diff --git a/.coverage b/.coverage
new file mode 100644
index 00000000..f27ad65b
Binary files /dev/null and b/.coverage differ
diff --git a/.cursor/archive/markr-handoff-2026-06-22.mdc b/.cursor/archive/markr-handoff-2026-06-22.mdc
new file mode 100644
index 00000000..6a357e47
--- /dev/null
+++ b/.cursor/archive/markr-handoff-2026-06-22.mdc
@@ -0,0 +1,140 @@
+---
+description: Archived Markr session handoff (stale — do not apply)
+alwaysApply: false
+---
+
+<!-- Archived 2026-06-22. Superseded by AGENTS.md. Regenerate via Markr when starting a new handoff. -->
+
+# Handoff from Cursor (archived)
+> 10 messages | ~344 tokens | Projects/VideoTuna | branch `main`
+>
+> Conditional Residual Handoff — transmits what the repo can't tell you (decisions, dead-ends, constraints, uncommitted diff), not the code itself.
+
+## ⚡ Paste this first
+
+Continue in Cursor Chat. Use the project files as source of truth and keep changes scoped.
+
+```text
+I'm resuming a previous Cursor session on Projects/VideoTuna. You have the repository — read it for anything not stated here. This handoff carries only what the code itself cannot tell you.
+
+TASK
+Pin SimpleTuner upstream SHA on next sync
+Migrate third_party/flux/ → videotuna/vendor/simpletuner/ via submodule
+Remove cogvideo_sat after SAT deprecation
+First-party Flux LoRA trainer to drop the 71-file snapshot
+Original task: @/home/menes/.cursor/projects/home-menes-Projects-VideoTuna/terminals/10.txt:9-239
+
+STATE
+- Branch `main` · 40 uncommitted file(s)
+UNCOMMITTED (in-flight — not on HEAD, you can't see this by reading committed code)
+- M README.md
+- D configs/005_cogvideox1.5/cogvideox1.5_5b.yaml
+- M docs/MODEL_VERSIONS.md
+- M docs/checkpoints.md
+- M poetry.lock
+- M pyproject.toml
+- M scripts/__init__.py
+- D scripts/inference_cogVideo_sat_refactor.py
+- M tests/conftest.py
+- M tests/test_import_smoke.py
+- M uv.lock
+- D videotuna/models/cogvideo_sat/arguments.py
+- D videotuna/models/cogvideo_sat/data_video.py
+- D videotuna/models/cogvideo_sat/diffusion_video.py
+- D videotuna/models/cogvideo_sat/dit_video_concat.py
+- D videotuna/models/cogvideo_sat/sgm/__init__.py
+- D videotuna/models/cogvideo_sat/sgm/lr_scheduler.py
+- D videotuna/models/cogvideo_sat/sgm/models/__init__.py
+- D videotuna/models/cogvideo_sat/sgm/models/autoencoder.py
+- D videotuna/models/cogvideo_sat/sgm/modules/__init__.py
+NEXT
+- Run the relevant build, lint, or test command before calling the handoff complete.
+- Preserve existing user changes and avoid reverting unrelated work.
+VERIFY
+- No verification command was captured — run the project build/lint/test before finishing.
+
+SYNTHESIS — before you change anything, restate in one line: (a) the task, and (b) the one constraint you must not break. Then proceed.
+```
+
+---
+
+## 🧠 Decision log
+
+_No explicit decisions were captured in the transcript._
+
+## 🛑 Dead-ends — do not redo
+
+_None captured._
+
+## 📌 Constraints
+
+_None explicitly stated._
+
+## 🔀 In-flight (uncommitted) state
+
+Branch: `main`
+
+Uncommitted changes:
+- `M README.md`
+- `D configs/005_cogvideox1.5/cogvideox1.5_5b.yaml`
+- `M docs/MODEL_VERSIONS.md`
+- `M docs/checkpoints.md`
+- `M poetry.lock`
+- `M pyproject.toml`
+- `M scripts/__init__.py`
+- `D scripts/inference_cogVideo_sat_refactor.py`
+- `M tests/conftest.py`
+- `M tests/test_import_smoke.py`
+- `M uv.lock`
+- `D videotuna/models/cogvideo_sat/arguments.py`
+- `D videotuna/models/cogvideo_sat/data_video.py`
+- `D videotuna/models/cogvideo_sat/diffusion_video.py`
+- `D videotuna/models/cogvideo_sat/dit_video_concat.py`
+- `D videotuna/models/cogvideo_sat/sgm/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/lr_scheduler.py`
+- `D videotuna/models/cogvideo_sat/sgm/models/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/models/autoencoder.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/attention.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/discriminator_loss.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/lpips.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/video_loss.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/.gitignore`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/LICENSE`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/lpips.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/LICENSE`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/model.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/util.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/vqperceptual.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/magvit2_pytorch.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/__init__.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/base.py`
+- `D videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/finite_scalar_quantization.py`
+
+## 🎯 Task
+
+**Continue:** Pin SimpleTuner upstream SHA on next sync
+Migrate third_party/flux/ → videotuna/vendor/simpletuner/ via submodule
+Remove cogvideo_sat after SAT deprecation
+First-party Flux LoRA trainer to drop the 71-file snapshot
+
+**Original request:** @/home/menes/.cursor/projects/home-menes-Projects-VideoTuna/terminals/10.txt:9-239
+
+## 💬 Recent exchange (tail)
+
+**You**: Provide me with 3 comprehensive prompts, to run in plan model to setup amdu rocm support, imrpove nvidia support and use cpu. Also, be thorough on how to improve integration with the current system.
+
+**You**: This is too slow poetry run pytest tests/test_diffusers_video_flow.py
+
+**You**: @videotuna/third_party Is there a better way than doing this in our repo ? Provide me with a prompt to re-organize and improve the dependencies, management, etc
+
+**You**: Consume this article https://bitmovin.com/blog/ai-video-research/ , suggest me 10 improvements you would do on this codebase based on the information.
+
+**You**: Provide me with 3 comprehensive prompts, to run in plan mode to setup amdu rocm support, imrpove nvidia support and use cpu. Also, be thorough on how to improve integration with the current system.
+
+**You**: Pin SimpleTuner upstream SHA on next sync Migrate third_party/flux/ → videotuna/vendor/simpletuner/ via submodule Remove cogvideo_sat after SAT deprecation First-party Flux LoRA trainer to drop the 71-file snapshot
diff --git a/.cursor/mcp.json b/.cursor/mcp.json
new file mode 100644
index 00000000..da39e4ff
--- /dev/null
+++ b/.cursor/mcp.json
@@ -0,0 +1,3 @@
+{
+  "mcpServers": {}
+}
diff --git a/.cursor/rules/privtune.mdc b/.cursor/rules/privtune.mdc
new file mode 100644
index 00000000..40d2320d
--- /dev/null
+++ b/.cursor/rules/privtune.mdc
@@ -0,0 +1,15 @@
+---
+description: PrivTune project conventions and agent workflow
+alwaysApply: true
+---
+
+# PrivTune
+
+**Role:** Private-domain LoRA training platform (Flux T2I + Wan 2.1 T2V train, Wan 2.2 Diffusers validate). Optimize for correct training behavior, portable CUDA/ROCm/CPU handling, and minimal scoped diffs.
+
+Primary instructions: [`AGENTS.md`](../AGENTS.md) at the repo root.
+
+- Python 3.11+ · Poetry default (`poetry run …`) · optional uv
+- **Before finishing (required):** `poetry run lint`, `poetry run format-check`, `poetry run coverage-gate`
+- Env vars: [`.env.example`](../.env.example) (`VIDEOTUNA_*` retained) · Vendor policy: [`docs/vendor-policy.md`](../docs/vendor-policy.md)
+- Never commit `.env`, checkpoints, `outputs/`, weights, or secrets
diff --git a/.cursor/settings.json b/.cursor/settings.json
new file mode 100644
index 00000000..29968d9a
--- /dev/null
+++ b/.cursor/settings.json
@@ -0,0 +1,7 @@
+{
+  "plugins": {
+    "context7-plugin": {
+      "enabled": true
+    }
+  }
+}
diff --git a/.env.example b/.env.example
new file mode 100644
index 00000000..fd0faf60
--- /dev/null
+++ b/.env.example
@@ -0,0 +1,54 @@
+# PrivTune environment variables (VIDEOTUNA_* prefix retained for compatibility)
+# Copy to .env and export, or set in your shell profile.
+# Do not commit .env — it may contain secrets.
+
+# --- Compute backend ---
+# auto | cuda | rocm | cpu
+VIDEOTUNA_COMPUTE_BACKEND=auto
+
+# --- Attention backend ---
+# auto | flash | sdpa | eager
+# ROCm: use sdpa (flash is not supported)
+VIDEOTUNA_ATTN_BACKEND=auto
+
+# Fail when flash is requested but flash-attn is not installed (0 | 1)
+VIDEOTUNA_ATTN_BACKEND_STRICT=0
+
+# --- torch.compile (denoiser only) ---
+VIDEOTUNA_TORCH_COMPILE=0
+VIDEOTUNA_TORCH_COMPILE_MODE=reduce-overhead
+
+# --- Metrics ---
+# Inference metrics.json ownership (not training experiment tracking): script | flow
+VIDEOTUNA_METRICS_OWNER=script
+
+# Training experiment tracking (Flux Phase 1): tensorboard | trackio
+# trackio requires: poetry install -E trackio
+# VIDEOTUNA_METRICS_BACKEND=tensorboard
+# Optional Trackio remote dashboard (private Space recommended for domain content):
+# VIDEOTUNA_TRACKIO_SPACE_ID=username/privtune-trackio
+# VIDEOTUNA_TRACKIO_PROJECT=flux-domain-lora
+
+# --- CPU inference mode ---
+# off | smoke | force
+VIDEOTUNA_CPU_MODE=off
+# Deprecated — use VIDEOTUNA_CPU_MODE=force
+# VIDEOTUNA_ALLOW_CPU_INFERENCE=0
+
+# --- Training / benchmarks (optional) ---
+# VIDEOTUNA_LOG_LEVEL=INFO
+# VIDEOTUNA_BENCH_MODEL=
+
+# --- GPU selection ---
+# CUDA_VISIBLE_DEVICES=0
+# HIP_VISIBLE_DEVICES=0
+
+# --- Hugging Face (gated models, higher rate limits) ---
+# HF_TOKEN=
+# HF_HOME=~/.cache/huggingface
+
+# --- DashScope (Wan prompt extension via dashscope method) ---
+# DASH_API_KEY=
+
+# --- Test harness ---
+# ENV=test
diff --git a/.gemini/settings.json b/.gemini/settings.json
new file mode 100644
index 00000000..94adb007
--- /dev/null
+++ b/.gemini/settings.json
@@ -0,0 +1,15 @@
+{
+	"hooks": {
+		"AfterAgent": [
+			{
+				"hooks": [
+					{
+						"type": "command",
+						"command": "\"$HOME/.jolli/jollimemory/run-hook\" gemini-after-agent",
+						"name": "jolli-session-tracker"
+					}
+				]
+			}
+		]
+	}
+}
\ No newline at end of file
diff --git a/.github/workflows/cpu.yml b/.github/workflows/cpu.yml
new file mode 100644
index 00000000..7fc61bdd
--- /dev/null
+++ b/.github/workflows/cpu.yml
@@ -0,0 +1,37 @@
+name: CPU
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install Poetry
+        run: pip install poetry
+
+      - name: Install dependencies (CPU extra + dev + training)
+        run: poetry install -E cpu --with dev --with training
+
+      - name: Install CPU PyTorch wheels
+        run: poetry run install-cpu-torch
+
+      - name: Verify CPU torch
+        run: poetry run verify-cpu-torch
+
+      - name: Lint
+        run: poetry run lint
+
+      - name: Format check
+        run: poetry run format-check
+
+      - name: Smoke tests + coverage gate
+        run: poetry run coverage-gate
diff --git a/.github/workflows/lint-autofix.yml b/.github/workflows/lint-autofix.yml
new file mode 100644
index 00000000..5c2c07d8
--- /dev/null
+++ b/.github/workflows/lint-autofix.yml
@@ -0,0 +1,43 @@
+name: Lint autofix
+
+on:
+  pull_request:
+    types: [opened, synchronize, reopened]
+
+permissions:
+  contents: write
+
+concurrency:
+  group: lint-autofix-${{ github.event.pull_request.number }}
+  cancel-in-progress: true
+
+jobs:
+  autofix:
+    if: github.event.pull_request.head.repo.full_name == github.repository
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: ${{ github.event.pull_request.head.ref }}
+          repository: ${{ github.event.pull_request.head.repo.full_name }}
+          token: ${{ secrets.GITHUB_TOKEN }}
+          persist-credentials: true
+
+      # Install ruff only (no Poetry / torch). Pin matches poetry.lock (0.6.9).
+      # Mirrors scripts/__init__.py code_format() targets and command order.
+      - name: Install ruff
+        uses: astral-sh/ruff-action@v4.0.0
+        with:
+          args: "--version"
+          version: "0.6.9"
+
+      - name: Apply lint and format fixes
+        run: |
+          ruff check --fix videotuna tests scripts tools
+          ruff check --select I --fix .
+          ruff format .
+
+      - uses: stefanzweifel/git-auto-commit-action@v7
+        with:
+          commit_message: "style: apply ruff lint and format fixes"
+          commit_author: "github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>"
diff --git a/.gitignore b/.gitignore
index 1f568fa9..cded9802 100644
--- a/.gitignore
+++ b/.gitignore
@@ -19,6 +19,10 @@ Dataset/
 output/
 outputs/
 /data
+data/t2i/domain/
+data/t2v/domain/
+
+cloud/vast/.env
 
 HPSv2/
 SwissArmyTransformer/
@@ -28,3 +32,5 @@ temp
 
 
 *.outputs
+.jolli/jollimemory
+.jolli/jollimemory/debug.log
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 00000000..dcf80594
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "videotuna/vendor/simpletuner"]
+	path = videotuna/vendor/simpletuner
+	url = https://github.com/bghira/SimpleTuner.git
diff --git a/.markr/config-tests.json b/.markr/config-tests.json
new file mode 100644
index 00000000..1d43c097
--- /dev/null
+++ b/.markr/config-tests.json
@@ -0,0 +1,18 @@
+{
+  "version": 1,
+  "suites": [
+    {
+      "configPath": ".agents/skills/jolli-recall/SKILL.md",
+      "tests": [
+        {
+          "id": "3266f96c-3940-4c08-a8f9-0334be69b6e4",
+          "name": "New test",
+          "prompt": "",
+          "expectedBehavior": "",
+          "mustInclude": [],
+          "mustNotInclude": []
+        }
+      ]
+    }
+  ]
+}
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index f2e18a6d..2394ea3d 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -10,24 +10,24 @@ repos:
         pass_filenames: false
         language: system
         stages: [pre-commit]
-#      - id: linting
-#        name: linting
-#        entry: poetry run lint
-#        pass_filenames: false
-#        language: system
-#        stages: [commit]
-#      - id: type-checking
-#        name: type checking
-#        entry: poetry run type-check
-#        pass_filenames: false
-#        language: system
-#        stages: [commit]
-#      - id: unit-tests
-#        name: unit tests
-#        entry: poetry run test
-#        pass_filenames: false
-#        language: system
-#        stages: [commit]
+      - id: linting
+        name: linting
+        entry: poetry run lint
+        pass_filenames: false
+        language: system
+        stages: [pre-commit]
+      #      - id: type-checking
+      #        name: type checking
+      #        entry: poetry run type-check
+      #        pass_filenames: false
+      #        language: system
+      #        stages: [commit]
+      - id: unit-tests
+        name: unit tests
+        entry: poetry run coverage-gate
+        pass_filenames: false
+        language: system
+        stages: [pre-commit]
   - repo: https://github.com/commitizen-tools/commitizen
     rev: v2.28.0
     hooks:
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 00000000..86f39933
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,146 @@
+# PrivTune — Agent Instructions
+
+## Project overview
+
+**PrivTune** (`privtune` in Poetry; Python import path `videotuna/`) is a **training platform ONLY** for private-domain LoRA:
+
+- **Phase 1:** Flux T2I LoRA (`videotuna/training/flux_lora/`, `train-domain-t2i`)
+- **Phase 2:** Wan 2.1 T2V LoRA (`videotuna/flow/wanvideo.py`, `train-domain-t2v`)
+- **Phase 2.5 (optional):** Wan 2.1 I2V LoRA (`train-domain-i2v`) — LoRA-only; full Wan fine-tune is out of scope
+- **Phase 3:** Wan 2.2 Diffusers domain LoRA validation via `validate-domain-t2v` (production-ready; see runbook). General Wan 2.2 720p inference profile remains optional — [`wan2.2-inference-profile.md`](docs/runbooks/wan2.2-inference-profile.md).
+
+Training stacks differ by design — see [ADR-001](docs/decisions/0001-dual-training-stacks.md).
+
+Canonical runbook: [`docs/runbooks/domain-adult-finetune.md`](docs/runbooks/domain-adult-finetune.md)
+
+Python 3.11+; Poetry default (`poetry run …`); optional uv.
+
+## Role
+
+Optimize for:
+
+1. **Correct behavior** — training and smoke inference on CUDA, ROCm, and CPU config validation.
+2. **Scoped diffs** — change only what the task requires.
+3. **Portable device handling** — respect `videotuna/utils/device_utils.py` and `.env.example`.
+4. **Safe boundaries** — never commit weights, datasets, `outputs/`, or secrets.
+
+Cursor rules: [`.cursor/rules/privtune.mdc`](.cursor/rules/privtune.mdc)
+
+## Agent workflow
+
+1. `cd` into the repo root before running commands.
+2. Prefer **Poetry** (`poetry run …`) unless the user explicitly uses uv.
+3. Read [`docs/vendor-policy.md`](docs/vendor-policy.md) before touching vendored code.
+
+## Install profiles
+
+| Use case | Poetry | uv |
+|----------|--------|-----|
+| **Default (CUDA + training)** | `poetry install -E cuda --with training` | `uv sync --group training` |
+| Inference AMD ROCm | `poetry install -E rocm --with training` then `poetry run install-rocm` | Wan training requires CUDA; ROCm is inference + Flux training only — see [install-rocm.md](docs/install-rocm.md) |
+| CPU dev / CI | `poetry install -E cpu --with dev --with training` then `poetry run install-cpu-torch` | see [install-cpu.md](docs/install-cpu.md) |
+| + Dev | add `--with dev` | `uv sync --group dev` |
+
+After install for Wan LoRA: `poetry run install-deepspeed`
+
+## Verification (required before finishing)
+
+```bash
+poetry run lint
+poetry run format-check
+poetry run coverage-gate
+```
+
+`coverage-gate` runs the CI smoke test list and enforces a **33%** line-coverage floor on `videotuna/training/` + `videotuna/utils/`. For local exploratory reporting without a gate, use `poetry run coverage-report`.
+
+| Change area | Additional tests |
+|-------------|------------------|
+| Wan 2.2 presets / bridge | `test_wan_inference_presets.py` |
+| diffusers_video | `test_diffusers_video_flow.py` |
+| device/attention | `test_device_utils.py`, `test_attention_backend.py` |
+| inference CLI / memory | `test_inference_optimization.py` |
+
+## Commands
+
+### Training (canonical)
+
+```bash
+poetry run train-domain-t2i
+poetry run train-domain-t2v
+poetry run install-deepspeed   # Wan LoRA
+```
+
+Configs: `configs/domain/flux_t2i.json`, `configs/domain/flux_t2i_data.json`, `configs/domain/wan_t2v_lora.yaml`
+
+Legacy aliases: `train-flux-lora`, `train-wan2-1-t2v-lora`
+
+### Smoke inference
+
+```bash
+poetry run inference-domain-t2i --lorackpt results/train/flux-domain-adult/checkpoint-2000
+poetry run validate-domain-t2v --trained_ckpt results/train/.../denoiser.ckpt
+poetry run inference-wan2.2-t2v-720p   # general Wan 2.2 720p — optional
+```
+
+### Dev tooling
+
+```bash
+poetry run test -q
+poetry run lint
+poetry run format-check
+poetry run type-check   # mypy on typed allowlist only (see pyproject.toml)
+```
+
+## Environment variables
+
+`VIDEOTUNA_*` prefix is retained for compatibility (no `PRIVTUNE_*` aliases in v0.2).
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `VIDEOTUNA_COMPUTE_BACKEND` | `auto` | cuda / rocm / cpu override |
+| `VIDEOTUNA_ATTN_BACKEND` | `auto` | flash / sdpa / eager |
+| `VIDEOTUNA_TORCH_COMPILE` | `0` | denoiser compile |
+| `HF_TOKEN` | — | Gated HF models |
+
+## Project layout
+
+```
+videotuna/
+  training/flux_lora/   # Phase 1
+  models/wan/           # Phase 2
+  flow/                 # wanvideo, diffusers_video
+configs/domain/         # flux_t2i*.json, wan_t2v_lora.yaml
+configs/inference/presets/  # smoke + Wan 2.2 presets
+cloud/vast/
+docs/runbooks/
+```
+
+## Safety
+
+- Never commit `.env`, checkpoints, `outputs/`, `results/`, or training data.
+- ROCm: `VIDEOTUNA_ATTN_BACKEND=sdpa`; do not run `install-flash-attn`.
+- **QA = training callbacks + smoke inference.** VBench was removed: domain QA does not need generic T2V benchmarking; ImageLogger previews and LoRA smoke inference cover the supported workflows.
+
+## Related docs
+
+| Doc | Topic |
+|-----|-------|
+| [0001-dual-training-stacks.md](docs/decisions/0001-dual-training-stacks.md) | Why Flux uses Accelerate and Wan uses Lightning+DeepSpeed |
+| [domain-adult-finetune.md](docs/runbooks/domain-adult-finetune.md) | Domain training runbook |
+| [wan2.2-inference-profile.md](docs/runbooks/wan2.2-inference-profile.md) | Wan 2.2 rental GPU presets (Phase 3) |
+| [checkpoints.md](docs/checkpoints.md) | Weight layout |
+
+## Cursor Cloud specific instructions
+
+The Cloud VM is **CPU-only (no GPU/CUDA driver)** and runs Python 3.12 (satisfies `^3.11`). It uses the documented "CPU dev / CI" profile (see [docs/install-cpu.md](docs/install-cpu.md)). The startup update script installs deps and swaps to CPU torch; the notes below are durable caveats, not setup steps.
+
+- **CPU-torch swap is mandatory after every `poetry install`.** The lockfile pins CUDA `torch==2.6.0+cu126`, so any `poetry install` re-installs the CUDA wheel; you must re-run `poetry run install-cpu-torch` afterward or imports break with CUDA errors. Verify with `poetry run verify-cpu-torch`.
+- **Use `VIDEOTUNA_ATTN_BACKEND=eager` on CPU** (flash/xformers/bitsandbytes are CUDA-only and absent here). Keep `VIDEOTUNA_TORCH_COMPILE=0`.
+- **The `poetry run test <path>` script appends args to `pytest tests`**, so it always collects the whole suite regardless of the path you pass. To run a single file, call `poetry run pytest <path>` directly.
+- **Full suite + the `test_import_smoke.py` gate need the `training` group** (`pytorch_lightning`, `pandas`); without it ~5 modules fail to import. Install with `--with dev --with training`.
+- **Known pre-existing baseline failures (not environment issues):** `poetry run lint` reports ~1000 ruff errors; `poetry run pytest tests` shows ~6 failures (`tests/datasets/test_dataset_from_csv.py` hits a `PosixPath` bug in `videotuna/data/datasets.py`, and `test_wan_checkpoint.py::test_wan_from_pretrained_missing_dir` depends on diffusers/network behavior). The rest (~129) pass.
+- **GPU training/inference and real FLUX/Wan weights are not runnable here.** `inference-wan2.2-t2v-720p` and the CPU smoke preset download a 14B Wan 2.2 model. Validate core behavior via the CPU test gates in [capability-matrix.md](docs/capability-matrix.md) and small LoRA training-step smokes instead.
+=======
+| [MODEL_VERSIONS.md](docs/MODEL_VERSIONS.md) | Model pins |
+
+Removed: `capability-matrix.md`, `finetune_flux.md`, `finetune_wan.md`, `evaluation.md` (superseded by runbook + smoke inference).
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 00000000..f7105961
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,161 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project
+
+**PrivTune** (Poetry package: `privtune`; Python import path: `videotuna/`) is a private-domain LoRA training platform:
+
+| Phase | Model | Goal |
+|-------|-------|------|
+| 1 — T2I | FLUX.1-dev LoRA | Still-image domain style training |
+| 2 — T2V | Wan 2.1 T2V LoRA | Short-video motion training |
+| 2.5 — I2V | Wan 2.1 I2V LoRA | Image-to-video domain training (optional) |
+| 3 — Validate | Wan 2.2 Diffusers | Domain LoRA validation via `validate-domain-t2v` (production-ready); general 720p inference optional |
+
+Python 3.11+. Use `poetry run …` by default; `uv` is supported but secondary.
+
+## Commands
+
+### Install
+
+```bash
+# Default (CUDA + training)
+poetry install -E cuda --with training
+poetry run install-deepspeed   # required for Wan LoRA (DeepSpeed ZeRO-3)
+
+# CPU dev / CI
+poetry install -E cpu --with dev --with training
+poetry run install-cpu-torch
+```
+
+### Dev tools
+
+```bash
+poetry run lint            # ruff check
+poetry run format-check    # ruff format --check
+poetry run format          # ruff format (apply)
+poetry run type-check      # mypy on typed allowlist only (see pyproject.toml)
+```
+
+### Tests
+
+```bash
+poetry run test -q                          # full suite
+poetry run test tests/test_foo.py -q        # single file
+poetry run test tests/test_foo.py::test_bar # single test
+```
+
+Tests use markers: `gpu` (skipped without CUDA), `rocm`, `cpu_smoke`.
+
+### Training
+
+```bash
+poetry run train-domain-t2i   # Phase 1 — Flux T2I
+poetry run train-domain-t2v   # Phase 2 — Wan T2V
+poetry run train-domain-i2v   # Phase 2.5 — Wan I2V (optional)
+```
+
+### Smoke inference / validation
+
+```bash
+# Phase 1
+poetry run inference-domain-t2i --lorackpt results/train/flux-domain-adult/checkpoint-2000
+
+# Phase 2 production (Wan 2.1 → 2.2 bridge)
+poetry run validate-domain-t2v --trained_ckpt <ckpt>
+poetry run validate-domain-i2v --trained_ckpt <ckpt> --prompt_dir <dir>
+
+# Phase 3 — general Wan 2.2 720p (optional)
+poetry run inference-wan2.2-t2v-720p
+```
+
+### Pre-merge checklist
+
+Run these before finishing any change:
+
+```bash
+poetry run lint
+poetry run format-check
+poetry run test tests/test_import_smoke.py -q
+poetry run test tests/test_domain_finetune_configs.py -q
+poetry run test tests/test_flux_lora_train_smoke.py -q
+poetry run test tests/test_wan_lora_bridge.py -q
+poetry run test tests/test_wan_i2v_lora_bridge.py -q
+poetry run test tests/test_wan_domain_lora_smoke_22_config.py -q
+poetry run test tests/test_wan_domain_i2v_smoke_22_config.py -q
+poetry run test tests/test_wan_i2v_dataset.py -q
+poetry run test tests/test_wan_training_step.py -q
+poetry run test tests/test_poetry_scripts.py -q
+```
+
+Additional tests by change area:
+
+| Area | Tests |
+|------|-------|
+| Wan 2.2 presets / bridge | `test_wan_inference_presets.py` |
+| `diffusers_video` flow | `test_diffusers_video_flow.py` |
+| Device / attention | `test_device_utils.py`, `test_attention_backend.py` |
+| Inference CLI / memory | `test_inference_optimization.py` |
+
+## Architecture
+
+### Two inference flows
+
+`videotuna/flow/` contains the two execution paths that never mix:
+
+- **`wanvideo.py` (`WanVideoModelFlow`)** — PyTorch Lightning / native Wan 2.1 stack. Used by `train-domain-t2v`, `train-domain-i2v`, and the Wan 2.1 smoke scripts. Requires GPU (`FLOW_TIERS["gpu_required"]`).
+- **`diffusers_video.py` (`DiffusersVideoFlow`)** — Unified Diffusers pipeline supporting Flux T2I, Wan 2.2 T2V, and Wan 2.2 I2V. Used for all production inference and Flux training. Runs in CPU smoke mode for config validation.
+
+### LoRA bridge: Wan 2.1 → 2.2
+
+`videotuna/utils/wan_lora_bridge.py` is the critical path between training and validation. Wan 2.1 native Lightning LoRA checkpoints (`blocks.N.self_attn.q`) use different key names than Wan 2.2 Diffusers (`attn1.to_q`). The bridge remaps and loads the LoRA weights onto a `WanTransformer3DModel` via PEFT. Always verify remap coverage ≥ 90% when touching this file (`test_wan_lora_bridge.py`).
+
+For offline export: `tools/convert_wan_lora_21_to_22.py`. For debugging: `tools/spike_wan_lora_bridge.py`.
+
+### Phase 1: Flux LoRA trainer
+
+`videotuna/training/flux_lora/` is a first-party trainer (Diffusers + PEFT + Accelerate). Configs live in `configs/domain/flux_t2i.json` + `configs/domain/flux_t2i_data.json`. `LoraModelCheckpoint` in `videotuna/utils/callbacks.py` strips all non-LoRA weights from checkpoints.
+
+### Phase 2: Wan 2.1 native stack
+
+`videotuna/models/wan/wan/` is vendored upstream Wan 2.1 code (see `docs/vendor-policy.md`). Do not modify freely — check the vendor policy before making changes. Training runs under DeepSpeed ZeRO-3 via `wanvideo.py`.
+
+### Settings and device handling
+
+All environment configuration goes through `videotuna/settings.py` (`PrivTuneSettings`, pydantic-settings). Environment variables use `VIDEOTUNA_*` prefix (retained from upstream for compatibility; no `PRIVTUNE_*` aliases exist).
+
+`videotuna/utils/device_utils.py` is the single source of truth for compute backend detection (cuda/rocm/cpu). Always call `detect_compute_backend()` or `resolve_inference_device()` — do not check `torch.cuda.is_available()` directly in flow code.
+
+### CLI layer
+
+`videotuna/cli/inference_app.py` uses `cyclopts` to register all `inference-*` and `validate-*` Poetry scripts. Options are declared as dataclasses in `videotuna/cli/inference_options.py` (`InferenceRunOptions`, `StandardInferenceOptions`, `InferencePreset`). New inference entry points follow this pattern.
+
+### Config layout
+
+```
+configs/domain/          # training configs (flux_t2i*.json, wan_t2v_lora.yaml, wan_i2v_lora.yaml)
+configs/inference/
+  presets/               # smoke + production presets (wan_domain_*, balanced_*, low_vram_*)
+```
+
+## Key environment variables
+
+Loaded from `.env` (copy from `.env.example`). All optional — defaults work for CUDA.
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `VIDEOTUNA_COMPUTE_BACKEND` | `auto` | `cuda` / `rocm` / `cpu` |
+| `VIDEOTUNA_ATTN_BACKEND` | `auto` | `flash` / `sdpa` / `eager` — use `sdpa` on ROCm |
+| `VIDEOTUNA_TORCH_COMPILE` | `0` | Enable denoiser compile |
+| `VIDEOTUNA_CPU_MODE` | `off` | `smoke` / `force` for CPU inference |
+| `HF_TOKEN` | — | Gated HF models (FLUX.1-dev) |
+| `DASH_API_KEY` | — | DashScope prompt extension |
+
+## Safety
+
+Never commit `.env`, `outputs/`, `results/`, `data/`, or model checkpoints. These are in `.gitignore` but the restriction is absolute — do not override.
+
+ROCm: always set `VIDEOTUNA_ATTN_BACKEND=sdpa`; do not run `install-flash-attn`.
+
+QA is training callbacks (ImageLogger) + smoke inference on held-out prompts. VBench was removed; do not reintroduce generic T2V benchmarking.
diff --git a/README.md b/README.md
index 22f235a4..0e7a76d7 100644
--- a/README.md
+++ b/README.md
@@ -1,331 +1,189 @@
-<p align="center" width="50%">
-<img src="https://github.com/user-attachments/assets/38efb5bc-723e-4012-aebd-f55723c593fb" alt="VideoTuna" style="width: 75%; min-width: 450px; display: block; margin: auto; background-color: transparent;">
-</p>
+# PrivTune
 
-# VideoTuna
+**PrivTune** is a private-domain LoRA training platform for still-image and short-video generation.
 
-![Version](https://img.shields.io/badge/version-0.1.0-blue) ![visitors](https://visitor-badge.laobi.icu/badge?page_id=VideoVerses.VideoTuna&left_color=green&right_color=red)  [![](https://dcbadge.limes.pink/api/server/AammaaR2?style=flat)](https://discord.gg/AammaaR2) <a href='https://github.com/user-attachments/assets/a48d57a3-4d89-482c-8181-e0bce4f750fd'><img src='https://badges.aleen42.com/src/wechat.svg'></a> [![Homepage](https://img.shields.io/badge/Homepage-VideoTuna-orange)](https://videoverses.github.io/videotuna/) [![GitHub](https://img.shields.io/github/stars/VideoVerses/VideoTuna?style=social)](https://github.com/VideoVerses/VideoTuna)
+Canonical runbook: [`docs/runbooks/domain-adult-finetune.md`](docs/runbooks/domain-adult-finetune.md)
 
+The Python package directory remains `videotuna/` for compatibility; Poetry project name is `privtune`.
 
-🤗🤗🤗 Videotuna is a useful codebase for text-to-video applications.  
-🌟 VideoTuna is the first repo that integrates multiple AI video generation models including `text-to-video (T2V)`, `image-to-video (I2V)`, `text-to-image (T2I)`, and `video-to-video (V2V)` generation for model inference and finetuning (to the best of our knowledge).  
-🌟 VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).  
+## What PrivTune does
 
+| Phase | Model | Role |
+|-------|-------|------|
+| 1 — T2I | FLUX.1-dev LoRA | Train domain still-image style |
+| 2 — T2V | Wan 2.1 T2V LoRA | Train domain short-video motion |
+| 3 — Validate | Wan 2.2 Diffusers | Production domain LoRA validation via `validate-domain-t2v` (see [domain-adult-finetune.md](docs/runbooks/domain-adult-finetune.md)); general 720p inference profile in [wan2.2-inference-profile.md](docs/runbooks/wan2.2-inference-profile.md) |
 
+QA uses **training ImageLogger callbacks** and **LoRA smoke inference** on held-out prompts — not generic T2V benchmarking (VBench removed).
 
-## 🔆 Features
-![videotuna-pipeline-fig3](https://github.com/user-attachments/assets/625693d9-b5cf-4c00-8e84-20ea855c2445)
-🌟 **All-in-one framework:** Inference and fine-tune various up-to-date pre-trained video generation models.  
-🌟 **Continuous training:** Keep improving your model with new data.  
-🌟 **Fine-tuning:** Adapt pre-trained models to specific domains.  
-🌟 **Human preference alignment:** Leverage RLHF to align with human preferences.  
-🌟 **Post-processing:** Enhance and rectify the videos with video-to-video enhancement model.  
+## Legal and data requirements
 
+- Use only **rights-cleared, consented** training data.
+- Never commit `data/`, checkpoints, `results/`, or `outputs/` to git.
 
-## 🔆 Updates
+## Install (default = training)
 
-- [2025-04-22] 🐟 Supported **inference** for `Wan2.1` and `Step Video` and **fine-tuning** for `HunyuanVideo T2V`, with a unified codebase architecture.
-- [2025-02-03] 🐟 Supported automatic code formatting via [PR#27](https://github.com/VideoVerses/VideoTuna/pull/27). Thanks [@samidarko](https://github.com/samidarko)!
-- [2025-02-01] 🐟 Migrated to [Poetry](https://python-poetry.org) for streamlined dependency and script management ([PR#25](https://github.com/VideoVerses/VideoTuna/pull/25)). Thanks [@samidarko](https://github.com/samidarko)!
-- [2025-01-20] 🐟 Supported **fine-tuning** for `Flux-T2I`.
-- [2025-01-01] 🐟 Released **training** for `VideoVAE+` in the [VideoVAEPlus repo](https://github.com/VideoVerses/VideoVAEPlus).
-- [2025-01-01] 🐟 Supported **inference** for `Hunyuan Video` and `Mochi`.
-- [2024-12-24] 🐟 Released `VideoVAE+`: a SOTA Video VAE model—now available in [this repo](https://github.com/VideoVerses/VideoVAEPlus)! Achieves better video reconstruction than NVIDIA’s [`Cosmos-Tokenizer`](https://github.com/NVIDIA/Cosmos-Tokenizer).
-- [2024-12-01] 🐟 Supported **inference** for `CogVideoX-1.5-T2V&I2V` and `Video-to-Video Enhancement` from ModelScope.
-- [2024-12-01] 🐟 Supported **fine-tuning** for `CogVideoX`.
-- [2024-11-01] 🐟 🎉 Released **VideoTuna v0.1.0**!  
-  Initial support includes inference for `VideoCrafter1-T2V&I2V`, `VideoCrafter2-T2V`, `DynamiCrafter-I2V`, `OpenSora-T2V`, `CogVideoX-1-2B-T2V`, `CogVideoX-1-T2V`, `Flux-T2I`, and training/fine-tuning of `VideoCrafter`, `DynamiCrafter`, and `Open-Sora`.
+PrivTune supports **Poetry** (default) and **[uv](https://docs.astral.sh/uv/)**.
 
-## 🔆 Get started
-
-### 1.Prepare environment
-
-#### (1) If you use Linux and Conda (Recommend)
-``` shell
-conda create -n videotuna python=3.10 -y
-conda activate videotuna
+```shell
+conda create -n privtune python=3.11 -y
+conda activate privtune
 pip install poetry
-poetry install
-```
-- ↑ It takes around 3 minitues.
-
-**Optional: Flash-attn installation**
-
-Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the `flash-attn` via:
-``` shell
-poetry run install-flash-attn 
-```
-- ↑ It takes 1 minitue.
-
-**Optional: Video-to-video enhancement**
+poetry install -E cuda --with training
+poetry run install-deepspeed   # required for Wan LoRA (DeepSpeed ZeRO-3)
 ```
-poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
-```
-- If this command ↑ get stucked, kill and re-run it will solve the issue.
-
-
-#### (2) If you use Linux and Poetry (without Conda):
-<details>
-  <summary>Click to check instructions</summary>
-  <br>
-
-  Install Poetry: https://python-poetry.org/docs/#installation  
-  Then:
-
-  ``` shell
-  poetry config virtualenvs.in-project true # optional but recommended, will ensure the virtual env is created in the project root
-  poetry config virtualenvs.create true # enable this argument to ensure the virtual env is created in the project root
-  poetry env use python3.10 # will create the virtual env, check with `ls -l .venv`.
-  poetry env activate # optional because Poetry commands (e.g. `poetry install` or `poetry run <command>`) will always automatically load the virtual env.
-  poetry install
-  ```
-
-  **Optional: Flash-attn installation**
-
-  Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the `flash-attn` via:
-  ``` shell
-  poetry run install-flash-attn
-  ```
-  
-  **Optional: Video-to-video enhancement**
-  ```
-  poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
-  ```
-  - If this command ↑ get stucked, kill and re-run it will solve the issue.
-
-</details>
-
-
-
-#### (3) If you use MacOS
-<details>
-  <summary>Click to check instructions</summary>
-  <br>
-
-  On MacOS with Apple Silicon chip use [docker compose](https://docs.docker.com/compose/) because some dependencies are not supporting arm64 (e.g. `bitsandbytes`, `decord`, `xformers`).
-
-  First build:
-
-  ```shell
-  docker compose build videotuna
-  ```
-
-  To preserve the project's files permissions set those env variables:
-
-  ```shell
-  export HOST_UID=$(id -u)
-  export HOST_GID=$(id -g)
-  ```
-
-  Install dependencies:
-
-  ```shell
-  docker compose run --remove-orphans videotuna poetry env use /usr/local/bin/python
-  docker compose run --remove-orphans videotuna poetry run python -m pip install --upgrade pip setuptools wheel
-  docker compose run --remove-orphans videotuna poetry install
-  docker compose run --remove-orphans videotuna poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
-  ```
-
-  Note: installing swissarmytransformer might hang. Just try again and it should work.
-
-  Add a dependency:
-
-  ```shell
-  docker compose run --remove-orphans videotuna poetry add wheel
-  ```
 
-  Check dependencies:
+| Use case | Poetry | uv |
+|----------|--------|-----|
+| **Default (CUDA + training)** | `poetry install -E cuda --with training` | `uv sync --group training` |
+| Inference AMD ROCm | `poetry install -E rocm --with training` then `poetry run install-rocm` | Wan training requires CUDA; ROCm is inference + Flux training only — see [install-rocm.md](docs/install-rocm.md) |
+| CPU dev / CI | `poetry install -E cpu --with dev --with training` then `poetry run install-cpu-torch` | see [install-cpu.md](docs/install-cpu.md) |
+| + Dev (pytest, ruff) | add `--with dev` | `uv sync --group dev` |
 
-  ```shell
-  docker compose run --remove-orphans videotuna poetry run pip freeze
-  ```
+### Docker (optional)
 
-  Run Poetry commands:
-
-  ```shell
-  docker compose run --remove-orphans videotuna poetry run format
-  ```
-
-  Start a terminal:
-
-  ```shell
-  docker compose run -it --remove-orphans videotuna bash
-  ```
-</details>
-
-### 2.Prepare checkpoints
-
-- Please follow [docs/checkpoints.md](https://github.com/VideoVerses/VideoTuna/blob/main/docs/checkpoints.md) to download model checkpoints.  
-- After downloading, the model checkpoints should be placed as [Checkpoint Structure](https://github.com/VideoVerses/VideoTuna/blob/main/docs/checkpoints.md#checkpoint-orgnization-structure).
-
-### 3.Inference state-of-the-art T2V/I2V/T2I models
-
-
-Run the following commands to inference models:
-It will automatically perform T2V/T2I based on prompts in `inputs/t2v/prompts.txt`, 
-and I2V based on images and prompts in `inputs/i2v/576x1024`.  
-
-**T2V**
-Task|Model|Command|Length (#Frames)|Resolution|Inference Time|GPU Memory (GB)|
-|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
-|T2V|HunyuanVideo|`poetry run inference-hunyuan-t2v`|129|720x1280|32min|60G|
-|T2V|WanVideo|`poetry run inference-wanvideo-t2v-720p`|81|720x1280|32min|70G|
-|T2V|StepVideo|`poetry run inference-stepvideo-t2v-544x992`|51|544x992|8min|61G|
-|T2V|Mochi|`poetry run inference-mochi`|84|480x848|2min|26G|
-|T2V|CogVideoX-5b|`poetry run inference-cogvideo-t2v-diffusers`|49|480x720|2min|3G|
-|T2V|CogVideoX-2b|`poetry run inference-cogvideo-t2v-diffusers`|49|480x720|2min|3G|
-|T2V|Open Sora V1.0|`poetry run inference-opensora-v10-16x256x256`|16|256x256|11s|24G|
-|T2V|VideoCrafter-V2-320x512|`poetry run inference-vc2-t2v-320x512`|16|320x512|26s|11G|
-|T2V|VideoCrafter-V1-576x1024|`poetry run inference-vc1-t2v-576x1024`|16|576x1024|2min|15G|
-
----
-
-
-**I2V**
-
-
-Task|Model|Command|Length (#Frames)|Resolution|Inference Time|GPU Memory (GB)|
-|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
-|I2V|WanVideo|`poetry run inference-wanvideo-i2v-720p `|81|720x1280|28min|77G|
-|I2V|HunyuanVideo|`poetry run inference-hunyuan-i2v-720p`|129|720x1280|29min|43G|
-|I2V|CogVideoX-5b-I2V|`poetry run inference-cogvideox-15-5b-i2v`|49|480x720|5min|5G|
-|I2V|DynamiCrafter|`poetry run inference-dc-i2v-576x1024`|16|576x1024|2min|53G|
-|I2V|VideoCrafter-V1|`poetry run inference-vc1-i2v-320x512`|16|320x512|26s|11G|
+For containerized dev (e.g. Mac arm64), use the Compose image **`privtune`** (`privtune:latest` or `privtune:${TAG}`):
 
+```shell
+export HOST_UID=$(id -u) HOST_GID=$(id -g)
+docker compose build privtune
+docker compose run --rm privtune bash
+# inside container: poetry install -E cuda --with training
+```
 
----
+The legacy **`videotune`** Compose service and `videotune:latest` image tag remain as a backward-compatible alias (deprecated). A separate legacy all-in-one RunPod image lives under [`docker/Dockerfile`](docker/Dockerfile) and is not the recommended training path.
 
-**T2I**
+See [`docs/vendor-policy.md`](docs/vendor-policy.md) for vendored upstream policy.
 
-Task|Model|Command|Length (#Frames)|Resolution|Inference Time|GPU Memory (GB)|
-|:---------|:---------|:---------|:---------|:---------|:---------|:---------|
-|T2I|Flux-dev|`poetry run inference-flux-dev`|1|768x1360|4s|37G|
-|T2I|Flux-dev|`poetry run inference-flux-dev --enable_vae_tiling --enable_sequential_cpu_offload`|1|768x1360|4.2min|2G|
-|T2I|Flux-schnell|`poetry run inference-flux-schnell`|1|768x1360|1s|37G|
-|T2I|Flux-schnell|`poetry run inference-flux-schnell --enable_vae_tiling --enable_sequential_cpu_offload`|1|768x1360|24s|2G|
+## Data preparation
 
-### 4. Finetune T2V models
-#### (1) Prepare dataset
-Please follow the [docs/datasets.md](docs/datasets.md) to try provided toydataset or build your own datasets.
+**Phase 1 — still images** (`data/t2i/domain/`):
 
-#### (2) Fine-tune
-All  training commands were tested on H800 80G GPUs.  
-**T2V**
+```
+data/t2i/domain/
+  0001.jpg
+  0001.txt          # e.g. "sks_style, portrait, studio lighting"
+```
 
-|Task|Model|Mode|Command|More Details|#GPUs|
-|:----|:---------|:---------------|:-----------------------------------------|:----------------------------|:------|
-|T2V|Wan Video|Lora Fine-tune|`poetry run train-wan2-1-t2v-lora`|[docs/finetune_wan.md](docs/finetune_wan.md)|1|
-|T2V|Wan Video|Full Fine-tune|`poetry run train-wan2-1-t2v-fullft`|[docs/finetune_wan.md](docs/finetune_wan.md)|1|
-|T2V|Hunyuan Video|Lora Fine-tune|`poetry run train-hunyuan-t2v-lora`|[docs/finetune_hunyuanvideo.md](docs/finetune_hunyuanvideo.md)|2|
-|T2V|CogvideoX|Lora Fine-tune|`poetry run train-cogvideox-t2v-lora`|[docs/finetune_cogvideox.md](docs/finetune_cogvideox.md)|1|
-|T2V|CogvideoX|Full Fine-tune|`poetry run train-cogvideox-t2v-fullft`|[docs/finetune_cogvideox.md](docs/finetune_cogvideox.md)|4|
-|T2V|Open-Sora v1.0|Full Fine-tune|`poetry run train-opensorav10`|-|1|
-|T2V|VideoCrafter|Lora Fine-tune|`poetry run train-videocrafter-lora`|[docs/finetune_videocrafter.md](docs/finetune_videocrafter.md)|1|
-|T2V|VideoCrafter|Full Fine-tune|`poetry run train-videocrafter-v2`|[docs/finetune_videocrafter.md](docs/finetune_videocrafter.md)|1|
+Use a consistent trigger token (default: `sks_style`) in every `.txt` caption file.
 
----
+**Phase 2 — short video** (`data/t2v/domain/`):
 
-**I2V**
+```
+data/t2v/domain/
+  metadata.csv
+  videos/
+    clip001.mp4
+```
 
-|Task|Model|Mode|Command|More Details|#GPUs|
-|:----|:---------|:---------------|:-----------------------------------------|:----------------------------|:------|
-|I2V|Wan Video|Lora Fine-tune|`poetry run train-wan2-1-i2v-lora`|[docs/finetune_wan.md](docs/finetune_wan.md)|1|
-|I2V|Wan Video|Full Fine-tune|`poetry run train-wan2-1-i2v-fullft`|[docs/finetune_wan.md](docs/finetune_wan.md)|1|
-|I2V|CogvideoX|Lora Fine-tune|`poetry run train-cogvideox-i2v-lora`|[docs/finetune_cogvideox.md](docs/finetune_cogvideox.md)|1|
-|I2V|CogvideoX|Full Fine-tune|`poetry run train-cogvideox-i2v-fullft`|[docs/finetune_cogvideox.md](docs/finetune_cogvideox.md)|4|
+See the runbook for CSV format and ffmpeg re-encode notes.
 
----
+## Train
 
-**T2I**
+```bash
+# Phase 1 — Flux T2I domain LoRA
+poetry run train-domain-t2i
 
-|Task|Model|Mode|Command|More Details|#GPUs|
-|:----|:---------|:---------------|:-----------------------------------------|:----------------------------|:------|
-|T2I|Flux|Lora Fine-tune|`poetry run train-flux-lora`|[docs/finetune_flux.md](docs/finetune_flux.md)|1|
+# Phase 2 — Wan 2.1 T2V domain LoRA
+poetry run train-domain-t2v
+```
 
+Configs: [`configs/domain/flux_t2i.json`](configs/domain/flux_t2i.json), [`configs/domain/flux_t2i_data.json`](configs/domain/flux_t2i_data.json), [`configs/domain/wan_t2v_lora.yaml`](configs/domain/wan_t2v_lora.yaml).
 
-### 5. Evaluation
-We support VBench evaluation to evaluate the T2V generation performance.
-Please check [eval/README.md](docs/evaluation.md) for details.
+Legacy aliases `train-flux-lora` and `train-wan2-1-t2v-lora` remain available.
 
-<!-- ### 6. Alignment
-We support video alignment post-training to align human perference for video diffusion models. Please check [configs/train/004_rlhf_vc2/README.md](configs/train/004_rlhf_vc2/README.md) for details. -->
+Outputs:
 
-## Contribute
+- Phase 1: `results/train/flux-domain-adult/checkpoint-*/`
+- Phase 2: `results/train/train_wan_domain_t2v_lora_*/`
 
-## Git hooks
+## Validate
 
-Git hooks are handled with [pre-commit](https://pre-commit.com) library.
+**Phase 1 (this milestone):**
 
-### Hooks installation
+```bash
+poetry run inference-domain-t2i \
+  --lorackpt results/train/flux-domain-adult/checkpoint-2000 \
+  --prompt "sks_style, portrait, soft lighting"
+```
 
-Run the following command to install hooks on `commit`. They will check formatting, linting and types.
+**Phase 2 interim (Wan 2.1 native smoke):**
 
-```shell
-poetry run pre-commit install
-poetry run pre-commit install --hook-type commit-msg
+```bash
+poetry run python scripts/inference_new.py \
+  --config configs/inference/presets/wan_domain_lora_smoke.yaml \
+  --ckpt_path checkpoints/wan/Wan2.1-T2V-14B \
+  --trained_ckpt results/train/.../denoiser-000-000000025.ckpt \
+  --prompt "sks_style, slow camera push-in"
 ```
 
-### Running the hooks without commiting
+**Phase 2 production (Wan 2.2 domain LoRA validation):**
 
-```shell
-poetry run pre-commit run --all-files
+```bash
+poetry run validate-domain-t2v \
+  --trained_ckpt results/train/.../denoiser-000-000000025.ckpt \
+  --prompt_file inputs/t2v/domain_prompt.txt
 ```
 
-## Acknowledgement
-We thank the following repos for sharing their awesome models and codes!
+See [`docs/runbooks/domain-adult-finetune.md`](docs/runbooks/domain-adult-finetune.md) for VRAM presets and known bridge limitations. For general Wan 2.2 720p inference (non-domain), see [`docs/runbooks/wan2.2-inference-profile.md`](docs/runbooks/wan2.2-inference-profile.md).
 
-* [Wan2.1](https://github.com/Wan-Video/Wan2.1): Wan: Open and Advanced Large-Scale Video Generative Models.
-* [HunyuanVideo](https://github.com/Tencent/HunyuanVideo): A Systematic Framework For Large Video Generation Model.
-* [Step-Video](https://github.com/stepfun-ai/Step-Video-T2V): A text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
-* [Mochi](https://www.genmo.ai/blog): A new SOTA in open-source video generation models
-* [VideoCrafter2](https://github.com/AILab-CVC/VideoCrafter): Overcoming Data Limitations for High-Quality Video Diffusion Models
-* [VideoCrafter1](https://github.com/AILab-CVC/VideoCrafter): Open Diffusion Models for High-Quality Video Generation
-* [DynamiCrafter](https://github.com/Doubiiu/DynamiCrafter): Animating Open-domain Images with Video Diffusion Priors
-* [Open-Sora](https://github.com/hpcaitech/Open-Sora): Democratizing Efficient Video Production for All
-* [CogVideoX](https://github.com/THUDM/CogVideo): Text-to-Video Diffusion Models with An Expert Transformer
-* [VADER](https://github.com/mihirp1998/VADER): Video Diffusion Alignment via Reward Gradients
-* [VBench](https://github.com/Vchitect/VBench): Comprehensive Benchmark Suite for Video Generative Models
-* [Flux](https://github.com/black-forest-labs/flux): Text-to-image models from Black Forest Labs.
-* [SimpleTuner](https://github.com/bghira/SimpleTuner): A fine-tuning kit for text-to-image generation.
+## VRAM and hardware
 
+| Phase | Model | Peak VRAM | Notes |
+|-------|-------|-----------|-------|
+| 1 — T2I | Flux LoRA @ 512px | ~24–40 GB | 1 GPU |
+| 2 — T2V | Wan 2.1 LoRA @ 480×832×81 | ~38 GB | 1 GPU + DeepSpeed ZeRO-3 offload |
 
+## CPU dev / CI gates
 
+```bash
+poetry install -E cpu --with dev --with training
+poetry run install-cpu-torch
+poetry run lint
+poetry run format-check
+poetry run test tests/test_import_smoke.py -q
+poetry run test tests/test_domain_finetune_configs.py -q
+poetry run test tests/test_flux_lora_train_smoke.py -q
+poetry run test tests/test_poetry_scripts.py -q
+```
 
-## Some Resources
-* [LLMs-Meet-MM-Generation](https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation): A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
-* [MMTrail](https://github.com/litwellchi/MMTrail): A multimodal trailer video dataset with language and music descriptions.
-* [Seeing-and-Hearing](https://github.com/yzxing87/Seeing-and-Hearing): A versatile framework for Joint VA generation, V2A, A2V, and I2A.
-* [Self-Cascade](https://github.com/GuoLanqing/Self-Cascade): A Self-Cascade model for higher-resolution image and video generation.
-* [ScaleCrafter](https://github.com/YingqingHe/ScaleCrafter) and [HiPrompt](https://liuxinyv.github.io/HiPrompt/): Free method for higher-resolution image and video generation.
-* [FreeTraj](https://github.com/arthur-qiu/FreeTraj) and [FreeNoise](https://github.com/AILab-CVC/FreeNoise): Free method for video trajectory control and longer-video generation.
-* [Follow-Your-Emoji](https://github.com/mayuelala/FollowYourEmoji), [Follow-Your-Click](https://github.com/mayuelala/FollowYourClick), and [Follow-Your-Pose](https://follow-your-pose.github.io/): Follow family for controllable video generation.
-* [Animate-A-Story](https://github.com/AILab-CVC/Animate-A-Story): A framework for storytelling video generation.
-* [LVDM](https://github.com/YingqingHe/LVDM): Latent Video Diffusion Model for long video generation and text-to-video generation.
+## Environment variables
 
+`VIDEOTUNA_*` env vars are retained for compatibility (see [`.env.example`](.env.example)).
 
+| Variable | Purpose |
+|----------|---------|
+| `VIDEOTUNA_ATTN_BACKEND` | `auto`, `flash`, `sdpa`, `eager` — use `sdpa` on ROCm |
+| `VIDEOTUNA_COMPUTE_BACKEND` | `auto`, `cuda`, `rocm`, `cpu` |
+| `HF_TOKEN` | Gated models (FLUX.1-dev) |
 
-## 🍻 Contributors
+## Project layout
 
-<a href="https://github.com/VideoVerses/VideoTuna/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=VideoVerses/VideoTuna" />
-</a>
+```
+videotuna/
+  training/flux_lora/   # Phase 1 trainer
+  models/wan/           # Phase 2 native stack
+  flow/                 # wanvideo (train), diffusers_video (infer)
+configs/domain/         # flux_t2i*.json, wan_t2v_lora.yaml, cloud smoke variants
+configs/inference/presets/  # smoke + Wan 2.2 inference presets
+cloud/vast/             # rented GPU provisioning
+docs/runbooks/          # domain-adult-finetune, wan2.2-inference-profile
+```
 
-## 📋 License
-Please follow [CC-BY-NC-ND](./LICENSE). If you want a license authorization, please contact the project leads Yingqing He (yhebm@connect.ust.hk) and Yazhou Xing (yxingag@connect.ust.hk).
+## Cloud GPU training
 
-## 😊 Citation
+Rented GPU provisioning (Vast.ai): [`docs/runbooks/cloud-gpu-training.md`](docs/runbooks/cloud-gpu-training.md)
 
-```bibtex
-@software{videotuna,
-  author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
-  title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
-  month = {Nov},
-  year = {2024},
-  url = {https://github.com/VideoVerses/VideoTuna}
-}
-```
+## Related docs
 
+| Doc | Topic |
+|-----|-------|
+| [domain-adult-finetune.md](docs/runbooks/domain-adult-finetune.md) | Full domain training runbook |
+| [checkpoints.md](docs/checkpoints.md) | Weight download layout |
+| [MODEL_VERSIONS.md](docs/MODEL_VERSIONS.md) | FLUX.1 + Wan 2.1/2.2 pins |
+| [cloud-gpu-training.md](docs/runbooks/cloud-gpu-training.md) | Vast.ai provisioning |
+| [vendor-policy.md](docs/vendor-policy.md) | Vendored upstream policy |
 
-## Star History
+## License
 
-[![Star History Chart](https://api.star-history.com/svg?repos=VideoVerses/VideoTuna&type=Date)](https://star-history.com/#VideoVerses/VideoTuna&Date)
+See [LICENSE](./LICENSE).
diff --git a/cloud/vast/.env.cloud.example b/cloud/vast/.env.cloud.example
new file mode 100644
index 00000000..684ac508
--- /dev/null
+++ b/cloud/vast/.env.cloud.example
@@ -0,0 +1,31 @@
+# PrivTune cloud instance environment (Vast.ai / linux-desktop template)
+# Copied to /workspace/PrivTune/.env by bootstrap.sh — do not commit .env.
+
+WORKSPACE=/workspace
+
+# --- Compute ---
+VIDEOTUNA_COMPUTE_BACKEND=cuda
+VIDEOTUNA_ATTN_BACKEND=auto
+VIDEOTUNA_ATTN_BACKEND_STRICT=0
+
+# --- GPU selection ---
+CUDA_VISIBLE_DEVICES=0
+
+# --- Hugging Face (required for gated models e.g. FLUX.1-dev) ---
+HF_TOKEN=
+HF_HOME=/workspace/.cache/huggingface
+
+# --- Training launcher (run-train.sh / run-smoke-train.sh) ---
+# flux-lora | wan-t2v-lora
+TRAIN_PROFILE=flux-lora
+CONFIG_PATH=
+DATA_CONFIG_PATH=
+RESUME_CKPT=
+
+# --- Optional provisioning knobs ---
+# VIDEOTUNA_INSTALL_FLASH_ATTN=1
+# VIDEOTUNA_FAST_HF_DOWNLOAD=1   # cloud only: HF_XET_HIGH_PERFORMANCE=1 for bootstrap + training hub pulls
+
+# --- Template host vars (set by rental provider — do not override) ---
+# PUBLIC_IPADDR=
+# OPEN_BUTTON_TOKEN=
diff --git a/cloud/vast/bootstrap-requirements.txt b/cloud/vast/bootstrap-requirements.txt
new file mode 100644
index 00000000..8a5089c6
--- /dev/null
+++ b/cloud/vast/bootstrap-requirements.txt
@@ -0,0 +1,2 @@
+tenacity>=9.0.0
+pyyaml>=6.0
diff --git a/cloud/vast/bootstrap.sh b/cloud/vast/bootstrap.sh
new file mode 100755
index 00000000..ce292ae6
--- /dev/null
+++ b/cloud/vast/bootstrap.sh
@@ -0,0 +1,202 @@
+#!/usr/bin/env bash
+# VideoTuna first-boot / re-provision bootstrap for Vast.ai linux-desktop templates.
+# Usable as PROVISIONING_SCRIPT or invoked from provisioning.yaml post_commands.
+set -euo pipefail
+
+WORKSPACE="${WORKSPACE:-/workspace}"
+REPO="${WORKSPACE}/VideoTuna"
+MARKER="${WORKSPACE}/.videotuna_provisioned"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROVISION_RETRY="${SCRIPT_DIR}/provision_retry.py"
+
+log() { echo "[videotuna-bootstrap] $*"; }
+
+enable_fast_hf_download() {
+  if [[ "${VIDEOTUNA_FAST_HF_DOWNLOAD:-0}" == "1" ]]; then
+    export HF_XET_HIGH_PERFORMANCE=1
+    log "Fast HF downloads enabled (HF_XET_HIGH_PERFORMANCE=1 via VIDEOTUNA_FAST_HF_DOWNLOAD)"
+  fi
+}
+
+ensure_provision_retry() {
+  log "Ensuring bootstrap retry dependencies (tenacity, pyyaml)..."
+  python3 "${PROVISION_RETRY}" install-bootstrap-deps
+}
+
+ensure_poetry() {
+  export PATH="${HOME}/.local/bin:${PATH}"
+  if command -v poetry >/dev/null 2>&1; then
+    log "Poetry already installed: $(poetry --version)"
+    return 0
+  fi
+  python3 "${PROVISION_RETRY}" install-poetry
+  export PATH="${HOME}/.local/bin:${PATH}"
+  poetry --version
+}
+
+setup_workspace_layout() {
+  log "Creating workspace directories and symlinks..."
+  mkdir -p \
+    "${WORKSPACE}/data/t2i/domain" \
+    "${WORKSPACE}/data/t2v/domain/videos" \
+    "${WORKSPACE}/checkpoints/flux" \
+    "${WORKSPACE}/checkpoints/wan" \
+    "${WORKSPACE}/results" \
+    "${WORKSPACE}/.cache/huggingface"
+
+  mkdir -p "${REPO}/data"
+  ln -sfn "${WORKSPACE}/data/t2i" "${REPO}/data/t2i"
+  ln -sfn "${WORKSPACE}/data/t2v" "${REPO}/data/t2v"
+  ln -sfn "${WORKSPACE}/checkpoints" "${REPO}/checkpoints"
+  ln -sfn "${WORKSPACE}/results" "${REPO}/results"
+}
+
+write_env_file() {
+  local env_file="${REPO}/.env"
+  local example="${SCRIPT_DIR}/.env.cloud.example"
+  if [[ -f "${env_file}" ]]; then
+    log ".env already exists at ${env_file}; skipping template copy"
+    return 0
+  fi
+  if [[ ! -f "${example}" ]]; then
+    log "WARNING: ${example} not found; creating minimal .env"
+    cat >"${env_file}" <<EOF
+WORKSPACE=${WORKSPACE}
+VIDEOTUNA_COMPUTE_BACKEND=cuda
+VIDEOTUNA_ATTN_BACKEND=auto
+CUDA_VISIBLE_DEVICES=0
+HF_HOME=${WORKSPACE}/.cache/huggingface
+TRAIN_PROFILE=flux-lora
+EOF
+  else
+    cp "${example}" "${env_file}"
+  fi
+  chmod 600 "${env_file}"
+
+  # Inject host secrets from template env when set.
+  if [[ -n "${HF_TOKEN:-}" ]]; then
+    if grep -q '^HF_TOKEN=' "${env_file}"; then
+      sed -i "s|^HF_TOKEN=.*|HF_TOKEN=${HF_TOKEN}|" "${env_file}"
+    else
+      echo "HF_TOKEN=${HF_TOKEN}" >>"${env_file}"
+    fi
+  fi
+  if [[ -n "${VIDEOTUNA_ATTN_BACKEND:-}" ]]; then
+    sed -i "s|^VIDEOTUNA_ATTN_BACKEND=.*|VIDEOTUNA_ATTN_BACKEND=${VIDEOTUNA_ATTN_BACKEND}|" \
+      "${env_file}" || echo "VIDEOTUNA_ATTN_BACKEND=${VIDEOTUNA_ATTN_BACKEND}" >>"${env_file}"
+  fi
+  if [[ "${VIDEOTUNA_FAST_HF_DOWNLOAD:-0}" == "1" ]]; then
+    if grep -q '^HF_XET_HIGH_PERFORMANCE=' "${env_file}"; then
+      sed -i "s|^HF_XET_HIGH_PERFORMANCE=.*|HF_XET_HIGH_PERFORMANCE=1|" "${env_file}"
+    else
+      echo "HF_XET_HIGH_PERFORMANCE=1" >>"${env_file}"
+    fi
+  fi
+  log "Wrote ${env_file}"
+}
+
+install_videotuna() {
+  if [[ ! -d "${REPO}" ]]; then
+    log "ERROR: ${REPO} not found — git_repos phase must clone VideoTuna first"
+    exit 1
+  fi
+  cd "${REPO}"
+  export PATH="${HOME}/.local/bin:${PATH}"
+  export HF_HOME="${HF_HOME:-${WORKSPACE}/.cache/huggingface}"
+
+  log "Running poetry install -E cuda --with training..."
+  python3 "${PROVISION_RETRY}" run -- poetry install -E cuda --with training --no-interaction
+
+  poetry run python -c "import hf_xet" 2>/dev/null \
+    || log "WARNING: hf-xet not importable; HF downloads use fallback path"
+
+  log "Installing DeepSpeed (required for Wan / CogVideoX LoRA)..."
+  python3 "${PROVISION_RETRY}" run -- poetry run install-deepspeed
+
+  if [[ "${VIDEOTUNA_INSTALL_FLASH_ATTN:-0}" == "1" ]]; then
+    log "Installing flash-attn (optional, datacenter GPUs)..."
+    python3 "${PROVISION_RETRY}" run -- poetry run install-flash-attn \
+      || log "WARNING: install-flash-attn failed; use VIDEOTUNA_ATTN_BACKEND=sdpa"
+  fi
+}
+
+hf_login() {
+  if [[ -z "${HF_TOKEN:-}" ]]; then
+    log "HF_TOKEN not set; skipping huggingface-cli login and gated downloads"
+    return 0
+  fi
+  export PATH="${HOME}/.local/bin:${PATH}"
+  export HF_HOME="${HF_HOME:-${WORKSPACE}/.cache/huggingface}"
+  cd "${REPO}"
+  if command -v huggingface-cli >/dev/null 2>&1; then
+    huggingface-cli login --token "${HF_TOKEN}" --add-to-git-credential || true
+  else
+    poetry run huggingface-cli login --token "${HF_TOKEN}" --add-to-git-credential || true
+  fi
+}
+
+download_weights_if_missing() {
+  if [[ -z "${HF_TOKEN:-}" ]]; then
+    return 0
+  fi
+  export HF_HOME="${HF_HOME:-${WORKSPACE}/.cache/huggingface}"
+
+  log "Pre-downloading FLUX.1-dev (with retries)..."
+  python3 "${PROVISION_RETRY}" hf-download \
+    black-forest-labs/FLUX.1-dev \
+    "${WORKSPACE}/checkpoints/flux/FLUX.1-dev" \
+    --repo-root "${REPO}"
+
+  log "Pre-downloading Wan2.1-T2V-14B (with retries)..."
+  python3 "${PROVISION_RETRY}" hf-download \
+    Wan-AI/Wan2.1-T2V-14B \
+    "${WORKSPACE}/checkpoints/wan/Wan2.1-T2V-14B" \
+    --repo-root "${REPO}"
+
+  log "Pre-downloading Wan2.1-I2V-14B-480P (with retries)..."
+  python3 "${PROVISION_RETRY}" hf-download \
+    Wan-AI/Wan2.1-I2V-14B-480P \
+    "${WORKSPACE}/checkpoints/wan/Wan2.1-I2V-14B-480P" \
+    --repo-root "${REPO}"
+
+  log "Pre-downloading Wan2.2-T2V-A14B-Diffusers into HF hub cache (with retries)..."
+  python3 "${PROVISION_RETRY}" hf-download-cache \
+    Wan-AI/Wan2.2-T2V-A14B-Diffusers \
+    --repo-root "${REPO}"
+
+  log "Pre-downloading Wan2.2-I2V-A14B-Diffusers into HF hub cache (with retries)..."
+  python3 "${PROVISION_RETRY}" hf-download-cache \
+    Wan-AI/Wan2.2-I2V-A14B-Diffusers \
+    --repo-root "${REPO}"
+}
+
+run_smoke_validation() {
+  cd "${REPO}"
+  export PATH="${HOME}/.local/bin:${PATH}"
+  # shellcheck disable=SC1091
+  [[ -f .env ]] && set -a && source .env && set +a
+
+  log "Running import smoke test..."
+  poetry run test tests/test_import_smoke.py -q
+
+  log "Describing compute environment..."
+  poetry run python -c \
+    "from videotuna.utils.device_utils import describe_compute_environment; print(describe_compute_environment())"
+}
+
+main() {
+  log "Starting VideoTuna bootstrap (workspace=${WORKSPACE})"
+  enable_fast_hf_download
+  ensure_provision_retry
+  ensure_poetry
+  setup_workspace_layout
+  write_env_file
+  install_videotuna
+  hf_login
+  download_weights_if_missing
+  run_smoke_validation
+  touch "${MARKER}"
+  log "Bootstrap complete. Marker: ${MARKER}"
+}
+
+main "$@"
diff --git a/cloud/vast/provision_retry.py b/cloud/vast/provision_retry.py
new file mode 100644
index 00000000..737d951f
--- /dev/null
+++ b/cloud/vast/provision_retry.py
@@ -0,0 +1,302 @@
+#!/usr/bin/env python3
+"""Retry wrapper for Vast bootstrap network steps.
+
+Mirrors provisioning.yaml settings.retry backoff for shell/subprocess steps.
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import os
+import shutil
+import subprocess
+import sys
+import time
+from dataclasses import dataclass
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+PROVISIONING_MANIFEST = SCRIPT_DIR / "provisioning.yaml"
+BOOTSTRAP_REQUIREMENTS = SCRIPT_DIR / "bootstrap-requirements.txt"
+DOWNLOAD_OK_SENTINEL = ".privtune_download_ok"
+HUB_CACHE_SENTINEL_DIR = ".privtune_hub_cache"
+
+
+def _hub_cache_sentinel_path(repo_id: str) -> Path:
+    hf_home = Path(os.environ.get("HF_HOME", Path.home() / ".cache" / "huggingface"))
+    slug = repo_id.replace("/", "--")
+    return hf_home / HUB_CACHE_SENTINEL_DIR / slug / DOWNLOAD_OK_SENTINEL
+
+
+LOG = logging.getLogger("videotuna-provision-retry")
+
+
+@dataclass(frozen=True)
+class RetrySettings:
+    max_attempts: int = 5
+    initial_delay: float = 2
+    backoff_multiplier: float = 2
+
+
+def load_retry_settings(manifest_path: Path | None = None) -> RetrySettings:
+    path = manifest_path or PROVISIONING_MANIFEST
+    if not path.is_file():
+        return RetrySettings()
+    try:
+        import yaml
+    except ImportError:
+        return RetrySettings()
+    data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
+    retry = (data.get("settings") or {}).get("retry") or {}
+    return RetrySettings(
+        max_attempts=int(retry.get("max_attempts", 5)),
+        initial_delay=float(retry.get("initial_delay", 2)),
+        backoff_multiplier=float(retry.get("backoff_multiplier", 2)),
+    )
+
+
+def _wait_seconds(settings: RetrySettings, attempt: int) -> float:
+    """Mirror tenacity wait_exponential(multiplier, exp_base, min=multiplier)."""
+    return max(
+        settings.initial_delay,
+        settings.initial_delay * (settings.backoff_multiplier ** (attempt - 1)),
+    )
+
+
+def _make_retry_decorator(settings: RetrySettings):
+    from tenacity import (
+        before_sleep_log,
+        retry,
+        retry_if_exception_type,
+        stop_after_attempt,
+        wait_exponential,
+    )
+
+    return retry(
+        stop=stop_after_attempt(settings.max_attempts),
+        wait=wait_exponential(
+            multiplier=settings.initial_delay,
+            exp_base=settings.backoff_multiplier,
+            min=settings.initial_delay,
+        ),
+        retry=retry_if_exception_type((subprocess.CalledProcessError, OSError)),
+        before_sleep=before_sleep_log(LOG, logging.WARNING),
+        reraise=True,
+    )
+
+
+def run_command(
+    argv: list[str],
+    *,
+    cwd: Path | None = None,
+    env: dict[str, str] | None = None,
+    settings: RetrySettings | None = None,
+) -> None:
+    retry_settings = settings or load_retry_settings()
+
+    @_make_retry_decorator(retry_settings)
+    def _run() -> None:
+        LOG.info("Running: %s", " ".join(argv))
+        subprocess.run(
+            argv,
+            check=True,
+            cwd=str(cwd) if cwd else None,
+            env=env,
+        )
+
+    _run()
+
+
+def install_bootstrap_deps(settings: RetrySettings | None = None) -> None:
+    if not BOOTSTRAP_REQUIREMENTS.is_file():
+        raise FileNotFoundError(f"Missing {BOOTSTRAP_REQUIREMENTS}")
+    retry_settings = settings or load_retry_settings()
+    argv = [
+        sys.executable,
+        "-m",
+        "pip",
+        "install",
+        "--user",
+        "-q",
+        "-r",
+        str(BOOTSTRAP_REQUIREMENTS),
+    ]
+    last_error: BaseException | None = None
+    for attempt in range(1, retry_settings.max_attempts + 1):
+        try:
+            LOG.info("Running: %s", " ".join(argv))
+            subprocess.run(argv, check=True)
+            return
+        except (subprocess.CalledProcessError, OSError) as exc:
+            last_error = exc
+            if attempt >= retry_settings.max_attempts:
+                break
+            delay = _wait_seconds(retry_settings, attempt)
+            LOG.warning(
+                "Attempt %s/%s failed (%s); retrying in %.1fs",
+                attempt,
+                retry_settings.max_attempts,
+                exc,
+                delay,
+            )
+            time.sleep(delay)
+    assert last_error is not None
+    raise last_error
+
+
+def _resolve_hf_argv(repo_root: Path | None) -> list[str]:
+    if shutil.which("hf"):
+        return ["hf"]
+    if shutil.which("huggingface-cli"):
+        return ["huggingface-cli"]
+    if repo_root and repo_root.is_dir():
+        return ["poetry", "run", "hf"]
+    raise RuntimeError("hf / huggingface-cli not found and repo root unavailable")
+
+
+def hf_download(
+    repo_id: str,
+    local_dir: Path,
+    *,
+    repo_root: Path | None = None,
+    settings: RetrySettings | None = None,
+) -> None:
+    local_dir = local_dir.resolve()
+    sentinel = local_dir / DOWNLOAD_OK_SENTINEL
+    if sentinel.is_file():
+        LOG.info("Skipping %s (sentinel %s exists)", repo_id, sentinel)
+        return
+
+    local_dir.mkdir(parents=True, exist_ok=True)
+    hf_base = _resolve_hf_argv(repo_root)
+    argv = [*hf_base, "download", repo_id, "--local-dir", str(local_dir)]
+    run_command(argv, cwd=repo_root, env=os.environ.copy(), settings=settings)
+    sentinel.write_text(f"{repo_id}\n", encoding="utf-8")
+    LOG.info("Download complete: %s -> %s", repo_id, local_dir)
+
+
+def hf_download_hub_cache(
+    repo_id: str,
+    *,
+    repo_root: Path | None = None,
+    settings: RetrySettings | None = None,
+) -> None:
+    """Download a HF repo into the default hub cache (no --local-dir)."""
+    sentinel = _hub_cache_sentinel_path(repo_id)
+    if sentinel.is_file():
+        LOG.info("Skipping %s (hub cache sentinel %s exists)", repo_id, sentinel)
+        return
+
+    hf_base = _resolve_hf_argv(repo_root)
+    argv = [*hf_base, "download", repo_id]
+    run_command(argv, cwd=repo_root, env=os.environ.copy(), settings=settings)
+    sentinel.parent.mkdir(parents=True, exist_ok=True)
+    sentinel.write_text(f"{repo_id}\n", encoding="utf-8")
+    LOG.info("Hub cache download complete: %s", repo_id)
+
+
+def install_poetry(settings: RetrySettings | None = None) -> None:
+    retry_settings = settings or load_retry_settings()
+
+    @_make_retry_decorator(retry_settings)
+    def _install() -> None:
+        LOG.info("Installing Poetry via official installer...")
+        curl = subprocess.run(
+            ["curl", "-sSL", "https://install.python-poetry.org"],
+            check=True,
+            capture_output=True,
+        )
+        subprocess.run(
+            [sys.executable, "-"],
+            input=curl.stdout,
+            check=True,
+        )
+
+    _install()
+
+
+def _configure_logging() -> None:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="[videotuna-provision-retry] %(levelname)s %(message)s",
+    )
+
+
+def main(argv: list[str] | None = None) -> int:
+    _configure_logging()
+    parser = argparse.ArgumentParser(
+        description="Retry wrapper for Vast bootstrap steps"
+    )
+    sub = parser.add_subparsers(dest="command", required=True)
+
+    sub.add_parser(
+        "install-bootstrap-deps", help="pip install bootstrap-requirements.txt"
+    )
+
+    run_parser = sub.add_parser("run", help="Run a command with retries")
+    run_parser.add_argument(
+        "cmd",
+        nargs=argparse.REMAINDER,
+        help="Command after -- (e.g. run -- poetry install ...)",
+    )
+
+    hf_parser = sub.add_parser("hf-download", help="Download HF repo with retries")
+    hf_parser.add_argument("repo_id", help="Hugging Face repo id (org/name)")
+    hf_parser.add_argument("local_dir", type=Path, help="Destination directory")
+    hf_parser.add_argument(
+        "--repo-root",
+        type=Path,
+        default=None,
+        help="VideoTuna repo root for poetry run hf fallback",
+    )
+
+    hf_cache_parser = sub.add_parser(
+        "hf-download-cache", help="Download HF repo into hub cache with retries"
+    )
+    hf_cache_parser.add_argument("repo_id", help="Hugging Face repo id (org/name)")
+    hf_cache_parser.add_argument(
+        "--repo-root",
+        type=Path,
+        default=None,
+        help="VideoTuna repo root for poetry run hf fallback",
+    )
+
+    sub.add_parser("install-poetry", help="Install Poetry via official installer")
+
+    args = parser.parse_args(argv)
+    settings = load_retry_settings()
+
+    try:
+        if args.command == "install-bootstrap-deps":
+            install_bootstrap_deps(settings)
+        elif args.command == "run":
+            cmd = args.cmd
+            if cmd and cmd[0] == "--":
+                cmd = cmd[1:]
+            if not cmd:
+                parser.error("run requires a command after --")
+            run_command(cmd, settings=settings)
+        elif args.command == "hf-download":
+            hf_download(
+                args.repo_id,
+                args.local_dir,
+                repo_root=args.repo_root,
+                settings=settings,
+            )
+        elif args.command == "hf-download-cache":
+            hf_download_hub_cache(
+                args.repo_id,
+                repo_root=args.repo_root,
+                settings=settings,
+            )
+        elif args.command == "install-poetry":
+            install_poetry(settings)
+    except (subprocess.CalledProcessError, OSError, RuntimeError) as exc:
+        LOG.error("Failed after retries: %s", exc)
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/cloud/vast/provisioning.yaml b/cloud/vast/provisioning.yaml
new file mode 100644
index 00000000..a2f96bb0
--- /dev/null
+++ b/cloud/vast/provisioning.yaml
@@ -0,0 +1,54 @@
+# VideoTuna provisioning manifest for Vast.ai linux-desktop templates.
+# Set: PROVISIONING_MANIFEST=https://raw.githubusercontent.com/miguelenes/VideoTuna/main/cloud/vast/provisioning.yaml
+version: 1
+
+# settings.retry — per-download backoff for manifest phases (apt, git, conditional_downloads).
+# post_commands (bootstrap.sh) are fail-fast at the provisioner; bootstrap uses provision_retry.py
+# with the same backoff for poetry install / hf download. See docs/runbooks/cloud-gpu-training.md.
+settings:
+  retry:
+    max_attempts: 5
+    initial_delay: 2
+    backoff_multiplier: 2
+
+# on_failure — whole-pipeline retry when any phase exits non-zero (including bootstrap.sh).
+# action: continue keeps the instance up after retries are exhausted. Override via PROVISIONER_* env.
+on_failure:
+  action: continue
+  max_retries: 3
+  retry_delay: 60
+
+auth:
+  huggingface:
+    token_env: HF_TOKEN
+
+apt_packages:
+  - git
+  - curl
+  - wget
+  - ffmpeg
+  - build-essential
+  - libgl1
+  - libsndfile1
+  - ca-certificates
+
+git_repos:
+  - url: https://github.com/miguelenes/VideoTuna.git
+    dest: /workspace/VideoTuna
+    ref: main
+    pull_if_exists: true
+
+# Manifest-phase HF pulls honor instance env HF_XET_HIGH_PERFORMANCE=1.
+# Set VIDEOTUNA_FAST_HF_DOWNLOAD=1 at rent time (bootstrap maps it for post_commands + .env).
+conditional_downloads:
+  - when: hf_token_valid
+    downloads:
+      - url: https://huggingface.co/black-forest-labs/FLUX.1-dev
+        dest: /workspace/checkpoints/flux/FLUX.1-dev
+      - url: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
+        dest: /workspace/checkpoints/wan/Wan2.1-T2V-14B
+      - url: https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P
+        dest: /workspace/checkpoints/wan/Wan2.1-I2V-14B-480P
+
+post_commands:
+  - bash /workspace/VideoTuna/cloud/vast/bootstrap.sh
diff --git a/cloud/vast/run-smoke-train.sh b/cloud/vast/run-smoke-train.sh
new file mode 100755
index 00000000..608212f8
--- /dev/null
+++ b/cloud/vast/run-smoke-train.sh
@@ -0,0 +1,62 @@
+#!/usr/bin/env bash
+# Short GPU smoke training run (5–50 steps) to validate deps before a long job.
+set -euo pipefail
+
+WORKSPACE="${WORKSPACE:-/workspace}"
+REPO="${WORKSPACE}/VideoTuna"
+cd "${REPO}"
+
+export PATH="${HOME}/.local/bin:${PATH}"
+
+if [[ -f "${REPO}/.env" ]]; then
+  set -a
+  # shellcheck disable=SC1091
+  source "${REPO}/.env"
+  set +a
+fi
+
+mkdir -p "${WORKSPACE}/results"
+LOG_OUT="${WORKSPACE}/results/smoke-train.log"
+LOG_ERR="${WORKSPACE}/results/smoke-train.err"
+
+log() { echo "[$(date -Iseconds)] [run-smoke-train] $*" | tee -a "${LOG_OUT}"; }
+
+log "GPU info:"
+nvidia-smi 2>&1 | tee -a "${LOG_OUT}" || log "nvidia-smi not available"
+
+POETRY_VENV_BIN="$(poetry env info -p 2>/dev/null)/bin" || true
+if [[ -n "${POETRY_VENV_BIN}" && -d "${POETRY_VENV_BIN}" ]]; then
+  export PATH="${POETRY_VENV_BIN}:${PATH}"
+fi
+
+TRAIN_PROFILE="${TRAIN_PROFILE:-flux-lora}"
+CONFIG_PATH="${CONFIG_PATH:-}"
+DATA_CONFIG_PATH="${DATA_CONFIG_PATH:-}"
+
+log "Smoke TRAIN_PROFILE=${TRAIN_PROFILE}"
+
+run_cmd() {
+  log "Executing: $*"
+  "$@" >>"${LOG_OUT}" 2>>"${LOG_ERR}"
+}
+
+case "${TRAIN_PROFILE}" in
+  flux-lora)
+  CONFIG_PATH="${CONFIG_PATH:-configs/domain/flux_t2i_cloud_smoke.json}"
+  DATA_CONFIG_PATH="${DATA_CONFIG_PATH:-configs/domain/flux_t2i_data.json}"
+  run_cmd poetry run train-flux-lora \
+    --config_path "${CONFIG_PATH}" \
+    --data_config_path "${DATA_CONFIG_PATH}"
+  ;;
+  wan-t2v-lora)
+  CONFIG_PATH="${CONFIG_PATH:-configs/domain/wan_t2v_lora_cloud_smoke.yaml}"
+  run_cmd poetry run train-wan2-1-t2v-lora --base "${CONFIG_PATH}"
+  ;;
+  *)
+  echo "Unknown TRAIN_PROFILE=${TRAIN_PROFILE}" | tee -a "${LOG_ERR}"
+  echo "Valid: flux-lora, wan-t2v-lora" | tee -a "${LOG_ERR}"
+  exit 1
+  ;;
+esac
+
+log "Smoke training finished."
diff --git a/cloud/vast/run-train.sh b/cloud/vast/run-train.sh
new file mode 100755
index 00000000..f85e2acd
--- /dev/null
+++ b/cloud/vast/run-train.sh
@@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# Parameterized PrivTune training launcher for cloud GPU instances.
+set -euo pipefail
+
+WORKSPACE="${WORKSPACE:-/workspace}"
+REPO="${WORKSPACE}/PrivTune"
+cd "${REPO}"
+
+export PATH="${HOME}/.local/bin:${PATH}"
+
+if [[ -f "${REPO}/.env" ]]; then
+  set -a
+  # shellcheck disable=SC1091
+  source "${REPO}/.env"
+  set +a
+fi
+
+mkdir -p "${WORKSPACE}/results"
+LOG_OUT="${WORKSPACE}/results/train.log"
+LOG_ERR="${WORKSPACE}/results/train.err"
+
+log() { echo "[$(date -Iseconds)] [run-train] $*" | tee -a "${LOG_OUT}"; }
+
+log "GPU info:"
+nvidia-smi 2>&1 | tee -a "${LOG_OUT}" || log "nvidia-smi not available"
+
+POETRY_VENV_BIN="$(poetry env info -p 2>/dev/null)/bin" || true
+if [[ -n "${POETRY_VENV_BIN}" && -d "${POETRY_VENV_BIN}" ]]; then
+  export PATH="${POETRY_VENV_BIN}:${PATH}"
+fi
+
+TRAIN_PROFILE="${TRAIN_PROFILE:-flux-lora}"
+CONFIG_PATH="${CONFIG_PATH:-}"
+DATA_CONFIG_PATH="${DATA_CONFIG_PATH:-}"
+RESUME_CKPT="${RESUME_CKPT:-}"
+
+log "TRAIN_PROFILE=${TRAIN_PROFILE}"
+
+run_cmd() {
+  log "Executing: $*"
+  "$@" >>"${LOG_OUT}" 2>>"${LOG_ERR}"
+}
+
+case "${TRAIN_PROFILE}" in
+  flux-lora)
+  CONFIG_PATH="${CONFIG_PATH:-configs/domain/flux_t2i.json}"
+  DATA_CONFIG_PATH="${DATA_CONFIG_PATH:-configs/domain/flux_t2i_data.json}"
+  run_cmd poetry run train-flux-lora \
+    --config_path "${CONFIG_PATH}" \
+    --data_config_path "${DATA_CONFIG_PATH}"
+  ;;
+  wan-t2v-lora)
+  CONFIG_PATH="${CONFIG_PATH:-configs/domain/wan_t2v_lora.yaml}"
+  ARGS=(poetry run train-wan2-1-t2v-lora --base "${CONFIG_PATH}")
+  if [[ -n "${RESUME_CKPT}" ]]; then
+    ARGS+=(--resume_ckpt "${RESUME_CKPT}")
+  fi
+  run_cmd "${ARGS[@]}"
+  ;;
+  wan-i2v-lora)
+  CONFIG_PATH="${CONFIG_PATH:-configs/domain/wan_i2v_lora.yaml}"
+  ARGS=(poetry run train-wan2-1-i2v-lora --base "${CONFIG_PATH}")
+  if [[ -n "${RESUME_CKPT}" ]]; then
+    ARGS+=(--resume_ckpt "${RESUME_CKPT}")
+  fi
+  run_cmd "${ARGS[@]}"
+  ;;
+  *)
+  echo "Unknown TRAIN_PROFILE=${TRAIN_PROFILE}" | tee -a "${LOG_ERR}"
+  echo "Valid: flux-lora, wan-t2v-lora, wan-i2v-lora" | tee -a "${LOG_ERR}"
+  exit 1
+  ;;
+esac
+
+log "Training finished successfully."
diff --git a/cloud/vast/supervisor/videotuna-train.conf b/cloud/vast/supervisor/videotuna-train.conf
new file mode 100644
index 00000000..7d9298f3
--- /dev/null
+++ b/cloud/vast/supervisor/videotuna-train.conf
@@ -0,0 +1,20 @@
+; VideoTuna training service for Supervisor (Vast.ai linux-desktop template).
+; Install:
+;   cp /workspace/VideoTuna/cloud/vast/supervisor/videotuna-train.conf /etc/supervisor/conf.d/
+;   supervisorctl reread && supervisorctl update
+; Start:
+;   export TRAIN_PROFILE=flux-lora  # in /workspace/VideoTuna/.env or shell
+;   supervisorctl start videotuna-train
+
+[program:videotuna-train]
+command=/workspace/VideoTuna/cloud/vast/run-train.sh
+directory=/workspace/VideoTuna
+autostart=false
+autorestart=unexpected
+startsecs=10
+stopwaitsecs=120
+stdout_logfile=/workspace/results/train.log
+stderr_logfile=/workspace/results/train.err
+stdout_logfile_maxbytes=50MB
+stderr_logfile_maxbytes=50MB
+environment=HOME="/root",PATH="/root/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",WORKSPACE="/workspace"
diff --git a/configs/000_videocrafter/vc1_i2v_512.yaml b/configs/000_videocrafter/vc1_i2v_512.yaml
deleted file mode 100644
index 8d112651..00000000
--- a/configs/000_videocrafter/vc1_i2v_512.yaml
+++ /dev/null
@@ -1,90 +0,0 @@
-model:
-  target: videotuna.models.lvdm.ddpm3d.LatentVisualDiffusionFlow
-  params:
-    linear_start: 0.00085
-    linear_end: 0.012
-    timesteps: 1000
-    first_stage_key: video
-    cond_stage_key: caption
-    cond_stage_trainable: false
-    conditioning_key: crossattn
-    image_size:
-    - 40
-    - 64
-    channels: 4
-    scale_by_std: false
-    scale_factor: 0.18215
-    use_ema: false
-    uncond_type: empty_seq
-    use_scale: true
-    scale_b: 0.7
-    finegrained: true
-
-    diffusion_scheduler_config:
-      target: videotuna.schedulers.diffusion_schedulers.LDMScheduler
-      params:
-        timesteps: 1000
-        linear_start: 0.00085
-        linear_end: 0.012
-
-    unet_config:
-      target: videotuna.models.lvdm.modules.networks.openaimodel3d.UNetModel
-      params:
-        in_channels: 4
-        out_channels: 4
-        model_channels: 320
-        attention_resolutions:
-        - 4
-        - 2
-        - 1
-        num_res_blocks: 2
-        channel_mult:
-        - 1
-        - 2
-        - 4
-        - 4
-        num_head_channels: 64
-        transformer_depth: 1
-        context_dim: 1024
-        use_linear: true
-        use_checkpoint: true
-        temporal_conv: true
-        temporal_attention: true
-        temporal_selfatt_only: true
-        use_relative_position: false
-        use_causal_attention: false
-        use_image_attention: true
-        temporal_length: 16
-        addition_attention: true
-        fps_cond: true
-    first_stage_config:
-      target: videotuna.models.lvdm.modules.vae.autoencoder.AutoencoderKL
-      params:
-        embed_dim: 4
-        monitor: val/rec_loss
-        ddconfig:
-          double_z: true
-          z_channels: 4
-          resolution: 512
-          in_channels: 3
-          out_ch: 3
-          ch: 128
-          ch_mult:
-          - 1
-          - 2
-          - 4
-          - 4
-          num_res_blocks: 2
-          attn_resolutions: []
-          dropout: 0.0
-        lossconfig:
-          target: torch.nn.Identity
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
-      params:
-        freeze: true
-        layer: penultimate
-    img_cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
-      params:
-        freeze: true
diff --git a/configs/000_videocrafter/vc1_t2v_1024.yaml b/configs/000_videocrafter/vc1_t2v_1024.yaml
deleted file mode 100644
index 9f9f736d..00000000
--- a/configs/000_videocrafter/vc1_t2v_1024.yaml
+++ /dev/null
@@ -1,84 +0,0 @@
-model:
-  target: videotuna.models.lvdm.ddpm3d.LVDMFlow
-  params:
-    linear_start: 0.00085
-    linear_end: 0.012
-    timesteps: 1000
-    first_stage_key: video
-    cond_stage_key: caption
-    cond_stage_trainable: false
-    conditioning_key: crossattn
-    image_size:
-    - 72
-    - 128
-    channels: 4
-    scale_by_std: false
-    scale_factor: 0.18215
-    use_ema: false
-    uncond_type: empty_seq
-    use_scale: true
-    fix_scale_bug: true
-
-    diffusion_scheduler_config:
-      target: videotuna.schedulers.diffusion_schedulers.LDMScheduler
-      params:
-        timesteps: 1000
-        linear_start: 0.00085
-        linear_end: 0.012
-
-    unet_config:
-      target: videotuna.models.lvdm.modules.networks.openaimodel3d.UNetModel
-      params:
-        in_channels: 4
-        out_channels: 4
-        model_channels: 320
-        attention_resolutions:
-        - 4
-        - 2
-        - 1
-        num_res_blocks: 2
-        channel_mult:
-        - 1
-        - 2
-        - 4
-        - 4
-        num_head_channels: 64
-        transformer_depth: 1
-        context_dim: 1024
-        use_linear: true
-        use_checkpoint: true
-        temporal_conv: false
-        temporal_attention: true
-        temporal_selfatt_only: true
-        use_relative_position: true
-        use_causal_attention: false
-        temporal_length: 16
-        addition_attention: true
-        fps_cond: true
-    first_stage_config:
-      target: videotuna.models.lvdm.modules.vae.autoencoder.AutoencoderKL
-      params:
-        embed_dim: 4
-        monitor: val/rec_loss
-        ddconfig:
-          double_z: true
-          z_channels: 4
-          resolution: 512
-          in_channels: 3
-          out_ch: 3
-          ch: 128
-          ch_mult:
-          - 1
-          - 2
-          - 4
-          - 4
-          num_res_blocks: 2
-          attn_resolutions: []
-          dropout: 0.0
-        lossconfig:
-          target: torch.nn.Identity
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
-      params:
-        freeze: true
-        layer: penultimate
diff --git a/configs/001_videocrafter2/vc2_t2v_320x512.yaml b/configs/001_videocrafter2/vc2_t2v_320x512.yaml
deleted file mode 100644
index ea1ea223..00000000
--- a/configs/001_videocrafter2/vc2_t2v_320x512.yaml
+++ /dev/null
@@ -1,156 +0,0 @@
-flow:
-  # empty_params_only: True # disable this means finetuning all parameters
-  target: videotuna.flow.videocrafter.VideocrafterFlow
-  params:
-    log_every_t: 200
-    first_stage_key: video
-    cond_stage_key: caption
-    cond_stage_trainable: false
-    conditioning_key: crossattn
-    image_size:
-    - 40
-    - 64
-    channels: 4
-    scale_by_std: false
-    scale_factor: 0.18215
-    use_ema: false
-    uncond_type: empty_seq
-    monitor: train/loss_step
-    encoder_type: 2d
-    use_scale: true
-    scale_b: 0.7 # adapt to videocrafter-v2
-
-    scheduler_config:
-      target: videotuna.schedulers.ddpm.LDDPM
-      params: 
-        timesteps: 1000
-        linear_start: 0.00085
-        linear_end: 0.012
-
-    denoiser_config:
-      target: videotuna.models.lvdm.modules.networks.openaimodel3d.UNetModel
-      params:
-        in_channels: 4
-        out_channels: 4
-        model_channels: 320
-        attention_resolutions:
-        - 4
-        - 2
-        - 1
-        num_res_blocks: 2
-        channel_mult:
-        - 1
-        - 2
-        - 4
-        - 4
-        num_head_channels: 64
-        transformer_depth: 1
-        context_dim: 1024
-        use_linear: true
-        use_checkpoint: true
-        temporal_conv: true # adapt to videocrafter-v2
-        temporal_attention: true
-        temporal_selfatt_only: true
-        use_relative_position: false # adapt to videocrafter-v2
-        use_causal_attention: false
-        temporal_length: 16
-        addition_attention: true
-        fps_cond: true # adapt to videocrafter-v2
-
-    first_stage_config:
-      target: videotuna.models.lvdm.modules.vae.autoencoder.AutoencoderKL
-      params:
-        embed_dim: 4
-        monitor: val/rec_loss
-        ddconfig:
-          double_z: true
-          z_channels: 4
-          resolution: 512
-          in_channels: 3
-          out_ch: 3
-          ch: 128
-          ch_mult:
-          - 1
-          - 2
-          - 4
-          - 4
-          num_res_blocks: 2
-          attn_resolutions: []
-          dropout: 0.0
-        lossconfig:
-          target: torch.nn.Identity
-
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
-      params:
-        freeze: true
-        layer: penultimate
-        
-train:
-  ckpt: checkpoints/videocrafter/t2v_v2_512_split
-  name: train_vc_t2v
-  logdir: results/train
-  seed: 42
-  debug: false  
-  first_stage_key: video
-  cond_stage_key: caption
-
-  lr_config:
-    base_learning_rate: 6.0e-06
-    scale_lr: False
-
-  data:
-    target: videotuna.data.lightningdata.DataModuleFromConfig
-    params:
-      batch_size: 4
-      num_workers: 16
-      wrap: false
-      train:
-        target: videotuna.data.datasets.DatasetFromCSV
-        params:
-          csv_path: Dataset/ToyDataset/toydataset.csv
-          resolution: [320, 512]
-          video_length: 16
-          frame_interval: 3
-          train: True
-
-  lightning:
-    trainer:
-      benchmark: True
-      num_nodes: 1
-      accumulate_grad_batches: 2
-      max_epochs: 2000
-      precision: bf16 # training precision
-    callbacks:
-      image_logger:
-        target: videotuna.utils.callbacks.ImageLogger
-        params:
-          batch_frequency: 50
-          max_images: 6
-          to_local: True # save videos into files
-          log_images_kwargs:
-            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
-      model_checkpoint:
-        target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
-        params:
-          filename: "{epoch:03}-{step:09}"
-          save_only_selected_model: True
-          selected_model: ["denoiser"]
-          save_weights_only: False
-          save_on_train_epoch_end: False
-          save_last: True
-          every_n_epochs: 0
-          every_n_train_steps: 100
-
-inference:
-  mode: t2v
-  savedir: results/t2v/videocrafter2
-  seed: 42
-  height: 320
-  width: 512
-  fps: 28
-  n_samples_prompt: 3
-  bs: 2
-  ddim_steps: 50
-  ddim_eta: 1.0
-  unconditional_guidance_scale: 12.0
\ No newline at end of file
diff --git a/configs/001_videocrafter2/vc2_t2v_lora.yaml b/configs/001_videocrafter2/vc2_t2v_lora.yaml
deleted file mode 100644
index 71e35c93..00000000
--- a/configs/001_videocrafter2/vc2_t2v_lora.yaml
+++ /dev/null
@@ -1,146 +0,0 @@
-model:
-  base_learning_rate: 6.0e-06 # 1.5e-04
-  scale_lr: False
-  # empty_params_only: True # If enabled, only the newly added temporal parameters are fine-tuned. If disabled, all spatial-temporal parameters will be fine-tuned.
-  target: videotuna.models.lvdm.ddpm3d.LVDMFlow
-  params:
-    lora_args:
-      # lora_ckpt: "/path/to/lora.ckpt" # no need for the first-time training, only used for resume training.
-      target_modules:  ["to_q", "to_k", "to_v"]
-      lora_rank: 4
-      lora_alpha: 1
-      lora_dropout: 0.0
-    log_every_t: 200
-    first_stage_key: video
-    cond_stage_key: caption
-    cond_stage_trainable: false
-    conditioning_key: crossattn
-    image_size:
-    - 40
-    - 64
-    channels: 4
-    scale_by_std: false
-    scale_factor: 0.18215
-    use_ema: false
-    uncond_type: empty_seq
-    monitor: val/loss_simple_ema
-    encoder_type: 2d
-    use_scale: true
-    scale_b: 0.7 # adapt to videocrafter-v2
-
-    diffusion_scheduler_config:
-      target: videotuna.schedulers.diffusion_schedulers.LDMScheduler
-      params:
-        timesteps: 1000
-        linear_start: 0.00085
-        linear_end: 0.012
-
-    unet_config:
-      target: videotuna.models.lvdm.modules.networks.openaimodel3d.UNetModel
-      params:
-        in_channels: 4
-        out_channels: 4
-        model_channels: 320
-        attention_resolutions:
-        - 4
-        - 2
-        - 1
-        num_res_blocks: 2
-        channel_mult:
-        - 1
-        - 2
-        - 4
-        - 4
-        num_head_channels: 64
-        transformer_depth: 1
-        context_dim: 1024
-        use_linear: true
-        use_checkpoint: true
-        temporal_conv: true # adapt to videocrafter-v2
-        temporal_attention: true
-        temporal_selfatt_only: true
-        use_relative_position: false # adapt to videocrafter-v2
-        use_causal_attention: false
-        temporal_length: 16
-        addition_attention: true
-        fps_cond: true # adapt to videocrafter-v2
-    first_stage_config:
-      target: videotuna.models.lvdm.modules.vae.autoencoder.AutoencoderKL
-      params:
-        embed_dim: 4
-        monitor: val/rec_loss
-        ddconfig:
-          double_z: true
-          z_channels: 4
-          resolution: 512
-          in_channels: 3
-          out_ch: 3
-          ch: 128
-          ch_mult:
-          - 1
-          - 2
-          - 4
-          - 4
-          num_res_blocks: 2
-          attn_resolutions: []
-          dropout: 0.0
-        lossconfig:
-          target: torch.nn.Identity
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
-      params:
-        freeze: true
-        layer: penultimate
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 4
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: Dataset/ToyDataset/toydataset.csv
-        resolution: [320, 512]
-        video_length: 16
-        frame_interval: 3
-        train: True
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: Dataset/ToyDataset/toydataset.csv
-        resolution: [320, 512]
-        video_length: 16
-        frame_interval: 3
-        train: False
-
-lightning:
-  trainer:
-    benchmark: True
-    # num_workers: 32
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    precision: bf16 # training precision
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 50
-        max_images: 2
-        to_local: True # save videos into files
-        log_images_kwargs:
-          unconditional_guidance_scale: 12 # need this, otherwise it is grey
-    modelcheckpoint:
-      target: videotuna.utils.callbacks.LoraModelCheckpoint
-      params:
-        every_n_epochs: 1
-        filename: "{epoch:04}-{step:06}"
-    metrics_over_trainsteps_checkpoint:
-      target: videotuna.utils.callbacks.LoraModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 300
-        every_n_train_steps: 100
diff --git a/configs/002_dynamicrafter/dc_i2v_1024.yaml b/configs/002_dynamicrafter/dc_i2v_1024.yaml
deleted file mode 100644
index ebecef8c..00000000
--- a/configs/002_dynamicrafter/dc_i2v_1024.yaml
+++ /dev/null
@@ -1,172 +0,0 @@
-model:
-  base_learning_rate: 1.0e-05
-  scale_lr: False
-  target: videotuna.models.lvdm.ddpm3d.LatentVisualDiffusionFlow
-  params:
-    parameterization: v
-    log_every_t: 200
-    first_stage_key: video
-    cond_stage_key: caption
-    cond_stage_trainable: False
-    image_proj_model_trainable: True
-    conditioning_key: hybrid
-    image_size: [72, 128]
-    channels: 4
-    scale_by_std: False
-    scale_factor: 0.18215
-    use_ema: False
-    uncond_prob: 0.05
-    uncond_type: empty_seq
-    rand_cond_frame: true
-    use_scale: true
-    scale_b: 0.3
-    fps_condition_type: fps
-
-    diffusion_scheduler_config:
-      target: videotuna.schedulers.diffusion_schedulers.LDMScheduler
-      params:
-        timesteps: 1000
-        linear_start: 0.00085
-        linear_end: 0.012
-        rescale_betas_zero_snr: True
-
-    unet_config:
-      target: videotuna.models.lvdm.modules.networks.openaimodel3d_dc.UNetModel
-      params:
-        in_channels: 8
-        out_channels: 4
-        model_channels: 320
-        attention_resolutions:
-        - 4
-        - 2
-        - 1
-        num_res_blocks: 2
-        channel_mult:
-        - 1
-        - 2
-        - 4
-        - 4
-        dropout: 0.1
-        num_head_channels: 64
-        transformer_depth: 1
-        context_dim: 1024
-        use_linear: true
-        use_checkpoint: True
-        temporal_conv: True
-        temporal_attention: True
-        temporal_selfatt_only: true
-        use_relative_position: false
-        use_causal_attention: False
-        temporal_length: 16
-        addition_attention: true
-        img_cross_attention: true
-        default_fs: 10
-        fs_condition: true
-
-    first_stage_config:
-      target: videotuna.models.lvdm.modules.vae.autoencoder.AutoencoderKL
-      params:
-        embed_dim: 4
-        monitor: val/rec_loss
-        ddconfig:
-          double_z: True
-          z_channels: 4
-          resolution: 256
-          in_channels: 3
-          out_ch: 3
-          ch: 128
-          ch_mult:
-          - 1
-          - 2
-          - 4
-          - 4
-          num_res_blocks: 2
-          attn_resolutions: []
-          dropout: 0.0
-        lossconfig:
-          target: torch.nn.Identity
-
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
-      params:
-        freeze: true
-        layer: penultimate
-
-    img_cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
-      params:
-        freeze: true
-
-    image_proj_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.ip_resampler.Resampler
-      params:
-        dim: 1024
-        depth: 4
-        dim_head: 64
-        heads: 12
-        num_queries: 16
-        embedding_dim: 1280
-        output_dim: 1024
-        ff_mult: 4
-        video_length: 16
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 2
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: Dataset/ToyDataset/toydataset.csv
-        resolution: [576, 1024]
-        video_length: 16
-        frame_interval: 3
-        train: True
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: Dataset/ToyDataset/toydataset.csv
-        resolution: [576, 1024]
-        video_length: 16
-        frame_interval: 3
-        train: False
-
-lightning:
-  trainer:
-    benchmark: True
-    accumulate_grad_batches: 2
-    max_steps: 100000
-    # logger
-    log_every_n_steps: 50
-    # val
-    val_check_interval: 0.5
-    gradient_clip_algorithm: 'norm'
-    gradient_clip_val: 0.5
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 2
-        save_dir: outputs/samples
-        max_images: 6
-        to_local: True # save videos into files
-        log_images_kwargs:
-          sample: false
-          ddim_steps: 50
-          unconditional_guidance_scale: 7.5
-          timestep_spacing: uniform_trailing
-          guidance_rescale: 0.7
-    model_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        every_n_train_steps: 50
-        filename: "{epoch:04}-{step:06}"
-        save_weights_only: True
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: True
-        every_n_train_steps: 10000
diff --git a/configs/003_opensora/opensorav10_256x256.yaml b/configs/003_opensora/opensorav10_256x256.yaml
deleted file mode 100644
index c08a4253..00000000
--- a/configs/003_opensora/opensorav10_256x256.yaml
+++ /dev/null
@@ -1,102 +0,0 @@
-model:
-  base_learning_rate: 6.0e-06 # 1.5e-04
-  scale_lr: False
-  target: videotuna.models.opensora.models.iddpm3d.LatentDiffusion
-  params:
-    log_every_t: 200
-    first_stage_key: video
-    cond_stage_key: caption
-    cond_stage_trainable: true
-    conditioning_key: crossattn_stdit
-    image_size: # TO CHECK
-    - 32
-    - 32
-    channels: 4
-    scale_by_std: false
-    scale_factor: 0.18215
-    use_ema: false
-    uncond_type: empty_seq
-    monitor: val/loss_simple_ema
-    encoder_type: 3d
-    use_scale: true
-    scale_b: 0.7 # adapt to videocrafter-v2
-
-    diffusion_scheduler_config:
-      target: videotuna.models.opensora.models.iddpm3d.OpenSoraScheduler
-      params:
-        timesteps: 1000
-        linear_start: 0.00085
-        linear_end: 0.012
-
-    unet_config:
-      target: videotuna.models.opensora.models.stdit.stdit.STDiT_XL_2
-      params:
-        space_scale: 0.5
-        time_scale: 1.0
-        from_pretrained: False
-        enable_flashattn: True
-        enable_layernorm_kernel: False
-        input_size:
-        - 16
-        - 32
-        - 32
-    first_stage_config:
-      target: videotuna.models.opensora.models.vae.opensoravae.VideoAutoencoderKL
-      params:
-          from_pretrained: stabilityai/sd-vae-ft-ema
-          micro_batch_size: 4
-    cond_stage_config:
-      target: videotuna.models.opensora.models.text_encoder.t5.T5Encoder
-      params:
-        from_pretrained: DeepFloyd/t5-v1_1-xxl
-        model_max_length: 120
-        shardformer: False # TODO
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 4
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: Dataset/ToyDataset/toydataset.csv
-        resolution: [256, 256]
-        video_length: 16
-        frame_interval: 3
-        train: True
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: Dataset/ToyDataset/toydataset.csv
-        resolution: [256, 256]
-        video_length: 16
-        frame_interval: 3
-        train: False
-
-lightning:
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 200
-        max_images: 6
-        to_local: True # save videos into files
-        log_images_kwargs:
-          unconditional_guidance_scale: 12 # need this, otherwise it is grey
-        save_dir: ./results
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        every_n_epochs: null
-        every_n_train_steps: 1000
-  trainer:
-    benchmark: True
-    # num_workers: 32
-    num_nodes: 1
-    accumulate_grad_batches: 1
-    max_epochs: 2000
-    precision: bf16 # training precision
diff --git a/configs/004_cogvideox/cogvideo2b.yaml b/configs/004_cogvideox/cogvideo2b.yaml
deleted file mode 100644
index de4ce8d4..00000000
--- a/configs/004_cogvideox/cogvideo2b.yaml
+++ /dev/null
@@ -1,94 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  target: videotuna.models.cogvideo_hf.cogvideo_pl.CogVideoXWorkFlow
-  params:
-    # VAE of CogVideoX
-    first_stage_config:
-      target: diffusers.AutoencoderKLCogVideoX
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-2b
-        subfolder: "vae"
-
-    # Text encoder (T5) of CogVideoX
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenT5Embedder
-      params:
-        version: "DeepFloyd/t5-v1_1-xxl"
-        device: "cuda"
-        max_length: 226
-        freeze: True
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.CogVideoXTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-2b
-        subfolder: "transformer"
-        load_dtype: fp16 # bf16 5b / fp16 2B
-        # revision: null
-        # variant: null
-
-    # Lora module
-    adapter_config:
-      target: peft.LoraConfig
-      params:
-        r: 4
-        lora_alpha: 1.0
-        init_lora_weights: True
-        target_modules: ["to_k", "to_q", "to_v", "to_out.0"]
-
-    # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.CogVideoXDPMScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-2b
-        subfolder: scheduler
-
-# data configs
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 2
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.cogvideo_dataset.VideoDataset
-      params:
-        instance_data_root: "inputs/t2v/cogvideo/elon_musk_video"
-        dataset_name: null
-        dataset_config_name: null
-        caption_column: "labels.txt"
-        video_column: "videos.txt"
-        height: 480
-        width: 720
-        fps: 28
-        max_num_frames: 2
-        skip_frames_start: 0
-        skip_frames_end: 0
-        cache_dir: ~/.cache
-        id_token: null
-
-# training configs
-lightning:
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    precision: 32
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 100000
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 300
-        every_n_train_steps: 10
diff --git a/configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml b/configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml
deleted file mode 100644
index 2a4390ba..00000000
--- a/configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml
+++ /dev/null
@@ -1,88 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  target: videotuna.models.cogvideo_hf.cogvideo_i2v.CogVideoXI2V
-  params:
-    noised_image_input: True
-    noised_image_dropout: 0.05
-    # VAE of CogVideoX
-    first_stage_config:
-      target: diffusers.AutoencoderKLCogVideoX
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b-I2V
-        subfolder: "vae"
-
-    # Text encoder (T5) of CogVideoX
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenT5Embedder
-      params:
-        version: "DeepFloyd/t5-v1_1-xxl"
-        device: "cuda"
-        max_length: 226
-        freeze: True
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.CogVideoXTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b-I2V
-        subfolder: "transformer"
-
-    # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.CogVideoXDPMScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b-I2V
-        subfolder: scheduler
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 1
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: True
-        image_to_video: true
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: False
-        image_to_video: true
-
-# training configs
-lightning:
-  strategy: deepspeed_stage_2
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    precision: 16
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 100
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 300
-        every_n_train_steps: 200
diff --git a/configs/004_cogvideox/cogvideo5b-i2v.yaml b/configs/004_cogvideox/cogvideo5b-i2v.yaml
deleted file mode 100644
index 6f752d81..00000000
--- a/configs/004_cogvideox/cogvideo5b-i2v.yaml
+++ /dev/null
@@ -1,97 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  target: videotuna.models.cogvideo_hf.cogvideo_i2v.CogVideoXI2V
-  params:
-    noised_image_input: True
-    noised_image_dropout: 0.05
-    # VAE of CogVideoX
-    first_stage_config:
-      target: diffusers.AutoencoderKLCogVideoX
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b-I2V
-        subfolder: "vae"
-
-    # Text encoder (T5) of CogVideoX
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenT5Embedder
-      params:
-        version: "DeepFloyd/t5-v1_1-xxl"
-        device: "cuda"
-        max_length: 226
-        freeze: True
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.CogVideoXTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b-I2V
-        subfolder: "transformer"
-        # load_dtype: bf16 # bf16 for 5b / fp16 for 2B
-
-    # Lora module
-    adapter_config:
-      target: peft.LoraConfig
-      params:
-        r: 4
-        lora_alpha: 1.0
-        init_lora_weights: True
-        target_modules: ["to_k", "to_q", "to_v", "to_out.0"]
-
-    # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.CogVideoXDPMScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b-I2V
-        subfolder: scheduler
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 1
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: True
-        image_to_video: true
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: False
-        image_to_video: true
-
-# training configs
-lightning:
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    precision: 32
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 100
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 300
-        every_n_train_steps: 2 #200
diff --git a/configs/004_cogvideox/cogvideo5b-t2v-fullft.yaml b/configs/004_cogvideox/cogvideo5b-t2v-fullft.yaml
deleted file mode 100644
index f9ad3944..00000000
--- a/configs/004_cogvideox/cogvideo5b-t2v-fullft.yaml
+++ /dev/null
@@ -1,84 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  target: videotuna.models.cogvideo_hf.cogvideo_pl.CogVideoXWorkFlow
-  params:
-    # VAE of CogVideoX
-    first_stage_config:
-      target: diffusers.AutoencoderKLCogVideoX
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b
-        subfolder: "vae"
-
-    # Text encoder (T5) of CogVideoX
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenT5Embedder
-      params:
-        version: "DeepFloyd/t5-v1_1-xxl"
-        device: "cuda"
-        max_length: 226
-        freeze: True
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.CogVideoXTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b
-        subfolder: "transformer"
-
-    # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.CogVideoXDPMScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b
-        subfolder: scheduler
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 1
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: True
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: False
-
-# training configs
-lightning:
-  strategy: deepspeed_stage_2
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    precision: 16
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 50
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 300
-        every_n_train_steps: 10
diff --git a/configs/004_cogvideox/cogvideo5b.yaml b/configs/004_cogvideox/cogvideo5b.yaml
deleted file mode 100644
index 6b3e4747..00000000
--- a/configs/004_cogvideox/cogvideo5b.yaml
+++ /dev/null
@@ -1,92 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  target: videotuna.models.cogvideo_hf.cogvideo_pl.CogVideoXWorkFlow
-  params:
-    # VAE of CogVideoX
-    first_stage_config:
-      target: diffusers.AutoencoderKLCogVideoX
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b
-        subfolder: "vae"
-
-    # Text encoder (T5) of CogVideoX
-    cond_stage_config:
-      target: videotuna.models.lvdm.modules.encoders.condition.FrozenT5Embedder
-      params:
-        version: "DeepFloyd/t5-v1_1-xxl"
-        device: "cuda"
-        max_length: 226
-        freeze: True
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.CogVideoXTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b
-        subfolder: "transformer"
-
-    # Lora module
-    adapter_config:
-      target: peft.LoraConfig
-      params:
-        r: 4
-        lora_alpha: 1.0
-        init_lora_weights: True
-        target_modules: ["to_k", "to_q", "to_v", "to_out.0"]
-
-    # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.CogVideoXDPMScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/cogvideo/CogVideoX-5b
-        subfolder: scheduler
-
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 1
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: True
-    validation:
-      target: videotuna.data.datasets.DatasetFromCSV
-      params:
-        csv_path: ${YOUR_DATA_CSV_PATH}
-        height: 480
-        width: 720
-        video_length: 49
-        frame_interval: 1
-        train: False
-
-# training configs
-lightning:
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    precision: 32
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 50
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 300
-        every_n_train_steps: 10
diff --git a/configs/005_cogvideox1.5/cogvideox1.5_5b.yaml b/configs/005_cogvideox1.5/cogvideox1.5_5b.yaml
deleted file mode 100644
index a096a90c..00000000
--- a/configs/005_cogvideox1.5/cogvideox1.5_5b.yaml
+++ /dev/null
@@ -1,149 +0,0 @@
-model:
-  scale_factor: 0.7
-  disable_first_stage_autocast: true
-  latent_input: true
-  log_keys:
-    - txt
-
-  denoiser_config:
-    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
-    params:
-      num_idx: 1000
-      quantize_c_noise: False
-
-      weighting_config:
-        target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
-      scaling_config:
-        target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
-      discretization_config:
-        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
-
-  network_config:
-    target: dit_video_concat.DiffusionTransformer
-    params:
-      time_embed_dim: 512
-      elementwise_affine: True
-      num_frames: 81 # for 5 seconds and 161 for 10 seconds
-      time_compressed_rate: 4
-      latent_width: 300
-      latent_height: 300
-      num_layers: 42
-      patch_size: [2, 2, 2]
-      in_channels: 16
-      out_channels: 16
-      hidden_size: 3072
-      adm_in_channels: 256
-      num_attention_heads: 48
-
-      transformer_args:
-        checkpoint_activations: True
-        vocab_size: 1
-        max_sequence_length: 64
-        layernorm_order: pre
-        skip_init: false
-        model_parallel_size: 1
-        is_decoder: false
-
-      modules:
-        pos_embed_config:
-          target: dit_video_concat.Rotary3DPositionEmbeddingMixin
-          params:
-            hidden_size_head: 64
-            text_length: 224
-
-        patch_embed_config:
-          target: dit_video_concat.ImagePatchEmbeddingMixin
-          params:
-            text_hidden_size: 4096
-
-        adaln_layer_config:
-          target: dit_video_concat.AdaLNMixin
-          params:
-            qk_ln: True
-
-        final_layer_config:
-          target: dit_video_concat.FinalLayerMixin
-
-  conditioner_config:
-    target: sgm.modules.GeneralConditioner
-    params:
-      emb_models:
-          - is_trainable: false
-            input_key: txt
-            ucg_rate: 0.1
-            target: sgm.modules.encoders.modules.FrozenT5Embedder
-            params:
-              model_dir: "checkpoints/cogvideo/CogVideoX1.5-5B-SAT/t5-v1_1-xxl"
-              max_length: 224
-
-
-  first_stage_config:
-    target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
-    params:
-        cp_size: 1
-        ckpt_path: "checkpoints/cogvideo/CogVideoX1.5-5B-SAT/vae/3d-vae.pt"
-        ignore_keys: ['loss']
-
-        loss_config:
-          target: torch.nn.Identity
-
-        regularizer_config:
-          target: vae_modules.regularizers.DiagonalGaussianRegularizer
-
-        encoder_config:
-          target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
-          params:
-            double_z: true
-            z_channels: 16
-            resolution: 256
-            in_channels: 3
-            out_ch: 3
-            ch: 128
-            ch_mult: [1, 2, 2, 4]
-            attn_resolutions: []
-            num_res_blocks: 3
-            dropout: 0.0
-            gather_norm: True
-
-        decoder_config:
-          target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
-          params:
-            double_z: True
-            z_channels: 16
-            resolution: 256
-            in_channels: 3
-            out_ch: 3
-            ch: 128
-            ch_mult: [1, 2, 2, 4]
-            attn_resolutions: []
-            num_res_blocks: 3
-            dropout: 0.0
-            gather_norm: True
-
-  loss_fn_config:
-    target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
-    params:
-      offset_noise_level: 0
-      sigma_sampler_config:
-        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
-        params:
-          uniform_sampling: True
-          group_num: 40
-          num_idx: 1000
-          discretization_config:
-            target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
-
-  sampler_config:
-    target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
-    params:
-      num_steps: 50
-      verbose: True
-
-      discretization_config:
-        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
-      guider_config:
-        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
-        params:
-          scale: 6
-          exp: 5
-          num_steps: 50
diff --git a/configs/006_flux/README.md b/configs/006_flux/README.md
new file mode 100644
index 00000000..ab7ec790
--- /dev/null
+++ b/configs/006_flux/README.md
@@ -0,0 +1,3 @@
+# Deprecated
+
+This directory holds backward-compat symlinks only. Use [`configs/domain/`](../domain/) for training configs.
diff --git a/configs/006_flux/cloud_smoke.json b/configs/006_flux/cloud_smoke.json
new file mode 120000
index 00000000..d364eafe
--- /dev/null
+++ b/configs/006_flux/cloud_smoke.json
@@ -0,0 +1 @@
+../domain/flux_t2i_cloud_smoke.json
\ No newline at end of file
diff --git a/configs/007_hunyuanvideo/hunyuanvideo_i2v.yaml b/configs/007_hunyuanvideo/hunyuanvideo_i2v.yaml
deleted file mode 100644
index 141b36fc..00000000
--- a/configs/007_hunyuanvideo/hunyuanvideo_i2v.yaml
+++ /dev/null
@@ -1,146 +0,0 @@
-flow:
-  target: videotuna.flow.hunyuanvideo.HunyuanVideoFlow
-  params:
-    # Model Configuration
-    precision: bf16
-    rope_theta: 256
-    time_shift: 7.0
-
-    # Image-to-Video Settings
-    i2v_mode: true
-    i2v_condition_type: token_replace
-    use_cpu_offload: true
-    disable_autocast: false
-
-    # VAE Configuration
-    vae_type: 884-16c-hy
-    vae_precision: fp16
-    vae_tiling: true
-
-    # Path Settings
-    ckpt_path: checkpoints/hunyuanvideo/HunyuanVideo-I2V
-    denoiser_ckpt_path: ${flow.params.ckpt_path}
-    dit_weight: ${flow.params.ckpt_path}/hunyuan-video-i2v-720p/transformers/mp_rank_00_model_states.pt
-    first_stage_ckpt_path: ${flow.params.ckpt_path}/hunyuan-video-i2v-720p/vae
-
-    # Undeveloppred Settings
-    use_fp8: false
-    ulysses_degree: 1
-    ring_degree: 1
-
-    use_lora: false
-    lora_path: ''
-    lora_scale: 1.0
-    lora_rank: 64
-      
-    first_stage_config:
-      target: videotuna.models.hunyuan.hyvideo_i2v.vae.autoencoder_kl_causal_3d.AutoencoderKLCausal3DWrapper
-      params:
-        vae_type: ${flow.params.vae_type}
-        vae_path: ${flow.params.first_stage_ckpt_path}
-        use_cpu_offload: ${flow.params.use_cpu_offload}
-        vae_precision: fp16
-        device: cuda
-    
-    cond_stage_config:
-      target: videotuna.models.hunyuan.hyvideo_i2v.text_encoder.TextEncoderWrapper
-      params:
-        i2v_mode: ${flow.params.i2v_mode}
-        i2v_condition_type: ${flow.params.i2v_condition_type}
-        text_encoder: "llm-i2v"
-        text_states_dim: 4096
-        text_len: 256
-        tokenizer: llm-i2v
-        prompt_template: dit-llm-encode-i2v
-        prompt_template_video: dit-llm-encode-video-i2v
-        hidden_state_skip_layer: 2
-        apply_final_norm: false
-        reproduce: false
-        use_cpu_offload: ${flow.params.use_cpu_offload}
-        device: cuda
-        text_encoder_precision: "fp16"
-
-    cond_stage_2_config:
-      target: videotuna.models.hunyuan.hyvideo_i2v.text_encoder.TextEncoder
-      params:
-        text_encoder_type: clipL
-        max_length: 77
-        text_encoder_precision: fp16
-        tokenizer_type: clipL
-        device: cpu
-    
-    # Denoiser model wrapper 
-    denoiser_config:
-      target: videotuna.models.hunyuan.hyvideo_i2v.modules.models.HYVideoDiffusionTransformerWrapper
-      params:
-        i2v_mode: ${flow.params.i2v_mode}
-        i2v_condition_type: ${flow.params.i2v_condition_type}
-        device: 'cuda'
-        precision: bf16
-        latent_channels: 16
-        text_states_dim: 4096
-        text_states_dim_2: 768
-        gradient_checkpoint: false
-        gradient_checkpoint_layers: -1
-        embedded_cfg_scale: 6.0
-        model: HYVideo-T/2
-        ckpt_path: ${flow.params.denoiser_ckpt_path}
-        i2v_dit_weight: ${flow.params.dit_weight}
-        load_key: module
-    
-    # Diffusion sampling scheduler
-    scheduler_config:
-      target: videotuna.models.hunyuan.hyvideo_i2v.diffusion.schedulers.scheduling_flow_match_discrete.FlowMatchDiscreteScheduler
-      params:
-        shift: ${flow.params.time_shift}
-        reverse: True
-        solver: 'euler'
-
-
-
-inference:
-  mode: i2v
-  ckpt_path: checkpoints/hunyuanvideo/HunyuanVideo-I2V
-  dit_weight: checkpoints/hunyuanvideo/HunyuanVideo-I2V/hunyuan-video-i2v-720p/transformers/mp_rank_00_model_states.pt
-  savedir: results/i2v/hunyuanvideo
-  seed: 42
-  height: 360
-  width: 640
-  i2v_resolution: 360p
-  # height: 720
-  # width: 1280
-  # i2v_resolution: 720p
-  prompt_dir: "inputs/i2v/576x1024"      
-  num_inference_steps: 50                
-  time_shift: 7.0                
-  unconditional_guidance_scale: 1.0  
-  uncond_prompt: null                     
-  frames: 129
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 28
-  embedded_guidance_scale: 6.0
-  ulysses_degree: 1
-  ring_degree: 1
-  xdit_adaptive_size: false
-
-  i2v_mode: true
-  i2v_condition_type: token_replace
-  i2v_stability: true
-  enable_sequential_cpu_offload: true
-  enable_vae_tiling: true
-
-  mapping:
-    inference.time_shift : flow.params.time_shift
-    inference.i2v_mode : flow.params.i2v_mode
-    inference.i2v_condition_type : flow.params.i2v_condition_type
-    inference.ring_degree : flow.params.ring_degree
-    inference.ulysses_degree : flow.params.ulysses_degree
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.dit_weight : flow.params.dit_weight
-    inference.enable_sequential_cpu_offload : flow.params.use_cpu_offload
-    inference.enable_vae_tiling: flow.params.vae_tiling
-
-
-    
-
diff --git a/configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser.yaml b/configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser.yaml
deleted file mode 100644
index e04c1c59..00000000
--- a/configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser.yaml
+++ /dev/null
@@ -1,122 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  skip_loading_weight: true
-  target: videotuna.models.hunyuan.hyvideo_t2v.hunyuanvideo.HunyuanVideoWorkFlow
-  params: 
-    # VAE of HunyuanVideo
-    first_stage_config:
-      target: diffusers.AutoencoderKLHunyuanVideo
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "vae"
-        load_dtype: fp16 # bf16 5b / fp16 2B 
-        # load_dtype: fp32 # bf16 5b / fp16 2B 
-    
-    # Text encoder 
-    cond_stage_config:
-      target: transformers.LlamaModel
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "text_encoder"
-        # torch_dtype: auto # bf16 5b / fp16 2B 
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-
-    tokenizer_config: 
-      target: transformers.LlamaTokenizerFast
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "tokenizer"
-        # torch_dtype: auto # bf16 5b / fp16 2B 
-        # torch_dtype: fp16 # bf16 5b / fp16 2B 
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-
-
-    cond_stage_config_2:
-      target: transformers.CLIPTextModel
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "text_encoder_2"
-        # torch_dtype: auto # bf16 5b / fp16 2B 
-        # torch_dtype: fp16 # bf16 5b / fp16 2B 
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-    
-    tokenizer_config_2: 
-      target: transformers.CLIPTokenizer
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "tokenizer_2"
-        # torch_dtype: auto # bf16 5b / fp16 2B 
-        # torch_dtype: fp16 # bf16 5b / fp16 2B 
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.HunyuanVideoTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "transformer"
-        load_dtype: fp16 # bf16 5b / fp16 2B 
-        # load_dtype: fp32 # bf16 5b / fp16 2B 
-        
-
-    # # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.FlowMatchEulerDiscreteScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: scheduler
-
-
-# data configs
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 1
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.cogvideo_dataset.VideoDataset
-      params:
-        instance_data_root: "inputs/t2v/hunyuanvideo/tyler_swift_video"
-        dataset_name: null 
-        dataset_config_name: null
-        caption_column: "labels.txt"
-        video_column: "videos.txt"
-        height: 256
-        width: 256
-        fps: 28
-        max_num_frames: 17
-        skip_frames_start: 0
-        skip_frames_end: 0
-        cache_dir: ~/.cache
-        id_token: null
-
-# training configs
-lightning:
-  strategy: fsdp
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    # precision: 32
-    precision: 16
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 100000
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 1
-        every_n_train_steps: 20
-    
-
diff --git a/configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser_lora.yaml b/configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser_lora.yaml
deleted file mode 100644
index 20ecdd42..00000000
--- a/configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser_lora.yaml
+++ /dev/null
@@ -1,147 +0,0 @@
-model:
-  base_learning_rate: 6e-6
-  target: videotuna.models.hunyuan.hyvideo_t2v.hunyuanvideo.HunyuanVideoWorkFlow
-  params: 
-    # VAE of HunyuanVideo
-    first_stage_config:
-      target: diffusers.AutoencoderKLHunyuanVideo
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "vae"
-        load_dtype: fp16 # bf16 5b / fp16 2B 
-    
-    # Text encoder 
-    cond_stage_config:
-      target: transformers.LlamaModel
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "text_encoder"
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-
-    tokenizer_config: 
-      target: transformers.LlamaTokenizerFast
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "tokenizer"
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-
-
-    cond_stage_config_2:
-      target: transformers.CLIPTextModel
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "text_encoder_2"
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-    
-    tokenizer_config_2: 
-      target: transformers.CLIPTokenizer
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "tokenizer_2"
-        torch_dtype: float16 # bf16 5b / fp16 2B 
-
-
-    # Denosier model
-    denoiser_config:
-      target: diffusers.HunyuanVideoTransformer3DModel
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: "transformer"
-        load_dtype: fp16 # bf16 5b / fp16 2B 
-        
-    # # Deepspeed config
-    deepspeed_config:
-      params: 
-        use_cpu_adam: True
-    
-    # Lora module 
-    adapter_config: 
-      target: peft.LoraConfig
-      params:
-        r: 4
-        lora_alpha: 1.0 
-        init_lora_weights: True
-        target_modules: ["to_k", "to_q", "to_v", "to_out.0"]
-
-    # # Diffusion sampling scheduler
-    scheduler_config:
-      target: diffusers.FlowMatchEulerDiscreteScheduler
-      params:
-        pretrained_model_name_or_path: checkpoints/hunyuanvideo/HunyuanVideo
-        subfolder: scheduler
-
-
-
-# data configs
-data:
-  target: videotuna.data.lightningdata.DataModuleFromConfig
-  params:
-    batch_size: 1
-    num_workers: 16
-    wrap: false
-    train:
-      target: videotuna.data.cogvideo_dataset.VideoDataset
-      params:
-        instance_data_root: "inputs/t2v/hunyuanvideo/tyler_swift_video"
-        dataset_name: null 
-        dataset_config_name: null
-        caption_column: "labels.txt"
-        video_column: "videos.txt"
-        height: 544
-        width: 960
-        fps: 28
-        max_num_frames: 17
-        skip_frames_start: 0
-        skip_frames_end: 0
-        cache_dir: ~/.cache
-        id_token: null
-
-# training configs
-lightning:
-  trainer:
-    benchmark: True
-    num_nodes: 1
-    accumulate_grad_batches: 2
-    max_epochs: 2000
-    # precision: 32
-    precision: 16
-  
-  strategy:
-    target: pytorch_lightning.strategies.DeepSpeedStrategy
-    params:
-      stage: 3
-      config: 
-        bf16: 
-          enabled: auto 
-        zero_optimization: 
-          stage: 3
-          offload_optimizer: 
-            device: cpu
-            pin_memory: True
-          overlap_comm: True 
-          contiguous_gradients: True 
-        fp16: 
-          enabled: False
-          loss_scale: 0
-          loss_scale_window: 1000
-          hysteresis: 2
-          min_loss_scale: 1 
-
-  callbacks:
-    image_logger:
-      target: videotuna.utils.callbacks.ImageLogger
-      params:
-        batch_frequency: 100000
-        max_images: 2
-        to_local: True # save videos into local files
-        log_images_kwargs:
-          unconditional_guidance_scale: 6
-    metrics_over_trainsteps_checkpoint:
-      target: pytorch_lightning.callbacks.ModelCheckpoint
-      params:
-        filename: "{epoch:06}-{step:09}"
-        save_weights_only: False
-        # every_n_epochs: 1
-        every_n_train_steps: 20
-    
-
diff --git a/configs/008_wanvideo/README.md b/configs/008_wanvideo/README.md
new file mode 100644
index 00000000..79816154
--- /dev/null
+++ b/configs/008_wanvideo/README.md
@@ -0,0 +1,3 @@
+# Deprecated
+
+This directory holds backward-compat symlinks only. Use [`configs/domain/`](../domain/) for training and [`configs/inference/presets/`](../inference/presets/) for inference.
diff --git a/configs/008_wanvideo/wan2_1_i2v_14B_480P.yaml b/configs/008_wanvideo/wan2_1_i2v_14B_480P.yaml
deleted file mode 100644
index 3d4718a0..00000000
--- a/configs/008_wanvideo/wan2_1_i2v_14B_480P.yaml
+++ /dev/null
@@ -1,97 +0,0 @@
-flow:
-  target: videotuna.flow.wanvideo.WanVideoModelFlow
-  params:
-    task: "i2v-14B"                   # The task to run (choices from WAN_CONFIGS.keys())
-    ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-480P"                    # The path to the checkpoint directory.
-    offload_model: true               # Whether to offload the model to CPU after each model forward.
-    ulysses_size: 1                   # The size of the ulysses parallelism in DiT.
-    ring_size: 1                      # The size of the ring attention parallelism in DiT.
-    t5_fsdp: false                    # Whether to use FSDP for T5.
-    t5_cpu: false                     # Whether to place T5 model on CPU.
-    dit_fsdp: false                   # Whether to use FSDP for DiT.
-    use_prompt_extend: false          # Whether to use prompt extend.
-    prompt_extend_method: "local_qwen" # The prompt extend method to use (choices: dashscope, local_qwen)
-    prompt_extend_model: null         # The prompt extend model to use.
-    prompt_extend_target_lang: "zh"   # The target language of prompt extend (choices: zh, en)
-    seed: 42                     # The seed to use for generating the image or video
-
-    scheduler_config: __is_first_stage__
-
-    denoiser_config:
-      target: videotuna.models.wan.wan.modules.model.WanModel
-      use_from_pretrained: true
-      params:
-        pretrained_model_name_or_path: ${flow.params.ckpt_path}
-
-    first_stage_config:
-      target: videotuna.models.wan.wan.modules.vae.WanVAE_
-      params:
-        dim: 96
-        z_dim: 16
-        dim_mult: [1, 2, 4, 4]
-        num_res_blocks: 2
-        attn_scales: []
-        temperal_downsample: [false, true, true]
-        dropout: 0.0
-
-    cond_stage_config:
-      target: videotuna.models.wan.wan.modules.t5.T5Encoder
-      params:
-        dim: 4096
-        dim_attn: 4096
-        dim_ffn: 10240
-        num_heads: 64
-        num_buckets: 32
-        shared_pos: false
-        dropout: 0.1
-        vocab: 256384
-        num_layers: 24
-
-      
-    cond_stage_2_config:
-      target: videotuna.models.wan.wan.modules.clip.XLMRobertaCLIP
-      params:
-        embed_dim: 1024
-        image_size: 224
-        patch_size: 14
-        vision_dim: 1280
-        vision_mlp_ratio: 4
-        vision_heads: 16
-        vision_layers: 32
-        vision_pool: "token"
-        activation: "gelu"
-        vocab_size: 250002
-        max_text_len: 514
-        type_size: 1
-        pad_id: 1
-        text_dim: 1024
-        text_heads: 16
-        text_layers: 24
-        text_post_norm: true
-        text_dropout: 0.1
-        attn_dropout: 0.0
-        proj_dropout: 0.0
-        embedding_dropout: 0.0
-
-inference:
-  mode: i2v
-  ckpt_path: checkpoints/wan/Wan2.1-I2V-14B-480P
-  savedir: results/i2v/wanvideo
-  seed: 42
-  height: 480
-  width: 832
-  prompt_dir: "inputs/i2v/576x1024"
-  solver: "unipc"           
-  num_inference_steps: 40                
-  time_shift: 3.0                
-  unconditional_guidance_scale: 5.0                       
-  frames: 81
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 16
-  enable_model_cpu_offload: true
-
-  mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
\ No newline at end of file
diff --git a/configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml b/configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml
deleted file mode 100644
index a15f5452..00000000
--- a/configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml
+++ /dev/null
@@ -1,156 +0,0 @@
-flow:
-  target: videotuna.flow.wanvideo.WanVideoModelFlow
-  params:
-    task: "i2v-14B"                  
-    ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-480P"                    
-    offload_model: true               
-    ulysses_size: 1                  
-    ring_size: 1                     
-    t5_fsdp: false                    
-    t5_cpu: false                     
-    dit_fsdp: false                   
-    use_prompt_extend: false          
-    prompt_extend_method: "local_qwen" 
-    prompt_extend_model: null         
-    prompt_extend_target_lang: "zh"   
-    seed: 42                     
-
-    denoiser_config:
-      target: videotuna.models.wan.wan.modules.model.WanModel
-      use_from_pretrained: true
-      params:
-        pretrained_model_name_or_path: ${flow.params.ckpt_path}
-
-    first_stage_config:
-      target: videotuna.models.wan.wan.modules.vae.WanVAE_
-      params:
-        dim: 96
-        z_dim: 16
-        dim_mult: [1, 2, 4, 4]
-        num_res_blocks: 2
-        attn_scales: []
-        temperal_downsample: [false, true, true]
-        dropout: 0.0
-
-    cond_stage_config:
-      target: videotuna.models.wan.wan.modules.t5.T5Encoder
-      params:
-        dim: 4096
-        dim_attn: 4096
-        dim_ffn: 10240
-        num_heads: 64
-        num_buckets: 32
-        shared_pos: false
-        dropout: 0.1
-        vocab: 256384
-        num_layers: 24
-
-      
-    cond_stage_2_config:
-      target: videotuna.models.wan.wan.modules.clip.XLMRobertaCLIP
-      params:
-        embed_dim: 1024
-        image_size: 224
-        patch_size: 14
-        vision_dim: 1280
-        vision_mlp_ratio: 4
-        vision_heads: 16
-        vision_layers: 32
-        vision_pool: "token"
-        activation: "gelu"
-        vocab_size: 250002
-        max_text_len: 514
-        type_size: 1
-        pad_id: 1
-        text_dim: 1024
-        text_heads: 16
-        text_layers: 24
-        text_post_norm: true
-        text_dropout: 0.1
-        attn_dropout: 0.0
-        proj_dropout: 0.0
-        embedding_dropout: 0.0
-
-train:
-  ckpt: checkpoints/wan/Wan2.1-I2V-14B-480P
-  name: train_wan_i2v_fullft
-  logdir: results/train
-  seed: 42
-  debug: false         
-  first_stage_key: video
-  cond_stage_key: caption
-  mapping:
-    train.ckpt : flow.params.ckpt_path
-
-  lr_config:
-    base_learning_rate: 6.0e-06
-    scale_lr: False
-
-  data:
-    target: videotuna.data.lightningdata.DataModuleFromConfig
-    params:
-      batch_size: 1
-      num_workers: 16
-      wrap: false
-      train:
-        target: videotuna.data.datasets.DatasetFromCSV
-        params:
-          csv_path: data/apply_lipstick/metadata.csv
-          height: 480
-          width: 832
-          num_frames: 81
-          frame_interval: 1
-          train: True
-
-  lightning:
-    strategy: deepspeed_stage_3_offload
-    trainer:
-      accelerator: gpu
-      benchmark: True
-      num_nodes: 1
-      accumulate_grad_batches: 1
-      max_epochs: 2000
-      precision: bf16-mixed
-    callbacks:
-      image_logger:
-        target: videotuna.utils.callbacks.ImageLogger
-        params:
-          batch_frequency: 50
-          max_images: 6
-          to_local: True # save videos into files
-          log_images_kwargs:
-            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
-      model_checkpoint:
-        target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
-        params:
-          filename: "{epoch:03}-{step:09}"
-          save_only_selected_model: True
-          selected_model: ["denoiser"]
-          save_weights_only: False
-          save_on_train_epoch_end: False
-          save_last: True
-          every_n_epochs: 0
-          every_n_train_steps: 50
-
-inference:
-  mode: i2v
-  ckpt_path: checkpoints/wan/Wan2.1-I2V-14B-480P
-  savedir: results/i2v/wanvideo
-  seed: 42
-  height: 480
-  width: 832
-  prompt_dir: "inputs/i2v/576x1024"
-  solver: "unipc"           
-  num_inference_steps: 40                
-  time_shift: 3.0                
-  unconditional_guidance_scale: 5.0                       
-  frames: 81
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 16
-  enable_model_cpu_offload: true
-
-  mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
\ No newline at end of file
diff --git a/configs/008_wanvideo/wan2_1_i2v_14B_720P.yaml b/configs/008_wanvideo/wan2_1_i2v_14B_720P.yaml
deleted file mode 100644
index 9f6a2f56..00000000
--- a/configs/008_wanvideo/wan2_1_i2v_14B_720P.yaml
+++ /dev/null
@@ -1,98 +0,0 @@
-flow:
-  target: videotuna.flow.wanvideo.WanVideoModelFlow
-  params:
-    task: "i2v-14B"                   # The task to run (choices from WAN_CONFIGS.keys())
-    ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-720P"                    # The path to the checkpoint directory.
-    offload_model: true               # Whether to offload the model to CPU after each model forward.
-    ulysses_size: 1                   # The size of the ulysses parallelism in DiT.
-    ring_size: 1                      # The size of the ring attention parallelism in DiT.
-    t5_fsdp: false                    # Whether to use FSDP for T5.
-    t5_cpu: false                     # Whether to place T5 model on CPU.
-    dit_fsdp: false                   # Whether to use FSDP for DiT.
-    use_prompt_extend: false          # Whether to use prompt extend.
-    prompt_extend_method: "local_qwen" # The prompt extend method to use (choices: dashscope, local_qwen)
-    prompt_extend_model: null         # The prompt extend model to use.
-    prompt_extend_target_lang: "zh"   # The target language of prompt extend (choices: zh, en)
-    seed: 42                     # The seed to use for generating the image or video
-
-
-    scheduler_config: __is_first_stage__
-
-    denoiser_config:
-      target: videotuna.models.wan.wan.modules.model.WanModel
-      use_from_pretrained: true
-      params:
-        pretrained_model_name_or_path: ${flow.params.ckpt_path}
-
-    first_stage_config:
-      target: videotuna.models.wan.wan.modules.vae.WanVAE_
-      params:
-        dim: 96
-        z_dim: 16
-        dim_mult: [1, 2, 4, 4]
-        num_res_blocks: 2
-        attn_scales: []
-        temperal_downsample: [false, true, true]
-        dropout: 0.0
-
-    cond_stage_config:
-      target: videotuna.models.wan.wan.modules.t5.T5Encoder
-      params:
-        dim: 4096
-        dim_attn: 4096
-        dim_ffn: 10240
-        num_heads: 64
-        num_buckets: 32
-        shared_pos: false
-        dropout: 0.1
-        vocab: 256384
-        num_layers: 24
-
-      
-    cond_stage_2_config:
-      target: videotuna.models.wan.wan.modules.clip.XLMRobertaCLIP
-      params:
-        embed_dim: 1024
-        image_size: 224
-        patch_size: 14
-        vision_dim: 1280
-        vision_mlp_ratio: 4
-        vision_heads: 16
-        vision_layers: 32
-        vision_pool: "token"
-        activation: "gelu"
-        vocab_size: 250002
-        max_text_len: 514
-        type_size: 1
-        pad_id: 1
-        text_dim: 1024
-        text_heads: 16
-        text_layers: 24
-        text_post_norm: true
-        text_dropout: 0.1
-        attn_dropout: 0.0
-        proj_dropout: 0.0
-        embedding_dropout: 0.0
-
-inference:
-  mode: i2v
-  ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-720P"
-  prompt_dir: "inputs/i2v/576x1024"
-  savedir: results/i2v/wanvideo
-  seed: 42
-  height: 720
-  width: 1280
-  solver: "unipc"           
-  num_inference_steps: 40                
-  time_shift: 5.0                
-  unconditional_guidance_scale: 5.0                       
-  frames: 81
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 16
-  enable_model_cpu_offload: true
-
-  mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
\ No newline at end of file
diff --git a/configs/008_wanvideo/wan2_1_t2v_14B.yaml b/configs/008_wanvideo/wan2_1_t2v_14B.yaml
deleted file mode 100644
index 6a94bbe5..00000000
--- a/configs/008_wanvideo/wan2_1_t2v_14B.yaml
+++ /dev/null
@@ -1,74 +0,0 @@
-flow:
-  target: videotuna.flow.wanvideo.WanVideoModelFlow
-  params:
-    task: "t2v-14B"                   # The task to run (choices from WAN_CONFIGS.keys())
-    ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"                    # The path to the checkpoint directory.
-    offload_model: true               # Whether to offload the model to CPU after each model forward.
-    ulysses_size: 1                   # The size of the ulysses parallelism in DiT.
-    ring_size: 1                      # The size of the ring attention parallelism in DiT.
-    t5_fsdp: false                    # Whether to use FSDP for T5.
-    t5_cpu: false                     # Whether to place T5 model on CPU.
-    dit_fsdp: false                   # Whether to use FSDP for DiT.
-    use_prompt_extend: false          # Whether to use prompt extend.
-    prompt_extend_method: "local_qwen" # The prompt extend method to use (choices: dashscope, local_qwen)
-    prompt_extend_model: null         # The prompt extend model to use.
-    prompt_extend_target_lang: "zh"   # The target language of prompt extend (choices: zh, en)
-    seed: 42                    
-
-
-    scheduler_config: __is_first_stage__
-
-    denoiser_config:
-      target: videotuna.models.wan.wan.modules.model.WanModel
-      use_from_pretrained: true
-      params:
-        pretrained_model_name_or_path: ${flow.params.ckpt_path}
-
-    first_stage_config:
-      target: videotuna.models.wan.wan.modules.vae.WanVAE_
-      params:
-        dim: 96
-        z_dim: 16
-        dim_mult: [1, 2, 4, 4]
-        num_res_blocks: 2
-        attn_scales: []
-        temperal_downsample: [false, true, true]
-        dropout: 0.0
-
-    cond_stage_config:
-      target: videotuna.models.wan.wan.modules.t5.T5Encoder
-      params:
-        dim: 4096
-        dim_attn: 4096
-        dim_ffn: 10240
-        num_heads: 64
-        num_buckets: 32
-        shared_pos: false
-        dropout: 0.1
-        vocab: 256384
-        num_layers: 24
-
-inference:
-  mode: t2v
-  ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"
-  savedir: results/t2v/wanvideo
-  seed: 42
-  height: 480
-  width: 832
-  image: null                       
-  prompt_file: 'Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.'                     
-  solver: "unipc"           
-  num_inference_steps: 50                
-  time_shift: 3.0                
-  unconditional_guidance_scale: 5.0                        
-  frames: 81
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 30
-  enable_model_cpu_offload: true
-
-  mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
-    
\ No newline at end of file
diff --git a/configs/008_wanvideo/wan2_1_t2v_14B_lora_cloud_smoke.yaml b/configs/008_wanvideo/wan2_1_t2v_14B_lora_cloud_smoke.yaml
new file mode 120000
index 00000000..9e7526ac
--- /dev/null
+++ b/configs/008_wanvideo/wan2_1_t2v_14B_lora_cloud_smoke.yaml
@@ -0,0 +1 @@
+../domain/wan_t2v_lora_cloud_smoke.yaml
\ No newline at end of file
diff --git a/configs/008_wanvideo/wan2_2_t2v_14b.yaml b/configs/008_wanvideo/wan2_2_t2v_14b.yaml
new file mode 120000
index 00000000..b32aeeb4
--- /dev/null
+++ b/configs/008_wanvideo/wan2_2_t2v_14b.yaml
@@ -0,0 +1 @@
+../inference/presets/wan2_2_native_t2v_14b.yaml
\ No newline at end of file
diff --git a/configs/009_stepvideo/stepvideo_t2v.yaml b/configs/009_stepvideo/stepvideo_t2v.yaml
deleted file mode 100644
index 36b9db24..00000000
--- a/configs/009_stepvideo/stepvideo_t2v.yaml
+++ /dev/null
@@ -1,69 +0,0 @@
-flow:
-  target: videotuna.flow.stepvideo.StepVideoModelFlow
-  params:
-    ckpt_path: checkpoints/stepvideo/stepvideo-t2v
-    denoiser_ckpt_path: ${flow.params.ckpt_path}/transformer
-    scheduler_ckpt_path: ${flow.params.ckpt_path}/scheduler
-    first_stage_ckpt_path: ${flow.params.ckpt_path}/vae
-    cond_stage_ckpt_path: ${flow.params.ckpt_path}/step_llm
-    cond_stage_2_ckpt_path: ${flow.params.ckpt_path}/hunyuan_clip
-    enable_model_cpu_offload: True
-    enable_sequential_cpu_offload: False
-
-    scheduler_config: 
-      target: videotuna.models.stepvideo.stepvideo.diffusion.scheduler.FlowMatchDiscreteScheduler
-      use_from_pretrained: True
-      params:
-        pretrained_model_name_or_path: ${flow.params.scheduler_ckpt_path}
-
-    denoiser_config:
-      target: videotuna.models.stepvideo.stepvideo.modules.model.StepVideoModel
-      use_from_pretrained: True
-      params:
-        pretrained_model_name_or_path: ${flow.params.denoiser_ckpt_path}
-        torch_dtype: ${dtype_resolver:torch.bfloat16}
-        attention_type: torch
-
-    first_stage_config:
-      target: videotuna.models.stepvideo.stepvideo.vae.vae.AutoencoderKL
-      params:
-        z_channels: 64
-        model_path: ${flow.params.first_stage_ckpt_path}/vae_v2.safetensors
-        version: 2
-
-    cond_stage_config:
-      target: videotuna.models.stepvideo.stepvideo.text_encoder.stepllm.STEP1TextEncoder
-      params:
-        model_dir: ${flow.params.cond_stage_ckpt_path}
-        max_length: 320
-
-    cond_stage_2_config:
-      target: videotuna.models.stepvideo.stepvideo.text_encoder.clip.HunyuanClip
-      params:
-        model_dir: ${flow.params.cond_stage_2_ckpt_path}
-        max_length: 77
-
-inference:
-  ckpt_path: checkpoints/stepvideo/stepvideo-t2v
-  mode: t2v
-  savedir: results/t2v/stepvideo
-  seed: 42
-  height: 544
-  width: 992
-  frames: 51 
-  num_inference_steps: 50
-  time_shift: 13.0
-  unconditional_guidance_scale: 12.0
-  prompt_file: '一名宇航员在月球上发现一块石碑，上面印有“stepfun”字样，闪闪发光'
-  uncond_prompt: ''
-  pos_prompt: ''
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 28
-  enable_model_cpu_offload: True
-  enable_sequential_cpu_offload: False
-
-  mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.enable_model_cpu_offload: flow.params.enable_model_cpu_offload
-    inference.enable_sequential_cpu_offload: flow.params.enable_sequential_cpu_offload
\ No newline at end of file
diff --git a/configs/009_stepvideo/stepvideo_t2v_lora.yaml b/configs/009_stepvideo/stepvideo_t2v_lora.yaml
deleted file mode 100644
index 77825077..00000000
--- a/configs/009_stepvideo/stepvideo_t2v_lora.yaml
+++ /dev/null
@@ -1,134 +0,0 @@
-flow:
-  target: videotuna.flow.stepvideo.StepVideoModelFlow
-  params:
-    ckpt_path: checkpoints/stepvideo/stepvideo-t2v
-    denoiser_ckpt_path: ${flow.params.ckpt_path}/transformer
-    scheduler_ckpt_path: ${flow.params.ckpt_path}/scheduler
-    first_stage_ckpt_path: ${flow.params.ckpt_path}/vae
-    cond_stage_ckpt_path: ${flow.params.ckpt_path}/step_llm
-    cond_stage_2_ckpt_path: ${flow.params.ckpt_path}/hunyuan_clip
-    enable_model_cpu_offload: True
-    enable_sequential_cpu_offload: False
-
-    scheduler_config: 
-      target: videotuna.models.stepvideo.stepvideo.diffusion.scheduler.FlowMatchDiscreteScheduler
-      use_from_pretrained: True
-      params:
-        pretrained_model_name_or_path: ${flow.params.scheduler_ckpt_path}
-
-    denoiser_config:
-      target: videotuna.models.stepvideo.stepvideo.modules.model.StepVideoModel
-      use_from_pretrained: True
-      params:
-        pretrained_model_name_or_path: ${flow.params.denoiser_ckpt_path}
-        torch_dtype: ${dtype_resolver:torch.bfloat16}
-        attention_type: torch
-
-    first_stage_config:
-      target: videotuna.models.stepvideo.stepvideo.vae.vae.AutoencoderKL
-      params:
-        z_channels: 64
-        model_path: ${flow.params.first_stage_ckpt_path}/vae_v2.safetensors
-        version: 2
-
-    cond_stage_config:
-      target: videotuna.models.stepvideo.stepvideo.text_encoder.stepllm.STEP1TextEncoder
-      params:
-        model_dir: ${flow.params.cond_stage_ckpt_path}
-        max_length: 320
-
-    cond_stage_2_config:
-      target: videotuna.models.stepvideo.stepvideo.text_encoder.clip.HunyuanClip
-      params:
-        model_dir: ${flow.params.cond_stage_2_ckpt_path}
-        max_length: 77
-
-    lora_config: 
-      target: peft.LoraConfig
-      params:
-        r: 16
-        lora_alpha: 16.0 
-        init_lora_weights: True
-        target_modules: [wq, wkv, wo, ff.net.0.proj, ffn.net.2]
-train:
-  ckpt: checkpoints/stepvideo/stepvideo-t2v
-  name: train_stepvideo_t2v_lora
-  logdir: results/train
-  seed: 42
-  debug: false         
-  first_stage_key: video
-  cond_stage_key: caption
-
-  lr_config:
-    base_learning_rate: 1e-4
-    scale_lr: False
-
-  data:
-    target: videotuna.data.lightningdata.DataModuleFromConfig
-    params:
-      batch_size: 1
-      num_workers: 16
-      wrap: false
-      train:
-        target: videotuna.data.datasets.DatasetFromCSV
-        params:
-          csv_path: data/apply_lipstick/metadata.csv
-          height: 544
-          width: 992
-          frame_interval: 1
-          train: True
-
-  lightning:
-    strategy: deepspeed_stage_3_offload
-    trainer:
-      accelerator: gpu
-      benchmark: True
-      num_nodes: 1
-      accumulate_grad_batches: 1
-      max_epochs: 2000
-      precision: bf16
-    callbacks:
-      image_logger:
-        target: videotuna.utils.callbacks.ImageLogger
-        params:
-          batch_frequency: 50
-          max_images: 6
-          to_local: True # save videos into files
-          log_images_kwargs:
-            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
-      model_checkpoint:
-        target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
-        params:
-          filename: "{epoch:03}-{step:09}"
-          save_only_selected_model: True
-          selected_model: ["denoiser"]
-          save_weights_only: False
-          save_on_train_epoch_end: False
-          save_last: True
-          every_n_epochs: 0
-          every_n_train_steps: 50
-
-inference:
-  ckpt_path: checkpoints/stepvideo/stepvideo-t2v
-  mode: t2v
-  savedir: results/t2v/stepvideo
-  seed: 42
-  height: 544
-  width: 992
-  frames: 51 
-  num_inference_steps: 50
-  time_shift: 13.0
-  unconditional_guidance_scale: 12.0
-  prompt_file: '一名宇航员在月球上发现一块石碑，上面印有“stepfun”字样，闪闪发光'
-  uncond_prompt: ''
-  pos_prompt: ''
-  n_samples_prompt: 1
-  bs: 1
-  savefps: 28
-  enable_model_cpu_offload: True
-  enable_sequential_cpu_offload: False
-
-  mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.enable_model_cpu_offload: flow.params.enable_model_cpu_offload
-    inference.enable_sequential_cpu_offload: flow.params.enable_sequential_cpu_offload
\ No newline at end of file
diff --git a/configs/006_flux/config.json b/configs/domain/flux_t2i.json
similarity index 80%
rename from configs/006_flux/config.json
rename to configs/domain/flux_t2i.json
index 5286edea..fe8fb5fe 100644
--- a/configs/006_flux/config.json
+++ b/configs/domain/flux_t2i.json
@@ -1,16 +1,16 @@
 {
     "--resume_from_checkpoint": "latest",
-    "--data_backend_config": "configs/006_flux/multidatabackend.json",
+    "--data_backend_config": "configs/domain/flux_t2i_data.json",
     "--aspect_bucket_rounding": 2,
     "--seed": 42,
     "--minimum_image_size": 0,
     "--disable_benchmark": false,
-    "--output_dir": "results/train/flux-000_20250419190951",
+    "--output_dir": "results/train/flux-domain-adult",
     "--lora_type": "standard",
     "--lora_rank": 4,
-    "--max_train_steps": 12000,
+    "--max_train_steps": 2000,
     "--num_train_epochs": -1,
-    "--checkpointing_steps": 500,
+    "--checkpointing_steps": 250,
     "--checkpoints_total_limit": 20,
     "--model_type": "lora",
     "--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
@@ -27,11 +27,11 @@
     "--validation_guidance": 3.0,
     "--validation_guidance_rescale": "0.0",
     "--validation_num_inference_steps": "10",
-    "--validation_prompt": "a photo of teddybear",
+    "--validation_prompt": "sks_style, portrait, soft lighting",
     "--disable_tf32": "true",
     "--mixed_precision": "bf16",
     "--optimizer": "adamw_bf16",
     "--learning_rate": "8e-5",
     "--lr_scheduler": "polynomial",
     "--lr_warmup_steps": 5
-}
\ No newline at end of file
+}
diff --git a/configs/domain/flux_t2i_cloud_smoke.json b/configs/domain/flux_t2i_cloud_smoke.json
new file mode 100644
index 00000000..39fd81c0
--- /dev/null
+++ b/configs/domain/flux_t2i_cloud_smoke.json
@@ -0,0 +1,37 @@
+{
+    "--resume_from_checkpoint": "latest",
+    "--data_backend_config": "configs/domain/flux_t2i_data.json",
+    "--aspect_bucket_rounding": 2,
+    "--seed": 42,
+    "--minimum_image_size": 0,
+    "--disable_benchmark": false,
+    "--output_dir": "results/train/flux-cloud-smoke",
+    "--lora_type": "standard",
+    "--lora_rank": 4,
+    "--max_train_steps": 50,
+    "--num_train_epochs": -1,
+    "--checkpointing_steps": 25,
+    "--checkpoints_total_limit": 5,
+    "--model_type": "lora",
+    "--pretrained_model_name_or_path": "checkpoints/flux/FLUX.1-dev",
+    "--model_family": "flux",
+    "--train_batch_size": 1,
+    "--write_batch_size": 1,
+    "--gradient_checkpointing": "true",
+    "--caption_dropout_probability": 0.0,
+    "--resolution_type": "pixel_area",
+    "--resolution": 512,
+    "--validation_seed": 42,
+    "--validation_steps": 40,
+    "--validation_resolution": "512x512",
+    "--validation_guidance": 3.0,
+    "--validation_guidance_rescale": "0.0",
+    "--validation_num_inference_steps": "10",
+    "--validation_prompt": "sks_style, portrait, soft lighting",
+    "--disable_tf32": "true",
+    "--mixed_precision": "bf16",
+    "--optimizer": "adamw_bf16",
+    "--learning_rate": "8e-5",
+    "--lr_scheduler": "polynomial",
+    "--lr_warmup_steps": 5
+}
diff --git a/configs/006_flux/multidatabackend.json b/configs/domain/flux_t2i_data.json
similarity index 71%
rename from configs/006_flux/multidatabackend.json
rename to configs/domain/flux_t2i_data.json
index 8f58614a..36f224e6 100644
--- a/configs/006_flux/multidatabackend.json
+++ b/configs/domain/flux_t2i_data.json
@@ -1,6 +1,6 @@
 [
     {
-      "id": "pseudo-camera-10k-flux",
+      "id": "domain-adult-t2i",
       "type": "local",
       "crop": true,
       "crop_aspect": "square",
@@ -10,9 +10,8 @@
       "maximum_image_size": 512,
       "target_downsample_size": 512,
       "resolution_type": "pixel_area",
-      "cache_dir_vae": "cache/vae/flux/pseudo-camera-10k/train",
-      "instance_data_dir": "inputs/t2i/flux/plushie_teddybear",
-      "caption": "nezha",
+      "cache_dir_vae": "cache/vae/flux/domain-adult/train",
+      "instance_data_dir": "data/t2i/domain",
       "ignore_epochs": true,
       "disabled": false,
       "skip_file_discovery": "",
@@ -24,7 +23,7 @@
       "type": "local",
       "dataset_type": "text_embeds",
       "default": true,
-      "cache_dir": "cache/text/flux/pseudo-camera-10k",
+      "cache_dir": "cache/text/flux/domain-adult",
       "disabled": false,
       "write_batch_size": 128
     }
diff --git a/configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml b/configs/domain/wan_i2v_lora.yaml
similarity index 53%
rename from configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml
rename to configs/domain/wan_i2v_lora.yaml
index 67770d6c..6b0aabd2 100644
--- a/configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml
+++ b/configs/domain/wan_i2v_lora.yaml
@@ -1,25 +1,28 @@
 flow:
   target: videotuna.flow.wanvideo.WanVideoModelFlow
   params:
-    task: "i2v-14B"                  
-    ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-480P"                    
-    offload_model: true               
-    ulysses_size: 1                  
-    ring_size: 1                     
-    t5_fsdp: false                    
-    t5_cpu: false                     
-    dit_fsdp: false                   
-    use_prompt_extend: false          
-    prompt_extend_method: "local_qwen" 
-    prompt_extend_model: null         
-    prompt_extend_target_lang: "zh"   
-    seed: 42                     
+    task: "i2v-14B"
+    ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-480P"
+    offload_model: true
+    ulysses_size: 1
+    ring_size: 1
+    t5_fsdp: false
+    t5_cpu: false
+    dit_fsdp: false
+    use_prompt_extend: false
+    prompt_extend_method: "local_qwen"
+    prompt_extend_model: null
+    prompt_extend_target_lang: "zh"
+    seed: 42
+    gradient_checkpointing: true
 
     denoiser_config:
       target: videotuna.models.wan.wan.modules.model.WanModel
       use_from_pretrained: true
       params:
         pretrained_model_name_or_path: ${flow.params.ckpt_path}
+        subfolder: high_noise_model
+        model_type: i2v
 
     first_stage_config:
       target: videotuna.models.wan.wan.modules.vae.WanVAE_
@@ -45,68 +48,47 @@ flow:
         vocab: 256384
         num_layers: 24
 
-      
-    cond_stage_2_config:
-      target: videotuna.models.wan.wan.modules.clip.XLMRobertaCLIP
-      params:
-        embed_dim: 1024
-        image_size: 224
-        patch_size: 14
-        vision_dim: 1280
-        vision_mlp_ratio: 4
-        vision_heads: 16
-        vision_layers: 32
-        vision_pool: "token"
-        activation: "gelu"
-        vocab_size: 250002
-        max_text_len: 514
-        type_size: 1
-        pad_id: 1
-        text_dim: 1024
-        text_heads: 16
-        text_layers: 24
-        text_post_norm: true
-        text_dropout: 0.1
-        attn_dropout: 0.0
-        proj_dropout: 0.0
-        embedding_dropout: 0.0
-    
-    lora_config: 
+    lora_config:
       target: peft.LoraConfig
       params:
         r: 16
-        lora_alpha: 16.0 
+        lora_alpha: 16.0
         init_lora_weights: True
         target_modules: [q, k, v, o, ffn.0, ffn.2]
+
 train:
   ckpt: checkpoints/wan/Wan2.1-I2V-14B-480P
-  name: train_wan_i2v_lora
+  name: train_wan_domain_i2v_lora
   logdir: results/train
   seed: 42
-  debug: false         
+  debug: false
   first_stage_key: video
   cond_stage_key: caption
   mapping:
-    train.ckpt : flow.params.ckpt_path
+    train.ckpt: flow.params.ckpt_path
 
   lr_config:
-    base_learning_rate: 6.0e-06
+    base_learning_rate: 1e-4
     scale_lr: False
 
   data:
     target: videotuna.data.lightningdata.DataModuleFromConfig
     params:
       batch_size: 1
-      num_workers: 16
+      num_workers: 4
+      pin_memory: true
+      persistent_workers: true
+      prefetch_factor: 2
       wrap: false
       train:
         target: videotuna.data.datasets.DatasetFromCSV
         params:
-          csv_path: data/apply_lipstick/metadata.csv
+          csv_path: data/i2v/domain/metadata.csv
           height: 480
           width: 832
           num_frames: 81
           frame_interval: 1
+          image_to_video: false
           train: True
 
   lightning:
@@ -116,7 +98,7 @@ train:
       benchmark: True
       num_nodes: 1
       accumulate_grad_batches: 1
-      max_epochs: 2000
+      max_epochs: 50
       precision: bf16-mixed
     callbacks:
       image_logger:
@@ -124,9 +106,9 @@ train:
         params:
           batch_frequency: 50
           max_images: 6
-          to_local: True # save videos into files
+          to_local: True
           log_images_kwargs:
-            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
+            unconditional_guidance_scale: 12.0
       model_checkpoint:
         target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
         params:
@@ -137,27 +119,28 @@ train:
           save_on_train_epoch_end: False
           save_last: True
           every_n_epochs: 0
-          every_n_train_steps: 50
+          every_n_train_steps: 25
 
 inference:
   mode: i2v
-  ckpt_path: checkpoints/wan/Wan2.1-I2V-14B-480P
-  savedir: results/i2v/wanvideo
+  ckpt_path: "checkpoints/wan/Wan2.1-I2V-14B-480P"
+  savedir: results/i2v/wanvideo-domain
   seed: 42
   height: 480
   width: 832
-  prompt_dir: "inputs/i2v/576x1024"
-  solver: "unipc"           
-  num_inference_steps: 40                
-  time_shift: 3.0                
-  unconditional_guidance_scale: 5.0                       
+  image: null
+  prompt_dir: inputs/i2v/domain_smoke
+  solver: "unipc"
+  num_inference_steps: 20
+  time_shift: 3.0
+  unconditional_guidance_scale: 5.0
   frames: 81
   n_samples_prompt: 1
   bs: 1
-  savefps: 16
+  savefps: 30
   enable_model_cpu_offload: true
 
   mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
\ No newline at end of file
+    inference.ckpt_path: flow.params.ckpt_path
+    inference.seed: flow.params.seed
+    inference.enable_model_cpu_offload: flow.params.offload_model
diff --git a/configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml b/configs/domain/wan_t2v_lora.yaml
similarity index 63%
rename from configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml
rename to configs/domain/wan_t2v_lora.yaml
index 9b56d0db..ae029dfa 100644
--- a/configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml
+++ b/configs/domain/wan_t2v_lora.yaml
@@ -1,19 +1,20 @@
 flow:
   target: videotuna.flow.wanvideo.WanVideoModelFlow
   params:
-    task: "t2v-14B"                   
-    ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"                    
-    offload_model: true               
-    ulysses_size: 1          
-    ring_size: 1                      
-    t5_fsdp: false                   
-    t5_cpu: false                     
-    dit_fsdp: false                   
-    use_prompt_extend: false          
+    task: "t2v-14B"
+    ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"
+    offload_model: true
+    ulysses_size: 1
+    ring_size: 1
+    t5_fsdp: false
+    t5_cpu: false
+    dit_fsdp: false
+    use_prompt_extend: false
     prompt_extend_method: "local_qwen"
-    prompt_extend_model: null         
-    prompt_extend_target_lang: "zh"  
-    seed: 42       
+    prompt_extend_model: null
+    prompt_extend_target_lang: "zh"
+    seed: 42
+    gradient_checkpointing: true
 
     denoiser_config:
       target: videotuna.models.wan.wan.modules.model.WanModel
@@ -44,26 +45,26 @@ flow:
         dropout: 0.1
         vocab: 256384
         num_layers: 24
-      
-    lora_config: 
+
+    lora_config:
       target: peft.LoraConfig
       params:
         r: 16
-        lora_alpha: 16.0 
+        lora_alpha: 16.0
         init_lora_weights: True
         target_modules: [q, k, v, o, ffn.0, ffn.2]
 
 train:
   ckpt: checkpoints/wan/Wan2.1-T2V-14B
-  name: train_wan_t2v_lora
+  name: train_wan_domain_t2v_lora
   logdir: results/train
   seed: 42
-  debug: false         
+  debug: false
   first_stage_key: video
   cond_stage_key: caption
   mapping:
-    train.ckpt : flow.params.ckpt_path
-    
+    train.ckpt: flow.params.ckpt_path
+
   lr_config:
     base_learning_rate: 1e-4
     scale_lr: False
@@ -72,12 +73,15 @@ train:
     target: videotuna.data.lightningdata.DataModuleFromConfig
     params:
       batch_size: 1
-      num_workers: 16
+      num_workers: 4
+      pin_memory: true
+      persistent_workers: true
+      prefetch_factor: 2
       wrap: false
       train:
         target: videotuna.data.datasets.DatasetFromCSV
         params:
-          csv_path: data/apply_lipstick/metadata.csv
+          csv_path: data/t2v/domain/metadata.csv
           height: 480
           width: 832
           num_frames: 81
@@ -91,7 +95,7 @@ train:
       benchmark: True
       num_nodes: 1
       accumulate_grad_batches: 1
-      max_epochs: 2000
+      max_epochs: 50
       precision: bf16-mixed
     callbacks:
       image_logger:
@@ -99,9 +103,9 @@ train:
         params:
           batch_frequency: 50
           max_images: 6
-          to_local: True # save videos into files
+          to_local: True
           log_images_kwargs:
-            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
+            unconditional_guidance_scale: 12.0
       model_checkpoint:
         target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
         params:
@@ -112,21 +116,21 @@ train:
           save_on_train_epoch_end: False
           save_last: True
           every_n_epochs: 0
-          every_n_train_steps: 50
+          every_n_train_steps: 25
 
 inference:
   mode: t2v
   ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"
-  savedir: results/t2v/wanvideo
+  savedir: results/t2v/wanvideo-domain
   seed: 42
   height: 480
   width: 832
-  image: null                       
-  prompt_file: 'Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.'                     
-  solver: "unipc"           
-  num_inference_steps: 50                
-  time_shift: 3.0                
-  unconditional_guidance_scale: 5.0                        
+  image: null
+  prompt_file: "sks_style, slow camera push-in, soft lighting"
+  solver: "unipc"
+  num_inference_steps: 20
+  time_shift: 3.0
+  unconditional_guidance_scale: 5.0
   frames: 81
   n_samples_prompt: 1
   bs: 1
@@ -134,7 +138,6 @@ inference:
   enable_model_cpu_offload: true
 
   mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
-    
\ No newline at end of file
+    inference.ckpt_path: flow.params.ckpt_path
+    inference.seed: flow.params.seed
+    inference.enable_model_cpu_offload: flow.params.offload_model
diff --git a/configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml b/configs/domain/wan_t2v_lora_cloud_smoke.yaml
similarity index 62%
rename from configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml
rename to configs/domain/wan_t2v_lora_cloud_smoke.yaml
index b09f6c33..d957bd90 100644
--- a/configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml
+++ b/configs/domain/wan_t2v_lora_cloud_smoke.yaml
@@ -1,19 +1,20 @@
 flow:
   target: videotuna.flow.wanvideo.WanVideoModelFlow
   params:
-    task: "t2v-14B"                   
-    ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"                    
-    offload_model: true               
-    ulysses_size: 1          
-    ring_size: 1                      
-    t5_fsdp: false                   
-    t5_cpu: false                     
-    dit_fsdp: false                   
-    use_prompt_extend: false          
+    task: "t2v-14B"
+    ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"
+    offload_model: true
+    ulysses_size: 1
+    ring_size: 1
+    t5_fsdp: false
+    t5_cpu: false
+    dit_fsdp: false
+    use_prompt_extend: false
     prompt_extend_method: "local_qwen"
-    prompt_extend_model: null         
-    prompt_extend_target_lang: "zh"  
-    seed: 42       
+    prompt_extend_model: null
+    prompt_extend_target_lang: "zh"
+    seed: 42
+    gradient_checkpointing: true
 
     denoiser_config:
       target: videotuna.models.wan.wan.modules.model.WanModel
@@ -45,31 +46,42 @@ flow:
         vocab: 256384
         num_layers: 24
 
+    lora_config:
+      target: peft.LoraConfig
+      params:
+        r: 16
+        lora_alpha: 16.0
+        init_lora_weights: True
+        target_modules: [q, k, v, o, ffn.0, ffn.2]
+
 train:
   ckpt: checkpoints/wan/Wan2.1-T2V-14B
-  name: train_wan_t2v_fullft
+  name: train_wan_cloud_smoke
   logdir: results/train
   seed: 42
-  debug: false         
+  debug: false
   first_stage_key: video
   cond_stage_key: caption
   mapping:
-    train.ckpt : flow.params.ckpt_path
+    train.ckpt: flow.params.ckpt_path
 
   lr_config:
-    base_learning_rate: 6.0e-06
+    base_learning_rate: 1e-4
     scale_lr: False
 
   data:
     target: videotuna.data.lightningdata.DataModuleFromConfig
     params:
       batch_size: 1
-      num_workers: 16
+      num_workers: 4
+      pin_memory: true
+      persistent_workers: true
+      prefetch_factor: 2
       wrap: false
       train:
         target: videotuna.data.datasets.DatasetFromCSV
         params:
-          csv_path: data/apply_lipstick/metadata.csv
+          csv_path: data/t2v/domain/metadata.csv
           height: 480
           width: 832
           num_frames: 81
@@ -83,7 +95,7 @@ train:
       benchmark: True
       num_nodes: 1
       accumulate_grad_batches: 1
-      max_epochs: 2000
+      max_epochs: 1
       precision: bf16-mixed
     callbacks:
       image_logger:
@@ -91,9 +103,9 @@ train:
         params:
           batch_frequency: 50
           max_images: 6
-          to_local: True # save videos into files
+          to_local: True
           log_images_kwargs:
-            unconditional_guidance_scale: 12.0 # need this, otherwise it is grey
+            unconditional_guidance_scale: 12.0
       model_checkpoint:
         target: videotuna.utils.callbacks.VideoTunaModelCheckpoint
         params:
@@ -104,21 +116,21 @@ train:
           save_on_train_epoch_end: False
           save_last: True
           every_n_epochs: 0
-          every_n_train_steps: 50
+          every_n_train_steps: 5
 
 inference:
   mode: t2v
   ckpt_path: "checkpoints/wan/Wan2.1-T2V-14B"
-  savedir: results/t2v/wanvideo
+  savedir: results/t2v/wanvideo-domain
   seed: 42
   height: 480
   width: 832
-  image: null                       
-  prompt_file: 'Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.'                     
-  solver: "unipc"           
-  num_inference_steps: 50                
-  time_shift: 3.0                
-  unconditional_guidance_scale: 5.0                        
+  image: null
+  prompt_file: "sks_style, slow camera push-in, soft lighting"
+  solver: "unipc"
+  num_inference_steps: 20
+  time_shift: 3.0
+  unconditional_guidance_scale: 5.0
   frames: 81
   n_samples_prompt: 1
   bs: 1
@@ -126,7 +138,6 @@ inference:
   enable_model_cpu_offload: true
 
   mapping:
-    inference.ckpt_path : flow.params.ckpt_path
-    inference.seed : flow.params.seed
-    inference.enable_model_cpu_offload : flow.params.offload_model
-    
\ No newline at end of file
+    inference.ckpt_path: flow.params.ckpt_path
+    inference.seed: flow.params.seed
+    inference.enable_model_cpu_offload: flow.params.offload_model
diff --git a/configs/inference/presets/balanced_wan2_2_720p.yaml b/configs/inference/presets/balanced_wan2_2_720p.yaml
new file mode 100644
index 00000000..e4848108
--- /dev/null
+++ b/configs/inference/presets/balanced_wan2_2_720p.yaml
@@ -0,0 +1,27 @@
+# Balanced preset for Wan 2.2 Diffusers 720p (~24 GB VRAM)
+# Usage: poetry run inference-wan2.2-t2v-720p --config configs/inference/presets/balanced_wan2_2_720p.yaml
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan2.2-t2v-a14b-balanced
+  prompt_file: inputs/t2v/prompts.txt
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 50
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: balanced
+  enable_model_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: bf16
+  min_vram_gb: 20
diff --git a/configs/inference/presets/flux1_dev.yaml b/configs/inference/presets/flux1_dev.yaml
new file mode 100644
index 00000000..b99e3e4a
--- /dev/null
+++ b/configs/inference/presets/flux1_dev.yaml
@@ -0,0 +1,18 @@
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: flux
+    mode: t2i
+    pipeline_only: true
+    model_variant: 1-dev
+    pretrained_model_name_or_path: black-forest-labs/FLUX.1-dev
+inference:
+  mode: t2i
+  ckpt_path: black-forest-labs/FLUX.1-dev
+  savedir: results/t2i/flux1-dev
+  prompt_file: inputs/t2v/prompts.txt
+  height: 768
+  width: 1360
+  num_inference_steps: 28
+  unconditional_guidance_scale: 3.5
+  seed: 42
diff --git a/configs/inference/presets/flux_domain_lora_smoke.yaml b/configs/inference/presets/flux_domain_lora_smoke.yaml
new file mode 100644
index 00000000..a392e5c9
--- /dev/null
+++ b/configs/inference/presets/flux_domain_lora_smoke.yaml
@@ -0,0 +1,25 @@
+# Smoke preset for domain Flux LoRA (few steps, 512px, offload)
+# Usage:
+#   poetry run python scripts/inference_new.py \
+#     --config configs/inference/presets/flux_domain_lora_smoke.yaml \
+#     --lorackpt results/train/flux-domain-adult/checkpoint-2000 \
+#     --prompt "sks_style, portrait, soft lighting"
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: flux
+    mode: t2i
+    pipeline_only: true
+    model_variant: 1-dev
+    pretrained_model_name_or_path: black-forest-labs/FLUX.1-dev
+inference:
+  mode: t2i
+  ckpt_path: black-forest-labs/FLUX.1-dev
+  savedir: results/t2i/flux-domain-lora-smoke
+  prompt_file: inputs/t2v/prompts.txt
+  height: 512
+  width: 512
+  num_inference_steps: 8
+  unconditional_guidance_scale: 3.5
+  seed: 42
+  enable_model_cpu_offload: true
diff --git a/configs/inference/presets/low_vram_wan2_2_720p.yaml b/configs/inference/presets/low_vram_wan2_2_720p.yaml
new file mode 100644
index 00000000..ca728378
--- /dev/null
+++ b/configs/inference/presets/low_vram_wan2_2_720p.yaml
@@ -0,0 +1,32 @@
+# Low VRAM preset for Wan 2.2 Diffusers 720p
+# Usage: poetry run inference-wan2.2-t2v-720p --config configs/inference/presets/low_vram_wan2_2_720p.yaml
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan2.2-t2v-a14b-low-vram
+  prompt_file: inputs/t2v/prompts.txt
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 50
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: low_vram
+  enable_sequential_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: fp16
+  min_vram_gb: 10
+  # Optional transformer quant (CUDA only):
+  #   int8_wo — low_vram_wan2_2_720p_int8.yaml (sm >= 8.0)
+  #   fp8_wo  — low_vram_wan2_2_720p_fp8.yaml (Ada/Hopper sm >= 8.9)
+  # transformer_quant: int8_wo
+  # quant_backend: torchao
diff --git a/configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml b/configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml
new file mode 100644
index 00000000..c6c5f5fb
--- /dev/null
+++ b/configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml
@@ -0,0 +1,30 @@
+# Low VRAM + fp8 weight-only transformer quant (Wan 2.2 Diffusers 720p)
+# Requires NVIDIA CUDA (Ada/Hopper sm >= 8.9) and torchao>=0.15.0.
+# Usage: poetry run inference-wan2.2-t2v-720p --config configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan2.2-t2v-a14b-low-vram-fp8
+  prompt_file: inputs/t2v/prompts.txt
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 50
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: low_vram
+  enable_model_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: bf16
+  transformer_quant: fp8_wo
+  quant_backend: torchao
+  min_vram_gb: 10
diff --git a/configs/inference/presets/low_vram_wan2_2_720p_int8.yaml b/configs/inference/presets/low_vram_wan2_2_720p_int8.yaml
new file mode 100644
index 00000000..d9bbd0ac
--- /dev/null
+++ b/configs/inference/presets/low_vram_wan2_2_720p_int8.yaml
@@ -0,0 +1,30 @@
+# Low VRAM + int8 weight-only transformer quant (Wan 2.2 Diffusers 720p)
+# Requires NVIDIA CUDA and torchao>=0.15.0.
+# Usage: poetry run inference-wan2.2-t2v-720p --config configs/inference/presets/low_vram_wan2_2_720p_int8.yaml
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan2.2-t2v-a14b-low-vram-int8
+  prompt_file: inputs/t2v/prompts.txt
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 50
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: low_vram
+  enable_model_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: fp16
+  transformer_quant: int8_wo
+  quant_backend: torchao
+  min_vram_gb: 10
diff --git a/configs/inference/presets/max_speed_wan2_2_720p.yaml b/configs/inference/presets/max_speed_wan2_2_720p.yaml
new file mode 100644
index 00000000..0d0923d9
--- /dev/null
+++ b/configs/inference/presets/max_speed_wan2_2_720p.yaml
@@ -0,0 +1,27 @@
+# Max speed preset for Wan 2.2 Diffusers 720p (full GPU, ~40–48 GB VRAM)
+# Usage:
+#   poetry run inference-wan2.2-t2v-720p --config configs/inference/presets/max_speed_wan2_2_720p.yaml
+# Optional after warm-up: add --compile (sets VIDEOTUNA_TORCH_COMPILE=1)
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan2.2-t2v-a14b-max-speed
+  prompt_file: inputs/t2v/prompts.txt
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 50
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: max_speed
+  dtype: bf16
+  min_vram_gb: 38
diff --git a/configs/inference/presets/wan2_2_cpu_smoke.yaml b/configs/inference/presets/wan2_2_cpu_smoke.yaml
new file mode 100644
index 00000000..342841cd
--- /dev/null
+++ b/configs/inference/presets/wan2_2_cpu_smoke.yaml
@@ -0,0 +1,26 @@
+# CPU smoke preset for Wan 2.2 Diffusers (dev/CI only — not for production)
+# Usage:
+#   poetry run inference-wan2.2-t2v-720p \
+#     --config configs/inference/presets/wan2_2_cpu_smoke.yaml --cpu-smoke
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  device: cpu
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan2.2-t2v-a14b-cpu-smoke
+  prompt_file: inputs/t2v/prompts.txt
+  frames: 2
+  height: 256
+  width: 448
+  num_inference_steps: 4
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 8
+  dtype: fp32
diff --git a/configs/inference/presets/wan2_2_native_t2v_14b.yaml b/configs/inference/presets/wan2_2_native_t2v_14b.yaml
new file mode 100644
index 00000000..c63981db
--- /dev/null
+++ b/configs/inference/presets/wan2_2_native_t2v_14b.yaml
@@ -0,0 +1,74 @@
+flow:
+  target: videotuna.flow.wan2.2-native.WanVideoModelFlow
+  params:
+    task: "t2v-A14B" # The task to run (choices from WAN_CONFIGS.keys())
+    ckpt_path: "checkpoints/wan/Wan2.2-T2V-A14B" # The path to the checkpoint directory.
+    offload_model: true # Whether to offload the model to CPU after each model forward.
+    ulysses_size: 1 # The size of the ulysses parallelism in DiT.
+    ring_size: 1 # The size of the ring attention parallelism in DiT.
+    t5_fsdp: false # Whether to use FSDP for T5.
+    t5_cpu: false # Whether to place T5 model on CPU.
+    dit_fsdp: false # Whether to use FSDP for DiT.
+    use_prompt_extend: false # Whether to use prompt extend.
+    prompt_extend_method: "local_qwen" # The prompt extend method to use (choices: dashscope, local_qwen)
+    prompt_extend_model: null # The prompt extend model to use.
+    prompt_extend_target_lang: "zh" # The target language of prompt extend (choices: zh, en)
+    seed: 42
+
+    scheduler_config: __is_first_stage__
+
+    denoiser_config:
+      target: videotuna.models.wan.wan.modules.model.WanModel
+      use_from_pretrained: true
+      params:
+        pretrained_model_name_or_path: ${flow.params.ckpt_path}
+
+    first_stage_config:
+      target: videotuna.models.wan.wan.modules.vae.WanVAE_
+      params:
+        dim: 96
+        z_dim: 16
+        dim_mult: [1, 2, 4, 4]
+        num_res_blocks: 2
+        attn_scales: []
+        temperal_downsample: [false, true, true]
+        dropout: 0.0
+
+    cond_stage_config:
+      target: videotuna.models.wan.wan.modules.t5.T5Encoder
+      params:
+        dim: 4096
+        dim_attn: 4096
+        dim_ffn: 10240
+        num_heads: 64
+        num_buckets: 32
+        shared_pos: false
+        dropout: 0.1
+        vocab: 256384
+        num_layers: 24
+
+inference:
+  mode: t2v
+  ckpt_path: "checkpoints/wan/Wan2.2-T2V-A14B"
+  savedir: results/t2v/wan2.2-native
+  seed: 42
+  height: 480
+  width: 832
+  image: null
+  prompt_file: "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
+  solver: "unipc"
+  num_inference_steps: 50
+  time_shift: 3.0
+  unconditional_guidance_scale: 5.0
+  frames: 81
+  n_samples_prompt: 1
+  bs: 1
+  savefps: 30
+  enable_model_cpu_offload: true
+
+  mapping:
+    inference.ckpt_path: flow.params.ckpt_path
+    inference.seed: flow.params.seed
+    inference.enable_model_cpu_offload: flow.params.offload_model
+    inference.ulysses_degree: flow.params.ulysses_size
+    inference.ring_degree: flow.params.ring_size
diff --git a/configs/inference/presets/wan_domain_i2v_smoke.yaml b/configs/inference/presets/wan_domain_i2v_smoke.yaml
new file mode 100644
index 00000000..80949cef
--- /dev/null
+++ b/configs/inference/presets/wan_domain_i2v_smoke.yaml
@@ -0,0 +1,69 @@
+# Smoke preset for domain Wan 2.1 I2V LoRA (480p, paired image+prompt dir)
+# Usage:
+#   poetry run python scripts/inference_new.py \
+#     --config configs/inference/presets/wan_domain_i2v_smoke.yaml \
+#     --ckpt_path checkpoints/wan/Wan2.1-I2V-14B-480P \
+#     --trained_ckpt results/train/train_wan_domain_i2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+#     --prompt_dir inputs/i2v/domain_smoke
+flow:
+  target: videotuna.flow.wanvideo.WanVideoModelFlow
+  params:
+    task: "i2v-14B"
+    ckpt_path: checkpoints/wan/Wan2.1-I2V-14B-480P
+    offload_model: true
+    seed: 42
+    gradient_checkpointing: true
+    denoiser_config:
+      target: videotuna.models.wan.wan.modules.model.WanModel
+      use_from_pretrained: true
+      params:
+        pretrained_model_name_or_path: ${flow.params.ckpt_path}
+        subfolder: high_noise_model
+        model_type: i2v
+    first_stage_config:
+      target: videotuna.models.wan.wan.modules.vae.WanVAE_
+      params:
+        dim: 96
+        z_dim: 16
+        dim_mult: [1, 2, 4, 4]
+        num_res_blocks: 2
+        attn_scales: []
+        temperal_downsample: [false, true, true]
+        dropout: 0.0
+    cond_stage_config:
+      target: videotuna.models.wan.wan.modules.t5.T5Encoder
+      params:
+        dim: 4096
+        dim_attn: 4096
+        dim_ffn: 10240
+        num_heads: 64
+        num_buckets: 32
+        shared_pos: false
+        dropout: 0.1
+        vocab: 256384
+        num_layers: 24
+    lora_config:
+      target: peft.LoraConfig
+      params:
+        r: 16
+        lora_alpha: 16.0
+        init_lora_weights: True
+        target_modules: [q, k, v, o, ffn.0, ffn.2]
+inference:
+  mode: i2v
+  ckpt_path: checkpoints/wan/Wan2.1-I2V-14B-480P
+  savedir: results/i2v/wan-domain-lora-smoke
+  seed: 42
+  height: 480
+  width: 832
+  frames: 81
+  num_inference_steps: 20
+  time_shift: 3.0
+  unconditional_guidance_scale: 5.0
+  savefps: 30
+  prompt_dir: inputs/i2v/domain_smoke
+  enable_model_cpu_offload: true
+  mapping:
+    inference.ckpt_path: flow.params.ckpt_path
+    inference.seed: flow.params.seed
+    inference.enable_model_cpu_offload: flow.params.offload_model
diff --git a/configs/inference/presets/wan_domain_i2v_smoke_22.yaml b/configs/inference/presets/wan_domain_i2v_smoke_22.yaml
new file mode 100644
index 00000000..ebd5c233
--- /dev/null
+++ b/configs/inference/presets/wan_domain_i2v_smoke_22.yaml
@@ -0,0 +1,30 @@
+# Domain Wan 2.2 I2V LoRA validation smoke (720p, few steps, offload)
+# Usage:
+#   poetry run validate-domain-i2v \
+#     --trained_ckpt results/train/train_wan_domain_i2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+#     --prompt_dir inputs/i2v/domain_smoke
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: i2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-I2V-A14B-Diffusers
+inference:
+  mode: i2v
+  ckpt_path: Wan-AI/Wan2.2-I2V-A14B-Diffusers
+  savedir: results/i2v/wan-domain-lora-smoke-22
+  prompt_dir: inputs/i2v/domain_smoke
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 4
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: balanced
+  enable_model_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: bf16
+  min_vram_gb: 24
diff --git a/configs/inference/presets/wan_domain_lora_smoke.yaml b/configs/inference/presets/wan_domain_lora_smoke.yaml
new file mode 100644
index 00000000..7763b5a4
--- /dev/null
+++ b/configs/inference/presets/wan_domain_lora_smoke.yaml
@@ -0,0 +1,66 @@
+# Smoke preset for domain Wan 2.1 T2V LoRA (few steps, 480p, offload)
+# Usage:
+#   poetry run python scripts/inference_new.py \
+#     --config configs/inference/presets/wan_domain_lora_smoke.yaml \
+#     --ckpt_path checkpoints/wan/Wan2.1-T2V-14B \
+#     --trained_ckpt results/train/train_wan_domain_t2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+#     --prompt "sks_style, slow camera push-in, soft lighting"
+flow:
+  target: videotuna.flow.wanvideo.WanVideoModelFlow
+  params:
+    task: "t2v-14B"
+    ckpt_path: checkpoints/wan/Wan2.1-T2V-14B
+    offload_model: true
+    seed: 42
+    gradient_checkpointing: true
+    denoiser_config:
+      target: videotuna.models.wan.wan.modules.model.WanModel
+      use_from_pretrained: true
+      params:
+        pretrained_model_name_or_path: ${flow.params.ckpt_path}
+    first_stage_config:
+      target: videotuna.models.wan.wan.modules.vae.WanVAE_
+      params:
+        dim: 96
+        z_dim: 16
+        dim_mult: [1, 2, 4, 4]
+        num_res_blocks: 2
+        attn_scales: []
+        temperal_downsample: [false, true, true]
+        dropout: 0.0
+    cond_stage_config:
+      target: videotuna.models.wan.wan.modules.t5.T5Encoder
+      params:
+        dim: 4096
+        dim_attn: 4096
+        dim_ffn: 10240
+        num_heads: 64
+        num_buckets: 32
+        shared_pos: false
+        dropout: 0.1
+        vocab: 256384
+        num_layers: 24
+    lora_config:
+      target: peft.LoraConfig
+      params:
+        r: 16
+        lora_alpha: 16.0
+        init_lora_weights: True
+        target_modules: [q, k, v, o, ffn.0, ffn.2]
+inference:
+  mode: t2v
+  ckpt_path: checkpoints/wan/Wan2.1-T2V-14B
+  savedir: results/t2v/wanvideo-domain-smoke
+  seed: 42
+  height: 480
+  width: 832
+  frames: 81
+  num_inference_steps: 20
+  time_shift: 3.0
+  unconditional_guidance_scale: 5.0
+  savefps: 30
+  enable_model_cpu_offload: true
+  mapping:
+    inference.ckpt_path: flow.params.ckpt_path
+    inference.seed: flow.params.seed
+    inference.enable_model_cpu_offload: flow.params.offload_model
diff --git a/configs/inference/presets/wan_domain_lora_smoke_22.yaml b/configs/inference/presets/wan_domain_lora_smoke_22.yaml
new file mode 100644
index 00000000..b8fb1df0
--- /dev/null
+++ b/configs/inference/presets/wan_domain_lora_smoke_22.yaml
@@ -0,0 +1,30 @@
+# Domain Wan 2.2 T2V LoRA validation smoke (720p, few steps, offload)
+# Usage:
+#   poetry run validate-domain-t2v \
+#     --trained_ckpt results/train/train_wan_domain_t2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+#     --prompt_file inputs/t2v/domain_prompt.txt
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan-domain-lora-smoke-22
+  prompt_file: "sks_style, slow camera push-in, soft lighting"
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 4
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: balanced
+  enable_model_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: bf16
+  min_vram_gb: 20
diff --git a/configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml b/configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml
new file mode 100644
index 00000000..7ec65638
--- /dev/null
+++ b/configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml
@@ -0,0 +1,35 @@
+# Low VRAM domain Wan 2.2 LoRA validation smoke (12–16 GB GPUs)
+# Usage:
+#   poetry run validate-domain-t2v \
+#     --config configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml \
+#     --trained_ckpt <denoiser.ckpt>
+flow:
+  target: videotuna.flow.diffusers_video.DiffusersVideoFlow
+  params:
+    model_family: wan
+    mode: t2v
+    pipeline_only: true
+    model_variant: "2.2"
+    pretrained_model_name_or_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+inference:
+  mode: t2v
+  ckpt_path: Wan-AI/Wan2.2-T2V-A14B-Diffusers
+  savedir: results/t2v/wan-domain-lora-smoke-22-low-vram
+  prompt_file: "sks_style, slow camera push-in, soft lighting"
+  frames: 81
+  height: 720
+  width: 1280
+  num_inference_steps: 4
+  unconditional_guidance_scale: 5.0
+  seed: 42
+  savefps: 16
+  memory_preset: low_vram
+  enable_sequential_cpu_offload: true
+  enable_vae_tiling: true
+  dtype: fp16
+  min_vram_gb: 10
+  # Optional transformer quant for base inference (LoRA+quant is attempted; use none if bridge fails):
+  #   int8_wo — see low_vram_wan2_2_720p_int8.yaml
+  #   fp8_wo  — see low_vram_wan2_2_720p_fp8.yaml (Ada/Hopper sm >= 8.9)
+  # transformer_quant: int8_wo
+  # quant_backend: torchao
diff --git a/configs/inference/wan2_2_t2v_a14b.yaml b/configs/inference/wan2_2_t2v_a14b.yaml
new file mode 120000
index 00000000..3974a5c9
--- /dev/null
+++ b/configs/inference/wan2_2_t2v_a14b.yaml
@@ -0,0 +1 @@
+presets/balanced_wan2_2_720p.yaml
\ No newline at end of file
diff --git a/docker-compose.yml b/docker-compose.yml
index d4cb5b28..b462e7d2 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -1,10 +1,17 @@
+x-privtune-dev: &privtune-dev
+  platform: linux/amd64
+  build:
+    context: .
+    dockerfile: docker/poetry/Dockerfile
+  user: "${HOST_UID:-1000}:${HOST_GID:-1000}"  # Run as the host's user
+  volumes:
+    - .:/opt/VideoTuna
+
 services:
-  videotuna:
-    image: "videotuna:${TAG-latest}"
-    platform: linux/amd64
-    build:
-      context: .
-      dockerfile: docker/poetry/Dockerfile
-    user: "${HOST_UID:-1000}:${HOST_GID:-1000}"  # Run as the host's user
-    volumes:
-      - .:/opt/VideoTuna
+  privtune:
+    <<: *privtune-dev
+    image: "privtune:${TAG-latest}"
+
+  videotuna:  # deprecated — use privtune
+    <<: *privtune-dev
+    image: "videotune:${TAG-latest}"
diff --git a/docker/Dockerfile b/docker/Dockerfile
index 2d1fea4c..20c52e1e 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -1,81 +1,16 @@
 # Usage: 1. cd ./docker
-# Usage: 2. docker build -t videotuna:v1.0 .
-# Usage: 3. docker run --gpus all -it videotuna:v1.0 /bin/bash
-
-
+# Usage: 2. docker build -t privtune:latest .
+# Usage: 3. docker run --gpus all -it privtune:latest /bin/bash
+#
+# Legacy VideoTuna all-in-one image — prefer Poetry install from repo root
+# (see README.md) for PrivTune domain training.
 
 FROM runpod/pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04
 WORKDIR /content
-ENV PATH="/home/videotuna/.local/bin:${PATH}"
+ENV PATH="/home/privtune/.local/bin:${PATH}"
 
 RUN apt update -y && add-apt-repository -y ppa:git-core/ppa && apt update -y
 RUN apt install -y aria2 git git-lfs unzip ffmpeg python3-tk
 
-# ---------------------------- Set up python env ----------------------------
-RUN git clone https://github.com/VideoVerses/VideoTuna.git &&\
-    cd VideoTuna && \
-    pip install -r requirements.txt && \
-    git clone https://github.com/JingyeChen/SwissArmyTransformer && \
-    pip install -e SwissArmyTransformer/ && \
-    git clone https://github.com/tgxs002/HPSv2.git && \
-    cd ./HPSv2 && \
-    pip install -e . && \
-    cd .. && \
-    pip install ffmpeg
-
-# ---------------------------- For add check points ----------------------------
-# RUN mkdir checkpoints
-
-# ---------------------------- T2V ----------------------------
-
-# # ---- CogVideo (diffusers) ----
-# RUN mkdir -p checkpoints/cogvideo; cd checkpoints/cogvideo
-# RUN git clone https://huggingface.co/THUDM/CogVideoX-2b         # This are checkpoints for CogVideoX T2V-2B
-# # RUN git clone https://huggingface.co/THUDM/CogVideoX-5b         # This are checkpoints for CogVideoX T2V-5B
-# # RUN git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V     # This are checkpoints for CogVideoX I2V-5B
-# # RUN git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT  # This are checkpoints for CogVideoX 1.5-5B (both T2V and I2V)
-
-
-# # ---- Open-Sora ----
-# RUN mkdir -p checkpoints/open-sora/t2v_v10
-# # RUN wget https://huggingface.co/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-HQ-16x512x512.pth -P checkpoints/open-sora/t2v_v10/
-# # RUN wget https://huggingface.co/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-HQ-16x256x256.pth -P checkpoints/open-sora/t2v_v10/
-# RUN wget https://huggingface.co/hpcai-tech/Open-Sora/resolve/mai/n/OpenSora-v1-16x256x256.pth -P checkpoints/open-sora/t2v_v10/
-# #
-# RUN mkdir -p checkpoints/open-sora/t2v_v11
-# RUN cd checkpoints/open-sora/t2v_v11
-# RUN git clone https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage2
-# RUN git clone https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage3
-# RUN cd ../../..
-# #
-# RUN mkdir -p checkpoints/open-sora/t2v_v12/OpenSora-STDiT-v3
-# RUN mkdir -p checkpoints/open-sora/t2v_v12/OpenSora-VAE-v1.2
-# RUN wget https://huggingface.co/hpcai-tech/OpenSora-VAE-v1.2/resolve/main/model.safetensors -P checkpoints/open-sora/t2v_v12/OpenSora-VAE-v1.2
-# RUN wget https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3/resolve/main/model.safetensors -P checkpoints/open-sora/t2v_v12/OpenSora-STDiT-v3
-
-
-# # ---- Videocrafter ----
-# RUN mkdir checkpoints/videocrafter/
-
-# RUN mkdir checkpoints/videocrafter/t2v_v2_512
-# RUN wget https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt -P checkpoints/videocrafter/t2v_v2_512  # videocrafter2-t2v-512
-
-# RUN mkdir checkpoints/videocrafter/t2v_v1_1024
-# RUN wget https://huggingface.co/VideoCrafter/Text2Video-1024/resolve/main/model.ckpt -P checkpoints/videocrafter/t2v_v1_1024 # videocrafter1-t2v-1024
-
-
-# # ---------------------------- I2V ----------------------------
-# # ---- Dynamicrafter ----
-# RUN mkdir checkpoints/dynamicrafter/
-# RUN mkdir checkpoints/dynamicrafter/i2v_576x1024
-
-# RUN wget https://huggingface.co/Doubiiu/DynamiCrafter_1024/resolve/main/model.ckpt -P checkpoints/dynamicrafter/i2v_576x1024  # dynamicrafter-i2v-1024
-
-# # ---- Videocrafter ----
-# RUN mkdir -p checkpoints/videocrafter/i2v_v1_512
-
-# RUN wget https://huggingface.co/VideoCrafter/Image2Video-512/resolve/main/model.ckpt -P checkpoints/videocrafter/i2v_v1_512 # videocrafter1-i2v-512
-
-# # ---- Stable Diffusion checkpoint for VC2 Training ----
-# RUN mkdir -p checkpoints/stablediffusion/v2-1_512-ema
-# RUN wget https://huggingface.co/stabilityai/stable-diffusion-2-1-base/resolve/main/v2-1_512-ema-pruned.ckpt -P checkpoints/stablediffusion/v2-1_512-ema
+# PrivTune: clone repo and install via Poetry (CUDA + training).
+# Checkpoint download: see docs/checkpoints.md (Wan 2.1 local, Flux/Wan2.2 from HF hub).
diff --git a/docker/poetry/Dockerfile b/docker/poetry/Dockerfile
index 63d52396..284ade79 100644
--- a/docker/poetry/Dockerfile
+++ b/docker/poetry/Dockerfile
@@ -1,21 +1,24 @@
-FROM python:3.10-bookworm
+FROM python:3.11-bookworm
 
 ARG UID=1000
 ARG GID=1000
 
 WORKDIR /opt/VideoTuna/
 
-RUN groupadd -g "${GID}" videotuna \
-    && useradd -m -u "${UID}" -s /usr/bin/bash -g videotuna videotuna \
-    && chown -R videotuna:videotuna /opt/VideoTuna/ \
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends build-essential ninja-build \
+    && rm -rf /var/lib/apt/lists/* \
+    && groupadd -g "${GID}" privtune \
+    && useradd -m -u "${UID}" -s /usr/bin/bash -g privtune privtune \
+    && chown -R privtune:privtune /opt/VideoTuna/ \
     && chmod -R 755 /opt/VideoTuna/ \
     && pip install pipx
 
-USER videotuna
+USER privtune
 
 WORKDIR /opt/VideoTuna/
 
-ENV PATH="/home/videotuna/.local/bin:${PATH}"
+ENV PATH="/home/privtune/.local/bin:${PATH}"
 
 RUN pipx ensurepath \
     && pipx install poetry \
diff --git a/docker/start_docker.sh b/docker/start_docker.sh
index dddfd606..fe1c56dc 100644
--- a/docker/start_docker.sh
+++ b/docker/start_docker.sh
@@ -1,6 +1,6 @@
 cd ./docker
-docker build -t videotuna:v1.0 .
-docker run --gpus all -it videotuna:v1.0 /bin/bash
+docker build -t privtune:v1.0 -t videotuna:v1.0 .
+docker run --gpus all -it privtune:v1.0 /bin/bash
 
 # if you want to use your local path:
-docker run  -v /path/to/VideoTuna/:/content/local_VideoTuna --gpus all -it videotuna:v1.0 /bin/bash
+docker run  -v /path/to/VideoTuna/:/content/local_VideoTuna --gpus all -it privtune:v1.0 /bin/bash
diff --git a/docs/MODEL_VERSIONS.md b/docs/MODEL_VERSIONS.md
new file mode 100644
index 00000000..21161db9
--- /dev/null
+++ b/docs/MODEL_VERSIONS.md
@@ -0,0 +1,176 @@
+# Model versions
+
+PrivTune supports three model families: two for **training** and one for **validation**.
+
+| Model | Hub ID | Role |
+|-------|--------|------|
+| FLUX.1-dev | `black-forest-labs/FLUX.1-dev` | **Train** — Phase 1 T2I LoRA |
+| Wan 2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B` | **Train** — Phase 2 T2V LoRA |
+| Wan 2.1 I2V 14B 480P | `Wan-AI/Wan2.1-I2V-14B-480P` | **Train** — Phase 2.5 I2V LoRA (optional) |
+| Wan 2.2 T2V A14B Diffusers | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | **Validate** — Phase 2 production inference |
+| Wan 2.2 I2V A14B Diffusers | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | **Validate** — Phase 2.5 I2V production inference |
+
+## Training configs
+
+| Phase | Config |
+|-------|--------|
+| Flux T2I | `configs/domain/flux_t2i.json` + `configs/domain/flux_t2i_data.json` |
+| Wan T2V | `configs/domain/wan_t2v_lora.yaml` |
+| Wan I2V (optional) | `configs/domain/wan_i2v_lora.yaml` |
+| Flux T2I cloud smoke | `configs/domain/flux_t2i_cloud_smoke.json` + `configs/domain/flux_t2i_data.json` |
+| Wan T2V cloud smoke | `configs/domain/wan_t2v_lora_cloud_smoke.yaml` |
+
+Commands: `poetry run train-domain-t2i` / `poetry run train-domain-t2v` / `poetry run train-domain-i2v`
+
+Wan training requires DeepSpeed ZeRO-3: `poetry run install-deepspeed`
+
+## Smoke inference presets
+
+| Phase | Preset | Command |
+|-------|--------|---------|
+| Flux LoRA | `configs/inference/presets/flux_domain_lora_smoke.yaml` | `poetry run inference-domain-t2i` |
+| Wan 2.2 domain LoRA | `configs/inference/presets/wan_domain_lora_smoke_22.yaml` | `poetry run validate-domain-t2v --trained_ckpt <ckpt>` |
+| Wan 2.2 domain I2V LoRA | `configs/inference/presets/wan_domain_i2v_smoke_22.yaml` | `poetry run validate-domain-i2v --trained_ckpt <ckpt> --prompt_dir <dir>` |
+| Wan 2.2 (general) | `configs/inference/presets/balanced_wan2_2_720p.yaml` | `poetry run inference-wan2.2-t2v-720p` |
+| Wan 2.1 LoRA (optional) | `configs/inference/presets/wan_domain_lora_smoke.yaml` | `inference_new` + `--trained_ckpt` |
+| Wan 2.1 I2V LoRA (optional) | `configs/inference/presets/wan_domain_i2v_smoke.yaml` | `inference_new` + `--prompt_dir` |
+
+LoRA bridge (Wan 2.1 native → 2.2 Diffusers): `videotuna/utils/wan_lora_bridge.py`
+
+Offline export: `tools/convert_wan_lora_21_to_22.py`
+
+## ML stack pins (Wan 2.2 LoRA bridge audit)
+
+Audited at commit `a17b6a0` (2026-06-23). Validated against `tests/test_wan_lora_bridge.py` (remap coverage ≥ 90% on production-style fixture keys; 11 CPU tests + optional GPU smoke).
+
+| Package | Constraint | Locked | Notes |
+|---------|------------|--------|-------|
+| diffusers | `^0.38.0` | 0.38.0 | Wan `WanTransformer3DModel` + `WanLoraLoaderMixin`; pipeline `set_adapters(adapter_weights=…)` |
+| peft | `^0.17.0` | 0.17.1 | Runtime bridge uses `get_peft_model` + `set_peft_model_state_dict`; cap at 0.17.x (see matrix) |
+| transformers | `^4.48.0` | 4.57.6 | Shared with Flux training; lock resolves above floor |
+| accelerate | `^1.14.0` | 1.14.0 | peft requires `>=0.21.0`; diffusers 0.38 dev extra recommends `>=0.31.0` |
+| safetensors | `^0.8.0` | 0.8.0 | Required by diffusers 0.38 (`>=0.8.0rc0`) |
+
+### Upstream API alignment (diffusers 0.38 + peft 0.17)
+
+Wan 2.2 T2V/I2V pipelines expose two denoisers: `transformer` (high-noise) and `transformer_2` (low-noise). Diffusers community practice ([PR #12074](https://github.com/huggingface/diffusers/pull/12074)):
+
+- `pipeline.load_lora_weights(..., adapter_name=…)` for high-noise expert
+- `pipeline.load_lora_weights(..., load_into_transformer_2=True)` for low-noise expert
+- `pipeline.set_adapters(["a1", "a2"], adapter_weights=[w1, w2])` to activate both
+
+PrivTune bridge (`videotuna/utils/wan_lora_bridge.py`) mirrors this at runtime without pre-exported safetensors:
+
+| Upstream | PrivTune bridge |
+|----------|-----------------|
+| `load_lora_weights` per expert | `get_peft_model()` + `set_peft_model_state_dict()` on each `WanTransformer3DModel` |
+| Named adapters per expert | `domain_high` / `domain_low` |
+| `set_adapters` with weights | `pipeline.set_adapters(adapters, adapter_weights=scales)` |
+
+Offline export (`tools/convert_wan_lora_21_to_22.py`) writes `high_noise.safetensors` / `low_noise.safetensors` for native `load_lora_weights`; validated in `test_exported_lora_loads_via_diffusers_adapter`.
+
+Production inference (`videotuna/flow/diffusers_video.py`) detects native Wan 2.1 ckpts and calls the bridge; Diffusers-format LoRA dirs fall through to `pipeline.load_lora_weights()`.
+
+### Version matrix
+
+Procedure: swap only `diffusers` + `peft` in the Poetry env (`poetry run pip install --no-deps diffusers==X peft==Y`), run `poetry run test tests/test_wan_lora_bridge.py -q -k "not gpu"` and `poetry run python tools/spike_wan_lora_bridge.py --synthetic /tmp/synthetic-matrix.ckpt`, restore `diffusers==0.38.0 peft==0.17.1`.
+
+| Row | diffusers | peft | Result |
+|-----|-----------|------|--------|
+| A (baseline) | 0.36.0 | 0.17.1 | pass — remap 100% |
+| B | 0.37.1 | 0.17.1 | pass — remap 100% |
+| C | 0.38.0 | 0.17.1 | pass — **chosen combo** |
+| F | 0.38.0 | 0.18.1 | pass — remap 100% |
+| D | 0.38.0 | 0.19.1 | fail — `peft` 0.19.1 requires `torchao>=0.16.0`; project pins `torchao ^0.9.0` |
+| E | 0.36.0 | 0.19.1 | fail — same `torchao` incompatibility in `get_peft_model()` |
+
+**Decision:** keep `diffusers ^0.38.0` + `peft ^0.17.0`. peft 0.18.1 passes bridge tests but offers no benefit over 0.17.1 for this workflow; peft 0.19.x is blocked by the `torchao` pin (used elsewhere for Wan 2.2 quant inference).
+
+Debug inventory: `poetry run python tools/spike_wan_lora_bridge.py --synthetic /tmp/synthetic.ckpt`
+
+## Wan training stack pins (DeepSpeed + PyTorch Lightning audit)
+
+See [ADR-001](decisions/0001-dual-training-stacks.md) for rationale; pins below.
+
+| Phase | Stack | Entry point |
+|-------|-------|-------------|
+| Flux T2I LoRA | Hugging Face **Accelerate** | `poetry run train-domain-t2i` |
+| Wan 2.1 T2V / I2V LoRA | **PyTorch Lightning** + **DeepSpeed ZeRO-3** | `poetry run train-domain-t2v` / `train-domain-i2v` |
+
+Wan domain YAMLs set `train.lightning.strategy: deepspeed_stage_3_offload` with `precision: bf16-mixed`. Training runs through `scripts/train_new.py` (Lightning `Trainer` + DeepSpeed autocast wrapper). Checkpoint export for LoRA uses `deepspeed.utils.zero_to_fp32` in `videotuna/utils/callbacks.py`.
+
+Pinned in `[tool.poetry.group.training]` / `uv` `training` group (`poetry install -E cuda --with training`):
+
+| Package | Pin | Rationale |
+|---------|-----|-----------|
+| `deepspeed` | **0.19.2** | ZeRO-3 CPU offload for 14B LoRA on ~40 GB GPUs; `poetry run install-deepspeed` rebuilds against the active torch/CUDA build. |
+| `pytorch-lightning` | **2.4.0** | Native Wan training path (callbacks, `VideoTunaModelCheckpoint`, DeepSpeed strategy registry). Flux stays on Accelerate — no Lightning upgrade required for Phase 1. |
+| `torch` | **^2.6** (cu126) | Shared base; DeepSpeed ops JIT-built against installed torch. |
+
+### Breaking-change notes (0.19.2 + 2.4.0)
+
+**DeepSpeed 0.19.2 — mixed-dtype ZeRO-3 + PEFT (critical for Wan LoRA)**
+
+PR [#8066](https://github.com/deepspeedai/DeepSpeed/pull/8066) stopped blanket-casting all ZeRO-Init parameters to bf16 (correct for fp32 buffers such as RoPE `inv_freq`). That exposed a latent bug when PEFT’s default `autocast_adapter_dtype=True` keeps LoRA adapters in **fp32** while the frozen base stays **bf16**: the first optimizer step can fail in `_allgather_params_coalesced` ([DeepSpeed #8072](https://github.com/deepspeedai/DeepSpeed/issues/8072)). Upstream fix: [DeepSpeed #8073](https://github.com/deepspeedai/DeepSpeed/pull/8073) (open; no 0.19.3 release yet).
+
+PrivTune mitigates by passing `autocast_adapter_dtype=False` to `get_peft_model()` in `videotuna/base/generation_base.py` and `videotuna/utils/wan_training.py` (same pattern as [TRL #6091](https://github.com/huggingface/trl/pull/6091)).
+
+**PyTorch Lightning 2.4.0 — DeepSpeed integration**
+
+- String strategy `deepspeed_stage_3_offload` maps to `DeepSpeedStrategy` (stage 3, CPU optimizer + param offload).
+- `DeepSpeedOptimizer` import path fix landed in PL 2.3.x; compatible with DeepSpeed ≥ 0.14.1.
+- PL 2.5.x adds `exclude_frozen_parameters` on `DeepSpeedStrategy` (useful for LoRA) but is **not required** for current configs. Defer upgrade; avoid PL 2.6.2+ (upstream supply-chain advisory). Torch 2.6 is ahead of PL 2.4’s original test matrix but works in practice with these pins.
+
+**ZeRO-3 gradients:** use `deepspeed.utils.safe_get_full_grad(param)` if reading partitioned grads in custom code (not used in default Wan loss path).
+
+### GPU training smoke (manual)
+
+Requires NVIDIA GPU (~40 GB for Wan cloud smoke), Wan 14B weights under `checkpoints/wan/Wan2.1-T2V-14B`, and `data/t2v/domain/metadata.csv` + videos.
+
+```bash
+poetry install -E cuda --with training
+poetry run install-deepspeed
+
+# Short cloud smoke preset (1 epoch, checkpoint every 5 steps)
+poetry run train-domain-t2v \
+  --base configs/domain/wan_t2v_lora_cloud_smoke.yaml \
+  --devices 0,
+```
+
+On Vast.ai after provisioning: `TRAIN_PROFILE=wan-t2v-lora ./cloud/vast/run-smoke-train.sh` (see [cloud-gpu-training.md](runbooks/cloud-gpu-training.md)).
+
+Success: trainer reports `DeepSpeedStrategy`, completes ≥5 steps, writes `results/train/.../checkpoints/only_trained_model/denoiser-*.ckpt`.
+
+Local dev without GPU: CPU CI covers config YAML and flow-matching helpers only (`tests/test_domain_finetune_configs.py`, `tests/test_wan_training_step.py`); ZeRO-3 behavior is not exercised in CI.
+
+## CI smoke (CPU config validation)
+
+```bash
+poetry run lint
+poetry run format-check
+poetry run test tests/test_import_smoke.py -q
+poetry run test tests/test_domain_finetune_configs.py -q
+poetry run test tests/test_flux_lora_train_smoke.py -q
+poetry run test tests/test_wan_lora_bridge.py -q
+poetry run test tests/test_wan_i2v_lora_bridge.py -q
+poetry run test tests/test_wan_domain_lora_smoke_22_config.py -q
+poetry run test tests/test_wan_domain_i2v_smoke_22_config.py -q
+poetry run test tests/test_wan_i2v_dataset.py -q
+poetry run test tests/test_wan_training_step.py -q
+poetry run test tests/test_poetry_scripts.py -q
+```
+
+GPU inference smoke (optional, manual — skipped in CI without GPU):
+
+```bash
+# Base pipeline only (no LoRA)
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/wan2_2_cpu_smoke.yaml \
+  --num_inference_steps 4 --enable_model_cpu_offload
+
+# Domain LoRA validation (requires trained denoiser ckpt + GPU)
+poetry run validate-domain-t2v \
+  --trained_ckpt results/train/.../denoiser-000-000000025.ckpt \
+  --num_inference_steps 4 \
+  --config configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml
+```
diff --git a/docs/checkpoints.md b/docs/checkpoints.md
index 928a157c..5eb755e1 100644
--- a/docs/checkpoints.md
+++ b/docs/checkpoints.md
@@ -1,187 +1,60 @@
-
 # Prepare checkpoints
 
-This document contains commands for preparing model checkpoints and the final checkpoint organization structure.
-
-
-### 1. Supported Models
-
-|T2V-Models|HxWxL|Checkpoints|
-|:---------|:---------|:--------|
-|HunyuanVideo|720x1280x129|[Hugging Face](https://huggingface.co/tencent/HunyuanVideo)
-|Mochi|848x480, 3s|[Hugging Face](https://huggingface.co/genmo/mochi-1-preview)
-|CogVideoX-2B|480x720x49|[Hugging Face](https://huggingface.co/THUDM/CogVideoX-2b)
-|CogVideoX-5B|480x720x49|[Hugging Face](https://huggingface.co/THUDM/CogVideoX-5b)
-|Open-Sora 1.0|512×512x16|[Hugging Face](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x512x512.pth)
-|Open-Sora 1.0|256×256x16|[Hugging Face](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-HQ-16x256x256.pth)
-|Open-Sora 1.0|256×256x16|[Hugging Face](https://huggingface.co/hpcai-tech/Open-Sora/blob/main/OpenSora-v1-16x256x256.pth)
-|VideoCrafter2|320x512x16|[Hugging Face](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt)
-|VideoCrafter1|576x1024x16|[Hugging Face](https://huggingface.co/VideoCrafter/Text2Video-1024/blob/main/model.ckpt)
-|VideoCrafter1|320x512x16|[Hugging Face](https://huggingface.co/VideoCrafter/Text2Video-512/blob/main/model.ckpt)
-
-|I2V-Models|HxWxL|Checkpoints|
-|:---------|:---------|:--------|
-|CogVideoX-5B-I2V|480x720x49|[Hugging Face](https://huggingface.co/THUDM/CogVideoX-5b-I2V)
-|DynamiCrafter|576x1024x16|[Hugging Face](https://huggingface.co/Doubiiu/DynamiCrafter_1024/blob/main/model.ckpt)|
-|VideoCrafter1|320x512x16|[Hugging Face](https://huggingface.co/VideoCrafter/Image2Video-512/blob/main/model.ckpt)|
-
-* Note: H: height; W: width; L: length
-
+PrivTune domain training and validation use Hugging Face hub weights (downloaded on first run) or local clones under `checkpoints/`.
 
-### 2. Download checkpoints
-Please run the following commands in your terminal to download the checkpoints for each model.
-```
-mkdir checkpoints
-
-# ---------------------------- T2V ----------------------------
-
-# ---- CogVideo (diffusers) ----
-mkdir -p checkpoints/cogvideo; cd checkpoints/cogvideo
-git clone https://huggingface.co/THUDM/CogVideoX-2b         # This are checkpoints for CogVideoX T2V-2B
-git clone https://huggingface.co/THUDM/CogVideoX-5b         # This are checkpoints for CogVideoX T2V-5B
-git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V     # This are checkpoints for CogVideoX I2V-5B
-git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT  # This are checkpoints for CogVideoX 1.5-5B (both T2V and I2V)
-
-# ---- HunyuanVideo (diffusers) ----
-cd VideoTuna   # Make sure you are under the root path of VideoTuna
-mkdir checkpoints/hunyuanvideo
-cd checkpoints/hunyuanvideo
-git lfs install
-git clone https://huggingface.co/hunyuanvideo-community/HunyuanVideo
-
-
-# ---- Open-Sora ----
-mkdir -p checkpoints/open-sora/t2v_v10
-wget https://huggingface.co/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-HQ-16x512x512.pth -P checkpoints/open-sora/t2v_v10/
-wget https://huggingface.co/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-HQ-16x256x256.pth -P checkpoints/open-sora/t2v_v10/
-wget https://huggingface.co/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-16x256x256.pth -P checkpoints/open-sora/t2v_v10/
-#
-mkdir -p checkpoints/open-sora/t2v_v11
-cd checkpoints/open-sora/t2v_v11
-git clone https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage2
-git clone https://huggingface.co/hpcai-tech/OpenSora-STDiT-v2-stage3
-cd ../../..
-#
-mkdir -p checkpoints/open-sora/t2v_v12/OpenSora-STDiT-v3
-mkdir -p checkpoints/open-sora/t2v_v12/OpenSora-VAE-v1.2
-wget https://huggingface.co/hpcai-tech/OpenSora-VAE-v1.2/resolve/main/model.safetensors -P checkpoints/open-sora/t2v_v12/OpenSora-VAE-v1.2
-wget https://huggingface.co/hpcai-tech/OpenSora-STDiT-v3/resolve/main/model.safetensors -P checkpoints/open-sora/t2v_v12/OpenSora-STDiT-v3
-
-
-# ---- Videocrafter ----
-mkdir checkpoints/videocrafter/
-
-mkdir checkpoints/videocrafter/t2v_v2_512
-wget https://huggingface.co/VideoCrafter/VideoCrafter2/resolve/main/model.ckpt -P checkpoints/videocrafter/t2v_v2_512  # videocrafter2-t2v-512
-python tools/videocrafter_checkpoint_converter.py
-
-mkdir checkpoints/videocrafter/t2v_v1_1024
-wget https://huggingface.co/VideoCrafter/Text2Video-1024/resolve/main/model.ckpt -P checkpoints/videocrafter/t2v_v1_1024 # videocrafter1-t2v-1024
-
-
-# ---- StepVideo ----
-mkdir checkpoints/stepvideo/
-cd checkpoints/stepvideo
-huggingface-cli download stepfun-ai/stepvideo-t2v --local-dir ./stepvideo-t2v
-cd ../..
-
-# ---- Wan ----
-mkdir checkpoints/wan/
-cd checkpoints/wan
-huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
-cd ../..
-
-
-# ---- HunyuanVideo ----
-mkdir -p checkpoints/hunyuanvideo/
-huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./checkpoints/hunyuanvideo/HunyuanVideo-I2V
-cd checkpoints/hunyuanvideo/HunyuanVideo-I2V
-huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./text_encoder_i2v
-huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./text_encoder_2
-cd ../..
+## Supported models
 
+| Phase | Model | Hub ID | Local path (optional) |
+|-------|-------|--------|----------------------|
+| T2I LoRA | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` | — (HF cache) |
+| T2V LoRA train | Wan 2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B` | `checkpoints/wan/Wan2.1-T2V-14B` |
+| I2V LoRA train | Wan 2.1 I2V 14B 480P | `Wan-AI/Wan2.1-I2V-14B-480P` | `checkpoints/wan/Wan2.1-I2V-14B-480P` |
+| T2V validate | Wan 2.2 Diffusers | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | — (HF cache) |
+| I2V validate | Wan 2.2 I2V Diffusers | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | — (HF cache) |
 
-# ---------------------------- I2V ----------------------------
-# ---- Dynamicrafter ----
-mkdir checkpoints/dynamicrafter/
-mkdir checkpoints/dynamicrafter/i2v_576x1024
+## Compute compatibility
 
-wget https://huggingface.co/Doubiiu/DynamiCrafter_1024/resolve/main/model.ckpt -P checkpoints/dynamicrafter/i2v_576x1024  # dynamicrafter-i2v-1024
+| Backend | Flux T2I | Wan 2.1 train | Wan 2.2 infer |
+|---------|----------|---------------|---------------|
+| NVIDIA CUDA | Yes | Yes (+ DeepSpeed) | Yes |
+| AMD ROCm | Yes (`sdpa`) | Experimental | Yes (`sdpa` + offload) |
+| CPU | Config/smoke only | Config validation only | Tiny smoke preset only |
 
-# ---- Videocrafter ----
-mkdir -p checkpoints/videocrafter/i2v_v1_512
+Install: NVIDIA `poetry install -E cuda --with training` · AMD [`docs/install-rocm.md`](install-rocm.md) · CPU [`docs/install-cpu.md`](install-cpu.md)
 
-wget https://huggingface.co/VideoCrafter/Image2Video-512/resolve/main/model.ckpt -P checkpoints/videocrafter/i2v_v1_512 # videocrafter1-i2v-512
+See [`docs/MODEL_VERSIONS.md`](MODEL_VERSIONS.md) and [`docs/runbooks/domain-adult-finetune.md`](runbooks/domain-adult-finetune.md) for commands and presets.
 
-# ---- Stable Diffusion checkpoint for VC2 Training ----
-mkdir -p checkpoints/stablediffusion/v2-1_512-ema
-wget https://huggingface.co/stabilityai/stable-diffusion-2-1-base/resolve/main/v2-1_512-ema-pruned.ckpt -P checkpoints/stablediffusion/v2-1_512-ema
+## Download (offline / air-gapped)
 
-# ---- Wan ----
-mkdir -p checkpoints/wan/
+```bash
+mkdir -p checkpoints/wan
 cd checkpoints/wan
-huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./Wan2.1-I2V-14B-720P
-cd ../..
+hf download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
+```
 
-# ---------------------------- V2V ----------------------------
-# ---- ModelScope Video-to-Video ----
-cd checkpoints
-# please ensure that you have installed lfs. If not, you can install it by running the following command:
-git lfs install
-# after installing lfs, you can clone the Video-to-Video checkpoints
-git clone https://www.modelscope.cn/iic/Video-to-Video.git
+Cloud renters on Vast.ai can opt into faster multi-GB hub pulls with `VIDEOTUNA_FAST_HF_DOWNLOAD=1` at instance launch — see [`docs/runbooks/cloud-gpu-training.md`](runbooks/cloud-gpu-training.md#fast-model-downloads-opt-in). Local dev is unchanged.
 
-```
+Flux and Wan 2.2 Diffusers weights are pulled from the hub on first `train-flux-lora` or `inference-wan2.2-t2v-720p` run unless you set `HF_HOME` or pass `--ckpt_path` / config overrides.
+
+## Commands
 
+| Use case | Command |
+|----------|---------|
+| Flux LoRA train | `poetry run train-domain-t2i` |
+| Flux LoRA smoke | `poetry run inference-domain-t2i` |
+| Wan LoRA train | `poetry run train-domain-t2v` |
+| Wan I2V LoRA train | `poetry run train-domain-i2v` |
+| Wan native smoke | `poetry run python scripts/inference_new.py --config configs/inference/presets/wan_domain_lora_smoke.yaml` |
+| Wan 2.2 T2V validation | `poetry run inference-wan2.2-t2v-720p` |
+| Wan 2.2 I2V validation | `poetry run validate-domain-i2v` |
 
-### 3. Checkpoint Orgnization Structure
-After downloading, the model checkpoints should be placed as follows:
+## Checkpoint layout
 
 ```
 VideoTuna/
     └── checkpoints/
-        ├── cogvideo/
-        │   └── CogVideoX-2b/
-        │   └── CogVideoX-5b/
-        │   └── CogVideoX-5b-I2V/
-        ├── hunyuanvideo/
-        │   ├── HunyuanVideo-I2V/
-        │   │   └── hunyuan-video-i2v-720p/
-        │   │   └── text_encoder_2
-        │   │   └── text_encoder_i2v
-        │   └── HunyuanVideo/
-        │       └── scheduler/
-        │       └── text_encoder/
-        │       └── text_encoder_2/
-        │       └── tokenizer/
-        │       └── tokenizer_2/
-        │       └── transformer/
-        │       └── vae/
-        ├── dynamicrafter/
-        │   └── i2v_576x1024/
-        │       └── model.ckpt
-        ├── videocrafter/
-        │   ├── t2v_v2_512/
-        │   │   └── model.ckpt
-        │   ├── t2v_v2_512_split/
-        │   │   └── cond_stage.ckpt
-        │   │   └── denoiser.ckpt
-        │   │   └── first_stage.ckpt
-        │   │   └── model_new.ckpt
-        │   ├── t2v_v1_1024/
-        │   │   └── model.ckpt
-        │   └── i2v_v1_512/
-        │       └── model.ckpt
-        └── open-sora/
-            ├── t2v_v10/
-            │   ├── OpenSora-v1-16x256x256.pth
-            │   └── OpenSora-v1-HQ-16x512x512.pth
-            ├── t2v_v11/
-            │   ├── OpenSora-STDiT-v2-stage2/
-            │   └── OpenSora-STDiT-v2-stage3/
-            └── t2v_v12/
-                ├── OpenSora-STDiT-v3/
-                └── OpenSora-VAE-v1.2/
+        └── wan/
+            └── Wan2.1-T2V-14B/    # required for Wan 2.1 LoRA training
 ```
 
-If you do not follow these locations, please modify the default checkpoint path argument during training/inference.
+Training outputs go under `results/train/` (not committed).
diff --git a/docs/decisions/0001-dual-training-stacks.md b/docs/decisions/0001-dual-training-stacks.md
new file mode 100644
index 00000000..f66a059b
--- /dev/null
+++ b/docs/decisions/0001-dual-training-stacks.md
@@ -0,0 +1,108 @@
+# ADR-001: Dual training stacks (Accelerate vs Lightning+DeepSpeed)
+
+## Status
+
+Accepted
+
+## Date
+
+2026-06-23
+
+## Context
+
+PrivTune is a **training-only** platform with two model families at different scales and upstream ecosystems:
+
+| Phase | Model | Role |
+|-------|-------|------|
+| 1 — T2I | FLUX.1-dev LoRA | Still-image domain style |
+| 2 — T2V / 2.5 — I2V | Wan 2.1 LoRA | Short-video domain motion |
+
+New contributors often see two incompatible training paths and assume technical debt or an incomplete migration. The split is **intentional**.
+
+**Flux T2I** fits the Hugging Face stack end-to-end: Diffusers pipeline, PEFT LoRA, single-GPU training (~24–40 GB VRAM; see [domain-adult-finetune runbook](../runbooks/domain-adult-finetune.md)).
+
+**Wan 2.1 T2V/I2V** uses a [vendored native stack](../../videotuna/models/wan/) without a maintained Diffusers training path for domain LoRA. The 14B video model at 480×832×81 frames needs **DeepSpeed ZeRO-3 CPU offload** to fit on ~38–44 GB GPUs.
+
+**Validation inference** unifies on Diffusers (Wan 2.2) via [`videotuna/utils/wan_lora_bridge.py`](../../videotuna/utils/wan_lora_bridge.py). That bridge is inference-only; it does not replace the native Wan training stack.
+
+## Decision
+
+Keep two training stacks:
+
+| Phase | Stack | Entry point | Code home |
+|-------|-------|-------------|-----------|
+| Flux T2I LoRA | Hugging Face **Accelerate** | `poetry run train-domain-t2i` | [`videotuna/training/flux_lora/`](../../videotuna/training/flux_lora/) |
+| Wan 2.1 T2V / I2V LoRA | **PyTorch Lightning** + **DeepSpeed ZeRO-3** | `poetry run train-domain-t2v` / `train-domain-i2v` | [`scripts/train_new.py`](../../scripts/train_new.py), [`videotuna/flow/wanvideo.py`](../../videotuna/flow/wanvideo.py) |
+
+**Flux on Accelerate** — first-party trainer launched via `accelerate launch` (`scripts/__init__.py` → `videotuna/training/flux_lora/train.py`).
+
+**Wan on Lightning + DeepSpeed** — YAML-driven `GenerationBase.init_trainer()` (`videotuna/base/generation_base.py`), strategy `deepspeed_stage_3_offload` in domain YAMLs (e.g. `configs/domain/wan_t2v_lora.yaml`), DeepSpeed-specific checkpoint export in `videotuna/utils/callbacks.py`.
+
+```mermaid
+flowchart LR
+  subgraph phase1 [Phase1_Flux_T2I]
+    FluxConfig["configs/domain/flux_t2i.json"]
+    FluxTrain["videotuna/training/flux_lora/"]
+    Accelerate["Accelerate launch"]
+    FluxConfig --> FluxTrain --> Accelerate
+  end
+  subgraph phase2 [Phase2_Wan_T2V_I2V]
+    WanYaml["configs/domain/wan_*_lora.yaml"]
+    WanFlow["wanvideo.py + GenerationBase"]
+    PL["PyTorch Lightning Trainer"]
+    DS["DeepSpeed ZeRO-3 offload"]
+    WanYaml --> WanFlow --> PL --> DS
+  end
+  subgraph validate [Phase3_Validation]
+    Bridge["wan_lora_bridge.py"]
+    Diffusers["DiffusersVideoFlow"]
+    Bridge --> Diffusers
+  end
+  phase2 --> Bridge
+```
+
+## Alternatives considered
+
+### Unify both on Accelerate
+
+Wan 14B LoRA still needs ZeRO-3 offload. Reimplementing Lightning callbacks, DeepSpeed checkpoint gather (`zero_to_fp32`), and PEFT dtype mitigations (`autocast_adapter_dtype=False` in `generation_base.py` and `wan_training.py`) on raw Accelerate is high risk for little gain.
+
+**Rejected.**
+
+### Unify both on Lightning
+
+Flux LoRA is a small, Diffusers-native loop. Forcing Lightning adds dependency surface and diverges from Hugging Face training examples without solving a Flux problem.
+
+**Rejected.**
+
+### Train Wan via Diffusers
+
+Wan 2.2 Diffusers is the **validation** target, not the 2.1 training stack. Domain training must stay on native 2.1 weights and checkpoint layout until/unless a first-party Diffusers trainer is built and bridge coverage is re-proven (≥ 90% remap in `tests/test_wan_lora_bridge.py`).
+
+**Rejected** for current scope.
+
+## Consequences
+
+### Install and dependencies
+
+- Wan training requires `poetry install -E cuda --with training` (pulls `deepspeed==0.19.2`, `pytorch-lightning==2.4.0`) and `poetry run install-deepspeed` after torch install.
+- Flux only needs `accelerate` from main deps; no DeepSpeed or Lightning in the Flux code path.
+
+### Contributor routing
+
+| If you are changing… | Touch |
+|---------------------|-------|
+| Flux LoRA loss, checkpoints, data | `videotuna/training/flux_lora/` |
+| Wan training loop, callbacks, ZeRO | `generation_base.py`, `callbacks.py`, `wan_training.py`, domain YAMLs |
+| Wan ↔ 2.2 validation | `wan_lora_bridge.py`, Diffusers presets |
+
+### Governance
+
+- **Do not** refactor one phase to match the other's stack without superseding this ADR.
+- Version pins and breaking-change notes stay in [MODEL_VERSIONS.md](../MODEL_VERSIONS.md); this ADR links there instead of duplicating version tables.
+
+## Related docs
+
+- [MODEL_VERSIONS.md](../MODEL_VERSIONS.md) — stack pins and upgrade audit
+- [vendor-policy.md](../vendor-policy.md) — vendored Wan vs first-party Flux layout
+- [domain-adult-finetune.md](../runbooks/domain-adult-finetune.md) — VRAM and training runbook
diff --git a/docs/evaluation.md b/docs/evaluation.md
deleted file mode 100644
index a53f4699..00000000
--- a/docs/evaluation.md
+++ /dev/null
@@ -1,70 +0,0 @@
-## Installation
-If you have installed the environment for the model training and inference, you can simply install some extra packages for evaluation.
-```shell
-pip install -r eval/requirements_vbench.txt
-python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
-```
-If you encounter errors during installing the [detectron2](https://github.com/facebookresearch/detectron2), you can check [here](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) for detailed suggestions.
-
-## Usage
-1. Prepare samples and a json file.
-    Firstly, if you already have video samples, please export a json file for mapping the video file name to prompt. The format is as follows:
-    ```json
-    {
-        "sample1.mp4": "sample1's prompt",
-        "sample2.mp4": "sample2's prompt",
-        ...
-    }
-    ```
-
-    For the standard vbench evaluation, you have to do inference on `all_dimensions.txt`.
-
-2. Evaluation
-(1) Standard evaluation
-
-    Run the following command:
-    ```shell
-    python eval/scripts/evaluation.py  \
-        --output_path $output_path \
-        --videos_path $video_path \
-        --map_json_path $json_path
-    ```
-    The final score of all dimensions are saved in the file `final_results.json`. If you want to submit your result to the VBench Leaderboard, you can zip the files `results_eval_results.json` and `results_full_info.json` and upload it to the [Leaderboard](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard).
-
-    Besides, you also can caluate the ***overall score***, ***quality score*** and ***sementic score*** in the VBench Leaderboard by yourself:
-    ```shell
-    python eval/scripts/tabular_score.py \
-        --result_path $result_json_path
-    ```
-    The result will be saved in the file `scaled_results.json`.
-
-    (2) Customized evaluation
-
-    If you want to evaluate the generation performance on your own prompts, you can choose the custom mode.
-    Note that Vbench only support the following dimensions for the custom mode:
-    ```python
-    dimensions = [
-        # Quality Score
-        "subject_consistency",
-        "background_consistency",
-        "motion_smoothness",
-        "dynamic_degree",
-        "aesthetic_quality",
-        "imaging_quality",
-        "temporal_flickering",
-        # Semantic Score
-        "temporal_style",
-        "overall_consistency",
-        "human_action",
-    ]
-    ```
-    You can run the following command to perform the customized evaluation:
-    ```shell
-    python eval/scripts/evaluation.py  \
-        --output_path $output_path \
-        --videos_path $video_path \
-        --map_json_path $json_path \
-        --dimension $dim1 $dim2 ... \
-        --mode custom_input
-    ```
-    The final score of each dimension is saved in the file `final_results.json`.
diff --git a/docs/finetune_cogvideox.md b/docs/finetune_cogvideox.md
deleted file mode 100644
index 26f4b23f..00000000
--- a/docs/finetune_cogvideox.md
+++ /dev/null
@@ -1,75 +0,0 @@
-
-# Introduction
-- This document provides instructions for fine-tuning the CogvideoX model.
-- It supports both text-to-video and image-to-video.
-- It supports both full fine-tuning and lora fine-tuning.
-
-# Preliminary steps
-1. Install the videotuna environment (see [Installation](https://github.com/VideoVerses/VideoTuna?tab=readme-ov-file#1prepare-environment)).
-2. Download the CogvideoX checkpoints (see [docs/checkpoints](https://github.com/VideoVerses/VideoTuna/blob/main/docs/CHECKPOINTS.md)). 
-3. Download the example training data.
-You can download manually from [this link](https://huggingface.co/datasets/Yingqing/VideoTuna-Datasets/resolve/main/apply_lipstick.zip), or download via `wget`:
-    ```
-    wget https://huggingface.co/datasets/Yingqing/VideoTuna-Datasets/resolve/main/apply_lipstick.zip
-    cd data
-    unzip apply_lipstick.zip -d apply_lipstick
-    ```
-    Make sure the data is putted at `data/apply_lipstick/metadata.csv`
-
-# Steps of Simple Fine-tuning
-**Lora Fine-tuning of CogVideoX Text-to-Video:**
-
-1. Run the commands in the terminal to launch training.
-    ```
-    bash shscripts/train_cogvideox_t2v_lora.sh
-    ```
-2. After training, run the commands to inference your personalized models.
-    ```
-    bash shscripts/inference_cogvideo_t2v_lora.sh
-    ```
-    - You need to provide the checkpoint path to the `ckpt` argument in the above shell script.  
-
-    Note: 
-    - The training and inference use the default model config from `configs/004_cogvideox/cogvideo5b.yaml`
-
-
-**Lora Fine-tuning of CogVideoX Image-to-Video:**
-1. Run the commands in the terminal to launch training.
-    ```
-    bash shscripts/train_cogvideox_i2v_lora.sh
-    ```
-2. After training, run the commands to inference your personalized models.
-    ```
-    bash shscripts/inference_cogvideo_i2v_lora.sh
-    ```
-    - You need to provide the checkpoint path to the `ckpt` argument in the above shell script.  
-
-    Note: 
-    - The training and inference use the default model config from `configs/004_cogvideox/cogvideo5b-i2v.yaml`
-
-**Full Fine-tuning of CogVideoX Text-to-Video:**
-1. Run the commands in the terminal to launch training.
-    ```
-    bash shscripts/train_cogvideox_t2v_fullft.sh
-    ```
-    We tested on 4 H800 GPUs. The training requires 68GB GPU memory.
-2. After training, run the commands to inference your personalized models.
-    ```
-    shscripts/inference_cogvideo_t2v_fullft.sh
-    ```
-    - You need to provide the checkpoint path to the `ckpt` argument in the above shell script. Because the full fine-tuning uses deepspeed to reduce GPU memory, so the checkpoint is like `${exp_save_dir}/checkpoints/trainstep_checkpoints/epoch=xxxxxx-step=xxxxxxxxx.ckpt/checkpoint/mp_rank_00_model_states.pt`
-
-    Note: 
-    - The training and inference use the default model config from `configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml`
-
-**Full Fine-tuning of CogVideoX Image-to-Video:**
-
-Same as above full fine-tuning of text-to-video. 
-1. Training:
-```
-bash shscripts/train_cogvideox_i2v_fullft.sh
-```
-2. Inference:
-```
-shscripts/inference_cogvideo_i2v_fullft.sh
-```
\ No newline at end of file
diff --git a/docs/finetune_flux.md b/docs/finetune_flux.md
deleted file mode 100644
index ec7f0e7f..00000000
--- a/docs/finetune_flux.md
+++ /dev/null
@@ -1,104 +0,0 @@
-
-# Introduction
-This document provides instructions for fine-tuning the Flux.1-dev model.
-
-# Preliminary steps
-1. **Install the environment** (see [Installation]()). 
-2. **Log in to Hugging Face to get the access to the pretrained Flux model.** The pretrained model will be automatically downloaded when lauch the training.   
-    **(1) Log in in the Hugging face accoun**t from the model webpage [Flux.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev), to be granted access to this model. 
-    - More Flux models can be check in the [Flux repo](https://github.com/black-forest-labs/flux?tab=readme-ov-file#models) 
-
-    **(2) Run `huggingface-cli login` in your terminal to log in to your Hugging Face account**  
-    - To log in, `huggingface_hub` requires a token generated from [this link](https://huggingface.co/settings/tokens). Get the token from Hugging Face and enter the token in your terminal.
-
-# Steps of Simple Fine-tuning
-We use images in `inputs/t2i/flux/plushie_teddybear` to train.  
-1. Set the exp configs in the file `configs/006_flux/config.json` and `configs/006_flux/multidatabackend.json`
-    <details>
-      <summary>Click to view the introduction to these arguments</summary>
-
-      
-      **Necessary arguments that you need to modify to train different loras.** 
-
-      config.json  
-      - `output_dir`: the directory for saving trained lora models and intermediate results.  
-      - `validation_prompt`: the testing prompt for validation during training. It should contain the concept name used in training labels.  
-
-      multidatabackend.json  
-      - `instance_data_dir`: the image directory. set to `data/images/${DataName}`
-      - `caption`: the simple caption that be used for all images. 
-      
-      **Optional arguments that you may need to adjust to match more advanced requirements.**  
-      config.json
-      - `max_train_steps`: the total steps for training.  
-      - `num_train_epochs`: Total number of training epochs (-1 means determined by steps).
-      - `lora_rank`: the rank of the LoRA models, the bigger, the more learnable parameters.
-      - `learning_rate`: controls how much the model weights are adjusted per update, balancing convergence speed and stability.
-      - `checkpointing_steps`: the steps intersection for saving each LoRA checkpoint.
-      - `checkpoints_total_limit`: the total number of saved model checkpoints.
-      - `resume_from_checkpoint`: Resume training from the latest checkpoint.
-      - `data_backend_config`: Path to the data backend configuration file.
-      - `pretrained_model_name_or_path`: Name or path of the pre-trained model.
-      - `seed`: Random seed for reproducibility.
-      - `train_batch_size`: Batch size for training.
-      - `gradient_checkpointing`: Whether to enable gradient checkpointing.
-      - `disable_tf32`: Whether to disable TF32.
-      - `mixed_precision`: Type of mixed precision.
-      - `optimizer`: Type of optimizer.
-      - `lr_warmup_steps`: Number of warmup steps for learning rate.
-      - `lr_scheduler`: Type of learning rate scheduler.
-      - `resolution_type`: Type of resolution.
-      - `resolution`: Image resolution.
-      - `validation_seed`: Random seed for validation.
-      - `validation_steps`: Number of validation steps.
-      - `validation_resolution`: Image resolution for validation.
-      - `validation_guidance`: Guidance coefficient for validation.
-      - `validation_guidance_rescale`: Guidance rescale for validation.
-      - `validation_num_inference_steps`: Number of inference steps for validation.
-      - `aspect_bucket_rounding`: Rounding precision for image aspect ratio bucketing.
-      - `minimum_image_size`: Minimum image size.
-      - `disable_benchmark`: Whether to disable benchmarking.
-      - `lora_type`: Type of LoRA (Low-Rank Adaptation).
-      - `model_type`: Type of the model.
-      - `model_family`: Family of the model.
-      - `write_batch_size`: Batch size for writing.
-      - `caption_dropout_probability`: Probability of caption dropout.
-    </details>
-
-
-3. Run the commands in the terminal to launch training.
-    ```
-    poetry run train-flux-lora
-    ```
-4. After training, run the commands in the terminal to inference your personalized videotuna models.
-    ```
-    poetry run inference-flux-lora \
-    --prompt "nezha is riding a bike" \
-    --lora_path ${lora_path} \
-    --out_path ${out_path}
-    ```
-    - ${out_path} should be a file path like `image.jpg`  
-
-    You can also inference multiple prompts by passing a txt file:
-    ```
-    poetry run inference-flux-lora \
-    --prompt data/prompts/nezha.txt \
-    --lora_path ${lora_path} \
-    --out_path ${out_path}
-    ```
-    - ${out_path} should be a directory.
-
-# Use your own dataset
-If you want to build your own dataset, please organize your data as `inputs/t2i/flux/plushie_teddybear`, which contains the training images and the corresponding text prompt files, as shown in the following directory structure.  
-Then modify the `instance_data_dir` in`configs/006_flux/multidatabackend.json`.
-```
-your_own_data/
-    ├── img1.jpg
-    ├── img2.jpg
-    ├── img3.jpg
-    ├── ...
-    ├── prompt1.txt      # prompt of img1.jpg
-    ├── prompt2.txt      # prompt of img2.jpg
-    ├── prompt3.txt      # prompt of img3.jpg
-    ├── ...
-```
diff --git a/docs/finetune_hunyuanvideo.md b/docs/finetune_hunyuanvideo.md
deleted file mode 100644
index b1f38091..00000000
--- a/docs/finetune_hunyuanvideo.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# Introduction
-This document provides instructions for fine-tuning the HunyuanVideo model.
-
-# Preliminary steps
-1. Install the videotuna environment (see [here](https://github.com/VideoVerses/VideoTuna?tab=readme-ov-file#1prepare-environment))
-2. Download the checkpoints for HunyuanVideo (see [here](https://github.com/VideoVerses/VideoTuna/blob/main/docs/CHECKPOINTS.md))
-3. Install deepspeed:
-```shell
-poetry run install-deepspeed
-```
-
-# Steps for Fine-tuning
-### Lora Fine-tuning of HunyuanVideo Text-to-Video
-
-1. Run the commands in the terminal to launch training.
-    ```
-    bash shscripts/train_hunyuanvideo_t2v_lora.sh
-    ```
-    NOTE: this script uses deepspeed for training.
-
-2. After training, one additional checkpoints converting step is needed. The script is:
-    ```shell
-    tools/deepspeed_checkpoint_converter.py
-    ```
-
-3. Inference:
-    ```
-    bash shscripts/inference_hunyuanvideo_t2v_lora.sh
-    ```
-    - You need to provide the checkpoint path to the `ckpt` argument in the above shell script.  
-
-    Note: 
-    - The training and inference use the default model config from `configs/007_hunyuanvideo/hunyuanvideo_diffuser.yaml`
-
-
-
-
diff --git a/docs/finetune_videocrafter.md b/docs/finetune_videocrafter.md
deleted file mode 100644
index e86e3473..00000000
--- a/docs/finetune_videocrafter.md
+++ /dev/null
@@ -1,74 +0,0 @@
-
-# Introduction
-- This document provides instructions for fine-tuning the VideoCrafter2 model.
-- It supports both full fine-tuning and lora fine-tuning.
-
-
-
-# Preliminary steps
-  1) [Install the environment](#1prepare-environment)
-  2) [Prepare the dataset   ](#41-prepare-dataset) to get the example dataset
-```
-$ ll Dataset/ToyDataset/
-
-ToyDataset/
-    ├── toydataset.csv
-    ├── videos/
-        ├── video1.mp4
-        ├── video2.mp4
-        ...
-```
-  3) [Download the checkpoints](docs/CHECKPOINTS.md) and get the checkpoint
-```
-  $ ll checkpoints/videocrafter/t2v_v2_512/model.ckpt
-```
-Then, run this command to convert the VC2 checkpoint as we make minor modifications on the keys of the state dict of the checkpoint. The converted checkpoint will be automatically save at `checkpoints/videocrafter/t2v_v2_512/model_converted.ckpt`.
-```
-python tools/convert_checkpoint.py \
---input_path checkpoints/videocrafter/t2v_v2_512/model.ckpt
-```
-Then you will get the following checkpoints
-```
-  $ ll checkpoints/videocrafter/t2v_v2_512_split
-  cond_stage.ckpt
-  denoiser.ckpt
-  first_stage.ckpt
-  model_new.ckpt
-```
-
-# Steps of Simple Fine-tuning
-**1. Full Fine-tuning of VideoCrafter2 Text-to-Video:**
-
-**(1) Train:** Run this command to start training on the single GPU. 
-```
-bash shscripts/train_videocrafter_v2.sh
-```
-or
-```
-poetry run train-videocrafter-v2
-```
-
-The training results will be automatically saved at `results/train/${CURRENT_TIME}_${EXPNAME}`. The checkpoints will be save every 100 iteractions.
-
-**(2) Inference:** Replace denoiser.ckpt with the newly trained denoiser.ckpt saved in above directory (e.g., `results/train/${CURRENT_TIME}_${EXPNAME}/checkpoints/only_trained_model`) and perform inference via running:
-```
-bash shscripts/inference_vc2_t2v_320x512.sh
-```
-
-**2. Lora Fine-tuning of VideoCrafter2 Text-to-Video:**  
-**(1) Train:**
-
-```
-bash shscripts/train_videocrafter_lora.sh
-```
-or
-```
-poetry run train-videocrafter-v2
-```
-
-- The training and inference use the default model config from `configs/001_videocrafter2/vc2_t2v_lora.yaml`
-
-**(2) Inference:**
-```
-bash shscripts/inference_vc2_t2v_320x512_lora.sh
-```
diff --git a/docs/finetune_wan.md b/docs/finetune_wan.md
deleted file mode 100644
index 3b8b3e59..00000000
--- a/docs/finetune_wan.md
+++ /dev/null
@@ -1,107 +0,0 @@
-
-# Introduction
-- This document provides instructions for fine-tuning the WanVideo T2V and I2V model.
-- It supports both full fine-tuning and lora fine-tuning.
-
-| Model    | Avg. Epoch Time (s) | Avg. Peak Memory (GB) |
-| -------- | ------------------- | ---------------------- |
-| i2v Full | 57.76               | 48.41                  |
-| i2v LoRA | 45.19               | 38.45                  |
-| t2v Full | 51.04               | 43.59                  |
-| t2v LoRA | 40.89               | 37.79                  |
-
-# Preliminary steps
-  1) [Install the environment](#1prepare-environment)
-  2) To use deepspeed Zero3 training, please review the following preparation steps.
-```shell
-poetry run install-deepspeed
-```
-  3) Download the example training data.
-You can download manually from [this link](https://huggingface.co/datasets/Yingqing/VideoTuna-Datasets/resolve/main/apply_lipstick.zip), or download via `wget`:
-```
-wget https://huggingface.co/datasets/Yingqing/VideoTuna-Datasets/resolve/main/apply_lipstick.zip
-cd data
-unzip apply_lipstick.zip -d apply_lipstick
-```
-Make sure the data is putted at `data/apply_lipstick/metadata.csv`
-
-  4) [Download the checkpoints](docs/CHECKPOINTS.md) and get the checkpoint
-```
-  $ ll checkpoints/wan/Wan2.1-T2V-14B
-  $ ll checkpoints/wan/Wan2.1-I2V-14B-480P
-```
-
-# Steps of Simple Fine-tuning
-**1. Full Fine-tuning of WanVideo Text-to-Video:**
-
-**(1) Train:** Run this command to start training on the single GPU. 
-```
-bash shscripts/train_wanvideo_t2v_fullft.sh
-```
-or
-```
-poetry run train-wan2-1-t2v-fullft
-```
-
-The training results will be automatically saved at `results/train/train_wanvideo_t2v_fullft_${CURRENT_TIME}_${EXPNAME}`. The checkpoints will be save at `results/train/train_wanvideo_t2v_fullft_${CURRENT_TIME}_${EXPNAME}/checkpoints/only_trained_model/denoiser-$epoch-$step.ckpt` every 50 iteractions. Saving checkpoints is time consuming, you can increase every_n_train_steps in `configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml` callbacks section
-
-**(2) Inference:**  Remember replace trained_ckpt
-```
-bash shscripts/inference_wanvideo_t2v_fullft.sh
-```
-
-**2. Lora Fine-tuning of WanVideo Text-to-Video:**  
-**(1) Train:**
-
-```
-bash shscripts/train_wanvideo_t2v_lora.sh
-```
-The training results will be automatically saved at `results/train/train_wanvideo_t2v_lora_${CURRENT_TIME}_${EXPNAME}`. The checkpoints will be save at `results/train/train_wanvideo_t2v_lora_${CURRENT_TIME}_${EXPNAME}/checkpoints/only_trained_model/denoiser-$epoch-$step.ckpt` every 50 iteractions. Saving checkpoints is time consuming, you can increase every_n_train_steps in `configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml` callbacks section
-
-or
-```
-poetry run train-wan2-1-t2v-lora
-```
-
-
-**(2) Inference:** Remember replace trained_ckpt
-```
-bash shscripts/inference_wanvideo_t2v_lora.sh
-```
-
-**3. Full Fine-tuning of WanVideo Image-to-Video:**
-
-**(1) Train:** Run this command to start training on the single GPU. 
-```
-bash shscripts/train_wanvideo_i2v_fullft.sh
-```
-or
-```
-poetry run train-wan2-1-i2v-fullft
-```
-
-The training results will be automatically saved at `results/train/train_wanvideo_i2v_fullft_${CURRENT_TIME}_${EXPNAME}`. The checkpoints will be save at `results/train/train_wanvideo_i2v_fullft_${CURRENT_TIME}_${EXPNAME}/checkpoints/only_trained_model/denoiser-$epoch-$step.ckpt` every 50 iteractions. Saving checkpoints is time consuming, you can increase every_n_train_steps in `configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml` callbacks section
-
-**(2) Inference:**  Remember replace trained_ckpt
-```
-bash shscripts/inference_wanvideo_i2v_fullft.sh
-```
-
-**4. Lora Fine-tuning of WanVideo Image-to-Video:**  
-**(1) Train:**
-
-```
-bash shscripts/train_wanvideo_i2v_lora.sh
-```
-The training results will be automatically saved at `results/train/train_wanvideo_i2v_lora_${CURRENT_TIME}_${EXPNAME}`. The checkpoints will be save at `results/train/train_wanvideo_i2v_lora_${CURRENT_TIME}_${EXPNAME}/checkpoints/only_trained_model/denoiser-$epoch-$step.ckpt` every 50 iteractions. Saving checkpoints is time consuming, you can increase every_n_train_steps in `configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml` callbacks section
-
-or
-```
-poetry run train-wan2-1-i2v-lora
-```
-
-
-**(2) Inference:** Remember replace trained_ckpt
-```
-bash shscripts/inference_wanvideo_i2v_lora.sh
-```
diff --git a/docs/install-cpu.md b/docs/install-cpu.md
new file mode 100644
index 00000000..e68cb267
--- /dev/null
+++ b/docs/install-cpu.md
@@ -0,0 +1,111 @@
+# CPU-only development install
+
+PrivTune supports CPU-only installs for **unit tests**, **config validation**, and **tiny smoke inference**. CPU is not practical for domain LoRA training or 14B video generation — use NVIDIA CUDA or AMD ROCm for production training.
+
+## Prerequisites
+
+- Linux, macOS, or Windows
+- Python 3.11+
+- No GPU required
+
+## Install
+
+```bash
+poetry install -E cpu --with dev --with training
+poetry run install-cpu-torch
+```
+
+`install-cpu-torch` removes CUDA/ROCm wheels and installs CPU-only `torch==2.6.0` + `torchvision==0.21.0` from `https://download.pytorch.org/whl/cpu`.
+
+**Important:** The committed `poetry.lock` pins NVIDIA CUDA torch. Any later plain `poetry install` may restore `+cu126` wheels — re-run `poetry run install-cpu-torch` on CPU-only machines.
+
+Verify:
+
+```bash
+poetry run verify-cpu-torch
+poetry run python -c "from videotuna.utils.device_utils import describe_compute_environment; print(describe_compute_environment())"
+```
+
+## Environment variables
+
+| Variable | Purpose |
+|----------|---------|
+| `VIDEOTUNA_COMPUTE_BACKEND` | Set `cpu` to force CPU even when a GPU is visible |
+| `VIDEOTUNA_CPU_MODE` | `off` (default), `smoke` (tiny runs), `force` (debug init; deprecated alias: `VIDEOTUNA_ALLOW_CPU_INFERENCE=1`) |
+| `VIDEOTUNA_ATTN_BACKEND` | Use `eager` or `sdpa` on CPU (`flash` is not supported) |
+| `VIDEOTUNA_TORCH_COMPILE` | Keep `0` on CPU (compile is GPU-only) |
+
+## CPU inference vs GPU + CPU offload
+
+| | CPU-only inference | GPU inference + CPU offload |
+|--|-------------------|----------------------------|
+| **Purpose** | Dev, CI, smoke tests | Reduce VRAM on a GPU machine |
+| **Requires GPU** | No | Yes |
+| **Flags** | `--cpu-smoke`, `device: cpu` | `--enable_model_cpu_offload`, `--memory-preset low_vram` |
+| **Practical for 720p 14B** | No | Yes (slow) |
+
+CPU offload flags move weights between GPU and **host RAM**. They do not replace a GPU for large models.
+
+## Smoke tests
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=eager
+poetry run lint
+poetry run format-check
+poetry run test tests/test_import_smoke.py -q
+poetry run test tests/test_domain_finetune_configs.py -q
+poetry run test tests/test_flux_lora_train_smoke.py -q
+poetry run test tests/test_poetry_scripts.py -q
+```
+
+Optional CPU inference smoke (downloads Wan 2.2 weights on first run):
+
+```bash
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/wan2_2_cpu_smoke.yaml \
+  --cpu-smoke --num_inference_steps 2
+```
+
+Preset: [`configs/inference/presets/wan2_2_cpu_smoke.yaml`](../configs/inference/presets/wan2_2_cpu_smoke.yaml). See [MODEL_VERSIONS.md](MODEL_VERSIONS.md).
+
+## Model tiers on CPU
+
+| Tier | Models | Status |
+|------|--------|--------|
+| **cpu_ok** | Import smoke, config parse, unit tests | Always |
+| **cpu_smoke** | Flux dev (≤512px), Wan 2.2 tiny preset | `--cpu-smoke` required |
+| **gpu_required** | Production 720p Wan 2.2, native Wan 2.1 720p | Clear error; use `wan2_2_cpu_smoke.yaml` or config validation only |
+
+## NVIDIA install (default)
+
+```bash
+poetry install -E cuda --with training
+poetry run install-deepspeed
+poetry run install-flash-attn   # optional
+```
+
+## AMD ROCm
+
+See [install-rocm.md](install-rocm.md).
+
+## Apple Silicon
+
+Linux CPU install is separate from Apple Silicon. For Mac arm64, use the Compose **`privtune`** image for containerized dev — see [Docker (optional)](../README.md#docker-optional) in the README. Native CPU training on Mac remains limited (same constraints as Linux CPU smoke).
+
+## Troubleshooting
+
+**`verify-cpu-torch` reports CUDA build**
+
+Re-run `poetry run install-cpu-torch` after any `poetry install`.
+
+**`flash` / `xformers` import errors on CPU**
+
+Expected — these are CUDA-only optional deps. Use `VIDEOTUNA_ATTN_BACKEND=eager`.
+
+**Offload flags rejected on CPU**
+
+`--enable_model_cpu_offload` and `--memory-preset low_vram` need a GPU. Remove them for CPU smoke runs.
+
+**Wan blocked at production resolution**
+
+Production 720p configs are `gpu_required` on CPU. Use `wan2_2_cpu_smoke.yaml` with `--cpu-smoke`, or stick to config/unit tests.
diff --git a/docs/install-rocm.md b/docs/install-rocm.md
new file mode 100644
index 00000000..9e670c09
--- /dev/null
+++ b/docs/install-rocm.md
@@ -0,0 +1,129 @@
+# AMD ROCm install
+
+PrivTune supports AMD GPUs on Linux x86_64 via PyTorch ROCm wheels. ROCm uses the same `torch.cuda` API as NVIDIA CUDA (HIP backend).
+
+## Prerequisites
+
+- Linux x86_64
+- AMD GPU (e.g. RX 7900 XTX, MI300 series)
+- ROCm driver **≥ 6.2** (PyTorch 2.6 wheels target **ROCm 6.2.4**)
+- Python 3.11+
+
+Verify the driver:
+
+```bash
+rocminfo | head
+```
+
+## Install
+
+```bash
+poetry install -E rocm
+poetry run install-rocm
+```
+
+`install-rocm` removes CUDA-only packages, uninstalls any existing torch/torchvision wheels, then installs matching **ROCm** builds of `torch==2.6.0` and `torchvision==0.21.0` from `https://download.pytorch.org/whl/rocm6.2.4`.
+
+**Important:** The committed `poetry.lock` pins NVIDIA CUDA torch. Any later `poetry install` may restore `+cu126` wheels — re-run `poetry run install-rocm` on AMD machines before inference.
+
+Verify:
+
+```bash
+poetry run python -c "import torch; print(torch.cuda.is_available(), torch.version.hip)"
+poetry run python -c "from videotuna.utils.device_utils import describe_compute_environment; print(describe_compute_environment())"
+```
+
+## Training vs inference scope
+
+| Workflow | ROCm support | Notes |
+|----------|--------------|-------|
+| Flux T2I LoRA training | Supported | Accelerate only; no DeepSpeed |
+| Wan 2.1 T2V/I2V LoRA training | **Not supported** | Requires DeepSpeed ZeRO-3; `poetry run install-deepspeed` aborts on ROCm |
+| Wan 2.2 / Flux inference & validation | Supported | Use `VIDEOTUNA_ATTN_BACKEND=sdpa` + CPU offload presets |
+
+Wan training requires NVIDIA CUDA:
+
+```bash
+poetry install -E cuda --with training
+poetry run install-deepspeed
+```
+
+For Flux training on ROCm, add `--with training` to the install command above (no `install-deepspeed`).
+
+## Environment variables
+
+| Variable | Purpose |
+|----------|---------|
+| `VIDEOTUNA_COMPUTE_BACKEND` | `auto`, `cuda`, `rocm`, or `cpu` |
+| `VIDEOTUNA_ATTN_BACKEND` | Use `sdpa` or `eager` on ROCm (`flash` is not supported) |
+| `HIP_VISIBLE_DEVICES` | GPU selection (like `CUDA_VISIBLE_DEVICES`) |
+
+## Smoke tests
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=sdpa
+poetry run benchmark-attn-backends --num-inference-steps 2
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/wan2_2_cpu_smoke.yaml \
+  --num_inference_steps 2 --enable_model_cpu_offload
+```
+
+Per-model presets: [MODEL_VERSIONS.md](MODEL_VERSIONS.md).
+
+## Model tiers on ROCm
+
+| Tier | Models | Status |
+|------|--------|--------|
+| **Supported** | Flux T2I, Wan 2.2 Diffusers | Use `sdpa` + CPU offload |
+| **Not supported** | Wan 2.1 native train/infer | CUDA + DeepSpeed required; see [Training vs inference scope](#training-vs-inference-scope) |
+
+## NVIDIA install (default)
+
+```bash
+poetry install -E cuda
+poetry run install-flash-attn   # optional
+```
+
+Training (NVIDIA only): `poetry install -E cuda --with training` then `poetry run install-deepspeed` if needed.
+
+## CPU-only dev
+
+```bash
+poetry install -E cpu
+poetry run install-cpu-torch
+```
+
+See [install-cpu.md](install-cpu.md) for smoke tests and tier matrix.
+
+## Troubleshooting
+
+**`torchvision::nms` / import errors after `install-rocm`**
+
+torch and torchvision must come from the same ROCm index. If torchvision still shows `+cu126`, re-run:
+
+```bash
+poetry run install-rocm
+```
+
+**`torch.cuda.is_available()` is False**
+
+- Confirm ROCm driver: `rocminfo`
+- Re-run `poetry run install-rocm`
+- Check `HIP_VISIBLE_DEVICES` is not masking all GPUs
+
+**Out of memory**
+
+- Use `--enable_sequential_cpu_offload`, `--enable_vae_tiling`, `--dtype bf16`
+- Prefer Wan 2.2 Diffusers presets with `--memory-preset low_vram`
+
+**flash-attn / xformers errors**
+
+- ROCm does not use these packages. Set `export VIDEOTUNA_ATTN_BACKEND=sdpa`
+
+**`install-deepspeed` exits with "not supported on AMD ROCm"**
+
+Expected. Wan LoRA training needs NVIDIA CUDA + DeepSpeed ZeRO-3. Use ROCm for inference/validation and Flux T2I training only, or train Wan on a CUDA machine / cloud GPU (see [domain-adult-finetune.md](runbooks/domain-adult-finetune.md)).
+
+## Lockfile note
+
+The committed `poetry.lock` targets the `cuda` extra. ROCm users rely on `install-rocm` for PyTorch wheels. To regenerate a ROCm lock locally: `poetry lock` after editing extras (advanced).
diff --git a/docs/multi-gpu.md b/docs/multi-gpu.md
new file mode 100644
index 00000000..7faecb0d
--- /dev/null
+++ b/docs/multi-gpu.md
@@ -0,0 +1,67 @@
+# Multi-GPU inference on PrivTune
+
+PrivTune supports multi-GPU paths for Wan 2.2 Diffusers validation and optional native Wan distributed inference.
+
+## Single-process multi-GPU (Diffusers)
+
+For Wan 2.2 on a single host:
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1 poetry run inference-wan2.2-t2v-720p --device-map auto
+```
+
+- Uses `accelerate` `infer_auto_device_map` to spread the transformer across GPUs.
+- Requires `poetry install -E cuda` (accelerate is a core dependency).
+
+## Distributed sequence parallel (xfuser)
+
+Native Wan flows support Ulysses + Ring attention via [xfuser](https://github.com/xdit-project/xDiT).
+
+### Requirements
+
+- NVIDIA CUDA only (blocked on ROCm).
+- `ulysses_degree × ring_degree == WORLD_SIZE` (number of processes).
+- No CPU offload when USP is enabled.
+- NCCL-compatible driver and peers on the same node.
+
+### Wan native
+
+```shell
+torchrun --nproc_per_node=4 scripts/inference_new.py \
+  --config configs/inference/presets/wan_domain_lora_smoke.yaml \
+  --ulysses_degree 2 --ring_degree 2
+```
+
+Wan re-enables `dist.init_process_group` when `WORLD_SIZE > 1`.
+
+## Environment variables
+
+| Variable | Purpose |
+|----------|---------|
+| `NCCL_DEBUG=INFO` | Debug collective hangs |
+| `CUDA_DEVICE_MAX_CONNECTIONS=1` | Sometimes stabilizes NCCL + flash attention |
+| `CUDA_VISIBLE_DEVICES` | Restrict visible GPUs before `--device cuda:0` remapping |
+
+## Failure modes
+
+| Symptom | Likely cause |
+|---------|----------------|
+| Hang at init | `ulysses × ring ≠ nproc` or missing `torchrun` |
+| OOM on rank > 0 | Model loaded on all ranks without broadcast (check flow logs) |
+| xfuser import error | Install CUDA extra: `poetry install -E cuda` |
+| xfuser on ROCm | Use single-GPU Wan 2.2 Diffusers with `VIDEOTUNA_ATTN_BACKEND=sdpa` |
+
+## Training multi-GPU
+
+- **Wan Lightning:** `--devices N` in `train-wan2-1-t2v-lora` / `scripts/train_new.py`
+- **DeepSpeed:** `poetry run install-deepspeed` for ZeRO stage configs in domain YAMLs
+
+## Device selection with `CUDA_VISIBLE_DEVICES`
+
+When GPUs are remapped, always use logical indices after remapping:
+
+```shell
+CUDA_VISIBLE_DEVICES=1 poetry run inference-wan2.2-t2v-720p --device cuda:0
+```
+
+`--device cuda:1` selects the second *visible* GPU.
diff --git a/docs/runbooks/cloud-gpu-training.md b/docs/runbooks/cloud-gpu-training.md
new file mode 100644
index 00000000..c3b562be
--- /dev/null
+++ b/docs/runbooks/cloud-gpu-training.md
@@ -0,0 +1,199 @@
+# Cloud GPU training runbook (Vast.ai / linux-desktop template)
+
+Headless training on rented NVIDIA GPUs using the PrivTune `cloud/vast/` provisioning bundle. Primary workflow: **Flux LoRA T2I → Wan 2.1 T2V LoRA** (see [domain-adult-finetune.md](domain-adult-finetune.md)).
+
+Never commit datasets, weights, API keys, or `results/` to git.
+
+## A. Instance selection
+
+| Workload | Peak VRAM | GPU examples | Notes |
+|----------|-----------|--------------|-------|
+| Flux LoRA @ 512px | ~24–40 GB | RTX 4090 24GB, A100 40GB | No DeepSpeed required |
+| Wan 2.1 T2V LoRA @ 480×832×81 | ~38 GB | A100 40GB, H100 | Requires DeepSpeed ZeRO-3 offload |
+
+**CUDA:** Template should ship NVIDIA driver + CUDA 12.x compatible with PrivTune's cu126 PyTorch wheels (`poetry install -E cuda --with training`). For faster Wan video dataloading, add `-E video-fast` (optional `torchcodec` extra; PyAV remains the fallback).
+
+**Disk:** Base weights are large (Wan 14B ≈ tens of GB; full train + validate bundle ≈ 200 GB+). Use **≥200 GB** volume. Pre-download via manifest when `HF_TOKEN` is set:
+
+| Model | When downloaded | Purpose |
+|-------|-----------------|---------|
+| `black-forest-labs/FLUX.1-dev` | Manifest + bootstrap | Flux T2I train |
+| `Wan-AI/Wan2.1-T2V-14B` | Manifest + bootstrap | Wan T2V train |
+| `Wan-AI/Wan2.1-I2V-14B-480P` | Manifest + bootstrap | Wan I2V train (optional) |
+| `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | Bootstrap (`hf-download-cache`) | `validate-domain-t2v` |
+| `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | Bootstrap (`hf-download-cache`) | `validate-domain-i2v` |
+
+Wan 2.2 Diffusers weights populate the Hugging Face hub cache so validation presets (hub IDs) resolve without config changes.
+
+## B. Launch checklist
+
+1. Rent a **linux-desktop** (Selkies/VNC + SSH + Jupyter) template with a compatible CUDA tag.
+2. Set template environment variables at rent time:
+
+| Variable | Value |
+|----------|-------|
+| `WORKSPACE` | `/workspace` |
+| `PROVISIONING_MANIFEST` | `https://raw.githubusercontent.com/miguelenes/VideoTuna/main/cloud/vast/provisioning.yaml` |
+| `HF_TOKEN` | `hf_...` (required for FLUX.1-dev; accept license on Hugging Face first) |
+| `VIDEOTUNA_ATTN_BACKEND` | `auto` or `sdpa` if flash-attn not installed |
+
+### Fast model downloads (opt-in)
+
+Multi-GB first-boot pulls (FLUX.1-dev, Wan 2.1 train weights, Wan 2.2 validate weights) can dominate rental cost on datacenter GPUs with fast network and NVMe storage. PrivTune exposes an **opt-in** cloud knob:
+
+| Variable | Value |
+|----------|-------|
+| `VIDEOTUNA_FAST_HF_DOWNLOAD` | `1` |
+
+When set at **rent time**, bootstrap exports `HF_XET_HIGH_PERFORMANCE=1` (modern `hf-xet` high-bandwidth mode; **not** deprecated `hf_transfer`) and persists it into `.env` for training-time hub pulls.
+
+**Important:** `conditional_downloads` in the manifest run **before** `bootstrap.sh`. Set `VIDEOTUNA_FAST_HF_DOWNLOAD=1` when launching the instance so both manifest-phase and bootstrap-phase pulls benefit.
+
+**When to use:** datacenter GPU + NVMe, multi-GB weight pre-downloads.
+
+**Caveats:** higher CPU/RAM use; best on SSD/NVMe. On spinning disks, consider `HF_XET_RECONSTRUCT_WRITE_SEQUENTIALLY=1`. Leave unset for local dev (default adaptive `hf-xet` only).
+
+See [HF Xet env vars](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfxethighperformance).
+
+Fallback imperative provisioner:
+
+```text
+PROVISIONING_SCRIPT=https://raw.githubusercontent.com/miguelenes/VideoTuna/main/cloud/vast/bootstrap.sh
+```
+
+### Provisioning retries
+
+Two independent retry layers apply during first boot:
+
+**Manifest-level (Vast provisioner)** — configured in [`cloud/vast/provisioning.yaml`](../../cloud/vast/provisioning.yaml):
+
+| Key | Behavior |
+|-----|----------|
+| `settings.retry` | Exponential backoff on manifest-phase downloads (`conditional_downloads`, wget, apt). Default: 5 attempts, delays 2s → 4s → 8s → 16s. |
+| `post_commands` | **No per-command retry.** `bootstrap.sh` is fail-fast; a non-zero exit aborts the phase. |
+| `on_failure` | Whole-pipeline retry: sleep **60s**, re-run from the failed phase (idempotent hash skip), up to **3** times; then `action: continue` (log + exit 1, instance stays up). |
+
+Override manifest failure behavior at rent time: `PROVISIONER_RETRY_MAX`, `PROVISIONER_RETRY_DELAY`, `PROVISIONER_FAILURE_ACTION`.
+
+**Bootstrap-level (tenacity)** — [`cloud/vast/provision_retry.py`](../../cloud/vast/provision_retry.py) mirrors `settings.retry` for network-heavy steps inside `bootstrap.sh`: Poetry install, DeepSpeed install, and bootstrap-phase `hf download`. This catches transient flakes without waiting 60s for a full manifest retry.
+
+Manual recovery (idempotent): `bash /workspace/VideoTuna/cloud/vast/bootstrap.sh`
+
+3. **SSH in** (preferred) or use the Jupyter terminal on port 8080.
+4. Wait for provisioning to finish:
+   - Template marker: `/.provisioning_complete`
+   - PrivTune marker: `/workspace/.videotuna_provisioned`
+   - Logs: `/var/log/portal/provisioning.log` (if present)
+5. Confirm smoke tests passed during bootstrap (`poetry run test tests/test_import_smoke.py`).
+6. Sync datasets via Syncthing (port **8384**) — see [C. Data sync](#c-data-sync).
+7. Start training:
+
+```bash
+cd /workspace/VideoTuna
+source .env
+export TRAIN_PROFILE=flux-lora
+./cloud/vast/run-smoke-train.sh    # short GPU validation
+./cloud/vast/run-train.sh          # full run
+```
+
+**Supervisor (survives SSH disconnect):**
+
+```bash
+cp cloud/vast/supervisor/videotuna-train.conf /etc/supervisor/conf.d/
+supervisorctl reread && supervisorctl update
+# Set TRAIN_PROFILE in .env first
+supervisorctl start videotuna-train
+supervisorctl status
+```
+
+## C. Data sync
+
+Provisioner creates Syncthing-friendly paths and symlinks into the repo:
+
+```
+/workspace/data/t2i/domain/     ↔  local data/t2i/domain/
+/workspace/data/t2v/domain/     ↔  local data/t2v/domain/
+/workspace/results/             ↔  local results/  (pull back after training)
+/workspace/checkpoints/         ↔  local checkpoints/  (optional, large)
+```
+
+Repo-relative configs (`data/t2i/domain`, `checkpoints/wan/...`) resolve via symlinks under `/workspace/VideoTuna/`.
+
+**Flux layout:** paired `0001.jpg` + `0001.txt` with trigger token `sks_style`.
+
+**Wan layout:** `metadata.csv` + `videos/*.mp4` at 480×832, 81 frames.
+
+## D. Monitoring
+
+```bash
+tail -f /workspace/results/train.log
+tail -f /workspace/results/train.err
+supervisorctl status videotuna-train
+nvidia-smi -l 5
+```
+
+**TensorBoard (training metrics):** event files live under each run directory (Flux: `{output_dir}/tensorboard/`; Wan: `{workdir}/tensorboard/`). View locally with SSH port-forward:
+
+```bash
+tensorboard --logdir /workspace/PrivTune/results/train --bind_all --port 6006
+# then open http://<instance-ip>:6006 from your laptop
+```
+
+**Trackio (Flux, optional):** when `VIDEOTUNA_METRICS_BACKEND=trackio` and the `trackio` extra is installed, metrics sync to a local SQLite database and optional private Hugging Face Space (`VIDEOTUNA_TRACKIO_SPACE_ID`). View with `trackio show` or open the Space URL — no port-forward required. See [domain-adult-finetune.md](domain-adult-finetune.md#optional-trackio-flux-phase-1).
+
+Smoke run logs: `/workspace/results/smoke-train.log`.
+
+## E. Checkpoint recovery
+
+| Phase | Checkpoint path |
+|-------|-----------------|
+| Flux LoRA | `results/train/flux-domain-adult/checkpoint-<step>/` |
+| Flux smoke | `results/train/flux-cloud-smoke/checkpoint-<step>/` |
+| Wan LoRA | `results/train/train_wan_domain_t2v_lora_<timestamp>/checkpoints/only_trained_model/denoiser-*.ckpt` |
+
+**Before terminating the instance:** Syncthing `results/` (and optionally `checkpoints/`) back to your machine.
+
+Resume Wan training:
+
+```bash
+export TRAIN_PROFILE=wan-t2v-lora
+export RESUME_CKPT=/workspace/results/train/.../checkpoints/...
+./cloud/vast/run-train.sh
+```
+
+## F. Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| CUDA OOM (Flux) | Lower `--resolution` in JSON; keep `gradient_checkpointing: true` |
+| CUDA OOM (Wan) | Confirm `poetry run install-deepspeed` succeeded; reduce frames/resolution in YAML |
+| HF gated model | Set `HF_TOKEN`; `huggingface-cli login`; accept FLUX.1-dev license |
+| flash-attn build fail | `export VIDEOTUNA_ATTN_BACKEND=sdpa` in `.env`; do not run `install-flash-attn` |
+| DeepSpeed build fail | Check CUDA toolkit / nvcc; re-run `poetry run install-deepspeed` |
+| Wan grey preview | Use `unconditional_guidance_scale: 12.0` in training YAML `image_logger` |
+| Provisioning retry | Manifest `on_failure` retries the whole pipeline (60s delay); bootstrap uses `provision_retry.py` for finer-grained retries on poetry/HF steps. Re-run `bash /workspace/VideoTuna/cloud/vast/bootstrap.sh` manually. |
+| Slow HF weight download | Set `VIDEOTUNA_FAST_HF_DOWNLOAD=1` at rent time (see [Fast model downloads](#fast-model-downloads-opt-in)) |
+
+## G. Cost control
+
+1. Run `./cloud/vast/run-smoke-train.sh` before any multi-hour job.
+2. Stop the instance when finished — only persist `results/` and `checkpoints/` via Syncthing.
+3. Use smoke configs: `configs/domain/flux_t2i_cloud_smoke.json`, `configs/domain/wan_t2v_lora_cloud_smoke.yaml`.
+
+## Training profiles (`TRAIN_PROFILE`)
+
+| Profile | Default config |
+|---------|----------------|
+| `flux-lora` | `configs/domain/flux_t2i.json` |
+| `wan-t2v-lora` | `configs/domain/wan_t2v_lora.yaml` |
+| `flux-lora` (smoke) | `configs/domain/flux_t2i_cloud_smoke.json` |
+| `wan-t2v-lora` (smoke) | `configs/domain/wan_t2v_lora_cloud_smoke.yaml` |
+
+Override with `CONFIG_PATH` and `DATA_CONFIG_PATH` (Flux only) in `.env`.
+
+## Related docs
+
+- [domain-adult-finetune.md](domain-adult-finetune.md) — dataset layout, hyperparameters, inference smoke
+- [../checkpoints.md](../checkpoints.md) — weight download layout
+- [`cloud/vast/provisioning.yaml`](../../cloud/vast/provisioning.yaml) — manifest source
+- [`AGENTS.md`](../../AGENTS.md) — local dev verification gates
diff --git a/docs/runbooks/domain-adult-finetune.md b/docs/runbooks/domain-adult-finetune.md
new file mode 100644
index 00000000..076f4637
--- /dev/null
+++ b/docs/runbooks/domain-adult-finetune.md
@@ -0,0 +1,409 @@
+# Domain adult fine-tuning runbook (T2I + T2V)
+
+Two-phase pipeline for domain-specific adult content: **Phase 1** Flux LoRA (still images), **Phase 2** Wan 2.1 T2V LoRA (short video clips).
+
+All training data must be rights-cleared and consented. Never commit datasets, weights, or `outputs/` to git.
+
+## Prerequisites
+
+```bash
+cd /path/to/PrivTune
+huggingface-cli login   # FLUX.1-dev is gated on Hugging Face
+
+# NVIDIA (Phases 1–2.5)
+poetry install -E cuda --with training
+poetry run install-deepspeed   # required for Wan LoRA (ZeRO-3 offload)
+
+# Optional: faster video frame decode for Wan T2V/I2V dataloading (PyAV fallback when omitted)
+# poetry install -E cuda --with training -E video-fast
+
+# AMD ROCm (Phase 1 Flux training + Phase 3 inference only)
+# poetry install -E rocm --with training
+# poetry run install-rocm
+# export VIDEOTUNA_ATTN_BACKEND=sdpa
+# Do NOT run install-deepspeed — Wan training is CUDA-only (see install-rocm.md)
+```
+
+| Environment | Extra steps |
+|-------------|-------------|
+| AMD ROCm | `export VIDEOTUNA_ATTN_BACKEND=sdpa` — do not run `install-flash-attn` or `install-deepspeed`; Wan training requires CUDA |
+| CPU only | Config validation only — run training on a GPU machine (see [CPU stub](#cpu-stub-no-gpu)) |
+
+## VRAM and time expectations
+
+| Phase | Model | Peak VRAM | GPUs | Rough time | Limitation |
+|-------|-------|-----------|------|------------|------------|
+| 1 — T2I | Flux LoRA @ 512px | ~24–40 GB | 1 | 2000 steps ≈ hours on A100-class | Trains **FLUX.1-dev** |
+| 2 — T2V | Wan 2.1 T2V LoRA @ 480×832×81 | ~38 GB | 1 + DeepSpeed | ~41 s/epoch on H800 | Trains **Wan 2.1**; validates on **Wan 2.2** Diffusers 720p |
+| 2.5 — I2V (optional) | Wan 2.1 I2V LoRA @ 480×832×81 | ~40–44 GB | 1 + DeepSpeed | Similar to T2V | Reference image + clip pairs; validates on **Wan 2.2 I2V** Diffusers 720p |
+
+---
+
+## Training metrics
+
+PrivTune uses **TensorBoard** as the default training experiment tracker. Console logs (stdlib / loguru / tqdm) are for tailing only — not the canonical metrics store.
+
+| Phase | Canonical metrics path | Logged scalars | Other artifacts (not loggers) |
+|-------|------------------------|----------------|-------------------------------|
+| **1 — Flux T2I** | `{output_dir}/tensorboard/` | `train/loss`, `train/lr`, `validation/sample` | LoRA checkpoints, `{output_dir}/validation/step-*.png`, `training_config.json` |
+| **2 — Wan T2V** | `{workdir}/tensorboard/` | `train/loss`, LR monitor, `epoch_time_s`, `peak_vram_gb` | `metrics.json` (epoch summary export), `images/train/` previews, `loginfo/*.txt` |
+| **2.5 — Wan I2V** | Same as T2V | Same | Same |
+
+View locally:
+
+```bash
+# Flux (single run)
+tensorboard --logdir results/train/flux-domain-adult
+
+# Wan (all runs under logdir)
+tensorboard --logdir results/train
+```
+
+On cloud GPUs, see [cloud-gpu-training.md](cloud-gpu-training.md) for SSH port-forward to TensorBoard.
+
+### Optional Trackio (Flux Phase 1)
+
+[Trackio](https://huggingface.co/docs/trackio) is an opt-in Hugging Face experiment tracker that logs **alongside** TensorBoard when enabled. Wan phases remain TensorBoard-only.
+
+```bash
+# Install optional extra (in addition to training profile)
+poetry install -E cuda --with training -E trackio
+
+# Enable dual logging (TensorBoard + Trackio)
+export VIDEOTUNA_METRICS_BACKEND=trackio
+poetry run train-domain-t2i
+```
+
+Logged metrics match TensorBoard: `train/loss`, `train/lr`, and `validation/sample` (preview images).
+
+View the local Trackio dashboard:
+
+```bash
+trackio show
+```
+
+For remote monitoring on rented GPUs (no SSH port-forward), sync to a **private** Hugging Face Space:
+
+```bash
+export HF_TOKEN=...                              # required for Space sync
+export VIDEOTUNA_TRACKIO_SPACE_ID=username/privtune-trackio  # private Space
+export VIDEOTUNA_METRICS_BACKEND=trackio
+poetry run train-domain-t2i
+```
+
+Optional project name override: `VIDEOTUNA_TRACKIO_PROJECT` (default: `flux-domain-lora`).
+
+**Privacy:** domain validation images may contain sensitive content — use **private** Spaces only if syncing to Hugging Face.
+
+**Note:** Wan `ImageLogger` previews require a non-stub `log_images` implementation in `wanvideo.py`; until then, rely on smoke inference for visual QA.
+
+---
+
+## Phase 1 — Flux T2I LoRA
+
+### Dataset layout
+
+Place images and sidecar captions under `data/t2i/domain/` (gitignored):
+
+```
+data/t2i/domain/
+  0001.jpg
+  0001.txt          # e.g. "sks_style, portrait, studio lighting"
+  0002.jpg
+  0002.txt
+```
+
+- Use a **consistent trigger token** (default: `sks_style`) in every `.txt` file.
+- `caption_strategy: filename` pairs `0001.txt` with `0001.jpg`.
+- Minimum ~10–30 images for a smoke run; 50–200+ recommended for production.
+
+### Config files
+
+| File | Purpose |
+|------|---------|
+| `configs/domain/flux_t2i.json` | Training hyperparameters |
+| `configs/domain/flux_t2i_data.json` | Dataset backend (`data/t2i/domain`) |
+
+In-training preview images are controlled by `validation_steps`, `validation_prompt`, and `validation_resolution` in `flux_t2i.json`. Every `validation_steps` training steps, the trainer writes a PNG under `{output_dir}/validation/` and logs it to TensorBoard as `validation/sample`. This is separate from post-training smoke inference via `inference-domain-t2i`.
+
+### Download base weights
+
+Weights auto-download on first train. For offline use:
+
+```bash
+mkdir -p checkpoints/flux
+hf download black-forest-labs/FLUX.1-dev --local-dir checkpoints/flux/FLUX.1-dev
+```
+
+Then set `"--pretrained_model_name_or_path": "checkpoints/flux/FLUX.1-dev"` in `flux_t2i.json`.
+
+### Train
+
+```bash
+poetry run train-domain-t2i
+```
+
+Legacy alias: `poetry run train-flux-lora` (same defaults).
+
+Checkpoints: `results/train/flux-domain-adult/checkpoint-<step>/` (Diffusers LoRA format).
+
+For a quick smoke on GPU, temporarily set `"--max_train_steps": 50` in the JSON.
+
+### Resume training
+
+Set `"--resume_from_checkpoint": "latest"` (or a relative path like `"checkpoint-500"`) in `flux_t2i.json`.
+
+| Behavior | Detail |
+|----------|--------|
+| First run | `train-domain-t2i` stamps a timestamped `output_dir` (e.g. `flux-domain-adult_20260101120000`) and writes `checkpoint-*` dirs there. |
+| Resume run | When `resume_from_checkpoint` is set, output-dir stamping is **skipped** so `"latest"` resolves checkpoints under the existing stamped directory. |
+| Restored | LoRA safetensors from `checkpoint-{step}/`; training continues from that step; LR scheduler is advanced to match the step. |
+| Not restored | Optimizer momentum, Accelerate RNG, and full experiment metadata. |
+
+To start a fresh run, remove `resume_from_checkpoint` from the JSON or set it to `null`. If resume is requested but no matching checkpoint exists under `output_dir`, training fails with an error.
+
+### Inference smoke
+
+```bash
+poetry run inference-domain-t2i \
+  --lorackpt results/train/flux-domain-adult/checkpoint-2000 \
+  --prompt "sks_style, portrait, soft lighting"
+```
+
+---
+
+## Phase 2 — Wan 2.1 T2V LoRA
+
+### Dataset layout
+
+```
+data/t2v/domain/
+  metadata.csv
+  videos/
+    clip001.mp4
+    clip002.mp4
+```
+
+`metadata.csv`:
+
+```csv
+path,caption
+data/t2v/domain/videos/clip001.mp4,"sks_style, slow pan, cinematic lighting"
+data/t2v/domain/videos/clip002.mp4,"sks_style, close-up, warm lighting"
+```
+
+Clips should be **480×832**, **81 frames**. Re-encode if needed:
+
+```bash
+ffmpeg -i in.mp4 -vf scale=832:480 -r 16 -frames:v 81 data/t2v/domain/videos/clip001.mp4
+```
+
+### Config file
+
+`configs/domain/wan_t2v_lora.yaml` — domain CSV path, 25-step checkpoint interval, 50 max epochs (raise for production).
+
+### Download base weights
+
+```bash
+mkdir -p checkpoints/wan
+hf download Wan-AI/Wan2.1-T2V-14B --local-dir checkpoints/wan/Wan2.1-T2V-14B
+```
+
+### Train
+
+```bash
+poetry run train-domain-t2v
+```
+
+Legacy alias: `poetry run train-wan2-1-t2v-lora` (same defaults).
+
+Checkpoint example:
+
+`results/train/train_wan_domain_t2v_lora_<timestamp>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt`
+
+### Validation (Wan 2.2 Diffusers — primary)
+
+After training, validate the native Lightning LoRA on Wan 2.2 Diffusers 720p:
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=auto   # NVIDIA; use sdpa on ROCm
+poetry run validate-domain-t2v \
+  --trained_ckpt results/train/train_wan_domain_t2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+  --prompt_file inputs/t2v/domain_prompt.txt \
+  --num_inference_steps 4
+```
+
+| VRAM | Preset override |
+|------|-----------------|
+| ~24 GB | default (`wan_domain_lora_smoke_22.yaml`) |
+| 12–16 GB | `--config configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml` |
+
+Output: `results/t2v/wan-domain-lora-smoke-22/*.mp4` at **720×1280**, **81 frames**, **16 fps**.
+
+For full-quality QA (20–50 steps), use [`balanced_wan2_2_720p.yaml`](../../configs/inference/presets/balanced_wan2_2_720p.yaml) — see [wan2.2-inference-profile.md](wan2.2-inference-profile.md).
+
+**Optional:** export Diffusers safetensors for reuse without runtime PEFT injection:
+
+```bash
+poetry run python tools/convert_wan_lora_21_to_22.py \
+  --input results/train/.../denoiser-000-000000025.ckpt \
+  --output-dir results/lora/wan22-export/
+```
+
+**Bridge debug / spike:** `poetry run python tools/spike_wan_lora_bridge.py --synthetic /tmp/synthetic.ckpt`
+
+<details>
+<summary>Optional fast path — Wan 2.1 native smoke (480p)</summary>
+
+Use only when debugging training checkpoints on the same base model (no 2.1→2.2 bridge):
+
+```bash
+poetry run python scripts/inference_new.py \
+  --config configs/inference/presets/wan_domain_lora_smoke.yaml \
+  --trained_ckpt results/train/train_wan_domain_t2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+  --prompt "sks_style, slow camera push-in, soft lighting" \
+  --enable_model_cpu_offload
+```
+
+</details>
+
+---
+
+## Phase 2.5 — Wan 2.1 I2V LoRA (optional)
+
+Optional when you need **reference-frame control** (e.g. Flux still → motion). **LoRA-only** — full Wan fine-tune is out of platform scope.
+
+### Dataset layout (primary: image + video pairs)
+
+```
+data/i2v/domain/
+  metadata.csv
+  images/
+    ref001.jpg
+  videos/
+    clip001.mp4
+```
+
+`metadata.csv`:
+
+```csv
+image_path,video_path,caption
+data/i2v/domain/images/ref001.jpg,data/i2v/domain/videos/clip001.mp4,"sks_style, slow pan, cinematic lighting"
+```
+
+**Secondary (first-frame conditioning):** use `path,caption` only and set `image_to_video: true` in `configs/domain/wan_i2v_lora.yaml`.
+
+Normalize clips (same as T2V):
+
+```bash
+ffmpeg -i in.mp4 -vf scale=832:480 -r 16 -frames:v 81 data/i2v/domain/videos/clip001.mp4
+```
+
+Extract a reference frame from a clip:
+
+```bash
+ffmpeg -i data/i2v/domain/videos/clip001.mp4 -vf scale=832:480 -frames:v 1 \
+  data/i2v/domain/images/ref001.jpg
+```
+
+From a Flux still:
+
+```bash
+ffmpeg -i flux_output.png -vf scale=832:480 data/i2v/domain/images/ref001.jpg
+```
+
+### Config
+
+`configs/domain/wan_i2v_lora.yaml`
+
+### Download base weights
+
+```bash
+mkdir -p checkpoints/wan
+hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir checkpoints/wan/Wan2.1-I2V-14B-480P
+```
+
+### Train
+
+```bash
+poetry run train-domain-i2v
+```
+
+Legacy alias: `poetry run train-wan2-1-i2v-lora`
+
+### Validation
+
+**Wan 2.2 I2V Diffusers (primary):**
+
+```bash
+poetry run validate-domain-i2v \
+  --trained_ckpt results/train/train_wan_domain_i2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+  --prompt_dir inputs/i2v/domain_smoke
+```
+
+`inputs/i2v/domain_smoke/` must contain paired `.txt` prompts and images (`.jpg`/`.png`), one line per prompt.
+
+**Wan 2.1 native smoke (debug):**
+
+```bash
+poetry run python scripts/inference_new.py \
+  --config configs/inference/presets/wan_domain_i2v_smoke.yaml \
+  --trained_ckpt results/train/train_wan_domain_i2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+  --prompt_dir inputs/i2v/domain_smoke \
+  --enable_model_cpu_offload
+```
+
+---
+
+## CPU stub (no GPU)
+
+When no CUDA/ROCm GPU is available locally:
+
+1. Do **not** run training.
+2. Validate configs:
+
+```bash
+poetry run test tests/test_domain_finetune_configs.py -q
+poetry run test tests/test_flux_lora_train_smoke.py -q
+poetry run test tests/test_wan_lora_bridge.py -q
+poetry run test tests/test_wan_i2v_lora_bridge.py -q
+poetry run test tests/test_wan_domain_lora_smoke_22_config.py -q
+poetry run test tests/test_wan_domain_i2v_smoke_22_config.py -q
+poetry run test tests/test_wan_i2v_dataset.py -q
+poetry run test tests/test_wan_training_step.py -q
+poetry run test tests/test_import_smoke.py -q
+poetry run test tests/test_poetry_scripts.py -q
+```
+
+3. Run the full train/infer commands above on a GPU machine with the same repo checkout and dataset paths.
+
+---
+
+## Troubleshooting
+
+| Issue | Fix |
+|-------|-----|
+| CUDA OOM (Flux) | Lower `--resolution` to 384 in JSON; keep `gradient_checkpointing: true` |
+| CUDA OOM (Wan) | Confirm DeepSpeed installed; reduce `num_frames` or resolution in YAML |
+| ROCm flash-attn error | `export VIDEOTUNA_ATTN_BACKEND=sdpa` |
+| HF gated model | `huggingface-cli login` and accept FLUX.1-dev license |
+| Wan grey output at inference | Use `unconditional_guidance_scale: 12.0` during training preview (set in YAML `image_logger`) |
+| Wan 2.2 validation OOM | Use `wan_domain_lora_smoke_22_low_vram.yaml` or `--enable_sequential_cpu_offload` |
+| Bridge load warnings | Run `tools/spike_wan_lora_bridge.py --input <ckpt>` for key inventory |
+
+## Known limitations
+
+- **FLUX.1 only:** Training uses FLUX.1-dev; see [`docs/MODEL_VERSIONS.md`](../MODEL_VERSIONS.md).
+- **Flux resume:** LoRA weights and step counter only — optimizer state is not saved or restored.
+- **Wan 2.1 → 2.2 bridge (`validate-domain-t2v`):** Production-ready for domain LoRA QA via `poetry run validate-domain-t2v`. The bridge remaps native Wan 2.1 Lightning keys onto Wan 2.2 Diffusers and loads the **same** LoRA onto both high-noise (`transformer`) and low-noise (`transformer_2`) experts. Limitations:
+  - Default smoke preset uses **4 inference steps** at 720×1280 — use [`balanced_wan2_2_720p.yaml`](../../configs/inference/presets/balanced_wan2_2_720p.yaml) for higher-quality QA.
+  - Remap ratio below 90% logs a warning (does not block load) — run visual QA and `tools/spike_wan_lora_bridge.py --input <ckpt>` if keys look wrong.
+  - Training runs on Wan 2.1 block layout; validate visually after bridge — block-count mismatch may leave some pipeline LoRA slots at init.
+  - Optional offline export: `tools/convert_wan_lora_21_to_22.py` writes `high_noise.safetensors` / `low_noise.safetensors`.
+
+## Related docs
+
+- [`docs/runbooks/cloud-gpu-training.md`](cloud-gpu-training.md) — Vast.ai / rented GPU provisioning and Syncthing workflow
+- [`docs/runbooks/wan2.2-inference-profile.md`](wan2.2-inference-profile.md) — Wan 2.2 VRAM tiers and benchmarks
+- [`docs/checkpoints.md`](../checkpoints.md)
+- [`docs/datasets.md`](../datasets.md)
diff --git a/docs/runbooks/wan2.2-inference-profile.md b/docs/runbooks/wan2.2-inference-profile.md
new file mode 100644
index 00000000..19e0fc58
--- /dev/null
+++ b/docs/runbooks/wan2.2-inference-profile.md
@@ -0,0 +1,264 @@
+# Wan 2.2-T2V-720p inference profile
+
+Optimized inference presets for **Wan-AI/Wan2.2-T2V-A14B-Diffusers** (Diffusers path via `inference-wan2.2-t2v-720p`). Device and attention routing go through `videotuna/utils/device_utils.py` and `videotuna/utils/attention.py`.
+
+## Hardware tiers
+
+| Environment | Typical hardware | Wan 2.2 720p feasible? |
+|-------------|------------------|------------------------|
+| Home dev | RX 550 / CPU only | **No** for production 720p. RX 550 is not a supported ROCm target and has far too little VRAM (~2–4 GB). Use CPU smoke for pipeline validation only. |
+| Rental | 24 GB (RTX 4090, A10) | Yes — `balanced` or `low_vram` |
+| Rental | 40–48 GB (A6000, L40S) | Yes — `max_speed` |
+| Rental | 2× A100 | Yes — `max_speed` + `--device-map auto` (Diffusers) or native xfuser USP |
+
+## Preset YAMLs
+
+| Preset file | Tier | Est. peak VRAM |
+|-------------|------|----------------|
+| [`configs/inference/presets/low_vram_wan2_2_720p.yaml`](../../configs/inference/presets/low_vram_wan2_2_720p.yaml) | Minimum | 12–16 GB |
+| [`configs/inference/presets/low_vram_wan2_2_720p_int8.yaml`](../../configs/inference/presets/low_vram_wan2_2_720p_int8.yaml) | Minimum + int8 quant | 10–14 GB (CUDA) |
+| [`configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml`](../../configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml) | Minimum + fp8 quant (Ada/Hopper) | 10–14 GB (CUDA sm ≥ 8.9) |
+| [`configs/inference/presets/balanced_wan2_2_720p.yaml`](../../configs/inference/presets/balanced_wan2_2_720p.yaml) | Recommended | ~24 GB |
+| [`configs/inference/presets/max_speed_wan2_2_720p.yaml`](../../configs/inference/presets/max_speed_wan2_2_720p.yaml) | Max speed | 40–48 GB |
+| [`configs/inference/presets/wan2_2_cpu_smoke.yaml`](../../configs/inference/presets/wan2_2_cpu_smoke.yaml) | Home dev only | RAM (not practical) |
+
+## Quantization by hardware tier
+
+Requires **torchao ≥ 0.15.0** (default Poetry dependency) and NVIDIA CUDA. See [Diffusers torchao quantization](https://huggingface.co/docs/diffusers/main/en/quantization/torchao).
+
+| Tier / GPU examples | `int8_wo` | `fp8_wo` | Preset / CLI |
+|---------------------|-----------|----------|--------------|
+| Home CPU / ROCm | Not supported | Not supported | `transformer_quant: none` |
+| 12–16 GB (A10, RTX 3090 — sm 8.6) | **Recommended** | Not supported (sm < 8.9) | [`low_vram_wan2_2_720p_int8.yaml`](../../configs/inference/presets/low_vram_wan2_2_720p_int8.yaml) or `--transformer-quant int8_wo` |
+| 24 GB RTX 4090 (sm 8.9) | Supported | **Preferred** (speed + VRAM) | [`low_vram_wan2_2_720p_fp8.yaml`](../../configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml) or `--transformer-quant fp8_wo` |
+| 40–48 GB A6000 (sm 8.6) | Supported | Not supported | int8 preset or no quant + `max_speed` |
+| 40–48 GB L40S / Hopper (sm ≥ 8.9) | Supported | **Preferred** | fp8 preset or CLI |
+| 2× A100 (sm 8.0) | Supported if VRAM-tight | Not supported | int8 or offload without quant |
+
+Measure peak VRAM on rental hardware with `tools/spike_wan_quant_compare.py` (records `torchao` version and per-scheme metrics).
+
+## Three-tier command matrix (rental GPU)
+
+### Minimum VRAM (~12–16 GB)
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=auto   # flash→sdpa on NVIDIA; sdpa on ROCm
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/low_vram_wan2_2_720p.yaml \
+  --min-vram-gb 10
+```
+
+Settings: sequential CPU offload, fp16, VAE tiling.
+
+Optional **transformer weight-only quantization** (CUDA only, torchao):
+
+```bash
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/low_vram_wan2_2_720p_int8.yaml \
+  --min-vram-gb 10
+```
+
+Ada/Hopper (sm ≥ 8.9) — fp8 weight-only:
+
+```bash
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/low_vram_wan2_2_720p_fp8.yaml \
+  --min-vram-gb 10
+```
+
+Or add to any preset / CLI:
+
+```bash
+--transformer-quant int8_wo --quant-backend torchao
+```
+
+| Scheme | VRAM impact | GPU requirement | LoRA |
+|--------|-------------|-----------------|------|
+| `int8_wo` (default quant) | Lower transformer weight memory | NVIDIA CUDA | Attempted; use `none` if PEFT bridge fails |
+| `int4_wo` | Further weight savings | NVIDIA CUDA | Same as int8 |
+| `fp8_wo` | Best speed/memory on Ada+ | sm ≥ 8.9 (RTX 4090, Hopper) | Same as int8 |
+
+**FP8 on Wan 2.2 Diffusers:** use `--transformer-quant fp8_wo` (torchao weight-only; Ada/Hopper sm ≥ 8.9). Legacy native checkpoint FP8 is not supported in PrivTune.
+
+**optimum-quanto:** evaluated via `tools/spike_wan_quant_compare.py` on rental GPU (`--include-quanto`); not added as a default dependency. Use `--quant-backend quanto` only after installing `optimum-quanto>=0.2.6` manually if torchao is insufficient.
+
+When `transformer_quant` is enabled, sequential CPU offload is upgraded to **model CPU offload** automatically for Diffusers quant compatibility.
+
+### Recommended (~24 GB)
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=auto
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/balanced_wan2_2_720p.yaml \
+  --min-vram-gb 20
+```
+
+Settings: model CPU offload, bf16, VAE tiling.
+
+### Max speed (~40–48 GB+)
+
+```bash
+poetry run install-flash-attn   # NVIDIA only, optional
+export VIDEOTUNA_ATTN_BACKEND=flash
+export VIDEOTUNA_TORCH_COMPILE=0
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/max_speed_wan2_2_720p.yaml \
+  --min-vram-gb 38
+# Optional after a warm-up run (discard first compile iteration when timing):
+# poetry run inference-wan2.2-t2v-720p ... --compile
+```
+
+Settings: full GPU, bf16, no offload. `--compile` sets `VIDEOTUNA_TORCH_COMPILE=1` and compiles the transformer when offload is disabled.
+
+### Home — CPU smoke (not production)
+
+```bash
+poetry install -E cpu --with dev
+poetry run install-cpu-torch
+export VIDEOTUNA_ATTN_BACKEND=eager
+poetry run inference-wan2.2-t2v-720p \
+  --config configs/inference/presets/wan2_2_cpu_smoke.yaml \
+  --cpu-smoke
+```
+
+Also validate configs without weights:
+
+```bash
+poetry run test tests/test_wan_inference_presets.py -q
+poetry run test tests/test_import_smoke.py -q
+```
+
+## VRAM / speed / quality tradeoffs
+
+| Tier | Est. peak VRAM | Speed | Quality tradeoffs |
+|------|----------------|-------|-------------------|
+| `low_vram` | 12–16 GB | Slowest (sequential PCIe offload) | fp16 vs bf16 — minor; full 720p / 81 frames |
+| `balanced` | ~24 GB | Moderate | bf16; model offload latency between steps |
+| `max_speed` | 40–48 GB | Fastest single-GPU | Full bf16 on GPU; optional compile after warm-up |
+| 2× GPU `device-map auto` | ~22 GB/GPU | Moderate–fast | Same quality as max_speed |
+| CPU smoke | RAM only | Impractical | 256p / 2 frames — pipeline validation only |
+
+Quantitative throughput: check `metrics.json` beside outputs after a rental run. Use the benchmark script (below) for frames/sec at 480p.
+
+## Attention backend
+
+| Backend | NVIDIA | ROCm | CPU |
+|---------|--------|------|-----|
+| `auto` | flash → sdpa → eager | sdpa → eager | eager |
+| `flash` | Yes (after `install-flash-attn`) | **Not supported** — use `sdpa` | No |
+| `sdpa` | Yes | **Recommended** | Yes |
+| `eager` | Yes | Yes | **Required** for `--cpu-smoke` |
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=sdpa   # ROCm rental
+export VIDEOTUNA_ATTN_BACKEND=flash    # NVIDIA max_speed
+```
+
+## Benchmark methodology
+
+The benchmark script runs a **warm-up** at `num_inference_steps=1` before resetting peak VRAM and starting the timer. The first `torch.compile` iteration is therefore excluded from timed results.
+
+### NVIDIA rental
+
+```bash
+poetry run install-flash-attn   # optional
+export VIDEOTUNA_ATTN_BACKEND=auto
+poetry run benchmark-attn-backends \
+  --pipeline wan \
+  --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
+  --resolutions 480 \
+  --num-inference-steps 4 \
+  --json-out results/bench_wan22_attn.json
+```
+
+### ROCm rental
+
+```bash
+export VIDEOTUNA_ATTN_BACKEND=sdpa
+poetry run benchmark-attn-backends \
+  --pipeline wan \
+  --backends eager sdpa \
+  --resolutions 480
+```
+
+### 24 GB realistic offload benchmark
+
+```bash
+poetry run benchmark-attn-backends \
+  --pipeline wan \
+  --resolutions 480 \
+  --enable-offload
+```
+
+**Interpretation:** on NVIDIA, expect `flash` ≈ `sdpa` > `eager`. On ROCm, use `sdpa`. When using `--compile` in production, run twice and discard the first timed iteration.
+
+## Domain LoRA validation (Wan 2.1 → 2.2 bridge)
+
+After Phase 2 training, validate the native Lightning LoRA on Wan 2.2 Diffusers:
+
+```bash
+poetry run validate-domain-t2v \
+  --trained_ckpt results/train/train_wan_domain_t2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+  --prompt_file inputs/t2v/domain_prompt.txt \
+  --enable_model_cpu_offload
+```
+
+Low VRAM (12–16 GB):
+
+```bash
+poetry run validate-domain-t2v \
+  --config configs/inference/presets/wan_domain_lora_smoke_22_low_vram.yaml \
+  --trained_ckpt <denoiser.ckpt>
+```
+
+The bridge loads adapters onto both `transformer` (high-noise) and `transformer_2` (low-noise). Run `poetry run test tests/test_wan_lora_bridge.py -q` on CPU; full visual QA requires a rental GPU.
+
+## Domain I2V LoRA validation (Wan 2.1 → 2.2 I2V bridge)
+
+After optional Phase 2.5 I2V training:
+
+```bash
+poetry run validate-domain-i2v \
+  --trained_ckpt results/train/train_wan_domain_i2v_lora_<ts>/checkpoints/only_trained_model/denoiser-000-000000025.ckpt \
+  --prompt_dir inputs/i2v/domain_smoke \
+  --enable_model_cpu_offload
+```
+
+Preset: `configs/inference/presets/wan_domain_i2v_smoke_22.yaml` (720×1280, 4 steps, ~24 GB with offload).
+
+`--prompt_dir` must contain paired `.txt` prompts and reference images (same contract as native Wan I2V inference).
+
+Export LoRA for reuse: `poetry run python tools/convert_wan_lora_21_to_22.py --input <ckpt> --output-dir results/lora/wan22-i2v-export/ --mode i2v`
+
+## Multi-GPU (2× A100)
+
+Wan 2.2 via `inference-wan2.2-t2v-720p` uses **Diffusers** (`DiffusersVideoFlow`).
+
+| Path | Command | Pros | Cons |
+|------|---------|------|------|
+| **device-map auto** (recommended) | `CUDA_VISIBLE_DEVICES=0,1 poetry run inference-wan2.2-t2v-720p --config configs/inference/presets/max_speed_wan2_2_720p.yaml --device-map auto` | Single process; spreads transformer across GPUs | Slower than xfuser USP; experimental |
+| **xfuser USP** (native) | `torchrun --nproc_per_node=2 scripts/inference_new.py --config configs/inference/presets/wan2_2_native_t2v_14b.yaml --ulysses_degree 2 --ring_degree 1` | Faster sequence-parallel attention | CUDA-only; no CPU offload; needs `checkpoints/wan/` layout |
+
+See [multi-gpu.md](../multi-gpu.md) for xfuser requirements (`ulysses_degree × ring_degree == WORLD_SIZE`).
+
+## Clear errors when VRAM is insufficient
+
+- **`min_vram_gb` in preset YAML** or **`--min-vram-gb` CLI** — `require_min_vram()` fails before model load with next-step hints (`low_vram`, lower resolution, pick another GPU).
+- **720p without GPU** — `require_accelerator_for_flow()` returns `tier=gpu_required` for Wan Diffusers at 720×1280 unless `--cpu-smoke` is set.
+
+## Environment variables (summary)
+
+| Variable | Rental NVIDIA | Rental ROCm | Home CPU |
+|----------|---------------|-------------|----------|
+| `VIDEOTUNA_ATTN_BACKEND` | `auto` or `flash` | `sdpa` | `eager` |
+| `VIDEOTUNA_TORCH_COMPILE` | `0` (or `1` / `--compile` after warm-up) | `0` | `0` |
+| `VIDEOTUNA_COMPUTE_BACKEND` | `auto` | `rocm` | `cpu` |
+| `CUDA_VISIBLE_DEVICES` / `HIP_VISIBLE_DEVICES` | GPU selection | GPU selection | N/A |
+
+## Related docs
+
+- [README.md](../../README.md) — performance tuning section
+- [install-cpu.md](../install-cpu.md) — CPU smoke tiers
+- [install-rocm.md](../install-rocm.md) — AMD ROCm setup
+- [multi-gpu.md](../multi-gpu.md) — device-map and xfuser
+- [domain-adult-finetune.md](domain-adult-finetune.md) — Wan 2.1 LoRA training (separate from 2.2 Diffusers inference)
diff --git a/docs/vendor-policy.md b/docs/vendor-policy.md
new file mode 100644
index 00000000..5c5f6dab
--- /dev/null
+++ b/docs/vendor-policy.md
@@ -0,0 +1,74 @@
+# Vendor policy
+
+PrivTune vendors upstream model code when Diffusers/Hugging Face pipelines are insufficient for training. This document defines **where** vendored code lives, **how** to update it, and **what** attribution is required.
+
+## Directory convention
+
+| Location | Purpose | Example |
+|----------|---------|---------|
+| `videotuna/models/<family>/` | Native model implementations for training and/or smoke inference | `wan/` |
+| `videotuna/training/<task>/` | First-party training loops (Diffusers + PEFT + Accelerate) | `flux_lora/` |
+| `videotuna/vendor/<upstream>/` | Third-party snapshots (git submodule preferred) | *(none today)* |
+
+**Rule:** New upstream code goes under `videotuna/vendor/<upstream>/` with a `VENDOR.md` at the tree root. Prefer first-party trainers under `videotuna/training/` when Diffusers covers the model.
+
+## Required provenance (`VENDOR.md`)
+
+Every vendored tree must include `VENDOR.md` (or `LICENSE` + `VENDOR.md`) with:
+
+1. **Upstream repository URL**
+2. **License** (SPDX identifier + link to upstream `LICENSE`)
+3. **Pinned commit** (full SHA) at last sync
+4. **Import date / PR** that introduced or last updated the snapshot
+5. **PrivTune entrypoints** that depend on the tree
+6. **Update procedure** (submodule bump, manual diff, or replacement plan)
+
+## Update process
+
+1. Identify upstream release or commit to pin.
+2. Record the SHA in `VENDOR.md` before merging.
+3. Run import smoke tests (`poetry run test tests/test_import_smoke.py -q`).
+4. Run the relevant Poetry script on a smoke config.
+5. Note breaking config changes in `README.md` and `docs/MODEL_VERSIONS.md` if applicable.
+
+Prefer **git submodule** or **pip/git dependency** over copying large trees.
+
+## Dependency groups
+
+| Group | Install command | Consumers |
+|-------|-----------------|-----------|
+| *(default / main)* | `poetry install -E cuda` or `uv sync` | Diffusers inference (Flux + Wan 2.2), Wan native smoke |
+| `training` | `--with training` | Flux LoRA, Wan 2.1 Lightning LoRA, DeepSpeed |
+| `dev` | `--with dev` | pytest, ruff, mypy |
+
+## Poetry scripts → dependency groups
+
+| Scripts | Groups required |
+|---------|-----------------|
+| `inference-flux-lora`, `inference-wan2.2-t2v-720p` | default (inference) |
+| `train-flux-lora`, `train-wan2-1-t2v-lora` | `training` |
+| `install-deepspeed`, `install-flash-attn` | `training` / optional runtime |
+| `test`, `lint`, `format*` | `dev` |
+
+## Inventory
+
+| Path | Upstream | License | Entrypoints | Vendor deps | Status |
+|------|----------|---------|-------------|-------------|--------|
+| `videotuna/training/flux_lora/` | PrivTune first-party | N/A | `train-flux-lora` | — | **Keep** — Accelerate stack ([ADR-001](decisions/0001-dual-training-stacks.md)) |
+| `videotuna/models/wan/` | [Wan-Video/Wan2.2](https://github.com/Wan-Video/Wan2.2) (Wan 2.1 hub weights for domain training — see `VENDOR.md`) | Apache-2.0 | `train-domain-t2v`, `train-domain-i2v`, `train-wan2-1-*-lora`, `scripts/train_new.py`, `scripts/inference_new.py` | `easydict` (configs only) | **Keep** — Lightning+DeepSpeed stack ([ADR-001](decisions/0001-dual-training-stacks.md)); T2V/I2V only; s2v/animate/ti2v pruned (see `VENDOR.md`) |
+
+Provenance details: [`videotuna/models/wan/VENDOR.md`](../videotuna/models/wan/VENDOR.md).
+
+`easydict` is pinned in `pyproject.toml` for upstream Wan config modules under `videotuna/models/wan/wan/configs/`. First-party code must not import it; `tests/test_vendor_import_boundary.py` enforces that boundary.
+
+## Flux LoRA training
+
+First-party trainer at `videotuna/training/flux_lora/` (Diffusers + PEFT + Accelerate). Configs under `configs/domain/`. Inference via `DiffusersVideoFlow` / `inference-flux-lora`. Training stack rationale: [ADR-001](decisions/0001-dual-training-stacks.md).
+
+## Removing vendored code
+
+Before deleting any file:
+
+1. Confirm zero references outside the vendor tree.
+2. Confirm no Poetry script imports it.
+3. Archive provenance in `docs/vendor/` and update this inventory.
diff --git a/eval/prompts/vbench_all_dimension.txt b/eval/prompts/vbench_all_dimension.txt
deleted file mode 100644
index f26fbf80..00000000
--- a/eval/prompts/vbench_all_dimension.txt
+++ /dev/null
@@ -1,946 +0,0 @@
-In a still frame, a stop sign
-a toilet, frozen in time
-a laptop, frozen in time
-A tranquil tableau of alley
-A tranquil tableau of bar
-A tranquil tableau of barn
-A tranquil tableau of bathroom
-A tranquil tableau of bedroom
-A tranquil tableau of cliff
-In a still frame, courtyard
-In a still frame, gas station
-A tranquil tableau of house
-indoor gymnasium, frozen in time
-A tranquil tableau of indoor library
-A tranquil tableau of kitchen
-A tranquil tableau of palace
-In a still frame, parking lot
-In a still frame, phone booth
-A tranquil tableau of restaurant
-A tranquil tableau of tower
-A tranquil tableau of a bowl
-A tranquil tableau of an apple
-A tranquil tableau of a bench
-A tranquil tableau of a bed
-A tranquil tableau of a chair
-A tranquil tableau of a cup
-A tranquil tableau of a dining table
-In a still frame, a pear
-A tranquil tableau of a bunch of grapes
-A tranquil tableau of a bowl on the kitchen counter
-A tranquil tableau of a beautiful, handcrafted ceramic bowl
-A tranquil tableau of an antique bowl
-A tranquil tableau of an exquisite mahogany dining table
-A tranquil tableau of a wooden bench in the park
-A tranquil tableau of a beautiful wrought-iron bench surrounded by blooming flowers
-In a still frame, a park bench with a view of the lake
-A tranquil tableau of a vintage rocking chair was placed on the porch
-A tranquil tableau of the jail cell was small and dimly lit, with cold, steel bars
-A tranquil tableau of the phone booth was tucked away in a quiet alley
-a dilapidated phone booth stood as a relic of a bygone era on the sidewalk, frozen in time
-A tranquil tableau of the old red barn stood weathered and iconic against the backdrop of the countryside
-A tranquil tableau of a picturesque barn was painted a warm shade of red and nestled in a picturesque meadow
-In a still frame, within the desolate desert, an oasis unfolded, characterized by the stoic presence of palm trees and a motionless, glassy pool of water
-In a still frame, the Parthenon's majestic Doric columns stand in serene solitude atop the Acropolis, framed by the tranquil Athenian landscape
-In a still frame, the Temple of Hephaestus, with its timeless Doric grace, stands stoically against the backdrop of a quiet Athens
-In a still frame, the ornate Victorian streetlamp stands solemnly, adorned with intricate ironwork and stained glass panels
-A tranquil tableau of the Stonehenge presented itself as an enigmatic puzzle, each colossal stone meticulously placed against the backdrop of tranquility
-In a still frame, in the vast desert, an oasis nestled among dunes, featuring tall palm trees and an air of serenity
-static view on a desert scene with an oasis, palm trees, and a clear, calm pool of water
-A tranquil tableau of an ornate Victorian streetlamp standing on a cobblestone street corner, illuminating the empty night
-A tranquil tableau of a tranquil lakeside cabin nestled among tall pines, its reflection mirrored perfectly in the calm water
-In a still frame, a vintage gas lantern, adorned with intricate details, gracing a historic cobblestone square
-In a still frame, a tranquil Japanese tea ceremony room, with tatami mats, a delicate tea set, and a bonsai tree in the corner
-A tranquil tableau of the Parthenon stands resolute in its classical elegance, a timeless symbol of Athens' cultural legacy
-A tranquil tableau of in the heart of Plaka, the neoclassical architecture of the old city harmonizes with the ancient ruins
-A tranquil tableau of in the desolate beauty of the American Southwest, Chaco Canyon's ancient ruins whispered tales of an enigmatic civilization that once thrived amidst the arid landscapes
-A tranquil tableau of at the edge of the Arabian Desert, the ancient city of Petra beckoned with its enigmatic rock-carved façades
-In a still frame, amidst the cobblestone streets, an Art Nouveau lamppost stood tall
-A tranquil tableau of in the quaint village square, a traditional wrought-iron streetlamp featured delicate filigree patterns and amber-hued glass panels
-A tranquil tableau of the lampposts were adorned with Art Deco motifs, their geometric shapes and frosted glass creating a sense of vintage glamour
-In a still frame, in the picturesque square, a Gothic-style lamppost adorned with intricate stone carvings added a touch of medieval charm to the setting
-In a still frame, in the heart of the old city, a row of ornate lantern-style streetlamps bathed the narrow alleyway in a warm, welcoming light
-A tranquil tableau of in the heart of the Utah desert, a massive sandstone arch spanned the horizon
-A tranquil tableau of in the Arizona desert, a massive stone bridge arched across a rugged canyon
-A tranquil tableau of in the corner of the minimalist tea room, a bonsai tree added a touch of nature's beauty to the otherwise simple and elegant space
-In a still frame, amidst the hushed ambiance of the traditional tea room, a meticulously arranged tea set awaited, with porcelain cups, a bamboo whisk
-In a still frame, nestled in the Zen garden, a rustic teahouse featured tatami seating and a traditional charcoal brazier
-A tranquil tableau of a country estate's library featured elegant wooden shelves
-A tranquil tableau of beneath the shade of a solitary oak tree, an old wooden park bench sat patiently
-A tranquil tableau of beside a tranquil pond, a weeping willow tree draped its branches gracefully over the water's surface, creating a serene tableau of reflection and calm
-A tranquil tableau of in the Zen garden, a perfectly raked gravel path led to a serene rock garden
-In a still frame, a tranquil pond was fringed by weeping cherry trees, their blossoms drifting lazily onto the glassy surface
-In a still frame, within the historic library's reading room, rows of antique leather chairs and mahogany tables offered a serene haven for literary contemplation
-A tranquil tableau of a peaceful orchid garden showcased a variety of delicate blooms
-A tranquil tableau of in the serene courtyard, a centuries-old stone well stood as a symbol of a bygone era, its mossy stones bearing witness to the passage of time
-a bird and a cat
-a cat and a dog
-a dog and a horse
-a horse and a sheep
-a sheep and a cow
-a cow and an elephant
-an elephant and a bear
-a bear and a zebra
-a zebra and a giraffe
-a giraffe and a bird
-a chair and a couch
-a couch and a potted plant
-a potted plant and a tv
-a tv and a laptop
-a laptop and a remote
-a remote and a keyboard
-a keyboard and a cell phone
-a cell phone and a book
-a book and a clock
-a clock and a backpack
-a backpack and an umbrella
-an umbrella and a handbag
-a handbag and a tie
-a tie and a suitcase
-a suitcase and a vase
-a vase and scissors
-scissors and a teddy bear
-a teddy bear and a frisbee
-a frisbee and skis
-skis and a snowboard
-a snowboard and a sports ball
-a sports ball and a kite
-a kite and a baseball bat
-a baseball bat and a baseball glove
-a baseball glove and a skateboard
-a skateboard and a surfboard
-a surfboard and a tennis racket
-a tennis racket and a bottle
-a bottle and a chair
-an airplane and a train
-a train and a boat
-a boat and an airplane
-a bicycle and a car
-a car and a motorcycle
-a motorcycle and a bus
-a bus and a traffic light
-a traffic light and a fire hydrant
-a fire hydrant and a stop sign
-a stop sign and a parking meter
-a parking meter and a truck
-a truck and a bicycle
-a toilet and a hair drier
-a hair drier and a toothbrush
-a toothbrush and a sink
-a sink and a toilet
-a wine glass and a chair
-a cup and a couch
-a fork and a potted plant
-a knife and a tv
-a spoon and a laptop
-a bowl and a remote
-a banana and a keyboard
-an apple and a cell phone
-a sandwich and a book
-an orange and a clock
-broccoli and a backpack
-a carrot and an umbrella
-a hot dog and a handbag
-a pizza and a tie
-a donut and a suitcase
-a cake and a vase
-an oven and scissors
-a toaster and a teddy bear
-a microwave and a frisbee
-a refrigerator and skis
-a bicycle and an airplane
-a car and a train
-a motorcycle and a boat
-a person and a toilet
-a person and a hair drier
-a person and a toothbrush
-a person and a sink
-A person is riding a bike
-A person is marching
-A person is roller skating
-A person is tasting beer
-A person is clapping
-A person is drawing
-A person is petting animal (not cat)
-A person is eating watermelon
-A person is playing harp
-A person is wrestling
-A person is riding scooter
-A person is sweeping floor
-A person is skateboarding
-A person is dunking basketball
-A person is playing flute
-A person is stretching leg
-A person is tying tie
-A person is skydiving
-A person is shooting goal (soccer)
-A person is playing piano
-A person is finger snapping
-A person is canoeing or kayaking
-A person is laughing
-A person is digging
-A person is clay pottery making
-A person is shooting basketball
-A person is bending back
-A person is shaking hands
-A person is bandaging
-A person is push up
-A person is catching or throwing frisbee
-A person is playing trumpet
-A person is flying kite
-A person is filling eyebrows
-A person is shuffling cards
-A person is folding clothes
-A person is smoking
-A person is tai chi
-A person is squat
-A person is playing controller
-A person is throwing axe
-A person is giving or receiving award
-A person is air drumming
-A person is taking a shower
-A person is planting trees
-A person is sharpening knives
-A person is robot dancing
-A person is rock climbing
-A person is hula hooping
-A person is writing
-A person is bungee jumping
-A person is pushing cart
-A person is cleaning windows
-A person is cutting watermelon
-A person is cheerleading
-A person is washing hands
-A person is ironing
-A person is cutting nails
-A person is hugging
-A person is trimming or shaving beard
-A person is jogging
-A person is making bed
-A person is washing dishes
-A person is grooming dog
-A person is doing laundry
-A person is knitting
-A person is reading book
-A person is baby waking up
-A person is massaging legs
-A person is brushing teeth
-A person is crawling baby
-A person is motorcycling
-A person is driving car
-A person is sticking tongue out
-A person is shaking head
-A person is sword fighting
-A person is doing aerobics
-A person is strumming guitar
-A person is riding or walking with horse
-A person is archery
-A person is catching or throwing baseball
-A person is playing chess
-A person is rock scissors paper
-A person is using computer
-A person is arranging flowers
-A person is bending metal
-A person is ice skating
-A person is climbing a rope
-A person is crying
-A person is dancing ballet
-A person is getting a haircut
-A person is running on treadmill
-A person is kissing
-A person is counting money
-A person is barbequing
-A person is peeling apples
-A person is milking cow
-A person is shining shoes
-A person is making snowman
-A person is sailing
-a person swimming in ocean
-a person giving a presentation to a room full of colleagues
-a person washing the dishes
-a person eating a burger
-a person walking in the snowstorm
-a person drinking coffee in a cafe
-a person playing guitar
-a bicycle leaning against a tree
-a bicycle gliding through a snowy field
-a bicycle slowing down to stop
-a bicycle accelerating to gain speed
-a car stuck in traffic during rush hour
-a car turning a corner
-a car slowing down to stop
-a car accelerating to gain speed
-a motorcycle cruising along a coastal highway
-a motorcycle turning a corner
-a motorcycle slowing down to stop
-a motorcycle gliding through a snowy field
-a motorcycle accelerating to gain speed
-an airplane soaring through a clear blue sky
-an airplane taking off
-an airplane landing smoothly on a runway
-an airplane accelerating to gain speed
-a bus turning a corner
-a bus stuck in traffic during rush hour
-a bus accelerating to gain speed
-a train speeding down the tracks
-a train crossing over a tall bridge
-a train accelerating to gain speed
-a truck turning a corner
-a truck anchored in a tranquil bay
-a truck stuck in traffic during rush hour
-a truck slowing down to stop
-a truck accelerating to gain speed
-a boat sailing smoothly on a calm lake
-a boat slowing down to stop
-a boat accelerating to gain speed
-a bird soaring gracefully in the sky
-a bird building a nest from twigs and leaves
-a bird flying over a snowy forest
-a cat grooming itself meticulously with its tongue
-a cat playing in park
-a cat drinking water
-a cat running happily
-a dog enjoying a peaceful walk
-a dog playing in park
-a dog drinking water
-a dog running happily
-a horse bending down to drink water from a river
-a horse galloping across an open field
-a horse taking a peaceful walk
-a horse running to join a herd of its kind
-a sheep bending down to drink water from a river
-a sheep taking a peaceful walk
-a sheep running to join a herd of its kind
-a cow bending down to drink water from a river
-a cow chewing cud while resting in a tranquil barn
-a cow running to join a herd of its kind
-an elephant spraying itself with water using its trunk to cool down
-an elephant taking a peaceful walk
-an elephant running to join a herd of its kind
-a bear catching a salmon in its powerful jaws
-a bear sniffing the air for scents of food
-a bear climbing a tree
-a bear hunting for prey
-a zebra bending down to drink water from a river
-a zebra running to join a herd of its kind
-a zebra taking a peaceful walk
-a giraffe bending down to drink water from a river
-a giraffe taking a peaceful walk
-a giraffe running to join a herd of its kind
-a person
-a bicycle
-a car
-a motorcycle
-an airplane
-a bus
-a train
-a truck
-a boat
-a traffic light
-a fire hydrant
-a stop sign
-a parking meter
-a bench
-a bird
-a cat
-a dog
-a horse
-a sheep
-a cow
-an elephant
-a bear
-a zebra
-a giraffe
-a backpack
-an umbrella
-a handbag
-a tie
-a suitcase
-a frisbee
-skis
-a snowboard
-a sports ball
-a kite
-a baseball bat
-a baseball glove
-a skateboard
-a surfboard
-a tennis racket
-a bottle
-a wine glass
-a cup
-a fork
-a knife
-a spoon
-a bowl
-a banana
-an apple
-a sandwich
-an orange
-broccoli
-a carrot
-a hot dog
-a pizza
-a donut
-a cake
-a chair
-a couch
-a potted plant
-a bed
-a dining table
-a toilet
-a tv
-a laptop
-a remote
-a keyboard
-a cell phone
-a microwave
-an oven
-a toaster
-a sink
-a refrigerator
-a book
-a clock
-a vase
-scissors
-a teddy bear
-a hair drier
-a toothbrush
-a red bicycle
-a green bicycle
-a blue bicycle
-a yellow bicycle
-an orange bicycle
-a purple bicycle
-a pink bicycle
-a black bicycle
-a white bicycle
-a red car
-a green car
-a blue car
-a yellow car
-an orange car
-a purple car
-a pink car
-a black car
-a white car
-a red bird
-a green bird
-a blue bird
-a yellow bird
-an orange bird
-a purple bird
-a pink bird
-a black bird
-a white bird
-a black cat
-a white cat
-an orange cat
-a yellow cat
-a red umbrella
-a green umbrella
-a blue umbrella
-a yellow umbrella
-an orange umbrella
-a purple umbrella
-a pink umbrella
-a black umbrella
-a white umbrella
-a red suitcase
-a green suitcase
-a blue suitcase
-a yellow suitcase
-an orange suitcase
-a purple suitcase
-a pink suitcase
-a black suitcase
-a white suitcase
-a red bowl
-a green bowl
-a blue bowl
-a yellow bowl
-an orange bowl
-a purple bowl
-a pink bowl
-a black bowl
-a white bowl
-a red chair
-a green chair
-a blue chair
-a yellow chair
-an orange chair
-a purple chair
-a pink chair
-a black chair
-a white chair
-a red clock
-a green clock
-a blue clock
-a yellow clock
-an orange clock
-a purple clock
-a pink clock
-a black clock
-a white clock
-a red vase
-a green vase
-a blue vase
-a yellow vase
-an orange vase
-a purple vase
-a pink vase
-a black vase
-a white vase
-A beautiful coastal beach in spring, waves lapping on sand, Van Gogh style
-A beautiful coastal beach in spring, waves lapping on sand, oil painting
-A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo
-A beautiful coastal beach in spring, waves lapping on sand, black and white
-A beautiful coastal beach in spring, waves lapping on sand, pixel art
-A beautiful coastal beach in spring, waves lapping on sand, in cyberpunk style
-A beautiful coastal beach in spring, waves lapping on sand, animated style
-A beautiful coastal beach in spring, waves lapping on sand, watercolor painting
-A beautiful coastal beach in spring, waves lapping on sand, surrealism style
-The bund Shanghai, Van Gogh style
-The bund Shanghai, oil painting
-The bund Shanghai by Hokusai, in the style of Ukiyo
-The bund Shanghai, black and white
-The bund Shanghai, pixel art
-The bund Shanghai, in cyberpunk style
-The bund Shanghai, animated style
-The bund Shanghai, watercolor painting
-The bund Shanghai, surrealism style
-a shark is swimming in the ocean, Van Gogh style
-a shark is swimming in the ocean, oil painting
-a shark is swimming in the ocean by Hokusai, in the style of Ukiyo
-a shark is swimming in the ocean, black and white
-a shark is swimming in the ocean, pixel art
-a shark is swimming in the ocean, in cyberpunk style
-a shark is swimming in the ocean, animated style
-a shark is swimming in the ocean, watercolor painting
-a shark is swimming in the ocean, surrealism style
-A panda drinking coffee in a cafe in Paris, Van Gogh style
-A panda drinking coffee in a cafe in Paris, oil painting
-A panda drinking coffee in a cafe in Paris by Hokusai, in the style of Ukiyo
-A panda drinking coffee in a cafe in Paris, black and white
-A panda drinking coffee in a cafe in Paris, pixel art
-A panda drinking coffee in a cafe in Paris, in cyberpunk style
-A panda drinking coffee in a cafe in Paris, animated style
-A panda drinking coffee in a cafe in Paris, watercolor painting
-A panda drinking coffee in a cafe in Paris, surrealism style
-A cute happy Corgi playing in park, sunset, Van Gogh style
-A cute happy Corgi playing in park, sunset, oil painting
-A cute happy Corgi playing in park, sunset by Hokusai, in the style of Ukiyo
-A cute happy Corgi playing in park, sunset, black and white
-A cute happy Corgi playing in park, sunset, pixel art
-A cute happy Corgi playing in park, sunset, in cyberpunk style
-A cute happy Corgi playing in park, sunset, animated style
-A cute happy Corgi playing in park, sunset, watercolor painting
-A cute happy Corgi playing in park, sunset, surrealism style
-Gwen Stacy reading a book, Van Gogh style
-Gwen Stacy reading a book, oil painting
-Gwen Stacy reading a book by Hokusai, in the style of Ukiyo
-Gwen Stacy reading a book, black and white
-Gwen Stacy reading a book, pixel art
-Gwen Stacy reading a book, in cyberpunk style
-Gwen Stacy reading a book, animated style
-Gwen Stacy reading a book, watercolor painting
-Gwen Stacy reading a book, surrealism style
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, Van Gogh style
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, oil painting
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Hokusai, in the style of Ukiyo
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, black and white
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pixel art
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, in cyberpunk style
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, animated style
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, watercolor painting
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, surrealism style
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, Van Gogh style
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, oil painting
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas by Hokusai, in the style of Ukiyo
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, black and white
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, pixel art
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, in cyberpunk style
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, animated style
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, watercolor painting
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, surrealism style
-An astronaut flying in space, Van Gogh style
-An astronaut flying in space, oil painting
-An astronaut flying in space by Hokusai, in the style of Ukiyo
-An astronaut flying in space, black and white
-An astronaut flying in space, pixel art
-An astronaut flying in space, in cyberpunk style
-An astronaut flying in space, animated style
-An astronaut flying in space, watercolor painting
-An astronaut flying in space, surrealism style
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, Van Gogh style
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, oil painting
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks by Hokusai, in the style of Ukiyo
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, black and white
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, pixel art
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, in cyberpunk style
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, animated style
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, watercolor painting
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, surrealism style
-A beautiful coastal beach in spring, waves lapping on sand, in super slow motion
-A beautiful coastal beach in spring, waves lapping on sand, zoom in
-A beautiful coastal beach in spring, waves lapping on sand, zoom out
-A beautiful coastal beach in spring, waves lapping on sand, pan left
-A beautiful coastal beach in spring, waves lapping on sand, pan right
-A beautiful coastal beach in spring, waves lapping on sand, tilt up
-A beautiful coastal beach in spring, waves lapping on sand, tilt down
-A beautiful coastal beach in spring, waves lapping on sand, with an intense shaking effect
-A beautiful coastal beach in spring, waves lapping on sand, featuring a steady and smooth perspective
-A beautiful coastal beach in spring, waves lapping on sand, racking focus
-The bund Shanghai, in super slow motion
-The bund Shanghai, zoom in
-The bund Shanghai, zoom out
-The bund Shanghai, pan left
-The bund Shanghai, pan right
-The bund Shanghai, tilt up
-The bund Shanghai, tilt down
-The bund Shanghai, with an intense shaking effect
-The bund Shanghai, featuring a steady and smooth perspective
-The bund Shanghai, racking focus
-a shark is swimming in the ocean, in super slow motion
-a shark is swimming in the ocean, zoom in
-a shark is swimming in the ocean, zoom out
-a shark is swimming in the ocean, pan left
-a shark is swimming in the ocean, pan right
-a shark is swimming in the ocean, tilt up
-a shark is swimming in the ocean, tilt down
-a shark is swimming in the ocean, with an intense shaking effect
-a shark is swimming in the ocean, featuring a steady and smooth perspective
-a shark is swimming in the ocean, racking focus
-A panda drinking coffee in a cafe in Paris, in super slow motion
-A panda drinking coffee in a cafe in Paris, zoom in
-A panda drinking coffee in a cafe in Paris, zoom out
-A panda drinking coffee in a cafe in Paris, pan left
-A panda drinking coffee in a cafe in Paris, pan right
-A panda drinking coffee in a cafe in Paris, tilt up
-A panda drinking coffee in a cafe in Paris, tilt down
-A panda drinking coffee in a cafe in Paris, with an intense shaking effect
-A panda drinking coffee in a cafe in Paris, featuring a steady and smooth perspective
-A panda drinking coffee in a cafe in Paris, racking focus
-A cute happy Corgi playing in park, sunset, in super slow motion
-A cute happy Corgi playing in park, sunset, zoom in
-A cute happy Corgi playing in park, sunset, zoom out
-A cute happy Corgi playing in park, sunset, pan left
-A cute happy Corgi playing in park, sunset, pan right
-A cute happy Corgi playing in park, sunset, tilt up
-A cute happy Corgi playing in park, sunset, tilt down
-A cute happy Corgi playing in park, sunset, with an intense shaking effect
-A cute happy Corgi playing in park, sunset, featuring a steady and smooth perspective
-A cute happy Corgi playing in park, sunset, racking focus
-Gwen Stacy reading a book, in super slow motion
-Gwen Stacy reading a book, zoom in
-Gwen Stacy reading a book, zoom out
-Gwen Stacy reading a book, pan left
-Gwen Stacy reading a book, pan right
-Gwen Stacy reading a book, tilt up
-Gwen Stacy reading a book, tilt down
-Gwen Stacy reading a book, with an intense shaking effect
-Gwen Stacy reading a book, featuring a steady and smooth perspective
-Gwen Stacy reading a book, racking focus
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, in super slow motion
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, zoom in
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, zoom out
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pan left
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pan right
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, tilt up
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, tilt down
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, with an intense shaking effect
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, featuring a steady and smooth perspective
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background, racking focus
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, in super slow motion
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, zoom in
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, zoom out
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, pan left
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, pan right
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, tilt up
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, tilt down
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, with an intense shaking effect
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, featuring a steady and smooth perspective
-A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, racking focus
-An astronaut flying in space, in super slow motion
-An astronaut flying in space, zoom in
-An astronaut flying in space, zoom out
-An astronaut flying in space, pan left
-An astronaut flying in space, pan right
-An astronaut flying in space, tilt up
-An astronaut flying in space, tilt down
-An astronaut flying in space, with an intense shaking effect
-An astronaut flying in space, featuring a steady and smooth perspective
-An astronaut flying in space, racking focus
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, in super slow motion
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, zoom in
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, zoom out
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, pan left
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, pan right
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, tilt up
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, tilt down
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, with an intense shaking effect
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, featuring a steady and smooth perspective
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, racking focus
-Close up of grapes on a rotating table.
-Turtle swimming in ocean.
-A storm trooper vacuuming the beach.
-A panda standing on a surfboard in the ocean in sunset.
-An astronaut feeding ducks on a sunny afternoon, reflection from the water.
-Two pandas discussing an academic paper.
-Sunset time lapse at the beach with moving clouds and colors in the sky.
-A fat rabbit wearing a purple robe walking through a fantasy landscape.
-A koala bear playing piano in the forest.
-An astronaut flying in space.
-Fireworks.
-An animated painting of fluffy white clouds moving in sky.
-Flying through fantasy landscapes.
-A bigfoot walking in the snowstorm.
-A squirrel eating a burger.
-A cat wearing sunglasses and working as a lifeguard at a pool.
-Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks.
-Splash of turquoise water in extreme slow motion, alpha channel included.
-an ice cream is melting on the table.
-a drone flying over a snowy forest.
-a shark is swimming in the ocean.
-Aerial panoramic video from a drone of a fantasy land.
-a teddy bear is swimming in the ocean.
-time lapse of sunrise on mars.
-golden fish swimming in the ocean.
-An artist brush painting on a canvas close up.
-A drone view of celebration with Christmas tree and fireworks, starry sky - background.
-happy dog wearing a yellow turtleneck, studio, portrait, facing camera, dark background
-Origami dancers in white paper, 3D render, on white background, studio shot, dancing modern dance.
-Campfire at night in a snowy forest with starry sky in the background.
-a fantasy landscape
-A 3D model of a 1800s victorian house.
-this is how I do makeup in the morning.
-A raccoon that looks like a turtle, digital art.
-Robot dancing in Times Square.
-Busy freeway at night.
-Balloon full of water exploding in extreme slow motion.
-An astronaut is riding a horse in the space in a photorealistic style.
-Macro slo-mo. Slow motion cropped closeup of roasted coffee beans falling into an empty bowl.
-Sewing machine, old sewing machine working.
-Motion colour drop in water, ink swirling in water, colourful ink in water, abstraction fancy dream cloud of ink.
-Few big purple plums rotating on the turntable. water drops appear on the skin during rotation. isolated on the white background. close-up. macro.
-Vampire makeup face of beautiful girl, red contact lenses.
-Ashtray full of butts on table, smoke flowing on black background, close-up
-Pacific coast, carmel by the sea ocean and waves.
-A teddy bear is playing drum kit in NYC Times Square.
-A corgi is playing drum kit.
-An Iron man is playing the electronic guitar, high electronic guitar.
-A raccoon is playing the electronic guitar.
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh
-A corgi's head depicted as an explosion of a nebula
-A fantasy landscape
-A future where humans have achieved teleportation technology
-A jellyfish floating through the ocean, with bioluminescent tentacles
-A Mars rover moving on Mars
-A panda drinking coffee in a cafe in Paris
-A space shuttle launching into orbit, with flames and smoke billowing out from the engines
-A steam train moving on a mountainside
-A super cool giant robot in Cyberpunk Beijing
-A tropical beach at sunrise, with palm trees and crystal-clear water in the foreground
-Cinematic shot of Van Gogh's selfie, Van Gogh style
-Gwen Stacy reading a book
-Iron Man flying in the sky
-The bund Shanghai, oil painting
-Yoda playing guitar on the stage
-A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo
-A beautiful coastal beach in spring, waves lapping on sand by Vincent van Gogh
-A boat sailing leisurely along the Seine River with the Eiffel Tower in background
-A car moving slowly on an empty street, rainy evening
-A cat eating food out of a bowl
-A cat wearing sunglasses at a pool
-A confused panda in calculus class
-A cute fluffy panda eating Chinese food in a restaurant
-A cute happy Corgi playing in park, sunset
-A cute raccoon playing guitar in a boat on the ocean
-A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background
-A lightning striking atop of eiffel tower, dark clouds in the sky
-A modern art museum, with colorful paintings
-A panda cooking in the kitchen
-A panda playing on a swing set
-A polar bear is playing guitar
-A raccoon dressed in suit playing the trumpet, stage background
-A robot DJ is playing the turntable, in heavy raining futuristic tokyo rooftop cyberpunk night, sci-fi, fantasy
-A shark swimming in clear Caribbean ocean
-A super robot protecting city
-A teddy bear washing the dishes
-An epic tornado attacking above a glowing city at night, the tornado is made of smoke
-An oil painting of a couple in formal evening wear going home get caught in a heavy downpour with umbrellas
-Clown fish swimming through the coral reef
-Hyper-realistic spaceship landing on Mars
-The bund Shanghai, vibrant color
-Vincent van Gogh is painting in the room
-Yellow flowers swing in the wind
-alley
-amusement park
-aquarium
-arch
-art gallery
-bathroom
-bakery shop
-ballroom
-bar
-barn
-basement
-beach
-bedroom
-bridge
-botanical garden
-cafeteria
-campsite
-campus
-carrousel
-castle
-cemetery
-classroom
-cliff
-crosswalk
-construction site
-corridor
-courtyard
-desert
-downtown
-driveway
-farm
-food court
-football field
-forest road
-fountain
-gas station
-glacier
-golf course
-indoor gymnasium
-harbor
-highway
-hospital
-house
-iceberg
-industrial area
-jail cell
-junkyard
-kitchen
-indoor library
-lighthouse
-laboratory
-mansion
-marsh
-mountain
-indoor movie theater
-indoor museum
-music studio
-nursery
-ocean
-office
-palace
-parking lot
-pharmacy
-phone booth
-raceway
-restaurant
-river
-science museum
-shower
-ski slope
-sky
-skyscraper
-baseball stadium
-staircase
-street
-supermarket
-indoor swimming pool
-tower
-outdoor track
-train railway
-train station platform
-underwater coral reef
-valley
-volcano
-waterfall
-windmill
-a bicycle on the left of a car, front view
-a car on the right of a motorcycle, front view
-a motorcycle on the left of a bus, front view
-a bus on the right of a traffic light, front view
-a traffic light on the left of a fire hydrant, front view
-a fire hydrant on the right of a stop sign, front view
-a stop sign on the left of a parking meter, front view
-a parking meter on the right of a bench, front view
-a bench on the left of a truck, front view
-a truck on the right of a bicycle, front view
-a bird on the left of a cat, front view
-a cat on the right of a dog, front view
-a dog on the left of a horse, front view
-a horse on the right of a sheep, front view
-a sheep on the left of a cow, front view
-a cow on the right of an elephant, front view
-an elephant on the left of a bear, front view
-a bear on the right of a zebra, front view
-a zebra on the left of a giraffe, front view
-a giraffe on the right of a bird, front view
-a bottle on the left of a wine glass, front view
-a wine glass on the right of a cup, front view
-a cup on the left of a fork, front view
-a fork on the right of a knife, front view
-a knife on the left of a spoon, front view
-a spoon on the right of a bowl, front view
-a bowl on the left of a bottle, front view
-a potted plant on the left of a remote, front view
-a remote on the right of a clock, front view
-a clock on the left of a vase, front view
-a vase on the right of scissors, front view
-scissors on the left of a teddy bear, front view
-a teddy bear on the right of a potted plant, front view
-a frisbee on the left of a sports ball, front view
-a sports ball on the right of a baseball bat, front view
-a baseball bat on the left of a baseball glove, front view
-a baseball glove on the right of a tennis racket, front view
-a tennis racket on the left of a frisbee, front view
-a toilet on the left of a hair drier, front view
-a hair drier on the right of a toothbrush, front view
-a toothbrush on the left of a sink, front view
-a sink on the right of a toilet, front view
-a chair on the left of a couch, front view
-a couch on the right of a bed, front view
-a bed on the left of a tv, front view
-a tv on the right of a dining table, front view
-a dining table on the left of a chair, front view
-an airplane on the left of a train, front view
-a train on the right of a boat, front view
-a boat on the left of an airplane, front view
-an oven on the top of a toaster, front view
-an oven on the bottom of a toaster, front view
-a toaster on the top of a microwave, front view
-a toaster on the bottom of a microwave, front view
-a microwave on the top of an oven, front view
-a microwave on the bottom of an oven, front view
-a banana on the top of an apple, front view
-a banana on the bottom of an apple, front view
-an apple on the top of a sandwich, front view
-an apple on the bottom of a sandwich, front view
-a sandwich on the top of an orange, front view
-a sandwich on the bottom of an orange, front view
-an orange on the top of a carrot, front view
-an orange on the bottom of a carrot, front view
-a carrot on the top of a hot dog, front view
-a carrot on the bottom of a hot dog, front view
-a hot dog on the top of a pizza, front view
-a hot dog on the bottom of a pizza, front view
-a pizza on the top of a donut, front view
-a pizza on the bottom of a donut, front view
-a donut on the top of broccoli, front view
-a donut on the bottom of broccoli, front view
-broccoli on the top of a banana, front view
-broccoli on the bottom of a banana, front view
-skis on the top of a snowboard, front view
-skis on the bottom of a snowboard, front view
-a snowboard on the top of a kite, front view
-a snowboard on the bottom of a kite, front view
-a kite on the top of a skateboard, front view
-a kite on the bottom of a skateboard, front view
-a skateboard on the top of a surfboard, front view
-a skateboard on the bottom of a surfboard, front view
-a surfboard on the top of skis, front view
-a surfboard on the bottom of skis, front view
diff --git a/eval/requirements_vbench.txt b/eval/requirements_vbench.txt
deleted file mode 100644
index e747d773..00000000
--- a/eval/requirements_vbench.txt
+++ /dev/null
@@ -1,8 +0,0 @@
-imageio>=2.34.1
-pyiqa==0.1.10
-scikit-learn
-scikit-image
-lvis
-boto3
-easydict
-fairscale
diff --git a/eval/scripts/evaluation.py b/eval/scripts/evaluation.py
deleted file mode 100644
index e8ccb837..00000000
--- a/eval/scripts/evaluation.py
+++ /dev/null
@@ -1,212 +0,0 @@
-import sys
-from pathlib import Path
-
-sys.path.append(str(Path(__file__).resolve().parents[1]))
-
-import argparse
-import json
-import os
-from datetime import datetime
-
-import torch
-from vbench import VBench
-
-STANDARD_DIMENSION = [
-    # a: 10min
-    "subject_consistency",  # 4min
-    "imaging_quality",  # 6min
-    # b: 12min
-    "background_consistency",  # 2min
-    "motion_smoothness",  # 5min
-    "overall_consistency",  # 2min
-    "human_action",  # 3min
-    # c: 14min
-    "multiple_objects",  # 14min
-    # d: 14min
-    "spatial_relationship",  # 14min
-    # e: 12min
-    "object_class",  # 12min
-    # f: 12min
-    "color",  # 12min
-    # g: 10.5min
-    "aesthetic_quality",  # 2.5min
-    "appearance_style",  # 6min
-    "temporal_flickering",  # 2min
-    # h: 9min
-    "scene",  # 3min
-    "temporal_style",  # 2min
-    "dynamic_degree",  # 4min
-]
-
-
-def parse_args():
-
-    CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-    PARENT_DIR = os.path.dirname(CUR_DIR)
-    parser = argparse.ArgumentParser(
-        description="VBench", formatter_class=argparse.RawTextHelpFormatter
-    )
-    parser.add_argument(
-        "--output_path",
-        type=str,
-        default="./evaluation_results/",
-        help="output path to save the evaluation results",
-    )
-    parser.add_argument(
-        "--full_json_dir",
-        type=str,
-        default=f"{PARENT_DIR}/vbench/VBench_full_info.json",
-        help="path to save the json file that contains the prompt and dimension information",
-    )
-    parser.add_argument(
-        "--map_json_path",
-        type=str,
-        required=True,
-        help="json file path of mapping from video path to prompt",
-    )
-    parser.add_argument(
-        "--videos_path",
-        type=str,
-        required=True,
-        help="folder that contains the sampled videos",
-    )
-    parser.add_argument(
-        "--dimension",
-        nargs="+",
-        required=False,
-        default=None,
-        help="list of evaluation dimensions, usage: --dimension <dim_1> <dim_2>",
-    )
-    parser.add_argument(
-        "--load_ckpt_from_local",
-        type=bool,
-        required=False,
-        help="whether load checkpoints from local default paths (assuming you have downloaded the checkpoints locally",
-    )
-    parser.add_argument(
-        "--read_frame",
-        type=bool,
-        required=False,
-        help="whether directly read frames, or directly read videos",
-    )
-    parser.add_argument(
-        "--mode",
-        choices=["custom_input", "vbench_standard", "vbench_category"],
-        default="vbench_standard",
-        help="""This flags determine the mode of evaluations, choose one of the following:
-        1. "custom_input": receive input prompt from either --prompt/--prompt_file flags or the filename
-        2. "vbench_standard": evaluate on standard prompt suite of VBench
-        3. "vbench_category": evaluate on specific category
-        """,
-    )
-    parser.add_argument(
-        "--custom_input",
-        action="store_true",
-        required=False,
-        help='(deprecated) use --mode="custom_input" instead',
-    )
-    parser.add_argument(
-        "--prompt",
-        type=str,
-        default="",
-        help="""Specify the input prompt
-        If not specified, filenames will be used as input prompts
-        * Mutually exclusive to --prompt_file.
-        ** This option must be used with --custom_input flag
-        """,
-    )
-    parser.add_argument(
-        "--prompt_file",
-        type=str,
-        required=False,
-        help="""Specify the path of the file that contains prompt lists
-        If not specified, filenames will be used as input prompts
-        * Mutually exclusive to --prompt.
-        ** This option must be used with --custom_input flag
-        """,
-    )
-    parser.add_argument(
-        "--category",
-        type=str,
-        required=False,
-        help="""This is for mode=='vbench_category'
-        The category to evaluate on, usage: --category=animal.
-        """,
-    )
-
-    ## for dimension specific params ###
-    parser.add_argument(
-        "--imaging_quality_preprocessing_mode",
-        type=str,
-        required=False,
-        default="longer",
-        help="""This is for setting preprocessing in imaging_quality
-        1. 'shorter': if the shorter side is more than 512, the image is resized so that the shorter side is 512.
-        2. 'longer': if the longer side is more than 512, the image is resized so that the longer side is 512.
-        3. 'shorter_centercrop': if the shorter side is more than 512, the image is resized so that the shorter side is 512.
-        Then the center 512 x 512 after resized is used for evaluation.
-        4. 'None': no preprocessing
-        """,
-    )
-    args = parser.parse_args()
-    return args
-
-
-def main():
-    args = parse_args()
-    print(f"args: {args}")
-
-    device = torch.device("cuda")
-    my_VBench = VBench(device, args.full_json_dir, args.output_path)
-
-    print(f"start evaluation")
-
-    if args.dimension is None:
-        dimensions = STANDARD_DIMENSION
-    else:
-        dimensions = args.dimension
-
-    video_path = args.videos_path
-    prompt_file = args.map_json_path
-
-    kwargs = {}
-    prompt = []
-    with open(prompt_file, "r") as f:
-        prompt = json.load(f)
-    assert (
-        type(prompt) == dict
-    ), 'Invalid prompt file format. The correct format is {"video_path": prompt, ... }'
-
-    if args.category != "":
-        kwargs["category"] = args.category
-
-    kwargs["imaging_quality_preprocessing_mode"] = (
-        args.imaging_quality_preprocessing_mode
-    )
-    result_save_name = args.output_path + f"results"
-
-    my_VBench.evaluate(
-        videos_path=video_path,
-        name=result_save_name,
-        prompt_list=prompt,  # pass in [] to read prompt from filename
-        dimension_list=dimensions,
-        local=args.load_ckpt_from_local,
-        read_frame=args.read_frame,
-        mode=args.mode,
-        **kwargs,
-    )
-
-    with open(result_save_name + "_eval_results.json", "r") as f:
-        result = json.load(f)
-
-    avg_dict = {}
-    for key, value in result.items():
-        avg_dict[key] = value[0]
-    with open(os.path.join(args.output_path, "final_results.json"), "w") as f:
-        json.dump(avg_dict, f, indent=4)
-
-    print("done")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/eval/scripts/tabular_score.py b/eval/scripts/tabular_score.py
deleted file mode 100644
index e260c91c..00000000
--- a/eval/scripts/tabular_score.py
+++ /dev/null
@@ -1,150 +0,0 @@
-import argparse
-import json
-import os
-import shutil
-from pathlib import Path
-
-SEMANTIC_WEIGHT = 1
-QUALITY_WEIGHT = 4
-
-QUALITY_LIST = [
-    "subject consistency",
-    "background consistency",
-    "temporal flickering",
-    "motion smoothness",
-    "aesthetic quality",
-    "imaging quality",
-    "dynamic degree",
-]
-
-SEMANTIC_LIST = [
-    "object class",
-    "multiple objects",
-    "human action",
-    "color",
-    "spatial relationship",
-    "scene",
-    "appearance style",
-    "temporal style",
-    "overall consistency",
-]
-
-NORMALIZE_DIC = {
-    "subject consistency": {"Min": 0.1462, "Max": 1.0},
-    "background consistency": {"Min": 0.2615, "Max": 1.0},
-    "temporal flickering": {"Min": 0.6293, "Max": 1.0},
-    "motion smoothness": {"Min": 0.706, "Max": 0.9975},
-    "dynamic degree": {"Min": 0.0, "Max": 1.0},
-    "aesthetic quality": {"Min": 0.0, "Max": 1.0},
-    "imaging quality": {"Min": 0.0, "Max": 1.0},
-    "object class": {"Min": 0.0, "Max": 1.0},
-    "multiple objects": {"Min": 0.0, "Max": 1.0},
-    "human action": {"Min": 0.0, "Max": 1.0},
-    "color": {"Min": 0.0, "Max": 1.0},
-    "spatial relationship": {"Min": 0.0, "Max": 1.0},
-    "scene": {"Min": 0.0, "Max": 0.8222},
-    "appearance style": {"Min": 0.0009, "Max": 0.2855},
-    "temporal style": {"Min": 0.0, "Max": 0.364},
-    "overall consistency": {"Min": 0.0, "Max": 0.364},
-}
-
-DIM_WEIGHT = {
-    "subject consistency": 1,
-    "background consistency": 1,
-    "temporal flickering": 1,
-    "motion smoothness": 1,
-    "aesthetic quality": 1,
-    "imaging quality": 1,
-    "dynamic degree": 0.5,
-    "object class": 1,
-    "multiple objects": 1,
-    "human action": 1,
-    "color": 1,
-    "spatial relationship": 1,
-    "scene": 1,
-    "appearance style": 1,
-    "temporal style": 1,
-    "overall consistency": 1,
-}
-
-ordered_scaled_res = [
-    "total score",
-    "quality score",
-    "semantic score",
-    "subject consistency",
-    "background consistency",
-    "temporal flickering",
-    "motion smoothness",
-    "dynamic degree",
-    "aesthetic quality",
-    "imaging quality",
-    "object class",
-    "multiple objects",
-    "human action",
-    "color",
-    "spatial relationship",
-    "scene",
-    "appearance style",
-    "temporal style",
-    "overall consistency",
-]
-
-
-def main(args):
-    ori_result_path = args.result_path
-    output_dir = os.path.dirname(ori_result_path)
-    with open(ori_result_path, "r") as f:
-        full_results = json.load(f)
-
-    scaled_results = {}
-    dims = set()
-    for key, val in full_results.items():
-        dim = key.replace("_", " ") if "_" in key else key
-        scaled_score = (float(val) - NORMALIZE_DIC[dim]["Min"]) / (
-            NORMALIZE_DIC[dim]["Max"] - NORMALIZE_DIC[dim]["Min"]
-        )
-        scaled_score *= DIM_WEIGHT[dim]
-        scaled_results[dim] = scaled_score
-        dims.add(dim)
-
-    quality_score = sum([scaled_results[i] for i in QUALITY_LIST]) / sum(
-        [DIM_WEIGHT[i] for i in QUALITY_LIST]
-    )
-    semantic_score = sum([scaled_results[i] for i in SEMANTIC_LIST]) / sum(
-        [DIM_WEIGHT[i] for i in SEMANTIC_LIST]
-    )
-    scaled_results["quality score"] = quality_score
-    scaled_results["semantic score"] = semantic_score
-    scaled_results["total score"] = (
-        quality_score * QUALITY_WEIGHT + semantic_score * SEMANTIC_WEIGHT
-    ) / (QUALITY_WEIGHT + SEMANTIC_WEIGHT)
-
-    formated_scaled_results = {"items": []}
-    for key in ordered_scaled_res:
-        formated_score = format(scaled_results[key] * 100, ".2f") + "%"
-        formated_scaled_results["items"].append({key: formated_score})
-
-    # all_results.json is the same with final_results.json
-    # output_file_path = os.path.join(output_dir, "all_results.json")
-    # with open(output_file_path, "w") as outfile:
-    #     json.dump(full_results, outfile, indent=4, sort_keys=True)
-    # print(f"results saved to: {output_file_path}")
-
-    scaled_file_path = os.path.join(output_dir, "scaled_results.json")
-    with open(scaled_file_path, "w") as outfile:
-        json.dump(formated_scaled_results, outfile, indent=4, sort_keys=True)
-    print(f"results saved to: {scaled_file_path}")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="VBench", formatter_class=argparse.RawTextHelpFormatter
-    )
-    parser.add_argument(
-        "--result_path",
-        type=str,
-        required=True,
-        help="The path of result json file",
-    )
-    args = parser.parse_args()
-    main(args)
diff --git a/eval/vbench/VBench_full_info.json b/eval/vbench/VBench_full_info.json
deleted file mode 100644
index e60c40eb..00000000
--- a/eval/vbench/VBench_full_info.json
+++ /dev/null
@@ -1,9132 +0,0 @@
-[
-    {
-        "prompt_en": "In a still frame, a stop sign",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "a toilet, frozen in time",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "a laptop, frozen in time",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of alley",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of bar",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of barn",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of bathroom",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of bedroom",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of cliff",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, courtyard",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, gas station",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of house",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "indoor gymnasium, frozen in time",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of indoor library",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of kitchen",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of palace",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, parking lot",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, phone booth",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of restaurant",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of tower",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a bowl",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of an apple",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a bench",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a bed",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a chair",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a cup",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a dining table",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, a pear",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a bunch of grapes",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a bowl on the kitchen counter",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a beautiful, handcrafted ceramic bowl",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of an antique bowl",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of an exquisite mahogany dining table",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a wooden bench in the park",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a beautiful wrought-iron bench surrounded by blooming flowers",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, a park bench with a view of the lake",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a vintage rocking chair was placed on the porch",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of the jail cell was small and dimly lit, with cold, steel bars",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of the phone booth was tucked away in a quiet alley",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "a dilapidated phone booth stood as a relic of a bygone era on the sidewalk, frozen in time",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of the old red barn stood weathered and iconic against the backdrop of the countryside",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a picturesque barn was painted a warm shade of red and nestled in a picturesque meadow",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, within the desolate desert, an oasis unfolded, characterized by the stoic presence of palm trees and a motionless, glassy pool of water",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, the Parthenon's majestic Doric columns stand in serene solitude atop the Acropolis, framed by the tranquil Athenian landscape",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, the Temple of Hephaestus, with its timeless Doric grace, stands stoically against the backdrop of a quiet Athens",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, the ornate Victorian streetlamp stands solemnly, adorned with intricate ironwork and stained glass panels",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of the Stonehenge presented itself as an enigmatic puzzle, each colossal stone meticulously placed against the backdrop of tranquility",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, in the vast desert, an oasis nestled among dunes, featuring tall palm trees and an air of serenity",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "static view on a desert scene with an oasis, palm trees, and a clear, calm pool of water",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of an ornate Victorian streetlamp standing on a cobblestone street corner, illuminating the empty night",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a tranquil lakeside cabin nestled among tall pines, its reflection mirrored perfectly in the calm water",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, a vintage gas lantern, adorned with intricate details, gracing a historic cobblestone square",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, a tranquil Japanese tea ceremony room, with tatami mats, a delicate tea set, and a bonsai tree in the corner",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of the Parthenon stands resolute in its classical elegance, a timeless symbol of Athens' cultural legacy",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the heart of Plaka, the neoclassical architecture of the old city harmonizes with the ancient ruins",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the desolate beauty of the American Southwest, Chaco Canyon's ancient ruins whispered tales of an enigmatic civilization that once thrived amidst the arid landscapes",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of at the edge of the Arabian Desert, the ancient city of Petra beckoned with its enigmatic rock-carved fa\u00e7ades",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, amidst the cobblestone streets, an Art Nouveau lamppost stood tall",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the quaint village square, a traditional wrought-iron streetlamp featured delicate filigree patterns and amber-hued glass panels",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of the lampposts were adorned with Art Deco motifs, their geometric shapes and frosted glass creating a sense of vintage glamour",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, in the picturesque square, a Gothic-style lamppost adorned with intricate stone carvings added a touch of medieval charm to the setting",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, in the heart of the old city, a row of ornate lantern-style streetlamps bathed the narrow alleyway in a warm, welcoming light",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the heart of the Utah desert, a massive sandstone arch spanned the horizon",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the Arizona desert, a massive stone bridge arched across a rugged canyon",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the corner of the minimalist tea room, a bonsai tree added a touch of nature's beauty to the otherwise simple and elegant space",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, amidst the hushed ambiance of the traditional tea room, a meticulously arranged tea set awaited, with porcelain cups, a bamboo whisk",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, nestled in the Zen garden, a rustic teahouse featured tatami seating and a traditional charcoal brazier",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a country estate's library featured elegant wooden shelves",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of beneath the shade of a solitary oak tree, an old wooden park bench sat patiently",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of beside a tranquil pond, a weeping willow tree draped its branches gracefully over the water's surface, creating a serene tableau of reflection and calm",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the Zen garden, a perfectly raked gravel path led to a serene rock garden",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, a tranquil pond was fringed by weeping cherry trees, their blossoms drifting lazily onto the glassy surface",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "In a still frame, within the historic library's reading room, rows of antique leather chairs and mahogany tables offered a serene haven for literary contemplation",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of a peaceful orchid garden showcased a variety of delicate blooms",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "A tranquil tableau of in the serene courtyard, a centuries-old stone well stood as a symbol of a bygone era, its mossy stones bearing witness to the passage of time",
-        "dimension": [
-            "temporal_flickering"
-        ]
-    },
-    {
-        "prompt_en": "a bird and a cat",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bird and cat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cat and a dog",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "cat and dog"
-            }
-        }
-    },
-    {
-        "prompt_en": "a dog and a horse",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "dog and horse"
-            }
-        }
-    },
-    {
-        "prompt_en": "a horse and a sheep",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "horse and sheep"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sheep and a cow",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "sheep and cow"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cow and an elephant",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "cow and elephant"
-            }
-        }
-    },
-    {
-        "prompt_en": "an elephant and a bear",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "elephant and bear"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bear and a zebra",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bear and zebra"
-            }
-        }
-    },
-    {
-        "prompt_en": "a zebra and a giraffe",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "zebra and giraffe"
-            }
-        }
-    },
-    {
-        "prompt_en": "a giraffe and a bird",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "giraffe and bird"
-            }
-        }
-    },
-    {
-        "prompt_en": "a chair and a couch",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "chair and couch"
-            }
-        }
-    },
-    {
-        "prompt_en": "a couch and a potted plant",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "couch and potted plant"
-            }
-        }
-    },
-    {
-        "prompt_en": "a potted plant and a tv",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "potted plant and tv"
-            }
-        }
-    },
-    {
-        "prompt_en": "a tv and a laptop",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "tv and laptop"
-            }
-        }
-    },
-    {
-        "prompt_en": "a laptop and a remote",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "laptop and remote"
-            }
-        }
-    },
-    {
-        "prompt_en": "a remote and a keyboard",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "remote and keyboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a keyboard and a cell phone",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "keyboard and cell phone"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cell phone and a book",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "cell phone and book"
-            }
-        }
-    },
-    {
-        "prompt_en": "a book and a clock",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "book and clock"
-            }
-        }
-    },
-    {
-        "prompt_en": "a clock and a backpack",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "clock and backpack"
-            }
-        }
-    },
-    {
-        "prompt_en": "a backpack and an umbrella",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "backpack and umbrella"
-            }
-        }
-    },
-    {
-        "prompt_en": "an umbrella and a handbag",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "umbrella and handbag"
-            }
-        }
-    },
-    {
-        "prompt_en": "a handbag and a tie",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "handbag and tie"
-            }
-        }
-    },
-    {
-        "prompt_en": "a tie and a suitcase",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "tie and suitcase"
-            }
-        }
-    },
-    {
-        "prompt_en": "a suitcase and a vase",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "suitcase and vase"
-            }
-        }
-    },
-    {
-        "prompt_en": "a vase and scissors",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "vase and scissors"
-            }
-        }
-    },
-    {
-        "prompt_en": "scissors and a teddy bear",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "scissors and teddy bear"
-            }
-        }
-    },
-    {
-        "prompt_en": "a teddy bear and a frisbee",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "teddy bear and frisbee"
-            }
-        }
-    },
-    {
-        "prompt_en": "a frisbee and skis",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "frisbee and skis"
-            }
-        }
-    },
-    {
-        "prompt_en": "skis and a snowboard",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "skis and snowboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a snowboard and a sports ball",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "snowboard and sports ball"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sports ball and a kite",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "sports ball and kite"
-            }
-        }
-    },
-    {
-        "prompt_en": "a kite and a baseball bat",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "kite and baseball bat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a baseball bat and a baseball glove",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "baseball bat and baseball glove"
-            }
-        }
-    },
-    {
-        "prompt_en": "a baseball glove and a skateboard",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "baseball glove and skateboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a skateboard and a surfboard",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "skateboard and surfboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a surfboard and a tennis racket",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "surfboard and tennis racket"
-            }
-        }
-    },
-    {
-        "prompt_en": "a tennis racket and a bottle",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "tennis racket and bottle"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bottle and a chair",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bottle and chair"
-            }
-        }
-    },
-    {
-        "prompt_en": "an airplane and a train",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "airplane and train"
-            }
-        }
-    },
-    {
-        "prompt_en": "a train and a boat",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "train and boat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a boat and an airplane",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "boat and airplane"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bicycle and a car",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bicycle and car"
-            }
-        }
-    },
-    {
-        "prompt_en": "a car and a motorcycle",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "car and motorcycle"
-            }
-        }
-    },
-    {
-        "prompt_en": "a motorcycle and a bus",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "motorcycle and bus"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bus and a traffic light",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bus and traffic light"
-            }
-        }
-    },
-    {
-        "prompt_en": "a traffic light and a fire hydrant",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "traffic light and fire hydrant"
-            }
-        }
-    },
-    {
-        "prompt_en": "a fire hydrant and a stop sign",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "fire hydrant and stop sign"
-            }
-        }
-    },
-    {
-        "prompt_en": "a stop sign and a parking meter",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "stop sign and parking meter"
-            }
-        }
-    },
-    {
-        "prompt_en": "a parking meter and a truck",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "parking meter and truck"
-            }
-        }
-    },
-    {
-        "prompt_en": "a truck and a bicycle",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "truck and bicycle"
-            }
-        }
-    },
-    {
-        "prompt_en": "a toilet and a hair drier",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "toilet and hair drier"
-            }
-        }
-    },
-    {
-        "prompt_en": "a hair drier and a toothbrush",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "hair drier and toothbrush"
-            }
-        }
-    },
-    {
-        "prompt_en": "a toothbrush and a sink",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "toothbrush and sink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sink and a toilet",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "sink and toilet"
-            }
-        }
-    },
-    {
-        "prompt_en": "a wine glass and a chair",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "wine glass and chair"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cup and a couch",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "cup and couch"
-            }
-        }
-    },
-    {
-        "prompt_en": "a fork and a potted plant",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "fork and potted plant"
-            }
-        }
-    },
-    {
-        "prompt_en": "a knife and a tv",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "knife and tv"
-            }
-        }
-    },
-    {
-        "prompt_en": "a spoon and a laptop",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "spoon and laptop"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bowl and a remote",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bowl and remote"
-            }
-        }
-    },
-    {
-        "prompt_en": "a banana and a keyboard",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "banana and keyboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "an apple and a cell phone",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "apple and cell phone"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sandwich and a book",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "sandwich and book"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange and a clock",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "orange and clock"
-            }
-        }
-    },
-    {
-        "prompt_en": "broccoli and a backpack",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "broccoli and backpack"
-            }
-        }
-    },
-    {
-        "prompt_en": "a carrot and an umbrella",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "carrot and umbrella"
-            }
-        }
-    },
-    {
-        "prompt_en": "a hot dog and a handbag",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "hot dog and handbag"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pizza and a tie",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "pizza and tie"
-            }
-        }
-    },
-    {
-        "prompt_en": "a donut and a suitcase",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "donut and suitcase"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cake and a vase",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "cake and vase"
-            }
-        }
-    },
-    {
-        "prompt_en": "an oven and scissors",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "oven and scissors"
-            }
-        }
-    },
-    {
-        "prompt_en": "a toaster and a teddy bear",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "toaster and teddy bear"
-            }
-        }
-    },
-    {
-        "prompt_en": "a microwave and a frisbee",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "microwave and frisbee"
-            }
-        }
-    },
-    {
-        "prompt_en": "a refrigerator and skis",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "refrigerator and skis"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bicycle and an airplane",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "bicycle and airplane"
-            }
-        }
-    },
-    {
-        "prompt_en": "a car and a train",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "car and train"
-            }
-        }
-    },
-    {
-        "prompt_en": "a motorcycle and a boat",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "motorcycle and boat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a person and a toilet",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "person and toilet"
-            }
-        }
-    },
-    {
-        "prompt_en": "a person and a hair drier",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "person and hair drier"
-            }
-        }
-    },
-    {
-        "prompt_en": "a person and a toothbrush",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "person and toothbrush"
-            }
-        }
-    },
-    {
-        "prompt_en": "a person and a sink",
-        "dimension": [
-            "multiple_objects"
-        ],
-        "auxiliary_info": {
-            "multiple_objects": {
-                "object": "person and sink"
-            }
-        }
-    },
-    {
-        "prompt_en": "A person is riding a bike",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is marching",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is roller skating",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is tasting beer",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is clapping",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is drawing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is petting animal (not cat)",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is eating watermelon",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is playing harp",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is wrestling",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is riding scooter",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is sweeping floor",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is skateboarding",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is dunking basketball",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is playing flute",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is stretching leg",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is tying tie",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is skydiving",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is shooting goal (soccer)",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is playing piano",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is finger snapping",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is canoeing or kayaking",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is laughing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is digging",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is clay pottery making",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is shooting basketball",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is bending back",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is shaking hands",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is bandaging",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is push up",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is catching or throwing frisbee",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is playing trumpet",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is flying kite",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is filling eyebrows",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is shuffling cards",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is folding clothes",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is smoking",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is tai chi",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is squat",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is playing controller",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is throwing axe",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is giving or receiving award",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is air drumming",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is taking a shower",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is planting trees",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is sharpening knives",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is robot dancing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is rock climbing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is hula hooping",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is writing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is bungee jumping",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is pushing cart",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is cleaning windows",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is cutting watermelon",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is cheerleading",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is washing hands",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is ironing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is cutting nails",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is hugging",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is trimming or shaving beard",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is jogging",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is making bed",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is washing dishes",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is grooming dog",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is doing laundry",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is knitting",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is reading book",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is baby waking up",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is massaging legs",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is brushing teeth",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is crawling baby",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is motorcycling",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is driving car",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is sticking tongue out",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is shaking head",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is sword fighting",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is doing aerobics",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is strumming guitar",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is riding or walking with horse",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is archery",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is catching or throwing baseball",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is playing chess",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is rock scissors paper",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is using computer",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is arranging flowers",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is bending metal",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is ice skating",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is climbing a rope",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is crying",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is dancing ballet",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is getting a haircut",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is running on treadmill",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is kissing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is counting money",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is barbequing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is peeling apples",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is milking cow",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is shining shoes",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is making snowman",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "A person is sailing",
-        "dimension": [
-            "human_action"
-        ]
-    },
-    {
-        "prompt_en": "a person swimming in ocean",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person giving a presentation to a room full of colleagues",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person washing the dishes",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person eating a burger",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person walking in the snowstorm",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person drinking coffee in a cafe",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person playing guitar",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bicycle leaning against a tree",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bicycle gliding through a snowy field",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bicycle slowing down to stop",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bicycle accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a car stuck in traffic during rush hour",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a car turning a corner",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a car slowing down to stop",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a car accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a motorcycle cruising along a coastal highway",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a motorcycle turning a corner",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a motorcycle slowing down to stop",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a motorcycle gliding through a snowy field",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a motorcycle accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an airplane soaring through a clear blue sky",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an airplane taking off",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an airplane landing smoothly on a runway",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an airplane accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bus turning a corner",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bus stuck in traffic during rush hour",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bus accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a train speeding down the tracks",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a train crossing over a tall bridge",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a train accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a truck turning a corner",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a truck anchored in a tranquil bay",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a truck stuck in traffic during rush hour",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a truck slowing down to stop",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a truck accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a boat sailing smoothly on a calm lake",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a boat slowing down to stop",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a boat accelerating to gain speed",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bird soaring gracefully in the sky",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bird building a nest from twigs and leaves",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bird flying over a snowy forest",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cat grooming itself meticulously with its tongue",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cat playing in park",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cat drinking water",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cat running happily",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a dog enjoying a peaceful walk",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a dog playing in park",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a dog drinking water",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a dog running happily",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a horse bending down to drink water from a river",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a horse galloping across an open field",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a horse taking a peaceful walk",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a horse running to join a herd of its kind",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a sheep bending down to drink water from a river",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a sheep taking a peaceful walk",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a sheep running to join a herd of its kind",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cow bending down to drink water from a river",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cow chewing cud while resting in a tranquil barn",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a cow running to join a herd of its kind",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an elephant spraying itself with water using its trunk to cool down",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an elephant taking a peaceful walk",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "an elephant running to join a herd of its kind",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bear catching a salmon in its powerful jaws",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bear sniffing the air for scents of food",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bear climbing a tree",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a bear hunting for prey",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a zebra bending down to drink water from a river",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a zebra running to join a herd of its kind",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a zebra taking a peaceful walk",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a giraffe bending down to drink water from a river",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a giraffe taking a peaceful walk",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a giraffe running to join a herd of its kind",
-        "dimension": [
-            "subject_consistency",
-            "dynamic_degree",
-            "motion_smoothness"
-        ]
-    },
-    {
-        "prompt_en": "a person",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "person"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bicycle",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bicycle"
-            }
-        }
-    },
-    {
-        "prompt_en": "a car",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "car"
-            }
-        }
-    },
-    {
-        "prompt_en": "a motorcycle",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "motorcycle"
-            }
-        }
-    },
-    {
-        "prompt_en": "an airplane",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "airplane"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bus",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bus"
-            }
-        }
-    },
-    {
-        "prompt_en": "a train",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "train"
-            }
-        }
-    },
-    {
-        "prompt_en": "a truck",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "truck"
-            }
-        }
-    },
-    {
-        "prompt_en": "a boat",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "boat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a traffic light",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "traffic light"
-            }
-        }
-    },
-    {
-        "prompt_en": "a fire hydrant",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "fire hydrant"
-            }
-        }
-    },
-    {
-        "prompt_en": "a stop sign",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "stop sign"
-            }
-        }
-    },
-    {
-        "prompt_en": "a parking meter",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "parking meter"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bench",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bench"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bird",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bird"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cat",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "cat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a dog",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "dog"
-            }
-        }
-    },
-    {
-        "prompt_en": "a horse",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "horse"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sheep",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "sheep"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cow",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "cow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an elephant",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "elephant"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bear",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bear"
-            }
-        }
-    },
-    {
-        "prompt_en": "a zebra",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "zebra"
-            }
-        }
-    },
-    {
-        "prompt_en": "a giraffe",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "giraffe"
-            }
-        }
-    },
-    {
-        "prompt_en": "a backpack",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "backpack"
-            }
-        }
-    },
-    {
-        "prompt_en": "an umbrella",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "umbrella"
-            }
-        }
-    },
-    {
-        "prompt_en": "a handbag",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "handbag"
-            }
-        }
-    },
-    {
-        "prompt_en": "a tie",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "tie"
-            }
-        }
-    },
-    {
-        "prompt_en": "a suitcase",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "suitcase"
-            }
-        }
-    },
-    {
-        "prompt_en": "a frisbee",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "frisbee"
-            }
-        }
-    },
-    {
-        "prompt_en": "skis",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "skis"
-            }
-        }
-    },
-    {
-        "prompt_en": "a snowboard",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "snowboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sports ball",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "sports ball"
-            }
-        }
-    },
-    {
-        "prompt_en": "a kite",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "kite"
-            }
-        }
-    },
-    {
-        "prompt_en": "a baseball bat",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "baseball bat"
-            }
-        }
-    },
-    {
-        "prompt_en": "a baseball glove",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "baseball glove"
-            }
-        }
-    },
-    {
-        "prompt_en": "a skateboard",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "skateboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a surfboard",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "surfboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a tennis racket",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "tennis racket"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bottle",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bottle"
-            }
-        }
-    },
-    {
-        "prompt_en": "a wine glass",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "wine glass"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cup",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "cup"
-            }
-        }
-    },
-    {
-        "prompt_en": "a fork",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "fork"
-            }
-        }
-    },
-    {
-        "prompt_en": "a knife",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "knife"
-            }
-        }
-    },
-    {
-        "prompt_en": "a spoon",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "spoon"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bowl",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bowl"
-            }
-        }
-    },
-    {
-        "prompt_en": "a banana",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "banana"
-            }
-        }
-    },
-    {
-        "prompt_en": "an apple",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "apple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sandwich",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "sandwich"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "broccoli",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "broccoli"
-            }
-        }
-    },
-    {
-        "prompt_en": "a carrot",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "carrot"
-            }
-        }
-    },
-    {
-        "prompt_en": "a hot dog",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "hot dog"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pizza",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "pizza"
-            }
-        }
-    },
-    {
-        "prompt_en": "a donut",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "donut"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cake",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "cake"
-            }
-        }
-    },
-    {
-        "prompt_en": "a chair",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "chair"
-            }
-        }
-    },
-    {
-        "prompt_en": "a couch",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "couch"
-            }
-        }
-    },
-    {
-        "prompt_en": "a potted plant",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "potted plant"
-            }
-        }
-    },
-    {
-        "prompt_en": "a bed",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "bed"
-            }
-        }
-    },
-    {
-        "prompt_en": "a dining table",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "dining table"
-            }
-        }
-    },
-    {
-        "prompt_en": "a toilet",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "toilet"
-            }
-        }
-    },
-    {
-        "prompt_en": "a tv",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "tv"
-            }
-        }
-    },
-    {
-        "prompt_en": "a laptop",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "laptop"
-            }
-        }
-    },
-    {
-        "prompt_en": "a remote",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "remote"
-            }
-        }
-    },
-    {
-        "prompt_en": "a keyboard",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "keyboard"
-            }
-        }
-    },
-    {
-        "prompt_en": "a cell phone",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "cell phone"
-            }
-        }
-    },
-    {
-        "prompt_en": "a microwave",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "microwave"
-            }
-        }
-    },
-    {
-        "prompt_en": "an oven",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "oven"
-            }
-        }
-    },
-    {
-        "prompt_en": "a toaster",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "toaster"
-            }
-        }
-    },
-    {
-        "prompt_en": "a sink",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "sink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a refrigerator",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "refrigerator"
-            }
-        }
-    },
-    {
-        "prompt_en": "a book",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "book"
-            }
-        }
-    },
-    {
-        "prompt_en": "a clock",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "clock"
-            }
-        }
-    },
-    {
-        "prompt_en": "a vase",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "vase"
-            }
-        }
-    },
-    {
-        "prompt_en": "scissors",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "scissors"
-            }
-        }
-    },
-    {
-        "prompt_en": "a teddy bear",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "teddy bear"
-            }
-        }
-    },
-    {
-        "prompt_en": "a hair drier",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "hair drier"
-            }
-        }
-    },
-    {
-        "prompt_en": "a toothbrush",
-        "dimension": [
-            "object_class"
-        ],
-        "auxiliary_info": {
-            "object_class": {
-                "object": "toothbrush"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white bicycle",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white car",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white bird",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black cat",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white cat",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange cat",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow cat",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white umbrella",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white suitcase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white bowl",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white chair",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white clock",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a red vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "red"
-            }
-        }
-    },
-    {
-        "prompt_en": "a green vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "green"
-            }
-        }
-    },
-    {
-        "prompt_en": "a blue vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "blue"
-            }
-        }
-    },
-    {
-        "prompt_en": "a yellow vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "yellow"
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "orange"
-            }
-        }
-    },
-    {
-        "prompt_en": "a purple vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "purple"
-            }
-        }
-    },
-    {
-        "prompt_en": "a pink vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "pink"
-            }
-        }
-    },
-    {
-        "prompt_en": "a black vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "black"
-            }
-        }
-    },
-    {
-        "prompt_en": "a white vase",
-        "dimension": [
-            "color"
-        ],
-        "auxiliary_info": {
-            "color": {
-                "color": "white"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "The bund Shanghai, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "An astronaut flying in space, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, Van Gogh style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "Van Gogh style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, oil painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "oil painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "by Hokusai, in the style of Ukiyo"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, black and white",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "black and white"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, pixel art",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "pixel art"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, in cyberpunk style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "in cyberpunk style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, animated style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "animated style"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, watercolor painting",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "watercolor painting"
-            }
-        }
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, surrealism style",
-        "dimension": [
-            "appearance_style"
-        ],
-        "auxiliary_info": {
-            "appearance_style": {
-                "appearance_style": "surrealism style"
-            }
-        }
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "A couple in formal evening wear going home get caught in a heavy downpour with umbrellas, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, in super slow motion",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, zoom in",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, zoom out",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, pan left",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, pan right",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, tilt up",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, tilt down",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, with an intense shaking effect",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, featuring a steady and smooth perspective",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks, racking focus",
-        "dimension": [
-            "temporal_style"
-        ]
-    },
-    {
-        "prompt_en": "Close up of grapes on a rotating table.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Turtle swimming in ocean.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A storm trooper vacuuming the beach.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A panda standing on a surfboard in the ocean in sunset.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut feeding ducks on a sunny afternoon, reflection from the water.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Two pandas discussing an academic paper.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Sunset time lapse at the beach with moving clouds and colors in the sky.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A fat rabbit wearing a purple robe walking through a fantasy landscape.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A koala bear playing piano in the forest.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut flying in space.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Fireworks.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An animated painting of fluffy white clouds moving in sky.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Flying through fantasy landscapes.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A bigfoot walking in the snowstorm.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A squirrel eating a burger.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A cat wearing sunglasses and working as a lifeguard at a pool.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Splash of turquoise water in extreme slow motion, alpha channel included.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "an ice cream is melting on the table.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "a drone flying over a snowy forest.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "a shark is swimming in the ocean.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Aerial panoramic video from a drone of a fantasy land.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "a teddy bear is swimming in the ocean.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "time lapse of sunrise on mars.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "golden fish swimming in the ocean.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An artist brush painting on a canvas close up.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A drone view of celebration with Christmas tree and fireworks, starry sky - background.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "happy dog wearing a yellow turtleneck, studio, portrait, facing camera, dark background",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Origami dancers in white paper, 3D render, on white background, studio shot, dancing modern dance.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Campfire at night in a snowy forest with starry sky in the background.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "a fantasy landscape",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A 3D model of a 1800s victorian house.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "this is how I do makeup in the morning.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A raccoon that looks like a turtle, digital art.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Robot dancing in Times Square.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Busy freeway at night.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Balloon full of water exploding in extreme slow motion.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An astronaut is riding a horse in the space in a photorealistic style.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Macro slo-mo. Slow motion cropped closeup of roasted coffee beans falling into an empty bowl.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Sewing machine, old sewing machine working.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Motion colour drop in water, ink swirling in water, colourful ink in water, abstraction fancy dream cloud of ink.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Few big purple plums rotating on the turntable. water drops appear on the skin during rotation. isolated on the white background. close-up. macro.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Vampire makeup face of beautiful girl, red contact lenses.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Ashtray full of butts on table, smoke flowing on black background, close-up",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Pacific coast, carmel by the sea ocean and waves.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A teddy bear is playing drum kit in NYC Times Square.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A corgi is playing drum kit.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An Iron man is playing the electronic guitar, high electronic guitar.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A raccoon is playing the electronic guitar.",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A corgi's head depicted as an explosion of a nebula",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A fantasy landscape",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A future where humans have achieved teleportation technology",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A jellyfish floating through the ocean, with bioluminescent tentacles",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A Mars rover moving on Mars",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A panda drinking coffee in a cafe in Paris",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A space shuttle launching into orbit, with flames and smoke billowing out from the engines",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A steam train moving on a mountainside",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A super cool giant robot in Cyberpunk Beijing",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A tropical beach at sunrise, with palm trees and crystal-clear water in the foreground",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Cinematic shot of Van Gogh's selfie, Van Gogh style",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Gwen Stacy reading a book",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Iron Man flying in the sky",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, oil painting",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Yoda playing guitar on the stage",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A beautiful coastal beach in spring, waves lapping on sand by Vincent van Gogh",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A boat sailing leisurely along the Seine River with the Eiffel Tower in background",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A car moving slowly on an empty street, rainy evening",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A cat eating food out of a bowl",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A cat wearing sunglasses at a pool",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A confused panda in calculus class",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A cute fluffy panda eating Chinese food in a restaurant",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A cute happy Corgi playing in park, sunset",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A cute raccoon playing guitar in a boat on the ocean",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A lightning striking atop of eiffel tower, dark clouds in the sky",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A modern art museum, with colorful paintings",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A panda cooking in the kitchen",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A panda playing on a swing set",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A polar bear is playing guitar",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A raccoon dressed in suit playing the trumpet, stage background",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A robot DJ is playing the turntable, in heavy raining futuristic tokyo rooftop cyberpunk night, sci-fi, fantasy",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A shark swimming in clear Caribbean ocean",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A super robot protecting city",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "A teddy bear washing the dishes",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An epic tornado attacking above a glowing city at night, the tornado is made of smoke",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "An oil painting of a couple in formal evening wear going home get caught in a heavy downpour with umbrellas",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Clown fish swimming through the coral reef",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Hyper-realistic spaceship landing on Mars",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "The bund Shanghai, vibrant color",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Vincent van Gogh is painting in the room",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "Yellow flowers swing in the wind",
-        "dimension": [
-            "overall_consistency",
-            "aesthetic_quality",
-            "imaging_quality"
-        ]
-    },
-    {
-        "prompt_en": "alley",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "alley"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "amusement park",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "amusement park"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "aquarium",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "aquarium"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "arch",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "arch"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "art gallery",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "art gallery"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "bathroom",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "bathroom"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "bakery shop",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "bakery shop"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "ballroom",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "ballroom"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "bar",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "bar"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "barn",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "barn"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "basement",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "basement"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "beach",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "beach"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "bedroom",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "bedroom"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "bridge",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "bridge"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "botanical garden",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "botanical garden"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "cafeteria",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "cafeteria"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "campsite",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "campsite"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "campus",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "campus"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "carrousel",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "carrousel"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "castle",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "castle"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "cemetery",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "cemetery"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "classroom",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "classroom"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "cliff",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "cliff"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "crosswalk",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "crosswalk"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "construction site",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "construction site"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "corridor",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "corridor"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "courtyard",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "courtyard"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "desert",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "desert"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "downtown",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "downtown"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "driveway",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "driveway"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "farm",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "farm"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "food court",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "food court"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "football field",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "football field"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "forest road",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "forest road"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "fountain",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "fountain"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "gas station",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "gas station"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "glacier",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "glacier"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "golf course",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "golf course"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "indoor gymnasium",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "indoor gymnasium"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "harbor",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "harbor"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "highway",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "highway"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "hospital",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "hospital"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "house",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "house"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "iceberg",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "iceberg"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "industrial area",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "industrial area"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "jail cell",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "jail cell"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "junkyard",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "junkyard"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "kitchen",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "kitchen"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "indoor library",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "indoor library"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "lighthouse",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "lighthouse"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "laboratory",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "laboratory"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "mansion",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "mansion"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "marsh",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "marsh"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "mountain",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "mountain"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "indoor movie theater",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "indoor movie theater"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "indoor museum",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "indoor museum"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "music studio",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "music studio"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "nursery",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "nursery"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "ocean",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "ocean"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "office",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "office"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "palace",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "palace"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "parking lot",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "parking lot"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "pharmacy",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "pharmacy"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "phone booth",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "phone booth"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "raceway",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "raceway"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "restaurant",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "restaurant"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "river",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "river"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "science museum",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "science museum"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "shower",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "shower"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "ski slope",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "ski slope"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "sky",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "sky"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "skyscraper",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "skyscraper"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "baseball stadium",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "baseball stadium"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "staircase",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "staircase"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "street",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "street"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "supermarket",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "supermarket"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "indoor swimming pool",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "indoor swimming pool"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "tower",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "tower"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "outdoor track",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "outdoor track"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "train railway",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "train railway"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "train station platform",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "train station platform"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "underwater coral reef",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "underwater coral reef"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "valley",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "valley"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "volcano",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "volcano"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "waterfall",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "waterfall"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "windmill",
-        "dimension": [
-            "scene",
-            "background_consistency"
-        ],
-        "auxiliary_info": {
-            "scene": {
-                "scene": {
-                    "scene": "windmill"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bicycle on the left of a car, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bicycle",
-                    "object_b": "car",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a car on the right of a motorcycle, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "car",
-                    "object_b": "motorcycle",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a motorcycle on the left of a bus, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "motorcycle",
-                    "object_b": "bus",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bus on the right of a traffic light, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bus",
-                    "object_b": "traffic light",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a traffic light on the left of a fire hydrant, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "traffic light",
-                    "object_b": "fire hydrant",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a fire hydrant on the right of a stop sign, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "fire hydrant",
-                    "object_b": "stop sign",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a stop sign on the left of a parking meter, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "stop sign",
-                    "object_b": "parking meter",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a parking meter on the right of a bench, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "parking meter",
-                    "object_b": "bench",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bench on the left of a truck, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bench",
-                    "object_b": "truck",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a truck on the right of a bicycle, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "truck",
-                    "object_b": "bicycle",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bird on the left of a cat, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bird",
-                    "object_b": "cat",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a cat on the right of a dog, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "cat",
-                    "object_b": "dog",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a dog on the left of a horse, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "dog",
-                    "object_b": "horse",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a horse on the right of a sheep, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "horse",
-                    "object_b": "sheep",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a sheep on the left of a cow, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "sheep",
-                    "object_b": "cow",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a cow on the right of an elephant, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "cow",
-                    "object_b": "elephant",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an elephant on the left of a bear, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "elephant",
-                    "object_b": "bear",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bear on the right of a zebra, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bear",
-                    "object_b": "zebra",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a zebra on the left of a giraffe, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "zebra",
-                    "object_b": "giraffe",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a giraffe on the right of a bird, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "giraffe",
-                    "object_b": "bird",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bottle on the left of a wine glass, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bottle",
-                    "object_b": "wine glass",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a wine glass on the right of a cup, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "wine glass",
-                    "object_b": "cup",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a cup on the left of a fork, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "cup",
-                    "object_b": "fork",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a fork on the right of a knife, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "fork",
-                    "object_b": "knife",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a knife on the left of a spoon, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "knife",
-                    "object_b": "spoon",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a spoon on the right of a bowl, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "spoon",
-                    "object_b": "bowl",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bowl on the left of a bottle, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bowl",
-                    "object_b": "bottle",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a potted plant on the left of a remote, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "potted plant",
-                    "object_b": "remote",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a remote on the right of a clock, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "remote",
-                    "object_b": "clock",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a clock on the left of a vase, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "clock",
-                    "object_b": "vase",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a vase on the right of scissors, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "vase",
-                    "object_b": "scissors",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "scissors on the left of a teddy bear, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "scissors",
-                    "object_b": "teddy bear",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a teddy bear on the right of a potted plant, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "teddy bear",
-                    "object_b": "potted plant",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a frisbee on the left of a sports ball, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "frisbee",
-                    "object_b": "sports ball",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a sports ball on the right of a baseball bat, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "sports ball",
-                    "object_b": "baseball bat",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a baseball bat on the left of a baseball glove, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "baseball bat",
-                    "object_b": "baseball glove",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a baseball glove on the right of a tennis racket, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "baseball glove",
-                    "object_b": "tennis racket",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a tennis racket on the left of a frisbee, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "tennis racket",
-                    "object_b": "frisbee",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a toilet on the left of a hair drier, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "toilet",
-                    "object_b": "hair drier",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a hair drier on the right of a toothbrush, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "hair drier",
-                    "object_b": "toothbrush",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a toothbrush on the left of a sink, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "toothbrush",
-                    "object_b": "sink",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a sink on the right of a toilet, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "sink",
-                    "object_b": "toilet",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a chair on the left of a couch, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "chair",
-                    "object_b": "couch",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a couch on the right of a bed, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "couch",
-                    "object_b": "bed",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a bed on the left of a tv, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "bed",
-                    "object_b": "tv",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a tv on the right of a dining table, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "tv",
-                    "object_b": "dining table",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a dining table on the left of a chair, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "dining table",
-                    "object_b": "chair",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an airplane on the left of a train, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "airplane",
-                    "object_b": "train",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a train on the right of a boat, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "train",
-                    "object_b": "boat",
-                    "relationship": "on the right of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a boat on the left of an airplane, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "boat",
-                    "object_b": "airplane",
-                    "relationship": "on the left of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an oven on the top of a toaster, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "oven",
-                    "object_b": "toaster",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an oven on the bottom of a toaster, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "oven",
-                    "object_b": "toaster",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a toaster on the top of a microwave, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "toaster",
-                    "object_b": "microwave",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a toaster on the bottom of a microwave, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "toaster",
-                    "object_b": "microwave",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a microwave on the top of an oven, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "microwave",
-                    "object_b": "oven",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a microwave on the bottom of an oven, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "microwave",
-                    "object_b": "oven",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a banana on the top of an apple, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "banana",
-                    "object_b": "apple",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a banana on the bottom of an apple, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "banana",
-                    "object_b": "apple",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an apple on the top of a sandwich, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "apple",
-                    "object_b": "sandwich",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an apple on the bottom of a sandwich, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "apple",
-                    "object_b": "sandwich",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a sandwich on the top of an orange, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "sandwich",
-                    "object_b": "orange",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a sandwich on the bottom of an orange, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "sandwich",
-                    "object_b": "orange",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange on the top of a carrot, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "orange",
-                    "object_b": "carrot",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "an orange on the bottom of a carrot, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "orange",
-                    "object_b": "carrot",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a carrot on the top of a hot dog, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "carrot",
-                    "object_b": "hot dog",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a carrot on the bottom of a hot dog, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "carrot",
-                    "object_b": "hot dog",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a hot dog on the top of a pizza, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "hot dog",
-                    "object_b": "pizza",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a hot dog on the bottom of a pizza, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "hot dog",
-                    "object_b": "pizza",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a pizza on the top of a donut, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "pizza",
-                    "object_b": "donut",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a pizza on the bottom of a donut, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "pizza",
-                    "object_b": "donut",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a donut on the top of broccoli, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "donut",
-                    "object_b": "broccoli",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a donut on the bottom of broccoli, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "donut",
-                    "object_b": "broccoli",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "broccoli on the top of a banana, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "broccoli",
-                    "object_b": "banana",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "broccoli on the bottom of a banana, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "broccoli",
-                    "object_b": "banana",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "skis on the top of a snowboard, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "skis",
-                    "object_b": "snowboard",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "skis on the bottom of a snowboard, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "skis",
-                    "object_b": "snowboard",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a snowboard on the top of a kite, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "snowboard",
-                    "object_b": "kite",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a snowboard on the bottom of a kite, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "snowboard",
-                    "object_b": "kite",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a kite on the top of a skateboard, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "kite",
-                    "object_b": "skateboard",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a kite on the bottom of a skateboard, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "kite",
-                    "object_b": "skateboard",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a skateboard on the top of a surfboard, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "skateboard",
-                    "object_b": "surfboard",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a skateboard on the bottom of a surfboard, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "skateboard",
-                    "object_b": "surfboard",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a surfboard on the top of skis, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "surfboard",
-                    "object_b": "skis",
-                    "relationship": "on the top of"
-                }
-            }
-        }
-    },
-    {
-        "prompt_en": "a surfboard on the bottom of skis, front view",
-        "dimension": [
-            "spatial_relationship"
-        ],
-        "auxiliary_info": {
-            "spatial_relationship": {
-                "spatial_relationship": {
-                    "object_a": "surfboard",
-                    "object_b": "skis",
-                    "relationship": "on the bottom of"
-                }
-            }
-        }
-    }
-]
diff --git a/eval/vbench/__init__.py b/eval/vbench/__init__.py
deleted file mode 100644
index f4234c4c..00000000
--- a/eval/vbench/__init__.py
+++ /dev/null
@@ -1,255 +0,0 @@
-import importlib
-import os
-from itertools import chain
-from pathlib import Path
-
-from .utils import get_prompt_from_filename, init_submodules, load_json, save_json
-
-
-class VBench(object):
-    def __init__(self, device, full_info_dir, output_path):
-        self.device = device  # cuda or cpu
-        self.full_info_dir = (
-            full_info_dir  # full json file that VBench originally provides
-        )
-        self.output_path = output_path  # output directory to save VBench results
-        os.makedirs(self.output_path, exist_ok=True)
-
-    def build_full_dimension_list(
-        self,
-    ):
-        return [
-            "subject_consistency",
-            "background_consistency",
-            "aesthetic_quality",
-            "imaging_quality",
-            "object_class",
-            "multiple_objects",
-            "color",
-            "spatial_relationship",
-            "scene",
-            "temporal_style",
-            "overall_consistency",
-            "human_action",
-            "temporal_flickering",
-            "motion_smoothness",
-            "dynamic_degree",
-            "appearance_style",
-        ]
-
-    def check_dimension_requires_extra_info(self, dimension_list):
-        dim_custom_not_supported = set(dimension_list) & set(
-            [
-                "object_class",
-                "multiple_objects",
-                "scene",
-                "appearance_style",
-                "color",
-                "spatial_relationship",
-            ]
-        )
-
-        assert (
-            len(dim_custom_not_supported) == 0
-        ), f"dimensions : {dim_custom_not_supported} not supported for custom input"
-
-    def build_full_info_json(
-        self,
-        videos_path,
-        name,
-        dimension_list,
-        prompt_list=[],
-        special_str="",
-        verbose=False,
-        mode="vbench_standard",
-        **kwargs,
-    ):
-        cur_full_info_list = (
-            []
-        )  # to save the prompt and video path info for the current dimensions
-        if mode == "custom_input":
-            self.check_dimension_requires_extra_info(dimension_list)
-            if os.path.isfile(videos_path):
-                cur_full_info_list = [
-                    {
-                        "prompt_en": get_prompt_from_filename(videos_path),
-                        "dimension": dimension_list,
-                        "video_list": [videos_path],
-                    }
-                ]
-                if len(prompt_list) == 1:
-                    cur_full_info_list[0]["prompt_en"] = prompt_list[0]
-            else:
-                video_names = os.listdir(videos_path)
-
-                cur_full_info_list = []
-
-                for filename in video_names:
-                    postfix = Path(os.path.join(videos_path, filename)).suffix
-                    if postfix.lower() not in [".mp4", ".gif", ".jpg", ".png"]:
-                        continue
-                    cur_full_info_list.append(
-                        {
-                            "prompt_en": get_prompt_from_filename(filename),
-                            "dimension": dimension_list,
-                            "video_list": [os.path.join(videos_path, filename)],
-                        }
-                    )
-
-                if len(prompt_list) > 0:
-                    prompt_list = {
-                        os.path.join(videos_path, path): prompt_list[path]
-                        for path in prompt_list
-                    }
-                    assert len(prompt_list) >= len(
-                        cur_full_info_list
-                    ), """
-                        Number of prompts should match with number of videos.\n
-                        Got {len(prompt_list)=}, {len(cur_full_info_list)=}\n
-                        To read the prompt from filename, delete --prompt_file and --prompt_list
-                        """
-
-                    all_video_path = [
-                        os.path.abspath(file)
-                        for file in list(
-                            chain.from_iterable(
-                                vid["video_list"] for vid in cur_full_info_list
-                            )
-                        )
-                    ]
-                    backslash = "\n"
-                    assert (
-                        len(
-                            set(all_video_path)
-                            - set(
-                                [os.path.abspath(path_key) for path_key in prompt_list]
-                            )
-                        )
-                        == 0
-                    ), f"""
-                    The prompts for the following videos are not found in the prompt file: \n
-                    {backslash.join(set(all_video_path) - set([os.path.abspath(path_key) for path_key in prompt_list]))}
-                    """
-
-                    video_map = {}
-                    for prompt_key in prompt_list:
-                        video_map[os.path.abspath(prompt_key)] = prompt_list[prompt_key]
-
-                    for video_info in cur_full_info_list:
-                        video_info["prompt_en"] = video_map[
-                            os.path.abspath(video_info["video_list"][0])
-                        ]
-
-        elif mode == "vbench_category":
-            self.check_dimension_requires_extra_info(dimension_list)
-            CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-            category_supported = [
-                Path(category).stem
-                for category in os.listdir(f"prompts/prompts_per_category")
-            ]  # TODO: probably need refactoring again
-            if "category" not in kwargs:
-                category = category_supported
-            else:
-                category = kwargs["category"]
-
-            assert (
-                category is not None
-            ), "Please specify the category to be evaluated with --category"
-            assert (
-                category in category_supported
-            ), f"""
-            The following category is not supported, {category}.
-            """
-
-            video_names = os.listdir(videos_path)
-            postfix = Path(video_names[0]).suffix
-
-            with open(f"{CUR_DIR}/prompts_per_category/{category}.txt", "r") as f:
-                video_prompts = [line.strip() for line in f.readlines()]
-
-            for prompt in video_prompts:
-                video_list = []
-                for filename in video_names:
-                    if not Path(filename).stem.startswith(prompt):
-                        continue
-                    postfix = Path(os.path.join(videos_path, filename)).suffix
-                    if postfix.lower() not in [".mp4", ".gif", ".jpg", ".png"]:
-                        continue
-                    video_list.append(os.path.join(videos_path, filename))
-
-                cur_full_info_list.append(
-                    {
-                        "prompt_en": prompt,
-                        "dimension": dimension_list,
-                        "video_list": video_list,
-                    }
-                )
-
-        else:
-            full_info_list = load_json(self.full_info_dir)
-            video_names = os.listdir(videos_path)
-            postfix = Path(video_names[0]).suffix
-            for prompt_dict in full_info_list:
-                # if the prompt belongs to any dimension we want to evaluate
-                if set(dimension_list) & set(prompt_dict["dimension"]):
-                    prompt = prompt_dict["prompt_en"]
-                    prompt_dict["video_list"] = []
-                    for i in range(5):  # video index for the same prompt
-                        intended_video_name = f"{prompt}{special_str}-{str(i)}{postfix}"
-                        if intended_video_name in video_names:  # if the video exists
-                            intended_video_path = os.path.join(
-                                videos_path, intended_video_name
-                            )
-                            prompt_dict["video_list"].append(intended_video_path)
-                            if verbose:
-                                print(
-                                    f"Successfully found video: {intended_video_name}"
-                                )
-                        else:
-                            print(
-                                f"WARNING!!! This required video is not found! Missing benchmark videos can lead to unfair evaluation result. The missing video is: {intended_video_name}"
-                            )
-                    cur_full_info_list.append(prompt_dict)
-
-        cur_full_info_path = os.path.join(self.output_path, name + "_full_info.json")
-        save_json(cur_full_info_list, cur_full_info_path)
-        print(f"Evaluation meta data saved to {cur_full_info_path}")
-        return cur_full_info_path
-
-    def evaluate(
-        self,
-        videos_path,
-        name,
-        prompt_list=[],
-        dimension_list=None,
-        local=False,
-        read_frame=False,
-        mode="vbench_standard",
-        **kwargs,
-    ):
-        results_dict = {}
-        if dimension_list is None:
-            dimension_list = self.build_full_dimension_list()
-        submodules_dict = init_submodules(
-            dimension_list, local=local, read_frame=read_frame
-        )
-
-        cur_full_info_path = self.build_full_info_json(
-            videos_path, name, dimension_list, prompt_list, mode=mode, **kwargs
-        )
-
-        for dimension in dimension_list:
-            try:
-                dimension_module = importlib.import_module(f"vbench.{dimension}")
-                evaluate_func = getattr(dimension_module, f"compute_{dimension}")
-            except Exception as e:
-                raise NotImplementedError(f"UnImplemented dimension {dimension}!, {e}")
-            submodules_list = submodules_dict[dimension]
-            print(f"cur_full_info_path: {cur_full_info_path}")  # TODO: to delete
-            results = evaluate_func(
-                cur_full_info_path, self.device, submodules_list, **kwargs
-            )
-            results_dict[dimension] = results
-        output_name = os.path.join(self.output_path, name + "_eval_results.json")
-        save_json(results_dict, output_name)
-        print(f"Evaluation results saved to {output_name}")
diff --git a/eval/vbench/aesthetic_quality.py b/eval/vbench/aesthetic_quality.py
deleted file mode 100644
index 972cf21e..00000000
--- a/eval/vbench/aesthetic_quality.py
+++ /dev/null
@@ -1,75 +0,0 @@
-import os
-import subprocess
-from urllib.request import urlretrieve
-
-import clip
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from tqdm import tqdm
-from vbench.utils import clip_transform, load_dimension_info, load_video
-
-
-def get_aesthetic_model(cache_folder):
-    """load the aethetic model"""
-    path_to_model = cache_folder + "/sa_0_4_vit_l_14_linear.pth"
-    if not os.path.exists(path_to_model):
-        os.makedirs(cache_folder, exist_ok=True)
-        url_model = "https://github.com/LAION-AI/aesthetic-predictor/blob/main/sa_0_4_vit_l_14_linear.pth?raw=true"
-        # download aesthetic predictor
-        if not os.path.isfile(path_to_model):
-            try:
-                print(f"trying urlretrieve to download {url_model} to {path_to_model}")
-                urlretrieve(
-                    url_model, path_to_model
-                )  # unable to download https://github.com/LAION-AI/aesthetic-predictor/blob/main/sa_0_4_vit_l_14_linear.pth?raw=true to pretrained/aesthetic_model/emb_reader/sa_0_4_vit_l_14_linear.pth
-            except:
-                print(
-                    f"unable to download {url_model} to {path_to_model} using urlretrieve, trying wget"
-                )
-                wget_command = ["wget", url_model, "-P", os.path.dirname(path_to_model)]
-                subprocess.run(wget_command)
-    m = nn.Linear(768, 1)
-    s = torch.load(path_to_model)
-    m.load_state_dict(s)
-    m.eval()
-    return m
-
-
-def laion_aesthetic(aesthetic_model, clip_model, video_list, device):
-    aesthetic_model.eval()
-    clip_model.eval()
-    aesthetic_avg = 0.0
-    num = 0
-    video_results = []
-    for video_path in tqdm(video_list):
-        images = load_video(video_path)
-        image_transform = clip_transform(224)
-        images = image_transform(images)
-        images = images.to(device)
-        image_feats = clip_model.encode_image(images).to(torch.float32)
-        image_feats = F.normalize(image_feats, dim=-1, p=2)
-        aesthetic_scores = aesthetic_model(image_feats).squeeze()
-        normalized_aesthetic_scores = aesthetic_scores / 10
-        cur_avg = torch.mean(normalized_aesthetic_scores, dim=0, keepdim=True)
-        aesthetic_avg += cur_avg.item()
-        num += 1
-        video_results.append(
-            {"video_path": video_path, "video_results": cur_avg.item()}
-        )
-    aesthetic_avg /= num
-    return aesthetic_avg, video_results
-
-
-def compute_aesthetic_quality(json_dir, device, submodules_list, **kwargs):
-    vit_path = submodules_list[0]
-    aes_path = submodules_list[1]
-    aesthetic_model = get_aesthetic_model(aes_path).to(device)
-    clip_model, preprocess = clip.load(vit_path, device=device)
-    video_list, _ = load_dimension_info(
-        json_dir, dimension="aesthetic_quality", lang="en"
-    )
-    all_results, video_results = laion_aesthetic(
-        aesthetic_model, clip_model, video_list, device
-    )
-    return all_results, video_results
diff --git a/eval/vbench/appearance_style.py b/eval/vbench/appearance_style.py
deleted file mode 100644
index 85332f9f..00000000
--- a/eval/vbench/appearance_style.py
+++ /dev/null
@@ -1,87 +0,0 @@
-import json
-import os
-
-import clip
-import numpy as np
-import torch
-from PIL import Image
-from tqdm import tqdm
-from vbench.utils import (
-    clip_transform,
-    clip_transform_Image,
-    load_dimension_info,
-    load_video,
-    read_frames_decord_by_fps,
-)
-
-
-def get_text_features(model, input_text, tokenizer, text_feature_dict={}):
-    if input_text in text_feature_dict:
-        return text_feature_dict[input_text]
-    text_template = f"{input_text}"
-    with torch.no_grad():
-        text_features = model.encode_text(text_template).float()
-        text_features /= text_features.norm(dim=-1, keepdim=True)
-        text_feature_dict[input_text] = text_features
-    return text_features
-
-
-def get_vid_features(model, input_frames):
-    with torch.no_grad():
-        clip_feat = model.encode_vision(input_frames, test=True).float()
-        clip_feat /= clip_feat.norm(dim=-1, keepdim=True)
-    return clip_feat
-
-
-def get_predict_label(clip_feature, text_feats_tensor, top=5):
-    label_probs = (100.0 * clip_feature @ text_feats_tensor.T).softmax(dim=-1)
-    top_probs, top_labels = label_probs.cpu().topk(top, dim=-1)
-    return top_probs, top_labels
-
-
-def appearance_style(clip_model, video_dict, device, sample="rand"):
-    sim = 0.0
-    cnt = 0
-    video_results = []
-    image_transform = clip_transform_Image(224)
-    for info in tqdm(video_dict):
-        if "auxiliary_info" not in info:
-            raise "Auxiliary info is not in json, please check your json."
-        query = info["auxiliary_info"]["appearance_style"]
-        text = clip.tokenize([query]).to(device)
-        video_list = info["video_list"]
-        for video_path in video_list:
-            cur_video = []
-            with torch.no_grad():
-                video_arrays = load_video(video_path, return_tensor=False)
-                images = [Image.fromarray(i) for i in video_arrays]
-                for image in images:
-                    image = image_transform(image)
-                    image = image.to(device)
-                    logits_per_image, logits_per_text = clip_model(
-                        image.unsqueeze(0), text
-                    )
-                    cur_sim = float(logits_per_text[0][0].cpu())
-                    cur_sim = cur_sim / 100
-                    cur_video.append(cur_sim)
-                    sim += cur_sim
-                    cnt += 1
-                video_sim = np.mean(cur_video)
-                video_results.append(
-                    {
-                        "video_path": video_path,
-                        "video_results": video_sim,
-                        "frame_results": cur_video,
-                    }
-                )
-    sim_per_frame = sim / cnt
-    return sim_per_frame, video_results
-
-
-def compute_appearance_style(json_dir, device, submodules_list, **kwargs):
-    clip_model, preprocess = clip.load(device=device, **submodules_list)
-    _, video_dict = load_dimension_info(
-        json_dir, dimension="appearance_style", lang="en"
-    )
-    all_results, video_results = appearance_style(clip_model, video_dict, device)
-    return all_results, video_results
diff --git a/eval/vbench/background_consistency.py b/eval/vbench/background_consistency.py
deleted file mode 100644
index 204fbff9..00000000
--- a/eval/vbench/background_consistency.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import json
-import logging
-import os
-
-import clip
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from PIL import Image
-from tqdm import tqdm
-from vbench.utils import clip_transform, load_dimension_info, load_video
-
-
-def background_consistency(clip_model, preprocess, video_list, device, read_frame):
-    sim = 0.0
-    cnt = 0
-    video_results = []
-    image_transform = clip_transform(224)
-    for video_path in tqdm(video_list):
-        video_sim = 0.0
-        if read_frame:
-            video_path = video_path[:-4].replace("videos", "frames").replace(" ", "_")
-            tmp_paths = [
-                os.path.join(video_path, f) for f in sorted(os.listdir(video_path))
-            ]
-            images = []
-            for tmp_path in tmp_paths:
-                images.append(preprocess(Image.open(tmp_path)))
-            images = torch.stack(images)
-        else:
-            images = load_video(video_path)
-            images = image_transform(images)
-        images = images.to(device)
-        image_features = clip_model.encode_image(images)
-        image_features = F.normalize(image_features, dim=-1, p=2)
-        for i in range(len(image_features)):
-            image_feature = image_features[i].unsqueeze(0)
-            if i == 0:
-                first_image_feature = image_feature
-            else:
-                sim_pre = max(
-                    0.0, F.cosine_similarity(former_image_feature, image_feature).item()
-                )
-                sim_fir = max(
-                    0.0, F.cosine_similarity(first_image_feature, image_feature).item()
-                )
-                cur_sim = (sim_pre + sim_fir) / 2
-                video_sim += cur_sim
-                cnt += 1
-            former_image_feature = image_feature
-        sim_per_image = video_sim / (len(image_features) - 1)
-        sim += video_sim
-        video_results.append({"video_path": video_path, "video_results": sim_per_image})
-    # sim_per_video = sim / (len(video_list) - 1)
-    sim_per_frame = sim / cnt
-    return sim_per_frame, video_results
-
-
-def compute_background_consistency(json_dir, device, submodules_list, **kwargs):
-    vit_path, read_frame = submodules_list[0], submodules_list[1]
-    clip_model, preprocess = clip.load(vit_path, device=device)
-    video_list, _ = load_dimension_info(
-        json_dir, dimension="background_consistency", lang="en"
-    )
-    all_results, video_results = background_consistency(
-        clip_model, preprocess, video_list, device, read_frame
-    )
-    return all_results, video_results
diff --git a/eval/vbench/cli/evaluate.py b/eval/vbench/cli/evaluate.py
deleted file mode 100644
index eb811387..00000000
--- a/eval/vbench/cli/evaluate.py
+++ /dev/null
@@ -1,163 +0,0 @@
-import argparse
-import json
-import os
-from datetime import datetime
-
-import torch
-from vbench import VBench
-
-CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-
-
-def register_subparsers(subparser):
-    parser = subparser.add_parser(
-        "evaluate", formatter_class=argparse.RawTextHelpFormatter
-    )
-    parser.add_argument(
-        "--output_path",
-        type=str,
-        default="./evaluation_results/",
-        help="output path to save the evaluation results",
-    )
-    parser.add_argument(
-        "--full_json_dir",
-        type=str,
-        default=f"{CUR_DIR}/../VBench_full_info.json",
-        help="path to save the json file that contains the prompt and dimension information",
-    )
-    parser.add_argument(
-        "--videos_path",
-        type=str,
-        required=True,
-        help="folder that contains the sampled videos",
-    )
-    parser.add_argument(
-        "--dimension",
-        nargs="+",
-        required=True,
-        help="list of evaluation dimensions, usage: --dimension <dim_1> <dim_2>",
-    )
-    parser.add_argument(
-        "--load_ckpt_from_local",
-        type=bool,
-        required=False,
-        help="whether load checkpoints from local default paths (assuming you have downloaded the checkpoints locally",
-    )
-    parser.add_argument(
-        "--read_frame",
-        type=bool,
-        required=False,
-        help="whether directly read frames, or directly read videos",
-    )
-    parser.add_argument(
-        "--mode",
-        choices=["custom_input", "vbench_standard", "vbench_category"],
-        default="vbench_standard",
-        help="""This flags determine the mode of evaluations, choose one of the following:
-        1. "custom_input": receive input prompt from either --prompt/--prompt_file flags or the filename
-        2. "vbench_standard": evaluate on standard prompt suite of VBench
-        3. "vbench_category": evaluate on specific category
-        """,
-    )
-    parser.add_argument(
-        "--custom_input",
-        action="store_true",
-        required=False,
-        help='(deprecated) use --mode="custom_input" instead',
-    )
-    parser.add_argument(
-        "--prompt",
-        type=str,
-        default="",
-        help="""Specify the input prompt
-        If not specified, filenames will be used as input prompts
-        * Mutually exclusive to --prompt_file.
-        ** This option must be used with --custom_input flag
-        """,
-    )
-    parser.add_argument(
-        "--prompt_file",
-        type=str,
-        required=False,
-        help="""Specify the path of the file that contains prompt lists
-        If not specified, filenames will be used as input prompts
-        * Mutually exclusive to --prompt.
-        ** This option must be used with --custom_input flag
-        """,
-    )
-    parser.add_argument(
-        "--category",
-        type=str,
-        required=False,
-        help="""This is for mode=='vbench_category'
-        The category to evaluate on, usage: --category=animal.
-        """,
-    )
-
-    ## for dimension specific params ###
-    parser.add_argument(
-        "--imaging_quality_preprocessing_mode",
-        type=str,
-        required=False,
-        default="longer",
-        help="""This is for setting preprocessing in imaging_quality
-        1. 'shorter': if the shorter side is more than 512, the image is resized so that the shorter side is 512.
-        2. 'longer': if the longer side is more than 512, the image is resized so that the longer side is 512.
-        3. 'shorter_centercrop': if the shorter side is more than 512, the image is resized so that the shorter side is 512.
-        Then the center 512 x 512 after resized is used for evaluation.
-        4. 'None': no preprocessing
-        """,
-    )
-    parser.set_defaults(func=evaluate)
-
-
-def evaluate(args):
-    print(f"args: {args}")
-
-    device = torch.device("cuda")
-    my_VBench = VBench(device, args.full_json_dir, args.output_path)
-
-    print(f"start evaluation")
-
-    current_time = datetime.now().strftime("%Y-%m-%d-%H:%M:%S")
-
-    kwargs = {}
-
-    prompt = []
-
-    assert args.custom_input == False, "(Deprecated) use --mode=custom_input instead"
-
-    if (args.prompt_file is not None) and (args.prompt != ""):
-        raise Exception("--prompt_file and --prompt cannot be used together")
-    if (args.prompt_file is not None or args.prompt != "") and (
-        not args.mode == "custom_input"
-    ):
-        raise Exception("must set --mode=custom_input for using external prompt")
-
-    if args.prompt_file:
-        with open(args.prompt_file, "r") as f:
-            prompt = json.load(f)
-        assert (
-            type(prompt) == dict
-        ), 'Invalid prompt file format. The correct format is {"video_path": prompt, ... }'
-    elif args.prompt != "":
-        prompt = [args.prompt]
-
-    if args.category != "":
-        kwargs["category"] = args.category
-
-    kwargs["imaging_quality_preprocessing_mode"] = (
-        args.imaging_quality_preprocessing_mode
-    )
-
-    my_VBench.evaluate(
-        videos_path=args.videos_path,
-        name=f"results_{current_time}",
-        prompt_list=prompt,  # pass in [] to read prompt from filename
-        dimension_list=args.dimension,
-        local=args.load_ckpt_from_local,
-        read_frame=args.read_frame,
-        mode=args.mode,
-        **kwargs,
-    )
-    print("done")
diff --git a/eval/vbench/cli/static_filter.py b/eval/vbench/cli/static_filter.py
deleted file mode 100644
index 589a0ec5..00000000
--- a/eval/vbench/cli/static_filter.py
+++ /dev/null
@@ -1,219 +0,0 @@
-import glob
-import json
-import logging
-import os
-import shutil
-from pathlib import Path
-
-import cv2
-import numpy as np
-import torch
-from tqdm import tqdm
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-from vbench.third_party.RAFT.core.raft import RAFT
-from vbench.third_party.RAFT.core.utils_core.utils import InputPadder
-from vbench.utils import CACHE_DIR, get_prompt_from_filename, load_json
-
-CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-DEVICE = "cuda"
-
-
-class StaticFilter:
-    def __init__(self, args, device):
-        self.args = args
-        self.device = device
-        self.load_model()
-
-    def load_model(self):
-        self.model = torch.nn.DataParallel(RAFT(self.args))
-        self.model.load_state_dict(torch.load(self.args.model))
-
-        self.model = self.model.module
-        self.model.to(self.device)
-        self.model.eval()
-
-    def get_score(self, img, flo):
-        img = img[0].permute(1, 2, 0).cpu().numpy()
-        flo = flo[0].permute(1, 2, 0).cpu().numpy()
-
-        u = flo[:, :, 0]
-        v = flo[:, :, 1]
-        rad = np.sqrt(np.square(u) + np.square(v))
-
-        h, w = rad.shape
-        rad_flat = rad.flatten()
-        cut_index = int(h * w * 0.02)
-
-        max_rad = np.mean(abs(np.sort(-rad_flat))[:cut_index])
-
-        return max_rad
-
-    def check_static(self, score_list):
-        thres = self.params["thres"]
-        count_num = self.params["count_num"]
-        count = 0
-        for score in score_list[:-2]:
-            if score > thres:
-                count += 1
-            if count > count_num:
-                return False
-        for score in score_list[-2:]:
-            if score > thres * count_num * 2:
-                return False
-        return True
-
-    def set_params(self, frame, count):
-        scale = min(list(frame.shape)[-2:])
-        self.params = {
-            "thres": 3.0 * (scale / 256.0),
-            "count_num": round(2 * (count / 16.0)),
-        }
-
-    def infer(self, path):
-        with torch.no_grad():
-            frames = self.get_frames(path)
-            self.set_params(frame=frames[0], count=len(frames))
-            static_score = []
-            for image1, image2 in zip(
-                frames[:-1] + [frames[0], frames[-1]],
-                frames[1:] + [frames[-1], frames[0]],
-            ):
-                padder = InputPadder(image1.shape)
-                image1, image2 = padder.pad(image1, image2)
-                _, flow_up = self.model(image1, image2, iters=20, test_mode=True)
-                max_rad = self.get_score(image1, flow_up)
-                static_score.append(max_rad)
-            whether_static = self.check_static(static_score)
-            return whether_static
-
-    def get_frames(self, video_path):
-        frame_list = []
-        video = cv2.VideoCapture(video_path)
-        while video.isOpened():
-            success, frame = video.read()
-            if success:
-                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # convert to rgb
-                frame = (
-                    torch.from_numpy(frame.astype(np.uint8)).permute(2, 0, 1).float()
-                )
-                frame = frame[None].to(DEVICE)
-                frame_list.append(frame)
-            else:
-                break
-        video.release()
-        assert frame_list != []
-        return frame_list
-
-
-def check_and_move(args, filter_results, target_path=None):
-    if target_path is None:
-        target_path = os.path.join(args.result_path, "filtered_videos")
-    os.makedirs(target_path, exist_ok=True)
-    for prompt, v in filter_results.items():
-        if v["static_count"] < 5 and args.filter_scope == "temporal_flickering":
-            logger.warning(f"Prompt: '{prompt}' has fewer than 5 filter results.")
-        for i, video_path in enumerate(v["static_path"]):
-            target_name = os.path.join(target_path, f"{prompt}-{i}.mp4")
-            shutil.copy(video_path, target_name)
-    logger.info(f"All filtered videos are saved in the '{target_path}' path")
-
-
-def static_filter(args):
-    static_filter = StaticFilter(args, device=DEVICE)
-    prompt_dict = {}
-    prompt_list = []
-    paths = sorted(glob.glob(os.path.join(args.videos_path, "*.mp4")))
-
-    if args.filter_scope == "temporal_flickering":
-        full_prompt_list = load_json(f"{CUR_DIR}/../VBench_full_info.json")
-        for prompt in full_prompt_list:
-            if "temporal_flickering" in prompt["dimension"]:
-                prompt_dict[prompt["prompt_en"]] = {
-                    "static_count": 0,
-                    "static_path": [],
-                }
-                prompt_list.append(prompt["prompt_en"])
-
-    elif args.filter_scope == "all":
-        for prompt in paths:
-            prompt = get_prompt_from_filename(prompt)
-            prompt_dict[prompt] = {"static_count": 0, "static_path": []}
-            prompt_list.append(prompt)
-
-    else:
-        assert (
-            os.path.isfile(args.filter_scope)
-            and Path(args.filter_scope).suffix.lower() == ".json"
-        ), f"""
-        --filter_scope flag is not correctly set, set to 'all' to filter all videos in the --videos_path directory,
-        or provide the correct path to the JSON file
-        """
-        full_prompt_list = load_json(args.filter_scope)
-        for prompt in full_prompt_list:
-            prompt = get_prompt_from_filename(prompt)
-            prompt_dict[prompt] = {"static_count": 0, "static_path": []}
-            prompt_list.append(prompt)
-
-    for path in tqdm(paths):
-        name = get_prompt_from_filename(path)
-        if name in prompt_list:
-            if (
-                prompt_dict[name]["static_count"] < 5
-                or args.filter_scope != "temporal_flickering"
-            ):
-                if static_filter.infer(path):
-                    prompt_dict[name]["static_count"] += 1
-                    prompt_dict[name]["static_path"].append(path)
-
-    os.makedirs(args.result_path, exist_ok=True)
-    info_file = os.path.join(args.result_path, args.store_name)
-    json.dump(prompt_dict, open(info_file, "w"))
-    logger.info(f"Filtered results info is saved in the '{info_file}' file")
-    check_and_move(args, prompt_dict)
-
-
-def register_subparsers(subparser):
-    parser = subparser.add_parser("static_filter")
-    parser.add_argument(
-        "--model",
-        type=str,
-        default=f"{CACHE_DIR}/raft_model/models/raft-things.pth",
-        help="restore checkpoint",
-    )
-    parser.add_argument(
-        "--videos_path", default="", required=True, help="video path for filtering"
-    )
-    parser.add_argument(
-        "--result_path", type=str, default="./filter_results", help="result save path"
-    )
-    parser.add_argument(
-        "--store_name",
-        type=str,
-        default="filtered_static_video.json",
-        help="result file name",
-    )
-    parser.add_argument("--small", action="store_true", help="use small model")
-    parser.add_argument(
-        "--mixed_precision", action="store_true", help="use mixed precision"
-    )
-    parser.add_argument(
-        "--alternate_corr",
-        action="store_true",
-        help="use efficent correlation implementation",
-    )
-    parser.add_argument(
-        "--filter_scope",
-        default="temporal_flickering",
-        help=f"""For specifying the scope for filtering videos
-        1. 'temporal_flickering' (default): filter videos based on matches with temporal_flickering dimension of VBench.
-        2. 'all': filter all video in the current directory.
-        3. '$filename': if a filepath to a JSON file is provided, only the filename exists in JSON file will be filtered.
-                >       usage: --filter_scope example.json
-    """,
-    )
-    parser.set_defaults(func=static_filter)
diff --git a/eval/vbench/cli/vbench.py b/eval/vbench/cli/vbench.py
deleted file mode 100644
index 2f870405..00000000
--- a/eval/vbench/cli/vbench.py
+++ /dev/null
@@ -1,23 +0,0 @@
-import argparse
-import importlib
-import subprocess
-
-vbench_cmd = ["evaluate", "static_filter"]
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        prog="vbench", formatter_class=argparse.RawTextHelpFormatter
-    )
-    subparsers = parser.add_subparsers(title="vbench subcommands")
-
-    for cmd in vbench_cmd:
-        module = importlib.import_module(f"vbench.cli.{cmd}")
-        module.register_subparsers(subparsers)
-    parser.set_defaults(func=help)
-    args = parser.parse_args()
-    args.func(args)
-
-
-def help(args):
-    subprocess.run(["vbench", "-h"], check=True)
diff --git a/eval/vbench/color.py b/eval/vbench/color.py
deleted file mode 100644
index 6df91cf3..00000000
--- a/eval/vbench/color.py
+++ /dev/null
@@ -1,103 +0,0 @@
-import json
-import logging
-import os
-
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.grit_model import DenseCaptioning
-from vbench.utils import load_dimension_info, load_video, read_frames_decord_by_fps
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def get_dect_from_grit(model, image_arrays):
-    pred = []
-    if type(image_arrays) is not list and type(image_arrays) is not np.ndarray:
-        image_arrays = image_arrays.numpy()
-    with torch.no_grad():
-        for frame in image_arrays:
-            ret = model.run_caption_tensor(frame)
-            cur_pred = []
-            if len(ret[0]) < 1:
-                cur_pred.append(["", ""])
-            else:
-                for idx, cap_det in enumerate(ret[0]):
-                    cur_pred.append([cap_det[0], cap_det[2][0]])
-            pred.append(cur_pred)
-    return pred
-
-
-def check_generate(color_key, object_key, predictions):
-    cur_object_color, cur_object = 0, 0
-    for frame_pred in predictions:
-        object_flag, color_flag = False, False
-        for pred in frame_pred:
-            if object_key == pred[1]:
-                for color_query in [
-                    "white",
-                    "red",
-                    "pink",
-                    "blue",
-                    "silver",
-                    "purple",
-                    "orange",
-                    "green",
-                    "gray",
-                    "yellow",
-                    "black",
-                    "grey",
-                ]:
-                    if color_query in pred[0]:
-                        object_flag = True
-                if color_key in pred[0]:
-                    color_flag = True
-        if color_flag:
-            cur_object_color += 1
-        if object_flag:
-            cur_object += 1
-    return cur_object, cur_object_color
-
-
-def color(model, video_dict, device):
-    success_frame_count_all, video_count = 0, 0
-    video_results = []
-    for info in tqdm(video_dict):
-        if "auxiliary_info" not in info:
-            raise "Auxiliary info is not in json, please check your json."
-        # print(info)
-        color_info = info["auxiliary_info"]["color"]
-        object_info = info["prompt"]
-        object_info = (
-            object_info.replace("a ", "")
-            .replace("an ", "")
-            .replace(color_info, "")
-            .strip()
-        )
-        for video_path in info["video_list"]:
-            video_arrays = load_video(video_path, num_frames=16, return_tensor=False)
-            cur_video_pred = get_dect_from_grit(model, video_arrays)
-            cur_object, cur_object_color = check_generate(
-                color_info, object_info, cur_video_pred
-            )
-            if cur_object > 0:
-                cur_success_frame_rate = cur_object_color / cur_object
-                success_frame_count_all += cur_success_frame_rate
-                video_count += 1
-                video_results.append(
-                    {"video_path": video_path, "video_results": cur_success_frame_rate}
-                )
-    success_rate = success_frame_count_all / video_count
-    return success_rate, video_results
-
-
-def compute_color(json_dir, device, submodules_dict, **kwargs):
-    dense_caption_model = DenseCaptioning(device)
-    dense_caption_model.initialize_model(**submodules_dict)
-    logger.info("Initialize detection model success")
-    _, prompt_dict_ls = load_dimension_info(json_dir, dimension="color", lang="en")
-    all_results, video_results = color(dense_caption_model, prompt_dict_ls, device)
-    return all_results, video_results
diff --git a/eval/vbench/dynamic_degree.py b/eval/vbench/dynamic_degree.py
deleted file mode 100644
index 533a791f..00000000
--- a/eval/vbench/dynamic_degree.py
+++ /dev/null
@@ -1,170 +0,0 @@
-import argparse
-import glob
-import os
-
-import cv2
-import numpy as np
-import torch
-from easydict import EasyDict as edict
-from tqdm import tqdm
-from vbench.third_party.RAFT.core.raft import RAFT
-from vbench.third_party.RAFT.core.utils_core.utils import InputPadder
-from vbench.utils import load_dimension_info
-
-
-class DynamicDegree:
-    def __init__(self, args, device):
-        self.args = args
-        self.device = device
-        self.load_model()
-
-    def load_model(self):
-        self.model = torch.nn.DataParallel(RAFT(self.args))
-        self.model.load_state_dict(torch.load(self.args.model))
-
-        self.model = self.model.module
-        self.model.to(self.device)
-        self.model.eval()
-
-    def get_score(self, img, flo):
-        img = img[0].permute(1, 2, 0).cpu().numpy()
-        flo = flo[0].permute(1, 2, 0).cpu().numpy()
-
-        u = flo[:, :, 0]
-        v = flo[:, :, 1]
-        rad = np.sqrt(np.square(u) + np.square(v))
-
-        h, w = rad.shape
-        rad_flat = rad.flatten()
-        cut_index = int(h * w * 0.05)
-
-        max_rad = np.mean(abs(np.sort(-rad_flat))[:cut_index])
-
-        return max_rad.item()
-
-    def set_params(self, frame, count):
-        scale = min(list(frame.shape)[-2:])
-        self.params = {
-            "thres": 6.0 * (scale / 256.0),
-            "count_num": round(4 * (count / 16.0)),
-        }
-
-    def infer(self, video_path):
-        with torch.no_grad():
-            if video_path.endswith(".mp4"):
-                frames = self.get_frames(video_path)
-            elif os.path.isdir(video_path):
-                frames = self.get_frames_from_img_folder(video_path)
-            else:
-                raise NotImplementedError
-            self.set_params(frame=frames[0], count=len(frames))
-            static_score = []
-            for image1, image2 in zip(frames[:-1], frames[1:]):
-                padder = InputPadder(image1.shape)
-                image1, image2 = padder.pad(image1, image2)
-                _, flow_up = self.model(image1, image2, iters=20, test_mode=True)
-                max_rad = self.get_score(image1, flow_up)
-                static_score.append(max_rad)
-            whether_move = self.check_move(static_score)
-            return whether_move
-
-    def check_move(self, score_list):
-        thres = self.params["thres"]
-        count_num = self.params["count_num"]
-        count = 0
-        for score in score_list:
-            if score > thres:
-                count += 1
-            if count >= count_num:
-                return True
-        return False
-
-    def get_frames(self, video_path):
-        frame_list = []
-        video = cv2.VideoCapture(video_path)
-        fps = video.get(cv2.CAP_PROP_FPS)  # get fps
-        interval = round(fps / 8)
-        while video.isOpened():
-            success, frame = video.read()
-            if success:
-                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # convert to rgb
-                frame = (
-                    torch.from_numpy(frame.astype(np.uint8)).permute(2, 0, 1).float()
-                )
-                frame = frame[None].to(self.device)
-                frame_list.append(frame)
-            else:
-                break
-        video.release()
-        assert frame_list != []
-        frame_list = self.extract_frame(frame_list, interval)
-        return frame_list
-
-    def extract_frame(self, frame_list, interval=1):
-        extract = []
-        for i in range(0, len(frame_list), interval):
-            extract.append(frame_list[i])
-        return extract
-
-    def get_frames_from_img_folder(self, img_folder):
-        exts = [
-            "jpg",
-            "png",
-            "jpeg",
-            "bmp",
-            "tif",
-            "tiff",
-            "JPG",
-            "PNG",
-            "JPEG",
-            "BMP",
-            "TIF",
-            "TIFF",
-        ]
-        frame_list = []
-        imgs = sorted(
-            [
-                p
-                for p in glob.glob(os.path.join(img_folder, "*"))
-                if os.path.splitext(p)[1][1:] in exts
-            ]
-        )
-        # imgs = sorted(glob.glob(os.path.join(img_folder, "*.png")))
-        for img in imgs:
-            frame = cv2.imread(img, cv2.IMREAD_COLOR)
-            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
-            frame = torch.from_numpy(frame.astype(np.uint8)).permute(2, 0, 1).float()
-            frame = frame[None].to(self.device)
-            frame_list.append(frame)
-        assert frame_list != []
-        return frame_list
-
-
-def dynamic_degree(dynamic, video_list):
-    sim = []
-    video_results = []
-    for video_path in tqdm(video_list):
-        score_per_video = dynamic.infer(video_path)
-        video_results.append(
-            {"video_path": video_path, "video_results": score_per_video}
-        )
-        sim.append(score_per_video)
-    avg_score = np.mean(sim)
-    return avg_score, video_results
-
-
-def compute_dynamic_degree(json_dir, device, submodules_list, **kwargs):
-    model_path = submodules_list["model"]
-    # set_args
-    args_new = edict(
-        {
-            "model": model_path,
-            "small": False,
-            "mixed_precision": False,
-            "alternate_corr": False,
-        }
-    )
-    dynamic = DynamicDegree(args_new, device)
-    video_list, _ = load_dimension_info(json_dir, dimension="dynamic_degree", lang="en")
-    all_results, video_results = dynamic_degree(dynamic, video_list)
-    return all_results, video_results
diff --git a/eval/vbench/human_action.py b/eval/vbench/human_action.py
deleted file mode 100644
index 2a02d59d..00000000
--- a/eval/vbench/human_action.py
+++ /dev/null
@@ -1,119 +0,0 @@
-import json
-import os
-
-import clip
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from PIL import Image
-from timm.models import create_model
-from tqdm import tqdm
-from vbench.third_party.umt.datasets.video_transforms import (
-    CenterCrop,
-    Compose,
-    Normalize,
-    Resize,
-    create_random_augment,
-    horizontal_flip,
-    random_crop,
-    random_resized_crop,
-    random_resized_crop_with_shift,
-    random_short_side_scale_jitter,
-    uniform_crop,
-)
-from vbench.third_party.umt.datasets.volume_transforms import ClipToTensor
-from vbench.third_party.umt.models.modeling_finetune import vit_large_patch16_224
-from vbench.utils import load_dimension_info, load_video
-
-
-def build_dict():
-    CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-    path = f"{CUR_DIR}/third_party/umt/kinetics_400_categories.txt"
-    results = {}
-    with open(path, "r") as f:
-        cat_list = f.readlines()
-        cat_list = [c.strip() for c in cat_list]
-        for line in cat_list:
-            cat, number = line.split("\t")
-            results[number] = cat.lower()
-    return results
-
-
-def human_action(umt_path, video_list, device):
-    state_dict = torch.load(umt_path, map_location="cpu")
-    model = create_model(
-        "vit_large_patch16_224",
-        pretrained=False,
-        num_classes=400,
-        all_frames=16,
-        tubelet_size=1,
-        use_learnable_pos_emb=False,
-        fc_drop_rate=0.0,
-        drop_rate=0.0,
-        drop_path_rate=0.2,
-        attn_drop_rate=0.0,
-        drop_block_rate=None,
-        use_checkpoint=False,
-        checkpoint_num=16,
-        use_mean_pooling=True,
-        init_scale=0.001,
-    )
-    data_transform = Compose(
-        [
-            Resize(256, interpolation="bilinear"),
-            CenterCrop(size=(224, 224)),
-            ClipToTensor(),
-            Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-        ]
-    )
-    model = model.to(device)
-    model.load_state_dict(state_dict, strict=False)
-    model.eval()
-    cat_dict = build_dict()
-    cnt = 0
-    cor_num = 0
-    video_results = []
-    for video_path in tqdm(video_list):
-        video_label_ls = (
-            video_path.split("/")[-1]
-            .lower()
-            .split("-")[0]
-            .split("person is ")[-1]
-            .split("_")[0]
-        )
-        cnt += 1
-        images = load_video(video_path, data_transform, num_frames=16)
-        images = images.unsqueeze(0)
-        images = images.to(device)
-        with torch.no_grad():
-            logits = torch.sigmoid(model(images))
-            results, indices = torch.topk(logits, 5, dim=1)
-        indices = indices.squeeze().tolist()
-        results = results.squeeze().tolist()
-        results = [round(f, 4) for f in results]
-        cat_ls = []
-        for i in range(5):
-            if results[i] >= 0.85:
-                cat_ls.append(cat_dict[str(indices[i])])
-        flag = False
-        for cat in cat_ls:
-            if cat == video_label_ls:
-                cor_num += 1
-                flag = True
-                # print(f"{cnt}: {video_path} correct, top-5: {cat_ls}, logits: {results}", flush=True)
-                break
-        if flag is False:
-            # print(f"{cnt}: {video_path} false, gt: {video_label_ls}, top-5: {cat_ls}, logits: {results}", flush=True)
-            pass
-        video_results.append({"video_path": video_path, "video_results": flag})
-    # print(f"cor num: {cor_num}, total: {cnt}")
-    acc = cor_num / cnt
-    return acc, video_results
-
-
-def compute_human_action(json_dir, device, submodules_list, **kwargs):
-    umt_path = submodules_list[0]
-    video_list, _ = load_dimension_info(json_dir, dimension="human_action", lang="en")
-    all_results, video_results = human_action(umt_path, video_list, device)
-    return all_results, video_results
diff --git a/eval/vbench/imaging_quality.py b/eval/vbench/imaging_quality.py
deleted file mode 100644
index f5a18788..00000000
--- a/eval/vbench/imaging_quality.py
+++ /dev/null
@@ -1,63 +0,0 @@
-import torch
-from pyiqa.archs.musiq_arch import MUSIQ
-from torchvision import transforms
-from tqdm import tqdm
-from vbench.utils import load_dimension_info, load_video
-
-
-def transform(images, preprocess_mode="shorter"):
-    if preprocess_mode.startswith("shorter"):
-        _, _, h, w = images.size()
-        if min(h, w) > 512:
-            scale = 512.0 / min(h, w)
-            images = transforms.Resize(size=(int(scale * h), int(scale * w)))(images)
-            if preprocess_mode == "shorter_centercrop":
-                images = transforms.CenterCrop(512)(images)
-
-    elif preprocess_mode == "longer":
-        _, _, h, w = images.size()
-        if max(h, w) > 512:
-            scale = 512.0 / max(h, w)
-            images = transforms.Resize(size=(int(scale * h), int(scale * w)))(images)
-
-    elif preprocess_mode == "None":
-        return images / 255.0
-
-    else:
-        raise ValueError("Please recheck imaging_quality_mode")
-    return images / 255.0
-
-
-def technical_quality(model, video_list, device, **kwargs):
-    preprocess_mode = kwargs["imaging_quality_preprocessing_mode"]
-    video_results = []
-    for video_path in tqdm(video_list):
-        images = load_video(video_path)
-        images = transform(images, preprocess_mode)
-        acc_score_video = 0.0
-        for i in range(len(images)):
-            frame = images[i].unsqueeze(0).to(device)
-            score = model(frame)
-            acc_score_video += float(score)
-        video_results.append(
-            {"video_path": video_path, "video_results": acc_score_video / len(images)}
-        )
-    average_score = sum([o["video_results"] for o in video_results]) / len(
-        video_results
-    )
-    average_score = average_score / 100.0
-    return average_score, video_results
-
-
-def compute_imaging_quality(json_dir, device, submodules_list, **kwargs):
-    model_path = submodules_list["model_path"]
-
-    model = MUSIQ(pretrained_model_path=model_path)
-    model.to(device)
-    model.training = False
-
-    video_list, _ = load_dimension_info(
-        json_dir, dimension="imaging_quality", lang="en"
-    )
-    all_results, video_results = technical_quality(model, video_list, device, **kwargs)
-    return all_results, video_results
diff --git a/eval/vbench/motion_smoothness.py b/eval/vbench/motion_smoothness.py
deleted file mode 100644
index ceee2cab..00000000
--- a/eval/vbench/motion_smoothness.py
+++ /dev/null
@@ -1,199 +0,0 @@
-import glob
-import os
-
-import cv2
-import numpy as np
-import torch
-from omegaconf import OmegaConf
-from tqdm import tqdm
-from vbench.third_party.amt.utils.build_utils import build_from_cfg
-from vbench.third_party.amt.utils.utils import (
-    InputPadder,
-    check_dim_and_resize,
-    img2tensor,
-    tensor2img,
-)
-from vbench.utils import load_dimension_info
-
-
-class FrameProcess:
-    def __init__(self):
-        pass
-
-    def get_frames(self, video_path):
-        frame_list = []
-        video = cv2.VideoCapture(video_path)
-        while video.isOpened():
-            success, frame = video.read()
-            if success:
-                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # convert to rgb
-                frame_list.append(frame)
-            else:
-                break
-        video.release()
-        assert frame_list != []
-        return frame_list
-
-    def get_frames_from_img_folder(self, img_folder):
-        exts = [
-            "jpg",
-            "png",
-            "jpeg",
-            "bmp",
-            "tif",
-            "tiff",
-            "JPG",
-            "PNG",
-            "JPEG",
-            "BMP",
-            "TIF",
-            "TIFF",
-        ]
-        frame_list = []
-        imgs = sorted(
-            [
-                p
-                for p in glob.glob(os.path.join(img_folder, "*"))
-                if os.path.splitext(p)[1][1:] in exts
-            ]
-        )
-        # imgs = sorted(glob.glob(os.path.join(img_folder, "*.png")))
-        for img in imgs:
-            frame = cv2.imread(img, cv2.IMREAD_COLOR)
-            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
-            frame_list.append(frame)
-        assert frame_list != []
-        return frame_list
-
-    def extract_frame(self, frame_list, start_from=0):
-        extract = []
-        for i in range(start_from, len(frame_list), 2):
-            extract.append(frame_list[i])
-        return extract
-
-
-class MotionSmoothness:
-    def __init__(self, config, ckpt, device):
-        self.device = device
-        self.config = config
-        self.ckpt = ckpt
-        self.niters = 1
-        self.initialization()
-        self.load_model()
-
-    def load_model(self):
-        cfg_path = self.config
-        ckpt_path = self.ckpt
-        network_cfg = OmegaConf.load(cfg_path).network
-        network_name = network_cfg.name
-        print(f"Loading [{network_name}] from [{ckpt_path}]...")
-        self.model = build_from_cfg(network_cfg)
-        ckpt = torch.load(ckpt_path)
-        self.model.load_state_dict(ckpt["state_dict"])
-        self.model = self.model.to(self.device)
-        self.model.eval()
-
-    def initialization(self):
-        if self.device == "cuda":
-            self.anchor_resolution = 1024 * 512
-            self.anchor_memory = 1500 * 1024**2
-            self.anchor_memory_bias = 2500 * 1024**2
-            self.vram_avail = torch.cuda.get_device_properties(self.device).total_memory
-            print("VRAM available: {:.1f} MB".format(self.vram_avail / 1024**2))
-        else:
-            # Do not resize in cpu mode
-            self.anchor_resolution = 8192 * 8192
-            self.anchor_memory = 1
-            self.anchor_memory_bias = 0
-            self.vram_avail = 1
-
-        self.embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).to(self.device)
-        self.fp = FrameProcess()
-
-    def motion_score(self, video_path):
-        iters = int(self.niters)
-        # get inputs
-        if video_path.endswith(".mp4"):
-            frames = self.fp.get_frames(video_path)
-        elif os.path.isdir(video_path):
-            frames = self.fp.get_frames_from_img_folder(video_path)
-        else:
-            raise NotImplementedError
-        frame_list = self.fp.extract_frame(frames, start_from=0)
-        # print(f'Loading [images] from [{video_path}], the number of images = [{len(frame_list)}]')
-        inputs = [img2tensor(frame).to(self.device) for frame in frame_list]
-        assert (
-            len(inputs) > 1
-        ), f"The number of input should be more than one (current {len(inputs)})"
-        inputs = check_dim_and_resize(inputs)
-        h, w = inputs[0].shape[-2:]
-        scale = (
-            self.anchor_resolution
-            / (h * w)
-            * np.sqrt((self.vram_avail - self.anchor_memory_bias) / self.anchor_memory)
-        )
-        scale = 1 if scale > 1 else scale
-        scale = 1 / np.floor(1 / np.sqrt(scale) * 16) * 16
-        if scale < 1:
-            print(f"Due to the limited VRAM, the video will be scaled by {scale:.2f}")
-        padding = int(16 / scale)
-        padder = InputPadder(inputs[0].shape, padding)
-        inputs = padder.pad(*inputs)
-
-        # -----------------------  Interpolater -----------------------
-        # print(f'Start frame interpolation:')
-        for i in range(iters):
-            # print(f'Iter {i+1}. input_frames={len(inputs)} output_frames={2*len(inputs)-1}')
-            outputs = [inputs[0]]
-            for in_0, in_1 in zip(inputs[:-1], inputs[1:]):
-                in_0 = in_0.to(self.device)
-                in_1 = in_1.to(self.device)
-                with torch.no_grad():
-                    imgt_pred = self.model(
-                        in_0, in_1, self.embt, scale_factor=scale, eval=True
-                    )["imgt_pred"]
-                outputs += [imgt_pred.cpu(), in_1.cpu()]
-            inputs = outputs
-
-        # -----------------------  cal_vfi_score -----------------------
-        outputs = padder.unpad(*outputs)
-        outputs = [tensor2img(out) for out in outputs]
-        vfi_score = self.vfi_score(frames, outputs)
-        norm = (255.0 - vfi_score) / 255.0
-        return norm
-
-    def vfi_score(self, ori_frames, interpolate_frames):
-        ori = self.fp.extract_frame(ori_frames, start_from=1)
-        interpolate = self.fp.extract_frame(interpolate_frames, start_from=1)
-        scores = []
-        for i in range(len(interpolate)):
-            scores.append(self.get_diff(ori[i], interpolate[i]))
-        return np.mean(np.array(scores))
-
-    def get_diff(self, img1, img2):
-        img = cv2.absdiff(img1, img2)
-        return np.mean(img)
-
-
-def motion_smoothness(motion, video_list):
-    sim = []
-    video_results = []
-    for video_path in tqdm(video_list):
-        score_per_video = motion.motion_score(video_path)
-        video_results.append(
-            {"video_path": video_path, "video_results": score_per_video}
-        )
-        sim.append(score_per_video)
-    avg_score = np.mean(sim)
-    return avg_score, video_results
-
-
-def compute_motion_smoothness(json_dir, device, submodules_list, **kwargs):
-    config = submodules_list["config"]  # pretrained/amt_model/AMT-S.yaml
-    ckpt = submodules_list["ckpt"]  # pretrained/amt_model/amt-s.pth
-    motion = MotionSmoothness(config, ckpt, device)
-    video_list, _ = load_dimension_info(
-        json_dir, dimension="motion_smoothness", lang="en"
-    )
-    all_results, video_results = motion_smoothness(motion, video_list)
-    return all_results, video_results
diff --git a/eval/vbench/multiple_objects.py b/eval/vbench/multiple_objects.py
deleted file mode 100644
index 6ad43a4c..00000000
--- a/eval/vbench/multiple_objects.py
+++ /dev/null
@@ -1,73 +0,0 @@
-import json
-import logging
-import os
-
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.grit_model import DenseCaptioning
-from vbench.utils import load_dimension_info, load_video
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def get_dect_from_grit(model, image_arrays):
-    pred = []
-    if type(image_arrays) is not list:
-        image_arrays = image_arrays.numpy()
-    with torch.no_grad():
-        for frame in image_arrays:
-            ret = model.run_caption_tensor(frame)
-            if len(ret[0]) > 0:
-                pred.append(set(ret[0][0][2]))
-            else:
-                pred.append(set([]))
-    return pred
-
-
-def check_generate(key_info, predictions):
-    cur_cnt = 0
-    key_a, key_b = key_info.split(" and ")
-    key_a = key_a.strip()
-    key_b = key_b.strip()
-    for pred in predictions:
-        if key_a in pred and key_b in pred:
-            cur_cnt += 1
-    return cur_cnt
-
-
-def multiple_objects(model, video_dict, device):
-    success_frame_count, frame_count = 0, 0
-    video_results = []
-    for info in tqdm(video_dict):
-        if "auxiliary_info" not in info:
-            raise "Auxiliary info is not in json, please check your json."
-        object_info = info["auxiliary_info"]["object"]
-        for video_path in info["video_list"]:
-            video_tensor = load_video(video_path, num_frames=16)
-            cur_video_pred = get_dect_from_grit(model, video_tensor.permute(0, 2, 3, 1))
-            cur_success_frame_count = check_generate(object_info, cur_video_pred)
-            cur_success_frame_rate = cur_success_frame_count / len(cur_video_pred)
-            success_frame_count += cur_success_frame_count
-            frame_count += len(cur_video_pred)
-            video_results.append(
-                {"video_path": video_path, "video_results": cur_success_frame_rate}
-            )
-    success_rate = success_frame_count / frame_count
-    return success_rate, video_results
-
-
-def compute_multiple_objects(json_dir, device, submodules_dict, **kwargs):
-    dense_caption_model = DenseCaptioning(device)
-    dense_caption_model.initialize_model_det(**submodules_dict)
-    logger.info("Initialize detection model success")
-    _, prompt_dict_ls = load_dimension_info(
-        json_dir, dimension="multiple_objects", lang="en"
-    )
-    all_results, video_results = multiple_objects(
-        dense_caption_model, prompt_dict_ls, device
-    )
-    return all_results, video_results
diff --git a/eval/vbench/object_class.py b/eval/vbench/object_class.py
deleted file mode 100644
index 9200c657..00000000
--- a/eval/vbench/object_class.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import json
-import logging
-import os
-
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.grit_model import DenseCaptioning
-from vbench.utils import load_dimension_info, load_video
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def get_dect_from_grit(model, image_arrays):
-    pred = []
-    if type(image_arrays) is not list:
-        image_arrays = image_arrays.numpy()
-    with torch.no_grad():
-        for frame in image_arrays:
-            try:
-                pred.append(set(model.run_caption_tensor(frame)[0][0][2]))
-            except:
-                pred.append(set())
-    return pred
-
-
-def check_generate(key_info, predictions):
-    cur_cnt = 0
-    for pred in predictions:
-        if key_info in pred:
-            cur_cnt += 1
-    return cur_cnt
-
-
-def object_class(model, video_dict, device):
-    success_frame_count, frame_count = 0, 0
-    video_results = []
-    for info in tqdm(video_dict):
-        if "auxiliary_info" not in info:
-            raise "Auxiliary info is not in json, please check your json."
-        object_info = info["auxiliary_info"]["object"]
-        for video_path in info["video_list"]:
-            video_tensor = load_video(video_path, num_frames=16)
-            cur_video_pred = get_dect_from_grit(model, video_tensor.permute(0, 2, 3, 1))
-            cur_success_frame_count = check_generate(object_info, cur_video_pred)
-            cur_success_frame_rate = cur_success_frame_count / len(cur_video_pred)
-            success_frame_count += cur_success_frame_count
-            frame_count += len(cur_video_pred)
-            video_results.append(
-                {"video_path": video_path, "video_results": cur_success_frame_rate}
-            )
-    success_rate = success_frame_count / frame_count
-    return success_rate, video_results
-
-
-def compute_object_class(json_dir, device, submodules_dict, **kwargs):
-    dense_caption_model = DenseCaptioning(device)
-    dense_caption_model.initialize_model_det(**submodules_dict)
-    logger.info("Initialize detection model success")
-    _, prompt_dict_ls = load_dimension_info(
-        json_dir, dimension="object_class", lang="en"
-    )
-    all_results, video_results = object_class(
-        dense_caption_model, prompt_dict_ls, device
-    )
-    return all_results, video_results
diff --git a/eval/vbench/overall_consistency.py b/eval/vbench/overall_consistency.py
deleted file mode 100644
index 68ffee00..00000000
--- a/eval/vbench/overall_consistency.py
+++ /dev/null
@@ -1,82 +0,0 @@
-import json
-import os
-
-import clip
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.ViCLIP.simple_tokenizer import SimpleTokenizer
-from vbench.third_party.ViCLIP.viclip import ViCLIP
-from vbench.utils import (
-    CACHE_DIR,
-    clip_transform,
-    load_dimension_info,
-    load_video,
-    read_frames_decord_by_fps,
-)
-
-
-def get_text_features(model, input_text, tokenizer, text_feature_dict={}):
-    if input_text in text_feature_dict:
-        return text_feature_dict[input_text]
-    text_template = f"{input_text}"
-    with torch.no_grad():
-        text_features = model.encode_text(text_template).float()
-        text_features /= text_features.norm(dim=-1, keepdim=True)
-        text_feature_dict[input_text] = text_features
-    return text_features
-
-
-def get_vid_features(model, input_frames):
-    with torch.no_grad():
-        clip_feat = model.encode_vision(input_frames, test=True).float()
-        clip_feat /= clip_feat.norm(dim=-1, keepdim=True)
-    return clip_feat
-
-
-def get_predict_label(clip_feature, text_feats_tensor, top=5):
-    label_probs = (100.0 * clip_feature @ text_feats_tensor.T).softmax(dim=-1)
-    top_probs, top_labels = label_probs.cpu().topk(top, dim=-1)
-    return top_probs, top_labels
-
-
-def overall_consistency(clip_model, video_dict, tokenizer, device, sample="middle"):
-    sim = []
-    video_results = []
-    image_transform = clip_transform(224)
-    for info in tqdm(video_dict):
-        query = info["prompt"]
-        # text = clip.tokenize([query]).to(device)
-        video_list = info["video_list"]
-        for video_path in video_list:
-            cur_video = []
-            with torch.no_grad():
-                images = read_frames_decord_by_fps(
-                    video_path, num_frames=8, sample=sample
-                )
-                images = image_transform(images)
-                images = images.to(device)
-                clip_feat = get_vid_features(clip_model, images.unsqueeze(0))
-                text_feat = get_text_features(clip_model, query, tokenizer)
-                logit_per_text = clip_feat @ text_feat.T
-                score_per_video = float(logit_per_text[0][0].cpu())
-                sim.append(score_per_video)
-                video_results.append(
-                    {"video_path": video_path, "video_results": score_per_video}
-                )
-    avg_score = np.mean(sim)
-    return avg_score, video_results
-
-
-def compute_overall_consistency(json_dir, device, submodules_list, **kwargs):
-    tokenizer = SimpleTokenizer(
-        os.path.join(CACHE_DIR, "ViCLIP/bpe_simple_vocab_16e6.txt.gz")
-    )
-    viclip = ViCLIP(tokenizer=tokenizer, **submodules_list).to(device)
-    _, video_dict = load_dimension_info(
-        json_dir, dimension="overall_consistency", lang="en"
-    )
-    all_results, video_results = overall_consistency(
-        viclip, video_dict, tokenizer, device
-    )
-    return all_results, video_results
diff --git a/eval/vbench/scene.py b/eval/vbench/scene.py
deleted file mode 100644
index a8ce2ee0..00000000
--- a/eval/vbench/scene.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import json
-import logging
-import os
-
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.tag2Text.tag2text import tag2text_caption
-from vbench.utils import load_dimension_info, load_video, tag2text_transform
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def get_caption(model, image_arrays):
-    caption, tag_predict = model.generate(
-        image_arrays, tag_input=None, return_tag_predict=True
-    )
-    return caption
-
-
-def check_generate(key_info, predictions):
-    cur_cnt = 0
-    key = key_info["scene"]
-    for pred in predictions:
-        q_flag = [q in pred for q in key.split(" ")]
-        if len(q_flag) == sum(q_flag):
-            cur_cnt += 1
-    return cur_cnt
-
-
-def scene(model, video_dict, device):
-    success_frame_count, frame_count = 0, 0
-    video_results = []
-    transform = tag2text_transform(384)
-    for info in tqdm(video_dict):
-        if "auxiliary_info" not in info:
-            raise "Auxiliary info is not in json, please check your json."
-        scene_info = info["auxiliary_info"]["scene"]
-        for video_path in info["video_list"]:
-            video_array = load_video(
-                video_path, num_frames=16, return_tensor=False, width=384, height=384
-            )
-            video_tensor_list = []
-            for i in video_array:
-                video_tensor_list.append(transform(i).to(device).unsqueeze(0))
-            video_tensor = torch.cat(video_tensor_list)
-            cur_video_pred = get_caption(model, video_tensor)
-            cur_success_frame_count = check_generate(scene_info, cur_video_pred)
-            cur_success_frame_rate = cur_success_frame_count / len(cur_video_pred)
-            success_frame_count += cur_success_frame_count
-            frame_count += len(cur_video_pred)
-            video_results.append(
-                {"video_path": video_path, "video_results": cur_success_frame_rate}
-            )
-    success_rate = success_frame_count / frame_count
-    return success_rate, video_results
-
-
-def compute_scene(json_dir, device, submodules_dict, **kwargs):
-    model = tag2text_caption(**submodules_dict)
-    model.eval()
-    model = model.to(device)
-    logger.info("Initialize caption model success")
-    _, prompt_dict_ls = load_dimension_info(json_dir, dimension="scene", lang="en")
-    all_results, video_results = scene(model, prompt_dict_ls, device)
-    return all_results, video_results
diff --git a/eval/vbench/spatial_relationship.py b/eval/vbench/spatial_relationship.py
deleted file mode 100644
index 665f9cca..00000000
--- a/eval/vbench/spatial_relationship.py
+++ /dev/null
@@ -1,158 +0,0 @@
-import json
-import logging
-import os
-
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.grit_model import DenseCaptioning
-from vbench.utils import load_dimension_info, load_video
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def get_position_score(locality, obj1, obj2, iou_threshold=0.1):
-    # input obj1 and obj2 should be [x0,y0,x1,y1]
-    # Calculate centers of bounding boxes
-    box1 = {
-        "x_min": obj1[0],
-        "y_min": obj1[1],
-        "x_max": obj1[2],
-        "y_max": obj1[3],
-        "width": obj1[2] - obj1[0],
-        "height": obj1[3] - obj1[1],
-    }
-
-    box2 = {
-        "x_min": obj2[0],
-        "y_min": obj2[1],
-        "x_max": obj2[2],
-        "y_max": obj2[3],
-        "width": obj2[2] - obj2[0],
-        "height": obj2[3] - obj2[1],
-    }
-
-    # Get the object center
-    box1_center = (
-        (box1["x_min"] + box1["x_max"]) / 2,
-        (box1["y_min"] + box1["y_max"]) / 2,
-    )
-    box2_center = (
-        (box2["x_min"] + box2["x_max"]) / 2,
-        (box2["y_min"] + box2["y_max"]) / 2,
-    )
-
-    # Calculate horizontal and vertical distances
-    x_distance = box2_center[0] - box1_center[0]
-    y_distance = box2_center[1] - box1_center[1]
-
-    # Calculate IoU
-    x_overlap = max(
-        0, min(box1["x_max"], box2["x_max"]) - max(box1["x_min"], box2["x_min"])
-    )
-    y_overlap = max(
-        0, min(box1["y_max"], box2["y_max"]) - max(box1["y_min"], box2["y_min"])
-    )
-    intersection = x_overlap * y_overlap
-    box1_area = (box1["x_max"] - box1["x_min"]) * (box1["y_max"] - box1["y_min"])
-    box2_area = (box2["x_max"] - box2["x_min"]) * (box2["y_max"] - box2["y_min"])
-    union = box1_area + box2_area - intersection
-    iou = intersection / union
-
-    # get max object width and max object height
-    max_width = max(box1["width"], box2["width"])
-    max_height = max(box1["height"], box2["height"])
-
-    score = 0
-    if locality in "on the right of" or locality in "on the left of":
-        if abs(x_distance) > abs(y_distance) and iou < iou_threshold:
-            score = 1
-        elif abs(x_distance) > abs(y_distance) and iou >= iou_threshold:
-            score = iou_threshold / iou
-        else:
-            score = 0
-    elif locality in "on the bottom of" or locality in "on the top of":
-        if abs(y_distance) > abs(x_distance) and iou < iou_threshold:
-            score = 1
-        elif abs(y_distance) > abs(x_distance) and iou >= iou_threshold:
-            score = iou_threshold / iou
-        else:
-            score = 0
-    return score
-
-
-def get_dect_from_grit(model, image_arrays):
-    pred = []
-    if type(image_arrays) is not list:
-        image_arrays = image_arrays.numpy()
-    with torch.no_grad():
-        for frame in image_arrays:
-            ret = model.run_caption_tensor(frame)
-            pred_cur = []
-            if len(ret[0]) > 0:
-                for info in ret[0]:
-                    pred_cur.append([info[0], info[1]])
-            pred.append(pred_cur)
-    return pred
-
-
-def check_generate(key_info, predictions):
-    key_a = key_info["object_a"]
-    key_b = key_info["object_b"]
-    relation = key_info["relationship"]
-    frame_score = []
-    for frame_pred in predictions:
-        # filter the target object
-        frame_obj_locats = []
-        cur_score = [0]
-        for item in frame_pred:
-            if (key_a == item[0]) or (key_b == item[0]):
-                frame_obj_locats.append(item[1])
-            for c_obj1 in range(len(frame_obj_locats) - 1):
-                for c_obj2 in range(c_obj1 + 1, len(frame_obj_locats)):
-                    score_obj1_obj2 = get_position_score(
-                        relation, frame_obj_locats[c_obj1], frame_obj_locats[c_obj2]
-                    )
-                    cur_score.append(score_obj1_obj2)
-        frame_score.append(max(cur_score))
-    return frame_score
-
-
-def spatial_relationship(model, video_dict, device):
-    video_results = []
-    frame_score_overall = []
-    for info in tqdm(video_dict):
-        if "auxiliary_info" not in info:
-            raise "Auxiliary info is not in json, please check your json."
-        object_info = info["auxiliary_info"]["spatial_relationship"]
-        for video_path in info["video_list"]:
-            video_tensor = load_video(video_path, num_frames=16)
-            cur_video_pred = get_dect_from_grit(model, video_tensor.permute(0, 2, 3, 1))
-            cur_video_frame_score = check_generate(object_info, cur_video_pred)
-            cur_success_frame_rate = np.mean(cur_video_frame_score)
-            frame_score_overall.extend(cur_video_frame_score)
-            video_results.append(
-                {
-                    "video_path": video_path,
-                    "video_results": cur_success_frame_rate,
-                    "frame_results": cur_video_frame_score,
-                }
-            )
-    success_rate = np.mean(frame_score_overall)
-    return success_rate, video_results
-
-
-def compute_spatial_relationship(json_dir, device, submodules_dict, **kwargs):
-    dense_caption_model = DenseCaptioning(device)
-    dense_caption_model.initialize_model_det(**submodules_dict)
-    logger.info("Initialize detection model success")
-    _, prompt_dict_ls = load_dimension_info(
-        json_dir, dimension="spatial_relationship", lang="en"
-    )
-    all_results, video_results = spatial_relationship(
-        dense_caption_model, prompt_dict_ls, device
-    )
-    return all_results, video_results
diff --git a/eval/vbench/subject_consistency.py b/eval/vbench/subject_consistency.py
deleted file mode 100644
index 904772a3..00000000
--- a/eval/vbench/subject_consistency.py
+++ /dev/null
@@ -1,93 +0,0 @@
-import io
-import json
-import logging
-import os
-
-import cv2
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchvision.transforms as transforms
-from PIL import Image
-from tqdm import tqdm
-from vbench.utils import (
-    dino_transform,
-    dino_transform_Image,
-    load_dimension_info,
-    load_video,
-)
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def subject_consistency(model, video_list, device, read_frame):
-    sim = 0.0
-    cnt = 0
-    video_results = []
-    if read_frame:
-        image_transform = dino_transform_Image(224)
-    else:
-        image_transform = dino_transform(224)
-    for video_path in tqdm(video_list):
-        video_sim = 0.0
-        if read_frame:
-            video_path = video_path[:-4].replace("videos", "frames").replace(" ", "_")
-            tmp_paths = [
-                os.path.join(video_path, f) for f in sorted(os.listdir(video_path))
-            ]
-            images = []
-            for tmp_path in tmp_paths:
-                images.append(image_transform(Image.open(tmp_path)))
-        else:
-            images = load_video(video_path)
-            images = image_transform(images)
-        for i in range(len(images)):
-            with torch.no_grad():
-                image = images[i].unsqueeze(0)
-                image = image.to(device)
-                image_features = model(image)
-                image_features = F.normalize(image_features, dim=-1, p=2)
-                if i == 0:
-                    first_image_features = image_features
-                else:
-                    sim_pre = max(
-                        0.0,
-                        F.cosine_similarity(
-                            former_image_features, image_features
-                        ).item(),
-                    )
-                    sim_fir = max(
-                        0.0,
-                        F.cosine_similarity(
-                            first_image_features, image_features
-                        ).item(),
-                    )
-                    cur_sim = (sim_pre + sim_fir) / 2
-                    video_sim += cur_sim
-                    cnt += 1
-            former_image_features = image_features
-        sim_per_images = video_sim / (len(images) - 1)
-        sim += video_sim
-        video_results.append(
-            {"video_path": video_path, "video_results": sim_per_images}
-        )
-    # sim_per_video = sim / (len(video_list) - 1)
-    sim_per_frame = sim / cnt
-    return sim_per_frame, video_results
-
-
-def compute_subject_consistency(json_dir, device, submodules_list, **kwargs):
-    dino_model = torch.hub.load(**submodules_list).to(device)
-    read_frame = submodules_list["read_frame"]
-    logger.info("Initialize DINO success")
-    video_list, _ = load_dimension_info(
-        json_dir, dimension="subject_consistency", lang="en"
-    )
-    all_results, video_results = subject_consistency(
-        dino_model, video_list, device, read_frame
-    )
-    return all_results, video_results
diff --git a/eval/vbench/temporal_flickering.py b/eval/vbench/temporal_flickering.py
deleted file mode 100644
index 96d49c7c..00000000
--- a/eval/vbench/temporal_flickering.py
+++ /dev/null
@@ -1,66 +0,0 @@
-import cv2
-import numpy as np
-from tqdm import tqdm
-from vbench.utils import load_dimension_info
-
-
-def get_frames(video_path):
-    frames = []
-    video = cv2.VideoCapture(video_path)
-    while video.isOpened():
-        success, frame = video.read()
-        if success:
-            frames.append(frame)
-        else:
-            break
-    video.release()
-    assert frames != []
-    return frames
-
-
-def mae_seq(frames):
-    ssds = []
-    for i in range(len(frames) - 1):
-        ssds.append(calculate_mae(frames[i], frames[i + 1]))
-    return np.array(ssds)
-
-
-def calculate_mae(img1, img2):
-    """Computing the mean absolute error (MAE) between two images."""
-    if img1.shape != img2.shape:
-        print("Images don't have the same shape.")
-        return
-    return np.mean(
-        cv2.absdiff(np.array(img1, dtype=np.float32), np.array(img2, dtype=np.float32))
-    )
-
-
-def cal_score(video_path):
-    """please ensure the video is static"""
-    frames = get_frames(video_path)
-    score_seq = mae_seq(frames)
-    return (255.0 - np.mean(score_seq).item()) / 255.0
-
-
-def temporal_flickering(video_list):
-    sim = []
-    video_results = []
-    for video_path in tqdm(video_list):
-        try:
-            score_per_video = cal_score(video_path)
-        except AssertionError:
-            continue
-        video_results.append(
-            {"video_path": video_path, "video_results": score_per_video}
-        )
-        sim.append(score_per_video)
-    avg_score = np.mean(sim)
-    return avg_score, video_results
-
-
-def compute_temporal_flickering(json_dir, device, submodules_list, **kwargs):
-    video_list, _ = load_dimension_info(
-        json_dir, dimension="temporal_flickering", lang="en"
-    )
-    all_results, video_results = temporal_flickering(video_list)
-    return all_results, video_results
diff --git a/eval/vbench/temporal_style.py b/eval/vbench/temporal_style.py
deleted file mode 100644
index e2d4d6db..00000000
--- a/eval/vbench/temporal_style.py
+++ /dev/null
@@ -1,79 +0,0 @@
-import json
-import os
-
-import clip
-import numpy as np
-import torch
-from tqdm import tqdm
-from vbench.third_party.ViCLIP.simple_tokenizer import SimpleTokenizer
-from vbench.third_party.ViCLIP.viclip import ViCLIP
-from vbench.utils import (
-    CACHE_DIR,
-    clip_transform,
-    load_dimension_info,
-    load_video,
-    read_frames_decord_by_fps,
-)
-
-
-def get_text_features(model, input_text, tokenizer, text_feature_dict={}):
-    if input_text in text_feature_dict:
-        return text_feature_dict[input_text]
-    text_template = f"{input_text}"
-    with torch.no_grad():
-        text_features = model.encode_text(text_template).float()
-        text_features /= text_features.norm(dim=-1, keepdim=True)
-        text_feature_dict[input_text] = text_features
-    return text_features
-
-
-def get_vid_features(model, input_frames):
-    with torch.no_grad():
-        clip_feat = model.encode_vision(input_frames, test=True).float()
-        clip_feat /= clip_feat.norm(dim=-1, keepdim=True)
-    return clip_feat
-
-
-def get_predict_label(clip_feature, text_feats_tensor, top=5):
-    label_probs = (100.0 * clip_feature @ text_feats_tensor.T).softmax(dim=-1)
-    top_probs, top_labels = label_probs.cpu().topk(top, dim=-1)
-    return top_probs, top_labels
-
-
-def temporal_style(clip_model, video_dict, tokenizer, device, sample="middle"):
-    sim = []
-    video_results = []
-    image_transform = clip_transform(224)
-    for info in tqdm(video_dict):
-        query = info["prompt"]
-        # text = clip.tokenize([query]).to(device)
-        video_list = info["video_list"]
-        for video_path in video_list:
-            cur_video = []
-            with torch.no_grad():
-                # images = load_video(video_path, num_frames=8)
-                images = read_frames_decord_by_fps(
-                    video_path, num_frames=8, sample=sample
-                )
-                images = image_transform(images)
-                images = images.to(device)
-                clip_feat = get_vid_features(clip_model, images.unsqueeze(0))
-                text_feat = get_text_features(clip_model, query, tokenizer)
-                logit_per_text = clip_feat @ text_feat.T
-                score_per_video = float(logit_per_text[0][0].cpu())
-                sim.append(score_per_video)
-                video_results.append(
-                    {"video_path": video_path, "video_results": score_per_video}
-                )
-    avg_score = np.mean(sim)
-    return avg_score, video_results
-
-
-def compute_temporal_style(json_dir, device, submodules_list, **kwargs):
-    tokenizer = SimpleTokenizer(
-        os.path.join(CACHE_DIR, "ViCLIP/bpe_simple_vocab_16e6.txt.gz")
-    )
-    viclip = ViCLIP(tokenizer=tokenizer, **submodules_list).to(device)
-    _, video_dict = load_dimension_info(json_dir, dimension="temporal_style", lang="en")
-    all_results, video_results = temporal_style(viclip, video_dict, tokenizer, device)
-    return all_results, video_results
diff --git a/eval/vbench/third_party/RAFT/LICENSE b/eval/vbench/third_party/RAFT/LICENSE
deleted file mode 100644
index ed13d840..00000000
--- a/eval/vbench/third_party/RAFT/LICENSE
+++ /dev/null
@@ -1,29 +0,0 @@
-BSD 3-Clause License
-
-Copyright (c) 2020, princeton-vl
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-* Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-* Neither the name of the copyright holder nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/eval/vbench/third_party/RAFT/RAFT.png b/eval/vbench/third_party/RAFT/RAFT.png
deleted file mode 100644
index a387fe2c..00000000
Binary files a/eval/vbench/third_party/RAFT/RAFT.png and /dev/null differ
diff --git a/eval/vbench/third_party/RAFT/README.md b/eval/vbench/third_party/RAFT/README.md
deleted file mode 100644
index 388d2629..00000000
--- a/eval/vbench/third_party/RAFT/README.md
+++ /dev/null
@@ -1,80 +0,0 @@
-# RAFT
-This repository contains the source code for our paper:
-
-[RAFT: Recurrent All Pairs Field Transforms for Optical Flow](https://arxiv.org/pdf/2003.12039.pdf)<br/>
-ECCV 2020 <br/>
-Zachary Teed and Jia Deng<br/>
-
-<img src="RAFT.png">
-
-## Requirements
-The code has been tested with PyTorch 1.6 and Cuda 10.1.
-```Shell
-conda create --name raft
-conda activate raft
-conda install pytorch=1.6.0 torchvision=0.7.0 cudatoolkit=10.1 matplotlib tensorboard scipy opencv -c pytorch
-```
-
-## Demos
-Pretrained models can be downloaded by running
-```Shell
-./download_models.sh
-```
-or downloaded from [google drive](https://drive.google.com/drive/folders/1sWDsfuZ3Up38EUQt7-JDTT1HcGHuJgvT?usp=sharing)
-
-You can demo a trained model on a sequence of frames
-```Shell
-python demo.py --model=models/raft-things.pth --path=demo-frames
-```
-
-## Required Data
-To evaluate/train RAFT, you will need to download the required datasets.
-* [FlyingChairs](https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html#flyingchairs)
-* [FlyingThings3D](https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html)
-* [Sintel](http://sintel.is.tue.mpg.de/)
-* [KITTI](http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=flow)
-* [HD1K](http://hci-benchmark.iwr.uni-heidelberg.de/) (optional)
-
-
-By default `datasets.py` will search for the datasets in these locations. You can create symbolic links to wherever the datasets were downloaded in the `datasets` folder
-
-```Shell
-├── datasets
-    ├── Sintel
-        ├── test
-        ├── training
-    ├── KITTI
-        ├── testing
-        ├── training
-        ├── devkit
-    ├── FlyingChairs_release
-        ├── data
-    ├── FlyingThings3D
-        ├── frames_cleanpass
-        ├── frames_finalpass
-        ├── optical_flow
-```
-
-## Evaluation
-You can evaluate a trained model using `evaluate.py`
-```Shell
-python evaluate.py --model=models/raft-things.pth --dataset=sintel --mixed_precision
-```
-
-## Training
-We used the following training schedule in our paper (2 GPUs). Training logs will be written to the `runs` which can be visualized using tensorboard
-```Shell
-./train_standard.sh
-```
-
-If you have a RTX GPU, training can be accelerated using mixed precision. You can expect similiar results in this setting (1 GPU)
-```Shell
-./train_mixed.sh
-```
-
-## (Optional) Efficent Implementation
-You can optionally use our alternate (efficent) implementation by compiling the provided cuda extension
-```Shell
-cd alt_cuda_corr && python setup.py install && cd ..
-```
-and running `demo.py` and `evaluate.py` with the `--alternate_corr` flag Note, this implementation is somewhat slower than all-pairs, but uses significantly less GPU memory during the forward pass.
diff --git a/eval/vbench/third_party/RAFT/__init__.py b/eval/vbench/third_party/RAFT/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/RAFT/alt_cuda_corr/correlation.cpp b/eval/vbench/third_party/RAFT/alt_cuda_corr/correlation.cpp
deleted file mode 100644
index 9ba63069..00000000
--- a/eval/vbench/third_party/RAFT/alt_cuda_corr/correlation.cpp
+++ /dev/null
@@ -1,54 +0,0 @@
-#include <torch/extension.h>
-#include <vector>
-
-// CUDA forward declarations
-std::vector<torch::Tensor> corr_cuda_forward(
-    torch::Tensor fmap1,
-    torch::Tensor fmap2,
-    torch::Tensor coords,
-    int radius);
-
-std::vector<torch::Tensor> corr_cuda_backward(
-  torch::Tensor fmap1,
-  torch::Tensor fmap2,
-  torch::Tensor coords,
-  torch::Tensor corr_grad,
-  int radius);
-
-// C++ interface
-#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor")
-#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
-#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
-
-std::vector<torch::Tensor> corr_forward(
-    torch::Tensor fmap1,
-    torch::Tensor fmap2,
-    torch::Tensor coords,
-    int radius) {
-  CHECK_INPUT(fmap1);
-  CHECK_INPUT(fmap2);
-  CHECK_INPUT(coords);
-
-  return corr_cuda_forward(fmap1, fmap2, coords, radius);
-}
-
-
-std::vector<torch::Tensor> corr_backward(
-    torch::Tensor fmap1,
-    torch::Tensor fmap2,
-    torch::Tensor coords,
-    torch::Tensor corr_grad,
-    int radius) {
-  CHECK_INPUT(fmap1);
-  CHECK_INPUT(fmap2);
-  CHECK_INPUT(coords);
-  CHECK_INPUT(corr_grad);
-
-  return corr_cuda_backward(fmap1, fmap2, coords, corr_grad, radius);
-}
-
-
-PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
-  m.def("forward", &corr_forward, "CORR forward");
-  m.def("backward", &corr_backward, "CORR backward");
-}
diff --git a/eval/vbench/third_party/RAFT/alt_cuda_corr/correlation_kernel.cu b/eval/vbench/third_party/RAFT/alt_cuda_corr/correlation_kernel.cu
deleted file mode 100644
index 017dee1d..00000000
--- a/eval/vbench/third_party/RAFT/alt_cuda_corr/correlation_kernel.cu
+++ /dev/null
@@ -1,324 +0,0 @@
-#include <torch/extension.h>
-#include <cuda.h>
-#include <cuda_runtime.h>
-#include <vector>
-
-
-#define BLOCK_H 4
-#define BLOCK_W 8
-#define BLOCK_HW BLOCK_H * BLOCK_W
-#define CHANNEL_STRIDE 32
-
-
-__forceinline__ __device__
-bool within_bounds(int h, int w, int H, int W) {
-  return h >= 0 && h < H && w >= 0 && w < W;
-}
-
-template <typename scalar_t>
-__global__ void corr_forward_kernel(
-    const torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits> fmap1,
-    const torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits> fmap2,
-    const torch::PackedTensorAccessor32<scalar_t,5,torch::RestrictPtrTraits> coords,
-    torch::PackedTensorAccessor32<scalar_t,5,torch::RestrictPtrTraits> corr,
-    int r)
-{
-  const int b = blockIdx.x;
-  const int h0 = blockIdx.y * blockDim.x;
-  const int w0 = blockIdx.z * blockDim.y;
-  const int tid = threadIdx.x * blockDim.y + threadIdx.y;
-
-  const int H1 = fmap1.size(1);
-  const int W1 = fmap1.size(2);
-  const int H2 = fmap2.size(1);
-  const int W2 = fmap2.size(2);
-  const int N = coords.size(1);
-  const int C = fmap1.size(3);
-
-  __shared__ scalar_t f1[CHANNEL_STRIDE][BLOCK_HW+1];
-  __shared__ scalar_t f2[CHANNEL_STRIDE][BLOCK_HW+1];
-  __shared__ scalar_t x2s[BLOCK_HW];
-  __shared__ scalar_t y2s[BLOCK_HW];
-
-  for (int c=0; c<C; c+=CHANNEL_STRIDE) {
-    for (int k=0; k<BLOCK_HW; k+=BLOCK_HW/CHANNEL_STRIDE) {
-      int k1 = k + tid / CHANNEL_STRIDE;
-      int h1 = h0 + k1 / BLOCK_W;
-      int w1 = w0 + k1 % BLOCK_W;
-      int c1 = tid % CHANNEL_STRIDE;
-
-      auto fptr = fmap1[b][h1][w1];
-      if (within_bounds(h1, w1, H1, W1))
-        f1[c1][k1] = fptr[c+c1];
-      else
-        f1[c1][k1] = 0.0;
-    }
-
-    __syncthreads();
-
-    for (int n=0; n<N; n++) {
-      int h1 = h0 + threadIdx.x;
-      int w1 = w0 + threadIdx.y;
-      if (within_bounds(h1, w1, H1, W1)) {
-        x2s[tid] = coords[b][n][h1][w1][0];
-        y2s[tid] = coords[b][n][h1][w1][1];
-      }
-
-      scalar_t dx = x2s[tid] - floor(x2s[tid]);
-      scalar_t dy = y2s[tid] - floor(y2s[tid]);
-
-      int rd = 2*r + 1;
-      for (int iy=0; iy<rd+1; iy++) {
-        for (int ix=0; ix<rd+1; ix++) {
-          for (int k=0; k<BLOCK_HW; k+=BLOCK_HW/CHANNEL_STRIDE) {
-            int k1 = k + tid / CHANNEL_STRIDE;
-            int h2 = static_cast<int>(floor(y2s[k1]))-r+iy;
-            int w2 = static_cast<int>(floor(x2s[k1]))-r+ix;
-            int c2 = tid % CHANNEL_STRIDE;
-
-            auto fptr = fmap2[b][h2][w2];
-            if (within_bounds(h2, w2, H2, W2))
-              f2[c2][k1] = fptr[c+c2];
-            else
-              f2[c2][k1] = 0.0;
-          }
-
-          __syncthreads();
-
-          scalar_t s = 0.0;
-          for (int k=0; k<CHANNEL_STRIDE; k++)
-            s += f1[k][tid] * f2[k][tid];
-
-          int ix_nw = H1*W1*((iy-1) + rd*(ix-1));
-          int ix_ne = H1*W1*((iy-1) + rd*ix);
-          int ix_sw = H1*W1*(iy + rd*(ix-1));
-          int ix_se = H1*W1*(iy + rd*ix);
-
-          scalar_t nw = s * (dy) * (dx);
-          scalar_t ne = s * (dy) * (1-dx);
-          scalar_t sw = s * (1-dy) * (dx);
-          scalar_t se = s * (1-dy) * (1-dx);
-
-          scalar_t* corr_ptr = &corr[b][n][0][h1][w1];
-
-          if (iy > 0 && ix > 0 && within_bounds(h1, w1, H1, W1))
-            *(corr_ptr + ix_nw) += nw;
-
-          if (iy > 0 && ix < rd && within_bounds(h1, w1, H1, W1))
-            *(corr_ptr + ix_ne) += ne;
-
-          if (iy < rd && ix > 0 && within_bounds(h1, w1, H1, W1))
-            *(corr_ptr + ix_sw) += sw;
-
-          if (iy < rd && ix < rd && within_bounds(h1, w1, H1, W1))
-            *(corr_ptr + ix_se) += se;
-        }
-      }
-    }
-  }
-}
-
-
-template <typename scalar_t>
-__global__ void corr_backward_kernel(
-    const torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits> fmap1,
-    const torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits> fmap2,
-    const torch::PackedTensorAccessor32<scalar_t,5,torch::RestrictPtrTraits> coords,
-    const torch::PackedTensorAccessor32<scalar_t,5,torch::RestrictPtrTraits> corr_grad,
-    torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits> fmap1_grad,
-    torch::PackedTensorAccessor32<scalar_t,4,torch::RestrictPtrTraits> fmap2_grad,
-    torch::PackedTensorAccessor32<scalar_t,5,torch::RestrictPtrTraits> coords_grad,
-    int r)
-{
-
-  const int b = blockIdx.x;
-  const int h0 = blockIdx.y * blockDim.x;
-  const int w0 = blockIdx.z * blockDim.y;
-  const int tid = threadIdx.x * blockDim.y + threadIdx.y;
-
-  const int H1 = fmap1.size(1);
-  const int W1 = fmap1.size(2);
-  const int H2 = fmap2.size(1);
-  const int W2 = fmap2.size(2);
-  const int N = coords.size(1);
-  const int C = fmap1.size(3);
-
-  __shared__ scalar_t f1[CHANNEL_STRIDE][BLOCK_HW+1];
-  __shared__ scalar_t f2[CHANNEL_STRIDE][BLOCK_HW+1];
-
-  __shared__ scalar_t f1_grad[CHANNEL_STRIDE][BLOCK_HW+1];
-  __shared__ scalar_t f2_grad[CHANNEL_STRIDE][BLOCK_HW+1];
-
-  __shared__ scalar_t x2s[BLOCK_HW];
-  __shared__ scalar_t y2s[BLOCK_HW];
-
-  for (int c=0; c<C; c+=CHANNEL_STRIDE) {
-
-    for (int k=0; k<BLOCK_HW; k+=BLOCK_HW/CHANNEL_STRIDE) {
-      int k1 = k + tid / CHANNEL_STRIDE;
-      int h1 = h0 + k1 / BLOCK_W;
-      int w1 = w0 + k1 % BLOCK_W;
-      int c1 = tid % CHANNEL_STRIDE;
-
-      auto fptr = fmap1[b][h1][w1];
-      if (within_bounds(h1, w1, H1, W1))
-        f1[c1][k1] = fptr[c+c1];
-      else
-        f1[c1][k1] = 0.0;
-
-      f1_grad[c1][k1] = 0.0;
-    }
-
-    __syncthreads();
-
-    int h1 = h0 + threadIdx.x;
-    int w1 = w0 + threadIdx.y;
-
-    for (int n=0; n<N; n++) {
-      x2s[tid] = coords[b][n][h1][w1][0];
-      y2s[tid] = coords[b][n][h1][w1][1];
-
-      scalar_t dx = x2s[tid] - floor(x2s[tid]);
-      scalar_t dy = y2s[tid] - floor(y2s[tid]);
-
-      int rd = 2*r + 1;
-      for (int iy=0; iy<rd+1; iy++) {
-        for (int ix=0; ix<rd+1; ix++) {
-          for (int k=0; k<BLOCK_HW; k+=BLOCK_HW/CHANNEL_STRIDE) {
-            int k1 = k + tid / CHANNEL_STRIDE;
-            int h2 = static_cast<int>(floor(y2s[k1]))-r+iy;
-            int w2 = static_cast<int>(floor(x2s[k1]))-r+ix;
-            int c2 = tid % CHANNEL_STRIDE;
-
-            auto fptr = fmap2[b][h2][w2];
-            if (within_bounds(h2, w2, H2, W2))
-              f2[c2][k1] = fptr[c+c2];
-            else
-              f2[c2][k1] = 0.0;
-
-            f2_grad[c2][k1] = 0.0;
-          }
-
-          __syncthreads();
-
-          const scalar_t* grad_ptr = &corr_grad[b][n][0][h1][w1];
-          scalar_t g = 0.0;
-
-          int ix_nw = H1*W1*((iy-1) + rd*(ix-1));
-          int ix_ne = H1*W1*((iy-1) + rd*ix);
-          int ix_sw = H1*W1*(iy + rd*(ix-1));
-          int ix_se = H1*W1*(iy + rd*ix);
-
-          if (iy > 0 && ix > 0 && within_bounds(h1, w1, H1, W1))
-            g +=  *(grad_ptr + ix_nw) * dy * dx;
-
-          if (iy > 0 && ix < rd && within_bounds(h1, w1, H1, W1))
-            g += *(grad_ptr + ix_ne) * dy * (1-dx);
-
-          if (iy < rd && ix > 0 && within_bounds(h1, w1, H1, W1))
-            g += *(grad_ptr + ix_sw) * (1-dy) * dx;
-
-          if (iy < rd && ix < rd && within_bounds(h1, w1, H1, W1))
-            g += *(grad_ptr + ix_se) * (1-dy) * (1-dx);
-
-          for (int k=0; k<CHANNEL_STRIDE; k++) {
-            f1_grad[k][tid] += g * f2[k][tid];
-            f2_grad[k][tid] += g * f1[k][tid];
-          }
-
-          for (int k=0; k<BLOCK_HW; k+=BLOCK_HW/CHANNEL_STRIDE) {
-            int k1 = k + tid / CHANNEL_STRIDE;
-            int h2 = static_cast<int>(floor(y2s[k1]))-r+iy;
-            int w2 = static_cast<int>(floor(x2s[k1]))-r+ix;
-            int c2 = tid % CHANNEL_STRIDE;
-
-            scalar_t* fptr = &fmap2_grad[b][h2][w2][0];
-            if (within_bounds(h2, w2, H2, W2))
-              atomicAdd(fptr+c+c2, f2_grad[c2][k1]);
-          }
-        }
-      }
-    }
-    __syncthreads();
-
-
-    for (int k=0; k<BLOCK_HW; k+=BLOCK_HW/CHANNEL_STRIDE) {
-      int k1 = k + tid / CHANNEL_STRIDE;
-      int h1 = h0 + k1 / BLOCK_W;
-      int w1 = w0 + k1 % BLOCK_W;
-      int c1 = tid % CHANNEL_STRIDE;
-
-      scalar_t* fptr = &fmap1_grad[b][h1][w1][0];
-      if (within_bounds(h1, w1, H1, W1))
-        fptr[c+c1] += f1_grad[c1][k1];
-    }
-  }
-}
-
-
-
-std::vector<torch::Tensor> corr_cuda_forward(
-  torch::Tensor fmap1,
-  torch::Tensor fmap2,
-  torch::Tensor coords,
-  int radius)
-{
-  const auto B = coords.size(0);
-  const auto N = coords.size(1);
-  const auto H = coords.size(2);
-  const auto W = coords.size(3);
-
-  const auto rd = 2 * radius + 1;
-  auto opts = fmap1.options();
-  auto corr = torch::zeros({B, N, rd*rd, H, W}, opts);
-
-  const dim3 blocks(B, (H+BLOCK_H-1)/BLOCK_H, (W+BLOCK_W-1)/BLOCK_W);
-  const dim3 threads(BLOCK_H, BLOCK_W);
-
-  corr_forward_kernel<float><<<blocks, threads>>>(
-    fmap1.packed_accessor32<float,4,torch::RestrictPtrTraits>(),
-    fmap2.packed_accessor32<float,4,torch::RestrictPtrTraits>(),
-    coords.packed_accessor32<float,5,torch::RestrictPtrTraits>(),
-    corr.packed_accessor32<float,5,torch::RestrictPtrTraits>(),
-    radius);
-
-  return {corr};
-}
-
-std::vector<torch::Tensor> corr_cuda_backward(
-  torch::Tensor fmap1,
-  torch::Tensor fmap2,
-  torch::Tensor coords,
-  torch::Tensor corr_grad,
-  int radius)
-{
-  const auto B = coords.size(0);
-  const auto N = coords.size(1);
-
-  const auto H1 = fmap1.size(1);
-  const auto W1 = fmap1.size(2);
-  const auto H2 = fmap2.size(1);
-  const auto W2 = fmap2.size(2);
-  const auto C = fmap1.size(3);
-
-  auto opts = fmap1.options();
-  auto fmap1_grad = torch::zeros({B, H1, W1, C}, opts);
-  auto fmap2_grad = torch::zeros({B, H2, W2, C}, opts);
-  auto coords_grad = torch::zeros({B, N, H1, W1, 2}, opts);
-
-  const dim3 blocks(B, (H1+BLOCK_H-1)/BLOCK_H, (W1+BLOCK_W-1)/BLOCK_W);
-  const dim3 threads(BLOCK_H, BLOCK_W);
-
-
-  corr_backward_kernel<float><<<blocks, threads>>>(
-    fmap1.packed_accessor32<float,4,torch::RestrictPtrTraits>(),
-    fmap2.packed_accessor32<float,4,torch::RestrictPtrTraits>(),
-    coords.packed_accessor32<float,5,torch::RestrictPtrTraits>(),
-    corr_grad.packed_accessor32<float,5,torch::RestrictPtrTraits>(),
-    fmap1_grad.packed_accessor32<float,4,torch::RestrictPtrTraits>(),
-    fmap2_grad.packed_accessor32<float,4,torch::RestrictPtrTraits>(),
-    coords_grad.packed_accessor32<float,5,torch::RestrictPtrTraits>(),
-    radius);
-
-  return {fmap1_grad, fmap2_grad, coords_grad};
-}
diff --git a/eval/vbench/third_party/RAFT/alt_cuda_corr/setup.py b/eval/vbench/third_party/RAFT/alt_cuda_corr/setup.py
deleted file mode 100644
index 799d4d9f..00000000
--- a/eval/vbench/third_party/RAFT/alt_cuda_corr/setup.py
+++ /dev/null
@@ -1,14 +0,0 @@
-from setuptools import setup
-from torch.utils.cpp_extension import BuildExtension, CUDAExtension
-
-setup(
-    name="correlation",
-    ext_modules=[
-        CUDAExtension(
-            "alt_cuda_corr",
-            sources=["correlation.cpp", "correlation_kernel.cu"],
-            extra_compile_args={"cxx": [], "nvcc": ["-O3"]},
-        ),
-    ],
-    cmdclass={"build_ext": BuildExtension},
-)
diff --git a/eval/vbench/third_party/RAFT/chairs_split.txt b/eval/vbench/third_party/RAFT/chairs_split.txt
deleted file mode 100644
index fa637708..00000000
--- a/eval/vbench/third_party/RAFT/chairs_split.txt
+++ /dev/null
@@ -1,22872 +0,0 @@
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-1
-1
-1
-2
-1
-1
-1
-1
-1
diff --git a/eval/vbench/third_party/RAFT/core/__init__.py b/eval/vbench/third_party/RAFT/core/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/RAFT/core/corr.py b/eval/vbench/third_party/RAFT/core/corr.py
deleted file mode 100644
index 616d8eb1..00000000
--- a/eval/vbench/third_party/RAFT/core/corr.py
+++ /dev/null
@@ -1,92 +0,0 @@
-import torch
-import torch.nn.functional as F
-
-from .utils_core.utils import bilinear_sampler, coords_grid
-
-try:
-    import alt_cuda_corr
-except:
-    # alt_cuda_corr is not compiled
-    pass
-
-
-class CorrBlock:
-    def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
-        self.num_levels = num_levels
-        self.radius = radius
-        self.corr_pyramid = []
-
-        # all pairs correlation
-        corr = CorrBlock.corr(fmap1, fmap2)
-
-        batch, h1, w1, dim, h2, w2 = corr.shape
-        corr = corr.reshape(batch * h1 * w1, dim, h2, w2)
-
-        self.corr_pyramid.append(corr)
-        for i in range(self.num_levels - 1):
-            corr = F.avg_pool2d(corr, 2, stride=2)
-            self.corr_pyramid.append(corr)
-
-    def __call__(self, coords):
-        r = self.radius
-        coords = coords.permute(0, 2, 3, 1)
-        batch, h1, w1, _ = coords.shape
-
-        out_pyramid = []
-        for i in range(self.num_levels):
-            corr = self.corr_pyramid[i]
-            dx = torch.linspace(-r, r, 2 * r + 1, device=coords.device)
-            dy = torch.linspace(-r, r, 2 * r + 1, device=coords.device)
-            delta = torch.stack(torch.meshgrid(dy, dx), axis=-1)
-
-            centroid_lvl = coords.reshape(batch * h1 * w1, 1, 1, 2) / 2**i
-            delta_lvl = delta.view(1, 2 * r + 1, 2 * r + 1, 2)
-            coords_lvl = centroid_lvl + delta_lvl
-
-            corr = bilinear_sampler(corr, coords_lvl)
-            corr = corr.view(batch, h1, w1, -1)
-            out_pyramid.append(corr)
-
-        out = torch.cat(out_pyramid, dim=-1)
-        return out.permute(0, 3, 1, 2).contiguous().float()
-
-    @staticmethod
-    def corr(fmap1, fmap2):
-        batch, dim, ht, wd = fmap1.shape
-        fmap1 = fmap1.view(batch, dim, ht * wd)
-        fmap2 = fmap2.view(batch, dim, ht * wd)
-
-        corr = torch.matmul(fmap1.transpose(1, 2), fmap2)
-        corr = corr.view(batch, ht, wd, 1, ht, wd)
-        return corr / torch.sqrt(torch.tensor(dim).float())
-
-
-class AlternateCorrBlock:
-    def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
-        self.num_levels = num_levels
-        self.radius = radius
-
-        self.pyramid = [(fmap1, fmap2)]
-        for i in range(self.num_levels):
-            fmap1 = F.avg_pool2d(fmap1, 2, stride=2)
-            fmap2 = F.avg_pool2d(fmap2, 2, stride=2)
-            self.pyramid.append((fmap1, fmap2))
-
-    def __call__(self, coords):
-        coords = coords.permute(0, 2, 3, 1)
-        B, H, W, _ = coords.shape
-        dim = self.pyramid[0][0].shape[1]
-
-        corr_list = []
-        for i in range(self.num_levels):
-            r = self.radius
-            fmap1_i = self.pyramid[0][0].permute(0, 2, 3, 1).contiguous()
-            fmap2_i = self.pyramid[i][1].permute(0, 2, 3, 1).contiguous()
-
-            coords_i = (coords / 2**i).reshape(B, 1, H, W, 2).contiguous()
-            (corr,) = alt_cuda_corr.forward(fmap1_i, fmap2_i, coords_i, r)
-            corr_list.append(corr.squeeze(1))
-
-        corr = torch.stack(corr_list, dim=1)
-        corr = corr.reshape(B, -1, H, W)
-        return corr / torch.sqrt(torch.tensor(dim).float())
diff --git a/eval/vbench/third_party/RAFT/core/datasets.py b/eval/vbench/third_party/RAFT/core/datasets.py
deleted file mode 100644
index f9eb361c..00000000
--- a/eval/vbench/third_party/RAFT/core/datasets.py
+++ /dev/null
@@ -1,290 +0,0 @@
-# Data loading based on https://github.com/NVIDIA/flownet2-pytorch
-
-import math
-import os
-import os.path as osp
-import random
-from glob import glob
-
-import numpy as np
-import torch
-import torch.nn.functional as F
-import torch.utils.data as data
-from utils_core import frame_utils
-from utils_core.augmentor import FlowAugmentor, SparseFlowAugmentor
-
-
-class FlowDataset(data.Dataset):
-    def __init__(self, aug_params=None, sparse=False):
-        self.augmentor = None
-        self.sparse = sparse
-        if aug_params is not None:
-            if sparse:
-                self.augmentor = SparseFlowAugmentor(**aug_params)
-            else:
-                self.augmentor = FlowAugmentor(**aug_params)
-
-        self.is_test = False
-        self.init_seed = False
-        self.flow_list = []
-        self.image_list = []
-        self.extra_info = []
-
-    def __getitem__(self, index):
-
-        if self.is_test:
-            img1 = frame_utils.read_gen(self.image_list[index][0])
-            img2 = frame_utils.read_gen(self.image_list[index][1])
-            img1 = np.array(img1).astype(np.uint8)[..., :3]
-            img2 = np.array(img2).astype(np.uint8)[..., :3]
-            img1 = torch.from_numpy(img1).permute(2, 0, 1).float()
-            img2 = torch.from_numpy(img2).permute(2, 0, 1).float()
-            return img1, img2, self.extra_info[index]
-
-        if not self.init_seed:
-            worker_info = torch.utils.data.get_worker_info()
-            if worker_info is not None:
-                torch.manual_seed(worker_info.id)
-                np.random.seed(worker_info.id)
-                random.seed(worker_info.id)
-                self.init_seed = True
-
-        index = index % len(self.image_list)
-        valid = None
-        if self.sparse:
-            flow, valid = frame_utils.readFlowKITTI(self.flow_list[index])
-        else:
-            flow = frame_utils.read_gen(self.flow_list[index])
-
-        img1 = frame_utils.read_gen(self.image_list[index][0])
-        img2 = frame_utils.read_gen(self.image_list[index][1])
-
-        flow = np.array(flow).astype(np.float32)
-        img1 = np.array(img1).astype(np.uint8)
-        img2 = np.array(img2).astype(np.uint8)
-
-        # grayscale images
-        if len(img1.shape) == 2:
-            img1 = np.tile(img1[..., None], (1, 1, 3))
-            img2 = np.tile(img2[..., None], (1, 1, 3))
-        else:
-            img1 = img1[..., :3]
-            img2 = img2[..., :3]
-
-        if self.augmentor is not None:
-            if self.sparse:
-                img1, img2, flow, valid = self.augmentor(img1, img2, flow, valid)
-            else:
-                img1, img2, flow = self.augmentor(img1, img2, flow)
-
-        img1 = torch.from_numpy(img1).permute(2, 0, 1).float()
-        img2 = torch.from_numpy(img2).permute(2, 0, 1).float()
-        flow = torch.from_numpy(flow).permute(2, 0, 1).float()
-
-        if valid is not None:
-            valid = torch.from_numpy(valid)
-        else:
-            valid = (flow[0].abs() < 1000) & (flow[1].abs() < 1000)
-
-        return img1, img2, flow, valid.float()
-
-    def __rmul__(self, v):
-        self.flow_list = v * self.flow_list
-        self.image_list = v * self.image_list
-        return self
-
-    def __len__(self):
-        return len(self.image_list)
-
-
-class MpiSintel(FlowDataset):
-    def __init__(
-        self, aug_params=None, split="training", root="datasets/Sintel", dstype="clean"
-    ):
-        super(MpiSintel, self).__init__(aug_params)
-        flow_root = osp.join(root, split, "flow")
-        image_root = osp.join(root, split, dstype)
-
-        if split == "test":
-            self.is_test = True
-
-        for scene in os.listdir(image_root):
-            image_list = sorted(glob(osp.join(image_root, scene, "*.png")))
-            for i in range(len(image_list) - 1):
-                self.image_list += [[image_list[i], image_list[i + 1]]]
-                self.extra_info += [(scene, i)]  # scene and frame_id
-
-            if split != "test":
-                self.flow_list += sorted(glob(osp.join(flow_root, scene, "*.flo")))
-
-
-class FlyingChairs(FlowDataset):
-    def __init__(
-        self, aug_params=None, split="train", root="datasets/FlyingChairs_release/data"
-    ):
-        super(FlyingChairs, self).__init__(aug_params)
-
-        images = sorted(glob(osp.join(root, "*.ppm")))
-        flows = sorted(glob(osp.join(root, "*.flo")))
-        assert len(images) // 2 == len(flows)
-
-        split_list = np.loadtxt("chairs_split.txt", dtype=np.int32)
-        for i in range(len(flows)):
-            xid = split_list[i]
-            if (split == "training" and xid == 1) or (
-                split == "validation" and xid == 2
-            ):
-                self.flow_list += [flows[i]]
-                self.image_list += [[images[2 * i], images[2 * i + 1]]]
-
-
-class FlyingThings3D(FlowDataset):
-    def __init__(
-        self, aug_params=None, root="datasets/FlyingThings3D", dstype="frames_cleanpass"
-    ):
-        super(FlyingThings3D, self).__init__(aug_params)
-
-        for cam in ["left"]:
-            for direction in ["into_future", "into_past"]:
-                image_dirs = sorted(glob(osp.join(root, dstype, "TRAIN/*/*")))
-                image_dirs = sorted([osp.join(f, cam) for f in image_dirs])
-
-                flow_dirs = sorted(glob(osp.join(root, "optical_flow/TRAIN/*/*")))
-                flow_dirs = sorted([osp.join(f, direction, cam) for f in flow_dirs])
-
-                for idir, fdir in zip(image_dirs, flow_dirs):
-                    images = sorted(glob(osp.join(idir, "*.png")))
-                    flows = sorted(glob(osp.join(fdir, "*.pfm")))
-                    for i in range(len(flows) - 1):
-                        if direction == "into_future":
-                            self.image_list += [[images[i], images[i + 1]]]
-                            self.flow_list += [flows[i]]
-                        elif direction == "into_past":
-                            self.image_list += [[images[i + 1], images[i]]]
-                            self.flow_list += [flows[i + 1]]
-
-
-class KITTI(FlowDataset):
-    def __init__(self, aug_params=None, split="training", root="datasets/KITTI"):
-        super(KITTI, self).__init__(aug_params, sparse=True)
-        if split == "testing":
-            self.is_test = True
-
-        root = osp.join(root, split)
-        images1 = sorted(glob(osp.join(root, "image_2/*_10.png")))
-        images2 = sorted(glob(osp.join(root, "image_2/*_11.png")))
-
-        for img1, img2 in zip(images1, images2):
-            frame_id = img1.split("/")[-1]
-            self.extra_info += [[frame_id]]
-            self.image_list += [[img1, img2]]
-
-        if split == "training":
-            self.flow_list = sorted(glob(osp.join(root, "flow_occ/*_10.png")))
-
-
-class HD1K(FlowDataset):
-    def __init__(self, aug_params=None, root="datasets/HD1k"):
-        super(HD1K, self).__init__(aug_params, sparse=True)
-
-        seq_ix = 0
-        while 1:
-            flows = sorted(
-                glob(os.path.join(root, "hd1k_flow_gt", "flow_occ/%06d_*.png" % seq_ix))
-            )
-            images = sorted(
-                glob(os.path.join(root, "hd1k_input", "image_2/%06d_*.png" % seq_ix))
-            )
-
-            if len(flows) == 0:
-                break
-
-            for i in range(len(flows) - 1):
-                self.flow_list += [flows[i]]
-                self.image_list += [[images[i], images[i + 1]]]
-
-            seq_ix += 1
-
-
-def fetch_dataloader(args, TRAIN_DS="C+T+K+S+H"):
-    """Create the data loader for the corresponding trainign set"""
-
-    if args.stage == "chairs":
-        aug_params = {
-            "crop_size": args.image_size,
-            "min_scale": -0.1,
-            "max_scale": 1.0,
-            "do_flip": True,
-        }
-        train_dataset = FlyingChairs(aug_params, split="training")
-
-    elif args.stage == "things":
-        aug_params = {
-            "crop_size": args.image_size,
-            "min_scale": -0.4,
-            "max_scale": 0.8,
-            "do_flip": True,
-        }
-        clean_dataset = FlyingThings3D(aug_params, dstype="frames_cleanpass")
-        final_dataset = FlyingThings3D(aug_params, dstype="frames_finalpass")
-        train_dataset = clean_dataset + final_dataset
-
-    elif args.stage == "sintel":
-        aug_params = {
-            "crop_size": args.image_size,
-            "min_scale": -0.2,
-            "max_scale": 0.6,
-            "do_flip": True,
-        }
-        things = FlyingThings3D(aug_params, dstype="frames_cleanpass")
-        sintel_clean = MpiSintel(aug_params, split="training", dstype="clean")
-        sintel_final = MpiSintel(aug_params, split="training", dstype="final")
-
-        if TRAIN_DS == "C+T+K+S+H":
-            kitti = KITTI(
-                {
-                    "crop_size": args.image_size,
-                    "min_scale": -0.3,
-                    "max_scale": 0.5,
-                    "do_flip": True,
-                }
-            )
-            hd1k = HD1K(
-                {
-                    "crop_size": args.image_size,
-                    "min_scale": -0.5,
-                    "max_scale": 0.2,
-                    "do_flip": True,
-                }
-            )
-            train_dataset = (
-                100 * sintel_clean
-                + 100 * sintel_final
-                + 200 * kitti
-                + 5 * hd1k
-                + things
-            )
-
-        elif TRAIN_DS == "C+T+K/S":
-            train_dataset = 100 * sintel_clean + 100 * sintel_final + things
-
-    elif args.stage == "kitti":
-        aug_params = {
-            "crop_size": args.image_size,
-            "min_scale": -0.2,
-            "max_scale": 0.4,
-            "do_flip": False,
-        }
-        train_dataset = KITTI(aug_params, split="training")
-
-    train_loader = data.DataLoader(
-        train_dataset,
-        batch_size=args.batch_size,
-        pin_memory=False,
-        shuffle=True,
-        num_workers=4,
-        drop_last=True,
-    )
-
-    print("Training with %d image pairs" % len(train_dataset))
-    return train_loader
diff --git a/eval/vbench/third_party/RAFT/core/extractor.py b/eval/vbench/third_party/RAFT/core/extractor.py
deleted file mode 100644
index 4215b796..00000000
--- a/eval/vbench/third_party/RAFT/core/extractor.py
+++ /dev/null
@@ -1,269 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class ResidualBlock(nn.Module):
-    def __init__(self, in_planes, planes, norm_fn="group", stride=1):
-        super(ResidualBlock, self).__init__()
-
-        self.conv1 = nn.Conv2d(
-            in_planes, planes, kernel_size=3, padding=1, stride=stride
-        )
-        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, padding=1)
-        self.relu = nn.ReLU(inplace=True)
-
-        num_groups = planes // 8
-
-        if norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-            if not stride == 1:
-                self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-
-        elif norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(planes)
-            self.norm2 = nn.BatchNorm2d(planes)
-            if not stride == 1:
-                self.norm3 = nn.BatchNorm2d(planes)
-
-        elif norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(planes)
-            self.norm2 = nn.InstanceNorm2d(planes)
-            if not stride == 1:
-                self.norm3 = nn.InstanceNorm2d(planes)
-
-        elif norm_fn == "none":
-            self.norm1 = nn.Sequential()
-            self.norm2 = nn.Sequential()
-            if not stride == 1:
-                self.norm3 = nn.Sequential()
-
-        if stride == 1:
-            self.downsample = None
-
-        else:
-            self.downsample = nn.Sequential(
-                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm3
-            )
-
-    def forward(self, x):
-        y = x
-        y = self.relu(self.norm1(self.conv1(y)))
-        y = self.relu(self.norm2(self.conv2(y)))
-
-        if self.downsample is not None:
-            x = self.downsample(x)
-
-        return self.relu(x + y)
-
-
-class BottleneckBlock(nn.Module):
-    def __init__(self, in_planes, planes, norm_fn="group", stride=1):
-        super(BottleneckBlock, self).__init__()
-
-        self.conv1 = nn.Conv2d(in_planes, planes // 4, kernel_size=1, padding=0)
-        self.conv2 = nn.Conv2d(
-            planes // 4, planes // 4, kernel_size=3, padding=1, stride=stride
-        )
-        self.conv3 = nn.Conv2d(planes // 4, planes, kernel_size=1, padding=0)
-        self.relu = nn.ReLU(inplace=True)
-
-        num_groups = planes // 8
-
-        if norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes // 4)
-            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes // 4)
-            self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-            if not stride == 1:
-                self.norm4 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-
-        elif norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(planes // 4)
-            self.norm2 = nn.BatchNorm2d(planes // 4)
-            self.norm3 = nn.BatchNorm2d(planes)
-            if not stride == 1:
-                self.norm4 = nn.BatchNorm2d(planes)
-
-        elif norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(planes // 4)
-            self.norm2 = nn.InstanceNorm2d(planes // 4)
-            self.norm3 = nn.InstanceNorm2d(planes)
-            if not stride == 1:
-                self.norm4 = nn.InstanceNorm2d(planes)
-
-        elif norm_fn == "none":
-            self.norm1 = nn.Sequential()
-            self.norm2 = nn.Sequential()
-            self.norm3 = nn.Sequential()
-            if not stride == 1:
-                self.norm4 = nn.Sequential()
-
-        if stride == 1:
-            self.downsample = None
-
-        else:
-            self.downsample = nn.Sequential(
-                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm4
-            )
-
-    def forward(self, x):
-        y = x
-        y = self.relu(self.norm1(self.conv1(y)))
-        y = self.relu(self.norm2(self.conv2(y)))
-        y = self.relu(self.norm3(self.conv3(y)))
-
-        if self.downsample is not None:
-            x = self.downsample(x)
-
-        return self.relu(x + y)
-
-
-class BasicEncoder(nn.Module):
-    def __init__(self, output_dim=128, norm_fn="batch", dropout=0.0):
-        super(BasicEncoder, self).__init__()
-        self.norm_fn = norm_fn
-
-        if self.norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=64)
-
-        elif self.norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(64)
-
-        elif self.norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(64)
-
-        elif self.norm_fn == "none":
-            self.norm1 = nn.Sequential()
-
-        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
-        self.relu1 = nn.ReLU(inplace=True)
-
-        self.in_planes = 64
-        self.layer1 = self._make_layer(64, stride=1)
-        self.layer2 = self._make_layer(96, stride=2)
-        self.layer3 = self._make_layer(128, stride=2)
-
-        # output convolution
-        self.conv2 = nn.Conv2d(128, output_dim, kernel_size=1)
-
-        self.dropout = None
-        if dropout > 0:
-            self.dropout = nn.Dropout2d(p=dropout)
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
-            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
-                if m.weight is not None:
-                    nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
-
-    def _make_layer(self, dim, stride=1):
-        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
-        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
-        layers = (layer1, layer2)
-
-        self.in_planes = dim
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-
-        # if input is list, combine batch dimension
-        is_list = isinstance(x, tuple) or isinstance(x, list)
-        if is_list:
-            batch_dim = x[0].shape[0]
-            x = torch.cat(x, dim=0)
-
-        x = self.conv1(x)
-        x = self.norm1(x)
-        x = self.relu1(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-
-        x = self.conv2(x)
-
-        if self.training and self.dropout is not None:
-            x = self.dropout(x)
-
-        if is_list:
-            x = torch.split(x, [batch_dim, batch_dim], dim=0)
-
-        return x
-
-
-class SmallEncoder(nn.Module):
-    def __init__(self, output_dim=128, norm_fn="batch", dropout=0.0):
-        super(SmallEncoder, self).__init__()
-        self.norm_fn = norm_fn
-
-        if self.norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=32)
-
-        elif self.norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(32)
-
-        elif self.norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(32)
-
-        elif self.norm_fn == "none":
-            self.norm1 = nn.Sequential()
-
-        self.conv1 = nn.Conv2d(3, 32, kernel_size=7, stride=2, padding=3)
-        self.relu1 = nn.ReLU(inplace=True)
-
-        self.in_planes = 32
-        self.layer1 = self._make_layer(32, stride=1)
-        self.layer2 = self._make_layer(64, stride=2)
-        self.layer3 = self._make_layer(96, stride=2)
-
-        self.dropout = None
-        if dropout > 0:
-            self.dropout = nn.Dropout2d(p=dropout)
-
-        self.conv2 = nn.Conv2d(96, output_dim, kernel_size=1)
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
-            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
-                if m.weight is not None:
-                    nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
-
-    def _make_layer(self, dim, stride=1):
-        layer1 = BottleneckBlock(self.in_planes, dim, self.norm_fn, stride=stride)
-        layer2 = BottleneckBlock(dim, dim, self.norm_fn, stride=1)
-        layers = (layer1, layer2)
-
-        self.in_planes = dim
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-
-        # if input is list, combine batch dimension
-        is_list = isinstance(x, tuple) or isinstance(x, list)
-        if is_list:
-            batch_dim = x[0].shape[0]
-            x = torch.cat(x, dim=0)
-
-        x = self.conv1(x)
-        x = self.norm1(x)
-        x = self.relu1(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.conv2(x)
-
-        if self.training and self.dropout is not None:
-            x = self.dropout(x)
-
-        if is_list:
-            x = torch.split(x, [batch_dim, batch_dim], dim=0)
-
-        return x
diff --git a/eval/vbench/third_party/RAFT/core/raft.py b/eval/vbench/third_party/RAFT/core/raft.py
deleted file mode 100644
index a8a6aeb1..00000000
--- a/eval/vbench/third_party/RAFT/core/raft.py
+++ /dev/null
@@ -1,155 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from .corr import AlternateCorrBlock, CorrBlock
-from .extractor import BasicEncoder, SmallEncoder
-from .update import BasicUpdateBlock, SmallUpdateBlock
-from .utils_core.utils import bilinear_sampler, coords_grid, upflow8
-
-try:
-    autocast = torch.cuda.amp.autocast
-except:
-    # dummy autocast for PyTorch < 1.6
-    class autocast:
-        def __init__(self, enabled):
-            pass
-
-        def __enter__(self):
-            pass
-
-        def __exit__(self, *args):
-            pass
-
-
-class RAFT(nn.Module):
-    def __init__(self, args):
-        super(RAFT, self).__init__()
-        self.args = args
-
-        if args.small:
-            self.hidden_dim = hdim = 96
-            self.context_dim = cdim = 64
-            args.corr_levels = 4
-            args.corr_radius = 3
-
-        else:
-            self.hidden_dim = hdim = 128
-            self.context_dim = cdim = 128
-            args.corr_levels = 4
-            args.corr_radius = 4
-
-        if "dropout" not in self.args:
-            self.args.dropout = 0
-
-        if "alternate_corr" not in self.args:
-            self.args.alternate_corr = False
-
-        # feature network, context network, and update block
-        if args.small:
-            self.fnet = SmallEncoder(
-                output_dim=128, norm_fn="instance", dropout=args.dropout
-            )
-            self.cnet = SmallEncoder(
-                output_dim=hdim + cdim, norm_fn="none", dropout=args.dropout
-            )
-            self.update_block = SmallUpdateBlock(self.args, hidden_dim=hdim)
-
-        else:
-            self.fnet = BasicEncoder(
-                output_dim=256, norm_fn="instance", dropout=args.dropout
-            )
-            self.cnet = BasicEncoder(
-                output_dim=hdim + cdim, norm_fn="batch", dropout=args.dropout
-            )
-            self.update_block = BasicUpdateBlock(self.args, hidden_dim=hdim)
-
-    def freeze_bn(self):
-        for m in self.modules():
-            if isinstance(m, nn.BatchNorm2d):
-                m.eval()
-
-    def initialize_flow(self, img):
-        """Flow is represented as difference between two coordinate grids flow = coords1 - coords0"""
-        N, C, H, W = img.shape
-        coords0 = coords_grid(N, H // 8, W // 8, device=img.device)
-        coords1 = coords_grid(N, H // 8, W // 8, device=img.device)
-
-        # optical flow computed as difference: flow = coords1 - coords0
-        return coords0, coords1
-
-    def upsample_flow(self, flow, mask):
-        """Upsample flow field [H/8, W/8, 2] -> [H, W, 2] using convex combination"""
-        N, _, H, W = flow.shape
-        mask = mask.view(N, 1, 9, 8, 8, H, W)
-        mask = torch.softmax(mask, dim=2)
-
-        up_flow = F.unfold(8 * flow, [3, 3], padding=1)
-        up_flow = up_flow.view(N, 2, 9, 1, 1, H, W)
-
-        up_flow = torch.sum(mask * up_flow, dim=2)
-        up_flow = up_flow.permute(0, 1, 4, 2, 5, 3)
-        return up_flow.reshape(N, 2, 8 * H, 8 * W)
-
-    def forward(
-        self, image1, image2, iters=12, flow_init=None, upsample=True, test_mode=False
-    ):
-        """Estimate optical flow between pair of frames"""
-
-        image1 = 2 * (image1 / 255.0) - 1.0
-        image2 = 2 * (image2 / 255.0) - 1.0
-
-        image1 = image1.contiguous()
-        image2 = image2.contiguous()
-
-        hdim = self.hidden_dim
-        cdim = self.context_dim
-
-        # run the feature network
-        with autocast(enabled=self.args.mixed_precision):
-            fmap1, fmap2 = self.fnet([image1, image2])
-
-        fmap1 = fmap1.float()
-        fmap2 = fmap2.float()
-        if self.args.alternate_corr:
-            corr_fn = AlternateCorrBlock(fmap1, fmap2, radius=self.args.corr_radius)
-        else:
-            corr_fn = CorrBlock(fmap1, fmap2, radius=self.args.corr_radius)
-
-        # run the context network
-        with autocast(enabled=self.args.mixed_precision):
-            cnet = self.cnet(image1)
-            net, inp = torch.split(cnet, [hdim, cdim], dim=1)
-            net = torch.tanh(net)
-            inp = torch.relu(inp)
-
-        coords0, coords1 = self.initialize_flow(image1)
-
-        if flow_init is not None:
-            coords1 = coords1 + flow_init
-
-        flow_predictions = []
-        for itr in range(iters):
-            coords1 = coords1.detach()
-            corr = corr_fn(coords1)  # index correlation volume
-
-            flow = coords1 - coords0
-            with autocast(enabled=self.args.mixed_precision):
-                net, up_mask, delta_flow = self.update_block(net, inp, corr, flow)
-
-            # F(t+1) = F(t) + \Delta(t)
-            coords1 = coords1 + delta_flow
-
-            # upsample predictions
-            if up_mask is None:
-                flow_up = upflow8(coords1 - coords0)
-            else:
-                flow_up = self.upsample_flow(coords1 - coords0, up_mask)
-
-            flow_predictions.append(flow_up)
-
-        if test_mode:
-            return coords1 - coords0, flow_up
-
-        return flow_predictions
diff --git a/eval/vbench/third_party/RAFT/core/update.py b/eval/vbench/third_party/RAFT/core/update.py
deleted file mode 100644
index ced6df06..00000000
--- a/eval/vbench/third_party/RAFT/core/update.py
+++ /dev/null
@@ -1,154 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class FlowHead(nn.Module):
-    def __init__(self, input_dim=128, hidden_dim=256):
-        super(FlowHead, self).__init__()
-        self.conv1 = nn.Conv2d(input_dim, hidden_dim, 3, padding=1)
-        self.conv2 = nn.Conv2d(hidden_dim, 2, 3, padding=1)
-        self.relu = nn.ReLU(inplace=True)
-
-    def forward(self, x):
-        return self.conv2(self.relu(self.conv1(x)))
-
-
-class ConvGRU(nn.Module):
-    def __init__(self, hidden_dim=128, input_dim=192 + 128):
-        super(ConvGRU, self).__init__()
-        self.convz = nn.Conv2d(hidden_dim + input_dim, hidden_dim, 3, padding=1)
-        self.convr = nn.Conv2d(hidden_dim + input_dim, hidden_dim, 3, padding=1)
-        self.convq = nn.Conv2d(hidden_dim + input_dim, hidden_dim, 3, padding=1)
-
-    def forward(self, h, x):
-        hx = torch.cat([h, x], dim=1)
-
-        z = torch.sigmoid(self.convz(hx))
-        r = torch.sigmoid(self.convr(hx))
-        q = torch.tanh(self.convq(torch.cat([r * h, x], dim=1)))
-
-        h = (1 - z) * h + z * q
-        return h
-
-
-class SepConvGRU(nn.Module):
-    def __init__(self, hidden_dim=128, input_dim=192 + 128):
-        super(SepConvGRU, self).__init__()
-        self.convz1 = nn.Conv2d(
-            hidden_dim + input_dim, hidden_dim, (1, 5), padding=(0, 2)
-        )
-        self.convr1 = nn.Conv2d(
-            hidden_dim + input_dim, hidden_dim, (1, 5), padding=(0, 2)
-        )
-        self.convq1 = nn.Conv2d(
-            hidden_dim + input_dim, hidden_dim, (1, 5), padding=(0, 2)
-        )
-
-        self.convz2 = nn.Conv2d(
-            hidden_dim + input_dim, hidden_dim, (5, 1), padding=(2, 0)
-        )
-        self.convr2 = nn.Conv2d(
-            hidden_dim + input_dim, hidden_dim, (5, 1), padding=(2, 0)
-        )
-        self.convq2 = nn.Conv2d(
-            hidden_dim + input_dim, hidden_dim, (5, 1), padding=(2, 0)
-        )
-
-    def forward(self, h, x):
-        # horizontal
-        hx = torch.cat([h, x], dim=1)
-        z = torch.sigmoid(self.convz1(hx))
-        r = torch.sigmoid(self.convr1(hx))
-        q = torch.tanh(self.convq1(torch.cat([r * h, x], dim=1)))
-        h = (1 - z) * h + z * q
-
-        # vertical
-        hx = torch.cat([h, x], dim=1)
-        z = torch.sigmoid(self.convz2(hx))
-        r = torch.sigmoid(self.convr2(hx))
-        q = torch.tanh(self.convq2(torch.cat([r * h, x], dim=1)))
-        h = (1 - z) * h + z * q
-
-        return h
-
-
-class SmallMotionEncoder(nn.Module):
-    def __init__(self, args):
-        super(SmallMotionEncoder, self).__init__()
-        cor_planes = args.corr_levels * (2 * args.corr_radius + 1) ** 2
-        self.convc1 = nn.Conv2d(cor_planes, 96, 1, padding=0)
-        self.convf1 = nn.Conv2d(2, 64, 7, padding=3)
-        self.convf2 = nn.Conv2d(64, 32, 3, padding=1)
-        self.conv = nn.Conv2d(128, 80, 3, padding=1)
-
-    def forward(self, flow, corr):
-        cor = F.relu(self.convc1(corr))
-        flo = F.relu(self.convf1(flow))
-        flo = F.relu(self.convf2(flo))
-        cor_flo = torch.cat([cor, flo], dim=1)
-        out = F.relu(self.conv(cor_flo))
-        return torch.cat([out, flow], dim=1)
-
-
-class BasicMotionEncoder(nn.Module):
-    def __init__(self, args):
-        super(BasicMotionEncoder, self).__init__()
-        cor_planes = args.corr_levels * (2 * args.corr_radius + 1) ** 2
-        self.convc1 = nn.Conv2d(cor_planes, 256, 1, padding=0)
-        self.convc2 = nn.Conv2d(256, 192, 3, padding=1)
-        self.convf1 = nn.Conv2d(2, 128, 7, padding=3)
-        self.convf2 = nn.Conv2d(128, 64, 3, padding=1)
-        self.conv = nn.Conv2d(64 + 192, 128 - 2, 3, padding=1)
-
-    def forward(self, flow, corr):
-        cor = F.relu(self.convc1(corr))
-        cor = F.relu(self.convc2(cor))
-        flo = F.relu(self.convf1(flow))
-        flo = F.relu(self.convf2(flo))
-
-        cor_flo = torch.cat([cor, flo], dim=1)
-        out = F.relu(self.conv(cor_flo))
-        return torch.cat([out, flow], dim=1)
-
-
-class SmallUpdateBlock(nn.Module):
-    def __init__(self, args, hidden_dim=96):
-        super(SmallUpdateBlock, self).__init__()
-        self.encoder = SmallMotionEncoder(args)
-        self.gru = ConvGRU(hidden_dim=hidden_dim, input_dim=82 + 64)
-        self.flow_head = FlowHead(hidden_dim, hidden_dim=128)
-
-    def forward(self, net, inp, corr, flow):
-        motion_features = self.encoder(flow, corr)
-        inp = torch.cat([inp, motion_features], dim=1)
-        net = self.gru(net, inp)
-        delta_flow = self.flow_head(net)
-
-        return net, None, delta_flow
-
-
-class BasicUpdateBlock(nn.Module):
-    def __init__(self, args, hidden_dim=128, input_dim=128):
-        super(BasicUpdateBlock, self).__init__()
-        self.args = args
-        self.encoder = BasicMotionEncoder(args)
-        self.gru = SepConvGRU(hidden_dim=hidden_dim, input_dim=128 + hidden_dim)
-        self.flow_head = FlowHead(hidden_dim, hidden_dim=256)
-
-        self.mask = nn.Sequential(
-            nn.Conv2d(128, 256, 3, padding=1),
-            nn.ReLU(inplace=True),
-            nn.Conv2d(256, 64 * 9, 1, padding=0),
-        )
-
-    def forward(self, net, inp, corr, flow, upsample=True):
-        motion_features = self.encoder(flow, corr)
-        inp = torch.cat([inp, motion_features], dim=1)
-
-        net = self.gru(net, inp)
-        delta_flow = self.flow_head(net)
-
-        # scale mask to balence gradients
-        mask = 0.25 * self.mask(net)
-        return net, mask, delta_flow
diff --git a/eval/vbench/third_party/RAFT/core/utils_core/__init__.py b/eval/vbench/third_party/RAFT/core/utils_core/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/RAFT/core/utils_core/augmentor.py b/eval/vbench/third_party/RAFT/core/utils_core/augmentor.py
deleted file mode 100644
index 4cb60e69..00000000
--- a/eval/vbench/third_party/RAFT/core/utils_core/augmentor.py
+++ /dev/null
@@ -1,267 +0,0 @@
-import math
-import random
-
-import cv2
-import numpy as np
-from PIL import Image
-
-cv2.setNumThreads(0)
-cv2.ocl.setUseOpenCL(False)
-
-import torch
-import torch.nn.functional as F
-from torchvision.transforms import ColorJitter
-
-
-class FlowAugmentor:
-    def __init__(self, crop_size, min_scale=-0.2, max_scale=0.5, do_flip=True):
-
-        # spatial augmentation params
-        self.crop_size = crop_size
-        self.min_scale = min_scale
-        self.max_scale = max_scale
-        self.spatial_aug_prob = 0.8
-        self.stretch_prob = 0.8
-        self.max_stretch = 0.2
-
-        # flip augmentation params
-        self.do_flip = do_flip
-        self.h_flip_prob = 0.5
-        self.v_flip_prob = 0.1
-
-        # photometric augmentation params
-        self.photo_aug = ColorJitter(
-            brightness=0.4, contrast=0.4, saturation=0.4, hue=0.5 / 3.14
-        )
-        self.asymmetric_color_aug_prob = 0.2
-        self.eraser_aug_prob = 0.5
-
-    def color_transform(self, img1, img2):
-        """Photometric augmentation"""
-
-        # asymmetric
-        if np.random.rand() < self.asymmetric_color_aug_prob:
-            img1 = np.array(self.photo_aug(Image.fromarray(img1)), dtype=np.uint8)
-            img2 = np.array(self.photo_aug(Image.fromarray(img2)), dtype=np.uint8)
-
-        # symmetric
-        else:
-            image_stack = np.concatenate([img1, img2], axis=0)
-            image_stack = np.array(
-                self.photo_aug(Image.fromarray(image_stack)), dtype=np.uint8
-            )
-            img1, img2 = np.split(image_stack, 2, axis=0)
-
-        return img1, img2
-
-    def eraser_transform(self, img1, img2, bounds=[50, 100]):
-        """Occlusion augmentation"""
-
-        ht, wd = img1.shape[:2]
-        if np.random.rand() < self.eraser_aug_prob:
-            mean_color = np.mean(img2.reshape(-1, 3), axis=0)
-            for _ in range(np.random.randint(1, 3)):
-                x0 = np.random.randint(0, wd)
-                y0 = np.random.randint(0, ht)
-                dx = np.random.randint(bounds[0], bounds[1])
-                dy = np.random.randint(bounds[0], bounds[1])
-                img2[y0 : y0 + dy, x0 : x0 + dx, :] = mean_color
-
-        return img1, img2
-
-    def spatial_transform(self, img1, img2, flow):
-        # randomly sample scale
-        ht, wd = img1.shape[:2]
-        min_scale = np.maximum(
-            (self.crop_size[0] + 8) / float(ht), (self.crop_size[1] + 8) / float(wd)
-        )
-
-        scale = 2 ** np.random.uniform(self.min_scale, self.max_scale)
-        scale_x = scale
-        scale_y = scale
-        if np.random.rand() < self.stretch_prob:
-            scale_x *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)
-            scale_y *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)
-
-        scale_x = np.clip(scale_x, min_scale, None)
-        scale_y = np.clip(scale_y, min_scale, None)
-
-        if np.random.rand() < self.spatial_aug_prob:
-            # rescale the images
-            img1 = cv2.resize(
-                img1, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
-            )
-            img2 = cv2.resize(
-                img2, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
-            )
-            flow = cv2.resize(
-                flow, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
-            )
-            flow = flow * [scale_x, scale_y]
-
-        if self.do_flip:
-            if np.random.rand() < self.h_flip_prob:  # h-flip
-                img1 = img1[:, ::-1]
-                img2 = img2[:, ::-1]
-                flow = flow[:, ::-1] * [-1.0, 1.0]
-
-            if np.random.rand() < self.v_flip_prob:  # v-flip
-                img1 = img1[::-1, :]
-                img2 = img2[::-1, :]
-                flow = flow[::-1, :] * [1.0, -1.0]
-
-        y0 = np.random.randint(0, img1.shape[0] - self.crop_size[0])
-        x0 = np.random.randint(0, img1.shape[1] - self.crop_size[1])
-
-        img1 = img1[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-        img2 = img2[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-        flow = flow[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-
-        return img1, img2, flow
-
-    def __call__(self, img1, img2, flow):
-        img1, img2 = self.color_transform(img1, img2)
-        img1, img2 = self.eraser_transform(img1, img2)
-        img1, img2, flow = self.spatial_transform(img1, img2, flow)
-
-        img1 = np.ascontiguousarray(img1)
-        img2 = np.ascontiguousarray(img2)
-        flow = np.ascontiguousarray(flow)
-
-        return img1, img2, flow
-
-
-class SparseFlowAugmentor:
-    def __init__(self, crop_size, min_scale=-0.2, max_scale=0.5, do_flip=False):
-        # spatial augmentation params
-        self.crop_size = crop_size
-        self.min_scale = min_scale
-        self.max_scale = max_scale
-        self.spatial_aug_prob = 0.8
-        self.stretch_prob = 0.8
-        self.max_stretch = 0.2
-
-        # flip augmentation params
-        self.do_flip = do_flip
-        self.h_flip_prob = 0.5
-        self.v_flip_prob = 0.1
-
-        # photometric augmentation params
-        self.photo_aug = ColorJitter(
-            brightness=0.3, contrast=0.3, saturation=0.3, hue=0.3 / 3.14
-        )
-        self.asymmetric_color_aug_prob = 0.2
-        self.eraser_aug_prob = 0.5
-
-    def color_transform(self, img1, img2):
-        image_stack = np.concatenate([img1, img2], axis=0)
-        image_stack = np.array(
-            self.photo_aug(Image.fromarray(image_stack)), dtype=np.uint8
-        )
-        img1, img2 = np.split(image_stack, 2, axis=0)
-        return img1, img2
-
-    def eraser_transform(self, img1, img2):
-        ht, wd = img1.shape[:2]
-        if np.random.rand() < self.eraser_aug_prob:
-            mean_color = np.mean(img2.reshape(-1, 3), axis=0)
-            for _ in range(np.random.randint(1, 3)):
-                x0 = np.random.randint(0, wd)
-                y0 = np.random.randint(0, ht)
-                dx = np.random.randint(50, 100)
-                dy = np.random.randint(50, 100)
-                img2[y0 : y0 + dy, x0 : x0 + dx, :] = mean_color
-
-        return img1, img2
-
-    def resize_sparse_flow_map(self, flow, valid, fx=1.0, fy=1.0):
-        ht, wd = flow.shape[:2]
-        coords = np.meshgrid(np.arange(wd), np.arange(ht))
-        coords = np.stack(coords, axis=-1)
-
-        coords = coords.reshape(-1, 2).astype(np.float32)
-        flow = flow.reshape(-1, 2).astype(np.float32)
-        valid = valid.reshape(-1).astype(np.float32)
-
-        coords0 = coords[valid >= 1]
-        flow0 = flow[valid >= 1]
-
-        ht1 = int(round(ht * fy))
-        wd1 = int(round(wd * fx))
-
-        coords1 = coords0 * [fx, fy]
-        flow1 = flow0 * [fx, fy]
-
-        xx = np.round(coords1[:, 0]).astype(np.int32)
-        yy = np.round(coords1[:, 1]).astype(np.int32)
-
-        v = (xx > 0) & (xx < wd1) & (yy > 0) & (yy < ht1)
-        xx = xx[v]
-        yy = yy[v]
-        flow1 = flow1[v]
-
-        flow_img = np.zeros([ht1, wd1, 2], dtype=np.float32)
-        valid_img = np.zeros([ht1, wd1], dtype=np.int32)
-
-        flow_img[yy, xx] = flow1
-        valid_img[yy, xx] = 1
-
-        return flow_img, valid_img
-
-    def spatial_transform(self, img1, img2, flow, valid):
-        # randomly sample scale
-
-        ht, wd = img1.shape[:2]
-        min_scale = np.maximum(
-            (self.crop_size[0] + 1) / float(ht), (self.crop_size[1] + 1) / float(wd)
-        )
-
-        scale = 2 ** np.random.uniform(self.min_scale, self.max_scale)
-        scale_x = np.clip(scale, min_scale, None)
-        scale_y = np.clip(scale, min_scale, None)
-
-        if np.random.rand() < self.spatial_aug_prob:
-            # rescale the images
-            img1 = cv2.resize(
-                img1, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
-            )
-            img2 = cv2.resize(
-                img2, None, fx=scale_x, fy=scale_y, interpolation=cv2.INTER_LINEAR
-            )
-            flow, valid = self.resize_sparse_flow_map(
-                flow, valid, fx=scale_x, fy=scale_y
-            )
-
-        if self.do_flip:
-            if np.random.rand() < 0.5:  # h-flip
-                img1 = img1[:, ::-1]
-                img2 = img2[:, ::-1]
-                flow = flow[:, ::-1] * [-1.0, 1.0]
-                valid = valid[:, ::-1]
-
-        margin_y = 20
-        margin_x = 50
-
-        y0 = np.random.randint(0, img1.shape[0] - self.crop_size[0] + margin_y)
-        x0 = np.random.randint(-margin_x, img1.shape[1] - self.crop_size[1] + margin_x)
-
-        y0 = np.clip(y0, 0, img1.shape[0] - self.crop_size[0])
-        x0 = np.clip(x0, 0, img1.shape[1] - self.crop_size[1])
-
-        img1 = img1[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-        img2 = img2[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-        flow = flow[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-        valid = valid[y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]]
-        return img1, img2, flow, valid
-
-    def __call__(self, img1, img2, flow, valid):
-        img1, img2 = self.color_transform(img1, img2)
-        img1, img2 = self.eraser_transform(img1, img2)
-        img1, img2, flow, valid = self.spatial_transform(img1, img2, flow, valid)
-
-        img1 = np.ascontiguousarray(img1)
-        img2 = np.ascontiguousarray(img2)
-        flow = np.ascontiguousarray(flow)
-        valid = np.ascontiguousarray(valid)
-
-        return img1, img2, flow, valid
diff --git a/eval/vbench/third_party/RAFT/core/utils_core/flow_viz.py b/eval/vbench/third_party/RAFT/core/utils_core/flow_viz.py
deleted file mode 100644
index fec08363..00000000
--- a/eval/vbench/third_party/RAFT/core/utils_core/flow_viz.py
+++ /dev/null
@@ -1,133 +0,0 @@
-# Flow visualization code used from https://github.com/tomrunia/OpticalFlow_Visualization
-
-
-# MIT License
-#
-# Copyright (c) 2018 Tom Runia
-#
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to conditions.
-#
-# Author: Tom Runia
-# Date Created: 2018-08-03
-
-import numpy as np
-
-
-def make_colorwheel():
-    """
-    Generates a color wheel for optical flow visualization as presented in:
-        Baker et al. "A Database and Evaluation Methodology for Optical Flow" (ICCV, 2007)
-        URL: http://vision.middlebury.edu/flow/flowEval-iccv07.pdf
-
-    Code follows the original C++ source code of Daniel Scharstein.
-    Code follows the the Matlab source code of Deqing Sun.
-
-    Returns:
-        np.ndarray: Color wheel
-    """
-
-    RY = 15
-    YG = 6
-    GC = 4
-    CB = 11
-    BM = 13
-    MR = 6
-
-    ncols = RY + YG + GC + CB + BM + MR
-    colorwheel = np.zeros((ncols, 3))
-    col = 0
-
-    # RY
-    colorwheel[0:RY, 0] = 255
-    colorwheel[0:RY, 1] = np.floor(255 * np.arange(0, RY) / RY)
-    col = col + RY
-    # YG
-    colorwheel[col : col + YG, 0] = 255 - np.floor(255 * np.arange(0, YG) / YG)
-    colorwheel[col : col + YG, 1] = 255
-    col = col + YG
-    # GC
-    colorwheel[col : col + GC, 1] = 255
-    colorwheel[col : col + GC, 2] = np.floor(255 * np.arange(0, GC) / GC)
-    col = col + GC
-    # CB
-    colorwheel[col : col + CB, 1] = 255 - np.floor(255 * np.arange(CB) / CB)
-    colorwheel[col : col + CB, 2] = 255
-    col = col + CB
-    # BM
-    colorwheel[col : col + BM, 2] = 255
-    colorwheel[col : col + BM, 0] = np.floor(255 * np.arange(0, BM) / BM)
-    col = col + BM
-    # MR
-    colorwheel[col : col + MR, 2] = 255 - np.floor(255 * np.arange(MR) / MR)
-    colorwheel[col : col + MR, 0] = 255
-    return colorwheel
-
-
-def flow_uv_to_colors(u, v, convert_to_bgr=False):
-    """
-    Applies the flow color wheel to (possibly clipped) flow components u and v.
-
-    According to the C++ source code of Daniel Scharstein
-    According to the Matlab source code of Deqing Sun
-
-    Args:
-        u (np.ndarray): Input horizontal flow of shape [H,W]
-        v (np.ndarray): Input vertical flow of shape [H,W]
-        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
-
-    Returns:
-        np.ndarray: Flow visualization image of shape [H,W,3]
-    """
-    flow_image = np.zeros((u.shape[0], u.shape[1], 3), np.uint8)
-    colorwheel = make_colorwheel()  # shape [55x3]
-    ncols = colorwheel.shape[0]
-    rad = np.sqrt(np.square(u) + np.square(v))
-    a = np.arctan2(-v, -u) / np.pi
-    fk = (a + 1) / 2 * (ncols - 1)
-    k0 = np.floor(fk).astype(np.int32)
-    k1 = k0 + 1
-    k1[k1 == ncols] = 0
-    f = fk - k0
-    for i in range(colorwheel.shape[1]):
-        tmp = colorwheel[:, i]
-        col0 = tmp[k0] / 255.0
-        col1 = tmp[k1] / 255.0
-        col = (1 - f) * col0 + f * col1
-        idx = rad <= 1
-        col[idx] = 1 - rad[idx] * (1 - col[idx])
-        col[~idx] = col[~idx] * 0.75  # out of range
-        # Note the 2-i => BGR instead of RGB
-        ch_idx = 2 - i if convert_to_bgr else i
-        flow_image[:, :, ch_idx] = np.floor(255 * col)
-    return flow_image
-
-
-def flow_to_image(flow_uv, clip_flow=None, convert_to_bgr=False):
-    """
-    Expects a two dimensional flow image of shape.
-
-    Args:
-        flow_uv (np.ndarray): Flow UV image of shape [H,W,2]
-        clip_flow (float, optional): Clip maximum of flow values. Defaults to None.
-        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
-
-    Returns:
-        np.ndarray: Flow visualization image of shape [H,W,3]
-    """
-    assert flow_uv.ndim == 3, "input flow must have three dimensions"
-    assert flow_uv.shape[2] == 2, "input flow must have shape [H,W,2]"
-    if clip_flow is not None:
-        flow_uv = np.clip(flow_uv, 0, clip_flow)
-    u = flow_uv[:, :, 0]
-    v = flow_uv[:, :, 1]
-    rad = np.sqrt(np.square(u) + np.square(v))
-    rad_max = np.max(rad)
-    epsilon = 1e-5
-    u = u / (rad_max + epsilon)
-    v = v / (rad_max + epsilon)
-    return flow_uv_to_colors(u, v, convert_to_bgr)
diff --git a/eval/vbench/third_party/RAFT/core/utils_core/frame_utils.py b/eval/vbench/third_party/RAFT/core/utils_core/frame_utils.py
deleted file mode 100644
index 17ba4006..00000000
--- a/eval/vbench/third_party/RAFT/core/utils_core/frame_utils.py
+++ /dev/null
@@ -1,142 +0,0 @@
-import re
-from os.path import *
-
-import cv2
-import numpy as np
-from PIL import Image
-
-cv2.setNumThreads(0)
-cv2.ocl.setUseOpenCL(False)
-
-TAG_CHAR = np.array([202021.25], np.float32)
-
-
-def readFlow(fn):
-    """Read .flo file in Middlebury format"""
-    # Code adapted from:
-    # http://stackoverflow.com/questions/28013200/reading-middlebury-flow-files-with-python-bytes-array-numpy
-
-    # WARNING: this will work on little-endian architectures (eg Intel x86) only!
-    # print 'fn = %s'%(fn)
-    with open(fn, "rb") as f:
-        magic = np.fromfile(f, np.float32, count=1)
-        if 202021.25 != magic:
-            print("Magic number incorrect. Invalid .flo file")
-            return None
-        else:
-            w = np.fromfile(f, np.int32, count=1)
-            h = np.fromfile(f, np.int32, count=1)
-            # print 'Reading %d x %d flo file\n' % (w, h)
-            data = np.fromfile(f, np.float32, count=2 * int(w) * int(h))
-            # Reshape data into 3D array (columns, rows, bands)
-            # The reshape here is for visualization, the original code is (w,h,2)
-            return np.resize(data, (int(h), int(w), 2))
-
-
-def readPFM(file):
-    file = open(file, "rb")
-
-    color = None
-    width = None
-    height = None
-    scale = None
-    endian = None
-
-    header = file.readline().rstrip()
-    if header == b"PF":
-        color = True
-    elif header == b"Pf":
-        color = False
-    else:
-        raise Exception("Not a PFM file.")
-
-    dim_match = re.match(rb"^(\d+)\s(\d+)\s$", file.readline())
-    if dim_match:
-        width, height = map(int, dim_match.groups())
-    else:
-        raise Exception("Malformed PFM header.")
-
-    scale = float(file.readline().rstrip())
-    if scale < 0:  # little-endian
-        endian = "<"
-        scale = -scale
-    else:
-        endian = ">"  # big-endian
-
-    data = np.fromfile(file, endian + "f")
-    shape = (height, width, 3) if color else (height, width)
-
-    data = np.reshape(data, shape)
-    data = np.flipud(data)
-    return data
-
-
-def writeFlow(filename, uv, v=None):
-    """Write optical flow to file.
-
-    If v is None, uv is assumed to contain both u and v channels,
-    stacked in depth.
-    Original code by Deqing Sun, adapted from Daniel Scharstein.
-    """
-    nBands = 2
-
-    if v is None:
-        assert uv.ndim == 3
-        assert uv.shape[2] == 2
-        u = uv[:, :, 0]
-        v = uv[:, :, 1]
-    else:
-        u = uv
-
-    assert u.shape == v.shape
-    height, width = u.shape
-    f = open(filename, "wb")
-    # write the header
-    f.write(TAG_CHAR)
-    np.array(width).astype(np.int32).tofile(f)
-    np.array(height).astype(np.int32).tofile(f)
-    # arrange into matrix form
-    tmp = np.zeros((height, width * nBands))
-    tmp[:, np.arange(width) * 2] = u
-    tmp[:, np.arange(width) * 2 + 1] = v
-    tmp.astype(np.float32).tofile(f)
-    f.close()
-
-
-def readFlowKITTI(filename):
-    flow = cv2.imread(filename, cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)
-    flow = flow[:, :, ::-1].astype(np.float32)
-    flow, valid = flow[:, :, :2], flow[:, :, 2]
-    flow = (flow - 2**15) / 64.0
-    return flow, valid
-
-
-def readDispKITTI(filename):
-    disp = cv2.imread(filename, cv2.IMREAD_ANYDEPTH) / 256.0
-    valid = disp > 0.0
-    flow = np.stack([-disp, np.zeros_like(disp)], -1)
-    return flow, valid
-
-
-def writeFlowKITTI(filename, uv):
-    uv = 64.0 * uv + 2**15
-    valid = np.ones([uv.shape[0], uv.shape[1], 1])
-    uv = np.concatenate([uv, valid], axis=-1).astype(np.uint16)
-    cv2.imwrite(filename, uv[..., ::-1])
-
-
-def read_gen(file_name, pil=False):
-    ext = splitext(file_name)[-1]
-    if ext == ".png" or ext == ".jpeg" or ext == ".ppm" or ext == ".jpg":
-        return Image.open(file_name)
-    elif ext == ".bin" or ext == ".raw":
-        return np.load(file_name)
-    elif ext == ".flo":
-        return readFlow(file_name).astype(np.float32)
-    elif ext == ".pfm":
-        flow = readPFM(file_name).astype(np.float32)
-        if len(flow.shape) == 2:
-            return flow
-        else:
-            return flow[:, :, :-1]
-    return []
diff --git a/eval/vbench/third_party/RAFT/core/utils_core/utils.py b/eval/vbench/third_party/RAFT/core/utils_core/utils.py
deleted file mode 100644
index bcd92c15..00000000
--- a/eval/vbench/third_party/RAFT/core/utils_core/utils.py
+++ /dev/null
@@ -1,93 +0,0 @@
-import numpy as np
-import torch
-import torch.nn.functional as F
-from scipy import interpolate
-
-
-class InputPadder:
-    """Pads images such that dimensions are divisible by 8"""
-
-    def __init__(self, dims, mode="sintel"):
-        self.ht, self.wd = dims[-2:]
-        pad_ht = (((self.ht // 8) + 1) * 8 - self.ht) % 8
-        pad_wd = (((self.wd // 8) + 1) * 8 - self.wd) % 8
-        if mode == "sintel":
-            self._pad = [
-                pad_wd // 2,
-                pad_wd - pad_wd // 2,
-                pad_ht // 2,
-                pad_ht - pad_ht // 2,
-            ]
-        else:
-            self._pad = [pad_wd // 2, pad_wd - pad_wd // 2, 0, pad_ht]
-
-    def pad(self, *inputs):
-        return [F.pad(x, self._pad, mode="replicate") for x in inputs]
-
-    def unpad(self, x):
-        ht, wd = x.shape[-2:]
-        c = [self._pad[2], ht - self._pad[3], self._pad[0], wd - self._pad[1]]
-        return x[..., c[0] : c[1], c[2] : c[3]]
-
-
-def forward_interpolate(flow):
-    flow = flow.detach().cpu().numpy()
-    dx, dy = flow[0], flow[1]
-
-    ht, wd = dx.shape
-    x0, y0 = np.meshgrid(np.arange(wd), np.arange(ht))
-
-    x1 = x0 + dx
-    y1 = y0 + dy
-
-    x1 = x1.reshape(-1)
-    y1 = y1.reshape(-1)
-    dx = dx.reshape(-1)
-    dy = dy.reshape(-1)
-
-    valid = (x1 > 0) & (x1 < wd) & (y1 > 0) & (y1 < ht)
-    x1 = x1[valid]
-    y1 = y1[valid]
-    dx = dx[valid]
-    dy = dy[valid]
-
-    flow_x = interpolate.griddata(
-        (x1, y1), dx, (x0, y0), method="nearest", fill_value=0
-    )
-
-    flow_y = interpolate.griddata(
-        (x1, y1), dy, (x0, y0), method="nearest", fill_value=0
-    )
-
-    flow = np.stack([flow_x, flow_y], axis=0)
-    return torch.from_numpy(flow).float()
-
-
-def bilinear_sampler(img, coords, mode="bilinear", mask=False):
-    """Wrapper for grid_sample, uses pixel coordinates"""
-    H, W = img.shape[-2:]
-    xgrid, ygrid = coords.split([1, 1], dim=-1)
-    xgrid = 2 * xgrid / (W - 1) - 1
-    ygrid = 2 * ygrid / (H - 1) - 1
-
-    grid = torch.cat([xgrid, ygrid], dim=-1)
-    img = F.grid_sample(img, grid, align_corners=True)
-
-    if mask:
-        mask = (xgrid > -1) & (ygrid > -1) & (xgrid < 1) & (ygrid < 1)
-        return img, mask.float()
-
-    return img
-
-
-def coords_grid(batch, ht, wd, device):
-    coords = torch.meshgrid(
-        torch.arange(ht, device=device), torch.arange(wd, device=device)
-    )
-    coords = torch.stack(coords[::-1], dim=0).float()
-    return coords[None].repeat(batch, 1, 1, 1)
-
-
-def upflow8(flow, mode="bilinear"):
-    new_size = (8 * flow.shape[2], 8 * flow.shape[3])
-    return 8 * F.interpolate(flow, size=new_size, mode=mode, align_corners=True)
diff --git a/eval/vbench/third_party/RAFT/download_models.sh b/eval/vbench/third_party/RAFT/download_models.sh
deleted file mode 100644
index dfd8d473..00000000
--- a/eval/vbench/third_party/RAFT/download_models.sh
+++ /dev/null
@@ -1,3 +0,0 @@
-#!/bin/bash
-wget https://dl.dropboxusercontent.com/s/4j4z58wuv8o0mfz/models.zip
-unzip models.zip
diff --git a/eval/vbench/third_party/ViCLIP/__init__.py b/eval/vbench/third_party/ViCLIP/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/ViCLIP/simple_tokenizer.py b/eval/vbench/third_party/ViCLIP/simple_tokenizer.py
deleted file mode 100644
index 5634e24b..00000000
--- a/eval/vbench/third_party/ViCLIP/simple_tokenizer.py
+++ /dev/null
@@ -1,159 +0,0 @@
-import gzip
-import html
-import os
-import subprocess
-from functools import lru_cache
-
-import ftfy
-import regex as re
-from vbench.utils import CACHE_DIR
-
-
-def default_bpe():
-    tokenizer_file = os.path.join(CACHE_DIR, "ViCLIP/bpe_simple_vocab_16e6.txt.gz")
-    if not os.path.exists(tokenizer_file):
-        print(f"Downloading ViCLIP tokenizer to {tokenizer_file}")
-        wget_command = [
-            "wget",
-            "https://raw.githubusercontent.com/openai/CLIP/main/clip/bpe_simple_vocab_16e6.txt.gz",
-            "-P",
-            os.path.dirname(tokenizer_file),
-        ]
-        subprocess.run(wget_command)
-    return tokenizer_file
-
-
-@lru_cache()
-def bytes_to_unicode():
-    """
-    Returns list of utf-8 byte and a corresponding list of unicode strings.
-    The reversible bpe codes work on unicode strings.
-    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
-    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
-    This is a signficant percentage of your normal, say, 32K bpe vocab.
-    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
-    And avoids mapping to whitespace/control characters the bpe code barfs on.
-    """
-    bs = (
-        list(range(ord("!"), ord("~") + 1))
-        + list(range(ord("¡"), ord("¬") + 1))
-        + list(range(ord("®"), ord("ÿ") + 1))
-    )
-    cs = bs[:]
-    n = 0
-    for b in range(2**8):
-        if b not in bs:
-            bs.append(b)
-            cs.append(2**8 + n)
-            n += 1
-    cs = [chr(n) for n in cs]
-    return dict(zip(bs, cs))
-
-
-def get_pairs(word):
-    """Return set of symbol pairs in a word.
-    Word is represented as tuple of symbols (symbols being variable-length strings).
-    """
-    pairs = set()
-    prev_char = word[0]
-    for char in word[1:]:
-        pairs.add((prev_char, char))
-        prev_char = char
-    return pairs
-
-
-def basic_clean(text):
-    text = ftfy.fix_text(text)
-    text = html.unescape(html.unescape(text))
-    return text.strip()
-
-
-def whitespace_clean(text):
-    text = re.sub(r"\s+", " ", text)
-    text = text.strip()
-    return text
-
-
-class SimpleTokenizer(object):
-    def __init__(self, bpe_path: str = default_bpe()):
-        self.byte_encoder = bytes_to_unicode()
-        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
-        merges = gzip.open(bpe_path).read().decode("utf-8").split("\n")
-        merges = merges[1 : 49152 - 256 - 2 + 1]
-        merges = [tuple(merge.split()) for merge in merges]
-        vocab = list(bytes_to_unicode().values())
-        vocab = vocab + [v + "</w>" for v in vocab]
-        for merge in merges:
-            vocab.append("".join(merge))
-        vocab.extend(["<|startoftext|>", "<|endoftext|>"])
-        self.encoder = dict(zip(vocab, range(len(vocab))))
-        self.decoder = {v: k for k, v in self.encoder.items()}
-        self.bpe_ranks = dict(zip(merges, range(len(merges))))
-        self.cache = {
-            "<|startoftext|>": "<|startoftext|>",
-            "<|endoftext|>": "<|endoftext|>",
-        }
-        self.pat = re.compile(
-            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
-            re.IGNORECASE,
-        )
-
-    def bpe(self, token):
-        if token in self.cache:
-            return self.cache[token]
-        word = tuple(token[:-1]) + (token[-1] + "</w>",)
-        pairs = get_pairs(word)
-
-        if not pairs:
-            return token + "</w>"
-
-        while True:
-            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
-            if bigram not in self.bpe_ranks:
-                break
-            first, second = bigram
-            new_word = []
-            i = 0
-            while i < len(word):
-                try:
-                    j = word.index(first, i)
-                    new_word.extend(word[i:j])
-                    i = j
-                except:
-                    new_word.extend(word[i:])
-                    break
-
-                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
-                    new_word.append(first + second)
-                    i += 2
-                else:
-                    new_word.append(word[i])
-                    i += 1
-            new_word = tuple(new_word)
-            word = new_word
-            if len(word) == 1:
-                break
-            else:
-                pairs = get_pairs(word)
-        word = " ".join(word)
-        self.cache[token] = word
-        return word
-
-    def encode(self, text):
-        bpe_tokens = []
-        text = whitespace_clean(basic_clean(text)).lower()
-        for token in re.findall(self.pat, text):
-            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
-            bpe_tokens.extend(
-                self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
-            )
-        return bpe_tokens
-
-    def decode(self, tokens):
-        text = "".join([self.decoder[token] for token in tokens])
-        text = (
-            bytearray([self.byte_decoder[c] for c in text])
-            .decode("utf-8", errors="replace")
-            .replace("</w>", " ")
-        )
-        return text
diff --git a/eval/vbench/third_party/ViCLIP/viclip.py b/eval/vbench/third_party/ViCLIP/viclip.py
deleted file mode 100644
index b7fb2091..00000000
--- a/eval/vbench/third_party/ViCLIP/viclip.py
+++ /dev/null
@@ -1,227 +0,0 @@
-import logging
-import math
-import os
-
-import torch
-from einops import rearrange
-from torch import nn
-
-from .simple_tokenizer import SimpleTokenizer as _Tokenizer
-from .viclip_text import clip_text_l14
-from .viclip_vision import clip_joint_l14
-
-logger = logging.getLogger(__name__)
-
-
-class ViCLIP(nn.Module):
-    """docstring for ViCLIP"""
-
-    def __init__(
-        self,
-        tokenizer=None,
-        pretrain=os.path.join(
-            os.path.dirname(os.path.abspath(__file__)), "ViClip-InternVid-10M-FLT.pth"
-        ),
-        freeze_text=True,
-    ):
-        super(ViCLIP, self).__init__()
-        if tokenizer:
-            self.tokenizer = tokenizer
-        else:
-            self.tokenizer = _Tokenizer()
-        self.max_txt_l = 32
-
-        self.vision_encoder_name = "vit_l14"
-
-        self.vision_encoder_pretrained = False
-        self.inputs_image_res = 224
-        self.vision_encoder_kernel_size = 1
-        self.vision_encoder_center = True
-        self.video_input_num_frames = 8
-        self.vision_encoder_drop_path_rate = 0.1
-        self.vision_encoder_checkpoint_num = 24
-        self.is_pretrain = pretrain
-        self.vision_width = 1024
-        self.text_width = 768
-        self.embed_dim = 768
-        self.masking_prob = 0.9
-
-        self.text_encoder_name = "vit_l14"
-        self.text_encoder_pretrained = False  #'bert-base-uncased'
-        self.text_encoder_d_model = 768
-
-        self.text_encoder_vocab_size = 49408
-
-        # create modules.
-        self.vision_encoder = self.build_vision_encoder()
-        self.text_encoder = self.build_text_encoder()
-
-        self.temp = nn.parameter.Parameter(torch.ones([]) * 1 / 100.0)
-        self.temp_min = 1 / 100.0
-
-        if pretrain:
-            logger.info(f"Load pretrained weights from {pretrain}")
-            state_dict = torch.load(pretrain, map_location="cpu")["model"]
-            self.load_state_dict(state_dict)
-
-        # Freeze weights
-        if freeze_text:
-            self.freeze_text()
-
-    def freeze_text(self):
-        """freeze text encoder"""
-        for p in self.text_encoder.parameters():
-            p.requires_grad = False
-
-    def no_weight_decay(self):
-        ret = {"temp"}
-        ret.update(
-            {"vision_encoder." + k for k in self.vision_encoder.no_weight_decay()}
-        )
-        ret.update({"text_encoder." + k for k in self.text_encoder.no_weight_decay()})
-
-        return ret
-
-    def forward(
-        self, image, text, raw_text, idx, log_generation=None, return_sims=False
-    ):
-        """forward and calculate loss.
-
-        Args:
-            image (torch.Tensor): The input images. Shape: [B,T,C,H,W].
-            text (dict): TODO
-            idx (torch.Tensor): TODO
-
-        Returns: TODO
-
-        """
-        self.clip_contrastive_temperature()
-
-        vision_embeds = self.encode_vision(image)
-        text_embeds = self.encode_text(raw_text)
-        if return_sims:
-            sims = torch.nn.functional.normalize(
-                vision_embeds, dim=-1
-            ) @ torch.nn.functional.normalize(text_embeds, dim=-1).transpose(0, 1)
-            return sims
-
-        # calculate loss
-
-        ## VTC loss
-        loss_vtc = self.clip_loss.vtc_loss(
-            vision_embeds, text_embeds, idx, self.temp, all_gather=True
-        )
-
-        return dict(
-            loss_vtc=loss_vtc,
-        )
-
-    def encode_vision(self, image, test=False):
-        """encode image / videos as features.
-
-        Args:
-            image (torch.Tensor): The input images.
-            test (bool): Whether testing.
-
-        Returns: tuple.
-            - vision_embeds (torch.Tensor): The features of all patches. Shape: [B,T,L,C].
-            - pooled_vision_embeds (torch.Tensor): The pooled features. Shape: [B,T,C].
-
-        """
-        if image.ndim == 5:
-            image = image.permute(0, 2, 1, 3, 4).contiguous()
-        else:
-            image = image.unsqueeze(2)
-
-        if not test and self.masking_prob > 0.0:
-            return self.vision_encoder(image, masking_prob=self.masking_prob)
-
-        return self.vision_encoder(image)
-
-    def encode_text(self, text):
-        """encode text.
-        Args:
-            text (dict): The output of huggingface's `PreTrainedTokenizer`. contains keys:
-                - input_ids (torch.Tensor): Token ids to be fed to a model. Shape: [B,L].
-                - attention_mask (torch.Tensor): The mask indicate padded tokens. Shape: [B,L]. 0 is padded token.
-                - other keys refer to "https://huggingface.co/docs/transformers/v4.21.2/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__".
-        Returns: tuple.
-            - text_embeds (torch.Tensor): The features of all tokens. Shape: [B,L,C].
-            - pooled_text_embeds (torch.Tensor): The pooled features. Shape: [B,C].
-
-        """
-        device = next(self.text_encoder.parameters()).device
-        text = self.text_encoder.tokenize(text, context_length=self.max_txt_l).to(
-            device
-        )
-        text_embeds = self.text_encoder(text)
-        return text_embeds
-
-    @torch.no_grad()
-    def clip_contrastive_temperature(self, min_val=0.001, max_val=0.5):
-        """Seems only used during pre-training"""
-        self.temp.clamp_(min=self.temp_min)
-
-    def build_vision_encoder(self):
-        """build vision encoder
-        Returns: (vision_encoder, vision_layernorm). Each is a `nn.Module`.
-
-        """
-        encoder_name = self.vision_encoder_name
-        if encoder_name != "vit_l14":
-            raise ValueError(f"Not implemented: {encoder_name}")
-        vision_encoder = clip_joint_l14(
-            pretrained=self.vision_encoder_pretrained,
-            input_resolution=self.inputs_image_res,
-            kernel_size=self.vision_encoder_kernel_size,
-            center=self.vision_encoder_center,
-            num_frames=self.video_input_num_frames,
-            drop_path=self.vision_encoder_drop_path_rate,
-            checkpoint_num=self.vision_encoder_checkpoint_num,
-        )
-        return vision_encoder
-
-    def build_text_encoder(self):
-        """build text_encoder and possiblly video-to-text multimodal fusion encoder.
-        Returns: nn.Module. The text encoder
-
-        """
-        encoder_name = self.text_encoder_name
-        if encoder_name != "vit_l14":
-            raise ValueError(f"Not implemented: {encoder_name}")
-        text_encoder = clip_text_l14(
-            pretrained=self.text_encoder_pretrained,
-            embed_dim=self.text_encoder_d_model,
-            context_length=self.max_txt_l,
-            vocab_size=self.text_encoder_vocab_size,
-            checkpoint_num=0,
-        )
-
-        return text_encoder
-
-    def get_text_encoder(self):
-        """get text encoder, used for text and cross-modal encoding"""
-        encoder = self.text_encoder
-        return encoder.bert if hasattr(encoder, "bert") else encoder
-
-    def get_text_features(self, input_text, tokenizer, text_feature_dict={}):
-        if input_text in text_feature_dict:
-            return text_feature_dict[input_text]
-        text_template = f"{input_text}"
-        with torch.no_grad():
-            # text_token = tokenizer.encode(text_template).cuda()
-            text_features = self.encode_text(text_template).float()
-            text_features /= text_features.norm(dim=-1, keepdim=True)
-            text_feature_dict[input_text] = text_features
-        return text_features
-
-    def get_vid_features(self, input_frames):
-        with torch.no_grad():
-            clip_feat = self.encode_vision(input_frames, test=True).float()
-            clip_feat /= clip_feat.norm(dim=-1, keepdim=True)
-        return clip_feat
-
-    def get_predict_label(self, clip_feature, text_feats_tensor, top=5):
-        label_probs = (100.0 * clip_feature @ text_feats_tensor.T).softmax(dim=-1)
-        top_probs, top_labels = label_probs.cpu().topk(top, dim=-1)
-        return top_probs, top_labels
diff --git a/eval/vbench/third_party/ViCLIP/viclip_text.py b/eval/vbench/third_party/ViCLIP/viclip_text.py
deleted file mode 100644
index 4e7a7c68..00000000
--- a/eval/vbench/third_party/ViCLIP/viclip_text.py
+++ /dev/null
@@ -1,304 +0,0 @@
-import functools
-import logging
-import os
-from collections import OrderedDict
-
-import numpy as np
-import torch
-import torch.nn.functional as F
-import torch.utils.checkpoint as checkpoint
-from pkg_resources import packaging
-from torch import nn
-
-from .simple_tokenizer import SimpleTokenizer as _Tokenizer
-
-logger = logging.getLogger(__name__)
-
-
-MODEL_PATH = "https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K"
-_MODELS = {
-    "ViT-L/14": os.path.join(MODEL_PATH, "vit_l14_text.pth"),
-}
-
-
-class LayerNorm(nn.LayerNorm):
-    """Subclass torch's LayerNorm to handle fp16."""
-
-    def forward(self, x: torch.Tensor):
-        orig_type = x.dtype
-        ret = super().forward(x.type(torch.float32))
-        return ret.type(orig_type)
-
-
-class QuickGELU(nn.Module):
-    def forward(self, x: torch.Tensor):
-        return x * torch.sigmoid(1.702 * x)
-
-
-class ResidualAttentionBlock(nn.Module):
-    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
-        super().__init__()
-
-        self.attn = nn.MultiheadAttention(d_model, n_head)
-        self.ln_1 = LayerNorm(d_model)
-        self.mlp = nn.Sequential(
-            OrderedDict(
-                [
-                    ("c_fc", nn.Linear(d_model, d_model * 4)),
-                    ("gelu", QuickGELU()),
-                    ("c_proj", nn.Linear(d_model * 4, d_model)),
-                ]
-            )
-        )
-        self.ln_2 = LayerNorm(d_model)
-        self.attn_mask = attn_mask
-
-    def attention(self, x: torch.Tensor):
-        self.attn_mask = (
-            self.attn_mask.to(dtype=x.dtype, device=x.device)
-            if self.attn_mask is not None
-            else None
-        )
-        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
-
-    def forward(self, x: torch.Tensor):
-        x = x + self.attention(self.ln_1(x))
-        x = x + self.mlp(self.ln_2(x))
-        return x
-
-
-class Transformer(nn.Module):
-    def __init__(
-        self,
-        width: int,
-        layers: int,
-        heads: int,
-        attn_mask: torch.Tensor = None,
-        checkpoint_num: int = 0,
-    ):
-        super().__init__()
-        self.width = width
-        self.layers = layers
-        self.resblocks = nn.Sequential(
-            *[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)]
-        )
-
-        self.checkpoint_num = checkpoint_num
-
-    def forward(self, x: torch.Tensor):
-        if self.checkpoint_num > 0:
-            segments = min(self.checkpoint_num, len(self.resblocks))
-            return checkpoint.checkpoint_sequential(self.resblocks, segments, x)
-        else:
-            return self.resblocks(x)
-
-
-class CLIP_TEXT(nn.Module):
-    def __init__(
-        self,
-        embed_dim: int,
-        context_length: int,
-        vocab_size: int,
-        transformer_width: int,
-        transformer_heads: int,
-        transformer_layers: int,
-        checkpoint_num: int,
-    ):
-        super().__init__()
-
-        self.context_length = context_length
-        self._tokenizer = _Tokenizer()
-
-        self.transformer = Transformer(
-            width=transformer_width,
-            layers=transformer_layers,
-            heads=transformer_heads,
-            attn_mask=self.build_attention_mask(),
-            checkpoint_num=checkpoint_num,
-        )
-
-        self.vocab_size = vocab_size
-        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
-        self.positional_embedding = nn.Parameter(
-            torch.empty(self.context_length, transformer_width)
-        )
-        self.ln_final = LayerNorm(transformer_width)
-
-        self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
-
-    def no_weight_decay(self):
-        return {"token_embedding", "positional_embedding"}
-
-    @functools.lru_cache(maxsize=None)
-    def build_attention_mask(self):
-        # lazily create causal attention mask, with full attention between the vision tokens
-        # pytorch uses additive attention mask; fill with -inf
-        mask = torch.empty(self.context_length, self.context_length)
-        mask.fill_(float("-inf"))
-        mask.triu_(1)  # zero out the lower diagonal
-        return mask
-
-    def tokenize(self, texts, context_length=77, truncate=True):
-        """
-        Returns the tokenized representation of given input string(s)
-        Parameters
-        ----------
-        texts : Union[str, List[str]]
-            An input string or a list of input strings to tokenize
-        context_length : int
-            The context length to use; all CLIP models use 77 as the context length
-        truncate: bool
-            Whether to truncate the text in case its encoding is longer than the context length
-        Returns
-        -------
-        A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length].
-        We return LongTensor when torch version is <1.8.0, since older index_select requires indices to be long.
-        """
-        if isinstance(texts, str):
-            texts = [texts]
-
-        sot_token = self._tokenizer.encoder["<|startoftext|>"]
-        eot_token = self._tokenizer.encoder["<|endoftext|>"]
-        all_tokens = [
-            [sot_token] + self._tokenizer.encode(text) + [eot_token] for text in texts
-        ]
-        if packaging.version.parse(torch.__version__) < packaging.version.parse(
-            "1.8.0"
-        ):
-            result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)
-        else:
-            result = torch.zeros(len(all_tokens), context_length, dtype=torch.int)
-
-        for i, tokens in enumerate(all_tokens):
-            if len(tokens) > context_length:
-                if truncate:
-                    tokens = tokens[:context_length]
-                    tokens[-1] = eot_token
-                else:
-                    raise RuntimeError(
-                        f"Input {texts[i]} is too long for context length {context_length}"
-                    )
-            result[i, : len(tokens)] = torch.tensor(tokens)
-
-        return result
-
-    def forward(self, text):
-        x = self.token_embedding(text)  # [batch_size, n_ctx, d_model]
-
-        x = x + self.positional_embedding
-        x = x.permute(1, 0, 2)  # NLD -> LND
-        x = self.transformer(x)
-        x = x.permute(1, 0, 2)  # LND -> NLD
-        x = self.ln_final(x)
-
-        # x.shape = [batch_size, n_ctx, transformer.width]
-        # take features from the eot embedding (eot_token is the highest number in each sequence)
-        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
-
-        return x
-
-
-def clip_text_b16(
-    embed_dim=512,
-    context_length=77,
-    vocab_size=49408,
-    transformer_width=512,
-    transformer_heads=8,
-    transformer_layers=12,
-):
-    raise NotImplementedError
-    model = CLIP_TEXT(
-        embed_dim,
-        context_length,
-        vocab_size,
-        transformer_width,
-        transformer_heads,
-        transformer_layers,
-    )
-    pretrained = _MODELS["ViT-B/16"]
-    logger.info(f"Load pretrained weights from {pretrained}")
-    state_dict = torch.load(pretrained, map_location="cpu")
-    model.load_state_dict(state_dict, strict=False)
-    return model.eval()
-
-
-def clip_text_l14(
-    embed_dim=768,
-    context_length=77,
-    vocab_size=49408,
-    transformer_width=768,
-    transformer_heads=12,
-    transformer_layers=12,
-    checkpoint_num=0,
-    pretrained=True,
-):
-    model = CLIP_TEXT(
-        embed_dim,
-        context_length,
-        vocab_size,
-        transformer_width,
-        transformer_heads,
-        transformer_layers,
-        checkpoint_num,
-    )
-    if pretrained:
-        if isinstance(pretrained, str) and pretrained != "bert-base-uncased":
-            pretrained = _MODELS[pretrained]
-        else:
-            pretrained = _MODELS["ViT-L/14"]
-        logger.info(f"Load pretrained weights from {pretrained}")
-        state_dict = torch.load(pretrained, map_location="cpu")
-        if context_length != state_dict["positional_embedding"].size(0):
-            # assert context_length < state_dict["positional_embedding"].size(0), "Cannot increase context length."
-            print(
-                f"Resize positional embedding from {state_dict['positional_embedding'].size(0)} to {context_length}"
-            )
-            if context_length < state_dict["positional_embedding"].size(0):
-                state_dict["positional_embedding"] = state_dict["positional_embedding"][
-                    :context_length
-                ]
-            else:
-                state_dict["positional_embedding"] = F.pad(
-                    state_dict["positional_embedding"],
-                    (
-                        0,
-                        0,
-                        0,
-                        context_length - state_dict["positional_embedding"].size(0),
-                    ),
-                    value=0,
-                )
-
-        message = model.load_state_dict(state_dict, strict=False)
-        print(f"Load pretrained weights from {pretrained}: {message}")
-    return model.eval()
-
-
-def clip_text_l14_336(
-    embed_dim=768,
-    context_length=77,
-    vocab_size=49408,
-    transformer_width=768,
-    transformer_heads=12,
-    transformer_layers=12,
-):
-    raise NotImplementedError
-    model = CLIP_TEXT(
-        embed_dim,
-        context_length,
-        vocab_size,
-        transformer_width,
-        transformer_heads,
-        transformer_layers,
-    )
-    pretrained = _MODELS["ViT-L/14_336"]
-    logger.info(f"Load pretrained weights from {pretrained}")
-    state_dict = torch.load(pretrained, map_location="cpu")
-    model.load_state_dict(state_dict, strict=False)
-    return model.eval()
-
-
-def build_clip(config):
-    model_cls = config.text_encoder.clip_teacher
-    model = eval(model_cls)()
-    return model
diff --git a/eval/vbench/third_party/ViCLIP/viclip_vision.py b/eval/vbench/third_party/ViCLIP/viclip_vision.py
deleted file mode 100644
index 163be681..00000000
--- a/eval/vbench/third_party/ViCLIP/viclip_vision.py
+++ /dev/null
@@ -1,437 +0,0 @@
-#!/usr/bin/env python
-import logging
-import os
-from collections import OrderedDict
-
-import torch
-import torch.utils.checkpoint as checkpoint
-from einops import rearrange
-from timm.models.layers import DropPath
-from timm.models.registry import register_model
-from torch import nn
-
-logger = logging.getLogger(__name__)
-
-
-def load_temp_embed_with_mismatch(temp_embed_old, temp_embed_new, add_zero=True):
-    """
-    Add/Remove extra temporal_embeddings as needed.
-    https://arxiv.org/abs/2104.00650 shows adding zero paddings works.
-
-    temp_embed_old: (1, num_frames_old, 1, d)
-    temp_embed_new: (1, num_frames_new, 1, d)
-    add_zero: bool, if True, add zero, else, interpolate trained embeddings.
-    """
-    # TODO zero pad
-    num_frms_new = temp_embed_new.shape[1]
-    num_frms_old = temp_embed_old.shape[1]
-    logger.info(f"Load temporal_embeddings, lengths: {num_frms_old}-->{num_frms_new}")
-    if num_frms_new > num_frms_old:
-        if add_zero:
-            temp_embed_new[:, :num_frms_old] = (
-                temp_embed_old  # untrained embeddings are zeros.
-            )
-        else:
-            temp_embed_new = interpolate_temporal_pos_embed(
-                temp_embed_old, num_frms_new
-            )
-    elif num_frms_new < num_frms_old:
-        temp_embed_new = temp_embed_old[:, :num_frms_new]
-    else:  # =
-        temp_embed_new = temp_embed_old
-    return temp_embed_new
-
-
-MODEL_PATH = "https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/"
-_MODELS = {
-    "ViT-L/14": os.path.join(MODEL_PATH, "ViClip-InternVid-10M-FLT.pth"),
-}
-
-
-class QuickGELU(nn.Module):
-    def forward(self, x):
-        return x * torch.sigmoid(1.702 * x)
-
-
-class ResidualAttentionBlock(nn.Module):
-    def __init__(self, d_model, n_head, drop_path=0.0, attn_mask=None, dropout=0.0):
-        super().__init__()
-
-        self.drop_path1 = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.drop_path2 = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.attn = nn.MultiheadAttention(d_model, n_head, dropout=dropout)
-        self.ln_1 = nn.LayerNorm(d_model)
-        self.mlp = nn.Sequential(
-            OrderedDict(
-                [
-                    ("c_fc", nn.Linear(d_model, d_model * 4)),
-                    ("gelu", QuickGELU()),
-                    ("drop1", nn.Dropout(dropout)),
-                    ("c_proj", nn.Linear(d_model * 4, d_model)),
-                    ("drop2", nn.Dropout(dropout)),
-                ]
-            )
-        )
-        self.ln_2 = nn.LayerNorm(d_model)
-        self.attn_mask = attn_mask
-
-    def attention(self, x):
-        self.attn_mask = (
-            self.attn_mask.to(dtype=x.dtype, device=x.device)
-            if self.attn_mask is not None
-            else None
-        )
-        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
-
-    def forward(self, x):
-        x = x + self.drop_path1(self.attention(self.ln_1(x)))
-        x = x + self.drop_path2(self.mlp(self.ln_2(x)))
-        return x
-
-
-class Transformer(nn.Module):
-    def __init__(
-        self, width, layers, heads, drop_path=0.0, checkpoint_num=0, dropout=0.0
-    ):
-        super().__init__()
-        dpr = [x.item() for x in torch.linspace(0, drop_path, layers)]
-        self.resblocks = nn.ModuleList()
-        for idx in range(layers):
-            self.resblocks.append(
-                ResidualAttentionBlock(
-                    width, heads, drop_path=dpr[idx], dropout=dropout
-                )
-            )
-        self.checkpoint_num = checkpoint_num
-
-    def forward(self, x):
-        for idx, blk in enumerate(self.resblocks):
-            if idx < self.checkpoint_num:
-                x = checkpoint.checkpoint(blk, x)
-            else:
-                x = blk(x)
-        return x
-
-
-class VisionTransformer(nn.Module):
-    def __init__(
-        self,
-        input_resolution,
-        patch_size,
-        width,
-        layers,
-        heads,
-        output_dim=None,
-        kernel_size=1,
-        num_frames=8,
-        drop_path=0,
-        checkpoint_num=0,
-        dropout=0.0,
-        temp_embed=True,
-    ):
-        super().__init__()
-        self.output_dim = output_dim
-        self.conv1 = nn.Conv3d(
-            3,
-            width,
-            (kernel_size, patch_size, patch_size),
-            (kernel_size, patch_size, patch_size),
-            (0, 0, 0),
-            bias=False,
-        )
-
-        scale = width**-0.5
-        self.class_embedding = nn.Parameter(scale * torch.randn(width))
-        self.positional_embedding = nn.Parameter(
-            scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)
-        )
-        self.ln_pre = nn.LayerNorm(width)
-        if temp_embed:
-            self.temporal_positional_embedding = nn.Parameter(
-                torch.zeros(1, num_frames, width)
-            )
-
-        self.transformer = Transformer(
-            width,
-            layers,
-            heads,
-            drop_path=drop_path,
-            checkpoint_num=checkpoint_num,
-            dropout=dropout,
-        )
-
-        self.ln_post = nn.LayerNorm(width)
-        if output_dim is not None:
-            self.proj = nn.Parameter(torch.empty(width, output_dim))
-        else:
-            self.proj = None
-
-        self.dropout = nn.Dropout(dropout)
-
-    def get_num_layers(self):
-        return len(self.transformer.resblocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {
-            "positional_embedding",
-            "class_embedding",
-            "temporal_positional_embedding",
-        }
-
-    def mask_tokens(self, inputs, masking_prob=0.0):
-        B, L, _ = inputs.shape
-
-        # This is different from text as we are masking a fix number of tokens
-        Lm = int(masking_prob * L)
-        masked_indices = torch.zeros(B, L)
-        indices = torch.argsort(torch.rand_like(masked_indices), dim=-1)[:, :Lm]
-        batch_indices = (
-            torch.arange(masked_indices.shape[0]).unsqueeze(-1).expand_as(indices)
-        )
-        masked_indices[batch_indices, indices] = 1
-
-        masked_indices = masked_indices.bool()
-
-        return inputs[~masked_indices].reshape(B, -1, inputs.shape[-1])
-
-    def forward(self, x, masking_prob=0.0):
-        x = self.conv1(x)  # shape = [*, width, grid, grid]
-        B, C, T, H, W = x.shape
-        x = x.permute(0, 2, 3, 4, 1).reshape(B * T, H * W, C)
-
-        x = torch.cat(
-            [
-                self.class_embedding.to(x.dtype)
-                + torch.zeros(
-                    x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
-                ),
-                x,
-            ],
-            dim=1,
-        )  # shape = [*, grid ** 2 + 1, width]
-        x = x + self.positional_embedding.to(x.dtype)
-
-        # temporal pos
-        cls_tokens = x[:B, :1, :]
-        x = x[:, 1:]
-        x = rearrange(x, "(b t) n m -> (b n) t m", b=B, t=T)
-        if hasattr(self, "temporal_positional_embedding"):
-            if x.size(1) == 1:
-                # This is a workaround for unused parameter issue
-                x = x + self.temporal_positional_embedding.mean(1)
-            else:
-                x = x + self.temporal_positional_embedding
-        x = rearrange(x, "(b n) t m -> b (n t) m", b=B, t=T)
-
-        if masking_prob > 0.0:
-            x = self.mask_tokens(x, masking_prob)
-
-        x = torch.cat((cls_tokens, x), dim=1)
-
-        x = self.ln_pre(x)
-
-        x = x.permute(1, 0, 2)  # BND -> NBD
-        x = self.transformer(x)
-
-        x = self.ln_post(x)
-
-        if self.proj is not None:
-            x = self.dropout(x[0]) @ self.proj
-        else:
-            x = x.permute(1, 0, 2)  # NBD -> BND
-
-        return x
-
-
-def inflate_weight(weight_2d, time_dim, center=True):
-    logger.info(f"Init center: {center}")
-    if center:
-        weight_3d = torch.zeros(*weight_2d.shape)
-        weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
-        middle_idx = time_dim // 2
-        weight_3d[:, :, middle_idx, :, :] = weight_2d
-    else:
-        weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
-        weight_3d = weight_3d / time_dim
-    return weight_3d
-
-
-def load_state_dict(
-    model, state_dict, input_resolution=224, patch_size=16, center=True
-):
-    state_dict_3d = model.state_dict()
-    for k in state_dict.keys():
-        if k in state_dict_3d.keys() and state_dict[k].shape != state_dict_3d[k].shape:
-            if len(state_dict_3d[k].shape) <= 2:
-                logger.info(f"Ignore: {k}")
-                continue
-            logger.info(
-                f"Inflate: {k}, {state_dict[k].shape} => {state_dict_3d[k].shape}"
-            )
-            time_dim = state_dict_3d[k].shape[2]
-            state_dict[k] = inflate_weight(state_dict[k], time_dim, center=center)
-
-    pos_embed_checkpoint = state_dict["positional_embedding"]
-    embedding_size = pos_embed_checkpoint.shape[-1]
-    num_patches = (input_resolution // patch_size) ** 2
-    orig_size = int((pos_embed_checkpoint.shape[-2] - 1) ** 0.5)
-    new_size = int(num_patches**0.5)
-    if orig_size != new_size:
-        logger.info(f"Pos_emb from {orig_size} to {new_size}")
-        extra_tokens = pos_embed_checkpoint[:1]
-        pos_tokens = pos_embed_checkpoint[1:]
-        pos_tokens = pos_tokens.reshape(
-            -1, orig_size, orig_size, embedding_size
-        ).permute(0, 3, 1, 2)
-        pos_tokens = torch.nn.functional.interpolate(
-            pos_tokens, size=(new_size, new_size), mode="bicubic", align_corners=False
-        )
-        pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(0, 2)
-        new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=0)
-        state_dict["positional_embedding"] = new_pos_embed
-
-    message = model.load_state_dict(state_dict, strict=False)
-    logger.info(f"Load pretrained weights: {message}")
-
-
-@register_model
-def clip_joint_b16(
-    pretrained=True,
-    input_resolution=224,
-    kernel_size=1,
-    center=True,
-    num_frames=8,
-    drop_path=0.0,
-):
-    model = VisionTransformer(
-        input_resolution=input_resolution,
-        patch_size=16,
-        width=768,
-        layers=12,
-        heads=12,
-        output_dim=512,
-        kernel_size=kernel_size,
-        num_frames=num_frames,
-        drop_path=drop_path,
-    )
-    raise NotImplementedError
-    if pretrained:
-        logger.info("load pretrained weights")
-        state_dict = torch.load(_MODELS["ViT-B/16"], map_location="cpu")
-        load_state_dict(
-            model,
-            state_dict,
-            input_resolution=input_resolution,
-            patch_size=16,
-            center=center,
-        )
-    return model.eval()
-
-
-@register_model
-def clip_joint_l14(
-    pretrained=False,
-    input_resolution=224,
-    kernel_size=1,
-    center=True,
-    num_frames=8,
-    drop_path=0.0,
-    checkpoint_num=0,
-    dropout=0.0,
-):
-    model = VisionTransformer(
-        input_resolution=input_resolution,
-        patch_size=14,
-        width=1024,
-        layers=24,
-        heads=16,
-        output_dim=768,
-        kernel_size=kernel_size,
-        num_frames=num_frames,
-        drop_path=drop_path,
-        checkpoint_num=checkpoint_num,
-        dropout=dropout,
-    )
-    if pretrained:
-        if isinstance(pretrained, str):
-            model_name = pretrained
-        else:
-            model_name = "ViT-L/14"
-        logger.info("load pretrained weights")
-        state_dict = torch.load(_MODELS[model_name], map_location="cpu")
-        load_state_dict(
-            model,
-            state_dict,
-            input_resolution=input_resolution,
-            patch_size=14,
-            center=center,
-        )
-    return model.eval()
-
-
-@register_model
-def clip_joint_l14_336(
-    pretrained=True,
-    input_resolution=336,
-    kernel_size=1,
-    center=True,
-    num_frames=8,
-    drop_path=0.0,
-):
-    raise NotImplementedError
-    model = VisionTransformer(
-        input_resolution=input_resolution,
-        patch_size=14,
-        width=1024,
-        layers=24,
-        heads=16,
-        output_dim=768,
-        kernel_size=kernel_size,
-        num_frames=num_frames,
-        drop_path=drop_path,
-    )
-    if pretrained:
-        logger.info("load pretrained weights")
-        state_dict = torch.load(_MODELS["ViT-L/14_336"], map_location="cpu")
-        load_state_dict(
-            model,
-            state_dict,
-            input_resolution=input_resolution,
-            patch_size=14,
-            center=center,
-        )
-    return model.eval()
-
-
-def interpolate_pos_embed_vit(state_dict, new_model):
-    key = "vision_encoder.temporal_positional_embedding"
-    if key in state_dict:
-        vision_temp_embed_new = new_model.state_dict()[key]
-        vision_temp_embed_new = vision_temp_embed_new.unsqueeze(
-            2
-        )  # [1, n, d] -> [1, n, 1, d]
-        vision_temp_embed_old = state_dict[key]
-        vision_temp_embed_old = vision_temp_embed_old.unsqueeze(2)
-
-        state_dict[key] = load_temp_embed_with_mismatch(
-            vision_temp_embed_old, vision_temp_embed_new, add_zero=False
-        ).squeeze(2)
-
-    key = "text_encoder.positional_embedding"
-    if key in state_dict:
-        text_temp_embed_new = new_model.state_dict()[key]
-        text_temp_embed_new = text_temp_embed_new.unsqueeze(0).unsqueeze(
-            2
-        )  # [n, d] -> [1, n, 1, d]
-        text_temp_embed_old = state_dict[key]
-        text_temp_embed_old = text_temp_embed_old.unsqueeze(0).unsqueeze(2)
-
-        state_dict[key] = (
-            load_temp_embed_with_mismatch(
-                text_temp_embed_old, text_temp_embed_new, add_zero=False
-            )
-            .squeeze(2)
-            .squeeze(0)
-        )
-    return state_dict
diff --git a/eval/vbench/third_party/__init__.py b/eval/vbench/third_party/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/LICENSE b/eval/vbench/third_party/amt/LICENSE
deleted file mode 100644
index 594d9f3e..00000000
--- a/eval/vbench/third_party/amt/LICENSE
+++ /dev/null
@@ -1,176 +0,0 @@
-## creative commons
-
-# Attribution-NonCommercial 4.0 International
-
-Creative Commons Corporation (“Creative Commons”) is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an “as-is” basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible.
-
-### Using Creative Commons Public Licenses
-
-Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses.
-
-* __Considerations for licensors:__ Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC-licensed material, or material used under an exception or limitation to copyright. [More considerations for licensors](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensors).
-
-* __Considerations for the public:__ By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor’s permission is not necessary for any reason–for example, because of any applicable exception or limitation to copyright–then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. [More considerations for the public](http://wiki.creativecommons.org/Considerations_for_licensors_and_licensees#Considerations_for_licensees).
-
-## Creative Commons Attribution-NonCommercial 4.0 International Public License
-
-By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-NonCommercial 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
-
-### Section 1 – Definitions.
-
-a. __Adapted Material__ means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
-
-b. __Adapter's License__ means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
-
-c. __Copyright and Similar Rights__ means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
-
-d. __Effective Technological Measures__ means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
-
-e. __Exceptions and Limitations__ means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
-
-f. __Licensed Material__ means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
-
-g. __Licensed Rights__ means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
-
-h. __Licensor__ means the individual(s) or entity(ies) granting rights under this Public License.
-
-i. __NonCommercial__ means not primarily intended for or directed towards commercial advantage or monetary compensation. For purposes of this Public License, the exchange of the Licensed Material for other material subject to Copyright and Similar Rights by digital file-sharing or similar means is NonCommercial provided there is no payment of monetary compensation in connection with the exchange.
-
-j. __Share__ means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
-
-k. __Sui Generis Database Rights__ means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
-
-l. __You__ means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
-
-### Section 2 – Scope.
-
-a. ___License grant.___
-
-   1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
-
-       A. reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and
-
-       B. produce, reproduce, and Share Adapted Material for NonCommercial purposes only.
-
-   2. __Exceptions and Limitations.__ For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
-
-   3. __Term.__ The term of this Public License is specified in Section 6(a).
-
-   4. __Media and formats; technical modifications allowed.__ The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
-
-   5. __Downstream recipients.__
-
-        A. __Offer from the Licensor – Licensed Material.__ Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
-
-        B. __No downstream restrictions.__ You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
-
-   6. __No endorsement.__ Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
-
-b. ___Other rights.___
-
-   1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
-
-   2. Patent and trademark rights are not licensed under this Public License.
-
-   3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties, including when the Licensed Material is used other than for NonCommercial purposes.
-
-### Section 3 – License Conditions.
-
-Your exercise of the Licensed Rights is expressly made subject to the following conditions.
-
-a. ___Attribution.___
-
-   1. If You Share the Licensed Material (including in modified form), You must:
-
-       A. retain the following if it is supplied by the Licensor with the Licensed Material:
-
-         i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
-
-         ii. a copyright notice;
-
-         iii. a notice that refers to this Public License;
-
-         iv. a notice that refers to the disclaimer of warranties;
-
-         v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
-
-       B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
-
-       C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
-
-   2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
-
-   3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
-
-   4. If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License.
-
-### Section 4 – Sui Generis Database Rights.
-
-Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
-
-a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only;
-
-b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and
-
-c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
-
-For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
-
-### Section 5 – Disclaimer of Warranties and Limitation of Liability.
-
-a. __Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.__
-
-b. __To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.__
-
-c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
-
-### Section 6 – Term and Termination.
-
-a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
-
-b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
-
-   1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
-
-   2. upon express reinstatement by the Licensor.
-
-   For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
-
-c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
-
-d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
-
-### Section 7 – Other Terms and Conditions.
-
-a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
-
-b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
-
-### Section 8 – Interpretation.
-
-a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
-
-b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
-
-c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
-
-d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
-
-> Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at [creativecommons.org/policies](http://creativecommons.org/policies), Creative Commons does not authorize the use of the trademark “Creative Commons” or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses.
->
-> Creative Commons may be contacted at creativecommons.org
-
-
-### Commercial licensing opportunities
-For commercial uses of the Model & Software, please send email to cmm[AT]nankai.edu.cn
-
-Citation:
-
-@inproceedings{licvpr23amt,
-    title     = {AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation},
-    author    = {Li, Zhen and Zhu, Zuo-Liang and Han, Ling-Hao and Hou, Qibin and Guo, Chun-Le and Cheng, Ming-Ming},
-    booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
-    year      = {2023}
-}
-
-Copyright (c) 2023 MCG-NKU
diff --git a/eval/vbench/third_party/amt/README.md b/eval/vbench/third_party/amt/README.md
deleted file mode 100644
index b7238c48..00000000
--- a/eval/vbench/third_party/amt/README.md
+++ /dev/null
@@ -1,166 +0,0 @@
-# AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation
-
-
-This repository contains the official implementation of the following paper:
-> **AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation**<br>
-> [Zhen Li](https://paper99.github.io/)<sup>\*</sup>, [Zuo-Liang Zhu](https://nk-cs-zzl.github.io/)<sup>\*</sup>, [Ling-Hao Han](https://scholar.google.com/citations?user=0ooNdgUAAAAJ&hl=en), [Qibin Hou](https://scholar.google.com/citations?hl=en&user=fF8OFV8AAAAJ&view_op=list_works), [Chun-Le Guo](https://scholar.google.com/citations?hl=en&user=RZLYwR0AAAAJ),  [Ming-Ming Cheng](https://mmcheng.net/cmm)<br>
-> (\* denotes equal contribution) <br>
-> Nankai University <br>
-> In CVPR 2023<br>
-
-[[Paper](https://arxiv.org/abs/2304.09790)]
-[[Project Page](https://nk-cs-zzl.github.io/projects/amt/index.html)]
-[[Web demos](#web-demos)]
-[Video]
-
-AMT is a **lightweight, fast, and accurate** algorithm for Frame Interpolation.
-It aims to provide practical solutions for **video generation** from **a few given frames (at least two frames)**.
-
-![Demo gif](assets/amt_demo.gif)
-* More examples can be found in our [project page](https://nk-cs-zzl.github.io/projects/amt/index.html).
-
-## Web demos
-Integrated into [Hugging Face Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/NKU-AMT/AMT)
-
-Try AMT to interpolate between two or more images at [![PyTTI-Tools:FILM](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1IeVO5BmLouhRh6fL2z_y18kgubotoaBq?usp=sharing)
-
-
-## Change Log
-- **Apr 20, 2023**: Our code is publicly available.
-
-
-## Method Overview
-![pipeline](https://user-images.githubusercontent.com/21050959/229420451-65951bd0-732c-4f09-9121-f291a3862d6e.png)
-
-For technical details, please refer to the [method.md](docs/method.md) file, or read the full report on [arXiv](https://arxiv.org/abs/2304.09790).
-
-## Dependencies and Installation
-1. Clone Repo
-
-   ```bash
-   git clone https://github.com/MCG-NKU/AMT.git
-   ```
-
-2. Create Conda Environment and Install Dependencies
-
-   ```bash
-   conda env create -f environment.yaml
-   conda activate amt
-   ```
-3. Download pretrained models for demos from [Pretrained Models](#pretrained-models) and place them to the `pretrained` folder
-
-## Quick Demo
-
-**Note that the selected pretrained model (`[CKPT_PATH]`) needs to match the config file (`[CFG]`).**
-
- > Creating a video demo, increasing $n$ will slow down the motion in the video. (With $m$ input frames, `[N_ITER]` $=n$ corresponds to $2^n\times (m-1)+1$ output frames.)
-
-
- ```bash
- python demos/demo_2x.py -c [CFG] -p [CKPT] -n [N_ITER] -i [INPUT] -o [OUT_PATH] -r [FRAME_RATE]
- # e.g. [INPUT]
- # -i could be a video / a regular expression / a folder contains multiple images
- # -i demo.mp4 (video)/img_*.png (regular expression)/img0.png img1.png (images)/demo_input (folder)
-
- # e.g. a simple usage
- python demos/demo_2x.py -c cfgs/AMT-S.yaml -p pretrained/amt-s.pth -n 6 -i assets/quick_demo/img0.png assets/quick_demo/img1.png
-
- ```
-
- + Note: Please enable `--save_images` for saving the output images (Save speed will be slowed down if there are too many output images)
- + Input type supported: `a video` / `a regular expression` / `multiple images` / `a folder containing input frames`.
- + Results are in the `[OUT_PATH]` (default is `results/2x`) folder.
-
-## Pretrained Models
-
-<p id="Pretrained"></p>
-
-<table>
-<thead>
-  <tr>
-    <th> Dataset </th>
-    <th> :link: Download Links </th>
-    <th> Config file </th>
-    <th> Trained on </th>
-    <th> Arbitrary/Fixed </th>
-  </tr>
-</thead>
-<tbody>
-  <tr>
-    <td>AMT-S</td>
-    <th> [<a href="https://drive.google.com/file/d/1WmOKmQmd6pnLpID8EpUe-TddFpJuavrL/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1yGaNLeb9TG5-81t0skrOUA?pwd=f66n">Baidu Cloud</a>][<a href="https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth">Hugging Face</a>] </th>
-    <th> [<a href="cfgs/AMT-S.yaml">cfgs/AMT-S</a>] </th>
-    <th>Vimeo90k</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>AMT-L</td>
-    <th>[<a href="https://drive.google.com/file/d/1UyhYpAQLXMjFA55rlFZ0kdiSVTL7oU-z/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1qI4fBgS405Bd4Wn1R3Gbeg?pwd=nbne">Baidu Cloud</a>][<a href="https://huggingface.co/lalala125/AMT/resolve/main/amt-l.pth">Hugging Face</a>]</th>
-    <th> [<a href="cfgs/AMT-L.yaml">cfgs/AMT-L</a>] </th>
-    <th>Vimeo90k</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>AMT-G</td>
-    <th>[<a href="https://drive.google.com/file/d/1yieLtKh4ei3gOrLN1LhKSP_9157Q-mtP/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1AjmQVziQut1bXgQnDcDKvA?pwd=caf6">Baidu Cloud</a>][<a href="https://huggingface.co/lalala125/AMT/resolve/main/amt-g.pth">Hugging Face</a>] </th>
-    <th> [<a href="cfgs/AMT-G.yaml">cfgs/AMT-G</a>] </th>
-    <th>Vimeo90k</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>AMT-S</td>
-    <th>[<a href="https://drive.google.com/file/d/1f1xAF0EDm-rjDdny8_aLyeedfM0QL4-C/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1eZtoULyduQM8AkXeYEBOEw?pwd=8hy3">Baidu Cloud</a>][<a href="https://huggingface.co/lalala125/AMT/resolve/main/gopro_amt-s.pth">Hugging Face</a>] </th>
-    <th> [<a href="cfgs/AMT-S_gopro.yaml">cfgs/AMT-S_gopro</a>] </th>
-    <th>GoPro</th>
-    <th>Arbitrary</th>
-  </tr>
-</tbody>
-</table>
-
-## Training and Evaluation
-
-Please refer to [develop.md](docs/develop.md) to learn how to benchmark the AMT and how to train a new AMT model from scratch.
-
-
-## Citation
-   If you find our repo useful for your research, please consider citing our paper:
-
-   ```bibtex
-   @inproceedings{licvpr23amt,
-      title={AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation},
-      author={Li, Zhen and Zhu, Zuo-Liang and Han, Ling-Hao and Hou, Qibin and Guo, Chun-Le and Cheng, Ming-Ming},
-      booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
-      year={2023}
-   }
-   ```
-
-
-## License
-This code is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for non-commercial use only.
-Please note that any commercial use of this code requires formal permission prior to use.
-
-## Contact
-
-For technical questions, please contact `zhenli1031[AT]gmail.com` and `nkuzhuzl[AT]gmail.com`.
-
-For commercial licensing, please contact `cmm[AT]nankai.edu.cn`
-
-## Acknowledgement
-
-We thank Jia-Wen Xiao, Zheng-Peng Duan, Rui-Qi Wu, and Xin Jin for proof reading.
-We thank [Zhewei Huang](https://github.com/hzwer) for his suggestions.
-
-Here are some great resources we benefit from:
-
-- [IFRNet](https://github.com/ltkong218/IFRNet) and [RIFE](https://github.com/megvii-research/ECCV2022-RIFE) for data processing, benchmarking, and loss designs.
-- [RAFT](https://github.com/princeton-vl/RAFT), [M2M-VFI](https://github.com/feinanshan/M2M_VFI), and [GMFlow](https://github.com/haofeixu/gmflow) for inspirations.
-- [FILM](https://github.com/google-research/frame-interpolation) for Web demo reference.
-
-
-**If you develop/use AMT in your projects, welcome to let us know. We will list your projects in this repository.**
-
-We also thank all of our contributors.
-
-<a href="https://github.com/MCG-NKU/AMT/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=MCG-NKU/AMT" />
-</a>
diff --git a/eval/vbench/third_party/amt/__init__.py b/eval/vbench/third_party/amt/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/benchmarks/__init__.py b/eval/vbench/third_party/amt/benchmarks/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/benchmarks/adobe240.py b/eval/vbench/third_party/amt/benchmarks/adobe240.py
deleted file mode 100644
index a262e783..00000000
--- a/eval/vbench/third_party/amt/benchmarks/adobe240.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import argparse
-import sys
-
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from datasets.adobe_datasets import Adobe240_Dataset
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="Adobe240 evaluation",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S_gopro.yaml")
-parser.add_argument(
-    "-p",
-    "--ckpt",
-    default="pretrained/gopro_amt-s.pth",
-)
-parser.add_argument(
-    "-r",
-    "--root",
-    default="data/Adobe240/test_frames",
-)
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"])
-model = model.to(device)
-model.eval()
-
-dataset = Adobe240_Dataset(dataset_dir=root, augment=False)
-
-psnr_list = []
-ssim_list = []
-pbar = tqdm.tqdm(dataset, total=len(dataset))
-for data in pbar:
-    input_dict = {}
-    for k, v in data.items():
-        input_dict[k] = v.to(device).unsqueeze(0)
-    with torch.no_grad():
-        imgt_pred = model(**input_dict)["imgt_pred"]
-        psnr = calculate_psnr(imgt_pred, input_dict["imgt"])
-        ssim = calculate_ssim(imgt_pred, input_dict["imgt"])
-    psnr_list.append(psnr)
-    ssim_list.append(ssim)
-    avg_psnr = np.mean(psnr_list)
-    avg_ssim = np.mean(ssim_list)
-    desc_str = f"[{network_name}/Adobe240] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-    pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/benchmarks/gopro.py b/eval/vbench/third_party/amt/benchmarks/gopro.py
deleted file mode 100644
index 96d8fb8c..00000000
--- a/eval/vbench/third_party/amt/benchmarks/gopro.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import argparse
-import sys
-
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from datasets.gopro_datasets import GoPro_Test_Dataset
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="GOPRO evaluation",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S_gopro.yaml")
-parser.add_argument(
-    "-p",
-    "--ckpt",
-    default="pretrained/gopro_amt-s.pth",
-)
-parser.add_argument(
-    "-r",
-    "--root",
-    default="data/GOPRO",
-)
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"])
-model = model.to(device)
-model.eval()
-
-dataset = GoPro_Test_Dataset(dataset_dir=root)
-
-psnr_list = []
-ssim_list = []
-pbar = tqdm.tqdm(dataset, total=len(dataset))
-for data in pbar:
-    input_dict = {}
-    for k, v in data.items():
-        input_dict[k] = v.to(device).unsqueeze(0)
-    with torch.no_grad():
-        imgt_pred = model(**input_dict)["imgt_pred"]
-        psnr = calculate_psnr(imgt_pred, input_dict["imgt"])
-        ssim = calculate_ssim(imgt_pred, input_dict["imgt"])
-    psnr_list.append(psnr)
-    ssim_list.append(ssim)
-    avg_psnr = np.mean(psnr_list)
-    avg_ssim = np.mean(ssim_list)
-    desc_str = f"[{network_name}/GOPRO] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-    pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/benchmarks/snu_film.py b/eval/vbench/third_party/amt/benchmarks/snu_film.py
deleted file mode 100644
index 040df7ec..00000000
--- a/eval/vbench/third_party/amt/benchmarks/snu_film.py
+++ /dev/null
@@ -1,76 +0,0 @@
-import argparse
-import os
-import os.path as osp
-import sys
-
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-from utils.utils import InputPadder, img2tensor, read
-
-
-def parse_path(path):
-    path_list = path.split("/")
-    new_path = osp.join(*path_list[-3:])
-    return new_path
-
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="SNU-FILM evaluation",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S.yaml")
-parser.add_argument("-p", "--ckpt", default="pretrained/amt-s.pth")
-parser.add_argument("-r", "--root", default="data/SNU_FILM")
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"])
-model = model.to(device)
-model.eval()
-
-divisor = 20
-scale_factor = 0.8
-splits = ["easy", "medium", "hard", "extreme"]
-for split in splits:
-    with open(os.path.join(root, f"test-{split}.txt"), "r") as fr:
-        file_list = [l.strip().split(" ") for l in fr.readlines()]
-    pbar = tqdm.tqdm(file_list, total=len(file_list))
-
-    psnr_list = []
-    ssim_list = []
-    for name in pbar:
-        img0 = img2tensor(read(osp.join(root, parse_path(name[0])))).to(device)
-        imgt = img2tensor(read(osp.join(root, parse_path(name[1])))).to(device)
-        img1 = img2tensor(read(osp.join(root, parse_path(name[2])))).to(device)
-        padder = InputPadder(img0.shape, divisor)
-        img0, img1 = padder.pad(img0, img1)
-
-        embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).to(device)
-        imgt_pred = model(img0, img1, embt, scale_factor=scale_factor, eval=True)[
-            "imgt_pred"
-        ]
-        imgt_pred = padder.unpad(imgt_pred)
-
-        psnr = calculate_psnr(imgt_pred, imgt).detach().cpu().numpy()
-        ssim = calculate_ssim(imgt_pred, imgt).detach().cpu().numpy()
-
-        psnr_list.append(psnr)
-        ssim_list.append(ssim)
-        avg_psnr = np.mean(psnr_list)
-        avg_ssim = np.mean(ssim_list)
-        desc_str = f"[{network_name}/SNU-FILM] [{split}] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-        pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/benchmarks/speed_parameters.py b/eval/vbench/third_party/amt/benchmarks/speed_parameters.py
deleted file mode 100644
index 762886be..00000000
--- a/eval/vbench/third_party/amt/benchmarks/speed_parameters.py
+++ /dev/null
@@ -1,39 +0,0 @@
-import argparse
-import sys
-import time
-
-import torch
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from utils.build_utils import build_from_cfg
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="Speed&parameter benchmark",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S.yaml")
-args = parser.parse_args()
-
-cfg_path = args.config
-network_cfg = OmegaConf.load(cfg_path).network
-model = build_from_cfg(network_cfg)
-model = model.cuda()
-model.eval()
-
-img0 = torch.randn(1, 3, 256, 448).cuda()
-img1 = torch.randn(1, 3, 256, 448).cuda()
-embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).cuda()
-
-with torch.no_grad():
-    for i in range(100):
-        out = model(img0, img1, embt, eval=True)
-    torch.cuda.synchronize()
-    time_stamp = time.time()
-    for i in range(1000):
-        out = model(img0, img1, embt, eval=True)
-    torch.cuda.synchronize()
-    print("Time: {:.5f}s".format((time.time() - time_stamp) / 1))
-
-total = sum([param.nelement() for param in model.parameters()])
-print("Parameters: {:.2f}M".format(total / 1e6))
diff --git a/eval/vbench/third_party/amt/benchmarks/ucf101.py b/eval/vbench/third_party/amt/benchmarks/ucf101.py
deleted file mode 100644
index 8632f38f..00000000
--- a/eval/vbench/third_party/amt/benchmarks/ucf101.py
+++ /dev/null
@@ -1,60 +0,0 @@
-import argparse
-import os
-import os.path as osp
-import sys
-
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-from utils.utils import img2tensor, read
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="UCF101 evaluation",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S.yaml")
-parser.add_argument("-p", "--ckpt", default="pretrained/amt-s.pth")
-parser.add_argument("-r", "--root", default="data/ucf101_interp_ours")
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"])
-model = model.to(device)
-model.eval()
-
-dirs = sorted(os.listdir(root))
-psnr_list = []
-ssim_list = []
-pbar = tqdm.tqdm(dirs, total=len(dirs))
-for d in pbar:
-    dir_path = osp.join(root, d)
-    I0 = img2tensor(read(osp.join(dir_path, "frame_00.png"))).to(device)
-    I1 = img2tensor(read(osp.join(dir_path, "frame_01_gt.png"))).to(device)
-    I2 = img2tensor(read(osp.join(dir_path, "frame_02.png"))).to(device)
-    embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).to(device)
-
-    I1_pred = model(I0, I2, embt, eval=True)["imgt_pred"]
-
-    psnr = calculate_psnr(I1_pred, I1).detach().cpu().numpy()
-    ssim = calculate_ssim(I1_pred, I1).detach().cpu().numpy()
-
-    psnr_list.append(psnr)
-    ssim_list.append(ssim)
-
-    avg_psnr = np.mean(psnr_list)
-    avg_ssim = np.mean(ssim_list)
-    desc_str = f"[{network_name}/UCF101] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-    pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/benchmarks/vimeo90k.py b/eval/vbench/third_party/amt/benchmarks/vimeo90k.py
deleted file mode 100644
index 206b2c52..00000000
--- a/eval/vbench/third_party/amt/benchmarks/vimeo90k.py
+++ /dev/null
@@ -1,72 +0,0 @@
-import argparse
-import os.path as osp
-import sys
-
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-from utils.utils import img2tensor, read
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="Vimeo90K evaluation",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S.yaml")
-parser.add_argument(
-    "-p",
-    "--ckpt",
-    default="pretrained/amt-s.pth",
-)
-parser.add_argument(
-    "-r",
-    "--root",
-    default="data/vimeo_triplet",
-)
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"])
-model = model.to(device)
-model.eval()
-
-with open(osp.join(root, "tri_testlist.txt"), "r") as fr:
-    file_list = fr.readlines()
-
-psnr_list = []
-ssim_list = []
-
-pbar = tqdm.tqdm(file_list, total=len(file_list))
-for name in pbar:
-    name = str(name).strip()
-    if len(name) <= 1:
-        continue
-    dir_path = osp.join(root, "sequences", name)
-    I0 = img2tensor(read(osp.join(dir_path, "im1.png"))).to(device)
-    I1 = img2tensor(read(osp.join(dir_path, "im2.png"))).to(device)
-    I2 = img2tensor(read(osp.join(dir_path, "im3.png"))).to(device)
-    embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).to(device)
-
-    I1_pred = model(I0, I2, embt, scale_factor=1.0, eval=True)["imgt_pred"]
-
-    psnr = calculate_psnr(I1_pred, I1).detach().cpu().numpy()
-    ssim = calculate_ssim(I1_pred, I1).detach().cpu().numpy()
-
-    psnr_list.append(psnr)
-    ssim_list.append(ssim)
-    avg_psnr = np.mean(psnr_list)
-    avg_ssim = np.mean(ssim_list)
-    desc_str = f"[{network_name}/Vimeo90K] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-    pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/benchmarks/vimeo90k_tta.py b/eval/vbench/third_party/amt/benchmarks/vimeo90k_tta.py
deleted file mode 100644
index 6726b24c..00000000
--- a/eval/vbench/third_party/amt/benchmarks/vimeo90k_tta.py
+++ /dev/null
@@ -1,75 +0,0 @@
-import argparse
-import os.path as osp
-import sys
-
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-from utils.utils import img2tensor, read
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="Vimeo90K evaluation (with Test-Time Augmentation)",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S.yaml")
-parser.add_argument(
-    "p",
-    "--ckpt",
-    default="pretrained/amt-s.pth",
-)
-parser.add_argument(
-    "-r",
-    "--root",
-    default="data/vimeo_triplet",
-)
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"])
-model = model.to(device)
-model.eval()
-
-with open(osp.join(root, "tri_testlist.txt"), "r") as fr:
-    file_list = fr.readlines()
-
-psnr_list = []
-ssim_list = []
-
-pbar = tqdm.tqdm(file_list, total=len(file_list))
-for name in pbar:
-    name = str(name).strip()
-    if len(name) <= 1:
-        continue
-    dir_path = osp.join(root, "sequences", name)
-    I0 = img2tensor(read(osp.join(dir_path, "im1.png"))).to(device)
-    I1 = img2tensor(read(osp.join(dir_path, "im2.png"))).to(device)
-    I2 = img2tensor(read(osp.join(dir_path, "im3.png"))).to(device)
-    embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).to(device)
-
-    I1_pred1 = model(I0, I2, embt, scale_factor=1.0, eval=True)["imgt_pred"]
-    I1_pred2 = model(
-        torch.flip(I0, [2]), torch.flip(I2, [2]), embt, scale_factor=1.0, eval=True
-    )["imgt_pred"]
-    I1_pred = I1_pred1 / 2 + torch.flip(I1_pred2, [2]) / 2
-    psnr = calculate_psnr(I1_pred, I1).detach().cpu().numpy()
-    ssim = calculate_ssim(I1_pred, I1).detach().cpu().numpy()
-
-    psnr_list.append(psnr)
-    ssim_list.append(ssim)
-    avg_psnr = np.mean(psnr_list)
-    avg_ssim = np.mean(ssim_list)
-    desc_str = f"[{network_name}/Vimeo90K] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-    pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/benchmarks/xiph.py b/eval/vbench/third_party/amt/benchmarks/xiph.py
deleted file mode 100644
index 9689772a..00000000
--- a/eval/vbench/third_party/amt/benchmarks/xiph.py
+++ /dev/null
@@ -1,134 +0,0 @@
-import argparse
-import glob
-import os
-import os.path as osp
-import sys
-
-import cv2
-import numpy as np
-import torch
-import tqdm
-from omegaconf import OmegaConf
-
-sys.path.append(".")
-from metrics.psnr_ssim import calculate_psnr, calculate_ssim
-from utils.build_utils import build_from_cfg
-from utils.utils import InputPadder, img2tensor, read
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="Xiph evaluation",
-)
-parser.add_argument("-c", "--config", default="cfgs/AMT-S.yaml")
-parser.add_argument("-p", "--ckpt", default="pretrained/amt-s.pth")
-parser.add_argument("-r", "--root", default="data/xiph")
-args = parser.parse_args()
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-cfg_path = args.config
-ckpt_path = args.ckpt
-root = args.root
-
-network_cfg = OmegaConf.load(cfg_path).network
-network_name = network_cfg.name
-model = build_from_cfg(network_cfg)
-ckpt = torch.load(ckpt_path)
-model.load_state_dict(ckpt["state_dict"], False)
-model = model.to(device)
-model.eval()
-
-############################################# Prepare Dataset #############################################
-download_links = [
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_BoxingPractice_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_Crosswalk_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/Chimera/Netflix_DrivingPOV_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_FoodMarket_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_FoodMarket2_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_RitualDance_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_SquareAndTimelapse_4096x2160_60fps_10bit_420.y4m",
-    "https://media.xiph.org/video/derf/ElFuente/Netflix_Tango_4096x2160_60fps_10bit_420.y4m",
-]
-file_list = [
-    "BoxingPractice",
-    "Crosswalk",
-    "DrivingPOV",
-    "FoodMarket",
-    "FoodMarket2",
-    "RitualDance",
-    "SquareAndTimelapse",
-    "Tango",
-]
-
-for file_name, link in zip(file_list, download_links):
-    data_dir = osp.join(root, file_name)
-    if osp.exists(data_dir) is False:
-        os.makedirs(data_dir)
-    if len(glob.glob(f"{data_dir}/*.png")) < 100:
-        os.system(f"ffmpeg -i {link} -pix_fmt rgb24 -vframes 100 {data_dir}/%03d.png")
-############################################### Prepare End ###############################################
-
-
-divisor = 32
-scale_factor = 0.5
-for category in ["resized-2k", "cropped-4k"]:
-    psnr_list = []
-    ssim_list = []
-    pbar = tqdm.tqdm(file_list, total=len(file_list))
-    for flie_name in pbar:
-        dir_name = osp.join(root, flie_name)
-        for intFrame in range(2, 99, 2):
-            img0 = read(f"{dir_name}/{intFrame - 1:03d}.png")
-            img1 = read(f"{dir_name}/{intFrame + 1:03d}.png")
-            imgt = read(f"{dir_name}/{intFrame:03d}.png")
-
-            if category == "resized-2k":
-                img0 = cv2.resize(
-                    src=img0,
-                    dsize=(2048, 1080),
-                    fx=0.0,
-                    fy=0.0,
-                    interpolation=cv2.INTER_AREA,
-                )
-                img1 = cv2.resize(
-                    src=img1,
-                    dsize=(2048, 1080),
-                    fx=0.0,
-                    fy=0.0,
-                    interpolation=cv2.INTER_AREA,
-                )
-                imgt = cv2.resize(
-                    src=imgt,
-                    dsize=(2048, 1080),
-                    fx=0.0,
-                    fy=0.0,
-                    interpolation=cv2.INTER_AREA,
-                )
-
-            elif category == "cropped-4k":
-                img0 = img0[540:-540, 1024:-1024, :]
-                img1 = img1[540:-540, 1024:-1024, :]
-                imgt = imgt[540:-540, 1024:-1024, :]
-            img0 = img2tensor(img0).to(device)
-            imgt = img2tensor(imgt).to(device)
-            img1 = img2tensor(img1).to(device)
-            embt = torch.tensor(1 / 2).float().view(1, 1, 1, 1).to(device)
-
-            padder = InputPadder(img0.shape, divisor)
-            img0, img1 = padder.pad(img0, img1)
-
-            with torch.no_grad():
-                imgt_pred = model(
-                    img0, img1, embt, scale_factor=scale_factor, eval=True
-                )["imgt_pred"]
-                imgt_pred = padder.unpad(imgt_pred)
-
-            psnr = calculate_psnr(imgt_pred, imgt)
-            ssim = calculate_ssim(imgt_pred, imgt)
-
-            avg_psnr = np.mean(psnr_list)
-            avg_ssim = np.mean(ssim_list)
-            psnr_list.append(psnr)
-            ssim_list.append(ssim)
-            desc_str = f"[{network_name}/Xiph] [{category}/{flie_name}] psnr: {avg_psnr:.02f}, ssim: {avg_ssim:.04f}"
-
-            pbar.set_description_str(desc_str)
diff --git a/eval/vbench/third_party/amt/cfgs/AMT-G.yaml b/eval/vbench/third_party/amt/cfgs/AMT-G.yaml
deleted file mode 100644
index d07e33c2..00000000
--- a/eval/vbench/third_party/amt/cfgs/AMT-G.yaml
+++ /dev/null
@@ -1,62 +0,0 @@
-exp_name: floloss1e-2_300epoch_bs24_lr1p5e-4
-seed: 2023
-epochs: 300
-distributed: true
-lr: 1.5e-4
-lr_min: 2e-5
-weight_decay: 0.0
-resume_state: null
-save_dir: work_dir
-eval_interval: 1
-
-network:
-  name: networks.AMT-G.Model
-  params:
-    corr_radius: 3
-    corr_lvls: 4
-    num_flows: 5
-data:
-  train:
-    name: datasets.vimeo_datasets.Vimeo90K_Train_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  val:
-    name: datasets.vimeo_datasets.Vimeo90K_Test_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  train_loader:
-    batch_size: 24
-    num_workers: 12
-  val_loader:
-    batch_size: 24
-    num_workers: 3
-
-logger:
-  use_wandb: true
-  resume_id: null
-
-losses:
-  - {
-    name: losses.loss.CharbonnierLoss,
-    nickname: l_rec,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.TernaryLoss,
-    nickname: l_ter,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.MultipleFlowLoss,
-    nickname: l_flo,
-    params: {
-      loss_weight: 0.005,
-      keys: [flow0_pred, flow1_pred, flow]
-    }
-  }
diff --git a/eval/vbench/third_party/amt/cfgs/AMT-L.yaml b/eval/vbench/third_party/amt/cfgs/AMT-L.yaml
deleted file mode 100644
index 42861b73..00000000
--- a/eval/vbench/third_party/amt/cfgs/AMT-L.yaml
+++ /dev/null
@@ -1,62 +0,0 @@
-exp_name: floloss1e-2_300epoch_bs24_lr2e-4
-seed: 2023
-epochs: 300
-distributed: true
-lr: 2e-4
-lr_min: 2e-5
-weight_decay: 0.0
-resume_state: null
-save_dir: work_dir
-eval_interval: 1
-
-network:
-  name: networks.AMT-L.Model
-  params:
-    corr_radius: 3
-    corr_lvls: 4
-    num_flows: 5
-data:
-  train:
-    name: datasets.vimeo_datasets.Vimeo90K_Train_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  val:
-    name: datasets.vimeo_datasets.Vimeo90K_Test_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  train_loader:
-    batch_size: 24
-    num_workers: 12
-  val_loader:
-    batch_size: 24
-    num_workers: 3
-
-logger:
-  use_wandb: true
-  resume_id: null
-
-losses:
-  - {
-    name: losses.loss.CharbonnierLoss,
-    nickname: l_rec,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.TernaryLoss,
-    nickname: l_ter,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.MultipleFlowLoss,
-    nickname: l_flo,
-    params: {
-      loss_weight: 0.002,
-      keys: [flow0_pred, flow1_pred, flow]
-    }
-  }
diff --git a/eval/vbench/third_party/amt/cfgs/AMT-S.yaml b/eval/vbench/third_party/amt/cfgs/AMT-S.yaml
deleted file mode 100644
index 903ed8a8..00000000
--- a/eval/vbench/third_party/amt/cfgs/AMT-S.yaml
+++ /dev/null
@@ -1,63 +0,0 @@
-exp_name: floloss1e-2_300epoch_bs24_lr2e-4
-seed: 2023
-epochs: 300
-distributed: true
-lr: 2e-4
-lr_min: 2e-5
-weight_decay: 0.0
-resume_state: null
-save_dir: work_dir
-eval_interval: 1
-
-network:
-  name: networks.AMT-S.Model
-  params:
-    corr_radius: 3
-    corr_lvls: 4
-    num_flows: 3
-
-data:
-  train:
-    name: datasets.vimeo_datasets.Vimeo90K_Train_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  val:
-    name: datasets.vimeo_datasets.Vimeo90K_Test_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  train_loader:
-    batch_size: 24
-    num_workers: 12
-  val_loader:
-    batch_size: 24
-    num_workers: 3
-
-logger:
-  use_wandb: false
-  resume_id: null
-
-losses:
-  - {
-    name: losses.loss.CharbonnierLoss,
-    nickname: l_rec,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.TernaryLoss,
-    nickname: l_ter,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.MultipleFlowLoss,
-    nickname: l_flo,
-    params: {
-      loss_weight: 0.002,
-      keys: [flow0_pred, flow1_pred, flow]
-    }
-  }
diff --git a/eval/vbench/third_party/amt/cfgs/AMT-S_gopro.yaml b/eval/vbench/third_party/amt/cfgs/AMT-S_gopro.yaml
deleted file mode 100644
index ded91c51..00000000
--- a/eval/vbench/third_party/amt/cfgs/AMT-S_gopro.yaml
+++ /dev/null
@@ -1,55 +0,0 @@
-exp_name: wofloloss_400epoch_bs24_lr2e-4
-seed: 2023
-epochs: 400
-distributed: true
-lr: 2e-4
-lr_min: 2e-5
-weight_decay: 0.0
-resume_state: null
-save_dir: work_dir
-eval_interval: 1
-
-network:
-  name: networks.AMT-S.Model
-  params:
-    corr_radius: 3
-    corr_lvls: 4
-    num_flows: 3
-
-data:
-  train:
-    name: datasets.gopro_datasets.GoPro_Train_Dataset
-    params:
-      dataset_dir: data/GOPRO
-  val:
-    name: datasets.gopro_datasets.GoPro_Test_Dataset
-    params:
-      dataset_dir: data/GOPRO
-  train_loader:
-    batch_size: 24
-    num_workers: 12
-  val_loader:
-    batch_size: 24
-    num_workers: 3
-
-logger:
-  use_wandb: false
-  resume_id: null
-
-losses:
-  - {
-    name: losses.loss.CharbonnierLoss,
-    nickname: l_rec,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.TernaryLoss,
-    nickname: l_ter,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
diff --git a/eval/vbench/third_party/amt/cfgs/IFRNet.yaml b/eval/vbench/third_party/amt/cfgs/IFRNet.yaml
deleted file mode 100644
index 7dd0d704..00000000
--- a/eval/vbench/third_party/amt/cfgs/IFRNet.yaml
+++ /dev/null
@@ -1,67 +0,0 @@
-exp_name: floloss1e-2_geoloss1e-2_300epoch_bs24_lr1e-4
-seed: 2023
-epochs: 300
-distributed: true
-lr: 1e-4
-lr_min: 1e-5
-weight_decay: 1e-6
-resume_state: null
-save_dir: work_dir
-eval_interval: 1
-
-network:
-  name: networks.IFRNet.Model
-
-data:
-  train:
-    name: datasets.datasets.Vimeo90K_Train_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  val:
-    name: datasets.datasets.Vimeo90K_Test_Dataset
-    params:
-      dataset_dir: data/vimeo_triplet
-  train_loader:
-    batch_size: 24
-    num_workers: 12
-  val_loader:
-    batch_size: 24
-    num_workers: 3
-
-logger:
-  use_wandb: true
-  resume_id: null
-
-losses:
-  - {
-    name: losses.loss.CharbonnierLoss,
-    nickname: l_rec,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.TernaryLoss,
-    nickname: l_ter,
-    params: {
-      loss_weight: 1.0,
-      keys: [imgt_pred, imgt]
-    }
-  }
-  - {
-    name: losses.loss.IFRFlowLoss,
-    nickname: l_flo,
-    params: {
-      loss_weight: 0.01,
-      keys: [flow0_pred, flow1_pred, flow]
-    }
-  }
-  - {
-    name: losses.loss.GeometryLoss,
-    nickname: l_geo,
-    params: {
-      loss_weight: 0.01,
-      keys: [ft_pred, ft_gt]
-    }
-  }
diff --git a/eval/vbench/third_party/amt/datasets/__init__.py b/eval/vbench/third_party/amt/datasets/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/datasets/adobe_datasets.py b/eval/vbench/third_party/amt/datasets/adobe_datasets.py
deleted file mode 100644
index a22fe5bf..00000000
--- a/eval/vbench/third_party/amt/datasets/adobe_datasets.py
+++ /dev/null
@@ -1,101 +0,0 @@
-"""
-    This code is partially borrowed from IFRNet (https://github.com/ltkong218/IFRNet).
-"""
-
-import os
-import sys
-
-import numpy as np
-import torch
-from torch.utils.data import Dataset
-
-sys.path.append(".")
-from datasets.gopro_datasets import (
-    center_crop_woflow,
-    random_crop_woflow,
-    random_horizontal_flip_woflow,
-    random_resize_woflow,
-    random_reverse_channel_woflow,
-    random_reverse_time_woflow,
-    random_rotate_woflow,
-    random_vertical_flip_woflow,
-)
-from utils.utils import img2tensor, read
-
-
-class Adobe240_Dataset(Dataset):
-    def __init__(
-        self, dataset_dir="data/adobe240/test_frames", interFrames=7, augment=True
-    ):
-        super().__init__()
-        self.augment = augment
-        self.interFrames = interFrames
-        self.setLength = interFrames + 2
-        self.dataset_dir = os.path.join(dataset_dir)
-        video_list = os.listdir(self.dataset_dir)[9::10]
-        self.frames_list = []
-        self.file_list = []
-        for video in video_list:
-            frames = sorted(os.listdir(os.path.join(self.dataset_dir, video)))
-            n_sets = (len(frames) - self.setLength) // (interFrames + 1) + 1
-            videoInputs = [
-                frames[(interFrames + 1) * i : (interFrames + 1) * i + self.setLength]
-                for i in range(n_sets)
-            ]
-            videoInputs = [
-                [os.path.join(video, f) for f in group] for group in videoInputs
-            ]
-            self.file_list.extend(videoInputs)
-
-    def __getitem__(self, idx):
-        clip_idx = idx // self.interFrames
-        embt_idx = idx % self.interFrames
-        imgpaths = [
-            os.path.join(self.dataset_dir, fp) for fp in self.file_list[clip_idx]
-        ]
-        pick_idxs = list(range(0, self.setLength, self.interFrames + 1))
-        imgt_beg = self.setLength // 2 - self.interFrames // 2
-        imgt_end = self.setLength // 2 + self.interFrames // 2 + self.interFrames % 2
-        imgt_idx = list(range(imgt_beg, imgt_end))
-        input_paths = [imgpaths[idx] for idx in pick_idxs]
-        imgt_paths = [imgpaths[idx] for idx in imgt_idx]
-
-        img0 = np.array(read(input_paths[0]))
-        imgt = np.array(read(imgt_paths[embt_idx]))
-        img1 = np.array(read(input_paths[1]))
-        embt = torch.from_numpy(
-            np.array((embt_idx + 1) / (self.interFrames + 1))
-            .reshape(1, 1, 1)
-            .astype(np.float32)
-        )
-
-        if self.augment == True:
-            img0, imgt, img1 = random_resize_woflow(img0, imgt, img1, p=0.1)
-            img0, imgt, img1 = random_crop_woflow(
-                img0, imgt, img1, crop_size=(224, 224)
-            )
-            img0, imgt, img1 = random_reverse_channel_woflow(img0, imgt, img1, p=0.5)
-            img0, imgt, img1 = random_vertical_flip_woflow(img0, imgt, img1, p=0.3)
-            img0, imgt, img1 = random_horizontal_flip_woflow(img0, imgt, img1, p=0.5)
-            img0, imgt, img1 = random_rotate_woflow(img0, imgt, img1, p=0.05)
-            img0, imgt, img1, embt = random_reverse_time_woflow(
-                img0, imgt, img1, embt=embt, p=0.5
-            )
-        else:
-            img0, imgt, img1 = center_crop_woflow(
-                img0, imgt, img1, crop_size=(512, 512)
-            )
-
-        img0 = img2tensor(img0).squeeze(0)
-        imgt = img2tensor(imgt).squeeze(0)
-        img1 = img2tensor(img1).squeeze(0)
-
-        return {
-            "img0": img0.float(),
-            "imgt": imgt.float(),
-            "img1": img1.float(),
-            "embt": embt,
-        }
-
-    def __len__(self):
-        return len(self.file_list) * self.interFrames
diff --git a/eval/vbench/third_party/amt/datasets/gopro_datasets.py b/eval/vbench/third_party/amt/datasets/gopro_datasets.py
deleted file mode 100644
index 3cbfcfb1..00000000
--- a/eval/vbench/third_party/amt/datasets/gopro_datasets.py
+++ /dev/null
@@ -1,264 +0,0 @@
-"""
-    This code is partially borrowed from IFRNet (https://github.com/ltkong218/IFRNet).
-    In the consideration of the difficulty in flow supervision generation, we abort
-    flow loss in the 8x case.
-"""
-
-import os
-import random
-
-import cv2
-import numpy as np
-import torch
-from torch.utils.data import Dataset
-from utils.utils import img2tensor, read
-
-
-def random_resize_woflow(img0, imgt, img1, p=0.1):
-    if random.uniform(0, 1) < p:
-        img0 = cv2.resize(
-            img0, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR
-        )
-        imgt = cv2.resize(
-            imgt, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR
-        )
-        img1 = cv2.resize(
-            img1, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR
-        )
-    return img0, imgt, img1
-
-
-def random_crop_woflow(img0, imgt, img1, crop_size=(224, 224)):
-    h, w = crop_size[0], crop_size[1]
-    ih, iw, _ = img0.shape
-    x = np.random.randint(0, ih - h + 1)
-    y = np.random.randint(0, iw - w + 1)
-    img0 = img0[x : x + h, y : y + w, :]
-    imgt = imgt[x : x + h, y : y + w, :]
-    img1 = img1[x : x + h, y : y + w, :]
-    return img0, imgt, img1
-
-
-def center_crop_woflow(img0, imgt, img1, crop_size=(512, 512)):
-    h, w = crop_size[0], crop_size[1]
-    ih, iw, _ = img0.shape
-    img0 = img0[
-        ih // 2 - h // 2 : ih // 2 + h // 2, iw // 2 - w // 2 : iw // 2 + w // 2, :
-    ]
-    imgt = imgt[
-        ih // 2 - h // 2 : ih // 2 + h // 2, iw // 2 - w // 2 : iw // 2 + w // 2, :
-    ]
-    img1 = img1[
-        ih // 2 - h // 2 : ih // 2 + h // 2, iw // 2 - w // 2 : iw // 2 + w // 2, :
-    ]
-    return img0, imgt, img1
-
-
-def random_reverse_channel_woflow(img0, imgt, img1, p=0.5):
-    if random.uniform(0, 1) < p:
-        img0 = img0[:, :, ::-1]
-        imgt = imgt[:, :, ::-1]
-        img1 = img1[:, :, ::-1]
-    return img0, imgt, img1
-
-
-def random_vertical_flip_woflow(img0, imgt, img1, p=0.3):
-    if random.uniform(0, 1) < p:
-        img0 = img0[::-1]
-        imgt = imgt[::-1]
-        img1 = img1[::-1]
-    return img0, imgt, img1
-
-
-def random_horizontal_flip_woflow(img0, imgt, img1, p=0.5):
-    if random.uniform(0, 1) < p:
-        img0 = img0[:, ::-1]
-        imgt = imgt[:, ::-1]
-        img1 = img1[:, ::-1]
-    return img0, imgt, img1
-
-
-def random_rotate_woflow(img0, imgt, img1, p=0.05):
-    if random.uniform(0, 1) < p:
-        img0 = img0.transpose((1, 0, 2))
-        imgt = imgt.transpose((1, 0, 2))
-        img1 = img1.transpose((1, 0, 2))
-    return img0, imgt, img1
-
-
-def random_reverse_time_woflow(img0, imgt, img1, embt, p=0.5):
-    if random.uniform(0, 1) < p:
-        tmp = img1
-        img1 = img0
-        img0 = tmp
-    embt = 1 - embt
-    return img0, imgt, img1, embt
-
-
-class GoPro_Train_Dataset(Dataset):
-    def __init__(self, dataset_dir="data/GOPRO", interFrames=7, augment=True):
-        self.dataset_dir = dataset_dir + "/train"
-        self.interFrames = interFrames
-        self.augment = augment
-        self.setLength = interFrames + 2
-        video_list = [
-            "GOPR0372_07_00",
-            "GOPR0374_11_01",
-            "GOPR0378_13_00",
-            "GOPR0384_11_01",
-            "GOPR0384_11_04",
-            "GOPR0477_11_00",
-            "GOPR0868_11_02",
-            "GOPR0884_11_00",
-            "GOPR0372_07_01",
-            "GOPR0374_11_02",
-            "GOPR0379_11_00",
-            "GOPR0384_11_02",
-            "GOPR0385_11_00",
-            "GOPR0857_11_00",
-            "GOPR0871_11_01",
-            "GOPR0374_11_00",
-            "GOPR0374_11_03",
-            "GOPR0380_11_00",
-            "GOPR0384_11_03",
-            "GOPR0386_11_00",
-            "GOPR0868_11_01",
-            "GOPR0881_11_00",
-        ]
-        self.frames_list = []
-        self.file_list = []
-        for video in video_list:
-            frames = sorted(os.listdir(os.path.join(self.dataset_dir, video)))
-            n_sets = (len(frames) - self.setLength) // (interFrames + 1) + 1
-            videoInputs = [
-                frames[(interFrames + 1) * i : (interFrames + 1) * i + self.setLength]
-                for i in range(n_sets)
-            ]
-            videoInputs = [
-                [os.path.join(video, f) for f in group] for group in videoInputs
-            ]
-            self.file_list.extend(videoInputs)
-
-    def __len__(self):
-        return len(self.file_list) * self.interFrames
-
-    def __getitem__(self, idx):
-        clip_idx = idx // self.interFrames
-        embt_idx = idx % self.interFrames
-        imgpaths = [
-            os.path.join(self.dataset_dir, fp) for fp in self.file_list[clip_idx]
-        ]
-        pick_idxs = list(range(0, self.setLength, self.interFrames + 1))
-        imgt_beg = self.setLength // 2 - self.interFrames // 2
-        imgt_end = self.setLength // 2 + self.interFrames // 2 + self.interFrames % 2
-        imgt_idx = list(range(imgt_beg, imgt_end))
-        input_paths = [imgpaths[idx] for idx in pick_idxs]
-        imgt_paths = [imgpaths[idx] for idx in imgt_idx]
-
-        embt = torch.from_numpy(
-            np.array((embt_idx + 1) / (self.interFrames + 1))
-            .reshape(1, 1, 1)
-            .astype(np.float32)
-        )
-        img0 = np.array(read(input_paths[0]))
-        imgt = np.array(read(imgt_paths[embt_idx]))
-        img1 = np.array(read(input_paths[1]))
-
-        if self.augment == True:
-            img0, imgt, img1 = random_resize_woflow(img0, imgt, img1, p=0.1)
-            img0, imgt, img1 = random_crop_woflow(
-                img0, imgt, img1, crop_size=(224, 224)
-            )
-            img0, imgt, img1 = random_reverse_channel_woflow(img0, imgt, img1, p=0.5)
-            img0, imgt, img1 = random_vertical_flip_woflow(img0, imgt, img1, p=0.3)
-            img0, imgt, img1 = random_horizontal_flip_woflow(img0, imgt, img1, p=0.5)
-            img0, imgt, img1 = random_rotate_woflow(img0, imgt, img1, p=0.05)
-            img0, imgt, img1, embt = random_reverse_time_woflow(
-                img0, imgt, img1, embt=embt, p=0.5
-            )
-        else:
-            img0, imgt, img1 = center_crop_woflow(
-                img0, imgt, img1, crop_size=(512, 512)
-            )
-
-        img0 = img2tensor(img0.copy()).squeeze(0)
-        imgt = img2tensor(imgt.copy()).squeeze(0)
-        img1 = img2tensor(img1.copy()).squeeze(0)
-
-        return {
-            "img0": img0.float(),
-            "imgt": imgt.float(),
-            "img1": img1.float(),
-            "embt": embt,
-        }
-
-
-class GoPro_Test_Dataset(Dataset):
-    def __init__(self, dataset_dir="data/GOPRO", interFrames=7):
-        self.dataset_dir = dataset_dir + "/test"
-        self.interFrames = interFrames
-        self.setLength = interFrames + 2
-        video_list = [
-            "GOPR0384_11_00",
-            "GOPR0385_11_01",
-            "GOPR0410_11_00",
-            "GOPR0862_11_00",
-            "GOPR0869_11_00",
-            "GOPR0881_11_01",
-            "GOPR0384_11_05",
-            "GOPR0396_11_00",
-            "GOPR0854_11_00",
-            "GOPR0868_11_00",
-            "GOPR0871_11_00",
-        ]
-        self.frames_list = []
-        self.file_list = []
-        for video in video_list:
-            frames = sorted(os.listdir(os.path.join(self.dataset_dir, video)))
-            n_sets = (len(frames) - self.setLength) // (interFrames + 1) + 1
-            videoInputs = [
-                frames[(interFrames + 1) * i : (interFrames + 1) * i + self.setLength]
-                for i in range(n_sets)
-            ]
-            videoInputs = [
-                [os.path.join(video, f) for f in group] for group in videoInputs
-            ]
-            self.file_list.extend(videoInputs)
-
-    def __len__(self):
-        return len(self.file_list) * self.interFrames
-
-    def __getitem__(self, idx):
-        clip_idx = idx // self.interFrames
-        embt_idx = idx % self.interFrames
-        imgpaths = [
-            os.path.join(self.dataset_dir, fp) for fp in self.file_list[clip_idx]
-        ]
-        pick_idxs = list(range(0, self.setLength, self.interFrames + 1))
-        imgt_beg = self.setLength // 2 - self.interFrames // 2
-        imgt_end = self.setLength // 2 + self.interFrames // 2 + self.interFrames % 2
-        imgt_idx = list(range(imgt_beg, imgt_end))
-        input_paths = [imgpaths[idx] for idx in pick_idxs]
-        imgt_paths = [imgpaths[idx] for idx in imgt_idx]
-
-        img0 = np.array(read(input_paths[0]))
-        imgt = np.array(read(imgt_paths[embt_idx]))
-        img1 = np.array(read(input_paths[1]))
-
-        img0, imgt, img1 = center_crop_woflow(img0, imgt, img1, crop_size=(512, 512))
-
-        img0 = img2tensor(img0).squeeze(0)
-        imgt = img2tensor(imgt).squeeze(0)
-        img1 = img2tensor(img1).squeeze(0)
-
-        embt = torch.from_numpy(
-            np.array((embt_idx + 1) / (self.interFrames + 1))
-            .reshape(1, 1, 1)
-            .astype(np.float32)
-        )
-        return {
-            "img0": img0.float(),
-            "imgt": imgt.float(),
-            "img1": img1.float(),
-            "embt": embt,
-        }
diff --git a/eval/vbench/third_party/amt/datasets/vimeo_datasets.py b/eval/vbench/third_party/amt/datasets/vimeo_datasets.py
deleted file mode 100644
index 6b50cac3..00000000
--- a/eval/vbench/third_party/amt/datasets/vimeo_datasets.py
+++ /dev/null
@@ -1,230 +0,0 @@
-"""
-    This code is partially borrowed from IFRNet (https://github.com/ltkong218/IFRNet).
-"""
-
-import os
-import random
-
-import cv2
-import numpy as np
-import torch
-from torch.utils.data import Dataset
-from utils.utils import read
-
-
-def random_resize(img0, imgt, img1, flow, p=0.1):
-    if random.uniform(0, 1) < p:
-        img0 = cv2.resize(
-            img0, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR
-        )
-        imgt = cv2.resize(
-            imgt, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR
-        )
-        img1 = cv2.resize(
-            img1, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR
-        )
-        flow = (
-            cv2.resize(flow, dsize=None, fx=2.0, fy=2.0, interpolation=cv2.INTER_LINEAR)
-            * 2.0
-        )
-    return img0, imgt, img1, flow
-
-
-def random_crop(img0, imgt, img1, flow, crop_size=(224, 224)):
-    h, w = crop_size[0], crop_size[1]
-    ih, iw, _ = img0.shape
-    x = np.random.randint(0, ih - h + 1)
-    y = np.random.randint(0, iw - w + 1)
-    img0 = img0[x : x + h, y : y + w, :]
-    imgt = imgt[x : x + h, y : y + w, :]
-    img1 = img1[x : x + h, y : y + w, :]
-    flow = flow[x : x + h, y : y + w, :]
-    return img0, imgt, img1, flow
-
-
-def random_reverse_channel(img0, imgt, img1, flow, p=0.5):
-    if random.uniform(0, 1) < p:
-        img0 = img0[:, :, ::-1]
-        imgt = imgt[:, :, ::-1]
-        img1 = img1[:, :, ::-1]
-    return img0, imgt, img1, flow
-
-
-def random_vertical_flip(img0, imgt, img1, flow, p=0.3):
-    if random.uniform(0, 1) < p:
-        img0 = img0[::-1]
-        imgt = imgt[::-1]
-        img1 = img1[::-1]
-        flow = flow[::-1]
-        flow = np.concatenate(
-            (flow[:, :, 0:1], -flow[:, :, 1:2], flow[:, :, 2:3], -flow[:, :, 3:4]), 2
-        )
-    return img0, imgt, img1, flow
-
-
-def random_horizontal_flip(img0, imgt, img1, flow, p=0.5):
-    if random.uniform(0, 1) < p:
-        img0 = img0[:, ::-1]
-        imgt = imgt[:, ::-1]
-        img1 = img1[:, ::-1]
-        flow = flow[:, ::-1]
-        flow = np.concatenate(
-            (-flow[:, :, 0:1], flow[:, :, 1:2], -flow[:, :, 2:3], flow[:, :, 3:4]), 2
-        )
-    return img0, imgt, img1, flow
-
-
-def random_rotate(img0, imgt, img1, flow, p=0.05):
-    if random.uniform(0, 1) < p:
-        img0 = img0.transpose((1, 0, 2))
-        imgt = imgt.transpose((1, 0, 2))
-        img1 = img1.transpose((1, 0, 2))
-        flow = flow.transpose((1, 0, 2))
-        flow = np.concatenate(
-            (flow[:, :, 1:2], flow[:, :, 0:1], flow[:, :, 3:4], flow[:, :, 2:3]), 2
-        )
-    return img0, imgt, img1, flow
-
-
-def random_reverse_time(img0, imgt, img1, flow, p=0.5):
-    if random.uniform(0, 1) < p:
-        tmp = img1
-        img1 = img0
-        img0 = tmp
-        flow = np.concatenate((flow[:, :, 2:4], flow[:, :, 0:2]), 2)
-    return img0, imgt, img1, flow
-
-
-class Vimeo90K_Train_Dataset(Dataset):
-    def __init__(
-        self,
-        dataset_dir="data/vimeo_triplet",
-        flow_dir=None,
-        augment=True,
-        crop_size=(224, 224),
-    ):
-        self.dataset_dir = dataset_dir
-        self.augment = augment
-        self.crop_size = crop_size
-        self.img0_list = []
-        self.imgt_list = []
-        self.img1_list = []
-        self.flow_t0_list = []
-        self.flow_t1_list = []
-        if flow_dir is None:
-            flow_dir = "flow"
-        with open(os.path.join(dataset_dir, "tri_trainlist.txt"), "r") as f:
-            for i in f:
-                name = str(i).strip()
-                if len(name) <= 1:
-                    continue
-                self.img0_list.append(
-                    os.path.join(dataset_dir, "sequences", name, "im1.png")
-                )
-                self.imgt_list.append(
-                    os.path.join(dataset_dir, "sequences", name, "im2.png")
-                )
-                self.img1_list.append(
-                    os.path.join(dataset_dir, "sequences", name, "im3.png")
-                )
-                self.flow_t0_list.append(
-                    os.path.join(dataset_dir, flow_dir, name, "flow_t0.flo")
-                )
-                self.flow_t1_list.append(
-                    os.path.join(dataset_dir, flow_dir, name, "flow_t1.flo")
-                )
-
-    def __len__(self):
-        return len(self.imgt_list)
-
-    def __getitem__(self, idx):
-        img0 = read(self.img0_list[idx])
-        imgt = read(self.imgt_list[idx])
-        img1 = read(self.img1_list[idx])
-        flow_t0 = read(self.flow_t0_list[idx])
-        flow_t1 = read(self.flow_t1_list[idx])
-        flow = np.concatenate((flow_t0, flow_t1), 2).astype(np.float64)
-
-        if self.augment == True:
-            img0, imgt, img1, flow = random_resize(img0, imgt, img1, flow, p=0.1)
-            img0, imgt, img1, flow = random_crop(
-                img0, imgt, img1, flow, crop_size=self.crop_size
-            )
-            img0, imgt, img1, flow = random_reverse_channel(
-                img0, imgt, img1, flow, p=0.5
-            )
-            img0, imgt, img1, flow = random_vertical_flip(img0, imgt, img1, flow, p=0.3)
-            img0, imgt, img1, flow = random_horizontal_flip(
-                img0, imgt, img1, flow, p=0.5
-            )
-            img0, imgt, img1, flow = random_rotate(img0, imgt, img1, flow, p=0.05)
-            img0, imgt, img1, flow = random_reverse_time(img0, imgt, img1, flow, p=0.5)
-
-        img0 = torch.from_numpy(img0.transpose((2, 0, 1)).astype(np.float32) / 255.0)
-        imgt = torch.from_numpy(imgt.transpose((2, 0, 1)).astype(np.float32) / 255.0)
-        img1 = torch.from_numpy(img1.transpose((2, 0, 1)).astype(np.float32) / 255.0)
-        flow = torch.from_numpy(flow.transpose((2, 0, 1)).astype(np.float32))
-        embt = torch.from_numpy(np.array(1 / 2).reshape(1, 1, 1).astype(np.float32))
-
-        return {
-            "img0": img0.float(),
-            "imgt": imgt.float(),
-            "img1": img1.float(),
-            "flow": flow.float(),
-            "embt": embt,
-        }
-
-
-class Vimeo90K_Test_Dataset(Dataset):
-    def __init__(self, dataset_dir="data/vimeo_triplet"):
-        self.dataset_dir = dataset_dir
-        self.img0_list = []
-        self.imgt_list = []
-        self.img1_list = []
-        self.flow_t0_list = []
-        self.flow_t1_list = []
-        with open(os.path.join(dataset_dir, "tri_testlist.txt"), "r") as f:
-            for i in f:
-                name = str(i).strip()
-                if len(name) <= 1:
-                    continue
-                self.img0_list.append(
-                    os.path.join(dataset_dir, "sequences", name, "im1.png")
-                )
-                self.imgt_list.append(
-                    os.path.join(dataset_dir, "sequences", name, "im2.png")
-                )
-                self.img1_list.append(
-                    os.path.join(dataset_dir, "sequences", name, "im3.png")
-                )
-                self.flow_t0_list.append(
-                    os.path.join(dataset_dir, "flow", name, "flow_t0.flo")
-                )
-                self.flow_t1_list.append(
-                    os.path.join(dataset_dir, "flow", name, "flow_t1.flo")
-                )
-
-    def __len__(self):
-        return len(self.imgt_list)
-
-    def __getitem__(self, idx):
-        img0 = read(self.img0_list[idx])
-        imgt = read(self.imgt_list[idx])
-        img1 = read(self.img1_list[idx])
-        flow_t0 = read(self.flow_t0_list[idx])
-        flow_t1 = read(self.flow_t1_list[idx])
-        flow = np.concatenate((flow_t0, flow_t1), 2)
-
-        img0 = torch.from_numpy(img0.transpose((2, 0, 1)).astype(np.float32) / 255.0)
-        imgt = torch.from_numpy(imgt.transpose((2, 0, 1)).astype(np.float32) / 255.0)
-        img1 = torch.from_numpy(img1.transpose((2, 0, 1)).astype(np.float32) / 255.0)
-        flow = torch.from_numpy(flow.transpose((2, 0, 1)).astype(np.float32))
-        embt = torch.from_numpy(np.array(1 / 2).reshape(1, 1, 1).astype(np.float32))
-
-        return {
-            "img0": img0.float(),
-            "imgt": imgt.float(),
-            "img1": img1.float(),
-            "flow": flow.float(),
-            "embt": embt,
-        }
diff --git a/eval/vbench/third_party/amt/docs/develop.md b/eval/vbench/third_party/amt/docs/develop.md
deleted file mode 100644
index df5c7aa0..00000000
--- a/eval/vbench/third_party/amt/docs/develop.md
+++ /dev/null
@@ -1,239 +0,0 @@
-# Development for evaluation and training
-
-- [Datasets](#Datasets)
-- [Pretrained Models](#pretrained-models)
-- [Evaluation](#evaluation)
-- [Training](#training)
-
-## Datasets<p id="Datasets"></p>
-First, please prepare standard datasets for evaluation and training.
-
-We present most of prevailing datasets in video frame interpolation, though some are not used in our project. Hope this collection could help your research.
-
-<table>
-<thead>
-  <tr>
-    <th> Dataset </th>
-    <th> :link: Source </th>
-    <th> Train/Eval </th>
-    <th> Arbitrary/Fixed </th>
-  </tr>
-</thead>
-<tbody>
-  <tr>
-    <td>Vimeo90k</td>
-    <th><a href="http://toflow.csail.mit.edu/">ToFlow (IJCV 2019)</a></th>
-    <th>Both</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>ATD-12K</td>
-    <th><a href="https://github.com/lisiyao21/AnimeInterp">AnimeInterp (CVPR 2021)</a></th>
-    <th>Both</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>SNU-FILM</td>
-    <th><a href="https://myungsub.github.io/CAIN/">CAIN (AAAI 2021)</a></th>
-    <th>Eval</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>UCF101</td>
-    <th><a href="https://drive.google.com/file/d/0B7EVK8r0v71pdHBNdXB6TE1wSTQ/view?resourcekey=0-r6ihCy20h3kbgZ3ZdimPiA">Google Driver</a></th>
-    <th>Eval</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>HD</td>
-    <th><a href="https://github.com/baowenbo/MEMC-Net">MEMC-Net (TPAMI 2018)</a>/<a href="https://github.com/baowenbo/MEMC-Net">Google Driver</a></th>
-    <th>Eval</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>Xiph-2k/-4k</td>
-    <th><a href="https://github.com/sniklaus/softmax-splatting/blob/master/benchmark_xiph.py">SoftSplat (CVPR 2020)</a></th>
-    <th>Eval</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>MiddleBury</td>
-    <th><a href="https://vision.middlebury.edu/flow/data/">MiddleBury</a></th>
-    <th>Eval</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>GoPro</td>
-    <th><a href="https://seungjunnah.github.io/Datasets/gopro">GoPro</a></th>
-    <th>Both</th>
-    <th>Arbitrary</th>
-  </tr>
-  <tr>
-    <td>Adobe240fps</td>
-    <th><a href="http://www.cs.ubc.ca/labs/imager/tr/2017/DeepVideoDeblurring">DBN (CVPR 2017)</a></th>
-    <th>Both</th>
-    <th>Arbitrary</th>
-  </tr>
-   <tr>
-    <td>X4K1000FPS</td>
-    <th><a href="https://github.com/JihyongOh/XVFI">XVFI (ICCV 2021)</a></th>
-    <th>Both</th>
-    <th>Arbitrary</th>
-  </tr>
-</tbody>
-</table>
-
-
-## Pretrained Models
-
-<p id="Pretrained"></p>
-
-<table>
-<thead>
-  <tr>
-    <th> Dataset </th>
-    <th> :link: Download Links </th>
-    <th> Config file </th>
-    <th> Trained on </th>
-    <th> Arbitrary/Fixed </th>
-  </tr>
-</thead>
-<tbody>
-  <tr>
-    <td>AMT-S</td>
-    <th> [<a href="https://drive.google.com/file/d/1WmOKmQmd6pnLpID8EpUe-TddFpJuavrL/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1yGaNLeb9TG5-81t0skrOUA?pwd=f66n">Baidu Cloud</a>]</th>
-    <th> [<a href="../cfgs/AMT-S.yaml">cfgs/AMT-S</a>] </th>
-    <th>Vimeo90k</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>AMT-L</td>
-    <th>[<a href="https://drive.google.com/file/d/1UyhYpAQLXMjFA55rlFZ0kdiSVTL7oU-z/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1qI4fBgS405Bd4Wn1R3Gbeg?pwd=nbne">Baidu Cloud</a>]</th>
-    <th> [<a href="../cfgs/AMT-L.yaml">cfgs/AMT-L</a>] </th>
-    <th>Vimeo90k</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>AMT-G</td>
-    <th>[<a href="https://drive.google.com/file/d/1yieLtKh4ei3gOrLN1LhKSP_9157Q-mtP/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1AjmQVziQut1bXgQnDcDKvA?pwd=caf6">Baidu Cloud</a>]</th>
-    <th> [<a href="../cfgs/AMT-G.yaml">cfgs/AMT-G</a>] </th>
-    <th>Vimeo90k</th>
-    <th>Fixed</th>
-  </tr>
-  <tr>
-    <td>AMT-S</td>
-    <th>[<a href="https://drive.google.com/file/d/1f1xAF0EDm-rjDdny8_aLyeedfM0QL4-C/view?usp=share_link">Google Driver</a>][<a href="https://pan.baidu.com/s/1eZtoULyduQM8AkXeYEBOEw?pwd=8hy3">Baidu Cloud</a>]</th>
-    <th> [<a href="../cfgs/AMT-S_gopro.yaml">cfgs/AMT-S_gopro</a>] </th>
-    <th>GoPro</th>
-    <th>Arbitrary</th>
-  </tr>
-</tbody>
-</table>
-
-## Evaluation
-Before evaluation, you should:
-
-1. Check the dataroot is organized as follows:
-
-```shell
-./data
-├── Adobe240
-│   ├── original_high_fps_videos
-│   └── test_frames # using ffmpeg to extract 240 fps frames from `original_high_fps_videos`
-├── GOPRO
-│   ├── test
-│   └── train
-├── SNU_FILM
-│   ├── GOPRO_test
-│   ├── test-easy.txt
-│   ├── test-extreme.txt
-│   ├── test-hard.txt
-│   ├── test-medium.txt
-│   └── YouTube_test
-├── ucf101_interp_ours
-│   ├── 1
-│   ├── 1001
-│   └── ...
-└── vimeo_triplet
-    ├── readme.txt
-    ├── sequences
-    ├── tri_testlist.txt
-    └── tri_trainlist.txt
-```
-
-2. Download the provided [pretrained models](#pretrained-models).
-
-Then, you can perform evaluation as follows:
-
-+ Run all benchmarks for fixed-time models.
-
-    ```shell
-    sh ./scripts/benchmark_fixed.sh [CFG] [CKPT_PATH]
-    ## e.g.
-    sh ./scripts/benchmark_fixed.sh cfgs/AMT-S.yaml pretrained/amt-s.pth
-    ```
-
-+ Run all benchmarks for arbitrary-time models.
-
-    ```shell
-    sh ./scripts/benchmark_arbitrary.sh [CFG] [CKPT_PATH]
-    ## e.g.
-    sh ./scripts/benchmark_arbitrary.sh cfgs/AMT-S.yaml pretrained/gopro_amt-s.pth
-    ```
-
-+ Run a single benchmark for fixed-time models. *You can custom data paths in this case*.
-
-    ```shell
-    python [BENCHMARK] -c [CFG] -p [CKPT_PATH] -r [DATAROOT]
-    ## e.g.
-    python benchmarks/vimeo90k.py -c cfgs/AMT-S.yaml -p pretrained/amt-s.pth -r data/vimeo_triplet
-    ```
-
-+ Run the inference speed & model size comparisons using:
-
-    ```shell
-    python speed_parameters.py -c [CFG]
-    ## e.g.
-    python speed_parameters.py -c cfgs/AMT-S.yaml
-    ```
-
-
-## Training
-
-Before training, please first prepare the optical flows (which are used for supervision).
-
-We need to install `cupy` first before flow generation:
-
-```shell
-conda activate amt # satisfying `requirement.txt`
-conda install -c conda-forge cupy
-```
-
-
-After installing `cupy`, we can generate optical flows by the following command:
-
-```shell
-python flow_generation/gen_flow.py -r [DATA_ROOT]
-## e.g.
-python flow_generation/gen_flow.py -r data/vimeo_triplet
-```
-
-After obtaining the optical flow of the training data,
-run the following commands for training (DDP mode):
-
-```shell
- sh ./scripts/train.sh [NUM_GPU] [CFG] [MASTER_PORT]
- ## e.g.
- sh ./scripts/train.sh 2 cfgs/AMT-S.yaml 14514
-```
-
-Our training configuration files are provided in [`cfgs`](../cfgs). Please carefully check the `dataset_dir` is suitable for you.
-
-
-Note:
-
-- If you intend to turn off DDP training, you can switch the key `distributed` from `true`
-to `false` in the config file.
-
-- If you do not use wandb, you can switch the key `logger.use_wandb` from `true`
-to `false` in the config file.
diff --git a/eval/vbench/third_party/amt/docs/method.md b/eval/vbench/third_party/amt/docs/method.md
deleted file mode 100644
index df9c0378..00000000
--- a/eval/vbench/third_party/amt/docs/method.md
+++ /dev/null
@@ -1,126 +0,0 @@
-# Illustration of AMT
-
-<p align="center">
-<img src="https://user-images.githubusercontent.com/21050959/229420451-65951bd0-732c-4f09-9121-f291a3862d6e.png" width="1200">
-</p>
-
-### :rocket: Highlights:
-
-+ [**Good tradeoff**](#good-tradeoff) between performance and efficiency.
-
-+ [**All-pairs correlation**](#all-pairs-correlation) for modeling large motions during interpolation.
-
-+ A [**plug-and-play operator**](#multi-field-refinement) to improve the diversity of predicted task-oriented flows, further **boosting the interpolation performance**.
-
-
-## Good Tradeoff
-
-<p align="left">
-<img src="https://user-images.githubusercontent.com/21050959/229470703-2f386d62-d26c-46a3-af97-ddfc4270678a.png" width="500">
-</p>
-
-We examine the proposed AMT on several public benchmarks with different model scales, showing strong performance and high efficiency in contrast to the SOTA methods (see Figure). Our small model outperforms [IFRNet-B](https://arxiv.org/abs/2205.14620), a SOTA lightweight model, by **\+0.17dB PSNR** on Vimeo90K with **only 60% of its FLOPs and parameters**. For large-scale setting, our AMT exceeds the previous SOTA (i.e., [IFRNet-L](https://arxiv.org/abs/2205.14620)) by **+0.15 dB PSNR** on Vimeo90K with **75% of its FLOPs and 65% of its parameters**. Besides, we provide a huge model for comparison
-with the SOTA transformer-based method [VFIFormer](https://arxiv.org/abs/2205.07230). Our convolution-based AMT shows a **comparable performance** but only needs **nearly 23× less computational cost** compared to VFIFormer.
-
-Considering its effectiveness, we hope our AMT could bring a new perspective for the architecture design in efficient frame interpolation.
-
-## All-pairs correlation
-
-We build all-pairs correlation to effectively model large motions during interpolation.
-
-Here is an example about the update operation at a single scale in AMT:
-
-```python
-  # Construct bidirectional correlation volumes
-  fmap0, fmap1 = self.feat_encoder([img0_, img1_]) # [B, C, H//8, W//8]
-  corr_fn = BidirCorrBlock(fmap0, fmap1, radius=self.radius, num_levels=self.corr_levels)
-
-  # Correlation scaled lookup (bilateral -> bidirectional)
-  t1_scale = 1. / embt
-  t0_scale = 1. / (1. - embt)
-  coord = coords_grid(b, h // 8, w // 8, img0.device)
-  corr0, corr1 = corr_fn(coord + flow1 * t1_scale, coord + flow0 * t0_scale)
-  corr = torch.cat([corr0, corr1], dim=1)
-  flow = torch.cat([flow0, flow1], dim=1)
-
-  # Update both intermediate feature and bilateral flows
-  delta_feat, delta_flow = self.update(feat, flow, corr)
-  delta_flow0, delta_flow1 = torch.chunk(delta_flow, 2, 1)
-  flow0 = flow0 + delta_flow0
-  flow1= flow1 + delta_flow1
-  feat = feat + delta_feat
-
-```
-
-Note: we extend above operations to each pyramid scale (except for the last one), which guarantees the consistency of flows on the coarse scale.
-
-### ⏫ performance gain
-|                         | Vimeo 90k | Hard  | Extreme |
-|-------------------------|-----------|-------|---------|
-| Baseline                | 35.60     | 30.39 | 25.06   |
-| + All-pairs correlation | 35.97 (**+0.37**)  | 30.60 (**+0.21**) | 25.30 (**+0.24**)  |
-
-More ablations can be found in the [paper](https://arxiv.org/abs/2304.09790).
-
-## Multi-field Refinement
-
-For most frame interpolation methods which are based on backward warping, the common formulation for
-interpolating the final intermediate frame $I_{t}$ is:
-
-$I_{t} = M \odot \mathcal{W}(I_{0}, F_{t\rightarrow 0}) + (1 - M) \odot \mathcal{W}(I_{1}, F_{t\rightarrow 1}) + R$
-
-Above formualtion only utilizes **one set of** bilateral optical flows $F_{t\rightarrow 0}$ and $F_{t\rightarrow 1}$, occulusion masks $M$, and residuals $R$.
-
-Multi-field refinement aims to improve the common formulation of backward warping.
-Specifically, we first predict **multiple** bilateral optical flows (accompanied by the corresponding masks and residuals) through simply enlarging the output channels of the last decoder.
-Then, we use aforementioned equation to genearate each interpolated candidate frame. Finally, we obtain the final interpolated frame through combining candidate frames using stacked convolutional layers.
-
-Please refer to [this code snippet](../networks/blocks/multi_flow.py#L46) for the details of the first step.
-Please refer to [this code snippet](../networks/blocks/multi_flow.py#L10) for the details of the last two steps.
-
-### 🌟 easy to use
-The proposed multi-field refinement can be **easily migrated to any frame interpolation model** to improve the performance.
-
-Code examples are shown below:
-
-```python
-
-# (At the __init__ stage) Initialize a decoder that predicts multiple flow fields (accompanied by the corresponding masks and residuals)
-self.decoder1 = MultiFlowDecoder(channels[0], skip_channels, num_flows)
-...
-
-# (At the forward stage) Predict multiple flow fields (accompanied by the corresponding masks and residuals)
-up_flow0_1, up_flow1_1, mask, img_res = self.decoder1(ft_1_, f0_1, f1_1, up_flow0_2, up_flow1_2)
-# Merge multiple predictions
-imgt_pred = multi_flow_combine(self.comb_block, img0, img1, up_flow0_1, up_flow1_1,  # self.comb_block stacks two convolutional layers
-                                                            mask, img_res, mean_)
-
-```
-
-### ⏫ performance gain
-
-| # Number of flow pairs | Vimeo 90k     | Hard          | Extreme       |
-|------------------------|---------------|---------------|---------------|
-| Baseline (1 pair)      | 35.84         | 30.52         | 25.25         |
-| 3 pairs                | 35.97 (**+0.13**) | 30.60 (**+0.08**) | 25.30 (**+0.05**) |
-| 5 pairs                | 36.00 (**+0.16**) | 30.63 (**+0.11**) | 25.33 (**+0.08**) |
-
-## Comparison with SOTA methods
-<p align="left">
-<img src="https://user-images.githubusercontent.com/21050959/230716340-dea52895-1713-4857-97e5-48cdff9c478f.png" width="1200">
-</p>
-
-
-## Discussions
-
-We encountered the challenges about the novelty issue during the rebuttal process.
-
-We are ready to clarify again here:
-
-1. We consider the estimation of task-oriented flows from **the perspective of architecture formulation rather than loss function designs** in previous works. The detailed analysis can be found in Sec. 1 of the main paper. We introduce all-pairs correlation to strengthen the ability
-in motion modeling, which guarantees **the consistency of flows on the coarse scale**. We employ multi-field refinement to **ensure diversity for the flow regions that need to be task-specific at the finest scale**. The two designs also enable our AMT to capture large motions and successfully handle occlusion regions with high efficiency. As a consequence, they both bring noticeable performance improvements, as shown in the ablations.
-2. The frame interpolation task is closely related to the **motion modeling**. We strongly believe that a [RAFT-style](https://arxiv.org/abs/2003.12039) approach to motion modeling would be beneficial for the frame interpolation task. However, such style **has not been well studied** in the recent frame interpolation literature. Experimental results show that **all-pairs correlation is very important for the performance gain**. We also involve many novel and task-specific designs
-beyond the original RAFT. For other task-related design choices, our volume design, scaled lookup strategy, content update, and cross-scale update way have good performance gains on challenging cases (i.e., Hard and Extreme). Besides, if we discard all design choices (but remaining multi-field refinement) and follow the original RAFT to retrain a new model, **the PSNR values will dramatically decrease** (-0.20dB on Vimeo, -0.33dB on Hard, and -0.39dB on Extreme).
-3.  [M2M-VFI](https://arxiv.org/abs/2204.03513) is the most relevant to our multi-field refinement. It also generates multiple flows through the decoder and prepares warped candidates in the image domain. However, there are **five key differences** between our multi-field refinement and M2M-VFI. **First**, our method generates the candidate frames by backward warping rather than forward warping in M2M-VFI. The proposed multi-field refinement aims to improve the common formulation of backward warping (see Eqn.~(4) in the main paper). **Second**, while M2M-VFI predicts multiple flows to overcome the hole issue and artifacts in overlapped regions caused by forward warping, we aim to alleviate the ambiguity issue in the occluded areas and motion boundaries by enhancing the diversity of flows. **Third**, M2M-VFI needs to estimate bidirectional flows first through an off-the-shelf optical flow estimator and then predict multiple bilateral flows through a motion refinement network. On the contrary, we directly estimate multiple bilateral flows in a one-stage network. In this network, we first estimate one pair of bilateral flows at the coarse scale and then derive multiple groups of fine-grained bilateral flows from the coarse flow pairs. **Fourth**, M2M-VFI jointly estimates two reliability maps together with all pairs of bilateral flows, which can be further used to fuse the overlapping pixels caused by forward warping. As shown in Eqn. (5) of the main paper, we estimate not only an occlusion mask but a residual content for cooperating with each pair of bilateral flows. The residual content is used to compensate for the unreliable details after warping. This design has been investigated in Tab. 2e of the main paper. **Fifth**, we stack two convolutional layers to adaptively merge candidate frames, while M2M-VFI normalizes the sum of all candidate frames through a pre-computed weighting map
-
-More discussions and details can be found in the [appendix](https://arxiv.org/abs/2304.09790) of our paper.
diff --git a/eval/vbench/third_party/amt/environment.yaml b/eval/vbench/third_party/amt/environment.yaml
deleted file mode 100644
index 979925fe..00000000
--- a/eval/vbench/third_party/amt/environment.yaml
+++ /dev/null
@@ -1,19 +0,0 @@
-name: amt
-channels:
-  - pytorch
-  - conda-forge
-  - defaults
-dependencies:
-  - python=3.8.5
-  - pip=20.3
-  - cudatoolkit=11.3
-  - pytorch=1.11.0
-  - torchvision=0.12.0
-  - numpy=1.21.5
-  - pip:
-    - opencv-python==4.1.2.30
-    - imageio==2.19.3
-    - omegaconf==2.3.0
-    - Pillow==9.4.0
-    - tqdm==4.64.1
-    - wandb==0.12.21
diff --git a/eval/vbench/third_party/amt/flow_generation/__init__.py b/eval/vbench/third_party/amt/flow_generation/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/flow_generation/gen_flow.py b/eval/vbench/third_party/amt/flow_generation/gen_flow.py
deleted file mode 100644
index 52b01b5b..00000000
--- a/eval/vbench/third_party/amt/flow_generation/gen_flow.py
+++ /dev/null
@@ -1,77 +0,0 @@
-import argparse
-import os
-import os.path as osp
-import sys
-
-import numpy as np
-import torch
-import torch.nn.functional as F
-
-sys.path.append(".")
-from flow_generation.liteflownet.run import estimate
-from utils.utils import read, write
-
-parser = argparse.ArgumentParser(
-    prog="AMT",
-    description="Flow generation",
-)
-parser.add_argument("-r", "--root", default="data/vimeo_triplet")
-args = parser.parse_args()
-
-vimeo90k_dir = args.root
-vimeo90k_sequences_dir = osp.join(vimeo90k_dir, "sequences")
-vimeo90k_flow_dir = osp.join(vimeo90k_dir, "flow")
-
-
-def pred_flow(img1, img2):
-    img1 = torch.from_numpy(img1).float().permute(2, 0, 1) / 255.0
-    img2 = torch.from_numpy(img2).float().permute(2, 0, 1) / 255.0
-
-    flow = estimate(img1, img2)
-
-    flow = flow.permute(1, 2, 0).cpu().numpy()
-    return flow
-
-
-print("Built Flow Path")
-if not osp.exists(vimeo90k_flow_dir):
-    os.makedirs(vimeo90k_flow_dir)
-
-for sequences_path in sorted(os.listdir(vimeo90k_sequences_dir)):
-    vimeo90k_sequences_path_dir = osp.join(vimeo90k_sequences_dir, sequences_path)
-    vimeo90k_flow_path_dir = osp.join(vimeo90k_flow_dir, sequences_path)
-    if not osp.exists(vimeo90k_flow_path_dir):
-        os.mkdir(vimeo90k_flow_path_dir)
-
-    for sequences_id in sorted(os.listdir(vimeo90k_sequences_path_dir)):
-        vimeo90k_flow_id_dir = osp.join(vimeo90k_flow_path_dir, sequences_id)
-        if not osp.exists(vimeo90k_flow_id_dir):
-            os.mkdir(vimeo90k_flow_id_dir)
-
-for sequences_path in sorted(os.listdir(vimeo90k_sequences_dir)):
-    vimeo90k_sequences_path_dir = os.path.join(vimeo90k_sequences_dir, sequences_path)
-    vimeo90k_flow_path_dir = os.path.join(vimeo90k_flow_dir, sequences_path)
-
-    for sequences_id in sorted(os.listdir(vimeo90k_sequences_path_dir)):
-        vimeo90k_sequences_id_dir = os.path.join(
-            vimeo90k_sequences_path_dir, sequences_id
-        )
-        vimeo90k_flow_id_dir = os.path.join(vimeo90k_flow_path_dir, sequences_id)
-
-        img0_path = vimeo90k_sequences_id_dir + "/im1.png"
-        imgt_path = vimeo90k_sequences_id_dir + "/im2.png"
-        img1_path = vimeo90k_sequences_id_dir + "/im3.png"
-        flow_t0_path = vimeo90k_flow_id_dir + "/flow_t0.flo"
-        flow_t1_path = vimeo90k_flow_id_dir + "/flow_t1.flo"
-
-        img0 = read(img0_path)
-        imgt = read(imgt_path)
-        img1 = read(img1_path)
-
-        flow_t0 = pred_flow(imgt, img0)
-        flow_t1 = pred_flow(imgt, img1)
-
-        write(flow_t0_path, flow_t0)
-        write(flow_t1_path, flow_t1)
-
-    print("Written Sequences {}".format(sequences_path))
diff --git a/eval/vbench/third_party/amt/flow_generation/liteflownet/README.md b/eval/vbench/third_party/amt/flow_generation/liteflownet/README.md
deleted file mode 100644
index 556ab2ff..00000000
--- a/eval/vbench/third_party/amt/flow_generation/liteflownet/README.md
+++ /dev/null
@@ -1,45 +0,0 @@
-# pytorch-liteflownet
-This is a personal reimplementation of LiteFlowNet [1] using PyTorch. Should you be making use of this work, please cite the paper accordingly. Also, make sure to adhere to the <a href="https://github.com/twhui/LiteFlowNet#license-and-citation">licensing terms</a> of the authors. Should you be making use of this particular implementation, please acknowledge it appropriately [2].
-
-<a href="https://arxiv.org/abs/1805.07036" rel="Paper"><img src="http://www.arxiv-sanity.com/static/thumbs/1805.07036v1.pdf.jpg" alt="Paper" width="100%"></a>
-
-For the original Caffe version of this work, please see: https://github.com/twhui/LiteFlowNet
-<br />
-Other optical flow implementations from me: [pytorch-pwc](https://github.com/sniklaus/pytorch-pwc), [pytorch-unflow](https://github.com/sniklaus/pytorch-unflow), [pytorch-spynet](https://github.com/sniklaus/pytorch-spynet)
-
-## setup
-The correlation layer is implemented in CUDA using CuPy, which is why CuPy is a required dependency. It can be installed using `pip install cupy` or alternatively using one of the provided [binary packages](https://docs.cupy.dev/en/stable/install.html#installing-cupy) as outlined in the CuPy repository. If you would like to use Docker, you can take a look at [this](https://github.com/sniklaus/pytorch-liteflownet/pull/43) pull request to get started.
-
-## usage
-To run it on your own pair of images, use the following command. You can choose between three models, please make sure to see their paper / the code for more details.
-
-```
-python run.py --model default --one ./images/one.png --two ./images/two.png --out ./out.flo
-```
-
-I am afraid that I cannot guarantee that this reimplementation is correct. However, it produced results pretty much identical to the implementation of the original authors in the examples that I tried. There are some numerical deviations that stem from differences in the `DownsampleLayer` of Caffe and the `torch.nn.functional.interpolate` function of PyTorch. Please feel free to contribute to this repository by submitting issues and pull requests.
-
-## comparison
-<p align="center"><img src="comparison/comparison.gif?raw=true" alt="Comparison"></p>
-
-## license
-As stated in the <a href="https://github.com/twhui/LiteFlowNet#license-and-citation">licensing terms</a> of the authors of the paper, their material is provided for research purposes only. Please make sure to further consult their licensing terms.
-
-## references
-```
-[1]  @inproceedings{Hui_CVPR_2018,
-         author = {Tak-Wai Hui and Xiaoou Tang and Chen Change Loy},
-         title = {{LiteFlowNet}: A Lightweight Convolutional Neural Network for Optical Flow Estimation},
-         booktitle = {IEEE Conference on Computer Vision and Pattern Recognition},
-         year = {2018}
-     }
-```
-
-```
-[2]  @misc{pytorch-liteflownet,
-         author = {Simon Niklaus},
-         title = {A Reimplementation of {LiteFlowNet} Using {PyTorch}},
-         year = {2019},
-         howpublished = {\url{https://github.com/sniklaus/pytorch-liteflownet}}
-    }
-```
diff --git a/eval/vbench/third_party/amt/flow_generation/liteflownet/__init__.py b/eval/vbench/third_party/amt/flow_generation/liteflownet/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/flow_generation/liteflownet/correlation/README.md b/eval/vbench/third_party/amt/flow_generation/liteflownet/correlation/README.md
deleted file mode 100644
index fa99e1d8..00000000
--- a/eval/vbench/third_party/amt/flow_generation/liteflownet/correlation/README.md
+++ /dev/null
@@ -1 +0,0 @@
-This is an adaptation of the FlowNet2 implementation in order to compute cost volumes. Should you be making use of this work, please make sure to adhere to the licensing terms of the original authors. Should you be making use or modify this particular implementation, please acknowledge it appropriately.
diff --git a/eval/vbench/third_party/amt/flow_generation/liteflownet/correlation/correlation.py b/eval/vbench/third_party/amt/flow_generation/liteflownet/correlation/correlation.py
deleted file mode 100644
index 48eac1db..00000000
--- a/eval/vbench/third_party/amt/flow_generation/liteflownet/correlation/correlation.py
+++ /dev/null
@@ -1,513 +0,0 @@
-#!/usr/bin/env python
-
-import math
-import re
-
-import cupy
-import torch
-
-kernel_Correlation_rearrange = """
-    extern "C" __global__ void kernel_Correlation_rearrange(
-        const int n,
-        const float* input,
-        float* output
-    ) {
-      int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
-      if (intIndex >= n) {
-        return;
-      }
-      int intSample = blockIdx.z;
-      int intChannel = blockIdx.y;
-      float fltValue = input[(((intSample * SIZE_1(input)) + intChannel) * SIZE_2(input) * SIZE_3(input)) + intIndex];
-      __syncthreads();
-      int intPaddedY = (intIndex / SIZE_3(input)) + 3*{{intStride}};
-      int intPaddedX = (intIndex % SIZE_3(input)) + 3*{{intStride}};
-      int intRearrange = ((SIZE_3(input) + 6*{{intStride}}) * intPaddedY) + intPaddedX;
-      output[(((intSample * SIZE_1(output) * SIZE_2(output)) + intRearrange) * SIZE_1(input)) + intChannel] = fltValue;
-    }
-"""
-
-kernel_Correlation_updateOutput = """
-    extern "C" __global__ void kernel_Correlation_updateOutput(
-      const int n,
-      const float* rbot0,
-      const float* rbot1,
-      float* top
-    ) {
-      extern __shared__ char patch_data_char[];
-
-      float *patch_data = (float *)patch_data_char;
-
-      // First (upper left) position of kernel upper-left corner in current center position of neighborhood in image 1
-      int x1 = (blockIdx.x + 3) * {{intStride}};
-      int y1 = (blockIdx.y + 3) * {{intStride}};
-      int item = blockIdx.z;
-      int ch_off = threadIdx.x;
-
-      // Load 3D patch into shared shared memory
-      for (int j = 0; j < 1; j++) { // HEIGHT
-        for (int i = 0; i < 1; i++) { // WIDTH
-          int ji_off = (j + i) * SIZE_3(rbot0);
-          for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS
-            int idx1 = ((item * SIZE_1(rbot0) + y1+j) * SIZE_2(rbot0) + x1+i) * SIZE_3(rbot0) + ch;
-            int idxPatchData = ji_off + ch;
-            patch_data[idxPatchData] = rbot0[idx1];
-          }
-        }
-      }
-
-      __syncthreads();
-
-      __shared__ float sum[32];
-
-      // Compute correlation
-      for (int top_channel = 0; top_channel < SIZE_1(top); top_channel++) {
-        sum[ch_off] = 0;
-
-        int s2o = (top_channel % 7 - 3) * {{intStride}};
-        int s2p = (top_channel / 7 - 3) * {{intStride}};
-
-        for (int j = 0; j < 1; j++) { // HEIGHT
-          for (int i = 0; i < 1; i++) { // WIDTH
-            int ji_off = (j + i) * SIZE_3(rbot0);
-            for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS
-              int x2 = x1 + s2o;
-              int y2 = y1 + s2p;
-
-              int idxPatchData = ji_off + ch;
-              int idx2 = ((item * SIZE_1(rbot0) + y2+j) * SIZE_2(rbot0) + x2+i) * SIZE_3(rbot0) + ch;
-
-              sum[ch_off] += patch_data[idxPatchData] * rbot1[idx2];
-            }
-          }
-        }
-
-        __syncthreads();
-
-        if (ch_off == 0) {
-          float total_sum = 0;
-          for (int idx = 0; idx < 32; idx++) {
-            total_sum += sum[idx];
-          }
-          const int sumelems = SIZE_3(rbot0);
-          const int index = ((top_channel*SIZE_2(top) + blockIdx.y)*SIZE_3(top))+blockIdx.x;
-          top[index + item*SIZE_1(top)*SIZE_2(top)*SIZE_3(top)] = total_sum / (float)sumelems;
-        }
-      }
-    }
-"""
-
-kernel_Correlation_updateGradOne = """
-    #define ROUND_OFF 50000
-    extern "C" __global__ void kernel_Correlation_updateGradOne(
-      const int n,
-      const int intSample,
-      const float* rbot0,
-      const float* rbot1,
-      const float* gradOutput,
-      float* gradOne,
-      float* gradTwo
-    ) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {
-      int n = intIndex % SIZE_1(gradOne); // channels
-      int l = (intIndex / SIZE_1(gradOne)) % SIZE_3(gradOne) + 3*{{intStride}}; // w-pos
-      int m = (intIndex / SIZE_1(gradOne) / SIZE_3(gradOne)) % SIZE_2(gradOne) + 3*{{intStride}}; // h-pos
-
-      // round_off is a trick to enable integer division with ceil, even for negative numbers
-      // We use a large offset, for the inner part not to become negative.
-      const int round_off = ROUND_OFF;
-      const int round_off_s1 = {{intStride}} * round_off;
-
-      // We add round_off before_s1 the int division and subtract round_off after it, to ensure the formula matches ceil behavior:
-      int xmin = (l - 3*{{intStride}} + round_off_s1 - 1) / {{intStride}} + 1 - round_off; // ceil (l - 3*{{intStride}}) / {{intStride}}
-      int ymin = (m - 3*{{intStride}} + round_off_s1 - 1) / {{intStride}} + 1 - round_off; // ceil (l - 3*{{intStride}}) / {{intStride}}
-
-      // Same here:
-      int xmax = (l - 3*{{intStride}} + round_off_s1) / {{intStride}} - round_off; // floor (l - 3*{{intStride}}) / {{intStride}}
-      int ymax = (m - 3*{{intStride}} + round_off_s1) / {{intStride}} - round_off; // floor (m - 3*{{intStride}}) / {{intStride}}
-
-      float sum = 0;
-      if (xmax>=0 && ymax>=0 && (xmin<=SIZE_3(gradOutput)-1) && (ymin<=SIZE_2(gradOutput)-1)) {
-        xmin = max(0,xmin);
-        xmax = min(SIZE_3(gradOutput)-1,xmax);
-
-        ymin = max(0,ymin);
-        ymax = min(SIZE_2(gradOutput)-1,ymax);
-
-        for (int p = -3; p <= 3; p++) {
-          for (int o = -3; o <= 3; o++) {
-            // Get rbot1 data:
-            int s2o = {{intStride}} * o;
-            int s2p = {{intStride}} * p;
-            int idxbot1 = ((intSample * SIZE_1(rbot0) + (m+s2p)) * SIZE_2(rbot0) + (l+s2o)) * SIZE_3(rbot0) + n;
-            float bot1tmp = rbot1[idxbot1]; // rbot1[l+s2o,m+s2p,n]
-
-            // Index offset for gradOutput in following loops:
-            int op = (p+3) * 7 + (o+3); // index[o,p]
-            int idxopoffset = (intSample * SIZE_1(gradOutput) + op);
-
-            for (int y = ymin; y <= ymax; y++) {
-              for (int x = xmin; x <= xmax; x++) {
-                int idxgradOutput = (idxopoffset * SIZE_2(gradOutput) + y) * SIZE_3(gradOutput) + x; // gradOutput[x,y,o,p]
-                sum += gradOutput[idxgradOutput] * bot1tmp;
-              }
-            }
-          }
-        }
-      }
-      const int sumelems = SIZE_1(gradOne);
-      const int bot0index = ((n * SIZE_2(gradOne)) + (m-3*{{intStride}})) * SIZE_3(gradOne) + (l-3*{{intStride}});
-      gradOne[bot0index + intSample*SIZE_1(gradOne)*SIZE_2(gradOne)*SIZE_3(gradOne)] = sum / (float)sumelems;
-    } }
-"""
-
-kernel_Correlation_updateGradTwo = """
-    #define ROUND_OFF 50000
-    extern "C" __global__ void kernel_Correlation_updateGradTwo(
-      const int n,
-      const int intSample,
-      const float* rbot0,
-      const float* rbot1,
-      const float* gradOutput,
-      float* gradOne,
-      float* gradTwo
-    ) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {
-      int n = intIndex % SIZE_1(gradTwo); // channels
-      int l = (intIndex / SIZE_1(gradTwo)) % SIZE_3(gradTwo) + 3*{{intStride}}; // w-pos
-      int m = (intIndex / SIZE_1(gradTwo) / SIZE_3(gradTwo)) % SIZE_2(gradTwo) + 3*{{intStride}}; // h-pos
-
-      // round_off is a trick to enable integer division with ceil, even for negative numbers
-      // We use a large offset, for the inner part not to become negative.
-      const int round_off = ROUND_OFF;
-      const int round_off_s1 = {{intStride}} * round_off;
-
-      float sum = 0;
-      for (int p = -3; p <= 3; p++) {
-        for (int o = -3; o <= 3; o++) {
-          int s2o = {{intStride}} * o;
-          int s2p = {{intStride}} * p;
-
-          //Get X,Y ranges and clamp
-          // We add round_off before_s1 the int division and subtract round_off after it, to ensure the formula matches ceil behavior:
-          int xmin = (l - 3*{{intStride}} - s2o + round_off_s1 - 1) / {{intStride}} + 1 - round_off; // ceil (l - 3*{{intStride}} - s2o) / {{intStride}}
-          int ymin = (m - 3*{{intStride}} - s2p + round_off_s1 - 1) / {{intStride}} + 1 - round_off; // ceil (l - 3*{{intStride}} - s2o) / {{intStride}}
-
-          // Same here:
-          int xmax = (l - 3*{{intStride}} - s2o + round_off_s1) / {{intStride}} - round_off; // floor (l - 3*{{intStride}} - s2o) / {{intStride}}
-          int ymax = (m - 3*{{intStride}} - s2p + round_off_s1) / {{intStride}} - round_off; // floor (m - 3*{{intStride}} - s2p) / {{intStride}}
-
-          if (xmax>=0 && ymax>=0 && (xmin<=SIZE_3(gradOutput)-1) && (ymin<=SIZE_2(gradOutput)-1)) {
-            xmin = max(0,xmin);
-            xmax = min(SIZE_3(gradOutput)-1,xmax);
-
-            ymin = max(0,ymin);
-            ymax = min(SIZE_2(gradOutput)-1,ymax);
-
-            // Get rbot0 data:
-            int idxbot0 = ((intSample * SIZE_1(rbot0) + (m-s2p)) * SIZE_2(rbot0) + (l-s2o)) * SIZE_3(rbot0) + n;
-            float bot0tmp = rbot0[idxbot0]; // rbot1[l+s2o,m+s2p,n]
-
-            // Index offset for gradOutput in following loops:
-            int op = (p+3) * 7 + (o+3); // index[o,p]
-            int idxopoffset = (intSample * SIZE_1(gradOutput) + op);
-
-            for (int y = ymin; y <= ymax; y++) {
-              for (int x = xmin; x <= xmax; x++) {
-                int idxgradOutput = (idxopoffset * SIZE_2(gradOutput) + y) * SIZE_3(gradOutput) + x; // gradOutput[x,y,o,p]
-                sum += gradOutput[idxgradOutput] * bot0tmp;
-              }
-            }
-          }
-        }
-      }
-      const int sumelems = SIZE_1(gradTwo);
-      const int bot1index = ((n * SIZE_2(gradTwo)) + (m-3*{{intStride}})) * SIZE_3(gradTwo) + (l-3*{{intStride}});
-      gradTwo[bot1index + intSample*SIZE_1(gradTwo)*SIZE_2(gradTwo)*SIZE_3(gradTwo)] = sum / (float)sumelems;
-    } }
-"""
-
-
-def cupy_kernel(strFunction, objVariables):
-    strKernel = globals()[strFunction].replace(
-        "{{intStride}}", str(objVariables["intStride"])
-    )
-
-    while True:
-        objMatch = re.search("(SIZE_)([0-4])(\()([^\)]*)(\))", strKernel)
-
-        if objMatch is None:
-            break
-        # end
-
-        intArg = int(objMatch.group(2))
-
-        strTensor = objMatch.group(4)
-        intSizes = objVariables[strTensor].size()
-
-        strKernel = strKernel.replace(
-            objMatch.group(),
-            str(
-                intSizes[intArg]
-                if torch.is_tensor(intSizes[intArg]) == False
-                else intSizes[intArg].item()
-            ),
-        )
-    # end
-
-    while True:
-        objMatch = re.search("(VALUE_)([0-4])(\()([^\)]+)(\))", strKernel)
-
-        if objMatch is None:
-            break
-        # end
-
-        intArgs = int(objMatch.group(2))
-        strArgs = objMatch.group(4).split(",")
-
-        strTensor = strArgs[0]
-        intStrides = objVariables[strTensor].stride()
-        strIndex = [
-            "(("
-            + strArgs[intArg + 1].replace("{", "(").replace("}", ")").strip()
-            + ")*"
-            + str(
-                intStrides[intArg]
-                if torch.is_tensor(intStrides[intArg]) == False
-                else intStrides[intArg].item()
-            )
-            + ")"
-            for intArg in range(intArgs)
-        ]
-
-        strKernel = strKernel.replace(
-            objMatch.group(0), strTensor + "[" + str.join("+", strIndex) + "]"
-        )
-    # end
-
-    return strKernel
-
-
-# end
-
-
-@cupy.memoize(for_each_device=True)
-def cupy_launch(strFunction, strKernel):
-    return cupy.cuda.compile_with_cache(strKernel).get_function(strFunction)
-
-
-# end
-
-
-class _FunctionCorrelation(torch.autograd.Function):
-    @staticmethod
-    def forward(self, one, two, intStride):
-        rbot0 = one.new_zeros(
-            [
-                one.shape[0],
-                one.shape[2] + (6 * intStride),
-                one.shape[3] + (6 * intStride),
-                one.shape[1],
-            ]
-        )
-        rbot1 = one.new_zeros(
-            [
-                one.shape[0],
-                one.shape[2] + (6 * intStride),
-                one.shape[3] + (6 * intStride),
-                one.shape[1],
-            ]
-        )
-
-        self.intStride = intStride
-
-        one = one.contiguous()
-        assert one.is_cuda == True
-        two = two.contiguous()
-        assert two.is_cuda == True
-
-        output = one.new_zeros(
-            [
-                one.shape[0],
-                49,
-                int(math.ceil(one.shape[2] / intStride)),
-                int(math.ceil(one.shape[3] / intStride)),
-            ]
-        )
-
-        if one.is_cuda == True:
-            n = one.shape[2] * one.shape[3]
-            cupy_launch(
-                "kernel_Correlation_rearrange",
-                cupy_kernel(
-                    "kernel_Correlation_rearrange",
-                    {"intStride": self.intStride, "input": one, "output": rbot0},
-                ),
-            )(
-                grid=tuple([int((n + 16 - 1) / 16), one.shape[1], one.shape[0]]),
-                block=tuple([16, 1, 1]),
-                args=[cupy.int32(n), one.data_ptr(), rbot0.data_ptr()],
-            )
-
-            n = two.shape[2] * two.shape[3]
-            cupy_launch(
-                "kernel_Correlation_rearrange",
-                cupy_kernel(
-                    "kernel_Correlation_rearrange",
-                    {"intStride": self.intStride, "input": two, "output": rbot1},
-                ),
-            )(
-                grid=tuple([int((n + 16 - 1) / 16), two.shape[1], two.shape[0]]),
-                block=tuple([16, 1, 1]),
-                args=[cupy.int32(n), two.data_ptr(), rbot1.data_ptr()],
-            )
-
-            n = output.shape[1] * output.shape[2] * output.shape[3]
-            cupy_launch(
-                "kernel_Correlation_updateOutput",
-                cupy_kernel(
-                    "kernel_Correlation_updateOutput",
-                    {
-                        "intStride": self.intStride,
-                        "rbot0": rbot0,
-                        "rbot1": rbot1,
-                        "top": output,
-                    },
-                ),
-            )(
-                grid=tuple([output.shape[3], output.shape[2], output.shape[0]]),
-                block=tuple([32, 1, 1]),
-                shared_mem=one.shape[1] * 4,
-                args=[
-                    cupy.int32(n),
-                    rbot0.data_ptr(),
-                    rbot1.data_ptr(),
-                    output.data_ptr(),
-                ],
-            )
-
-        elif one.is_cuda == False:
-            raise NotImplementedError()
-
-        # end
-
-        self.save_for_backward(one, two, rbot0, rbot1)
-
-        return output
-
-    # end
-
-    @staticmethod
-    def backward(self, gradOutput):
-        one, two, rbot0, rbot1 = self.saved_tensors
-
-        gradOutput = gradOutput.contiguous()
-        assert gradOutput.is_cuda == True
-
-        gradOne = (
-            one.new_zeros([one.shape[0], one.shape[1], one.shape[2], one.shape[3]])
-            if self.needs_input_grad[0] == True
-            else None
-        )
-        gradTwo = (
-            one.new_zeros([one.shape[0], one.shape[1], one.shape[2], one.shape[3]])
-            if self.needs_input_grad[1] == True
-            else None
-        )
-
-        if one.is_cuda == True:
-            if gradOne is not None:
-                for intSample in range(one.shape[0]):
-                    n = one.shape[1] * one.shape[2] * one.shape[3]
-                    cupy_launch(
-                        "kernel_Correlation_updateGradOne",
-                        cupy_kernel(
-                            "kernel_Correlation_updateGradOne",
-                            {
-                                "intStride": self.intStride,
-                                "rbot0": rbot0,
-                                "rbot1": rbot1,
-                                "gradOutput": gradOutput,
-                                "gradOne": gradOne,
-                                "gradTwo": None,
-                            },
-                        ),
-                    )(
-                        grid=tuple([int((n + 512 - 1) / 512), 1, 1]),
-                        block=tuple([512, 1, 1]),
-                        args=[
-                            cupy.int32(n),
-                            intSample,
-                            rbot0.data_ptr(),
-                            rbot1.data_ptr(),
-                            gradOutput.data_ptr(),
-                            gradOne.data_ptr(),
-                            None,
-                        ],
-                    )
-                # end
-            # end
-
-            if gradTwo is not None:
-                for intSample in range(one.shape[0]):
-                    n = one.shape[1] * one.shape[2] * one.shape[3]
-                    cupy_launch(
-                        "kernel_Correlation_updateGradTwo",
-                        cupy_kernel(
-                            "kernel_Correlation_updateGradTwo",
-                            {
-                                "intStride": self.intStride,
-                                "rbot0": rbot0,
-                                "rbot1": rbot1,
-                                "gradOutput": gradOutput,
-                                "gradOne": None,
-                                "gradTwo": gradTwo,
-                            },
-                        ),
-                    )(
-                        grid=tuple([int((n + 512 - 1) / 512), 1, 1]),
-                        block=tuple([512, 1, 1]),
-                        args=[
-                            cupy.int32(n),
-                            intSample,
-                            rbot0.data_ptr(),
-                            rbot1.data_ptr(),
-                            gradOutput.data_ptr(),
-                            None,
-                            gradTwo.data_ptr(),
-                        ],
-                    )
-                # end
-            # end
-
-        elif one.is_cuda == False:
-            raise NotImplementedError()
-
-        # end
-
-        return gradOne, gradTwo, None
-
-    # end
-
-
-# end
-
-
-def FunctionCorrelation(tenOne, tenTwo, intStride):
-    return _FunctionCorrelation.apply(tenOne, tenTwo, intStride)
-
-
-# end
-
-
-class ModuleCorrelation(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    # end
-
-    def forward(self, tenOne, tenTwo, intStride):
-        return _FunctionCorrelation.apply(tenOne, tenTwo, intStride)
-
-    # end
-
-
-# end
diff --git a/eval/vbench/third_party/amt/flow_generation/liteflownet/run.py b/eval/vbench/third_party/amt/flow_generation/liteflownet/run.py
deleted file mode 100644
index 9da2baa8..00000000
--- a/eval/vbench/third_party/amt/flow_generation/liteflownet/run.py
+++ /dev/null
@@ -1,813 +0,0 @@
-#!/usr/bin/env python
-
-import getopt
-import math
-import sys
-
-import numpy
-import PIL
-import PIL.Image
-import torch
-
-try:
-    from .correlation import correlation  # the custom cost volume layer
-except:
-    sys.path.insert(0, "./correlation")
-    import correlation  # you should consider upgrading python
-# end
-
-##########################################################
-
-assert (
-    int(str("").join(torch.__version__.split(".")[0:2])) >= 13
-)  # requires at least pytorch version 1.3.0
-
-torch.set_grad_enabled(
-    False
-)  # make sure to not compute gradients for computational performance
-
-torch.backends.cudnn.enabled = (
-    True  # make sure to use cudnn for computational performance
-)
-
-##########################################################
-
-arguments_strModel = "default"  # 'default', or 'kitti', or 'sintel'
-arguments_strOne = "./images/one.png"
-arguments_strTwo = "./images/two.png"
-arguments_strOut = "./out.flo"
-
-for strOption, strArgument in getopt.getopt(
-    sys.argv[1:], "", [strParameter[2:] + "=" for strParameter in sys.argv[1::2]]
-)[0]:
-    if strOption == "--model" and strArgument != "":
-        arguments_strModel = strArgument  # which model to use
-    if strOption == "--one" and strArgument != "":
-        arguments_strOne = strArgument  # path to the first frame
-    if strOption == "--two" and strArgument != "":
-        arguments_strTwo = strArgument  # path to the second frame
-    if strOption == "--out" and strArgument != "":
-        arguments_strOut = strArgument  # path to where the output should be stored
-# end
-
-##########################################################
-
-backwarp_tenGrid = {}
-
-
-def backwarp(tenInput, tenFlow):
-    if str(tenFlow.shape) not in backwarp_tenGrid:
-        tenHor = (
-            torch.linspace(
-                -1.0 + (1.0 / tenFlow.shape[3]),
-                1.0 - (1.0 / tenFlow.shape[3]),
-                tenFlow.shape[3],
-            )
-            .view(1, 1, 1, -1)
-            .repeat(1, 1, tenFlow.shape[2], 1)
-        )
-        tenVer = (
-            torch.linspace(
-                -1.0 + (1.0 / tenFlow.shape[2]),
-                1.0 - (1.0 / tenFlow.shape[2]),
-                tenFlow.shape[2],
-            )
-            .view(1, 1, -1, 1)
-            .repeat(1, 1, 1, tenFlow.shape[3])
-        )
-
-        backwarp_tenGrid[str(tenFlow.shape)] = torch.cat([tenHor, tenVer], 1).cuda()
-    # end
-
-    tenFlow = torch.cat(
-        [
-            tenFlow[:, 0:1, :, :] / ((tenInput.shape[3] - 1.0) / 2.0),
-            tenFlow[:, 1:2, :, :] / ((tenInput.shape[2] - 1.0) / 2.0),
-        ],
-        1,
-    )
-
-    return torch.nn.functional.grid_sample(
-        input=tenInput,
-        grid=(backwarp_tenGrid[str(tenFlow.shape)] + tenFlow).permute(0, 2, 3, 1),
-        mode="bilinear",
-        padding_mode="zeros",
-        align_corners=False,
-    )
-
-
-# end
-
-##########################################################
-
-
-class Network(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-
-        class Features(torch.nn.Module):
-            def __init__(self):
-                super().__init__()
-
-                self.netOne = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=3,
-                        out_channels=32,
-                        kernel_size=7,
-                        stride=1,
-                        padding=3,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-                self.netTwo = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=2,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-                self.netThr = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=64,
-                        kernel_size=3,
-                        stride=2,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=64,
-                        out_channels=64,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-                self.netFou = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=64,
-                        out_channels=96,
-                        kernel_size=3,
-                        stride=2,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=96,
-                        out_channels=96,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-                self.netFiv = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=96,
-                        out_channels=128,
-                        kernel_size=3,
-                        stride=2,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-                self.netSix = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=128,
-                        out_channels=192,
-                        kernel_size=3,
-                        stride=2,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-            # end
-
-            def forward(self, tenInput):
-                tenOne = self.netOne(tenInput)
-                tenTwo = self.netTwo(tenOne)
-                tenThr = self.netThr(tenTwo)
-                tenFou = self.netFou(tenThr)
-                tenFiv = self.netFiv(tenFou)
-                tenSix = self.netSix(tenFiv)
-
-                return [tenOne, tenTwo, tenThr, tenFou, tenFiv, tenSix]
-
-            # end
-
-        # end
-
-        class Matching(torch.nn.Module):
-            def __init__(self, intLevel):
-                super().__init__()
-
-                self.fltBackwarp = [0.0, 0.0, 10.0, 5.0, 2.5, 1.25, 0.625][intLevel]
-
-                if intLevel != 2:
-                    self.netFeat = torch.nn.Sequential()
-
-                elif intLevel == 2:
-                    self.netFeat = torch.nn.Sequential(
-                        torch.nn.Conv2d(
-                            in_channels=32,
-                            out_channels=64,
-                            kernel_size=1,
-                            stride=1,
-                            padding=0,
-                        ),
-                        torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    )
-
-                # end
-
-                if intLevel == 6:
-                    self.netUpflow = None
-
-                elif intLevel != 6:
-                    self.netUpflow = torch.nn.ConvTranspose2d(
-                        in_channels=2,
-                        out_channels=2,
-                        kernel_size=4,
-                        stride=2,
-                        padding=1,
-                        bias=False,
-                        groups=2,
-                    )
-
-                # end
-
-                if intLevel >= 4:
-                    self.netUpcorr = None
-
-                elif intLevel < 4:
-                    self.netUpcorr = torch.nn.ConvTranspose2d(
-                        in_channels=49,
-                        out_channels=49,
-                        kernel_size=4,
-                        stride=2,
-                        padding=1,
-                        bias=False,
-                        groups=49,
-                    )
-
-                # end
-
-                self.netMain = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=49,
-                        out_channels=128,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=128,
-                        out_channels=64,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=64,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=2,
-                        kernel_size=[0, 0, 7, 5, 5, 3, 3][intLevel],
-                        stride=1,
-                        padding=[0, 0, 3, 2, 2, 1, 1][intLevel],
-                    ),
-                )
-
-            # end
-
-            def forward(self, tenOne, tenTwo, tenFeaturesOne, tenFeaturesTwo, tenFlow):
-                tenFeaturesOne = self.netFeat(tenFeaturesOne)
-                tenFeaturesTwo = self.netFeat(tenFeaturesTwo)
-
-                if tenFlow is not None:
-                    tenFlow = self.netUpflow(tenFlow)
-                # end
-
-                if tenFlow is not None:
-                    tenFeaturesTwo = backwarp(
-                        tenInput=tenFeaturesTwo, tenFlow=tenFlow * self.fltBackwarp
-                    )
-                # end
-
-                if self.netUpcorr is None:
-                    tenCorrelation = torch.nn.functional.leaky_relu(
-                        input=correlation.FunctionCorrelation(
-                            tenOne=tenFeaturesOne, tenTwo=tenFeaturesTwo, intStride=1
-                        ),
-                        negative_slope=0.1,
-                        inplace=False,
-                    )
-
-                elif self.netUpcorr is not None:
-                    tenCorrelation = self.netUpcorr(
-                        torch.nn.functional.leaky_relu(
-                            input=correlation.FunctionCorrelation(
-                                tenOne=tenFeaturesOne,
-                                tenTwo=tenFeaturesTwo,
-                                intStride=2,
-                            ),
-                            negative_slope=0.1,
-                            inplace=False,
-                        )
-                    )
-
-                # end
-
-                return (tenFlow if tenFlow is not None else 0.0) + self.netMain(
-                    tenCorrelation
-                )
-
-            # end
-
-        # end
-
-        class Subpixel(torch.nn.Module):
-            def __init__(self, intLevel):
-                super().__init__()
-
-                self.fltBackward = [0.0, 0.0, 10.0, 5.0, 2.5, 1.25, 0.625][intLevel]
-
-                if intLevel != 2:
-                    self.netFeat = torch.nn.Sequential()
-
-                elif intLevel == 2:
-                    self.netFeat = torch.nn.Sequential(
-                        torch.nn.Conv2d(
-                            in_channels=32,
-                            out_channels=64,
-                            kernel_size=1,
-                            stride=1,
-                            padding=0,
-                        ),
-                        torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    )
-
-                # end
-
-                self.netMain = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=[0, 0, 130, 130, 194, 258, 386][intLevel],
-                        out_channels=128,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=128,
-                        out_channels=64,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=64,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=2,
-                        kernel_size=[0, 0, 7, 5, 5, 3, 3][intLevel],
-                        stride=1,
-                        padding=[0, 0, 3, 2, 2, 1, 1][intLevel],
-                    ),
-                )
-
-            # end
-
-            def forward(self, tenOne, tenTwo, tenFeaturesOne, tenFeaturesTwo, tenFlow):
-                tenFeaturesOne = self.netFeat(tenFeaturesOne)
-                tenFeaturesTwo = self.netFeat(tenFeaturesTwo)
-
-                if tenFlow is not None:
-                    tenFeaturesTwo = backwarp(
-                        tenInput=tenFeaturesTwo, tenFlow=tenFlow * self.fltBackward
-                    )
-                # end
-
-                return (tenFlow if tenFlow is not None else 0.0) + self.netMain(
-                    torch.cat([tenFeaturesOne, tenFeaturesTwo, tenFlow], 1)
-                )
-
-            # end
-
-        # end
-
-        class Regularization(torch.nn.Module):
-            def __init__(self, intLevel):
-                super().__init__()
-
-                self.fltBackward = [0.0, 0.0, 10.0, 5.0, 2.5, 1.25, 0.625][intLevel]
-
-                self.intUnfold = [0, 0, 7, 5, 5, 3, 3][intLevel]
-
-                if intLevel >= 5:
-                    self.netFeat = torch.nn.Sequential()
-
-                elif intLevel < 5:
-                    self.netFeat = torch.nn.Sequential(
-                        torch.nn.Conv2d(
-                            in_channels=[0, 0, 32, 64, 96, 128, 192][intLevel],
-                            out_channels=128,
-                            kernel_size=1,
-                            stride=1,
-                            padding=0,
-                        ),
-                        torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    )
-
-                # end
-
-                self.netMain = torch.nn.Sequential(
-                    torch.nn.Conv2d(
-                        in_channels=[0, 0, 131, 131, 131, 131, 195][intLevel],
-                        out_channels=128,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=128,
-                        out_channels=128,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=128,
-                        out_channels=64,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=64,
-                        out_channels=64,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=64,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                    torch.nn.Conv2d(
-                        in_channels=32,
-                        out_channels=32,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                    ),
-                    torch.nn.LeakyReLU(inplace=False, negative_slope=0.1),
-                )
-
-                if intLevel >= 5:
-                    self.netDist = torch.nn.Sequential(
-                        torch.nn.Conv2d(
-                            in_channels=32,
-                            out_channels=[0, 0, 49, 25, 25, 9, 9][intLevel],
-                            kernel_size=[0, 0, 7, 5, 5, 3, 3][intLevel],
-                            stride=1,
-                            padding=[0, 0, 3, 2, 2, 1, 1][intLevel],
-                        )
-                    )
-
-                elif intLevel < 5:
-                    self.netDist = torch.nn.Sequential(
-                        torch.nn.Conv2d(
-                            in_channels=32,
-                            out_channels=[0, 0, 49, 25, 25, 9, 9][intLevel],
-                            kernel_size=([0, 0, 7, 5, 5, 3, 3][intLevel], 1),
-                            stride=1,
-                            padding=([0, 0, 3, 2, 2, 1, 1][intLevel], 0),
-                        ),
-                        torch.nn.Conv2d(
-                            in_channels=[0, 0, 49, 25, 25, 9, 9][intLevel],
-                            out_channels=[0, 0, 49, 25, 25, 9, 9][intLevel],
-                            kernel_size=(1, [0, 0, 7, 5, 5, 3, 3][intLevel]),
-                            stride=1,
-                            padding=(0, [0, 0, 3, 2, 2, 1, 1][intLevel]),
-                        ),
-                    )
-
-                # end
-
-                self.netScaleX = torch.nn.Conv2d(
-                    in_channels=[0, 0, 49, 25, 25, 9, 9][intLevel],
-                    out_channels=1,
-                    kernel_size=1,
-                    stride=1,
-                    padding=0,
-                )
-                self.netScaleY = torch.nn.Conv2d(
-                    in_channels=[0, 0, 49, 25, 25, 9, 9][intLevel],
-                    out_channels=1,
-                    kernel_size=1,
-                    stride=1,
-                    padding=0,
-                )
-
-            # eny
-
-            def forward(self, tenOne, tenTwo, tenFeaturesOne, tenFeaturesTwo, tenFlow):
-                tenDifference = (
-                    (
-                        (
-                            tenOne
-                            - backwarp(
-                                tenInput=tenTwo, tenFlow=tenFlow * self.fltBackward
-                            )
-                        )
-                        ** 2
-                    )
-                    .sum(1, True)
-                    .sqrt()
-                    .detach()
-                )
-
-                tenDist = self.netDist(
-                    self.netMain(
-                        torch.cat(
-                            [
-                                tenDifference,
-                                tenFlow
-                                - tenFlow.view(tenFlow.shape[0], 2, -1)
-                                .mean(2, True)
-                                .view(tenFlow.shape[0], 2, 1, 1),
-                                self.netFeat(tenFeaturesOne),
-                            ],
-                            1,
-                        )
-                    )
-                )
-                tenDist = (tenDist**2).neg()
-                tenDist = (tenDist - tenDist.max(1, True)[0]).exp()
-
-                tenDivisor = tenDist.sum(1, True).reciprocal()
-
-                tenScaleX = (
-                    self.netScaleX(
-                        tenDist
-                        * torch.nn.functional.unfold(
-                            input=tenFlow[:, 0:1, :, :],
-                            kernel_size=self.intUnfold,
-                            stride=1,
-                            padding=int((self.intUnfold - 1) / 2),
-                        ).view_as(tenDist)
-                    )
-                    * tenDivisor
-                )
-                tenScaleY = (
-                    self.netScaleY(
-                        tenDist
-                        * torch.nn.functional.unfold(
-                            input=tenFlow[:, 1:2, :, :],
-                            kernel_size=self.intUnfold,
-                            stride=1,
-                            padding=int((self.intUnfold - 1) / 2),
-                        ).view_as(tenDist)
-                    )
-                    * tenDivisor
-                )
-
-                return torch.cat([tenScaleX, tenScaleY], 1)
-
-            # end
-
-        # end
-
-        self.netFeatures = Features()
-        self.netMatching = torch.nn.ModuleList(
-            [Matching(intLevel) for intLevel in [2, 3, 4, 5, 6]]
-        )
-        self.netSubpixel = torch.nn.ModuleList(
-            [Subpixel(intLevel) for intLevel in [2, 3, 4, 5, 6]]
-        )
-        self.netRegularization = torch.nn.ModuleList(
-            [Regularization(intLevel) for intLevel in [2, 3, 4, 5, 6]]
-        )
-
-        self.load_state_dict(
-            {
-                strKey.replace("module", "net"): tenWeight
-                for strKey, tenWeight in torch.hub.load_state_dict_from_url(
-                    url="http://content.sniklaus.com/github/pytorch-liteflownet/network-"
-                    + arguments_strModel
-                    + ".pytorch"
-                ).items()
-            }
-        )
-        # self.load_state_dict(torch.load('./liteflownet/network-default.pth'))
-
-    # end
-
-    def forward(self, tenOne, tenTwo):
-        tenOne[:, 0, :, :] = tenOne[:, 0, :, :] - 0.411618
-        tenOne[:, 1, :, :] = tenOne[:, 1, :, :] - 0.434631
-        tenOne[:, 2, :, :] = tenOne[:, 2, :, :] - 0.454253
-
-        tenTwo[:, 0, :, :] = tenTwo[:, 0, :, :] - 0.410782
-        tenTwo[:, 1, :, :] = tenTwo[:, 1, :, :] - 0.433645
-        tenTwo[:, 2, :, :] = tenTwo[:, 2, :, :] - 0.452793
-
-        tenFeaturesOne = self.netFeatures(tenOne)
-        tenFeaturesTwo = self.netFeatures(tenTwo)
-
-        tenOne = [tenOne]
-        tenTwo = [tenTwo]
-
-        for intLevel in [1, 2, 3, 4, 5]:
-            tenOne.append(
-                torch.nn.functional.interpolate(
-                    input=tenOne[-1],
-                    size=(
-                        tenFeaturesOne[intLevel].shape[2],
-                        tenFeaturesOne[intLevel].shape[3],
-                    ),
-                    mode="bilinear",
-                    align_corners=False,
-                )
-            )
-            tenTwo.append(
-                torch.nn.functional.interpolate(
-                    input=tenTwo[-1],
-                    size=(
-                        tenFeaturesTwo[intLevel].shape[2],
-                        tenFeaturesTwo[intLevel].shape[3],
-                    ),
-                    mode="bilinear",
-                    align_corners=False,
-                )
-            )
-        # end
-
-        tenFlow = None
-
-        for intLevel in [-1, -2, -3, -4, -5]:
-            tenFlow = self.netMatching[intLevel](
-                tenOne[intLevel],
-                tenTwo[intLevel],
-                tenFeaturesOne[intLevel],
-                tenFeaturesTwo[intLevel],
-                tenFlow,
-            )
-            tenFlow = self.netSubpixel[intLevel](
-                tenOne[intLevel],
-                tenTwo[intLevel],
-                tenFeaturesOne[intLevel],
-                tenFeaturesTwo[intLevel],
-                tenFlow,
-            )
-            tenFlow = self.netRegularization[intLevel](
-                tenOne[intLevel],
-                tenTwo[intLevel],
-                tenFeaturesOne[intLevel],
-                tenFeaturesTwo[intLevel],
-                tenFlow,
-            )
-        # end
-
-        return tenFlow * 20.0
-
-    # end
-
-
-# end
-
-netNetwork = None
-
-##########################################################
-
-
-def estimate(tenOne, tenTwo):
-    global netNetwork
-
-    if netNetwork is None:
-        netNetwork = Network().cuda().eval()
-    # end
-
-    assert tenOne.shape[1] == tenTwo.shape[1]
-    assert tenOne.shape[2] == tenTwo.shape[2]
-
-    intWidth = tenOne.shape[2]
-    intHeight = tenOne.shape[1]
-
-    # assert(intWidth == 1024) # remember that there is no guarantee for correctness, comment this line out if you acknowledge this and want to continue
-    # assert(intHeight == 436) # remember that there is no guarantee for correctness, comment this line out if you acknowledge this and want to continue
-
-    tenPreprocessedOne = tenOne.cuda().view(1, 3, intHeight, intWidth)
-    tenPreprocessedTwo = tenTwo.cuda().view(1, 3, intHeight, intWidth)
-
-    intPreprocessedWidth = int(math.floor(math.ceil(intWidth / 32.0) * 32.0))
-    intPreprocessedHeight = int(math.floor(math.ceil(intHeight / 32.0) * 32.0))
-
-    tenPreprocessedOne = torch.nn.functional.interpolate(
-        input=tenPreprocessedOne,
-        size=(intPreprocessedHeight, intPreprocessedWidth),
-        mode="bilinear",
-        align_corners=False,
-    )
-    tenPreprocessedTwo = torch.nn.functional.interpolate(
-        input=tenPreprocessedTwo,
-        size=(intPreprocessedHeight, intPreprocessedWidth),
-        mode="bilinear",
-        align_corners=False,
-    )
-
-    tenFlow = torch.nn.functional.interpolate(
-        input=netNetwork(tenPreprocessedOne, tenPreprocessedTwo),
-        size=(intHeight, intWidth),
-        mode="bilinear",
-        align_corners=False,
-    )
-
-    tenFlow[:, 0, :, :] *= float(intWidth) / float(intPreprocessedWidth)
-    tenFlow[:, 1, :, :] *= float(intHeight) / float(intPreprocessedHeight)
-
-    return tenFlow[0, :, :, :].cpu()
-
-
-# end
-
-##########################################################
-
-if __name__ == "__main__":
-    tenOne = torch.FloatTensor(
-        numpy.ascontiguousarray(
-            numpy.array(PIL.Image.open(arguments_strOne))[:, :, ::-1]
-            .transpose(2, 0, 1)
-            .astype(numpy.float32)
-            * (1.0 / 255.0)
-        )
-    )
-    tenTwo = torch.FloatTensor(
-        numpy.ascontiguousarray(
-            numpy.array(PIL.Image.open(arguments_strTwo))[:, :, ::-1]
-            .transpose(2, 0, 1)
-            .astype(numpy.float32)
-            * (1.0 / 255.0)
-        )
-    )
-
-    tenOutput = estimate(tenOne, tenTwo)
-
-    objOutput = open(arguments_strOut, "wb")
-
-    numpy.array([80, 73, 69, 72], numpy.uint8).tofile(objOutput)
-    numpy.array([tenOutput.shape[2], tenOutput.shape[1]], numpy.int32).tofile(objOutput)
-    numpy.array(tenOutput.numpy().transpose(1, 2, 0), numpy.float32).tofile(objOutput)
-
-    objOutput.close()
-# end
diff --git a/eval/vbench/third_party/amt/losses/__init__.py b/eval/vbench/third_party/amt/losses/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/losses/loss.py b/eval/vbench/third_party/amt/losses/loss.py
deleted file mode 100644
index 1ebf0d40..00000000
--- a/eval/vbench/third_party/amt/losses/loss.py
+++ /dev/null
@@ -1,209 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class Loss(nn.Module):
-    def __init__(self, loss_weight, keys, mapping=None) -> None:
-        """
-        mapping: map the kwargs keys into desired ones.
-        """
-        super().__init__()
-        self.loss_weight = loss_weight
-        self.keys = keys
-        self.mapping = mapping
-        if isinstance(mapping, dict):
-            self.mapping = {k: v for k, v in mapping if v in keys}
-
-    def forward(self, **kwargs):
-        params = {k: v for k, v in kwargs.items() if k in self.keys}
-        if self.mapping is not None:
-            for k, v in kwargs.items():
-                if self.mapping.get(k) is not None:
-                    params[self.mapping[k]] = v
-
-        return self._forward(**params) * self.loss_weight
-
-    def _forward(self, **kwargs):
-        pass
-
-
-class CharbonnierLoss(Loss):
-    def __init__(self, loss_weight, keys) -> None:
-        super().__init__(loss_weight, keys)
-
-    def _forward(self, imgt_pred, imgt):
-        diff = imgt_pred - imgt
-        loss = ((diff**2 + 1e-6) ** 0.5).mean()
-        return loss
-
-
-class AdaCharbonnierLoss(Loss):
-    def __init__(self, loss_weight, keys) -> None:
-        super().__init__(loss_weight, keys)
-
-    def _forward(self, imgt_pred, imgt, weight):
-        alpha = weight / 2
-        epsilon = 10 ** (-(10 * weight - 1) / 3)
-
-        diff = imgt_pred - imgt
-        loss = ((diff**2 + epsilon**2) ** alpha).mean()
-        return loss
-
-
-class TernaryLoss(Loss):
-    def __init__(self, loss_weight, keys, patch_size=7):
-        super().__init__(loss_weight, keys)
-        self.patch_size = patch_size
-        out_channels = patch_size * patch_size
-        self.w = np.eye(out_channels).reshape((patch_size, patch_size, 1, out_channels))
-        self.w = np.transpose(self.w, (3, 2, 0, 1))
-        self.w = torch.tensor(self.w, dtype=torch.float32)
-
-    def transform(self, tensor):
-        self.w = self.w.to(tensor.device)
-        tensor_ = tensor.mean(dim=1, keepdim=True)
-        patches = F.conv2d(tensor_, self.w, padding=self.patch_size // 2, bias=None)
-        loc_diff = patches - tensor_
-        loc_diff_norm = loc_diff / torch.sqrt(0.81 + loc_diff**2)
-        return loc_diff_norm
-
-    def valid_mask(self, tensor):
-        padding = self.patch_size // 2
-        b, c, h, w = tensor.size()
-        inner = torch.ones(b, 1, h - 2 * padding, w - 2 * padding).type_as(tensor)
-        mask = F.pad(inner, [padding] * 4)
-        return mask
-
-    def _forward(self, imgt_pred, imgt):
-        loc_diff_x = self.transform(imgt_pred)
-        loc_diff_y = self.transform(imgt)
-        diff = loc_diff_x - loc_diff_y.detach()
-        dist = (diff**2 / (0.1 + diff**2)).mean(dim=1, keepdim=True)
-        mask = self.valid_mask(imgt_pred)
-        loss = (dist * mask).mean()
-        return loss
-
-
-class GeometryLoss(Loss):
-    def __init__(self, loss_weight, keys, patch_size=3):
-        super().__init__(loss_weight, keys)
-        self.patch_size = patch_size
-        out_channels = patch_size * patch_size
-        self.w = np.eye(out_channels).reshape((patch_size, patch_size, 1, out_channels))
-        self.w = np.transpose(self.w, (3, 2, 0, 1))
-        self.w = torch.tensor(self.w).float()
-
-    def transform(self, tensor):
-        b, c, h, w = tensor.size()
-        self.w = self.w.to(tensor.device)
-        tensor_ = tensor.reshape(b * c, 1, h, w)
-        patches = F.conv2d(tensor_, self.w, padding=self.patch_size // 2, bias=None)
-        loc_diff = patches - tensor_
-        loc_diff_ = loc_diff.reshape(b, c * (self.patch_size**2), h, w)
-        loc_diff_norm = loc_diff_ / torch.sqrt(0.81 + loc_diff_**2)
-        return loc_diff_norm
-
-    def valid_mask(self, tensor):
-        padding = self.patch_size // 2
-        b, c, h, w = tensor.size()
-        inner = torch.ones(b, 1, h - 2 * padding, w - 2 * padding).type_as(tensor)
-        mask = F.pad(inner, [padding] * 4)
-        return mask
-
-    def _forward(self, ft_pred, ft_gt):
-        loss = 0.0
-        for pred, gt in zip(ft_pred, ft_gt):
-            loc_diff_x = self.transform(pred)
-            loc_diff_y = self.transform(gt)
-            diff = loc_diff_x - loc_diff_y
-            dist = (diff**2 / (0.1 + diff**2)).mean(dim=1, keepdim=True)
-            mask = self.valid_mask(pred)
-            loss = loss + (dist * mask).mean()
-        return loss
-
-
-class IFRFlowLoss(Loss):
-    def __init__(self, loss_weight, keys, beta=0.3) -> None:
-        super().__init__(loss_weight, keys)
-        self.beta = beta
-        self.ada_cb_loss = AdaCharbonnierLoss(1.0, ["imgt_pred", "imgt", "weight"])
-
-    def _forward(self, flow0_pred, flow1_pred, flow):
-
-        robust_weight0 = self.get_robust_weight(flow0_pred[0], flow[:, 0:2])
-        robust_weight1 = self.get_robust_weight(flow1_pred[0], flow[:, 2:4])
-        loss = 0
-        for lvl in range(1, len(flow0_pred)):
-            scale_factor = 2**lvl
-            loss = loss + self.ada_cb_loss(
-                **{
-                    "imgt_pred": self.resize(flow0_pred[lvl], scale_factor),
-                    "imgt": flow[:, 0:2],
-                    "weight": robust_weight0,
-                }
-            )
-            loss = loss + self.ada_cb_loss(
-                **{
-                    "imgt_pred": self.resize(flow1_pred[lvl], scale_factor),
-                    "imgt": flow[:, 2:4],
-                    "weight": robust_weight1,
-                }
-            )
-        return loss
-
-    def resize(self, x, scale_factor):
-        return scale_factor * F.interpolate(
-            x, scale_factor=scale_factor, mode="bilinear", align_corners=False
-        )
-
-    def get_robust_weight(self, flow_pred, flow_gt):
-        epe = ((flow_pred.detach() - flow_gt) ** 2).sum(dim=1, keepdim=True) ** 0.5
-        robust_weight = torch.exp(-self.beta * epe)
-        return robust_weight
-
-
-class MultipleFlowLoss(Loss):
-    def __init__(self, loss_weight, keys, beta=0.3) -> None:
-        super().__init__(loss_weight, keys)
-        self.beta = beta
-        self.ada_cb_loss = AdaCharbonnierLoss(1.0, ["imgt_pred", "imgt", "weight"])
-
-    def _forward(self, flow0_pred, flow1_pred, flow):
-
-        robust_weight0 = self.get_mutli_flow_robust_weight(flow0_pred[0], flow[:, 0:2])
-        robust_weight1 = self.get_mutli_flow_robust_weight(flow1_pred[0], flow[:, 2:4])
-        loss = 0
-        for lvl in range(1, len(flow0_pred)):
-            scale_factor = 2**lvl
-            loss = loss + self.ada_cb_loss(
-                **{
-                    "imgt_pred": self.resize(flow0_pred[lvl], scale_factor),
-                    "imgt": flow[:, 0:2],
-                    "weight": robust_weight0,
-                }
-            )
-            loss = loss + self.ada_cb_loss(
-                **{
-                    "imgt_pred": self.resize(flow1_pred[lvl], scale_factor),
-                    "imgt": flow[:, 2:4],
-                    "weight": robust_weight1,
-                }
-            )
-        return loss
-
-    def resize(self, x, scale_factor):
-        return scale_factor * F.interpolate(
-            x, scale_factor=scale_factor, mode="bilinear", align_corners=False
-        )
-
-    def get_mutli_flow_robust_weight(self, flow_pred, flow_gt):
-        b, num_flows, c, h, w = flow_pred.shape
-        flow_pred = flow_pred.view(b, num_flows, c, h, w)
-        flow_gt = flow_gt.repeat(1, num_flows, 1, 1).view(b, num_flows, c, h, w)
-        epe = ((flow_pred.detach() - flow_gt) ** 2).sum(dim=2, keepdim=True).max(1)[
-            0
-        ] ** 0.5
-        robust_weight = torch.exp(-self.beta * epe)
-        return robust_weight
diff --git a/eval/vbench/third_party/amt/metrics/__init__.py b/eval/vbench/third_party/amt/metrics/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/metrics/psnr_ssim.py b/eval/vbench/third_party/amt/metrics/psnr_ssim.py
deleted file mode 100644
index c8bb9e70..00000000
--- a/eval/vbench/third_party/amt/metrics/psnr_ssim.py
+++ /dev/null
@@ -1,236 +0,0 @@
-from math import exp
-
-import torch
-import torch.nn.functional as F
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-
-
-def gaussian(window_size, sigma):
-    gauss = torch.Tensor(
-        [
-            exp(-((x - window_size // 2) ** 2) / float(2 * sigma**2))
-            for x in range(window_size)
-        ]
-    )
-    return gauss / gauss.sum()
-
-
-def create_window(window_size, channel=1):
-    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
-    _2D_window = (
-        _1D_window.mm(_1D_window.t()).float().unsqueeze(0).unsqueeze(0).to(device)
-    )
-    window = _2D_window.expand(channel, 1, window_size, window_size).contiguous()
-    return window
-
-
-def create_window_3d(window_size, channel=1):
-    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
-    _2D_window = _1D_window.mm(_1D_window.t())
-    _3D_window = _2D_window.unsqueeze(2) @ (_1D_window.t())
-    window = (
-        _3D_window.expand(1, channel, window_size, window_size, window_size)
-        .contiguous()
-        .to(device)
-    )
-    return window
-
-
-def ssim(
-    img1,
-    img2,
-    window_size=11,
-    window=None,
-    size_average=True,
-    full=False,
-    val_range=None,
-):
-    if val_range is None:
-        if torch.max(img1) > 128:
-            max_val = 255
-        else:
-            max_val = 1
-
-        if torch.min(img1) < -0.5:
-            min_val = -1
-        else:
-            min_val = 0
-        L = max_val - min_val
-    else:
-        L = val_range
-
-    padd = 0
-    (_, channel, height, width) = img1.size()
-    if window is None:
-        real_size = min(window_size, height, width)
-        window = create_window(real_size, channel=channel).to(img1.device)
-
-    mu1 = F.conv2d(
-        F.pad(img1, (5, 5, 5, 5), mode="replicate"),
-        window,
-        padding=padd,
-        groups=channel,
-    )
-    mu2 = F.conv2d(
-        F.pad(img2, (5, 5, 5, 5), mode="replicate"),
-        window,
-        padding=padd,
-        groups=channel,
-    )
-
-    mu1_sq = mu1.pow(2)
-    mu2_sq = mu2.pow(2)
-    mu1_mu2 = mu1 * mu2
-
-    sigma1_sq = (
-        F.conv2d(
-            F.pad(img1 * img1, (5, 5, 5, 5), "replicate"),
-            window,
-            padding=padd,
-            groups=channel,
-        )
-        - mu1_sq
-    )
-    sigma2_sq = (
-        F.conv2d(
-            F.pad(img2 * img2, (5, 5, 5, 5), "replicate"),
-            window,
-            padding=padd,
-            groups=channel,
-        )
-        - mu2_sq
-    )
-    sigma12 = (
-        F.conv2d(
-            F.pad(img1 * img2, (5, 5, 5, 5), "replicate"),
-            window,
-            padding=padd,
-            groups=channel,
-        )
-        - mu1_mu2
-    )
-
-    C1 = (0.01 * L) ** 2
-    C2 = (0.03 * L) ** 2
-
-    v1 = 2.0 * sigma12 + C2
-    v2 = sigma1_sq + sigma2_sq + C2
-    cs = torch.mean(v1 / v2)
-
-    ssim_map = ((2 * mu1_mu2 + C1) * v1) / ((mu1_sq + mu2_sq + C1) * v2)
-
-    if size_average:
-        ret = ssim_map.mean()
-    else:
-        ret = ssim_map.mean(1).mean(1).mean(1)
-
-    if full:
-        return ret, cs
-    return ret
-
-
-def calculate_ssim(
-    img1,
-    img2,
-    window_size=11,
-    window=None,
-    size_average=True,
-    full=False,
-    val_range=None,
-):
-    if val_range is None:
-        if torch.max(img1) > 128:
-            max_val = 255
-        else:
-            max_val = 1
-
-        if torch.min(img1) < -0.5:
-            min_val = -1
-        else:
-            min_val = 0
-        L = max_val - min_val
-    else:
-        L = val_range
-
-    padd = 0
-    (_, _, height, width) = img1.size()
-    if window is None:
-        real_size = min(window_size, height, width)
-        window = create_window_3d(real_size, channel=1).to(img1.device)
-
-    img1 = img1.unsqueeze(1)
-    img2 = img2.unsqueeze(1)
-
-    mu1 = F.conv3d(
-        F.pad(img1, (5, 5, 5, 5, 5, 5), mode="replicate"),
-        window,
-        padding=padd,
-        groups=1,
-    )
-    mu2 = F.conv3d(
-        F.pad(img2, (5, 5, 5, 5, 5, 5), mode="replicate"),
-        window,
-        padding=padd,
-        groups=1,
-    )
-
-    mu1_sq = mu1.pow(2)
-    mu2_sq = mu2.pow(2)
-    mu1_mu2 = mu1 * mu2
-
-    sigma1_sq = (
-        F.conv3d(
-            F.pad(img1 * img1, (5, 5, 5, 5, 5, 5), "replicate"),
-            window,
-            padding=padd,
-            groups=1,
-        )
-        - mu1_sq
-    )
-    sigma2_sq = (
-        F.conv3d(
-            F.pad(img2 * img2, (5, 5, 5, 5, 5, 5), "replicate"),
-            window,
-            padding=padd,
-            groups=1,
-        )
-        - mu2_sq
-    )
-    sigma12 = (
-        F.conv3d(
-            F.pad(img1 * img2, (5, 5, 5, 5, 5, 5), "replicate"),
-            window,
-            padding=padd,
-            groups=1,
-        )
-        - mu1_mu2
-    )
-
-    C1 = (0.01 * L) ** 2
-    C2 = (0.03 * L) ** 2
-
-    v1 = 2.0 * sigma12 + C2
-    v2 = sigma1_sq + sigma2_sq + C2
-    cs = torch.mean(v1 / v2)
-
-    ssim_map = ((2 * mu1_mu2 + C1) * v1) / ((mu1_sq + mu2_sq + C1) * v2)
-
-    if size_average:
-        ret = ssim_map.mean()
-    else:
-        ret = ssim_map.mean(1).mean(1).mean(1)
-
-    if full:
-        return ret, cs
-    return ret.detach().cpu().numpy()
-
-
-def calculate_psnr(img1, img2):
-    psnr = -10 * torch.log10(((img1 - img2) * (img1 - img2)).mean())
-    return psnr.detach().cpu().numpy()
-
-
-def calculate_ie(img1, img2):
-    ie = torch.abs(torch.round(img1 * 255.0) - torch.round(img2 * 255.0)).mean()
-    return ie.detach().cpu().numpy()
diff --git a/eval/vbench/third_party/amt/networks/AMT-G.py b/eval/vbench/third_party/amt/networks/AMT-G.py
deleted file mode 100644
index 510e1445..00000000
--- a/eval/vbench/third_party/amt/networks/AMT-G.py
+++ /dev/null
@@ -1,202 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from vbench.third_party.amt.networks.blocks.feat_enc import LargeEncoder
-from vbench.third_party.amt.networks.blocks.ifrnet import (
-    Encoder,
-    InitDecoder,
-    IntermediateDecoder,
-    resize,
-)
-from vbench.third_party.amt.networks.blocks.multi_flow import (
-    MultiFlowDecoder,
-    multi_flow_combine,
-)
-from vbench.third_party.amt.networks.blocks.raft import (
-    BasicUpdateBlock,
-    BidirCorrBlock,
-    coords_grid,
-)
-
-
-class Model(nn.Module):
-    def __init__(
-        self,
-        corr_radius=3,
-        corr_lvls=4,
-        num_flows=5,
-        channels=[84, 96, 112, 128],
-        skip_channels=84,
-    ):
-        super(Model, self).__init__()
-        self.radius = corr_radius
-        self.corr_levels = corr_lvls
-        self.num_flows = num_flows
-
-        self.feat_encoder = LargeEncoder(
-            output_dim=128, norm_fn="instance", dropout=0.0
-        )
-        self.encoder = Encoder(channels, large=True)
-        self.decoder4 = InitDecoder(channels[3], channels[2], skip_channels)
-        self.decoder3 = IntermediateDecoder(channels[2], channels[1], skip_channels)
-        self.decoder2 = IntermediateDecoder(channels[1], channels[0], skip_channels)
-        self.decoder1 = MultiFlowDecoder(channels[0], skip_channels, num_flows)
-
-        self.update4 = self._get_updateblock(112, None)
-        self.update3_low = self._get_updateblock(96, 2.0)
-        self.update2_low = self._get_updateblock(84, 4.0)
-
-        self.update3_high = self._get_updateblock(96, None)
-        self.update2_high = self._get_updateblock(84, None)
-
-        self.comb_block = nn.Sequential(
-            nn.Conv2d(3 * self.num_flows, 6 * self.num_flows, 7, 1, 3),
-            nn.PReLU(6 * self.num_flows),
-            nn.Conv2d(6 * self.num_flows, 3, 7, 1, 3),
-        )
-
-    def _get_updateblock(self, cdim, scale_factor=None):
-        return BasicUpdateBlock(
-            cdim=cdim,
-            hidden_dim=192,
-            flow_dim=64,
-            corr_dim=256,
-            corr_dim2=192,
-            fc_dim=188,
-            scale_factor=scale_factor,
-            corr_levels=self.corr_levels,
-            radius=self.radius,
-        )
-
-    def _corr_scale_lookup(self, corr_fn, coord, flow0, flow1, embt, downsample=1):
-        # convert t -> 0 to 0 -> 1 | convert t -> 1 to 1 -> 0
-        # based on linear assumption
-        t1_scale = 1.0 / embt
-        t0_scale = 1.0 / (1.0 - embt)
-        if downsample != 1:
-            inv = 1 / downsample
-            flow0 = inv * resize(flow0, scale_factor=inv)
-            flow1 = inv * resize(flow1, scale_factor=inv)
-
-        corr0, corr1 = corr_fn(coord + flow1 * t1_scale, coord + flow0 * t0_scale)
-        corr = torch.cat([corr0, corr1], dim=1)
-        flow = torch.cat([flow0, flow1], dim=1)
-        return corr, flow
-
-    def forward(self, img0, img1, embt, scale_factor=1.0, eval=False, **kwargs):
-        mean_ = (
-            torch.cat([img0, img1], 2)
-            .mean(1, keepdim=True)
-            .mean(2, keepdim=True)
-            .mean(3, keepdim=True)
-        )
-        img0 = img0 - mean_
-        img1 = img1 - mean_
-        img0_ = resize(img0, scale_factor) if scale_factor != 1.0 else img0
-        img1_ = resize(img1, scale_factor) if scale_factor != 1.0 else img1
-        b, _, h, w = img0_.shape
-        coord = coords_grid(b, h // 8, w // 8, img0.device)
-
-        fmap0, fmap1 = self.feat_encoder([img0_, img1_])  # [1, 128, H//8, W//8]
-        corr_fn = BidirCorrBlock(
-            fmap0, fmap1, radius=self.radius, num_levels=self.corr_levels
-        )
-
-        # f0_1: [1, c0, H//2, W//2] | f0_2: [1, c1, H//4, W//4]
-        # f0_3: [1, c2, H//8, W//8] | f0_4: [1, c3, H//16, W//16]
-        f0_1, f0_2, f0_3, f0_4 = self.encoder(img0_)
-        f1_1, f1_2, f1_3, f1_4 = self.encoder(img1_)
-
-        ######################################### the 4th decoder #########################################
-        up_flow0_4, up_flow1_4, ft_3_ = self.decoder4(f0_4, f1_4, embt)
-        corr_4, flow_4 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_4, up_flow1_4, embt, downsample=1
-        )
-
-        # residue update with lookup corr
-        delta_ft_3_, delta_flow_4 = self.update4(ft_3_, flow_4, corr_4)
-        delta_flow0_4, delta_flow1_4 = torch.chunk(delta_flow_4, 2, 1)
-        up_flow0_4 = up_flow0_4 + delta_flow0_4
-        up_flow1_4 = up_flow1_4 + delta_flow1_4
-        ft_3_ = ft_3_ + delta_ft_3_
-
-        ######################################### the 3rd decoder #########################################
-        up_flow0_3, up_flow1_3, ft_2_ = self.decoder3(
-            ft_3_, f0_3, f1_3, up_flow0_4, up_flow1_4
-        )
-        corr_3, flow_3 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_3, up_flow1_3, embt, downsample=2
-        )
-
-        # residue update with lookup corr
-        delta_ft_2_, delta_flow_3 = self.update3_low(ft_2_, flow_3, corr_3)
-        delta_flow0_3, delta_flow1_3 = torch.chunk(delta_flow_3, 2, 1)
-        up_flow0_3 = up_flow0_3 + delta_flow0_3
-        up_flow1_3 = up_flow1_3 + delta_flow1_3
-        ft_2_ = ft_2_ + delta_ft_2_
-
-        # residue update with lookup corr (hr)
-        corr_3 = resize(corr_3, scale_factor=2.0)
-        up_flow_3 = torch.cat([up_flow0_3, up_flow1_3], dim=1)
-        delta_ft_2_, delta_up_flow_3 = self.update3_high(ft_2_, up_flow_3, corr_3)
-        ft_2_ += delta_ft_2_
-        up_flow0_3 += delta_up_flow_3[:, 0:2]
-        up_flow1_3 += delta_up_flow_3[:, 2:4]
-
-        ######################################### the 2nd decoder #########################################
-        up_flow0_2, up_flow1_2, ft_1_ = self.decoder2(
-            ft_2_, f0_2, f1_2, up_flow0_3, up_flow1_3
-        )
-        corr_2, flow_2 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_2, up_flow1_2, embt, downsample=4
-        )
-
-        # residue update with lookup corr
-        delta_ft_1_, delta_flow_2 = self.update2_low(ft_1_, flow_2, corr_2)
-        delta_flow0_2, delta_flow1_2 = torch.chunk(delta_flow_2, 2, 1)
-        up_flow0_2 = up_flow0_2 + delta_flow0_2
-        up_flow1_2 = up_flow1_2 + delta_flow1_2
-        ft_1_ = ft_1_ + delta_ft_1_
-
-        # residue update with lookup corr (hr)
-        corr_2 = resize(corr_2, scale_factor=4.0)
-        up_flow_2 = torch.cat([up_flow0_2, up_flow1_2], dim=1)
-        delta_ft_1_, delta_up_flow_2 = self.update2_high(ft_1_, up_flow_2, corr_2)
-        ft_1_ += delta_ft_1_
-        up_flow0_2 += delta_up_flow_2[:, 0:2]
-        up_flow1_2 += delta_up_flow_2[:, 2:4]
-
-        ######################################### the 1st decoder #########################################
-        up_flow0_1, up_flow1_1, mask, img_res = self.decoder1(
-            ft_1_, f0_1, f1_1, up_flow0_2, up_flow1_2
-        )
-
-        if scale_factor != 1.0:
-            up_flow0_1 = resize(up_flow0_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            up_flow1_1 = resize(up_flow1_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            mask = resize(mask, scale_factor=(1.0 / scale_factor))
-            img_res = resize(img_res, scale_factor=(1.0 / scale_factor))
-
-        # Merge multiple predictions
-        imgt_pred = multi_flow_combine(
-            self.comb_block, img0, img1, up_flow0_1, up_flow1_1, mask, img_res, mean_
-        )
-        imgt_pred = torch.clamp(imgt_pred, 0, 1)
-
-        if eval:
-            return {
-                "imgt_pred": imgt_pred,
-            }
-        else:
-            up_flow0_1 = up_flow0_1.reshape(b, self.num_flows, 2, h, w)
-            up_flow1_1 = up_flow1_1.reshape(b, self.num_flows, 2, h, w)
-            return {
-                "imgt_pred": imgt_pred,
-                "flow0_pred": [up_flow0_1, up_flow0_2, up_flow0_3, up_flow0_4],
-                "flow1_pred": [up_flow1_1, up_flow1_2, up_flow1_3, up_flow1_4],
-                "ft_pred": [ft_1_, ft_2_, ft_3_],
-            }
diff --git a/eval/vbench/third_party/amt/networks/AMT-L.py b/eval/vbench/third_party/amt/networks/AMT-L.py
deleted file mode 100644
index a238d1cd..00000000
--- a/eval/vbench/third_party/amt/networks/AMT-L.py
+++ /dev/null
@@ -1,183 +0,0 @@
-import torch
-import torch.nn as nn
-from vbench.third_party.amt.networks.blocks.feat_enc import BasicEncoder
-from vbench.third_party.amt.networks.blocks.ifrnet import (
-    Encoder,
-    InitDecoder,
-    IntermediateDecoder,
-    resize,
-)
-from vbench.third_party.amt.networks.blocks.multi_flow import (
-    MultiFlowDecoder,
-    multi_flow_combine,
-)
-from vbench.third_party.amt.networks.blocks.raft import (
-    BasicUpdateBlock,
-    BidirCorrBlock,
-    coords_grid,
-)
-
-
-class Model(nn.Module):
-    def __init__(
-        self,
-        corr_radius=3,
-        corr_lvls=4,
-        num_flows=5,
-        channels=[48, 64, 72, 128],
-        skip_channels=48,
-    ):
-        super(Model, self).__init__()
-        self.radius = corr_radius
-        self.corr_levels = corr_lvls
-        self.num_flows = num_flows
-
-        self.feat_encoder = BasicEncoder(
-            output_dim=128, norm_fn="instance", dropout=0.0
-        )
-        self.encoder = Encoder([48, 64, 72, 128], large=True)
-
-        self.decoder4 = InitDecoder(channels[3], channels[2], skip_channels)
-        self.decoder3 = IntermediateDecoder(channels[2], channels[1], skip_channels)
-        self.decoder2 = IntermediateDecoder(channels[1], channels[0], skip_channels)
-        self.decoder1 = MultiFlowDecoder(channels[0], skip_channels, num_flows)
-
-        self.update4 = self._get_updateblock(72, None)
-        self.update3 = self._get_updateblock(64, 2.0)
-        self.update2 = self._get_updateblock(48, 4.0)
-
-        self.comb_block = nn.Sequential(
-            nn.Conv2d(3 * self.num_flows, 6 * self.num_flows, 7, 1, 3),
-            nn.PReLU(6 * self.num_flows),
-            nn.Conv2d(6 * self.num_flows, 3, 7, 1, 3),
-        )
-
-    def _get_updateblock(self, cdim, scale_factor=None):
-        return BasicUpdateBlock(
-            cdim=cdim,
-            hidden_dim=128,
-            flow_dim=48,
-            corr_dim=256,
-            corr_dim2=160,
-            fc_dim=124,
-            scale_factor=scale_factor,
-            corr_levels=self.corr_levels,
-            radius=self.radius,
-        )
-
-    def _corr_scale_lookup(self, corr_fn, coord, flow0, flow1, embt, downsample=1):
-        # convert t -> 0 to 0 -> 1 | convert t -> 1 to 1 -> 0
-        # based on linear assumption
-        t1_scale = 1.0 / embt
-        t0_scale = 1.0 / (1.0 - embt)
-        if downsample != 1:
-            inv = 1 / downsample
-            flow0 = inv * resize(flow0, scale_factor=inv)
-            flow1 = inv * resize(flow1, scale_factor=inv)
-
-        corr0, corr1 = corr_fn(coord + flow1 * t1_scale, coord + flow0 * t0_scale)
-        corr = torch.cat([corr0, corr1], dim=1)
-        flow = torch.cat([flow0, flow1], dim=1)
-        return corr, flow
-
-    def forward(self, img0, img1, embt, scale_factor=1.0, eval=False, **kwargs):
-        mean_ = (
-            torch.cat([img0, img1], 2)
-            .mean(1, keepdim=True)
-            .mean(2, keepdim=True)
-            .mean(3, keepdim=True)
-        )
-        img0 = img0 - mean_
-        img1 = img1 - mean_
-        img0_ = resize(img0, scale_factor) if scale_factor != 1.0 else img0
-        img1_ = resize(img1, scale_factor) if scale_factor != 1.0 else img1
-        b, _, h, w = img0_.shape
-        coord = coords_grid(b, h // 8, w // 8, img0.device)
-
-        fmap0, fmap1 = self.feat_encoder([img0_, img1_])  # [1, 128, H//8, W//8]
-        corr_fn = BidirCorrBlock(
-            fmap0, fmap1, radius=self.radius, num_levels=self.corr_levels
-        )
-
-        # f0_1: [1, c0, H//2, W//2] | f0_2: [1, c1, H//4, W//4]
-        # f0_3: [1, c2, H//8, W//8] | f0_4: [1, c3, H//16, W//16]
-        f0_1, f0_2, f0_3, f0_4 = self.encoder(img0_)
-        f1_1, f1_2, f1_3, f1_4 = self.encoder(img1_)
-
-        ######################################### the 4th decoder #########################################
-        up_flow0_4, up_flow1_4, ft_3_ = self.decoder4(f0_4, f1_4, embt)
-        corr_4, flow_4 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_4, up_flow1_4, embt, downsample=1
-        )
-
-        # residue update with lookup corr
-        delta_ft_3_, delta_flow_4 = self.update4(ft_3_, flow_4, corr_4)
-        delta_flow0_4, delta_flow1_4 = torch.chunk(delta_flow_4, 2, 1)
-        up_flow0_4 = up_flow0_4 + delta_flow0_4
-        up_flow1_4 = up_flow1_4 + delta_flow1_4
-        ft_3_ = ft_3_ + delta_ft_3_
-
-        ######################################### the 3rd decoder #########################################
-        up_flow0_3, up_flow1_3, ft_2_ = self.decoder3(
-            ft_3_, f0_3, f1_3, up_flow0_4, up_flow1_4
-        )
-        corr_3, flow_3 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_3, up_flow1_3, embt, downsample=2
-        )
-
-        # residue update with lookup corr
-        delta_ft_2_, delta_flow_3 = self.update3(ft_2_, flow_3, corr_3)
-        delta_flow0_3, delta_flow1_3 = torch.chunk(delta_flow_3, 2, 1)
-        up_flow0_3 = up_flow0_3 + delta_flow0_3
-        up_flow1_3 = up_flow1_3 + delta_flow1_3
-        ft_2_ = ft_2_ + delta_ft_2_
-
-        ######################################### the 2nd decoder #########################################
-        up_flow0_2, up_flow1_2, ft_1_ = self.decoder2(
-            ft_2_, f0_2, f1_2, up_flow0_3, up_flow1_3
-        )
-        corr_2, flow_2 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_2, up_flow1_2, embt, downsample=4
-        )
-
-        # residue update with lookup corr
-        delta_ft_1_, delta_flow_2 = self.update2(ft_1_, flow_2, corr_2)
-        delta_flow0_2, delta_flow1_2 = torch.chunk(delta_flow_2, 2, 1)
-        up_flow0_2 = up_flow0_2 + delta_flow0_2
-        up_flow1_2 = up_flow1_2 + delta_flow1_2
-        ft_1_ = ft_1_ + delta_ft_1_
-
-        ######################################### the 1st decoder #########################################
-        up_flow0_1, up_flow1_1, mask, img_res = self.decoder1(
-            ft_1_, f0_1, f1_1, up_flow0_2, up_flow1_2
-        )
-
-        if scale_factor != 1.0:
-            up_flow0_1 = resize(up_flow0_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            up_flow1_1 = resize(up_flow1_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            mask = resize(mask, scale_factor=(1.0 / scale_factor))
-            img_res = resize(img_res, scale_factor=(1.0 / scale_factor))
-
-        # Merge multiple predictions
-        imgt_pred = multi_flow_combine(
-            self.comb_block, img0, img1, up_flow0_1, up_flow1_1, mask, img_res, mean_
-        )
-        imgt_pred = torch.clamp(imgt_pred, 0, 1)
-
-        if eval:
-            return {
-                "imgt_pred": imgt_pred,
-            }
-        else:
-            up_flow0_1 = up_flow0_1.reshape(b, self.num_flows, 2, h, w)
-            up_flow1_1 = up_flow1_1.reshape(b, self.num_flows, 2, h, w)
-            return {
-                "imgt_pred": imgt_pred,
-                "flow0_pred": [up_flow0_1, up_flow0_2, up_flow0_3, up_flow0_4],
-                "flow1_pred": [up_flow1_1, up_flow1_2, up_flow1_3, up_flow1_4],
-                "ft_pred": [ft_1_, ft_2_, ft_3_],
-            }
diff --git a/eval/vbench/third_party/amt/networks/AMT-S.py b/eval/vbench/third_party/amt/networks/AMT-S.py
deleted file mode 100644
index 9b9f058c..00000000
--- a/eval/vbench/third_party/amt/networks/AMT-S.py
+++ /dev/null
@@ -1,182 +0,0 @@
-import torch
-import torch.nn as nn
-from vbench.third_party.amt.networks.blocks.feat_enc import SmallEncoder
-from vbench.third_party.amt.networks.blocks.ifrnet import (
-    Encoder,
-    InitDecoder,
-    IntermediateDecoder,
-    resize,
-)
-from vbench.third_party.amt.networks.blocks.multi_flow import (
-    MultiFlowDecoder,
-    multi_flow_combine,
-)
-from vbench.third_party.amt.networks.blocks.raft import (
-    BidirCorrBlock,
-    SmallUpdateBlock,
-    coords_grid,
-)
-
-
-class Model(nn.Module):
-    def __init__(
-        self,
-        corr_radius=3,
-        corr_lvls=4,
-        num_flows=3,
-        channels=[20, 32, 44, 56],
-        skip_channels=20,
-    ):
-        super(Model, self).__init__()
-        self.radius = corr_radius
-        self.corr_levels = corr_lvls
-        self.num_flows = num_flows
-        self.channels = channels
-        self.skip_channels = skip_channels
-
-        self.feat_encoder = SmallEncoder(output_dim=84, norm_fn="instance", dropout=0.0)
-        self.encoder = Encoder(channels)
-
-        self.decoder4 = InitDecoder(channels[3], channels[2], skip_channels)
-        self.decoder3 = IntermediateDecoder(channels[2], channels[1], skip_channels)
-        self.decoder2 = IntermediateDecoder(channels[1], channels[0], skip_channels)
-        self.decoder1 = MultiFlowDecoder(channels[0], skip_channels, num_flows)
-
-        self.update4 = self._get_updateblock(44)
-        self.update3 = self._get_updateblock(32, 2)
-        self.update2 = self._get_updateblock(20, 4)
-
-        self.comb_block = nn.Sequential(
-            nn.Conv2d(3 * num_flows, 6 * num_flows, 3, 1, 1),
-            nn.PReLU(6 * num_flows),
-            nn.Conv2d(6 * num_flows, 3, 3, 1, 1),
-        )
-
-    def _get_updateblock(self, cdim, scale_factor=None):
-        return SmallUpdateBlock(
-            cdim=cdim,
-            hidden_dim=76,
-            flow_dim=20,
-            corr_dim=64,
-            fc_dim=68,
-            scale_factor=scale_factor,
-            corr_levels=self.corr_levels,
-            radius=self.radius,
-        )
-
-    def _corr_scale_lookup(self, corr_fn, coord, flow0, flow1, embt, downsample=1):
-        # convert t -> 0 to 0 -> 1 | convert t -> 1 to 1 -> 0
-        # based on linear assumption
-        t1_scale = 1.0 / embt
-        t0_scale = 1.0 / (1.0 - embt)
-        if downsample != 1:
-            inv = 1 / downsample
-            flow0 = inv * resize(flow0, scale_factor=inv)
-            flow1 = inv * resize(flow1, scale_factor=inv)
-
-        corr0, corr1 = corr_fn(coord + flow1 * t1_scale, coord + flow0 * t0_scale)
-        corr = torch.cat([corr0, corr1], dim=1)
-        flow = torch.cat([flow0, flow1], dim=1)
-        return corr, flow
-
-    def forward(self, img0, img1, embt, scale_factor=1.0, eval=False, **kwargs):
-        mean_ = (
-            torch.cat([img0, img1], 2)
-            .mean(1, keepdim=True)
-            .mean(2, keepdim=True)
-            .mean(3, keepdim=True)
-        )
-        img0 = img0 - mean_
-        img1 = img1 - mean_
-        img0_ = resize(img0, scale_factor) if scale_factor != 1.0 else img0
-        img1_ = resize(img1, scale_factor) if scale_factor != 1.0 else img1
-        b, _, h, w = img0_.shape
-        coord = coords_grid(b, h // 8, w // 8, img0.device)
-
-        fmap0, fmap1 = self.feat_encoder([img0_, img1_])  # [1, 128, H//8, W//8]
-        corr_fn = BidirCorrBlock(
-            fmap0, fmap1, radius=self.radius, num_levels=self.corr_levels
-        )
-
-        # f0_1: [1, c0, H//2, W//2] | f0_2: [1, c1, H//4, W//4]
-        # f0_3: [1, c2, H//8, W//8] | f0_4: [1, c3, H//16, W//16]
-        f0_1, f0_2, f0_3, f0_4 = self.encoder(img0_)
-        f1_1, f1_2, f1_3, f1_4 = self.encoder(img1_)
-
-        ######################################### the 4th decoder #########################################
-        up_flow0_4, up_flow1_4, ft_3_ = self.decoder4(f0_4, f1_4, embt)
-        corr_4, flow_4 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_4, up_flow1_4, embt, downsample=1
-        )
-
-        # residue update with lookup corr
-        delta_ft_3_, delta_flow_4 = self.update4(ft_3_, flow_4, corr_4)
-        delta_flow0_4, delta_flow1_4 = torch.chunk(delta_flow_4, 2, 1)
-        up_flow0_4 = up_flow0_4 + delta_flow0_4
-        up_flow1_4 = up_flow1_4 + delta_flow1_4
-        ft_3_ = ft_3_ + delta_ft_3_
-
-        ######################################### the 3rd decoder #########################################
-        up_flow0_3, up_flow1_3, ft_2_ = self.decoder3(
-            ft_3_, f0_3, f1_3, up_flow0_4, up_flow1_4
-        )
-        corr_3, flow_3 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_3, up_flow1_3, embt, downsample=2
-        )
-
-        # residue update with lookup corr
-        delta_ft_2_, delta_flow_3 = self.update3(ft_2_, flow_3, corr_3)
-        delta_flow0_3, delta_flow1_3 = torch.chunk(delta_flow_3, 2, 1)
-        up_flow0_3 = up_flow0_3 + delta_flow0_3
-        up_flow1_3 = up_flow1_3 + delta_flow1_3
-        ft_2_ = ft_2_ + delta_ft_2_
-
-        ######################################### the 2nd decoder #########################################
-        up_flow0_2, up_flow1_2, ft_1_ = self.decoder2(
-            ft_2_, f0_2, f1_2, up_flow0_3, up_flow1_3
-        )
-        corr_2, flow_2 = self._corr_scale_lookup(
-            corr_fn, coord, up_flow0_2, up_flow1_2, embt, downsample=4
-        )
-
-        # residue update with lookup corr
-        delta_ft_1_, delta_flow_2 = self.update2(ft_1_, flow_2, corr_2)
-        delta_flow0_2, delta_flow1_2 = torch.chunk(delta_flow_2, 2, 1)
-        up_flow0_2 = up_flow0_2 + delta_flow0_2
-        up_flow1_2 = up_flow1_2 + delta_flow1_2
-        ft_1_ = ft_1_ + delta_ft_1_
-
-        ######################################### the 1st decoder #########################################
-        up_flow0_1, up_flow1_1, mask, img_res = self.decoder1(
-            ft_1_, f0_1, f1_1, up_flow0_2, up_flow1_2
-        )
-
-        if scale_factor != 1.0:
-            up_flow0_1 = resize(up_flow0_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            up_flow1_1 = resize(up_flow1_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            mask = resize(mask, scale_factor=(1.0 / scale_factor))
-            img_res = resize(img_res, scale_factor=(1.0 / scale_factor))
-
-        # Merge multiple predictions
-        imgt_pred = multi_flow_combine(
-            self.comb_block, img0, img1, up_flow0_1, up_flow1_1, mask, img_res, mean_
-        )
-        imgt_pred = torch.clamp(imgt_pred, 0, 1)
-
-        if eval:
-            return {
-                "imgt_pred": imgt_pred,
-            }
-        else:
-            up_flow0_1 = up_flow0_1.reshape(b, self.num_flows, 2, h, w)
-            up_flow1_1 = up_flow1_1.reshape(b, self.num_flows, 2, h, w)
-            return {
-                "imgt_pred": imgt_pred,
-                "flow0_pred": [up_flow0_1, up_flow0_2, up_flow0_3, up_flow0_4],
-                "flow1_pred": [up_flow1_1, up_flow1_2, up_flow1_3, up_flow1_4],
-                "ft_pred": [ft_1_, ft_2_, ft_3_],
-            }
diff --git a/eval/vbench/third_party/amt/networks/IFRNet.py b/eval/vbench/third_party/amt/networks/IFRNet.py
deleted file mode 100644
index 8cae18a3..00000000
--- a/eval/vbench/third_party/amt/networks/IFRNet.py
+++ /dev/null
@@ -1,173 +0,0 @@
-import torch
-import torch.nn as nn
-from vbench.third_party.amt.networks.blocks.ifrnet import ResBlock, convrelu, resize
-from vbench.third_party.amt.utils.flow_utils import warp
-
-
-class Encoder(nn.Module):
-    def __init__(self):
-        super(Encoder, self).__init__()
-        self.pyramid1 = nn.Sequential(
-            convrelu(3, 32, 3, 2, 1), convrelu(32, 32, 3, 1, 1)
-        )
-        self.pyramid2 = nn.Sequential(
-            convrelu(32, 48, 3, 2, 1), convrelu(48, 48, 3, 1, 1)
-        )
-        self.pyramid3 = nn.Sequential(
-            convrelu(48, 72, 3, 2, 1), convrelu(72, 72, 3, 1, 1)
-        )
-        self.pyramid4 = nn.Sequential(
-            convrelu(72, 96, 3, 2, 1), convrelu(96, 96, 3, 1, 1)
-        )
-
-    def forward(self, img):
-        f1 = self.pyramid1(img)
-        f2 = self.pyramid2(f1)
-        f3 = self.pyramid3(f2)
-        f4 = self.pyramid4(f3)
-        return f1, f2, f3, f4
-
-
-class Decoder4(nn.Module):
-    def __init__(self):
-        super(Decoder4, self).__init__()
-        self.convblock = nn.Sequential(
-            convrelu(192 + 1, 192),
-            ResBlock(192, 32),
-            nn.ConvTranspose2d(192, 76, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, f0, f1, embt):
-        b, c, h, w = f0.shape
-        embt = embt.repeat(1, 1, h, w)
-        f_in = torch.cat([f0, f1, embt], 1)
-        f_out = self.convblock(f_in)
-        return f_out
-
-
-class Decoder3(nn.Module):
-    def __init__(self):
-        super(Decoder3, self).__init__()
-        self.convblock = nn.Sequential(
-            convrelu(220, 216),
-            ResBlock(216, 32),
-            nn.ConvTranspose2d(216, 52, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, ft_, f0, f1, up_flow0, up_flow1):
-        f0_warp = warp(f0, up_flow0)
-        f1_warp = warp(f1, up_flow1)
-        f_in = torch.cat([ft_, f0_warp, f1_warp, up_flow0, up_flow1], 1)
-        f_out = self.convblock(f_in)
-        return f_out
-
-
-class Decoder2(nn.Module):
-    def __init__(self):
-        super(Decoder2, self).__init__()
-        self.convblock = nn.Sequential(
-            convrelu(148, 144),
-            ResBlock(144, 32),
-            nn.ConvTranspose2d(144, 36, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, ft_, f0, f1, up_flow0, up_flow1):
-        f0_warp = warp(f0, up_flow0)
-        f1_warp = warp(f1, up_flow1)
-        f_in = torch.cat([ft_, f0_warp, f1_warp, up_flow0, up_flow1], 1)
-        f_out = self.convblock(f_in)
-        return f_out
-
-
-class Decoder1(nn.Module):
-    def __init__(self):
-        super(Decoder1, self).__init__()
-        self.convblock = nn.Sequential(
-            convrelu(100, 96),
-            ResBlock(96, 32),
-            nn.ConvTranspose2d(96, 8, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, ft_, f0, f1, up_flow0, up_flow1):
-        f0_warp = warp(f0, up_flow0)
-        f1_warp = warp(f1, up_flow1)
-        f_in = torch.cat([ft_, f0_warp, f1_warp, up_flow0, up_flow1], 1)
-        f_out = self.convblock(f_in)
-        return f_out
-
-
-class Model(nn.Module):
-    def __init__(self):
-        super(Model, self).__init__()
-        self.encoder = Encoder()
-        self.decoder4 = Decoder4()
-        self.decoder3 = Decoder3()
-        self.decoder2 = Decoder2()
-        self.decoder1 = Decoder1()
-
-    def forward(self, img0, img1, embt, scale_factor=1.0, eval=False, **kwargs):
-        mean_ = (
-            torch.cat([img0, img1], 2)
-            .mean(1, keepdim=True)
-            .mean(2, keepdim=True)
-            .mean(3, keepdim=True)
-        )
-        img0 = img0 - mean_
-        img1 = img1 - mean_
-
-        img0_ = resize(img0, scale_factor) if scale_factor != 1.0 else img0
-        img1_ = resize(img1, scale_factor) if scale_factor != 1.0 else img1
-
-        f0_1, f0_2, f0_3, f0_4 = self.encoder(img0_)
-        f1_1, f1_2, f1_3, f1_4 = self.encoder(img1_)
-
-        out4 = self.decoder4(f0_4, f1_4, embt)
-        up_flow0_4 = out4[:, 0:2]
-        up_flow1_4 = out4[:, 2:4]
-        ft_3_ = out4[:, 4:]
-
-        out3 = self.decoder3(ft_3_, f0_3, f1_3, up_flow0_4, up_flow1_4)
-        up_flow0_3 = out3[:, 0:2] + 2.0 * resize(up_flow0_4, scale_factor=2.0)
-        up_flow1_3 = out3[:, 2:4] + 2.0 * resize(up_flow1_4, scale_factor=2.0)
-        ft_2_ = out3[:, 4:]
-
-        out2 = self.decoder2(ft_2_, f0_2, f1_2, up_flow0_3, up_flow1_3)
-        up_flow0_2 = out2[:, 0:2] + 2.0 * resize(up_flow0_3, scale_factor=2.0)
-        up_flow1_2 = out2[:, 2:4] + 2.0 * resize(up_flow1_3, scale_factor=2.0)
-        ft_1_ = out2[:, 4:]
-
-        out1 = self.decoder1(ft_1_, f0_1, f1_1, up_flow0_2, up_flow1_2)
-        up_flow0_1 = out1[:, 0:2] + 2.0 * resize(up_flow0_2, scale_factor=2.0)
-        up_flow1_1 = out1[:, 2:4] + 2.0 * resize(up_flow1_2, scale_factor=2.0)
-        up_mask_1 = torch.sigmoid(out1[:, 4:5])
-        up_res_1 = out1[:, 5:]
-
-        if scale_factor != 1.0:
-            up_flow0_1 = resize(up_flow0_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            up_flow1_1 = resize(up_flow1_1, scale_factor=(1.0 / scale_factor)) * (
-                1.0 / scale_factor
-            )
-            up_mask_1 = resize(up_mask_1, scale_factor=(1.0 / scale_factor))
-            up_res_1 = resize(up_res_1, scale_factor=(1.0 / scale_factor))
-
-        img0_warp = warp(img0, up_flow0_1)
-        img1_warp = warp(img1, up_flow1_1)
-        imgt_merge = up_mask_1 * img0_warp + (1 - up_mask_1) * img1_warp + mean_
-        imgt_pred = imgt_merge + up_res_1
-        imgt_pred = torch.clamp(imgt_pred, 0, 1)
-
-        if eval:
-            return {
-                "imgt_pred": imgt_pred,
-            }
-        else:
-            return {
-                "imgt_pred": imgt_pred,
-                "flow0_pred": [up_flow0_1, up_flow0_2, up_flow0_3, up_flow0_4],
-                "flow1_pred": [up_flow1_1, up_flow1_2, up_flow1_3, up_flow1_4],
-                "ft_pred": [ft_1_, ft_2_, ft_3_],
-                "img0_warp": img0_warp,
-                "img1_warp": img1_warp,
-            }
diff --git a/eval/vbench/third_party/amt/networks/__init__.py b/eval/vbench/third_party/amt/networks/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/networks/blocks/__init__.py b/eval/vbench/third_party/amt/networks/blocks/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/networks/blocks/feat_enc.py b/eval/vbench/third_party/amt/networks/blocks/feat_enc.py
deleted file mode 100644
index 7af11533..00000000
--- a/eval/vbench/third_party/amt/networks/blocks/feat_enc.py
+++ /dev/null
@@ -1,346 +0,0 @@
-import torch
-import torch.nn as nn
-
-
-class BottleneckBlock(nn.Module):
-    def __init__(self, in_planes, planes, norm_fn="group", stride=1):
-        super(BottleneckBlock, self).__init__()
-
-        self.conv1 = nn.Conv2d(in_planes, planes // 4, kernel_size=1, padding=0)
-        self.conv2 = nn.Conv2d(
-            planes // 4, planes // 4, kernel_size=3, padding=1, stride=stride
-        )
-        self.conv3 = nn.Conv2d(planes // 4, planes, kernel_size=1, padding=0)
-        self.relu = nn.ReLU(inplace=True)
-
-        num_groups = planes // 8
-
-        if norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes // 4)
-            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes // 4)
-            self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-            if not stride == 1:
-                self.norm4 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-
-        elif norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(planes // 4)
-            self.norm2 = nn.BatchNorm2d(planes // 4)
-            self.norm3 = nn.BatchNorm2d(planes)
-            if not stride == 1:
-                self.norm4 = nn.BatchNorm2d(planes)
-
-        elif norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(planes // 4)
-            self.norm2 = nn.InstanceNorm2d(planes // 4)
-            self.norm3 = nn.InstanceNorm2d(planes)
-            if not stride == 1:
-                self.norm4 = nn.InstanceNorm2d(planes)
-
-        elif norm_fn == "none":
-            self.norm1 = nn.Sequential()
-            self.norm2 = nn.Sequential()
-            self.norm3 = nn.Sequential()
-            if not stride == 1:
-                self.norm4 = nn.Sequential()
-
-        if stride == 1:
-            self.downsample = None
-
-        else:
-            self.downsample = nn.Sequential(
-                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm4
-            )
-
-    def forward(self, x):
-        y = x
-        y = self.relu(self.norm1(self.conv1(y)))
-        y = self.relu(self.norm2(self.conv2(y)))
-        y = self.relu(self.norm3(self.conv3(y)))
-
-        if self.downsample is not None:
-            x = self.downsample(x)
-
-        return self.relu(x + y)
-
-
-class ResidualBlock(nn.Module):
-    def __init__(self, in_planes, planes, norm_fn="group", stride=1):
-        super(ResidualBlock, self).__init__()
-
-        self.conv1 = nn.Conv2d(
-            in_planes, planes, kernel_size=3, padding=1, stride=stride
-        )
-        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, padding=1)
-        self.relu = nn.ReLU(inplace=True)
-
-        num_groups = planes // 8
-
-        if norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-            self.norm2 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-            if not stride == 1:
-                self.norm3 = nn.GroupNorm(num_groups=num_groups, num_channels=planes)
-
-        elif norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(planes)
-            self.norm2 = nn.BatchNorm2d(planes)
-            if not stride == 1:
-                self.norm3 = nn.BatchNorm2d(planes)
-
-        elif norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(planes)
-            self.norm2 = nn.InstanceNorm2d(planes)
-            if not stride == 1:
-                self.norm3 = nn.InstanceNorm2d(planes)
-
-        elif norm_fn == "none":
-            self.norm1 = nn.Sequential()
-            self.norm2 = nn.Sequential()
-            if not stride == 1:
-                self.norm3 = nn.Sequential()
-
-        if stride == 1:
-            self.downsample = None
-
-        else:
-            self.downsample = nn.Sequential(
-                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride), self.norm3
-            )
-
-    def forward(self, x):
-        y = x
-        y = self.relu(self.norm1(self.conv1(y)))
-        y = self.relu(self.norm2(self.conv2(y)))
-
-        if self.downsample is not None:
-            x = self.downsample(x)
-
-        return self.relu(x + y)
-
-
-class SmallEncoder(nn.Module):
-    def __init__(self, output_dim=128, norm_fn="batch", dropout=0.0):
-        super(SmallEncoder, self).__init__()
-        self.norm_fn = norm_fn
-
-        if self.norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=32)
-
-        elif self.norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(32)
-
-        elif self.norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(32)
-
-        elif self.norm_fn == "none":
-            self.norm1 = nn.Sequential()
-
-        self.conv1 = nn.Conv2d(3, 32, kernel_size=7, stride=2, padding=3)
-        self.relu1 = nn.ReLU(inplace=True)
-
-        self.in_planes = 32
-        self.layer1 = self._make_layer(32, stride=1)
-        self.layer2 = self._make_layer(64, stride=2)
-        self.layer3 = self._make_layer(96, stride=2)
-
-        self.dropout = None
-        if dropout > 0:
-            self.dropout = nn.Dropout2d(p=dropout)
-
-        self.conv2 = nn.Conv2d(96, output_dim, kernel_size=1)
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
-            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
-                if m.weight is not None:
-                    nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
-
-    def _make_layer(self, dim, stride=1):
-        layer1 = BottleneckBlock(self.in_planes, dim, self.norm_fn, stride=stride)
-        layer2 = BottleneckBlock(dim, dim, self.norm_fn, stride=1)
-        layers = (layer1, layer2)
-
-        self.in_planes = dim
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-
-        # if input is list, combine batch dimension
-        is_list = isinstance(x, tuple) or isinstance(x, list)
-        if is_list:
-            batch_dim = x[0].shape[0]
-            x = torch.cat(x, dim=0)
-
-        x = self.conv1(x)
-        x = self.norm1(x)
-        x = self.relu1(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.conv2(x)
-
-        if self.training and self.dropout is not None:
-            x = self.dropout(x)
-
-        if is_list:
-            x = torch.split(x, [batch_dim, batch_dim], dim=0)
-
-        return x
-
-
-class BasicEncoder(nn.Module):
-    def __init__(self, output_dim=128, norm_fn="batch", dropout=0.0):
-        super(BasicEncoder, self).__init__()
-        self.norm_fn = norm_fn
-
-        if self.norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=64)
-
-        elif self.norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(64)
-
-        elif self.norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(64)
-
-        elif self.norm_fn == "none":
-            self.norm1 = nn.Sequential()
-
-        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
-        self.relu1 = nn.ReLU(inplace=True)
-
-        self.in_planes = 64
-        self.layer1 = self._make_layer(64, stride=1)
-        self.layer2 = self._make_layer(72, stride=2)
-        self.layer3 = self._make_layer(128, stride=2)
-
-        # output convolution
-        self.conv2 = nn.Conv2d(128, output_dim, kernel_size=1)
-
-        self.dropout = None
-        if dropout > 0:
-            self.dropout = nn.Dropout2d(p=dropout)
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
-            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
-                if m.weight is not None:
-                    nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
-
-    def _make_layer(self, dim, stride=1):
-        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
-        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
-        layers = (layer1, layer2)
-
-        self.in_planes = dim
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-
-        # if input is list, combine batch dimension
-        is_list = isinstance(x, tuple) or isinstance(x, list)
-        if is_list:
-            batch_dim = x[0].shape[0]
-            x = torch.cat(x, dim=0)
-
-        x = self.conv1(x)
-        x = self.norm1(x)
-        x = self.relu1(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-
-        x = self.conv2(x)
-
-        if self.training and self.dropout is not None:
-            x = self.dropout(x)
-
-        if is_list:
-            x = torch.split(x, [batch_dim, batch_dim], dim=0)
-
-        return x
-
-
-class LargeEncoder(nn.Module):
-    def __init__(self, output_dim=128, norm_fn="batch", dropout=0.0):
-        super(LargeEncoder, self).__init__()
-        self.norm_fn = norm_fn
-
-        if self.norm_fn == "group":
-            self.norm1 = nn.GroupNorm(num_groups=8, num_channels=64)
-
-        elif self.norm_fn == "batch":
-            self.norm1 = nn.BatchNorm2d(64)
-
-        elif self.norm_fn == "instance":
-            self.norm1 = nn.InstanceNorm2d(64)
-
-        elif self.norm_fn == "none":
-            self.norm1 = nn.Sequential()
-
-        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3)
-        self.relu1 = nn.ReLU(inplace=True)
-
-        self.in_planes = 64
-        self.layer1 = self._make_layer(64, stride=1)
-        self.layer2 = self._make_layer(112, stride=2)
-        self.layer3 = self._make_layer(160, stride=2)
-        self.layer3_2 = self._make_layer(160, stride=1)
-
-        # output convolution
-        self.conv2 = nn.Conv2d(self.in_planes, output_dim, kernel_size=1)
-
-        self.dropout = None
-        if dropout > 0:
-            self.dropout = nn.Dropout2d(p=dropout)
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
-            elif isinstance(m, (nn.BatchNorm2d, nn.InstanceNorm2d, nn.GroupNorm)):
-                if m.weight is not None:
-                    nn.init.constant_(m.weight, 1)
-                if m.bias is not None:
-                    nn.init.constant_(m.bias, 0)
-
-    def _make_layer(self, dim, stride=1):
-        layer1 = ResidualBlock(self.in_planes, dim, self.norm_fn, stride=stride)
-        layer2 = ResidualBlock(dim, dim, self.norm_fn, stride=1)
-        layers = (layer1, layer2)
-
-        self.in_planes = dim
-        return nn.Sequential(*layers)
-
-    def forward(self, x):
-
-        # if input is list, combine batch dimension
-        is_list = isinstance(x, tuple) or isinstance(x, list)
-        if is_list:
-            batch_dim = x[0].shape[0]
-            x = torch.cat(x, dim=0)
-
-        x = self.conv1(x)
-        x = self.norm1(x)
-        x = self.relu1(x)
-
-        x = self.layer1(x)
-        x = self.layer2(x)
-        x = self.layer3(x)
-        x = self.layer3_2(x)
-
-        x = self.conv2(x)
-
-        if self.training and self.dropout is not None:
-            x = self.dropout(x)
-
-        if is_list:
-            x = torch.split(x, [batch_dim, batch_dim], dim=0)
-
-        return x
diff --git a/eval/vbench/third_party/amt/networks/blocks/ifrnet.py b/eval/vbench/third_party/amt/networks/blocks/ifrnet.py
deleted file mode 100644
index 356959fa..00000000
--- a/eval/vbench/third_party/amt/networks/blocks/ifrnet.py
+++ /dev/null
@@ -1,159 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from vbench.third_party.amt.utils.flow_utils import warp
-
-
-def resize(x, scale_factor):
-    return F.interpolate(
-        x, scale_factor=scale_factor, mode="bilinear", align_corners=False
-    )
-
-
-def convrelu(
-    in_channels,
-    out_channels,
-    kernel_size=3,
-    stride=1,
-    padding=1,
-    dilation=1,
-    groups=1,
-    bias=True,
-):
-    return nn.Sequential(
-        nn.Conv2d(
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride,
-            padding,
-            dilation,
-            groups,
-            bias=bias,
-        ),
-        nn.PReLU(out_channels),
-    )
-
-
-class ResBlock(nn.Module):
-    def __init__(self, in_channels, side_channels, bias=True):
-        super(ResBlock, self).__init__()
-        self.side_channels = side_channels
-        self.conv1 = nn.Sequential(
-            nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1, bias=bias
-            ),
-            nn.PReLU(in_channels),
-        )
-        self.conv2 = nn.Sequential(
-            nn.Conv2d(
-                side_channels,
-                side_channels,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                bias=bias,
-            ),
-            nn.PReLU(side_channels),
-        )
-        self.conv3 = nn.Sequential(
-            nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1, bias=bias
-            ),
-            nn.PReLU(in_channels),
-        )
-        self.conv4 = nn.Sequential(
-            nn.Conv2d(
-                side_channels,
-                side_channels,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                bias=bias,
-            ),
-            nn.PReLU(side_channels),
-        )
-        self.conv5 = nn.Conv2d(
-            in_channels, in_channels, kernel_size=3, stride=1, padding=1, bias=bias
-        )
-        self.prelu = nn.PReLU(in_channels)
-
-    def forward(self, x):
-        out = self.conv1(x)
-
-        res_feat = out[:, : -self.side_channels, ...]
-        side_feat = out[:, -self.side_channels :, :, :]
-        side_feat = self.conv2(side_feat)
-        out = self.conv3(torch.cat([res_feat, side_feat], 1))
-
-        res_feat = out[:, : -self.side_channels, ...]
-        side_feat = out[:, -self.side_channels :, :, :]
-        side_feat = self.conv4(side_feat)
-        out = self.conv5(torch.cat([res_feat, side_feat], 1))
-
-        out = self.prelu(x + out)
-        return out
-
-
-class Encoder(nn.Module):
-    def __init__(self, channels, large=False):
-        super(Encoder, self).__init__()
-        self.channels = channels
-        prev_ch = 3
-        for idx, ch in enumerate(channels, 1):
-            k = 7 if large and idx == 1 else 3
-            p = 3 if k == 7 else 1
-            self.register_module(
-                f"pyramid{idx}",
-                nn.Sequential(
-                    convrelu(prev_ch, ch, k, 2, p), convrelu(ch, ch, 3, 1, 1)
-                ),
-            )
-            prev_ch = ch
-
-    def forward(self, in_x):
-        fs = []
-        for idx in range(len(self.channels)):
-            out_x = getattr(self, f"pyramid{idx+1}")(in_x)
-            fs.append(out_x)
-            in_x = out_x
-        return fs
-
-
-class InitDecoder(nn.Module):
-    def __init__(self, in_ch, out_ch, skip_ch) -> None:
-        super().__init__()
-        self.convblock = nn.Sequential(
-            convrelu(in_ch * 2 + 1, in_ch * 2),
-            ResBlock(in_ch * 2, skip_ch),
-            nn.ConvTranspose2d(in_ch * 2, out_ch + 4, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, f0, f1, embt):
-        h, w = f0.shape[2:]
-        embt = embt.repeat(1, 1, h, w)
-        out = self.convblock(torch.cat([f0, f1, embt], 1))
-        flow0, flow1 = torch.chunk(out[:, :4, ...], 2, 1)
-        ft_ = out[:, 4:, ...]
-        return flow0, flow1, ft_
-
-
-class IntermediateDecoder(nn.Module):
-    def __init__(self, in_ch, out_ch, skip_ch) -> None:
-        super().__init__()
-        self.convblock = nn.Sequential(
-            convrelu(in_ch * 3 + 4, in_ch * 3),
-            ResBlock(in_ch * 3, skip_ch),
-            nn.ConvTranspose2d(in_ch * 3, out_ch + 4, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, ft_, f0, f1, flow0_in, flow1_in):
-        f0_warp = warp(f0, flow0_in)
-        f1_warp = warp(f1, flow1_in)
-        f_in = torch.cat([ft_, f0_warp, f1_warp, flow0_in, flow1_in], 1)
-        out = self.convblock(f_in)
-        flow0, flow1 = torch.chunk(out[:, :4, ...], 2, 1)
-        ft_ = out[:, 4:, ...]
-        flow0 = flow0 + 2.0 * resize(flow0_in, scale_factor=2.0)
-        flow1 = flow1 + 2.0 * resize(flow1_in, scale_factor=2.0)
-        return flow0, flow1, ft_
diff --git a/eval/vbench/third_party/amt/networks/blocks/multi_flow.py b/eval/vbench/third_party/amt/networks/blocks/multi_flow.py
deleted file mode 100644
index 2f839094..00000000
--- a/eval/vbench/third_party/amt/networks/blocks/multi_flow.py
+++ /dev/null
@@ -1,80 +0,0 @@
-import torch
-import torch.nn as nn
-from vbench.third_party.amt.networks.blocks.ifrnet import ResBlock, convrelu, resize
-from vbench.third_party.amt.utils.flow_utils import warp
-
-
-def multi_flow_combine(
-    comb_block, img0, img1, flow0, flow1, mask=None, img_res=None, mean=None
-):
-    """
-    A parallel implementation of multiple flow field warping
-    comb_block: An nn.Seqential object.
-    img shape: [b, c, h, w]
-    flow shape: [b, 2*num_flows, h, w]
-    mask (opt):
-        If 'mask' is None, the function conduct a simple average.
-    img_res (opt):
-        If 'img_res' is None, the function adds zero instead.
-    mean (opt):
-        If 'mean' is None, the function adds zero instead.
-    """
-    b, c, h, w = flow0.shape
-    num_flows = c // 2
-    flow0 = flow0.reshape(b, num_flows, 2, h, w).reshape(-1, 2, h, w)
-    flow1 = flow1.reshape(b, num_flows, 2, h, w).reshape(-1, 2, h, w)
-
-    mask = (
-        mask.reshape(b, num_flows, 1, h, w).reshape(-1, 1, h, w)
-        if mask is not None
-        else None
-    )
-    img_res = (
-        img_res.reshape(b, num_flows, 3, h, w).reshape(-1, 3, h, w)
-        if img_res is not None
-        else 0
-    )
-    img0 = torch.stack([img0] * num_flows, 1).reshape(-1, 3, h, w)
-    img1 = torch.stack([img1] * num_flows, 1).reshape(-1, 3, h, w)
-    mean = (
-        torch.stack([mean] * num_flows, 1).reshape(-1, 1, 1, 1)
-        if mean is not None
-        else 0
-    )
-
-    img0_warp = warp(img0, flow0)
-    img1_warp = warp(img1, flow1)
-    img_warps = mask * img0_warp + (1 - mask) * img1_warp + mean + img_res
-    img_warps = img_warps.reshape(b, num_flows, 3, h, w)
-    imgt_pred = img_warps.mean(1) + comb_block(img_warps.view(b, -1, h, w))
-    return imgt_pred
-
-
-class MultiFlowDecoder(nn.Module):
-    def __init__(self, in_ch, skip_ch, num_flows=3):
-        super(MultiFlowDecoder, self).__init__()
-        self.num_flows = num_flows
-        self.convblock = nn.Sequential(
-            convrelu(in_ch * 3 + 4, in_ch * 3),
-            ResBlock(in_ch * 3, skip_ch),
-            nn.ConvTranspose2d(in_ch * 3, 8 * num_flows, 4, 2, 1, bias=True),
-        )
-
-    def forward(self, ft_, f0, f1, flow0, flow1):
-        n = self.num_flows
-        f0_warp = warp(f0, flow0)
-        f1_warp = warp(f1, flow1)
-        out = self.convblock(torch.cat([ft_, f0_warp, f1_warp, flow0, flow1], 1))
-        delta_flow0, delta_flow1, mask, img_res = torch.split(
-            out, [2 * n, 2 * n, n, 3 * n], 1
-        )
-        mask = torch.sigmoid(mask)
-
-        flow0 = delta_flow0 + 2.0 * resize(flow0, scale_factor=2.0).repeat(
-            1, self.num_flows, 1, 1
-        )
-        flow1 = delta_flow1 + 2.0 * resize(flow1, scale_factor=2.0).repeat(
-            1, self.num_flows, 1, 1
-        )
-
-        return flow0, flow1, mask, img_res
diff --git a/eval/vbench/third_party/amt/networks/blocks/raft.py b/eval/vbench/third_party/amt/networks/blocks/raft.py
deleted file mode 100644
index 2c0644b4..00000000
--- a/eval/vbench/third_party/amt/networks/blocks/raft.py
+++ /dev/null
@@ -1,240 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-def resize(x, scale_factor):
-    return F.interpolate(
-        x, scale_factor=scale_factor, mode="bilinear", align_corners=False
-    )
-
-
-def bilinear_sampler(img, coords, mask=False):
-    """Wrapper for grid_sample, uses pixel coordinates"""
-    H, W = img.shape[-2:]
-    xgrid, ygrid = coords.split([1, 1], dim=-1)
-    xgrid = 2 * xgrid / (W - 1) - 1
-    ygrid = 2 * ygrid / (H - 1) - 1
-
-    grid = torch.cat([xgrid, ygrid], dim=-1)
-    img = F.grid_sample(img, grid, align_corners=True)
-
-    if mask:
-        mask = (xgrid > -1) & (ygrid > -1) & (xgrid < 1) & (ygrid < 1)
-        return img, mask.float()
-
-    return img
-
-
-def coords_grid(batch, ht, wd, device):
-    coords = torch.meshgrid(
-        torch.arange(ht, device=device), torch.arange(wd, device=device), indexing="ij"
-    )
-    coords = torch.stack(coords[::-1], dim=0).float()
-    return coords[None].repeat(batch, 1, 1, 1)
-
-
-class SmallUpdateBlock(nn.Module):
-    def __init__(
-        self,
-        cdim,
-        hidden_dim,
-        flow_dim,
-        corr_dim,
-        fc_dim,
-        corr_levels=4,
-        radius=3,
-        scale_factor=None,
-    ):
-        super(SmallUpdateBlock, self).__init__()
-        cor_planes = corr_levels * (2 * radius + 1) ** 2
-        self.scale_factor = scale_factor
-
-        self.convc1 = nn.Conv2d(2 * cor_planes, corr_dim, 1, padding=0)
-        self.convf1 = nn.Conv2d(4, flow_dim * 2, 7, padding=3)
-        self.convf2 = nn.Conv2d(flow_dim * 2, flow_dim, 3, padding=1)
-        self.conv = nn.Conv2d(corr_dim + flow_dim, fc_dim, 3, padding=1)
-
-        self.gru = nn.Sequential(
-            nn.Conv2d(fc_dim + 4 + cdim, hidden_dim, 3, padding=1),
-            nn.LeakyReLU(negative_slope=0.1, inplace=True),
-            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
-        )
-
-        self.feat_head = nn.Sequential(
-            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
-            nn.LeakyReLU(negative_slope=0.1, inplace=True),
-            nn.Conv2d(hidden_dim, cdim, 3, padding=1),
-        )
-
-        self.flow_head = nn.Sequential(
-            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
-            nn.LeakyReLU(negative_slope=0.1, inplace=True),
-            nn.Conv2d(hidden_dim, 4, 3, padding=1),
-        )
-
-        self.lrelu = nn.LeakyReLU(negative_slope=0.1, inplace=True)
-
-    def forward(self, net, flow, corr):
-        net = (
-            resize(net, 1 / self.scale_factor) if self.scale_factor is not None else net
-        )
-        cor = self.lrelu(self.convc1(corr))
-        flo = self.lrelu(self.convf1(flow))
-        flo = self.lrelu(self.convf2(flo))
-        cor_flo = torch.cat([cor, flo], dim=1)
-        inp = self.lrelu(self.conv(cor_flo))
-        inp = torch.cat([inp, flow, net], dim=1)
-
-        out = self.gru(inp)
-        delta_net = self.feat_head(out)
-        delta_flow = self.flow_head(out)
-
-        if self.scale_factor is not None:
-            delta_net = resize(delta_net, scale_factor=self.scale_factor)
-            delta_flow = self.scale_factor * resize(
-                delta_flow, scale_factor=self.scale_factor
-            )
-
-        return delta_net, delta_flow
-
-
-class BasicUpdateBlock(nn.Module):
-    def __init__(
-        self,
-        cdim,
-        hidden_dim,
-        flow_dim,
-        corr_dim,
-        corr_dim2,
-        fc_dim,
-        corr_levels=4,
-        radius=3,
-        scale_factor=None,
-        out_num=1,
-    ):
-        super(BasicUpdateBlock, self).__init__()
-        cor_planes = corr_levels * (2 * radius + 1) ** 2
-
-        self.scale_factor = scale_factor
-        self.convc1 = nn.Conv2d(2 * cor_planes, corr_dim, 1, padding=0)
-        self.convc2 = nn.Conv2d(corr_dim, corr_dim2, 3, padding=1)
-        self.convf1 = nn.Conv2d(4, flow_dim * 2, 7, padding=3)
-        self.convf2 = nn.Conv2d(flow_dim * 2, flow_dim, 3, padding=1)
-        self.conv = nn.Conv2d(flow_dim + corr_dim2, fc_dim, 3, padding=1)
-
-        self.gru = nn.Sequential(
-            nn.Conv2d(fc_dim + 4 + cdim, hidden_dim, 3, padding=1),
-            nn.LeakyReLU(negative_slope=0.1, inplace=True),
-            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
-        )
-
-        self.feat_head = nn.Sequential(
-            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
-            nn.LeakyReLU(negative_slope=0.1, inplace=True),
-            nn.Conv2d(hidden_dim, cdim, 3, padding=1),
-        )
-
-        self.flow_head = nn.Sequential(
-            nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1),
-            nn.LeakyReLU(negative_slope=0.1, inplace=True),
-            nn.Conv2d(hidden_dim, 4 * out_num, 3, padding=1),
-        )
-
-        self.lrelu = nn.LeakyReLU(negative_slope=0.1, inplace=True)
-
-    def forward(self, net, flow, corr):
-        net = (
-            resize(net, 1 / self.scale_factor) if self.scale_factor is not None else net
-        )
-        cor = self.lrelu(self.convc1(corr))
-        cor = self.lrelu(self.convc2(cor))
-        flo = self.lrelu(self.convf1(flow))
-        flo = self.lrelu(self.convf2(flo))
-        cor_flo = torch.cat([cor, flo], dim=1)
-        inp = self.lrelu(self.conv(cor_flo))
-        inp = torch.cat([inp, flow, net], dim=1)
-
-        out = self.gru(inp)
-        delta_net = self.feat_head(out)
-        delta_flow = self.flow_head(out)
-
-        if self.scale_factor is not None:
-            delta_net = resize(delta_net, scale_factor=self.scale_factor)
-            delta_flow = self.scale_factor * resize(
-                delta_flow, scale_factor=self.scale_factor
-            )
-        return delta_net, delta_flow
-
-
-class BidirCorrBlock:
-    def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
-        self.num_levels = num_levels
-        self.radius = radius
-        self.corr_pyramid = []
-        self.corr_pyramid_T = []
-
-        corr = BidirCorrBlock.corr(fmap1, fmap2)
-        batch, h1, w1, dim, h2, w2 = corr.shape
-        corr_T = corr.clone().permute(0, 4, 5, 3, 1, 2)
-
-        corr = corr.reshape(batch * h1 * w1, dim, h2, w2)
-        corr_T = corr_T.reshape(batch * h2 * w2, dim, h1, w1)
-
-        self.corr_pyramid.append(corr)
-        self.corr_pyramid_T.append(corr_T)
-
-        for _ in range(self.num_levels - 1):
-            corr = F.avg_pool2d(corr, 2, stride=2)
-            corr_T = F.avg_pool2d(corr_T, 2, stride=2)
-            self.corr_pyramid.append(corr)
-            self.corr_pyramid_T.append(corr_T)
-
-    def __call__(self, coords0, coords1):
-        r = self.radius
-        coords0 = coords0.permute(0, 2, 3, 1)
-        coords1 = coords1.permute(0, 2, 3, 1)
-        assert (
-            coords0.shape == coords1.shape
-        ), f"coords0 shape: [{coords0.shape}] is not equal to [{coords1.shape}]"
-        batch, h1, w1, _ = coords0.shape
-
-        out_pyramid = []
-        out_pyramid_T = []
-        for i in range(self.num_levels):
-            corr = self.corr_pyramid[i]
-            corr_T = self.corr_pyramid_T[i]
-
-            dx = torch.linspace(-r, r, 2 * r + 1, device=coords0.device)
-            dy = torch.linspace(-r, r, 2 * r + 1, device=coords0.device)
-            delta = torch.stack(torch.meshgrid(dy, dx, indexing="ij"), axis=-1)
-            delta_lvl = delta.view(1, 2 * r + 1, 2 * r + 1, 2)
-
-            centroid_lvl_0 = coords0.reshape(batch * h1 * w1, 1, 1, 2) / 2**i
-            centroid_lvl_1 = coords1.reshape(batch * h1 * w1, 1, 1, 2) / 2**i
-            coords_lvl_0 = centroid_lvl_0 + delta_lvl
-            coords_lvl_1 = centroid_lvl_1 + delta_lvl
-
-            corr = bilinear_sampler(corr, coords_lvl_0)
-            corr_T = bilinear_sampler(corr_T, coords_lvl_1)
-            corr = corr.view(batch, h1, w1, -1)
-            corr_T = corr_T.view(batch, h1, w1, -1)
-            out_pyramid.append(corr)
-            out_pyramid_T.append(corr_T)
-
-        out = torch.cat(out_pyramid, dim=-1)
-        out_T = torch.cat(out_pyramid_T, dim=-1)
-        return (
-            out.permute(0, 3, 1, 2).contiguous().float(),
-            out_T.permute(0, 3, 1, 2).contiguous().float(),
-        )
-
-    @staticmethod
-    def corr(fmap1, fmap2):
-        batch, dim, ht, wd = fmap1.shape
-        fmap1 = fmap1.view(batch, dim, ht * wd)
-        fmap2 = fmap2.view(batch, dim, ht * wd)
-
-        corr = torch.matmul(fmap1.transpose(1, 2), fmap2)
-        corr = corr.view(batch, ht, wd, 1, ht, wd)
-        return corr / torch.sqrt(torch.tensor(dim).float())
diff --git a/eval/vbench/third_party/amt/scripts/benchmark_arbitrary.sh b/eval/vbench/third_party/amt/scripts/benchmark_arbitrary.sh
deleted file mode 100644
index a9c55787..00000000
--- a/eval/vbench/third_party/amt/scripts/benchmark_arbitrary.sh
+++ /dev/null
@@ -1,5 +0,0 @@
-CFG=$1
-CKPT=$2
-
-python benchmarks/gopro.py -c $CFG -p $CKPT
-python benchmarks/adobe240.py -c $CFG -p $CKPT
diff --git a/eval/vbench/third_party/amt/scripts/benchmark_fixed.sh b/eval/vbench/third_party/amt/scripts/benchmark_fixed.sh
deleted file mode 100644
index 613b9662..00000000
--- a/eval/vbench/third_party/amt/scripts/benchmark_fixed.sh
+++ /dev/null
@@ -1,7 +0,0 @@
-CFG=$1
-CKPT=$2
-
-python benchmarks/vimeo90k.py -c $CFG -p $CKPT
-python benchmarks/ucf101.py -c $CFG -p $CKPT
-python benchmarks/snu_film.py -c $CFG -p $CKPT
-python benchmarks/xiph.py -c $CFG -p $CKPT
diff --git a/eval/vbench/third_party/amt/scripts/train.sh b/eval/vbench/third_party/amt/scripts/train.sh
deleted file mode 100644
index 561656fa..00000000
--- a/eval/vbench/third_party/amt/scripts/train.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-NUM_GPU=$1
-CFG=$2
-PORT=$3
-python -m torch.distributed.launch \
---nproc_per_node $NUM_GPU \
---master_port $PORT train.py -c $CFG
diff --git a/eval/vbench/third_party/amt/train.py b/eval/vbench/third_party/amt/train.py
deleted file mode 100644
index 914b3f8c..00000000
--- a/eval/vbench/third_party/amt/train.py
+++ /dev/null
@@ -1,67 +0,0 @@
-import argparse
-import datetime
-import importlib
-import os
-from shutil import copyfile
-
-import torch
-import torch.distributed as dist
-from omegaconf import OmegaConf
-from utils.dist_utils import get_world_size
-from utils.utils import seed_all
-
-parser = argparse.ArgumentParser(description="VFI")
-parser.add_argument("-c", "--config", type=str)
-parser.add_argument("-p", "--port", default="23455", type=str)
-parser.add_argument("--local_rank", default="0")
-
-args = parser.parse_args()
-
-
-def main_worker(rank, config):
-    if "local_rank" not in config:
-        config["local_rank"] = config["global_rank"] = rank
-    if torch.cuda.is_available():
-        print(f"Rank {rank} is available")
-        config["device"] = f"cuda:{rank}"
-        if config["distributed"]:
-            dist.init_process_group(
-                backend="nccl", timeout=datetime.timedelta(seconds=5400)
-            )
-    else:
-        config["device"] = "cpu"
-
-    cfg_name = os.path.basename(args.config).split(".")[0]
-    config["exp_name"] = cfg_name + "_" + config["exp_name"]
-    config["save_dir"] = os.path.join(config["save_dir"], config["exp_name"])
-
-    if (not config["distributed"]) or rank == 0:
-        os.makedirs(config["save_dir"], exist_ok=True)
-        os.makedirs(f'{config["save_dir"]}/ckpts', exist_ok=True)
-        config_path = os.path.join(config["save_dir"], args.config.split("/")[-1])
-        if not os.path.isfile(config_path):
-            copyfile(args.config, config_path)
-        print("[**] create folder {}".format(config["save_dir"]))
-
-    trainer_name = config.get("trainer_type", "base_trainer")
-    print(f"using GPU {rank} for training")
-    if rank == 0:
-        print(trainer_name)
-    trainer_pack = importlib.import_module("trainers." + trainer_name)
-    trainer = trainer_pack.Trainer(config)
-
-    trainer.train()
-
-
-if __name__ == "__main__":
-    torch.backends.cudnn.benchmark = True
-    cfg = OmegaConf.load(args.config)
-    seed_all(cfg.seed)
-    rank = int(args.local_rank)
-    torch.cuda.set_device(torch.device(f"cuda:{rank}"))
-    # setting distributed cfgurations
-    cfg["world_size"] = get_world_size()
-    cfg["local_rank"] = rank
-    if rank == 0:
-        print("world_size: ", cfg["world_size"])
-    main_worker(rank, cfg)
diff --git a/eval/vbench/third_party/amt/trainers/__init__.py b/eval/vbench/third_party/amt/trainers/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/trainers/base_trainer.py b/eval/vbench/third_party/amt/trainers/base_trainer.py
deleted file mode 100644
index 160fe5d7..00000000
--- a/eval/vbench/third_party/amt/trainers/base_trainer.py
+++ /dev/null
@@ -1,278 +0,0 @@
-import logging
-import os.path as osp
-import time
-from collections import OrderedDict
-
-import numpy as np
-import torch
-import wandb
-from metrics.psnr_ssim import calculate_psnr
-from torch.nn.parallel import DistributedDataParallel as DDP
-from torch.optim import AdamW
-from torch.utils.data import DataLoader
-from torch.utils.data.distributed import DistributedSampler
-from utils.build_utils import build_from_cfg
-from utils.utils import AverageMeterGroups
-
-from .logger import CustomLogger
-
-
-class Trainer:
-    def __init__(self, config):
-        super().__init__()
-        self.config = config
-        self.rank = self.config["local_rank"]
-        init_log = self._init_logger()
-        self._init_dataset()
-        self._init_loss()
-        self.model_name = config["exp_name"]
-        self.model = build_from_cfg(config.network).to(self.config.device)
-
-        if config["distributed"]:
-            self.model = DDP(
-                self.model,
-                device_ids=[self.rank],
-                output_device=self.rank,
-                broadcast_buffers=True,
-                find_unused_parameters=False,
-            )
-
-        init_log += str(self.model)
-        self.optimizer = AdamW(
-            self.model.parameters(), lr=config.lr, weight_decay=config.weight_decay
-        )
-        if self.rank == 0:
-            print(init_log)
-        self.logger(init_log)
-        self.resume_training()
-
-    def resume_training(self):
-        ckpt_path = self.config.get("resume_state")
-        if ckpt_path is not None:
-            ckpt = torch.load(self.config["resume_state"])
-            if self.config["distributed"]:
-                self.model.module.load_state_dict(ckpt["state_dict"])
-            else:
-                self.model.load_state_dict(ckpt["state_dict"])
-            self.optimizer.load_state_dict(ckpt["optim"])
-            self.resume_epoch = ckpt.get("epoch")
-            self.logger(
-                f"load model from {ckpt_path} and training resumes from epoch {self.resume_epoch}"
-            )
-        else:
-            self.resume_epoch = 0
-
-    def _init_logger(self):
-        init_log = ""
-        console_cfg = dict(
-            level=logging.INFO,
-            format="%(asctime)s %(filename)s[line:%(lineno)d]"
-            "%(levelname)s %(message)s",
-            datefmt="%a, %d %b %Y %H:%M:%S",
-            filename=f"{self.config['save_dir']}/log",
-            filemode="w",
-        )
-        tb_cfg = dict(log_dir=osp.join(self.config["save_dir"], "tb_logger"))
-        wandb_cfg = None
-        use_wandb = self.config["logger"].get("use_wandb", False)
-        if use_wandb:
-            resume_id = self.config["logger"].get("resume_id", None)
-            if resume_id:
-                wandb_id = resume_id
-                resume = "allow"
-                init_log += f"Resume wandb logger with id={wandb_id}."
-            else:
-                wandb_id = wandb.util.generate_id()
-                resume = "never"
-
-            wandb_cfg = dict(
-                id=wandb_id,
-                resume=resume,
-                name=osp.basename(self.config["save_dir"]),
-                config=self.config,
-                project="YOUR PROJECT",
-                entity="YOUR ENTITY",
-                sync_tensorboard=True,
-            )
-            init_log += f"Use wandb logger with id={wandb_id}; project=[YOUR PROJECT]."
-        self.logger = CustomLogger(console_cfg, tb_cfg, wandb_cfg, self.rank)
-        return init_log
-
-    def _init_dataset(self):
-        dataset_train = build_from_cfg(self.config.data.train)
-        dataset_val = build_from_cfg(self.config.data.val)
-
-        self.sampler = DistributedSampler(
-            dataset_train,
-            num_replicas=self.config["world_size"],
-            rank=self.config["local_rank"],
-        )
-        self.config.data.train_loader.batch_size //= self.config["world_size"]
-        self.loader_train = DataLoader(
-            dataset_train,
-            **self.config.data.train_loader,
-            pin_memory=True,
-            drop_last=True,
-            sampler=self.sampler,
-        )
-
-        self.loader_val = DataLoader(
-            dataset_val,
-            **self.config.data.val_loader,
-            pin_memory=True,
-            shuffle=False,
-            drop_last=False,
-        )
-
-    def _init_loss(self):
-        self.loss_dict = dict()
-        for loss_cfg in self.config.losses:
-            loss = build_from_cfg(loss_cfg)
-            self.loss_dict[loss_cfg["nickname"]] = loss
-
-    def set_lr(self, optimizer, lr):
-        for param_group in optimizer.param_groups:
-            param_group["lr"] = lr
-
-    def get_lr(self, iters):
-        ratio = 0.5 * (
-            1.0
-            + np.cos(
-                iters / (self.config["epochs"] * self.loader_train.__len__()) * np.pi
-            )
-        )
-        lr = (self.config["lr"] - self.config["lr_min"]) * ratio + self.config["lr_min"]
-        return lr
-
-    def train(self):
-        local_rank = self.config["local_rank"]
-        best_psnr = 0.0
-        loss_group = AverageMeterGroups()
-        time_group = AverageMeterGroups()
-        iters_per_epoch = self.loader_train.__len__()
-        iters = self.resume_epoch * iters_per_epoch
-        total_iters = self.config["epochs"] * iters_per_epoch
-
-        start_t = time.time()
-        total_t = 0
-        for epoch in range(self.resume_epoch, self.config["epochs"]):
-            self.sampler.set_epoch(epoch)
-            for data in self.loader_train:
-                for k, v in data.items():
-                    data[k] = v.to(self.config["device"])
-                data_t = time.time() - start_t
-
-                lr = self.get_lr(iters)
-                self.set_lr(self.optimizer, lr)
-
-                self.optimizer.zero_grad()
-                results = self.model(**data)
-                total_loss = torch.tensor(0.0, device=self.config["device"])
-                for name, loss in self.loss_dict.items():
-                    l = loss(**results, **data)
-                    loss_group.update({name: l.cpu().data})
-                    total_loss += l
-                total_loss.backward()
-                self.optimizer.step()
-
-                iters += 1
-
-                iter_t = time.time() - start_t
-                total_t += iter_t
-                time_group.update({"data_t": data_t, "iter_t": iter_t})
-
-                if (iters + 1) % 100 == 0 and local_rank == 0:
-                    tpi = total_t / (iters - self.resume_epoch * iters_per_epoch)
-                    eta = total_iters * tpi
-                    remainder = (total_iters - iters) * tpi
-                    eta = self.eta_format(eta)
-
-                    remainder = self.eta_format(remainder)
-                    log_str = (
-                        f"[{self.model_name}]epoch:{epoch +1}/{self.config['epochs']} "
-                    )
-                    log_str += (
-                        f"iter:{iters + 1}/{self.config['epochs'] * iters_per_epoch} "
-                    )
-                    log_str += f"time:{time_group.avg('iter_t'):.3f}({time_group.avg('data_t'):.3f}) "
-                    log_str += f"lr:{lr:.3e} eta:{remainder}({eta})\n"
-                    for name in self.loss_dict.keys():
-                        avg_l = loss_group.avg(name)
-                        log_str += f"{name}:{avg_l:.3e} "
-                        self.logger(tb_msg=[f"loss/{name}", avg_l, iters])
-                    log_str += f"best:{best_psnr:.2f}dB\n\n"
-                    self.logger(log_str)
-                    loss_group.reset()
-                    time_group.reset()
-                start_t = time.time()
-
-            if (epoch + 1) % self.config["eval_interval"] == 0 and local_rank == 0:
-                psnr, eval_t = self.evaluate(epoch)
-                total_t += eval_t
-                self.logger(tb_msg=["eval/psnr", psnr, epoch])
-                if psnr > best_psnr:
-                    best_psnr = psnr
-                    self.save("psnr_best.pth", epoch)
-                    if self.logger.enable_wandb:
-                        wandb.run.summary["best_psnr"] = best_psnr
-                if (epoch + 1) % 50 == 0:
-                    self.save(f"epoch_{epoch+1}.pth", epoch)
-                self.save("latest.pth", epoch)
-
-        self.logger.close()
-
-    def evaluate(self, epoch):
-        psnr_list = []
-        time_stamp = time.time()
-        for i, data in enumerate(self.loader_val):
-            for k, v in data.items():
-                data[k] = v.to(self.config["device"])
-
-            with torch.no_grad():
-                results = self.model(**data, eval=True)
-                imgt_pred = results["imgt_pred"]
-                for j in range(data["img0"].shape[0]):
-                    psnr = (
-                        calculate_psnr(
-                            imgt_pred[j].detach().unsqueeze(0),
-                            data["imgt"][j].unsqueeze(0),
-                        )
-                        .cpu()
-                        .data
-                    )
-                    psnr_list.append(psnr)
-
-        eval_time = time.time() - time_stamp
-
-        self.logger(
-            "eval epoch:{}/{} time:{:.2f} psnr:{:.3f}".format(
-                epoch + 1, self.config["epochs"], eval_time, np.array(psnr_list).mean()
-            )
-        )
-        return np.array(psnr_list).mean(), eval_time
-
-    def save(self, name, epoch):
-        save_path = "{}/{}/{}".format(self.config["save_dir"], "ckpts", name)
-        ckpt = OrderedDict(epoch=epoch)
-        if self.config["distributed"]:
-            ckpt["state_dict"] = self.model.module.state_dict()
-        else:
-            ckpt["state_dict"] = self.model.state_dict()
-        ckpt["optim"] = self.optimizer.state_dict()
-        torch.save(ckpt, save_path)
-
-    def eta_format(self, eta):
-        time_str = ""
-        if eta >= 3600:
-            hours = int(eta // 3600)
-            eta -= hours * 3600
-            time_str = f"{hours}"
-
-        if eta >= 60:
-            mins = int(eta // 60)
-            eta -= mins * 60
-            time_str = f"{time_str}:{mins:02}"
-
-        eta = int(eta)
-        time_str = f"{time_str}:{eta:02}"
-        return time_str
diff --git a/eval/vbench/third_party/amt/trainers/logger.py b/eval/vbench/third_party/amt/trainers/logger.py
deleted file mode 100644
index 8e7bc24c..00000000
--- a/eval/vbench/third_party/amt/trainers/logger.py
+++ /dev/null
@@ -1,64 +0,0 @@
-import logging
-import os.path as osp
-import shutil
-import time
-
-import wandb
-from torch.utils.tensorboard import SummaryWriter
-
-
-def mv_archived_logger(name):
-    timestamp = time.strftime("%Y-%m-%d_%H:%M:%S_", time.localtime())
-    basename = "archived_" + timestamp + osp.basename(name)
-    archived_name = osp.join(osp.dirname(name), basename)
-    shutil.move(name, archived_name)
-
-
-class CustomLogger:
-    def __init__(self, common_cfg, tb_cfg=None, wandb_cfg=None, rank=0):
-        global global_logger
-        self.rank = rank
-
-        if self.rank == 0:
-            self.logger = logging.getLogger("VFI")
-            self.logger.setLevel(logging.INFO)
-            format_str = logging.Formatter(common_cfg["format"])
-
-            console_handler = logging.StreamHandler()
-            console_handler.setFormatter(format_str)
-
-            if osp.exists(common_cfg["filename"]):
-                mv_archived_logger(common_cfg["filename"])
-
-            file_handler = logging.FileHandler(
-                common_cfg["filename"], common_cfg["filemode"]
-            )
-            file_handler.setFormatter(format_str)
-
-            self.logger.addHandler(console_handler)
-            self.logger.addHandler(file_handler)
-            self.tb_logger = None
-
-            self.enable_wandb = False
-
-            if wandb_cfg is not None:
-                self.enable_wandb = True
-                wandb.init(**wandb_cfg)
-
-            if tb_cfg is not None:
-                self.tb_logger = SummaryWriter(**tb_cfg)
-
-        global_logger = self
-
-    def __call__(self, msg=None, level=logging.INFO, tb_msg=None):
-        if self.rank != 0:
-            return
-        if msg is not None:
-            self.logger.log(level, msg)
-
-        if self.tb_logger is not None and tb_msg is not None:
-            self.tb_logger.add_scalar(*tb_msg)
-
-    def close(self):
-        if self.rank == 0 and self.enable_wandb:
-            wandb.finish()
diff --git a/eval/vbench/third_party/amt/utils/__init__.py b/eval/vbench/third_party/amt/utils/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/amt/utils/build_utils.py b/eval/vbench/third_party/amt/utils/build_utils.py
deleted file mode 100644
index 9e574264..00000000
--- a/eval/vbench/third_party/amt/utils/build_utils.py
+++ /dev/null
@@ -1,16 +0,0 @@
-import importlib
-import os
-import sys
-
-CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-sys.path.append(os.path.join(CUR_DIR, "../"))
-
-
-def base_build_fn(module, cls, params):
-    return getattr(importlib.import_module(module, package=None), cls)(**params)
-
-
-def build_from_cfg(config):
-    module, cls = config["name"].rsplit(".", 1)
-    params = config.get("params", {})
-    return base_build_fn(module, cls, params)
diff --git a/eval/vbench/third_party/amt/utils/dist_utils.py b/eval/vbench/third_party/amt/utils/dist_utils.py
deleted file mode 100644
index d754d4fc..00000000
--- a/eval/vbench/third_party/amt/utils/dist_utils.py
+++ /dev/null
@@ -1,48 +0,0 @@
-import os
-
-import torch
-
-
-def get_world_size():
-    """Find OMPI world size without calling mpi functions
-    :rtype: int
-    """
-    if os.environ.get("PMI_SIZE") is not None:
-        return int(os.environ.get("PMI_SIZE") or 1)
-    elif os.environ.get("OMPI_COMM_WORLD_SIZE") is not None:
-        return int(os.environ.get("OMPI_COMM_WORLD_SIZE") or 1)
-    else:
-        return torch.cuda.device_count()
-
-
-def get_global_rank():
-    """Find OMPI world rank without calling mpi functions
-    :rtype: int
-    """
-    if os.environ.get("PMI_RANK") is not None:
-        return int(os.environ.get("PMI_RANK") or 0)
-    elif os.environ.get("OMPI_COMM_WORLD_RANK") is not None:
-        return int(os.environ.get("OMPI_COMM_WORLD_RANK") or 0)
-    else:
-        return 0
-
-
-def get_local_rank():
-    """Find OMPI local rank without calling mpi functions
-    :rtype: int
-    """
-    if os.environ.get("MPI_LOCALRANKID") is not None:
-        return int(os.environ.get("MPI_LOCALRANKID") or 0)
-    elif os.environ.get("OMPI_COMM_WORLD_LOCAL_RANK") is not None:
-        return int(os.environ.get("OMPI_COMM_WORLD_LOCAL_RANK") or 0)
-    else:
-        return 0
-
-
-def get_master_ip():
-    if os.environ.get("AZ_BATCH_MASTER_NODE") is not None:
-        return os.environ.get("AZ_BATCH_MASTER_NODE").split(":")[0]
-    elif os.environ.get("AZ_BATCHAI_MPI_MASTER_NODE") is not None:
-        return os.environ.get("AZ_BATCHAI_MPI_MASTER_NODE")
-    else:
-        return "127.0.0.1"
diff --git a/eval/vbench/third_party/amt/utils/flow_utils.py b/eval/vbench/third_party/amt/utils/flow_utils.py
deleted file mode 100644
index 4415a528..00000000
--- a/eval/vbench/third_party/amt/utils/flow_utils.py
+++ /dev/null
@@ -1,137 +0,0 @@
-import numpy as np
-import torch
-import torch.nn.functional as F
-from PIL import ImageFile
-
-ImageFile.LOAD_TRUNCATED_IMAGES = True
-
-
-def warp(img, flow):
-    B, _, H, W = flow.shape
-    xx = torch.linspace(-1.0, 1.0, W).view(1, 1, 1, W).expand(B, -1, H, -1)
-    yy = torch.linspace(-1.0, 1.0, H).view(1, 1, H, 1).expand(B, -1, -1, W)
-    grid = torch.cat([xx, yy], 1).to(img)
-    flow_ = torch.cat(
-        [
-            flow[:, 0:1, :, :] / ((W - 1.0) / 2.0),
-            flow[:, 1:2, :, :] / ((H - 1.0) / 2.0),
-        ],
-        1,
-    )
-    grid_ = (grid + flow_).permute(0, 2, 3, 1)
-    output = F.grid_sample(
-        input=img,
-        grid=grid_,
-        mode="bilinear",
-        padding_mode="border",
-        align_corners=True,
-    )
-    return output
-
-
-def make_colorwheel():
-    """
-    Generates a color wheel for optical flow visualization as presented in:
-        Baker et al. "A Database and Evaluation Methodology for Optical Flow" (ICCV, 2007)
-        URL: http://vision.middlebury.edu/flow/flowEval-iccv07.pdf
-    Code follows the original C++ source code of Daniel Scharstein.
-    Code follows the the Matlab source code of Deqing Sun.
-    Returns:
-        np.ndarray: Color wheel
-    """
-
-    RY = 15
-    YG = 6
-    GC = 4
-    CB = 11
-    BM = 13
-    MR = 6
-
-    ncols = RY + YG + GC + CB + BM + MR
-    colorwheel = np.zeros((ncols, 3))
-    col = 0
-
-    # RY
-    colorwheel[0:RY, 0] = 255
-    colorwheel[0:RY, 1] = np.floor(255 * np.arange(0, RY) / RY)
-    col = col + RY
-    # YG
-    colorwheel[col : col + YG, 0] = 255 - np.floor(255 * np.arange(0, YG) / YG)
-    colorwheel[col : col + YG, 1] = 255
-    col = col + YG
-    # GC
-    colorwheel[col : col + GC, 1] = 255
-    colorwheel[col : col + GC, 2] = np.floor(255 * np.arange(0, GC) / GC)
-    col = col + GC
-    # CB
-    colorwheel[col : col + CB, 1] = 255 - np.floor(255 * np.arange(CB) / CB)
-    colorwheel[col : col + CB, 2] = 255
-    col = col + CB
-    # BM
-    colorwheel[col : col + BM, 2] = 255
-    colorwheel[col : col + BM, 0] = np.floor(255 * np.arange(0, BM) / BM)
-    col = col + BM
-    # MR
-    colorwheel[col : col + MR, 2] = 255 - np.floor(255 * np.arange(MR) / MR)
-    colorwheel[col : col + MR, 0] = 255
-    return colorwheel
-
-
-def flow_uv_to_colors(u, v, convert_to_bgr=False):
-    """
-    Applies the flow color wheel to (possibly clipped) flow components u and v.
-    According to the C++ source code of Daniel Scharstein
-    According to the Matlab source code of Deqing Sun
-    Args:
-        u (np.ndarray): Input horizontal flow of shape [H,W]
-        v (np.ndarray): Input vertical flow of shape [H,W]
-        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
-    Returns:
-        np.ndarray: Flow visualization image of shape [H,W,3]
-    """
-    flow_image = np.zeros((u.shape[0], u.shape[1], 3), np.uint8)
-    colorwheel = make_colorwheel()  # shape [55x3]
-    ncols = colorwheel.shape[0]
-    rad = np.sqrt(np.square(u) + np.square(v))
-    a = np.arctan2(-v, -u) / np.pi
-    fk = (a + 1) / 2 * (ncols - 1)
-    k0 = np.floor(fk).astype(np.int32)
-    k1 = k0 + 1
-    k1[k1 == ncols] = 0
-    f = fk - k0
-    for i in range(colorwheel.shape[1]):
-        tmp = colorwheel[:, i]
-        col0 = tmp[k0] / 255.0
-        col1 = tmp[k1] / 255.0
-        col = (1 - f) * col0 + f * col1
-        idx = rad <= 1
-        col[idx] = 1 - rad[idx] * (1 - col[idx])
-        col[~idx] = col[~idx] * 0.75  # out of range
-        # Note the 2-i => BGR instead of RGB
-        ch_idx = 2 - i if convert_to_bgr else i
-        flow_image[:, :, ch_idx] = np.floor(255 * col)
-    return flow_image
-
-
-def flow_to_image(flow_uv, clip_flow=None, convert_to_bgr=False):
-    """
-    Expects a two dimensional flow image of shape.
-    Args:
-        flow_uv (np.ndarray): Flow UV image of shape [H,W,2]
-        clip_flow (float, optional): Clip maximum of flow values. Defaults to None.
-        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
-    Returns:
-        np.ndarray: Flow visualization image of shape [H,W,3]
-    """
-    assert flow_uv.ndim == 3, "input flow must have three dimensions"
-    assert flow_uv.shape[2] == 2, "input flow must have shape [H,W,2]"
-    if clip_flow is not None:
-        flow_uv = np.clip(flow_uv, 0, clip_flow)
-    u = flow_uv[:, :, 0]
-    v = flow_uv[:, :, 1]
-    rad = np.sqrt(np.square(u) + np.square(v))
-    rad_max = np.max(rad)
-    epsilon = 1e-5
-    u = u / (rad_max + epsilon)
-    v = v / (rad_max + epsilon)
-    return flow_uv_to_colors(u, v, convert_to_bgr)
diff --git a/eval/vbench/third_party/amt/utils/utils.py b/eval/vbench/third_party/amt/utils/utils.py
deleted file mode 100644
index 9b04c9f7..00000000
--- a/eval/vbench/third_party/amt/utils/utils.py
+++ /dev/null
@@ -1,334 +0,0 @@
-import random
-import re
-import sys
-
-import numpy as np
-import torch
-import torch.nn.functional as F
-from imageio import imread, imwrite
-from PIL import ImageFile
-
-ImageFile.LOAD_TRUNCATED_IMAGES = True
-
-
-class AverageMeter:
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self.val = 0.0
-        self.avg = 0.0
-        self.sum = 0.0
-        self.count = 0
-
-    def update(self, val, n=1):
-        self.val = val
-        self.sum += val * n
-        self.count += n
-        self.avg = self.sum / self.count
-
-
-class AverageMeterGroups:
-    def __init__(self) -> None:
-        self.meter_dict = dict()
-
-    def update(self, dict, n=1):
-        for name, val in dict.items():
-            if self.meter_dict.get(name) is None:
-                self.meter_dict[name] = AverageMeter()
-            self.meter_dict[name].update(val, n)
-
-    def reset(self, name=None):
-        if name is None:
-            for v in self.meter_dict.values():
-                v.reset()
-        else:
-            meter = self.meter_dict.get(name)
-            if meter is not None:
-                meter.reset()
-
-    def avg(self, name):
-        meter = self.meter_dict.get(name)
-        if meter is not None:
-            return meter.avg
-
-
-class InputPadder:
-    """Pads images such that dimensions are divisible by divisor"""
-
-    def __init__(self, dims, divisor=16):
-        self.ht, self.wd = dims[-2:]
-        pad_ht = (((self.ht // divisor) + 1) * divisor - self.ht) % divisor
-        pad_wd = (((self.wd // divisor) + 1) * divisor - self.wd) % divisor
-        self._pad = [
-            pad_wd // 2,
-            pad_wd - pad_wd // 2,
-            pad_ht // 2,
-            pad_ht - pad_ht // 2,
-        ]
-
-    def pad(self, *inputs):
-        if len(inputs) == 1:
-            return F.pad(inputs[0], self._pad, mode="replicate")
-        else:
-            return [F.pad(x, self._pad, mode="replicate") for x in inputs]
-
-    def unpad(self, *inputs):
-        if len(inputs) == 1:
-            return self._unpad(inputs[0])
-        else:
-            return [self._unpad(x) for x in inputs]
-
-    def _unpad(self, x):
-        ht, wd = x.shape[-2:]
-        c = [self._pad[2], ht - self._pad[3], self._pad[0], wd - self._pad[1]]
-        return x[..., c[0] : c[1], c[2] : c[3]]
-
-
-def img2tensor(img):
-    if img.shape[-1] > 3:
-        img = img[:, :, :3]
-    return torch.tensor(img).permute(2, 0, 1).unsqueeze(0) / 255.0
-
-
-def tensor2img(img_t):
-    return (
-        (img_t * 255.0)
-        .detach()
-        .squeeze(0)
-        .permute(1, 2, 0)
-        .cpu()
-        .numpy()
-        .clip(0, 255)
-        .astype(np.uint8)
-    )
-
-
-def seed_all(seed):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-
-
-def read(file):
-    if file.endswith(".float3"):
-        return readFloat(file)
-    elif file.endswith(".flo"):
-        return readFlow(file)
-    elif file.endswith(".ppm"):
-        return readImage(file)
-    elif file.endswith(".pgm"):
-        return readImage(file)
-    elif file.endswith(".png"):
-        return readImage(file)
-    elif file.endswith(".jpg"):
-        return readImage(file)
-    elif file.endswith(".pfm"):
-        return readPFM(file)[0]
-    else:
-        raise Exception("don't know how to read %s" % file)
-
-
-def write(file, data):
-    if file.endswith(".float3"):
-        return writeFloat(file, data)
-    elif file.endswith(".flo"):
-        return writeFlow(file, data)
-    elif file.endswith(".ppm"):
-        return writeImage(file, data)
-    elif file.endswith(".pgm"):
-        return writeImage(file, data)
-    elif file.endswith(".png"):
-        return writeImage(file, data)
-    elif file.endswith(".jpg"):
-        return writeImage(file, data)
-    elif file.endswith(".pfm"):
-        return writePFM(file, data)
-    else:
-        raise Exception("don't know how to write %s" % file)
-
-
-def readPFM(file):
-    file = open(file, "rb")
-
-    color = None
-    width = None
-    height = None
-    scale = None
-    endian = None
-
-    header = file.readline().rstrip()
-    if header.decode("ascii") == "PF":
-        color = True
-    elif header.decode("ascii") == "Pf":
-        color = False
-    else:
-        raise Exception("Not a PFM file.")
-
-    dim_match = re.match(r"^(\d+)\s(\d+)\s$", file.readline().decode("ascii"))
-    if dim_match:
-        width, height = list(map(int, dim_match.groups()))
-    else:
-        raise Exception("Malformed PFM header.")
-
-    scale = float(file.readline().decode("ascii").rstrip())
-    if scale < 0:
-        endian = "<"
-        scale = -scale
-    else:
-        endian = ">"
-
-    data = np.fromfile(file, endian + "f")
-    shape = (height, width, 3) if color else (height, width)
-
-    data = np.reshape(data, shape)
-    data = np.flipud(data)
-    return data, scale
-
-
-def writePFM(file, image, scale=1):
-    file = open(file, "wb")
-
-    color = None
-
-    if image.dtype.name != "float32":
-        raise Exception("Image dtype must be float32.")
-
-    image = np.flipud(image)
-
-    if len(image.shape) == 3 and image.shape[2] == 3:
-        color = True
-    elif len(image.shape) == 2 or len(image.shape) == 3 and image.shape[2] == 1:
-        color = False
-    else:
-        raise Exception("Image must have H x W x 3, H x W x 1 or H x W dimensions.")
-
-    file.write("PF\n" if color else "Pf\n".encode())
-    file.write("%d %d\n".encode() % (image.shape[1], image.shape[0]))
-
-    endian = image.dtype.byteorder
-
-    if endian == "<" or endian == "=" and sys.byteorder == "little":
-        scale = -scale
-
-    file.write("%f\n".encode() % scale)
-
-    image.tofile(file)
-
-
-def readFlow(name):
-    if name.endswith(".pfm") or name.endswith(".PFM"):
-        return readPFM(name)[0][:, :, 0:2]
-
-    f = open(name, "rb")
-
-    header = f.read(4)
-    if header.decode("utf-8") != "PIEH":
-        raise Exception("Flow file header does not contain PIEH")
-
-    width = np.fromfile(f, np.int32, 1).squeeze()
-    height = np.fromfile(f, np.int32, 1).squeeze()
-
-    flow = np.fromfile(f, np.float32, width * height * 2).reshape((height, width, 2))
-
-    return flow.astype(np.float32)
-
-
-def readImage(name):
-    if name.endswith(".pfm") or name.endswith(".PFM"):
-        data = readPFM(name)[0]
-        if len(data.shape) == 3:
-            return data[:, :, 0:3]
-        else:
-            return data
-    return imread(name)
-
-
-def writeImage(name, data):
-    if name.endswith(".pfm") or name.endswith(".PFM"):
-        return writePFM(name, data, 1)
-    return imwrite(name, data)
-
-
-def writeFlow(name, flow):
-    f = open(name, "wb")
-    f.write("PIEH".encode("utf-8"))
-    np.array([flow.shape[1], flow.shape[0]], dtype=np.int32).tofile(f)
-    flow = flow.astype(np.float32)
-    flow.tofile(f)
-
-
-def readFloat(name):
-    f = open(name, "rb")
-
-    if (f.readline().decode("utf-8")) != "float\n":
-        raise Exception("float file %s did not contain <float> keyword" % name)
-
-    dim = int(f.readline())
-
-    dims = []
-    count = 1
-    for i in range(0, dim):
-        d = int(f.readline())
-        dims.append(d)
-        count *= d
-
-    dims = list(reversed(dims))
-
-    data = np.fromfile(f, np.float32, count).reshape(dims)
-    if dim > 2:
-        data = np.transpose(data, (2, 1, 0))
-        data = np.transpose(data, (1, 0, 2))
-
-    return data
-
-
-def writeFloat(name, data):
-    f = open(name, "wb")
-
-    dim = len(data.shape)
-    if dim > 3:
-        raise Exception("bad float file dimension: %d" % dim)
-
-    f.write(("float\n").encode("ascii"))
-    f.write(("%d\n" % dim).encode("ascii"))
-
-    if dim == 1:
-        f.write(("%d\n" % data.shape[0]).encode("ascii"))
-    else:
-        f.write(("%d\n" % data.shape[1]).encode("ascii"))
-        f.write(("%d\n" % data.shape[0]).encode("ascii"))
-        for i in range(2, dim):
-            f.write(("%d\n" % data.shape[i]).encode("ascii"))
-
-    data = data.astype(np.float32)
-    if dim == 2:
-        data.tofile(f)
-
-    else:
-        np.transpose(data, (2, 0, 1)).tofile(f)
-
-
-def check_dim_and_resize(tensor_list):
-    shape_list = []
-    for t in tensor_list:
-        shape_list.append(t.shape[2:])
-
-    if len(set(shape_list)) > 1:
-        desired_shape = shape_list[0]
-        print(
-            f"Inconsistent size of input video frames. All frames will be resized to {desired_shape}"
-        )
-
-        resize_tensor_list = []
-        for t in tensor_list:
-            resize_tensor_list.append(
-                torch.nn.functional.interpolate(
-                    t, size=tuple(desired_shape), mode="bilinear"
-                )
-            )
-
-        tensor_list = resize_tensor_list
-
-    return tensor_list
diff --git a/eval/vbench/third_party/grit_model.py b/eval/vbench/third_party/grit_model.py
deleted file mode 100644
index bca8d6f0..00000000
--- a/eval/vbench/third_party/grit_model.py
+++ /dev/null
@@ -1,48 +0,0 @@
-import os
-import sys
-
-from detectron2.data.detection_utils import read_image
-
-from .grit_src.image_dense_captions import (
-    dense_pred_to_caption,
-    dense_pred_to_caption_only_name,
-    dense_pred_to_caption_tuple,
-    image_caption_api,
-    init_demo,
-)
-
-
-class DenseCaptioning:
-    def __init__(self, device):
-        self.device = device
-        self.demo = None
-
-    def initialize_model(self, model_weight):
-        self.demo = init_demo(self.device, model_weight=model_weight)
-
-    def initialize_model_det(self, model_weight):
-        self.demo = init_demo(self.device, model_weight=model_weight, task="ObjectDet")
-
-    def image_dense_caption(self, image_src):
-        dense_caption = image_caption_api(image_src, self.device)
-        print("\033[1;35m" + "*" * 100 + "\033[0m")
-        print("Step2, Dense Caption:\n")
-        print(dense_caption)
-        print("\033[1;35m" + "*" * 100 + "\033[0m")
-        return dense_caption
-
-    def run_caption_api(self, image_src):
-        img = read_image(image_src, format="BGR")
-        print(img.shape)
-        predictions, visualized_output = self.demo.run_on_image(img)
-        new_caption = dense_pred_to_caption_only_name(predictions)
-        return new_caption
-
-    def run_caption_tensor(self, img):
-        predictions, visualized_output = self.demo.run_on_image(img)
-        new_caption = dense_pred_to_caption_tuple(predictions)
-        return new_caption, visualized_output
-
-    def run_det_tensor(self, img):
-        predictions, visualized_output = self.demo.run_on_image(img)
-        return predictions, visualized_output
diff --git a/eval/vbench/third_party/grit_src/__init__.py b/eval/vbench/third_party/grit_src/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/.gitignore b/eval/vbench/third_party/grit_src/centernet2/.gitignore
deleted file mode 100644
index 51c17688..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/.gitignore
+++ /dev/null
@@ -1,10 +0,0 @@
-# compilation and distribution
-__pycache__
-_ext
-*.pyc
-*.pyd
-*.so
-centernet.egg-info/
-build/
-dist/
-wheels/
diff --git a/eval/vbench/third_party/grit_src/centernet2/__init__.py b/eval/vbench/third_party/grit_src/centernet2/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/__init__.py
deleted file mode 100644
index 11af3898..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/__init__.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from .modeling.backbone.bifpn import build_resnet_bifpn_backbone
-from .modeling.backbone.bifpn_fcos import build_fcos_resnet_bifpn_backbone
-from .modeling.backbone.dla import build_dla_backbone
-from .modeling.backbone.dlafpn import build_dla_fpn3_backbone
-from .modeling.backbone.fpn_p5 import build_p67_resnet_fpn_backbone
-from .modeling.backbone.res2net import build_p67_res2net_fpn_backbone
-from .modeling.dense_heads.centernet import CenterNet
-from .modeling.meta_arch.centernet_detector import CenterNetDetector
-from .modeling.roi_heads.custom_roi_heads import CustomCascadeROIHeads, CustomROIHeads
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/config.py b/eval/vbench/third_party/grit_src/centernet2/centernet/config.py
deleted file mode 100644
index 7447a154..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/config.py
+++ /dev/null
@@ -1,93 +0,0 @@
-from detectron2.config import CfgNode as CN
-
-
-def add_centernet_config(cfg):
-    _C = cfg
-
-    _C.MODEL.CENTERNET = CN()
-    _C.MODEL.CENTERNET.NUM_CLASSES = 80
-    _C.MODEL.CENTERNET.IN_FEATURES = ["p3", "p4", "p5", "p6", "p7"]
-    _C.MODEL.CENTERNET.FPN_STRIDES = [8, 16, 32, 64, 128]
-    _C.MODEL.CENTERNET.PRIOR_PROB = 0.01
-    _C.MODEL.CENTERNET.INFERENCE_TH = 0.05
-    _C.MODEL.CENTERNET.CENTER_NMS = False
-    _C.MODEL.CENTERNET.NMS_TH_TRAIN = 0.6
-    _C.MODEL.CENTERNET.NMS_TH_TEST = 0.6
-    _C.MODEL.CENTERNET.PRE_NMS_TOPK_TRAIN = 1000
-    _C.MODEL.CENTERNET.POST_NMS_TOPK_TRAIN = 100
-    _C.MODEL.CENTERNET.PRE_NMS_TOPK_TEST = 1000
-    _C.MODEL.CENTERNET.POST_NMS_TOPK_TEST = 100
-    _C.MODEL.CENTERNET.NORM = "GN"
-    _C.MODEL.CENTERNET.USE_DEFORMABLE = False
-    _C.MODEL.CENTERNET.NUM_CLS_CONVS = 4
-    _C.MODEL.CENTERNET.NUM_BOX_CONVS = 4
-    _C.MODEL.CENTERNET.NUM_SHARE_CONVS = 0
-    _C.MODEL.CENTERNET.LOC_LOSS_TYPE = "giou"
-    _C.MODEL.CENTERNET.SIGMOID_CLAMP = 1e-4
-    _C.MODEL.CENTERNET.HM_MIN_OVERLAP = 0.8
-    _C.MODEL.CENTERNET.MIN_RADIUS = 4
-    _C.MODEL.CENTERNET.SOI = [
-        [0, 80],
-        [64, 160],
-        [128, 320],
-        [256, 640],
-        [512, 10000000],
-    ]
-    _C.MODEL.CENTERNET.POS_WEIGHT = 1.0
-    _C.MODEL.CENTERNET.NEG_WEIGHT = 1.0
-    _C.MODEL.CENTERNET.REG_WEIGHT = 2.0
-    _C.MODEL.CENTERNET.HM_FOCAL_BETA = 4
-    _C.MODEL.CENTERNET.HM_FOCAL_ALPHA = 0.25
-    _C.MODEL.CENTERNET.LOSS_GAMMA = 2.0
-    _C.MODEL.CENTERNET.WITH_AGN_HM = False
-    _C.MODEL.CENTERNET.ONLY_PROPOSAL = False
-    _C.MODEL.CENTERNET.AS_PROPOSAL = False
-    _C.MODEL.CENTERNET.IGNORE_HIGH_FP = -1.0
-    _C.MODEL.CENTERNET.MORE_POS = False
-    _C.MODEL.CENTERNET.MORE_POS_THRESH = 0.2
-    _C.MODEL.CENTERNET.MORE_POS_TOPK = 9
-    _C.MODEL.CENTERNET.NOT_NORM_REG = True
-    _C.MODEL.CENTERNET.NOT_NMS = False
-    _C.MODEL.CENTERNET.NO_REDUCE = False
-
-    _C.MODEL.ROI_BOX_HEAD.USE_SIGMOID_CE = False
-    _C.MODEL.ROI_BOX_HEAD.PRIOR_PROB = 0.01
-    _C.MODEL.ROI_BOX_HEAD.USE_EQL_LOSS = False
-    _C.MODEL.ROI_BOX_HEAD.CAT_FREQ_PATH = "datasets/lvis/lvis_v1_train_cat_info.json"
-    _C.MODEL.ROI_BOX_HEAD.EQL_FREQ_CAT = 200
-    _C.MODEL.ROI_BOX_HEAD.USE_FED_LOSS = False
-    _C.MODEL.ROI_BOX_HEAD.FED_LOSS_NUM_CAT = 50
-    _C.MODEL.ROI_BOX_HEAD.FED_LOSS_FREQ_WEIGHT = 0.5
-    _C.MODEL.ROI_BOX_HEAD.MULT_PROPOSAL_SCORE = False
-
-    _C.MODEL.BIFPN = CN()
-    _C.MODEL.BIFPN.NUM_LEVELS = 5
-    _C.MODEL.BIFPN.NUM_BIFPN = 6
-    _C.MODEL.BIFPN.NORM = "GN"
-    _C.MODEL.BIFPN.OUT_CHANNELS = 160
-    _C.MODEL.BIFPN.SEPARABLE_CONV = False
-
-    _C.MODEL.DLA = CN()
-    _C.MODEL.DLA.OUT_FEATURES = ["dla2"]
-    _C.MODEL.DLA.USE_DLA_UP = True
-    _C.MODEL.DLA.NUM_LAYERS = 34
-    _C.MODEL.DLA.MS_OUTPUT = False
-    _C.MODEL.DLA.NORM = "BN"
-    _C.MODEL.DLA.DLAUP_IN_FEATURES = ["dla3", "dla4", "dla5"]
-    _C.MODEL.DLA.DLAUP_NODE = "conv"
-
-    _C.SOLVER.RESET_ITER = False
-    _C.SOLVER.TRAIN_ITER = -1
-
-    _C.INPUT.CUSTOM_AUG = ""
-    _C.INPUT.TRAIN_SIZE = 640
-    _C.INPUT.TEST_SIZE = 640
-    _C.INPUT.SCALE_RANGE = (0.1, 2.0)
-    # 'default' for fixed short/ long edge, 'square' for max size=INPUT.SIZE
-    _C.INPUT.TEST_INPUT_TYPE = "default"
-
-    _C.DEBUG = False
-    _C.SAVE_DEBUG = False
-    _C.SAVE_PTH = False
-    _C.VIS_THRESH = 0.3
-    _C.DEBUG_SHOW_NAME = False
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/bifpn.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/bifpn.py
deleted file mode 100644
index d00f2721..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/bifpn.py
+++ /dev/null
@@ -1,542 +0,0 @@
-# Modified from https://github.com/rwightman/efficientdet-pytorch/blob/master/effdet/efficientdet.py
-# The original file is under Apache-2.0 License
-import math
-from collections import OrderedDict
-from os.path import join
-from typing import List
-
-import fvcore.nn.weight_init as weight_init
-import numpy as np
-import torch
-import torch.nn.functional as F
-import torch.utils.model_zoo as model_zoo
-from detectron2.layers import Conv2d, ShapeSpec
-from detectron2.layers.batch_norm import get_norm
-from detectron2.modeling.backbone import Backbone
-from detectron2.modeling.backbone.build import BACKBONE_REGISTRY
-from detectron2.modeling.backbone.resnet import build_resnet_backbone
-from torch import nn
-
-from .dlafpn import dla34
-
-
-def get_fpn_config(base_reduction=8):
-    """BiFPN config with sum."""
-    p = {
-        "nodes": [
-            {"reduction": base_reduction << 3, "inputs_offsets": [3, 4]},
-            {"reduction": base_reduction << 2, "inputs_offsets": [2, 5]},
-            {"reduction": base_reduction << 1, "inputs_offsets": [1, 6]},
-            {"reduction": base_reduction, "inputs_offsets": [0, 7]},
-            {"reduction": base_reduction << 1, "inputs_offsets": [1, 7, 8]},
-            {"reduction": base_reduction << 2, "inputs_offsets": [2, 6, 9]},
-            {"reduction": base_reduction << 3, "inputs_offsets": [3, 5, 10]},
-            {"reduction": base_reduction << 4, "inputs_offsets": [4, 11]},
-        ],
-        "weight_method": "fastattn",
-    }
-    return p
-
-
-def swish(x, inplace: bool = False):
-    """Swish - Described in: https://arxiv.org/abs/1710.05941"""
-    return x.mul_(x.sigmoid()) if inplace else x.mul(x.sigmoid())
-
-
-class Swish(nn.Module):
-    def __init__(self, inplace: bool = False):
-        super(Swish, self).__init__()
-        self.inplace = inplace
-
-    def forward(self, x):
-        return swish(x, self.inplace)
-
-
-class SequentialAppend(nn.Sequential):
-    def __init__(self, *args):
-        super(SequentialAppend, self).__init__(*args)
-
-    def forward(self, x):
-        for module in self:
-            x.append(module(x))
-        return x
-
-
-class SequentialAppendLast(nn.Sequential):
-    def __init__(self, *args):
-        super(SequentialAppendLast, self).__init__(*args)
-
-    # def forward(self, x: List[torch.Tensor]):
-    def forward(self, x):
-        for module in self:
-            x.append(module(x[-1]))
-        return x
-
-
-class ConvBnAct2d(nn.Module):
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        kernel_size,
-        stride=1,
-        dilation=1,
-        padding="",
-        bias=False,
-        norm="",
-        act_layer=Swish,
-    ):
-        super(ConvBnAct2d, self).__init__()
-        # self.conv = create_conv2d(
-        #     in_channels, out_channels, kernel_size, stride=stride, dilation=dilation, padding=padding, bias=bias)
-        self.conv = Conv2d(
-            in_channels,
-            out_channels,
-            kernel_size=kernel_size,
-            stride=stride,
-            padding=kernel_size // 2,
-            bias=(norm == ""),
-        )
-        self.bn = get_norm(norm, out_channels)
-        self.act = None if act_layer is None else act_layer(inplace=True)
-
-    def forward(self, x):
-        x = self.conv(x)
-        if self.bn is not None:
-            x = self.bn(x)
-        if self.act is not None:
-            x = self.act(x)
-        return x
-
-
-class SeparableConv2d(nn.Module):
-    """Separable Conv"""
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        kernel_size=3,
-        stride=1,
-        dilation=1,
-        padding="",
-        bias=False,
-        channel_multiplier=1.0,
-        pw_kernel_size=1,
-        act_layer=Swish,
-        norm="",
-    ):
-        super(SeparableConv2d, self).__init__()
-
-        # self.conv_dw = create_conv2d(
-        #     in_channels, int(in_channels * channel_multiplier), kernel_size,
-        #     stride=stride, dilation=dilation, padding=padding, depthwise=True)
-
-        self.conv_dw = Conv2d(
-            in_channels,
-            int(in_channels * channel_multiplier),
-            kernel_size=kernel_size,
-            stride=stride,
-            padding=kernel_size // 2,
-            bias=bias,
-            groups=out_channels,
-        )
-        # print('conv_dw', kernel_size, stride)
-        # self.conv_pw = create_conv2d(
-        #     int(in_channels * channel_multiplier), out_channels, pw_kernel_size, padding=padding, bias=bias)
-
-        self.conv_pw = Conv2d(
-            int(in_channels * channel_multiplier),
-            out_channels,
-            kernel_size=pw_kernel_size,
-            padding=pw_kernel_size // 2,
-            bias=(norm == ""),
-        )
-        # print('conv_pw', pw_kernel_size)
-
-        self.bn = get_norm(norm, out_channels)
-        self.act = None if act_layer is None else act_layer(inplace=True)
-
-    def forward(self, x):
-        x = self.conv_dw(x)
-        x = self.conv_pw(x)
-        if self.bn is not None:
-            x = self.bn(x)
-        if self.act is not None:
-            x = self.act(x)
-        return x
-
-
-class ResampleFeatureMap(nn.Sequential):
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        reduction_ratio=1.0,
-        pad_type="",
-        pooling_type="max",
-        norm="",
-        apply_bn=False,
-        conv_after_downsample=False,
-        redundant_bias=False,
-    ):
-        super(ResampleFeatureMap, self).__init__()
-        pooling_type = pooling_type or "max"
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.reduction_ratio = reduction_ratio
-        self.conv_after_downsample = conv_after_downsample
-
-        conv = None
-        if in_channels != out_channels:
-            conv = ConvBnAct2d(
-                in_channels,
-                out_channels,
-                kernel_size=1,
-                padding=pad_type,
-                norm=norm if apply_bn else "",
-                bias=not apply_bn or redundant_bias,
-                act_layer=None,
-            )
-
-        if reduction_ratio > 1:
-            stride_size = int(reduction_ratio)
-            if conv is not None and not self.conv_after_downsample:
-                self.add_module("conv", conv)
-            self.add_module(
-                "downsample",
-                # create_pool2d(
-                #     pooling_type, kernel_size=stride_size + 1, stride=stride_size, padding=pad_type)
-                # nn.MaxPool2d(kernel_size=stride_size + 1, stride=stride_size, padding=pad_type)
-                nn.MaxPool2d(kernel_size=stride_size, stride=stride_size),
-            )
-            if conv is not None and self.conv_after_downsample:
-                self.add_module("conv", conv)
-        else:
-            if conv is not None:
-                self.add_module("conv", conv)
-            if reduction_ratio < 1:
-                scale = int(1 // reduction_ratio)
-                self.add_module("upsample", nn.UpsamplingNearest2d(scale_factor=scale))
-
-
-class FpnCombine(nn.Module):
-    def __init__(
-        self,
-        feature_info,
-        fpn_config,
-        fpn_channels,
-        inputs_offsets,
-        target_reduction,
-        pad_type="",
-        pooling_type="max",
-        norm="",
-        apply_bn_for_resampling=False,
-        conv_after_downsample=False,
-        redundant_bias=False,
-        weight_method="attn",
-    ):
-        super(FpnCombine, self).__init__()
-        self.inputs_offsets = inputs_offsets
-        self.weight_method = weight_method
-
-        self.resample = nn.ModuleDict()
-        for idx, offset in enumerate(inputs_offsets):
-            in_channels = fpn_channels
-            if offset < len(feature_info):
-                in_channels = feature_info[offset]["num_chs"]
-                input_reduction = feature_info[offset]["reduction"]
-            else:
-                node_idx = offset - len(feature_info)
-                # print('node_idx, len', node_idx, len(fpn_config['nodes']))
-                input_reduction = fpn_config["nodes"][node_idx]["reduction"]
-            reduction_ratio = target_reduction / input_reduction
-            self.resample[str(offset)] = ResampleFeatureMap(
-                in_channels,
-                fpn_channels,
-                reduction_ratio=reduction_ratio,
-                pad_type=pad_type,
-                pooling_type=pooling_type,
-                norm=norm,
-                apply_bn=apply_bn_for_resampling,
-                conv_after_downsample=conv_after_downsample,
-                redundant_bias=redundant_bias,
-            )
-
-        if weight_method == "attn" or weight_method == "fastattn":
-            # WSM
-            self.edge_weights = nn.Parameter(
-                torch.ones(len(inputs_offsets)), requires_grad=True
-            )
-        else:
-            self.edge_weights = None
-
-    def forward(self, x):
-        dtype = x[0].dtype
-        nodes = []
-        for offset in self.inputs_offsets:
-            input_node = x[offset]
-            input_node = self.resample[str(offset)](input_node)
-            nodes.append(input_node)
-
-        if self.weight_method == "attn":
-            normalized_weights = torch.softmax(self.edge_weights.type(dtype), dim=0)
-            x = torch.stack(nodes, dim=-1) * normalized_weights
-        elif self.weight_method == "fastattn":
-            edge_weights = nn.functional.relu(self.edge_weights.type(dtype))
-            weights_sum = torch.sum(edge_weights)
-            x = torch.stack(
-                [
-                    (nodes[i] * edge_weights[i]) / (weights_sum + 0.0001)
-                    for i in range(len(nodes))
-                ],
-                dim=-1,
-            )
-        elif self.weight_method == "sum":
-            x = torch.stack(nodes, dim=-1)
-        else:
-            raise ValueError("unknown weight_method {}".format(self.weight_method))
-        x = torch.sum(x, dim=-1)
-        return x
-
-
-class BiFpnLayer(nn.Module):
-    def __init__(
-        self,
-        feature_info,
-        fpn_config,
-        fpn_channels,
-        num_levels=5,
-        pad_type="",
-        pooling_type="max",
-        norm="",
-        act_layer=Swish,
-        apply_bn_for_resampling=False,
-        conv_after_downsample=True,
-        conv_bn_relu_pattern=False,
-        separable_conv=True,
-        redundant_bias=False,
-    ):
-        super(BiFpnLayer, self).__init__()
-        self.fpn_config = fpn_config
-        self.num_levels = num_levels
-        self.conv_bn_relu_pattern = False
-
-        self.feature_info = []
-        self.fnode = SequentialAppend()
-        for i, fnode_cfg in enumerate(fpn_config["nodes"]):
-            # logging.debug('fnode {} : {}'.format(i, fnode_cfg))
-            # print('fnode {} : {}'.format(i, fnode_cfg))
-            fnode_layers = OrderedDict()
-
-            # combine features
-            reduction = fnode_cfg["reduction"]
-            fnode_layers["combine"] = FpnCombine(
-                feature_info,
-                fpn_config,
-                fpn_channels,
-                fnode_cfg["inputs_offsets"],
-                target_reduction=reduction,
-                pad_type=pad_type,
-                pooling_type=pooling_type,
-                norm=norm,
-                apply_bn_for_resampling=apply_bn_for_resampling,
-                conv_after_downsample=conv_after_downsample,
-                redundant_bias=redundant_bias,
-                weight_method=fpn_config["weight_method"],
-            )
-            self.feature_info.append(dict(num_chs=fpn_channels, reduction=reduction))
-
-            # after combine ops
-            after_combine = OrderedDict()
-            if not conv_bn_relu_pattern:
-                after_combine["act"] = act_layer(inplace=True)
-                conv_bias = redundant_bias
-                conv_act = None
-            else:
-                conv_bias = False
-                conv_act = act_layer
-            conv_kwargs = dict(
-                in_channels=fpn_channels,
-                out_channels=fpn_channels,
-                kernel_size=3,
-                padding=pad_type,
-                bias=conv_bias,
-                norm=norm,
-                act_layer=conv_act,
-            )
-            after_combine["conv"] = (
-                SeparableConv2d(**conv_kwargs)
-                if separable_conv
-                else ConvBnAct2d(**conv_kwargs)
-            )
-            fnode_layers["after_combine"] = nn.Sequential(after_combine)
-
-            self.fnode.add_module(str(i), nn.Sequential(fnode_layers))
-
-        self.feature_info = self.feature_info[-num_levels::]
-
-    def forward(self, x):
-        x = self.fnode(x)
-        return x[-self.num_levels : :]
-
-
-class BiFPN(Backbone):
-    def __init__(
-        self,
-        cfg,
-        bottom_up,
-        in_features,
-        out_channels,
-        norm="",
-        num_levels=5,
-        num_bifpn=4,
-        separable_conv=False,
-    ):
-        super(BiFPN, self).__init__()
-        assert isinstance(bottom_up, Backbone)
-
-        # Feature map strides and channels from the bottom up network (e.g. ResNet)
-        input_shapes = bottom_up.output_shape()
-        in_strides = [input_shapes[f].stride for f in in_features]
-        in_channels = [input_shapes[f].channels for f in in_features]
-
-        self.num_levels = num_levels
-        self.num_bifpn = num_bifpn
-        self.bottom_up = bottom_up
-        self.in_features = in_features
-        self._size_divisibility = 128
-        levels = [int(math.log2(s)) for s in in_strides]
-        self._out_feature_strides = {
-            "p{}".format(int(math.log2(s))): s for s in in_strides
-        }
-        if len(in_features) < num_levels:
-            for l in range(num_levels - len(in_features)):
-                s = l + levels[-1]
-                self._out_feature_strides["p{}".format(s + 1)] = 2 ** (s + 1)
-        self._out_features = list(sorted(self._out_feature_strides.keys()))
-        self._out_feature_channels = {k: out_channels for k in self._out_features}
-
-        # print('self._out_feature_strides', self._out_feature_strides)
-        # print('self._out_feature_channels', self._out_feature_channels)
-
-        feature_info = [
-            {"num_chs": in_channels[level], "reduction": in_strides[level]}
-            for level in range(len(self.in_features))
-        ]
-        # self.config = config
-        fpn_config = get_fpn_config()
-        self.resample = SequentialAppendLast()
-        for level in range(num_levels):
-            if level < len(feature_info):
-                in_chs = in_channels[level]  # feature_info[level]['num_chs']
-                reduction = in_strides[level]  # feature_info[level]['reduction']
-            else:
-                # Adds a coarser level by downsampling the last feature map
-                reduction_ratio = 2
-                self.resample.add_module(
-                    str(level),
-                    ResampleFeatureMap(
-                        in_channels=in_chs,
-                        out_channels=out_channels,
-                        pad_type="same",
-                        pooling_type=None,
-                        norm=norm,
-                        reduction_ratio=reduction_ratio,
-                        apply_bn=True,
-                        conv_after_downsample=False,
-                        redundant_bias=False,
-                    ),
-                )
-                in_chs = out_channels
-                reduction = int(reduction * reduction_ratio)
-                feature_info.append(dict(num_chs=in_chs, reduction=reduction))
-
-        self.cell = nn.Sequential()
-        for rep in range(self.num_bifpn):
-            # logging.debug('building cell {}'.format(rep))
-            # print('building cell {}'.format(rep))
-            fpn_layer = BiFpnLayer(
-                feature_info=feature_info,
-                fpn_config=fpn_config,
-                fpn_channels=out_channels,
-                num_levels=self.num_levels,
-                pad_type="same",
-                pooling_type=None,
-                norm=norm,
-                act_layer=Swish,
-                separable_conv=separable_conv,
-                apply_bn_for_resampling=True,
-                conv_after_downsample=False,
-                conv_bn_relu_pattern=False,
-                redundant_bias=False,
-            )
-            self.cell.add_module(str(rep), fpn_layer)
-            feature_info = fpn_layer.feature_info
-        # import pdb; pdb.set_trace()
-
-    @property
-    def size_divisibility(self):
-        return self._size_divisibility
-
-    def forward(self, x):
-        # print('input shapes', x.shape)
-        bottom_up_features = self.bottom_up(x)
-        x = [bottom_up_features[f] for f in self.in_features]
-        assert len(self.resample) == self.num_levels - len(x)
-        x = self.resample(x)
-        shapes = [xx.shape for xx in x]
-        # print('resample shapes', shapes)
-        x = self.cell(x)
-        out = {f: xx for f, xx in zip(self._out_features, x)}
-        # import pdb; pdb.set_trace()
-        return out
-
-
-@BACKBONE_REGISTRY.register()
-def build_resnet_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_resnet_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    backbone = BiFPN(
-        cfg=cfg,
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=cfg.MODEL.BIFPN.OUT_CHANNELS,
-        norm=cfg.MODEL.BIFPN.NORM,
-        num_levels=cfg.MODEL.BIFPN.NUM_LEVELS,
-        num_bifpn=cfg.MODEL.BIFPN.NUM_BIFPN,
-        separable_conv=cfg.MODEL.BIFPN.SEPARABLE_CONV,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_p37_dla_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = dla34(cfg)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    assert cfg.MODEL.BIFPN.NUM_LEVELS == 5
-
-    backbone = BiFPN(
-        cfg=cfg,
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=cfg.MODEL.BIFPN.OUT_CHANNELS,
-        norm=cfg.MODEL.BIFPN.NORM,
-        num_levels=cfg.MODEL.BIFPN.NUM_LEVELS,
-        num_bifpn=cfg.MODEL.BIFPN.NUM_BIFPN,
-        separable_conv=cfg.MODEL.BIFPN.SEPARABLE_CONV,
-    )
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/bifpn_fcos.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/bifpn_fcos.py
deleted file mode 100644
index 07981044..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/bifpn_fcos.py
+++ /dev/null
@@ -1,478 +0,0 @@
-# This file is modified from https://github.com/aim-uofa/AdelaiDet/blob/master/adet/modeling/backbone/bifpn.py
-# The original file is under 2-clause BSD License for academic use, and *non-commercial use*.
-import torch
-import torch.nn.functional as F
-from detectron2.layers import Conv2d, ShapeSpec, get_norm
-from detectron2.modeling import BACKBONE_REGISTRY
-from detectron2.modeling.backbone import Backbone, build_resnet_backbone
-from torch import nn
-
-from .dlafpn import dla34
-
-__all__ = []
-
-
-def swish(x):
-    return x * x.sigmoid()
-
-
-def split_name(name):
-    for i, c in enumerate(name):
-        if not c.isalpha():
-            return name[:i], int(name[i:])
-    raise ValueError()
-
-
-class FeatureMapResampler(nn.Module):
-    def __init__(self, in_channels, out_channels, stride, norm=""):
-        super(FeatureMapResampler, self).__init__()
-        if in_channels != out_channels:
-            self.reduction = Conv2d(
-                in_channels,
-                out_channels,
-                kernel_size=1,
-                bias=(norm == ""),
-                norm=get_norm(norm, out_channels),
-                activation=None,
-            )
-        else:
-            self.reduction = None
-
-        assert stride <= 2
-        self.stride = stride
-
-    def forward(self, x):
-        if self.reduction is not None:
-            x = self.reduction(x)
-
-        if self.stride == 2:
-            x = F.max_pool2d(
-                x, kernel_size=self.stride + 1, stride=self.stride, padding=1
-            )
-        elif self.stride == 1:
-            pass
-        else:
-            raise NotImplementedError()
-        return x
-
-
-class BackboneWithTopLevels(Backbone):
-    def __init__(self, backbone, out_channels, num_top_levels, norm=""):
-        super(BackboneWithTopLevels, self).__init__()
-        self.backbone = backbone
-        backbone_output_shape = backbone.output_shape()
-
-        self._out_feature_channels = {
-            name: shape.channels for name, shape in backbone_output_shape.items()
-        }
-        self._out_feature_strides = {
-            name: shape.stride for name, shape in backbone_output_shape.items()
-        }
-        self._out_features = list(self._out_feature_strides.keys())
-
-        last_feature_name = max(
-            self._out_feature_strides.keys(), key=lambda x: split_name(x)[1]
-        )
-        self.last_feature_name = last_feature_name
-        self.num_top_levels = num_top_levels
-
-        last_channels = self._out_feature_channels[last_feature_name]
-        last_stride = self._out_feature_strides[last_feature_name]
-
-        prefix, suffix = split_name(last_feature_name)
-        prev_channels = last_channels
-        for i in range(num_top_levels):
-            name = prefix + str(suffix + i + 1)
-            self.add_module(
-                name, FeatureMapResampler(prev_channels, out_channels, 2, norm)
-            )
-            prev_channels = out_channels
-
-            self._out_feature_channels[name] = out_channels
-            self._out_feature_strides[name] = last_stride * 2 ** (i + 1)
-            self._out_features.append(name)
-
-    def forward(self, x):
-        outputs = self.backbone(x)
-        last_features = outputs[self.last_feature_name]
-        prefix, suffix = split_name(self.last_feature_name)
-
-        x = last_features
-        for i in range(self.num_top_levels):
-            name = prefix + str(suffix + i + 1)
-            x = self.__getattr__(name)(x)
-            outputs[name] = x
-
-        return outputs
-
-
-class SingleBiFPN(Backbone):
-    """
-    This module implements Feature Pyramid Network.
-    It creates pyramid features built on top of some input feature maps.
-    """
-
-    def __init__(self, in_channels_list, out_channels, norm=""):
-        """
-        Args:
-            bottom_up (Backbone): module representing the bottom up subnetwork.
-                Must be a subclass of :class:`Backbone`. The multi-scale feature
-                maps generated by the bottom up network, and listed in `in_features`,
-                are used to generate FPN levels.
-            in_features (list[str]): names of the input feature maps coming
-                from the backbone to which FPN is attached. For example, if the
-                backbone produces ["res2", "res3", "res4"], any *contiguous* sublist
-                of these may be used; order must be from high to low resolution.
-            out_channels (int): number of channels in the output feature maps.
-            norm (str): the normalization to use.
-        """
-        super(SingleBiFPN, self).__init__()
-
-        self.out_channels = out_channels
-        # build 5-levels bifpn
-        if len(in_channels_list) == 5:
-            self.nodes = [
-                {"feat_level": 3, "inputs_offsets": [3, 4]},
-                {"feat_level": 2, "inputs_offsets": [2, 5]},
-                {"feat_level": 1, "inputs_offsets": [1, 6]},
-                {"feat_level": 0, "inputs_offsets": [0, 7]},
-                {"feat_level": 1, "inputs_offsets": [1, 7, 8]},
-                {"feat_level": 2, "inputs_offsets": [2, 6, 9]},
-                {"feat_level": 3, "inputs_offsets": [3, 5, 10]},
-                {"feat_level": 4, "inputs_offsets": [4, 11]},
-            ]
-        elif len(in_channels_list) == 3:
-            self.nodes = [
-                {"feat_level": 1, "inputs_offsets": [1, 2]},
-                {"feat_level": 0, "inputs_offsets": [0, 3]},
-                {"feat_level": 1, "inputs_offsets": [1, 3, 4]},
-                {"feat_level": 2, "inputs_offsets": [2, 5]},
-            ]
-        else:
-            raise NotImplementedError
-
-        node_info = [_ for _ in in_channels_list]
-
-        num_output_connections = [0 for _ in in_channels_list]
-        for fnode in self.nodes:
-            feat_level = fnode["feat_level"]
-            inputs_offsets = fnode["inputs_offsets"]
-            inputs_offsets_str = "_".join(map(str, inputs_offsets))
-            for input_offset in inputs_offsets:
-                num_output_connections[input_offset] += 1
-
-                in_channels = node_info[input_offset]
-                if in_channels != out_channels:
-                    lateral_conv = Conv2d(
-                        in_channels,
-                        out_channels,
-                        kernel_size=1,
-                        norm=get_norm(norm, out_channels),
-                    )
-                    self.add_module(
-                        "lateral_{}_f{}".format(input_offset, feat_level), lateral_conv
-                    )
-            node_info.append(out_channels)
-            num_output_connections.append(0)
-
-            # generate attention weights
-            name = "weights_f{}_{}".format(feat_level, inputs_offsets_str)
-            self.__setattr__(
-                name,
-                nn.Parameter(
-                    torch.ones(len(inputs_offsets), dtype=torch.float32),
-                    requires_grad=True,
-                ),
-            )
-
-            # generate convolutions after combination
-            name = "outputs_f{}_{}".format(feat_level, inputs_offsets_str)
-            self.add_module(
-                name,
-                Conv2d(
-                    out_channels,
-                    out_channels,
-                    kernel_size=3,
-                    padding=1,
-                    norm=get_norm(norm, out_channels),
-                    bias=(norm == ""),
-                ),
-            )
-
-    def forward(self, feats):
-        """
-        Args:
-            input (dict[str->Tensor]): mapping feature map name (e.g., "p5") to
-                feature map tensor for each feature level in high to low resolution order.
-        Returns:
-            dict[str->Tensor]:
-                mapping from feature map name to FPN feature map tensor
-                in high to low resolution order. Returned feature names follow the FPN
-                paper convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
-                ["n2", "n3", ..., "n6"].
-        """
-        feats = [_ for _ in feats]
-        num_levels = len(feats)
-        num_output_connections = [0 for _ in feats]
-        for fnode in self.nodes:
-            feat_level = fnode["feat_level"]
-            inputs_offsets = fnode["inputs_offsets"]
-            inputs_offsets_str = "_".join(map(str, inputs_offsets))
-            input_nodes = []
-            _, _, target_h, target_w = feats[feat_level].size()
-            for input_offset in inputs_offsets:
-                num_output_connections[input_offset] += 1
-                input_node = feats[input_offset]
-
-                # reduction
-                if input_node.size(1) != self.out_channels:
-                    name = "lateral_{}_f{}".format(input_offset, feat_level)
-                    input_node = self.__getattr__(name)(input_node)
-
-                # maybe downsample
-                _, _, h, w = input_node.size()
-                if h > target_h and w > target_w:
-                    height_stride_size = int((h - 1) // target_h + 1)
-                    width_stride_size = int((w - 1) // target_w + 1)
-                    assert height_stride_size == width_stride_size == 2
-                    input_node = F.max_pool2d(
-                        input_node,
-                        kernel_size=(height_stride_size + 1, width_stride_size + 1),
-                        stride=(height_stride_size, width_stride_size),
-                        padding=1,
-                    )
-                elif h <= target_h and w <= target_w:
-                    if h < target_h or w < target_w:
-                        input_node = F.interpolate(
-                            input_node, size=(target_h, target_w), mode="nearest"
-                        )
-                else:
-                    raise NotImplementedError()
-                input_nodes.append(input_node)
-
-            # attention
-            name = "weights_f{}_{}".format(feat_level, inputs_offsets_str)
-            weights = F.relu(self.__getattr__(name))
-            norm_weights = weights / (weights.sum() + 0.0001)
-
-            new_node = torch.stack(input_nodes, dim=-1)
-            new_node = (norm_weights * new_node).sum(dim=-1)
-            new_node = swish(new_node)
-
-            name = "outputs_f{}_{}".format(feat_level, inputs_offsets_str)
-            feats.append(self.__getattr__(name)(new_node))
-
-            num_output_connections.append(0)
-
-        output_feats = []
-        for idx in range(num_levels):
-            for i, fnode in enumerate(reversed(self.nodes)):
-                if fnode["feat_level"] == idx:
-                    output_feats.append(feats[-1 - i])
-                    break
-            else:
-                raise ValueError()
-        return output_feats
-
-
-class BiFPN(Backbone):
-    """
-    This module implements Feature Pyramid Network.
-    It creates pyramid features built on top of some input feature maps.
-    """
-
-    def __init__(
-        self, bottom_up, in_features, out_channels, num_top_levels, num_repeats, norm=""
-    ):
-        """
-        Args:
-            bottom_up (Backbone): module representing the bottom up subnetwork.
-                Must be a subclass of :class:`Backbone`. The multi-scale feature
-                maps generated by the bottom up network, and listed in `in_features`,
-                are used to generate FPN levels.
-            in_features (list[str]): names of the input feature maps coming
-                from the backbone to which FPN is attached. For example, if the
-                backbone produces ["res2", "res3", "res4"], any *contiguous* sublist
-                of these may be used; order must be from high to low resolution.
-            out_channels (int): number of channels in the output feature maps.
-            num_top_levels (int): the number of the top levels (p6 or p7).
-            num_repeats (int): the number of repeats of BiFPN.
-            norm (str): the normalization to use.
-        """
-        super(BiFPN, self).__init__()
-        assert isinstance(bottom_up, Backbone)
-
-        # add extra feature levels (i.e., 6 and 7)
-        self.bottom_up = BackboneWithTopLevels(
-            bottom_up, out_channels, num_top_levels, norm
-        )
-        bottom_up_output_shapes = self.bottom_up.output_shape()
-
-        in_features = sorted(in_features, key=lambda x: split_name(x)[1])
-        self._size_divisibility = 128  # bottom_up_output_shapes[in_features[-1]].stride
-        self.out_channels = out_channels
-        self.min_level = split_name(in_features[0])[1]
-
-        # add the names for top blocks
-        prefix, last_suffix = split_name(in_features[-1])
-        for i in range(num_top_levels):
-            in_features.append(prefix + str(last_suffix + i + 1))
-        self.in_features = in_features
-
-        # generate output features
-        self._out_features = ["p{}".format(split_name(name)[1]) for name in in_features]
-        self._out_feature_strides = {
-            out_name: bottom_up_output_shapes[in_name].stride
-            for out_name, in_name in zip(self._out_features, in_features)
-        }
-        self._out_feature_channels = {k: out_channels for k in self._out_features}
-
-        # build bifpn
-        self.repeated_bifpn = nn.ModuleList()
-        for i in range(num_repeats):
-            if i == 0:
-                in_channels_list = [
-                    bottom_up_output_shapes[name].channels for name in in_features
-                ]
-            else:
-                in_channels_list = [
-                    self._out_feature_channels[name] for name in self._out_features
-                ]
-            self.repeated_bifpn.append(
-                SingleBiFPN(in_channels_list, out_channels, norm)
-            )
-
-    @property
-    def size_divisibility(self):
-        return self._size_divisibility
-
-    def forward(self, x):
-        """
-        Args:
-            input (dict[str->Tensor]): mapping feature map name (e.g., "p5") to
-                feature map tensor for each feature level in high to low resolution order.
-        Returns:
-            dict[str->Tensor]:
-                mapping from feature map name to FPN feature map tensor
-                in high to low resolution order. Returned feature names follow the FPN
-                paper convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
-                ["n2", "n3", ..., "n6"].
-        """
-        bottom_up_features = self.bottom_up(x)
-        feats = [bottom_up_features[f] for f in self.in_features]
-
-        for bifpn in self.repeated_bifpn:
-            feats = bifpn(feats)
-
-        return dict(zip(self._out_features, feats))
-
-
-def _assert_strides_are_log2_contiguous(strides):
-    """
-    Assert that each stride is 2x times its preceding stride, i.e. "contiguous in log2".
-    """
-    for i, stride in enumerate(strides[1:], 1):
-        assert (
-            stride == 2 * strides[i - 1]
-        ), "Strides {} {} are not log2 contiguous".format(stride, strides[i - 1])
-
-
-@BACKBONE_REGISTRY.register()
-def build_fcos_resnet_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_resnet_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.BIFPN.OUT_CHANNELS
-    num_repeats = cfg.MODEL.BIFPN.NUM_BIFPN
-    top_levels = 2
-
-    backbone = BiFPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        num_top_levels=top_levels,
-        num_repeats=num_repeats,
-        norm=cfg.MODEL.BIFPN.NORM,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_p35_fcos_resnet_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_resnet_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.BIFPN.OUT_CHANNELS
-    num_repeats = cfg.MODEL.BIFPN.NUM_BIFPN
-    top_levels = 0
-
-    backbone = BiFPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        num_top_levels=top_levels,
-        num_repeats=num_repeats,
-        norm=cfg.MODEL.BIFPN.NORM,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_p35_fcos_dla_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = dla34(cfg)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.BIFPN.OUT_CHANNELS
-    num_repeats = cfg.MODEL.BIFPN.NUM_BIFPN
-    top_levels = 0
-
-    backbone = BiFPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        num_top_levels=top_levels,
-        num_repeats=num_repeats,
-        norm=cfg.MODEL.BIFPN.NORM,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_p37_fcos_dla_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = dla34(cfg)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.BIFPN.OUT_CHANNELS
-    num_repeats = cfg.MODEL.BIFPN.NUM_BIFPN
-    assert cfg.MODEL.BIFPN.NUM_LEVELS == 5
-    top_levels = 2
-
-    backbone = BiFPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        num_top_levels=top_levels,
-        num_repeats=num_repeats,
-        norm=cfg.MODEL.BIFPN.NORM,
-    )
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/dla.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/dla.py
deleted file mode 100644
index bd915cc9..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/dla.py
+++ /dev/null
@@ -1,610 +0,0 @@
-import math
-from os.path import join
-
-import fvcore.nn.weight_init as weight_init
-import numpy as np
-import torch
-import torch.nn.functional as F
-import torch.utils.model_zoo as model_zoo
-from detectron2.layers import (
-    Conv2d,
-    DeformConv,
-    FrozenBatchNorm2d,
-    ModulatedDeformConv,
-    ShapeSpec,
-    get_norm,
-)
-from detectron2.modeling.backbone.backbone import Backbone
-from detectron2.modeling.backbone.build import BACKBONE_REGISTRY
-from detectron2.modeling.backbone.fpn import FPN
-from detectron2.modeling.backbone.resnet import (
-    BasicStem,
-    BottleneckBlock,
-    DeformBottleneckBlock,
-)
-from torch import nn
-
-__all__ = [
-    "BottleneckBlock",
-    "DeformBottleneckBlock",
-    "BasicStem",
-]
-
-DCNV1 = False
-
-HASH = {
-    34: "ba72cf86",
-    60: "24839fc4",
-}
-
-
-def get_model_url(data, name, hash):
-    return join("http://dl.yf.io/dla/models", data, "{}-{}.pth".format(name, hash))
-
-
-class BasicBlock(nn.Module):
-    def __init__(self, inplanes, planes, stride=1, dilation=1, norm="BN"):
-        super(BasicBlock, self).__init__()
-        self.conv1 = nn.Conv2d(
-            inplanes,
-            planes,
-            kernel_size=3,
-            stride=stride,
-            padding=dilation,
-            bias=False,
-            dilation=dilation,
-        )
-        self.bn1 = get_norm(norm, planes)
-        self.relu = nn.ReLU(inplace=True)
-        self.conv2 = nn.Conv2d(
-            planes,
-            planes,
-            kernel_size=3,
-            stride=1,
-            padding=dilation,
-            bias=False,
-            dilation=dilation,
-        )
-        self.bn2 = get_norm(norm, planes)
-        self.stride = stride
-
-    def forward(self, x, residual=None):
-        if residual is None:
-            residual = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-
-        out += residual
-        out = self.relu(out)
-
-        return out
-
-
-class Bottleneck(nn.Module):
-    expansion = 2
-
-    def __init__(self, inplanes, planes, stride=1, dilation=1, norm="BN"):
-        super(Bottleneck, self).__init__()
-        expansion = Bottleneck.expansion
-        bottle_planes = planes // expansion
-        self.conv1 = nn.Conv2d(inplanes, bottle_planes, kernel_size=1, bias=False)
-        self.bn1 = get_norm(norm, bottle_planes)
-        self.conv2 = nn.Conv2d(
-            bottle_planes,
-            bottle_planes,
-            kernel_size=3,
-            stride=stride,
-            padding=dilation,
-            bias=False,
-            dilation=dilation,
-        )
-        self.bn2 = get_norm(norm, bottle_planes)
-        self.conv3 = nn.Conv2d(bottle_planes, planes, kernel_size=1, bias=False)
-        self.bn3 = get_norm(norm, planes)
-        self.relu = nn.ReLU(inplace=True)
-        self.stride = stride
-
-    def forward(self, x, residual=None):
-        if residual is None:
-            residual = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-        out = self.relu(out)
-
-        out = self.conv3(out)
-        out = self.bn3(out)
-
-        out += residual
-        out = self.relu(out)
-
-        return out
-
-
-class Root(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, residual, norm="BN"):
-        super(Root, self).__init__()
-        self.conv = nn.Conv2d(
-            in_channels,
-            out_channels,
-            1,
-            stride=1,
-            bias=False,
-            padding=(kernel_size - 1) // 2,
-        )
-        self.bn = get_norm(norm, out_channels)
-        self.relu = nn.ReLU(inplace=True)
-        self.residual = residual
-
-    def forward(self, *x):
-        children = x
-        x = self.conv(torch.cat(x, 1))
-        x = self.bn(x)
-        if self.residual:
-            x += children[0]
-        x = self.relu(x)
-
-        return x
-
-
-class Tree(nn.Module):
-    def __init__(
-        self,
-        levels,
-        block,
-        in_channels,
-        out_channels,
-        stride=1,
-        level_root=False,
-        root_dim=0,
-        root_kernel_size=1,
-        dilation=1,
-        root_residual=False,
-        norm="BN",
-    ):
-        super(Tree, self).__init__()
-        if root_dim == 0:
-            root_dim = 2 * out_channels
-        if level_root:
-            root_dim += in_channels
-        if levels == 1:
-            self.tree1 = block(
-                in_channels, out_channels, stride, dilation=dilation, norm=norm
-            )
-            self.tree2 = block(
-                out_channels, out_channels, 1, dilation=dilation, norm=norm
-            )
-        else:
-            self.tree1 = Tree(
-                levels - 1,
-                block,
-                in_channels,
-                out_channels,
-                stride,
-                root_dim=0,
-                root_kernel_size=root_kernel_size,
-                dilation=dilation,
-                root_residual=root_residual,
-                norm=norm,
-            )
-            self.tree2 = Tree(
-                levels - 1,
-                block,
-                out_channels,
-                out_channels,
-                root_dim=root_dim + out_channels,
-                root_kernel_size=root_kernel_size,
-                dilation=dilation,
-                root_residual=root_residual,
-                norm=norm,
-            )
-        if levels == 1:
-            self.root = Root(
-                root_dim, out_channels, root_kernel_size, root_residual, norm=norm
-            )
-        self.level_root = level_root
-        self.root_dim = root_dim
-        self.downsample = None
-        self.project = None
-        self.levels = levels
-        if stride > 1:
-            self.downsample = nn.MaxPool2d(stride, stride=stride)
-        if in_channels != out_channels:
-            self.project = nn.Sequential(
-                nn.Conv2d(
-                    in_channels, out_channels, kernel_size=1, stride=1, bias=False
-                ),
-                get_norm(norm, out_channels),
-            )
-
-    def forward(self, x, residual=None, children=None):
-        children = [] if children is None else children
-        bottom = self.downsample(x) if self.downsample else x
-        residual = self.project(bottom) if self.project else bottom
-        if self.level_root:
-            children.append(bottom)
-        x1 = self.tree1(x, residual)
-        if self.levels == 1:
-            x2 = self.tree2(x1)
-            x = self.root(x2, x1, *children)
-        else:
-            children.append(x1)
-            x = self.tree2(x1, children=children)
-        return x
-
-
-class DLA(nn.Module):
-    def __init__(
-        self,
-        num_layers,
-        levels,
-        channels,
-        block=BasicBlock,
-        residual_root=False,
-        norm="BN",
-    ):
-        """
-        Args:
-        """
-        super(DLA, self).__init__()
-        self.norm = norm
-        self.channels = channels
-        self.base_layer = nn.Sequential(
-            nn.Conv2d(3, channels[0], kernel_size=7, stride=1, padding=3, bias=False),
-            get_norm(self.norm, channels[0]),
-            nn.ReLU(inplace=True),
-        )
-        self.level0 = self._make_conv_level(channels[0], channels[0], levels[0])
-        self.level1 = self._make_conv_level(
-            channels[0], channels[1], levels[1], stride=2
-        )
-        self.level2 = Tree(
-            levels[2],
-            block,
-            channels[1],
-            channels[2],
-            2,
-            level_root=False,
-            root_residual=residual_root,
-            norm=norm,
-        )
-        self.level3 = Tree(
-            levels[3],
-            block,
-            channels[2],
-            channels[3],
-            2,
-            level_root=True,
-            root_residual=residual_root,
-            norm=norm,
-        )
-        self.level4 = Tree(
-            levels[4],
-            block,
-            channels[3],
-            channels[4],
-            2,
-            level_root=True,
-            root_residual=residual_root,
-            norm=norm,
-        )
-        self.level5 = Tree(
-            levels[5],
-            block,
-            channels[4],
-            channels[5],
-            2,
-            level_root=True,
-            root_residual=residual_root,
-            norm=norm,
-        )
-        self.load_pretrained_model(
-            data="imagenet", name="dla{}".format(num_layers), hash=HASH[num_layers]
-        )
-
-    def load_pretrained_model(self, data, name, hash):
-        model_url = get_model_url(data, name, hash)
-        model_weights = model_zoo.load_url(model_url)
-        num_classes = len(model_weights[list(model_weights.keys())[-1]])
-        self.fc = nn.Conv2d(
-            self.channels[-1],
-            num_classes,
-            kernel_size=1,
-            stride=1,
-            padding=0,
-            bias=True,
-        )
-        print("Loading pretrained")
-        self.load_state_dict(model_weights, strict=False)
-
-    def _make_conv_level(self, inplanes, planes, convs, stride=1, dilation=1):
-        modules = []
-        for i in range(convs):
-            modules.extend(
-                [
-                    nn.Conv2d(
-                        inplanes,
-                        planes,
-                        kernel_size=3,
-                        stride=stride if i == 0 else 1,
-                        padding=dilation,
-                        bias=False,
-                        dilation=dilation,
-                    ),
-                    get_norm(self.norm, planes),
-                    nn.ReLU(inplace=True),
-                ]
-            )
-            inplanes = planes
-        return nn.Sequential(*modules)
-
-    def forward(self, x):
-        y = []
-        x = self.base_layer(x)
-        for i in range(6):
-            x = getattr(self, "level{}".format(i))(x)
-            y.append(x)
-        return y
-
-
-def fill_up_weights(up):
-    w = up.weight.data
-    f = math.ceil(w.size(2) / 2)
-    c = (2 * f - 1 - f % 2) / (2.0 * f)
-    for i in range(w.size(2)):
-        for j in range(w.size(3)):
-            w[0, 0, i, j] = (1 - math.fabs(i / f - c)) * (1 - math.fabs(j / f - c))
-    for c in range(1, w.size(0)):
-        w[c, 0, :, :] = w[0, 0, :, :]
-
-
-class _DeformConv(nn.Module):
-    def __init__(self, chi, cho, norm="BN"):
-        super(_DeformConv, self).__init__()
-        self.actf = nn.Sequential(get_norm(norm, cho), nn.ReLU(inplace=True))
-        if DCNV1:
-            self.offset = Conv2d(
-                chi, 18, kernel_size=3, stride=1, padding=1, dilation=1
-            )
-            self.conv = DeformConv(
-                chi,
-                cho,
-                kernel_size=(3, 3),
-                stride=1,
-                padding=1,
-                dilation=1,
-                deformable_groups=1,
-            )
-        else:
-            self.offset = Conv2d(
-                chi, 27, kernel_size=3, stride=1, padding=1, dilation=1
-            )
-            self.conv = ModulatedDeformConv(
-                chi,
-                cho,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                dilation=1,
-                deformable_groups=1,
-            )
-        nn.init.constant_(self.offset.weight, 0)
-        nn.init.constant_(self.offset.bias, 0)
-
-    def forward(self, x):
-        if DCNV1:
-            offset = self.offset(x)
-            x = self.conv(x, offset)
-        else:
-            offset_mask = self.offset(x)
-            offset_x, offset_y, mask = torch.chunk(offset_mask, 3, dim=1)
-            offset = torch.cat((offset_x, offset_y), dim=1)
-            mask = mask.sigmoid()
-            x = self.conv(x, offset, mask)
-        x = self.actf(x)
-        return x
-
-
-class IDAUp(nn.Module):
-    def __init__(self, o, channels, up_f, norm="BN"):
-        super(IDAUp, self).__init__()
-        for i in range(1, len(channels)):
-            c = channels[i]
-            f = int(up_f[i])
-            proj = _DeformConv(c, o, norm=norm)
-            node = _DeformConv(o, o, norm=norm)
-
-            up = nn.ConvTranspose2d(
-                o,
-                o,
-                f * 2,
-                stride=f,
-                padding=f // 2,
-                output_padding=0,
-                groups=o,
-                bias=False,
-            )
-            fill_up_weights(up)
-
-            setattr(self, "proj_" + str(i), proj)
-            setattr(self, "up_" + str(i), up)
-            setattr(self, "node_" + str(i), node)
-
-    def forward(self, layers, startp, endp):
-        for i in range(startp + 1, endp):
-            upsample = getattr(self, "up_" + str(i - startp))
-            project = getattr(self, "proj_" + str(i - startp))
-            layers[i] = upsample(project(layers[i]))
-            node = getattr(self, "node_" + str(i - startp))
-            layers[i] = node(layers[i] + layers[i - 1])
-
-
-class DLAUp(nn.Module):
-    def __init__(self, startp, channels, scales, in_channels=None, norm="BN"):
-        super(DLAUp, self).__init__()
-        self.startp = startp
-        if in_channels is None:
-            in_channels = channels
-        self.channels = channels
-        channels = list(channels)
-        scales = np.array(scales, dtype=int)
-        for i in range(len(channels) - 1):
-            j = -i - 2
-            setattr(
-                self,
-                "ida_{}".format(i),
-                IDAUp(channels[j], in_channels[j:], scales[j:] // scales[j], norm=norm),
-            )
-            scales[j + 1 :] = scales[j]
-            in_channels[j + 1 :] = [channels[j] for _ in channels[j + 1 :]]
-
-    def forward(self, layers):
-        out = [layers[-1]]  # start with 32
-        for i in range(len(layers) - self.startp - 1):
-            ida = getattr(self, "ida_{}".format(i))
-            ida(layers, len(layers) - i - 2, len(layers))
-            out.insert(0, layers[-1])
-        return out
-
-
-DLA_CONFIGS = {
-    34: ([1, 1, 1, 2, 2, 1], [16, 32, 64, 128, 256, 512], BasicBlock),
-    60: ([1, 1, 1, 2, 3, 1], [16, 32, 128, 256, 512, 1024], Bottleneck),
-}
-
-
-class DLASeg(Backbone):
-    def __init__(
-        self, num_layers, out_features, use_dla_up=True, ms_output=False, norm="BN"
-    ):
-        super(DLASeg, self).__init__()
-        # depth = 34
-        levels, channels, Block = DLA_CONFIGS[num_layers]
-        self.base = DLA(
-            num_layers=num_layers,
-            levels=levels,
-            channels=channels,
-            block=Block,
-            norm=norm,
-        )
-        down_ratio = 4
-        self.first_level = int(np.log2(down_ratio))
-        self.ms_output = ms_output
-        self.last_level = 5 if not self.ms_output else 6
-        channels = self.base.channels
-        scales = [2**i for i in range(len(channels[self.first_level :]))]
-        self.use_dla_up = use_dla_up
-        if self.use_dla_up:
-            self.dla_up = DLAUp(
-                self.first_level, channels[self.first_level :], scales, norm=norm
-            )
-        out_channel = channels[self.first_level]
-        if not self.ms_output:  # stride 4 DLA
-            self.ida_up = IDAUp(
-                out_channel,
-                channels[self.first_level : self.last_level],
-                [2**i for i in range(self.last_level - self.first_level)],
-                norm=norm,
-            )
-        self._out_features = out_features
-        self._out_feature_channels = {"dla{}".format(i): channels[i] for i in range(6)}
-        self._out_feature_strides = {"dla{}".format(i): 2**i for i in range(6)}
-        self._size_divisibility = 32
-
-    @property
-    def size_divisibility(self):
-        return self._size_divisibility
-
-    def forward(self, x):
-        x = self.base(x)
-        if self.use_dla_up:
-            x = self.dla_up(x)
-        if not self.ms_output:  # stride 4 dla
-            y = []
-            for i in range(self.last_level - self.first_level):
-                y.append(x[i].clone())
-            self.ida_up(y, 0, len(y))
-            ret = {}
-            for i in range(self.last_level - self.first_level):
-                out_feature = "dla{}".format(i)
-                if out_feature in self._out_features:
-                    ret[out_feature] = y[i]
-        else:
-            ret = {}
-            st = self.first_level if self.use_dla_up else 0
-            for i in range(self.last_level - st):
-                out_feature = "dla{}".format(i + st)
-                if out_feature in self._out_features:
-                    ret[out_feature] = x[i]
-
-        return ret
-
-
-@BACKBONE_REGISTRY.register()
-def build_dla_backbone(cfg, input_shape):
-    """
-    Create a ResNet instance from config.
-
-    Returns:
-        ResNet: a :class:`ResNet` instance.
-    """
-    return DLASeg(
-        out_features=cfg.MODEL.DLA.OUT_FEATURES,
-        num_layers=cfg.MODEL.DLA.NUM_LAYERS,
-        use_dla_up=cfg.MODEL.DLA.USE_DLA_UP,
-        ms_output=cfg.MODEL.DLA.MS_OUTPUT,
-        norm=cfg.MODEL.DLA.NORM,
-    )
-
-
-class LastLevelP6P7(nn.Module):
-    """
-    This module is used in RetinaNet to generate extra layers, P6 and P7 from
-    C5 feature.
-    """
-
-    def __init__(self, in_channels, out_channels):
-        super().__init__()
-        self.num_levels = 2
-        self.in_feature = "dla5"
-        self.p6 = nn.Conv2d(in_channels, out_channels, 3, 2, 1)
-        self.p7 = nn.Conv2d(out_channels, out_channels, 3, 2, 1)
-        for module in [self.p6, self.p7]:
-            weight_init.c2_xavier_fill(module)
-
-    def forward(self, c5):
-        p6 = self.p6(c5)
-        p7 = self.p7(F.relu(p6))
-        return [p6, p7]
-
-
-@BACKBONE_REGISTRY.register()
-def build_retinanet_dla_fpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_dla_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    in_channels_p6p7 = bottom_up.output_shape()["dla5"].channels
-    backbone = FPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        norm=cfg.MODEL.FPN.NORM,
-        top_block=LastLevelP6P7(in_channels_p6p7, out_channels),
-        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
-    )
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/dlafpn.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/dlafpn.py
deleted file mode 100644
index d9d19ddf..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/dlafpn.py
+++ /dev/null
@@ -1,594 +0,0 @@
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-
-# this file is from https://github.com/ucbdrive/dla/blob/master/dla.py.
-
-import math
-from os.path import join
-
-import fvcore.nn.weight_init as weight_init
-import numpy as np
-import torch
-import torch.nn.functional as F
-import torch.utils.model_zoo as model_zoo
-from detectron2.layers import Conv2d, ModulatedDeformConv, ShapeSpec
-from detectron2.layers.batch_norm import get_norm
-from detectron2.modeling.backbone import FPN, Backbone
-from detectron2.modeling.backbone.build import BACKBONE_REGISTRY
-from torch import nn
-
-WEB_ROOT = "http://dl.yf.io/dla/models"
-
-
-def get_model_url(data, name, hash):
-    return join("http://dl.yf.io/dla/models", data, "{}-{}.pth".format(name, hash))
-
-
-def conv3x3(in_planes, out_planes, stride=1):
-    "3x3 convolution with padding"
-    return nn.Conv2d(
-        in_planes, out_planes, kernel_size=3, stride=stride, padding=1, bias=False
-    )
-
-
-class BasicBlock(nn.Module):
-    def __init__(self, cfg, inplanes, planes, stride=1, dilation=1):
-        super(BasicBlock, self).__init__()
-        self.conv1 = nn.Conv2d(
-            inplanes,
-            planes,
-            kernel_size=3,
-            stride=stride,
-            padding=dilation,
-            bias=False,
-            dilation=dilation,
-        )
-        self.bn1 = get_norm(cfg.MODEL.DLA.NORM, planes)
-        self.relu = nn.ReLU(inplace=True)
-        self.conv2 = nn.Conv2d(
-            planes,
-            planes,
-            kernel_size=3,
-            stride=1,
-            padding=dilation,
-            bias=False,
-            dilation=dilation,
-        )
-        self.bn2 = get_norm(cfg.MODEL.DLA.NORM, planes)
-        self.stride = stride
-
-    def forward(self, x, residual=None):
-        if residual is None:
-            residual = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-
-        out += residual
-        out = self.relu(out)
-
-        return out
-
-
-class Bottleneck(nn.Module):
-    expansion = 2
-
-    def __init__(self, cfg, inplanes, planes, stride=1, dilation=1):
-        super(Bottleneck, self).__init__()
-        expansion = Bottleneck.expansion
-        bottle_planes = planes // expansion
-        self.conv1 = nn.Conv2d(inplanes, bottle_planes, kernel_size=1, bias=False)
-        self.bn1 = get_norm(cfg.MODEL.DLA.NORM, bottle_planes)
-        self.conv2 = nn.Conv2d(
-            bottle_planes,
-            bottle_planes,
-            kernel_size=3,
-            stride=stride,
-            padding=dilation,
-            bias=False,
-            dilation=dilation,
-        )
-        self.bn2 = get_norm(cfg.MODEL.DLA.NORM, bottle_planes)
-        self.conv3 = nn.Conv2d(bottle_planes, planes, kernel_size=1, bias=False)
-        self.bn3 = get_norm(cfg.MODEL.DLA.NORM, planes)
-        self.relu = nn.ReLU(inplace=True)
-        self.stride = stride
-
-    def forward(self, x, residual=None):
-        if residual is None:
-            residual = x
-
-        out = self.conv1(x)
-        out = self.bn1(out)
-        out = self.relu(out)
-
-        out = self.conv2(out)
-        out = self.bn2(out)
-        out = self.relu(out)
-
-        out = self.conv3(out)
-        out = self.bn3(out)
-
-        out += residual
-        out = self.relu(out)
-
-        return out
-
-
-class Root(nn.Module):
-    def __init__(self, cfg, in_channels, out_channels, kernel_size, residual):
-        super(Root, self).__init__()
-        self.conv = nn.Conv2d(
-            in_channels,
-            out_channels,
-            kernel_size,
-            stride=1,
-            bias=False,
-            padding=(kernel_size - 1) // 2,
-        )
-        self.bn = get_norm(cfg.MODEL.DLA.NORM, out_channels)
-        self.relu = nn.ReLU(inplace=True)
-        self.residual = residual
-
-    def forward(self, *x):
-        children = x
-        x = self.conv(torch.cat(x, 1))
-        x = self.bn(x)
-        if self.residual:
-            x += children[0]
-        x = self.relu(x)
-
-        return x
-
-
-class Tree(nn.Module):
-    def __init__(
-        self,
-        cfg,
-        levels,
-        block,
-        in_channels,
-        out_channels,
-        stride=1,
-        level_root=False,
-        root_dim=0,
-        root_kernel_size=1,
-        dilation=1,
-        root_residual=False,
-    ):
-        super(Tree, self).__init__()
-        if root_dim == 0:
-            root_dim = 2 * out_channels
-        if level_root:
-            root_dim += in_channels
-        if levels == 1:
-            self.tree1 = block(
-                cfg, in_channels, out_channels, stride, dilation=dilation
-            )
-            self.tree2 = block(cfg, out_channels, out_channels, 1, dilation=dilation)
-        else:
-            self.tree1 = Tree(
-                cfg,
-                levels - 1,
-                block,
-                in_channels,
-                out_channels,
-                stride,
-                root_dim=0,
-                root_kernel_size=root_kernel_size,
-                dilation=dilation,
-                root_residual=root_residual,
-            )
-            self.tree2 = Tree(
-                cfg,
-                levels - 1,
-                block,
-                out_channels,
-                out_channels,
-                root_dim=root_dim + out_channels,
-                root_kernel_size=root_kernel_size,
-                dilation=dilation,
-                root_residual=root_residual,
-            )
-        if levels == 1:
-            self.root = Root(
-                cfg, root_dim, out_channels, root_kernel_size, root_residual
-            )
-        self.level_root = level_root
-        self.root_dim = root_dim
-        self.downsample = None
-        self.project = None
-        self.levels = levels
-        if stride > 1:
-            self.downsample = nn.MaxPool2d(stride, stride=stride)
-        if in_channels != out_channels:
-            self.project = nn.Sequential(
-                nn.Conv2d(
-                    in_channels, out_channels, kernel_size=1, stride=1, bias=False
-                ),
-                get_norm(cfg.MODEL.DLA.NORM, out_channels),
-            )
-
-    def forward(self, x, residual=None, children=None):
-        if self.training and residual is not None:
-            x = x + residual.sum() * 0.0
-        children = [] if children is None else children
-        bottom = self.downsample(x) if self.downsample else x
-        residual = self.project(bottom) if self.project else bottom
-        if self.level_root:
-            children.append(bottom)
-        x1 = self.tree1(x, residual)
-        if self.levels == 1:
-            x2 = self.tree2(x1)
-            x = self.root(x2, x1, *children)
-        else:
-            children.append(x1)
-            x = self.tree2(x1, children=children)
-        return x
-
-
-class DLA(Backbone):
-    def __init__(self, cfg, levels, channels, block=BasicBlock, residual_root=False):
-        super(DLA, self).__init__()
-        self.cfg = cfg
-        self.channels = channels
-
-        self._out_features = ["dla{}".format(i) for i in range(6)]
-        self._out_feature_channels = {
-            k: channels[i] for i, k in enumerate(self._out_features)
-        }
-        self._out_feature_strides = {k: 2**i for i, k in enumerate(self._out_features)}
-
-        self.base_layer = nn.Sequential(
-            nn.Conv2d(3, channels[0], kernel_size=7, stride=1, padding=3, bias=False),
-            get_norm(cfg.MODEL.DLA.NORM, channels[0]),
-            nn.ReLU(inplace=True),
-        )
-        self.level0 = self._make_conv_level(channels[0], channels[0], levels[0])
-        self.level1 = self._make_conv_level(
-            channels[0], channels[1], levels[1], stride=2
-        )
-        self.level2 = Tree(
-            cfg,
-            levels[2],
-            block,
-            channels[1],
-            channels[2],
-            2,
-            level_root=False,
-            root_residual=residual_root,
-        )
-        self.level3 = Tree(
-            cfg,
-            levels[3],
-            block,
-            channels[2],
-            channels[3],
-            2,
-            level_root=True,
-            root_residual=residual_root,
-        )
-        self.level4 = Tree(
-            cfg,
-            levels[4],
-            block,
-            channels[3],
-            channels[4],
-            2,
-            level_root=True,
-            root_residual=residual_root,
-        )
-        self.level5 = Tree(
-            cfg,
-            levels[5],
-            block,
-            channels[4],
-            channels[5],
-            2,
-            level_root=True,
-            root_residual=residual_root,
-        )
-
-        for m in self.modules():
-            if isinstance(m, nn.Conv2d):
-                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
-                m.weight.data.normal_(0, math.sqrt(2.0 / n))
-
-        self.load_pretrained_model(data="imagenet", name="dla34", hash="ba72cf86")
-
-    def load_pretrained_model(self, data, name, hash):
-        model_url = get_model_url(data, name, hash)
-        model_weights = model_zoo.load_url(model_url)
-        del model_weights["fc.weight"]
-        del model_weights["fc.bias"]
-        print("Loading pretrained DLA!")
-        self.load_state_dict(model_weights, strict=True)
-
-    def _make_conv_level(self, inplanes, planes, convs, stride=1, dilation=1):
-        modules = []
-        for i in range(convs):
-            modules.extend(
-                [
-                    nn.Conv2d(
-                        inplanes,
-                        planes,
-                        kernel_size=3,
-                        stride=stride if i == 0 else 1,
-                        padding=dilation,
-                        bias=False,
-                        dilation=dilation,
-                    ),
-                    get_norm(self.cfg.MODEL.DLA.NORM, planes),
-                    nn.ReLU(inplace=True),
-                ]
-            )
-            inplanes = planes
-        return nn.Sequential(*modules)
-
-    def forward(self, x):
-        y = {}
-        x = self.base_layer(x)
-        for i in range(6):
-            name = "level{}".format(i)
-            x = getattr(self, name)(x)
-            y["dla{}".format(i)] = x
-        return y
-
-
-def fill_up_weights(up):
-    w = up.weight.data
-    f = math.ceil(w.size(2) / 2)
-    c = (2 * f - 1 - f % 2) / (2.0 * f)
-    for i in range(w.size(2)):
-        for j in range(w.size(3)):
-            w[0, 0, i, j] = (1 - math.fabs(i / f - c)) * (1 - math.fabs(j / f - c))
-    for c in range(1, w.size(0)):
-        w[c, 0, :, :] = w[0, 0, :, :]
-
-
-class Conv(nn.Module):
-    def __init__(self, chi, cho, norm):
-        super(Conv, self).__init__()
-        self.conv = nn.Sequential(
-            nn.Conv2d(chi, cho, kernel_size=1, stride=1, bias=False),
-            get_norm(norm, cho),
-            nn.ReLU(inplace=True),
-        )
-
-    def forward(self, x):
-        return self.conv(x)
-
-
-class DeformConv(nn.Module):
-    def __init__(self, chi, cho, norm):
-        super(DeformConv, self).__init__()
-        self.actf = nn.Sequential(get_norm(norm, cho), nn.ReLU(inplace=True))
-        self.offset = Conv2d(chi, 27, kernel_size=3, stride=1, padding=1, dilation=1)
-        self.conv = ModulatedDeformConv(
-            chi,
-            cho,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-            dilation=1,
-            deformable_groups=1,
-        )
-        nn.init.constant_(self.offset.weight, 0)
-        nn.init.constant_(self.offset.bias, 0)
-
-    def forward(self, x):
-        offset_mask = self.offset(x)
-        offset_x, offset_y, mask = torch.chunk(offset_mask, 3, dim=1)
-        offset = torch.cat((offset_x, offset_y), dim=1)
-        mask = mask.sigmoid()
-        x = self.conv(x, offset, mask)
-        x = self.actf(x)
-        return x
-
-
-class IDAUp(nn.Module):
-    def __init__(self, o, channels, up_f, norm="FrozenBN", node_type=Conv):
-        super(IDAUp, self).__init__()
-        for i in range(1, len(channels)):
-            c = channels[i]
-            f = int(up_f[i])
-            proj = node_type(c, o, norm)
-            node = node_type(o, o, norm)
-
-            up = nn.ConvTranspose2d(
-                o,
-                o,
-                f * 2,
-                stride=f,
-                padding=f // 2,
-                output_padding=0,
-                groups=o,
-                bias=False,
-            )
-            fill_up_weights(up)
-
-            setattr(self, "proj_" + str(i), proj)
-            setattr(self, "up_" + str(i), up)
-            setattr(self, "node_" + str(i), node)
-
-    def forward(self, layers, startp, endp):
-        for i in range(startp + 1, endp):
-            upsample = getattr(self, "up_" + str(i - startp))
-            project = getattr(self, "proj_" + str(i - startp))
-            layers[i] = upsample(project(layers[i]))
-            node = getattr(self, "node_" + str(i - startp))
-            layers[i] = node(layers[i] + layers[i - 1])
-
-
-DLAUP_NODE_MAP = {
-    "conv": Conv,
-    "dcn": DeformConv,
-}
-
-
-class DLAUP(Backbone):
-    def __init__(self, bottom_up, in_features, norm, dlaup_node="conv"):
-        super(DLAUP, self).__init__()
-        assert isinstance(bottom_up, Backbone)
-        self.bottom_up = bottom_up
-        input_shapes = bottom_up.output_shape()
-        in_strides = [input_shapes[f].stride for f in in_features]
-        in_channels = [input_shapes[f].channels for f in in_features]
-        in_levels = [int(math.log2(input_shapes[f].stride)) for f in in_features]
-        self.in_features = in_features
-        out_features = ["dlaup{}".format(l) for l in in_levels]
-        self._out_features = out_features
-        self._out_feature_channels = {
-            "dlaup{}".format(l): in_channels[i] for i, l in enumerate(in_levels)
-        }
-        self._out_feature_strides = {"dlaup{}".format(l): 2**l for l in in_levels}
-
-        print("self._out_features", self._out_features)
-        print("self._out_feature_channels", self._out_feature_channels)
-        print("self._out_feature_strides", self._out_feature_strides)
-        self._size_divisibility = 32
-
-        node_type = DLAUP_NODE_MAP[dlaup_node]
-
-        self.startp = int(math.log2(in_strides[0]))
-        self.channels = in_channels
-        channels = list(in_channels)
-        scales = np.array([2**i for i in range(len(out_features))], dtype=int)
-        for i in range(len(channels) - 1):
-            j = -i - 2
-            setattr(
-                self,
-                "ida_{}".format(i),
-                IDAUp(
-                    channels[j],
-                    in_channels[j:],
-                    scales[j:] // scales[j],
-                    norm=norm,
-                    node_type=node_type,
-                ),
-            )
-            scales[j + 1 :] = scales[j]
-            in_channels[j + 1 :] = [channels[j] for _ in channels[j + 1 :]]
-
-    @property
-    def size_divisibility(self):
-        return self._size_divisibility
-
-    def forward(self, x):
-        bottom_up_features = self.bottom_up(x)
-        layers = [bottom_up_features[f] for f in self.in_features]
-        out = [layers[-1]]  # start with 32
-        for i in range(len(layers) - 1):
-            ida = getattr(self, "ida_{}".format(i))
-            ida(layers, len(layers) - i - 2, len(layers))
-            out.insert(0, layers[-1])
-        ret = {}
-        for k, v in zip(self._out_features, out):
-            ret[k] = v
-        # import pdb; pdb.set_trace()
-        return ret
-
-
-def dla34(cfg, pretrained=None):  # DLA-34
-    model = DLA(cfg, [1, 1, 1, 2, 2, 1], [16, 32, 64, 128, 256, 512], block=BasicBlock)
-    return model
-
-
-class LastLevelP6P7(nn.Module):
-    """
-    This module is used in RetinaNet to generate extra layers, P6 and P7 from
-    C5 feature.
-    """
-
-    def __init__(self, in_channels, out_channels):
-        super().__init__()
-        self.num_levels = 2
-        self.in_feature = "dla5"
-        self.p6 = nn.Conv2d(in_channels, out_channels, 3, 2, 1)
-        self.p7 = nn.Conv2d(out_channels, out_channels, 3, 2, 1)
-        for module in [self.p6, self.p7]:
-            weight_init.c2_xavier_fill(module)
-
-    def forward(self, c5):
-        p6 = self.p6(c5)
-        p7 = self.p7(F.relu(p6))
-        return [p6, p7]
-
-
-@BACKBONE_REGISTRY.register()
-def build_dla_fpn3_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-
-    depth_to_creator = {"dla34": dla34}
-    bottom_up = depth_to_creator["dla{}".format(cfg.MODEL.DLA.NUM_LAYERS)](cfg)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-
-    backbone = FPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        norm=cfg.MODEL.FPN.NORM,
-        top_block=None,
-        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
-    )
-
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_dla_fpn5_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-
-    depth_to_creator = {"dla34": dla34}
-    bottom_up = depth_to_creator["dla{}".format(cfg.MODEL.DLA.NUM_LAYERS)](cfg)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    in_channels_top = bottom_up.output_shape()["dla5"].channels
-
-    backbone = FPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        norm=cfg.MODEL.FPN.NORM,
-        top_block=LastLevelP6P7(in_channels_top, out_channels),
-        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
-    )
-
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_dlaup_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-
-    depth_to_creator = {"dla34": dla34}
-    bottom_up = depth_to_creator["dla{}".format(cfg.MODEL.DLA.NUM_LAYERS)](cfg)
-
-    backbone = DLAUP(
-        bottom_up=bottom_up,
-        in_features=cfg.MODEL.DLA.DLAUP_IN_FEATURES,
-        norm=cfg.MODEL.DLA.NORM,
-        dlaup_node=cfg.MODEL.DLA.DLAUP_NODE,
-    )
-
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/fpn_p5.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/fpn_p5.py
deleted file mode 100644
index 3b62a393..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/fpn_p5.py
+++ /dev/null
@@ -1,78 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-import math
-
-import fvcore.nn.weight_init as weight_init
-import torch.nn.functional as F
-from detectron2.layers import Conv2d, ShapeSpec, get_norm
-from detectron2.modeling.backbone import Backbone
-from detectron2.modeling.backbone.build import BACKBONE_REGISTRY
-from detectron2.modeling.backbone.fpn import FPN
-from detectron2.modeling.backbone.resnet import build_resnet_backbone
-from torch import nn
-
-
-class LastLevelP6P7_P5(nn.Module):
-    """
-    This module is used in RetinaNet to generate extra layers, P6 and P7 from
-    C5 feature.
-    """
-
-    def __init__(self, in_channels, out_channels):
-        super().__init__()
-        self.num_levels = 2
-        self.in_feature = "p5"
-        self.p6 = nn.Conv2d(in_channels, out_channels, 3, 2, 1)
-        self.p7 = nn.Conv2d(out_channels, out_channels, 3, 2, 1)
-        for module in [self.p6, self.p7]:
-            weight_init.c2_xavier_fill(module)
-
-    def forward(self, c5):
-        p6 = self.p6(c5)
-        p7 = self.p7(F.relu(p6))
-        return [p6, p7]
-
-
-@BACKBONE_REGISTRY.register()
-def build_p67_resnet_fpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_resnet_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    backbone = FPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        norm=cfg.MODEL.FPN.NORM,
-        top_block=LastLevelP6P7_P5(out_channels, out_channels),
-        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_p35_resnet_fpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_resnet_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    backbone = FPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        norm=cfg.MODEL.FPN.NORM,
-        top_block=None,
-        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
-    )
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/res2net.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/res2net.py
deleted file mode 100644
index 036e73f7..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/backbone/res2net.py
+++ /dev/null
@@ -1,826 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# This file is modified from https://github.com/Res2Net/Res2Net-detectron2/blob/master/detectron2/modeling/backbone/resnet.py
-# The original file is under Apache-2.0 License
-import fvcore.nn.weight_init as weight_init
-import numpy as np
-import torch
-import torch.nn.functional as F
-from detectron2.layers import (
-    CNNBlockBase,
-    Conv2d,
-    DeformConv,
-    ModulatedDeformConv,
-    ShapeSpec,
-    get_norm,
-)
-from detectron2.modeling.backbone import Backbone
-from detectron2.modeling.backbone.build import BACKBONE_REGISTRY
-from detectron2.modeling.backbone.fpn import FPN
-from torch import nn
-
-from .bifpn import BiFPN
-from .fpn_p5 import LastLevelP6P7_P5
-
-__all__ = [
-    "ResNetBlockBase",
-    "BasicBlock",
-    "BottleneckBlock",
-    "DeformBottleneckBlock",
-    "BasicStem",
-    "ResNet",
-    "make_stage",
-    "build_res2net_backbone",
-]
-
-
-ResNetBlockBase = CNNBlockBase
-"""
-Alias for backward compatibiltiy.
-"""
-
-
-class BasicBlock(CNNBlockBase):
-    """
-    The basic residual block for ResNet-18 and ResNet-34, with two 3x3 conv layers
-    and a projection shortcut if needed.
-    """
-
-    def __init__(self, in_channels, out_channels, *, stride=1, norm="BN"):
-        """
-        Args:
-            in_channels (int): Number of input channels.
-            out_channels (int): Number of output channels.
-            stride (int): Stride for the first conv.
-            norm (str or callable): normalization for all conv layers.
-                See :func:`layers.get_norm` for supported format.
-        """
-        super().__init__(in_channels, out_channels, stride)
-
-        if in_channels != out_channels:
-            self.shortcut = Conv2d(
-                in_channels,
-                out_channels,
-                kernel_size=1,
-                stride=stride,
-                bias=False,
-                norm=get_norm(norm, out_channels),
-            )
-        else:
-            self.shortcut = None
-
-        self.conv1 = Conv2d(
-            in_channels,
-            out_channels,
-            kernel_size=3,
-            stride=stride,
-            padding=1,
-            bias=False,
-            norm=get_norm(norm, out_channels),
-        )
-
-        self.conv2 = Conv2d(
-            out_channels,
-            out_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-            bias=False,
-            norm=get_norm(norm, out_channels),
-        )
-
-        for layer in [self.conv1, self.conv2, self.shortcut]:
-            if layer is not None:  # shortcut can be None
-                weight_init.c2_msra_fill(layer)
-
-    def forward(self, x):
-        out = self.conv1(x)
-        out = F.relu_(out)
-        out = self.conv2(out)
-
-        if self.shortcut is not None:
-            shortcut = self.shortcut(x)
-        else:
-            shortcut = x
-
-        out += shortcut
-        out = F.relu_(out)
-        return out
-
-
-class BottleneckBlock(CNNBlockBase):
-    """
-    The standard bottle2neck residual block used by Res2Net-50, 101 and 152.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        *,
-        bottleneck_channels,
-        stride=1,
-        num_groups=1,
-        norm="BN",
-        stride_in_1x1=False,
-        dilation=1,
-        basewidth=26,
-        scale=4,
-    ):
-        """
-        Args:
-            bottleneck_channels (int): number of output channels for the 3x3
-                "bottleneck" conv layers.
-            num_groups (int): number of groups for the 3x3 conv layer.
-            norm (str or callable): normalization for all conv layers.
-                See :func:`layers.get_norm` for supported format.
-            stride_in_1x1 (bool): when stride>1, whether to put stride in the
-                first 1x1 convolution or the bottleneck 3x3 convolution.
-            dilation (int): the dilation rate of the 3x3 conv layer.
-        """
-        super().__init__(in_channels, out_channels, stride)
-
-        if in_channels != out_channels:
-            self.shortcut = nn.Sequential(
-                nn.AvgPool2d(
-                    kernel_size=stride,
-                    stride=stride,
-                    ceil_mode=True,
-                    count_include_pad=False,
-                ),
-                Conv2d(
-                    in_channels,
-                    out_channels,
-                    kernel_size=1,
-                    stride=1,
-                    bias=False,
-                    norm=get_norm(norm, out_channels),
-                ),
-            )
-        else:
-            self.shortcut = None
-
-        # The original MSRA ResNet models have stride in the first 1x1 conv
-        # The subsequent fb.torch.resnet and Caffe2 ResNe[X]t implementations have
-        # stride in the 3x3 conv
-        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)
-        width = bottleneck_channels // scale
-
-        self.conv1 = Conv2d(
-            in_channels,
-            bottleneck_channels,
-            kernel_size=1,
-            stride=stride_1x1,
-            bias=False,
-            norm=get_norm(norm, bottleneck_channels),
-        )
-        if scale == 1:
-            self.nums = 1
-        else:
-            self.nums = scale - 1
-        if self.in_channels != self.out_channels and stride_3x3 != 2:
-            self.pool = nn.AvgPool2d(kernel_size=3, stride=stride_3x3, padding=1)
-
-        convs = []
-        bns = []
-        for i in range(self.nums):
-            convs.append(
-                nn.Conv2d(
-                    width,
-                    width,
-                    kernel_size=3,
-                    stride=stride_3x3,
-                    padding=1 * dilation,
-                    bias=False,
-                    groups=num_groups,
-                    dilation=dilation,
-                )
-            )
-            bns.append(get_norm(norm, width))
-        self.convs = nn.ModuleList(convs)
-        self.bns = nn.ModuleList(bns)
-
-        self.conv3 = Conv2d(
-            bottleneck_channels,
-            out_channels,
-            kernel_size=1,
-            bias=False,
-            norm=get_norm(norm, out_channels),
-        )
-        self.scale = scale
-        self.width = width
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.stride_3x3 = stride_3x3
-        for layer in [self.conv1, self.conv3]:
-            if layer is not None:  # shortcut can be None
-                weight_init.c2_msra_fill(layer)
-        if self.shortcut is not None:
-            for layer in self.shortcut.modules():
-                if isinstance(layer, Conv2d):
-                    weight_init.c2_msra_fill(layer)
-
-        for layer in self.convs:
-            if layer is not None:  # shortcut can be None
-                weight_init.c2_msra_fill(layer)
-
-        # Zero-initialize the last normalization in each residual branch,
-        # so that at the beginning, the residual branch starts with zeros,
-        # and each residual block behaves like an identity.
-        # See Sec 5.1 in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour":
-        # "For BN layers, the learnable scaling coefficient γ is initialized
-        # to be 1, except for each residual block's last BN
-        # where γ is initialized to be 0."
-
-        # nn.init.constant_(self.conv3.norm.weight, 0)
-        # TODO this somehow hurts performance when training GN models from scratch.
-        # Add it as an option when we need to use this code to train a backbone.
-
-    def forward(self, x):
-        out = self.conv1(x)
-        out = F.relu_(out)
-
-        spx = torch.split(out, self.width, 1)
-        for i in range(self.nums):
-            if i == 0 or self.in_channels != self.out_channels:
-                sp = spx[i]
-            else:
-                sp = sp + spx[i]
-            sp = self.convs[i](sp)
-            sp = F.relu_(self.bns[i](sp))
-            if i == 0:
-                out = sp
-            else:
-                out = torch.cat((out, sp), 1)
-        if self.scale != 1 and self.stride_3x3 == 1:
-            out = torch.cat((out, spx[self.nums]), 1)
-        elif self.scale != 1 and self.stride_3x3 == 2:
-            out = torch.cat((out, self.pool(spx[self.nums])), 1)
-
-        out = self.conv3(out)
-
-        if self.shortcut is not None:
-            shortcut = self.shortcut(x)
-        else:
-            shortcut = x
-
-        out += shortcut
-        out = F.relu_(out)
-        return out
-
-
-class DeformBottleneckBlock(ResNetBlockBase):
-    """
-    Not implemented for res2net yet.
-    Similar to :class:`BottleneckBlock`, but with deformable conv in the 3x3 convolution.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        *,
-        bottleneck_channels,
-        stride=1,
-        num_groups=1,
-        norm="BN",
-        stride_in_1x1=False,
-        dilation=1,
-        deform_modulated=False,
-        deform_num_groups=1,
-        basewidth=26,
-        scale=4,
-    ):
-        super().__init__(in_channels, out_channels, stride)
-        self.deform_modulated = deform_modulated
-
-        if in_channels != out_channels:
-            # self.shortcut = Conv2d(
-            #     in_channels,
-            #     out_channels,
-            #     kernel_size=1,
-            #     stride=stride,
-            #     bias=False,
-            #     norm=get_norm(norm, out_channels),
-            # )
-            self.shortcut = nn.Sequential(
-                nn.AvgPool2d(
-                    kernel_size=stride,
-                    stride=stride,
-                    ceil_mode=True,
-                    count_include_pad=False,
-                ),
-                Conv2d(
-                    in_channels,
-                    out_channels,
-                    kernel_size=1,
-                    stride=1,
-                    bias=False,
-                    norm=get_norm(norm, out_channels),
-                ),
-            )
-        else:
-            self.shortcut = None
-
-        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)
-        width = bottleneck_channels // scale
-
-        self.conv1 = Conv2d(
-            in_channels,
-            bottleneck_channels,
-            kernel_size=1,
-            stride=stride_1x1,
-            bias=False,
-            norm=get_norm(norm, bottleneck_channels),
-        )
-
-        if scale == 1:
-            self.nums = 1
-        else:
-            self.nums = scale - 1
-        if self.in_channels != self.out_channels and stride_3x3 != 2:
-            self.pool = nn.AvgPool2d(kernel_size=3, stride=stride_3x3, padding=1)
-
-        if deform_modulated:
-            deform_conv_op = ModulatedDeformConv
-            # offset channels are 2 or 3 (if with modulated) * kernel_size * kernel_size
-            offset_channels = 27
-        else:
-            deform_conv_op = DeformConv
-            offset_channels = 18
-
-        # self.conv2_offset = Conv2d(
-        #     bottleneck_channels,
-        #     offset_channels * deform_num_groups,
-        #     kernel_size=3,
-        #     stride=stride_3x3,
-        #     padding=1 * dilation,
-        #     dilation=dilation,
-        # )
-        # self.conv2 = deform_conv_op(
-        #     bottleneck_channels,
-        #     bottleneck_channels,
-        #     kernel_size=3,
-        #     stride=stride_3x3,
-        #     padding=1 * dilation,
-        #     bias=False,
-        #     groups=num_groups,
-        #     dilation=dilation,
-        #     deformable_groups=deform_num_groups,
-        #     norm=get_norm(norm, bottleneck_channels),
-        # )
-
-        conv2_offsets = []
-        convs = []
-        bns = []
-        for i in range(self.nums):
-            conv2_offsets.append(
-                Conv2d(
-                    width,
-                    offset_channels * deform_num_groups,
-                    kernel_size=3,
-                    stride=stride_3x3,
-                    padding=1 * dilation,
-                    bias=False,
-                    groups=num_groups,
-                    dilation=dilation,
-                )
-            )
-            convs.append(
-                deform_conv_op(
-                    width,
-                    width,
-                    kernel_size=3,
-                    stride=stride_3x3,
-                    padding=1 * dilation,
-                    bias=False,
-                    groups=num_groups,
-                    dilation=dilation,
-                    deformable_groups=deform_num_groups,
-                )
-            )
-            bns.append(get_norm(norm, width))
-        self.conv2_offsets = nn.ModuleList(conv2_offsets)
-        self.convs = nn.ModuleList(convs)
-        self.bns = nn.ModuleList(bns)
-
-        self.conv3 = Conv2d(
-            bottleneck_channels,
-            out_channels,
-            kernel_size=1,
-            bias=False,
-            norm=get_norm(norm, out_channels),
-        )
-        self.scale = scale
-        self.width = width
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.stride_3x3 = stride_3x3
-        # for layer in [self.conv1, self.conv2, self.conv3, self.shortcut]:
-        #     if layer is not None:  # shortcut can be None
-        #         weight_init.c2_msra_fill(layer)
-
-        # nn.init.constant_(self.conv2_offset.weight, 0)
-        # nn.init.constant_(self.conv2_offset.bias, 0)
-        for layer in [self.conv1, self.conv3]:
-            if layer is not None:  # shortcut can be None
-                weight_init.c2_msra_fill(layer)
-        if self.shortcut is not None:
-            for layer in self.shortcut.modules():
-                if isinstance(layer, Conv2d):
-                    weight_init.c2_msra_fill(layer)
-
-        for layer in self.convs:
-            if layer is not None:  # shortcut can be None
-                weight_init.c2_msra_fill(layer)
-
-        for layer in self.conv2_offsets:
-            if layer.weight is not None:
-                nn.init.constant_(layer.weight, 0)
-            if layer.bias is not None:
-                nn.init.constant_(layer.bias, 0)
-
-    def forward(self, x):
-        out = self.conv1(x)
-        out = F.relu_(out)
-
-        # if self.deform_modulated:
-        #     offset_mask = self.conv2_offset(out)
-        #     offset_x, offset_y, mask = torch.chunk(offset_mask, 3, dim=1)
-        #     offset = torch.cat((offset_x, offset_y), dim=1)
-        #     mask = mask.sigmoid()
-        #     out = self.conv2(out, offset, mask)
-        # else:
-        #     offset = self.conv2_offset(out)
-        #     out = self.conv2(out, offset)
-        # out = F.relu_(out)
-
-        spx = torch.split(out, self.width, 1)
-        for i in range(self.nums):
-            if i == 0 or self.in_channels != self.out_channels:
-                sp = spx[i].contiguous()
-            else:
-                sp = sp + spx[i].contiguous()
-
-            # sp = self.convs[i](sp)
-            if self.deform_modulated:
-                offset_mask = self.conv2_offsets[i](sp)
-                offset_x, offset_y, mask = torch.chunk(offset_mask, 3, dim=1)
-                offset = torch.cat((offset_x, offset_y), dim=1)
-                mask = mask.sigmoid()
-                sp = self.convs[i](sp, offset, mask)
-            else:
-                offset = self.conv2_offsets[i](sp)
-                sp = self.convs[i](sp, offset)
-            sp = F.relu_(self.bns[i](sp))
-            if i == 0:
-                out = sp
-            else:
-                out = torch.cat((out, sp), 1)
-        if self.scale != 1 and self.stride_3x3 == 1:
-            out = torch.cat((out, spx[self.nums]), 1)
-        elif self.scale != 1 and self.stride_3x3 == 2:
-            out = torch.cat((out, self.pool(spx[self.nums])), 1)
-
-        out = self.conv3(out)
-
-        if self.shortcut is not None:
-            shortcut = self.shortcut(x)
-        else:
-            shortcut = x
-
-        out += shortcut
-        out = F.relu_(out)
-        return out
-
-
-def make_stage(
-    block_class, num_blocks, first_stride, *, in_channels, out_channels, **kwargs
-):
-    """
-    Create a list of blocks just like those in a ResNet stage.
-    Args:
-        block_class (type): a subclass of ResNetBlockBase
-        num_blocks (int):
-        first_stride (int): the stride of the first block. The other blocks will have stride=1.
-        in_channels (int): input channels of the entire stage.
-        out_channels (int): output channels of **every block** in the stage.
-        kwargs: other arguments passed to the constructor of every block.
-    Returns:
-        list[nn.Module]: a list of block module.
-    """
-    assert "stride" not in kwargs, "Stride of blocks in make_stage cannot be changed."
-    blocks = []
-    for i in range(num_blocks):
-        blocks.append(
-            block_class(
-                in_channels=in_channels,
-                out_channels=out_channels,
-                stride=first_stride if i == 0 else 1,
-                **kwargs,
-            )
-        )
-        in_channels = out_channels
-    return blocks
-
-
-class BasicStem(CNNBlockBase):
-    """
-    The standard ResNet stem (layers before the first residual block).
-    """
-
-    def __init__(self, in_channels=3, out_channels=64, norm="BN"):
-        """
-        Args:
-            norm (str or callable): norm after the first conv layer.
-                See :func:`layers.get_norm` for supported format.
-        """
-        super().__init__(in_channels, out_channels, 4)
-        self.in_channels = in_channels
-        self.conv1 = nn.Sequential(
-            Conv2d(
-                in_channels,
-                32,
-                kernel_size=3,
-                stride=2,
-                padding=1,
-                bias=False,
-            ),
-            get_norm(norm, 32),
-            nn.ReLU(inplace=True),
-            Conv2d(
-                32,
-                32,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                bias=False,
-            ),
-            get_norm(norm, 32),
-            nn.ReLU(inplace=True),
-            Conv2d(
-                32,
-                out_channels,
-                kernel_size=3,
-                stride=1,
-                padding=1,
-                bias=False,
-            ),
-        )
-        self.bn1 = get_norm(norm, out_channels)
-
-        for layer in self.conv1:
-            if isinstance(layer, Conv2d):
-                weight_init.c2_msra_fill(layer)
-
-    def forward(self, x):
-        x = self.conv1(x)
-        x = self.bn1(x)
-        x = F.relu_(x)
-        x = F.max_pool2d(x, kernel_size=3, stride=2, padding=1)
-        return x
-
-
-class ResNet(Backbone):
-    def __init__(self, stem, stages, num_classes=None, out_features=None):
-        """
-        Args:
-            stem (nn.Module): a stem module
-            stages (list[list[CNNBlockBase]]): several (typically 4) stages,
-                each contains multiple :class:`CNNBlockBase`.
-            num_classes (None or int): if None, will not perform classification.
-                Otherwise, will create a linear layer.
-            out_features (list[str]): name of the layers whose outputs should
-                be returned in forward. Can be anything in "stem", "linear", or "res2" ...
-                If None, will return the output of the last layer.
-        """
-        super(ResNet, self).__init__()
-        self.stem = stem
-        self.num_classes = num_classes
-
-        current_stride = self.stem.stride
-        self._out_feature_strides = {"stem": current_stride}
-        self._out_feature_channels = {"stem": self.stem.out_channels}
-
-        self.stages_and_names = []
-        for i, blocks in enumerate(stages):
-            assert len(blocks) > 0, len(blocks)
-            for block in blocks:
-                assert isinstance(block, CNNBlockBase), block
-
-            name = "res" + str(i + 2)
-            stage = nn.Sequential(*blocks)
-
-            self.add_module(name, stage)
-            self.stages_and_names.append((stage, name))
-
-            self._out_feature_strides[name] = current_stride = int(
-                current_stride * np.prod([k.stride for k in blocks])
-            )
-            self._out_feature_channels[name] = curr_channels = blocks[-1].out_channels
-
-        if num_classes is not None:
-            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
-            self.linear = nn.Linear(curr_channels, num_classes)
-
-            # Sec 5.1 in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour":
-            # "The 1000-way fully-connected layer is initialized by
-            # drawing weights from a zero-mean Gaussian with standard deviation of 0.01."
-            nn.init.normal_(self.linear.weight, std=0.01)
-            name = "linear"
-
-        if out_features is None:
-            out_features = [name]
-        self._out_features = out_features
-        assert len(self._out_features)
-        children = [x[0] for x in self.named_children()]
-        for out_feature in self._out_features:
-            assert out_feature in children, "Available children: {}".format(
-                ", ".join(children)
-            )
-
-    def forward(self, x):
-        outputs = {}
-        x = self.stem(x)
-        if "stem" in self._out_features:
-            outputs["stem"] = x
-        for stage, name in self.stages_and_names:
-            x = stage(x)
-            if name in self._out_features:
-                outputs[name] = x
-        if self.num_classes is not None:
-            x = self.avgpool(x)
-            x = torch.flatten(x, 1)
-            x = self.linear(x)
-            if "linear" in self._out_features:
-                outputs["linear"] = x
-        return outputs
-
-    def output_shape(self):
-        return {
-            name: ShapeSpec(
-                channels=self._out_feature_channels[name],
-                stride=self._out_feature_strides[name],
-            )
-            for name in self._out_features
-        }
-
-    def freeze(self, freeze_at=0):
-        """
-        Freeze the first several stages of the ResNet. Commonly used in
-        fine-tuning.
-        Args:
-            freeze_at (int): number of stem and stages to freeze.
-                `1` means freezing the stem. `2` means freezing the stem and
-                the first stage, etc.
-        Returns:
-            nn.Module: this ResNet itself
-        """
-        if freeze_at >= 1:
-            self.stem.freeze()
-        for idx, (stage, _) in enumerate(self.stages_and_names, start=2):
-            if freeze_at >= idx:
-                for block in stage.children():
-                    block.freeze()
-        return self
-
-
-@BACKBONE_REGISTRY.register()
-def build_res2net_backbone(cfg, input_shape):
-    """
-    Create a Res2Net instance from config.
-    Returns:
-        ResNet: a :class:`ResNet` instance.
-    """
-    # need registration of new blocks/stems?
-    norm = cfg.MODEL.RESNETS.NORM
-    stem = BasicStem(
-        in_channels=input_shape.channels,
-        out_channels=cfg.MODEL.RESNETS.STEM_OUT_CHANNELS,
-        norm=norm,
-    )
-
-    # fmt: off
-    freeze_at           = cfg.MODEL.BACKBONE.FREEZE_AT
-    out_features        = cfg.MODEL.RESNETS.OUT_FEATURES
-    depth               = cfg.MODEL.RESNETS.DEPTH
-    num_groups          = cfg.MODEL.RESNETS.NUM_GROUPS
-    width_per_group     = cfg.MODEL.RESNETS.WIDTH_PER_GROUP
-    scale              = 4
-    bottleneck_channels = num_groups * width_per_group * scale
-    in_channels         = cfg.MODEL.RESNETS.STEM_OUT_CHANNELS
-    out_channels        = cfg.MODEL.RESNETS.RES2_OUT_CHANNELS
-    stride_in_1x1       = cfg.MODEL.RESNETS.STRIDE_IN_1X1
-    res5_dilation       = cfg.MODEL.RESNETS.RES5_DILATION
-    deform_on_per_stage = cfg.MODEL.RESNETS.DEFORM_ON_PER_STAGE
-    deform_modulated    = cfg.MODEL.RESNETS.DEFORM_MODULATED
-    deform_num_groups   = cfg.MODEL.RESNETS.DEFORM_NUM_GROUPS
-    # fmt: on
-    assert res5_dilation in {1, 2}, "res5_dilation cannot be {}.".format(res5_dilation)
-
-    num_blocks_per_stage = {
-        18: [2, 2, 2, 2],
-        34: [3, 4, 6, 3],
-        50: [3, 4, 6, 3],
-        101: [3, 4, 23, 3],
-        152: [3, 8, 36, 3],
-    }[depth]
-
-    if depth in [18, 34]:
-        assert (
-            out_channels == 64
-        ), "Must set MODEL.RESNETS.RES2_OUT_CHANNELS = 64 for R18/R34"
-        assert not any(
-            deform_on_per_stage
-        ), "MODEL.RESNETS.DEFORM_ON_PER_STAGE unsupported for R18/R34"
-        assert (
-            res5_dilation == 1
-        ), "Must set MODEL.RESNETS.RES5_DILATION = 1 for R18/R34"
-        assert num_groups == 1, "Must set MODEL.RESNETS.NUM_GROUPS = 1 for R18/R34"
-
-    stages = []
-
-    # Avoid creating variables without gradients
-    # It consumes extra memory and may cause allreduce to fail
-    out_stage_idx = [
-        {"res2": 2, "res3": 3, "res4": 4, "res5": 5}[f] for f in out_features
-    ]
-    max_stage_idx = max(out_stage_idx)
-    for idx, stage_idx in enumerate(range(2, max_stage_idx + 1)):
-        dilation = res5_dilation if stage_idx == 5 else 1
-        first_stride = 1 if idx == 0 or (stage_idx == 5 and dilation == 2) else 2
-        stage_kargs = {
-            "num_blocks": num_blocks_per_stage[idx],
-            "first_stride": first_stride,
-            "in_channels": in_channels,
-            "out_channels": out_channels,
-            "norm": norm,
-        }
-        # Use BasicBlock for R18 and R34.
-        if depth in [18, 34]:
-            stage_kargs["block_class"] = BasicBlock
-        else:
-            stage_kargs["bottleneck_channels"] = bottleneck_channels
-            stage_kargs["stride_in_1x1"] = stride_in_1x1
-            stage_kargs["dilation"] = dilation
-            stage_kargs["num_groups"] = num_groups
-            stage_kargs["scale"] = scale
-
-            if deform_on_per_stage[idx]:
-                stage_kargs["block_class"] = DeformBottleneckBlock
-                stage_kargs["deform_modulated"] = deform_modulated
-                stage_kargs["deform_num_groups"] = deform_num_groups
-            else:
-                stage_kargs["block_class"] = BottleneckBlock
-        blocks = make_stage(**stage_kargs)
-        in_channels = out_channels
-        out_channels *= 2
-        bottleneck_channels *= 2
-        stages.append(blocks)
-    return ResNet(stem, stages, out_features=out_features).freeze(freeze_at)
-
-
-@BACKBONE_REGISTRY.register()
-def build_p67_res2net_fpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_res2net_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    backbone = FPN(
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=out_channels,
-        norm=cfg.MODEL.FPN.NORM,
-        top_block=LastLevelP6P7_P5(out_channels, out_channels),
-        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_res2net_bifpn_backbone(cfg, input_shape: ShapeSpec):
-    """
-    Args:
-        cfg: a detectron2 CfgNode
-
-    Returns:
-        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
-    """
-    bottom_up = build_res2net_backbone(cfg, input_shape)
-    in_features = cfg.MODEL.FPN.IN_FEATURES
-    backbone = BiFPN(
-        cfg=cfg,
-        bottom_up=bottom_up,
-        in_features=in_features,
-        out_channels=cfg.MODEL.BIFPN.OUT_CHANNELS,
-        norm=cfg.MODEL.BIFPN.NORM,
-        num_levels=cfg.MODEL.BIFPN.NUM_LEVELS,
-        num_bifpn=cfg.MODEL.BIFPN.NUM_BIFPN,
-        separable_conv=cfg.MODEL.BIFPN.SEPARABLE_CONV,
-    )
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/debug.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/debug.py
deleted file mode 100644
index 66b385c5..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/debug.py
+++ /dev/null
@@ -1,373 +0,0 @@
-import cv2
-import numpy as np
-import torch
-import torch.nn.functional as F
-
-COLORS = (
-    ((np.random.rand(1300, 3) * 0.4 + 0.6) * 255)
-    .astype(np.uint8)
-    .reshape(1300, 1, 1, 3)
-)
-
-
-def _get_color_image(heatmap):
-    heatmap = heatmap.reshape(heatmap.shape[0], heatmap.shape[1], heatmap.shape[2], 1)
-    if heatmap.shape[0] == 1:
-        color_map = (
-            (heatmap * np.ones((1, 1, 1, 3), np.uint8) * 255)
-            .max(axis=0)
-            .astype(np.uint8)
-        )  # H, W, 3
-    else:
-        color_map = (
-            (heatmap * COLORS[: heatmap.shape[0]]).max(axis=0).astype(np.uint8)
-        )  # H, W, 3
-
-    return color_map
-
-
-def _blend_image(image, color_map, a=0.7):
-    color_map = cv2.resize(color_map, (image.shape[1], image.shape[0]))
-    ret = np.clip(image * (1 - a) + color_map * a, 0, 255).astype(np.uint8)
-    return ret
-
-
-def _blend_image_heatmaps(image, color_maps, a=0.7):
-    merges = np.zeros((image.shape[0], image.shape[1], 3), np.float32)
-    for color_map in color_maps:
-        color_map = cv2.resize(color_map, (image.shape[1], image.shape[0]))
-        merges = np.maximum(merges, color_map)
-    ret = np.clip(image * (1 - a) + merges * a, 0, 255).astype(np.uint8)
-    return ret
-
-
-def _decompose_level(x, shapes_per_level, N):
-    """
-    x: LNHiWi x C
-    """
-    x = x.view(x.shape[0], -1)
-    ret = []
-    st = 0
-    for l in range(len(shapes_per_level)):
-        ret.append([])
-        h = shapes_per_level[l][0].int().item()
-        w = shapes_per_level[l][1].int().item()
-        for i in range(N):
-            ret[l].append(
-                x[st + h * w * i : st + h * w * (i + 1)].view(h, w, -1).permute(2, 0, 1)
-            )
-        st += h * w * N
-    return ret
-
-
-def _imagelist_to_tensor(images):
-    images = [x for x in images]
-    image_sizes = [x.shape[-2:] for x in images]
-    h = max([size[0] for size in image_sizes])
-    w = max([size[1] for size in image_sizes])
-    S = 32
-    h, w = ((h - 1) // S + 1) * S, ((w - 1) // S + 1) * S
-    images = [F.pad(x, (0, w - x.shape[2], 0, h - x.shape[1], 0, 0)) for x in images]
-    images = torch.stack(images)
-    return images
-
-
-def _ind2il(ind, shapes_per_level, N):
-    r = ind
-    l = 0
-    S = 0
-    while r - S >= N * shapes_per_level[l][0] * shapes_per_level[l][1]:
-        S += N * shapes_per_level[l][0] * shapes_per_level[l][1]
-        l += 1
-    i = (r - S) // (shapes_per_level[l][0] * shapes_per_level[l][1])
-    return i, l
-
-
-def debug_train(
-    images,
-    gt_instances,
-    flattened_hms,
-    reg_targets,
-    labels,
-    pos_inds,
-    shapes_per_level,
-    locations,
-    strides,
-):
-    """
-    images: N x 3 x H x W
-    flattened_hms: LNHiWi x C
-    shapes_per_level: L x 2 [(H_i, W_i)]
-    locations: LNHiWi x 2
-    """
-    reg_inds = torch.nonzero(reg_targets.max(dim=1)[0] > 0).squeeze(1)
-    N = len(images)
-    images = _imagelist_to_tensor(images)
-    repeated_locations = [torch.cat([loc] * N, dim=0) for loc in locations]
-    locations = torch.cat(repeated_locations, dim=0)
-    gt_hms = _decompose_level(flattened_hms, shapes_per_level, N)
-    masks = flattened_hms.new_zeros((flattened_hms.shape[0], 1))
-    masks[pos_inds] = 1
-    masks = _decompose_level(masks, shapes_per_level, N)
-    for i in range(len(images)):
-        image = images[i].detach().cpu().numpy().transpose(1, 2, 0)
-        color_maps = []
-        for l in range(len(gt_hms)):
-            color_map = _get_color_image(gt_hms[l][i].detach().cpu().numpy())
-            color_maps.append(color_map)
-            cv2.imshow("gthm_{}".format(l), color_map)
-        blend = _blend_image_heatmaps(image.copy(), color_maps)
-        if gt_instances is not None:
-            bboxes = gt_instances[i].gt_boxes.tensor
-            for j in range(len(bboxes)):
-                bbox = bboxes[j]
-                cv2.rectangle(
-                    blend,
-                    (int(bbox[0]), int(bbox[1])),
-                    (int(bbox[2]), int(bbox[3])),
-                    (0, 0, 255),
-                    3,
-                    cv2.LINE_AA,
-                )
-
-        for j in range(len(pos_inds)):
-            image_id, l = _ind2il(pos_inds[j], shapes_per_level, N)
-            if image_id != i:
-                continue
-            loc = locations[pos_inds[j]]
-            cv2.drawMarker(
-                blend,
-                (int(loc[0]), int(loc[1])),
-                (0, 255, 255),
-                markerSize=(l + 1) * 16,
-            )
-
-        for j in range(len(reg_inds)):
-            image_id, l = _ind2il(reg_inds[j], shapes_per_level, N)
-            if image_id != i:
-                continue
-            ltrb = reg_targets[reg_inds[j]]
-            ltrb *= strides[l]
-            loc = locations[reg_inds[j]]
-            bbox = [
-                (loc[0] - ltrb[0]),
-                (loc[1] - ltrb[1]),
-                (loc[0] + ltrb[2]),
-                (loc[1] + ltrb[3]),
-            ]
-            cv2.rectangle(
-                blend,
-                (int(bbox[0]), int(bbox[1])),
-                (int(bbox[2]), int(bbox[3])),
-                (255, 0, 0),
-                1,
-                cv2.LINE_AA,
-            )
-            cv2.circle(blend, (int(loc[0]), int(loc[1])), 2, (255, 0, 0), -1)
-
-        cv2.imshow("blend", blend)
-        cv2.waitKey()
-
-
-def debug_test(
-    images,
-    logits_pred,
-    reg_pred,
-    agn_hm_pred=[],
-    preds=[],
-    vis_thresh=0.3,
-    debug_show_name=False,
-    mult_agn=False,
-):
-    """
-    images: N x 3 x H x W
-    class_target: LNHiWi x C
-    cat_agn_heatmap: LNHiWi
-    shapes_per_level: L x 2 [(H_i, W_i)]
-    """
-    N = len(images)
-    for i in range(len(images)):
-        image = images[i].detach().cpu().numpy().transpose(1, 2, 0)
-        result = image.copy().astype(np.uint8)
-        pred_image = image.copy().astype(np.uint8)
-        color_maps = []
-        L = len(logits_pred)
-        for l in range(L):
-            if logits_pred[0] is not None:
-                stride = min(image.shape[0], image.shape[1]) / min(
-                    logits_pred[l][i].shape[1], logits_pred[l][i].shape[2]
-                )
-            else:
-                stride = min(image.shape[0], image.shape[1]) / min(
-                    agn_hm_pred[l][i].shape[1], agn_hm_pred[l][i].shape[2]
-                )
-            stride = stride if stride < 60 else 64 if stride < 100 else 128
-            if logits_pred[0] is not None:
-                if mult_agn:
-                    logits_pred[l][i] = logits_pred[l][i] * agn_hm_pred[l][i]
-                color_map = _get_color_image(logits_pred[l][i].detach().cpu().numpy())
-                color_maps.append(color_map)
-                cv2.imshow("predhm_{}".format(l), color_map)
-
-            if debug_show_name:
-                from detectron2.data.datasets.lvis_v1_categories import LVIS_CATEGORIES
-
-                cat2name = [x["name"] for x in LVIS_CATEGORIES]
-            for j in range(len(preds[i].scores) if preds is not None else 0):
-                if preds[i].scores[j] > vis_thresh:
-                    bbox = (
-                        preds[i].proposal_boxes[j]
-                        if preds[i].has("proposal_boxes")
-                        else preds[i].pred_boxes[j]
-                    )
-                    bbox = bbox.tensor[0].detach().cpu().numpy().astype(np.int32)
-                    cat = (
-                        int(preds[i].pred_classes[j])
-                        if preds[i].has("pred_classes")
-                        else 0
-                    )
-                    cl = COLORS[cat, 0, 0]
-                    cv2.rectangle(
-                        pred_image,
-                        (int(bbox[0]), int(bbox[1])),
-                        (int(bbox[2]), int(bbox[3])),
-                        (int(cl[0]), int(cl[1]), int(cl[2])),
-                        2,
-                        cv2.LINE_AA,
-                    )
-                    if debug_show_name:
-                        txt = "{}{:.1f}".format(
-                            cat2name[cat] if cat > 0 else "", preds[i].scores[j]
-                        )
-                        font = cv2.FONT_HERSHEY_SIMPLEX
-                        cat_size = cv2.getTextSize(txt, font, 0.5, 2)[0]
-                        cv2.rectangle(
-                            pred_image,
-                            (int(bbox[0]), int(bbox[1] - cat_size[1] - 2)),
-                            (int(bbox[0] + cat_size[0]), int(bbox[1] - 2)),
-                            (int(cl[0]), int(cl[1]), int(cl[2])),
-                            -1,
-                        )
-                        cv2.putText(
-                            pred_image,
-                            txt,
-                            (int(bbox[0]), int(bbox[1] - 2)),
-                            font,
-                            0.5,
-                            (0, 0, 0),
-                            thickness=1,
-                            lineType=cv2.LINE_AA,
-                        )
-
-            if agn_hm_pred[l] is not None:
-                agn_hm_ = agn_hm_pred[l][i, 0, :, :, None].detach().cpu().numpy()
-                agn_hm_ = (agn_hm_ * np.array([255, 255, 255]).reshape(1, 1, 3)).astype(
-                    np.uint8
-                )
-                cv2.imshow("agn_hm_{}".format(l), agn_hm_)
-        blend = _blend_image_heatmaps(image.copy(), color_maps)
-        cv2.imshow("blend", blend)
-        cv2.imshow("preds", pred_image)
-        cv2.waitKey()
-
-
-global cnt
-cnt = 0
-
-
-def debug_second_stage(
-    images,
-    instances,
-    proposals=None,
-    vis_thresh=0.3,
-    save_debug=False,
-    debug_show_name=False,
-):
-    images = _imagelist_to_tensor(images)
-    if debug_show_name:
-        from detectron2.data.datasets.lvis_v1_categories import LVIS_CATEGORIES
-
-        cat2name = [x["name"] for x in LVIS_CATEGORIES]
-    for i in range(len(images)):
-        image = (
-            images[i].detach().cpu().numpy().transpose(1, 2, 0).astype(np.uint8).copy()
-        )
-        if instances[i].has("gt_boxes"):
-            bboxes = instances[i].gt_boxes.tensor.cpu().numpy()
-            scores = np.ones(bboxes.shape[0])
-            cats = instances[i].gt_classes.cpu().numpy()
-        else:
-            bboxes = instances[i].pred_boxes.tensor.cpu().numpy()
-            scores = instances[i].scores.cpu().numpy()
-            cats = instances[i].pred_classes.cpu().numpy()
-        for j in range(len(bboxes)):
-            if scores[j] > vis_thresh:
-                bbox = bboxes[j]
-                cl = COLORS[cats[j], 0, 0]
-                cl = (int(cl[0]), int(cl[1]), int(cl[2]))
-                cv2.rectangle(
-                    image,
-                    (int(bbox[0]), int(bbox[1])),
-                    (int(bbox[2]), int(bbox[3])),
-                    cl,
-                    2,
-                    cv2.LINE_AA,
-                )
-                if debug_show_name:
-                    cat = cats[j]
-                    txt = "{}{:.1f}".format(cat2name[cat] if cat > 0 else "", scores[j])
-                    font = cv2.FONT_HERSHEY_SIMPLEX
-                    cat_size = cv2.getTextSize(txt, font, 0.5, 2)[0]
-                    cv2.rectangle(
-                        image,
-                        (int(bbox[0]), int(bbox[1] - cat_size[1] - 2)),
-                        (int(bbox[0] + cat_size[0]), int(bbox[1] - 2)),
-                        (int(cl[0]), int(cl[1]), int(cl[2])),
-                        -1,
-                    )
-                    cv2.putText(
-                        image,
-                        txt,
-                        (int(bbox[0]), int(bbox[1] - 2)),
-                        font,
-                        0.5,
-                        (0, 0, 0),
-                        thickness=1,
-                        lineType=cv2.LINE_AA,
-                    )
-        if proposals is not None:
-            proposal_image = (
-                images[i]
-                .detach()
-                .cpu()
-                .numpy()
-                .transpose(1, 2, 0)
-                .astype(np.uint8)
-                .copy()
-            )
-            bboxes = proposals[i].proposal_boxes.tensor.cpu().numpy()
-            if proposals[i].has("scores"):
-                scores = proposals[i].scores.cpu().numpy()
-            else:
-                scores = proposals[i].objectness_logits.sigmoid().cpu().numpy()
-            for j in range(len(bboxes)):
-                if scores[j] > vis_thresh:
-                    bbox = bboxes[j]
-                    cl = (209, 159, 83)
-                    cv2.rectangle(
-                        proposal_image,
-                        (int(bbox[0]), int(bbox[1])),
-                        (int(bbox[2]), int(bbox[3])),
-                        cl,
-                        2,
-                        cv2.LINE_AA,
-                    )
-
-        cv2.imshow("image", image)
-        if proposals is not None:
-            cv2.imshow("proposals", proposal_image)
-            if save_debug:
-                global cnt
-                cnt += 1
-                cv2.imwrite("output/save_debug/{}.jpg".format(cnt), proposal_image)
-        cv2.waitKey()
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/centernet.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/centernet.py
deleted file mode 100644
index 65383541..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/centernet.py
+++ /dev/null
@@ -1,961 +0,0 @@
-import copy
-import json
-import math
-from typing import Dict, List
-
-import numpy as np
-import torch
-from detectron2.config import configurable
-from detectron2.layers import ShapeSpec, cat
-from detectron2.modeling import detector_postprocess
-from detectron2.modeling.proposal_generator.build import PROPOSAL_GENERATOR_REGISTRY
-from detectron2.structures import Boxes, Instances
-from detectron2.utils.comm import get_world_size
-from torch import nn
-from torch.nn import functional as F
-
-from ..debug import debug_test, debug_train
-from ..layers.heatmap_focal_loss import (
-    binary_heatmap_focal_loss,
-    heatmap_focal_loss_jit,
-)
-from ..layers.iou_loss import IOULoss
-from ..layers.ml_nms import ml_nms
-from .centernet_head import CenterNetHead
-from .utils import _transpose, reduce_sum
-
-__all__ = ["CenterNet"]
-
-INF = 100000000
-
-
-@PROPOSAL_GENERATOR_REGISTRY.register()
-class CenterNet(nn.Module):
-    @configurable
-    def __init__(
-        self,
-        # input_shape: Dict[str, ShapeSpec],
-        in_channels=256,
-        *,
-        num_classes=80,
-        in_features=("p3", "p4", "p5", "p6", "p7"),
-        strides=(8, 16, 32, 64, 128),
-        score_thresh=0.05,
-        hm_min_overlap=0.8,
-        loc_loss_type="giou",
-        min_radius=4,
-        hm_focal_alpha=0.25,
-        hm_focal_beta=4,
-        loss_gamma=2.0,
-        reg_weight=2.0,
-        not_norm_reg=True,
-        with_agn_hm=False,
-        only_proposal=False,
-        as_proposal=False,
-        not_nms=False,
-        pos_weight=1.0,
-        neg_weight=1.0,
-        sigmoid_clamp=1e-4,
-        ignore_high_fp=-1.0,
-        center_nms=False,
-        sizes_of_interest=[[0, 80], [64, 160], [128, 320], [256, 640], [512, 10000000]],
-        more_pos=False,
-        more_pos_thresh=0.2,
-        more_pos_topk=9,
-        pre_nms_topk_train=1000,
-        pre_nms_topk_test=1000,
-        post_nms_topk_train=100,
-        post_nms_topk_test=100,
-        nms_thresh_train=0.6,
-        nms_thresh_test=0.6,
-        no_reduce=False,
-        debug=False,
-        vis_thresh=0.5,
-        pixel_mean=[103.530, 116.280, 123.675],
-        pixel_std=[1.0, 1.0, 1.0],
-        device="cuda",
-        centernet_head=None,
-    ):
-        super().__init__()
-        self.num_classes = num_classes
-        self.in_features = in_features
-        self.strides = strides
-        self.score_thresh = score_thresh
-        self.min_radius = min_radius
-        self.hm_focal_alpha = hm_focal_alpha
-        self.hm_focal_beta = hm_focal_beta
-        self.loss_gamma = loss_gamma
-        self.reg_weight = reg_weight
-        self.not_norm_reg = not_norm_reg
-        self.with_agn_hm = with_agn_hm
-        self.only_proposal = only_proposal
-        self.as_proposal = as_proposal
-        self.not_nms = not_nms
-        self.pos_weight = pos_weight
-        self.neg_weight = neg_weight
-        self.sigmoid_clamp = sigmoid_clamp
-        self.ignore_high_fp = ignore_high_fp
-        self.center_nms = center_nms
-        self.sizes_of_interest = sizes_of_interest
-        self.more_pos = more_pos
-        self.more_pos_thresh = more_pos_thresh
-        self.more_pos_topk = more_pos_topk
-        self.pre_nms_topk_train = pre_nms_topk_train
-        self.pre_nms_topk_test = pre_nms_topk_test
-        self.post_nms_topk_train = post_nms_topk_train
-        self.post_nms_topk_test = post_nms_topk_test
-        self.nms_thresh_train = nms_thresh_train
-        self.nms_thresh_test = nms_thresh_test
-        self.no_reduce = no_reduce
-        self.debug = debug
-        self.vis_thresh = vis_thresh
-        if self.center_nms:
-            self.not_nms = True
-        self.iou_loss = IOULoss(loc_loss_type)
-        assert (not self.only_proposal) or self.with_agn_hm
-        # delta for rendering heatmap
-        self.delta = (1 - hm_min_overlap) / (1 + hm_min_overlap)
-        if centernet_head is None:
-            self.centernet_head = CenterNetHead(
-                in_channels=in_channels,
-                num_levels=len(in_features),
-                with_agn_hm=with_agn_hm,
-                only_proposal=only_proposal,
-            )
-        else:
-            self.centernet_head = centernet_head
-        if self.debug:
-            pixel_mean = torch.Tensor(pixel_mean).to(torch.device(device)).view(3, 1, 1)
-            pixel_std = torch.Tensor(pixel_std).to(torch.device(device)).view(3, 1, 1)
-            self.denormalizer = lambda x: x * pixel_std + pixel_mean
-
-    @classmethod
-    def from_config(cls, cfg, input_shape):
-        ret = {
-            # 'input_shape': input_shape,
-            "in_channels": input_shape[cfg.MODEL.CENTERNET.IN_FEATURES[0]].channels,
-            "num_classes": cfg.MODEL.CENTERNET.NUM_CLASSES,
-            "in_features": cfg.MODEL.CENTERNET.IN_FEATURES,
-            "strides": cfg.MODEL.CENTERNET.FPN_STRIDES,
-            "score_thresh": cfg.MODEL.CENTERNET.INFERENCE_TH,
-            "loc_loss_type": cfg.MODEL.CENTERNET.LOC_LOSS_TYPE,
-            "hm_min_overlap": cfg.MODEL.CENTERNET.HM_MIN_OVERLAP,
-            "min_radius": cfg.MODEL.CENTERNET.MIN_RADIUS,
-            "hm_focal_alpha": cfg.MODEL.CENTERNET.HM_FOCAL_ALPHA,
-            "hm_focal_beta": cfg.MODEL.CENTERNET.HM_FOCAL_BETA,
-            "loss_gamma": cfg.MODEL.CENTERNET.LOSS_GAMMA,
-            "reg_weight": cfg.MODEL.CENTERNET.REG_WEIGHT,
-            "not_norm_reg": cfg.MODEL.CENTERNET.NOT_NORM_REG,
-            "with_agn_hm": cfg.MODEL.CENTERNET.WITH_AGN_HM,
-            "only_proposal": cfg.MODEL.CENTERNET.ONLY_PROPOSAL,
-            "as_proposal": cfg.MODEL.CENTERNET.AS_PROPOSAL,
-            "not_nms": cfg.MODEL.CENTERNET.NOT_NMS,
-            "pos_weight": cfg.MODEL.CENTERNET.POS_WEIGHT,
-            "neg_weight": cfg.MODEL.CENTERNET.NEG_WEIGHT,
-            "sigmoid_clamp": cfg.MODEL.CENTERNET.SIGMOID_CLAMP,
-            "ignore_high_fp": cfg.MODEL.CENTERNET.IGNORE_HIGH_FP,
-            "center_nms": cfg.MODEL.CENTERNET.CENTER_NMS,
-            "sizes_of_interest": cfg.MODEL.CENTERNET.SOI,
-            "more_pos": cfg.MODEL.CENTERNET.MORE_POS,
-            "more_pos_thresh": cfg.MODEL.CENTERNET.MORE_POS_THRESH,
-            "more_pos_topk": cfg.MODEL.CENTERNET.MORE_POS_TOPK,
-            "pre_nms_topk_train": cfg.MODEL.CENTERNET.PRE_NMS_TOPK_TRAIN,
-            "pre_nms_topk_test": cfg.MODEL.CENTERNET.PRE_NMS_TOPK_TEST,
-            "post_nms_topk_train": cfg.MODEL.CENTERNET.POST_NMS_TOPK_TRAIN,
-            "post_nms_topk_test": cfg.MODEL.CENTERNET.POST_NMS_TOPK_TEST,
-            "nms_thresh_train": cfg.MODEL.CENTERNET.NMS_TH_TRAIN,
-            "nms_thresh_test": cfg.MODEL.CENTERNET.NMS_TH_TEST,
-            "no_reduce": cfg.MODEL.CENTERNET.NO_REDUCE,
-            "debug": cfg.DEBUG,
-            "vis_thresh": cfg.VIS_THRESH,
-            "pixel_mean": cfg.MODEL.PIXEL_MEAN,
-            "pixel_std": cfg.MODEL.PIXEL_STD,
-            "device": cfg.MODEL.DEVICE,
-            "centernet_head": CenterNetHead(
-                cfg, [input_shape[f] for f in cfg.MODEL.CENTERNET.IN_FEATURES]
-            ),
-        }
-        return ret
-
-    def forward(self, images, features_dict, gt_instances):
-        features = [features_dict[f] for f in self.in_features]
-        clss_per_level, reg_pred_per_level, agn_hm_pred_per_level = self.centernet_head(
-            features
-        )
-        grids = self.compute_grids(features)
-        shapes_per_level = grids[0].new_tensor(
-            [(x.shape[2], x.shape[3]) for x in reg_pred_per_level]
-        )
-
-        if not self.training:
-            return self.inference(
-                images, clss_per_level, reg_pred_per_level, agn_hm_pred_per_level, grids
-            )
-        else:
-            pos_inds, labels, reg_targets, flattened_hms = self._get_ground_truth(
-                grids, shapes_per_level, gt_instances
-            )
-            # logits_pred: M x F, reg_pred: M x 4, agn_hm_pred: M
-            logits_pred, reg_pred, agn_hm_pred = self._flatten_outputs(
-                clss_per_level, reg_pred_per_level, agn_hm_pred_per_level
-            )
-
-            if self.more_pos:
-                # add more pixels as positive if \
-                #   1. they are within the center3x3 region of an object
-                #   2. their regression losses are small (<self.more_pos_thresh)
-                pos_inds, labels = self._add_more_pos(
-                    reg_pred, gt_instances, shapes_per_level
-                )
-
-            losses = self.losses(
-                pos_inds,
-                labels,
-                reg_targets,
-                flattened_hms,
-                logits_pred,
-                reg_pred,
-                agn_hm_pred,
-            )
-
-            proposals = None
-            if self.only_proposal:
-                agn_hm_pred_per_level = [x.sigmoid() for x in agn_hm_pred_per_level]
-                proposals = self.predict_instances(
-                    grids,
-                    agn_hm_pred_per_level,
-                    reg_pred_per_level,
-                    images.image_sizes,
-                    [None for _ in agn_hm_pred_per_level],
-                )
-            elif self.as_proposal:  # category specific bbox as agnostic proposals
-                clss_per_level = [x.sigmoid() for x in clss_per_level]
-                proposals = self.predict_instances(
-                    grids,
-                    clss_per_level,
-                    reg_pred_per_level,
-                    images.image_sizes,
-                    agn_hm_pred_per_level,
-                )
-            if self.only_proposal or self.as_proposal:
-                for p in range(len(proposals)):
-                    proposals[p].proposal_boxes = proposals[p].get("pred_boxes")
-                    proposals[p].objectness_logits = proposals[p].get("scores")
-                    proposals[p].remove("pred_boxes")
-                    proposals[p].remove("scores")
-                    proposals[p].remove("pred_classes")
-
-            if self.debug:
-                debug_train(
-                    [self.denormalizer(x) for x in images],
-                    gt_instances,
-                    flattened_hms,
-                    reg_targets,
-                    labels,
-                    pos_inds,
-                    shapes_per_level,
-                    grids,
-                    self.strides,
-                )
-            return proposals, losses
-
-    def losses(
-        self,
-        pos_inds,
-        labels,
-        reg_targets,
-        flattened_hms,
-        logits_pred,
-        reg_pred,
-        agn_hm_pred,
-    ):
-        """
-        Inputs:
-            pos_inds: N
-            labels: N
-            reg_targets: M x 4
-            flattened_hms: M x C
-            logits_pred: M x C
-            reg_pred: M x 4
-            agn_hm_pred: M x 1 or None
-            N: number of positive locations in all images
-            M: number of pixels from all FPN levels
-            C: number of classes
-        """
-        assert torch.isfinite(reg_pred).all().item()
-        num_pos_local = pos_inds.numel()
-        num_gpus = get_world_size()
-        if self.no_reduce:
-            total_num_pos = num_pos_local * num_gpus
-        else:
-            total_num_pos = reduce_sum(pos_inds.new_tensor([num_pos_local])).item()
-        num_pos_avg = max(total_num_pos / num_gpus, 1.0)
-
-        losses = {}
-        if not self.only_proposal:
-            pos_loss, neg_loss = heatmap_focal_loss_jit(
-                logits_pred,
-                flattened_hms,
-                pos_inds,
-                labels,
-                alpha=self.hm_focal_alpha,
-                beta=self.hm_focal_beta,
-                gamma=self.loss_gamma,
-                reduction="sum",
-                sigmoid_clamp=self.sigmoid_clamp,
-                ignore_high_fp=self.ignore_high_fp,
-            )
-            pos_loss = self.pos_weight * pos_loss / num_pos_avg
-            neg_loss = self.neg_weight * neg_loss / num_pos_avg
-            losses["loss_centernet_pos"] = pos_loss
-            losses["loss_centernet_neg"] = neg_loss
-
-        reg_inds = torch.nonzero(reg_targets.max(dim=1)[0] >= 0).squeeze(1)
-        reg_pred = reg_pred[reg_inds]
-        reg_targets_pos = reg_targets[reg_inds]
-        reg_weight_map = flattened_hms.max(dim=1)[0]
-        reg_weight_map = reg_weight_map[reg_inds]
-        reg_weight_map = reg_weight_map * 0 + 1 if self.not_norm_reg else reg_weight_map
-        if self.no_reduce:
-            reg_norm = max(reg_weight_map.sum(), 1)
-        else:
-            reg_norm = max(reduce_sum(reg_weight_map.sum()).item() / num_gpus, 1)
-
-        reg_loss = (
-            self.reg_weight
-            * self.iou_loss(reg_pred, reg_targets_pos, reg_weight_map, reduction="sum")
-            / reg_norm
-        )
-        losses["loss_centernet_loc"] = reg_loss
-
-        if self.with_agn_hm:
-            cat_agn_heatmap = flattened_hms.max(dim=1)[0]  # M
-            agn_pos_loss, agn_neg_loss = binary_heatmap_focal_loss(
-                agn_hm_pred,
-                cat_agn_heatmap,
-                pos_inds,
-                alpha=self.hm_focal_alpha,
-                beta=self.hm_focal_beta,
-                gamma=self.loss_gamma,
-                sigmoid_clamp=self.sigmoid_clamp,
-                ignore_high_fp=self.ignore_high_fp,
-            )
-            agn_pos_loss = self.pos_weight * agn_pos_loss / num_pos_avg
-            agn_neg_loss = self.neg_weight * agn_neg_loss / num_pos_avg
-            losses["loss_centernet_agn_pos"] = agn_pos_loss
-            losses["loss_centernet_agn_neg"] = agn_neg_loss
-
-        if self.debug:
-            print("losses", losses)
-            print("total_num_pos", total_num_pos)
-        return losses
-
-    def compute_grids(self, features):
-        grids = []
-        for level, feature in enumerate(features):
-            h, w = feature.size()[-2:]
-            shifts_x = torch.arange(
-                0,
-                w * self.strides[level],
-                step=self.strides[level],
-                dtype=torch.float32,
-                device=feature.device,
-            )
-            shifts_y = torch.arange(
-                0,
-                h * self.strides[level],
-                step=self.strides[level],
-                dtype=torch.float32,
-                device=feature.device,
-            )
-            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
-            shift_x = shift_x.reshape(-1)
-            shift_y = shift_y.reshape(-1)
-            grids_per_level = (
-                torch.stack((shift_x, shift_y), dim=1) + self.strides[level] // 2
-            )
-            grids.append(grids_per_level)
-        return grids
-
-    def _get_ground_truth(self, grids, shapes_per_level, gt_instances):
-        """
-        Input:
-            grids: list of tensors [(hl x wl, 2)]_l
-            shapes_per_level: list of tuples L x 2:
-            gt_instances: gt instances
-        Retuen:
-            pos_inds: N
-            labels: N
-            reg_targets: M x 4
-            flattened_hms: M x C or M x 1
-            N: number of objects in all images
-            M: number of pixels from all FPN levels
-        """
-
-        # get positive pixel index
-        if not self.more_pos:
-            pos_inds, labels = self._get_label_inds(gt_instances, shapes_per_level)
-        else:
-            pos_inds, labels = None, None
-        heatmap_channels = self.num_classes
-        L = len(grids)
-        num_loc_list = [len(loc) for loc in grids]
-        strides = torch.cat(
-            [
-                shapes_per_level.new_ones(num_loc_list[l]) * self.strides[l]
-                for l in range(L)
-            ]
-        ).float()  # M
-        reg_size_ranges = torch.cat(
-            [
-                shapes_per_level.new_tensor(self.sizes_of_interest[l])
-                .float()
-                .view(1, 2)
-                .expand(num_loc_list[l], 2)
-                for l in range(L)
-            ]
-        )  # M x 2
-        grids = torch.cat(grids, dim=0)  # M x 2
-        M = grids.shape[0]
-
-        reg_targets = []
-        flattened_hms = []
-        for i in range(len(gt_instances)):  # images
-            boxes = gt_instances[i].gt_boxes.tensor  # N x 4
-            area = gt_instances[i].gt_boxes.area()  # N
-            gt_classes = gt_instances[i].gt_classes  # N in [0, self.num_classes]
-
-            N = boxes.shape[0]
-            if N == 0:
-                reg_targets.append(grids.new_zeros((M, 4)) - INF)
-                flattened_hms.append(
-                    grids.new_zeros((M, 1 if self.only_proposal else heatmap_channels))
-                )
-                continue
-
-            l = grids[:, 0].view(M, 1) - boxes[:, 0].view(1, N)  # M x N
-            t = grids[:, 1].view(M, 1) - boxes[:, 1].view(1, N)  # M x N
-            r = boxes[:, 2].view(1, N) - grids[:, 0].view(M, 1)  # M x N
-            b = boxes[:, 3].view(1, N) - grids[:, 1].view(M, 1)  # M x N
-            reg_target = torch.stack([l, t, r, b], dim=2)  # M x N x 4
-
-            centers = (boxes[:, [0, 1]] + boxes[:, [2, 3]]) / 2  # N x 2
-            centers_expanded = centers.view(1, N, 2).expand(M, N, 2)  # M x N x 2
-            strides_expanded = strides.view(M, 1, 1).expand(M, N, 2)
-            centers_discret = (
-                (centers_expanded / strides_expanded).int() * strides_expanded
-            ).float() + strides_expanded / 2  # M x N x 2
-
-            is_peak = (
-                (grids.view(M, 1, 2).expand(M, N, 2) - centers_discret) ** 2
-            ).sum(
-                dim=2
-            ) == 0  # M x N
-            is_in_boxes = reg_target.min(dim=2)[0] > 0  # M x N
-            is_center3x3 = (
-                self.get_center3x3(grids, centers, strides) & is_in_boxes
-            )  # M x N
-            is_cared_in_the_level = self.assign_reg_fpn(
-                reg_target, reg_size_ranges
-            )  # M x N
-            reg_mask = is_center3x3 & is_cared_in_the_level  # M x N
-
-            dist2 = ((grids.view(M, 1, 2).expand(M, N, 2) - centers_expanded) ** 2).sum(
-                dim=2
-            )  # M x N
-            dist2[is_peak] = 0
-            radius2 = self.delta**2 * 2 * area  # N
-            radius2 = torch.clamp(radius2, min=self.min_radius**2)
-            weighted_dist2 = dist2 / radius2.view(1, N).expand(M, N)  # M x N
-            reg_target = self._get_reg_targets(
-                reg_target, weighted_dist2.clone(), reg_mask, area
-            )  # M x 4
-
-            if self.only_proposal:
-                flattened_hm = self._create_agn_heatmaps_from_dist(
-                    weighted_dist2.clone()
-                )  # M x 1
-            else:
-                flattened_hm = self._create_heatmaps_from_dist(
-                    weighted_dist2.clone(), gt_classes, channels=heatmap_channels
-                )  # M x C
-
-            reg_targets.append(reg_target)
-            flattened_hms.append(flattened_hm)
-
-        # transpose im first training_targets to level first ones
-        reg_targets = _transpose(reg_targets, num_loc_list)
-        flattened_hms = _transpose(flattened_hms, num_loc_list)
-        for l in range(len(reg_targets)):
-            reg_targets[l] = reg_targets[l] / float(self.strides[l])
-        reg_targets = cat([x for x in reg_targets], dim=0)  # MB x 4
-        flattened_hms = cat([x for x in flattened_hms], dim=0)  # MB x C
-
-        return pos_inds, labels, reg_targets, flattened_hms
-
-    def _get_label_inds(self, gt_instances, shapes_per_level):
-        """
-        Inputs:
-            gt_instances: [n_i], sum n_i = N
-            shapes_per_level: L x 2 [(h_l, w_l)]_L
-        Returns:
-            pos_inds: N'
-            labels: N'
-        """
-        pos_inds = []
-        labels = []
-        L = len(self.strides)
-        B = len(gt_instances)
-        shapes_per_level = shapes_per_level.long()
-        loc_per_level = (shapes_per_level[:, 0] * shapes_per_level[:, 1]).long()  # L
-        level_bases = []
-        s = 0
-        for l in range(L):
-            level_bases.append(s)
-            s = s + B * loc_per_level[l]
-        level_bases = shapes_per_level.new_tensor(level_bases).long()  # L
-        strides_default = shapes_per_level.new_tensor(self.strides).float()  # L
-        for im_i in range(B):
-            targets_per_im = gt_instances[im_i]
-            bboxes = targets_per_im.gt_boxes.tensor  # n x 4
-            n = bboxes.shape[0]
-            centers = (bboxes[:, [0, 1]] + bboxes[:, [2, 3]]) / 2  # n x 2
-            centers = centers.view(n, 1, 2).expand(n, L, 2)
-            strides = strides_default.view(1, L, 1).expand(n, L, 2)
-            centers_inds = (centers / strides).long()  # n x L x 2
-            Ws = shapes_per_level[:, 1].view(1, L).expand(n, L)
-            pos_ind = (
-                level_bases.view(1, L).expand(n, L)
-                + im_i * loc_per_level.view(1, L).expand(n, L)
-                + centers_inds[:, :, 1] * Ws
-                + centers_inds[:, :, 0]
-            )  # n x L
-            is_cared_in_the_level = self.assign_fpn_level(bboxes)
-            pos_ind = pos_ind[is_cared_in_the_level].view(-1)
-            label = (
-                targets_per_im.gt_classes.view(n, 1)
-                .expand(n, L)[is_cared_in_the_level]
-                .view(-1)
-            )
-
-            pos_inds.append(pos_ind)  # n'
-            labels.append(label)  # n'
-        pos_inds = torch.cat(pos_inds, dim=0).long()
-        labels = torch.cat(labels, dim=0)
-        return pos_inds, labels  # N, N
-
-    def assign_fpn_level(self, boxes):
-        """
-        Inputs:
-            boxes: n x 4
-            size_ranges: L x 2
-        Return:
-            is_cared_in_the_level: n x L
-        """
-        size_ranges = boxes.new_tensor(self.sizes_of_interest).view(
-            len(self.sizes_of_interest), 2
-        )  # L x 2
-        crit = ((boxes[:, 2:] - boxes[:, :2]) ** 2).sum(dim=1) ** 0.5 / 2  # n
-        n, L = crit.shape[0], size_ranges.shape[0]
-        crit = crit.view(n, 1).expand(n, L)
-        size_ranges_expand = size_ranges.view(1, L, 2).expand(n, L, 2)
-        is_cared_in_the_level = (crit >= size_ranges_expand[:, :, 0]) & (
-            crit <= size_ranges_expand[:, :, 1]
-        )
-        return is_cared_in_the_level
-
-    def assign_reg_fpn(self, reg_targets_per_im, size_ranges):
-        """
-        TODO (Xingyi): merge it with assign_fpn_level
-        Inputs:
-            reg_targets_per_im: M x N x 4
-            size_ranges: M x 2
-        """
-        crit = ((reg_targets_per_im[:, :, :2] + reg_targets_per_im[:, :, 2:]) ** 2).sum(
-            dim=2
-        ) ** 0.5 / 2  # M x N
-        is_cared_in_the_level = (crit >= size_ranges[:, [0]]) & (
-            crit <= size_ranges[:, [1]]
-        )
-        return is_cared_in_the_level
-
-    def _get_reg_targets(self, reg_targets, dist, mask, area):
-        """
-        reg_targets (M x N x 4): long tensor
-        dist (M x N)
-        is_*: M x N
-        """
-        dist[mask == 0] = INF * 1.0
-        min_dist, min_inds = dist.min(dim=1)  # M
-        reg_targets_per_im = reg_targets[
-            range(len(reg_targets)), min_inds
-        ]  # M x N x 4 --> M x 4
-        reg_targets_per_im[min_dist == INF] = -INF
-        return reg_targets_per_im
-
-    def _create_heatmaps_from_dist(self, dist, labels, channels):
-        """
-        dist: M x N
-        labels: N
-        return:
-          heatmaps: M x C
-        """
-        heatmaps = dist.new_zeros((dist.shape[0], channels))
-        for c in range(channels):
-            inds = labels == c  # N
-            if inds.int().sum() == 0:
-                continue
-            heatmaps[:, c] = torch.exp(-dist[:, inds].min(dim=1)[0])
-            zeros = heatmaps[:, c] < 1e-4
-            heatmaps[zeros, c] = 0
-        return heatmaps
-
-    def _create_agn_heatmaps_from_dist(self, dist):
-        """
-        TODO (Xingyi): merge it with _create_heatmaps_from_dist
-        dist: M x N
-        return:
-          heatmaps: M x 1
-        """
-        heatmaps = dist.new_zeros((dist.shape[0], 1))
-        heatmaps[:, 0] = torch.exp(-dist.min(dim=1)[0])
-        zeros = heatmaps < 1e-4
-        heatmaps[zeros] = 0
-        return heatmaps
-
-    def _flatten_outputs(self, clss, reg_pred, agn_hm_pred):
-        # Reshape: (N, F, Hl, Wl) -> (N, Hl, Wl, F) -> (sum_l N*Hl*Wl, F)
-        clss = (
-            cat([x.permute(0, 2, 3, 1).reshape(-1, x.shape[1]) for x in clss], dim=0)
-            if clss[0] is not None
-            else None
-        )
-        reg_pred = cat([x.permute(0, 2, 3, 1).reshape(-1, 4) for x in reg_pred], dim=0)
-        agn_hm_pred = (
-            cat([x.permute(0, 2, 3, 1).reshape(-1) for x in agn_hm_pred], dim=0)
-            if self.with_agn_hm
-            else None
-        )
-        return clss, reg_pred, agn_hm_pred
-
-    def get_center3x3(self, locations, centers, strides):
-        """
-        Inputs:
-            locations: M x 2
-            centers: N x 2
-            strides: M
-        """
-        M, N = locations.shape[0], centers.shape[0]
-        locations_expanded = locations.view(M, 1, 2).expand(M, N, 2)  # M x N x 2
-        centers_expanded = centers.view(1, N, 2).expand(M, N, 2)  # M x N x 2
-        strides_expanded = strides.view(M, 1, 1).expand(M, N, 2)  # M x N
-        centers_discret = (
-            (centers_expanded / strides_expanded).int() * strides_expanded
-        ).float() + strides_expanded / 2  # M x N x 2
-        dist_x = (locations_expanded[:, :, 0] - centers_discret[:, :, 0]).abs()
-        dist_y = (locations_expanded[:, :, 1] - centers_discret[:, :, 1]).abs()
-        return (dist_x <= strides_expanded[:, :, 0]) & (
-            dist_y <= strides_expanded[:, :, 0]
-        )
-
-    def inference(
-        self, images, clss_per_level, reg_pred_per_level, agn_hm_pred_per_level, grids
-    ):
-        logits_pred = [x.sigmoid() if x is not None else None for x in clss_per_level]
-        agn_hm_pred_per_level = [
-            x.sigmoid() if x is not None else None for x in agn_hm_pred_per_level
-        ]
-
-        if self.only_proposal:
-            proposals = self.predict_instances(
-                grids,
-                agn_hm_pred_per_level,
-                reg_pred_per_level,
-                images.image_sizes,
-                [None for _ in agn_hm_pred_per_level],
-            )
-        else:
-            proposals = self.predict_instances(
-                grids,
-                logits_pred,
-                reg_pred_per_level,
-                images.image_sizes,
-                agn_hm_pred_per_level,
-            )
-        if self.as_proposal or self.only_proposal:
-            for p in range(len(proposals)):
-                proposals[p].proposal_boxes = proposals[p].get("pred_boxes")
-                proposals[p].objectness_logits = proposals[p].get("scores")
-                proposals[p].remove("pred_boxes")
-
-        if self.debug:
-            debug_test(
-                [self.denormalizer(x) for x in images],
-                logits_pred,
-                reg_pred_per_level,
-                agn_hm_pred_per_level,
-                preds=proposals,
-                vis_thresh=self.vis_thresh,
-                debug_show_name=False,
-            )
-        return proposals, {}
-
-    def predict_instances(
-        self, grids, logits_pred, reg_pred, image_sizes, agn_hm_pred, is_proposal=False
-    ):
-        sampled_boxes = []
-        for l in range(len(grids)):
-            sampled_boxes.append(
-                self.predict_single_level(
-                    grids[l],
-                    logits_pred[l],
-                    reg_pred[l] * self.strides[l],
-                    image_sizes,
-                    agn_hm_pred[l],
-                    l,
-                    is_proposal=is_proposal,
-                )
-            )
-        boxlists = list(zip(*sampled_boxes))
-        boxlists = [Instances.cat(boxlist) for boxlist in boxlists]
-        boxlists = self.nms_and_topK(boxlists, nms=not self.not_nms)
-        return boxlists
-
-    def predict_single_level(
-        self, grids, heatmap, reg_pred, image_sizes, agn_hm, level, is_proposal=False
-    ):
-        N, C, H, W = heatmap.shape
-        # put in the same format as grids
-        if self.center_nms:
-            heatmap_nms = nn.functional.max_pool2d(heatmap, (3, 3), stride=1, padding=1)
-            heatmap = heatmap * (heatmap_nms == heatmap).float()
-        heatmap = heatmap.permute(0, 2, 3, 1)  # N x H x W x C
-        heatmap = heatmap.reshape(N, -1, C)  # N x HW x C
-        box_regression = reg_pred.view(N, 4, H, W).permute(0, 2, 3, 1)  # N x H x W x 4
-        box_regression = box_regression.reshape(N, -1, 4)
-
-        candidate_inds = heatmap > self.score_thresh  # 0.05
-        pre_nms_top_n = candidate_inds.view(N, -1).sum(1)  # N
-        pre_nms_topk = (
-            self.pre_nms_topk_train if self.training else self.pre_nms_topk_test
-        )
-        pre_nms_top_n = pre_nms_top_n.clamp(max=pre_nms_topk)  # N
-
-        if agn_hm is not None:
-            agn_hm = agn_hm.view(N, 1, H, W).permute(0, 2, 3, 1)
-            agn_hm = agn_hm.reshape(N, -1)
-            heatmap = heatmap * agn_hm[:, :, None]
-
-        results = []
-        for i in range(N):
-            per_box_cls = heatmap[i]  # HW x C
-            per_candidate_inds = candidate_inds[i]  # n
-            per_box_cls = per_box_cls[per_candidate_inds]  # n
-
-            per_candidate_nonzeros = per_candidate_inds.nonzero()  # n
-            per_box_loc = per_candidate_nonzeros[:, 0]  # n
-            per_class = per_candidate_nonzeros[:, 1]  # n
-
-            per_box_regression = box_regression[i]  # HW x 4
-            per_box_regression = per_box_regression[per_box_loc]  # n x 4
-            per_grids = grids[per_box_loc]  # n x 2
-
-            per_pre_nms_top_n = pre_nms_top_n[i]  # 1
-
-            if per_candidate_inds.sum().item() > per_pre_nms_top_n.item():
-                per_box_cls, top_k_indices = per_box_cls.topk(
-                    per_pre_nms_top_n, sorted=False
-                )
-                per_class = per_class[top_k_indices]
-                per_box_regression = per_box_regression[top_k_indices]
-                per_grids = per_grids[top_k_indices]
-
-            detections = torch.stack(
-                [
-                    per_grids[:, 0] - per_box_regression[:, 0],
-                    per_grids[:, 1] - per_box_regression[:, 1],
-                    per_grids[:, 0] + per_box_regression[:, 2],
-                    per_grids[:, 1] + per_box_regression[:, 3],
-                ],
-                dim=1,
-            )  # n x 4
-
-            # avoid invalid boxes in RoI heads
-            detections[:, 2] = torch.max(detections[:, 2], detections[:, 0] + 0.01)
-            detections[:, 3] = torch.max(detections[:, 3], detections[:, 1] + 0.01)
-            boxlist = Instances(image_sizes[i])
-            boxlist.scores = (
-                torch.sqrt(per_box_cls) if self.with_agn_hm else per_box_cls
-            )  # n
-            # import pdb; pdb.set_trace()
-            boxlist.pred_boxes = Boxes(detections)
-            boxlist.pred_classes = per_class
-            results.append(boxlist)
-        return results
-
-    def nms_and_topK(self, boxlists, nms=True):
-        num_images = len(boxlists)
-        results = []
-        for i in range(num_images):
-            nms_thresh = (
-                self.nms_thresh_train if self.training else self.nms_thresh_test
-            )
-            result = ml_nms(boxlists[i], nms_thresh) if nms else boxlists[i]
-            if self.debug:
-                print("#proposals before nms", len(boxlists[i]))
-                print("#proposals after nms", len(result))
-            num_dets = len(result)
-            post_nms_topk = (
-                self.post_nms_topk_train if self.training else self.post_nms_topk_test
-            )
-            if num_dets > post_nms_topk:
-                cls_scores = result.scores
-                image_thresh, _ = torch.kthvalue(
-                    cls_scores.float().cpu(), num_dets - post_nms_topk + 1
-                )
-                keep = cls_scores >= image_thresh.item()
-                keep = torch.nonzero(keep).squeeze(1)
-                result = result[keep]
-            if self.debug:
-                print("#proposals after filter", len(result))
-            results.append(result)
-        return results
-
-    def _add_more_pos(self, reg_pred, gt_instances, shapes_per_level):
-        labels, level_masks, c33_inds, c33_masks, c33_regs = self._get_c33_inds(
-            gt_instances, shapes_per_level
-        )
-        N, L, K = labels.shape[0], len(self.strides), 9
-        c33_inds[c33_masks == 0] = 0
-        reg_pred_c33 = reg_pred[c33_inds].detach()  # N x L x K
-        invalid_reg = c33_masks == 0
-        c33_regs_expand = c33_regs.view(N * L * K, 4).clamp(min=0)
-        if N > 0:
-            with torch.no_grad():
-                c33_reg_loss = (
-                    self.iou_loss(
-                        reg_pred_c33.view(N * L * K, 4),
-                        c33_regs_expand,
-                        None,
-                        reduction="none",
-                    )
-                    .view(N, L, K)
-                    .detach()
-                )  # N x L x K
-        else:
-            c33_reg_loss = reg_pred_c33.new_zeros((N, L, K)).detach()
-        c33_reg_loss[invalid_reg] = INF  # N x L x K
-        c33_reg_loss.view(N * L, K)[level_masks.view(N * L), 4] = 0  # real center
-        c33_reg_loss = c33_reg_loss.view(N, L * K)
-        if N == 0:
-            loss_thresh = c33_reg_loss.new_ones((N)).float()
-        else:
-            loss_thresh = torch.kthvalue(c33_reg_loss, self.more_pos_topk, dim=1)[
-                0
-            ]  # N
-        loss_thresh[loss_thresh > self.more_pos_thresh] = self.more_pos_thresh  # N
-        new_pos = c33_reg_loss.view(N, L, K) < loss_thresh.view(N, 1, 1).expand(N, L, K)
-        pos_inds = c33_inds[new_pos].view(-1)  # P
-        labels = labels.view(N, 1, 1).expand(N, L, K)[new_pos].view(-1)
-        return pos_inds, labels
-
-    def _get_c33_inds(self, gt_instances, shapes_per_level):
-        """
-        TODO (Xingyi): The current implementation is ugly. Refactor.
-        Get the center (and the 3x3 region near center) locations of each objects
-        Inputs:
-            gt_instances: [n_i], sum n_i = N
-            shapes_per_level: L x 2 [(h_l, w_l)]_L
-        """
-        labels = []
-        level_masks = []
-        c33_inds = []
-        c33_masks = []
-        c33_regs = []
-        L = len(self.strides)
-        B = len(gt_instances)
-        shapes_per_level = shapes_per_level.long()
-        loc_per_level = (shapes_per_level[:, 0] * shapes_per_level[:, 1]).long()  # L
-        level_bases = []
-        s = 0
-        for l in range(L):
-            level_bases.append(s)
-            s = s + B * loc_per_level[l]
-        level_bases = shapes_per_level.new_tensor(level_bases).long()  # L
-        strides_default = shapes_per_level.new_tensor(self.strides).float()  # L
-        K = 9
-        dx = shapes_per_level.new_tensor([-1, 0, 1, -1, 0, 1, -1, 0, 1]).long()
-        dy = shapes_per_level.new_tensor([-1, -1, -1, 0, 0, 0, 1, 1, 1]).long()
-        for im_i in range(B):
-            targets_per_im = gt_instances[im_i]
-            bboxes = targets_per_im.gt_boxes.tensor  # n x 4
-            n = bboxes.shape[0]
-            if n == 0:
-                continue
-            centers = (bboxes[:, [0, 1]] + bboxes[:, [2, 3]]) / 2  # n x 2
-            centers = centers.view(n, 1, 2).expand(n, L, 2)
-
-            strides = strides_default.view(1, L, 1).expand(n, L, 2)  #
-            centers_inds = (centers / strides).long()  # n x L x 2
-            center_grids = centers_inds * strides + strides // 2  # n x L x 2
-            l = center_grids[:, :, 0] - bboxes[:, 0].view(n, 1).expand(n, L)
-            t = center_grids[:, :, 1] - bboxes[:, 1].view(n, 1).expand(n, L)
-            r = bboxes[:, 2].view(n, 1).expand(n, L) - center_grids[:, :, 0]
-            b = bboxes[:, 3].view(n, 1).expand(n, L) - center_grids[:, :, 1]  # n x L
-            reg = torch.stack([l, t, r, b], dim=2)  # n x L x 4
-            reg = reg / strides_default.view(1, L, 1).expand(n, L, 4).float()
-
-            Ws = shapes_per_level[:, 1].view(1, L).expand(n, L)
-            Hs = shapes_per_level[:, 0].view(1, L).expand(n, L)
-            expand_Ws = Ws.view(n, L, 1).expand(n, L, K)
-            expand_Hs = Hs.view(n, L, 1).expand(n, L, K)
-            label = targets_per_im.gt_classes.view(n).clone()
-            mask = reg.min(dim=2)[0] >= 0  # n x L
-            mask = mask & self.assign_fpn_level(bboxes)
-            labels.append(label)  # n
-            level_masks.append(mask)  # n x L
-
-            Dy = dy.view(1, 1, K).expand(n, L, K)
-            Dx = dx.view(1, 1, K).expand(n, L, K)
-            c33_ind = (
-                level_bases.view(1, L, 1).expand(n, L, K)
-                + im_i * loc_per_level.view(1, L, 1).expand(n, L, K)
-                + (centers_inds[:, :, 1:2].expand(n, L, K) + Dy) * expand_Ws
-                + (centers_inds[:, :, 0:1].expand(n, L, K) + Dx)
-            )  # n x L x K
-
-            c33_mask = (
-                ((centers_inds[:, :, 1:2].expand(n, L, K) + dy) < expand_Hs)
-                & ((centers_inds[:, :, 1:2].expand(n, L, K) + dy) >= 0)
-                & ((centers_inds[:, :, 0:1].expand(n, L, K) + dx) < expand_Ws)
-                & ((centers_inds[:, :, 0:1].expand(n, L, K) + dx) >= 0)
-            )
-            # TODO (Xingyi): think about better way to implement this
-            # Currently it hard codes the 3x3 region
-            c33_reg = reg.view(n, L, 1, 4).expand(n, L, K, 4).clone()
-            c33_reg[:, :, [0, 3, 6], 0] -= 1
-            c33_reg[:, :, [0, 3, 6], 2] += 1
-            c33_reg[:, :, [2, 5, 8], 0] += 1
-            c33_reg[:, :, [2, 5, 8], 2] -= 1
-            c33_reg[:, :, [0, 1, 2], 1] -= 1
-            c33_reg[:, :, [0, 1, 2], 3] += 1
-            c33_reg[:, :, [6, 7, 8], 1] += 1
-            c33_reg[:, :, [6, 7, 8], 3] -= 1
-            c33_mask = c33_mask & (c33_reg.min(dim=3)[0] >= 0)  # n x L x K
-            c33_inds.append(c33_ind)
-            c33_masks.append(c33_mask)
-            c33_regs.append(c33_reg)
-
-        if len(level_masks) > 0:
-            labels = torch.cat(labels, dim=0)
-            level_masks = torch.cat(level_masks, dim=0)
-            c33_inds = torch.cat(c33_inds, dim=0).long()
-            c33_regs = torch.cat(c33_regs, dim=0)
-            c33_masks = torch.cat(c33_masks, dim=0)
-        else:
-            labels = shapes_per_level.new_zeros((0)).long()
-            level_masks = shapes_per_level.new_zeros((0, L)).bool()
-            c33_inds = shapes_per_level.new_zeros((0, L, K)).long()
-            c33_regs = shapes_per_level.new_zeros((0, L, K, 4)).float()
-            c33_masks = shapes_per_level.new_zeros((0, L, K)).bool()
-        return labels, level_masks, c33_inds, c33_masks, c33_regs  # N x L, N x L x K
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/centernet_head.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/centernet_head.py
deleted file mode 100644
index 3e661b96..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/centernet_head.py
+++ /dev/null
@@ -1,177 +0,0 @@
-import math
-from typing import List
-
-import torch
-from detectron2.config import configurable
-from detectron2.layers import ShapeSpec, get_norm
-from torch import nn
-from torch.nn import functional as F
-
-from ..layers.deform_conv import DFConv2d
-
-__all__ = ["CenterNetHead"]
-
-
-class Scale(nn.Module):
-    def __init__(self, init_value=1.0):
-        super(Scale, self).__init__()
-        self.scale = nn.Parameter(torch.FloatTensor([init_value]))
-
-    def forward(self, input):
-        return input * self.scale
-
-
-class CenterNetHead(nn.Module):
-    @configurable
-    def __init__(
-        self,
-        # input_shape: List[ShapeSpec],
-        in_channels,
-        num_levels,
-        *,
-        num_classes=80,
-        with_agn_hm=False,
-        only_proposal=False,
-        norm="GN",
-        num_cls_convs=4,
-        num_box_convs=4,
-        num_share_convs=0,
-        use_deformable=False,
-        prior_prob=0.01,
-    ):
-        super().__init__()
-        self.num_classes = num_classes
-        self.with_agn_hm = with_agn_hm
-        self.only_proposal = only_proposal
-        self.out_kernel = 3
-
-        head_configs = {
-            "cls": (num_cls_convs if not self.only_proposal else 0, use_deformable),
-            "bbox": (num_box_convs, use_deformable),
-            "share": (num_share_convs, use_deformable),
-        }
-
-        # in_channels = [s.channels for s in input_shape]
-        # assert len(set(in_channels)) == 1, \
-        #     "Each level must have the same channel!"
-        # in_channels = in_channels[0]
-        channels = {
-            "cls": in_channels,
-            "bbox": in_channels,
-            "share": in_channels,
-        }
-        for head in head_configs:
-            tower = []
-            num_convs, use_deformable = head_configs[head]
-            channel = channels[head]
-            for i in range(num_convs):
-                if use_deformable and i == num_convs - 1:
-                    conv_func = DFConv2d
-                else:
-                    conv_func = nn.Conv2d
-                tower.append(
-                    conv_func(
-                        in_channels if i == 0 else channel,
-                        channel,
-                        kernel_size=3,
-                        stride=1,
-                        padding=1,
-                        bias=True,
-                    )
-                )
-                if norm == "GN" and channel % 32 != 0:
-                    tower.append(nn.GroupNorm(25, channel))
-                elif norm != "":
-                    tower.append(get_norm(norm, channel))
-                tower.append(nn.ReLU())
-            self.add_module("{}_tower".format(head), nn.Sequential(*tower))
-
-        self.bbox_pred = nn.Conv2d(
-            in_channels,
-            4,
-            kernel_size=self.out_kernel,
-            stride=1,
-            padding=self.out_kernel // 2,
-        )
-
-        self.scales = nn.ModuleList([Scale(init_value=1.0) for _ in range(num_levels)])
-
-        for modules in [
-            self.cls_tower,
-            self.bbox_tower,
-            self.share_tower,
-            self.bbox_pred,
-        ]:
-            for l in modules.modules():
-                if isinstance(l, nn.Conv2d):
-                    torch.nn.init.normal_(l.weight, std=0.01)
-                    torch.nn.init.constant_(l.bias, 0)
-
-        torch.nn.init.constant_(self.bbox_pred.bias, 8.0)
-        prior_prob = prior_prob
-        bias_value = -math.log((1 - prior_prob) / prior_prob)
-
-        if self.with_agn_hm:
-            self.agn_hm = nn.Conv2d(
-                in_channels,
-                1,
-                kernel_size=self.out_kernel,
-                stride=1,
-                padding=self.out_kernel // 2,
-            )
-            torch.nn.init.constant_(self.agn_hm.bias, bias_value)
-            torch.nn.init.normal_(self.agn_hm.weight, std=0.01)
-
-        if not self.only_proposal:
-            cls_kernel_size = self.out_kernel
-            self.cls_logits = nn.Conv2d(
-                in_channels,
-                self.num_classes,
-                kernel_size=cls_kernel_size,
-                stride=1,
-                padding=cls_kernel_size // 2,
-            )
-
-            torch.nn.init.constant_(self.cls_logits.bias, bias_value)
-            torch.nn.init.normal_(self.cls_logits.weight, std=0.01)
-
-    @classmethod
-    def from_config(cls, cfg, input_shape):
-        ret = {
-            # 'input_shape': input_shape,
-            "in_channels": [s.channels for s in input_shape][0],
-            "num_levels": len(input_shape),
-            "num_classes": cfg.MODEL.CENTERNET.NUM_CLASSES,
-            "with_agn_hm": cfg.MODEL.CENTERNET.WITH_AGN_HM,
-            "only_proposal": cfg.MODEL.CENTERNET.ONLY_PROPOSAL,
-            "norm": cfg.MODEL.CENTERNET.NORM,
-            "num_cls_convs": cfg.MODEL.CENTERNET.NUM_CLS_CONVS,
-            "num_box_convs": cfg.MODEL.CENTERNET.NUM_BOX_CONVS,
-            "num_share_convs": cfg.MODEL.CENTERNET.NUM_SHARE_CONVS,
-            "use_deformable": cfg.MODEL.CENTERNET.USE_DEFORMABLE,
-            "prior_prob": cfg.MODEL.CENTERNET.PRIOR_PROB,
-        }
-        return ret
-
-    def forward(self, x):
-        clss = []
-        bbox_reg = []
-        agn_hms = []
-        for l, feature in enumerate(x):
-            feature = self.share_tower(feature)
-            cls_tower = self.cls_tower(feature)
-            bbox_tower = self.bbox_tower(feature)
-            if not self.only_proposal:
-                clss.append(self.cls_logits(cls_tower))
-            else:
-                clss.append(None)
-
-            if self.with_agn_hm:
-                agn_hms.append(self.agn_hm(bbox_tower))
-            else:
-                agn_hms.append(None)
-            reg = self.bbox_pred(bbox_tower)
-            reg = self.scales[l](reg)
-            bbox_reg.append(F.relu(reg))
-
-        return clss, bbox_reg, agn_hms
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/utils.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/utils.py
deleted file mode 100644
index 3853048e..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/dense_heads/utils.py
+++ /dev/null
@@ -1,39 +0,0 @@
-import cv2
-import numpy as np
-import torch
-
-# from .data import CenterNetCrop
-import torch.nn.functional as F
-from detectron2.structures import Boxes, ImageList, Instances, pairwise_iou
-from detectron2.utils.comm import get_world_size
-from torch import nn
-
-__all__ = ["reduce_sum", "_transpose"]
-
-INF = 1000000000
-
-
-def _transpose(training_targets, num_loc_list):
-    """
-    This function is used to transpose image first training targets to
-        level first ones
-    :return: level first training targets
-    """
-    for im_i in range(len(training_targets)):
-        training_targets[im_i] = torch.split(
-            training_targets[im_i], num_loc_list, dim=0
-        )
-
-    targets_level_first = []
-    for targets_per_level in zip(*training_targets):
-        targets_level_first.append(torch.cat(targets_per_level, dim=0))
-    return targets_level_first
-
-
-def reduce_sum(tensor):
-    world_size = get_world_size()
-    if world_size < 2:
-        return tensor
-    tensor = tensor.clone()
-    torch.distributed.all_reduce(tensor, op=torch.distributed.ReduceOp.SUM)
-    return tensor
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/deform_conv.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/deform_conv.py
deleted file mode 100644
index 89973ea1..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/deform_conv.py
+++ /dev/null
@@ -1,114 +0,0 @@
-import torch
-from detectron2.layers import Conv2d
-from torch import nn
-
-
-class _NewEmptyTensorOp(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, x, new_shape):
-        ctx.shape = x.shape
-        return x.new_empty(new_shape)
-
-    @staticmethod
-    def backward(ctx, grad):
-        shape = ctx.shape
-        return _NewEmptyTensorOp.apply(grad, shape), None
-
-
-class DFConv2d(nn.Module):
-    """Deformable convolutional layer"""
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        with_modulated_dcn=True,
-        kernel_size=3,
-        stride=1,
-        groups=1,
-        dilation=1,
-        deformable_groups=1,
-        bias=False,
-        padding=None,
-    ):
-        super(DFConv2d, self).__init__()
-        if isinstance(kernel_size, (list, tuple)):
-            assert isinstance(stride, (list, tuple))
-            assert isinstance(dilation, (list, tuple))
-            assert len(kernel_size) == 2
-            assert len(stride) == 2
-            assert len(dilation) == 2
-            padding = (
-                dilation[0] * (kernel_size[0] - 1) // 2,
-                dilation[1] * (kernel_size[1] - 1) // 2,
-            )
-            offset_base_channels = kernel_size[0] * kernel_size[1]
-        else:
-            padding = dilation * (kernel_size - 1) // 2
-            offset_base_channels = kernel_size * kernel_size
-        if with_modulated_dcn:
-            from detectron2.layers.deform_conv import ModulatedDeformConv
-
-            offset_channels = offset_base_channels * 3  # default: 27
-            conv_block = ModulatedDeformConv
-        else:
-            from detectron2.layers.deform_conv import DeformConv
-
-            offset_channels = offset_base_channels * 2  # default: 18
-            conv_block = DeformConv
-        self.offset = Conv2d(
-            in_channels,
-            deformable_groups * offset_channels,
-            kernel_size=kernel_size,
-            stride=stride,
-            padding=padding,
-            groups=1,
-            dilation=dilation,
-        )
-        nn.init.constant_(self.offset.weight, 0)
-        nn.init.constant_(self.offset.bias, 0)
-        """
-        for l in [self.offset, ]:
-            nn.init.kaiming_uniform_(l.weight, a=1)
-            torch.nn.init.constant_(l.bias, 0.)
-        """
-        self.conv = conv_block(
-            in_channels,
-            out_channels,
-            kernel_size=kernel_size,
-            stride=stride,
-            padding=padding,
-            dilation=dilation,
-            groups=groups,
-            deformable_groups=deformable_groups,
-            bias=bias,
-        )
-        self.with_modulated_dcn = with_modulated_dcn
-        self.kernel_size = kernel_size
-        self.stride = stride
-        self.padding = padding
-        self.dilation = dilation
-        self.offset_split = offset_base_channels * deformable_groups * 2
-
-    def forward(self, x, return_offset=False):
-        if x.numel() > 0:
-            if not self.with_modulated_dcn:
-                offset_mask = self.offset(x)
-                x = self.conv(x, offset_mask)
-            else:
-                offset_mask = self.offset(x)
-                offset = offset_mask[:, : self.offset_split, :, :]
-                mask = offset_mask[:, self.offset_split :, :, :].sigmoid()
-                x = self.conv(x, offset, mask)
-            if return_offset:
-                return x, offset_mask
-            return x
-        # get output shape
-        output_shape = [
-            (i + 2 * p - (di * (k - 1) + 1)) // d + 1
-            for i, p, di, k, d in zip(
-                x.shape[-2:], self.padding, self.dilation, self.kernel_size, self.stride
-            )
-        ]
-        output_shape = [x.shape[0], self.conv.weight.shape[0]] + output_shape
-        return _NewEmptyTensorOp.apply(x, output_shape)
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/heatmap_focal_loss.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/heatmap_focal_loss.py
deleted file mode 100644
index c6e5d223..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/heatmap_focal_loss.py
+++ /dev/null
@@ -1,96 +0,0 @@
-import torch
-from torch.nn import functional as F
-
-
-# TODO: merge these two function
-def heatmap_focal_loss(
-    inputs,
-    targets,
-    pos_inds,
-    labels,
-    alpha: float = -1,
-    beta: float = 4,
-    gamma: float = 2,
-    reduction: str = "sum",
-    sigmoid_clamp: float = 1e-4,
-    ignore_high_fp: float = -1.0,
-):
-    """
-    Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002.
-    Args:
-        inputs:  (sum_l N*Hl*Wl, C)
-        targets: (sum_l N*Hl*Wl, C)
-        pos_inds: N
-        labels: N
-    Returns:
-        Loss tensor with the reduction option applied.
-    """
-    pred = torch.clamp(inputs.sigmoid_(), min=sigmoid_clamp, max=1 - sigmoid_clamp)
-    neg_weights = torch.pow(1 - targets, beta)
-    pos_pred_pix = pred[pos_inds]  # N x C
-    pos_pred = pos_pred_pix.gather(1, labels.unsqueeze(1))
-    pos_loss = torch.log(pos_pred) * torch.pow(1 - pos_pred, gamma)
-    neg_loss = torch.log(1 - pred) * torch.pow(pred, gamma) * neg_weights
-
-    if ignore_high_fp > 0:
-        not_high_fp = (pred < ignore_high_fp).float()
-        neg_loss = not_high_fp * neg_loss
-
-    if reduction == "sum":
-        pos_loss = pos_loss.sum()
-        neg_loss = neg_loss.sum()
-
-    if alpha >= 0:
-        pos_loss = alpha * pos_loss
-        neg_loss = (1 - alpha) * neg_loss
-
-    return -pos_loss, -neg_loss
-
-
-heatmap_focal_loss_jit = torch.jit.script(heatmap_focal_loss)
-# heatmap_focal_loss_jit = heatmap_focal_loss
-
-
-def binary_heatmap_focal_loss(
-    inputs,
-    targets,
-    pos_inds,
-    alpha: float = -1,
-    beta: float = 4,
-    gamma: float = 2,
-    sigmoid_clamp: float = 1e-4,
-    ignore_high_fp: float = -1.0,
-):
-    """
-    Args:
-        inputs:  (sum_l N*Hl*Wl,)
-        targets: (sum_l N*Hl*Wl,)
-        pos_inds: N
-    Returns:
-        Loss tensor with the reduction option applied.
-    """
-    pred = torch.clamp(inputs.sigmoid_(), min=sigmoid_clamp, max=1 - sigmoid_clamp)
-    neg_weights = torch.pow(1 - targets, beta)
-    for i, ind in enumerate(pos_inds):
-        if ind >= pred.shape[0]:
-            print("%" * 100)
-            print(pred.shape, ind, pos_inds)
-            pos_inds[i] = pred.shape[0] - 1
-    pos_pred = pred[pos_inds]  # N
-    pos_loss = torch.log(pos_pred) * torch.pow(1 - pos_pred, gamma)
-    neg_loss = torch.log(1 - pred) * torch.pow(pred, gamma) * neg_weights
-    if ignore_high_fp > 0:
-        not_high_fp = (pred < ignore_high_fp).float()
-        neg_loss = not_high_fp * neg_loss
-
-    pos_loss = -pos_loss.sum()
-    neg_loss = -neg_loss.sum()
-
-    if alpha >= 0:
-        pos_loss = alpha * pos_loss
-        neg_loss = (1 - alpha) * neg_loss
-
-    return pos_loss, neg_loss
-
-
-# binary_heatmap_focal_loss_jit = torch.jit.script(binary_heatmap_focal_loss)
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/iou_loss.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/iou_loss.py
deleted file mode 100644
index 29c653ed..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/iou_loss.py
+++ /dev/null
@@ -1,123 +0,0 @@
-import torch
-from torch import nn
-
-
-class IOULoss(nn.Module):
-    def __init__(self, loc_loss_type="iou"):
-        super(IOULoss, self).__init__()
-        self.loc_loss_type = loc_loss_type
-
-    def forward(self, pred, target, weight=None, reduction="sum"):
-        pred_left = pred[:, 0]
-        pred_top = pred[:, 1]
-        pred_right = pred[:, 2]
-        pred_bottom = pred[:, 3]
-
-        target_left = target[:, 0]
-        target_top = target[:, 1]
-        target_right = target[:, 2]
-        target_bottom = target[:, 3]
-
-        target_aera = (target_left + target_right) * (target_top + target_bottom)
-        pred_aera = (pred_left + pred_right) * (pred_top + pred_bottom)
-
-        w_intersect = torch.min(pred_left, target_left) + torch.min(
-            pred_right, target_right
-        )
-        h_intersect = torch.min(pred_bottom, target_bottom) + torch.min(
-            pred_top, target_top
-        )
-
-        g_w_intersect = torch.max(pred_left, target_left) + torch.max(
-            pred_right, target_right
-        )
-        g_h_intersect = torch.max(pred_bottom, target_bottom) + torch.max(
-            pred_top, target_top
-        )
-        ac_uion = g_w_intersect * g_h_intersect
-
-        area_intersect = w_intersect * h_intersect
-        area_union = target_aera + pred_aera - area_intersect
-
-        ious = (area_intersect + 1.0) / (area_union + 1.0)
-        gious = ious - (ac_uion - area_union) / ac_uion
-        if self.loc_loss_type == "iou":
-            losses = -torch.log(ious)
-        elif self.loc_loss_type == "linear_iou":
-            losses = 1 - ious
-        elif self.loc_loss_type == "giou":
-            losses = 1 - gious
-        else:
-            raise NotImplementedError
-
-        if weight is not None:
-            losses = losses * weight
-        else:
-            losses = losses
-
-        if reduction == "sum":
-            return losses.sum()
-        elif reduction == "batch":
-            return losses.sum(dim=[1])
-        elif reduction == "none":
-            return losses
-        else:
-            raise NotImplementedError
-
-
-def giou_loss(
-    boxes1: torch.Tensor,
-    boxes2: torch.Tensor,
-    reduction: str = "none",
-    eps: float = 1e-7,
-) -> torch.Tensor:
-    """
-    Generalized Intersection over Union Loss (Hamid Rezatofighi et. al)
-    https://arxiv.org/abs/1902.09630
-    Gradient-friendly IoU loss with an additional penalty that is non-zero when the
-    boxes do not overlap and scales with the size of their smallest enclosing box.
-    This loss is symmetric, so the boxes1 and boxes2 arguments are interchangeable.
-    Args:
-        boxes1, boxes2 (Tensor): box locations in XYXY format, shape (N, 4) or (4,).
-        reduction: 'none' | 'mean' | 'sum'
-                 'none': No reduction will be applied to the output.
-                 'mean': The output will be averaged.
-                 'sum': The output will be summed.
-        eps (float): small number to prevent division by zero
-    """
-
-    x1, y1, x2, y2 = boxes1.unbind(dim=-1)
-    x1g, y1g, x2g, y2g = boxes2.unbind(dim=-1)
-
-    assert (x2 >= x1).all(), "bad box: x1 larger than x2"
-    assert (y2 >= y1).all(), "bad box: y1 larger than y2"
-
-    # Intersection keypoints
-    xkis1 = torch.max(x1, x1g)
-    ykis1 = torch.max(y1, y1g)
-    xkis2 = torch.min(x2, x2g)
-    ykis2 = torch.min(y2, y2g)
-
-    intsctk = torch.zeros_like(x1)
-    mask = (ykis2 > ykis1) & (xkis2 > xkis1)
-    intsctk[mask] = (xkis2[mask] - xkis1[mask]) * (ykis2[mask] - ykis1[mask])
-    unionk = (x2 - x1) * (y2 - y1) + (x2g - x1g) * (y2g - y1g) - intsctk
-    iouk = intsctk / (unionk + eps)
-
-    # smallest enclosing box
-    xc1 = torch.min(x1, x1g)
-    yc1 = torch.min(y1, y1g)
-    xc2 = torch.max(x2, x2g)
-    yc2 = torch.max(y2, y2g)
-
-    area_c = (xc2 - xc1) * (yc2 - yc1)
-    miouk = iouk - ((area_c - unionk) / (area_c + eps))
-
-    loss = 1 - miouk
-
-    if reduction == "mean":
-        loss = loss.mean()
-    elif reduction == "sum":
-        loss = loss.sum()
-
-    return loss
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/ml_nms.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/ml_nms.py
deleted file mode 100644
index 02f353e1..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/layers/ml_nms.py
+++ /dev/null
@@ -1,33 +0,0 @@
-from detectron2.layers import batched_nms
-
-
-def ml_nms(
-    boxlist, nms_thresh, max_proposals=-1, score_field="scores", label_field="labels"
-):
-    """
-    Performs non-maximum suppression on a boxlist, with scores specified
-    in a boxlist field via score_field.
-    Arguments:
-        boxlist(BoxList)
-        nms_thresh (float)
-        max_proposals (int): if > 0, then only the top max_proposals are kept
-            after non-maximum suppression
-        score_field (str)
-    """
-    if nms_thresh <= 0:
-        return boxlist
-    if boxlist.has("pred_boxes"):
-        boxes = boxlist.pred_boxes.tensor
-        labels = boxlist.pred_classes
-    else:
-        boxes = boxlist.proposal_boxes.tensor
-        labels = boxlist.proposal_boxes.tensor.new_zeros(
-            len(boxlist.proposal_boxes.tensor)
-        )
-    scores = boxlist.scores
-
-    keep = batched_nms(boxes, scores, labels, nms_thresh)
-    if max_proposals > 0:
-        keep = keep[:max_proposals]
-    boxlist = boxlist[keep]
-    return boxlist
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/meta_arch/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/meta_arch/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/meta_arch/centernet_detector.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/meta_arch/centernet_detector.py
deleted file mode 100644
index cf89399e..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/meta_arch/centernet_detector.py
+++ /dev/null
@@ -1,75 +0,0 @@
-import json
-import math
-
-import numpy as np
-import torch
-from detectron2.modeling import (
-    build_backbone,
-    build_proposal_generator,
-    detector_postprocess,
-)
-from detectron2.modeling.meta_arch.build import META_ARCH_REGISTRY
-from detectron2.structures import ImageList
-from torch import nn
-
-
-@META_ARCH_REGISTRY.register()
-class CenterNetDetector(nn.Module):
-    def __init__(self, cfg):
-        super().__init__()
-        self.mean, self.std = cfg.MODEL.PIXEL_MEAN, cfg.MODEL.PIXEL_STD
-        self.register_buffer(
-            "pixel_mean", torch.Tensor(cfg.MODEL.PIXEL_MEAN).view(-1, 1, 1)
-        )
-        self.register_buffer(
-            "pixel_std", torch.Tensor(cfg.MODEL.PIXEL_STD).view(-1, 1, 1)
-        )
-
-        self.backbone = build_backbone(cfg)
-        self.proposal_generator = build_proposal_generator(
-            cfg, self.backbone.output_shape()
-        )  # TODO: change to a more precise name
-
-    def forward(self, batched_inputs):
-        if not self.training:
-            return self.inference(batched_inputs)
-        images = self.preprocess_image(batched_inputs)
-        features = self.backbone(images.tensor)
-        gt_instances = [x["instances"].to(self.device) for x in batched_inputs]
-
-        _, proposal_losses = self.proposal_generator(images, features, gt_instances)
-        return proposal_losses
-
-    @property
-    def device(self):
-        return self.pixel_mean.device
-
-    @torch.no_grad()
-    def inference(self, batched_inputs, do_postprocess=True):
-        images = self.preprocess_image(batched_inputs)
-        inp = images.tensor
-        features = self.backbone(inp)
-        proposals, _ = self.proposal_generator(images, features, None)
-
-        processed_results = []
-        for results_per_image, input_per_image, image_size in zip(
-            proposals, batched_inputs, images.image_sizes
-        ):
-            if do_postprocess:
-                height = input_per_image.get("height", image_size[0])
-                width = input_per_image.get("width", image_size[1])
-                r = detector_postprocess(results_per_image, height, width)
-                processed_results.append({"instances": r})
-            else:
-                r = results_per_image
-                processed_results.append(r)
-        return processed_results
-
-    def preprocess_image(self, batched_inputs):
-        """
-        Normalize, pad and batch the input images.
-        """
-        images = [x["image"].to(self.device) for x in batched_inputs]
-        images = [(x - self.pixel_mean) / self.pixel_std for x in images]
-        images = ImageList.from_tensors(images, self.backbone.size_divisibility)
-        return images
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/__init__.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/custom_fast_rcnn.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/custom_fast_rcnn.py
deleted file mode 100644
index 5513f789..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/custom_fast_rcnn.py
+++ /dev/null
@@ -1,134 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# Part of the code is from https://github.com/tztztztztz/eql.detectron2/blob/master/projects/EQL/eql/fast_rcnn.py
-import json
-import logging
-import math
-from typing import Dict, Union
-
-import torch
-from detectron2.config import configurable
-from detectron2.layers import Linear, ShapeSpec, batched_nms, cat, nonzero_tuple
-from detectron2.modeling.box_regression import Box2BoxTransform
-from detectron2.modeling.roi_heads.fast_rcnn import (
-    FastRCNNOutputLayers,
-    _log_classification_stats,
-    fast_rcnn_inference,
-)
-from detectron2.structures import Boxes, Instances
-from detectron2.utils.comm import get_world_size
-from detectron2.utils.events import get_event_storage
-from fvcore.nn import giou_loss, smooth_l1_loss
-from torch import nn
-from torch.nn import functional as F
-
-from .fed_loss import get_fed_loss_inds, load_class_freq
-
-__all__ = ["CustomFastRCNNOutputLayers"]
-
-
-class CustomFastRCNNOutputLayers(FastRCNNOutputLayers):
-    def __init__(self, cfg, input_shape: ShapeSpec, **kwargs):
-        super().__init__(cfg, input_shape, **kwargs)
-
-        self.cfg = cfg
-
-    def losses(self, predictions, proposals):
-        """
-        enable advanced loss
-        """
-        scores, proposal_deltas = predictions
-        gt_classes = (
-            cat([p.gt_classes for p in proposals], dim=0)
-            if len(proposals)
-            else torch.empty(0)
-        )
-        num_classes = self.num_classes
-        _log_classification_stats(scores, gt_classes)
-
-        if len(proposals):
-            proposal_boxes = cat(
-                [p.proposal_boxes.tensor for p in proposals], dim=0
-            )  # Nx4
-            assert (
-                not proposal_boxes.requires_grad
-            ), "Proposals should not require gradients!"
-            gt_boxes = cat(
-                [
-                    (p.gt_boxes if p.has("gt_boxes") else p.proposal_boxes).tensor
-                    for p in proposals
-                ],
-                dim=0,
-            )
-        else:
-            proposal_boxes = gt_boxes = torch.empty(
-                (0, 4), device=proposal_deltas.device
-            )
-
-        loss_cls = self.softmax_cross_entropy_loss(scores, gt_classes)
-        return {
-            "loss_cls": loss_cls,
-            "loss_box_reg": self.box_reg_loss(
-                proposal_boxes, gt_boxes, proposal_deltas, gt_classes
-            ),
-        }
-
-    def sigmoid_cross_entropy_loss(self, pred_class_logits, gt_classes):
-        if pred_class_logits.numel() == 0:
-            return pred_class_logits.new_zeros([1])[
-                0
-            ]  # This is more robust than .sum() * 0.
-
-        B = pred_class_logits.shape[0]
-        C = pred_class_logits.shape[1] - 1
-
-        target = pred_class_logits.new_zeros(B, C + 1)
-        target[range(len(gt_classes)), gt_classes] = 1  # B x (C + 1)
-        target = target[:, :C]  # B x C
-
-        weight = 1
-
-        cls_loss = F.binary_cross_entropy_with_logits(
-            pred_class_logits[:, :-1], target, reduction="none"
-        )  # B x C
-        loss = torch.sum(cls_loss * weight) / B
-        return loss
-
-    def softmax_cross_entropy_loss(self, pred_class_logits, gt_classes):
-        """
-        change _no_instance handling
-        """
-        if pred_class_logits.numel() == 0:
-            return pred_class_logits.new_zeros([1])[0]
-
-        loss = F.cross_entropy(pred_class_logits, gt_classes, reduction="mean")
-        return loss
-
-    def inference(self, predictions, proposals):
-        """
-        enable use proposal boxes
-        """
-        boxes = self.predict_boxes(predictions, proposals)
-        scores = self.predict_probs(predictions, proposals)
-        if self.cfg.MODEL.ROI_BOX_HEAD.MULT_PROPOSAL_SCORE:
-            proposal_scores = [p.get("objectness_logits") for p in proposals]
-            scores = [
-                (s * ps[:, None]) ** 0.5 for s, ps in zip(scores, proposal_scores)
-            ]
-        image_shapes = [x.image_size for x in proposals]
-        return fast_rcnn_inference(
-            boxes,
-            scores,
-            image_shapes,
-            self.test_score_thresh,
-            self.test_nms_thresh,
-            self.test_topk_per_image,
-        )
-
-    def predict_probs(self, predictions, proposals):
-        """
-        support sigmoid
-        """
-        scores, _ = predictions
-        num_inst_per_image = [len(p) for p in proposals]
-        probs = F.softmax(scores, dim=-1)
-        return probs.split(num_inst_per_image, dim=0)
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/custom_roi_heads.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/custom_roi_heads.py
deleted file mode 100644
index bbcb268b..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/custom_roi_heads.py
+++ /dev/null
@@ -1,210 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-import json
-import math
-from typing import Dict, List, Optional, Tuple, Union
-
-import numpy as np
-import torch
-from detectron2.layers import ShapeSpec
-from detectron2.modeling.box_regression import Box2BoxTransform
-from detectron2.modeling.roi_heads.box_head import build_box_head
-from detectron2.modeling.roi_heads.cascade_rcnn import CascadeROIHeads
-from detectron2.modeling.roi_heads.fast_rcnn import fast_rcnn_inference
-from detectron2.modeling.roi_heads.roi_heads import ROI_HEADS_REGISTRY, StandardROIHeads
-from detectron2.structures import Boxes, Instances, pairwise_iou
-from detectron2.utils.events import get_event_storage
-from torch import nn
-from torch.autograd.function import Function
-
-from .custom_fast_rcnn import CustomFastRCNNOutputLayers
-
-
-@ROI_HEADS_REGISTRY.register()
-class CustomROIHeads(StandardROIHeads):
-    @classmethod
-    def _init_box_head(self, cfg, input_shape):
-        ret = super()._init_box_head(cfg, input_shape)
-        del ret["box_predictor"]
-        ret["box_predictor"] = CustomFastRCNNOutputLayers(
-            cfg, ret["box_head"].output_shape
-        )
-        self.debug = cfg.DEBUG
-        if self.debug:
-            self.debug_show_name = cfg.DEBUG_SHOW_NAME
-            self.save_debug = cfg.SAVE_DEBUG
-            self.vis_thresh = cfg.VIS_THRESH
-            self.pixel_mean = (
-                torch.Tensor(cfg.MODEL.PIXEL_MEAN)
-                .to(torch.device(cfg.MODEL.DEVICE))
-                .view(3, 1, 1)
-            )
-            self.pixel_std = (
-                torch.Tensor(cfg.MODEL.PIXEL_STD)
-                .to(torch.device(cfg.MODEL.DEVICE))
-                .view(3, 1, 1)
-            )
-        return ret
-
-    def forward(self, images, features, proposals, targets=None):
-        """
-        enable debug
-        """
-        if not self.debug:
-            del images
-        if self.training:
-            assert targets
-            proposals = self.label_and_sample_proposals(proposals, targets)
-        del targets
-
-        if self.training:
-            losses = self._forward_box(features, proposals)
-            losses.update(self._forward_mask(features, proposals))
-            losses.update(self._forward_keypoint(features, proposals))
-            return proposals, losses
-        else:
-            pred_instances = self._forward_box(features, proposals)
-            pred_instances = self.forward_with_given_boxes(features, pred_instances)
-            if self.debug:
-                from ..debug import debug_second_stage
-
-                denormalizer = lambda x: x * self.pixel_std + self.pixel_mean
-                debug_second_stage(
-                    [denormalizer(images[0].clone())],
-                    pred_instances,
-                    proposals=proposals,
-                    debug_show_name=self.debug_show_name,
-                )
-            return pred_instances, {}
-
-
-@ROI_HEADS_REGISTRY.register()
-class CustomCascadeROIHeads(CascadeROIHeads):
-    @classmethod
-    def _init_box_head(self, cfg, input_shape):
-        self.mult_proposal_score = cfg.MODEL.ROI_BOX_HEAD.MULT_PROPOSAL_SCORE
-        ret = super()._init_box_head(cfg, input_shape)
-        del ret["box_predictors"]
-        cascade_bbox_reg_weights = cfg.MODEL.ROI_BOX_CASCADE_HEAD.BBOX_REG_WEIGHTS
-        box_predictors = []
-        for box_head, bbox_reg_weights in zip(
-            ret["box_heads"], cascade_bbox_reg_weights
-        ):
-            box_predictors.append(
-                CustomFastRCNNOutputLayers(
-                    cfg,
-                    box_head.output_shape,
-                    box2box_transform=Box2BoxTransform(weights=bbox_reg_weights),
-                )
-            )
-        ret["box_predictors"] = box_predictors
-        self.debug = cfg.DEBUG
-        if self.debug:
-            self.debug_show_name = cfg.DEBUG_SHOW_NAME
-            self.save_debug = cfg.SAVE_DEBUG
-            self.vis_thresh = cfg.VIS_THRESH
-            self.pixel_mean = (
-                torch.Tensor(cfg.MODEL.PIXEL_MEAN)
-                .to(torch.device(cfg.MODEL.DEVICE))
-                .view(3, 1, 1)
-            )
-            self.pixel_std = (
-                torch.Tensor(cfg.MODEL.PIXEL_STD)
-                .to(torch.device(cfg.MODEL.DEVICE))
-                .view(3, 1, 1)
-            )
-        return ret
-
-    def _forward_box(self, features, proposals, targets=None):
-        """
-        Add mult proposal scores at testing
-        """
-        if (not self.training) and self.mult_proposal_score:
-            if len(proposals) > 0 and proposals[0].has("scores"):
-                proposal_scores = [p.get("scores") for p in proposals]
-            else:
-                proposal_scores = [p.get("objectness_logits") for p in proposals]
-
-        features = [features[f] for f in self.box_in_features]
-        head_outputs = []  # (predictor, predictions, proposals)
-        prev_pred_boxes = None
-        image_sizes = [x.image_size for x in proposals]
-        for k in range(self.num_cascade_stages):
-            if k > 0:
-                proposals = self._create_proposals_from_boxes(
-                    prev_pred_boxes, image_sizes
-                )
-                if self.training:
-                    proposals = self._match_and_label_boxes(proposals, k, targets)
-            predictions = self._run_stage(features, proposals, k)
-            prev_pred_boxes = self.box_predictor[k].predict_boxes(
-                predictions, proposals
-            )
-            head_outputs.append((self.box_predictor[k], predictions, proposals))
-
-        if self.training:
-            losses = {}
-            storage = get_event_storage()
-            for stage, (predictor, predictions, proposals) in enumerate(head_outputs):
-                with storage.name_scope("stage{}".format(stage)):
-                    stage_losses = predictor.losses(predictions, proposals)
-                losses.update(
-                    {k + "_stage{}".format(stage): v for k, v in stage_losses.items()}
-                )
-            return losses
-        else:
-            # Each is a list[Tensor] of length #image. Each tensor is Ri x (K+1)
-            scores_per_stage = [h[0].predict_probs(h[1], h[2]) for h in head_outputs]
-            scores = [
-                sum(list(scores_per_image)) * (1.0 / self.num_cascade_stages)
-                for scores_per_image in zip(*scores_per_stage)
-            ]
-
-            if self.mult_proposal_score:
-                scores = [
-                    (s * ps[:, None]) ** 0.5 for s, ps in zip(scores, proposal_scores)
-                ]
-
-            predictor, predictions, proposals = head_outputs[-1]
-            boxes = predictor.predict_boxes(predictions, proposals)
-            pred_instances, _ = fast_rcnn_inference(
-                boxes,
-                scores,
-                image_sizes,
-                predictor.test_score_thresh,
-                predictor.test_nms_thresh,
-                predictor.test_topk_per_image,
-            )
-
-            return pred_instances
-
-    def forward(self, images, features, proposals, targets=None):
-        """
-        enable debug
-        """
-        if not self.debug:
-            del images
-        if self.training:
-            proposals = self.label_and_sample_proposals(proposals, targets)
-
-        if self.training:
-            losses = self._forward_box(features, proposals, targets)
-            losses.update(self._forward_mask(features, proposals))
-            losses.update(self._forward_keypoint(features, proposals))
-            return proposals, losses
-        else:
-            # import pdb; pdb.set_trace()
-            pred_instances = self._forward_box(features, proposals)
-            pred_instances = self.forward_with_given_boxes(features, pred_instances)
-            if self.debug:
-                from ..debug import debug_second_stage
-
-                denormalizer = lambda x: x * self.pixel_std + self.pixel_mean
-                debug_second_stage(
-                    [denormalizer(x.clone()) for x in images],
-                    pred_instances,
-                    proposals=proposals,
-                    save_debug=self.save_debug,
-                    debug_show_name=self.debug_show_name,
-                    vis_thresh=self.vis_thresh,
-                )
-            return pred_instances, {}
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/fed_loss.py b/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/fed_loss.py
deleted file mode 100644
index 3745062e..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet/modeling/roi_heads/fed_loss.py
+++ /dev/null
@@ -1,33 +0,0 @@
-import json
-
-import numpy as np
-import torch
-from torch.nn import functional as F
-
-
-def load_class_freq(path="datasets/lvis/lvis_v1_train_cat_info.json", freq_weight=0.5):
-    cat_info = json.load(open(path, "r"))
-    cat_info = torch.tensor(
-        [c["image_count"] for c in sorted(cat_info, key=lambda x: x["id"])]
-    )
-    freq_weight = cat_info.float() ** freq_weight
-    return freq_weight
-
-
-def get_fed_loss_inds(
-    gt_classes, num_sample_cats=50, C=1203, weight=None, fed_cls_inds=-1
-):
-    appeared = torch.unique(gt_classes)  # C'
-    prob = appeared.new_ones(C + 1).float()
-    prob[-1] = 0
-    if len(appeared) < num_sample_cats:
-        if weight is not None:
-            prob[:C] = weight.float().clone()
-        prob[appeared] = 0
-        if fed_cls_inds > 0:
-            prob[fed_cls_inds:] = 0
-        more_appeared = torch.multinomial(
-            prob, num_sample_cats - len(appeared), replacement=False
-        )
-        appeared = torch.cat([appeared, more_appeared])
-    return appeared
diff --git a/eval/vbench/third_party/grit_src/centernet2/centernet2_docs/MODEL_ZOO.md b/eval/vbench/third_party/grit_src/centernet2/centernet2_docs/MODEL_ZOO.md
deleted file mode 100644
index 19afa67e..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/centernet2_docs/MODEL_ZOO.md
+++ /dev/null
@@ -1,73 +0,0 @@
-# MODEL_ZOO
-
-### Common settings and notes
-
-- Multiscale training is used by default in all models. The results are all reported using single-scale testing.
-- We report runtime on our local workstation with a TitanXp GPU and a Titan RTX GPU.
-- All models are trained on 8-GPU servers by default. The 1280 models are trained on 24G GPUs. Reducing the batchsize with the linear learning rate rule should be fine.
-- All models can be downloaded directly from [Google drive](https://drive.google.com/drive/folders/1eae1cTX8tvIaCeof36sBgxrXEXALYlf-?usp=sharing).
-
-
-## COCO
-
-### CenterNet
-
-| Model                                     | val mAP | FPS (Titan Xp/ Titan RTX) | links     |
-|-------------------------------------------|---------|---------|-----------|
-| CenterNet-S4_DLA_8x                       |  42.5   | 50 / 71 |[config](../configs/CenterNet-S4_DLA_8x.yaml)/[model](https://drive.google.com/file/d/1lNBhVHnZAEBRD66MFaHjm5Ij6Z4KYrJq/view?usp=sharing)|
-| CenterNet-FPN_R50_1x                      |  40.2   | 20 / 24 |[config](../configs/CenterNet-FPN_R50_1x.yaml)/[model](https://drive.google.com/file/d/1rVG1YTthMXvutC6jr9KoE2DthT5-jhGj/view?usp=sharing)|
-
-#### Note
-
-- `CenterNet-S4_DLA_8x` is a re-implemented version of the original CenterNet (stride 4), with several changes, including
-  - Using top-left-right-bottom box encoding and GIoU Loss; adding regression loss to the center 3x3 region.
-  - Adding more positive pixels for the heatmap loss whose regression loss is small and is within the center3x3 region.
-  - Using more heavy crop augmentation (EfficientDet-style crop ratio 0.1-2), and removing color augmentations.
-  - Using standard NMS instead of max pooling.
-  - Using RetinaNet-style optimizer (SGD), learning rate rule (0.01 for each batch size 16), and schedule (8x12 epochs).
-- `CenterNet-FPN_R50_1x` is a (new) FPN version of CenterNet. It includes the changes above, and assigns objects to FPN levels based on a fixed size range. The model is trained with standard short edge 640-800 multi-scale training with 12 epochs (1x).
-
-
-### CenterNet2
-
-| Model                                     | val mAP | FPS (Titan Xp/ Titan RTX) | links     |
-|-------------------------------------------|---------|---------|-----------|
-| CenterNet2-F_R50_1x                       |   41.7  | 22 / 27  |[config](../configs/CenterNet2-F_R50_1x.yaml)/[model](X)|
-| CenterNet2_R50_1x                         |  42.9   | 18 / 24 |[config](../configs/CenterNet2_R50_1x.yaml)/[model](https://drive.google.com/file/d/1Osu1J_sskt_1FaGdfJKa4vd2N71TWS9W/view?usp=sharing)|
-| CenterNet2_X101-DCN_2x                    |  49.9   | 6 / 8  |[config](../configs/CenterNet2_X101-DCN_2x.yaml)/[model](https://drive.google.com/file/d/1IHgpUHVJWpvMuFUUetgKWsw27pRNN2oK/view?usp=sharing)|
-| CenterNet2_DLA-BiFPN-P3_4x                |  43.8   | 40 / 50|[config](../configs/CenterNet2_DLA-BiFPN-P3_4x.yaml)/[model](https://drive.google.com/file/d/12GUNlDW9RmOs40UEMSiiUsk5QK_lpGsE/view?usp=sharing)|
-| CenterNet2_DLA-BiFPN-P3_24x               |  45.6   | 40 / 50  |[config](../configs/CenterNet2_DLA-BiFPN-P3_24x.yaml)/[model](https://drive.google.com/file/d/15ZES1ySxubDPzKsHPA7pYg8o_Vwmf-Mb/view?usp=sharing)|
-| CenterNet2_R2-101-DCN_896_4x              |  51.2   | 9 / 13 |[config](../configs/CenterNet2_R2-101-DCN_896_4x.yaml)/[model](https://drive.google.com/file/d/1S7_GE8ZDQBWuLEfKHkxzeF3KBsxsbABg/view?usp=sharing)|
-| CenterNet2_R2-101-DCN-BiFPN_1280_4x       |  52.9   | 6 / 8 |[config](../configs/CenterNet2_R2-101-DCN-BiFPN_1280_4x.yaml)/[model](https://drive.google.com/file/d/14EBHNMagBCNTQjOXcHoZwLYIi2lFIm7F/view?usp=sharing)|
-| CenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST |  56.1   | 3 / 5 |[config](../configs/CenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST.yaml)/[model](https://drive.google.com/file/d/11ww9VlOi_nhpdsU_vBAecSxBU0dR_JzW/view?usp=sharing)|
-| CenterNet2_DLA-BiFPN-P5_640_24x_ST        |  49.2   | 33 / 38 |[config](../configs/CenterNet2_DLA-BiFPN-P5_640_24x_ST.yaml)/[model](https://drive.google.com/file/d/1qsHp2HrM1u8WrtBzF5S0oCoLMz-B40wk/view?usp=sharing)|
-
-#### Note
-
-- `CenterNet2-F_R50_1x` uses Faster RCNN as the second stage. All other CenterNet2 models use Cascade RCNN as the second stage.
-- `CenterNet2_DLA-BiFPN-P3_4x` follows the same training setting as [realtime-FCOS](https://github.com/aim-uofa/AdelaiDet/blob/master/configs/FCOS-Detection/README.md).
-- `CenterNet2_DLA-BiFPN-P3_24x` is trained by repeating the `4x` schedule (starting from learning rate 0.01) 6 times.
-- R2 means [Res2Net](https://github.com/Res2Net/Res2Net-detectron2) backbone. To train Res2Net models, you need to download the ImageNet pre-trained weight [here](https://github.com/Res2Net/Res2Net-detectron2) and place it in `output/r2_101.pkl`.
-- The last 4 models in the table are trained with the EfficientDet-style resize-and-crop augmentation, instead of the default random resizing short edge in detectron2. We found this trains faster (per-iteration) and gives better performance under a long schedule.
-- `_ST` means using [self-training](https://arxiv.org/abs/2006.06882) using pseudo-labels produced by [Scaled-YOLOv4](https://github.com/WongKinYiu/ScaledYOLOv4) on COCO unlabeled images, with a hard score threshold 0.5. Our processed pseudo-labels can be downloaded [here](https://drive.google.com/file/d/1LMBjtHhLp6dYf6MjwEQmzCLWQLkmWPpw/view?usp=sharing).
-- `CenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST` finetunes from `CenterNet2_R2-101-DCN-BiFPN_1280_4x` for an additional `4x` schedule with the self-training data. It is trained under `1280x1280` but tested under `1560x1560`.
-
-## LVIS v1
-
-| Model                                     |  val mAP box | links     |
-|-------------------------------------------|--------------|-----------|
-| LVIS_CenterNet2_R50_1x                    |  26.5        |[config](../configs/LVIS_CenterNet2_R50_1x.yaml)/[model](https://drive.google.com/file/d/1gT9e-tNw8uzEBaCadQuoOOP2TEYa4kKP/view?usp=sharing)|
-| LVIS_CenterNet2_R50_Fed_1x            |  28.3        |[config](../configs/LVIS_CenterNet2_R50_Fed_1x.yaml)/[model](https://drive.google.com/file/d/1a9UjheMCKax0qAKEwPVpq2ZHN6vpqJv8/view?usp=sharing)|
-
-- The models are trained with repeat-factor sampling.
-- `LVIS_CenterNet2_R50_Fed_1x` is CenterNet2 with our federated loss. Check our Appendix D of our [paper](https://arxiv.org/abs/2103.07461) or our [technical report at LVIS challenge](https://www.lvisdataset.org/assets/challenge_reports/2020/CenterNet2.pdf) for references.
-
-## Objects365
-
-| Model                                     |  val mAP| links     |
-|-------------------------------------------|---------|-----------|
-| O365_CenterNet2_R50_1x                    |  22.6   |[config](../configs/O365_CenterNet2_R50_1x.yaml)/[model](https://drive.google.com/file/d/18fG6xGchAlpNp5sx8RAtwadGkS-gdIBU/view?usp=sharing)|
-
-#### Note
-- Objects365 dataset can be downloaded [here](https://www.objects365.org/overview.html).
-- The model is trained with class-aware sampling.
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/Base-CenterNet-FPN.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/Base-CenterNet-FPN.yaml
deleted file mode 100644
index bef3dc10..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/Base-CenterNet-FPN.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-MODEL:
-  META_ARCHITECTURE: "CenterNetDetector"
-  PROPOSAL_GENERATOR:
-    NAME: "CenterNet"
-  BACKBONE:
-    NAME: "build_p67_resnet_fpn_backbone"
-  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
-  RESNETS:
-    DEPTH: 50
-    OUT_FEATURES: ["res3", "res4", "res5"]
-  FPN:
-    IN_FEATURES: ["res3", "res4", "res5"]
-DATASETS:
-  TRAIN: ("coco_2017_train",)
-  TEST: ("coco_2017_val",)
-SOLVER:
-  IMS_PER_BATCH: 16
-  BASE_LR: 0.01
-  STEPS: (60000, 80000)
-  MAX_ITER: 90000
-  CHECKPOINT_PERIOD: 1000000000
-  WARMUP_ITERS: 4000
-  WARMUP_FACTOR: 0.00025
-  CLIP_GRADIENTS:
-    ENABLED: True
-INPUT:
-  MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)
-OUTPUT_DIR: "./output/CenterNet2/auto"
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/Base-CenterNet2.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/Base-CenterNet2.yaml
deleted file mode 100644
index 68937231..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/Base-CenterNet2.yaml
+++ /dev/null
@@ -1,56 +0,0 @@
-MODEL:
-  META_ARCHITECTURE: "GeneralizedRCNN"
-  PROPOSAL_GENERATOR:
-    NAME: "CenterNet"
-  BACKBONE:
-    NAME: "build_p67_resnet_fpn_backbone"
-  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
-  RESNETS:
-    DEPTH: 50
-    OUT_FEATURES: ["res3", "res4", "res5"]
-  FPN:
-    IN_FEATURES: ["res3", "res4", "res5"]
-  ROI_HEADS:
-    NAME: CustomCascadeROIHeads
-    IN_FEATURES: ["p3", "p4", "p5", "p6", "p7"]
-    IOU_THRESHOLDS: [0.6]
-    NMS_THRESH_TEST: 0.7
-  ROI_BOX_CASCADE_HEAD:
-    IOUS: [0.6, 0.7, 0.8]
-  ROI_BOX_HEAD:
-    NAME: "FastRCNNConvFCHead"
-    NUM_FC: 2
-    POOLER_RESOLUTION: 7
-    CLS_AGNOSTIC_BBOX_REG: True
-    MULT_PROPOSAL_SCORE: True
-  CENTERNET:
-    REG_WEIGHT: 1.
-    NOT_NORM_REG: True
-    ONLY_PROPOSAL: True
-    WITH_AGN_HM: True
-    INFERENCE_TH: 0.0001
-    PRE_NMS_TOPK_TRAIN: 4000
-    POST_NMS_TOPK_TRAIN: 2000
-    PRE_NMS_TOPK_TEST: 1000
-    POST_NMS_TOPK_TEST: 256
-    NMS_TH_TRAIN: 0.9
-    NMS_TH_TEST: 0.9
-    POS_WEIGHT: 0.5
-    NEG_WEIGHT: 0.5
-    IGNORE_HIGH_FP: 0.85
-DATASETS:
-  TRAIN: ("coco_2017_train",)
-  TEST: ("coco_2017_val",)
-SOLVER:
-  IMS_PER_BATCH: 16
-  BASE_LR: 0.02
-  STEPS: (60000, 80000)
-  MAX_ITER: 90000
-  CHECKPOINT_PERIOD: 1000000000
-  WARMUP_ITERS: 4000
-  WARMUP_FACTOR: 0.00025
-  CLIP_GRADIENTS:
-    ENABLED: True
-INPUT:
-  MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)
-OUTPUT_DIR: "./output/CenterNet2/auto"
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/Base_S4_DLA.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/Base_S4_DLA.yaml
deleted file mode 100644
index 7e01be7e..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/Base_S4_DLA.yaml
+++ /dev/null
@@ -1,40 +0,0 @@
-MODEL:
-  META_ARCHITECTURE: "CenterNetDetector"
-  PROPOSAL_GENERATOR:
-    NAME: "CenterNet"
-  PIXEL_STD: [57.375, 57.120, 58.395]
-  BACKBONE:
-    NAME: "build_dla_backbone"
-  DLA:
-    NORM: "BN"
-  CENTERNET:
-    IN_FEATURES: ["dla2"]
-    FPN_STRIDES: [4]
-    SOI: [[0, 1000000]]
-    NUM_CLS_CONVS: 1
-    NUM_BOX_CONVS: 1
-    REG_WEIGHT: 1.
-    MORE_POS: True
-    HM_FOCAL_ALPHA: 0.25
-DATASETS:
-  TRAIN: ("coco_2017_train",)
-  TEST: ("coco_2017_val",)
-SOLVER:
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  MAX_ITER: 90000
-  BASE_LR: 0.04
-  IMS_PER_BATCH: 64
-  WEIGHT_DECAY: 0.0001
-  CHECKPOINT_PERIOD: 1000000
-  CLIP_GRADIENTS:
-    ENABLED: True
-INPUT:
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 640
-  MIN_SIZE_TEST: 608
-  MAX_SIZE_TEST: 900
-TEST:
-  EVAL_PERIOD: 7500
-DATALOADER:
-  NUM_WORKERS: 8
-OUTPUT_DIR: "output/CenterNet2/auto"
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet-FPN_R50_1x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet-FPN_R50_1x.yaml
deleted file mode 100644
index 811a5096..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet-FPN_R50_1x.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-_BASE_: "Base-CenterNet-FPN.yaml"
-MODEL:
-  CENTERNET:
-    MORE_POS: True
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet-S4_DLA_8x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet-S4_DLA_8x.yaml
deleted file mode 100644
index 68665a2c..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet-S4_DLA_8x.yaml
+++ /dev/null
@@ -1,5 +0,0 @@
-_BASE_: "Base_S4_DLA.yaml"
-SOLVER:
-  MAX_ITER: 90000
-  BASE_LR: 0.08
-  IMS_PER_BATCH: 128
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2-F_R50_1x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2-F_R50_1x.yaml
deleted file mode 100644
index 8d0bfaf3..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2-F_R50_1x.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  ROI_HEADS:
-    NAME: CustomROIHeads
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P3_24x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P3_24x.yaml
deleted file mode 100644
index 9bf4de3a..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P3_24x.yaml
+++ /dev/null
@@ -1,36 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_p35_fcos_dla_bifpn_backbone"
-  BIFPN:
-    OUT_CHANNELS: 160
-    NUM_LEVELS: 3
-    NUM_BIFPN: 4
-  DLA:
-    NUM_LAYERS: 34
-    NORM: "SyncBN"
-  FPN:
-    IN_FEATURES: ["dla3", "dla4", "dla5"]
-  ROI_HEADS:
-    IN_FEATURES: ["p3", "p4", "p5"]
-  CENTERNET:
-    POST_NMS_TOPK_TEST: 128
-    FPN_STRIDES: [8, 16, 32]
-    IN_FEATURES: ['p3', 'p4', 'p5']
-    SOI: [[0, 64], [48, 192], [128, 1000000]]
-DATASETS:
-  TRAIN: ("coco_2017_train",)
-  TEST: ("coco_2017_val",)
-SOLVER:
-  IMS_PER_BATCH: 16
-  BASE_LR: 0.02
-  STEPS: (300000, 340000)
-  MAX_ITER: 360000
-  CHECKPOINT_PERIOD: 100000
-  WARMUP_ITERS: 4000
-  WARMUP_FACTOR: 0.00025
-INPUT:
-  MIN_SIZE_TRAIN: (256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608)
-  MAX_SIZE_TRAIN: 900
-  MAX_SIZE_TEST: 736
-  MIN_SIZE_TEST: 512
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P3_4x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P3_4x.yaml
deleted file mode 100644
index 9bf4de3a..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P3_4x.yaml
+++ /dev/null
@@ -1,36 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_p35_fcos_dla_bifpn_backbone"
-  BIFPN:
-    OUT_CHANNELS: 160
-    NUM_LEVELS: 3
-    NUM_BIFPN: 4
-  DLA:
-    NUM_LAYERS: 34
-    NORM: "SyncBN"
-  FPN:
-    IN_FEATURES: ["dla3", "dla4", "dla5"]
-  ROI_HEADS:
-    IN_FEATURES: ["p3", "p4", "p5"]
-  CENTERNET:
-    POST_NMS_TOPK_TEST: 128
-    FPN_STRIDES: [8, 16, 32]
-    IN_FEATURES: ['p3', 'p4', 'p5']
-    SOI: [[0, 64], [48, 192], [128, 1000000]]
-DATASETS:
-  TRAIN: ("coco_2017_train",)
-  TEST: ("coco_2017_val",)
-SOLVER:
-  IMS_PER_BATCH: 16
-  BASE_LR: 0.02
-  STEPS: (300000, 340000)
-  MAX_ITER: 360000
-  CHECKPOINT_PERIOD: 100000
-  WARMUP_ITERS: 4000
-  WARMUP_FACTOR: 0.00025
-INPUT:
-  MIN_SIZE_TRAIN: (256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608)
-  MAX_SIZE_TRAIN: 900
-  MAX_SIZE_TEST: 736
-  MIN_SIZE_TEST: 512
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P5_640_16x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P5_640_16x.yaml
deleted file mode 100644
index 80413a62..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P5_640_16x.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_p37_dla_bifpn_backbone"
-  BIFPN:
-    OUT_CHANNELS: 160
-    NUM_LEVELS: 5
-    NUM_BIFPN: 3
-  CENTERNET:
-    POST_NMS_TOPK_TEST: 128
-  WEIGHTS: ''
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  FPN:
-    IN_FEATURES: ["dla3", "dla4", "dla5"]
-SOLVER:
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  MAX_ITER: 360000
-  BASE_LR: 0.08
-  IMS_PER_BATCH: 64
-  CHECKPOINT_PERIOD: 90000
-TEST:
-  EVAL_PERIOD: 7500
-INPUT:
-  FORMAT: RGB
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 640
-  MIN_SIZE_TEST: 608
-  MAX_SIZE_TEST: 900
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P5_640_16x_ST.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P5_640_16x_ST.yaml
deleted file mode 100644
index 8813b39c..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-BiFPN-P5_640_16x_ST.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_p37_dla_bifpn_backbone"
-  BIFPN:
-    OUT_CHANNELS: 160
-    NUM_LEVELS: 5
-    NUM_BIFPN: 3
-  CENTERNET:
-    POST_NMS_TOPK_TEST: 128
-  WEIGHTS: ''
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  FPN:
-    IN_FEATURES: ["dla3", "dla4", "dla5"]
-SOLVER:
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  MAX_ITER: 360000
-  BASE_LR: 0.08
-  IMS_PER_BATCH: 64
-TEST:
-  EVAL_PERIOD: 7500
-INPUT:
-  FORMAT: RGB
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 640
-  MIN_SIZE_TEST: 608
-  MAX_SIZE_TEST: 900
-DATASETS:
-  TRAIN: ("coco_2017_train","coco_un_yolov4_55_0.5",)
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-fcosBiFPN-P5_640_16x_ST.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-fcosBiFPN-P5_640_16x_ST.yaml
deleted file mode 100644
index f94f1358..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_DLA-fcosBiFPN-P5_640_16x_ST.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_p37_fcos_dla_bifpn_backbone"
-  BIFPN:
-    OUT_CHANNELS: 160
-    NUM_LEVELS: 5
-    NUM_BIFPN: 3
-  CENTERNET:
-    POST_NMS_TOPK_TEST: 128
-  WEIGHTS: ''
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  FPN:
-    IN_FEATURES: ["dla3", "dla4", "dla5"]
-TEST:
-  EVAL_PERIOD: 7500
-SOLVER:
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  MAX_ITER: 360000
-  BASE_LR: 0.08
-  IMS_PER_BATCH: 64
-INPUT:
-  FORMAT: RGB
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 640
-  MIN_SIZE_TEST: 608
-  MAX_SIZE_TEST: 900
-DATASETS:
-  TRAIN: ("coco_2017_train","coco_un_yolov4_55_0.5",)
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN-BiFPN_1280_4x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN-BiFPN_1280_4x.yaml
deleted file mode 100644
index e07574b3..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN-BiFPN_1280_4x.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_res2net_bifpn_backbone"
-  BIFPN:
-    NUM_BIFPN: 7
-    OUT_CHANNELS: 288
-  WEIGHTS: "output/r2_101.pkl"
-  RESNETS:
-    DEPTH: 101
-    WIDTH_PER_GROUP: 26
-    DEFORM_ON_PER_STAGE: [False, False, True, True] # on Res4, Res5
-    DEFORM_MODULATED: True
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  CENTERNET:
-    USE_DEFORMABLE: True
-  ROI_HEADS:
-    IN_FEATURES: ["p3", "p4"]
-INPUT:
-  FORMAT: RGB
-TEST:
-  EVAL_PERIOD: 7500
-SOLVER:
-  MAX_ITER: 180000
-  CHECKPOINT_PERIOD: 60000
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  BASE_LR: 0.04
-  IMS_PER_BATCH: 32
-INPUT:
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 1280
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST.yaml
deleted file mode 100644
index e1185c55..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN-BiFPN_4x+4x_1560_ST.yaml
+++ /dev/null
@@ -1,35 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_res2net_bifpn_backbone"
-  BIFPN:
-    NUM_BIFPN: 7
-    OUT_CHANNELS: 288
-  WEIGHTS: "output/r2_101.pkl"
-  RESNETS:
-    DEPTH: 101
-    WIDTH_PER_GROUP: 26
-    DEFORM_ON_PER_STAGE: [False, False, True, True] # on Res4, Res5
-    DEFORM_MODULATED: True
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  CENTERNET:
-    USE_DEFORMABLE: True
-  ROI_HEADS:
-    IN_FEATURES: ["p3", "p4"]
-TEST:
-  EVAL_PERIOD: 7500
-SOLVER:
-  MAX_ITER: 180000
-  CHECKPOINT_PERIOD: 7500
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  BASE_LR: 0.04
-  IMS_PER_BATCH: 32
-DATASETS:
-  TRAIN: "('coco_2017_train', 'coco_un_yolov4_55_0.5')"
-INPUT:
-  FORMAT: RGB
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 1280
-  TEST_SIZE: 1560
-  TEST_INPUT_TYPE: 'square'
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN_896_4x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN_896_4x.yaml
deleted file mode 100644
index 5f6fe5ef..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R2-101-DCN_896_4x.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  BACKBONE:
-    NAME: "build_p67_res2net_fpn_backbone"
-  WEIGHTS: "output/r2_101.pkl"
-  RESNETS:
-    DEPTH: 101
-    WIDTH_PER_GROUP: 26
-    DEFORM_ON_PER_STAGE: [False, False, True, True] # on Res4, Res5
-    DEFORM_MODULATED: True
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  CENTERNET:
-    USE_DEFORMABLE: True
-  ROI_HEADS:
-    IN_FEATURES: ["p3", "p4"]
-INPUT:
-  FORMAT: RGB
-TEST:
-  EVAL_PERIOD: 7500
-SOLVER:
-  MAX_ITER: 180000
-  CHECKPOINT_PERIOD: 600000
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  BASE_LR: 0.04
-  IMS_PER_BATCH: 32
-INPUT:
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 896
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R50_1x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R50_1x.yaml
deleted file mode 100644
index 9dcdf5b8..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_R50_1x.yaml
+++ /dev/null
@@ -1 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_X101-DCN_2x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_X101-DCN_2x.yaml
deleted file mode 100644
index 009c6808..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/CenterNet2_X101-DCN_2x.yaml
+++ /dev/null
@@ -1,22 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  CENTERNET:
-    USE_DEFORMABLE: True
-  WEIGHTS: "detectron2://ImageNetPretrained/FAIR/X-101-32x8d.pkl"
-  PIXEL_STD: [57.375, 57.120, 58.395]
-  RESNETS:
-    STRIDE_IN_1X1: False
-    NUM_GROUPS: 32
-    WIDTH_PER_GROUP: 8
-    DEPTH: 101
-    DEFORM_ON_PER_STAGE: [False, False, True, True] # on Res4, Res5
-    DEFORM_MODULATED: True
-  ROI_HEADS:
-    IN_FEATURES: ["p3", "p4"]
-SOLVER:
-  STEPS: (120000, 160000)
-  MAX_ITER: 180000
-  CHECKPOINT_PERIOD: 40000
-INPUT:
-  MIN_SIZE_TRAIN: (480, 960)
-  MIN_SIZE_TRAIN_SAMPLING: "range"
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/LVIS_CenterNet2_R50_1x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/LVIS_CenterNet2_R50_1x.yaml
deleted file mode 100644
index c5338aca..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/LVIS_CenterNet2_R50_1x.yaml
+++ /dev/null
@@ -1,17 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  ROI_HEADS:
-    NUM_CLASSES: 1203
-    SCORE_THRESH_TEST: 0.02
-    NMS_THRESH_TEST: 0.5
-  CENTERNET:
-    NUM_CLASSES: 1203
-
-DATASETS:
-  TRAIN: ("lvis_v1_train",)
-  TEST: ("lvis_v1_val",)
-DATALOADER:
-  SAMPLER_TRAIN: "RepeatFactorTrainingSampler"
-  REPEAT_THRESHOLD: 0.001
-TEST:
-  DETECTIONS_PER_IMAGE: 300
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/LVIS_CenterNet2_R50_Fed_1x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/LVIS_CenterNet2_R50_Fed_1x.yaml
deleted file mode 100644
index d6b6c823..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/LVIS_CenterNet2_R50_Fed_1x.yaml
+++ /dev/null
@@ -1,19 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  ROI_HEADS:
-    NUM_CLASSES: 1203
-    SCORE_THRESH_TEST: 0.02
-    NMS_THRESH_TEST: 0.5
-  CENTERNET:
-    NUM_CLASSES: 1203
-  ROI_BOX_HEAD:
-    USE_SIGMOID_CE: True
-    USE_FED_LOSS: True
-DATASETS:
-  TRAIN: ("lvis_v1_train",)
-  TEST: ("lvis_v1_val",)
-DATALOADER:
-  SAMPLER_TRAIN: "RepeatFactorTrainingSampler"
-  REPEAT_THRESHOLD: 0.001
-TEST:
-  DETECTIONS_PER_IMAGE: 300
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/O365_CenterNet2_R50_1x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/O365_CenterNet2_R50_1x.yaml
deleted file mode 100644
index 9ef16f6c..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/O365_CenterNet2_R50_1x.yaml
+++ /dev/null
@@ -1,13 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  ROI_HEADS:
-    NUM_CLASSES: 365
-  CENTERNET:
-    NUM_CLASSES: 365
-DATASETS:
-  TRAIN: ("objects365_train",)
-  TEST: ("objects365_val",)
-DATALOADER:
-  SAMPLER_TRAIN: "ClassAwareSampler"
-TEST:
-  DETECTIONS_PER_IMAGE: 300
diff --git a/eval/vbench/third_party/grit_src/centernet2/configs/nuImages_CenterNet2_DLA_640_8x.yaml b/eval/vbench/third_party/grit_src/centernet2/configs/nuImages_CenterNet2_DLA_640_8x.yaml
deleted file mode 100644
index c400e92c..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/configs/nuImages_CenterNet2_DLA_640_8x.yaml
+++ /dev/null
@@ -1,42 +0,0 @@
-_BASE_: "Base-CenterNet2.yaml"
-MODEL:
-  MASK_ON: True
-  ROI_MASK_HEAD:
-    NAME: "MaskRCNNConvUpsampleHead"
-    NUM_CONV: 4
-    POOLER_RESOLUTION: 14
-  ROI_HEADS:
-    NUM_CLASSES: 10
-    IN_FEATURES: ["dla2"]
-  BACKBONE:
-    NAME: "build_dla_backbone"
-  DLA:
-    NORM: "BN"
-  CENTERNET:
-    IN_FEATURES: ["dla2"]
-    FPN_STRIDES: [4]
-    SOI: [[0, 1000000]]
-    NUM_CLS_CONVS: 1
-    NUM_BOX_CONVS: 1
-    REG_WEIGHT: 1.
-    MORE_POS: True
-    HM_FOCAL_ALPHA: 0.25
-    POST_NMS_TOPK_TEST: 128
-  WEIGHTS: ''
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-SOLVER:
-  MAX_ITER: 180000
-  STEPS: (120000, 160000)
-  BASE_LR: 0.08
-  IMS_PER_BATCH: 64
-INPUT:
-  FORMAT: RGB
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 640
-  MIN_SIZE_TEST: 608
-  MAX_SIZE_TEST: 900
-  MASK_FORMAT: bitmask
-DATASETS:
-  TRAIN: ("nuimages_train",)
-  TEST: ("nuimages_val",)
diff --git a/eval/vbench/third_party/grit_src/centernet2/predictor.py b/eval/vbench/third_party/grit_src/centernet2/predictor.py
deleted file mode 100644
index 58e72571..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/predictor.py
+++ /dev/null
@@ -1,255 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-import atexit
-import bisect
-import multiprocessing as mp
-from collections import deque
-
-import cv2
-import torch
-from detectron2.data import MetadataCatalog
-from detectron2.engine.defaults import DefaultPredictor
-from detectron2.utils.video_visualizer import VideoVisualizer
-from detectron2.utils.visualizer import ColorMode, Visualizer
-
-
-class VisualizationDemo(object):
-    def __init__(self, cfg, instance_mode=ColorMode.IMAGE, parallel=False):
-        """
-        Args:
-            cfg (CfgNode):
-            instance_mode (ColorMode):
-            parallel (bool): whether to run the model in different processes from visualization.
-                Useful since the visualization logic can be slow.
-        """
-        self.metadata = MetadataCatalog.get(
-            cfg.DATASETS.TRAIN[0] if len(cfg.DATASETS.TRAIN) else "__unused"
-        )
-        self.cpu_device = torch.device("cpu")
-        self.instance_mode = instance_mode
-
-        self.parallel = parallel
-        if parallel:
-            num_gpu = torch.cuda.device_count()
-            self.predictor = AsyncPredictor(cfg, num_gpus=num_gpu)
-        else:
-            self.predictor = DefaultPredictor(cfg)
-
-    def run_on_image(self, image, visualizer=None):
-        """
-        Args:
-            image (np.ndarray): an image of shape (H, W, C) (in BGR order).
-                This is the format used by OpenCV.
-
-        Returns:
-            predictions (dict): the output of the model.
-            vis_output (VisImage): the visualized image output.
-        """
-        vis_output = None
-        predictions = self.predictor(image)
-        # Convert image from OpenCV BGR format to Matplotlib RGB format.
-        image = image[:, :, ::-1]
-        use_video_vis = True
-        if visualizer is None:
-            use_video_vis = False
-            visualizer = Visualizer(
-                image, self.metadata, instance_mode=self.instance_mode
-            )
-        if "panoptic_seg" in predictions:
-            panoptic_seg, segments_info = predictions["panoptic_seg"]
-            vis_output = visualizer.draw_panoptic_seg_predictions(
-                panoptic_seg.to(self.cpu_device), segments_info
-            )
-        else:
-            if "sem_seg" in predictions:
-                vis_output = visualizer.draw_sem_seg(
-                    predictions["sem_seg"].argmax(dim=0).to(self.cpu_device)
-                )
-            if "instances" in predictions:
-                instances = predictions["instances"].to(self.cpu_device)
-                if use_video_vis:
-                    vis_output = visualizer.draw_instance_predictions(
-                        image, predictions=instances
-                    )
-                else:
-                    vis_output = visualizer.draw_instance_predictions(
-                        predictions=instances
-                    )
-            elif "proposals" in predictions:
-                instances = predictions["proposals"].to(self.cpu_device)
-                instances.pred_boxes = instances.proposal_boxes
-                instances.scores = instances.objectness_logits
-                instances.pred_classes[:] = -1
-                if use_video_vis:
-                    vis_output = visualizer.draw_instance_predictions(
-                        image, predictions=instances
-                    )
-                else:
-                    vis_output = visualizer.draw_instance_predictions(
-                        predictions=instances
-                    )
-
-        return predictions, vis_output
-
-    def _frame_from_video(self, video):
-        while video.isOpened():
-            success, frame = video.read()
-            if success:
-                yield frame
-            else:
-                break
-
-    def run_on_video(self, video):
-        """
-        Visualizes predictions on frames of the input video.
-
-        Args:
-            video (cv2.VideoCapture): a :class:`VideoCapture` object, whose source can be
-                either a webcam or a video file.
-
-        Yields:
-            ndarray: BGR visualizations of each video frame.
-        """
-        video_visualizer = VideoVisualizer(self.metadata, self.instance_mode)
-
-        def process_predictions(frame, predictions):
-            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
-            if "panoptic_seg" in predictions:
-                panoptic_seg, segments_info = predictions["panoptic_seg"]
-                vis_frame = video_visualizer.draw_panoptic_seg_predictions(
-                    frame, panoptic_seg.to(self.cpu_device), segments_info
-                )
-            elif "instances" in predictions:
-                predictions = predictions["instances"].to(self.cpu_device)
-                vis_frame = video_visualizer.draw_instance_predictions(
-                    frame, predictions
-                )
-            elif "sem_seg" in predictions:
-                vis_frame = video_visualizer.draw_sem_seg(
-                    frame, predictions["sem_seg"].argmax(dim=0).to(self.cpu_device)
-                )
-            elif "proposals" in predictions:
-                predictions = predictions["proposals"].to(self.cpu_device)
-                predictions.pred_boxes = predictions.proposal_boxes
-                predictions.scores = predictions.objectness_logits
-                predictions.pred_classes[:] = -1
-                vis_frame = video_visualizer.draw_instance_predictions(
-                    frame, predictions
-                )
-
-            # Converts Matplotlib RGB format to OpenCV BGR format
-            vis_frame = cv2.cvtColor(vis_frame.get_image(), cv2.COLOR_RGB2BGR)
-            return vis_frame
-
-        frame_gen = self._frame_from_video(video)
-        if self.parallel:
-            buffer_size = self.predictor.default_buffer_size
-
-            frame_data = deque()
-
-            for cnt, frame in enumerate(frame_gen):
-                frame_data.append(frame)
-                self.predictor.put(frame)
-
-                if cnt >= buffer_size:
-                    frame = frame_data.popleft()
-                    predictions = self.predictor.get()
-                    yield process_predictions(frame, predictions)
-
-            while len(frame_data):
-                frame = frame_data.popleft()
-                predictions = self.predictor.get()
-                yield process_predictions(frame, predictions)
-        else:
-            for frame in frame_gen:
-                yield process_predictions(frame, self.predictor(frame))
-
-
-class AsyncPredictor:
-    """
-    A predictor that runs the model asynchronously, possibly on >1 GPUs.
-    Because rendering the visualization takes considerably amount of time,
-    this helps improve throughput when rendering videos.
-    """
-
-    class _StopToken:
-        pass
-
-    class _PredictWorker(mp.Process):
-        def __init__(self, cfg, task_queue, result_queue):
-            self.cfg = cfg
-            self.task_queue = task_queue
-            self.result_queue = result_queue
-            super().__init__()
-
-        def run(self):
-            predictor = DefaultPredictor(self.cfg)
-
-            while True:
-                task = self.task_queue.get()
-                if isinstance(task, AsyncPredictor._StopToken):
-                    break
-                idx, data = task
-                result = predictor(data)
-                self.result_queue.put((idx, result))
-
-    def __init__(self, cfg, num_gpus: int = 1):
-        """
-        Args:
-            cfg (CfgNode):
-            num_gpus (int): if 0, will run on CPU
-        """
-        num_workers = max(num_gpus, 1)
-        self.task_queue = mp.Queue(maxsize=num_workers * 3)
-        self.result_queue = mp.Queue(maxsize=num_workers * 3)
-        self.procs = []
-        for gpuid in range(max(num_gpus, 1)):
-            cfg = cfg.clone()
-            cfg.defrost()
-            cfg.MODEL.DEVICE = "cuda:{}".format(gpuid) if num_gpus > 0 else "cpu"
-            self.procs.append(
-                AsyncPredictor._PredictWorker(cfg, self.task_queue, self.result_queue)
-            )
-
-        self.put_idx = 0
-        self.get_idx = 0
-        self.result_rank = []
-        self.result_data = []
-
-        for p in self.procs:
-            p.start()
-        atexit.register(self.shutdown)
-
-    def put(self, image):
-        self.put_idx += 1
-        self.task_queue.put((self.put_idx, image))
-
-    def get(self):
-        self.get_idx += 1  # the index needed for this request
-        if len(self.result_rank) and self.result_rank[0] == self.get_idx:
-            res = self.result_data[0]
-            del self.result_data[0], self.result_rank[0]
-            return res
-
-        while True:
-            # make sure the results are returned in the correct order
-            idx, res = self.result_queue.get()
-            if idx == self.get_idx:
-                return res
-            insert = bisect.bisect(self.result_rank, idx)
-            self.result_rank.insert(insert, idx)
-            self.result_data.insert(insert, res)
-
-    def __len__(self):
-        return self.put_idx - self.get_idx
-
-    def __call__(self, image):
-        self.put(image)
-        return self.get()
-
-    def shutdown(self):
-        for _ in self.procs:
-            self.task_queue.put(AsyncPredictor._StopToken())
-
-    @property
-    def default_buffer_size(self):
-        return len(self.procs) * 5
diff --git a/eval/vbench/third_party/grit_src/centernet2/train_net.py b/eval/vbench/third_party/grit_src/centernet2/train_net.py
deleted file mode 100644
index 0cabb3e7..00000000
--- a/eval/vbench/third_party/grit_src/centernet2/train_net.py
+++ /dev/null
@@ -1,246 +0,0 @@
-import datetime
-import json
-import logging
-import os
-import time
-from collections import OrderedDict
-
-import detectron2.utils.comm as comm
-import torch
-from centernet.config import add_centernet_config
-from centernet.data.custom_build_augmentation import build_custom_augmentation
-from detectron2.checkpoint import DetectionCheckpointer, PeriodicCheckpointer
-from detectron2.config import get_cfg
-from detectron2.data import MetadataCatalog, build_detection_test_loader
-from detectron2.data.build import build_detection_train_loader
-from detectron2.data.dataset_mapper import DatasetMapper
-from detectron2.engine import default_argument_parser, default_setup, launch
-from detectron2.evaluation import (
-    COCOEvaluator,
-    LVISEvaluator,
-    inference_on_dataset,
-    print_csv_format,
-)
-from detectron2.modeling import build_model
-from detectron2.modeling.test_time_augmentation import GeneralizedRCNNWithTTA
-from detectron2.solver import build_lr_scheduler, build_optimizer
-from detectron2.utils.events import (
-    CommonMetricPrinter,
-    EventStorage,
-    JSONWriter,
-    TensorboardXWriter,
-)
-from fvcore.common.timer import Timer
-from torch.nn.parallel import DistributedDataParallel
-
-logger = logging.getLogger("detectron2")
-
-
-def do_test(cfg, model):
-    results = OrderedDict()
-    for dataset_name in cfg.DATASETS.TEST:
-        mapper = (
-            None
-            if cfg.INPUT.TEST_INPUT_TYPE == "default"
-            else DatasetMapper(
-                cfg, False, augmentations=build_custom_augmentation(cfg, False)
-            )
-        )
-        data_loader = build_detection_test_loader(cfg, dataset_name, mapper=mapper)
-        output_folder = os.path.join(
-            cfg.OUTPUT_DIR, "inference_{}".format(dataset_name)
-        )
-        evaluator_type = MetadataCatalog.get(dataset_name).evaluator_type
-
-        if evaluator_type == "lvis":
-            evaluator = LVISEvaluator(dataset_name, cfg, True, output_folder)
-        elif evaluator_type == "coco":
-            evaluator = COCOEvaluator(dataset_name, cfg, True, output_folder)
-        else:
-            assert 0, evaluator_type
-
-        results[dataset_name] = inference_on_dataset(model, data_loader, evaluator)
-        if comm.is_main_process():
-            logger.info("Evaluation results for {} in csv format:".format(dataset_name))
-            print_csv_format(results[dataset_name])
-    if len(results) == 1:
-        results = list(results.values())[0]
-    return results
-
-
-def do_train(cfg, model, resume=False):
-    model.train()
-    optimizer = build_optimizer(cfg, model)
-    scheduler = build_lr_scheduler(cfg, optimizer)
-
-    checkpointer = DetectionCheckpointer(
-        model, cfg.OUTPUT_DIR, optimizer=optimizer, scheduler=scheduler
-    )
-
-    start_iter = (
-        checkpointer.resume_or_load(
-            cfg.MODEL.WEIGHTS,
-            resume=resume,
-        ).get("iteration", -1)
-        + 1
-    )
-    if cfg.SOLVER.RESET_ITER:
-        logger.info("Reset loaded iteration. Start training from iteration 0.")
-        start_iter = 0
-    max_iter = (
-        cfg.SOLVER.MAX_ITER if cfg.SOLVER.TRAIN_ITER < 0 else cfg.SOLVER.TRAIN_ITER
-    )
-
-    periodic_checkpointer = PeriodicCheckpointer(
-        checkpointer, cfg.SOLVER.CHECKPOINT_PERIOD, max_iter=max_iter
-    )
-
-    writers = (
-        [
-            CommonMetricPrinter(max_iter),
-            JSONWriter(os.path.join(cfg.OUTPUT_DIR, "metrics.json")),
-            TensorboardXWriter(cfg.OUTPUT_DIR),
-        ]
-        if comm.is_main_process()
-        else []
-    )
-
-    mapper = (
-        DatasetMapper(cfg, True)
-        if cfg.INPUT.CUSTOM_AUG == ""
-        else DatasetMapper(
-            cfg, True, augmentations=build_custom_augmentation(cfg, True)
-        )
-    )
-    if cfg.DATALOADER.SAMPLER_TRAIN in [
-        "TrainingSampler",
-        "RepeatFactorTrainingSampler",
-    ]:
-        data_loader = build_detection_train_loader(cfg, mapper=mapper)
-    else:
-        from centernet.data.custom_dataset_dataloader import build_custom_train_loader
-
-        data_loader = build_custom_train_loader(cfg, mapper=mapper)
-
-    logger.info("Starting training from iteration {}".format(start_iter))
-    with EventStorage(start_iter) as storage:
-        step_timer = Timer()
-        data_timer = Timer()
-        start_time = time.perf_counter()
-        for data, iteration in zip(data_loader, range(start_iter, max_iter)):
-            data_time = data_timer.seconds()
-            storage.put_scalars(data_time=data_time)
-            step_timer.reset()
-            iteration = iteration + 1
-            storage.step()
-            loss_dict = model(data)
-
-            losses = sum(loss for k, loss in loss_dict.items())
-            assert torch.isfinite(losses).all(), loss_dict
-
-            loss_dict_reduced = {
-                k: v.item() for k, v in comm.reduce_dict(loss_dict).items()
-            }
-            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
-            if comm.is_main_process():
-                storage.put_scalars(total_loss=losses_reduced, **loss_dict_reduced)
-
-            optimizer.zero_grad()
-            losses.backward()
-            optimizer.step()
-
-            storage.put_scalar(
-                "lr", optimizer.param_groups[0]["lr"], smoothing_hint=False
-            )
-
-            step_time = step_timer.seconds()
-            storage.put_scalars(time=step_time)
-            data_timer.reset()
-            scheduler.step()
-
-            if (
-                cfg.TEST.EVAL_PERIOD > 0
-                and iteration % cfg.TEST.EVAL_PERIOD == 0
-                and iteration != max_iter
-            ):
-                do_test(cfg, model)
-                comm.synchronize()
-
-            if iteration - start_iter > 5 and (
-                iteration % 20 == 0 or iteration == max_iter
-            ):
-                for writer in writers:
-                    writer.write()
-            periodic_checkpointer.step(iteration)
-
-        total_time = time.perf_counter() - start_time
-        logger.info(
-            "Total training time: {}".format(
-                str(datetime.timedelta(seconds=int(total_time)))
-            )
-        )
-
-
-def setup(args):
-    """
-    Create configs and perform basic setups.
-    """
-    cfg = get_cfg()
-    add_centernet_config(cfg)
-    cfg.merge_from_file(args.config_file)
-    cfg.merge_from_list(args.opts)
-    if "/auto" in cfg.OUTPUT_DIR:
-        file_name = os.path.basename(args.config_file)[:-5]
-        cfg.OUTPUT_DIR = cfg.OUTPUT_DIR.replace("/auto", "/{}".format(file_name))
-        logger.info("OUTPUT_DIR: {}".format(cfg.OUTPUT_DIR))
-    cfg.freeze()
-    default_setup(cfg, args)
-    return cfg
-
-
-def main(args):
-    cfg = setup(args)
-
-    model = build_model(cfg)
-    logger.info("Model:\n{}".format(model))
-    if args.eval_only:
-        DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
-            cfg.MODEL.WEIGHTS, resume=args.resume
-        )
-        if cfg.TEST.AUG.ENABLED:
-            logger.info("Running inference with test-time augmentation ...")
-            model = GeneralizedRCNNWithTTA(cfg, model, batch_size=1)
-
-        return do_test(cfg, model)
-
-    distributed = comm.get_world_size() > 1
-    if distributed:
-        model = DistributedDataParallel(
-            model,
-            device_ids=[comm.get_local_rank()],
-            broadcast_buffers=False,
-            find_unused_parameters=True,
-        )
-
-    do_train(cfg, model, resume=args.resume)
-    return do_test(cfg, model)
-
-
-if __name__ == "__main__":
-    args = default_argument_parser()
-    args.add_argument("--manual_device", default="")
-    args = args.parse_args()
-    if args.manual_device != "":
-        os.environ["CUDA_VISIBLE_DEVICES"] = args.manual_device
-    args.dist_url = "tcp://127.0.0.1:{}".format(
-        torch.randint(11111, 60000, (1,))[0].item()
-    )
-    print("Command Line Args:", args)
-    launch(
-        main,
-        args.num_gpus,
-        num_machines=args.num_machines,
-        machine_rank=args.machine_rank,
-        dist_url=args.dist_url,
-        args=(args,),
-    )
diff --git a/eval/vbench/third_party/grit_src/configs/Base.yaml b/eval/vbench/third_party/grit_src/configs/Base.yaml
deleted file mode 100644
index 620bee8e..00000000
--- a/eval/vbench/third_party/grit_src/configs/Base.yaml
+++ /dev/null
@@ -1,77 +0,0 @@
-MODEL:
-  META_ARCHITECTURE: "GRiT"
-  MASK_ON: True
-  PROPOSAL_GENERATOR:
-    NAME: "CenterNet"
-  FPN:
-    IN_FEATURES: ["layer3", "layer4", "layer5"]
-  PIXEL_MEAN: [123.675, 116.280, 103.530]
-  PIXEL_STD: [58.395, 57.12, 57.375]
-  ROI_HEADS:
-    NAME: GRiTROIHeadsAndTextDecoder
-    IN_FEATURES: ["p3", "p4", "p5"]
-    IOU_THRESHOLDS: [0.6]
-    NUM_CLASSES: 1
-    SCORE_THRESH_TEST: 0.02
-    NMS_THRESH_TEST: 0.5
-    OBJECT_FEAT_POOLER_RES: 14
-  ROI_BOX_CASCADE_HEAD:
-    IOUS: [0.6, 0.7, 0.8]
-  ROI_BOX_HEAD:
-    NAME: "FastRCNNConvFCHead"
-    NUM_FC: 2
-    POOLER_RESOLUTION: 7
-    CLS_AGNOSTIC_BBOX_REG: True
-    MULT_PROPOSAL_SCORE: True
-  ROI_MASK_HEAD:
-    NAME: "MaskRCNNConvUpsampleHead"
-    NUM_CONV: 4
-    POOLER_RESOLUTION: 14
-    CLS_AGNOSTIC_MASK: True
-  CENTERNET:
-    NUM_CLASSES: 1
-    REG_WEIGHT: 1.
-    NOT_NORM_REG: True
-    ONLY_PROPOSAL: True
-    WITH_AGN_HM: True
-    INFERENCE_TH: 0.0001
-    PRE_NMS_TOPK_TRAIN: 4000
-    POST_NMS_TOPK_TRAIN: 2000
-    PRE_NMS_TOPK_TEST: 1000
-    POST_NMS_TOPK_TEST: 256
-    NMS_TH_TRAIN: 0.9
-    NMS_TH_TEST: 0.9
-    POS_WEIGHT: 0.5
-    NEG_WEIGHT: 0.5
-    IGNORE_HIGH_FP: 0.85
-DATASETS:
-  TRAIN: ("coco_2017_train",)
-  TEST: ("coco_2017_val",)
-DATALOADER:
-  SAMPLER_TRAIN: "MultiDatasetSampler"
-  DATASET_RATIO: [1]
-  DATASET_INPUT_SIZE: [1024]
-  DATASET_INPUT_SCALE: [[0.1, 2.0]]
-  FILTER_EMPTY_ANNOTATIONS: False
-  NUM_WORKERS: 8
-TEST:
-  DETECTIONS_PER_IMAGE: 256
-SOLVER:
-  LR_SCHEDULER_NAME: "WarmupCosineLR"
-  CHECKPOINT_PERIOD: 10000
-  WARMUP_ITERS: 1000
-  WARMUP_FACTOR: 0.001
-  USE_CUSTOM_SOLVER: True
-  OPTIMIZER: "ADAMW"
-  MAX_ITER: 180000
-  IMS_PER_BATCH: 64
-  BASE_LR: 0.00008
-  VIT_LAYER_DECAY: True
-  CLIP_GRADIENTS:
-    ENABLED: True
-INPUT:
-  FORMAT: RGB
-  CUSTOM_AUG: EfficientDetResizeCrop
-  TRAIN_SIZE: 640
-USE_ACT_CHECKPOINT: True
-VERSION: 2
diff --git a/eval/vbench/third_party/grit_src/configs/GRiT_B_DenseCap.yaml b/eval/vbench/third_party/grit_src/configs/GRiT_B_DenseCap.yaml
deleted file mode 100644
index a312d242..00000000
--- a/eval/vbench/third_party/grit_src/configs/GRiT_B_DenseCap.yaml
+++ /dev/null
@@ -1,20 +0,0 @@
-_BASE_: "Base.yaml"
-MODEL:
-  TRAIN_TASK: ["DenseCap"]
-  TEST_TASK: "DenseCap"
-  MASK_ON: False
-  ROI_HEADS:
-    SOFT_NMS_ENABLED: False
-  BEAM_SIZE: 1
-  WEIGHTS: "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_base.pth"
-  BACKBONE:
-    NAME: build_vit_fpn_backbone
-  VIT_LAYERS: 12
-SOLVER:
-  VIT_LAYER_DECAY_RATE: 0.7
-DATASETS:
-  TRAIN: ("vg_train",)
-  TEST: ("vg_test",)
-DATALOADER:
-  DATASET_BS: 2
-OUTPUT_DIR: "./output/GRiT_B_DenseCap"
diff --git a/eval/vbench/third_party/grit_src/configs/GRiT_B_DenseCap_ObjectDet.yaml b/eval/vbench/third_party/grit_src/configs/GRiT_B_DenseCap_ObjectDet.yaml
deleted file mode 100644
index 03ba3356..00000000
--- a/eval/vbench/third_party/grit_src/configs/GRiT_B_DenseCap_ObjectDet.yaml
+++ /dev/null
@@ -1,23 +0,0 @@
-_BASE_: "Base.yaml"
-MODEL:
-  TRAIN_TASK: ["ObjectDet", "DenseCap"]
-  TEST_TASK: "DenseCap" # DenseCap or ObjectDet: Choose one for testing
-  MASK_ON: True
-  ROI_HEADS:
-    SOFT_NMS_ENABLED: False
-  BEAM_SIZE: 1
-  WEIGHTS: "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_base.pth"
-  BACKBONE:
-    NAME: build_vit_fpn_backbone
-  VIT_LAYERS: 12
-SOLVER:
-  VIT_LAYER_DECAY_RATE: 0.7
-DATASETS:
-  TRAIN: ("GRiT_coco2017_train", "vg_train")
-  TEST: ("coco_2017_test-dev",)
-DATALOADER:
-  DATASET_RATIO: [1, 1]
-  DATASET_BS: 2
-  DATASET_INPUT_SIZE: [1024, 1024]
-  DATASET_INPUT_SCALE: [[0.1, 2.0], [0.1, 2.0]]
-OUTPUT_DIR: "./output/GRiT_B_DenseCap_ObjectDet"
diff --git a/eval/vbench/third_party/grit_src/configs/GRiT_B_ObjectDet.yaml b/eval/vbench/third_party/grit_src/configs/GRiT_B_ObjectDet.yaml
deleted file mode 100644
index a23d6b3b..00000000
--- a/eval/vbench/third_party/grit_src/configs/GRiT_B_ObjectDet.yaml
+++ /dev/null
@@ -1,20 +0,0 @@
-_BASE_: "Base.yaml"
-MODEL:
-  TRAIN_TASK: ["ObjectDet"]
-  TEST_TASK: "ObjectDet"
-  MASK_ON: True
-  ROI_HEADS:
-    SOFT_NMS_ENABLED: True
-  BEAM_SIZE: 3
-  WEIGHTS: "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_base.pth"
-  BACKBONE:
-    NAME: build_vit_fpn_backbone
-  VIT_LAYERS: 12
-SOLVER:
-  VIT_LAYER_DECAY_RATE: 0.7
-DATASETS:
-  TRAIN: ("GRiT_coco2017_train",)
-  TEST: ("coco_2017_val",)
-DATALOADER:
-  DATASET_BS: 2
-OUTPUT_DIR: "./output/GRiT_B_ObjectDet"
diff --git a/eval/vbench/third_party/grit_src/configs/GRiT_H_ObjectDet.yaml b/eval/vbench/third_party/grit_src/configs/GRiT_H_ObjectDet.yaml
deleted file mode 100644
index 7bd21c74..00000000
--- a/eval/vbench/third_party/grit_src/configs/GRiT_H_ObjectDet.yaml
+++ /dev/null
@@ -1,21 +0,0 @@
-_BASE_: "Base.yaml"
-MODEL:
-  TRAIN_TASK: ["ObjectDet"]
-  TEST_TASK: "ObjectDet"
-  MASK_ON: True
-  ROI_HEADS:
-    SOFT_NMS_ENABLED: True
-  BEAM_SIZE: 3
-  WEIGHTS: "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_huge_p14to16.pth"
-  BACKBONE:
-    NAME: build_vit_fpn_backbone_huge
-  VIT_LAYERS: 32
-SOLVER:
-  MAX_ITER: 135000
-  VIT_LAYER_DECAY_RATE: 0.9
-DATASETS:
-  TRAIN: ("GRiT_coco2017_train",)
-  TEST: ("coco_2017_val",)
-DATALOADER:
-  DATASET_BS: 1
-OUTPUT_DIR: "./output/GRiT_H_ObjectDet"
diff --git a/eval/vbench/third_party/grit_src/configs/GRiT_L_ObjectDet.yaml b/eval/vbench/third_party/grit_src/configs/GRiT_L_ObjectDet.yaml
deleted file mode 100644
index a9055031..00000000
--- a/eval/vbench/third_party/grit_src/configs/GRiT_L_ObjectDet.yaml
+++ /dev/null
@@ -1,20 +0,0 @@
-_BASE_: "Base.yaml"
-MODEL:
-  TRAIN_TASK: ["ObjectDet"]
-  TEST_TASK: "ObjectDet"
-  MASK_ON: True
-  ROI_HEADS:
-    SOFT_NMS_ENABLED: True
-  BEAM_SIZE: 3
-  WEIGHTS: "detectron2://ImageNetPretrained/MAE/mae_pretrain_vit_large.pth"
-  BACKBONE:
-    NAME: build_vit_fpn_backbone_large
-  VIT_LAYERS: 24
-SOLVER:
-  VIT_LAYER_DECAY_RATE: 0.8
-DATASETS:
-  TRAIN: ("GRiT_coco2017_train",)
-  TEST: ("coco_2017_val",)
-DATALOADER:
-  DATASET_BS: 1
-OUTPUT_DIR: "./output/GRiT_L_ObjectDet"
diff --git a/eval/vbench/third_party/grit_src/grit/__init__.py b/eval/vbench/third_party/grit_src/grit/__init__.py
deleted file mode 100644
index 1c42b177..00000000
--- a/eval/vbench/third_party/grit_src/grit/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from .data.datasets import grit_coco, object365, vg
-from .modeling.backbone import vit
-from .modeling.meta_arch import grit
-from .modeling.roi_heads import grit_roi_heads
diff --git a/eval/vbench/third_party/grit_src/grit/config.py b/eval/vbench/third_party/grit_src/grit/config.py
deleted file mode 100644
index 041fe6a0..00000000
--- a/eval/vbench/third_party/grit_src/grit/config.py
+++ /dev/null
@@ -1,50 +0,0 @@
-from detectron2.config import CfgNode as CN
-
-
-def add_grit_config(cfg):
-    _C = cfg
-
-    _C.MODEL.BEAM_SIZE = 1
-    _C.MODEL.TRAIN_TASK = ["ObjectDet", "DenseCap"]
-    _C.MODEL.TEST_TASK = "DenseCap"  # This can be varied if the model is jointly trained on multiple tasks
-
-    _C.MODEL.ROI_BOX_HEAD.USE_BIAS = 0.0  # >= 0: not use
-    _C.MODEL.ROI_BOX_HEAD.MULT_PROPOSAL_SCORE = False
-
-    _C.MODEL.ROI_HEADS.MASK_WEIGHT = 1.0
-    _C.MODEL.ROI_HEADS.OBJECT_FEAT_POOLER_RES = 14
-    _C.MODEL.ROI_HEADS.SOFT_NMS_ENABLED = False
-
-    # Backbones
-    _C.MODEL.VIT_LAYERS = 12
-
-    # Text Decoder
-    _C.TEXT_DECODER = CN()
-    _C.TEXT_DECODER.VOCAB_SIZE = 30522
-    _C.TEXT_DECODER.HIDDEN_SIZE = 768
-    _C.TEXT_DECODER.NUM_LAYERS = 6
-    _C.TEXT_DECODER.ATTENTION_HEADS = 12
-    _C.TEXT_DECODER.FEEDFORWARD_SIZE = 768 * 4
-
-    # Multi-dataset dataloader
-    _C.DATALOADER.DATASET_RATIO = [1, 1]  # sample ratio
-    _C.DATALOADER.DATASET_BS = 1
-    _C.DATALOADER.DATASET_INPUT_SIZE = [1024, 1024]
-    _C.DATALOADER.DATASET_INPUT_SCALE = [(0.1, 2.0), (0.1, 2.0)]
-    _C.DATALOADER.DATASET_MIN_SIZES = [(640, 800), (640, 800)]
-    _C.DATALOADER.DATASET_MAX_SIZES = [1333, 1333]
-
-    _C.SOLVER.USE_CUSTOM_SOLVER = True
-    _C.SOLVER.OPTIMIZER = "ADAMW"
-    _C.SOLVER.VIT_LAYER_DECAY = True
-    _C.SOLVER.VIT_LAYER_DECAY_RATE = 0.7
-
-    _C.INPUT.CUSTOM_AUG = "EfficientDetResizeCrop"
-    _C.INPUT.TRAIN_SIZE = 1024
-    _C.INPUT.TEST_SIZE = 1024
-    _C.INPUT.SCALE_RANGE = (0.1, 2.0)
-    # 'default' for fixed short / long edge
-    _C.INPUT.TEST_INPUT_TYPE = "default"
-
-    _C.FIND_UNUSED_PARAM = True
-    _C.USE_ACT_CHECKPOINT = True
diff --git a/eval/vbench/third_party/grit_src/grit/custom_solver.py b/eval/vbench/third_party/grit_src/grit/custom_solver.py
deleted file mode 100644
index 25fc968f..00000000
--- a/eval/vbench/third_party/grit_src/grit/custom_solver.py
+++ /dev/null
@@ -1,91 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# Modified by Jialian Wu from https://github.com/facebookresearch/Detic/blob/main/detic/custom_solver.py
-import itertools
-from typing import Any, Callable, Dict, Iterable, List, Set, Type, Union
-
-import torch
-from detectron2.config import CfgNode
-from detectron2.solver.build import maybe_add_gradient_clipping
-
-
-def build_custom_optimizer(
-    cfg: CfgNode, model: torch.nn.Module
-) -> torch.optim.Optimizer:
-    params: List[Dict[str, Any]] = []
-    memo: Set[torch.nn.parameter.Parameter] = set()
-    optimizer_type = cfg.SOLVER.OPTIMIZER
-
-    for key, value in model.named_parameters(recurse=True):
-        if not value.requires_grad:
-            continue
-        # Avoid duplicating parameters
-        if value in memo:
-            continue
-        memo.add(value)
-        lr = cfg.SOLVER.BASE_LR
-        weight_decay = cfg.SOLVER.WEIGHT_DECAY
-
-        if cfg.SOLVER.VIT_LAYER_DECAY:
-            lr = lr * get_vit_lr_decay_rate(
-                key, cfg.SOLVER.VIT_LAYER_DECAY_RATE, cfg.MODEL.VIT_LAYERS
-            )
-
-        param = {"params": [value], "lr": lr}
-        if optimizer_type != "ADAMW":
-            param["weight_decay"] = weight_decay
-        params += [param]
-
-    def maybe_add_full_model_gradient_clipping(optim):  # optim: the optimizer class
-        # detectron2 doesn't have full model gradient clipping now
-        clip_norm_val = cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE
-        enable = (
-            cfg.SOLVER.CLIP_GRADIENTS.ENABLED
-            and cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == "full_model"
-            and clip_norm_val > 0.0
-        )
-
-        class FullModelGradientClippingOptimizer(optim):
-            def step(self, closure=None):
-                all_params = itertools.chain(*[x["params"] for x in self.param_groups])
-                torch.nn.utils.clip_grad_norm_(all_params, clip_norm_val)
-                super().step(closure=closure)
-
-        return FullModelGradientClippingOptimizer if enable else optim
-
-    if optimizer_type == "SGD":
-        optimizer = maybe_add_full_model_gradient_clipping(torch.optim.SGD)(
-            params,
-            cfg.SOLVER.BASE_LR,
-            momentum=cfg.SOLVER.MOMENTUM,
-            nesterov=cfg.SOLVER.NESTEROV,
-        )
-    elif optimizer_type == "ADAMW":
-        optimizer = maybe_add_full_model_gradient_clipping(torch.optim.AdamW)(
-            params, cfg.SOLVER.BASE_LR, weight_decay=cfg.SOLVER.WEIGHT_DECAY
-        )
-    else:
-        raise NotImplementedError(f"no optimizer type {optimizer_type}")
-    if not cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == "full_model":
-        optimizer = maybe_add_gradient_clipping(cfg, optimizer)
-    return optimizer
-
-
-def get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):
-    """
-    Calculate lr decay rate for different ViT blocks.
-    Args:
-        name (string): parameter name.
-        lr_decay_rate (float): base lr decay rate.
-        num_layers (int): number of ViT blocks.
-
-    Returns:
-        lr decay rate for the given parameter.
-    """
-    layer_id = num_layers + 1
-    if name.startswith("backbone"):
-        if ".pos_embed" in name or ".patch_embed" in name:
-            layer_id = 0
-        elif ".blocks." in name and ".residual." not in name:
-            layer_id = int(name[name.find(".blocks.") :].split(".")[2]) + 1
-
-    return lr_decay_rate ** (num_layers + 1 - layer_id)
diff --git a/eval/vbench/third_party/grit_src/grit/data/__init__.py b/eval/vbench/third_party/grit_src/grit/data/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/data/custom_build_augmentation.py b/eval/vbench/third_party/grit_src/grit/data/custom_build_augmentation.py
deleted file mode 100644
index bf9f4b8d..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/custom_build_augmentation.py
+++ /dev/null
@@ -1,46 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-from detectron2.data import transforms as T
-
-from .transforms.custom_augmentation_impl import EfficientDetResizeCrop
-
-
-def build_custom_augmentation(
-    cfg, is_train, scale=None, size=None, min_size=None, max_size=None
-):
-    """
-    Create a list of default :class:`Augmentation` from config.
-    Now it includes resizing and flipping.
-
-    Returns:
-        list[Augmentation]
-    """
-    if cfg.INPUT.CUSTOM_AUG == "ResizeShortestEdge":
-        if is_train:
-            min_size = cfg.INPUT.MIN_SIZE_TRAIN if min_size is None else min_size
-            max_size = cfg.INPUT.MAX_SIZE_TRAIN if max_size is None else max_size
-            sample_style = cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING
-        else:
-            min_size = cfg.INPUT.MIN_SIZE_TEST
-            max_size = cfg.INPUT.MAX_SIZE_TEST
-            sample_style = "choice"
-        augmentation = [T.ResizeShortestEdge(min_size, max_size, sample_style)]
-    elif cfg.INPUT.CUSTOM_AUG == "EfficientDetResizeCrop":
-        if is_train:
-            scale = cfg.INPUT.SCALE_RANGE if scale is None else scale
-            size = cfg.INPUT.TRAIN_SIZE if size is None else size
-        else:
-            scale = (1, 1)
-            size = cfg.INPUT.TEST_SIZE
-        augmentation = [EfficientDetResizeCrop(size, scale)]
-    else:
-        assert 0, cfg.INPUT.CUSTOM_AUG
-
-    if is_train:
-        augmentation.append(T.RandomFlip())
-    return augmentation
-
-
-build_custom_transform_gen = build_custom_augmentation
-"""
-Alias for backward-compatibility.
-"""
diff --git a/eval/vbench/third_party/grit_src/grit/data/custom_dataset_dataloader.py b/eval/vbench/third_party/grit_src/grit/data/custom_dataset_dataloader.py
deleted file mode 100644
index b49e2546..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/custom_dataset_dataloader.py
+++ /dev/null
@@ -1,278 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# Modified by Jialian Wu from https://github.com/facebookresearch/Detic/blob/main/detic/data/custom_dataset_dataloader.py
-import itertools
-import operator
-from typing import Optional
-
-import torch
-import torch.utils.data
-from detectron2.config import configurable
-from detectron2.data.build import (
-    build_batch_data_loader,
-    check_metadata_consistency,
-    filter_images_with_few_keypoints,
-    filter_images_with_only_crowd_annotations,
-    get_detection_dataset_dicts,
-    print_instances_class_histogram,
-    worker_init_reset_seed,
-)
-from detectron2.data.catalog import DatasetCatalog, MetadataCatalog
-from detectron2.data.common import DatasetFromList, MapDataset
-from detectron2.data.dataset_mapper import DatasetMapper
-from detectron2.data.samplers import TrainingSampler
-from detectron2.utils import comm
-from detectron2.utils.comm import get_world_size
-from torch.utils.data.sampler import BatchSampler, Sampler
-
-
-def _custom_train_loader_from_config(cfg, mapper=None, *, dataset=None, sampler=None):
-    sampler_name = cfg.DATALOADER.SAMPLER_TRAIN
-    if "MultiDataset" in sampler_name:
-        dataset_dicts = get_detection_dataset_dicts_with_source(
-            cfg.DATASETS.TRAIN,
-            filter_empty=cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS,
-            min_keypoints=(
-                cfg.MODEL.ROI_KEYPOINT_HEAD.MIN_KEYPOINTS_PER_IMAGE
-                if cfg.MODEL.KEYPOINT_ON
-                else 0
-            ),
-            proposal_files=(
-                cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None
-            ),
-        )
-    else:
-        dataset_dicts = get_detection_dataset_dicts(
-            cfg.DATASETS.TRAIN,
-            filter_empty=cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS,
-            min_keypoints=(
-                cfg.MODEL.ROI_KEYPOINT_HEAD.MIN_KEYPOINTS_PER_IMAGE
-                if cfg.MODEL.KEYPOINT_ON
-                else 0
-            ),
-            proposal_files=(
-                cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None
-            ),
-        )
-
-    if mapper is None:
-        mapper = DatasetMapper(cfg, True)
-
-    if sampler is not None:
-        pass
-    elif sampler_name == "TrainingSampler":
-        sampler = TrainingSampler(len(dataset))
-    elif sampler_name == "MultiDatasetSampler":
-        sampler = MultiDatasetSampler(
-            dataset_dicts,
-            dataset_ratio=cfg.DATALOADER.DATASET_RATIO,
-        )
-    else:
-        raise ValueError("Unknown training sampler: {}".format(sampler_name))
-
-    return {
-        "dataset": dataset_dicts,
-        "sampler": sampler,
-        "mapper": mapper,
-        "total_batch_size": cfg.SOLVER.IMS_PER_BATCH,
-        "num_workers": cfg.DATALOADER.NUM_WORKERS,
-        "dataset_bs": cfg.DATALOADER.DATASET_BS,
-        "num_datasets": len(cfg.DATASETS.TRAIN),
-    }
-
-
-@configurable(from_config=_custom_train_loader_from_config)
-def build_custom_train_loader(
-    dataset,
-    *,
-    mapper,
-    sampler,
-    total_batch_size=16,
-    num_workers=0,
-    num_datasets=1,
-    dataset_bs=1,
-):
-
-    if isinstance(dataset, list):
-        dataset = DatasetFromList(dataset, copy=False)
-    if mapper is not None:
-        dataset = MapDataset(dataset, mapper)
-    if sampler is None:
-        sampler = TrainingSampler(len(dataset))
-    assert isinstance(sampler, torch.utils.data.sampler.Sampler)
-
-    return build_dataset_batch_data_loader(
-        dataset_bs,
-        dataset,
-        sampler,
-        total_batch_size,
-        num_datasets=num_datasets,
-        num_workers=num_workers,
-    )
-
-
-def build_dataset_batch_data_loader(
-    dataset_bs, dataset, sampler, total_batch_size, num_datasets, num_workers=0
-):
-
-    world_size = get_world_size()
-    assert (
-        total_batch_size > 0 and total_batch_size % world_size == 0
-    ), "Total batch size ({}) must be divisible by the number of gpus ({}).".format(
-        total_batch_size, world_size
-    )
-
-    data_loader = torch.utils.data.DataLoader(
-        dataset,
-        sampler=sampler,
-        num_workers=num_workers,
-        batch_sampler=None,
-        collate_fn=operator.itemgetter(0),  # don't batch, but yield individual elements
-        worker_init_fn=worker_init_reset_seed,
-    )
-
-    if num_datasets > 1:
-        return MultiDatasets(data_loader, dataset_bs, num_datasets)
-    else:
-        return SingleDataset(data_loader, dataset_bs)
-
-
-def get_detection_dataset_dicts_with_source(
-    dataset_names, filter_empty=True, min_keypoints=0, proposal_files=None
-):
-    assert len(dataset_names)
-    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names]
-    for dataset_name, dicts in zip(dataset_names, dataset_dicts):
-        assert len(dicts), "Dataset '{}' is empty!".format(dataset_name)
-
-    for source_id, (dataset_name, dicts) in enumerate(
-        zip(dataset_names, dataset_dicts)
-    ):
-        assert len(dicts), "Dataset '{}' is empty!".format(dataset_name)
-        for d in dicts:
-            d["dataset_source"] = source_id
-
-        if "annotations" in dicts[0]:
-            try:
-                class_names = MetadataCatalog.get(dataset_name).thing_classes
-                check_metadata_consistency("thing_classes", dataset_name)
-                print_instances_class_histogram(dicts, class_names)
-            except AttributeError:  # class names are not available for this dataset
-                pass
-
-    assert proposal_files is None
-
-    dataset_dicts = list(itertools.chain.from_iterable(dataset_dicts))
-
-    has_instances = "annotations" in dataset_dicts[0]
-    if filter_empty and has_instances:
-        dataset_dicts = filter_images_with_only_crowd_annotations(dataset_dicts)
-    if min_keypoints > 0 and has_instances:
-        dataset_dicts = filter_images_with_few_keypoints(dataset_dicts, min_keypoints)
-
-    return dataset_dicts
-
-
-class MultiDatasetSampler(Sampler):
-    def __init__(
-        self,
-        dataset_dicts,
-        dataset_ratio,
-        seed: Optional[int] = None,
-    ):
-        sizes = [0 for _ in range(len(dataset_ratio))]
-        for d in dataset_dicts:
-            sizes[d["dataset_source"]] += 1
-        print("dataset sizes", sizes)
-        self.sizes = sizes
-        assert len(dataset_ratio) == len(
-            sizes
-        ), "length of dataset ratio {} should be equal to number if dataset {}".format(
-            len(dataset_ratio), len(sizes)
-        )
-        if seed is None:
-            seed = comm.shared_random_seed()
-        self._seed = int(seed)
-        self._rank = comm.get_rank()
-        self._world_size = comm.get_world_size()
-
-        self.dataset_ids = torch.tensor(
-            [d["dataset_source"] for d in dataset_dicts], dtype=torch.long
-        )
-        self.dataset_ratio = dataset_ratio
-
-        dataset_weight = [
-            torch.ones(s) * max(sizes) / s * r / sum(dataset_ratio)
-            for i, (r, s) in enumerate(zip(dataset_ratio, sizes))
-        ]
-        dataset_weight = torch.cat(dataset_weight)
-
-        self.weights = dataset_weight
-        self.sample_epoch_size = len(self.weights)
-
-    def __iter__(self):
-        start = self._rank
-        yield from itertools.islice(
-            self._infinite_indices(), start, None, self._world_size
-        )
-
-    def _infinite_indices(self):
-        g = torch.Generator()
-        g.manual_seed(self._seed)
-        while True:
-            if len(self.dataset_ratio) > 1:
-                # multiple datasets
-                ids = torch.multinomial(
-                    self.weights, self.sample_epoch_size, generator=g, replacement=True
-                )
-                nums = [
-                    (self.dataset_ids[ids] == i).sum().int().item()
-                    for i in range(len(self.sizes))
-                ]
-                yield from ids
-            else:
-                # single dataset
-                yield from torch.randperm(self.sizes[0], generator=g).tolist()
-
-
-class SingleDataset(torch.utils.data.IterableDataset):
-    def __init__(self, dataset, batch_sizes):
-        self.dataset = dataset
-        self.batch_sizes = batch_sizes
-        self._buckets = [[] for _ in range(2)]
-
-    def __iter__(self):
-        for d in self.dataset:
-            w, h = d["width"], d["height"]
-            aspect_ratio_bucket_id = 0 if w > h else 1
-            bucket_id = aspect_ratio_bucket_id
-            bucket = self._buckets[bucket_id]
-            bucket.append(d)
-            if len(bucket) == self.batch_sizes:
-                yield bucket[:]
-                del bucket[:]
-
-
-class MultiDatasets(torch.utils.data.IterableDataset):
-    def __init__(self, dataset, batch_sizes, num_datasets):
-        self.dataset = dataset
-        self.batch_sizes = batch_sizes
-        self._buckets = [[] for _ in range(2 * num_datasets)]
-        self.iter_idx = 0
-        self.num_datasets = num_datasets
-
-    def __iter__(self):
-        for d in self.dataset:
-            w, h = d["width"], d["height"]
-            aspect_ratio_bucket_id = 0 if w > h else 1
-            bucket_id = d["dataset_source"] * 2 + aspect_ratio_bucket_id
-            bucket = self._buckets[bucket_id]
-            if len(bucket) < self.batch_sizes:
-                bucket.append(d)
-            selected_dataset = self.iter_idx % self.num_datasets
-            if (
-                len(bucket) == self.batch_sizes
-                and selected_dataset == d["dataset_source"]
-            ):
-                self.iter_idx += 1
-                yield bucket[:]
-                del bucket[:]
diff --git a/eval/vbench/third_party/grit_src/grit/data/custom_dataset_mapper.py b/eval/vbench/third_party/grit_src/grit/data/custom_dataset_mapper.py
deleted file mode 100644
index 9da5e9ba..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/custom_dataset_mapper.py
+++ /dev/null
@@ -1,161 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# Modified by Jialian Wu from https://github.com/facebookresearch/Detic/blob/main/detic/data/custom_dataset_mapper.py
-import copy
-import logging
-from itertools import compress
-
-import numpy as np
-import torch
-from detectron2.config import configurable
-from detectron2.data import detection_utils as utils
-from detectron2.data import transforms as T
-from detectron2.data.dataset_mapper import DatasetMapper
-
-from .custom_build_augmentation import build_custom_augmentation
-
-__all__ = ["CustomDatasetMapper", "ObjDescription"]
-logger = logging.getLogger(__name__)
-
-
-class CustomDatasetMapper(DatasetMapper):
-    @configurable
-    def __init__(self, is_train: bool, dataset_augs=[], **kwargs):
-        if is_train:
-            self.dataset_augs = [T.AugmentationList(x) for x in dataset_augs]
-        super().__init__(is_train, **kwargs)
-
-    @classmethod
-    def from_config(cls, cfg, is_train: bool = True):
-        ret = super().from_config(cfg, is_train)
-        if is_train:
-            if cfg.INPUT.CUSTOM_AUG == "EfficientDetResizeCrop":
-                dataset_scales = cfg.DATALOADER.DATASET_INPUT_SCALE
-                dataset_sizes = cfg.DATALOADER.DATASET_INPUT_SIZE
-                ret["dataset_augs"] = [
-                    build_custom_augmentation(cfg, True, scale, size)
-                    for scale, size in zip(dataset_scales, dataset_sizes)
-                ]
-            else:
-                assert cfg.INPUT.CUSTOM_AUG == "ResizeShortestEdge"
-                min_sizes = cfg.DATALOADER.DATASET_MIN_SIZES
-                max_sizes = cfg.DATALOADER.DATASET_MAX_SIZES
-                ret["dataset_augs"] = [
-                    build_custom_augmentation(cfg, True, min_size=mi, max_size=ma)
-                    for mi, ma in zip(min_sizes, max_sizes)
-                ]
-        else:
-            ret["dataset_augs"] = []
-
-        return ret
-
-    def __call__(self, dataset_dict):
-        dataset_dict_out = self.prepare_data(dataset_dict)
-
-        # When augmented image is too small, do re-augmentation
-        retry = 0
-        while (
-            dataset_dict_out["image"].shape[1] < 32
-            or dataset_dict_out["image"].shape[2] < 32
-        ):
-            retry += 1
-            if retry == 100:
-                logger.info(
-                    "Retry 100 times for augmentation. Make sure the image size is not too small."
-                )
-                logger.info("Find image information below")
-                logger.info(dataset_dict)
-            dataset_dict_out = self.prepare_data(dataset_dict)
-
-        return dataset_dict_out
-
-    def prepare_data(self, dataset_dict_in):
-        dataset_dict = copy.deepcopy(dataset_dict_in)
-        if "file_name" in dataset_dict:
-            ori_image = utils.read_image(
-                dataset_dict["file_name"], format=self.image_format
-            )
-        else:
-            ori_image, _, _ = self.tar_dataset[dataset_dict["tar_index"]]
-            ori_image = utils._apply_exif_orientation(ori_image)
-            ori_image = utils.convert_PIL_to_numpy(ori_image, self.image_format)
-        utils.check_image_size(dataset_dict, ori_image)
-
-        aug_input = T.AugInput(copy.deepcopy(ori_image), sem_seg=None)
-        if self.is_train:
-            transforms = self.dataset_augs[dataset_dict["dataset_source"]](aug_input)
-        else:
-            transforms = self.augmentations(aug_input)
-        image, sem_seg_gt = aug_input.image, aug_input.sem_seg
-
-        image_shape = image.shape[:2]
-        dataset_dict["image"] = torch.as_tensor(
-            np.ascontiguousarray(image.transpose(2, 0, 1))
-        )
-
-        if not self.is_train:
-            # USER: Modify this if you want to keep them for some reason.
-            dataset_dict.pop("annotations", None)
-            return dataset_dict
-
-        if "annotations" in dataset_dict:
-            if len(dataset_dict["annotations"]) > 0:
-                object_descriptions = [
-                    an["object_description"] for an in dataset_dict["annotations"]
-                ]
-            else:
-                object_descriptions = []
-            # USER: Modify this if you want to keep them for some reason.
-            for anno in dataset_dict["annotations"]:
-                if not self.use_instance_mask:
-                    anno.pop("segmentation", None)
-                if not self.use_keypoint:
-                    anno.pop("keypoints", None)
-
-            all_annos = [
-                (
-                    utils.transform_instance_annotations(
-                        obj,
-                        transforms,
-                        image_shape,
-                        keypoint_hflip_indices=self.keypoint_hflip_indices,
-                    ),
-                    obj.get("iscrowd", 0),
-                )
-                for obj in dataset_dict.pop("annotations")
-            ]
-            annos = [ann[0] for ann in all_annos if ann[1] == 0]
-            instances = utils.annotations_to_instances(
-                annos, image_shape, mask_format=self.instance_mask_format
-            )
-
-            instances.gt_object_descriptions = ObjDescription(object_descriptions)
-
-            del all_annos
-            if self.recompute_boxes:
-                instances.gt_boxes = instances.gt_masks.get_bounding_boxes()
-            dataset_dict["instances"] = utils.filter_empty_instances(instances)
-
-        return dataset_dict
-
-
-class ObjDescription:
-    def __init__(self, object_descriptions):
-        self.data = object_descriptions
-
-    def __getitem__(self, item):
-        assert type(item) == torch.Tensor
-        assert item.dim() == 1
-        if len(item) > 0:
-            assert item.dtype == torch.int64 or item.dtype == torch.bool
-            if item.dtype == torch.int64:
-                return ObjDescription([self.data[x.item()] for x in item])
-            elif item.dtype == torch.bool:
-                return ObjDescription(list(compress(self.data, item)))
-
-        return ObjDescription(list(compress(self.data, item)))
-
-    def __len__(self):
-        return len(self.data)
-
-    def __repr__(self):
-        return "ObjDescription({})".format(self.data)
diff --git a/eval/vbench/third_party/grit_src/grit/data/datasets/__init__.py b/eval/vbench/third_party/grit_src/grit/data/datasets/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/data/datasets/grit_coco.py b/eval/vbench/third_party/grit_src/grit/data/datasets/grit_coco.py
deleted file mode 100644
index 9a311201..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/datasets/grit_coco.py
+++ /dev/null
@@ -1,121 +0,0 @@
-import logging
-import os
-
-from detectron2.data import DatasetCatalog, MetadataCatalog
-from detectron2.structures import BoxMode
-from fvcore.common.file_io import PathManager
-from fvcore.common.timer import Timer
-from lvis import LVIS
-
-logger = logging.getLogger(__name__)
-
-__all__ = ["load_GRiTcoco_json", "register_GRiTcoco_instances"]
-
-
-def register_GRiTcoco_instances(name, metadata, json_file, image_root):
-    """ """
-    DatasetCatalog.register(
-        name, lambda: load_GRiTcoco_json(json_file, image_root, name)
-    )
-    MetadataCatalog.get(name).set(
-        json_file=json_file, image_root=image_root, evaluator_type="coco", **metadata
-    )
-
-
-def get_GRiTcoco_meta():
-    categories = [{"supercategory": "object", "id": 1, "name": "object"}]
-    categories = sorted(categories, key=lambda x: x["id"])
-    thing_classes = [k["name"] for k in categories]
-    meta = {"thing_classes": thing_classes}
-    return meta
-
-
-def load_GRiTcoco_json(json_file, image_root, dataset_name=None):
-    """
-    Load COCO class name text for object description for GRiT
-    """
-
-    json_file = PathManager.get_local_path(json_file)
-
-    timer = Timer()
-    lvis_api = LVIS(json_file)
-    if timer.seconds() > 1:
-        logger.info(
-            "Loading {} takes {:.2f} seconds.".format(json_file, timer.seconds())
-        )
-
-    class_names = {}
-    sort_cat = sorted(lvis_api.dataset["categories"], key=lambda x: x["id"])
-    for x in sort_cat:
-        class_names[x["id"]] = x["name"]
-
-    img_ids = sorted(lvis_api.imgs.keys())
-    imgs = lvis_api.load_imgs(img_ids)
-    anns = [lvis_api.img_ann_map[img_id] for img_id in img_ids]
-
-    ann_ids = [ann["id"] for anns_per_image in anns for ann in anns_per_image]
-    assert len(set(ann_ids)) == len(
-        ann_ids
-    ), "Annotation ids in '{}' are not unique".format(json_file)
-
-    imgs_anns = list(zip(imgs, anns))
-    logger.info(
-        "Loaded {} images in the LVIS v1 format from {}".format(
-            len(imgs_anns), json_file
-        )
-    )
-
-    dataset_dicts = []
-
-    for img_dict, anno_dict_list in imgs_anns:
-        record = {}
-        if "file_name" in img_dict:
-            file_name = img_dict["file_name"]
-            record["file_name"] = os.path.join(image_root, file_name)
-
-        record["height"] = int(img_dict["height"])
-        record["width"] = int(img_dict["width"])
-        image_id = record["image_id"] = img_dict["id"]
-
-        objs = []
-        for anno in anno_dict_list:
-            assert anno["image_id"] == image_id
-            if anno.get("iscrowd", 0) > 0:
-                continue
-            obj = {"bbox": anno["bbox"], "bbox_mode": BoxMode.XYWH_ABS}
-            obj["category_id"] = 0
-            obj["object_description"] = class_names[anno["category_id"]]
-            if "segmentation" in anno:
-                segm = anno["segmentation"]
-                valid_segm = [
-                    poly for poly in segm if len(poly) % 2 == 0 and len(poly) >= 6
-                ]
-                if not len(segm) == len(valid_segm):
-                    print("Annotation contains an invalid polygon with < 3 points")
-                assert len(segm) > 0
-                obj["segmentation"] = segm
-            objs.append(obj)
-        record["annotations"] = objs
-        if len(record["annotations"]) == 0:
-            continue
-        record["task"] = "ObjectDet"
-        dataset_dicts.append(record)
-
-    return dataset_dicts
-
-
-_CUSTOM_SPLITS_LVIS = {
-    "GRiT_coco2017_train": (
-        "coco/train2017/",
-        "coco/annotations/instances_train2017.json",
-    ),
-}
-
-
-for key, (image_root, json_file) in _CUSTOM_SPLITS_LVIS.items():
-    register_GRiTcoco_instances(
-        key,
-        get_GRiTcoco_meta(),
-        os.path.join("datasets", json_file) if "://" not in json_file else json_file,
-        os.path.join("datasets", image_root),
-    )
diff --git a/eval/vbench/third_party/grit_src/grit/data/datasets/object365.py b/eval/vbench/third_party/grit_src/grit/data/datasets/object365.py
deleted file mode 100644
index 11e9cf30..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/datasets/object365.py
+++ /dev/null
@@ -1,118 +0,0 @@
-import logging
-import os
-
-from detectron2.data import DatasetCatalog, MetadataCatalog
-from detectron2.structures import BoxMode
-from fvcore.common.file_io import PathManager
-from fvcore.common.timer import Timer
-from lvis import LVIS
-
-logger = logging.getLogger(__name__)
-
-__all__ = ["load_o365_json", "register_o365_instances"]
-
-
-def register_o365_instances(name, metadata, json_file, image_root):
-    DatasetCatalog.register(name, lambda: load_o365_json(json_file, image_root, name))
-    MetadataCatalog.get(name).set(
-        json_file=json_file, image_root=image_root, evaluator_type="lvis", **metadata
-    )
-
-
-def get_o365_meta():
-    categories = [{"supercategory": "object", "id": 1, "name": "object"}]
-    o365_categories = sorted(categories, key=lambda x: x["id"])
-    thing_classes = [k["name"] for k in o365_categories]
-    meta = {"thing_classes": thing_classes}
-    return meta
-
-
-def load_o365_json(json_file, image_root, dataset_name=None):
-    """
-    Load Object365 class name text for object description for GRiT
-    """
-
-    json_file = PathManager.get_local_path(json_file)
-
-    timer = Timer()
-    lvis_api = LVIS(json_file)
-    if timer.seconds() > 1:
-        logger.info(
-            "Loading {} takes {:.2f} seconds.".format(json_file, timer.seconds())
-        )
-
-    class_names = {}
-    sort_cat = sorted(lvis_api.dataset["categories"], key=lambda x: x["id"])
-    for x in sort_cat:
-        if "/" in x["name"]:
-            text = ""
-            for xx in x["name"].split("/"):
-                text += xx
-                text += " "
-            text = text[:-1]
-        else:
-            text = x["name"]
-        class_names[x["id"]] = text
-
-    img_ids = sorted(lvis_api.imgs.keys())
-    imgs = lvis_api.load_imgs(img_ids)
-    anns = [lvis_api.img_ann_map[img_id] for img_id in img_ids]
-
-    ann_ids = [ann["id"] for anns_per_image in anns for ann in anns_per_image]
-    assert len(set(ann_ids)) == len(
-        ann_ids
-    ), "Annotation ids in '{}' are not unique".format(json_file)
-
-    imgs_anns = list(zip(imgs, anns))
-    logger.info(
-        "Loaded {} images in the LVIS v1 format from {}".format(
-            len(imgs_anns), json_file
-        )
-    )
-
-    dataset_dicts = []
-
-    for img_dict, anno_dict_list in imgs_anns:
-        record = {}
-        if "file_name" in img_dict:
-            file_name = img_dict["file_name"]
-            record["file_name"] = os.path.join(image_root, file_name)
-
-        record["height"] = int(img_dict["height"])
-        record["width"] = int(img_dict["width"])
-        image_id = record["image_id"] = img_dict["id"]
-
-        objs = []
-        for anno in anno_dict_list:
-            assert anno["image_id"] == image_id
-            if anno.get("iscrowd", 0) > 0:
-                continue
-            obj = {"bbox": anno["bbox"], "bbox_mode": BoxMode.XYWH_ABS}
-            obj["category_id"] = 0
-            obj["object_description"] = class_names[anno["category_id"]]
-
-            objs.append(obj)
-        record["annotations"] = objs
-        if len(record["annotations"]) == 0:
-            continue
-        record["task"] = "ObjectDet"
-        dataset_dicts.append(record)
-
-    return dataset_dicts
-
-
-_CUSTOM_SPLITS_LVIS = {
-    "object365_train": (
-        "object365/images/train/",
-        "object365/annotations/train_v1.json",
-    ),
-}
-
-
-for key, (image_root, json_file) in _CUSTOM_SPLITS_LVIS.items():
-    register_o365_instances(
-        key,
-        get_o365_meta(),
-        os.path.join("datasets", json_file) if "://" not in json_file else json_file,
-        os.path.join("datasets", image_root),
-    )
diff --git a/eval/vbench/third_party/grit_src/grit/data/datasets/vg.py b/eval/vbench/third_party/grit_src/grit/data/datasets/vg.py
deleted file mode 100644
index dfbea5e1..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/datasets/vg.py
+++ /dev/null
@@ -1,101 +0,0 @@
-import logging
-import os
-
-from detectron2.data import DatasetCatalog, MetadataCatalog
-from detectron2.structures import BoxMode
-from fvcore.common.file_io import PathManager
-from fvcore.common.timer import Timer
-from lvis import LVIS
-
-logger = logging.getLogger(__name__)
-
-__all__ = ["load_vg_json", "register_vg_instances"]
-
-
-def register_vg_instances(name, metadata, json_file, image_root):
-    """ """
-    DatasetCatalog.register(name, lambda: load_vg_json(json_file, image_root, name))
-    MetadataCatalog.get(name).set(
-        json_file=json_file, image_root=image_root, evaluator_type="vg", **metadata
-    )
-
-
-def get_vg_meta():
-    categories = [{"supercategory": "object", "id": 1, "name": "object"}]
-    vg_categories = sorted(categories, key=lambda x: x["id"])
-    thing_classes = [k["name"] for k in vg_categories]
-    meta = {"thing_classes": thing_classes}
-    return meta
-
-
-def load_vg_json(json_file, image_root, dataset_name=None):
-
-    json_file = PathManager.get_local_path(json_file)
-
-    timer = Timer()
-    lvis_api = LVIS(json_file)
-    if timer.seconds() > 1:
-        logger.info(
-            "Loading {} takes {:.2f} seconds.".format(json_file, timer.seconds())
-        )
-
-    img_ids = sorted(lvis_api.imgs.keys())
-    imgs = lvis_api.load_imgs(img_ids)
-    anns = [lvis_api.img_ann_map[img_id] for img_id in img_ids]
-
-    ann_ids = [ann["id"] for anns_per_image in anns for ann in anns_per_image]
-    assert len(set(ann_ids)) == len(
-        ann_ids
-    ), "Annotation ids in '{}' are not unique".format(json_file)
-
-    imgs_anns = list(zip(imgs, anns))
-    logger.info(
-        "Loaded {} images in the LVIS v1 format from {}".format(
-            len(imgs_anns), json_file
-        )
-    )
-
-    dataset_dicts = []
-
-    for img_dict, anno_dict_list in imgs_anns:
-        record = {}
-        if "file_name" in img_dict:
-            file_name = img_dict["file_name"]
-            record["file_name"] = os.path.join(image_root, file_name)
-
-        record["height"] = int(img_dict["height"])
-        record["width"] = int(img_dict["width"])
-        image_id = record["image_id"] = img_dict["id"]
-
-        objs = []
-        for anno in anno_dict_list:
-            assert anno["image_id"] == image_id
-            if anno.get("iscrowd", 0) > 0:
-                continue
-            obj = {"bbox": anno["bbox"], "bbox_mode": BoxMode.XYWH_ABS}
-            obj["category_id"] = 0
-            obj["object_description"] = anno["caption"]
-
-            objs.append(obj)
-        record["annotations"] = objs
-        if len(record["annotations"]) == 0:
-            continue
-        record["task"] = "DenseCap"
-        dataset_dicts.append(record)
-
-    return dataset_dicts
-
-
-_CUSTOM_SPLITS_LVIS = {
-    "vg_train": ("vg/images", "vg/annotations/train.json"),
-    "vg_test": ("vg/images", "vg/annotations/test.json"),
-}
-
-
-for key, (image_root, json_file) in _CUSTOM_SPLITS_LVIS.items():
-    register_vg_instances(
-        key,
-        get_vg_meta(),
-        os.path.join("datasets", json_file) if "://" not in json_file else json_file,
-        os.path.join("datasets", image_root),
-    )
diff --git a/eval/vbench/third_party/grit_src/grit/data/transforms/__init__.py b/eval/vbench/third_party/grit_src/grit/data/transforms/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/data/transforms/custom_augmentation_impl.py b/eval/vbench/third_party/grit_src/grit/data/transforms/custom_augmentation_impl.py
deleted file mode 100644
index eef89c05..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/transforms/custom_augmentation_impl.py
+++ /dev/null
@@ -1,56 +0,0 @@
-# -*- coding: utf-8 -*-
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# Part of the code is from https://github.com/rwightman/efficientdet-pytorch/blob/master/effdet/data/transforms.py
-# Modified by Xingyi Zhou
-# The original code is under Apache-2.0 License
-import numpy as np
-from detectron2.data.transforms.augmentation import Augmentation
-from PIL import Image
-
-from .custom_transform import EfficientDetResizeCropTransform
-
-__all__ = [
-    "EfficientDetResizeCrop",
-]
-
-
-class EfficientDetResizeCrop(Augmentation):
-    """
-    Scale the shorter edge to the given size, with a limit of `max_size` on the longer edge.
-    If `max_size` is reached, then downscale so that the longer edge does not exceed max_size.
-    """
-
-    def __init__(self, size, scale, interp=Image.BILINEAR):
-        """ """
-        super().__init__()
-        self.target_size = (size, size)
-        self.scale = scale
-        self.interp = interp
-
-    def get_transform(self, img):
-        # Select a random scale factor.
-        scale_factor = np.random.uniform(*self.scale)
-        scaled_target_height = scale_factor * self.target_size[0]
-        scaled_target_width = scale_factor * self.target_size[1]
-        # Recompute the accurate scale_factor using rounded scaled image size.
-        width, height = img.shape[1], img.shape[0]
-        img_scale_y = scaled_target_height / height
-        img_scale_x = scaled_target_width / width
-        img_scale = min(img_scale_y, img_scale_x)
-
-        # Select non-zero random offset (x, y) if scaled image is larger than target size
-        scaled_h = int(height * img_scale)
-        scaled_w = int(width * img_scale)
-        offset_y = scaled_h - self.target_size[0]
-        offset_x = scaled_w - self.target_size[1]
-        offset_y = int(max(0.0, float(offset_y)) * np.random.uniform(0, 1))
-        offset_x = int(max(0.0, float(offset_x)) * np.random.uniform(0, 1))
-        return EfficientDetResizeCropTransform(
-            scaled_h,
-            scaled_w,
-            offset_y,
-            offset_x,
-            img_scale,
-            self.target_size,
-            self.interp,
-        )
diff --git a/eval/vbench/third_party/grit_src/grit/data/transforms/custom_transform.py b/eval/vbench/third_party/grit_src/grit/data/transforms/custom_transform.py
deleted file mode 100644
index 62d2be65..00000000
--- a/eval/vbench/third_party/grit_src/grit/data/transforms/custom_transform.py
+++ /dev/null
@@ -1,121 +0,0 @@
-# -*- coding: utf-8 -*-
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# Part of the code is from https://github.com/rwightman/efficientdet-pytorch/blob/master/effdet/data/transforms.py
-# Modified by Xingyi Zhou
-# The original code is under Apache-2.0 License
-import numpy as np
-import torch
-import torch.nn.functional as F
-from fvcore.transforms.transform import (
-    CropTransform,
-    HFlipTransform,
-    NoOpTransform,
-    Transform,
-    TransformList,
-)
-from PIL import Image
-
-try:
-    import cv2  # noqa
-except ImportError:
-    # OpenCV is an optional dependency at the moment
-    pass
-
-__all__ = [
-    "EfficientDetResizeCropTransform",
-]
-
-
-class EfficientDetResizeCropTransform(Transform):
-    """ """
-
-    def __init__(
-        self,
-        scaled_h,
-        scaled_w,
-        offset_y,
-        offset_x,
-        img_scale,
-        target_size,
-        interp=None,
-    ):
-        """
-        Args:
-            h, w (int): original image size
-            new_h, new_w (int): new image size
-            interp: PIL interpolation methods, defaults to bilinear.
-        """
-        # TODO decide on PIL vs opencv
-        super().__init__()
-        if interp is None:
-            interp = Image.BILINEAR
-        self._set_attributes(locals())
-
-    def apply_image(self, img, interp=None):
-        assert len(img.shape) <= 4
-
-        if img.dtype == np.uint8:
-            pil_image = Image.fromarray(img)
-            interp_method = interp if interp is not None else self.interp
-            pil_image = pil_image.resize((self.scaled_w, self.scaled_h), interp_method)
-            ret = np.asarray(pil_image)
-            right = min(self.scaled_w, self.offset_x + self.target_size[1])
-            lower = min(self.scaled_h, self.offset_y + self.target_size[0])
-            if len(ret.shape) <= 3:
-                ret = ret[self.offset_y : lower, self.offset_x : right]
-            else:
-                ret = ret[..., self.offset_y : lower, self.offset_x : right, :]
-        else:
-            # PIL only supports uint8
-            img = torch.from_numpy(img)
-            shape = list(img.shape)
-            shape_4d = shape[:2] + [1] * (4 - len(shape)) + shape[2:]
-            img = img.view(shape_4d).permute(2, 3, 0, 1)  # hw(c) -> nchw
-            _PIL_RESIZE_TO_INTERPOLATE_MODE = {
-                Image.BILINEAR: "bilinear",
-                Image.BICUBIC: "bicubic",
-            }
-            mode = _PIL_RESIZE_TO_INTERPOLATE_MODE[self.interp]
-            img = F.interpolate(
-                img, (self.scaled_h, self.scaled_w), mode=mode, align_corners=False
-            )
-            shape[:2] = (self.scaled_h, self.scaled_w)
-            ret = img.permute(2, 3, 0, 1).view(shape).numpy()  # nchw -> hw(c)
-            right = min(self.scaled_w, self.offset_x + self.target_size[1])
-            lower = min(self.scaled_h, self.offset_y + self.target_size[0])
-            if len(ret.shape) <= 3:
-                ret = ret[self.offset_y : lower, self.offset_x : right]
-            else:
-                ret = ret[..., self.offset_y : lower, self.offset_x : right, :]
-        return ret
-
-    def apply_coords(self, coords):
-        coords[:, 0] = coords[:, 0] * self.img_scale
-        coords[:, 1] = coords[:, 1] * self.img_scale
-        coords[:, 0] -= self.offset_x
-        coords[:, 1] -= self.offset_y
-        return coords
-
-    def apply_segmentation(self, segmentation):
-        segmentation = self.apply_image(segmentation, interp=Image.NEAREST)
-        return segmentation
-
-    def inverse(self):
-        raise NotImplementedError
-
-    def inverse_apply_coords(self, coords):
-        coords[:, 0] += self.offset_x
-        coords[:, 1] += self.offset_y
-        coords[:, 0] = coords[:, 0] / self.img_scale
-        coords[:, 1] = coords[:, 1] / self.img_scale
-        return coords
-
-    def inverse_apply_box(self, box: np.ndarray) -> np.ndarray:
-        """ """
-        idxs = np.array([(0, 1), (2, 1), (0, 3), (2, 3)]).flatten()
-        coords = np.asarray(box).reshape(-1, 4)[:, idxs].reshape(-1, 2)
-        coords = self.inverse_apply_coords(coords).reshape((-1, 4, 2))
-        minxy = coords.min(axis=1)
-        maxxy = coords.max(axis=1)
-        trans_boxes = np.concatenate((minxy, maxxy), axis=1)
-        return trans_boxes
diff --git a/eval/vbench/third_party/grit_src/grit/evaluation/eval.py b/eval/vbench/third_party/grit_src/grit/evaluation/eval.py
deleted file mode 100644
index 458475ec..00000000
--- a/eval/vbench/third_party/grit_src/grit/evaluation/eval.py
+++ /dev/null
@@ -1,165 +0,0 @@
-import itertools
-import json
-import os
-
-import numpy as np
-import pycocotools.mask as mask_util
-from detectron2.evaluation.coco_evaluation import (
-    COCOEvaluator,
-    _evaluate_predictions_on_coco,
-)
-from detectron2.structures import Boxes, BoxMode, pairwise_iou
-from detectron2.utils.file_io import PathManager
-
-
-class GRiTCOCOEvaluator(COCOEvaluator):
-    def process(self, inputs, outputs):
-        for input, output in zip(inputs, outputs):
-            prediction = {"image_id": input["image_id"]}
-
-            if "instances" in output:
-                instances = output["instances"].to(self._cpu_device)
-                prediction["instances"] = instances_to_coco_json(
-                    instances, input["image_id"]
-                )
-
-            if len(prediction) > 1:
-                self._predictions.append(prediction)
-
-    def _eval_predictions(self, predictions, img_ids=None):
-        self._logger.info("Preparing results for COCO format ...")
-        coco_results = list(itertools.chain(*[x["instances"] for x in predictions]))
-        tasks = self._tasks or self._tasks_from_predictions(coco_results)
-
-        if self._output_dir:
-            file_path = os.path.join(self._output_dir, "coco_instances_results.json")
-            self._logger.info("Saving results to {}".format(file_path))
-            with PathManager.open(file_path, "w") as f:
-                f.write(json.dumps(coco_results))
-                f.flush()
-
-        if not self._do_evaluation:
-            self._logger.info("Annotations are not available for evaluation.")
-            return
-
-        self._logger.info(
-            "Evaluating predictions with {} COCO API...".format(
-                "unofficial" if self._use_fast_impl else "official"
-            )
-        )
-
-        coco_results = self.convert_classname_to_id(coco_results)
-
-        for task in sorted(tasks):
-            assert task in {"bbox", "segm", "keypoints"}, f"Got unknown task: {task}!"
-            coco_eval = (
-                _evaluate_predictions_on_coco(
-                    self._coco_api,
-                    coco_results,
-                    task,
-                    kpt_oks_sigmas=self._kpt_oks_sigmas,
-                    use_fast_impl=self._use_fast_impl,
-                    img_ids=img_ids,
-                    max_dets_per_image=self._max_dets_per_image,
-                )
-                if len(coco_results) > 0
-                else None  # cocoapi does not handle empty results very well
-            )
-
-            res = self._derive_coco_results(
-                coco_eval, task, class_names=self._metadata.get("thing_classes")
-            )
-            self._results[task] = res
-
-    def convert_classname_to_id(self, results):
-        outputs = []
-        class_name_to_id = {}
-        categories = sorted(self._coco_api.dataset["categories"], key=lambda x: x["id"])
-
-        for cat in categories:
-            class_name_to_id[cat["name"]] = cat["id"]
-
-        for pred in results:
-            if pred["object_descriptions"] in class_name_to_id:
-                pred["category_id"] = class_name_to_id[pred["object_descriptions"]]
-                del pred["object_descriptions"]
-                outputs.append(pred)
-
-        return outputs
-
-
-class GRiTVGEvaluator(COCOEvaluator):
-    def process(self, inputs, outputs):
-        for input, output in zip(inputs, outputs):
-            assert input["image_id"] == int(
-                input["file_name"].split("/")[-1].split(".")[0]
-            )
-            prediction = {"image_id": input["image_id"]}
-
-            if "instances" in output:
-                instances = output["instances"].to(self._cpu_device)
-                prediction["instances"] = instances_to_coco_json(
-                    instances, input["image_id"], output_logits=True
-                )
-                h = input["height"]
-                w = input["width"]
-                scale = 720.0 / max(h, w)
-                scaled_inst = []
-                for inst in prediction["instances"]:
-                    inst["bbox"][0] = inst["bbox"][0] * scale
-                    inst["bbox"][1] = inst["bbox"][1] * scale
-                    inst["bbox"][2] = inst["bbox"][2] * scale
-                    inst["bbox"][3] = inst["bbox"][3] * scale
-                    scaled_inst.append(inst)
-                if len(scaled_inst) > 0:
-                    prediction["instances"] = scaled_inst
-            if len(prediction) > 1:
-                self._predictions.append(prediction)
-
-    def _eval_predictions(self, predictions, img_ids=None):
-        """
-        This is only for saving the results to json file
-        """
-        self._logger.info("Preparing results for COCO format ...")
-        coco_results = list(itertools.chain(*[x["instances"] for x in predictions]))
-
-        if self._output_dir:
-            file_path = os.path.join(self._output_dir, "vg_instances_results.json")
-            self._logger.info("Saving results to {}".format(file_path))
-            with PathManager.open(file_path, "w") as f:
-                f.write(json.dumps(coco_results))
-                f.flush()
-
-
-def instances_to_coco_json(instances, img_id, output_logits=False):
-    """
-    Add object_descriptions and logit (if applicable) to
-    detectron2's instances_to_coco_json
-    """
-    num_instance = len(instances)
-    if num_instance == 0:
-        return []
-
-    boxes = instances.pred_boxes.tensor.numpy()
-    boxes = BoxMode.convert(boxes, BoxMode.XYXY_ABS, BoxMode.XYWH_ABS)
-    boxes = boxes.tolist()
-    scores = instances.scores.tolist()
-    classes = instances.pred_classes.tolist()
-    object_descriptions = instances.pred_object_descriptions.data
-    if output_logits:
-        logits = instances.logits.tolist()
-
-    results = []
-    for k in range(num_instance):
-        result = {
-            "image_id": img_id,
-            "category_id": classes[k],
-            "bbox": boxes[k],
-            "score": scores[k],
-            "object_descriptions": object_descriptions[k],
-        }
-        if output_logits:
-            result["logit"] = logits[k]
-
-        results.append(result)
-    return results
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/__init__.py b/eval/vbench/third_party/grit_src/grit/modeling/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/backbone/__init__.py b/eval/vbench/third_party/grit_src/grit/modeling/backbone/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/backbone/utils.py b/eval/vbench/third_party/grit_src/grit/modeling/backbone/utils.py
deleted file mode 100644
index a92b1796..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/backbone/utils.py
+++ /dev/null
@@ -1,199 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
-# This code is from https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/utils.py
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-__all__ = [
-    "window_partition",
-    "window_unpartition",
-    "add_decomposed_rel_pos",
-    "get_abs_pos",
-    "PatchEmbed",
-]
-
-
-def window_partition(x, window_size):
-    """
-    Partition into non-overlapping windows with padding if needed.
-    Args:
-        x (tensor): input tokens with [B, H, W, C].
-        window_size (int): window size.
-
-    Returns:
-        windows: windows after partition with [B * num_windows, window_size, window_size, C].
-        (Hp, Wp): padded height and width before partition
-    """
-    B, H, W, C = x.shape
-
-    pad_h = (window_size - H % window_size) % window_size
-    pad_w = (window_size - W % window_size) % window_size
-    if pad_h > 0 or pad_w > 0:
-        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
-    Hp, Wp = H + pad_h, W + pad_w
-
-    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
-    windows = (
-        x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
-    )
-    return windows, (Hp, Wp)
-
-
-def window_unpartition(windows, window_size, pad_hw, hw):
-    """
-    Window unpartition into original sequences and removing padding.
-    Args:
-        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].
-        window_size (int): window size.
-        pad_hw (Tuple): padded height and width (Hp, Wp).
-        hw (Tuple): original height and width (H, W) before padding.
-
-    Returns:
-        x: unpartitioned sequences with [B, H, W, C].
-    """
-    Hp, Wp = pad_hw
-    H, W = hw
-    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
-    x = windows.view(
-        B, Hp // window_size, Wp // window_size, window_size, window_size, -1
-    )
-    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)
-
-    if Hp > H or Wp > W:
-        x = x[:, :H, :W, :].contiguous()
-    return x
-
-
-def get_rel_pos(q_size, k_size, rel_pos):
-    """
-    Get relative positional embeddings according to the relative positions of
-        query and key sizes.
-    Args:
-        q_size (int): size of query q.
-        k_size (int): size of key k.
-        rel_pos (Tensor): relative position embeddings (L, C).
-
-    Returns:
-        Extracted positional embeddings according to relative positions.
-    """
-    max_rel_dist = int(2 * max(q_size, k_size) - 1)
-    # Interpolate rel pos if needed.
-    if rel_pos.shape[0] != max_rel_dist:
-        # Interpolate rel pos.
-        rel_pos_resized = F.interpolate(
-            rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
-            size=max_rel_dist,
-            mode="linear",
-        )
-        rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
-    else:
-        rel_pos_resized = rel_pos
-
-    # Scale the coords with short length if shapes for q and k are different.
-    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
-    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
-    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)
-
-    return rel_pos_resized[relative_coords.long()]
-
-
-def add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size):
-    """
-    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.
-    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950
-    Args:
-        attn (Tensor): attention map.
-        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).
-        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.
-        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.
-        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).
-        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).
-
-    Returns:
-        attn (Tensor): attention map with added relative positional embeddings.
-    """
-    q_h, q_w = q_size
-    k_h, k_w = k_size
-    Rh = get_rel_pos(q_h, k_h, rel_pos_h)
-    Rw = get_rel_pos(q_w, k_w, rel_pos_w)
-
-    B, _, dim = q.shape
-    r_q = q.reshape(B, q_h, q_w, dim)
-    rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
-    rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)
-
-    attn = (
-        attn.view(B, q_h, q_w, k_h, k_w)
-        + rel_h[:, :, :, :, None]
-        + rel_w[:, :, :, None, :]
-    ).view(B, q_h * q_w, k_h * k_w)
-
-    return attn
-
-
-def get_abs_pos(abs_pos, has_cls_token, hw):
-    """
-    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token
-        dimension for the original embeddings.
-    Args:
-        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).
-        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.
-        hw (Tuple): size of input image tokens.
-
-    Returns:
-        Absolute positional embeddings after processing with shape (1, H, W, C)
-    """
-    h, w = hw
-    if has_cls_token:
-        abs_pos = abs_pos[:, 1:]
-    xy_num = abs_pos.shape[1]
-    size = int(math.sqrt(xy_num))
-    assert size * size == xy_num
-
-    if size != h or size != w:
-        new_abs_pos = F.interpolate(
-            abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),
-            size=(h, w),
-            mode="bicubic",
-            align_corners=False,
-        )
-
-        return new_abs_pos.permute(0, 2, 3, 1)
-    else:
-        return abs_pos.reshape(1, h, w, -1)
-
-
-class PatchEmbed(nn.Module):
-    """
-    Image to Patch Embedding.
-    """
-
-    def __init__(
-        self,
-        kernel_size=(16, 16),
-        stride=(16, 16),
-        padding=(0, 0),
-        in_chans=3,
-        embed_dim=768,
-    ):
-        """
-        Args:
-            kernel_size (Tuple): kernel size of the projection layer.
-            stride (Tuple): stride of the projection layer.
-            padding (Tuple): padding size of the projection layer.
-            in_chans (int): Number of input image channels.
-            embed_dim (int):  embed_dim (int): Patch embedding dimension.
-        """
-        super().__init__()
-
-        self.proj = nn.Conv2d(
-            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding
-        )
-
-    def forward(self, x):
-        x = self.proj(x)
-        # B C H W -> B H W C
-        x = x.permute(0, 2, 3, 1)
-        return x
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/backbone/vit.py b/eval/vbench/third_party/grit_src/grit/modeling/backbone/vit.py
deleted file mode 100644
index e2bee979..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/backbone/vit.py
+++ /dev/null
@@ -1,628 +0,0 @@
-# Modified by Jialian Wu from https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/vit.py
-import logging
-import math
-import os
-import sys
-from functools import partial
-
-import fvcore.nn.weight_init as weight_init
-import torch
-import torch.nn as nn
-from detectron2.layers import CNNBlockBase, Conv2d, ShapeSpec, get_norm
-from detectron2.modeling.backbone.build import BACKBONE_REGISTRY
-
-CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-sys.path.append(os.path.join(CUR_DIR, "../../../centernet2"))
-import torch.utils.checkpoint as checkpoint
-from centernet.modeling.backbone.fpn_p5 import LastLevelP6P7_P5
-from detectron2.modeling.backbone.backbone import Backbone
-from timm.models.layers import DropPath, Mlp, trunc_normal_
-
-from .utils import (
-    PatchEmbed,
-    add_decomposed_rel_pos,
-    get_abs_pos,
-    window_partition,
-    window_unpartition,
-)
-
-logger = logging.getLogger(__name__)
-
-
-__all__ = ["ViT"]
-
-
-class Attention(nn.Module):
-    """Multi-head Attention block with relative position embeddings."""
-
-    def __init__(
-        self,
-        dim,
-        num_heads=8,
-        qkv_bias=True,
-        use_rel_pos=False,
-        rel_pos_zero_init=True,
-        input_size=None,
-    ):
-        """
-        Args:
-            dim (int): Number of input channels.
-            num_heads (int): Number of attention heads.
-            qkv_bias (bool:  If True, add a learnable bias to query, key, value.
-            rel_pos (bool): If True, add relative positional embeddings to the attention map.
-            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
-            input_size (int or None): Input resolution for calculating the relative positional
-                parameter size.
-        """
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        self.scale = head_dim**-0.5
-
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.proj = nn.Linear(dim, dim)
-
-        self.use_rel_pos = use_rel_pos
-        if self.use_rel_pos:
-            # initialize relative positional embeddings
-            self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
-            self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))
-
-            if not rel_pos_zero_init:
-                trunc_normal_(self.rel_pos_h, std=0.02)
-                trunc_normal_(self.rel_pos_w, std=0.02)
-
-    def forward(self, x):
-        B, H, W, _ = x.shape
-        # qkv with shape (3, B, nHead, H * W, C)
-        qkv = (
-            self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
-        )
-        # q, k, v with shape (B * nHead, H * W, C)
-        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)
-
-        attn = (q * self.scale) @ k.transpose(-2, -1)
-
-        if self.use_rel_pos:
-            attn = add_decomposed_rel_pos(
-                attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W)
-            )
-
-        attn = attn.softmax(dim=-1)
-        x = (
-            (attn @ v)
-            .view(B, self.num_heads, H, W, -1)
-            .permute(0, 2, 3, 1, 4)
-            .reshape(B, H, W, -1)
-        )
-        x = self.proj(x)
-
-        return x
-
-
-class ResBottleneckBlock(CNNBlockBase):
-    """
-    The standard bottleneck residual block without the last activation layer.
-    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        bottleneck_channels,
-        norm="LN",
-        act_layer=nn.GELU,
-    ):
-        """
-        Args:
-            in_channels (int): Number of input channels.
-            out_channels (int): Number of output channels.
-            bottleneck_channels (int): number of output channels for the 3x3
-                "bottleneck" conv layers.
-            norm (str or callable): normalization for all conv layers.
-                See :func:`layers.get_norm` for supported format.
-            act_layer (callable): activation for all conv layers.
-        """
-        super().__init__(in_channels, out_channels, 1)
-
-        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)
-        self.norm1 = get_norm(norm, bottleneck_channels)
-        self.act1 = act_layer()
-
-        self.conv2 = Conv2d(
-            bottleneck_channels,
-            bottleneck_channels,
-            3,
-            padding=1,
-            bias=False,
-        )
-        self.norm2 = get_norm(norm, bottleneck_channels)
-        self.act2 = act_layer()
-
-        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)
-        self.norm3 = get_norm(norm, out_channels)
-
-        for layer in [self.conv1, self.conv2, self.conv3]:
-            weight_init.c2_msra_fill(layer)
-        for layer in [self.norm1, self.norm2]:
-            layer.weight.data.fill_(1.0)
-            layer.bias.data.zero_()
-        # zero init last norm layer.
-        self.norm3.weight.data.zero_()
-        self.norm3.bias.data.zero_()
-
-    def forward(self, x):
-        out = x
-        for layer in self.children():
-            out = layer(out)
-
-        out = x + out
-        return out
-
-
-class Block(nn.Module):
-    """Transformer blocks with support of window attention and residual propagation blocks"""
-
-    def __init__(
-        self,
-        dim,
-        num_heads,
-        mlp_ratio=4.0,
-        qkv_bias=True,
-        drop_path=0.0,
-        norm_layer=nn.LayerNorm,
-        act_layer=nn.GELU,
-        use_rel_pos=False,
-        rel_pos_zero_init=True,
-        window_size=0,
-        use_residual_block=False,
-        input_size=None,
-    ):
-        """
-        Args:
-            dim (int): Number of input channels.
-            num_heads (int): Number of attention heads in each ViT block.
-            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
-            qkv_bias (bool): If True, add a learnable bias to query, key, value.
-            drop_path (float): Stochastic depth rate.
-            norm_layer (nn.Module): Normalization layer.
-            act_layer (nn.Module): Activation layer.
-            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
-            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
-            window_size (int): Window size for window attention blocks. If it equals 0, then not
-                use window attention.
-            use_residual_block (bool): If True, use a residual block after the MLP block.
-            input_size (int or None): Input resolution for calculating the relative positional
-                parameter size.
-        """
-        super().__init__()
-        self.norm1 = norm_layer(dim)
-        self.attn = Attention(
-            dim,
-            num_heads=num_heads,
-            qkv_bias=qkv_bias,
-            use_rel_pos=use_rel_pos,
-            rel_pos_zero_init=rel_pos_zero_init,
-            input_size=input_size if window_size == 0 else (window_size, window_size),
-        )
-
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        self.mlp = Mlp(
-            in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer
-        )
-
-        self.window_size = window_size
-
-        self.use_residual_block = use_residual_block
-        if use_residual_block:
-            # Use a residual block with bottleneck channel as dim // 2
-            self.residual = ResBottleneckBlock(
-                in_channels=dim,
-                out_channels=dim,
-                bottleneck_channels=dim // 2,
-                norm="LN",
-                act_layer=act_layer,
-            )
-
-    def forward(self, x):
-        shortcut = x
-        x = self.norm1(x)
-        # Window partition
-        if self.window_size > 0:
-            H, W = x.shape[1], x.shape[2]
-            x, pad_hw = window_partition(x, self.window_size)
-
-        x = self.attn(x)
-        # Reverse window partition
-        if self.window_size > 0:
-            x = window_unpartition(x, self.window_size, pad_hw, (H, W))
-
-        x = shortcut + self.drop_path(x)
-        x = x + self.drop_path(self.mlp(self.norm2(x)))
-
-        if self.use_residual_block:
-            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)
-
-        return x
-
-
-class ViT(Backbone):
-    """
-    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.
-    "Exploring Plain Vision Transformer Backbones for Object Detection",
-    https://arxiv.org/abs/2203.16527
-    """
-
-    def __init__(
-        self,
-        img_size=1024,
-        patch_size=16,
-        in_chans=3,
-        embed_dim=768,
-        depth=12,
-        num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=True,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        act_layer=nn.GELU,
-        use_abs_pos=True,
-        use_rel_pos=False,
-        rel_pos_zero_init=True,
-        window_size=0,
-        window_block_indexes=(),
-        residual_block_indexes=(),
-        use_act_checkpoint=True,
-        pretrain_img_size=224,
-        pretrain_use_cls_token=True,
-        out_feature="last_feat",
-    ):
-        """
-        Args:
-            img_size (int): Input image size.
-            patch_size (int): Patch size.
-            in_chans (int): Number of input image channels.
-            embed_dim (int): Patch embedding dimension.
-            depth (int): Depth of ViT.
-            num_heads (int): Number of attention heads in each ViT block.
-            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
-            qkv_bias (bool): If True, add a learnable bias to query, key, value.
-            drop_path_rate (float): Stochastic depth rate.
-            norm_layer (nn.Module): Normalization layer.
-            act_layer (nn.Module): Activation layer.
-            use_abs_pos (bool): If True, use absolute positional embeddings.
-            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
-            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
-            window_size (int): Window size for window attention blocks.
-            window_block_indexes (list): Indexes for blocks using window attention.
-            residual_block_indexes (list): Indexes for blocks using conv propagation.
-            use_act_checkpoint (bool): If True, use activation checkpointing.
-            pretrain_img_size (int): input image size for pretraining models.
-            pretrain_use_cls_token (bool): If True, pretrainig models use class token.
-            out_feature (str): name of the feature from the last block.
-        """
-        super().__init__()
-        self.pretrain_use_cls_token = pretrain_use_cls_token
-        self.use_act_checkpoint = use_act_checkpoint
-
-        self.patch_embed = PatchEmbed(
-            kernel_size=(patch_size, patch_size),
-            stride=(patch_size, patch_size),
-            in_chans=in_chans,
-            embed_dim=embed_dim,
-        )
-
-        if use_abs_pos:
-            # Initialize absolute positional embedding with pretrain image size.
-            num_patches = (pretrain_img_size // patch_size) * (
-                pretrain_img_size // patch_size
-            )
-            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches
-            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))
-        else:
-            self.pos_embed = None
-
-        # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
-
-        self.blocks = nn.ModuleList()
-        for i in range(depth):
-            block = Block(
-                dim=embed_dim,
-                num_heads=num_heads,
-                mlp_ratio=mlp_ratio,
-                qkv_bias=qkv_bias,
-                drop_path=dpr[i],
-                norm_layer=norm_layer,
-                act_layer=act_layer,
-                use_rel_pos=use_rel_pos,
-                rel_pos_zero_init=rel_pos_zero_init,
-                window_size=window_size if i in window_block_indexes else 0,
-                use_residual_block=i in residual_block_indexes,
-                input_size=(img_size // patch_size, img_size // patch_size),
-            )
-            self.blocks.append(block)
-
-        self._out_feature_channels = {out_feature: embed_dim}
-        self._out_feature_strides = {out_feature: patch_size}
-        self._out_features = [out_feature]
-
-        if self.pos_embed is not None:
-            trunc_normal_(self.pos_embed, std=0.02)
-
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=0.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def forward(self, x):
-        x = self.patch_embed(x)
-        if self.pos_embed is not None:
-            x = x + get_abs_pos(
-                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])
-            )
-
-        for blk in self.blocks:
-            if self.use_act_checkpoint:
-                x = checkpoint.checkpoint(blk, x)
-            else:
-                x = blk(x)
-
-        return x.permute(0, 3, 1, 2)
-
-
-class ViT_FPN(Backbone):
-    def __init__(
-        self,
-        bottom_up=None,
-        top_block=None,
-        out_channels=None,
-        strides=None,
-        vit_out_dim=None,
-    ):
-        super(ViT_FPN, self).__init__()
-        assert isinstance(bottom_up, Backbone)
-        self.bottom_up = bottom_up
-        self.top_block = top_block
-
-        self._out_feature_strides = {
-            "p{}".format(int(math.log2(s))): s for s in strides
-        }
-        self._out_features = list(self._out_feature_strides.keys())
-        self._out_feature_channels = {k: out_channels for k in self._out_features}
-        self._size_divisibility = strides[2]
-
-        self.maxpool = nn.MaxPool2d(2, stride=2)
-        self.fpn_stride_16_8 = nn.ConvTranspose2d(
-            vit_out_dim, vit_out_dim, 2, stride=2, bias=False
-        )
-        self.fpn_stride8_conv1 = nn.Conv2d(
-            in_channels=vit_out_dim,
-            out_channels=out_channels,
-            kernel_size=1,
-            bias=False,
-        )
-        self.fpn_stride8_norm1 = nn.LayerNorm(out_channels)
-        self.fpn_stride8_conv2 = nn.Conv2d(
-            in_channels=out_channels,
-            out_channels=out_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-            bias=False,
-        )
-        self.fpn_stride8_norm2 = nn.LayerNorm(out_channels)
-
-        self.fpn_stride16_conv1 = nn.Conv2d(
-            in_channels=vit_out_dim,
-            out_channels=out_channels,
-            kernel_size=1,
-            bias=False,
-        )
-        self.fpn_stride16_norm1 = nn.LayerNorm(out_channels)
-        self.fpn_stride16_conv2 = nn.Conv2d(
-            in_channels=out_channels,
-            out_channels=out_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-            bias=False,
-        )
-        self.fpn_stride16_norm2 = nn.LayerNorm(out_channels)
-
-        self.fpn_stride32_conv1 = nn.Conv2d(
-            in_channels=vit_out_dim,
-            out_channels=out_channels,
-            kernel_size=1,
-            bias=False,
-        )
-        self.fpn_stride32_norm1 = nn.LayerNorm(out_channels)
-        self.fpn_stride32_conv2 = nn.Conv2d(
-            in_channels=out_channels,
-            out_channels=out_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-            bias=False,
-        )
-        self.fpn_stride32_norm2 = nn.LayerNorm(out_channels)
-
-    def forward(self, x):
-        vit_output_featuremap = self.bottom_up(x)
-
-        stride8_feature = self.fpn_stride_16_8(vit_output_featuremap)
-        stride8_feature = self.fpn_stride8_norm1(
-            self.fpn_stride8_conv1(stride8_feature).permute(0, 2, 3, 1)
-        ).permute(0, 3, 1, 2)
-        stride8_feature = self.fpn_stride8_norm2(
-            self.fpn_stride8_conv2(stride8_feature).permute(0, 2, 3, 1)
-        ).permute(0, 3, 1, 2)
-
-        stride32_feature = self.maxpool(vit_output_featuremap)
-        stride32_feature = self.fpn_stride32_norm1(
-            self.fpn_stride32_conv1(stride32_feature).permute(0, 2, 3, 1)
-        ).permute(0, 3, 1, 2)
-        stride32_feature = self.fpn_stride32_norm2(
-            self.fpn_stride32_conv2(stride32_feature).permute(0, 2, 3, 1)
-        ).permute(0, 3, 1, 2)
-
-        stride16_feature = self.fpn_stride16_norm1(
-            self.fpn_stride16_conv1(vit_output_featuremap).permute(0, 2, 3, 1)
-        ).permute(0, 3, 1, 2)
-        stride16_feature = self.fpn_stride16_norm2(
-            self.fpn_stride16_conv2(stride16_feature).permute(0, 2, 3, 1)
-        ).permute(0, 3, 1, 2)
-
-        results = [stride8_feature, stride16_feature, stride32_feature]
-
-        results.extend(self.top_block(stride32_feature))
-
-        assert len(self._out_features) == len(results)
-        fpn_out = {f: res for f, res in zip(self._out_features, results)}
-
-        return fpn_out
-
-    @property
-    def size_divisibility(self):
-        return self._size_divisibility
-
-    def output_shape(self):
-        return {
-            name: ShapeSpec(
-                channels=self._out_feature_channels[name],
-                stride=self._out_feature_strides[name],
-            )
-            for name in self._out_features
-        }
-
-
-@BACKBONE_REGISTRY.register()
-def build_vit_fpn_backbone(cfg, input_shape: ShapeSpec):
-    embed_dim = 768
-    vit_out_dim = embed_dim
-    bottom_up = ViT(  # Single-scale ViT backbone
-        img_size=1024,
-        patch_size=16,
-        embed_dim=embed_dim,
-        depth=12,
-        num_heads=12,
-        drop_path_rate=0.1,
-        window_size=14,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        window_block_indexes=[
-            # 2, 5, 8 11 for global attention
-            0,
-            1,
-            3,
-            4,
-            6,
-            7,
-            9,
-            10,
-        ],
-        residual_block_indexes=[],
-        use_act_checkpoint=cfg.USE_ACT_CHECKPOINT,
-        use_rel_pos=True,
-        out_feature="last_feat",
-    )
-
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    assert out_channels == 256 or out_channels == 768 or out_channels == 1024
-    backbone = ViT_FPN(
-        bottom_up=bottom_up,
-        top_block=LastLevelP6P7_P5(out_channels, out_channels),
-        out_channels=out_channels,
-        strides=[8, 16, 32, 64, 128],
-        vit_out_dim=vit_out_dim,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_vit_fpn_backbone_large(cfg, input_shape: ShapeSpec):
-    window_block_indexes = (
-        list(range(0, 5))
-        + list(range(6, 11))
-        + list(range(12, 17))
-        + list(range(18, 23))
-    )
-    embed_dim = 1024
-    vit_out_dim = embed_dim
-    bottom_up = ViT(  # Single-scale ViT backbone
-        img_size=1024,
-        patch_size=16,
-        embed_dim=embed_dim,
-        depth=24,
-        num_heads=16,
-        drop_path_rate=0.4,
-        window_size=14,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        window_block_indexes=window_block_indexes,
-        residual_block_indexes=[],
-        use_act_checkpoint=cfg.USE_ACT_CHECKPOINT,
-        use_rel_pos=True,
-        out_feature="last_feat",
-    )
-
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    assert out_channels == 256 or out_channels == 768 or out_channels == 1024
-    backbone = ViT_FPN(
-        bottom_up=bottom_up,
-        top_block=LastLevelP6P7_P5(out_channels, out_channels),
-        out_channels=out_channels,
-        strides=[8, 16, 32, 64, 128],
-        vit_out_dim=vit_out_dim,
-    )
-    return backbone
-
-
-@BACKBONE_REGISTRY.register()
-def build_vit_fpn_backbone_huge(cfg, input_shape: ShapeSpec):
-    window_block_indexes = (
-        list(range(0, 7))
-        + list(range(8, 15))
-        + list(range(16, 23))
-        + list(range(24, 31))
-    )
-    embed_dim = 1280
-    vit_out_dim = embed_dim
-    bottom_up = ViT(  # Single-scale ViT backbone
-        img_size=1024,
-        patch_size=16,
-        embed_dim=embed_dim,
-        depth=32,
-        num_heads=16,
-        drop_path_rate=0.5,
-        window_size=14,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        window_block_indexes=window_block_indexes,
-        residual_block_indexes=[],
-        use_act_checkpoint=cfg.USE_ACT_CHECKPOINT,
-        use_rel_pos=True,
-        out_feature="last_feat",
-    )
-
-    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
-    assert out_channels == 256 or out_channels == 768 or out_channels == 1024
-    backbone = ViT_FPN(
-        bottom_up=bottom_up,
-        top_block=LastLevelP6P7_P5(out_channels, out_channels),
-        out_channels=out_channels,
-        strides=[8, 16, 32, 64, 128],
-        vit_out_dim=vit_out_dim,
-    )
-    return backbone
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/meta_arch/__init__.py b/eval/vbench/third_party/grit_src/grit/modeling/meta_arch/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/meta_arch/grit.py b/eval/vbench/third_party/grit_src/grit/modeling/meta_arch/grit.py
deleted file mode 100644
index 62691846..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/meta_arch/grit.py
+++ /dev/null
@@ -1,72 +0,0 @@
-from typing import Dict, List, Optional, Tuple
-
-import torch
-from detectron2.config import configurable
-from detectron2.modeling.meta_arch.build import META_ARCH_REGISTRY
-from detectron2.modeling.meta_arch.rcnn import GeneralizedRCNN
-from detectron2.structures import Boxes, ImageList, Instances
-
-
-@META_ARCH_REGISTRY.register()
-class GRiT(GeneralizedRCNN):
-    @configurable
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        assert self.proposal_generator is not None
-
-    @classmethod
-    def from_config(cls, cfg):
-        ret = super().from_config(cfg)
-        return ret
-
-    def inference(
-        self,
-        batched_inputs: Tuple[Dict[str, torch.Tensor]],
-        detected_instances: Optional[List[Instances]] = None,
-        do_postprocess: bool = True,
-    ):
-        assert not self.training
-        assert detected_instances is None
-
-        images = self.preprocess_image(batched_inputs)
-        features = self.backbone(images.tensor)
-        proposals, _ = self.proposal_generator(images, features, None)
-        results, _ = self.roi_heads(features, proposals)
-        results_det, _ = self.roi_heads.forward_object(features, proposals)
-        # results_det.get
-        for idx in range(len(results)):
-            obj_type = results_det[idx].get("pred_object_descriptions")
-            results[idx].set("det_obj", obj_type)
-        if do_postprocess:
-            assert (
-                not torch.jit.is_scripting()
-            ), "Scripting is not supported for postprocess."
-            return GRiT._postprocess(results, batched_inputs, images.image_sizes)
-        else:
-            return results
-
-    def forward(self, batched_inputs: List[Dict[str, torch.Tensor]]):
-        if not self.training:
-            return self.inference(batched_inputs)
-
-        images = self.preprocess_image(batched_inputs)
-
-        gt_instances = [x["instances"].to(self.device) for x in batched_inputs]
-
-        targets_task = batched_inputs[0]["task"]
-        for anno_per_image in batched_inputs:
-            assert targets_task == anno_per_image["task"]
-
-        features = self.backbone(images.tensor)
-        proposals, proposal_losses = self.proposal_generator(
-            images, features, gt_instances
-        )
-        proposals, roihead_textdecoder_losses = self.roi_heads(
-            features, proposals, gt_instances, targets_task=targets_task
-        )
-
-        losses = {}
-        losses.update(roihead_textdecoder_losses)
-        losses.update(proposal_losses)
-
-        return losses
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/__init__.py b/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/grit_fast_rcnn.py b/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/grit_fast_rcnn.py
deleted file mode 100644
index 61175c1b..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/grit_fast_rcnn.py
+++ /dev/null
@@ -1,142 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# Modified by Jialian Wu from https://github.com/facebookresearch/Detic/blob/main/detic/modeling/roi_heads/detic_fast_rcnn.py
-import fvcore.nn.weight_init as weight_init
-import torch
-from detectron2.config import configurable
-from detectron2.layers import ShapeSpec, batched_nms, cat, cross_entropy, nonzero_tuple
-from detectron2.modeling.roi_heads.fast_rcnn import (
-    FastRCNNOutputLayers,
-    _log_classification_stats,
-)
-from fvcore.nn import giou_loss, smooth_l1_loss
-from torch import nn
-from torch.nn import functional as F
-
-__all__ = ["GRiTFastRCNNOutputLayers"]
-
-
-class GRiTFastRCNNOutputLayers(FastRCNNOutputLayers):
-    @configurable
-    def __init__(
-        self,
-        input_shape: ShapeSpec,
-        **kwargs,
-    ):
-        super().__init__(
-            input_shape=input_shape,
-            **kwargs,
-        )
-
-        input_size = (
-            input_shape.channels * (input_shape.width or 1) * (input_shape.height or 1)
-        )
-
-        self.bbox_pred = nn.Sequential(
-            nn.Linear(input_size, input_size),
-            nn.ReLU(inplace=True),
-            nn.Linear(input_size, 4),
-        )
-        weight_init.c2_xavier_fill(self.bbox_pred[0])
-        nn.init.normal_(self.bbox_pred[-1].weight, std=0.001)
-        nn.init.constant_(self.bbox_pred[-1].bias, 0)
-
-    @classmethod
-    def from_config(cls, cfg, input_shape):
-        ret = super().from_config(cfg, input_shape)
-        return ret
-
-    def losses(self, predictions, proposals):
-        scores, proposal_deltas = predictions
-        gt_classes = (
-            cat([p.gt_classes for p in proposals], dim=0)
-            if len(proposals)
-            else torch.empty(0)
-        )
-        num_classes = self.num_classes
-        _log_classification_stats(scores, gt_classes)
-
-        if len(proposals):
-            proposal_boxes = cat(
-                [p.proposal_boxes.tensor for p in proposals], dim=0
-            )  # Nx4
-            assert (
-                not proposal_boxes.requires_grad
-            ), "Proposals should not require gradients!"
-            gt_boxes = cat(
-                [
-                    (p.gt_boxes if p.has("gt_boxes") else p.proposal_boxes).tensor
-                    for p in proposals
-                ],
-                dim=0,
-            )
-        else:
-            proposal_boxes = gt_boxes = torch.empty(
-                (0, 4), device=proposal_deltas.device
-            )
-
-        loss_cls = self.softmax_cross_entropy_loss(scores, gt_classes)
-        return {
-            "loss_cls": loss_cls,
-            "loss_box_reg": self.box_reg_loss(
-                proposal_boxes,
-                gt_boxes,
-                proposal_deltas,
-                gt_classes,
-                num_classes=num_classes,
-            ),
-        }
-
-    def softmax_cross_entropy_loss(self, pred_class_logits, gt_classes):
-        if pred_class_logits.numel() == 0:
-            return pred_class_logits.new_zeros([1])[0]
-
-        loss = F.cross_entropy(pred_class_logits, gt_classes, reduction="mean")
-        return loss
-
-    def box_reg_loss(
-        self, proposal_boxes, gt_boxes, pred_deltas, gt_classes, num_classes=-1
-    ):
-        num_classes = num_classes if num_classes > 0 else self.num_classes
-        box_dim = proposal_boxes.shape[1]
-        fg_inds = nonzero_tuple((gt_classes >= 0) & (gt_classes < num_classes))[0]
-        if pred_deltas.shape[1] == box_dim:
-            fg_pred_deltas = pred_deltas[fg_inds]
-        else:
-            fg_pred_deltas = pred_deltas.view(-1, self.num_classes, box_dim)[
-                fg_inds, gt_classes[fg_inds]
-            ]
-
-        if self.box_reg_loss_type == "smooth_l1":
-            gt_pred_deltas = self.box2box_transform.get_deltas(
-                proposal_boxes[fg_inds],
-                gt_boxes[fg_inds],
-            )
-            loss_box_reg = smooth_l1_loss(
-                fg_pred_deltas, gt_pred_deltas, self.smooth_l1_beta, reduction="sum"
-            )
-        elif self.box_reg_loss_type == "giou":
-            fg_pred_boxes = self.box2box_transform.apply_deltas(
-                fg_pred_deltas, proposal_boxes[fg_inds]
-            )
-            loss_box_reg = giou_loss(fg_pred_boxes, gt_boxes[fg_inds], reduction="sum")
-        else:
-            raise ValueError(f"Invalid bbox reg loss type '{self.box_reg_loss_type}'")
-        return loss_box_reg / max(gt_classes.numel(), 1.0)
-
-    def predict_probs(self, predictions, proposals):
-        scores = predictions[0]
-        num_inst_per_image = [len(p) for p in proposals]
-        probs = F.softmax(scores, dim=-1)
-        return probs.split(num_inst_per_image, dim=0)
-
-    def forward(self, x):
-        if x.dim() > 2:
-            x = torch.flatten(x, start_dim=1)
-        scores = []
-
-        cls_scores = self.cls_score(x)
-        scores.append(cls_scores)
-        scores = torch.cat(scores, dim=1)
-
-        proposal_deltas = self.bbox_pred(x)
-        return scores, proposal_deltas
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/grit_roi_heads.py b/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/grit_roi_heads.py
deleted file mode 100644
index 54f1121f..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/roi_heads/grit_roi_heads.py
+++ /dev/null
@@ -1,611 +0,0 @@
-import logging
-import math
-from typing import Dict, List, Optional, Tuple, Union
-
-import torch
-from detectron2.config import configurable
-from detectron2.layers import batched_nms
-from detectron2.modeling.box_regression import Box2BoxTransform
-from detectron2.modeling.poolers import ROIPooler
-from detectron2.modeling.roi_heads.cascade_rcnn import CascadeROIHeads, _ScaleGradient
-from detectron2.modeling.roi_heads.roi_heads import ROI_HEADS_REGISTRY, StandardROIHeads
-from detectron2.structures import Boxes, Instances, pairwise_iou
-from detectron2.utils.events import get_event_storage
-from transformers import BertTokenizer
-from vbench.third_party.grit_src.grit.data.custom_dataset_mapper import ObjDescription
-
-from ..soft_nms import batched_soft_nms
-from ..text.load_text_token import LoadTextTokens
-from ..text.text_decoder import (
-    AutoRegressiveBeamSearch,
-    GRiTTextDecoder,
-    TransformerDecoderTextualHead,
-)
-from .grit_fast_rcnn import GRiTFastRCNNOutputLayers
-
-logger = logging.getLogger(__name__)
-
-
-@ROI_HEADS_REGISTRY.register()
-class GRiTROIHeadsAndTextDecoder(CascadeROIHeads):
-    @configurable
-    def __init__(
-        self,
-        *,
-        text_decoder_transformer,
-        train_task: list,
-        test_task: str,
-        mult_proposal_score: bool = False,
-        mask_weight: float = 1.0,
-        object_feat_pooler=None,
-        soft_nms_enabled=False,
-        beam_size=1,
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        self.mult_proposal_score = mult_proposal_score
-        self.mask_weight = mask_weight
-        self.object_feat_pooler = object_feat_pooler
-        self.soft_nms_enabled = soft_nms_enabled
-        self.test_task = test_task
-        self.beam_size = beam_size
-
-        tokenizer = BertTokenizer.from_pretrained(
-            "bert-base-uncased", do_lower_case=True
-        )
-        self.tokenizer = tokenizer
-
-        assert test_task in train_task, (
-            "GRiT has not been trained on {} task, "
-            "please verify the task name or train a new "
-            "GRiT on {} task".format(test_task, test_task)
-        )
-        task_begin_tokens = {}
-        for i, task in enumerate(train_task):
-            if i == 0:
-                task_begin_tokens[task] = tokenizer.cls_token_id
-            else:
-                task_begin_tokens[task] = 103 + i
-        self.task_begin_tokens = task_begin_tokens
-
-        beamsearch_decode = AutoRegressiveBeamSearch(
-            end_token_id=tokenizer.sep_token_id,
-            max_steps=40,
-            beam_size=beam_size,
-            objectdet=test_task == "ObjectDet",
-            per_node_beam_size=1,
-        )
-        self.text_decoder = GRiTTextDecoder(
-            text_decoder_transformer,
-            beamsearch_decode=beamsearch_decode,
-            begin_token_id=task_begin_tokens[test_task],
-            loss_type="smooth",
-            tokenizer=tokenizer,
-        )
-        self.text_decoder_det = GRiTTextDecoder(
-            text_decoder_transformer,
-            beamsearch_decode=beamsearch_decode,
-            begin_token_id=task_begin_tokens["ObjectDet"],
-            loss_type="smooth",
-            tokenizer=tokenizer,
-        )
-        self.get_target_text_tokens = LoadTextTokens(
-            tokenizer, max_text_len=40, padding="do_not_pad"
-        )
-
-    @classmethod
-    def from_config(cls, cfg, input_shape):
-        ret = super().from_config(cfg, input_shape)
-        text_decoder_transformer = TransformerDecoderTextualHead(
-            object_feature_size=cfg.MODEL.FPN.OUT_CHANNELS,
-            vocab_size=cfg.TEXT_DECODER.VOCAB_SIZE,
-            hidden_size=cfg.TEXT_DECODER.HIDDEN_SIZE,
-            num_layers=cfg.TEXT_DECODER.NUM_LAYERS,
-            attention_heads=cfg.TEXT_DECODER.ATTENTION_HEADS,
-            feedforward_size=cfg.TEXT_DECODER.FEEDFORWARD_SIZE,
-            mask_future_positions=True,
-            padding_idx=0,
-            decoder_type="bert_en",
-            use_act_checkpoint=cfg.USE_ACT_CHECKPOINT,
-        )
-        ret.update(
-            {
-                "text_decoder_transformer": text_decoder_transformer,
-                "train_task": cfg.MODEL.TRAIN_TASK,
-                "test_task": cfg.MODEL.TEST_TASK,
-                "mult_proposal_score": cfg.MODEL.ROI_BOX_HEAD.MULT_PROPOSAL_SCORE,
-                "mask_weight": cfg.MODEL.ROI_HEADS.MASK_WEIGHT,
-                "soft_nms_enabled": cfg.MODEL.ROI_HEADS.SOFT_NMS_ENABLED,
-                "beam_size": cfg.MODEL.BEAM_SIZE,
-            }
-        )
-        return ret
-
-    @classmethod
-    def _init_box_head(self, cfg, input_shape):
-        ret = super()._init_box_head(cfg, input_shape)
-        del ret["box_predictors"]
-        cascade_bbox_reg_weights = cfg.MODEL.ROI_BOX_CASCADE_HEAD.BBOX_REG_WEIGHTS
-        box_predictors = []
-        for box_head, bbox_reg_weights in zip(
-            ret["box_heads"], cascade_bbox_reg_weights
-        ):
-            box_predictors.append(
-                GRiTFastRCNNOutputLayers(
-                    cfg,
-                    box_head.output_shape,
-                    box2box_transform=Box2BoxTransform(weights=bbox_reg_weights),
-                )
-            )
-        ret["box_predictors"] = box_predictors
-
-        in_features = cfg.MODEL.ROI_HEADS.IN_FEATURES
-        pooler_scales = tuple(1.0 / input_shape[k].stride for k in in_features)
-        sampling_ratio = cfg.MODEL.ROI_BOX_HEAD.POOLER_SAMPLING_RATIO
-        pooler_type = cfg.MODEL.ROI_BOX_HEAD.POOLER_TYPE
-        object_feat_pooler = ROIPooler(
-            output_size=cfg.MODEL.ROI_HEADS.OBJECT_FEAT_POOLER_RES,
-            scales=pooler_scales,
-            sampling_ratio=sampling_ratio,
-            pooler_type=pooler_type,
-        )
-        ret["object_feat_pooler"] = object_feat_pooler
-        return ret
-
-    def check_if_all_background(self, proposals, targets, stage):
-        all_background = True
-        for proposals_per_image in proposals:
-            if not (proposals_per_image.gt_classes == self.num_classes).all():
-                all_background = False
-
-        if all_background:
-            logger.info("all proposals are background at stage {}".format(stage))
-            proposals[0].proposal_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
-            proposals[0].gt_boxes.tensor[0, :] = targets[0].gt_boxes.tensor[0, :]
-            proposals[0].objectness_logits[0] = math.log(
-                (1.0 - 1e-10) / (1 - (1.0 - 1e-10))
-            )
-            proposals[0].gt_classes[0] = targets[0].gt_classes[0]
-            proposals[0].gt_object_descriptions.data[0] = targets[
-                0
-            ].gt_object_descriptions.data[0]
-            if "foreground" in proposals[0].get_fields().keys():
-                proposals[0].foreground[0] = 1
-        return proposals
-
-    def _forward_box(
-        self, features, proposals, targets=None, task="ObjectDet", det_box=False
-    ):
-        if self.training:
-            proposals = self.check_if_all_background(proposals, targets, 0)
-        if (not self.training) and self.mult_proposal_score:
-            if len(proposals) > 0 and proposals[0].has("scores"):
-                proposal_scores = [p.get("scores") for p in proposals]
-            else:
-                proposal_scores = [p.get("objectness_logits") for p in proposals]
-
-        features = [features[f] for f in self.box_in_features]
-        head_outputs = []
-        prev_pred_boxes = None
-        image_sizes = [x.image_size for x in proposals]
-
-        for k in range(self.num_cascade_stages):
-            if k > 0:
-                proposals = self._create_proposals_from_boxes(
-                    prev_pred_boxes,
-                    image_sizes,
-                    logits=[p.objectness_logits for p in proposals],
-                )
-                if self.training:
-                    proposals = self._match_and_label_boxes_GRiT(proposals, k, targets)
-                    proposals = self.check_if_all_background(proposals, targets, k)
-            predictions = self._run_stage(features, proposals, k)
-            prev_pred_boxes = self.box_predictor[k].predict_boxes(
-                (predictions[0], predictions[1]), proposals
-            )
-            head_outputs.append((self.box_predictor[k], predictions, proposals))
-
-        if self.training:
-            object_features = self.object_feat_pooler(
-                features, [x.proposal_boxes for x in proposals]
-            )
-            object_features = _ScaleGradient.apply(
-                object_features, 1.0 / self.num_cascade_stages
-            )
-            foreground = torch.cat([x.foreground for x in proposals])
-            object_features = object_features[foreground > 0]
-
-            object_descriptions = []
-            for x in proposals:
-                object_descriptions += x.gt_object_descriptions[x.foreground > 0].data
-            object_descriptions = ObjDescription(object_descriptions)
-            object_descriptions = object_descriptions.data
-
-            if len(object_descriptions) > 0:
-                begin_token = self.task_begin_tokens[task]
-                text_decoder_inputs = self.get_target_text_tokens(
-                    object_descriptions, object_features, begin_token
-                )
-                object_features = (
-                    object_features.view(
-                        object_features.shape[0], object_features.shape[1], -1
-                    )
-                    .permute(0, 2, 1)
-                    .contiguous()
-                )
-                text_decoder_inputs.update({"object_features": object_features})
-                text_decoder_loss = self.text_decoder(text_decoder_inputs)
-            else:
-                text_decoder_loss = head_outputs[0][1][0].new_zeros([1])[0]
-
-            losses = {}
-            storage = get_event_storage()
-            # RoI Head losses (For the proposal generator loss, please find it in grit.py)
-            for stage, (predictor, predictions, proposals) in enumerate(head_outputs):
-                with storage.name_scope("stage{}".format(stage)):
-                    stage_losses = predictor.losses(
-                        (predictions[0], predictions[1]), proposals
-                    )
-                losses.update(
-                    {k + "_stage{}".format(stage): v for k, v in stage_losses.items()}
-                )
-            # Text Decoder loss
-            losses.update({"text_decoder_loss": text_decoder_loss})
-            return losses
-        else:
-            scores_per_stage = [h[0].predict_probs(h[1], h[2]) for h in head_outputs]
-            logits_per_stage = [(h[1][0],) for h in head_outputs]
-            scores = [
-                sum(list(scores_per_image)) * (1.0 / self.num_cascade_stages)
-                for scores_per_image in zip(*scores_per_stage)
-            ]
-            logits = [
-                sum(list(logits_per_image)) * (1.0 / self.num_cascade_stages)
-                for logits_per_image in zip(*logits_per_stage)
-            ]
-            if self.mult_proposal_score:
-                scores = [
-                    (s * ps[:, None]) ** 0.5 for s, ps in zip(scores, proposal_scores)
-                ]
-            predictor, predictions, proposals = head_outputs[-1]
-            boxes = predictor.predict_boxes((predictions[0], predictions[1]), proposals)
-            assert len(boxes) == 1
-            pred_instances, _ = self.fast_rcnn_inference_GRiT(
-                boxes,
-                scores,
-                logits,
-                image_sizes,
-                predictor.test_score_thresh,
-                predictor.test_nms_thresh,
-                predictor.test_topk_per_image,
-                self.soft_nms_enabled,
-            )
-
-            assert len(pred_instances) == 1, "Only support one image"
-            for i, pred_instance in enumerate(pred_instances):
-                if len(pred_instance.pred_boxes) > 0:
-                    object_features = self.object_feat_pooler(
-                        features, [pred_instance.pred_boxes]
-                    )
-                    object_features = (
-                        object_features.view(
-                            object_features.shape[0], object_features.shape[1], -1
-                        )
-                        .permute(0, 2, 1)
-                        .contiguous()
-                    )
-                    if det_box:
-                        text_decoder_output = self.text_decoder_det(
-                            {"object_features": object_features}
-                        )
-                    else:
-                        text_decoder_output = self.text_decoder(
-                            {"object_features": object_features}
-                        )
-                    if self.beam_size > 1 and self.test_task == "ObjectDet":
-                        pred_boxes = []
-                        pred_scores = []
-                        pred_classes = []
-                        pred_object_descriptions = []
-
-                        for beam_id in range(self.beam_size):
-                            pred_boxes.append(pred_instance.pred_boxes.tensor)
-                            # object score = sqrt(objectness score x description score)
-                            pred_scores.append(
-                                (
-                                    pred_instance.scores
-                                    * torch.exp(text_decoder_output["logprobs"])[
-                                        :, beam_id
-                                    ]
-                                )
-                                ** 0.5
-                            )
-                            pred_classes.append(pred_instance.pred_classes)
-                            for prediction in text_decoder_output["predictions"][
-                                :, beam_id, :
-                            ]:
-                                # convert text tokens to words
-                                description = self.tokenizer.decode(
-                                    prediction.tolist()[1:], skip_special_tokens=True
-                                )
-                                pred_object_descriptions.append(description)
-
-                        merged_instances = Instances(image_sizes[0])
-                        if (
-                            torch.cat(pred_scores, dim=0).shape[0]
-                            <= predictor.test_topk_per_image
-                        ):
-                            merged_instances.scores = torch.cat(pred_scores, dim=0)
-                            merged_instances.pred_boxes = Boxes(
-                                torch.cat(pred_boxes, dim=0)
-                            )
-                            merged_instances.pred_classes = torch.cat(
-                                pred_classes, dim=0
-                            )
-                            merged_instances.pred_object_descriptions = ObjDescription(
-                                pred_object_descriptions
-                            )
-                        else:
-                            pred_scores, top_idx = torch.topk(
-                                torch.cat(pred_scores, dim=0),
-                                predictor.test_topk_per_image,
-                            )
-                            merged_instances.scores = pred_scores
-                            merged_instances.pred_boxes = Boxes(
-                                torch.cat(pred_boxes, dim=0)[top_idx, :]
-                            )
-                            merged_instances.pred_classes = torch.cat(
-                                pred_classes, dim=0
-                            )[top_idx]
-                            merged_instances.pred_object_descriptions = ObjDescription(
-                                ObjDescription(pred_object_descriptions)[top_idx].data
-                            )
-
-                        pred_instances[i] = merged_instances
-                    else:
-                        # object score = sqrt(objectness score x description score)
-                        pred_instance.scores = (
-                            pred_instance.scores
-                            * torch.exp(text_decoder_output["logprobs"])
-                        ) ** 0.5
-
-                        pred_object_descriptions = []
-                        for prediction in text_decoder_output["predictions"]:
-                            # convert text tokens to words
-                            description = self.tokenizer.decode(
-                                prediction.tolist()[1:], skip_special_tokens=True
-                            )
-                            pred_object_descriptions.append(description)
-                        pred_instance.pred_object_descriptions = ObjDescription(
-                            pred_object_descriptions
-                        )
-                else:
-                    pred_instance.pred_object_descriptions = ObjDescription([])
-
-            return pred_instances
-
-    def forward(self, features, proposals, targets=None, targets_task="ObjectDet"):
-        if self.training:
-            proposals = self.label_and_sample_proposals(proposals, targets)
-
-            losses = self._forward_box(features, proposals, targets, task=targets_task)
-            if targets[0].has("gt_masks"):
-                mask_losses = self._forward_mask(features, proposals)
-                losses.update({k: v * self.mask_weight for k, v in mask_losses.items()})
-            else:
-                losses.update(
-                    self._get_empty_mask_loss(
-                        device=proposals[0].objectness_logits.device
-                    )
-                )
-            return proposals, losses
-        else:
-            pred_instances = self._forward_box(features, proposals, task=self.test_task)
-            pred_instances = self.forward_with_given_boxes(features, pred_instances)
-            return pred_instances, {}
-
-    def forward_object(
-        self, features, proposals, targets=None, targets_task="ObjectDet"
-    ):
-        if self.training:
-            proposals = self.label_and_sample_proposals(proposals, targets)
-
-            losses = self._forward_box(features, proposals, targets, task="ObjectDet")
-            if targets[0].has("gt_masks"):
-                mask_losses = self._forward_mask(features, proposals)
-                losses.update({k: v * self.mask_weight for k, v in mask_losses.items()})
-            else:
-                losses.update(
-                    self._get_empty_mask_loss(
-                        device=proposals[0].objectness_logits.device
-                    )
-                )
-            return proposals, losses
-        else:
-            pred_instances = self._forward_box(
-                features, proposals, task="ObjectDet", det_box=True
-            )
-            pred_instances = self.forward_with_given_boxes(features, pred_instances)
-            return pred_instances, {}
-
-    @torch.no_grad()
-    def _match_and_label_boxes_GRiT(self, proposals, stage, targets):
-        """
-        Add  "gt_object_description" and "foreground" to detectron2's _match_and_label_boxes
-        """
-        num_fg_samples, num_bg_samples = [], []
-        for proposals_per_image, targets_per_image in zip(proposals, targets):
-            match_quality_matrix = pairwise_iou(
-                targets_per_image.gt_boxes, proposals_per_image.proposal_boxes
-            )
-            # proposal_labels are 0 or 1
-            matched_idxs, proposal_labels = self.proposal_matchers[stage](
-                match_quality_matrix
-            )
-            if len(targets_per_image) > 0:
-                gt_classes = targets_per_image.gt_classes[matched_idxs]
-                # Label unmatched proposals (0 label from matcher) as background (label=num_classes)
-                gt_classes[proposal_labels == 0] = self.num_classes
-                foreground = torch.ones_like(gt_classes)
-                foreground[proposal_labels == 0] = 0
-                gt_boxes = targets_per_image.gt_boxes[matched_idxs]
-                gt_object_descriptions = targets_per_image.gt_object_descriptions[
-                    matched_idxs
-                ]
-            else:
-                gt_classes = torch.zeros_like(matched_idxs) + self.num_classes
-                foreground = torch.zeros_like(gt_classes)
-                gt_boxes = Boxes(
-                    targets_per_image.gt_boxes.tensor.new_zeros(
-                        (len(proposals_per_image), 4)
-                    )
-                )
-                gt_object_descriptions = ObjDescription(
-                    ["None" for i in range(len(proposals_per_image))]
-                )
-            proposals_per_image.gt_classes = gt_classes
-            proposals_per_image.gt_boxes = gt_boxes
-            proposals_per_image.gt_object_descriptions = gt_object_descriptions
-            proposals_per_image.foreground = foreground
-
-            num_fg_samples.append((proposal_labels == 1).sum().item())
-            num_bg_samples.append(proposal_labels.numel() - num_fg_samples[-1])
-
-        # Log the number of fg/bg samples in each stage
-        storage = get_event_storage()
-        storage.put_scalar(
-            "stage{}/roi_head/num_fg_samples".format(stage),
-            sum(num_fg_samples) / len(num_fg_samples),
-        )
-        storage.put_scalar(
-            "stage{}/roi_head/num_bg_samples".format(stage),
-            sum(num_bg_samples) / len(num_bg_samples),
-        )
-        return proposals
-
-    def fast_rcnn_inference_GRiT(
-        self,
-        boxes: List[torch.Tensor],
-        scores: List[torch.Tensor],
-        logits: List[torch.Tensor],
-        image_shapes: List[Tuple[int, int]],
-        score_thresh: float,
-        nms_thresh: float,
-        topk_per_image: int,
-        soft_nms_enabled: bool,
-    ):
-        result_per_image = [
-            self.fast_rcnn_inference_single_image_GRiT(
-                boxes_per_image,
-                scores_per_image,
-                logits_per_image,
-                image_shape,
-                score_thresh,
-                nms_thresh,
-                topk_per_image,
-                soft_nms_enabled,
-            )
-            for scores_per_image, boxes_per_image, image_shape, logits_per_image in zip(
-                scores, boxes, image_shapes, logits
-            )
-        ]
-        return [x[0] for x in result_per_image], [x[1] for x in result_per_image]
-
-    def fast_rcnn_inference_single_image_GRiT(
-        self,
-        boxes,
-        scores,
-        logits,
-        image_shape: Tuple[int, int],
-        score_thresh: float,
-        nms_thresh: float,
-        topk_per_image: int,
-        soft_nms_enabled,
-    ):
-        """
-        Add soft NMS to detectron2's fast_rcnn_inference_single_image
-        """
-        valid_mask = torch.isfinite(boxes).all(dim=1) & torch.isfinite(scores).all(
-            dim=1
-        )
-        if not valid_mask.all():
-            boxes = boxes[valid_mask]
-            scores = scores[valid_mask]
-            logits = logits[valid_mask]
-
-        scores = scores[:, :-1]
-        logits = logits[:, :-1]
-        num_bbox_reg_classes = boxes.shape[1] // 4
-        # Convert to Boxes to use the `clip` function ...
-        boxes = Boxes(boxes.reshape(-1, 4))
-        boxes.clip(image_shape)
-        boxes = boxes.tensor.view(-1, num_bbox_reg_classes, 4)  # R x C x 4
-
-        # 1. Filter results based on detection scores. It can make NMS more efficient
-        #    by filtering out low-confidence detections.
-        filter_mask = scores > score_thresh  # R x K
-        # R' x 2. First column contains indices of the R predictions;
-        # Second column contains indices of classes.
-        filter_inds = filter_mask.nonzero()
-        if num_bbox_reg_classes == 1:
-            boxes = boxes[filter_inds[:, 0], 0]
-        else:
-            boxes = boxes[filter_mask]
-        scores = scores[filter_mask]
-        logits = logits[filter_mask]
-
-        # 2. Apply NMS for each class independently.
-        if not soft_nms_enabled:
-            keep = batched_nms(boxes, scores, filter_inds[:, 1], nms_thresh)
-        else:
-            keep, soft_nms_scores = batched_soft_nms(
-                boxes,
-                scores,
-                filter_inds[:, 1],
-                "linear",
-                0.5,
-                nms_thresh,
-                0.001,
-            )
-            scores[keep] = soft_nms_scores
-        if topk_per_image >= 0:
-            keep = keep[:topk_per_image]
-        boxes, scores, filter_inds = boxes[keep], scores[keep], filter_inds[keep]
-        logits = logits[keep]
-
-        result = Instances(image_shape)
-        result.pred_boxes = Boxes(boxes)
-        result.scores = scores
-        result.pred_classes = filter_inds[:, 1]
-        result.logits = logits
-        return result, filter_inds[:, 0]
-
-    def _get_empty_mask_loss(self, device):
-        if self.mask_on:
-            return {
-                "loss_mask": torch.zeros((1,), device=device, dtype=torch.float32)[0]
-            }
-        else:
-            return {}
-
-    def _create_proposals_from_boxes(self, boxes, image_sizes, logits):
-        boxes = [Boxes(b.detach()) for b in boxes]
-        proposals = []
-        for boxes_per_image, image_size, logit in zip(boxes, image_sizes, logits):
-            boxes_per_image.clip(image_size)
-            if self.training:
-                inds = boxes_per_image.nonempty()
-                boxes_per_image = boxes_per_image[inds]
-                logit = logit[inds]
-            prop = Instances(image_size)
-            prop.proposal_boxes = boxes_per_image
-            prop.objectness_logits = logit
-            proposals.append(prop)
-        return proposals
-
-    def _run_stage(self, features, proposals, stage):
-        pool_boxes = [x.proposal_boxes for x in proposals]
-        box_features = self.box_pooler(features, pool_boxes)
-        box_features = _ScaleGradient.apply(box_features, 1.0 / self.num_cascade_stages)
-        box_features = self.box_head[stage](box_features)
-        return self.box_predictor[stage](box_features)
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/soft_nms.py b/eval/vbench/third_party/grit_src/grit/modeling/soft_nms.py
deleted file mode 100644
index 550cb068..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/soft_nms.py
+++ /dev/null
@@ -1,185 +0,0 @@
-import torch
-from detectron2.structures import (
-    Boxes,
-    RotatedBoxes,
-    pairwise_iou,
-    pairwise_iou_rotated,
-)
-
-
-def soft_nms(boxes, scores, method, gaussian_sigma, linear_threshold, prune_threshold):
-    """
-    Performs soft non-maximum suppression algorithm on axis aligned boxes
-
-    Args:
-        boxes (Tensor[N, 5]):
-           boxes where NMS will be performed. They
-           are expected to be in (x_ctr, y_ctr, width, height, angle_degrees) format
-        scores (Tensor[N]):
-           scores for each one of the boxes
-        method (str):
-           one of ['gaussian', 'linear', 'hard']
-           see paper for details. users encouraged not to use "hard", as this is the
-           same nms available elsewhere in detectron2
-        gaussian_sigma (float):
-           parameter for Gaussian penalty function
-        linear_threshold (float):
-           iou threshold for applying linear decay. Nt from the paper
-           re-used as threshold for standard "hard" nms
-        prune_threshold (float):
-           boxes with scores below this threshold are pruned at each iteration.
-           Dramatically reduces computation time. Authors use values in [10e-4, 10e-2]
-
-    Returns:
-        tuple(Tensor, Tensor):
-            [0]: int64 tensor with the indices of the elements that have been kept
-            by Soft NMS, sorted in decreasing order of scores
-            [1]: float tensor with the re-scored scores of the elements that were kept
-    """
-    return _soft_nms(
-        Boxes,
-        pairwise_iou,
-        boxes,
-        scores,
-        method,
-        gaussian_sigma,
-        linear_threshold,
-        prune_threshold,
-    )
-
-
-def batched_soft_nms(
-    boxes, scores, idxs, method, gaussian_sigma, linear_threshold, prune_threshold
-):
-    """
-    Performs soft non-maximum suppression in a batched fashion.
-
-    Each index value correspond to a category, and NMS
-    will not be applied between elements of different categories.
-
-    Args:
-        boxes (Tensor[N, 4]):
-           boxes where NMS will be performed. They
-           are expected to be in (x1, y1, x2, y2) format
-        scores (Tensor[N]):
-           scores for each one of the boxes
-        idxs (Tensor[N]):
-           indices of the categories for each one of the boxes.
-        method (str):
-           one of ['gaussian', 'linear', 'hard']
-           see paper for details. users encouraged not to use "hard", as this is the
-           same nms available elsewhere in detectron2
-        gaussian_sigma (float):
-           parameter for Gaussian penalty function
-        linear_threshold (float):
-           iou threshold for applying linear decay. Nt from the paper
-           re-used as threshold for standard "hard" nms
-        prune_threshold (float):
-           boxes with scores below this threshold are pruned at each iteration.
-           Dramatically reduces computation time. Authors use values in [10e-4, 10e-2]
-    Returns:
-        tuple(Tensor, Tensor):
-            [0]: int64 tensor with the indices of the elements that have been kept
-            by Soft NMS, sorted in decreasing order of scores
-            [1]: float tensor with the re-scored scores of the elements that were kept
-    """
-    if boxes.numel() == 0:
-        return (
-            torch.empty((0,), dtype=torch.int64, device=boxes.device),
-            torch.empty((0,), dtype=torch.float32, device=scores.device),
-        )
-    # strategy: in order to perform NMS independently per class.
-    # we add an offset to all the boxes. The offset is dependent
-    # only on the class idx, and is large enough so that boxes
-    # from different classes do not overlap
-    max_coordinate = boxes.max()
-    offsets = idxs.to(boxes) * (max_coordinate + 1)
-    boxes_for_nms = boxes + offsets[:, None]
-    return soft_nms(
-        boxes_for_nms, scores, method, gaussian_sigma, linear_threshold, prune_threshold
-    )
-
-
-def _soft_nms(
-    box_class,
-    pairwise_iou_func,
-    boxes,
-    scores,
-    method,
-    gaussian_sigma,
-    linear_threshold,
-    prune_threshold,
-):
-    """
-    Soft non-max suppression algorithm.
-
-    Implementation of [Soft-NMS -- Improving Object Detection With One Line of Codec]
-    (https://arxiv.org/abs/1704.04503)
-
-    Args:
-        box_class (cls): one of Box, RotatedBoxes
-        pairwise_iou_func (func): one of pairwise_iou, pairwise_iou_rotated
-        boxes (Tensor[N, ?]):
-           boxes where NMS will be performed
-           if Boxes, in (x1, y1, x2, y2) format
-           if RotatedBoxes, in (x_ctr, y_ctr, width, height, angle_degrees) format
-        scores (Tensor[N]):
-           scores for each one of the boxes
-        method (str):
-           one of ['gaussian', 'linear', 'hard']
-           see paper for details. users encouraged not to use "hard", as this is the
-           same nms available elsewhere in detectron2
-        gaussian_sigma (float):
-           parameter for Gaussian penalty function
-        linear_threshold (float):
-           iou threshold for applying linear decay. Nt from the paper
-           re-used as threshold for standard "hard" nms
-        prune_threshold (float):
-           boxes with scores below this threshold are pruned at each iteration.
-           Dramatically reduces computation time. Authors use values in [10e-4, 10e-2]
-
-    Returns:
-        tuple(Tensor, Tensor):
-            [0]: int64 tensor with the indices of the elements that have been kept
-            by Soft NMS, sorted in decreasing order of scores
-            [1]: float tensor with the re-scored scores of the elements that were kept
-    """
-    boxes = boxes.clone()
-    scores = scores.clone()
-    idxs = torch.arange(scores.size()[0])
-
-    idxs_out = []
-    scores_out = []
-
-    while scores.numel() > 0:
-        top_idx = torch.argmax(scores)
-        idxs_out.append(idxs[top_idx].item())
-        scores_out.append(scores[top_idx].item())
-
-        top_box = boxes[top_idx]
-        ious = pairwise_iou_func(box_class(top_box.unsqueeze(0)), box_class(boxes))[0]
-
-        if method == "linear":
-            decay = torch.ones_like(ious)
-            decay_mask = ious > linear_threshold
-            decay[decay_mask] = 1 - ious[decay_mask]
-        elif method == "gaussian":
-            decay = torch.exp(-torch.pow(ious, 2) / gaussian_sigma)
-        elif method == "hard":  # standard NMS
-            decay = (ious < linear_threshold).float()
-        else:
-            raise NotImplementedError(
-                "{} soft nms method not implemented.".format(method)
-            )
-
-        scores *= decay
-        keep = scores > prune_threshold
-        keep[top_idx] = False
-
-        boxes = boxes[keep]
-        scores = scores[keep]
-        idxs = idxs[keep]
-
-    return torch.tensor(idxs_out).to(boxes.device), torch.tensor(scores_out).to(
-        scores.device
-    )
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/text/__init__.py b/eval/vbench/third_party/grit_src/grit/modeling/text/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/text/file_utils.py b/eval/vbench/third_party/grit_src/grit/modeling/text/file_utils.py
deleted file mode 100644
index 9876898b..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/text/file_utils.py
+++ /dev/null
@@ -1,264 +0,0 @@
-# Utilities for working with the local dataset cache.
-# This file is adapted from the AllenNLP library at https://github.com/allenai/allennlp
-# Copyright by the AllenNLP authors.
-
-from __future__ import absolute_import, division, print_function, unicode_literals
-
-import fnmatch
-import json
-import logging
-import os
-import shutil
-import sys
-import tempfile
-from functools import wraps
-from hashlib import sha256
-from io import open
-
-import boto3
-import requests
-from botocore.exceptions import ClientError
-from tqdm import tqdm
-
-try:
-    from torch.hub import _get_torch_home
-
-    torch_cache_home = _get_torch_home()
-except ImportError:
-    torch_cache_home = os.path.expanduser(
-        os.getenv(
-            "TORCH_HOME", os.path.join(os.getenv("XDG_CACHE_HOME", "~/.cache"), "torch")
-        )
-    )
-default_cache_path = os.path.join(torch_cache_home, "pytorch_transformers")
-
-try:
-    from urllib.parse import urlparse
-except ImportError:
-    from urlparse import urlparse
-
-try:
-    from pathlib import Path
-
-    PYTORCH_PRETRAINED_BERT_CACHE = Path(
-        os.getenv("PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path)
-    )
-except (AttributeError, ImportError):
-    PYTORCH_PRETRAINED_BERT_CACHE = os.getenv(
-        "PYTORCH_PRETRAINED_BERT_CACHE", default_cache_path
-    )
-
-logger = logging.getLogger(__name__)  # pylint: disable=invalid-name
-
-
-def url_to_filename(url, etag=None):
-    """
-    Convert `url` into a hashed filename in a repeatable way.
-    If `etag` is specified, append its hash to the url's, delimited
-    by a period.
-    """
-    url_bytes = url.encode("utf-8")
-    url_hash = sha256(url_bytes)
-    filename = url_hash.hexdigest()
-
-    if etag:
-        etag_bytes = etag.encode("utf-8")
-        etag_hash = sha256(etag_bytes)
-        filename += "." + etag_hash.hexdigest()
-
-    return filename
-
-
-def filename_to_url(filename, cache_dir=None):
-    """
-    Return the url and etag (which may be ``None``) stored for `filename`.
-    Raise ``EnvironmentError`` if `filename` or its stored metadata do not exist.
-    """
-    if cache_dir is None:
-        cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
-    if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
-        cache_dir = str(cache_dir)
-
-    cache_path = os.path.join(cache_dir, filename)
-    if not os.path.exists(cache_path):
-        raise EnvironmentError("file {} not found".format(cache_path))
-
-    meta_path = cache_path + ".json"
-    if not os.path.exists(meta_path):
-        raise EnvironmentError("file {} not found".format(meta_path))
-
-    with open(meta_path, encoding="utf-8") as meta_file:
-        metadata = json.load(meta_file)
-    url = metadata["url"]
-    etag = metadata["etag"]
-
-    return url, etag
-
-
-def cached_path(url_or_filename, cache_dir=None):
-    """
-    Given something that might be a URL (or might be a local path),
-    determine which. If it's a URL, download the file and cache it, and
-    return the path to the cached file. If it's already a local path,
-    make sure the file exists and then return the path.
-    """
-    if cache_dir is None:
-        cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
-    if sys.version_info[0] == 3 and isinstance(url_or_filename, Path):
-        url_or_filename = str(url_or_filename)
-    if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
-        cache_dir = str(cache_dir)
-
-    parsed = urlparse(url_or_filename)
-
-    if parsed.scheme in ("http", "https", "s3"):
-        # URL, so get it from the cache (downloading if necessary)
-        return get_from_cache(url_or_filename, cache_dir)
-    elif os.path.exists(url_or_filename):
-        # File, and it exists.
-        return url_or_filename
-    elif parsed.scheme == "":
-        # File, but it doesn't exist.
-        raise EnvironmentError("file {} not found".format(url_or_filename))
-    else:
-        # Something unknown
-        raise ValueError(
-            "unable to parse {} as a URL or as a local path".format(url_or_filename)
-        )
-
-
-def split_s3_path(url):
-    """Split a full s3 path into the bucket name and path."""
-    parsed = urlparse(url)
-    if not parsed.netloc or not parsed.path:
-        raise ValueError("bad s3 path {}".format(url))
-    bucket_name = parsed.netloc
-    s3_path = parsed.path
-    # Remove '/' at beginning of path.
-    if s3_path.startswith("/"):
-        s3_path = s3_path[1:]
-    return bucket_name, s3_path
-
-
-def s3_request(func):
-    """
-    Wrapper function for s3 requests in order to create more helpful error
-    messages.
-    """
-
-    @wraps(func)
-    def wrapper(url, *args, **kwargs):
-        try:
-            return func(url, *args, **kwargs)
-        except ClientError as exc:
-            if int(exc.response["Error"]["Code"]) == 404:
-                raise EnvironmentError("file {} not found".format(url))
-            else:
-                raise
-
-    return wrapper
-
-
-@s3_request
-def s3_etag(url):
-    """Check ETag on S3 object."""
-    s3_resource = boto3.resource("s3")
-    bucket_name, s3_path = split_s3_path(url)
-    s3_object = s3_resource.Object(bucket_name, s3_path)
-    return s3_object.e_tag
-
-
-@s3_request
-def s3_get(url, temp_file):
-    """Pull a file directly from S3."""
-    s3_resource = boto3.resource("s3")
-    bucket_name, s3_path = split_s3_path(url)
-    s3_resource.Bucket(bucket_name).download_fileobj(s3_path, temp_file)
-
-
-def http_get(url, temp_file):
-    req = requests.get(url, stream=True)
-    content_length = req.headers.get("Content-Length")
-    total = int(content_length) if content_length is not None else None
-    progress = tqdm(unit="B", total=total)
-    for chunk in req.iter_content(chunk_size=1024):
-        if chunk:  # filter out keep-alive new chunks
-            progress.update(len(chunk))
-            temp_file.write(chunk)
-    progress.close()
-
-
-def get_from_cache(url, cache_dir=None):
-    """
-    Given a URL, look for the corresponding dataset in the local cache.
-    If it's not there, download it. Then return the path to the cached file.
-    """
-    if cache_dir is None:
-        cache_dir = PYTORCH_PRETRAINED_BERT_CACHE
-    if sys.version_info[0] == 3 and isinstance(cache_dir, Path):
-        cache_dir = str(cache_dir)
-    if sys.version_info[0] == 2 and not isinstance(cache_dir, str):
-        cache_dir = str(cache_dir)
-
-    if not os.path.exists(cache_dir):
-        os.makedirs(cache_dir)
-
-    # Get eTag to add to filename, if it exists.
-    if url.startswith("s3://"):
-        etag = s3_etag(url)
-    else:
-        try:
-            response = requests.head(url, allow_redirects=True)
-            if response.status_code != 200:
-                etag = None
-            else:
-                etag = response.headers.get("ETag")
-        except EnvironmentError:
-            etag = None
-
-    if sys.version_info[0] == 2 and etag is not None:
-        etag = etag.decode("utf-8")
-    filename = url_to_filename(url, etag)
-
-    # get cache path to put the file
-    cache_path = os.path.join(cache_dir, filename)
-
-    # If we don't have a connection (etag is None) and can't identify the file
-    # try to get the last downloaded one
-    if not os.path.exists(cache_path) and etag is None:
-        matching_files = fnmatch.filter(os.listdir(cache_dir), filename + ".*")
-        matching_files = list(filter(lambda s: not s.endswith(".json"), matching_files))
-        if matching_files:
-            cache_path = os.path.join(cache_dir, matching_files[-1])
-
-    if not os.path.exists(cache_path):
-        # Download to temporary file, then copy to cache dir once finished.
-        # Otherwise you get corrupt cache entries if the download gets interrupted.
-        with tempfile.NamedTemporaryFile() as temp_file:
-            logger.info("%s not found in cache, downloading to %s", url, temp_file.name)
-
-            # GET file object
-            if url.startswith("s3://"):
-                s3_get(url, temp_file)
-            else:
-                http_get(url, temp_file)
-
-            # we are copying the file before closing it, so flush to avoid truncation
-            temp_file.flush()
-            # shutil.copyfileobj() starts at the current position, so go to the start
-            temp_file.seek(0)
-
-            logger.info("copying %s to cache at %s", temp_file.name, cache_path)
-            with open(cache_path, "wb") as cache_file:
-                shutil.copyfileobj(temp_file, cache_file)
-
-            logger.info("creating metadata file for %s", cache_path)
-            meta = {"url": url, "etag": etag}
-            meta_path = cache_path + ".json"
-            with open(meta_path, "w") as meta_file:
-                output_string = json.dumps(meta)
-                meta_file.write(output_string)
-
-            logger.info("removing temp file %s", temp_file.name)
-
-    return cache_path
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/text/load_text_token.py b/eval/vbench/third_party/grit_src/grit/modeling/text/load_text_token.py
deleted file mode 100644
index 7ba6b6d4..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/text/load_text_token.py
+++ /dev/null
@@ -1,89 +0,0 @@
-import torch
-
-
-class LoadTextTokens(object):
-    def __init__(self, tokenizer, max_text_len=40, padding="do_not_pad"):
-        self.tokenizer = tokenizer
-        self.max_text_len = max_text_len
-        self.padding = padding
-
-    def descriptions_to_text_tokens(self, target, begin_token):
-        target_encoding = self.tokenizer(
-            target,
-            padding=self.padding,
-            add_special_tokens=False,
-            truncation=True,
-            max_length=self.max_text_len,
-        )
-
-        need_predict = [1] * len(target_encoding["input_ids"])
-        payload = target_encoding["input_ids"]
-        if len(payload) > self.max_text_len - 2:
-            payload = payload[-(self.max_text_len - 2) :]
-            need_predict = payload[-(self.max_text_len - 2) :]
-
-        input_ids = [begin_token] + payload + [self.tokenizer.sep_token_id]
-
-        need_predict = [0] + need_predict + [1]
-        data = {
-            "text_tokens": torch.tensor(input_ids),
-            "text_lengths": len(input_ids),
-            "need_predict": torch.tensor(need_predict),
-        }
-
-        return data
-
-    def __call__(self, object_descriptions, box_features, begin_token):
-        text_tokens = []
-        text_lengths = []
-        need_predict = []
-        for description in object_descriptions:
-            tokens = self.descriptions_to_text_tokens(description, begin_token)
-            text_tokens.append(tokens["text_tokens"])
-            text_lengths.append(tokens["text_lengths"])
-            need_predict.append(tokens["need_predict"])
-
-        text_tokens = torch.cat(self.collate(text_tokens), dim=0).to(
-            box_features.device
-        )
-        text_lengths = torch.tensor(text_lengths).to(box_features.device)
-        need_predict = torch.cat(self.collate(need_predict), dim=0).to(
-            box_features.device
-        )
-
-        assert text_tokens.dim() == 2 and need_predict.dim() == 2
-        data = {
-            "text_tokens": text_tokens,
-            "text_lengths": text_lengths,
-            "need_predict": need_predict,
-        }
-
-        return data
-
-    def collate(self, batch):
-        if all(isinstance(b, torch.Tensor) for b in batch) and len(batch) > 0:
-            if not all(b.shape == batch[0].shape for b in batch[1:]):
-                assert all(len(b.shape) == len(batch[0].shape) for b in batch[1:])
-                shape = torch.tensor([b.shape for b in batch])
-                max_shape = tuple(shape.max(dim=0)[0].tolist())
-                batch2 = []
-                for b in batch:
-                    if any(c < m for c, m in zip(b.shape, max_shape)):
-                        b2 = torch.zeros(max_shape, dtype=b.dtype, device=b.device)
-                        if b.dim() == 1:
-                            b2[: b.shape[0]] = b
-                        elif b.dim() == 2:
-                            b2[: b.shape[0], : b.shape[1]] = b
-                        elif b.dim() == 3:
-                            b2[: b.shape[0], : b.shape[1], : b.shape[2]] = b
-                        else:
-                            raise NotImplementedError
-                        b = b2
-                    batch2.append(b[None, ...])
-            else:
-                batch2 = []
-                for b in batch:
-                    batch2.append(b[None, ...])
-            return batch2
-        else:
-            raise NotImplementedError
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/text/modeling_bert.py b/eval/vbench/third_party/grit_src/grit/modeling/text/modeling_bert.py
deleted file mode 100644
index ec9d960b..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/text/modeling_bert.py
+++ /dev/null
@@ -1,601 +0,0 @@
-# coding=utf-8
-# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
-# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""PyTorch BERT model. """
-# Adapted from https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
-
-from __future__ import absolute_import, division, print_function, unicode_literals
-
-import copy
-import json
-import logging
-import math
-import os
-import sys
-from io import open
-
-import torch
-import torch.utils.checkpoint as checkpoint
-from torch import nn
-
-from .file_utils import cached_path
-
-logger = logging.getLogger()
-
-
-BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    "bert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json",
-    "bert-large-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json",
-    "bert-base-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json",
-    "bert-large-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json",
-    "bert-base-multilingual-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json",
-    "bert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json",
-    "bert-base-chinese": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json",
-    "bert-base-german-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json",
-    "bert-large-uncased-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json",
-    "bert-large-cased-whole-word-masking": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json",
-    "bert-large-uncased-whole-word-masking-finetuned-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",
-    "bert-large-cased-whole-word-masking-finetuned-squad": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",
-    "bert-base-cased-finetuned-mrpc": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",
-}
-
-
-def qk2attn(query, key, attention_mask, gamma):
-    query = query / gamma
-    attention_scores = torch.matmul(query, key.transpose(-1, -2))
-    if attention_mask is not None:
-        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
-        attention_scores = attention_scores + attention_mask
-    return attention_scores.softmax(dim=-1)
-
-
-class QK2Attention(nn.Module):
-    def forward(self, query, key, attention_mask, gamma):
-        return qk2attn(query, key, attention_mask, gamma)
-
-
-LayerNormClass = torch.nn.LayerNorm
-
-
-class BertSelfAttention(nn.Module):
-    def __init__(self, config):
-        super(BertSelfAttention, self).__init__()
-        if config.hidden_size % config.num_attention_heads != 0:
-            raise ValueError(
-                "The hidden size (%d) is not a multiple of the number of attention "
-                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
-            )
-        self.output_attentions = config.output_attentions
-
-        self.num_attention_heads = config.num_attention_heads
-        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
-        self.all_head_size = self.num_attention_heads * self.attention_head_size
-
-        self.query = nn.Linear(config.hidden_size, self.all_head_size)
-        self.key = nn.Linear(config.hidden_size, self.all_head_size)
-        self.value = nn.Linear(config.hidden_size, self.all_head_size)
-
-        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
-        self.softmax = nn.Softmax(dim=-1)
-        self.qk2attn = QK2Attention()
-
-    def transpose_for_scores(self, x):
-        if torch._C._get_tracing_state():
-            # exporter is not smart enough to detect dynamic size for some paths
-            x = x.view(
-                x.shape[0], -1, self.num_attention_heads, self.attention_head_size
-            )
-        else:
-            new_x_shape = x.size()[:-1] + (
-                self.num_attention_heads,
-                self.attention_head_size,
-            )
-            x = x.view(*new_x_shape)
-        return x.permute(0, 2, 1, 3)
-
-    def forward(
-        self, hidden_states, attention_mask, head_mask=None, history_state=None
-    ):
-        if history_state is not None:
-            x_states = torch.cat([history_state, hidden_states], dim=1)
-            mixed_query_layer = self.query(hidden_states)
-            mixed_key_layer = self.key(x_states)
-            mixed_value_layer = self.value(x_states)
-        else:
-            mixed_query_layer = self.query(hidden_states)
-            mixed_key_layer = self.key(hidden_states)
-            mixed_value_layer = self.value(hidden_states)
-
-        query_layer = self.transpose_for_scores(mixed_query_layer)
-        key_layer = self.transpose_for_scores(mixed_key_layer)
-        value_layer = self.transpose_for_scores(mixed_value_layer)
-
-        attention_probs = self.qk2attn(
-            query_layer, key_layer, attention_mask, math.sqrt(self.attention_head_size)
-        )
-
-        # This is actually dropping out entire tokens to attend to, which might
-        # seem a bit unusual, but is taken from the original Transformer paper.
-        attention_probs = self.dropout(attention_probs)
-
-        # Mask heads if we want to
-        if head_mask is not None:
-            attention_probs = attention_probs * head_mask
-
-        context_layer = torch.matmul(attention_probs, value_layer)
-
-        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
-        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
-        context_layer = context_layer.view(*new_context_layer_shape)
-
-        outputs = (
-            (context_layer, attention_probs)
-            if self.output_attentions
-            else (context_layer,)
-        )
-        return outputs
-
-
-class BertSelfOutput(nn.Module):
-    def __init__(self, config):
-        super(BertSelfOutput, self).__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
-        self.pre_norm = hasattr(config, "pre_norm") and config.pre_norm
-        if not self.pre_norm:
-            self.LayerNorm = LayerNormClass(
-                config.hidden_size, eps=config.layer_norm_eps
-            )
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-    def forward(self, hidden_states, input_tensor):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.dropout(hidden_states)
-        if not self.pre_norm:
-            hidden_states = self.LayerNorm(hidden_states + input_tensor)
-        else:
-            hidden_states = hidden_states + input_tensor
-        return hidden_states
-
-
-class BertAttention(nn.Module):
-    def __init__(self, config):
-        super(BertAttention, self).__init__()
-        self.pre_norm = hasattr(config, "pre_norm") and config.pre_norm
-        if self.pre_norm:
-            self.LayerNorm = LayerNormClass(
-                config.hidden_size, eps=config.layer_norm_eps
-            )
-        self.self = BertSelfAttention(config)
-        self.output = BertSelfOutput(config)
-
-    def forward(self, input_tensor, attention_mask, head_mask=None, history_state=None):
-        if self.pre_norm:
-            self_outputs = self.self(
-                self.LayerNorm(input_tensor),
-                attention_mask,
-                head_mask,
-                self.layerNorm(history_state) if history_state else history_state,
-            )
-        else:
-            self_outputs = self.self(
-                input_tensor, attention_mask, head_mask, history_state
-            )
-        attention_output = self.output(self_outputs[0], input_tensor)
-        outputs = (attention_output,) + self_outputs[
-            1:
-        ]  # add attentions if we output them
-        return outputs
-
-
-class BertIntermediate(nn.Module):
-    def __init__(self, config):
-        super(BertIntermediate, self).__init__()
-        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
-        assert (
-            config.hidden_act == "gelu"
-        ), "Please implement other activation functions"
-        self.intermediate_act_fn = _gelu_python
-
-    def forward(self, hidden_states):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.intermediate_act_fn(hidden_states)
-        return hidden_states
-
-
-class BertOutput(nn.Module):
-    def __init__(self, config):
-        super(BertOutput, self).__init__()
-        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
-        self.pre_norm = hasattr(config, "pre_norm") and config.pre_norm
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-        if not self.pre_norm:
-            self.LayerNorm = LayerNormClass(
-                config.hidden_size, eps=config.layer_norm_eps
-            )
-
-    def forward(self, hidden_states, input_tensor):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.dropout(hidden_states)
-        if not self.pre_norm:
-            hidden_states = self.LayerNorm(hidden_states + input_tensor)
-        else:
-            hidden_states = hidden_states + input_tensor
-        return hidden_states
-
-
-class Mlp(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.pre_norm = hasattr(config, "pre_norm") and config.pre_norm
-        self.intermediate = BertIntermediate(config)
-        if self.pre_norm:
-            self.LayerNorm = LayerNormClass(
-                config.hidden_size, eps=config.layer_norm_eps
-            )
-        self.output = BertOutput(config)
-
-    def forward(self, attention_output):
-        if not self.pre_norm:
-            intermediate_output = self.intermediate(attention_output)
-        else:
-            intermediate_output = self.intermediate(self.LayerNorm(attention_output))
-        layer_output = self.output(intermediate_output, attention_output)
-        return layer_output
-
-
-class BertLayer(nn.Module):
-    def __init__(self, config, use_act_checkpoint=True):
-        super(BertLayer, self).__init__()
-        self.pre_norm = hasattr(config, "pre_norm") and config.pre_norm
-        self.use_mlp_wrapper = (
-            hasattr(config, "use_mlp_wrapper") and config.use_mlp_wrapper
-        )
-        self.attention = BertAttention(config)
-        self.use_act_checkpoint = use_act_checkpoint
-        if self.use_mlp_wrapper:
-            self.mlp = Mlp(config)
-        else:
-            self.intermediate = BertIntermediate(config)
-            if self.pre_norm:
-                self.LayerNorm = LayerNormClass(
-                    config.hidden_size, eps=config.layer_norm_eps
-                )
-            self.output = BertOutput(config)
-
-    def forward(
-        self, hidden_states, attention_mask, head_mask=None, history_state=None
-    ):
-        if self.use_act_checkpoint:
-            attention_outputs = checkpoint.checkpoint(
-                self.attention, hidden_states, attention_mask, head_mask, history_state
-            )
-        else:
-            attention_outputs = self.attention(
-                hidden_states, attention_mask, head_mask, history_state
-            )
-        attention_output = attention_outputs[0]
-        if self.use_mlp_wrapper:
-            layer_output = self.mlp(attention_output)
-        else:
-            if not self.pre_norm:
-                intermediate_output = self.intermediate(attention_output)
-            else:
-                intermediate_output = self.intermediate(
-                    self.LayerNorm(attention_output)
-                )
-            layer_output = self.output(intermediate_output, attention_output)
-        outputs = (layer_output,) + attention_outputs[
-            1:
-        ]  # add attentions if we output them
-        return outputs
-
-
-class BertEncoder(nn.Module):
-    def __init__(self, config, use_act_checkpoint=True):
-        super(BertEncoder, self).__init__()
-        self.output_attentions = config.output_attentions
-        self.output_hidden_states = config.output_hidden_states
-        self.layer = nn.ModuleList(
-            [
-                BertLayer(config, use_act_checkpoint=use_act_checkpoint)
-                for _ in range(config.num_hidden_layers)
-            ]
-        )
-        self.pre_norm = hasattr(config, "pre_norm") and config.pre_norm
-        if self.pre_norm:
-            self.LayerNorm = LayerNormClass(
-                config.hidden_size, eps=config.layer_norm_eps
-            )
-
-    def forward(
-        self, hidden_states, attention_mask, head_mask=None, encoder_history_states=None
-    ):
-        all_hidden_states = ()
-        all_attentions = ()
-        for i, layer_module in enumerate(self.layer):
-            if self.output_hidden_states:
-                all_hidden_states = all_hidden_states + (hidden_states,)
-
-            history_state = (
-                None if encoder_history_states is None else encoder_history_states[i]
-            )
-            layer_outputs = layer_module(
-                hidden_states,
-                attention_mask,
-                (None if head_mask is None else head_mask[i]),
-                history_state,
-            )
-            hidden_states = layer_outputs[0]
-
-            if self.output_attentions:
-                all_attentions = all_attentions + (layer_outputs[1],)
-        if self.pre_norm:
-            hidden_states = self.LayerNorm(hidden_states)
-        outputs = (hidden_states,)
-        if self.output_hidden_states:
-            outputs = outputs + (all_hidden_states,)
-        if self.output_attentions:
-            outputs = outputs + (all_attentions,)
-        return outputs
-
-
-CONFIG_NAME = "config.json"
-
-
-class PretrainedConfig(object):
-    """Base class for all configuration classes.
-    Handle a few common parameters and methods for loading/downloading/saving configurations.
-    """
-
-    pretrained_config_archive_map = {}
-
-    def __init__(self, **kwargs):
-        self.finetuning_task = kwargs.pop("finetuning_task", None)
-        self.num_labels = kwargs.pop("num_labels", 2)
-        self.output_attentions = kwargs.pop("output_attentions", False)
-        self.output_hidden_states = kwargs.pop("output_hidden_states", False)
-        self.torchscript = kwargs.pop("torchscript", False)
-
-    def save_pretrained(self, save_directory):
-        """Save a configuration object to a directory, so that it
-        can be re-loaded using the `from_pretrained(save_directory)` class method.
-        """
-        assert os.path.isdir(
-            save_directory
-        ), "Saving path should be a directory where the model and configuration can be saved"
-
-        # If we save using the predefined names, we can load using `from_pretrained`
-        output_config_file = os.path.join(save_directory, CONFIG_NAME)
-
-        self.to_json_file(output_config_file)
-
-    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
-        r"""Instantiate a PretrainedConfig from a pre-trained model configuration.
-
-        Params:
-            **pretrained_model_name_or_path**: either:
-                - a string with the `shortcut name` of a pre-trained model configuration to load from cache
-                    or download and cache if not already stored in cache (e.g. 'bert-base-uncased').
-                - a path to a `directory` containing a configuration file saved
-                    using the `save_pretrained(save_directory)` method.
-                - a path or url to a saved configuration `file`.
-            **cache_dir**: (`optional`) string:
-                Path to a directory in which a downloaded pre-trained model
-                configuration should be cached if the standard cache should not be used.
-            **return_unused_kwargs**: (`optional`) bool:
-                - If False, then this function returns just the final configuration object.
-                - If True, then this functions returns a tuple `(config, unused_kwargs)` where `unused_kwargs`
-                is a dictionary consisting of the key/value pairs whose keys are not configuration attributes:
-                ie the part of kwargs which has not been used to update `config` and is otherwise ignored.
-            **kwargs**: (`optional`) dict:
-                Dictionary of key/value pairs with which to update the configuration object after loading.
-                - The values in kwargs of any keys which are configuration attributes will be used
-                to override the loaded values.
-                - Behavior concerning key/value pairs whose keys are *not* configuration attributes is controlled
-                by the `return_unused_kwargs` keyword parameter.
-
-        Examples::
-
-            >>> config = BertConfig.from_pretrained('bert-base-uncased')    # Download configuration from S3 and cache.
-            >>> config = BertConfig.from_pretrained('./test/saved_model/')  # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')`
-            >>> config = BertConfig.from_pretrained('./test/saved_model/my_configuration.json')
-            >>> config = BertConfig.from_pretrained('bert-base-uncased', output_attention=True, foo=False)
-            >>> assert config.output_attention == True
-            >>> config, unused_kwargs = BertConfig.from_pretrained('bert-base-uncased', output_attention=True,
-            >>>                                                    foo=False, return_unused_kwargs=True)
-            >>> assert config.output_attention == True
-            >>> assert unused_kwargs == {'foo': False}
-
-        """
-        cache_dir = kwargs.pop("cache_dir", None)
-        return_unused_kwargs = kwargs.pop("return_unused_kwargs", False)
-
-        if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
-            config_file = cls.pretrained_config_archive_map[
-                pretrained_model_name_or_path
-            ]
-        elif os.path.isdir(pretrained_model_name_or_path):
-            config_file = os.path.join(pretrained_model_name_or_path, CONFIG_NAME)
-        else:
-            config_file = pretrained_model_name_or_path
-        # redirect to the cache, if necessary
-        try:
-            resolved_config_file = cached_path(config_file, cache_dir=cache_dir)
-        except EnvironmentError:
-            if pretrained_model_name_or_path in cls.pretrained_config_archive_map:
-                logger.error(
-                    "Couldn't reach server at '{}' to download pretrained model configuration file.".format(
-                        config_file
-                    )
-                )
-            else:
-                logger.error(
-                    "Model name '{}' was not found in model name list ({}). "
-                    "We assumed '{}' was a path or url but couldn't find any file "
-                    "associated to this path or url.".format(
-                        pretrained_model_name_or_path,
-                        ", ".join(cls.pretrained_config_archive_map.keys()),
-                        config_file,
-                    )
-                )
-            return None
-        if resolved_config_file == config_file:
-            logger.info("loading configuration file {}".format(config_file))
-        else:
-            logger.info(
-                "loading configuration file {} from cache at {}".format(
-                    config_file, resolved_config_file
-                )
-            )
-
-        # Load config
-        config = cls.from_json_file(resolved_config_file)
-
-        # Update config with kwargs if needed
-        to_remove = []
-        for key, value in kwargs.items():
-            if hasattr(config, key):
-                setattr(config, key, value)
-                to_remove.append(key)
-        # add img_layer_norm_eps, use_img_layernorm
-        if "img_layer_norm_eps" in kwargs:
-            setattr(config, "img_layer_norm_eps", kwargs["img_layer_norm_eps"])
-            to_remove.append("img_layer_norm_eps")
-        if "use_img_layernorm" in kwargs:
-            setattr(config, "use_img_layernorm", kwargs["use_img_layernorm"])
-            to_remove.append("use_img_layernorm")
-        for key in to_remove:
-            kwargs.pop(key, None)
-
-        logger.info("Model config %s", config)
-        if return_unused_kwargs:
-            return config, kwargs
-        else:
-            return config
-
-    @classmethod
-    def from_dict(cls, json_object):
-        """Constructs a `Config` from a Python dictionary of parameters."""
-        config = cls(vocab_size_or_config_json_file=-1)
-        for key, value in json_object.items():
-            config.__dict__[key] = value
-        return config
-
-    @classmethod
-    def from_json_file(cls, json_file):
-        """Constructs a `BertConfig` from a json file of parameters."""
-        with open(json_file, "r", encoding="utf-8") as reader:
-            text = reader.read()
-        return cls.from_dict(json.loads(text))
-
-    def __eq__(self, other):
-        return self.__dict__ == other.__dict__
-
-    def __repr__(self):
-        return str(self.to_json_string())
-
-    def to_dict(self):
-        """Serializes this instance to a Python dictionary."""
-        output = copy.deepcopy(self.__dict__)
-        return output
-
-    def to_json_string(self):
-        """Serializes this instance to a JSON string."""
-        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
-
-    def to_json_file(self, json_file_path):
-        """Save this instance to a json file."""
-        with open(json_file_path, "w", encoding="utf-8") as writer:
-            writer.write(self.to_json_string())
-
-
-class BertConfig(PretrainedConfig):
-    r"""
-    :class:`~pytorch_transformers.BertConfig` is the configuration class to store the configuration of a
-    `BertModel`.
-
-
-    Arguments:
-        vocab_size_or_config_json_file: Vocabulary size of `inputs_ids` in `BertModel`.
-        hidden_size: Size of the encoder layers and the pooler layer.
-        num_hidden_layers: Number of hidden layers in the Transformer encoder.
-        num_attention_heads: Number of attention heads for each attention layer in
-            the Transformer encoder.
-        intermediate_size: The size of the "intermediate" (i.e., feed-forward)
-            layer in the Transformer encoder.
-        hidden_act: The non-linear activation function (function or string) in the
-            encoder and pooler. If string, "gelu", "relu" and "swish" are supported.
-        hidden_dropout_prob: The dropout probabilitiy for all fully connected
-            layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob: The dropout ratio for the attention
-            probabilities.
-        max_position_embeddings: The maximum sequence length that this model might
-            ever be used with. Typically set this to something large just in case
-            (e.g., 512 or 1024 or 2048).
-        type_vocab_size: The vocabulary size of the `token_type_ids` passed into
-            `BertModel`.
-        initializer_range: The sttdev of the truncated_normal_initializer for
-            initializing all weight matrices.
-        layer_norm_eps: The epsilon used by LayerNorm.
-    """
-
-    pretrained_config_archive_map = BERT_PRETRAINED_CONFIG_ARCHIVE_MAP
-
-    def __init__(
-        self,
-        vocab_size_or_config_json_file=30522,
-        hidden_size=768,
-        num_hidden_layers=12,
-        num_attention_heads=12,
-        intermediate_size=3072,
-        hidden_act="gelu",
-        hidden_dropout_prob=0.1,
-        attention_probs_dropout_prob=0.1,
-        max_position_embeddings=512,
-        type_vocab_size=2,
-        initializer_range=0.02,
-        layer_norm_eps=1e-12,
-        **kwargs,
-    ):
-        super(BertConfig, self).__init__(**kwargs)
-        if isinstance(vocab_size_or_config_json_file, str):
-            with open(vocab_size_or_config_json_file, "r", encoding="utf-8") as reader:
-                json_config = json.loads(reader.read())
-            for key, value in json_config.items():
-                self.__dict__[key] = value
-        elif isinstance(vocab_size_or_config_json_file, int):
-            self.vocab_size = vocab_size_or_config_json_file
-            self.hidden_size = hidden_size
-            self.num_hidden_layers = num_hidden_layers
-            self.num_attention_heads = num_attention_heads
-            self.hidden_act = hidden_act
-            self.intermediate_size = intermediate_size
-            self.hidden_dropout_prob = hidden_dropout_prob
-            self.attention_probs_dropout_prob = attention_probs_dropout_prob
-            self.max_position_embeddings = max_position_embeddings
-            self.type_vocab_size = type_vocab_size
-            self.initializer_range = initializer_range
-            self.layer_norm_eps = layer_norm_eps
-        else:
-            raise ValueError(
-                "First argument must be either a vocabulary size (int)"
-                "or the path to a pretrained model config file (str)"
-            )
-
-
-def _gelu_python(x):
-
-    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
diff --git a/eval/vbench/third_party/grit_src/grit/modeling/text/text_decoder.py b/eval/vbench/third_party/grit_src/grit/modeling/text/text_decoder.py
deleted file mode 100644
index 26e8df55..00000000
--- a/eval/vbench/third_party/grit_src/grit/modeling/text/text_decoder.py
+++ /dev/null
@@ -1,716 +0,0 @@
-# Modified by Jialian Wu from
-# https://github.com/microsoft/GenerativeImage2Text/blob/main/generativeimage2text/layers/decoder.py
-# and https://github.com/kdexd/virtex
-import functools
-import warnings
-
-import torch
-from torch import nn
-from torch.nn import functional as F
-
-
-class TextualHead(nn.Module):
-    def __init__(self, visual_feature_size: int, vocab_size: int, hidden_size: int):
-        super().__init__()
-        self.visual_feature_size = visual_feature_size
-        self.vocab_size = vocab_size
-        self.hidden_size = hidden_size
-
-    @property
-    def textual_feature_size(self):
-        return self.hidden_size
-
-
-class WordAndPositionalEmbedding(nn.Module):
-    def __init__(
-        self,
-        vocab_size: int,
-        hidden_size: int,
-        dropout: float = 0.0,
-        max_caption_length: int = 30,
-        padding_idx: int = 0,
-    ):
-        super().__init__()
-        self.vocab_size = vocab_size
-        self.padding_idx = padding_idx
-
-        # self.words = nn.Embedding(vocab_size, hidden_size, padding_idx=padding_idx)
-        self.words = nn.Embedding(vocab_size, hidden_size)
-
-        # We provide no "padding index" for positional embeddings. We zero out
-        # the positional embeddings of padded positions as a post-processing.
-        self.positions = nn.Embedding(max_caption_length, hidden_size)
-        self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-8, elementwise_affine=True)
-        self.dropout = nn.Dropout(p=dropout)
-
-    def forward(self, tokens: torch.Tensor):
-        position_indices = self._create_position_indices(tokens)
-
-        # shape: (batch_size, max_caption_length, hidden_size)
-        word_embeddings = self.words(tokens)
-        position_embeddings = self.positions(position_indices)
-
-        # shape: (batch_size, max_caption_length, hidden_size)
-        embeddings = self.layer_norm(word_embeddings + position_embeddings)
-        embeddings = self.dropout(embeddings)
-
-        return embeddings
-
-    @functools.lru_cache(maxsize=128)
-    def _create_position_indices(self, tokens: torch.Tensor):
-
-        # Create position indices of the same size as token indices.
-        batch_size, max_caption_length = tokens.size()
-        positions = torch.arange(
-            max_caption_length, dtype=tokens.dtype, device=tokens.device
-        )
-        # shape: (batch_size, max_caption_length)
-        positions = positions.unsqueeze(0).expand(batch_size, max_caption_length)
-        return positions
-
-
-class BertEncoderAsDecoder(nn.Module):
-    def __init__(self, encoder):
-        super().__init__()
-        self.encoder = encoder
-
-    def forward(
-        self,
-        tgt,
-        memory,
-        tgt_mask=None,
-        tgt_key_padding_mask=None,
-        memory_key_padding_mask=None,
-        tgt_bi_valid_mask=None,
-        encoder_history_states=None,
-    ):
-        assert tgt_key_padding_mask is None, "not supported"
-        assert tgt_mask.dim() == 2
-        assert tgt_mask.shape[0] == tgt_mask.shape[1]
-        # tgt_mask should always be 0/negative infinity
-        tgt = tgt.transpose(0, 1)
-        memory = memory.transpose(0, 1)
-
-        hidden_states = torch.cat((memory, tgt), dim=1)
-        num_tgt = tgt.shape[1]
-        num_memory = memory.shape[1]
-        device = tgt.device
-        dtype = tgt.dtype
-        top_left = torch.zeros((num_memory, num_memory), device=device, dtype=dtype)
-        top_right = torch.full(
-            (num_memory, num_tgt),
-            float("-inf"),
-            device=tgt.device,
-            dtype=dtype,
-        )
-        bottom_left = torch.zeros(
-            (num_tgt, num_memory),
-            dtype=dtype,
-            device=tgt_mask.device,
-        )
-        left = torch.cat((top_left, bottom_left), dim=0)
-        right = torch.cat((top_right, tgt_mask.to(dtype)), dim=0)
-
-        full_attention_mask = torch.cat((left, right), dim=1)[None, :]
-
-        if memory_key_padding_mask is None:
-            memory_key_padding_mask = torch.full(
-                (memory.shape[0], memory.shape[1]), fill_value=False, device=device
-            )
-        # if it is False, it means valid. That is, it is not a padding
-        assert memory_key_padding_mask.dtype == torch.bool
-        zero_negative_infinity = torch.zeros_like(
-            memory_key_padding_mask, dtype=tgt.dtype
-        )
-        zero_negative_infinity[memory_key_padding_mask] = float("-inf")
-        full_attention_mask = full_attention_mask.expand(
-            (
-                memory_key_padding_mask.shape[0],
-                num_memory + num_tgt,
-                num_memory + num_tgt,
-            )
-        )
-        full_attention_mask = full_attention_mask.clone()
-        origin_left = full_attention_mask[:, :, :num_memory]
-        update = zero_negative_infinity[:, None, :]
-        full_attention_mask[:, :, :num_memory] = origin_left + update
-
-        if tgt_bi_valid_mask is not None:
-            # verify the correctness
-            bs = full_attention_mask.shape[0]
-            # during inference, tgt_bi_valid_mask's length is not changed, but
-            # num_tgt can be increased
-            max_valid_target = tgt_bi_valid_mask.shape[1]
-            mask = tgt_bi_valid_mask[:, None, :].expand(
-                (bs, num_memory + num_tgt, max_valid_target)
-            )
-            full_attention_mask[:, :, num_memory : (num_memory + max_valid_target)][
-                mask
-            ] = 0
-
-        # add axis for multi-head
-        full_attention_mask = full_attention_mask[:, None, :, :]
-
-        if encoder_history_states is None:
-            result = self.encoder(
-                hidden_states=hidden_states,
-                attention_mask=full_attention_mask,
-                encoder_history_states=encoder_history_states,
-            )
-            result = list(result)
-            result[0] = result[0][:, num_memory:].transpose(0, 1)
-            if self.encoder.output_hidden_states:
-                return result[0], result[1]
-            else:
-                # make it back-compatible
-                return result[0]
-        else:
-            encoder_out = self.encoder(
-                hidden_states=hidden_states[:, -1:],
-                attention_mask=full_attention_mask[:, :, -1:],
-                encoder_history_states=encoder_history_states,
-            )
-            result = encoder_out[0].transpose(0, 1)
-            if self.encoder.output_hidden_states:
-                return result, encoder_out[1]
-            else:
-                return result
-
-
-def create_transformer(
-    decoder_type,
-    norm_type,
-    textual_feature_size,
-    attention_heads,
-    feedforward_size,
-    dropout,
-    num_layers,
-    output_hidden_states=False,
-    use_mlp_wrapper=None,
-    use_act_checkpoint=True,
-):
-    assert norm_type in ["post", "pre"]
-    if decoder_type is None:
-        LayerClass = (
-            nn.TransformerDecoderLayer
-            if norm_type == "post"
-            else PreNormTransformerDecoderLayer
-        )
-        _layer = LayerClass(
-            textual_feature_size,
-            attention_heads,
-            dim_feedforward=feedforward_size,
-            dropout=dropout,
-            activation="gelu",
-        )
-        return nn.TransformerDecoder(_layer, num_layers)
-    elif decoder_type == "bert_en":
-        from .modeling_bert import BertConfig, BertEncoder
-
-        config = BertConfig(
-            vocab_size_or_config_json_file=30522,
-            hidden_size=textual_feature_size,
-            num_hidden_layers=num_layers,
-            num_attention_heads=attention_heads,
-            intermediate_size=feedforward_size,
-            hidden_act="gelu",
-            hidden_dropout_prob=0.1,
-            attention_probs_dropout_prob=0.1,
-            layer_norm_eps=1e-12,
-        )
-        config.pre_norm = norm_type == "pre"
-        config.use_mlp_wrapper = use_mlp_wrapper
-        config.output_hidden_states = output_hidden_states
-        encoder = BertEncoder(config, use_act_checkpoint=use_act_checkpoint)
-        return BertEncoderAsDecoder(encoder)
-
-
-class PreNormTransformerDecoderLayer(nn.TransformerDecoderLayer):
-    def forward(
-        self,
-        tgt,
-        memory,
-        tgt_mask=None,
-        memory_mask=None,
-        tgt_key_padding_mask=None,
-        memory_key_padding_mask=None,
-    ):
-        # fmt: off
-        # We use the members (modules) from super-class, just the order of
-        # operations is changed here. First layernorm, then attention.
-        tgt2 = self.norm1(tgt)
-        tgt2, _ = self.self_attn(
-            tgt2, tgt2, tgt2, attn_mask=tgt_mask,
-            key_padding_mask=tgt_key_padding_mask
-        )
-        tgt = tgt + self.dropout1(tgt2)
-
-        # Layernorm first, then decoder attention.
-        tgt2 = self.norm2(tgt)
-        tgt2, _ = self.multihead_attn(
-            tgt2, memory, memory, attn_mask=memory_mask,
-            key_padding_mask=memory_key_padding_mask
-        )
-        tgt = tgt + self.dropout2(tgt2)
-
-        # Layernorm first, then transformation through feedforward network.
-        tgt2 = self.norm3(tgt)
-        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))
-        tgt = tgt + self.dropout3(tgt2)
-        return tgt
-
-
-class TransformerDecoderTextualHead(TextualHead):
-    def __init__(
-        self,
-        object_feature_size: int,
-        vocab_size: int,
-        hidden_size: int,
-        num_layers: int,
-        attention_heads: int,
-        feedforward_size: int,
-        dropout: float = 0.1,
-        norm_type: str = "post",
-        mask_future_positions: bool = True,
-        max_caption_length: int = 1024,
-        padding_idx: int = 0,
-        decoder_type=None,
-        not_tie_weight=None,
-        output_hidden_states=None,
-        use_mlp_wrapper=None,
-        use_act_checkpoint=True,
-    ):
-        super().__init__(object_feature_size, vocab_size, hidden_size)
-        self.num_layers = num_layers
-        self.attention_heads = attention_heads
-        self.feedforward_size = feedforward_size
-        self.dropout = dropout
-        assert mask_future_positions
-        self.padding_idx = padding_idx
-
-        self.object_feature_projection = nn.Sequential(
-            nn.Linear(object_feature_size, self.textual_feature_size),
-            nn.LayerNorm(self.textual_feature_size),
-        )
-
-        self.embedding = WordAndPositionalEmbedding(
-            self.vocab_size,
-            self.textual_feature_size,
-            dropout=dropout,
-            max_caption_length=max_caption_length,
-            padding_idx=padding_idx,
-        )
-        self.transformer = create_transformer(
-            decoder_type=decoder_type,
-            norm_type=norm_type,
-            textual_feature_size=self.textual_feature_size,
-            attention_heads=self.attention_heads,
-            feedforward_size=self.feedforward_size,
-            dropout=dropout,
-            num_layers=self.num_layers,
-            output_hidden_states=output_hidden_states,
-            use_mlp_wrapper=use_mlp_wrapper,
-            use_act_checkpoint=use_act_checkpoint,
-        )
-        self.apply(self._init_weights)
-
-        # Create an output linear layer and tie the input and output word
-        # embeddings to reduce parametejs.
-        self.output = nn.Linear(self.textual_feature_size, vocab_size)
-        if not not_tie_weight:
-            self.output.weight = self.embedding.words.weight
-
-    @staticmethod
-    def _init_weights(module):
-        """Initialize weights like BERT - N(0.0, 0.02), bias = 0."""
-
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=0.02)
-        elif isinstance(module, nn.MultiheadAttention):
-            module.in_proj_weight.data.normal_(mean=0.0, std=0.02)
-            module.out_proj.weight.data.normal_(mean=0.0, std=0.02)
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=0.02)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-
-    def forward(
-        self,
-        hidden_states,
-        text_tokens,
-    ):
-        projected_object_features = (
-            self.object_feature_projection(hidden_states)
-            if hidden_states is not None
-            else None
-        )
-        batch_size, max_text_length = text_tokens.size()
-        text_embeddings = self.embedding(text_tokens)
-
-        # An additive mask for masking the future (one direction).
-        uni_mask_zero_neg = self._generate_future_mask(
-            max_text_length, text_embeddings.dtype, text_embeddings.device
-        )
-
-        # We transpose the first two dimensions of tokens embeddings and visual
-        # features, as required by decoder.
-        text_embeddings = text_embeddings.transpose(0, 1)
-
-        projected_object_features = projected_object_features.transpose(0, 1)
-
-        # if transformer here is the pytorch/decoder, there is no chance, the
-        # output is always tensor
-        trans_out = self.transformer(
-            text_embeddings,
-            projected_object_features,
-            tgt_mask=uni_mask_zero_neg,
-        )
-        if isinstance(trans_out, tuple):
-            textual_features = trans_out[0]
-        else:
-            assert isinstance(trans_out, torch.Tensor)
-            textual_features = trans_out
-        # Undo the transpose and bring batch to dim 0.
-        # shape: (batch_size, max_caption_length, hidden_size)
-        textual_features = textual_features.transpose(0, 1)
-
-        # shape: (batch_size, max_caption_length, vocab_size)
-        output_logits = self.output(textual_features)
-        if isinstance(trans_out, tuple):
-            return output_logits, trans_out[1]
-        else:
-            return output_logits
-
-    def _generate_future_mask(
-        self, size: int, dtype: torch.dtype, device: torch.device
-    ):
-        # Default mask is for forward direction. Flip for backward direction.
-        mask = torch.triu(
-            torch.ones(size, size, device=device, dtype=dtype), diagonal=1
-        )
-        mask = mask.masked_fill(mask == 1, float("-inf"))
-        return mask
-
-
-class AutoRegressiveBeamSearch(object):
-    def __init__(
-        self,
-        end_token_id: int,
-        max_steps: int = 50,
-        beam_size: int = 5,
-        objectdet=True,
-        per_node_beam_size: int = 2,
-    ):
-        self._eos_index = end_token_id
-        self.max_steps = max_steps
-        self.beam_size = beam_size
-        self.objectdet = objectdet
-        self.per_node_beam_size = per_node_beam_size or beam_size
-
-    def search(self, begin_tokens, step):
-        if self.beam_size > 1 and self.objectdet:
-            only_return_best = False
-        else:
-            only_return_best = True
-
-        batch_size = begin_tokens.size()[0]
-
-        predictions = begin_tokens.unsqueeze(1).expand(
-            (batch_size, self.beam_size, begin_tokens.shape[-1])
-        )
-        # Calculate the first timestep. This is done outside the main loop
-        # because we are going from a single decoder input (the output from the
-        # encoder) to the top `beam_size` decoder outputs. On the other hand,
-        # within the main loop we are going from the `beam_size` elements of the
-        # beam to `beam_size`^2 candidates from which we will select the top
-        # `beam_size` elements for the next iteration.
-        # shape: (batch_size, num_classes)
-        start_class_logits = step(begin_tokens)
-
-        # Convert logits to logprobs.
-        # shape: (batch_size * beam_size, vocab_size)
-        start_class_logprobs = F.log_softmax(start_class_logits, dim=1)
-
-        num_classes = start_class_logprobs.size()[1]
-
-        # shape: (batch_size, beam_size), (batch_size, beam_size)
-        start_top_logprobs, start_predicted_classes = start_class_logprobs.topk(
-            self.beam_size
-        )
-
-        if self.beam_size == 1 and (start_predicted_classes == self._eos_index).all():
-            warnings.warn(
-                "Empty object description predicted. You may want to increase beam"
-                "size or ensure your step function is working properly.",
-                RuntimeWarning,
-            )
-            if only_return_best:
-                return start_predicted_classes, start_top_logprobs
-            else:
-                return start_predicted_classes.unsqueeze(-1), start_top_logprobs
-
-        # The log probs for the last time step.
-        # shape: (batch_size, beam_size)
-        last_logprobs = start_top_logprobs
-
-        # shape: (batch_size, beam_size, sequence_length)
-        predictions = torch.cat(
-            [predictions, start_predicted_classes.unsqueeze(-1)], dim=-1
-        )
-
-        # Log probability tensor that mandates that the end token is selected.
-        # shape: (batch_size * beam_size, num_classes)
-        logprobs_after_end = start_class_logprobs.new_full(
-            (batch_size * self.beam_size, num_classes), float("-inf")
-        )
-        logprobs_after_end[:, self._eos_index] = 0.0
-
-        logits_after_end = start_class_logprobs.new_full(
-            (batch_size * self.beam_size, num_classes), float("-inf")
-        )
-        logits_after_end[:, self._eos_index] = 0
-
-        while predictions.shape[-1] < self.max_steps:
-            # shape: (batch_size * beam_size,)
-            last_predictions = predictions[:, :, -1].reshape(
-                batch_size * self.beam_size
-            )
-
-            # If every predicted token from the last step is `self._eos_index`,
-            # then we can stop early.
-            if (last_predictions == self._eos_index).all():
-                break
-
-            predictions_so_far = predictions.view(batch_size * self.beam_size, -1)
-            # shape: (batch_size * beam_size, num_classes)
-            class_logits = step(predictions_so_far)
-
-            # Set logprobs of last predicted tokens as high negative value to avoid
-            # repetition in description.
-            class_logits = class_logits.scatter(
-                1, predictions_so_far[:, -1].view((-1, 1)), -10000
-            )
-
-            # shape: (batch_size * beam_size, num_classes)
-            last_predictions_expanded = last_predictions.unsqueeze(-1).expand(
-                batch_size * self.beam_size, num_classes
-            )
-
-            # Here we are finding any beams where we predicted the end token in
-            # the previous timestep and replacing the distribution with a
-            # one-hot distribution, forcing the beam to predict the end token
-            # this timestep as well.
-            class_logits = torch.where(
-                last_predictions_expanded == self._eos_index,
-                logits_after_end,
-                class_logits,
-            )
-
-            # Convert logits to logprobs.
-            # shape: (batch_size * beam_size, vocab_size)
-            class_logprobs = F.log_softmax(class_logits, dim=1)
-
-            # shape (both): (batch_size * beam_size, per_node_beam_size)
-            top_logprobs, predicted_classes = class_logprobs.topk(
-                self.per_node_beam_size
-            )
-
-            # Here we expand the last log probs to `(batch_size * beam_size,
-            # per_node_beam_size)` so that we can add them to the current log
-            # probs for this timestep. This lets us maintain the log
-            # probability of each element on the beam.
-            # shape: (batch_size * beam_size, per_node_beam_size)
-            expanded_last_logprobs = (
-                last_logprobs.unsqueeze(2)
-                .expand(batch_size, self.beam_size, self.per_node_beam_size)
-                .reshape(batch_size * self.beam_size, self.per_node_beam_size)
-            )
-            # shape: (batch_size * beam_size, per_node_beam_size)
-            summed_top_logprobs = top_logprobs + expanded_last_logprobs
-
-            # shape: (batch_size, beam_size * per_node_beam_size)
-            reshaped_summed = summed_top_logprobs.reshape(
-                batch_size, self.beam_size * self.per_node_beam_size
-            )
-            # shape: (batch_size, beam_size * per_node_beam_size)
-            reshaped_predicted_classes = predicted_classes.reshape(
-                batch_size, self.beam_size * self.per_node_beam_size
-            )
-            # Append the predictions to the current beam.
-            reshaped_beam = (
-                predictions.view(batch_size * self.beam_size, 1, -1)
-                .repeat(1, self.per_node_beam_size, 1)
-                .reshape(batch_size, self.beam_size * self.per_node_beam_size, -1)
-            )
-            # batch_size, (beam_size * per_node_beach_size), #token
-            reshaped_beam = torch.cat(
-                [reshaped_beam, reshaped_predicted_classes.unsqueeze(-1)], dim=-1
-            )
-
-            # Keep only the top `beam_size` beam indices.
-            # shape: (batch_size, beam_size), (batch_size, beam_size)
-            restricted_beam_logprobs, restricted_beam_indices = reshaped_summed.topk(
-                self.beam_size
-            )
-            predictions = reshaped_beam.gather(
-                1,
-                restricted_beam_indices.unsqueeze(-1).repeat(
-                    1, 1, reshaped_beam.shape[-1]
-                ),
-            )
-
-            # shape: (batch_size, beam_size)
-            last_logprobs = restricted_beam_logprobs
-
-        if not torch.isfinite(last_logprobs).all():
-            warnings.warn(
-                "Infinite log probs encountered. Some final descriptions may not "
-                "make sense. This can happen when the beam size is larger than"
-                " the number of valid (non-zero probability) transitions that "
-                "the step function produces.",
-                RuntimeWarning,
-            )
-
-        # Optionally select best beam and its logprobs.
-        if only_return_best:
-            # shape: (batch_size, sequence_length)
-            predictions = predictions[:, 0, :]
-            last_logprobs = last_logprobs[:, 0]
-        num_valid = (predictions != self._eos_index).sum(dim=-1)
-        num_valid += (predictions == self._eos_index).sum(dim=-1) > 0
-        num_valid = num_valid - begin_tokens.shape[1]
-        num_valid = num_valid.clip(min=1)
-
-        last_logprobs = last_logprobs / num_valid
-
-        return predictions, last_logprobs
-
-
-class GRiTTextDecoder(nn.Module):
-    def __init__(
-        self,
-        transformer,
-        begin_token_id=101,
-        beamsearch_decode=None,
-        loss_type=None,
-        tokenizer=None,
-    ):
-        super().__init__()
-        self.textual = transformer
-        self.padding_idx = self.textual.padding_idx
-
-        self.begin_token_id = begin_token_id
-        self.beamsearch_decode = beamsearch_decode
-        self.tokenizer = tokenizer
-
-        if loss_type is None:
-            self.loss = nn.CrossEntropyLoss(ignore_index=self.padding_idx)
-        elif loss_type == "smooth":
-            self.loss = SmoothLabelCrossEntropyLoss(ignore_index=self.padding_idx)
-        else:
-            raise NotImplementedError(loss_type)
-
-    def forward(self, batch):
-        object_features = batch["object_features"]
-
-        if self.training:
-            caption_token_input = batch["text_tokens"]
-
-            output_logits = self.textual(
-                object_features,
-                caption_token_input,
-            )
-
-            if "need_predict" in batch:
-                # in place should also be good, but we do not choose that for
-                # safety as we may use it in prediction results in future
-                target = batch["text_tokens"].clone()
-                target[batch["need_predict"] == 0] = self.padding_idx
-            else:
-                target = batch["text_tokens"]
-
-            feat = output_logits[:, :-1].contiguous()
-            target = target[:, 1:].contiguous()
-            feat = feat.view(-1, self.textual.vocab_size)
-            target = target.view(-1)
-
-            valid_mask = target != self.padding_idx
-            target = target[valid_mask]
-            feat = feat[valid_mask]
-            loss = self.loss(feat, target)
-
-            return loss
-        else:
-            output_dict = self.infer(object_features)
-        return output_dict
-
-    def infer(self, object_features):
-        batch_size = object_features.size(0)
-        begin_tokens = object_features.new_full(
-            (batch_size, 1), self.begin_token_id
-        ).long()
-
-        decoding_step = functools.partial(self.decoding_step, object_features)
-
-        object_description_tokens, logprobs = self.beamsearch_decode.search(
-            begin_tokens, decoding_step
-        )
-
-        output_dict = {
-            "predictions": object_description_tokens,
-            "logprobs": logprobs,
-        }
-
-        return output_dict
-
-    def decoding_step(self, object_features, partial_text):
-        batch_size = object_features.shape[0]
-        beam_size = int(partial_text.size(0) / batch_size)
-        if beam_size > 1:
-            batch_size, num_token, channels = object_features.size()
-            object_features = object_features.unsqueeze(1).repeat(1, beam_size, 1, 1)
-            object_features = object_features.view(
-                batch_size * beam_size, num_token, channels
-            )
-
-        text_lengths = torch.ones_like(partial_text)
-        if len(text_lengths.size()) != 2:
-            partial_text = partial_text.unsqueeze(1)
-
-        # shape: (batch_size * beam_size, partial_caption_length, vocab_size)
-        logits = self.textual(
-            object_features,
-            partial_text,
-        )
-
-        return logits[:, -1, :].float()
-
-
-class SmoothLabelCrossEntropyLoss(nn.Module):
-    def __init__(self, eps=0.1, log_prefix="", ignore_index=None):
-        super().__init__()
-        self.eps = eps
-        self.log_soft = nn.LogSoftmax(dim=1)
-        self.kl = nn.KLDivLoss(reduction="none")
-
-        self.iter = 0
-        self.max_loss = 0
-        self.min_loss = 0
-        self.log_prefix = log_prefix
-        self.ignore_index = ignore_index
-
-    def forward(self, feature, target):
-        feature = feature.float()
-        if self.ignore_index is not None:
-            valid_mask = target != self.ignore_index
-            target = target[valid_mask]
-            feature = feature[valid_mask]
-        assert target.numel() > 0
-        self.iter += 1
-        eps = self.eps
-        n_class = feature.size(1)
-        one_hot = torch.zeros_like(feature).scatter(1, target.view(-1, 1), 1)
-        one_hot = one_hot * (1 - eps) + (1 - one_hot) * eps / (n_class - 1)
-        log_prb = self.log_soft(feature)
-        loss = self.kl(log_prb, one_hot)
-        return loss.sum(dim=1).mean()
diff --git a/eval/vbench/third_party/grit_src/grit/predictor.py b/eval/vbench/third_party/grit_src/grit/predictor.py
deleted file mode 100644
index 80df3739..00000000
--- a/eval/vbench/third_party/grit_src/grit/predictor.py
+++ /dev/null
@@ -1,122 +0,0 @@
-# Copyright (c) Facebook, Inc. and its affiliates.
-# Modified by Jialian Wu from https://github.com/facebookresearch/detectron2/blob/main/detectron2/utils/visualizer.py
-import torch
-from detectron2.engine.defaults import DefaultPredictor
-from detectron2.utils.visualizer import ColorMode, Visualizer
-
-
-class BatchDefaultPredictor(DefaultPredictor):
-    def __call__(self, original_images):
-        """
-        Args:
-            original_image (np.ndarray): an image of shape (H, W, C) (in BGR order).
-
-        Returns:
-            predictions (dict):
-                the output of the model for one image only.
-                See :doc:`/tutorials/models` for details about the format.
-        """
-        with torch.no_grad():  # https://github.com/sphinx-doc/sphinx/issues/4258
-            # Apply pre-processing to image.
-            height, width = original_images.shape[1:3]
-            batch_inputs = []
-            for original_image in original_images:
-                image = self.aug.get_transform(original_image).apply_image(
-                    original_image
-                )
-                image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
-
-                inputs = {"image": image, "height": height, "width": width}
-                batch_inputs.append(inputs)
-            predictions = self.model(batch_inputs)[0]
-            return predictions
-
-
-class SingleDefaultPredictor(DefaultPredictor):
-    def __call__(self, original_image):
-        """
-        Args:
-            original_image (np.ndarray): an image of shape (H, W, C) (in BGR order).
-
-        Returns:
-            predictions (dict):
-                the output of the model for one image only.
-                See :doc:`/tutorials/models` for details about the format.
-        """
-        with torch.no_grad():  # https://github.com/sphinx-doc/sphinx/issues/4258
-            # Apply pre-processing to image.
-            height, width = original_image.shape[-3:-1]
-            image = self.aug.get_transform(original_image).apply_image(original_image)
-            image = torch.as_tensor(original_image.astype("float32").transpose(2, 0, 1))
-
-            inputs = {"image": image, "height": height, "width": width}
-            predictions = self.model([inputs])[0]
-            return predictions
-
-
-class Visualizer_GRiT(Visualizer):
-    def __init__(self, image, instance_mode=None):
-        super().__init__(image, instance_mode=instance_mode)
-
-    def draw_instance_predictions(self, predictions):
-        boxes = predictions.pred_boxes if predictions.has("pred_boxes") else None
-        scores = predictions.scores if predictions.has("scores") else None
-        classes = (
-            predictions.pred_classes.tolist()
-            if predictions.has("pred_classes")
-            else None
-        )
-        object_description = predictions.pred_object_descriptions.data
-        # uncomment to output scores in visualized images
-        # object_description = [c + '|' + str(round(s.item(), 1)) for c, s in zip(object_description, scores)]
-
-        if self._instance_mode == ColorMode.SEGMENTATION and self.metadata.get(
-            "thing_colors"
-        ):
-            colors = [
-                self._jitter([x / 255 for x in self.metadata.thing_colors[c]])
-                for c in classes
-            ]
-            alpha = 0.8
-        else:
-            colors = None
-            alpha = 0.5
-
-        if self._instance_mode == ColorMode.IMAGE_BW:
-            self.output.reset_image(
-                self._create_grayscale_image(
-                    (predictions.pred_masks.any(dim=0) > 0).numpy()
-                    if predictions.has("pred_masks")
-                    else None
-                )
-            )
-            alpha = 0.3
-
-        self.overlay_instances(
-            masks=None,
-            boxes=boxes,
-            labels=object_description,
-            keypoints=None,
-            assigned_colors=colors,
-            alpha=alpha,
-        )
-        return self.output
-
-
-class VisualizationDemo(object):
-    def __init__(self, cfg, instance_mode=ColorMode.IMAGE):
-        self.cpu_device = torch.device("cpu")
-        self.instance_mode = instance_mode
-
-        self.predictor = SingleDefaultPredictor(cfg)
-
-    def run_on_image(self, image):
-        # device = image.device
-        predictions = self.predictor(image)
-        # Convert image from OpenCV BGR format to Matplotlib RGB format.
-        image = image[:, :, ::-1]
-        visualizer = Visualizer_GRiT(image, instance_mode=self.instance_mode)
-        instances = predictions["instances"].to(self.cpu_device)
-        vis_output = visualizer.draw_instance_predictions(predictions=instances)
-
-        return predictions, vis_output
diff --git a/eval/vbench/third_party/grit_src/image_dense_captions.py b/eval/vbench/third_party/grit_src/image_dense_captions.py
deleted file mode 100644
index 1031d648..00000000
--- a/eval/vbench/third_party/grit_src/image_dense_captions.py
+++ /dev/null
@@ -1,147 +0,0 @@
-import os
-from itertools import compress
-
-import torch
-from detectron2.config import get_cfg
-from detectron2.data.detection_utils import read_image
-
-# constants
-WINDOW_NAME = "GRiT"
-CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-# sys.path.insert(0, f"{CUR_DIR}/../")
-# print(CUR_DIR)
-import sys
-
-from vbench.utils import CACHE_DIR
-
-sys.path.append(os.path.join(CUR_DIR, "./centernet2/"))
-from centernet.config import add_centernet_config
-
-from .grit.config import add_grit_config
-from .grit.predictor import VisualizationDemo
-
-
-class ObjDescription:
-    def __init__(self, object_descriptions):
-        self.data = object_descriptions
-
-    def __getitem__(self, item):
-        assert type(item) == torch.Tensor
-        assert item.dim() == 1
-        if len(item) > 0:
-            assert item.dtype == torch.int64 or item.dtype == torch.bool
-            if item.dtype == torch.int64:
-                return ObjDescription([self.data[x.item()] for x in item])
-            elif item.dtype == torch.bool:
-                return ObjDescription(list(compress(self.data, item)))
-
-        return ObjDescription(list(compress(self.data, item)))
-
-    def __len__(self):
-        return len(self.data)
-
-    def __repr__(self):
-        return "ObjDescription({})".format(self.data)
-
-
-def dense_pred_to_caption(predictions):
-    boxes = (
-        predictions["instances"].pred_boxes
-        if predictions["instances"].has("pred_boxes")
-        else None
-    )
-    object_description = predictions["instances"].pred_object_descriptions.data
-    new_caption = ""
-    for i in range(len(object_description)):
-        new_caption += (
-            object_description[i]
-            + ": "
-            + str([int(a) for a in boxes[i].tensor.cpu().detach().numpy()[0]])
-        ) + "; "
-    return new_caption
-
-
-def dense_pred_to_caption_only_name(predictions):
-    object_description = predictions["instances"].pred_object_descriptions.data
-    new_caption = ",".join(object_description)
-    del predictions
-    return new_caption
-
-
-def dense_pred_to_caption_tuple(predictions):
-    boxes = (
-        predictions["instances"].pred_boxes
-        if predictions["instances"].has("pred_boxes")
-        else None
-    )
-    object_description = predictions["instances"].pred_object_descriptions.data
-    object_type = predictions["instances"].det_obj.data
-    new_caption = []
-    for i in range(len(object_description)):
-        # new_caption += (object_description[i] + ": " + str([int(a) for a in boxes[i].tensor.cpu().detach().numpy()[0]])) + "; "
-        new_caption.append(
-            (
-                object_description[i],
-                [int(a) for a in boxes[i].tensor.cpu().detach().numpy()[0]],
-                object_type,
-            )
-        )
-    return new_caption
-
-
-def setup_cfg(args):
-    cfg = get_cfg()
-    if args["cpu"]:
-        cfg.MODEL.DEVICE = "cpu"
-    add_centernet_config(cfg)
-    add_grit_config(cfg)
-    cfg.merge_from_file(args["config_file"])
-    cfg.merge_from_list(args["opts"])
-    # Set score_threshold for builtin models
-    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = args["confidence_threshold"]
-    cfg.MODEL.PANOPTIC_FPN.COMBINE.INSTANCES_CONFIDENCE_THRESH = args[
-        "confidence_threshold"
-    ]
-    if args["test_task"]:
-        cfg.MODEL.TEST_TASK = args["test_task"]
-    cfg.MODEL.BEAM_SIZE = 1
-    cfg.MODEL.ROI_HEADS.SOFT_NMS_ENABLED = False
-    cfg.USE_ACT_CHECKPOINT = False
-    cfg.freeze()
-    return cfg
-
-
-def get_parser(
-    device, model_weight=f"{CACHE_DIR}/grit_model/grit_b_densecap_objectdet.pth"
-):
-    arg_dict = {
-        "config_file": f"{CUR_DIR}/configs/GRiT_B_DenseCap_ObjectDet.yaml",
-        "cpu": False,
-        "confidence_threshold": 0.5,
-        "test_task": "DenseCap",
-        "opts": ["MODEL.WEIGHTS", model_weight],
-    }
-    if device.type == "cpu":
-        arg_dict["cpu"] = True
-    return arg_dict
-
-
-def image_caption_api(image_src, device, model_weight):
-    args2 = get_parser(device, model_weight)
-    cfg = setup_cfg(args2)
-    demo = VisualizationDemo(cfg)
-    if image_src:
-        img = read_image(image_src, format="BGR")
-        predictions, visualized_output = demo.run_on_image(img)
-        new_caption = dense_pred_to_caption(predictions)
-    return new_caption
-
-
-def init_demo(device, model_weight, task="DenseCap"):
-    args2 = get_parser(device, model_weight)
-    if task != "DenseCap":
-        args2["test_task"] = task
-    cfg = setup_cfg(args2)
-
-    demo = VisualizationDemo(cfg)
-    return demo
diff --git a/eval/vbench/third_party/tag2Text/__init__.py b/eval/vbench/third_party/tag2Text/__init__.py
deleted file mode 100644
index 868ecfd7..00000000
--- a/eval/vbench/third_party/tag2Text/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-import sys
-
-sys.path.append("third_party/grit_src")
diff --git a/eval/vbench/third_party/tag2Text/config_swinB_384.json b/eval/vbench/third_party/tag2Text/config_swinB_384.json
deleted file mode 100644
index 82a68889..00000000
--- a/eval/vbench/third_party/tag2Text/config_swinB_384.json
+++ /dev/null
@@ -1,9 +0,0 @@
-{
-    "ckpt": "pretrain_model/swin_base_patch4_window7_224_22k.pth",
-    "vision_width": 1024,
-    "image_res": 384,
-    "window_size": 12,
-    "embed_dim": 128,
-    "depths": [ 2, 2, 18, 2 ],
-    "num_heads": [ 4, 8, 16, 32 ]
-  }
diff --git a/eval/vbench/third_party/tag2Text/med.py b/eval/vbench/third_party/tag2Text/med.py
deleted file mode 100644
index 3bfac359..00000000
--- a/eval/vbench/third_party/tag2Text/med.py
+++ /dev/null
@@ -1,1163 +0,0 @@
-"""
- * Copyright (c) 2022, salesforce.com, inc.
- * All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
- * By Junnan Li
- * Based on huggingface code base
- * https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/bert
-"""
-
-import math
-import os
-import warnings
-from dataclasses import dataclass
-from typing import Optional, Tuple
-
-import torch
-import torch.nn.functional as F
-import torch.utils.checkpoint
-from torch import Tensor, device, dtype, nn
-from torch.nn import CrossEntropyLoss
-from transformers.activations import ACT2FN
-from transformers.file_utils import ModelOutput
-from transformers.modeling_outputs import (
-    BaseModelOutputWithPastAndCrossAttentions,
-    BaseModelOutputWithPoolingAndCrossAttentions,
-    CausalLMOutputWithCrossAttentions,
-    MaskedLMOutput,
-    MultipleChoiceModelOutput,
-    NextSentencePredictorOutput,
-    QuestionAnsweringModelOutput,
-    SequenceClassifierOutput,
-    TokenClassifierOutput,
-)
-from transformers.modeling_utils import (
-    PreTrainedModel,
-    apply_chunking_to_forward,
-    find_pruneable_heads_and_indices,
-    prune_linear_layer,
-)
-from transformers.models.bert.configuration_bert import BertConfig
-from transformers.utils import logging
-
-logger = logging.get_logger(__name__)
-
-
-class BertEmbeddings_nopos(nn.Module):
-    """Construct the embeddings from word and position embeddings."""
-
-    def __init__(self, config):
-        super().__init__()
-        self.word_embeddings = nn.Embedding(
-            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id
-        )
-        # self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
-
-        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
-        # any TensorFlow checkpoint file
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        # self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
-        # self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
-
-        self.config = config
-
-    def forward(
-        self,
-        input_ids=None,
-        position_ids=None,
-        inputs_embeds=None,
-        past_key_values_length=0,
-    ):
-        if input_ids is not None:
-            input_shape = input_ids.size()
-        else:
-            input_shape = inputs_embeds.size()[:-1]
-
-        seq_length = input_shape[1]
-
-        # if position_ids is None:
-        # position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
-
-        if inputs_embeds is None:
-            inputs_embeds = self.word_embeddings(input_ids)
-
-        embeddings = inputs_embeds
-
-        # if self.position_embedding_type == "absolute":
-        #     position_embeddings = self.position_embeddings(position_ids)
-        #     # print('add position_embeddings!!!!')
-        #     embeddings += position_embeddings
-        embeddings = self.LayerNorm(embeddings)
-        embeddings = self.dropout(embeddings)
-        return embeddings
-
-
-class BertEmbeddings(nn.Module):
-    """Construct the embeddings from word and position embeddings."""
-
-    def __init__(self, config):
-        super().__init__()
-        self.word_embeddings = nn.Embedding(
-            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id
-        )
-        self.position_embeddings = nn.Embedding(
-            config.max_position_embeddings, config.hidden_size
-        )
-
-        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
-        # any TensorFlow checkpoint file
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        self.register_buffer(
-            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))
-        )
-        self.position_embedding_type = getattr(
-            config, "position_embedding_type", "absolute"
-        )
-
-        self.config = config
-
-    def forward(
-        self,
-        input_ids=None,
-        position_ids=None,
-        inputs_embeds=None,
-        past_key_values_length=0,
-    ):
-        if input_ids is not None:
-            input_shape = input_ids.size()
-        else:
-            input_shape = inputs_embeds.size()[:-1]
-
-        seq_length = input_shape[1]
-
-        if position_ids is None:
-            position_ids = self.position_ids[
-                :, past_key_values_length : seq_length + past_key_values_length
-            ]
-
-        if inputs_embeds is None:
-            inputs_embeds = self.word_embeddings(input_ids)
-
-        embeddings = inputs_embeds
-
-        if self.position_embedding_type == "absolute":
-            position_embeddings = self.position_embeddings(position_ids)
-            # print('add position_embeddings!!!!')
-            embeddings += position_embeddings
-        embeddings = self.LayerNorm(embeddings)
-        embeddings = self.dropout(embeddings)
-        return embeddings
-
-
-class BertSelfAttention(nn.Module):
-    def __init__(self, config, is_cross_attention):
-        super().__init__()
-        self.config = config
-        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(
-            config, "embedding_size"
-        ):
-            raise ValueError(
-                "The hidden size (%d) is not a multiple of the number of attention "
-                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
-            )
-
-        self.num_attention_heads = config.num_attention_heads
-        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
-        self.all_head_size = self.num_attention_heads * self.attention_head_size
-
-        self.query = nn.Linear(config.hidden_size, self.all_head_size)
-        if is_cross_attention:
-            self.key = nn.Linear(config.encoder_width, self.all_head_size)
-            self.value = nn.Linear(config.encoder_width, self.all_head_size)
-        else:
-            self.key = nn.Linear(config.hidden_size, self.all_head_size)
-            self.value = nn.Linear(config.hidden_size, self.all_head_size)
-
-        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
-        self.position_embedding_type = getattr(
-            config, "position_embedding_type", "absolute"
-        )
-        if (
-            self.position_embedding_type == "relative_key"
-            or self.position_embedding_type == "relative_key_query"
-        ):
-            self.max_position_embeddings = config.max_position_embeddings
-            self.distance_embedding = nn.Embedding(
-                2 * config.max_position_embeddings - 1, self.attention_head_size
-            )
-        self.save_attention = False
-
-    def save_attn_gradients(self, attn_gradients):
-        self.attn_gradients = attn_gradients
-
-    def get_attn_gradients(self):
-        return self.attn_gradients
-
-    def save_attention_map(self, attention_map):
-        self.attention_map = attention_map
-
-    def get_attention_map(self):
-        return self.attention_map
-
-    def transpose_for_scores(self, x):
-        new_x_shape = x.size()[:-1] + (
-            self.num_attention_heads,
-            self.attention_head_size,
-        )
-        x = x.view(*new_x_shape)
-        return x.permute(0, 2, 1, 3)
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask=None,
-        head_mask=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_value=None,
-        output_attentions=False,
-    ):
-        mixed_query_layer = self.query(hidden_states)
-
-        # If this is instantiated as a cross-attention module, the keys
-        # and values come from an encoder; the attention mask needs to be
-        # such that the encoder's padding tokens are not attended to.
-        is_cross_attention = encoder_hidden_states is not None
-
-        if is_cross_attention:
-            # print(self.key.weight.shape)
-            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
-            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
-            attention_mask = encoder_attention_mask
-        elif past_key_value is not None:
-            key_layer = self.transpose_for_scores(self.key(hidden_states))
-            value_layer = self.transpose_for_scores(self.value(hidden_states))
-            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
-            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
-        else:
-            key_layer = self.transpose_for_scores(self.key(hidden_states))
-            value_layer = self.transpose_for_scores(self.value(hidden_states))
-
-        query_layer = self.transpose_for_scores(mixed_query_layer)
-
-        if key_layer.shape[0] > query_layer.shape[0]:
-            key_layer = key_layer[: query_layer.shape[0], :, :, :]
-            attention_mask = attention_mask[: query_layer.shape[0], :, :]
-            value_layer = value_layer[: query_layer.shape[0], :, :, :]
-        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
-
-        past_key_value = (key_layer, value_layer)
-
-        # Take the dot product between "query" and "key" to get the raw attention scores.
-        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
-
-        if (
-            self.position_embedding_type == "relative_key"
-            or self.position_embedding_type == "relative_key_query"
-        ):
-            seq_length = hidden_states.size()[1]
-            position_ids_l = torch.arange(
-                seq_length, dtype=torch.long, device=hidden_states.device
-            ).view(-1, 1)
-            position_ids_r = torch.arange(
-                seq_length, dtype=torch.long, device=hidden_states.device
-            ).view(1, -1)
-            distance = position_ids_l - position_ids_r
-            positional_embedding = self.distance_embedding(
-                distance + self.max_position_embeddings - 1
-            )
-            positional_embedding = positional_embedding.to(
-                dtype=query_layer.dtype
-            )  # fp16 compatibility
-
-            if self.position_embedding_type == "relative_key":
-                relative_position_scores = torch.einsum(
-                    "bhld,lrd->bhlr", query_layer, positional_embedding
-                )
-                attention_scores = attention_scores + relative_position_scores
-            elif self.position_embedding_type == "relative_key_query":
-                relative_position_scores_query = torch.einsum(
-                    "bhld,lrd->bhlr", query_layer, positional_embedding
-                )
-                relative_position_scores_key = torch.einsum(
-                    "bhrd,lrd->bhlr", key_layer, positional_embedding
-                )
-                attention_scores = (
-                    attention_scores
-                    + relative_position_scores_query
-                    + relative_position_scores_key
-                )
-
-        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
-        if attention_mask is not None:
-            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
-            attention_scores = attention_scores + attention_mask
-
-        # Normalize the attention scores to probabilities.
-        attention_probs = nn.Softmax(dim=-1)(attention_scores)
-
-        if is_cross_attention and self.save_attention:
-            self.save_attention_map(attention_probs)
-            attention_probs.register_hook(self.save_attn_gradients)
-
-        # This is actually dropping out entire tokens to attend to, which might
-        # seem a bit unusual, but is taken from the original Transformer paper.
-        attention_probs_dropped = self.dropout(attention_probs)
-
-        # Mask heads if we want to
-        if head_mask is not None:
-            attention_probs_dropped = attention_probs_dropped * head_mask
-
-        context_layer = torch.matmul(attention_probs_dropped, value_layer)
-
-        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
-        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
-        context_layer = context_layer.view(*new_context_layer_shape)
-
-        outputs = (
-            (context_layer, attention_probs) if output_attentions else (context_layer,)
-        )
-
-        outputs = outputs + (past_key_value,)
-        return outputs
-
-
-class BertSelfOutput(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-    def forward(self, hidden_states, input_tensor):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.dropout(hidden_states)
-        hidden_states = self.LayerNorm(hidden_states + input_tensor)
-        return hidden_states
-
-
-class BertAttention(nn.Module):
-    def __init__(self, config, is_cross_attention=False):
-        super().__init__()
-        self.self = BertSelfAttention(config, is_cross_attention)
-        self.output = BertSelfOutput(config)
-        self.pruned_heads = set()
-
-    def prune_heads(self, heads):
-        if len(heads) == 0:
-            return
-        heads, index = find_pruneable_heads_and_indices(
-            heads,
-            self.self.num_attention_heads,
-            self.self.attention_head_size,
-            self.pruned_heads,
-        )
-
-        # Prune linear layers
-        self.self.query = prune_linear_layer(self.self.query, index)
-        self.self.key = prune_linear_layer(self.self.key, index)
-        self.self.value = prune_linear_layer(self.self.value, index)
-        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
-
-        # Update hyper params and store pruned heads
-        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
-        self.self.all_head_size = (
-            self.self.attention_head_size * self.self.num_attention_heads
-        )
-        self.pruned_heads = self.pruned_heads.union(heads)
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask=None,
-        head_mask=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_value=None,
-        output_attentions=False,
-    ):
-        self_outputs = self.self(
-            hidden_states,
-            attention_mask,
-            head_mask,
-            encoder_hidden_states,
-            encoder_attention_mask,
-            past_key_value,
-            output_attentions,
-        )
-        attention_output = self.output(self_outputs[0], hidden_states)
-        outputs = (attention_output,) + self_outputs[
-            1:
-        ]  # add attentions if we output them
-        return outputs
-
-
-class BertIntermediate(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
-        if isinstance(config.hidden_act, str):
-            self.intermediate_act_fn = ACT2FN[config.hidden_act]
-        else:
-            self.intermediate_act_fn = config.hidden_act
-
-    def forward(self, hidden_states):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.intermediate_act_fn(hidden_states)
-        return hidden_states
-
-
-class BertOutput(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.dropout = nn.Dropout(config.hidden_dropout_prob)
-
-    def forward(self, hidden_states, input_tensor):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.dropout(hidden_states)
-        hidden_states = self.LayerNorm(hidden_states + input_tensor)
-        return hidden_states
-
-
-class BertLayer(nn.Module):
-    def __init__(self, config, layer_num):
-        super().__init__()
-        self.config = config
-        self.chunk_size_feed_forward = config.chunk_size_feed_forward
-        self.seq_len_dim = 1
-        self.attention = BertAttention(config)
-        self.layer_num = layer_num
-        if self.config.add_cross_attention:
-            self.crossattention = BertAttention(
-                config, is_cross_attention=self.config.add_cross_attention
-            )
-        self.intermediate = BertIntermediate(config)
-        self.output = BertOutput(config)
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask=None,
-        head_mask=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_value=None,
-        output_attentions=False,
-        mode=None,
-    ):
-
-        if mode == "mlr":
-
-            assert (
-                encoder_hidden_states is not None
-            ), "encoder_hidden_states must be given for cross-attention layers"
-
-            # print('attention_output.shape',attention_output.shape)
-            # print('encoder_hidden_states.shape',encoder_hidden_states.shape)
-            cross_attention_outputs = self.crossattention(
-                hidden_states,
-                attention_mask,
-                head_mask,
-                encoder_hidden_states,
-                encoder_attention_mask,
-                output_attentions=output_attentions,
-            )
-            attention_output = cross_attention_outputs[0]
-            outputs = cross_attention_outputs[
-                1:-1
-            ]  # add cross attentions if we output attention weights
-
-            present_key_value = cross_attention_outputs[-1]
-
-        else:
-            # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
-            self_attn_past_key_value = (
-                past_key_value[:2] if past_key_value is not None else None
-            )
-            self_attention_outputs = self.attention(
-                hidden_states,
-                attention_mask,
-                head_mask,
-                output_attentions=output_attentions,
-                past_key_value=self_attn_past_key_value,
-            )
-            attention_output = self_attention_outputs[0]
-
-            outputs = self_attention_outputs[1:-1]
-            present_key_value = self_attention_outputs[-1]
-
-            if mode == "multimodal":
-                assert (
-                    encoder_hidden_states is not None
-                ), "encoder_hidden_states must be given for cross-attention layers"
-
-                cross_attention_outputs = self.crossattention(
-                    attention_output,
-                    attention_mask,
-                    head_mask,
-                    encoder_hidden_states,
-                    encoder_attention_mask,
-                    output_attentions=output_attentions,
-                )
-                attention_output = cross_attention_outputs[0]
-                outputs = (
-                    outputs + cross_attention_outputs[1:-1]
-                )  # add cross attentions if we output attention weights
-        layer_output = apply_chunking_to_forward(
-            self.feed_forward_chunk,
-            self.chunk_size_feed_forward,
-            self.seq_len_dim,
-            attention_output,
-        )
-        outputs = (layer_output,) + outputs
-
-        outputs = outputs + (present_key_value,)
-
-        return outputs
-
-    def feed_forward_chunk(self, attention_output):
-        intermediate_output = self.intermediate(attention_output)
-        layer_output = self.output(intermediate_output, attention_output)
-        return layer_output
-
-
-class BertEncoder(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.config = config
-        self.layer = nn.ModuleList(
-            [BertLayer(config, i) for i in range(config.num_hidden_layers)]
-        )
-        self.gradient_checkpointing = False
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask=None,
-        head_mask=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_values=None,
-        use_cache=None,
-        output_attentions=False,
-        output_hidden_states=False,
-        return_dict=True,
-        mode="multimodal",
-    ):
-        all_hidden_states = () if output_hidden_states else None
-        all_self_attentions = () if output_attentions else None
-        all_cross_attentions = (
-            () if output_attentions and self.config.add_cross_attention else None
-        )
-
-        next_decoder_cache = () if use_cache else None
-
-        for i in range(self.config.num_hidden_layers):
-            layer_module = self.layer[i]
-            if output_hidden_states:
-                all_hidden_states = all_hidden_states + (hidden_states,)
-
-            layer_head_mask = head_mask[i] if head_mask is not None else None
-            past_key_value = past_key_values[i] if past_key_values is not None else None
-
-            if self.gradient_checkpointing and self.training:
-
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, past_key_value, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
-                    hidden_states,
-                    attention_mask,
-                    layer_head_mask,
-                    encoder_hidden_states,
-                    encoder_attention_mask,
-                    mode=mode,
-                )
-            else:
-                layer_outputs = layer_module(
-                    hidden_states,
-                    attention_mask,
-                    layer_head_mask,
-                    encoder_hidden_states,
-                    encoder_attention_mask,
-                    past_key_value,
-                    output_attentions,
-                    mode=mode,
-                )
-
-            hidden_states = layer_outputs[0]
-            if use_cache:
-                next_decoder_cache += (layer_outputs[-1],)
-            if output_attentions:
-                all_self_attentions = all_self_attentions + (layer_outputs[1],)
-
-        if output_hidden_states:
-            all_hidden_states = all_hidden_states + (hidden_states,)
-
-        if not return_dict:
-            return tuple(
-                v
-                for v in [
-                    hidden_states,
-                    next_decoder_cache,
-                    all_hidden_states,
-                    all_self_attentions,
-                    all_cross_attentions,
-                ]
-                if v is not None
-            )
-        return BaseModelOutputWithPastAndCrossAttentions(
-            last_hidden_state=hidden_states,
-            past_key_values=next_decoder_cache,
-            hidden_states=all_hidden_states,
-            attentions=all_self_attentions,
-            cross_attentions=all_cross_attentions,
-        )
-
-
-class BertPooler(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
-        self.activation = nn.Tanh()
-
-    def forward(self, hidden_states):
-        # We "pool" the model by simply taking the hidden state corresponding
-        # to the first token.
-        first_token_tensor = hidden_states[:, 0]
-        pooled_output = self.dense(first_token_tensor)
-        pooled_output = self.activation(pooled_output)
-        return pooled_output
-
-
-class BertPredictionHeadTransform(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
-        if isinstance(config.hidden_act, str):
-            self.transform_act_fn = ACT2FN[config.hidden_act]
-        else:
-            self.transform_act_fn = config.hidden_act
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-
-    def forward(self, hidden_states):
-        hidden_states = self.dense(hidden_states)
-        hidden_states = self.transform_act_fn(hidden_states)
-        hidden_states = self.LayerNorm(hidden_states)
-        return hidden_states
-
-
-class BertLMPredictionHead(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.transform = BertPredictionHeadTransform(config)
-
-        # The output weights are the same as the input embeddings, but there is
-        # an output-only bias for each token.
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-
-        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
-
-        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
-        self.decoder.bias = self.bias
-
-    def forward(self, hidden_states):
-        hidden_states = self.transform(hidden_states)
-        hidden_states = self.decoder(hidden_states)
-        return hidden_states
-
-
-class BertOnlyMLMHead(nn.Module):
-    def __init__(self, config):
-        super().__init__()
-        self.predictions = BertLMPredictionHead(config)
-
-    def forward(self, sequence_output):
-        prediction_scores = self.predictions(sequence_output)
-        return prediction_scores
-
-
-class BertPreTrainedModel(PreTrainedModel):
-    """
-    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
-    models.
-    """
-
-    config_class = BertConfig
-    base_model_prefix = "bert"
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
-    def _init_weights(self, module):
-        """Initialize the weights"""
-        if isinstance(module, (nn.Linear, nn.Embedding)):
-            # Slightly different from the TF version which uses truncated_normal for initialization
-            # cf https://github.com/pytorch/pytorch/pull/5617
-            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
-        elif isinstance(module, nn.LayerNorm):
-            module.bias.data.zero_()
-            module.weight.data.fill_(1.0)
-        if isinstance(module, nn.Linear) and module.bias is not None:
-            module.bias.data.zero_()
-
-
-class BertModel(BertPreTrainedModel):
-    """
-    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
-    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
-    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
-    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
-    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
-    input to the forward pass.
-    """
-
-    def __init__(self, config, add_pooling_layer=True):
-        super().__init__(config)
-        self.config = config
-
-        self.embeddings = BertEmbeddings(config)
-
-        self.encoder = BertEncoder(config)
-
-        self.pooler = BertPooler(config) if add_pooling_layer else None
-
-        self.init_weights()
-
-    def get_input_embeddings(self):
-        return self.embeddings.word_embeddings
-
-    def set_input_embeddings(self, value):
-        self.embeddings.word_embeddings = value
-
-    def _prune_heads(self, heads_to_prune):
-        """
-        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
-        class PreTrainedModel
-        """
-        for layer, heads in heads_to_prune.items():
-            self.encoder.layer[layer].attention.prune_heads(heads)
-
-    def get_extended_attention_mask(
-        self,
-        attention_mask: Tensor,
-        input_shape: Tuple[int],
-        device: device,
-        is_decoder: bool,
-    ) -> Tensor:
-        """
-        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
-
-        Arguments:
-            attention_mask (:obj:`torch.Tensor`):
-                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
-            input_shape (:obj:`Tuple[int]`):
-                The shape of the input to the model.
-            device: (:obj:`torch.device`):
-                The device of the input to the model.
-
-        Returns:
-            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
-        """
-        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
-        # ourselves in which case we just need to make it broadcastable to all heads.
-        if attention_mask.dim() == 3:
-            extended_attention_mask = attention_mask[:, None, :, :]
-        elif attention_mask.dim() == 2:
-            # Provided a padding mask of dimensions [batch_size, seq_length]
-            # - if the model is a decoder, apply a causal mask in addition to the padding mask
-            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
-            if is_decoder:
-                batch_size, seq_length = input_shape
-
-                seq_ids = torch.arange(seq_length, device=device)
-                causal_mask = (
-                    seq_ids[None, None, :].repeat(batch_size, seq_length, 1)
-                    <= seq_ids[None, :, None]
-                )
-                # in case past_key_values are used we need to add a prefix ones mask to the causal mask
-                # causal and attention masks must have same type with pytorch version < 1.3
-                causal_mask = causal_mask.to(attention_mask.dtype)
-
-                if causal_mask.shape[1] < attention_mask.shape[1]:
-                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
-                    causal_mask = torch.cat(
-                        [
-                            torch.ones(
-                                (batch_size, seq_length, prefix_seq_len),
-                                device=device,
-                                dtype=causal_mask.dtype,
-                            ),
-                            causal_mask,
-                        ],
-                        axis=-1,
-                    )
-
-                extended_attention_mask = (
-                    causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
-                )
-            else:
-                extended_attention_mask = attention_mask[:, None, None, :]
-        else:
-            raise ValueError(
-                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
-                    input_shape, attention_mask.shape
-                )
-            )
-
-        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
-        # masked positions, this operation will create a tensor which is 0.0 for
-        # positions we want to attend and -10000.0 for masked positions.
-        # Since we are adding it to the raw scores before the softmax, this is
-        # effectively the same as removing these entirely.
-        extended_attention_mask = extended_attention_mask.to(
-            dtype=self.dtype
-        )  # fp16 compatibility
-        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
-        return extended_attention_mask
-
-    def forward(
-        self,
-        input_ids=None,
-        attention_mask=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        encoder_embeds=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        past_key_values=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        is_decoder=False,
-        mode="multimodal",
-    ):
-        r"""
-        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
-            the model is configured as a decoder.
-        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
-            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
-            - 1 for tokens that are **not masked**,
-            - 0 for tokens that are **masked**.
-        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
-            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
-            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
-            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
-        use_cache (:obj:`bool`, `optional`):
-            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
-            decoding (see :obj:`past_key_values`).
-        """
-        output_attentions = (
-            output_attentions
-            if output_attentions is not None
-            else self.config.output_attentions
-        )
-        output_hidden_states = (
-            output_hidden_states
-            if output_hidden_states is not None
-            else self.config.output_hidden_states
-        )
-        return_dict = (
-            return_dict if return_dict is not None else self.config.use_return_dict
-        )
-
-        if is_decoder:
-            use_cache = use_cache if use_cache is not None else self.config.use_cache
-        else:
-            use_cache = False
-
-        if input_ids is not None and inputs_embeds is not None:
-            raise ValueError(
-                "You cannot specify both input_ids and inputs_embeds at the same time"
-            )
-        elif input_ids is not None:
-            input_shape = input_ids.size()
-            batch_size, seq_length = input_shape
-            device = input_ids.device
-        elif inputs_embeds is not None:
-            input_shape = inputs_embeds.size()[:-1]
-            batch_size, seq_length = input_shape
-            device = inputs_embeds.device
-        elif encoder_embeds is not None:
-            input_shape = encoder_embeds.size()[:-1]
-            batch_size, seq_length = input_shape
-            device = encoder_embeds.device
-        else:
-            raise ValueError(
-                "You have to specify either input_ids or inputs_embeds or encoder_embeds"
-            )
-
-        # past_key_values_length
-        past_key_values_length = (
-            past_key_values[0][0].shape[2] if past_key_values is not None else 0
-        )
-
-        if attention_mask is None:
-            attention_mask = torch.ones(
-                ((batch_size, seq_length + past_key_values_length)), device=device
-            )
-
-        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
-        # ourselves in which case we just need to make it broadcastable to all heads.
-        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(
-            attention_mask, input_shape, device, is_decoder
-        )
-
-        # If a 2D or 3D attention mask is provided for the cross-attention
-        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
-        if encoder_hidden_states is not None:
-            if type(encoder_hidden_states) == list:
-                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[
-                    0
-                ].size()
-            else:
-                encoder_batch_size, encoder_sequence_length, _ = (
-                    encoder_hidden_states.size()
-                )
-            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
-
-            if type(encoder_attention_mask) == list:
-                encoder_extended_attention_mask = [
-                    self.invert_attention_mask(mask) for mask in encoder_attention_mask
-                ]
-            elif encoder_attention_mask is None:
-                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
-                encoder_extended_attention_mask = self.invert_attention_mask(
-                    encoder_attention_mask
-                )
-            else:
-                encoder_extended_attention_mask = self.invert_attention_mask(
-                    encoder_attention_mask
-                )
-        else:
-            encoder_extended_attention_mask = None
-
-        # Prepare head mask if needed
-        # 1.0 in head_mask indicate we keep the head
-        # attention_probs has shape bsz x n_heads x N x N
-        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
-        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
-        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
-
-        if encoder_embeds is None:
-            embedding_output = self.embeddings(
-                input_ids=input_ids,
-                position_ids=position_ids,
-                inputs_embeds=inputs_embeds,
-                past_key_values_length=past_key_values_length,
-            )
-        else:
-            embedding_output = encoder_embeds
-
-        encoder_outputs = self.encoder(
-            embedding_output,
-            attention_mask=extended_attention_mask,
-            head_mask=head_mask,
-            encoder_hidden_states=encoder_hidden_states,
-            encoder_attention_mask=encoder_extended_attention_mask,
-            past_key_values=past_key_values,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
-            mode=mode,
-        )
-        sequence_output = encoder_outputs[0]
-        pooled_output = (
-            self.pooler(sequence_output) if self.pooler is not None else None
-        )
-
-        if not return_dict:
-            return (sequence_output, pooled_output) + encoder_outputs[1:]
-
-        return BaseModelOutputWithPoolingAndCrossAttentions(
-            last_hidden_state=sequence_output,
-            pooler_output=pooled_output,
-            past_key_values=encoder_outputs.past_key_values,
-            hidden_states=encoder_outputs.hidden_states,
-            attentions=encoder_outputs.attentions,
-            cross_attentions=encoder_outputs.cross_attentions,
-        )
-
-
-class BertLMHeadModel(BertPreTrainedModel):
-
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
-
-    def __init__(self, config):
-        super().__init__(config)
-
-        self.bert = BertModel(config, add_pooling_layer=False)
-        self.cls = BertOnlyMLMHead(config)
-
-        self.init_weights()
-
-    def get_output_embeddings(self):
-        return self.cls.predictions.decoder
-
-    def set_output_embeddings(self, new_embeddings):
-        self.cls.predictions.decoder = new_embeddings
-
-    def forward(
-        self,
-        input_ids=None,
-        attention_mask=None,
-        position_ids=None,
-        head_mask=None,
-        inputs_embeds=None,
-        encoder_hidden_states=None,
-        encoder_attention_mask=None,
-        labels=None,
-        past_key_values=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        return_logits=False,
-        is_decoder=True,
-        reduction="mean",
-        mode="multimodal",
-    ):
-        r"""
-        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
-            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
-            the model is configured as a decoder.
-        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
-            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
-            - 1 for tokens that are **not masked**,
-            - 0 for tokens that are **masked**.
-        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
-            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
-            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
-            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
-        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
-            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
-            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
-            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
-            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
-        use_cache (:obj:`bool`, `optional`):
-            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
-            decoding (see :obj:`past_key_values`).
-        Returns:
-        Example::
-            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
-            >>> import torch
-            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
-            >>> config = BertConfig.from_pretrained("bert-base-cased")
-            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
-            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
-            >>> outputs = model(**inputs)
-            >>> prediction_logits = outputs.logits
-        """
-        return_dict = (
-            return_dict if return_dict is not None else self.config.use_return_dict
-        )
-        if labels is not None:
-            use_cache = False
-
-        outputs = self.bert(
-            input_ids,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            head_mask=head_mask,
-            inputs_embeds=inputs_embeds,
-            encoder_hidden_states=encoder_hidden_states,
-            encoder_attention_mask=encoder_attention_mask,
-            past_key_values=past_key_values,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
-            is_decoder=is_decoder,
-            mode=mode,
-        )
-
-        sequence_output = outputs[0]
-        prediction_scores = self.cls(sequence_output)
-        # sequence_output.shape torch.Size([85, 30, 768])
-        # prediction_scores.shape torch.Size([85, 30, 30524])
-        # labels.shape torch.Size([85, 30])
-
-        if return_logits:
-            return prediction_scores[:, :-1, :].contiguous()
-
-        lm_loss = None
-        if labels is not None:
-            # we are doing next-token prediction; shift prediction scores and input ids by one
-            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
-            labels = labels[:, 1:].contiguous()
-            loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1)
-            lm_loss = loss_fct(
-                shifted_prediction_scores.view(-1, self.config.vocab_size),
-                labels.view(-1),
-            )
-            if reduction == "none":
-                lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)
-
-        if not return_dict:
-            output = (prediction_scores,) + outputs[2:]
-            return ((lm_loss,) + output) if lm_loss is not None else output
-
-        return CausalLMOutputWithCrossAttentions(
-            loss=lm_loss,
-            logits=prediction_scores,
-            past_key_values=outputs.past_key_values,
-            hidden_states=outputs.hidden_states,
-            attentions=outputs.attentions,
-            cross_attentions=outputs.cross_attentions,
-        )
-
-    def prepare_inputs_for_generation(
-        self, input_ids, past=None, attention_mask=None, **model_kwargs
-    ):
-        input_shape = input_ids.shape
-        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
-        if attention_mask is None:
-            attention_mask = input_ids.new_ones(input_shape)
-
-        # cut decoder_input_ids if past is used
-        if past is not None:
-            input_ids = input_ids[:, -1:]
-
-        return {
-            "input_ids": input_ids,
-            "attention_mask": attention_mask,
-            "past_key_values": past,
-            "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
-            "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
-            "is_decoder": True,
-        }
-
-    def _reorder_cache(self, past, beam_idx):
-        reordered_past = ()
-        for layer_past in past:
-            reordered_past += (
-                tuple(
-                    past_state.index_select(0, beam_idx) for past_state in layer_past
-                ),
-            )
-        return reordered_past
diff --git a/eval/vbench/third_party/tag2Text/med_config.json b/eval/vbench/third_party/tag2Text/med_config.json
deleted file mode 100644
index 391d5ca7..00000000
--- a/eval/vbench/third_party/tag2Text/med_config.json
+++ /dev/null
@@ -1,21 +0,0 @@
-{
-  "architectures": [
-    "BertModel"
-  ],
-  "attention_probs_dropout_prob": 0.1,
-  "hidden_act": "gelu",
-  "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
-  "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-12,
-  "max_position_embeddings": 512,
-  "model_type": "bert",
-  "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "pad_token_id": 0,
-  "type_vocab_size": 2,
-  "vocab_size": 30524,
-  "encoder_width": 768,
-  "add_cross_attention": true
-}
diff --git a/eval/vbench/third_party/tag2Text/q2l_config.json b/eval/vbench/third_party/tag2Text/q2l_config.json
deleted file mode 100644
index 1b7443c8..00000000
--- a/eval/vbench/third_party/tag2Text/q2l_config.json
+++ /dev/null
@@ -1,22 +0,0 @@
-{
-    "architectures": [
-      "BertModel"
-    ],
-    "attention_probs_dropout_prob": 0.1,
-    "hidden_act": "gelu",
-    "hidden_dropout_prob": 0.1,
-    "hidden_size": 768,
-    "initializer_range": 0.02,
-    "intermediate_size": 3072,
-    "layer_norm_eps": 1e-12,
-    "max_position_embeddings": 512,
-    "model_type": "bert",
-    "num_attention_heads": 4,
-    "num_hidden_layers": 2,
-    "pad_token_id": 0,
-    "type_vocab_size": 2,
-    "vocab_size": 30522,
-    "encoder_width": 768,
-    "add_cross_attention": true,
-    "add_tag_cross_attention": false
-  }
diff --git a/eval/vbench/third_party/tag2Text/swin_transformer.py b/eval/vbench/third_party/tag2Text/swin_transformer.py
deleted file mode 100644
index c58f772e..00000000
--- a/eval/vbench/third_party/tag2Text/swin_transformer.py
+++ /dev/null
@@ -1,832 +0,0 @@
-# --------------------------------------------------------
-# Swin Transformer
-# Copyright (c) 2021 Microsoft
-# Licensed under The MIT License [see LICENSE for details]
-# Written by Ze Liu
-# --------------------------------------------------------
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.utils.checkpoint as checkpoint
-from scipy import interpolate
-from timm.models.layers import DropPath, to_2tuple, trunc_normal_
-
-
-class Mlp(nn.Module):
-    def __init__(
-        self,
-        in_features,
-        hidden_features=None,
-        out_features=None,
-        act_layer=nn.GELU,
-        drop=0.0,
-    ):
-        super().__init__()
-        out_features = out_features or in_features
-        hidden_features = hidden_features or in_features
-        self.fc1 = nn.Linear(in_features, hidden_features)
-        self.act = act_layer()
-        self.fc2 = nn.Linear(hidden_features, out_features)
-        self.drop = nn.Dropout(drop)
-
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop(x)
-        x = self.fc2(x)
-        x = self.drop(x)
-        return x
-
-
-def window_partition(x, window_size):
-    """
-    Args:
-        x: (B, H, W, C)
-        window_size (int): window size
-
-    Returns:
-        windows: (num_windows*B, window_size, window_size, C)
-    """
-    B, H, W, C = x.shape
-    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
-    windows = (
-        x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
-    )
-    return windows
-
-
-def window_reverse(windows, window_size, H, W):
-    """
-    Args:
-        windows: (num_windows*B, window_size, window_size, C)
-        window_size (int): Window size
-        H (int): Height of image
-        W (int): Width of image
-
-    Returns:
-        x: (B, H, W, C)
-    """
-    B = int(windows.shape[0] / (H * W / window_size / window_size))
-    x = windows.view(
-        B, H // window_size, W // window_size, window_size, window_size, -1
-    )
-    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
-    return x
-
-
-class WindowAttention(nn.Module):
-    r"""Window based multi-head self attention (W-MSA) module with relative position bias.
-    It supports both of shifted and non-shifted window.
-
-    Args:
-        dim (int): Number of input channels.
-        window_size (tuple[int]): The height and width of the window.
-        num_heads (int): Number of attention heads.
-        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: True
-        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set
-        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0
-        proj_drop (float, optional): Dropout ratio of output. Default: 0.0
-    """
-
-    def __init__(
-        self,
-        dim,
-        window_size,
-        num_heads,
-        qkv_bias=True,
-        qk_scale=None,
-        attn_drop=0.0,
-        proj_drop=0.0,
-    ):
-
-        super().__init__()
-        self.dim = dim
-        self.window_size = window_size  # Wh, Ww
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        self.scale = qk_scale or head_dim**-0.5
-
-        # define a parameter table of relative position bias
-        self.relative_position_bias_table = nn.Parameter(
-            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads)
-        )  # 2*Wh-1 * 2*Ww-1, nH
-
-        # get pair-wise relative position index for each token inside the window
-        coords_h = torch.arange(self.window_size[0])
-        coords_w = torch.arange(self.window_size[1])
-        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
-        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
-        relative_coords = (
-            coords_flatten[:, :, None] - coords_flatten[:, None, :]
-        )  # 2, Wh*Ww, Wh*Ww
-        relative_coords = relative_coords.permute(
-            1, 2, 0
-        ).contiguous()  # Wh*Ww, Wh*Ww, 2
-        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0
-        relative_coords[:, :, 1] += self.window_size[1] - 1
-        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
-        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
-        self.register_buffer("relative_position_index", relative_position_index)
-
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-
-        trunc_normal_(self.relative_position_bias_table, std=0.02)
-        self.softmax = nn.Softmax(dim=-1)
-
-    def forward(self, x, mask=None):
-        """
-        Args:
-            x: input features with shape of (num_windows*B, N, C)
-            mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None
-        """
-        B_, N, C = x.shape
-        qkv = (
-            self.qkv(x)
-            .reshape(B_, N, 3, self.num_heads, C // self.num_heads)
-            .permute(2, 0, 3, 1, 4)
-        )
-        q, k, v = (
-            qkv[0],
-            qkv[1],
-            qkv[2],
-        )  # make torchscript happy (cannot use tensor as tuple)
-
-        q = q * self.scale
-        attn = q @ k.transpose(-2, -1)
-
-        relative_position_bias = self.relative_position_bias_table[
-            self.relative_position_index.view(-1)
-        ].view(
-            self.window_size[0] * self.window_size[1],
-            self.window_size[0] * self.window_size[1],
-            -1,
-        )  # Wh*Ww,Wh*Ww,nH
-        relative_position_bias = relative_position_bias.permute(
-            2, 0, 1
-        ).contiguous()  # nH, Wh*Ww, Wh*Ww
-        attn = attn + relative_position_bias.unsqueeze(0)
-
-        if mask is not None:
-            nW = mask.shape[0]
-            attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(
-                1
-            ).unsqueeze(0)
-            attn = attn.view(-1, self.num_heads, N, N)
-            attn = self.softmax(attn)
-        else:
-            attn = self.softmax(attn)
-
-        attn = self.attn_drop(attn)
-
-        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-    def extra_repr(self) -> str:
-        return f"dim={self.dim}, window_size={self.window_size}, num_heads={self.num_heads}"
-
-    def flops(self, N):
-        # calculate flops for 1 window with token length of N
-        flops = 0
-        # qkv = self.qkv(x)
-        flops += N * self.dim * 3 * self.dim
-        # attn = (q @ k.transpose(-2, -1))
-        flops += self.num_heads * N * (self.dim // self.num_heads) * N
-        #  x = (attn @ v)
-        flops += self.num_heads * N * N * (self.dim // self.num_heads)
-        # x = self.proj(x)
-        flops += N * self.dim * self.dim
-        return flops
-
-
-class SwinTransformerBlock(nn.Module):
-    r"""Swin Transformer Block.
-
-    Args:
-        dim (int): Number of input channels.
-        input_resolution (tuple[int]): Input resulotion.
-        num_heads (int): Number of attention heads.
-        window_size (int): Window size.
-        shift_size (int): Shift size for SW-MSA.
-        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
-        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
-        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
-        drop (float, optional): Dropout rate. Default: 0.0
-        attn_drop (float, optional): Attention dropout rate. Default: 0.0
-        drop_path (float, optional): Stochastic depth rate. Default: 0.0
-        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU
-        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
-    """
-
-    def __init__(
-        self,
-        dim,
-        input_resolution,
-        num_heads,
-        window_size=7,
-        shift_size=0,
-        mlp_ratio=4.0,
-        qkv_bias=True,
-        qk_scale=None,
-        drop=0.0,
-        attn_drop=0.0,
-        drop_path=0.0,
-        act_layer=nn.GELU,
-        norm_layer=nn.LayerNorm,
-    ):
-        super().__init__()
-        self.dim = dim
-        self.input_resolution = input_resolution
-        self.num_heads = num_heads
-        self.window_size = window_size
-        self.shift_size = shift_size
-        self.mlp_ratio = mlp_ratio
-        if min(self.input_resolution) <= self.window_size:
-            # if window size is larger than input resolution, we don't partition windows
-            self.shift_size = 0
-            self.window_size = min(self.input_resolution)
-        assert (
-            0 <= self.shift_size < self.window_size
-        ), "shift_size must in 0-window_size"
-
-        self.norm1 = norm_layer(dim)
-        self.attn = WindowAttention(
-            dim,
-            window_size=to_2tuple(self.window_size),
-            num_heads=num_heads,
-            qkv_bias=qkv_bias,
-            qk_scale=qk_scale,
-            attn_drop=attn_drop,
-            proj_drop=drop,
-        )
-
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        mlp_hidden_dim = int(dim * mlp_ratio)
-        self.mlp = Mlp(
-            in_features=dim,
-            hidden_features=mlp_hidden_dim,
-            act_layer=act_layer,
-            drop=drop,
-        )
-
-        if self.shift_size > 0:
-            # calculate attention mask for SW-MSA
-            H, W = self.input_resolution
-            img_mask = torch.zeros((1, H, W, 1))  # 1 H W 1
-            h_slices = (
-                slice(0, -self.window_size),
-                slice(-self.window_size, -self.shift_size),
-                slice(-self.shift_size, None),
-            )
-            w_slices = (
-                slice(0, -self.window_size),
-                slice(-self.window_size, -self.shift_size),
-                slice(-self.shift_size, None),
-            )
-            cnt = 0
-            for h in h_slices:
-                for w in w_slices:
-                    img_mask[:, h, w, :] = cnt
-                    cnt += 1
-
-            mask_windows = window_partition(
-                img_mask, self.window_size
-            )  # nW, window_size, window_size, 1
-            mask_windows = mask_windows.view(-1, self.window_size * self.window_size)
-            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
-            attn_mask = attn_mask.masked_fill(
-                attn_mask != 0, float(-100.0)
-            ).masked_fill(attn_mask == 0, float(0.0))
-        else:
-            attn_mask = None
-
-        self.register_buffer("attn_mask", attn_mask)
-
-    def forward(self, x):
-        H, W = self.input_resolution
-        B, L, C = x.shape
-        assert L == H * W, "input feature has wrong size"
-
-        shortcut = x
-        x = self.norm1(x)
-        x = x.view(B, H, W, C)
-
-        # cyclic shift
-        if self.shift_size > 0:
-            shifted_x = torch.roll(
-                x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2)
-            )
-        else:
-            shifted_x = x
-
-        # partition windows
-        x_windows = window_partition(
-            shifted_x, self.window_size
-        )  # nW*B, window_size, window_size, C
-        x_windows = x_windows.view(
-            -1, self.window_size * self.window_size, C
-        )  # nW*B, window_size*window_size, C
-
-        # W-MSA/SW-MSA
-        attn_windows = self.attn(
-            x_windows, mask=self.attn_mask
-        )  # nW*B, window_size*window_size, C
-
-        # merge windows
-        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)
-        shifted_x = window_reverse(attn_windows, self.window_size, H, W)  # B H' W' C
-
-        # reverse cyclic shift
-        if self.shift_size > 0:
-            x = torch.roll(
-                shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2)
-            )
-        else:
-            x = shifted_x
-        x = x.view(B, H * W, C)
-
-        # FFN
-        x = shortcut + self.drop_path(x)
-        x = x + self.drop_path(self.mlp(self.norm2(x)))
-
-        return x
-
-    def extra_repr(self) -> str:
-        return (
-            f"dim={self.dim}, input_resolution={self.input_resolution}, num_heads={self.num_heads}, "
-            f"window_size={self.window_size}, shift_size={self.shift_size}, mlp_ratio={self.mlp_ratio}"
-        )
-
-    def flops(self):
-        flops = 0
-        H, W = self.input_resolution
-        # norm1
-        flops += self.dim * H * W
-        # W-MSA/SW-MSA
-        nW = H * W / self.window_size / self.window_size
-        flops += nW * self.attn.flops(self.window_size * self.window_size)
-        # mlp
-        flops += 2 * H * W * self.dim * self.dim * self.mlp_ratio
-        # norm2
-        flops += self.dim * H * W
-        return flops
-
-
-class PatchMerging(nn.Module):
-    r"""Patch Merging Layer.
-
-    Args:
-        input_resolution (tuple[int]): Resolution of input feature.
-        dim (int): Number of input channels.
-        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
-    """
-
-    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
-        super().__init__()
-        self.input_resolution = input_resolution
-        self.dim = dim
-        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
-        self.norm = norm_layer(4 * dim)
-
-    def forward(self, x):
-        """
-        x: B, H*W, C
-        """
-        H, W = self.input_resolution
-        B, L, C = x.shape
-        assert L == H * W, "input feature has wrong size"
-        assert H % 2 == 0 and W % 2 == 0, f"x size ({H}*{W}) are not even."
-
-        x = x.view(B, H, W, C)
-
-        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
-        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
-        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
-        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
-        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
-        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C
-
-        x = self.norm(x)
-        x = self.reduction(x)
-
-        return x
-
-    def extra_repr(self) -> str:
-        return f"input_resolution={self.input_resolution}, dim={self.dim}"
-
-    def flops(self):
-        H, W = self.input_resolution
-        flops = H * W * self.dim
-        flops += (H // 2) * (W // 2) * 4 * self.dim * 2 * self.dim
-        return flops
-
-
-class BasicLayer(nn.Module):
-    """A basic Swin Transformer layer for one stage.
-
-    Args:
-        dim (int): Number of input channels.
-        input_resolution (tuple[int]): Input resolution.
-        depth (int): Number of blocks.
-        num_heads (int): Number of attention heads.
-        window_size (int): Local window size.
-        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
-        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
-        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.
-        drop (float, optional): Dropout rate. Default: 0.0
-        attn_drop (float, optional): Attention dropout rate. Default: 0.0
-        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0
-        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm
-        downsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: None
-        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.
-    """
-
-    def __init__(
-        self,
-        dim,
-        input_resolution,
-        depth,
-        num_heads,
-        window_size,
-        mlp_ratio=4.0,
-        qkv_bias=True,
-        qk_scale=None,
-        drop=0.0,
-        attn_drop=0.0,
-        drop_path=0.0,
-        norm_layer=nn.LayerNorm,
-        downsample=None,
-        use_checkpoint=False,
-    ):
-
-        super().__init__()
-        self.dim = dim
-        self.input_resolution = input_resolution
-        self.depth = depth
-        self.use_checkpoint = use_checkpoint
-
-        # build blocks
-        self.blocks = nn.ModuleList(
-            [
-                SwinTransformerBlock(
-                    dim=dim,
-                    input_resolution=input_resolution,
-                    num_heads=num_heads,
-                    window_size=window_size,
-                    shift_size=0 if (i % 2 == 0) else window_size // 2,
-                    mlp_ratio=mlp_ratio,
-                    qkv_bias=qkv_bias,
-                    qk_scale=qk_scale,
-                    drop=drop,
-                    attn_drop=attn_drop,
-                    drop_path=(
-                        drop_path[i] if isinstance(drop_path, list) else drop_path
-                    ),
-                    norm_layer=norm_layer,
-                )
-                for i in range(depth)
-            ]
-        )
-
-        # patch merging layer
-        if downsample is not None:
-            self.downsample = downsample(
-                input_resolution, dim=dim, norm_layer=norm_layer
-            )
-        else:
-            self.downsample = None
-
-    def forward(self, x):
-        for blk in self.blocks:
-            if self.use_checkpoint:
-                x = checkpoint.checkpoint(blk, x)
-            else:
-                x = blk(x)
-        if self.downsample is not None:
-            x = self.downsample(x)
-        return x
-
-    def extra_repr(self) -> str:
-        return f"dim={self.dim}, input_resolution={self.input_resolution}, depth={self.depth}"
-
-    def flops(self):
-        flops = 0
-        for blk in self.blocks:
-            flops += blk.flops()
-        if self.downsample is not None:
-            flops += self.downsample.flops()
-        return flops
-
-
-class PatchEmbed(nn.Module):
-    r"""Image to Patch Embedding
-
-    Args:
-        img_size (int): Image size.  Default: 224.
-        patch_size (int): Patch token size. Default: 4.
-        in_chans (int): Number of input image channels. Default: 3.
-        embed_dim (int): Number of linear projection output channels. Default: 96.
-        norm_layer (nn.Module, optional): Normalization layer. Default: None
-    """
-
-    def __init__(
-        self, img_size=224, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None
-    ):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-        patches_resolution = [
-            img_size[0] // patch_size[0],
-            img_size[1] // patch_size[1],
-        ]
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.patches_resolution = patches_resolution
-        self.num_patches = patches_resolution[0] * patches_resolution[1]
-
-        self.in_chans = in_chans
-        self.embed_dim = embed_dim
-
-        self.proj = nn.Conv2d(
-            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
-        )
-        if norm_layer is not None:
-            self.norm = norm_layer(embed_dim)
-        else:
-            self.norm = None
-
-    def forward(self, x):
-        B, C, H, W = x.shape
-        # FIXME look at relaxing size constraints
-        assert (
-            H == self.img_size[0] and W == self.img_size[1]
-        ), f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-        x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
-        if self.norm is not None:
-            x = self.norm(x)
-        return x
-
-    def flops(self):
-        Ho, Wo = self.patches_resolution
-        flops = (
-            Ho
-            * Wo
-            * self.embed_dim
-            * self.in_chans
-            * (self.patch_size[0] * self.patch_size[1])
-        )
-        if self.norm is not None:
-            flops += Ho * Wo * self.embed_dim
-        return flops
-
-
-class SwinTransformer(nn.Module):
-    r"""Swin Transformer
-        A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows`  -
-          https://arxiv.org/pdf/2103.14030
-
-    Args:
-        img_size (int | tuple(int)): Input image size. Default 224
-        patch_size (int | tuple(int)): Patch size. Default: 4
-        in_chans (int): Number of input image channels. Default: 3
-        num_classes (int): Number of classes for classification head. Default: 1000
-        embed_dim (int): Patch embedding dimension. Default: 96
-        depths (tuple(int)): Depth of each Swin Transformer layer.
-        num_heads (tuple(int)): Number of attention heads in different layers.
-        window_size (int): Window size. Default: 7
-        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4
-        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True
-        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set. Default: None
-        drop_rate (float): Dropout rate. Default: 0
-        attn_drop_rate (float): Attention dropout rate. Default: 0
-        drop_path_rate (float): Stochastic depth rate. Default: 0.1
-        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
-        ape (bool): If True, add absolute position embedding to the patch embedding. Default: False
-        patch_norm (bool): If True, add normalization after patch embedding. Default: True
-        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False
-    """
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=4,
-        in_chans=3,
-        num_classes=1000,
-        embed_dim=96,
-        depths=[2, 2, 6, 2],
-        num_heads=[3, 6, 12, 24],
-        window_size=7,
-        mlp_ratio=4.0,
-        qkv_bias=True,
-        qk_scale=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.1,
-        norm_layer=nn.LayerNorm,
-        ape=False,
-        patch_norm=True,
-        use_checkpoint=False,
-        **kwargs,
-    ):
-        super().__init__()
-
-        self.num_classes = num_classes
-        self.num_layers = len(depths)
-        self.embed_dim = embed_dim
-        self.ape = ape
-        self.patch_norm = patch_norm
-        self.num_features = int(embed_dim * 2 ** (self.num_layers - 1))
-        self.mlp_ratio = mlp_ratio
-
-        # split image into non-overlapping patches
-        self.patch_embed = PatchEmbed(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=in_chans,
-            embed_dim=embed_dim,
-            norm_layer=norm_layer if self.patch_norm else None,
-        )
-        num_patches = self.patch_embed.num_patches
-        patches_resolution = self.patch_embed.patches_resolution
-        self.patches_resolution = patches_resolution
-
-        # absolute position embedding
-        if self.ape:
-            self.absolute_pos_embed = nn.Parameter(
-                torch.zeros(1, num_patches, embed_dim)
-            )
-            trunc_normal_(self.absolute_pos_embed, std=0.02)
-
-        self.pos_drop = nn.Dropout(p=drop_rate)
-
-        # stochastic depth
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))
-        ]  # stochastic depth decay rule
-
-        # build layers
-        self.layers = nn.ModuleList()
-        for i_layer in range(self.num_layers):
-            layer = BasicLayer(
-                dim=int(embed_dim * 2**i_layer),
-                input_resolution=(
-                    patches_resolution[0] // (2**i_layer),
-                    patches_resolution[1] // (2**i_layer),
-                ),
-                depth=depths[i_layer],
-                num_heads=num_heads[i_layer],
-                window_size=window_size,
-                mlp_ratio=self.mlp_ratio,
-                qkv_bias=qkv_bias,
-                qk_scale=qk_scale,
-                drop=drop_rate,
-                attn_drop=attn_drop_rate,
-                drop_path=dpr[sum(depths[:i_layer]) : sum(depths[: i_layer + 1])],
-                norm_layer=norm_layer,
-                downsample=PatchMerging if (i_layer < self.num_layers - 1) else None,
-                use_checkpoint=use_checkpoint,
-            )
-            self.layers.append(layer)
-
-        self.norm = norm_layer(self.num_features)
-        self.avgpool = nn.AdaptiveAvgPool1d(1)
-        # self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
-
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=0.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"absolute_pos_embed"}
-
-    @torch.jit.ignore
-    def no_weight_decay_keywords(self):
-        return {"relative_position_bias_table"}
-
-    def forward(self, x, idx_to_group_img=None, image_atts=None, **kwargs):
-        x = self.patch_embed(x)
-        if self.ape:
-            x = x + self.absolute_pos_embed
-        x = self.pos_drop(x)
-
-        for layer in self.layers:
-            x = layer(x)
-
-        x = self.norm(x)  # B L C
-
-        x_cls = self.avgpool(x.transpose(1, 2))  # B C 1
-
-        if idx_to_group_img is None:
-            return torch.cat([x_cls.transpose(1, 2), x], dim=1)
-        else:
-            x_bs = torch.gather(
-                x,
-                dim=0,
-                index=idx_to_group_img.view(-1, 1, 1).expand(
-                    -1, x.shape[1], x.shape[2]
-                ),
-            )
-            weights = image_atts[:, 1:].unsqueeze(2)  # B L 1
-            x_bs_cls = torch.sum(
-                (weights * x_bs).transpose(1, 2), dim=-1, keepdim=True
-            )  # B C 1
-            x_bs_cls = x_bs_cls / torch.sum(
-                weights.transpose(1, 2), dim=-1, keepdim=True
-            )  # avgpool
-
-            return torch.cat([x_bs_cls.transpose(1, 2), x_bs], dim=1), torch.cat(
-                [x_cls.transpose(1, 2), x], dim=1
-            )
-
-    def flops(self):
-        flops = 0
-        flops += self.patch_embed.flops()
-        for i, layer in enumerate(self.layers):
-            flops += layer.flops()
-        flops += (
-            self.num_features
-            * self.patches_resolution[0]
-            * self.patches_resolution[1]
-            // (2**self.num_layers)
-        )
-        flops += self.num_features * self.num_classes
-        return flops
-
-
-def interpolate_relative_pos_embed(rel_pos_bias, dst_num_pos, param_name=""):
-    # from: https://github.com/microsoft/unilm/blob/8a0a1c1f4e7326938ea7580a00d56d7f17d65612/beit/run_class_finetuning.py#L348
-
-    # rel_pos_bias: relative_position_bias_table
-    src_num_pos, num_attn_heads = rel_pos_bias.size()
-
-    num_extra_tokens = 0
-    src_size = int((src_num_pos - num_extra_tokens) ** 0.5)
-    dst_size = int((dst_num_pos - num_extra_tokens) ** 0.5)
-    if src_size != dst_size:
-        print(
-            "Position interpolate %s from %dx%d to %dx%d"
-            % (param_name, src_size, src_size, dst_size, dst_size)
-        )
-
-        # extra_tokens = rel_pos_bias[-num_extra_tokens:, :]
-        # rel_pos_bias = rel_pos_bias[:-num_extra_tokens, :]
-
-        def geometric_progression(a, r, n):
-            return a * (1.0 - r**n) / (1.0 - r)
-
-        left, right = 1.01, 1.5
-        while right - left > 1e-6:
-            q = (left + right) / 2.0
-            gp = geometric_progression(1, q, src_size // 2)
-            if gp > dst_size // 2:
-                right = q
-            else:
-                left = q
-
-        # if q > 1.090307:
-        #     q = 1.090307
-
-        dis = []
-        cur = 1
-        for i in range(src_size // 2):
-            dis.append(cur)
-            cur += q ** (i + 1)
-
-        r_ids = [-_ for _ in reversed(dis)]
-
-        x = r_ids + [0] + dis
-        y = r_ids + [0] + dis
-
-        t = dst_size // 2.0
-        dx = np.arange(-t, t + 0.1, 1.0)
-        dy = np.arange(-t, t + 0.1, 1.0)
-
-        # print("Original positions = %s" % str(x))
-        # print("Target positions = %s" % str(dx))
-
-        all_rel_pos_bias = []
-
-        for i in range(num_attn_heads):
-            z = rel_pos_bias[:, i].view(src_size, src_size).float().numpy()
-            f = interpolate.interp2d(x, y, z, kind="cubic")
-            all_rel_pos_bias.append(
-                torch.Tensor(f(dx, dy)).contiguous().view(-1, 1).to(rel_pos_bias.device)
-            )
-
-        rel_pos_bias = torch.cat(all_rel_pos_bias, dim=-1)
-
-    return rel_pos_bias
diff --git a/eval/vbench/third_party/tag2Text/tag2text.py b/eval/vbench/third_party/tag2Text/tag2text.py
deleted file mode 100644
index 6456a901..00000000
--- a/eval/vbench/third_party/tag2Text/tag2text.py
+++ /dev/null
@@ -1,507 +0,0 @@
-"""
- * Tag2Text
- * Written by Xinyu Huang
-"""
-
-import warnings
-
-warnings.filterwarnings("ignore")
-
-import os
-
-import torch
-import torch.nn.functional as F
-from torch import nn
-from transformers import BertTokenizer
-
-from .med import BertConfig, BertLMHeadModel, BertModel
-from .swin_transformer import SwinTransformer, interpolate_relative_pos_embed
-from .vit import VisionTransformer, interpolate_pos_embed
-
-CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-import json
-import math
-from urllib.parse import urlparse
-
-import numpy as np
-from timm.models.hub import download_cached_file
-
-from .tag_class import tra_array
-
-
-def read_json(rpath):
-    with open(rpath, "r") as f:
-        return json.load(f)
-
-
-delete_tag_index = [127, 3351, 3265, 3338, 3355, 3359]
-
-
-class Tag2Text_Caption(nn.Module):
-    def __init__(
-        self,
-        med_config=f"{CUR_DIR}/med_config.json",
-        image_size=384,
-        vit="base",
-        vit_grad_ckpt=False,
-        vit_ckpt_layer=0,
-        prompt="a picture of ",
-        threshold=0.7,
-    ):
-        """
-        Args:
-            med_config (str): path for the mixture of encoder-decoder model's configuration file
-            image_size (int): input image size
-            vit (str): model size of vision transformer
-        """
-        super().__init__()
-
-        if vit == "swin_b":
-            if image_size == 224:
-                vision_config_path = "configs/swin/config_swinB_224.json"
-            elif image_size == 384:
-                vision_config_path = f"{CUR_DIR}/config_swinB_384.json"
-            vision_config = read_json(vision_config_path)
-            assert image_size == vision_config["image_res"]
-
-            vision_width = vision_config["vision_width"]
-
-            self.visual_encoder = SwinTransformer(
-                img_size=vision_config["image_res"],
-                patch_size=4,
-                in_chans=3,
-                embed_dim=vision_config["embed_dim"],
-                depths=vision_config["depths"],
-                num_heads=vision_config["num_heads"],
-                window_size=vision_config["window_size"],
-                mlp_ratio=4.0,
-                qkv_bias=True,
-                drop_rate=0.0,
-                drop_path_rate=0.1,
-                ape=False,
-                patch_norm=True,
-                use_checkpoint=False,
-            )
-
-        else:
-            self.visual_encoder, vision_width = create_vit(
-                vit, image_size, vit_grad_ckpt, vit_ckpt_layer
-            )
-
-        self.tokenizer = init_tokenizer()
-
-        # create the decoder
-        decoder_config = BertConfig.from_json_file(med_config)
-        decoder_config.encoder_width = 768
-        self.text_decoder = BertLMHeadModel(config=decoder_config)
-
-        # create encoder
-        encoder_config = BertConfig.from_json_file(med_config)
-        encoder_config.encoder_width = vision_width
-        self.tag_encoder = BertModel(config=encoder_config, add_pooling_layer=False)
-
-        self.prompt = prompt
-        self.prompt_length = len(self.tokenizer(self.prompt).input_ids) - 1
-
-        self.threshold = threshold
-        num_features = 768
-        self.num_class = 3429
-
-        q2l_config = BertConfig.from_json_file(f"{CUR_DIR}/q2l_config.json")
-        q2l_config.encoder_width = vision_width
-        self.vision_multi = BertModel.from_pretrained(
-            "bert-base-uncased", config=q2l_config, add_pooling_layer=False
-        )
-        self.vision_multi.resize_token_embeddings(len(self.tokenizer))
-        self.label_embed = nn.Embedding(self.num_class, q2l_config.hidden_size)
-        self.fc = GroupWiseLinear(self.num_class, num_features, bias=True)
-        self.del_selfattention()
-
-        tie_encoder_decoder_weights(self.tag_encoder, self.vision_multi, "", " ")
-        self.tag_array = tra_array
-
-    def del_selfattention(self):
-        del self.vision_multi.embeddings
-        for layer in self.vision_multi.encoder.layer:
-            del layer.attention
-
-    def generate(
-        self,
-        image,
-        sample=False,
-        num_beams=3,
-        max_length=30,
-        min_length=10,
-        top_p=0.9,
-        repetition_penalty=1.0,
-        tag_input=None,
-        return_tag_predict=False,
-    ):
-        image_embeds = self.visual_encoder(image)
-        image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(
-            image.device
-        )
-
-        # ==============generate tag==============#
-        if tag_input == None:
-            image_spatial_embeds = image_embeds[:, 1:, :]
-            image_cls_embeds = image_embeds[:, 0, :]
-
-            bs = image_spatial_embeds.shape[0]
-            label_embed = self.label_embed.weight.unsqueeze(0).repeat(bs, 1, 1)
-            mlr_tagembedding = self.vision_multi(
-                encoder_embeds=label_embed,
-                encoder_hidden_states=image_embeds,
-                encoder_attention_mask=image_atts,
-                return_dict=False,
-                mode="mlr",
-            )
-
-            logits = self.fc(mlr_tagembedding[0])
-
-            targets = torch.where(
-                torch.sigmoid(logits) > self.threshold,
-                torch.tensor(1.0).to(image.device),
-                torch.zeros(self.num_class).to(image.device),
-            )
-
-            tag = targets.cpu().numpy()
-            tag[:, delete_tag_index] = 0
-            bs = image.size(0)
-            tag_input = []
-            for b in range(bs):
-                index = np.argwhere(tag[b] == 1)
-                token = self.tag_array[index].squeeze(axis=1)
-                tag_input.append(" | ".join(token))
-        # ========================================#
-
-        if not sample:
-            image_embeds = image_embeds.repeat_interleave(num_beams, dim=0)
-            image_atts = image_atts.repeat_interleave(num_beams, dim=0)
-            tag_input_temp = []
-            for tag in tag_input:
-                for i in range(num_beams):
-                    tag_input_temp.append(tag)
-            tag_input = tag_input_temp
-
-        tag_input_tokenzier = self.tokenizer(
-            tag_input,
-            padding="max_length",
-            truncation=True,
-            max_length=40,
-            return_tensors="pt",
-        ).to(image.device)
-
-        encoder_input_ids = tag_input_tokenzier.input_ids
-        encoder_input_ids[:, 0] = self.tokenizer.enc_token_id
-        # print(encoder_input_ids.size(), tag_input_tokenzier.attention_mask.size(),image_embeds.size(),  image_atts.size())
-        # import pdb
-        # pdb.set_trace()
-        output_tagembedding = self.tag_encoder(
-            encoder_input_ids,
-            attention_mask=tag_input_tokenzier.attention_mask,
-            encoder_hidden_states=image_embeds,
-            encoder_attention_mask=image_atts,
-            return_dict=True,
-        )
-
-        prompt = [self.prompt] * image.size(0)
-        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(
-            image.device
-        )
-        input_ids[:, 0] = self.tokenizer.bos_token_id
-        input_ids = input_ids[:, :-1]
-
-        if sample:
-            # nucleus sampling
-            model_kwargs = {
-                "encoder_hidden_states": output_tagembedding.last_hidden_state,
-                "encoder_attention_mask": None,
-            }
-            outputs = self.text_decoder.generate(
-                input_ids=input_ids,
-                max_length=max_length,
-                min_length=min_length,
-                do_sample=True,
-                top_p=top_p,
-                num_return_sequences=1,
-                eos_token_id=self.tokenizer.sep_token_id,
-                pad_token_id=self.tokenizer.pad_token_id,
-                repetition_penalty=1.1,
-                **model_kwargs,
-            )
-        else:
-            # beam search
-            model_kwargs = {
-                "encoder_hidden_states": output_tagembedding.last_hidden_state,
-                "encoder_attention_mask": None,
-            }
-            outputs = self.text_decoder.generate(
-                input_ids=input_ids,
-                max_length=max_length,
-                min_length=min_length,
-                num_beams=num_beams,
-                eos_token_id=self.tokenizer.sep_token_id,
-                pad_token_id=self.tokenizer.pad_token_id,
-                repetition_penalty=repetition_penalty,
-                **model_kwargs,
-            )
-
-        captions = []
-        for output in outputs:
-            caption = self.tokenizer.decode(output, skip_special_tokens=True)
-            captions.append(caption[len(self.prompt) :])
-        if return_tag_predict == True:
-            if sample:
-                return captions, tag_input
-            else:
-                return captions, tag_input[0 : int(len(tag_input) / num_beams)]
-        return captions
-
-
-def tag2text_caption(pretrained="", **kwargs):
-    model = Tag2Text_Caption(**kwargs)
-    if pretrained:
-        if kwargs["vit"] == "swin_b":
-            model, msg = load_checkpoint_swinbase(model, pretrained, kwargs)
-        else:
-            model, msg = load_checkpoint(model, pretrained)
-        # print('vit:',kwargs['vit'])
-        # print('msg_v2',msg)
-    return model
-
-
-from typing import List
-
-
-def tie_encoder_decoder_weights(
-    encoder: nn.Module, decoder: nn.Module, base_model_prefix: str, skip_key: str
-):
-    uninitialized_encoder_weights: List[str] = []
-    if decoder.__class__ != encoder.__class__:
-        logger.info(
-            f"{decoder.__class__} and {encoder.__class__} are not equal. In this case make sure that all encoder weights are correctly initialized."
-        )
-
-    def tie_encoder_to_decoder_recursively(
-        decoder_pointer: nn.Module,
-        encoder_pointer: nn.Module,
-        module_name: str,
-        uninitialized_encoder_weights: List[str],
-        skip_key: str,
-        depth=0,
-    ):
-        assert isinstance(decoder_pointer, nn.Module) and isinstance(
-            encoder_pointer, nn.Module
-        ), f"{decoder_pointer} and {encoder_pointer} have to be of type torch.nn.Module"
-        if hasattr(decoder_pointer, "weight") and skip_key not in module_name:
-            assert hasattr(encoder_pointer, "weight")
-            encoder_pointer.weight = decoder_pointer.weight
-            if hasattr(decoder_pointer, "bias"):
-                assert hasattr(encoder_pointer, "bias")
-                encoder_pointer.bias = decoder_pointer.bias
-            # print(module_name+' is tied')
-            return
-
-        encoder_modules = encoder_pointer._modules
-        decoder_modules = decoder_pointer._modules
-        if len(decoder_modules) > 0:
-            assert (
-                len(encoder_modules) > 0
-            ), f"Encoder module {encoder_pointer} does not match decoder module {decoder_pointer}"
-
-            all_encoder_weights = set(
-                [module_name + "/" + sub_name for sub_name in encoder_modules.keys()]
-            )
-            encoder_layer_pos = 0
-            for name, module in decoder_modules.items():
-                if name.isdigit():
-                    encoder_name = str(int(name) + encoder_layer_pos)
-                    decoder_name = name
-                    if not isinstance(
-                        decoder_modules[decoder_name],
-                        type(encoder_modules[encoder_name]),
-                    ) and len(encoder_modules) != len(decoder_modules):
-                        # this can happen if the name corresponds to the position in a list module list of layers
-                        # in this case the decoder has added a cross-attention that the encoder does not have
-                        # thus skip this step and subtract one layer pos from encoder
-                        encoder_layer_pos -= 1
-                        continue
-                elif name not in encoder_modules:
-                    continue
-                elif depth > 500:
-                    raise ValueError(
-                        "Max depth of recursive function `tie_encoder_to_decoder` reached. It seems that there is a circular dependency between two or more `nn.Modules` of your model."
-                    )
-                else:
-                    decoder_name = encoder_name = name
-                tie_encoder_to_decoder_recursively(
-                    decoder_modules[decoder_name],
-                    encoder_modules[encoder_name],
-                    module_name + "/" + name,
-                    uninitialized_encoder_weights,
-                    skip_key,
-                    depth=depth + 1,
-                )
-                all_encoder_weights.remove(module_name + "/" + encoder_name)
-
-            uninitialized_encoder_weights += list(all_encoder_weights)
-
-    # tie weights recursively
-    tie_encoder_to_decoder_recursively(
-        decoder, encoder, base_model_prefix, uninitialized_encoder_weights, skip_key
-    )
-
-
-class GroupWiseLinear(nn.Module):
-    # could be changed to:
-    # output = torch.einsum('ijk,zjk->ij', x, self.W)
-    # or output = torch.einsum('ijk,jk->ij', x, self.W[0])
-    def __init__(self, num_class, hidden_dim, bias=True):
-        super().__init__()
-        self.num_class = num_class
-        self.hidden_dim = hidden_dim
-        self.bias = bias
-
-        self.W = nn.Parameter(torch.Tensor(1, num_class, hidden_dim))
-        if bias:
-            self.b = nn.Parameter(torch.Tensor(1, num_class))
-        self.reset_parameters()
-
-    def reset_parameters(self):
-        stdv = 1.0 / math.sqrt(self.W.size(2))
-        for i in range(self.num_class):
-            self.W[0][i].data.uniform_(-stdv, stdv)
-        if self.bias:
-            for i in range(self.num_class):
-                self.b[0][i].data.uniform_(-stdv, stdv)
-
-    def forward(self, x):
-        # x: B,K,d
-        x = (self.W * x).sum(-1)
-        if self.bias:
-            x = x + self.b
-        return x
-
-
-def init_tokenizer():
-    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
-    tokenizer.add_special_tokens({"bos_token": "[DEC]"})
-    tokenizer.add_special_tokens({"additional_special_tokens": ["[ENC]"]})
-    tokenizer.enc_token_id = tokenizer.additional_special_tokens_ids[0]
-    return tokenizer
-
-
-def create_vit(
-    vit, image_size, use_grad_checkpointing=False, ckpt_layer=0, drop_path_rate=0
-):
-
-    assert vit in ["base", "large"], "vit parameter must be base or large"
-    if vit == "base":
-        vision_width = 768
-        visual_encoder = VisionTransformer(
-            img_size=image_size,
-            patch_size=16,
-            embed_dim=vision_width,
-            depth=12,
-            num_heads=12,
-            use_grad_checkpointing=use_grad_checkpointing,
-            ckpt_layer=ckpt_layer,
-            drop_path_rate=0 or drop_path_rate,
-        )
-    elif vit == "large":
-        vision_width = 1024
-        visual_encoder = VisionTransformer(
-            img_size=image_size,
-            patch_size=16,
-            embed_dim=vision_width,
-            depth=24,
-            num_heads=16,
-            use_grad_checkpointing=use_grad_checkpointing,
-            ckpt_layer=ckpt_layer,
-            drop_path_rate=0.1 or drop_path_rate,
-        )
-    return visual_encoder, vision_width
-
-
-def is_url(url_or_filename):
-    parsed = urlparse(url_or_filename)
-    return parsed.scheme in ("http", "https")
-
-
-def load_checkpoint(model, url_or_filename):
-    if is_url(url_or_filename):
-        cached_file = download_cached_file(
-            url_or_filename, check_hash=False, progress=True
-        )
-        checkpoint = torch.load(cached_file, map_location="cpu")
-    elif os.path.isfile(url_or_filename):
-        checkpoint = torch.load(url_or_filename, map_location="cpu")
-    else:
-        raise RuntimeError("checkpoint url or path is invalid")
-
-    state_dict = checkpoint["model"]
-
-    state_dict["visual_encoder.pos_embed"] = interpolate_pos_embed(
-        state_dict["visual_encoder.pos_embed"], model.visual_encoder
-    )
-    if "visual_encoder_m.pos_embed" in model.state_dict().keys():
-        state_dict["visual_encoder_m.pos_embed"] = interpolate_pos_embed(
-            state_dict["visual_encoder_m.pos_embed"], model.visual_encoder_m
-        )
-    for key in model.state_dict().keys():
-        if key in state_dict.keys():
-            if state_dict[key].shape != model.state_dict()[key].shape:
-                del state_dict[key]
-
-    msg = model.load_state_dict(state_dict, strict=False)
-    # print('load checkpoint from %s'%url_or_filename)
-    return model, msg
-
-
-def load_checkpoint_swinbase(model, url_or_filename, kwargs):
-    if kwargs["image_size"] == 224:
-        vision_config_path = "configs/swin/config_swinB_224.json"
-    elif kwargs["image_size"] == 384:
-        vision_config_path = f"{CUR_DIR}/config_swinB_384.json"
-    elif kwargs["image_size"] == 480:
-        vision_config_path = "configs/swin/config_swinB_480.json"
-    elif kwargs["image_size"] == 576:
-        vision_config_path = "configs/swin/config_swinB_576.json"
-    elif kwargs["image_size"] == 608:
-        vision_config_path = "configs/swin/config_swinB_608.json"
-    window_size = read_json(vision_config_path)["window_size"]
-    # print('--------------')
-    # print(url_or_filename)
-    # print('--------------')
-    if is_url(url_or_filename):
-        cached_file = download_cached_file(
-            url_or_filename, check_hash=False, progress=True
-        )
-        checkpoint = torch.load(cached_file, map_location="cpu")
-    elif os.path.isfile(url_or_filename):
-        checkpoint = torch.load(url_or_filename, map_location="cpu")
-    else:
-        raise RuntimeError("checkpoint url or path is invalid")
-
-    state_dict = checkpoint["model"]
-
-    for k in list(state_dict.keys()):
-        if "relative_position_bias_table" in k:
-            dst_num_pos = (2 * window_size - 1) ** 2
-            state_dict[k] = interpolate_relative_pos_embed(
-                state_dict[k], dst_num_pos, param_name=k
-            )
-        elif ("relative_position_index" in k) or ("attn_mask" in k):
-            del state_dict[k]
-
-    msg = model.load_state_dict(state_dict, strict=False)
-    print("load checkpoint from %s" % url_or_filename)
-    return model, msg
-
-
-if __name__ == "__main__":
-    model = Tag2Text_Caption()
-    import pdb
-
-    pdb.set_trace()
diff --git a/eval/vbench/third_party/tag2Text/tag_class.py b/eval/vbench/third_party/tag2Text/tag_class.py
deleted file mode 100644
index e1b3ac18..00000000
--- a/eval/vbench/third_party/tag2Text/tag_class.py
+++ /dev/null
@@ -1,3436 +0,0 @@
-import numpy as np
-
-tra_array = [
-    "tennis",
-    "bear cub",
-    "observatory",
-    "bicycle",
-    "hillside",
-    "judge",
-    "watercolor illustration",
-    "granite",
-    "lobster",
-    "livery",
-    "stone",
-    "ceramic",
-    "ranch",
-    "cloth",
-    "smile",
-    "building",
-    "tattoo",
-    "cricketer",
-    "cheek",
-    "pear",
-    "source",
-    "winter",
-    "surface",
-    "spray",
-    "ceremony",
-    "magic",
-    "curve",
-    "container",
-    "fair",
-    "medicine",
-    "baby",
-    "tennis racquet",
-    "ornament",
-    "bamboo",
-    "duckling",
-    "song",
-    "safari",
-    "team presentation",
-    "daffodil",
-    "cross",
-    "toothpaste",
-    "shield",
-    "fashion model",
-    "capsule",
-    "map",
-    "creek",
-    "glass house",
-    "glass plate",
-    "siding",
-    "corner",
-    "water buffalo",
-    "bison",
-    "figure skater",
-    "diploma",
-    "tire",
-    "race",
-    "cable car",
-    "brain",
-    "gas stove",
-    "soap bubble",
-    "palette",
-    "snowboard",
-    "school child",
-    "trench coat",
-    "monk",
-    "fiber",
-    "kitchen window",
-    "sunglass",
-    "coffee",
-    "security",
-    "strawberry",
-    "penguin",
-    "tree root",
-    "loaf",
-    "engagement ring",
-    "lamb",
-    "vector cartoon illustration",
-    "sandwich",
-    "mountain village",
-    "shape",
-    "charm",
-    "fiction",
-    "knot",
-    "greenhouse",
-    "sushi",
-    "text",
-    "disaster",
-    "trophy",
-    "gang",
-    "strap",
-    "soccer game",
-    "cardinal",
-    "tee",
-    "turtle",
-    "water surface",
-    "grassland",
-    "dolphin",
-    "store",
-    "dirt",
-    "iceberg",
-    "pergola",
-    "farmer market",
-    "publicity portrait",
-    "tote bag",
-    "teenage girl",
-    "view mirror",
-    "session",
-    "commuter",
-    "dressing room",
-    "tricycle",
-    "christmas ball",
-    "headlight",
-    "police",
-    "armchair",
-    "chart",
-    "yacht",
-    "saw",
-    "printer",
-    "rock band",
-    "gingerbread house",
-    "tag",
-    "table lamp",
-    "hockey game",
-    "slope",
-    "font",
-    "wicker basket",
-    "jewelry",
-    "quarter",
-    "software",
-    "weapon",
-    "pin",
-    "worship",
-    "painter",
-    "goal",
-    "morning light",
-    "bike",
-    "baseball bat",
-    "elevator",
-    "cuisine",
-    "sausage",
-    "stunt",
-    "wrestler",
-    "statue",
-    "landing",
-    "pillar",
-    "willow tree",
-    "sea wave",
-    "chicken",
-    "peanut",
-    "muscle",
-    "bob",
-    "tv genre",
-    "bathroom window",
-    "radish",
-    "textile",
-    "pelican",
-    "marketplace",
-    "crest",
-    "elevation map",
-    "gift",
-    "parish",
-    "traffic light",
-    "campfire",
-    "fog",
-    "award winner",
-    "beach ball",
-    "mat",
-    "white house",
-    "plaster",
-    "moped",
-    "football team",
-    "solution",
-    "bicyclist",
-    "bit",
-    "playground",
-    "darkness",
-    "cake",
-    "maple leave",
-    "mold",
-    "cracker",
-    "blueberry",
-    "rubble",
-    "container ship",
-    "pedestrian bridge",
-    "snail",
-    "parrot",
-    "form",
-    "circuit",
-    "highlight",
-    "pickup truck",
-    "koala",
-    "rain",
-    "system",
-    "weather",
-    "raincoat",
-    "soccer team",
-    "windshield",
-    "thunderstorm",
-    "mike",
-    "bird house",
-    "bridge",
-    "grandfather",
-    "restroom",
-    "animation",
-    "wilderness",
-    "clown",
-    "banana",
-    "brown",
-    "braid",
-    "dining room",
-    "kindergarten",
-    "launch event",
-    "purple",
-    "school",
-    "stairwell",
-    "brooch",
-    "movie poster image",
-    "mountain river",
-    "shelf",
-    "wicket",
-    "headboard",
-    "buddha",
-    "flower field",
-    "dugout",
-    "cd",
-    "bald eagle",
-    "lagoon",
-    "seaweed",
-    "agriculture",
-    "emergency service",
-    "maple tree",
-    "parachute",
-    "continent",
-    "amusement park",
-    "remote",
-    "bun",
-    "tackle",
-    "hospital",
-    "garage door",
-    "birthday party",
-    "friendship",
-    "go",
-    "mausoleum",
-    "jeep",
-    "raccoon",
-    "step",
-    "ice hockey team",
-    "cigarette",
-    "lace dress",
-    "forest floor",
-    "mall",
-    "captain",
-    "milk",
-    "golf course",
-    "meal",
-    "picnic table",
-    "sail",
-    "volleyball",
-    "canal",
-    "terrace",
-    "computer desk",
-    "caravan",
-    "hotel",
-    "cheerleader",
-    "nurse",
-    "museum",
-    "marsh",
-    "fox",
-    "plateau",
-    "night",
-    "twin",
-    "letter logo",
-    "autumn tree",
-    "powder",
-    "convention",
-    "creature",
-    "lighthouse",
-    "shop window",
-    "jacket",
-    "stork",
-    "taxi",
-    "trade",
-    "blackboard",
-    "olive",
-    "road sign",
-    "resort",
-    "snowflake",
-    "cemetery",
-    "travel",
-    "evening dress",
-    "picnic",
-    "drink",
-    "winter morning",
-    "football player",
-    "snack",
-    "boxing glove",
-    "dinner party",
-    "airline",
-    "swing",
-    "port",
-    "wheelbarrow",
-    "bathroom sink",
-    "sweater",
-    "ambulance",
-    "gear",
-    "oil",
-    "wii controller",
-    "array",
-    "home office",
-    "car show",
-    "mixture",
-    "profession",
-    "tree frog",
-    "square",
-    "facility",
-    "coral reef",
-    "sea wall",
-    "pizza",
-    "exhibit",
-    "demolition",
-    "trout",
-    "ring",
-    "coffee shop",
-    "bracelet",
-    "bean",
-    "lip",
-    "fencing",
-    "landscape",
-    "sitting",
-    "package",
-    "metal",
-    "bust",
-    "king",
-    "hair",
-    "window seat",
-    "wildlife",
-    "trunk",
-    "greenery",
-    "stencil",
-    "fire hydrant",
-    "bridesmaid",
-    "plaza",
-    "alps",
-    "tower bridge",
-    "crop top",
-    "crossing",
-    "cinema",
-    "pedestrian crossing",
-    "family",
-    "shopping cart",
-    "stomach",
-    "church building",
-    "screen door",
-    "skater",
-    "soccer field",
-    "kettle",
-    "mussel",
-    "raindrop",
-    "candy cane",
-    "water lily",
-    "flower girl",
-    "desert",
-    "enclosure",
-    "christmas light",
-    "kitchen",
-    "caterpillar",
-    "plaid",
-    "bath",
-    "bush",
-    "mud",
-    "ballet",
-    "knee",
-    "adult",
-    "raft",
-    "sea view",
-    "cactus",
-    "office chair",
-    "overall",
-    "rim",
-    "scaffolding",
-    "pig",
-    "cover",
-    "poster page",
-    "sprinkle",
-    "chandelier",
-    "algae",
-    "traffic",
-    "surfboard",
-    "book",
-    "filming",
-    "flash",
-    "mansion",
-    "camouflage",
-    "trouser",
-    "ticket",
-    "weed",
-    "cab",
-    "trench",
-    "elephant",
-    "huddle",
-    "sphere",
-    "christmas decoration",
-    "city",
-    "launch",
-    "doll",
-    "christmas ornament",
-    "fabric",
-    "bikini",
-    "biplane",
-    "breakfast",
-    "neighbourhood",
-    "race track",
-    "foliage",
-    "avocado",
-    "school bus",
-    "footwear",
-    "highway",
-    "ocean view",
-    "art vector illustration",
-    "wall clock",
-    "curtain",
-    "teenager",
-    "kitchen area",
-    "robot",
-    "tusk",
-    "lounge chair",
-    "beam",
-    "paddle",
-    "camel",
-    "lid",
-    "world map",
-    "city view",
-    "newlywed",
-    "cargo ship",
-    "yellow",
-    "exhibition",
-    "bend",
-    "novel",
-    "wool",
-    "ontario",
-    "bread",
-    "campus",
-    "coastline",
-    "cutting board",
-    "booth",
-    "table top",
-    "carpet",
-    "beach chair",
-    "workout",
-    "street food",
-    "fun",
-    "costumer film designer",
-    "gadget",
-    "artist",
-    "fishing village",
-    "builder",
-    "violinist",
-    "iphone",
-    "spider web",
-    "traffic sign",
-    "ruin",
-    "rescue",
-    "clipboard",
-    "seal",
-    "film director",
-    "paw",
-    "nursery",
-    "intersection",
-    "tomato sauce",
-    "taste",
-    "paddy field",
-    "christmas tree",
-    "wave",
-    "stool",
-    "watering can",
-    "rug",
-    "daytime",
-    "subway station",
-    "craft",
-    "pine forest",
-    "black",
-    "planet",
-    "motif",
-    "christmas market",
-    "glass window",
-    "college",
-    "wheat",
-    "damage",
-    "rectangle",
-    "picture frame",
-    "chess",
-    "guest room",
-    "street corner",
-    "religion",
-    "seed",
-    "puzzle",
-    "freeway",
-    "beauty",
-    "ocean",
-    "watch",
-    "mother",
-    "garage",
-    "quote",
-    "dj",
-    "supporter",
-    "hip hop artist",
-    "muffin",
-    "eiffel tower",
-    "cash",
-    "firefighter",
-    "cauliflower",
-    "bunker",
-    "sled",
-    "manicure",
-    "shark",
-    "stall",
-    "jungle",
-    "family home",
-    "tour bus",
-    "chimney",
-    "touchdown",
-    "roundabout",
-    "coyote",
-    "street scene",
-    "tank",
-    "wedding dress",
-    "mantle",
-    "bedroom window",
-    "coconut",
-    "chapel",
-    "goat",
-    "living space",
-    "rock wall",
-    "polka dot",
-    "railway",
-    "mandala",
-    "mango",
-    "lesson",
-    "mountain landscape",
-    "team photo",
-    "bookshelf",
-    "meter",
-    "bulldog",
-    "evening sun",
-    "stick",
-    "card",
-    "pink",
-    "fish pond",
-    "paint",
-    "pill",
-    "cart",
-    "pea",
-    "van",
-    "album",
-    "football college game",
-    "mountain pass",
-    "doughnut",
-    "ski slope",
-    "match",
-    "official",
-    "shadow",
-    "organ",
-    "celebration",
-    "coin",
-    "log cabin",
-    "firework display",
-    "present",
-    "twig",
-    "chef",
-    "confetti",
-    "footpath",
-    "tour",
-    "ponytail",
-    "artwork",
-    "race car",
-    "club",
-    "season",
-    "hose",
-    "pencil",
-    "aircraft",
-    "rock formation",
-    "wardrobe",
-    "participant",
-    "politician",
-    "engineer",
-    "peace",
-    "filter",
-    "sailing boat",
-    "water bottle",
-    "service dog",
-    "poodle",
-    "loki",
-    "statesman",
-    "sleeping bag",
-    "outskirt",
-    "clock",
-    "factory",
-    "oak tree",
-    "physician",
-    "color",
-    "room",
-    "stairway",
-    "company",
-    "lady",
-    "graph",
-    "faucet",
-    "tablecloth",
-    "subway train",
-    "chocolate chip cookie",
-    "headquarters",
-    "screw",
-    "goggle",
-    "halloween",
-    "city street",
-    "swirl",
-    "cord",
-    "forward",
-    "bone",
-    "bedding",
-    "archway",
-    "wig",
-    "lobby",
-    "mask",
-    "attic",
-    "kitchen table",
-    "skylight",
-    "fire",
-    "exit",
-    "oil painting",
-    "passenger",
-    "meditation",
-    "salmon",
-    "fedora",
-    "rubber stamp",
-    "orange juice",
-    "arch",
-    "scientist",
-    "stroll",
-    "manhattan",
-    "float",
-    "baseball uniform",
-    "circle",
-    "church",
-    "decker bus",
-    "competitor",
-    "zoo",
-    "basketball team",
-    "tourist",
-    "daughter",
-    "silverware",
-    "ceiling fan",
-    "birth",
-    "vase",
-    "jack",
-    "mushroom",
-    "spiral",
-    "cage",
-    "limb",
-    "salad",
-    "ad",
-    "control",
-    "earth",
-    "party",
-    "bolt",
-    "tractor",
-    "barley",
-    "wedding photo",
-    "hawk",
-    "warehouse",
-    "vegetable garden",
-    "chocolate cake",
-    "cabbage",
-    "floor window",
-    "baby shower",
-    "magnifying glass",
-    "table",
-    "stethoscope",
-    "reading",
-    "mission",
-    "croissant",
-    "gift box",
-    "rocket",
-    "forest road",
-    "cooking",
-    "suite",
-    "hill country",
-    "motorcycle",
-    "baseball player",
-    "angle",
-    "drug",
-    "sport association",
-    "championship",
-    "family portrait",
-    "florist",
-    "softball",
-    "egret",
-    "office",
-    "plywood",
-    "jockey",
-    "mosque",
-    "brunch",
-    "beanie",
-    "office building",
-    "pattern",
-    "calendar",
-    "indoor",
-    "pepper",
-    "ledge",
-    "trail",
-    "fuel",
-    "laptop computer",
-    "tennis shoe",
-    "deck chair",
-    "guitarist",
-    "barn",
-    "surgery",
-    "cartoon illustration",
-    "nebula",
-    "railroad",
-    "mountain goat",
-    "goose",
-    "car door",
-    "cheer",
-    "liquid",
-    "hardwood floor",
-    "pathway",
-    "acorn",
-    "gull",
-    "airliner",
-    "couch",
-    "lake house",
-    "spaghetti",
-    "promenade",
-    "collection",
-    "garden",
-    "bank",
-    "robin",
-    "tennis ball",
-    "peony",
-    "gymnast",
-    "lavender",
-    "deck",
-    "test",
-    "riverside",
-    "rapper",
-    "domino",
-    "bride",
-    "mouse",
-    "basil",
-    "wedding couple",
-    "ocean wave",
-    "arm",
-    "kitchen floor",
-    "grove",
-    "family member",
-    "backyard",
-    "raspberry",
-    "forest fire",
-    "officer",
-    "hibiscus",
-    "canyon",
-    "composer",
-    "signature",
-    "olive oil",
-    "hibiscus flower",
-    "rose",
-    "vector icon",
-    "sunrise",
-    "horseback",
-    "motor scooter",
-    "office worker",
-    "tradition",
-    "ingredient",
-    "washing machine",
-    "lighting",
-    "bagel",
-    "sailboat",
-    "policeman",
-    "mare",
-    "graphic",
-    "halloween pumpkin",
-    "stock",
-    "pilot",
-    "education",
-    "team",
-    "body",
-    "horse",
-    "kimono",
-    "bazaar",
-    "bag",
-    "recording studio",
-    "parsley",
-    "entrance",
-    "denim",
-    "vet",
-    "horse farm",
-    "charcoal",
-    "architecture",
-    "glass vase",
-    "puppy",
-    "estuary",
-    "television show host",
-    "city bus",
-    "shoulder",
-    "beast",
-    "balance",
-    "golfer",
-    "roadside",
-    "denim jacket",
-    "stone wall",
-    "counter top",
-    "app icon",
-    "toast",
-    "head coach",
-    "ham",
-    "warrior",
-    "gem",
-    "refrigerator",
-    "snowman",
-    "construction worker",
-    "coal",
-    "website",
-    "morning fog",
-    "mustard",
-    "human",
-    "owl",
-    "puppy dog",
-    "piggy bank",
-    "vegetation",
-    "pirate",
-    "action film",
-    "marshmallow",
-    "thanksgiving",
-    "business",
-    "disease",
-    "signage",
-    "greeting",
-    "skate park",
-    "tile",
-    "mouth",
-    "spinach",
-    "vacation",
-    "leader",
-    "shrine",
-    "walker",
-    "science fiction film",
-    "bill",
-    "rabbit",
-    "motor boat",
-    "bar",
-    "radio",
-    "barge",
-    "tail",
-    "chainsaw",
-    "gallery",
-    "rainbow",
-    "pasta",
-    "padlock",
-    "web",
-    "pastry",
-    "ink",
-    "reef",
-    "school uniform",
-    "shawl",
-    "treasure",
-    "peach",
-    "dinner table",
-    "injury",
-    "harbor",
-    "witch",
-    "car dealership",
-    "litter",
-    "gesture",
-    "documentary",
-    "marriage",
-    "sea shell",
-    "priest",
-    "dome",
-    "kit",
-    "icon",
-    "seaside",
-    "bucket",
-    "entertainment",
-    "stable",
-    "hat",
-    "puddle",
-    "sock",
-    "shopper",
-    "technology",
-    "harbour",
-    "orbit",
-    "antler",
-    "tube",
-    "flag waving",
-    "cook",
-    "tight",
-    "commander",
-    "farmland",
-    "switch",
-    "hiker",
-    "wedding ceremony",
-    "award ceremony",
-    "champion",
-    "chopstick",
-    "farmhouse",
-    "performer",
-    "spike",
-    "accident",
-    "cruise ship",
-    "passenger train",
-    "attraction",
-    "entertainer",
-    "rear view",
-    "sidewalk",
-    "parade",
-    "racing",
-    "plane",
-    "ritual",
-    "peacock",
-    "pocket",
-    "plum",
-    "drop",
-    "carrot",
-    "floor",
-    "sunset",
-    "troop",
-    "architect",
-    "coffee table",
-    "dust",
-    "outline",
-    "leather",
-    "charity event",
-    "heat",
-    "whale",
-    "laundry",
-    "coconut tree",
-    "crosswalk",
-    "pony",
-    "ant",
-    "pipe",
-    "string",
-    "coat",
-    "angel",
-    "beef",
-    "church tower",
-    "dish",
-    "pitch",
-    "cupboard",
-    "thermometer",
-    "dirt field",
-    "fireworks",
-    "minute",
-    "cane",
-    "pajama",
-    "flower garden",
-    "autumn",
-    "trash can",
-    "dachshund",
-    "banana tree",
-    "tray",
-    "moose",
-    "roadway",
-    "carnival",
-    "antenna",
-    "pole",
-    "castle wall",
-    "ram",
-    "cattle",
-    "hay",
-    "cookie",
-    "swimmer",
-    "baseball team",
-    "strait",
-    "hedge",
-    "jet",
-    "fire pit",
-    "octopus",
-    "calf",
-    "cube",
-    "opera",
-    "cardboard box",
-    "tiara",
-    "kitchen sink",
-    "prairie",
-    "bowl",
-    "galaxy",
-    "straw hat",
-    "linen",
-    "ski resort",
-    "stitch",
-    "street lamp",
-    "motorist",
-    "icicle",
-    "stain",
-    "flora",
-    "drain",
-    "kitchen cabinet",
-    "decor",
-    "bouquet",
-    "pound",
-    "interior design",
-    "nail polish",
-    "figurine",
-    "tomb",
-    "disc",
-    "twist",
-    "blouse",
-    "ribbon",
-    "figure",
-    "burger",
-    "cork",
-    "soccer goalkeeper",
-    "train bridge",
-    "drinking water",
-    "dew",
-    "baker",
-    "storm cloud",
-    "tarmac",
-    "tv drama",
-    "sponge",
-    "magnet",
-    "sailor",
-    "entry",
-    "swan",
-    "exercise",
-    "sloth",
-    "jewel",
-    "scuba diver",
-    "bite",
-    "cat tree",
-    "tent",
-    "can",
-    "tennis match",
-    "ecosystem",
-    "picket fence",
-    "palm",
-    "train car",
-    "frying pan",
-    "rally",
-    "tablet pc",
-    "reindeer",
-    "image",
-    "wolf",
-    "chin",
-    "conservatory",
-    "flood water",
-    "cityscape",
-    "beach sand",
-    "car park",
-    "pavement",
-    "farm field",
-    "swimming",
-    "winter storm",
-    "stem",
-    "pillow",
-    "inning",
-    "gorilla",
-    "desk",
-    "avenue",
-    "fern",
-    "money",
-    "pearl",
-    "train station",
-    "skillet",
-    "nap",
-    "barber",
-    "library",
-    "freezer",
-    "label",
-    "rainforest",
-    "parking sign",
-    "mirror",
-    "wing",
-    "noodle",
-    "press room",
-    "sculpture",
-    "tablet",
-    "viewer",
-    "prayer",
-    "mini",
-    "mechanic",
-    "laugh",
-    "rice field",
-    "hand",
-    "mustache",
-    "mountain road",
-    "catwalk",
-    "conference",
-    "cape",
-    "installation",
-    "musician",
-    "stream",
-    "machine",
-    "speech",
-    "crocodile",
-    "soccer match",
-    "town square",
-    "passport",
-    "post box",
-    "point",
-    "stone building",
-    "motorway",
-    "mix",
-    "dentist",
-    "businessperson",
-    "happiness",
-    "boat",
-    "vineyard",
-    "treadmill",
-    "glass wall",
-    "water droplet",
-    "coffee mug",
-    "graduate",
-    "sunflower",
-    "parliament",
-    "shepherd",
-    "movie",
-    "wine",
-    "orchard",
-    "tulip",
-    "motherboard",
-    "cup",
-    "broom",
-    "spot",
-    "drawing",
-    "polo shirt",
-    "graduation",
-    "film producer",
-    "moonlight",
-    "glow",
-    "film format",
-    "t shirt",
-    "rock face",
-    "sword",
-    "clinic",
-    "festival day",
-    "meadow",
-    "staple",
-    "pupil",
-    "training ground",
-    "rider",
-    "flower",
-    "foal",
-    "wharf",
-    "foot bridge",
-    "shooting",
-    "top",
-    "mast",
-    "police car",
-    "robe",
-    "wedding bouquet",
-    "stop sign",
-    "birthday cake",
-    "glitter",
-    "butter",
-    "scooter",
-    "tundra",
-    "superhero",
-    "pocket watch",
-    "inscription",
-    "youngster",
-    "fruit tree",
-    "movie poster",
-    "engine",
-    "foundation",
-    "motorcyclist",
-    "take",
-    "woman",
-    "antelope",
-    "country artist",
-    "road trip",
-    "typewriter",
-    "tuxedo",
-    "brand",
-    "pine",
-    "bathroom",
-    "paradise",
-    "texture",
-    "balloon",
-    "dining table",
-    "home",
-    "computer screen",
-    "actor",
-    "clip",
-    "tv tower",
-    "panorama",
-    "summit",
-    "cat",
-    "plot",
-    "eagle",
-    "dancer",
-    "pup",
-    "studio shot",
-    "tear",
-    "bird bath",
-    "classroom",
-    "bookstore",
-    "city wall",
-    "tv programme",
-    "blade",
-    "easel",
-    "buttercream",
-    "sweet",
-    "designer",
-    "diamond",
-    "handshake",
-    "herb",
-    "corn field",
-    "seafront",
-    "concrete",
-    "street artist",
-    "gas",
-    "stamp",
-    "window display",
-    "paper",
-    "note",
-    "pint",
-    "quarry",
-    "research",
-    "fixture",
-    "manager",
-    "soil",
-    "leopard",
-    "board game",
-    "ladder",
-    "stop light",
-    "island",
-    "ramp",
-    "football match",
-    "icing",
-    "drill",
-    "currency",
-    "summer evening",
-    "topping",
-    "pyramid",
-    "pomegranate",
-    "cell",
-    "ivy",
-    "squad",
-    "scenery",
-    "computer",
-    "locomotive",
-    "surf",
-    "mascot",
-    "dune",
-    "path",
-    "duck",
-    "twilight",
-    "wire",
-    "bow tie",
-    "strike",
-    "cormorant",
-    "car wash",
-    "crane",
-    "market",
-    "philosopher",
-    "alarm clock",
-    "camera",
-    "birch",
-    "greeting card",
-    "plain",
-    "clay",
-    "donut",
-    "lock",
-    "moth",
-    "laboratory",
-    "fan",
-    "violin",
-    "jazz fusion artist",
-    "mountain biker",
-    "terrain",
-    "magazine",
-    "pickup",
-    "comedy film",
-    "smartphone",
-    "film",
-    "bed",
-    "microwave oven",
-    "tournament",
-    "lawn",
-    "car window",
-    "alligator",
-    "screen",
-    "jetty",
-    "shopping bag",
-    "landscape view",
-    "cabinetry",
-    "friendly match",
-    "thing",
-    "petal",
-    "shopping center",
-    "transport",
-    "ballet dancer",
-    "shoreline",
-    "princess",
-    "car seat",
-    "parking meter",
-    "green",
-    "vodka",
-    "band",
-    "rock",
-    "costume",
-    "warning sign",
-    "strip",
-    "plaque",
-    "wheelchair",
-    "headband",
-    "ginger",
-    "dice",
-    "media",
-    "hairdresser",
-    "press",
-    "living room",
-    "stove",
-    "player",
-    "cherry",
-    "workshop",
-    "carving",
-    "embroidery",
-    "doodle",
-    "adventure",
-    "rugby player",
-    "monument",
-    "brush",
-    "marker",
-    "loft",
-    "postcard",
-    "collage",
-    "ball",
-    "professor",
-    "dresser",
-    "gig",
-    "festival",
-    "blackbird",
-    "makeup artist",
-    "video camera",
-    "sticker",
-    "peak",
-    "wildflower",
-    "santa hat",
-    "rodeo",
-    "wedding photographer",
-    "guy",
-    "staff",
-    "waterfall",
-    "operation",
-    "defender",
-    "falcon",
-    "haze",
-    "individual",
-    "gentleman",
-    "greyhound",
-    "rocking chair",
-    "rice",
-    "garbage",
-    "platter",
-    "chocolate",
-    "splash",
-    "business suit",
-    "cheetah",
-    "valley",
-    "maze",
-    "trampoline",
-    "garland",
-    "slalom",
-    "unicorn",
-    "tree stump",
-    "painting",
-    "romance",
-    "fight",
-    "alcohol",
-    "ghost",
-    "fondant",
-    "spa",
-    "shutter",
-    "death",
-    "demonstration",
-    "cotton",
-    "pier",
-    "flea market",
-    "history",
-    "savannah",
-    "fist",
-    "aisle",
-    "crew",
-    "jug",
-    "pose",
-    "anchor",
-    "teapot",
-    "boat house",
-    "business team",
-    "tripod",
-    "bee",
-    "pebble",
-    "mattress",
-    "canvas",
-    "hallway",
-    "campaign",
-    "pod",
-    "lake district",
-    "article",
-    "white",
-    "sofa",
-    "honey",
-    "marathon",
-    "pancake",
-    "tourist attraction",
-    "wedding gown",
-    "battle",
-    "shelving",
-    "sea",
-    "sheet music",
-    "pie",
-    "yarn",
-    "construction site",
-    "flyer",
-    "tie",
-    "star",
-    "lettuce",
-    "martial artist",
-    "dart",
-    "straw",
-    "reflection",
-    "conference room",
-    "temperature",
-    "rugby",
-    "mosquito",
-    "physicist",
-    "rock climber",
-    "crash",
-    "backdrop",
-    "toilet seat",
-    "sand castle",
-    "water park",
-    "toy car",
-    "waste",
-    "luxury",
-    "hangar",
-    "rv",
-    "tree trunk",
-    "board",
-    "gold",
-    "project picture",
-    "cap",
-    "cottage",
-    "relief",
-    "attire",
-    "microscope",
-    "battery",
-    "roll",
-    "line",
-    "parking garage",
-    "crystal",
-    "broadcasting",
-    "brick wall",
-    "lab",
-    "flooring",
-    "meeting",
-    "3d cg rendering",
-    "desktop computer",
-    "cowboy",
-    "sailing ship",
-    "junction",
-    "hairstyle",
-    "homework",
-    "profile",
-    "model",
-    "flower pot",
-    "street light",
-    "salt lake",
-    "maple",
-    "space",
-    "blizzard",
-    "throw",
-    "zebras",
-    "brochure",
-    "constellation",
-    "beak",
-    "kilt",
-    "pond",
-    "blue sky",
-    "sneaker",
-    "sand dune",
-    "morning sun",
-    "almond",
-    "grill",
-    "curl",
-    "basketball girl game",
-    "chameleon",
-    "toilet bowl",
-    "prince",
-    "keyboard",
-    "queen",
-    "computer monitor",
-    "writing",
-    "crown",
-    "basilica",
-    "kiss",
-    "house",
-    "parking",
-    "football competition",
-    "shell",
-    "sport equipment",
-    "comedy",
-    "baboon",
-    "vendor",
-    "rise building",
-    "wrap",
-    "food truck",
-    "cat bed",
-    "rickshaw",
-    "flare",
-    "teal",
-    "nectar",
-    "eclipse",
-    "vehicle",
-    "steam locomotive",
-    "gorge",
-    "cow",
-    "christmas card",
-    "demonstrator",
-    "memorial",
-    "towel",
-    "jewellery",
-    "train",
-    "frisbee",
-    "baseball game",
-    "fur",
-    "afternoon sun",
-    "community",
-    "sparkler",
-    "bandage",
-    "firework",
-    "dollar",
-    "pasture",
-    "video",
-    "bus",
-    "tree house",
-    "seashore",
-    "field",
-    "hamburger",
-    "souvenir",
-    "hedgehog",
-    "worm",
-    "pine cone",
-    "osprey",
-    "dinosaur",
-    "vegetable",
-    "junk",
-    "poster",
-    "army",
-    "winger",
-    "bundle",
-    "stage",
-    "growth",
-    "wedding party",
-    "service",
-    "blanket",
-    "ruler",
-    "eye",
-    "credit card",
-    "castle",
-    "diner",
-    "hut",
-    "elk",
-    "hard rock artist",
-    "nun",
-    "dog breed",
-    "nest",
-    "drama film",
-    "number icon",
-    "water tank",
-    "giraffe",
-    "altar",
-    "pavilion",
-    "tv personality",
-    "suv",
-    "street vendor",
-    "street sign",
-    "ditch",
-    "debris",
-    "foam",
-    "takeoff",
-    "spice",
-    "mountain lake",
-    "tea",
-    "orchestra",
-    "spacecraft",
-    "counter",
-    "abbey",
-    "mountain",
-    "hydrangea",
-    "racer",
-    "orange tree",
-    "tide",
-    "cowboy hat",
-    "rapid",
-    "town",
-    "wild",
-    "herd",
-    "vein",
-    "driveway",
-    "jar",
-    "bark",
-    "illustration",
-    "horror film",
-    "corn",
-    "stroller",
-    "industry",
-    "mountain stream",
-    "gym",
-    "neckline",
-    "pan",
-    "client",
-    "spectator",
-    "eggplant",
-    "camper",
-    "fawn",
-    "hoodie",
-    "meat",
-    "lemonade",
-    "food market",
-    "slum",
-    "comic book character",
-    "flower market",
-    "love",
-    "palace",
-    "gun",
-    "heel",
-    "shopping street",
-    "shooting basketball guard",
-    "family photo",
-    "rooftop",
-    "laundry basket",
-    "airport runway",
-    "horn",
-    "face mask",
-    "flight",
-    "appetizer",
-    "violet",
-    "country lane",
-    "cement",
-    "instrument",
-    "tv actor",
-    "spark",
-    "celebrity",
-    "award",
-    "country house",
-    "standing",
-    "auction",
-    "date",
-    "engagement",
-    "puck",
-    "advertisement",
-    "chair",
-    "zebra",
-    "driftwood",
-    "bumblebee",
-    "maple leaf",
-    "bonnet",
-    "orange",
-    "water tower",
-    "door",
-    "singer",
-    "floor plan",
-    "discussion",
-    "theatre",
-    "pilgrim",
-    "mug",
-    "branch",
-    "window sill",
-    "baseball pitcher",
-    "bakery",
-    "lollipop",
-    "basketball player",
-    "toilet paper",
-    "chalkboard",
-    "cabin",
-    "sign",
-    "night sky",
-    "cannon",
-    "fishing net",
-    "submarine",
-    "suit",
-    "fur coat",
-    "wine bottle",
-    "folder",
-    "street art",
-    "suspension bridge",
-    "evening sky",
-    "billboard",
-    "postage stamp",
-    "newspaper",
-    "transportation",
-    "surgeon",
-    "light",
-    "park",
-    "horizon",
-    "road",
-    "sand bar",
-    "trumpet",
-    "lounge",
-    "cloud forest",
-    "birthday celebration",
-    "balcony",
-    "anime",
-    "beehive",
-    "umbrella",
-    "goldfish",
-    "baseball cap",
-    "waterhole",
-    "ceiling",
-    "carousel",
-    "backpack",
-    "plant pot",
-    "atmosphere",
-    "sunflower field",
-    "spire",
-    "vision",
-    "woodpecker",
-    "chip",
-    "pool table",
-    "lotus flower",
-    "cone",
-    "humpback whale",
-    "reservoir",
-    "hunt",
-    "piano",
-    "plate",
-    "dining area",
-    "luggage",
-    "skier",
-    "dance floor",
-    "crow",
-    "stair",
-    "overpass",
-    "opera house",
-    "bear",
-    "jazz artist",
-    "water",
-    "vessel",
-    "cast",
-    "yard",
-    "cathedral",
-    "basketball hoop",
-    "graveyard",
-    "sound",
-    "berry",
-    "onlooker",
-    "fauna",
-    "birch tree",
-    "retail",
-    "hill",
-    "skeleton",
-    "journalist",
-    "frost",
-    "basket",
-    "nail",
-    "dusk",
-    "trash",
-    "dawn",
-    "clover",
-    "hen",
-    "volcano",
-    "basketball coach",
-    "home decor",
-    "charge",
-    "haircut",
-    "sense",
-    "university",
-    "lizard",
-    "daisy",
-    "tablet computer",
-    "grass field",
-    "prison",
-    "metal artist",
-    "bathroom mirror",
-    "window frame",
-    "chest",
-    "flavor",
-    "pop country artist",
-    "market square",
-    "monkey",
-    "blog",
-    "deer",
-    "speech bubble",
-    "dog",
-    "independence day",
-    "girl",
-    "boy",
-    "tartan",
-    "furniture",
-    "appliance",
-    "office window",
-    "fish boat",
-    "sand box",
-    "tv sitcom",
-    "drama",
-    "sleigh",
-    "depression",
-    "paper towel",
-    "baseball",
-    "protestor",
-    "grape",
-    "wedding cake",
-    "invitation",
-    "accessory",
-    "pick",
-    "grandparent",
-    "racket",
-    "tea plantation",
-    "outdoors",
-    "egg",
-    "glass bowl",
-    "sun",
-    "organization",
-    "lion",
-    "panel",
-    "station",
-    "wallpaper",
-    "helicopter",
-    "salt",
-    "vanity",
-    "patio",
-    "lunch",
-    "street performer",
-    "mountain range",
-    "soup",
-    "bacon",
-    "power station",
-    "cantilever bridge",
-    "hummingbird",
-    "shirt",
-    "rope",
-    "hip",
-    "chalk",
-    "pendant",
-    "choir",
-    "tv",
-    "lichen",
-    "railway bridge",
-    "art gallery",
-    "bartender",
-    "wagon",
-    "baby elephant",
-    "accordion",
-    "horseshoe",
-    "building site",
-    "clutch",
-    "harvest",
-    "savanna",
-    "geranium",
-    "business woman",
-    "paddock",
-    "patch",
-    "beech tree",
-    "war",
-    "suburbs",
-    "hospital bed",
-    "motorcycle racer",
-    "moss",
-    "gravel",
-    "government agency",
-    "dollar bill",
-    "father",
-    "fjord",
-    "concert",
-    "nut",
-    "wedding photography",
-    "finish line",
-    "home plate",
-    "food",
-    "nose",
-    "thumb",
-    "village",
-    "dining room table",
-    "bumper",
-    "monster",
-    "blackberry",
-    "lime",
-    "conflict",
-    "gala",
-    "wallet",
-    "wrist",
-    "hug",
-    "mermaid",
-    "lava",
-    "lawyer",
-    "folk rock artist",
-    "arena",
-    "onion",
-    "toothbrush",
-    "fashion",
-    "perfume",
-    "flip",
-    "triangle",
-    "woodland",
-    "mail",
-    "grasshopper",
-    "studio",
-    "wood floor",
-    "den",
-    "racquet",
-    "cello",
-    "lemur",
-    "astronaut",
-    "glass table",
-    "blood",
-    "dvd",
-    "planter",
-    "silver",
-    "leash",
-    "master bedroom",
-    "forest",
-    "batter",
-    "shoe",
-    "engraving",
-    "opening",
-    "product",
-    "toe",
-    "cocktail",
-    "mallard duck",
-    "bike ride",
-    "oasis",
-    "wedding ring",
-    "cinematographer",
-    "holly",
-    "autograph",
-    "fence",
-    "ice cube",
-    "cove",
-    "pineapple",
-    "aurora",
-    "glass bead",
-    "produce",
-    "apartment building",
-    "cob",
-    "miniature",
-    "cockpit",
-    "flashlight",
-    "frog",
-    "sheep",
-    "groom",
-    "steel",
-    "watermelon",
-    "clip art",
-    "paper plate",
-    "ostrich",
-    "contour",
-    "mural",
-    "cub",
-    "paisley bandanna",
-    "winery",
-    "turn",
-    "handle",
-    "satellite",
-    "post",
-    "pork",
-    "child",
-    "asphalt",
-    "grocery store",
-    "vulture",
-    "trolley",
-    "nightclub",
-    "brick",
-    "trailer",
-    "compass",
-    "cereal",
-    "cafe",
-    "cartoon character",
-    "sugar",
-    "fiction book",
-    "glass floor",
-    "umpire",
-    "guitar",
-    "hamster",
-    "protester",
-    "airplane",
-    "garment",
-    "blazer",
-    "railway line",
-    "wedding",
-    "shoe box",
-    "parking lot",
-    "construction",
-    "graduation ceremony",
-    "tram",
-    "telescope",
-    "copper",
-    "pain",
-    "autumn forest",
-    "guest house",
-    "partner",
-    "crayon",
-    "dip",
-    "boot",
-    "corridor",
-    "computer keyboard",
-    "hockey player",
-    "chicken coop",
-    "bus station",
-    "gathering",
-    "ankle",
-    "bunk bed",
-    "wood table",
-    "football coach",
-    "monarch",
-    "pharmacy",
-    "legging",
-    "mannequin",
-    "female",
-    "train track",
-    "stack",
-    "canopy",
-    "design element",
-    "grandmother",
-    "symbol",
-    "beach hut",
-    "zucchini",
-    "bomb",
-    "businessman",
-    "skyscraper",
-    "tongue",
-    "case",
-    "sparkle",
-    "highland",
-    "ballroom",
-    "prom",
-    "estate",
-    "customer",
-    "archipelago",
-    "cheese",
-    "debate",
-    "carriage",
-    "bulldozer",
-    "pumpkin",
-    "sitting room",
-    "gas station",
-    "wedding reception",
-    "camp",
-    "dog bed",
-    "tower",
-    "property",
-    "river bed",
-    "pop latin artist",
-    "fridge",
-    "wine glass",
-    "coast",
-    "beer",
-    "tow truck",
-    "fire truck",
-    "mountain bike",
-    "thigh",
-    "heron",
-    "boat ride",
-    "gondola",
-    "turquoise",
-    "lake",
-    "llama",
-    "kitty",
-    "tin",
-    "waiting room",
-    "coffee cup",
-    "socialite",
-    "guard",
-    "tap",
-    "waterway",
-    "forehead",
-    "list",
-    "erosion",
-    "box",
-    "sea lion",
-    "pollen",
-    "dam",
-    "wasp",
-    "salon",
-    "tennis tournament",
-    "flower box",
-    "aquarium",
-    "rain cloud",
-    "clothing store",
-    "lead singer",
-    "cupcake",
-    "tortoise",
-    "lettering",
-    "sport facility",
-    "dance",
-    "dog house",
-    "nature",
-    "football",
-    "rooster",
-    "footballer",
-    "railway track",
-    "crowd",
-    "fishing rod",
-    "silhouette",
-    "wind turbine",
-    "sari",
-    "bus window",
-    "cloud",
-    "charity",
-    "medal",
-    "yoga",
-    "event",
-    "veil",
-    "fashion menswear milan week",
-    "news",
-    "knife",
-    "print",
-    "screen tv",
-    "walnut",
-    "fungus",
-    "ice cream",
-    "computer mouse",
-    "play",
-    "tribe",
-    "picture",
-    "video game",
-    "business card",
-    "music festival",
-    "rack",
-    "envelope",
-    "shower",
-    "dirt road",
-    "mine",
-    "oyster",
-    "monarch butterfly",
-    "dude",
-    "fruit salad",
-    "podium",
-    "fork",
-    "lace",
-    "test match",
-    "boulder",
-    "cricket player",
-    "staircase",
-    "peninsula",
-    "shopping",
-    "popcorn",
-    "oak",
-    "market stall",
-    "pine tree",
-    "mountaineer",
-    "student",
-    "closet",
-    "hood",
-    "handstand",
-    "centerpiece",
-    "insect",
-    "patient",
-    "makeover",
-    "tennis player",
-    "sheet",
-    "park bench",
-    "apple",
-    "organism",
-    "hook",
-    "turkey",
-    "tangerine",
-    "sibling",
-    "shopping mall",
-    "bird",
-    "scarf",
-    "smoothie",
-    "net",
-    "grass",
-    "napkin",
-    "ray",
-    "eyebrow",
-    "laptop keyboard",
-    "motorbike",
-    "woman hand",
-    "oven",
-    "book cover",
-    "easter egg",
-    "microwave",
-    "sand",
-    "snapshot",
-    "soccer ball",
-    "makeup",
-    "knight",
-    "bowling ball",
-    "shower curtain",
-    "flame",
-    "lightning",
-    "running",
-    "power plant",
-    "crib",
-    "cartoon",
-    "moat",
-    "fashion girl",
-    "wedding invitation",
-    "bottle",
-    "cliff",
-    "monastery",
-    "file photo",
-    "apartment",
-    "casino",
-    "cream",
-    "sweatshirt",
-    "storm",
-    "cruise",
-    "teddy bear",
-    "shovel",
-    "wind farm",
-    "writer",
-    "dock",
-    "professional",
-    "hotel room",
-    "job",
-    "monitor",
-    "donkey",
-    "pass",
-    "interview",
-    "duchess",
-    "mark",
-    "plank",
-    "beard",
-    "zombie",
-    "trio",
-    "channel",
-    "cricket team",
-    "windmill",
-    "vest",
-    "diagram",
-    "cable",
-    "winter scene",
-    "golden gate bridge",
-    "buffalo",
-    "studio portrait",
-    "pagoda",
-    "whiskey",
-    "freight train",
-    "kite",
-    "future",
-    "steam train",
-    "phone box",
-    "headset",
-    "wood",
-    "snowboarder",
-    "paper bag",
-    "slide",
-    "grapefruit",
-    "seating",
-    "morning",
-    "bronze sculpture",
-    "theatre actor",
-    "stump",
-    "jean",
-    "landmark",
-    "jam",
-    "waist",
-    "watercolor",
-    "hammock",
-    "light fixture",
-    "ice",
-    "basin",
-    "beverage",
-    "shelter",
-    "premiere",
-    "mound",
-    "ear",
-    "bronze",
-    "sunlight",
-    "street",
-    "energy",
-    "barn door",
-    "hike",
-    "fleet",
-    "claw",
-    "beach",
-    "pepperoni",
-    "bin",
-    "trainer",
-    "buffet",
-    "archive",
-    "toddler",
-    "referee",
-    "bay window",
-    "dove",
-    "production company",
-    "evening light",
-    "gate",
-    "farm",
-    "reed",
-    "fruit stand",
-    "explorer",
-    "snow storm",
-    "throw pillow",
-    "button",
-    "display case",
-    "bookcase",
-    "lead",
-    "lipstick",
-    "basketball court",
-    "cargo",
-    "ensemble",
-    "pope",
-    "clock tower",
-    "teen",
-    "speaker",
-    "rat",
-    "laptop",
-    "ski",
-    "mess",
-    "stadium",
-    "ferry boat",
-    "bunny",
-    "waterfront",
-    "downtown",
-    "sink",
-    "press conference",
-    "dinner",
-    "condiment",
-    "thread",
-    "audience",
-    "grid",
-    "car",
-    "plastic",
-    "people",
-    "barbecue",
-    "pigeon",
-    "urinal",
-    "seagull",
-    "volunteer",
-    "hockey",
-    "fir tree",
-    "pollution",
-    "trial",
-    "collar",
-    "area",
-    "meeting room",
-    "circus",
-    "yogurt",
-    "orangutan",
-    "viaduct",
-    "comedian",
-    "drone",
-    "scissor",
-    "pop rock artist",
-    "biscuit",
-    "panda",
-    "water feature",
-    "air balloon",
-    "remote control",
-    "watercolor painting",
-    "show",
-    "walk",
-    "post office",
-    "bike path",
-    "rap gangsta artist",
-    "microphone",
-    "crack",
-    "sunset sky",
-    "glass",
-    "tv show",
-    "cartoon style",
-    "stripe",
-    "foyer",
-    "signal",
-    "calligraphy",
-    "bulb",
-    "gardener",
-    "coffee bean",
-    "spider",
-    "tapestry",
-    "city skyline",
-    "necklace",
-    "kitten",
-    "traveler",
-    "veteran",
-    "frosting",
-    "fry",
-    "tennis court",
-    "tank top",
-    "butterfly house",
-    "mist",
-    "drummer",
-    "water level",
-    "scale",
-    "baseball glove",
-    "music video performer",
-    "champagne",
-    "camping",
-    "clothing",
-    "water drop",
-    "telephone box",
-    "pen",
-    "morning mist",
-    "fire engine",
-    "porch",
-    "opening ceremony",
-    "style",
-    "palm tree",
-    "fashion show",
-    "universe",
-    "scratch",
-    "axe",
-    "ottoman",
-    "explosion",
-    "rib",
-    "boutique",
-    "game",
-    "cucumber",
-    "fruit",
-    "stone bridge",
-    "nature reserve",
-    "track",
-    "train window",
-    "punch",
-    "telephone pole",
-    "velvet",
-    "sauce",
-    "moon",
-    "contrast",
-    "flamingo",
-    "bat",
-    "vending machine",
-    "ship",
-    "equestrian",
-    "shade",
-    "comforter",
-    "pallet",
-    "sparrow",
-    "wii",
-    "glaze",
-    "grocery",
-    "steeple",
-    "soccer player",
-    "contract",
-    "advertising",
-    "runner",
-    "chimpanzee",
-    "world",
-    "seat",
-    "project",
-    "chihuahua",
-    "bubble",
-    "willow",
-    "pedestal",
-    "soul hip hop artist",
-    "curb",
-    "drawer",
-    "leaf",
-    "banner",
-    "launch party",
-    "coach",
-    "government",
-    "snowball",
-    "toy",
-    "portrait",
-    "doctor",
-    "whiteboard",
-    "electronic",
-    "tiger",
-    "graffiti",
-    "column",
-    "nightstand",
-    "whistle",
-    "maxi dress",
-    "bench",
-    "wetsuit",
-    "bird feeder",
-    "football game",
-    "basketball",
-    "class",
-    "bathroom door",
-    "store window",
-    "text message",
-    "wreath",
-    "street view",
-    "binocular",
-    "pet",
-    "facade",
-    "drought",
-    "lemon",
-    "new year",
-    "night view",
-    "airplane window",
-    "specie",
-    "rule",
-    "jaw",
-    "wheat field",
-    "diet",
-    "pop artist",
-    "habitat",
-    "screenshot",
-    "scoreboard",
-    "shore",
-    "mane",
-    "quilt",
-    "ski lift",
-    "orchid",
-    "turban",
-    "christmas",
-    "airport",
-    "marina",
-    "glass door",
-    "glass bottle",
-    "restaurant",
-    "conductor",
-    "logo",
-    "sleep",
-    "tape",
-    "tomato",
-    "river bank",
-    "lilac",
-    "tooth",
-    "training",
-    "pottery",
-    "shop",
-    "steam engine",
-    "mason jar",
-    "base",
-    "procession",
-    "border",
-    "shoot",
-    "footprint",
-    "hotdog",
-    "bull",
-    "stocking",
-    "recreation",
-    "automobile model",
-    "design",
-    "country pop artist",
-    "river",
-    "retriever",
-    "department store",
-    "auditorium",
-    "sport car",
-    "supermarket",
-    "belt",
-    "cricket",
-    "window box",
-    "dress shirt",
-    "letter",
-    "residence",
-    "megaphone",
-    "pant",
-    "wildfire",
-    "bird nest",
-    "crab",
-    "swimsuit",
-    "candle",
-    "funeral",
-    "mill",
-    "national park",
-    "plant",
-    "cop",
-    "power line",
-    "perch",
-    "blue",
-    "finger",
-    "ferris wheel",
-    "globe",
-    "skateboard",
-    "helmet",
-    "movie theater",
-    "uniform",
-    "hammer",
-    "material",
-    "kid",
-    "well",
-    "butterfly",
-    "sideline",
-    "fashion fall show",
-    "planet earth",
-    "lift",
-    "male",
-    "sauna",
-    "gray",
-    "flour",
-    "sand sculpture",
-    "program",
-    "cabinet",
-    "infant",
-    "wheel",
-    "aircraft model",
-    "dough",
-    "garlic",
-    "skate",
-    "arrow",
-    "wrapping paper",
-    "ripple",
-    "lamp",
-    "iron",
-    "banknote",
-    "beaver",
-    "ferry",
-    "courtyard",
-    "bassist",
-    "countryside",
-    "steak",
-    "comfort",
-    "boxer",
-    "laundry room",
-    "campsite",
-    "brick building",
-    "golf",
-    "subway",
-    "headphone",
-    "fort",
-    "handbag",
-    "drum",
-    "flood",
-    "saddle",
-    "bass",
-    "labyrinth",
-    "needle",
-    "sun ray",
-    "app",
-    "menu",
-    "president",
-    "cardigan",
-    "dandelion",
-    "wetland",
-    "ice hockey player",
-    "number",
-    "city hall",
-    "fishing",
-    "portrait session",
-    "pug",
-    "key",
-    "art print",
-    "minister",
-    "hurdle",
-    "emergency",
-    "painting artist",
-    "flag pole",
-    "evening",
-    "purse",
-    "recipe",
-    "golf ball",
-    "coloring book",
-    "mountain peak",
-    "senior",
-    "holiday",
-    "bud",
-    "cousin",
-    "pantry",
-    "lap",
-    "skin",
-    "flag",
-    "tissue paper",
-    "ridge",
-    "wire fence",
-    "surfer",
-    "climber",
-    "photograph",
-    "sewing machine",
-    "cooler",
-    "actress",
-    "apple tree",
-    "cancer",
-    "starfish",
-    "automobile make",
-    "dumbbell",
-    "brace",
-    "tunnel",
-    "window",
-    "paint artist",
-    "composition",
-    "school student",
-    "condo",
-    "convertible",
-    "cushion",
-    "selfie",
-    "territory",
-    "guide",
-    "tree",
-    "court",
-    "shrimp",
-    "stone house",
-    "dress",
-    "eyelash",
-    "juice",
-    "broccoli",
-    "chain",
-    "tourism",
-    "mountain top",
-    "concept car",
-    "film premiere",
-    "light bulb",
-    "cafeteria",
-    "badge",
-    "flower bed",
-    "theater",
-    "root",
-    "racecar driver",
-    "basketball boy game",
-    "glove",
-    "skyline",
-    "wall",
-    "glacier",
-    "airport terminal",
-    "bug",
-    "trim",
-    "railway station",
-    "briefcase",
-    "flat",
-    "fountain",
-    "person",
-    "lane",
-    "asparagus",
-    "art",
-    "lantern",
-    "dishwasher",
-    "director",
-    "snake",
-    "lecture",
-    "game controller",
-    "tree branch",
-    "pub",
-    "bathing suit",
-    "queue",
-    "belly",
-    "poppy",
-    "bow",
-    "pitcher",
-    "ice cream cone",
-    "cave",
-    "candy",
-    "road bridge",
-    "host",
-    "traffic jam",
-    "earring",
-    "file",
-    "foot",
-    "watermark overlay stamp",
-    "mailbox",
-    "supercar",
-    "railing",
-    "bedroom",
-    "seafood",
-    "waffle",
-    "bronze statue",
-    "plan",
-    "flow",
-    "marble",
-    "basketball game",
-    "automobile",
-    "scene",
-    "cypress tree",
-    "soldier",
-    "skateboarder",
-    "glass building",
-    "cherry tree",
-    "pump",
-    "grain",
-    "wildebeest",
-    "loop",
-    "frame",
-    "bathtub",
-    "saxophone",
-    "diver",
-    "stalk",
-    "lily",
-    "bead",
-    "alley",
-    "flock",
-    "family room",
-    "manufacturing",
-    "pointer",
-    "worker",
-    "navy",
-    "potato",
-    "teacher",
-    "photography",
-    "dolly",
-    "boardwalk",
-    "water fountain",
-    "athlete",
-    "side dish",
-    "bay",
-    "ice hockey",
-    "phone",
-    "hero",
-    "face",
-    "gold medal",
-    "blind",
-    "swamp",
-    "researcher",
-    "swim",
-    "meatball",
-    "iguana",
-    "leather jacket",
-    "jellyfish",
-    "site",
-    "smoke",
-    "traffic signal",
-    "melon",
-    "beetle",
-    "calculator",
-    "skirt",
-    "plantation",
-    "sculptor",
-    "barrier",
-    "catcher",
-    "security guard",
-    "sketch",
-    "awning",
-    "steering wheel",
-    "mountain view",
-    "bus stop",
-    "pool",
-    "leg",
-    "spotlight",
-    "apron",
-    "mineral",
-    "inlet",
-    "sleeve",
-    "torch",
-    "emotion",
-    "march",
-    "police officer",
-    "performance",
-    "lamp post",
-    "fishing boat",
-    "summer",
-    "presentation",
-    "saucer",
-    "suitcase",
-    "supermodel",
-    "goalkeeper",
-    "shrub",
-    "rock artist",
-    "document",
-    "beach house",
-    "man",
-    "blue artist",
-    "cigar",
-    "railroad track",
-    "gown",
-    "mosaic",
-    "bungalow",
-    "alphabet",
-    "baseball field",
-    "shed",
-    "pedestrian",
-    "rail",
-    "soap",
-    "kitchen counter",
-    "dessert",
-    "dunk",
-    "blossom",
-    "conversation",
-    "fruit market",
-    "glass jar",
-    "military",
-    "beer bottle",
-    "photographer",
-    "tennis racket",
-    "competition",
-    "escalator",
-    "bell tower",
-    "stilt",
-    "ballerina",
-    "television",
-    "feather",
-    "fence post",
-    "rear",
-    "dahlia",
-    "red carpet",
-    "tub",
-    "hole",
-    "fortress",
-    "pack",
-    "telephone",
-    "cardboard",
-    "city park",
-    "platform",
-    "college student",
-    "arch bridge",
-    "wind",
-    "blender",
-    "bloom",
-    "ice rink",
-    "birthday",
-    "raven",
-    "fairy",
-    "embankment",
-    "hall",
-    "flower shop",
-    "suburb",
-    "barrel",
-    "biker",
-    "steam",
-    "dragonfly",
-    "formation",
-    "electricity",
-    "business people",
-    "symmetry",
-    "walkway",
-    "fisherman",
-    "gas mask",
-    "loch",
-    "youth",
-    "hanger",
-    "dot",
-    "fish",
-    "street market",
-    "animation film",
-    "crime fiction film",
-    "boar",
-    "emblem",
-    "halloween costume",
-    "kangaroo",
-    "couple",
-    "spoon",
-    "squirrel",
-    "neon sign",
-    "sky",
-    "office desk",
-    "beauty salon",
-    "breakwater",
-    "fashion look",
-    "toaster",
-    "author",
-    "news conference",
-    "outdoor",
-    "canoe",
-    "dragon",
-    "tool",
-    "shopping centre",
-    "ladybug",
-    "swimming pool",
-    "landscaping",
-    "ski pole",
-    "red",
-    "truck",
-    "fly",
-    "temple",
-    "level",
-    "sunday",
-    "railroad bridge",
-    "car mirror",
-    "lawn mower",
-    "flute",
-    "aircraft carrier",
-    "fashion menswear london week",
-    "sunshine",
-    "tile floor",
-    "skull",
-    "fossil",
-    "flower arrangement",
-    "diaper",
-    "sea turtle",
-    "cherry blossom",
-    "fireman",
-    "shack",
-    "lens",
-    "waiter",
-    "animal",
-    "basement",
-    "snow",
-    "autumn park",
-    "glass box",
-    "kick",
-    "head",
-    "anniversary",
-    "vine",
-    "back",
-    "paper lantern",
-    "fish tank",
-    "cellphone",
-    "silk",
-    "coral",
-    "notebook",
-    "photo",
-    "gazebo",
-    "ketchup",
-    "driver",
-    "farmer",
-    "bonfire",
-    "chestnut",
-    "photoshoot",
-    "football field",
-    "olive tree",
-    "pheasant",
-    "sandal",
-    "toilet",
-    "fireplace",
-    "music",
-    "deity",
-    "fish market",
-    "fig",
-    "bell",
-    "neck",
-    "grave",
-    "villa",
-    "cyclist",
-    "crate",
-    "grey",
-    "asphalt road",
-    "soccer",
-    "hostel",
-    "municipality",
-    "courthouse",
-    "roof",
-    "end table",
-    "pot",
-    "sedan",
-    "structure",
-    "folk artist",
-    "sport",
-    "sport team",
-    "protest",
-    "syringe",
-    "fashion designer",
-    "jersey",
-    "heart shape",
-    "kayak",
-    "stare",
-    "sit with",
-    "direct",
-    "read",
-    "photograph",
-    "spin",
-    "teach",
-    "laugh",
-    "carve",
-    "grow on",
-    "warm",
-    "watch",
-    "stretch",
-    "smell",
-    "decorate",
-    "shine",
-    "light",
-    "dance",
-    "send",
-    "park",
-    "chase",
-    "collect",
-    "lead",
-    "kiss",
-    "lead to",
-    "lick",
-    "smile",
-    "cheer",
-    "sit",
-    "point",
-    "block",
-    "rock",
-    "drop",
-    "cut",
-    "ski",
-    "wrap",
-    "lose",
-    "serve",
-    "provide",
-    "sleep",
-    "dress",
-    "embrace",
-    "burn",
-    "pack",
-    "stir",
-    "create",
-    "touch",
-    "wash",
-    "stick",
-    "reveal",
-    "shop",
-    "train",
-    "paint",
-    "groom",
-    "hunt",
-    "bloom",
-    "play",
-    "pay",
-    "brush",
-    "shoot",
-    "hold",
-    "picture",
-    "carry",
-    "sip",
-    "contain",
-    "turn",
-    "pour",
-    "pitch",
-    "give",
-    "add",
-    "blow",
-    "look in",
-    "show",
-    "walk",
-    "illuminate",
-    "kneel",
-    "cover",
-    "drag",
-    "post",
-    "present",
-    "fit",
-    "operate",
-    "fish",
-    "race",
-    "write",
-    "deliver",
-    "peel",
-    "push",
-    "run",
-    "sit around",
-    "buy",
-    "jump",
-    "walk on",
-    "attend",
-    "clean",
-    "sell",
-    "ride on",
-    "mount",
-    "host",
-    "dry",
-    "plant",
-    "sing",
-    "row",
-    "shake",
-    "perch",
-    "ride",
-    "fight",
-    "skateboard",
-    "live",
-    "call",
-    "surround",
-    "practice",
-    "play on",
-    "work on",
-    "step",
-    "relax",
-    "hit",
-    "fall in",
-    "flow",
-    "greet",
-    "launch",
-    "wear",
-    "hang on",
-    "drive",
-    "sit in",
-    "break",
-    "learn",
-    "fly",
-    "connect",
-    "display",
-    "locate",
-    "compete",
-    "go for",
-    "sail",
-    "lift",
-    "toast",
-    "help",
-    "run on",
-    "reflect",
-    "pose",
-    "scratch",
-    "frame",
-    "dribble",
-    "herd",
-    "enter",
-    "exit",
-    "place",
-    "inspect",
-    "build",
-    "pick",
-    "fill",
-    "grind",
-    "skate",
-    "offer",
-    "float",
-    "sit by",
-    "stand",
-    "release",
-    "rest",
-    "singe",
-    "climb",
-    "tie",
-    "mark",
-    "lay",
-    "stand around",
-    "capture",
-    "set",
-    "land",
-    "swinge",
-    "run in",
-    "kick",
-    "lean",
-    "head",
-    "sign",
-    "approach",
-    "swim",
-    "close",
-    "crash",
-    "control",
-    "fall",
-    "remove",
-    "repair",
-    "open",
-    "appear",
-    "travel",
-    "load",
-    "miss",
-    "check",
-    "surf",
-    "moor",
-    "smoke",
-    "drink",
-    "board",
-    "seat",
-    "feed",
-    "rise",
-    "sit on",
-    "swing",
-    "grow",
-    "strike",
-    "date",
-    "slide",
-    "share",
-    "graze",
-    "jump in",
-    "lie",
-    "extrude",
-    "roll",
-    "move",
-    "gather",
-    "eat",
-    "pull",
-    "run through",
-    "squeeze",
-    "lay on",
-    "draw",
-    "play with",
-    "wave",
-    "assemble",
-    "perform",
-    "march",
-    "score",
-    "attach",
-    "adjust",
-    "hang",
-    "hug",
-    "sleep on",
-    "throw",
-    "live in",
-    "talk",
-    "pet",
-    "work",
-    "run with",
-    "see",
-    "flip",
-    "catch",
-    "cook",
-    "receive",
-    "celebrate",
-    "look",
-    "classic",
-    "bridal",
-    "indoor",
-    "industrial",
-    "teenage",
-    "mini",
-    "grassy",
-    "aged",
-    "long",
-    "warm",
-    "light",
-    "handsome",
-    "happy",
-    "three",
-    "pregnant",
-    "circular",
-    "urban",
-    "silver",
-    "ceramic",
-    "3d",
-    "green",
-    "blonde",
-    "golden",
-    "dark",
-    "tropical",
-    "ripe",
-    "deep",
-    "fat",
-    "musical",
-    "giant",
-    "medical",
-    "medieval",
-    "bare",
-    "stunning",
-    "bold",
-    "geographical",
-    "huge",
-    "plastic",
-    "foggy",
-    "stormy",
-    "gothic",
-    "biological",
-    "empty",
-    "clear",
-    "antique",
-    "pink",
-    "steep",
-    "brown",
-    "striped",
-    "aerial",
-    "rainy",
-    "cool",
-    "flying",
-    "commercial",
-    "purple",
-    "trendy",
-    "blank",
-    "haired",
-    "dead",
-    "wooden",
-    "flat",
-    "high",
-    "beige",
-    "panoramic",
-    "angry",
-    "dozen",
-    "rural",
-    "solar",
-    "big",
-    "small",
-    "stained",
-    "thick",
-    "many",
-    "fresh",
-    "clean",
-    "strong",
-    "abstract",
-    "crowded",
-    "retro",
-    "dry",
-    "gorgeous",
-    "martial",
-    "modern",
-    "blue",
-    "cloudy",
-    "low",
-    "four",
-    "outdoor",
-    "single",
-    "much",
-    "beautiful",
-    "snowy",
-    "pretty",
-    "new",
-    "short",
-    "sunny",
-    "closed",
-    "rocky",
-    "red",
-    "two",
-    "double",
-    "male",
-    "gray",
-    "five",
-    "colorful",
-    "automotive",
-    "various",
-    "one",
-    "old",
-    "rusty",
-    "tall",
-    "wild",
-    "narrow",
-    "natural",
-    "several",
-    "frozen",
-    "textured",
-    "lush",
-    "young",
-    "hot",
-    "mixed",
-    "white",
-    "float",
-    "quiet",
-    "round",
-    "bright",
-    "religious",
-    "female",
-    "historical",
-    "shiny",
-    "traditional",
-    "tourist",
-    "yellow",
-    "bald",
-    "coastal",
-    "lovely",
-    "little",
-    "broken",
-    "romantic",
-    "wide",
-    "royal",
-    "rich",
-    "open",
-    "cute",
-    "ancient",
-    "cold",
-    "political",
-    "elderly",
-    "gold",
-    "full",
-    "rustic",
-    "metallic",
-    "floral",
-    "sad",
-    "wet",
-    "fancy",
-    "senior",
-    "tiny",
-    "stylish",
-    "large",
-    "frosty",
-    "orange",
-    "transparent",
-    "electronic",
-    "shallow",
-    "scared",
-    "armed",
-    "dirty",
-    "historic",
-    "black",
-    "few",
-    "windy",
-    "some",
-    "square",
-    "ornamental",
-    "sandy",
-    "thin",
-]
-
-
-tra_array = np.array(tra_array)
diff --git a/eval/vbench/third_party/tag2Text/vit.py b/eval/vbench/third_party/tag2Text/vit.py
deleted file mode 100644
index 55ba038a..00000000
--- a/eval/vbench/third_party/tag2Text/vit.py
+++ /dev/null
@@ -1,427 +0,0 @@
-"""
- * Copyright (c) 2022, salesforce.com, inc.
- * All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
- * By Junnan Li
- * Based on timm code base
- * https://github.com/rwightman/pytorch-image-models/tree/master/timm
-"""
-
-from functools import partial
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from fairscale.nn.checkpoint.checkpoint_activations import checkpoint_wrapper
-from timm.models.helpers import adapt_input_conv, named_apply
-from timm.models.layers import DropPath, trunc_normal_
-from timm.models.registry import register_model
-from timm.models.vision_transformer import PatchEmbed, _cfg
-
-
-class Mlp(nn.Module):
-    """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
-
-    def __init__(
-        self,
-        in_features,
-        hidden_features=None,
-        out_features=None,
-        act_layer=nn.GELU,
-        drop=0.0,
-    ):
-        super().__init__()
-        out_features = out_features or in_features
-        hidden_features = hidden_features or in_features
-        self.fc1 = nn.Linear(in_features, hidden_features)
-        self.act = act_layer()
-        self.fc2 = nn.Linear(hidden_features, out_features)
-        self.drop = nn.Dropout(drop)
-
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop(x)
-        x = self.fc2(x)
-        x = self.drop(x)
-        return x
-
-
-class Attention(nn.Module):
-    def __init__(
-        self,
-        dim,
-        num_heads=8,
-        qkv_bias=False,
-        qk_scale=None,
-        attn_drop=0.0,
-        proj_drop=0.0,
-    ):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
-        self.scale = qk_scale or head_dim**-0.5
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-        self.attn_gradients = None
-        self.attention_map = None
-
-    def save_attn_gradients(self, attn_gradients):
-        self.attn_gradients = attn_gradients
-
-    def get_attn_gradients(self):
-        return self.attn_gradients
-
-    def save_attention_map(self, attention_map):
-        self.attention_map = attention_map
-
-    def get_attention_map(self):
-        return self.attention_map
-
-    def forward(self, x, register_hook=False):
-        B, N, C = x.shape
-        qkv = (
-            self.qkv(x)
-            .reshape(B, N, 3, self.num_heads, C // self.num_heads)
-            .permute(2, 0, 3, 1, 4)
-        )
-        q, k, v = (
-            qkv[0],
-            qkv[1],
-            qkv[2],
-        )  # make torchscript happy (cannot use tensor as tuple)
-
-        attn = (q @ k.transpose(-2, -1)) * self.scale
-        attn = attn.softmax(dim=-1)
-        attn = self.attn_drop(attn)
-
-        if register_hook:
-            self.save_attention_map(attn)
-            attn.register_hook(self.save_attn_gradients)
-
-        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class Block(nn.Module):
-
-    def __init__(
-        self,
-        dim,
-        num_heads,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop=0.0,
-        attn_drop=0.0,
-        drop_path=0.0,
-        act_layer=nn.GELU,
-        norm_layer=nn.LayerNorm,
-        use_grad_checkpointing=False,
-    ):
-        super().__init__()
-        self.norm1 = norm_layer(dim)
-        self.attn = Attention(
-            dim,
-            num_heads=num_heads,
-            qkv_bias=qkv_bias,
-            qk_scale=qk_scale,
-            attn_drop=attn_drop,
-            proj_drop=drop,
-        )
-        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        mlp_hidden_dim = int(dim * mlp_ratio)
-        self.mlp = Mlp(
-            in_features=dim,
-            hidden_features=mlp_hidden_dim,
-            act_layer=act_layer,
-            drop=drop,
-        )
-
-        if use_grad_checkpointing:
-            self.attn = checkpoint_wrapper(self.attn)
-            self.mlp = checkpoint_wrapper(self.mlp)
-
-    def forward(self, x, register_hook=False):
-        x = x + self.drop_path(self.attn(self.norm1(x), register_hook=register_hook))
-        x = x + self.drop_path(self.mlp(self.norm2(x)))
-        return x
-
-
-class VisionTransformer(nn.Module):
-    """Vision Transformer
-    A PyTorch impl of : `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale`  -
-        https://arxiv.org/abs/2010.11929
-    """
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        in_chans=3,
-        num_classes=1000,
-        embed_dim=768,
-        depth=12,
-        num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=True,
-        qk_scale=None,
-        representation_size=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=None,
-        use_grad_checkpointing=False,
-        ckpt_layer=0,
-    ):
-        """
-        Args:
-            img_size (int, tuple): input image size
-            patch_size (int, tuple): patch size
-            in_chans (int): number of input channels
-            num_classes (int): number of classes for classification head
-            embed_dim (int): embedding dimension
-            depth (int): depth of transformer
-            num_heads (int): number of attention heads
-            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
-            qkv_bias (bool): enable bias for qkv if True
-            qk_scale (float): override default qk scale of head_dim ** -0.5 if set
-            representation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if set
-            drop_rate (float): dropout rate
-            attn_drop_rate (float): attention dropout rate
-            drop_path_rate (float): stochastic depth rate
-            norm_layer: (nn.Module): normalization layer
-        """
-        super().__init__()
-        self.num_features = self.embed_dim = (
-            embed_dim  # num_features for consistency with other models
-        )
-        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
-
-        self.patch_embed = PatchEmbed(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=in_chans,
-            embed_dim=embed_dim,
-        )
-
-        num_patches = self.patch_embed.num_patches
-
-        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
-        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
-        self.pos_drop = nn.Dropout(p=drop_rate)
-
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, depth)
-        ]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList(
-            [
-                Block(
-                    dim=embed_dim,
-                    num_heads=num_heads,
-                    mlp_ratio=mlp_ratio,
-                    qkv_bias=qkv_bias,
-                    qk_scale=qk_scale,
-                    drop=drop_rate,
-                    attn_drop=attn_drop_rate,
-                    drop_path=dpr[i],
-                    norm_layer=norm_layer,
-                    use_grad_checkpointing=(
-                        use_grad_checkpointing and i >= depth - ckpt_layer
-                    ),
-                )
-                for i in range(depth)
-            ]
-        )
-        self.norm = norm_layer(embed_dim)
-
-        trunc_normal_(self.pos_embed, std=0.02)
-        trunc_normal_(self.cls_token, std=0.02)
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=0.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"pos_embed", "cls_token"}
-
-    def forward(self, x, register_blk=-1):
-        B = x.shape[0]
-        x = self.patch_embed(x)
-
-        cls_tokens = self.cls_token.expand(
-            B, -1, -1
-        )  # stole cls_tokens impl from Phil Wang, thanks
-        x = torch.cat((cls_tokens, x), dim=1)
-
-        x = x + self.pos_embed[:, : x.size(1), :]
-        x = self.pos_drop(x)
-
-        for i, blk in enumerate(self.blocks):
-            x = blk(x, register_blk == i)
-        x = self.norm(x)
-
-        return x
-
-    @torch.jit.ignore()
-    def load_pretrained(self, checkpoint_path, prefix=""):
-        _load_weights(self, checkpoint_path, prefix)
-
-
-@torch.no_grad()
-def _load_weights(model: VisionTransformer, checkpoint_path: str, prefix: str = ""):
-    """Load weights from .npz checkpoints for official Google Brain Flax implementation"""
-    import numpy as np
-
-    def _n2p(w, t=True):
-        if w.ndim == 4 and w.shape[0] == w.shape[1] == w.shape[2] == 1:
-            w = w.flatten()
-        if t:
-            if w.ndim == 4:
-                w = w.transpose([3, 2, 0, 1])
-            elif w.ndim == 3:
-                w = w.transpose([2, 0, 1])
-            elif w.ndim == 2:
-                w = w.transpose([1, 0])
-        return torch.from_numpy(w)
-
-    w = np.load(checkpoint_path)
-    if not prefix and "opt/target/embedding/kernel" in w:
-        prefix = "opt/target/"
-
-    if hasattr(model.patch_embed, "backbone"):
-        # hybrid
-        backbone = model.patch_embed.backbone
-        stem_only = not hasattr(backbone, "stem")
-        stem = backbone if stem_only else backbone.stem
-        stem.conv.weight.copy_(
-            adapt_input_conv(
-                stem.conv.weight.shape[1], _n2p(w[f"{prefix}conv_root/kernel"])
-            )
-        )
-        stem.norm.weight.copy_(_n2p(w[f"{prefix}gn_root/scale"]))
-        stem.norm.bias.copy_(_n2p(w[f"{prefix}gn_root/bias"]))
-        if not stem_only:
-            for i, stage in enumerate(backbone.stages):
-                for j, block in enumerate(stage.blocks):
-                    bp = f"{prefix}block{i + 1}/unit{j + 1}/"
-                    for r in range(3):
-                        getattr(block, f"conv{r + 1}").weight.copy_(
-                            _n2p(w[f"{bp}conv{r + 1}/kernel"])
-                        )
-                        getattr(block, f"norm{r + 1}").weight.copy_(
-                            _n2p(w[f"{bp}gn{r + 1}/scale"])
-                        )
-                        getattr(block, f"norm{r + 1}").bias.copy_(
-                            _n2p(w[f"{bp}gn{r + 1}/bias"])
-                        )
-                    if block.downsample is not None:
-                        block.downsample.conv.weight.copy_(
-                            _n2p(w[f"{bp}conv_proj/kernel"])
-                        )
-                        block.downsample.norm.weight.copy_(
-                            _n2p(w[f"{bp}gn_proj/scale"])
-                        )
-                        block.downsample.norm.bias.copy_(_n2p(w[f"{bp}gn_proj/bias"]))
-        embed_conv_w = _n2p(w[f"{prefix}embedding/kernel"])
-    else:
-        embed_conv_w = adapt_input_conv(
-            model.patch_embed.proj.weight.shape[1], _n2p(w[f"{prefix}embedding/kernel"])
-        )
-    model.patch_embed.proj.weight.copy_(embed_conv_w)
-    model.patch_embed.proj.bias.copy_(_n2p(w[f"{prefix}embedding/bias"]))
-    model.cls_token.copy_(_n2p(w[f"{prefix}cls"], t=False))
-    pos_embed_w = _n2p(w[f"{prefix}Transformer/posembed_input/pos_embedding"], t=False)
-    if pos_embed_w.shape != model.pos_embed.shape:
-        pos_embed_w = resize_pos_embed(  # resize pos embedding when different size from pretrained weights
-            pos_embed_w,
-            model.pos_embed,
-            getattr(model, "num_tokens", 1),
-            model.patch_embed.grid_size,
-        )
-    model.pos_embed.copy_(pos_embed_w)
-    model.norm.weight.copy_(_n2p(w[f"{prefix}Transformer/encoder_norm/scale"]))
-    model.norm.bias.copy_(_n2p(w[f"{prefix}Transformer/encoder_norm/bias"]))
-    #     if isinstance(model.head, nn.Linear) and model.head.bias.shape[0] == w[f'{prefix}head/bias'].shape[-1]:
-    #         model.head.weight.copy_(_n2p(w[f'{prefix}head/kernel']))
-    #         model.head.bias.copy_(_n2p(w[f'{prefix}head/bias']))
-    #     if isinstance(getattr(model.pre_logits, 'fc', None), nn.Linear) and f'{prefix}pre_logits/bias' in w:
-    #         model.pre_logits.fc.weight.copy_(_n2p(w[f'{prefix}pre_logits/kernel']))
-    #         model.pre_logits.fc.bias.copy_(_n2p(w[f'{prefix}pre_logits/bias']))
-    for i, block in enumerate(model.blocks.children()):
-        block_prefix = f"{prefix}Transformer/encoderblock_{i}/"
-        mha_prefix = block_prefix + "MultiHeadDotProductAttention_1/"
-        block.norm1.weight.copy_(_n2p(w[f"{block_prefix}LayerNorm_0/scale"]))
-        block.norm1.bias.copy_(_n2p(w[f"{block_prefix}LayerNorm_0/bias"]))
-        block.attn.qkv.weight.copy_(
-            torch.cat(
-                [
-                    _n2p(w[f"{mha_prefix}{n}/kernel"], t=False).flatten(1).T
-                    for n in ("query", "key", "value")
-                ]
-            )
-        )
-        block.attn.qkv.bias.copy_(
-            torch.cat(
-                [
-                    _n2p(w[f"{mha_prefix}{n}/bias"], t=False).reshape(-1)
-                    for n in ("query", "key", "value")
-                ]
-            )
-        )
-        block.attn.proj.weight.copy_(_n2p(w[f"{mha_prefix}out/kernel"]).flatten(1))
-        block.attn.proj.bias.copy_(_n2p(w[f"{mha_prefix}out/bias"]))
-        for r in range(2):
-            getattr(block.mlp, f"fc{r + 1}").weight.copy_(
-                _n2p(w[f"{block_prefix}MlpBlock_3/Dense_{r}/kernel"])
-            )
-            getattr(block.mlp, f"fc{r + 1}").bias.copy_(
-                _n2p(w[f"{block_prefix}MlpBlock_3/Dense_{r}/bias"])
-            )
-        block.norm2.weight.copy_(_n2p(w[f"{block_prefix}LayerNorm_2/scale"]))
-        block.norm2.bias.copy_(_n2p(w[f"{block_prefix}LayerNorm_2/bias"]))
-
-
-def interpolate_pos_embed(pos_embed_checkpoint, visual_encoder):
-    # interpolate position embedding
-    embedding_size = pos_embed_checkpoint.shape[-1]
-    num_patches = visual_encoder.patch_embed.num_patches
-    num_extra_tokens = visual_encoder.pos_embed.shape[-2] - num_patches
-    # height (== width) for the checkpoint position embedding
-    orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)
-    # height (== width) for the new position embedding
-    new_size = int(num_patches**0.5)
-
-    if orig_size != new_size:
-        # class_token and dist_token are kept unchanged
-        extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
-        # only the position tokens are interpolated
-        pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
-        pos_tokens = pos_tokens.reshape(
-            -1, orig_size, orig_size, embedding_size
-        ).permute(0, 3, 1, 2)
-        pos_tokens = torch.nn.functional.interpolate(
-            pos_tokens, size=(new_size, new_size), mode="bicubic", align_corners=False
-        )
-        pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
-        new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
-        print("reshape position embedding from %d to %d" % (orig_size**2, new_size**2))
-
-        return new_pos_embed
-    else:
-        return pos_embed_checkpoint
diff --git a/eval/vbench/third_party/umt/__init__.py b/eval/vbench/third_party/umt/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/eval/vbench/third_party/umt/datasets/__init__.py b/eval/vbench/third_party/umt/datasets/__init__.py
deleted file mode 100644
index 07f320f9..00000000
--- a/eval/vbench/third_party/umt/datasets/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .build import build_dataset, build_pretraining_dataset
diff --git a/eval/vbench/third_party/umt/datasets/build.py b/eval/vbench/third_party/umt/datasets/build.py
deleted file mode 100644
index d80ae86f..00000000
--- a/eval/vbench/third_party/umt/datasets/build.py
+++ /dev/null
@@ -1,243 +0,0 @@
-import os
-
-from torchvision import transforms
-
-from .kinetics import VideoClsDataset
-from .kinetics_sparse import VideoClsDataset_sparse
-from .mae import VideoMAE
-from .masking_generator import RandomMaskingGenerator, TubeMaskingGenerator
-from .ssv2 import SSRawFrameClsDataset, SSVideoClsDataset
-from .transforms import *
-
-
-class DataAugmentationForVideoMAE(object):
-    def __init__(self, args):
-        self.input_mean = [0.485, 0.456, 0.406]  # IMAGENET_DEFAULT_MEAN
-        self.input_std = [0.229, 0.224, 0.225]  # IMAGENET_DEFAULT_STD
-        normalize = GroupNormalize(self.input_mean, self.input_std)
-        self.train_augmentation = GroupMultiScaleCrop(
-            args.input_size, [1, 0.875, 0.75, 0.66]
-        )
-        if args.color_jitter > 0:
-            self.transform = transforms.Compose(
-                [
-                    self.train_augmentation,
-                    GroupColorJitter(args.color_jitter),
-                    GroupRandomHorizontalFlip(flip=args.flip),
-                    Stack(roll=False),
-                    ToTorchFormatTensor(div=True),
-                    normalize,
-                ]
-            )
-        else:
-            self.transform = transforms.Compose(
-                [
-                    self.train_augmentation,
-                    GroupRandomHorizontalFlip(flip=args.flip),
-                    Stack(roll=False),
-                    ToTorchFormatTensor(div=True),
-                    normalize,
-                ]
-            )
-        if args.mask_type == "tube":
-            self.masked_position_generator = TubeMaskingGenerator(
-                args.window_size, args.mask_ratio
-            )
-        elif args.mask_type == "random":
-            self.masked_position_generator = RandomMaskingGenerator(
-                args.window_size, args.mask_ratio
-            )
-        elif args.mask_type in "attention":
-            self.masked_position_generator = None
-
-    def __call__(self, images):
-        process_data, _ = self.transform(images)
-        if self.masked_position_generator is None:
-            return process_data, -1
-        else:
-            return process_data, self.masked_position_generator()
-
-    def __repr__(self):
-        repr = "(DataAugmentationForVideoMAE,\n"
-        repr += "  transform = %s,\n" % str(self.transform)
-        repr += "  Masked position generator = %s,\n" % str(
-            self.masked_position_generator
-        )
-        repr += ")"
-        return repr
-
-
-def build_pretraining_dataset(args):
-    transform = DataAugmentationForVideoMAE(args)
-    dataset = VideoMAE(
-        root=None,
-        setting=args.data_path,
-        prefix=args.prefix,
-        split=args.split,
-        video_ext="mp4",
-        is_color=True,
-        modality="rgb",
-        num_segments=args.num_segments,
-        new_length=args.num_frames,
-        new_step=args.sampling_rate,
-        transform=transform,
-        temporal_jitter=False,
-        video_loader=True,
-        use_decord=args.use_decord,
-        lazy_init=False,
-        num_sample=args.num_sample,
-    )
-    print("Data Aug = %s" % str(transform))
-    return dataset
-
-
-def build_dataset(is_train, test_mode, args):
-    print(f"Use Dataset: {args.data_set}")
-    if args.data_set in ["Kinetics", "Kinetics_sparse", "mitv1_sparse"]:
-        mode = None
-        anno_path = None
-        if is_train is True:
-            mode = "train"
-            anno_path = os.path.join(args.data_path, "train.csv")
-        elif test_mode is True:
-            mode = "test"
-            anno_path = os.path.join(args.data_path, "test.csv")
-        else:
-            mode = "validation"
-            anno_path = os.path.join(args.data_path, "val.csv")
-
-        if "sparse" in args.data_set:
-            func = VideoClsDataset_sparse
-        else:
-            func = VideoClsDataset
-
-        dataset = func(
-            anno_path=anno_path,
-            prefix=args.prefix,
-            split=args.split,
-            mode=mode,
-            clip_len=args.num_frames,
-            frame_sample_rate=args.sampling_rate,
-            num_segment=1,
-            test_num_segment=args.test_num_segment,
-            test_num_crop=args.test_num_crop,
-            num_crop=1 if not test_mode else 3,
-            keep_aspect_ratio=True,
-            crop_size=args.input_size,
-            short_side_size=args.short_side_size,
-            new_height=256,
-            new_width=320,
-            args=args,
-        )
-
-        nb_classes = args.nb_classes
-
-    elif args.data_set == "SSV2":
-        mode = None
-        anno_path = None
-        if is_train is True:
-            mode = "train"
-            anno_path = os.path.join(args.data_path, "train.csv")
-        elif test_mode is True:
-            mode = "test"
-            anno_path = os.path.join(args.data_path, "test.csv")
-        else:
-            mode = "validation"
-            anno_path = os.path.join(args.data_path, "val.csv")
-
-        if args.use_decord:
-            func = SSVideoClsDataset
-        else:
-            func = SSRawFrameClsDataset
-
-        dataset = func(
-            anno_path=anno_path,
-            prefix=args.prefix,
-            split=args.split,
-            mode=mode,
-            clip_len=1,
-            num_segment=args.num_frames,
-            test_num_segment=args.test_num_segment,
-            test_num_crop=args.test_num_crop,
-            num_crop=1 if not test_mode else 3,
-            keep_aspect_ratio=True,
-            crop_size=args.input_size,
-            short_side_size=args.short_side_size,
-            new_height=256,
-            new_width=320,
-            args=args,
-        )
-        nb_classes = 174
-
-    elif args.data_set == "UCF101":
-        mode = None
-        anno_path = None
-        if is_train is True:
-            mode = "train"
-            anno_path = os.path.join(args.data_path, "train.csv")
-        elif test_mode is True:
-            mode = "test"
-            anno_path = os.path.join(args.data_path, "test.csv")
-        else:
-            mode = "validation"
-            anno_path = os.path.join(args.data_path, "val.csv")
-
-        dataset = VideoClsDataset(
-            anno_path=anno_path,
-            prefix=args.prefix,
-            split=args.split,
-            mode=mode,
-            clip_len=args.num_frames,
-            frame_sample_rate=args.sampling_rate,
-            num_segment=1,
-            test_num_segment=args.test_num_segment,
-            test_num_crop=args.test_num_crop,
-            num_crop=1 if not test_mode else 3,
-            keep_aspect_ratio=True,
-            crop_size=args.input_size,
-            short_side_size=args.short_side_size,
-            new_height=256,
-            new_width=320,
-            args=args,
-        )
-        nb_classes = 101
-
-    elif args.data_set == "HMDB51":
-        mode = None
-        anno_path = None
-        if is_train is True:
-            mode = "train"
-            anno_path = os.path.join(args.data_path, "train.csv")
-        elif test_mode is True:
-            mode = "test"
-            anno_path = os.path.join(args.data_path, "test.csv")
-        else:
-            mode = "validation"
-            anno_path = os.path.join(args.data_path, "val.csv")
-
-        dataset = VideoClsDataset(
-            anno_path=anno_path,
-            prefix=args.prefix,
-            split=args.split,
-            mode=mode,
-            clip_len=args.num_frames,
-            frame_sample_rate=args.sampling_rate,
-            num_segment=1,
-            test_num_segment=args.test_num_segment,
-            test_num_crop=args.test_num_crop,
-            num_crop=1 if not test_mode else 3,
-            keep_aspect_ratio=True,
-            crop_size=args.input_size,
-            short_side_size=args.short_side_size,
-            new_height=256,
-            new_width=320,
-            args=args,
-        )
-        nb_classes = 51
-    else:
-        print(f"Wrong: {args.data_set}")
-        raise NotImplementedError()
-    assert nb_classes == args.nb_classes
-    print("Number of the class = %d" % args.nb_classes)
-
-    return dataset, nb_classes
diff --git a/eval/vbench/third_party/umt/datasets/kinetics.py b/eval/vbench/third_party/umt/datasets/kinetics.py
deleted file mode 100644
index b83f43c1..00000000
--- a/eval/vbench/third_party/umt/datasets/kinetics.py
+++ /dev/null
@@ -1,464 +0,0 @@
-import io
-import os
-import warnings
-
-import numpy as np
-import torch
-from decord import VideoReader, cpu
-from numpy.lib.function_base import disp
-from torch.utils.data import Dataset
-from torchvision import transforms
-
-from .random_erasing import RandomErasing
-from .video_transforms import (
-    CenterCrop,
-    Compose,
-    Normalize,
-    Resize,
-    create_random_augment,
-    horizontal_flip,
-    random_crop,
-    random_resized_crop,
-    random_resized_crop_with_shift,
-    random_short_side_scale_jitter,
-    uniform_crop,
-)
-from .volume_transforms import ClipToTensor
-
-try:
-    from petrel_client.client import Client
-
-    has_client = True
-except ImportError:
-    has_client = False
-
-
-class VideoClsDataset(Dataset):
-    """Load your own video classification dataset."""
-
-    def __init__(
-        self,
-        anno_path,
-        prefix="",
-        split=" ",
-        mode="train",
-        clip_len=8,
-        frame_sample_rate=2,
-        crop_size=224,
-        short_side_size=256,
-        new_height=256,
-        new_width=340,
-        keep_aspect_ratio=True,
-        num_segment=1,
-        num_crop=1,
-        test_num_segment=10,
-        test_num_crop=3,
-        args=None,
-    ):
-        self.anno_path = anno_path
-        self.prefix = prefix
-        self.split = split
-        self.mode = mode
-        self.clip_len = clip_len
-        self.frame_sample_rate = frame_sample_rate
-        self.crop_size = crop_size
-        self.short_side_size = short_side_size
-        self.new_height = new_height
-        self.new_width = new_width
-        self.keep_aspect_ratio = keep_aspect_ratio
-        self.num_segment = num_segment
-        self.test_num_segment = test_num_segment
-        self.num_crop = num_crop
-        self.test_num_crop = test_num_crop
-        self.args = args
-        self.aug = False
-        self.rand_erase = False
-        assert num_segment == 1
-        if self.mode in ["train"]:
-            self.aug = True
-            if self.args.reprob > 0:
-                self.rand_erase = True
-        if VideoReader is None:
-            raise ImportError(
-                "Unable to import `decord` which is required to read videos."
-            )
-
-        import pandas as pd
-
-        cleaned = pd.read_csv(self.anno_path, header=None, delimiter=self.split)
-        self.dataset_samples = list(cleaned.values[:, 0])
-        self.label_array = list(cleaned.values[:, 1])
-
-        self.client = None
-        if has_client:
-            self.client = Client("~/petreloss.conf")
-
-        if mode == "train":
-            pass
-
-        elif mode == "validation":
-            self.data_transform = Compose(
-                [
-                    Resize(self.short_side_size, interpolation="bilinear"),
-                    CenterCrop(size=(self.crop_size, self.crop_size)),
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-        elif mode == "test":
-            self.data_resize = Compose(
-                [Resize(size=(short_side_size), interpolation="bilinear")]
-            )
-            self.data_transform = Compose(
-                [
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-            self.test_seg = []
-            self.test_dataset = []
-            self.test_label_array = []
-            for ck in range(self.test_num_segment):
-                for cp in range(self.test_num_crop):
-                    for idx in range(len(self.label_array)):
-                        sample_label = self.label_array[idx]
-                        self.test_label_array.append(sample_label)
-                        self.test_dataset.append(self.dataset_samples[idx])
-                        self.test_seg.append((ck, cp))
-
-    def __getitem__(self, index):
-        if self.mode == "train":
-            args = self.args
-            scale_t = 1
-
-            sample = self.dataset_samples[index]
-            buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t)  # T H W C
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during training".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t)
-
-            if args.num_sample > 1:
-                frame_list = []
-                label_list = []
-                index_list = []
-                for _ in range(args.num_sample):
-                    new_frames = self._aug_frame(buffer, args)
-                    label = self.label_array[index]
-                    frame_list.append(new_frames)
-                    label_list.append(label)
-                    index_list.append(index)
-                return frame_list, label_list, index_list, {}
-            else:
-                buffer = self._aug_frame(buffer, args)
-
-            return buffer, self.label_array[index], index, {}
-
-        elif self.mode == "validation":
-            sample = self.dataset_samples[index]
-            buffer = self.loadvideo_decord(sample)
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during validation".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.loadvideo_decord(sample)
-            buffer = self.data_transform(buffer)
-            return buffer, self.label_array[index], sample.split("/")[-1].split(".")[0]
-
-        elif self.mode == "test":
-            sample = self.test_dataset[index]
-            chunk_nb, split_nb = self.test_seg[index]
-            buffer = self.loadvideo_decord(sample, chunk_nb=chunk_nb)
-
-            while len(buffer) == 0:
-                warnings.warn(
-                    "video {}, temporal {}, spatial {} not found during testing".format(
-                        str(self.test_dataset[index]), chunk_nb, split_nb
-                    )
-                )
-                index = np.random.randint(self.__len__())
-                sample = self.test_dataset[index]
-                chunk_nb, split_nb = self.test_seg[index]
-                buffer = self.loadvideo_decord(sample, chunk_nb=chunk_nb)
-
-            buffer = self.data_resize(buffer)
-            if isinstance(buffer, list):
-                buffer = np.stack(buffer, 0)
-
-            if self.test_num_crop == 1:
-                spatial_step = (
-                    1.0
-                    * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size)
-                    / 2
-                )
-                spatial_start = int(spatial_step)
-            else:
-                spatial_step = (
-                    1.0
-                    * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size)
-                    / (self.test_num_crop - 1)
-                )
-                spatial_start = int(split_nb * spatial_step)
-            if buffer.shape[1] >= buffer.shape[2]:
-                buffer = buffer[
-                    :, spatial_start : spatial_start + self.short_side_size, :, :
-                ]
-            else:
-                buffer = buffer[
-                    :, :, spatial_start : spatial_start + self.short_side_size, :
-                ]
-
-            buffer = self.data_transform(buffer)
-            return (
-                buffer,
-                self.test_label_array[index],
-                sample.split("/")[-1].split(".")[0],
-                chunk_nb,
-                split_nb,
-            )
-        else:
-            raise NameError("mode {} unkown".format(self.mode))
-
-    def _aug_frame(
-        self,
-        buffer,
-        args,
-    ):
-
-        aug_transform = create_random_augment(
-            input_size=(self.crop_size, self.crop_size),
-            auto_augment=args.aa,
-            interpolation=args.train_interpolation,
-        )
-
-        buffer = [transforms.ToPILImage()(frame) for frame in buffer]
-
-        buffer = aug_transform(buffer)
-
-        buffer = [transforms.ToTensor()(img) for img in buffer]
-        buffer = torch.stack(buffer)  # T C H W
-        buffer = buffer.permute(0, 2, 3, 1)  # T H W C
-
-        # T H W C
-        buffer = tensor_normalize(buffer, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-        # T H W C -> C T H W.
-        buffer = buffer.permute(3, 0, 1, 2)
-        # Perform data augmentation.
-        scl, asp = (
-            [0.08, 1.0],
-            [0.75, 1.3333],
-        )
-
-        buffer = spatial_sampling(
-            buffer,
-            spatial_idx=-1,
-            min_scale=256,
-            max_scale=320,
-            crop_size=self.crop_size,
-            random_horizontal_flip=False if args.data_set == "SSV2" else True,
-            inverse_uniform_sampling=False,
-            aspect_ratio=asp,
-            scale=scl,
-            motion_shift=False,
-        )
-
-        if self.rand_erase:
-            erase_transform = RandomErasing(
-                args.reprob,
-                mode=args.remode,
-                max_count=args.recount,
-                num_splits=args.recount,
-                device="cpu",
-            )
-            buffer = buffer.permute(1, 0, 2, 3)
-            buffer = erase_transform(buffer)
-            buffer = buffer.permute(1, 0, 2, 3)
-
-        return buffer
-
-    def loadvideo_decord(self, sample, sample_rate_scale=1, chunk_nb=0):
-        """Load video content using Decord"""
-        fname = sample
-        fname = os.path.join(self.prefix, fname)
-
-        try:
-            if self.keep_aspect_ratio:
-                if fname.startswith("s3"):
-                    video_bytes = self.client.get(fname)
-                    vr = VideoReader(io.BytesIO(video_bytes), num_threads=1, ctx=cpu(0))
-                else:
-                    vr = VideoReader(fname, num_threads=1, ctx=cpu(0))
-            else:
-                if fname.startswith("s3:"):
-                    video_bytes = self.client.get(fname)
-                    vr = VideoReader(
-                        io.BytesIO(video_bytes),
-                        width=self.new_width,
-                        height=self.new_height,
-                        num_threads=1,
-                        ctx=cpu(0),
-                    )
-                else:
-                    vr = VideoReader(
-                        fname,
-                        width=self.new_width,
-                        height=self.new_height,
-                        num_threads=1,
-                        ctx=cpu(0),
-                    )
-
-            # handle temporal segments
-            converted_len = int(self.clip_len * self.frame_sample_rate)
-            seg_len = len(vr) // self.num_segment
-
-            if self.mode == "test":
-                temporal_step = max(
-                    1.0 * (len(vr) - converted_len) / (self.test_num_segment - 1), 0
-                )
-                temporal_start = int(chunk_nb * temporal_step)
-
-                bound = min(temporal_start + converted_len, len(vr))
-                all_index = [
-                    x for x in range(temporal_start, bound, self.frame_sample_rate)
-                ]
-                while len(all_index) < self.clip_len:
-                    all_index.append(all_index[-1])
-                vr.seek(0)
-                buffer = vr.get_batch(all_index).asnumpy()
-                return buffer
-
-            all_index = []
-            for i in range(self.num_segment):
-                if seg_len <= converted_len:
-                    index = np.linspace(
-                        0, seg_len, num=seg_len // self.frame_sample_rate
-                    )
-                    index = np.concatenate(
-                        (
-                            index,
-                            np.ones(self.clip_len - seg_len // self.frame_sample_rate)
-                            * seg_len,
-                        )
-                    )
-                    index = np.clip(index, 0, seg_len - 1).astype(np.int64)
-                else:
-                    if self.mode == "validation":
-                        end_idx = (seg_len - converted_len) // 2
-                    else:
-                        end_idx = np.random.randint(converted_len, seg_len)
-                    str_idx = end_idx - converted_len
-                    index = np.linspace(str_idx, end_idx, num=self.clip_len)
-                    index = np.clip(index, str_idx, end_idx - 1).astype(np.int64)
-                index = index + i * seg_len
-                all_index.extend(list(index))
-
-            all_index = all_index[:: int(sample_rate_scale)]
-            vr.seek(0)
-            buffer = vr.get_batch(all_index).asnumpy()
-            return buffer
-        except:
-            print("video cannot be loaded by decord: ", fname)
-            return []
-
-    def __len__(self):
-        if self.mode != "test":
-            return len(self.dataset_samples)
-        else:
-            return len(self.test_dataset)
-
-
-def spatial_sampling(
-    frames,
-    spatial_idx=-1,
-    min_scale=256,
-    max_scale=320,
-    crop_size=224,
-    random_horizontal_flip=True,
-    inverse_uniform_sampling=False,
-    aspect_ratio=None,
-    scale=None,
-    motion_shift=False,
-):
-    """
-    Perform spatial sampling on the given video frames. If spatial_idx is
-    -1, perform random scale, random crop, and random flip on the given
-    frames. If spatial_idx is 0, 1, or 2, perform spatial uniform sampling
-    with the given spatial_idx.
-    Args:
-        frames (tensor): frames of images sampled from the video. The
-            dimension is `num frames` x `height` x `width` x `channel`.
-        spatial_idx (int): if -1, perform random spatial sampling. If 0, 1,
-            or 2, perform left, center, right crop if width is larger than
-            height, and perform top, center, buttom crop if height is larger
-            than width.
-        min_scale (int): the minimal size of scaling.
-        max_scale (int): the maximal size of scaling.
-        crop_size (int): the size of height and width used to crop the
-            frames.
-        inverse_uniform_sampling (bool): if True, sample uniformly in
-            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
-            scale. If False, take a uniform sample from [min_scale,
-            max_scale].
-        aspect_ratio (list): Aspect ratio range for resizing.
-        scale (list): Scale range for resizing.
-        motion_shift (bool): Whether to apply motion shift for resizing.
-    Returns:
-        frames (tensor): spatially sampled frames.
-    """
-    assert spatial_idx in [-1, 0, 1, 2]
-    if spatial_idx == -1:
-        if aspect_ratio is None and scale is None:
-            frames, _ = random_short_side_scale_jitter(
-                images=frames,
-                min_size=min_scale,
-                max_size=max_scale,
-                inverse_uniform_sampling=inverse_uniform_sampling,
-            )
-            frames, _ = random_crop(frames, crop_size)
-        else:
-            transform_func = (
-                random_resized_crop_with_shift if motion_shift else random_resized_crop
-            )
-            frames = transform_func(
-                images=frames,
-                target_height=crop_size,
-                target_width=crop_size,
-                scale=scale,
-                ratio=aspect_ratio,
-            )
-        if random_horizontal_flip:
-            frames, _ = horizontal_flip(0.5, frames)
-    else:
-        # The testing is deterministic and no jitter should be performed.
-        # min_scale, max_scale, and crop_size are expect to be the same.
-        assert len({min_scale, max_scale, crop_size}) == 1
-        frames, _ = random_short_side_scale_jitter(frames, min_scale, max_scale)
-        frames, _ = uniform_crop(frames, crop_size, spatial_idx)
-    return frames
-
-
-def tensor_normalize(tensor, mean, std):
-    """
-    Normalize a given tensor by subtracting the mean and dividing the std.
-    Args:
-        tensor (tensor): tensor to normalize.
-        mean (tensor or list): mean value to subtract.
-        std (tensor or list): std to divide.
-    """
-    if tensor.dtype == torch.uint8:
-        tensor = tensor.float()
-        tensor = tensor / 255.0
-    if type(mean) == list:
-        mean = torch.tensor(mean)
-    if type(std) == list:
-        std = torch.tensor(std)
-    tensor = tensor - mean
-    tensor = tensor / std
-    return tensor
diff --git a/eval/vbench/third_party/umt/datasets/kinetics_sparse.py b/eval/vbench/third_party/umt/datasets/kinetics_sparse.py
deleted file mode 100644
index 5862e9aa..00000000
--- a/eval/vbench/third_party/umt/datasets/kinetics_sparse.py
+++ /dev/null
@@ -1,441 +0,0 @@
-import io
-import os
-import random
-import warnings
-
-import numpy as np
-import torch
-from decord import VideoReader, cpu
-from numpy.lib.function_base import disp
-from torch.utils.data import Dataset
-from torchvision import transforms
-
-from .random_erasing import RandomErasing
-from .video_transforms import (
-    CenterCrop,
-    Compose,
-    Normalize,
-    Resize,
-    create_random_augment,
-    horizontal_flip,
-    random_crop,
-    random_resized_crop,
-    random_resized_crop_with_shift,
-    random_short_side_scale_jitter,
-    uniform_crop,
-)
-from .volume_transforms import ClipToTensor
-
-try:
-    from petrel_client.client import Client
-
-    has_client = True
-except ImportError:
-    has_client = False
-
-
-class VideoClsDataset_sparse(Dataset):
-    """Load your own video classification dataset."""
-
-    def __init__(
-        self,
-        anno_path,
-        prefix="",
-        split=" ",
-        mode="train",
-        clip_len=8,
-        frame_sample_rate=2,
-        crop_size=224,
-        short_side_size=256,
-        new_height=256,
-        new_width=340,
-        keep_aspect_ratio=True,
-        num_segment=1,
-        num_crop=1,
-        test_num_segment=10,
-        test_num_crop=3,
-        args=None,
-    ):
-        self.anno_path = anno_path
-        self.prefix = prefix
-        self.split = split
-        self.mode = mode
-        self.clip_len = clip_len
-        self.frame_sample_rate = frame_sample_rate
-        self.crop_size = crop_size
-        self.short_side_size = short_side_size
-        self.new_height = new_height
-        self.new_width = new_width
-        self.keep_aspect_ratio = keep_aspect_ratio
-        self.num_segment = num_segment
-        self.test_num_segment = test_num_segment
-        self.num_crop = num_crop
-        self.test_num_crop = test_num_crop
-        self.args = args
-        self.aug = False
-        self.rand_erase = False
-        assert num_segment == 1
-        if self.mode in ["train"]:
-            self.aug = True
-            if self.args.reprob > 0:
-                self.rand_erase = True
-        if VideoReader is None:
-            raise ImportError(
-                "Unable to import `decord` which is required to read videos."
-            )
-
-        import pandas as pd
-
-        cleaned = pd.read_csv(self.anno_path, header=None, delimiter=self.split)
-        self.dataset_samples = list(cleaned.values[:, 0])
-        self.label_array = list(cleaned.values[:, 1])
-
-        self.client = None
-        if has_client:
-            self.client = Client("~/petreloss.conf")
-
-        if mode == "train":
-            pass
-
-        elif mode == "validation":
-            self.data_transform = Compose(
-                [
-                    Resize(self.short_side_size, interpolation="bilinear"),
-                    CenterCrop(size=(self.crop_size, self.crop_size)),
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-        elif mode == "test":
-            self.data_resize = Compose(
-                [Resize(size=(short_side_size), interpolation="bilinear")]
-            )
-            self.data_transform = Compose(
-                [
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-            self.test_seg = []
-            self.test_dataset = []
-            self.test_label_array = []
-            for ck in range(self.test_num_segment):
-                for cp in range(self.test_num_crop):
-                    for idx in range(len(self.label_array)):
-                        sample_label = self.label_array[idx]
-                        self.test_label_array.append(sample_label)
-                        self.test_dataset.append(self.dataset_samples[idx])
-                        self.test_seg.append((ck, cp))
-
-    def __getitem__(self, index):
-        if self.mode == "train":
-            args = self.args
-
-            sample = self.dataset_samples[index]
-            buffer = self.loadvideo_decord(sample, chunk_nb=-1)  # T H W C
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during training".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.loadvideo_decord(sample, chunk_nb=-1)
-
-            if args.num_sample > 1:
-                frame_list = []
-                label_list = []
-                index_list = []
-                for _ in range(args.num_sample):
-                    new_frames = self._aug_frame(buffer, args)
-                    label = self.label_array[index]
-                    frame_list.append(new_frames)
-                    label_list.append(label)
-                    index_list.append(index)
-                return frame_list, label_list, index_list, {}
-            else:
-                buffer = self._aug_frame(buffer, args)
-
-            return buffer, self.label_array[index], index, {}
-
-        elif self.mode == "validation":
-            sample = self.dataset_samples[index]
-            buffer = self.loadvideo_decord(sample, chunk_nb=0)
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during validation".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.loadvideo_decord(sample, chunk_nb=0)
-            buffer = self.data_transform(buffer)
-            return buffer, self.label_array[index], sample.split("/")[-1].split(".")[0]
-
-        elif self.mode == "test":
-            sample = self.test_dataset[index]
-            chunk_nb, split_nb = self.test_seg[index]
-            buffer = self.loadvideo_decord(sample, chunk_nb=chunk_nb)
-
-            while len(buffer) == 0:
-                warnings.warn(
-                    "video {}, temporal {}, spatial {} not found during testing".format(
-                        str(self.test_dataset[index]), chunk_nb, split_nb
-                    )
-                )
-                index = np.random.randint(self.__len__())
-                sample = self.test_dataset[index]
-                chunk_nb, split_nb = self.test_seg[index]
-                buffer = self.loadvideo_decord(sample, chunk_nb=chunk_nb)
-
-            buffer = self.data_resize(buffer)
-            if isinstance(buffer, list):
-                buffer = np.stack(buffer, 0)
-            if self.test_num_crop == 1:
-                spatial_step = (
-                    1.0
-                    * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size)
-                    / 2
-                )
-                spatial_start = int(spatial_step)
-            else:
-                spatial_step = (
-                    1.0
-                    * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size)
-                    / (self.test_num_crop - 1)
-                )
-                spatial_start = int(split_nb * spatial_step)
-            if buffer.shape[1] >= buffer.shape[2]:
-                buffer = buffer[
-                    :, spatial_start : spatial_start + self.short_side_size, :, :
-                ]
-            else:
-                buffer = buffer[
-                    :, :, spatial_start : spatial_start + self.short_side_size, :
-                ]
-
-            buffer = self.data_transform(buffer)
-            return (
-                buffer,
-                self.test_label_array[index],
-                sample.split("/")[-1].split(".")[0],
-                chunk_nb,
-                split_nb,
-            )
-        else:
-            raise NameError("mode {} unkown".format(self.mode))
-
-    def _aug_frame(
-        self,
-        buffer,
-        args,
-    ):
-
-        aug_transform = create_random_augment(
-            input_size=(self.crop_size, self.crop_size),
-            auto_augment=args.aa,
-            interpolation=args.train_interpolation,
-        )
-
-        buffer = [transforms.ToPILImage()(frame) for frame in buffer]
-
-        buffer = aug_transform(buffer)
-
-        buffer = [transforms.ToTensor()(img) for img in buffer]
-        buffer = torch.stack(buffer)  # T C H W
-        buffer = buffer.permute(0, 2, 3, 1)  # T H W C
-
-        # T H W C
-        buffer = tensor_normalize(buffer, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-        # T H W C -> C T H W.
-        buffer = buffer.permute(3, 0, 1, 2)
-        # Perform data augmentation.
-        scl, asp = (
-            [0.08, 1.0],
-            [0.75, 1.3333],
-        )
-
-        buffer = spatial_sampling(
-            buffer,
-            spatial_idx=-1,
-            min_scale=256,
-            max_scale=320,
-            crop_size=self.crop_size,
-            random_horizontal_flip=False if args.data_set == "SSV2" else True,
-            inverse_uniform_sampling=False,
-            aspect_ratio=asp,
-            scale=scl,
-            motion_shift=False,
-        )
-
-        if self.rand_erase:
-            erase_transform = RandomErasing(
-                args.reprob,
-                mode=args.remode,
-                max_count=args.recount,
-                num_splits=args.recount,
-                device="cpu",
-            )
-            buffer = buffer.permute(1, 0, 2, 3)
-            buffer = erase_transform(buffer)
-            buffer = buffer.permute(1, 0, 2, 3)
-
-        return buffer
-
-    def _get_seq_frames(self, video_size, num_frames, clip_idx=-1):
-        seg_size = max(0.0, float(video_size - 1) / num_frames)
-        max_frame = int(video_size) - 1
-        seq = []
-        # index from 1, must add 1
-        if clip_idx == -1:
-            for i in range(num_frames):
-                start = int(np.round(seg_size * i))
-                end = int(np.round(seg_size * (i + 1)))
-                idx = min(random.randint(start, end), max_frame)
-                seq.append(idx)
-        else:
-            num_segment = 1
-            if self.mode == "test":
-                num_segment = self.test_num_segment
-            duration = seg_size / (num_segment + 1)
-            for i in range(num_frames):
-                start = int(np.round(seg_size * i))
-                frame_index = start + int(duration * (clip_idx + 1))
-                idx = min(frame_index, max_frame)
-                seq.append(idx)
-        return seq
-
-    def loadvideo_decord(self, sample, chunk_nb=0):
-        """Load video content using Decord"""
-        fname = sample
-        fname = os.path.join(self.prefix, fname)
-
-        try:
-            if self.keep_aspect_ratio:
-                if fname.startswith("s3"):
-                    video_bytes = self.client.get(fname)
-                    vr = VideoReader(io.BytesIO(video_bytes), num_threads=1, ctx=cpu(0))
-                else:
-                    vr = VideoReader(fname, num_threads=1, ctx=cpu(0))
-            else:
-                if fname.startswith("s3:"):
-                    video_bytes = self.client.get(fname)
-                    vr = VideoReader(
-                        io.BytesIO(video_bytes),
-                        width=self.new_width,
-                        height=self.new_height,
-                        num_threads=1,
-                        ctx=cpu(0),
-                    )
-                else:
-                    vr = VideoReader(
-                        fname,
-                        width=self.new_width,
-                        height=self.new_height,
-                        num_threads=1,
-                        ctx=cpu(0),
-                    )
-
-            all_index = self._get_seq_frames(len(vr), self.clip_len, clip_idx=chunk_nb)
-            vr.seek(0)
-            buffer = vr.get_batch(all_index).asnumpy()
-            return buffer
-        except:
-            print("video cannot be loaded by decord: ", fname)
-            return []
-
-    def __len__(self):
-        if self.mode != "test":
-            return len(self.dataset_samples)
-        else:
-            return len(self.test_dataset)
-
-
-def spatial_sampling(
-    frames,
-    spatial_idx=-1,
-    min_scale=256,
-    max_scale=320,
-    crop_size=224,
-    random_horizontal_flip=True,
-    inverse_uniform_sampling=False,
-    aspect_ratio=None,
-    scale=None,
-    motion_shift=False,
-):
-    """
-    Perform spatial sampling on the given video frames. If spatial_idx is
-    -1, perform random scale, random crop, and random flip on the given
-    frames. If spatial_idx is 0, 1, or 2, perform spatial uniform sampling
-    with the given spatial_idx.
-    Args:
-        frames (tensor): frames of images sampled from the video. The
-            dimension is `num frames` x `height` x `width` x `channel`.
-        spatial_idx (int): if -1, perform random spatial sampling. If 0, 1,
-            or 2, perform left, center, right crop if width is larger than
-            height, and perform top, center, buttom crop if height is larger
-            than width.
-        min_scale (int): the minimal size of scaling.
-        max_scale (int): the maximal size of scaling.
-        crop_size (int): the size of height and width used to crop the
-            frames.
-        inverse_uniform_sampling (bool): if True, sample uniformly in
-            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
-            scale. If False, take a uniform sample from [min_scale,
-            max_scale].
-        aspect_ratio (list): Aspect ratio range for resizing.
-        scale (list): Scale range for resizing.
-        motion_shift (bool): Whether to apply motion shift for resizing.
-    Returns:
-        frames (tensor): spatially sampled frames.
-    """
-    assert spatial_idx in [-1, 0, 1, 2]
-    if spatial_idx == -1:
-        if aspect_ratio is None and scale is None:
-            frames, _ = random_short_side_scale_jitter(
-                images=frames,
-                min_size=min_scale,
-                max_size=max_scale,
-                inverse_uniform_sampling=inverse_uniform_sampling,
-            )
-            frames, _ = random_crop(frames, crop_size)
-        else:
-            transform_func = (
-                random_resized_crop_with_shift if motion_shift else random_resized_crop
-            )
-            frames = transform_func(
-                images=frames,
-                target_height=crop_size,
-                target_width=crop_size,
-                scale=scale,
-                ratio=aspect_ratio,
-            )
-        if random_horizontal_flip:
-            frames, _ = horizontal_flip(0.5, frames)
-    else:
-        # The testing is deterministic and no jitter should be performed.
-        # min_scale, max_scale, and crop_size are expect to be the same.
-        assert len({min_scale, max_scale, crop_size}) == 1
-        frames, _ = random_short_side_scale_jitter(frames, min_scale, max_scale)
-        frames, _ = uniform_crop(frames, crop_size, spatial_idx)
-    return frames
-
-
-def tensor_normalize(tensor, mean, std):
-    """
-    Normalize a given tensor by subtracting the mean and dividing the std.
-    Args:
-        tensor (tensor): tensor to normalize.
-        mean (tensor or list): mean value to subtract.
-        std (tensor or list): std to divide.
-    """
-    if tensor.dtype == torch.uint8:
-        tensor = tensor.float()
-        tensor = tensor / 255.0
-    if type(mean) == list:
-        mean = torch.tensor(mean)
-    if type(std) == list:
-        std = torch.tensor(std)
-    tensor = tensor - mean
-    tensor = tensor / std
-    return tensor
diff --git a/eval/vbench/third_party/umt/datasets/mae.py b/eval/vbench/third_party/umt/datasets/mae.py
deleted file mode 100644
index 4fe6a380..00000000
--- a/eval/vbench/third_party/umt/datasets/mae.py
+++ /dev/null
@@ -1,326 +0,0 @@
-import io
-import os
-import random
-
-import cv2
-import decord
-import numpy as np
-import torch
-from decord import VideoReader, cpu
-from PIL import Image
-
-try:
-    from petrel_client.client import Client
-
-    has_client = True
-except ImportError:
-    has_client = False
-
-
-class VideoMAE(torch.utils.data.Dataset):
-    """Load your own video classification dataset.
-    Parameters
-    ----------
-    root : str, required.
-        Path to the root folder storing the dataset.
-    setting : str, required.
-        A text file describing the dataset, each line per video sample.
-        There are three items in each line: (1) video path; (2) video length and (3) video label.
-    prefix : str, required.
-        The prefix for loading data.
-    split : str, required.
-        The split character for metadata.
-    train : bool, default True.
-        Whether to load the training or validation set.
-    test_mode : bool, default False.
-        Whether to perform evaluation on the test set.
-        Usually there is three-crop or ten-crop evaluation strategy involved.
-    name_pattern : str, default None.
-        The naming pattern of the decoded video frames.
-        For example, img_00012.jpg.
-    video_ext : str, default 'mp4'.
-        If video_loader is set to True, please specify the video format accordinly.
-    is_color : bool, default True.
-        Whether the loaded image is color or grayscale.
-    modality : str, default 'rgb'.
-        Input modalities, we support only rgb video frames for now.
-        Will add support for rgb difference image and optical flow image later.
-    num_segments : int, default 1.
-        Number of segments to evenly divide the video into clips.
-        A useful technique to obtain global video-level information.
-        Limin Wang, etal, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV 2016.
-    num_crop : int, default 1.
-        Number of crops for each image. default is 1.
-        Common choices are three crops and ten crops during evaluation.
-    new_length : int, default 1.
-        The length of input video clip. Default is a single image, but it can be multiple video frames.
-        For example, new_length=16 means we will extract a video clip of consecutive 16 frames.
-    new_step : int, default 1.
-        Temporal sampling rate. For example, new_step=1 means we will extract a video clip of consecutive frames.
-        new_step=2 means we will extract a video clip of every other frame.
-    temporal_jitter : bool, default False.
-        Whether to temporally jitter if new_step > 1.
-    video_loader : bool, default False.
-        Whether to use video loader to load data.
-    use_decord : bool, default True.
-        Whether to use Decord video loader to load data. Otherwise load image.
-    transform : function, default None.
-        A function that takes data and label and transforms them.
-    data_aug : str, default 'v1'.
-        Different types of data augmentation auto. Supports v1, v2, v3 and v4.
-    lazy_init : bool, default False.
-        If set to True, build a dataset instance without loading any dataset.
-    """
-
-    def __init__(
-        self,
-        root,
-        setting,
-        prefix="",
-        split=" ",
-        train=True,
-        test_mode=False,
-        name_pattern="img_%05d.jpg",
-        video_ext="mp4",
-        is_color=True,
-        modality="rgb",
-        num_segments=1,
-        num_crop=1,
-        new_length=1,
-        new_step=1,
-        transform=None,
-        temporal_jitter=False,
-        video_loader=False,
-        use_decord=True,
-        lazy_init=False,
-        num_sample=1,
-    ):
-
-        super(VideoMAE, self).__init__()
-        self.root = root
-        self.setting = setting
-        self.prefix = prefix
-        self.split = split
-        self.train = train
-        self.test_mode = test_mode
-        self.is_color = is_color
-        self.modality = modality
-        self.num_segments = num_segments
-        self.num_crop = num_crop
-        self.new_length = new_length
-        self.new_step = new_step
-        self.skip_length = self.new_length * self.new_step
-        self.temporal_jitter = temporal_jitter
-        self.name_pattern = name_pattern
-        self.video_loader = video_loader
-        self.video_ext = video_ext
-        self.use_decord = use_decord
-        self.transform = transform
-        self.lazy_init = lazy_init
-        self.num_sample = num_sample
-
-        # sparse sampling, num_segments != 1
-        if self.num_segments != 1:
-            print("Use sparse sampling, change frame and stride")
-            self.new_length = self.num_segments
-            self.skip_length = 1
-
-        self.client = None
-        if has_client:
-            self.client = Client("~/petreloss.conf")
-
-        if not self.lazy_init:
-            self.clips = self._make_dataset(root, setting)
-            if len(self.clips) == 0:
-                raise (
-                    RuntimeError(
-                        "Found 0 video clips in subfolders of: " + root + "\n"
-                        "Check your data directory (opt.data-dir)."
-                    )
-                )
-
-    def __getitem__(self, index):
-        while True:
-            try:
-                images = None
-                if self.use_decord:
-                    directory, target = self.clips[index]
-                    if self.video_loader:
-                        if "." in directory.split("/")[-1]:
-                            # data in the "setting" file already have extension, e.g., demo.mp4
-                            video_name = directory
-                        else:
-                            # data in the "setting" file do not have extension, e.g., demo
-                            # So we need to provide extension (i.e., .mp4) to complete the file name.
-                            video_name = "{}.{}".format(directory, self.video_ext)
-
-                        video_name = os.path.join(self.prefix, video_name)
-                        if video_name.startswith("s3"):
-                            video_bytes = self.client.get(video_name)
-                            decord_vr = VideoReader(
-                                io.BytesIO(video_bytes), num_threads=1, ctx=cpu(0)
-                            )
-                        else:
-                            decord_vr = decord.VideoReader(
-                                video_name, num_threads=1, ctx=cpu(0)
-                            )
-                        duration = len(decord_vr)
-
-                    segment_indices, skip_offsets = self._sample_train_indices(duration)
-                    images = self._video_TSN_decord_batch_loader(
-                        directory, decord_vr, duration, segment_indices, skip_offsets
-                    )
-
-                else:
-                    video_name, total_frame, target = self.clips[index]
-                    video_name = os.path.join(self.prefix, video_name)
-
-                    segment_indices, skip_offsets = self._sample_train_indices(
-                        total_frame
-                    )
-                    frame_id_list = self._get_frame_id_list(
-                        total_frame, segment_indices, skip_offsets
-                    )
-                    images = []
-                    for idx in frame_id_list:
-                        frame_fname = os.path.join(
-                            video_name, self.name_pattern.format(idx)
-                        )
-                        img_bytes = self.client.get(frame_fname)
-                        img_np = np.frombuffer(img_bytes, np.uint8)
-                        img = cv2.imdecode(img_np, cv2.IMREAD_COLOR)
-                        cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img)
-                        images.append(Image.fromarray(img))
-                if images is not None:
-                    break
-            except Exception as e:
-                print(
-                    "Failed to load video from {} with error {}".format(video_name, e)
-                )
-            index = random.randint(0, len(self.clips) - 1)
-
-        if self.num_sample > 1:
-            process_data_list = []
-            mask_list = []
-            for _ in range(self.num_sample):
-                process_data, mask = self.transform((images, None))
-                process_data = process_data.view(
-                    (self.new_length, 3) + process_data.size()[-2:]
-                ).transpose(0, 1)
-                process_data_list.append(process_data)
-                mask_list.append(mask)
-            return process_data_list, mask_list
-        else:
-            process_data, mask = self.transform((images, None))  # T*C,H,W
-            process_data = process_data.view(
-                (self.new_length, 3) + process_data.size()[-2:]
-            ).transpose(
-                0, 1
-            )  # T*C,H,W -> T,C,H,W -> C,T,H,W
-            return (process_data, mask)
-
-    def __len__(self):
-        return len(self.clips)
-
-    def _make_dataset(self, directory, setting):
-        if not os.path.exists(setting):
-            raise (
-                RuntimeError(
-                    "Setting file %s doesn't exist. Check opt.train-list and opt.val-list. "
-                    % (setting)
-                )
-            )
-        clips = []
-
-        print(f"Load dataset using decord: {self.use_decord}")
-        with open(setting) as split_f:
-            data = split_f.readlines()
-            for line in data:
-                line_info = line.split(self.split)
-                if len(line_info) < 2:
-                    raise (
-                        RuntimeError(
-                            "Video input format is not correct, missing one or more element. %s"
-                            % line
-                        )
-                    )
-                if self.use_decord:
-                    # line format: video_path, video_label
-                    clip_path = os.path.join(line_info[0])
-                    target = int(line_info[1])
-                    item = (clip_path, target)
-                else:
-                    # line format: video_path, video_duration, video_label
-                    clip_path = os.path.join(line_info[0])
-                    total_frame = int(line_info[1])
-                    target = int(line_info[2])
-                    item = (clip_path, total_frame, target)
-                clips.append(item)
-        return clips
-
-    def _sample_train_indices(self, num_frames):
-        average_duration = (num_frames - self.skip_length + 1) // self.num_segments
-        if average_duration > 0:
-            offsets = np.multiply(list(range(self.num_segments)), average_duration)
-            offsets = offsets + np.random.randint(
-                average_duration, size=self.num_segments
-            )
-        elif num_frames > max(self.num_segments, self.skip_length):
-            offsets = np.sort(
-                np.random.randint(
-                    num_frames - self.skip_length + 1, size=self.num_segments
-                )
-            )
-        else:
-            offsets = np.zeros((self.num_segments,))
-
-        if self.temporal_jitter:
-            skip_offsets = np.random.randint(
-                self.new_step, size=self.skip_length // self.new_step
-            )
-        else:
-            skip_offsets = np.zeros(self.skip_length // self.new_step, dtype=int)
-        return offsets + 1, skip_offsets
-
-    def _get_frame_id_list(self, duration, indices, skip_offsets):
-        frame_id_list = []
-        for seg_ind in indices:
-            offset = int(seg_ind)
-            for i, _ in enumerate(range(0, self.skip_length, self.new_step)):
-                if offset + skip_offsets[i] <= duration:
-                    frame_id = offset + skip_offsets[i] - 1
-                else:
-                    frame_id = offset - 1
-                frame_id_list.append(frame_id)
-                if offset + self.new_step < duration:
-                    offset += self.new_step
-        return frame_id_list
-
-    def _video_TSN_decord_batch_loader(
-        self, directory, video_reader, duration, indices, skip_offsets
-    ):
-        sampled_list = []
-        frame_id_list = []
-        for seg_ind in indices:
-            offset = int(seg_ind)
-            for i, _ in enumerate(range(0, self.skip_length, self.new_step)):
-                if offset + skip_offsets[i] <= duration:
-                    frame_id = offset + skip_offsets[i] - 1
-                else:
-                    frame_id = offset - 1
-                frame_id_list.append(frame_id)
-                if offset + self.new_step < duration:
-                    offset += self.new_step
-        try:
-            video_data = video_reader.get_batch(frame_id_list).asnumpy()
-            sampled_list = [
-                Image.fromarray(video_data[vid, :, :, :]).convert("RGB")
-                for vid, _ in enumerate(frame_id_list)
-            ]
-        except:
-            raise RuntimeError(
-                "Error occured in reading frames {} from video {} of duration {}.".format(
-                    frame_id_list, directory, duration
-                )
-            )
-        return sampled_list
diff --git a/eval/vbench/third_party/umt/datasets/masking_generator.py b/eval/vbench/third_party/umt/datasets/masking_generator.py
deleted file mode 100644
index 58877478..00000000
--- a/eval/vbench/third_party/umt/datasets/masking_generator.py
+++ /dev/null
@@ -1,54 +0,0 @@
-import numpy as np
-
-
-class TubeMaskingGenerator:
-    def __init__(self, input_size, mask_ratio):
-        self.frames, self.height, self.width = input_size
-        self.num_patches_per_frame = self.height * self.width
-        self.total_patches = self.frames * self.num_patches_per_frame
-        self.num_masks_per_frame = int(mask_ratio * self.num_patches_per_frame)
-        self.total_masks = self.frames * self.num_masks_per_frame
-
-    def __repr__(self):
-        repr_str = "Maks: total patches {}, mask patches {}".format(
-            self.total_patches, self.total_masks
-        )
-        return repr_str
-
-    def __call__(self):
-        mask_per_frame = np.hstack(
-            [
-                np.zeros(self.num_patches_per_frame - self.num_masks_per_frame),
-                np.ones(self.num_masks_per_frame),
-            ]
-        )
-        np.random.shuffle(mask_per_frame)
-        mask = np.tile(mask_per_frame, (self.frames, 1)).flatten()
-        return mask
-
-
-class RandomMaskingGenerator:
-    def __init__(self, input_size, mask_ratio):
-        if not isinstance(input_size, tuple):
-            input_size = (input_size,) * 3
-
-        self.frames, self.height, self.width = input_size
-
-        self.num_patches = self.frames * self.height * self.width  # 8x14x14
-        self.num_mask = int(mask_ratio * self.num_patches)
-
-    def __repr__(self):
-        repr_str = "Maks: total patches {}, mask patches {}".format(
-            self.num_patches, self.num_mask
-        )
-        return repr_str
-
-    def __call__(self):
-        mask = np.hstack(
-            [
-                np.zeros(self.num_patches - self.num_mask),
-                np.ones(self.num_mask),
-            ]
-        )
-        np.random.shuffle(mask)
-        return mask  # [196*8]
diff --git a/eval/vbench/third_party/umt/datasets/mixup.py b/eval/vbench/third_party/umt/datasets/mixup.py
deleted file mode 100644
index 51cab7fc..00000000
--- a/eval/vbench/third_party/umt/datasets/mixup.py
+++ /dev/null
@@ -1,402 +0,0 @@
-""" Mixup and Cutmix
-
-Papers:
-mixup: Beyond Empirical Risk Minimization (https://arxiv.org/abs/1710.09412)
-
-CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (https://arxiv.org/abs/1905.04899)
-
-Code Reference:
-CutMix: https://github.com/clovaai/CutMix-PyTorch
-
-Hacked together by / Copyright 2019, Ross Wightman
-"""
-
-import numpy as np
-import torch
-
-
-def one_hot(x, num_classes, on_value=1.0, off_value=0.0, device="cuda"):
-    x = x.long().view(-1, 1)
-    return torch.full((x.size()[0], num_classes), off_value, device=device).scatter_(
-        1, x, on_value
-    )
-
-
-def mixup_target(target, num_classes, lam=1.0, smoothing=0.0, device="cuda"):
-    off_value = smoothing / num_classes
-    on_value = 1.0 - smoothing + off_value
-    y1 = one_hot(
-        target, num_classes, on_value=on_value, off_value=off_value, device=device
-    )
-    y2 = one_hot(
-        target.flip(0),
-        num_classes,
-        on_value=on_value,
-        off_value=off_value,
-        device=device,
-    )
-    return y1 * lam + y2 * (1.0 - lam)
-
-
-def rand_bbox(img_shape, lam, margin=0.0, count=None):
-    """Standard CutMix bounding-box
-    Generates a random square bbox based on lambda value. This impl includes
-    support for enforcing a border margin as percent of bbox dimensions.
-
-    Args:
-        img_shape (tuple): Image shape as tuple
-        lam (float): Cutmix lambda value
-        margin (float): Percentage of bbox dimension to enforce as margin (reduce amount of box outside image)
-        count (int): Number of bbox to generate
-    """
-    ratio = np.sqrt(1 - lam)
-    img_h, img_w = img_shape[-2:]
-    cut_h, cut_w = int(img_h * ratio), int(img_w * ratio)
-    margin_y, margin_x = int(margin * cut_h), int(margin * cut_w)
-    cy = np.random.randint(0 + margin_y, img_h - margin_y, size=count)
-    cx = np.random.randint(0 + margin_x, img_w - margin_x, size=count)
-    yl = np.clip(cy - cut_h // 2, 0, img_h)
-    yh = np.clip(cy + cut_h // 2, 0, img_h)
-    xl = np.clip(cx - cut_w // 2, 0, img_w)
-    xh = np.clip(cx + cut_w // 2, 0, img_w)
-    return yl, yh, xl, xh
-
-
-def rand_bbox_minmax(img_shape, minmax, count=None):
-    """Min-Max CutMix bounding-box
-    Inspired by Darknet cutmix impl, generates a random rectangular bbox
-    based on min/max percent values applied to each dimension of the input image.
-
-    Typical defaults for minmax are usually in the  .2-.3 for min and .8-.9 range for max.
-
-    Args:
-        img_shape (tuple): Image shape as tuple
-        minmax (tuple or list): Min and max bbox ratios (as percent of image size)
-        count (int): Number of bbox to generate
-    """
-    assert len(minmax) == 2
-    img_h, img_w = img_shape[-2:]
-    cut_h = np.random.randint(
-        int(img_h * minmax[0]), int(img_h * minmax[1]), size=count
-    )
-    cut_w = np.random.randint(
-        int(img_w * minmax[0]), int(img_w * minmax[1]), size=count
-    )
-    yl = np.random.randint(0, img_h - cut_h, size=count)
-    xl = np.random.randint(0, img_w - cut_w, size=count)
-    yu = yl + cut_h
-    xu = xl + cut_w
-    return yl, yu, xl, xu
-
-
-def cutmix_bbox_and_lam(
-    img_shape, lam, ratio_minmax=None, correct_lam=True, count=None
-):
-    """Generate bbox and apply lambda correction."""
-    if ratio_minmax is not None:
-        yl, yu, xl, xu = rand_bbox_minmax(img_shape, ratio_minmax, count=count)
-    else:
-        yl, yu, xl, xu = rand_bbox(img_shape, lam, count=count)
-    if correct_lam or ratio_minmax is not None:
-        bbox_area = (yu - yl) * (xu - xl)
-        lam = 1.0 - bbox_area / float(img_shape[-2] * img_shape[-1])
-    return (yl, yu, xl, xu), lam
-
-
-class Mixup:
-    """Mixup/Cutmix that applies different params to each element or whole batch
-
-    Args:
-        mixup_alpha (float): mixup alpha value, mixup is active if > 0.
-        cutmix_alpha (float): cutmix alpha value, cutmix is active if > 0.
-        cutmix_minmax (List[float]): cutmix min/max image ratio, cutmix is active and uses this vs alpha if not None.
-        prob (float): probability of applying mixup or cutmix per batch or element
-        switch_prob (float): probability of switching to cutmix instead of mixup when both are active
-        mode (str): how to apply mixup/cutmix params (per 'batch', 'pair' (pair of elements), 'elem' (element)
-        correct_lam (bool): apply lambda correction when cutmix bbox clipped by image borders
-        label_smoothing (float): apply label smoothing to the mixed target tensor
-        num_classes (int): number of classes for target
-    """
-
-    def __init__(
-        self,
-        mixup_alpha=1.0,
-        cutmix_alpha=0.0,
-        cutmix_minmax=None,
-        prob=1.0,
-        switch_prob=0.5,
-        mode="batch",
-        correct_lam=True,
-        label_smoothing=0.1,
-        num_classes=1000,
-    ):
-        self.mixup_alpha = mixup_alpha
-        self.cutmix_alpha = cutmix_alpha
-        self.cutmix_minmax = cutmix_minmax
-        if self.cutmix_minmax is not None:
-            assert len(self.cutmix_minmax) == 2
-            # force cutmix alpha == 1.0 when minmax active to keep logic simple & safe
-            self.cutmix_alpha = 1.0
-        self.mix_prob = prob
-        self.switch_prob = switch_prob
-        self.label_smoothing = label_smoothing
-        self.num_classes = num_classes
-        self.mode = mode
-        self.correct_lam = (
-            correct_lam  # correct lambda based on clipped area for cutmix
-        )
-        self.mixup_enabled = (
-            True  # set to false to disable mixing (intended tp be set by train loop)
-        )
-
-    def _params_per_elem(self, batch_size):
-        lam = np.ones(batch_size, dtype=np.float32)
-        use_cutmix = np.zeros(batch_size, dtype=np.bool)
-        if self.mixup_enabled:
-            if self.mixup_alpha > 0.0 and self.cutmix_alpha > 0.0:
-                use_cutmix = np.random.rand(batch_size) < self.switch_prob
-                lam_mix = np.where(
-                    use_cutmix,
-                    np.random.beta(
-                        self.cutmix_alpha, self.cutmix_alpha, size=batch_size
-                    ),
-                    np.random.beta(self.mixup_alpha, self.mixup_alpha, size=batch_size),
-                )
-            elif self.mixup_alpha > 0.0:
-                lam_mix = np.random.beta(
-                    self.mixup_alpha, self.mixup_alpha, size=batch_size
-                )
-            elif self.cutmix_alpha > 0.0:
-                use_cutmix = np.ones(batch_size, dtype=np.bool)
-                lam_mix = np.random.beta(
-                    self.cutmix_alpha, self.cutmix_alpha, size=batch_size
-                )
-            else:
-                assert (
-                    False
-                ), "One of mixup_alpha > 0., cutmix_alpha > 0., cutmix_minmax not None should be true."
-            lam = np.where(
-                np.random.rand(batch_size) < self.mix_prob,
-                lam_mix.astype(np.float32),
-                lam,
-            )
-        return lam, use_cutmix
-
-    def _params_per_batch(self):
-        lam = 1.0
-        use_cutmix = False
-        if self.mixup_enabled and np.random.rand() < self.mix_prob:
-            if self.mixup_alpha > 0.0 and self.cutmix_alpha > 0.0:
-                use_cutmix = np.random.rand() < self.switch_prob
-                lam_mix = (
-                    np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
-                    if use_cutmix
-                    else np.random.beta(self.mixup_alpha, self.mixup_alpha)
-                )
-            elif self.mixup_alpha > 0.0:
-                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
-            elif self.cutmix_alpha > 0.0:
-                use_cutmix = True
-                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
-            else:
-                assert (
-                    False
-                ), "One of mixup_alpha > 0., cutmix_alpha > 0., cutmix_minmax not None should be true."
-            lam = float(lam_mix)
-        return lam, use_cutmix
-
-    def _mix_elem(self, x):
-        batch_size = len(x)
-        lam_batch, use_cutmix = self._params_per_elem(batch_size)
-        x_orig = x.clone()  # need to keep an unmodified original for mixing source
-        for i in range(batch_size):
-            j = batch_size - i - 1
-            lam = lam_batch[i]
-            if lam != 1.0:
-                if use_cutmix[i]:
-                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
-                        x[i].shape,
-                        lam,
-                        ratio_minmax=self.cutmix_minmax,
-                        correct_lam=self.correct_lam,
-                    )
-                    x[i][..., yl:yh, xl:xh] = x_orig[j][..., yl:yh, xl:xh]
-                    lam_batch[i] = lam
-                else:
-                    x[i] = x[i] * lam + x_orig[j] * (1 - lam)
-        return torch.tensor(lam_batch, device=x.device, dtype=x.dtype).unsqueeze(1)
-
-    def _mix_pair(self, x):
-        batch_size = len(x)
-        lam_batch, use_cutmix = self._params_per_elem(batch_size // 2)
-        x_orig = x.clone()  # need to keep an unmodified original for mixing source
-        for i in range(batch_size // 2):
-            j = batch_size - i - 1
-            lam = lam_batch[i]
-            if lam != 1.0:
-                if use_cutmix[i]:
-                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
-                        x[i].shape,
-                        lam,
-                        ratio_minmax=self.cutmix_minmax,
-                        correct_lam=self.correct_lam,
-                    )
-                    x[i][:, yl:yh, xl:xh] = x_orig[j][:, yl:yh, xl:xh]
-                    x[j][:, yl:yh, xl:xh] = x_orig[i][:, yl:yh, xl:xh]
-                    lam_batch[i] = lam
-                else:
-                    x[i] = x[i] * lam + x_orig[j] * (1 - lam)
-                    x[j] = x[j] * lam + x_orig[i] * (1 - lam)
-        lam_batch = np.concatenate((lam_batch, lam_batch[::-1]))
-        return torch.tensor(lam_batch, device=x.device, dtype=x.dtype).unsqueeze(1)
-
-    def _mix_batch(self, x):
-        lam, use_cutmix = self._params_per_batch()
-        if lam == 1.0:
-            return 1.0
-        if use_cutmix:
-            (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
-                x.shape,
-                lam,
-                ratio_minmax=self.cutmix_minmax,
-                correct_lam=self.correct_lam,
-            )
-            x[..., yl:yh, xl:xh] = x.flip(0)[..., yl:yh, xl:xh]
-        else:
-            x_flipped = x.flip(0).mul_(1.0 - lam)
-            x.mul_(lam).add_(x_flipped)
-        return lam
-
-    def __call__(self, x, target):
-        assert len(x) % 2 == 0, "Batch size should be even when using this"
-        if self.mode == "elem":
-            lam = self._mix_elem(x)
-        elif self.mode == "pair":
-            lam = self._mix_pair(x)
-        else:
-            lam = self._mix_batch(x)
-        target = mixup_target(
-            target, self.num_classes, lam, self.label_smoothing, x.device
-        )
-        return x, target
-
-
-class FastCollateMixup(Mixup):
-    """Fast Collate w/ Mixup/Cutmix that applies different params to each element or whole batch
-
-    A Mixup impl that's performed while collating the batches.
-    """
-
-    def _mix_elem_collate(self, output, batch, half=False):
-        batch_size = len(batch)
-        num_elem = batch_size // 2 if half else batch_size
-        assert len(output) == num_elem
-        lam_batch, use_cutmix = self._params_per_elem(num_elem)
-        for i in range(num_elem):
-            j = batch_size - i - 1
-            lam = lam_batch[i]
-            mixed = batch[i][0]
-            if lam != 1.0:
-                if use_cutmix[i]:
-                    if not half:
-                        mixed = mixed.copy()
-                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
-                        output.shape,
-                        lam,
-                        ratio_minmax=self.cutmix_minmax,
-                        correct_lam=self.correct_lam,
-                    )
-                    mixed[:, yl:yh, xl:xh] = batch[j][0][:, yl:yh, xl:xh]
-                    lam_batch[i] = lam
-                else:
-                    mixed = mixed.astype(np.float32) * lam + batch[j][0].astype(
-                        np.float32
-                    ) * (1 - lam)
-                    np.rint(mixed, out=mixed)
-            output[i] += torch.from_numpy(mixed.astype(np.uint8))
-        if half:
-            lam_batch = np.concatenate((lam_batch, np.ones(num_elem)))
-        return torch.tensor(lam_batch).unsqueeze(1)
-
-    def _mix_pair_collate(self, output, batch):
-        batch_size = len(batch)
-        lam_batch, use_cutmix = self._params_per_elem(batch_size // 2)
-        for i in range(batch_size // 2):
-            j = batch_size - i - 1
-            lam = lam_batch[i]
-            mixed_i = batch[i][0]
-            mixed_j = batch[j][0]
-            assert 0 <= lam <= 1.0
-            if lam < 1.0:
-                if use_cutmix[i]:
-                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
-                        output.shape,
-                        lam,
-                        ratio_minmax=self.cutmix_minmax,
-                        correct_lam=self.correct_lam,
-                    )
-                    patch_i = mixed_i[:, yl:yh, xl:xh].copy()
-                    mixed_i[:, yl:yh, xl:xh] = mixed_j[:, yl:yh, xl:xh]
-                    mixed_j[:, yl:yh, xl:xh] = patch_i
-                    lam_batch[i] = lam
-                else:
-                    mixed_temp = mixed_i.astype(np.float32) * lam + mixed_j.astype(
-                        np.float32
-                    ) * (1 - lam)
-                    mixed_j = mixed_j.astype(np.float32) * lam + mixed_i.astype(
-                        np.float32
-                    ) * (1 - lam)
-                    mixed_i = mixed_temp
-                    np.rint(mixed_j, out=mixed_j)
-                    np.rint(mixed_i, out=mixed_i)
-            output[i] += torch.from_numpy(mixed_i.astype(np.uint8))
-            output[j] += torch.from_numpy(mixed_j.astype(np.uint8))
-        lam_batch = np.concatenate((lam_batch, lam_batch[::-1]))
-        return torch.tensor(lam_batch).unsqueeze(1)
-
-    def _mix_batch_collate(self, output, batch):
-        batch_size = len(batch)
-        lam, use_cutmix = self._params_per_batch()
-        if use_cutmix:
-            (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
-                output.shape,
-                lam,
-                ratio_minmax=self.cutmix_minmax,
-                correct_lam=self.correct_lam,
-            )
-        for i in range(batch_size):
-            j = batch_size - i - 1
-            mixed = batch[i][0]
-            if lam != 1.0:
-                if use_cutmix:
-                    mixed = (
-                        mixed.copy()
-                    )  # don't want to modify the original while iterating
-                    mixed[..., yl:yh, xl:xh] = batch[j][0][..., yl:yh, xl:xh]
-                else:
-                    mixed = mixed.astype(np.float32) * lam + batch[j][0].astype(
-                        np.float32
-                    ) * (1 - lam)
-                    np.rint(mixed, out=mixed)
-            output[i] += torch.from_numpy(mixed.astype(np.uint8))
-        return lam
-
-    def __call__(self, batch, _=None):
-        batch_size = len(batch)
-        assert batch_size % 2 == 0, "Batch size should be even when using this"
-        half = "half" in self.mode
-        if half:
-            batch_size //= 2
-        output = torch.zeros((batch_size, *batch[0][0].shape), dtype=torch.uint8)
-        if self.mode == "elem" or self.mode == "half":
-            lam = self._mix_elem_collate(output, batch, half=half)
-        elif self.mode == "pair":
-            lam = self._mix_pair_collate(output, batch)
-        else:
-            lam = self._mix_batch_collate(output, batch)
-        target = torch.tensor([b[1] for b in batch], dtype=torch.int64)
-        target = mixup_target(
-            target, self.num_classes, lam, self.label_smoothing, device="cpu"
-        )
-        target = target[:batch_size]
-        return output, target
diff --git a/eval/vbench/third_party/umt/datasets/rand_augment.py b/eval/vbench/third_party/umt/datasets/rand_augment.py
deleted file mode 100644
index aec7e919..00000000
--- a/eval/vbench/third_party/umt/datasets/rand_augment.py
+++ /dev/null
@@ -1,514 +0,0 @@
-"""
-This implementation is based on
-https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py
-pulished under an Apache License 2.0.
-
-COMMENT FROM ORIGINAL:
-AutoAugment, RandAugment, and AugMix for PyTorch
-This code implements the searched ImageNet policies with various tweaks and
-improvements and does not include any of the search code. AA and RA
-Implementation adapted from:
-    https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py
-AugMix adapted from:
-    https://github.com/google-research/augmix
-Papers:
-    AutoAugment: Learning Augmentation Policies from Data
-    https://arxiv.org/abs/1805.09501
-    Learning Data Augmentation Strategies for Object Detection
-    https://arxiv.org/abs/1906.11172
-    RandAugment: Practical automated data augmentation...
-    https://arxiv.org/abs/1909.13719
-    AugMix: A Simple Data Processing Method to Improve Robustness and
-    Uncertainty https://arxiv.org/abs/1912.02781
-
-Hacked together by / Copyright 2020 Ross Wightman
-"""
-
-import math
-import random
-import re
-
-import numpy as np
-import PIL
-from PIL import Image, ImageEnhance, ImageOps
-
-_PIL_VER = tuple([int(x) for x in PIL.__version__.split(".")[:2]])
-
-_FILL = (128, 128, 128)
-
-# This signifies the max integer that the controller RNN could predict for the
-# augmentation scheme.
-_MAX_LEVEL = 10.0
-
-_HPARAMS_DEFAULT = {
-    "translate_const": 250,
-    "img_mean": _FILL,
-}
-
-_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
-
-
-def _interpolation(kwargs):
-    interpolation = kwargs.pop("resample", Image.BILINEAR)
-    if isinstance(interpolation, (list, tuple)):
-        return random.choice(interpolation)
-    else:
-        return interpolation
-
-
-def _check_args_tf(kwargs):
-    if "fillcolor" in kwargs and _PIL_VER < (5, 0):
-        kwargs.pop("fillcolor")
-    kwargs["resample"] = _interpolation(kwargs)
-
-
-def shear_x(img, factor, **kwargs):
-    _check_args_tf(kwargs)
-    return img.transform(img.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), **kwargs)
-
-
-def shear_y(img, factor, **kwargs):
-    _check_args_tf(kwargs)
-    return img.transform(img.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), **kwargs)
-
-
-def translate_x_rel(img, pct, **kwargs):
-    pixels = pct * img.size[0]
-    _check_args_tf(kwargs)
-    return img.transform(img.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), **kwargs)
-
-
-def translate_y_rel(img, pct, **kwargs):
-    pixels = pct * img.size[1]
-    _check_args_tf(kwargs)
-    return img.transform(img.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), **kwargs)
-
-
-def translate_x_abs(img, pixels, **kwargs):
-    _check_args_tf(kwargs)
-    return img.transform(img.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), **kwargs)
-
-
-def translate_y_abs(img, pixels, **kwargs):
-    _check_args_tf(kwargs)
-    return img.transform(img.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), **kwargs)
-
-
-def rotate(img, degrees, **kwargs):
-    _check_args_tf(kwargs)
-    if _PIL_VER >= (5, 2):
-        return img.rotate(degrees, **kwargs)
-    elif _PIL_VER >= (5, 0):
-        w, h = img.size
-        post_trans = (0, 0)
-        rotn_center = (w / 2.0, h / 2.0)
-        angle = -math.radians(degrees)
-        matrix = [
-            round(math.cos(angle), 15),
-            round(math.sin(angle), 15),
-            0.0,
-            round(-math.sin(angle), 15),
-            round(math.cos(angle), 15),
-            0.0,
-        ]
-
-        def transform(x, y, matrix):
-            (a, b, c, d, e, f) = matrix
-            return a * x + b * y + c, d * x + e * y + f
-
-        matrix[2], matrix[5] = transform(
-            -rotn_center[0] - post_trans[0],
-            -rotn_center[1] - post_trans[1],
-            matrix,
-        )
-        matrix[2] += rotn_center[0]
-        matrix[5] += rotn_center[1]
-        return img.transform(img.size, Image.AFFINE, matrix, **kwargs)
-    else:
-        return img.rotate(degrees, resample=kwargs["resample"])
-
-
-def auto_contrast(img, **__):
-    return ImageOps.autocontrast(img)
-
-
-def invert(img, **__):
-    return ImageOps.invert(img)
-
-
-def equalize(img, **__):
-    return ImageOps.equalize(img)
-
-
-def solarize(img, thresh, **__):
-    return ImageOps.solarize(img, thresh)
-
-
-def solarize_add(img, add, thresh=128, **__):
-    lut = []
-    for i in range(256):
-        if i < thresh:
-            lut.append(min(255, i + add))
-        else:
-            lut.append(i)
-    if img.mode in ("L", "RGB"):
-        if img.mode == "RGB" and len(lut) == 256:
-            lut = lut + lut + lut
-        return img.point(lut)
-    else:
-        return img
-
-
-def posterize(img, bits_to_keep, **__):
-    if bits_to_keep >= 8:
-        return img
-    return ImageOps.posterize(img, bits_to_keep)
-
-
-def contrast(img, factor, **__):
-    return ImageEnhance.Contrast(img).enhance(factor)
-
-
-def color(img, factor, **__):
-    return ImageEnhance.Color(img).enhance(factor)
-
-
-def brightness(img, factor, **__):
-    return ImageEnhance.Brightness(img).enhance(factor)
-
-
-def sharpness(img, factor, **__):
-    return ImageEnhance.Sharpness(img).enhance(factor)
-
-
-def _randomly_negate(v):
-    """With 50% prob, negate the value"""
-    return -v if random.random() > 0.5 else v
-
-
-def _rotate_level_to_arg(level, _hparams):
-    # range [-30, 30]
-    level = (level / _MAX_LEVEL) * 30.0
-    level = _randomly_negate(level)
-    return (level,)
-
-
-def _enhance_level_to_arg(level, _hparams):
-    # range [0.1, 1.9]
-    return ((level / _MAX_LEVEL) * 1.8 + 0.1,)
-
-
-def _enhance_increasing_level_to_arg(level, _hparams):
-    # the 'no change' level is 1.0, moving away from that towards 0. or 2.0 increases the enhancement blend
-    # range [0.1, 1.9]
-    level = (level / _MAX_LEVEL) * 0.9
-    level = 1.0 + _randomly_negate(level)
-    return (level,)
-
-
-def _shear_level_to_arg(level, _hparams):
-    # range [-0.3, 0.3]
-    level = (level / _MAX_LEVEL) * 0.3
-    level = _randomly_negate(level)
-    return (level,)
-
-
-def _translate_abs_level_to_arg(level, hparams):
-    translate_const = hparams["translate_const"]
-    level = (level / _MAX_LEVEL) * float(translate_const)
-    level = _randomly_negate(level)
-    return (level,)
-
-
-def _translate_rel_level_to_arg(level, hparams):
-    # default range [-0.45, 0.45]
-    translate_pct = hparams.get("translate_pct", 0.45)
-    level = (level / _MAX_LEVEL) * translate_pct
-    level = _randomly_negate(level)
-    return (level,)
-
-
-def _posterize_level_to_arg(level, _hparams):
-    # As per Tensorflow TPU EfficientNet impl
-    # range [0, 4], 'keep 0 up to 4 MSB of original image'
-    # intensity/severity of augmentation decreases with level
-    return (int((level / _MAX_LEVEL) * 4),)
-
-
-def _posterize_increasing_level_to_arg(level, hparams):
-    # As per Tensorflow models research and UDA impl
-    # range [4, 0], 'keep 4 down to 0 MSB of original image',
-    # intensity/severity of augmentation increases with level
-    return (4 - _posterize_level_to_arg(level, hparams)[0],)
-
-
-def _posterize_original_level_to_arg(level, _hparams):
-    # As per original AutoAugment paper description
-    # range [4, 8], 'keep 4 up to 8 MSB of image'
-    # intensity/severity of augmentation decreases with level
-    return (int((level / _MAX_LEVEL) * 4) + 4,)
-
-
-def _solarize_level_to_arg(level, _hparams):
-    # range [0, 256]
-    # intensity/severity of augmentation decreases with level
-    return (int((level / _MAX_LEVEL) * 256),)
-
-
-def _solarize_increasing_level_to_arg(level, _hparams):
-    # range [0, 256]
-    # intensity/severity of augmentation increases with level
-    return (256 - _solarize_level_to_arg(level, _hparams)[0],)
-
-
-def _solarize_add_level_to_arg(level, _hparams):
-    # range [0, 110]
-    return (int((level / _MAX_LEVEL) * 110),)
-
-
-LEVEL_TO_ARG = {
-    "AutoContrast": None,
-    "Equalize": None,
-    "Invert": None,
-    "Rotate": _rotate_level_to_arg,
-    # There are several variations of the posterize level scaling in various Tensorflow/Google repositories/papers
-    "Posterize": _posterize_level_to_arg,
-    "PosterizeIncreasing": _posterize_increasing_level_to_arg,
-    "PosterizeOriginal": _posterize_original_level_to_arg,
-    "Solarize": _solarize_level_to_arg,
-    "SolarizeIncreasing": _solarize_increasing_level_to_arg,
-    "SolarizeAdd": _solarize_add_level_to_arg,
-    "Color": _enhance_level_to_arg,
-    "ColorIncreasing": _enhance_increasing_level_to_arg,
-    "Contrast": _enhance_level_to_arg,
-    "ContrastIncreasing": _enhance_increasing_level_to_arg,
-    "Brightness": _enhance_level_to_arg,
-    "BrightnessIncreasing": _enhance_increasing_level_to_arg,
-    "Sharpness": _enhance_level_to_arg,
-    "SharpnessIncreasing": _enhance_increasing_level_to_arg,
-    "ShearX": _shear_level_to_arg,
-    "ShearY": _shear_level_to_arg,
-    "TranslateX": _translate_abs_level_to_arg,
-    "TranslateY": _translate_abs_level_to_arg,
-    "TranslateXRel": _translate_rel_level_to_arg,
-    "TranslateYRel": _translate_rel_level_to_arg,
-}
-
-
-NAME_TO_OP = {
-    "AutoContrast": auto_contrast,
-    "Equalize": equalize,
-    "Invert": invert,
-    "Rotate": rotate,
-    "Posterize": posterize,
-    "PosterizeIncreasing": posterize,
-    "PosterizeOriginal": posterize,
-    "Solarize": solarize,
-    "SolarizeIncreasing": solarize,
-    "SolarizeAdd": solarize_add,
-    "Color": color,
-    "ColorIncreasing": color,
-    "Contrast": contrast,
-    "ContrastIncreasing": contrast,
-    "Brightness": brightness,
-    "BrightnessIncreasing": brightness,
-    "Sharpness": sharpness,
-    "SharpnessIncreasing": sharpness,
-    "ShearX": shear_x,
-    "ShearY": shear_y,
-    "TranslateX": translate_x_abs,
-    "TranslateY": translate_y_abs,
-    "TranslateXRel": translate_x_rel,
-    "TranslateYRel": translate_y_rel,
-}
-
-
-class AugmentOp:
-    """
-    Apply for video.
-    """
-
-    def __init__(self, name, prob=0.5, magnitude=10, hparams=None):
-        hparams = hparams or _HPARAMS_DEFAULT
-        self.aug_fn = NAME_TO_OP[name]
-        self.level_fn = LEVEL_TO_ARG[name]
-        self.prob = prob
-        self.magnitude = magnitude
-        self.hparams = hparams.copy()
-        self.kwargs = {
-            "fillcolor": hparams["img_mean"] if "img_mean" in hparams else _FILL,
-            "resample": (
-                hparams["interpolation"]
-                if "interpolation" in hparams
-                else _RANDOM_INTERPOLATION
-            ),
-        }
-
-        # If magnitude_std is > 0, we introduce some randomness
-        # in the usually fixed policy and sample magnitude from a normal distribution
-        # with mean `magnitude` and std-dev of `magnitude_std`.
-        # NOTE This is my own hack, being tested, not in papers or reference impls.
-        self.magnitude_std = self.hparams.get("magnitude_std", 0)
-
-    def __call__(self, img_list):
-        if self.prob < 1.0 and random.random() > self.prob:
-            return img_list
-        magnitude = self.magnitude
-        if self.magnitude_std and self.magnitude_std > 0:
-            magnitude = random.gauss(magnitude, self.magnitude_std)
-        magnitude = min(_MAX_LEVEL, max(0, magnitude))  # clip to valid range
-        level_args = (
-            self.level_fn(magnitude, self.hparams) if self.level_fn is not None else ()
-        )
-
-        if isinstance(img_list, list):
-            return [self.aug_fn(img, *level_args, **self.kwargs) for img in img_list]
-        else:
-            return self.aug_fn(img_list, *level_args, **self.kwargs)
-
-
-_RAND_TRANSFORMS = [
-    "AutoContrast",
-    "Equalize",
-    "Invert",
-    "Rotate",
-    "Posterize",
-    "Solarize",
-    "SolarizeAdd",
-    "Color",
-    "Contrast",
-    "Brightness",
-    "Sharpness",
-    "ShearX",
-    "ShearY",
-    "TranslateXRel",
-    "TranslateYRel",
-]
-
-
-_RAND_INCREASING_TRANSFORMS = [
-    "AutoContrast",
-    "Equalize",
-    "Invert",
-    "Rotate",
-    "PosterizeIncreasing",
-    "SolarizeIncreasing",
-    "SolarizeAdd",
-    "ColorIncreasing",
-    "ContrastIncreasing",
-    "BrightnessIncreasing",
-    "SharpnessIncreasing",
-    "ShearX",
-    "ShearY",
-    "TranslateXRel",
-    "TranslateYRel",
-]
-
-
-# These experimental weights are based loosely on the relative improvements mentioned in paper.
-# They may not result in increased performance, but could likely be tuned to so.
-_RAND_CHOICE_WEIGHTS_0 = {
-    "Rotate": 0.3,
-    "ShearX": 0.2,
-    "ShearY": 0.2,
-    "TranslateXRel": 0.1,
-    "TranslateYRel": 0.1,
-    "Color": 0.025,
-    "Sharpness": 0.025,
-    "AutoContrast": 0.025,
-    "Solarize": 0.005,
-    "SolarizeAdd": 0.005,
-    "Contrast": 0.005,
-    "Brightness": 0.005,
-    "Equalize": 0.005,
-    "Posterize": 0,
-    "Invert": 0,
-}
-
-
-def _select_rand_weights(weight_idx=0, transforms=None):
-    transforms = transforms or _RAND_TRANSFORMS
-    assert weight_idx == 0  # only one set of weights currently
-    rand_weights = _RAND_CHOICE_WEIGHTS_0
-    probs = [rand_weights[k] for k in transforms]
-    probs /= np.sum(probs)
-    return probs
-
-
-def rand_augment_ops(magnitude=10, hparams=None, transforms=None):
-    hparams = hparams or _HPARAMS_DEFAULT
-    transforms = transforms or _RAND_TRANSFORMS
-    return [
-        AugmentOp(name, prob=0.5, magnitude=magnitude, hparams=hparams)
-        for name in transforms
-    ]
-
-
-class RandAugment:
-    def __init__(self, ops, num_layers=2, choice_weights=None):
-        self.ops = ops
-        self.num_layers = num_layers
-        self.choice_weights = choice_weights
-
-    def __call__(self, img):
-        # no replacement when using weighted choice
-        ops = np.random.choice(
-            self.ops,
-            self.num_layers,
-            replace=self.choice_weights is None,
-            p=self.choice_weights,
-        )
-        for op in ops:
-            img = op(img)
-        return img
-
-
-def rand_augment_transform(config_str, hparams):
-    """
-    RandAugment: Practical automated data augmentation... - https://arxiv.org/abs/1909.13719
-
-    Create a RandAugment transform
-    :param config_str: String defining configuration of random augmentation. Consists of multiple sections separated by
-    dashes ('-'). The first section defines the specific variant of rand augment (currently only 'rand'). The remaining
-    sections, not order sepecific determine
-        'm' - integer magnitude of rand augment
-        'n' - integer num layers (number of transform ops selected per image)
-        'w' - integer probabiliy weight index (index of a set of weights to influence choice of op)
-        'mstd' -  float std deviation of magnitude noise applied
-        'inc' - integer (bool), use augmentations that increase in severity with magnitude (default: 0)
-    Ex 'rand-m9-n3-mstd0.5' results in RandAugment with magnitude 9, num_layers 3, magnitude_std 0.5
-    'rand-mstd1-w0' results in magnitude_std 1.0, weights 0, default magnitude of 10 and num_layers 2
-    :param hparams: Other hparams (kwargs) for the RandAugmentation scheme
-    :return: A PyTorch compatible Transform
-    """
-    magnitude = _MAX_LEVEL  # default to _MAX_LEVEL for magnitude (currently 10)
-    num_layers = 2  # default to 2 ops per image
-    weight_idx = None  # default to no probability weights for op choice
-    transforms = _RAND_TRANSFORMS
-    config = config_str.split("-")
-    assert config[0] == "rand"
-    config = config[1:]
-    for c in config:
-        cs = re.split(r"(\d.*)", c)
-        if len(cs) < 2:
-            continue
-        key, val = cs[:2]
-        if key == "mstd":
-            # noise param injected via hparams for now
-            hparams.setdefault("magnitude_std", float(val))
-        elif key == "inc":
-            if bool(val):
-                transforms = _RAND_INCREASING_TRANSFORMS
-        elif key == "m":
-            magnitude = int(val)
-        elif key == "n":
-            num_layers = int(val)
-        elif key == "w":
-            weight_idx = int(val)
-        else:
-            assert NotImplementedError
-    ra_ops = rand_augment_ops(
-        magnitude=magnitude, hparams=hparams, transforms=transforms
-    )
-    choice_weights = None if weight_idx is None else _select_rand_weights(weight_idx)
-    return RandAugment(ra_ops, num_layers, choice_weights=choice_weights)
diff --git a/eval/vbench/third_party/umt/datasets/random_erasing.py b/eval/vbench/third_party/umt/datasets/random_erasing.py
deleted file mode 100644
index 4c4f96c8..00000000
--- a/eval/vbench/third_party/umt/datasets/random_erasing.py
+++ /dev/null
@@ -1,167 +0,0 @@
-"""
-This implementation is based on
-https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/random_erasing.py
-pulished under an Apache License 2.0.
-"""
-
-import math
-import random
-
-import torch
-
-
-def _get_pixels(per_pixel, rand_color, patch_size, dtype=torch.float32, device="cuda"):
-    # NOTE I've seen CUDA illegal memory access errors being caused by the normal_()
-    # paths, flip the order so normal is run on CPU if this becomes a problem
-    # Issue has been fixed in master https://github.com/pytorch/pytorch/issues/19508
-    if per_pixel:
-        return torch.empty(patch_size, dtype=dtype, device=device).normal_()
-    elif rand_color:
-        return torch.empty((patch_size[0], 1, 1), dtype=dtype, device=device).normal_()
-    else:
-        return torch.zeros((patch_size[0], 1, 1), dtype=dtype, device=device)
-
-
-class RandomErasing:
-    """Randomly selects a rectangle region in an image and erases its pixels.
-        'Random Erasing Data Augmentation' by Zhong et al.
-        See https://arxiv.org/pdf/1708.04896.pdf
-        This variant of RandomErasing is intended to be applied to either a batch
-        or single image tensor after it has been normalized by dataset mean and std.
-    Args:
-         probability: Probability that the Random Erasing operation will be performed.
-         min_area: Minimum percentage of erased area wrt input image area.
-         max_area: Maximum percentage of erased area wrt input image area.
-         min_aspect: Minimum aspect ratio of erased area.
-         mode: pixel color mode, one of 'const', 'rand', or 'pixel'
-            'const' - erase block is constant color of 0 for all channels
-            'rand'  - erase block is same per-channel random (normal) color
-            'pixel' - erase block is per-pixel random (normal) color
-        max_count: maximum number of erasing blocks per image, area per box is scaled by count.
-            per-image count is randomly chosen between 1 and this value.
-    """
-
-    def __init__(
-        self,
-        probability=0.5,
-        min_area=0.02,
-        max_area=1 / 3,
-        min_aspect=0.3,
-        max_aspect=None,
-        mode="const",
-        min_count=1,
-        max_count=None,
-        num_splits=0,
-        device="cuda",
-        cube=True,
-    ):
-        self.probability = probability
-        self.min_area = min_area
-        self.max_area = max_area
-        max_aspect = max_aspect or 1 / min_aspect
-        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
-        self.min_count = min_count
-        self.max_count = max_count or min_count
-        self.num_splits = num_splits
-        mode = mode.lower()
-        self.rand_color = False
-        self.per_pixel = False
-        self.cube = cube
-        if mode == "rand":
-            self.rand_color = True  # per block random normal
-        elif mode == "pixel":
-            self.per_pixel = True  # per pixel random normal
-        else:
-            assert not mode or mode == "const"
-        self.device = device
-
-    def _erase(self, img, chan, img_h, img_w, dtype):
-        if random.random() > self.probability:
-            return
-        area = img_h * img_w
-        count = (
-            self.min_count
-            if self.min_count == self.max_count
-            else random.randint(self.min_count, self.max_count)
-        )
-        for _ in range(count):
-            for _ in range(10):
-                target_area = (
-                    random.uniform(self.min_area, self.max_area) * area / count
-                )
-                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
-                h = int(round(math.sqrt(target_area * aspect_ratio)))
-                w = int(round(math.sqrt(target_area / aspect_ratio)))
-                if w < img_w and h < img_h:
-                    top = random.randint(0, img_h - h)
-                    left = random.randint(0, img_w - w)
-                    img[:, top : top + h, left : left + w] = _get_pixels(
-                        self.per_pixel,
-                        self.rand_color,
-                        (chan, h, w),
-                        dtype=dtype,
-                        device=self.device,
-                    )
-                    break
-
-    def _erase_cube(
-        self,
-        img,
-        batch_start,
-        batch_size,
-        chan,
-        img_h,
-        img_w,
-        dtype,
-    ):
-        if random.random() > self.probability:
-            return
-        area = img_h * img_w
-        count = (
-            self.min_count
-            if self.min_count == self.max_count
-            else random.randint(self.min_count, self.max_count)
-        )
-        for _ in range(count):
-            for _ in range(100):
-                target_area = (
-                    random.uniform(self.min_area, self.max_area) * area / count
-                )
-                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
-                h = int(round(math.sqrt(target_area * aspect_ratio)))
-                w = int(round(math.sqrt(target_area / aspect_ratio)))
-                if w < img_w and h < img_h:
-                    top = random.randint(0, img_h - h)
-                    left = random.randint(0, img_w - w)
-                    for i in range(batch_start, batch_size):
-                        img_instance = img[i]
-                        img_instance[:, top : top + h, left : left + w] = _get_pixels(
-                            self.per_pixel,
-                            self.rand_color,
-                            (chan, h, w),
-                            dtype=dtype,
-                            device=self.device,
-                        )
-                    break
-
-    def __call__(self, input):
-        if len(input.size()) == 3:
-            self._erase(input, *input.size(), input.dtype)
-        else:
-            batch_size, chan, img_h, img_w = input.size()
-            # skip first slice of batch if num_splits is set (for clean portion of samples)
-            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0
-            if self.cube:
-                self._erase_cube(
-                    input,
-                    batch_start,
-                    batch_size,
-                    chan,
-                    img_h,
-                    img_w,
-                    input.dtype,
-                )
-            else:
-                for i in range(batch_start, batch_size):
-                    self._erase(input[i], chan, img_h, img_w, input.dtype)
-        return input
diff --git a/eval/vbench/third_party/umt/datasets/ssv2.py b/eval/vbench/third_party/umt/datasets/ssv2.py
deleted file mode 100644
index 9d9a72dd..00000000
--- a/eval/vbench/third_party/umt/datasets/ssv2.py
+++ /dev/null
@@ -1,777 +0,0 @@
-import io
-import os
-import warnings
-
-import cv2
-import numpy as np
-import torch
-from decord import VideoReader, cpu
-from torch.utils.data import Dataset
-from torchvision import transforms
-
-from .random_erasing import RandomErasing
-from .video_transforms import (
-    CenterCrop,
-    Compose,
-    Normalize,
-    Resize,
-    create_random_augment,
-    horizontal_flip,
-    random_crop,
-    random_resized_crop,
-    random_resized_crop_with_shift,
-    random_short_side_scale_jitter,
-    uniform_crop,
-)
-from .volume_transforms import ClipToTensor
-
-try:
-    from petrel_client.client import Client
-
-    has_client = True
-except ImportError:
-    has_client = False
-
-
-class SSRawFrameClsDataset(Dataset):
-    """Load your own raw frame classification dataset."""
-
-    def __init__(
-        self,
-        anno_path,
-        prefix="",
-        split=" ",
-        mode="train",
-        clip_len=8,
-        crop_size=224,
-        short_side_size=256,
-        new_height=256,
-        new_width=340,
-        keep_aspect_ratio=True,
-        num_segment=1,
-        num_crop=1,
-        test_num_segment=10,
-        test_num_crop=3,
-        filename_tmpl="img_{:05}.jpg",
-        args=None,
-    ):
-        self.anno_path = anno_path
-        self.prefix = prefix
-        self.split = split
-        self.mode = mode
-        self.clip_len = clip_len
-        self.crop_size = crop_size
-        self.short_side_size = short_side_size
-        self.new_height = new_height
-        self.new_width = new_width
-        self.keep_aspect_ratio = keep_aspect_ratio
-        self.num_segment = num_segment
-        self.test_num_segment = test_num_segment
-        self.num_crop = num_crop
-        self.test_num_crop = test_num_crop
-        self.filename_tmpl = filename_tmpl
-        self.args = args
-        self.aug = False
-        self.rand_erase = False
-
-        self.client = None
-        if has_client:
-            self.client = Client("~/petreloss.conf")
-
-        if self.mode in ["train"]:
-            self.aug = True
-            if self.args.reprob > 0:
-                self.rand_erase = True
-        if VideoReader is None:
-            raise ImportError(
-                "Unable to import `decord` which is required to read videos."
-            )
-
-        import pandas as pd
-
-        cleaned = pd.read_csv(self.anno_path, header=None, delimiter=self.split)
-        self.dataset_samples = list(cleaned.values[:, 0])
-        self.total_frames = list(cleaned.values[:, 1])
-        self.label_array = list(cleaned.values[:, -1])
-
-        if mode == "train":
-            pass
-
-        elif mode == "validation":
-            self.data_transform = Compose(
-                [
-                    Resize(self.short_side_size, interpolation="bilinear"),
-                    CenterCrop(size=(self.crop_size, self.crop_size)),
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-        elif mode == "test":
-            self.data_resize = Compose(
-                [Resize(size=(short_side_size), interpolation="bilinear")]
-            )
-            self.data_transform = Compose(
-                [
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-            self.test_seg = []
-            self.test_dataset = []
-            self.test_total_frames = []
-            self.test_label_array = []
-            for ck in range(self.test_num_segment):
-                for cp in range(self.test_num_crop):
-                    for idx in range(len(self.label_array)):
-                        self.test_seg.append((ck, cp))
-                        self.test_dataset.append(self.dataset_samples[idx])
-                        self.test_total_frames.append(self.total_frames[idx])
-                        self.test_label_array.append(self.label_array[idx])
-
-    def __getitem__(self, index):
-        if self.mode == "train":
-            args = self.args
-            scale_t = 1
-
-            sample = self.dataset_samples[index]
-            total_frame = self.total_frames[index]
-            buffer = self.load_frame(
-                sample, total_frame, sample_rate_scale=scale_t
-            )  # T H W C
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during training".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    total_frame = self.total_frames[index]
-                    buffer = self.load_frame(
-                        sample, total_frame, sample_rate_scale=scale_t
-                    )
-
-            if args.num_sample > 1:
-                frame_list = []
-                label_list = []
-                index_list = []
-                for _ in range(args.num_sample):
-                    new_frames = self._aug_frame(buffer, args)
-                    label = self.label_array[index]
-                    frame_list.append(new_frames)
-                    label_list.append(label)
-                    index_list.append(index)
-                return frame_list, label_list, index_list, {}
-            else:
-                buffer = self._aug_frame(buffer, args)
-
-            return buffer, self.label_array[index], index, {}
-
-        elif self.mode == "validation":
-            sample = self.dataset_samples[index]
-            total_frame = self.total_frames[index]
-            buffer = self.load_frame(sample, total_frame)
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during validation".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.load_frame(sample, total_frame)
-            buffer = self.data_transform(buffer)
-            return buffer, self.label_array[index], sample.split("/")[-1].split(".")[0]
-
-        elif self.mode == "test":
-            sample = self.test_dataset[index]
-            total_frame = self.test_total_frames[index]
-            chunk_nb, split_nb = self.test_seg[index]
-            buffer = self.load_frame(sample, total_frame)
-
-            while len(buffer) == 0:
-                warnings.warn(
-                    "video {}, temporal {}, spatial {} not found during testing".format(
-                        str(self.test_dataset[index]), chunk_nb, split_nb
-                    )
-                )
-                index = np.random.randint(self.__len__())
-                sample = self.test_dataset[index]
-                total_frame = self.test_total_frames[index]
-                chunk_nb, split_nb = self.test_seg[index]
-                buffer = self.load_frame(sample, total_frame)
-
-            buffer = self.data_resize(buffer)
-            if isinstance(buffer, list):
-                buffer = np.stack(buffer, 0)
-
-            spatial_step = (
-                1.0
-                * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size)
-                / (self.test_num_crop - 1)
-            )
-            temporal_start = chunk_nb
-            spatial_start = int(split_nb * spatial_step)
-            if buffer.shape[1] >= buffer.shape[2]:
-                buffer = buffer[
-                    temporal_start :: self.test_num_segment,
-                    spatial_start : spatial_start + self.short_side_size,
-                    :,
-                    :,
-                ]
-            else:
-                buffer = buffer[
-                    temporal_start :: self.test_num_segment,
-                    :,
-                    spatial_start : spatial_start + self.short_side_size,
-                    :,
-                ]
-
-            buffer = self.data_transform(buffer)
-            return (
-                buffer,
-                self.test_label_array[index],
-                sample.split("/")[-1].split(".")[0],
-                chunk_nb,
-                split_nb,
-            )
-        else:
-            raise NameError("mode {} unkown".format(self.mode))
-
-    def _aug_frame(
-        self,
-        buffer,
-        args,
-    ):
-
-        aug_transform = create_random_augment(
-            input_size=(self.crop_size, self.crop_size),
-            auto_augment=args.aa,
-            interpolation=args.train_interpolation,
-        )
-
-        buffer = [transforms.ToPILImage()(frame) for frame in buffer]
-
-        buffer = aug_transform(buffer)
-
-        buffer = [transforms.ToTensor()(img) for img in buffer]
-        buffer = torch.stack(buffer)  # T C H W
-        buffer = buffer.permute(0, 2, 3, 1)  # T H W C
-
-        # T H W C
-        buffer = tensor_normalize(buffer, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-        # T H W C -> C T H W.
-        buffer = buffer.permute(3, 0, 1, 2)
-        # Perform data augmentation.
-        scl, asp = (
-            [0.08, 1.0],
-            [0.75, 1.3333],
-        )
-
-        buffer = spatial_sampling(
-            buffer,
-            spatial_idx=-1,
-            min_scale=256,
-            max_scale=320,
-            crop_size=self.crop_size,
-            random_horizontal_flip=False if args.data_set == "SSV2" else True,
-            inverse_uniform_sampling=False,
-            aspect_ratio=asp,
-            scale=scl,
-            motion_shift=False,
-        )
-
-        if self.rand_erase:
-            erase_transform = RandomErasing(
-                args.reprob,
-                mode=args.remode,
-                max_count=args.recount,
-                num_splits=args.recount,
-                device="cpu",
-            )
-            buffer = buffer.permute(1, 0, 2, 3)
-            buffer = erase_transform(buffer)
-            buffer = buffer.permute(1, 0, 2, 3)
-
-        return buffer
-
-    def load_frame(self, sample, num_frames, sample_rate_scale=1):
-        """Load video content using Decord"""
-        fname = sample
-        fname = os.path.join(self.prefix, fname)
-
-        if self.mode == "test":
-            tick = num_frames / float(self.num_segment)
-            all_index = []
-            for t_seg in range(self.test_num_segment):
-                tmp_index = [
-                    int(t_seg * tick / self.test_num_segment + tick * x)
-                    for x in range(self.num_segment)
-                ]
-                all_index.extend(tmp_index)
-            all_index = list(np.sort(np.array(all_index)))
-            imgs = []
-            for idx in all_index:
-                frame_fname = os.path.join(fname, self.filename_tmpl.format(idx + 1))
-                img_bytes = self.client.get(frame_fname)
-                img_np = np.frombuffer(img_bytes, np.uint8)
-                img = cv2.imdecode(img_np, cv2.IMREAD_COLOR)
-                cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img)
-                imgs.append(img)
-            buffer = np.array(imgs)
-            return buffer
-
-        # handle temporal segments
-        average_duration = num_frames // self.num_segment
-        all_index = []
-        if average_duration > 0:
-            if self.mode == "validation":
-                all_index = list(
-                    np.multiply(list(range(self.num_segment)), average_duration)
-                    + np.ones(self.num_segment, dtype=int) * (average_duration // 2)
-                )
-            else:
-                all_index = list(
-                    np.multiply(list(range(self.num_segment)), average_duration)
-                    + np.random.randint(average_duration, size=self.num_segment)
-                )
-        elif num_frames > self.num_segment:
-            if self.mode == "validation":
-                all_index = list(range(self.num_segment))
-            else:
-                all_index = list(
-                    np.sort(np.random.randint(num_frames, size=self.num_segment))
-                )
-        else:
-            all_index = [0] * (self.num_segment - num_frames) + list(range(num_frames))
-        all_index = list(np.array(all_index))
-        imgs = []
-        for idx in all_index:
-            frame_fname = os.path.join(fname, self.filename_tmpl.format(idx + 1))
-            img_bytes = self.client.get(frame_fname)
-            img_np = np.frombuffer(img_bytes, np.uint8)
-            img = cv2.imdecode(img_np, cv2.IMREAD_COLOR)
-            cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img)
-            imgs.append(img)
-        buffer = np.array(imgs)
-        return buffer
-
-    def __len__(self):
-        if self.mode != "test":
-            return len(self.dataset_samples)
-        else:
-            return len(self.test_dataset)
-
-
-class SSVideoClsDataset(Dataset):
-    """Load your own video classification dataset."""
-
-    def __init__(
-        self,
-        anno_path,
-        prefix="",
-        split=" ",
-        mode="train",
-        clip_len=8,
-        crop_size=224,
-        short_side_size=256,
-        new_height=256,
-        new_width=340,
-        keep_aspect_ratio=True,
-        num_segment=1,
-        num_crop=1,
-        test_num_segment=10,
-        test_num_crop=3,
-        args=None,
-    ):
-        self.anno_path = anno_path
-        self.prefix = prefix
-        self.split = split
-        self.mode = mode
-        self.clip_len = clip_len
-        self.crop_size = crop_size
-        self.short_side_size = short_side_size
-        self.new_height = new_height
-        self.new_width = new_width
-        self.keep_aspect_ratio = keep_aspect_ratio
-        self.num_segment = num_segment
-        self.test_num_segment = test_num_segment
-        self.num_crop = num_crop
-        self.test_num_crop = test_num_crop
-        self.args = args
-        self.aug = False
-        self.rand_erase = False
-
-        self.client = None
-        if has_client:
-            self.client = Client("~/petreloss.conf")
-
-        if self.mode in ["train"]:
-            self.aug = True
-            if self.args.reprob > 0:
-                self.rand_erase = True
-        if VideoReader is None:
-            raise ImportError(
-                "Unable to import `decord` which is required to read videos."
-            )
-
-        import pandas as pd
-
-        cleaned = pd.read_csv(self.anno_path, header=None, delimiter=self.split)
-        self.dataset_samples = list(cleaned.values[:, 0])
-        self.label_array = list(cleaned.values[:, 1])
-
-        if mode == "train":
-            pass
-
-        elif mode == "validation":
-            self.data_transform = Compose(
-                [
-                    Resize(self.short_side_size, interpolation="bilinear"),
-                    CenterCrop(size=(self.crop_size, self.crop_size)),
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-        elif mode == "test":
-            self.data_resize = Compose(
-                [Resize(size=(short_side_size), interpolation="bilinear")]
-            )
-            self.data_transform = Compose(
-                [
-                    ClipToTensor(),
-                    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
-                ]
-            )
-            self.test_seg = []
-            self.test_dataset = []
-            self.test_label_array = []
-            for ck in range(self.test_num_segment):
-                for cp in range(self.test_num_crop):
-                    for idx in range(len(self.label_array)):
-                        sample_label = self.label_array[idx]
-                        self.test_label_array.append(sample_label)
-                        self.test_dataset.append(self.dataset_samples[idx])
-                        self.test_seg.append((ck, cp))
-
-    def __getitem__(self, index):
-        if self.mode == "train":
-            args = self.args
-            scale_t = 1
-
-            sample = self.dataset_samples[index]
-            buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t)  # T H W C
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during training".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t)
-
-            if args.num_sample > 1:
-                frame_list = []
-                label_list = []
-                index_list = []
-                for _ in range(args.num_sample):
-                    new_frames = self._aug_frame(buffer, args)
-                    label = self.label_array[index]
-                    frame_list.append(new_frames)
-                    label_list.append(label)
-                    index_list.append(index)
-                return frame_list, label_list, index_list, {}
-            else:
-                buffer = self._aug_frame(buffer, args)
-
-            return buffer, self.label_array[index], index, {}
-
-        elif self.mode == "validation":
-            sample = self.dataset_samples[index]
-            buffer = self.loadvideo_decord(sample)
-            if len(buffer) == 0:
-                while len(buffer) == 0:
-                    warnings.warn(
-                        "video {} not correctly loaded during validation".format(sample)
-                    )
-                    index = np.random.randint(self.__len__())
-                    sample = self.dataset_samples[index]
-                    buffer = self.loadvideo_decord(sample)
-            buffer = self.data_transform(buffer)
-            return buffer, self.label_array[index], sample.split("/")[-1].split(".")[0]
-
-        elif self.mode == "test":
-            sample = self.test_dataset[index]
-            chunk_nb, split_nb = self.test_seg[index]
-            buffer = self.loadvideo_decord(sample)
-
-            while len(buffer) == 0:
-                warnings.warn(
-                    "video {}, temporal {}, spatial {} not found during testing".format(
-                        str(self.test_dataset[index]), chunk_nb, split_nb
-                    )
-                )
-                index = np.random.randint(self.__len__())
-                sample = self.test_dataset[index]
-                chunk_nb, split_nb = self.test_seg[index]
-                buffer = self.loadvideo_decord(sample)
-
-            buffer = self.data_resize(buffer)
-            if isinstance(buffer, list):
-                buffer = np.stack(buffer, 0)
-
-            spatial_step = (
-                1.0
-                * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size)
-                / (self.test_num_crop - 1)
-            )
-            temporal_start = chunk_nb  # 0/1
-            spatial_start = int(split_nb * spatial_step)
-            if buffer.shape[1] >= buffer.shape[2]:
-                buffer = buffer[
-                    temporal_start::2,
-                    spatial_start : spatial_start + self.short_side_size,
-                    :,
-                    :,
-                ]
-            else:
-                buffer = buffer[
-                    temporal_start::2,
-                    :,
-                    spatial_start : spatial_start + self.short_side_size,
-                    :,
-                ]
-
-            buffer = self.data_transform(buffer)
-            return (
-                buffer,
-                self.test_label_array[index],
-                sample.split("/")[-1].split(".")[0],
-                chunk_nb,
-                split_nb,
-            )
-        else:
-            raise NameError("mode {} unkown".format(self.mode))
-
-    def _aug_frame(
-        self,
-        buffer,
-        args,
-    ):
-
-        aug_transform = create_random_augment(
-            input_size=(self.crop_size, self.crop_size),
-            auto_augment=args.aa,
-            interpolation=args.train_interpolation,
-        )
-
-        buffer = [transforms.ToPILImage()(frame) for frame in buffer]
-
-        buffer = aug_transform(buffer)
-
-        buffer = [transforms.ToTensor()(img) for img in buffer]
-        buffer = torch.stack(buffer)  # T C H W
-        buffer = buffer.permute(0, 2, 3, 1)  # T H W C
-
-        # T H W C
-        buffer = tensor_normalize(buffer, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
-        # T H W C -> C T H W.
-        buffer = buffer.permute(3, 0, 1, 2)
-        # Perform data augmentation.
-        scl, asp = (
-            [0.08, 1.0],
-            [0.75, 1.3333],
-        )
-
-        buffer = spatial_sampling(
-            buffer,
-            spatial_idx=-1,
-            min_scale=256,
-            max_scale=320,
-            crop_size=self.crop_size,
-            random_horizontal_flip=False if args.data_set == "SSV2" else True,
-            inverse_uniform_sampling=False,
-            aspect_ratio=asp,
-            scale=scl,
-            motion_shift=False,
-        )
-
-        if self.rand_erase:
-            erase_transform = RandomErasing(
-                args.reprob,
-                mode=args.remode,
-                max_count=args.recount,
-                num_splits=args.recount,
-                device="cpu",
-            )
-            buffer = buffer.permute(1, 0, 2, 3)
-            buffer = erase_transform(buffer)
-            buffer = buffer.permute(1, 0, 2, 3)
-
-        return buffer
-
-    def loadvideo_decord(self, sample, sample_rate_scale=1):
-        """Load video content using Decord"""
-        fname = sample
-        fname = os.path.join(self.prefix, fname)
-
-        try:
-            if self.keep_aspect_ratio:
-                if fname.startswith("s3"):
-                    video_bytes = self.client.get(fname)
-                    vr = VideoReader(io.BytesIO(video_bytes), num_threads=1, ctx=cpu(0))
-                else:
-                    vr = VideoReader(fname, num_threads=1, ctx=cpu(0))
-            else:
-                if fname.startswith("s3:"):
-                    video_bytes = self.client.get(fname)
-                    vr = VideoReader(
-                        io.BytesIO(video_bytes),
-                        width=self.new_width,
-                        height=self.new_height,
-                        num_threads=1,
-                        ctx=cpu(0),
-                    )
-                else:
-                    vr = VideoReader(
-                        fname,
-                        width=self.new_width,
-                        height=self.new_height,
-                        num_threads=1,
-                        ctx=cpu(0),
-                    )
-        except:
-            print("video cannot be loaded by decord: ", fname)
-            return []
-
-        if self.mode == "test":
-            tick = len(vr) / float(self.num_segment)
-            all_index = list(
-                np.array(
-                    [int(tick / 2.0 + tick * x) for x in range(self.num_segment)]
-                    + [int(tick * x) for x in range(self.num_segment)]
-                )
-            )
-            while len(all_index) < (self.num_segment * self.test_num_segment):
-                all_index.append(all_index[-1])
-            all_index = np.sort(np.array(all_index))
-            vr.seek(0)
-            buffer = vr.get_batch(all_index).asnumpy()
-            return buffer
-        elif self.mode == "validation":
-            tick = len(vr) / float(self.num_segment)
-            all_index = np.array(
-                [int(tick / 2.0 + tick * x) for x in range(self.num_segment)]
-            )
-            vr.seek(0)
-            buffer = vr.get_batch(all_index).asnumpy()
-            return buffer
-
-        # handle temporal segments
-        average_duration = len(vr) // self.num_segment
-        if average_duration > 0:
-            all_index = list(
-                np.multiply(list(range(self.num_segment)), average_duration)
-                + np.random.randint(average_duration, size=self.num_segment)
-            )
-        elif len(vr) > self.num_segment:
-            all_index = list(np.sort(np.random.randint(len(vr), size=self.num_segment)))
-        else:
-            all_index = list(np.zeros((self.num_segment,)))
-        vr.seek(0)
-        buffer = vr.get_batch(all_index).asnumpy()
-        return buffer
-
-    def __len__(self):
-        if self.mode != "test":
-            return len(self.dataset_samples)
-        else:
-            return len(self.test_dataset)
-
-
-def spatial_sampling(
-    frames,
-    spatial_idx=-1,
-    min_scale=256,
-    max_scale=320,
-    crop_size=224,
-    random_horizontal_flip=True,
-    inverse_uniform_sampling=False,
-    aspect_ratio=None,
-    scale=None,
-    motion_shift=False,
-):
-    """
-    Perform spatial sampling on the given video frames. If spatial_idx is
-    -1, perform random scale, random crop, and random flip on the given
-    frames. If spatial_idx is 0, 1, or 2, perform spatial uniform sampling
-    with the given spatial_idx.
-    Args:
-        frames (tensor): frames of images sampled from the video. The
-            dimension is `num frames` x `height` x `width` x `channel`.
-        spatial_idx (int): if -1, perform random spatial sampling. If 0, 1,
-            or 2, perform left, center, right crop if width is larger than
-            height, and perform top, center, buttom crop if height is larger
-            than width.
-        min_scale (int): the minimal size of scaling.
-        max_scale (int): the maximal size of scaling.
-        crop_size (int): the size of height and width used to crop the
-            frames.
-        inverse_uniform_sampling (bool): if True, sample uniformly in
-            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
-            scale. If False, take a uniform sample from [min_scale,
-            max_scale].
-        aspect_ratio (list): Aspect ratio range for resizing.
-        scale (list): Scale range for resizing.
-        motion_shift (bool): Whether to apply motion shift for resizing.
-    Returns:
-        frames (tensor): spatially sampled frames.
-    """
-    assert spatial_idx in [-1, 0, 1, 2]
-    if spatial_idx == -1:
-        if aspect_ratio is None and scale is None:
-            frames, _ = random_short_side_scale_jitter(
-                images=frames,
-                min_size=min_scale,
-                max_size=max_scale,
-                inverse_uniform_sampling=inverse_uniform_sampling,
-            )
-            frames, _ = random_crop(frames, crop_size)
-        else:
-            transform_func = (
-                random_resized_crop_with_shift if motion_shift else random_resized_crop
-            )
-            frames = transform_func(
-                images=frames,
-                target_height=crop_size,
-                target_width=crop_size,
-                scale=scale,
-                ratio=aspect_ratio,
-            )
-        if random_horizontal_flip:
-            frames, _ = horizontal_flip(0.5, frames)
-    else:
-        # The testing is deterministic and no jitter should be performed.
-        # min_scale, max_scale, and crop_size are expect to be the same.
-        assert len({min_scale, max_scale, crop_size}) == 1
-        frames, _ = random_short_side_scale_jitter(frames, min_scale, max_scale)
-        frames, _ = uniform_crop(frames, crop_size, spatial_idx)
-    return frames
-
-
-def tensor_normalize(tensor, mean, std):
-    """
-    Normalize a given tensor by subtracting the mean and dividing the std.
-    Args:
-        tensor (tensor): tensor to normalize.
-        mean (tensor or list): mean value to subtract.
-        std (tensor or list): std to divide.
-    """
-    if tensor.dtype == torch.uint8:
-        tensor = tensor.float()
-        tensor = tensor / 255.0
-    if type(mean) == list:
-        mean = torch.tensor(mean)
-    if type(std) == list:
-        std = torch.tensor(std)
-    tensor = tensor - mean
-    tensor = tensor / std
-    return tensor
diff --git a/eval/vbench/third_party/umt/datasets/transforms.py b/eval/vbench/third_party/umt/datasets/transforms.py
deleted file mode 100644
index 0adffc8f..00000000
--- a/eval/vbench/third_party/umt/datasets/transforms.py
+++ /dev/null
@@ -1,261 +0,0 @@
-import numbers
-import random
-import warnings
-
-import numpy as np
-import torch
-import torchvision
-import torchvision.transforms.functional as F
-from PIL import Image, ImageOps
-
-
-class GroupRandomCrop(object):
-    def __init__(self, size):
-        if isinstance(size, numbers.Number):
-            self.size = (int(size), int(size))
-        else:
-            self.size = size
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-
-        w, h = img_group[0].size
-        th, tw = self.size
-
-        out_images = list()
-
-        x1 = random.randint(0, w - tw)
-        y1 = random.randint(0, h - th)
-
-        for img in img_group:
-            assert img.size[0] == w and img.size[1] == h
-            if w == tw and h == th:
-                out_images.append(img)
-            else:
-                out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
-
-        return (out_images, label)
-
-
-class GroupCenterCrop(object):
-    def __init__(self, size):
-        self.worker = torchvision.transforms.CenterCrop(size)
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-        return ([self.worker(img) for img in img_group], label)
-
-
-class GroupRandomHorizontalFlip(object):
-    def __init__(self, flip=False):
-        self.flip = flip
-
-    def __call__(self, img_tuple):
-        v = random.random()
-        if self.flip and v < 0.5:
-            img_group, label = img_tuple
-            ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
-            return (ret, label)
-        else:
-            return img_tuple
-
-
-class GroupNormalize(object):
-    def __init__(self, mean, std):
-        self.mean = mean
-        self.std = std
-
-    def __call__(self, tensor_tuple):
-        tensor, label = tensor_tuple
-        rep_mean = self.mean * (tensor.size()[0] // len(self.mean))
-        rep_std = self.std * (tensor.size()[0] // len(self.std))
-
-        # TODO: make efficient
-        for t, m, s in zip(tensor, rep_mean, rep_std):
-            t.sub_(m).div_(s)
-
-        return (tensor, label)
-
-
-class GroupGrayScale(object):
-    def __init__(self, size):
-        self.worker = torchvision.transforms.Grayscale(size)
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-        return ([self.worker(img) for img in img_group], label)
-
-
-class GroupColorJitter(object):
-    def __init__(self, size):
-        self.worker = torchvision.transforms.ColorJitter(
-            brightness=size, contrast=size, saturation=size
-        )
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-        return ([self.worker(img) for img in img_group], label)
-
-
-class GroupScale(object):
-    """Rescales the input PIL.Image to the given 'size'.
-    'size' will be the size of the smaller edge.
-    For example, if height > width, then image will be
-    rescaled to (size * height / width, size)
-    size: size of the smaller edge
-    interpolation: Default: PIL.Image.BILINEAR
-    """
-
-    def __init__(self, size, interpolation=Image.BILINEAR):
-        self.worker = torchvision.transforms.Resize(size, interpolation)
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-        return ([self.worker(img) for img in img_group], label)
-
-
-class GroupMultiScaleCrop(object):
-
-    def __init__(
-        self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True
-    ):
-        self.scales = scales if scales is not None else [1, 875, 0.75, 0.66]
-        self.max_distort = max_distort
-        self.fix_crop = fix_crop
-        self.more_fix_crop = more_fix_crop
-        self.input_size = (
-            input_size if not isinstance(input_size, int) else [input_size, input_size]
-        )
-        self.interpolation = Image.BILINEAR
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-
-        im_size = img_group[0].size
-
-        crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
-        crop_img_group = [
-            img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h))
-            for img in img_group
-        ]
-        ret_img_group = [
-            img.resize((self.input_size[0], self.input_size[1]), self.interpolation)
-            for img in crop_img_group
-        ]
-        return (ret_img_group, label)
-
-    def _sample_crop_size(self, im_size):
-        image_w, image_h = im_size[0], im_size[1]
-
-        # find a crop size
-        base_size = min(image_w, image_h)
-        crop_sizes = [int(base_size * x) for x in self.scales]
-        crop_h = [
-            self.input_size[1] if abs(x - self.input_size[1]) < 3 else x
-            for x in crop_sizes
-        ]
-        crop_w = [
-            self.input_size[0] if abs(x - self.input_size[0]) < 3 else x
-            for x in crop_sizes
-        ]
-
-        pairs = []
-        for i, h in enumerate(crop_h):
-            for j, w in enumerate(crop_w):
-                if abs(i - j) <= self.max_distort:
-                    pairs.append((w, h))
-
-        crop_pair = random.choice(pairs)
-        if not self.fix_crop:
-            w_offset = random.randint(0, image_w - crop_pair[0])
-            h_offset = random.randint(0, image_h - crop_pair[1])
-        else:
-            w_offset, h_offset = self._sample_fix_offset(
-                image_w, image_h, crop_pair[0], crop_pair[1]
-            )
-
-        return crop_pair[0], crop_pair[1], w_offset, h_offset
-
-    def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h):
-        offsets = self.fill_fix_offset(
-            self.more_fix_crop, image_w, image_h, crop_w, crop_h
-        )
-        return random.choice(offsets)
-
-    @staticmethod
-    def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h):
-        w_step = (image_w - crop_w) // 4
-        h_step = (image_h - crop_h) // 4
-
-        ret = list()
-        ret.append((0, 0))  # upper left
-        ret.append((4 * w_step, 0))  # upper right
-        ret.append((0, 4 * h_step))  # lower left
-        ret.append((4 * w_step, 4 * h_step))  # lower right
-        ret.append((2 * w_step, 2 * h_step))  # center
-
-        if more_fix_crop:
-            ret.append((0, 2 * h_step))  # center left
-            ret.append((4 * w_step, 2 * h_step))  # center right
-            ret.append((2 * w_step, 4 * h_step))  # lower center
-            ret.append((2 * w_step, 0 * h_step))  # upper center
-
-            ret.append((1 * w_step, 1 * h_step))  # upper left quarter
-            ret.append((3 * w_step, 1 * h_step))  # upper right quarter
-            ret.append((1 * w_step, 3 * h_step))  # lower left quarter
-            ret.append((3 * w_step, 3 * h_step))  # lower righ quarter
-        return ret
-
-
-class Stack(object):
-
-    def __init__(self, roll=False):
-        self.roll = roll
-
-    def __call__(self, img_tuple):
-        img_group, label = img_tuple
-
-        if img_group[0].mode == "L":
-            return (
-                np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2),
-                label,
-            )
-        elif img_group[0].mode == "RGB":
-            if self.roll:
-                return (
-                    np.concatenate(
-                        [np.array(x)[:, :, ::-1] for x in img_group], axis=2
-                    ),
-                    label,
-                )
-            else:
-                return (np.concatenate(img_group, axis=2), label)
-
-
-class ToTorchFormatTensor(object):
-    """Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255]
-    to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]"""
-
-    def __init__(self, div=True):
-        self.div = div
-
-    def __call__(self, pic_tuple):
-        pic, label = pic_tuple
-
-        if isinstance(pic, np.ndarray):
-            # handle numpy array
-            img = torch.from_numpy(pic).permute(2, 0, 1).contiguous()
-        else:
-            # handle PIL Image
-            img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
-            img = img.view(pic.size[1], pic.size[0], len(pic.mode))
-            # put it from HWC to CHW format
-            # yikes, this transpose takes 80% of the loading time/CPU
-            img = img.transpose(0, 1).transpose(0, 2).contiguous()
-        return (img.float().div(255.0) if self.div else img.float(), label)
-
-
-class IdentityTransform(object):
-
-    def __call__(self, data):
-        return data
diff --git a/eval/vbench/third_party/umt/datasets/video_transforms.py b/eval/vbench/third_party/umt/datasets/video_transforms.py
deleted file mode 100644
index 93ddf56a..00000000
--- a/eval/vbench/third_party/umt/datasets/video_transforms.py
+++ /dev/null
@@ -1,1269 +0,0 @@
-#!/usr/bin/env python3
-import math
-import numbers
-import random
-
-import numpy as np
-import PIL
-import torch
-import torchvision
-import torchvision.transforms.functional as F
-import vbench.third_party.umt.functional as FF
-from PIL import Image
-from torchvision import transforms
-
-from .rand_augment import rand_augment_transform
-from .random_erasing import RandomErasing
-
-_pil_interpolation_to_str = {
-    Image.NEAREST: "PIL.Image.NEAREST",
-    Image.BILINEAR: "PIL.Image.BILINEAR",
-    Image.BICUBIC: "PIL.Image.BICUBIC",
-    Image.LANCZOS: "PIL.Image.LANCZOS",
-    Image.HAMMING: "PIL.Image.HAMMING",
-    Image.BOX: "PIL.Image.BOX",
-}
-
-
-_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
-
-
-def _pil_interp(method):
-    if method == "bicubic":
-        return Image.BICUBIC
-    elif method == "lanczos":
-        return Image.LANCZOS
-    elif method == "hamming":
-        return Image.HAMMING
-    else:
-        return Image.BILINEAR
-
-
-def random_short_side_scale_jitter(
-    images, min_size, max_size, boxes=None, inverse_uniform_sampling=False
-):
-    """
-    Perform a spatial short scale jittering on the given images and
-    corresponding boxes.
-    Args:
-        images (tensor): images to perform scale jitter. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-        min_size (int): the minimal size to scale the frames.
-        max_size (int): the maximal size to scale the frames.
-        boxes (ndarray): optional. Corresponding boxes to images.
-            Dimension is `num boxes` x 4.
-        inverse_uniform_sampling (bool): if True, sample uniformly in
-            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
-            scale. If False, take a uniform sample from [min_scale, max_scale].
-    Returns:
-        (tensor): the scaled images with dimension of
-            `num frames` x `channel` x `new height` x `new width`.
-        (ndarray or None): the scaled boxes with dimension of
-            `num boxes` x 4.
-    """
-    if inverse_uniform_sampling:
-        size = int(round(1.0 / np.random.uniform(1.0 / max_size, 1.0 / min_size)))
-    else:
-        size = int(round(np.random.uniform(min_size, max_size)))
-
-    height = images.shape[2]
-    width = images.shape[3]
-    if (width <= height and width == size) or (height <= width and height == size):
-        return images, boxes
-    new_width = size
-    new_height = size
-    if width < height:
-        new_height = int(math.floor((float(height) / width) * size))
-        if boxes is not None:
-            boxes = boxes * float(new_height) / height
-    else:
-        new_width = int(math.floor((float(width) / height) * size))
-        if boxes is not None:
-            boxes = boxes * float(new_width) / width
-
-    return (
-        torch.nn.functional.interpolate(
-            images,
-            size=(new_height, new_width),
-            mode="bilinear",
-            align_corners=False,
-        ),
-        boxes,
-    )
-
-
-def crop_boxes(boxes, x_offset, y_offset):
-    """
-    Peform crop on the bounding boxes given the offsets.
-    Args:
-        boxes (ndarray or None): bounding boxes to peform crop. The dimension
-            is `num boxes` x 4.
-        x_offset (int): cropping offset in the x axis.
-        y_offset (int): cropping offset in the y axis.
-    Returns:
-        cropped_boxes (ndarray or None): the cropped boxes with dimension of
-            `num boxes` x 4.
-    """
-    cropped_boxes = boxes.copy()
-    cropped_boxes[:, [0, 2]] = boxes[:, [0, 2]] - x_offset
-    cropped_boxes[:, [1, 3]] = boxes[:, [1, 3]] - y_offset
-
-    return cropped_boxes
-
-
-def random_crop(images, size, boxes=None):
-    """
-    Perform random spatial crop on the given images and corresponding boxes.
-    Args:
-        images (tensor): images to perform random crop. The dimension is
-            `num frames` x `channel` x `height` x `width`.
-        size (int): the size of height and width to crop on the image.
-        boxes (ndarray or None): optional. Corresponding boxes to images.
-            Dimension is `num boxes` x 4.
-    Returns:
-        cropped (tensor): cropped images with dimension of
-            `num frames` x `channel` x `size` x `size`.
-        cropped_boxes (ndarray or None): the cropped boxes with dimension of
-            `num boxes` x 4.
-    """
-    if images.shape[2] == size and images.shape[3] == size:
-        return images
-    height = images.shape[2]
-    width = images.shape[3]
-    y_offset = 0
-    if height > size:
-        y_offset = int(np.random.randint(0, height - size))
-    x_offset = 0
-    if width > size:
-        x_offset = int(np.random.randint(0, width - size))
-    cropped = images[:, :, y_offset : y_offset + size, x_offset : x_offset + size]
-
-    cropped_boxes = crop_boxes(boxes, x_offset, y_offset) if boxes is not None else None
-
-    return cropped, cropped_boxes
-
-
-def horizontal_flip(prob, images, boxes=None):
-    """
-    Perform horizontal flip on the given images and corresponding boxes.
-    Args:
-        prob (float): probility to flip the images.
-        images (tensor): images to perform horizontal flip, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-        boxes (ndarray or None): optional. Corresponding boxes to images.
-            Dimension is `num boxes` x 4.
-    Returns:
-        images (tensor): images with dimension of
-            `num frames` x `channel` x `height` x `width`.
-        flipped_boxes (ndarray or None): the flipped boxes with dimension of
-            `num boxes` x 4.
-    """
-    if boxes is None:
-        flipped_boxes = None
-    else:
-        flipped_boxes = boxes.copy()
-
-    if np.random.uniform() < prob:
-        images = images.flip((-1))
-
-        if len(images.shape) == 3:
-            width = images.shape[2]
-        elif len(images.shape) == 4:
-            width = images.shape[3]
-        else:
-            raise NotImplementedError("Dimension does not supported")
-        if boxes is not None:
-            flipped_boxes[:, [0, 2]] = width - boxes[:, [2, 0]] - 1
-
-    return images, flipped_boxes
-
-
-def uniform_crop(images, size, spatial_idx, boxes=None, scale_size=None):
-    """
-    Perform uniform spatial sampling on the images and corresponding boxes.
-    Args:
-        images (tensor): images to perform uniform crop. The dimension is
-            `num frames` x `channel` x `height` x `width`.
-        size (int): size of height and weight to crop the images.
-        spatial_idx (int): 0, 1, or 2 for left, center, and right crop if width
-            is larger than height. Or 0, 1, or 2 for top, center, and bottom
-            crop if height is larger than width.
-        boxes (ndarray or None): optional. Corresponding boxes to images.
-            Dimension is `num boxes` x 4.
-        scale_size (int): optinal. If not None, resize the images to scale_size before
-            performing any crop.
-    Returns:
-        cropped (tensor): images with dimension of
-            `num frames` x `channel` x `size` x `size`.
-        cropped_boxes (ndarray or None): the cropped boxes with dimension of
-            `num boxes` x 4.
-    """
-    assert spatial_idx in [0, 1, 2]
-    ndim = len(images.shape)
-    if ndim == 3:
-        images = images.unsqueeze(0)
-    height = images.shape[2]
-    width = images.shape[3]
-
-    if scale_size is not None:
-        if width <= height:
-            width, height = scale_size, int(height / width * scale_size)
-        else:
-            width, height = int(width / height * scale_size), scale_size
-        images = torch.nn.functional.interpolate(
-            images,
-            size=(height, width),
-            mode="bilinear",
-            align_corners=False,
-        )
-
-    y_offset = int(math.ceil((height - size) / 2))
-    x_offset = int(math.ceil((width - size) / 2))
-
-    if height > width:
-        if spatial_idx == 0:
-            y_offset = 0
-        elif spatial_idx == 2:
-            y_offset = height - size
-    else:
-        if spatial_idx == 0:
-            x_offset = 0
-        elif spatial_idx == 2:
-            x_offset = width - size
-    cropped = images[:, :, y_offset : y_offset + size, x_offset : x_offset + size]
-    cropped_boxes = crop_boxes(boxes, x_offset, y_offset) if boxes is not None else None
-    if ndim == 3:
-        cropped = cropped.squeeze(0)
-    return cropped, cropped_boxes
-
-
-def clip_boxes_to_image(boxes, height, width):
-    """
-    Clip an array of boxes to an image with the given height and width.
-    Args:
-        boxes (ndarray): bounding boxes to perform clipping.
-            Dimension is `num boxes` x 4.
-        height (int): given image height.
-        width (int): given image width.
-    Returns:
-        clipped_boxes (ndarray): the clipped boxes with dimension of
-            `num boxes` x 4.
-    """
-    clipped_boxes = boxes.copy()
-    clipped_boxes[:, [0, 2]] = np.minimum(
-        width - 1.0, np.maximum(0.0, boxes[:, [0, 2]])
-    )
-    clipped_boxes[:, [1, 3]] = np.minimum(
-        height - 1.0, np.maximum(0.0, boxes[:, [1, 3]])
-    )
-    return clipped_boxes
-
-
-def blend(images1, images2, alpha):
-    """
-    Blend two images with a given weight alpha.
-    Args:
-        images1 (tensor): the first images to be blended, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-        images2 (tensor): the second images to be blended, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-        alpha (float): the blending weight.
-    Returns:
-        (tensor): blended images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    return images1 * alpha + images2 * (1 - alpha)
-
-
-def grayscale(images):
-    """
-    Get the grayscale for the input images. The channels of images should be
-    in order BGR.
-    Args:
-        images (tensor): the input images for getting grayscale. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-    Returns:
-        img_gray (tensor): blended images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    # R -> 0.299, G -> 0.587, B -> 0.114.
-    img_gray = torch.tensor(images)
-    gray_channel = 0.299 * images[:, 2] + 0.587 * images[:, 1] + 0.114 * images[:, 0]
-    img_gray[:, 0] = gray_channel
-    img_gray[:, 1] = gray_channel
-    img_gray[:, 2] = gray_channel
-    return img_gray
-
-
-def color_jitter(images, img_brightness=0, img_contrast=0, img_saturation=0):
-    """
-    Perfrom a color jittering on the input images. The channels of images
-    should be in order BGR.
-    Args:
-        images (tensor): images to perform color jitter. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-        img_brightness (float): jitter ratio for brightness.
-        img_contrast (float): jitter ratio for contrast.
-        img_saturation (float): jitter ratio for saturation.
-    Returns:
-        images (tensor): the jittered images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-
-    jitter = []
-    if img_brightness != 0:
-        jitter.append("brightness")
-    if img_contrast != 0:
-        jitter.append("contrast")
-    if img_saturation != 0:
-        jitter.append("saturation")
-
-    if len(jitter) > 0:
-        order = np.random.permutation(np.arange(len(jitter)))
-        for idx in range(0, len(jitter)):
-            if jitter[order[idx]] == "brightness":
-                images = brightness_jitter(img_brightness, images)
-            elif jitter[order[idx]] == "contrast":
-                images = contrast_jitter(img_contrast, images)
-            elif jitter[order[idx]] == "saturation":
-                images = saturation_jitter(img_saturation, images)
-    return images
-
-
-def brightness_jitter(var, images):
-    """
-    Perfrom brightness jittering on the input images. The channels of images
-    should be in order BGR.
-    Args:
-        var (float): jitter ratio for brightness.
-        images (tensor): images to perform color jitter. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-    Returns:
-        images (tensor): the jittered images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    alpha = 1.0 + np.random.uniform(-var, var)
-
-    img_bright = torch.zeros(images.shape)
-    images = blend(images, img_bright, alpha)
-    return images
-
-
-def contrast_jitter(var, images):
-    """
-    Perfrom contrast jittering on the input images. The channels of images
-    should be in order BGR.
-    Args:
-        var (float): jitter ratio for contrast.
-        images (tensor): images to perform color jitter. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-    Returns:
-        images (tensor): the jittered images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    alpha = 1.0 + np.random.uniform(-var, var)
-
-    img_gray = grayscale(images)
-    img_gray[:] = torch.mean(img_gray, dim=(1, 2, 3), keepdim=True)
-    images = blend(images, img_gray, alpha)
-    return images
-
-
-def saturation_jitter(var, images):
-    """
-    Perfrom saturation jittering on the input images. The channels of images
-    should be in order BGR.
-    Args:
-        var (float): jitter ratio for saturation.
-        images (tensor): images to perform color jitter. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-    Returns:
-        images (tensor): the jittered images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    alpha = 1.0 + np.random.uniform(-var, var)
-    img_gray = grayscale(images)
-    images = blend(images, img_gray, alpha)
-
-    return images
-
-
-def lighting_jitter(images, alphastd, eigval, eigvec):
-    """
-    Perform AlexNet-style PCA jitter on the given images.
-    Args:
-        images (tensor): images to perform lighting jitter. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-        alphastd (float): jitter ratio for PCA jitter.
-        eigval (list): eigenvalues for PCA jitter.
-        eigvec (list[list]): eigenvectors for PCA jitter.
-    Returns:
-        out_images (tensor): the jittered images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    if alphastd == 0:
-        return images
-    # generate alpha1, alpha2, alpha3.
-    alpha = np.random.normal(0, alphastd, size=(1, 3))
-    eig_vec = np.array(eigvec)
-    eig_val = np.reshape(eigval, (1, 3))
-    rgb = np.sum(
-        eig_vec * np.repeat(alpha, 3, axis=0) * np.repeat(eig_val, 3, axis=0),
-        axis=1,
-    )
-    out_images = torch.zeros_like(images)
-    if len(images.shape) == 3:
-        # C H W
-        channel_dim = 0
-    elif len(images.shape) == 4:
-        # T C H W
-        channel_dim = 1
-    else:
-        raise NotImplementedError(f"Unsupported dimension {len(images.shape)}")
-
-    for idx in range(images.shape[channel_dim]):
-        # C H W
-        if len(images.shape) == 3:
-            out_images[idx] = images[idx] + rgb[2 - idx]
-        # T C H W
-        elif len(images.shape) == 4:
-            out_images[:, idx] = images[:, idx] + rgb[2 - idx]
-        else:
-            raise NotImplementedError(f"Unsupported dimension {len(images.shape)}")
-
-    return out_images
-
-
-def color_normalization(images, mean, stddev):
-    """
-    Perform color nomration on the given images.
-    Args:
-        images (tensor): images to perform color normalization. Dimension is
-            `num frames` x `channel` x `height` x `width`.
-        mean (list): mean values for normalization.
-        stddev (list): standard deviations for normalization.
-
-    Returns:
-        out_images (tensor): the noramlized images, the dimension is
-            `num frames` x `channel` x `height` x `width`.
-    """
-    if len(images.shape) == 3:
-        assert len(mean) == images.shape[0], "channel mean not computed properly"
-        assert len(stddev) == images.shape[0], "channel stddev not computed properly"
-    elif len(images.shape) == 4:
-        assert len(mean) == images.shape[1], "channel mean not computed properly"
-        assert len(stddev) == images.shape[1], "channel stddev not computed properly"
-    else:
-        raise NotImplementedError(f"Unsupported dimension {len(images.shape)}")
-
-    out_images = torch.zeros_like(images)
-    for idx in range(len(mean)):
-        # C H W
-        if len(images.shape) == 3:
-            out_images[idx] = (images[idx] - mean[idx]) / stddev[idx]
-        elif len(images.shape) == 4:
-            out_images[:, idx] = (images[:, idx] - mean[idx]) / stddev[idx]
-        else:
-            raise NotImplementedError(f"Unsupported dimension {len(images.shape)}")
-    return out_images
-
-
-def _get_param_spatial_crop(
-    scale, ratio, height, width, num_repeat=10, log_scale=True, switch_hw=False
-):
-    """
-    Given scale, ratio, height and width, return sampled coordinates of the videos.
-    """
-    for _ in range(num_repeat):
-        area = height * width
-        target_area = random.uniform(*scale) * area
-        if log_scale:
-            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))
-            aspect_ratio = math.exp(random.uniform(*log_ratio))
-        else:
-            aspect_ratio = random.uniform(*ratio)
-
-        w = int(round(math.sqrt(target_area * aspect_ratio)))
-        h = int(round(math.sqrt(target_area / aspect_ratio)))
-
-        if np.random.uniform() < 0.5 and switch_hw:
-            w, h = h, w
-
-        if 0 < w <= width and 0 < h <= height:
-            i = random.randint(0, height - h)
-            j = random.randint(0, width - w)
-            return i, j, h, w
-
-    # Fallback to central crop
-    in_ratio = float(width) / float(height)
-    if in_ratio < min(ratio):
-        w = width
-        h = int(round(w / min(ratio)))
-    elif in_ratio > max(ratio):
-        h = height
-        w = int(round(h * max(ratio)))
-    else:  # whole image
-        w = width
-        h = height
-    i = (height - h) // 2
-    j = (width - w) // 2
-    return i, j, h, w
-
-
-def random_resized_crop(
-    images,
-    target_height,
-    target_width,
-    scale=(0.8, 1.0),
-    ratio=(3.0 / 4.0, 4.0 / 3.0),
-):
-    """
-    Crop the given images to random size and aspect ratio. A crop of random
-    size (default: of 0.08 to 1.0) of the original size and a random aspect
-    ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This
-    crop is finally resized to given size. This is popularly used to train the
-    Inception networks.
-
-    Args:
-        images: Images to perform resizing and cropping.
-        target_height: Desired height after cropping.
-        target_width: Desired width after cropping.
-        scale: Scale range of Inception-style area based random resizing.
-        ratio: Aspect ratio range of Inception-style area based random resizing.
-    """
-
-    height = images.shape[2]
-    width = images.shape[3]
-
-    i, j, h, w = _get_param_spatial_crop(scale, ratio, height, width)
-    cropped = images[:, :, i : i + h, j : j + w]
-    return torch.nn.functional.interpolate(
-        cropped,
-        size=(target_height, target_width),
-        mode="bilinear",
-        align_corners=False,
-    )
-
-
-def random_resized_crop_with_shift(
-    images,
-    target_height,
-    target_width,
-    scale=(0.8, 1.0),
-    ratio=(3.0 / 4.0, 4.0 / 3.0),
-):
-    """
-    This is similar to random_resized_crop. However, it samples two different
-    boxes (for cropping) for the first and last frame. It then linearly
-    interpolates the two boxes for other frames.
-
-    Args:
-        images: Images to perform resizing and cropping.
-        target_height: Desired height after cropping.
-        target_width: Desired width after cropping.
-        scale: Scale range of Inception-style area based random resizing.
-        ratio: Aspect ratio range of Inception-style area based random resizing.
-    """
-    t = images.shape[1]
-    height = images.shape[2]
-    width = images.shape[3]
-
-    i, j, h, w = _get_param_spatial_crop(scale, ratio, height, width)
-    i_, j_, h_, w_ = _get_param_spatial_crop(scale, ratio, height, width)
-    i_s = [int(i) for i in torch.linspace(i, i_, steps=t).tolist()]
-    j_s = [int(i) for i in torch.linspace(j, j_, steps=t).tolist()]
-    h_s = [int(i) for i in torch.linspace(h, h_, steps=t).tolist()]
-    w_s = [int(i) for i in torch.linspace(w, w_, steps=t).tolist()]
-    out = torch.zeros((3, t, target_height, target_width))
-    for ind in range(t):
-        out[:, ind : ind + 1, :, :] = torch.nn.functional.interpolate(
-            images[
-                :,
-                ind : ind + 1,
-                i_s[ind] : i_s[ind] + h_s[ind],
-                j_s[ind] : j_s[ind] + w_s[ind],
-            ],
-            size=(target_height, target_width),
-            mode="bilinear",
-            align_corners=False,
-        )
-    return out
-
-
-def create_random_augment(
-    input_size,
-    auto_augment=None,
-    interpolation="bilinear",
-):
-    """
-    Get video randaug transform.
-
-    Args:
-        input_size: The size of the input video in tuple.
-        auto_augment: Parameters for randaug. An example:
-            "rand-m7-n4-mstd0.5-inc1" (m is the magnitude and n is the number
-            of operations to apply).
-        interpolation: Interpolation method.
-    """
-    if isinstance(input_size, tuple):
-        img_size = input_size[-2:]
-    else:
-        img_size = input_size
-
-    if auto_augment:
-        assert isinstance(auto_augment, str)
-        if isinstance(img_size, tuple):
-            img_size_min = min(img_size)
-        else:
-            img_size_min = img_size
-        aa_params = {"translate_const": int(img_size_min * 0.45)}
-        if interpolation and interpolation != "random":
-            aa_params["interpolation"] = _pil_interp(interpolation)
-        if auto_augment.startswith("rand"):
-            return transforms.Compose([rand_augment_transform(auto_augment, aa_params)])
-    raise NotImplementedError
-
-
-def random_sized_crop_img(
-    im,
-    size,
-    jitter_scale=(0.08, 1.0),
-    jitter_aspect=(3.0 / 4.0, 4.0 / 3.0),
-    max_iter=10,
-):
-    """
-    Performs Inception-style cropping (used for training).
-    """
-    assert len(im.shape) == 3, "Currently only support image for random_sized_crop"
-    h, w = im.shape[1:3]
-    i, j, h, w = _get_param_spatial_crop(
-        scale=jitter_scale,
-        ratio=jitter_aspect,
-        height=h,
-        width=w,
-        num_repeat=max_iter,
-        log_scale=False,
-        switch_hw=True,
-    )
-    cropped = im[:, i : i + h, j : j + w]
-    return torch.nn.functional.interpolate(
-        cropped.unsqueeze(0),
-        size=(size, size),
-        mode="bilinear",
-        align_corners=False,
-    ).squeeze(0)
-
-
-# The following code are modified based on timm lib, we will replace the following
-# contents with dependency from PyTorchVideo.
-# https://github.com/facebookresearch/pytorchvideo
-class RandomResizedCropAndInterpolation:
-    """Crop the given PIL Image to random size and aspect ratio with random interpolation.
-    A crop of random size (default: of 0.08 to 1.0) of the original size and a random
-    aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop
-    is finally resized to given size.
-    This is popularly used to train the Inception networks.
-    Args:
-        size: expected output size of each edge
-        scale: range of size of the origin size cropped
-        ratio: range of aspect ratio of the origin aspect ratio cropped
-        interpolation: Default: PIL.Image.BILINEAR
-    """
-
-    def __init__(
-        self,
-        size,
-        scale=(0.08, 1.0),
-        ratio=(3.0 / 4.0, 4.0 / 3.0),
-        interpolation="bilinear",
-    ):
-        if isinstance(size, tuple):
-            self.size = size
-        else:
-            self.size = (size, size)
-        if (scale[0] > scale[1]) or (ratio[0] > ratio[1]):
-            print("range should be of kind (min, max)")
-
-        if interpolation == "random":
-            self.interpolation = _RANDOM_INTERPOLATION
-        else:
-            self.interpolation = _pil_interp(interpolation)
-        self.scale = scale
-        self.ratio = ratio
-
-    @staticmethod
-    def get_params(img, scale, ratio):
-        """Get parameters for ``crop`` for a random sized crop.
-        Args:
-            img (PIL Image): Image to be cropped.
-            scale (tuple): range of size of the origin size cropped
-            ratio (tuple): range of aspect ratio of the origin aspect ratio cropped
-        Returns:
-            tuple: params (i, j, h, w) to be passed to ``crop`` for a random
-                sized crop.
-        """
-        area = img.size[0] * img.size[1]
-
-        for _ in range(10):
-            target_area = random.uniform(*scale) * area
-            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))
-            aspect_ratio = math.exp(random.uniform(*log_ratio))
-
-            w = int(round(math.sqrt(target_area * aspect_ratio)))
-            h = int(round(math.sqrt(target_area / aspect_ratio)))
-
-            if w <= img.size[0] and h <= img.size[1]:
-                i = random.randint(0, img.size[1] - h)
-                j = random.randint(0, img.size[0] - w)
-                return i, j, h, w
-
-        # Fallback to central crop
-        in_ratio = img.size[0] / img.size[1]
-        if in_ratio < min(ratio):
-            w = img.size[0]
-            h = int(round(w / min(ratio)))
-        elif in_ratio > max(ratio):
-            h = img.size[1]
-            w = int(round(h * max(ratio)))
-        else:  # whole image
-            w = img.size[0]
-            h = img.size[1]
-        i = (img.size[1] - h) // 2
-        j = (img.size[0] - w) // 2
-        return i, j, h, w
-
-    def __call__(self, img):
-        """
-        Args:
-            img (PIL Image): Image to be cropped and resized.
-        Returns:
-            PIL Image: Randomly cropped and resized image.
-        """
-        i, j, h, w = self.get_params(img, self.scale, self.ratio)
-        if isinstance(self.interpolation, (tuple, list)):
-            interpolation = random.choice(self.interpolation)
-        else:
-            interpolation = self.interpolation
-        return F.resized_crop(img, i, j, h, w, self.size, interpolation)
-
-    def __repr__(self):
-        if isinstance(self.interpolation, (tuple, list)):
-            interpolate_str = " ".join(
-                [_pil_interpolation_to_str[x] for x in self.interpolation]
-            )
-        else:
-            interpolate_str = _pil_interpolation_to_str[self.interpolation]
-        format_string = self.__class__.__name__ + "(size={0}".format(self.size)
-        format_string += ", scale={0}".format(tuple(round(s, 4) for s in self.scale))
-        format_string += ", ratio={0}".format(tuple(round(r, 4) for r in self.ratio))
-        format_string += ", interpolation={0})".format(interpolate_str)
-        return format_string
-
-
-def transforms_imagenet_train(
-    img_size=224,
-    scale=None,
-    ratio=None,
-    hflip=0.5,
-    vflip=0.0,
-    color_jitter=0.4,
-    auto_augment=None,
-    interpolation="random",
-    use_prefetcher=False,
-    mean=(0.485, 0.456, 0.406),
-    std=(0.229, 0.224, 0.225),
-    re_prob=0.0,
-    re_mode="const",
-    re_count=1,
-    re_num_splits=0,
-    separate=False,
-):
-    """
-    If separate==True, the transforms are returned as a tuple of 3 separate transforms
-    for use in a mixing dataset that passes
-     * all data through the first (primary) transform, called the 'clean' data
-     * a portion of the data through the secondary transform
-     * normalizes and converts the branches above with the third, final transform
-    """
-    if isinstance(img_size, tuple):
-        img_size = img_size[-2:]
-    else:
-        img_size = img_size
-
-    scale = tuple(scale or (0.08, 1.0))  # default imagenet scale range
-    ratio = tuple(ratio or (3.0 / 4.0, 4.0 / 3.0))  # default imagenet ratio range
-    primary_tfl = [
-        RandomResizedCropAndInterpolation(
-            img_size, scale=scale, ratio=ratio, interpolation=interpolation
-        )
-    ]
-    if hflip > 0.0:
-        primary_tfl += [transforms.RandomHorizontalFlip(p=hflip)]
-    if vflip > 0.0:
-        primary_tfl += [transforms.RandomVerticalFlip(p=vflip)]
-
-    secondary_tfl = []
-    if auto_augment:
-        assert isinstance(auto_augment, str)
-        if isinstance(img_size, tuple):
-            img_size_min = min(img_size)
-        else:
-            img_size_min = img_size
-        aa_params = dict(
-            translate_const=int(img_size_min * 0.45),
-            img_mean=tuple([min(255, round(255 * x)) for x in mean]),
-        )
-        if interpolation and interpolation != "random":
-            aa_params["interpolation"] = _pil_interp(interpolation)
-        if auto_augment.startswith("rand"):
-            secondary_tfl += [rand_augment_transform(auto_augment, aa_params)]
-        elif auto_augment.startswith("augmix"):
-            raise NotImplementedError("Augmix not implemented")
-        else:
-            raise NotImplementedError("Auto aug not implemented")
-    elif color_jitter is not None:
-        # color jitter is enabled when not using AA
-        if isinstance(color_jitter, (list, tuple)):
-            # color jitter should be a 3-tuple/list if spec brightness/contrast/saturation
-            # or 4 if also augmenting hue
-            assert len(color_jitter) in (3, 4)
-        else:
-            # if it's a scalar, duplicate for brightness, contrast, and saturation, no hue
-            color_jitter = (float(color_jitter),) * 3
-        secondary_tfl += [transforms.ColorJitter(*color_jitter)]
-
-    final_tfl = []
-    final_tfl += [
-        transforms.ToTensor(),
-        transforms.Normalize(mean=torch.tensor(mean), std=torch.tensor(std)),
-    ]
-    if re_prob > 0.0:
-        final_tfl.append(
-            RandomErasing(
-                re_prob,
-                mode=re_mode,
-                max_count=re_count,
-                num_splits=re_num_splits,
-                device="cpu",
-                cube=False,
-            )
-        )
-
-    if separate:
-        return (
-            transforms.Compose(primary_tfl),
-            transforms.Compose(secondary_tfl),
-            transforms.Compose(final_tfl),
-        )
-    else:
-        return transforms.Compose(primary_tfl + secondary_tfl + final_tfl)
-
-
-############################################################################################################
-############################################################################################################
-
-
-class Compose(object):
-    """Composes several transforms
-    Args:
-    transforms (list of ``Transform`` objects): list of transforms
-    to compose
-    """
-
-    def __init__(self, transforms):
-        self.transforms = transforms
-
-    def __call__(self, clip):
-        for t in self.transforms:
-            clip = t(clip)
-        return clip
-
-
-class RandomHorizontalFlip(object):
-    """Horizontally flip the list of given images randomly
-    with a probability 0.5
-    """
-
-    def __call__(self, clip):
-        """
-        Args:
-        img (PIL.Image or numpy.ndarray): List of images to be cropped
-        in format (h, w, c) in numpy.ndarray
-        Returns:
-        PIL.Image or numpy.ndarray: Randomly flipped clip
-        """
-        if random.random() < 0.5:
-            if isinstance(clip[0], np.ndarray):
-                return [np.fliplr(img) for img in clip]
-            elif isinstance(clip[0], PIL.Image.Image):
-                return [img.transpose(PIL.Image.FLIP_LEFT_RIGHT) for img in clip]
-            else:
-                raise TypeError(
-                    "Expected numpy.ndarray or PIL.Image"
-                    + " but got list of {0}".format(type(clip[0]))
-                )
-        return clip
-
-
-class RandomResize(object):
-    """Resizes a list of (H x W x C) numpy.ndarray to the final size
-    The larger the original image is, the more times it takes to
-    interpolate
-    Args:
-    interpolation (str): Can be one of 'nearest', 'bilinear'
-    defaults to nearest
-    size (tuple): (widht, height)
-    """
-
-    def __init__(self, ratio=(3.0 / 4.0, 4.0 / 3.0), interpolation="nearest"):
-        self.ratio = ratio
-        self.interpolation = interpolation
-
-    def __call__(self, clip):
-        scaling_factor = random.uniform(self.ratio[0], self.ratio[1])
-
-        if isinstance(clip[0], np.ndarray):
-            im_h, im_w, im_c = clip[0].shape
-        elif isinstance(clip[0], PIL.Image.Image):
-            im_w, im_h = clip[0].size
-
-        new_w = int(im_w * scaling_factor)
-        new_h = int(im_h * scaling_factor)
-        new_size = (new_w, new_h)
-        resized = FF.resize_clip(clip, new_size, interpolation=self.interpolation)
-        return resized
-
-
-class Resize(object):
-    """Resizes a list of (H x W x C) numpy.ndarray to the final size
-    The larger the original image is, the more times it takes to
-    interpolate
-    Args:
-    interpolation (str): Can be one of 'nearest', 'bilinear'
-    defaults to nearest
-    size (tuple): (widht, height)
-    """
-
-    def __init__(self, size, interpolation="nearest"):
-        self.size = size
-        self.interpolation = interpolation
-
-    def __call__(self, clip):
-        resized = FF.resize_clip(clip, self.size, interpolation=self.interpolation)
-        return resized
-
-
-class RandomCrop(object):
-    """Extract random crop at the same location for a list of images
-    Args:
-    size (sequence or int): Desired output size for the
-    crop in format (h, w)
-    """
-
-    def __init__(self, size):
-        if isinstance(size, numbers.Number):
-            size = (size, size)
-
-        self.size = size
-
-    def __call__(self, clip):
-        """
-        Args:
-        img (PIL.Image or numpy.ndarray): List of images to be cropped
-        in format (h, w, c) in numpy.ndarray
-        Returns:
-        PIL.Image or numpy.ndarray: Cropped list of images
-        """
-        h, w = self.size
-        if isinstance(clip[0], np.ndarray):
-            im_h, im_w, im_c = clip[0].shape
-        elif isinstance(clip[0], PIL.Image.Image):
-            im_w, im_h = clip[0].size
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image"
-                + "but got list of {0}".format(type(clip[0]))
-            )
-        if w > im_w or h > im_h:
-            error_msg = (
-                "Initial image size should be larger then "
-                "cropped size but got cropped sizes : ({w}, {h}) while "
-                "initial image is ({im_w}, {im_h})".format(
-                    im_w=im_w, im_h=im_h, w=w, h=h
-                )
-            )
-            raise ValueError(error_msg)
-
-        x1 = random.randint(0, im_w - w)
-        y1 = random.randint(0, im_h - h)
-        cropped = FF.crop_clip(clip, y1, x1, h, w)
-
-        return cropped
-
-
-class ThreeCrop(object):
-    """Extract random crop at the same location for a list of images
-    Args:
-    size (sequence or int): Desired output size for the
-    crop in format (h, w)
-    """
-
-    def __init__(self, size):
-        if isinstance(size, numbers.Number):
-            size = (size, size)
-
-        self.size = size
-
-    def __call__(self, clip):
-        """
-        Args:
-        img (PIL.Image or numpy.ndarray): List of images to be cropped
-        in format (h, w, c) in numpy.ndarray
-        Returns:
-        PIL.Image or numpy.ndarray: Cropped list of images
-        """
-        h, w = self.size
-        if isinstance(clip[0], np.ndarray):
-            im_h, im_w, im_c = clip[0].shape
-        elif isinstance(clip[0], PIL.Image.Image):
-            im_w, im_h = clip[0].size
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image"
-                + "but got list of {0}".format(type(clip[0]))
-            )
-        if w != im_w and h != im_h:
-            clip = FF.resize_clip(clip, self.size, interpolation="bilinear")
-            im_h, im_w, im_c = clip[0].shape
-
-        step = np.max((np.max((im_w, im_h)) - self.size[0]) // 2, 0)
-        cropped = []
-        for i in range(3):
-            if im_h > self.size[0]:
-                x1 = 0
-                y1 = i * step
-                cropped.extend(FF.crop_clip(clip, y1, x1, h, w))
-            else:
-                x1 = i * step
-                y1 = 0
-                cropped.extend(FF.crop_clip(clip, y1, x1, h, w))
-        return cropped
-
-
-class RandomRotation(object):
-    """Rotate entire clip randomly by a random angle within
-    given bounds
-    Args:
-    degrees (sequence or int): Range of degrees to select from
-    If degrees is a number instead of sequence like (min, max),
-    the range of degrees, will be (-degrees, +degrees).
-    """
-
-    def __init__(self, degrees):
-        if isinstance(degrees, numbers.Number):
-            if degrees < 0:
-                raise ValueError("If degrees is a single number," "must be positive")
-            degrees = (-degrees, degrees)
-        else:
-            if len(degrees) != 2:
-                raise ValueError("If degrees is a sequence," "it must be of len 2.")
-
-        self.degrees = degrees
-
-    def __call__(self, clip):
-        """
-        Args:
-        img (PIL.Image or numpy.ndarray): List of images to be cropped
-        in format (h, w, c) in numpy.ndarray
-        Returns:
-        PIL.Image or numpy.ndarray: Cropped list of images
-        """
-        import skimage
-
-        angle = random.uniform(self.degrees[0], self.degrees[1])
-        if isinstance(clip[0], np.ndarray):
-            rotated = [skimage.transform.rotate(img, angle) for img in clip]
-        elif isinstance(clip[0], PIL.Image.Image):
-            rotated = [img.rotate(angle) for img in clip]
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image"
-                + "but got list of {0}".format(type(clip[0]))
-            )
-
-        return rotated
-
-
-class CenterCrop(object):
-    """Extract center crop at the same location for a list of images
-    Args:
-    size (sequence or int): Desired output size for the
-    crop in format (h, w)
-    """
-
-    def __init__(self, size):
-        if isinstance(size, numbers.Number):
-            size = (size, size)
-
-        self.size = size
-
-    def __call__(self, clip):
-        """
-        Args:
-        img (PIL.Image or numpy.ndarray): List of images to be cropped
-        in format (h, w, c) in numpy.ndarray
-        Returns:
-        PIL.Image or numpy.ndarray: Cropped list of images
-        """
-        h, w = self.size
-        if isinstance(clip[0], np.ndarray):
-            im_h, im_w, im_c = clip[0].shape
-        elif isinstance(clip[0], PIL.Image.Image):
-            im_w, im_h = clip[0].size
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image"
-                + "but got list of {0}".format(type(clip[0]))
-            )
-        if w > im_w or h > im_h:
-            error_msg = (
-                "Initial image size should be larger then "
-                "cropped size but got cropped sizes : ({w}, {h}) while "
-                "initial image is ({im_w}, {im_h})".format(
-                    im_w=im_w, im_h=im_h, w=w, h=h
-                )
-            )
-            raise ValueError(error_msg)
-
-        x1 = int(round((im_w - w) / 2.0))
-        y1 = int(round((im_h - h) / 2.0))
-        cropped = FF.crop_clip(clip, y1, x1, h, w)
-
-        return cropped
-
-
-class ColorJitter(object):
-    """Randomly change the brightness, contrast and saturation and hue of the clip
-    Args:
-    brightness (float): How much to jitter brightness. brightness_factor
-    is chosen uniformly from [max(0, 1 - brightness), 1 + brightness].
-    contrast (float): How much to jitter contrast. contrast_factor
-    is chosen uniformly from [max(0, 1 - contrast), 1 + contrast].
-    saturation (float): How much to jitter saturation. saturation_factor
-    is chosen uniformly from [max(0, 1 - saturation), 1 + saturation].
-    hue(float): How much to jitter hue. hue_factor is chosen uniformly from
-    [-hue, hue]. Should be >=0 and <= 0.5.
-    """
-
-    def __init__(self, brightness=0, contrast=0, saturation=0, hue=0):
-        self.brightness = brightness
-        self.contrast = contrast
-        self.saturation = saturation
-        self.hue = hue
-
-    def get_params(self, brightness, contrast, saturation, hue):
-        if brightness > 0:
-            brightness_factor = random.uniform(max(0, 1 - brightness), 1 + brightness)
-        else:
-            brightness_factor = None
-
-        if contrast > 0:
-            contrast_factor = random.uniform(max(0, 1 - contrast), 1 + contrast)
-        else:
-            contrast_factor = None
-
-        if saturation > 0:
-            saturation_factor = random.uniform(max(0, 1 - saturation), 1 + saturation)
-        else:
-            saturation_factor = None
-
-        if hue > 0:
-            hue_factor = random.uniform(-hue, hue)
-        else:
-            hue_factor = None
-        return brightness_factor, contrast_factor, saturation_factor, hue_factor
-
-    def __call__(self, clip):
-        """
-        Args:
-        clip (list): list of PIL.Image
-        Returns:
-        list PIL.Image : list of transformed PIL.Image
-        """
-        if isinstance(clip[0], np.ndarray):
-            raise TypeError("Color jitter not yet implemented for numpy arrays")
-        elif isinstance(clip[0], PIL.Image.Image):
-            brightness, contrast, saturation, hue = self.get_params(
-                self.brightness, self.contrast, self.saturation, self.hue
-            )
-
-            # Create img transform function sequence
-            img_transforms = []
-            if brightness is not None:
-                img_transforms.append(
-                    lambda img: torchvision.transforms.functional.adjust_brightness(
-                        img, brightness
-                    )
-                )
-            if saturation is not None:
-                img_transforms.append(
-                    lambda img: torchvision.transforms.functional.adjust_saturation(
-                        img, saturation
-                    )
-                )
-            if hue is not None:
-                img_transforms.append(
-                    lambda img: torchvision.transforms.functional.adjust_hue(img, hue)
-                )
-            if contrast is not None:
-                img_transforms.append(
-                    lambda img: torchvision.transforms.functional.adjust_contrast(
-                        img, contrast
-                    )
-                )
-            random.shuffle(img_transforms)
-
-            # Apply to all images
-            jittered_clip = []
-            for img in clip:
-                for func in img_transforms:
-                    jittered_img = func(img)
-                jittered_clip.append(jittered_img)
-
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image"
-                + "but got list of {0}".format(type(clip[0]))
-            )
-        return jittered_clip
-
-
-class Normalize(object):
-    """Normalize a clip with mean and standard deviation.
-    Given mean: ``(M1,...,Mn)`` and std: ``(S1,..,Sn)`` for ``n`` channels, this transform
-    will normalize each channel of the input ``torch.*Tensor`` i.e.
-    ``input[channel] = (input[channel] - mean[channel]) / std[channel]``
-    .. note::
-        This transform acts out of place, i.e., it does not mutates the input tensor.
-    Args:
-        mean (sequence): Sequence of means for each channel.
-        std (sequence): Sequence of standard deviations for each channel.
-    """
-
-    def __init__(self, mean, std):
-        self.mean = mean
-        self.std = std
-
-    def __call__(self, clip):
-        """
-        Args:
-            clip (Tensor): Tensor clip of size (T, C, H, W) to be normalized.
-        Returns:
-            Tensor: Normalized Tensor clip.
-        """
-        return FF.normalize(clip, self.mean, self.std)
-
-    def __repr__(self):
-        return self.__class__.__name__ + "(mean={0}, std={1})".format(
-            self.mean, self.std
-        )
diff --git a/eval/vbench/third_party/umt/datasets/volume_transforms.py b/eval/vbench/third_party/umt/datasets/volume_transforms.py
deleted file mode 100644
index ae040391..00000000
--- a/eval/vbench/third_party/umt/datasets/volume_transforms.py
+++ /dev/null
@@ -1,143 +0,0 @@
-import numpy as np
-import torch
-from PIL import Image
-
-
-def convert_img(img):
-    """Converts (H, W, C) numpy.ndarray to (C, W, H) format"""
-    if len(img.shape) == 3:
-        img = img.transpose(2, 0, 1)
-    if len(img.shape) == 2:
-        img = np.expand_dims(img, 0)
-    return img
-
-
-class ClipToTensor(object):
-    """Convert a list of m (H x W x C) numpy.ndarrays in the range [0, 255]
-    to a torch.FloatTensor of shape (C x m x H x W) in the range [0, 1.0]
-    """
-
-    def __init__(self, channel_nb=3, div_255=True, numpy=False):
-        self.channel_nb = channel_nb
-        self.div_255 = div_255
-        self.numpy = numpy
-
-    def __call__(self, clip):
-        """
-        Args: clip (list of numpy.ndarray): clip (list of images)
-        to be converted to tensor.
-        """
-        # Retrieve shape
-        if isinstance(clip[0], np.ndarray):
-            h, w, ch = clip[0].shape
-            assert ch == self.channel_nb, "Got {0} instead of 3 channels".format(ch)
-        elif isinstance(clip[0], Image.Image):
-            w, h = clip[0].size
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image\
-            but got list of {0}".format(
-                    type(clip[0])
-                )
-            )
-
-        np_clip = np.zeros([self.channel_nb, len(clip), int(h), int(w)])
-
-        # Convert
-        for img_idx, img in enumerate(clip):
-            if isinstance(img, np.ndarray):
-                pass
-            elif isinstance(img, Image.Image):
-                img = np.array(img, copy=False)
-            else:
-                raise TypeError(
-                    "Expected numpy.ndarray or PIL.Image\
-                but got list of {0}".format(
-                        type(clip[0])
-                    )
-                )
-            img = convert_img(img)
-            np_clip[:, img_idx, :, :] = img
-        if self.numpy:
-            if self.div_255:
-                np_clip = np_clip / 255.0
-            return np_clip
-
-        else:
-            tensor_clip = torch.from_numpy(np_clip)
-
-            if not isinstance(tensor_clip, torch.FloatTensor):
-                tensor_clip = tensor_clip.float()
-            if self.div_255:
-                tensor_clip = torch.div(tensor_clip, 255)
-            return tensor_clip
-
-
-# Note this norms data to -1/1
-class ClipToTensor_K(object):
-    """Convert a list of m (H x W x C) numpy.ndarrays in the range [0, 255]
-    to a torch.FloatTensor of shape (C x m x H x W) in the range [0, 1.0]
-    """
-
-    def __init__(self, channel_nb=3, div_255=True, numpy=False):
-        self.channel_nb = channel_nb
-        self.div_255 = div_255
-        self.numpy = numpy
-
-    def __call__(self, clip):
-        """
-        Args: clip (list of numpy.ndarray): clip (list of images)
-        to be converted to tensor.
-        """
-        # Retrieve shape
-        if isinstance(clip[0], np.ndarray):
-            h, w, ch = clip[0].shape
-            assert ch == self.channel_nb, "Got {0} instead of 3 channels".format(ch)
-        elif isinstance(clip[0], Image.Image):
-            w, h = clip[0].size
-        else:
-            raise TypeError(
-                "Expected numpy.ndarray or PIL.Image\
-            but got list of {0}".format(
-                    type(clip[0])
-                )
-            )
-
-        np_clip = np.zeros([self.channel_nb, len(clip), int(h), int(w)])
-
-        # Convert
-        for img_idx, img in enumerate(clip):
-            if isinstance(img, np.ndarray):
-                pass
-            elif isinstance(img, Image.Image):
-                img = np.array(img, copy=False)
-            else:
-                raise TypeError(
-                    "Expected numpy.ndarray or PIL.Image\
-                but got list of {0}".format(
-                        type(clip[0])
-                    )
-                )
-            img = convert_img(img)
-            np_clip[:, img_idx, :, :] = img
-        if self.numpy:
-            if self.div_255:
-                np_clip = (np_clip - 127.5) / 127.5
-            return np_clip
-
-        else:
-            tensor_clip = torch.from_numpy(np_clip)
-
-            if not isinstance(tensor_clip, torch.FloatTensor):
-                tensor_clip = tensor_clip.float()
-            if self.div_255:
-                tensor_clip = torch.div(torch.sub(tensor_clip, 127.5), 127.5)
-            return tensor_clip
-
-
-class ToTensor(object):
-    """Converts numpy array to tensor"""
-
-    def __call__(self, array):
-        tensor = torch.from_numpy(array)
-        return tensor
diff --git a/eval/vbench/third_party/umt/functional.py b/eval/vbench/third_party/umt/functional.py
deleted file mode 100644
index 21b34fd6..00000000
--- a/eval/vbench/third_party/umt/functional.py
+++ /dev/null
@@ -1,88 +0,0 @@
-import numbers
-
-import cv2
-import numpy as np
-import PIL
-import torch
-
-
-def _is_tensor_clip(clip):
-    return torch.is_tensor(clip) and clip.ndimension() == 4
-
-
-def crop_clip(clip, min_h, min_w, h, w):
-    if isinstance(clip[0], np.ndarray):
-        cropped = [img[min_h : min_h + h, min_w : min_w + w, :] for img in clip]
-
-    elif isinstance(clip[0], PIL.Image.Image):
-        cropped = [img.crop((min_w, min_h, min_w + w, min_h + h)) for img in clip]
-    else:
-        raise TypeError(
-            "Expected numpy.ndarray or PIL.Image"
-            + "but got list of {0}".format(type(clip[0]))
-        )
-    return cropped
-
-
-def resize_clip(clip, size, interpolation="bilinear"):
-    if isinstance(clip[0], np.ndarray):
-        if isinstance(size, numbers.Number):
-            im_h, im_w, im_c = clip[0].shape
-            # Min spatial dim already matches minimal size
-            if (im_w <= im_h and im_w == size) or (im_h <= im_w and im_h == size):
-                return clip
-            new_h, new_w = get_resize_sizes(im_h, im_w, size)
-            size = (new_w, new_h)
-        else:
-            size = size[0], size[1]
-        if interpolation == "bilinear":
-            np_inter = cv2.INTER_LINEAR
-        else:
-            np_inter = cv2.INTER_NEAREST
-        scaled = [cv2.resize(img, size, interpolation=np_inter) for img in clip]
-    elif isinstance(clip[0], PIL.Image.Image):
-        if isinstance(size, numbers.Number):
-            im_w, im_h = clip[0].size
-            # Min spatial dim already matches minimal size
-            if (im_w <= im_h and im_w == size) or (im_h <= im_w and im_h == size):
-                return clip
-            new_h, new_w = get_resize_sizes(im_h, im_w, size)
-            size = (new_w, new_h)
-        else:
-            size = size[1], size[0]
-        if interpolation == "bilinear":
-            pil_inter = PIL.Image.BILINEAR
-        else:
-            pil_inter = PIL.Image.NEAREST
-        scaled = [img.resize(size, pil_inter) for img in clip]
-    else:
-        raise TypeError(
-            "Expected numpy.ndarray or PIL.Image"
-            + "but got list of {0}".format(type(clip[0]))
-        )
-    return scaled
-
-
-def get_resize_sizes(im_h, im_w, size):
-    if im_w < im_h:
-        ow = size
-        oh = int(size * im_h / im_w)
-    else:
-        oh = size
-        ow = int(size * im_w / im_h)
-    return oh, ow
-
-
-def normalize(clip, mean, std, inplace=False):
-    if not _is_tensor_clip(clip):
-        raise TypeError("tensor is not a torch clip.")
-
-    if not inplace:
-        clip = clip.clone()
-
-    dtype = clip.dtype
-    mean = torch.as_tensor(mean, dtype=dtype, device=clip.device)
-    std = torch.as_tensor(std, dtype=dtype, device=clip.device)
-    clip.sub_(mean[:, None, None, None]).div_(std[:, None, None, None])
-
-    return clip
diff --git a/eval/vbench/third_party/umt/kinetics_400_categories.txt b/eval/vbench/third_party/umt/kinetics_400_categories.txt
deleted file mode 100644
index 06fc9968..00000000
--- a/eval/vbench/third_party/umt/kinetics_400_categories.txt
+++ /dev/null
@@ -1,400 +0,0 @@
-riding a bike	0
-marching	1
-dodgeball	2
-playing cymbals	3
-checking tires	4
-roller skating	5
-tasting beer	6
-clapping	7
-drawing	8
-juggling fire	9
-bobsledding	10
-petting animal (not cat)	11
-spray painting	12
-training dog	13
-eating watermelon	14
-building cabinet	15
-applauding	16
-playing harp	17
-balloon blowing	18
-sled dog racing	19
-wrestling	20
-pole vault	21
-hurling (sport)	22
-riding scooter	23
-shearing sheep	24
-sweeping floor	25
-eating carrots	26
-skateboarding	27
-dunking basketball	28
-disc golfing	29
-eating spaghetti	30
-playing flute	31
-riding mechanical bull	32
-making sushi	33
-trapezing	34
-picking fruit	35
-stretching leg	36
-playing ukulele	37
-tying tie	38
-skydiving	39
-playing cello	40
-jumping into pool	41
-shooting goal (soccer)	42
-trimming trees	43
-bookbinding	44
-ski jumping	45
-walking the dog	46
-riding unicycle	47
-shaving head	48
-hopscotch	49
-playing piano	50
-parasailing	51
-bartending	52
-kicking field goal	53
-finger snapping	54
-dining	55
-yawning	56
-peeling potatoes	57
-canoeing or kayaking	58
-front raises	59
-laughing	60
-dancing macarena	61
-digging	62
-reading newspaper	63
-hitting baseball	64
-clay pottery making	65
-exercising with an exercise ball	66
-playing saxophone	67
-shooting basketball	68
-washing hair	69
-lunge	70
-brushing hair	71
-curling hair	72
-kitesurfing	73
-tapping guitar	74
-bending back	75
-skipping rope	76
-situp	77
-folding paper	78
-cracking neck	79
-assembling computer	80
-cleaning gutters	81
-blowing out candles	82
-shaking hands	83
-dancing gangnam style	84
-windsurfing	85
-tap dancing	86
-skiing (not slalom or crosscountry)	87
-bandaging	88
-push up	89
-doing nails	90
-punching person (boxing)	91
-bouncing on trampoline	92
-scrambling eggs	93
-singing	94
-cleaning floor	95
-krumping	96
-drumming fingers	97
-snowmobiling	98
-gymnastics tumbling	99
-headbanging	100
-catching or throwing frisbee	101
-riding elephant	102
-bee keeping	103
-feeding birds	104
-snatch weight lifting	105
-mowing lawn	106
-fixing hair	107
-playing trumpet	108
-flying kite	109
-crossing river	110
-swinging legs	111
-sanding floor	112
-belly dancing	113
-sneezing	114
-clean and jerk	115
-side kick	116
-filling eyebrows	117
-shuffling cards	118
-recording music	119
-cartwheeling	120
-feeding fish	121
-folding clothes	122
-water skiing	123
-tobogganing	124
-blowing leaves	125
-smoking	126
-unboxing	127
-tai chi	128
-waxing legs	129
-riding camel	130
-slapping	131
-tossing salad	132
-capoeira	133
-playing cards	134
-playing organ	135
-playing violin	136
-playing drums	137
-tapping pen	138
-vault	139
-shoveling snow	140
-playing tennis	141
-getting a tattoo	142
-making a sandwich	143
-making tea	144
-grinding meat	145
-squat	146
-eating doughnuts	147
-ice fishing	148
-snowkiting	149
-kicking soccer ball	150
-playing controller	151
-giving or receiving award	152
-welding	153
-throwing discus	154
-throwing axe	155
-ripping paper	156
-swimming butterfly stroke	157
-air drumming	158
-blowing nose	159
-hockey stop	160
-taking a shower	161
-bench pressing	162
-planting trees	163
-pumping fist	164
-climbing tree	165
-tickling	166
-high kick	167
-waiting in line	168
-slacklining	169
-tango dancing	170
-hurdling	171
-carrying baby	172
-celebrating	173
-sharpening knives	174
-passing American football (in game)	175
-headbutting	176
-playing recorder	177
-brush painting	178
-garbage collecting	179
-robot dancing	180
-shredding paper	181
-pumping gas	182
-rock climbing	183
-hula hooping	184
-braiding hair	185
-opening present	186
-texting	187
-decorating the christmas tree	188
-answering questions	189
-playing keyboard	190
-writing	191
-bungee jumping	192
-sniffing	193
-eating burger	194
-playing accordion	195
-making pizza	196
-playing volleyball	197
-tasting food	198
-pushing cart	199
-spinning poi	200
-cleaning windows	201
-arm wrestling	202
-changing oil	203
-swimming breast stroke	204
-tossing coin	205
-deadlifting	206
-hoverboarding	207
-cutting watermelon	208
-cheerleading	209
-snorkeling	210
-washing hands	211
-eating cake	212
-pull ups	213
-surfing water	214
-eating hotdog	215
-holding snake	216
-playing harmonica	217
-ironing	218
-cutting nails	219
-golf chipping	220
-shot put	221
-hugging	222
-playing clarinet	223
-faceplanting	224
-trimming or shaving beard	225
-drinking shots	226
-riding mountain bike	227
-tying bow tie	228
-swinging on something	229
-skiing crosscountry	230
-unloading truck	231
-cleaning pool	232
-jogging	233
-ice climbing	234
-mopping floor	235
-making bed	236
-diving cliff	237
-washing dishes	238
-grooming dog	239
-weaving basket	240
-frying vegetables	241
-stomping grapes	242
-moving furniture	243
-cooking sausages	244
-doing laundry	245
-dying hair	246
-knitting	247
-reading book	248
-baby waking up	249
-punching bag	250
-surfing crowd	251
-cooking chicken	252
-pushing car	253
-springboard diving	254
-swing dancing	255
-massaging legs	256
-beatboxing	257
-breading or breadcrumbing	258
-somersaulting	259
-brushing teeth	260
-stretching arm	261
-juggling balls	262
-massaging person's head	263
-eating ice cream	264
-extinguishing fire	265
-hammer throw	266
-whistling	267
-crawling baby	268
-using remote controller (not gaming)	269
-playing cricket	270
-opening bottle	271
-playing xylophone	272
-motorcycling	273
-driving car	274
-exercising arm	275
-passing American football (not in game)	276
-playing kickball	277
-sticking tongue out	278
-flipping pancake	279
-catching fish	280
-eating chips	281
-shaking head	282
-sword fighting	283
-playing poker	284
-cooking on campfire	285
-doing aerobics	286
-paragliding	287
-using segway	288
-folding napkins	289
-playing bagpipes	290
-gargling	291
-skiing slalom	292
-strumming guitar	293
-javelin throw	294
-waxing back	295
-riding or walking with horse	296
-plastering	297
-long jump	298
-parkour	299
-wrapping present	300
-egg hunting	301
-archery	302
-cleaning toilet	303
-swimming backstroke	304
-snowboarding	305
-catching or throwing baseball	306
-massaging back	307
-blowing glass	308
-playing guitar	309
-playing chess	310
-golf driving	311
-presenting weather forecast	312
-rock scissors paper	313
-high jump	314
-baking cookies	315
-using computer	316
-washing feet	317
-arranging flowers	318
-playing bass guitar	319
-spraying	320
-cutting pineapple	321
-waxing chest	322
-auctioning	323
-jetskiing	324
-drinking	325
-busking	326
-playing monopoly	327
-salsa dancing	328
-waxing eyebrows	329
-watering plants	330
-zumba	331
-chopping wood	332
-pushing wheelchair	333
-carving pumpkin	334
-building shed	335
-making jewelry	336
-catching or throwing softball	337
-bending metal	338
-ice skating	339
-dancing charleston	340
-abseiling	341
-climbing a rope	342
-crying	343
-cleaning shoes	344
-dancing ballet	345
-driving tractor	346
-triple jump	347
-throwing ball	348
-getting a haircut	349
-running on treadmill	350
-climbing ladder	351
-blasting sand	352
-playing trombone	353
-drop kicking	354
-country line dancing	355
-changing wheel	356
-feeding goats	357
-tying knot (not on a tie)	358
-setting table	359
-shaving legs	360
-kissing	361
-riding mule	362
-counting money	363
-laying bricks	364
-barbequing	365
-news anchoring	366
-smoking hookah	367
-cooking egg	368
-peeling apples	369
-yoga	370
-sharpening pencil	371
-dribbling basketball	372
-petting cat	373
-playing ice hockey	374
-milking cow	375
-shining shoes	376
-juggling soccer ball	377
-scuba diving	378
-playing squash or racquetball	379
-drinking beer	380
-sign language interpreting	381
-playing basketball	382
-breakdancing	383
-testifying	384
-making snowman	385
-golf putting	386
-playing didgeridoo	387
-biking through snow	388
-sailing	389
-jumpstyle dancing	390
-water sliding	391
-grooming horse	392
-massaging feet	393
-playing paintball	394
-making a cake	395
-bowling	396
-contact juggling	397
-applying cream	398
-playing badminton	399
diff --git a/eval/vbench/third_party/umt/models/__init__.py b/eval/vbench/third_party/umt/models/__init__.py
deleted file mode 100644
index fbe97398..00000000
--- a/eval/vbench/third_party/umt/models/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-from .clip import clip_b16, clip_l14, clip_l14_336
-
-# from .modeling_finetune import vit_base_patch16_224, vit_base_patch16_384, vit_large_patch16_224, vit_large_patch16_384
-from .modeling_finetune import vit_large_patch16_224
-from .modeling_pretrain import (
-    pretrain_videomae_base_patch16_224,
-    pretrain_videomae_huge_patch16_224,
-    pretrain_videomae_large_patch16_224,
-)
-from .modeling_pretrain_umt import (
-    pretrain_umt_base_patch16_224,
-    pretrain_umt_large_patch16_224,
-)
diff --git a/eval/vbench/third_party/umt/models/clip.py b/eval/vbench/third_party/umt/models/clip.py
deleted file mode 100644
index 7a38a2a0..00000000
--- a/eval/vbench/third_party/umt/models/clip.py
+++ /dev/null
@@ -1,393 +0,0 @@
-#!/usr/bin/env python
-import os
-from collections import OrderedDict
-
-import torch
-from torch import nn
-
-MODEL_PATH = "your_model_path/clip_visual_encoder"
-_MODELS = {
-    # extracted from OpenAI, see extract_clip
-    "ViT-B/16": os.path.join(MODEL_PATH, "vit_b16.pth"),
-    "ViT-L/14": os.path.join(MODEL_PATH, "vit_l14.pth"),
-    "ViT-L/14_336": os.path.join(MODEL_PATH, "vit_l14_336.pth"),
-}
-
-
-class LayerNorm(nn.LayerNorm):
-    """Subclass torch's LayerNorm to handle fp16."""
-
-    def forward(self, x):
-        orig_type = x.dtype
-        ret = super().forward(x.type(torch.float32))
-        return ret.type(orig_type)
-
-
-class QuickGELU(nn.Module):
-    def forward(self, x):
-        return x * torch.sigmoid(1.702 * x)
-
-
-class ResidualAttentionBlock(nn.Module):
-    def __init__(self, d_model, n_head, attn_mask=None):
-        super().__init__()
-
-        self.attn = nn.MultiheadAttention(d_model, n_head)
-        self.ln_1 = LayerNorm(d_model)
-        self.mlp = nn.Sequential(
-            OrderedDict(
-                [
-                    ("c_fc", nn.Linear(d_model, d_model * 4)),
-                    ("gelu", QuickGELU()),
-                    ("c_proj", nn.Linear(d_model * 4, d_model)),
-                ]
-            )
-        )
-        self.ln_2 = LayerNorm(d_model)
-        self.attn_mask = attn_mask
-
-    def attention(self, x, return_attn=False):
-        self.attn_mask = (
-            self.attn_mask.to(dtype=x.dtype, device=x.device)
-            if self.attn_mask is not None
-            else None
-        )
-        if return_attn:
-            return self.attn(x, x, x, need_weights=True, attn_mask=self.attn_mask)
-        else:
-            return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]
-
-    def forward(self, x, return_attn=False):
-        if return_attn:
-            x_, attn = self.attention(self.ln_1(x), return_attn=True)
-            x = x + x_
-            x = x + self.mlp(self.ln_2(x))
-            return x, attn
-        else:
-            x = x + self.attention(self.ln_1(x))
-            x = x + self.mlp(self.ln_2(x))
-            return x
-
-
-class Transformer(nn.Module):
-    def __init__(
-        self,
-        width,
-        layers,
-        heads,
-        return_attn=False,
-        clip_return_layer=1,
-        clip_return_interval=1,
-    ):
-        super().__init__()
-        self.layers = layers
-        self.return_attn = return_attn
-        self.resblocks = nn.ModuleList()
-        for _ in range(layers):
-            self.resblocks.append(
-                ResidualAttentionBlock(
-                    width,
-                    heads,
-                )
-            )
-        self.return_index = []
-        for i in range(clip_return_layer):
-            self.return_index.append(layers - int(i * clip_return_interval) - 1)
-        print(f"Teacher return index: {self.return_index}")
-
-    def forward(self, x):
-        attn = None
-        z = []
-        for idx, blk in enumerate(self.resblocks):
-            if idx == self.layers - 1 and self.return_attn:
-                x, attn = blk(x, return_attn=True)
-            else:
-                x = blk(x)
-            if idx in self.return_index:
-                z.append(x)
-        x = torch.stack(z)
-        return x, attn
-
-
-class VisionTransformer(nn.Module):
-    def __init__(
-        self,
-        input_resolution,
-        patch_size,
-        width,
-        layers,
-        heads,
-        output_dim,
-        clip_norm_type="l2",
-        kernel_size=1,
-        return_attn=False,
-        clip_return_layer=1,
-        clip_return_interval=1,
-    ):
-        super().__init__()
-        self.clip_norm_type = clip_norm_type
-        self.return_attn = return_attn
-        print(f"Normalization Type: {clip_norm_type}")
-        print(f"Return Attention: {return_attn}")
-        print(f"Return Layer: {clip_return_layer}")
-        print(f"Return Interval: {clip_return_interval}")
-
-        self.output_dim = output_dim
-        self.conv1 = nn.Conv3d(
-            3,
-            width,
-            (kernel_size, patch_size, patch_size),
-            (kernel_size, patch_size, patch_size),
-            (0, 0, 0),
-            bias=False,
-        )
-
-        scale = width**-0.5
-        self.class_embedding = nn.Parameter(scale * torch.randn(width))
-        self.positional_embedding = nn.Parameter(
-            scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)
-        )
-        self.ln_pre = LayerNorm(width)
-
-        self.transformer = Transformer(
-            width,
-            layers,
-            heads,
-            return_attn=return_attn,
-            clip_return_layer=clip_return_layer,
-            clip_return_interval=clip_return_interval,
-        )
-
-        self.ln_post = LayerNorm(width)
-        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))
-
-    def forward(self, x, mask=None):
-        x = self.conv1(x)  # shape = [*, width, grid, grid]
-        N, C, T, H, W = x.shape
-        x = x.permute(0, 2, 3, 4, 1).reshape(N * T, H * W, C)
-
-        x = torch.cat(
-            [
-                self.class_embedding.to(x.dtype)
-                + torch.zeros(
-                    x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
-                ),
-                x,
-            ],
-            dim=1,
-        )  # shape = [*, grid ** 2 + 1, width]
-        x = x + self.positional_embedding.to(x.dtype)
-        x = self.ln_pre(x)
-
-        if mask is not None:
-            cls_tokens = x[:, :1, :]
-            x = x[:, 1:]
-            x = x.reshape(N, T * H * W, C)
-            x = x[~mask].view(N * T, -1, C)
-            HW = x.shape[1]
-            x = torch.cat([cls_tokens, x], dim=1)
-        else:
-            HW = H * W
-
-        x = x.permute(1, 0, 2)  # NLD -> LND
-        x, attn = self.transformer(x)
-
-        K = x.shape[0]
-        x = self.ln_post(x[:, 1:, :, :])  # [HW, NT, C]
-        x = (
-            x.view(K, HW, N, T, C).permute(0, 2, 3, 1, 4).reshape(K, N, T * HW, C)
-        )  # [K, N, THW, C]
-        x = x @ self.proj
-
-        if self.clip_norm_type == "l2":
-            x = x / x.norm(dim=-1, keepdim=True)
-        elif self.clip_norm_type == "none":
-            pass
-        else:
-            raise NotImplementedError
-
-        if self.return_attn:
-            return x, attn[:, 0, 1:]
-        else:
-            return x
-
-
-def inflate_weight(weight_2d, time_dim, center=True):
-    print(f"Init center: {center}")
-    if center:
-        weight_3d = torch.zeros(*weight_2d.shape)
-        weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
-        middle_idx = time_dim // 2
-        weight_3d[:, :, middle_idx, :, :] = weight_2d
-    else:
-        weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
-        weight_3d = weight_3d / time_dim
-    return weight_3d
-
-
-def load_state_dict(
-    model, state_dict, input_resolution=224, patch_size=16, center=True
-):
-    state_dict_3d = model.state_dict()
-    for k in state_dict.keys():
-        if k in state_dict_3d.keys() and state_dict[k].shape != state_dict_3d[k].shape:
-            if len(state_dict_3d[k].shape) <= 2:
-                print(f"Ignore: {k}")
-                continue
-            print(f"Inflate: {k}, {state_dict[k].shape} => {state_dict_3d[k].shape}")
-            time_dim = state_dict_3d[k].shape[2]
-            state_dict[k] = inflate_weight(state_dict[k], time_dim, center=center)
-
-    pos_embed_checkpoint = state_dict["positional_embedding"]
-    embedding_size = pos_embed_checkpoint.shape[-1]
-    num_patches = (input_resolution // patch_size) ** 2
-    orig_size = int((pos_embed_checkpoint.shape[-2] - 1) ** 0.5)
-    new_size = int(num_patches**0.5)
-    if orig_size != new_size:
-        print(f"Pos_emb from {orig_size} to {new_size}")
-        extra_tokens = pos_embed_checkpoint[:1]
-        pos_tokens = pos_embed_checkpoint[1:]
-        pos_tokens = pos_tokens.reshape(
-            -1, orig_size, orig_size, embedding_size
-        ).permute(0, 3, 1, 2)
-        pos_tokens = torch.nn.functional.interpolate(
-            pos_tokens, size=(new_size, new_size), mode="bicubic", align_corners=False
-        )
-        pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(0, 2)
-        new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=0)
-        state_dict["positional_embedding"] = new_pos_embed
-
-    model.load_state_dict(state_dict, strict=True)
-
-
-def clip_b16(
-    pretrained=True,
-    clip_norm_type="l2",
-    input_resolution=224,
-    kernel_size=1,
-    return_attn=False,
-    center=True,
-    clip_return_layer=1,
-    clip_return_interval=1,
-):
-    model = VisionTransformer(
-        input_resolution=input_resolution,
-        patch_size=16,
-        width=768,
-        layers=12,
-        heads=12,
-        output_dim=512,
-        clip_norm_type=clip_norm_type,
-        kernel_size=kernel_size,
-        return_attn=return_attn,
-        clip_return_layer=clip_return_layer,
-        clip_return_interval=clip_return_interval,
-    )
-    if pretrained:
-        print("load pretrained weights")
-        state_dict = torch.load(_MODELS["ViT-B/16"], map_location="cpu")
-        load_state_dict(
-            model,
-            state_dict,
-            input_resolution=input_resolution,
-            patch_size=16,
-            center=center,
-        )
-    return model.eval()
-
-
-def clip_l14(
-    pretrained=True,
-    clip_norm_type="l2",
-    input_resolution=224,
-    kernel_size=1,
-    return_attn=False,
-    center=True,
-    clip_return_layer=1,
-    clip_return_interval=1,
-):
-    model = VisionTransformer(
-        input_resolution=input_resolution,
-        patch_size=14,
-        width=1024,
-        layers=24,
-        heads=16,
-        output_dim=768,
-        clip_norm_type=clip_norm_type,
-        kernel_size=kernel_size,
-        return_attn=return_attn,
-        clip_return_layer=clip_return_layer,
-        clip_return_interval=clip_return_interval,
-    )
-    if pretrained:
-        print("load pretrained weights")
-        state_dict = torch.load(_MODELS["ViT-L/14"], map_location="cpu")
-        load_state_dict(
-            model,
-            state_dict,
-            input_resolution=input_resolution,
-            patch_size=14,
-            center=center,
-        )
-    return model.eval()
-
-
-def clip_l14_336(
-    pretrained=True,
-    clip_norm_type="l2",
-    input_resolution=336,
-    kernel_size=1,
-    return_attn=False,
-    center=True,
-    clip_return_layer=1,
-    clip_return_interval=1,
-):
-    model = VisionTransformer(
-        input_resolution=input_resolution,
-        patch_size=14,
-        width=1024,
-        layers=24,
-        heads=16,
-        output_dim=768,
-        clip_norm_type=clip_norm_type,
-        kernel_size=kernel_size,
-        return_attn=return_attn,
-        clip_return_layer=clip_return_layer,
-        clip_return_interval=clip_return_interval,
-    )
-    if pretrained:
-        print("load pretrained weights")
-        state_dict = torch.load(_MODELS["ViT-L/14_336"], map_location="cpu")
-        load_state_dict(
-            model,
-            state_dict,
-            input_resolution=input_resolution,
-            patch_size=14,
-            center=center,
-        )
-    return model.eval()
-
-
-if __name__ == "__main__":
-    import time
-
-    import numpy as np
-    from fvcore.nn import FlopCountAnalysis, flop_count_table
-
-    seed = 4217
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    num_frames = 8
-
-    model = clip_ml_b16(
-        pretrained=True, kernel_size=1, return_attn=False, clip_return_layer=1
-    )
-    # print(model)
-
-    # flops = FlopCountAnalysis(model, torch.rand(1, 3, num_frames, 224, 224))
-    # s = time.time()
-    # print(flop_count_table(flops, max_depth=1))
-    # print(time.time()-s)
-    print(model(torch.rand(1, 3, num_frames, 224, 224)).shape)
diff --git a/eval/vbench/third_party/umt/models/extract_clip/extract.ipynb b/eval/vbench/third_party/umt/models/extract_clip/extract.ipynb
deleted file mode 100644
index 3826677c..00000000
--- a/eval/vbench/third_party/umt/models/extract_clip/extract.ipynb
+++ /dev/null
@@ -1,101 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import clip.clip as clip\n",
-    "import os\n",
-    "import torch\n",
-    "from collections import OrderedDict"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "path = 'your_model_path/clip_visual_encoder'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model, _ = clip.load(\"ViT-B/16\", device='cpu')\n",
-    "new_state_dict = OrderedDict()\n",
-    "for k, v in model.state_dict().items():\n",
-    "    if 'visual.' in k:\n",
-    "        new_state_dict[k[7:]] = v\n",
-    "torch.save(new_state_dict, os.path.join(path, 'vit_b16.pth'))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model, _ = clip.load(\"ViT-L/14\", device='cpu')\n",
-    "new_state_dict = OrderedDict()\n",
-    "for k, v in model.state_dict().items():\n",
-    "    if 'visual.' in k:\n",
-    "        new_state_dict[k[7:]] = v\n",
-    "torch.save(new_state_dict, os.path.join(path, 'vit_l14.pth'))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "model, _ = clip.load(\"ViT-L/14@336px\", device='cpu')\n",
-    "new_state_dict = OrderedDict()\n",
-    "for k, v in model.state_dict().items():\n",
-    "    if 'visual.' in k:\n",
-    "        new_state_dict[k[7:]] = v\n",
-    "torch.save(new_state_dict, os.path.join(path, 'vit_l14_336.pth'))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3.7.13 ('torch1.9')",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.13"
-  },
-  "orig_nbformat": 4,
-  "vscode": {
-   "interpreter": {
-    "hash": "c30e0be9d1dabfc31a056b9daab5ce1d15284c0e9e5af7f56f8931344ec84c24"
-   }
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
diff --git a/eval/vbench/third_party/umt/models/modeling_finetune.py b/eval/vbench/third_party/umt/models/modeling_finetune.py
deleted file mode 100644
index eb2def53..00000000
--- a/eval/vbench/third_party/umt/models/modeling_finetune.py
+++ /dev/null
@@ -1,525 +0,0 @@
-from functools import partial
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.utils.checkpoint as checkpoint
-from timm.models.layers import drop_path, to_2tuple, trunc_normal_
-from timm.models.registry import register_model
-
-
-def _cfg(url="", **kwargs):
-    return {
-        "url": url,
-        "num_classes": 400,
-        "input_size": (3, 224, 224),
-        "pool_size": None,
-        "crop_pct": 0.9,
-        "interpolation": "bicubic",
-        "mean": (0.5, 0.5, 0.5),
-        "std": (0.5, 0.5, 0.5),
-        **kwargs,
-    }
-
-
-class DropPath(nn.Module):
-    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks)."""
-
-    def __init__(self, drop_prob=None):
-        super(DropPath, self).__init__()
-        self.drop_prob = drop_prob
-
-    def forward(self, x):
-        return drop_path(x, self.drop_prob, self.training)
-
-    def extra_repr(self) -> str:
-        return "p={}".format(self.drop_prob)
-
-
-class Mlp(nn.Module):
-    def __init__(
-        self,
-        in_features,
-        hidden_features=None,
-        out_features=None,
-        act_layer=nn.GELU,
-        drop=0.0,
-    ):
-        super().__init__()
-        out_features = out_features or in_features
-        hidden_features = hidden_features or in_features
-        self.fc1 = nn.Linear(in_features, hidden_features)
-        self.act = act_layer()
-        self.fc2 = nn.Linear(hidden_features, out_features)
-        self.drop = nn.Dropout(drop)
-
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        # x = self.drop(x)
-        # commit this for the orignal BERT implement
-        x = self.fc2(x)
-        x = self.drop(x)
-        return x
-
-
-class Attention(nn.Module):
-    def __init__(
-        self,
-        dim,
-        num_heads=8,
-        qkv_bias=False,
-        qk_scale=None,
-        attn_drop=0.0,
-        proj_drop=0.0,
-        attn_head_dim=None,
-    ):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-        if attn_head_dim is not None:
-            head_dim = attn_head_dim
-        all_head_dim = head_dim * self.num_heads
-        self.scale = qk_scale or head_dim**-0.5
-
-        self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
-        if qkv_bias:
-            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
-            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
-        else:
-            self.q_bias = None
-            self.v_bias = None
-
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(all_head_dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-
-    def forward(self, x):
-        B, N, C = x.shape
-        qkv_bias = None
-        if self.q_bias is not None:
-            qkv_bias = torch.cat(
-                (
-                    self.q_bias,
-                    torch.zeros_like(self.v_bias, requires_grad=False),
-                    self.v_bias,
-                )
-            )
-        # qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
-        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
-        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
-        q, k, v = (
-            qkv[0],
-            qkv[1],
-            qkv[2],
-        )  # make torchscript happy (cannot use tensor as tuple)
-
-        q = q * self.scale
-        attn = q @ k.transpose(-2, -1)
-
-        attn = attn.softmax(dim=-1)
-        attn = self.attn_drop(attn)
-
-        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class Block(nn.Module):
-    def __init__(
-        self,
-        dim,
-        num_heads,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop=0.0,
-        attn_drop=0.0,
-        drop_path=0.0,
-        init_values=None,
-        act_layer=nn.GELU,
-        norm_layer=nn.LayerNorm,
-        attn_head_dim=None,
-    ):
-        super().__init__()
-        self.norm1 = norm_layer(dim)
-        self.attn = Attention(
-            dim,
-            num_heads=num_heads,
-            qkv_bias=qkv_bias,
-            qk_scale=qk_scale,
-            attn_drop=attn_drop,
-            proj_drop=drop,
-            attn_head_dim=attn_head_dim,
-        )
-        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.norm2 = norm_layer(dim)
-        mlp_hidden_dim = int(dim * mlp_ratio)
-        self.mlp = Mlp(
-            in_features=dim,
-            hidden_features=mlp_hidden_dim,
-            act_layer=act_layer,
-            drop=drop,
-        )
-
-        if init_values > 0:
-            self.gamma_1 = nn.Parameter(
-                init_values * torch.ones((dim)), requires_grad=True
-            )
-            self.gamma_2 = nn.Parameter(
-                init_values * torch.ones((dim)), requires_grad=True
-            )
-        else:
-            self.gamma_1, self.gamma_2 = None, None
-
-    def forward(self, x):
-        if self.gamma_1 is None:
-            x = x + self.drop_path(self.attn(self.norm1(x)))
-            x = x + self.drop_path(self.mlp(self.norm2(x)))
-        else:
-            x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x)))
-            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
-        return x
-
-
-class PatchEmbed(nn.Module):
-    """Image to Patch Embedding"""
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        in_chans=3,
-        embed_dim=768,
-        num_frames=16,
-        tubelet_size=2,
-    ):
-        super().__init__()
-        img_size = to_2tuple(img_size)
-        patch_size = to_2tuple(patch_size)
-        self.tubelet_size = int(tubelet_size)
-        num_patches = (
-            (img_size[1] // patch_size[1])
-            * (img_size[0] // patch_size[0])
-            * (num_frames // self.tubelet_size)
-        )
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.num_patches = num_patches
-        self.proj = nn.Conv3d(
-            in_channels=in_chans,
-            out_channels=embed_dim,
-            kernel_size=(self.tubelet_size, patch_size[0], patch_size[1]),
-            stride=(self.tubelet_size, patch_size[0], patch_size[1]),
-        )
-
-    def forward(self, x, **kwargs):
-        B, C, T, H, W = x.shape
-        # FIXME look at relaxing size constraints
-        assert (
-            H == self.img_size[0] and W == self.img_size[1]
-        ), f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
-        x = self.proj(x).flatten(2).transpose(1, 2)
-        return x
-
-
-# sin-cos position encoding
-# https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31
-def get_sinusoid_encoding_table(n_position, d_hid, cur_frame=-1, pre_n_position=1568):
-    """Sinusoid position encoding table"""
-
-    # TODO: make it with torch instead of numpy
-    def get_position_angle_vec(position):
-        return [
-            position / np.power(10000, 2 * (hid_j // 2) / d_hid)
-            for hid_j in range(d_hid)
-        ]
-
-    # generate checkpoint position embedding
-    sinusoid_table = np.array(
-        [get_position_angle_vec(pos_i) for pos_i in range(pre_n_position)]
-    )
-    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
-    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
-    sinusoid_table = torch.tensor(
-        sinusoid_table, dtype=torch.float, requires_grad=False
-    ).unsqueeze(0)
-    print(f"n_position: {n_position}")
-    print(f"pre_n_position: {pre_n_position}")
-    if n_position // cur_frame * 8 != pre_n_position and cur_frame != -1:
-        T = 8  # checkpoint frame
-        P = 14  # checkpoint size
-        C = d_hid
-        new_P = int((n_position // cur_frame) ** 0.5)  # testing size
-        print(f"Pretraining uses 14x14, but current version is {new_P}x{new_P}")
-        print(f"Interpolate the position embedding")
-        sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
-        sinusoid_table = sinusoid_table.reshape(-1, P, P, C).permute(0, 3, 1, 2)
-        sinusoid_table = torch.nn.functional.interpolate(
-            sinusoid_table, size=(new_P, new_P), mode="bicubic", align_corners=False
-        )
-        # BT, C, H, W -> BT, H, W, C ->  B, T, H, W, C
-        sinusoid_table = sinusoid_table.permute(0, 2, 3, 1).reshape(
-            -1, T, new_P, new_P, C
-        )
-        sinusoid_table = sinusoid_table.flatten(1, 3)  # B, THW, C
-    if cur_frame != -1 and cur_frame != 8:
-        print(f"Pretraining uses 8 frames, but current frame is {cur_frame}")
-        print(f"Interpolate the position embedding")
-        T = 8  # checkpoint frame
-        new_T = cur_frame  # testing frame
-        # interpolate
-        P = int((n_position // cur_frame) ** 0.5)  # testing size
-        C = d_hid
-        sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
-        sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(
-            -1, C, T
-        )  # BHW, C, T
-        sinusoid_table = torch.nn.functional.interpolate(
-            sinusoid_table, size=new_T, mode="linear"
-        )
-        sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(
-            0, 4, 1, 2, 3
-        )  # B, T, H, W, C
-        sinusoid_table = sinusoid_table.flatten(1, 3)  # B, THW, C
-    if n_position == pre_n_position:
-        return sinusoid_table
-    else:
-        print("Use learnable position embedding")
-        return nn.Parameter(sinusoid_table, requires_grad=True)
-
-
-class VisionTransformer(nn.Module):
-    """Vision Transformer with support for patch or hybrid CNN input stage"""
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        in_chans=3,
-        num_classes=1000,
-        embed_dim=768,
-        depth=12,
-        num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        fc_drop_rate=0.0,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        init_values=0.0,
-        use_learnable_pos_emb=False,
-        init_scale=0.0,
-        all_frames=16,
-        tubelet_size=2,
-        use_checkpoint=False,
-        checkpoint_num=0,
-        use_mean_pooling=True,
-    ):
-        super().__init__()
-        self.num_classes = num_classes
-        self.num_features = self.embed_dim = (
-            embed_dim  # num_features for consistency with other models
-        )
-        self.tubelet_size = tubelet_size
-        self.patch_embed = PatchEmbed(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=in_chans,
-            embed_dim=embed_dim,
-            num_frames=all_frames,
-            tubelet_size=self.tubelet_size,
-        )
-        num_patches = self.patch_embed.num_patches
-        self.use_checkpoint = use_checkpoint
-        self.checkpoint_num = checkpoint_num
-        print(f"Use checkpoint: {use_checkpoint}")
-        print(f"Checkpoint number: {checkpoint_num}")
-
-        if use_learnable_pos_emb:
-            self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
-        else:
-            # sine-cosine positional embeddings is on the way
-            if patch_size == 14:
-                pre_n_position = 2048
-            else:
-                pre_n_position = 1568
-            self.pos_embed = get_sinusoid_encoding_table(
-                num_patches,
-                embed_dim,
-                all_frames // tubelet_size,
-                pre_n_position=pre_n_position,
-            )
-
-        self.pos_drop = nn.Dropout(p=drop_rate)
-
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, depth)
-        ]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList(
-            [
-                Block(
-                    dim=embed_dim,
-                    num_heads=num_heads,
-                    mlp_ratio=mlp_ratio,
-                    qkv_bias=qkv_bias,
-                    qk_scale=qk_scale,
-                    drop=drop_rate,
-                    attn_drop=attn_drop_rate,
-                    drop_path=dpr[i],
-                    norm_layer=norm_layer,
-                    init_values=init_values,
-                )
-                for i in range(depth)
-            ]
-        )
-        self.norm = nn.Identity() if use_mean_pooling else norm_layer(embed_dim)
-        self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
-        self.fc_dropout = (
-            nn.Dropout(p=fc_drop_rate) if fc_drop_rate > 0 else nn.Identity()
-        )
-        self.head = (
-            nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-        if use_learnable_pos_emb:
-            trunc_normal_(self.pos_embed, std=0.02)
-
-        trunc_normal_(self.head.weight, std=0.02)
-        self.apply(self._init_weights)
-
-        self.head.weight.data.mul_(init_scale)
-        self.head.bias.data.mul_(init_scale)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            trunc_normal_(m.weight, std=0.02)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def get_num_layers(self):
-        return len(self.blocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"pos_embed", "cls_token"}
-
-    def get_classifier(self):
-        return self.head
-
-    def reset_classifier(self, num_classes, global_pool=""):
-        self.num_classes = num_classes
-        self.head = (
-            nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-    def forward_features(self, x):
-        x = self.patch_embed(x)
-        B, _, _ = x.size()
-
-        if self.pos_embed is not None:
-            x = (
-                x
-                + self.pos_embed.expand(B, -1, -1)
-                .type_as(x)
-                .to(x.device)
-                .clone()
-                .detach()
-            )
-        x = self.pos_drop(x)
-
-        for idx, blk in enumerate(self.blocks):
-            if self.use_checkpoint and idx < self.checkpoint_num:
-                x = checkpoint.checkpoint(blk, x)
-            else:
-                x = blk(x)
-
-        x = self.norm(x)
-        if self.fc_norm is not None:
-            return self.fc_norm(x.mean(1))
-        else:
-            return x[:, 0]
-
-    def forward(self, x):
-        x = self.forward_features(x)
-        x = self.head(self.fc_dropout(x))
-        return x
-
-
-# @register_model
-# def vit_base_patch16_224(pretrained=False, **kwargs):
-#     model = VisionTransformer(
-#         patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
-#         norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
-#     model.default_cfg = _cfg()
-#     return model
-#
-#
-# # @register_model
-# def vit_base_patch16_384(pretrained=False, **kwargs):
-#     model = VisionTransformer(
-#         img_size=384, patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
-#         norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
-#     model.default_cfg = _cfg()
-#     return model
-
-
-@register_model
-def vit_large_patch16_224(pretrained=False, **kwargs):
-    kwargs.pop("pretrained_cfg", None)  # added by Ziqi to accommodate timm=0.9.12
-    kwargs.pop(
-        "pretrained_cfg_overlay", None
-    )  # added by Ziqi to accommodate timm=0.9.12
-    model = VisionTransformer(
-        patch_size=16,
-        embed_dim=1024,
-        depth=24,
-        num_heads=16,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        **kwargs,
-    )
-    model.default_cfg = _cfg()
-    return model
-
-
-# @register_model
-# def vit_large_patch16_384(pretrained=False, **kwargs):
-#     model = VisionTransformer(
-#         img_size=384, patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
-#         norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
-#     model.default_cfg = _cfg()
-#     return model
-
-
-if __name__ == "__main__":
-    import time
-
-    import numpy as np
-    from fvcore.nn import FlopCountAnalysis, flop_count_table
-
-    seed = 4217
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    num_frames = 8
-
-    # model = vit_base_patch16_384(all_frames=num_frames, tubelet_size=1)
-    # model = vit_large_patch16_384(all_frames=num_frames, tubelet_size=1)
-    # print(model)
-
-    flops = FlopCountAnalysis(model, torch.rand(1, 3, num_frames, 384, 384))
-    s = time.time()
-    print(flop_count_table(flops, max_depth=1))
-    print(time.time() - s)
-    # print(model(torch.rand(1, 3, num_frames, 224, 224)).shape)
diff --git a/eval/vbench/third_party/umt/models/modeling_pretrain.py b/eval/vbench/third_party/umt/models/modeling_pretrain.py
deleted file mode 100644
index e8d487b4..00000000
--- a/eval/vbench/third_party/umt/models/modeling_pretrain.py
+++ /dev/null
@@ -1,437 +0,0 @@
-import math
-from functools import partial
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.utils.checkpoint as checkpoint
-from timm.models.layers import trunc_normal_ as __call_trunc_normal_
-from timm.models.registry import register_model
-
-from .modeling_finetune import Block, PatchEmbed, _cfg, get_sinusoid_encoding_table
-
-
-def trunc_normal_(tensor, mean=0.0, std=1.0):
-    __call_trunc_normal_(tensor, mean=mean, std=std, a=-std, b=std)
-
-
-class PretrainVisionTransformerEncoder(nn.Module):
-    """Vision Transformer with support for patch or hybrid CNN input stage"""
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        in_chans=3,
-        num_classes=0,
-        embed_dim=768,
-        depth=12,
-        num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        init_values=None,
-        num_frames=16,
-        tubelet_size=2,
-        use_checkpoint=False,
-        use_learnable_pos_emb=False,
-    ):
-        super().__init__()
-        self.num_classes = num_classes
-        self.num_features = self.embed_dim = (
-            embed_dim  # num_features for consistency with other models
-        )
-        self.patch_embed = PatchEmbed(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=in_chans,
-            embed_dim=embed_dim,
-            num_frames=num_frames,
-            tubelet_size=tubelet_size,
-        )
-        num_patches = self.patch_embed.num_patches
-        self.use_checkpoint = use_checkpoint
-
-        # TODO: Add the cls token
-        if use_learnable_pos_emb:
-            self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
-        else:
-            # sine-cosine positional embeddings
-            self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim)
-
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, depth)
-        ]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList(
-            [
-                Block(
-                    dim=embed_dim,
-                    num_heads=num_heads,
-                    mlp_ratio=mlp_ratio,
-                    qkv_bias=qkv_bias,
-                    qk_scale=qk_scale,
-                    drop=drop_rate,
-                    attn_drop=attn_drop_rate,
-                    drop_path=dpr[i],
-                    norm_layer=norm_layer,
-                    init_values=init_values,
-                )
-                for i in range(depth)
-            ]
-        )
-        self.norm = norm_layer(embed_dim)
-        self.head = (
-            nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-        if use_learnable_pos_emb:
-            trunc_normal_(self.pos_embed, std=0.02)
-
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def get_num_layers(self):
-        return len(self.blocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"pos_embed", "cls_token"}
-
-    def get_classifier(self):
-        return self.head
-
-    def reset_classifier(self, num_classes, global_pool=""):
-        self.num_classes = num_classes
-        self.head = (
-            nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-    def forward_features(self, x, mask):
-        _, _, T, _, _ = x.shape
-        x = self.patch_embed(x)
-
-        x = x + self.pos_embed.type_as(x).to(x.device).clone().detach()
-
-        B, _, C = x.shape
-        x_vis = x[~mask].reshape(B, -1, C)  # ~mask means visible
-
-        if self.use_checkpoint:
-            for blk in self.blocks:
-                x_vis = checkpoint.checkpoint(blk, x_vis)
-        else:
-            for blk in self.blocks:
-                x_vis = blk(x_vis)
-
-        x_vis = self.norm(x_vis)
-        return x_vis
-
-    def forward(self, x, mask):
-        x = self.forward_features(x, mask)
-        x = self.head(x)
-        return x
-
-
-class PretrainVisionTransformerDecoder(nn.Module):
-    """Vision Transformer with support for patch or hybrid CNN input stage"""
-
-    def __init__(
-        self,
-        patch_size=16,
-        num_classes=768,
-        embed_dim=768,
-        depth=12,
-        num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        init_values=None,
-        num_patches=196,
-        tubelet_size=2,
-        use_checkpoint=False,
-    ):
-        super().__init__()
-        self.num_classes = num_classes
-        assert num_classes == 3 * tubelet_size * patch_size**2
-        self.num_features = self.embed_dim = (
-            embed_dim  # num_features for consistency with other models
-        )
-        self.patch_size = patch_size
-        self.use_checkpoint = use_checkpoint
-
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, depth)
-        ]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList(
-            [
-                Block(
-                    dim=embed_dim,
-                    num_heads=num_heads,
-                    mlp_ratio=mlp_ratio,
-                    qkv_bias=qkv_bias,
-                    qk_scale=qk_scale,
-                    drop=drop_rate,
-                    attn_drop=attn_drop_rate,
-                    drop_path=dpr[i],
-                    norm_layer=norm_layer,
-                    init_values=init_values,
-                )
-                for i in range(depth)
-            ]
-        )
-        self.norm = norm_layer(embed_dim)
-        self.head = (
-            nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def get_num_layers(self):
-        return len(self.blocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"pos_embed", "cls_token"}
-
-    def get_classifier(self):
-        return self.head
-
-    def reset_classifier(self, num_classes, global_pool=""):
-        self.num_classes = num_classes
-        self.head = (
-            nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-    def forward(self, x, return_token_num):
-        if self.use_checkpoint:
-            for blk in self.blocks:
-                x = checkpoint.checkpoint(blk, x)
-        else:
-            for blk in self.blocks:
-                x = blk(x)
-
-        if return_token_num > 0:
-            x = self.head(
-                self.norm(x[:, -return_token_num:])
-            )  # only return the mask tokens predict pixels
-        else:
-            x = self.head(self.norm(x))
-
-        return x
-
-
-class PretrainVisionTransformer(nn.Module):
-    """Vision Transformer with support for patch or hybrid CNN input stage"""
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        encoder_in_chans=3,
-        encoder_num_classes=0,
-        encoder_embed_dim=768,
-        encoder_depth=12,
-        encoder_num_heads=12,
-        decoder_num_classes=1536,  #  decoder_num_classes=768,
-        decoder_embed_dim=512,
-        decoder_depth=8,
-        decoder_num_heads=8,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        init_values=0.0,
-        use_learnable_pos_emb=False,
-        use_checkpoint=False,
-        num_frames=16,
-        tubelet_size=2,
-        num_classes=0,  # avoid the error from create_fn in timm
-        in_chans=0,  # avoid the error from create_fn in timm
-    ):
-        super().__init__()
-        self.encoder = PretrainVisionTransformerEncoder(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=encoder_in_chans,
-            num_classes=encoder_num_classes,
-            embed_dim=encoder_embed_dim,
-            depth=encoder_depth,
-            num_heads=encoder_num_heads,
-            mlp_ratio=mlp_ratio,
-            qkv_bias=qkv_bias,
-            qk_scale=qk_scale,
-            drop_rate=drop_rate,
-            attn_drop_rate=attn_drop_rate,
-            drop_path_rate=drop_path_rate,
-            norm_layer=norm_layer,
-            init_values=init_values,
-            num_frames=num_frames,
-            tubelet_size=tubelet_size,
-            use_checkpoint=use_checkpoint,
-            use_learnable_pos_emb=use_learnable_pos_emb,
-        )
-
-        self.decoder = PretrainVisionTransformerDecoder(
-            patch_size=patch_size,
-            num_patches=self.encoder.patch_embed.num_patches,
-            num_classes=decoder_num_classes,
-            embed_dim=decoder_embed_dim,
-            depth=decoder_depth,
-            num_heads=decoder_num_heads,
-            mlp_ratio=mlp_ratio,
-            qkv_bias=qkv_bias,
-            qk_scale=qk_scale,
-            drop_rate=drop_rate,
-            attn_drop_rate=attn_drop_rate,
-            drop_path_rate=drop_path_rate,
-            norm_layer=norm_layer,
-            init_values=init_values,
-            tubelet_size=tubelet_size,
-            use_checkpoint=use_checkpoint,
-        )
-
-        self.encoder_to_decoder = nn.Linear(
-            encoder_embed_dim, decoder_embed_dim, bias=False
-        )
-
-        self.mask_token = nn.Parameter(torch.zeros(1, 1, decoder_embed_dim))
-
-        self.pos_embed = get_sinusoid_encoding_table(
-            self.encoder.patch_embed.num_patches, decoder_embed_dim
-        )
-
-        trunc_normal_(self.mask_token, std=0.02)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def get_num_layers(self):
-        return len(self.blocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"pos_embed", "cls_token", "mask_token"}
-
-    def forward(self, x, mask):
-        _, _, T, _, _ = x.shape
-        x_vis = self.encoder(x, mask)  # [B, N_vis, C_e]
-        x_vis = self.encoder_to_decoder(x_vis)  # [B, N_vis, C_d]
-        B, N, C = x_vis.shape
-        # we don't unshuffle the correct visible token order,
-        # but shuffle the pos embedding accorddingly.
-        expand_pos_embed = (
-            self.pos_embed.expand(B, -1, -1).type_as(x).to(x.device).clone().detach()
-        )
-        pos_emd_vis = expand_pos_embed[~mask].reshape(B, -1, C)
-        pos_emd_mask = expand_pos_embed[mask].reshape(B, -1, C)
-        x_full = torch.cat(
-            [x_vis + pos_emd_vis, self.mask_token + pos_emd_mask], dim=1
-        )  # [B, N, C_d]
-        x = self.decoder(x_full, pos_emd_mask.shape[1])  # [B, N_mask, 3 * 16 * 16]
-
-        return x
-
-
-@register_model
-def pretrain_videomae_base_patch16_224(pretrained=False, **kwargs):
-    model = PretrainVisionTransformer(
-        img_size=224,
-        patch_size=16,
-        encoder_embed_dim=768,
-        encoder_depth=12,
-        encoder_num_heads=12,
-        encoder_num_classes=0,
-        decoder_num_classes=1536,
-        decoder_embed_dim=384,
-        decoder_num_heads=6,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        **kwargs,
-    )
-    model.default_cfg = _cfg()
-    if pretrained:
-        checkpoint = torch.load(kwargs["init_ckpt"], map_location="cpu")
-        model.load_state_dict(checkpoint["model"])
-    return model
-
-
-@register_model
-def pretrain_videomae_large_patch16_224(pretrained=False, **kwargs):
-    model = PretrainVisionTransformer(
-        img_size=224,
-        patch_size=16,
-        encoder_embed_dim=1024,
-        encoder_depth=24,
-        encoder_num_heads=16,
-        encoder_num_classes=0,
-        decoder_num_classes=1536,
-        decoder_embed_dim=512,
-        decoder_num_heads=8,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        **kwargs,
-    )
-    model.default_cfg = _cfg()
-    if pretrained:
-        checkpoint = torch.load(kwargs["init_ckpt"], map_location="cpu")
-        model.load_state_dict(checkpoint["model"])
-    return model
-
-
-@register_model
-def pretrain_videomae_huge_patch16_224(pretrained=False, **kwargs):
-    model = PretrainVisionTransformer(
-        img_size=224,
-        patch_size=16,
-        encoder_embed_dim=1280,
-        encoder_depth=32,
-        encoder_num_heads=16,
-        encoder_num_classes=0,
-        decoder_num_classes=1536,
-        decoder_embed_dim=640,
-        decoder_num_heads=8,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        **kwargs,
-    )
-    model.default_cfg = _cfg()
-    if pretrained:
-        checkpoint = torch.load(kwargs["init_ckpt"], map_location="cpu")
-        model.load_state_dict(checkpoint["model"])
-    return model
diff --git a/eval/vbench/third_party/umt/models/modeling_pretrain_umt.py b/eval/vbench/third_party/umt/models/modeling_pretrain_umt.py
deleted file mode 100644
index b9e91592..00000000
--- a/eval/vbench/third_party/umt/models/modeling_pretrain_umt.py
+++ /dev/null
@@ -1,415 +0,0 @@
-import math
-from functools import partial
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.utils.checkpoint as checkpoint
-from timm.models.layers import trunc_normal_ as __call_trunc_normal_
-from timm.models.registry import register_model
-
-from .modeling_finetune import Block, DropPath, Mlp, PatchEmbed, _cfg
-
-
-def trunc_normal_(tensor, mean=0.0, std=1.0):
-    __call_trunc_normal_(tensor, mean=mean, std=std, a=-std, b=std)
-
-
-# sin-cos position encoding
-# https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31
-def get_sinusoid_encoding_table(n_position, d_hid):
-    """Sinusoid position encoding table"""
-
-    # TODO: make it with torch instead of numpy
-    def get_position_angle_vec(position):
-        return [
-            position / np.power(10000, 2 * (hid_j // 2) / d_hid)
-            for hid_j in range(d_hid)
-        ]
-
-    sinusoid_table = np.array(
-        [get_position_angle_vec(pos_i) for pos_i in range(n_position)]
-    )
-    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
-    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1
-
-    return torch.tensor(
-        sinusoid_table, dtype=torch.float, requires_grad=False
-    ).unsqueeze(0)
-
-
-class PretrainVisionTransformerEncoder(nn.Module):
-    """Vision Transformer with support for patch or hybrid CNN input stage"""
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        in_chans=3,
-        num_classes=0,
-        embed_dim=768,
-        depth=12,
-        num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        init_values=None,
-        num_frames=16,
-        tubelet_size=2,
-        use_checkpoint=False,
-        checkpoint_num=0,
-        use_learnable_pos_emb=False,
-        clip_return_layer=1,
-        clip_student_return_interval=1,
-    ):
-        super().__init__()
-        self.num_classes = num_classes
-        self.num_features = self.embed_dim = (
-            embed_dim  # num_features for consistency with other models
-        )
-        self.patch_embed = PatchEmbed(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=in_chans,
-            embed_dim=embed_dim,
-            num_frames=num_frames,
-            tubelet_size=tubelet_size,
-        )
-        num_patches = self.patch_embed.num_patches
-        self.use_checkpoint = use_checkpoint
-        self.checkpoint_num = checkpoint_num
-        print(f"Use checkpoint: {use_checkpoint}")
-        print(f"Checkpoint number: {checkpoint_num}")
-        self.return_index = []
-        for i in range(clip_return_layer):
-            self.return_index.append(depth - int(i * clip_student_return_interval) - 1)
-        print(f"Student return index: {self.return_index}")
-
-        self.use_learnable_pos_emb = use_learnable_pos_emb
-        if use_learnable_pos_emb:
-            print("Use learnable position embedding")
-            self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
-        else:
-            # sine-cosine positional embeddings
-            self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim)
-
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, depth)
-        ]  # stochastic depth decay rule
-        self.blocks = nn.ModuleList(
-            [
-                Block(
-                    dim=embed_dim,
-                    num_heads=num_heads,
-                    mlp_ratio=mlp_ratio,
-                    qkv_bias=qkv_bias,
-                    qk_scale=qk_scale,
-                    drop=drop_rate,
-                    attn_drop=attn_drop_rate,
-                    drop_path=dpr[i],
-                    norm_layer=norm_layer,
-                    init_values=init_values,
-                )
-                for i in range(depth)
-            ]
-        )
-        self.norm = norm_layer(embed_dim)
-        self.head = (
-            nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-        if use_learnable_pos_emb:
-            trunc_normal_(self.pos_embed, std=0.02)
-
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def get_num_layers(self):
-        return len(self.blocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {"pos_embed", "cls_token"}
-
-    def get_classifier(self):
-        return self.head
-
-    def reset_classifier(self, num_classes, global_pool=""):
-        self.num_classes = num_classes
-        self.head = (
-            nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
-        )
-
-    def forward_features(self, x, mask):
-        x = self.patch_embed(x)
-
-        if self.use_learnable_pos_emb:
-            x = x + self.pos_embed.type_as(x).to(x.device)
-        else:
-            x = x + self.pos_embed.type_as(x).to(x.device).clone().detach()
-
-        B, _, C = x.shape
-        x_vis = x[~mask].reshape(B, -1, C)  # ~mask means visible
-        x_clip_vis = []
-
-        for idx, blk in enumerate(self.blocks):
-            if self.use_checkpoint and idx < self.checkpoint_num:
-                x_vis = checkpoint.checkpoint(blk, x_vis)
-            else:
-                x_vis = blk(x_vis)
-            if idx in self.return_index:
-                x_clip_vis.append(x_vis)
-
-        x_vis = self.norm(x_vis)
-        x_clip_vis = self.norm(torch.stack(x_clip_vis))
-        return x_vis, x_clip_vis
-
-    def forward(self, x, mask):
-        x, x_clip_vis = self.forward_features(x, mask)
-        x = self.head(x)
-        x_clip_vis = self.head(x_clip_vis)
-        return x_clip_vis
-
-
-class Linear_Decoder(nn.Module):
-    def __init__(
-        self,
-        num_classes=768,
-        embed_dim=768,
-        norm_layer=nn.LayerNorm,
-        clip_norm_type="l2",
-    ):
-        super().__init__()
-        self.clip_norm_type = clip_norm_type
-        print(f"Normalization Type: {clip_norm_type}")
-
-        self.head = nn.Linear(embed_dim, num_classes)
-        self.norm = norm_layer(num_classes)
-
-        self.apply(self._init_weights)
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def forward(self, x):
-        x = self.norm(self.head(x))
-
-        if self.clip_norm_type == "l2":
-            x = x / x.norm(dim=-1, keepdim=True)
-        elif self.clip_norm_type == "none":
-            pass
-        else:
-            raise NotImplementedError
-
-        return x
-
-
-class PretrainVisionTransformer(nn.Module):
-    """Vision Transformer with support for patch or hybrid CNN input stage"""
-
-    def __init__(
-        self,
-        img_size=224,
-        patch_size=16,
-        encoder_in_chans=3,
-        encoder_num_classes=0,
-        encoder_embed_dim=768,
-        encoder_depth=12,
-        encoder_num_heads=12,
-        mlp_ratio=4.0,
-        qkv_bias=False,
-        qk_scale=None,
-        drop_rate=0.0,
-        attn_drop_rate=0.0,
-        drop_path_rate=0.0,
-        norm_layer=nn.LayerNorm,
-        init_values=0.0,
-        use_learnable_pos_emb=False,
-        use_checkpoint=False,
-        checkpoint_num=0,
-        num_frames=16,
-        tubelet_size=2,
-        # clip,
-        clip_decoder_embed_dim=768,
-        clip_output_dim=512,
-        clip_norm_type="l2",
-        clip_return_layer=1,
-        clip_student_return_interval=1,
-    ):
-        super().__init__()
-
-        self.encoder = PretrainVisionTransformerEncoder(
-            img_size=img_size,
-            patch_size=patch_size,
-            in_chans=encoder_in_chans,
-            num_classes=encoder_num_classes,
-            embed_dim=encoder_embed_dim,
-            depth=encoder_depth,
-            num_heads=encoder_num_heads,
-            mlp_ratio=mlp_ratio,
-            qkv_bias=qkv_bias,
-            qk_scale=qk_scale,
-            drop_rate=drop_rate,
-            attn_drop_rate=attn_drop_rate,
-            drop_path_rate=drop_path_rate,
-            norm_layer=norm_layer,
-            init_values=init_values,
-            num_frames=num_frames,
-            tubelet_size=tubelet_size,
-            use_checkpoint=use_checkpoint,
-            checkpoint_num=checkpoint_num,
-            use_learnable_pos_emb=use_learnable_pos_emb,
-            clip_return_layer=clip_return_layer,
-            clip_student_return_interval=clip_student_return_interval,
-        )
-
-        # CLIP decoder
-        self.clip_decoder = nn.ModuleList(
-            [
-                Linear_Decoder(
-                    num_classes=clip_output_dim,
-                    embed_dim=clip_decoder_embed_dim,
-                    norm_layer=norm_layer,
-                    clip_norm_type=clip_norm_type,
-                )
-                for _ in range(clip_return_layer)
-            ]
-        )
-
-        self.clip_pos_embed = get_sinusoid_encoding_table(
-            self.encoder.patch_embed.num_patches, clip_decoder_embed_dim
-        )
-
-    def _init_weights(self, m):
-        if isinstance(m, nn.Linear):
-            nn.init.xavier_uniform_(m.weight)
-            if isinstance(m, nn.Linear) and m.bias is not None:
-                nn.init.constant_(m.bias, 0)
-        elif isinstance(m, nn.LayerNorm):
-            nn.init.constant_(m.bias, 0)
-            nn.init.constant_(m.weight, 1.0)
-
-    def get_num_layers(self):
-        return len(self.blocks)
-
-    @torch.jit.ignore
-    def no_weight_decay(self):
-        return {
-            "pos_embed",
-            "cls_token",
-            "mask_token",
-            "clip_mask_token",
-            "clip_pos_embed",
-        }
-
-    def forward(self, x, mask):
-        x_clip_vis = self.encoder(x, mask)  # [B, N_vis, C_e]
-
-        # align CLIP
-        K, B, _, C_CLIP = x_clip_vis.shape
-        expand_clip_pos_embed = (
-            self.clip_pos_embed.repeat(B, 1, 1).type_as(x).to(x.device).clone().detach()
-        )
-        clip_pos_emd_vis = (
-            expand_clip_pos_embed[~mask]
-            .view(B, -1, C_CLIP)
-            .unsqueeze(0)
-            .repeat(K, 1, 1, 1)
-        )
-        x_clip_full = x_clip_vis + clip_pos_emd_vis  # [K, B, N, C_d_clip]
-
-        x_clip = []
-        for idx, clip_decoder in enumerate(self.clip_decoder):
-            x_clip.append(clip_decoder(x_clip_full[idx]))
-        x_clip = torch.stack(x_clip)  # align and normalize
-
-        return x_clip
-
-
-@register_model
-def pretrain_umt_base_patch16_224(pretrained=False, **kwargs):
-    model = PretrainVisionTransformer(
-        img_size=224,
-        patch_size=16,
-        encoder_embed_dim=768,
-        encoder_depth=12,
-        encoder_num_heads=12,
-        encoder_num_classes=0,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        **kwargs,
-    )
-    model.default_cfg = _cfg()
-    if pretrained:
-        checkpoint = torch.load(kwargs["init_ckpt"], map_location="cpu")
-        model.load_state_dict(checkpoint["model"])
-    return model
-
-
-@register_model
-def pretrain_umt_large_patch16_224(pretrained=False, **kwargs):
-    model = PretrainVisionTransformer(
-        img_size=224,
-        patch_size=16,
-        encoder_embed_dim=1024,
-        encoder_depth=24,
-        encoder_num_heads=16,
-        encoder_num_classes=0,
-        mlp_ratio=4,
-        qkv_bias=True,
-        norm_layer=partial(nn.LayerNorm, eps=1e-6),
-        **kwargs,
-    )
-    model.default_cfg = _cfg()
-    if pretrained:
-        checkpoint = torch.load(kwargs["init_ckpt"], map_location="cpu")
-        model.load_state_dict(checkpoint["model"])
-    return model
-
-
-if __name__ == "__main__":
-    import time
-
-    import numpy as np
-    from fvcore.nn import FlopCountAnalysis, flop_count_table
-
-    seed = 4217
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-
-    model = pretrain_umt_base_patch16_224()
-
-    # flops = FlopCountAnalysis(model, torch.rand(1, 3, 16, 224, 224))
-    # s = time.time()
-    # print(flop_count_table(flops, max_depth=1))
-    # print(time.time()-s)
-    mask = torch.cat(
-        [
-            torch.ones(1, 8 * int(14 * 14 * 0.75)),
-            torch.zeros(1, 8 * int(14 * 14 * 0.25)),
-        ],
-        dim=-1,
-    ).to(torch.bool)
-    print(model(torch.rand(1, 3, 16, 224, 224), mask)[1].shape)
diff --git a/eval/vbench/utils.py b/eval/vbench/utils.py
deleted file mode 100644
index ff15a1d3..00000000
--- a/eval/vbench/utils.py
+++ /dev/null
@@ -1,540 +0,0 @@
-import json
-import logging
-import os
-import re
-import subprocess
-from pathlib import Path
-
-import numpy as np
-import torch
-from decord import VideoReader, cpu
-from PIL import Image, ImageSequence
-from torchvision import transforms
-from torchvision.transforms import (
-    CenterCrop,
-    Compose,
-    Normalize,
-    Resize,
-    ToPILImage,
-    ToTensor,
-)
-
-try:
-    from torchvision.transforms import InterpolationMode
-
-    BICUBIC = InterpolationMode.BICUBIC
-    BILINEAR = InterpolationMode.BILINEAR
-except ImportError:
-    BICUBIC = Image.BICUBIC
-    BILINEAR = Image.BILINEAR
-
-CACHE_DIR = os.environ.get("VBENCH_CACHE_DIR")
-if CACHE_DIR is None:
-    CACHE_DIR = os.path.join(os.path.expanduser("~"), ".cache", "vbench")
-
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
-logger = logging.getLogger(__name__)
-
-
-def clip_transform(n_px):
-    return Compose(
-        [
-            Resize(n_px, interpolation=BICUBIC),
-            CenterCrop(n_px),
-            transforms.Lambda(lambda x: x.float().div(255.0)),
-            Normalize(
-                (0.48145466, 0.4578275, 0.40821073),
-                (0.26862954, 0.26130258, 0.27577711),
-            ),
-        ]
-    )
-
-
-def clip_transform_Image(n_px):
-    return Compose(
-        [
-            Resize(n_px, interpolation=BICUBIC),
-            CenterCrop(n_px),
-            ToTensor(),
-            Normalize(
-                (0.48145466, 0.4578275, 0.40821073),
-                (0.26862954, 0.26130258, 0.27577711),
-            ),
-        ]
-    )
-
-
-def dino_transform(n_px):
-    return Compose(
-        [
-            Resize(size=n_px),
-            transforms.Lambda(lambda x: x.float().div(255.0)),
-            Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
-        ]
-    )
-
-
-def dino_transform_Image(n_px):
-    return Compose(
-        [
-            Resize(size=n_px),
-            ToTensor(),
-            Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
-        ]
-    )
-
-
-def tag2text_transform(n_px):
-    normalize = Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
-    return Compose([ToPILImage(), Resize((n_px, n_px)), ToTensor(), normalize])
-
-
-def get_frame_indices(
-    num_frames, vlen, sample="rand", fix_start=None, input_fps=1, max_num_frames=-1
-):
-    if sample in ["rand", "middle"]:  # uniform sampling
-        acc_samples = min(num_frames, vlen)
-        # split the video into `acc_samples` intervals, and sample from each interval.
-        intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
-        ranges = []
-        for idx, interv in enumerate(intervals[:-1]):
-            ranges.append((interv, intervals[idx + 1] - 1))
-        if sample == "rand":
-            try:
-                frame_indices = [random.choice(range(x[0], x[1])) for x in ranges]
-            except:
-                frame_indices = np.random.permutation(vlen)[:acc_samples]
-                frame_indices.sort()
-                frame_indices = list(frame_indices)
-        elif fix_start is not None:
-            frame_indices = [x[0] + fix_start for x in ranges]
-        elif sample == "middle":
-            frame_indices = [(x[0] + x[1]) // 2 for x in ranges]
-        else:
-            raise NotImplementedError
-
-        if len(frame_indices) < num_frames:  # padded with last frame
-            padded_frame_indices = [frame_indices[-1]] * num_frames
-            padded_frame_indices[: len(frame_indices)] = frame_indices
-            frame_indices = padded_frame_indices
-    elif "fps" in sample:  # fps0.5, sequentially sample frames at 0.5 fps
-        output_fps = float(sample[3:])
-        duration = float(vlen) / input_fps
-        delta = (
-            1 / output_fps
-        )  # gap between frames, this is also the clip length each frame represents
-        frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta)
-        frame_indices = np.around(frame_seconds * input_fps).astype(int)
-        frame_indices = [e for e in frame_indices if e < vlen]
-        if max_num_frames > 0 and len(frame_indices) > max_num_frames:
-            frame_indices = frame_indices[:max_num_frames]
-            # frame_indices = np.linspace(0 + delta / 2, duration + delta / 2, endpoint=False, num=max_num_frames)
-    else:
-        raise ValueError
-    return frame_indices
-
-
-def load_video(
-    video_path,
-    data_transform=None,
-    num_frames=None,
-    return_tensor=True,
-    width=None,
-    height=None,
-):
-    """
-    Load a video from a given path and apply optional data transformations.
-
-    The function supports loading video in GIF (.gif), PNG (.png), and MP4 (.mp4) formats.
-    Depending on the format, it processes and extracts frames accordingly.
-
-    Parameters:
-    - video_path (str): The file path to the video or image to be loaded.
-    - data_transform (callable, optional): A function that applies transformations to the video data.
-
-    Returns:
-    - frames (torch.Tensor): A tensor containing the video frames with shape (T, C, H, W),
-      where T is the number of frames, C is the number of channels, H is the height, and W is the width.
-
-    Raises:
-    - NotImplementedError: If the video format is not supported.
-
-    The function first determines the format of the video file by its extension.
-    For GIFs, it iterates over each frame and converts them to RGB.
-    For PNGs, it reads the single frame, converts it to RGB.
-    For MP4s, it reads the frames using the VideoReader class and converts them to NumPy arrays.
-    If a data_transform is provided, it is applied to the buffer before converting it to a tensor.
-    Finally, the tensor is permuted to match the expected (T, C, H, W) format.
-    """
-    if video_path.endswith(".gif"):
-        frame_ls = []
-        img = Image.open(video_path)
-        for frame in ImageSequence.Iterator(img):
-            frame = frame.convert("RGB")
-            frame = np.array(frame).astype(np.uint8)
-            frame_ls.append(frame)
-        buffer = np.array(frame_ls).astype(np.uint8)
-    elif video_path.endswith(".png"):
-        frame = Image.open(video_path)
-        frame = frame.convert("RGB")
-        frame = np.array(frame).astype(np.uint8)
-        frame_ls = [frame]
-        buffer = np.array(frame_ls)
-    elif video_path.endswith(".mp4"):
-        import decord
-
-        decord.bridge.set_bridge("native")
-        if width:
-            video_reader = VideoReader(
-                video_path, width=width, height=height, num_threads=1
-            )
-        else:
-            video_reader = VideoReader(video_path, num_threads=1)
-        frames = video_reader.get_batch(
-            range(len(video_reader))
-        )  # (T, H, W, C), torch.uint8
-
-        buffer = frames.asnumpy().astype(np.uint8)
-    else:
-        raise NotImplementedError
-
-    frames = buffer
-    if num_frames:
-        frame_indices = get_frame_indices(num_frames, len(frames), sample="middle")
-        frames = frames[frame_indices]
-
-    if data_transform:
-        frames = data_transform(frames)
-    elif return_tensor:
-        frames = torch.Tensor(frames)
-        frames = frames.permute(0, 3, 1, 2)  # (T, C, H, W), torch.uint8
-
-    return frames
-
-
-def read_frames_decord_by_fps(
-    video_path,
-    sample_fps=2,
-    sample="rand",
-    fix_start=None,
-    max_num_frames=-1,
-    trimmed30=False,
-    num_frames=8,
-):
-    import decord
-
-    decord.bridge.set_bridge("torch")
-    video_reader = VideoReader(video_path, num_threads=1)
-    vlen = len(video_reader)
-    fps = video_reader.get_avg_fps()
-    duration = vlen / float(fps)
-
-    if trimmed30 and duration > 30:
-        duration = 30
-        vlen = int(30 * float(fps))
-
-    frame_indices = get_frame_indices(
-        num_frames,
-        vlen,
-        sample=sample,
-        fix_start=fix_start,
-        input_fps=fps,
-        max_num_frames=max_num_frames,
-    )
-    frames = video_reader.get_batch(frame_indices)  # (T, H, W, C), torch.uint8
-    frames = frames.permute(0, 3, 1, 2)  # (T, C, H, W), torch.uint8
-    return frames
-
-
-def load_dimension_info(json_dir, dimension, lang):
-    """
-    Load video list and prompt information based on a specified dimension and language from a JSON file.
-
-    Parameters:
-    - json_dir (str): The directory path where the JSON file is located.
-    - dimension (str): The dimension for evaluation to filter the video prompts.
-    - lang (str): The language key used to retrieve the appropriate prompt text.
-
-    Returns:
-    - video_list (list): A list of video file paths that match the specified dimension.
-    - prompt_dict_ls (list): A list of dictionaries, each containing a prompt and its corresponding video list.
-
-    The function reads the JSON file to extract video information. It filters the prompts based on the specified
-    dimension and compiles a list of video paths and associated prompts in the specified language.
-
-    Notes:
-    - The JSON file is expected to contain a list of dictionaries with keys 'dimension', 'video_list', and language-based prompts.
-    - The function assumes that the 'video_list' key in the JSON can either be a list or a single string value.
-    """
-    video_list = []
-    prompt_dict_ls = []
-    full_prompt_list = load_json(json_dir)
-    for prompt_dict in full_prompt_list:
-        if dimension in prompt_dict["dimension"] and "video_list" in prompt_dict:
-            prompt = prompt_dict[f"prompt_{lang}"]
-            cur_video_list = (
-                prompt_dict["video_list"]
-                if isinstance(prompt_dict["video_list"], list)
-                else [prompt_dict["video_list"]]
-            )
-            video_list += cur_video_list
-            if (
-                "auxiliary_info" in prompt_dict
-                and dimension in prompt_dict["auxiliary_info"]
-            ):
-                prompt_dict_ls += [
-                    {
-                        "prompt": prompt,
-                        "video_list": cur_video_list,
-                        "auxiliary_info": prompt_dict["auxiliary_info"][dimension],
-                    }
-                ]
-            else:
-                prompt_dict_ls += [{"prompt": prompt, "video_list": cur_video_list}]
-    return video_list, prompt_dict_ls
-
-
-def init_submodules(dimension_list, local=False, read_frame=False):
-    submodules_dict = {}
-    if local:
-        logger.info(
-            "\x1b[32m[Local Mode]\x1b[0m Working in local mode, please make sure that the pre-trained model has been fully downloaded."
-        )
-    for dimension in dimension_list:
-        os.makedirs(CACHE_DIR, exist_ok=True)
-        if dimension == "background_consistency":
-            # read_frame = False
-            if local:
-                vit_b_path = f"{CACHE_DIR}/clip_model/ViT-B-32.pt"
-                if not os.path.isfile(vit_b_path):
-                    wget_command = [
-                        "wget",
-                        "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
-                        "-P",
-                        os.path.dirname(vit_b_path),
-                    ]
-                    subprocess.run(wget_command, check=True)
-            else:
-                vit_b_path = "ViT-B/32"
-
-            submodules_dict[dimension] = [vit_b_path, read_frame]
-        elif dimension == "human_action":
-            umt_path = f"{CACHE_DIR}/umt_model/l16_ptk710_ftk710_ftk400_f16_res224.pth"
-            if not os.path.isfile(umt_path):
-                wget_command = [
-                    "wget",
-                    "https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/umt/single_modality/l16_ptk710_ftk710_ftk400_f16_res224.pth",
-                    "-P",
-                    os.path.dirname(umt_path),
-                ]
-                subprocess.run(wget_command, check=True)
-            submodules_dict[dimension] = [
-                umt_path,
-            ]
-        elif dimension == "temporal_flickering":
-            submodules_dict[dimension] = []
-        elif dimension == "motion_smoothness":
-            CUR_DIR = os.path.dirname(os.path.abspath(__file__))
-            submodules_dict[dimension] = {
-                "config": f"{CUR_DIR}/third_party/amt/cfgs/AMT-S.yaml",
-                "ckpt": f"{CACHE_DIR}/amt_model/amt-s.pth",
-            }
-            details = submodules_dict[dimension]
-            # Check if the file exists, if not, download it with wget
-            if not os.path.isfile(details["ckpt"]):
-                print(f"File {details['ckpt']} does not exist. Downloading...")
-                wget_command = [
-                    "wget",
-                    "-P",
-                    os.path.dirname(details["ckpt"]),
-                    "https://huggingface.co/lalala125/AMT/resolve/main/amt-s.pth",
-                ]
-                subprocess.run(wget_command, check=True)
-
-        elif dimension == "dynamic_degree":
-            submodules_dict[dimension] = {
-                "model": f"{CACHE_DIR}/raft_model/models/raft-things.pth"
-            }
-            details = submodules_dict[dimension]
-            if not os.path.isfile(details["model"]):
-                # raise NotImplementedError
-                print(f"File {details['model']} does not exist. Downloading...")
-                wget_command = [
-                    "wget",
-                    "-P",
-                    f"{CACHE_DIR}/raft_model/",
-                    "https://dl.dropboxusercontent.com/s/4j4z58wuv8o0mfz/models.zip",
-                ]
-                unzip_command = [
-                    "unzip",
-                    "-d",
-                    f"{CACHE_DIR}/raft_model/",
-                    f"{CACHE_DIR}/raft_model/models.zip",
-                ]
-                remove_command = ["rm", "-r", f"{CACHE_DIR}/raft_model/models.zip"]
-                try:
-                    subprocess.run(wget_command, check=True)
-                    subprocess.run(unzip_command, check=True)
-                    subprocess.run(remove_command, check=True)
-                except subprocess.CalledProcessError as err:
-                    print(f"Error during downloading RAFT model: {err}")
-        # Assign the DINO model path for subject consistency dimension
-        elif dimension == "subject_consistency":
-            if local:
-                submodules_dict[dimension] = {
-                    "repo_or_dir": f"{CACHE_DIR}/dino_model/facebookresearch_dino_main/",
-                    "path": f"{CACHE_DIR}/dino_model/dino_vitbase16_pretrain.pth",
-                    "model": "dino_vitb16",
-                    "source": "local",
-                    "read_frame": read_frame,
-                }
-                details = submodules_dict[dimension]
-                # Check if the file exists, if not, download it with wget
-                if not os.path.isdir(details["repo_or_dir"]):
-                    print(
-                        f"Directory {details['repo_or_dir']} does not exist. Cloning repository..."
-                    )
-                    subprocess.run(
-                        [
-                            "git",
-                            "clone",
-                            "https://github.com/facebookresearch/dino",
-                            details["repo_or_dir"],
-                        ],
-                        check=True,
-                    )
-
-                if not os.path.isfile(details["path"]):
-                    print(f"File {details['path']} does not exist. Downloading...")
-                    wget_command = [
-                        "wget",
-                        "-P",
-                        os.path.dirname(details["path"]),
-                        "https://dl.fbaipublicfiles.com/dino/dino_vitbase16_pretrain/dino_vitbase16_pretrain.pth",
-                    ]
-                    subprocess.run(wget_command, check=True)
-            else:
-                submodules_dict[dimension] = {
-                    "repo_or_dir": "facebookresearch/dino:main",
-                    "source": "github",
-                    "model": "dino_vitb16",
-                    "read_frame": read_frame,
-                }
-        elif dimension == "aesthetic_quality":
-            aes_path = f"{CACHE_DIR}/aesthetic_model/emb_reader"
-            if local:
-                vit_l_path = f"{CACHE_DIR}/clip_model/ViT-L-14.pt"
-                if not os.path.isfile(vit_l_path):
-                    wget_command = [
-                        "wget",
-                        "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt",
-                        "-P",
-                        os.path.dirname(vit_l_path),
-                    ]
-                    subprocess.run(wget_command, check=True)
-            else:
-                vit_l_path = "ViT-L/14"
-            submodules_dict[dimension] = [vit_l_path, aes_path]
-        elif dimension == "imaging_quality":
-            musiq_spaq_path = f"{CACHE_DIR}/pyiqa_model/musiq_spaq_ckpt-358bb6af.pth"
-            if not os.path.isfile(musiq_spaq_path):
-                wget_command = [
-                    "wget",
-                    "https://github.com/chaofengc/IQA-PyTorch/releases/download/v0.1-weights/musiq_spaq_ckpt-358bb6af.pth",
-                    "-P",
-                    os.path.dirname(musiq_spaq_path),
-                ]
-                subprocess.run(wget_command, check=True)
-            submodules_dict[dimension] = {"model_path": musiq_spaq_path}
-        elif dimension in [
-            "object_class",
-            "multiple_objects",
-            "color",
-            "spatial_relationship",
-        ]:
-            submodules_dict[dimension] = {
-                "model_weight": f"{CACHE_DIR}/grit_model/grit_b_densecap_objectdet.pth"
-            }
-            if not os.path.exists(submodules_dict[dimension]["model_weight"]):
-                wget_command = [
-                    "wget",
-                    "https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth",
-                    "-P",
-                    os.path.dirname(submodules_dict[dimension]["model_weight"]),
-                ]
-                subprocess.run(wget_command, check=True)
-        elif dimension == "scene":
-            submodules_dict[dimension] = {
-                "pretrained": f"{CACHE_DIR}/caption_model/tag2text_swin_14m.pth",
-                "image_size": 384,
-                "vit": "swin_b",
-            }
-            if not os.path.exists(submodules_dict[dimension]["pretrained"]):
-                wget_command = [
-                    "wget",
-                    "https://huggingface.co/spaces/xinyu1205/recognize-anything/resolve/main/tag2text_swin_14m.pth",
-                    "-P",
-                    os.path.dirname(submodules_dict[dimension]["pretrained"]),
-                ]
-                subprocess.run(wget_command, check=True)
-        elif dimension == "appearance_style":
-            if local:
-                submodules_dict[dimension] = {
-                    "name": f"{CACHE_DIR}/clip_model/ViT-B-32.pt"
-                }
-                if not os.path.isfile(submodules_dict[dimension]["name"]):
-                    wget_command = [
-                        "wget",
-                        "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
-                        "-P",
-                        os.path.dirname(submodules_dict[dimension]["name"]),
-                    ]
-                    subprocess.run(wget_command, check=True)
-            else:
-                submodules_dict[dimension] = {"name": "ViT-B/32"}
-        elif dimension in ["temporal_style", "overall_consistency"]:
-            submodules_dict[dimension] = {
-                "pretrain": f"{CACHE_DIR}/ViCLIP/ViClip-InternVid-10M-FLT.pth",
-            }
-            if not os.path.exists(submodules_dict[dimension]["pretrain"]):
-                wget_command = [
-                    "wget",
-                    "https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/viclip/ViClip-InternVid-10M-FLT.pth",
-                    "-P",
-                    os.path.dirname(submodules_dict[dimension]["pretrain"]),
-                ]
-                subprocess.run(wget_command, check=True)
-    return submodules_dict
-
-
-def get_prompt_from_filename(path: str):
-    """
-    1. prompt-0.suffix -> prompt
-    2. prompt.suffix -> prompt
-    """
-    prompt = Path(path).stem
-    number_ending = r"-\d+$"  # checks ending with -<number>
-    if re.search(number_ending, prompt):
-        return re.sub(number_ending, "", prompt)
-    return prompt
-
-
-def save_json(data, path, indent=4):
-    with open(path, "w", encoding="utf-8") as f:
-        json.dump(data, f, indent=indent)
-
-
-def load_json(path):
-    """
-    Load a JSON file from the given file path.
-
-    Parameters:
-    - file_path (str): The path to the JSON file.
-
-    Returns:
-    - data (dict or list): The data loaded from the JSON file, which could be a dictionary or a list.
-    """
-    with open(path, "r", encoding="utf-8") as f:
-        return json.load(f)
diff --git a/poetry.lock b/poetry.lock
index 90406d9b..a2cfa8cb 100644
--- a/poetry.lock
+++ b/poetry.lock
@@ -1,186 +1,231 @@
-# This file is automatically @generated by Poetry 2.1.1 and should not be changed by hand.
+# This file is automatically @generated by Poetry 2.4.1 and should not be changed by hand.
 
 [[package]]
 name = "absl-py"
-version = "2.2.2"
+version = "2.4.0"
 description = "Abseil Python Common Libraries, see https://github.com/abseil/abseil-py."
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["training"]
 files = [
-    {file = "absl_py-2.2.2-py3-none-any.whl", hash = "sha256:e5797bc6abe45f64fd95dc06394ca3f2bedf3b5d895e9da691c9ee3397d70092"},
-    {file = "absl_py-2.2.2.tar.gz", hash = "sha256:bf25b2c2eed013ca456918c453d687eab4e8309fba81ee2f4c1a6aa2494175eb"},
+    {file = "absl_py-2.4.0-py3-none-any.whl", hash = "sha256:88476fd881ca8aab94ffa78b7b6c632a782ab3ba1cd19c9bd423abc4fb4cd28d"},
+    {file = "absl_py-2.4.0.tar.gz", hash = "sha256:8c6af82722b35cf71e0f4d1d47dcaebfff286e27110a99fc359349b247dfb5d4"},
 ]
 
 [[package]]
 name = "accelerate"
-version = "0.33.0"
+version = "1.14.0"
 description = "Accelerate"
 optional = false
-python-versions = ">=3.8.0"
+python-versions = ">=3.10.0"
 groups = ["main"]
 files = [
-    {file = "accelerate-0.33.0-py3-none-any.whl", hash = "sha256:0a7f33d60ba09afabd028d4f0856dd19c5a734b7a596d637d9dd6e3d0eadbaf3"},
-    {file = "accelerate-0.33.0.tar.gz", hash = "sha256:11ba481ed6ea09191775df55ce464aeeba67a024bd0261a44b77b30fb439e26a"},
+    {file = "accelerate-1.14.0-py3-none-any.whl", hash = "sha256:e94390c2863b873be18f623f9df48a0d8fe5eff13ea7f1a00092b0a7904888c6"},
+    {file = "accelerate-1.14.0.tar.gz", hash = "sha256:41b9c4377a54e0b460a959b0defa1b736e4ca0a2373252d9a539964c2afe3c8d"},
 ]
 
 [package.dependencies]
-huggingface-hub = ">=0.21.0"
-numpy = ">=1.17,<2.0.0"
+huggingface_hub = ">=0.21.0"
+numpy = ">=1.17"
 packaging = ">=20.0"
 psutil = "*"
 pyyaml = "*"
-safetensors = ">=0.3.1"
-torch = ">=1.10.0"
+safetensors = ">=0.4.3"
+torch = ">=2.0.0"
 
 [package.extras]
-deepspeed = ["deepspeed (<=0.14.0)"]
-dev = ["bitsandbytes", "black (>=23.1,<24.0)", "datasets", "diffusers", "evaluate", "hf-doc-builder (>=0.3.0)", "parameterized", "pytest (>=7.2.0,<=8.0.0)", "pytest-subtests", "pytest-xdist", "rich", "ruff (>=0.2.1,<0.3.0)", "scikit-learn", "scipy", "timm", "torchpippy (>=0.2.0)", "tqdm", "transformers"]
-quality = ["black (>=23.1,<24.0)", "hf-doc-builder (>=0.3.0)", "ruff (>=0.2.1,<0.3.0)"]
+deepspeed = ["deepspeed"]
+dev = ["bitsandbytes", "datasets", "diffusers", "evaluate", "parameterized", "pytest (>=7.2.0)", "pytest-order", "pytest-subtests", "pytest-xdist", "rich", "ruff (==0.13.1)", "scikit-learn", "scipy", "timm", "torchdata (>=0.8.0)", "torchpippy (>=0.2.0)", "tqdm", "transformers"]
+quality = ["ruff (==0.13.1)"]
 rich = ["rich"]
 sagemaker = ["sagemaker"]
-test-dev = ["bitsandbytes", "datasets", "diffusers", "evaluate", "scikit-learn", "scipy", "timm", "torchpippy (>=0.2.0)", "tqdm", "transformers"]
-test-prod = ["parameterized", "pytest (>=7.2.0,<=8.0.0)", "pytest-subtests", "pytest-xdist"]
-test-trackers = ["comet-ml", "dvclive", "tensorboard", "wandb"]
-testing = ["bitsandbytes", "datasets", "diffusers", "evaluate", "parameterized", "pytest (>=7.2.0,<=8.0.0)", "pytest-subtests", "pytest-xdist", "scikit-learn", "scipy", "timm", "torchpippy (>=0.2.0)", "tqdm", "transformers"]
-
-[[package]]
-name = "addict"
-version = "2.4.0"
-description = "Addict is a dictionary whose items can be set using both attribute and item syntax."
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "addict-2.4.0-py3-none-any.whl", hash = "sha256:249bb56bbfd3cdc2a004ea0ff4c2b6ddc84d53bc2194761636eb314d5cfa5dfc"},
-    {file = "addict-2.4.0.tar.gz", hash = "sha256:b3b2210e0e067a281f5646c8c5db92e99b7231ea8b0eb5f74dbdf9e259d4e494"},
-]
+test-dev = ["bitsandbytes", "datasets", "diffusers", "evaluate", "scikit-learn", "scipy", "timm", "torchdata (>=0.8.0)", "torchpippy (>=0.2.0)", "tqdm", "transformers"]
+test-fp8 = ["torchao"]
+test-prod = ["parameterized", "pytest (>=7.2.0)", "pytest-order", "pytest-subtests", "pytest-xdist"]
+test-trackers = ["dvclive", "matplotlib", "swanlab[dashboard]", "tensorboard", "trackio", "wandb"]
+testing = ["bitsandbytes", "datasets", "diffusers", "evaluate", "parameterized", "pytest (>=7.2.0)", "pytest-order", "pytest-subtests", "pytest-xdist", "scikit-learn", "scipy", "timm", "torchdata (>=0.8.0)", "torchpippy (>=0.2.0)", "tqdm", "transformers"]
 
 [[package]]
 name = "aiohappyeyeballs"
-version = "2.4.4"
+version = "2.6.2"
 description = "Happy Eyeballs for asyncio"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["main", "training"]
 files = [
-    {file = "aiohappyeyeballs-2.4.4-py3-none-any.whl", hash = "sha256:a980909d50efcd44795c4afeca523296716d50cd756ddca6af8c65b996e27de8"},
-    {file = "aiohappyeyeballs-2.4.4.tar.gz", hash = "sha256:5fdd7d87889c63183afc18ce9271f9b0a7d32c2303e394468dd45d514a757745"},
+    {file = "aiohappyeyeballs-2.6.2-py3-none-any.whl", hash = "sha256:4708045e2d7a6c6bdf8aafa8ed39649eaf926a4543b54560659129e3365953c4"},
+    {file = "aiohappyeyeballs-2.6.2.tar.gz", hash = "sha256:e202810ee718bd01fc6ef49e8ea53d023d5cb6b581076d7925aa499fa55dbe64"},
 ]
 
 [[package]]
 name = "aiohttp"
-version = "3.11.11"
+version = "3.14.1"
 description = "Async http client/server framework (asyncio)"
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "aiohttp-3.11.11-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:a60804bff28662cbcf340a4d61598891f12eea3a66af48ecfdc975ceec21e3c8"},
-    {file = "aiohttp-3.11.11-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:4b4fa1cb5f270fb3eab079536b764ad740bb749ce69a94d4ec30ceee1b5940d5"},
-    {file = "aiohttp-3.11.11-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:731468f555656767cda219ab42e033355fe48c85fbe3ba83a349631541715ba2"},
-    {file = "aiohttp-3.11.11-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cb23d8bb86282b342481cad4370ea0853a39e4a32a0042bb52ca6bdde132df43"},
-    {file = "aiohttp-3.11.11-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f047569d655f81cb70ea5be942ee5d4421b6219c3f05d131f64088c73bb0917f"},
-    {file = "aiohttp-3.11.11-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:dd7659baae9ccf94ae5fe8bfaa2c7bc2e94d24611528395ce88d009107e00c6d"},
-    {file = "aiohttp-3.11.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:af01e42ad87ae24932138f154105e88da13ce7d202a6de93fafdafb2883a00ef"},
-    {file = "aiohttp-3.11.11-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:5854be2f3e5a729800bac57a8d76af464e160f19676ab6aea74bde18ad19d438"},
-    {file = "aiohttp-3.11.11-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:6526e5fb4e14f4bbf30411216780c9967c20c5a55f2f51d3abd6de68320cc2f3"},
-    {file = "aiohttp-3.11.11-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:85992ee30a31835fc482468637b3e5bd085fa8fe9392ba0bdcbdc1ef5e9e3c55"},
-    {file = "aiohttp-3.11.11-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:88a12ad8ccf325a8a5ed80e6d7c3bdc247d66175afedbe104ee2aaca72960d8e"},
-    {file = "aiohttp-3.11.11-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:0a6d3fbf2232e3a08c41eca81ae4f1dff3d8f1a30bae415ebe0af2d2458b8a33"},
-    {file = "aiohttp-3.11.11-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:84a585799c58b795573c7fa9b84c455adf3e1d72f19a2bf498b54a95ae0d194c"},
-    {file = "aiohttp-3.11.11-cp310-cp310-win32.whl", hash = "sha256:bfde76a8f430cf5c5584553adf9926534352251d379dcb266ad2b93c54a29745"},
-    {file = "aiohttp-3.11.11-cp310-cp310-win_amd64.whl", hash = "sha256:0fd82b8e9c383af11d2b26f27a478640b6b83d669440c0a71481f7c865a51da9"},
-    {file = "aiohttp-3.11.11-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:ba74ec819177af1ef7f59063c6d35a214a8fde6f987f7661f4f0eecc468a8f76"},
-    {file = "aiohttp-3.11.11-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:4af57160800b7a815f3fe0eba9b46bf28aafc195555f1824555fa2cfab6c1538"},
-    {file = "aiohttp-3.11.11-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:ffa336210cf9cd8ed117011085817d00abe4c08f99968deef0013ea283547204"},
-    {file = "aiohttp-3.11.11-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:81b8fe282183e4a3c7a1b72f5ade1094ed1c6345a8f153506d114af5bf8accd9"},
-    {file = "aiohttp-3.11.11-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3af41686ccec6a0f2bdc66686dc0f403c41ac2089f80e2214a0f82d001052c03"},
-    {file = "aiohttp-3.11.11-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:70d1f9dde0e5dd9e292a6d4d00058737052b01f3532f69c0c65818dac26dc287"},
-    {file = "aiohttp-3.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:249cc6912405917344192b9f9ea5cd5b139d49e0d2f5c7f70bdfaf6b4dbf3a2e"},
-    {file = "aiohttp-3.11.11-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0eb98d90b6690827dcc84c246811feeb4e1eea683c0eac6caed7549be9c84665"},
-    {file = "aiohttp-3.11.11-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:ec82bf1fda6cecce7f7b915f9196601a1bd1a3079796b76d16ae4cce6d0ef89b"},
-    {file = "aiohttp-3.11.11-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:9fd46ce0845cfe28f108888b3ab17abff84ff695e01e73657eec3f96d72eef34"},
-    {file = "aiohttp-3.11.11-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:bd176afcf8f5d2aed50c3647d4925d0db0579d96f75a31e77cbaf67d8a87742d"},
-    {file = "aiohttp-3.11.11-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:ec2aa89305006fba9ffb98970db6c8221541be7bee4c1d027421d6f6df7d1ce2"},
-    {file = "aiohttp-3.11.11-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:92cde43018a2e17d48bb09c79e4d4cb0e236de5063ce897a5e40ac7cb4878773"},
-    {file = "aiohttp-3.11.11-cp311-cp311-win32.whl", hash = "sha256:aba807f9569455cba566882c8938f1a549f205ee43c27b126e5450dc9f83cc62"},
-    {file = "aiohttp-3.11.11-cp311-cp311-win_amd64.whl", hash = "sha256:ae545f31489548c87b0cced5755cfe5a5308d00407000e72c4fa30b19c3220ac"},
-    {file = "aiohttp-3.11.11-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:e595c591a48bbc295ebf47cb91aebf9bd32f3ff76749ecf282ea7f9f6bb73886"},
-    {file = "aiohttp-3.11.11-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:3ea1b59dc06396b0b424740a10a0a63974c725b1c64736ff788a3689d36c02d2"},
-    {file = "aiohttp-3.11.11-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8811f3f098a78ffa16e0ea36dffd577eb031aea797cbdba81be039a4169e242c"},
-    {file = "aiohttp-3.11.11-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:bd7227b87a355ce1f4bf83bfae4399b1f5bb42e0259cb9405824bd03d2f4336a"},
-    {file = "aiohttp-3.11.11-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d40f9da8cabbf295d3a9dae1295c69975b86d941bc20f0a087f0477fa0a66231"},
-    {file = "aiohttp-3.11.11-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ffb3dc385f6bb1568aa974fe65da84723210e5d9707e360e9ecb51f59406cd2e"},
-    {file = "aiohttp-3.11.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a8f5f7515f3552d899c61202d99dcb17d6e3b0de777900405611cd747cecd1b8"},
-    {file = "aiohttp-3.11.11-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:3499c7ffbfd9c6a3d8d6a2b01c26639da7e43d47c7b4f788016226b1e711caa8"},
-    {file = "aiohttp-3.11.11-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:8e2bf8029dbf0810c7bfbc3e594b51c4cc9101fbffb583a3923aea184724203c"},
-    {file = "aiohttp-3.11.11-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:b6212a60e5c482ef90f2d788835387070a88d52cf6241d3916733c9176d39eab"},
-    {file = "aiohttp-3.11.11-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:d119fafe7b634dbfa25a8c597718e69a930e4847f0b88e172744be24515140da"},
-    {file = "aiohttp-3.11.11-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:6fba278063559acc730abf49845d0e9a9e1ba74f85f0ee6efd5803f08b285853"},
-    {file = "aiohttp-3.11.11-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:92fc484e34b733704ad77210c7957679c5c3877bd1e6b6d74b185e9320cc716e"},
-    {file = "aiohttp-3.11.11-cp312-cp312-win32.whl", hash = "sha256:9f5b3c1ed63c8fa937a920b6c1bec78b74ee09593b3f5b979ab2ae5ef60d7600"},
-    {file = "aiohttp-3.11.11-cp312-cp312-win_amd64.whl", hash = "sha256:1e69966ea6ef0c14ee53ef7a3d68b564cc408121ea56c0caa2dc918c1b2f553d"},
-    {file = "aiohttp-3.11.11-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:541d823548ab69d13d23730a06f97460f4238ad2e5ed966aaf850d7c369782d9"},
-    {file = "aiohttp-3.11.11-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:929f3ed33743a49ab127c58c3e0a827de0664bfcda566108989a14068f820194"},
-    {file = "aiohttp-3.11.11-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:0882c2820fd0132240edbb4a51eb8ceb6eef8181db9ad5291ab3332e0d71df5f"},
-    {file = "aiohttp-3.11.11-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b63de12e44935d5aca7ed7ed98a255a11e5cb47f83a9fded7a5e41c40277d104"},
-    {file = "aiohttp-3.11.11-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:aa54f8ef31d23c506910c21163f22b124facb573bff73930735cf9fe38bf7dff"},
-    {file = "aiohttp-3.11.11-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a344d5dc18074e3872777b62f5f7d584ae4344cd6006c17ba12103759d407af3"},
-    {file = "aiohttp-3.11.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0b7fb429ab1aafa1f48578eb315ca45bd46e9c37de11fe45c7f5f4138091e2f1"},
-    {file = "aiohttp-3.11.11-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c341c7d868750e31961d6d8e60ff040fb9d3d3a46d77fd85e1ab8e76c3e9a5c4"},
-    {file = "aiohttp-3.11.11-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:ed9ee95614a71e87f1a70bc81603f6c6760128b140bc4030abe6abaa988f1c3d"},
-    {file = "aiohttp-3.11.11-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:de8d38f1c2810fa2a4f1d995a2e9c70bb8737b18da04ac2afbf3971f65781d87"},
-    {file = "aiohttp-3.11.11-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:a9b7371665d4f00deb8f32208c7c5e652059b0fda41cf6dbcac6114a041f1cc2"},
-    {file = "aiohttp-3.11.11-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:620598717fce1b3bd14dd09947ea53e1ad510317c85dda2c9c65b622edc96b12"},
-    {file = "aiohttp-3.11.11-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:bf8d9bfee991d8acc72d060d53860f356e07a50f0e0d09a8dfedea1c554dd0d5"},
-    {file = "aiohttp-3.11.11-cp313-cp313-win32.whl", hash = "sha256:9d73ee3725b7a737ad86c2eac5c57a4a97793d9f442599bea5ec67ac9f4bdc3d"},
-    {file = "aiohttp-3.11.11-cp313-cp313-win_amd64.whl", hash = "sha256:c7a06301c2fb096bdb0bd25fe2011531c1453b9f2c163c8031600ec73af1cc99"},
-    {file = "aiohttp-3.11.11-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:3e23419d832d969f659c208557de4a123e30a10d26e1e14b73431d3c13444c2e"},
-    {file = "aiohttp-3.11.11-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:21fef42317cf02e05d3b09c028712e1d73a9606f02467fd803f7c1f39cc59add"},
-    {file = "aiohttp-3.11.11-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:1f21bb8d0235fc10c09ce1d11ffbd40fc50d3f08a89e4cf3a0c503dc2562247a"},
-    {file = "aiohttp-3.11.11-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1642eceeaa5ab6c9b6dfeaaa626ae314d808188ab23ae196a34c9d97efb68350"},
-    {file = "aiohttp-3.11.11-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2170816e34e10f2fd120f603e951630f8a112e1be3b60963a1f159f5699059a6"},
-    {file = "aiohttp-3.11.11-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8be8508d110d93061197fd2d6a74f7401f73b6d12f8822bbcd6d74f2b55d71b1"},
-    {file = "aiohttp-3.11.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4eed954b161e6b9b65f6be446ed448ed3921763cc432053ceb606f89d793927e"},
-    {file = "aiohttp-3.11.11-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d6c9af134da4bc9b3bd3e6a70072509f295d10ee60c697826225b60b9959acdd"},
-    {file = "aiohttp-3.11.11-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:44167fc6a763d534a6908bdb2592269b4bf30a03239bcb1654781adf5e49caf1"},
-    {file = "aiohttp-3.11.11-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:479b8c6ebd12aedfe64563b85920525d05d394b85f166b7873c8bde6da612f9c"},
-    {file = "aiohttp-3.11.11-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:10b4ff0ad793d98605958089fabfa350e8e62bd5d40aa65cdc69d6785859f94e"},
-    {file = "aiohttp-3.11.11-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:b540bd67cfb54e6f0865ceccd9979687210d7ed1a1cc8c01f8e67e2f1e883d28"},
-    {file = "aiohttp-3.11.11-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:1dac54e8ce2ed83b1f6b1a54005c87dfed139cf3f777fdc8afc76e7841101226"},
-    {file = "aiohttp-3.11.11-cp39-cp39-win32.whl", hash = "sha256:568c1236b2fde93b7720f95a890741854c1200fba4a3471ff48b2934d2d93fd3"},
-    {file = "aiohttp-3.11.11-cp39-cp39-win_amd64.whl", hash = "sha256:943a8b052e54dfd6439fd7989f67fc6a7f2138d0a2cf0a7de5f18aa4fe7eb3b1"},
-    {file = "aiohttp-3.11.11.tar.gz", hash = "sha256:bb49c7f1e6ebf3821a42d81d494f538107610c3a705987f53068546b0e90303e"},
+python-versions = ">=3.10"
+groups = ["main", "training"]
+files = [
+    {file = "aiohttp-3.14.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:8f6bb621e5863cfe8fe5ff5468002d200ec31f30f1280b259dc505b02595099e"},
+    {file = "aiohttp-3.14.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:4f7215cb3933784f79ed20e5f050e15984f390424339b22375d5a53c933a0491"},
+    {file = "aiohttp-3.14.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:d9d4e294455b23a68c9b8f042d0e8e377a265bcb15332753695f6e5b6819e0ce"},
+    {file = "aiohttp-3.14.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b238af795833d5731d049d82bc84b768ae6f8f97f0495963b3ed9935c5901cc3"},
+    {file = "aiohttp-3.14.1-cp310-cp310-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:e4e5e0ae56914ecdbf446493addefc0159053dd53962cef37d7839f37f73d505"},
+    {file = "aiohttp-3.14.1-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:092e4ce3619a7c6dee52a6bdabda973d9b34b66781f840ce93c7e0cec30cf521"},
+    {file = "aiohttp-3.14.1-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:bb33777ea21e8b7ecde0e6fc84f598be0a1192eab1a63bc746d75aa75d38e7bd"},
+    {file = "aiohttp-3.14.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:23119f8fd4f5d16902ed459b63b100bcd269628075162bddac56cc7b5273b3fb"},
+    {file = "aiohttp-3.14.1-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:57fc6745a4b7d0f5a9eb4f40a69718be6c0bc1b8368cc9fe89e90118719f4f42"},
+    {file = "aiohttp-3.14.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:6fd35beba67c4183b09375c5fff9accb47524191a244a99f95fd4472f5402c2b"},
+    {file = "aiohttp-3.14.1-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:672b9d65f42eb877f5c3f234a4547e4e1a226ca8c2eed879bb34670a0ce51192"},
+    {file = "aiohttp-3.14.1-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:24ba13339fed9251d9b1a1bec8c7ab84c0d1675d79d33501e11f94f8b9a84e05"},
+    {file = "aiohttp-3.14.1-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:94da27378da0610e341c4d30de29a191672683cc82b8f9556e8f7c7212a020fe"},
+    {file = "aiohttp-3.14.1-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:52cdac9432d8b4a719f35094a818d95adcae0f0b4fe9b9b921909e0c87de9e7d"},
+    {file = "aiohttp-3.14.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:672ac254412a24d0d0cf00a9e6c238877e4be5e5fa2d188832c1244f45f31966"},
+    {file = "aiohttp-3.14.1-cp310-cp310-win32.whl", hash = "sha256:2fe3607e71acc6ebb0ec8e492a247bf7a291226192dc0084236dfc12478916f6"},
+    {file = "aiohttp-3.14.1-cp310-cp310-win_amd64.whl", hash = "sha256:30099eda75a53c32efb0920e9c33c195314d2cc1c680fbfd30894932ac5f27df"},
+    {file = "aiohttp-3.14.1-cp310-cp310-win_arm64.whl", hash = "sha256:5a837f49d901f9e368651b676912bff1104ed8c1a83b280bcd7b29adccef5c9c"},
+    {file = "aiohttp-3.14.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:aa00140699487bd435fde4342d85c94cb256b7cd3a5b9c3396c67f19922afda2"},
+    {file = "aiohttp-3.14.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1c1af67559445498b502030c35c59db59966f47041ca9de5b4e707f86bd10b5f"},
+    {file = "aiohttp-3.14.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:d44ec478e713ee7f29b439f7eb8dc2b9d4079e11ae114d2c2ac3d5daf30516c8"},
+    {file = "aiohttp-3.14.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d3b1a184a9a8f548a6b73f1e26b96b052193e4b3175ed7342aaf1151a1f00a04"},
+    {file = "aiohttp-3.14.1-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:5f2504bc0322437c9a1ff6d3333ca56c7477b727c995f036b976ae17b98372c8"},
+    {file = "aiohttp-3.14.1-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:73f05ea02013e02512c3bf42714f1208c57168c779cc6fe23516e4543089d0a6"},
+    {file = "aiohttp-3.14.1-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:797457503c2d426bee06eef808d07b31ede30b65e054444e7de64cad0061b7af"},
+    {file = "aiohttp-3.14.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b821a1f7dedf7e37450654e620038ac3b2e81e8fa6ea269337e97101978ec730"},
+    {file = "aiohttp-3.14.1-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:4cd96b5ba05d67ed0cf00b5b405c8cd99586d8e3481e8ee0a831057591af7621"},
+    {file = "aiohttp-3.14.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:1d459b98a932296c6f0e94f87511a0b1b90a8a02c30a50e60a297619cd5a58ee"},
+    {file = "aiohttp-3.14.1-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:764457a7be60825fb770a644852ff717bcbb5042f189f2bd16df61a81b3f6573"},
+    {file = "aiohttp-3.14.1-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:f7a16ef45b081454ef844502d87a848876c490c4cb5c650c230f6ec79ed2c1e7"},
+    {file = "aiohttp-3.14.1-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:2fbc3ed048b3475b9f0cbcb9978e9d2d3511acd91ead203af26ed9f0056004cf"},
+    {file = "aiohttp-3.14.1-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:bedb0cd073cc2dc035e30aeb99444389d3cd2113afe4ef9fcd23d439f5bade85"},
+    {file = "aiohttp-3.14.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:b6feea921016eb3d4e04d65fc4e9ca402d1a3801f562aef94989f54694917af3"},
+    {file = "aiohttp-3.14.1-cp311-cp311-win32.whl", hash = "sha256:313701e488100074ce99850404ee36e741abf6330179fec908a1944ecf570126"},
+    {file = "aiohttp-3.14.1-cp311-cp311-win_amd64.whl", hash = "sha256:03ab4530fdcb3a543a122ba4b65ac9919da9fe9f78a03d328a6e38ff962f7aa5"},
+    {file = "aiohttp-3.14.1-cp311-cp311-win_arm64.whl", hash = "sha256:486f7d16ed54c39c2cbd7ca71fd8ba2b8bb7860df65bd7b6ed640bab96a38a8b"},
+    {file = "aiohttp-3.14.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:d35143e27778b4bb0fb189562d7f275bff79c62ab8e98459717c0ea617ff2480"},
+    {file = "aiohttp-3.14.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:bcfb80a2cc36fba2534e5e5b5264dc7ae6fcd9bf15256da3e53d2f499e6fa29d"},
+    {file = "aiohttp-3.14.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:27fd7c91e51729b4f7e1577865fa6d34c9adccbc39aabe9000285b48af9f0ec2"},
+    {file = "aiohttp-3.14.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:64c567bf9eaf664280116a8688f63016e6b32db2505908e2bdaca1b6438142f2"},
+    {file = "aiohttp-3.14.1-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:f5e6ff2bdbb8f4cd3fbe41f99e25bbcd58e3bf9f13d3dd31a11e7917251cc77a"},
+    {file = "aiohttp-3.14.1-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:2f73e01dc37122325caf079982621262f96d74823c179038a82fddfc50359264"},
+    {file = "aiohttp-3.14.1-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:bb2c0c80d431c0d03f2c7dbf125150fedd4f0de17366a7ca33f7ccb822391842"},
+    {file = "aiohttp-3.14.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:3e6fc1a85fa7194a1a7d19f44e8609180f4a8eb5fa4c7ed8b4355f080fad235c"},
+    {file = "aiohttp-3.14.1-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:686b6c0d3911ec387b444ddf5dc62fb7f7c0a7d5186a7861626496a5ab4aff95"},
+    {file = "aiohttp-3.14.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:c6fa4dc7ad6f8109c70bb1499e589f76b0b792baf39f9b017eb92c8a81d0a199"},
+    {file = "aiohttp-3.14.1-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:87a5eea1b2a5e21e1ebdbb33ad4165359189327e63fc4e4894693e7f821ac817"},
+    {file = "aiohttp-3.14.1-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:1c1421eb01d4fd608d88cc8290211d177a58532b55ad94076fb349c5bf467f0a"},
+    {file = "aiohttp-3.14.1-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:34b257ec41345c1e8f2df68fa908a7952f5de932723871eb633ecbbff396c9a4"},
+    {file = "aiohttp-3.14.1-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:de538791a80e5d862addbc183f70f0158ac9b9bb872bb147f1fd2a683691e087"},
+    {file = "aiohttp-3.14.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:6f71173be42d3241d428f760122febb748de0623f44308a6f120d0dd9ec572e3"},
+    {file = "aiohttp-3.14.1-cp312-cp312-win32.whl", hash = "sha256:ec8dc383ee57ea3e883477dcca3f11b65d58199f1080acaf4cd6ad9a99698be4"},
+    {file = "aiohttp-3.14.1-cp312-cp312-win_amd64.whl", hash = "sha256:2aa92c87868cd13674989f9ee83e5f9f7ea4237589b728048e1f0c8f6caa3271"},
+    {file = "aiohttp-3.14.1-cp312-cp312-win_arm64.whl", hash = "sha256:2c840c90759922cb5e6dda94596e079a30fb5a5ba548e7e0dc00574703940847"},
+    {file = "aiohttp-3.14.1-cp313-cp313-android_21_arm64_v8a.whl", hash = "sha256:b3a03285a7f9c7b016324574a6d92a1c895da6b978cb8f1deee3ac72bc6da178"},
+    {file = "aiohttp-3.14.1-cp313-cp313-android_21_x86_64.whl", hash = "sha256:2a73f487ab8ef5abbb24b7aa9b73e98eaba9e9e031804ff2416f02eca315ccaf"},
+    {file = "aiohttp-3.14.1-cp313-cp313-ios_13_0_arm64_iphoneos.whl", hash = "sha256:915fbb7b41b115192259f8c9ae58f3ddc444d2b5579917270211858e606a4afd"},
+    {file = "aiohttp-3.14.1-cp313-cp313-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:7fb4bdf95b0561a79f259f9d28fbc109728c5ee7f27aff6391f0ca703a329abe"},
+    {file = "aiohttp-3.14.1-cp313-cp313-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:1b9748363260121d2927704f5d4fc498150669ca3ae93625986ee89c8f80dcd4"},
+    {file = "aiohttp-3.14.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:86a6dab78b0e43e2897a3bbe15745aa60dc5423ca437b7b0b164c069bf91b876"},
+    {file = "aiohttp-3.14.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:4dfd6e47d3c44c2279907607f73a4240b88c69eb8b90da7e2441a8045dfd21da"},
+    {file = "aiohttp-3.14.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:317acd9f8602858dc7d59679812c376c7f0b97bcbbf16e0d6237f54141d8a8a6"},
+    {file = "aiohttp-3.14.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:bd869c427324e5cb15195793de951295710db28be7d818247f3097b4ab5d4b96"},
+    {file = "aiohttp-3.14.1-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:93b032b5ec3255473c143627d21a69ac74ae12f7f33974cb587c564d11b1066f"},
+    {file = "aiohttp-3.14.1-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f234b4deb12f3ad59127e037bc57c40c21e45b45282df7d3a55a0f409f595296"},
+    {file = "aiohttp-3.14.1-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:9af6779bfb46abf124068327abcdf9ce95c9ef8287a3e8da76ccf2d0f16c28fa"},
+    {file = "aiohttp-3.14.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:faccab372e66bc76d5731525e7f1143c922271725b9d38c9f97edcc66266b451"},
+    {file = "aiohttp-3.14.1-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f380468b09d2a81633ee863b0ec5648d364bd17bb8ecfb8c2f387f7ac1faf42c"},
+    {file = "aiohttp-3.14.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:97e704dcd26271f5bda3fa07c3ce0fb76d6d3f8659f4baa1a24442cc9ba177ca"},
+    {file = "aiohttp-3.14.1-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:269b76ac5394092b95bc4a098f4fc6c191c083c3bd12775d1e30e663132f6a09"},
+    {file = "aiohttp-3.14.1-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:5c0b3e614340c889d575451696374c9d17affd54cd607ca0babed8f8c37b9397"},
+    {file = "aiohttp-3.14.1-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:5663ee9257cfa1add7253a7da3035a02f31b6600ec48261585e1800a81533080"},
+    {file = "aiohttp-3.14.1-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:603a2c834142172ffddc054067f5ec0ca65d57a0aa98a71bc81952573208e345"},
+    {file = "aiohttp-3.14.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:cb21957bb8aca671c1765e32f58164cf0c50e6bf41c0bbbd16da20732ecaf588"},
+    {file = "aiohttp-3.14.1-cp313-cp313-win32.whl", hash = "sha256:e509a55f681e6158c20f70f102f9cf61fb20fbc382272bc6d94b7343f2582780"},
+    {file = "aiohttp-3.14.1-cp313-cp313-win_amd64.whl", hash = "sha256:1ac8531b638959718e18c2207fbfe297819875da46a740b29dfa29beba64355a"},
+    {file = "aiohttp-3.14.1-cp313-cp313-win_arm64.whl", hash = "sha256:250d14af67f6b6a1a4a811049b1afa69d61d617fca6bf33149b3ab1a6dbcf7b8"},
+    {file = "aiohttp-3.14.1-cp314-cp314-android_24_arm64_v8a.whl", hash = "sha256:7c106c26852ca1c2047c6b80384f17100b4e439af276f21ef3d4e2f450ae7e15"},
+    {file = "aiohttp-3.14.1-cp314-cp314-android_24_x86_64.whl", hash = "sha256:20205f7f5ade7aaec9f4b500549bbc071b046453aed72f9c06dcab87896a83e8"},
+    {file = "aiohttp-3.14.1-cp314-cp314-ios_13_0_arm64_iphoneos.whl", hash = "sha256:62a759436b29e677181a9e76bab8b8f689a29cb9c535f45f7c48c9c830d3f8c3"},
+    {file = "aiohttp-3.14.1-cp314-cp314-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:2964cbf553df4d7a57348da44d961d871895fc1ee4e8c322b2a95612c7b17fba"},
+    {file = "aiohttp-3.14.1-cp314-cp314-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:237651caadc3a59badd39319c54642b5299e9cc98a3a194310e55d5bb9f5e397"},
+    {file = "aiohttp-3.14.1-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:896e12dfdbbab9d8f7e16d2b28c6769a60126fa92095d1ebf9473d02593a2448"},
+    {file = "aiohttp-3.14.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:d03f281ed22579314ba00821ce20115a7c0ac430660b4cc05704a3f818b3e004"},
+    {file = "aiohttp-3.14.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:07eabb979d236335fed927e137a928c9adfb7df3b9ec7aa31726f133a62be983"},
+    {file = "aiohttp-3.14.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4fe1f1087cbadb280b5e1bb054a4f00d1423c74d6626c5e48400d871d34ecefe"},
+    {file = "aiohttp-3.14.1-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:367a9314fdc79dab0fac96e216cb41dd73c85bdca85306ce8999118ba7e0f333"},
+    {file = "aiohttp-3.14.1-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:a24f677ebe83749039e7bdf862ff0bbb16818ae4193d4ef96505e269375bcce0"},
+    {file = "aiohttp-3.14.1-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c83afe0ba876be7e943d2e0ba645809ad441575d2840c895c21ee5de93b9377a"},
+    {file = "aiohttp-3.14.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:634e385930fb6d2d479cf3aa66515955863b77a5e3c2b5894ca259a25b308602"},
+    {file = "aiohttp-3.14.1-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:eeea07c4397bbc57719c4eed8f9c284874d4f175f9b6d57f7a1546b976d455ca"},
+    {file = "aiohttp-3.14.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:335c0cc3e3545ce98dcb9cfcb836f40c3411f43fa03dab757597d80c89af8a35"},
+    {file = "aiohttp-3.14.1-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:ae6be797afdef264e8a84864a85b196ca06045586481b3df8a967322fd2fa844"},
+    {file = "aiohttp-3.14.1-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:8560b4d712474335d08907db7973f71912d3a9a8f1dee992ec06b5d2fe359496"},
+    {file = "aiohttp-3.14.1-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:2b7edd08e0a5deb1e8564a2fcd8f4561014a3f05252334671bbf55ddd47db0e5"},
+    {file = "aiohttp-3.14.1-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:b6ff7fcee63287ae57b5df3e4f5957ce032122802509246dec1a5bcc55904c95"},
+    {file = "aiohttp-3.14.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:6ffbb2f4ec1ceaff7e07d43922954da26b223d188bf30658e561b98e23089444"},
+    {file = "aiohttp-3.14.1-cp314-cp314-win32.whl", hash = "sha256:a9875b46d910cff3ea2f5962f9d266b465459fe634e22556ab9bd6fc1192eea0"},
+    {file = "aiohttp-3.14.1-cp314-cp314-win_amd64.whl", hash = "sha256:af8b4b81a960eeaf1234971ac3cd0ba5901f3cd42eae42a46b4d089a8b492719"},
+    {file = "aiohttp-3.14.1-cp314-cp314-win_arm64.whl", hash = "sha256:cf4491381b1b57425c315a56a439251b1bdac07b2275f19a8c44bc57744532ec"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:819c054312f1af92947e6a55883d1b66feefab11531a7fc45e0fb9b63880b5c2"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:10ee9c1753a8f706345b22496c79fbddb5be0599e0823f3738b1534058e25340"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1601cc37baf5750ccacae618ec2daf020769581695550e3b654a911f859c563d"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4d6e0ac9da31c9c04c84e1c0182ad8d6df35965a85cae29cd71d089621b3ae94"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:9e8f2d660c350b3d0e259c7a7e3d9b7fc8b41210cbcc3d4a7076ff0a5e5c2fdc"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:4691802dda97be727f79d86818acaad7eb8e9252626a1d6b519fedbb92d5e251"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c389c482a7e9b9dc3ee2701ac46c4125297a3818875b9c305ddb603c04828fd1"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fc0cacab7ba4e56f0f81c82a98c09bed2f39c940107b03a34b168bdf7597edd3"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:979ed4717f59b8bb12e3963378fa285d93d367e15bcd66c721311826d3c44a6c"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:38e1e7daaea81df51c952e18483f323d878499a1e2bfe564790e0f9701d6f203"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:4132e72c608fe9fecb8f409113567605915b83e9bdd3ea56538d2f9cd35002f1"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:eefd9cc9b6d4a2db5f00a26bc3e4f9acf71926a6ec557cd56c9c6f27c290b665"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:b165790117eea512d7f3fb22f1f6dad3d55a7189571993eb015591c1401276d1"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:ed09c7eb1c391271c2ed0314a51903e72a3acb653d5ccfc264cdf3ef11f8269d"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:99abd37084b82f5830c635fddd0b4993b9742a66eb746dacf433c8590e8f9e3c"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-win32.whl", hash = "sha256:47ddf841cdecc810749921d25606dee45857d12d2ad5ddb7b5bd7eab12e4b365"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-win_amd64.whl", hash = "sha256:5e78b522b7a6e27e0b25d19b247b75039ac4c94f99823e3c9e53ae1603a9f7e9"},
+    {file = "aiohttp-3.14.1-cp314-cp314t-win_arm64.whl", hash = "sha256:90d53f1609c29ccc2193945ef732428382a28f78d0456ae4d3daf0d48b74f0f6"},
+    {file = "aiohttp-3.14.1.tar.gz", hash = "sha256:307f2cff90a764d329e77040603fa032db89c5c24fdad50c4c15334cba744035"},
 ]
 
 [package.dependencies]
-aiohappyeyeballs = ">=2.3.0"
-aiosignal = ">=1.1.2"
-async-timeout = {version = ">=4.0,<6.0", markers = "python_version < \"3.11\""}
+aiohappyeyeballs = ">=2.5.0"
+aiosignal = ">=1.4.0"
 attrs = ">=17.3.0"
 frozenlist = ">=1.1.1"
 multidict = ">=4.5,<7.0"
 propcache = ">=0.2.0"
+typing_extensions = {version = ">=4.4", markers = "python_version < \"3.13\""}
 yarl = ">=1.17.0,<2.0"
 
 [package.extras]
-speedups = ["Brotli ; platform_python_implementation == \"CPython\"", "aiodns (>=3.2.0) ; sys_platform == \"linux\" or sys_platform == \"darwin\"", "brotlicffi ; platform_python_implementation != \"CPython\""]
+speedups = ["Brotli (>=1.2) ; platform_python_implementation == \"CPython\" and sys_platform != \"android\" and sys_platform != \"ios\"", "aiodns (>=3.3.0) ; sys_platform != \"android\" and sys_platform != \"ios\"", "backports.zstd ; platform_python_implementation == \"CPython\" and python_version < \"3.14\" and sys_platform != \"android\" and sys_platform != \"ios\"", "brotlicffi (>=1.2) ; platform_python_implementation != \"CPython\""]
 
 [[package]]
 name = "aiosignal"
-version = "1.3.2"
+version = "1.4.0"
 description = "aiosignal: a list of registered asynchronous callbacks"
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
-    {file = "aiosignal-1.3.2-py2.py3-none-any.whl", hash = "sha256:45cde58e409a301715980c2b01d0c28bdde3770d8290b5eb2173759d9acb31a5"},
-    {file = "aiosignal-1.3.2.tar.gz", hash = "sha256:a8c255c66fafb1e499c9351d0bf32ff2d8a0321595ebac3b93713656d2436f54"},
+    {file = "aiosignal-1.4.0-py3-none-any.whl", hash = "sha256:053243f8b92b990551949e63930a839ff0cf0b0ebbe0597b0f3fb19e1a0fe82e"},
+    {file = "aiosignal-1.4.0.tar.gz", hash = "sha256:f47eecd9468083c2029cc99945502cb7708b082c232f9aca65da147157b251c7"},
 ]
 
 [package.dependencies]
 frozenlist = ">=1.1.0"
+typing-extensions = {version = ">=4.2", markers = "python_version < \"3.13\""}
+
+[[package]]
+name = "annotated-doc"
+version = "0.0.4"
+description = "Document parameters, class attributes, return types, and variables inline, with Annotated."
+optional = false
+python-versions = ">=3.8"
+groups = ["main"]
+files = [
+    {file = "annotated_doc-0.0.4-py3-none-any.whl", hash = "sha256:571ac1dc6991c450b25a9c2d84a3705e2ae7a53467b5d111c24fa8baabbed320"},
+    {file = "annotated_doc-0.0.4.tar.gz", hash = "sha256:fbcda96e87e9c92ad167c2e53839e57503ecfda18804ea28102353485033faa4"},
+]
 
 [[package]]
 name = "annotated-types"
@@ -188,7 +233,7 @@ version = "0.7.0"
 description = "Reusable constraint types to use with typing.Annotated"
 optional = false
 python-versions = ">=3.8"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "annotated_types-0.7.0-py3-none-any.whl", hash = "sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53"},
     {file = "annotated_types-0.7.0.tar.gz", hash = "sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89"},
@@ -206,48 +251,112 @@ files = [
 ]
 
 [[package]]
-name = "args"
-version = "0.1.0"
-description = "Command Arguments for Humans."
+name = "anyio"
+version = "4.14.0"
+description = "High-level concurrency and networking framework on top of asyncio or Trio"
 optional = false
-python-versions = "*"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "args-0.1.0.tar.gz", hash = "sha256:a785b8d837625e9b61c39108532d95b85274acd679693b71ebb5156848fcf814"},
+    {file = "anyio-4.14.0-py3-none-any.whl", hash = "sha256:dd9b7a2a9799ed6552fde617b2c5df02b7fdd7d88392fc48101e51bae46164d9"},
+    {file = "anyio-4.14.0.tar.gz", hash = "sha256:b47c1f9ccf73e67021df785332508f99379c68fa7d0684e8e3492cb1d4b23f89"},
 ]
 
-[[package]]
-name = "async-timeout"
-version = "5.0.1"
-description = "Timeout context manager for asyncio programs"
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-markers = "python_version == \"3.10\""
-files = [
-    {file = "async_timeout-5.0.1-py3-none-any.whl", hash = "sha256:39e3809566ff85354557ec2398b55e096c8364bacac9405a7a1fa429e77fe76c"},
-    {file = "async_timeout-5.0.1.tar.gz", hash = "sha256:d9321a7a3d5a6a5e187e824d2fa0793ce379a202935782d555d6e9d2735677d3"},
-]
+[package.dependencies]
+idna = ">=2.8"
+typing_extensions = {version = ">=4.5", markers = "python_version < \"3.13\""}
+
+[package.extras]
+trio = ["trio (>=0.32.0)"]
 
 [[package]]
 name = "attrs"
-version = "24.3.0"
+version = "26.1.0"
 description = "Classes Without Boilerplate"
 optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
+python-versions = ">=3.9"
+groups = ["main", "training"]
+files = [
+    {file = "attrs-26.1.0-py3-none-any.whl", hash = "sha256:c647aa4a12dfbad9333ca4e71fe62ddc36f4e63b2d260a37a8b83d2f043ac309"},
+    {file = "attrs-26.1.0.tar.gz", hash = "sha256:d03ceb89cb322a8fd706d4fb91940737b6642aa36998fe130a9bc96c985eff32"},
+]
+
+[[package]]
+name = "audioop-lts"
+version = "0.2.2"
+description = "LTS Port of Python audioop"
+optional = true
+python-versions = ">=3.13"
+groups = ["main"]
+markers = "python_version >= \"3.13\" and extra == \"trackio\""
+files = [
+    {file = "audioop_lts-0.2.2-cp313-abi3-macosx_10_13_universal2.whl", hash = "sha256:fd3d4602dc64914d462924a08c1a9816435a2155d74f325853c1f1ac3b2d9800"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-macosx_10_13_x86_64.whl", hash = "sha256:550c114a8df0aafe9a05442a1162dfc8fec37e9af1d625ae6060fed6e756f303"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-macosx_11_0_arm64.whl", hash = "sha256:9a13dc409f2564de15dd68be65b462ba0dde01b19663720c68c1140c782d1d75"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:51c916108c56aa6e426ce611946f901badac950ee2ddaf302b7ed35d9958970d"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:47eba38322370347b1c47024defbd36374a211e8dd5b0dcbce7b34fdb6f8847b"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:ba7c3a7e5f23e215cb271516197030c32aef2e754252c4c70a50aaff7031a2c8"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:def246fe9e180626731b26e89816e79aae2276f825420a07b4a647abaa84becc"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:e160bf9df356d841bb6c180eeeea1834085464626dc1b68fa4e1d59070affdc3"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:4b4cd51a57b698b2d06cb9993b7ac8dfe89a3b2878e96bc7948e9f19ff51dba6"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-musllinux_1_2_ppc64le.whl", hash = "sha256:4a53aa7c16a60a6857e6b0b165261436396ef7293f8b5c9c828a3a203147ed4a"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-musllinux_1_2_riscv64.whl", hash = "sha256:3fc38008969796f0f689f1453722a0f463da1b8a6fbee11987830bfbb664f623"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-musllinux_1_2_s390x.whl", hash = "sha256:15ab25dd3e620790f40e9ead897f91e79c0d3ce65fe193c8ed6c26cffdd24be7"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:03f061a1915538fd96272bac9551841859dbb2e3bf73ebe4a23ef043766f5449"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-win32.whl", hash = "sha256:3bcddaaf6cc5935a300a8387c99f7a7fbbe212a11568ec6cf6e4bc458c048636"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-win_amd64.whl", hash = "sha256:a2c2a947fae7d1062ef08c4e369e0ba2086049a5e598fda41122535557012e9e"},
+    {file = "audioop_lts-0.2.2-cp313-abi3-win_arm64.whl", hash = "sha256:5f93a5db13927a37d2d09637ccca4b2b6b48c19cd9eda7b17a2e9f77edee6a6f"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:73f80bf4cd5d2ca7814da30a120de1f9408ee0619cc75da87d0641273d202a09"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:106753a83a25ee4d6f473f2be6b0966fc1c9af7e0017192f5531a3e7463dce58"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:fbdd522624141e40948ab3e8cdae6e04c748d78710e9f0f8d4dae2750831de19"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:143fad0311e8209ece30a8dbddab3b65ab419cbe8c0dde6e8828da25999be911"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:dfbbc74ec68a0fd08cfec1f4b5e8cca3d3cd7de5501b01c4b5d209995033cde9"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:cfcac6aa6f42397471e4943e0feb2244549db5c5d01efcd02725b96af417f3fe"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:752d76472d9804ac60f0078c79cdae8b956f293177acd2316cd1e15149aee132"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:83c381767e2cc10e93e40281a04852facc4cd9334550e0f392f72d1c0a9c5753"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:c0022283e9556e0f3643b7c3c03f05063ca72b3063291834cca43234f20c60bb"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:a2d4f1513d63c795e82948e1305f31a6d530626e5f9f2605408b300ae6095093"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:c9c8e68d8b4a56fda8c025e538e639f8c5953f5073886b596c93ec9b620055e7"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:96f19de485a2925314f5020e85911fb447ff5fbef56e8c7c6927851b95533a1c"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:e541c3ef484852ef36545f66209444c48b28661e864ccadb29daddb6a4b8e5f5"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-win32.whl", hash = "sha256:d5e73fa573e273e4f2e5ff96f9043858a5e9311e94ffefd88a3186a910c70917"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-win_amd64.whl", hash = "sha256:9191d68659eda01e448188f60364c7763a7ca6653ed3f87ebb165822153a8547"},
+    {file = "audioop_lts-0.2.2-cp313-cp313t-win_arm64.whl", hash = "sha256:c174e322bb5783c099aaf87faeb240c8d210686b04bd61dfd05a8e5a83d88969"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-macosx_10_13_universal2.whl", hash = "sha256:f9ee9b52f5f857fbaf9d605a360884f034c92c1c23021fb90b2e39b8e64bede6"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:49ee1a41738a23e98d98b937a0638357a2477bc99e61b0f768a8f654f45d9b7a"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5b00be98ccd0fc123dcfad31d50030d25fcf31488cde9e61692029cd7394733b"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:a6d2e0f9f7a69403e388894d4ca5ada5c47230716a03f2847cfc7bd1ecb589d6"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f9b0b8a03ef474f56d1a842af1a2e01398b8f7654009823c6d9e0ecff4d5cfbf"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:2b267b70747d82125f1a021506565bdc5609a2b24bcb4773c16d79d2bb260bbd"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:0337d658f9b81f4cd0fdb1f47635070cc084871a3d4646d9de74fdf4e7c3d24a"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:167d3b62586faef8b6b2275c3218796b12621a60e43f7e9d5845d627b9c9b80e"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:0d9385e96f9f6da847f4d571ce3cb15b5091140edf3db97276872647ce37efd7"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:48159d96962674eccdca9a3df280e864e8ac75e40a577cc97c5c42667ffabfc5"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:8fefe5868cd082db1186f2837d64cfbfa78b548ea0d0543e9b28935ccce81ce9"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:58cf54380c3884fb49fdd37dfb7a772632b6701d28edd3e2904743c5e1773602"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:088327f00488cdeed296edd9215ca159f3a5a5034741465789cad403fcf4bec0"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-win32.whl", hash = "sha256:068aa17a38b4e0e7de771c62c60bbca2455924b67a8814f3b0dee92b5820c0b3"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-win_amd64.whl", hash = "sha256:a5bf613e96f49712073de86f20dbdd4014ca18efd4d34ed18c75bd808337851b"},
+    {file = "audioop_lts-0.2.2-cp314-cp314t-win_arm64.whl", hash = "sha256:b492c3b040153e68b9fdaff5913305aaaba5bb433d8a7f73d5cf6a64ed3cc1dd"},
+    {file = "audioop_lts-0.2.2.tar.gz", hash = "sha256:64d0c62d88e67b98a1a5e71987b7aa7b5bcffc7dcee65b635823dbdd0a8dbbd0"},
+]
+
+[[package]]
+name = "authlib"
+version = "1.7.2"
+description = "The ultimate Python library in building OAuth and OpenID Connect servers and clients."
+optional = true
+python-versions = ">=3.10"
+groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "attrs-24.3.0-py3-none-any.whl", hash = "sha256:ac96cd038792094f438ad1f6ff80837353805ac950cd2aa0e0625ef19850c308"},
-    {file = "attrs-24.3.0.tar.gz", hash = "sha256:8f5c07333d543103541ba7be0e2ce16eeee8130cb0b3f9238ab904ce1e85baff"},
+    {file = "authlib-1.7.2-py2.py3-none-any.whl", hash = "sha256:3e1faedc9d87e7d56a164eca3ccb6ace0d61b94abe83e92242f8dc8bba9b4a9f"},
+    {file = "authlib-1.7.2.tar.gz", hash = "sha256:2cea25fefcd4e7173bdf1372c0afc265c8034b23a8cd5dcb6a9164b826c64231"},
 ]
 
-[package.extras]
-benchmark = ["cloudpickle ; platform_python_implementation == \"CPython\"", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pympler", "pytest (>=4.3.0)", "pytest-codspeed", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"]
-cov = ["cloudpickle ; platform_python_implementation == \"CPython\"", "coverage[toml] (>=5.3)", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"]
-dev = ["cloudpickle ; platform_python_implementation == \"CPython\"", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pre-commit-uv", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"]
-docs = ["cogapp", "furo", "myst-parser", "sphinx", "sphinx-notfound-page", "sphinxcontrib-towncrier", "towncrier (<24.7)"]
-tests = ["cloudpickle ; platform_python_implementation == \"CPython\"", "hypothesis", "mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pympler", "pytest (>=4.3.0)", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-xdist[psutil]"]
-tests-mypy = ["mypy (>=1.11.1) ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\"", "pytest-mypy-plugins ; platform_python_implementation == \"CPython\" and python_version >= \"3.10\""]
+[package.dependencies]
+cryptography = "*"
+joserfc = ">=1.6.0"
 
 [[package]]
 name = "av"
@@ -324,64 +433,6 @@ files = [
 docs = ["furo", "jaraco.packaging (>=9.3)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"]
 testing = ["jaraco.test", "pytest (!=8.0.*)", "pytest (>=6,!=8.1.*)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)"]
 
-[[package]]
-name = "bcrypt"
-version = "4.2.1"
-description = "Modern password hashing for your software and your servers"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "bcrypt-4.2.1-cp37-abi3-macosx_10_12_universal2.whl", hash = "sha256:1340411a0894b7d3ef562fb233e4b6ed58add185228650942bdc885362f32c17"},
-    {file = "bcrypt-4.2.1-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b1ee315739bc8387aa36ff127afc99120ee452924e0df517a8f3e4c0187a0f5f"},
-    {file = "bcrypt-4.2.1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8dbd0747208912b1e4ce730c6725cb56c07ac734b3629b60d4398f082ea718ad"},
-    {file = "bcrypt-4.2.1-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:aaa2e285be097050dba798d537b6efd9b698aa88eef52ec98d23dcd6d7cf6fea"},
-    {file = "bcrypt-4.2.1-cp37-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:76d3e352b32f4eeb34703370e370997065d28a561e4a18afe4fef07249cb4396"},
-    {file = "bcrypt-4.2.1-cp37-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:b7703ede632dc945ed1172d6f24e9f30f27b1b1a067f32f68bf169c5f08d0425"},
-    {file = "bcrypt-4.2.1-cp37-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:89df2aea2c43be1e1fa066df5f86c8ce822ab70a30e4c210968669565c0f4685"},
-    {file = "bcrypt-4.2.1-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:04e56e3fe8308a88b77e0afd20bec516f74aecf391cdd6e374f15cbed32783d6"},
-    {file = "bcrypt-4.2.1-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:cfdf3d7530c790432046c40cda41dfee8c83e29482e6a604f8930b9930e94139"},
-    {file = "bcrypt-4.2.1-cp37-abi3-win32.whl", hash = "sha256:adadd36274510a01f33e6dc08f5824b97c9580583bd4487c564fc4617b328005"},
-    {file = "bcrypt-4.2.1-cp37-abi3-win_amd64.whl", hash = "sha256:8c458cd103e6c5d1d85cf600e546a639f234964d0228909d8f8dbeebff82d526"},
-    {file = "bcrypt-4.2.1-cp39-abi3-macosx_10_12_universal2.whl", hash = "sha256:8ad2f4528cbf0febe80e5a3a57d7a74e6635e41af1ea5675282a33d769fba413"},
-    {file = "bcrypt-4.2.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:909faa1027900f2252a9ca5dfebd25fc0ef1417943824783d1c8418dd7d6df4a"},
-    {file = "bcrypt-4.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cde78d385d5e93ece5479a0a87f73cd6fa26b171c786a884f955e165032b262c"},
-    {file = "bcrypt-4.2.1-cp39-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:533e7f3bcf2f07caee7ad98124fab7499cb3333ba2274f7a36cf1daee7409d99"},
-    {file = "bcrypt-4.2.1-cp39-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:687cf30e6681eeda39548a93ce9bfbb300e48b4d445a43db4298d2474d2a1e54"},
-    {file = "bcrypt-4.2.1-cp39-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:041fa0155c9004eb98a232d54da05c0b41d4b8e66b6fc3cb71b4b3f6144ba837"},
-    {file = "bcrypt-4.2.1-cp39-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:f85b1ffa09240c89aa2e1ae9f3b1c687104f7b2b9d2098da4e923f1b7082d331"},
-    {file = "bcrypt-4.2.1-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:c6f5fa3775966cca251848d4d5393ab016b3afed251163c1436fefdec3b02c84"},
-    {file = "bcrypt-4.2.1-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:807261df60a8b1ccd13e6599c779014a362ae4e795f5c59747f60208daddd96d"},
-    {file = "bcrypt-4.2.1-cp39-abi3-win32.whl", hash = "sha256:b588af02b89d9fad33e5f98f7838bf590d6d692df7153647724a7f20c186f6bf"},
-    {file = "bcrypt-4.2.1-cp39-abi3-win_amd64.whl", hash = "sha256:e84e0e6f8e40a242b11bce56c313edc2be121cec3e0ec2d76fce01f6af33c07c"},
-    {file = "bcrypt-4.2.1-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl", hash = "sha256:76132c176a6d9953cdc83c296aeaed65e1a708485fd55abf163e0d9f8f16ce0e"},
-    {file = "bcrypt-4.2.1-pp310-pypy310_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:e158009a54c4c8bc91d5e0da80920d048f918c61a581f0a63e4e93bb556d362f"},
-    {file = "bcrypt-4.2.1.tar.gz", hash = "sha256:6765386e3ab87f569b276988742039baab087b2cdb01e809d74e74503c2faafe"},
-]
-
-[package.extras]
-tests = ["pytest (>=3.2.1,!=3.3.0)"]
-typecheck = ["mypy"]
-
-[[package]]
-name = "beartype"
-version = "0.18.5"
-description = "Unbearably fast runtime type checking in pure Python."
-optional = false
-python-versions = ">=3.8.0"
-groups = ["main"]
-files = [
-    {file = "beartype-0.18.5-py3-none-any.whl", hash = "sha256:5301a14f2a9a5540fe47ec6d34d758e9cd8331d36c4760fc7a5499ab86310089"},
-    {file = "beartype-0.18.5.tar.gz", hash = "sha256:264ddc2f1da9ec94ff639141fbe33d22e12a9f75aa863b83b7046ffff1381927"},
-]
-
-[package.extras]
-all = ["typing-extensions (>=3.10.0.0)"]
-dev = ["autoapi (>=0.9.0)", "coverage (>=5.5)", "equinox", "mypy (>=0.800) ; platform_python_implementation != \"PyPy\"", "numpy ; sys_platform != \"darwin\" and platform_python_implementation != \"PyPy\"", "pandera", "pydata-sphinx-theme (<=0.7.2)", "pytest (>=4.0.0)", "sphinx (>=4.2.0,<6.0.0)", "sphinx ; python_version >= \"3.8.0\"", "sphinxext-opengraph (>=0.7.5)", "tox (>=3.20.1)", "typing-extensions (>=3.10.0.0)"]
-doc-rtd = ["autoapi (>=0.9.0)", "pydata-sphinx-theme (<=0.7.2)", "sphinx (>=4.2.0,<6.0.0)", "sphinxext-opengraph (>=0.7.5)"]
-test-tox = ["equinox", "mypy (>=0.800) ; platform_python_implementation != \"PyPy\"", "numpy ; sys_platform != \"darwin\" and platform_python_implementation != \"PyPy\"", "pandera", "pytest (>=4.0.0)", "sphinx ; python_version >= \"3.8.0\"", "typing-extensions (>=3.10.0.0)"]
-test-tox-coverage = ["coverage (>=5.5)"]
-
 [[package]]
 name = "beautifulsoup4"
 version = "4.12.3"
@@ -406,14 +457,15 @@ lxml = ["lxml"]
 
 [[package]]
 name = "bitsandbytes"
-version = "0.45.2"
+version = "0.45.5"
 description = "k-bit optimizers and matrix multiplication routines."
-optional = false
+optional = true
 python-versions = ">=3.8"
 groups = ["main"]
+markers = "extra == \"cuda\""
 files = [
-    {file = "bitsandbytes-0.45.2-py3-none-manylinux_2_24_x86_64.whl", hash = "sha256:ba3a720187f518b172ebce4081049c682ae3fd8284947e22499b256ff99a2bc3"},
-    {file = "bitsandbytes-0.45.2-py3-none-win_amd64.whl", hash = "sha256:e1893170455422924156760bae9e210ae26e8bd2ce7b21065d24b47decbe6963"},
+    {file = "bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl", hash = "sha256:a5453f30cc6aab6ccaac364e6bf51a7808d3da5f71763dffeb6d9694c59136e4"},
+    {file = "bitsandbytes-0.45.5-py3-none-win_amd64.whl", hash = "sha256:ed1c61b91d989d6a33fd05737d6edbf5086d8ebc89235ee632c7a19144085da2"},
 ]
 
 [package.dependencies]
@@ -422,343 +474,397 @@ torch = ">=2.0,<3"
 
 [package.extras]
 benchmark = ["matplotlib", "pandas"]
-dev = ["bitsandbytes[test]", "build (>=1.0.0,<2)", "pre-commit (>=3.5.0,<4)", "ruff (==0.6.9)", "wheel (>=0.42,<1)"]
+dev = ["bitsandbytes[test]", "build (>=1.0.0,<2)", "pre-commit (>=3.5.0,<4)", "ruff (==0.9.6)", "wheel (>=0.42,<1)"]
 docs = ["hf-doc-builder (==0.5.0)"]
 test = ["einops (>=0.8.0,<0.9.0)", "lion-pytorch (==0.2.3)", "pytest (>=8.3,<9.0)", "scipy (>=1.10.1,<2) ; python_version < \"3.9\"", "scipy (>=1.11.4,<2) ; python_version >= \"3.9\"", "transformers (>=4.30.1,<5)"]
 
 [[package]]
-name = "black"
-version = "24.10.0"
-description = "The uncompromising code formatter."
-optional = false
-python-versions = ">=3.9"
-groups = ["dev"]
-files = [
-    {file = "black-24.10.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:e6668650ea4b685440857138e5fe40cde4d652633b1bdffc62933d0db4ed9812"},
-    {file = "black-24.10.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:1c536fcf674217e87b8cc3657b81809d3c085d7bf3ef262ead700da345bfa6ea"},
-    {file = "black-24.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:649fff99a20bd06c6f727d2a27f401331dc0cc861fb69cde910fe95b01b5928f"},
-    {file = "black-24.10.0-cp310-cp310-win_amd64.whl", hash = "sha256:fe4d6476887de70546212c99ac9bd803d90b42fc4767f058a0baa895013fbb3e"},
-    {file = "black-24.10.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:5a2221696a8224e335c28816a9d331a6c2ae15a2ee34ec857dcf3e45dbfa99ad"},
-    {file = "black-24.10.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:f9da3333530dbcecc1be13e69c250ed8dfa67f43c4005fb537bb426e19200d50"},
-    {file = "black-24.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4007b1393d902b48b36958a216c20c4482f601569d19ed1df294a496eb366392"},
-    {file = "black-24.10.0-cp311-cp311-win_amd64.whl", hash = "sha256:394d4ddc64782e51153eadcaaca95144ac4c35e27ef9b0a42e121ae7e57a9175"},
-    {file = "black-24.10.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b5e39e0fae001df40f95bd8cc36b9165c5e2ea88900167bddf258bacef9bbdc3"},
-    {file = "black-24.10.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d37d422772111794b26757c5b55a3eade028aa3fde43121ab7b673d050949d65"},
-    {file = "black-24.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:14b3502784f09ce2443830e3133dacf2c0110d45191ed470ecb04d0f5f6fcb0f"},
-    {file = "black-24.10.0-cp312-cp312-win_amd64.whl", hash = "sha256:30d2c30dc5139211dda799758559d1b049f7f14c580c409d6ad925b74a4208a8"},
-    {file = "black-24.10.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:1cbacacb19e922a1d75ef2b6ccaefcd6e93a2c05ede32f06a21386a04cedb981"},
-    {file = "black-24.10.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:1f93102e0c5bb3907451063e08b9876dbeac810e7da5a8bfb7aeb5a9ef89066b"},
-    {file = "black-24.10.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ddacb691cdcdf77b96f549cf9591701d8db36b2f19519373d60d31746068dbf2"},
-    {file = "black-24.10.0-cp313-cp313-win_amd64.whl", hash = "sha256:680359d932801c76d2e9c9068d05c6b107f2584b2a5b88831c83962eb9984c1b"},
-    {file = "black-24.10.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:17374989640fbca88b6a448129cd1745c5eb8d9547b464f281b251dd00155ccd"},
-    {file = "black-24.10.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:63f626344343083322233f175aaf372d326de8436f5928c042639a4afbbf1d3f"},
-    {file = "black-24.10.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ccfa1d0cb6200857f1923b602f978386a3a2758a65b52e0950299ea014be6800"},
-    {file = "black-24.10.0-cp39-cp39-win_amd64.whl", hash = "sha256:2cd9c95431d94adc56600710f8813ee27eea544dd118d45896bb734e9d7a0dc7"},
-    {file = "black-24.10.0-py3-none-any.whl", hash = "sha256:3bb2b7a1f7b685f85b11fed1ef10f8a9148bceb49853e47a294a3dd963c1dd7d"},
-    {file = "black-24.10.0.tar.gz", hash = "sha256:846ea64c97afe3bc677b761787993be4991810ecc7a4a937816dd6bddedc4875"},
-]
-
-[package.dependencies]
-click = ">=8.0.0"
-mypy-extensions = ">=0.4.3"
-packaging = ">=22.0"
-pathspec = ">=0.9.0"
-platformdirs = ">=2"
-tomli = {version = ">=1.1.0", markers = "python_version < \"3.11\""}
-typing-extensions = {version = ">=4.0.1", markers = "python_version < \"3.11\""}
-
-[package.extras]
-colorama = ["colorama (>=0.4.3)"]
-d = ["aiohttp (>=3.10)"]
-jupyter = ["ipython (>=7.8.0)", "tokenize-rt (>=3.2.0)"]
-uvloop = ["uvloop (>=0.15.2)"]
-
-[[package]]
-name = "boto3"
-version = "1.36.5"
-description = "The AWS SDK for Python"
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "boto3-1.36.5-py3-none-any.whl", hash = "sha256:a404ad5ec94ff40c176215a991bf62f0db5514a93a3dd361b7b2ab9660f811f4"},
-    {file = "boto3-1.36.5.tar.gz", hash = "sha256:58a6b7c3d5145b3ac04d4b6caa76223b8ef88004b4237444e553041e29581a11"},
-]
-
-[package.dependencies]
-botocore = ">=1.36.5,<1.37.0"
-jmespath = ">=0.7.1,<2.0.0"
-s3transfer = ">=0.11.0,<0.12.0"
-
-[package.extras]
-crt = ["botocore[crt] (>=1.21.0,<2.0a0)"]
-
-[[package]]
-name = "botocore"
-version = "1.36.5"
-description = "Low-level, data-driven core of boto 3."
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "botocore-1.36.5-py3-none-any.whl", hash = "sha256:6d9f70afa9bf9d21407089dc22b8cc8ec6fa44866d4660858c062c74fc8555eb"},
-    {file = "botocore-1.36.5.tar.gz", hash = "sha256:234ed3d29a8954c37a551c933453bf14c6ae44a69a4f273ffef377a2612ca6a6"},
-]
-
-[package.dependencies]
-jmespath = ">=0.7.1,<2.0.0"
-python-dateutil = ">=2.1,<3.0.0"
-urllib3 = {version = ">=1.25.4,<2.2.0 || >2.2.0,<3", markers = "python_version >= \"3.10\""}
-
-[package.extras]
-crt = ["awscrt (==0.23.4)"]
-
-[[package]]
-name = "braceexpand"
-version = "0.1.7"
-description = "Bash-style brace expansion for Python"
-optional = false
+name = "brotli"
+version = "1.2.0"
+description = "Python bindings for the Brotli compression library"
+optional = true
 python-versions = "*"
 groups = ["main"]
-files = [
-    {file = "braceexpand-0.1.7-py2.py3-none-any.whl", hash = "sha256:91332d53de7828103dcae5773fb43bc34950b0c8160e35e0f44c4427a3b85014"},
-    {file = "braceexpand-0.1.7.tar.gz", hash = "sha256:e6e539bd20eaea53547472ff94f4fb5c3d3bf9d0a89388c4b56663aba765f705"},
+markers = "extra == \"trackio\""
+files = [
+    {file = "brotli-1.2.0-cp27-cp27m-macosx_10_9_x86_64.whl", hash = "sha256:99cfa69813d79492f0e5d52a20fd18395bc82e671d5d40bd5a91d13e75e468e8"},
+    {file = "brotli-1.2.0-cp27-cp27m-manylinux1_i686.whl", hash = "sha256:3ebe801e0f4e56d17cd386ca6600573e3706ce1845376307f5d2cbd32149b69a"},
+    {file = "brotli-1.2.0-cp27-cp27m-manylinux1_x86_64.whl", hash = "sha256:a387225a67f619bf16bd504c37655930f910eb03675730fc2ad69d3d8b5e7e92"},
+    {file = "brotli-1.2.0-cp27-cp27m-win32.whl", hash = "sha256:b908d1a7b28bc72dfb743be0d4d3f8931f8309f810af66c906ae6cd4127c93cb"},
+    {file = "brotli-1.2.0-cp27-cp27m-win_amd64.whl", hash = "sha256:d206a36b4140fbb5373bf1eb73fb9de589bb06afd0d22376de23c5e91d0ab35f"},
+    {file = "brotli-1.2.0-cp27-cp27mu-manylinux1_i686.whl", hash = "sha256:7e9053f5fb4e0dfab89243079b3e217f2aea4085e4d58c5c06115fc34823707f"},
+    {file = "brotli-1.2.0-cp27-cp27mu-manylinux1_x86_64.whl", hash = "sha256:4735a10f738cb5516905a121f32b24ce196ab82cfc1e4ba2e3ad1b371085fd46"},
+    {file = "brotli-1.2.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:3b90b767916ac44e93a8e28ce6adf8d551e43affb512f2377c732d486ac6514e"},
+    {file = "brotli-1.2.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:6be67c19e0b0c56365c6a76e393b932fb0e78b3b56b711d180dd7013cb1fd984"},
+    {file = "brotli-1.2.0-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0bbd5b5ccd157ae7913750476d48099aaf507a79841c0d04a9db4415b14842de"},
+    {file = "brotli-1.2.0-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:3f3c908bcc404c90c77d5a073e55271a0a498f4e0756e48127c35d91cf155947"},
+    {file = "brotli-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1b557b29782a643420e08d75aea889462a4a8796e9a6cf5621ab05a3f7da8ef2"},
+    {file = "brotli-1.2.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:81da1b229b1889f25adadc929aeb9dbc4e922bd18561b65b08dd9343cfccca84"},
+    {file = "brotli-1.2.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:ff09cd8c5eec3b9d02d2408db41be150d8891c5566addce57513bf546e3d6c6d"},
+    {file = "brotli-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:a1778532b978d2536e79c05dac2d8cd857f6c55cd0c95ace5b03740824e0e2f1"},
+    {file = "brotli-1.2.0-cp310-cp310-win32.whl", hash = "sha256:b232029d100d393ae3c603c8ffd7e3fe6f798c5e28ddca5feabb8e8fdb732997"},
+    {file = "brotli-1.2.0-cp310-cp310-win_amd64.whl", hash = "sha256:ef87b8ab2704da227e83a246356a2b179ef826f550f794b2c52cddb4efbd0196"},
+    {file = "brotli-1.2.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:15b33fe93cedc4caaff8a0bd1eb7e3dab1c61bb22a0bf5bdfdfd97cd7da79744"},
+    {file = "brotli-1.2.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:898be2be399c221d2671d29eed26b6b2713a02c2119168ed914e7d00ceadb56f"},
+    {file = "brotli-1.2.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:350c8348f0e76fff0a0fd6c26755d2653863279d086d3aa2c290a6a7251135dd"},
+    {file = "brotli-1.2.0-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:2e1ad3fda65ae0d93fec742a128d72e145c9c7a99ee2fcd667785d99eb25a7fe"},
+    {file = "brotli-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:40d918bce2b427a0c4ba189df7a006ac0c7277c180aee4617d99e9ccaaf59e6a"},
+    {file = "brotli-1.2.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:2a7f1d03727130fc875448b65b127a9ec5d06d19d0148e7554384229706f9d1b"},
+    {file = "brotli-1.2.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:9c79f57faa25d97900bfb119480806d783fba83cd09ee0b33c17623935b05fa3"},
+    {file = "brotli-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:844a8ceb8483fefafc412f85c14f2aae2fb69567bf2a0de53cdb88b73e7c43ae"},
+    {file = "brotli-1.2.0-cp311-cp311-win32.whl", hash = "sha256:aa47441fa3026543513139cb8926a92a8e305ee9c71a6209ef7a97d91640ea03"},
+    {file = "brotli-1.2.0-cp311-cp311-win_amd64.whl", hash = "sha256:022426c9e99fd65d9475dce5c195526f04bb8be8907607e27e747893f6ee3e24"},
+    {file = "brotli-1.2.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:35d382625778834a7f3061b15423919aa03e4f5da34ac8e02c074e4b75ab4f84"},
+    {file = "brotli-1.2.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7a61c06b334bd99bc5ae84f1eeb36bfe01400264b3c352f968c6e30a10f9d08b"},
+    {file = "brotli-1.2.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:acec55bb7c90f1dfc476126f9711a8e81c9af7fb617409a9ee2953115343f08d"},
+    {file = "brotli-1.2.0-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:260d3692396e1895c5034f204f0db022c056f9e2ac841593a4cf9426e2a3faca"},
+    {file = "brotli-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:072e7624b1fc4d601036ab3f4f27942ef772887e876beff0301d261210bca97f"},
+    {file = "brotli-1.2.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:adedc4a67e15327dfdd04884873c6d5a01d3e3b6f61406f99b1ed4865a2f6d28"},
+    {file = "brotli-1.2.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:7a47ce5c2288702e09dc22a44d0ee6152f2c7eda97b3c8482d826a1f3cfc7da7"},
+    {file = "brotli-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:af43b8711a8264bb4e7d6d9a6d004c3a2019c04c01127a868709ec29962b6036"},
+    {file = "brotli-1.2.0-cp312-cp312-win32.whl", hash = "sha256:e99befa0b48f3cd293dafeacdd0d191804d105d279e0b387a32054c1180f3161"},
+    {file = "brotli-1.2.0-cp312-cp312-win_amd64.whl", hash = "sha256:b35c13ce241abdd44cb8ca70683f20c0c079728a36a996297adb5334adfc1c44"},
+    {file = "brotli-1.2.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:9e5825ba2c9998375530504578fd4d5d1059d09621a02065d1b6bfc41a8e05ab"},
+    {file = "brotli-1.2.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:0cf8c3b8ba93d496b2fae778039e2f5ecc7cff99df84df337ca31d8f2252896c"},
+    {file = "brotli-1.2.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c8565e3cdc1808b1a34714b553b262c5de5fbda202285782173ec137fd13709f"},
+    {file = "brotli-1.2.0-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:26e8d3ecb0ee458a9804f47f21b74845cc823fd1bb19f02272be70774f56e2a6"},
+    {file = "brotli-1.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:67a91c5187e1eec76a61625c77a6c8c785650f5b576ca732bd33ef58b0dff49c"},
+    {file = "brotli-1.2.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:4ecdb3b6dc36e6d6e14d3a1bdc6c1057c8cbf80db04031d566eb6080ce283a48"},
+    {file = "brotli-1.2.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:3e1b35d56856f3ed326b140d3c6d9db91740f22e14b06e840fe4bb1923439a18"},
+    {file = "brotli-1.2.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:54a50a9dad16b32136b2241ddea9e4df159b41247b2ce6aac0b3276a66a8f1e5"},
+    {file = "brotli-1.2.0-cp313-cp313-win32.whl", hash = "sha256:1b1d6a4efedd53671c793be6dd760fcf2107da3a52331ad9ea429edf0902f27a"},
+    {file = "brotli-1.2.0-cp313-cp313-win_amd64.whl", hash = "sha256:b63daa43d82f0cdabf98dee215b375b4058cce72871fd07934f179885aad16e8"},
+    {file = "brotli-1.2.0-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:6c12dad5cd04530323e723787ff762bac749a7b256a5bece32b2243dd5c27b21"},
+    {file = "brotli-1.2.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:3219bd9e69868e57183316ee19c84e03e8f8b5a1d1f2667e1aa8c2f91cb061ac"},
+    {file = "brotli-1.2.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:963a08f3bebd8b75ac57661045402da15991468a621f014be54e50f53a58d19e"},
+    {file = "brotli-1.2.0-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:9322b9f8656782414b37e6af884146869d46ab85158201d82bab9abbcb971dc7"},
+    {file = "brotli-1.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:cf9cba6f5b78a2071ec6fb1e7bd39acf35071d90a81231d67e92d637776a6a63"},
+    {file = "brotli-1.2.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:7547369c4392b47d30a3467fe8c3330b4f2e0f7730e45e3103d7d636678a808b"},
+    {file = "brotli-1.2.0-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:fc1530af5c3c275b8524f2e24841cbe2599d74462455e9bae5109e9ff42e9361"},
+    {file = "brotli-1.2.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:d2d085ded05278d1c7f65560aae97b3160aeb2ea2c0b3e26204856beccb60888"},
+    {file = "brotli-1.2.0-cp314-cp314-win32.whl", hash = "sha256:832c115a020e463c2f67664560449a7bea26b0c1fdd690352addad6d0a08714d"},
+    {file = "brotli-1.2.0-cp314-cp314-win_amd64.whl", hash = "sha256:e7c0af964e0b4e3412a0ebf341ea26ec767fa0b4cf81abb5e897c9338b5ad6a3"},
+    {file = "brotli-1.2.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:82676c2781ecf0ab23833796062786db04648b7aae8be139f6b8065e5e7b1518"},
+    {file = "brotli-1.2.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c16ab1ef7bb55651f5836e8e62db1f711d55b82ea08c3b8083ff037157171a69"},
+    {file = "brotli-1.2.0-cp36-cp36m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e85190da223337a6b7431d92c799fca3e2982abd44e7b8dec69938dcc81c8e9e"},
+    {file = "brotli-1.2.0-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl", hash = "sha256:d8c05b1dfb61af28ef37624385b0029df902ca896a639881f594060b30ffc9a7"},
+    {file = "brotli-1.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:465a0d012b3d3e4f1d6146ea019b5c11e3e87f03d1676da1cc3833462e672fb0"},
+    {file = "brotli-1.2.0-cp36-cp36m-musllinux_1_2_aarch64.whl", hash = "sha256:96fbe82a58cdb2f872fa5d87dedc8477a12993626c446de794ea025bbda625ea"},
+    {file = "brotli-1.2.0-cp36-cp36m-musllinux_1_2_i686.whl", hash = "sha256:1b71754d5b6eda54d16fbbed7fce2d8bc6c052a1b91a35c320247946ee103502"},
+    {file = "brotli-1.2.0-cp36-cp36m-musllinux_1_2_ppc64le.whl", hash = "sha256:66c02c187ad250513c2f4fce973ef402d22f80e0adce734ee4e4efd657b6cb64"},
+    {file = "brotli-1.2.0-cp36-cp36m-musllinux_1_2_x86_64.whl", hash = "sha256:ba76177fd318ab7b3b9bf6522be5e84c2ae798754b6cc028665490f6e66b5533"},
+    {file = "brotli-1.2.0-cp36-cp36m-win32.whl", hash = "sha256:c1702888c9f3383cc2f09eb3e88b8babf5965a54afb79649458ec7c3c7a63e96"},
+    {file = "brotli-1.2.0-cp36-cp36m-win_amd64.whl", hash = "sha256:f8d635cafbbb0c61327f942df2e3f474dde1cff16c3cd0580564774eaba1ee13"},
+    {file = "brotli-1.2.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:e80a28f2b150774844c8b454dd288be90d76ba6109670fe33d7ff54d96eb5cb8"},
+    {file = "brotli-1.2.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:50b1b799f45da91292ffaa21a473ab3a3054fa78560e8ff67082a185274431c8"},
+    {file = "brotli-1.2.0-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:29b7e6716ee4ea0c59e3b241f682204105f7da084d6254ec61886508efeb43bc"},
+    {file = "brotli-1.2.0-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_12_i686.manylinux2010_i686.whl", hash = "sha256:640fe199048f24c474ec6f3eae67c48d286de12911110437a36a87d7c89573a6"},
+    {file = "brotli-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:92edab1e2fd6cd5ca605f57d4545b6599ced5dea0fd90b2bcdf8b247a12bd190"},
+    {file = "brotli-1.2.0-cp37-cp37m-musllinux_1_2_aarch64.whl", hash = "sha256:7274942e69b17f9cef76691bcf38f2b2d4c8a5f5dba6ec10958363dcb3308a0a"},
+    {file = "brotli-1.2.0-cp37-cp37m-musllinux_1_2_i686.whl", hash = "sha256:a56ef534b66a749759ebd091c19c03ef81eb8cd96f0d1d16b59127eaf1b97a12"},
+    {file = "brotli-1.2.0-cp37-cp37m-musllinux_1_2_ppc64le.whl", hash = "sha256:5732eff8973dd995549a18ecbd8acd692ac611c5c0bb3f59fa3541ae27b33be3"},
+    {file = "brotli-1.2.0-cp37-cp37m-musllinux_1_2_x86_64.whl", hash = "sha256:598e88c736f63a0efec8363f9eb34e5b5536b7b6b1821e401afcb501d881f59a"},
+    {file = "brotli-1.2.0-cp37-cp37m-win32.whl", hash = "sha256:7ad8cec81f34edf44a1c6a7edf28e7b7806dfb8886e371d95dcf789ccd4e4982"},
+    {file = "brotli-1.2.0-cp37-cp37m-win_amd64.whl", hash = "sha256:865cedc7c7c303df5fad14a57bc5db1d4f4f9b2b4d0a7523ddd206f00c121a16"},
+    {file = "brotli-1.2.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:ac27a70bda257ae3f380ec8310b0a06680236bea547756c277b5dfe55a2452a8"},
+    {file = "brotli-1.2.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:e813da3d2d865e9793ef681d3a6b66fa4b7c19244a45b817d0cceda67e615990"},
+    {file = "brotli-1.2.0-cp38-cp38-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9fe11467c42c133f38d42289d0861b6b4f9da31e8087ca2c0d7ebb4543625526"},
+    {file = "brotli-1.2.0-cp38-cp38-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:c0d6770111d1879881432f81c369de5cde6e9467be7c682a983747ec800544e2"},
+    {file = "brotli-1.2.0-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:eda5a6d042c698e28bda2507a89b16555b9aa954ef1d750e1c20473481aff675"},
+    {file = "brotli-1.2.0-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:3173e1e57cebb6d1de186e46b5680afbd82fd4301d7b2465beebe83ed317066d"},
+    {file = "brotli-1.2.0-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:71a66c1c9be66595d628467401d5976158c97888c2c9379c034e1e2312c5b4f5"},
+    {file = "brotli-1.2.0-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:1e68cdf321ad05797ee41d1d09169e09d40fdf51a725bb148bff892ce04583d7"},
+    {file = "brotli-1.2.0-cp38-cp38-win32.whl", hash = "sha256:f16dace5e4d3596eaeb8af334b4d2c820d34b8278da633ce4a00020b2eac981c"},
+    {file = "brotli-1.2.0-cp38-cp38-win_amd64.whl", hash = "sha256:14ef29fc5f310d34fc7696426071067462c9292ed98b5ff5a27ac70a200e5470"},
+    {file = "brotli-1.2.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:8d4f47f284bdd28629481c97b5f29ad67544fa258d9091a6ed1fda47c7347cd1"},
+    {file = "brotli-1.2.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:2881416badd2a88a7a14d981c103a52a23a276a553a8aacc1346c2ff47c8dc17"},
+    {file = "brotli-1.2.0-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2d39b54b968f4b49b5e845758e202b1035f948b0561ff5e6385e855c96625971"},
+    {file = "brotli-1.2.0-cp39-cp39-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:95db242754c21a88a79e01504912e537808504465974ebb92931cfca2510469e"},
+    {file = "brotli-1.2.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:bba6e7e6cfe1e6cb6eb0b7c2736a6059461de1fa2c0ad26cf845de6c078d16c8"},
+    {file = "brotli-1.2.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:88ef7d55b7bcf3331572634c3fd0ed327d237ceb9be6066810d39020a3ebac7a"},
+    {file = "brotli-1.2.0-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:7fa18d65a213abcfbb2f6cafbb4c58863a8bd6f2103d65203c520ac117d1944b"},
+    {file = "brotli-1.2.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:09ac247501d1909e9ee47d309be760c89c990defbb2e0240845c892ea5ff0de4"},
+    {file = "brotli-1.2.0-cp39-cp39-win32.whl", hash = "sha256:c25332657dee6052ca470626f18349fc1fe8855a56218e19bd7a8c6ad4952c49"},
+    {file = "brotli-1.2.0-cp39-cp39-win_amd64.whl", hash = "sha256:1ce223652fd4ed3eb2b7f78fbea31c52314baecfac68db44037bb4167062a937"},
+    {file = "brotli-1.2.0.tar.gz", hash = "sha256:e310f77e41941c13340a95976fe66a8a95b01e783d430eeaf7a2f87e0a57dd0a"},
 ]
 
 [[package]]
 name = "certifi"
-version = "2024.12.14"
+version = "2026.6.17"
 description = "Python package for providing Mozilla's CA Bundle."
 optional = false
-python-versions = ">=3.6"
+python-versions = ">=3.7"
 groups = ["main"]
 files = [
-    {file = "certifi-2024.12.14-py3-none-any.whl", hash = "sha256:1275f7a45be9464efc1173084eaa30f866fe2e47d389406136d332ed4967ec56"},
-    {file = "certifi-2024.12.14.tar.gz", hash = "sha256:b650d30f370c2b724812bee08008be0c4163b163ddaec3f2546c1caf65f191db"},
+    {file = "certifi-2026.6.17-py3-none-any.whl", hash = "sha256:2227dcbaafe0d2f59279d1762ddddc37783ed4354594f194ffc31d20f41fc3db"},
+    {file = "certifi-2026.6.17.tar.gz", hash = "sha256:024c88eeec92ca068db80f02b8b07c9cef7b9fe261d1d535abfd5abd6f6af432"},
 ]
 
 [[package]]
 name = "cffi"
-version = "1.17.1"
+version = "2.0.0"
 description = "Foreign Function Interface for Python calling C code."
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["main"]
-files = [
-    {file = "cffi-1.17.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:df8b1c11f177bc2313ec4b2d46baec87a5f3e71fc8b45dab2ee7cae86d9aba14"},
-    {file = "cffi-1.17.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:8f2cdc858323644ab277e9bb925ad72ae0e67f69e804f4898c070998d50b1a67"},
-    {file = "cffi-1.17.1-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:edae79245293e15384b51f88b00613ba9f7198016a5948b5dddf4917d4d26382"},
-    {file = "cffi-1.17.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:45398b671ac6d70e67da8e4224a065cec6a93541bb7aebe1b198a61b58c7b702"},
-    {file = "cffi-1.17.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:ad9413ccdeda48c5afdae7e4fa2192157e991ff761e7ab8fdd8926f40b160cc3"},
-    {file = "cffi-1.17.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5da5719280082ac6bd9aa7becb3938dc9f9cbd57fac7d2871717b1feb0902ab6"},
-    {file = "cffi-1.17.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2bb1a08b8008b281856e5971307cc386a8e9c5b625ac297e853d36da6efe9c17"},
-    {file = "cffi-1.17.1-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:045d61c734659cc045141be4bae381a41d89b741f795af1dd018bfb532fd0df8"},
-    {file = "cffi-1.17.1-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:6883e737d7d9e4899a8a695e00ec36bd4e5e4f18fabe0aca0efe0a4b44cdb13e"},
-    {file = "cffi-1.17.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:6b8b4a92e1c65048ff98cfe1f735ef8f1ceb72e3d5f0c25fdb12087a23da22be"},
-    {file = "cffi-1.17.1-cp310-cp310-win32.whl", hash = "sha256:c9c3d058ebabb74db66e431095118094d06abf53284d9c81f27300d0e0d8bc7c"},
-    {file = "cffi-1.17.1-cp310-cp310-win_amd64.whl", hash = "sha256:0f048dcf80db46f0098ccac01132761580d28e28bc0f78ae0d58048063317e15"},
-    {file = "cffi-1.17.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:a45e3c6913c5b87b3ff120dcdc03f6131fa0065027d0ed7ee6190736a74cd401"},
-    {file = "cffi-1.17.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:30c5e0cb5ae493c04c8b42916e52ca38079f1b235c2f8ae5f4527b963c401caf"},
-    {file = "cffi-1.17.1-cp311-cp311-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:f75c7ab1f9e4aca5414ed4d8e5c0e303a34f4421f8a0d47a4d019ceff0ab6af4"},
-    {file = "cffi-1.17.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a1ed2dd2972641495a3ec98445e09766f077aee98a1c896dcb4ad0d303628e41"},
-    {file = "cffi-1.17.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:46bf43160c1a35f7ec506d254e5c890f3c03648a4dbac12d624e4490a7046cd1"},
-    {file = "cffi-1.17.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a24ed04c8ffd54b0729c07cee15a81d964e6fee0e3d4d342a27b020d22959dc6"},
-    {file = "cffi-1.17.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:610faea79c43e44c71e1ec53a554553fa22321b65fae24889706c0a84d4ad86d"},
-    {file = "cffi-1.17.1-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:a9b15d491f3ad5d692e11f6b71f7857e7835eb677955c00cc0aefcd0669adaf6"},
-    {file = "cffi-1.17.1-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:de2ea4b5833625383e464549fec1bc395c1bdeeb5f25c4a3a82b5a8c756ec22f"},
-    {file = "cffi-1.17.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:fc48c783f9c87e60831201f2cce7f3b2e4846bf4d8728eabe54d60700b318a0b"},
-    {file = "cffi-1.17.1-cp311-cp311-win32.whl", hash = "sha256:85a950a4ac9c359340d5963966e3e0a94a676bd6245a4b55bc43949eee26a655"},
-    {file = "cffi-1.17.1-cp311-cp311-win_amd64.whl", hash = "sha256:caaf0640ef5f5517f49bc275eca1406b0ffa6aa184892812030f04c2abf589a0"},
-    {file = "cffi-1.17.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:805b4371bf7197c329fcb3ead37e710d1bca9da5d583f5073b799d5c5bd1eee4"},
-    {file = "cffi-1.17.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:733e99bc2df47476e3848417c5a4540522f234dfd4ef3ab7fafdf555b082ec0c"},
-    {file = "cffi-1.17.1-cp312-cp312-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1257bdabf294dceb59f5e70c64a3e2f462c30c7ad68092d01bbbfb1c16b1ba36"},
-    {file = "cffi-1.17.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:da95af8214998d77a98cc14e3a3bd00aa191526343078b530ceb0bd710fb48a5"},
-    {file = "cffi-1.17.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d63afe322132c194cf832bfec0dc69a99fb9bb6bbd550f161a49e9e855cc78ff"},
-    {file = "cffi-1.17.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f79fc4fc25f1c8698ff97788206bb3c2598949bfe0fef03d299eb1b5356ada99"},
-    {file = "cffi-1.17.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b62ce867176a75d03a665bad002af8e6d54644fad99a3c70905c543130e39d93"},
-    {file = "cffi-1.17.1-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:386c8bf53c502fff58903061338ce4f4950cbdcb23e2902d86c0f722b786bbe3"},
-    {file = "cffi-1.17.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:4ceb10419a9adf4460ea14cfd6bc43d08701f0835e979bf821052f1805850fe8"},
-    {file = "cffi-1.17.1-cp312-cp312-win32.whl", hash = "sha256:a08d7e755f8ed21095a310a693525137cfe756ce62d066e53f502a83dc550f65"},
-    {file = "cffi-1.17.1-cp312-cp312-win_amd64.whl", hash = "sha256:51392eae71afec0d0c8fb1a53b204dbb3bcabcb3c9b807eedf3e1e6ccf2de903"},
-    {file = "cffi-1.17.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:f3a2b4222ce6b60e2e8b337bb9596923045681d71e5a082783484d845390938e"},
-    {file = "cffi-1.17.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:0984a4925a435b1da406122d4d7968dd861c1385afe3b45ba82b750f229811e2"},
-    {file = "cffi-1.17.1-cp313-cp313-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d01b12eeeb4427d3110de311e1774046ad344f5b1a7403101878976ecd7a10f3"},
-    {file = "cffi-1.17.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:706510fe141c86a69c8ddc029c7910003a17353970cff3b904ff0686a5927683"},
-    {file = "cffi-1.17.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:de55b766c7aa2e2a3092c51e0483d700341182f08e67c63630d5b6f200bb28e5"},
-    {file = "cffi-1.17.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c59d6e989d07460165cc5ad3c61f9fd8f1b4796eacbd81cee78957842b834af4"},
-    {file = "cffi-1.17.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dd398dbc6773384a17fe0d3e7eeb8d1a21c2200473ee6806bb5e6a8e62bb73dd"},
-    {file = "cffi-1.17.1-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:3edc8d958eb099c634dace3c7e16560ae474aa3803a5df240542b305d14e14ed"},
-    {file = "cffi-1.17.1-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:72e72408cad3d5419375fc87d289076ee319835bdfa2caad331e377589aebba9"},
-    {file = "cffi-1.17.1-cp313-cp313-win32.whl", hash = "sha256:e03eab0a8677fa80d646b5ddece1cbeaf556c313dcfac435ba11f107ba117b5d"},
-    {file = "cffi-1.17.1-cp313-cp313-win_amd64.whl", hash = "sha256:f6a16c31041f09ead72d69f583767292f750d24913dadacf5756b966aacb3f1a"},
-    {file = "cffi-1.17.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:636062ea65bd0195bc012fea9321aca499c0504409f413dc88af450b57ffd03b"},
-    {file = "cffi-1.17.1-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c7eac2ef9b63c79431bc4b25f1cd649d7f061a28808cbc6c47b534bd789ef964"},
-    {file = "cffi-1.17.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e221cf152cff04059d011ee126477f0d9588303eb57e88923578ace7baad17f9"},
-    {file = "cffi-1.17.1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:31000ec67d4221a71bd3f67df918b1f88f676f1c3b535a7eb473255fdc0b83fc"},
-    {file = "cffi-1.17.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:6f17be4345073b0a7b8ea599688f692ac3ef23ce28e5df79c04de519dbc4912c"},
-    {file = "cffi-1.17.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0e2b1fac190ae3ebfe37b979cc1ce69c81f4e4fe5746bb401dca63a9062cdaf1"},
-    {file = "cffi-1.17.1-cp38-cp38-win32.whl", hash = "sha256:7596d6620d3fa590f677e9ee430df2958d2d6d6de2feeae5b20e82c00b76fbf8"},
-    {file = "cffi-1.17.1-cp38-cp38-win_amd64.whl", hash = "sha256:78122be759c3f8a014ce010908ae03364d00a1f81ab5c7f4a7a5120607ea56e1"},
-    {file = "cffi-1.17.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:b2ab587605f4ba0bf81dc0cb08a41bd1c0a5906bd59243d56bad7668a6fc6c16"},
-    {file = "cffi-1.17.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:28b16024becceed8c6dfbc75629e27788d8a3f9030691a1dbf9821a128b22c36"},
-    {file = "cffi-1.17.1-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1d599671f396c4723d016dbddb72fe8e0397082b0a77a4fab8028923bec050e8"},
-    {file = "cffi-1.17.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ca74b8dbe6e8e8263c0ffd60277de77dcee6c837a3d0881d8c1ead7268c9e576"},
-    {file = "cffi-1.17.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f7f5baafcc48261359e14bcd6d9bff6d4b28d9103847c9e136694cb0501aef87"},
-    {file = "cffi-1.17.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:98e3969bcff97cae1b2def8ba499ea3d6f31ddfdb7635374834cf89a1a08ecf0"},
-    {file = "cffi-1.17.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cdf5ce3acdfd1661132f2a9c19cac174758dc2352bfe37d98aa7512c6b7178b3"},
-    {file = "cffi-1.17.1-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:9755e4345d1ec879e3849e62222a18c7174d65a6a92d5b346b1863912168b595"},
-    {file = "cffi-1.17.1-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:f1e22e8c4419538cb197e4dd60acc919d7696e5ef98ee4da4e01d3f8cfa4cc5a"},
-    {file = "cffi-1.17.1-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:c03e868a0b3bc35839ba98e74211ed2b05d2119be4e8a0f224fba9384f1fe02e"},
-    {file = "cffi-1.17.1-cp39-cp39-win32.whl", hash = "sha256:e31ae45bc2e29f6b2abd0de1cc3b9d5205aa847cafaecb8af1476a609a2f6eb7"},
-    {file = "cffi-1.17.1-cp39-cp39-win_amd64.whl", hash = "sha256:d016c76bdd850f3c626af19b0542c9677ba156e4ee4fccfdd7848803533ef662"},
-    {file = "cffi-1.17.1.tar.gz", hash = "sha256:1c39c6016c32bc48dd54561950ebd6836e1670f2ae46128f67cf49e789c52824"},
+markers = "platform_python_implementation != \"PyPy\""
+files = [
+    {file = "cffi-2.0.0-cp310-cp310-macosx_10_13_x86_64.whl", hash = "sha256:0cf2d91ecc3fcc0625c2c530fe004f82c110405f101548512cce44322fa8ac44"},
+    {file = "cffi-2.0.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:f73b96c41e3b2adedc34a7356e64c8eb96e03a3782b535e043a986276ce12a49"},
+    {file = "cffi-2.0.0-cp310-cp310-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:53f77cbe57044e88bbd5ed26ac1d0514d2acf0591dd6bb02a3ae37f76811b80c"},
+    {file = "cffi-2.0.0-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:3e837e369566884707ddaf85fc1744b47575005c0a229de3327f8f9a20f4efeb"},
+    {file = "cffi-2.0.0-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:5eda85d6d1879e692d546a078b44251cdd08dd1cfb98dfb77b670c97cee49ea0"},
+    {file = "cffi-2.0.0-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:9332088d75dc3241c702d852d4671613136d90fa6881da7d770a483fd05248b4"},
+    {file = "cffi-2.0.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:fc7de24befaeae77ba923797c7c87834c73648a05a4bde34b3b7e5588973a453"},
+    {file = "cffi-2.0.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:cf364028c016c03078a23b503f02058f1814320a56ad535686f90565636a9495"},
+    {file = "cffi-2.0.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:e11e82b744887154b182fd3e7e8512418446501191994dbf9c9fc1f32cc8efd5"},
+    {file = "cffi-2.0.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:8ea985900c5c95ce9db1745f7933eeef5d314f0565b27625d9a10ec9881e1bfb"},
+    {file = "cffi-2.0.0-cp310-cp310-win32.whl", hash = "sha256:1f72fb8906754ac8a2cc3f9f5aaa298070652a0ffae577e0ea9bd480dc3c931a"},
+    {file = "cffi-2.0.0-cp310-cp310-win_amd64.whl", hash = "sha256:b18a3ed7d5b3bd8d9ef7a8cb226502c6bf8308df1525e1cc676c3680e7176739"},
+    {file = "cffi-2.0.0-cp311-cp311-macosx_10_13_x86_64.whl", hash = "sha256:b4c854ef3adc177950a8dfc81a86f5115d2abd545751a304c5bcf2c2c7283cfe"},
+    {file = "cffi-2.0.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:2de9a304e27f7596cd03d16f1b7c72219bd944e99cc52b84d0145aefb07cbd3c"},
+    {file = "cffi-2.0.0-cp311-cp311-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:baf5215e0ab74c16e2dd324e8ec067ef59e41125d3eade2b863d294fd5035c92"},
+    {file = "cffi-2.0.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:730cacb21e1bdff3ce90babf007d0a0917cc3e6492f336c2f0134101e0944f93"},
+    {file = "cffi-2.0.0-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:6824f87845e3396029f3820c206e459ccc91760e8fa24422f8b0c3d1731cbec5"},
+    {file = "cffi-2.0.0-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:9de40a7b0323d889cf8d23d1ef214f565ab154443c42737dfe52ff82cf857664"},
+    {file = "cffi-2.0.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:8941aaadaf67246224cee8c3803777eed332a19d909b47e29c9842ef1e79ac26"},
+    {file = "cffi-2.0.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:a05d0c237b3349096d3981b727493e22147f934b20f6f125a3eba8f994bec4a9"},
+    {file = "cffi-2.0.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:94698a9c5f91f9d138526b48fe26a199609544591f859c870d477351dc7b2414"},
+    {file = "cffi-2.0.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:5fed36fccc0612a53f1d4d9a816b50a36702c28a2aa880cb8a122b3466638743"},
+    {file = "cffi-2.0.0-cp311-cp311-win32.whl", hash = "sha256:c649e3a33450ec82378822b3dad03cc228b8f5963c0c12fc3b1e0ab940f768a5"},
+    {file = "cffi-2.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:66f011380d0e49ed280c789fbd08ff0d40968ee7b665575489afa95c98196ab5"},
+    {file = "cffi-2.0.0-cp311-cp311-win_arm64.whl", hash = "sha256:c6638687455baf640e37344fe26d37c404db8b80d037c3d29f58fe8d1c3b194d"},
+    {file = "cffi-2.0.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:6d02d6655b0e54f54c4ef0b94eb6be0607b70853c45ce98bd278dc7de718be5d"},
+    {file = "cffi-2.0.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8eca2a813c1cb7ad4fb74d368c2ffbbb4789d377ee5bb8df98373c2cc0dee76c"},
+    {file = "cffi-2.0.0-cp312-cp312-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:21d1152871b019407d8ac3985f6775c079416c282e431a4da6afe7aefd2bccbe"},
+    {file = "cffi-2.0.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:b21e08af67b8a103c71a250401c78d5e0893beff75e28c53c98f4de42f774062"},
+    {file = "cffi-2.0.0-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:1e3a615586f05fc4065a8b22b8152f0c1b00cdbc60596d187c2a74f9e3036e4e"},
+    {file = "cffi-2.0.0-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:81afed14892743bbe14dacb9e36d9e0e504cd204e0b165062c488942b9718037"},
+    {file = "cffi-2.0.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:3e17ed538242334bf70832644a32a7aae3d83b57567f9fd60a26257e992b79ba"},
+    {file = "cffi-2.0.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3925dd22fa2b7699ed2617149842d2e6adde22b262fcbfada50e3d195e4b3a94"},
+    {file = "cffi-2.0.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:2c8f814d84194c9ea681642fd164267891702542f028a15fc97d4674b6206187"},
+    {file = "cffi-2.0.0-cp312-cp312-win32.whl", hash = "sha256:da902562c3e9c550df360bfa53c035b2f241fed6d9aef119048073680ace4a18"},
+    {file = "cffi-2.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:da68248800ad6320861f129cd9c1bf96ca849a2771a59e0344e88681905916f5"},
+    {file = "cffi-2.0.0-cp312-cp312-win_arm64.whl", hash = "sha256:4671d9dd5ec934cb9a73e7ee9676f9362aba54f7f34910956b84d727b0d73fb6"},
+    {file = "cffi-2.0.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:00bdf7acc5f795150faa6957054fbbca2439db2f775ce831222b66f192f03beb"},
+    {file = "cffi-2.0.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:45d5e886156860dc35862657e1494b9bae8dfa63bf56796f2fb56e1679fc0bca"},
+    {file = "cffi-2.0.0-cp313-cp313-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:07b271772c100085dd28b74fa0cd81c8fb1a3ba18b21e03d7c27f3436a10606b"},
+    {file = "cffi-2.0.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d48a880098c96020b02d5a1f7d9251308510ce8858940e6fa99ece33f610838b"},
+    {file = "cffi-2.0.0-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:f93fd8e5c8c0a4aa1f424d6173f14a892044054871c771f8566e4008eaa359d2"},
+    {file = "cffi-2.0.0-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:dd4f05f54a52fb558f1ba9f528228066954fee3ebe629fc1660d874d040ae5a3"},
+    {file = "cffi-2.0.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:c8d3b5532fc71b7a77c09192b4a5a200ea992702734a2e9279a37f2478236f26"},
+    {file = "cffi-2.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:d9b29c1f0ae438d5ee9acb31cadee00a58c46cc9c0b2f9038c6b0b3470877a8c"},
+    {file = "cffi-2.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6d50360be4546678fc1b79ffe7a66265e28667840010348dd69a314145807a1b"},
+    {file = "cffi-2.0.0-cp313-cp313-win32.whl", hash = "sha256:74a03b9698e198d47562765773b4a8309919089150a0bb17d829ad7b44b60d27"},
+    {file = "cffi-2.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:19f705ada2530c1167abacb171925dd886168931e0a7b78f5bffcae5c6b5be75"},
+    {file = "cffi-2.0.0-cp313-cp313-win_arm64.whl", hash = "sha256:256f80b80ca3853f90c21b23ee78cd008713787b1b1e93eae9f3d6a7134abd91"},
+    {file = "cffi-2.0.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:fc33c5141b55ed366cfaad382df24fe7dcbc686de5be719b207bb248e3053dc5"},
+    {file = "cffi-2.0.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c654de545946e0db659b3400168c9ad31b5d29593291482c43e3564effbcee13"},
+    {file = "cffi-2.0.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:24b6f81f1983e6df8db3adc38562c83f7d4a0c36162885ec7f7b77c7dcbec97b"},
+    {file = "cffi-2.0.0-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:12873ca6cb9b0f0d3a0da705d6086fe911591737a59f28b7936bdfed27c0d47c"},
+    {file = "cffi-2.0.0-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:d9b97165e8aed9272a6bb17c01e3cc5871a594a446ebedc996e2397a1c1ea8ef"},
+    {file = "cffi-2.0.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:afb8db5439b81cf9c9d0c80404b60c3cc9c3add93e114dcae767f1477cb53775"},
+    {file = "cffi-2.0.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:737fe7d37e1a1bffe70bd5754ea763a62a066dc5913ca57e957824b72a85e205"},
+    {file = "cffi-2.0.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:38100abb9d1b1435bc4cc340bb4489635dc2f0da7456590877030c9b3d40b0c1"},
+    {file = "cffi-2.0.0-cp314-cp314-win32.whl", hash = "sha256:087067fa8953339c723661eda6b54bc98c5625757ea62e95eb4898ad5e776e9f"},
+    {file = "cffi-2.0.0-cp314-cp314-win_amd64.whl", hash = "sha256:203a48d1fb583fc7d78a4c6655692963b860a417c0528492a6bc21f1aaefab25"},
+    {file = "cffi-2.0.0-cp314-cp314-win_arm64.whl", hash = "sha256:dbd5c7a25a7cb98f5ca55d258b103a2054f859a46ae11aaf23134f9cc0d356ad"},
+    {file = "cffi-2.0.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:9a67fc9e8eb39039280526379fb3a70023d77caec1852002b4da7e8b270c4dd9"},
+    {file = "cffi-2.0.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7a66c7204d8869299919db4d5069a82f1561581af12b11b3c9f48c584eb8743d"},
+    {file = "cffi-2.0.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:7cc09976e8b56f8cebd752f7113ad07752461f48a58cbba644139015ac24954c"},
+    {file = "cffi-2.0.0-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:92b68146a71df78564e4ef48af17551a5ddd142e5190cdf2c5624d0c3ff5b2e8"},
+    {file = "cffi-2.0.0-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:b1e74d11748e7e98e2f426ab176d4ed720a64412b6a15054378afdb71e0f37dc"},
+    {file = "cffi-2.0.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:28a3a209b96630bca57cce802da70c266eb08c6e97e5afd61a75611ee6c64592"},
+    {file = "cffi-2.0.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:7553fb2090d71822f02c629afe6042c299edf91ba1bf94951165613553984512"},
+    {file = "cffi-2.0.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:6c6c373cfc5c83a975506110d17457138c8c63016b563cc9ed6e056a82f13ce4"},
+    {file = "cffi-2.0.0-cp314-cp314t-win32.whl", hash = "sha256:1fc9ea04857caf665289b7a75923f2c6ed559b8298a1b8c49e59f7dd95c8481e"},
+    {file = "cffi-2.0.0-cp314-cp314t-win_amd64.whl", hash = "sha256:d68b6cef7827e8641e8ef16f4494edda8b36104d79773a334beaa1e3521430f6"},
+    {file = "cffi-2.0.0-cp314-cp314t-win_arm64.whl", hash = "sha256:0a1527a803f0a659de1af2e1fd700213caba79377e27e4693648c2923da066f9"},
+    {file = "cffi-2.0.0-cp39-cp39-macosx_10_13_x86_64.whl", hash = "sha256:fe562eb1a64e67dd297ccc4f5addea2501664954f2692b69a76449ec7913ecbf"},
+    {file = "cffi-2.0.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:de8dad4425a6ca6e4e5e297b27b5c824ecc7581910bf9aee86cb6835e6812aa7"},
+    {file = "cffi-2.0.0-cp39-cp39-manylinux1_i686.manylinux2014_i686.manylinux_2_17_i686.manylinux_2_5_i686.whl", hash = "sha256:4647afc2f90d1ddd33441e5b0e85b16b12ddec4fca55f0d9671fef036ecca27c"},
+    {file = "cffi-2.0.0-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:3f4d46d8b35698056ec29bca21546e1551a205058ae1a181d871e278b0b28165"},
+    {file = "cffi-2.0.0-cp39-cp39-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:e6e73b9e02893c764e7e8d5bb5ce277f1a009cd5243f8228f75f842bf937c534"},
+    {file = "cffi-2.0.0-cp39-cp39-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:cb527a79772e5ef98fb1d700678fe031e353e765d1ca2d409c92263c6d43e09f"},
+    {file = "cffi-2.0.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:61d028e90346df14fedc3d1e5441df818d095f3b87d286825dfcbd6459b7ef63"},
+    {file = "cffi-2.0.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:0f6084a0ea23d05d20c3edcda20c3d006f9b6f3fefeac38f59262e10cef47ee2"},
+    {file = "cffi-2.0.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:1cd13c99ce269b3ed80b417dcd591415d3372bcac067009b6e0f59c7d4015e65"},
+    {file = "cffi-2.0.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:89472c9762729b5ae1ad974b777416bfda4ac5642423fa93bd57a09204712322"},
+    {file = "cffi-2.0.0-cp39-cp39-win32.whl", hash = "sha256:2081580ebb843f759b9f617314a24ed5738c51d2aee65d31e02f6f7a2b97707a"},
+    {file = "cffi-2.0.0-cp39-cp39-win_amd64.whl", hash = "sha256:b882b3df248017dba09d6b16defe9b5c407fe32fc7c65a9c69798e6175601be9"},
+    {file = "cffi-2.0.0.tar.gz", hash = "sha256:44d1b5909021139fe36001ae048dbdde8214afa20200eda0f64c068cac5d5529"},
 ]
 
 [package.dependencies]
-pycparser = "*"
+pycparser = {version = "*", markers = "implementation_name != \"PyPy\""}
 
 [[package]]
 name = "cfgv"
-version = "3.4.0"
+version = "3.5.0"
 description = "Validate configuration and produce human readable error messages."
 optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
+python-versions = ">=3.10"
+groups = ["dev"]
 files = [
-    {file = "cfgv-3.4.0-py2.py3-none-any.whl", hash = "sha256:b7265b1f29fd3316bfcd2b330d63d024f2bfd8bcb8b0272f8e19a504856c48f9"},
-    {file = "cfgv-3.4.0.tar.gz", hash = "sha256:e52591d4c5f5dead8e0f673fb16db7949d2cfb3f7da4582893288f0ded8fe560"},
+    {file = "cfgv-3.5.0-py2.py3-none-any.whl", hash = "sha256:a8dc6b26ad22ff227d2634a65cb388215ce6cc96bbcc5cfde7641ae87e8dacc0"},
+    {file = "cfgv-3.5.0.tar.gz", hash = "sha256:d5b1034354820651caa73ede66a6294d6e95c1b00acc5e9b098e917404669132"},
 ]
 
 [[package]]
 name = "charset-normalizer"
-version = "3.4.1"
+version = "3.4.7"
 description = "The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet."
 optional = false
 python-versions = ">=3.7"
 groups = ["main"]
 files = [
-    {file = "charset_normalizer-3.4.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:91b36a978b5ae0ee86c394f5a54d6ef44db1de0815eb43de826d41d21e4af3de"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7461baadb4dc00fd9e0acbe254e3d7d2112e7f92ced2adc96e54ef6501c5f176"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e218488cd232553829be0664c2292d3af2eeeb94b32bea483cf79ac6a694e037"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:80ed5e856eb7f30115aaf94e4a08114ccc8813e6ed1b5efa74f9f82e8509858f"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b010a7a4fd316c3c484d482922d13044979e78d1861f0e0650423144c616a46a"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4532bff1b8421fd0a320463030c7520f56a79c9024a4e88f01c537316019005a"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:d973f03c0cb71c5ed99037b870f2be986c3c05e63622c017ea9816881d2dd247"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:3a3bd0dcd373514dcec91c411ddb9632c0d7d92aed7093b8c3bbb6d69ca74408"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:d9c3cdf5390dcd29aa8056d13e8e99526cda0305acc038b96b30352aff5ff2bb"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:2bdfe3ac2e1bbe5b59a1a63721eb3b95fc9b6817ae4a46debbb4e11f6232428d"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:eab677309cdb30d047996b36d34caeda1dc91149e4fdca0b1a039b3f79d9a807"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-win32.whl", hash = "sha256:c0429126cf75e16c4f0ad00ee0eae4242dc652290f940152ca8c75c3a4b6ee8f"},
-    {file = "charset_normalizer-3.4.1-cp310-cp310-win_amd64.whl", hash = "sha256:9f0b8b1c6d84c8034a44893aba5e767bf9c7a211e313a9605d9c617d7083829f"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:8bfa33f4f2672964266e940dd22a195989ba31669bd84629f05fab3ef4e2d125"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:28bf57629c75e810b6ae989f03c0828d64d6b26a5e205535585f96093e405ed1"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f08ff5e948271dc7e18a35641d2f11a4cd8dfd5634f55228b691e62b37125eb3"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:234ac59ea147c59ee4da87a0c0f098e9c8d169f4dc2a159ef720f1a61bbe27cd"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fd4ec41f914fa74ad1b8304bbc634b3de73d2a0889bd32076342a573e0779e00"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:eea6ee1db730b3483adf394ea72f808b6e18cf3cb6454b4d86e04fa8c4327a12"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:c96836c97b1238e9c9e3fe90844c947d5afbf4f4c92762679acfe19927d81d77"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:4d86f7aff21ee58f26dcf5ae81a9addbd914115cdebcbb2217e4f0ed8982e146"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:09b5e6733cbd160dcc09589227187e242a30a49ca5cefa5a7edd3f9d19ed53fd"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:5777ee0881f9499ed0f71cc82cf873d9a0ca8af166dfa0af8ec4e675b7df48e6"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:237bdbe6159cff53b4f24f397d43c6336c6b0b42affbe857970cefbb620911c8"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-win32.whl", hash = "sha256:8417cb1f36cc0bc7eaba8ccb0e04d55f0ee52df06df3ad55259b9a323555fc8b"},
-    {file = "charset_normalizer-3.4.1-cp311-cp311-win_amd64.whl", hash = "sha256:d7f50a1f8c450f3925cb367d011448c39239bb3eb4117c36a6d354794de4ce76"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:73d94b58ec7fecbc7366247d3b0b10a21681004153238750bb67bd9012414545"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:dad3e487649f498dd991eeb901125411559b22e8d7ab25d3aeb1af367df5efd7"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c30197aa96e8eed02200a83fba2657b4c3acd0f0aa4bdc9f6c1af8e8962e0757"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2369eea1ee4a7610a860d88f268eb39b95cb588acd7235e02fd5a5601773d4fa"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bc2722592d8998c870fa4e290c2eec2c1569b87fe58618e67d38b4665dfa680d"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ffc9202a29ab3920fa812879e95a9e78b2465fd10be7fcbd042899695d75e616"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:804a4d582ba6e5b747c625bf1255e6b1507465494a40a2130978bda7b932c90b"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:0f55e69f030f7163dffe9fd0752b32f070566451afe180f99dbeeb81f511ad8d"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:c4c3e6da02df6fa1410a7680bd3f63d4f710232d3139089536310d027950696a"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:5df196eb874dae23dcfb968c83d4f8fdccb333330fe1fc278ac5ceeb101003a9"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e358e64305fe12299a08e08978f51fc21fac060dcfcddd95453eabe5b93ed0e1"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-win32.whl", hash = "sha256:9b23ca7ef998bc739bf6ffc077c2116917eabcc901f88da1b9856b210ef63f35"},
-    {file = "charset_normalizer-3.4.1-cp312-cp312-win_amd64.whl", hash = "sha256:6ff8a4a60c227ad87030d76e99cd1698345d4491638dfa6673027c48b3cd395f"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:aabfa34badd18f1da5ec1bc2715cadc8dca465868a4e73a0173466b688f29dda"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:22e14b5d70560b8dd51ec22863f370d1e595ac3d024cb8ad7d308b4cd95f8313"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8436c508b408b82d87dc5f62496973a1805cd46727c34440b0d29d8a2f50a6c9"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2d074908e1aecee37a7635990b2c6d504cd4766c7bc9fc86d63f9c09af3fa11b"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:955f8851919303c92343d2f66165294848d57e9bba6cf6e3625485a70a038d11"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:44ecbf16649486d4aebafeaa7ec4c9fed8b88101f4dd612dcaf65d5e815f837f"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:0924e81d3d5e70f8126529951dac65c1010cdf117bb75eb02dd12339b57749dd"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:2967f74ad52c3b98de4c3b32e1a44e32975e008a9cd2a8cc8966d6a5218c5cb2"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:c75cb2a3e389853835e84a2d8fb2b81a10645b503eca9bcb98df6b5a43eb8886"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:09b26ae6b1abf0d27570633b2b078a2a20419c99d66fb2823173d73f188ce601"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:fa88b843d6e211393a37219e6a1c1df99d35e8fd90446f1118f4216e307e48cd"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-win32.whl", hash = "sha256:eb8178fe3dba6450a3e024e95ac49ed3400e506fd4e9e5c32d30adda88cbd407"},
-    {file = "charset_normalizer-3.4.1-cp313-cp313-win_amd64.whl", hash = "sha256:b1ac5992a838106edb89654e0aebfc24f5848ae2547d22c2c3f66454daa11971"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f30bf9fd9be89ecb2360c7d94a711f00c09b976258846efe40db3d05828e8089"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:97f68b8d6831127e4787ad15e6757232e14e12060bec17091b85eb1486b91d8d"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7974a0b5ecd505609e3b19742b60cee7aa2aa2fb3151bc917e6e2646d7667dcf"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc54db6c8593ef7d4b2a331b58653356cf04f67c960f584edb7c3d8c97e8f39e"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:311f30128d7d333eebd7896965bfcfbd0065f1716ec92bd5638d7748eb6f936a"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-musllinux_1_2_aarch64.whl", hash = "sha256:7d053096f67cd1241601111b698f5cad775f97ab25d81567d3f59219b5f1adbd"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-musllinux_1_2_i686.whl", hash = "sha256:807f52c1f798eef6cf26beb819eeb8819b1622ddfeef9d0977a8502d4db6d534"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-musllinux_1_2_ppc64le.whl", hash = "sha256:dccbe65bd2f7f7ec22c4ff99ed56faa1e9f785482b9bbd7c717e26fd723a1d1e"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-musllinux_1_2_s390x.whl", hash = "sha256:2fb9bd477fdea8684f78791a6de97a953c51831ee2981f8e4f583ff3b9d9687e"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-musllinux_1_2_x86_64.whl", hash = "sha256:01732659ba9b5b873fc117534143e4feefecf3b2078b0a6a2e925271bb6f4cfa"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-win32.whl", hash = "sha256:7a4f97a081603d2050bfaffdefa5b02a9ec823f8348a572e39032caa8404a487"},
-    {file = "charset_normalizer-3.4.1-cp37-cp37m-win_amd64.whl", hash = "sha256:7b1bef6280950ee6c177b326508f86cad7ad4dff12454483b51d8b7d673a2c5d"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:ecddf25bee22fe4fe3737a399d0d177d72bc22be6913acfab364b40bce1ba83c"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8c60ca7339acd497a55b0ea5d506b2a2612afb2826560416f6894e8b5770d4a9"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b7b2d86dd06bfc2ade3312a83a5c364c7ec2e3498f8734282c6c3d4b07b346b8"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:dd78cfcda14a1ef52584dbb008f7ac81c1328c0f58184bf9a84c49c605002da6"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6e27f48bcd0957c6d4cb9d6fa6b61d192d0b13d5ef563e5f2ae35feafc0d179c"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:01ad647cdd609225c5350561d084b42ddf732f4eeefe6e678765636791e78b9a"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:619a609aa74ae43d90ed2e89bdd784765de0a25ca761b93e196d938b8fd1dbbd"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:89149166622f4db9b4b6a449256291dc87a99ee53151c74cbd82a53c8c2f6ccd"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:7709f51f5f7c853f0fb938bcd3bc59cdfdc5203635ffd18bf354f6967ea0f824"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-musllinux_1_2_s390x.whl", hash = "sha256:345b0426edd4e18138d6528aed636de7a9ed169b4aaf9d61a8c19e39d26838ca"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:0907f11d019260cdc3f94fbdb23ff9125f6b5d1039b76003b5b0ac9d6a6c9d5b"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-win32.whl", hash = "sha256:ea0d8d539afa5eb2728aa1932a988a9a7af94f18582ffae4bc10b3fbdad0626e"},
-    {file = "charset_normalizer-3.4.1-cp38-cp38-win_amd64.whl", hash = "sha256:329ce159e82018d646c7ac45b01a430369d526569ec08516081727a20e9e4af4"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:b97e690a2118911e39b4042088092771b4ae3fc3aa86518f84b8cf6888dbdb41"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:78baa6d91634dfb69ec52a463534bc0df05dbd546209b79a3880a34487f4b84f"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1a2bc9f351a75ef49d664206d51f8e5ede9da246602dc2d2726837620ea034b2"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:75832c08354f595c760a804588b9357d34ec00ba1c940c15e31e96d902093770"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0af291f4fe114be0280cdd29d533696a77b5b49cfde5467176ecab32353395c4"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0167ddc8ab6508fe81860a57dd472b2ef4060e8d378f0cc555707126830f2537"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:2a75d49014d118e4198bcee5ee0a6f25856b29b12dbf7cd012791f8a6cc5c496"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:363e2f92b0f0174b2f8238240a1a30142e3db7b957a5dd5689b0e75fb717cc78"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:ab36c8eb7e454e34e60eb55ca5d241a5d18b2c6244f6827a30e451c42410b5f7"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:4c0907b1928a36d5a998d72d64d8eaa7244989f7aaaf947500d3a800c83a3fd6"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:04432ad9479fa40ec0f387795ddad4437a2b50417c69fa275e212933519ff294"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-win32.whl", hash = "sha256:3bed14e9c89dcb10e8f3a29f9ccac4955aebe93c71ae803af79265c9ca5644c5"},
-    {file = "charset_normalizer-3.4.1-cp39-cp39-win_amd64.whl", hash = "sha256:49402233c892a461407c512a19435d1ce275543138294f7ef013f0b63d5d3765"},
-    {file = "charset_normalizer-3.4.1-py3-none-any.whl", hash = "sha256:d98b1668f06378c6dbefec3b92299716b931cd4e6061f3c875a71ced1780ab85"},
-    {file = "charset_normalizer-3.4.1.tar.gz", hash = "sha256:44251f18cd68a75b56585dd00dae26183e102cd5e0f9f1466e6df5da2ed64ea3"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:cdd68a1fb318e290a2077696b7eb7a21a49163c455979c639bf5a5dcdc46617d"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e17b8d5d6a8c47c85e68ca8379def1303fd360c3e22093a807cd34a71cd082b8"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:511ef87c8aec0783e08ac18565a16d435372bc1ac25a91e6ac7f5ef2b0bff790"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:007d05ec7321d12a40227aae9e2bc6dca73f3cb21058999a1df9e193555a9dcc"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:cf29836da5119f3c8a8a70667b0ef5fdca3bb12f80fd06487cfa575b3909b393"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-manylinux_2_31_armv7l.whl", hash = "sha256:12d8baf840cc7889b37c7c770f478adea7adce3dcb3944d02ec87508e2dcf153"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:d560742f3c0d62afaccf9f41fe485ed69bd7661a241f86a3ef0f0fb8b1a397af"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:b14b2d9dac08e28bb8046a1a0434b1750eb221c8f5b87a68f4fa11a6f97b5e34"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:bc17a677b21b3502a21f66a8cc64f5bfad4df8a0b8434d661666f8ce90ac3af1"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:750e02e074872a3fad7f233b47734166440af3cdea0add3e95163110816d6752"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:4e5163c14bffd570ef2affbfdd77bba66383890797df43dc8b4cc7d6f500bf53"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:6ed74185b2db44f41ef35fd1617c5888e59792da9bbc9190d6c7300617182616"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:94e1885b270625a9a828c9793b4d52a64445299baa1fea5a173bf1d3dd9a1a5a"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-win32.whl", hash = "sha256:6785f414ae0f3c733c437e0f3929197934f526d19dfaa75e18fdb4f94c6fb374"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-win_amd64.whl", hash = "sha256:6696b7688f54f5af4462118f0bfa7c1621eeb87154f77fa04b9295ce7a8f2943"},
+    {file = "charset_normalizer-3.4.7-cp310-cp310-win_arm64.whl", hash = "sha256:66671f93accb62ed07da56613636f3641f1a12c13046ce91ffc923721f23c008"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:7641bb8895e77f921102f72833904dcd9901df5d6d72a2ab8f31d04b7e51e4e7"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:202389074300232baeb53ae2569a60901f7efadd4245cf3a3bf0617d60b439d7"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:30b8d1d8c52a48c2c5690e152c169b673487a2a58de1ec7393196753063fcd5e"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:532bc9bf33a68613fd7d65e4b1c71a6a38d7d42604ecf239c77392e9b4e8998c"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2fe249cb4651fd12605b7288b24751d8bfd46d35f12a20b1ba33dea122e690df"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-manylinux_2_31_armv7l.whl", hash = "sha256:65bcd23054beab4d166035cabbc868a09c1a49d1efe458fe8e4361215df40265"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:08e721811161356f97b4059a9ba7bafb23ea5ee2255402c42881c214e173c6b4"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:e060d01aec0a910bdccb8be71faf34e7799ce36950f8294c8bf612cba65a2c9e"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:38c0109396c4cfc574d502df99742a45c72c08eff0a36158b6f04000043dbf38"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:1c2a768fdd44ee4a9339a9b0b130049139b8ce3c01d2ce09f67f5a68048d477c"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:1a87ca9d5df6fe460483d9a5bbf2b18f620cbed41b432e2bddb686228282d10b"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:d635aab80466bc95771bb78d5370e74d36d1fe31467b6b29b8b57b2a3cd7d22c"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:ae196f021b5e7c78e918242d217db021ed2a6ace2bc6ae94c0fc596221c7f58d"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-win32.whl", hash = "sha256:adb2597b428735679446b46c8badf467b4ca5f5056aae4d51a19f9570301b1ad"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-win_amd64.whl", hash = "sha256:8e385e4267ab76874ae30db04c627faaaf0b509e1ccc11a95b3fc3e83f855c00"},
+    {file = "charset_normalizer-3.4.7-cp311-cp311-win_arm64.whl", hash = "sha256:d4a48e5b3c2a489fae013b7589308a40146ee081f6f509e047e0e096084ceca1"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:eca9705049ad3c7345d574e3510665cb2cf844c2f2dcfe675332677f081cbd46"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6178f72c5508bfc5fd446a5905e698c6212932f25bcdd4b47a757a50605a90e2"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e1421b502d83040e6d7fb2fb18dff63957f720da3d77b2fbd3187ceb63755d7b"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:edac0f1ab77644605be2cbba52e6b7f630731fc42b34cb0f634be1a6eface56a"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5649fd1c7bade02f320a462fdefd0b4bd3ce036065836d4f42e0de958038e116"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-manylinux_2_31_armv7l.whl", hash = "sha256:203104ed3e428044fd943bc4bf45fa73c0730391f9621e37fe39ecf477b128cb"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:298930cec56029e05497a76988377cbd7457ba864beeea92ad7e844fe74cd1f1"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:708838739abf24b2ceb208d0e22403dd018faeef86ddac04319a62ae884c4f15"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:0f7eb884681e3938906ed0434f20c63046eacd0111c4ba96f27b76084cd679f5"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:4dc1e73c36828f982bfe79fadf5919923f8a6f4df2860804db9a98c48824ce8d"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:aed52fea0513bac0ccde438c188c8a471c4e0f457c2dd20cdbf6ea7a450046c7"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:fea24543955a6a729c45a73fe90e08c743f0b3334bbf3201e6c4bc1b0c7fa464"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:bb6d88045545b26da47aa879dd4a89a71d1dce0f0e549b1abcb31dfe4a8eac49"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-win32.whl", hash = "sha256:2257141f39fe65a3fdf38aeccae4b953e5f3b3324f4ff0daf9f15b8518666a2c"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-win_amd64.whl", hash = "sha256:5ed6ab538499c8644b8a3e18debabcd7ce684f3fa91cf867521a7a0279cab2d6"},
+    {file = "charset_normalizer-3.4.7-cp312-cp312-win_arm64.whl", hash = "sha256:56be790f86bfb2c98fb742ce566dfb4816e5a83384616ab59c49e0604d49c51d"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:f496c9c3cc02230093d8330875c4c3cdfc3b73612a5fd921c65d39cbcef08063"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0ea948db76d31190bf08bd371623927ee1339d5f2a0b4b1b4a4439a65298703c"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:a277ab8928b9f299723bc1a2dabb1265911b1a76341f90a510368ca44ad9ab66"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:3bec022aec2c514d9cf199522a802bd007cd588ab17ab2525f20f9c34d067c18"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e044c39e41b92c845bc815e5ae4230804e8e7bc29e399b0437d64222d92809dd"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-manylinux_2_31_armv7l.whl", hash = "sha256:f495a1652cf3fbab2eb0639776dad966c2fb874d79d87ca07f9d5f059b8bd215"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:e712b419df8ba5e42b226c510472b37bd57b38e897d3eca5e8cfd410a29fa859"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:7804338df6fcc08105c7745f1502ba68d900f45fd770d5bdd5288ddccb8a42d8"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:481551899c856c704d58119b5025793fa6730adda3571971af568f66d2424bb5"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:f59099f9b66f0d7145115e6f80dd8b1d847176df89b234a5a6b3f00437aa0832"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:f59ad4c0e8f6bba240a9bb85504faa1ab438237199d4cce5f622761507b8f6a6"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:3dedcc22d73ec993f42055eff4fcfed9318d1eeb9a6606c55892a26964964e48"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:64f02c6841d7d83f832cd97ccf8eb8a906d06eb95d5276069175c696b024b60a"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-win32.whl", hash = "sha256:4042d5c8f957e15221d423ba781e85d553722fc4113f523f2feb7b188cc34c5e"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-win_amd64.whl", hash = "sha256:3946fa46a0cf3e4c8cb1cc52f56bb536310d34f25f01ca9b6c16afa767dab110"},
+    {file = "charset_normalizer-3.4.7-cp313-cp313-win_arm64.whl", hash = "sha256:80d04837f55fc81da168b98de4f4b797ef007fc8a79ab71c6ec9bc4dd662b15b"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:c36c333c39be2dbca264d7803333c896ab8fa7d4d6f0ab7edb7dfd7aea6e98c0"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1c2aed2e5e41f24ea8ef1590b8e848a79b56f3a5564a65ceec43c9d692dc7d8a"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:54523e136b8948060c0fa0bc7b1b50c32c186f2fceee897a495406bb6e311d2b"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:715479b9a2802ecac752a3b0efa2b0b60285cf962ee38414211abdfccc233b41"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bd6c2a1c7573c64738d716488d2cdd3c00e340e4835707d8fdb8dc1a66ef164e"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-manylinux_2_31_armv7l.whl", hash = "sha256:c45e9440fb78f8ddabcf714b68f936737a121355bf59f3907f4e17721b9d1aae"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:3534e7dcbdcf757da6b85a0bbf5b6868786d5982dd959b065e65481644817a18"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:e8ac484bf18ce6975760921bb6148041faa8fef0547200386ea0b52b5d27bf7b"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:a5fe03b42827c13cdccd08e6c0247b6a6d4b5e3cdc53fd1749f5896adcdc2356"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:2d6eb928e13016cea4f1f21d1e10c1cebd5a421bc57ddf5b1142ae3f86824fab"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:e74327fb75de8986940def6e8dee4f127cc9752bee7355bb323cc5b2659b6d46"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:d6038d37043bced98a66e68d3aa2b6a35505dc01328cd65217cefe82f25def44"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:7579e913a5339fb8fa133f6bbcfd8e6749696206cf05acdbdca71a1b436d8e72"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-win32.whl", hash = "sha256:5b77459df20e08151cd6f8b9ef8ef1f961ef73d85c21a555c7eed5b79410ec10"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-win_amd64.whl", hash = "sha256:92a0a01ead5e668468e952e4238cccd7c537364eb7d851ab144ab6627dbbe12f"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314-win_arm64.whl", hash = "sha256:67f6279d125ca0046a7fd386d01b311c6363844deac3e5b069b514ba3e63c246"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:effc3f449787117233702311a1b7d8f59cba9ced946ba727bdc329ec69028e24"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fbccdc05410c9ee21bbf16a35f4c1d16123dcdeb8a1d38f33654fa21d0234f79"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:733784b6d6def852c814bce5f318d25da2ee65dd4839a0718641c696e09a2960"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a89c23ef8d2c6b27fd200a42aa4ac72786e7c60d40efdc76e6011260b6e949c4"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6c114670c45346afedc0d947faf3c7f701051d2518b943679c8ff88befe14f8e"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-manylinux_2_31_armv7l.whl", hash = "sha256:a180c5e59792af262bf263b21a3c49353f25945d8d9f70628e73de370d55e1e1"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:3c9a494bc5ec77d43cea229c4f6db1e4d8fe7e1bbffa8b6f0f0032430ff8ab44"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:8d828b6667a32a728a1ad1d93957cdf37489c57b97ae6c4de2860fa749b8fc1e"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:cf1493cd8607bec4d8a7b9b004e699fcf8f9103a9284cc94962cb73d20f9d4a3"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:0c96c3b819b5c3e9e165495db84d41914d6894d55181d2d108cc1a69bfc9cce0"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:752a45dc4a6934060b3b0dab47e04edc3326575f82be64bc4fc293914566503e"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:8778f0c7a52e56f75d12dae53ae320fae900a8b9b4164b981b9c5ce059cd1fcb"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:ce3412fbe1e31eb81ea42f4169ed94861c56e643189e1e75f0041f3fe7020abe"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-win32.whl", hash = "sha256:c03a41a8784091e67a39648f70c5f97b5b6a37f216896d44d2cdcb82615339a0"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-win_amd64.whl", hash = "sha256:03853ed82eeebbce3c2abfdbc98c96dc205f32a79627688ac9a27370ea61a49c"},
+    {file = "charset_normalizer-3.4.7-cp314-cp314t-win_arm64.whl", hash = "sha256:c35abb8bfff0185efac5878da64c45dafd2b37fb0383add1be155a763c1f083d"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:e5f4d355f0a2b1a31bc3edec6795b46324349c9cb25eed068049e4f472fb4259"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:16d971e29578a5e97d7117866d15889a4a07befe0e87e703ed63cd90cb348c01"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:dca4bbc466a95ba9c0234ef56d7dd9509f63da22274589ebd4ed7f1f4d4c54e3"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:e80c8378d8f3d83cd3164da1ad2df9e37a666cdde7b1cb2298ed0b558064be30"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:36836d6ff945a00b88ba1e4572d721e60b5b8c98c155d465f56ad19d68f23734"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-manylinux_2_31_armv7l.whl", hash = "sha256:bd9b23791fe793e4968dba0c447e12f78e425c59fc0e3b97f6450f4781f3ee60"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:aef65cd602a6d0e0ff6f9930fcb1c8fec60dd2cfcb6facaf4bdb0e5873042db0"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:82b271f5137d07749f7bf32f70b17ab6eaabedd297e75dce75081a24f76eb545"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-musllinux_1_2_armv7l.whl", hash = "sha256:1efde3cae86c8c273f1eb3b287be7d8499420cf2fe7585c41d370d3e790054a5"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:c593052c465475e64bbfe5dbd81680f64a67fdc752c56d7a0ae205dc8aeefe0f"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-musllinux_1_2_riscv64.whl", hash = "sha256:af21eb4409a119e365397b2adbaca4c9ccab56543a65d5dbd9f920d6ac29f686"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-musllinux_1_2_s390x.whl", hash = "sha256:84c018e49c3bf790f9c2771c45e9313a08c2c2a6342b162cd650258b57817706"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:dd915403e231e6b1809fe9b6d9fc55cf8fb5e02765ac625d9cd623342a7905d7"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-win32.whl", hash = "sha256:320ade88cfb846b8cd6b4ddf5ee9e80ee0c1f52401f2456b84ae1ae6a1a5f207"},
+    {file = "charset_normalizer-3.4.7-cp38-cp38-win_amd64.whl", hash = "sha256:1dc8b0ea451d6e69735094606991f32867807881400f808a106ee1d963c46a83"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:177a0ba5f0211d488e295aaf82707237e331c24788d8d76c96c5a41594723217"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6e0d51f618228538a3e8f46bd246f87a6cd030565e015803691603f55e12afb5"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:14265bfe1f09498b9d8ec91e9ec9fa52775edf90fcbde092b25f4a33d444fea9"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:87fad7d9ba98c86bcb41b2dc8dbb326619be2562af1f8ff50776a39e55721c5a"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f22dec1690b584cea26fade98b2435c132c1b5f68e39f5a0b7627cd7ae31f1dc"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-manylinux_2_31_armv7l.whl", hash = "sha256:d61f00a0869d77422d9b2aba989e2d24afa6ffd552af442e0e58de4f35ea6d00"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:6370e8686f662e6a3941ee48ed4742317cafbe5707e36406e9df792cdb535776"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:a6c5863edfbe888d9eff9c8b8087354e27618d9da76425c119293f11712a6319"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:ed065083d0898c9d5b4bbec7b026fd755ff7454e6e8b73a67f8c744b13986e24"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:2cd4a60d0e2fb04537162c62bbbb4182f53541fe0ede35cdf270a1c1e723cc42"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-musllinux_1_2_riscv64.whl", hash = "sha256:813c0e0132266c08eb87469a642cb30aaff57c5f426255419572aaeceeaa7bf4"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:07d9e39b01743c3717745f4c530a6349eadbfa043c7577eef86c502c15df2c67"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:c0f081d69a6e58272819b70288d3221a6ee64b98df852631c80f293514d3b274"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-win32.whl", hash = "sha256:8751d2787c9131302398b11e6c8068053dcb55d5a8964e114b6e196cf16cb366"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-win_amd64.whl", hash = "sha256:12a6fff75f6bc66711b73a2f0addfc4c8c15a20e805146a02d147a318962c444"},
+    {file = "charset_normalizer-3.4.7-cp39-cp39-win_arm64.whl", hash = "sha256:bb8cc7534f51d9a017b93e3e85b260924f909601c3df002bcdb58ddb4dc41a5c"},
+    {file = "charset_normalizer-3.4.7-py3-none-any.whl", hash = "sha256:3dce51d0f5e7951f8bb4900c257dad282f49190fdbebecd4ba99bcc41fef404d"},
+    {file = "charset_normalizer-3.4.7.tar.gz", hash = "sha256:ae89db9e5f98a11a4bf50407d4363e7b09b31e55bc117b4f7d80aab97ba009e5"},
 ]
 
 [[package]]
 name = "click"
-version = "8.1.8"
+version = "8.4.1"
 description = "Composable command line interface toolkit"
-optional = false
-python-versions = ">=3.7"
-groups = ["main", "dev"]
-files = [
-    {file = "click-8.1.8-py3-none-any.whl", hash = "sha256:63c132bbbed01578a06712a2d1f497bb62d9c1c0d329b7903a866228027263b2"},
-    {file = "click-8.1.8.tar.gz", hash = "sha256:ed53c9d8990d83c2a27deae68e4ee337473f6330c040a31d4225c9574d16096a"},
-]
-
-[package.dependencies]
-colorama = {version = "*", markers = "platform_system == \"Windows\""}
-
-[[package]]
-name = "clint"
-version = "0.5.1"
-description = "Python Command Line Interface Tools"
-optional = false
-python-versions = "*"
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "clint-0.5.1.tar.gz", hash = "sha256:05224c32b1075563d0b16d0015faaf9da43aa214e4a2140e51f08789e7a4c5aa"},
+    {file = "click-8.4.1-py3-none-any.whl", hash = "sha256:482be17c6991b8c19c5429a1e995d9b0efdbb63172824c41f99965dc0ade8ec2"},
+    {file = "click-8.4.1.tar.gz", hash = "sha256:918b5633eddf6b41c32d4f454bf0de810065c74e3f7dbf8ee5452f8be88d3e96"},
 ]
 
 [package.dependencies]
-args = "*"
+colorama = {version = "*", markers = "platform_system == \"Windows\""}
 
 [[package]]
 name = "colorama"
@@ -766,342 +872,225 @@ version = "0.4.6"
 description = "Cross-platform colored terminal text."
 optional = false
 python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,>=2.7"
-groups = ["main", "dev"]
+groups = ["main", "dev", "training"]
 files = [
     {file = "colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6"},
     {file = "colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44"},
 ]
-
-[[package]]
-name = "colossalai"
-version = "0.3.6"
-description = "An integrated large-scale model training system with efficient parallelization techniques"
-optional = false
-python-versions = ">=3.6"
-groups = ["main"]
-files = [
-    {file = "colossalai-0.3.6.tar.gz", hash = "sha256:a3454e50ec53a701eed56144bf1b25bae4a221e003fe8af799dff17884b12018"},
-]
-
-[package.dependencies]
-click = "*"
-contexttimer = "*"
-einops = "*"
-fabric = "*"
-google = "*"
-ninja = "*"
-numpy = "*"
-packaging = "*"
-pre-commit = "*"
-protobuf = "*"
-psutil = "*"
-pydantic = "*"
-ray = "*"
-rich = "*"
-safetensors = "*"
-sentencepiece = "*"
-torch = ">=1.12"
-tqdm = "*"
-
-[[package]]
-name = "contexttimer"
-version = "0.3.3"
-description = "A timer context manager measuring the clock wall time of the code block it contains."
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "contexttimer-0.3.3.tar.gz", hash = "sha256:35a1efd389af3f1ca509f33ff23e17d98b66c8fde5ba2a4eb8a8b7fa456598a5"},
-]
-
-[[package]]
-name = "contourpy"
-version = "1.3.1"
-description = "Python library for calculating contours of 2D quadrilateral grids"
-optional = false
-python-versions = ">=3.10"
-groups = ["main"]
-files = [
-    {file = "contourpy-1.3.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:a045f341a77b77e1c5de31e74e966537bba9f3c4099b35bf4c2e3939dd54cdab"},
-    {file = "contourpy-1.3.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:500360b77259914f7805af7462e41f9cb7ca92ad38e9f94d6c8641b089338124"},
-    {file = "contourpy-1.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b2f926efda994cdf3c8d3fdb40b9962f86edbc4457e739277b961eced3d0b4c1"},
-    {file = "contourpy-1.3.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:adce39d67c0edf383647a3a007de0a45fd1b08dedaa5318404f1a73059c2512b"},
-    {file = "contourpy-1.3.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:abbb49fb7dac584e5abc6636b7b2a7227111c4f771005853e7d25176daaf8453"},
-    {file = "contourpy-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a0cffcbede75c059f535725c1680dfb17b6ba8753f0c74b14e6a9c68c29d7ea3"},
-    {file = "contourpy-1.3.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:ab29962927945d89d9b293eabd0d59aea28d887d4f3be6c22deaefbb938a7277"},
-    {file = "contourpy-1.3.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:974d8145f8ca354498005b5b981165b74a195abfae9a8129df3e56771961d595"},
-    {file = "contourpy-1.3.1-cp310-cp310-win32.whl", hash = "sha256:ac4578ac281983f63b400f7fe6c101bedc10651650eef012be1ccffcbacf3697"},
-    {file = "contourpy-1.3.1-cp310-cp310-win_amd64.whl", hash = "sha256:174e758c66bbc1c8576992cec9599ce8b6672b741b5d336b5c74e35ac382b18e"},
-    {file = "contourpy-1.3.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:3e8b974d8db2c5610fb4e76307e265de0edb655ae8169e8b21f41807ccbeec4b"},
-    {file = "contourpy-1.3.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:20914c8c973f41456337652a6eeca26d2148aa96dd7ac323b74516988bea89fc"},
-    {file = "contourpy-1.3.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:19d40d37c1c3a4961b4619dd9d77b12124a453cc3d02bb31a07d58ef684d3d86"},
-    {file = "contourpy-1.3.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:113231fe3825ebf6f15eaa8bc1f5b0ddc19d42b733345eae0934cb291beb88b6"},
-    {file = "contourpy-1.3.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4dbbc03a40f916a8420e420d63e96a1258d3d1b58cbdfd8d1f07b49fcbd38e85"},
-    {file = "contourpy-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3a04ecd68acbd77fa2d39723ceca4c3197cb2969633836ced1bea14e219d077c"},
-    {file = "contourpy-1.3.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:c414fc1ed8ee1dbd5da626cf3710c6013d3d27456651d156711fa24f24bd1291"},
-    {file = "contourpy-1.3.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:31c1b55c1f34f80557d3830d3dd93ba722ce7e33a0b472cba0ec3b6535684d8f"},
-    {file = "contourpy-1.3.1-cp311-cp311-win32.whl", hash = "sha256:f611e628ef06670df83fce17805c344710ca5cde01edfdc72751311da8585375"},
-    {file = "contourpy-1.3.1-cp311-cp311-win_amd64.whl", hash = "sha256:b2bdca22a27e35f16794cf585832e542123296b4687f9fd96822db6bae17bfc9"},
-    {file = "contourpy-1.3.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:0ffa84be8e0bd33410b17189f7164c3589c229ce5db85798076a3fa136d0e509"},
-    {file = "contourpy-1.3.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:805617228ba7e2cbbfb6c503858e626ab528ac2a32a04a2fe88ffaf6b02c32bc"},
-    {file = "contourpy-1.3.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ade08d343436a94e633db932e7e8407fe7de8083967962b46bdfc1b0ced39454"},
-    {file = "contourpy-1.3.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:47734d7073fb4590b4a40122b35917cd77be5722d80683b249dac1de266aac80"},
-    {file = "contourpy-1.3.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2ba94a401342fc0f8b948e57d977557fbf4d515f03c67682dd5c6191cb2d16ec"},
-    {file = "contourpy-1.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:efa874e87e4a647fd2e4f514d5e91c7d493697127beb95e77d2f7561f6905bd9"},
-    {file = "contourpy-1.3.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:1bf98051f1045b15c87868dbaea84f92408337d4f81d0e449ee41920ea121d3b"},
-    {file = "contourpy-1.3.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:61332c87493b00091423e747ea78200659dc09bdf7fd69edd5e98cef5d3e9a8d"},
-    {file = "contourpy-1.3.1-cp312-cp312-win32.whl", hash = "sha256:e914a8cb05ce5c809dd0fe350cfbb4e881bde5e2a38dc04e3afe1b3e58bd158e"},
-    {file = "contourpy-1.3.1-cp312-cp312-win_amd64.whl", hash = "sha256:08d9d449a61cf53033612cb368f3a1b26cd7835d9b8cd326647efe43bca7568d"},
-    {file = "contourpy-1.3.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:a761d9ccfc5e2ecd1bf05534eda382aa14c3e4f9205ba5b1684ecfe400716ef2"},
-    {file = "contourpy-1.3.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:523a8ee12edfa36f6d2a49407f705a6ef4c5098de4f498619787e272de93f2d5"},
-    {file = "contourpy-1.3.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ece6df05e2c41bd46776fbc712e0996f7c94e0d0543af1656956d150c4ca7c81"},
-    {file = "contourpy-1.3.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:573abb30e0e05bf31ed067d2f82500ecfdaec15627a59d63ea2d95714790f5c2"},
-    {file = "contourpy-1.3.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a9fa36448e6a3a1a9a2ba23c02012c43ed88905ec80163f2ffe2421c7192a5d7"},
-    {file = "contourpy-1.3.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3ea9924d28fc5586bf0b42d15f590b10c224117e74409dd7a0be3b62b74a501c"},
-    {file = "contourpy-1.3.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:5b75aa69cb4d6f137b36f7eb2ace9280cfb60c55dc5f61c731fdf6f037f958a3"},
-    {file = "contourpy-1.3.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:041b640d4ec01922083645a94bb3b2e777e6b626788f4095cf21abbe266413c1"},
-    {file = "contourpy-1.3.1-cp313-cp313-win32.whl", hash = "sha256:36987a15e8ace5f58d4d5da9dca82d498c2bbb28dff6e5d04fbfcc35a9cb3a82"},
-    {file = "contourpy-1.3.1-cp313-cp313-win_amd64.whl", hash = "sha256:a7895f46d47671fa7ceec40f31fae721da51ad34bdca0bee83e38870b1f47ffd"},
-    {file = "contourpy-1.3.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:9ddeb796389dadcd884c7eb07bd14ef12408aaae358f0e2ae24114d797eede30"},
-    {file = "contourpy-1.3.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:19c1555a6801c2f084c7ddc1c6e11f02eb6a6016ca1318dd5452ba3f613a1751"},
-    {file = "contourpy-1.3.1-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:841ad858cff65c2c04bf93875e384ccb82b654574a6d7f30453a04f04af71342"},
-    {file = "contourpy-1.3.1-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:4318af1c925fb9a4fb190559ef3eec206845f63e80fb603d47f2d6d67683901c"},
-    {file = "contourpy-1.3.1-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:14c102b0eab282427b662cb590f2e9340a9d91a1c297f48729431f2dcd16e14f"},
-    {file = "contourpy-1.3.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:05e806338bfeaa006acbdeba0ad681a10be63b26e1b17317bfac3c5d98f36cda"},
-    {file = "contourpy-1.3.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:4d76d5993a34ef3df5181ba3c92fabb93f1eaa5729504fb03423fcd9f3177242"},
-    {file = "contourpy-1.3.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:89785bb2a1980c1bd87f0cb1517a71cde374776a5f150936b82580ae6ead44a1"},
-    {file = "contourpy-1.3.1-cp313-cp313t-win32.whl", hash = "sha256:8eb96e79b9f3dcadbad2a3891672f81cdcab7f95b27f28f1c67d75f045b6b4f1"},
-    {file = "contourpy-1.3.1-cp313-cp313t-win_amd64.whl", hash = "sha256:287ccc248c9e0d0566934e7d606201abd74761b5703d804ff3df8935f523d546"},
-    {file = "contourpy-1.3.1-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:b457d6430833cee8e4b8e9b6f07aa1c161e5e0d52e118dc102c8f9bd7dd060d6"},
-    {file = "contourpy-1.3.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cb76c1a154b83991a3cbbf0dfeb26ec2833ad56f95540b442c73950af2013750"},
-    {file = "contourpy-1.3.1-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:44a29502ca9c7b5ba389e620d44f2fbe792b1fb5734e8b931ad307071ec58c53"},
-    {file = "contourpy-1.3.1.tar.gz", hash = "sha256:dfd97abd83335045a913e3bcc4a09c0ceadbe66580cf573fe961f4a825efa699"},
-]
-
-[package.dependencies]
-numpy = ">=1.23"
-
-[package.extras]
-bokeh = ["bokeh", "selenium"]
-docs = ["furo", "sphinx (>=7.2)", "sphinx-copybutton"]
-mypy = ["contourpy[bokeh,docs]", "docutils-stubs", "mypy (==1.11.1)", "types-Pillow"]
-test = ["Pillow", "contourpy[test-no-images]", "matplotlib"]
-test-no-images = ["pytest", "pytest-cov", "pytest-rerunfailures", "pytest-xdist", "wurlitzer"]
+markers = {dev = "sys_platform == \"win32\"", training = "platform_system == \"Windows\""}
 
 [[package]]
 name = "coverage"
-version = "7.6.10"
+version = "7.14.2"
 description = "Code coverage measurement for Python"
 optional = false
-python-versions = ">=3.9"
+python-versions = ">=3.10"
 groups = ["dev"]
 files = [
-    {file = "coverage-7.6.10-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:5c912978f7fbf47ef99cec50c4401340436d200d41d714c7a4766f377c5b7b78"},
-    {file = "coverage-7.6.10-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:a01ec4af7dfeb96ff0078ad9a48810bb0cc8abcb0115180c6013a6b26237626c"},
-    {file = "coverage-7.6.10-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a3b204c11e2b2d883946fe1d97f89403aa1811df28ce0447439178cc7463448a"},
-    {file = "coverage-7.6.10-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:32ee6d8491fcfc82652a37109f69dee9a830e9379166cb73c16d8dc5c2915165"},
-    {file = "coverage-7.6.10-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:675cefc4c06e3b4c876b85bfb7c59c5e2218167bbd4da5075cbe3b5790a28988"},
-    {file = "coverage-7.6.10-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:f4f620668dbc6f5e909a0946a877310fb3d57aea8198bde792aae369ee1c23b5"},
-    {file = "coverage-7.6.10-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:4eea95ef275de7abaef630c9b2c002ffbc01918b726a39f5a4353916ec72d2f3"},
-    {file = "coverage-7.6.10-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:e2f0280519e42b0a17550072861e0bc8a80a0870de260f9796157d3fca2733c5"},
-    {file = "coverage-7.6.10-cp310-cp310-win32.whl", hash = "sha256:bc67deb76bc3717f22e765ab3e07ee9c7a5e26b9019ca19a3b063d9f4b874244"},
-    {file = "coverage-7.6.10-cp310-cp310-win_amd64.whl", hash = "sha256:0f460286cb94036455e703c66988851d970fdfd8acc2a1122ab7f4f904e4029e"},
-    {file = "coverage-7.6.10-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:ea3c8f04b3e4af80e17bab607c386a830ffc2fb88a5484e1df756478cf70d1d3"},
-    {file = "coverage-7.6.10-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:507a20fc863cae1d5720797761b42d2d87a04b3e5aeb682ef3b7332e90598f43"},
-    {file = "coverage-7.6.10-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d37a84878285b903c0fe21ac8794c6dab58150e9359f1aaebbeddd6412d53132"},
-    {file = "coverage-7.6.10-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:a534738b47b0de1995f85f582d983d94031dffb48ab86c95bdf88dc62212142f"},
-    {file = "coverage-7.6.10-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0d7a2bf79378d8fb8afaa994f91bfd8215134f8631d27eba3e0e2c13546ce994"},
-    {file = "coverage-7.6.10-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:6713ba4b4ebc330f3def51df1d5d38fad60b66720948112f114968feb52d3f99"},
-    {file = "coverage-7.6.10-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:ab32947f481f7e8c763fa2c92fd9f44eeb143e7610c4ca9ecd6a36adab4081bd"},
-    {file = "coverage-7.6.10-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:7bbd8c8f1b115b892e34ba66a097b915d3871db7ce0e6b9901f462ff3a975377"},
-    {file = "coverage-7.6.10-cp311-cp311-win32.whl", hash = "sha256:299e91b274c5c9cdb64cbdf1b3e4a8fe538a7a86acdd08fae52301b28ba297f8"},
-    {file = "coverage-7.6.10-cp311-cp311-win_amd64.whl", hash = "sha256:489a01f94aa581dbd961f306e37d75d4ba16104bbfa2b0edb21d29b73be83609"},
-    {file = "coverage-7.6.10-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:27c6e64726b307782fa5cbe531e7647aee385a29b2107cd87ba7c0105a5d3853"},
-    {file = "coverage-7.6.10-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:c56e097019e72c373bae32d946ecf9858fda841e48d82df7e81c63ac25554078"},
-    {file = "coverage-7.6.10-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c7827a5bc7bdb197b9e066cdf650b2887597ad124dd99777332776f7b7c7d0d0"},
-    {file = "coverage-7.6.10-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:204a8238afe787323a8b47d8be4df89772d5c1e4651b9ffa808552bdf20e1d50"},
-    {file = "coverage-7.6.10-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e67926f51821b8e9deb6426ff3164870976fe414d033ad90ea75e7ed0c2e5022"},
-    {file = "coverage-7.6.10-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:e78b270eadb5702938c3dbe9367f878249b5ef9a2fcc5360ac7bff694310d17b"},
-    {file = "coverage-7.6.10-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:714f942b9c15c3a7a5fe6876ce30af831c2ad4ce902410b7466b662358c852c0"},
-    {file = "coverage-7.6.10-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:abb02e2f5a3187b2ac4cd46b8ced85a0858230b577ccb2c62c81482ca7d18852"},
-    {file = "coverage-7.6.10-cp312-cp312-win32.whl", hash = "sha256:55b201b97286cf61f5e76063f9e2a1d8d2972fc2fcfd2c1272530172fd28c359"},
-    {file = "coverage-7.6.10-cp312-cp312-win_amd64.whl", hash = "sha256:e4ae5ac5e0d1e4edfc9b4b57b4cbecd5bc266a6915c500f358817a8496739247"},
-    {file = "coverage-7.6.10-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:05fca8ba6a87aabdd2d30d0b6c838b50510b56cdcfc604d40760dae7153b73d9"},
-    {file = "coverage-7.6.10-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:9e80eba8801c386f72e0712a0453431259c45c3249f0009aff537a517b52942b"},
-    {file = "coverage-7.6.10-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a372c89c939d57abe09e08c0578c1d212e7a678135d53aa16eec4430adc5e690"},
-    {file = "coverage-7.6.10-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ec22b5e7fe7a0fa8509181c4aac1db48f3dd4d3a566131b313d1efc102892c18"},
-    {file = "coverage-7.6.10-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:26bcf5c4df41cad1b19c84af71c22cbc9ea9a547fc973f1f2cc9a290002c8b3c"},
-    {file = "coverage-7.6.10-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:4e4630c26b6084c9b3cb53b15bd488f30ceb50b73c35c5ad7871b869cb7365fd"},
-    {file = "coverage-7.6.10-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:2396e8116db77789f819d2bc8a7e200232b7a282c66e0ae2d2cd84581a89757e"},
-    {file = "coverage-7.6.10-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:79109c70cc0882e4d2d002fe69a24aa504dec0cc17169b3c7f41a1d341a73694"},
-    {file = "coverage-7.6.10-cp313-cp313-win32.whl", hash = "sha256:9e1747bab246d6ff2c4f28b4d186b205adced9f7bd9dc362051cc37c4a0c7bd6"},
-    {file = "coverage-7.6.10-cp313-cp313-win_amd64.whl", hash = "sha256:254f1a3b1eef5f7ed23ef265eaa89c65c8c5b6b257327c149db1ca9d4a35f25e"},
-    {file = "coverage-7.6.10-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:2ccf240eb719789cedbb9fd1338055de2761088202a9a0b73032857e53f612fe"},
-    {file = "coverage-7.6.10-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:0c807ca74d5a5e64427c8805de15b9ca140bba13572d6d74e262f46f50b13273"},
-    {file = "coverage-7.6.10-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2bcfa46d7709b5a7ffe089075799b902020b62e7ee56ebaed2f4bdac04c508d8"},
-    {file = "coverage-7.6.10-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4e0de1e902669dccbf80b0415fb6b43d27edca2fbd48c74da378923b05316098"},
-    {file = "coverage-7.6.10-cp313-cp313t-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3f7b444c42bbc533aaae6b5a2166fd1a797cdb5eb58ee51a92bee1eb94a1e1cb"},
-    {file = "coverage-7.6.10-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:b330368cb99ef72fcd2dc3ed260adf67b31499584dc8a20225e85bfe6f6cfed0"},
-    {file = "coverage-7.6.10-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:9a7cfb50515f87f7ed30bc882f68812fd98bc2852957df69f3003d22a2aa0abf"},
-    {file = "coverage-7.6.10-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:6f93531882a5f68c28090f901b1d135de61b56331bba82028489bc51bdd818d2"},
-    {file = "coverage-7.6.10-cp313-cp313t-win32.whl", hash = "sha256:89d76815a26197c858f53c7f6a656686ec392b25991f9e409bcef020cd532312"},
-    {file = "coverage-7.6.10-cp313-cp313t-win_amd64.whl", hash = "sha256:54a5f0f43950a36312155dae55c505a76cd7f2b12d26abeebbe7a0b36dbc868d"},
-    {file = "coverage-7.6.10-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:656c82b8a0ead8bba147de9a89bda95064874c91a3ed43a00e687f23cc19d53a"},
-    {file = "coverage-7.6.10-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:ccc2b70a7ed475c68ceb548bf69cec1e27305c1c2606a5eb7c3afff56a1b3b27"},
-    {file = "coverage-7.6.10-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a5e37dc41d57ceba70956fa2fc5b63c26dba863c946ace9705f8eca99daecdc4"},
-    {file = "coverage-7.6.10-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0aa9692b4fdd83a4647eeb7db46410ea1322b5ed94cd1715ef09d1d5922ba87f"},
-    {file = "coverage-7.6.10-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:aa744da1820678b475e4ba3dfd994c321c5b13381d1041fe9c608620e6676e25"},
-    {file = "coverage-7.6.10-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:c0b1818063dc9e9d838c09e3a473c1422f517889436dd980f5d721899e66f315"},
-    {file = "coverage-7.6.10-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:59af35558ba08b758aec4d56182b222976330ef8d2feacbb93964f576a7e7a90"},
-    {file = "coverage-7.6.10-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:7ed2f37cfce1ce101e6dffdfd1c99e729dd2ffc291d02d3e2d0af8b53d13840d"},
-    {file = "coverage-7.6.10-cp39-cp39-win32.whl", hash = "sha256:4bcc276261505d82f0ad426870c3b12cb177752834a633e737ec5ee79bbdff18"},
-    {file = "coverage-7.6.10-cp39-cp39-win_amd64.whl", hash = "sha256:457574f4599d2b00f7f637a0700a6422243b3565509457b2dbd3f50703e11f59"},
-    {file = "coverage-7.6.10-pp39.pp310-none-any.whl", hash = "sha256:fd34e7b3405f0cc7ab03d54a334c17a9e802897580d964bd8c2001f4b9fd488f"},
-    {file = "coverage-7.6.10.tar.gz", hash = "sha256:7fb105327c8f8f0682e29843e2ff96af9dcbe5bab8eeb4b398c6a33a16d80a23"},
+    {file = "coverage-7.14.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:59b75818e3046e9319143157f3dc4b43679a550c2060a17cbf3e39cc0b552925"},
+    {file = "coverage-7.14.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:66b08ba4c5cbf0eaa2e9692b203073f198d5d469d8b15d1c7a4854ce7032b2e2"},
+    {file = "coverage-7.14.2-cp310-cp310-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:70f266b536c590060b707dddfb6cf9f17e24fd30b992242e774543d256265c43"},
+    {file = "coverage-7.14.2-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:fb40cac5b1a6378fdccc99268f1033112ee4636e4fd9aaf240f6930d1fcea12c"},
+    {file = "coverage-7.14.2-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9c301fe9990cb5c081bf4881cb498743807c8e0e93fad7b85c02788456492ef8"},
+    {file = "coverage-7.14.2-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:d67b0462c8a3c3d93033e7c79cacdfc57d08e5220d9115bcb24a23edf5a5900d"},
+    {file = "coverage-7.14.2-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:0e763087828ee9644f0c89c57f9b75f0a50fdf3e8f5d8fac5cfc351337e89a99"},
+    {file = "coverage-7.14.2-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:6d4da2baab6d96ceedd9176b3c142e1198b0310bc8dc04e18a3caab65c3a322c"},
+    {file = "coverage-7.14.2-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:ab565a405bfdea61260145d8cc987aa66d1998fd0e0ccd4348008f4e6a39ee33"},
+    {file = "coverage-7.14.2-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:c13230b688fbb9122251b74daa092175811eb64cb7bd1c98e2c8193dfa2b0bd5"},
+    {file = "coverage-7.14.2-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:014c83ba1ec97993cfe94e77fe6b56daa76bc0c218b86938971574c28942d044"},
+    {file = "coverage-7.14.2-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:6caf54ffbf84b30470a8118f275afee9234e616572e4e41bae1dc19198c37294"},
+    {file = "coverage-7.14.2-cp310-cp310-win32.whl", hash = "sha256:4bf9d8a35f77df5638c61b5012ba5225109ec1cc15bc5eb097036b3c3cc939f3"},
+    {file = "coverage-7.14.2-cp310-cp310-win_amd64.whl", hash = "sha256:c1f17a8caebe0facd4556b1e0adfe0987c17feebed88e7bb6b5365c45c84c5d6"},
+    {file = "coverage-7.14.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:909f265c8c41f04c824bf741b2601fdcb56cab4bf56e018996b6494192ba0f58"},
+    {file = "coverage-7.14.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c8102deaf911938233f760426e6a5e287388521de95111d5c8de26c8a1028924"},
+    {file = "coverage-7.14.2-cp311-cp311-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:851f49e7bd7d1cdaf328f3133942b252d5e3d3380690131f423cba8e435b87f5"},
+    {file = "coverage-7.14.2-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:04cb445bed86aaf00aaa97d41a8b6e30f100f21e81c34caaec4efc684cb57768"},
+    {file = "coverage-7.14.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7471bc920d97c51c37ea8127f13b2adca43c3d78c53313b26a1f428e99d2c254"},
+    {file = "coverage-7.14.2-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:da5057e1bb257c967feee8ba67f3ebf379e801c7717f238b3d8c9caf00fc8f93"},
+    {file = "coverage-7.14.2-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:33c0da852e8a40246cd8e20cf3b2fc17ca52a45e9b5f7983c93db26f5d24b87b"},
+    {file = "coverage-7.14.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:f48a85bb437fab7782021c40bfee6b15146928b96960d008ace41b6901a0f21d"},
+    {file = "coverage-7.14.2-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:f44e7579a769a21d5b5e3166916bfe30ee175aaffff750324cbb11be2dbec5ad"},
+    {file = "coverage-7.14.2-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:78853ca3c6ca2f012daa2b07dbabbb8db0f09d4dbe8ee828d294b3445d3f4cd8"},
+    {file = "coverage-7.14.2-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:c9c2795ee3692097ff226ab806005d36bb9691fca9b35353542b57ea749cc830"},
+    {file = "coverage-7.14.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:2f5cc48a845d755b6db236f8c29c2b54773eb4c7e4ee2ead43812d73718784b0"},
+    {file = "coverage-7.14.2-cp311-cp311-win32.whl", hash = "sha256:9c61cb7eaabcfa609c5bc0f5ff5869d72a2f02f17994e5fba5f971de516f3c82"},
+    {file = "coverage-7.14.2-cp311-cp311-win_amd64.whl", hash = "sha256:e715909b0966d1774d8a26e14e2f4a3ae75909dca526901c6306286b2dcbfbdc"},
+    {file = "coverage-7.14.2-cp311-cp311-win_arm64.whl", hash = "sha256:9193f7150937a4fd836b10eaa123e15d98e961d1fabac07e60adf2d4785f888a"},
+    {file = "coverage-7.14.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:37c94712e533ea06f0b1e4d934811c520b1914ce0e4da3916220717aa7a86bc6"},
+    {file = "coverage-7.14.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:c050bbc7bba94c77e4ed7438f4fda1babe98ab145691d80aa6f60df934a1468b"},
+    {file = "coverage-7.14.2-cp312-cp312-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:a7af571767a2ee342a171c16fc1b1a07a0bf511606d381703fb7cf397fe49d46"},
+    {file = "coverage-7.14.2-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:8b4910cce599cd2438f8da65f5ef199a70a1cdb6ab314926df78271ca5954240"},
+    {file = "coverage-7.14.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c33e9e4878972f430b0cc06de3bf2a28d054a9efb4f8426d27de0d9cb81396ff"},
+    {file = "coverage-7.14.2-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e7967ea55c6dea6becba4d5870e2fa0aa4915a8be7ebff1bb79e6207aa75ce8d"},
+    {file = "coverage-7.14.2-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:d1322f237c2979b84096f4239c17828ff17fea6b3bbe96c44381c5f587c44c26"},
+    {file = "coverage-7.14.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:77849525340c99f516d793dddbcee16b18d50af892ac43c8de1a6f343d41e3b5"},
+    {file = "coverage-7.14.2-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:ef11695493ec3f06f7b2678ca274bcabb4ca04057317df268ddbfd8b05f661a8"},
+    {file = "coverage-7.14.2-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:8134f0e0723e080d1c27bbe8fc149f0162e429fa1852482150015d0fce83eaf1"},
+    {file = "coverage-7.14.2-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:914eead2b843fc357f733b3fe39cc94f1b53d466e8cfe03080b1ed9d24ccfc73"},
+    {file = "coverage-7.14.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e4b2d5e847fb7958583b74910cc19e5ec4ece514487385677b26433b2546116e"},
+    {file = "coverage-7.14.2-cp312-cp312-win32.whl", hash = "sha256:e753db9e40dda7302e0ac3e1e6e1325fb7f7b4694f87a7314ab15dd5d57911a7"},
+    {file = "coverage-7.14.2-cp312-cp312-win_amd64.whl", hash = "sha256:d32e5ca5f16dafb269ee50b60d32b00c704b3f6f78e238105f1d94a3a5f24bf5"},
+    {file = "coverage-7.14.2-cp312-cp312-win_arm64.whl", hash = "sha256:dc366f158e2fb2add9d4e57338ca48f12611024278688ee657eb0b853fcb5de5"},
+    {file = "coverage-7.14.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e5f077641a6713ce9d38df9e85d4fb9e008677fc0775cbaeb32ddfc3b319d4ca"},
+    {file = "coverage-7.14.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:0907f39b49ae818fe8af50aaa0f19afbc8ca164aea0865181ca7af17a3ac690b"},
+    {file = "coverage-7.14.2-cp313-cp313-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:5734d47669118d75c28981e562d4530ceb77342d31ffef6def5edd5ad4f05d7b"},
+    {file = "coverage-7.14.2-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:1d9a1b5813d00ea6151f6ccf64d1fa16892771dfdda12ba87162d15ec4ea3e1e"},
+    {file = "coverage-7.14.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9f0a80f4c8ac3f774210b1cc1bc0e31e75502f2818dda9a144ff90e702c4d91d"},
+    {file = "coverage-7.14.2-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:c2e66f3f22d6c1515ce70f2e7c3e9c6f3ff0ff33480125c9f9c53e8f6508e30f"},
+    {file = "coverage-7.14.2-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:6a2c37c3114f87ca7f10113756026eecb49656514debad600dcbec21f355ccea"},
+    {file = "coverage-7.14.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:3b16a7959d04b1497281c062c180413565c3f3469211d78799ad5b9a75f67796"},
+    {file = "coverage-7.14.2-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:6466c6999545cf00c4c142dfcbbf2db396dc735f005dcf8f91d57e351a79472b"},
+    {file = "coverage-7.14.2-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:5c60915ebb8f562317ba5ff6b8c32e25c0882289b201a9f2fb2987f91efd95d8"},
+    {file = "coverage-7.14.2-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:33b830850488acbcd358c78a4fecfafe7031667b4da8ddff5546295dc962cdeb"},
+    {file = "coverage-7.14.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:d0f845539230b8269aec902bc978b0cc403f52f002d18a04492efc943404d0bc"},
+    {file = "coverage-7.14.2-cp313-cp313-win32.whl", hash = "sha256:a8ac51a2e441e9119b9395f4d893fbc4934c64c8ba58be9b9eaa85591249e548"},
+    {file = "coverage-7.14.2-cp313-cp313-win_amd64.whl", hash = "sha256:039b264cdb31c44b48f9821e2afbf8f37df49e0fb837e24a942918b36c567e31"},
+    {file = "coverage-7.14.2-cp313-cp313-win_arm64.whl", hash = "sha256:7f2ef591e381cc36b8e53334e1b842c760c520c8a52d01e8626209400e93fe6a"},
+    {file = "coverage-7.14.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:7a0d1f026b72d627fa5c8a57cbc86ad209b64aa2a65833c83b290ace5cbee126"},
+    {file = "coverage-7.14.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:4d2b86f81c1c9310a7e774e3cc9e927a3d0bf583ecbfa01498dd626930025428"},
+    {file = "coverage-7.14.2-cp314-cp314-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:d76bdc1f9396ae70a55d050cf9743d88141c62ce0a22a3f627fab1d11c2f8bc6"},
+    {file = "coverage-7.14.2-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:cda36d8e7bfd63b3e44e75163265429caa5d935b672b00f71bccc8c010518c64"},
+    {file = "coverage-7.14.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0904f3b79d7b845bef0715afe1900da634d12b97f05b9479cb472880ca07cb9c"},
+    {file = "coverage-7.14.2-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:b6795ca4198d6cb7fc2c6163214f6555a6bc5f0ae1e268e76139dec4b37c4499"},
+    {file = "coverage-7.14.2-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c41e9b60fc0fa57f5d73306417d2f9d668202cca6944f9435878c55a5e7ae213"},
+    {file = "coverage-7.14.2-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:419d2aadd5746efc2e9df0f33c05570d8192e6f6a6098ab05acce586f44ce8a5"},
+    {file = "coverage-7.14.2-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:1c5d273c5f1411c0d26c4f066c398d4a434b1f97bb5fa409189bedce86d4add4"},
+    {file = "coverage-7.14.2-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:5fe465bc691264adce601527a972990c1174075d86bcbe9968fd20c95e0b1948"},
+    {file = "coverage-7.14.2-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:6fbb61617af1c56f95d53170ae9fa6c9aef6de1abd02fcc50064bfc672efb18d"},
+    {file = "coverage-7.14.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:e1eff22b831dfd5694989cc1f0789980f18391f614ac67c851af9a8e6d25e9ba"},
+    {file = "coverage-7.14.2-cp314-cp314-win32.whl", hash = "sha256:58e91be0a233adef698d3e6be54f10401bb91fd7854c0d4c4d50e0d3711e72f1"},
+    {file = "coverage-7.14.2-cp314-cp314-win_amd64.whl", hash = "sha256:d8429bf97906bfe6c61f9dbfb3342e0d88120da61939da8bd04f830cc3eab3b8"},
+    {file = "coverage-7.14.2-cp314-cp314-win_arm64.whl", hash = "sha256:13609d9d77249447aa73357b14831b0f3b95f275026c9ff20dd105f981f53a0c"},
+    {file = "coverage-7.14.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:9818486c2bac88ae931df7e04905ee29bef49fd218c00f5f02bed4855254a101"},
+    {file = "coverage-7.14.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:58055adffabfa243516a197aa9f85f0dd56d905b0fba1a10193269759c29ccb0"},
+    {file = "coverage-7.14.2-cp314-cp314t-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:535747dbc200349d7fb434cffcb28e770f0290f69b225f56dc3803aa7210cdea"},
+    {file = "coverage-7.14.2-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:420c66e35d85c0ca5dc6a38147d83ef239762542900e5921ebbdb89333c540ea"},
+    {file = "coverage-7.14.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f2cf17b33773be446a588551ea6a746b2d70dd0bc90dc31f1dd7648975a63c6b"},
+    {file = "coverage-7.14.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:adb4a5fef041f7179bb264203add873c147d169cf2f8d0adae89ff2e51271bac"},
+    {file = "coverage-7.14.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:9c012ec357dec9408a83dad5541172a63c5cfa1421709f2e5811480d31ae1b28"},
+    {file = "coverage-7.14.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:dacd0ecd08fda3cb2f85b60cabea7da326dcb2fc15fbb23a88830a80144cc9f2"},
+    {file = "coverage-7.14.2-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:f27e980f2feba5dfe7a32b22b125470de69c0bd113c75e16165de909a777f512"},
+    {file = "coverage-7.14.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:105c00efb65c863630b2b63cbf7b8267e4da2d44b62284efbb19a03b04c337d4"},
+    {file = "coverage-7.14.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:571173fa04c8e8d6235ab32ae67fecca97777e2e1b4a1a30f3022c34e397c1c1"},
+    {file = "coverage-7.14.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:e532f34d42d1a421fa00ed6b7735d14ac2e340256c1bad26a5e1dc1252b0bed7"},
+    {file = "coverage-7.14.2-cp314-cp314t-win32.whl", hash = "sha256:243971550fb46c3039257f75e65610002d84304c505f609bbd9779e20a653a0a"},
+    {file = "coverage-7.14.2-cp314-cp314t-win_amd64.whl", hash = "sha256:60fb0ca084a92da96474b8b405a7ea76dfecac3c68db54383e7934b6f3871169"},
+    {file = "coverage-7.14.2-cp314-cp314t-win_arm64.whl", hash = "sha256:36a0a3f42ed7dfdbca2a69a541519ffd5064a5692152fc0018109e74370d7345"},
+    {file = "coverage-7.14.2-py3-none-any.whl", hash = "sha256:04d92589e481a8b68a005a5a1e0646a91c76f322c397c4635298c57cf63699b5"},
+    {file = "coverage-7.14.2.tar.gz", hash = "sha256:7a2da3d81cfe17c18038c6d98e6592aa9147d596d056119b0ee612c3c8bd5230"},
 ]
 
 [package.extras]
 toml = ["tomli ; python_full_version <= \"3.11.0a6\""]
 
-[[package]]
-name = "cpm-kernels"
-version = "1.0.11"
-description = "CPM CUDA kernels"
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "cpm_kernels-1.0.11-py3-none-any.whl", hash = "sha256:eab7f211f3b3f6a0686ded4c15cd7d9158393cdf69a931fa5b96a5fbcd366822"},
-]
-
 [[package]]
 name = "cryptography"
-version = "44.0.0"
+version = "49.0.0"
 description = "cryptography is a package which provides cryptographic recipes and primitives to Python developers."
 optional = false
-python-versions = "!=3.9.0,!=3.9.1,>=3.7"
-groups = ["main"]
-files = [
-    {file = "cryptography-44.0.0-cp37-abi3-macosx_10_9_universal2.whl", hash = "sha256:84111ad4ff3f6253820e6d3e58be2cc2a00adb29335d4cacb5ab4d4d34f2a123"},
-    {file = "cryptography-44.0.0-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b15492a11f9e1b62ba9d73c210e2416724633167de94607ec6069ef724fad092"},
-    {file = "cryptography-44.0.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:831c3c4d0774e488fdc83a1923b49b9957d33287de923d58ebd3cec47a0ae43f"},
-    {file = "cryptography-44.0.0-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:761817a3377ef15ac23cd7834715081791d4ec77f9297ee694ca1ee9c2c7e5eb"},
-    {file = "cryptography-44.0.0-cp37-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:3c672a53c0fb4725a29c303be906d3c1fa99c32f58abe008a82705f9ee96f40b"},
-    {file = "cryptography-44.0.0-cp37-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:4ac4c9f37eba52cb6fbeaf5b59c152ea976726b865bd4cf87883a7e7006cc543"},
-    {file = "cryptography-44.0.0-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:ed3534eb1090483c96178fcb0f8893719d96d5274dfde98aa6add34614e97c8e"},
-    {file = "cryptography-44.0.0-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:f3f6fdfa89ee2d9d496e2c087cebef9d4fcbb0ad63c40e821b39f74bf48d9c5e"},
-    {file = "cryptography-44.0.0-cp37-abi3-win32.whl", hash = "sha256:eb33480f1bad5b78233b0ad3e1b0be21e8ef1da745d8d2aecbb20671658b9053"},
-    {file = "cryptography-44.0.0-cp37-abi3-win_amd64.whl", hash = "sha256:abc998e0c0eee3c8a1904221d3f67dcfa76422b23620173e28c11d3e626c21bd"},
-    {file = "cryptography-44.0.0-cp39-abi3-macosx_10_9_universal2.whl", hash = "sha256:660cb7312a08bc38be15b696462fa7cc7cd85c3ed9c576e81f4dc4d8b2b31591"},
-    {file = "cryptography-44.0.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1923cb251c04be85eec9fda837661c67c1049063305d6be5721643c22dd4e2b7"},
-    {file = "cryptography-44.0.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:404fdc66ee5f83a1388be54300ae978b2efd538018de18556dde92575e05defc"},
-    {file = "cryptography-44.0.0-cp39-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:c5eb858beed7835e5ad1faba59e865109f3e52b3783b9ac21e7e47dc5554e289"},
-    {file = "cryptography-44.0.0-cp39-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:f53c2c87e0fb4b0c00fa9571082a057e37690a8f12233306161c8f4b819960b7"},
-    {file = "cryptography-44.0.0-cp39-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:9e6fc8a08e116fb7c7dd1f040074c9d7b51d74a8ea40d4df2fc7aa08b76b9e6c"},
-    {file = "cryptography-44.0.0-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:d2436114e46b36d00f8b72ff57e598978b37399d2786fd39793c36c6d5cb1c64"},
-    {file = "cryptography-44.0.0-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:a01956ddfa0a6790d594f5b34fc1bfa6098aca434696a03cfdbe469b8ed79285"},
-    {file = "cryptography-44.0.0-cp39-abi3-win32.whl", hash = "sha256:eca27345e1214d1b9f9490d200f9db5a874479be914199194e746c893788d417"},
-    {file = "cryptography-44.0.0-cp39-abi3-win_amd64.whl", hash = "sha256:708ee5f1bafe76d041b53a4f95eb28cdeb8d18da17e597d46d7833ee59b97ede"},
-    {file = "cryptography-44.0.0-pp310-pypy310_pp73-macosx_10_9_x86_64.whl", hash = "sha256:37d76e6863da3774cd9db5b409a9ecfd2c71c981c38788d3fcfaf177f447b731"},
-    {file = "cryptography-44.0.0-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl", hash = "sha256:f677e1268c4e23420c3acade68fac427fffcb8d19d7df95ed7ad17cdef8404f4"},
-    {file = "cryptography-44.0.0-pp310-pypy310_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:f5e7cb1e5e56ca0933b4873c0220a78b773b24d40d186b6738080b73d3d0a756"},
-    {file = "cryptography-44.0.0-pp310-pypy310_pp73-manylinux_2_34_aarch64.whl", hash = "sha256:8b3e6eae66cf54701ee7d9c83c30ac0a1e3fa17be486033000f2a73a12ab507c"},
-    {file = "cryptography-44.0.0-pp310-pypy310_pp73-manylinux_2_34_x86_64.whl", hash = "sha256:be4ce505894d15d5c5037167ffb7f0ae90b7be6f2a98f9a5c3442395501c32fa"},
-    {file = "cryptography-44.0.0-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:62901fb618f74d7d81bf408c8719e9ec14d863086efe4185afd07c352aee1d2c"},
-    {file = "cryptography-44.0.0.tar.gz", hash = "sha256:cd4e834f340b4293430701e772ec543b0fbe6c2dea510a5286fe0acabe153a02"},
+python-versions = "!=3.9.0,!=3.9.1,>=3.9"
+groups = ["main"]
+files = [
+    {file = "cryptography-49.0.0-cp311-abi3-macosx_11_0_arm64.whl", hash = "sha256:966fe0e9c67490071f14c0d2b1cb2dfb3023c5ce39457343931415f08382f2db"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:36d1709f992593689b45bda411498d62c6e365f2ca00b84657d4dadd24de16db"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:0e959b578856a3924bc0cbb710fc12c387b9412a951389f3ca61704a9e25f325"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:53ecee2e23f7169b6117e99fc8a944e5e50f79e69758a83b52a00cb98ab2b2d2"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_28_ppc64le.whl", hash = "sha256:2eda353d8a27bcbcaa4cbed18994a74ab4d19a2ca897db188ea269ab9b71419b"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:2afe9051da7ae7bd5905da5a949280c7d2bb75682e188f650a9d0f2756b834c6"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_31_armv7l.whl", hash = "sha256:0b82e28ee398a386f0807bba7884d30f25218855690f45115831bcce5d90822c"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:ccac2bfebc306b862133e3bb71f3f6ee8bb525240089b2d952e4144b3a6d5da7"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_34_ppc64le.whl", hash = "sha256:d0527ce944105f257f605a827d6ebead966c752038b6e8656abb9c5edee6fc68"},
+    {file = "cryptography-49.0.0-cp311-abi3-manylinux_2_34_x86_64.whl", hash = "sha256:cbc77da8c523d5abd028635ba850a6966fcee2c82e2bf65a41d1d8afe0f98be9"},
+    {file = "cryptography-49.0.0-cp311-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:b87e65d263b3e5d3bb92a57e2a6638e2f31110fa7aa890c7b2dbba42248d0a3f"},
+    {file = "cryptography-49.0.0-cp311-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:66ec79c3904820572d7e987abdf304281f141d37ad9a489b8e97066e7b9b6459"},
+    {file = "cryptography-49.0.0-cp311-abi3-win_amd64.whl", hash = "sha256:e5dfc1e64de5677cec922ffa8da89c546d0415bf6efdf081842e5d44c84e1f0e"},
+    {file = "cryptography-49.0.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:73a205dce83953d131a4aa1e0fd917a2fd1c5b1eef251e9d7152efefcbf5caf7"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:196ecd6a36e4e9aa10270393bb98d8df88fccee0bf1e5128b91ae4eb4375896d"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:7abcee80084cda3f7691f3eb1ce480d8df49cec637b429aa35986c1de71738aa"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:4ae387c9cb68ea569ca17e490d66d8142b81c3cc814bf179974b7d146e490bbb"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_28_ppc64le.whl", hash = "sha256:f37d847238971164fdbc68ade6f6574aecc9c0af714190e2083429ff68f4ce9d"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:c2bc30226390d60ea19d9f82b19db005fe0452154a23c1c410c12ea801e43561"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_31_armv7l.whl", hash = "sha256:07cab27cc7b7e0fd28e5e26bb9eeedde5c135c868b46de4a27845abe94af6122"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_34_aarch64.whl", hash = "sha256:b20133d204d2bb56ba047642199603876c872026ca53e79c35b83772ab2cc505"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_34_ppc64le.whl", hash = "sha256:b970c6da94d5bb18629db453d14f2a1300f6bf59b61e9b82377931ef95504866"},
+    {file = "cryptography-49.0.0-cp314-cp314t-manylinux_2_34_x86_64.whl", hash = "sha256:d8ecde755e2e91bf773fc94e8c9d730cd7f2007004cb492263a794ec3899a1c8"},
+    {file = "cryptography-49.0.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e3fb64c420688e5319ae25113a354015abbd8dffbfbc41781a1ea66fc7622ac3"},
+    {file = "cryptography-49.0.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:32703d93296f5c1f4b53349ad3a250c2cae0fdecd3a3dd5d47e616d8d616af27"},
+    {file = "cryptography-49.0.0-cp314-cp314t-win_amd64.whl", hash = "sha256:33cd0565932807baddb67b96dbee92f2c374b5c89dee09fd74079aeb8c8dba61"},
+    {file = "cryptography-49.0.0-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:ec5e529fb80935c94fe7b729f9972b50e351a0e6b50aa294fd5cabb109fcc29a"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:f78ff2c9ed8dc2d036b0f4d640e22522213d047c1b14e61205a7e55c80a494d4"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:35b151772baff2c74cba7fa290ceaff4c3b11c0c881eb93eb5dbc05a7cfbba18"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:0f21641cf4b30fca7aee061ced0ec7ad7b073518088b7c9969a297c0ae796c69"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_28_ppc64le.whl", hash = "sha256:9e82dcc8e56052715fb18b2429e3bca4823b1629136a2084fc45a9a5cecb9b64"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:6f2debedf9ca60cf1d5bd466475638af5130f89965605cd818484d19987d3a21"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_31_armv7l.whl", hash = "sha256:8c25ceb16df5b9435f3f6a9829204985b0e0cbee3b48aacd432c7d2c850b44d9"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_34_aarch64.whl", hash = "sha256:28d8b15e6275f12c8a207dc309dfa957903c927d08d0cc937ee3f63f200693cc"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_34_ppc64le.whl", hash = "sha256:6fc361c34fb6aac015ce19435876635e5c6d21db31998b0920f675f131e043b8"},
+    {file = "cryptography-49.0.0-cp39-abi3-manylinux_2_34_x86_64.whl", hash = "sha256:2400ef9c9e2299a25614eb1dea3db54a69b1349efd043bfac9c67630d136df36"},
+    {file = "cryptography-49.0.0-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:67e1d20ad9ef3a563c59ef22e7a8a0b8210bd26604369ea4a30a7c66aefe504e"},
+    {file = "cryptography-49.0.0-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:42b0684e0e40cf26122427802486f6d93aea593612603a94fbf260c7eb1e9c1b"},
+    {file = "cryptography-49.0.0-cp39-abi3-win_amd64.whl", hash = "sha256:026ac7423e6fa66872d3bf889be5974507da3944f866f704fa200eadacd00001"},
+    {file = "cryptography-49.0.0-pp311-pypy311_pp73-macosx_11_0_arm64.whl", hash = "sha256:fc1e275c2f1d97b1a6450b8b0ea3ebfa6e087a611c2b26cb2404d48588abab7b"},
+    {file = "cryptography-49.0.0-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl", hash = "sha256:c83782480a4a9da4d0feb51950131ba32e12e70813848b3343f6e18c28a66838"},
+    {file = "cryptography-49.0.0-pp311-pypy311_pp73-manylinux_2_28_x86_64.whl", hash = "sha256:b39efa323140595abd3ecca8529d321ae50f55f3aa3ba9cc81ea56a6011953d5"},
+    {file = "cryptography-49.0.0-pp311-pypy311_pp73-manylinux_2_34_aarch64.whl", hash = "sha256:b47db11c2c3525083296069b98ac5221907455e989ae0c2e3008bde851921615"},
+    {file = "cryptography-49.0.0-pp311-pypy311_pp73-manylinux_2_34_x86_64.whl", hash = "sha256:084ef1af862eb07ec46d25f68689f2102a9fc0e05ce7b80f14f5fe51e4eef0f6"},
+    {file = "cryptography-49.0.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:be9fcb48a55f023493482827d4f459bd263cc20efde64f204b97c123201850c6"},
+    {file = "cryptography-49.0.0.tar.gz", hash = "sha256:f89660a348f4f78a92366240a61404e337586ef7f5909a2fef59ca88ef505493"},
 ]
 
 [package.dependencies]
-cffi = {version = ">=1.12", markers = "platform_python_implementation != \"PyPy\""}
+cffi = {version = ">=2.0.0", markers = "platform_python_implementation != \"PyPy\""}
 
 [package.extras]
-docs = ["sphinx (>=5.3.0)", "sphinx-rtd-theme (>=3.0.0) ; python_version >= \"3.8\""]
-docstest = ["pyenchant (>=3)", "readme-renderer (>=30.0)", "sphinxcontrib-spelling (>=7.3.1)"]
-nox = ["nox (>=2024.4.15)", "nox[uv] (>=2024.3.2) ; python_version >= \"3.8\""]
-pep8test = ["check-sdist ; python_version >= \"3.8\"", "click (>=8.0.1)", "mypy (>=1.4)", "ruff (>=0.3.6)"]
-sdist = ["build (>=1.0.0)"]
 ssh = ["bcrypt (>=3.1.5)"]
-test = ["certifi (>=2024)", "cryptography-vectors (==44.0.0)", "pretend (>=0.7)", "pytest (>=7.4.0)", "pytest-benchmark (>=4.0)", "pytest-cov (>=2.10.1)", "pytest-xdist (>=3.5.0)"]
-test-randomorder = ["pytest-randomly"]
 
 [[package]]
-name = "cycler"
-version = "0.12.1"
-description = "Composable style cycles"
+name = "cyclopts"
+version = "3.24.0"
+description = "Intuitive, easy CLIs based on type hints."
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["main"]
 files = [
-    {file = "cycler-0.12.1-py3-none-any.whl", hash = "sha256:85cef7cff222d8644161529808465972e51340599459b8ac3ccbac5a854e0d30"},
-    {file = "cycler-0.12.1.tar.gz", hash = "sha256:88bb128f02ba341da8ef447245a9e138fae777f6a23943da4540077d3601eb1c"},
+    {file = "cyclopts-3.24.0-py3-none-any.whl", hash = "sha256:809d04cde9108617106091140c3964ee6fceb33cecdd537f7ffa360bde13ed71"},
+    {file = "cyclopts-3.24.0.tar.gz", hash = "sha256:de6964a041dfb3c57bf043b41e68c43548227a17de1bad246e3a0bfc5c4b7417"},
 ]
 
+[package.dependencies]
+attrs = ">=23.1.0"
+docstring-parser = {version = ">=0.15", markers = "python_version < \"4.0\""}
+rich = ">=13.6.0"
+rich-rst = ">=1.3.1,<2.0.0"
+
 [package.extras]
-docs = ["ipython", "matplotlib", "numpydoc", "sphinx"]
-tests = ["pytest", "pytest-cov", "pytest-xdist"]
+toml = ["tomli (>=2.0.0) ; python_version < \"3.11\""]
+trio = ["trio (>=0.10.0)"]
+yaml = ["pyyaml (>=6.0.1)"]
 
 [[package]]
 name = "dashscope"
-version = "1.23.0"
+version = "1.25.23"
 description = "dashscope client sdk library"
 optional = false
 python-versions = ">=3.8.0"
 groups = ["main"]
 files = [
-    {file = "dashscope-1.23.0-py3-none-any.whl", hash = "sha256:887a238e970ccca035b1554fbb2606662a8b557d8533b8afc6a532580c12a099"},
+    {file = "dashscope-1.25.23-py3-none-any.whl", hash = "sha256:7cb2cc48fd82536202e3100e93730e92c952435bb8cbe16d45de9700e1546bcc"},
 ]
 
 [package.dependencies]
 aiohttp = "*"
+certifi = "*"
+cryptography = "*"
 requests = "*"
+rich = ">=13.0.0"
+typer = ">=0.9.0"
 websocket-client = "*"
 
 [package.extras]
 tokenizer = ["tiktoken"]
 
-[[package]]
-name = "datasets"
-version = "3.2.0"
-description = "HuggingFace community-driven open-source library of datasets"
-optional = false
-python-versions = ">=3.9.0"
-groups = ["main"]
-files = [
-    {file = "datasets-3.2.0-py3-none-any.whl", hash = "sha256:f3d2ba2698b7284a4518019658596a6a8bc79f31e51516524249d6c59cf0fe2a"},
-    {file = "datasets-3.2.0.tar.gz", hash = "sha256:9a6e1a356052866b5dbdd9c9eedb000bf3fc43d986e3584d9b028f4976937229"},
-]
-
-[package.dependencies]
-aiohttp = "*"
-dill = ">=0.3.0,<0.3.9"
-filelock = "*"
-fsspec = {version = ">=2023.1.0,<=2024.9.0", extras = ["http"]}
-huggingface-hub = ">=0.23.0"
-multiprocess = "<0.70.17"
-numpy = ">=1.17"
-packaging = "*"
-pandas = "*"
-pyarrow = ">=15.0.0"
-pyyaml = ">=5.1"
-requests = ">=2.32.2"
-tqdm = ">=4.66.3"
-xxhash = "*"
-
-[package.extras]
-audio = ["librosa", "soundfile (>=0.12.1)", "soxr (>=0.4.0) ; python_version >= \"3.9\""]
-benchmarks = ["tensorflow (==2.12.0)", "torch (==2.0.1)", "transformers (==4.30.1)"]
-dev = ["Pillow (>=9.4.0)", "absl-py", "decorator", "decord (==0.6.0)", "elasticsearch (>=7.17.12,<8.0.0)", "faiss-cpu (>=1.8.0.post1)", "jax (>=0.3.14) ; sys_platform != \"win32\"", "jaxlib (>=0.3.14) ; sys_platform != \"win32\"", "joblib (<1.3.0)", "joblibspark", "librosa", "lz4", "moto[server]", "polars[timezone] (>=0.20.0)", "protobuf (<4.0.0)", "py7zr", "pyspark (>=3.4)", "pytest", "pytest-datadir", "pytest-xdist", "rarfile (>=4.0)", "ruff (>=0.3.0)", "s3fs", "s3fs (>=2021.11.1)", "soundfile (>=0.12.1)", "soxr (>=0.4.0) ; python_version >= \"3.9\"", "sqlalchemy", "tensorflow (>=2.16.0) ; python_version >= \"3.10\"", "tensorflow (>=2.6.0)", "tensorflow (>=2.6.0) ; python_version < \"3.10\"", "tiktoken", "torch", "torch (>=2.0.0)", "torchdata", "transformers", "transformers (>=4.42.0)", "zstandard"]
-docs = ["s3fs", "tensorflow (>=2.6.0)", "torch", "transformers"]
-jax = ["jax (>=0.3.14)", "jaxlib (>=0.3.14)"]
-quality = ["ruff (>=0.3.0)"]
-s3 = ["s3fs"]
-tensorflow = ["tensorflow (>=2.6.0)"]
-tensorflow-gpu = ["tensorflow (>=2.6.0)"]
-tests = ["Pillow (>=9.4.0)", "absl-py", "decorator", "decord (==0.6.0)", "elasticsearch (>=7.17.12,<8.0.0)", "faiss-cpu (>=1.8.0.post1)", "jax (>=0.3.14) ; sys_platform != \"win32\"", "jaxlib (>=0.3.14) ; sys_platform != \"win32\"", "joblib (<1.3.0)", "joblibspark", "librosa", "lz4", "moto[server]", "polars[timezone] (>=0.20.0)", "protobuf (<4.0.0)", "py7zr", "pyspark (>=3.4)", "pytest", "pytest-datadir", "pytest-xdist", "rarfile (>=4.0)", "s3fs (>=2021.11.1)", "soundfile (>=0.12.1)", "soxr (>=0.4.0) ; python_version >= \"3.9\"", "sqlalchemy", "tensorflow (>=2.16.0) ; python_version >= \"3.10\"", "tensorflow (>=2.6.0) ; python_version < \"3.10\"", "tiktoken", "torch (>=2.0.0)", "torchdata", "transformers (>=4.42.0)", "zstandard"]
-tests-numpy2 = ["Pillow (>=9.4.0)", "absl-py", "decorator", "decord (==0.6.0)", "elasticsearch (>=7.17.12,<8.0.0)", "jax (>=0.3.14) ; sys_platform != \"win32\"", "jaxlib (>=0.3.14) ; sys_platform != \"win32\"", "joblib (<1.3.0)", "joblibspark", "lz4", "moto[server]", "polars[timezone] (>=0.20.0)", "protobuf (<4.0.0)", "py7zr", "pyspark (>=3.4)", "pytest", "pytest-datadir", "pytest-xdist", "rarfile (>=4.0)", "s3fs (>=2021.11.1)", "soundfile (>=0.12.1)", "soxr (>=0.4.0) ; python_version >= \"3.9\"", "sqlalchemy", "tiktoken", "torch (>=2.0.0)", "torchdata", "transformers (>=4.42.0)", "zstandard"]
-torch = ["torch"]
-vision = ["Pillow (>=9.4.0)"]
-
 [[package]]
 name = "decorator"
 version = "4.4.2"
@@ -1114,36 +1103,15 @@ files = [
     {file = "decorator-4.4.2.tar.gz", hash = "sha256:e3a62f0520172440ca0dcc823749319382e377f37f140a0b99ef45fecb84bfe7"},
 ]
 
-[[package]]
-name = "decord"
-version = "0.6.0"
-description = "Decord Video Loader"
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "decord-0.6.0-cp36-cp36m-macosx_10_15_x86_64.whl", hash = "sha256:85ef90d2f872384657d7774cc486c237c5b12df62d4ac5cb5c8d6001fa611323"},
-    {file = "decord-0.6.0-cp37-cp37m-macosx_10_15_x86_64.whl", hash = "sha256:9c20674964fb1490c677bd911d2023d2a09fec7a58a4bb0b7ddf1ccc269f107a"},
-    {file = "decord-0.6.0-cp38-cp38-macosx_10_15_x86_64.whl", hash = "sha256:a0eb1258beade34dceb29d97856a7764d179db1b5182899b61874f3418a1abc8"},
-    {file = "decord-0.6.0-py3-none-manylinux2010_x86_64.whl", hash = "sha256:51997f20be8958e23b7c4061ba45d0efcd86bffd5fe81c695d0befee0d442976"},
-    {file = "decord-0.6.0-py3-none-win_amd64.whl", hash = "sha256:02665d7c4f1193a330205a791bc128f7e108eb6ae5b67144437a02f700943bad"},
-]
-
-[package.dependencies]
-numpy = ">=1.14.0"
-
 [[package]]
 name = "deepspeed"
-version = "0.16.5"
+version = "0.19.2"
 description = "DeepSpeed library"
 optional = false
 python-versions = "*"
-groups = ["main"]
+groups = ["training"]
 files = [
-    {file = "deepspeed-0.16.5-cp310-cp310-win_amd64.whl", hash = "sha256:e58738e86cfa4358c7f14f033557612e9084b817d31f979e974b958f6486c2f6"},
-    {file = "deepspeed-0.16.5-cp311-cp311-win_amd64.whl", hash = "sha256:8a08ffb4710094b9897f71068179298794dce0886d051b302273a8ab14ba2dc0"},
-    {file = "deepspeed-0.16.5-cp312-cp312-win_amd64.whl", hash = "sha256:3c89784151826a57ceb97a5512a859378bcd8755b9e4a907f1cb8461d67165b0"},
-    {file = "deepspeed-0.16.5.tar.gz", hash = "sha256:29e007a2bdafb1431b7a021126dace0126ce53c57eb79db1ba85a1484c0b770e"},
+    {file = "deepspeed-0.19.2.tar.gz", hash = "sha256:7e854b6ebe3d2bfa239f82958372927631c74e5324c7f08f17ce7ff5f6b06969"},
 ]
 
 [package.dependencies]
@@ -1156,17 +1124,18 @@ packaging = ">=20.0"
 psutil = "*"
 py-cpuinfo = "*"
 pydantic = ">=2.0.0"
-torch = "*"
+torch = ">=2.0.0"
 tqdm = "*"
 
 [package.extras]
 1bit-mpi = ["mpi4py"]
-all = ["accelerate", "autodoc_pydantic (>=2.0.0)", "clang-format (==18.1.3)", "comet_ml (>=3.41.0)", "deepspeed-kernels ; sys_platform == \"linux\"", "diffusers (>=0.25.0)", "docutils (<0.18)", "future", "google", "hjson", "importlib-metadata (>=4)", "lm-eval (==0.3.0)", "mpi4py", "mup", "neural-compressor (==2.1.0)", "packaging", "pre-commit (>=3.2.0)", "protobuf", "psutil", "py-cpuinfo", "pydantic (>=2.0.0)", "pytest (>=7.2.0)", "pytest-forked", "pytest-randomly", "pytest-xdist", "qtorch", "qtorch (==0.3.0)", "recommonmark", "safetensors", "sentencepiece", "sphinx", "sphinx-rtd-theme", "sphinx_rtd_theme", "tabulate", "tensorboard", "torch", "torchvision", "tqdm", "transformers (>=4.32.1)", "transformers (>=4.39.0)", "triton (==1.0.0)", "triton (==2.1.0)", "triton (>=2.1.0)", "wandb", "xgboost"]
+all = ["accelerate", "autodoc_pydantic (>=2.0.0)", "clang-format (==18.1.3)", "comet_ml (>=3.41.0)", "deepspeed-kernels ; sys_platform == \"linux\"", "diffusers (>=0.25.0)", "docutils (<0.18)", "future", "google", "hjson", "importlib-metadata (>=4)", "lm-eval (==0.3.0)", "mpi4py", "mup", "neural-compressor (==2.1.0)", "packaging", "pre-commit (>=3.2.0)", "protobuf", "psutil", "py-cpuinfo", "pydantic (>=2.0.0)", "pytest (>=7.2.0,<8.4.0)", "pytest-forked", "pytest-randomly", "pytest-xdist", "qtorch", "qtorch (==0.3.0)", "recommonmark", "safetensors", "scipy", "sentencepiece", "sphinx", "sphinx-rtd-theme", "sphinx_rtd_theme", "tabulate", "tensorboard", "torch (>=2.0.0)", "torchvision", "tqdm", "transformers (>=4.32.1)", "transformers (>=4.51.3)", "triton (==1.0.0)", "triton (==2.1.0)", "triton (>=2.1.0)", "wandb", "xgboost"]
 autotuning = ["tabulate"]
 autotuning-ml = ["hjson", "tabulate", "xgboost"]
-dev = ["accelerate", "clang-format (==18.1.3)", "comet_ml (>=3.41.0)", "deepspeed-kernels ; sys_platform == \"linux\"", "docutils (<0.18)", "future", "importlib-metadata (>=4)", "mup", "pre-commit (>=3.2.0)", "pytest (>=7.2.0)", "pytest-forked", "pytest-randomly", "pytest-xdist", "qtorch (==0.3.0)", "recommonmark", "sphinx", "sphinx-rtd-theme", "tensorboard", "torchvision", "transformers (>=4.39.0)", "wandb"]
+deepcompile = ["scipy"]
+dev = ["accelerate", "clang-format (==18.1.3)", "comet_ml (>=3.41.0)", "deepspeed-kernels ; sys_platform == \"linux\"", "docutils (<0.18)", "future", "importlib-metadata (>=4)", "mup", "pre-commit (>=3.2.0)", "pytest (>=7.2.0,<8.4.0)", "pytest-forked", "pytest-randomly", "pytest-xdist", "qtorch (==0.3.0)", "recommonmark", "sphinx", "sphinx-rtd-theme", "tensorboard", "torchvision", "transformers (>=4.51.3)", "wandb"]
 inf = ["google", "lm-eval (==0.3.0)", "protobuf", "qtorch", "safetensors", "sentencepiece", "transformers (>=4.32.1)"]
-readthedocs = ["autodoc_pydantic (>=2.0.0)", "docutils (<0.18)", "hjson", "packaging", "psutil", "py-cpuinfo", "pydantic (>=2.0.0)", "recommonmark", "sphinx_rtd_theme", "torch", "tqdm"]
+readthedocs = ["autodoc_pydantic (>=2.0.0)", "docutils (<0.18)", "hjson", "packaging", "psutil", "py-cpuinfo", "pydantic (>=2.0.0)", "recommonmark", "sphinx_rtd_theme", "torch (>=2.0.0)", "tqdm"]
 sd = ["diffusers (>=0.25.0)", "triton (>=2.1.0)"]
 sparse = ["neural-compressor (==2.1.0)"]
 sparse-attn = ["triton (==1.0.0)"]
@@ -1174,69 +1143,62 @@ triton = ["triton (==2.1.0)"]
 
 [[package]]
 name = "diffusers"
-version = "0.32.2"
+version = "0.38.0"
 description = "State-of-the-art diffusion in PyTorch and JAX."
 optional = false
-python-versions = ">=3.8.0"
+python-versions = ">=3.10.0"
 groups = ["main"]
 files = [
-    {file = "diffusers-0.32.2-py3-none-any.whl", hash = "sha256:d7f182b49c7f428737ee3bf6397d463ec03b85f4f3b2c9470bd1d73292b609ff"},
-    {file = "diffusers-0.32.2.tar.gz", hash = "sha256:eb1e36b326aabb0675729af7c626caf7a76ce7ced3a126e879331790b1eaa230"},
+    {file = "diffusers-0.38.0-py3-none-any.whl", hash = "sha256:18e53f9e539096320470f62c6360a6fd5727ff28cffda566265316e13fcdb612"},
+    {file = "diffusers-0.38.0.tar.gz", hash = "sha256:1e094ec5c16f18c42fb89d37f07a94cf9aab3ebbe527ab059c609597b8857626"},
 ]
 
 [package.dependencies]
 filelock = "*"
-huggingface-hub = ">=0.23.2"
-importlib-metadata = "*"
+httpx = "<1.0.0"
+huggingface-hub = ">=0.34.0,<2.0"
+importlib_metadata = "*"
 numpy = "*"
 Pillow = "*"
 regex = "!=2019.12.17"
 requests = "*"
-safetensors = ">=0.3.1"
+safetensors = ">=0.8.0rc0"
 
 [package.extras]
-dev = ["GitPython (<3.1.19)", "Jinja2", "accelerate (>=0.31.0)", "compel (==0.1.8)", "datasets", "flax (>=0.4.1)", "hf-doc-builder (>=0.3.0)", "invisible-watermark (>=0.2.0)", "isort (>=5.5.4)", "jax (>=0.4.1)", "jaxlib (>=0.4.1)", "k-diffusion (>=0.0.12)", "librosa", "parameterized", "peft (>=0.6.0)", "protobuf (>=3.20.3,<4)", "pytest", "pytest-timeout", "pytest-xdist", "requests-mock (==1.10.0)", "ruff (==0.1.5)", "safetensors (>=0.3.1)", "scipy", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "torch (>=1.4)", "torchvision", "transformers (>=4.41.2)", "urllib3 (<=2.0.0)"]
+bitsandbytes = ["accelerate (>=0.31.0)", "bitsandbytes (>=0.43.3)"]
+dev = ["GitPython (<3.1.19)", "Jinja2", "Jinja2", "accelerate (>=0.31.0)", "accelerate (>=0.31.0)", "compel (==0.1.8)", "datasets", "datasets", "flax (>=0.4.1)", "ftfy", "hf-doc-builder (>=0.3.0)", "hf-doc-builder (>=0.3.0)", "invisible-watermark (>=0.2.0)", "isort (>=5.5.4)", "jax (>=0.4.1)", "jaxlib (>=0.4.1)", "librosa", "parameterized", "peft (>=0.17.0)", "phonemizer", "protobuf (>=3.20.3,<4)", "protobuf (>=3.20.3,<4)", "pytest", "pytest-timeout", "pytest-xdist", "requests-mock (==1.10.0)", "ruff (==0.9.10)", "safetensors (>=0.8.0rc0)", "scipy", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "tiktoken (>=0.7.0)", "timm", "torch (>=1.4)", "torchsde", "torchvision", "transformers (>=4.41.2)", "urllib3 (<=2.0.0)"]
 docs = ["hf-doc-builder (>=0.3.0)"]
+flashpack = ["flashpack"]
 flax = ["flax (>=0.4.1)", "jax (>=0.4.1)", "jaxlib (>=0.4.1)"]
-quality = ["hf-doc-builder (>=0.3.0)", "isort (>=5.5.4)", "ruff (==0.1.5)", "urllib3 (<=2.0.0)"]
-test = ["GitPython (<3.1.19)", "Jinja2", "compel (==0.1.8)", "datasets", "invisible-watermark (>=0.2.0)", "k-diffusion (>=0.0.12)", "librosa", "parameterized", "pytest", "pytest-timeout", "pytest-xdist", "requests-mock (==1.10.0)", "safetensors (>=0.3.1)", "scipy", "sentencepiece (>=0.1.91,!=0.1.92)", "torchvision", "transformers (>=4.41.2)"]
+gguf = ["accelerate (>=0.31.0)", "gguf (>=0.10.0)"]
+nvidia-modelopt = ["nvidia_modelopt[hf] (>=0.33.1)"]
+optimum-quanto = ["accelerate (>=0.31.0)", "optimum_quanto (>=0.2.6)"]
+quality = ["hf-doc-builder (>=0.3.0)", "isort (>=5.5.4)", "ruff (==0.9.10)", "urllib3 (<=2.0.0)"]
+test = ["GitPython (<3.1.19)", "Jinja2", "compel (==0.1.8)", "datasets", "ftfy", "invisible-watermark (>=0.2.0)", "librosa", "parameterized", "phonemizer", "protobuf (>=3.20.3,<4)", "pytest", "pytest-timeout", "pytest-xdist", "requests-mock (==1.10.0)", "safetensors (>=0.8.0rc0)", "scipy", "sentencepiece (>=0.1.91,!=0.1.92)", "tiktoken (>=0.7.0)", "torchsde", "torchvision", "transformers (>=4.41.2)"]
 torch = ["accelerate (>=0.31.0)", "torch (>=1.4)"]
-training = ["Jinja2", "accelerate (>=0.31.0)", "datasets", "peft (>=0.6.0)", "protobuf (>=3.20.3,<4)", "tensorboard"]
-
-[[package]]
-name = "dill"
-version = "0.3.7"
-description = "serialize all of Python"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "dill-0.3.7-py3-none-any.whl", hash = "sha256:76b122c08ef4ce2eedcd4d1abd8e641114bfc6c2867f49f3c41facf65bf19f5e"},
-    {file = "dill-0.3.7.tar.gz", hash = "sha256:cc1c8b182eb3013e24bd475ff2e9295af86c1a38eb1aff128dac8962a9ce3c03"},
-]
-
-[package.extras]
-graph = ["objgraph (>=1.7.2)"]
+torchao = ["accelerate (>=0.31.0)", "torchao (>=0.7.0)"]
+training = ["Jinja2", "accelerate (>=0.31.0)", "datasets", "peft (>=0.17.0)", "protobuf (>=3.20.3,<4)", "tensorboard", "timm"]
 
 [[package]]
 name = "distlib"
-version = "0.3.9"
+version = "0.4.3"
 description = "Distribution utilities"
 optional = false
 python-versions = "*"
-groups = ["main", "dev"]
+groups = ["dev"]
 files = [
-    {file = "distlib-0.3.9-py2.py3-none-any.whl", hash = "sha256:47f8c22fd27c27e25a65601af709b38e4f0a45ea4fc2e710f65755fa8caaaf87"},
-    {file = "distlib-0.3.9.tar.gz", hash = "sha256:a60f20dea646b8a33f3e7772f74dc0b2d0772d2837ee1342a00645c81edf9403"},
+    {file = "distlib-0.4.3-py2.py3-none-any.whl", hash = "sha256:4b0ce306c966eb73bc3a7b6abad017c556dadd92c44701562cd528ac7fde4d5b"},
+    {file = "distlib-0.4.3.tar.gz", hash = "sha256:f152097224a0ae24be5a0f6bae1b9359af82133bce63f98a95f86cae1aede9ed"},
 ]
 
 [[package]]
 name = "distvae"
 version = "0.0.0b5"
 description = "DistVAE: Patch Parallelism Distributed VAE for High-Resolution image generation"
-optional = false
+optional = true
 python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"cuda\""
 files = [
     {file = "DistVAE-0.0.0b5-py3-none-any.whl", hash = "sha256:e0a7e302ba0935251086b788a95cc971602698b9743eb36ed4ca910ddaf286bb"},
     {file = "distvae-0.0.0b5.tar.gz", hash = "sha256:4376467ed2b7d6e9e7cab0bc174e49f1771535d07eaa8d2a86ef6f537e2977f6"},
@@ -1248,19 +1210,33 @@ torch = ">=2.2"
 transformers = "*"
 
 [[package]]
-name = "docker-pycreds"
-version = "0.4.0"
-description = "Python bindings for the docker credentials store API"
+name = "docstring-parser"
+version = "0.18.0"
+description = "Parse Python docstrings in reST, Google and Numpydoc format"
 optional = false
-python-versions = "*"
+python-versions = ">=3.8"
 groups = ["main"]
 files = [
-    {file = "docker-pycreds-0.4.0.tar.gz", hash = "sha256:6ce3270bcaf404cc4c3e27e4b6c70d3521deae82fb508767870fdbf772d584d4"},
-    {file = "docker_pycreds-0.4.0-py2.py3-none-any.whl", hash = "sha256:7266112468627868005106ec19cd0d722702d2b7d5912a28e19b826c3d37af49"},
+    {file = "docstring_parser-0.18.0-py3-none-any.whl", hash = "sha256:b3fcbed555c47d8479be0796ef7e19c2670d428d72e96da63f3a40122860374b"},
+    {file = "docstring_parser-0.18.0.tar.gz", hash = "sha256:292510982205c12b1248696f44959db3cdd1740237a968ea1e2e7a900eeb2015"},
 ]
 
-[package.dependencies]
-six = ">=1.4.0"
+[package.extras]
+dev = ["pre-commit (>=2.16.0) ; python_version >= \"3.9\"", "pydoctor (>=25.4.0)", "pytest"]
+docs = ["pydoctor (>=25.4.0)"]
+test = ["pytest"]
+
+[[package]]
+name = "docutils"
+version = "0.23"
+description = "Docutils -- Python Documentation Utilities"
+optional = false
+python-versions = ">=3.9"
+groups = ["main"]
+files = [
+    {file = "docutils-0.23-py3-none-any.whl", hash = "sha256:25d013af9bf23bc1c7b2b093dff4208166c53a94786c9e447808335ef1185fea"},
+    {file = "docutils-0.23.tar.gz", hash = "sha256:746f5060322511280a1e50eb76846ed6bf2342984b2ac04dc42caa1a8d78799e"},
+]
 
 [[package]]
 name = "easydict"
@@ -1280,265 +1256,199 @@ version = "0.8.0"
 description = "A new flavour of deep learning operations"
 optional = false
 python-versions = ">=3.8"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "einops-0.8.0-py3-none-any.whl", hash = "sha256:9572fb63046264a862693b0a87088af3bdc8c068fde03de63453cbbde245465f"},
     {file = "einops-0.8.0.tar.gz", hash = "sha256:63486517fed345712a8385c100cb279108d9d47e6ae59099b07657e983deae85"},
 ]
 
 [[package]]
-name = "exceptiongroup"
-version = "1.2.2"
-description = "Backport of PEP 654 (exception groups)"
-optional = false
-python-versions = ">=3.7"
-groups = ["main", "dev"]
-markers = "python_version == \"3.10\""
-files = [
-    {file = "exceptiongroup-1.2.2-py3-none-any.whl", hash = "sha256:3111b9d131c238bec2f8f516e123e14ba243563fb135d3fe885990585aa7795b"},
-    {file = "exceptiongroup-1.2.2.tar.gz", hash = "sha256:47c2edf7c6738fafb49fd34290706d1a1a2f4d1c6df275526b62cbb4aa5393cc"},
-]
-
-[package.extras]
-test = ["pytest (>=6)"]
-
-[[package]]
-name = "fabric"
-version = "3.0.1"
-description = "High level SSH command execution"
-optional = false
-python-versions = "*"
+name = "fastapi"
+version = "0.138.0"
+description = "FastAPI framework, high performance, easy to learn, fast to code, ready for production"
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "fabric-3.0.1-py3-none-any.whl", hash = "sha256:a616a47b0e929c46c0c85619634a8a7522aa03378e1aea275c0a548385653ddf"},
-    {file = "fabric-3.0.1.tar.gz", hash = "sha256:65af8199f3e90c226db0aa03984989084099b9758315d9a4001f5e32c8599a84"},
+    {file = "fastapi-0.138.0-py3-none-any.whl", hash = "sha256:b6f54fd1bd72c80b0f899f172c61a600f6f7af9b43d4d772a018f35624048cb0"},
+    {file = "fastapi-0.138.0.tar.gz", hash = "sha256:d445a4877636ad191e7053e08c9bf98cb921a6756776848400bb773d1740c061"},
 ]
 
 [package.dependencies]
-invoke = ">=2.0"
-paramiko = ">=2.4"
+annotated-doc = ">=0.0.2"
+pydantic = ">=2.9.0"
+starlette = ">=0.46.0"
+typing-extensions = ">=4.8.0"
+typing-inspection = ">=0.4.2"
 
 [package.extras]
-pytest = ["pytest (>=7)"]
+all = ["email-validator (>=2.0.0)", "fastapi-cli[standard] (>=0.0.8)", "httpx (>=0.23.0,<1.0.0)", "itsdangerous (>=1.1.0)", "jinja2 (>=3.1.5)", "pydantic-extra-types (>=2.0.0)", "pydantic-settings (>=2.0.0)", "python-multipart (>=0.0.18)", "pyyaml (>=5.3.1)", "uvicorn[standard] (>=0.12.0)"]
+standard = ["email-validator (>=2.0.0)", "fastapi-cli[standard] (>=0.0.8)", "fastar (>=0.9.0)", "httpx (>=0.23.0,<1.0.0)", "jinja2 (>=3.1.5)", "pydantic-extra-types (>=2.0.0)", "pydantic-settings (>=2.0.0)", "python-multipart (>=0.0.18)", "uvicorn[standard] (>=0.12.0)"]
+standard-no-fastapi-cloud-cli = ["email-validator (>=2.0.0)", "fastapi-cli[standard-no-fastapi-cloud-cli] (>=0.0.8)", "httpx (>=0.23.0,<1.0.0)", "jinja2 (>=3.1.5)", "pydantic-extra-types (>=2.0.0)", "pydantic-settings (>=2.0.0)", "python-multipart (>=0.0.18)", "uvicorn[standard] (>=0.12.0)"]
 
 [[package]]
 name = "filelock"
-version = "3.16.1"
+version = "3.29.4"
 description = "A platform independent file lock."
 optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
-files = [
-    {file = "filelock-3.16.1-py3-none-any.whl", hash = "sha256:2082e5703d51fbf98ea75855d9d5527e33d8ff23099bec374a134febee6946b0"},
-    {file = "filelock-3.16.1.tar.gz", hash = "sha256:c249fbfcd5db47e5e2d6d62198e565475ee65e4831e2561c8e313fa7eb961435"},
-]
-
-[package.extras]
-docs = ["furo (>=2024.8.6)", "sphinx (>=8.0.2)", "sphinx-autodoc-typehints (>=2.4.1)"]
-testing = ["covdefaults (>=2.3)", "coverage (>=7.6.1)", "diff-cover (>=9.2)", "pytest (>=8.3.3)", "pytest-asyncio (>=0.24)", "pytest-cov (>=5)", "pytest-mock (>=3.14)", "pytest-timeout (>=2.3.1)", "virtualenv (>=20.26.4)"]
-typing = ["typing-extensions (>=4.12.2) ; python_version < \"3.11\""]
-
-[[package]]
-name = "fire"
-version = "0.6.0"
-description = "A library for automatically generating command line interfaces."
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "fire-0.6.0.tar.gz", hash = "sha256:54ec5b996ecdd3c0309c800324a0703d6da512241bc73b553db959d98de0aa66"},
-]
-
-[package.dependencies]
-six = "*"
-termcolor = "*"
-
-[[package]]
-name = "fonttools"
-version = "4.55.3"
-description = "Tools to manipulate font files"
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["main", "dev", "training"]
 files = [
-    {file = "fonttools-4.55.3-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:1dcc07934a2165ccdc3a5a608db56fb3c24b609658a5b340aee4ecf3ba679dc0"},
-    {file = "fonttools-4.55.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:f7d66c15ba875432a2d2fb419523f5d3d347f91f48f57b8b08a2dfc3c39b8a3f"},
-    {file = "fonttools-4.55.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:27e4ae3592e62eba83cd2c4ccd9462dcfa603ff78e09110680a5444c6925d841"},
-    {file = "fonttools-4.55.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:62d65a3022c35e404d19ca14f291c89cc5890032ff04f6c17af0bd1927299674"},
-    {file = "fonttools-4.55.3-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:d342e88764fb201286d185093781bf6628bbe380a913c24adf772d901baa8276"},
-    {file = "fonttools-4.55.3-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:dd68c87a2bfe37c5b33bcda0fba39b65a353876d3b9006fde3adae31f97b3ef5"},
-    {file = "fonttools-4.55.3-cp310-cp310-win32.whl", hash = "sha256:1bc7ad24ff98846282eef1cbeac05d013c2154f977a79886bb943015d2b1b261"},
-    {file = "fonttools-4.55.3-cp310-cp310-win_amd64.whl", hash = "sha256:b54baf65c52952db65df39fcd4820668d0ef4766c0ccdf32879b77f7c804d5c5"},
-    {file = "fonttools-4.55.3-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:8c4491699bad88efe95772543cd49870cf756b019ad56294f6498982408ab03e"},
-    {file = "fonttools-4.55.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:5323a22eabddf4b24f66d26894f1229261021dacd9d29e89f7872dd8c63f0b8b"},
-    {file = "fonttools-4.55.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5480673f599ad410695ca2ddef2dfefe9df779a9a5cda89503881e503c9c7d90"},
-    {file = "fonttools-4.55.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:da9da6d65cd7aa6b0f806556f4985bcbf603bf0c5c590e61b43aa3e5a0f822d0"},
-    {file = "fonttools-4.55.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:e894b5bd60d9f473bed7a8f506515549cc194de08064d829464088d23097331b"},
-    {file = "fonttools-4.55.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:aee3b57643827e237ff6ec6d28d9ff9766bd8b21e08cd13bff479e13d4b14765"},
-    {file = "fonttools-4.55.3-cp311-cp311-win32.whl", hash = "sha256:eb6ca911c4c17eb51853143624d8dc87cdcdf12a711fc38bf5bd21521e79715f"},
-    {file = "fonttools-4.55.3-cp311-cp311-win_amd64.whl", hash = "sha256:6314bf82c54c53c71805318fcf6786d986461622dd926d92a465199ff54b1b72"},
-    {file = "fonttools-4.55.3-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:f9e736f60f4911061235603a6119e72053073a12c6d7904011df2d8fad2c0e35"},
-    {file = "fonttools-4.55.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7a8aa2c5e5b8b3bcb2e4538d929f6589a5c6bdb84fd16e2ed92649fb5454f11c"},
-    {file = "fonttools-4.55.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:07f8288aacf0a38d174445fc78377a97fb0b83cfe352a90c9d9c1400571963c7"},
-    {file = "fonttools-4.55.3-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b8d5e8916c0970fbc0f6f1bece0063363bb5857a7f170121a4493e31c3db3314"},
-    {file = "fonttools-4.55.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:ae3b6600565b2d80b7c05acb8e24d2b26ac407b27a3f2e078229721ba5698427"},
-    {file = "fonttools-4.55.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:54153c49913f45065c8d9e6d0c101396725c5621c8aee744719300f79771d75a"},
-    {file = "fonttools-4.55.3-cp312-cp312-win32.whl", hash = "sha256:827e95fdbbd3e51f8b459af5ea10ecb4e30af50221ca103bea68218e9615de07"},
-    {file = "fonttools-4.55.3-cp312-cp312-win_amd64.whl", hash = "sha256:e6e8766eeeb2de759e862004aa11a9ea3d6f6d5ec710551a88b476192b64fd54"},
-    {file = "fonttools-4.55.3-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:a430178ad3e650e695167cb53242dae3477b35c95bef6525b074d87493c4bf29"},
-    {file = "fonttools-4.55.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:529cef2ce91dc44f8e407cc567fae6e49a1786f2fefefa73a294704c415322a4"},
-    {file = "fonttools-4.55.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8e75f12c82127486fac2d8bfbf5bf058202f54bf4f158d367e41647b972342ca"},
-    {file = "fonttools-4.55.3-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:859c358ebf41db18fb72342d3080bce67c02b39e86b9fbcf1610cca14984841b"},
-    {file = "fonttools-4.55.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:546565028e244a701f73df6d8dd6be489d01617863ec0c6a42fa25bf45d43048"},
-    {file = "fonttools-4.55.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:aca318b77f23523309eec4475d1fbbb00a6b133eb766a8bdc401faba91261abe"},
-    {file = "fonttools-4.55.3-cp313-cp313-win32.whl", hash = "sha256:8c5ec45428edaa7022f1c949a632a6f298edc7b481312fc7dc258921e9399628"},
-    {file = "fonttools-4.55.3-cp313-cp313-win_amd64.whl", hash = "sha256:11e5de1ee0d95af4ae23c1a138b184b7f06e0b6abacabf1d0db41c90b03d834b"},
-    {file = "fonttools-4.55.3-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:caf8230f3e10f8f5d7593eb6d252a37caf58c480b19a17e250a63dad63834cf3"},
-    {file = "fonttools-4.55.3-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:b586ab5b15b6097f2fb71cafa3c98edfd0dba1ad8027229e7b1e204a58b0e09d"},
-    {file = "fonttools-4.55.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a8c2794ded89399cc2169c4d0bf7941247b8d5932b2659e09834adfbb01589aa"},
-    {file = "fonttools-4.55.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cf4fe7c124aa3f4e4c1940880156e13f2f4d98170d35c749e6b4f119a872551e"},
-    {file = "fonttools-4.55.3-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:86721fbc389ef5cc1e2f477019e5069e8e4421e8d9576e9c26f840dbb04678de"},
-    {file = "fonttools-4.55.3-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:89bdc5d88bdeec1b15af790810e267e8332d92561dce4f0748c2b95c9bdf3926"},
-    {file = "fonttools-4.55.3-cp38-cp38-win32.whl", hash = "sha256:bc5dbb4685e51235ef487e4bd501ddfc49be5aede5e40f4cefcccabc6e60fb4b"},
-    {file = "fonttools-4.55.3-cp38-cp38-win_amd64.whl", hash = "sha256:cd70de1a52a8ee2d1877b6293af8a2484ac82514f10b1c67c1c5762d38073e56"},
-    {file = "fonttools-4.55.3-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:bdcc9f04b36c6c20978d3f060e5323a43f6222accc4e7fcbef3f428e216d96af"},
-    {file = "fonttools-4.55.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:c3ca99e0d460eff46e033cd3992a969658c3169ffcd533e0a39c63a38beb6831"},
-    {file = "fonttools-4.55.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:22f38464daa6cdb7b6aebd14ab06609328fe1e9705bb0fcc7d1e69de7109ee02"},
-    {file = "fonttools-4.55.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ed63959d00b61959b035c7d47f9313c2c1ece090ff63afea702fe86de00dbed4"},
-    {file = "fonttools-4.55.3-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:5e8d657cd7326eeaba27de2740e847c6b39dde2f8d7cd7cc56f6aad404ddf0bd"},
-    {file = "fonttools-4.55.3-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:fb594b5a99943042c702c550d5494bdd7577f6ef19b0bc73877c948a63184a32"},
-    {file = "fonttools-4.55.3-cp39-cp39-win32.whl", hash = "sha256:dc5294a3d5c84226e3dbba1b6f61d7ad813a8c0238fceea4e09aa04848c3d851"},
-    {file = "fonttools-4.55.3-cp39-cp39-win_amd64.whl", hash = "sha256:aedbeb1db64496d098e6be92b2e63b5fac4e53b1b92032dfc6988e1ea9134a4d"},
-    {file = "fonttools-4.55.3-py3-none-any.whl", hash = "sha256:f412604ccbeee81b091b420272841e5ec5ef68967a9790e80bffd0e30b8e2977"},
-    {file = "fonttools-4.55.3.tar.gz", hash = "sha256:3983313c2a04d6cc1fe9251f8fc647754cf49a61dac6cb1e7249ae67afaafc45"},
+    {file = "filelock-3.29.4-py3-none-any.whl", hash = "sha256:dac1648087d5115554850d113e7dd8c83ab2d38e3435dde2d4f163847e57b767"},
+    {file = "filelock-3.29.4.tar.gz", hash = "sha256:10cdb3656fc44541cdf30652a93fb10ec6b05325620eb316bd26893e4201538a"},
 ]
 
-[package.extras]
-all = ["brotli (>=1.0.1) ; platform_python_implementation == \"CPython\"", "brotlicffi (>=0.8.0) ; platform_python_implementation != \"CPython\"", "fs (>=2.2.0,<3)", "lxml (>=4.0)", "lz4 (>=1.7.4.2)", "matplotlib", "munkres ; platform_python_implementation == \"PyPy\"", "pycairo", "scipy ; platform_python_implementation != \"PyPy\"", "skia-pathops (>=0.5.0)", "sympy", "uharfbuzz (>=0.23.0)", "unicodedata2 (>=15.1.0) ; python_version <= \"3.12\"", "xattr ; sys_platform == \"darwin\"", "zopfli (>=0.1.4)"]
-graphite = ["lz4 (>=1.7.4.2)"]
-interpolatable = ["munkres ; platform_python_implementation == \"PyPy\"", "pycairo", "scipy ; platform_python_implementation != \"PyPy\""]
-lxml = ["lxml (>=4.0)"]
-pathops = ["skia-pathops (>=0.5.0)"]
-plot = ["matplotlib"]
-repacker = ["uharfbuzz (>=0.23.0)"]
-symfont = ["sympy"]
-type1 = ["xattr ; sys_platform == \"darwin\""]
-ufo = ["fs (>=2.2.0,<3)"]
-unicode = ["unicodedata2 (>=15.1.0) ; python_version <= \"3.12\""]
-woff = ["brotli (>=1.0.1) ; platform_python_implementation == \"CPython\"", "brotlicffi (>=0.8.0) ; platform_python_implementation != \"CPython\"", "zopfli (>=0.1.4)"]
-
 [[package]]
 name = "frozenlist"
-version = "1.5.0"
+version = "1.8.0"
 description = "A list-like structure which implements collections.abc.MutableSequence"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "frozenlist-1.5.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:5b6a66c18b5b9dd261ca98dffcb826a525334b2f29e7caa54e182255c5f6a65a"},
-    {file = "frozenlist-1.5.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:d1b3eb7b05ea246510b43a7e53ed1653e55c2121019a97e60cad7efb881a97bb"},
-    {file = "frozenlist-1.5.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:15538c0cbf0e4fa11d1e3a71f823524b0c46299aed6e10ebb4c2089abd8c3bec"},
-    {file = "frozenlist-1.5.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e79225373c317ff1e35f210dd5f1344ff31066ba8067c307ab60254cd3a78ad5"},
-    {file = "frozenlist-1.5.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9272fa73ca71266702c4c3e2d4a28553ea03418e591e377a03b8e3659d94fa76"},
-    {file = "frozenlist-1.5.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:498524025a5b8ba81695761d78c8dd7382ac0b052f34e66939c42df860b8ff17"},
-    {file = "frozenlist-1.5.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:92b5278ed9d50fe610185ecd23c55d8b307d75ca18e94c0e7de328089ac5dcba"},
-    {file = "frozenlist-1.5.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7f3c8c1dacd037df16e85227bac13cca58c30da836c6f936ba1df0c05d046d8d"},
-    {file = "frozenlist-1.5.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:f2ac49a9bedb996086057b75bf93538240538c6d9b38e57c82d51f75a73409d2"},
-    {file = "frozenlist-1.5.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:e66cc454f97053b79c2ab09c17fbe3c825ea6b4de20baf1be28919460dd7877f"},
-    {file = "frozenlist-1.5.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:5a3ba5f9a0dfed20337d3e966dc359784c9f96503674c2faf015f7fe8e96798c"},
-    {file = "frozenlist-1.5.0-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:6321899477db90bdeb9299ac3627a6a53c7399c8cd58d25da094007402b039ab"},
-    {file = "frozenlist-1.5.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:76e4753701248476e6286f2ef492af900ea67d9706a0155335a40ea21bf3b2f5"},
-    {file = "frozenlist-1.5.0-cp310-cp310-win32.whl", hash = "sha256:977701c081c0241d0955c9586ffdd9ce44f7a7795df39b9151cd9a6fd0ce4cfb"},
-    {file = "frozenlist-1.5.0-cp310-cp310-win_amd64.whl", hash = "sha256:189f03b53e64144f90990d29a27ec4f7997d91ed3d01b51fa39d2dbe77540fd4"},
-    {file = "frozenlist-1.5.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:fd74520371c3c4175142d02a976aee0b4cb4a7cc912a60586ffd8d5929979b30"},
-    {file = "frozenlist-1.5.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:2f3f7a0fbc219fb4455264cae4d9f01ad41ae6ee8524500f381de64ffaa077d5"},
-    {file = "frozenlist-1.5.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:f47c9c9028f55a04ac254346e92977bf0f166c483c74b4232bee19a6697e4778"},
-    {file = "frozenlist-1.5.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0996c66760924da6e88922756d99b47512a71cfd45215f3570bf1e0b694c206a"},
-    {file = "frozenlist-1.5.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:a2fe128eb4edeabe11896cb6af88fca5346059f6c8d807e3b910069f39157869"},
-    {file = "frozenlist-1.5.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1a8ea951bbb6cacd492e3948b8da8c502a3f814f5d20935aae74b5df2b19cf3d"},
-    {file = "frozenlist-1.5.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:de537c11e4aa01d37db0d403b57bd6f0546e71a82347a97c6a9f0dcc532b3a45"},
-    {file = "frozenlist-1.5.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9c2623347b933fcb9095841f1cc5d4ff0b278addd743e0e966cb3d460278840d"},
-    {file = "frozenlist-1.5.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:cee6798eaf8b1416ef6909b06f7dc04b60755206bddc599f52232606e18179d3"},
-    {file = "frozenlist-1.5.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:f5f9da7f5dbc00a604fe74aa02ae7c98bcede8a3b8b9666f9f86fc13993bc71a"},
-    {file = "frozenlist-1.5.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:90646abbc7a5d5c7c19461d2e3eeb76eb0b204919e6ece342feb6032c9325ae9"},
-    {file = "frozenlist-1.5.0-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:bdac3c7d9b705d253b2ce370fde941836a5f8b3c5c2b8fd70940a3ea3af7f4f2"},
-    {file = "frozenlist-1.5.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:03d33c2ddbc1816237a67f66336616416e2bbb6beb306e5f890f2eb22b959cdf"},
-    {file = "frozenlist-1.5.0-cp311-cp311-win32.whl", hash = "sha256:237f6b23ee0f44066219dae14c70ae38a63f0440ce6750f868ee08775073f942"},
-    {file = "frozenlist-1.5.0-cp311-cp311-win_amd64.whl", hash = "sha256:0cc974cc93d32c42e7b0f6cf242a6bd941c57c61b618e78b6c0a96cb72788c1d"},
-    {file = "frozenlist-1.5.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:31115ba75889723431aa9a4e77d5f398f5cf976eea3bdf61749731f62d4a4a21"},
-    {file = "frozenlist-1.5.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7437601c4d89d070eac8323f121fcf25f88674627505334654fd027b091db09d"},
-    {file = "frozenlist-1.5.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:7948140d9f8ece1745be806f2bfdf390127cf1a763b925c4a805c603df5e697e"},
-    {file = "frozenlist-1.5.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:feeb64bc9bcc6b45c6311c9e9b99406660a9c05ca8a5b30d14a78555088b0b3a"},
-    {file = "frozenlist-1.5.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:683173d371daad49cffb8309779e886e59c2f369430ad28fe715f66d08d4ab1a"},
-    {file = "frozenlist-1.5.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7d57d8f702221405a9d9b40f9da8ac2e4a1a8b5285aac6100f3393675f0a85ee"},
-    {file = "frozenlist-1.5.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:30c72000fbcc35b129cb09956836c7d7abf78ab5416595e4857d1cae8d6251a6"},
-    {file = "frozenlist-1.5.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:000a77d6034fbad9b6bb880f7ec073027908f1b40254b5d6f26210d2dab1240e"},
-    {file = "frozenlist-1.5.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:5d7f5a50342475962eb18b740f3beecc685a15b52c91f7d975257e13e029eca9"},
-    {file = "frozenlist-1.5.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:87f724d055eb4785d9be84e9ebf0f24e392ddfad00b3fe036e43f489fafc9039"},
-    {file = "frozenlist-1.5.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:6e9080bb2fb195a046e5177f10d9d82b8a204c0736a97a153c2466127de87784"},
-    {file = "frozenlist-1.5.0-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:9b93d7aaa36c966fa42efcaf716e6b3900438632a626fb09c049f6a2f09fc631"},
-    {file = "frozenlist-1.5.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:52ef692a4bc60a6dd57f507429636c2af8b6046db8b31b18dac02cbc8f507f7f"},
-    {file = "frozenlist-1.5.0-cp312-cp312-win32.whl", hash = "sha256:29d94c256679247b33a3dc96cce0f93cbc69c23bf75ff715919332fdbb6a32b8"},
-    {file = "frozenlist-1.5.0-cp312-cp312-win_amd64.whl", hash = "sha256:8969190d709e7c48ea386db202d708eb94bdb29207a1f269bab1196ce0dcca1f"},
-    {file = "frozenlist-1.5.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:7a1a048f9215c90973402e26c01d1cff8a209e1f1b53f72b95c13db61b00f953"},
-    {file = "frozenlist-1.5.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:dd47a5181ce5fcb463b5d9e17ecfdb02b678cca31280639255ce9d0e5aa67af0"},
-    {file = "frozenlist-1.5.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:1431d60b36d15cda188ea222033eec8e0eab488f39a272461f2e6d9e1a8e63c2"},
-    {file = "frozenlist-1.5.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6482a5851f5d72767fbd0e507e80737f9c8646ae7fd303def99bfe813f76cf7f"},
-    {file = "frozenlist-1.5.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:44c49271a937625619e862baacbd037a7ef86dd1ee215afc298a417ff3270608"},
-    {file = "frozenlist-1.5.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:12f78f98c2f1c2429d42e6a485f433722b0061d5c0b0139efa64f396efb5886b"},
-    {file = "frozenlist-1.5.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ce3aa154c452d2467487765e3adc730a8c153af77ad84096bc19ce19a2400840"},
-    {file = "frozenlist-1.5.0-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9b7dc0c4338e6b8b091e8faf0db3168a37101943e687f373dce00959583f7439"},
-    {file = "frozenlist-1.5.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:45e0896250900b5aa25180f9aec243e84e92ac84bd4a74d9ad4138ef3f5c97de"},
-    {file = "frozenlist-1.5.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:561eb1c9579d495fddb6da8959fd2a1fca2c6d060d4113f5844b433fc02f2641"},
-    {file = "frozenlist-1.5.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:df6e2f325bfee1f49f81aaac97d2aa757c7646534a06f8f577ce184afe2f0a9e"},
-    {file = "frozenlist-1.5.0-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:140228863501b44b809fb39ec56b5d4071f4d0aa6d216c19cbb08b8c5a7eadb9"},
-    {file = "frozenlist-1.5.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:7707a25d6a77f5d27ea7dc7d1fc608aa0a478193823f88511ef5e6b8a48f9d03"},
-    {file = "frozenlist-1.5.0-cp313-cp313-win32.whl", hash = "sha256:31a9ac2b38ab9b5a8933b693db4939764ad3f299fcaa931a3e605bc3460e693c"},
-    {file = "frozenlist-1.5.0-cp313-cp313-win_amd64.whl", hash = "sha256:11aabdd62b8b9c4b84081a3c246506d1cddd2dd93ff0ad53ede5defec7886b28"},
-    {file = "frozenlist-1.5.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:dd94994fc91a6177bfaafd7d9fd951bc8689b0a98168aa26b5f543868548d3ca"},
-    {file = "frozenlist-1.5.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:2d0da8bbec082bf6bf18345b180958775363588678f64998c2b7609e34719b10"},
-    {file = "frozenlist-1.5.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:73f2e31ea8dd7df61a359b731716018c2be196e5bb3b74ddba107f694fbd7604"},
-    {file = "frozenlist-1.5.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:828afae9f17e6de596825cf4228ff28fbdf6065974e5ac1410cecc22f699d2b3"},
-    {file = "frozenlist-1.5.0-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f1577515d35ed5649d52ab4319db757bb881ce3b2b796d7283e6634d99ace307"},
-    {file = "frozenlist-1.5.0-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2150cc6305a2c2ab33299453e2968611dacb970d2283a14955923062c8d00b10"},
-    {file = "frozenlist-1.5.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:a72b7a6e3cd2725eff67cd64c8f13335ee18fc3c7befc05aed043d24c7b9ccb9"},
-    {file = "frozenlist-1.5.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c16d2fa63e0800723139137d667e1056bee1a1cf7965153d2d104b62855e9b99"},
-    {file = "frozenlist-1.5.0-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:17dcc32fc7bda7ce5875435003220a457bcfa34ab7924a49a1c19f55b6ee185c"},
-    {file = "frozenlist-1.5.0-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:97160e245ea33d8609cd2b8fd997c850b56db147a304a262abc2b3be021a9171"},
-    {file = "frozenlist-1.5.0-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:f1e6540b7fa044eee0bb5111ada694cf3dc15f2b0347ca125ee9ca984d5e9e6e"},
-    {file = "frozenlist-1.5.0-cp38-cp38-musllinux_1_2_s390x.whl", hash = "sha256:91d6c171862df0a6c61479d9724f22efb6109111017c87567cfeb7b5d1449fdf"},
-    {file = "frozenlist-1.5.0-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:c1fac3e2ace2eb1052e9f7c7db480818371134410e1f5c55d65e8f3ac6d1407e"},
-    {file = "frozenlist-1.5.0-cp38-cp38-win32.whl", hash = "sha256:b97f7b575ab4a8af9b7bc1d2ef7f29d3afee2226bd03ca3875c16451ad5a7723"},
-    {file = "frozenlist-1.5.0-cp38-cp38-win_amd64.whl", hash = "sha256:374ca2dabdccad8e2a76d40b1d037f5bd16824933bf7bcea3e59c891fd4a0923"},
-    {file = "frozenlist-1.5.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:9bbcdfaf4af7ce002694a4e10a0159d5a8d20056a12b05b45cea944a4953f972"},
-    {file = "frozenlist-1.5.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:1893f948bf6681733aaccf36c5232c231e3b5166d607c5fa77773611df6dc336"},
-    {file = "frozenlist-1.5.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:2b5e23253bb709ef57a8e95e6ae48daa9ac5f265637529e4ce6b003a37b2621f"},
-    {file = "frozenlist-1.5.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0f253985bb515ecd89629db13cb58d702035ecd8cfbca7d7a7e29a0e6d39af5f"},
-    {file = "frozenlist-1.5.0-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:04a5c6babd5e8fb7d3c871dc8b321166b80e41b637c31a995ed844a6139942b6"},
-    {file = "frozenlist-1.5.0-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a9fe0f1c29ba24ba6ff6abf688cb0b7cf1efab6b6aa6adc55441773c252f7411"},
-    {file = "frozenlist-1.5.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:226d72559fa19babe2ccd920273e767c96a49b9d3d38badd7c91a0fdeda8ea08"},
-    {file = "frozenlist-1.5.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:15b731db116ab3aedec558573c1a5eec78822b32292fe4f2f0345b7f697745c2"},
-    {file = "frozenlist-1.5.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:366d8f93e3edfe5a918c874702f78faac300209a4d5bf38352b2c1bdc07a766d"},
-    {file = "frozenlist-1.5.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:1b96af8c582b94d381a1c1f51ffaedeb77c821c690ea5f01da3d70a487dd0a9b"},
-    {file = "frozenlist-1.5.0-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:c03eff4a41bd4e38415cbed054bbaff4a075b093e2394b6915dca34a40d1e38b"},
-    {file = "frozenlist-1.5.0-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:50cf5e7ee9b98f22bdecbabf3800ae78ddcc26e4a435515fc72d97903e8488e0"},
-    {file = "frozenlist-1.5.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:1e76bfbc72353269c44e0bc2cfe171900fbf7f722ad74c9a7b638052afe6a00c"},
-    {file = "frozenlist-1.5.0-cp39-cp39-win32.whl", hash = "sha256:666534d15ba8f0fda3f53969117383d5dc021266b3c1a42c9ec4855e4b58b9d3"},
-    {file = "frozenlist-1.5.0-cp39-cp39-win_amd64.whl", hash = "sha256:5c28f4b5dbef8a0d8aad0d4de24d1e9e981728628afaf4ea0792f5d0939372f0"},
-    {file = "frozenlist-1.5.0-py3-none-any.whl", hash = "sha256:d994863bba198a4a518b467bb971c56e1db3f180a25c6cf7bb1949c267f748c3"},
-    {file = "frozenlist-1.5.0.tar.gz", hash = "sha256:81d5af29e61b9c8348e876d442253723928dce6433e0e76cd925cd83f1b4b817"},
+python-versions = ">=3.9"
+groups = ["main", "training"]
+files = [
+    {file = "frozenlist-1.8.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:b37f6d31b3dcea7deb5e9696e529a6aa4a898adc33db82da12e4c60a7c4d2011"},
+    {file = "frozenlist-1.8.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:ef2b7b394f208233e471abc541cc6991f907ffd47dc72584acee3147899d6565"},
+    {file = "frozenlist-1.8.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:a88f062f072d1589b7b46e951698950e7da00442fc1cacbe17e19e025dc327ad"},
+    {file = "frozenlist-1.8.0-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:f57fb59d9f385710aa7060e89410aeb5058b99e62f4d16b08b91986b9a2140c2"},
+    {file = "frozenlist-1.8.0-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:799345ab092bee59f01a915620b5d014698547afd011e691a208637312db9186"},
+    {file = "frozenlist-1.8.0-cp310-cp310-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:c23c3ff005322a6e16f71bf8692fcf4d5a304aaafe1e262c98c6d4adc7be863e"},
+    {file = "frozenlist-1.8.0-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:8a76ea0f0b9dfa06f254ee06053d93a600865b3274358ca48a352ce4f0798450"},
+    {file = "frozenlist-1.8.0-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c7366fe1418a6133d5aa824ee53d406550110984de7637d65a178010f759c6ef"},
+    {file = "frozenlist-1.8.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:13d23a45c4cebade99340c4165bd90eeb4a56c6d8a9d8aa49568cac19a6d0dc4"},
+    {file = "frozenlist-1.8.0-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:e4a3408834f65da56c83528fb52ce7911484f0d1eaf7b761fc66001db1646eff"},
+    {file = "frozenlist-1.8.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:42145cd2748ca39f32801dad54aeea10039da6f86e303659db90db1c4b614c8c"},
+    {file = "frozenlist-1.8.0-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:e2de870d16a7a53901e41b64ffdf26f2fbb8917b3e6ebf398098d72c5b20bd7f"},
+    {file = "frozenlist-1.8.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:20e63c9493d33ee48536600d1a5c95eefc870cd71e7ab037763d1fbb89cc51e7"},
+    {file = "frozenlist-1.8.0-cp310-cp310-win32.whl", hash = "sha256:adbeebaebae3526afc3c96fad434367cafbfd1b25d72369a9e5858453b1bb71a"},
+    {file = "frozenlist-1.8.0-cp310-cp310-win_amd64.whl", hash = "sha256:667c3777ca571e5dbeb76f331562ff98b957431df140b54c85fd4d52eea8d8f6"},
+    {file = "frozenlist-1.8.0-cp310-cp310-win_arm64.whl", hash = "sha256:80f85f0a7cc86e7a54c46d99c9e1318ff01f4687c172ede30fd52d19d1da1c8e"},
+    {file = "frozenlist-1.8.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:09474e9831bc2b2199fad6da3c14c7b0fbdd377cce9d3d77131be28906cb7d84"},
+    {file = "frozenlist-1.8.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:17c883ab0ab67200b5f964d2b9ed6b00971917d5d8a92df149dc2c9779208ee9"},
+    {file = "frozenlist-1.8.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:fa47e444b8ba08fffd1c18e8cdb9a75db1b6a27f17507522834ad13ed5922b93"},
+    {file = "frozenlist-1.8.0-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:2552f44204b744fba866e573be4c1f9048d6a324dfe14475103fd51613eb1d1f"},
+    {file = "frozenlist-1.8.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:957e7c38f250991e48a9a73e6423db1bb9dd14e722a10f6b8bb8e16a0f55f695"},
+    {file = "frozenlist-1.8.0-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:8585e3bb2cdea02fc88ffa245069c36555557ad3609e83be0ec71f54fd4abb52"},
+    {file = "frozenlist-1.8.0-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:edee74874ce20a373d62dc28b0b18b93f645633c2943fd90ee9d898550770581"},
+    {file = "frozenlist-1.8.0-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c9a63152fe95756b85f31186bddf42e4c02c6321207fd6601a1c89ebac4fe567"},
+    {file = "frozenlist-1.8.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:b6db2185db9be0a04fecf2f241c70b63b1a242e2805be291855078f2b404dd6b"},
+    {file = "frozenlist-1.8.0-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:f4be2e3d8bc8aabd566f8d5b8ba7ecc09249d74ba3c9ed52e54dc23a293f0b92"},
+    {file = "frozenlist-1.8.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:c8d1634419f39ea6f5c427ea2f90ca85126b54b50837f31497f3bf38266e853d"},
+    {file = "frozenlist-1.8.0-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:1a7fa382a4a223773ed64242dbe1c9c326ec09457e6b8428efb4118c685c3dfd"},
+    {file = "frozenlist-1.8.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:11847b53d722050808926e785df837353bd4d75f1d494377e59b23594d834967"},
+    {file = "frozenlist-1.8.0-cp311-cp311-win32.whl", hash = "sha256:27c6e8077956cf73eadd514be8fb04d77fc946a7fe9f7fe167648b0b9085cc25"},
+    {file = "frozenlist-1.8.0-cp311-cp311-win_amd64.whl", hash = "sha256:ac913f8403b36a2c8610bbfd25b8013488533e71e62b4b4adce9c86c8cea905b"},
+    {file = "frozenlist-1.8.0-cp311-cp311-win_arm64.whl", hash = "sha256:d4d3214a0f8394edfa3e303136d0575eece0745ff2b47bd2cb2e66dd92d4351a"},
+    {file = "frozenlist-1.8.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:78f7b9e5d6f2fdb88cdde9440dc147259b62b9d3b019924def9f6478be254ac1"},
+    {file = "frozenlist-1.8.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:229bf37d2e4acdaf808fd3f06e854a4a7a3661e871b10dc1f8f1896a3b05f18b"},
+    {file = "frozenlist-1.8.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:f833670942247a14eafbb675458b4e61c82e002a148f49e68257b79296e865c4"},
+    {file = "frozenlist-1.8.0-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:494a5952b1c597ba44e0e78113a7266e656b9794eec897b19ead706bd7074383"},
+    {file = "frozenlist-1.8.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:96f423a119f4777a4a056b66ce11527366a8bb92f54e541ade21f2374433f6d4"},
+    {file = "frozenlist-1.8.0-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:3462dd9475af2025c31cc61be6652dfa25cbfb56cbbf52f4ccfe029f38decaf8"},
+    {file = "frozenlist-1.8.0-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:c4c800524c9cd9bac5166cd6f55285957fcfc907db323e193f2afcd4d9abd69b"},
+    {file = "frozenlist-1.8.0-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:d6a5df73acd3399d893dafc71663ad22534b5aa4f94e8a2fabfe856c3c1b6a52"},
+    {file = "frozenlist-1.8.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:405e8fe955c2280ce66428b3ca55e12b3c4e9c336fb2103a4937e891c69a4a29"},
+    {file = "frozenlist-1.8.0-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:908bd3f6439f2fef9e85031b59fd4f1297af54415fb60e4254a95f75b3cab3f3"},
+    {file = "frozenlist-1.8.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:294e487f9ec720bd8ffcebc99d575f7eff3568a08a253d1ee1a0378754b74143"},
+    {file = "frozenlist-1.8.0-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:74c51543498289c0c43656701be6b077f4b265868fa7f8a8859c197006efb608"},
+    {file = "frozenlist-1.8.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:776f352e8329135506a1d6bf16ac3f87bc25b28e765949282dcc627af36123aa"},
+    {file = "frozenlist-1.8.0-cp312-cp312-win32.whl", hash = "sha256:433403ae80709741ce34038da08511d4a77062aa924baf411ef73d1146e74faf"},
+    {file = "frozenlist-1.8.0-cp312-cp312-win_amd64.whl", hash = "sha256:34187385b08f866104f0c0617404c8eb08165ab1272e884abc89c112e9c00746"},
+    {file = "frozenlist-1.8.0-cp312-cp312-win_arm64.whl", hash = "sha256:fe3c58d2f5db5fbd18c2987cba06d51b0529f52bc3a6cdc33d3f4eab725104bd"},
+    {file = "frozenlist-1.8.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:8d92f1a84bb12d9e56f818b3a746f3efba93c1b63c8387a73dde655e1e42282a"},
+    {file = "frozenlist-1.8.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:96153e77a591c8adc2ee805756c61f59fef4cf4073a9275ee86fe8cba41241f7"},
+    {file = "frozenlist-1.8.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:f21f00a91358803399890ab167098c131ec2ddd5f8f5fd5fe9c9f2c6fcd91e40"},
+    {file = "frozenlist-1.8.0-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:fb30f9626572a76dfe4293c7194a09fb1fe93ba94c7d4f720dfae3b646b45027"},
+    {file = "frozenlist-1.8.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:eaa352d7047a31d87dafcacbabe89df0aa506abb5b1b85a2fb91bc3faa02d822"},
+    {file = "frozenlist-1.8.0-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:03ae967b4e297f58f8c774c7eabcce57fe3c2434817d4385c50661845a058121"},
+    {file = "frozenlist-1.8.0-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f6292f1de555ffcc675941d65fffffb0a5bcd992905015f85d0592201793e0e5"},
+    {file = "frozenlist-1.8.0-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:29548f9b5b5e3460ce7378144c3010363d8035cea44bc0bf02d57f5a685e084e"},
+    {file = "frozenlist-1.8.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:ec3cc8c5d4084591b4237c0a272cc4f50a5b03396a47d9caaf76f5d7b38a4f11"},
+    {file = "frozenlist-1.8.0-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:517279f58009d0b1f2e7c1b130b377a349405da3f7621ed6bfae50b10adf20c1"},
+    {file = "frozenlist-1.8.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:db1e72ede2d0d7ccb213f218df6a078a9c09a7de257c2fe8fcef16d5925230b1"},
+    {file = "frozenlist-1.8.0-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:b4dec9482a65c54a5044486847b8a66bf10c9cb4926d42927ec4e8fd5db7fed8"},
+    {file = "frozenlist-1.8.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:21900c48ae04d13d416f0e1e0c4d81f7931f73a9dfa0b7a8746fb2fe7dd970ed"},
+    {file = "frozenlist-1.8.0-cp313-cp313-win32.whl", hash = "sha256:8b7b94a067d1c504ee0b16def57ad5738701e4ba10cec90529f13fa03c833496"},
+    {file = "frozenlist-1.8.0-cp313-cp313-win_amd64.whl", hash = "sha256:878be833caa6a3821caf85eb39c5ba92d28e85df26d57afb06b35b2efd937231"},
+    {file = "frozenlist-1.8.0-cp313-cp313-win_arm64.whl", hash = "sha256:44389d135b3ff43ba8cc89ff7f51f5a0bb6b63d829c8300f79a2fe4fe61bcc62"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:e25ac20a2ef37e91c1b39938b591457666a0fa835c7783c3a8f33ea42870db94"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:07cdca25a91a4386d2e76ad992916a85038a9b97561bf7a3fd12d5d9ce31870c"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:4e0c11f2cc6717e0a741f84a527c52616140741cd812a50422f83dc31749fb52"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:b3210649ee28062ea6099cfda39e147fa1bc039583c8ee4481cb7811e2448c51"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:581ef5194c48035a7de2aefc72ac6539823bb71508189e5de01d60c9dcd5fa65"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:3ef2d026f16a2b1866e1d86fc4e1291e1ed8a387b2c333809419a2f8b3a77b82"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:5500ef82073f599ac84d888e3a8c1f77ac831183244bfd7f11eaa0289fb30714"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:50066c3997d0091c411a66e710f4e11752251e6d2d73d70d8d5d4c76442a199d"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:5c1c8e78426e59b3f8005e9b19f6ff46e5845895adbde20ece9218319eca6506"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:eefdba20de0d938cec6a89bd4d70f346a03108a19b9df4248d3cf0d88f1b0f51"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:cf253e0e1c3ceb4aaff6df637ce033ff6535fb8c70a764a8f46aafd3d6ab798e"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:032efa2674356903cd0261c4317a561a6850f3ac864a63fc1583147fb05a79b0"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:6da155091429aeba16851ecb10a9104a108bcd32f6c1642867eadaee401c1c41"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-win32.whl", hash = "sha256:0f96534f8bfebc1a394209427d0f8a63d343c9779cda6fc25e8e121b5fd8555b"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-win_amd64.whl", hash = "sha256:5d63a068f978fc69421fb0e6eb91a9603187527c86b7cd3f534a5b77a592b888"},
+    {file = "frozenlist-1.8.0-cp313-cp313t-win_arm64.whl", hash = "sha256:bf0a7e10b077bf5fb9380ad3ae8ce20ef919a6ad93b4552896419ac7e1d8e042"},
+    {file = "frozenlist-1.8.0-cp314-cp314-macosx_10_13_universal2.whl", hash = "sha256:cee686f1f4cadeb2136007ddedd0aaf928ab95216e7691c63e50a8ec066336d0"},
+    {file = "frozenlist-1.8.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:119fb2a1bd47307e899c2fac7f28e85b9a543864df47aa7ec9d3c1b4545f096f"},
+    {file = "frozenlist-1.8.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:4970ece02dbc8c3a92fcc5228e36a3e933a01a999f7094ff7c23fbd2beeaa67c"},
+    {file = "frozenlist-1.8.0-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:cba69cb73723c3f329622e34bdbf5ce1f80c21c290ff04256cff1cd3c2036ed2"},
+    {file = "frozenlist-1.8.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:778a11b15673f6f1df23d9586f83c4846c471a8af693a22e066508b77d201ec8"},
+    {file = "frozenlist-1.8.0-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:0325024fe97f94c41c08872db482cf8ac4800d80e79222c6b0b7b162d5b13686"},
+    {file = "frozenlist-1.8.0-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:97260ff46b207a82a7567b581ab4190bd4dfa09f4db8a8b49d1a958f6aa4940e"},
+    {file = "frozenlist-1.8.0-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:54b2077180eb7f83dd52c40b2750d0a9f175e06a42e3213ce047219de902717a"},
+    {file = "frozenlist-1.8.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:2f05983daecab868a31e1da44462873306d3cbfd76d1f0b5b69c473d21dbb128"},
+    {file = "frozenlist-1.8.0-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:33f48f51a446114bc5d251fb2954ab0164d5be02ad3382abcbfe07e2531d650f"},
+    {file = "frozenlist-1.8.0-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:154e55ec0655291b5dd1b8731c637ecdb50975a2ae70c606d100750a540082f7"},
+    {file = "frozenlist-1.8.0-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:4314debad13beb564b708b4a496020e5306c7333fa9a3ab90374169a20ffab30"},
+    {file = "frozenlist-1.8.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:073f8bf8becba60aa931eb3bc420b217bb7d5b8f4750e6f8b3be7f3da85d38b7"},
+    {file = "frozenlist-1.8.0-cp314-cp314-win32.whl", hash = "sha256:bac9c42ba2ac65ddc115d930c78d24ab8d4f465fd3fc473cdedfccadb9429806"},
+    {file = "frozenlist-1.8.0-cp314-cp314-win_amd64.whl", hash = "sha256:3e0761f4d1a44f1d1a47996511752cf3dcec5bbdd9cc2b4fe595caf97754b7a0"},
+    {file = "frozenlist-1.8.0-cp314-cp314-win_arm64.whl", hash = "sha256:d1eaff1d00c7751b7c6662e9c5ba6eb2c17a2306ba5e2a37f24ddf3cc953402b"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-macosx_10_13_universal2.whl", hash = "sha256:d3bb933317c52d7ea5004a1c442eef86f426886fba134ef8cf4226ea6ee1821d"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:8009897cdef112072f93a0efdce29cd819e717fd2f649ee3016efd3cd885a7ed"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:2c5dcbbc55383e5883246d11fd179782a9d07a986c40f49abe89ddf865913930"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:39ecbc32f1390387d2aa4f5a995e465e9e2f79ba3adcac92d68e3e0afae6657c"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:92db2bf818d5cc8d9c1f1fc56b897662e24ea5adb36ad1f1d82875bd64e03c24"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:2dc43a022e555de94c3b68a4ef0b11c4f747d12c024a520c7101709a2144fb37"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:cb89a7f2de3602cfed448095bab3f178399646ab7c61454315089787df07733a"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:33139dc858c580ea50e7e60a1b0ea003efa1fd42e6ec7fdbad78fff65fad2fd2"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:168c0969a329b416119507ba30b9ea13688fafffac1b7822802537569a1cb0ef"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:28bd570e8e189d7f7b001966435f9dac6718324b5be2990ac496cf1ea9ddb7fe"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:b2a095d45c5d46e5e79ba1e5b9cb787f541a8dee0433836cea4b96a2c439dcd8"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:eab8145831a0d56ec9c4139b6c3e594c7a83c2c8be25d5bcf2d86136a532287a"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:974b28cf63cc99dfb2188d8d222bc6843656188164848c4f679e63dae4b0708e"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-win32.whl", hash = "sha256:342c97bf697ac5480c0a7ec73cd700ecfa5a8a40ac923bd035484616efecc2df"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-win_amd64.whl", hash = "sha256:06be8f67f39c8b1dc671f5d83aaefd3358ae5cdcf8314552c57e7ed3e6475bdd"},
+    {file = "frozenlist-1.8.0-cp314-cp314t-win_arm64.whl", hash = "sha256:102e6314ca4da683dca92e3b1355490fed5f313b768500084fbe6371fddfdb79"},
+    {file = "frozenlist-1.8.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:d8b7138e5cd0647e4523d6685b0eac5d4be9a184ae9634492f25c6eb38c12a47"},
+    {file = "frozenlist-1.8.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:a6483e309ca809f1efd154b4d37dc6d9f61037d6c6a81c2dc7a15cb22c8c5dca"},
+    {file = "frozenlist-1.8.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:1b9290cf81e95e93fdf90548ce9d3c1211cf574b8e3f4b3b7cb0537cf2227068"},
+    {file = "frozenlist-1.8.0-cp39-cp39-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:59a6a5876ca59d1b63af8cd5e7ffffb024c3dc1e9cf9301b21a2e76286505c95"},
+    {file = "frozenlist-1.8.0-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6dc4126390929823e2d2d9dc79ab4046ed74680360fc5f38b585c12c66cdf459"},
+    {file = "frozenlist-1.8.0-cp39-cp39-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:332db6b2563333c5671fecacd085141b5800cb866be16d5e3eb15a2086476675"},
+    {file = "frozenlist-1.8.0-cp39-cp39-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:9ff15928d62a0b80bb875655c39bf517938c7d589554cbd2669be42d97c2cb61"},
+    {file = "frozenlist-1.8.0-cp39-cp39-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:7bf6cdf8e07c8151fba6fe85735441240ec7f619f935a5205953d58009aef8c6"},
+    {file = "frozenlist-1.8.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:48e6d3f4ec5c7273dfe83ff27c91083c6c9065af655dc2684d2c200c94308bb5"},
+    {file = "frozenlist-1.8.0-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:1a7607e17ad33361677adcd1443edf6f5da0ce5e5377b798fba20fae194825f3"},
+    {file = "frozenlist-1.8.0-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:5a3a935c3a4e89c733303a2d5a7c257ea44af3a56c8202df486b7f5de40f37e1"},
+    {file = "frozenlist-1.8.0-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:940d4a017dbfed9daf46a3b086e1d2167e7012ee297fef9e1c545c4d022f5178"},
+    {file = "frozenlist-1.8.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:b9be22a69a014bc47e78072d0ecae716f5eb56c15238acca0f43d6eb8e4a5bda"},
+    {file = "frozenlist-1.8.0-cp39-cp39-win32.whl", hash = "sha256:1aa77cb5697069af47472e39612976ed05343ff2e84a3dcf15437b232cbfd087"},
+    {file = "frozenlist-1.8.0-cp39-cp39-win_amd64.whl", hash = "sha256:7398c222d1d405e796970320036b1b563892b65809d9e5261487bb2c7f7b5c6a"},
+    {file = "frozenlist-1.8.0-cp39-cp39-win_arm64.whl", hash = "sha256:b4f3b365f31c6cd4af24545ca0a244a53688cad8834e32f56831c4923b50a103"},
+    {file = "frozenlist-1.8.0-py3-none-any.whl", hash = "sha256:0c18a16eab41e82c295618a77502e17b195883241c563b00f0aa5106fc4eaa0d"},
+    {file = "frozenlist-1.8.0.tar.gz", hash = "sha256:3ede829ed8d842f6cd48fc7081d7a41001a56f1f38603f9d49bf3020d59a31ad"},
 ]
 
 [[package]]
 name = "fsspec"
-version = "2024.9.0"
+version = "2026.4.0"
 description = "File-system specification"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["main", "training"]
 files = [
-    {file = "fsspec-2024.9.0-py3-none-any.whl", hash = "sha256:a0947d552d8a6efa72cc2c730b12c41d043509156966cca4fb157b0f2a0c574b"},
-    {file = "fsspec-2024.9.0.tar.gz", hash = "sha256:4b0afb90c2f21832df142f292649035d80b421f60a9e1c027802e5a0da2b04e8"},
+    {file = "fsspec-2026.4.0-py3-none-any.whl", hash = "sha256:11ef7bb35dab8a394fde6e608221d5cf3e8499401c249bebaeaad760a1a8dec2"},
+    {file = "fsspec-2026.4.0.tar.gz", hash = "sha256:301d8ac70ae90ef3ad05dcf94d6c3754a097f9b5fe4667d2787aa359ec7df7e4"},
 ]
 
 [package.dependencies]
@@ -1549,12 +1459,12 @@ abfs = ["adlfs"]
 adl = ["adlfs"]
 arrow = ["pyarrow (>=1)"]
 dask = ["dask", "distributed"]
-dev = ["pre-commit", "ruff"]
+dev = ["pre-commit", "ruff (>=0.5)"]
 doc = ["numpydoc", "sphinx", "sphinx-design", "sphinx-rtd-theme", "yarl"]
 dropbox = ["dropbox", "dropboxdrivefs", "requests"]
-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "dask", "distributed", "dropbox", "dropboxdrivefs", "fusepy", "gcsfs", "libarchive-c", "ocifs", "panel", "paramiko", "pyarrow (>=1)", "pygit2", "requests", "s3fs", "smbprotocol", "tqdm"]
+full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "dask", "distributed", "dropbox", "dropboxdrivefs", "fusepy", "gcsfs (>2024.2.0)", "libarchive-c", "ocifs", "panel", "paramiko", "pyarrow (>=1)", "pygit2", "requests", "s3fs (>2024.2.0)", "smbprotocol", "tqdm"]
 fuse = ["fusepy"]
-gcs = ["gcsfs"]
+gcs = ["gcsfs (>2024.2.0)"]
 git = ["pygit2"]
 github = ["requests"]
 gs = ["gcsfs"]
@@ -1563,13 +1473,13 @@ hdfs = ["pyarrow (>=1)"]
 http = ["aiohttp (!=4.0.0a0,!=4.0.0a1)"]
 libarchive = ["libarchive-c"]
 oci = ["ocifs"]
-s3 = ["s3fs"]
+s3 = ["s3fs (>2024.2.0)"]
 sftp = ["paramiko"]
 smb = ["smbprotocol"]
 ssh = ["paramiko"]
 test = ["aiohttp (!=4.0.0a0,!=4.0.0a1)", "numpy", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "requests"]
-test-downstream = ["aiobotocore (>=2.5.4,<3.0.0)", "dask-expr", "dask[dataframe,test]", "moto[server] (>4,<5)", "pytest-timeout", "xarray"]
-test-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "cloudpickle", "dask", "distributed", "dropbox", "dropboxdrivefs", "fastparquet", "fusepy", "gcsfs", "jinja2", "kerchunk", "libarchive-c", "lz4", "notebook", "numpy", "ocifs", "pandas", "panel", "paramiko", "pyarrow", "pyarrow (>=1)", "pyftpdlib", "pygit2", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "python-snappy", "requests", "smbprotocol", "tqdm", "urllib3", "zarr", "zstandard"]
+test-downstream = ["aiobotocore (>=2.5.4,<3.0.0)", "dask[dataframe,test]", "moto[server] (>4,<5)", "pytest-timeout", "xarray"]
+test-full = ["adlfs", "aiohttp (!=4.0.0a0,!=4.0.0a1)", "backports-zstd ; python_version < \"3.14\"", "cloudpickle", "dask", "distributed", "dropbox", "dropboxdrivefs", "fastparquet", "fusepy", "gcsfs", "jinja2", "kerchunk", "libarchive-c", "lz4", "notebook", "numpy", "ocifs", "pandas (<3.0.0)", "panel", "paramiko", "pyarrow", "pyarrow (>=1)", "pyftpdlib", "pygit2", "pytest", "pytest-asyncio (!=0.22.0)", "pytest-benchmark", "pytest-cov", "pytest-mock", "pytest-recording", "pytest-rerunfailures", "python-snappy", "requests", "smbprotocol", "tqdm", "urllib3", "zarr", "zstandard ; python_version < \"3.14\""]
 tqdm = ["tqdm"]
 
 [[package]]
@@ -1588,117 +1498,227 @@ files = [
 wcwidth = ">=0.2.12,<0.3.0"
 
 [[package]]
-name = "gitdb"
-version = "4.0.12"
-description = "Git Object Database"
-optional = false
-python-versions = ">=3.7"
+name = "gradio"
+version = "6.17.3"
+description = "Python library for easily interacting with trained machine learning models"
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "gitdb-4.0.12-py3-none-any.whl", hash = "sha256:67073e15955400952c6565cc3e707c554a4eea2e428946f7a4c162fab9bd9bcf"},
-    {file = "gitdb-4.0.12.tar.gz", hash = "sha256:5ef71f855d191a3326fcfbc0d5da835f26b13fbcba60c32c21091c349ffdb571"},
+    {file = "gradio-6.17.3-py3-none-any.whl", hash = "sha256:7e52c65bfbb7bd75ac1c28cb38f93b01e5f6a2ff013224e6213533451bfee517"},
+    {file = "gradio-6.17.3.tar.gz", hash = "sha256:3822c3ac3e2a5fcbde7821cf6437a01c88592e484efd5f4cd369581b0ce258fb"},
 ]
 
 [package.dependencies]
-smmap = ">=3.0.1,<6"
+anyio = ">=3.0,<5.0"
+audioop-lts = {version = "<=0.2.2", markers = "python_version >= \"3.13\""}
+authlib = {version = "*", optional = true, markers = "extra == \"oauth\""}
+brotli = ">=1.1.0"
+fastapi = ">=0.115.2,<1.0"
+gradio-client = "2.5.0"
+groovy = ">=0.1,<1.0"
+hf-gradio = ">=0.4.1,<1.0"
+httpx = ">=0.24.1,<1.0"
+huggingface-hub = ">=0.33.5,<2.0"
+itsdangerous = {version = "*", optional = true, markers = "extra == \"oauth\""}
+jinja2 = "<4.0"
+markupsafe = ">=2.0,<4.0"
+numpy = ">=1.0,<3.0"
+orjson = ">=3.0,<4.0"
+packaging = "*"
+pandas = ">=1.0,<4.0"
+pillow = ">=8.0,<13.0"
+pydantic = ">=2.0,<=3.0"
+pydub = "<1.0"
+python-multipart = ">=0.0.18,<1.0"
+pytz = ">=2017.2"
+pyyaml = ">=5.0,<7.0"
+safehttpx = ">=0.1.7,<0.2.0"
+semantic-version = ">=2.0,<3.0"
+starlette = ">=1.0.1,<2.0"
+tomlkit = ">=0.12.0,<0.15.0"
+typer = ">=0.12,<1.0"
+typing-extensions = ">=4.0,<5.0"
+uvicorn = ">=0.14.0"
+
+[package.extras]
+mcp = ["mcp (>=1.21.0,<2.0.0)", "pydantic (>=2.11.10,<=2.12.5)"]
+oauth = ["authlib", "itsdangerous"]
 
 [[package]]
-name = "gitpython"
-version = "3.1.44"
-description = "GitPython is a Python library used to interact with Git repositories"
-optional = false
-python-versions = ">=3.7"
+name = "gradio-client"
+version = "2.5.0"
+description = "Python library for easily interacting with trained machine learning models"
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "GitPython-3.1.44-py3-none-any.whl", hash = "sha256:9e0e10cda9bed1ee64bc9a6de50e7e38a9c9943241cd7f585f6df3ed28011110"},
-    {file = "gitpython-3.1.44.tar.gz", hash = "sha256:c87e30b26253bf5418b01b0660f818967f3c503193838337fe5e573331249269"},
+    {file = "gradio_client-2.5.0-py3-none-any.whl", hash = "sha256:d43e2179c29076292a76485ad7ed2e6eaa19d14ac58283bd7f5beabfe4ca958c"},
+    {file = "gradio_client-2.5.0.tar.gz", hash = "sha256:4cde99bad62149595c30c90876ca2e405e3a13687ecf895474f3412cb476673d"},
 ]
 
 [package.dependencies]
-gitdb = ">=4.0.1,<5"
+fsspec = "*"
+httpx = ">=0.24.1"
+huggingface-hub = ">=0.19.3,<2.0"
+packaging = "*"
+typing-extensions = ">=4.0,<5.0"
+
+[[package]]
+name = "groovy"
+version = "0.1.2"
+description = "A small Python library created to help developers protect their applications from Server Side Request Forgery (SSRF) attacks."
+optional = true
+python-versions = ">3.9"
+groups = ["main"]
+markers = "extra == \"trackio\""
+files = [
+    {file = "groovy-0.1.2-py3-none-any.whl", hash = "sha256:7f7975bab18c729a257a8b1ae9dcd70b7cafb1720481beae47719af57c35fa64"},
+    {file = "groovy-0.1.2.tar.gz", hash = "sha256:25c1dc09b3f9d7e292458aa762c6beb96ea037071bf5e917fc81fb78d2231083"},
+]
 
 [package.extras]
-doc = ["sphinx (>=7.1.2,<7.2)", "sphinx-autodoc-typehints", "sphinx_rtd_theme"]
-test = ["coverage[toml]", "ddt (>=1.1.1,!=1.4.3)", "mock ; python_version < \"3.8\"", "mypy", "pre-commit", "pytest (>=7.3.1)", "pytest-cov", "pytest-instafail", "pytest-mock", "pytest-sugar", "typing-extensions ; python_version < \"3.11\""]
+dev = ["pytest", "ruff (==0.9.3)"]
 
 [[package]]
-name = "google"
-version = "3.0.0"
-description = "Python bindings to the Google search engine."
+name = "grpcio"
+version = "1.81.1"
+description = "HTTP/2-based RPC framework"
 optional = false
-python-versions = "*"
+python-versions = ">=3.10"
+groups = ["training"]
+files = [
+    {file = "grpcio-1.81.1-cp310-cp310-linux_armv7l.whl", hash = "sha256:6f9a0c9c1cc15c112d1c053064fd032b64917062292c3d70aea280e02ae10b77"},
+    {file = "grpcio-1.81.1-cp310-cp310-macosx_11_0_universal2.whl", hash = "sha256:69ef28e54fc85397f91b8c19592b8ef3d81952080366914823bd8572a2958120"},
+    {file = "grpcio-1.81.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:15641444eca4a29358107b3dceb74c1c6305c55c822fd199b458aaea4068a7fb"},
+    {file = "grpcio-1.81.1-cp310-cp310-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:d4b2dddfc219f54f956ccd53cf76a1d338ffe68fc7f2849ec9c7feb9927ff692"},
+    {file = "grpcio-1.81.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ca1cc11d82677b9662082e5478b7528e2b7db7beaa6bdff42bd62789d81be399"},
+    {file = "grpcio-1.81.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:aa2ba7d2ad6df4d80127cea65e5b8d5e2c3adbf153ff4804452836328aca7c54"},
+    {file = "grpcio-1.81.1-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:592b5fee597faa91cce2dd294dd7d9a1c83d76c4dbf877e33ec1adb866b2fbed"},
+    {file = "grpcio-1.81.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:62481553b1793a27e9b9c3cf9e5bd483ef045ca72462592074b46d42b0c4d9b9"},
+    {file = "grpcio-1.81.1-cp310-cp310-win32.whl", hash = "sha256:bb693b1e3d9a2f3fd228e2110daf4b5aeedb36761ca1e4282f74725f6d89f611"},
+    {file = "grpcio-1.81.1-cp310-cp310-win_amd64.whl", hash = "sha256:88268ca418cacea64cecb0d1d600d3c6b3a8038fcba02e1e205178c5b1f47661"},
+    {file = "grpcio-1.81.1-cp311-cp311-linux_armv7l.whl", hash = "sha256:d71d30f2d92f67d944631c523713934fee37292469e182ebcd2c1dd8a64ce53f"},
+    {file = "grpcio-1.81.1-cp311-cp311-macosx_11_0_universal2.whl", hash = "sha256:b137f4bf3ada9dc44d411478decc6ff09a79ed30b306cd2abaa98408c3588137"},
+    {file = "grpcio-1.81.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:a3acb384427816dd5d470f47e62137b87f74da694faa8a50147012cf40df276a"},
+    {file = "grpcio-1.81.1-cp311-cp311-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:f9a0ebbe45c29b5e5866593c12b78bd9035f0f0f0d4bc8361680cd580d99db49"},
+    {file = "grpcio-1.81.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:0a37165cc80b1a368384b383e63a4c38116a10467ae44c904d2d7468c4470ec2"},
+    {file = "grpcio-1.81.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:6282caffb41ec326d4cb67ca9cf53b739d1b2f975a2acb498c7418e9f7d9a416"},
+    {file = "grpcio-1.81.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:a35009284d0d3d5c2c9601c164a911b8b4331608d98a9a66d47d97bb2f522b70"},
+    {file = "grpcio-1.81.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:1b22c80559854b789a01fd89e8929b3798a156c0829b5282a8939f33ad4115ad"},
+    {file = "grpcio-1.81.1-cp311-cp311-win32.whl", hash = "sha256:428bec0161b48d8cf583c068591bc0016d0d9cfff52462b72b3884861ea768c5"},
+    {file = "grpcio-1.81.1-cp311-cp311-win_amd64.whl", hash = "sha256:30e825f6848d9f18bba350ed6c75c1b02a0b5184474a31db9a32b1fa66fd8c79"},
+    {file = "grpcio-1.81.1-cp312-cp312-linux_armv7l.whl", hash = "sha256:8b39472beafc0bdcafc4c8c73ad082ebfdb449d566897a61e7acb4fa88089115"},
+    {file = "grpcio-1.81.1-cp312-cp312-macosx_11_0_universal2.whl", hash = "sha256:12b7524c88d4026d3dcb7b0ebe16b6714f3b4af402ddd0f0639ab064a00c87c3"},
+    {file = "grpcio-1.81.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1e123f9b37edb8375fd74130d1f69c944bbf0a7b06761ae7211154b8759e94d2"},
+    {file = "grpcio-1.81.1-cp312-cp312-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:2c2e2ae6867c2966b8daccc836d54a13218e0007e9a490aeb81dd05be64d22d7"},
+    {file = "grpcio-1.81.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:766bc7c9a9c340342f4c864ccbda8e78111e4751f13b895812b9c148fb79e9d0"},
+    {file = "grpcio-1.81.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:b259a04a737cb3496be0901328eb8b7552ed8df4865d8c8f1cf1bffcfc0776a3"},
+    {file = "grpcio-1.81.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:85b10a45b8993d195c4f3ff57025b8d1e11834909ee475c403bfa60cb4caefaf"},
+    {file = "grpcio-1.81.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:8ea1936c26b99999b27479853039a7f34713f56c49375ad52b38535ec93a796c"},
+    {file = "grpcio-1.81.1-cp312-cp312-win32.whl", hash = "sha256:a185a04039df6cae8648bc8ab6d6fde7bf94f7188ecf7828e76ac52eef1e41d6"},
+    {file = "grpcio-1.81.1-cp312-cp312-win_amd64.whl", hash = "sha256:3ad74f8bb1a18963914c5452d289422830b39459e8776ebbcd207be1fbfb1d94"},
+    {file = "grpcio-1.81.1-cp313-cp313-linux_armv7l.whl", hash = "sha256:b10e1ff4756ed27d5a29d7fc79cfce7ef1ff56ad20025b89bac7cf79e09abbbe"},
+    {file = "grpcio-1.81.1-cp313-cp313-macosx_11_0_universal2.whl", hash = "sha256:819edbdcb42ab8598b494bcf0222684bbb7a3c772bd1b1f0be7e029a6063c28e"},
+    {file = "grpcio-1.81.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:c5bf2dc311127d91230cc79b92188c082634a06cf66c5234db49a43b910183b0"},
+    {file = "grpcio-1.81.1-cp313-cp313-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:e8ca6a1fcdb2943c9cbc1804a1baf3acb6071d72a471591678ded84218006e14"},
+    {file = "grpcio-1.81.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:e64dd101d380a115cc5a0c7856788adb535f1a4e21fc543775602f8be95180ae"},
+    {file = "grpcio-1.81.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:98a07f9bf591e3a8919797bee1c53f026ba4acd587e5a4404c8e57c9ec36b2a5"},
+    {file = "grpcio-1.81.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:c261d74b1a945cf895a9d6eccd1685a8e837531beaab782da4d630a8d12deffb"},
+    {file = "grpcio-1.81.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:58ad1131c300d3c9b933802b3cc4dc69d380822935ba50b28703156ea826fbf7"},
+    {file = "grpcio-1.81.1-cp313-cp313-win32.whl", hash = "sha256:78e29211f26da2fdd0e9c6d2b79f489476140cf7029b6a64808ade7ca4156a42"},
+    {file = "grpcio-1.81.1-cp313-cp313-win_amd64.whl", hash = "sha256:edb59506291b647a30884b1d51a599d605f40b20af4a7dc3d33786a47a31de60"},
+    {file = "grpcio-1.81.1-cp314-cp314-linux_armv7l.whl", hash = "sha256:506f48f2f9c29b143fca3dad7b0d518c188b6c9648c75a2ae6e2d9f2c13a060b"},
+    {file = "grpcio-1.81.1-cp314-cp314-macosx_11_0_universal2.whl", hash = "sha256:d865db4a6318e1c1bea83292e0ed231090538fc4ca45425b0f0480eb338bbc6e"},
+    {file = "grpcio-1.81.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:e2aa72e3ce1770317ef534f63d397b55e130725f5149bd36077c3b539019db27"},
+    {file = "grpcio-1.81.1-cp314-cp314-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:0490c30c261eded63f3f354979f9dc4502a9fb944cccb60cd9dc85f5a7349854"},
+    {file = "grpcio-1.81.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:410482da976329fe5f4067270401b12cf2bd552ff8020f054ecfaddb5475f9d6"},
+    {file = "grpcio-1.81.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:e3657301562ac3cb8018d30d0d3ebfa39932239f7b5703422057ef14b69949f5"},
+    {file = "grpcio-1.81.1-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:24c8e57504c8f45b237e40b99262d181071e5099a07053695b75d97bb53053a0"},
+    {file = "grpcio-1.81.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:b427c19380991a4eaab2f6144b64b99b412043314c6bf4ab544f97bb31ee4190"},
+    {file = "grpcio-1.81.1-cp314-cp314-win32.whl", hash = "sha256:61233fe8951e5c85dff81c2458b6528624760166946b5b47ea150a589168411f"},
+    {file = "grpcio-1.81.1-cp314-cp314-win_amd64.whl", hash = "sha256:3768a5ff1b2125e6f552e561b6b2dca0e64982d8949689b4df145cf8b98d7821"},
+    {file = "grpcio-1.81.1.tar.gz", hash = "sha256:6fa10a767143a5e82e8eaab53918af0cd8909a57a27f8cb2288b80a613ac671b"},
+]
+
+[package.dependencies]
+typing-extensions = ">=4.12,<5.0"
+
+[package.extras]
+protobuf = ["grpcio-tools (>=1.81.1)"]
+
+[[package]]
+name = "h11"
+version = "0.16.0"
+description = "A pure-Python, bring-your-own-I/O implementation of HTTP/1.1"
+optional = false
+python-versions = ">=3.8"
+groups = ["main"]
+files = [
+    {file = "h11-0.16.0-py3-none-any.whl", hash = "sha256:63cf8bbe7522de3bf65932fda1d9c2772064ffb3dae62d55932da54b31cb6c86"},
+    {file = "h11-0.16.0.tar.gz", hash = "sha256:4e35b956cf45792e4caa5885e69fba00bdbc6ffafbfa020300e549b208ee5ff1"},
+]
+
+[[package]]
+name = "hf-gradio"
+version = "0.4.1"
+description = "An extension of the Hugging Face CLI for interacting with Gradio Spaces and Apps."
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "google-3.0.0-py2.py3-none-any.whl", hash = "sha256:889cf695f84e4ae2c55fbc0cfdaf4c1e729417fa52ab1db0485202ba173e4935"},
-    {file = "google-3.0.0.tar.gz", hash = "sha256:143530122ee5130509ad5e989f0512f7cb218b2d4eddbafbad40fd10e8d8ccbe"},
+    {file = "hf_gradio-0.4.1-py3-none-any.whl", hash = "sha256:76b8cb8be6abe62d74c1ad2d35b42f0629db89aa9e1a8d033cecfe7c856eeab3"},
+    {file = "hf_gradio-0.4.1.tar.gz", hash = "sha256:a017d942618f0d495a58ee4563047fa04bef614c00e0cb789a9a6d0633cffa7b"},
 ]
 
 [package.dependencies]
-beautifulsoup4 = "*"
+gradio-client = ">=2.0,<3.0"
+typer = "*"
+
+[package.extras]
+dev = ["ruff", "ty"]
 
 [[package]]
-name = "grpcio"
-version = "1.71.0"
-description = "HTTP/2-based RPC framework"
+name = "hf-xet"
+version = "1.5.1"
+description = "Fast transfer of large files with the Hugging Face Hub."
 optional = false
-python-versions = ">=3.9"
+python-versions = ">=3.8"
 groups = ["main"]
-files = [
-    {file = "grpcio-1.71.0-cp310-cp310-linux_armv7l.whl", hash = "sha256:c200cb6f2393468142eb50ab19613229dcc7829b5ccee8b658a36005f6669fdd"},
-    {file = "grpcio-1.71.0-cp310-cp310-macosx_12_0_universal2.whl", hash = "sha256:b2266862c5ad664a380fbbcdbdb8289d71464c42a8c29053820ee78ba0119e5d"},
-    {file = "grpcio-1.71.0-cp310-cp310-manylinux_2_17_aarch64.whl", hash = "sha256:0ab8b2864396663a5b0b0d6d79495657ae85fa37dcb6498a2669d067c65c11ea"},
-    {file = "grpcio-1.71.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c30f393f9d5ff00a71bb56de4aa75b8fe91b161aeb61d39528db6b768d7eac69"},
-    {file = "grpcio-1.71.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f250ff44843d9a0615e350c77f890082102a0318d66a99540f54769c8766ab73"},
-    {file = "grpcio-1.71.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:e6d8de076528f7c43a2f576bc311799f89d795aa6c9b637377cc2b1616473804"},
-    {file = "grpcio-1.71.0-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:9b91879d6da1605811ebc60d21ab6a7e4bae6c35f6b63a061d61eb818c8168f6"},
-    {file = "grpcio-1.71.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:f71574afdf944e6652203cd1badcda195b2a27d9c83e6d88dc1ce3cfb73b31a5"},
-    {file = "grpcio-1.71.0-cp310-cp310-win32.whl", hash = "sha256:8997d6785e93308f277884ee6899ba63baafa0dfb4729748200fcc537858a509"},
-    {file = "grpcio-1.71.0-cp310-cp310-win_amd64.whl", hash = "sha256:7d6ac9481d9d0d129224f6d5934d5832c4b1cddb96b59e7eba8416868909786a"},
-    {file = "grpcio-1.71.0-cp311-cp311-linux_armv7l.whl", hash = "sha256:d6aa986318c36508dc1d5001a3ff169a15b99b9f96ef5e98e13522c506b37eef"},
-    {file = "grpcio-1.71.0-cp311-cp311-macosx_10_14_universal2.whl", hash = "sha256:d2c170247315f2d7e5798a22358e982ad6eeb68fa20cf7a820bb74c11f0736e7"},
-    {file = "grpcio-1.71.0-cp311-cp311-manylinux_2_17_aarch64.whl", hash = "sha256:e6f83a583ed0a5b08c5bc7a3fe860bb3c2eac1f03f1f63e0bc2091325605d2b7"},
-    {file = "grpcio-1.71.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4be74ddeeb92cc87190e0e376dbc8fc7736dbb6d3d454f2fa1f5be1dee26b9d7"},
-    {file = "grpcio-1.71.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4dd0dfbe4d5eb1fcfec9490ca13f82b089a309dc3678e2edabc144051270a66e"},
-    {file = "grpcio-1.71.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:a2242d6950dc892afdf9e951ed7ff89473aaf744b7d5727ad56bdaace363722b"},
-    {file = "grpcio-1.71.0-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:0fa05ee31a20456b13ae49ad2e5d585265f71dd19fbd9ef983c28f926d45d0a7"},
-    {file = "grpcio-1.71.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:3d081e859fb1ebe176de33fc3adb26c7d46b8812f906042705346b314bde32c3"},
-    {file = "grpcio-1.71.0-cp311-cp311-win32.whl", hash = "sha256:d6de81c9c00c8a23047136b11794b3584cdc1460ed7cbc10eada50614baa1444"},
-    {file = "grpcio-1.71.0-cp311-cp311-win_amd64.whl", hash = "sha256:24e867651fc67717b6f896d5f0cac0ec863a8b5fb7d6441c2ab428f52c651c6b"},
-    {file = "grpcio-1.71.0-cp312-cp312-linux_armv7l.whl", hash = "sha256:0ff35c8d807c1c7531d3002be03221ff9ae15712b53ab46e2a0b4bb271f38537"},
-    {file = "grpcio-1.71.0-cp312-cp312-macosx_10_14_universal2.whl", hash = "sha256:b78a99cd1ece4be92ab7c07765a0b038194ded2e0a26fd654591ee136088d8d7"},
-    {file = "grpcio-1.71.0-cp312-cp312-manylinux_2_17_aarch64.whl", hash = "sha256:dc1a1231ed23caac1de9f943d031f1bc38d0f69d2a3b243ea0d664fc1fbd7fec"},
-    {file = "grpcio-1.71.0-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e6beeea5566092c5e3c4896c6d1d307fb46b1d4bdf3e70c8340b190a69198594"},
-    {file = "grpcio-1.71.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d5170929109450a2c031cfe87d6716f2fae39695ad5335d9106ae88cc32dc84c"},
-    {file = "grpcio-1.71.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:5b08d03ace7aca7b2fadd4baf291139b4a5f058805a8327bfe9aece7253b6d67"},
-    {file = "grpcio-1.71.0-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:f903017db76bf9cc2b2d8bdd37bf04b505bbccad6be8a81e1542206875d0e9db"},
-    {file = "grpcio-1.71.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:469f42a0b410883185eab4689060a20488a1a0a00f8bbb3cbc1061197b4c5a79"},
-    {file = "grpcio-1.71.0-cp312-cp312-win32.whl", hash = "sha256:ad9f30838550695b5eb302add33f21f7301b882937460dd24f24b3cc5a95067a"},
-    {file = "grpcio-1.71.0-cp312-cp312-win_amd64.whl", hash = "sha256:652350609332de6dac4ece254e5d7e1ff834e203d6afb769601f286886f6f3a8"},
-    {file = "grpcio-1.71.0-cp313-cp313-linux_armv7l.whl", hash = "sha256:cebc1b34ba40a312ab480ccdb396ff3c529377a2fce72c45a741f7215bfe8379"},
-    {file = "grpcio-1.71.0-cp313-cp313-macosx_10_14_universal2.whl", hash = "sha256:85da336e3649a3d2171e82f696b5cad2c6231fdd5bad52616476235681bee5b3"},
-    {file = "grpcio-1.71.0-cp313-cp313-manylinux_2_17_aarch64.whl", hash = "sha256:f9a412f55bb6e8f3bb000e020dbc1e709627dcb3a56f6431fa7076b4c1aab0db"},
-    {file = "grpcio-1.71.0-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:47be9584729534660416f6d2a3108aaeac1122f6b5bdbf9fd823e11fe6fbaa29"},
-    {file = "grpcio-1.71.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7c9c80ac6091c916db81131d50926a93ab162a7e97e4428ffc186b6e80d6dda4"},
-    {file = "grpcio-1.71.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:789d5e2a3a15419374b7b45cd680b1e83bbc1e52b9086e49308e2c0b5bbae6e3"},
-    {file = "grpcio-1.71.0-cp313-cp313-musllinux_1_1_i686.whl", hash = "sha256:1be857615e26a86d7363e8a163fade914595c81fec962b3d514a4b1e8760467b"},
-    {file = "grpcio-1.71.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:a76d39b5fafd79ed604c4be0a869ec3581a172a707e2a8d7a4858cb05a5a7637"},
-    {file = "grpcio-1.71.0-cp313-cp313-win32.whl", hash = "sha256:74258dce215cb1995083daa17b379a1a5a87d275387b7ffe137f1d5131e2cfbb"},
-    {file = "grpcio-1.71.0-cp313-cp313-win_amd64.whl", hash = "sha256:22c3bc8d488c039a199f7a003a38cb7635db6656fa96437a8accde8322ce2366"},
-    {file = "grpcio-1.71.0-cp39-cp39-linux_armv7l.whl", hash = "sha256:c6a0a28450c16809f94e0b5bfe52cabff63e7e4b97b44123ebf77f448534d07d"},
-    {file = "grpcio-1.71.0-cp39-cp39-macosx_10_14_universal2.whl", hash = "sha256:a371e6b6a5379d3692cc4ea1cb92754d2a47bdddeee755d3203d1f84ae08e03e"},
-    {file = "grpcio-1.71.0-cp39-cp39-manylinux_2_17_aarch64.whl", hash = "sha256:39983a9245d37394fd59de71e88c4b295eb510a3555e0a847d9965088cdbd033"},
-    {file = "grpcio-1.71.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9182e0063112e55e74ee7584769ec5a0b4f18252c35787f48738627e23a62b97"},
-    {file = "grpcio-1.71.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:693bc706c031aeb848849b9d1c6b63ae6bcc64057984bb91a542332b75aa4c3d"},
-    {file = "grpcio-1.71.0-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:20e8f653abd5ec606be69540f57289274c9ca503ed38388481e98fa396ed0b41"},
-    {file = "grpcio-1.71.0-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:8700a2a57771cc43ea295296330daaddc0d93c088f0a35cc969292b6db959bf3"},
-    {file = "grpcio-1.71.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:d35a95f05a8a2cbe8e02be137740138b3b2ea5f80bd004444e4f9a1ffc511e32"},
-    {file = "grpcio-1.71.0-cp39-cp39-win32.whl", hash = "sha256:f9c30c464cb2ddfbc2ddf9400287701270fdc0f14be5f08a1e3939f1e749b455"},
-    {file = "grpcio-1.71.0-cp39-cp39-win_amd64.whl", hash = "sha256:63e41b91032f298b3e973b3fa4093cbbc620c875e2da7b93e249d4728b54559a"},
-    {file = "grpcio-1.71.0.tar.gz", hash = "sha256:2b85f7820475ad3edec209d3d89a7909ada16caab05d3f2e08a7e8ae3200a55c"},
+markers = "platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"arm64\" or platform_machine == \"aarch64\""
+files = [
+    {file = "hf_xet-1.5.1-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:dbf48c0d02cf0b2e568944330c60d9120c272dabe013bd892d48e25bc6797577"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e78e4e5192ad2b674c2e1160b651cb9134db974f8ae1835bdfbfb0166b894a43"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:6f7a04a8ad962422e225bc49fbbac99dc1806764b1f3e54dbd154bffa7593947"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:d48199c2bf4f8df0adc55d31d1368b6ec0e4d4f45bc86b08038089c23db0bed8"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:97f212a88d14bbf573619a74b7fecb238de77d08fc702e54dec6f78276ca3283"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:f61e3665892a6c8c5e765395838b8ddf36185da835253d4bc4509a81e49fb342"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-win_amd64.whl", hash = "sha256:f4ad3ebd4c32dd2b27099d69dc7b2df821e30767e46fb6ee6a0713778243b8ff"},
+    {file = "hf_xet-1.5.1-cp313-cp313t-win_arm64.whl", hash = "sha256:8298485c1e36e7e67cbd01eeb1376619b7af43d4f1ec245caae306f890a8a32d"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:3474760d10e3bb6f92ff3f024fcb00c0b3e4001e9b035c7483e49a5dd17aa70f"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:6762d89b9e3267dfd502b29b2a327b4525f33b17e7b509a78d94e2151a30ce30"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:bf67e6ed10260cef62e852789dc91ebb03f382d5bdc4b1dbeb64763ea275e7d6"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:c6b6cd08ca095058780b50b8ce4d6cbf6787bcf27841705d58a9d32246e3e47a"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e1af0de8ca6f190d4294a28b88023db64a1e2d1d719cab044baf75bec569e7a9"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:4f561cbbb92f80960772059864b7fb07eae879adde1b2e781ec6f86f6ac26c59"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-win_amd64.whl", hash = "sha256:e7dbb40617410f432182d918e37c12303fe6700fd6aa6c5964e30a535a4461d6"},
+    {file = "hf_xet-1.5.1-cp314-cp314t-win_arm64.whl", hash = "sha256:6071d5ccb4d8d2cbd5fea5cc798da4f0ba3f44e25369591c4e89a4987050e61d"},
+    {file = "hf_xet-1.5.1-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:6abd35c3221eff63836618ddfb954dcf84798603f71d8e33e3ed7b04acfdbe6e"},
+    {file = "hf_xet-1.5.1-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:94e761bbd266bf4c03cee73753916062665ce8365aa40ed321f45afcb934b41e"},
+    {file = "hf_xet-1.5.1-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:892e3a3a3aecc12aded8b93cf4f9cd059282c7de0732f7d55026f3abdf474350"},
+    {file = "hf_xet-1.5.1-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:a93df2039190502835b1db8cd7e178b0b7b889fe9ab51299d5ced26e0dd879a4"},
+    {file = "hf_xet-1.5.1-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:0c97106032ef70467b4f6bc2d0ccc266d7613ee076afc56516c502f87ce1c4a6"},
+    {file = "hf_xet-1.5.1-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:6208adb15d192b90e4c2ad2a27ed864359b2cb0f2494eb6d7c7f3699ac02e2bf"},
+    {file = "hf_xet-1.5.1-cp37-abi3-win_amd64.whl", hash = "sha256:f7b3002f95d1c13e24bcb4537baa8f0eb3838957067c91bb4959bc004a6435f5"},
+    {file = "hf_xet-1.5.1-cp37-abi3-win_arm64.whl", hash = "sha256:93d090b57b211133f6c0dab0205ef5cb6d89162979ba75a74845045cc3063b8e"},
+    {file = "hf_xet-1.5.1.tar.gz", hash = "sha256:51ef4500dab3764b41135ee1381a4b62ce56fc54d4c92b719b59e597d6df5bf6"},
 ]
 
 [package.extras]
-protobuf = ["grpcio-tools (>=1.71.0)"]
+tests = ["pytest"]
 
 [[package]]
 name = "hjson"
@@ -1706,65 +1726,75 @@ version = "3.1.0"
 description = "Hjson, a user interface for JSON."
 optional = false
 python-versions = "*"
-groups = ["main"]
+groups = ["training"]
 files = [
     {file = "hjson-3.1.0-py3-none-any.whl", hash = "sha256:65713cdcf13214fb554eb8b4ef803419733f4f5e551047c9b711098ab7186b89"},
     {file = "hjson-3.1.0.tar.gz", hash = "sha256:55af475a27cf83a7969c808399d7bccdec8fb836a07ddbd574587593b9cdcf75"},
 ]
 
 [[package]]
-name = "hpsv2"
-version = "1.2.0"
-description = "Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis."
+name = "httpcore"
+version = "1.0.9"
+description = "A minimal low-level HTTP client."
+optional = false
+python-versions = ">=3.8"
+groups = ["main"]
+files = [
+    {file = "httpcore-1.0.9-py3-none-any.whl", hash = "sha256:2d400746a40668fc9dec9810239072b40b4484b640a8c38fd654a024c7a1bf55"},
+    {file = "httpcore-1.0.9.tar.gz", hash = "sha256:6e34463af53fd2ab5d807f399a9b45ea31c3dfa2276f15a2c3f00afff6e176e8"},
+]
+
+[package.dependencies]
+certifi = "*"
+h11 = ">=0.16"
+
+[package.extras]
+asyncio = ["anyio (>=4.0,<5.0)"]
+http2 = ["h2 (>=3,<5)"]
+socks = ["socksio (==1.*)"]
+trio = ["trio (>=0.22.0,<1.0)"]
+
+[[package]]
+name = "httpx"
+version = "0.28.1"
+description = "The next generation HTTP client."
 optional = false
-python-versions = ">=3.6.0"
+python-versions = ">=3.8"
 groups = ["main"]
-files = []
-develop = false
+files = [
+    {file = "httpx-0.28.1-py3-none-any.whl", hash = "sha256:d909fcccc110f8c7faf814ca82a9a4d816bc5a6dbfea25d6591d6985b8ba59ad"},
+    {file = "httpx-0.28.1.tar.gz", hash = "sha256:75e98c5f16b0f35b567856f597f06ff2270a374470a5c2392242528e3e3e42fc"},
+]
 
 [package.dependencies]
-braceexpand = "*"
-clint = "*"
-einops = "*"
-fsspec = "*"
-ftfy = "*"
-huggingface_hub = "*"
-pandas = "*"
-protobuf = "<4"
-pyarrow = "*"
-pytest = "7.2.0"
-pytest-split = "0.8.0"
-regex = "*"
-requests = "*"
-sentencepiece = "*"
-timm = "*"
-torch = ">=1.9.0"
-torchvision = "*"
-tqdm = "*"
-transformers = "*"
-webdataset = "*"
+anyio = "*"
+certifi = "*"
+httpcore = "==1.*"
+idna = "*"
 
-[package.source]
-type = "git"
-url = "https://github.com/tgxs002/HPSv2.git"
-reference = "HEAD"
-resolved_reference = "866735ecaae999fa714bd9edfa05aa2672669ee3"
+[package.extras]
+brotli = ["brotli ; platform_python_implementation == \"CPython\"", "brotlicffi ; platform_python_implementation != \"CPython\""]
+cli = ["click (==8.*)", "pygments (==2.*)", "rich (>=10,<14)"]
+http2 = ["h2 (>=3,<5)"]
+socks = ["socksio (==1.*)"]
+zstd = ["zstandard (>=0.18.0)"]
 
 [[package]]
 name = "huggingface-hub"
-version = "0.24.6"
+version = "0.34.6"
 description = "Client library to download and publish models, datasets and other repos on the huggingface.co hub"
 optional = false
 python-versions = ">=3.8.0"
 groups = ["main"]
 files = [
-    {file = "huggingface_hub-0.24.6-py3-none-any.whl", hash = "sha256:a990f3232aa985fe749bc9474060cbad75e8b2f115f6665a9fda5b9c97818970"},
-    {file = "huggingface_hub-0.24.6.tar.gz", hash = "sha256:cc2579e761d070713eaa9c323e3debe39d5b464ae3a7261c39a9195b27bb8000"},
+    {file = "huggingface_hub-0.34.6-py3-none-any.whl", hash = "sha256:3387ec9045f9dc5b5715e4e7392c25b0d23fd539eb925111a1b301e60f2b4883"},
+    {file = "huggingface_hub-0.34.6.tar.gz", hash = "sha256:d0824eb012e37594357bb1790dfbe26c8f45eed7e701c1cdae02539e0c06f3f8"},
 ]
 
 [package.dependencies]
 filelock = "*"
 fsspec = ">=2023.5.0"
+hf-xet = {version = ">=1.1.3,<2.0.0", markers = "platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"arm64\" or platform_machine == \"aarch64\""}
 packaging = ">=20.9"
 pyyaml = ">=5.1"
 requests = "*"
@@ -1772,29 +1802,32 @@ tqdm = ">=4.42.1"
 typing-extensions = ">=3.7.4.3"
 
 [package.extras]
-all = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "fastapi", "gradio", "jedi", "minijinja (>=1.0)", "mypy (==1.5.1)", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.5.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"]
+all = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "libcst (>=1.4.0)", "mypy (==1.15.0) ; python_version >= \"3.9\"", "mypy (>=1.14.1,<1.15.0) ; python_version == \"3.8\"", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.9.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"]
 cli = ["InquirerPy (==0.3.4)"]
-dev = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "fastapi", "gradio", "jedi", "minijinja (>=1.0)", "mypy (==1.5.1)", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.5.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"]
+dev = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "libcst (>=1.4.0)", "mypy (==1.15.0) ; python_version >= \"3.9\"", "mypy (>=1.14.1,<1.15.0) ; python_version == \"3.8\"", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "ruff (>=0.9.0)", "soundfile", "types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)", "urllib3 (<2.0)"]
 fastai = ["fastai (>=2.4)", "fastcore (>=1.3.27)", "toml"]
 hf-transfer = ["hf-transfer (>=0.1.4)"]
-inference = ["aiohttp", "minijinja (>=1.0)"]
-quality = ["mypy (==1.5.1)", "ruff (>=0.5.0)"]
+hf-xet = ["hf-xet (>=1.1.2,<2.0.0)"]
+inference = ["aiohttp"]
+mcp = ["aiohttp", "mcp (>=1.8.0)", "typer"]
+oauth = ["authlib (>=1.3.2)", "fastapi", "httpx", "itsdangerous"]
+quality = ["libcst (>=1.4.0)", "mypy (==1.15.0) ; python_version >= \"3.9\"", "mypy (>=1.14.1,<1.15.0) ; python_version == \"3.8\"", "ruff (>=0.9.0)"]
 tensorflow = ["graphviz", "pydot", "tensorflow"]
 tensorflow-testing = ["keras (<3.0)", "tensorflow"]
-testing = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "fastapi", "gradio", "jedi", "minijinja (>=1.0)", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "soundfile", "urllib3 (<2.0)"]
+testing = ["InquirerPy (==0.3.4)", "Jinja2", "Pillow", "aiohttp", "authlib (>=1.3.2)", "fastapi", "gradio (>=4.0.0)", "httpx", "itsdangerous", "jedi", "numpy", "pytest (>=8.1.1,<8.2.2)", "pytest-asyncio", "pytest-cov", "pytest-env", "pytest-mock", "pytest-rerunfailures", "pytest-vcr", "pytest-xdist", "soundfile", "urllib3 (<2.0)"]
 torch = ["safetensors[torch]", "torch"]
 typing = ["types-PyYAML", "types-requests", "types-simplejson", "types-toml", "types-tqdm", "types-urllib3", "typing-extensions (>=4.8.0)"]
 
 [[package]]
 name = "identify"
-version = "2.6.6"
+version = "2.6.19"
 description = "File identification library for Python"
 optional = false
-python-versions = ">=3.9"
-groups = ["main", "dev"]
+python-versions = ">=3.10"
+groups = ["dev"]
 files = [
-    {file = "identify-2.6.6-py2.py3-none-any.whl", hash = "sha256:cbd1810bce79f8b671ecb20f53ee0ae8e86ae84b557de31d89709dc2a48ba881"},
-    {file = "identify-2.6.6.tar.gz", hash = "sha256:7bec12768ed44ea4761efb47806f0a41f86e7c0a5fdf5950d4648c90eca7e251"},
+    {file = "identify-2.6.19-py2.py3-none-any.whl", hash = "sha256:20e6a87f786f768c092a721ad107fc9df0eb89347be9396cadf3f4abbd1fb78a"},
+    {file = "identify-2.6.19.tar.gz", hash = "sha256:6be5020c38fcb07da56c53733538a3081ea5aa70d36a156f83044bfbf9173842"},
 ]
 
 [package.extras]
@@ -1802,18 +1835,18 @@ license = ["ukkonen"]
 
 [[package]]
 name = "idna"
-version = "3.10"
+version = "3.18"
 description = "Internationalized Domain Names in Applications (IDNA)"
 optional = false
-python-versions = ">=3.6"
-groups = ["main"]
+python-versions = ">=3.9"
+groups = ["main", "training"]
 files = [
-    {file = "idna-3.10-py3-none-any.whl", hash = "sha256:946d195a0d259cbba61165e88e65941f16e9b36ea6ddb97f00452bae8b1287d3"},
-    {file = "idna-3.10.tar.gz", hash = "sha256:12f65c9b470abda6dc35cf8e63cc574b1c52b11df2c86030af0ac09b01b13ea9"},
+    {file = "idna-3.18-py3-none-any.whl", hash = "sha256:7f952cbe720b688055e3f87de14f5c3e5fdaa8bc3928985c4077ca689de849a2"},
+    {file = "idna-3.18.tar.gz", hash = "sha256:ffb385a7e039654cef1ab9ef32c6fafe283c0c0467bba1d9029738ce4a14a848"},
 ]
 
 [package.extras]
-all = ["flake8 (>=7.1.1)", "mypy (>=1.11.2)", "pytest (>=8.3.2)", "ruff (>=0.6.2)"]
+all = ["mypy (>=1.11.2)", "pytest (>=8.3.2)", "ruff (>=0.6.2)"]
 
 [[package]]
 name = "imageio"
@@ -1885,27 +1918,27 @@ numpy = "*"
 
 [[package]]
 name = "importlib-metadata"
-version = "8.6.1"
+version = "9.0.0"
 description = "Read metadata from Python packages"
 optional = false
-python-versions = ">=3.9"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "importlib_metadata-8.6.1-py3-none-any.whl", hash = "sha256:02a89390c1e15fdfdc0d7c6b25cb3e62650d0494005c97d6f148bf5b9787525e"},
-    {file = "importlib_metadata-8.6.1.tar.gz", hash = "sha256:310b41d755445d74569f993ccfc22838295d9fe005425094fad953d7f15c8580"},
+    {file = "importlib_metadata-9.0.0-py3-none-any.whl", hash = "sha256:2d21d1cc5a017bd0559e36150c21c830ab1dc304dedd1b7ea85d20f45ef3edd7"},
+    {file = "importlib_metadata-9.0.0.tar.gz", hash = "sha256:a4f57ab599e6a2e3016d7595cfd72eb4661a5106e787a95bcc90c7105b831efc"},
 ]
 
 [package.dependencies]
 zipp = ">=3.20"
 
 [package.extras]
-check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\""]
+check = ["pytest-checkdocs (>=2.14)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\""]
 cover = ["pytest-cov"]
 doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"]
-enabler = ["pytest-enabler (>=2.2)"]
+enabler = ["pytest-enabler (>=3.4)"]
 perf = ["ipython"]
-test = ["flufl.flake8", "importlib_resources (>=1.3) ; python_version < \"3.9\"", "jaraco.test (>=5.4)", "packaging", "pyfakefs", "pytest (>=6,!=8.1.*)", "pytest-perf (>=0.9.2)"]
-type = ["pytest-mypy"]
+test = ["packaging", "pyfakefs", "pytest (>=6,!=8.1.*)", "pytest-perf (>=0.9.2)"]
+type = ["pytest-mypy (>=1.0.1) ; platform_python_implementation != \"PyPy\""]
 
 [[package]]
 name = "imwatermark"
@@ -1925,73 +1958,58 @@ numpy = "*"
 
 [[package]]
 name = "iniconfig"
-version = "2.0.0"
+version = "2.3.0"
 description = "brain-dead simple config-ini parsing"
 optional = false
-python-versions = ">=3.7"
-groups = ["main", "dev"]
+python-versions = ">=3.10"
+groups = ["dev"]
 files = [
-    {file = "iniconfig-2.0.0-py3-none-any.whl", hash = "sha256:b6a85871a79d2e3b22d2d1b94ac2824226a63c6b741c88f7ae975f18b6778374"},
-    {file = "iniconfig-2.0.0.tar.gz", hash = "sha256:2d91e135bf72d31a410b17c16da610a82cb55f6b0477d1a902134b24a455b8b3"},
+    {file = "iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12"},
+    {file = "iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730"},
 ]
 
 [[package]]
-name = "invoke"
+name = "itsdangerous"
 version = "2.2.0"
-description = "Pythonic task execution"
-optional = false
-python-versions = ">=3.6"
+description = "Safely pass data to untrusted environments and back."
+optional = true
+python-versions = ">=3.8"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "invoke-2.2.0-py3-none-any.whl", hash = "sha256:6ea924cc53d4f78e3d98bc436b08069a03077e6f85ad1ddaa8a116d7dad15820"},
-    {file = "invoke-2.2.0.tar.gz", hash = "sha256:ee6cbb101af1a859c7fe84f2a264c059020b0cb7fe3535f9424300ab568f6bd5"},
+    {file = "itsdangerous-2.2.0-py3-none-any.whl", hash = "sha256:c6242fc49e35958c8b15141343aa660db5fc54d4f13a1db01a3f5891b98700ef"},
+    {file = "itsdangerous-2.2.0.tar.gz", hash = "sha256:e0050c0b7da1eea53ffaf149c0cfbb5c6e2e2b69c4bef22c81fa6eb73e5f6173"},
 ]
 
-[[package]]
-name = "isort"
-version = "5.13.2"
-description = "A Python utility / library to sort Python imports."
-optional = false
-python-versions = ">=3.8.0"
-groups = ["dev"]
-files = [
-    {file = "isort-5.13.2-py3-none-any.whl", hash = "sha256:8ca5e72a8d85860d5a3fa69b8745237f2939afe12dbf656afbcb47fe72d947a6"},
-    {file = "isort-5.13.2.tar.gz", hash = "sha256:48fdfcb9face5d58a4f6dde2e72a1fb8dcaf8ab26f95ab49fab84c2ddefb0109"},
-]
-
-[package.extras]
-colors = ["colorama (>=0.4.6)"]
-
 [[package]]
 name = "jedi"
-version = "0.19.2"
+version = "0.20.0"
 description = "An autocompletion tool for Python that can be used for text editors."
 optional = false
-python-versions = ">=3.6"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["dev"]
 files = [
-    {file = "jedi-0.19.2-py2.py3-none-any.whl", hash = "sha256:a8ef22bde8490f57fe5c7681a3c83cb58874daf72b4784de3cce5b6ef6edb5b9"},
-    {file = "jedi-0.19.2.tar.gz", hash = "sha256:4770dc3de41bde3966b02eb84fbcf557fb33cce26ad23da12c742fb50ecb11f0"},
+    {file = "jedi-0.20.0-py2.py3-none-any.whl", hash = "sha256:7bdd9c2634f56713299976f4cbd59cb3fa92165cc5e05ea811fb253480728b67"},
+    {file = "jedi-0.20.0.tar.gz", hash = "sha256:c3f4ccbd276696f4b19c54618d4fb18f9fc24b0aef02acf704b23f487daa1011"},
 ]
 
 [package.dependencies]
-parso = ">=0.8.4,<0.9.0"
+parso = ">=0.8.6,<0.9.0"
 
 [package.extras]
-docs = ["Jinja2 (==2.11.3)", "MarkupSafe (==1.1.1)", "Pygments (==2.8.1)", "alabaster (==0.7.12)", "babel (==2.9.1)", "chardet (==4.0.0)", "commonmark (==0.8.1)", "docutils (==0.17.1)", "future (==0.18.2)", "idna (==2.10)", "imagesize (==1.2.0)", "mock (==1.0.1)", "packaging (==20.9)", "pyparsing (==2.4.7)", "pytz (==2021.1)", "readthedocs-sphinx-ext (==2.1.4)", "recommonmark (==0.5.0)", "requests (==2.25.1)", "six (==1.15.0)", "snowballstemmer (==2.1.0)", "sphinx (==1.8.5)", "sphinx-rtd-theme (==0.4.3)", "sphinxcontrib-serializinghtml (==1.1.4)", "sphinxcontrib-websupport (==1.2.4)", "urllib3 (==1.26.4)"]
-qa = ["flake8 (==5.0.4)", "mypy (==0.971)", "types-setuptools (==67.2.0.1)"]
-testing = ["Django", "attrs", "colorama", "docopt", "pytest (<9.0.0)"]
+dev = ["Django", "attrs", "colorama", "docopt", "flake8 (==7.1.2)", "pytest (<9.0.0)", "types-setuptools (==80.9.0.20250529)", "typing-extensions", "zuban (==0.7.0)"]
+docs = ["Jinja2 (==3.1.6)", "MarkupSafe (==3.0.3)", "Pygments (==2.20.0)", "Sphinx (==9.1.0)", "alabaster (==1.0.0)", "babel (==2.18.0)", "certifi (==2026.4.22)", "charset-normalizer (==3.4.7)", "docutils (==0.22.4)", "idna (==3.13)", "imagesize (==2.0.0)", "iniconfig (==2.3.0)", "packaging (==26.2)", "pluggy (==1.6.0)", "pytest (==9.0.3)", "requests (==2.33.1)", "roman-numerals (==4.1.0)", "snowballstemmer (==3.0.1)", "sphinx-rtd-theme (==3.1.0)", "sphinxcontrib-applehelp (==2.0.0)", "sphinxcontrib-devhelp (==2.0.0)", "sphinxcontrib-htmlhelp (==2.1.0)", "sphinxcontrib-jquery (==4.1)", "sphinxcontrib-jsmath (==1.0.1)", "sphinxcontrib-qthelp (==2.0.0)", "sphinxcontrib-serializinghtml (==2.0.0)", "urllib3 (==2.6.3)"]
 
 [[package]]
 name = "jinja2"
-version = "3.1.5"
+version = "3.1.6"
 description = "A very fast and expressive template engine."
 optional = false
 python-versions = ">=3.7"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
-    {file = "jinja2-3.1.5-py3-none-any.whl", hash = "sha256:aba0f4dc9ed8013c424088f68a5c226f7d6097ed89b246d7749c2ec4175c6adb"},
-    {file = "jinja2-3.1.5.tar.gz", hash = "sha256:8fefff8dc3034e27bb80d67c671eb8a9bc424c0ef4c0826edbff304cceff43bb"},
+    {file = "jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67"},
+    {file = "jinja2-3.1.6.tar.gz", hash = "sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d"},
 ]
 
 [package.dependencies]
@@ -2000,156 +2018,36 @@ MarkupSafe = ">=2.0"
 [package.extras]
 i18n = ["Babel (>=2.7)"]
 
-[[package]]
-name = "jmespath"
-version = "1.0.1"
-description = "JSON Matching Expressions"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "jmespath-1.0.1-py3-none-any.whl", hash = "sha256:02e2e4cc71b5bcab88332eebf907519190dd9e6e82107fa7f83b1003a6252980"},
-    {file = "jmespath-1.0.1.tar.gz", hash = "sha256:90261b206d6defd58fdd5e85f478bf633a2901798906be2ad389150c5c60edbe"},
-]
-
 [[package]]
 name = "joblib"
-version = "1.4.2"
+version = "1.5.3"
 description = "Lightweight pipelining with Python functions"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.9"
+groups = ["training"]
 files = [
-    {file = "joblib-1.4.2-py3-none-any.whl", hash = "sha256:06d478d5674cbc267e7496a410ee875abd68e4340feff4490bcb7afb88060ae6"},
-    {file = "joblib-1.4.2.tar.gz", hash = "sha256:2382c5816b2636fbd20a09e0f4e9dad4736765fdfb7dca582943b9c1366b3f0e"},
+    {file = "joblib-1.5.3-py3-none-any.whl", hash = "sha256:5fc3c5039fc5ca8c0276333a188bbd59d6b7ab37fe6632daa76bc7f9ec18e713"},
+    {file = "joblib-1.5.3.tar.gz", hash = "sha256:8561a3269e6801106863fd0d6d84bb737be9e7631e33aaed3fb9ce5953688da3"},
 ]
 
 [[package]]
-name = "jsonschema"
-version = "4.23.0"
-description = "An implementation of JSON Schema validation for Python"
-optional = false
-python-versions = ">=3.8"
+name = "joserfc"
+version = "1.7.1"
+description = "The ultimate Python library for JOSE RFCs, including JWS, JWE, JWK, JWA, JWT"
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "jsonschema-4.23.0-py3-none-any.whl", hash = "sha256:fbadb6f8b144a8f8cf9f0b89ba94501d143e50411a1278633f56a7acf7fd5566"},
-    {file = "jsonschema-4.23.0.tar.gz", hash = "sha256:d71497fef26351a33265337fa77ffeb82423f3ea21283cd9467bb03999266bc4"},
+    {file = "joserfc-1.7.1-py3-none-any.whl", hash = "sha256:b3e3d655612e2e1ef67b2600f2f420e12e537b020208fab1761fad647319c164"},
+    {file = "joserfc-1.7.1.tar.gz", hash = "sha256:77d0b76514879c68c6f433bc5b7357a4ab72008ff1e33d8379fd11d72bd8ca81"},
 ]
 
 [package.dependencies]
-attrs = ">=22.2.0"
-jsonschema-specifications = ">=2023.03.6"
-referencing = ">=0.28.4"
-rpds-py = ">=0.7.1"
+cryptography = ">=45.0.1"
 
 [package.extras]
-format = ["fqdn", "idna", "isoduration", "jsonpointer (>1.13)", "rfc3339-validator", "rfc3987", "uri-template", "webcolors (>=1.11)"]
-format-nongpl = ["fqdn", "idna", "isoduration", "jsonpointer (>1.13)", "rfc3339-validator", "rfc3986-validator (>0.1.0)", "uri-template", "webcolors (>=24.6.0)"]
-
-[[package]]
-name = "jsonschema-specifications"
-version = "2024.10.1"
-description = "The JSON Schema meta-schemas and vocabularies, exposed as a Registry"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "jsonschema_specifications-2024.10.1-py3-none-any.whl", hash = "sha256:a09a0680616357d9a0ecf05c12ad234479f549239d0f5b55f3deea67475da9bf"},
-    {file = "jsonschema_specifications-2024.10.1.tar.gz", hash = "sha256:0f38b83639958ce1152d02a7f062902c41c8fd20d558b0c34344292d417ae272"},
-]
-
-[package.dependencies]
-referencing = ">=0.31.0"
-
-[[package]]
-name = "kiwisolver"
-version = "1.4.8"
-description = "A fast implementation of the Cassowary constraint solver"
-optional = false
-python-versions = ">=3.10"
-groups = ["main"]
-files = [
-    {file = "kiwisolver-1.4.8-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:88c6f252f6816a73b1f8c904f7bbe02fd67c09a69f7cb8a0eecdbf5ce78e63db"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:c72941acb7b67138f35b879bbe85be0f6c6a70cab78fe3ef6db9c024d9223e5b"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ce2cf1e5688edcb727fdf7cd1bbd0b6416758996826a8be1d958f91880d0809d"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl", hash = "sha256:c8bf637892dc6e6aad2bc6d4d69d08764166e5e3f69d469e55427b6ac001b19d"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:034d2c891f76bd3edbdb3ea11140d8510dca675443da7304205a2eaa45d8334c"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d47b28d1dfe0793d5e96bce90835e17edf9a499b53969b03c6c47ea5985844c3"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:eb158fe28ca0c29f2260cca8c43005329ad58452c36f0edf298204de32a9a3ed"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d5536185fce131780ebd809f8e623bf4030ce1b161353166c49a3c74c287897f"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:369b75d40abedc1da2c1f4de13f3482cb99e3237b38726710f4a793432b1c5ff"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:641f2ddf9358c80faa22e22eb4c9f54bd3f0e442e038728f500e3b978d00aa7d"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:d561d2d8883e0819445cfe58d7ddd673e4015c3c57261d7bdcd3710d0d14005c"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:1732e065704b47c9afca7ffa272f845300a4eb959276bf6970dc07265e73b605"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:bcb1ebc3547619c3b58a39e2448af089ea2ef44b37988caf432447374941574e"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-win_amd64.whl", hash = "sha256:89c107041f7b27844179ea9c85d6da275aa55ecf28413e87624d033cf1f6b751"},
-    {file = "kiwisolver-1.4.8-cp310-cp310-win_arm64.whl", hash = "sha256:b5773efa2be9eb9fcf5415ea3ab70fc785d598729fd6057bea38d539ead28271"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:a4d3601908c560bdf880f07d94f31d734afd1bb71e96585cace0e38ef44c6d84"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:856b269c4d28a5c0d5e6c1955ec36ebfd1651ac00e1ce0afa3e28da95293b561"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c2b9a96e0f326205af81a15718a9073328df1173a2619a68553decb7097fd5d7"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c5020c83e8553f770cb3b5fc13faac40f17e0b205bd237aebd21d53d733adb03"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:dace81d28c787956bfbfbbfd72fdcef014f37d9b48830829e488fdb32b49d954"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:11e1022b524bd48ae56c9b4f9296bce77e15a2e42a502cceba602f804b32bb79"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:3b9b4d2892fefc886f30301cdd80debd8bb01ecdf165a449eb6e78f79f0fabd6"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3a96c0e790ee875d65e340ab383700e2b4891677b7fcd30a699146f9384a2bb0"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:23454ff084b07ac54ca8be535f4174170c1094a4cff78fbae4f73a4bcc0d4dab"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:87b287251ad6488e95b4f0b4a79a6d04d3ea35fde6340eb38fbd1ca9cd35bbbc"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:b21dbe165081142b1232a240fc6383fd32cdd877ca6cc89eab93e5f5883e1c25"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:768cade2c2df13db52475bd28d3a3fac8c9eff04b0e9e2fda0f3760f20b3f7fc"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:d47cfb2650f0e103d4bf68b0b5804c68da97272c84bb12850d877a95c056bd67"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-win_amd64.whl", hash = "sha256:ed33ca2002a779a2e20eeb06aea7721b6e47f2d4b8a8ece979d8ba9e2a167e34"},
-    {file = "kiwisolver-1.4.8-cp311-cp311-win_arm64.whl", hash = "sha256:16523b40aab60426ffdebe33ac374457cf62863e330a90a0383639ce14bf44b2"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:d6af5e8815fd02997cb6ad9bbed0ee1e60014438ee1a5c2444c96f87b8843502"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:bade438f86e21d91e0cf5dd7c0ed00cda0f77c8c1616bd83f9fc157fa6760d31"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:b83dc6769ddbc57613280118fb4ce3cd08899cc3369f7d0e0fab518a7cf37fdb"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:111793b232842991be367ed828076b03d96202c19221b5ebab421ce8bcad016f"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:257af1622860e51b1a9d0ce387bf5c2c4f36a90594cb9514f55b074bcc787cfc"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:69b5637c3f316cab1ec1c9a12b8c5f4750a4c4b71af9157645bf32830e39c03a"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:782bb86f245ec18009890e7cb8d13a5ef54dcf2ebe18ed65f795e635a96a1c6a"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cc978a80a0db3a66d25767b03688f1147a69e6237175c0f4ffffaaedf744055a"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:36dbbfd34838500a31f52c9786990d00150860e46cd5041386f217101350f0d3"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:eaa973f1e05131de5ff3569bbba7f5fd07ea0595d3870ed4a526d486fe57fa1b"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:a66f60f8d0c87ab7f59b6fb80e642ebb29fec354a4dfad687ca4092ae69d04f4"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:858416b7fb777a53f0c59ca08190ce24e9abbd3cffa18886a5781b8e3e26f65d"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:085940635c62697391baafaaeabdf3dd7a6c3643577dde337f4d66eba021b2b8"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-win_amd64.whl", hash = "sha256:01c3d31902c7db5fb6182832713d3b4122ad9317c2c5877d0539227d96bb2e50"},
-    {file = "kiwisolver-1.4.8-cp312-cp312-win_arm64.whl", hash = "sha256:a3c44cb68861de93f0c4a8175fbaa691f0aa22550c331fefef02b618a9dcb476"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:1c8ceb754339793c24aee1c9fb2485b5b1f5bb1c2c214ff13368431e51fc9a09"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:54a62808ac74b5e55a04a408cda6156f986cefbcf0ada13572696b507cc92fa1"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:68269e60ee4929893aad82666821aaacbd455284124817af45c11e50a4b42e3c"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:34d142fba9c464bc3bbfeff15c96eab0e7310343d6aefb62a79d51421fcc5f1b"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3ddc373e0eef45b59197de815b1b28ef89ae3955e7722cc9710fb91cd77b7f47"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:77e6f57a20b9bd4e1e2cedda4d0b986ebd0216236f0106e55c28aea3d3d69b16"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:08e77738ed7538f036cd1170cbed942ef749137b1311fa2bbe2a7fda2f6bf3cc"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a5ce1e481a74b44dd5e92ff03ea0cb371ae7a0268318e202be06c8f04f4f1246"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:fc2ace710ba7c1dfd1a3b42530b62b9ceed115f19a1656adefce7b1782a37794"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:3452046c37c7692bd52b0e752b87954ef86ee2224e624ef7ce6cb21e8c41cc1b"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:7e9a60b50fe8b2ec6f448fe8d81b07e40141bfced7f896309df271a0b92f80f3"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:918139571133f366e8362fa4a297aeba86c7816b7ecf0bc79168080e2bd79957"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e063ef9f89885a1d68dd8b2e18f5ead48653176d10a0e324e3b0030e3a69adeb"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-win_amd64.whl", hash = "sha256:a17b7c4f5b2c51bb68ed379defd608a03954a1845dfed7cc0117f1cc8a9b7fd2"},
-    {file = "kiwisolver-1.4.8-cp313-cp313-win_arm64.whl", hash = "sha256:3cd3bc628b25f74aedc6d374d5babf0166a92ff1317f46267f12d2ed54bc1d30"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:370fd2df41660ed4e26b8c9d6bbcad668fbe2560462cba151a721d49e5b6628c"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:84a2f830d42707de1d191b9490ac186bf7997a9495d4e9072210a1296345f7dc"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:7a3ad337add5148cf51ce0b55642dc551c0b9d6248458a757f98796ca7348712"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:7506488470f41169b86d8c9aeff587293f530a23a23a49d6bc64dab66bedc71e"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2f0121b07b356a22fb0414cec4666bbe36fd6d0d759db3d37228f496ed67c880"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d6d6bd87df62c27d4185de7c511c6248040afae67028a8a22012b010bc7ad062"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:291331973c64bb9cce50bbe871fb2e675c4331dab4f31abe89f175ad7679a4d7"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:893f5525bb92d3d735878ec00f781b2de998333659507d29ea4466208df37bed"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:b47a465040146981dc9db8647981b8cb96366fbc8d452b031e4f8fdffec3f26d"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:99cea8b9dd34ff80c521aef46a1dddb0dcc0283cf18bde6d756f1e6f31772165"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:151dffc4865e5fe6dafce5480fab84f950d14566c480c08a53c663a0020504b6"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:577facaa411c10421314598b50413aa1ebcf5126f704f1e5d72d7e4e9f020d90"},
-    {file = "kiwisolver-1.4.8-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:be4816dc51c8a471749d664161b434912eee82f2ea66bd7628bd14583a833e85"},
-    {file = "kiwisolver-1.4.8-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:e7a019419b7b510f0f7c9dceff8c5eae2392037eae483a7f9162625233802b0a"},
-    {file = "kiwisolver-1.4.8-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:286b18e86682fd2217a48fc6be6b0f20c1d0ed10958d8dc53453ad58d7be0bf8"},
-    {file = "kiwisolver-1.4.8-pp310-pypy310_pp73-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4191ee8dfd0be1c3666ccbac178c5a05d5f8d689bbe3fc92f3c4abec817f8fe0"},
-    {file = "kiwisolver-1.4.8-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7cd2785b9391f2873ad46088ed7599a6a71e762e1ea33e87514b1a441ed1da1c"},
-    {file = "kiwisolver-1.4.8-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c07b29089b7ba090b6f1a669f1411f27221c3662b3a1b7010e67b59bb5a6f10b"},
-    {file = "kiwisolver-1.4.8-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:65ea09a5a3faadd59c2ce96dc7bf0f364986a315949dc6374f04396b0d60e09b"},
-    {file = "kiwisolver-1.4.8.tar.gz", hash = "sha256:23d5f023bdc8c7e54eb65f03ca5d5bb25b601eac4d7f1a042888a1f45237987e"},
-]
+drafts = ["pycryptodome"]
 
 [[package]]
 name = "kornia"
@@ -2175,82 +2073,182 @@ x = ["accelerate", "onnxruntime-gpu (>=1.16) ; sys_platform != \"darwin\""]
 
 [[package]]
 name = "kornia-rs"
-version = "0.1.8"
+version = "0.1.14"
 description = "Low level implementations for computer vision in Rust"
 optional = false
 python-versions = ">=3.8"
 groups = ["main"]
 files = [
-    {file = "kornia_rs-0.1.8-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:1380edbbb841f9579bc8677d388e326b7363e1d0d49e8bab567ec9ef1aec782f"},
-    {file = "kornia_rs-0.1.8-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:b82cf759df6f5fd935c1afd25aa3a145fd47f14af3650ad37c71189f49171bd8"},
-    {file = "kornia_rs-0.1.8-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1f12aeaf672493b456f2d35b4b3c88eda3dd8284807430d0b173cb3272c7ef61"},
-    {file = "kornia_rs-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3b57fd6262ef932a3131dd211764bf184380742a2aea0a12c54949af7c61c2ac"},
-    {file = "kornia_rs-0.1.8-cp310-cp310-win_amd64.whl", hash = "sha256:06f60ff032ce9824b5fe746d1e1cca06ea3f5ba72b71a907a1c48f0e27094333"},
-    {file = "kornia_rs-0.1.8-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:61b9822a68556198c5b526da939ddc3f9c630cab37c2d6bcf613c2de1bb3d088"},
-    {file = "kornia_rs-0.1.8-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:2dc98296aeeccf2536c1f8efa99d3c273962c7a07a8ae7c088de09ecc19543c4"},
-    {file = "kornia_rs-0.1.8-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4968efcd26ca190977cfe84d38492a912ad95f13222473dbeb90f330aab51d82"},
-    {file = "kornia_rs-0.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b64be28fbac1f2e1bab3903b5016e1a957968fe43141ee7866c2ec5ebafc71ab"},
-    {file = "kornia_rs-0.1.8-cp311-cp311-win_amd64.whl", hash = "sha256:2886f3a586728fe4a3586b3cc1df1dbea5d8984c74f77e23f5ab198441ec6e3c"},
-    {file = "kornia_rs-0.1.8-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:983200f2b336dd832d81154295ff152195ade0228054ecbe7ac9ed7d5bf3b031"},
-    {file = "kornia_rs-0.1.8-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:bf8a78b1fac32fe05974272c5659c6a2f8754d1c15372aa529e0b5802ea2daed"},
-    {file = "kornia_rs-0.1.8-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4ca82f982d92d3b90f462848557ebd1500ea02d65b38b032305d1966c3bbc153"},
-    {file = "kornia_rs-0.1.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:297e48f800c93e7cc8b089e472b77a272f9887509ce9d8756fab0fa7714f8439"},
-    {file = "kornia_rs-0.1.8-cp312-cp312-win_amd64.whl", hash = "sha256:dba6d86df9d3bb3e99f2d6017b9939b9e2683929277e959d11ea86fb3153eaec"},
-    {file = "kornia_rs-0.1.8-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:9197fc690b79562ff745a9ebda05c1408b9938045aecbbdafeaa8aed1f238b31"},
-    {file = "kornia_rs-0.1.8-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:1014eac46dd75c8ba9ca61579593d77b84918236877fcae9dca362ff5d6960e4"},
-    {file = "kornia_rs-0.1.8-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c7d7c90c6244a37e0d1994e532ddf3484b3e7f767c54121d514feda83974a934"},
-    {file = "kornia_rs-0.1.8-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2ef0c4a19103ff9c3c7e7acb2a7db0a276a0ab1ea1c19fe151aea384a98cd63c"},
-    {file = "kornia_rs-0.1.8-cp313-cp313-win_amd64.whl", hash = "sha256:434fb087e2caef5b2ecd5222ea54cc443e907851b708be15142bc65ae82cef63"},
-    {file = "kornia_rs-0.1.8-cp37-cp37m-macosx_10_12_x86_64.whl", hash = "sha256:db56ba011f96cb15139a00828370b587e0a0a4287c7d8f004bf1b97e7581e341"},
-    {file = "kornia_rs-0.1.8-cp37-cp37m-macosx_11_0_arm64.whl", hash = "sha256:58f8b6ed43e08d04d77a09573f7904d62046b9b8df53b537ffd3ff94a495b746"},
-    {file = "kornia_rs-0.1.8-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:992d04a63f382185424127f29ad8db8e258a6d906c6d9c29529e46ca59d4ab43"},
-    {file = "kornia_rs-0.1.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8d42a21858fbc416669bc6fd3a31ad1082733a288e03c906cad44945bccd6d60"},
-    {file = "kornia_rs-0.1.8-cp37-cp37m-win_amd64.whl", hash = "sha256:4d846492d6651c3e04205c04cbc21e3b37122c0ce5208fe40f1ed367d07257e1"},
-    {file = "kornia_rs-0.1.8-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:420f89bbe13d9a83dc82e71cb543182b7104dcf7ab40da36c5bbfca1683d7ccc"},
-    {file = "kornia_rs-0.1.8-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:66590d87b75ff38656c5976718c875536a1526549041fc29114db31202574114"},
-    {file = "kornia_rs-0.1.8-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9405009b248221c01c124c6c3d48c6a3e624fad4103a5b006a4289b0fbfad9cd"},
-    {file = "kornia_rs-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:77af8b8758db1edf59fdbf1c1e2a62bda79f76317e7f61854be4ada38d8a96cc"},
-    {file = "kornia_rs-0.1.8-cp38-cp38-win_amd64.whl", hash = "sha256:d053bfbf4ef05c5225b5bcb04aca7ef03cd3e0bfbbeae4f08f8465577f196880"},
-    {file = "kornia_rs-0.1.8-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:c7555eb7f5586a5ad4e0cf528d972b06335cc9cde429a8bb0115ef876d9e105e"},
-    {file = "kornia_rs-0.1.8-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:23b4aed00ee6d34300e6e2406ddb130a3ef07af7698a6aaf86a08b64cfe149b5"},
-    {file = "kornia_rs-0.1.8-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:283aa6203c3217734d02696877b455081d14eeb8b0cfa4740919078f90a6da74"},
-    {file = "kornia_rs-0.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:21a303660b5e66b1cb9dd30033d075790d2e8b879e65db073a3d87c7710e0bda"},
-    {file = "kornia_rs-0.1.8-cp39-cp39-win_amd64.whl", hash = "sha256:ff12844b8e92ff5805827cb04f1d5130c07798d023d9c17f33d4eab7bc72dbdf"},
-    {file = "kornia_rs-0.1.8.tar.gz", hash = "sha256:519e05f51deb4c8e849889292b9c109e0ea0943ae5024685781c35018effafd9"},
-]
-
-[[package]]
-name = "legacy-cgi"
-version = "2.6.2"
-description = "Fork of the standard library cgi and cgitb modules, being deprecated in PEP-594"
+    {file = "kornia_rs-0.1.14-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:34f56c024b9216b6c407a3352491c3fe6608ee3ff49bc811f9ac5f75b0dd0e6d"},
+    {file = "kornia_rs-0.1.14-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:5de2ce1415472e2447a1fab7012d89a03682d13b63b138628d656cfaf815ef7b"},
+    {file = "kornia_rs-0.1.14-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:bdd3a557f11fdf0fd7d7b3a6dd0871664255176bbb5ee96a19b3c34c68188c5a"},
+    {file = "kornia_rs-0.1.14-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b7faddb0f7077a208917ba7c245bf7f87e663b62bd1236bde83beba72dc99dc5"},
+    {file = "kornia_rs-0.1.14-cp310-cp310-win_amd64.whl", hash = "sha256:8a9d946555a0df9558b4c1535b19e21f2c38b37c7bd2eb1c6371b22726ca40bc"},
+    {file = "kornia_rs-0.1.14-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:378ea4dd5aa82a8d754d48713da4f6794ceacc6fe6e429aead9095a75faff01c"},
+    {file = "kornia_rs-0.1.14-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:e9d694526266252418084dca90814753eec43ff0194557b7824334c1e49bb9eb"},
+    {file = "kornia_rs-0.1.14-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7ecf81b642e6f770e2212a888935c18dfcc8cf00e65474262e77b5acf5409648"},
+    {file = "kornia_rs-0.1.14-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:371ba151de638150554af9fad53a351d5c41ed80f50a73ae376b58622e0a3430"},
+    {file = "kornia_rs-0.1.14-cp311-cp311-win_amd64.whl", hash = "sha256:9175b704be9d2de5f1aefc6516eefa46835f71bb93605db67936996d2be42684"},
+    {file = "kornia_rs-0.1.14-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:76faf5389b1ea53452fc08561622ccad8ce81c8ff1857c4742be6ae4e82bf078"},
+    {file = "kornia_rs-0.1.14-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:0025db9854f3a34c66123c2646d52e71a534678d9343f3c897192136b2c3ddaf"},
+    {file = "kornia_rs-0.1.14-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:747b26a3ce0cad76aa1047ed65f95dcd649286a2d5417d8ad93f03bb1909238d"},
+    {file = "kornia_rs-0.1.14-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:396f84661fcf260885c3f9db717caf6904eafd44857dca17be09a835bd7da8d9"},
+    {file = "kornia_rs-0.1.14-cp312-cp312-win_amd64.whl", hash = "sha256:ac4bbd0a8fd73b5058a39707c790fecec4c5204a42d1f5af17f1fa57cc83d406"},
+    {file = "kornia_rs-0.1.14-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:a703ec79a33b76115386dfef02fd36bed17715a1209fed858dd0c1adf7482421"},
+    {file = "kornia_rs-0.1.14-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:ea26534e04f937f2f4d445e12dcbf0c291c4afbb91b3d659b03c1841b0a445d7"},
+    {file = "kornia_rs-0.1.14-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9f0312afdaf27fb4579d07fdf6b457b2c75e1323a4d3b1d5812a86fef0a2316e"},
+    {file = "kornia_rs-0.1.14-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:65ba9214fc10cca816b7f6653f59a2bb74f343dce163adceba10926480d7a2b6"},
+    {file = "kornia_rs-0.1.14-cp313-cp313-win_amd64.whl", hash = "sha256:4d3312002012fd0189e762b62b24d882e97e4ea9fe3a3834f01d7e17e911201c"},
+    {file = "kornia_rs-0.1.14-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:45866a0691ecb491a6af3c779b25fd76dc65792710070d0673181a7f9dc38a08"},
+    {file = "kornia_rs-0.1.14-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:2b329376d01a03e5a76a381efaaafa6fe1e54a5932eace1de95760564643ca4d"},
+    {file = "kornia_rs-0.1.14-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:603f56ffa0ffe2de50e5c3c4c606e5a37c98c0277a2ad752feac0e25920880f4"},
+    {file = "kornia_rs-0.1.14-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:496301c800afea6867220d0f02344f44a90b50c1da22d5511c25df0c0c2b4d75"},
+    {file = "kornia_rs-0.1.14-cp313-cp313t-win_amd64.whl", hash = "sha256:29cfb7b179ba0b98772bd459f6e74da67f93b290491a5c03deb9197955dfa684"},
+    {file = "kornia_rs-0.1.14-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:816dd1d1713b13f3b39831d20097cb2aa69c2863c9a98555b1b32df0e5b9e309"},
+    {file = "kornia_rs-0.1.14-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:27b23edda1f847ee4532ae2f008b16da535b947e2cb261be1865f7faff6c9fe7"},
+    {file = "kornia_rs-0.1.14-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f1f5798b209c5e0cd6ec2629aac5b70c2b7c6c628a432a1b6a7414aca5805f9d"},
+    {file = "kornia_rs-0.1.14-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b93a70df2ce65269de1f1e9c1fbe14e1fb2cdda6c3a39a31621b68a09cdba01d"},
+    {file = "kornia_rs-0.1.14-cp314-cp314-win_amd64.whl", hash = "sha256:ff5ab2ede8eee7c05c6b55318ca96118785c40e9320e30c3fbb7f2b68b6fbe2b"},
+    {file = "kornia_rs-0.1.14-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:7726f27690cf471e8df967d71ee6c937adce764a0de0fea02aeac216b71770fc"},
+    {file = "kornia_rs-0.1.14-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7835328d5bf1565c42ce405db125811653a207c6e5dc16e937cf4527a04d8710"},
+    {file = "kornia_rs-0.1.14-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4cb1a72ea7ce13a2971af16f28409c080560aa332431b3552c633197316e0869"},
+    {file = "kornia_rs-0.1.14-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2dc1942acd2e6cbf28f1e056518db751264550f9aaa61760ce01ede266e42b61"},
+    {file = "kornia_rs-0.1.14-cp314-cp314t-win_amd64.whl", hash = "sha256:26b13fbf0a22c133a1957defca8460faceeb22c7ce1ab37a6f4a658944682c58"},
+    {file = "kornia_rs-0.1.14-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:83ab6270fdac7a2c8d6a14763cce70b0c05194d17441bbd4ff255d7ecf37482d"},
+    {file = "kornia_rs-0.1.14-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:132330ca0766a6393c7b0553074765c731774903c43d193a32d995896fe7128d"},
+    {file = "kornia_rs-0.1.14-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1d11232a54a3fab5f9a738c27a8dcc28c6a57dc4447e7c35f6f8c98c4715b365"},
+    {file = "kornia_rs-0.1.14-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4f08d54aefabb096cdd3c75659e4598c1c3c6e9633aa58089f51f2b87f138d6f"},
+    {file = "kornia_rs-0.1.14-cp38-cp38-win_amd64.whl", hash = "sha256:d40d2dfc86b1446e6d4b6b03a65eb5f2d5d40261929fdf0ac0482037935eab04"},
+    {file = "kornia_rs-0.1.14-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:e8bd44b1c7a8c3082dad46f2d0c6c5b411a5998b3c1e3e36bf6e5a4532d5b9a1"},
+    {file = "kornia_rs-0.1.14-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:de924f73f9df9bed21d527f57bb4250e3b946c873ec2869b7bf2eb9e60631dba"},
+    {file = "kornia_rs-0.1.14-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d65f4746fc63339696fb8ee0def95c9f8c13ce7687d1e8df8174d03a1bc3a364"},
+    {file = "kornia_rs-0.1.14-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f7b48a83638a37f0a01e8a27f6b326c000cdadee86e6c3b4f3c43409dc1fd202"},
+    {file = "kornia_rs-0.1.14-cp39-cp39-win_amd64.whl", hash = "sha256:46796827b4bd428956d172e0915f45a3efb71aa8dd5655e6609acee4576c562b"},
+    {file = "kornia_rs-0.1.14.tar.gz", hash = "sha256:7584f654a9db2b41bee05c9aaf865608b665e2f7195096372e001b6f220de1d2"},
+]
+
+[package.extras]
+dev = ["numpy", "pytest", "pytest-run-parallel", "torch"]
+
+[[package]]
+name = "librt"
+version = "0.11.0"
+description = "Mypyc runtime library"
 optional = false
-python-versions = ">=3.10"
-groups = ["main"]
-markers = "python_version >= \"3.13\""
-files = [
-    {file = "legacy_cgi-2.6.2-py3-none-any.whl", hash = "sha256:a7b83afb1baf6ebeb56522537c5943ef9813cf933f6715e88a803f7edbce0bff"},
-    {file = "legacy_cgi-2.6.2.tar.gz", hash = "sha256:9952471ceb304043b104c22d00b4f333cac27a6abe446d8a528fc437cf13c85f"},
+python-versions = ">=3.9"
+groups = ["dev"]
+markers = "platform_python_implementation != \"PyPy\""
+files = [
+    {file = "librt-0.11.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:6e94ebfcfa2d5e9926d6c3b9aa4617ffc42a845b4321fb84021b872358c82a0f"},
+    {file = "librt-0.11.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ae627397a2f351560440d872d6f7c8dbb4072e57868e7b2fc5b8b430fe489d45"},
+    {file = "librt-0.11.0-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:dc329359321b67d24efdf4bc69012b0597001649544db662c001db5a0184794c"},
+    {file = "librt-0.11.0-cp310-cp310-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:7e82e642ab0f7608ce2fe53d76ca2280a9ee33a1b06556142c7c6fe80a86fc33"},
+    {file = "librt-0.11.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:88145c15c67731d54283d135b03244028c750cc9edc334a96a4f5950ebdb2884"},
+    {file = "librt-0.11.0-cp310-cp310-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:9d36a51b3d93320b686588e27123f4995804dbf1bce81df78c02fc3c6eea9280"},
+    {file = "librt-0.11.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:d00f3ac06a2a8b246327f11e186a53a100a4d5c7ed52346367e5ec751d51586c"},
+    {file = "librt-0.11.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:461bbceede621f1ffb8839755f8663e886087ee7af16294cab7fb4d782c62eeb"},
+    {file = "librt-0.11.0-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:0cad8a4d6a8ff03c9b76f9414caccd78e7cfbc8a2e12fa334d8e1d9932753783"},
+    {file = "librt-0.11.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:f37aa505b3cf60701562eddb32df74b12a9e380c207fd8b06dd157a943ac7ea0"},
+    {file = "librt-0.11.0-cp310-cp310-win32.whl", hash = "sha256:94663a21534637f0e787ec2a2a756022df6e5b7b2335a5cdd7d8e33d68a2af89"},
+    {file = "librt-0.11.0-cp310-cp310-win_amd64.whl", hash = "sha256:dec7db73758c2b54953fd8b7fe348c45188fe26b39ee18446196edd08453a5d4"},
+    {file = "librt-0.11.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:93d95bd45b7d58343d8b90d904450a545144eec19a002511163426f8ab1fae29"},
+    {file = "librt-0.11.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4ee278c769a713638cdacd4c0436d72156e75df3ebc0166ab2b9dc43acc386c9"},
+    {file = "librt-0.11.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f230cb1cbc9faaa616f9a678f530ebcf186e414b6bcbd88b960e4ba1b92428d5"},
+    {file = "librt-0.11.0-cp311-cp311-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:5d63c855d86938d9de93e265c9bd8c705b51ec494de5738340ee93767a686e4b"},
+    {file = "librt-0.11.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:993f028be9e96a08d31df3479ac80d99be374d17f3b78e4796b3fd3c913d4e89"},
+    {file = "librt-0.11.0-cp311-cp311-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:258d73a0aa66a055e65b2e4d1b8cdb23b9d132c5bb915d9547d804fcaed116cc"},
+    {file = "librt-0.11.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:0827efe7854718f04aaddf6496e96960a956e676fe1d0f04eb41511fd8ad06d5"},
+    {file = "librt-0.11.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:7753e57d6e12d019c0d8786f1c09c709f4c3fcc57c3887b24e36e6c06ec938b7"},
+    {file = "librt-0.11.0-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:11bd19822431cc21af9f27374e7ae2e58103c7d98bda823536a6c47f6bb2bb3d"},
+    {file = "librt-0.11.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:22bdf239b219d3993761a148ffa134b19e52e9989c84f845d5d7b71d70a17412"},
+    {file = "librt-0.11.0-cp311-cp311-win32.whl", hash = "sha256:46c60b61e308eb535fbd6fa622b1ee1bb2815691c1ad9c98bf7b84952ec3bc8d"},
+    {file = "librt-0.11.0-cp311-cp311-win_amd64.whl", hash = "sha256:902e546ff044f579ff1c953ff5fce97b636fe9e3943996b2177710c6ef076f73"},
+    {file = "librt-0.11.0-cp311-cp311-win_arm64.whl", hash = "sha256:65ac3bc20f78aa0ee5ae84baa68917f89fef4af63e941084dd019a0d0e749f0c"},
+    {file = "librt-0.11.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b87504f1690a23b9a2cca841191a04f83895d4fc2dd04df91d82b1a04ca2ad46"},
+    {file = "librt-0.11.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40071fc5fe0ce8daa6de616702314a01e1250711682b0523d6ab8d4525910cb3"},
+    {file = "librt-0.11.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:137e79445c896a0ea7b265f52d23954e05b64222ee1af69e2cb34219067cbb67"},
+    {file = "librt-0.11.0-cp312-cp312-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:cca6644054e78746d8d4ef238681f9c34ff8b584fe6b988ecebb8db3b15e622a"},
+    {file = "librt-0.11.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d5b0eea49f5562861ee8d757a32ef7d559c1d35be2aaaa1ec28941d74c9ffc8a"},
+    {file = "librt-0.11.0-cp312-cp312-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:0d1029d7e1ae1a7e647ed6fb5df8c4ce2dffefb7a9f5fd1376a4554d96dac09f"},
+    {file = "librt-0.11.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:bc3ce6b33c5828d9e80592011a5c584cb2ce86edbc4088405f70da47dc1d1b3b"},
+    {file = "librt-0.11.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:936c5995f3514a42111f20099397d8177c79b4d7e70961e396c6f5a0a3566766"},
+    {file = "librt-0.11.0-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:9bc0ca6ad9381cbe8e4aa6e5726e4c80c78115a6e9723c599ed1d73e092bc49d"},
+    {file = "librt-0.11.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:070aa8c26c0a74774317a72df8851facc7f0f012a5b406557ac56992d92e1ec8"},
+    {file = "librt-0.11.0-cp312-cp312-win32.whl", hash = "sha256:6bf14feb84b05ae945277395451998c89c54d0def4070eb5c08de544930b245a"},
+    {file = "librt-0.11.0-cp312-cp312-win_amd64.whl", hash = "sha256:75672f0bc524ede266287d532d7923dbce94c7514ad07627bac3d0c6d92cc4d9"},
+    {file = "librt-0.11.0-cp312-cp312-win_arm64.whl", hash = "sha256:2f10cf143e4a9bb0f4f5af568a00df94a2d69ef41c2579584454bb0fe5cc642c"},
+    {file = "librt-0.11.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:78dc31f7fdfe9c9d0eb0e8f42d139db230e826415bbcabd9f0e9faaaee909894"},
+    {file = "librt-0.11.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:fa475675db22290c3158e1d42326d0f5a65f04f44a0e68c3630a25b53560fb9c"},
+    {file = "librt-0.11.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:621db29691044bdeda22e789e482e1b0f3a985d90e3426c9c6d17606416205ea"},
+    {file = "librt-0.11.0-cp313-cp313-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:a9010e2ed5b3a9e158c5fd966b3ab7e834bb3d3aacc8f66c91dd4b57a3799230"},
+    {file = "librt-0.11.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7c39513d8b7477a2e1ed8c43fc21c524e8d5a0f8d4e8b7b074dbdbe7820a08e2"},
+    {file = "librt-0.11.0-cp313-cp313-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:7aef3cf1d5af86e770ab04bfd993dfc4ae8b8c17f66fb77dd4a7d50de7bbb1a3"},
+    {file = "librt-0.11.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:557183ddc36babe46b27dd60facbd5adb4492181a5be887587d57cda6e092f21"},
+    {file = "librt-0.11.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:83d3e1f72bd42f6c5c0b7daec530c3f829bd02db42c70b8ddf0c2d90a2459930"},
+    {file = "librt-0.11.0-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:4ce1f21fbe589bc1afd7872dece84fb0e1144f794a288e58a10d2c54a55c43be"},
+    {file = "librt-0.11.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:970b09f7044ea2b64c9da42fd3d335666518cfd1c6e8a182c95da73d0214b41e"},
+    {file = "librt-0.11.0-cp313-cp313-win32.whl", hash = "sha256:78fddc31cd4d3caa897ad5d31f856b1faadc9474021ad6cb182b9018793e254e"},
+    {file = "librt-0.11.0-cp313-cp313-win_amd64.whl", hash = "sha256:8ca8aa88751a775870b764e93bad5135385f563cb8dcee399abf034ea4d3cb47"},
+    {file = "librt-0.11.0-cp313-cp313-win_arm64.whl", hash = "sha256:96f044bb325fd9cf1a723015638c219e9143f0dfbc0ca54c565df2b7fc748b44"},
+    {file = "librt-0.11.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:4a017a95e5837dc15a8c5661d60e05daa96b90908b1aa6b7acdf443cd25c8ebd"},
+    {file = "librt-0.11.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:b1ecbd9819deccc39b7542bf4d2a740d8a620694d39989e58661d3763458f8d4"},
+    {file = "librt-0.11.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7da327dacd7be8f8ec36547373550744a3cc0e536d54665cd83f8bcd961200e8"},
+    {file = "librt-0.11.0-cp314-cp314-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:0dc56b1f8d06e60db362cc3fdae206681817f86ce4725d34511473487f12a34b"},
+    {file = "librt-0.11.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:05fb8fb2ab90e21c8d12ea240d744ad514da9baf381ebfa70d91d20d21713175"},
+    {file = "librt-0.11.0-cp314-cp314-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:cae74872be221df4374d10fec61f93ed1513b9546ea84f2c0bf73ab3e9bd0b03"},
+    {file = "librt-0.11.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:32bcc918c0148eb7e3d57385125bac7e5f9e4359d05f07448b09f6f778c2f31c"},
+    {file = "librt-0.11.0-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:f9743fc99135d5f78d2454435615f6dec0473ca507c26ce9d92b10b562a280d3"},
+    {file = "librt-0.11.0-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:5ba067f4aadae8fda802d91d2124c90c42195ff32d9161d3549e6d05cfe26f96"},
+    {file = "librt-0.11.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:de3bf945454d032f9e390b85c4072e0a0570bf825421c8be0e71209fa65e1abe"},
+    {file = "librt-0.11.0-cp314-cp314-win32.whl", hash = "sha256:d2277a05f6dcb9fd13db9566aac4fabd68c3ea1ea46ee5567d4eef8efa495a2f"},
+    {file = "librt-0.11.0-cp314-cp314-win_amd64.whl", hash = "sha256:ab73e8db5e3f564d812c1f5c3a175930a5f9bc96ccb5e3b22a34d7858b401cf7"},
+    {file = "librt-0.11.0-cp314-cp314-win_arm64.whl", hash = "sha256:aea3caa317752e3a466fa8af45d91ee0ea8c7fdd96e42b0a8dd9b76a7931eba1"},
+    {file = "librt-0.11.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:d1b36540d7aaf9b9101b3a6f376c8d8e9f7a9aec93ed05918f2c69d493ffef72"},
+    {file = "librt-0.11.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:efbb343ab2ce3540f4ecbe6315d677ed70f37cd9a72b1e58066c918ca83acbaa"},
+    {file = "librt-0.11.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:aa0dd688aab3f7914d3e6e5e3554978e0383312fb8e771d84be008a35b9ee548"},
+    {file = "librt-0.11.0-cp314-cp314t-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:f5fb36b8c6c63fdcbb1d526d94c0d1331610d43f4118cc1beb4efef4f3faacb2"},
+    {file = "librt-0.11.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4a9a237d13addb93715b6fee74023d5ee3469b53fce527626c0e088aa585805f"},
+    {file = "librt-0.11.0-cp314-cp314t-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:5ddd17bd87b2c56ddd60e546a7984a2e64c4e8eab92fb4cf3830a48ad5469d51"},
+    {file = "librt-0.11.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:bd43992b4473d42f12ff9e68326079f0696d9d4e6000e8f39a0238d482ba6ee2"},
+    {file = "librt-0.11.0-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:f8e3e8056dd674e279741485e2e512d6e9a751c7455809d0114e6ebf8d781085"},
+    {file = "librt-0.11.0-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:c1f708d8ae9c56cf38a903c44297243d2ec83fd82b396b977e0144a3e76217e3"},
+    {file = "librt-0.11.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:0add982e0e7b9fc14cf4b33789d5f13f66581889b88c2f58099f6ce8f92617bd"},
+    {file = "librt-0.11.0-cp314-cp314t-win32.whl", hash = "sha256:2b481d846ac894c4e8403c5fd0e87c5d11d6499e404b474602508a224ff531c8"},
+    {file = "librt-0.11.0-cp314-cp314t-win_amd64.whl", hash = "sha256:28edb433edde181112a908c78907af28f964eabc15f4dd16c9d66c834302677c"},
+    {file = "librt-0.11.0-cp314-cp314t-win_arm64.whl", hash = "sha256:dee008f20b542e3cd162ba338a7f9ec0f6d23d395f66fe8aeeec3c9d067ea253"},
+    {file = "librt-0.11.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:6bd72d903911d995ab666dbd1871f8b1e80925a699af8063fbf50053329fb05f"},
+    {file = "librt-0.11.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:0ef69ac715f3cd8e5cd252cb2aebfa72c015492aacc339d5d7bf8fef3c62c677"},
+    {file = "librt-0.11.0-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:624a40c4a4ad7773315c287276cd024509b2c66ff5904f504bfc08d2c70293ab"},
+    {file = "librt-0.11.0-cp39-cp39-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:41dc19fe150b69716c8ece4f76773a9e8813fe3e35e032a58b4d46423fb8d7c0"},
+    {file = "librt-0.11.0-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4e8bd98ea9c47ae90b319a087ab28dac493f1ffbc1ecd1f28fcdbf3b7e1108d1"},
+    {file = "librt-0.11.0-cp39-cp39-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:84308fc49423ce6475d1c5d1985cd69a8ca9f0325fc7d5f81bb690a3f3625d4e"},
+    {file = "librt-0.11.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:ff0fbaf5f44a21beeb0110f2ab64f45135a9536a834b79c0d1ef018f2786bbfa"},
+    {file = "librt-0.11.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:9c028a9442a18e266955d364ce42259136e79a7ba14d773e0d778d5f70cd56f1"},
+    {file = "librt-0.11.0-cp39-cp39-musllinux_1_2_riscv64.whl", hash = "sha256:9f1692105a02bcf853f355032a5fdc5494358ef83d8fd22d16de375c85cec3f5"},
+    {file = "librt-0.11.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:7a80a71e1fda83cc752a9141e87aae7fef279538597564d670e9ce513f286192"},
+    {file = "librt-0.11.0-cp39-cp39-win32.whl", hash = "sha256:140695816ddf3c86eb972981a26f35efd871c44b0c3aed44c8cd01749386617f"},
+    {file = "librt-0.11.0-cp39-cp39-win_amd64.whl", hash = "sha256:92f7ff819c197fc30473190a12c2856f325ac90aabfccbeb2072d28cc2e234e3"},
+    {file = "librt-0.11.0.tar.gz", hash = "sha256:075dc3ef4458a278e0195cbf6ac9d38808d9b906c5a6c7f7f79c3888276a3fb1"},
 ]
 
 [[package]]
 name = "lightning-utilities"
-version = "0.11.9"
+version = "0.15.3"
 description = "Lightning toolbox for across the our ecosystem."
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["training"]
 files = [
-    {file = "lightning_utilities-0.11.9-py3-none-any.whl", hash = "sha256:ac6d4e9e28faf3ff4be997876750fee10dc604753dbc429bf3848a95c5d7e0d2"},
-    {file = "lightning_utilities-0.11.9.tar.gz", hash = "sha256:f5052b81344cc2684aa9afd74b7ce8819a8f49a858184ec04548a5a109dfd053"},
+    {file = "lightning_utilities-0.15.3-py3-none-any.whl", hash = "sha256:6c55f1bee70084a1cbeaa41ada96e4b3a0fea5909e844dd335bd80f5a73c5f91"},
+    {file = "lightning_utilities-0.15.3.tar.gz", hash = "sha256:792ae0204c79f6859721ac7f386c237a33b0ed06ba775009cb894e010a842033"},
 ]
 
 [package.dependencies]
-packaging = ">=17.1"
-setuptools = "*"
-typing-extensions = "*"
+packaging = ">=22"
+typing_extensions = "*"
 
 [package.extras]
-cli = ["fire"]
+cli = ["jsonargparse[signatures] (>=4.38.0)", "tomlkit"]
 docs = ["requests (>=2.0.0)"]
 typing = ["mypy (>=1.0.0)", "types-setuptools"]
 
@@ -2271,34 +2269,34 @@ colorama = {version = ">=0.3.4", markers = "sys_platform == \"win32\""}
 win32-setctime = {version = ">=1.0.0", markers = "sys_platform == \"win32\""}
 
 [package.extras]
-dev = ["Sphinx (==7.2.5) ; python_version >= \"3.9\"", "colorama (==0.4.5) ; python_version < \"3.8\"", "colorama (==0.4.6) ; python_version >= \"3.8\"", "exceptiongroup (==1.1.3) ; python_version >= \"3.7\" and python_version < \"3.11\"", "freezegun (==1.1.0) ; python_version < \"3.8\"", "freezegun (==1.2.2) ; python_version >= \"3.8\"", "mypy (==v0.910) ; python_version < \"3.6\"", "mypy (==v0.971) ; python_version == \"3.6\"", "mypy (==v1.4.1) ; python_version == \"3.7\"", "mypy (==v1.5.1) ; python_version >= \"3.8\"", "pre-commit (==3.4.0) ; python_version >= \"3.8\"", "pytest (==6.1.2) ; python_version < \"3.8\"", "pytest (==7.4.0) ; python_version >= \"3.8\"", "pytest-cov (==2.12.1) ; python_version < \"3.8\"", "pytest-cov (==4.1.0) ; python_version >= \"3.8\"", "pytest-mypy-plugins (==1.9.3) ; python_version >= \"3.6\" and python_version < \"3.8\"", "pytest-mypy-plugins (==3.0.0) ; python_version >= \"3.8\"", "sphinx-autobuild (==2021.3.14) ; python_version >= \"3.9\"", "sphinx-rtd-theme (==1.3.0) ; python_version >= \"3.9\"", "tox (==3.27.1) ; python_version < \"3.8\"", "tox (==4.11.0) ; python_version >= \"3.8\""]
+dev = ["Sphinx (==7.2.5) ; python_version >= \"3.9\"", "colorama (==0.4.5) ; python_version < \"3.8\"", "colorama (==0.4.6) ; python_version >= \"3.8\"", "exceptiongroup (==1.1.3) ; python_version >= \"3.7\" and python_version < \"3.11\"", "freezegun (==1.1.0) ; python_version < \"3.8\"", "freezegun (==1.2.2) ; python_version >= \"3.8\"", "mypy (==0.910) ; python_version < \"3.6\"", "mypy (==0.971) ; python_version == \"3.6\"", "mypy (==1.4.1) ; python_version == \"3.7\"", "mypy (==1.5.1) ; python_version >= \"3.8\"", "pre-commit (==3.4.0) ; python_version >= \"3.8\"", "pytest (==6.1.2) ; python_version < \"3.8\"", "pytest (==7.4.0) ; python_version >= \"3.8\"", "pytest-cov (==2.12.1) ; python_version < \"3.8\"", "pytest-cov (==4.1.0) ; python_version >= \"3.8\"", "pytest-mypy-plugins (==1.9.3) ; python_version >= \"3.6\" and python_version < \"3.8\"", "pytest-mypy-plugins (==3.0.0) ; python_version >= \"3.8\"", "sphinx-autobuild (==2021.3.14) ; python_version >= \"3.9\"", "sphinx-rtd-theme (==1.3.0) ; python_version >= \"3.9\"", "tox (==3.27.1) ; python_version < \"3.8\"", "tox (==4.11.0) ; python_version >= \"3.8\""]
 
 [[package]]
 name = "markdown"
-version = "3.8"
+version = "3.10.2"
 description = "Python implementation of John Gruber's Markdown."
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["training"]
 files = [
-    {file = "markdown-3.8-py3-none-any.whl", hash = "sha256:794a929b79c5af141ef5ab0f2f642d0f7b1872981250230e72682346f7cc90dc"},
-    {file = "markdown-3.8.tar.gz", hash = "sha256:7df81e63f0df5c4b24b7d156eb81e4690595239b7d70937d0409f1b0de319c6f"},
+    {file = "markdown-3.10.2-py3-none-any.whl", hash = "sha256:e91464b71ae3ee7afd3017d9f358ef0baf158fd9a298db92f1d4761133824c36"},
+    {file = "markdown-3.10.2.tar.gz", hash = "sha256:994d51325d25ad8aa7ce4ebaec003febcce822c3f8c911e3b17c52f7f589f950"},
 ]
 
 [package.extras]
-docs = ["mdx_gh_links (>=0.2)", "mkdocs (>=1.6)", "mkdocs-gen-files", "mkdocs-literate-nav", "mkdocs-nature (>=0.6)", "mkdocs-section-index", "mkdocstrings[python]"]
+docs = ["mdx_gh_links (>=0.2)", "mkdocs (>=1.6)", "mkdocs-gen-files", "mkdocs-literate-nav", "mkdocs-nature (>=0.6)", "mkdocs-section-index", "mkdocstrings[python] (>=0.28.3)"]
 testing = ["coverage", "pyyaml"]
 
 [[package]]
 name = "markdown-it-py"
-version = "3.0.0"
+version = "4.2.0"
 description = "Python port of markdown-it. Markdown parsing, done right!"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "markdown-it-py-3.0.0.tar.gz", hash = "sha256:e3f60a94fa066dc52ec76661e37c851cb232d92f9886b15cb560aaada2df8feb"},
-    {file = "markdown_it_py-3.0.0-py3-none-any.whl", hash = "sha256:355216845c60bd96232cd8d8c40e8f9765cc86f46880e43a8fd22dc1a1a8cab1"},
+    {file = "markdown_it_py-4.2.0-py3-none-any.whl", hash = "sha256:9f7ebbcd14fe59494226453aed97c1070d83f8d24b6fc3a3bcf9a38092641c4a"},
+    {file = "markdown_it_py-4.2.0.tar.gz", hash = "sha256:04a21681d6fbb623de53f6f364d352309d4094dd4194040a10fd51833e418d49"},
 ]
 
 [package.dependencies]
@@ -2306,143 +2304,112 @@ mdurl = ">=0.1,<1.0"
 
 [package.extras]
 benchmarking = ["psutil", "pytest", "pytest-benchmark"]
-code-style = ["pre-commit (>=3.0,<4.0)"]
-compare = ["commonmark (>=0.9,<1.0)", "markdown (>=3.4,<4.0)", "mistletoe (>=1.0,<2.0)", "mistune (>=2.0,<3.0)", "panflute (>=2.3,<3.0)"]
+compare = ["commonmark (>=0.9,<1.0)", "markdown (>=3.4,<4.0)", "markdown-it-pyrs", "mistletoe (>=1.0,<2.0)", "mistune (>=3.0,<4.0)", "panflute (>=2.3,<3.0)"]
 linkify = ["linkify-it-py (>=1,<3)"]
-plugins = ["mdit-py-plugins"]
+plugins = ["mdit-py-plugins (>=0.5.0)"]
 profiling = ["gprof2dot"]
-rtd = ["jupyter_sphinx", "mdit-py-plugins", "myst-parser", "pyyaml", "sphinx", "sphinx-copybutton", "sphinx-design", "sphinx_book_theme"]
-testing = ["coverage", "pytest", "pytest-cov", "pytest-regressions"]
+rtd = ["ipykernel", "jupyter_sphinx", "mdit-py-plugins (>=0.5.0)", "myst-parser", "pyyaml", "sphinx", "sphinx-book-theme (>=1.0,<2.0)", "sphinx-copybutton", "sphinx-design"]
+testing = ["coverage", "pytest", "pytest-cov", "pytest-regressions", "pytest-timeout", "requests"]
 
 [[package]]
 name = "markupsafe"
-version = "3.0.2"
+version = "3.0.3"
 description = "Safely add untrusted strings to HTML/XML markup."
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "MarkupSafe-3.0.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:7e94c425039cde14257288fd61dcfb01963e658efbc0ff54f5306b06054700f8"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:9e2d922824181480953426608b81967de705c3cef4d1af983af849d7bd619158"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:38a9ef736c01fccdd6600705b09dc574584b89bea478200c5fbf112a6b0d5579"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bbcb445fa71794da8f178f0f6d66789a28d7319071af7a496d4d507ed566270d"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:57cb5a3cf367aeb1d316576250f65edec5bb3be939e9247ae594b4bcbc317dfb"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:3809ede931876f5b2ec92eef964286840ed3540dadf803dd570c3b7e13141a3b"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:e07c3764494e3776c602c1e78e298937c3315ccc9043ead7e685b7f2b8d47b3c"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:b424c77b206d63d500bcb69fa55ed8d0e6a3774056bdc4839fc9298a7edca171"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-win32.whl", hash = "sha256:fcabf5ff6eea076f859677f5f0b6b5c1a51e70a376b0579e0eadef8db48c6b50"},
-    {file = "MarkupSafe-3.0.2-cp310-cp310-win_amd64.whl", hash = "sha256:6af100e168aa82a50e186c82875a5893c5597a0c1ccdb0d8b40240b1f28b969a"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:9025b4018f3a1314059769c7bf15441064b2207cb3f065e6ea1e7359cb46db9d"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:93335ca3812df2f366e80509ae119189886b0f3c2b81325d39efdb84a1e2ae93"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2cb8438c3cbb25e220c2ab33bb226559e7afb3baec11c4f218ffa7308603c832"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a123e330ef0853c6e822384873bef7507557d8e4a082961e1defa947aa59ba84"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1e084f686b92e5b83186b07e8a17fc09e38fff551f3602b249881fec658d3eca"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:d8213e09c917a951de9d09ecee036d5c7d36cb6cb7dbaece4c71a60d79fb9798"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:5b02fb34468b6aaa40dfc198d813a641e3a63b98c2b05a16b9f80b7ec314185e"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:0bff5e0ae4ef2e1ae4fdf2dfd5b76c75e5c2fa4132d05fc1b0dabcd20c7e28c4"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-win32.whl", hash = "sha256:6c89876f41da747c8d3677a2b540fb32ef5715f97b66eeb0c6b66f5e3ef6f59d"},
-    {file = "MarkupSafe-3.0.2-cp311-cp311-win_amd64.whl", hash = "sha256:70a87b411535ccad5ef2f1df5136506a10775d267e197e4cf531ced10537bd6b"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:9778bd8ab0a994ebf6f84c2b949e65736d5575320a17ae8984a77fab08db94cf"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:846ade7b71e3536c4e56b386c2a47adf5741d2d8b94ec9dc3e92e5e1ee1e2225"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1c99d261bd2d5f6b59325c92c73df481e05e57f19837bdca8413b9eac4bd8028"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e17c96c14e19278594aa4841ec148115f9c7615a47382ecb6b82bd8fea3ab0c8"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:88416bd1e65dcea10bc7569faacb2c20ce071dd1f87539ca2ab364bf6231393c"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:2181e67807fc2fa785d0592dc2d6206c019b9502410671cc905d132a92866557"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:52305740fe773d09cffb16f8ed0427942901f00adedac82ec8b67752f58a1b22"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:ad10d3ded218f1039f11a75f8091880239651b52e9bb592ca27de44eed242a48"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-win32.whl", hash = "sha256:0f4ca02bea9a23221c0182836703cbf8930c5e9454bacce27e767509fa286a30"},
-    {file = "MarkupSafe-3.0.2-cp312-cp312-win_amd64.whl", hash = "sha256:8e06879fc22a25ca47312fbe7c8264eb0b662f6db27cb2d3bbbc74b1df4b9b87"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:ba9527cdd4c926ed0760bc301f6728ef34d841f405abf9d4f959c478421e4efd"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:f8b3d067f2e40fe93e1ccdd6b2e1d16c43140e76f02fb1319a05cf2b79d99430"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:569511d3b58c8791ab4c2e1285575265991e6d8f8700c7be0e88f86cb0672094"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:15ab75ef81add55874e7ab7055e9c397312385bd9ced94920f2802310c930396"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:f3818cb119498c0678015754eba762e0d61e5b52d34c8b13d770f0719f7b1d79"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:cdb82a876c47801bb54a690c5ae105a46b392ac6099881cdfb9f6e95e4014c6a"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:cabc348d87e913db6ab4aa100f01b08f481097838bdddf7c7a84b7575b7309ca"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:444dcda765c8a838eaae23112db52f1efaf750daddb2d9ca300bcae1039adc5c"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-win32.whl", hash = "sha256:bcf3e58998965654fdaff38e58584d8937aa3096ab5354d493c77d1fdd66d7a1"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313-win_amd64.whl", hash = "sha256:e6a2a455bd412959b57a172ce6328d2dd1f01cb2135efda2e4576e8a23fa3b0f"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:b5a6b3ada725cea8a5e634536b1b01c30bcdcd7f9c6fff4151548d5bf6b3a36c"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:a904af0a6162c73e3edcb969eeeb53a63ceeb5d8cf642fade7d39e7963a22ddb"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4aa4e5faecf353ed117801a068ebab7b7e09ffb6e1d5e412dc852e0da018126c"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c0ef13eaeee5b615fb07c9a7dadb38eac06a0608b41570d8ade51c56539e509d"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d16a81a06776313e817c951135cf7340a3e91e8c1ff2fac444cfd75fffa04afe"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:6381026f158fdb7c72a168278597a5e3a5222e83ea18f543112b2662a9b699c5"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:3d79d162e7be8f996986c064d1c7c817f6df3a77fe3d6859f6f9e7be4b8c213a"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:131a3c7689c85f5ad20f9f6fb1b866f402c445b220c19fe4308c0b147ccd2ad9"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-win32.whl", hash = "sha256:ba8062ed2cf21c07a9e295d5b8a2a5ce678b913b45fdf68c32d95d6c1291e0b6"},
-    {file = "MarkupSafe-3.0.2-cp313-cp313t-win_amd64.whl", hash = "sha256:e444a31f8db13eb18ada366ab3cf45fd4b31e4db1236a4448f68778c1d1a5a2f"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:eaa0a10b7f72326f1372a713e73c3f739b524b3af41feb43e4921cb529f5929a"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:48032821bbdf20f5799ff537c7ac3d1fba0ba032cfc06194faffa8cda8b560ff"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1a9d3f5f0901fdec14d8d2f66ef7d035f2157240a433441719ac9a3fba440b13"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:88b49a3b9ff31e19998750c38e030fc7bb937398b1f78cfa599aaef92d693144"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:cfad01eed2c2e0c01fd0ecd2ef42c492f7f93902e39a42fc9ee1692961443a29"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:1225beacc926f536dc82e45f8a4d68502949dc67eea90eab715dea3a21c1b5f0"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:3169b1eefae027567d1ce6ee7cae382c57fe26e82775f460f0b2778beaad66c0"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:eb7972a85c54febfb25b5c4b4f3af4dcc731994c7da0d8a0b4a6eb0640e1d178"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-win32.whl", hash = "sha256:8c4e8c3ce11e1f92f6536ff07154f9d49677ebaaafc32db9db4620bc11ed480f"},
-    {file = "MarkupSafe-3.0.2-cp39-cp39-win_amd64.whl", hash = "sha256:6e296a513ca3d94054c2c881cc913116e90fd030ad1c656b3869762b754f5f8a"},
-    {file = "markupsafe-3.0.2.tar.gz", hash = "sha256:ee55d3edf80167e48ea11a923c7386f4669df67d7994554387f84e7d8b0a2bf0"},
-]
-
-[[package]]
-name = "matplotlib"
-version = "3.10.0"
-description = "Python plotting package"
-optional = false
-python-versions = ">=3.10"
-groups = ["main"]
-files = [
-    {file = "matplotlib-3.10.0-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:2c5829a5a1dd5a71f0e31e6e8bb449bc0ee9dbfb05ad28fc0c6b55101b3a4be6"},
-    {file = "matplotlib-3.10.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:a2a43cbefe22d653ab34bb55d42384ed30f611bcbdea1f8d7f431011a2e1c62e"},
-    {file = "matplotlib-3.10.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:607b16c8a73943df110f99ee2e940b8a1cbf9714b65307c040d422558397dac5"},
-    {file = "matplotlib-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:01d2b19f13aeec2e759414d3bfe19ddfb16b13a1250add08d46d5ff6f9be83c6"},
-    {file = "matplotlib-3.10.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:5e6c6461e1fc63df30bf6f80f0b93f5b6784299f721bc28530477acd51bfc3d1"},
-    {file = "matplotlib-3.10.0-cp310-cp310-win_amd64.whl", hash = "sha256:994c07b9d9fe8d25951e3202a68c17900679274dadfc1248738dcfa1bd40d7f3"},
-    {file = "matplotlib-3.10.0-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:fd44fc75522f58612ec4a33958a7e5552562b7705b42ef1b4f8c0818e304a363"},
-    {file = "matplotlib-3.10.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c58a9622d5dbeb668f407f35f4e6bfac34bb9ecdcc81680c04d0258169747997"},
-    {file = "matplotlib-3.10.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:845d96568ec873be63f25fa80e9e7fae4be854a66a7e2f0c8ccc99e94a8bd4ef"},
-    {file = "matplotlib-3.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5439f4c5a3e2e8eab18e2f8c3ef929772fd5641876db71f08127eed95ab64683"},
-    {file = "matplotlib-3.10.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:4673ff67a36152c48ddeaf1135e74ce0d4bce1bbf836ae40ed39c29edf7e2765"},
-    {file = "matplotlib-3.10.0-cp311-cp311-win_amd64.whl", hash = "sha256:7e8632baebb058555ac0cde75db885c61f1212e47723d63921879806b40bec6a"},
-    {file = "matplotlib-3.10.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:4659665bc7c9b58f8c00317c3c2a299f7f258eeae5a5d56b4c64226fca2f7c59"},
-    {file = "matplotlib-3.10.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d44cb942af1693cced2604c33a9abcef6205601c445f6d0dc531d813af8a2f5a"},
-    {file = "matplotlib-3.10.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a994f29e968ca002b50982b27168addfd65f0105610b6be7fa515ca4b5307c95"},
-    {file = "matplotlib-3.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9b0558bae37f154fffda54d779a592bc97ca8b4701f1c710055b609a3bac44c8"},
-    {file = "matplotlib-3.10.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:503feb23bd8c8acc75541548a1d709c059b7184cde26314896e10a9f14df5f12"},
-    {file = "matplotlib-3.10.0-cp312-cp312-win_amd64.whl", hash = "sha256:c40ba2eb08b3f5de88152c2333c58cee7edcead0a2a0d60fcafa116b17117adc"},
-    {file = "matplotlib-3.10.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:96f2886f5c1e466f21cc41b70c5a0cd47bfa0015eb2d5793c88ebce658600e25"},
-    {file = "matplotlib-3.10.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:12eaf48463b472c3c0f8dbacdbf906e573013df81a0ab82f0616ea4b11281908"},
-    {file = "matplotlib-3.10.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2fbbabc82fde51391c4da5006f965e36d86d95f6ee83fb594b279564a4c5d0d2"},
-    {file = "matplotlib-3.10.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ad2e15300530c1a94c63cfa546e3b7864bd18ea2901317bae8bbf06a5ade6dcf"},
-    {file = "matplotlib-3.10.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:3547d153d70233a8496859097ef0312212e2689cdf8d7ed764441c77604095ae"},
-    {file = "matplotlib-3.10.0-cp313-cp313-win_amd64.whl", hash = "sha256:c55b20591ced744aa04e8c3e4b7543ea4d650b6c3c4b208c08a05b4010e8b442"},
-    {file = "matplotlib-3.10.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:9ade1003376731a971e398cc4ef38bb83ee8caf0aee46ac6daa4b0506db1fd06"},
-    {file = "matplotlib-3.10.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:95b710fea129c76d30be72c3b38f330269363fbc6e570a5dd43580487380b5ff"},
-    {file = "matplotlib-3.10.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5cdbaf909887373c3e094b0318d7ff230b2ad9dcb64da7ade654182872ab2593"},
-    {file = "matplotlib-3.10.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d907fddb39f923d011875452ff1eca29a9e7f21722b873e90db32e5d8ddff12e"},
-    {file = "matplotlib-3.10.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:3b427392354d10975c1d0f4ee18aa5844640b512d5311ef32efd4dd7db106ede"},
-    {file = "matplotlib-3.10.0-cp313-cp313t-win_amd64.whl", hash = "sha256:5fd41b0ec7ee45cd960a8e71aea7c946a28a0b8a4dcee47d2856b2af051f334c"},
-    {file = "matplotlib-3.10.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:81713dd0d103b379de4516b861d964b1d789a144103277769238c732229d7f03"},
-    {file = "matplotlib-3.10.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:359f87baedb1f836ce307f0e850d12bb5f1936f70d035561f90d41d305fdacea"},
-    {file = "matplotlib-3.10.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ae80dc3a4add4665cf2faa90138384a7ffe2a4e37c58d83e115b54287c4f06ef"},
-    {file = "matplotlib-3.10.0.tar.gz", hash = "sha256:b886d02a581b96704c9d1ffe55709e49b4d2d52709ccebc4be42db856e511278"},
+groups = ["main", "training"]
+files = [
+    {file = "markupsafe-3.0.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2f981d352f04553a7171b8e44369f2af4055f888dfb147d55e42d29e29e74559"},
+    {file = "markupsafe-3.0.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e1c1493fb6e50ab01d20a22826e57520f1284df32f2d8601fdd90b6304601419"},
+    {file = "markupsafe-3.0.3-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1ba88449deb3de88bd40044603fafffb7bc2b055d626a330323a9ed736661695"},
+    {file = "markupsafe-3.0.3-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f42d0984e947b8adf7dd6dde396e720934d12c506ce84eea8476409563607591"},
+    {file = "markupsafe-3.0.3-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c0c0b3ade1c0b13b936d7970b1d37a57acde9199dc2aecc4c336773e1d86049c"},
+    {file = "markupsafe-3.0.3-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:0303439a41979d9e74d18ff5e2dd8c43ed6c6001fd40e5bf2e43f7bd9bbc523f"},
+    {file = "markupsafe-3.0.3-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:d2ee202e79d8ed691ceebae8e0486bd9a2cd4794cec4824e1c99b6f5009502f6"},
+    {file = "markupsafe-3.0.3-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:177b5253b2834fe3678cb4a5f0059808258584c559193998be2601324fdeafb1"},
+    {file = "markupsafe-3.0.3-cp310-cp310-win32.whl", hash = "sha256:2a15a08b17dd94c53a1da0438822d70ebcd13f8c3a95abe3a9ef9f11a94830aa"},
+    {file = "markupsafe-3.0.3-cp310-cp310-win_amd64.whl", hash = "sha256:c4ffb7ebf07cfe8931028e3e4c85f0357459a3f9f9490886198848f4fa002ec8"},
+    {file = "markupsafe-3.0.3-cp310-cp310-win_arm64.whl", hash = "sha256:e2103a929dfa2fcaf9bb4e7c091983a49c9ac3b19c9061b6d5427dd7d14d81a1"},
+    {file = "markupsafe-3.0.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1cc7ea17a6824959616c525620e387f6dd30fec8cb44f649e31712db02123dad"},
+    {file = "markupsafe-3.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4bd4cd07944443f5a265608cc6aab442e4f74dff8088b0dfc8238647b8f6ae9a"},
+    {file = "markupsafe-3.0.3-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6b5420a1d9450023228968e7e6a9ce57f65d148ab56d2313fcd589eee96a7a50"},
+    {file = "markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0bf2a864d67e76e5c9a34dc26ec616a66b9888e25e7b9460e1c76d3293bd9dbf"},
+    {file = "markupsafe-3.0.3-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:bc51efed119bc9cfdf792cdeaa4d67e8f6fcccab66ed4bfdd6bde3e59bfcbb2f"},
+    {file = "markupsafe-3.0.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:068f375c472b3e7acbe2d5318dea141359e6900156b5b2ba06a30b169086b91a"},
+    {file = "markupsafe-3.0.3-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:7be7b61bb172e1ed687f1754f8e7484f1c8019780f6f6b0786e76bb01c2ae115"},
+    {file = "markupsafe-3.0.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f9e130248f4462aaa8e2552d547f36ddadbeaa573879158d721bbd33dfe4743a"},
+    {file = "markupsafe-3.0.3-cp311-cp311-win32.whl", hash = "sha256:0db14f5dafddbb6d9208827849fad01f1a2609380add406671a26386cdf15a19"},
+    {file = "markupsafe-3.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:de8a88e63464af587c950061a5e6a67d3632e36df62b986892331d4620a35c01"},
+    {file = "markupsafe-3.0.3-cp311-cp311-win_arm64.whl", hash = "sha256:3b562dd9e9ea93f13d53989d23a7e775fdfd1066c33494ff43f5418bc8c58a5c"},
+    {file = "markupsafe-3.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d53197da72cc091b024dd97249dfc7794d6a56530370992a5e1a08983ad9230e"},
+    {file = "markupsafe-3.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:1872df69a4de6aead3491198eaf13810b565bdbeec3ae2dc8780f14458ec73ce"},
+    {file = "markupsafe-3.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:3a7e8ae81ae39e62a41ec302f972ba6ae23a5c5396c8e60113e9066ef893da0d"},
+    {file = "markupsafe-3.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d6dd0be5b5b189d31db7cda48b91d7e0a9795f31430b7f271219ab30f1d3ac9d"},
+    {file = "markupsafe-3.0.3-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:94c6f0bb423f739146aec64595853541634bde58b2135f27f61c1ffd1cd4d16a"},
+    {file = "markupsafe-3.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:be8813b57049a7dc738189df53d69395eba14fb99345e0a5994914a3864c8a4b"},
+    {file = "markupsafe-3.0.3-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:83891d0e9fb81a825d9a6d61e3f07550ca70a076484292a70fde82c4b807286f"},
+    {file = "markupsafe-3.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:77f0643abe7495da77fb436f50f8dab76dbc6e5fd25d39589a0f1fe6548bfa2b"},
+    {file = "markupsafe-3.0.3-cp312-cp312-win32.whl", hash = "sha256:d88b440e37a16e651bda4c7c2b930eb586fd15ca7406cb39e211fcff3bf3017d"},
+    {file = "markupsafe-3.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:26a5784ded40c9e318cfc2bdb30fe164bdb8665ded9cd64d500a34fb42067b1c"},
+    {file = "markupsafe-3.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:35add3b638a5d900e807944a078b51922212fb3dedb01633a8defc4b01a3c85f"},
+    {file = "markupsafe-3.0.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e1cf1972137e83c5d4c136c43ced9ac51d0e124706ee1c8aa8532c1287fa8795"},
+    {file = "markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:116bb52f642a37c115f517494ea5feb03889e04df47eeff5b130b1808ce7c219"},
+    {file = "markupsafe-3.0.3-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:133a43e73a802c5562be9bbcd03d090aa5a1fe899db609c29e8c8d815c5f6de6"},
+    {file = "markupsafe-3.0.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ccfcd093f13f0f0b7fdd0f198b90053bf7b2f02a3927a30e63f3ccc9df56b676"},
+    {file = "markupsafe-3.0.3-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:509fa21c6deb7a7a273d629cf5ec029bc209d1a51178615ddf718f5918992ab9"},
+    {file = "markupsafe-3.0.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:a4afe79fb3de0b7097d81da19090f4df4f8d3a2b3adaa8764138aac2e44f3af1"},
+    {file = "markupsafe-3.0.3-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:795e7751525cae078558e679d646ae45574b47ed6e7771863fcc079a6171a0fc"},
+    {file = "markupsafe-3.0.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:8485f406a96febb5140bfeca44a73e3ce5116b2501ac54fe953e488fb1d03b12"},
+    {file = "markupsafe-3.0.3-cp313-cp313-win32.whl", hash = "sha256:bdd37121970bfd8be76c5fb069c7751683bdf373db1ed6c010162b2a130248ed"},
+    {file = "markupsafe-3.0.3-cp313-cp313-win_amd64.whl", hash = "sha256:9a1abfdc021a164803f4d485104931fb8f8c1efd55bc6b748d2f5774e78b62c5"},
+    {file = "markupsafe-3.0.3-cp313-cp313-win_arm64.whl", hash = "sha256:7e68f88e5b8799aa49c85cd116c932a1ac15caaa3f5db09087854d218359e485"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:218551f6df4868a8d527e3062d0fb968682fe92054e89978594c28e642c43a73"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:3524b778fe5cfb3452a09d31e7b5adefeea8c5be1d43c4f810ba09f2ceb29d37"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4e885a3d1efa2eadc93c894a21770e4bc67899e3543680313b09f139e149ab19"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8709b08f4a89aa7586de0aadc8da56180242ee0ada3999749b183aa23df95025"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:b8512a91625c9b3da6f127803b166b629725e68af71f8184ae7e7d54686a56d6"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:9b79b7a16f7fedff2495d684f2b59b0457c3b493778c9eed31111be64d58279f"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:12c63dfb4a98206f045aa9563db46507995f7ef6d83b2f68eda65c307c6829eb"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:8f71bc33915be5186016f675cd83a1e08523649b0e33efdb898db577ef5bb009"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-win32.whl", hash = "sha256:69c0b73548bc525c8cb9a251cddf1931d1db4d2258e9599c28c07ef3580ef354"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-win_amd64.whl", hash = "sha256:1b4b79e8ebf6b55351f0d91fe80f893b4743f104bff22e90697db1590e47a218"},
+    {file = "markupsafe-3.0.3-cp313-cp313t-win_arm64.whl", hash = "sha256:ad2cf8aa28b8c020ab2fc8287b0f823d0a7d8630784c31e9ee5edea20f406287"},
+    {file = "markupsafe-3.0.3-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:eaa9599de571d72e2daf60164784109f19978b327a3910d3e9de8c97b5b70cfe"},
+    {file = "markupsafe-3.0.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c47a551199eb8eb2121d4f0f15ae0f923d31350ab9280078d1e5f12b249e0026"},
+    {file = "markupsafe-3.0.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f34c41761022dd093b4b6896d4810782ffbabe30f2d443ff5f083e0cbbb8c737"},
+    {file = "markupsafe-3.0.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:457a69a9577064c05a97c41f4e65148652db078a3a509039e64d3467b9e7ef97"},
+    {file = "markupsafe-3.0.3-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:e8afc3f2ccfa24215f8cb28dcf43f0113ac3c37c2f0f0806d8c70e4228c5cf4d"},
+    {file = "markupsafe-3.0.3-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ec15a59cf5af7be74194f7ab02d0f59a62bdcf1a537677ce67a2537c9b87fcda"},
+    {file = "markupsafe-3.0.3-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:0eb9ff8191e8498cca014656ae6b8d61f39da5f95b488805da4bb029cccbfbaf"},
+    {file = "markupsafe-3.0.3-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:2713baf880df847f2bece4230d4d094280f4e67b1e813eec43b4c0e144a34ffe"},
+    {file = "markupsafe-3.0.3-cp314-cp314-win32.whl", hash = "sha256:729586769a26dbceff69f7a7dbbf59ab6572b99d94576a5592625d5b411576b9"},
+    {file = "markupsafe-3.0.3-cp314-cp314-win_amd64.whl", hash = "sha256:bdc919ead48f234740ad807933cdf545180bfbe9342c2bb451556db2ed958581"},
+    {file = "markupsafe-3.0.3-cp314-cp314-win_arm64.whl", hash = "sha256:5a7d5dc5140555cf21a6fefbdbf8723f06fcd2f63ef108f2854de715e4422cb4"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:1353ef0c1b138e1907ae78e2f6c63ff67501122006b0f9abad68fda5f4ffc6ab"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1085e7fbddd3be5f89cc898938f42c0b3c711fdcb37d75221de2666af647c175"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1b52b4fb9df4eb9ae465f8d0c228a00624de2334f216f178a995ccdcf82c4634"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fed51ac40f757d41b7c48425901843666a6677e3e8eb0abcff09e4ba6e664f50"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f190daf01f13c72eac4efd5c430a8de82489d9cff23c364c3ea822545032993e"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e56b7d45a839a697b5eb268c82a71bd8c7f6c94d6fd50c3d577fa39a9f1409f5"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:f3e98bb3798ead92273dc0e5fd0f31ade220f59a266ffd8a4f6065e0a3ce0523"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5678211cb9333a6468fb8d8be0305520aa073f50d17f089b5b4b477ea6e67fdc"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-win32.whl", hash = "sha256:915c04ba3851909ce68ccc2b8e2cd691618c4dc4c4232fb7982bca3f41fd8c3d"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-win_amd64.whl", hash = "sha256:4faffd047e07c38848ce017e8725090413cd80cbc23d86e55c587bf979e579c9"},
+    {file = "markupsafe-3.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:32001d6a8fc98c8cb5c947787c5d08b0a50663d139f1305bac5885d98d9b40fa"},
+    {file = "markupsafe-3.0.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:15d939a21d546304880945ca1ecb8a039db6b4dc49b2c5a400387cdae6a62e26"},
+    {file = "markupsafe-3.0.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:f71a396b3bf33ecaa1626c255855702aca4d3d9fea5e051b41ac59a9c1c41edc"},
+    {file = "markupsafe-3.0.3-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0f4b68347f8c5eab4a13419215bdfd7f8c9b19f2b25520968adfad23eb0ce60c"},
+    {file = "markupsafe-3.0.3-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e8fc20152abba6b83724d7ff268c249fa196d8259ff481f3b1476383f8f24e42"},
+    {file = "markupsafe-3.0.3-cp39-cp39-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:949b8d66bc381ee8b007cd945914c721d9aba8e27f71959d750a46f7c282b20b"},
+    {file = "markupsafe-3.0.3-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:3537e01efc9d4dccdf77221fb1cb3b8e1a38d5428920e0657ce299b20324d758"},
+    {file = "markupsafe-3.0.3-cp39-cp39-musllinux_1_2_riscv64.whl", hash = "sha256:591ae9f2a647529ca990bc681daebdd52c8791ff06c2bfa05b65163e28102ef2"},
+    {file = "markupsafe-3.0.3-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:a320721ab5a1aba0a233739394eb907f8c8da5c98c9181d1161e77a0c8e36f2d"},
+    {file = "markupsafe-3.0.3-cp39-cp39-win32.whl", hash = "sha256:df2449253ef108a379b8b5d6b43f4b1a8e81a061d6537becd5582fba5f9196d7"},
+    {file = "markupsafe-3.0.3-cp39-cp39-win_amd64.whl", hash = "sha256:7c3fb7d25180895632e5d3148dbdc29ea38ccb7fd210aa27acbd1201a1902c6e"},
+    {file = "markupsafe-3.0.3-cp39-cp39-win_arm64.whl", hash = "sha256:38664109c14ffc9e7437e86b4dceb442b0096dfe3541d7864d9cbe1da4cf36c8"},
+    {file = "markupsafe-3.0.3.tar.gz", hash = "sha256:722695808f4b6457b320fdc131280796bdceb04ab50fe1795cd540799ebe1698"},
 ]
 
-[package.dependencies]
-contourpy = ">=1.0.1"
-cycler = ">=0.10"
-fonttools = ">=4.22.0"
-kiwisolver = ">=1.3.1"
-numpy = ">=1.23"
-packaging = ">=20.0"
-pillow = ">=8"
-pyparsing = ">=2.3.1"
-python-dateutil = ">=2.7"
-
-[package.extras]
-dev = ["meson-python (>=0.13.1,<0.17.0)", "pybind11 (>=2.13.2,!=2.13.3)", "setuptools (>=64)", "setuptools_scm (>=7)"]
-
 [[package]]
 name = "mdurl"
 version = "0.1.2"
@@ -2455,57 +2422,31 @@ files = [
     {file = "mdurl-0.1.2.tar.gz", hash = "sha256:bb413d29f5eea38f31dd4754dd7377d4465116fb207585f97bf925588687c1ba"},
 ]
 
-[[package]]
-name = "mmengine"
-version = "0.10.4"
-description = "Engine of OpenMMLab projects"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "mmengine-0.10.4-py3-none-any.whl", hash = "sha256:18b681ef36b00dc6f5cc1912031e82814dcc39b9f22f82cb63be0af321fcf7b5"},
-    {file = "mmengine-0.10.4.tar.gz", hash = "sha256:d3ee2148935826fd08c2541d3a23120805884341d0fafe85185327cdc9bf07b7"},
-]
-
-[package.dependencies]
-addict = "*"
-matplotlib = "*"
-numpy = "*"
-opencv-python = ">=3"
-pyyaml = "*"
-regex = {version = "*", markers = "sys_platform == \"win32\""}
-rich = "*"
-termcolor = "*"
-yapf = "*"
-
-[package.extras]
-all = ["addict", "aim (<=3.17.5) ; sys_platform != \"win32\"", "bitsandbytes", "clearml", "coverage", "dadaptation", "dvclive", "lion-pytorch", "lmdb", "matplotlib", "mlflow", "numpy", "parameterized", "pydantic (==1.10.9)", "pytest", "pyyaml", "regex ; sys_platform == \"win32\"", "rich", "termcolor", "transformers", "yapf"]
-tests = ["aim (<=3.17.5) ; sys_platform != \"win32\"", "bitsandbytes", "clearml", "coverage", "dadaptation", "dvclive", "lion-pytorch", "lmdb", "mlflow", "parameterized", "pydantic (==1.10.9)", "pytest", "transformers"]
-
 [[package]]
 name = "moviepy"
-version = "1.0.3"
+version = "2.2.1"
 description = "Video editing with Python"
 optional = false
 python-versions = "*"
 groups = ["main"]
 files = [
-    {file = "moviepy-1.0.3.tar.gz", hash = "sha256:2884e35d1788077db3ff89e763c5ba7bfddbd7ae9108c9bc809e7ba58fa433f5"},
+    {file = "moviepy-2.2.1-py3-none-any.whl", hash = "sha256:6b56803fec2ac54b557404126ac1160e65448e03798fa282bd23e8fab3795060"},
+    {file = "moviepy-2.2.1.tar.gz", hash = "sha256:c80cb56815ece94e5e3e2d361aa40070eeb30a09d23a24c4e684d03e16deacb1"},
 ]
 
 [package.dependencies]
-decorator = ">=4.0.2,<5.0"
-imageio = {version = ">=2.5,<3.0", markers = "python_version >= \"3.4\""}
-imageio_ffmpeg = {version = ">=0.2.0", markers = "python_version >= \"3.4\""}
-numpy = {version = ">=1.17.3", markers = "python_version > \"2.7\""}
+decorator = ">=4.0.2,<6.0"
+imageio = ">=2.5,<3.0"
+imageio_ffmpeg = ">=0.2.0"
+numpy = ">=1.25.0"
+pillow = ">=9.2.0,<12.0"
 proglog = "<=1.0.0"
-requests = ">=2.8.1,<3.0"
-tqdm = ">=4.11.2,<5.0"
+python-dotenv = ">=0.10"
 
 [package.extras]
-doc = ["Sphinx (>=1.5.2,<2.0)", "numpydoc (>=0.6.0,<1.0)", "pygame (>=1.9.3,<2.0) ; python_version < \"3.8\"", "sphinx_rtd_theme (>=0.1.10b0,<1.0)"]
-optional = ["matplotlib (>=2.0.0,<3.0) ; python_version >= \"3.4\"", "opencv-python (>=3.0,<4.0) ; python_version != \"2.7\"", "scikit-image (>=0.13.0,<1.0) ; python_version >= \"3.4\"", "scikit-learn ; python_version >= \"3.4\"", "scipy (>=0.19.0,<1.5) ; python_version != \"3.3\"", "youtube_dl"]
-test = ["coverage (<5.0)", "coveralls (>=1.1,<2.0)", "pytest (>=3.0.0,<4.0)", "pytest-cov (>=2.5.1,<3.0)", "requests (>=2.8.1,<3.0)"]
+doc = ["Sphinx (==6.*)", "numpydoc (<2.0)", "pydata-sphinx-theme (==0.13)", "sphinx_design"]
+lint = ["black (>=23.7.0)", "flake8 (>=6.0.0)", "flake8-absolute-import (>=1.0)", "flake8-docstrings (>=1.7.0)", "flake8-implicit-str-concat (==0.4.0)", "flake8-rst-docstrings (>=0.3)", "isort (>=5.12)", "pre-commit (>=3.3)"]
+test = ["coveralls (>=3.0,<4.0)", "pytest (>=3.0.0,<7.0.0)", "pytest-cov (>=2.5.1,<3.0)"]
 
 [[package]]
 name = "mpmath"
@@ -2513,7 +2454,7 @@ version = "1.3.0"
 description = "Python library for arbitrary-precision floating-point arithmetic"
 optional = false
 python-versions = "*"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "mpmath-1.3.0-py3-none-any.whl", hash = "sha256:a0b2b9fe80bbcd81a6647ff13108738cfb482d481d826cc0e02f5b35e5c88d2c"},
     {file = "mpmath-1.3.0.tar.gz", hash = "sha256:7a28eb2a9774d00c7bc92411c19a89209d5da7c4c9a9e227be8330a23a25b91f"},
@@ -2527,454 +2468,573 @@ tests = ["pytest (>=4.6)"]
 
 [[package]]
 name = "msgpack"
-version = "1.1.0"
+version = "1.2.1"
 description = "MessagePack serializer"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "msgpack-1.1.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:7ad442d527a7e358a469faf43fda45aaf4ac3249c8310a82f0ccff9164e5dccd"},
-    {file = "msgpack-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:74bed8f63f8f14d75eec75cf3d04ad581da6b914001b474a5d3cd3372c8cc27d"},
-    {file = "msgpack-1.1.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:914571a2a5b4e7606997e169f64ce53a8b1e06f2cf2c3a7273aa106236d43dd5"},
-    {file = "msgpack-1.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c921af52214dcbb75e6bdf6a661b23c3e6417f00c603dd2070bccb5c3ef499f5"},
-    {file = "msgpack-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d8ce0b22b890be5d252de90d0e0d119f363012027cf256185fc3d474c44b1b9e"},
-    {file = "msgpack-1.1.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:73322a6cc57fcee3c0c57c4463d828e9428275fb85a27aa2aa1a92fdc42afd7b"},
-    {file = "msgpack-1.1.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:e1f3c3d21f7cf67bcf2da8e494d30a75e4cf60041d98b3f79875afb5b96f3a3f"},
-    {file = "msgpack-1.1.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:64fc9068d701233effd61b19efb1485587560b66fe57b3e50d29c5d78e7fef68"},
-    {file = "msgpack-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:42f754515e0f683f9c79210a5d1cad631ec3d06cea5172214d2176a42e67e19b"},
-    {file = "msgpack-1.1.0-cp310-cp310-win32.whl", hash = "sha256:3df7e6b05571b3814361e8464f9304c42d2196808e0119f55d0d3e62cd5ea044"},
-    {file = "msgpack-1.1.0-cp310-cp310-win_amd64.whl", hash = "sha256:685ec345eefc757a7c8af44a3032734a739f8c45d1b0ac45efc5d8977aa4720f"},
-    {file = "msgpack-1.1.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:3d364a55082fb2a7416f6c63ae383fbd903adb5a6cf78c5b96cc6316dc1cedc7"},
-    {file = "msgpack-1.1.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:79ec007767b9b56860e0372085f8504db5d06bd6a327a335449508bbee9648fa"},
-    {file = "msgpack-1.1.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:6ad622bf7756d5a497d5b6836e7fc3752e2dd6f4c648e24b1803f6048596f701"},
-    {file = "msgpack-1.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8e59bca908d9ca0de3dc8684f21ebf9a690fe47b6be93236eb40b99af28b6ea6"},
-    {file = "msgpack-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5e1da8f11a3dd397f0a32c76165cf0c4eb95b31013a94f6ecc0b280c05c91b59"},
-    {file = "msgpack-1.1.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:452aff037287acb1d70a804ffd022b21fa2bb7c46bee884dbc864cc9024128a0"},
-    {file = "msgpack-1.1.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:8da4bf6d54ceed70e8861f833f83ce0814a2b72102e890cbdfe4b34764cdd66e"},
-    {file = "msgpack-1.1.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:41c991beebf175faf352fb940bf2af9ad1fb77fd25f38d9142053914947cdbf6"},
-    {file = "msgpack-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:a52a1f3a5af7ba1c9ace055b659189f6c669cf3657095b50f9602af3a3ba0fe5"},
-    {file = "msgpack-1.1.0-cp311-cp311-win32.whl", hash = "sha256:58638690ebd0a06427c5fe1a227bb6b8b9fdc2bd07701bec13c2335c82131a88"},
-    {file = "msgpack-1.1.0-cp311-cp311-win_amd64.whl", hash = "sha256:fd2906780f25c8ed5d7b323379f6138524ba793428db5d0e9d226d3fa6aa1788"},
-    {file = "msgpack-1.1.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:d46cf9e3705ea9485687aa4001a76e44748b609d260af21c4ceea7f2212a501d"},
-    {file = "msgpack-1.1.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:5dbad74103df937e1325cc4bfeaf57713be0b4f15e1c2da43ccdd836393e2ea2"},
-    {file = "msgpack-1.1.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:58dfc47f8b102da61e8949708b3eafc3504509a5728f8b4ddef84bd9e16ad420"},
-    {file = "msgpack-1.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4676e5be1b472909b2ee6356ff425ebedf5142427842aa06b4dfd5117d1ca8a2"},
-    {file = "msgpack-1.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:17fb65dd0bec285907f68b15734a993ad3fc94332b5bb21b0435846228de1f39"},
-    {file = "msgpack-1.1.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:a51abd48c6d8ac89e0cfd4fe177c61481aca2d5e7ba42044fd218cfd8ea9899f"},
-    {file = "msgpack-1.1.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:2137773500afa5494a61b1208619e3871f75f27b03bcfca7b3a7023284140247"},
-    {file = "msgpack-1.1.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:398b713459fea610861c8a7b62a6fec1882759f308ae0795b5413ff6a160cf3c"},
-    {file = "msgpack-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:06f5fd2f6bb2a7914922d935d3b8bb4a7fff3a9a91cfce6d06c13bc42bec975b"},
-    {file = "msgpack-1.1.0-cp312-cp312-win32.whl", hash = "sha256:ad33e8400e4ec17ba782f7b9cf868977d867ed784a1f5f2ab46e7ba53b6e1e1b"},
-    {file = "msgpack-1.1.0-cp312-cp312-win_amd64.whl", hash = "sha256:115a7af8ee9e8cddc10f87636767857e7e3717b7a2e97379dc2054712693e90f"},
-    {file = "msgpack-1.1.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:071603e2f0771c45ad9bc65719291c568d4edf120b44eb36324dcb02a13bfddf"},
-    {file = "msgpack-1.1.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:0f92a83b84e7c0749e3f12821949d79485971f087604178026085f60ce109330"},
-    {file = "msgpack-1.1.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:4a1964df7b81285d00a84da4e70cb1383f2e665e0f1f2a7027e683956d04b734"},
-    {file = "msgpack-1.1.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:59caf6a4ed0d164055ccff8fe31eddc0ebc07cf7326a2aaa0dbf7a4001cd823e"},
-    {file = "msgpack-1.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0907e1a7119b337971a689153665764adc34e89175f9a34793307d9def08e6ca"},
-    {file = "msgpack-1.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:65553c9b6da8166e819a6aa90ad15288599b340f91d18f60b2061f402b9a4915"},
-    {file = "msgpack-1.1.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:7a946a8992941fea80ed4beae6bff74ffd7ee129a90b4dd5cf9c476a30e9708d"},
-    {file = "msgpack-1.1.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:4b51405e36e075193bc051315dbf29168d6141ae2500ba8cd80a522964e31434"},
-    {file = "msgpack-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:b4c01941fd2ff87c2a934ee6055bda4ed353a7846b8d4f341c428109e9fcde8c"},
-    {file = "msgpack-1.1.0-cp313-cp313-win32.whl", hash = "sha256:7c9a35ce2c2573bada929e0b7b3576de647b0defbd25f5139dcdaba0ae35a4cc"},
-    {file = "msgpack-1.1.0-cp313-cp313-win_amd64.whl", hash = "sha256:bce7d9e614a04d0883af0b3d4d501171fbfca038f12c77fa838d9f198147a23f"},
-    {file = "msgpack-1.1.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c40ffa9a15d74e05ba1fe2681ea33b9caffd886675412612d93ab17b58ea2fec"},
-    {file = "msgpack-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f1ba6136e650898082d9d5a5217d5906d1e138024f836ff48691784bbe1adf96"},
-    {file = "msgpack-1.1.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e0856a2b7e8dcb874be44fea031d22e5b3a19121be92a1e098f46068a11b0870"},
-    {file = "msgpack-1.1.0-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:471e27a5787a2e3f974ba023f9e265a8c7cfd373632247deb225617e3100a3c7"},
-    {file = "msgpack-1.1.0-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:646afc8102935a388ffc3914b336d22d1c2d6209c773f3eb5dd4d6d3b6f8c1cb"},
-    {file = "msgpack-1.1.0-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:13599f8829cfbe0158f6456374e9eea9f44eee08076291771d8ae93eda56607f"},
-    {file = "msgpack-1.1.0-cp38-cp38-win32.whl", hash = "sha256:8a84efb768fb968381e525eeeb3d92857e4985aacc39f3c47ffd00eb4509315b"},
-    {file = "msgpack-1.1.0-cp38-cp38-win_amd64.whl", hash = "sha256:879a7b7b0ad82481c52d3c7eb99bf6f0645dbdec5134a4bddbd16f3506947feb"},
-    {file = "msgpack-1.1.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:53258eeb7a80fc46f62fd59c876957a2d0e15e6449a9e71842b6d24419d88ca1"},
-    {file = "msgpack-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:7e7b853bbc44fb03fbdba34feb4bd414322180135e2cb5164f20ce1c9795ee48"},
-    {file = "msgpack-1.1.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:f3e9b4936df53b970513eac1758f3882c88658a220b58dcc1e39606dccaaf01c"},
-    {file = "msgpack-1.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:46c34e99110762a76e3911fc923222472c9d681f1094096ac4102c18319e6468"},
-    {file = "msgpack-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8a706d1e74dd3dea05cb54580d9bd8b2880e9264856ce5068027eed09680aa74"},
-    {file = "msgpack-1.1.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:534480ee5690ab3cbed89d4c8971a5c631b69a8c0883ecfea96c19118510c846"},
-    {file = "msgpack-1.1.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:8cf9e8c3a2153934a23ac160cc4cba0ec035f6867c8013cc6077a79823370346"},
-    {file = "msgpack-1.1.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:3180065ec2abbe13a4ad37688b61b99d7f9e012a535b930e0e683ad6bc30155b"},
-    {file = "msgpack-1.1.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:c5a91481a3cc573ac8c0d9aace09345d989dc4a0202b7fcb312c88c26d4e71a8"},
-    {file = "msgpack-1.1.0-cp39-cp39-win32.whl", hash = "sha256:f80bc7d47f76089633763f952e67f8214cb7b3ee6bfa489b3cb6a84cfac114cd"},
-    {file = "msgpack-1.1.0-cp39-cp39-win_amd64.whl", hash = "sha256:4d1b7ff2d6146e16e8bd665ac726a89c74163ef8cd39fa8c1087d4e52d3a2325"},
-    {file = "msgpack-1.1.0.tar.gz", hash = "sha256:dd432ccc2c72b914e4cb77afce64aab761c1137cc698be3984eee260bcb2896e"},
+python-versions = ">=3.10"
+groups = ["training"]
+files = [
+    {file = "msgpack-1.2.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:8c7b398c56ff125feae96c2737abfec5595f1fa0aa186df60c56040b8accb95c"},
+    {file = "msgpack-1.2.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:1548006a91aa93c5da81f3bdcebc1a0d10cea2d25969754fbe848da622b2b895"},
+    {file = "msgpack-1.2.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1dabedcd0f23559f3596428c6589c1cd8c6eaed3a0d720795b07b0225d769203"},
+    {file = "msgpack-1.2.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:83efa1c898e0fc5380fc0cabbf75164c52e3b5cbb45973710d75821928380c73"},
+    {file = "msgpack-1.2.1-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:01e2dd6c9b19d333a00282330cc8a73d38d8dabc306dc5b42cd668c3ac82e833"},
+    {file = "msgpack-1.2.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:350cb813d0af6e65d2f7ef0d729f7ff5be5a8bce03665892f43e5883d4ecc1b8"},
+    {file = "msgpack-1.2.1-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:ee1d9ed27d0497b848923746cf762ed2e7db24f4be7eec8e5cbe8c766aa707b7"},
+    {file = "msgpack-1.2.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:633727297ed063441fd1cda2288865487f33ad14eeb8831afb5f0c396a62cfce"},
+    {file = "msgpack-1.2.1-cp310-cp310-win32.whl", hash = "sha256:298872ecf9e61950f1c6af4ca969b859ee91783bb920ef6e6172697d0c8aad74"},
+    {file = "msgpack-1.2.1-cp310-cp310-win_amd64.whl", hash = "sha256:2ff164c1b0bcb740b073b99e945234d0212852fa378e44a208c425379140dbeb"},
+    {file = "msgpack-1.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:29a3f6e9667868429d8240dfd063ea5ffdc1321c13d783aa23827a38de0dcb22"},
+    {file = "msgpack-1.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:aded5bdf32609dc7987a49bbbd15a8ef096193f96dd8bbeb791de729e650acf5"},
+    {file = "msgpack-1.2.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:146ee4e9ce80b365c6d4c47073da9da7bcec473e58194ceee5dd7620ace77e06"},
+    {file = "msgpack-1.2.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a28d076ca7c82b9c8728ad90b7147489449557038bed50e4241eb832395169b4"},
+    {file = "msgpack-1.2.1-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:7d31c0ac0c640f877804c67cb2bc9f4e23dc2db97e96c2e67fa27d38283b41f8"},
+    {file = "msgpack-1.2.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:8ff92d7feeaf5bc26c51495b69e2f99ed97ab79346fb6555f44be7dd2ac6503b"},
+    {file = "msgpack-1.2.1-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:779197a6513bab3c3632265e3d0f7cb3227e62510841a6f34f1eaa37efbb345e"},
+    {file = "msgpack-1.2.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:67f6dd22fa72a93752643f07889796d62739a13415ee630169a8ce764f86cf9f"},
+    {file = "msgpack-1.2.1-cp311-cp311-win32.whl", hash = "sha256:91054a783328e0ea7954b8771095705c8d2243b814743fbaadf14552c9c52c5d"},
+    {file = "msgpack-1.2.1-cp311-cp311-win_amd64.whl", hash = "sha256:2eda0b7ebb1283a98d3e4492ac933c8af6aff59fd3df1c3ed024f536af4b1dc8"},
+    {file = "msgpack-1.2.1-cp311-cp311-win_arm64.whl", hash = "sha256:6ee967f7c7e1df2890c671ff2ee51a28ded0efc95da3e507176dee881ce36c66"},
+    {file = "msgpack-1.2.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:2ef59c659f289eddf8aa6623823f19fa2f40a4029266889eac7a2505dd210c35"},
+    {file = "msgpack-1.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d3567748a5107cb40cdf66a275430c2f87c07777698f4bfd25c35f44d533258c"},
+    {file = "msgpack-1.2.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:60926b75d00c8e816ef98f3034f484a8bc64242d66839cef4cf7e503142316a0"},
+    {file = "msgpack-1.2.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:020e881a764b20d8d7ca1a54fc01b8175519d108e3c3f194fddc200bda95951a"},
+    {file = "msgpack-1.2.1-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:4202c74688ca06591f78cb18988228bd4cca2cc75d57b60008372892d2f1e6e6"},
+    {file = "msgpack-1.2.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:8b267ce94efb76fbd1b3373511420074ee3187f0f7811bf394531de13294735a"},
+    {file = "msgpack-1.2.1-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:e4f1d0f8f98ade9634e01fb704a408f9336c0a8f1117b369f5db83dc7551d8b1"},
+    {file = "msgpack-1.2.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:f02cf17a6ca1abe29b5f980644f7551f94d71f2011509b26d8625ce038f0df64"},
+    {file = "msgpack-1.2.1-cp312-cp312-win32.whl", hash = "sha256:0c0d9802354507bcba62af19c17918e3eb437cc25e6f50657d511b5856a77aac"},
+    {file = "msgpack-1.2.1-cp312-cp312-win_amd64.whl", hash = "sha256:5c24aa15d5963051e1a5c62b12c50cd705992502b5ec1f3bece6046f33c9fc24"},
+    {file = "msgpack-1.2.1-cp312-cp312-win_arm64.whl", hash = "sha256:4227224aaec8f7fbcbfbd4272319347b2bb4030366502600f8c45588c5187b07"},
+    {file = "msgpack-1.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:0a70e3cf2804a300d921bb0940426e35f4e489a23adfb77a808892241db0a064"},
+    {file = "msgpack-1.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:491cc39455ca765fad51fb451bf2915eb2cf41192ab5801ce8d67c1d614fe056"},
+    {file = "msgpack-1.2.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f310233ef7fb9c14e201c93639fe5f5260b005f56f0b29048e999c30935596cc"},
+    {file = "msgpack-1.2.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:787c9bebb5833e8f6fc8abca3c0597683d8d87f56a8842b6b89c75a5f3176e2d"},
+    {file = "msgpack-1.2.1-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:dc871b997a9370d855b7394465f2f350e847a5b806dd38dcc9c989e7d87da155"},
+    {file = "msgpack-1.2.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:85f57e960d877f2977f6430896191b04a21f8901b3b4baf2e4604329f4db5402"},
+    {file = "msgpack-1.2.1-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:1233ee2dd0cefba127583de50ea654677277047d238303521db35def3d7b2e7c"},
+    {file = "msgpack-1.2.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e3dc2feb0876209d9c38aa56cb1de169bd6c4348f1aa48271f241226590993e6"},
+    {file = "msgpack-1.2.1-cp313-cp313-win32.whl", hash = "sha256:6d09badf350af2be9d189184e04e64cf54ad93569ab3d96fca58bd3e84aad707"},
+    {file = "msgpack-1.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:33f14fba63278b714efe6ad07e50ea5f03d91537aa6a1c5f1ceca4cf44013ca9"},
+    {file = "msgpack-1.2.1-cp313-cp313-win_arm64.whl", hash = "sha256:afc5febcd4c99effbc02b528e49d6fd0760b2b7d48c05239e345a5fa6e743d9a"},
+    {file = "msgpack-1.2.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:05f340e47e7e47d2da8db9b53e1bb1d294369e9ef45a747441309f6650b8351d"},
+    {file = "msgpack-1.2.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:810b916696c86ef0deb3b74588480224df4c1b071136c34183e4a2a4284d7ac7"},
+    {file = "msgpack-1.2.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ca0dacff965c47afdc3749a8469d7302a8f801d6a28758d55120d75e66ce6889"},
+    {file = "msgpack-1.2.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0e2bf9280bceb5efca998435904b5d3e9fdbcc11d90dc9df30aec7973252b720"},
+    {file = "msgpack-1.2.1-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:aa6c4be5d1c02a42b066ca6ddb71adf36432868fdcdb6ee87e634e86e0674190"},
+    {file = "msgpack-1.2.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ec0e675d59150a6269ddc9139087c722292664a37d071a849c05c473350f1f2d"},
+    {file = "msgpack-1.2.1-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:dd3bfe82d53edfe4b7fc9a7ec9761e23a7a5b1dac22264505af428253c29ed24"},
+    {file = "msgpack-1.2.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:5ad5467fc3f68b5468e06c5f788d712e9f8ffc8b0cd1bcb160c105c1ee92dae7"},
+    {file = "msgpack-1.2.1-cp314-cp314-win32.whl", hash = "sha256:98b58bdb89c46190e4609bb36abe17c6d4105ad13f9c5f8f6f64d320f8ced3fb"},
+    {file = "msgpack-1.2.1-cp314-cp314-win_amd64.whl", hash = "sha256:74847557e28ce71bd3c438a447ca90e4b507e997ddbdef8a12a7b283b86c156b"},
+    {file = "msgpack-1.2.1-cp314-cp314-win_arm64.whl", hash = "sha256:b50b727bd652bdc37d950336c848ef20ec54a4cafc38dce19b1cd86ad625d0f7"},
+    {file = "msgpack-1.2.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:8d00f177ca88a77c1cf848d204a38f249751650b601cb6532acc68805d8a8273"},
+    {file = "msgpack-1.2.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5bb9c386f0a329c035ddbab4b72d1028bf9627add8dda41070288563d57ed1b1"},
+    {file = "msgpack-1.2.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:20466cca18c49c7292a8984bc15d65857b171e7264bdcb5f96baf8be238791fc"},
+    {file = "msgpack-1.2.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:196300e7e5d6e74d50f1607ab9c06c4a1484c383cd22defd727902591f7e8dde"},
+    {file = "msgpack-1.2.1-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:575957e79cd51903a4e8495a242442949641e08f1efd5197b43bebd3ea7682b4"},
+    {file = "msgpack-1.2.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:8c2ed1e48cc0f460bf3c7780e7137ff21a4e18433451916f2442c1b21036cd7d"},
+    {file = "msgpack-1.2.1-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:5f6277e5f783c36786a145e0247fc189a03f35f84b251646e53592d2bc12b355"},
+    {file = "msgpack-1.2.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:f9389552ecf4784886345ead0647e4edc96bee37cbab05b75540f542f766c48c"},
+    {file = "msgpack-1.2.1-cp314-cp314t-win32.whl", hash = "sha256:c1c79a604a2969a868a78b6ebd27a887e00c624f14f66b3038e0590cb23332d1"},
+    {file = "msgpack-1.2.1-cp314-cp314t-win_amd64.whl", hash = "sha256:f12038a35fabd52e56a3547bab42401af49a45caa6dd00b34c44de235bc93ee2"},
+    {file = "msgpack-1.2.1-cp314-cp314t-win_arm64.whl", hash = "sha256:0adcf06ffde0777c0e1a9b771a2b1c4226ba1bbf748c8efcc02fcdeca3299107"},
+    {file = "msgpack-1.2.1.tar.gz", hash = "sha256:04c721c2c7448767e9e3f2520a475663d8ee0f09c31890f6d2bd70fd636a9647"},
 ]
 
 [[package]]
 name = "multidict"
-version = "6.1.0"
+version = "6.7.1"
 description = "multidict implementation"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "multidict-6.1.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:3380252550e372e8511d49481bd836264c009adb826b23fefcc5dd3c69692f60"},
-    {file = "multidict-6.1.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:99f826cbf970077383d7de805c0681799491cb939c25450b9b5b3ced03ca99f1"},
-    {file = "multidict-6.1.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:a114d03b938376557927ab23f1e950827c3b893ccb94b62fd95d430fd0e5cf53"},
-    {file = "multidict-6.1.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b1c416351ee6271b2f49b56ad7f308072f6f44b37118d69c2cad94f3fa8a40d5"},
-    {file = "multidict-6.1.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6b5d83030255983181005e6cfbac1617ce9746b219bc2aad52201ad121226581"},
-    {file = "multidict-6.1.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:3e97b5e938051226dc025ec80980c285b053ffb1e25a3db2a3aa3bc046bf7f56"},
-    {file = "multidict-6.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d618649d4e70ac6efcbba75be98b26ef5078faad23592f9b51ca492953012429"},
-    {file = "multidict-6.1.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:10524ebd769727ac77ef2278390fb0068d83f3acb7773792a5080f2b0abf7748"},
-    {file = "multidict-6.1.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:ff3827aef427c89a25cc96ded1759271a93603aba9fb977a6d264648ebf989db"},
-    {file = "multidict-6.1.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:06809f4f0f7ab7ea2cabf9caca7d79c22c0758b58a71f9d32943ae13c7ace056"},
-    {file = "multidict-6.1.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:f179dee3b863ab1c59580ff60f9d99f632f34ccb38bf67a33ec6b3ecadd0fd76"},
-    {file = "multidict-6.1.0-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:aaed8b0562be4a0876ee3b6946f6869b7bcdb571a5d1496683505944e268b160"},
-    {file = "multidict-6.1.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:3c8b88a2ccf5493b6c8da9076fb151ba106960a2df90c2633f342f120751a9e7"},
-    {file = "multidict-6.1.0-cp310-cp310-win32.whl", hash = "sha256:4a9cb68166a34117d6646c0023c7b759bf197bee5ad4272f420a0141d7eb03a0"},
-    {file = "multidict-6.1.0-cp310-cp310-win_amd64.whl", hash = "sha256:20b9b5fbe0b88d0bdef2012ef7dee867f874b72528cf1d08f1d59b0e3850129d"},
-    {file = "multidict-6.1.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:3efe2c2cb5763f2f1b275ad2bf7a287d3f7ebbef35648a9726e3b69284a4f3d6"},
-    {file = "multidict-6.1.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:c7053d3b0353a8b9de430a4f4b4268ac9a4fb3481af37dfe49825bf45ca24156"},
-    {file = "multidict-6.1.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:27e5fc84ccef8dfaabb09d82b7d179c7cf1a3fbc8a966f8274fcb4ab2eb4cadb"},
-    {file = "multidict-6.1.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0e2b90b43e696f25c62656389d32236e049568b39320e2735d51f08fd362761b"},
-    {file = "multidict-6.1.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d83a047959d38a7ff552ff94be767b7fd79b831ad1cd9920662db05fec24fe72"},
-    {file = "multidict-6.1.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:d1a9dd711d0877a1ece3d2e4fea11a8e75741ca21954c919406b44e7cf971304"},
-    {file = "multidict-6.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ec2abea24d98246b94913b76a125e855eb5c434f7c46546046372fe60f666351"},
-    {file = "multidict-6.1.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4867cafcbc6585e4b678876c489b9273b13e9fff9f6d6d66add5e15d11d926cb"},
-    {file = "multidict-6.1.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:5b48204e8d955c47c55b72779802b219a39acc3ee3d0116d5080c388970b76e3"},
-    {file = "multidict-6.1.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:d8fff389528cad1618fb4b26b95550327495462cd745d879a8c7c2115248e399"},
-    {file = "multidict-6.1.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:a7a9541cd308eed5e30318430a9c74d2132e9a8cb46b901326272d780bf2d423"},
-    {file = "multidict-6.1.0-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:da1758c76f50c39a2efd5e9859ce7d776317eb1dd34317c8152ac9251fc574a3"},
-    {file = "multidict-6.1.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:c943a53e9186688b45b323602298ab727d8865d8c9ee0b17f8d62d14b56f0753"},
-    {file = "multidict-6.1.0-cp311-cp311-win32.whl", hash = "sha256:90f8717cb649eea3504091e640a1b8568faad18bd4b9fcd692853a04475a4b80"},
-    {file = "multidict-6.1.0-cp311-cp311-win_amd64.whl", hash = "sha256:82176036e65644a6cc5bd619f65f6f19781e8ec2e5330f51aa9ada7504cc1926"},
-    {file = "multidict-6.1.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:b04772ed465fa3cc947db808fa306d79b43e896beb677a56fb2347ca1a49c1fa"},
-    {file = "multidict-6.1.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:6180c0ae073bddeb5a97a38c03f30c233e0a4d39cd86166251617d1bbd0af436"},
-    {file = "multidict-6.1.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:071120490b47aa997cca00666923a83f02c7fbb44f71cf7f136df753f7fa8761"},
-    {file = "multidict-6.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:50b3a2710631848991d0bf7de077502e8994c804bb805aeb2925a981de58ec2e"},
-    {file = "multidict-6.1.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b58c621844d55e71c1b7f7c498ce5aa6985d743a1a59034c57a905b3f153c1ef"},
-    {file = "multidict-6.1.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:55b6d90641869892caa9ca42ff913f7ff1c5ece06474fbd32fb2cf6834726c95"},
-    {file = "multidict-6.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4b820514bfc0b98a30e3d85462084779900347e4d49267f747ff54060cc33925"},
-    {file = "multidict-6.1.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:10a9b09aba0c5b48c53761b7c720aaaf7cf236d5fe394cd399c7ba662d5f9966"},
-    {file = "multidict-6.1.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:1e16bf3e5fc9f44632affb159d30a437bfe286ce9e02754759be5536b169b305"},
-    {file = "multidict-6.1.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:76f364861c3bfc98cbbcbd402d83454ed9e01a5224bb3a28bf70002a230f73e2"},
-    {file = "multidict-6.1.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:820c661588bd01a0aa62a1283f20d2be4281b086f80dad9e955e690c75fb54a2"},
-    {file = "multidict-6.1.0-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:0e5f362e895bc5b9e67fe6e4ded2492d8124bdf817827f33c5b46c2fe3ffaca6"},
-    {file = "multidict-6.1.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:3ec660d19bbc671e3a6443325f07263be452c453ac9e512f5eb935e7d4ac28b3"},
-    {file = "multidict-6.1.0-cp312-cp312-win32.whl", hash = "sha256:58130ecf8f7b8112cdb841486404f1282b9c86ccb30d3519faf301b2e5659133"},
-    {file = "multidict-6.1.0-cp312-cp312-win_amd64.whl", hash = "sha256:188215fc0aafb8e03341995e7c4797860181562380f81ed0a87ff455b70bf1f1"},
-    {file = "multidict-6.1.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:d569388c381b24671589335a3be6e1d45546c2988c2ebe30fdcada8457a31008"},
-    {file = "multidict-6.1.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:052e10d2d37810b99cc170b785945421141bf7bb7d2f8799d431e7db229c385f"},
-    {file = "multidict-6.1.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:f90c822a402cb865e396a504f9fc8173ef34212a342d92e362ca498cad308e28"},
-    {file = "multidict-6.1.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b225d95519a5bf73860323e633a664b0d85ad3d5bede6d30d95b35d4dfe8805b"},
-    {file = "multidict-6.1.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:23bfd518810af7de1116313ebd9092cb9aa629beb12f6ed631ad53356ed6b86c"},
-    {file = "multidict-6.1.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5c09fcfdccdd0b57867577b719c69e347a436b86cd83747f179dbf0cc0d4c1f3"},
-    {file = "multidict-6.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bf6bea52ec97e95560af5ae576bdac3aa3aae0b6758c6efa115236d9e07dae44"},
-    {file = "multidict-6.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:57feec87371dbb3520da6192213c7d6fc892d5589a93db548331954de8248fd2"},
-    {file = "multidict-6.1.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:0c3f390dc53279cbc8ba976e5f8035eab997829066756d811616b652b00a23a3"},
-    {file = "multidict-6.1.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:59bfeae4b25ec05b34f1956eaa1cb38032282cd4dfabc5056d0a1ec4d696d3aa"},
-    {file = "multidict-6.1.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:b2f59caeaf7632cc633b5cf6fc449372b83bbdf0da4ae04d5be36118e46cc0aa"},
-    {file = "multidict-6.1.0-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:37bb93b2178e02b7b618893990941900fd25b6b9ac0fa49931a40aecdf083fe4"},
-    {file = "multidict-6.1.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:4e9f48f58c2c523d5a06faea47866cd35b32655c46b443f163d08c6d0ddb17d6"},
-    {file = "multidict-6.1.0-cp313-cp313-win32.whl", hash = "sha256:3a37ffb35399029b45c6cc33640a92bef403c9fd388acce75cdc88f58bd19a81"},
-    {file = "multidict-6.1.0-cp313-cp313-win_amd64.whl", hash = "sha256:e9aa71e15d9d9beaad2c6b9319edcdc0a49a43ef5c0a4c8265ca9ee7d6c67774"},
-    {file = "multidict-6.1.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:db7457bac39421addd0c8449933ac32d8042aae84a14911a757ae6ca3eef1392"},
-    {file = "multidict-6.1.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:d094ddec350a2fb899fec68d8353c78233debde9b7d8b4beeafa70825f1c281a"},
-    {file = "multidict-6.1.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:5845c1fd4866bb5dd3125d89b90e57ed3138241540897de748cdf19de8a2fca2"},
-    {file = "multidict-6.1.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9079dfc6a70abe341f521f78405b8949f96db48da98aeb43f9907f342f627cdc"},
-    {file = "multidict-6.1.0-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3914f5aaa0f36d5d60e8ece6a308ee1c9784cd75ec8151062614657a114c4478"},
-    {file = "multidict-6.1.0-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c08be4f460903e5a9d0f76818db3250f12e9c344e79314d1d570fc69d7f4eae4"},
-    {file = "multidict-6.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d093be959277cb7dee84b801eb1af388b6ad3ca6a6b6bf1ed7585895789d027d"},
-    {file = "multidict-6.1.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:3702ea6872c5a2a4eeefa6ffd36b042e9773f05b1f37ae3ef7264b1163c2dcf6"},
-    {file = "multidict-6.1.0-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:2090f6a85cafc5b2db085124d752757c9d251548cedabe9bd31afe6363e0aff2"},
-    {file = "multidict-6.1.0-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:f67f217af4b1ff66c68a87318012de788dd95fcfeb24cc889011f4e1c7454dfd"},
-    {file = "multidict-6.1.0-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:189f652a87e876098bbc67b4da1049afb5f5dfbaa310dd67c594b01c10388db6"},
-    {file = "multidict-6.1.0-cp38-cp38-musllinux_1_2_s390x.whl", hash = "sha256:6bb5992037f7a9eff7991ebe4273ea7f51f1c1c511e6a2ce511d0e7bdb754492"},
-    {file = "multidict-6.1.0-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:ac10f4c2b9e770c4e393876e35a7046879d195cd123b4f116d299d442b335bcd"},
-    {file = "multidict-6.1.0-cp38-cp38-win32.whl", hash = "sha256:e27bbb6d14416713a8bd7aaa1313c0fc8d44ee48d74497a0ff4c3a1b6ccb5167"},
-    {file = "multidict-6.1.0-cp38-cp38-win_amd64.whl", hash = "sha256:22f3105d4fb15c8f57ff3959a58fcab6ce36814486500cd7485651230ad4d4ef"},
-    {file = "multidict-6.1.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:4e18b656c5e844539d506a0a06432274d7bd52a7487e6828c63a63d69185626c"},
-    {file = "multidict-6.1.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:a185f876e69897a6f3325c3f19f26a297fa058c5e456bfcff8015e9a27e83ae1"},
-    {file = "multidict-6.1.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:ab7c4ceb38d91570a650dba194e1ca87c2b543488fe9309b4212694174fd539c"},
-    {file = "multidict-6.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e617fb6b0b6953fffd762669610c1c4ffd05632c138d61ac7e14ad187870669c"},
-    {file = "multidict-6.1.0-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:16e5f4bf4e603eb1fdd5d8180f1a25f30056f22e55ce51fb3d6ad4ab29f7d96f"},
-    {file = "multidict-6.1.0-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f4c035da3f544b1882bac24115f3e2e8760f10a0107614fc9839fd232200b875"},
-    {file = "multidict-6.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:957cf8e4b6e123a9eea554fa7ebc85674674b713551de587eb318a2df3e00255"},
-    {file = "multidict-6.1.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:483a6aea59cb89904e1ceabd2b47368b5600fb7de78a6e4a2c2987b2d256cf30"},
-    {file = "multidict-6.1.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:87701f25a2352e5bf7454caa64757642734da9f6b11384c1f9d1a8e699758057"},
-    {file = "multidict-6.1.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:682b987361e5fd7a139ed565e30d81fd81e9629acc7d925a205366877d8c8657"},
-    {file = "multidict-6.1.0-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:ce2186a7df133a9c895dea3331ddc5ddad42cdd0d1ea2f0a51e5d161e4762f28"},
-    {file = "multidict-6.1.0-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:9f636b730f7e8cb19feb87094949ba54ee5357440b9658b2a32a5ce4bce53972"},
-    {file = "multidict-6.1.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:73eae06aa53af2ea5270cc066dcaf02cc60d2994bbb2c4ef5764949257d10f43"},
-    {file = "multidict-6.1.0-cp39-cp39-win32.whl", hash = "sha256:1ca0083e80e791cffc6efce7660ad24af66c8d4079d2a750b29001b53ff59ada"},
-    {file = "multidict-6.1.0-cp39-cp39-win_amd64.whl", hash = "sha256:aa466da5b15ccea564bdab9c89175c762bc12825f4659c11227f515cee76fa4a"},
-    {file = "multidict-6.1.0-py3-none-any.whl", hash = "sha256:48e171e52d1c4d33888e529b999e5900356b9ae588c2f09a52dcefb158b27506"},
-    {file = "multidict-6.1.0.tar.gz", hash = "sha256:22ae2ebf9b0c69d206c003e2f6a914ea33f0a932d4aa16f236afc049d9958f4a"},
-]
-
-[package.dependencies]
-typing-extensions = {version = ">=4.1.0", markers = "python_version < \"3.11\""}
-
-[[package]]
-name = "multiprocess"
-version = "0.70.15"
-description = "better multiprocessing and multithreading in Python"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "multiprocess-0.70.15-pp310-pypy310_pp73-macosx_10_9_x86_64.whl", hash = "sha256:aa36c7ed16f508091438687fe9baa393a7a8e206731d321e443745e743a0d4e5"},
-    {file = "multiprocess-0.70.15-pp37-pypy37_pp73-macosx_10_9_x86_64.whl", hash = "sha256:20e024018c46d0d1602024c613007ac948f9754659e3853b0aa705e83f6931d8"},
-    {file = "multiprocess-0.70.15-pp37-pypy37_pp73-manylinux_2_24_i686.whl", hash = "sha256:e576062981c91f0fe8a463c3d52506e598dfc51320a8dd8d78b987dfca91c5db"},
-    {file = "multiprocess-0.70.15-pp37-pypy37_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:e73f497e6696a0f5433ada2b3d599ae733b87a6e8b008e387c62ac9127add177"},
-    {file = "multiprocess-0.70.15-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:73db2e7b32dcc7f9b0f075c2ffa45c90b6729d3f1805f27e88534c8d321a1be5"},
-    {file = "multiprocess-0.70.15-pp38-pypy38_pp73-manylinux_2_24_i686.whl", hash = "sha256:4271647bd8a49c28ecd6eb56a7fdbd3c212c45529ad5303b40b3c65fc6928e5f"},
-    {file = "multiprocess-0.70.15-pp38-pypy38_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:cf981fb998d6ec3208cb14f0cf2e9e80216e834f5d51fd09ebc937c32b960902"},
-    {file = "multiprocess-0.70.15-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:18f9f2c7063346d1617bd1684fdcae8d33380ae96b99427260f562e1a1228b67"},
-    {file = "multiprocess-0.70.15-pp39-pypy39_pp73-manylinux_2_24_i686.whl", hash = "sha256:0eac53214d664c49a34695e5824872db4006b1a465edd7459a251809c3773370"},
-    {file = "multiprocess-0.70.15-pp39-pypy39_pp73-manylinux_2_24_x86_64.whl", hash = "sha256:1a51dd34096db47fb21fa2b839e615b051d51b97af9a67afbcdaa67186b44883"},
-    {file = "multiprocess-0.70.15-py310-none-any.whl", hash = "sha256:7dd58e33235e83cf09d625e55cffd7b0f0eede7ee9223cdd666a87624f60c21a"},
-    {file = "multiprocess-0.70.15-py311-none-any.whl", hash = "sha256:134f89053d82c9ed3b73edd3a2531eb791e602d4f4156fc92a79259590bd9670"},
-    {file = "multiprocess-0.70.15-py37-none-any.whl", hash = "sha256:f7d4a1629bccb433114c3b4885f69eccc200994323c80f6feee73b0edc9199c5"},
-    {file = "multiprocess-0.70.15-py38-none-any.whl", hash = "sha256:bee9afba476c91f9ebee7beeee0601face9eff67d822e893f9a893725fbd6316"},
-    {file = "multiprocess-0.70.15-py39-none-any.whl", hash = "sha256:3e0953f5d52b4c76f1c973eaf8214554d146f2be5decb48e928e55c7a2d19338"},
-    {file = "multiprocess-0.70.15.tar.gz", hash = "sha256:f20eed3036c0ef477b07a4177cf7c1ba520d9a2677870a4f47fe026f0cd6787e"},
+python-versions = ">=3.9"
+groups = ["main", "training"]
+files = [
+    {file = "multidict-6.7.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:c93c3db7ea657dd4637d57e74ab73de31bccefe144d3d4ce370052035bc85fb5"},
+    {file = "multidict-6.7.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:974e72a2474600827abaeda71af0c53d9ebbc3c2eb7da37b37d7829ae31232d8"},
+    {file = "multidict-6.7.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:cdea2e7b2456cfb6694fb113066fd0ec7ea4d67e3a35e1f4cbeea0b448bf5872"},
+    {file = "multidict-6.7.1-cp310-cp310-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:17207077e29342fdc2c9a82e4b306f1127bf1ea91f8b71e02d4798a70bb99991"},
+    {file = "multidict-6.7.1-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d4f49cb5661344764e4c7c7973e92a47a59b8fc19b6523649ec9dc4960e58a03"},
+    {file = "multidict-6.7.1-cp310-cp310-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:a9fc4caa29e2e6ae408d1c450ac8bf19892c5fca83ee634ecd88a53332c59981"},
+    {file = "multidict-6.7.1-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:c5f0c21549ab432b57dcc82130f388d84ad8179824cc3f223d5e7cfbfd4143f6"},
+    {file = "multidict-6.7.1-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:7dfb78d966b2c906ae1d28ccf6e6712a3cd04407ee5088cd276fe8cb42186190"},
+    {file = "multidict-6.7.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9b0d9b91d1aa44db9c1f1ecd0d9d2ae610b2f4f856448664e01a3b35899f3f92"},
+    {file = "multidict-6.7.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:dd96c01a9dcd4889dcfcf9eb5544ca0c77603f239e3ffab0524ec17aea9a93ee"},
+    {file = "multidict-6.7.1-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:067343c68cd6612d375710f895337b3a98a033c94f14b9a99eff902f205424e2"},
+    {file = "multidict-6.7.1-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:5884a04f4ff56c6120f6ccf703bdeb8b5079d808ba604d4d53aec0d55dc33568"},
+    {file = "multidict-6.7.1-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:8affcf1c98b82bc901702eb73b6947a1bfa170823c153fe8a47b5f5f02e48e40"},
+    {file = "multidict-6.7.1-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:0d17522c37d03e85c8098ec8431636309b2682cf12e58f4dbc76121fb50e4962"},
+    {file = "multidict-6.7.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:24c0cf81544ca5e17cfcb6e482e7a82cd475925242b308b890c9452a074d4505"},
+    {file = "multidict-6.7.1-cp310-cp310-win32.whl", hash = "sha256:d82dd730a95e6643802f4454b8fdecdf08667881a9c5670db85bc5a56693f122"},
+    {file = "multidict-6.7.1-cp310-cp310-win_amd64.whl", hash = "sha256:cf37cbe5ced48d417ba045aca1b21bafca67489452debcde94778a576666a1df"},
+    {file = "multidict-6.7.1-cp310-cp310-win_arm64.whl", hash = "sha256:59bc83d3f66b41dac1e7460aac1d196edc70c9ba3094965c467715a70ecb46db"},
+    {file = "multidict-6.7.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:7ff981b266af91d7b4b3793ca3382e53229088d193a85dfad6f5f4c27fc73e5d"},
+    {file = "multidict-6.7.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:844c5bca0b5444adb44a623fb0a1310c2f4cd41f402126bb269cd44c9b3f3e1e"},
+    {file = "multidict-6.7.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:f2a0a924d4c2e9afcd7ec64f9de35fcd96915149b2216e1cb2c10a56df483855"},
+    {file = "multidict-6.7.1-cp311-cp311-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:8be1802715a8e892c784c0197c2ace276ea52702a0ede98b6310c8f255a5afb3"},
+    {file = "multidict-6.7.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2e2d2ed645ea29f31c4c7ea1552fcfd7cb7ba656e1eafd4134a6620c9f5fdd9e"},
+    {file = "multidict-6.7.1-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:95922cee9a778659e91db6497596435777bd25ed116701a4c034f8e46544955a"},
+    {file = "multidict-6.7.1-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:6b83cabdc375ffaaa15edd97eb7c0c672ad788e2687004990074d7d6c9b140c8"},
+    {file = "multidict-6.7.1-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:38fb49540705369bab8484db0689d86c0a33a0a9f2c1b197f506b71b4b6c19b0"},
+    {file = "multidict-6.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:439cbebd499f92e9aa6793016a8acaa161dfa749ae86d20960189f5398a19144"},
+    {file = "multidict-6.7.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:6d3bc717b6fe763b8be3f2bee2701d3c8eb1b2a8ae9f60910f1b2860c82b6c49"},
+    {file = "multidict-6.7.1-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:619e5a1ac57986dbfec9f0b301d865dddf763696435e2962f6d9cf2fdff2bb71"},
+    {file = "multidict-6.7.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:0b38ebffd9be37c1170d33bc0f36f4f262e0a09bc1aac1c34c7aa51a7293f0b3"},
+    {file = "multidict-6.7.1-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:10ae39c9cfe6adedcdb764f5e8411d4a92b055e35573a2eaa88d3323289ef93c"},
+    {file = "multidict-6.7.1-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:25167cc263257660290fba06b9318d2026e3c910be240a146e1f66dd114af2b0"},
+    {file = "multidict-6.7.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:128441d052254f42989ef98b7b6a6ecb1e6f708aa962c7984235316db59f50fa"},
+    {file = "multidict-6.7.1-cp311-cp311-win32.whl", hash = "sha256:d62b7f64ffde3b99d06b707a280db04fb3855b55f5a06df387236051d0668f4a"},
+    {file = "multidict-6.7.1-cp311-cp311-win_amd64.whl", hash = "sha256:bdbf9f3b332abd0cdb306e7c2113818ab1e922dc84b8f8fd06ec89ed2a19ab8b"},
+    {file = "multidict-6.7.1-cp311-cp311-win_arm64.whl", hash = "sha256:b8c990b037d2fff2f4e33d3f21b9b531c5745b33a49a7d6dbe7a177266af44f6"},
+    {file = "multidict-6.7.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:a90f75c956e32891a4eda3639ce6dd86e87105271f43d43442a3aedf3cddf172"},
+    {file = "multidict-6.7.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:3fccb473e87eaa1382689053e4a4618e7ba7b9b9b8d6adf2027ee474597128cd"},
+    {file = "multidict-6.7.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:b0fa96985700739c4c7853a43c0b3e169360d6855780021bfc6d0f1ce7c123e7"},
+    {file = "multidict-6.7.1-cp312-cp312-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:cb2a55f408c3043e42b40cc8eecd575afa27b7e0b956dfb190de0f8499a57a53"},
+    {file = "multidict-6.7.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:eb0ce7b2a32d09892b3dd6cc44877a0d02a33241fafca5f25c8b6b62374f8b75"},
+    {file = "multidict-6.7.1-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:c3a32d23520ee37bf327d1e1a656fec76a2edd5c038bf43eddfa0572ec49c60b"},
+    {file = "multidict-6.7.1-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:9c90fed18bffc0189ba814749fdcc102b536e83a9f738a9003e569acd540a733"},
+    {file = "multidict-6.7.1-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:da62917e6076f512daccfbbde27f46fed1c98fee202f0559adec8ee0de67f71a"},
+    {file = "multidict-6.7.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bfde23ef6ed9db7eaee6c37dcec08524cb43903c60b285b172b6c094711b3961"},
+    {file = "multidict-6.7.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3758692429e4e32f1ba0df23219cd0b4fc0a52f476726fff9337d1a57676a582"},
+    {file = "multidict-6.7.1-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:398c1478926eca669f2fd6a5856b6de9c0acf23a2cb59a14c0ba5844fa38077e"},
+    {file = "multidict-6.7.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:c102791b1c4f3ab36ce4101154549105a53dc828f016356b3e3bcae2e3a039d3"},
+    {file = "multidict-6.7.1-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:a088b62bd733e2ad12c50dad01b7d0166c30287c166e137433d3b410add807a6"},
+    {file = "multidict-6.7.1-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:3d51ff4785d58d3f6c91bdbffcb5e1f7ddfda557727043aa20d20ec4f65e324a"},
+    {file = "multidict-6.7.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fc5907494fccf3e7d3f94f95c91d6336b092b5fc83811720fae5e2765890dfba"},
+    {file = "multidict-6.7.1-cp312-cp312-win32.whl", hash = "sha256:28ca5ce2fd9716631133d0e9a9b9a745ad7f60bac2bccafb56aa380fc0b6c511"},
+    {file = "multidict-6.7.1-cp312-cp312-win_amd64.whl", hash = "sha256:fcee94dfbd638784645b066074b338bc9cc155d4b4bffa4adce1615c5a426c19"},
+    {file = "multidict-6.7.1-cp312-cp312-win_arm64.whl", hash = "sha256:ba0a9fb644d0c1a2194cf7ffb043bd852cea63a57f66fbd33959f7dae18517bf"},
+    {file = "multidict-6.7.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:2b41f5fed0ed563624f1c17630cb9941cf2309d4df00e494b551b5f3e3d67a23"},
+    {file = "multidict-6.7.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:84e61e3af5463c19b67ced91f6c634effb89ef8bfc5ca0267f954451ed4bb6a2"},
+    {file = "multidict-6.7.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:935434b9853c7c112eee7ac891bc4cb86455aa631269ae35442cb316790c1445"},
+    {file = "multidict-6.7.1-cp313-cp313-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:432feb25a1cb67fe82a9680b4d65fb542e4635cb3166cd9c01560651ad60f177"},
+    {file = "multidict-6.7.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e82d14e3c948952a1a85503817e038cba5905a3352de76b9a465075d072fba23"},
+    {file = "multidict-6.7.1-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:4cfb48c6ea66c83bcaaf7e4dfa7ec1b6bbcf751b7db85a328902796dfde4c060"},
+    {file = "multidict-6.7.1-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:1d540e51b7e8e170174555edecddbd5538105443754539193e3e1061864d444d"},
+    {file = "multidict-6.7.1-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:273d23f4b40f3dce4d6c8a821c741a86dec62cded82e1175ba3d99be128147ed"},
+    {file = "multidict-6.7.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9d624335fd4fa1c08a53f8b4be7676ebde19cd092b3895c421045ca87895b429"},
+    {file = "multidict-6.7.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:12fad252f8b267cc75b66e8fc51b3079604e8d43a75428ffe193cd9e2195dfd6"},
+    {file = "multidict-6.7.1-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:03ede2a6ffbe8ef936b92cb4529f27f42be7f56afcdab5ab739cd5f27fb1cbf9"},
+    {file = "multidict-6.7.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:90efbcf47dbe33dcf643a1e400d67d59abeac5db07dc3f27d6bdeae497a2198c"},
+    {file = "multidict-6.7.1-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:5c4b9bfc148f5a91be9244d6264c53035c8a0dcd2f51f1c3c6e30e30ebaa1c84"},
+    {file = "multidict-6.7.1-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:401c5a650f3add2472d1d288c26deebc540f99e2fb83e9525007a74cd2116f1d"},
+    {file = "multidict-6.7.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:97891f3b1b3ffbded884e2916cacf3c6fc87b66bb0dde46f7357404750559f33"},
+    {file = "multidict-6.7.1-cp313-cp313-win32.whl", hash = "sha256:e1c5988359516095535c4301af38d8a8838534158f649c05dd1050222321bcb3"},
+    {file = "multidict-6.7.1-cp313-cp313-win_amd64.whl", hash = "sha256:960c83bf01a95b12b08fd54324a4eb1d5b52c88932b5cba5d6e712bb3ed12eb5"},
+    {file = "multidict-6.7.1-cp313-cp313-win_arm64.whl", hash = "sha256:563fe25c678aaba333d5399408f5ec3c383ca5b663e7f774dd179a520b8144df"},
+    {file = "multidict-6.7.1-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:c76c4bec1538375dad9d452d246ca5368ad6e1c9039dadcf007ae59c70619ea1"},
+    {file = "multidict-6.7.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:57b46b24b5d5ebcc978da4ec23a819a9402b4228b8a90d9c656422b4bdd8a963"},
+    {file = "multidict-6.7.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e954b24433c768ce78ab7929e84ccf3422e46deb45a4dc9f93438f8217fa2d34"},
+    {file = "multidict-6.7.1-cp313-cp313t-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:3bd231490fa7217cc832528e1cd8752a96f0125ddd2b5749390f7c3ec8721b65"},
+    {file = "multidict-6.7.1-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:253282d70d67885a15c8a7716f3a73edf2d635793ceda8173b9ecc21f2fb8292"},
+    {file = "multidict-6.7.1-cp313-cp313t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:0b4c48648d7649c9335cf1927a8b87fa692de3dcb15faa676c6a6f1f1aabda43"},
+    {file = "multidict-6.7.1-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:98bc624954ec4d2c7cb074b8eefc2b5d0ce7d482e410df446414355d158fe4ca"},
+    {file = "multidict-6.7.1-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:1b99af4d9eec0b49927b4402bcbb58dea89d3e0db8806a4086117019939ad3dd"},
+    {file = "multidict-6.7.1-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6aac4f16b472d5b7dc6f66a0d49dd57b0e0902090be16594dc9ebfd3d17c47e7"},
+    {file = "multidict-6.7.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:21f830fe223215dffd51f538e78c172ed7c7f60c9b96a2bf05c4848ad49921c3"},
+    {file = "multidict-6.7.1-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:f5dd81c45b05518b9aa4da4aa74e1c93d715efa234fd3e8a179df611cc85e5f4"},
+    {file = "multidict-6.7.1-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:eb304767bca2bb92fb9c5bd33cedc95baee5bb5f6c88e63706533a1c06ad08c8"},
+    {file = "multidict-6.7.1-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:c9035dde0f916702850ef66460bc4239d89d08df4d02023a5926e7446724212c"},
+    {file = "multidict-6.7.1-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:af959b9beeb66c822380f222f0e0a1889331597e81f1ded7f374f3ecb0fd6c52"},
+    {file = "multidict-6.7.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:41f2952231456154ee479651491e94118229844dd7226541788be783be2b5108"},
+    {file = "multidict-6.7.1-cp313-cp313t-win32.whl", hash = "sha256:df9f19c28adcb40b6aae30bbaa1478c389efd50c28d541d76760199fc1037c32"},
+    {file = "multidict-6.7.1-cp313-cp313t-win_amd64.whl", hash = "sha256:d54ecf9f301853f2c5e802da559604b3e95bb7a3b01a9c295c6ee591b9882de8"},
+    {file = "multidict-6.7.1-cp313-cp313t-win_arm64.whl", hash = "sha256:5a37ca18e360377cfda1d62f5f382ff41f2b8c4ccb329ed974cc2e1643440118"},
+    {file = "multidict-6.7.1-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:8f333ec9c5eb1b7105e3b84b53141e66ca05a19a605368c55450b6ba208cb9ee"},
+    {file = "multidict-6.7.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:a407f13c188f804c759fc6a9f88286a565c242a76b27626594c133b82883b5c2"},
+    {file = "multidict-6.7.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:0e161ddf326db5577c3a4cc2d8648f81456e8a20d40415541587a71620d7a7d1"},
+    {file = "multidict-6.7.1-cp314-cp314-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:1e3a8bb24342a8201d178c3b4984c26ba81a577c80d4d525727427460a50c22d"},
+    {file = "multidict-6.7.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:97231140a50f5d447d3164f994b86a0bed7cd016e2682f8650d6a9158e14fd31"},
+    {file = "multidict-6.7.1-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:6b10359683bd8806a200fd2909e7c8ca3a7b24ec1d8132e483d58e791d881048"},
+    {file = "multidict-6.7.1-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:283ddac99f7ac25a4acadbf004cb5ae34480bbeb063520f70ce397b281859362"},
+    {file = "multidict-6.7.1-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:538cec1e18c067d0e6103aa9a74f9e832904c957adc260e61cd9d8cf0c3b3d37"},
+    {file = "multidict-6.7.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7eee46ccb30ff48a1e35bb818cc90846c6be2b68240e42a78599166722cea709"},
+    {file = "multidict-6.7.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:fa263a02f4f2dd2d11a7b1bb4362aa7cb1049f84a9235d31adf63f30143469a0"},
+    {file = "multidict-6.7.1-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:2e1425e2f99ec5bd36c15a01b690a1a2456209c5deed58f95469ffb46039ccbb"},
+    {file = "multidict-6.7.1-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:497394b3239fc6f0e13a78a3e1b61296e72bf1c5f94b4c4eb80b265c37a131cd"},
+    {file = "multidict-6.7.1-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:233b398c29d3f1b9676b4b6f75c518a06fcb2ea0b925119fb2c1bc35c05e1601"},
+    {file = "multidict-6.7.1-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:93b1818e4a6e0930454f0f2af7dfce69307ca03cdcfb3739bf4d91241967b6c1"},
+    {file = "multidict-6.7.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:f33dc2a3abe9249ea5d8360f969ec7f4142e7ac45ee7014d8f8d5acddf178b7b"},
+    {file = "multidict-6.7.1-cp314-cp314-win32.whl", hash = "sha256:3ab8b9d8b75aef9df299595d5388b14530839f6422333357af1339443cff777d"},
+    {file = "multidict-6.7.1-cp314-cp314-win_amd64.whl", hash = "sha256:5e01429a929600e7dab7b166062d9bb54a5eed752384c7384c968c2afab8f50f"},
+    {file = "multidict-6.7.1-cp314-cp314-win_arm64.whl", hash = "sha256:4885cb0e817aef5d00a2e8451d4665c1808378dc27c2705f1bf4ef8505c0d2e5"},
+    {file = "multidict-6.7.1-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:0458c978acd8e6ea53c81eefaddbbee9c6c5e591f41b3f5e8e194780fe026581"},
+    {file = "multidict-6.7.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:c0abd12629b0af3cf590982c0b413b1e7395cd4ec026f30986818ab95bfaa94a"},
+    {file = "multidict-6.7.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:14525a5f61d7d0c94b368a42cff4c9a4e7ba2d52e2672a7b23d84dc86fb02b0c"},
+    {file = "multidict-6.7.1-cp314-cp314t-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:17307b22c217b4cf05033dabefe68255a534d637c6c9b0cc8382718f87be4262"},
+    {file = "multidict-6.7.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7a7e590ff876a3eaf1c02a4dfe0724b6e69a9e9de6d8f556816f29c496046e59"},
+    {file = "multidict-6.7.1-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:5fa6a95dfee63893d80a34758cd0e0c118a30b8dcb46372bf75106c591b77889"},
+    {file = "multidict-6.7.1-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:a0543217a6a017692aa6ae5cc39adb75e587af0f3a82288b1492eb73dd6cc2a4"},
+    {file = "multidict-6.7.1-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:f99fe611c312b3c1c0ace793f92464d8cd263cc3b26b5721950d977b006b6c4d"},
+    {file = "multidict-6.7.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9004d8386d133b7e6135679424c91b0b854d2d164af6ea3f289f8f2761064609"},
+    {file = "multidict-6.7.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e628ef0e6859ffd8273c69412a2465c4be4a9517d07261b33334b5ec6f3c7489"},
+    {file = "multidict-6.7.1-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:841189848ba629c3552035a6a7f5bf3b02eb304e9fea7492ca220a8eda6b0e5c"},
+    {file = "multidict-6.7.1-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:ce1bbd7d780bb5a0da032e095c951f7014d6b0a205f8318308140f1a6aba159e"},
+    {file = "multidict-6.7.1-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:b26684587228afed0d50cf804cc71062cc9c1cdf55051c4c6345d372947b268c"},
+    {file = "multidict-6.7.1-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:9f9af11306994335398293f9958071019e3ab95e9a707dc1383a35613f6abcb9"},
+    {file = "multidict-6.7.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:b4938326284c4f1224178a560987b6cf8b4d38458b113d9b8c1db1a836e640a2"},
+    {file = "multidict-6.7.1-cp314-cp314t-win32.whl", hash = "sha256:98655c737850c064a65e006a3df7c997cd3b220be4ec8fe26215760b9697d4d7"},
+    {file = "multidict-6.7.1-cp314-cp314t-win_amd64.whl", hash = "sha256:497bde6223c212ba11d462853cfa4f0ae6ef97465033e7dc9940cdb3ab5b48e5"},
+    {file = "multidict-6.7.1-cp314-cp314t-win_arm64.whl", hash = "sha256:2bbd113e0d4af5db41d5ebfe9ccaff89de2120578164f86a5d17d5a576d1e5b2"},
+    {file = "multidict-6.7.1-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:65573858d27cdeaca41893185677dc82395159aa28875a8867af66532d413a8f"},
+    {file = "multidict-6.7.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:c524c6fb8fc342793708ab111c4dbc90ff9abd568de220432500e47e990c0358"},
+    {file = "multidict-6.7.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:aa23b001d968faef416ff70dc0f1ab045517b9b42a90edd3e9bcdb06479e31d5"},
+    {file = "multidict-6.7.1-cp39-cp39-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:6704fa2b7453b2fb121740555fa1ee20cd98c4d011120caf4d2b8d4e7c76eec0"},
+    {file = "multidict-6.7.1-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:121a34e5bfa410cdf2c8c49716de160de3b1dbcd86b49656f5681e4543bcd1a8"},
+    {file = "multidict-6.7.1-cp39-cp39-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:026d264228bcd637d4e060844e39cdc60f86c479e463d49075dedc21b18fbbe0"},
+    {file = "multidict-6.7.1-cp39-cp39-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:0e697826df7eb63418ee190fd06ce9f1803593bb4b9517d08c60d9b9a7f69d8f"},
+    {file = "multidict-6.7.1-cp39-cp39-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:bb08271280173720e9fea9ede98e5231defcbad90f1624bea26f32ec8a956e2f"},
+    {file = "multidict-6.7.1-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c6b3228e1d80af737b72925ce5fb4daf5a335e49cd7ab77ed7b9fdfbf58c526e"},
+    {file = "multidict-6.7.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:3943debf0fbb57bdde5901695c11094a9a36723e5c03875f87718ee15ca2f4d2"},
+    {file = "multidict-6.7.1-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:98c5787b0a0d9a41d9311eae44c3b76e6753def8d8870ab501320efe75a6a5f8"},
+    {file = "multidict-6.7.1-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:08ccb2a6dc72009093ebe7f3f073e5ec5964cba9a706fa94b1a1484039b87941"},
+    {file = "multidict-6.7.1-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:eb351f72c26dc9abe338ca7294661aa22969ad8ffe7ef7d5541d19f368dc854a"},
+    {file = "multidict-6.7.1-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:ac1c665bad8b5d762f5f85ebe4d94130c26965f11de70c708c75671297c776de"},
+    {file = "multidict-6.7.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:1fa6609d0364f4f6f58351b4659a1f3e0e898ba2a8c5cac04cb2c7bc556b0bc5"},
+    {file = "multidict-6.7.1-cp39-cp39-win32.whl", hash = "sha256:6f77ce314a29263e67adadc7e7c1bc699fcb3a305059ab973d038f87caa42ed0"},
+    {file = "multidict-6.7.1-cp39-cp39-win_amd64.whl", hash = "sha256:f537b55778cd3cbee430abe3131255d3a78202e0f9ea7ffc6ada893a4bcaeea4"},
+    {file = "multidict-6.7.1-cp39-cp39-win_arm64.whl", hash = "sha256:749aa54f578f2e5f439538706a475aa844bfa8ef75854b1401e6e528e4937cf9"},
+    {file = "multidict-6.7.1-py3-none-any.whl", hash = "sha256:55d97cc6dae627efa6a6e548885712d4864b81110ac76fa4e534c03819fa4a56"},
+    {file = "multidict-6.7.1.tar.gz", hash = "sha256:ec6652a1bee61c53a3e5776b6049172c53b6aaba34f18c9ad04f82712bac623d"},
 ]
 
-[package.dependencies]
-dill = ">=0.3.7"
-
 [[package]]
 name = "mypy"
-version = "1.14.1"
+version = "1.20.2"
 description = "Optional static typing for Python"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.10"
 groups = ["dev"]
 files = [
-    {file = "mypy-1.14.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:52686e37cf13d559f668aa398dd7ddf1f92c5d613e4f8cb262be2fb4fedb0fcb"},
-    {file = "mypy-1.14.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:1fb545ca340537d4b45d3eecdb3def05e913299ca72c290326be19b3804b39c0"},
-    {file = "mypy-1.14.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:90716d8b2d1f4cd503309788e51366f07c56635a3309b0f6a32547eaaa36a64d"},
-    {file = "mypy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2ae753f5c9fef278bcf12e1a564351764f2a6da579d4a81347e1d5a15819997b"},
-    {file = "mypy-1.14.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:e0fe0f5feaafcb04505bcf439e991c6d8f1bf8b15f12b05feeed96e9e7bf1427"},
-    {file = "mypy-1.14.1-cp310-cp310-win_amd64.whl", hash = "sha256:7d54bd85b925e501c555a3227f3ec0cfc54ee8b6930bd6141ec872d1c572f81f"},
-    {file = "mypy-1.14.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:f995e511de847791c3b11ed90084a7a0aafdc074ab88c5a9711622fe4751138c"},
-    {file = "mypy-1.14.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:d64169ec3b8461311f8ce2fd2eb5d33e2d0f2c7b49116259c51d0d96edee48d1"},
-    {file = "mypy-1.14.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ba24549de7b89b6381b91fbc068d798192b1b5201987070319889e93038967a8"},
-    {file = "mypy-1.14.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:183cf0a45457d28ff9d758730cd0210419ac27d4d3f285beda038c9083363b1f"},
-    {file = "mypy-1.14.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f2a0ecc86378f45347f586e4163d1769dd81c5a223d577fe351f26b179e148b1"},
-    {file = "mypy-1.14.1-cp311-cp311-win_amd64.whl", hash = "sha256:ad3301ebebec9e8ee7135d8e3109ca76c23752bac1e717bc84cd3836b4bf3eae"},
-    {file = "mypy-1.14.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:30ff5ef8519bbc2e18b3b54521ec319513a26f1bba19a7582e7b1f58a6e69f14"},
-    {file = "mypy-1.14.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:cb9f255c18052343c70234907e2e532bc7e55a62565d64536dbc7706a20b78b9"},
-    {file = "mypy-1.14.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8b4e3413e0bddea671012b063e27591b953d653209e7a4fa5e48759cda77ca11"},
-    {file = "mypy-1.14.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:553c293b1fbdebb6c3c4030589dab9fafb6dfa768995a453d8a5d3b23784af2e"},
-    {file = "mypy-1.14.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fad79bfe3b65fe6a1efaed97b445c3d37f7be9fdc348bdb2d7cac75579607c89"},
-    {file = "mypy-1.14.1-cp312-cp312-win_amd64.whl", hash = "sha256:8fa2220e54d2946e94ab6dbb3ba0a992795bd68b16dc852db33028df2b00191b"},
-    {file = "mypy-1.14.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:92c3ed5afb06c3a8e188cb5da4984cab9ec9a77ba956ee419c68a388b4595255"},
-    {file = "mypy-1.14.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:dbec574648b3e25f43d23577309b16534431db4ddc09fda50841f1e34e64ed34"},
-    {file = "mypy-1.14.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8c6d94b16d62eb3e947281aa7347d78236688e21081f11de976376cf010eb31a"},
-    {file = "mypy-1.14.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d4b19b03fdf54f3c5b2fa474c56b4c13c9dbfb9a2db4370ede7ec11a2c5927d9"},
-    {file = "mypy-1.14.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:0c911fde686394753fff899c409fd4e16e9b294c24bfd5e1ea4675deae1ac6fd"},
-    {file = "mypy-1.14.1-cp313-cp313-win_amd64.whl", hash = "sha256:8b21525cb51671219f5307be85f7e646a153e5acc656e5cebf64bfa076c50107"},
-    {file = "mypy-1.14.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:7084fb8f1128c76cd9cf68fe5971b37072598e7c31b2f9f95586b65c741a9d31"},
-    {file = "mypy-1.14.1-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:8f845a00b4f420f693f870eaee5f3e2692fa84cc8514496114649cfa8fd5e2c6"},
-    {file = "mypy-1.14.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:44bf464499f0e3a2d14d58b54674dee25c031703b2ffc35064bd0df2e0fac319"},
-    {file = "mypy-1.14.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c99f27732c0b7dc847adb21c9d47ce57eb48fa33a17bc6d7d5c5e9f9e7ae5bac"},
-    {file = "mypy-1.14.1-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:bce23c7377b43602baa0bd22ea3265c49b9ff0b76eb315d6c34721af4cdf1d9b"},
-    {file = "mypy-1.14.1-cp38-cp38-win_amd64.whl", hash = "sha256:8edc07eeade7ebc771ff9cf6b211b9a7d93687ff892150cb5692e4f4272b0837"},
-    {file = "mypy-1.14.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:3888a1816d69f7ab92092f785a462944b3ca16d7c470d564165fe703b0970c35"},
-    {file = "mypy-1.14.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:46c756a444117c43ee984bd055db99e498bc613a70bbbc120272bd13ca579fbc"},
-    {file = "mypy-1.14.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:27fc248022907e72abfd8e22ab1f10e903915ff69961174784a3900a8cba9ad9"},
-    {file = "mypy-1.14.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:499d6a72fb7e5de92218db961f1a66d5f11783f9ae549d214617edab5d4dbdbb"},
-    {file = "mypy-1.14.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:57961db9795eb566dc1d1b4e9139ebc4c6b0cb6e7254ecde69d1552bf7613f60"},
-    {file = "mypy-1.14.1-cp39-cp39-win_amd64.whl", hash = "sha256:07ba89fdcc9451f2ebb02853deb6aaaa3d2239a236669a63ab3801bbf923ef5c"},
-    {file = "mypy-1.14.1-py3-none-any.whl", hash = "sha256:b66a60cc4073aeb8ae00057f9c1f64d49e90f918fbcef9a977eb121da8b8f1d1"},
-    {file = "mypy-1.14.1.tar.gz", hash = "sha256:7ec88144fe9b510e8475ec2f5f251992690fcf89ccb4500b214b4226abcd32d6"},
+    {file = "mypy-1.20.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:cf5a4db6dca263010e2c7bff081c89383c72d187ba2cf4c44759aac970e2f0c4"},
+    {file = "mypy-1.20.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:7b0e817b518bff7facd7f85ea05b643ad8bdcce684cf29784987b0a7c8e1f997"},
+    {file = "mypy-1.20.2-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:97d7b9a485b40f8ca425460e89bf1da2814625b2da627c0dcc6aa46c92631d14"},
+    {file = "mypy-1.20.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1e1c12f6d2db3d78b909b5f77513c11eb7f2dd2782b96a3ab6dffc7d44575c99"},
+    {file = "mypy-1.20.2-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:89dce27e142d25ffbc154c1819383b69f2e9234dc4ed4766f42e0e8cb264ab5c"},
+    {file = "mypy-1.20.2-cp310-cp310-win_amd64.whl", hash = "sha256:f376e37f9bf2a946872fc5fd1199c99310748e3c26c7a26683f13f8bdb756cbd"},
+    {file = "mypy-1.20.2-cp310-cp310-win_arm64.whl", hash = "sha256:6e2b469efd811707bc530fd1effef0f5d6eebcb7fe376affae69025da4b979a2"},
+    {file = "mypy-1.20.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:4077797a273e56e8843d001e9dfe4ba10e33323d6ade647ff260e5cd97d9758c"},
+    {file = "mypy-1.20.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:cdecf62abcc4292500d7858aeae87a1f8f1150f4c4dd08fb0b336ee79b2a6df3"},
+    {file = "mypy-1.20.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c566c3a88b6ece59b3d70f65bedef17304f48eb52ff040a6a18214e1917b3254"},
+    {file = "mypy-1.20.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0deb80d062b2479f2c87ae568f89845afc71d11bc41b04179e58165fd9f31e98"},
+    {file = "mypy-1.20.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:bba9ad231e92a3e424b3e56b65aa17704993425bba97e302c832f9466bb85bac"},
+    {file = "mypy-1.20.2-cp311-cp311-win_amd64.whl", hash = "sha256:baf593f2765fa3a6b1ef95807dbaa3d25b594f6a52adcc506a6b9cb115e1be67"},
+    {file = "mypy-1.20.2-cp311-cp311-win_arm64.whl", hash = "sha256:20175a1c0f49863946ec20b7f63255768058ac4f07d2b9ded6a6b46cfb5a9100"},
+    {file = "mypy-1.20.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:4dbfcf869f6b0517f70cf0030ba6ea1d6645e132337a7d5204a18d8d5636c02b"},
+    {file = "mypy-1.20.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:4b6481b228d072315b053210b01ac320e1be243dc17f9e5887ef167f23f5fae4"},
+    {file = "mypy-1.20.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:34397cdced6b90b836e38182076049fdb41424322e0b0728c946b0939ebdf9f6"},
+    {file = "mypy-1.20.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a5da6976f20cae27059ea8d0c86e7cef3de720e04c4bb9ee18e3690fdb792066"},
+    {file = "mypy-1.20.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:56908d7e08318d39f85b1f0c6cfd47b0cac1a130da677630dac0de3e0623e102"},
+    {file = "mypy-1.20.2-cp312-cp312-win_amd64.whl", hash = "sha256:d52ad8d78522da1d308789df651ee5379088e77c76cb1994858d40a426b343b9"},
+    {file = "mypy-1.20.2-cp312-cp312-win_arm64.whl", hash = "sha256:785b08db19c9f214dc37d65f7c165d19a30fcecb48abfa30f31b01b5acaabb58"},
+    {file = "mypy-1.20.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:edfbfca868cdd6bd8d974a60f8a3682f5565d3f5c99b327640cedd24c4264026"},
+    {file = "mypy-1.20.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:e2877a02380adfcdbc69071a0f74d6e9dbbf593c0dc9d174e1f223ffd5281943"},
+    {file = "mypy-1.20.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7488448de6007cd5177c6cea0517ac33b4c0f5ee9b5e9f2be51ce75511a85517"},
+    {file = "mypy-1.20.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bb9c2fa06887e21d6a3a868762acb82aec34e2c6fd0174064f27c93ede68ad15"},
+    {file = "mypy-1.20.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:9d56a78b646f2e3daa865bc70cd5ec5a46c50045801ca8ff17a0c43abc97e3ee"},
+    {file = "mypy-1.20.2-cp313-cp313-win_amd64.whl", hash = "sha256:2a4102b03bb7481d9a91a6da8d174740c9c8c4401024684b9ca3b7cc5e49852f"},
+    {file = "mypy-1.20.2-cp313-cp313-win_arm64.whl", hash = "sha256:a95a9248b0c6fd933a442c03c3b113c3b61320086b88e2c444676d3fd1ca3330"},
+    {file = "mypy-1.20.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:419413398fe250aae057fd2fe50166b61077083c9b82754c341cf4fd73038f30"},
+    {file = "mypy-1.20.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:e73c07f23009962885c197ccb9b41356a30cc0e5a1d0c2ea8fd8fb1362d7f924"},
+    {file = "mypy-1.20.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0c64e5973df366b747646fc98da921f9d6eba9716d57d1db94a83c026a08e0fb"},
+    {file = "mypy-1.20.2-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5a65aa591af023864fd08a97da9974e919452cfe19cb146c8a5dc692626445dc"},
+    {file = "mypy-1.20.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:4fef51b01e638974a6e69885687e9bd40c8d1e09a6cd291cca0619625cf1f558"},
+    {file = "mypy-1.20.2-cp314-cp314-win_amd64.whl", hash = "sha256:913485a03f1bcf5d279409a9d2b9ed565c151f61c09f29991e5faa14033da4c8"},
+    {file = "mypy-1.20.2-cp314-cp314-win_arm64.whl", hash = "sha256:c3bae4f855d965b5453784300c12ffc63a548304ac7f99e55d4dc7c898673aa3"},
+    {file = "mypy-1.20.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:2de3dcea53babc1c3237a19002bc3d228ce1833278f093b8d619e06e7cc79609"},
+    {file = "mypy-1.20.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:52b176444e2e5054dfcbcb8c75b0b719865c96247b37407184bbfca5c353f2c2"},
+    {file = "mypy-1.20.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:688c3312e5dadb573a2c69c82af3a298d43ecf9e6d264e0f95df960b5f6ac19c"},
+    {file = "mypy-1.20.2-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:29752dbbf8cc53f89f6ac096d363314333045c257c9c75cbd189ca2de0455744"},
+    {file = "mypy-1.20.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:803203d2b6ea644982c644895c2f78b28d0e208bba7b27d9b921e0ec5eb207c6"},
+    {file = "mypy-1.20.2-cp314-cp314t-win_amd64.whl", hash = "sha256:9bcb8aa397ff0093c824182fd76a935a9ba7ad097fcbef80ae89bf6c1731d8ec"},
+    {file = "mypy-1.20.2-cp314-cp314t-win_arm64.whl", hash = "sha256:e061b58443f1736f8a37c48978d7ab581636d6ab03e3d4f99e3fa90463bb9382"},
+    {file = "mypy-1.20.2-py3-none-any.whl", hash = "sha256:a94c5a76ab46c5e6257c7972b6c8cff0574201ca7dc05647e33e795d78680563"},
+    {file = "mypy-1.20.2.tar.gz", hash = "sha256:e8222c26daaafd9e8626dec58ae36029f82585890589576f769a650dd20fd665"},
 ]
 
 [package.dependencies]
+librt = {version = ">=0.8.0", markers = "platform_python_implementation != \"PyPy\""}
 mypy_extensions = ">=1.0.0"
-tomli = {version = ">=1.1.0", markers = "python_version < \"3.11\""}
-typing_extensions = ">=4.6.0"
+pathspec = ">=1.0.0"
+typing_extensions = [
+    {version = ">=4.6.0", markers = "python_version < \"3.15\""},
+    {version = ">=4.14.0", markers = "python_version >= \"3.15\""},
+]
 
 [package.extras]
 dmypy = ["psutil (>=4.0)"]
 faster-cache = ["orjson"]
 install-types = ["pip"]
 mypyc = ["setuptools (>=50)"]
+native-parser = ["ast-serialize (>=0.1.1,<1.0.0)"]
 reports = ["lxml"]
 
 [[package]]
 name = "mypy-extensions"
-version = "1.0.0"
+version = "1.1.0"
 description = "Type system extensions for programs checked with the mypy type checker."
 optional = false
-python-versions = ">=3.5"
+python-versions = ">=3.8"
 groups = ["dev"]
 files = [
-    {file = "mypy_extensions-1.0.0-py3-none-any.whl", hash = "sha256:4392f6c0eb8a5668a69e23d168ffa70f0be9ccfd32b5cc2d26a34ae5b844552d"},
-    {file = "mypy_extensions-1.0.0.tar.gz", hash = "sha256:75dbf8955dc00442a438fc4d0666508a9a97b6bd41aa2f0ffe9d2f2725af0782"},
+    {file = "mypy_extensions-1.1.0-py3-none-any.whl", hash = "sha256:1be4cccdb0f2482337c4743e60421de3a356cd97508abadd57d47403e94f5505"},
+    {file = "mypy_extensions-1.1.0.tar.gz", hash = "sha256:52e68efc3284861e772bbcd66823fde5ae21fd2fdb51c62a211403730b916558"},
+]
+
+[[package]]
+name = "narwhals"
+version = "2.22.1"
+description = "Extremely lightweight compatibility layer between dataframe libraries"
+optional = false
+python-versions = ">=3.10"
+groups = ["main", "training"]
+files = [
+    {file = "narwhals-2.22.1-py3-none-any.whl", hash = "sha256:60567d774edf77db53906f89d9fbd164e66e56d66d388e1e6990f17ac33cfb53"},
+    {file = "narwhals-2.22.1.tar.gz", hash = "sha256:d62920805a0a43b7ff8b54b0c0d3142d796f8a9301836ada37e573d6a33cbcd9"},
 ]
+markers = {main = "extra == \"trackio\""}
+
+[package.extras]
+cudf = ["cudf-cu12 (>=24.10.0) ; sys_platform == \"linux\""]
+dask = ["dask[dataframe] (>=2024.8)"]
+duckdb = ["duckdb (>=1.1)"]
+ibis = ["ibis-framework (>=6.0.0)", "packaging (>=21.3)", "pyarrow-hotfix (>=0.7)", "rich (>=12.4.4)"]
+modin = ["modin (>=0.22.0)"]
+pandas = ["pandas (>=1.3.4)"]
+polars = ["polars (>=0.20.4)"]
+pyarrow = ["pyarrow (>=13.0.0)"]
+pyspark = ["pyspark (>=3.5.0)"]
+pyspark-connect = ["pyspark[connect] (>=3.5.0)"]
+sql = ["narwhals[duckdb]", "sqlparse (>=0.5.5)"]
+sqlframe = ["sqlframe (>=3.22.0,!=3.39.3)"]
 
 [[package]]
 name = "networkx"
-version = "3.4.2"
+version = "3.6"
 description = "Python package for creating and manipulating graphs and networks"
 optional = false
-python-versions = ">=3.10"
-groups = ["main"]
+python-versions = ">=3.11"
+groups = ["main", "training"]
+markers = "python_version < \"3.15\" and python_version >= \"3.12\""
 files = [
-    {file = "networkx-3.4.2-py3-none-any.whl", hash = "sha256:df5d4365b724cf81b8c6a7312509d0c22386097011ad1abe274afd5e9d3bbc5f"},
-    {file = "networkx-3.4.2.tar.gz", hash = "sha256:307c3669428c5362aab27c8a1260aa8f47c4e91d3891f48be0141738d8d053e1"},
+    {file = "networkx-3.6-py3-none-any.whl", hash = "sha256:cdb395b105806062473d3be36458d8f1459a4e4b98e236a66c3a48996e07684f"},
+    {file = "networkx-3.6.tar.gz", hash = "sha256:285276002ad1f7f7da0f7b42f004bcba70d381e936559166363707fdad3d72ad"},
 ]
 
 [package.extras]
-default = ["matplotlib (>=3.7)", "numpy (>=1.24)", "pandas (>=2.0)", "scipy (>=1.10,!=1.11.0,!=1.11.1)"]
-developer = ["changelist (==0.5)", "mypy (>=1.1)", "pre-commit (>=3.2)", "rtoml"]
-doc = ["intersphinx-registry", "myst-nb (>=1.1)", "numpydoc (>=1.8.0)", "pillow (>=9.4)", "pydata-sphinx-theme (>=0.15)", "sphinx (>=7.3)", "sphinx-gallery (>=0.16)", "texext (>=0.6.7)"]
-example = ["cairocffi (>=1.7)", "contextily (>=1.6)", "igraph (>=0.11)", "momepy (>=0.7.2)", "osmnx (>=1.9)", "scikit-learn (>=1.5)", "seaborn (>=0.13)"]
+benchmarking = ["asv", "virtualenv"]
+default = ["matplotlib (>=3.8)", "numpy (>=1.25)", "pandas (>=2.0)", "scipy (>=1.11.2)"]
+developer = ["mypy (>=1.15)", "pre-commit (>=4.1)"]
+doc = ["intersphinx-registry", "myst-nb (>=1.1)", "numpydoc (>=1.8.0)", "pillow (>=10)", "pydata-sphinx-theme (>=0.16)", "sphinx (>=8.0)", "sphinx-gallery (>=0.18)", "texext (>=0.6.7)"]
+example = ["cairocffi (>=1.7)", "contextily (>=1.6)", "igraph (>=0.11)", "iplotx (>=0.9.0)", "momepy (>=0.7.2)", "osmnx (>=2.0.0)", "scikit-learn (>=1.5)", "seaborn (>=0.13)"]
 extra = ["lxml (>=4.6)", "pydot (>=3.0.1)", "pygraphviz (>=1.14)", "sympy (>=1.10)"]
-test = ["pytest (>=7.2)", "pytest-cov (>=4.0)"]
+release = ["build (>=0.10)", "changelist (==0.5)", "twine (>=4.0)", "wheel (>=0.40)"]
+test = ["pytest (>=7.2)", "pytest-cov (>=4.0)", "pytest-xdist (>=3.0)"]
+test-extras = ["pytest-mpl", "pytest-randomly"]
 
 [[package]]
-name = "ninja"
-version = "1.11.1.3"
-description = "Ninja is a small build system with a focus on speed"
+name = "networkx"
+version = "3.6.1"
+description = "Python package for creating and manipulating graphs and networks"
 optional = false
-python-versions = ">=3.7"
-groups = ["main"]
+python-versions = "!=3.14.1,>=3.11"
+groups = ["main", "training"]
+markers = "python_version == \"3.11\" or python_version >= \"3.15\""
 files = [
-    {file = "ninja-1.11.1.3-py3-none-macosx_10_9_universal2.whl", hash = "sha256:2b4879ea3f1169f3d855182c57dcc84d1b5048628c8b7be0d702b81882a37237"},
-    {file = "ninja-1.11.1.3-py3-none-manylinux_2_12_i686.manylinux2010_i686.whl", hash = "sha256:bc3ebc8b2e47716149f3541742b5cd8e0b08f51013b825c05baca3e34854370d"},
-    {file = "ninja-1.11.1.3-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:a27e78ca71316c8654965ee94b286a98c83877bfebe2607db96897bbfe458af0"},
-    {file = "ninja-1.11.1.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2883ea46b3c5079074f56820f9989c6261fcc6fd873d914ee49010ecf283c3b2"},
-    {file = "ninja-1.11.1.3-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8c4bdb9fd2d0c06501ae15abfd23407660e95659e384acd36e013b6dd7d8a8e4"},
-    {file = "ninja-1.11.1.3-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:114ed5c61c8474df6a69ab89097a20749b769e2c219a452cb2fadc49b0d581b0"},
-    {file = "ninja-1.11.1.3-py3-none-manylinux_2_28_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:7fa2247fce98f683bc712562d82b22b8a0a5c000738a13147ca2d1b68c122298"},
-    {file = "ninja-1.11.1.3-py3-none-musllinux_1_1_aarch64.whl", hash = "sha256:a38c6c6c8032bed68b70c3b065d944c35e9f903342875d3a3218c1607987077c"},
-    {file = "ninja-1.11.1.3-py3-none-musllinux_1_1_i686.whl", hash = "sha256:56ada5d33b8741d298836644042faddebc83ee669782d661e21563034beb5aba"},
-    {file = "ninja-1.11.1.3-py3-none-musllinux_1_1_ppc64le.whl", hash = "sha256:53409151da081f3c198bb0bfc220a7f4e821e022c5b7d29719adda892ddb31bb"},
-    {file = "ninja-1.11.1.3-py3-none-musllinux_1_1_s390x.whl", hash = "sha256:1ad2112c2b0159ed7c4ae3731595191b1546ba62316fc40808edecd0306fefa3"},
-    {file = "ninja-1.11.1.3-py3-none-musllinux_1_1_x86_64.whl", hash = "sha256:28aea3c1c280cba95b8608d50797169f3a34280e3e9a6379b6e340f0c9eaeeb0"},
-    {file = "ninja-1.11.1.3-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:b6966f83064a88a51693073eea3decd47e08c3965241e09578ef7aa3a7738329"},
-    {file = "ninja-1.11.1.3-py3-none-win32.whl", hash = "sha256:a4a3b71490557e18c010cbb26bd1ea9a0c32ee67e8f105e9731515b6e0af792e"},
-    {file = "ninja-1.11.1.3-py3-none-win_amd64.whl", hash = "sha256:04d48d14ea7ba11951c156599ab526bdda575450797ff57c6fdf99b2554d09c7"},
-    {file = "ninja-1.11.1.3-py3-none-win_arm64.whl", hash = "sha256:17978ad611d8ead578d83637f5ae80c2261b033db0b493a7ce94f88623f29e1b"},
-    {file = "ninja-1.11.1.3.tar.gz", hash = "sha256:edfa0d2e9d7ead1635b03e40a32ad56cc8f56798b6e2e9848d8300b174897076"},
+    {file = "networkx-3.6.1-py3-none-any.whl", hash = "sha256:d47fbf302e7d9cbbb9e2555a0d267983d2aa476bac30e90dfbe5669bd57f3762"},
+    {file = "networkx-3.6.1.tar.gz", hash = "sha256:26b7c357accc0c8cde558ad486283728b65b6a95d85ee1cd66bafab4c8168509"},
 ]
 
 [package.extras]
-test = ["coverage (>=4.2)", "importlib_metadata (>=2.0)", "pytest (>=6.0)", "pytest-cov (>=3)"]
+benchmarking = ["asv", "virtualenv"]
+default = ["matplotlib (>=3.8)", "numpy (>=1.25)", "pandas (>=2.0)", "scipy (>=1.11.2)"]
+developer = ["mypy (>=1.15)", "pre-commit (>=4.1)"]
+doc = ["intersphinx-registry", "myst-nb (>=1.1)", "numpydoc (>=1.8.0)", "pillow (>=10)", "pydata-sphinx-theme (>=0.16)", "sphinx (>=8.0)", "sphinx-gallery (>=0.18)", "texext (>=0.6.7)"]
+example = ["cairocffi (>=1.7)", "contextily (>=1.6)", "igraph (>=0.11)", "iplotx (>=0.9.0)", "momepy (>=0.7.2)", "osmnx (>=2.0.0)", "scikit-learn (>=1.5)", "seaborn (>=0.13)"]
+extra = ["lxml (>=4.6)", "pydot (>=3.0.1)", "pygraphviz (>=1.14)", "sympy (>=1.10)"]
+release = ["build (>=0.10)", "changelist (==0.5)", "twine (>=4.0)", "wheel (>=0.40)"]
+test = ["pytest (>=7.2)", "pytest-cov (>=4.0)", "pytest-xdist (>=3.0)"]
+test-extras = ["pytest-mpl", "pytest-randomly"]
+
+[[package]]
+name = "ninja"
+version = "1.13.0"
+description = "Ninja is a small build system with a focus on speed"
+optional = false
+python-versions = ">=3.8"
+groups = ["main", "training"]
+files = [
+    {file = "ninja-1.13.0-py3-none-macosx_10_9_universal2.whl", hash = "sha256:fa2a8bfc62e31b08f83127d1613d10821775a0eb334197154c4d6067b7068ff1"},
+    {file = "ninja-1.13.0-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:3d00c692fb717fd511abeb44b8c5d00340c36938c12d6538ba989fe764e79630"},
+    {file = "ninja-1.13.0-py3-none-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:be7f478ff9f96a128b599a964fc60a6a87b9fa332ee1bd44fa243ac88d50291c"},
+    {file = "ninja-1.13.0-py3-none-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:60056592cf495e9a6a4bea3cd178903056ecb0943e4de45a2ea825edb6dc8d3e"},
+    {file = "ninja-1.13.0-py3-none-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:1c97223cdda0417f414bf864cfb73b72d8777e57ebb279c5f6de368de0062988"},
+    {file = "ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:fb46acf6b93b8dd0322adc3a4945452a4e774b75b91293bafcc7b7f8e6517dfa"},
+    {file = "ninja-1.13.0-py3-none-manylinux_2_28_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:4be9c1b082d244b1ad7ef41eb8ab088aae8c109a9f3f0b3e56a252d3e00f42c1"},
+    {file = "ninja-1.13.0-py3-none-manylinux_2_31_riscv64.whl", hash = "sha256:6739d3352073341ad284246f81339a384eec091d9851a886dfa5b00a6d48b3e2"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:11be2d22027bde06f14c343f01d31446747dbb51e72d00decca2eb99be911e2f"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:aa45b4037b313c2f698bc13306239b8b93b4680eb47e287773156ac9e9304714"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_i686.whl", hash = "sha256:5f8e1e8a1a30835eeb51db05cf5a67151ad37542f5a4af2a438e9490915e5b72"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_ppc64le.whl", hash = "sha256:3d7d7779d12cb20c6d054c61b702139fd23a7a964ec8f2c823f1ab1b084150db"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_riscv64.whl", hash = "sha256:d741a5e6754e0bda767e3274a0f0deeef4807f1fec6c0d7921a0244018926ae5"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_s390x.whl", hash = "sha256:e8bad11f8a00b64137e9b315b137d8bb6cbf3086fbdc43bf1f90fd33324d2e96"},
+    {file = "ninja-1.13.0-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:b4f2a072db3c0f944c32793e91532d8948d20d9ab83da9c0c7c15b5768072200"},
+    {file = "ninja-1.13.0-py3-none-win32.whl", hash = "sha256:8cfbb80b4a53456ae8a39f90ae3d7a2129f45ea164f43fadfa15dc38c4aef1c9"},
+    {file = "ninja-1.13.0-py3-none-win_amd64.whl", hash = "sha256:fb8ee8719f8af47fed145cced4a85f0755dd55d45b2bddaf7431fa89803c5f3e"},
+    {file = "ninja-1.13.0-py3-none-win_arm64.whl", hash = "sha256:3c0b40b1f0bba764644385319028650087b4c1b18cdfa6f45cb39a3669b81aa9"},
+    {file = "ninja-1.13.0.tar.gz", hash = "sha256:4a40ce995ded54d9dc24f8ea37ff3bf62ad192b547f6c7126e7e25045e76f978"},
+]
+markers = {main = "extra == \"quanto\""}
 
 [[package]]
 name = "nodeenv"
-version = "1.9.1"
+version = "1.10.0"
 description = "Node.js virtual environment builder"
 optional = false
 python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,!=3.6.*,>=2.7"
-groups = ["main", "dev"]
+groups = ["dev"]
 files = [
-    {file = "nodeenv-1.9.1-py2.py3-none-any.whl", hash = "sha256:ba11c9782d29c27c70ffbdda2d7415098754709be8a7056d79a737cd901155c9"},
-    {file = "nodeenv-1.9.1.tar.gz", hash = "sha256:6ec12890a2dab7946721edbfbcd91f3319c6ccc9aec47be7c7e6b7011ee6645f"},
+    {file = "nodeenv-1.10.0-py2.py3-none-any.whl", hash = "sha256:5bb13e3eed2923615535339b3c620e76779af4cb4c6a90deccc9e36b274d3827"},
+    {file = "nodeenv-1.10.0.tar.gz", hash = "sha256:996c191ad80897d076bdfba80a41994c2b47c68e224c542b48feba42ba00f8bb"},
 ]
 
 [[package]]
 name = "numpy"
-version = "1.26.4"
+version = "2.2.6"
 description = "Fundamental package for array computing in Python"
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "numpy-1.26.4-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:9ff0f4f29c51e2803569d7a51c2304de5554655a60c5d776e35b4a41413830d0"},
-    {file = "numpy-1.26.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:2e4ee3380d6de9c9ec04745830fd9e2eccb3e6cf790d39d7b98ffd19b0dd754a"},
-    {file = "numpy-1.26.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d209d8969599b27ad20994c8e41936ee0964e6da07478d6c35016bc386b66ad4"},
-    {file = "numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ffa75af20b44f8dba823498024771d5ac50620e6915abac414251bd971b4529f"},
-    {file = "numpy-1.26.4-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:62b8e4b1e28009ef2846b4c7852046736bab361f7aeadeb6a5b89ebec3c7055a"},
-    {file = "numpy-1.26.4-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:a4abb4f9001ad2858e7ac189089c42178fcce737e4169dc61321660f1a96c7d2"},
-    {file = "numpy-1.26.4-cp310-cp310-win32.whl", hash = "sha256:bfe25acf8b437eb2a8b2d49d443800a5f18508cd811fea3181723922a8a82b07"},
-    {file = "numpy-1.26.4-cp310-cp310-win_amd64.whl", hash = "sha256:b97fe8060236edf3662adfc2c633f56a08ae30560c56310562cb4f95500022d5"},
-    {file = "numpy-1.26.4-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:4c66707fabe114439db9068ee468c26bbdf909cac0fb58686a42a24de1760c71"},
-    {file = "numpy-1.26.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:edd8b5fe47dab091176d21bb6de568acdd906d1887a4584a15a9a96a1dca06ef"},
-    {file = "numpy-1.26.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7ab55401287bfec946ced39700c053796e7cc0e3acbef09993a9ad2adba6ca6e"},
-    {file = "numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:666dbfb6ec68962c033a450943ded891bed2d54e6755e35e5835d63f4f6931d5"},
-    {file = "numpy-1.26.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:96ff0b2ad353d8f990b63294c8986f1ec3cb19d749234014f4e7eb0112ceba5a"},
-    {file = "numpy-1.26.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:60dedbb91afcbfdc9bc0b1f3f402804070deed7392c23eb7a7f07fa857868e8a"},
-    {file = "numpy-1.26.4-cp311-cp311-win32.whl", hash = "sha256:1af303d6b2210eb850fcf03064d364652b7120803a0b872f5211f5234b399f20"},
-    {file = "numpy-1.26.4-cp311-cp311-win_amd64.whl", hash = "sha256:cd25bcecc4974d09257ffcd1f098ee778f7834c3ad767fe5db785be9a4aa9cb2"},
-    {file = "numpy-1.26.4-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:b3ce300f3644fb06443ee2222c2201dd3a89ea6040541412b8fa189341847218"},
-    {file = "numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:03a8c78d01d9781b28a6989f6fa1bb2c4f2d51201cf99d3dd875df6fbd96b23b"},
-    {file = "numpy-1.26.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9fad7dcb1aac3c7f0584a5a8133e3a43eeb2fe127f47e3632d43d677c66c102b"},
-    {file = "numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:675d61ffbfa78604709862923189bad94014bef562cc35cf61d3a07bba02a7ed"},
-    {file = "numpy-1.26.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:ab47dbe5cc8210f55aa58e4805fe224dac469cde56b9f731a4c098b91917159a"},
-    {file = "numpy-1.26.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:1dda2e7b4ec9dd512f84935c5f126c8bd8b9f2fc001e9f54af255e8c5f16b0e0"},
-    {file = "numpy-1.26.4-cp312-cp312-win32.whl", hash = "sha256:50193e430acfc1346175fcbdaa28ffec49947a06918b7b92130744e81e640110"},
-    {file = "numpy-1.26.4-cp312-cp312-win_amd64.whl", hash = "sha256:08beddf13648eb95f8d867350f6a018a4be2e5ad54c8d8caed89ebca558b2818"},
-    {file = "numpy-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:7349ab0fa0c429c82442a27a9673fc802ffdb7c7775fad780226cb234965e53c"},
-    {file = "numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:52b8b60467cd7dd1e9ed082188b4e6bb35aa5cdd01777621a1658910745b90be"},
-    {file = "numpy-1.26.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d5241e0a80d808d70546c697135da2c613f30e28251ff8307eb72ba696945764"},
-    {file = "numpy-1.26.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f870204a840a60da0b12273ef34f7051e98c3b5961b61b0c2c1be6dfd64fbcd3"},
-    {file = "numpy-1.26.4-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:679b0076f67ecc0138fd2ede3a8fd196dddc2ad3254069bcb9faf9a79b1cebcd"},
-    {file = "numpy-1.26.4-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:47711010ad8555514b434df65f7d7b076bb8261df1ca9bb78f53d3b2db02e95c"},
-    {file = "numpy-1.26.4-cp39-cp39-win32.whl", hash = "sha256:a354325ee03388678242a4d7ebcd08b5c727033fcff3b2f536aea978e15ee9e6"},
-    {file = "numpy-1.26.4-cp39-cp39-win_amd64.whl", hash = "sha256:3373d5d70a5fe74a2c1bb6d2cfd9609ecf686d47a2d7b1d37a8f3b6bf6003aea"},
-    {file = "numpy-1.26.4-pp39-pypy39_pp73-macosx_10_9_x86_64.whl", hash = "sha256:afedb719a9dcfc7eaf2287b839d8198e06dcd4cb5d276a3df279231138e83d30"},
-    {file = "numpy-1.26.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:95a7476c59002f2f6c590b9b7b998306fba6a5aa646b1e22ddfeaf8f78c3a29c"},
-    {file = "numpy-1.26.4-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:7e50d0a0cc3189f9cb0aeb3a6a6af18c16f59f004b866cd2be1c14b36134a4a0"},
-    {file = "numpy-1.26.4.tar.gz", hash = "sha256:2a02aba9ed12e4ac4eb3ea9421c420301a0c6460d9830d74a9df87efa4912010"},
+python-versions = ">=3.10"
+groups = ["main", "training"]
+files = [
+    {file = "numpy-2.2.6-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:b412caa66f72040e6d268491a59f2c43bf03eb6c96dd8f0307829feb7fa2b6fb"},
+    {file = "numpy-2.2.6-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:8e41fd67c52b86603a91c1a505ebaef50b3314de0213461c7a6e99c9a3beff90"},
+    {file = "numpy-2.2.6-cp310-cp310-macosx_14_0_arm64.whl", hash = "sha256:37e990a01ae6ec7fe7fa1c26c55ecb672dd98b19c3d0e1d1f326fa13cb38d163"},
+    {file = "numpy-2.2.6-cp310-cp310-macosx_14_0_x86_64.whl", hash = "sha256:5a6429d4be8ca66d889b7cf70f536a397dc45ba6faeb5f8c5427935d9592e9cf"},
+    {file = "numpy-2.2.6-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:efd28d4e9cd7d7a8d39074a4d44c63eda73401580c5c76acda2ce969e0a38e83"},
+    {file = "numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc7b73d02efb0e18c000e9ad8b83480dfcd5dfd11065997ed4c6747470ae8915"},
+    {file = "numpy-2.2.6-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:74d4531beb257d2c3f4b261bfb0fc09e0f9ebb8842d82a7b4209415896adc680"},
+    {file = "numpy-2.2.6-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:8fc377d995680230e83241d8a96def29f204b5782f371c532579b4f20607a289"},
+    {file = "numpy-2.2.6-cp310-cp310-win32.whl", hash = "sha256:b093dd74e50a8cba3e873868d9e93a85b78e0daf2e98c6797566ad8044e8363d"},
+    {file = "numpy-2.2.6-cp310-cp310-win_amd64.whl", hash = "sha256:f0fd6321b839904e15c46e0d257fdd101dd7f530fe03fd6359c1ea63738703f3"},
+    {file = "numpy-2.2.6-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:f9f1adb22318e121c5c69a09142811a201ef17ab257a1e66ca3025065b7f53ae"},
+    {file = "numpy-2.2.6-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c820a93b0255bc360f53eca31a0e676fd1101f673dda8da93454a12e23fc5f7a"},
+    {file = "numpy-2.2.6-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:3d70692235e759f260c3d837193090014aebdf026dfd167834bcba43e30c2a42"},
+    {file = "numpy-2.2.6-cp311-cp311-macosx_14_0_x86_64.whl", hash = "sha256:481b49095335f8eed42e39e8041327c05b0f6f4780488f61286ed3c01368d491"},
+    {file = "numpy-2.2.6-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b64d8d4d17135e00c8e346e0a738deb17e754230d7e0810ac5012750bbd85a5a"},
+    {file = "numpy-2.2.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ba10f8411898fc418a521833e014a77d3ca01c15b0c6cdcce6a0d2897e6dbbdf"},
+    {file = "numpy-2.2.6-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:bd48227a919f1bafbdda0583705e547892342c26fb127219d60a5c36882609d1"},
+    {file = "numpy-2.2.6-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:9551a499bf125c1d4f9e250377c1ee2eddd02e01eac6644c080162c0c51778ab"},
+    {file = "numpy-2.2.6-cp311-cp311-win32.whl", hash = "sha256:0678000bb9ac1475cd454c6b8c799206af8107e310843532b04d49649c717a47"},
+    {file = "numpy-2.2.6-cp311-cp311-win_amd64.whl", hash = "sha256:e8213002e427c69c45a52bbd94163084025f533a55a59d6f9c5b820774ef3303"},
+    {file = "numpy-2.2.6-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:41c5a21f4a04fa86436124d388f6ed60a9343a6f767fced1a8a71c3fbca038ff"},
+    {file = "numpy-2.2.6-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:de749064336d37e340f640b05f24e9e3dd678c57318c7289d222a8a2f543e90c"},
+    {file = "numpy-2.2.6-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:894b3a42502226a1cac872f840030665f33326fc3dac8e57c607905773cdcde3"},
+    {file = "numpy-2.2.6-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:71594f7c51a18e728451bb50cc60a3ce4e6538822731b2933209a1f3614e9282"},
+    {file = "numpy-2.2.6-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f2618db89be1b4e05f7a1a847a9c1c0abd63e63a1607d892dd54668dd92faf87"},
+    {file = "numpy-2.2.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fd83c01228a688733f1ded5201c678f0c53ecc1006ffbc404db9f7a899ac6249"},
+    {file = "numpy-2.2.6-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:37c0ca431f82cd5fa716eca9506aefcabc247fb27ba69c5062a6d3ade8cf8f49"},
+    {file = "numpy-2.2.6-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fe27749d33bb772c80dcd84ae7e8df2adc920ae8297400dabec45f0dedb3f6de"},
+    {file = "numpy-2.2.6-cp312-cp312-win32.whl", hash = "sha256:4eeaae00d789f66c7a25ac5f34b71a7035bb474e679f410e5e1a94deb24cf2d4"},
+    {file = "numpy-2.2.6-cp312-cp312-win_amd64.whl", hash = "sha256:c1f9540be57940698ed329904db803cf7a402f3fc200bfe599334c9bd84a40b2"},
+    {file = "numpy-2.2.6-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:0811bb762109d9708cca4d0b13c4f67146e3c3b7cf8d34018c722adb2d957c84"},
+    {file = "numpy-2.2.6-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:287cc3162b6f01463ccd86be154f284d0893d2b3ed7292439ea97eafa8170e0b"},
+    {file = "numpy-2.2.6-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:f1372f041402e37e5e633e586f62aa53de2eac8d98cbfb822806ce4bbefcb74d"},
+    {file = "numpy-2.2.6-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:55a4d33fa519660d69614a9fad433be87e5252f4b03850642f88993f7b2ca566"},
+    {file = "numpy-2.2.6-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f92729c95468a2f4f15e9bb94c432a9229d0d50de67304399627a943201baa2f"},
+    {file = "numpy-2.2.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1bc23a79bfabc5d056d106f9befb8d50c31ced2fbc70eedb8155aec74a45798f"},
+    {file = "numpy-2.2.6-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:e3143e4451880bed956e706a3220b4e5cf6172ef05fcc397f6f36a550b1dd868"},
+    {file = "numpy-2.2.6-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:b4f13750ce79751586ae2eb824ba7e1e8dba64784086c98cdbbcc6a42112ce0d"},
+    {file = "numpy-2.2.6-cp313-cp313-win32.whl", hash = "sha256:5beb72339d9d4fa36522fc63802f469b13cdbe4fdab4a288f0c441b74272ebfd"},
+    {file = "numpy-2.2.6-cp313-cp313-win_amd64.whl", hash = "sha256:b0544343a702fa80c95ad5d3d608ea3599dd54d4632df855e4c8d24eb6ecfa1c"},
+    {file = "numpy-2.2.6-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:0bca768cd85ae743b2affdc762d617eddf3bcf8724435498a1e80132d04879e6"},
+    {file = "numpy-2.2.6-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:fc0c5673685c508a142ca65209b4e79ed6740a4ed6b2267dbba90f34b0b3cfda"},
+    {file = "numpy-2.2.6-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:5bd4fc3ac8926b3819797a7c0e2631eb889b4118a9898c84f585a54d475b7e40"},
+    {file = "numpy-2.2.6-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:fee4236c876c4e8369388054d02d0e9bb84821feb1a64dd59e137e6511a551f8"},
+    {file = "numpy-2.2.6-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e1dda9c7e08dc141e0247a5b8f49cf05984955246a327d4c48bda16821947b2f"},
+    {file = "numpy-2.2.6-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f447e6acb680fd307f40d3da4852208af94afdfab89cf850986c3ca00562f4fa"},
+    {file = "numpy-2.2.6-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:389d771b1623ec92636b0786bc4ae56abafad4a4c513d36a55dce14bd9ce8571"},
+    {file = "numpy-2.2.6-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:8e9ace4a37db23421249ed236fdcdd457d671e25146786dfc96835cd951aa7c1"},
+    {file = "numpy-2.2.6-cp313-cp313t-win32.whl", hash = "sha256:038613e9fb8c72b0a41f025a7e4c3f0b7a1b5d768ece4796b674c8f3fe13efff"},
+    {file = "numpy-2.2.6-cp313-cp313t-win_amd64.whl", hash = "sha256:6031dd6dfecc0cf9f668681a37648373bddd6421fff6c66ec1624eed0180ee06"},
+    {file = "numpy-2.2.6-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:0b605b275d7bd0c640cad4e5d30fa701a8d59302e127e5f79138ad62762c3e3d"},
+    {file = "numpy-2.2.6-pp310-pypy310_pp73-macosx_14_0_x86_64.whl", hash = "sha256:7befc596a7dc9da8a337f79802ee8adb30a552a94f792b9c9d18c840055907db"},
+    {file = "numpy-2.2.6-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ce47521a4754c8f4593837384bd3424880629f718d87c5d44f8ed763edd63543"},
+    {file = "numpy-2.2.6-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:d042d24c90c41b54fd506da306759e06e568864df8ec17ccc17e9e884634fd00"},
+    {file = "numpy-2.2.6.tar.gz", hash = "sha256:e29554e2bef54a90aa5cc07da6ce955accb83f21ab5de01a62c8478897b264fd"},
 ]
 
 [[package]]
 name = "nvidia-cublas-cu12"
-version = "12.1.3.1"
+version = "12.6.4.1"
 description = "CUBLAS native runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl", hash = "sha256:ee53ccca76a6fc08fb9701aa95b6ceb242cdaab118c3bb152af4e579af792728"},
-    {file = "nvidia_cublas_cu12-12.1.3.1-py3-none-win_amd64.whl", hash = "sha256:2b964d60e8cf11b5e1073d179d85fa340c120e99b3067558f3cf98dd69d02906"},
+    {file = "nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:08ed2686e9875d01b58e3cb379c6896df8e76c75e0d4a7f7dace3d7b6d9ef8eb"},
+    {file = "nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:235f728d6e2a409eddf1df58d5b0921cf80cfa9e72b9f2775ccb7b4a87984668"},
+    {file = "nvidia_cublas_cu12-12.6.4.1-py3-none-win_amd64.whl", hash = "sha256:9e4fa264f4d8a4eb0cdbd34beadc029f453b3bafae02401e999cf3d5a5af75f8"},
 ]
 
 [[package]]
 name = "nvidia-cuda-cupti-cu12"
-version = "12.1.105"
+version = "12.6.80"
 description = "CUDA profiling tools runtime libs."
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl", hash = "sha256:e54fde3983165c624cb79254ae9818a456eb6e87a7fd4d56a2352c24ee542d7e"},
-    {file = "nvidia_cuda_cupti_cu12-12.1.105-py3-none-win_amd64.whl", hash = "sha256:bea8236d13a0ac7190bd2919c3e8e6ce1e402104276e6f9694479e48bb0eb2a4"},
+    {file = "nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:166ee35a3ff1587f2490364f90eeeb8da06cd867bd5b701bf7f9a02b78bc63fc"},
+    {file = "nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_aarch64.whl", hash = "sha256:358b4a1d35370353d52e12f0a7d1769fc01ff74a191689d3870b2123156184c4"},
+    {file = "nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:6768bad6cab4f19e8292125e5f1ac8aa7d1718704012a0e3272a6f61c4bce132"},
+    {file = "nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl", hash = "sha256:a3eff6cdfcc6a4c35db968a06fcadb061cbc7d6dde548609a941ff8701b98b73"},
+    {file = "nvidia_cuda_cupti_cu12-12.6.80-py3-none-win_amd64.whl", hash = "sha256:bbe6ae76e83ce5251b56e8c8e61a964f757175682bbad058b170b136266ab00a"},
 ]
 
 [[package]]
 name = "nvidia-cuda-nvrtc-cu12"
-version = "12.1.105"
+version = "12.6.77"
 description = "NVRTC native runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl", hash = "sha256:339b385f50c309763ca65456ec75e17bbefcbbf2893f462cb8b90584cd27a1c2"},
-    {file = "nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-win_amd64.whl", hash = "sha256:0a98a522d9ff138b96c010a65e145dc1b4850e9ecb75a0172371793752fd46ed"},
+    {file = "nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_aarch64.whl", hash = "sha256:5847f1d6e5b757f1d2b3991a01082a44aad6f10ab3c5c0213fa3e25bddc25a13"},
+    {file = "nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl", hash = "sha256:35b0cc6ee3a9636d5409133e79273ce1f3fd087abb0532d2d2e8fff1fe9efc53"},
+    {file = "nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-win_amd64.whl", hash = "sha256:f7007dbd914c56bd80ea31bc43e8e149da38f68158f423ba845fc3292684e45a"},
 ]
 
 [[package]]
 name = "nvidia-cuda-runtime-cu12"
-version = "12.1.105"
+version = "12.6.77"
 description = "CUDA Runtime native Libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl", hash = "sha256:6e258468ddf5796e25f1dc591a31029fa317d97a0a94ed93468fc86301d61e40"},
-    {file = "nvidia_cuda_runtime_cu12-12.1.105-py3-none-win_amd64.whl", hash = "sha256:dfb46ef84d73fababab44cf03e3b83f80700d27ca300e537f85f636fac474344"},
+    {file = "nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:6116fad3e049e04791c0256a9778c16237837c08b27ed8c8401e2e45de8d60cd"},
+    {file = "nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_aarch64.whl", hash = "sha256:d461264ecb429c84c8879a7153499ddc7b19b5f8d84c204307491989a365588e"},
+    {file = "nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ba3b56a4f896141e25e19ab287cd71e52a6a0f4b29d0d31609f60e3b4d5219b7"},
+    {file = "nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl", hash = "sha256:a84d15d5e1da416dd4774cb42edf5e954a3e60cc945698dc1d5be02321c44dc8"},
+    {file = "nvidia_cuda_runtime_cu12-12.6.77-py3-none-win_amd64.whl", hash = "sha256:86c58044c824bf3c173c49a2dbc7a6c8b53cb4e4dca50068be0bf64e9dab3f7f"},
 ]
 
 [[package]]
 name = "nvidia-cudnn-cu12"
-version = "8.9.2.26"
+version = "9.5.1.17"
 description = "cuDNN runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl", hash = "sha256:5ccb288774fdfb07a7e7025ffec286971c06d8d7b4fb162525334616d7629ff9"},
+    {file = "nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_aarch64.whl", hash = "sha256:9fd4584468533c61873e5fda8ca41bac3a38bcb2d12350830c69b0a96a7e4def"},
+    {file = "nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl", hash = "sha256:30ac3869f6db17d170e0e556dd6cc5eee02647abc31ca856634d5a40f82c15b2"},
+    {file = "nvidia_cudnn_cu12-9.5.1.17-py3-none-win_amd64.whl", hash = "sha256:d7af0f8a4f3b4b9dbb3122f2ef553b45694ed9c384d5a75bab197b8eefb79ab8"},
 ]
 
 [package.dependencies]
@@ -2982,41 +3042,53 @@ nvidia-cublas-cu12 = "*"
 
 [[package]]
 name = "nvidia-cufft-cu12"
-version = "11.0.2.54"
+version = "11.3.0.4"
 description = "CUFFT native runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl", hash = "sha256:794e3948a1aa71fd817c3775866943936774d1c14e7628c74f6f7417224cdf56"},
-    {file = "nvidia_cufft_cu12-11.0.2.54-py3-none-win_amd64.whl", hash = "sha256:d9ac353f78ff89951da4af698f80870b1534ed69993f10a4cf1d96f21357e253"},
+    {file = "nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d16079550df460376455cba121db6564089176d9bac9e4f360493ca4741b22a6"},
+    {file = "nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8510990de9f96c803a051822618d42bf6cb8f069ff3f48d93a8486efdacb48fb"},
+    {file = "nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:ccba62eb9cef5559abd5e0d54ceed2d9934030f51163df018532142a8ec533e5"},
+    {file = "nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl", hash = "sha256:768160ac89f6f7b459bee747e8d175dbf53619cfe74b2a5636264163138013ca"},
+    {file = "nvidia_cufft_cu12-11.3.0.4-py3-none-win_amd64.whl", hash = "sha256:6048ebddfb90d09d2707efb1fd78d4e3a77cb3ae4dc60e19aab6be0ece2ae464"},
 ]
 
+[package.dependencies]
+nvidia-nvjitlink-cu12 = "*"
+
 [[package]]
 name = "nvidia-curand-cu12"
-version = "10.3.2.106"
+version = "10.3.7.77"
 description = "CURAND native runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl", hash = "sha256:9d264c5036dde4e64f1de8c50ae753237c12e0b1348738169cd0f8a536c0e1e0"},
-    {file = "nvidia_curand_cu12-10.3.2.106-py3-none-win_amd64.whl", hash = "sha256:75b6b0c574c0037839121317e17fd01f8a69fd2ef8e25853d826fec30bdba74a"},
+    {file = "nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_aarch64.whl", hash = "sha256:6e82df077060ea28e37f48a3ec442a8f47690c7499bff392a5938614b56c98d8"},
+    {file = "nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:a42cd1344297f70b9e39a1e4f467a4e1c10f1da54ff7a85c12197f6c652c8bdf"},
+    {file = "nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl", hash = "sha256:99f1a32f1ac2bd134897fc7a203f779303261268a65762a623bf30cc9fe79117"},
+    {file = "nvidia_curand_cu12-10.3.7.77-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:7b2ed8e95595c3591d984ea3603dd66fe6ce6812b886d59049988a712ed06b6e"},
+    {file = "nvidia_curand_cu12-10.3.7.77-py3-none-win_amd64.whl", hash = "sha256:6d6d935ffba0f3d439b7cd968192ff068fafd9018dbf1b85b37261b13cfc9905"},
 ]
 
 [[package]]
 name = "nvidia-cusolver-cu12"
-version = "11.4.5.107"
+version = "11.7.1.2"
 description = "CUDA solver native runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl", hash = "sha256:8a7ec542f0412294b15072fa7dab71d31334014a69f953004ea7a118206fe0dd"},
-    {file = "nvidia_cusolver_cu12-11.4.5.107-py3-none-win_amd64.whl", hash = "sha256:74e0c3a24c78612192a74fcd90dd117f1cf21dea4822e66d89e8ea80e3cd2da5"},
+    {file = "nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_aarch64.whl", hash = "sha256:0ce237ef60acde1efc457335a2ddadfd7610b892d94efee7b776c64bb1cac9e0"},
+    {file = "nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:e9e49843a7707e42022babb9bcfa33c29857a93b88020c4e4434656a655b698c"},
+    {file = "nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl", hash = "sha256:6cf28f17f64107a0c4d7802be5ff5537b2130bfc112f25d5a30df227058ca0e6"},
+    {file = "nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:dbbe4fc38ec1289c7e5230e16248365e375c3673c9c8bac5796e2e20db07f56e"},
+    {file = "nvidia_cusolver_cu12-11.7.1.2-py3-none-win_amd64.whl", hash = "sha256:6813f9d8073f555444a8705f3ab0296d3e1cb37a16d694c5fc8b862a0d8706d7"},
 ]
 
 [package.dependencies]
@@ -3026,40 +3098,57 @@ nvidia-nvjitlink-cu12 = "*"
 
 [[package]]
 name = "nvidia-cusparse-cu12"
-version = "12.1.0.106"
+version = "12.5.4.2"
 description = "CUSPARSE native runtime libraries"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl", hash = "sha256:f3b50f42cf363f86ab21f720998517a659a48131e8d538dc02f8768237bd884c"},
-    {file = "nvidia_cusparse_cu12-12.1.0.106-py3-none-win_amd64.whl", hash = "sha256:b798237e81b9719373e8fae8d4f091b70a0cf09d9d85c95a557e11df2d8e9a5a"},
+    {file = "nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d25b62fb18751758fe3c93a4a08eff08effedfe4edf1c6bb5afd0890fe88f887"},
+    {file = "nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_aarch64.whl", hash = "sha256:7aa32fa5470cf754f72d1116c7cbc300b4e638d3ae5304cfa4a638a5b87161b1"},
+    {file = "nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:7556d9eca156e18184b94947ade0fba5bb47d69cec46bf8660fd2c71a4b48b73"},
+    {file = "nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl", hash = "sha256:23749a6571191a215cb74d1cdbff4a86e7b19f1200c071b3fcf844a5bea23a2f"},
+    {file = "nvidia_cusparse_cu12-12.5.4.2-py3-none-win_amd64.whl", hash = "sha256:4acb8c08855a26d737398cba8fb6f8f5045d93f82612b4cfd84645a2332ccf20"},
 ]
 
 [package.dependencies]
 nvidia-nvjitlink-cu12 = "*"
 
+[[package]]
+name = "nvidia-cusparselt-cu12"
+version = "0.6.3"
+description = "NVIDIA cuSPARSELt"
+optional = true
+python-versions = "*"
+groups = ["main"]
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
+files = [
+    {file = "nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_aarch64.whl", hash = "sha256:8371549623ba601a06322af2133c4a44350575f5a3108fb75f3ef20b822ad5f1"},
+    {file = "nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl", hash = "sha256:e5c8a26c36445dd2e6812f1177978a24e2d37cacce7e090f297a688d1ec44f46"},
+    {file = "nvidia_cusparselt_cu12-0.6.3-py3-none-win_amd64.whl", hash = "sha256:3b325bcbd9b754ba43df5a311488fca11a6b5dc3d11df4d190c000cf1a0765c7"},
+]
+
 [[package]]
 name = "nvidia-nccl-cu12"
-version = "2.19.3"
+version = "2.21.5"
 description = "NVIDIA Collective Communication Library (NCCL) Runtime"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_nccl_cu12-2.19.3-py3-none-manylinux1_x86_64.whl", hash = "sha256:a9734707a2c96443331c1e48c717024aa6678a0e2a4cb66b2c364d18cee6b48d"},
+    {file = "nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl", hash = "sha256:8579076d30a8c24988834445f8d633c697d42397e92ffc3f63fa26766d25e0a0"},
 ]
 
 [[package]]
 name = "nvidia-nvjitlink-cu12"
 version = "12.6.85"
 description = "Nvidia JIT LTO Library"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
     {file = "nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:eedc36df9e88b682efe4309aa16b5b4e78c2407eac59e8c10a6a47535164369a"},
     {file = "nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:cf4eaa7d4b6b543ffd69d6abfb11efdeb2db48270d94dfd3a452c24150829e41"},
@@ -3068,15 +3157,18 @@ files = [
 
 [[package]]
 name = "nvidia-nvtx-cu12"
-version = "12.1.105"
+version = "12.6.77"
 description = "NVIDIA Tools Extension"
-optional = false
+optional = true
 python-versions = ">=3"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl", hash = "sha256:dc21cf308ca5691e7c04d962e213f8a4aa9bbfa23d95412f452254c2caeb09e5"},
-    {file = "nvidia_nvtx_cu12-12.1.105-py3-none-win_amd64.whl", hash = "sha256:65f4d98982b31b60026e0e6de73fbdfc09d08a96f4656dd3665ca616a11e1e82"},
+    {file = "nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:f44f8d86bb7d5629988d61c8d3ae61dddb2015dee142740536bc7481b022fe4b"},
+    {file = "nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_aarch64.whl", hash = "sha256:adcaabb9d436c9761fca2b13959a2d237c5f9fd406c8e4b723c695409ff88059"},
+    {file = "nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:b90bed3df379fa79afbd21be8e04a0314336b8ae16768b58f2d34cb1d04cd7d2"},
+    {file = "nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl", hash = "sha256:6574241a3ec5fdc9334353ab8c479fe75841dbe8f4532a8fc97ce63503330ba1"},
+    {file = "nvidia_nvtx_cu12-12.6.77-py3-none-win_amd64.whl", hash = "sha256:2fb11a4af04a5e6c84073e6404d26588a34afd35379f0855a99797897efa75c0"},
 ]
 
 [[package]]
@@ -3140,19 +3232,126 @@ files = [
 
 [package.dependencies]
 numpy = [
-    {version = ">=1.21.4", markers = "python_version >= \"3.10\" and platform_system == \"Darwin\""},
-    {version = ">=1.21.2", markers = "platform_system != \"Darwin\" and python_version >= \"3.10\""},
     {version = ">=1.23.5", markers = "python_version >= \"3.11\""},
     {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
 ]
 
+[[package]]
+name = "optimum-quanto"
+version = "0.2.7"
+description = "A pytorch quantization backend for optimum."
+optional = true
+python-versions = ">=3.9.0"
+groups = ["main"]
+markers = "extra == \"quanto\""
+files = [
+    {file = "optimum_quanto-0.2.7-py3-none-any.whl", hash = "sha256:1369b1d9a4a197f88c0d1c67e8d950694e5b86ce4c9f3878e178d5be35339f61"},
+    {file = "optimum_quanto-0.2.7.tar.gz", hash = "sha256:91b5c2dc8a9100297dc7924a93747fb77ab010784b5e1f6d0208976ba054dade"},
+]
+
+[package.dependencies]
+huggingface_hub = "*"
+ninja = "*"
+numpy = "*"
+safetensors = "*"
+torch = ">=2.6.0"
+
+[package.extras]
+dev = ["pytest", "ruff"]
+examples = ["accelerate", "datasets", "diffusers", "scipy", "sentencepiece", "torchvision", "transformers"]
+
+[[package]]
+name = "orjson"
+version = "3.11.9"
+description = "Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy"
+optional = true
+python-versions = ">=3.10"
+groups = ["main"]
+markers = "extra == \"trackio\""
+files = [
+    {file = "orjson-3.11.9-cp310-cp310-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:135869ef917b8704ea0a94e01620e0c05021c15c52036e4663baffe75e72f8ce"},
+    {file = "orjson-3.11.9-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:115ab5f5f4a0f203cc2a5f0fb09aee503a3f771aa08392949ab5ca230c4fbdbd"},
+    {file = "orjson-3.11.9-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:4da3c38a2083ca4aaf9c2a36776cce3e9328e6647b10d118948f3cfb4913ffe4"},
+    {file = "orjson-3.11.9-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:53b50b0e14084b8f7e29c5ce84c5af0f1160169b30d8a6914231d97d2fe297d4"},
+    {file = "orjson-3.11.9-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:231742b4a11dad8d5380a435962c57e91b7c37b79be858f4ef1c0df1a259897e"},
+    {file = "orjson-3.11.9-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:34fd2317602587321faab75ab76c623a0117e80841a6413654f04e47f339a8fb"},
+    {file = "orjson-3.11.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:71f3db16e69b667b132e0f305a833d5497da302d801508cbb051ed9a9819da47"},
+    {file = "orjson-3.11.9-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:0b34789fa0da61cf7bef0546b09c738fb195331e017e477096d129e9105ab03d"},
+    {file = "orjson-3.11.9-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:87e4d4ab280b0c87424d47695bec2182caf8cfc17879ea78dab76680194abc13"},
+    {file = "orjson-3.11.9-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:ace6c58523302d3b97b6ac5c38a5298a54b473762b6be82726b4265c41029f92"},
+    {file = "orjson-3.11.9-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:97d0d932803c1b164fde11cb542a9efcb1e0f63b184537cca65887147906ff48"},
+    {file = "orjson-3.11.9-cp310-cp310-win32.whl", hash = "sha256:b3afcf569c15577a9fe64627292daa3e6b3a70f4fb77a5df246a87ec21681b94"},
+    {file = "orjson-3.11.9-cp310-cp310-win_amd64.whl", hash = "sha256:8697ab6a080a5c46edaad50e2bc5bd8c7ca5c66442d24104fa44ec74910a8244"},
+    {file = "orjson-3.11.9-cp311-cp311-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:f01c4818b3fc9b0da8e096722a84318071eaa118df35f6ed2344da0e73a5444f"},
+    {file = "orjson-3.11.9-cp311-cp311-macosx_15_0_arm64.whl", hash = "sha256:3ebca4179031ee716ed076ffadc29428e900512f6fccee8614c9983157fcf19c"},
+    {file = "orjson-3.11.9-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:48ee05097750de0ff69ed5b7bbcf0732182fd57a24043dcc2a1da780a5ead3a5"},
+    {file = "orjson-3.11.9-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:a6082706765a95a6680d812e1daf1c0cfe8adec7831b3ff3b625693f3b461b1c"},
+    {file = "orjson-3.11.9-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:277fefe9d76ee17eb14debf399e3533d4d63b5f677a4d3719eb763536af1f4bd"},
+    {file = "orjson-3.11.9-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:03db380e3780fa0015ed776a90f20e8e20bb11dde13b216ce19e5718e3dfba62"},
+    {file = "orjson-3.11.9-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:33d7d766701847dc6729846362dc27895d2f2d2251264f9d10e7cb9878194877"},
+    {file = "orjson-3.11.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:147302878da387104b66bb4a8b0227d1d487e976ce41a8501916161072ed87b1"},
+    {file = "orjson-3.11.9-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:3513550321f8c8c811a7c3297b8a630e82dc08e4c10216d07703c997776236cd"},
+    {file = "orjson-3.11.9-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:c5d001196b89fa9cf0a4ab79766cd835b991a166e4b621ba95089edc50c429ff"},
+    {file = "orjson-3.11.9-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:16969c9d369c98eb084889c6e4d2d39b77c7eb38ceccf8da2a9fff62ae908980"},
+    {file = "orjson-3.11.9-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:63e0efbc991250c0b3143488fa57d95affcabbfc63c99c48d625dd37779aafe2"},
+    {file = "orjson-3.11.9-cp311-cp311-win32.whl", hash = "sha256:14ed654580c1ed2bc217352ec82f91b047aef82951aa71c7f64e0dcb03c0e180"},
+    {file = "orjson-3.11.9-cp311-cp311-win_amd64.whl", hash = "sha256:57ea77fb70a448ce87d18fca050193202a3da5e54598f6501ca5476fb66cfe02"},
+    {file = "orjson-3.11.9-cp311-cp311-win_arm64.whl", hash = "sha256:19b72ed11572a2ee51a67a903afbe5af504f84ed6f529c0fe44b0ab3fb5cc697"},
+    {file = "orjson-3.11.9-cp312-cp312-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:9ef6fe90aadef185c7b128859f40beb24720b4ecea95379fc9000931179c3a49"},
+    {file = "orjson-3.11.9-cp312-cp312-macosx_15_0_arm64.whl", hash = "sha256:e5c9b8f28e726e97d97696c826bc7bea5d71cecd63576dba92924a32c1961291"},
+    {file = "orjson-3.11.9-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:26a473dbb4162108b27901492546f83c76fdcea3d0eadff00ae7a07e18dcce09"},
+    {file = "orjson-3.11.9-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:011382e2a60fda9d46f1cdee31068cfc52ffe952b587d683ec0463002802a0f4"},
+    {file = "orjson-3.11.9-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c2d3dc759490128c5c1711a53eeaa8ee1d437fd0038ffd2b6008abf46db3f882"},
+    {file = "orjson-3.11.9-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d8ea516b3726d190e1b4297e6f4e7a8650347ae053868a18163b4dd3641d1fff"},
+    {file = "orjson-3.11.9-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:380cdce7ba24989af81d0a7013d0aaec5d0e2a21734c0e2681b1bc4f141957fe"},
+    {file = "orjson-3.11.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:be4fa4f0af7fa18951f7ab3fc2148e223af211bf03f59e1c6034ec3f97f21d61"},
+    {file = "orjson-3.11.9-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:a8f5f8bc7ce7d59f08d9f99fa510c06496164a24cb5f3d34537dbd9ca30132e2"},
+    {file = "orjson-3.11.9-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:4d7fde5501b944f83b3e665e1b31343ff6e154b15560a16b7130ea1e594a4206"},
+    {file = "orjson-3.11.9-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:cde1a448023ba7d5bb4c01c5afb48894380b5e4956e0627266526587ef4e535f"},
+    {file = "orjson-3.11.9-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:71e63adb0e1f1ed5d9e168f50a91ceb93ae6420731d222dc7da5c69409aa47aa"},
+    {file = "orjson-3.11.9-cp312-cp312-win32.whl", hash = "sha256:2d057a602cdd19a0ad680417527c45b6961a095081c0f46fe0e03e304aac6470"},
+    {file = "orjson-3.11.9-cp312-cp312-win_amd64.whl", hash = "sha256:59e403b1cc5a676da8eaf31f6254801b7341b3e29efa85f92b48d272637e77be"},
+    {file = "orjson-3.11.9-cp312-cp312-win_arm64.whl", hash = "sha256:9af678d6488357948f1f84c6cd1c1d397c014e1ae2f98ae082a44eb48f602624"},
+    {file = "orjson-3.11.9-cp313-cp313-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:4bab1b2d6141fe7b32ae71dac905666ece4f94936efbfb13d55bb7739a3a6021"},
+    {file = "orjson-3.11.9-cp313-cp313-macosx_15_0_arm64.whl", hash = "sha256:844417969855fc7a41be124aafe83dc424592a7f77cd4501900c67307122b92c"},
+    {file = "orjson-3.11.9-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ffe02797b5e9f3a9d8292ddcd289b474ad13e81ad83cd1891a240811f1d2cb81"},
+    {file = "orjson-3.11.9-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0e4eed3b200023042814d2fc8a5d2e880f13b52e1ed2485e83da4f3962f7dc1a"},
+    {file = "orjson-3.11.9-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:8aff7da9952a5ad1cef8e68017724d96c7b9a66e99e91d6252e1b133d67a7b10"},
+    {file = "orjson-3.11.9-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:4d4e98d6f3b8afed8bc8cd9718ec0cdf46661826beefb53fe8eafb37f2bf0362"},
+    {file = "orjson-3.11.9-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:3a81d52442a7c99b3662333235b3adf96a1715864658b35bb797212be7bddb97"},
+    {file = "orjson-3.11.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4e39364e726a8fff737309aff059ff67d8a8c8d5b677be7bb49a8b3e84b7e218"},
+    {file = "orjson-3.11.9-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:4fd66214623f1b17501df9f0543bef0b833979ab5b6ded1e1d123222866aa8c9"},
+    {file = "orjson-3.11.9-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:8ecc30f10465fa1e0ce13fd01d9e22c316e5053a719a8d915d4545a09a5ff677"},
+    {file = "orjson-3.11.9-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:97db4c94a7db398a5bd636273324f0b3fd58b350bbbac8bb380ceb825a9b40f4"},
+    {file = "orjson-3.11.9-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:9f78cf8fec5bd627f4082b8dfeac7871b43d7f3274904492a43dab39f18a19a0"},
+    {file = "orjson-3.11.9-cp313-cp313-win32.whl", hash = "sha256:d4087e5c0209a0a8efe4de3303c234b9c44d1174161dcd851e8eea07c7560b32"},
+    {file = "orjson-3.11.9-cp313-cp313-win_amd64.whl", hash = "sha256:051b102c93b4f634e89f3866b07b9a9a98915ada541f4ec30f177067b2694979"},
+    {file = "orjson-3.11.9-cp313-cp313-win_arm64.whl", hash = "sha256:cce9127885941bd28f080cecf1f1d288336b7e0d812c345b08be88b572796254"},
+    {file = "orjson-3.11.9-cp314-cp314-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:b6ef1979adc4bc243523f1a2ba91418030a8e29b0a99cbe7e0e2d6807d4dce6e"},
+    {file = "orjson-3.11.9-cp314-cp314-macosx_15_0_arm64.whl", hash = "sha256:f36b7f32c7c0db4a719f1fc5824db4a9c6f8bd1a354debb91faf26ebf3a4c71e"},
+    {file = "orjson-3.11.9-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:08f4d8ebb44925c794e535b2bebc507cebf32209df81de22ae285fb0d8d66de0"},
+    {file = "orjson-3.11.9-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6cc7923789694fd58f001cbcac7e47abc13af4d560ebbfcf3b41a8b1a0748124"},
+    {file = "orjson-3.11.9-cp314-cp314-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ea5c46eb2d3af39e806b986f4b09d5c2706a1f5afde3cbf7544ce6616127173c"},
+    {file = "orjson-3.11.9-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f5d89a2ed90731df3be64bab0aa44f78bff39fdc9d71c291f4a8023aa46425b7"},
+    {file = "orjson-3.11.9-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:25e4aed0312d292c09f61af25bba34e0b2c88546041472b09088c39a4d828af1"},
+    {file = "orjson-3.11.9-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:aaea64f3f467d22e70eeed68bdccb3bc4f83f650446c4a03c59f2cba28a108db"},
+    {file = "orjson-3.11.9-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:a028425d1b440c5d92a6be1e1a020739dfe67ea87d96c6dbe828c1b30041728b"},
+    {file = "orjson-3.11.9-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:5b192c6cf397e4455b11523c5cf2b18ed084c1bbd61b6c0926344d2129481972"},
+    {file = "orjson-3.11.9-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:ea407d4ccf5891d667d045fecae97a7a1e5e87b3b97f97ae1803c2e741130be0"},
+    {file = "orjson-3.11.9-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:5f63aaf97afd9f6dec5b1a68e1b8da12bfccb4cb9a9a65c3e0b6c847849e7586"},
+    {file = "orjson-3.11.9-cp314-cp314-win32.whl", hash = "sha256:e30ab17845bb9fa54ccf67fa4f9f5282652d54faa6d17452f47d0f369d038673"},
+    {file = "orjson-3.11.9-cp314-cp314-win_amd64.whl", hash = "sha256:32ef5f4283a3be81913947d19608eacb7c6608026851123790cd9cc8982af34b"},
+    {file = "orjson-3.11.9-cp314-cp314-win_arm64.whl", hash = "sha256:eebdbdeef0094e4f5aefa20dcd4eb2368ab5e7a3b4edea27f1e7b2892e009cf9"},
+    {file = "orjson-3.11.9.tar.gz", hash = "sha256:4fef17e1f8722c11587a6ef18e35902450221da0028e65dbaaa543619e68e48f"},
+]
+
 [[package]]
 name = "packaging"
 version = "24.1"
 description = "Core utilities for Python packages"
 optional = false
 python-versions = ">=3.8"
-groups = ["main", "dev"]
+groups = ["main", "dev", "training"]
 files = [
     {file = "packaging-24.1-py3-none-any.whl", hash = "sha256:5b8f2217dbdbd2f7f384c41c628544e6d52f2d0f53c6d0c3ea61aa5d1d7ff124"},
     {file = "packaging-24.1.tar.gz", hash = "sha256:026ed72c8ed3fcce5bf8950572258698927fd1dbda10a5e981cdf0ac37f4f002"},
@@ -3164,7 +3363,7 @@ version = "2.2.2"
 description = "Powerful data structures for data analysis, time series, and statistics"
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "pandas-2.2.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:90c6fca2acf139569e74e8781709dccb6fe25940488755716d1d354d6bc58bce"},
     {file = "pandas-2.2.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:c7adfc142dac335d8c1e0dcbd37eb8617eac386596eb9e1a1b77791cf2498238"},
@@ -3196,10 +3395,10 @@ files = [
     {file = "pandas-2.2.2-cp39-cp39-win_amd64.whl", hash = "sha256:640cef9aa381b60e296db324337a554aeeb883ead99dc8f6c18e81a93942f5f4"},
     {file = "pandas-2.2.2.tar.gz", hash = "sha256:9e79019aba43cb4fda9e4d983f8e88ca0373adbb697ae9c6c43093218de28b54"},
 ]
+markers = {main = "extra == \"trackio\""}
 
 [package.dependencies]
 numpy = [
-    {version = ">=1.22.4", markers = "python_version < \"3.11\""},
     {version = ">=1.23.2", markers = "python_version == \"3.11\""},
     {version = ">=1.26.0", markers = "python_version >= \"3.12\""},
 ]
@@ -3228,92 +3427,58 @@ plot = ["matplotlib (>=3.6.3)"]
 postgresql = ["SQLAlchemy (>=2.0.0)", "adbc-driver-postgresql (>=0.8.0)", "psycopg2 (>=2.9.6)"]
 pyarrow = ["pyarrow (>=10.0.1)"]
 spss = ["pyreadstat (>=1.2.0)"]
-sql-other = ["SQLAlchemy (>=2.0.0)", "adbc-driver-postgresql (>=0.8.0)", "adbc-driver-sqlite (>=0.8.0)"]
-test = ["hypothesis (>=6.46.1)", "pytest (>=7.3.2)", "pytest-xdist (>=2.2.0)"]
-xml = ["lxml (>=4.9.2)"]
-
-[[package]]
-name = "paramiko"
-version = "3.5.0"
-description = "SSH2 protocol library"
-optional = false
-python-versions = ">=3.6"
-groups = ["main"]
-files = [
-    {file = "paramiko-3.5.0-py3-none-any.whl", hash = "sha256:1fedf06b085359051cd7d0d270cebe19e755a8a921cc2ddbfa647fb0cd7d68f9"},
-    {file = "paramiko-3.5.0.tar.gz", hash = "sha256:ad11e540da4f55cedda52931f1a3f812a8238a7af7f62a60de538cd80bb28124"},
-]
-
-[package.dependencies]
-bcrypt = ">=3.2"
-cryptography = ">=3.3"
-pynacl = ">=1.5"
-
-[package.extras]
-all = ["gssapi (>=1.4.1) ; platform_system != \"Windows\"", "invoke (>=2.0)", "pyasn1 (>=0.1.7)", "pywin32 (>=2.1.8) ; platform_system == \"Windows\""]
-gssapi = ["gssapi (>=1.4.1) ; platform_system != \"Windows\"", "pyasn1 (>=0.1.7)", "pywin32 (>=2.1.8) ; platform_system == \"Windows\""]
-invoke = ["invoke (>=2.0)"]
-
-[[package]]
-name = "parso"
-version = "0.8.4"
-description = "A Python Parser"
-optional = false
-python-versions = ">=3.6"
-groups = ["main"]
-files = [
-    {file = "parso-0.8.4-py2.py3-none-any.whl", hash = "sha256:a418670a20291dacd2dddc80c377c5c3791378ee1e8d12bffc35420643d43f18"},
-    {file = "parso-0.8.4.tar.gz", hash = "sha256:eb3a7b58240fb99099a345571deecc0f9540ea5f4dd2fe14c2a99d6b281ab92d"},
-]
-
-[package.extras]
-qa = ["flake8 (==5.0.4)", "mypy (==0.971)", "types-setuptools (==67.2.0.1)"]
-testing = ["docopt", "pytest"]
+sql-other = ["SQLAlchemy (>=2.0.0)", "adbc-driver-postgresql (>=0.8.0)", "adbc-driver-sqlite (>=0.8.0)"]
+test = ["hypothesis (>=6.46.1)", "pytest (>=7.3.2)", "pytest-xdist (>=2.2.0)"]
+xml = ["lxml (>=4.9.2)"]
 
 [[package]]
-name = "pastedeploy"
-version = "3.1.0"
-description = "Load, configure, and compose WSGI applications and servers"
+name = "parso"
+version = "0.8.7"
+description = "A Python Parser"
 optional = false
-python-versions = ">=3.7"
-groups = ["main"]
+python-versions = ">=3.6"
+groups = ["dev"]
 files = [
-    {file = "PasteDeploy-3.1.0-py3-none-any.whl", hash = "sha256:76388ad53a661448d436df28c798063108f70e994ddc749540d733cdbd1b38cf"},
-    {file = "PasteDeploy-3.1.0.tar.gz", hash = "sha256:9ddbaf152f8095438a9fe81f82c78a6714b92ae8e066bed418b6a7ff6a095a95"},
+    {file = "parso-0.8.7-py2.py3-none-any.whl", hash = "sha256:a8926eb2a1b915486941fdbd31e86a4baf88fe8c210f25f2f35ecec5b574ca1c"},
+    {file = "parso-0.8.7.tar.gz", hash = "sha256:eaaac4c9fdd5e9e8852dc778d2d7405897ec510f2a298071453e5e3a07914bb1"},
 ]
 
 [package.extras]
-docs = ["Sphinx (>=1.7.5)", "pylons-sphinx-themes"]
-paste = ["Paste"]
-testing = ["Paste", "pytest", "pytest-cov"]
+qa = ["flake8 (==5.0.4)", "types-setuptools (==67.2.0.1)", "zuban (==0.5.1)"]
+testing = ["docopt", "pytest"]
 
 [[package]]
 name = "pathspec"
-version = "0.12.1"
+version = "1.1.1"
 description = "Utility library for gitignore style pattern matching of file paths."
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["dev"]
 files = [
-    {file = "pathspec-0.12.1-py3-none-any.whl", hash = "sha256:a0d503e138a4c123b27490a4f7beda6a01c6f288df0e4a8b79c7eb0dc7b4cc08"},
-    {file = "pathspec-0.12.1.tar.gz", hash = "sha256:a482d51503a1ab33b1c67a6c3813a26953dbdc71c31dacaef9a838c4e29f5712"},
+    {file = "pathspec-1.1.1-py3-none-any.whl", hash = "sha256:a00ce642f577bf7f473932318056212bc4f8bfdf53128c78bbd5af0b9b20b189"},
+    {file = "pathspec-1.1.1.tar.gz", hash = "sha256:17db5ecd524104a120e173814c90367a96a98d07c45b2e10c2f3919fff91bf5a"},
 ]
 
+[package.extras]
+hyperscan = ["hyperscan (>=0.7)"]
+optional = ["typing-extensions (>=4)"]
+re2 = ["google-re2 (>=1.1)"]
+
 [[package]]
 name = "peft"
-version = "0.12.0"
+version = "0.17.1"
 description = "Parameter-Efficient Fine-Tuning (PEFT)"
 optional = false
-python-versions = ">=3.8.0"
+python-versions = ">=3.9.0"
 groups = ["main"]
 files = [
-    {file = "peft-0.12.0-py3-none-any.whl", hash = "sha256:a47915efb08af50e9fda267b7bf1b5b6eff33ccbb08791bdb544dccb8788f674"},
-    {file = "peft-0.12.0.tar.gz", hash = "sha256:253205bd478e985ccdc7f04804aab9c95f479130c517bf6e474b8d509db5f4a4"},
+    {file = "peft-0.17.1-py3-none-any.whl", hash = "sha256:3d129d64def3d74779c32a080d2567e5f7b674e77d546e3585138216d903f99e"},
+    {file = "peft-0.17.1.tar.gz", hash = "sha256:e6002b42517976c290b3b8bbb9829a33dd5d470676b2dec7cb4df8501b77eb9f"},
 ]
 
 [package.dependencies]
 accelerate = ">=0.21.0"
-huggingface-hub = ">=0.17.0"
+huggingface_hub = ">=0.25.0"
 numpy = ">=1.17"
 packaging = ">=20.0"
 psutil = "*"
@@ -3324,10 +3489,10 @@ tqdm = "*"
 transformers = "*"
 
 [package.extras]
-dev = ["black", "hf-doc-builder", "ruff (>=0.4.8,<0.5.0)"]
+dev = ["black", "black", "hf-doc-builder", "hf-doc-builder", "ruff (>=0.9.2,<0.10.0)"]
 docs-specific = ["black", "hf-doc-builder"]
-quality = ["black", "hf-doc-builder", "ruff (>=0.4.8,<0.5.0)"]
-test = ["black", "datasets", "diffusers (<0.21.0)", "hf-doc-builder", "parameterized", "pytest", "pytest-cov", "pytest-xdist", "ruff (>=0.4.8,<0.5.0)", "scipy"]
+quality = ["black", "hf-doc-builder", "ruff (>=0.9.2,<0.10.0)"]
+test = ["black", "black", "datasets", "diffusers", "hf-doc-builder", "hf-doc-builder", "parameterized", "protobuf", "pytest", "pytest-cov", "pytest-xdist", "ruff (>=0.9.2,<0.10.0)", "scipy", "sentencepiece"]
 
 [[package]]
 name = "pillow"
@@ -3335,7 +3500,7 @@ version = "10.4.0"
 description = "Python Imaging Library (Fork)"
 optional = false
 python-versions = ">=3.8"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "pillow-10.4.0-cp310-cp310-macosx_10_10_x86_64.whl", hash = "sha256:4d9667937cfa347525b319ae34375c37b9ee6b525440f3ef48542fcf66f2731e"},
     {file = "pillow-10.4.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:543f3dc61c18dafb755773efc89aae60d06b6596a63914107f75459cf984164d"},
@@ -3429,47 +3594,70 @@ xmp = ["defusedxml"]
 
 [[package]]
 name = "platformdirs"
-version = "4.3.6"
+version = "4.10.0"
 description = "A small Python package for determining appropriate platform-specific dirs, e.g. a `user data dir`."
 optional = false
+python-versions = ">=3.10"
+groups = ["dev"]
+files = [
+    {file = "platformdirs-4.10.0-py3-none-any.whl", hash = "sha256:fb516cdb12eb0d857d0cd85a7c57cea4d060bee4578d6cf5a14dfdf8cbf8784a"},
+    {file = "platformdirs-4.10.0.tar.gz", hash = "sha256:31e761a6a0ca04faf7353ea759bdba55652be214725111e5aac52dfa29d4bef7"},
+]
+
+[[package]]
+name = "plotly"
+version = "6.8.0"
+description = "An open-source interactive data visualization library for Python"
+optional = true
 python-versions = ">=3.8"
-groups = ["main", "dev"]
+groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "platformdirs-4.3.6-py3-none-any.whl", hash = "sha256:73e575e1408ab8103900836b97580d5307456908a03e92031bab39e4554cc3fb"},
-    {file = "platformdirs-4.3.6.tar.gz", hash = "sha256:357fb2acbc885b0419afd3ce3ed34564c13c9b95c89360cd9563f73aa5e2b907"},
+    {file = "plotly-6.8.0-py3-none-any.whl", hash = "sha256:13c5c4a0f70b74cab1913eda0de49b826df5931708eb6f9c3010040614700ec8"},
+    {file = "plotly-6.8.0.tar.gz", hash = "sha256:e088e7ddc68d4f70e3d66659224727a45296d71d2b8284181862d3d8f1f0d88f"},
 ]
 
+[package.dependencies]
+narwhals = ">=1.15.1"
+packaging = "*"
+
 [package.extras]
-docs = ["furo (>=2024.8.6)", "proselint (>=0.14)", "sphinx (>=8.0.2)", "sphinx-autodoc-typehints (>=2.4)"]
-test = ["appdirs (==1.4.4)", "covdefaults (>=2.3)", "pytest (>=8.3.2)", "pytest-cov (>=5)", "pytest-mock (>=3.14)"]
-type = ["mypy (>=1.11.2)"]
+dev = ["anywidget", "build", "colorcet", "fiona (<=1.9.6) ; python_version <= \"3.8\"", "geopandas", "inflect", "jupyterlab", "kaleido (>=1.3.0)", "numpy (>=1.22)", "orjson", "pandas", "pdfrw", "pillow", "plotly-geo", "polars[timezone]", "pyarrow", "pyshp", "pytest", "pytz", "requests", "ruff (==0.11.12)", "scikit-image", "scipy", "shapely", "statsmodels", "vaex ; python_version <= \"3.9\"", "xarray"]
+dev-build = ["build", "jupyterlab", "pytest", "requests", "ruff (==0.11.12)"]
+dev-core = ["pytest", "requests", "ruff (==0.11.12)"]
+dev-optional = ["anywidget", "build", "colorcet", "fiona (<=1.9.6) ; python_version <= \"3.8\"", "geopandas", "inflect", "jupyterlab", "kaleido (>=1.3.0)", "numpy (>=1.22)", "orjson", "pandas", "pdfrw", "pillow", "plotly-geo", "polars[timezone]", "pyarrow", "pyshp", "pytest", "pytz", "requests", "ruff (==0.11.12)", "scikit-image", "scipy", "shapely", "statsmodels", "vaex ; python_version <= \"3.9\"", "xarray"]
+dev-pandas1 = ["numpy (>=1,<2)", "pandas (>=1,<2)", "setuptools (<82)"]
+dev-pandas2 = ["pandas (>=2,<3)"]
+dev-pandas3 = ["pandas (>=3) ; python_version >= \"3.11\""]
+express = ["numpy (>=1.22)"]
+kaleido = ["kaleido (>=1.3.0)"]
 
 [[package]]
 name = "pluggy"
-version = "1.5.0"
+version = "1.6.0"
 description = "plugin and hook calling mechanisms for python"
 optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
+python-versions = ">=3.9"
+groups = ["dev"]
 files = [
-    {file = "pluggy-1.5.0-py3-none-any.whl", hash = "sha256:44e1ad92c8ca002de6377e165f3e0f1be63266ab4d554740532335b9d75ea669"},
-    {file = "pluggy-1.5.0.tar.gz", hash = "sha256:2cffa88e94fdc978c4c574f15f9e59b7f4201d439195c3715ca9e2486f1d0cf1"},
+    {file = "pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746"},
+    {file = "pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3"},
 ]
 
 [package.extras]
 dev = ["pre-commit", "tox"]
-testing = ["pytest", "pytest-benchmark"]
+testing = ["coverage", "pytest", "pytest-benchmark"]
 
 [[package]]
 name = "pre-commit"
-version = "4.1.0"
+version = "4.6.0"
 description = "A framework for managing and maintaining multi-language pre-commit hooks."
 optional = false
-python-versions = ">=3.9"
-groups = ["main", "dev"]
+python-versions = ">=3.10"
+groups = ["dev"]
 files = [
-    {file = "pre_commit-4.1.0-py2.py3-none-any.whl", hash = "sha256:d29e7cb346295bcc1cc75fc3e92e343495e3ea0196c9ec6ba53f49f10ab6ae7b"},
-    {file = "pre_commit-4.1.0.tar.gz", hash = "sha256:ae3f018575a588e30dfddfab9a05448bfbd6b73d78709617b5a2b853549716d4"},
+    {file = "pre_commit-4.6.0-py2.py3-none-any.whl", hash = "sha256:e2cf246f7299edcabcf15f9b0571fdce06058527f0a06535068a86d38089f29b"},
+    {file = "pre_commit-4.6.0.tar.gz", hash = "sha256:718d2208cef53fdc38206e40524a6d4d9576d103eb16f0fec11c875e7716e9d9"},
 ]
 
 [package.dependencies]
@@ -3481,14 +3669,14 @@ virtualenv = ">=20.10.0"
 
 [[package]]
 name = "proglog"
-version = "0.1.10"
+version = "0.1.12"
 description = "Log and progress bar manager for console, notebooks, web..."
 optional = false
 python-versions = "*"
 groups = ["main"]
 files = [
-    {file = "proglog-0.1.10-py3-none-any.whl", hash = "sha256:19d5da037e8c813da480b741e3fa71fb1ac0a5b02bf21c41577c7f327485ec50"},
-    {file = "proglog-0.1.10.tar.gz", hash = "sha256:658c28c9c82e4caeb2f25f488fff9ceace22f8d69b15d0c1c86d64275e4ddab4"},
+    {file = "proglog-0.1.12-py3-none-any.whl", hash = "sha256:ccaafce51e80a81c65dc907a460c07ccb8ec1f78dc660cfd8f9ec3a22f01b84c"},
+    {file = "proglog-0.1.12.tar.gz", hash = "sha256:361ee074721c277b89b75c061336cb8c5f287c92b043efa562ccf7866cda931c"},
 ]
 
 [package.dependencies]
@@ -3496,94 +3684,133 @@ tqdm = "*"
 
 [[package]]
 name = "propcache"
-version = "0.2.1"
+version = "0.5.2"
 description = "Accelerated property cache"
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "propcache-0.2.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:6b3f39a85d671436ee3d12c017f8fdea38509e4f25b28eb25877293c98c243f6"},
-    {file = "propcache-0.2.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:39d51fbe4285d5db5d92a929e3e21536ea3dd43732c5b177c7ef03f918dff9f2"},
-    {file = "propcache-0.2.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6445804cf4ec763dc70de65a3b0d9954e868609e83850a47ca4f0cb64bd79fea"},
-    {file = "propcache-0.2.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f9479aa06a793c5aeba49ce5c5692ffb51fcd9a7016e017d555d5e2b0045d212"},
-    {file = "propcache-0.2.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d9631c5e8b5b3a0fda99cb0d29c18133bca1e18aea9effe55adb3da1adef80d3"},
-    {file = "propcache-0.2.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:3156628250f46a0895f1f36e1d4fbe062a1af8718ec3ebeb746f1d23f0c5dc4d"},
-    {file = "propcache-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6b6fb63ae352e13748289f04f37868099e69dba4c2b3e271c46061e82c745634"},
-    {file = "propcache-0.2.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:887d9b0a65404929641a9fabb6452b07fe4572b269d901d622d8a34a4e9043b2"},
-    {file = "propcache-0.2.1-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:a96dc1fa45bd8c407a0af03b2d5218392729e1822b0c32e62c5bf7eeb5fb3958"},
-    {file = "propcache-0.2.1-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:a7e65eb5c003a303b94aa2c3852ef130230ec79e349632d030e9571b87c4698c"},
-    {file = "propcache-0.2.1-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:999779addc413181912e984b942fbcc951be1f5b3663cd80b2687758f434c583"},
-    {file = "propcache-0.2.1-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:19a0f89a7bb9d8048d9c4370c9c543c396e894c76be5525f5e1ad287f1750ddf"},
-    {file = "propcache-0.2.1-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:1ac2f5fe02fa75f56e1ad473f1175e11f475606ec9bd0be2e78e4734ad575034"},
-    {file = "propcache-0.2.1-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:574faa3b79e8ebac7cb1d7930f51184ba1ccf69adfdec53a12f319a06030a68b"},
-    {file = "propcache-0.2.1-cp310-cp310-win32.whl", hash = "sha256:03ff9d3f665769b2a85e6157ac8b439644f2d7fd17615a82fa55739bc97863f4"},
-    {file = "propcache-0.2.1-cp310-cp310-win_amd64.whl", hash = "sha256:2d3af2e79991102678f53e0dbf4c35de99b6b8b58f29a27ca0325816364caaba"},
-    {file = "propcache-0.2.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:1ffc3cca89bb438fb9c95c13fc874012f7b9466b89328c3c8b1aa93cdcfadd16"},
-    {file = "propcache-0.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:f174bbd484294ed9fdf09437f889f95807e5f229d5d93588d34e92106fbf6717"},
-    {file = "propcache-0.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:70693319e0b8fd35dd863e3e29513875eb15c51945bf32519ef52927ca883bc3"},
-    {file = "propcache-0.2.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b480c6a4e1138e1aa137c0079b9b6305ec6dcc1098a8ca5196283e8a49df95a9"},
-    {file = "propcache-0.2.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d27b84d5880f6d8aa9ae3edb253c59d9f6642ffbb2c889b78b60361eed449787"},
-    {file = "propcache-0.2.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:857112b22acd417c40fa4595db2fe28ab900c8c5fe4670c7989b1c0230955465"},
-    {file = "propcache-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cf6c4150f8c0e32d241436526f3c3f9cbd34429492abddbada2ffcff506c51af"},
-    {file = "propcache-0.2.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:66d4cfda1d8ed687daa4bc0274fcfd5267873db9a5bc0418c2da19273040eeb7"},
-    {file = "propcache-0.2.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:c2f992c07c0fca81655066705beae35fc95a2fa7366467366db627d9f2ee097f"},
-    {file = "propcache-0.2.1-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:4a571d97dbe66ef38e472703067021b1467025ec85707d57e78711c085984e54"},
-    {file = "propcache-0.2.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:bb6178c241278d5fe853b3de743087be7f5f4c6f7d6d22a3b524d323eecec505"},
-    {file = "propcache-0.2.1-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:ad1af54a62ffe39cf34db1aa6ed1a1873bd548f6401db39d8e7cd060b9211f82"},
-    {file = "propcache-0.2.1-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:e7048abd75fe40712005bcfc06bb44b9dfcd8e101dda2ecf2f5aa46115ad07ca"},
-    {file = "propcache-0.2.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:160291c60081f23ee43d44b08a7e5fb76681221a8e10b3139618c5a9a291b84e"},
-    {file = "propcache-0.2.1-cp311-cp311-win32.whl", hash = "sha256:819ce3b883b7576ca28da3861c7e1a88afd08cc8c96908e08a3f4dd64a228034"},
-    {file = "propcache-0.2.1-cp311-cp311-win_amd64.whl", hash = "sha256:edc9fc7051e3350643ad929df55c451899bb9ae6d24998a949d2e4c87fb596d3"},
-    {file = "propcache-0.2.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:081a430aa8d5e8876c6909b67bd2d937bfd531b0382d3fdedb82612c618bc41a"},
-    {file = "propcache-0.2.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d2ccec9ac47cf4e04897619c0e0c1a48c54a71bdf045117d3a26f80d38ab1fb0"},
-    {file = "propcache-0.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:14d86fe14b7e04fa306e0c43cdbeebe6b2c2156a0c9ce56b815faacc193e320d"},
-    {file = "propcache-0.2.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:049324ee97bb67285b49632132db351b41e77833678432be52bdd0289c0e05e4"},
-    {file = "propcache-0.2.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1cd9a1d071158de1cc1c71a26014dcdfa7dd3d5f4f88c298c7f90ad6f27bb46d"},
-    {file = "propcache-0.2.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:98110aa363f1bb4c073e8dcfaefd3a5cea0f0834c2aab23dda657e4dab2f53b5"},
-    {file = "propcache-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:647894f5ae99c4cf6bb82a1bb3a796f6e06af3caa3d32e26d2350d0e3e3faf24"},
-    {file = "propcache-0.2.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:bfd3223c15bebe26518d58ccf9a39b93948d3dcb3e57a20480dfdd315356baff"},
-    {file = "propcache-0.2.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:d71264a80f3fcf512eb4f18f59423fe82d6e346ee97b90625f283df56aee103f"},
-    {file = "propcache-0.2.1-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:e73091191e4280403bde6c9a52a6999d69cdfde498f1fdf629105247599b57ec"},
-    {file = "propcache-0.2.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:3935bfa5fede35fb202c4b569bb9c042f337ca4ff7bd540a0aa5e37131659348"},
-    {file = "propcache-0.2.1-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:f508b0491767bb1f2b87fdfacaba5f7eddc2f867740ec69ece6d1946d29029a6"},
-    {file = "propcache-0.2.1-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:1672137af7c46662a1c2be1e8dc78cb6d224319aaa40271c9257d886be4363a6"},
-    {file = "propcache-0.2.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:b74c261802d3d2b85c9df2dfb2fa81b6f90deeef63c2db9f0e029a3cac50b518"},
-    {file = "propcache-0.2.1-cp312-cp312-win32.whl", hash = "sha256:d09c333d36c1409d56a9d29b3a1b800a42c76a57a5a8907eacdbce3f18768246"},
-    {file = "propcache-0.2.1-cp312-cp312-win_amd64.whl", hash = "sha256:c214999039d4f2a5b2073ac506bba279945233da8c786e490d411dfc30f855c1"},
-    {file = "propcache-0.2.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:aca405706e0b0a44cc6bfd41fbe89919a6a56999157f6de7e182a990c36e37bc"},
-    {file = "propcache-0.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:12d1083f001ace206fe34b6bdc2cb94be66d57a850866f0b908972f90996b3e9"},
-    {file = "propcache-0.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:d93f3307ad32a27bda2e88ec81134b823c240aa3abb55821a8da553eed8d9439"},
-    {file = "propcache-0.2.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ba278acf14471d36316159c94a802933d10b6a1e117b8554fe0d0d9b75c9d536"},
-    {file = "propcache-0.2.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:4e6281aedfca15301c41f74d7005e6e3f4ca143584ba696ac69df4f02f40d629"},
-    {file = "propcache-0.2.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5b750a8e5a1262434fb1517ddf64b5de58327f1adc3524a5e44c2ca43305eb0b"},
-    {file = "propcache-0.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bf72af5e0fb40e9babf594308911436c8efde3cb5e75b6f206c34ad18be5c052"},
-    {file = "propcache-0.2.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b2d0a12018b04f4cb820781ec0dffb5f7c7c1d2a5cd22bff7fb055a2cb19ebce"},
-    {file = "propcache-0.2.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:e800776a79a5aabdb17dcc2346a7d66d0777e942e4cd251defeb084762ecd17d"},
-    {file = "propcache-0.2.1-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:4160d9283bd382fa6c0c2b5e017acc95bc183570cd70968b9202ad6d8fc48dce"},
-    {file = "propcache-0.2.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:30b43e74f1359353341a7adb783c8f1b1c676367b011709f466f42fda2045e95"},
-    {file = "propcache-0.2.1-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:58791550b27d5488b1bb52bc96328456095d96206a250d28d874fafe11b3dfaf"},
-    {file = "propcache-0.2.1-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:0f022d381747f0dfe27e99d928e31bc51a18b65bb9e481ae0af1380a6725dd1f"},
-    {file = "propcache-0.2.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:297878dc9d0a334358f9b608b56d02e72899f3b8499fc6044133f0d319e2ec30"},
-    {file = "propcache-0.2.1-cp313-cp313-win32.whl", hash = "sha256:ddfab44e4489bd79bda09d84c430677fc7f0a4939a73d2bba3073036f487a0a6"},
-    {file = "propcache-0.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:556fc6c10989f19a179e4321e5d678db8eb2924131e64652a51fe83e4c3db0e1"},
-    {file = "propcache-0.2.1-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:6a9a8c34fb7bb609419a211e59da8887eeca40d300b5ea8e56af98f6fbbb1541"},
-    {file = "propcache-0.2.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:ae1aa1cd222c6d205853b3013c69cd04515f9d6ab6de4b0603e2e1c33221303e"},
-    {file = "propcache-0.2.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:accb6150ce61c9c4b7738d45550806aa2b71c7668c6942f17b0ac182b6142fd4"},
-    {file = "propcache-0.2.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5eee736daafa7af6d0a2dc15cc75e05c64f37fc37bafef2e00d77c14171c2097"},
-    {file = "propcache-0.2.1-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f7a31fc1e1bd362874863fdeed71aed92d348f5336fd84f2197ba40c59f061bd"},
-    {file = "propcache-0.2.1-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:cba4cfa1052819d16699e1d55d18c92b6e094d4517c41dd231a8b9f87b6fa681"},
-    {file = "propcache-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f089118d584e859c62b3da0892b88a83d611c2033ac410e929cb6754eec0ed16"},
-    {file = "propcache-0.2.1-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:781e65134efaf88feb447e8c97a51772aa75e48b794352f94cb7ea717dedda0d"},
-    {file = "propcache-0.2.1-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:31f5af773530fd3c658b32b6bdc2d0838543de70eb9a2156c03e410f7b0d3aae"},
-    {file = "propcache-0.2.1-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:a7a078f5d37bee6690959c813977da5291b24286e7b962e62a94cec31aa5188b"},
-    {file = "propcache-0.2.1-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:cea7daf9fc7ae6687cf1e2c049752f19f146fdc37c2cc376e7d0032cf4f25347"},
-    {file = "propcache-0.2.1-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:8b3489ff1ed1e8315674d0775dc7d2195fb13ca17b3808721b54dbe9fd020faf"},
-    {file = "propcache-0.2.1-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:9403db39be1393618dd80c746cb22ccda168efce239c73af13c3763ef56ffc04"},
-    {file = "propcache-0.2.1-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:5d97151bc92d2b2578ff7ce779cdb9174337390a535953cbb9452fb65164c587"},
-    {file = "propcache-0.2.1-cp39-cp39-win32.whl", hash = "sha256:9caac6b54914bdf41bcc91e7eb9147d331d29235a7c967c150ef5df6464fd1bb"},
-    {file = "propcache-0.2.1-cp39-cp39-win_amd64.whl", hash = "sha256:92fc4500fcb33899b05ba73276dfb684a20d31caa567b7cb5252d48f896a91b1"},
-    {file = "propcache-0.2.1-py3-none-any.whl", hash = "sha256:52277518d6aae65536e9cea52d4e7fd2f7a66f4aa2d30ed3f2fcea620ace3c54"},
-    {file = "propcache-0.2.1.tar.gz", hash = "sha256:3f77ce728b19cb537714499928fe800c3dda29e8d9428778fc7c186da4c09a64"},
+python-versions = ">=3.10"
+groups = ["main", "training"]
+files = [
+    {file = "propcache-0.5.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:d5a81be28596d6559f6131ef33e10200de6e17643b3c74ce03f9eb103be6ae8b"},
+    {file = "propcache-0.5.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:29cbaac5ea0212663e6845e04b5e188d5a6ae6dd919810ac835bf1d3b42c3f4c"},
+    {file = "propcache-0.5.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:6bf3be92233808fcd338eba0fb4d0b59ec5772af4f4ecfcec450d1bfc0f8b5eb"},
+    {file = "propcache-0.5.2-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2f8ea531c794b9d6274acd4e8d2c2ebcac590a4361d27482edd3010b79f1325e"},
+    {file = "propcache-0.5.2-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:decfca4c79dd53ebab484b00cc4b6717d8c369f86e74aa4ca395a64ac651495e"},
+    {file = "propcache-0.5.2-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:4621064bbf28fa77ff64dd5d94367c04684c67d3a5bf1dff25f0cd0d98a38f3b"},
+    {file = "propcache-0.5.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b96db7141a592cbc968daf1feea83a118e6ab378af4abbc72b248c895414c22d"},
+    {file = "propcache-0.5.2-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:1ca071adabaab6e9219924bbe00af821f1ee7de113a9eca1cdc292de3d120f4d"},
+    {file = "propcache-0.5.2-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:e4294d04a94dcab1b3bccd8b66d962dcad411a1d19414b2a41d1445f1de32ad0"},
+    {file = "propcache-0.5.2-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:a0e399a2eccb91ed18721f86aa85757727400b6865c89e88934781deb9c8498b"},
+    {file = "propcache-0.5.2-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:823581fd5cb08b12a48bfa11fe962a7916766b6170c17b028fbdf762b85eb9bf"},
+    {file = "propcache-0.5.2-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:949c91d1a990cf3b2e8188dfcfb25005e0b834a06c63fa4ef9f360878ce21ecf"},
+    {file = "propcache-0.5.2-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:cc1177027eda740fdb152706bd215a3f124e3eea15afc39f2cb9fe351b50619e"},
+    {file = "propcache-0.5.2-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:b05d643f944a8c3c4bd86d65ffd87bf3264b617f87791940302bc474d2ff5274"},
+    {file = "propcache-0.5.2-cp310-cp310-win32.whl", hash = "sha256:8114f28879e0904748e831c3a7774261bd9e75f49be089f389a76f959dcd13fe"},
+    {file = "propcache-0.5.2-cp310-cp310-win_amd64.whl", hash = "sha256:5fcb98e7598b1ee0addab320d90f65b530297a867dbfe9de52ea838077e16e3d"},
+    {file = "propcache-0.5.2-cp310-cp310-win_arm64.whl", hash = "sha256:04dc2390d9edbbaef7461f33322555976ffddf0b650a038649d026358714e6c5"},
+    {file = "propcache-0.5.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:74b70780220e2dd89175ca24b81b68b67c83db499ae611e7f2313cb329801c78"},
+    {file = "propcache-0.5.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:a4840ab0ae0216d952f4b53dc6d0b992bfc2bedbfe360bdd9b548bc184c08959"},
+    {file = "propcache-0.5.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c6844ba6364fb12f403928a82cfd295ab103a2b315c77c747b2dbe4a41894ea7"},
+    {file = "propcache-0.5.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2293949b855ce597f2826452d17c2d545fb5622379c4ea6fdf525e9b8e8a2511"},
+    {file = "propcache-0.5.2-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:0fd59b5af35f74da48d905dcbad55449ba13be91823cb05a9bd590bbf5b61660"},
+    {file = "propcache-0.5.2-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:29f9309a2e42b0d273be006fdb4be2d6c39a47f6f57d8fb1cf9f81481df81b66"},
+    {file = "propcache-0.5.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5aaa2b923c1944ac8febd6609cb373540a5563e7cbcb0fd770f75dace2eb817b"},
+    {file = "propcache-0.5.2-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:66ea454f095ddf5b6b14f56c064c0941c4788be11e18d2464cf643bf7203ff67"},
+    {file = "propcache-0.5.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:95f1e3f4760d404b13c9976c0229b2b49a3c8e2c62a9ce92efdd2b11ada75e3f"},
+    {file = "propcache-0.5.2-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:85341b12b9d55bad0bded24cac341bb34289469e03a11f3f583ea1cc1db0326c"},
+    {file = "propcache-0.5.2-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:26a4dca084132874e639895c3135dfad5eb20bae209f62d1aeb31b03e601c3c0"},
+    {file = "propcache-0.5.2-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:3b199b9b2b3d6a7edf3183ba8a9a137a22b97f7df525feb5ae1eccf026d2a9c6"},
+    {file = "propcache-0.5.2-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:e59bc9e66329185b93dab73f210f1a37f81cb40f321501db8017c9aea15dba27"},
+    {file = "propcache-0.5.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:552ffadf6ad409844bc5919c42a0a83d88314cedddaea0e41e80a8b8fffe881f"},
+    {file = "propcache-0.5.2-cp311-cp311-win32.whl", hash = "sha256:cd416c1de191973c52ff1a12a57446bfc7642797b282d7caf2162d7d1b8aa9a0"},
+    {file = "propcache-0.5.2-cp311-cp311-win_amd64.whl", hash = "sha256:44e488ef40dbb452700b2b1f8188934121f6648f52c295055662d2191959ff82"},
+    {file = "propcache-0.5.2-cp311-cp311-win_arm64.whl", hash = "sha256:54adaa85a22078d1e306304a40984dc5be99d599bf3dc0a24dc98f7daeab89ab"},
+    {file = "propcache-0.5.2-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:806719138ecd720339a12410fb9614ac9b2b2d3a5fdf8235d56981c36f4039ba"},
+    {file = "propcache-0.5.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:db2b80ea58eab4f86b2beec3cc8b39e8ff9276ac20e96b7cce43c8ae84cd6b5a"},
+    {file = "propcache-0.5.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e5cbfac9f61484f7e9f3597775500cd3ebe8274e9b050c38f9525c77c97520bf"},
+    {file = "propcache-0.5.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5dbc581d2814337da56222fab8dc5f161cd798a434e49bac27930aaef798e144"},
+    {file = "propcache-0.5.2-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:857187f381f88c8e2fa2fe56ab94879d011b883d5a2ee5a1b60a8cd2a06846d9"},
+    {file = "propcache-0.5.2-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:178b4a2cdaac1818e2bf1c5a99b94383fa73ea5382e032a48dec07dc5668dc42"},
+    {file = "propcache-0.5.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6f328175a2cde1f0ff2c4ed8ce968b9dcfb55f3a7153f39e2957ed994da13476"},
+    {file = "propcache-0.5.2-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:5671d09a36b06d0fd4a3da0fccbcae360e9b1570924171a15e9e0997f0249fba"},
+    {file = "propcache-0.5.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:80168e2ebe4d3ec6599d10ad8f520304ae1cad9b6c5a95372aef1b66b7bfb53a"},
+    {file = "propcache-0.5.2-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:45f11346f884bc47444f6e6647131055844134c3175b629f84952e2b5cd62b64"},
+    {file = "propcache-0.5.2-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:8e778ebd44ef4f66ed60a0416b06b489687db264a9c0b3620362f26489492913"},
+    {file = "propcache-0.5.2-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:c0cb9ed24c8964e172768d455a38254c2dd8a552905729ce006cad3d3dda59b1"},
+    {file = "propcache-0.5.2-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:1d1ad32d9d4355e2be65574fd0bfd3677e7066b009cd5b9b2dee8aa6a6393b33"},
+    {file = "propcache-0.5.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:c80f4ba3e8f00189165999a742ee526ebeccedf6c3f7beb0c7df821e9772435a"},
+    {file = "propcache-0.5.2-cp312-cp312-win32.whl", hash = "sha256:8c7972d8f193740d9175f0998ab38717e6cd322d5935c5b0fef8c0d323fd9031"},
+    {file = "propcache-0.5.2-cp312-cp312-win_amd64.whl", hash = "sha256:d9ee8826a7d47863a08ac44e1a5f611a462eefc3a194b492da242128bec75b42"},
+    {file = "propcache-0.5.2-cp312-cp312-win_arm64.whl", hash = "sha256:2800a4a8ead6b28cccd1ec54b59346f0def7922ee1c7598e8499c733cfbb7c84"},
+    {file = "propcache-0.5.2-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:099aaf4b4d1a02265b92a977edf00b5c4f63b3b17ac6de39b0d637c9cac0188a"},
+    {file = "propcache-0.5.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:68ce1c44c7a813a7f71ea04315a8c7b330b63db99d059a797a4651bb6f69f117"},
+    {file = "propcache-0.5.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:fc299c129490f55f254cd90be0deca4764e36e9a7c08b4aa588479a3bbed3098"},
+    {file = "propcache-0.5.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:a6ae2198be502c10f09b2516e7b5d019816924bc3183a43ce792a7bd6625e6f4"},
+    {file = "propcache-0.5.2-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:6041d31504dc1779d700e1edcfb08eea334b357620b06681a4eabb57a74e574e"},
+    {file = "propcache-0.5.2-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:f7eabc04151c78a9f4d5bbb5f1faf571e4defeb4b585e0fe95b60ff2dbe4d3d7"},
+    {file = "propcache-0.5.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4db0ba63d693afd40d249bd93f842b5f144f8fcbb83de05660373bcf30517b1d"},
+    {file = "propcache-0.5.2-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:1dbcf7675229b35d31abb6547d8ebc8c27a830ac3f9a794edff6254873ec7c0a"},
+    {file = "propcache-0.5.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:d310c013aad2c72f1c3f2f8dd3279d460a858c551f97aeb8c63e4693cca7b4d2"},
+    {file = "propcache-0.5.2-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:06187263ddad280d05b4d8a8b3bb7d164cbebd469236544a42e6d9b28ac6a4fa"},
+    {file = "propcache-0.5.2-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:3115559b8effafd63b142ea5ed53d63a16ea6469cbc63dce4ee194b42db5d853"},
+    {file = "propcache-0.5.2-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:c60462af8e6dc30c35407c7237ea908d777b22862bbee27bc4699c0d8bcdc45a"},
+    {file = "propcache-0.5.2-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:40314bca9ac559716fe374094fc81c11dcc34b64fd6c585360f5775690505704"},
+    {file = "propcache-0.5.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:cfa21e036ce1e1db2be04ba3b85d2df1bb1702fa01932d984c5464c665228ff4"},
+    {file = "propcache-0.5.2-cp313-cp313-win32.whl", hash = "sha256:f156a3529f38063b6dbaf356e15602a7f95f8055b1295a438433a6386f10463d"},
+    {file = "propcache-0.5.2-cp313-cp313-win_amd64.whl", hash = "sha256:dfed59d0a5aeb01e242e66ff0300bc4a265a7c05f612d30016f0b60b1017d757"},
+    {file = "propcache-0.5.2-cp313-cp313-win_arm64.whl", hash = "sha256:ba338430e87ceb9c8f0cf754de38a9860560261e56c00376debd628698a7364f"},
+    {file = "propcache-0.5.2-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:a592f5f3da71c8691c788c13cb6734b6d17663d2e1cb8caddf0673d01ef8847d"},
+    {file = "propcache-0.5.2-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:6a997d0489e9668a384fcfd5061b857aa5361de73191cac204d04b889cfbbafa"},
+    {file = "propcache-0.5.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:10734b5484ea113152ee25a91dccedf81631791805d2c9ccb054958e51842c94"},
+    {file = "propcache-0.5.2-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:cafca7e56c12bb02ae16d283742bef25a61122e9dab2b5b3f2ccbe589ce32164"},
+    {file = "propcache-0.5.2-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f064f8d2b59177878b7615df1735cd8fe3462ed6be8c7b217d17a276489c2b7f"},
+    {file = "propcache-0.5.2-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:f78abfa8dfc32376fd1aacf597b2f2fbbe0ea751419aee718af5d4f82537ef8c"},
+    {file = "propcache-0.5.2-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f7467da8a9822bf1a55336f877340c5bcbd3c482afc43a99771169f74a26dedc"},
+    {file = "propcache-0.5.2-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:a6ddc6ac9e25de626c1f129c1b467d7ecd33ce2237d3fd0c4e429feef0a7ee1f"},
+    {file = "propcache-0.5.2-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:2f22cbbac9e26a8e864c0985ff1268d5d939d53d9d9411a9824279097e03a2cb"},
+    {file = "propcache-0.5.2-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:fc76378c62a0f04d0cd82fbb1a2cd2d7e28fcb40d5873f28a6c44e388aaa2751"},
+    {file = "propcache-0.5.2-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:acd2c8edba48e31e58a363b8cf4e5c7db3b04b3f9e371f601df30d9b0d244836"},
+    {file = "propcache-0.5.2-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:452b5065457eb9991ec5eb38ff41d6cd4c991c9ac7c531c4d5849ae473a9a13f"},
+    {file = "propcache-0.5.2-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:3430bb2bfe1331885c427745a751e774ee679fd4344f80b97bf879815fe8fa55"},
+    {file = "propcache-0.5.2-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:cef6cea3922890dd6c9654971001fa797b526c16ab5e1e46c05fd6f877be7568"},
+    {file = "propcache-0.5.2-cp313-cp313t-win32.whl", hash = "sha256:72d61e16dd78228b58c5d47be830ff3da7e5f139abdf0aef9d86cde1c5cf2191"},
+    {file = "propcache-0.5.2-cp313-cp313t-win_amd64.whl", hash = "sha256:0958834041a0166d343b8d2cedcd8bcbaeb4fdbe0cf08320c5379f143c3be6e7"},
+    {file = "propcache-0.5.2-cp313-cp313t-win_arm64.whl", hash = "sha256:6de8bd93ddde9b992cf2b2e0d796d501a19026b5b9fd87356d7d0779531a8d96"},
+    {file = "propcache-0.5.2-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:46088abff4cba581dea21ae0467a480526cb25aa5f3c269e909f800328bc3999"},
+    {file = "propcache-0.5.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:fc88b26f08d634f7bc819a7852e5214f5802641ab8d9fd5326892292eee1993e"},
+    {file = "propcache-0.5.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:97797ebb098e670a2f92dd66f32897e30d7615b14e7f59711de23e30a9072539"},
+    {file = "propcache-0.5.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ba57fffe4ac99c5d30076161b5866336d97600769bad35cc68f7774b15298a4e"},
+    {file = "propcache-0.5.2-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:583c19759d9eec1e5b69e2fbef36a7d9c326041be9746cb822d335c8cedc2979"},
+    {file = "propcache-0.5.2-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:d0326e2e5e1f3163fa306c834e48e8d490e5fae607a097a40c0648109b47ba80"},
+    {file = "propcache-0.5.2-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e00820e192c8dbebcafb383ebbf99030895f09905e7a0eb2e0340a0bcc2bc825"},
+    {file = "propcache-0.5.2-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c66afea89b1e43725731d2004732a046fe6fe955d51f952c3e95a7314a284a39"},
+    {file = "propcache-0.5.2-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:d4dc37dec6c6cdad0b57881a5658fd14fbf53e333b1a86cf86559f190e1d9ec4"},
+    {file = "propcache-0.5.2-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:5570dbcc97571c15f68068e529c92715a12f8d54030e272d264b377e22bd17a5"},
+    {file = "propcache-0.5.2-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:f814362777a9f841adddb200ecdf8f5cb1e5a3c4b7a86378edbd6ccb26edd702"},
+    {file = "propcache-0.5.2-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:196913dea116aeb5a2ba95af4ddcb7ea85559ae07d8eee8751688310d09168c3"},
+    {file = "propcache-0.5.2-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:6e7b8719005dd1175be4ab1cd25e9b98659a5e0347331506ec6760d2773a7fb5"},
+    {file = "propcache-0.5.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:51f96d685ab16e88cab128cd37a52c5da540809c8b879fa047731bfcb4ad35a4"},
+    {file = "propcache-0.5.2-cp314-cp314-win32.whl", hash = "sha256:cc6fc3cc62e8501d3ed62894425040d2728ecddb1ed072737a5c70bd537aa9f0"},
+    {file = "propcache-0.5.2-cp314-cp314-win_amd64.whl", hash = "sha256:81e3a30b0bb60caa22033dd0f8a3618d1d67356212514f62c57db75cb0ef410c"},
+    {file = "propcache-0.5.2-cp314-cp314-win_arm64.whl", hash = "sha256:0d2c9bf8528f135dbb805ce027567e09164f7efa51a2be07458a2c0420f292d0"},
+    {file = "propcache-0.5.2-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:4bc8ff1feffc6a61c7002ffe84634c41b822e104990ae009f44a0834430070bb"},
+    {file = "propcache-0.5.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:79aa3ff0a9b566633b642fa9caf7e21ed1c13d6feca718187873f199e1514078"},
+    {file = "propcache-0.5.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1b31822f4474c4036bae62de9402710051d431a606d6a0f907fec79935a071aa"},
+    {file = "propcache-0.5.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:13fef48778b5a2a756523fdb781326b028ca75e32858b04f2cdd19f394564917"},
+    {file = "propcache-0.5.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:8b73ab70f1a3351fbc71f663b3e645af6dd0329100c353081cf69c37433fc6fe"},
+    {file = "propcache-0.5.2-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:5538d2c13d93e4698af7e092b57bc7298fd35d1d58e656ae18f23ee0d0378e03"},
+    {file = "propcache-0.5.2-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:cd645f03898405cabe694fb8bc35241e3a9c332ec85627584fe3de201452b335"},
+    {file = "propcache-0.5.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:a473b3440261e0c60706e732b2ed2f517857344fc21bf48fdfe211e2d98eb285"},
+    {file = "propcache-0.5.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:7afa37062e6650640e932e4cc9297d81f9f42d9944029cc386b8247dea4da837"},
+    {file = "propcache-0.5.2-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:8a90efd5777e996e42d568db9ac740b944d691e565cbfd31b2f7832f9184b2b8"},
+    {file = "propcache-0.5.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:f19bb891234d72535764d703bfed1153cc34f4214d5bd7150aee1eec9e8f4366"},
+    {file = "propcache-0.5.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:32775082acd2d807ee3db715c7770d38767b817870acfa08c29e057f3c4d5b56"},
+    {file = "propcache-0.5.2-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:9282fb1a3bccd038da9f768b927b24a0c753e466c086b7c4f3c6982851eefb2d"},
+    {file = "propcache-0.5.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:cc49723e2f60d6b32a0f0b08a3fd6d13203c07f1cd9566cfce0f12a917c967a2"},
+    {file = "propcache-0.5.2-cp314-cp314t-win32.whl", hash = "sha256:2d7aa89ebca5acc98cba9d1472d976e394782f587bad6661003602a619fd1821"},
+    {file = "propcache-0.5.2-cp314-cp314t-win_amd64.whl", hash = "sha256:d447bb0b3054be5818458fbb171208b1d9ff11eba14e18ca18b90cbb45767370"},
+    {file = "propcache-0.5.2-cp314-cp314t-win_arm64.whl", hash = "sha256:fe67a3d11cd9b4efabfa45c3d00ffba2b26811442a73a581a94b67c2b5faccf6"},
+    {file = "propcache-0.5.2-py3-none-any.whl", hash = "sha256:be1ddfcbb376e3de5d2e2db1d58d6d67463e6b4f9f040c000de8e300295465fe"},
+    {file = "propcache-0.5.2.tar.gz", hash = "sha256:01c4fc7480cd0598bb4b57022df55b9ca296da7fc5a8760bd8451a7e63a7d427"},
 ]
 
 [[package]]
@@ -3592,7 +3819,7 @@ version = "3.20.3"
 description = "Protocol Buffers"
 optional = false
 python-versions = ">=3.7"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "protobuf-3.20.3-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:f4bd856d702e5b0d96a00ec6b307b0f51c1982c2bf9c0052cf9019e9a544ba99"},
     {file = "protobuf-3.20.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl", hash = "sha256:9aae4406ea63d825636cc11ffb34ad3379335803216ee3a856787bcf5ccc751e"},
@@ -3620,34 +3847,38 @@ files = [
 
 [[package]]
 name = "psutil"
-version = "6.1.1"
-description = "Cross-platform lib for process and system monitoring in Python."
+version = "7.2.2"
+description = "Cross-platform lib for process and system monitoring."
 optional = false
-python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,!=3.5.*,>=2.7"
-groups = ["main"]
-files = [
-    {file = "psutil-6.1.1-cp27-cp27m-macosx_10_9_x86_64.whl", hash = "sha256:9ccc4316f24409159897799b83004cb1e24f9819b0dcf9c0b68bdcb6cefee6a8"},
-    {file = "psutil-6.1.1-cp27-cp27m-manylinux2010_i686.whl", hash = "sha256:ca9609c77ea3b8481ab005da74ed894035936223422dc591d6772b147421f777"},
-    {file = "psutil-6.1.1-cp27-cp27m-manylinux2010_x86_64.whl", hash = "sha256:8df0178ba8a9e5bc84fed9cfa61d54601b371fbec5c8eebad27575f1e105c0d4"},
-    {file = "psutil-6.1.1-cp27-cp27mu-manylinux2010_i686.whl", hash = "sha256:1924e659d6c19c647e763e78670a05dbb7feaf44a0e9c94bf9e14dfc6ba50468"},
-    {file = "psutil-6.1.1-cp27-cp27mu-manylinux2010_x86_64.whl", hash = "sha256:018aeae2af92d943fdf1da6b58665124897cfc94faa2ca92098838f83e1b1bca"},
-    {file = "psutil-6.1.1-cp27-none-win32.whl", hash = "sha256:6d4281f5bbca041e2292be3380ec56a9413b790579b8e593b1784499d0005dac"},
-    {file = "psutil-6.1.1-cp27-none-win_amd64.whl", hash = "sha256:c777eb75bb33c47377c9af68f30e9f11bc78e0f07fbf907be4a5d70b2fe5f030"},
-    {file = "psutil-6.1.1-cp36-abi3-macosx_10_9_x86_64.whl", hash = "sha256:fc0ed7fe2231a444fc219b9c42d0376e0a9a1a72f16c5cfa0f68d19f1a0663e8"},
-    {file = "psutil-6.1.1-cp36-abi3-macosx_11_0_arm64.whl", hash = "sha256:0bdd4eab935276290ad3cb718e9809412895ca6b5b334f5a9111ee6d9aff9377"},
-    {file = "psutil-6.1.1-cp36-abi3-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b6e06c20c05fe95a3d7302d74e7097756d4ba1247975ad6905441ae1b5b66003"},
-    {file = "psutil-6.1.1-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:97f7cb9921fbec4904f522d972f0c0e1f4fabbdd4e0287813b21215074a0f160"},
-    {file = "psutil-6.1.1-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:33431e84fee02bc84ea36d9e2c4a6d395d479c9dd9bba2376c1f6ee8f3a4e0b3"},
-    {file = "psutil-6.1.1-cp36-cp36m-win32.whl", hash = "sha256:384636b1a64b47814437d1173be1427a7c83681b17a450bfc309a1953e329603"},
-    {file = "psutil-6.1.1-cp36-cp36m-win_amd64.whl", hash = "sha256:8be07491f6ebe1a693f17d4f11e69d0dc1811fa082736500f649f79df7735303"},
-    {file = "psutil-6.1.1-cp37-abi3-win32.whl", hash = "sha256:eaa912e0b11848c4d9279a93d7e2783df352b082f40111e078388701fd479e53"},
-    {file = "psutil-6.1.1-cp37-abi3-win_amd64.whl", hash = "sha256:f35cfccb065fff93529d2afb4a2e89e363fe63ca1e4a5da22b603a85833c2649"},
-    {file = "psutil-6.1.1.tar.gz", hash = "sha256:cf8496728c18f2d0b45198f06895be52f36611711746b7f30c464b422b50e2f5"},
+python-versions = ">=3.6"
+groups = ["main", "training"]
+files = [
+    {file = "psutil-7.2.2-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:2edccc433cbfa046b980b0df0171cd25bcaeb3a68fe9022db0979e7aa74a826b"},
+    {file = "psutil-7.2.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e78c8603dcd9a04c7364f1a3e670cea95d51ee865e4efb3556a3a63adef958ea"},
+    {file = "psutil-7.2.2-cp313-cp313t-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1a571f2330c966c62aeda00dd24620425d4b0cc86881c89861fbc04549e5dc63"},
+    {file = "psutil-7.2.2-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:917e891983ca3c1887b4ef36447b1e0873e70c933afc831c6b6da078ba474312"},
+    {file = "psutil-7.2.2-cp313-cp313t-win_amd64.whl", hash = "sha256:ab486563df44c17f5173621c7b198955bd6b613fb87c71c161f827d3fb149a9b"},
+    {file = "psutil-7.2.2-cp313-cp313t-win_arm64.whl", hash = "sha256:ae0aefdd8796a7737eccea863f80f81e468a1e4cf14d926bd9b6f5f2d5f90ca9"},
+    {file = "psutil-7.2.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:eed63d3b4d62449571547b60578c5b2c4bcccc5387148db46e0c2313dad0ee00"},
+    {file = "psutil-7.2.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7b6d09433a10592ce39b13d7be5a54fbac1d1228ed29abc880fb23df7cb694c9"},
+    {file = "psutil-7.2.2-cp314-cp314t-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1fa4ecf83bcdf6e6c8f4449aff98eefb5d0604bf88cb883d7da3d8d2d909546a"},
+    {file = "psutil-7.2.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e452c464a02e7dc7822a05d25db4cde564444a67e58539a00f929c51eddda0cf"},
+    {file = "psutil-7.2.2-cp314-cp314t-win_amd64.whl", hash = "sha256:c7663d4e37f13e884d13994247449e9f8f574bc4655d509c3b95e9ec9e2b9dc1"},
+    {file = "psutil-7.2.2-cp314-cp314t-win_arm64.whl", hash = "sha256:11fe5a4f613759764e79c65cf11ebdf26e33d6dd34336f8a337aa2996d71c841"},
+    {file = "psutil-7.2.2-cp36-abi3-macosx_10_9_x86_64.whl", hash = "sha256:ed0cace939114f62738d808fdcecd4c869222507e266e574799e9c0faa17d486"},
+    {file = "psutil-7.2.2-cp36-abi3-macosx_11_0_arm64.whl", hash = "sha256:1a7b04c10f32cc88ab39cbf606e117fd74721c831c98a27dc04578deb0c16979"},
+    {file = "psutil-7.2.2-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:076a2d2f923fd4821644f5ba89f059523da90dc9014e85f8e45a5774ca5bc6f9"},
+    {file = "psutil-7.2.2-cp36-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b0726cecd84f9474419d67252add4ac0cd9811b04d61123054b9fb6f57df6e9e"},
+    {file = "psutil-7.2.2-cp36-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:fd04ef36b4a6d599bbdb225dd1d3f51e00105f6d48a28f006da7f9822f2606d8"},
+    {file = "psutil-7.2.2-cp36-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:b58fabe35e80b264a4e3bb23e6b96f9e45a3df7fb7eed419ac0e5947c61e47cc"},
+    {file = "psutil-7.2.2-cp37-abi3-win_amd64.whl", hash = "sha256:eb7e81434c8d223ec4a219b5fc1c47d0417b12be7ea866e24fb5ad6e84b3d988"},
+    {file = "psutil-7.2.2-cp37-abi3-win_arm64.whl", hash = "sha256:8c233660f575a5a89e6d4cb65d9f938126312bca76d8fe087b947b3a1aaac9ee"},
+    {file = "psutil-7.2.2.tar.gz", hash = "sha256:0746f5f8d406af344fd547f1c8daa5f5c33dbc293bb8d6a16d80b4bb88f59372"},
 ]
 
 [package.extras]
-dev = ["abi3audit", "black", "check-manifest", "coverage", "packaging", "pylint", "pyperf", "pypinfo", "pytest-cov", "requests", "rstcheck", "ruff", "sphinx", "sphinx_rtd_theme", "toml-sort", "twine", "virtualenv", "vulture", "wheel"]
-test = ["pytest", "pytest-xdist", "setuptools"]
+dev = ["abi3audit", "black", "check-manifest", "colorama ; os_name == \"nt\"", "coverage", "packaging", "psleak", "pylint", "pyperf", "pypinfo", "pyreadline3 ; os_name == \"nt\"", "pytest", "pytest-cov", "pytest-instafail", "pytest-xdist", "pywin32 ; os_name == \"nt\" and implementation_name != \"pypy\"", "requests", "rstcheck", "ruff", "setuptools", "sphinx", "sphinx_rtd_theme", "toml-sort", "twine", "validate-pyproject[all]", "virtualenv", "vulture", "wheel", "wheel ; os_name == \"nt\" and implementation_name != \"pypy\"", "wmi ; os_name == \"nt\" and implementation_name != \"pypy\""]
+test = ["psleak", "pytest", "pytest-instafail", "pytest-xdist", "pywin32 ; os_name == \"nt\" and implementation_name != \"pypy\"", "setuptools", "wheel ; os_name == \"nt\" and implementation_name != \"pypy\"", "wmi ; os_name == \"nt\" and implementation_name != \"pypy\""]
 
 [[package]]
 name = "pudb"
@@ -3655,7 +3886,7 @@ version = "2024.1.2"
 description = "A full-screen, console-based Python debugger"
 optional = false
 python-versions = "~=3.8"
-groups = ["main"]
+groups = ["dev"]
 files = [
     {file = "pudb-2024.1.2-py3-none-any.whl", hash = "sha256:4726c288d9f57845b8dba706c70eb6faaddff9d86e5208eda82216ef5e79cc2e"},
     {file = "pudb-2024.1.2.tar.gz", hash = "sha256:adc9b00042ba8367117df0a6c0dc62fa9609abd21c3bf8e5b73d620907c5b43e"},
@@ -3677,95 +3908,42 @@ version = "9.0.0"
 description = "Get CPU info with pure Python"
 optional = false
 python-versions = "*"
-groups = ["main"]
+groups = ["training"]
 files = [
     {file = "py-cpuinfo-9.0.0.tar.gz", hash = "sha256:3cdbbf3fac90dc6f118bfd64384f309edeadd902d7c8fb17f02ffa1fc3f49690"},
     {file = "py_cpuinfo-9.0.0-py3-none-any.whl", hash = "sha256:859625bc251f64e21f077d099d4162689c762b5d6a4c3c97553d56241c9674d5"},
 ]
 
-[[package]]
-name = "pyarrow"
-version = "19.0.0"
-description = "Python library for Apache Arrow"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "pyarrow-19.0.0-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:c318eda14f6627966997a7d8c374a87d084a94e4e38e9abbe97395c215830e0c"},
-    {file = "pyarrow-19.0.0-cp310-cp310-macosx_12_0_x86_64.whl", hash = "sha256:62ef8360ff256e960f57ce0299090fb86423afed5e46f18f1225f960e05aae3d"},
-    {file = "pyarrow-19.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2795064647add0f16563e57e3d294dbfc067b723f0fd82ecd80af56dad15f503"},
-    {file = "pyarrow-19.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a218670b26fb1bc74796458d97bcab072765f9b524f95b2fccad70158feb8b17"},
-    {file = "pyarrow-19.0.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:66732e39eaa2247996a6b04c8aa33e3503d351831424cdf8d2e9a0582ac54b34"},
-    {file = "pyarrow-19.0.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:e675a3ad4732b92d72e4d24009707e923cab76b0d088e5054914f11a797ebe44"},
-    {file = "pyarrow-19.0.0-cp310-cp310-win_amd64.whl", hash = "sha256:f094742275586cdd6b1a03655ccff3b24b2610c3af76f810356c4c71d24a2a6c"},
-    {file = "pyarrow-19.0.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:8e3a839bf36ec03b4315dc924d36dcde5444a50066f1c10f8290293c0427b46a"},
-    {file = "pyarrow-19.0.0-cp311-cp311-macosx_12_0_x86_64.whl", hash = "sha256:ce42275097512d9e4e4a39aade58ef2b3798a93aa3026566b7892177c266f735"},
-    {file = "pyarrow-19.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9348a0137568c45601b031a8d118275069435f151cbb77e6a08a27e8125f59d4"},
-    {file = "pyarrow-19.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2a0144a712d990d60f7f42b7a31f0acaccf4c1e43e957f7b1ad58150d6f639c1"},
-    {file = "pyarrow-19.0.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:2a1a109dfda558eb011e5f6385837daffd920d54ca00669f7a11132d0b1e6042"},
-    {file = "pyarrow-19.0.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:be686bf625aa7b9bada18defb3a3ea3981c1099697239788ff111d87f04cd263"},
-    {file = "pyarrow-19.0.0-cp311-cp311-win_amd64.whl", hash = "sha256:239ca66d9a05844bdf5af128861af525e14df3c9591bcc05bac25918e650d3a2"},
-    {file = "pyarrow-19.0.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:a7bbe7109ab6198688b7079cbad5a8c22de4d47c4880d8e4847520a83b0d1b68"},
-    {file = "pyarrow-19.0.0-cp312-cp312-macosx_12_0_x86_64.whl", hash = "sha256:4624c89d6f777c580e8732c27bb8e77fd1433b89707f17c04af7635dd9638351"},
-    {file = "pyarrow-19.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2b6d3ce4288793350dc2d08d1e184fd70631ea22a4ff9ea5c4ff182130249d9b"},
-    {file = "pyarrow-19.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:450a7d27e840e4d9a384b5c77199d489b401529e75a3b7a3799d4cd7957f2f9c"},
-    {file = "pyarrow-19.0.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:a08e2a8a039a3f72afb67a6668180f09fddaa38fe0d21f13212b4aba4b5d2451"},
-    {file = "pyarrow-19.0.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:f43f5aef2a13d4d56adadae5720d1fed4c1356c993eda8b59dace4b5983843c1"},
-    {file = "pyarrow-19.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:2f672f5364b2d7829ef7c94be199bb88bf5661dd485e21d2d37de12ccb78a136"},
-    {file = "pyarrow-19.0.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:cf3bf0ce511b833f7bc5f5bb3127ba731e97222023a444b7359f3a22e2a3b463"},
-    {file = "pyarrow-19.0.0-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:4d8b0c0de0a73df1f1bf439af1b60f273d719d70648e898bc077547649bb8352"},
-    {file = "pyarrow-19.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a92aff08e23d281c69835e4a47b80569242a504095ef6a6223c1f6bb8883431d"},
-    {file = "pyarrow-19.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c3b78eff5968a1889a0f3bc81ca57e1e19b75f664d9c61a42a604bf9d8402aae"},
-    {file = "pyarrow-19.0.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:b34d3bde38eba66190b215bae441646330f8e9da05c29e4b5dd3e41bde701098"},
-    {file = "pyarrow-19.0.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:5418d4d0fab3a0ed497bad21d17a7973aad336d66ad4932a3f5f7480d4ca0c04"},
-    {file = "pyarrow-19.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:e82c3d5e44e969c217827b780ed8faf7ac4c53f934ae9238872e749fa531f7c9"},
-    {file = "pyarrow-19.0.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:f208c3b58a6df3b239e0bb130e13bc7487ed14f39a9ff357b6415e3f6339b560"},
-    {file = "pyarrow-19.0.0-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:c751c1c93955b7a84c06794df46f1cec93e18610dcd5ab7d08e89a81df70a849"},
-    {file = "pyarrow-19.0.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b903afaa5df66d50fc38672ad095806443b05f202c792694f3a604ead7c6ea6e"},
-    {file = "pyarrow-19.0.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a22a4bc0937856263df8b94f2f2781b33dd7f876f787ed746608e06902d691a5"},
-    {file = "pyarrow-19.0.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:5e8a28b918e2e878c918f6d89137386c06fe577cd08d73a6be8dafb317dc2d73"},
-    {file = "pyarrow-19.0.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:29cd86c8001a94f768f79440bf83fee23963af5e7bc68ce3a7e5f120e17edf89"},
-    {file = "pyarrow-19.0.0-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:c0423393e4a07ff6fea08feb44153302dd261d0551cc3b538ea7a5dc853af43a"},
-    {file = "pyarrow-19.0.0-cp39-cp39-macosx_12_0_x86_64.whl", hash = "sha256:718947fb6d82409013a74b176bf93e0f49ef952d8a2ecd068fecd192a97885b7"},
-    {file = "pyarrow-19.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3c1c162c4660e0978411a4761f91113dde8da3433683efa473501254563dcbe8"},
-    {file = "pyarrow-19.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c73268cf557e688efb60f1ccbc7376f7e18cd8e2acae9e663e98b194c40c1a2d"},
-    {file = "pyarrow-19.0.0-cp39-cp39-manylinux_2_28_aarch64.whl", hash = "sha256:edfe6d3916e915ada9acc4e48f6dafca7efdbad2e6283db6fd9385a1b23055f1"},
-    {file = "pyarrow-19.0.0-cp39-cp39-manylinux_2_28_x86_64.whl", hash = "sha256:da410b70a7ab8eb524112f037a7a35da7128b33d484f7671a264a4c224ac131d"},
-    {file = "pyarrow-19.0.0-cp39-cp39-win_amd64.whl", hash = "sha256:597360ffc71fc8cceea1aec1fb60cb510571a744fffc87db33d551d5de919bec"},
-    {file = "pyarrow-19.0.0.tar.gz", hash = "sha256:8d47c691765cf497aaeed4954d226568563f1b3b74ff61139f2d77876717084b"},
-]
-
-[package.extras]
-test = ["cffi", "hypothesis", "pandas", "pytest", "pytz"]
-
 [[package]]
 name = "pycparser"
-version = "2.22"
+version = "3.0"
 description = "C parser in Python"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "implementation_name != \"PyPy\" and platform_python_implementation != \"PyPy\""
 files = [
-    {file = "pycparser-2.22-py3-none-any.whl", hash = "sha256:c3702b6d3dd8c7abc1afa565d7e63d53a1d0bd86cdc24edd75470f4de499cfcc"},
-    {file = "pycparser-2.22.tar.gz", hash = "sha256:491c8be9c040f5390f5bf44a5b07752bd07f56edf992381b05c701439eec10f6"},
+    {file = "pycparser-3.0-py3-none-any.whl", hash = "sha256:b727414169a36b7d524c1c3e31839a521725078d7b2ff038656844266160a992"},
+    {file = "pycparser-3.0.tar.gz", hash = "sha256:600f49d217304a5902ac3c37e1281c9fe94e4d0489de643a9504c5cdfdfc6b29"},
 ]
 
 [[package]]
 name = "pydantic"
-version = "2.10.5"
+version = "2.13.4"
 description = "Data validation using Python type hints"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
+python-versions = ">=3.9"
+groups = ["main", "training"]
 files = [
-    {file = "pydantic-2.10.5-py3-none-any.whl", hash = "sha256:4dd4e322dbe55472cb7ca7e73f4b63574eecccf2835ffa2af9021ce113c83c53"},
-    {file = "pydantic-2.10.5.tar.gz", hash = "sha256:278b38dbbaec562011d659ee05f63346951b3a248a6f3642e1bc68894ea2b4ff"},
+    {file = "pydantic-2.13.4-py3-none-any.whl", hash = "sha256:45a282cde31d808236fd7ea9d919b128653c8b38b393d1c4ab335c62924d9aba"},
+    {file = "pydantic-2.13.4.tar.gz", hash = "sha256:c40756b57adaa8b1efeeced5c196f3f3b7c435f90e84ea7f443901bec8099ef6"},
 ]
 
 [package.dependencies]
 annotated-types = ">=0.6.0"
-pydantic-core = "2.27.2"
-typing-extensions = ">=4.12.2"
+pydantic-core = "2.46.4"
+typing-extensions = ">=4.14.1"
+typing-inspection = ">=0.4.2"
 
 [package.extras]
 email = ["email-validator (>=2.0.0)"]
@@ -3773,258 +3951,210 @@ timezone = ["tzdata ; python_version >= \"3.9\" and platform_system == \"Windows
 
 [[package]]
 name = "pydantic-core"
-version = "2.27.2"
+version = "2.46.4"
 description = "Core functionality for Pydantic validation and serialization"
 optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "pydantic_core-2.27.2-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:2d367ca20b2f14095a8f4fa1210f5a7b78b8a20009ecced6b12818f455b1e9fa"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:491a2b73db93fab69731eaee494f320faa4e093dbed776be1a829c2eb222c34c"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7969e133a6f183be60e9f6f56bfae753585680f3b7307a8e555a948d443cc05a"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:3de9961f2a346257caf0aa508a4da705467f53778e9ef6fe744c038119737ef5"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e2bb4d3e5873c37bb3dd58714d4cd0b0e6238cebc4177ac8fe878f8b3aa8e74c"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:280d219beebb0752699480fe8f1dc61ab6615c2046d76b7ab7ee38858de0a4e7"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:47956ae78b6422cbd46f772f1746799cbb862de838fd8d1fbd34a82e05b0983a"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:14d4a5c49d2f009d62a2a7140d3064f686d17a5d1a268bc641954ba181880236"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:337b443af21d488716f8d0b6164de833e788aa6bd7e3a39c005febc1284f4962"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-musllinux_1_1_armv7l.whl", hash = "sha256:03d0f86ea3184a12f41a2d23f7ccb79cdb5a18e06993f8a45baa8dfec746f0e9"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:7041c36f5680c6e0f08d922aed302e98b3745d97fe1589db0a3eebf6624523af"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-win32.whl", hash = "sha256:50a68f3e3819077be2c98110c1f9dcb3817e93f267ba80a2c05bb4f8799e2ff4"},
-    {file = "pydantic_core-2.27.2-cp310-cp310-win_amd64.whl", hash = "sha256:e0fd26b16394ead34a424eecf8a31a1f5137094cabe84a1bcb10fa6ba39d3d31"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:8e10c99ef58cfdf2a66fc15d66b16c4a04f62bca39db589ae8cba08bc55331bc"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:26f32e0adf166a84d0cb63be85c562ca8a6fa8de28e5f0d92250c6b7e9e2aff7"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8c19d1ea0673cd13cc2f872f6c9ab42acc4e4f492a7ca9d3795ce2b112dd7e15"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5e68c4446fe0810e959cdff46ab0a41ce2f2c86d227d96dc3847af0ba7def306"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d9640b0059ff4f14d1f37321b94061c6db164fbe49b334b31643e0528d100d99"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:40d02e7d45c9f8af700f3452f329ead92da4c5f4317ca9b896de7ce7199ea459"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1c1fd185014191700554795c99b347d64f2bb637966c4cfc16998a0ca700d048"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d81d2068e1c1228a565af076598f9e7451712700b673de8f502f0334f281387d"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:1a4207639fb02ec2dbb76227d7c751a20b1a6b4bc52850568e52260cae64ca3b"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-musllinux_1_1_armv7l.whl", hash = "sha256:3de3ce3c9ddc8bbd88f6e0e304dea0e66d843ec9de1b0042b0911c1663ffd474"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:30c5f68ded0c36466acede341551106821043e9afaad516adfb6e8fa80a4e6a6"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-win32.whl", hash = "sha256:c70c26d2c99f78b125a3459f8afe1aed4d9687c24fd677c6a4436bc042e50d6c"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-win_amd64.whl", hash = "sha256:08e125dbdc505fa69ca7d9c499639ab6407cfa909214d500897d02afb816e7cc"},
-    {file = "pydantic_core-2.27.2-cp311-cp311-win_arm64.whl", hash = "sha256:26f0d68d4b235a2bae0c3fc585c585b4ecc51382db0e3ba402a22cbc440915e4"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:9e0c8cfefa0ef83b4da9588448b6d8d2a2bf1a53c3f1ae5fca39eb3061e2f0b0"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:83097677b8e3bd7eaa6775720ec8e0405f1575015a463285a92bfdfe254529ef"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:172fce187655fece0c90d90a678424b013f8fbb0ca8b036ac266749c09438cb7"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:519f29f5213271eeeeb3093f662ba2fd512b91c5f188f3bb7b27bc5973816934"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:05e3a55d124407fffba0dd6b0c0cd056d10e983ceb4e5dbd10dda135c31071d6"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:9c3ed807c7b91de05e63930188f19e921d1fe90de6b4f5cd43ee7fcc3525cb8c"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6fb4aadc0b9a0c063206846d603b92030eb6f03069151a625667f982887153e2"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:28ccb213807e037460326424ceb8b5245acb88f32f3d2777427476e1b32c48c4"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:de3cd1899e2c279b140adde9357c4495ed9d47131b4a4eaff9052f23398076b3"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-musllinux_1_1_armv7l.whl", hash = "sha256:220f892729375e2d736b97d0e51466252ad84c51857d4d15f5e9692f9ef12be4"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:a0fcd29cd6b4e74fe8ddd2c90330fd8edf2e30cb52acda47f06dd615ae72da57"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-win32.whl", hash = "sha256:1e2cb691ed9834cd6a8be61228471d0a503731abfb42f82458ff27be7b2186fc"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-win_amd64.whl", hash = "sha256:cc3f1a99a4f4f9dd1de4fe0312c114e740b5ddead65bb4102884b384c15d8bc9"},
-    {file = "pydantic_core-2.27.2-cp312-cp312-win_arm64.whl", hash = "sha256:3911ac9284cd8a1792d3cb26a2da18f3ca26c6908cc434a18f730dc0db7bfa3b"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:7d14bd329640e63852364c306f4d23eb744e0f8193148d4044dd3dacdaacbd8b"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:82f91663004eb8ed30ff478d77c4d1179b3563df6cdb15c0817cd1cdaf34d154"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:71b24c7d61131bb83df10cc7e687433609963a944ccf45190cfc21e0887b08c9"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:fa8e459d4954f608fa26116118bb67f56b93b209c39b008277ace29937453dc9"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:ce8918cbebc8da707ba805b7fd0b382816858728ae7fe19a942080c24e5b7cd1"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:eda3f5c2a021bbc5d976107bb302e0131351c2ba54343f8a496dc8783d3d3a6a"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bd8086fa684c4775c27f03f062cbb9eaa6e17f064307e86b21b9e0abc9c0f02e"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:8d9b3388db186ba0c099a6d20f0604a44eabdeef1777ddd94786cdae158729e4"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:7a66efda2387de898c8f38c0cf7f14fca0b51a8ef0b24bfea5849f1b3c95af27"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-musllinux_1_1_armv7l.whl", hash = "sha256:18a101c168e4e092ab40dbc2503bdc0f62010e95d292b27827871dc85450d7ee"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:ba5dd002f88b78a4215ed2f8ddbdf85e8513382820ba15ad5ad8955ce0ca19a1"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-win32.whl", hash = "sha256:1ebaf1d0481914d004a573394f4be3a7616334be70261007e47c2a6fe7e50130"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-win_amd64.whl", hash = "sha256:953101387ecf2f5652883208769a79e48db18c6df442568a0b5ccd8c2723abee"},
-    {file = "pydantic_core-2.27.2-cp313-cp313-win_arm64.whl", hash = "sha256:ac4dbfd1691affb8f48c2c13241a2e3b60ff23247cbcf981759c768b6633cf8b"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:d3e8d504bdd3f10835468f29008d72fc8359d95c9c415ce6e767203db6127506"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:521eb9b7f036c9b6187f0b47318ab0d7ca14bd87f776240b90b21c1f4f149320"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:85210c4d99a0114f5a9481b44560d7d1e35e32cc5634c656bc48e590b669b145"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:d716e2e30c6f140d7560ef1538953a5cd1a87264c737643d481f2779fc247fe1"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:f66d89ba397d92f840f8654756196d93804278457b5fbede59598a1f9f90b228"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:669e193c1c576a58f132e3158f9dfa9662969edb1a250c54d8fa52590045f046"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9fdbe7629b996647b99c01b37f11170a57ae675375b14b8c13b8518b8320ced5"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d262606bf386a5ba0b0af3b97f37c83d7011439e3dc1a9298f21efb292e42f1a"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:cabb9bcb7e0d97f74df8646f34fc76fbf793b7f6dc2438517d7a9e50eee4f14d"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-musllinux_1_1_armv7l.whl", hash = "sha256:d2d63f1215638d28221f664596b1ccb3944f6e25dd18cd3b86b0a4c408d5ebb9"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:bca101c00bff0adb45a833f8451b9105d9df18accb8743b08107d7ada14bd7da"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-win32.whl", hash = "sha256:f6f8e111843bbb0dee4cb6594cdc73e79b3329b526037ec242a3e49012495b3b"},
-    {file = "pydantic_core-2.27.2-cp38-cp38-win_amd64.whl", hash = "sha256:fd1aea04935a508f62e0d0ef1f5ae968774a32afc306fb8545e06f5ff5cdf3ad"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:c10eb4f1659290b523af58fa7cffb452a61ad6ae5613404519aee4bfbf1df993"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:ef592d4bad47296fb11f96cd7dc898b92e795032b4894dfb4076cfccd43a9308"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c61709a844acc6bf0b7dce7daae75195a10aac96a596ea1b776996414791ede4"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:42c5f762659e47fdb7b16956c71598292f60a03aa92f8b6351504359dbdba6cf"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:4c9775e339e42e79ec99c441d9730fccf07414af63eac2f0e48e08fd38a64d76"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:57762139821c31847cfb2df63c12f725788bd9f04bc2fb392790959b8f70f118"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0d1e85068e818c73e048fe28cfc769040bb1f475524f4745a5dc621f75ac7630"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:097830ed52fd9e427942ff3b9bc17fab52913b2f50f2880dc4a5611446606a54"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:044a50963a614ecfae59bb1eaf7ea7efc4bc62f49ed594e18fa1e5d953c40e9f"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-musllinux_1_1_armv7l.whl", hash = "sha256:4e0b4220ba5b40d727c7f879eac379b822eee5d8fff418e9d3381ee45b3b0362"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:5e4f4bb20d75e9325cc9696c6802657b58bc1dbbe3022f32cc2b2b632c3fbb96"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-win32.whl", hash = "sha256:cca63613e90d001b9f2f9a9ceb276c308bfa2a43fafb75c8031c4f66039e8c6e"},
-    {file = "pydantic_core-2.27.2-cp39-cp39-win_amd64.whl", hash = "sha256:77d1bca19b0f7021b3a982e6f903dcd5b2b06076def36a652e3907f596e29f67"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:2bf14caea37e91198329b828eae1618c068dfb8ef17bb33287a7ad4b61ac314e"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:b0cb791f5b45307caae8810c2023a184c74605ec3bcbb67d13846c28ff731ff8"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:688d3fd9fcb71f41c4c015c023d12a79d1c4c0732ec9eb35d96e3388a120dcf3"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3d591580c34f4d731592f0e9fe40f9cc1b430d297eecc70b962e93c5c668f15f"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:82f986faf4e644ffc189a7f1aafc86e46ef70372bb153e7001e8afccc6e54133"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:bec317a27290e2537f922639cafd54990551725fc844249e64c523301d0822fc"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-musllinux_1_1_armv7l.whl", hash = "sha256:0296abcb83a797db256b773f45773da397da75a08f5fcaef41f2044adec05f50"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:0d75070718e369e452075a6017fbf187f788e17ed67a3abd47fa934d001863d9"},
-    {file = "pydantic_core-2.27.2-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:7e17b560be3c98a8e3aa66ce828bdebb9e9ac6ad5466fba92eb74c4c95cb1151"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:c33939a82924da9ed65dab5a65d427205a73181d8098e79b6b426bdf8ad4e656"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:00bad2484fa6bda1e216e7345a798bd37c68fb2d97558edd584942aa41b7d278"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c817e2b40aba42bac6f457498dacabc568c3b7a986fc9ba7c8d9d260b71485fb"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:251136cdad0cb722e93732cb45ca5299fb56e1344a833640bf93b2803f8d1bfd"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d2088237af596f0a524d3afc39ab3b036e8adb054ee57cbb1dcf8e09da5b29cc"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:d4041c0b966a84b4ae7a09832eb691a35aec90910cd2dbe7a208de59be77965b"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-musllinux_1_1_armv7l.whl", hash = "sha256:8083d4e875ebe0b864ffef72a4304827015cff328a1be6e22cc850753bfb122b"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:f141ee28a0ad2123b6611b6ceff018039df17f32ada8b534e6aa039545a3efb2"},
-    {file = "pydantic_core-2.27.2-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:7d0c8399fcc1848491f00e0314bd59fb34a9c008761bcb422a057670c3f65e35"},
-    {file = "pydantic_core-2.27.2.tar.gz", hash = "sha256:eb026e5a4c1fee05726072337ff51d1efb6f59090b7da90d30ea58625b1ffb39"},
+python-versions = ">=3.9"
+groups = ["main", "training"]
+files = [
+    {file = "pydantic_core-2.46.4-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:a396dcc17e5a0b164dbe026896245a4fa9ff402edca1dff0be3d53a517f74de4"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:da4b951fe36dc7c3a1ccb4e3cd1747c3542b8c9ceede8fc86cae054e764485f5"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:bb63e0198ca18aad131c089b9204c23079c3afa95487e561f4c522d519e55aba"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f47286a97f0bc9b8859519809077b91b2cefe4ae47fcbf5e466a009c1c5d742b"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:905a0ed8ea6f2d61c1738835f99b699348d7857379083e5fc497fa0c967a407c"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ea793e075b70290d89d8142074262885d3f7da19634845135751bd6344f73b50"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:395aebd9183f9d112f569aeb5b2214d1a10a33bec8456447f7fbdfa51d38d4cd"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_31_riscv64.whl", hash = "sha256:b078afbc25f3a1436c7a1d2cd3e322497ee99615ba97c563566fdf46aff1ee01"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:f747929cf940cddb5b3668a390056ddd5ba2e5010615ea2dcf4f9c4f3ab8791d"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:daa27d92c36f24388fe3ad306b174781c747627f134452e4f128ea00ce1fe8c4"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-musllinux_1_1_armv7l.whl", hash = "sha256:19e51f073cd3df251856a8a4189fbdf1de4012c3ebacfb1884f94f1eb406079f"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:c1747f85cee84c26985853c6f3d9bd3e75da5212912443fa111c113b9c246f39"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-win32.whl", hash = "sha256:2f84c03c8607173d16b5a854ec68a2f9079ae03237a54fb506d13af47e1d018d"},
+    {file = "pydantic_core-2.46.4-cp310-cp310-win_amd64.whl", hash = "sha256:8358a950c8909158e3df31538a7e4edc2d7265a7c54b47f0864d9e5bae9dcebf"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:0e96592440881c74a213e5ad528e2b24d3d4f940de2766bed9010ab1d9e51594"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:e0d65b8c354be7fb5f720c3caa8bc940bc2d20ce749c8e06135f07f8ed95dd7c"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7bfb192b3f4b9e8a89b6277b6ce787564f62cfd272055f6e685726b111dc7826"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:9037063db01f09b09e237c282b6792bd4da634b5402c4e7f0c61effed7701a04"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:fc010ab034c8c7452522748bf937df58020d256ccae0874463d1f4d01758af8e"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8c5dac79fa1614d1e06ca695109c6105923bd9c7d1d6c918d4e637b7e6b32fd3"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f9fa868638bf362d3d138ea55829cefb3d5f4b0d7f142234382a15e2485dbec4"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_31_riscv64.whl", hash = "sha256:17299feefe090f2caa5b8e37222bb5f663e4935a8bfa6931d4102e5df1a9f398"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4c63ebc82684aa89d9a3bcbd13d515b3be44250dc68dd3bd81526c1cb31286c3"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:aaa2a54443eff1950ba5ddc6b6ccda0d9c84a364276a62f969bdf2a390650848"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-musllinux_1_1_armv7l.whl", hash = "sha256:18e5ceec2ab67e6d5f1a9085e5a24c9c4e2ac4545730bfe668680bca05e555f3"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:a0f62d0a58f4e7da165457e995725421e0064f2255d8eccebc49f41bbc23b109"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-win32.whl", hash = "sha256:041bde0a48fd37cf71cab1c9d56d3e8625a3793fef1f7dd232b3ff37e978ecda"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-win_amd64.whl", hash = "sha256:6f2eeda33a839975441c86a4119e1383c50b47faf0cbb5176985565c6bb02c33"},
+    {file = "pydantic_core-2.46.4-cp311-cp311-win_arm64.whl", hash = "sha256:14f4c5d6db102bd796a627bbb3a17b4cf4574b9ae861d8b7c9a9661c6dd3362d"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:3245406455a5d98187ec35530fd772b1d799b26667980872c8d4614991e2c4a2"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:962ccbab7b642487b1d8b7df90ef677e03134cf1fd8880bf698649b22a69371f"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8233f2947cf85404441fd7e0085f53b10c93e0ee78611099b5c7237e36aacbf7"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:3a233125ac121aa3ffba9a2b59edfc4a985a76092dc8279586ab4b71390875e7"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5b712b53160b79a5850310b912a5ef8e57e56947c8ad690c227f5c9d7e561712"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:9401557acd873c3a7f3eb9383edef8ac4968f9510e340f4808d427e75667e7b4"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:926c9541b14b12b1681dca8a0b75feb510b06c6341b70a8e500c2fdcff837cce"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:56cb4851bcaf3d117eddcef4fe66afd750a50274b0da8e22be256d10e5611987"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:c68fcd102d71ea85c5b2dfac3f4f8476eff42a9e078fd5faefff6d145063536b"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:b2f69dec1725e79a012d920df1707de5caf7ed5e08f3be4435e25803efc47458"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-musllinux_1_1_armv7l.whl", hash = "sha256:8d0820e8192167f80d88d64038e609c31452eeca865b4e1d9950a27a4609b00b"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:fbdb89b3e1c94a30cc5edfce477c6e6a5dc4d8f84665b455c27582f211a1c72c"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-win32.whl", hash = "sha256:9aa768456404a8bf48a4406685ac2bec8e72b62c69313734fa3b73cf33b3a894"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-win_amd64.whl", hash = "sha256:e9c26f834c65f5752f3f06cb08cb86a913ceb7274d0db6e267808a708b46bc89"},
+    {file = "pydantic_core-2.46.4-cp312-cp312-win_arm64.whl", hash = "sha256:4fc73cb559bdb54b1134a706a2802a4cddd27a0633f5abb7e53056268751ac6a"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:5d5902252db0d3cedf8d4a1bc68f70eeb430f7e4c7104c8c476753519b423008"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:c94f0688e7b8d0a67abf40e57a7eaaecd17cc9586706a31b76c031f63df052b4"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f027324c56cd5406ca49c124b0db10e56c69064fec039acc571c29020cc87c76"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e739fee756ba1010f8bcccb534252e85a35fe45ae92c295a06059ce58b74ccd3"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9d56801be94b86a9da183e5f3766e6310752b99ff647e38b09a9500d88e46e76"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2412e734dcb48da14d4e4006b82b46b74f2518b8a26ee7e58c6844a6cd6d03c4"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9551187363ffc0de2a00b2e47c25aeaeb1020b69b668762966df15fc5659dd5a"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:0186750b482eefa11d7f435892b09c5c606193ef3375bcf94aa00ae6bfb66262"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:5855698a4856556d86e8e6cd8434bc3ac0314ee8e12089ae0e143f64c6256e4e"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:cbaf13819775b7f769bf4a1f066cb6df7a28d4480081a589828ef190226881cd"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-musllinux_1_1_armv7l.whl", hash = "sha256:633147d34cf4550417f12e2b1a0383973bdf5cdfde212cb09e9a581cf10820be"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:82cf5301172168103724d49a1444d3378cb20cdee30b116a1bd6031236298a5d"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-win32.whl", hash = "sha256:9fa8ae11da9e2b3126c6426f147e0fba88d96d65921799bb30c6abd1cb2c97fb"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-win_amd64.whl", hash = "sha256:6b3ace8194b0e5204818c92802dcdca7fc6d88aabbb799d7c795540d9cd6d292"},
+    {file = "pydantic_core-2.46.4-cp313-cp313-win_arm64.whl", hash = "sha256:184c081504d17f1c1066e430e117142b2c77d9448a97f7b65c6ac9fd9aee238d"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:428e04521a40150c85216fc8b85e8d39fece235a9cf5e383761238c7fa9b96fb"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:23ace664830ee0bfe014a0c7bc248b1f7f25ed7ad103852c317624a1083af462"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce5c1d2a8b27468f433ca974829c44060b8097eedc39933e3c206a90ee49c4a9"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7283d57845ecf5a163403eb0702dfc220cc4fbdd18919cb5ccea4f95ee1cdab4"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8daafc69c93ee8a0204506a3b6b30f586ef54028f52aeeeb5c4cfc5184fd5914"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:cd2213145bcc2ba85884d0ac63d222fece9209678f77b9b4d76f054c561adb28"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7a5f930472650a82629163023e630d160863fce524c616f4e5186e5de9d9a49b"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:c1b3f518abeca3aa13c712fd202306e145abf59a18b094a6bafb2d2bbf59192c"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1a7dd0b3ee80d90150e3495a3a13ac34dbcbfd4f012996a6a1d8900e91b5c0fb"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:3fb702cd90b0446a3a1c5e470bfa0dd23c0233b676a9099ddcc964fa6ca13898"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_armv7l.whl", hash = "sha256:b8458003118a712e66286df6a707db01c52c0f52f7db8e4a38f0da1d3b94fc4e"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:372429a130e469c9cd698925ce5fc50940b7a1336b0d82038e63d5bbc4edc519"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-win32.whl", hash = "sha256:85bb3611ff1802f3ee7fdd7dbff26b56f343fb432d57a4728fdd49b6ef35e2f4"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-win_amd64.whl", hash = "sha256:811ff8e9c313ab425368bcbb36e5c4ebd7108c2bbf4e4089cfbb0b01eff63fac"},
+    {file = "pydantic_core-2.46.4-cp314-cp314-win_arm64.whl", hash = "sha256:bfec22eab3c8cc2ceec0248aec886624116dc079afa027ecc8ad4a7e62010f8a"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:af8244b2bef6aaad6d92cda81372de7f8c8d36c9f0c3ea36e827c60e7d9467a0"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5a4330cdbc57162e4b3aa303f588ba752257694c9c9be3e7ebb11b4aca659b5d"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:29c61fc04a3d840155ff08e475a04809278972fe6aef51e2720554e96367e34b"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:c50f2528cf200c5eed56faf3f4e22fcd5f38c157a8b78576e6ba3168ec35f000"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:0cbe8b01f948de4286c74cdd6c667aceb38f5c1e26f0693b3983d9d74887c65e"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:617d7e2ca7dcb8c5cf6bcb8c59b8832c94b36196bbf1cbd1bfb56ed341905edd"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7027560ee92211647d0d34e3f7cd6f50da56399d26a9c8ad0da286d3869a53f3"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:f99626688942fb746e545232e7726926f3be91b5975f8b55327665fafda991c7"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:fc3e9034a63de20e15e8ade85358bc6efc614008cab72898b4b4952bea0509ff"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:97e7cf2be5c77b7d1a9713a05605d49460d02c6078d38d8bef3cbe323c548424"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_armv7l.whl", hash = "sha256:3bf92c5d0e00fefaab325a4d27828fe6b6e2a21848686b5b60d2d9eeb09d76c6"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:3ecbc122d18468d06ca279dc26a8c2e2d5acb10943bb35e36ae92096dc3b5565"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-win32.whl", hash = "sha256:e846ae7835bf0703ae43f534ab79a867146dadd59dc9ca5c8b53d5c8f7c9ef02"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-win_amd64.whl", hash = "sha256:2108ba5c1c1eca18030634489dc544844144ee36357f2f9f780b93e7ddbb44b5"},
+    {file = "pydantic_core-2.46.4-cp314-cp314t-win_arm64.whl", hash = "sha256:4fcbe087dbc2068af7eda3aa87634eba216dbda64d1ae73c8684b621d33f6596"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:fd8b3d9fd264be37976686c7f65cd52a83f5e84f4bfd2adf9c1d469676bbb6ae"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:9f444c499b3eefd3a92e348059471ea0c3a6e303d9c1cec09fa748fd9f895201"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3447661d99f75a3683a4cf5c87da72f2161964611864dbbeac7fbb118bb4bfc0"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:8b9bab013d1c7a79d3501ff86d0bc9c31bf587db4551677b96bec07df78c6b15"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d995260fdf4e1db774581b4900e0f832abe3c7c84996726bbc161b19c8f29e76"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f13a646d65d09fbf1bc6b3a9635d30095c8e7e5cc419ff35ecc563c5fd04cd49"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:432c179df7874eeb73307aad2df0755e1ae0efa61ff0ea89b93e194411ae3928"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_31_riscv64.whl", hash = "sha256:e68b7a074f65a2fd746c52a7ce6142ab7006074ac269ace0c25cd8ba171f8066"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4a05d69cba51d852c5c3e92758653245a50c0b646ced0cf05bd793ed592839d6"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:228ee9bae8bef5b1e97ec58302f80357c37199e0d0a99174e138d28e6957b9d9"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-musllinux_1_1_armv7l.whl", hash = "sha256:10e17cbb10a330363733efc4d7c4d0dd827ac0909b8f6a6542298fed1ea62f29"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:91a06d2e259ecfbd8c901d70c3c507900458498142b3026a296b7de4d1322cc9"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-win32.whl", hash = "sha256:d80ee3d731373b24cebbc10d689ca4ee1875caf0d5703a245db18efd4dd37fc1"},
+    {file = "pydantic_core-2.46.4-cp39-cp39-win_amd64.whl", hash = "sha256:3be77f45df024d789a672ae34f8b06fb346c4f9f46ea714956660ea4862e89ac"},
+    {file = "pydantic_core-2.46.4-graalpy311-graalpy242_311_native-macosx_10_12_x86_64.whl", hash = "sha256:14d4edf427bdcf950a8a02d7cb44a08614388dd6e1bdcbf4f67504fa7887da9c"},
+    {file = "pydantic_core-2.46.4-graalpy311-graalpy242_311_native-macosx_11_0_arm64.whl", hash = "sha256:0ce40cd7b21210e99342afafbd4d0f76d784eb5b1d60f3bdc566be4983c6c73b"},
+    {file = "pydantic_core-2.46.4-graalpy311-graalpy242_311_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:90884113d8b48f760e9587002789ddd741e76ab9f89518cd1e43b1f1a52ec44b"},
+    {file = "pydantic_core-2.46.4-graalpy311-graalpy242_311_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:66ce7632c22d837c95301830e111ad0128a32b8207533b60896a96c4915192ea"},
+    {file = "pydantic_core-2.46.4-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:1d8ba486450b14f3b1d63bc521d410ec7565e52f887b9fb671791886436a42f7"},
+    {file = "pydantic_core-2.46.4-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:3009f12e4e90b7f88b4f9adb1b0c4a3d58fe7820f3238c190047209d148026df"},
+    {file = "pydantic_core-2.46.4-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ad785e92e6dc634c21555edc8bd6b64957ab844541bcb96a1366c202951ae526"},
+    {file = "pydantic_core-2.46.4-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:00c603d540afdd6b80eb39f078f33ebd46211f02f33e34a32d9f053bba711de0"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-macosx_10_12_x86_64.whl", hash = "sha256:0c563b08bca408dc7f65f700633d8442fffb2421fc47b8101377e9fd65051ff0"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-macosx_11_0_arm64.whl", hash = "sha256:db06ffe51636ffe9ca531fe9023dd64bdd794be8754cb5df57c5498ae5b518a7"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:133878133d271ade3d41d1bfb2a45ec38dbdbda40bc065921c6b04e4630127e2"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:9bc519fbf2b7578398853d815009ae5e4d4603d12f4e3f91da8c06852d3da3e9"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:c7a7bd4e39e8e4c12c39cd480356842b6a8a06e41b23a55a5e3e191718838ddf"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-musllinux_1_1_armv7l.whl", hash = "sha256:d396ec2b979760aaf3218e76c24e65bd0aca24983298653b3a9d7a45f9e47b30"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:86e1a4418c6cd97d60c95c71164158eaf7324fae7b0923264016baa993eba6fc"},
+    {file = "pydantic_core-2.46.4-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:d51026d73fcfd93610abc7b27789c26b313920fcfb20e27462d74a7f8b06e983"},
+    {file = "pydantic_core-2.46.4.tar.gz", hash = "sha256:62f875393d7f270851f20523dd2e29f082bcc82292d66db2b64ea71f64b6e1c1"},
 ]
 
 [package.dependencies]
-typing-extensions = ">=4.6.0,<4.7.0 || >4.7.0"
+typing-extensions = ">=4.14.1"
 
 [[package]]
 name = "pydantic-settings"
-version = "2.8.0"
+version = "2.14.2"
 description = "Settings management using Pydantic"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "pydantic_settings-2.8.0-py3-none-any.whl", hash = "sha256:c782c7dc3fb40e97b238e713c25d26f64314aece2e91abcff592fcac15f71820"},
-    {file = "pydantic_settings-2.8.0.tar.gz", hash = "sha256:88e2ca28f6e68ea102c99c3c401d6c9078e68a5df600e97b43891c34e089500a"},
+    {file = "pydantic_settings-2.14.2-py3-none-any.whl", hash = "sha256:a20c97b37910b6550d5ea50fbcc2d4187defe58cd57070b73863d069419c9440"},
+    {file = "pydantic_settings-2.14.2.tar.gz", hash = "sha256:c19dd64b19097f1de80184f0cc7b0272a13ae6e170cbf240a3e27e381ed14a5f"},
 ]
 
 [package.dependencies]
 pydantic = ">=2.7.0"
 python-dotenv = ">=0.21.0"
+typing-inspection = ">=0.4.0"
 
 [package.extras]
+aws-secrets-manager = ["boto3 (>=1.35.0)", "types-boto3[secretsmanager]"]
 azure-key-vault = ["azure-identity (>=1.16.0)", "azure-keyvault-secrets (>=4.8.0)"]
+gcp-secret-manager = ["google-cloud-secret-manager (>=2.23.1)"]
 toml = ["tomli (>=2.0.1)"]
 yaml = ["pyyaml (>=6.0.1)"]
 
 [[package]]
-name = "pygments"
-version = "2.19.1"
-description = "Pygments is a syntax highlighting package written in Python."
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "pygments-2.19.1-py3-none-any.whl", hash = "sha256:9ea1544ad55cecf4b8242fab6dd35a93bbce657034b0611ee383099054ab6d8c"},
-    {file = "pygments-2.19.1.tar.gz", hash = "sha256:61c16d2a8576dc0649d9f39e089b5f02bcd27fba10d8fb4dcc28173f7a45151f"},
-]
-
-[package.extras]
-windows-terminal = ["colorama (>=0.4.6)"]
-
-[[package]]
-name = "pynacl"
-version = "1.5.0"
-description = "Python binding to the Networking and Cryptography (NaCl) library"
-optional = false
-python-versions = ">=3.6"
+name = "pydub"
+version = "0.25.1"
+description = "Manipulate audio with an simple and easy high level interface"
+optional = true
+python-versions = "*"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "PyNaCl-1.5.0-cp36-abi3-macosx_10_10_universal2.whl", hash = "sha256:401002a4aaa07c9414132aaed7f6836ff98f59277a234704ff66878c2ee4a0d1"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.manylinux_2_24_aarch64.whl", hash = "sha256:52cb72a79269189d4e0dc537556f4740f7f0a9ec41c1322598799b0bdad4ef92"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a36d4a9dda1f19ce6e03c9a784a2921a4b726b02e1c736600ca9c22029474394"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl", hash = "sha256:0c84947a22519e013607c9be43706dd42513f9e6ae5d39d3613ca1e142fba44d"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:06b8f6fa7f5de8d5d2f7573fe8c863c051225a27b61e6860fd047b1775807858"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-musllinux_1_1_aarch64.whl", hash = "sha256:a422368fc821589c228f4c49438a368831cb5bbc0eab5ebe1d7fac9dded6567b"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-musllinux_1_1_x86_64.whl", hash = "sha256:61f642bf2378713e2c2e1de73444a3778e5f0a38be6fee0fe532fe30060282ff"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-win32.whl", hash = "sha256:e46dae94e34b085175f8abb3b0aaa7da40767865ac82c928eeb9e57e1ea8a543"},
-    {file = "PyNaCl-1.5.0-cp36-abi3-win_amd64.whl", hash = "sha256:20f42270d27e1b6a29f54032090b972d97f0a1b0948cc52392041ef7831fee93"},
-    {file = "PyNaCl-1.5.0.tar.gz", hash = "sha256:8ac7448f09ab85811607bdd21ec2464495ac8b7c66d146bf545b0f08fb9220ba"},
+    {file = "pydub-0.25.1-py2.py3-none-any.whl", hash = "sha256:65617e33033874b59d87db603aa1ed450633288aefead953b30bded59cb599a6"},
+    {file = "pydub-0.25.1.tar.gz", hash = "sha256:980a33ce9949cab2a569606b65674d748ecbca4f0796887fd6f46173a7b0d30f"},
 ]
 
-[package.dependencies]
-cffi = ">=1.4.1"
-
-[package.extras]
-docs = ["sphinx (>=1.6.5)", "sphinx-rtd-theme"]
-tests = ["hypothesis (>=3.27.0)", "pytest (>=3.2.1,!=3.3.0)"]
-
 [[package]]
-name = "pyparsing"
-version = "3.2.1"
-description = "pyparsing module - Classes and methods to define and execute parsing grammars"
+name = "pygments"
+version = "2.20.0"
+description = "Pygments is a syntax highlighting package written in Python."
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "pyparsing-3.2.1-py3-none-any.whl", hash = "sha256:506ff4f4386c4cec0590ec19e6302d3aedb992fdc02c761e90416f158dacf8e1"},
-    {file = "pyparsing-3.2.1.tar.gz", hash = "sha256:61980854fd66de3a90028d679a954d5f2623e83144b5afe5ee86f43d762e5f0a"},
-]
-
-[package.extras]
-diagrams = ["jinja2", "railroad-diagrams"]
-
-[[package]]
-name = "pyramid"
-version = "1.5"
-description = "The Pyramid Web Framework, a Pylons project"
-optional = false
-python-versions = "*"
-groups = ["main"]
+groups = ["main", "dev"]
 files = [
-    {file = "pyramid-1.5.tar.gz", hash = "sha256:db3216f61d9dbb5358fcb3f9eb2d772948c5b2bc608eb2f643159b4abd993621"},
+    {file = "pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176"},
+    {file = "pygments-2.20.0.tar.gz", hash = "sha256:6757cd03768053ff99f3039c1a36d6c0aa0b263438fcab17520b30a303a82b5f"},
 ]
 
-[package.dependencies]
-PasteDeploy = ">=1.5.0"
-"repoze.lru" = ">=0.4"
-setuptools = "*"
-translationstring = ">=0.4"
-venusian = ">=1.0a3"
-WebOb = ">=1.3.1"
-"zope.deprecation" = ">=3.5.0"
-"zope.interface" = ">=3.8.0"
-
 [package.extras]
-docs = ["Sphinx", "docutils", "repoze.sphinx.autointerface"]
-testing = ["WebTest (>=1.3.1)", "coverage", "nose", "virtualenv", "zope.component (>=3.11.0)"]
+windows-terminal = ["colorama (>=0.4.6)"]
 
 [[package]]
 name = "pytest"
-version = "7.2.0"
+version = "9.1.1"
 description = "pytest: simple powerful testing with Python"
 optional = false
-python-versions = ">=3.7"
-groups = ["main", "dev"]
+python-versions = ">=3.10"
+groups = ["dev"]
 files = [
-    {file = "pytest-7.2.0-py3-none-any.whl", hash = "sha256:892f933d339f068883b6fd5a459f03d85bfcb355e4981e146d2c7616c21fef71"},
-    {file = "pytest-7.2.0.tar.gz", hash = "sha256:c4014eb40e10f11f355ad4e3c2fb2c6c6d1919c73f3b5a433de4708202cade59"},
+    {file = "pytest-9.1.1-py3-none-any.whl", hash = "sha256:37a86b45efb9a47a61a36449063e8e18d0cab3161329fc099eb21783169c4f0c"},
+    {file = "pytest-9.1.1.tar.gz", hash = "sha256:1088fbde8f2b49d95a549a195707afa7a76a3ce9bcadc26b6d71f0ffda5fe313"},
 ]
 
 [package.dependencies]
-attrs = ">=19.2.0"
-colorama = {version = "*", markers = "sys_platform == \"win32\""}
-exceptiongroup = {version = ">=1.0.0rc8", markers = "python_version < \"3.11\""}
-iniconfig = "*"
-packaging = "*"
-pluggy = ">=0.12,<2.0"
-tomli = {version = ">=1.0.0", markers = "python_version < \"3.11\""}
+colorama = {version = ">=0.4", markers = "sys_platform == \"win32\""}
+iniconfig = ">=1.0.1"
+packaging = ">=22"
+pluggy = ">=1.5,<2"
+pygments = ">=2.7.2"
 
 [package.extras]
-testing = ["argcomplete", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "xmlschema"]
-
-[[package]]
-name = "pytest-split"
-version = "0.8.0"
-description = "Pytest plugin which splits the test suite to equally sized sub suites based on test execution time."
-optional = false
-python-versions = ">=3.7.1,<4.0"
-groups = ["main"]
-files = [
-    {file = "pytest-split-0.8.0.tar.gz", hash = "sha256:8571a3f60ca8656c698ed86b0a3212bb9e79586ecb201daef9988c336ff0e6ff"},
-    {file = "pytest_split-0.8.0-py3-none-any.whl", hash = "sha256:2e06b8b1ab7ceb19d0b001548271abaf91d12415a8687086cf40581c555d309f"},
-]
-
-[package.dependencies]
-pytest = ">=5,<8"
+dev = ["argcomplete", "attrs (>=19.2)", "hypothesis (>=3.56)", "mock", "requests", "setuptools", "xmlschema"]
 
 [[package]]
 name = "python-dateutil"
@@ -4032,37 +4162,71 @@ version = "2.9.0.post0"
 description = "Extensions to the standard Python datetime module"
 optional = false
 python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "python-dateutil-2.9.0.post0.tar.gz", hash = "sha256:37dd54208da7e1cd875388217d5e00ebd4179249f90fb72437e91a35459a0ad3"},
     {file = "python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427"},
 ]
+markers = {main = "extra == \"trackio\""}
 
 [package.dependencies]
 six = ">=1.5"
 
+[[package]]
+name = "python-discovery"
+version = "1.4.2"
+description = "Python interpreter discovery"
+optional = false
+python-versions = ">=3.8"
+groups = ["dev"]
+files = [
+    {file = "python_discovery-1.4.2-py3-none-any.whl", hash = "sha256:475803f53b7b2ed6e490e27373f9d8340f7d2eebf9acdaf645d7d714c97bb500"},
+    {file = "python_discovery-1.4.2.tar.gz", hash = "sha256:8f3746c4b4968d22afbb97d36e1a0e5b66e6c0f297290f2e95f05b9b8bf18690"},
+]
+
+[package.dependencies]
+filelock = ">=3.15.4"
+platformdirs = ">=4.3.6,<5"
+
+[package.extras]
+docs = ["furo (>=2025.12.19)", "sphinx (>=9.1)", "sphinx-autodoc-typehints (>=3.6.3)", "sphinxcontrib-mermaid (>=2)", "sphinxcontrib-towncrier (>=0.4)", "towncrier (>=25.8)"]
+testing = ["covdefaults (>=2.3)", "coverage (>=7.5.4)", "pytest (>=8.3.5)", "pytest-mock (>=3.14)", "setuptools (>=75.1)"]
+
 [[package]]
 name = "python-dotenv"
-version = "1.0.1"
+version = "1.2.2"
 description = "Read key-value pairs from a .env file and set them as environment variables"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "python-dotenv-1.0.1.tar.gz", hash = "sha256:e324ee90a023d808f1959c46bcbc04446a10ced277783dc6ee09987c37ec10ca"},
-    {file = "python_dotenv-1.0.1-py3-none-any.whl", hash = "sha256:f7b63ef50f1b690dddf550d03497b66d609393b40b564ed0d674909a68ebf16a"},
+    {file = "python_dotenv-1.2.2-py3-none-any.whl", hash = "sha256:1d8214789a24de455a8b8bd8ae6fe3c6b69a5e3d64aa8a8e5d68e694bbcb285a"},
+    {file = "python_dotenv-1.2.2.tar.gz", hash = "sha256:2c371a91fbd7ba082c2c1dc1f8bf89ca22564a087c2c287cd9b662adde799cf3"},
 ]
 
 [package.extras]
 cli = ["click (>=5.0)"]
 
+[[package]]
+name = "python-multipart"
+version = "0.0.32"
+description = "A streaming multipart parser for Python"
+optional = true
+python-versions = ">=3.10"
+groups = ["main"]
+markers = "extra == \"trackio\""
+files = [
+    {file = "python_multipart-0.0.32-py3-none-any.whl", hash = "sha256:ff6d3f776f16878c894e52e107296ffc890e913c611b1a4ec6c44e2821fe2e23"},
+    {file = "python_multipart-0.0.32.tar.gz", hash = "sha256:be54b7f3fa167bb83e4fcd936b887b708f4e57fe75911c02aebf53efaf8d938e"},
+]
+
 [[package]]
 name = "pytorch-lightning"
 version = "2.4.0"
 description = "PyTorch Lightning is the lightweight PyTorch wrapper for ML researchers. Scale your models. Write less boilerplate."
 optional = false
 python-versions = ">=3.8"
-groups = ["main"]
+groups = ["training"]
 files = [
     {file = "pytorch-lightning-2.4.0.tar.gz", hash = "sha256:6aa897fd9d6dfa7b7b49f37c2f04e13592861831d08deae584dfda423fdb71c8"},
     {file = "pytorch_lightning-2.4.0-py3-none-any.whl", hash = "sha256:9ac7935229ac022ef06994c928217ed37f525ac6700f7d4fc57009624570e655"},
@@ -4089,15 +4253,16 @@ test = ["cloudpickle (>=1.3)", "coverage (==7.3.1)", "fastapi", "numpy (>=1.17.2
 
 [[package]]
 name = "pytz"
-version = "2024.2"
+version = "2026.2"
 description = "World timezone definitions, modern and historical"
 optional = false
 python-versions = "*"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
-    {file = "pytz-2024.2-py2.py3-none-any.whl", hash = "sha256:31c7c1817eb7fae7ca4b8c7ee50c72f93aa2dd863de768e1ef4245d426aa0725"},
-    {file = "pytz-2024.2.tar.gz", hash = "sha256:2aa355083c50a0f93fa581709deac0c9ad65cca8a9e9beac660adcbd493c798a"},
+    {file = "pytz-2026.2-py2.py3-none-any.whl", hash = "sha256:04156e608bee23d3792fd45c94ae47fae1036688e75032eea2e3bf0323d1f126"},
+    {file = "pytz-2026.2.tar.gz", hash = "sha256:0e60b47b29f21574376f218fe21abc009894a2321ea16c6754f3cad6eb7cdd6a"},
 ]
+markers = {main = "extra == \"trackio\""}
 
 [[package]]
 name = "pyyaml"
@@ -4105,7 +4270,7 @@ version = "6.0.2"
 description = "YAML parser and emitter for Python"
 optional = false
 python-versions = ">=3.8"
-groups = ["main", "dev"]
+groups = ["main", "dev", "training"]
 files = [
     {file = "PyYAML-6.0.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:0a9a2848a5b7feac301353437eb7d5957887edbf81d56e903999a75a3d743086"},
     {file = "PyYAML-6.0.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:29717114e51c84ddfba879543fb232a6ed60086602313ca38cce623c1d62cfbf"},
@@ -4162,81 +4327,6 @@ files = [
     {file = "pyyaml-6.0.2.tar.gz", hash = "sha256:d584d9ec91ad65861cc08d42e834324ef890a082e591037abe114850ff7bbc3e"},
 ]
 
-[[package]]
-name = "ray"
-version = "2.40.0"
-description = "Ray provides a simple, universal API for building distributed applications."
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "ray-2.40.0-cp310-cp310-macosx_10_15_x86_64.whl", hash = "sha256:064af8bc52cc988c82470b8e76e5df417737fa7c1d87f597a892c69eb4ec3caa"},
-    {file = "ray-2.40.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:45beb4019cd20b6cb10572d8012c771bccd623f544a669da6797ccf993c4bb33"},
-    {file = "ray-2.40.0-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:6cede5fbf7de4fae22cebe2c6977aaf3c85fde6f7de2aa10c46992cf24ea8bda"},
-    {file = "ray-2.40.0-cp310-cp310-manylinux2014_x86_64.whl", hash = "sha256:f6eab11dc8490f88e78e06aa645905b259cde1fa03b15e8426155c4782ba0bbe"},
-    {file = "ray-2.40.0-cp310-cp310-win_amd64.whl", hash = "sha256:f83cda1ecceb7abe021cd377f0c503596f26d2d66cdff13c1089a06c8b780c23"},
-    {file = "ray-2.40.0-cp311-cp311-macosx_10_15_x86_64.whl", hash = "sha256:dac89bb2cb889c19549a4ac0383492e7550f3e63b78b629a3118e8b91e4e82f3"},
-    {file = "ray-2.40.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:3e4efdf8aebff6e71391c2d5dd66bb45835f2d6d629ac03a3e21e2d4283e2311"},
-    {file = "ray-2.40.0-cp311-cp311-manylinux2014_aarch64.whl", hash = "sha256:c776f131e5d0a169a98ab8021c5796f52bf48fcfc6c44ffbd2a9d090fe10748a"},
-    {file = "ray-2.40.0-cp311-cp311-manylinux2014_x86_64.whl", hash = "sha256:71711cbf2c156213fd49b0f9cc93180a7ba424110070a34bdea3dc09527f31df"},
-    {file = "ray-2.40.0-cp311-cp311-win_amd64.whl", hash = "sha256:532321132618983366e39aeb4cc7867cf7241b0b1e49ee44b01d2aee9923e422"},
-    {file = "ray-2.40.0-cp312-cp312-macosx_10_15_x86_64.whl", hash = "sha256:6992922fe91a90b5cc97d9f05ca51b64d72cd644db7ad55caa936be9a6098cce"},
-    {file = "ray-2.40.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:28329e7a7471610a475d3bb09a4c1b31abcf3596cee25c4254f8d01ad161ba84"},
-    {file = "ray-2.40.0-cp312-cp312-manylinux2014_aarch64.whl", hash = "sha256:8ea05221fa48e32c652c29498d320e90134b3a012421006af98965097dd1cc3b"},
-    {file = "ray-2.40.0-cp312-cp312-manylinux2014_x86_64.whl", hash = "sha256:674755814f5692306c554cadbc24015af823dc0516e34bdef24ccac9d7a656e3"},
-    {file = "ray-2.40.0-cp312-cp312-win_amd64.whl", hash = "sha256:bbc01d773cbc43e3efa462ec28ee4c0cacc50f098078332fb45b1ab38eaf9b5d"},
-    {file = "ray-2.40.0-cp39-cp39-macosx_10_15_x86_64.whl", hash = "sha256:27292bf8921dd69757e7581644afcd3ccae13d6f10f3841f5523ae82b6612f4b"},
-    {file = "ray-2.40.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:2b74ca43d0c4ccdcaefbf1e7d26aabb1c0d20f825688a9fd7134ba918bda8442"},
-    {file = "ray-2.40.0-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:5eb7a203f58defedff0dc53f78a4e1431d040b2b8458548704979c0113f3b892"},
-    {file = "ray-2.40.0-cp39-cp39-manylinux2014_x86_64.whl", hash = "sha256:a36a20a3b936b36d14fab031222f92e3c5e731d7db6bb183ca4fba6d0ce3f52a"},
-    {file = "ray-2.40.0-cp39-cp39-win_amd64.whl", hash = "sha256:fbe9cd3e076dea676afd57caf19b2897a67ecdf14a542c03864800966cf2aec9"},
-]
-
-[package.dependencies]
-aiosignal = "*"
-click = ">=7.0"
-filelock = "*"
-frozenlist = "*"
-jsonschema = "*"
-msgpack = ">=1.0.0,<2.0.0"
-packaging = "*"
-protobuf = ">=3.15.3,<3.19.5 || >3.19.5"
-pyyaml = "*"
-requests = "*"
-
-[package.extras]
-adag = ["cupy-cuda12x ; sys_platform != \"darwin\""]
-air = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "fsspec", "grpcio (>=1.32.0) ; python_version < \"3.10\"", "grpcio (>=1.42.0) ; python_version >= \"3.10\"", "memray ; sys_platform != \"win32\"", "numpy (>=1.20)", "opencensus", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "tensorboardX (>=1.9)", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-all = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "cupy-cuda12x ; sys_platform != \"darwin\"", "dm-tree", "fastapi", "fsspec", "grpcio (!=1.56.0)", "grpcio (>=1.32.0) ; python_version < \"3.10\"", "grpcio (>=1.42.0) ; python_version >= \"3.10\"", "gymnasium (==1.0.0)", "lz4", "memray ; sys_platform != \"win32\"", "numpy (>=1.20)", "opencensus", "opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyOpenSSL", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "pyyaml", "requests", "rich", "scikit-image", "scipy", "smart-open", "starlette", "tensorboardX (>=1.9)", "typer", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-all-cpp = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "cupy-cuda12x ; sys_platform != \"darwin\"", "dm-tree", "fastapi", "fsspec", "grpcio (!=1.56.0)", "grpcio (>=1.32.0) ; python_version < \"3.10\"", "grpcio (>=1.42.0) ; python_version >= \"3.10\"", "gymnasium (==1.0.0)", "lz4", "memray ; sys_platform != \"win32\"", "numpy (>=1.20)", "opencensus", "opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk", "pandas", "pandas (>=1.3)", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyOpenSSL", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "pyyaml", "ray-cpp (==2.40.0)", "requests", "rich", "scikit-image", "scipy", "smart-open", "starlette", "tensorboardX (>=1.9)", "typer", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-client = ["grpcio (!=1.56.0)"]
-cpp = ["ray-cpp (==2.40.0)"]
-data = ["fsspec", "numpy (>=1.20)", "pandas (>=1.3)", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)"]
-default = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "grpcio (>=1.32.0) ; python_version < \"3.10\"", "grpcio (>=1.42.0) ; python_version >= \"3.10\"", "memray ; sys_platform != \"win32\"", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "virtualenv (>=20.0.24,!=20.21.1)"]
-observability = ["opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk"]
-rllib = ["dm-tree", "fsspec", "gymnasium (==1.0.0)", "lz4", "pandas", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)", "pyyaml", "requests", "rich", "scikit-image", "scipy", "tensorboardX (>=1.9)", "typer"]
-serve = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "grpcio (>=1.32.0) ; python_version < \"3.10\"", "grpcio (>=1.42.0) ; python_version >= \"3.10\"", "memray ; sys_platform != \"win32\"", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-serve-grpc = ["aiohttp (>=3.7)", "aiohttp-cors", "colorful", "fastapi", "grpcio (>=1.32.0) ; python_version < \"3.10\"", "grpcio (>=1.42.0) ; python_version >= \"3.10\"", "memray ; sys_platform != \"win32\"", "opencensus", "prometheus-client (>=0.7.1)", "py-spy (>=0.2.0)", "pyOpenSSL", "pydantic (<2.0.dev0 || >=2.5.dev0,<3)", "requests", "smart-open", "starlette", "uvicorn[standard]", "virtualenv (>=20.0.24,!=20.21.1)", "watchfiles"]
-train = ["fsspec", "pandas", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)", "requests", "tensorboardX (>=1.9)"]
-tune = ["fsspec", "pandas", "pyarrow (<18) ; sys_platform == \"darwin\" and platform_machine == \"x86_64\"", "pyarrow (>=9.0.0)", "requests", "tensorboardX (>=1.9)"]
-
-[[package]]
-name = "referencing"
-version = "0.36.1"
-description = "JSON Referencing + Python"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "referencing-0.36.1-py3-none-any.whl", hash = "sha256:363d9c65f080d0d70bc41c721dce3c7f3e77fc09f269cd5c8813da18069a6794"},
-    {file = "referencing-0.36.1.tar.gz", hash = "sha256:ca2e6492769e3602957e9b831b94211599d2aade9477f5d44110d2530cf9aade"},
-]
-
-[package.dependencies]
-attrs = ">=22.2.0"
-rpds-py = ">=0.7.0"
-typing-extensions = {version = ">=4.4.0", markers = "python_version < \"3.13\""}
-
 [[package]]
 name = "regex"
 version = "2024.11.6"
@@ -4341,22 +4431,6 @@ files = [
     {file = "regex-2024.11.6.tar.gz", hash = "sha256:7ab159b063c52a0333c884e4679f8d7a85112ee3078fe3d9004b2dd875585519"},
 ]
 
-[[package]]
-name = "repoze-lru"
-version = "0.7"
-description = "A tiny LRU cache implementation and decorator"
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "repoze.lru-0.7-py3-none-any.whl", hash = "sha256:f77bf0e1096ea445beadd35f3479c5cff2aa1efe604a133e67150bc8630a62ea"},
-    {file = "repoze.lru-0.7.tar.gz", hash = "sha256:0429a75e19380e4ed50c0694e26ac8819b4ea7851ee1fc7583c8572db80aff77"},
-]
-
-[package.extras]
-docs = ["Sphinx"]
-testing = ["coverage", "nose"]
-
 [[package]]
 name = "requests"
 version = "2.32.3"
@@ -4381,24 +4455,42 @@ use-chardet-on-py3 = ["chardet (>=3.0.2,<6)"]
 
 [[package]]
 name = "rich"
-version = "13.9.4"
+version = "15.0.0"
 description = "Render rich text, tables, progress bars, syntax highlighting, markdown and more to the terminal"
 optional = false
-python-versions = ">=3.8.0"
+python-versions = ">=3.9.0"
 groups = ["main"]
 files = [
-    {file = "rich-13.9.4-py3-none-any.whl", hash = "sha256:6049d5e6ec054bf2779ab3358186963bac2ea89175919d699e378b99738c2a90"},
-    {file = "rich-13.9.4.tar.gz", hash = "sha256:439594978a49a09530cff7ebc4b5c7103ef57baf48d5ea3184f21d9a2befa098"},
+    {file = "rich-15.0.0-py3-none-any.whl", hash = "sha256:33bd4ef74232fb73fe9279a257718407f169c09b78a87ad3d296f548e27de0bb"},
+    {file = "rich-15.0.0.tar.gz", hash = "sha256:edd07a4824c6b40189fb7ac9bc4c52536e9780fbbfbddf6f1e2502c31b068c36"},
 ]
 
 [package.dependencies]
 markdown-it-py = ">=2.2.0"
 pygments = ">=2.13.0,<3.0.0"
-typing-extensions = {version = ">=4.0.0,<5.0", markers = "python_version < \"3.11\""}
 
 [package.extras]
 jupyter = ["ipywidgets (>=7.5.1,<9)"]
 
+[[package]]
+name = "rich-rst"
+version = "1.3.2"
+description = "A beautiful reStructuredText renderer for rich"
+optional = false
+python-versions = "*"
+groups = ["main"]
+files = [
+    {file = "rich_rst-1.3.2-py3-none-any.whl", hash = "sha256:a99b4907cbe118cf9d18b0b44de272efa61f15117c61e39ebdc431baf5df722a"},
+    {file = "rich_rst-1.3.2.tar.gz", hash = "sha256:a1196fdddf1e364b02ec68a05e8ff8f6914fee10fbca2e6b6735f166bb0da8d4"},
+]
+
+[package.dependencies]
+docutils = "*"
+rich = ">=12.0.0"
+
+[package.extras]
+docs = ["sphinx"]
+
 [[package]]
 name = "rotary-embedding-torch"
 version = "0.6.5"
@@ -4415,119 +4507,6 @@ files = [
 einops = ">=0.7"
 torch = ">=2.0"
 
-[[package]]
-name = "rpds-py"
-version = "0.22.3"
-description = "Python bindings to Rust's persistent data structures (rpds)"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "rpds_py-0.22.3-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:6c7b99ca52c2c1752b544e310101b98a659b720b21db00e65edca34483259967"},
-    {file = "rpds_py-0.22.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:be2eb3f2495ba669d2a985f9b426c1797b7d48d6963899276d22f23e33d47e37"},
-    {file = "rpds_py-0.22.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:70eb60b3ae9245ddea20f8a4190bd79c705a22f8028aaf8bbdebe4716c3fab24"},
-    {file = "rpds_py-0.22.3-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:4041711832360a9b75cfb11b25a6a97c8fb49c07b8bd43d0d02b45d0b499a4ff"},
-    {file = "rpds_py-0.22.3-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:64607d4cbf1b7e3c3c8a14948b99345eda0e161b852e122c6bb71aab6d1d798c"},
-    {file = "rpds_py-0.22.3-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:81e69b0a0e2537f26d73b4e43ad7bc8c8efb39621639b4434b76a3de50c6966e"},
-    {file = "rpds_py-0.22.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bc27863442d388870c1809a87507727b799c8460573cfbb6dc0eeaef5a11b5ec"},
-    {file = "rpds_py-0.22.3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:e79dd39f1e8c3504be0607e5fc6e86bb60fe3584bec8b782578c3b0fde8d932c"},
-    {file = "rpds_py-0.22.3-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:e0fa2d4ec53dc51cf7d3bb22e0aa0143966119f42a0c3e4998293a3dd2856b09"},
-    {file = "rpds_py-0.22.3-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:fda7cb070f442bf80b642cd56483b5548e43d366fe3f39b98e67cce780cded00"},
-    {file = "rpds_py-0.22.3-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:cff63a0272fcd259dcc3be1657b07c929c466b067ceb1c20060e8d10af56f5bf"},
-    {file = "rpds_py-0.22.3-cp310-cp310-win32.whl", hash = "sha256:9bd7228827ec7bb817089e2eb301d907c0d9827a9e558f22f762bb690b131652"},
-    {file = "rpds_py-0.22.3-cp310-cp310-win_amd64.whl", hash = "sha256:9beeb01d8c190d7581a4d59522cd3d4b6887040dcfc744af99aa59fef3e041a8"},
-    {file = "rpds_py-0.22.3-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:d20cfb4e099748ea39e6f7b16c91ab057989712d31761d3300d43134e26e165f"},
-    {file = "rpds_py-0.22.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:68049202f67380ff9aa52f12e92b1c30115f32e6895cd7198fa2a7961621fc5a"},
-    {file = "rpds_py-0.22.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fb4f868f712b2dd4bcc538b0a0c1f63a2b1d584c925e69a224d759e7070a12d5"},
-    {file = "rpds_py-0.22.3-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:bc51abd01f08117283c5ebf64844a35144a0843ff7b2983e0648e4d3d9f10dbb"},
-    {file = "rpds_py-0.22.3-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:0f3cec041684de9a4684b1572fe28c7267410e02450f4561700ca5a3bc6695a2"},
-    {file = "rpds_py-0.22.3-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:7ef9d9da710be50ff6809fed8f1963fecdfecc8b86656cadfca3bc24289414b0"},
-    {file = "rpds_py-0.22.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:59f4a79c19232a5774aee369a0c296712ad0e77f24e62cad53160312b1c1eaa1"},
-    {file = "rpds_py-0.22.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1a60bce91f81ddaac922a40bbb571a12c1070cb20ebd6d49c48e0b101d87300d"},
-    {file = "rpds_py-0.22.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:e89391e6d60251560f0a8f4bd32137b077a80d9b7dbe6d5cab1cd80d2746f648"},
-    {file = "rpds_py-0.22.3-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:e3fb866d9932a3d7d0c82da76d816996d1667c44891bd861a0f97ba27e84fc74"},
-    {file = "rpds_py-0.22.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:1352ae4f7c717ae8cba93421a63373e582d19d55d2ee2cbb184344c82d2ae55a"},
-    {file = "rpds_py-0.22.3-cp311-cp311-win32.whl", hash = "sha256:b0b4136a252cadfa1adb705bb81524eee47d9f6aab4f2ee4fa1e9d3cd4581f64"},
-    {file = "rpds_py-0.22.3-cp311-cp311-win_amd64.whl", hash = "sha256:8bd7c8cfc0b8247c8799080fbff54e0b9619e17cdfeb0478ba7295d43f635d7c"},
-    {file = "rpds_py-0.22.3-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:27e98004595899949bd7a7b34e91fa7c44d7a97c40fcaf1d874168bb652ec67e"},
-    {file = "rpds_py-0.22.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:1978d0021e943aae58b9b0b196fb4895a25cc53d3956b8e35e0b7682eefb6d56"},
-    {file = "rpds_py-0.22.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:655ca44a831ecb238d124e0402d98f6212ac527a0ba6c55ca26f616604e60a45"},
-    {file = "rpds_py-0.22.3-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:feea821ee2a9273771bae61194004ee2fc33f8ec7db08117ef9147d4bbcbca8e"},
-    {file = "rpds_py-0.22.3-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:22bebe05a9ffc70ebfa127efbc429bc26ec9e9b4ee4d15a740033efda515cf3d"},
-    {file = "rpds_py-0.22.3-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:3af6e48651c4e0d2d166dc1b033b7042ea3f871504b6805ba5f4fe31581d8d38"},
-    {file = "rpds_py-0.22.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e67ba3c290821343c192f7eae1d8fd5999ca2dc99994114643e2f2d3e6138b15"},
-    {file = "rpds_py-0.22.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:02fbb9c288ae08bcb34fb41d516d5eeb0455ac35b5512d03181d755d80810059"},
-    {file = "rpds_py-0.22.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:f56a6b404f74ab372da986d240e2e002769a7d7102cc73eb238a4f72eec5284e"},
-    {file = "rpds_py-0.22.3-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:0a0461200769ab3b9ab7e513f6013b7a97fdeee41c29b9db343f3c5a8e2b9e61"},
-    {file = "rpds_py-0.22.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:8633e471c6207a039eff6aa116e35f69f3156b3989ea3e2d755f7bc41754a4a7"},
-    {file = "rpds_py-0.22.3-cp312-cp312-win32.whl", hash = "sha256:593eba61ba0c3baae5bc9be2f5232430453fb4432048de28399ca7376de9c627"},
-    {file = "rpds_py-0.22.3-cp312-cp312-win_amd64.whl", hash = "sha256:d115bffdd417c6d806ea9069237a4ae02f513b778e3789a359bc5856e0404cc4"},
-    {file = "rpds_py-0.22.3-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:ea7433ce7e4bfc3a85654aeb6747babe3f66eaf9a1d0c1e7a4435bbdf27fea84"},
-    {file = "rpds_py-0.22.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:6dd9412824c4ce1aca56c47b0991e65bebb7ac3f4edccfd3f156150c96a7bf25"},
-    {file = "rpds_py-0.22.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:20070c65396f7373f5df4005862fa162db5d25d56150bddd0b3e8214e8ef45b4"},
-    {file = "rpds_py-0.22.3-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0b09865a9abc0ddff4e50b5ef65467cd94176bf1e0004184eb915cbc10fc05c5"},
-    {file = "rpds_py-0.22.3-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3453e8d41fe5f17d1f8e9c383a7473cd46a63661628ec58e07777c2fff7196dc"},
-    {file = "rpds_py-0.22.3-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f5d36399a1b96e1a5fdc91e0522544580dbebeb1f77f27b2b0ab25559e103b8b"},
-    {file = "rpds_py-0.22.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:009de23c9c9ee54bf11303a966edf4d9087cd43a6003672e6aa7def643d06518"},
-    {file = "rpds_py-0.22.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1aef18820ef3e4587ebe8b3bc9ba6e55892a6d7b93bac6d29d9f631a3b4befbd"},
-    {file = "rpds_py-0.22.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:f60bd8423be1d9d833f230fdbccf8f57af322d96bcad6599e5a771b151398eb2"},
-    {file = "rpds_py-0.22.3-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:62d9cfcf4948683a18a9aff0ab7e1474d407b7bab2ca03116109f8464698ab16"},
-    {file = "rpds_py-0.22.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:9253fc214112405f0afa7db88739294295f0e08466987f1d70e29930262b4c8f"},
-    {file = "rpds_py-0.22.3-cp313-cp313-win32.whl", hash = "sha256:fb0ba113b4983beac1a2eb16faffd76cb41e176bf58c4afe3e14b9c681f702de"},
-    {file = "rpds_py-0.22.3-cp313-cp313-win_amd64.whl", hash = "sha256:c58e2339def52ef6b71b8f36d13c3688ea23fa093353f3a4fee2556e62086ec9"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:f82a116a1d03628a8ace4859556fb39fd1424c933341a08ea3ed6de1edb0283b"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:3dfcbc95bd7992b16f3f7ba05af8a64ca694331bd24f9157b49dadeeb287493b"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:59259dc58e57b10e7e18ce02c311804c10c5a793e6568f8af4dead03264584d1"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5725dd9cc02068996d4438d397e255dcb1df776b7ceea3b9cb972bdb11260a83"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:99b37292234e61325e7a5bb9689e55e48c3f5f603af88b1642666277a81f1fbd"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:27b1d3b3915a99208fee9ab092b8184c420f2905b7d7feb4aeb5e4a9c509b8a1"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f612463ac081803f243ff13cccc648578e2279295048f2a8d5eb430af2bae6e3"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:f73d3fef726b3243a811121de45193c0ca75f6407fe66f3f4e183c983573e130"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:3f21f0495edea7fdbaaa87e633a8689cd285f8f4af5c869f27bc8074638ad69c"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:1e9663daaf7a63ceccbbb8e3808fe90415b0757e2abddbfc2e06c857bf8c5e2b"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:a76e42402542b1fae59798fab64432b2d015ab9d0c8c47ba7addddbaf7952333"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-win32.whl", hash = "sha256:69803198097467ee7282750acb507fba35ca22cc3b85f16cf45fb01cb9097730"},
-    {file = "rpds_py-0.22.3-cp313-cp313t-win_amd64.whl", hash = "sha256:f5cf2a0c2bdadf3791b5c205d55a37a54025c6e18a71c71f82bb536cf9a454bf"},
-    {file = "rpds_py-0.22.3-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:378753b4a4de2a7b34063d6f95ae81bfa7b15f2c1a04a9518e8644e81807ebea"},
-    {file = "rpds_py-0.22.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:3445e07bf2e8ecfeef6ef67ac83de670358abf2996916039b16a218e3d95e97e"},
-    {file = "rpds_py-0.22.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b2513ba235829860b13faa931f3b6846548021846ac808455301c23a101689d"},
-    {file = "rpds_py-0.22.3-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:eaf16ae9ae519a0e237a0f528fd9f0197b9bb70f40263ee57ae53c2b8d48aeb3"},
-    {file = "rpds_py-0.22.3-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:583f6a1993ca3369e0f80ba99d796d8e6b1a3a2a442dd4e1a79e652116413091"},
-    {file = "rpds_py-0.22.3-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:4617e1915a539a0d9a9567795023de41a87106522ff83fbfaf1f6baf8e85437e"},
-    {file = "rpds_py-0.22.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0c150c7a61ed4a4f4955a96626574e9baf1adf772c2fb61ef6a5027e52803543"},
-    {file = "rpds_py-0.22.3-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2fa4331c200c2521512595253f5bb70858b90f750d39b8cbfd67465f8d1b596d"},
-    {file = "rpds_py-0.22.3-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:214b7a953d73b5e87f0ebece4a32a5bd83c60a3ecc9d4ec8f1dca968a2d91e99"},
-    {file = "rpds_py-0.22.3-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:f47ad3d5f3258bd7058d2d506852217865afefe6153a36eb4b6928758041d831"},
-    {file = "rpds_py-0.22.3-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:f276b245347e6e36526cbd4a266a417796fc531ddf391e43574cf6466c492520"},
-    {file = "rpds_py-0.22.3-cp39-cp39-win32.whl", hash = "sha256:bbb232860e3d03d544bc03ac57855cd82ddf19c7a07651a7c0fdb95e9efea8b9"},
-    {file = "rpds_py-0.22.3-cp39-cp39-win_amd64.whl", hash = "sha256:cfbc454a2880389dbb9b5b398e50d439e2e58669160f27b60e5eca11f68ae17c"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:d48424e39c2611ee1b84ad0f44fb3b2b53d473e65de061e3f460fc0be5f1939d"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:24e8abb5878e250f2eb0d7859a8e561846f98910326d06c0d51381fed59357bd"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4b232061ca880db21fa14defe219840ad9b74b6158adb52ddf0e87bead9e8493"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ac0a03221cdb5058ce0167ecc92a8c89e8d0decdc9e99a2ec23380793c4dcb96"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:eb0c341fa71df5a4595f9501df4ac5abfb5a09580081dffbd1ddd4654e6e9123"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:bf9db5488121b596dbfc6718c76092fda77b703c1f7533a226a5a9f65248f8ad"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0b8db6b5b2d4491ad5b6bdc2bc7c017eec108acbf4e6785f42a9eb0ba234f4c9"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:b3d504047aba448d70cf6fa22e06cb09f7cbd761939fdd47604f5e007675c24e"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:e61b02c3f7a1e0b75e20c3978f7135fd13cb6cf551bf4a6d29b999a88830a338"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-musllinux_1_2_i686.whl", hash = "sha256:e35ba67d65d49080e8e5a1dd40101fccdd9798adb9b050ff670b7d74fa41c566"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:26fd7cac7dd51011a245f29a2cc6489c4608b5a8ce8d75661bb4a1066c52dfbe"},
-    {file = "rpds_py-0.22.3-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:177c7c0fce2855833819c98e43c262007f42ce86651ffbb84f37883308cb0e7d"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:bb47271f60660803ad11f4c61b42242b8c1312a31c98c578f79ef9387bbde21c"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:70fb28128acbfd264eda9bf47015537ba3fe86e40d046eb2963d75024be4d055"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:44d61b4b7d0c2c9ac019c314e52d7cbda0ae31078aabd0f22e583af3e0d79723"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5f0e260eaf54380380ac3808aa4ebe2d8ca28b9087cf411649f96bad6900c728"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b25bc607423935079e05619d7de556c91fb6adeae9d5f80868dde3468657994b"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:fb6116dfb8d1925cbdb52595560584db42a7f664617a1f7d7f6e32f138cdf37d"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a63cbdd98acef6570c62b92a1e43266f9e8b21e699c363c0fef13bd530799c11"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2b8f60e1b739a74bab7e01fcbe3dddd4657ec685caa04681df9d562ef15b625f"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-musllinux_1_2_aarch64.whl", hash = "sha256:2e8b55d8517a2fda8d95cb45d62a5a8bbf9dd0ad39c5b25c8833efea07b880ca"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-musllinux_1_2_i686.whl", hash = "sha256:2de29005e11637e7a2361fa151f780ff8eb2543a0da1413bb951e9f14b699ef3"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-musllinux_1_2_x86_64.whl", hash = "sha256:666ecce376999bf619756a24ce15bb14c5bfaf04bf00abc7e663ce17c3f34fe7"},
-    {file = "rpds_py-0.22.3-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:5246b14ca64a8675e0a7161f7af68fe3e910e6b90542b4bfb5439ba752191df6"},
-    {file = "rpds_py-0.22.3.tar.gz", hash = "sha256:e32fee8ab45d3c2db6da19a5323bc3362237c8b653c70194414b892fd06a080d"},
-]
-
 [[package]]
 name = "ruff"
 version = "0.6.9"
@@ -4557,210 +4536,122 @@ files = [
 ]
 
 [[package]]
-name = "s3transfer"
-version = "0.11.2"
-description = "An Amazon S3 Transfer Manager"
-optional = false
-python-versions = ">=3.8"
+name = "safehttpx"
+version = "0.1.7"
+description = "A small Python library created to help developers protect their applications from Server Side Request Forgery (SSRF) attacks."
+optional = true
+python-versions = ">3.9"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "s3transfer-0.11.2-py3-none-any.whl", hash = "sha256:be6ecb39fadd986ef1701097771f87e4d2f821f27f6071c872143884d2950fbc"},
-    {file = "s3transfer-0.11.2.tar.gz", hash = "sha256:3b39185cb72f5acc77db1a58b6e25b977f28d20496b6e58d6813d75f464d632f"},
+    {file = "safehttpx-0.1.7-py3-none-any.whl", hash = "sha256:c4f4a162db6993464d7ca3d7cc4af0ffc6515a606dfd220b9f82c6945d869cde"},
+    {file = "safehttpx-0.1.7.tar.gz", hash = "sha256:db201c0978c41eddb8bb480f3eee59dd67304fdd91646035e9d9a720049a9d23"},
 ]
 
 [package.dependencies]
-botocore = ">=1.36.0,<2.0a.0"
+httpx = "*"
 
 [package.extras]
-crt = ["botocore[crt] (>=1.36.0,<2.0a.0)"]
+dev = ["pytest"]
 
 [[package]]
 name = "safetensors"
-version = "0.4.4"
+version = "0.8.0"
 description = ""
 optional = false
-python-versions = ">=3.7"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "safetensors-0.4.4-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:2adb497ada13097f30e386e88c959c0fda855a5f6f98845710f5bb2c57e14f12"},
-    {file = "safetensors-0.4.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:7db7fdc2d71fd1444d85ca3f3d682ba2df7d61a637dfc6d80793f439eae264ab"},
-    {file = "safetensors-0.4.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8d4f0eed76b430f009fbefca1a0028ddb112891b03cb556d7440d5cd68eb89a9"},
-    {file = "safetensors-0.4.4-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:57d216fab0b5c432aabf7170883d7c11671622bde8bd1436c46d633163a703f6"},
-    {file = "safetensors-0.4.4-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:7d9b76322e49c056bcc819f8bdca37a2daa5a6d42c07f30927b501088db03309"},
-    {file = "safetensors-0.4.4-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:32f0d1f6243e90ee43bc6ee3e8c30ac5b09ca63f5dd35dbc985a1fc5208c451a"},
-    {file = "safetensors-0.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:44d464bdc384874601a177375028012a5f177f1505279f9456fea84bbc575c7f"},
-    {file = "safetensors-0.4.4-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:63144e36209ad8e4e65384dbf2d52dd5b1866986079c00a72335402a38aacdc5"},
-    {file = "safetensors-0.4.4-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:051d5ecd490af7245258000304b812825974d5e56f14a3ff7e1b8b2ba6dc2ed4"},
-    {file = "safetensors-0.4.4-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:51bc8429d9376224cd3cf7e8ce4f208b4c930cd10e515b6ac6a72cbc3370f0d9"},
-    {file = "safetensors-0.4.4-cp310-none-win32.whl", hash = "sha256:fb7b54830cee8cf9923d969e2df87ce20e625b1af2fd194222ab902d3adcc29c"},
-    {file = "safetensors-0.4.4-cp310-none-win_amd64.whl", hash = "sha256:4b3e8aa8226d6560de8c2b9d5ff8555ea482599c670610758afdc97f3e021e9c"},
-    {file = "safetensors-0.4.4-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:bbaa31f2cb49013818bde319232ccd72da62ee40f7d2aa532083eda5664e85ff"},
-    {file = "safetensors-0.4.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:9fdcb80f4e9fbb33b58e9bf95e7dbbedff505d1bcd1c05f7c7ce883632710006"},
-    {file = "safetensors-0.4.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:55c14c20be247b8a1aeaf3ab4476265e3ca83096bb8e09bb1a7aa806088def4f"},
-    {file = "safetensors-0.4.4-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:949aaa1118660f992dbf0968487b3e3cfdad67f948658ab08c6b5762e90cc8b6"},
-    {file = "safetensors-0.4.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c11a4ab7debc456326a2bac67f35ee0ac792bcf812c7562a4a28559a5c795e27"},
-    {file = "safetensors-0.4.4-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c0cea44bba5c5601b297bc8307e4075535b95163402e4906b2e9b82788a2a6df"},
-    {file = "safetensors-0.4.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a9d752c97f6bbe327352f76e5b86442d776abc789249fc5e72eacb49e6916482"},
-    {file = "safetensors-0.4.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:03f2bb92e61b055ef6cc22883ad1ae898010a95730fa988c60a23800eb742c2c"},
-    {file = "safetensors-0.4.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:87bf3f91a9328a941acc44eceffd4e1f5f89b030985b2966637e582157173b98"},
-    {file = "safetensors-0.4.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:20d218ec2b6899d29d6895419a58b6e44cc5ff8f0cc29fac8d236a8978ab702e"},
-    {file = "safetensors-0.4.4-cp311-none-win32.whl", hash = "sha256:8079486118919f600c603536e2490ca37b3dbd3280e3ad6eaacfe6264605ac8a"},
-    {file = "safetensors-0.4.4-cp311-none-win_amd64.whl", hash = "sha256:2f8c2eb0615e2e64ee27d478c7c13f51e5329d7972d9e15528d3e4cfc4a08f0d"},
-    {file = "safetensors-0.4.4-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:baec5675944b4a47749c93c01c73d826ef7d42d36ba8d0dba36336fa80c76426"},
-    {file = "safetensors-0.4.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:f15117b96866401825f3e94543145028a2947d19974429246ce59403f49e77c6"},
-    {file = "safetensors-0.4.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6a13a9caea485df164c51be4eb0c87f97f790b7c3213d635eba2314d959fe929"},
-    {file = "safetensors-0.4.4-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6b54bc4ca5f9b9bba8cd4fb91c24b2446a86b5ae7f8975cf3b7a277353c3127c"},
-    {file = "safetensors-0.4.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:08332c22e03b651c8eb7bf5fc2de90044f3672f43403b3d9ac7e7e0f4f76495e"},
-    {file = "safetensors-0.4.4-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:bb62841e839ee992c37bb75e75891c7f4904e772db3691c59daaca5b4ab960e1"},
-    {file = "safetensors-0.4.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8e5b927acc5f2f59547270b0309a46d983edc44be64e1ca27a7fcb0474d6cd67"},
-    {file = "safetensors-0.4.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2a69c71b1ae98a8021a09a0b43363b0143b0ce74e7c0e83cacba691b62655fb8"},
-    {file = "safetensors-0.4.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:23654ad162c02a5636f0cd520a0310902c4421aab1d91a0b667722a4937cc445"},
-    {file = "safetensors-0.4.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:0677c109d949cf53756859160b955b2e75b0eefe952189c184d7be30ecf7e858"},
-    {file = "safetensors-0.4.4-cp312-none-win32.whl", hash = "sha256:a51d0ddd4deb8871c6de15a772ef40b3dbd26a3c0451bb9e66bc76fc5a784e5b"},
-    {file = "safetensors-0.4.4-cp312-none-win_amd64.whl", hash = "sha256:2d065059e75a798bc1933c293b68d04d79b586bb7f8c921e0ca1e82759d0dbb1"},
-    {file = "safetensors-0.4.4-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:9d625692578dd40a112df30c02a1adf068027566abd8e6a74893bb13d441c150"},
-    {file = "safetensors-0.4.4-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:7cabcf39c81e5b988d0adefdaea2eb9b4fd9bd62d5ed6559988c62f36bfa9a89"},
-    {file = "safetensors-0.4.4-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8359bef65f49d51476e9811d59c015f0ddae618ee0e44144f5595278c9f8268c"},
-    {file = "safetensors-0.4.4-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:1a32c662e7df9226fd850f054a3ead0e4213a96a70b5ce37b2d26ba27004e013"},
-    {file = "safetensors-0.4.4-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c329a4dcc395364a1c0d2d1574d725fe81a840783dda64c31c5a60fc7d41472c"},
-    {file = "safetensors-0.4.4-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:239ee093b1db877c9f8fe2d71331a97f3b9c7c0d3ab9f09c4851004a11f44b65"},
-    {file = "safetensors-0.4.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bd574145d930cf9405a64f9923600879a5ce51d9f315443a5f706374841327b6"},
-    {file = "safetensors-0.4.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:f6784eed29f9e036acb0b7769d9e78a0dc2c72c2d8ba7903005350d817e287a4"},
-    {file = "safetensors-0.4.4-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:65a4a6072436bf0a4825b1c295d248cc17e5f4651e60ee62427a5bcaa8622a7a"},
-    {file = "safetensors-0.4.4-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:df81e3407630de060ae8313da49509c3caa33b1a9415562284eaf3d0c7705f9f"},
-    {file = "safetensors-0.4.4-cp37-cp37m-macosx_10_12_x86_64.whl", hash = "sha256:e4a0f374200e8443d9746e947ebb346c40f83a3970e75a685ade0adbba5c48d9"},
-    {file = "safetensors-0.4.4-cp37-cp37m-macosx_11_0_arm64.whl", hash = "sha256:181fb5f3dee78dae7fd7ec57d02e58f7936498d587c6b7c1c8049ef448c8d285"},
-    {file = "safetensors-0.4.4-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2cb4ac1d8f6b65ec84ddfacd275079e89d9df7c92f95675ba96c4f790a64df6e"},
-    {file = "safetensors-0.4.4-cp37-cp37m-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:76897944cd9239e8a70955679b531b9a0619f76e25476e57ed373322d9c2075d"},
-    {file = "safetensors-0.4.4-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2a9e9d1a27e51a0f69e761a3d581c3af46729ec1c988fa1f839e04743026ae35"},
-    {file = "safetensors-0.4.4-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:005ef9fc0f47cb9821c40793eb029f712e97278dae84de91cb2b4809b856685d"},
-    {file = "safetensors-0.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:26987dac3752688c696c77c3576f951dbbdb8c57f0957a41fb6f933cf84c0b62"},
-    {file = "safetensors-0.4.4-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:c05270b290acd8d249739f40d272a64dd597d5a4b90f27d830e538bc2549303c"},
-    {file = "safetensors-0.4.4-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:068d3a33711fc4d93659c825a04480ff5a3854e1d78632cdc8f37fee917e8a60"},
-    {file = "safetensors-0.4.4-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:063421ef08ca1021feea8b46951251b90ae91f899234dd78297cbe7c1db73b99"},
-    {file = "safetensors-0.4.4-cp37-none-win32.whl", hash = "sha256:d52f5d0615ea83fd853d4e1d8acf93cc2e0223ad4568ba1e1f6ca72e94ea7b9d"},
-    {file = "safetensors-0.4.4-cp37-none-win_amd64.whl", hash = "sha256:88a5ac3280232d4ed8e994cbc03b46a1807ce0aa123867b40c4a41f226c61f94"},
-    {file = "safetensors-0.4.4-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:3467ab511bfe3360967d7dc53b49f272d59309e57a067dd2405b4d35e7dcf9dc"},
-    {file = "safetensors-0.4.4-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:2ab4c96d922e53670ce25fbb9b63d5ea972e244de4fa1dd97b590d9fd66aacef"},
-    {file = "safetensors-0.4.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:87df18fce4440477c3ef1fd7ae17c704a69a74a77e705a12be135ee0651a0c2d"},
-    {file = "safetensors-0.4.4-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0e5fe345b2bc7d88587149ac11def1f629d2671c4c34f5df38aed0ba59dc37f8"},
-    {file = "safetensors-0.4.4-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9f1a3e01dce3cd54060791e7e24588417c98b941baa5974700eeb0b8eb65b0a0"},
-    {file = "safetensors-0.4.4-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1c6bf35e9a8998d8339fd9a05ac4ce465a4d2a2956cc0d837b67c4642ed9e947"},
-    {file = "safetensors-0.4.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:166c0c52f6488b8538b2a9f3fbc6aad61a7261e170698779b371e81b45f0440d"},
-    {file = "safetensors-0.4.4-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:87e9903b8668a16ef02c08ba4ebc91e57a49c481e9b5866e31d798632805014b"},
-    {file = "safetensors-0.4.4-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:a9c421153aa23c323bd8483d4155b4eee82c9a50ac11cccd83539104a8279c64"},
-    {file = "safetensors-0.4.4-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:a4b8617499b2371c7353302c5116a7e0a3a12da66389ce53140e607d3bf7b3d3"},
-    {file = "safetensors-0.4.4-cp38-none-win32.whl", hash = "sha256:c6280f5aeafa1731f0a3709463ab33d8e0624321593951aefada5472f0b313fd"},
-    {file = "safetensors-0.4.4-cp38-none-win_amd64.whl", hash = "sha256:6ceed6247fc2d33b2a7b7d25d8a0fe645b68798856e0bc7a9800c5fd945eb80f"},
-    {file = "safetensors-0.4.4-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:5cf6c6f6193797372adf50c91d0171743d16299491c75acad8650107dffa9269"},
-    {file = "safetensors-0.4.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:419010156b914a3e5da4e4adf992bee050924d0fe423c4b329e523e2c14c3547"},
-    {file = "safetensors-0.4.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:88f6fd5a5c1302ce79993cc5feeadcc795a70f953c762544d01fb02b2db4ea33"},
-    {file = "safetensors-0.4.4-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:d468cffb82d90789696d5b4d8b6ab8843052cba58a15296691a7a3df55143cd2"},
-    {file = "safetensors-0.4.4-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9353c2af2dd467333d4850a16edb66855e795561cd170685178f706c80d2c71e"},
-    {file = "safetensors-0.4.4-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:83c155b4a33368d9b9c2543e78f2452090fb030c52401ca608ef16fa58c98353"},
-    {file = "safetensors-0.4.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9850754c434e636ce3dc586f534bb23bcbd78940c304775bee9005bf610e98f1"},
-    {file = "safetensors-0.4.4-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:275f500b4d26f67b6ec05629a4600645231bd75e4ed42087a7c1801bff04f4b3"},
-    {file = "safetensors-0.4.4-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:5c2308de665b7130cd0e40a2329278226e4cf083f7400c51ca7e19ccfb3886f3"},
-    {file = "safetensors-0.4.4-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:e06a9ebc8656e030ccfe44634f2a541b4b1801cd52e390a53ad8bacbd65f8518"},
-    {file = "safetensors-0.4.4-cp39-none-win32.whl", hash = "sha256:ef73df487b7c14b477016947c92708c2d929e1dee2bacdd6fff5a82ed4539537"},
-    {file = "safetensors-0.4.4-cp39-none-win_amd64.whl", hash = "sha256:83d054818a8d1198d8bd8bc3ea2aac112a2c19def2bf73758321976788706398"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:1d1f34c71371f0e034004a0b583284b45d233dd0b5f64a9125e16b8a01d15067"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:1a8043a33d58bc9b30dfac90f75712134ca34733ec3d8267b1bd682afe7194f5"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8db8f0c59c84792c12661f8efa85de160f80efe16b87a9d5de91b93f9e0bce3c"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cfc1fc38e37630dd12d519bdec9dcd4b345aec9930bb9ce0ed04461f49e58b52"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:e5c9d86d9b13b18aafa88303e2cd21e677f5da2a14c828d2c460fe513af2e9a5"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:43251d7f29a59120a26f5a0d9583b9e112999e500afabcfdcb91606d3c5c89e3"},
-    {file = "safetensors-0.4.4-pp310-pypy310_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:2c42e9b277513b81cf507e6121c7b432b3235f980cac04f39f435b7902857f91"},
-    {file = "safetensors-0.4.4-pp37-pypy37_pp73-macosx_10_12_x86_64.whl", hash = "sha256:3daacc9a4e3f428a84dd56bf31f20b768eb0b204af891ed68e1f06db9edf546f"},
-    {file = "safetensors-0.4.4-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:218bbb9b883596715fc9997bb42470bf9f21bb832c3b34c2bf744d6fa8f2bbba"},
-    {file = "safetensors-0.4.4-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7bd5efc26b39f7fc82d4ab1d86a7f0644c8e34f3699c33f85bfa9a717a030e1b"},
-    {file = "safetensors-0.4.4-pp37-pypy37_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:56ad9776b65d8743f86698a1973292c966cf3abff627efc44ed60e66cc538ddd"},
-    {file = "safetensors-0.4.4-pp37-pypy37_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:30f23e6253c5f43a809dea02dc28a9f5fa747735dc819f10c073fe1b605e97d4"},
-    {file = "safetensors-0.4.4-pp37-pypy37_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:5512078d00263de6cb04e9d26c9ae17611098f52357fea856213e38dc462f81f"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-macosx_10_12_x86_64.whl", hash = "sha256:b96c3d9266439d17f35fc2173111d93afc1162f168e95aed122c1ca517b1f8f1"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-macosx_11_0_arm64.whl", hash = "sha256:08d464aa72a9a13826946b4fb9094bb4b16554bbea2e069e20bd903289b6ced9"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:210160816d5a36cf41f48f38473b6f70d7bcb4b0527bedf0889cc0b4c3bb07db"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:eb276a53717f2bcfb6df0bcf284d8a12069002508d4c1ca715799226024ccd45"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:a2c28c6487f17d8db0089e8b2cdc13de859366b94cc6cdc50e1b0a4147b56551"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:7915f0c60e4e6e65d90f136d85dd3b429ae9191c36b380e626064694563dbd9f"},
-    {file = "safetensors-0.4.4-pp38-pypy38_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:00eea99ae422fbfa0b46065acbc58b46bfafadfcec179d4b4a32d5c45006af6c"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:bb1ed4fcb0b3c2f3ea2c5767434622fe5d660e5752f21ac2e8d737b1e5e480bb"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:73fc9a0a4343188bdb421783e600bfaf81d0793cd4cce6bafb3c2ed567a74cd5"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2c37e6b714200824c73ca6eaf007382de76f39466a46e97558b8dc4cf643cfbf"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f75698c5c5c542417ac4956acfc420f7d4a2396adca63a015fd66641ea751759"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:ca1a209157f242eb183e209040097118472e169f2e069bfbd40c303e24866543"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:177f2b60a058f92a3cec7a1786c9106c29eca8987ecdfb79ee88126e5f47fa31"},
-    {file = "safetensors-0.4.4-pp39-pypy39_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:ee9622e84fe6e4cd4f020e5fda70d6206feff3157731df7151d457fdae18e541"},
-    {file = "safetensors-0.4.4.tar.gz", hash = "sha256:5fe3e9b705250d0172ed4e100a811543108653fb2b66b9e702a088ad03772a07"},
+    {file = "safetensors-0.8.0-cp310-abi3-macosx_10_12_x86_64.whl", hash = "sha256:c554f85858e05226d3c2828e32395e677434685d6d94594a41643361c5e837f0"},
+    {file = "safetensors-0.8.0-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:c80201d22cbf405b80647a60ada77bba06c8fba2da2743ba1e89cdcc39a81f25"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7a46e5ff292c356d6991e60942ba7f79817682d3a2cef0702136448cb9c4d235"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:4124502b78f03534117c848f87a39b8f31e577b15eff423bf8bfb95f2a8c30d0"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:7bc0a787ba8a35be368ee3574edfa2b1ad389eebd0a72e482ae275490e3f6c98"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:040070828e36dc8e122178bbbd5830ff9e97920affb84cbe0f46442497bed358"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fd6f3f93c9a0a7cc2788ee63fb763353d4bd2e89b0751bc78fcf7dda00bea774"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_31_riscv64.whl", hash = "sha256:fcdd41ec4628fee5799f807c73c353629130fbd942aa23d83c623dd6c9d52d78"},
+    {file = "safetensors-0.8.0-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:8e9f537aa183a38ace122d27303dcd986b26bd2a7591f9181d7f0c396f4677ca"},
+    {file = "safetensors-0.8.0-cp310-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:87eec7ffed2b809f05a398a8becb7d013f19f7837cd15d9748580d6cf30dbaf4"},
+    {file = "safetensors-0.8.0-cp310-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:4a95ae2b05d7726d751da4ebf626a2ca782b706e101bd894c95bc2450b1cffcc"},
+    {file = "safetensors-0.8.0-cp310-abi3-musllinux_1_2_i686.whl", hash = "sha256:3ae091f16662658bdc019a4ff6cb4c085bb7d725eb5978b183ffd265863b6d2d"},
+    {file = "safetensors-0.8.0-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:8e080062fcde23be189565e1c3305d16751a218ecf9412c8601e64204eb6f846"},
+    {file = "safetensors-0.8.0-cp310-abi3-win32.whl", hash = "sha256:2ddf52eac562eda224f99acfa7889d02968c1fd59a5b011ae7d8137c37e9c02d"},
+    {file = "safetensors-0.8.0-cp310-abi3-win_amd64.whl", hash = "sha256:096ec1a98435df7beb08853bb5aa9081a84f23d0adc67ed1a0a10550f608373f"},
+    {file = "safetensors-0.8.0-cp310-abi3-win_arm64.whl", hash = "sha256:f7838e5135a406ad3e02efdcb8cf2e5397d368b0154537c4fec682dbc544d452"},
+    {file = "safetensors-0.8.0.tar.gz", hash = "sha256:fabaf3e0f18a6618d9b36560682562157f77c2b71fcffc7b432be2baed9d753d"},
 ]
 
 [package.extras]
-all = ["safetensors[jax]", "safetensors[numpy]", "safetensors[paddlepaddle]", "safetensors[pinned-tf]", "safetensors[quality]", "safetensors[testing]", "safetensors[torch]"]
-dev = ["safetensors[all]"]
+all = ["safetensors[convert]", "safetensors[jax]", "safetensors[numpy]", "safetensors[paddlepaddle]", "safetensors[quality]", "safetensors[testing]", "safetensors[torch]"]
+convert = ["huggingface-hub (>=1.4)", "safetensors[torch]"]
+dev = ["safetensors[all]", "safetensors[pinned-tf]"]
 jax = ["flax (>=0.6.3)", "jax (>=0.3.25)", "jaxlib (>=0.3.25)", "safetensors[numpy]"]
 mlx = ["mlx (>=0.0.9)"]
-numpy = ["numpy (>=1.21.6)"]
+numpy = ["numpy (>=1.24.6)"]
 paddlepaddle = ["paddlepaddle (>=2.4.1)", "safetensors[numpy]"]
-pinned-tf = ["safetensors[numpy]", "tensorflow (==2.11.0)"]
-quality = ["black (==22.3)", "click (==8.0.4)", "flake8 (>=3.8.3)", "isort (>=5.5.4)"]
+pinned-tf = ["safetensors[numpy]", "tensorflow (==2.18.0)"]
+quality = ["ruff"]
 tensorflow = ["safetensors[numpy]", "tensorflow (>=2.11.0)"]
-testing = ["h5py (>=3.7.0)", "huggingface-hub (>=0.12.1)", "hypothesis (>=6.70.2)", "pytest (>=7.2.0)", "pytest-benchmark (>=4.0.0)", "safetensors[numpy]", "setuptools-rust (>=1.5.2)"]
-torch = ["safetensors[numpy]", "torch (>=1.10)"]
+testing = ["fsspec (>=2024.6.0)", "h5py (>=3.7.0)", "hypothesis (>=6.70.2)", "pytest (>=9.0)", "pytest-benchmark (>=5.2)", "s3fs (>=2024.6.0)", "safetensors[numpy]", "setuptools-rust (>=1.12.0)"]
+tf-nightly = ["safetensors[numpy]", "tf-nightly"]
+torch = ["safetensors[numpy]", "torch (>=2.4)"]
 
 [[package]]
 name = "scikit-learn"
-version = "1.6.1"
+version = "1.9.0"
 description = "A set of python modules for machine learning and data mining"
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "scikit_learn-1.6.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:d056391530ccd1e501056160e3c9673b4da4805eb67eb2bdf4e983e1f9c9204e"},
-    {file = "scikit_learn-1.6.1-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:0c8d036eb937dbb568c6242fa598d551d88fb4399c0344d95c001980ec1c7d36"},
-    {file = "scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8634c4bd21a2a813e0a7e3900464e6d593162a29dd35d25bdf0103b3fce60ed5"},
-    {file = "scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:775da975a471c4f6f467725dff0ced5c7ac7bda5e9316b260225b48475279a1b"},
-    {file = "scikit_learn-1.6.1-cp310-cp310-win_amd64.whl", hash = "sha256:8a600c31592bd7dab31e1c61b9bbd6dea1b3433e67d264d17ce1017dbdce8002"},
-    {file = "scikit_learn-1.6.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:72abc587c75234935e97d09aa4913a82f7b03ee0b74111dcc2881cba3c5a7b33"},
-    {file = "scikit_learn-1.6.1-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:b3b00cdc8f1317b5f33191df1386c0befd16625f49d979fe77a8d44cae82410d"},
-    {file = "scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:dc4765af3386811c3ca21638f63b9cf5ecf66261cc4815c1db3f1e7dc7b79db2"},
-    {file = "scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:25fc636bdaf1cc2f4a124a116312d837148b5e10872147bdaf4887926b8c03d8"},
-    {file = "scikit_learn-1.6.1-cp311-cp311-win_amd64.whl", hash = "sha256:fa909b1a36e000a03c382aade0bd2063fd5680ff8b8e501660c0f59f021a6415"},
-    {file = "scikit_learn-1.6.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:926f207c804104677af4857b2c609940b743d04c4c35ce0ddc8ff4f053cddc1b"},
-    {file = "scikit_learn-1.6.1-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:2c2cae262064e6a9b77eee1c8e768fc46aa0b8338c6a8297b9b6759720ec0ff2"},
-    {file = "scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1061b7c028a8663fb9a1a1baf9317b64a257fcb036dae5c8752b2abef31d136f"},
-    {file = "scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2e69fab4ebfc9c9b580a7a80111b43d214ab06250f8a7ef590a4edf72464dd86"},
-    {file = "scikit_learn-1.6.1-cp312-cp312-win_amd64.whl", hash = "sha256:70b1d7e85b1c96383f872a519b3375f92f14731e279a7b4c6cfd650cf5dffc52"},
-    {file = "scikit_learn-1.6.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:2ffa1e9e25b3d93990e74a4be2c2fc61ee5af85811562f1288d5d055880c4322"},
-    {file = "scikit_learn-1.6.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:dc5cf3d68c5a20ad6d571584c0750ec641cc46aeef1c1507be51300e6003a7e1"},
-    {file = "scikit_learn-1.6.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c06beb2e839ecc641366000ca84f3cf6fa9faa1777e29cf0c04be6e4d096a348"},
-    {file = "scikit_learn-1.6.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e8ca8cb270fee8f1f76fa9bfd5c3507d60c6438bbee5687f81042e2bb98e5a97"},
-    {file = "scikit_learn-1.6.1-cp313-cp313-win_amd64.whl", hash = "sha256:7a1c43c8ec9fde528d664d947dc4c0789be4077a3647f232869f41d9bf50e0fb"},
-    {file = "scikit_learn-1.6.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:a17c1dea1d56dcda2fac315712f3651a1fea86565b64b48fa1bc090249cbf236"},
-    {file = "scikit_learn-1.6.1-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:6a7aa5f9908f0f28f4edaa6963c0a6183f1911e63a69aa03782f0d924c830a35"},
-    {file = "scikit_learn-1.6.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0650e730afb87402baa88afbf31c07b84c98272622aaba002559b614600ca691"},
-    {file = "scikit_learn-1.6.1-cp313-cp313t-win_amd64.whl", hash = "sha256:3f59fe08dc03ea158605170eb52b22a105f238a5d512c4470ddeca71feae8e5f"},
-    {file = "scikit_learn-1.6.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:6849dd3234e87f55dce1db34c89a810b489ead832aaf4d4550b7ea85628be6c1"},
-    {file = "scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl", hash = "sha256:e7be3fa5d2eb9be7d77c3734ff1d599151bb523674be9b834e8da6abe132f44e"},
-    {file = "scikit_learn-1.6.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:44a17798172df1d3c1065e8fcf9019183f06c87609b49a124ebdf57ae6cb0107"},
-    {file = "scikit_learn-1.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b8b7a3b86e411e4bce21186e1c180d792f3d99223dcfa3b4f597ecc92fa1a422"},
-    {file = "scikit_learn-1.6.1-cp39-cp39-win_amd64.whl", hash = "sha256:7a73d457070e3318e32bdb3aa79a8d990474f19035464dfd8bede2883ab5dc3b"},
-    {file = "scikit_learn-1.6.1.tar.gz", hash = "sha256:b4fc2525eca2c69a59260f583c56a7557c6ccdf8deafdba6e060f94c1c59738e"},
+python-versions = ">=3.11"
+groups = ["training"]
+files = [
+    {file = "scikit_learn-1.9.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:9db6f4d34e68c8899e4cab27fdf8eafe6ed21f2ba52ceb25ea250cd237f8e47b"},
+    {file = "scikit_learn-1.9.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:f401448645a3e7bc115aa3c094097865155b34bff1cba8101857d9104e99074c"},
+    {file = "scikit_learn-1.9.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fd3a8ef0c758555a3b23c03adaa858af32f7736785ded50ad5991f59c4ed03fa"},
+    {file = "scikit_learn-1.9.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f7e254636164090da847715a27f8e5478feb98c40a9e0ee90cbd277de9e5ceb8"},
+    {file = "scikit_learn-1.9.0-cp311-cp311-win_amd64.whl", hash = "sha256:5dc1818c77575d149e25fce9ef82dd7b7263ae372f03494158668ad632a69759"},
+    {file = "scikit_learn-1.9.0-cp311-cp311-win_arm64.whl", hash = "sha256:366652351f092b219c248f1e72821e841960a63d8f358f1dcfd54dc1cbdbbc28"},
+    {file = "scikit_learn-1.9.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:2bd41b0d201bc81575531b96b713d3eb5e5f50fb0b82101ff0f92294fdc236ac"},
+    {file = "scikit_learn-1.9.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:5be45aa4a42a68a533913a6ed736cf309de2226411c79ef8d609a5456f1939b1"},
+    {file = "scikit_learn-1.9.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5e50ed4da51974e86e940690e9a3d82e729b62b5a49f7c9bac534d515d39d86f"},
+    {file = "scikit_learn-1.9.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:056c92bb67ad4c28463c2f2653d9701449201e7e7a9e94e321be0f71c4fef2b8"},
+    {file = "scikit_learn-1.9.0-cp312-cp312-win_amd64.whl", hash = "sha256:4306775fad04cc4b472a1b15af1ae9cede1540fbfcc17fbce3767cd8dc7ae283"},
+    {file = "scikit_learn-1.9.0-cp312-cp312-win_arm64.whl", hash = "sha256:26e22435f63bcdcf396b574273f29f13dd531f5ea035801f5be10ba1540a4e60"},
+    {file = "scikit_learn-1.9.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:80746d63bd4b6eaca54d36fe5feaf4d28bb38dc6f9470f81c7cad7c40155f119"},
+    {file = "scikit_learn-1.9.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:5b934c45c252844a91d69fda3a34cff5e7307e1db10d77cb10a3980312c74713"},
+    {file = "scikit_learn-1.9.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:38c3dcb9a1ffb85505ec53d54c7b4aea0cff70050425a7760c2af661ac85df05"},
+    {file = "scikit_learn-1.9.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:da76d09304a4706db7cc1e3ebaa3b6b98a67365cc11d2996c4f1e58ba47df714"},
+    {file = "scikit_learn-1.9.0-cp313-cp313-win_amd64.whl", hash = "sha256:5808d98f15c6bf6d9d96d2348c1997392a5888ce7097e664105f930c4bca1277"},
+    {file = "scikit_learn-1.9.0-cp313-cp313-win_arm64.whl", hash = "sha256:d77f54c017633791bc0225a43e2f8d03745fdcfe4880268fcc4df15f505dec2e"},
+    {file = "scikit_learn-1.9.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:9656acd4e93f74e0b66c8a36c88830a99252dfa900044d36bc2212ae89a47162"},
+    {file = "scikit_learn-1.9.0-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:24360002ae845e7866522b0a5bbf690802e7bc388cac8663502e78aa98598aa2"},
+    {file = "scikit_learn-1.9.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5162ad10a418c8a282dde04c9aa06965de3e9a65f33c1440c0ae69bb1a09d913"},
+    {file = "scikit_learn-1.9.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1fea2cc5677ab49d6f5bade978c866da44957b712d92e9635e8b4f723013c3cb"},
+    {file = "scikit_learn-1.9.0-cp314-cp314-win_amd64.whl", hash = "sha256:64fa347efc1c839c487433e40c5144d38c336e8a2b59c81aa8660373945c2673"},
+    {file = "scikit_learn-1.9.0-cp314-cp314-win_arm64.whl", hash = "sha256:1b944b6db288f6b926e3650026ddafb988929de95d11fc2cc5fa117773c9ba42"},
+    {file = "scikit_learn-1.9.0-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:4ccacf04ca5f4b492158a5f28afe0ace43f81b2571e4b9a66d34848b46128949"},
+    {file = "scikit_learn-1.9.0-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:ee1a8db2c18c08e34c7412d4b10be1cac214cd4ea7dc9715a6a327eb49a37c96"},
+    {file = "scikit_learn-1.9.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:147e9329ef0e39f75d4cffa02b2aa48d827832684926cd5210d9a2cb5c57246b"},
+    {file = "scikit_learn-1.9.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5bad8f8b9950321b54c965fdcbac6c6c55e79e16646b49977bcf3668d3870a1a"},
+    {file = "scikit_learn-1.9.0-cp314-cp314t-win_amd64.whl", hash = "sha256:78fc56eafd4edb9575d2d8950d1dd152061abb573341a1cb7e099fc40f6c6666"},
+    {file = "scikit_learn-1.9.0-cp314-cp314t-win_arm64.whl", hash = "sha256:051075bda8b7aab87b1906ab3d4740a1e1224a19d7b3781a576736edc94e76aa"},
+    {file = "scikit_learn-1.9.0.tar.gz", hash = "sha256:8833266989d3a5110178a9fae30783675460724d0e1efb13b14901d2c660c557"},
 ]
 
 [package.dependencies]
-joblib = ">=1.2.0"
-numpy = ">=1.19.5"
-scipy = ">=1.6.0"
-threadpoolctl = ">=3.1.0"
+joblib = ">=1.4.0"
+narwhals = ">=2.0.1"
+numpy = ">=1.24.1"
+scipy = ">=1.10.0"
+threadpoolctl = ">=3.5.0"
 
 [package.extras]
-benchmark = ["matplotlib (>=3.3.4)", "memory_profiler (>=0.57.0)", "pandas (>=1.1.5)"]
-build = ["cython (>=3.0.10)", "meson-python (>=0.16.0)", "numpy (>=1.19.5)", "scipy (>=1.6.0)"]
-docs = ["Pillow (>=7.1.2)", "matplotlib (>=3.3.4)", "memory_profiler (>=0.57.0)", "numpydoc (>=1.2.0)", "pandas (>=1.1.5)", "plotly (>=5.14.0)", "polars (>=0.20.30)", "pooch (>=1.6.0)", "pydata-sphinx-theme (>=0.15.3)", "scikit-image (>=0.17.2)", "seaborn (>=0.9.0)", "sphinx (>=7.3.7)", "sphinx-copybutton (>=0.5.2)", "sphinx-design (>=0.5.0)", "sphinx-design (>=0.6.0)", "sphinx-gallery (>=0.17.1)", "sphinx-prompt (>=1.4.0)", "sphinx-remove-toctrees (>=1.0.0.post1)", "sphinxcontrib-sass (>=0.3.4)", "sphinxext-opengraph (>=0.9.1)", "towncrier (>=24.8.0)"]
-examples = ["matplotlib (>=3.3.4)", "pandas (>=1.1.5)", "plotly (>=5.14.0)", "pooch (>=1.6.0)", "scikit-image (>=0.17.2)", "seaborn (>=0.9.0)"]
-install = ["joblib (>=1.2.0)", "numpy (>=1.19.5)", "scipy (>=1.6.0)", "threadpoolctl (>=3.1.0)"]
-maintenance = ["conda-lock (==2.5.6)"]
-tests = ["black (>=24.3.0)", "matplotlib (>=3.3.4)", "mypy (>=1.9)", "numpydoc (>=1.2.0)", "pandas (>=1.1.5)", "polars (>=0.20.30)", "pooch (>=1.6.0)", "pyamg (>=4.0.0)", "pyarrow (>=12.0.0)", "pytest (>=7.1.2)", "pytest-cov (>=2.9.0)", "ruff (>=0.5.1)", "scikit-image (>=0.17.2)"]
+benchmark = ["matplotlib (>=3.6.1)", "memory_profiler (>=0.57.0)", "pandas (>=1.5.0)"]
+build = ["cython (>=3.1.2)", "meson-python (>=0.17.1)", "numpy (>=1.24.1)", "scipy (>=1.10.0)"]
+docs = ["Pillow (>=12.1.1)", "matplotlib (>=3.6.1)", "memory_profiler (>=0.57.0)", "numpydoc (>=1.2.0)", "pandas (>=1.5.0)", "plotly (>=5.22.0)", "polars (>=0.20.30)", "pooch (>=1.8.0)", "pydata-sphinx-theme (>=0.15.3)", "rich (>=14.1.0)", "scikit-image (>=0.22.0)", "seaborn (>=0.13.0)", "sphinx (>=7.3.7)", "sphinx-copybutton (>=0.5.2)", "sphinx-design (>=0.6.0)", "sphinx-gallery (>=0.17.1)", "sphinx-prompt (>=1.4.0)", "sphinx-remove-toctrees (>=1.0.0.post1)", "sphinxcontrib-sass (>=0.3.4)", "sphinxext-opengraph (>=0.9.1)", "towncrier (>=24.8.0)"]
+examples = ["matplotlib (>=3.6.1)", "pandas (>=1.5.0)", "plotly (>=5.22.0)", "pooch (>=1.8.0)", "rich (>=14.1.0)", "scikit-image (>=0.22.0)", "seaborn (>=0.13.0)"]
+install = ["joblib (>=1.4.0)", "narwhals (>=2.0.1)", "numpy (>=1.24.1)", "scipy (>=1.10.0)", "threadpoolctl (>=3.5.0)"]
+maintenance = ["conda-lock (==3.0.1)"]
+tests = ["matplotlib (>=3.6.1)", "mypy (>=1.15)", "numpydoc (>=1.2.0)", "pandas (>=1.5.0)", "polars (>=0.20.30)", "pooch (>=1.8.0)", "pyamg (>=5.0.0)", "pyarrow (>=13.0.0)", "pytest (>=7.1.2)", "pytest-cov (>=2.9.0)", "rich (>=14.1.0)", "ruff (>=0.12.2)"]
 
 [[package]]
 name = "scipy"
@@ -4768,7 +4659,7 @@ version = "1.14.1"
 description = "Fundamental algorithms for scientific computing in Python"
 optional = false
 python-versions = ">=3.10"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "scipy-1.14.1-cp310-cp310-macosx_10_13_x86_64.whl", hash = "sha256:b28d2ca4add7ac16ae8bb6632a3c86e4b9e4d52d3e34267f6e1b0c1f8d87e389"},
     {file = "scipy-1.14.1-cp310-cp310-macosx_12_0_arm64.whl", hash = "sha256:d0d2821003174de06b69e58cef2316a6622b60ee613121199cb2852a873f8cf3"},
@@ -4814,317 +4705,189 @@ doc = ["jupyterlite-pyodide-kernel", "jupyterlite-sphinx (>=0.13.1)", "jupytext"
 test = ["Cython", "array-api-strict (>=2.0)", "asv", "gmpy2", "hypothesis (>=6.30)", "meson", "mpmath", "ninja ; sys_platform != \"emscripten\"", "pooch", "pytest", "pytest-cov", "pytest-timeout", "pytest-xdist", "scikit-umfpack", "threadpoolctl"]
 
 [[package]]
-name = "sentencepiece"
-version = "0.2.0"
-description = "SentencePiece python wrapper"
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "sentencepiece-0.2.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:188779e1298a1c8b8253c7d3ad729cb0a9891e5cef5e5d07ce4592c54869e227"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:bed9cf85b296fa2b76fc2547b9cbb691a523864cebaee86304c43a7b4cb1b452"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:d7b67e724bead13f18db6e1d10b6bbdc454af574d70efbb36f27d90387be1ca3"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2fde4b08cfe237be4484c6c7c2e2c75fb862cfeab6bd5449ce4caeafd97b767a"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4c378492056202d1c48a4979650981635fd97875a00eabb1f00c6a236b013b5e"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1380ce6540a368de2ef6d7e6ba14ba8f3258df650d39ba7d833b79ee68a52040"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-win32.whl", hash = "sha256:a1151d6a6dd4b43e552394aed0edfe9292820272f0194bd56c7c1660a0c06c3d"},
-    {file = "sentencepiece-0.2.0-cp310-cp310-win_amd64.whl", hash = "sha256:d490142b0521ef22bc1085f061d922a2a6666175bb6b42e588ff95c0db6819b2"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:17982700c4f6dbb55fa3594f3d7e5dd1c8659a274af3738e33c987d2a27c9d5c"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:7c867012c0e8bcd5bdad0f791609101cb5c66acb303ab3270218d6debc68a65e"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:7fd6071249c74f779c5b27183295b9202f8dedb68034e716784364443879eaa6"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:27f90c55a65013cbb8f4d7aab0599bf925cde4adc67ae43a0d323677b5a1c6cb"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b293734059ef656dcd65be62ff771507bea8fed0a711b6733976e1ed3add4553"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e58b47f933aca74c6a60a79dcb21d5b9e47416256c795c2d58d55cec27f9551d"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-win32.whl", hash = "sha256:c581258cf346b327c62c4f1cebd32691826306f6a41d8c4bec43b010dee08e75"},
-    {file = "sentencepiece-0.2.0-cp311-cp311-win_amd64.whl", hash = "sha256:0993dbc665f4113017892f1b87c3904a44d0640eda510abcacdfb07f74286d36"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:ea5f536e32ea8ec96086ee00d7a4a131ce583a1b18d130711707c10e69601cb2"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:d0cb51f53b6aae3c36bafe41e86167c71af8370a039f542c43b0cce5ef24a68c"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:3212121805afc58d8b00ab4e7dd1f8f76c203ddb9dc94aa4079618a31cf5da0f"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2a3149e3066c2a75e0d68a43eb632d7ae728c7925b517f4c05c40f6f7280ce08"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:632f3594d3e7ac8b367bca204cb3fd05a01d5b21455acd097ea4c0e30e2f63d7"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f295105c6bdbb05bd5e1b0cafbd78ff95036f5d3641e7949455a3f4e5e7c3109"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-win32.whl", hash = "sha256:fb89f811e5efd18bab141afc3fea3de141c3f69f3fe9e898f710ae7fe3aab251"},
-    {file = "sentencepiece-0.2.0-cp312-cp312-win_amd64.whl", hash = "sha256:7a673a72aab81fef5ebe755c6e0cc60087d1f3a4700835d40537183c1703a45f"},
-    {file = "sentencepiece-0.2.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:4547683f330289ec4f093027bfeb87f9ef023b2eb6f879fdc4a8187c7e0ffb90"},
-    {file = "sentencepiece-0.2.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7cd6175f7eaec7142d2bf6f6597ce7db4c9ac89acf93fcdb17410c3a8b781eeb"},
-    {file = "sentencepiece-0.2.0-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:859ba1acde782609a0910a26a60e16c191a82bf39b5621107552c0cd79fad00f"},
-    {file = "sentencepiece-0.2.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:bcbbef6cc277f8f18f36959e305f10b1c620442d75addc79c21d7073ae581b50"},
-    {file = "sentencepiece-0.2.0-cp36-cp36m-win32.whl", hash = "sha256:536b934e244829e3fe6c4f198652cd82da48adb9aa145c9f00889542726dee3d"},
-    {file = "sentencepiece-0.2.0-cp36-cp36m-win_amd64.whl", hash = "sha256:0a91aaa3c769b52440df56fafda683b3aa48e3f2169cf7ee5b8c8454a7f3ae9b"},
-    {file = "sentencepiece-0.2.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:787e480ca4c1d08c9985a7eb1eae4345c107729c99e9b5a9a00f2575fc7d4b4b"},
-    {file = "sentencepiece-0.2.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f4d158189eb2ecffea3a51edf6d25e110b3678ec47f1a40f2d541eafbd8f6250"},
-    {file = "sentencepiece-0.2.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d1e5ca43013e8935f25457a4fca47e315780172c3e821b4b13a890668911c792"},
-    {file = "sentencepiece-0.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7140d9e5a74a0908493bb4a13f1f16a401297bd755ada4c707e842fbf6f0f5bf"},
-    {file = "sentencepiece-0.2.0-cp37-cp37m-win32.whl", hash = "sha256:6cf333625234f247ab357b0bd9836638405ea9082e1543d5b8408f014979dcbf"},
-    {file = "sentencepiece-0.2.0-cp37-cp37m-win_amd64.whl", hash = "sha256:ff88712338b01031910e8e61e7239aff3ce8869ee31a47df63cb38aadd591bea"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:20813a68d4c221b1849c62c30e1281ea81687894d894b8d4a0f4677d9311e0f5"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:926ef920ae2e8182db31d3f5d081ada57804e3e1d3a8c4ef8b117f9d9fb5a945"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:89f65f69636b7e9c015b79dff9c9985a9bc7d19ded6f79ef9f1ec920fdd73ecf"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0f67eae0dbe6f2d7d6ba50a354623d787c99965f068b81e145d53240198021b0"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:98501e075f35dd1a1d5a20f65be26839fcb1938752ec61539af008a5aa6f510b"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e3d1d2cc4882e8d6a1adf9d5927d7716f80617fc693385661caff21888972269"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-win32.whl", hash = "sha256:b99a308a2e5e569031ab164b74e6fab0b6f37dfb493c32f7816225f4d411a6dd"},
-    {file = "sentencepiece-0.2.0-cp38-cp38-win_amd64.whl", hash = "sha256:cdb701eec783d3ec86b7cd4c763adad8eaf6b46db37ee1c36e5e6c44b3fe1b5f"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:1e0f9c4d0a6b0af59b613175f019916e28ade076e21242fd5be24340d8a2f64a"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:298f21cc1366eb60311aedba3169d30f885c363ddbf44214b0a587d2908141ad"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:3f1ec95aa1e5dab11f37ac7eff190493fd87770f7a8b81ebc9dd768d1a3c8704"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b06b70af54daa4b4904cbb90b4eb6d35c9f3252fdc86c9c32d5afd4d30118d8"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:22e37bac44dd6603388cb598c64ff7a76e41ca774646f21c23aadfbf5a2228ab"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0461324897735512a32d222e3d886e24ad6a499761952b6bda2a9ee6e4313ea5"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-win32.whl", hash = "sha256:38aed822fb76435fa1f12185f10465a94ab9e51d5e8a9159e9a540ce926f0ffd"},
-    {file = "sentencepiece-0.2.0-cp39-cp39-win_amd64.whl", hash = "sha256:d8cf876516548b5a1d6ac4745d8b554f5c07891d55da557925e5c13ff0b4e6ad"},
-    {file = "sentencepiece-0.2.0.tar.gz", hash = "sha256:a52c19171daaf2e697dc6cbe67684e0fa341b1248966f6aebb541de654d15843"},
-]
-
-[[package]]
-name = "sentry-sdk"
-version = "2.20.0"
-description = "Python client for Sentry (https://sentry.io)"
-optional = false
-python-versions = ">=3.6"
+name = "semantic-version"
+version = "2.10.0"
+description = "A library implementing the 'SemVer' scheme."
+optional = true
+python-versions = ">=2.7"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "sentry_sdk-2.20.0-py2.py3-none-any.whl", hash = "sha256:c359a1edf950eb5e80cffd7d9111f3dbeef57994cb4415df37d39fda2cf22364"},
-    {file = "sentry_sdk-2.20.0.tar.gz", hash = "sha256:afa82713a92facf847df3c6f63cec71eb488d826a50965def3d7722aa6f0fdab"},
+    {file = "semantic_version-2.10.0-py2.py3-none-any.whl", hash = "sha256:de78a3b8e0feda74cabc54aab2da702113e33ac9d9eb9d2389bcf1f58b7d9177"},
+    {file = "semantic_version-2.10.0.tar.gz", hash = "sha256:bdabb6d336998cbb378d4b9db3a4b56a1e3235701dc05ea2690d9a997ed5041c"},
 ]
 
-[package.dependencies]
-certifi = "*"
-urllib3 = ">=1.26.11"
-
 [package.extras]
-aiohttp = ["aiohttp (>=3.5)"]
-anthropic = ["anthropic (>=0.16)"]
-arq = ["arq (>=0.23)"]
-asyncpg = ["asyncpg (>=0.23)"]
-beam = ["apache-beam (>=2.12)"]
-bottle = ["bottle (>=0.12.13)"]
-celery = ["celery (>=3)"]
-celery-redbeat = ["celery-redbeat (>=2)"]
-chalice = ["chalice (>=1.16.0)"]
-clickhouse-driver = ["clickhouse-driver (>=0.2.0)"]
-django = ["django (>=1.8)"]
-falcon = ["falcon (>=1.4)"]
-fastapi = ["fastapi (>=0.79.0)"]
-flask = ["blinker (>=1.1)", "flask (>=0.11)", "markupsafe"]
-grpcio = ["grpcio (>=1.21.1)", "protobuf (>=3.8.0)"]
-http2 = ["httpcore[http2] (==1.*)"]
-httpx = ["httpx (>=0.16.0)"]
-huey = ["huey (>=2)"]
-huggingface-hub = ["huggingface_hub (>=0.22)"]
-langchain = ["langchain (>=0.0.210)"]
-launchdarkly = ["launchdarkly-server-sdk (>=9.8.0)"]
-litestar = ["litestar (>=2.0.0)"]
-loguru = ["loguru (>=0.5)"]
-openai = ["openai (>=1.0.0)", "tiktoken (>=0.3.0)"]
-openfeature = ["openfeature-sdk (>=0.7.1)"]
-opentelemetry = ["opentelemetry-distro (>=0.35b0)"]
-opentelemetry-experimental = ["opentelemetry-distro"]
-pure-eval = ["asttokens", "executing", "pure_eval"]
-pymongo = ["pymongo (>=3.1)"]
-pyspark = ["pyspark (>=2.4.4)"]
-quart = ["blinker (>=1.1)", "quart (>=0.16.1)"]
-rq = ["rq (>=0.6)"]
-sanic = ["sanic (>=0.8)"]
-sqlalchemy = ["sqlalchemy (>=1.2)"]
-starlette = ["starlette (>=0.19.1)"]
-starlite = ["starlite (>=1.48)"]
-tornado = ["tornado (>=6)"]
-unleash = ["UnleashClient (>=6.0.1)"]
-
-[[package]]
-name = "setproctitle"
-version = "1.3.4"
-description = "A Python module to customize the process title"
+dev = ["Django (>=1.11)", "check-manifest", "colorama (<=0.4.1) ; python_version == \"3.4\"", "coverage", "flake8", "nose2", "readme-renderer (<25.0) ; python_version == \"3.4\"", "tox", "wheel", "zest.releaser[recommended]"]
+doc = ["Sphinx", "sphinx-rtd-theme"]
+
+[[package]]
+name = "sentencepiece"
+version = "0.2.1"
+description = "Unsupervised text tokenizer and detokenizer."
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["main"]
 files = [
-    {file = "setproctitle-1.3.4-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:0f6661a69c68349172ba7b4d5dd65fec2b0917abc99002425ad78c3e58cf7595"},
-    {file = "setproctitle-1.3.4-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:754bac5e470adac7f7ec2239c485cd0b75f8197ca8a5b86ffb20eb3a3676cc42"},
-    {file = "setproctitle-1.3.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f7bc7088c15150745baf66db62a4ced4507d44419eb66207b609f91b64a682af"},
-    {file = "setproctitle-1.3.4-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:a46ef3ecf61e4840fbc1145fdd38acf158d0da7543eda7b773ed2b30f75c2830"},
-    {file = "setproctitle-1.3.4-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ffcb09d5c0ffa043254ec9a734a73f3791fec8bf6333592f906bb2e91ed2af1a"},
-    {file = "setproctitle-1.3.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:06c16b7a91cdc5d700271899e4383384a61aae83a3d53d0e2e5a266376083342"},
-    {file = "setproctitle-1.3.4-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:9f9732e59863eaeedd3feef94b2b216cb86d40dda4fad2d0f0aaec3b31592716"},
-    {file = "setproctitle-1.3.4-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:e152f4ab9ea1632b5fecdd87cee354f2b2eb6e2dfc3aceb0eb36a01c1e12f94c"},
-    {file = "setproctitle-1.3.4-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:020ea47a79b2bbd7bd7b94b85ca956ba7cb026e82f41b20d2e1dac4008cead25"},
-    {file = "setproctitle-1.3.4-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:8c52b12b10e4057fc302bd09cb3e3f28bb382c30c044eb3396e805179a8260e4"},
-    {file = "setproctitle-1.3.4-cp310-cp310-win32.whl", hash = "sha256:a65a147f545f3fac86f11acb2d0b316d3e78139a9372317b7eb50561b2817ba0"},
-    {file = "setproctitle-1.3.4-cp310-cp310-win_amd64.whl", hash = "sha256:66821fada6426998762a3650a37fba77e814a249a95b1183011070744aff47f6"},
-    {file = "setproctitle-1.3.4-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:f0f749f07002c2d6fecf37cedc43207a88e6c651926a470a5f229070cf791879"},
-    {file = "setproctitle-1.3.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:90ea8d302a5d30b948451d146e94674a3c5b020cc0ced9a1c28f8ddb0f203a5d"},
-    {file = "setproctitle-1.3.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f859c88193ed466bee4eb9d45fbc29d2253e6aa3ccd9119c9a1d8d95f409a60d"},
-    {file = "setproctitle-1.3.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b3afa5a0ed08a477ded239c05db14c19af585975194a00adf594d48533b23701"},
-    {file = "setproctitle-1.3.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:10a78fce9018cc3e9a772b6537bbe3fe92380acf656c9f86db2f45e685af376e"},
-    {file = "setproctitle-1.3.4-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5d758e2eed2643afac5f2881542fbb5aa97640b54be20d0a5ed0691d02f0867d"},
-    {file = "setproctitle-1.3.4-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:ef133a1a2ee378d549048a12d56f4ef0e2b9113b0b25b6b77821e9af94d50634"},
-    {file = "setproctitle-1.3.4-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:1d2a154b79d5fb42d1eff06e05e22f0e8091261d877dd47b37d31352b74ecc37"},
-    {file = "setproctitle-1.3.4-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:202eae632815571297833876a0f407d0d9c7ad9d843b38adbe687fe68c5192ee"},
-    {file = "setproctitle-1.3.4-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:2b0080819859e80a7776ac47cf6accb4b7ad313baf55fabac89c000480dcd103"},
-    {file = "setproctitle-1.3.4-cp311-cp311-win32.whl", hash = "sha256:9c9d7d1267dee8c6627963d9376efa068858cfc8f573c083b1b6a2d297a8710f"},
-    {file = "setproctitle-1.3.4-cp311-cp311-win_amd64.whl", hash = "sha256:475986ddf6df65d619acd52188336a20f616589403f5a5ceb3fc70cdc137037a"},
-    {file = "setproctitle-1.3.4-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:d06990dcfcd41bb3543c18dd25c8476fbfe1f236757f42fef560f6aa03ac8dfc"},
-    {file = "setproctitle-1.3.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:317218c9d8b17a010ab2d2f0851e8ef584077a38b1ba2b7c55c9e44e79a61e73"},
-    {file = "setproctitle-1.3.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cb5fefb53b9d9f334a5d9ec518a36b92a10b936011ac8a6b6dffd60135f16459"},
-    {file = "setproctitle-1.3.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:0855006261635e8669646c7c304b494b6df0a194d2626683520103153ad63cc9"},
-    {file = "setproctitle-1.3.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:1a88e466fcaee659679c1d64dcb2eddbcb4bfadffeb68ba834d9c173a25b6184"},
-    {file = "setproctitle-1.3.4-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f963b6ed8ba33eda374a98d979e8a0eaf21f891b6e334701693a2c9510613c4c"},
-    {file = "setproctitle-1.3.4-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:122c2e05697fa91f5d23f00bbe98a9da1bd457b32529192e934095fadb0853f1"},
-    {file = "setproctitle-1.3.4-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:1bba0a866f5895d5b769d8c36b161271c7fd407e5065862ab80ff91c29fbe554"},
-    {file = "setproctitle-1.3.4-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:97f1f861998e326e640708488c442519ad69046374b2c3fe9bcc9869b387f23c"},
-    {file = "setproctitle-1.3.4-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:726aee40357d4bdb70115442cb85ccc8e8bc554fc0bbbaa3a57cbe81df42287d"},
-    {file = "setproctitle-1.3.4-cp312-cp312-win32.whl", hash = "sha256:04d6ba8b816dbb0bfd62000b0c3e583160893e6e8c4233e1dca1a9ae4d95d924"},
-    {file = "setproctitle-1.3.4-cp312-cp312-win_amd64.whl", hash = "sha256:9c76e43cb351ba8887371240b599925cdf3ecececc5dfb7125c71678e7722c55"},
-    {file = "setproctitle-1.3.4-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:d6e3b177e634aa6bbbfbf66d097b6d1cdb80fc60e912c7d8bace2e45699c07dd"},
-    {file = "setproctitle-1.3.4-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:6b17655a5f245b416e127e02087ea6347a48821cc4626bc0fd57101bfcd88afc"},
-    {file = "setproctitle-1.3.4-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fa5057a86df920faab8ee83960b724bace01a3231eb8e3f2c93d78283504d598"},
-    {file = "setproctitle-1.3.4-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:149fdfb8a26a555780c4ce53c92e6d3c990ef7b30f90a675eca02e83c6d5f76d"},
-    {file = "setproctitle-1.3.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ded03546938a987f463c68ab98d683af87a83db7ac8093bbc179e77680be5ba2"},
-    {file = "setproctitle-1.3.4-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8ab9f5b7f2bbc1754bc6292d9a7312071058e5a891b0391e6d13b226133f36aa"},
-    {file = "setproctitle-1.3.4-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:0b19813c852566fa031902124336fa1f080c51e262fc90266a8c3d65ca47b74c"},
-    {file = "setproctitle-1.3.4-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:db78b645dc63c0ccffca367a498f3b13492fb106a2243a1e998303ba79c996e2"},
-    {file = "setproctitle-1.3.4-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:b669aaac70bd9f03c070270b953f78d9ee56c4af6f0ff9f9cd3e6d1878c10b40"},
-    {file = "setproctitle-1.3.4-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6dc3d656702791565994e64035a208be56b065675a5bc87b644c657d6d9e2232"},
-    {file = "setproctitle-1.3.4-cp313-cp313-win32.whl", hash = "sha256:091f682809a4d12291cf0205517619d2e7014986b7b00ebecfde3d76f8ae5a8f"},
-    {file = "setproctitle-1.3.4-cp313-cp313-win_amd64.whl", hash = "sha256:adcd6ba863a315702184d92d3d3bbff290514f24a14695d310f02ae5e28bd1f7"},
-    {file = "setproctitle-1.3.4-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:acf41cf91bbc5a36d1fa4455a818bb02bf2a4ccfed2f892ba166ba2fcbb0ec8a"},
-    {file = "setproctitle-1.3.4-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:ceb3ce3262b0e8e088e4117175591b7a82b3bdc5e52e33b1e74778b5fb53fd38"},
-    {file = "setproctitle-1.3.4-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2b2ef636a6a25fe7f3d5a064bea0116b74a4c8c7df9646b17dc7386c439a26cf"},
-    {file = "setproctitle-1.3.4-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:28b8614de08679ae95bc4e8d6daaef6b61afdf027fa0d23bf13d619000286b3c"},
-    {file = "setproctitle-1.3.4-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:24f3c8be826a7d44181eac2269b15b748b76d98cd9a539d4c69f09321dcb5c12"},
-    {file = "setproctitle-1.3.4-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fc9d79b1bf833af63b7c720a6604eb16453ac1ad4e718eb8b59d1f97d986b98c"},
-    {file = "setproctitle-1.3.4-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:fb693000b65842c85356b667d057ae0d0bac6519feca7e1c437cc2cfeb0afc59"},
-    {file = "setproctitle-1.3.4-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:a166251b8fbc6f2755e2ce9d3c11e9edb0c0c7d2ed723658ff0161fbce26ac1c"},
-    {file = "setproctitle-1.3.4-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:0361428e6378911a378841509c56ba472d991cbed1a7e3078ec0cacc103da44a"},
-    {file = "setproctitle-1.3.4-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:62d66e0423e3bd520b4c897063506b309843a8d07343fbfad04197e91a4edd28"},
-    {file = "setproctitle-1.3.4-cp38-cp38-win32.whl", hash = "sha256:5edd01909348f3b0b2da329836d6b5419cd4869fec2e118e8ff3275b38af6267"},
-    {file = "setproctitle-1.3.4-cp38-cp38-win_amd64.whl", hash = "sha256:59e0dda9ad245921af0328035a961767026e1fa94bb65957ab0db0a0491325d6"},
-    {file = "setproctitle-1.3.4-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:bdaaa81a6e95a0a19fba0285f10577377f3503ae4e9988b403feba79da3e2f80"},
-    {file = "setproctitle-1.3.4-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:4ee5b19a2d794463bcc19153dfceede7beec784b4cf7967dec0bc0fc212ab3a3"},
-    {file = "setproctitle-1.3.4-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3058a1bb0c767b3a6ccbb38b27ef870af819923eb732e21e44a3f300370fe159"},
-    {file = "setproctitle-1.3.4-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5a97d37ee4fe0d1c6e87d2a97229c27a88787a8f4ebfbdeee95f91b818e52efe"},
-    {file = "setproctitle-1.3.4-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:6e61dd7d05da11fc69bb86d51f1e0ee08f74dccf3ecf884c94de41135ffdc75d"},
-    {file = "setproctitle-1.3.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1eb115d53dc2a1299ae72f1119c96a556db36073bacb6da40c47ece5db0d9587"},
-    {file = "setproctitle-1.3.4-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:342570716e2647a51ea859b8a9126da9dc1a96a0153c9c0a3514effd60ab57ad"},
-    {file = "setproctitle-1.3.4-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:0ad212ae2b03951367a69584af034579b34e1e4199a75d377ef9f8e08ee299b1"},
-    {file = "setproctitle-1.3.4-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:4afcb38e22122465013f4621b7e9ff8d42a7a48ae0ffeb94133a806cb91b4aad"},
-    {file = "setproctitle-1.3.4-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:30bb223e6c3f95ad9e9bb2a113292759e947d1cfd60dbd4adb55851c370006b2"},
-    {file = "setproctitle-1.3.4-cp39-cp39-win32.whl", hash = "sha256:5f0521ed3bb9f02e9486573ea95e2062cd6bf036fa44e640bd54a06f22d85f35"},
-    {file = "setproctitle-1.3.4-cp39-cp39-win_amd64.whl", hash = "sha256:0baadeb27f9e97e65922b4151f818b19c311d30b9efdb62af0e53b3db4006ce2"},
-    {file = "setproctitle-1.3.4-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:939d364a187b2adfbf6ae488664277e717d56c7951a4ddeb4f23b281bc50bfe5"},
-    {file = "setproctitle-1.3.4-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:cb8a6a19be0cbf6da6fcbf3698b76c8af03fe83e4bd77c96c3922be3b88bf7da"},
-    {file = "setproctitle-1.3.4-pp310-pypy310_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:779006f9e1aade9522a40e8d9635115ab15dd82b7af8e655967162e9c01e2573"},
-    {file = "setproctitle-1.3.4-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:5519f2a7b8c535b0f1f77b30441476571373add72008230c81211ee17b423b57"},
-    {file = "setproctitle-1.3.4-pp38-pypy38_pp73-macosx_11_0_arm64.whl", hash = "sha256:743836d484151334ebba1490d6907ca9e718fe815dcd5756f2a01bc3067d099c"},
-    {file = "setproctitle-1.3.4-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:abda20aff8d1751e48d7967fa8945fef38536b82366c49be39b83678d4be3893"},
-    {file = "setproctitle-1.3.4-pp38-pypy38_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1a2041b5788ce52f218b5be94af458e04470f997ab46fdebd57cf0b8374cc20e"},
-    {file = "setproctitle-1.3.4-pp38-pypy38_pp73-win_amd64.whl", hash = "sha256:2c3b1ce68746557aa6e6f4547e76883925cdc7f8d7c7a9f518acd203f1265ca5"},
-    {file = "setproctitle-1.3.4-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:0b6a4cbabf024cb263a45bdef425760f14470247ff223f0ec51699ca9046c0fe"},
-    {file = "setproctitle-1.3.4-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:3e55d7ecc68bdc80de5a553691a3ed260395d5362c19a266cf83cbb4e046551f"},
-    {file = "setproctitle-1.3.4-pp39-pypy39_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:02ca3802902d91a89957f79da3ec44b25b5804c88026362cb85eea7c1fbdefd1"},
-    {file = "setproctitle-1.3.4-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:47669fc8ed8b27baa2d698104732234b5389f6a59c37c046f6bcbf9150f7a94e"},
-    {file = "setproctitle-1.3.4.tar.gz", hash = "sha256:3b40d32a3e1f04e94231ed6dfee0da9e43b4f9c6b5450d53e6dd7754c34e0c50"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:e10fa50bdbaa5e2445dbd387979980d391760faf0ec99a09bd7780ff37eaec44"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:2f27ae6deea72efdb6f361750c92f6c21fd0ad087445082770cc34015213c526"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:60937c959e6f44159fdd9f56fbdd302501f96114a5ba436829496d5f32d8de3f"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d8b1d91545578852f128650b8cce4ec20f93d39b378ff554ebe66290f2dabb92"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:27e38eee653abc3d387862e67bc5c8b6f428cd604e688b85d29170b7e725c26c"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-win32.whl", hash = "sha256:251874d720ac7f28024a168501f3c7bb15d1802245f6e66de565f18bbb9b5eaa"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-win_amd64.whl", hash = "sha256:e52144670738b4b477fade6c2a9b6af71a8d0094514c9853ac9f6fc1fcfabae7"},
+    {file = "sentencepiece-0.2.1-cp310-cp310-win_arm64.whl", hash = "sha256:9076430ac25dfa7147d9d05751dbc66a04bc1aaac371c07f84952979ea59f0d0"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:6356d0986b8b8dc351b943150fcd81a1c6e6e4d439772e8584c64230e58ca987"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:8f8ba89a3acb3dc1ae90f65ec1894b0b9596fdb98ab003ff38e058f898b39bc7"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:02593eca45440ef39247cee8c47322a34bdcc1d8ae83ad28ba5a899a2cf8d79a"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a0d15781a171d188b661ae4bde1d998c303f6bd8621498c50c671bd45a4798e"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4f5a3e0d9f445ed9d66c0fec47d4b23d12cfc858b407a03c194c1b26c2ac2a63"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-win32.whl", hash = "sha256:6d297a1748d429ba8534eebe5535448d78b8acc32d00a29b49acf28102eeb094"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-win_amd64.whl", hash = "sha256:82d9ead6591015f009cb1be1cb1c015d5e6f04046dbb8c9588b931e869a29728"},
+    {file = "sentencepiece-0.2.1-cp311-cp311-win_arm64.whl", hash = "sha256:39f8651bd10974eafb9834ce30d9bcf5b73e1fc798a7f7d2528f9820ca86e119"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:57cae326c8727de58c85977b175af132a7138d84c764635d7e71bbee7e774133"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:56dd39a3c4d6493db3cdca7e8cc68c6b633f0d4195495cbadfcf5af8a22d05a6"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d9381351182ff9888cc80e41c632e7e274b106f450de33d67a9e8f6043da6f76"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:99f955df238021bf11f0fc37cdb54fd5e5b5f7fd30ecc3d93fb48b6815437167"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0cdfecef430d985f1c2bcbfff3defd1d95dae876fbd0173376012d2d7d24044b"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-win32.whl", hash = "sha256:a483fd29a34c3e34c39ac5556b0a90942bec253d260235729e50976f5dba1068"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-win_amd64.whl", hash = "sha256:4cdc7c36234fda305e85c32949c5211faaf8dd886096c7cea289ddc12a2d02de"},
+    {file = "sentencepiece-0.2.1-cp312-cp312-win_arm64.whl", hash = "sha256:daeb5e9e9fcad012324807856113708614d534f596d5008638eb9b40112cd9e4"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:dcd8161eee7b41aae57ded06272905dbd680a0a04b91edd0f64790c796b2f706"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:c6c8f42949f419ff8c7e9960dbadcfbc982d7b5efc2f6748210d3dd53a7de062"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:097f3394e99456e9e4efba1737c3749d7e23563dd1588ce71a3d007f25475fff"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d7b670879c370d350557edabadbad1f6561a9e6968126e6debca4029e5547820"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c7f0fd2f2693309e6628aeeb2e2faf6edd221134dfccac3308ca0de01f8dab47"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-win32.whl", hash = "sha256:92b3816aa2339355fda2c8c4e021a5de92180b00aaccaf5e2808972e77a4b22f"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:10ed3dab2044c47f7a2e7b4969b0c430420cdd45735d78c8f853191fa0e3148b"},
+    {file = "sentencepiece-0.2.1-cp313-cp313-win_arm64.whl", hash = "sha256:ac650534e2251083c5f75dde4ff28896ce7c8904133dc8fef42780f4d5588fcd"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:8dd4b477a7b069648d19363aad0cab9bad2f4e83b2d179be668efa672500dc94"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:0c0f672da370cc490e4c59d89e12289778310a0e71d176c541e4834759e1ae07"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:ad8493bea8432dae8d6830365352350f3b4144415a1d09c4c8cb8d30cf3b6c3c"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b81a24733726e3678d2db63619acc5a8dccd074f7aa7a54ecd5ca33ca6d2d596"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0a81799d0a68d618e89063fb423c3001a034c893069135ffe51fee439ae474d6"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-win32.whl", hash = "sha256:89a3ea015517c42c0341d0d962f3e6aaf2cf10d71b1932d475c44ba48d00aa2b"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-win_amd64.whl", hash = "sha256:33f068c9382dc2e7c228eedfd8163b52baa86bb92f50d0488bf2b7da7032e484"},
+    {file = "sentencepiece-0.2.1-cp313-cp313t-win_arm64.whl", hash = "sha256:b3616ad246f360e52c85781e47682d31abfb6554c779e42b65333d4b5f44ecc0"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-macosx_10_13_universal2.whl", hash = "sha256:5d0350b686c320068702116276cfb26c066dc7e65cfef173980b11bb4d606719"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:c7f54a31cde6fa5cb030370566f68152a742f433f8d2be458463d06c208aef33"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c83b85ab2d6576607f31df77ff86f28182be4a8de6d175d2c33ca609925f5da1"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1855f57db07b51fb51ed6c9c452f570624d2b169b36f0f79ef71a6e6c618cd8b"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:01e6912125cb45d3792f530a4d38f8e21bf884d6b4d4ade1b2de5cf7a8d2a52b"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-win32.whl", hash = "sha256:c415c9de1447e0a74ae3fdb2e52f967cb544113a3a5ce3a194df185cbc1f962f"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-win_amd64.whl", hash = "sha256:881b2e44b14fc19feade3cbed314be37de639fc415375cefaa5bc81a4be137fd"},
+    {file = "sentencepiece-0.2.1-cp314-cp314-win_arm64.whl", hash = "sha256:2005242a16d2dc3ac5fe18aa7667549134d37854823df4c4db244752453b78a8"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-macosx_10_13_universal2.whl", hash = "sha256:a19adcec27c524cb7069a1c741060add95f942d1cbf7ad0d104dffa0a7d28a2b"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:e37e4b4c4a11662b5db521def4e44d4d30ae69a1743241412a93ae40fdcab4bb"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:477c81505db072b3ab627e7eab972ea1025331bd3a92bacbf798df2b75ea86ec"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:010f025a544ef770bb395091d57cb94deb9652d8972e0d09f71d85d5a0816c8c"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:733e59ff1794d26db706cd41fc2d7ca5f6c64a820709cb801dc0ea31780d64ab"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-win32.whl", hash = "sha256:d3233770f78e637dc8b1fda2cd7c3b99ec77e7505041934188a4e7fe751de3b0"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-win_amd64.whl", hash = "sha256:5e4366c97b68218fd30ea72d70c525e6e78a6c0a88650f57ac4c43c63b234a9d"},
+    {file = "sentencepiece-0.2.1-cp314-cp314t-win_arm64.whl", hash = "sha256:105e36e75cbac1292642045458e8da677b2342dcd33df503e640f0b457cb6751"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:afefe50a0cdcb4f2fd9733cb52001a2c164181ee2d82c32d38f5b1b326a8528c"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:891ade6503dd93d418c03993f7d6a8aa20260c422cefff5096b9068185e67642"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:814978ac05130dd5812b4b03215c766bc6abaef13e7bd72bc534e4d1e12e9a4c"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:017f97b274d4b0baa84b2dc743bf4517be81156f413bb24f12aacacde378e5ab"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:22c4ebcb3c6ab1496ab1c37c79ef7bb563b8726f29548c30773b7a4cb152df1a"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-win32.whl", hash = "sha256:caa4e560c72c151da80036aecc2159e51a7fd8ae9efebefd96860460ce6bd025"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-win_amd64.whl", hash = "sha256:2af5a1fb05013332ad94343b8b5f3973e006a2dde2dfba55a819549e054e2f0f"},
+    {file = "sentencepiece-0.2.1-cp39-cp39-win_arm64.whl", hash = "sha256:3d165fbb9bf8fba35f1946ba2617c3f9995679f07438325f07c026d53f33e746"},
+    {file = "sentencepiece-0.2.1.tar.gz", hash = "sha256:8138cec27c2f2282f4a34d9a016e3374cd40e5c6e9cb335063db66a0a3b71fad"},
 ]
 
 [package.extras]
 test = ["pytest"]
+testpaths = ["test"]
 
 [[package]]
 name = "setuptools"
-version = "75.8.0"
-description = "Easily download, build, install, upgrade, and uninstall Python packages"
+version = "82.0.1"
+description = "Most extensible Python build backend with support for C/C++ extension modules"
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
-    {file = "setuptools-75.8.0-py3-none-any.whl", hash = "sha256:e3982f444617239225d675215d51f6ba05f845d4eec313da4418fdbb56fb27e3"},
-    {file = "setuptools-75.8.0.tar.gz", hash = "sha256:c5afc8f407c626b8313a86e10311dd3f661c6cd9c09d4bf8c15c0e11f9f2b0e6"},
+    {file = "setuptools-82.0.1-py3-none-any.whl", hash = "sha256:a59e362652f08dcd477c78bb6e7bd9d80a7995bc73ce773050228a348ce2e5bb"},
+    {file = "setuptools-82.0.1.tar.gz", hash = "sha256:7d872682c5d01cfde07da7bccc7b65469d3dca203318515ada1de5eda35efbf9"},
 ]
 
 [package.extras]
-check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\"", "ruff (>=0.8.0) ; sys_platform != \"cygwin\""]
-core = ["importlib_metadata (>=6) ; python_version < \"3.10\"", "jaraco.collections", "jaraco.functools (>=4)", "jaraco.text (>=3.7)", "more_itertools", "more_itertools (>=8.8)", "packaging", "packaging (>=24.2)", "platformdirs (>=4.2.2)", "tomli (>=2.0.1) ; python_version < \"3.11\"", "wheel (>=0.43.0)"]
+check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\"", "ruff (>=0.13.0) ; sys_platform != \"cygwin\""]
+core = ["importlib_metadata (>=6) ; python_version < \"3.10\"", "jaraco.functools (>=4)", "jaraco.text (>=3.7)", "more_itertools", "more_itertools (>=8.8)", "packaging (>=24.2)", "tomli (>=2.0.1) ; python_version < \"3.11\"", "wheel (>=0.43.0)"]
 cover = ["pytest-cov"]
 doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "pygments-github-lexers (==0.0.5)", "pyproject-hooks (!=1.1)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-favicon", "sphinx-inline-tabs", "sphinx-lint", "sphinx-notfound-page (>=1,<2)", "sphinx-reredirects", "sphinxcontrib-towncrier", "towncrier (<24.7)"]
 enabler = ["pytest-enabler (>=2.2)"]
 test = ["build[virtualenv] (>=1.0.3)", "filelock (>=3.4.0)", "ini2toml[lite] (>=0.14)", "jaraco.develop (>=7.21) ; python_version >= \"3.9\" and sys_platform != \"cygwin\"", "jaraco.envs (>=2.2)", "jaraco.path (>=3.7.2)", "jaraco.test (>=5.5)", "packaging (>=24.2)", "pip (>=19.1)", "pyproject-hooks (!=1.1)", "pytest (>=6,!=8.1.*)", "pytest-home (>=0.5)", "pytest-perf ; sys_platform != \"cygwin\"", "pytest-subprocess", "pytest-timeout", "pytest-xdist (>=3)", "tomli-w (>=1.0.0)", "virtualenv (>=13.0.0)", "wheel (>=0.44.0)"]
-type = ["importlib_metadata (>=7.0.2) ; python_version < \"3.10\"", "jaraco.develop (>=7.21) ; sys_platform != \"cygwin\"", "mypy (==1.14.*)", "pytest-mypy"]
+type = ["importlib_metadata (>=7.0.2) ; python_version < \"3.10\"", "jaraco.develop (>=7.21) ; sys_platform != \"cygwin\"", "mypy (==1.18.*)", "pytest-mypy"]
 
 [[package]]
-name = "six"
-version = "1.17.0"
-description = "Python 2 and 3 compatibility utilities"
+name = "shellingham"
+version = "1.5.4"
+description = "Tool to Detect Surrounding Shell"
 optional = false
-python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7"
+python-versions = ">=3.7"
 groups = ["main"]
 files = [
-    {file = "six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274"},
-    {file = "six-1.17.0.tar.gz", hash = "sha256:ff70335d468e7eb6ec65b95b99d3a2836546063f63acc5171de367e834932a81"},
+    {file = "shellingham-1.5.4-py2.py3-none-any.whl", hash = "sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686"},
+    {file = "shellingham-1.5.4.tar.gz", hash = "sha256:8dbca0739d487e5bd35ab3ca4b36e11c4078f3a234bfce294b0a0291363404de"},
 ]
 
 [[package]]
-name = "smmap"
-version = "5.0.2"
-description = "A pure Python implementation of a sliding window memory map manager"
+name = "six"
+version = "1.17.0"
+description = "Python 2 and 3 compatibility utilities"
 optional = false
-python-versions = ">=3.7"
-groups = ["main"]
+python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7"
+groups = ["main", "training"]
 files = [
-    {file = "smmap-5.0.2-py3-none-any.whl", hash = "sha256:b30115f0def7d7531d22a0fb6502488d879e75b260a9db4d0819cfb25403af5e"},
-    {file = "smmap-5.0.2.tar.gz", hash = "sha256:26ea65a03958fa0c8a1c7e8c7a58fdc77221b8910f6be2131affade476898ad5"},
+    {file = "six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274"},
+    {file = "six-1.17.0.tar.gz", hash = "sha256:ff70335d468e7eb6ec65b95b99d3a2836546063f63acc5171de367e834932a81"},
 ]
+markers = {main = "extra == \"trackio\""}
 
 [[package]]
 name = "soupsieve"
-version = "2.6"
+version = "2.8.4"
 description = "A modern CSS selector implementation for Beautiful Soup."
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["main"]
 files = [
-    {file = "soupsieve-2.6-py3-none-any.whl", hash = "sha256:e72c4ff06e4fb6e4b5a9f0f55fe6e81514581fca1515028625d0f299c602ccc9"},
-    {file = "soupsieve-2.6.tar.gz", hash = "sha256:e2e68417777af359ec65daac1057404a3c8a5455bb8abc36f1a9866ab1a51abb"},
+    {file = "soupsieve-2.8.4-py3-none-any.whl", hash = "sha256:e7e6b0769c8f51ed59acab6e994b00621096cfb1c640a7509295987388fbaf65"},
+    {file = "soupsieve-2.8.4.tar.gz", hash = "sha256:e121fd02e975c695e4e9e8774a5ee35d74714b59307868dcc5319ad2d9e3328e"},
 ]
 
 [[package]]
-name = "SwissArmyTransformer"
-version = "0.4.12"
-description = "A transformer-based framework with finetuning as the first class citizen."
-optional = false
-python-versions = ">=3.5"
+name = "starlette"
+version = "1.3.1"
+description = "The little ASGI library that shines."
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
-files = []
-develop = false
+markers = "extra == \"trackio\""
+files = [
+    {file = "starlette-1.3.1-py3-none-any.whl", hash = "sha256:c7372aae11c3c3f26a42df7bd626cec2f47d03483d261d369516a615a53714c6"},
+    {file = "starlette-1.3.1.tar.gz", hash = "sha256:05d0213193f2fbaae60e2ecb593b4add4262ad4e46536b54abe36f11a71724e0"},
+]
 
-[package.dependencies]
-boto3 = "*"
-cpm_kernels = "*"
-datasets = "*"
-deepspeed = "*"
-einops = "*"
-sentencepiece = "*"
-tensorboardX = "*"
-torch = "*"
-transformers = "*"
-webdataset = "*"
+[package.dependencies]
+anyio = ">=3.6.2,<5"
+typing-extensions = {version = ">=4.10.0", markers = "python_version < \"3.13\""}
 
-[package.source]
-type = "git"
-url = "https://github.com/JingyeChen/SwissArmyTransformer"
-reference = "HEAD"
-resolved_reference = "982455404afea07503e6dc9ffafafad1a22c4302"
+[package.extras]
+full = ["httpx (>=0.27.0,<0.29.0)", "httpx2 (>=2.0.0)", "itsdangerous", "jinja2", "python-multipart (>=0.0.18)", "pyyaml"]
 
 [[package]]
 name = "sympy"
-version = "1.13.3"
+version = "1.13.1"
 description = "Computer algebra system (CAS) in Python"
 optional = false
 python-versions = ">=3.8"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
-    {file = "sympy-1.13.3-py3-none-any.whl", hash = "sha256:54612cf55a62755ee71824ce692986f23c88ffa77207b30c1368eda4a7060f73"},
-    {file = "sympy-1.13.3.tar.gz", hash = "sha256:b27fd2c6530e0ab39e275fc9b683895367e51d5da91baa8d3d64db2565fec4d9"},
+    {file = "sympy-1.13.1-py3-none-any.whl", hash = "sha256:db36cdc64bf61b9b24578b6f7bab1ecdd2452cf008f34faa33776680c26d66f8"},
+    {file = "sympy-1.13.1.tar.gz", hash = "sha256:9cebf7e04ff162015ce31c9c6c9144daa34a93bd082f54fd8f12deca4f47515f"},
 ]
 
 [package.dependencies]
@@ -5135,13 +4898,13 @@ dev = ["hypothesis (>=6.70.0)", "pytest (>=7.1.0)"]
 
 [[package]]
 name = "tensorboard"
-version = "2.19.0"
+version = "2.20.0"
 description = "TensorBoard lets you watch Tensors Flow"
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
+groups = ["training"]
 files = [
-    {file = "tensorboard-2.19.0-py3-none-any.whl", hash = "sha256:5e71b98663a641a7ce8a6e70b0be8e1a4c0c45d48760b076383ac4755c35b9a0"},
+    {file = "tensorboard-2.20.0-py3-none-any.whl", hash = "sha256:9dc9f978cb84c0723acf9a345d96c184f0293d18f166bb8d59ee098e6cfaaba6"},
 ]
 
 [package.dependencies]
@@ -5150,9 +4913,9 @@ grpcio = ">=1.48.2"
 markdown = ">=2.6.8"
 numpy = ">=1.12.0"
 packaging = "*"
+pillow = "*"
 protobuf = ">=3.19.6,<4.24.0 || >4.24.0"
 setuptools = ">=41.0.0"
-six = ">1.9"
 tensorboard-data-server = ">=0.7.0,<0.8.0"
 werkzeug = ">=1.0.1"
 
@@ -5162,52 +4925,20 @@ version = "0.7.2"
 description = "Fast data loading for TensorBoard"
 optional = false
 python-versions = ">=3.7"
-groups = ["main"]
+groups = ["training"]
 files = [
     {file = "tensorboard_data_server-0.7.2-py3-none-any.whl", hash = "sha256:7e0610d205889588983836ec05dc098e80f97b7e7bbff7e994ebb78f578d0ddb"},
     {file = "tensorboard_data_server-0.7.2-py3-none-macosx_10_9_x86_64.whl", hash = "sha256:9fe5d24221b29625dbc7328b0436ca7fc1c23de4acf4d272f1180856e32f9f60"},
     {file = "tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl", hash = "sha256:ef687163c24185ae9754ed5650eb5bc4d84ff257aabdc33f0cc6f74d8ba54530"},
 ]
 
-[[package]]
-name = "tensorboardx"
-version = "2.6.2.2"
-description = "TensorBoardX lets you watch Tensors Flow without Tensorflow"
-optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "tensorboardX-2.6.2.2-py2.py3-none-any.whl", hash = "sha256:160025acbf759ede23fd3526ae9d9bfbfd8b68eb16c38a010ebe326dc6395db8"},
-    {file = "tensorboardX-2.6.2.2.tar.gz", hash = "sha256:c6476d7cd0d529b0b72f4acadb1269f9ed8b22f441e87a84f2a3b940bb87b666"},
-]
-
-[package.dependencies]
-numpy = "*"
-packaging = "*"
-protobuf = ">=3.20"
-
-[[package]]
-name = "termcolor"
-version = "2.5.0"
-description = "ANSI color formatting for output in terminal"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "termcolor-2.5.0-py3-none-any.whl", hash = "sha256:37b17b5fc1e604945c2642c872a3764b5d547a48009871aea3edd3afa180afb8"},
-    {file = "termcolor-2.5.0.tar.gz", hash = "sha256:998d8d27da6d48442e8e1f016119076b690d962507531df4890fcd2db2ef8a6f"},
-]
-
-[package.extras]
-tests = ["pytest", "pytest-cov"]
-
 [[package]]
 name = "threadpoolctl"
 version = "3.6.0"
 description = "threadpoolctl"
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
+groups = ["training"]
 files = [
     {file = "threadpoolctl-3.6.0-py3-none-any.whl", hash = "sha256:43a0b8fd5a2928500110039e43a5eed8480b918967083ea48dc3ab9f13c4a7fb"},
     {file = "threadpoolctl-3.6.0.tar.gz", hash = "sha256:8ab8b4aa3491d812b623328249fab5302a68d2d71745c8a4c719a2fcaba9f44e"},
@@ -5234,222 +4965,84 @@ torchvision = "*"
 
 [[package]]
 name = "tokenizers"
-version = "0.20.3"
+version = "0.22.2"
 description = ""
 optional = false
-python-versions = ">=3.7"
+python-versions = ">=3.9"
 groups = ["main"]
 files = [
-    {file = "tokenizers-0.20.3-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:31ccab28dbb1a9fe539787210b0026e22debeab1662970f61c2d921f7557f7e4"},
-    {file = "tokenizers-0.20.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:c6361191f762bda98c773da418cf511cbaa0cb8d0a1196f16f8c0119bde68ff8"},
-    {file = "tokenizers-0.20.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f128d5da1202b78fa0a10d8d938610472487da01b57098d48f7e944384362514"},
-    {file = "tokenizers-0.20.3-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:79c4121a2e9433ad7ef0769b9ca1f7dd7fa4c0cd501763d0a030afcbc6384481"},
-    {file = "tokenizers-0.20.3-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b7850fde24197fe5cd6556e2fdba53a6d3bae67c531ea33a3d7c420b90904141"},
-    {file = "tokenizers-0.20.3-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b357970c095dc134978a68c67d845a1e3803ab7c4fbb39195bde914e7e13cf8b"},
-    {file = "tokenizers-0.20.3-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a333d878c4970b72d6c07848b90c05f6b045cf9273fc2bc04a27211721ad6118"},
-    {file = "tokenizers-0.20.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1fd9fee817f655a8f50049f685e224828abfadd436b8ff67979fc1d054b435f1"},
-    {file = "tokenizers-0.20.3-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:9e7816808b402129393a435ea2a509679b41246175d6e5e9f25b8692bfaa272b"},
-    {file = "tokenizers-0.20.3-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:ba96367db9d8a730d3a1d5996b4b7babb846c3994b8ef14008cd8660f55db59d"},
-    {file = "tokenizers-0.20.3-cp310-none-win32.whl", hash = "sha256:ee31ba9d7df6a98619426283e80c6359f167e2e9882d9ce1b0254937dbd32f3f"},
-    {file = "tokenizers-0.20.3-cp310-none-win_amd64.whl", hash = "sha256:a845c08fdad554fe0871d1255df85772f91236e5fd6b9287ef8b64f5807dbd0c"},
-    {file = "tokenizers-0.20.3-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:585b51e06ca1f4839ce7759941e66766d7b060dccfdc57c4ca1e5b9a33013a90"},
-    {file = "tokenizers-0.20.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:61cbf11954f3b481d08723ebd048ba4b11e582986f9be74d2c3bdd9293a4538d"},
-    {file = "tokenizers-0.20.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ef820880d5e4e8484e2fa54ff8d297bb32519eaa7815694dc835ace9130a3eea"},
-    {file = "tokenizers-0.20.3-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:67ef4dcb8841a4988cd00dd288fb95dfc8e22ed021f01f37348fd51c2b055ba9"},
-    {file = "tokenizers-0.20.3-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ff1ef8bd47a02b0dc191688ccb4da53600df5d4c9a05a4b68e1e3de4823e78eb"},
-    {file = "tokenizers-0.20.3-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:444d188186eab3148baf0615b522461b41b1f0cd58cd57b862ec94b6ac9780f1"},
-    {file = "tokenizers-0.20.3-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:37c04c032c1442740b2c2d925f1857885c07619224a533123ac7ea71ca5713da"},
-    {file = "tokenizers-0.20.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:453c7769d22231960ee0e883d1005c93c68015025a5e4ae56275406d94a3c907"},
-    {file = "tokenizers-0.20.3-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:4bb31f7b2847e439766aaa9cc7bccf7ac7088052deccdb2275c952d96f691c6a"},
-    {file = "tokenizers-0.20.3-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:843729bf0f991b29655a069a2ff58a4c24375a553c70955e15e37a90dd4e045c"},
-    {file = "tokenizers-0.20.3-cp311-none-win32.whl", hash = "sha256:efcce3a927b1e20ca694ba13f7a68c59b0bd859ef71e441db68ee42cf20c2442"},
-    {file = "tokenizers-0.20.3-cp311-none-win_amd64.whl", hash = "sha256:88301aa0801f225725b6df5dea3d77c80365ff2362ca7e252583f2b4809c4cc0"},
-    {file = "tokenizers-0.20.3-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:49d12a32e190fad0e79e5bdb788d05da2f20d8e006b13a70859ac47fecf6ab2f"},
-    {file = "tokenizers-0.20.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:282848cacfb9c06d5e51489f38ec5aa0b3cd1e247a023061945f71f41d949d73"},
-    {file = "tokenizers-0.20.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:abe4e08c7d0cd6154c795deb5bf81d2122f36daf075e0c12a8b050d824ef0a64"},
-    {file = "tokenizers-0.20.3-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ca94fc1b73b3883c98f0c88c77700b13d55b49f1071dfd57df2b06f3ff7afd64"},
-    {file = "tokenizers-0.20.3-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ef279c7e239f95c8bdd6ff319d9870f30f0d24915b04895f55b1adcf96d6c60d"},
-    {file = "tokenizers-0.20.3-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:16384073973f6ccbde9852157a4fdfe632bb65208139c9d0c0bd0176a71fd67f"},
-    {file = "tokenizers-0.20.3-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:312d522caeb8a1a42ebdec87118d99b22667782b67898a76c963c058a7e41d4f"},
-    {file = "tokenizers-0.20.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f2b7cb962564785a83dafbba0144ecb7f579f1d57d8c406cdaa7f32fe32f18ad"},
-    {file = "tokenizers-0.20.3-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:124c5882ebb88dadae1fc788a582299fcd3a8bd84fc3e260b9918cf28b8751f5"},
-    {file = "tokenizers-0.20.3-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:2b6e54e71f84c4202111a489879005cb14b92616a87417f6c102c833af961ea2"},
-    {file = "tokenizers-0.20.3-cp312-none-win32.whl", hash = "sha256:83d9bfbe9af86f2d9df4833c22e94d94750f1d0cd9bfb22a7bb90a86f61cdb1c"},
-    {file = "tokenizers-0.20.3-cp312-none-win_amd64.whl", hash = "sha256:44def74cee574d609a36e17c8914311d1b5dbcfe37c55fd29369d42591b91cf2"},
-    {file = "tokenizers-0.20.3-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:e0b630e0b536ef0e3c8b42c685c1bc93bd19e98c0f1543db52911f8ede42cf84"},
-    {file = "tokenizers-0.20.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:a02d160d2b19bcbfdf28bd9a4bf11be4cb97d0499c000d95d4c4b1a4312740b6"},
-    {file = "tokenizers-0.20.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0e3d80d89b068bc30034034b5319218c7c0a91b00af19679833f55f3becb6945"},
-    {file = "tokenizers-0.20.3-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:174a54910bed1b089226512b4458ea60d6d6fd93060254734d3bc3540953c51c"},
-    {file = "tokenizers-0.20.3-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:098b8a632b8656aa5802c46689462c5c48f02510f24029d71c208ec2c822e771"},
-    {file = "tokenizers-0.20.3-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:78c8c143e3ae41e718588281eb3e212c2b31623c9d6d40410ec464d7d6221fb5"},
-    {file = "tokenizers-0.20.3-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2b26b0aadb18cd8701077362ba359a06683662d5cafe3e8e8aba10eb05c037f1"},
-    {file = "tokenizers-0.20.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:07d7851a72717321022f3774e84aa9d595a041d643fafa2e87fbc9b18711dac0"},
-    {file = "tokenizers-0.20.3-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:bd44e48a430ada902c6266a8245f5036c4fe744fcb51f699999fbe82aa438797"},
-    {file = "tokenizers-0.20.3-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:a4c186bb006ccbe1f5cc4e0380d1ce7806f5955c244074fd96abc55e27b77f01"},
-    {file = "tokenizers-0.20.3-cp313-none-win32.whl", hash = "sha256:6e19e0f1d854d6ab7ea0c743d06e764d1d9a546932be0a67f33087645f00fe13"},
-    {file = "tokenizers-0.20.3-cp313-none-win_amd64.whl", hash = "sha256:d50ede425c7e60966a9680d41b58b3a0950afa1bb570488e2972fa61662c4273"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-macosx_10_12_x86_64.whl", hash = "sha256:9adda1ff5fb9dcdf899ceca672a4e2ce9e797adb512a6467305ca3d8bfcfbdd0"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-macosx_11_0_arm64.whl", hash = "sha256:6dde2cae6004ba7a3badff4a11911cae03ebf23e97eebfc0e71fef2530e5074f"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c4a7fd678b35614fca708579eb95b7587a5e8a6d328171bd2488fd9f27d82be4"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:1b80e3c7283a01a356bd2210f53d1a4a5d32b269c2024389ed0173137708d50e"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:a8cc0e8176b762973758a77f0d9c4467d310e33165fb74173418ca3734944da4"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:d5634b2e2f5f3d2b4439d2d74066e22eb4b1f04f3fea05cb2a3c12d89b5a3bcd"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b4ba635165bc1ea46f2da8e5d80b5f70f6ec42161e38d96dbef33bb39df73964"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:18e4c7c64172e7789bd8b07aa3087ea87c4c4de7e90937a2aa036b5d92332536"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:1f74909ef7675c26d4095a817ec3393d67f3158ca4836c233212e5613ef640c4"},
-    {file = "tokenizers-0.20.3-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:0e9b81321a1e05b16487d312b4264984513f8b4a7556229cafac6e88c2036b09"},
-    {file = "tokenizers-0.20.3-cp37-none-win32.whl", hash = "sha256:ab48184cd58b4a03022a2ec75b54c9f600ffea9a733612c02325ed636f353729"},
-    {file = "tokenizers-0.20.3-cp37-none-win_amd64.whl", hash = "sha256:60ac483cebee1c12c71878523e768df02fa17e4c54412966cb3ac862c91b36c1"},
-    {file = "tokenizers-0.20.3-cp38-cp38-macosx_10_12_x86_64.whl", hash = "sha256:3229ef103c89583d10b9378afa5d601b91e6337530a0988e17ca8d635329a996"},
-    {file = "tokenizers-0.20.3-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:6ac52cc24bad3de865c7e65b1c4e7b70d00938a8ae09a92a453b8f676e714ad5"},
-    {file = "tokenizers-0.20.3-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:04627b7b502fa6a2a005e1bd446fa4247d89abcb1afaa1b81eb90e21aba9a60f"},
-    {file = "tokenizers-0.20.3-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:c27ceb887f0e81a3c377eb4605dca7a95a81262761c0fba308d627b2abb98f2b"},
-    {file = "tokenizers-0.20.3-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:65ab780194da4e1fcf5670523a2f377c4838ebf5249efe41fa1eddd2a84fb49d"},
-    {file = "tokenizers-0.20.3-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:98d343134f47159e81f7f242264b0eb222e6b802f37173c8d7d7b64d5c9d1388"},
-    {file = "tokenizers-0.20.3-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f2475bb004ab2009d29aff13b5047bfdb3d4b474f0aa9d4faa13a7f34dbbbb43"},
-    {file = "tokenizers-0.20.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7b6583a65c01db1197c1eb36857ceba8ec329d53afadd268b42a6b04f4965724"},
-    {file = "tokenizers-0.20.3-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:62d00ba208358c037eeab7bfc00a905adc67b2d31b68ab40ed09d75881e114ea"},
-    {file = "tokenizers-0.20.3-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:0fc7a39e5bedc817bda395a798dfe2d9c5f7c71153c90d381b5135a0328d9520"},
-    {file = "tokenizers-0.20.3-cp38-none-win32.whl", hash = "sha256:84d40ee0f8550d64d3ea92dd7d24a8557a9172165bdb986c9fb2503b4fe4e3b6"},
-    {file = "tokenizers-0.20.3-cp38-none-win_amd64.whl", hash = "sha256:205a45246ed7f1718cf3785cff88450ba603352412aaf220ace026384aa3f1c0"},
-    {file = "tokenizers-0.20.3-cp39-cp39-macosx_10_12_x86_64.whl", hash = "sha256:93e37f0269a11dc3b1a953f1fca9707f0929ebf8b4063c591c71a0664219988e"},
-    {file = "tokenizers-0.20.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:f4cb0c614b0135e781de96c2af87e73da0389ac1458e2a97562ed26e29490d8d"},
-    {file = "tokenizers-0.20.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7eb2fb1c432f5746b22f8a7f09fc18c4156cb0031c77f53cb19379d82d43297a"},
-    {file = "tokenizers-0.20.3-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:bfa8d029bb156181b006643309d6b673615a24e4ed24cf03aa191d599b996f51"},
-    {file = "tokenizers-0.20.3-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:6f90549622de3bf476ad9f1dd6f3f952ec3ed6ab8615ae88ef060d0c5bfad55d"},
-    {file = "tokenizers-0.20.3-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:a1d469c74eebf5c43fd61cd9b030e271d17198edd7bd45392e03a3c091d7d6d4"},
-    {file = "tokenizers-0.20.3-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:bee8f53b2594749f4460d53253bae55d718f04e9b633efa0f5df8938bd98e4f0"},
-    {file = "tokenizers-0.20.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:938441babf3e5720e4459e306ef2809fb267680df9d1ff2873458b22aef60248"},
-    {file = "tokenizers-0.20.3-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:7310ab23d7b0caebecc0e8be11a1146f320f5f07284000f6ea54793e83de1b75"},
-    {file = "tokenizers-0.20.3-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:16121eb030a2b13094cfec936b0c12e8b4063c5f839591ea7d0212336d8f9921"},
-    {file = "tokenizers-0.20.3-cp39-none-win32.whl", hash = "sha256:401cc21ef642ee235985d747f65e18f639464d377c70836c9003df208d582064"},
-    {file = "tokenizers-0.20.3-cp39-none-win_amd64.whl", hash = "sha256:7498f3ea7746133335a6adb67a77cf77227a8b82c8483f644a2e5f86fea42b8d"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-macosx_10_12_x86_64.whl", hash = "sha256:e919f2e3e68bb51dc31de4fcbbeff3bdf9c1cad489044c75e2b982a91059bd3c"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-macosx_11_0_arm64.whl", hash = "sha256:b8e9608f2773996cc272156e305bd79066163a66b0390fe21750aff62df1ac07"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:39270a7050deaf50f7caff4c532c01b3c48f6608d42b3eacdebdc6795478c8df"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e005466632b1c5d2d2120f6de8aa768cc9d36cd1ab7d51d0c27a114c91a1e6ee"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a07962340b36189b6c8feda552ea1bfeee6cf067ff922a1d7760662c2ee229e5"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:55046ad3dd5f2b3c67501fcc8c9cbe3e901d8355f08a3b745e9b57894855f85b"},
-    {file = "tokenizers-0.20.3-pp310-pypy310_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:efcf0eb939988b627558aaf2b9dc3e56d759cad2e0cfa04fcab378e4b48fc4fd"},
-    {file = "tokenizers-0.20.3-pp37-pypy37_pp73-macosx_10_12_x86_64.whl", hash = "sha256:f3558a7ae6a6d38a77dfce12172a1e2e1bf3e8871e744a1861cd7591ea9ebe24"},
-    {file = "tokenizers-0.20.3-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4d53029fe44bc70c3ff14ef512460a0cf583495a0f8e2f4b70e26eb9438e38a9"},
-    {file = "tokenizers-0.20.3-pp37-pypy37_pp73-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:57a2a56397b2bec5a629b516b23f0f8a3e4f978c7488d4a299980f8375954b85"},
-    {file = "tokenizers-0.20.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b1e5bfaae740ef9ece000f8a07e78ac0e2b085c5ce9648f8593ddf0243c9f76d"},
-    {file = "tokenizers-0.20.3-pp37-pypy37_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:fbaf3ea28fedfb2283da60e710aff25492e795a7397cad8a50f1e079b65a5a70"},
-    {file = "tokenizers-0.20.3-pp37-pypy37_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:c47c037116310dc976eb96b008e41b9cfaba002ed8005848d4d632ee0b7ba9ae"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-macosx_10_12_x86_64.whl", hash = "sha256:c31751f0721f58f5e19bb27c1acc259aeff860d8629c4e1a900b26a1979ada8e"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-macosx_11_0_arm64.whl", hash = "sha256:c697cbd3be7a79ea250ea5f380d6f12e534c543cfb137d5c734966b3ee4f34cc"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b48971b88ef9130bf35b41b35fd857c3c4dae4a9cd7990ebc7fc03e59cc92438"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:4e615de179bbe060ab33773f0d98a8a8572b5883dd7dac66c1de8c056c7e748c"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:da1ec842035ed9999c62e45fbe0ff14b7e8a7e02bb97688cc6313cf65e5cd755"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:6ee4954c1dd23aadc27958dad759006e71659d497dcb0ef0c7c87ea992c16ebd"},
-    {file = "tokenizers-0.20.3-pp38-pypy38_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:3eda46ca402751ec82553a321bf35a617b76bbed7586e768c02ccacbdda94d6d"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-macosx_10_12_x86_64.whl", hash = "sha256:de082392a85eb0055cc055c535bff2f0cc15d7a000bdc36fbf601a0f3cf8507a"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-macosx_11_0_arm64.whl", hash = "sha256:c3db46cc0647bfd88263afdb739b92017a02a87ee30945cb3e86c7e25c7c9917"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a292392f24ab9abac5cfa8197e5a6208f2e43723420217e1ceba0b4ec77816ac"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:8dcd91f4e60f62b20d83a87a84fe062035a1e3ff49a8c2bbdeb2d441c8e311f4"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:900991a2b8ee35961b1095db7e265342e0e42a84c1a594823d5ee9f8fb791958"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:5a8d8261ca2133d4f98aa9627c748189502b3787537ba3d7e2beb4f7cfc5d627"},
-    {file = "tokenizers-0.20.3-pp39-pypy39_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:c4fd4d71e6deb6ddf99d8d0eab87d1d16f635898906e631914a9bae8ae9f2cfb"},
-    {file = "tokenizers-0.20.3.tar.gz", hash = "sha256:2278b34c5d0dd78e087e1ca7f9b1dcbf129d80211afa645f214bd6e051037539"},
+    {file = "tokenizers-0.22.2-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:544dd704ae7238755d790de45ba8da072e9af3eea688f698b137915ae959281c"},
+    {file = "tokenizers-0.22.2-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:1e418a55456beedca4621dbab65a318981467a2b188e982a23e117f115ce5001"},
+    {file = "tokenizers-0.22.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2249487018adec45d6e3554c71d46eb39fa8ea67156c640f7513eb26f318cec7"},
+    {file = "tokenizers-0.22.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:25b85325d0815e86e0bac263506dd114578953b7b53d7de09a6485e4a160a7dd"},
+    {file = "tokenizers-0.22.2-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:bfb88f22a209ff7b40a576d5324bf8286b519d7358663db21d6246fb17eea2d5"},
+    {file = "tokenizers-0.22.2-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1c774b1276f71e1ef716e5486f21e76333464f47bece56bbd554485982a9e03e"},
+    {file = "tokenizers-0.22.2-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:df6c4265b289083bf710dff49bc51ef252f9d5be33a45ee2bed151114a56207b"},
+    {file = "tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:369cc9fc8cc10cb24143873a0d95438bb8ee257bb80c71989e3ee290e8d72c67"},
+    {file = "tokenizers-0.22.2-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:29c30b83d8dcd061078b05ae0cb94d3c710555fbb44861139f9f83dcca3dc3e4"},
+    {file = "tokenizers-0.22.2-cp39-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:37ae80a28c1d3265bb1f22464c856bd23c02a05bb211e56d0c5301a435be6c1a"},
+    {file = "tokenizers-0.22.2-cp39-abi3-musllinux_1_2_i686.whl", hash = "sha256:791135ee325f2336f498590eb2f11dc5c295232f288e75c99a36c5dbce63088a"},
+    {file = "tokenizers-0.22.2-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:38337540fbbddff8e999d59970f3c6f35a82de10053206a7562f1ea02d046fa5"},
+    {file = "tokenizers-0.22.2-cp39-abi3-win32.whl", hash = "sha256:a6bf3f88c554a2b653af81f3204491c818ae2ac6fbc09e76ef4773351292bc92"},
+    {file = "tokenizers-0.22.2-cp39-abi3-win_amd64.whl", hash = "sha256:c9ea31edff2968b44a88f97d784c2f16dc0729b8b143ed004699ebca91f05c48"},
+    {file = "tokenizers-0.22.2-cp39-abi3-win_arm64.whl", hash = "sha256:9ce725d22864a1e965217204946f830c37876eee3b2ba6fc6255e8e903d5fcbc"},
+    {file = "tokenizers-0.22.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:753d47ebd4542742ef9261d9da92cd545b2cacbb48349a1225466745bb866ec4"},
+    {file = "tokenizers-0.22.2-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e10bf9113d209be7cd046d40fbabbaf3278ff6d18eb4da4c500443185dc1896c"},
+    {file = "tokenizers-0.22.2-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:64d94e84f6660764e64e7e0b22baa72f6cd942279fdbb21d46abd70d179f0195"},
+    {file = "tokenizers-0.22.2-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f01a9c019878532f98927d2bacb79bbb404b43d3437455522a00a30718cdedb5"},
+    {file = "tokenizers-0.22.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:319f659ee992222f04e58f84cbf407cfa66a65fe3a8de44e8ad2bc53e7d99012"},
+    {file = "tokenizers-0.22.2-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:1e50f8554d504f617d9e9d6e4c2c2884a12b388a97c5c77f0bc6cf4cd032feee"},
+    {file = "tokenizers-0.22.2-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:1a62ba2c5faa2dd175aaeed7b15abf18d20266189fb3406c5d0550dd34dd5f37"},
+    {file = "tokenizers-0.22.2-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:143b999bdc46d10febb15cbffb4207ddd1f410e2c755857b5a0797961bbdc113"},
+    {file = "tokenizers-0.22.2.tar.gz", hash = "sha256:473b83b915e547aa366d1eee11806deaf419e17be16310ac0a14077f1e28f917"},
 ]
 
 [package.dependencies]
-huggingface-hub = ">=0.16.4,<1.0"
+huggingface-hub = ">=0.16.4,<2.0"
 
 [package.extras]
 dev = ["tokenizers[testing]"]
 docs = ["setuptools-rust", "sphinx", "sphinx-rtd-theme"]
-testing = ["black (==22.3)", "datasets", "numpy", "pytest", "requests", "ruff"]
+testing = ["datasets", "numpy", "pytest", "pytest-asyncio", "requests", "ruff", "ty"]
 
 [[package]]
-name = "toml"
-version = "0.10.2"
-description = "Python Library for Tom's Obvious, Minimal Language"
-optional = false
-python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*"
+name = "tomlkit"
+version = "0.14.0"
+description = "Style preserving TOML library"
+optional = true
+python-versions = ">=3.9"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "toml-0.10.2-py2.py3-none-any.whl", hash = "sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b"},
-    {file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"},
-]
-
-[[package]]
-name = "tomli"
-version = "2.2.1"
-description = "A lil' TOML parser"
-optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
-markers = "python_version == \"3.10\""
-files = [
-    {file = "tomli-2.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:678e4fa69e4575eb77d103de3df8a895e1591b48e740211bd1067378c69e8249"},
-    {file = "tomli-2.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:023aa114dd824ade0100497eb2318602af309e5a55595f76b626d6d9f3b7b0a6"},
-    {file = "tomli-2.2.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ece47d672db52ac607a3d9599a9d48dcb2f2f735c6c2d1f34130085bb12b112a"},
-    {file = "tomli-2.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6972ca9c9cc9f0acaa56a8ca1ff51e7af152a9f87fb64623e31d5c83700080ee"},
-    {file = "tomli-2.2.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c954d2250168d28797dd4e3ac5cf812a406cd5a92674ee4c8f123c889786aa8e"},
-    {file = "tomli-2.2.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:8dd28b3e155b80f4d54beb40a441d366adcfe740969820caf156c019fb5c7ec4"},
-    {file = "tomli-2.2.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:e59e304978767a54663af13c07b3d1af22ddee3bb2fb0618ca1593e4f593a106"},
-    {file = "tomli-2.2.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:33580bccab0338d00994d7f16f4c4ec25b776af3ffaac1ed74e0b3fc95e885a8"},
-    {file = "tomli-2.2.1-cp311-cp311-win32.whl", hash = "sha256:465af0e0875402f1d226519c9904f37254b3045fc5084697cefb9bdde1ff99ff"},
-    {file = "tomli-2.2.1-cp311-cp311-win_amd64.whl", hash = "sha256:2d0f2fdd22b02c6d81637a3c95f8cd77f995846af7414c5c4b8d0545afa1bc4b"},
-    {file = "tomli-2.2.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:4a8f6e44de52d5e6c657c9fe83b562f5f4256d8ebbfe4ff922c495620a7f6cea"},
-    {file = "tomli-2.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8d57ca8095a641b8237d5b079147646153d22552f1c637fd3ba7f4b0b29167a8"},
-    {file = "tomli-2.2.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4e340144ad7ae1533cb897d406382b4b6fede8890a03738ff1683af800d54192"},
-    {file = "tomli-2.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:db2b95f9de79181805df90bedc5a5ab4c165e6ec3fe99f970d0e302f384ad222"},
-    {file = "tomli-2.2.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:40741994320b232529c802f8bc86da4e1aa9f413db394617b9a256ae0f9a7f77"},
-    {file = "tomli-2.2.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:400e720fe168c0f8521520190686ef8ef033fb19fc493da09779e592861b78c6"},
-    {file = "tomli-2.2.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:02abe224de6ae62c19f090f68da4e27b10af2b93213d36cf44e6e1c5abd19fdd"},
-    {file = "tomli-2.2.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:b82ebccc8c8a36f2094e969560a1b836758481f3dc360ce9a3277c65f374285e"},
-    {file = "tomli-2.2.1-cp312-cp312-win32.whl", hash = "sha256:889f80ef92701b9dbb224e49ec87c645ce5df3fa2cc548664eb8a25e03127a98"},
-    {file = "tomli-2.2.1-cp312-cp312-win_amd64.whl", hash = "sha256:7fc04e92e1d624a4a63c76474610238576942d6b8950a2d7f908a340494e67e4"},
-    {file = "tomli-2.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:f4039b9cbc3048b2416cc57ab3bda989a6fcf9b36cf8937f01a6e731b64f80d7"},
-    {file = "tomli-2.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:286f0ca2ffeeb5b9bd4fcc8d6c330534323ec51b2f52da063b11c502da16f30c"},
-    {file = "tomli-2.2.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a92ef1a44547e894e2a17d24e7557a5e85a9e1d0048b0b5e7541f76c5032cb13"},
-    {file = "tomli-2.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9316dc65bed1684c9a98ee68759ceaed29d229e985297003e494aa825ebb0281"},
-    {file = "tomli-2.2.1-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e85e99945e688e32d5a35c1ff38ed0b3f41f43fad8df0bdf79f72b2ba7bc5272"},
-    {file = "tomli-2.2.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:ac065718db92ca818f8d6141b5f66369833d4a80a9d74435a268c52bdfa73140"},
-    {file = "tomli-2.2.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:d920f33822747519673ee656a4b6ac33e382eca9d331c87770faa3eef562aeb2"},
-    {file = "tomli-2.2.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:a198f10c4d1b1375d7687bc25294306e551bf1abfa4eace6650070a5c1ae2744"},
-    {file = "tomli-2.2.1-cp313-cp313-win32.whl", hash = "sha256:d3f5614314d758649ab2ab3a62d4f2004c825922f9e370b29416484086b264ec"},
-    {file = "tomli-2.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:a38aa0308e754b0e3c67e344754dff64999ff9b513e691d0e786265c93583c69"},
-    {file = "tomli-2.2.1-py3-none-any.whl", hash = "sha256:cb55c73c5f4408779d0cf3eef9f762b9c9f147a77de7b258bef0a5628adc85cc"},
-    {file = "tomli-2.2.1.tar.gz", hash = "sha256:cd45e1dc79c835ce60f7404ec8119f2eb06d38b1deba146f07ced3bbc44505ff"},
+    {file = "tomlkit-0.14.0-py3-none-any.whl", hash = "sha256:592064ed85b40fa213469f81ac584f67a4f2992509a7c3ea2d632208623a3680"},
+    {file = "tomlkit-0.14.0.tar.gz", hash = "sha256:cf00efca415dbd57575befb1f6634c4f42d2d87dbba376128adb42c121b87064"},
 ]
 
 [[package]]
 name = "torch"
-version = "2.2.2"
+version = "2.6.0+cu126"
 description = "Tensors and Dynamic neural networks in Python with strong GPU acceleration"
 optional = false
-python-versions = ">=3.8.0"
-groups = ["main"]
-files = [
-    {file = "torch-2.2.2-cp310-cp310-manylinux1_x86_64.whl", hash = "sha256:bc889d311a855dd2dfd164daf8cc903a6b7273a747189cebafdd89106e4ad585"},
-    {file = "torch-2.2.2-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:15dffa4cc3261fa73d02f0ed25f5fa49ecc9e12bf1ae0a4c1e7a88bbfaad9030"},
-    {file = "torch-2.2.2-cp310-cp310-win_amd64.whl", hash = "sha256:11e8fe261233aeabd67696d6b993eeb0896faa175c6b41b9a6c9f0334bdad1c5"},
-    {file = "torch-2.2.2-cp310-none-macosx_10_9_x86_64.whl", hash = "sha256:b2e2200b245bd9f263a0d41b6a2dab69c4aca635a01b30cca78064b0ef5b109e"},
-    {file = "torch-2.2.2-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:877b3e6593b5e00b35bbe111b7057464e76a7dd186a287280d941b564b0563c2"},
-    {file = "torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl", hash = "sha256:ad4c03b786e074f46606f4151c0a1e3740268bcf29fbd2fdf6666d66341c1dcb"},
-    {file = "torch-2.2.2-cp311-cp311-manylinux2014_aarch64.whl", hash = "sha256:32827fa1fbe5da8851686256b4cd94cc7b11be962862c2293811c94eea9457bf"},
-    {file = "torch-2.2.2-cp311-cp311-win_amd64.whl", hash = "sha256:f9ef0a648310435511e76905f9b89612e45ef2c8b023bee294f5e6f7e73a3e7c"},
-    {file = "torch-2.2.2-cp311-none-macosx_10_9_x86_64.whl", hash = "sha256:95b9b44f3bcebd8b6cd8d37ec802048c872d9c567ba52c894bba90863a439059"},
-    {file = "torch-2.2.2-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:49aa4126ede714c5aeef7ae92969b4b0bbe67f19665106463c39f22e0a1860d1"},
-    {file = "torch-2.2.2-cp312-cp312-manylinux1_x86_64.whl", hash = "sha256:cf12cdb66c9c940227ad647bc9cf5dba7e8640772ae10dfe7569a0c1e2a28aca"},
-    {file = "torch-2.2.2-cp312-cp312-manylinux2014_aarch64.whl", hash = "sha256:89ddac2a8c1fb6569b90890955de0c34e1724f87431cacff4c1979b5f769203c"},
-    {file = "torch-2.2.2-cp312-cp312-win_amd64.whl", hash = "sha256:451331406b760f4b1ab298ddd536486ab3cfb1312614cfe0532133535be60bea"},
-    {file = "torch-2.2.2-cp312-none-macosx_10_9_x86_64.whl", hash = "sha256:eb4d6e9d3663e26cd27dc3ad266b34445a16b54908e74725adb241aa56987533"},
-    {file = "torch-2.2.2-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:bf9558da7d2bf7463390b3b2a61a6a3dbb0b45b161ee1dd5ec640bf579d479fc"},
-    {file = "torch-2.2.2-cp38-cp38-manylinux1_x86_64.whl", hash = "sha256:cd2bf7697c9e95fb5d97cc1d525486d8cf11a084c6af1345c2c2c22a6b0029d0"},
-    {file = "torch-2.2.2-cp38-cp38-manylinux2014_aarch64.whl", hash = "sha256:b421448d194496e1114d87a8b8d6506bce949544e513742b097e2ab8f7efef32"},
-    {file = "torch-2.2.2-cp38-cp38-win_amd64.whl", hash = "sha256:3dbcd563a9b792161640c0cffe17e3270d85e8f4243b1f1ed19cca43d28d235b"},
-    {file = "torch-2.2.2-cp38-none-macosx_10_9_x86_64.whl", hash = "sha256:31f4310210e7dda49f1fb52b0ec9e59382cfcb938693f6d5378f25b43d7c1d29"},
-    {file = "torch-2.2.2-cp38-none-macosx_11_0_arm64.whl", hash = "sha256:c795feb7e8ce2e0ef63f75f8e1ab52e7fd5e1a4d7d0c31367ade1e3de35c9e95"},
-    {file = "torch-2.2.2-cp39-cp39-manylinux1_x86_64.whl", hash = "sha256:a6e5770d68158d07456bfcb5318b173886f579fdfbf747543901ce718ea94782"},
-    {file = "torch-2.2.2-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:67dcd726edff108e2cd6c51ff0e416fd260c869904de95750e80051358680d24"},
-    {file = "torch-2.2.2-cp39-cp39-win_amd64.whl", hash = "sha256:539d5ef6c4ce15bd3bd47a7b4a6e7c10d49d4d21c0baaa87c7d2ef8698632dfb"},
-    {file = "torch-2.2.2-cp39-none-macosx_10_9_x86_64.whl", hash = "sha256:dff696de90d6f6d1e8200e9892861fd4677306d0ef604cb18f2134186f719f82"},
-    {file = "torch-2.2.2-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:3a4dd910663fd7a124c056c878a52c2b0be4a5a424188058fe97109d4436ee42"},
+python-versions = ">=3.9.0"
+groups = ["main", "training"]
+files = [
+    {file = "torch-2.6.0+cu126-cp310-cp310-linux_aarch64.whl", hash = "sha256:48775b8544e6705aa72256117f33c5f0c3c1ab51cb7abef1989dcfc3cf2e6500"},
+    {file = "torch-2.6.0+cu126-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:c55280b4da58e565d8a25e0e844dc27d0c96aaada7b90b4de70a45397faf604e"},
+    {file = "torch-2.6.0+cu126-cp310-cp310-win_amd64.whl", hash = "sha256:eda7768f0a2ad9da3513abf60ff5c13049e7e2ec74ed4cfcd4736a8523ab1f89"},
+    {file = "torch-2.6.0+cu126-cp311-cp311-linux_aarch64.whl", hash = "sha256:d4809b188f5c9b9753f7578085b79ae1f5d9c36a3fffc122e83e446ecf251325"},
+    {file = "torch-2.6.0+cu126-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:cd3b15819315bd44d34e6fa56a8f6f64192608de17da112ec0cd6cd5fc1781f3"},
+    {file = "torch-2.6.0+cu126-cp311-cp311-win_amd64.whl", hash = "sha256:5ddca43b81c64df8ce0c59260566e648ee46b2622ab6a718e38dea3c0ca059a1"},
+    {file = "torch-2.6.0+cu126-cp312-cp312-linux_aarch64.whl", hash = "sha256:993e0e99c472df1d2746c3233ef8e88d992904fe75b8996a2c15439c43ff46c4"},
+    {file = "torch-2.6.0+cu126-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:6bc5b9126daa3ac1e4d920b731da9f9503ff1f56204796de124e080f5cc3570e"},
+    {file = "torch-2.6.0+cu126-cp312-cp312-win_amd64.whl", hash = "sha256:b10c39c83e5d1afd639b5c9f5683b351e97e41390a93f59c59187004a9949924"},
+    {file = "torch-2.6.0+cu126-cp313-cp313-linux_aarch64.whl", hash = "sha256:e7913d9dcca60d352b296adf566ae9bb84c9e4d27414cf070b78a84c0a0ceb20"},
+    {file = "torch-2.6.0+cu126-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:2356c759696f4e296a7a08e8146c6381ccf2da40990fe400264b189a8a6c4bab"},
+    {file = "torch-2.6.0+cu126-cp313-cp313-win_amd64.whl", hash = "sha256:a1ce724eb9813fcd05b99cb8b652b2d02f447caba65f1469abd7d50af5e5323f"},
+    {file = "torch-2.6.0+cu126-cp313-cp313t-linux_aarch64.whl", hash = "sha256:e38a2564b15fba3fd8cb24d03d165b86a80fe3681b7207be5e500b100e19893c"},
+    {file = "torch-2.6.0+cu126-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:90d9c64ab8961595e05d4816e7190f38d8a1cd9931909a669da7bc398b9bc26b"},
+    {file = "torch-2.6.0+cu126-cp39-cp39-linux_aarch64.whl", hash = "sha256:2eea662d2d4ba57db2117d510c1baa47f49b1f327f9e91cf3a29d38f298d7f21"},
+    {file = "torch-2.6.0+cu126-cp39-cp39-manylinux_2_28_x86_64.whl", hash = "sha256:eccdaa0908f91321f34d37d7286843ff7b32a8e187fdc61c97f8a895e636b19f"},
+    {file = "torch-2.6.0+cu126-cp39-cp39-win_amd64.whl", hash = "sha256:57ce9f680a4fe2ea0ecc0085e165fdedd2b333b34b6099b054b966d2ba169787"},
 ]
 
 [package.dependencies]
@@ -5457,24 +5050,18 @@ filelock = "*"
 fsspec = "*"
 jinja2 = "*"
 networkx = "*"
-nvidia-cublas-cu12 = {version = "12.1.3.1", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cuda-cupti-cu12 = {version = "12.1.105", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cuda-nvrtc-cu12 = {version = "12.1.105", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cuda-runtime-cu12 = {version = "12.1.105", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cudnn-cu12 = {version = "8.9.2.26", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cufft-cu12 = {version = "11.0.2.54", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-curand-cu12 = {version = "10.3.2.106", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cusolver-cu12 = {version = "11.4.5.107", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-cusparse-cu12 = {version = "12.1.0.106", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-nccl-cu12 = {version = "2.19.3", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-nvidia-nvtx-cu12 = {version = "12.1.105", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\""}
-sympy = "*"
-triton = {version = "2.2.0", markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\" and python_version < \"3.12\""}
-typing-extensions = ">=4.8.0"
+setuptools = {version = "*", markers = "python_version >= \"3.12\""}
+sympy = {version = "1.13.1", markers = "python_version >= \"3.9\""}
+typing-extensions = ">=4.10.0"
 
 [package.extras]
 opt-einsum = ["opt-einsum (>=3.3)"]
-optree = ["optree (>=0.9.1)"]
+optree = ["optree (>=0.13.0)"]
+
+[package.source]
+type = "legacy"
+url = "https://download.pytorch.org/whl/cu126"
+reference = "pytorch-cu126"
 
 [[package]]
 name = "torch-optimi"
@@ -5499,46 +5086,72 @@ test = ["numpy (>=1.23)", "pytest (>=8.1.1)", "pytest-md (>=0.2.0)", "ruff (>=0.
 
 [[package]]
 name = "torchao"
-version = "0.8.0"
+version = "0.15.0"
 description = "Package for applying ao techniques to GPU models"
 optional = false
 python-versions = "*"
 groups = ["main"]
 files = [
-    {file = "torchao-0.8.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:372c071869d4ebca527055bb2e5d889ca6762fcadee1c562609a47ff48fc14da"},
-    {file = "torchao-0.8.0-py3-none-any.whl", hash = "sha256:ae0640aae719f041eb3a814d0a03fcfe504cf40a9de58daca656933136ca70f4"},
+    {file = "torchao-0.15.0-cp310-abi3-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1cbe813201314ba6329a650a76944502f3e8ec4b1b44523f3f48676810d8d1f6"},
+    {file = "torchao-0.15.0-py3-none-any.whl", hash = "sha256:3f3812676048ef8a2a0e9d492d12d8971ba7a7ebb16f54aa56f690414e130d2c"},
+]
+
+[package.extras]
+dev = ["bitsandbytes", "blobfile", "cmake (>=3.19.0,<4.0.0)", "diskcache", "expecttest", "fire", "hypothesis", "importlib_metadata", "lm_eval", "matplotlib", "ninja", "packaging", "pandas", "parameterized", "pre-commit", "pycocotools", "pytest (==8.4.2)", "ruff (==0.11.6)", "sentencepiece", "tabulate", "tiktoken", "tqdm", "transformers", "unittest-xml-reporting"]
+
+[[package]]
+name = "torchcodec"
+version = "0.2.1"
+description = "A video decoder for PyTorch"
+optional = true
+python-versions = ">=3.8"
+groups = ["main"]
+markers = "extra == \"video-fast\""
+files = [
+    {file = "TorchCodec-0.2.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:939d6ab021d3d43c2b5d439f3e5b69ef5a061cbaccffc7f4a2c9d2eedd026087"},
+    {file = "TorchCodec-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5c2d42f923701f54d642e44442d77b70f19ade53b92381264213ec6036da8fae"},
+    {file = "TorchCodec-0.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:3621786b6eb1e9129c08012a2be32ded828e9c71614a267b874c9932f75874d8"},
+    {file = "TorchCodec-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:540fa2acfcd0411b7925d95b9cb179f4eb9a08f0547fc4fb2f17f596dd834321"},
+    {file = "TorchCodec-0.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d4b194bfd3f8cc77986e327c8a13d4eb86ef1eba860096e81117cd6b9cc64960"},
+    {file = "TorchCodec-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0735917480efe7c7b7ce3f1a7ccc9832faf43d085cce75cdd031fd7f8f14cbb9"},
+    {file = "TorchCodec-0.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:c3a001722d1e9f11d867a4ebf7e428efbb0f1f699909c22f99f91a5faba35e9f"},
+    {file = "TorchCodec-0.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:deba4158f7be4744cfb9bd2b178230f20c45a5901c2c2b2fd08d0b2a180d3b65"},
+    {file = "TorchCodec-0.2.1-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:7e9700bbe8d5cf479676e7c86ba1d391882d5038cea8cff9be5f2eb3e6061a28"},
+    {file = "TorchCodec-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3e566a61f9f2158c0e7bd7d7e7607053befa62664fb8ba967ca5ecdf38cb4783"},
 ]
 
 [package.extras]
-dev = ["bitsandbytes", "blobfile", "diskcache", "expecttest", "fire", "hypothesis", "importlib-metadata", "lm-eval", "matplotlib", "ninja", "packaging", "pandas", "parameterized", "pre-commit", "pycocotools", "pytest (==7.4.0)", "ruff (==0.6.8)", "sentencepiece", "tabulate", "tiktoken", "tqdm", "transformers", "unittest-xml-reporting"]
+dev = ["numpy", "pillow", "pytest"]
 
 [[package]]
 name = "torchmetrics"
-version = "1.6.1"
+version = "1.9.0"
 description = "PyTorch native Metrics"
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["training"]
 files = [
-    {file = "torchmetrics-1.6.1-py3-none-any.whl", hash = "sha256:c3090aa2341129e994c0a659abb6d4140ae75169a6ebf45bffc16c5cb553b38e"},
-    {file = "torchmetrics-1.6.1.tar.gz", hash = "sha256:a5dc236694b392180949fdd0a0fcf2b57135c8b600e557c725e077eb41e53e64"},
+    {file = "torchmetrics-1.9.0-py3-none-any.whl", hash = "sha256:bfdcbff3dd1d96b3374bb2496eb39f23c4b28b8a845b6a18c313688e0d2d9ca1"},
+    {file = "torchmetrics-1.9.0.tar.gz", hash = "sha256:a488609948600df52d3db4fcdab02e62aab2a85ef34da67037dc3e65b8512faa"},
 ]
 
 [package.dependencies]
-lightning-utilities = ">=0.8.0"
+lightning-utilities = ">=0.15.3"
 numpy = ">1.20.0"
 packaging = ">17.1"
 torch = ">=2.0.0"
 
 [package.extras]
-all = ["SciencePlots (>=2.0.0)", "gammatone (>=1.0.0)", "ipadic (>=1.0.0)", "librosa (>=0.10.0)", "matplotlib (>=3.6.0)", "mecab-python3 (>=1.0.6)", "mypy (==1.14.0)", "nltk (>3.8.1)", "numpy (<2.0)", "onnxruntime (>=1.12.0)", "pesq (>=0.0.4)", "piq (<=0.8.0)", "pycocotools (>2.0.0)", "pystoi (>=0.4.0)", "regex (>=2021.9.24)", "requests (>=2.19.0)", "scipy (>1.0.0)", "sentencepiece (>=0.2.0)", "torch (==2.5.1)", "torch-fidelity (<=0.4.0)", "torchaudio (>=2.0.1)", "torchvision (>=0.15.1)", "torchvision (>=0.15.1)", "tqdm (<4.68.0)", "transformers (>4.4.0)", "transformers (>=4.42.3)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
-audio = ["gammatone (>=1.0.0)", "librosa (>=0.10.0)", "numpy (<2.0)", "onnxruntime (>=1.12.0)", "pesq (>=0.0.4)", "pystoi (>=0.4.0)", "requests (>=2.19.0)", "torchaudio (>=2.0.1)"]
+all = ["SciencePlots (>=2.0.0)", "einops (>=0.7.0)", "einops (>=0.7.0)", "gammatone (>=1.0.0)", "ipadic (>=1.0.0)", "librosa (>=0.10.0)", "matplotlib (>=3.6.0)", "mecab-python3 (>=1.0.6)", "mypy (==1.17.1)", "nltk (>3.8.1)", "onnxruntime (>=1.12.0)", "pesq (>=0.0.4)", "piq (<=0.8.0)", "pycocotools (>2.0.0)", "pystoi (>=0.4.0)", "regex (>=2021.9.24)", "requests (>=2.22.0)", "scipy (>1.0.0)", "sentencepiece (>=0.2.0)", "timm (>=0.9.0)", "torch (==2.8.0)", "torch-fidelity (<=0.4.0)", "torch_linear_assignment (>=0.0.2)", "torchaudio (>=2.0.1)", "torchvision (>=0.15.1)", "torchvision (>=0.15.1)", "tqdm (<4.68.0)", "transformers (>=4.43.0)", "transformers (>=4.43.0)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate", "vmaf-torch (>=1.1.0)"]
+audio = ["gammatone (>=1.0.0)", "librosa (>=0.10.0)", "onnxruntime (>=1.12.0)", "pesq (>=0.0.4)", "pystoi (>=0.4.0)", "requests (>=2.22.0)", "torchaudio (>=2.0.1)"]
+clustering = ["torch_linear_assignment (>=0.0.2)"]
 detection = ["pycocotools (>2.0.0)", "torchvision (>=0.15.1)"]
-dev = ["PyTDC (==0.4.1) ; python_version < \"3.12\"", "SciencePlots (>=2.0.0)", "bert_score (==0.3.13)", "dython (==0.7.6) ; python_version < \"3.9\"", "dython (>=0.7.8,<0.8.0) ; python_version > \"3.8\"", "fairlearn", "fast-bss-eval (>=0.1.0)", "faster-coco-eval (>=1.6.3)", "gammatone (>=1.0.0)", "huggingface-hub (<0.28)", "ipadic (>=1.0.0)", "jiwer (>=2.3.0)", "kornia (>=0.6.7)", "librosa (>=0.10.0)", "lpips (<=0.1.4)", "matplotlib (>=3.6.0)", "mecab-ko (>=1.0.0,<1.1.0) ; python_version < \"3.12\"", "mecab-ko-dic (>=1.0.0) ; python_version < \"3.12\"", "mecab-python3 (>=1.0.6)", "mir-eval (>=0.6)", "monai (==1.3.2) ; python_version < \"3.9\"", "monai (==1.4.0) ; python_version > \"3.8\"", "mypy (==1.14.0)", "netcal (>1.0.0)", "nltk (>3.8.1)", "numpy (<2.0)", "numpy (<2.3.0)", "onnxruntime (>=1.12.0)", "pandas (>1.4.0)", "permetrics (==2.0.0)", "pesq (>=0.0.4)", "piq (<=0.8.0)", "pycocotools (>2.0.0)", "pystoi (>=0.4.0)", "pytorch-msssim (==1.0.0)", "regex (>=2021.9.24)", "requests (>=2.19.0)", "rouge-score (>0.1.0)", "sacrebleu (>=2.3.0)", "scikit-image (>=0.19.0)", "scipy (>1.0.0)", "scipy (>1.0.0)", "sentencepiece (>=0.2.0)", "sewar (>=0.4.4)", "statsmodels (>0.13.5)", "torch (==2.5.1)", "torch-fidelity (<=0.4.0)", "torch_complex (<0.5.0)", "torchaudio (>=2.0.1)", "torchvision (>=0.15.1)", "torchvision (>=0.15.1)", "tqdm (<4.68.0)", "transformers (>4.4.0)", "transformers (>=4.42.3)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
+dev = ["PyTDC (==0.4.1) ; platform_system == \"Windows\" and python_version < \"3.12\"", "SciencePlots (>=2.0.0)", "aeon (>=1.0.0) ; python_version > \"3.10\"", "bert_score (==0.3.13)", "dists-pytorch (==0.1)", "dython (==0.7.9)", "einops (>=0.7.0)", "einops (>=0.7.0)", "fairlearn", "fast-bss-eval (>=0.1.0)", "faster-coco-eval (>=1.6.3)", "gammatone (>=1.0.0)", "huggingface-hub (<0.35)", "ipadic (>=1.0.0)", "jiwer (>=2.3.0)", "kornia (>=0.6.7)", "librosa (>=0.10.0)", "lpips (<=0.1.4)", "matplotlib (>=3.6.0)", "mecab-ko (>=1.0.0,<1.1.0) ; python_version < \"3.12\"", "mecab-ko-dic (>=1.0.0) ; python_version < \"3.12\"", "mecab-python3 (>=1.0.6)", "mir-eval (>=0.6)", "monai (==1.4.0)", "mypy (==1.17.1)", "netcal (>1.0.0)", "nltk (>3.8.1)", "numpy (<2.4.0)", "onnxruntime (>=1.12.0)", "pandas (>1.4.0)", "permetrics (==2.0.0)", "pesq (>=0.0.4)", "piq (<=0.8.0)", "properscoring (==0.1)", "pycocotools (>2.0.0)", "pystoi (>=0.4.0)", "pytorch-msssim (==1.0.0)", "regex (>=2021.9.24)", "requests (>=2.22.0)", "rouge-score (>0.1.0)", "sacrebleu (>=2.3.0)", "scikit-image (>=0.19.0)", "scipy (>1.0.0)", "scipy (>1.0.0)", "sentencepiece (>=0.2.0)", "setuptools (<82.0.0)", "sewar (>=0.4.4)", "statsmodels (>0.13.5)", "timm (>=0.9.0)", "torch (==2.8.0)", "torch-fidelity (<=0.4.0)", "torch_complex (<0.5.0)", "torch_linear_assignment (>=0.0.2)", "torchaudio (>=2.0.1)", "torchvision (>=0.15.1)", "torchvision (>=0.15.1)", "tqdm (<4.68.0)", "transformers (>=4.43.0)", "transformers (>=4.43.0)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate", "vmaf-torch (>=1.1.0)"]
 image = ["scipy (>1.0.0)", "torch-fidelity (<=0.4.0)", "torchvision (>=0.15.1)"]
-multimodal = ["piq (<=0.8.0)", "transformers (>=4.42.3)"]
-text = ["ipadic (>=1.0.0)", "mecab-python3 (>=1.0.6)", "nltk (>3.8.1)", "regex (>=2021.9.24)", "sentencepiece (>=0.2.0)", "tqdm (<4.68.0)", "transformers (>4.4.0)"]
-typing = ["mypy (==1.14.0)", "torch (==2.5.1)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
+multimodal = ["einops (>=0.7.0)", "piq (<=0.8.0)", "timm (>=0.9.0)", "transformers (>=4.43.0)"]
+text = ["ipadic (>=1.0.0)", "mecab-python3 (>=1.0.6)", "nltk (>3.8.1)", "regex (>=2021.9.24)", "sentencepiece (>=0.2.0)", "tqdm (<4.68.0)", "transformers (>=4.43.0)"]
+typing = ["mypy (==1.17.1)", "torch (==2.8.0)", "types-PyYAML", "types-emoji", "types-protobuf", "types-requests", "types-setuptools", "types-six", "types-tabulate"]
+video = ["einops (>=0.7.0)", "vmaf-torch (>=1.1.0)"]
 visual = ["SciencePlots (>=2.0.0)", "matplotlib (>=3.6.0)"]
 
 [[package]]
@@ -5561,54 +5174,45 @@ trampoline = ">=0.1.2"
 
 [[package]]
 name = "torchvision"
-version = "0.17.2"
+version = "0.21.0+cu126"
 description = "image and video datasets and models for torch deep learning"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["main"]
 files = [
-    {file = "torchvision-0.17.2-cp310-cp310-macosx_10_13_x86_64.whl", hash = "sha256:1f2910fe3c21ad6875b2720d46fad835b2e4b336e9553d31ca364d24c90b1d4f"},
-    {file = "torchvision-0.17.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:ecc1c503fa8a54fbab777e06a7c228032b8ab78efebf35b28bc8f22f544f51f1"},
-    {file = "torchvision-0.17.2-cp310-cp310-manylinux1_x86_64.whl", hash = "sha256:f400145fc108833e7c2fc28486a04989ca742146d7a2a2cc48878ebbb40cdbbd"},
-    {file = "torchvision-0.17.2-cp310-cp310-manylinux2014_aarch64.whl", hash = "sha256:e9e4bed404af33dfc92eecc2b513d21ddc4c242a7fd8708b3b09d3a26aa6f444"},
-    {file = "torchvision-0.17.2-cp310-cp310-win_amd64.whl", hash = "sha256:ba2e62f233eab3d42b648c122a3a29c47cc108ca314dfd5cbb59cd3a143fd623"},
-    {file = "torchvision-0.17.2-cp311-cp311-macosx_10_13_x86_64.whl", hash = "sha256:9b83e55ee7d0a1704f52b9c0ac87388e7a6d1d98a6bde7b0b35f9ab54d7bda54"},
-    {file = "torchvision-0.17.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:e031004a1bc432c980a7bd642f6c189a3efc316e423fc30b5569837166a4e28d"},
-    {file = "torchvision-0.17.2-cp311-cp311-manylinux1_x86_64.whl", hash = "sha256:3bbc24b7713e8f22766992562547d8b4b10001208d372fe599255af84bfd1a69"},
-    {file = "torchvision-0.17.2-cp311-cp311-manylinux2014_aarch64.whl", hash = "sha256:833fd2e4216ced924c8aca0525733fe727f9a1af66dfad7c5be7257e97c39678"},
-    {file = "torchvision-0.17.2-cp311-cp311-win_amd64.whl", hash = "sha256:6835897df852fad1015e6a106c167c83848114cbcc7d86112384a973404e4431"},
-    {file = "torchvision-0.17.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:14fd1d4a033c325bdba2d03a69c3450cab6d3a625f85cc375781d9237ca5d04d"},
-    {file = "torchvision-0.17.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:9c3acbebbe379af112b62b535820174277b1f3eed30df264a4e458d58ee4e5b2"},
-    {file = "torchvision-0.17.2-cp312-cp312-manylinux1_x86_64.whl", hash = "sha256:77d680adf6ce367166a186d2c7fda3a73807ab9a03b2c31a03fa8812c8c5335b"},
-    {file = "torchvision-0.17.2-cp312-cp312-manylinux2014_aarch64.whl", hash = "sha256:f1c9ab3152cfb27f83aca072cac93a3a4c4e4ab0261cf0f2d516b9868a4e96f3"},
-    {file = "torchvision-0.17.2-cp312-cp312-win_amd64.whl", hash = "sha256:3f784381419f3ed3f2ec2aa42fb4aeec5bf4135e298d1631e41c926e6f1a0dff"},
-    {file = "torchvision-0.17.2-cp38-cp38-macosx_10_13_x86_64.whl", hash = "sha256:b83aac8d78f48981146d582168d75b6c947cfb0a7693f76e219f1926f6e595a3"},
-    {file = "torchvision-0.17.2-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:1ece40557e122d79975860a005aa7e2a9e2e6c350a03e78a00ec1450083312fd"},
-    {file = "torchvision-0.17.2-cp38-cp38-manylinux1_x86_64.whl", hash = "sha256:32dbeba3987e20f2dc1bce8d1504139fff582898346dfe8ad98d649f97ca78fa"},
-    {file = "torchvision-0.17.2-cp38-cp38-manylinux2014_aarch64.whl", hash = "sha256:35ba5c1600c3203549d2316422a659bd20c0cfda1b6085eec94fb9f35f55ca43"},
-    {file = "torchvision-0.17.2-cp38-cp38-win_amd64.whl", hash = "sha256:2f69570f50b1d195e51bc03feffb7b7728207bc36efcfb1f0813712b2379d881"},
-    {file = "torchvision-0.17.2-cp39-cp39-macosx_10_13_x86_64.whl", hash = "sha256:4868bbfa55758c8107e69a0e7dd5e77b89056035cd38b767ad5b98cdb71c0f0d"},
-    {file = "torchvision-0.17.2-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:efd6d0dd0668e15d01a2cffadc74068433b32cbcf5692e0c4aa15fc5cb250ce7"},
-    {file = "torchvision-0.17.2-cp39-cp39-manylinux1_x86_64.whl", hash = "sha256:7dc85b397f6c6d9ef12716ce0d6e11ac2b803f5cccff6fe3966db248e7774478"},
-    {file = "torchvision-0.17.2-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:d506854c5acd69b20a8b6641f01fe841685a21c5406b56813184f1c9fc94279e"},
-    {file = "torchvision-0.17.2-cp39-cp39-win_amd64.whl", hash = "sha256:067095e87a020a7a251ac1d38483aa591c5ccb81e815527c54db88a982fc9267"},
+    {file = "torchvision-0.21.0+cu126-cp310-cp310-linux_x86_64.whl", hash = "sha256:db4369a89b866b319c8dd73931c3e5f314aa535f7035ae2336ce9a26d7ace15a"},
+    {file = "torchvision-0.21.0+cu126-cp310-cp310-win_amd64.whl", hash = "sha256:d6b23af252e8f4fc923d57efeab5aad7a33b6e15a72a119d576aa48ec1e0d924"},
+    {file = "torchvision-0.21.0+cu126-cp311-cp311-linux_x86_64.whl", hash = "sha256:bce6bff7ad759a4c924214af08c04a6c1f6f2d2901031bfcf67fcbaa79c08432"},
+    {file = "torchvision-0.21.0+cu126-cp311-cp311-win_amd64.whl", hash = "sha256:ddbf4516fbb7624ac42934b877dcf6a3b295d9914ab89643b55dedb9c9773ce4"},
+    {file = "torchvision-0.21.0+cu126-cp312-cp312-linux_x86_64.whl", hash = "sha256:ec1887ed3c842aa48308ea00f1442c683f7d351fb14e94b76c2072678d06ac92"},
+    {file = "torchvision-0.21.0+cu126-cp312-cp312-win_amd64.whl", hash = "sha256:600c18579cd6eae8f6bbfcc43a088bc512bfde1fa4de0587a4db1d44eaf411f9"},
+    {file = "torchvision-0.21.0+cu126-cp313-cp313-linux_x86_64.whl", hash = "sha256:ed7912ed64c110792401273ee8a9dda81fc2ef53a66a3f7b25238bc52900a987"},
+    {file = "torchvision-0.21.0+cu126-cp313-cp313-win_amd64.whl", hash = "sha256:1112ebe400eca7af30060909ceec422708b2bb5ce470489c5ffb5cf93664779b"},
+    {file = "torchvision-0.21.0+cu126-cp39-cp39-linux_x86_64.whl", hash = "sha256:a73248e1620ca08842837955efb206019c9057b05c448806eed4fd269ca29f2d"},
+    {file = "torchvision-0.21.0+cu126-cp39-cp39-win_amd64.whl", hash = "sha256:783a78d0c52545df8c6f00e1048794526681680fe66ad60145010f0b2e1049ae"},
 ]
 
 [package.dependencies]
 numpy = "*"
 pillow = ">=5.3.0,<8.3.dev0 || >=8.4.dev0"
-torch = "2.2.2"
+torch = "2.6.0"
 
 [package.extras]
+gdown = ["gdown (>=4.7.3)"]
 scipy = ["scipy"]
 
+[package.source]
+type = "legacy"
+url = "https://download.pytorch.org/whl/cu126"
+reference = "pytorch-cu126"
+
 [[package]]
 name = "tqdm"
 version = "4.66.5"
 description = "Fast, Extensible Progress Meter"
 optional = false
 python-versions = ">=3.7"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
     {file = "tqdm-4.66.5-py3-none-any.whl", hash = "sha256:90279a3770753eafc9194a0364852159802111925aa30eb3f9d85b0e805ac7cd"},
     {file = "tqdm-4.66.5.tar.gz", hash = "sha256:e1020aef2e5096702d8a025ac7d16b1577279c9d63f8375b63083e9a5f0fcbad"},
@@ -5623,6 +5227,35 @@ notebook = ["ipywidgets (>=6)"]
 slack = ["slack-sdk"]
 telegram = ["requests"]
 
+[[package]]
+name = "trackio"
+version = "0.20.2"
+description = "A lightweight, local-first, and free experiment tracking library built on top of Hugging Face Datasets and Spaces."
+optional = true
+python-versions = ">=3.10"
+groups = ["main"]
+markers = "extra == \"trackio\""
+files = [
+    {file = "trackio-0.20.2-py3-none-any.whl", hash = "sha256:8934715dca1e3ab8f4e9fe0dcf837a4393ab24f7d576e5f22dd65dc139482098"},
+]
+
+[package.dependencies]
+gradio = {version = ">=6.10.0,<7.0.0", extras = ["oauth"]}
+huggingface-hub = "<2.0.0"
+numpy = "<3.0.0"
+orjson = ">=3.0,<4.0.0"
+pandas = "<3.0.0"
+pillow = "<12.0.0"
+plotly = ">=6.0.0,<7.0.0"
+pydub = "<1.0.0"
+
+[package.extras]
+apple-gpu = ["psutil (>=5.9.0)"]
+dev = ["playwright (>=1.40.0,<2.0.0)", "pytest (>=8.0.0,<9.0.0)", "pytest-playwright (>=0.7.0,<1.0.0)", "ruff (==0.9.3)"]
+gpu = ["nvidia-ml-py (>=12.0.0)", "psutil (>=5.9.0)"]
+spaces = ["pyarrow (>=21.0)"]
+tensorboard = ["tbparse (==0.0.9)", "tensorboardx (>=2.0.0,<3.0.0)"]
+
 [[package]]
 name = "trampoline"
 version = "0.1.2"
@@ -5636,170 +5269,243 @@ files = [
 
 [[package]]
 name = "transformers"
-version = "4.46.2"
+version = "4.57.6"
 description = "State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow"
 optional = false
-python-versions = ">=3.8.0"
+python-versions = ">=3.9.0"
 groups = ["main"]
 files = [
-    {file = "transformers-4.46.2-py3-none-any.whl", hash = "sha256:c921f4406b78e6518c97b618c5acd1cf8a4f2315b6b727f4bf9e01496eef849c"},
-    {file = "transformers-4.46.2.tar.gz", hash = "sha256:3d85410881e1c074be767877bf33c83231ec11529f274a6044ecb20c157ba14e"},
+    {file = "transformers-4.57.6-py3-none-any.whl", hash = "sha256:4c9e9de11333ddfe5114bc872c9f370509198acf0b87a832a0ab9458e2bd0550"},
+    {file = "transformers-4.57.6.tar.gz", hash = "sha256:55e44126ece9dc0a291521b7e5492b572e6ef2766338a610b9ab5afbb70689d3"},
 ]
 
 [package.dependencies]
 filelock = "*"
-huggingface-hub = ">=0.23.2,<1.0"
+huggingface-hub = ">=0.34.0,<1.0"
 numpy = ">=1.17"
 packaging = ">=20.0"
 pyyaml = ">=5.1"
 regex = "!=2019.12.17"
 requests = "*"
-safetensors = ">=0.4.1"
-tokenizers = ">=0.20,<0.21"
+safetensors = ">=0.4.3"
+tokenizers = ">=0.22.0,<=0.23.0"
 tqdm = ">=4.27"
 
 [package.extras]
 accelerate = ["accelerate (>=0.26.0)"]
-agents = ["Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "datasets (!=2.5.0)", "diffusers", "opencv-python", "sentencepiece (>=0.1.91,!=0.1.92)", "torch"]
-all = ["Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "av (==9.2.0)", "codecarbon (==1.2.0)", "flax (>=0.4.1,<=0.7.0)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "librosa", "onnxconverter-common", "optax (>=0.0.8,<=0.1.4)", "optuna", "phonemizer", "protobuf", "pyctcdecode (>=0.4.0)", "ray[tune] (>=2.7.0)", "scipy (<1.13.0)", "sentencepiece (>=0.1.91,!=0.1.92)", "sigopt", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timm (<=0.9.16)", "tokenizers (>=0.20,<0.21)", "torch", "torchaudio", "torchvision"]
+all = ["Pillow (>=10.0.1,<=15.0)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "accelerate (>=0.26.0)", "av", "codecarbon (>=2.8.1)", "flax (>=0.4.1,<=0.7.0)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "jinja2 (>=3.1.0)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "kernels (>=0.6.1,<=0.9)", "librosa", "mistral-common[opencv] (>=1.6.3)", "num2words", "onnxconverter-common", "optax (>=0.0.8,<=0.1.4)", "optuna", "phonemizer", "protobuf", "pyctcdecode (>=0.4.0)", "ray[tune] (>=2.7.0)", "scipy (<1.13.0)", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timm (!=1.0.18,<=1.0.19)", "tokenizers (>=0.22.0,<=0.23.0)", "torch (>=2.2)", "torchaudio", "torchvision"]
 audio = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)"]
 benchmark = ["optimum-benchmark (>=0.3.0)"]
-codecarbon = ["codecarbon (==1.2.0)"]
+chat-template = ["jinja2 (>=3.1.0)"]
+codecarbon = ["codecarbon (>=2.8.1)"]
 deepspeed = ["accelerate (>=0.26.0)", "deepspeed (>=0.9.3)"]
-deepspeed-testing = ["GitPython (<3.1.19)", "accelerate (>=0.26.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "deepspeed (>=0.9.3)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "nltk (<=3.8.1)", "optuna", "parameterized", "protobuf", "psutil", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "timeout-decorator"]
-dev = ["GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "av (==9.2.0)", "beautifulsoup4", "codecarbon (==1.2.0)", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "flax (>=0.4.1,<=0.7.0)", "fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "isort (>=5.5.4)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "libcst", "librosa", "nltk (<=3.8.1)", "onnxconverter-common", "optax (>=0.0.8,<=0.1.4)", "optuna", "parameterized", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "ray[tune] (>=2.7.0)", "rhoknp (>=1.1.0,<1.3.1)", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "scipy (<1.13.0)", "sentencepiece (>=0.1.91,!=0.1.92)", "sigopt", "sudachidict-core (>=20220729)", "sudachipy (>=0.6.6)", "tensorboard", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timeout-decorator", "timm (<=0.9.16)", "tokenizers (>=0.20,<0.21)", "torch", "torchaudio", "torchvision", "unidic (>=1.0.2)", "unidic-lite (>=1.0.7)", "urllib3 (<2.0.0)"]
-dev-tensorflow = ["GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "isort (>=5.5.4)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "libcst", "librosa", "nltk (<=3.8.1)", "onnxconverter-common", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "parameterized", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timeout-decorator", "tokenizers (>=0.20,<0.21)", "urllib3 (<2.0.0)"]
-dev-torch = ["GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "beautifulsoup4", "codecarbon (==1.2.0)", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "isort (>=5.5.4)", "kenlm", "libcst", "librosa", "nltk (<=3.8.1)", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "optuna", "parameterized", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "ray[tune] (>=2.7.0)", "rhoknp (>=1.1.0,<1.3.1)", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "sentencepiece (>=0.1.91,!=0.1.92)", "sigopt", "sudachidict-core (>=20220729)", "sudachipy (>=0.6.6)", "tensorboard", "timeout-decorator", "timm (<=0.9.16)", "tokenizers (>=0.20,<0.21)", "torch", "torchaudio", "torchvision", "unidic (>=1.0.2)", "unidic-lite (>=1.0.7)", "urllib3 (<2.0.0)"]
+deepspeed-testing = ["GitPython (<3.1.19)", "accelerate (>=0.26.0)", "accelerate (>=0.26.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "deepspeed (>=0.9.3)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fastapi", "libcst", "mistral-common[opencv] (>=1.6.3)", "nltk (<=3.8.1)", "openai (>=1.98.0)", "optuna", "parameterized (>=0.9)", "protobuf", "psutil", "pydantic (>=2)", "pydantic (>=2)", "pytest (>=7.2.0)", "pytest-asyncio", "pytest-order", "pytest-rerunfailures (<16.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.13.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "sentencepiece (>=0.1.91,!=0.1.92)", "sentencepiece (>=0.1.91,!=0.1.92)", "starlette", "tensorboard", "timeout-decorator", "torch (>=2.2)", "uvicorn"]
+dev = ["GitPython (<3.1.19)", "GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "accelerate (>=0.26.0)", "accelerate (>=0.26.0)", "av", "beautifulsoup4", "codecarbon (>=2.8.1)", "cookiecutter (==1.7.3)", "cookiecutter (==1.7.3)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fastapi", "flax (>=0.4.1,<=0.7.0)", "fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "jinja2 (>=3.1.0)", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "kernels (>=0.6.1,<=0.9)", "libcst", "libcst", "librosa", "mistral-common[opencv] (>=1.6.3)", "mistral-common[opencv] (>=1.6.3)", "nltk (<=3.8.1)", "num2words", "onnxconverter-common", "openai (>=1.98.0)", "optax (>=0.0.8,<=0.1.4)", "optuna", "pandas (<2.3.0)", "parameterized (>=0.9)", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic (>=2)", "pydantic (>=2)", "pytest (>=7.2.0)", "pytest-asyncio", "pytest-order", "pytest-rerunfailures (<16.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "ray[tune] (>=2.7.0)", "rhoknp (>=1.1.0,<1.3.1)", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.13.1)", "ruff (==0.13.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "scipy (<1.13.0)", "sentencepiece (>=0.1.91,!=0.1.92)", "sentencepiece (>=0.1.91,!=0.1.92)", "starlette", "sudachidict_core (>=20220729)", "sudachipy (>=0.6.6)", "tensorboard", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "timeout-decorator", "timm (!=1.0.18,<=1.0.19)", "tokenizers (>=0.22.0,<=0.23.0)", "torch (>=2.2)", "torch (>=2.2)", "torchaudio", "torchvision", "unidic (>=1.0.2)", "unidic_lite (>=1.0.7)", "urllib3 (<2.0.0)", "uvicorn"]
+dev-tensorflow = ["GitPython (<3.1.19)", "GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "cookiecutter (==1.7.3)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fastapi", "kenlm", "keras-nlp (>=0.3.1,<0.14.0)", "libcst", "libcst", "librosa", "mistral-common[opencv] (>=1.6.3)", "nltk (<=3.8.1)", "onnxconverter-common", "onnxconverter-common", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "openai (>=1.98.0)", "pandas (<2.3.0)", "parameterized (>=0.9)", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic (>=2)", "pydantic (>=2)", "pytest (>=7.2.0)", "pytest-asyncio", "pytest-order", "pytest-rerunfailures (<16.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.13.1)", "ruff (==0.13.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "sentencepiece (>=0.1.91,!=0.1.92)", "sentencepiece (>=0.1.91,!=0.1.92)", "starlette", "tensorboard", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx", "tf2onnx", "timeout-decorator", "tokenizers (>=0.22.0,<=0.23.0)", "torch (>=2.2)", "urllib3 (<2.0.0)", "uvicorn"]
+dev-torch = ["GitPython (<3.1.19)", "GitPython (<3.1.19)", "Pillow (>=10.0.1,<=15.0)", "Pillow (>=10.0.1,<=15.0)", "accelerate (>=0.26.0)", "accelerate (>=0.26.0)", "beautifulsoup4", "codecarbon (>=2.8.1)", "cookiecutter (==1.7.3)", "cookiecutter (==1.7.3)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fastapi", "fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "kenlm", "kernels (>=0.6.1,<=0.9)", "libcst", "libcst", "librosa", "mistral-common[opencv] (>=1.6.3)", "nltk (<=3.8.1)", "num2words", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "openai (>=1.98.0)", "optuna", "pandas (<2.3.0)", "parameterized (>=0.9)", "phonemizer", "protobuf", "psutil", "pyctcdecode (>=0.4.0)", "pydantic (>=2)", "pydantic (>=2)", "pytest (>=7.2.0)", "pytest-asyncio", "pytest-order", "pytest-rerunfailures (<16.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "ray[tune] (>=2.7.0)", "rhoknp (>=1.1.0,<1.3.1)", "rich", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.13.1)", "ruff (==0.13.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "scikit-learn", "sentencepiece (>=0.1.91,!=0.1.92)", "sentencepiece (>=0.1.91,!=0.1.92)", "starlette", "sudachidict_core (>=20220729)", "sudachipy (>=0.6.6)", "tensorboard", "timeout-decorator", "timm (!=1.0.18,<=1.0.19)", "tokenizers (>=0.22.0,<=0.23.0)", "torch (>=2.2)", "torch (>=2.2)", "torchaudio", "torchvision", "unidic (>=1.0.2)", "unidic_lite (>=1.0.7)", "urllib3 (<2.0.0)", "uvicorn"]
 flax = ["flax (>=0.4.1,<=0.7.0)", "jax (>=0.4.1,<=0.4.13)", "jaxlib (>=0.4.1,<=0.4.13)", "optax (>=0.0.8,<=0.1.4)", "scipy (<1.13.0)"]
 flax-speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)"]
 ftfy = ["ftfy"]
-integrations = ["optuna", "ray[tune] (>=2.7.0)", "sigopt"]
-ja = ["fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "rhoknp (>=1.1.0,<1.3.1)", "sudachidict-core (>=20220729)", "sudachipy (>=0.6.6)", "unidic (>=1.0.2)", "unidic-lite (>=1.0.7)"]
+hf-xet = ["hf_xet"]
+hub-kernels = ["kernels (>=0.6.1,<=0.9)"]
+integrations = ["kernels (>=0.6.1,<=0.9)", "optuna", "ray[tune] (>=2.7.0)"]
+ja = ["fugashi (>=1.0)", "ipadic (>=1.0.0,<2.0)", "rhoknp (>=1.1.0,<1.3.1)", "sudachidict_core (>=20220729)", "sudachipy (>=0.6.6)", "unidic (>=1.0.2)", "unidic_lite (>=1.0.7)"]
+mistral-common = ["mistral-common[opencv] (>=1.6.3)"]
 modelcreation = ["cookiecutter (==1.7.3)"]
 natten = ["natten (>=0.14.6,<0.15.0)"]
+num2words = ["num2words"]
 onnx = ["onnxconverter-common", "onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)", "tf2onnx"]
 onnxruntime = ["onnxruntime (>=1.4.0)", "onnxruntime-tools (>=1.4.2)"]
+open-telemetry = ["opentelemetry-api", "opentelemetry-exporter-otlp", "opentelemetry-sdk"]
 optuna = ["optuna"]
-quality = ["GitPython (<3.1.19)", "datasets (!=2.5.0)", "isort (>=5.5.4)", "libcst", "rich", "ruff (==0.5.1)", "urllib3 (<2.0.0)"]
+quality = ["GitPython (<3.1.19)", "datasets (>=2.15.0)", "libcst", "pandas (<2.3.0)", "rich", "ruff (==0.13.1)", "urllib3 (<2.0.0)"]
 ray = ["ray[tune] (>=2.7.0)"]
-retrieval = ["datasets (!=2.5.0)", "faiss-cpu"]
-ruff = ["ruff (==0.5.1)"]
+retrieval = ["datasets (>=2.15.0)", "faiss-cpu"]
+ruff = ["ruff (==0.13.1)"]
 sagemaker = ["sagemaker (>=2.31.0)"]
 sentencepiece = ["protobuf", "sentencepiece (>=0.1.91,!=0.1.92)"]
-serving = ["fastapi", "pydantic", "starlette", "uvicorn"]
+serving = ["accelerate (>=0.26.0)", "fastapi", "openai (>=1.98.0)", "pydantic (>=2)", "starlette", "torch (>=2.2)", "uvicorn"]
 sigopt = ["sigopt"]
 sklearn = ["scikit-learn"]
 speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)", "torchaudio"]
-testing = ["GitPython (<3.1.19)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (!=2.5.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "nltk (<=3.8.1)", "parameterized", "psutil", "pydantic", "pytest (>=7.2.0,<8.0.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.5.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "sentencepiece (>=0.1.91,!=0.1.92)", "tensorboard", "timeout-decorator"]
+testing = ["GitPython (<3.1.19)", "accelerate (>=0.26.0)", "beautifulsoup4", "cookiecutter (==1.7.3)", "datasets (>=2.15.0)", "datasets (>=2.15.0)", "dill (<0.3.5)", "evaluate (>=0.2.0)", "faiss-cpu", "fastapi", "libcst", "mistral-common[opencv] (>=1.6.3)", "nltk (<=3.8.1)", "openai (>=1.98.0)", "parameterized (>=0.9)", "psutil", "pydantic (>=2)", "pydantic (>=2)", "pytest (>=7.2.0)", "pytest-asyncio", "pytest-order", "pytest-rerunfailures (<16.0)", "pytest-rich", "pytest-timeout", "pytest-xdist", "rjieba", "rouge-score (!=0.0.7,!=0.0.8,!=0.1,!=0.1.1)", "ruff (==0.13.1)", "sacrebleu (>=1.4.12,<2.0.0)", "sacremoses", "sentencepiece (>=0.1.91,!=0.1.92)", "starlette", "tensorboard", "timeout-decorator", "torch (>=2.2)", "uvicorn"]
 tf = ["keras-nlp (>=0.3.1,<0.14.0)", "onnxconverter-common", "tensorflow (>2.9,<2.16)", "tensorflow-text (<2.16)", "tf2onnx"]
 tf-cpu = ["keras (>2.9,<2.16)", "keras-nlp (>=0.3.1,<0.14.0)", "onnxconverter-common", "tensorflow-cpu (>2.9,<2.16)", "tensorflow-probability (<0.24)", "tensorflow-text (<2.16)", "tf2onnx"]
 tf-speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)"]
 tiktoken = ["blobfile", "tiktoken"]
-timm = ["timm (<=0.9.16)"]
-tokenizers = ["tokenizers (>=0.20,<0.21)"]
-torch = ["accelerate (>=0.26.0)", "torch"]
+timm = ["timm (!=1.0.18,<=1.0.19)"]
+tokenizers = ["tokenizers (>=0.22.0,<=0.23.0)"]
+torch = ["accelerate (>=0.26.0)", "torch (>=2.2)"]
 torch-speech = ["kenlm", "librosa", "phonemizer", "pyctcdecode (>=0.4.0)", "torchaudio"]
 torch-vision = ["Pillow (>=10.0.1,<=15.0)", "torchvision"]
-torchhub = ["filelock", "huggingface-hub (>=0.23.2,<1.0)", "importlib-metadata", "numpy (>=1.17)", "packaging (>=20.0)", "protobuf", "regex (!=2019.12.17)", "requests", "sentencepiece (>=0.1.91,!=0.1.92)", "tokenizers (>=0.20,<0.21)", "torch", "tqdm (>=4.27)"]
-video = ["av (==9.2.0)"]
+torchhub = ["filelock", "huggingface-hub (>=0.34.0,<1.0)", "importlib_metadata", "numpy (>=1.17)", "packaging (>=20.0)", "protobuf", "regex (!=2019.12.17)", "requests", "sentencepiece (>=0.1.91,!=0.1.92)", "tokenizers (>=0.22.0,<=0.23.0)", "torch (>=2.2)", "tqdm (>=4.27)"]
+video = ["av"]
 vision = ["Pillow (>=10.0.1,<=15.0)"]
 
 [[package]]
-name = "translationstring"
-version = "1.4"
-description = "Utility library for i18n relied on by various Repoze and Pyramid packages"
-optional = false
+name = "triton"
+version = "3.2.0"
+description = "A language and compiler for custom Deep Learning operations"
+optional = true
 python-versions = "*"
 groups = ["main"]
+markers = "platform_machine == \"x86_64\" and extra == \"cuda\" and platform_system == \"Linux\""
 files = [
-    {file = "translationstring-1.4-py2.py3-none-any.whl", hash = "sha256:5f4dc4d939573db851c8d840551e1a0fb27b946afe3b95aafc22577eed2d6262"},
-    {file = "translationstring-1.4.tar.gz", hash = "sha256:bf947538d76e69ba12ab17283b10355a9ecfbc078e6123443f43f2107f6376f3"},
+    {file = "triton-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b3e54983cd51875855da7c68ec05c05cf8bb08df361b1d5b69e05e40b0c9bd62"},
+    {file = "triton-3.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8009a1fb093ee8546495e96731336a33fb8856a38e45bb4ab6affd6dbc3ba220"},
+    {file = "triton-3.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8d9b215efc1c26fa7eefb9a157915c92d52e000d2bf83e5f69704047e63f125c"},
+    {file = "triton-3.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e5dfa23ba84541d7c0a531dfce76d8bcd19159d50a4a8b14ad01e91734a5c1b0"},
+    {file = "triton-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:30ceed0eff2c4a73b14eb63e052992f44bbdf175f3fad21e1ac8097a772de7ee"},
 ]
 
 [package.extras]
-docs = ["Sphinx (>=1.3.1)", "docutils", "pylons-sphinx-themes"]
+build = ["cmake (>=3.20)", "lit"]
+tests = ["autopep8", "flake8", "isort", "llnl-hatchet", "numpy", "pytest", "scipy (>=1.7.1)"]
+tutorials = ["matplotlib", "pandas", "tabulate"]
 
 [[package]]
-name = "triton"
-version = "2.2.0"
-description = "A language and compiler for custom Deep Learning operations"
+name = "typer"
+version = "0.26.7"
+description = "Typer, build great CLIs. Easy to code. Based on Python type hints."
 optional = false
-python-versions = "*"
+python-versions = ">=3.10"
 groups = ["main"]
-markers = "platform_system == \"Linux\" and platform_machine == \"x86_64\" and (python_version == \"3.10\" or python_version == \"3.11\")"
 files = [
-    {file = "triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a2294514340cfe4e8f4f9e5c66c702744c4a117d25e618bd08469d0bfed1e2e5"},
-    {file = "triton-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:da58a152bddb62cafa9a857dd2bc1f886dbf9f9c90a2b5da82157cd2b34392b0"},
-    {file = "triton-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:0af58716e721460a61886668b205963dc4d1e4ac20508cc3f623aef0d70283d5"},
-    {file = "triton-2.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e8fe46d3ab94a8103e291bd44c741cc294b91d1d81c1a2888254cbf7ff846dab"},
-    {file = "triton-2.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b8ce26093e539d727e7cf6f6f0d932b1ab0574dc02567e684377630d86723ace"},
-    {file = "triton-2.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:227cc6f357c5efcb357f3867ac2a8e7ecea2298cd4606a8ba1e931d1d5a947df"},
+    {file = "typer-0.26.7-py3-none-any.whl", hash = "sha256:5c87cfbc5d34491c5346ebf49c23e18d56ccb863268d3a8d592b26087c2f5e58"},
+    {file = "typer-0.26.7.tar.gz", hash = "sha256:e314a34c617e419c091b2830dda3ea1f257134ff593061a8f5b9717ab8dddb3a"},
 ]
 
 [package.dependencies]
-filelock = "*"
+annotated-doc = ">=0.0.2"
+colorama = {version = "*", markers = "platform_system == \"Windows\""}
+rich = ">=13.8.0"
+shellingham = ">=1.3.0"
 
-[package.extras]
-build = ["cmake (>=3.20)", "lit"]
-tests = ["autopep8", "flake8", "isort", "numpy", "pytest", "scipy (>=1.7.1)", "torch"]
-tutorials = ["matplotlib", "pandas", "tabulate", "torch"]
+[[package]]
+name = "types-colorama"
+version = "0.4.15.20260508"
+description = "Typing stubs for colorama"
+optional = false
+python-versions = ">=3.10"
+groups = ["dev"]
+files = [
+    {file = "types_colorama-0.4.15.20260508-py3-none-any.whl", hash = "sha256:b0c39908a3e5171ef1f8bf3d59fae082e9eaff3a19ca49b6d640b83f78cff61c"},
+    {file = "types_colorama-0.4.15.20260508.tar.gz", hash = "sha256:3a8916039e57452bd21f57e674e1f221ca9e4f319893c5e3bbd37b845c27d8e6"},
+]
+
+[[package]]
+name = "types-psutil"
+version = "7.2.2.20260518"
+description = "Typing stubs for psutil"
+optional = false
+python-versions = ">=3.10"
+groups = ["dev"]
+files = [
+    {file = "types_psutil-7.2.2.20260518-py3-none-any.whl", hash = "sha256:6a3d697665754a60d7b5a41d5a2cff12b53f5e0676d77810cd28ba5e14cb4049"},
+    {file = "types_psutil-7.2.2.20260518.tar.gz", hash = "sha256:9f825f631463a5b4d26f19f63aebc9ec25f01140d655026f3ad8a67841f9b331"},
+]
+
+[[package]]
+name = "types-requests"
+version = "2.33.0.20260518"
+description = "Typing stubs for requests"
+optional = false
+python-versions = ">=3.10"
+groups = ["dev"]
+files = [
+    {file = "types_requests-2.33.0.20260518-py3-none-any.whl", hash = "sha256:626d697d1adaaff76e2044dc8c5c051d8f21abc157bdfe204a75558076fe0bf0"},
+    {file = "types_requests-2.33.0.20260518.tar.gz", hash = "sha256:df7bd3bfe0ca8402dfb841e7d9be714bb5578203283d66d7dc4ef69343449a5e"},
+]
+
+[package.dependencies]
+urllib3 = ">=2"
+
+[[package]]
+name = "types-tqdm"
+version = "4.68.0.20260608"
+description = "Typing stubs for tqdm"
+optional = false
+python-versions = ">=3.10"
+groups = ["dev"]
+files = [
+    {file = "types_tqdm-4.68.0.20260608-py3-none-any.whl", hash = "sha256:450a6e7e9e9b604928968927c414b32970e40091213c4180e1ed470905b13eff"},
+    {file = "types_tqdm-4.68.0.20260608.tar.gz", hash = "sha256:e1dfddf8770fbc30ecaf95ae57c286397831235064308f7dfc2b1d6684a76107"},
+]
+
+[package.dependencies]
+types-requests = "*"
 
 [[package]]
 name = "typing-extensions"
-version = "4.12.2"
-description = "Backported and Experimental Type Hints for Python 3.8+"
+version = "4.15.0"
+description = "Backported and Experimental Type Hints for Python 3.9+"
 optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
+python-versions = ">=3.9"
+groups = ["main", "dev", "training"]
+files = [
+    {file = "typing_extensions-4.15.0-py3-none-any.whl", hash = "sha256:f0fa19c6845758ab08074a0cfa8b7aecb71c999ca73d62883bc25cc018c4e548"},
+    {file = "typing_extensions-4.15.0.tar.gz", hash = "sha256:0cea48d173cc12fa28ecabc3b837ea3cf6f38c6d1136f85cbaaf598984861466"},
+]
+
+[[package]]
+name = "typing-inspection"
+version = "0.4.2"
+description = "Runtime typing introspection tools"
+optional = false
+python-versions = ">=3.9"
+groups = ["main", "training"]
 files = [
-    {file = "typing_extensions-4.12.2-py3-none-any.whl", hash = "sha256:04e5ca0351e0f3f85c6853954072df659d0d13fac324d0072316b67d7794700d"},
-    {file = "typing_extensions-4.12.2.tar.gz", hash = "sha256:1a7ead55c7e559dd4dee8856e3a88b41225abfe1ce8df57b7c13915fe121ffb8"},
+    {file = "typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7"},
+    {file = "typing_inspection-0.4.2.tar.gz", hash = "sha256:ba561c48a67c5958007083d386c3295464928b01faa735ab8547c5692e87f464"},
 ]
 
+[package.dependencies]
+typing-extensions = ">=4.12.0"
+
 [[package]]
 name = "tzdata"
-version = "2024.2"
+version = "2026.2"
 description = "Provider of IANA time zone data"
 optional = false
 python-versions = ">=2"
-groups = ["main"]
+groups = ["main", "training"]
 files = [
-    {file = "tzdata-2024.2-py2.py3-none-any.whl", hash = "sha256:a48093786cdcde33cad18c2555e8532f34422074448fbc874186f0abd79565cd"},
-    {file = "tzdata-2024.2.tar.gz", hash = "sha256:7d85cc416e9382e69095b7bdf4afd9e3880418a2413feec7069d533d6b4e31cc"},
+    {file = "tzdata-2026.2-py2.py3-none-any.whl", hash = "sha256:bbe9af844f658da81a5f95019480da3a89415801f6cc966806612cc7169bffe7"},
+    {file = "tzdata-2026.2.tar.gz", hash = "sha256:9173fde7d80d9018e02a662e168e5a2d04f87c41ea174b139fbef642eda62d10"},
 ]
+markers = {main = "extra == \"trackio\""}
 
 [[package]]
 name = "urllib3"
-version = "2.3.0"
+version = "2.7.0"
 description = "HTTP library with thread-safe connection pooling, file post, and more."
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
+python-versions = ">=3.10"
+groups = ["main", "dev"]
 files = [
-    {file = "urllib3-2.3.0-py3-none-any.whl", hash = "sha256:1cee9ad369867bfdbbb48b7dd50374c0967a0bb7710050facf0dd6911440e3df"},
-    {file = "urllib3-2.3.0.tar.gz", hash = "sha256:f8c5449b3cf0861679ce7e0503c7b44b5ec981bec0d1d3795a07f1ba96f0204d"},
+    {file = "urllib3-2.7.0-py3-none-any.whl", hash = "sha256:9fb4c81ebbb1ce9531cce37674bbc6f1360472bc18ca9a553ede278ef7276897"},
+    {file = "urllib3-2.7.0.tar.gz", hash = "sha256:231e0ec3b63ceb14667c67be60f2f2c40a518cb38b03af60abc813da26505f4c"},
 ]
 
 [package.extras]
-brotli = ["brotli (>=1.0.9) ; platform_python_implementation == \"CPython\"", "brotlicffi (>=0.8.0) ; platform_python_implementation != \"CPython\""]
+brotli = ["brotli (>=1.2.0) ; platform_python_implementation == \"CPython\"", "brotlicffi (>=1.2.0.0) ; platform_python_implementation != \"CPython\""]
 h2 = ["h2 (>=4,<5)"]
 socks = ["pysocks (>=1.5.6,!=1.5.7,<2.0)"]
-zstd = ["zstandard (>=0.18.0)"]
+zstd = ["backports-zstd (>=1.0.0) ; python_version < \"3.14\""]
 
 [[package]]
 name = "urwid"
-version = "2.6.16"
+version = "3.0.3"
 description = "A full-featured console (xterm et al.) user interface library"
 optional = false
-python-versions = ">3.7"
-groups = ["main"]
+python-versions = ">=3.9.0"
+groups = ["dev"]
 files = [
-    {file = "urwid-2.6.16-py3-none-any.whl", hash = "sha256:de14896c6df9eb759ed1fd93e0384a5279e51e0dde8f621e4083f7a8368c0797"},
-    {file = "urwid-2.6.16.tar.gz", hash = "sha256:93ad239939e44c385e64aa00027878b9e5c486d59e855ec8ab5b1e1adcdb32a2"},
+    {file = "urwid-3.0.3-py3-none-any.whl", hash = "sha256:ede36ecc99a293bbb4b5e5072c7b7bb943eb3bed17decf89b808209ed2dead15"},
+    {file = "urwid-3.0.3.tar.gz", hash = "sha256:300804dd568cda5aa1c5b204227bd0cfe7a62cef2d00987c5eb2e4e64294ed9b"},
 ]
 
 [package.dependencies]
-typing-extensions = "*"
 wcwidth = "*"
 
 [package.extras]
@@ -5808,7 +5514,7 @@ glib = ["PyGObject"]
 lcd = ["pyserial"]
 serial = ["pyserial"]
 tornado = ["tornado (>=5.0)"]
-trio = ["exceptiongroup", "trio (>=0.22.0)"]
+trio = ["exceptiongroup ; python_version < \"3.11\"", "trio (>=0.24.0)"]
 twisted = ["twisted"]
 zmq = ["zmq"]
 
@@ -5818,7 +5524,7 @@ version = "0.15.1"
 description = "A textbox edit widget for urwid that supports readline shortcuts"
 optional = false
 python-versions = "*"
-groups = ["main"]
+groups = ["dev"]
 files = [
     {file = "urwid_readline-0.15.1.tar.gz", hash = "sha256:9301444b86d58f7d26388506b704f142cefd193888488b4070d3a0fdfcfc0f84"},
 ]
@@ -5830,165 +5536,86 @@ urwid = "*"
 dev = ["black", "pytest"]
 
 [[package]]
-name = "venusian"
-version = "3.1.1"
-description = "A library for deferring decorator actions"
-optional = false
-python-versions = ">=3.7"
+name = "uvicorn"
+version = "0.49.0"
+description = "The lightning-fast ASGI server."
+optional = true
+python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"trackio\""
 files = [
-    {file = "venusian-3.1.1-py3-none-any.whl", hash = "sha256:0845808a985976acbceaa1fbb871c7fac4fb28ae75453232970e9c2c2866dbf4"},
-    {file = "venusian-3.1.1.tar.gz", hash = "sha256:534fb3b355669283eb3954581931e5d1d071fce61d029d58f3219a5e3a6f0c41"},
+    {file = "uvicorn-0.49.0-py3-none-any.whl", hash = "sha256:ba3d14c3ee7e41c6c654c46c9eb489d33213cdd30aa1696eab1374337c13f68f"},
+    {file = "uvicorn-0.49.0.tar.gz", hash = "sha256:ebf4271aa580d9de97f93192d4595176df6e91f9aae919ca73e4fc07df1e66a3"},
 ]
 
+[package.dependencies]
+click = ">=7.0"
+h11 = ">=0.8"
+
 [package.extras]
-docs = ["Sphinx (>=4.3.2)", "pylons-sphinx-themes", "repoze.sphinx.autointerface", "sphinx-copybutton"]
-testing = ["coverage", "pytest", "pytest-cov"]
+standard = ["colorama (>=0.4) ; sys_platform == \"win32\"", "httptools (>=0.8.0)", "python-dotenv (>=0.13)", "pyyaml (>=5.1)", "uvloop (>=0.15.1) ; sys_platform != \"win32\" and sys_platform != \"cygwin\" and platform_python_implementation != \"PyPy\"", "watchfiles (>=0.20)", "websockets (>=10.4)"]
 
 [[package]]
 name = "virtualenv"
-version = "20.29.1"
+version = "21.5.1"
 description = "Virtual Python Environment builder"
 optional = false
-python-versions = ">=3.8"
-groups = ["main", "dev"]
+python-versions = ">=3.9"
+groups = ["dev"]
 files = [
-    {file = "virtualenv-20.29.1-py3-none-any.whl", hash = "sha256:4e4cb403c0b0da39e13b46b1b2476e505cb0046b25f242bee80f62bf990b2779"},
-    {file = "virtualenv-20.29.1.tar.gz", hash = "sha256:b8b8970138d32fb606192cb97f6cd4bb644fa486be9308fb9b63f81091b5dc35"},
+    {file = "virtualenv-21.5.1-py3-none-any.whl", hash = "sha256:55aa670b67bbfb991b03fda39bd3276d92c419d702376e98c5df1c9989a26783"},
+    {file = "virtualenv-21.5.1.tar.gz", hash = "sha256:dca3bf98275a59c652b69d68e73433e597d977c2da9198882479d1a7188009c8"},
 ]
 
 [package.dependencies]
 distlib = ">=0.3.7,<1"
-filelock = ">=3.12.2,<4"
+filelock = {version = ">=3.24.2,<4", markers = "python_version >= \"3.10\""}
 platformdirs = ">=3.9.1,<5"
-
-[package.extras]
-docs = ["furo (>=2023.7.26)", "proselint (>=0.13)", "sphinx (>=7.1.2,!=7.3)", "sphinx-argparse (>=0.4)", "sphinxcontrib-towncrier (>=0.2.1a0)", "towncrier (>=23.6)"]
-test = ["covdefaults (>=2.3)", "coverage (>=7.2.7)", "coverage-enable-subprocess (>=1)", "flaky (>=3.7)", "packaging (>=23.1)", "pytest (>=7.4)", "pytest-env (>=0.8.2)", "pytest-freezer (>=0.4.8) ; platform_python_implementation == \"PyPy\" or platform_python_implementation == \"CPython\" and sys_platform == \"win32\" and python_version >= \"3.13\"", "pytest-mock (>=3.11.1)", "pytest-randomly (>=3.12)", "pytest-timeout (>=2.1)", "setuptools (>=68)", "time-machine (>=2.10) ; platform_python_implementation == \"CPython\""]
-
-[[package]]
-name = "wandb"
-version = "0.17.8"
-description = "A CLI and library for interacting with the Weights & Biases API."
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "wandb-0.17.8-py3-none-any.whl", hash = "sha256:0e240d9e92c2557fba8415266ee6e124420cb80353e40d702a597f3cb609fad6"},
-    {file = "wandb-0.17.8-py3-none-macosx_10_14_x86_64.whl", hash = "sha256:a1f8a032776bea9a9aec9c6c3671142a31ed962cc40a20988805cedea57fc16c"},
-    {file = "wandb-0.17.8-py3-none-macosx_11_0_arm64.whl", hash = "sha256:c6e60534f21e9a322df6e9ebc3e4188d06ed3413985828130508f06c2393116e"},
-    {file = "wandb-0.17.8-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5e0edcb0eee9a392a7115d349e790c8df10ae2d488e525ace2f8d1589ddda6de"},
-    {file = "wandb-0.17.8-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:1762ecc98c38d7a040531d0a01e5090efcaf594ebac87d6929316884828c6393"},
-    {file = "wandb-0.17.8-py3-none-win32.whl", hash = "sha256:200ee7c887181db2c879be0d5f0ee6a1d6199ea97b7a2dbca73dcedf5a4cfd32"},
-    {file = "wandb-0.17.8-py3-none-win_amd64.whl", hash = "sha256:325ce529e3af7dc9eaea889ba2c2d9af7e19a761136300ae5a4c1b5df0c9f02d"},
-    {file = "wandb-0.17.8.tar.gz", hash = "sha256:d3d0ae27e85366d8ed48e79873d409eb43ad5fa43792506a6240b875b1d44c87"},
-]
-
-[package.dependencies]
-click = ">=7.1,<8.0.0 || >8.0.0"
-docker-pycreds = ">=0.4.0"
-gitpython = ">=1.0.0,<3.1.29 || >3.1.29"
-platformdirs = "*"
-protobuf = {version = ">=3.19.0,<4.21.0 || >4.21.0,<6", markers = "python_version > \"3.9\" or sys_platform != \"linux\""}
-psutil = ">=5.0.0"
-pyyaml = "*"
-requests = ">=2.0.0,<3"
-sentry-sdk = ">=1.0.0"
-setproctitle = "*"
-setuptools = "*"
-
-[package.extras]
-aws = ["boto3"]
-azure = ["azure-identity", "azure-storage-blob"]
-gcp = ["google-cloud-storage"]
-importers = ["filelock", "mlflow", "polars (<=1.2.1)", "rich", "tenacity"]
-kubeflow = ["google-cloud-storage", "kubernetes", "minio", "sh"]
-launch = ["awscli", "azure-containerregistry", "azure-identity", "azure-storage-blob", "boto3", "botocore", "chardet", "google-auth", "google-cloud-aiplatform", "google-cloud-artifact-registry", "google-cloud-compute", "google-cloud-storage", "iso8601", "jsonschema", "kubernetes", "kubernetes-asyncio", "nbconvert", "nbformat", "optuna", "pydantic", "pyyaml (>=6.0.0)", "tomli", "typing-extensions"]
-media = ["bokeh", "moviepy", "numpy", "pillow", "plotly (>=5.18.0)", "rdkit-pypi", "soundfile"]
-models = ["cloudpickle"]
-perf = ["orjson"]
-sweeps = ["sweeps (>=0.2.0)"]
-workspaces = ["wandb-workspaces"]
+python-discovery = ">=1.4.2"
 
 [[package]]
 name = "wcwidth"
-version = "0.2.13"
+version = "0.2.14"
 description = "Measures the displayed width of unicode strings in a terminal"
 optional = false
-python-versions = "*"
-groups = ["main"]
-files = [
-    {file = "wcwidth-0.2.13-py2.py3-none-any.whl", hash = "sha256:3da69048e4540d84af32131829ff948f1e022c1c6bdb8d6102117aac784f6859"},
-    {file = "wcwidth-0.2.13.tar.gz", hash = "sha256:72ea0c06399eb286d978fdedb6923a9eb47e1c486ce63e9b4e64fc18303972b5"},
-]
-
-[[package]]
-name = "webdataset"
-version = "0.2.100"
-description = "Record sequential storage for deep learning."
-optional = false
 python-versions = ">=3.6"
-groups = ["main"]
-files = [
-    {file = "webdataset-0.2.100-py3-none-any.whl", hash = "sha256:f70a8e1f6d4f5268b364bd6f77fe8a1168ea14e7e9ed455d71f8d29585fd86af"},
-    {file = "webdataset-0.2.100.tar.gz", hash = "sha256:798e30ff700277f0b963dc0395f3b9de4971a67cffc7cb6d0cb9225df7b68e42"},
-]
-
-[package.dependencies]
-braceexpand = "*"
-numpy = "*"
-pyyaml = "*"
-
-[[package]]
-name = "webob"
-version = "1.8.9"
-description = "WSGI request and response object"
-optional = false
-python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,>=2.7"
-groups = ["main"]
+groups = ["main", "dev"]
 files = [
-    {file = "WebOb-1.8.9-py2.py3-none-any.whl", hash = "sha256:45e34c58ed0c7e2ecd238ffd34432487ff13d9ad459ddfd77895e67abba7c1f9"},
-    {file = "webob-1.8.9.tar.gz", hash = "sha256:ad6078e2edb6766d1334ec3dee072ac6a7f95b1e32ce10def8ff7f0f02d56589"},
+    {file = "wcwidth-0.2.14-py2.py3-none-any.whl", hash = "sha256:a7bb560c8aee30f9957e5f9895805edd20602f2d7f720186dfd906e82b4982e1"},
+    {file = "wcwidth-0.2.14.tar.gz", hash = "sha256:4d478375d31bc5395a3c55c40ccdf3354688364cd61c4f6adacaa9215d0b3605"},
 ]
 
-[package.dependencies]
-legacy-cgi = {version = ">=2.6", markers = "python_version >= \"3.13\""}
-
-[package.extras]
-docs = ["Sphinx (>=1.7.5)", "pylons-sphinx-themes"]
-testing = ["coverage", "pytest (>=3.1.0)", "pytest-cov", "pytest-xdist"]
-
 [[package]]
 name = "websocket-client"
-version = "1.8.0"
+version = "1.9.0"
 description = "WebSocket client for Python with low level API options"
 optional = false
-python-versions = ">=3.8"
+python-versions = ">=3.9"
 groups = ["main"]
 files = [
-    {file = "websocket_client-1.8.0-py3-none-any.whl", hash = "sha256:17b44cc997f5c498e809b22cdf2d9c7a9e71c02c8cc2b6c56e7c2d1239bfa526"},
-    {file = "websocket_client-1.8.0.tar.gz", hash = "sha256:3239df9f44da632f96012472805d40a23281a991027ce11d2f45a6f24ac4c3da"},
+    {file = "websocket_client-1.9.0-py3-none-any.whl", hash = "sha256:af248a825037ef591efbf6ed20cc5faa03d3b47b9e5a2230a529eeee1c1fc3ef"},
+    {file = "websocket_client-1.9.0.tar.gz", hash = "sha256:9e813624b6eb619999a97dc7958469217c3176312b3a16a4bd1bc7e08a46ec98"},
 ]
 
 [package.extras]
-docs = ["Sphinx (>=6.0)", "myst-parser (>=2.0.0)", "sphinx-rtd-theme (>=1.1.0)"]
+docs = ["Sphinx (>=6.0)", "myst-parser (>=2.0.0)", "sphinx_rtd_theme (>=1.1.0)"]
 optional = ["python-socks", "wsaccel"]
-test = ["websockets"]
+test = ["pytest", "websockets"]
 
 [[package]]
 name = "werkzeug"
-version = "3.1.3"
+version = "3.1.8"
 description = "The comprehensive WSGI web application library."
 optional = false
 python-versions = ">=3.9"
-groups = ["main"]
+groups = ["training"]
 files = [
-    {file = "werkzeug-3.1.3-py3-none-any.whl", hash = "sha256:54b78bf3716d19a65be4fceccc0d1d7b89e608834989dfae50ea87564639213e"},
-    {file = "werkzeug-3.1.3.tar.gz", hash = "sha256:60723ce945c19328679790e3282cc758aa4a6040e4bb330f53d30fa546d44746"},
+    {file = "werkzeug-3.1.8-py3-none-any.whl", hash = "sha256:63a77fb8892bf28ebc3178683445222aa500e48ebad5ec77b0ad80f8726b1f50"},
+    {file = "werkzeug-3.1.8.tar.gz", hash = "sha256:9bad61a4268dac112f1c5cd4630a56ede601b6ed420300677a869083d70a4c44"},
 ]
 
 [package.dependencies]
-MarkupSafe = ">=2.1.1"
+markupsafe = ">=2.1.1"
 
 [package.extras]
 watchdog = ["watchdog (>=2.3)"]
@@ -6011,416 +5638,233 @@ dev = ["black (>=19.3b0) ; python_version >= \"3.6\"", "pytest (>=4.6.2)"]
 
 [[package]]
 name = "xformers"
-version = "0.0.25.post1"
+version = "0.0.29.post3"
 description = "XFormers: A collection of composable Transformer building blocks."
-optional = false
-python-versions = ">=3.7"
+optional = true
+python-versions = ">=3.9"
 groups = ["main"]
+markers = "extra == \"cuda\""
 files = [
-    {file = "xformers-0.0.25.post1-cp310-cp310-manylinux2014_x86_64.whl", hash = "sha256:cdfe9560848fa5ba75fc04d3da8803658e35997adc6075ee6bbf6d67c1f0fa5e"},
-    {file = "xformers-0.0.25.post1-cp310-cp310-win_amd64.whl", hash = "sha256:ddc22273f2ff06b886d9e86f17997e4f1f3074fdeb5d46bcdf50b704430df528"},
-    {file = "xformers-0.0.25.post1-cp311-cp311-manylinux2014_x86_64.whl", hash = "sha256:bbe8e83043f761d701baaac16f57d0de7d9b53e5111e15e324a3bfedbc94e3eb"},
-    {file = "xformers-0.0.25.post1-cp311-cp311-win_amd64.whl", hash = "sha256:3eaf21f437c1e1a8aa126310e33b186cb6d90906b06f90759672ba9e1f61893c"},
-    {file = "xformers-0.0.25.post1-cp38-cp38-manylinux2014_x86_64.whl", hash = "sha256:3c82a2c9180d87591a0306113b62248818ceee7176aad35a79557e70841432a4"},
-    {file = "xformers-0.0.25.post1-cp38-cp38-win_amd64.whl", hash = "sha256:45646a9877c6376800cb5ed4124e2f3d7baf418f75d9e21840589cf1f4fe1f8e"},
-    {file = "xformers-0.0.25.post1-cp39-cp39-manylinux2014_x86_64.whl", hash = "sha256:1ccc5f2b9370f97fb5e646cd76228106a872a4b96b05e209c595d05141abff70"},
-    {file = "xformers-0.0.25.post1-cp39-cp39-win_amd64.whl", hash = "sha256:f48bbc04a916d1010b752a005d4a27c54fec181210b63d7879534455e3b53169"},
-    {file = "xformers-0.0.25.post1.tar.gz", hash = "sha256:397430bd0162fd5a75eb8bc50b0ba242200881e48fd6404a19376f853f8c0444"},
+    {file = "xformers-0.0.29.post3-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:982f6049307905bd437b1dc95b372679e366ded1fadd672fb7e60756f2103d00"},
+    {file = "xformers-0.0.29.post3-cp310-cp310-win_amd64.whl", hash = "sha256:0c95e6fdb60e360801bc851a0e2b5b1fcfa8056d547a074a8823a49db01ba3b0"},
+    {file = "xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:57ac478f6ac3cf79e7298276f1fb7cd8db579125e99cb523bb10e4ce20e9cfc6"},
+    {file = "xformers-0.0.29.post3-cp311-cp311-win_amd64.whl", hash = "sha256:5128b3a90c305a506cc95b5879d39c5bd931137108e2cadc2c3c54ef5c8a2390"},
+    {file = "xformers-0.0.29.post3-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:9c47adb3643b9e9cc18d767c2d403b0edf0c24b3eaaae0be46ef004069fba94c"},
+    {file = "xformers-0.0.29.post3-cp312-cp312-win_amd64.whl", hash = "sha256:2e658e2b2c45229e5c74d527055b51616d24f8bac243dbf1b2817ce525b9d8ba"},
+    {file = "xformers-0.0.29.post3-cp39-cp39-manylinux_2_28_x86_64.whl", hash = "sha256:3ffc1021d578c114ad79a1192da4ffb08bec9801bfb1aa5528df9b49e3f4e181"},
+    {file = "xformers-0.0.29.post3-cp39-cp39-win_amd64.whl", hash = "sha256:800972a992d107ae835d5ac921a2ab4314db724b9063e6c9bd8920589973acfa"},
 ]
 
 [package.dependencies]
 numpy = "*"
-torch = "2.2.2"
+torch = "2.6.0"
+
+[package.source]
+type = "legacy"
+url = "https://download.pytorch.org/whl/cu126"
+reference = "pytorch-cu126"
 
 [[package]]
 name = "xfuser"
-version = "0.4.3.post2"
+version = "0.4.5"
 description = "A Scalable Inference Engine for Diffusion Transformers (DiTs) on Multiple Computing Devices"
-optional = false
+optional = true
 python-versions = ">=3.10"
 groups = ["main"]
+markers = "extra == \"cuda\""
 files = [
-    {file = "xfuser-0.4.3.post2-py3-none-any.whl", hash = "sha256:4e2a431c343393b3a98b2ff0e83126ab3feb5c92280f0752af383622e8ff2c1c"},
-    {file = "xfuser-0.4.3.post2.tar.gz", hash = "sha256:9c8a0a92b4df4352ad54b0c54938715e4758ed729dee9ec3506ab6cb5e5c7cd2"},
+    {file = "xfuser-0.4.5-py3-none-any.whl", hash = "sha256:6d660a733f9f96c06be48dd9d4cbb17b29d133a0288939c95adeb1a2beff5b5d"},
+    {file = "xfuser-0.4.5.tar.gz", hash = "sha256:bfd985b9a2f27bc541fc71e6a224bcdbfd945a25ead7e89976ced1ea63bc3e64"},
 ]
 
 [package.dependencies]
 accelerate = ">=0.33.0"
 beautifulsoup4 = ">=4.12.3"
+diffusers = ">=0.33.0"
 distvae = "*"
 einops = "*"
-imageio = "*"
-imageio-ffmpeg = "*"
-opencv-python = "*"
-pytest = "*"
 sentencepiece = ">=0.1.99"
-torch = ">=2.1.0"
+torch = ">=2.4.1"
 transformers = ">=4.39.1"
 yunchang = ">=0.6.0"
 
 [package.extras]
-diffusers = ["diffusers (>=0.31.0)"]
 flash-attn = ["flash-attn (>=2.6.0)"]
 flask = ["flask"]
+opencv-python = ["opencv-python-headless"]
 optimum-quanto = ["optimum-quanto"]
 ray = ["ray"]
-
-[[package]]
-name = "xxhash"
-version = "3.5.0"
-description = "Python binding for xxHash"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "xxhash-3.5.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:ece616532c499ee9afbb83078b1b952beffef121d989841f7f4b3dc5ac0fd212"},
-    {file = "xxhash-3.5.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:3171f693dbc2cef6477054a665dc255d996646b4023fe56cb4db80e26f4cc520"},
-    {file = "xxhash-3.5.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7c5d3e570ef46adaf93fc81b44aca6002b5a4d8ca11bd0580c07eac537f36680"},
-    {file = "xxhash-3.5.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:7cb29a034301e2982df8b1fe6328a84f4b676106a13e9135a0d7e0c3e9f806da"},
-    {file = "xxhash-3.5.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5d0d307d27099bb0cbeea7260eb39ed4fdb99c5542e21e94bb6fd29e49c57a23"},
-    {file = "xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c0342aafd421795d740e514bc9858ebddfc705a75a8c5046ac56d85fe97bf196"},
-    {file = "xxhash-3.5.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:3dbbd9892c5ebffeca1ed620cf0ade13eb55a0d8c84e0751a6653adc6ac40d0c"},
-    {file = "xxhash-3.5.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:4cc2d67fdb4d057730c75a64c5923abfa17775ae234a71b0200346bfb0a7f482"},
-    {file = "xxhash-3.5.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:ec28adb204b759306a3d64358a5e5c07d7b1dd0ccbce04aa76cb9377b7b70296"},
-    {file = "xxhash-3.5.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:1328f6d8cca2b86acb14104e381225a3d7b42c92c4b86ceae814e5c400dbb415"},
-    {file = "xxhash-3.5.0-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:8d47ebd9f5d9607fd039c1fbf4994e3b071ea23eff42f4ecef246ab2b7334198"},
-    {file = "xxhash-3.5.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:b96d559e0fcddd3343c510a0fe2b127fbff16bf346dd76280b82292567523442"},
-    {file = "xxhash-3.5.0-cp310-cp310-win32.whl", hash = "sha256:61c722ed8d49ac9bc26c7071eeaa1f6ff24053d553146d5df031802deffd03da"},
-    {file = "xxhash-3.5.0-cp310-cp310-win_amd64.whl", hash = "sha256:9bed5144c6923cc902cd14bb8963f2d5e034def4486ab0bbe1f58f03f042f9a9"},
-    {file = "xxhash-3.5.0-cp310-cp310-win_arm64.whl", hash = "sha256:893074d651cf25c1cc14e3bea4fceefd67f2921b1bb8e40fcfeba56820de80c6"},
-    {file = "xxhash-3.5.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:02c2e816896dc6f85922ced60097bcf6f008dedfc5073dcba32f9c8dd786f3c1"},
-    {file = "xxhash-3.5.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:6027dcd885e21581e46d3c7f682cfb2b870942feeed58a21c29583512c3f09f8"},
-    {file = "xxhash-3.5.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1308fa542bbdbf2fa85e9e66b1077eea3a88bef38ee8a06270b4298a7a62a166"},
-    {file = "xxhash-3.5.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c28b2fdcee797e1c1961cd3bcd3d545cab22ad202c846235197935e1df2f8ef7"},
-    {file = "xxhash-3.5.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:924361811732ddad75ff23e90efd9ccfda4f664132feecb90895bade6a1b4623"},
-    {file = "xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:89997aa1c4b6a5b1e5b588979d1da048a3c6f15e55c11d117a56b75c84531f5a"},
-    {file = "xxhash-3.5.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:685c4f4e8c59837de103344eb1c8a3851f670309eb5c361f746805c5471b8c88"},
-    {file = "xxhash-3.5.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:dbd2ecfbfee70bc1a4acb7461fa6af7748ec2ab08ac0fa298f281c51518f982c"},
-    {file = "xxhash-3.5.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:25b5a51dc3dfb20a10833c8eee25903fd2e14059e9afcd329c9da20609a307b2"},
-    {file = "xxhash-3.5.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:a8fb786fb754ef6ff8c120cb96629fb518f8eb5a61a16aac3a979a9dbd40a084"},
-    {file = "xxhash-3.5.0-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:a905ad00ad1e1c34fe4e9d7c1d949ab09c6fa90c919860c1534ff479f40fd12d"},
-    {file = "xxhash-3.5.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:963be41bcd49f53af6d795f65c0da9b4cc518c0dd9c47145c98f61cb464f4839"},
-    {file = "xxhash-3.5.0-cp311-cp311-win32.whl", hash = "sha256:109b436096d0a2dd039c355fa3414160ec4d843dfecc64a14077332a00aeb7da"},
-    {file = "xxhash-3.5.0-cp311-cp311-win_amd64.whl", hash = "sha256:b702f806693201ad6c0a05ddbbe4c8f359626d0b3305f766077d51388a6bac58"},
-    {file = "xxhash-3.5.0-cp311-cp311-win_arm64.whl", hash = "sha256:c4dcb4120d0cc3cc448624147dba64e9021b278c63e34a38789b688fd0da9bf3"},
-    {file = "xxhash-3.5.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:14470ace8bd3b5d51318782cd94e6f94431974f16cb3b8dc15d52f3b69df8e00"},
-    {file = "xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:59aa1203de1cb96dbeab595ded0ad0c0056bb2245ae11fac11c0ceea861382b9"},
-    {file = "xxhash-3.5.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:08424f6648526076e28fae6ea2806c0a7d504b9ef05ae61d196d571e5c879c84"},
-    {file = "xxhash-3.5.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:61a1ff00674879725b194695e17f23d3248998b843eb5e933007ca743310f793"},
-    {file = "xxhash-3.5.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f2f2c61bee5844d41c3eb015ac652a0229e901074951ae48581d58bfb2ba01be"},
-    {file = "xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9d32a592cac88d18cc09a89172e1c32d7f2a6e516c3dfde1b9adb90ab5df54a6"},
-    {file = "xxhash-3.5.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:70dabf941dede727cca579e8c205e61121afc9b28516752fd65724be1355cc90"},
-    {file = "xxhash-3.5.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:e5d0ddaca65ecca9c10dcf01730165fd858533d0be84c75c327487c37a906a27"},
-    {file = "xxhash-3.5.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:3e5b5e16c5a480fe5f59f56c30abdeba09ffd75da8d13f6b9b6fd224d0b4d0a2"},
-    {file = "xxhash-3.5.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:149b7914451eb154b3dfaa721315117ea1dac2cc55a01bfbd4df7c68c5dd683d"},
-    {file = "xxhash-3.5.0-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:eade977f5c96c677035ff39c56ac74d851b1cca7d607ab3d8f23c6b859379cab"},
-    {file = "xxhash-3.5.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fa9f547bd98f5553d03160967866a71056a60960be00356a15ecc44efb40ba8e"},
-    {file = "xxhash-3.5.0-cp312-cp312-win32.whl", hash = "sha256:f7b58d1fd3551b8c80a971199543379be1cee3d0d409e1f6d8b01c1a2eebf1f8"},
-    {file = "xxhash-3.5.0-cp312-cp312-win_amd64.whl", hash = "sha256:fa0cafd3a2af231b4e113fba24a65d7922af91aeb23774a8b78228e6cd785e3e"},
-    {file = "xxhash-3.5.0-cp312-cp312-win_arm64.whl", hash = "sha256:586886c7e89cb9828bcd8a5686b12e161368e0064d040e225e72607b43858ba2"},
-    {file = "xxhash-3.5.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:37889a0d13b0b7d739cfc128b1c902f04e32de17b33d74b637ad42f1c55101f6"},
-    {file = "xxhash-3.5.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:97a662338797c660178e682f3bc180277b9569a59abfb5925e8620fba00b9fc5"},
-    {file = "xxhash-3.5.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7f85e0108d51092bdda90672476c7d909c04ada6923c14ff9d913c4f7dc8a3bc"},
-    {file = "xxhash-3.5.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:cd2fd827b0ba763ac919440042302315c564fdb797294d86e8cdd4578e3bc7f3"},
-    {file = "xxhash-3.5.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:82085c2abec437abebf457c1d12fccb30cc8b3774a0814872511f0f0562c768c"},
-    {file = "xxhash-3.5.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:07fda5de378626e502b42b311b049848c2ef38784d0d67b6f30bb5008642f8eb"},
-    {file = "xxhash-3.5.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:c279f0d2b34ef15f922b77966640ade58b4ccdfef1c4d94b20f2a364617a493f"},
-    {file = "xxhash-3.5.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:89e66ceed67b213dec5a773e2f7a9e8c58f64daeb38c7859d8815d2c89f39ad7"},
-    {file = "xxhash-3.5.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:bcd51708a633410737111e998ceb3b45d3dbc98c0931f743d9bb0a209033a326"},
-    {file = "xxhash-3.5.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:3ff2c0a34eae7df88c868be53a8dd56fbdf592109e21d4bfa092a27b0bf4a7bf"},
-    {file = "xxhash-3.5.0-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:4e28503dccc7d32e0b9817aa0cbfc1f45f563b2c995b7a66c4c8a0d232e840c7"},
-    {file = "xxhash-3.5.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:a6c50017518329ed65a9e4829154626f008916d36295b6a3ba336e2458824c8c"},
-    {file = "xxhash-3.5.0-cp313-cp313-win32.whl", hash = "sha256:53a068fe70301ec30d868ece566ac90d873e3bb059cf83c32e76012c889b8637"},
-    {file = "xxhash-3.5.0-cp313-cp313-win_amd64.whl", hash = "sha256:80babcc30e7a1a484eab952d76a4f4673ff601f54d5142c26826502740e70b43"},
-    {file = "xxhash-3.5.0-cp313-cp313-win_arm64.whl", hash = "sha256:4811336f1ce11cac89dcbd18f3a25c527c16311709a89313c3acaf771def2d4b"},
-    {file = "xxhash-3.5.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:6e5f70f6dca1d3b09bccb7daf4e087075ff776e3da9ac870f86ca316736bb4aa"},
-    {file = "xxhash-3.5.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2e76e83efc7b443052dd1e585a76201e40b3411fe3da7af4fe434ec51b2f163b"},
-    {file = "xxhash-3.5.0-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:33eac61d0796ca0591f94548dcfe37bb193671e0c9bcf065789b5792f2eda644"},
-    {file = "xxhash-3.5.0-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0ec70a89be933ea49222fafc3999987d7899fc676f688dd12252509434636622"},
-    {file = "xxhash-3.5.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dd86b8e7f703ec6ff4f351cfdb9f428955859537125904aa8c963604f2e9d3e7"},
-    {file = "xxhash-3.5.0-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0adfbd36003d9f86c8c97110039f7539b379f28656a04097e7434d3eaf9aa131"},
-    {file = "xxhash-3.5.0-cp37-cp37m-musllinux_1_2_aarch64.whl", hash = "sha256:63107013578c8a730419adc05608756c3fa640bdc6abe806c3123a49fb829f43"},
-    {file = "xxhash-3.5.0-cp37-cp37m-musllinux_1_2_i686.whl", hash = "sha256:683b94dbd1ca67557850b86423318a2e323511648f9f3f7b1840408a02b9a48c"},
-    {file = "xxhash-3.5.0-cp37-cp37m-musllinux_1_2_ppc64le.whl", hash = "sha256:5d2a01dcce81789cf4b12d478b5464632204f4c834dc2d064902ee27d2d1f0ee"},
-    {file = "xxhash-3.5.0-cp37-cp37m-musllinux_1_2_s390x.whl", hash = "sha256:a9d360a792cbcce2fe7b66b8d51274ec297c53cbc423401480e53b26161a290d"},
-    {file = "xxhash-3.5.0-cp37-cp37m-musllinux_1_2_x86_64.whl", hash = "sha256:f0b48edbebea1b7421a9c687c304f7b44d0677c46498a046079d445454504737"},
-    {file = "xxhash-3.5.0-cp37-cp37m-win32.whl", hash = "sha256:7ccb800c9418e438b44b060a32adeb8393764da7441eb52aa2aa195448935306"},
-    {file = "xxhash-3.5.0-cp37-cp37m-win_amd64.whl", hash = "sha256:c3bc7bf8cb8806f8d1c9bf149c18708cb1c406520097d6b0a73977460ea03602"},
-    {file = "xxhash-3.5.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:74752ecaa544657d88b1d1c94ae68031e364a4d47005a90288f3bab3da3c970f"},
-    {file = "xxhash-3.5.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:dee1316133c9b463aa81aca676bc506d3f80d8f65aeb0bba2b78d0b30c51d7bd"},
-    {file = "xxhash-3.5.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:602d339548d35a8579c6b013339fb34aee2df9b4e105f985443d2860e4d7ffaa"},
-    {file = "xxhash-3.5.0-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:695735deeddfb35da1677dbc16a083445360e37ff46d8ac5c6fcd64917ff9ade"},
-    {file = "xxhash-3.5.0-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1030a39ba01b0c519b1a82f80e8802630d16ab95dc3f2b2386a0b5c8ed5cbb10"},
-    {file = "xxhash-3.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a5bc08f33c4966f4eb6590d6ff3ceae76151ad744576b5fc6c4ba8edd459fdec"},
-    {file = "xxhash-3.5.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:160e0c19ee500482ddfb5d5570a0415f565d8ae2b3fd69c5dcfce8a58107b1c3"},
-    {file = "xxhash-3.5.0-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:f1abffa122452481a61c3551ab3c89d72238e279e517705b8b03847b1d93d738"},
-    {file = "xxhash-3.5.0-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:d5e9db7ef3ecbfc0b4733579cea45713a76852b002cf605420b12ef3ef1ec148"},
-    {file = "xxhash-3.5.0-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:23241ff6423378a731d84864bf923a41649dc67b144debd1077f02e6249a0d54"},
-    {file = "xxhash-3.5.0-cp38-cp38-musllinux_1_2_s390x.whl", hash = "sha256:82b833d5563fefd6fceafb1aed2f3f3ebe19f84760fdd289f8b926731c2e6e91"},
-    {file = "xxhash-3.5.0-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:0a80ad0ffd78bef9509eee27b4a29e56f5414b87fb01a888353e3d5bda7038bd"},
-    {file = "xxhash-3.5.0-cp38-cp38-win32.whl", hash = "sha256:50ac2184ffb1b999e11e27c7e3e70cc1139047e7ebc1aa95ed12f4269abe98d4"},
-    {file = "xxhash-3.5.0-cp38-cp38-win_amd64.whl", hash = "sha256:392f52ebbb932db566973693de48f15ce787cabd15cf6334e855ed22ea0be5b3"},
-    {file = "xxhash-3.5.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:bfc8cdd7f33d57f0468b0614ae634cc38ab9202c6957a60e31d285a71ebe0301"},
-    {file = "xxhash-3.5.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:e0c48b6300cd0b0106bf49169c3e0536408dfbeb1ccb53180068a18b03c662ab"},
-    {file = "xxhash-3.5.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fe1a92cfbaa0a1253e339ccec42dbe6db262615e52df591b68726ab10338003f"},
-    {file = "xxhash-3.5.0-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:33513d6cc3ed3b559134fb307aae9bdd94d7e7c02907b37896a6c45ff9ce51bd"},
-    {file = "xxhash-3.5.0-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:eefc37f6138f522e771ac6db71a6d4838ec7933939676f3753eafd7d3f4c40bc"},
-    {file = "xxhash-3.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a606c8070ada8aa2a88e181773fa1ef17ba65ce5dd168b9d08038e2a61b33754"},
-    {file = "xxhash-3.5.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:42eca420c8fa072cc1dd62597635d140e78e384a79bb4944f825fbef8bfeeef6"},
-    {file = "xxhash-3.5.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:604253b2143e13218ff1ef0b59ce67f18b8bd1c4205d2ffda22b09b426386898"},
-    {file = "xxhash-3.5.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:6e93a5ad22f434d7876665444a97e713a8f60b5b1a3521e8df11b98309bff833"},
-    {file = "xxhash-3.5.0-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:7a46e1d6d2817ba8024de44c4fd79913a90e5f7265434cef97026215b7d30df6"},
-    {file = "xxhash-3.5.0-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:30eb2efe6503c379b7ab99c81ba4a779748e3830241f032ab46bd182bf5873af"},
-    {file = "xxhash-3.5.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:c8aa771ff2c13dd9cda8166d685d7333d389fae30a4d2bb39d63ab5775de8606"},
-    {file = "xxhash-3.5.0-cp39-cp39-win32.whl", hash = "sha256:5ed9ebc46f24cf91034544b26b131241b699edbfc99ec5e7f8f3d02d6eb7fba4"},
-    {file = "xxhash-3.5.0-cp39-cp39-win_amd64.whl", hash = "sha256:220f3f896c6b8d0316f63f16c077d52c412619e475f9372333474ee15133a558"},
-    {file = "xxhash-3.5.0-cp39-cp39-win_arm64.whl", hash = "sha256:a7b1d8315d9b5e9f89eb2933b73afae6ec9597a258d52190944437158b49d38e"},
-    {file = "xxhash-3.5.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl", hash = "sha256:2014c5b3ff15e64feecb6b713af12093f75b7926049e26a580e94dcad3c73d8c"},
-    {file = "xxhash-3.5.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fab81ef75003eda96239a23eda4e4543cedc22e34c373edcaf744e721a163986"},
-    {file = "xxhash-3.5.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:4e2febf914ace002132aa09169cc572e0d8959d0f305f93d5828c4836f9bc5a6"},
-    {file = "xxhash-3.5.0-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:5d3a10609c51da2a1c0ea0293fc3968ca0a18bd73838455b5bca3069d7f8e32b"},
-    {file = "xxhash-3.5.0-pp310-pypy310_pp73-win_amd64.whl", hash = "sha256:5a74f23335b9689b66eb6dbe2a931a88fcd7a4c2cc4b1cb0edba8ce381c7a1da"},
-    {file = "xxhash-3.5.0-pp37-pypy37_pp73-macosx_10_9_x86_64.whl", hash = "sha256:2b4154c00eb22e4d543f472cfca430e7962a0f1d0f3778334f2e08a7ba59363c"},
-    {file = "xxhash-3.5.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d30bbc1644f726b825b3278764240f449d75f1a8bdda892e641d4a688b1494ae"},
-    {file = "xxhash-3.5.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6fa0b72f2423e2aa53077e54a61c28e181d23effeaafd73fcb9c494e60930c8e"},
-    {file = "xxhash-3.5.0-pp37-pypy37_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:13de2b76c1835399b2e419a296d5b38dc4855385d9e96916299170085ef72f57"},
-    {file = "xxhash-3.5.0-pp37-pypy37_pp73-win_amd64.whl", hash = "sha256:0691bfcc4f9c656bcb96cc5db94b4d75980b9d5589f2e59de790091028580837"},
-    {file = "xxhash-3.5.0-pp38-pypy38_pp73-macosx_10_9_x86_64.whl", hash = "sha256:297595fe6138d4da2c8ce9e72a04d73e58725bb60f3a19048bc96ab2ff31c692"},
-    {file = "xxhash-3.5.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:cc1276d369452040cbb943300dc8abeedab14245ea44056a2943183822513a18"},
-    {file = "xxhash-3.5.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2061188a1ba352fc699c82bff722f4baacb4b4b8b2f0c745d2001e56d0dfb514"},
-    {file = "xxhash-3.5.0-pp38-pypy38_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:38c384c434021e4f62b8d9ba0bc9467e14d394893077e2c66d826243025e1f81"},
-    {file = "xxhash-3.5.0-pp38-pypy38_pp73-win_amd64.whl", hash = "sha256:e6a4dd644d72ab316b580a1c120b375890e4c52ec392d4aef3c63361ec4d77d1"},
-    {file = "xxhash-3.5.0-pp39-pypy39_pp73-macosx_10_15_x86_64.whl", hash = "sha256:531af8845aaadcadf951b7e0c1345c6b9c68a990eeb74ff9acd8501a0ad6a1c9"},
-    {file = "xxhash-3.5.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7ce379bcaa9fcc00f19affa7773084dd09f5b59947b3fb47a1ceb0179f91aaa1"},
-    {file = "xxhash-3.5.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fd1b2281d01723f076df3c8188f43f2472248a6b63118b036e641243656b1b0f"},
-    {file = "xxhash-3.5.0-pp39-pypy39_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9c770750cc80e8694492244bca7251385188bc5597b6a39d98a9f30e8da984e0"},
-    {file = "xxhash-3.5.0-pp39-pypy39_pp73-win_amd64.whl", hash = "sha256:b150b8467852e1bd844387459aa6fbe11d7f38b56e901f9f3b3e6aba0d660240"},
-    {file = "xxhash-3.5.0.tar.gz", hash = "sha256:84f2caddf951c9cbf8dc2e22a89d4ccf5d86391ac6418fe81e3c67d0cf60b45f"},
-]
-
-[[package]]
-name = "yapf"
-version = "0.43.0"
-description = "A formatter for Python code"
-optional = false
-python-versions = ">=3.7"
-groups = ["main"]
-files = [
-    {file = "yapf-0.43.0-py3-none-any.whl", hash = "sha256:224faffbc39c428cb095818cf6ef5511fdab6f7430a10783fdfb292ccf2852ca"},
-    {file = "yapf-0.43.0.tar.gz", hash = "sha256:00d3aa24bfedff9420b2e0d5d9f5ab6d9d4268e72afbf59bb3fa542781d5218e"},
-]
-
-[package.dependencies]
-platformdirs = ">=3.5.1"
-tomli = {version = ">=2.0.1", markers = "python_version < \"3.11\""}
+test = ["imageio", "imageio-ffmpeg", "pytest"]
 
 [[package]]
 name = "yarl"
-version = "1.18.3"
+version = "1.24.2"
 description = "Yet another URL library"
 optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "yarl-1.18.3-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:7df647e8edd71f000a5208fe6ff8c382a1de8edfbccdbbfe649d263de07d8c34"},
-    {file = "yarl-1.18.3-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:c69697d3adff5aa4f874b19c0e4ed65180ceed6318ec856ebc423aa5850d84f7"},
-    {file = "yarl-1.18.3-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:602d98f2c2d929f8e697ed274fbadc09902c4025c5a9963bf4e9edfc3ab6f7ed"},
-    {file = "yarl-1.18.3-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c654d5207c78e0bd6d749f6dae1dcbbfde3403ad3a4b11f3c5544d9906969dde"},
-    {file = "yarl-1.18.3-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5094d9206c64181d0f6e76ebd8fb2f8fe274950a63890ee9e0ebfd58bf9d787b"},
-    {file = "yarl-1.18.3-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:35098b24e0327fc4ebdc8ffe336cee0a87a700c24ffed13161af80124b7dc8e5"},
-    {file = "yarl-1.18.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3236da9272872443f81fedc389bace88408f64f89f75d1bdb2256069a8730ccc"},
-    {file = "yarl-1.18.3-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e2c08cc9b16f4f4bc522771d96734c7901e7ebef70c6c5c35dd0f10845270bcd"},
-    {file = "yarl-1.18.3-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:80316a8bd5109320d38eef8833ccf5f89608c9107d02d2a7f985f98ed6876990"},
-    {file = "yarl-1.18.3-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:c1e1cc06da1491e6734f0ea1e6294ce00792193c463350626571c287c9a704db"},
-    {file = "yarl-1.18.3-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:fea09ca13323376a2fdfb353a5fa2e59f90cd18d7ca4eaa1fd31f0a8b4f91e62"},
-    {file = "yarl-1.18.3-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:e3b9fd71836999aad54084906f8663dffcd2a7fb5cdafd6c37713b2e72be1760"},
-    {file = "yarl-1.18.3-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:757e81cae69244257d125ff31663249b3013b5dc0a8520d73694aed497fb195b"},
-    {file = "yarl-1.18.3-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:b1771de9944d875f1b98a745bc547e684b863abf8f8287da8466cf470ef52690"},
-    {file = "yarl-1.18.3-cp310-cp310-win32.whl", hash = "sha256:8874027a53e3aea659a6d62751800cf6e63314c160fd607489ba5c2edd753cf6"},
-    {file = "yarl-1.18.3-cp310-cp310-win_amd64.whl", hash = "sha256:93b2e109287f93db79210f86deb6b9bbb81ac32fc97236b16f7433db7fc437d8"},
-    {file = "yarl-1.18.3-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:8503ad47387b8ebd39cbbbdf0bf113e17330ffd339ba1144074da24c545f0069"},
-    {file = "yarl-1.18.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:02ddb6756f8f4517a2d5e99d8b2f272488e18dd0bfbc802f31c16c6c20f22193"},
-    {file = "yarl-1.18.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:67a283dd2882ac98cc6318384f565bffc751ab564605959df4752d42483ad889"},
-    {file = "yarl-1.18.3-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d980e0325b6eddc81331d3f4551e2a333999fb176fd153e075c6d1c2530aa8a8"},
-    {file = "yarl-1.18.3-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b643562c12680b01e17239be267bc306bbc6aac1f34f6444d1bded0c5ce438ca"},
-    {file = "yarl-1.18.3-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c017a3b6df3a1bd45b9fa49a0f54005e53fbcad16633870104b66fa1a30a29d8"},
-    {file = "yarl-1.18.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:75674776d96d7b851b6498f17824ba17849d790a44d282929c42dbb77d4f17ae"},
-    {file = "yarl-1.18.3-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ccaa3a4b521b780a7e771cc336a2dba389a0861592bbce09a476190bb0c8b4b3"},
-    {file = "yarl-1.18.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:2d06d3005e668744e11ed80812e61efd77d70bb7f03e33c1598c301eea20efbb"},
-    {file = "yarl-1.18.3-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:9d41beda9dc97ca9ab0b9888cb71f7539124bc05df02c0cff6e5acc5a19dcc6e"},
-    {file = "yarl-1.18.3-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:ba23302c0c61a9999784e73809427c9dbedd79f66a13d84ad1b1943802eaaf59"},
-    {file = "yarl-1.18.3-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:6748dbf9bfa5ba1afcc7556b71cda0d7ce5f24768043a02a58846e4a443d808d"},
-    {file = "yarl-1.18.3-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:0b0cad37311123211dc91eadcb322ef4d4a66008d3e1bdc404808992260e1a0e"},
-    {file = "yarl-1.18.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:0fb2171a4486bb075316ee754c6d8382ea6eb8b399d4ec62fde2b591f879778a"},
-    {file = "yarl-1.18.3-cp311-cp311-win32.whl", hash = "sha256:61b1a825a13bef4a5f10b1885245377d3cd0bf87cba068e1d9a88c2ae36880e1"},
-    {file = "yarl-1.18.3-cp311-cp311-win_amd64.whl", hash = "sha256:b9d60031cf568c627d028239693fd718025719c02c9f55df0a53e587aab951b5"},
-    {file = "yarl-1.18.3-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:1dd4bdd05407ced96fed3d7f25dbbf88d2ffb045a0db60dbc247f5b3c5c25d50"},
-    {file = "yarl-1.18.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7c33dd1931a95e5d9a772d0ac5e44cac8957eaf58e3c8da8c1414de7dd27c576"},
-    {file = "yarl-1.18.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:25b411eddcfd56a2f0cd6a384e9f4f7aa3efee14b188de13048c25b5e91f1640"},
-    {file = "yarl-1.18.3-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:436c4fc0a4d66b2badc6c5fc5ef4e47bb10e4fd9bf0c79524ac719a01f3607c2"},
-    {file = "yarl-1.18.3-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e35ef8683211db69ffe129a25d5634319a677570ab6b2eba4afa860f54eeaf75"},
-    {file = "yarl-1.18.3-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:84b2deecba4a3f1a398df819151eb72d29bfeb3b69abb145a00ddc8d30094512"},
-    {file = "yarl-1.18.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:00e5a1fea0fd4f5bfa7440a47eff01d9822a65b4488f7cff83155a0f31a2ecba"},
-    {file = "yarl-1.18.3-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d0e883008013c0e4aef84dcfe2a0b172c4d23c2669412cf5b3371003941f72bb"},
-    {file = "yarl-1.18.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:5a3f356548e34a70b0172d8890006c37be92995f62d95a07b4a42e90fba54272"},
-    {file = "yarl-1.18.3-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:ccd17349166b1bee6e529b4add61727d3f55edb7babbe4069b5764c9587a8cc6"},
-    {file = "yarl-1.18.3-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:b958ddd075ddba5b09bb0be8a6d9906d2ce933aee81100db289badbeb966f54e"},
-    {file = "yarl-1.18.3-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:c7d79f7d9aabd6011004e33b22bc13056a3e3fb54794d138af57f5ee9d9032cb"},
-    {file = "yarl-1.18.3-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:4891ed92157e5430874dad17b15eb1fda57627710756c27422200c52d8a4e393"},
-    {file = "yarl-1.18.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:ce1af883b94304f493698b00d0f006d56aea98aeb49d75ec7d98cd4a777e9285"},
-    {file = "yarl-1.18.3-cp312-cp312-win32.whl", hash = "sha256:f91c4803173928a25e1a55b943c81f55b8872f0018be83e3ad4938adffb77dd2"},
-    {file = "yarl-1.18.3-cp312-cp312-win_amd64.whl", hash = "sha256:7e2ee16578af3b52ac2f334c3b1f92262f47e02cc6193c598502bd46f5cd1477"},
-    {file = "yarl-1.18.3-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:90adb47ad432332d4f0bc28f83a5963f426ce9a1a8809f5e584e704b82685dcb"},
-    {file = "yarl-1.18.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:913829534200eb0f789d45349e55203a091f45c37a2674678744ae52fae23efa"},
-    {file = "yarl-1.18.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:ef9f7768395923c3039055c14334ba4d926f3baf7b776c923c93d80195624782"},
-    {file = "yarl-1.18.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:88a19f62ff30117e706ebc9090b8ecc79aeb77d0b1f5ec10d2d27a12bc9f66d0"},
-    {file = "yarl-1.18.3-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e17c9361d46a4d5addf777c6dd5eab0715a7684c2f11b88c67ac37edfba6c482"},
-    {file = "yarl-1.18.3-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1a74a13a4c857a84a845505fd2d68e54826a2cd01935a96efb1e9d86c728e186"},
-    {file = "yarl-1.18.3-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:41f7ce59d6ee7741af71d82020346af364949314ed3d87553763a2df1829cc58"},
-    {file = "yarl-1.18.3-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:f52a265001d830bc425f82ca9eabda94a64a4d753b07d623a9f2863fde532b53"},
-    {file = "yarl-1.18.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:82123d0c954dc58db301f5021a01854a85bf1f3bb7d12ae0c01afc414a882ca2"},
-    {file = "yarl-1.18.3-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:2ec9bbba33b2d00999af4631a3397d1fd78290c48e2a3e52d8dd72db3a067ac8"},
-    {file = "yarl-1.18.3-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:fbd6748e8ab9b41171bb95c6142faf068f5ef1511935a0aa07025438dd9a9bc1"},
-    {file = "yarl-1.18.3-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:877d209b6aebeb5b16c42cbb377f5f94d9e556626b1bfff66d7b0d115be88d0a"},
-    {file = "yarl-1.18.3-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:b464c4ab4bfcb41e3bfd3f1c26600d038376c2de3297760dfe064d2cb7ea8e10"},
-    {file = "yarl-1.18.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:8d39d351e7faf01483cc7ff7c0213c412e38e5a340238826be7e0e4da450fdc8"},
-    {file = "yarl-1.18.3-cp313-cp313-win32.whl", hash = "sha256:61ee62ead9b68b9123ec24bc866cbef297dd266175d53296e2db5e7f797f902d"},
-    {file = "yarl-1.18.3-cp313-cp313-win_amd64.whl", hash = "sha256:578e281c393af575879990861823ef19d66e2b1d0098414855dd367e234f5b3c"},
-    {file = "yarl-1.18.3-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:61e5e68cb65ac8f547f6b5ef933f510134a6bf31bb178be428994b0cb46c2a04"},
-    {file = "yarl-1.18.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:fe57328fbc1bfd0bd0514470ac692630f3901c0ee39052ae47acd1d90a436719"},
-    {file = "yarl-1.18.3-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:a440a2a624683108a1b454705ecd7afc1c3438a08e890a1513d468671d90a04e"},
-    {file = "yarl-1.18.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:09c7907c8548bcd6ab860e5f513e727c53b4a714f459b084f6580b49fa1b9cee"},
-    {file = "yarl-1.18.3-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b4f6450109834af88cb4cc5ecddfc5380ebb9c228695afc11915a0bf82116789"},
-    {file = "yarl-1.18.3-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a9ca04806f3be0ac6d558fffc2fdf8fcef767e0489d2684a21912cc4ed0cd1b8"},
-    {file = "yarl-1.18.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:77a6e85b90a7641d2e07184df5557132a337f136250caafc9ccaa4a2a998ca2c"},
-    {file = "yarl-1.18.3-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:6333c5a377c8e2f5fae35e7b8f145c617b02c939d04110c76f29ee3676b5f9a5"},
-    {file = "yarl-1.18.3-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:0b3c92fa08759dbf12b3a59579a4096ba9af8dd344d9a813fc7f5070d86bbab1"},
-    {file = "yarl-1.18.3-cp39-cp39-musllinux_1_2_armv7l.whl", hash = "sha256:4ac515b860c36becb81bb84b667466885096b5fc85596948548b667da3bf9f24"},
-    {file = "yarl-1.18.3-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:045b8482ce9483ada4f3f23b3774f4e1bf4f23a2d5c912ed5170f68efb053318"},
-    {file = "yarl-1.18.3-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:a4bb030cf46a434ec0225bddbebd4b89e6471814ca851abb8696170adb163985"},
-    {file = "yarl-1.18.3-cp39-cp39-musllinux_1_2_s390x.whl", hash = "sha256:54d6921f07555713b9300bee9c50fb46e57e2e639027089b1d795ecd9f7fa910"},
-    {file = "yarl-1.18.3-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:1d407181cfa6e70077df3377938c08012d18893f9f20e92f7d2f314a437c30b1"},
-    {file = "yarl-1.18.3-cp39-cp39-win32.whl", hash = "sha256:ac36703a585e0929b032fbaab0707b75dc12703766d0b53486eabd5139ebadd5"},
-    {file = "yarl-1.18.3-cp39-cp39-win_amd64.whl", hash = "sha256:ba87babd629f8af77f557b61e49e7c7cac36f22f871156b91e10a6e9d4f829e9"},
-    {file = "yarl-1.18.3-py3-none-any.whl", hash = "sha256:b57f4f58099328dfb26c6a771d09fb20dbbae81d20cfb66141251ea063bd101b"},
-    {file = "yarl-1.18.3.tar.gz", hash = "sha256:ac1801c45cbf77b6c99242eeff4fffb5e4e73a800b5c4ad4fc0be5def634d2e1"},
+python-versions = ">=3.10"
+groups = ["main", "training"]
+files = [
+    {file = "yarl-1.24.2-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:5249a113065c2b7a958bc699759e359cd61cfc81e3069662208f48f191b7ed12"},
+    {file = "yarl-1.24.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:7f4425fa244fbf530b006d0c5f79ce920114cfff5b4f5f6056e669f8e160fdc0"},
+    {file = "yarl-1.24.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:15c0b5e49d3c44e2a0b93e6a49476c5edad0a7686b92c395765a7ea775572a75"},
+    {file = "yarl-1.24.2-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:246d32a53a947c8f0189f5d699cbd4c7036de45d9359e13ba238d1239678c727"},
+    {file = "yarl-1.24.2-cp310-cp310-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:64480fb3e4d4ed9ed71c48a91a477384fc342a50ca30071d2f8a88d51d9c9413"},
+    {file = "yarl-1.24.2-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:349de4701dc3760b6e876628423a8f147ef4f5599d10aba1e10702075d424ed9"},
+    {file = "yarl-1.24.2-cp310-cp310-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:d162677af8d5d3d6ebab8394b021f4d041ac107a4b705873148a77a49dc9e1b2"},
+    {file = "yarl-1.24.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f5f5c6ec23a9043f2d139cc072f53dd23168d202a334b9b2fda8de4c3e890d90"},
+    {file = "yarl-1.24.2-cp310-cp310-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:60de6742447fbbf697f16f070b8a443f1b5fe6ca3826fbef9fe70ecd5328e643"},
+    {file = "yarl-1.24.2-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:acf93187c3710e422368eb768aee98db551ec7c85adc250207a95c16548ab7ac"},
+    {file = "yarl-1.24.2-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:f4b0352fd41fd34b6651934606268816afd6914d09626f9bcbbf018edb0afb3f"},
+    {file = "yarl-1.24.2-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:6b208bb939099b4b297438da4e9b25357f0b1c791888669b963e45b203ea9f36"},
+    {file = "yarl-1.24.2-cp310-cp310-musllinux_1_2_riscv64.whl", hash = "sha256:4b85b8825e631295ff4bc8943f7471d54c533a9360bbe15ebb38e018b555bb8a"},
+    {file = "yarl-1.24.2-cp310-cp310-musllinux_1_2_s390x.whl", hash = "sha256:e26acf20c26cb4fefc631fdb75aca2a6b8fa8b7b5d7f204fb6a8f1e63c706f53"},
+    {file = "yarl-1.24.2-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:819ca24f8eafcfb683c1bd5f44f2f488cea1274eb8944731ffd2e1f10f619342"},
+    {file = "yarl-1.24.2-cp310-cp310-win_amd64.whl", hash = "sha256:5cb0f995a901c36be096ccbf4c673591c2faabbe96279598ffaec8c030f85bf4"},
+    {file = "yarl-1.24.2-cp310-cp310-win_arm64.whl", hash = "sha256:f408eace7e22a68b467a0562e0d27d322f91fe3eaaa6f466b962c6cfaea9fa39"},
+    {file = "yarl-1.24.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:36348bebb147b83818b9d7e673ea4debc75970afc6ffdc7e3975ad05ce5a58c1"},
+    {file = "yarl-1.24.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1a97e42c8a2233f2f279ecadd9e4a037bcb5d813b78435e8eedd4db5a9e9708c"},
+    {file = "yarl-1.24.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:8d027d56f1035e339d1001ac33eceab5b2ec8e42e449787bb75e289fb9a5cd1d"},
+    {file = "yarl-1.24.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a6377060e7927187a42b7eb202090cbe2b34933a4eeaf90e3bd9e33432e5cae"},
+    {file = "yarl-1.24.2-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:17076578bce0049a5ce57d14ad1bded391b68a3b213e9b81b0097b090244999a"},
+    {file = "yarl-1.24.2-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:50713f1d4d6be6375bb178bb43d140ee1acb8abe589cd723320b7925a275be1e"},
+    {file = "yarl-1.24.2-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:34263e2fa8fb5bb63a0d97706cda38edbad62fddb58c7f12d6acbc092812aa50"},
+    {file = "yarl-1.24.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:49016d82f032b1bd1e10b01078a7d29ae71bf468eeae0ea22df8bab691e60003"},
+    {file = "yarl-1.24.2-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:3f6d2c216318f8f32038ca3f72501ba08536f0fd18a36e858836b121b2deed9f"},
+    {file = "yarl-1.24.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:08d3a33218e0c64393e7610284e770409a9c31c429b078bcb24096ed0a783b8f"},
+    {file = "yarl-1.24.2-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:5d699376c4ca3cba49bbfae3a05b5b70ded572937171ce1e0b8d87118e2ba294"},
+    {file = "yarl-1.24.2-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:a1cab588b4fa14bea2e55ebea27478adfb05372f47573738e1acc4a36c0b05d2"},
+    {file = "yarl-1.24.2-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:ec87ccc31bd21db7ad009d8572c127c1000f268517618a4cc09adba3c2a7f21c"},
+    {file = "yarl-1.24.2-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:d1dd47a22843b212baa8d74f37796815d43bd046b42a0f41e9da433386c3136b"},
+    {file = "yarl-1.24.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:7b54b9c67c2b06bd7b9a77253d242124b9c95d2c02def5a1144001ee547dd9d5"},
+    {file = "yarl-1.24.2-cp311-cp311-win_amd64.whl", hash = "sha256:f8fdbcff8b2c7c9284e60c196f693588598ddcee31e11c18e14949ce44519d45"},
+    {file = "yarl-1.24.2-cp311-cp311-win_arm64.whl", hash = "sha256:b32c37a7a337e90822c45797bf3d79d60875cfcccd3ecc80e9f453d87026c122"},
+    {file = "yarl-1.24.2-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:b975866c184564c827e0877380f0dae57dcca7e52782128381b72feff6dfceb8"},
+    {file = "yarl-1.24.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:3b075301a2836a0e297b1b658cb6d6135df535d62efefdd60366bd589c2c82f2"},
+    {file = "yarl-1.24.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8ae44649b00947634ab0dab2a374a638f52923a6e67083f2c156cd5cbd1a881d"},
+    {file = "yarl-1.24.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:507cc19f0b45454e2d6dcd62ff7d062b9f77a2812404e62dbdaec05b50faa035"},
+    {file = "yarl-1.24.2-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:c4c17bad5a530912d2111825d3f05e89bab2dd376aaa8cbc77e449e6db63e576"},
+    {file = "yarl-1.24.2-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f5f0cbb112838a4a293985b6ed73948a547dadcc1ba6d2089938e7abdedceef8"},
+    {file = "yarl-1.24.2-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:5ec8356b8a6afcf81fc7aeeef13b1ff7a49dec00f313394bbb9e83830d32ccd7"},
+    {file = "yarl-1.24.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7e7ebcdef69dec6c6451e616f32b622a6d4a2e92b445c992f7c8e5274a6bbc4c"},
+    {file = "yarl-1.24.2-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:47a55d6cf6db2f401017a9e96e5288844e5051911fb4e0c8311a3980f5e59a7d"},
+    {file = "yarl-1.24.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3065657c80a2321225e804048597ad55658a7e76b32d6f5ee4074d04c50401db"},
+    {file = "yarl-1.24.2-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:cb84b80d88e19ede158619b80813968713d8d008b0e2497a576e6a0557d50712"},
+    {file = "yarl-1.24.2-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:990de4f680b1c217e77ff0d6aa0029f9eb79889c11fb3e9a3942c7eba29c1996"},
+    {file = "yarl-1.24.2-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:abb8ec0323b80161e3802da3150ef660b41d0e9be2048b76a363d93eee992c2b"},
+    {file = "yarl-1.24.2-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:e7977781f83638a4c73e0f88425563d70173e0dfd90ac006a45c65036293ee3c"},
+    {file = "yarl-1.24.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e30dd55825dc554ec5b66a94953b8eda8745926514c5089dfcacecb9c99b5bd1"},
+    {file = "yarl-1.24.2-cp312-cp312-win_amd64.whl", hash = "sha256:7dafe10c12ddd4d120d528c4b5599c953bd7b12845347d507b95451195bb6cad"},
+    {file = "yarl-1.24.2-cp312-cp312-win_arm64.whl", hash = "sha256:044a09d8401fcf8681977faef6d286b8ade1e2d2e9dceda175d1cfa5ca496f30"},
+    {file = "yarl-1.24.2-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:491ac9141decf49ee8030199e1ee251cdff0e131f25678817ff6aa5f837a3536"},
+    {file = "yarl-1.24.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e89418f65eda18f99030386305bd44d7d504e328a7945db1ead514fbe03a0607"},
+    {file = "yarl-1.24.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:cdfcce633b4a4bb8281913c57fcafd4b5933fbc19111a5e3930bbd299d6102f1"},
+    {file = "yarl-1.24.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:863297ddede92ee49024e9a9b11ecb59f310ca85b60d8537f56bed9bbb5b1986"},
+    {file = "yarl-1.24.2-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:374423f70754a2c96942ede36a29d37dc6b0cb8f92f8d009ddf3ed78d3da5488"},
+    {file = "yarl-1.24.2-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:33a29b5d00ccbf3219bb3e351d7875739c19481e030779f48cc46a7a71681a9b"},
+    {file = "yarl-1.24.2-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a9532c57211730c515341af11fef6e9b61d157487272a096d0c04da445642592"},
+    {file = "yarl-1.24.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:91e72cf093fd833483a97ee648e0c053c7c629f51ff4a0e7edd84f806b0c5617"},
+    {file = "yarl-1.24.2-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:b3177bc0a768ef3bacceb4f272632990b7bea352f1b2f1eee9d6d6ff16516f92"},
+    {file = "yarl-1.24.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:e196952aacaf3b232e265ff02980b64d483dc0972bd49bcb061171ff22ac203a"},
+    {file = "yarl-1.24.2-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:204e7a61ce99919c0de1bf904ab5d7aa188a129ea8f690a8f76cfb6e2844dc44"},
+    {file = "yarl-1.24.2-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:4b156914620f0b9d78dc1adb3751141daee561cfec796088abb89ed49d220f1a"},
+    {file = "yarl-1.24.2-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:8372a2b976cf70654b2be6619ab6068acabb35f724c0fda7b277fbf53d66a5cf"},
+    {file = "yarl-1.24.2-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:f9a1e9b622ca284143aab5d885848686dcd85453bb1ca9abcdb7503e64dc0056"},
+    {file = "yarl-1.24.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:810e19b685c8c3c5862f6a38160a1f4e4c0916c9390024ec347b6157a45a0992"},
+    {file = "yarl-1.24.2-cp313-cp313-win_amd64.whl", hash = "sha256:7d37fb7c38f2b6edab0f845c4f85148d4c44204f52bc127021bd2bc9fdbf1656"},
+    {file = "yarl-1.24.2-cp313-cp313-win_arm64.whl", hash = "sha256:1e831894be7c2954240e49791fa4b50c05a0dc881de2552cfe3ffd8631c7f461"},
+    {file = "yarl-1.24.2-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:f9312b3c02d9b3d23840f67952913c9c8721d7f1b7db305289faefa878f364c2"},
+    {file = "yarl-1.24.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:a4f4d6cd615823bfc7fb7e9b5987c3f41666371d870d51058f77e2680fbe9630"},
+    {file = "yarl-1.24.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:0c3063e5c0a8e8e62fae6c2596fa01da1561e4cd1da6fec5789f5cf99a8aefd8"},
+    {file = "yarl-1.24.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fecd17873a096036c1c87ab3486f1aef7f269ada7f23f7f856f93b1cc7744f14"},
+    {file = "yarl-1.24.2-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:a46d1ab4ba4d32e6dc80daf8a28ce0bd83d08df52fbc32f3e288663427734535"},
+    {file = "yarl-1.24.2-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:73e68edf6dfd5f73f9ca127d84e2a6f9213c65bdffb736bda19524c0564fcd14"},
+    {file = "yarl-1.24.2-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a296ca617f2d25fbceafb962b88750d627e5984e75732c712154d058ae8d79a3"},
+    {file = "yarl-1.24.2-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e51b2cf5ec89a8b8470177641ed62a3ba22d74e1e898e06ad53aa77972487208"},
+    {file = "yarl-1.24.2-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:310fc687f7b2044ec54e372c8cbe923bb88f5c37bded0d3079e5791c2fc3cf50"},
+    {file = "yarl-1.24.2-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:297a2fe352ecf858b30a98f87948746ec16f001d279f84aebdbd3bd965e2f1bd"},
+    {file = "yarl-1.24.2-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:2a263e76b97bc42bdcd7c5f4953dec1f7cd62a1112fa7f869e57255229390d67"},
+    {file = "yarl-1.24.2-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:822519b64cf0b474f1a0aaef1dc621438ea46bb77c94df97a5b4d213a7d8a8b1"},
+    {file = "yarl-1.24.2-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:b6067060d9dc594899ba83e6db6c48c68d1e494a6dab158156ed86977ca7bcb1"},
+    {file = "yarl-1.24.2-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:0063adad533e57171b79db3943b229d40dfafeeee579767f96541f106bac5f1b"},
+    {file = "yarl-1.24.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:ee8e3fb34513e8dc082b586ef4910c98335d43a6fab688cd44d4851bacfce3e8"},
+    {file = "yarl-1.24.2-cp314-cp314-win_amd64.whl", hash = "sha256:afb00d7fd8e0f285ca29a44cc50df2d622ff2f7a6d933fa641577b5f9d5f3db0"},
+    {file = "yarl-1.24.2-cp314-cp314-win_arm64.whl", hash = "sha256:68cf6eacd6028ef1142bc4b48376b81566385ca6f9e7dde3b0fa91be08ffcb57"},
+    {file = "yarl-1.24.2-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:221ce1dd921ac4f603957f17d7c18c5cc0797fbb52f156941f92e04605d1d67b"},
+    {file = "yarl-1.24.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:5f3224db28173a00d7afacdee07045cc4673dfab2b15492c7ae10deddbece761"},
+    {file = "yarl-1.24.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:c557165320d6244ebe3a02431b2a201a20080e02f41f0cfa0ccc47a183765da8"},
+    {file = "yarl-1.24.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:904065e6e85b1fa54d0d87438bd58c14c0bad97aad654ad1077fd9d87e8478ed"},
+    {file = "yarl-1.24.2-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:8cec2a38d70edc10e0e856ceda886af5327a017ccbde8e1de1bd44d300357543"},
+    {file = "yarl-1.24.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e7484b9361ed222ee1ca5b4337aa4cbdcc4618ce5aff57d9ef1582fd95893fc0"},
+    {file = "yarl-1.24.2-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:84f9670b89f34db07f81e53aee83e0b938a3412329d51c8f922488be7fcc4024"},
+    {file = "yarl-1.24.2-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:abb2759733d63a28b4956500a5dd57140f26486c92b2caedfb964ab7d9b79dbf"},
+    {file = "yarl-1.24.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:081c2bf54efe03774d0311172bc04fedf9ca01e644d4cd8c805688e527209bdc"},
+    {file = "yarl-1.24.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:86746bef442aa479107fe28132e1277237f9c24c2f00b0b0cf22b3ee0904f2bb"},
+    {file = "yarl-1.24.2-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:2d07d21d0bc4b17558e8de0b02fbfdf1e347d3bb3699edd00bb92e7c57925420"},
+    {file = "yarl-1.24.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:4fb1ac3fc5fecd8ae7453ea237e4d22b49befa70266dfe1629924245c21a0c7f"},
+    {file = "yarl-1.24.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:4da31a5512ed1729ca8d8aacde3f7faeb8843cde3165d6bcf7f88f74f17bb8aa"},
+    {file = "yarl-1.24.2-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:533ded4dceb5f1f3da7906244f4e82cf46cfd40d84c69a1faf5ac506aa65ecbe"},
+    {file = "yarl-1.24.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:7b3a85525f6e7eeabcfdd372862b21ee1915db1b498a04e8bf0e389b607ff0bd"},
+    {file = "yarl-1.24.2-cp314-cp314t-win_amd64.whl", hash = "sha256:a7624b1ca46ca5d7b864ef0d2f8efe3091454085ee1855b4e992314529972215"},
+    {file = "yarl-1.24.2-cp314-cp314t-win_arm64.whl", hash = "sha256:e434a45ce2e7a947f951fc5a8944c8cc080b7e59f9c50ae80fd39107cf88126d"},
+    {file = "yarl-1.24.2-py3-none-any.whl", hash = "sha256:2783d9226db8797636cd6896e4de81feed252d1db72265686c9558d97a4d94b9"},
+    {file = "yarl-1.24.2.tar.gz", hash = "sha256:9ac374123c6fd7abf64d1fec93962b0bd4ee2c19751755a762a72dd96c0378f8"},
 ]
 
 [package.dependencies]
 idna = ">=2.0"
 multidict = ">=4.0"
-propcache = ">=0.2.0"
+propcache = ">=0.2.1"
 
 [[package]]
 name = "yunchang"
-version = "0.6.1"
+version = "0.6.4"
 description = "a package for long context attention"
-optional = false
+optional = true
 python-versions = ">=3.7"
 groups = ["main"]
+markers = "extra == \"cuda\""
 files = [
-    {file = "yunchang-0.6.1-py3-none-any.whl", hash = "sha256:20dfc6502a02cace5eee60dd97f03ac6bb496b34a35e9e4e6fd8ce566f7444b1"},
-    {file = "yunchang-0.6.1.tar.gz", hash = "sha256:068471774038b8bb846335b025ab9cc1ddb8d1660c108115bbdb02318d358b45"},
+    {file = "yunchang-0.6.4-py3-none-any.whl", hash = "sha256:cce4295058e9de2c0592d69cfe1e3e679711def5f315fd1b68dbed2d54bc9255"},
+    {file = "yunchang-0.6.4.tar.gz", hash = "sha256:9493ba28cd0f0daa3871f0c80a4876b866ead5db48e475708569d476804736f4"},
 ]
 
+[package.dependencies]
+torch = ">=2.3.0"
+
 [package.extras]
 flash = ["flash-attn (>=2.6.0)"]
 
 [[package]]
 name = "zipp"
-version = "3.21.0"
+version = "4.1.0"
 description = "Backport of pathlib-compatible object wrapper for zip files"
 optional = false
-python-versions = ">=3.9"
+python-versions = ">=3.10"
 groups = ["main"]
 files = [
-    {file = "zipp-3.21.0-py3-none-any.whl", hash = "sha256:ac1bbe05fd2991f160ebce24ffbac5f6d11d83dc90891255885223d42b3cd931"},
-    {file = "zipp-3.21.0.tar.gz", hash = "sha256:2c9958f6430a2040341a52eb608ed6dd93ef4392e02ffe219417c1b28b5dd1f4"},
+    {file = "zipp-4.1.0-py3-none-any.whl", hash = "sha256:25ad4e16390cd314347dd8f1de67a2ac538ae658ed4ab9db16029c07c188e97f"},
+    {file = "zipp-4.1.0.tar.gz", hash = "sha256:4cb57381f544315db7688e976e922a2b18cdb513d21cc194eb42232ba2a3e602"},
 ]
 
 [package.extras]
-check = ["pytest-checkdocs (>=2.4)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\""]
+check = ["pytest-checkdocs (>=2.14)", "pytest-ruff (>=0.2.1) ; sys_platform != \"cygwin\""]
 cover = ["pytest-cov"]
 doc = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"]
-enabler = ["pytest-enabler (>=2.2)"]
-test = ["big-O", "importlib-resources ; python_version < \"3.9\"", "jaraco.functools", "jaraco.itertools", "jaraco.test", "more-itertools", "pytest (>=6,!=8.1.*)", "pytest-ignore-flaky"]
-type = ["pytest-mypy"]
-
-[[package]]
-name = "zope-deprecation"
-version = "5.1"
-description = "Zope Deprecation Infrastructure"
-optional = false
-python-versions = ">=3.9"
-groups = ["main"]
-files = [
-    {file = "zope.deprecation-5.1-py3-none-any.whl", hash = "sha256:60f957b964d8f947a4a592c647d51ce0f4f844d1f041657956ddde0d9fa9a76a"},
-    {file = "zope_deprecation-5.1.tar.gz", hash = "sha256:46bed4611fb53edc731aadeb64b28308bcb848f4cc150c60c948d078f7108721"},
-]
-
-[package.dependencies]
-setuptools = "*"
-
-[package.extras]
-docs = ["Sphinx"]
-test = ["zope.testrunner"]
-
-[[package]]
-name = "zope-interface"
-version = "7.2"
-description = "Interfaces for Python"
-optional = false
-python-versions = ">=3.8"
-groups = ["main"]
-files = [
-    {file = "zope.interface-7.2-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:ce290e62229964715f1011c3dbeab7a4a1e4971fd6f31324c4519464473ef9f2"},
-    {file = "zope.interface-7.2-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:05b910a5afe03256b58ab2ba6288960a2892dfeef01336dc4be6f1b9ed02ab0a"},
-    {file = "zope.interface-7.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:550f1c6588ecc368c9ce13c44a49b8d6b6f3ca7588873c679bd8fd88a1b557b6"},
-    {file = "zope.interface-7.2-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0ef9e2f865721553c6f22a9ff97da0f0216c074bd02b25cf0d3af60ea4d6931d"},
-    {file = "zope.interface-7.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:27f926f0dcb058211a3bb3e0e501c69759613b17a553788b2caeb991bed3b61d"},
-    {file = "zope.interface-7.2-cp310-cp310-win_amd64.whl", hash = "sha256:144964649eba4c5e4410bb0ee290d338e78f179cdbfd15813de1a664e7649b3b"},
-    {file = "zope.interface-7.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1909f52a00c8c3dcab6c4fad5d13de2285a4b3c7be063b239b8dc15ddfb73bd2"},
-    {file = "zope.interface-7.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:80ecf2451596f19fd607bb09953f426588fc1e79e93f5968ecf3367550396b22"},
-    {file = "zope.interface-7.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:033b3923b63474800b04cba480b70f6e6243a62208071fc148354f3f89cc01b7"},
-    {file = "zope.interface-7.2-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:a102424e28c6b47c67923a1f337ede4a4c2bba3965b01cf707978a801fc7442c"},
-    {file = "zope.interface-7.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:25e6a61dcb184453bb00eafa733169ab6d903e46f5c2ace4ad275386f9ab327a"},
-    {file = "zope.interface-7.2-cp311-cp311-win_amd64.whl", hash = "sha256:3f6771d1647b1fc543d37640b45c06b34832a943c80d1db214a37c31161a93f1"},
-    {file = "zope.interface-7.2-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:086ee2f51eaef1e4a52bd7d3111a0404081dadae87f84c0ad4ce2649d4f708b7"},
-    {file = "zope.interface-7.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:21328fcc9d5b80768bf051faa35ab98fb979080c18e6f84ab3f27ce703bce465"},
-    {file = "zope.interface-7.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f6dd02ec01f4468da0f234da9d9c8545c5412fef80bc590cc51d8dd084138a89"},
-    {file = "zope.interface-7.2-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:8e7da17f53e25d1a3bde5da4601e026adc9e8071f9f6f936d0fe3fe84ace6d54"},
-    {file = "zope.interface-7.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cab15ff4832580aa440dc9790b8a6128abd0b88b7ee4dd56abacbc52f212209d"},
-    {file = "zope.interface-7.2-cp312-cp312-win_amd64.whl", hash = "sha256:29caad142a2355ce7cfea48725aa8bcf0067e2b5cc63fcf5cd9f97ad12d6afb5"},
-    {file = "zope.interface-7.2-cp313-cp313-macosx_10_9_x86_64.whl", hash = "sha256:3e0350b51e88658d5ad126c6a57502b19d5f559f6cb0a628e3dc90442b53dd98"},
-    {file = "zope.interface-7.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:15398c000c094b8855d7d74f4fdc9e73aa02d4d0d5c775acdef98cdb1119768d"},
-    {file = "zope.interface-7.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:802176a9f99bd8cc276dcd3b8512808716492f6f557c11196d42e26c01a69a4c"},
-    {file = "zope.interface-7.2-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:eb23f58a446a7f09db85eda09521a498e109f137b85fb278edb2e34841055398"},
-    {file = "zope.interface-7.2-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a71a5b541078d0ebe373a81a3b7e71432c61d12e660f1d67896ca62d9628045b"},
-    {file = "zope.interface-7.2-cp313-cp313-win_amd64.whl", hash = "sha256:4893395d5dd2ba655c38ceb13014fd65667740f09fa5bb01caa1e6284e48c0cd"},
-    {file = "zope.interface-7.2-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:d3a8ffec2a50d8ec470143ea3d15c0c52d73df882eef92de7537e8ce13475e8a"},
-    {file = "zope.interface-7.2-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:31d06db13a30303c08d61d5fb32154be51dfcbdb8438d2374ae27b4e069aac40"},
-    {file = "zope.interface-7.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e204937f67b28d2dca73ca936d3039a144a081fc47a07598d44854ea2a106239"},
-    {file = "zope.interface-7.2-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:224b7b0314f919e751f2bca17d15aad00ddbb1eadf1cb0190fa8175edb7ede62"},
-    {file = "zope.interface-7.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:baf95683cde5bc7d0e12d8e7588a3eb754d7c4fa714548adcd96bdf90169f021"},
-    {file = "zope.interface-7.2-cp38-cp38-win_amd64.whl", hash = "sha256:7dc5016e0133c1a1ec212fc87a4f7e7e562054549a99c73c8896fa3a9e80cbc7"},
-    {file = "zope.interface-7.2-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:7bd449c306ba006c65799ea7912adbbfed071089461a19091a228998b82b1fdb"},
-    {file = "zope.interface-7.2-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:a19a6cc9c6ce4b1e7e3d319a473cf0ee989cbbe2b39201d7c19e214d2dfb80c7"},
-    {file = "zope.interface-7.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:72cd1790b48c16db85d51fbbd12d20949d7339ad84fd971427cf00d990c1f137"},
-    {file = "zope.interface-7.2-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:52e446f9955195440e787596dccd1411f543743c359eeb26e9b2c02b077b0519"},
-    {file = "zope.interface-7.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2ad9913fd858274db8dd867012ebe544ef18d218f6f7d1e3c3e6d98000f14b75"},
-    {file = "zope.interface-7.2-cp39-cp39-win_amd64.whl", hash = "sha256:1090c60116b3da3bfdd0c03406e2f14a1ff53e5771aebe33fec1edc0a350175d"},
-    {file = "zope.interface-7.2.tar.gz", hash = "sha256:8b49f1a3d1ee4cdaf5b32d2e738362c7f5e40ac8b46dd7d1a65e82a4872728fe"},
-]
-
-[package.dependencies]
-setuptools = "*"
-
-[package.extras]
-docs = ["Sphinx", "furo", "repoze.sphinx.autointerface"]
-test = ["coverage[toml]", "zope.event", "zope.testing"]
-testing = ["coverage[toml]", "zope.event", "zope.testing"]
+enabler = ["pytest-enabler (>=3.4)"]
+test = ["big-O", "jaraco.functools", "jaraco.itertools", "jaraco.test", "more_itertools", "pytest (>=6,!=8.1.*)", "pytest-ignore-flaky"]
+type = ["pytest-mypy (>=1.0.1) ; platform_python_implementation != \"PyPy\""]
+
+[extras]
+cpu = []
+cuda = ["bitsandbytes", "nvidia-cublas-cu12", "nvidia-cuda-cupti-cu12", "nvidia-cuda-nvrtc-cu12", "nvidia-cuda-runtime-cu12", "nvidia-cudnn-cu12", "nvidia-cufft-cu12", "nvidia-curand-cu12", "nvidia-cusolver-cu12", "nvidia-cusparse-cu12", "nvidia-cusparselt-cu12", "nvidia-nccl-cu12", "nvidia-nvjitlink-cu12", "nvidia-nvtx-cu12", "triton", "xformers", "xfuser"]
+cuda124 = []
+cuda126 = []
+quanto = ["optimum-quanto"]
+rocm = []
+trackio = ["trackio"]
+video-fast = ["torchcodec"]
 
 [metadata]
 lock-version = "2.1"
-python-versions = "^3.10"
-content-hash = "3ca5bef3677ebc11d61746f39f79dea20a564e7dbbaef0432881512436bb111f"
+python-versions = "^3.11"
+content-hash = "d067b5b3214aaf7a2d8789492b7ce4a1da881092ec7ca79ce0cdc0f0e60d95cb"
diff --git a/pyproject.toml b/pyproject.toml
index 8b162c19..b3bc7246 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,147 +1,262 @@
 [tool.poetry]
-name = "videotuna"
-version = "0.1.0"
-description = "Videotuna is a useful codebase for text-to-video applications"
-authors = ["Yingqing He <yhebm@connect.ust.hk>", "Yazhou Xing <yxingag@connect.ust.hk>"]
+name = "privtune"
+version = "0.2.0"
+description = "PrivTune is a private-domain LoRA training platform for still-image and short-video generation — Flux T2I style training, Wan 2.1 T2V LoRA training, and Wan 2.2 Diffusers validation inference."
+authors = [
+    "Yingqing He <yhebm@connect.ust.hk>",
+    "Yazhou Xing <yxingag@connect.ust.hk>",
+]
 readme = "README.md"
+packages = [{ include = "videotuna" }]
 
 [build-system]
 requires = ["poetry-core"]
 build-backend = "poetry.core.masonry.api"
 
+# Default install (`poetry install -E cuda --with training`) = NVIDIA CUDA + domain LoRA training stack.
+# AMD ROCm: `poetry install -E rocm` then `poetry run install-rocm`
+# CPU dev:  `poetry install -E cpu` then `poetry run install-cpu-torch` (see docs/install-cpu.md)
+# Note: `cpu` extra is a marker; CPU torch wheels come from `install-cpu-torch`.
+# Training: `poetry install -E cuda --with training`
+# Optional faster video decode (Wan dataloading):
+#   `poetry install -E cuda --with training -E video-fast`
+# Optional Wan transformer quant (quanto backend):
+#   `poetry install -E cuda -E quanto`
+# Optional Flux Trackio experiment tracking:
+#   `poetry install -E cuda --with training -E trackio`
+# Dev:      `poetry install --with dev`
+
 [tool.poetry.dependencies]
-python = "^3.10"
-deepspeed = "0.16.5"
+python = "^3.11"
 av = "12.3.0"
 beautifulsoup4 = "4.12.3"
-colossalai = "0.3.6"
-peft = "^0.12.0"
-bitsandbytes = "^0.45.0"
-decord = "0.6.0"
+peft = "^0.17.0"
 einops = "0.8.0"
-fire = "0.6.0"
-torch = "2.2.2"
+cyclopts = "^3.0.0"
+torch = { version = "^2.6.0", source = "pytorch-cu126" }
+torchvision = { version = "^0.21.0", source = "pytorch-cu126" }
+triton = { version = "3.2.0", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cublas-cu12 = { version = "12.6.4.1", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cuda-cupti-cu12 = { version = "12.6.80", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cuda-nvrtc-cu12 = { version = "12.6.77", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cuda-runtime-cu12 = { version = "12.6.77", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cudnn-cu12 = { version = "9.5.1.17", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cufft-cu12 = { version = "11.3.0.4", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-curand-cu12 = { version = "10.3.7.77", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cusolver-cu12 = { version = "11.7.1.2", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cusparse-cu12 = { version = "12.5.4.2", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-cusparselt-cu12 = { version = "0.6.3", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-nccl-cu12 = { version = "2.21.5", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-nvjitlink-cu12 = { version = "12.6.85", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+nvidia-nvtx-cu12 = { version = "12.6.77", markers = "platform_system == 'Linux' and platform_machine == 'x86_64'", optional = true }
+xformers = { version = "0.0.29.post3", source = "pytorch-cu126", optional = true }
+bitsandbytes = { version = "^0.45.0", optional = true }
+xfuser = { version = "^0.4.4", optional = true }
 ftfy = "6.2.3"
-huggingface-hub = "0.24.6"
+huggingface-hub = "^0.34.0"
 loguru = "0.7.2"
 imwatermark = "0.0.2"
 kornia = "0.7.3"
-mmengine = "0.10.4"
 omegaconf = "2.3.0"
 opencv-python = "4.10.0.84"
 packaging = "24.1"
-pandas = "2.2.2"
 pillow = "10.4.0"
-pudb = "2024.1.2"
-pytorch-lightning = "2.4.0"
 pyyaml = "6.0.2"
 rotary-embedding-torch = "0.6.5"
 requests = "2.32.3"
-safetensors = "0.4.4"
+safetensors = "^0.8.0"
 timm = "1.0.8"
-torchvision = "0.17.2"
 tqdm = "4.66.5"
-transformers = "4.46.2"
-xformers = "0.0.25.post1"
+transformers = "^4.48.0"
 imageio = "2.35.1"
 imageio-ffmpeg = "0.5.1"
-pyramid = "1.5"
-wandb = "0.17.8"
 scipy = "1.14.1"
-beartype = "0.18.5"
-moviepy = "1.0.3"
+moviepy = "^2.2.1"
 open-clip-torch = "2.12.0"
-numpy = "==1.*"
-diffusers = "^0.32.2"
+numpy = ">=1.26,<2.3"
+diffusers = "^0.38.0"
 torchsde = "0.2.6"
 colorama = "0.4.6"
 torch-optimi = "^0.2.1"
-accelerate = "^0.33.0"
-torchao = "0.8.0"
-toml = "0.10.2"
-hpsv2 = {git = "https://github.com/tgxs002/HPSv2.git"}
+accelerate = "^1.14.0"
+torchao = "^0.15.0"
 backports-tarfile = "^1.2.0"
-swissarmytransformer = {git = "https://github.com/JingyeChen/SwissArmyTransformer"}
+pydantic = "^2.0"
 pydantic-settings = "^2.8.0"
-xfuser = "^0.4.1"
 dashscope = "^1.23.0"
-tensorboard = "^2.19.0"
+# Wan vendor only — videotuna/models/wan/wan/configs/
 easydict = "^1.13"
+torchcodec = { version = "~0.2.1", optional = true }
+optimum-quanto = { version = ">=0.2.6", optional = true }
+# Trackio 0.21+ requires huggingface-hub>=1.10; transformers pins hub<1.0.
+trackio = { version = "~0.20.0", optional = true }
+
+[tool.poetry.extras]
+cuda = [
+    "triton",
+    "nvidia-cublas-cu12",
+    "nvidia-cuda-cupti-cu12",
+    "nvidia-cuda-nvrtc-cu12",
+    "nvidia-cuda-runtime-cu12",
+    "nvidia-cudnn-cu12",
+    "nvidia-cufft-cu12",
+    "nvidia-curand-cu12",
+    "nvidia-cusolver-cu12",
+    "nvidia-cusparse-cu12",
+    "nvidia-cusparselt-cu12",
+    "nvidia-nccl-cu12",
+    "nvidia-nvjitlink-cu12",
+    "nvidia-nvtx-cu12",
+    "xformers",
+    "bitsandbytes",
+    "xfuser",
+]
+rocm = []
+cpu = []
+cuda124 = []
+cuda126 = ["cuda"]
+video-fast = ["torchcodec"]
+quanto = ["optimum-quanto"]
+trackio = ["trackio"]
+
+[tool.poetry.group.training]
+optional = true
+
+[tool.poetry.group.training.dependencies]
+deepspeed = "0.19.2"
+pytorch-lightning = "2.4.0"
+tensorboard = "^2.19.0"
+pandas = "2.2.2"
 scikit-learn = "^1.6.1"
 
+[tool.poetry.group.dev]
+optional = true
+
 [tool.poetry.group.dev.dependencies]
-black = "^24.0.0"
-isort = "^5.12.0"
 mypy = "^1.11.2"
-pytest = "7.2.0"
+pytest = "^9.1"
 pre-commit = "^4.1.0"
 coverage = "^7.6.1"
 ruff = "^0.6.8"
+pudb = "2024.1.2"
+types-colorama = "*"
+types-tqdm = "*"
+types-psutil = "*"
+
+[tool.uv]
+package = true
+
+[dependency-groups]
+training = [
+    "deepspeed==0.19.2",
+    "pytorch-lightning==2.4.0",
+    "tensorboard>=2.19.0",
+    "pandas==2.2.2",
+    "scikit-learn>=1.6.1",
+]
+dev = [
+    "mypy>=1.11.2",
+    "pytest>=9.1,<10",
+    "pre-commit>=4.1.0",
+    "coverage>=7.6.1",
+    "ruff>=0.6.8",
+    "pudb==2024.1.2",
+    "types-colorama",
+    "types-tqdm",
+    "types-psutil",
+]
+
+[[tool.poetry.source]]
+name = "pytorch-cu124"
+url = "https://download.pytorch.org/whl/cu124"
+priority = "explicit"
+
+[[tool.poetry.source]]
+name = "pytorch-cu126"
+url = "https://download.pytorch.org/whl/cu126"
+priority = "explicit"
+
+[[tool.poetry.source]]
+name = "pytorch-rocm642"
+url = "https://download.pytorch.org/whl/rocm6.2.4"
+priority = "explicit"
 
 [[tool.poetry.source]]
-name = "modelscope"
-url = "https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html"
-priority = "supplemental"
+name = "pytorch-cpu"
+url = "https://download.pytorch.org/whl/cpu"
+priority = "explicit"
 
 [tool.poetry.scripts]
 install-deepspeed = 'scripts:install_deepspeed'
 install-flash-attn = 'scripts:install_flash_attn'
+install-rocm = 'scripts:install_rocm'
+install-cpu-torch = 'scripts:install_cpu_torch'
+install-flash-attn-rocm = 'scripts:install_flash_attn_rocm'
 coverage-report = 'scripts:coverage_report'
+coverage-gate = 'scripts:coverage_gate'
 format = 'scripts:code_format'
 format-check = 'scripts:code_format_check'
 lint = 'scripts:lint'
+benchmark-attn-backends = 'scripts:benchmark_attn_backends'
+verify-cuda-extras = 'scripts.verify_cuda_extras:main'
+verify-cpu-torch = 'scripts.verify_cpu_torch:main'
 test = 'scripts:test'
 type-check = 'scripts:type_check'
-inference-stepvideo-t2v-544x992 = 'scripts:inference_stepvideo_t2v_544x992'
-inference-wanvideo-i2v-720p = 'scripts:inference_wanvideo_i2v_720p'
-inference-wanvideo-t2v-720p = 'scripts:inference_wanvideo_t2v_720p'
-inference-hunyuan-i2v-720p = 'scripts:inference_hunyuan_i2v_720p'
-inference-cogvideo-i2v-diffusers = 'scripts:inference_cogvideo_i2v_diffusers'
-inference-cogvideo-i2v-lora = 'scripts:inference_cogvideo_i2v_lora'
-inference-cogvideo-lora = 'scripts:inference_cogvideo_lora'
-inference-cogvideo-t2v-diffusers = 'scripts:inference_cogvideo_t2v_diffusers'
-inference-cogvideox-15-5b-i2v = 'scripts:inference_cogvideox1_5_5b_i2v'
-inference-cogvideox-15-5b-t2v = 'scripts:inference_cogvideox1_5_5b_t2v'
-inference-dc-i2v-576x1024 = 'scripts:inference_dc_i2v_576x1024'
-inference-flux-schnell = 'scripts:inference_flux_schnell'
-inference-flux-dev = 'scripts:inference_flux_dev'
-inference-flux-lora = 'scripts:inference_flux_lora'
-inference-hunyuan-t2v = 'scripts:inference_hunyuan_t2v'
-inference-mochi = 'scripts:inference_mochi'
-inference-opensora-v10-16x256x256 = 'scripts:inference_opensora_v10_16x256x256'
-inference-v2v-ms = 'scripts:inference_v2v_ms'
-inference-vc1-i2v-320x512 = 'scripts:inference_vc1_i2v_320x512'
-inference-vc1-t2v-576x1024 = 'scripts:inference_vc1_t2v_576x1024'
-inference-vc2-t2v-320x512 = 'scripts:inference_vc2_t2v_320x512'
-inference-vc2-t2v-320x512-lora = 'scripts:inference_vc2_t2v_320x512_lora'
-train-cogvideox-i2v-lora = 'scripts:train_cogvideox_i2v_lora'
-train-cogvideox-i2v-fullft = 'scripts:train_cogvideox_i2v_fullft'
-train-cogvideox-t2v-lora = 'scripts:train_cogvideox_t2v_lora'
-train-cogvideox-t2v-fullft = 'scripts:train_cogvideox_t2v_fullft'
-train-dynamicrafter = 'scripts:train_dynamicrafter'
+inference-flux-lora = 'videotuna.cli.inference_app:inference_flux_lora_entry'
+"inference-wan2.2-t2v-720p" = 'videotuna.cli.inference_app:inference_wan2_2_t2v_720p_entry'
 train-flux-lora = 'scripts:train_flux_lora'
-train-opensorav10 = 'scripts:train_opensorav10'
-train-videocrafter-lora = 'scripts:train_videocrafter_lora'
-train-videocrafter-v2 = 'scripts:train_videocrafter_v2'
-train-hunyuan-t2v-lora = 'scripts:train_hunyuan_t2v_lora'
-train-wan2-1-i2v-lora = 'scripts:train_wan2_1_i2v_lora'
-train-wan2-1-i2v-fullft = 'scripts:train_wan2_1_i2v_fullft'
 train-wan2-1-t2v-lora = 'scripts:train_wan2_1_t2v_lora'
-train-wan2-1-t2v-fullft = 'scripts:train_wan2_1_t2v_fullft'
-
-[tool.black]
-line-length = 88
-target-version = ['py310']
-include = '\.pyi?$'
+train-domain-t2i = 'scripts:train_domain_t2i'
+train-domain-t2v = 'scripts:train_domain_t2v'
+train-wan2-1-i2v-lora = 'scripts:train_wan2_1_i2v_lora'
+train-domain-i2v = 'scripts:train_domain_i2v'
+inference-domain-t2i = 'videotuna.cli.inference_app:inference_domain_t2i_entry'
+validate-domain-t2v = 'videotuna.cli.inference_app:validate_domain_t2v_entry'
+validate-domain-i2v = 'videotuna.cli.inference_app:validate_domain_i2v_entry'
+"inference-wan2.2-i2v-720p" = 'videotuna.cli.inference_app:inference_wan2_2_i2v_720p_entry'
 
-[tool.isort]
-profile = "black"
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+markers = [
+    "gpu: tests that require an NVIDIA/ROCm GPU (skipped when torch.cuda.is_available() is False)",
+    "rocm: tests that require an AMD ROCm GPU",
+    "cpu_smoke: slow CPU integration tests (optional nightly)",
+]
 
 [[tool.mypy.overrides]]
-module = [
-]
+module = []
 ignore_missing_imports = true
 
+[tool.mypy]
+mypy_path = "typings"
+
 [tool.ruff]
-select = ["E", "F", "C90"]
-ignore = []
+line-length = 88
+target-version = "py311"
+exclude = ["tools/data_process"]
+
+[tool.ruff.lint]
+select = ["E", "F", "C90", "I"]
+
+[tool.ruff.lint.mccabe]
+max-complexity = 19
+
+[tool.ruff.lint.per-file-ignores]
+"videotuna/models/wan/wan/**" = ["E501", "E722", "F403", "F405"]
+
+[tool.ruff.lint.isort]
+known-first-party = ["videotuna"]
+
+[tool.coverage.run]
+source = ["videotuna"]
+relative_files = true
+
+[tool.coverage.report]
+precision = 1
+show_missing = true
+skip_empty = true
+
+[tool.pyrefly]
+search-path = ["typings"]
+project-includes = ["videotuna/**", "scripts/**", "tests/**"]
+project-excludes = []
diff --git a/scripts/__init__.py b/scripts/__init__.py
index d5c7e248..64aa3315 100644
--- a/scripts/__init__.py
+++ b/scripts/__init__.py
@@ -3,70 +3,417 @@
 """
 
 import os
+import shutil
 import subprocess
 import sys
 from datetime import datetime
 
 current_time = datetime.now().strftime("%Y%m%d%H%M%S")
 
+FLUX_T2I_CONFIG = "configs/domain/flux_t2i.json"
+FLUX_T2I_DATA_CONFIG = "configs/domain/flux_t2i_data.json"
+FLUX_T2I_CLOUD_SMOKE = "configs/domain/flux_t2i_cloud_smoke.json"
+WAN_T2V_LORA_CONFIG = "configs/domain/wan_t2v_lora.yaml"
+WAN_T2V_LORA_CLOUD_SMOKE = "configs/domain/wan_t2v_lora_cloud_smoke.yaml"
+WAN_I2V_LORA_CONFIG = "configs/domain/wan_i2v_lora.yaml"
+
+# Single source of truth for CPU CI smoke tests (see .github/workflows/cpu.yml).
+CI_SMOKE_TESTS = [
+    "tests/test_import_smoke.py",
+    "tests/test_domain_finetune_configs.py",
+    "tests/test_flux_lora_train_smoke.py",
+    "tests/test_wan_lora_bridge.py",
+    "tests/test_wan_i2v_lora_bridge.py",
+    "tests/test_wan_domain_lora_smoke_22_config.py",
+    "tests/test_wan_domain_i2v_smoke_22_config.py",
+    "tests/test_wan_i2v_dataset.py",
+    "tests/test_wan_training_step.py",
+    "tests/test_poetry_scripts.py",
+    "tests/test_diffusers_quantization.py",
+]
+
+# Line-coverage floor for videotuna/training/ + videotuna/utils/
+# (CI smoke baseline ~36%).
+COVERAGE_GATE_FAIL_UNDER = 33
+
+
+def _require_cuda_backend(installer_name: str) -> None:
+    """Abort when the active PyTorch build is ROCm (CUDA-only installer)."""
+    try:
+        from videotuna.utils.device_utils import detect_compute_backend
+
+        if detect_compute_backend() == "rocm":
+            print(
+                f"{installer_name} is not supported on AMD ROCm.\n"
+                "Use VIDEOTUNA_ATTN_BACKEND=sdpa for attention on ROCm.\n"
+                "See docs/install-rocm.md.",
+                file=sys.stderr,
+            )
+            sys.exit(1)
+    except ImportError:
+        pass
+
+
 def install_deepspeed():
     """
-    Install the flash attention package
+    Install DeepSpeed with CUDA 12.6 toolkit support
+    (rebuilds against the active torch).
+
+    When conda is unavailable, skips the CUDA toolkit step and installs via pip.
+    If deepspeed>=0.19.2 is already importable, exits successfully without rebuilding.
     """
-    command_install_cuda_toolkit = [
-        "conda", "install", "cuda-toolkit=12.1", "-c", "conda-forge", "-c", "nvidia", "-y"
-    ] + sys.argv[1:]
-    command_uninstall_deepspeed = [
-        "pip", "uninstall", "deepspeed", "-y"
-    ]
-    command_install_deepspeed = [
-        "pip", "install", "deepspeed==0.16.5"
-    ]
-    result_cuda_toolkit = subprocess.run(command_install_cuda_toolkit, check=False)
-    if result_cuda_toolkit.returncode != 0:
-        exit(result_cuda_toolkit.returncode)
-
-    
-    result_uninstall_deepspeed = subprocess.run(command_uninstall_deepspeed, check=False)
-    if result_uninstall_deepspeed.returncode != 0:
-        exit(result_uninstall_deepspeed.returncode)
+    _require_cuda_backend("install-deepspeed")
+    try:
+        import deepspeed
+        from packaging.version import Version
+
+        if Version(deepspeed.__version__) >= Version("0.19.2"):
+            print(
+                f"deepspeed {deepspeed.__version__} already installed "
+                "(>= 0.19.2); skipping rebuild."
+            )
+            return
+    except ImportError:
+        pass
+
+    if shutil.which("conda"):
+        command_install_cuda_toolkit = [
+            "conda",
+            "install",
+            "cuda-toolkit=12.6",
+            "-c",
+            "conda-forge",
+            "-c",
+            "nvidia",
+            "-y",
+        ] + sys.argv[1:]
+        result_cuda_toolkit = subprocess.run(command_install_cuda_toolkit, check=False)
+        if result_cuda_toolkit.returncode != 0:
+            print(
+                "conda cuda-toolkit install failed; continuing with pip-only "
+                "deepspeed install.",
+                file=sys.stderr,
+            )
+    else:
+        print(
+            "conda not found; skipping cuda-toolkit install. "
+            "If the pip build fails, install CUDA/nvcc or use conda.",
+            file=sys.stderr,
+        )
+
+    pip = [sys.executable, "-m", "pip"]
+    subprocess.run([*pip, "uninstall", "deepspeed", "-y"], check=False)
 
     env = os.environ.copy()
     env["DS_BUILD_CPU_ADAM"] = "1"
     env["BUILD_UTILS"] = "1"
-    result_deepspeed = subprocess.run(command_install_deepspeed, check=False, env=env)
+    result_deepspeed = subprocess.run(
+        [*pip, "install", "deepspeed==0.19.2"],
+        check=False,
+        env=env,
+    )
     exit(result_deepspeed.returncode)
 
+
+def _python_wheel_tag() -> str:
+    major, minor = sys.version_info[:2]
+    return f"cp{major}{minor}"
+
+
+def _torch_cuda_wheel_tag() -> str:
+    """Map torch.version.cuda to flash-attn wheel tag (e.g. cu126)."""
+    try:
+        import torch
+        import torch.version
+
+        cuda = getattr(torch.version, "cuda", None)
+        if cuda is None:
+            return "cu12"
+        parts = str(cuda).split(".")
+        if len(parts) >= 2:
+            return f"cu{parts[0]}{parts[1]}"
+    except ImportError:
+        pass
+    return "cu126"
+
+
+def _torch_minor_for_flash() -> str:
+    import torch
+
+    return ".".join(torch.__version__.split(".")[:2])
+
+
+def _flash_attn_wheel_url() -> str:
+    wheel_tag = _python_wheel_tag()
+    cuda_tag = _torch_cuda_wheel_tag()
+    torch_minor = _torch_minor_for_flash()
+    return (
+        "https://github.com/Dao-AILab/flash-attention/releases/download/"
+        f"v2.7.4.post1/flash_attn-2.7.4.post1+{cuda_tag}torch{torch_minor}cxx11abiTRUE-"
+        f"{wheel_tag}-{wheel_tag}-linux_x86_64.whl"
+    )
+
+
 def install_flash_attn():
     """
-    Install the flash attention package
+    Install flash-attn for PyTorch 2.6 + CUDA 12.6 (cxx11 ABI wheels).
+
+    Tries a prebuilt wheel first (no compiler or conda required). Falls back to a
+    source build only when the wheel is unavailable.
     """
-    command_install_cuda_nvcc = [
-        "conda", "install", "-c", "nvidia", "cuda-nvcc=12.1", "-y"
-    ] + sys.argv[1:]
-    command_install_flash_attn = [
-        "pip", "install", "flash-attn==2.7.3", "--no-build-isolation"
-    ]
-    result_nvcc = subprocess.run(command_install_cuda_nvcc, check=False)
-    if result_nvcc.returncode != 0:
-        exit(result_nvcc.returncode)
-
-    result_flash = subprocess.run(command_install_flash_attn, check=False)
+    _require_cuda_backend("install-flash-attn")
+    try:
+        import torch
+        import torch.version
+
+        if getattr(torch.version, "hip", None) is not None:
+            print(
+                "install-flash-attn requires an NVIDIA CUDA PyTorch build. "
+                "Detected ROCm/HIP. See docs/install-rocm.md.",
+                file=sys.stderr,
+            )
+            sys.exit(1)
+        if getattr(torch.version, "cuda", None) is None:
+            print(
+                "install-flash-attn requires a CUDA PyTorch build. "
+                "Run: poetry run install-cpu-torch is not compatible.",
+                file=sys.stderr,
+            )
+            sys.exit(1)
+    except ImportError:
+        pass
+
+    subprocess.run([sys.executable, "-m", "pip", "install", "ninja"], check=False)
+
+    flash_attn_wheel = _flash_attn_wheel_url()
+    result_wheel = subprocess.run(
+        [
+            sys.executable,
+            "-m",
+            "pip",
+            "install",
+            flash_attn_wheel,
+            "--no-build-isolation",
+        ],
+        check=False,
+    )
+    if result_wheel.returncode == 0:
+        exit(0)
+
+    if shutil.which("conda"):
+        result_nvcc = subprocess.run(
+            [
+                "conda",
+                "install",
+                "-c",
+                "nvidia",
+                "cuda-nvcc=12.6",
+                "-y",
+            ]
+            + sys.argv[1:],
+            check=False,
+        )
+        if result_nvcc.returncode != 0:
+            exit(result_nvcc.returncode)
+    elif shutil.which("nvcc") is None:
+        print(
+            "Prebuilt flash-attn wheel install failed and nvcc was not found.\n"
+            "Install the CUDA toolkit (nvcc), or use conda:\n"
+            "  conda install -c nvidia cuda-nvcc=12.6",
+            file=sys.stderr,
+        )
+        exit(result_wheel.returncode)
+
+    result_flash = subprocess.run(
+        [
+            sys.executable,
+            "-m",
+            "pip",
+            "install",
+            "flash-attn==2.7.4.post1",
+            "--no-build-isolation",
+        ],
+        check=False,
+    )
     exit(result_flash.returncode)
 
 
+_ROCM_TORCH_INDEX = "https://download.pytorch.org/whl/rocm6.2.4"
+_CPU_TORCH_INDEX = "https://download.pytorch.org/whl/cpu"
+# Re-pin after pip torch installs (ROCm/CPU indexes may upgrade these transitively).
+_POETRY_PINNED_DEPS = ("pillow==10.4.0", "numpy>=1.26,<2.3")
+_CUDA_ONLY_PACKAGES = (
+    "xformers",
+    "bitsandbytes",
+    "xfuser",
+    "triton",
+    "nvidia-cublas-cu12",
+    "nvidia-cuda-cupti-cu12",
+    "nvidia-cuda-nvrtc-cu12",
+    "nvidia-cuda-runtime-cu12",
+    "nvidia-cudnn-cu12",
+    "nvidia-cufft-cu12",
+    "nvidia-curand-cu12",
+    "nvidia-cusolver-cu12",
+    "nvidia-cusparse-cu12",
+    "nvidia-cusparselt-cu12",
+    "nvidia-nccl-cu12",
+    "nvidia-nvjitlink-cu12",
+    "nvidia-nvtx-cu12",
+)
+# Keep triton on CPU installs — torchao/diffusers import torch._inductor which needs it.
+_CPU_UNINSTALL_PACKAGES = tuple(p for p in _CUDA_ONLY_PACKAGES if p != "triton")
+
+
+def _reconcile_poetry_pinned_deps(pip: list[str]) -> None:
+    """Restore numpy/pillow versions required by videotuna and scipy."""
+    subprocess.run([*pip, "install", *_POETRY_PINNED_DEPS], check=False)
+
+
+def install_rocm():
+    """
+    Install PyTorch 2.6 + torchvision 0.21 for ROCm 6.2.4 and remove CUDA-only wheels.
+
+    Uninstalls existing torch/torchvision first so pip does not keep a mismatched
+    CUDA torchvision wheel (e.g. 0.21.0+cu126) alongside ROCm torch.
+
+    Run after: poetry install -E rocm
+    Re-run after any plain `poetry install` on AMD machines (lockfile pins CUDA torch).
+    """
+    pip = [sys.executable, "-m", "pip"]
+    for pkg in (*_CUDA_ONLY_PACKAGES, "torch", "torchvision"):
+        subprocess.run([*pip, "uninstall", pkg, "-y"], check=False)
+    result = subprocess.run(
+        [
+            *pip,
+            "install",
+            "torch==2.6.0",
+            "torchvision==0.21.0",
+            "--index-url",
+            _ROCM_TORCH_INDEX,
+            "--force-reinstall",
+            "--no-deps",
+            "--no-cache-dir",
+        ],
+        check=False,
+    )
+    if result.returncode != 0:
+        exit(result.returncode)
+    subprocess.run(
+        [
+            *pip,
+            "install",
+            "pytorch-triton-rocm==3.2.0",
+            "--index-url",
+            _ROCM_TORCH_INDEX,
+            "--no-deps",
+            "--no-cache-dir",
+        ],
+        check=False,
+    )
+    _reconcile_poetry_pinned_deps(pip)
+
+    import torch
+    import torch.version
+    import torchvision
+
+    torch_build = torch.__version__
+    tv_build = torchvision.__version__
+    hip = getattr(torch.version, "hip", None)
+    if hip is None:
+        print(
+            "WARNING: torch installed but torch.version.hip is None. "
+            "Expected a ROCm wheel from the rocm6.2.4 index.",
+            file=sys.stderr,
+        )
+    if "+cu" in tv_build:
+        print(
+            f"ERROR: torch/torchvision build mismatch: torch={torch_build}, "
+            f"torchvision={tv_build}. Re-run: poetry run install-rocm",
+            file=sys.stderr,
+        )
+        exit(1)
+
+    print(f"torch {torch_build}, torchvision {tv_build}, HIP {hip}")
+    try:
+        from videotuna.utils.device_utils import describe_compute_environment
+
+        print(describe_compute_environment())
+    except ImportError:
+        print(f"torch.cuda.is_available()={torch.cuda.is_available()}, " f"hip={hip}")
+    exit(0)
+
+
+def install_cpu_torch():
+    """Install CPU-only PyTorch 2.6 wheels (no CUDA/ROCm)."""
+    pip = [sys.executable, "-m", "pip"]
+    for pkg in (*_CPU_UNINSTALL_PACKAGES, "torch", "torchvision"):
+        subprocess.run([*pip, "uninstall", pkg, "-y"], check=False)
+    result = subprocess.run(
+        [
+            *pip,
+            "install",
+            "torch==2.6.0",
+            "torchvision==0.21.0",
+            "--index-url",
+            _CPU_TORCH_INDEX,
+            "--force-reinstall",
+            "--no-deps",
+            "--no-cache-dir",
+        ],
+        check=False,
+    )
+    if result.returncode != 0:
+        exit(result.returncode)
+    subprocess.run(
+        [*pip, "install", "triton==3.2.0", "--no-cache-dir"],
+        check=False,
+    )
+    _reconcile_poetry_pinned_deps(pip)
+    exit(result.returncode)
+
+
+def install_flash_attn_rocm():
+    """
+    flash-attn is not officially supported on ROCm in VideoTuna.
+
+    Use VIDEOTUNA_ATTN_BACKEND=sdpa instead. See docs/install-rocm.md.
+    """
+    print(
+        "flash-attn is not supported on AMD ROCm in VideoTuna.\n"
+        "Use: export VIDEOTUNA_ATTN_BACKEND=sdpa\n"
+        "For experimental upstream builds, see "
+        "https://github.com/Dao-AILab/flash-attention",
+        file=sys.stderr,
+    )
+    sys.exit(1)
+
+
+_FORMAT_TARGETS = ["."]
+_LINT_TARGETS = ["videotuna", "tests", "scripts", "tools"]
+
+
 def code_format(check=False):
     """
     Run the code formatting
     """
-    commands = [["isort", "."], ["black", "."]]
+    cmds = (
+        [
+            ["ruff", "format", "--check", *_FORMAT_TARGETS],
+            ["ruff", "check", "--select", "I", *_FORMAT_TARGETS],
+        ]
+        if check
+        else [
+            ["ruff", "check", "--fix", *_LINT_TARGETS],
+            ["ruff", "check", "--select", "I", "--fix", *_FORMAT_TARGETS],
+            ["ruff", "format", *_FORMAT_TARGETS],
+        ]
+    )
     return_code = 0
 
-    for command in commands:
-        if check:
-            command.append("--check")
+    for command in cmds:
         process = subprocess.run(command, check=False)
-        if process.returncode > 0:
+        if process.returncode != 0:
             return_code = process.returncode
             break
 
@@ -85,8 +432,7 @@ def lint():
     Run the linter
     """
     result = subprocess.run(
-        ["ruff", "check", "videotuna", "tests"] + sys.argv[1:], 
-        check=False
+        ["ruff", "check", *_LINT_TARGETS] + sys.argv[1:], check=False
     )
     exit(result.returncode)
 
@@ -96,7 +442,14 @@ def test():  # pragma: no cover
     Run all unittests
     """
     os.environ["ENV"] = "test"
-    result = subprocess.run(["pytest", "."] + sys.argv[1:], check=False)
+    args = sys.argv[1:]
+    has_targets = any(
+        not a.startswith("-")
+        and (a.endswith(".py") or "::" in a or a.startswith("tests"))
+        for a in args
+    )
+    cmd = ["pytest"] + (args if has_targets else ["tests"] + args)
+    result = subprocess.run(cmd, check=False)
     exit(result.returncode)
 
 
@@ -106,8 +459,7 @@ def coverage_report():
     """
     os.environ["ENV"] = "test"
     result = subprocess.run(
-        ["coverage", "run", "-m", "pytest", "--junitxml", "report.xml"], 
-        check=False
+        ["coverage", "run", "-m", "pytest", "--junitxml", "report.xml"], check=False
     )
     if result.returncode > 0:
         exit(result.returncode)
@@ -115,839 +467,172 @@ def coverage_report():
     exit(result.returncode)
 
 
-def type_check():
+def coverage_gate():
     """
-    Run the type checking
+    Run CI smoke tests with coverage and enforce a modest floor on core modules.
     """
-    result = subprocess.run(["mypy", "videotuna", "tests"], check=False)
-    exit(result.returncode)
-
-
-def inference_cogvideo_i2v_diffusers():
-    result = subprocess.run(
-        ["python", "scripts/inference_cogVideo_diffusers.py", 
-         "--generate_type", "i2v", 
-         "--model_input", "inputs/i2v/576x1024", 
-         "--model_path", "checkpoints/cogvideo/CogVideoX-5b-I2V", 
-         "--output_path", "results/i2v/cogvideox5b", 
-         "--num_inference_steps", "50", 
-         "--guidance_scale", "3.5", 
-         "--num_videos_per_prompt", "1", 
-         "--dtype", "float16"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_cogvideo_i2v_lora():
-    config = "configs/004_cogvideox/cogvideo5b-i2v.yaml"
-    ckpt = "results/train/cogvideox_i2v_5b/{YOUR_CKPT_PATH}.ckpt"
-    prompt_dir = "{YOUR_PROMPT_DIR}"
-    savedir = f"results/inference/i2v/cogvideox-i2v-lora-{current_time}"
-
-    result = subprocess.run(
-        ["python3", "scripts/inference_cogvideo.py", 
-         "--config", config, 
-         "--ckpt_path", ckpt, 
-         "--prompt_dir", prompt_dir, 
-         "--savedir", savedir, 
-         "--bs", "1", 
-         "--height", "480", 
-         "--width", "720", 
-         "--fps", "16", 
-         "--seed", "6666", 
-         "--mode", "i2v", 
-         "--denoiser_precision", "bf16"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_cogvideo_lora():
-    config = "configs/004_cogvideox/cogvideo5b.yaml"
-    prompt_file = "inputs/t2v/prompts.txt"
-    savedir = f"results/t2v/{current_time}-cogvideo"
-    ckpt = "{YOUR_CKPT_PATH}"
-    result = subprocess.run(
-        ["python3", "scripts/inference_cogvideo.py", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_file", prompt_file, 
-         "--savedir", savedir, 
-         "--bs", "1", 
-         "--height", "480", 
-         "--width", "720", 
-         "--fps", "16", 
-         "--seed", "6666", 
-         "--denoiser_precision", "bf16"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_cogvideo_t2v_diffusers():
-    result = subprocess.run(
-        ["python", "scripts/inference_cogVideo_diffusers.py", 
-         "--model_input", "inputs/t2v/prompts.txt", 
-         "--model_path", "checkpoints/cogvideo/CogVideoX-2b", 
-         "--output_path", "results/t2v/cogvideox5b", 
-         "--num_inference_steps", "50", 
-         "--guidance_scale", "3.5", 
-         "--num_videos_per_prompt", "1", 
-         "--dtype", "float16"
-        ] + sys.argv[1:], 
-        check=False
+    os.environ["ENV"] = "test"
+    run = subprocess.run(
+        ["coverage", "run", "-m", "pytest", "-q", *CI_SMOKE_TESTS],
+        check=False,
     )
-    exit(result.returncode)
-
-
-def inference_cogvideox1_5_5b_i2v():
-    load_transformer = "checkpoints/cogvideo/CogVideoX1.5-5B-SAT/transformer_i2v"
-    input_file = "inputs/i2v/576x1024/test_prompts.txt"
-    output_dir = "results/i2v/cogvideox1.5"
-    base = "configs/005_cogvideox1.5/cogvideox1.5_5b.yaml"
-    image_folder = "inputs/i2v/576x1024/"
-
-    result = subprocess.run(
-        ["python", "scripts/inference_cogVideo_sat_refactor.py", 
-         "--load_transformer", load_transformer, 
-         "--input_file", input_file, 
-         "--output_dir", output_dir, 
-         "--base", base, 
-         "--mode_type", "i2v", 
-         "--sampling_num_frames", "22", 
-         "--image_folder", image_folder
-        ] + sys.argv[1:], 
-        check=False
+    if run.returncode != 0:
+        exit(run.returncode)
+    report = subprocess.run(
+        [
+            "coverage",
+            "report",
+            "--include=videotuna/training/*,videotuna/utils/*",
+            f"--fail-under={COVERAGE_GATE_FAIL_UNDER}",
+        ],
+        check=False,
     )
-    exit(result.returncode)
-
+    exit(report.returncode)
 
-def inference_cogvideox1_5_5b_t2v():
-    load_transformer = "checkpoints/cogvideo/CogVideoX1.5-5B-SAT/transformer_t2v"
-    input_file = "inputs/t2v/prompts.txt"
-    output_dir = "results/t2v/"
-    base = "configs/005_cogvideox1.5/cogvideox1.5_5b.yaml"
 
-    result = subprocess.run(
-        ["python", "scripts/inference_cogVideo_sat_refactor.py", 
-         "--load_transformer", load_transformer, 
-         "--input_file", input_file, 
-         "--output_dir", output_dir, 
-         "--base", base, 
-         "--mode_type", "t2v", 
-         "--sampling_num_frames", "22"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_dc_i2v_576x1024():
-    ckpt = "checkpoints/dynamicrafter/i2v_576x1024/model.ckpt"
-    config = "configs/002_dynamicrafter/dc_i2v_1024.yaml"
-    prompt_dir = "inputs/i2v/576x1024"
-    savedir = "results/dc-i2v-576x1024"
-
-    result = subprocess.run(
-        ["python3", "scripts/inference.py", 
-         "--mode", "i2v", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_dir", prompt_dir, 
-         "--savedir", savedir, 
-         "--bs", "1", 
-         "--height", "576", 
-         "--width", "1024", 
-         "--fps", "10", 
-         "--seed", "123"
-        ] + sys.argv[1:], 
-        check=False
-    )
+def type_check():
+    """
+    Run the type checking
+    """
+    result = subprocess.run(["mypy", "videotuna", "tests"], check=False)
     exit(result.returncode)
 
 
-def inference_flux_schnell():
-    prompt = "inputs/t2v/prompts.txt"
-    width = 1360
-    height = 768
-
-    command_schnell = [
-        "python", "scripts/inference_flux.py", 
-        "--model_type", "schnell", 
-        "--prompt", prompt, 
-        "--out_path", "results/flux-schnell/", 
-        "--width", str(width), 
-        "--height", str(height), 
-        "--num_inference_steps", "4", 
-        "--guidance_scale", "0."
-    ] + sys.argv[1:]
-
-    result_schnell = subprocess.run(command_schnell, check=False)
-    exit(result_schnell.returncode)
-
-
-def inference_flux_dev():
-    prompt = "inputs/t2v/prompts.txt"
-    width = 1360
-    height = 768
-
-    command_dev = [
-        "python", "scripts/inference_flux.py", 
-        "--model_type", "dev", 
-        "--prompt", prompt, 
-        "--out_path", "results/t2i/flux-dev/", 
-        "--width", str(width), 
-        "--height", str(height), 
-        "--num_inference_steps", "50", 
-        "--guidance_scale", "0."
-    ] + sys.argv[1:]
-
-    result_dev = subprocess.run(command_dev, check=False)
-    exit(result_dev.returncode)
-
-
 def inference_flux_lora():
-    os.environ["lora_ckpt"] = "{YOUR_CORA_CKPT_PATH}"
-    result = subprocess.run(
-        ["python", "scripts/inference_flux_lora.py", 
-         "--model_type", "dev", 
-         "--prompt", "inputs/t2v/prompts.txt", 
-         "--out_path", "results/t2i/flux-lora/", 
-         "--lora_path", os.environ["lora_ckpt"], 
-         "--width", "1360", 
-         "--height", "768", 
-         "--num_inference_steps", "50", 
-         "--guidance_scale", "3.5"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
+    from videotuna.cli.inference_app import inference_flux_lora_entry
 
+    inference_flux_lora_entry()
 
-def inference_hunyuan_t2v():
-    result = subprocess.run(
-        ["python", "scripts/inference_cogvideo.py", 
-         "--ckpt_path", "checkpoints/hunyuanvideo/HunyuanVideo", 
-         "--config", "configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser.yaml", 
-         "--prompt_file", "inputs/t2v/hunyuanvideo/tyler_swift_video/labels.txt", 
-         "--savedir", f"results/t2v/hunyuanvideo-{current_time}", 
-         "--bs", "1", 
-         "--height", "256", 
-         "--width", "256", 
-         "--fps", "16", 
-         "--seed", "6666"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
 
+def inference_wan2_2_t2v_720p():
+    from videotuna.cli.inference_app import inference_wan2_2_t2v_720p_entry
 
-def inference_mochi():
-    ckpt = "checkpoints/mochi-1-preview"
-    prompt_file = "inputs/t2v/prompts.txt"
-    savedir = "results/t2v/mochi2"
-    height = 480
-    width = 848
-    result = subprocess.run(
-        ["python3", "scripts/inference_mochi.py", 
-         "--ckpt_path", ckpt, 
-         "--prompt_file", prompt_file, 
-         "--savedir", savedir, 
-         "--bs", "1", 
-         "--height", str(height), 
-         "--width", str(width), 
-         "--fps", "28", 
-         "--seed", "124"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_opensora_v10_16x256x256():
-    ckpt = "checkpoints/open-sora/t2v_v10/OpenSora-v1-HQ-16x256x256.pth"
-    config = "configs/003_opensora/opensorav10_256x256.yaml"
-    prompt_file = "inputs/t2v/prompts.txt"
-    res_dir = f"results/t2v/{current_time}-opensorav10-HQ-16x256x256"
-    result = subprocess.run(
-        ["python3", "scripts/inference.py", 
-         "--seed", "123", 
-         "--mode", "t2v", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--savedir", res_dir, 
-         "--n_samples", "3", 
-         "--bs", "2", 
-         "--height", "256", 
-         "--width", "256", 
-         "--unconditional_guidance_scale", "7.0", 
-         "--ddim_steps", "50", 
-         "--ddim_eta", "1.0", 
-         "--prompt_file", prompt_file, 
-         "--fps", "8", 
-         "--frames", "16"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_v2v_ms():
-    from .inference_v2v_ms import Settings, inference_v2v_ms
-
-    settings = Settings(
-        input_dir="inputs/v2v/001",
-        output_dir=f"results/v2v/{current_time}-v2v-modelscope-001",
-    )
-    inference_v2v_ms(settings=settings)
-
-
-def inference_vc1_i2v_320x512():
-    ckpt = "checkpoints/videocrafter/i2v_v1_512/model.ckpt"
-    config = "configs/000_videocrafter/vc1_i2v_512.yaml"
-    prompt_dir = "inputs/i2v/576x1024"
-    savedir = "results/i2v/vc1-i2v-320x512"
-    result = subprocess.run(
-        ["python3", "scripts/inference.py", 
-         "--mode", "i2v", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_dir", prompt_dir, 
-         "--savedir", savedir, 
-         "--bs", "1", 
-         "--height", "320", 
-         "--width", "512", 
-         "--fps", "8", 
-         "--seed", "123"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-def inference_stepvideo_t2v_544x992():
-    ckpt = "checkpoints/stepvideo/stepvideo-t2v/"
-    config = "configs/009_stepvideo/stepvideo_t2v.yaml"
-    prompt_file = "inputs/t2v/prompts.txt"
-    savedir = "results/t2v/stepvideo"
-    result = subprocess.run(
-        ["python3", "scripts/inference_new.py", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_file", prompt_file, 
-         "--savedir", savedir, 
-         "--height", "544", 
-         "--width", "992", 
-         "--frames", "51", 
-         "--seed", "44", 
-         "--num_inference_steps", "50"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_wanvideo_i2v_720p():
-    ckpt = "checkpoints/wan/Wan2.1-I2V-14B-720P/"
-    config = "configs/008_wanvideo/wan2_1_i2v_14B_720P.yaml"
-    prompt_dir = "inputs/i2v/576x1024"
-    savedir = "results/i2v/wanvideo/720P"
-    result = subprocess.run(
-        ["python3", "scripts/inference_new.py", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_dir", prompt_dir, 
-         "--savedir", savedir, 
-         "--height", "720", 
-         "--width", "1280", 
-         "--frames", "81", 
-         "--seed", "44", 
-         "--num_inference_steps", "40", 
-         "--time_shift", "5.0", 
-         "--enable_model_cpu_offload"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_wanvideo_t2v_720p():
-    ckpt = "checkpoints/wan/Wan2.1-T2V-14B/"
-    config = "configs/008_wanvideo/wan2_1_t2v_14B.yaml"
-    prompt_file = "inputs/t2v/prompts.txt"
-    savedir = "results/t2v/wanvideo/720P"
-    result = subprocess.run(
-        ["python3", "scripts/inference_new.py", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_file", prompt_file, 
-         "--savedir", savedir, 
-         "--height", "720", 
-         "--width", "1280", 
-         "--frames", "81", 
-         "--seed", "44", 
-         "--time_shift", "5.0", 
-         "--num_inference_steps", "50", 
-         "--enable_model_cpu_offload"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-def inference_hunyuan_i2v_720p():
-    ckpt = "checkpoints/hunyuanvideo/HunyuanVideo-I2V"
-    dit_weight = "checkpoints/hunyuanvideo/HunyuanVideo-I2V/hunyuan-video-i2v-720p/transformers/mp_rank_00_model_states.pt"
-    config = "configs/007_hunyuanvideo/hunyuanvideo_i2v.yaml"
-    prompt_dir = "inputs/i2v/576x1024"
-    savedir = "results/i2v/hunyuan"
-    
-    result = subprocess.run(
-        ["python3", "scripts/inference_new.py", 
-         "--ckpt_path", ckpt, 
-         "--dit_weight", dit_weight, 
-         "--config", config, 
-         "--prompt_dir", prompt_dir, 
-         "--savedir", savedir, 
-         "--height", "720", 
-         "--width", "1280", 
-         "--i2v_resolution", "720p", 
-         "--frames", "129", 
-         "--seed", "44", 
-         "--num_inference_steps", "50"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_vc1_t2v_576x1024():
-    ckpt = "checkpoints/videocrafter/t2v_v1_1024/model.ckpt"
-    config = "configs/000_videocrafter/vc1_t2v_1024.yaml"
-    prompt_file = "inputs/t2v/prompts.txt"
-    res_dir = "results/t2v/videocrafter1-576x1024"
-    result = subprocess.run(
-        ["python3", "scripts/inference.py", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_file", prompt_file, 
-         "--savedir", res_dir, 
-         "--bs", "1", 
-         "--height", "576", 
-         "--width", "1024", 
-         "--fps", "28", 
-         "--seed", "123"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def inference_vc2_t2v_320x512():
-    # Dependencies
-    ckpt = "checkpoints/videocrafter/t2v_v2_512_split"
-    config = "configs/001_videocrafter2/vc2_t2v_320x512.yaml"
-    prompt_file = "inputs/t2v/prompts.txt"
-    result = subprocess.run(
-        ["python3", "scripts/inference_new.py", 
-         "--ckpt_path", ckpt, 
-         "--config", config, 
-         "--prompt_file", prompt_file, 
-         "--savefps", "30"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-def inference_vc2_t2v_320x512_lora():
-    # Dependencies
-    ckpt = "checkpoints/videocrafter/t2v_v2_512/model.ckpt"
-    config = "configs/001_videocrafter2/vc2_t2v_lora.yaml"
-    lorackpt = "YOUR_LORA_CKPT"
-    prompt_file = "inputs/t2v/prompts.txt"
-    res_dir = "results/train/003_vc2_lora_ft"
-    result = subprocess.run(
-        ["python3", "scripts/inference.py", 
-         "--seed", "123", 
-         "--mode", "t2v", 
-         "--ckpt_path", ckpt, 
-         "--lorackpt", lorackpt, 
-         "--config", config, 
-         "--savedir", res_dir, 
-         "--n_samples", "1", 
-         "--bs", "1", 
-         "--height", "320", 
-         "--width", "512", 
-         "--unconditional_guidance_scale", "12.0", 
-         "--ddim_steps", "50", 
-         "--ddim_eta", "1.0", 
-         "--prompt_file", prompt_file, 
-         "--fps", "28"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def train_cogvideox_i2v_lora():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-    # Dependencies
-    config = "configs/004_cogvideox/cogvideo5b-i2v.yaml"  # Experiment config
-
-    # Experiment settings
-    resroot = "results/train"  # Experiment saving directory
-    expname = "cogvideox_i2v_5b"  # Experiment name
-    datapath="data/apply_lipstick/metadata.csv"
-
-    result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{current_time}_{expname}", 
-         "--devices", "0,", 
-         "lightning.trainer.num_nodes=1", 
-         f"data.params.train.params.csv_path={datapath}", 
-         f"data.params.validation.params.csv_path={datapath}", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def train_cogvideox_i2v_fullft():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-    # Dependencies
-    config = "configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml"  # Experiment config
-
-    # Experiment settings
-    resroot = "results/train"  # Experiment saving directory
-    expname = "cogvideox_i2v_5b_fullft"  # Experiment name
-    datapath="data/apply_lipstick/metadata.csv"
-
-    result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{current_time}_{expname}", 
-         "--devices", "0,1,2,3", 
-         "lightning.trainer.num_nodes=1", 
-         f"data.params.train.params.csv_path={datapath}", 
-         f"data.params.validation.params.csv_path={datapath}", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def train_cogvideox_t2v_lora():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-    # Dependencies
-    config = "configs/004_cogvideox/cogvideo5b.yaml"  # Experiment config
-    datapath = "data/apply_lipstick/metadata.csv"
-
-    # Experiment settings
-    resroot = "results/train"  # Experiment saving directory
-    expname = "cogvideox_t2v_5b"  # Experiment name
-    result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{current_time}_{expname}", 
-         "--devices", "0,", 
-         "lightning.trainer.num_nodes=1", 
-         f"data.params.train.params.csv_path={datapath}", 
-         f"data.params.validation.params.csv_path={datapath}", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def train_cogvideox_t2v_fullft():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-    # Dependencies
-    config = "configs/004_cogvideox/cogvideo5b-t2v-fullft.yaml"  # Experiment config
-    datapath = "data/apply_lipstick/metadata.csv"
-
-    # Experiment settings
-    resroot = "results/train"  # Experiment saving directory
-    expname = "cogvideox_t2v_5b_fullft"  # Experiment name
-    result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{current_time}_{expname}", 
-         "--devices", "0,1,2,3", 
-         "lightning.trainer.num_nodes=1", 
-         f"data.params.train.params.csv_path={datapath}", 
-         f"data.params.validation.params.csv_path={datapath}", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-def train_dynamicrafter():
-    # Dependencies
-    sdckpt = "checkpoints/stablediffusion/v2-1_512-ema/model.ckpt"
-    dcckpt = "checkpoints/dynamicrafter/i2v_576x1024/model_converted.ckpt"
-
-    # Experiment settings
-    expname = "002_dynamicrafterft_1024"  # Experiment name
-    config = "configs/002_dynamicrafter/dc_i2v_1024.yaml"  # Experiment config
-    resroot = "results/train"  # Experiment saving directory
-    result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--name", f"{current_time}_{expname}", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--sdckpt", sdckpt, 
-         "--ckpt", dcckpt, 
-         "--devices", "0,", 
-         "lightning.trainer.num_nodes=1", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
+    inference_wan2_2_t2v_720p_entry()
 
 
 def train_flux_lora():
     os.environ["TOKENIZERS_PARALLELISM"] = "false"
-    os.environ["CONFIG_PATH"] = "configs/006_flux/config"
-    os.environ["DATACONFIG_PATH"] = "configs/006_flux/multidatabackend"
-    os.environ["CONFIG_BACKEND"] = "json"
+    config_path = FLUX_T2I_CONFIG
+    data_config_path = FLUX_T2I_DATA_CONFIG
     result = subprocess.run(
-        ["accelerate", "launch", 
-         "--mixed_precision=bf16", 
-         "--num_processes=1", 
-         "--num_machines=1", 
-         "scripts/train_flux_lora.py", 
-         "--config_path", f"{os.environ['CONFIG_PATH']}.{os.environ['CONFIG_BACKEND']}", 
-         "--data_config_path", f"{os.environ['DATACONFIG_PATH']}.{os.environ['CONFIG_BACKEND']}"
-        ] + sys.argv[1:], 
-        check=False
+        [
+            "accelerate",
+            "launch",
+            "--mixed_precision=bf16",
+            "--num_processes=1",
+            "--num_machines=1",
+            "scripts/train_flux_lora.py",
+            "--config_path",
+            config_path,
+            "--data_config_path",
+            data_config_path,
+        ]
+        + sys.argv[1:],
+        check=False,
     )
     exit(result.returncode)
 
 
-def train_opensorav10():
-    # Experiment settings
-    expname = "opensora"  # Experiment name
-    config = "configs/003_opensora/opensorav10_256x256.yaml"  # Experiment config
-    logdir = "results/train"  # Experiment saving directory
-    result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--devices", "0,", 
-         "lightning.trainer.num_nodes=1", 
-         "--base", config, 
-         "--name", f"{current_time}_{expname}", 
-         "--logdir", logdir, 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
-
-
-def train_videocrafter_lora():
-    # Set environment variables
+def train_wan2_1_t2v_lora():
     os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-    # Dependencies
-    vc2_ckpt = "checkpoints/videocrafter/t2v_v2_512/model.ckpt"
-
-    # Experiment settings
-    expname = "videocrafter2_t2v_lora"  # Experiment name
-    config = "configs/001_videocrafter2/vc2_t2v_lora.yaml"  # Experiment config
-    resroot = "results/train"  # Experiment saving directory
-
-    # Generate current time
+    ckpt = "checkpoints/wan/Wan2.1-T2V-14B"
+    config = WAN_T2V_LORA_CONFIG
+    resroot = "results/train"
+    expname = "train_wan_domain_t2v_lora"
     result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--name", f"{current_time}_{expname}", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--ckpt", vc2_ckpt, 
-         "--devices", "0,", 
-         "lightning.trainer.num_nodes=1", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
+        [
+            sys.executable,
+            "scripts/train_new.py",
+            "-t",
+            "--ckpt",
+            ckpt,
+            "--base",
+            config,
+            "--logdir",
+            resroot,
+            "--name",
+            f"{expname}_{current_time}",
+            "--devices",
+            "0,",
+            "--auto_resume",
+        ]
+        + sys.argv[1:],
+        check=False,
     )
     exit(result.returncode)
 
 
-def train_videocrafter_v2():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+def train_domain_t2i():
+    """Canonical alias for Flux T2I domain LoRA training."""
+    train_flux_lora()
 
-    # Dependencies
-    vc2_ckpt = "checkpoints/videocrafter/t2v_v2_512_split"  # pretrained checkpoint of videocrafter2
-    config = "configs/001_videocrafter2/vc2_t2v_320x512.yaml"  # experiment config: model+data+training
 
-    # Experiment saving directory and parameters
-    resroot = "results/train"  # root directory for saving multiple experiments
-    expname = "videocrafter2_320x512"  # experiment name
-    result = subprocess.run(
-        ["python", "scripts/train_new.py", 
-         "-t", 
-         "--ckpt", vc2_ckpt, 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{current_time}_{expname}", 
-         "--devices", "0,", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
+def train_domain_t2v():
+    """Canonical alias for Wan 2.1 T2V domain LoRA training."""
+    train_wan2_1_t2v_lora()
 
 
-def train_hunyuan_t2v_lora():
-    # Set environment variables
+def train_wan2_1_i2v_lora():
     os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-    # Dependencies
-    config = "configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser_lora.yaml"  # Experiment config
-
-    # Experiment settings
-    resroot = "results/train"  # Experiment saving directory
-    expname = "hunyuanvideo_t2v_lora"  # Experiment name
+    ckpt = "checkpoints/wan/Wan2.1-I2V-14B-480P"
+    config = WAN_I2V_LORA_CONFIG
+    resroot = "results/train"
+    expname = "train_wan_domain_i2v_lora"
     result = subprocess.run(
-        ["python", "scripts/train.py", 
-         "-t", 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{current_time}_{expname}", 
-         "--devices", "0,1", 
-         "lightning.trainer.num_nodes=1", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
+        [
+            sys.executable,
+            "scripts/train_new.py",
+            "-t",
+            "--ckpt",
+            ckpt,
+            "--base",
+            config,
+            "--logdir",
+            resroot,
+            "--name",
+            f"{expname}_{current_time}",
+            "--devices",
+            "0,",
+            "--auto_resume",
+        ]
+        + sys.argv[1:],
+        check=False,
     )
     exit(result.returncode)
 
 
+def train_domain_i2v():
+    """Canonical alias for Wan 2.1 I2V domain LoRA training."""
+    train_wan2_1_i2v_lora()
 
-def train_wan2_1_t2v_fullft():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
-    # Dependencies
-    ckpt = "checkpoints/wan/Wan2.1-T2V-14B" 
-    config = "configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml" 
+def inference_domain_t2i():
+    """Canonical alias for Flux domain LoRA smoke inference."""
+    inference_flux_lora()
 
-    # Experiment saving directory and parameters
-    resroot = "results/train"  # root directory for saving multiple experiments
-    expname = "train_wanvideo_t2v_fullft"  # experiment name
-    result = subprocess.run(
-        ["python", "scripts/train_new.py", 
-         "-t", 
-         "--ckpt", ckpt, 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{expname}_{current_time}", 
-         "--devices", "0,", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
 
+def validate_domain_t2v():
+    """Canonical Wan 2.2 domain LoRA validation after training."""
+    from videotuna.cli.inference_app import validate_domain_t2v_entry
 
-def train_wan2_1_t2v_lora():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    validate_domain_t2v_entry()
 
-    # Dependencies
-    ckpt = "checkpoints/wan/Wan2.1-T2V-14B" 
-    config = "configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml" 
 
-    # Experiment saving directory and parameters
-    resroot = "results/train"  # root directory for saving multiple experiments
-    expname = "train_wanvideo_t2v_lora"  # experiment name
-    result = subprocess.run(
-        ["python", "scripts/train_new.py", 
-         "-t", 
-         "--ckpt", ckpt, 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{expname}_{current_time}", 
-         "--devices", "0,", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
+def validate_domain_i2v():
+    """Canonical Wan 2.2 domain I2V LoRA validation after training."""
+    from videotuna.cli.inference_app import validate_domain_i2v_entry
 
+    validate_domain_i2v_entry()
 
-def train_wan2_1_i2v_fullft():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
-    # Dependencies
-    ckpt = "checkpoints/wan/Wan2.1-I2V-14B-480P" 
-    config = "configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml" 
+def inference_wan2_2_i2v_720p():
+    from videotuna.cli.inference_app import inference_wan2_2_i2v_720p_entry
 
-    # Experiment saving directory and parameters
-    resroot = "results/train"  # root directory for saving multiple experiments
-    expname = "train_wanvideo_i2v_fullft"  # experiment name
-    result = subprocess.run(
-        ["python", "scripts/train_new.py", 
-         "-t", 
-         "--ckpt", ckpt, 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{expname}_{current_time}", 
-         "--devices", "0,", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
+    inference_wan2_2_i2v_720p_entry()
 
-def train_wan2_1_i2v_lora():
-    # Set environment variables
-    os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
-    # Dependencies
-    ckpt = "checkpoints/wan/Wan2.1-I2V-14B-480P" 
-    config = "configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml" 
+def benchmark_attn_backends():
+    """Benchmark eager vs sdpa vs flash on Wan Diffusers inference."""
+    from scripts.benchmark_attn_backends import main
 
-    # Experiment saving directory and parameters
-    resroot = "results/train"  # root directory for saving multiple experiments
-    expname = "train_wanvideo_i2v_lora"  # experiment name
-    result = subprocess.run(
-        ["python", "scripts/train_new.py", 
-         "-t", 
-         "--ckpt", ckpt, 
-         "--base", config, 
-         "--logdir", resroot, 
-         "--name", f"{expname}_{current_time}", 
-         "--devices", "0,", 
-         "--auto_resume"
-        ] + sys.argv[1:], 
-        check=False
-    )
-    exit(result.returncode)
\ No newline at end of file
+    raise SystemExit(main())
diff --git a/scripts/benchmark_attn_backends.py b/scripts/benchmark_attn_backends.py
new file mode 100644
index 00000000..5ae4fcd7
--- /dev/null
+++ b/scripts/benchmark_attn_backends.py
@@ -0,0 +1,296 @@
+#!/usr/bin/env python3
+"""
+Benchmark attention backends on Wan 2.2 Diffusers inference smoke runs.
+
+Example:
+    poetry run benchmark-attn-backends
+    poetry run benchmark-attn-backends --json-out results/bench_attn.json
+    poetry run benchmark-attn-backends --resolutions 480
+    VIDEOTUNA_ATTN_BACKEND=sdpa poetry run benchmark-attn-backends --json
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List
+
+import torch
+from diffusers import WanPipeline
+
+from videotuna.settings import ENV_ATTN_BACKEND, get_settings
+from videotuna.utils.attention import (
+    apply_diffusers_attention_backend,
+    is_flash_attn_available,
+)
+from videotuna.utils.device_utils import (
+    detect_compute_backend,
+    empty_accelerator_cache,
+    gpu_is_available,
+    resolve_inference_device,
+    synchronize_accelerator,
+)
+
+DEFAULT_MODEL = "Wan-AI/Wan2.2-T2V-A14B-Diffusers"
+DEFAULT_HEIGHTS = [480]
+DEFAULT_NUM_FRAMES = 17
+
+
+def _verify_torch_vision_stack() -> None:
+    import torch.version
+    import torchvision
+
+    torch_build = torch.__version__
+    tv_build = torchvision.__version__
+    hip = getattr(torch.version, "hip", None)
+    if hip is not None and "+cu" in tv_build:
+        raise RuntimeError(
+            f"torch/torchvision build mismatch: torch={torch_build} (ROCm), "
+            f"torchvision={tv_build} (CUDA). Run: poetry run install-rocm"
+        )
+
+
+def _compute_capability() -> str | None:
+    if not gpu_is_available():
+        return None
+    major, minor = torch.cuda.get_device_capability()
+    return f"{major}.{minor}"
+
+
+def _load_pipeline(model_path: str, *, enable_offload: bool) -> WanPipeline:
+    loaded = WanPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
+    if enable_offload:
+        loaded.enable_model_cpu_offload()
+        return loaded
+    device = resolve_inference_device()
+    return loaded.to(device)
+
+
+def _run_backend(
+    backend: str,
+    model_path: str,
+    prompt: str,
+    num_inference_steps: int,
+    seed: int,
+    compute_backend: str,
+    height: int | None = None,
+    width: int | None = None,
+    num_frames: int = DEFAULT_NUM_FRAMES,
+    enable_offload: bool = False,
+) -> Dict[str, Any]:
+    os.environ[ENV_ATTN_BACKEND] = backend
+
+    if not gpu_is_available():
+        raise RuntimeError(
+            "A GPU accelerator (NVIDIA CUDA or AMD ROCm) is required for benchmarks."
+        )
+
+    device = resolve_inference_device()
+    empty_accelerator_cache()
+    torch.cuda.reset_peak_memory_stats()
+
+    pipe = _load_pipeline(model_path, enable_offload=enable_offload)
+
+    transformer = getattr(pipe, "transformer", None)
+    if transformer is not None:
+        apply_diffusers_attention_backend(transformer)
+
+    generator_device = device if not enable_offload else resolve_inference_device()
+    generator = torch.Generator(device=generator_device).manual_seed(seed)
+
+    pipe_kwargs: Dict[str, Any] = {
+        "prompt": prompt,
+        "num_inference_steps": 1,
+        "generator": generator,
+        "output_type": "latent",
+    }
+    if height is not None:
+        pipe_kwargs["height"] = height
+        pipe_kwargs["width"] = width
+        pipe_kwargs["num_frames"] = num_frames
+
+    _ = pipe(**pipe_kwargs)
+
+    synchronize_accelerator()
+    torch.cuda.reset_peak_memory_stats()
+
+    generator = torch.Generator(device=generator_device).manual_seed(seed)
+    start = time.perf_counter()
+    _ = pipe(
+        prompt=prompt,
+        num_inference_steps=num_inference_steps,
+        generator=generator,
+        output_type="latent",
+        **{
+            k: v
+            for k, v in pipe_kwargs.items()
+            if k not in ("prompt", "num_inference_steps", "generator", "output_type")
+        },
+    )
+    synchronize_accelerator()
+    elapsed = time.perf_counter() - start
+
+    peak_vram_gb = torch.cuda.max_memory_allocated() / (1024**3)
+    frames_per_sec = round(num_frames / elapsed, 3) if elapsed > 0 and height else None
+
+    del pipe
+    empty_accelerator_cache()
+
+    result: Dict[str, Any] = {
+        "backend": backend,
+        "compute_backend": compute_backend,
+        "pipeline": "wan",
+        "seconds": round(elapsed, 3),
+        "peak_vram_gb": round(peak_vram_gb, 3),
+        "num_inference_steps": num_inference_steps,
+        "model_path": model_path,
+        "compute_capability": _compute_capability(),
+        "enable_offload": enable_offload,
+    }
+    if height is not None:
+        result["height"] = height
+        result["width"] = width
+        result["num_frames"] = num_frames
+        result["frames_per_sec"] = frames_per_sec
+    return result
+
+
+def main(argv: List[str] | None = None) -> int:  # noqa: C901
+    parser = argparse.ArgumentParser(
+        description="Benchmark PrivTune attention backends on Wan 2.2 Diffusers."
+    )
+    parser.add_argument(
+        "--model-path",
+        default=None,
+        help="Hugging Face model id or local path.",
+    )
+    parser.add_argument(
+        "--prompt",
+        default="A cat riding a bicycle through a sunny park.",
+        help="Short prompt for the smoke benchmark.",
+    )
+    parser.add_argument(
+        "--num-inference-steps",
+        type=int,
+        default=4,
+        help="Diffusion steps for the timed run (after warm-up).",
+    )
+    parser.add_argument(
+        "--num-frames",
+        type=int,
+        default=DEFAULT_NUM_FRAMES,
+        help="Frame count when benchmarking with explicit resolution.",
+    )
+    parser.add_argument(
+        "--enable-offload",
+        action="store_true",
+        help="Use enable_model_cpu_offload during the benchmark.",
+    )
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument(
+        "--backends",
+        nargs="+",
+        default=None,
+        help="Backends to test (default: eager sdpa; flash on CUDA when available).",
+    )
+    parser.add_argument(
+        "--resolutions",
+        default=None,
+        help="Comma-separated heights (width keeps 16:9 aspect).",
+    )
+    parser.add_argument(
+        "--json-out",
+        default=None,
+        help="Write JSON results to this file path.",
+    )
+    parser.add_argument(
+        "--json", action="store_true", help="Print JSON instead of a table."
+    )
+    args = parser.parse_args(argv)
+
+    _verify_torch_vision_stack()
+    model_path = args.model_path or get_settings().bench_model or DEFAULT_MODEL
+
+    compute_backend = detect_compute_backend()
+    backends = args.backends or ["eager", "sdpa"]
+    if (
+        compute_backend == "cuda"
+        and is_flash_attn_available()
+        and "flash" not in backends
+    ):
+        backends.append("flash")
+
+    if args.resolutions:
+        heights: List[int | None] = [
+            int(h.strip()) for h in args.resolutions.split(",") if h.strip()
+        ]
+    else:
+        heights = list(DEFAULT_HEIGHTS)
+
+    results: List[Dict[str, Any]] = []
+    for height in heights:
+        width = int(height * 16 / 9) if height else None
+        for backend in backends:
+            label = backend if height is None else f"{backend}@{height}p"
+            print(
+                f"Running wan backend={label} ({compute_backend}) ...",
+                file=sys.stderr,
+            )
+            try:
+                results.append(
+                    _run_backend(
+                        backend=backend,
+                        model_path=model_path,
+                        prompt=args.prompt,
+                        num_inference_steps=args.num_inference_steps,
+                        seed=args.seed,
+                        compute_backend=compute_backend,
+                        height=height,
+                        width=width,
+                        num_frames=args.num_frames,
+                        enable_offload=args.enable_offload,
+                    )
+                )
+            except Exception as exc:
+                results.append(
+                    {
+                        "backend": backend,
+                        "compute_backend": compute_backend,
+                        "pipeline": "wan",
+                        "height": height,
+                        "error": str(exc),
+                    }
+                )
+
+    if args.json_out:
+        out_path = Path(args.json_out)
+        out_path.parent.mkdir(parents=True, exist_ok=True)
+        out_path.write_text(json.dumps(results, indent=2))
+
+    if args.json:
+        print(json.dumps(results, indent=2))
+    else:
+        print(f"\nCompute backend: {compute_backend}  pipeline: wan\n")
+        print("| Backend | Seconds | Peak VRAM (GB) | Frames/s |")
+        print("| --- | ---: | ---: | ---: |")
+        for row in results:
+            if "error" in row:
+                print(f"| {row['backend']} | ERROR | {row['error']} | |")
+            else:
+                vram = row["peak_vram_gb"]
+                fps = row.get("frames_per_sec")
+                label = row["backend"]
+                if row.get("height"):
+                    label = f"{label} ({row['height']}p)"
+                fps_str = f"{fps:.3f}" if fps is not None else "n/a"
+                print(f"| {label} | {row['seconds']:.3f} | {vram:.3f} | {fps_str} |")
+
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/inference.py b/scripts/inference.py
deleted file mode 100644
index 734263b0..00000000
--- a/scripts/inference.py
+++ /dev/null
@@ -1,336 +0,0 @@
-import os
-import sys
-import time
-import json
-import argparse
-
-
-import torch
-from tqdm import trange
-from omegaconf import OmegaConf
-from pytorch_lightning import seed_everything
-
-sys.path.insert(0, os.getcwd())
-sys.path.insert(1, f"{os.getcwd()}/src")
-from videotuna.schedulers.ddim import DDIMSampler
-from videotuna.schedulers.ddim_multiplecond import DDIMSampler as DDIMSampler_multicond
-from videotuna.utils.common_utils import instantiate_from_config
-from videotuna.utils.inference_utils import (
-    load_inputs_i2v,
-    load_model_checkpoint,
-    load_prompts_from_txt,
-    sample_batch_i2v,
-    sample_batch_t2v,
-    save_videos,
-    save_videos_vbench,
-)
-
-
-def get_parser():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--mode",
-        default="t2v",
-        type=str,
-        help="inference mode: t2v/i2v",
-        choices=["t2v", "i2v"],
-    )
-    #
-    parser.add_argument("--ckpt_path", type=str, default=None, help="checkpoint path")
-    parser.add_argument("--config", type=str, help="model config (yaml) path")
-    parser.add_argument(
-        "--prompt_file",
-        type=str,
-        default=None,
-        help="a text file containing many prompts for text-to-video",
-    )
-    parser.add_argument(
-        "--prompt_dir",
-        type=str,
-        default=None,
-        help="a input dir containing images and prompts for image-to-video/interpolation",
-    )
-    parser.add_argument("--savedir", type=str, default=None, help="results saving path")
-    parser.add_argument(
-        "--standard_vbench",
-        action="store_true",
-        default=False,
-        help="inference standard vbench prompts",
-    )
-    #
-    parser.add_argument("--seed", type=int, default=123, help="random seed")
-    #
-    parser.add_argument(
-        "--height", type=int, default=320, help="video height, in pixel space"
-    )
-    parser.add_argument(
-        "--width", type=int, default=512, help="video width, in pixel space"
-    )
-    parser.add_argument(
-        "--frames", type=int, default=None, help="video frame number, in pixel space"
-    )
-    parser.add_argument(
-        "--fps",
-        type=int,
-        default=24,
-        help="video motion speed. 512 or 1024 model: large value -> slow motion; 256 model: large value -> large motion;",
-    )
-    parser.add_argument(
-        "--n_samples_prompt",
-        type=int,
-        default=1,
-        help="num of samples per prompt",
-    )
-    #
-    parser.add_argument("--bs", type=int, default=1, help="batch size for inference")
-    parser.add_argument(
-        "--ddim_steps",
-        type=int,
-        default=50,
-        help="steps of ddim if positive, otherwise use DDPM",
-    )
-    parser.add_argument(
-        "--ddim_eta",
-        type=float,
-        default=1.0,
-        help="eta for ddim sampling (0.0 yields deterministic sampling)",
-    )
-    parser.add_argument(
-        "--uncond_prompt",
-        type=str,
-        default="",
-        help="unconditional prompts, or negative prompts",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale",
-        type=float,
-        default=12.0,
-        help="prompt classifier-free guidance",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale_temporal",
-        type=float,
-        default=None,
-        help="temporal consistency guidance",
-    )
-    # dc args
-    parser.add_argument(
-        "--multiple_cond_cfg",
-        action="store_true",
-        default=False,
-        help="i2v: use multi-condition cfg or not",
-    )
-    parser.add_argument(
-        "--cfg_img",
-        type=float,
-        default=None,
-        help="guidance scale for image conditioning",
-    )
-    parser.add_argument(
-        "--timestep_spacing",
-        type=str,
-        default="uniform",
-        help="The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.",
-    )
-    parser.add_argument(
-        "--guidance_rescale",
-        type=float,
-        default=0.0,
-        help="guidance rescale in [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891)",
-    )
-    parser.add_argument(
-        "--loop",
-        action="store_true",
-        default=False,
-        help="generate looping videos or not",
-    )
-    parser.add_argument(
-        "--gfi",
-        action="store_true",
-        default=False,
-        help="generate generative frame interpolation (gfi) or not",
-    )
-    # lora args
-    parser.add_argument(
-        "--lorackpt",
-        type=str,
-        default=None,
-        help="[Optional] checkpoint path for lora model. ",
-    )
-    #
-    parser.add_argument("--savefps", type=str, default=10, help="video fps to generate")
-    return parser
-
-
-def load_model(args, cuda_idx=0):
-    """
-    Create model and load weight.
-    """
-    # build model
-    config = OmegaConf.load(args.config)
-    model_config = config.pop("model", OmegaConf.create())
-    if args.lorackpt is not None:
-        model_config["params"]["lora_args"] = {"lora_ckpt": args.lorackpt}
-    model = instantiate_from_config(model_config)
-    model = model.cuda(cuda_idx)
-    # load weights
-    assert os.path.exists(
-        args.ckpt_path
-    ), f"Error: checkpoint [{args.ckpt_path}] Not Found!"
-    model = load_model_checkpoint(model, args.ckpt_path)
-    # load lora weights
-    if hasattr(model, "lora_args") and len(model.lora_args) != 0:
-        model.inject_lora()
-
-    model.eval()
-    return model
-
-
-def load_inputs(args):
-    """
-    load inputs:
-        t2v: prompts
-        i2v: prompts + images
-    """
-    assert (
-        args.prompt_file is not None or args.prompt_dir is not None
-    ), "Error: input file/dir NOT Found!"
-
-    if args.prompt_file is not None:
-        assert os.path.exists(args.prompt_file)
-        # load inputs for t2v
-        prompt_list = load_prompts_from_txt(args.prompt_file)
-        num_prompts = len(prompt_list)
-        filename_list = [f"prompt-{idx+1:04d}" for idx in range(num_prompts)]
-        image_list = None
-    elif args.prompt_dir is not None:
-        assert os.path.exists(args.prompt_dir)
-        # load inputs for i2v
-        filename_list, image_list, prompt_list = load_inputs_i2v(
-            args.prompt_dir,
-            video_size=(args.height, args.width),
-            video_frames=args.frames,
-        )
-    return prompt_list, image_list, filename_list
-
-
-def run_inference(args, gpu_num=1, rank=0, **kwargs):
-    """
-    Inference t2v/i2v models
-    """
-    assert (args.height % 16 == 0) and (
-        args.width % 16 == 0
-    ), "Error: image size [h,w] should be multiples of 16!"
-
-    seed_everything(args.seed)
-    os.makedirs(args.savedir, exist_ok=True)
-
-    # load model, sampler, inputs
-    model = load_model(args)
-    if args.mode == "i2v" and args.multiple_cond_cfg:
-        ddim_sampler = DDIMSampler_multicond(model)
-    else:
-        ddim_sampler = DDIMSampler(model)
-    args.frames = model.temporal_length if args.frames is None else args.frames
-    prompt_list, image_list, filename_list = load_inputs(args)
-
-    # split across multiple gpus
-    num_samples = len(prompt_list)
-    num_samples_rank = num_samples // gpu_num
-    remainder = num_samples % gpu_num
-    indices_rank = list(range(num_samples_rank * rank, num_samples_rank * (rank + 1)))
-    if rank == 0 and remainder != 0:
-        indices_rank = indices_rank + list(range(num_samples - remainder, num_samples))
-    #
-    prompt_list_rank = [prompt_list[i] for i in indices_rank]
-    filename_list_rank = [filename_list[i] for i in indices_rank]
-    if args.mode == "i2v":
-        image_list_rank = [image_list[i] for i in indices_rank]
-
-    # noise shape
-    h, w, frames, channels = (
-        args.height // 8,
-        args.width // 8,
-        args.frames,
-        model.channels,
-    )
-
-    # -----------------------------------------------------------------
-    # inference iters
-    format_file = {}
-    start = time.time()
-    n_iters = len(prompt_list_rank) // args.bs + (
-        1 if len(prompt_list_rank) % args.bs else 0
-    )
-    with torch.no_grad():
-        for idx in trange(0, n_iters, desc="Sample Iters"):
-
-            prompts = prompt_list_rank[idx * args.bs : (idx + 1) * args.bs]
-            filenames = filename_list_rank[idx * args.bs : (idx + 1) * args.bs]
-
-            if args.mode == "i2v":
-                images = image_list_rank[idx * args.bs : (idx + 1) * args.bs]
-                if isinstance(images, list):
-                    images = torch.stack(images, dim=0).to("cuda")
-                else:
-                    images = images.unsqueeze(0).to("cuda")
-
-            # inference batch
-            bs = args.bs if args.bs == len(prompts) else len(prompts)
-            noise_shape = [bs, channels, frames, h, w]
-            if args.mode == "t2v":
-                batch_samples = sample_batch_t2v(
-                    model,
-                    ddim_sampler,
-                    prompts,
-                    noise_shape,
-                    args.fps,
-                    args.n_samples_prompt,
-                    args.ddim_steps,
-                    args.ddim_eta,
-                    args.unconditional_guidance_scale,
-                    args.unconditional_guidance_scale_temporal,
-                    args.uncond_prompt,
-                )
-            elif args.mode == "i2v":
-                batch_samples = sample_batch_i2v(
-                    model,
-                    ddim_sampler,
-                    prompts,
-                    images,
-                    noise_shape,
-                    args.n_samples_prompt,
-                    args.ddim_steps,
-                    args.ddim_eta,
-                    args.unconditional_guidance_scale,
-                    args.cfg_img,
-                    args.fps,
-                    args.uncond_prompt,
-                    args.multiple_cond_cfg,
-                    args.loop,
-                    args.gfi,
-                    args.timestep_spacing,
-                    args.guidance_rescale,
-                )
-            else:
-                raise ValueError
-
-            if args.standard_vbench:
-                save_videos_vbench(
-                    batch_samples, args.savedir, prompts, format_file, fps=args.savefps
-                )
-                print("test")
-            else:
-                save_videos(batch_samples, args.savedir, filenames, fps=args.savefps)
-
-    if args.standard_vbench:
-        with open(os.path.join(args.savedir, "info.json"), "w") as f:
-            json.dump(format_file, f)
-
-    print(f"Saved in {args.savedir}. Time used: {(time.time() - start):.2f} seconds")
-
-
-if __name__ == "__main__":
-
-    args = get_parser().parse_args()
-    run_inference(args)
diff --git a/scripts/inference_cogVideo_diffusers.py b/scripts/inference_cogVideo_diffusers.py
deleted file mode 100644
index b5eee52f..00000000
--- a/scripts/inference_cogVideo_diffusers.py
+++ /dev/null
@@ -1,333 +0,0 @@
-"""
-This script demonstrates how to generate a video using the CogVideoX model with the Hugging Face `diffusers` pipeline.
-The script supports different types of video generation, including text-to-video (t2v), image-to-video (i2v),
-and video-to-video (v2v), depending on the input data and different weight.
-
-- text-to-video: THUDM/CogVideoX-5b or THUDM/CogVideoX-2b
-- video-to-video: THUDM/CogVideoX-5b or THUDM/CogVideoX-2b
-- image-to-video: THUDM/CogVideoX-5b-I2V
-
-Running the Script:
-To run the script, use the following command with appropriate arguments:
-
-```bash
-$ python cli_demo.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --generate_type "t2v"
-```
-
-Additional options are available to specify the model path, guidance scale, number of inference steps, video generation type, and output paths.
-"""
-
-import argparse
-import glob
-import os
-import sys
-import time
-from typing import Literal
-
-import torch
-from diffusers import (
-    CogVideoXDDIMScheduler,
-    CogVideoXDPMScheduler,
-    CogVideoXImageToVideoPipeline,
-    CogVideoXPipeline,
-    CogVideoXVideoToVideoPipeline,
-)
-
-sys.path.insert(0, os.getcwd())
-from diffusers.utils import export_to_video, load_image, load_video
-
-from videotuna.utils.inference_utils import get_target_filelist, load_prompts_from_txt
-from videotuna.utils.common_utils import monitor_resources, save_metrics
-
-
-def generate_video(
-    model_input: str,
-    model_path: str,
-    lora_path: str = None,
-    lora_rank: int = 128,
-    output_path: str = "./output.mp4",
-    image_or_video_path: str = "",
-    num_inference_steps: int = 50,
-    guidance_scale: float = 6.0,
-    num_videos_per_prompt: int = 1,
-    dtype: torch.dtype = torch.bfloat16,
-    generate_type: str = Literal[
-        "t2v", "i2v", "v2v"
-    ],  # i2v: image to video, v2v: video to video
-    seed: int = 42,
-    enable_sequential_cpu_offload: bool = False,
-    enable_model_cpu_offload: bool = False,
-    enable_vae_slicing: bool = False,
-    enable_vae_tiling: bool = False
-):
-    """
-    Generates a video based on the given input and saves it to the specified path.
-
-    Parameters:
-    - model_input (str): can be a string prompt or a path to a prompt file for t2v, or a directory containing images or videos for i2v and v2v.
-    - model_path (str): The path of the pre-trained model to be used.
-    - lora_path (str): The path of the LoRA weights to be used.
-    - lora_rank (int): The rank of the LoRA weights.
-    - output_path (str): The path or directory where the generated video will be saved.
-    - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
-    - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
-    - num_videos_per_prompt (int): Number of videos to generate per prompt.
-    - dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
-    - generate_type (str): The type of video generation (e.g., 't2v', 'i2v', 'v2v').·
-    - seed (int): The seed for reproducibility.
-    """
-    if not output_path.endswith(".mp4"):  # output_path is a directory
-        os.makedirs(output_path, exist_ok=True)
-
-    if model_input.endswith(".txt"):
-        # model_input is a file for t2v
-        prompts = load_prompts_from_txt(prompt_file=model_input)
-        image_or_video_paths = [None] * len(prompts)
-    elif os.path.isdir(model_input):
-        if generate_type == "i2v":
-            # model_input is a directory for i2v
-            prompt_file = get_target_filelist(model_input, ext="txt")[0]
-            prompts = load_prompts_from_txt(prompt_file=prompt_file)
-            images = get_target_filelist(model_input, ext="png,jpg,webp,jpeg")
-            image_or_video_paths = images
-        elif generate_type == "v2v":
-            # model_input is a directory for v2v
-            prompt_file = get_target_filelist(model_input, ext="txt")[0]
-            prompts = load_prompts_from_txt(prompt_file=prompt_file)
-            videos = [
-                os.path.join(model_input, f)
-                for f in os.listdir(model_input)
-                if f.endswith(".mp4")
-            ]
-            image_or_video_paths = videos
-    else:
-        assert isinstance(model_input, str)
-        prompts = [model_input]
-        image_or_video_paths = [None]
-
-    # 1.  Load the pre-trained CogVideoX pipeline with the specified precision (bfloat16).
-    # add device_map="balanced" in the from_pretrained function and remove the enable_model_cpu_offload()
-    # function to use Multi GPUs.
-
-    if generate_type == "i2v":
-        pipe = CogVideoXImageToVideoPipeline.from_pretrained(
-            model_path, torch_dtype=dtype
-        )
-    elif generate_type == "t2v":
-        pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
-    else:
-        pipe = CogVideoXVideoToVideoPipeline.from_pretrained(
-            model_path, torch_dtype=dtype
-        )
-
-    # If you're using with lora, add this code
-    if lora_path:
-        pipe.load_lora_weights(
-            lora_path,
-            weight_name="pytorch_lora_weights.safetensors",
-            adapter_name="test_1",
-        )
-        pipe.fuse_lora(lora_scale=1 / lora_rank)
-
-    # 2. Set Scheduler.
-    # Can be changed to `CogVideoXDPMScheduler` or `CogVideoXDDIMScheduler`.
-    # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B.
-    # using `CogVideoXDPMScheduler` for CogVideoX-5B / CogVideoX-5B-I2V.
-
-    # pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
-    pipe.scheduler = CogVideoXDPMScheduler.from_config(
-        pipe.scheduler.config, timestep_spacing="trailing"
-    )
-
-    # 3. Enable CPU offload for the model.
-    # turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
-    # and enable to("cuda")
-
-    if enable_sequential_cpu_offload:
-        pipe.enable_sequential_cpu_offload()
-    elif enable_model_cpu_offload:
-        pipe.enable_model_cpu_offload()
-    else:
-        pipe.to("cuda")
-
-    if enable_vae_slicing:
-        pipe.vae.enable_slicing()
-    if enable_vae_tiling:
-        pipe.vae.enable_tiling()
-
-    start_time = time.time()
-    # 4. Generate the video frames based on the prompt.
-    # `num_frames` is the Number of frames to generate.
-    # This is the default value for 6 seconds video and 8 fps and will plus 1 frame for the first frame and 49 frames.
-    gpu_metrics = []
-    time_metrics = []
-    for i, (prompt, image_or_video_path) in enumerate(
-        zip(prompts, image_or_video_paths)
-    ):
-        output_path_ = (
-            os.path.join(output_path, f"{i:03d}-{prompt}.mp4")
-            if os.path.isdir(output_path)
-            else output_path
-        )
-        result_with_metrics = inference(image_or_video_path, num_inference_steps, guidance_scale, num_videos_per_prompt, generate_type, seed, pipe, prompt)
-        video_generate = result_with_metrics['result']
-        gpu_metrics.append(result_with_metrics.get('gpu', -1.0))
-        time_metrics.append(result_with_metrics.get('time', -1.0))
-        # 5. Export the generated frames to a video file. fps must be 8 for original video.
-        export_to_video(video_generate, output_path_, fps=8)
-    save_metrics(gpu=gpu_metrics, time=time_metrics, config=None, savedir=output_path)
-
-    print(f"Total time taken: {time.time() - start_time:.2f}s")
-    avg_time = (time.time() - start_time) / len(prompts) / num_videos_per_prompt
-    print(f"Average time taken per prompt: {avg_time:.2f}s")
-
-@monitor_resources(return_metrics=True)
-def inference(image_or_video_path, num_inference_steps, guidance_scale, num_videos_per_prompt, generate_type, seed, pipe, prompt):
-    if generate_type == "i2v":
-        image = load_image(image=image_or_video_path)
-        video_generate = pipe(
-                prompt=prompt,
-                image=image,  # The path of the image to be used as the background of the video
-                num_videos_per_prompt=num_videos_per_prompt,  # Number of videos to generate per prompt
-                num_inference_steps=num_inference_steps,  # Number of inference steps
-                num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.30.3` and after.
-                use_dynamic_cfg=True,  # This id used for DPM Sechduler, for DDIM scheduler, it should be False
-                guidance_scale=guidance_scale,
-                generator=torch.Generator().manual_seed(
-                    seed
-                ),  # Set the seed for reproducibility
-            ).frames[0]
-    elif generate_type == "t2v":
-        video_generate = pipe(
-                prompt=prompt,
-                num_videos_per_prompt=num_videos_per_prompt,
-                num_inference_steps=num_inference_steps,
-                num_frames=49,
-                use_dynamic_cfg=True,
-                guidance_scale=guidance_scale,
-                generator=torch.Generator().manual_seed(seed),
-            ).frames[0]
-    else:
-            # v2v
-        video = load_video(image_or_video_path)
-        video_generate = pipe(
-                prompt=prompt,
-                video=video,  # The path of the video to be used as the background of the video
-                num_videos_per_prompt=num_videos_per_prompt,
-                num_inference_steps=num_inference_steps,
-                # num_frames=49,
-                use_dynamic_cfg=True,
-                guidance_scale=guidance_scale,
-                generator=torch.Generator().manual_seed(
-                    seed
-                ),  # Set the seed for reproducibility
-            ).frames[0]
-        
-    return video_generate
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Generate a video from a text prompt using CogVideoX"
-    )
-    parser.add_argument(
-        "--generate_type",
-        type=str,
-        default="t2v",
-        help="The type of video generation (e.g., 't2v', 'i2v', 'v2v')",
-    )
-    parser.add_argument(
-        "--model_input",
-        type=str,
-        default="",
-        help="The description of the video to be generated",
-    )
-    parser.add_argument(
-        "--image_or_video_path",
-        type=str,
-        default=None,
-        help="The path of the image to be used as the background of the video",
-    )
-    parser.add_argument(
-        "--model_path",
-        type=str,
-        default="THUDM/CogVideoX-5b",
-        help="The path of the pre-trained model to be used",
-    )
-    parser.add_argument(
-        "--lora_path",
-        type=str,
-        default=None,
-        help="The path of the LoRA weights to be used",
-    )
-    parser.add_argument(
-        "--lora_rank", type=int, default=128, help="The rank of the LoRA weights"
-    )
-    parser.add_argument(
-        "--output_path",
-        type=str,
-        default="./output.mp4",
-        help="The path where the generated video will be saved",
-    )
-    parser.add_argument(
-        "--guidance_scale",
-        type=float,
-        default=6.0,
-        help="The scale for classifier-free guidance",
-    )
-    parser.add_argument(
-        "--num_inference_steps",
-        type=int,
-        default=50,
-        help="Number of steps for the inference process",
-    )
-    parser.add_argument(
-        "--num_videos_per_prompt",
-        type=int,
-        default=1,
-        help="Number of videos to generate per prompt",
-    )
-
-    parser.add_argument(
-        "--dtype",
-        type=str,
-        default="bfloat16",
-        help="The data type for computation (e.g., 'float16' or 'bfloat16')",
-    )
-    parser.add_argument(
-        "--seed", type=int, default=42, help="The seed for reproducibility"
-    )
-    parser.add_argument(
-        "--enable_vae_tiling", action="store_true", help="enable vae tiling"
-    )
-    parser.add_argument(
-        "--enable_vae_slicing", action="store_true", help="enable vae slicing"
-    )
-    parser.add_argument(
-        "--enable_sequential_cpu_offload", action="store_true", help="enable sequential cpu offload"
-    )
-    parser.add_argument(
-        "--enable_model_cpu_offload", action="store_true", help="enable model cpu offload"
-    )
-
-
-    args = parser.parse_args()
-    dtype = torch.float16 if args.dtype == "float16" else torch.bfloat16
-    generate_video(
-        model_input=args.model_input,
-        model_path=args.model_path,
-        lora_path=args.lora_path,
-        lora_rank=args.lora_rank,
-        output_path=args.output_path,
-        image_or_video_path=args.image_or_video_path,
-        num_inference_steps=args.num_inference_steps,
-        guidance_scale=args.guidance_scale,
-        num_videos_per_prompt=args.num_videos_per_prompt,
-        dtype=dtype,
-        generate_type=args.generate_type,
-        seed=args.seed,
-        enable_model_cpu_offload=args.enable_model_cpu_offload,
-        enable_sequential_cpu_offload=args.enable_sequential_cpu_offload,
-        enable_vae_slicing=args.enable_vae_slicing,
-        enable_vae_tiling=args.enable_vae_tiling,
-    )
diff --git a/scripts/inference_cogVideo_sat_refactor.py b/scripts/inference_cogVideo_sat_refactor.py
deleted file mode 100644
index cdc7edb4..00000000
--- a/scripts/inference_cogVideo_sat_refactor.py
+++ /dev/null
@@ -1,304 +0,0 @@
-import argparse
-import math
-import os
-import sys
-from typing import List, Union
-
-import imageio
-import numpy as np
-import omegaconf
-import torch
-import torchvision.transforms as TT
-from einops import rearrange, repeat
-from omegaconf import ListConfig, OmegaConf
-from PIL import Image
-from sat import mpu
-from sat.arguments import (
-    add_data_args,
-    add_evaluation_args,
-    add_training_args,
-    set_random_seed,
-)
-from sat.model.base_model import get_model
-from sat.training.model_io import load_checkpoint
-from tqdm import tqdm
-
-sys.path.append(os.path.join(os.path.dirname(__file__), "../videotuna/models/cogvideo_sat"))
-import datetime
-
-from arguments import getArgs
-
-# from cogvideo_sat import diffusion_video
-from diffusion_video import SATVideoDiffusionEngine
-
-current_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
-
-
-def read_from_file(p, rank=0, world_size=1):
-    with open(p, "r") as fin:
-        cnt = -1
-        for l in fin:
-            cnt += 1
-            if cnt % world_size != rank:
-                continue
-            yield l.strip(), cnt
-
-
-def get_batch(keys, value_dict, N: Union[List, ListConfig], T=None, device="cuda"):
-    batch = {}
-    batch_uc = {}
-
-    for key in keys:
-        if key == "txt":
-            batch["txt"] = (
-                np.repeat([value_dict["prompt"]], repeats=math.prod(N))
-                .reshape(N)
-                .tolist()
-            )
-            batch_uc["txt"] = (
-                np.repeat([value_dict["negative_prompt"]], repeats=math.prod(N))
-                .reshape(N)
-                .tolist()
-            )
-        elif key == "original_size_as_tuple":
-            batch["original_size_as_tuple"] = (
-                torch.tensor([value_dict["orig_height"], value_dict["orig_width"]])
-                .to(device)
-                .repeat(*N, 1)
-            )
-        elif key == "crop_coords_top_left":
-            batch["crop_coords_top_left"] = (
-                torch.tensor(
-                    [value_dict["crop_coords_top"], value_dict["crop_coords_left"]]
-                )
-                .to(device)
-                .repeat(*N, 1)
-            )
-        elif key == "aesthetic_score":
-            batch["aesthetic_score"] = (
-                torch.tensor([value_dict["aesthetic_score"]]).to(device).repeat(*N, 1)
-            )
-            batch_uc["aesthetic_score"] = (
-                torch.tensor([value_dict["negative_aesthetic_score"]])
-                .to(device)
-                .repeat(*N, 1)
-            )
-
-        elif key == "target_size_as_tuple":
-            batch["target_size_as_tuple"] = (
-                torch.tensor([value_dict["target_height"], value_dict["target_width"]])
-                .to(device)
-                .repeat(*N, 1)
-            )
-        elif key == "fps":
-            batch[key] = (
-                torch.tensor([value_dict["fps"]]).to(device).repeat(math.prod(N))
-            )
-        elif key == "fps_id":
-            batch[key] = (
-                torch.tensor([value_dict["fps_id"]]).to(device).repeat(math.prod(N))
-            )
-        elif key == "motion_bucket_id":
-            batch[key] = (
-                torch.tensor([value_dict["motion_bucket_id"]])
-                .to(device)
-                .repeat(math.prod(N))
-            )
-        elif key == "pool_image":
-            batch[key] = repeat(value_dict[key], "1 ... -> b ...", b=math.prod(N)).to(
-                device, dtype=torch.half
-            )
-        elif key == "cond_aug":
-            batch[key] = repeat(
-                torch.tensor([value_dict["cond_aug"]]).to("cuda"),
-                "1 -> b",
-                b=math.prod(N),
-            )
-        elif key == "cond_frames":
-            batch[key] = repeat(value_dict["cond_frames"], "1 ... -> b ...", b=N[0])
-        elif key == "cond_frames_without_noise":
-            batch[key] = repeat(
-                value_dict["cond_frames_without_noise"], "1 ... -> b ...", b=N[0]
-            )
-        else:
-            batch[key] = value_dict[key]
-
-    if T is not None:
-        batch["num_video_frames"] = T
-
-    for key in batch.keys():
-        if key not in batch_uc and isinstance(batch[key], torch.Tensor):
-            batch_uc[key] = torch.clone(batch[key])
-    return batch, batch_uc
-
-
-def save_video_as_grid_and_mp4(
-    video_batch: torch.Tensor, save_path: str, fps: int = 5, args=None, key=None
-):
-    os.makedirs(save_path, exist_ok=True)
-
-    for i, vid in enumerate(video_batch):
-        gif_frames = []
-        for frame in vid:
-            frame = rearrange(frame, "c h w -> h w c")
-            frame = (255.0 * frame).cpu().numpy().astype(np.uint8)
-            gif_frames.append(frame)
-        now_save_path = os.path.join(save_path, f"prompt-{key:04d}.mp4")
-        with imageio.get_writer(now_save_path, fps=fps) as writer:
-            for frame in gif_frames:
-                writer.append_data(frame)
-
-
-def main(args, model_cls):
-    model = get_model(args, model_cls) if isinstance(model_cls, type) else model_cls
-    load_checkpoint(model, args)
-    model.eval()
-
-    if args.input_type == "txt":
-        rank, world_size = (
-            mpu.get_data_parallel_rank(),
-            mpu.get_data_parallel_world_size(),
-        )
-        data_iter = read_from_file(args.input_file, rank=rank, world_size=world_size)
-    else:
-        raise NotImplementedError("Only 'txt' input_type is supported.")
-
-    sample_func = model.sample
-    num_samples = [1]
-    force_uc_zero_embeddings = ["txt"]
-    T, C = args.sampling_num_frames, args.latent_channels
-    counter = 0
-
-    def get_images_in_list(folder_path, extensions=("jpg", "png")):
-        files = sorted(
-            f for f in os.listdir(folder_path) if f.lower().endswith(extensions)
-        )
-        return [os.path.join(folder_path, file) for file in files]
-
-    def nearest_multiple_of_16(n):
-        return int(min(((n // 16) * 16, (n // 16 + 1) * 16), key=lambda x: abs(n - x)))
-
-    images = get_images_in_list(args.image_folder) if args.image2video else None
-
-    with torch.no_grad():
-        for text, cnt in tqdm(data_iter):
-            if args.image2video:
-                image_path = images[counter]
-                counter += 1
-                assert os.path.exists(
-                    image_path
-                ), f"Image path does not exist: {image_path}"
-
-                image = Image.open(image_path).convert("RGB")
-                img_W, img_H = image.size
-                H, W = (
-                    (96, nearest_multiple_of_16(img_W / img_H * 96 * 8) // 8)
-                    if img_H < img_W
-                    else (nearest_multiple_of_16(img_H / img_W * 96 * 8) // 8, 96)
-                )
-
-                transform = TT.Compose(
-                    [
-                        TT.Resize(size=[int(H * 8), int(W * 8)], interpolation=1),
-                        TT.ToTensor(),
-                    ]
-                )
-                image = transform(image).unsqueeze(0).to("cuda") * 2.0 - 1.0
-                image = image.unsqueeze(2).to(torch.bfloat16)
-                image = model.encode_first_stage(image, None) / model.scale_factor
-                image = image.permute(0, 2, 1, 3, 4).contiguous()
-                pad_shape = (image.shape[0], T - 1, C, H, W)
-                image = torch.cat(
-                    [
-                        image,
-                        torch.zeros(pad_shape, device=image.device, dtype=image.dtype),
-                    ],
-                    dim=1,
-                )
-            else:
-                image, H, W = None, *args.sampling_image_size
-
-            text_cast = [text]
-            mp_size = mpu.get_model_parallel_world_size()
-            global_rank = torch.distributed.get_rank() // mp_size
-            src = global_rank * mp_size
-            torch.distributed.broadcast_object_list(
-                text_cast, src=src, group=mpu.get_model_parallel_group()
-            )
-            text = text_cast[0]
-
-            value_dict = {
-                "prompt": text,
-                "negative_prompt": "",
-                "num_frames": torch.tensor(T).unsqueeze(0),
-            }
-            # batch, batch_uc = get_batch(
-            #     get_unique_embedder_keys_from_conditioner(model.conditioner), value_dict, num_samples
-            # )
-            conditioner_keys = list(
-                set([x.input_key for x in model.conditioner.embedders])
-            )
-            batch, batch_uc = get_batch(conditioner_keys, value_dict, num_samples, T=T)
-            c, uc = model.conditioner.get_unconditional_conditioning(
-                batch,
-                batch_uc=batch_uc,
-                force_uc_zero_embeddings=force_uc_zero_embeddings,
-            )
-            for key in c:
-                if key != "crossattn":
-                    c[key], uc[key] = map(
-                        lambda y: y[key][: math.prod(num_samples)].to("cuda"), (c, uc)
-                    )
-            if args.image2video:
-                c["concat"] = uc["concat"] = image
-
-            for index in range(args.batch_size):
-                shape = (T, C, H, W) if args.image2video else (T, C, H // 8, W // 8)
-                set_random_seed(args.seed)
-                samples_z = (
-                    sample_func(c, uc=uc, batch_size=1, shape=shape)
-                    .permute(0, 2, 1, 3, 4)
-                    .contiguous()
-                )
-
-                # save_path = os.path.join(
-                #     args.output_dir, f"{cnt}_{text.replace(' ', '_').replace('/', '')[:120]}", str(index)
-                # )
-
-                save_path = os.path.join(
-                    args.output_dir, f"{current_time}-cogvideox1.5"
-                )
-                os.makedirs(save_path, exist_ok=True)
-
-                if args.only_save_latents:
-                    torch.save(
-                        samples_z / model.scale_factor,
-                        os.path.join(save_path, "latent.pt"),
-                    )
-                    with open(os.path.join(save_path, "text.txt"), "w") as f:
-                        f.write(text)
-                else:
-                    samples_x = (
-                        torch.clamp(
-                            (
-                                model.decode_first_stage(samples_z).permute(
-                                    0, 2, 1, 3, 4
-                                )
-                                + 1.0
-                            )
-                            / 2.0,
-                            0.0,
-                            1.0,
-                        )
-                        .to(torch.float32)
-                        .cpu()
-                    )
-                    if mpu.get_model_parallel_rank() == 0:
-                        save_video_as_grid_and_mp4(
-                            samples_x, save_path, fps=args.sampling_fps, key=cnt
-                        )
-
-
-if __name__ == "__main__":
-    args = getArgs()
-    main(args, model_cls=SATVideoDiffusionEngine)
diff --git a/scripts/inference_cogvideo.py b/scripts/inference_cogvideo.py
deleted file mode 100644
index 9eec5fea..00000000
--- a/scripts/inference_cogvideo.py
+++ /dev/null
@@ -1,355 +0,0 @@
-import argparse
-import json
-import os
-import sys
-import time
-from typing import List
-
-import torch
-import torchvision.transforms as transforms
-from einops import repeat
-from omegaconf import OmegaConf
-from PIL import Image
-from pytorch_lightning import seed_everything
-from tqdm import trange
-
-sys.path.insert(0, os.getcwd())
-sys.path.insert(1, f"{os.getcwd()}/src")
-from videotuna.utils.common_utils import instantiate_from_config
-from videotuna.utils.inference_utils import (
-    get_target_filelist,
-    load_model_checkpoint,
-    load_prompts_from_txt,
-    save_videos,
-    save_videos_vbench,
-)
-
-
-def get_parser():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--mode",
-        default="t2v",
-        type=str,
-        help="inference mode: t2v/i2v",
-        choices=["t2v", "i2v"],
-    )
-    #
-    parser.add_argument("--ckpt_path", type=str, default=None, help="checkpoint path")
-    parser.add_argument("--config", type=str, help="model config (yaml) path")
-    parser.add_argument(
-        "--prompt_file",
-        type=str,
-        default=None,
-        help="a text file containing many prompts for text-to-video",
-    )
-    parser.add_argument(
-        "--prompt_dir",
-        type=str,
-        default=None,
-        help="a input dir containing images and prompts for image-to-video/interpolation",
-    )
-    parser.add_argument("--savedir", type=str, default=None, help="results saving path")
-    parser.add_argument(
-        "--standard_vbench",
-        action="store_true",
-        default=False,
-        help="inference standard vbench prompts",
-    )
-    #
-    parser.add_argument("--seed", type=int, default=123, help="random seed")
-    #
-    parser.add_argument(
-        "--height", type=int, default=512, help="video height, in pixel space"
-    )
-    parser.add_argument(
-        "--width", type=int, default=512, help="video width, in pixel space"
-    )
-    parser.add_argument(
-        "--frames", type=int, default=49, help="video frame number, in pixel space"
-    )
-    parser.add_argument(
-        "--fps",
-        type=int,
-        default=24,
-        help="video motion speed. 512 or 1024 model: large value -> slow motion; 256 model: large value -> large motion;",
-    )
-    parser.add_argument(
-        "--n_samples_prompt",
-        type=int,
-        default=1,
-        help="num of samples per prompt",
-    )
-    #
-    parser.add_argument("--bs", type=int, default=1, help="batch size for inference")
-    parser.add_argument(
-        "--ddim_steps",
-        type=int,
-        default=50,
-        help="steps of ddim if positive, otherwise use DDPM",
-    )
-    parser.add_argument(
-        "--ddim_eta",
-        type=float,
-        default=1.0,
-        help="eta for ddim sampling (0.0 yields deterministic sampling)",
-    )
-    parser.add_argument(
-        "--uncond_prompt",
-        type=str,
-        default="",
-        help="unconditional prompts, or negative prompts",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale",
-        type=float,
-        default=6.0,
-        help="prompt classifier-free guidance",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale_temporal",
-        type=float,
-        default=None,
-        help="temporal consistency guidance",
-    )
-    # dc args
-    parser.add_argument(
-        "--multiple_cond_cfg",
-        action="store_true",
-        default=False,
-        help="i2v: use multi-condition cfg or not",
-    )
-    parser.add_argument(
-        "--cfg_img",
-        type=float,
-        default=None,
-        help="guidance scale for image conditioning",
-    )
-    parser.add_argument(
-        "--timestep_spacing",
-        type=str,
-        default="uniform",
-        help="The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.",
-    )
-    parser.add_argument(
-        "--guidance_rescale",
-        type=float,
-        default=0.0,
-        help="guidance rescale in [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891)",
-    )
-    parser.add_argument(
-        "--loop",
-        action="store_true",
-        default=False,
-        help="generate looping videos or not",
-    )
-    parser.add_argument(
-        "--gfi",
-        action="store_true",
-        default=False,
-        help="generate generative frame interpolation (gfi) or not",
-    )
-    # lora args
-    parser.add_argument(
-        "--lorackpt",
-        type=str,
-        default=None,
-        help="[Optional] checkpoint path for lora model. ",
-    )
-    #
-    parser.add_argument("--savefps", type=str, default=10, help="video fps to generate")
-    parser.add_argument("--denoiser_precision", type=str, default="fp32", help="precision of denoiser model")
-    return parser
-
-
-def load_model(args, cuda_idx=0):
-    """
-    Create model and load weight.
-    """
-    # build model
-    config = OmegaConf.load(args.config)
-    model_config = config.pop("model", OmegaConf.create())
-    if args.lorackpt is not None:
-        model_config["params"]["lora_args"] = {"lora_ckpt": args.lorackpt}
-    
-    model_config["params"]["denoiser_config"]["params"]["load_dtype"] = args.denoiser_precision
-    model = instantiate_from_config(model_config)
-    model = model.cuda(cuda_idx)
-    # load weights
-    skip_loading_weight = hasattr(model_config, "skip_loading_weight") and model_config.skip_loading_weight
-    if not skip_loading_weight:
-        assert os.path.exists(
-            args.ckpt_path
-        ), f"Error: checkpoint [{args.ckpt_path}] Not Found!"
-        model = load_model_checkpoint(model, args.ckpt_path)
-    # load lora weights
-    if hasattr(model, "lora_args") and len(model.lora_args) != 0:
-        model.inject_lora()
-    model.eval()
-    return model
-
-
-def load_inputs(args):
-    """
-    load inputs:
-        t2v: prompts
-        i2v: prompts + images
-    """
-    assert (
-        args.prompt_file is not None or args.prompt_dir is not None
-    ), "Error: input file/dir NOT Found!"
-
-    if args.prompt_file is not None:
-        assert os.path.exists(args.prompt_file)
-        # load inputs for t2v
-        prompt_list = load_prompts_from_txt(args.prompt_file)
-        num_prompts = len(prompt_list)
-        filename_list = [f"prompt-{idx+1:04d}" for idx in range(num_prompts)]
-        image_list = None
-    elif args.prompt_dir is not None:
-        assert os.path.exists(args.prompt_dir)
-        # load inputs for i2v
-        filename_list, image_list, prompt_list = load_inputs_i2v(
-            args.prompt_dir,
-            video_size=(args.height, args.width),
-            video_frames=args.frames,
-        )
-    return prompt_list, image_list, filename_list
-
-
-def load_inputs_i2v(input_dir, video_size=(480, 720), video_frames=49):
-    """
-    Load prompt list and conditional images for i2v from input_dir.
-    """
-    # load prompt files
-    prompt_files = get_target_filelist(input_dir, ext="txt")
-    if len(prompt_files) > 1:
-        # only use the first one (sorted by name) if multiple exist
-        print(
-            f"Warning: multiple prompt files exist. The one {os.path.split(prompt_files[0])[1]} is used."
-        )
-        prompt_file = prompt_files[0]
-    elif len(prompt_files) == 1:
-        prompt_file = prompt_files[0]
-    elif len(prompt_files) == 0:
-        print(prompt_files)
-        raise ValueError(f"Error: found NO prompt file in {input_dir}")
-    prompt_list = load_prompts_from_txt(prompt_file)
-    n_samples = len(prompt_list)
-
-    ## load images
-    img_paths = get_target_filelist(input_dir, ext="png,jpg,webp,jpeg")
-    print(f"Found {n_samples} prompts and {len(img_paths)} images in {input_dir}")
-    # image transforms
-    transform = transforms.Compose(
-        [
-            transforms.Resize(min(video_size)),
-            transforms.CenterCrop(video_size),
-            transforms.ToTensor(),
-            transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)),
-        ]
-    )
-
-    image_list = []
-    filename_list = []
-    for idx in range(n_samples):
-        image = Image.open(img_paths[idx]).convert("RGB")
-        # image_tensor = transform(image).unsqueeze(0) # [c,h,w]
-        # frame_tensor = repeat(image_tensor, 'c t h w -> c (repeat t) h w', repeat=video_frames)
-        image_list.append(image)
-
-        _, filename = os.path.split(img_paths[idx])
-        filename_list.append(filename.split(".")[0])
-
-    return filename_list, image_list, prompt_list
-
-
-def run_inference_cogvideo(args, gpu_num=1, rank=0, **kwargs):
-    """
-    ideally this should be merged to run_inference()
-
-    """
-    assert (args.height % 16 == 0) and (
-        args.width % 16 == 0
-    ), "Error: image size [h,w] should be multiples of 16!"
-
-    seed_everything(args.seed)
-    os.makedirs(args.savedir, exist_ok=True)
-    prompt_list, image_list, filename_list = load_inputs(args)
-    print(args)
-    model = load_model(args)
-    # split across multiple gpus
-    num_samples = len(prompt_list)
-    num_samples_rank = num_samples // gpu_num
-    remainder = num_samples % gpu_num
-    indices_rank = list(range(num_samples_rank * rank, num_samples_rank * (rank + 1)))
-    if rank == 0 and remainder != 0:
-        indices_rank = indices_rank + list(range(num_samples - remainder, num_samples))
-    #
-    prompt_list_rank = [prompt_list[i] for i in indices_rank]
-    filename_list_rank = [filename_list[i] for i in indices_rank]
-    if args.mode == "i2v":
-        image_list_rank = [image_list[i] for i in indices_rank]
-
-    # -----------------------------------------------------------------
-    # inference
-    format_file = {}
-    start = time.time()
-    n_iters = len(prompt_list_rank) // args.bs + (
-        1 if len(prompt_list_rank) % args.bs else 0
-    )
-    with torch.no_grad():
-        for idx in trange(0, n_iters, desc="Sample Iters"):
-            # split batch
-            prompts = prompt_list_rank[idx * args.bs : (idx + 1) * args.bs]
-            filenames = filename_list_rank[idx * args.bs : (idx + 1) * args.bs]
-            if args.mode == "i2v":
-                images = image_list_rank[idx * args.bs : (idx + 1) * args.bs]
-
-            ## inference
-            bs = args.bs if args.bs == len(prompts) else len(prompts)
-            if args.mode == "t2v":
-                batch_samples = model.sample(
-                    prompts,
-                    None,
-                    height=args.height,
-                    width=args.width,
-                    num_frames=49,
-                    num_videos_per_prompt=args.n_samples_prompt,
-                    guidance_scale=args.unconditional_guidance_scale,
-                )
-            elif args.mode == "i2v":
-                batch_samples = model.sample(
-                    images,
-                    prompts,
-                    None,
-                    height=args.height,
-                    width=args.width,
-                    num_frames=49,
-                    num_inference_steps=args.ddim_steps,
-                    num_videos_per_prompt=args.n_samples_prompt,
-                    guidance_scale=args.unconditional_guidance_scale,
-                )
-            else:
-                raise ValueError
-
-            if args.standard_vbench:
-                save_videos_vbench(
-                    batch_samples, args.savedir, prompts, format_file, fps=args.savefps
-                )
-                print("test")
-            else:
-                save_videos(batch_samples, args.savedir, filenames, fps=args.savefps)
-
-    if args.standard_vbench:
-        with open(os.path.join(args.savedir, "info.json"), "w") as f:
-            json.dump(format_file, f)
-
-    print(f"Saved in {args.savedir}. Time used: {(time.time() - start):.2f} seconds")
-
-
-if __name__ == "__main__":
-
-    args = get_parser().parse_args()
-    # run_inference(args)
-    run_inference_cogvideo(args)
diff --git a/scripts/inference_ddp.py b/scripts/inference_ddp.py
deleted file mode 100644
index ef882b50..00000000
--- a/scripts/inference_ddp.py
+++ /dev/null
@@ -1,48 +0,0 @@
-import argparse
-import datetime
-import importlib
-
-import torch
-import torch.distributed as dist
-from pytorch_lightning import seed_everything
-
-
-def setup_dist(local_rank):
-    if dist.is_initialized():
-        return
-    torch.cuda.set_device(local_rank)
-    torch.distributed.init_process_group("nccl", init_method="env://")
-
-
-def get_dist_info():
-    if dist.is_available():
-        initialized = dist.is_initialized()
-    else:
-        initialized = False
-    if initialized:
-        rank = dist.get_rank()
-        world_size = dist.get_world_size()
-    else:
-        rank = 0
-        world_size = 1
-    return rank, world_size
-
-
-if __name__ == "__main__":
-    now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--module", type=str, help="module name", default="inference")
-    parser.add_argument("--local_rank", type=int, nargs="?", help="for ddp", default=0)
-    args, unknown = parser.parse_known_args()
-    inference_api = importlib.import_module(args.module, package=None)
-
-    inference_parser = inference_api.get_parser()
-    inference_args, unknown = inference_parser.parse_known_args()
-
-    seed_everything(inference_args.seed)
-    setup_dist(args.local_rank)
-    torch.backends.cudnn.benchmark = True
-    rank, gpu_num = get_dist_info()
-
-    print("@CoLVDM Inference [rank%d]: %s" % (rank, now))
-    inference_api.run_inference(inference_args, gpu_num, rank)
diff --git a/scripts/inference_flux.py b/scripts/inference_flux.py
deleted file mode 100644
index aa443165..00000000
--- a/scripts/inference_flux.py
+++ /dev/null
@@ -1,94 +0,0 @@
-import argparse
-import os
-
-import torch
-from diffusers import FluxPipeline
-
-from videotuna.utils.inference_utils import load_prompts_from_txt
-from videotuna.utils.common_utils import monitor_resources, save_metrics
-
-def inference(args):
-    if args.model_type == "dev":
-        pipe = FluxPipeline.from_pretrained(
-            "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
-        )
-    elif args.model_type == "schnell":
-        pipe = FluxPipeline.from_pretrained(
-            "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
-        )
-    else:
-        raise ValueError("model_type must be either 'dev' or 'schnell'")
-
-    if args.enable_sequential_cpu_offload:
-        pipe.enable_sequential_cpu_offload()
-    elif args.enable_model_cpu_offload:
-        pipe.enable_model_cpu_offload()
-    else:
-        pipe.to("cuda")
-
-    if args.enable_vae_slicing:
-        pipe.vae.enable_slicing()
-    if args.enable_vae_tiling:
-        pipe.vae.enable_tiling()
-    pipe.to(torch.float16)
-    if args.prompt.endswith(".txt"):
-        # model_input is a file for t2i
-        prompts = load_prompts_from_txt(prompt_file=args.prompt)
-        os.makedirs(args.out_path, exist_ok=True)
-        out_paths = [
-            os.path.join(args.out_path, f"{i:05d}_{prompts[i]}.jpg")
-            for i in range(len(prompts))
-        ]
-    else:
-        prompts = [prompt]
-        out_paths = [args.out_path]
-    gpu_metrics = []
-    time_metrics = []
-    for prompt, out_path in zip(prompts, out_paths):
-        result_with_metrics = generate(args, pipe, prompt)
-        out = result_with_metrics['result']
-        gpu_metrics.append(result_with_metrics.get('gpu', -1.0))
-        time_metrics.append(result_with_metrics.get('time', -1.0))
-        out.save(out_path)
-    save_metrics(gpu=gpu_metrics, time=time_metrics, config=args, savedir=args.out_path)
-
-@monitor_resources(return_metrics=True)
-def generate(args, pipe, prompt):
-    out = pipe(
-            prompt=prompt,
-            guidance_scale=args.guidance_scale,
-            height=args.height,
-            width=args.width,
-            num_inference_steps=args.num_inference_steps,
-            max_sequence_length=256,
-        ).images[0]
-    return out
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_type", type=str, default="dev", choices=["dev", "schnell"]
-    )
-    parser.add_argument(
-        "--prompt", type=str, default="A cat holding a sign that says hello world"
-    )
-    parser.add_argument("--out_path", type=str, default="./image.png")
-    parser.add_argument("--width", type=int, default=1360)
-    parser.add_argument("--height", type=int, default=768)
-    parser.add_argument("--num_inference_steps", type=int, default=4)
-    parser.add_argument("--guidance_scale", type=float, default=0.0)
-    parser.add_argument(
-        "--enable_vae_tiling", action="store_true", help="enable vae tiling"
-    )
-    parser.add_argument(
-        "--enable_vae_slicing", action="store_true", help="enable vae slicing"
-    )
-    parser.add_argument(
-        "--enable_sequential_cpu_offload", action="store_true", help="enable sequential cpu offload"
-    )
-    parser.add_argument(
-        "--enable_model_cpu_offload", action="store_true", help="enable model cpu offload"
-    )
-    args = parser.parse_args()
-    inference(args)
diff --git a/scripts/inference_flux_lora.py b/scripts/inference_flux_lora.py
deleted file mode 100644
index 0ba28216..00000000
--- a/scripts/inference_flux_lora.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import argparse
-import os
-
-import torch
-from diffusers import FluxPipeline
-from videotuna.utils.inference_utils import load_prompts_from_txt
-
-
-def inference(args):
-    if args.model_type == "dev":
-        pipe = FluxPipeline.from_pretrained(
-            "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
-        )
-    else:
-        raise ValueError("model_type must be either 'dev'.")
-
-    # load lora weights
-    if args.lora_path is not None:
-        pipe.load_lora_weights(args.lora_path)
-        print("Load lora weights.")
-    else:
-        print("No lora weights.")
-
-    pipe.enable_sequential_cpu_offload()
-    pipe.vae.enable_slicing()
-    pipe.vae.enable_tiling()
-    pipe.to(torch.float16)
-
-    # prompt preprocessing
-    if args.prompt.endswith(".txt"):
-        # model_input is a file for t2i
-        prompts = load_prompts_from_txt(prompt_file=args.prompt)
-        os.makedirs(args.out_path, exist_ok=True)
-        out_paths = [
-            os.path.join(args.out_path, f"{i:05d}_{prompts[i]}.jpg")
-            for i in range(len(prompts))
-        ]
-    else:
-        prompts = [args.prompt]
-        out_paths = [args.out_path]
-
-    for prompt, out_path in zip(prompts, out_paths):
-        out = pipe(
-            prompt=prompt,
-            guidance_scale=args.guidance_scale,
-            height=args.height,
-            width=args.width,
-            num_inference_steps=args.num_inference_steps,
-            max_sequence_length=256,
-            generator=torch.Generator().manual_seed(args.seed)
-        ).images[0]
-        out.save(out_path)
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--model_type", type=str, default="dev", choices=["dev", "schnell"]
-    )
-    parser.add_argument("--prompt", type=str, default="A photo of a cat", help="Inference prompt, string or path to a .txt file")
-    parser.add_argument("--out_path", type=str, default="./results/t2i/image.png")
-    parser.add_argument("--lora_path", type=str, default=None, help="Full path to lora weights")
-    parser.add_argument("--width", type=int, default=1360)
-    parser.add_argument("--height", type=int, default=768)
-    parser.add_argument("--num_inference_steps", type=int, default=4)
-    parser.add_argument("--guidance_scale", type=float, default=0.0)
-    parser.add_argument("--seed", type=int, default=42)
-    args = parser.parse_args()
-    inference(args)
diff --git a/scripts/inference_mochi.py b/scripts/inference_mochi.py
deleted file mode 100644
index e328e08b..00000000
--- a/scripts/inference_mochi.py
+++ /dev/null
@@ -1,42 +0,0 @@
-import argparse
-import os
-
-import torch
-from diffusers import MochiPipeline
-from diffusers.utils import export_to_video
-
-# create arg parser
-parser = argparse.ArgumentParser()
-parser.add_argument("--ckpt_path", type=str, default="genmo/mochi-1-preview")
-parser.add_argument("--prompt_file", type=str, default="inputs/t2v/prompts.txt")
-parser.add_argument("--savedir", type=str, default="results/t2v/")
-parser.add_argument("--height", type=int, default=480)
-parser.add_argument("--width", type=int, default=848)
-parser.add_argument("--bs", type=int, default=1)
-parser.add_argument("--fps", type=int, default=28)
-parser.add_argument("--seed", type=int, default=123)
-
-args = parser.parse_args()
-
-os.makedirs(args.savedir, exist_ok=True)
-
-pipe = MochiPipeline.from_pretrained(
-    "genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16
-)
-# Enable memory savings
-pipe.enable_model_cpu_offload()
-pipe.enable_vae_tiling()
-
-# there are many prompts in the prompt_file, we need to read them all
-with open(args.prompt_file, "r") as file:
-    prompts = file.readlines()
-
-# set seed
-torch.manual_seed(args.seed)
-
-for index, prompt in enumerate(prompts):
-
-    with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
-        frames = pipe(prompt, num_frames=84).frames[0]
-
-    export_to_video(frames, f"{args.savedir}/mochi_{index}.mp4", fps=30)
diff --git a/scripts/inference_new.py b/scripts/inference_new.py
index b180c776..479d85ea 100644
--- a/scripts/inference_new.py
+++ b/scripts/inference_new.py
@@ -1,233 +1,183 @@
-import argparse
-import json
-import os
-import sys
-import time
-from functools import partial
 from pathlib import Path
+from typing import cast
 
-import numpy as np
-import torch
-from einops import rearrange, repeat
-from omegaconf import OmegaConf
+from loguru import logger
+from omegaconf import DictConfig, OmegaConf
 from pytorch_lightning import seed_everything
-from tqdm import tqdm, trange
 
-sys.path.insert(0, os.getcwd())
-sys.path.insert(1, f"{os.getcwd()}/src")
-
-from videotuna.utils.args_utils import prepare_inference_args
-from videotuna.utils.common_utils import instantiate_from_config
 from videotuna.base.generation_base import GenerationBase
-from videotuna.utils.common_utils import monitor_resources
-
-def get_parser():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--mode",
-        default=None,
-        type=str,
-        help="inference mode: t2v/i2v",
-    )
-    #
-    parser.add_argument("--ckpt_path", type=str, default=None, help="checkpoint path")
-    parser.add_argument(
-        "--lorackpt",
-        type=str,
-        default=None,
-        help="[Optional] checkpoint path for lora model. ",
-    )
-    parser.add_argument(
-        "--trained_ckpt", type=str, default=None, help="denoiser full checkpoint"
-    )
-    parser.add_argument("--config", type=str, default=None, help="model config (yaml) path")
-    parser.add_argument(
-        "--prompt_file",
-        type=str,
-        default=None,
-        help="a text file containing many prompts for text-to-video",
-    )
-    parser.add_argument(
-        "--prompt_dir",
-        type=str,
-        default=None,
-        help="a input dir containing images and prompts for image-to-video/interpolation",
-    )
-    parser.add_argument("--savedir", type=str, default=None, help="results saving path")
-    parser.add_argument(
-        "--standard_vbench",
-        action="store_true",
-        default=None,
-        help="inference standard vbench prompts",
-    )
-    #
-    parser.add_argument("--seed", type=int, default=None, help="random seed")
-    #
-    parser.add_argument(
-        "--height", type=int, default=None, help="video height, in pixel space"
-    )
-    parser.add_argument(
-        "--width", type=int, default=None, help="video width, in pixel space"
-    )
-    parser.add_argument(
-        "--frames", type=int, default=None, help="video frame number, in pixel space"
-    )
-    parser.add_argument(
-        "--fps",
-        type=int,
-        default=None,
-        help="video motion speed. 512 or 1024 model: large value -> slow motion; 256 model: large value -> large motion;",
-    )
-    parser.add_argument(
-        "--n_samples_prompt",
-        type=int,
-        default=None,
-        help="num of samples per prompt",
-    )
-    #
-    parser.add_argument("--bs", type=int, default=None, help="batch size for inference")
-    parser.add_argument(
-        "--ddim_steps",
-        type=int,
-        default=None,
-        help="steps of ddim if positive, otherwise use DDPM",
-    )
-    parser.add_argument(
-        "--ddim_eta",
-        type=float,
-        default=None,
-        help="eta for ddim sampling (0.0 yields deterministic sampling)",
-    )
-    parser.add_argument(
-        "--uncond_prompt",
-        type=str,
-        default=None,
-        help="unconditional prompts, or negative prompts",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale",
-        type=float,
-        default=None,
-        help="prompt classifier-free guidance",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale_temporal",
-        type=float,
-        default=None,
-        help="temporal consistency guidance",
-    )
-    # dc args
-    parser.add_argument(
-        "--multiple_cond_cfg",
-        action="store_true",
-        default=None,
-        help="i2v: use multi-condition cfg or not",
-    )
-    parser.add_argument(
-        "--cfg_img",
-        type=float,
-        default=None,
-        help="guidance scale for image conditioning",
-    )
-    parser.add_argument(
-        "--timestep_spacing",
-        type=str,
-        default=None,
-        help="The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.",
-    )
-    parser.add_argument(
-        "--guidance_rescale",
-        type=float,
-        default=None,
-        help="guidance rescale in [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891)",
-    )
-    parser.add_argument(
-        "--loop",
-        action="store_true",
-        default=None,
-        help="generate looping videos or not",
-    )
-    parser.add_argument(
-        "--gfi",
-        action="store_true",
-        default=None,
-        help="generate generative frame interpolation (gfi) or not",
-    )
-    parser.add_argument("--savefps", type=str, default=None, help="video fps to generate")
-    parser.add_argument(
-        "--time_shift", 
-        type=float, 
-        default=None, 
-        help="time shift",
-    )
-    parser.add_argument(
-        "--num_inference_steps", 
-        type=int, 
-        default=None, 
-        help="sampling steps",
-    )
-    parser.add_argument(
-        "--dit_weight", 
-        type=str, 
-        default=None, 
-        help="hunyuan dit weight",
-    )
-    parser.add_argument(
-        "--i2v_resolution", 
-        type=str, 
-        default=None, 
-        help="target resolution",
-    )
-    parser.add_argument(
-        "--enable_model_cpu_offload", 
-        action="store_true",
-        help="model cpu offload",
-    )
-    parser.add_argument(
-        "--enable_sequential_cpu_offload", 
-        action="store_true",
-        help="seqeuential cpu offload",
-    )
-    parser.add_argument(
-        "--enable_vae_tiling", 
-        action="store_true",
-        help="vae tiling",
-    )
-    parser.add_argument(
-        "--enable_vae_slicing", 
-        action="store_true",
-        help="vae slicing",
-    )
-    return parser
+from videotuna.settings import get_settings
+from videotuna.utils.args_utils import prepare_inference_args
+from videotuna.utils.attention import (
+    get_attn_backend_requested,
+    get_resolved_attn_backend,
+    get_torch_compile_mode,
+)
+from videotuna.utils.common_utils import (
+    instantiate_from_config,
+    monitor_resources,
+    save_metrics,
+)
+from videotuna.utils.device_utils import (
+    checkpoint_available,
+    describe_compute_environment,
+    log_startup_device_summary,
+    require_accelerator_for_flow,
+    require_min_vram,
+    resolve_cpu_mode,
+    resolve_inference_device,
+    snapshot_nvidia_smi,
+)
+from videotuna.utils.diffusers_optimizations import apply_flow_memory_config
+from videotuna.utils.diffusers_quantization import (
+    maybe_adjust_offload_for_quant,
+    validate_transformer_quant,
+)
+from videotuna.utils.inference_cli import apply_compile_env, apply_cpu_smoke_limits
+from videotuna.utils.inference_profile import resolve_inference_profile
 
 
 def run_inference(args, gpu_num=1, rank=0, **kwargs):
     """
     Inference t2v/i2v models
     """
+    try:
+        _run_inference_impl(args, gpu_num=gpu_num, rank=rank, **kwargs)
+    except RuntimeError as exc:
+        smi = snapshot_nvidia_smi()
+        if smi:
+            logger.error("nvidia-smi snapshot:\n{}", smi)
+        raise exc
+
+
+def _prepare_inference_quant(
+    args,
+    inference_config,
+) -> None:
+    """Validate transformer quant settings before model load."""
+    has_lora = bool(
+        getattr(inference_config, "trained_ckpt", None)
+        or getattr(inference_config, "lorackpt", None)
+    )
+    profile = resolve_inference_profile(inference_config, apply_preset=False)
+    transformer_quant = validate_transformer_quant(
+        transformer_quant=getattr(inference_config, "transformer_quant", None),
+        quant_backend=getattr(inference_config, "quant_backend", None),
+        offload_mode=profile.offload_mode,
+        compile_enabled=bool(getattr(args, "compile", False)),
+        has_lora=has_lora,
+    )
+    if transformer_quant != "none":
+        maybe_adjust_offload_for_quant(inference_config, transformer_quant)
+
+
+def _run_inference_impl(args, gpu_num=1, rank=0, **kwargs):
     # load and replace inference args with user agrgument
     assert Path(args.config).exists(), f"Error: config file {args.config} NOT Found!"
     config = OmegaConf.load(args.config)
+    if not isinstance(config, DictConfig):
+        raise TypeError(f"Expected YAML mapping config, got {type(config).__name__}")
     config = prepare_inference_args(args, config)
-    
-    inference_config = config.pop("inference", OmegaConf.create(flags={"allow_objects": True}))
+
+    inference_config = config.pop(
+        "inference", OmegaConf.create(flags={"allow_objects": True})
+    )
     seed_everything(inference_config.seed)
 
-    # 1. create flow
-    # 1.1 init class on meta
-    # 1.2 load weight to cpu
-    # 1.3 vram management (default to cuda)
     flow_config = config.pop("flow", OmegaConf.create(flags={"allow_objects": True}))
-    flow : GenerationBase = instantiate_from_config(flow_config, resolve=True)
-    flow.from_pretrained(inference_config.ckpt_path, inference_config.trained_ckpt, inference_config.lorackpt)
+    flow_target = flow_config.get("target", "")
+    flow_params = flow_config.get("params", OmegaConf.create())
+
+    cpu_mode = resolve_cpu_mode(cli_smoke=bool(getattr(args, "cpu_smoke", False)))
+    if cpu_mode == "smoke":
+        apply_cpu_smoke_limits(inference_config, flow_config)
+
+    device_prefer = getattr(inference_config, "device", None) or getattr(
+        args, "device", None
+    )
+    if device_prefer is None and cpu_mode in ("smoke", "force"):
+        device_prefer = "cpu"
+    device = resolve_inference_device(device_prefer)
+    inference_config.device = str(device)
+
+    logger.info("Compute environment: {}", describe_compute_environment())
+
+    apply_compile_env(bool(getattr(args, "compile", False)))
+    _prepare_inference_quant(args, inference_config)
+
+    require_accelerator_for_flow(
+        flow_target,
+        cpu_mode=cpu_mode,
+        min_vram_gb=getattr(inference_config, "min_vram_gb", None),
+        model_family=OmegaConf.select(flow_params, "model_family"),
+        model_variant=OmegaConf.select(flow_params, "model_variant"),
+        height=getattr(inference_config, "height", None),
+        width=getattr(inference_config, "width", None),
+        frames=getattr(inference_config, "frames", None),
+    )
+
+    min_vram = getattr(inference_config, "min_vram_gb", None)
+    if min_vram is not None:
+        require_min_vram(
+            float(min_vram),
+            device=device,
+            context=f"Flow: {flow_target}",
+        )
+
+    profile = resolve_inference_profile(inference_config, apply_preset=False)
+    log_startup_device_summary(
+        device,
+        profile.dtype,
+        get_resolved_attn_backend(),
+        profile.offload_mode,
+        attn_backend_requested=get_attn_backend_requested(),
+        memory_preset=profile.memory_preset,
+        compile_enabled=get_settings().torch_compile,
+        compile_mode=get_torch_compile_mode(),
+    )
+
+    ckpt_path = getattr(inference_config, "ckpt_path", None)
+    if ckpt_path and not checkpoint_available(ckpt_path, flow_target=flow_target):
+        raise FileNotFoundError(
+            f"Checkpoint path not found: {ckpt_path}\n"
+            "Download model weights into checkpoints/ or pass a Hugging Face model id "
+            "(org/model). See docs/checkpoints.md for setup."
+        )
+
+    # 1. create flow
+    flow = cast(GenerationBase, instantiate_from_config(flow_config, resolve=True))
+    flow.from_pretrained(
+        inference_config.ckpt_path,
+        inference_config.trained_ckpt,
+        inference_config.lorackpt,
+        device=str(device),
+    )
+    apply_flow_memory_config(flow, inference_config)
     flow.enable_vram_management()
     flow.eval()
 
     # 2. flow inference
-    decorated_inference = monitor_resources(return_metrics=True)(flow.inference)
-    metrics = decorated_inference(inference_config) 
+    num_frames = int(getattr(inference_config, "frames", 1) or 1)
+    device_index = (
+        device.index if device.type == "cuda" and device.index is not None else 0
+    )
+    decorated_inference = monitor_resources(
+        frames=num_frames,
+        return_metrics=True,
+        inference_config=inference_config,
+        device_index=device_index,
+    )(flow.inference)
+    metrics = decorated_inference(inference_config)
+    if metrics and inference_config.savedir:
+        if get_settings().metrics_owner == "script":
+            save_metrics(
+                metrics=metrics,
+                savedir=inference_config.savedir,
+                config=inference_config,
+            )
 
 
 if __name__ == "__main__":
-    args = get_parser().parse_args()
-    run_inference(args)
+    from videotuna.cli.inference_app import generic_inference_entry
+
+    generic_inference_entry()
diff --git a/scripts/inference_v2v_ms.py b/scripts/inference_v2v_ms.py
deleted file mode 100644
index 8a4c35b2..00000000
--- a/scripts/inference_v2v_ms.py
+++ /dev/null
@@ -1,69 +0,0 @@
-import argparse
-import os
-import sys
-
-sys.path.insert(0, os.getcwd())
-
-from modelscope.models import Model
-from modelscope.outputs import OutputKeys
-from modelscope.pipelines import pipeline
-from pydantic import Field
-from pydantic_core import ValidationError
-from pydantic_settings import BaseSettings, CliApp, SettingsConfigDict, SettingsError
-
-from videotuna.utils.inference_utils import load_inputs_v2v
-
-
-class Settings(BaseSettings, cli_parse_args=True, cli_prog_name="inference_v2v_ms"):
-    ckpt_path: str = Field(
-        "checkpoints/Video-to-Video", description="Checkpoint path of the model"
-    )
-    input_dir: str = Field(
-        ...,
-        description="A input directory containing videos and prompts for video-to-video enhancement",
-    )
-    output_dir: str = Field(..., description="Results saving directory")
-
-
-def inference_v2v_ms(settings: Settings):
-    # prepare arguments, model, pipeline.
-    model = Model.from_pretrained(settings.ckpt_path)
-    pipe = pipeline(
-        task="video-to-video", model=model, model_revision="v1.1.0", device="cuda:0"
-    )
-    print(f"Successfully loaded model from {settings.ckpt_path}")
-
-    os.makedirs(settings.output_dir, exist_ok=True)
-
-    # load input prompts, video paths, video filenames
-    prompt_list, video_filepaths, video_filenames = load_inputs_v2v(
-        input_dir=settings.input_dir
-    )
-
-    # video-to-video enhancement
-    for i, (prompt, video_filepath, video_filename) in enumerate(
-        zip(prompt_list, video_filepaths, video_filenames)
-    ):
-        print(f"[{i}:03d] input path: {video_filepath}")
-        print(f"[{i}:03d] input name: {video_filename}")
-        print(f"[{i}:03d] prompt: {prompt}")
-        p_input = {"video_path": video_filepath, "text": prompt}
-        output_video_path = pipe(
-            p_input, output_video=os.path.join(settings.output_dir, video_filename)
-        )[OutputKeys.OUTPUT_VIDEO]
-        print(
-            f"Successfully processed {video_filename} and saved to {output_video_path}"
-        )
-
-
-if __name__ == "__main__":
-    try:
-        settings = CliApp.run(
-            Settings,
-        )
-        inference_v2v_ms(settings)
-    except SystemExit as e:
-        print(e)
-    except ValidationError as e:
-        print(e)
-        print("Use --help for more info")
diff --git a/scripts/train.py b/scripts/train.py
deleted file mode 100644
index de7b5ddd..00000000
--- a/scripts/train.py
+++ /dev/null
@@ -1,291 +0,0 @@
-import argparse
-import datetime
-import os
-import sys
-
-import pytorch_lightning as pl
-import torch
-from omegaconf import OmegaConf
-from pytorch_lightning import Trainer, seed_everything
-from pytorch_lightning.cli import LightningCLI
-from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
-from transformers import logging as transf_logging
-
-sys.path.insert(0, os.getcwd())
-from videotuna.utils.common_utils import instantiate_from_config
-from videotuna.utils.lightning_utils import add_trainer_args_to_parser
-from videotuna.utils.train_utils import (
-    check_config_attribute,
-    get_autoresume_path,
-    get_empty_params_comparedwith_sd,
-    get_trainer_callbacks,
-    get_trainer_logger,
-    get_trainer_strategy,
-    init_workspace,
-    load_checkpoints,
-    set_logger,
-)
-
-
-def get_parser(**parser_kwargs):
-    parser = argparse.ArgumentParser(**parser_kwargs)
-    parser.add_argument(
-        "--seed", "-s", type=int, default=20230211, help="seed for seed_everything"
-    )
-    parser.add_argument(
-        "--name", "-n", type=str, default="", help="experiment name, as saving folder"
-    )
-
-    parser.add_argument(
-        "--base",
-        "-b",
-        nargs="*",
-        metavar="base_config.yaml",
-        help="paths to base configs. Loaded from left-to-right. "
-        "Parameters can be overwritten or added with command-line options of the form `--key value`.",
-        default=list(),
-    )
-
-    parser.add_argument(
-        "--train", "-t", action="store_true", default=False, help="train"
-    )
-    parser.add_argument("--val", "-v", action="store_true", default=False, help="val")
-    parser.add_argument("--test", action="store_true", default=False, help="test")
-
-    parser.add_argument(
-        "--logdir",
-        "-l",
-        type=str,
-        default="logs",
-        help="directory for logging dat shit",
-    )
-    parser.add_argument(
-        "--auto_resume",
-        action="store_true",
-        default=False,
-        help="resume from full-info checkpoint",
-    )
-    parser.add_argument(
-        "--debug",
-        "-d",
-        action="store_true",
-        default=False,
-        help="enable post-mortem debugging",
-    )
-
-    parser.add_argument(
-        "--sdckpt",
-        type=str,
-        default=None,
-        help="pretrained stable diffusion checkpoint",
-    )
-    parser.add_argument(
-        "--ckpt", type=str, default=None, help="pretrained current model checkpoint"
-    )
-    parser.add_argument(
-        "--lorackpt", type=str, default=None, help="pretrained current model checkpoint"
-    )
-    return parser
-
-
-def get_nondefault_trainer_args(args):
-    parser = argparse.ArgumentParser()
-    parser = add_trainer_args_to_parser(Trainer, parser)
-
-    default_trainer_args = parser.parse_args([])
-    return sorted(
-        k
-        for k in vars(default_trainer_args)
-        if getattr(args, k) != getattr(default_trainer_args, k)
-    )
-
-
-if __name__ == "__main__":
-    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
-    try:
-        local_rank = int(os.environ.get("LOCAL_RANK"))
-        global_rank = int(os.environ.get("RANK"))
-        num_rank = int(os.environ.get("WORLD_SIZE"))
-    except:
-        local_rank, global_rank, num_rank = 0, 0, 1
-
-    parser = get_parser()
-    ## Extends existing argparse by default Trainer attributes
-    parser = add_trainer_args_to_parser(Trainer, parser)
-
-    parser = add_trainer_args_to_parser(Trainer, parser)
-
-    args, unknown = parser.parse_known_args()
-    ## disable transformer warning
-    transf_logging.set_verbosity_error()
-    seed_everything(args.seed)
-
-    ## yaml configs: "model" | "data" | "lightning"
-    configs = [OmegaConf.load(cfg) for cfg in args.base]
-    cli = OmegaConf.from_dotlist(unknown)
-    config = OmegaConf.merge(*configs, cli)
-
-    if args.sdckpt is not None:
-        config["model"]["sd_checkpoint"] = args.sdckpt
-    if args.ckpt is not None:
-        config["model"]["pretrained_checkpoint"] = args.ckpt
-    if args.lorackpt is not None:
-        config["model"]["params"]["lora_args"]["lora_ckpt"] = args.lorackpt
-
-    lightning_config = config.pop("lightning", OmegaConf.create())
-    trainer_config = lightning_config.get("trainer", OmegaConf.create())
-
-    ## setup workspace directories
-    workdir, ckptdir, cfgdir, loginfo = init_workspace(
-        args.name, args.logdir, config, lightning_config, global_rank
-    )
-    logger = set_logger(
-        logfile=os.path.join(loginfo, "log_%d:%s.txt" % (global_rank, now))
-    )
-    logger.info("@lightning version: %s [>=2.0 required]" % pl.__version__)
-
-    ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Configuring Model *****")
-    config.model.params.logdir = workdir
-
-    model = instantiate_from_config(config.model)
-    if args.auto_resume:
-        ## the saved checkpoint must be: full-info checkpoint
-        resume_ckpt_path = get_autoresume_path(workdir)
-        if resume_ckpt_path is not None:
-            args.resume_from_checkpoint = resume_ckpt_path
-            logger.info("Resuming from checkpoint: %s" % args.resume_from_checkpoint)
-            ## just in case train empy parameters only
-            if check_config_attribute(config.model.params, "empty_params_only"):
-                _, model.empty_paras = get_empty_params_comparedwith_sd(
-                    model, config.model
-                )
-        else:
-            model = load_checkpoints(model, config.model)
-            logger.warning("Auto-resuming skipped as No checkpoint found!")
-    else:
-        model = load_checkpoints(model, config.model)
-    if len(model.lora_args) != 0:
-        model.inject_lora()
-    ## update trainer config
-    for k in get_nondefault_trainer_args(args):
-        trainer_config[k] = getattr(args, k)
-
-    print(f"trainer_config: {trainer_config}")
-    num_nodes = trainer_config.num_nodes
-    ngpu_per_node = trainer_config.devices
-    logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")
-
-    ## setup learning rate
-    base_lr = config.model.base_learning_rate
-    bs = config.data.params.batch_size
-    if getattr(config.model, "scale_lr", True):
-        model.learning_rate = num_rank * bs * base_lr
-    else:
-        model.learning_rate = base_lr
-
-    ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Configuring Data *****")
-    data = instantiate_from_config(config.data)
-    data.setup()
-    for k in data.datasets:
-        logger.info(
-            f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}"
-        )
-
-    ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Configuring Trainer *****")
-    if "accelerator" not in trainer_config:
-        trainer_config["accelerator"] = "gpu"
-
-    ## setup trainer args: pl-logger and callbacks
-    trainer_kwargs = dict()
-    trainer_kwargs["num_sanity_val_steps"] = 0
-    logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
-    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
-    print(f"logger save_dir: {trainer_kwargs['logger'].save_dir}")
-    ## setup callbacks
-    callbacks_cfg = get_trainer_callbacks(
-        lightning_config, workdir, ckptdir
-    )
-    callbacks_cfg["image_logger"]["params"]["save_dir"] = workdir
-    trainer_kwargs["callbacks"] = [
-        instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg
-    ]
-    strategy_cfg = get_trainer_strategy(lightning_config)
-    print('strategy cfg: ', strategy_cfg)
-    trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(OmegaConf.to_container(strategy_cfg))
-
-    trainer_kwargs["sync_batchnorm"] = False
-
-    ## trainer config: others
-    if (
-        "train" in config.data.params
-        and config.data.params.train.target == "lvdm.data.hdvila.HDVila"
-        or (
-            "validation" in config.data.params
-            and config.data.params.validation.target == "lvdm.data.hdvila.HDVila"
-        )
-    ):
-        trainer_kwargs["replace_sampler_ddp"] = False
-
-    ## for debug
-    # trainer_kwargs["fast_dev_run"] = 10
-    # trainer_kwargs["limit_train_batches"] = 1./32
-    # trainer_kwargs["limit_val_batches"] = 0.01
-    # trainer_kwargs["val_check_interval"] = 20  #float: epoch ratio | integer: batch num
-
-    # merge args for trainer
-    print(f"trainer_kwargs: {trainer_kwargs}")
-    trainer = Trainer(**trainer_config, **trainer_kwargs)
-
-    ## allow checkpointing via USR1
-    def melk(*args, **kwargs):
-        ## run all checkpoint hooks
-        if trainer.global_rank == 0:
-            print("Summoning checkpoint.")
-            ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
-            trainer.save_checkpoint(ckpt_path)
-
-    def divein(*args, **kwargs):
-        if trainer.global_rank == 0:
-            import pudb
-
-            pudb.set_trace()
-
-    import signal
-
-    signal.signal(signal.SIGUSR1, melk)
-    signal.signal(signal.SIGUSR2, divein)
-
-    ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Running the Loop *****")
-
-    if args.train:
-        try:
-            # Strategy is automatically managed, no need to manually check it here
-            logger.info(f"<Training in {trainer.strategy.__class__.__name__} Mode>")
-            if trainer.strategy.__class__.__name__ == 'DeepSpeedStrategy':
-                logger.info(f"Make parameter contiguous in case deepseed does not allow non contigouous data")
-                for param in model.parameters(): param.data = param.data.contiguous()
-            # Please refer to https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.plugins.precision.MixedPrecision.html for Automatic Mixed Precision (AMP) training
-            if trainer.strategy == "deepspeed":
-                with torch.cuda.amp.autocast():
-                    trainer.fit(model, data)
-            else:
-                # import pdb
-                # pdb.set_trace()
-                trainer.fit(model, data)
-        except Exception as e:
-            logger.error(f"Training failed: {str(e)}")
-            raise
-    
-    logger.info("***** Converting deepspeed checkpoint into correct format *****")
-
-    if args.val:
-        # Directly call validation
-        trainer.validate(model, data)
-
-    # Ensure test runs either after training finishes or if explicitly requested
-    if args.test or not trainer.interrupted:
-        trainer.test(model, data)
diff --git a/scripts/train_flux_lora.py b/scripts/train_flux_lora.py
index 3ded7d19..fd6affb2 100644
--- a/scripts/train_flux_lora.py
+++ b/scripts/train_flux_lora.py
@@ -1,118 +1,35 @@
-import os
-import sys
-
-import yaml
-
-sys.path.insert(0, os.getcwd())
+"""Train Flux LoRA adapters using the first-party Diffusers trainer."""
 
 import argparse
-import json
-import logging
-import time
-from os import environ
-from pathlib import Path
-
-import torch.distributed as dist
-from pytorch_lightning import Trainer
-
-from videotuna.third_party.flux import log_format
-from videotuna.third_party.flux.training.model import Model
-from videotuna.third_party.flux.training.model_data import ModelData
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("SimpleTuner")
-logger.setLevel(environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-def add_timestamp_to_output_dir(output_dir):
-    time_str = time.strftime("%Y%m%d%H%M%S")
-    folder_name = output_dir.stem
-    name_list = folder_name.split("_")
-    if len(name_list[-1]) == 14:
-        folder_name = "_".join(name_list[:-1])
-    folder_name = f"{folder_name}_{time_str}"
-    output_dir = output_dir.parent / folder_name
-    return str(output_dir)
-
-def config_process(config):
-    # add timestamp to the output_dir
-    output_dir = Path(config["--output_dir"])
-    config["--output_dir"] = add_timestamp_to_output_dir(output_dir)
-    # rewrite the config file
-    with open(args.config_path, "w") as f:
-        json.dump(config, f, indent=4)
-    return config
-
-def load_yaml_config(config_path):
-    with open(config_path) as f:
-        config = yaml.safe_load(f)
-    data_config = config["data"]
-    data_config_json = json.dumps(data_config, indent=2)
-    config = config["train"]
+import multiprocessing
 
-    new_config = {}
-    for key, value in config.items():
-        new_key = "--" + key
-        new_config[new_key] = value
-    config = new_config
-    config["--data_backend_config"] = "configs/006_flux/multidatabackend.json"
+from videotuna.settings import get_settings
+from videotuna.training.flux_lora.train import run_training
+from videotuna.utils.logging_config import bound_logger, configure_logging
 
-    return config, data_config_json
+logger = bound_logger(phase="t2i", flow="flux_lora")
 
-def load_json_config(config_path, data_config_path):
-    # load config files
-    with open(config_path) as f:
-        config = json.load(f)
-    with open(data_config_path) as f:
-        data_config = json.load(f)
-    # process config
-    config = config_process(config)
-    return config, data_config
 
-def main(args):
+def main(args: argparse.Namespace) -> None:
+    configure_logging(level=get_settings().log_level)
     try:
-        import multiprocessing
-
         multiprocessing.set_start_method("fork")
-    except Exception as e:
-        logger.error(
-            "Failed to set the multiprocessing start method to 'fork'. Unexpected behaviour such as high memory overhead or poor performance may result."
-            f"\nError: {e}"
-        )
-    try:
-        config, data_config = load_json_config(args.config_path, args.data_config_path)
-        data_dir = data_config[0]["instance_data_dir"]
-        dm = ModelData(data_dir)
-        dm.create_dataset()
-        dm.setup()
-        print("dataset setup done!")
-        model = Model()
-        model.run()
-        print("loaded model")
-        trainer = Trainer(
-            accelerator="gpu",
-            max_epochs=config["--num_train_epochs"],
-            max_steps=config["--max_train_steps"],
-            strategy="ddp",
-            limit_train_batches=1490,
-            logger=False,
-        )
-        print("loaded Trainer, training...")
-
-        if dist.is_available() and dist.is_initialized():
-            dist.barrier()
-        trainer.fit(model, datamodule=dm)
-        print("train finished")
-
-    except Exception as e:
-        raise e
+    except Exception as exc:
+        logger.warning("Could not set multiprocessing start method to 'fork': {}", exc)
+    run_training(args.config_path, args.data_config_path)
 
 
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--config_path", type=str, help="Path to the config file")
+    parser = argparse.ArgumentParser(
+        description="Fine-tune Flux LoRA (Diffusers + PEFT)"
+    )
     parser.add_argument(
-        "--data_config_path", type=str, help="Path to the config of data file"
+        "--config_path", type=str, required=True, help="Training config JSON"
     )
-    args = parser.parse_args()
-
-    main(args)
+    parser.add_argument(
+        "--data_config_path",
+        type=str,
+        required=True,
+        help="Path to multidatabackend JSON",
+    )
+    main(parser.parse_args())
diff --git a/scripts/train_new.py b/scripts/train_new.py
index d8d24916..26227851 100644
--- a/scripts/train_new.py
+++ b/scripts/train_new.py
@@ -1,31 +1,19 @@
 import argparse
 import datetime
 import os
-import sys
 
 import pytorch_lightning as pl
 import torch
-from omegaconf import OmegaConf, DictConfig
-from pytorch_lightning import Trainer, seed_everything
-from pytorch_lightning.cli import LightningCLI
+from omegaconf import DictConfig, OmegaConf
+from pytorch_lightning import seed_everything
 from transformers import logging as transf_logging
 
-# sys.path.insert(1, os.path.join(sys.path[0], '..'))
-sys.path.insert(0, os.getcwd())
 from videotuna.base.generation_base import GenerationBase
 from videotuna.utils.args_utils import prepare_train_args
-from videotuna.utils.common_utils import instantiate_from_config, get_dist_info
-from videotuna.utils.lightning_utils import add_trainer_args_to_parser
+from videotuna.utils.common_utils import get_dist_info, instantiate_from_config
+from videotuna.utils.logging_config import bound_logger, configure_logging
 from videotuna.utils.train_utils import (
-    check_config_attribute,
-    get_autoresume_path,
-    get_empty_params_comparedwith_sd,
-    get_trainer_callbacks,
-    get_trainer_logger,
-    get_trainer_strategy,
     init_workspace,
-    load_checkpoints,
-    set_logger,
 )
 
 
@@ -44,7 +32,8 @@ def get_parser(**parser_kwargs):
         nargs="*",
         metavar="base_config.yaml",
         help="paths to base configs. Loaded from left-to-right. "
-        "Parameters can be overwritten or added with command-line options of the form `--key value`.",
+        "Parameters can be overwritten or added with command-line options "
+        "of the form `--key value`.",
         default=list(),
     )
     parser.add_argument(
@@ -84,8 +73,8 @@ def setup_logger(config: DictConfig):
     local_rank, global_rank, num_rank = get_dist_info()
 
     ## 2. config
-    train_config : DictConfig = config.get("train", OmegaConf.create())
-    lightning_config : DictConfig = train_config.get("lightning", OmegaConf.create())
+    train_config: DictConfig = config.get("train", OmegaConf.create())
+    lightning_config: DictConfig = train_config.get("lightning", OmegaConf.create())
 
     ## 3. init logger
     seed_everything(train_config.seed)
@@ -94,27 +83,31 @@ def setup_logger(config: DictConfig):
     workdir, ckptdir, cfgdir, loginfo = init_workspace(
         train_config.name, train_config.logdir, config, lightning_config, global_rank
     )
-    logger = set_logger(
-        logfile=os.path.join(loginfo, "log_%d:%s.txt" % (global_rank, now))
+    configure_logging(
+        log_file=os.path.join(loginfo, "log_%d:%s.txt" % (global_rank, now))
     )
-    train_config['workdir'] = workdir
-    train_config['ckptdir'] = ckptdir
+    logger = bound_logger(phase="t2v", flow="wanvideo")
+    train_config["workdir"] = workdir
+    train_config["ckptdir"] = ckptdir
     return logger
 
+
 if __name__ == "__main__":
     ## prepare args and logger
     local_rank, global_rank, num_rank = get_dist_info()
     parser = get_parser()
     config = prepare_train_args(parser)
-    logger = setup_logger(config)   
+    logger = setup_logger(config)
 
     ## load flow
     logger.info("@lightning version: %s [>=2.0 required]" % pl.__version__)
     logger.info("***** Configuring Model *****")
-    train_config: DictConfig = config['train']
-    flow_config: DictConfig = config['flow']
-    flow : GenerationBase = instantiate_from_config(flow_config, resolve=True)
-    flow.from_pretrained(train_config['ckpt'], train_config['trained_ckpt'], train_config['lorackpt'])
+    train_config: DictConfig = config["train"]
+    flow_config: DictConfig = config["flow"]
+    flow: GenerationBase = instantiate_from_config(flow_config, resolve=True)
+    flow.from_pretrained(
+        train_config["ckpt"], train_config["trained_ckpt"], train_config["lorackpt"]
+    )
 
     ## load trainer
     flow.init_trainer(train_config)
@@ -125,7 +118,7 @@ def setup_logger(config: DictConfig):
     logger.info("***** Running the Loop *****")
     try:
         logger.info(f"<Training in {trainer.strategy.__class__.__name__} Mode>")
-        if trainer.strategy.__class__.__name__  == "DeepSpeedStrategy":
+        if trainer.strategy.__class__.__name__ == "DeepSpeedStrategy":
             logger.info("deepspeed needs autocast")
             with torch.cuda.amp.autocast():
                 trainer.fit(flow, data, ckpt_path=train_config.resume_ckpt)
diff --git a/scripts/train_pl_v18.py b/scripts/train_pl_v18.py
deleted file mode 100644
index 173f33be..00000000
--- a/scripts/train_pl_v18.py
+++ /dev/null
@@ -1,291 +0,0 @@
-import argparse
-import datetime
-import os
-import sys
-
-import pytorch_lightning as pl
-import torch
-from omegaconf import OmegaConf
-from pytorch_lightning import seed_everything
-from pytorch_lightning.cli import LightningCLI
-from pytorch_lightning.trainer import Trainer
-from transformers import logging as transf_logging
-
-# sys.path.insert(1, os.path.join(sys.path[0], '..'))
-sys.path.insert(0, os.getcwd())
-from utils.common_utils import instantiate_from_config
-from utils.lightning_utils import add_trainer_args_to_parser
-
-from scripts.train_utils import (
-    check_config_attribute,
-    get_autoresume_path,
-    get_empty_params_comparedwith_sd,
-    get_trainer_callbacks,
-    get_trainer_logger,
-    get_trainer_strategy,
-    init_workspace,
-    load_checkpoints,
-    set_logger,
-)
-
-
-def get_parser(**parser_kwargs):
-    parser = argparse.ArgumentParser(**parser_kwargs)
-    parser.add_argument(
-        "--seed", "-s", type=int, default=20230211, help="seed for seed_everything"
-    )
-    parser.add_argument(
-        "--name", "-n", type=str, default="", help="experiment name, as saving folder"
-    )
-
-    parser.add_argument(
-        "--base",
-        "-b",
-        nargs="*",
-        metavar="base_config.yaml",
-        help="paths to base configs. Loaded from left-to-right. "
-        "Parameters can be overwritten or added with command-line options of the form `--key value`.",
-        default=list(),
-    )
-
-    parser.add_argument(
-        "--train", "-t", action="store_true", default=False, help="train"
-    )
-    parser.add_argument("--val", "-v", action="store_true", default=False, help="val")
-    parser.add_argument("--test", action="store_true", default=False, help="test")
-
-    parser.add_argument(
-        "--logdir",
-        "-l",
-        type=str,
-        default="logs",
-        help="directory for logging dat shit",
-    )
-    parser.add_argument(
-        "--auto_resume",
-        action="store_true",
-        default=False,
-        help="resume from full-info checkpoint",
-    )
-    parser.add_argument(
-        "--debug",
-        "-d",
-        action="store_true",
-        default=False,
-        help="enable post-mortem debugging",
-    )
-
-    parser.add_argument(
-        "--sdckpt",
-        type=str,
-        default=None,
-        help="pretrained stable diffusion checkpoint",
-    )
-    parser.add_argument(
-        "--ckpt", type=str, default=None, help="pretrained current model checkpoint"
-    )
-    parser.add_argument(
-        "--lorackpt", type=str, default=None, help="pretrained current model checkpoint"
-    )
-    return parser
-
-
-def get_nondefault_trainer_args(args):
-    parser = argparse.ArgumentParser()
-    try:
-        parser = Trainer.add_argparse_args(parser)
-    except:
-        parser = add_trainer_args_to_parser(Trainer, parser)
-
-    default_trainer_args = parser.parse_args([])
-    return sorted(
-        k
-        for k in vars(default_trainer_args)
-        if getattr(args, k) != getattr(default_trainer_args, k)
-    )
-
-
-if __name__ == "__main__":
-    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
-    try:
-        local_rank = int(os.environ.get("LOCAL_RANK"))
-        global_rank = int(os.environ.get("RANK"))
-        num_rank = int(os.environ.get("WORLD_SIZE"))
-    except:
-        local_rank, global_rank, num_rank = 0, 0, 1
-    # print(f'local_rank: {local_rank} | global_rank:{global_rank} | num_rank:{num_rank}')
-
-    parser = get_parser()
-    ## Extends existing argparse by default Trainer attributes
-
-    try:
-        parser = Trainer.add_argparse_args(parser)
-    except:
-        parser = add_trainer_args_to_parser(Trainer, parser)
-
-    args, unknown = parser.parse_known_args()
-    ## disable transformer warning
-    transf_logging.set_verbosity_error()
-    seed_everything(args.seed)
-
-    ## yaml configs: "model" | "data" | "lightning"
-    configs = [OmegaConf.load(cfg) for cfg in args.base]
-    cli = OmegaConf.from_dotlist(unknown)
-    config = OmegaConf.merge(*configs, cli)
-
-    if args.sdckpt is not None:
-        config["model"]["sd_checkpoint"] = args.sdckpt
-    if args.ckpt is not None:
-        config["model"]["pretrained_checkpoint"] = args.ckpt
-    if args.lorackpt is not None:
-        config["model"]["params"]["lora_args"]["lora_ckpt"] = args.lorackpt
-
-    lightning_config = config.pop("lightning", OmegaConf.create())
-    trainer_config = lightning_config.get("trainer", OmegaConf.create())
-
-    ## setup workspace directories
-    workdir, ckptdir, cfgdir, loginfo = init_workspace(
-        args.name, args.logdir, config, lightning_config, global_rank
-    )
-    logger = set_logger(
-        logfile=os.path.join(loginfo, "log_%d:%s.txt" % (global_rank, now))
-    )
-    logger.info("@lightning version: %s [>=1.8 required]" % (pl.__version__))
-
-    ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Configing Model *****")
-    config.model.params.logdir = workdir
-
-    model = instantiate_from_config(config.model)
-    if args.auto_resume:
-        ## the saved checkpoint must be: full-info checkpoint
-        resume_ckpt_path = get_autoresume_path(workdir)
-        if resume_ckpt_path is not None:
-            args.resume_from_checkpoint = resume_ckpt_path
-            logger.info("Resuming from checkpoint: %s" % args.resume_from_checkpoint)
-            ## just in case train empy parameters only
-            if check_config_attribute(config.model.params, "empty_params_only"):
-                _, model.empty_paras = get_empty_params_comparedwith_sd(
-                    model, config.model
-                )
-        else:
-            model = load_checkpoints(model, config.model)
-            logger.warning("Auto-resuming skipped as No checkpoit found!")
-    else:
-        model = load_checkpoints(model, config.model)
-    if len(model.lora_args) != 0:
-        model.inject_lora()
-    ## update trainer config
-    for k in get_nondefault_trainer_args(args):
-        trainer_config[k] = getattr(args, k)
-
-    print(trainer_config)
-    num_nodes = trainer_config.num_nodes
-    ngpu_per_node = trainer_config.devices
-    logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")
-
-    ## setup learning rate
-    base_lr = config.model.base_learning_rate
-    bs = config.data.params.batch_size
-    if getattr(config.model, "scale_lr", True):
-        model.learning_rate = num_rank * bs * base_lr
-    else:
-        model.learning_rate = base_lr
-
-    ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Configing Data *****")
-    data = instantiate_from_config(config.data)
-    data.setup()
-    for k in data.datasets:
-        logger.info(
-            f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}"
-        )
-
-    ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Configing Trainer *****")
-    if "accelerator" not in trainer_config:
-        trainer_config["accelerator"] = "gpu"
-
-    ## setup trainer args: pl-logger and callbacks
-    trainer_kwargs = dict()
-    trainer_kwargs["num_sanity_val_steps"] = 0
-    logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug)
-    trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
-    print(trainer_kwargs["logger"].save_dir)
-    ## setup callbacks
-    callbacks_cfg = get_trainer_callbacks(
-        lightning_config, config, workdir, ckptdir, logger
-    )
-    callbacks_cfg["image_logger"]["params"]["save_dir"] = workdir
-    trainer_kwargs["callbacks"] = [
-        instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg
-    ]
-    strategy_cfg = get_trainer_strategy(lightning_config)
-    trainer_kwargs["strategy"] = (
-        strategy_cfg
-        if type(strategy_cfg) == str
-        else instantiate_from_config(strategy_cfg)
-    )
-    trainer_kwargs["sync_batchnorm"] = False
-
-    ## trainer config: others
-    if (
-        "train" in config.data.params
-        and config.data.params.train.target == "lvdm.data.hdvila.HDVila"
-        or (
-            "validation" in config.data.params
-            and config.data.params.validation.target == "lvdm.data.hdvila.HDVila"
-        )
-    ):
-        trainer_kwargs["replace_sampler_ddp"] = False
-
-    ## for debug
-    # trainer_kwargs["fast_dev_run"] = 10
-    # trainer_kwargs["limit_train_batches"] = 1./32
-    # trainer_kwargs["limit_val_batches"] = 0.01
-    # trainer_kwargs["val_check_interval"] = 20  #float: epoch ratio | integer: batch num
-
-    # merge args for trainer
-    trainer_args = argparse.Namespace(**trainer_config)
-    trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs)
-    print(trainer_args, trainer_kwargs)
-
-    ## allow checkpointing via USR1
-    def melk(*args, **kwargs):
-        ## run all checkpoint hooks
-        if trainer.global_rank == 0:
-            print("Summoning checkpoint.")
-            ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt")
-            trainer.save_checkpoint(ckpt_path)
-
-    def divein(*args, **kwargs):
-        if trainer.global_rank == 0:
-            import pudb
-
-            pudb.set_trace()
-
-    import signal
-
-    signal.signal(signal.SIGUSR1, melk)
-    signal.signal(signal.SIGUSR2, divein)
-
-    ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
-    logger.info("***** Running the Loop *****")
-    if args.train:
-        try:
-            if "strategy" in lightning_config:
-                logger.info("<Training in DeepSpeed Mode>")
-                ## deepspeed
-                with torch.cuda.amp.autocast():
-                    trainer.fit(model, data)
-            else:
-                logger.info("<Training in DDPShare Mode>")
-                ## ddpshare
-                trainer.fit(model, data)
-        except Exception:
-            # melk()
-            raise
-    if args.val:
-        trainer.validate(model, data)
-    if args.test or not trainer.interrupted:
-        trainer.test(model, data)
diff --git a/scripts/verify_cpu_torch.py b/scripts/verify_cpu_torch.py
new file mode 100644
index 00000000..66abc2fb
--- /dev/null
+++ b/scripts/verify_cpu_torch.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+"""Verify CPU-only PyTorch install for VideoTuna dev/CI."""
+
+from __future__ import annotations
+
+import argparse
+import importlib
+import sys
+
+import torch
+from torch import version as torch_version
+
+from videotuna.utils.device_utils import (
+    describe_compute_environment,
+    detect_compute_backend,
+)
+
+
+def _check_import(name: str) -> tuple[bool, str]:
+    try:
+        mod = importlib.import_module(name)
+        version = getattr(mod, "__version__", "installed")
+        return True, str(version)
+    except ImportError:
+        return False, "not installed"
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        description="Verify CPU-only PyTorch install for VideoTuna."
+    )
+    parser.parse_args(argv)
+
+    errors: list[str] = []
+    backend = detect_compute_backend()
+
+    print(f"Compute backend: {backend}")
+    print(describe_compute_environment())
+    print(f"PyTorch: {torch.__version__}")
+    print(f"CUDA build: {getattr(torch_version, 'cuda', None)}")
+    print(f"HIP build: {getattr(torch_version, 'hip', None)}")
+
+    if getattr(torch_version, "cuda", None) is not None:
+        errors.append("PyTorch was built with CUDA; run: poetry run install-cpu-torch")
+    if getattr(torch_version, "hip", None) is not None:
+        errors.append("PyTorch reports HIP (ROCm); expected CPU-only wheel.")
+
+    if backend != "cpu":
+        errors.append(
+            f"Expected detect_compute_backend()=cpu, got {backend!r}. "
+            "Set VIDEOTUNA_COMPUTE_BACKEND=cpu or use a CPU torch wheel."
+        )
+
+    cuda_only = ["xformers", "xfuser", "bitsandbytes"]
+    for pkg in cuda_only:
+        ok, detail = _check_import(pkg)
+        status = "PRESENT" if ok else "absent"
+        print(f"  {pkg}: {status} ({detail})")
+        if ok:
+            errors.append(
+                f"CUDA-only package {pkg} is installed; "
+                "re-run poetry run install-cpu-torch"
+            )
+
+    triton_ok, triton_detail = _check_import("triton")
+    print(f"  triton: {'PRESENT' if triton_ok else 'absent'} ({triton_detail})")
+
+    if errors:
+        for err in errors:
+            print(f"ERROR: {err}", file=sys.stderr)
+        return 1
+
+    print("CPU torch verification OK")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/verify_cuda_extras.py b/scripts/verify_cuda_extras.py
new file mode 100644
index 00000000..0a5df02f
--- /dev/null
+++ b/scripts/verify_cuda_extras.py
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+"""Verify NVIDIA CUDA optional dependencies and runtime environment."""
+
+from __future__ import annotations
+
+import argparse
+import importlib
+import sys
+
+import torch
+import torch.version
+
+from videotuna.utils.device_utils import (
+    _driver_version,
+    describe_compute_environment,
+    detect_compute_backend,
+    get_visible_gpus,
+)
+
+
+def _check_import(name: str) -> tuple[bool, str]:
+    try:
+        mod = importlib.import_module(name)
+        version = getattr(mod, "__version__", "unknown")
+        return True, str(version)
+    except ImportError as exc:
+        return False, str(exc)
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        description="Verify NVIDIA CUDA extras for VideoTuna."
+    )
+    parser.add_argument(
+        "--expect-flash",
+        action="store_true",
+        help="Fail when flash-attn is not importable.",
+    )
+    args = parser.parse_args(argv)
+
+    errors: list[str] = []
+    backend = detect_compute_backend()
+
+    print(f"Compute backend: {backend}")
+    print(describe_compute_environment())
+    print(f"Driver: {_driver_version()}")
+    print(f"CUDA runtime (torch): {getattr(torch.version, 'cuda', 'n/a')}")
+    print(f"PyTorch: {torch.__version__}")
+
+    if backend == "rocm":
+        errors.append(
+            "Active backend is ROCm; run verify on an NVIDIA CUDA install "
+            "(poetry install -E cuda)."
+        )
+    elif backend == "cpu":
+        errors.append("No GPU visible; CUDA verification requires an NVIDIA GPU.")
+
+    gpus = get_visible_gpus()
+    for gpu in gpus:
+        print(
+            f"  [{gpu.index}] {gpu.name}: "
+            f"{gpu.total_vram_gb:.1f} GB total, "
+            f"{gpu.free_vram_gb:.1f} GB free, "
+            f"sm {gpu.compute_capability[0]}.{gpu.compute_capability[1]}, "
+            f"bf16={gpu.supports_bf16}"
+        )
+
+    optional = ["xformers", "flash_attn", "triton", "xfuser", "bitsandbytes"]
+    for pkg in optional:
+        ok, detail = _check_import(pkg)
+        status = "OK" if ok else "MISSING"
+        print(f"  {pkg}: {status} ({detail})")
+        if args.expect_flash and pkg == "flash_attn" and not ok:
+            errors.append("flash-attn not installed (--expect-flash)")
+
+    if errors:
+        for err in errors:
+            print(f"ERROR: {err}", file=sys.stderr)
+        return 1
+
+    print("CUDA extras verification OK")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/verify_rocm_extras.py b/scripts/verify_rocm_extras.py
new file mode 100644
index 00000000..5cdd8d25
--- /dev/null
+++ b/scripts/verify_rocm_extras.py
@@ -0,0 +1,66 @@
+#!/usr/bin/env python3
+"""Verify pyproject.toml ROCm extra excludes CUDA-only packages."""
+
+from __future__ import annotations
+
+import sys
+import tomllib
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parents[1]
+PYPROJECT = ROOT / "pyproject.toml"
+
+CUDA_ONLY_IN_ROCM = {
+    "xformers",
+    "bitsandbytes",
+    "xfuser",
+    "triton",
+}
+CUDA_ONLY_PREFIXES = ("nvidia-",)
+
+
+def main() -> int:
+    data = tomllib.loads(PYPROJECT.read_text())
+    poetry = data.get("tool", {}).get("poetry", {})
+    extras = poetry.get("extras", {})
+    rocm_extra = set(extras.get("rocm", []))
+    cuda_extra = set(extras.get("cuda", []))
+
+    errors: list[str] = []
+
+    overlap = rocm_extra & cuda_extra
+    if overlap:
+        errors.append(f"rocm and cuda extras overlap: {sorted(overlap)}")
+
+    for pkg in rocm_extra:
+        if pkg in CUDA_ONLY_IN_ROCM or pkg.startswith(CUDA_ONLY_PREFIXES):
+            errors.append(f"CUDA-only package {pkg!r} listed in rocm extra")
+
+    # torch uses install-rocm script; rocm extra is intentionally empty
+    if "pytorch-rocm642" not in {
+        s["name"] for s in data.get("tool", {}).get("poetry", {}).get("source", [])
+    }:
+        # sources are top-level in pyproject
+        pass
+
+    sources = data.get("tool", {}).get("poetry", {}).get("source", [])
+    if not any(s.get("name") == "pytorch-rocm642" for s in sources):
+        errors.append("missing pytorch-rocm642 poetry source")
+
+    cuda_has_torch = "triton" in cuda_extra or "xformers" in cuda_extra
+    if not cuda_has_torch:
+        errors.append(
+            "cuda extra should include CUDA accelerator packages (e.g. xformers)"
+        )
+
+    if errors:
+        for err in errors:
+            print(f"ERROR: {err}", file=sys.stderr)
+        return 1
+
+    print("ROCm extras configuration OK")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/shscripts/inference_cogVideo_i2v_diffusers.sh b/shscripts/inference_cogVideo_i2v_diffusers.sh
deleted file mode 100644
index 15cd59ff..00000000
--- a/shscripts/inference_cogVideo_i2v_diffusers.sh
+++ /dev/null
@@ -1,9 +0,0 @@
-python scripts/inference_cogVideo_diffusers.py \
-    --generate_type i2v \
-    --model_input "inputs/i2v/576x1024" \
-    --model_path checkpoints/cogvideo/CogVideoX-5b-I2V \
-    --output_path results/cogvideo-test-i2v \
-    --num_inference_steps 50 \
-    --guidance_scale 3.5 \
-    --num_videos_per_prompt 1 \
-    --dtype float16
diff --git a/shscripts/inference_cogVideo_t2v_diffusers.sh b/shscripts/inference_cogVideo_t2v_diffusers.sh
deleted file mode 100644
index 75688a92..00000000
--- a/shscripts/inference_cogVideo_t2v_diffusers.sh
+++ /dev/null
@@ -1,20 +0,0 @@
-
-# sample a single video
-python scripts/inference_cogVideo_diffusers.py \
-    --model_input "A cat playing with a ball" \
-    --model_path checkpoints/cogvideo/CogVideoX-2b \
-    --output_path results/output.mp4 \
-    --num_inference_steps 50 \
-    --guidance_scale 3.5 \
-    --num_videos_per_prompt 1 \
-    --dtype float16
-
-# sample multiple videos
-# python scripts/inference_cogVideo_diffusers.py \
-    # --model_input "inputs/t2v/prompts.txt" \
-    # --model_path checkpoints/cogvideo/CogVideoX-2b \
-    # --output_path results/cogvideo-test \
-    # --num_inference_steps 50 \
-    # --guidance_scale 3.5 \
-    # --num_videos_per_prompt 1 \
-    # --dtype float16
diff --git a/shscripts/inference_cogVideox1.5_5b_i2v.sh b/shscripts/inference_cogVideox1.5_5b_i2v.sh
deleted file mode 100644
index d3afbbaf..00000000
--- a/shscripts/inference_cogVideox1.5_5b_i2v.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-load_transformer="checkpoints/cogvideo/CogVideoX1.5-5B-SAT/transformer_i2v"
-input_type="txt"
-input_file="inputs/i2v/576x1024/test_prompts.txt"
-output_dir="results/i2v/"
-base="configs/005_cogvideox1.5/cogvideox1.5_5b.yaml"
-image_folder="inputs/i2v/576x1024/"
-
-python scripts/inference_cogVideo_sat_refactor.py \
---load_transformer $load_transformer \
---input_file $input_file \
---output_dir $output_dir \
---base $base    \
---mode_type "i2v"   \
---sampling_num_frames 22    \
---image_folder $image_folder \
---seed 42
diff --git a/shscripts/inference_cogVideox1.5_5b_t2v.sh b/shscripts/inference_cogVideox1.5_5b_t2v.sh
deleted file mode 100644
index 4039d860..00000000
--- a/shscripts/inference_cogVideox1.5_5b_t2v.sh
+++ /dev/null
@@ -1,13 +0,0 @@
-load_transformer="checkpoints/cogvideo/CogVideoX1.5-5B-SAT/transformer_t2v"
-input_type="txt"
-input_file="inputs/t2v/prompts.txt"
-output_dir="results/t2v/"
-base="configs/005_cogvideox1.5/cogvideox1.5_5b.yaml"
-
-python scripts/inference_cogVideo_sat_refactor.py \
---load_transformer $load_transformer \
---input_file $input_file \
---output_dir $output_dir \
---base $base    \
---mode_type "t2v"   \
---sampling_num_frames 22    \
diff --git a/shscripts/inference_cogvideo_i2v_fullft.sh b/shscripts/inference_cogvideo_i2v_fullft.sh
deleted file mode 100644
index ea871745..00000000
--- a/shscripts/inference_cogvideo_i2v_fullft.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-config=configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml
-ckpt=${YOUR_CKPT_PATH}
-prompt_dir=inputs/i2v/576x1024
-
-current_time=$(date +%Y%m%d%H%M%S)
-savedir="results/inference/i2v/cogvideox-i2v-fullft-$current_time"
-
-python3 scripts/inference_cogvideo.py \
-    --config $config \
-    --ckpt_path $ckpt \
-    --prompt_dir $prompt_dir \
-    --savedir $savedir \
-    --bs 1 --height 480 --width 720 \
-    --fps 16 \
-    --seed 6666 \
-    --mode i2v \
-    --denoiser_precision bf16
diff --git a/shscripts/inference_cogvideo_i2v_lora.sh b/shscripts/inference_cogvideo_i2v_lora.sh
deleted file mode 100644
index 110fdc47..00000000
--- a/shscripts/inference_cogvideo_i2v_lora.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-config=configs/004_cogvideox/cogvideo5b-i2v.yaml
-ckpt=${YOUR_CKPT_PATH} # TODO
-prompt_dir=inputs/i2v/576x1024
-
-current_time=$(date +%Y%m%d%H%M%S)
-savedir="results/inference/i2v/cogvideox-i2v-lora-$current_time"
-
-python3 scripts/inference_cogvideo.py \
-    --config $config \
-    --ckpt_path $ckpt \
-    --prompt_dir $prompt_dir \
-    --savedir $savedir \
-    --bs 1 --height 480 --width 720 \
-    --fps 16 \
-    --seed 6666 \
-    --mode i2v \
-    --denoiser_precision bf16
diff --git a/shscripts/inference_cogvideo_t2v_fullft.sh b/shscripts/inference_cogvideo_t2v_fullft.sh
deleted file mode 100644
index 54af0b77..00000000
--- a/shscripts/inference_cogvideo_t2v_fullft.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-# ----------------------diffusers based pl inference ----------------------
-config='configs/004_cogvideox/cogvideo5b-t2v-fullft.yaml' # or configs/004_cogvideox/cogvideo2b.yaml
-prompt_file="inputs/t2v/prompts.txt"
-current_time=$(date +%Y%m%d%H%M%S)
-savedir="results/inference/t2v/cogvideox-t2v-fullft-$current_time"
-ckpt="{YOUR_CKPT_PATH}"
-
-python3 scripts/inference_cogvideo.py \
---ckpt_path $ckpt \
---config $config \
---prompt_file $prompt_file \
---savedir $savedir \
---bs 1 --height 480 --width 720 \
---fps 16 \
---seed 6666 \
---denoiser_precision bf16
diff --git a/shscripts/inference_cogvideo_t2v_lora.sh b/shscripts/inference_cogvideo_t2v_lora.sh
deleted file mode 100644
index 99165c7a..00000000
--- a/shscripts/inference_cogvideo_t2v_lora.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-# ----------------------diffusers based pl inference ----------------------
-# ‘configs/004_cogvideox/cogvideo2b.yaml’ or 'configs/004_cogvideox/cogvideo5b.yaml'
-config='configs/004_cogvideox/cogvideo5b.yaml'
-prompt_file="inputs/t2v/prompts.txt"
-current_time=$(date +%Y%m%d%H%M%S)
-savedir="results/inference/t2v/cogvideox-t2v-lora-$current_time"
-ckpt="{YOUR_CKPT_PATH}"
-
-python3 scripts/inference_cogvideo.py \
---ckpt_path $ckpt \
---config $config \
---prompt_file $prompt_file \
---savedir $savedir \
---bs 1 --height 480 --width 720 \
---fps 16 \
---seed 6666 \
---denoiser_precision bf16
diff --git a/shscripts/inference_dc_i2v_576x1024.sh b/shscripts/inference_dc_i2v_576x1024.sh
deleted file mode 100644
index 710e7ae7..00000000
--- a/shscripts/inference_dc_i2v_576x1024.sh
+++ /dev/null
@@ -1,15 +0,0 @@
-
-ckpt=checkpoints/dynamicrafter/i2v_576x1024/model.ckpt
-config=configs/002_dynamicrafter/dc_i2v_1024.yaml
-prompt_dir=inputs/i2v/576x1024
-savedir=results/dc-i2v-576x1024
-
-python3 scripts/inference.py \
---mode 'i2v' \
---ckpt_path $ckpt \
---config $config \
---prompt_dir $prompt_dir \
---savedir $savedir \
---bs 1 --height 576 --width 1024 \
---fps 10 \
---seed 123
diff --git a/shscripts/inference_flux.sh b/shscripts/inference_flux.sh
deleted file mode 100644
index 8c5bd93c..00000000
--- a/shscripts/inference_flux.sh
+++ /dev/null
@@ -1,21 +0,0 @@
-#!/bin/bash
-# inference with a file of prompts or a single prompt
-# default inference with dev model
-python scripts/inference_flux.py \
-    --model_type dev \
-    --prompt inputs/t2v/prompts.txt \
-    --out_path results/flux-dev/ \
-    --width 1360 \
-    --height 768 \
-    --num_inference_steps 50 \
-    --guidance_scale 0.
-
-# default inference with schell model
-python scripts/inference_flux.py \
-    --model_type schnell \
-    --prompt inputs/t2v/prompts.txt \
-    --out_path results/flux-schnell/ \
-    --width 1360 \
-    --height 768 \
-    --num_inference_steps 4 \
-    --guidance_scale 0.
diff --git a/shscripts/inference_hunyuanvideo_i2v.sh b/shscripts/inference_hunyuanvideo_i2v.sh
deleted file mode 100644
index df36408d..00000000
--- a/shscripts/inference_hunyuanvideo_i2v.sh
+++ /dev/null
@@ -1,18 +0,0 @@
-ckpt='checkpoints/hunyuanvideo/HunyuanVideo-I2V'
-dit_weight='checkpoints/hunyuanvideo/HunyuanVideo-I2V/hunyuan-video-i2v-720p/transformers/mp_rank_00_model_states.pt'
-config='configs/007_hunyuanvideo/hunyuanvideo_i2v.yaml'
-prompt_dir="inputs/i2v/576x1024"
-savedir="results/i2v/hunyuan"
-
-python3 scripts/inference_new.py \
-    --ckpt_path "$ckpt" \
-    --dit_weight "$dit_weight" \
-    --config "$config" \
-    --prompt_dir "$prompt_dir" \
-    --savedir "$savedir" \
-    --height 720 \
-    --width 1280 \
-    --i2v_resolution "720p" \
-    --frames 129 \
-    --seed 44 \
-    --num_inference_steps 50 
diff --git a/shscripts/inference_hunyuanvideo_t2v.sh b/shscripts/inference_hunyuanvideo_t2v.sh
deleted file mode 100644
index e74181fe..00000000
--- a/shscripts/inference_hunyuanvideo_t2v.sh
+++ /dev/null
@@ -1,14 +0,0 @@
-config='configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser.yaml'
-prompt_file="inputs/t2v/hunyuanvideo/tyler_swift_video/labels.txt"
-current_time=$(date +%Y%m%d%H%M%S)
-savedir="results/t2v/$current_time-hunyuanvideo"
-ckpt="checkpoints/hunyuanvideo/HunyuanVideo"
-
-python3 scripts/inference_cogvideo.py \
---ckpt_path $ckpt \
---config $config \
---prompt_file $prompt_file \
---savedir $savedir \
---bs 1 --height 256 --width 256 \
---fps 16 \
---seed 6666 \
\ No newline at end of file
diff --git a/shscripts/inference_hunyuanvideo_t2v_lora.sh b/shscripts/inference_hunyuanvideo_t2v_lora.sh
deleted file mode 100644
index 991bcdc6..00000000
--- a/shscripts/inference_hunyuanvideo_t2v_lora.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-# ----------------------diffusers based pl inference ----------------------
-# ‘configs/004_cogvideox/cogvideo2b.yaml’ or 'configs/004_cogvideox/cogvideo5b.yaml'
-config='configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser_lora.yaml'
-prompt_file="inputs/t2v/hunyuanvideo/tyler_swift_video/labels.txt"
-current_time=$(date +%Y%m%d%H%M%S)
-savedir="results/t2v/$current_time-hunyuanvideo"
-# ckpt="{YOUR_CKPT_PATH}"
-ckpt="results/train/20250228203955_hunyuanvideo_t2v_lora/checkpoints/epoch=430.ckpt"
-
-python3 scripts/inference_cogvideo.py \
---ckpt_path $ckpt \
---config $config \
---prompt_file $prompt_file \
---savedir $savedir \
---bs 1 --height 256 --width 256 \
---fps 16 \
---seed 6666 \
diff --git a/shscripts/inference_mochi.sh b/shscripts/inference_mochi.sh
deleted file mode 100644
index 8e701b4e..00000000
--- a/shscripts/inference_mochi.sh
+++ /dev/null
@@ -1,13 +0,0 @@
-ckpt='checkpoints/mochi-1-preview'
-prompt_file="inputs/t2v/prompts.txt"
-savedir="results/t2v/mochi2"
-height=480
-width=848
-
-python3 scripts/inference_mochi.py \
-    --ckpt_path $ckpt \
-    --prompt_file $prompt_file \
-    --savedir $savedir \
-    --bs 1 --height $height --width $width \
-    --fps 28 \
-    --seed 124
diff --git a/shscripts/inference_opensora_v10_16x256x256.sh b/shscripts/inference_opensora_v10_16x256x256.sh
deleted file mode 100644
index bec260fb..00000000
--- a/shscripts/inference_opensora_v10_16x256x256.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-current_time=$(date +%Y%m%d%H%M%S)
-
-ckpt="checkpoints/open-sora/t2v_v10/OpenSora-v1-HQ-16x256x256.pth"
-config='configs/003_opensora/opensorav10_256x256.yaml'
-
-prompt_file="inputs/t2v/prompts.txt"
-res_dir="results/t2v/$current_time-opensorav10-HQ-16x256x256"
-
-python3 scripts/inference.py \
-    --seed 123 \
-    --mode 't2v' \
-    --ckpt_path $ckpt \
-    --config $config \
-    --savedir $res_dir \
-    --n_samples 3 \
-    --bs 2 --height 256 --width 256 \
-    --unconditional_guidance_scale 7.0 \
-    --ddim_steps 50 \
-    --ddim_eta 1.0 \
-    --prompt_file $prompt_file \
-    --fps 8 \
-    --frames 16
diff --git a/shscripts/inference_stepvideo_t2v.sh b/shscripts/inference_stepvideo_t2v.sh
deleted file mode 100644
index da619181..00000000
--- a/shscripts/inference_stepvideo_t2v.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-ckpt='checkpoints/stepvideo/stepvideo-t2v/'
-config='configs/009_stepvideo/stepvideo_t2v.yaml'
-prompt_file="inputs/t2v/prompts.txt"
-savedir="results/t2v/stepvideo"
-
-python3 scripts/inference_new.py \
-    --ckpt_path "$ckpt" \
-    --config "$config" \
-    --prompt_file "$prompt_file" \
-    --savedir "$savedir" \
-    --height 544 \
-    --width 992 \
-    --frames 51 \
-    --seed 44 \
-    --num_inference_steps 50 \
-    --enable_model_cpu_offload
diff --git a/shscripts/inference_v2v_ms.sh b/shscripts/inference_v2v_ms.sh
deleted file mode 100644
index 8ab9fc2e..00000000
--- a/shscripts/inference_v2v_ms.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-input_dir="inputs/v2v/001"
-current_time=$(date +%Y%m%d%H%M%S)
-output_dir="results/v2v/$current_time-v2v-modelscope-001"
-
-python3 scripts/inference_v2v_ms.py \
-    --input_dir $input_dir --output_dir $output_dir
diff --git a/shscripts/inference_vc1_i2v_320x512.sh b/shscripts/inference_vc1_i2v_320x512.sh
deleted file mode 100644
index 0f18db99..00000000
--- a/shscripts/inference_vc1_i2v_320x512.sh
+++ /dev/null
@@ -1,14 +0,0 @@
-ckpt='checkpoints/videocrafter/i2v_v1_512/model.ckpt'
-config='configs/000_videocrafter/vc1_i2v_512.yaml'
-prompt_dir="inputs/i2v/576x1024"
-savedir="results/i2v/vc1-i2v-320x512"
-
-python3 scripts/inference.py \
---mode 'i2v' \
---ckpt_path $ckpt \
---config $config \
---prompt_dir $prompt_dir \
---savedir $savedir \
---bs 1 --height 320 --width 512 \
---fps 8 \
---seed 123
diff --git a/shscripts/inference_vc1_t2v_576x1024.sh b/shscripts/inference_vc1_t2v_576x1024.sh
deleted file mode 100644
index 23563823..00000000
--- a/shscripts/inference_vc1_t2v_576x1024.sh
+++ /dev/null
@@ -1,13 +0,0 @@
-ckpt=checkpoints/videocrafter/t2v_v1_1024/model.ckpt
-config=configs/000_videocrafter/vc1_t2v_1024.yaml
-prompt_file=inputs/t2v/prompts.txt
-res_dir="results/t2v/videocrafter1-576x1024"
-
-python3 scripts/inference.py \
-    --ckpt_path $ckpt \
-    --config $config \
-    --prompt_file $prompt_file \
-    --savedir $res_dir \
-    --bs 1 --height 576 --width 1024 \
-    --fps 28 \
-    --seed 123
diff --git a/shscripts/inference_vc2_t2v_320x512.sh b/shscripts/inference_vc2_t2v_320x512.sh
deleted file mode 100644
index b759b6d6..00000000
--- a/shscripts/inference_vc2_t2v_320x512.sh
+++ /dev/null
@@ -1,9 +0,0 @@
-ckpt='checkpoints/videocrafter/t2v_v2_512_split/'
-config='configs/001_videocrafter2/vc2_t2v_320x512.yaml'
-prompt_file="inputs/t2v/prompts.txt"
-
-python3 scripts/inference_new.py \
-    --ckpt_path $ckpt \
-    --config $config \
-    --prompt_file $prompt_file \
-    --savefps 30
diff --git a/shscripts/inference_vc2_t2v_320x512_lora.sh b/shscripts/inference_vc2_t2v_320x512_lora.sh
deleted file mode 100644
index 9620ac4b..00000000
--- a/shscripts/inference_vc2_t2v_320x512_lora.sh
+++ /dev/null
@@ -1,20 +0,0 @@
-ckpt=checkpoints/videocrafter/t2v_v2_512/model.ckpt
-config=configs/001_videocrafter2/vc2_t2v_lora.yaml
-LORACKPT=YOUR_LORA_CKPT
-prompt_file=inputs/t2v/prompts.txt
-res_dir=results/train/003_vc2_lora_ft
-
-python3 scripts/inference.py \
-    --seed 123 \
-    --mode 't2v' \
-    --ckpt_path $ckpt \
-    --lorackpt $LORACKPT \
-    --config $config \
-    --savedir $res_dir \
-    --n_samples 1 \
-    --bs 1 --height 320 --width 512 \
-    --unconditional_guidance_scale 12.0 \
-    --ddim_steps 50 \
-    --ddim_eta 1.0 \
-    --prompt_file $prompt_file \
-    --fps 28
diff --git a/shscripts/inference_wanvideo_i2v.sh b/shscripts/inference_wanvideo_i2v.sh
deleted file mode 100644
index c26c25ae..00000000
--- a/shscripts/inference_wanvideo_i2v.sh
+++ /dev/null
@@ -1,44 +0,0 @@
-resolution="720P"
-
-if [ "$resolution" = "480P" ]; then
-    ckpt='checkpoints/wan/Wan2.1-I2V-14B-480P/'
-    config='configs/008_wanvideo/wan2_1_i2v_14B_480P.yaml'
-    prompt_dir="inputs/i2v/576x1024"
-    savedir="results/i2v/wanvideo/480P"
-
-    python3 scripts/inference_new.py \
-        --ckpt_path "$ckpt" \
-        --config "$config" \
-        --prompt_dir "$prompt_dir" \
-        --savedir "$savedir" \
-        --height 480 \
-        --width 832 \
-        --frames 81 \
-        --seed 44 \
-        --num_inference_steps 40 \
-        --time_shift 3.0 \
-        --enable_model_cpu_offload
-        
-elif [ "$resolution" = "720P" ]; then
-    #720P
-    ckpt='checkpoints/wan/Wan2.1-I2V-14B-720P/'
-    config='configs/008_wanvideo/wan2_1_i2v_14B_720P.yaml'
-    prompt_dir="inputs/i2v/576x1024"
-    savedir="results/i2v/wanvideo/720P"
-
-    python3 scripts/inference_new.py \
-        --ckpt_path "$ckpt" \
-        --config "$config" \
-        --prompt_dir "$prompt_dir" \
-        --savedir "$savedir" \
-        --height 720 \
-        --width 1280 \
-        --frames 81 \
-        --seed 44 \
-        --num_inference_steps 40 \
-        --time_shift 5.0 \
-        --enable_model_cpu_offload
-else
-    echo "Unsupported resolution: $resolution"
-    exit 1
-fi
\ No newline at end of file
diff --git a/shscripts/inference_wanvideo_i2v_fullft.sh b/shscripts/inference_wanvideo_i2v_fullft.sh
deleted file mode 100644
index b041001a..00000000
--- a/shscripts/inference_wanvideo_i2v_fullft.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-ckpt='checkpoints/wan/Wan2.1-I2V-14B-480P/'
-config='configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml'
-prompt_dir="inputs/i2v/576x1024"
-savedir="results/i2v/wanvideo/480P"
-
-#replace your trained checkpoint
-trained_ckpt="results/train/train_wanvideo_i2v_fullft_20250427220943/checkpoints/only_trained_model/denoiser-000-000000002.ckpt"
-
-
-python3 scripts/inference_new.py \
-    --ckpt_path "$ckpt" \
-    --trained_ckpt "$trained_ckpt" \
-    --config "$config" \
-    --prompt_dir "$prompt_dir" \
-    --savedir "$savedir" \
-    --height 480 \
-    --width 832 \
-    --frames 81 \
-    --seed 44 \
-    --num_inference_steps 40 \
-    --time_shift 3.0 \
-    --enable_model_cpu_offload
\ No newline at end of file
diff --git a/shscripts/inference_wanvideo_i2v_lora.sh b/shscripts/inference_wanvideo_i2v_lora.sh
deleted file mode 100644
index 8ffae5ed..00000000
--- a/shscripts/inference_wanvideo_i2v_lora.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-ckpt='checkpoints/wan/Wan2.1-I2V-14B-480P/'
-config='configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml'
-prompt_dir="checkpoints/benchmark/i2v"
-savedir="results/i2v/wanvideo/480P"
-
-#replace your trained checkpoint
-trained_ckpt="results/train/train_wanvideo_i2v_lora_20250429045426/checkpoints/only_trained_model/denoiser-000-000000050.ckpt"
-
-
-python3 scripts/inference_new.py \
-    --ckpt_path "$ckpt" \
-    --trained_ckpt "$trained_ckpt" \
-    --config "$config" \
-    --prompt_dir "$prompt_dir" \
-    --savedir "$savedir" \
-    --height 480 \
-    --width 832 \
-    --frames 81 \
-    --seed 44 \
-    --num_inference_steps 40 \
-    --time_shift 3.0 \
-    --enable_model_cpu_offload
\ No newline at end of file
diff --git a/shscripts/inference_wanvideo_t2v.sh b/shscripts/inference_wanvideo_t2v.sh
deleted file mode 100644
index 4e4066fb..00000000
--- a/shscripts/inference_wanvideo_t2v.sh
+++ /dev/null
@@ -1,41 +0,0 @@
-resolution="720P"
-
-if [ "$resolution" = "480P" ]; then
-    ckpt='checkpoints/wan/Wan2.1-T2V-14B/'
-    config='configs/008_wanvideo/wan2_1_t2v_14B.yaml'
-    prompt_file="inputs/t2v/prompts.txt"
-    savedir="results/t2v/wanvideo/480P"
-
-    python3 scripts/inference_new.py \
-        --ckpt_path "$ckpt" \
-        --config "$config" \
-        --prompt_file "$prompt_file" \
-        --savedir "$savedir" \
-        --height 480 \
-        --width 832 \
-        --frames 81 \
-        --seed 44 \
-        --time_shift 3.0 \
-        --num_inference_steps 50 \
-        --enable_model_cpu_offload
-elif [ "$resolution" = "720P" ]; then
-    ckpt='checkpoints/wan/Wan2.1-T2V-14B/'
-    config='configs/008_wanvideo/wan2_1_t2v_14B.yaml'
-    prompt_file="inputs/t2v/prompts.txt"
-    savedir="results/t2v/wanvideo/720P"
-
-    python3 scripts/inference_new.py \
-        --ckpt_path "$ckpt" \
-        --config "$config" \
-        --prompt_file "$prompt_file" \
-        --savedir "$savedir" \
-        --height 720 \
-        --width 1280 \
-        --frames 81 \
-        --seed 44 \
-        --time_shift 5.0 \
-        --num_inference_steps 50 \
-else
-    echo "Unsupported resolution: $resolution"
-    exit 1
-fi
\ No newline at end of file
diff --git a/shscripts/inference_wanvideo_t2v_fullft.sh b/shscripts/inference_wanvideo_t2v_fullft.sh
deleted file mode 100644
index 703ec91d..00000000
--- a/shscripts/inference_wanvideo_t2v_fullft.sh
+++ /dev/null
@@ -1,21 +0,0 @@
-ckpt='checkpoints/wan/Wan2.1-T2V-14B/'
-config='configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml'
-prompt_file="inputs/t2v/prompts.txt"
-savedir="results/t2v/wanvideo/480P"
-
-#replace your trained checkpoint
-trained_ckpt="results/train/train_wanvideo_t2v_fullft_20250429045309/checkpoints/only_trained_model/denoiser-000-000000050.ckpt"
-
-python3 scripts/inference_new.py \
-    --ckpt_path "$ckpt" \
-    --trained_ckpt "$trained_ckpt" \
-    --config "$config" \
-    --prompt_file "$prompt_file" \
-    --savedir "$savedir" \
-    --height 480 \
-    --width 832 \
-    --frames 81 \
-    --seed 44 \
-    --time_shift 3.0 \
-    --num_inference_steps 50 \
-    --enable_model_cpu_offload
\ No newline at end of file
diff --git a/shscripts/inference_wanvideo_t2v_lora.sh b/shscripts/inference_wanvideo_t2v_lora.sh
index da6d3227..e6eb7bd4 100644
--- a/shscripts/inference_wanvideo_t2v_lora.sh
+++ b/shscripts/inference_wanvideo_t2v_lora.sh
@@ -1,5 +1,5 @@
 ckpt='checkpoints/wan/Wan2.1-T2V-14B/'
-config='configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml'
+config='configs/inference/presets/wan_domain_lora_smoke.yaml'
 prompt_file="inputs/t2v/prompts.txt"
 savedir="results/t2v/wanvideo/480P"
 
diff --git a/shscripts/train_cogvideox_i2v_fullft.sh b/shscripts/train_cogvideox_i2v_fullft.sh
deleted file mode 100644
index 80556a94..00000000
--- a/shscripts/train_cogvideox_i2v_fullft.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-# dependencies
-CONFIG="configs/004_cogvideox/cogvideo5b-i2v-fullft.yaml"   # experiment config
-
-# exp saving directory: ${RESROOT}/${CURRENT_TIME}_${EXPNAME}
-RESROOT="results/train"             # experiment saving directory
-EXPNAME="cogvideox_i2v_5b_fullft"   # experiment name
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)  # current time
-DATAPATH="data/apply_lipstick/metadata.csv"
-
-# run
-python scripts/train.py \
--t \
---base $CONFIG \
---logdir $RESROOT \
---name "$CURRENT_TIME"_$EXPNAME \
---devices '0,1,2,3' \
-lightning.trainer.num_nodes=1 \
-data.params.train.params.csv_path=$DATAPATH \
-data.params.validation.params.csv_path=$DATAPATH \
---auto_resume
diff --git a/shscripts/train_cogvideox_i2v_lora.sh b/shscripts/train_cogvideox_i2v_lora.sh
deleted file mode 100644
index a27b2cbc..00000000
--- a/shscripts/train_cogvideox_i2v_lora.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-# dependencies
-CONFIG="configs/004_cogvideox/cogvideo5b-i2v.yaml"   # experiment config
-
-# exp saving directory: ${RESROOT}/${CURRENT_TIME}_${EXPNAME}
-RESROOT="results/train"             # experiment saving directory
-EXPNAME="cogvideox_i2v_5b"          # experiment name
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)  # current time
-DATAPATH="data/apply_lipstick/metadata.csv"
-
-# run
-python scripts/train.py \
--t \
---base $CONFIG \
---logdir $RESROOT \
---name "$CURRENT_TIME"_$EXPNAME \
---devices '0,' \
-lightning.trainer.num_nodes=1 \
-data.params.train.params.csv_path=$DATAPATH \
-data.params.validation.params.csv_path=$DATAPATH \
---auto_resume
diff --git a/shscripts/train_cogvideox_t2v_fullft.sh b/shscripts/train_cogvideox_t2v_fullft.sh
deleted file mode 100644
index 4a8adbdc..00000000
--- a/shscripts/train_cogvideox_t2v_fullft.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-# dependencies
-CONFIG="configs/004_cogvideox/cogvideo5b-t2v-fullft.yaml"   # experiment config: ‘configs/004_cogvideox/cogvideo2b.yaml’ or 'configs/004_cogvideox/cogvideo5b.yaml'
-
-# exp saving directory: ${RESROOT}/${CURRENT_TIME}_${EXPNAME}
-RESROOT="results/train"             # experiment saving directory
-EXPNAME="cogvideox_t2v_5b_fullft"          # experiment name
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)  # current time
-DATAPATH="data/apply_lipstick/metadata.csv"
-
-# run
-python scripts/train.py \
--t \
---base $CONFIG \
---logdir $RESROOT \
---name "$CURRENT_TIME"_$EXPNAME \
---devices '0,1,2,3' \
-lightning.trainer.num_nodes=1 \
-data.params.train.params.csv_path=$DATAPATH \
-data.params.validation.params.csv_path=$DATAPATH \
---auto_resume
diff --git a/shscripts/train_cogvideox_t2v_lora.sh b/shscripts/train_cogvideox_t2v_lora.sh
deleted file mode 100644
index b0d8eeea..00000000
--- a/shscripts/train_cogvideox_t2v_lora.sh
+++ /dev/null
@@ -1,22 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-# dependencies
-CONFIG='configs/004_cogvideox/cogvideo5b.yaml'   # experiment config: ‘configs/004_cogvideox/cogvideo2b.yaml’ or 'configs/004_cogvideox/cogvideo5b.yaml'
-
-# exp saving directory: ${RESROOT}/${CURRENT_TIME}_${EXPNAME}
-RESROOT="results/train"             # experiment saving directory
-EXPNAME="cogvideox_t2v_5b"          # experiment name
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)  # current time
-DATAPATH="data/apply_lipstick/metadata.csv"
-
-# run
-python scripts/train.py \
--t \
---base $CONFIG \
---logdir $RESROOT \
---name "$CURRENT_TIME"_$EXPNAME \
---devices '0,' \
-lightning.trainer.num_nodes=1 \
-data.params.train.params.csv_path=$DATAPATH \
-data.params.validation.params.csv_path=$DATAPATH \
---auto_resume
diff --git a/shscripts/train_dynamicrafter.sh b/shscripts/train_dynamicrafter.sh
deleted file mode 100644
index 325ca3c3..00000000
--- a/shscripts/train_dynamicrafter.sh
+++ /dev/null
@@ -1,24 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-
-# dependencies
-SDCKPT="checkpoints/stablediffusion/v2-1_512-ema/model.ckpt"
-# DCCKPT="checkpoints/dynamicrafter/i2v_576x1024/model.ckpt"
-DCCKPT="checkpoints/dynamicrafter/i2v_576x1024/model_converted.ckpt"
-
-EXPNAME="002_dynamicrafterft_1024"                            # experiment name
-CONFIG='configs/002_dynamicrafter/dc_i2v_1024.yaml' # experiment config
-RESROOT="results/train"                               # experiment saving directory
-
-# run
-current_time=$(date +%Y%m%d%H%M%S)
-python scripts/train.py \
--t \
---name "$current_time"_$EXPNAME \
---base $CONFIG \
---logdir $RESROOT \
---sdckpt $SDCKPT \
---ckpt $DCCKPT \
---devices '0,' \
-lightning.trainer.num_nodes=1 \
---auto_resume
diff --git a/shscripts/train_flux.sh b/shscripts/train_flux.sh
index d47461a2..ce0cbeca 100755
--- a/shscripts/train_flux.sh
+++ b/shscripts/train_flux.sh
@@ -1,12 +1,8 @@
+#!/usr/bin/env bash
+# Flux LoRA fine-tuning via first-party Diffusers trainer (replaces legacy SimpleTuner train_flux.py).
 export TOKENIZERS_PARALLELISM=false
-export CONFIG_PATH="configs/006_flux/config"
-export DATACONFIG_PATH="configs/006_flux/multidatabackend"
-export CONFIG_BACKEND="json"
 
-accelerate launch \
---mixed_precision="bf16" \
---num_processes="1" \
---num_machines="1" \
-scripts/train_flux.py \
---config_path="$CONFIG_PATH.$CONFIG_BACKEND" \
---data_config_path="$DATACONFIG_PATH.$CONFIG_BACKEND" \
+poetry run train-flux-lora \
+  --config_path configs/domain/flux_t2i.json \
+  --data_config_path configs/domain/flux_t2i_data.json \
+  "$@"
diff --git a/shscripts/train_hunyuanvideo_t2v_lora.sh b/shscripts/train_hunyuanvideo_t2v_lora.sh
deleted file mode 100644
index 2d716dd0..00000000
--- a/shscripts/train_hunyuanvideo_t2v_lora.sh
+++ /dev/null
@@ -1,18 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-# dependencies
-CONFIG="configs/007_hunyuanvideo/hunyuanvideo_t2v_diffuser_lora.yaml"   # experiment config 
-
-# exp saving directory: ${RESROOT}/${CURRENT_TIME}_${EXPNAME}
-RESROOT="results/train"             # experiment saving directory
-EXPNAME="hunyuanvideo_t2v_lora"          # experiment name 
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)  # current time
-
-python scripts/train.py \
--t \
---base $CONFIG \
---logdir $RESROOT \
---name "$CURRENT_TIME"_$EXPNAME \
---devices '0,1' \
-lightning.trainer.num_nodes=1 \
---auto_resume
\ No newline at end of file
diff --git a/shscripts/train_opensorav10.sh b/shscripts/train_opensorav10.sh
deleted file mode 100644
index 325a10ee..00000000
--- a/shscripts/train_opensorav10.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-current_time=$(date +%Y%m%d%H%M%S)
-
-EXPNAME="train_opensora_t2v512"                            # experiment name
-CONFIG='configs/003_opensora/opensorav10_256x256.yaml' # experiment config
-LOGDIR="./results"                                     # experiment saving directory
-
-# run
-python scripts/train.py \
--t --devices '0,' \
-lightning.trainer.num_nodes=1 \
---base $CONFIG \
---name "$current_time"_$EXPNAME \
---logdir $LOGDIR \
---auto_resume
diff --git a/shscripts/train_videocrafter_lora.sh b/shscripts/train_videocrafter_lora.sh
deleted file mode 100644
index 300e3d58..00000000
--- a/shscripts/train_videocrafter_lora.sh
+++ /dev/null
@@ -1,24 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-
-# dependencies
-SDCKPT="checkpoints/stablediffusion/v2-1_512-ema/model.ckpt"
-VC2CKPT="checkpoints/videocrafter/t2v_v2_512/model.ckpt"
-# LORACKPT="checkpoints/lora/512/lora.ckpt"
-
-# exp settings
-EXPNAME="videocrafter2_t2v_lora"                            # experiment name
-CONFIG='configs/001_videocrafter2/vc2_t2v_lora.yaml' # experiment config
-RESROOT="results/train"                               # experiment saving directory
-
-# run
-current_time=$(date +%Y%m%d%H%M%S)
-python scripts/train.py \
--t \
---name "$current_time"_$EXPNAME \
---base $CONFIG \
---logdir $RESROOT \
---ckpt $VC2CKPT \
---devices '0,' \
-lightning.trainer.num_nodes=1 \
---auto_resume
diff --git a/shscripts/train_videocrafter_v2.sh b/shscripts/train_videocrafter_v2.sh
deleted file mode 100644
index a6e01c05..00000000
--- a/shscripts/train_videocrafter_v2.sh
+++ /dev/null
@@ -1,20 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-VC2CKPT="checkpoints/videocrafter/t2v_v2_512_split/"  # pretrained checkpoint of videocrafter2
-CONFIG='configs/001_videocrafter2/vc2_t2v_320x512.yaml'             # experiment config: model+data+training
-
-# exp saving directory: ${RESROOT}/${CURRENT_TIME}_${EXPNAME}
-RESROOT="results/train"                                             # root directory for saving multiple experiments
-EXPNAME="videocrafter2_320x512"                                     # experiment name 
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)                                  # current time
-
-# run
-python scripts/train_new.py \
--t \
---ckpt $VC2CKPT \
---base $CONFIG \
---logdir $RESROOT \
---name ${CURRENT_TIME}_${EXPNAME} \
---devices '0,' \
---auto_resume
-
diff --git a/shscripts/train_wanvideo_i2v_fullft.sh b/shscripts/train_wanvideo_i2v_fullft.sh
deleted file mode 100644
index 7684693b..00000000
--- a/shscripts/train_wanvideo_i2v_fullft.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-CKPT="checkpoints/wan/Wan2.1-I2V-14B-480P"
-CONFIG='configs/008_wanvideo/wan2_1_i2v_14B_480P_fullft.yaml'        
-
-RESROOT="results/train"                                             
-EXPNAME="train_wanvideo_i2v_fullft"                                     
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)                                  
-
-
-python scripts/train_new.py -t \
---ckpt $CKPT \
---base $CONFIG \
---logdir $RESROOT \
---name "$EXPNAME"_"$CURRENT_TIME" \
---devices 0, \
---auto_resume
\ No newline at end of file
diff --git a/shscripts/train_wanvideo_i2v_lora.sh b/shscripts/train_wanvideo_i2v_lora.sh
deleted file mode 100644
index 6ba1b958..00000000
--- a/shscripts/train_wanvideo_i2v_lora.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-CKPT="checkpoints/wan/Wan2.1-I2V-14B-480P"
-CONFIG='configs/008_wanvideo/wan2_1_i2v_14B_480P_lora.yaml'        
-
-RESROOT="results/train"                                             
-EXPNAME="train_wanvideo_i2v_lora"                                     
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)                                  
-
-
-python scripts/train_new.py -t \
---ckpt $CKPT \
---base $CONFIG \
---logdir $RESROOT \
---name "$EXPNAME"_"$CURRENT_TIME" \
---devices 0, \
---auto_resume
\ No newline at end of file
diff --git a/shscripts/train_wanvideo_t2v_fullft.sh b/shscripts/train_wanvideo_t2v_fullft.sh
deleted file mode 100644
index f9c777e5..00000000
--- a/shscripts/train_wanvideo_t2v_fullft.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-CKPT="checkpoints/wan/Wan2.1-T2V-14B"
-CONFIG='configs/008_wanvideo/wan2_1_t2v_14B_fullft.yaml'        
-
-RESROOT="results/train"                                             
-EXPNAME="train_wanvideo_t2v_fullft"                                     
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)                                  
-
-
-python scripts/train_new.py -t \
---ckpt $CKPT \
---base $CONFIG \
---logdir $RESROOT \
---name "$EXPNAME"_"$CURRENT_TIME" \
---devices 0, \
---auto_resume
\ No newline at end of file
diff --git a/shscripts/train_wanvideo_t2v_lora.sh b/shscripts/train_wanvideo_t2v_lora.sh
deleted file mode 100644
index efd4c684..00000000
--- a/shscripts/train_wanvideo_t2v_lora.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-export TOKENIZERS_PARALLELISM=false
-
-CKPT="checkpoints/wan_test/Wan2.1-T2V-14B"
-CONFIG='configs/008_wanvideo/wan2_1_t2v_14B_lora.yaml'        
-
-RESROOT="results/train"                                             
-EXPNAME="train_wanvideo_t2v_lora"                                     
-CURRENT_TIME=$(date +%Y%m%d%H%M%S)                                  
-
-
-python scripts/train_new.py -t \
---ckpt $CKPT \
---base $CONFIG \
---logdir $RESROOT \
---name "$EXPNAME"_"$CURRENT_TIME" \
---devices 0, \
---auto_resume
\ No newline at end of file
diff --git a/tests/conftest.py b/tests/conftest.py
new file mode 100644
index 00000000..a9989a6f
--- /dev/null
+++ b/tests/conftest.py
@@ -0,0 +1,71 @@
+import warnings
+
+import pytest
+
+try:
+    from sentry_sdk.hub import SentryHubDeprecationWarning
+except ImportError:
+    SentryHubDeprecationWarning = DeprecationWarning  # type: ignore[misc,assignment]
+
+
+@pytest.fixture(autouse=True)
+def _suppress_third_party_import_warnings():
+    """Optional third-party deps emit noisy warnings on import-only smoke tests."""
+    with warnings.catch_warnings():
+        warnings.filterwarnings("ignore", category=SentryHubDeprecationWarning)
+        warnings.filterwarnings(
+            "ignore",
+            message="builtin type SwigPyPacked has no __module__ attribute",
+            category=DeprecationWarning,
+        )
+        warnings.filterwarnings(
+            "ignore",
+            message="builtin type SwigPyObject has no __module__ attribute",
+            category=DeprecationWarning,
+        )
+        warnings.filterwarnings(
+            "ignore",
+            message="User provided device_type of 'cuda', but CUDA is not available",
+            category=UserWarning,
+        )
+        warnings.filterwarnings(
+            "ignore",
+            message="`torch.cuda.amp.autocast",
+            category=FutureWarning,
+        )
+        warnings.filterwarnings(
+            "ignore",
+            message="`torch.utils._pytree._register_pytree_node` is deprecated",
+            category=FutureWarning,
+            module=r"colossalai\..*",
+        )
+        warnings.filterwarnings(
+            "ignore",
+            message="is already registered as pytree node",
+            category=UserWarning,
+            module=r"torch\..*",
+        )
+        yield
+
+
+def pytest_collection_modifyitems(config, items):
+    try:
+        import torch
+        from torch import version as torch_version
+    except (ImportError, OSError, ValueError):
+        return
+
+    if not torch.cuda.is_available():
+        skip_gpu = pytest.mark.skip(reason="GPU not available")
+        for item in items:
+            if "gpu" in item.keywords:
+                item.add_marker(skip_gpu)
+
+    is_rocm = (
+        torch.cuda.is_available() and getattr(torch_version, "hip", None) is not None
+    )
+    if not is_rocm:
+        skip_rocm = pytest.mark.skip(reason="ROCm not available")
+        for item in items:
+            if "rocm" in item.keywords:
+                item.add_marker(skip_rocm)
diff --git a/tests/datasets/test_dataset_from_csv.py b/tests/datasets/test_dataset_from_csv.py
index 3d0fa1c1..5ab3f522 100644
--- a/tests/datasets/test_dataset_from_csv.py
+++ b/tests/datasets/test_dataset_from_csv.py
@@ -1,25 +1,43 @@
 import sys
+from pathlib import Path
 
 sys.path.append(".")
 
-import os
 import unittest
 
 import videotuna.data.transforms as transforms
 from videotuna.data.datasets import DatasetFromCSV
 
+REPO_ROOT = Path(__file__).resolve().parents[1]
+TOY_VIDEO_CSV = REPO_ROOT / "videotuna/data/anno_files/toy_video_dataset.csv"
+TOY_IMAGE_CSV = REPO_ROOT / "videotuna/data/anno_files/toy_image_dataset.csv"
+TOY_VIDEOS_DIR = REPO_ROOT / "videotuna/data/toy_videos"
+TOY_IMAGES_DIR = REPO_ROOT / "videotuna/data/toy_images"
+
+
+def _use_dummy_video(transform_video):
+    if not TOY_VIDEOS_DIR.exists():
+        transform_video.transforms[0] = transforms.LoadDummyVideo(
+            (100, 100), probs_fail=0.5
+        )
+
+
+def _use_dummy_image(transform_image):
+    if not TOY_IMAGES_DIR.exists():
+        transform_image.transforms[0] = transforms.LoadDummyImage(probs_fail=0.5)
+
+
+def _has_toy_images():
+    return TOY_IMAGE_CSV.is_file()
 
-class TestDatasets(unittest.TestCase):
 
+class TestDatasets(unittest.TestCase):
     def test_video_dataset_from_csv(self):
         transform_video = transforms.get_transforms_video()
-        if not os.path.exists("videotuna/data/toy_videos"):
-            transform_video.transforms[0] = transforms.LoadDummyVideo(
-                (100, 100), probs_fail=0.5
-            )
+        _use_dummy_video(transform_video)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
         )
         for i in range(min(5, len(dataset))):
@@ -34,7 +52,8 @@ def test_video_dataset_from_csv(self):
 
         transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.4)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
         )
         for i in range(min(5, len(dataset))):
@@ -45,14 +64,14 @@ def test_video_dataset_from_csv(self):
 
     def test_video_dataset_wo_transforms_from_csv(self):
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
         )
-        if not os.path.exists("videotuna/data/toy_videos"):
+        if not TOY_VIDEOS_DIR.exists():
             transform_video = dataset.transform["video"]
             transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.5)
             dataset = DatasetFromCSV(
-                "videotuna/data/anno_files/toy_video_dataset.csv",
+                TOY_VIDEO_CSV,
                 transform={"video": transform_video},
             )
         for i in range(min(5, len(dataset))):
@@ -65,13 +84,13 @@ def test_video_dataset_wo_transforms_from_csv(self):
         self.assertEqual(len(dataset), 128)
         self.assertEqual(dataset[0]["video"].shape[2], 256)
 
+    @unittest.skipUnless(_has_toy_images(), "toy image annotations not available")
     def test_image_dataset_from_csv(self):
         transform_image = transforms.get_transforms_image()
-        if not os.path.exists("videotuna/data/toy_images"):
-            transform_image.transforms[0] = transforms.LoadDummyImage(probs_fail=0.5)
+        _use_dummy_image(transform_image)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_image_dataset.csv",
-            "videotuna/data/toy_images",
+            TOY_IMAGE_CSV,
+            str(TOY_IMAGES_DIR),
             transform={"image": transform_image},
         )
         for i in range(min(5, len(dataset))):
@@ -84,14 +103,14 @@ def test_image_dataset_from_csv(self):
         self.assertEqual(len(dataset), 16)
         self.assertEqual(dataset[0]["video"].shape[2], 256)
 
+    @unittest.skipUnless(_has_toy_images(), "toy image annotations not available")
     def test_multi_res(self):
         # Test Video
         transform_video = transforms.get_transforms_video()
-        if not os.path.exists("videotuna/data/toy_videos"):
-            transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.5)
+        _use_dummy_video(transform_video)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
             use_multi_res=True,
         )
@@ -107,11 +126,10 @@ def test_multi_res(self):
 
         # Test Image
         transform_image = transforms.get_transforms_image()
-        if not os.path.exists("videotuna/data/toy_images"):
-            transform_image.transforms[0] = transforms.LoadDummyImage(probs_fail=0.5)
+        _use_dummy_image(transform_image)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_image_dataset.csv",
-            "videotuna/data/toy_images",
+            TOY_IMAGE_CSV,
+            str(TOY_IMAGES_DIR),
             transform={"image": transform_image},
             use_multi_res=True,
         )
@@ -125,20 +143,19 @@ def test_multi_res(self):
         self.assertEqual(len(dataset), 16)
         self.assertEqual(dataset[0]["video"].shape[2], 256)
 
+    @unittest.skipUnless(_has_toy_images(), "toy image annotations not available")
     def test_concat_dataset_from_csv(self):
         transform_video = transforms.get_transforms_video()
-        if not os.path.exists("videotuna/data/toy_videos"):
-            transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.5)
+        _use_dummy_video(transform_video)
 
         transform_image = transforms.get_transforms_image()
-        if not os.path.exists("videotuna/data/toy_images"):
-            transform_image.transforms[0] = transforms.LoadDummyImage(probs_fail=0.5)
+        _use_dummy_image(transform_image)
         dataset = DatasetFromCSV(
             [
-                "videotuna/data/anno_files/toy_video_dataset.csv",
-                "videotuna/data/anno_files/toy_image_dataset.csv",
+                TOY_VIDEO_CSV,
+                TOY_IMAGE_CSV,
             ],
-            ["videotuna/data/toy_videos", "videotuna/data/toy_images"],
+            [str(TOY_VIDEOS_DIR), str(TOY_IMAGES_DIR)],
             transform={"video": transform_video, "image": transform_image},
         )
         for i in range(min(5, len(dataset))):
@@ -153,11 +170,10 @@ def test_concat_dataset_from_csv(self):
 
     def test_anno_wo_meta_info(self):
         transform_video = transforms.get_transforms_video()
-        if not os.path.exists("videotuna/data/toy_videos"):
-            transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.5)
+        _use_dummy_video(transform_video)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
             use_multi_res=True,
         )
@@ -178,11 +194,10 @@ def test_anno_wo_meta_info(self):
 
     def test_anno_wo_meta_info_wo_multi_res(self):
         transform_video = transforms.get_transforms_video()
-        if not os.path.exists("videotuna/data/toy_videos"):
-            transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.5)
+        _use_dummy_video(transform_video)
         dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
             use_multi_res=False,
         )
@@ -203,13 +218,12 @@ def test_anno_wo_meta_info_wo_multi_res(self):
 
     def test_video_dataset_from_csv_with_split(self):
         transform_video = transforms.get_transforms_video()
-        if not os.path.exists("videotuna/data/toy_videos"):
-            transform_video.transforms[0] = transforms.LoadDummyVideo(probs_fail=0.5)
+        _use_dummy_video(transform_video)
 
         # Test Training Dataset
         train_dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
             split_val=True,
         )
@@ -225,8 +239,8 @@ def test_video_dataset_from_csv_with_split(self):
 
         # Test Validation Dataset
         val_dataset = DatasetFromCSV(
-            "videotuna/data/anno_files/toy_video_dataset.csv",
-            "videotuna/data/toy_videos",
+            TOY_VIDEO_CSV,
+            str(TOY_VIDEOS_DIR),
             transform={"video": transform_video},
             train=False,
             split_val=True,
@@ -240,7 +254,7 @@ def test_video_dataset_from_csv_with_split(self):
         print(f"len(dataset): {len(val_dataset)}")
         self.assertLessEqual(len(val_dataset), 128)
         self.assertEqual(val_dataset[0]["video"].shape[2], 256)
-        # Check if the sum of the lengths of the training and validation datasets is equal to the total number of samples
+        # Check train + validation lengths sum to total sample count
         self.assertEqual(len(train_dataset) + len(val_dataset), 128)
 
 
diff --git a/tests/test_attention_backend.py b/tests/test_attention_backend.py
new file mode 100644
index 00000000..5ac666fe
--- /dev/null
+++ b/tests/test_attention_backend.py
@@ -0,0 +1,77 @@
+"""Tests for ROCm-safe attention backend selection."""
+
+import os
+from unittest import mock
+
+import pytest
+
+from videotuna.utils import attention
+
+
+def test_auto_backend_cpu_fallback_eager():
+    with mock.patch.object(attention, "detect_compute_backend", return_value="cpu"):
+        with mock.patch.object(attention, "gpu_is_available", return_value=False):
+            with mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "auto"}):
+                assert attention.get_attn_backend() == "eager"
+
+
+def test_flash_rejected_on_cpu():
+    with mock.patch.object(attention, "detect_compute_backend", return_value="cpu"):
+        with mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "flash"}):
+            with pytest.raises(RuntimeError, match="not supported on CPU"):
+                attention.get_attn_backend()
+
+
+def test_maybe_compile_noop_without_gpu():
+    import torch.nn as nn
+
+    mod = nn.Linear(4, 4)
+    with mock.patch.object(attention, "gpu_is_available", return_value=False):
+        with mock.patch.dict(os.environ, {"VIDEOTUNA_TORCH_COMPILE": "1"}):
+            assert attention.maybe_compile_denoiser(mod) is mod
+
+
+def test_auto_backend_rocm_prefers_sdpa():
+    with mock.patch.object(attention, "detect_compute_backend", return_value="rocm"):
+        with mock.patch.object(attention, "gpu_is_available", return_value=True):
+            with mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "auto"}):
+                assert attention.get_attn_backend() == "sdpa"
+
+
+def test_auto_backend_rocm_cpu_fallback_eager():
+    with mock.patch.object(attention, "detect_compute_backend", return_value="rocm"):
+        with mock.patch.object(attention, "gpu_is_available", return_value=False):
+            with mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "auto"}):
+                assert attention.get_attn_backend() == "eager"
+
+
+def test_flash_rejected_on_rocm():
+    with mock.patch.object(attention, "detect_compute_backend", return_value="rocm"):
+        with mock.patch.object(attention, "_FLASH_ATTN_AVAILABLE", True):
+            with mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "flash"}):
+                with pytest.raises(RuntimeError, match="not supported on AMD ROCm"):
+                    attention.get_attn_backend()
+
+
+def test_auto_backend_cuda_uses_flash_when_available():
+    with mock.patch.object(attention, "detect_compute_backend", return_value="cuda"):
+        with mock.patch.object(attention, "gpu_is_available", return_value=True):
+            with mock.patch.object(attention, "_FLASH_ATTN_AVAILABLE", True):
+                with mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "auto"}):
+                    assert attention.get_attn_backend() == "flash"
+
+
+def test_sdpa_context_rocm_excludes_flash_kernel():
+    with mock.patch.object(attention, "gpu_is_available", return_value=True):
+        with mock.patch.object(
+            attention, "detect_compute_backend", return_value="rocm"
+        ):
+            with mock.patch("torch.nn.attention.sdpa_kernel") as mock_sdpa:
+                mock_sdpa.return_value.__enter__ = mock.Mock(return_value=None)
+                mock_sdpa.return_value.__exit__ = mock.Mock(return_value=False)
+                with attention._sdpa_context():
+                    pass
+                backends = mock_sdpa.call_args[0][0]
+                backend_names = [b.name for b in backends]
+                assert "FLASH_ATTENTION" not in backend_names
+                assert "EFFICIENT_ATTENTION" in backend_names
diff --git a/tests/test_cloud_provisioning_scripts.py b/tests/test_cloud_provisioning_scripts.py
new file mode 100644
index 00000000..38705213
--- /dev/null
+++ b/tests/test_cloud_provisioning_scripts.py
@@ -0,0 +1,180 @@
+"""CPU tests for cloud/vast provisioning scripts and configs (no GPU)."""
+
+import os
+import re
+from pathlib import Path
+
+import yaml
+from omegaconf import OmegaConf
+
+from videotuna.training.flux_lora.config import load_train_config
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+CLOUD_VAST = REPO_ROOT / "cloud" / "vast"
+
+EXECUTABLE_SCRIPTS = [
+    CLOUD_VAST / "bootstrap.sh",
+    CLOUD_VAST / "run-train.sh",
+    CLOUD_VAST / "run-smoke-train.sh",
+]
+
+PROVISIONING_FILES = [
+    CLOUD_VAST / "provisioning.yaml",
+    CLOUD_VAST / "bootstrap.sh",
+    CLOUD_VAST / "run-train.sh",
+    CLOUD_VAST / "run-smoke-train.sh",
+]
+
+REQUIRED_POETRY_COMMANDS = [
+    "poetry install",
+    "train-flux-lora",
+    "train-wan2-1-t2v-lora",
+    "install-deepspeed",
+    "test tests/test_import_smoke.py",
+]
+
+SECRET_PATTERNS = [
+    re.compile(r"hf_[a-zA-Z0-9]{20,}"),
+    re.compile(r"sk-[a-zA-Z0-9]{20,}"),
+]
+
+FLUX_CLOUD_SMOKE = REPO_ROOT / "configs" / "domain" / "flux_t2i_cloud_smoke.json"
+WAN_CLOUD_SMOKE = REPO_ROOT / "configs" / "domain" / "wan_t2v_lora_cloud_smoke.yaml"
+FLUX_CLOUD_SMOKE_LEGACY = REPO_ROOT / "configs" / "006_flux" / "cloud_smoke.json"
+WAN_CLOUD_SMOKE_LEGACY = (
+    REPO_ROOT / "configs" / "008_wanvideo" / "wan2_1_t2v_14B_lora_cloud_smoke.yaml"
+)
+FLUX_DATA_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i_data.json"
+
+
+def test_cloud_scripts_exist_and_are_executable():
+    for path in EXECUTABLE_SCRIPTS:
+        assert path.is_file(), f"missing {path}"
+        assert os.access(path, os.X_OK), f"not executable: {path}"
+
+
+def test_provisioning_references_valid_poetry_commands():
+    combined = ""
+    for path in PROVISIONING_FILES:
+        combined += path.read_text(encoding="utf-8") + "\n"
+    for cmd in REQUIRED_POETRY_COMMANDS:
+        assert cmd in combined, f"expected poetry command not found: {cmd}"
+
+
+def test_no_hardcoded_secrets_in_cloud_vast():
+    for path in CLOUD_VAST.rglob("*"):
+        if not path.is_file():
+            continue
+        if path.name.endswith(".example"):
+            continue
+        if "__pycache__" in path.parts or path.suffix == ".pyc":
+            continue
+        text = path.read_text(encoding="utf-8")
+        for pattern in SECRET_PATTERNS:
+            match = pattern.search(text)
+            assert match is None, (
+                f"possible secret in {path.relative_to(REPO_ROOT)}: "
+                f"{match.group()[:12]}..."
+            )
+
+
+def test_provisioning_yaml_structure():
+    prov_path = CLOUD_VAST / "provisioning.yaml"
+    data = yaml.safe_load(prov_path.read_text(encoding="utf-8"))
+    assert data["version"] == 1
+    assert "git_repos" in data
+    assert any(
+        "PrivTune" in r.get("dest", "") or "VideoTuna" in r.get("dest", "")
+        for r in data["git_repos"]
+    )
+    assert "post_commands" in data
+    assert any("bootstrap.sh" in c for c in data["post_commands"])
+
+
+def test_flux_cloud_smoke_config_loads():
+    train_cfg, data_cfg = load_train_config(FLUX_CLOUD_SMOKE, FLUX_DATA_CONFIG)
+    assert train_cfg.max_train_steps == 50
+    assert train_cfg.checkpointing_steps == 25
+    assert train_cfg.output_dir == "results/train/flux-cloud-smoke"
+    assert train_cfg.pretrained_model_name_or_path == "checkpoints/flux/FLUX.1-dev"
+    assert data_cfg.instance_data_dir == "data/t2i/domain"
+
+
+def test_wan_cloud_smoke_yaml_parses():
+    cfg = OmegaConf.load(WAN_CLOUD_SMOKE)
+    assert cfg.train.name == "train_wan_cloud_smoke"
+    assert cfg.train.lightning.trainer.max_epochs == 1
+    ckpt_cb = cfg.train.lightning.callbacks.model_checkpoint.params
+    assert ckpt_cb.every_n_train_steps == 5
+
+
+def test_legacy_cloud_smoke_symlinks_resolve():
+    assert FLUX_CLOUD_SMOKE_LEGACY.is_file()
+    assert FLUX_CLOUD_SMOKE_LEGACY.resolve() == FLUX_CLOUD_SMOKE.resolve()
+    assert WAN_CLOUD_SMOKE_LEGACY.is_file()
+    assert WAN_CLOUD_SMOKE_LEGACY.resolve() == WAN_CLOUD_SMOKE.resolve()
+
+
+def test_env_cloud_example_exists():
+    example = CLOUD_VAST / ".env.cloud.example"
+    assert example.is_file()
+    text = example.read_text(encoding="utf-8")
+    assert "VIDEOTUNA_COMPUTE_BACKEND=cuda" in text
+    assert "TRAIN_PROFILE=" in text
+    assert "HF_TOKEN=" in text
+
+
+def test_supervisor_config_exists():
+    conf = CLOUD_VAST / "supervisor" / "videotuna-train.conf"
+    assert conf.is_file()
+    text = conf.read_text(encoding="utf-8")
+    assert "videotuna-train" in text
+    assert "run-train.sh" in text
+
+
+def test_env_cloud_example_documents_fast_hf_download():
+    example = CLOUD_VAST / ".env.cloud.example"
+    text = example.read_text(encoding="utf-8")
+    assert "VIDEOTUNA_FAST_HF_DOWNLOAD" in text
+    assert "HF_XET_HIGH_PERFORMANCE" in text
+
+
+def test_bootstrap_enables_hf_xet_high_performance():
+    bootstrap = CLOUD_VAST / "bootstrap.sh"
+    text = bootstrap.read_text(encoding="utf-8")
+    assert "VIDEOTUNA_FAST_HF_DOWNLOAD" in text
+    assert "HF_XET_HIGH_PERFORMANCE" in text
+    assert "enable_fast_hf_download" in text
+    assert "provision_retry.py" in text
+
+
+def test_bootstrap_retry_artifacts_exist():
+    assert (CLOUD_VAST / "provision_retry.py").is_file()
+    assert (CLOUD_VAST / "bootstrap-requirements.txt").is_file()
+    reqs = (CLOUD_VAST / "bootstrap-requirements.txt").read_text(encoding="utf-8")
+    assert "tenacity" in reqs
+    assert "pyyaml" in reqs
+
+
+def test_provisioning_includes_wan_train_and_validate_hub_ids():
+    combined = ""
+    for path in PROVISIONING_FILES:
+        combined += path.read_text(encoding="utf-8") + "\n"
+    for repo in (
+        "Wan-AI/Wan2.1-T2V-14B",
+        "Wan-AI/Wan2.1-I2V-14B-480P",
+        "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+        "Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+    ):
+        assert repo in combined, f"expected hub id in cloud bundle: {repo}"
+
+
+def test_provisioning_yaml_documents_retry_layers():
+    prov_path = CLOUD_VAST / "provisioning.yaml"
+    text = prov_path.read_text(encoding="utf-8")
+    assert "settings:" in text
+    assert "on_failure:" in text
+    assert "provision_retry.py" in text
+    data = yaml.safe_load(text)
+    assert data["settings"]["retry"]["max_attempts"] == 5
+    assert data["on_failure"]["action"] == "continue"
diff --git a/tests/test_config_mapping.py b/tests/test_config_mapping.py
new file mode 100644
index 00000000..fc352aad
--- /dev/null
+++ b/tests/test_config_mapping.py
@@ -0,0 +1,150 @@
+"""Tests for config path mapping validation and application."""
+
+from pathlib import Path
+
+import pytest
+from omegaconf import OmegaConf
+from pydantic import ValidationError
+
+from videotuna.utils.config_mapping import (
+    ConfigMappingError,
+    ConfigPathMappings,
+    apply_config_mappings,
+    config_path_exists,
+    get_config_path,
+)
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+WAN_T2V_CONFIG = REPO_ROOT / "configs" / "domain" / "wan_t2v_lora.yaml"
+
+
+def _minimal_cfg(*, include_mapping: bool = True) -> OmegaConf:
+    cfg = OmegaConf.create(
+        {
+            "train": {
+                "ckpt": "checkpoints/wan/Wan2.1-T2V-14B",
+            },
+            "flow": {
+                "params": {
+                    "ckpt_path": "checkpoints/wan/old-path",
+                }
+            },
+        }
+    )
+    if include_mapping:
+        cfg.train.mapping = {"train.ckpt": "flow.params.ckpt_path"}
+    return cfg
+
+
+def test_config_path_exists_valid_and_missing():
+    cfg = _minimal_cfg(include_mapping=False)
+
+    assert config_path_exists(cfg, "train.ckpt") is True
+    assert config_path_exists(cfg, "flow.params.ckpt_path") is True
+    assert config_path_exists(cfg, "flow.params.missing") is False
+    assert config_path_exists(cfg, "missing.top") is False
+    assert config_path_exists(cfg, "train.ckpt.bad") is False
+    assert config_path_exists(cfg, "") is False
+
+
+def test_get_config_path_raises_for_missing_path():
+    cfg = _minimal_cfg(include_mapping=False)
+
+    assert get_config_path(cfg, "train.ckpt") == "checkpoints/wan/Wan2.1-T2V-14B"
+    with pytest.raises(ConfigMappingError, match="does not exist"):
+        get_config_path(cfg, "train.missing")
+
+
+def test_apply_train_mapping_valid():
+    cfg = _minimal_cfg()
+
+    apply_config_mappings(cfg, section="train")
+
+    assert cfg.flow.params.ckpt_path == "checkpoints/wan/Wan2.1-T2V-14B"
+
+
+def test_apply_train_mapping_missing_source():
+    cfg = _minimal_cfg()
+    cfg.train.mapping = {"train.no_such": "flow.params.ckpt_path"}
+
+    with pytest.raises(ConfigMappingError) as exc_info:
+        apply_config_mappings(cfg, section="train")
+
+    message = str(exc_info.value)
+    assert "train.mapping" in message
+    assert "source path" in message
+    assert "train.no_such" in message
+    assert "flow.params.ckpt_path" in message
+
+
+def test_apply_train_mapping_missing_target():
+    cfg = _minimal_cfg()
+    cfg.train.mapping = {"train.ckpt": "flow.params.no_such"}
+
+    with pytest.raises(ConfigMappingError) as exc_info:
+        apply_config_mappings(cfg, section="train")
+
+    message = str(exc_info.value)
+    assert "train.mapping" in message
+    assert "target path" in message
+    assert "flow.params.no_such" in message
+    assert "train.ckpt" in message
+
+
+def test_apply_train_mapping_invalid_shape():
+    cfg = _minimal_cfg()
+    cfg.train.mapping = ["train.ckpt", "flow.params.ckpt_path"]
+
+    with pytest.raises(ConfigMappingError, match="must be a mapping"):
+        apply_config_mappings(cfg, section="train")
+
+
+def test_apply_train_mapping_invalid_dot_path():
+    cfg = _minimal_cfg()
+    cfg.train.mapping = {"train..ckpt": "flow.params.ckpt_path"}
+
+    with pytest.raises(ConfigMappingError, match="invalid dot paths"):
+        apply_config_mappings(cfg, section="train")
+
+
+def test_config_path_mappings_rejects_invalid_dot_path():
+    with pytest.raises(ValidationError, match="invalid source path"):
+        ConfigPathMappings(root={"train..ckpt": "flow.params.ckpt_path"})
+
+
+def test_apply_inference_mapping_missing_source():
+    cfg = OmegaConf.create(
+        {
+            "inference": {
+                "ckpt_path": "checkpoints/wan/Wan2.1-T2V-14B",
+                "mapping": {"inference.no_such": "flow.params.ckpt_path"},
+            },
+            "flow": {"params": {"ckpt_path": "checkpoints/wan/Wan2.1-T2V-14B"}},
+        }
+    )
+
+    with pytest.raises(ConfigMappingError) as exc_info:
+        apply_config_mappings(cfg, section="inference")
+
+    message = str(exc_info.value)
+    assert "inference.mapping" in message
+    assert "source path" in message
+    assert "inference.no_such" in message
+
+
+def test_apply_mapping_noop_when_absent():
+    cfg = _minimal_cfg(include_mapping=False)
+    original_ckpt_path = cfg.flow.params.ckpt_path
+
+    apply_config_mappings(cfg, section="train")
+
+    assert cfg.flow.params.ckpt_path == original_ckpt_path
+
+
+def test_wan_domain_yaml_mapping_paths_exist():
+    cfg = OmegaConf.load(WAN_T2V_CONFIG)
+
+    apply_config_mappings(cfg, section="train")
+    apply_config_mappings(cfg, section="inference")
+
+    assert cfg.flow.params.ckpt_path == cfg.train.ckpt
diff --git a/tests/test_coverage_gate.py b/tests/test_coverage_gate.py
new file mode 100644
index 00000000..71a3036b
--- /dev/null
+++ b/tests/test_coverage_gate.py
@@ -0,0 +1,17 @@
+from pathlib import Path
+
+from scripts import CI_SMOKE_TESTS, COVERAGE_GATE_FAIL_UNDER, coverage_gate
+
+
+def test_ci_smoke_tests_exist():
+    repo_root = Path(__file__).resolve().parents[1]
+    for rel_path in CI_SMOKE_TESTS:
+        assert (repo_root / rel_path).is_file(), f"missing CI smoke test: {rel_path}"
+
+
+def test_coverage_gate_entrypoint():
+    assert callable(coverage_gate)
+
+
+def test_coverage_gate_threshold_is_modest():
+    assert 0 < COVERAGE_GATE_FAIL_UNDER < 50
diff --git a/tests/test_device_utils.py b/tests/test_device_utils.py
new file mode 100644
index 00000000..a26b1e12
--- /dev/null
+++ b/tests/test_device_utils.py
@@ -0,0 +1,304 @@
+"""Tests for unified compute backend detection."""
+
+from unittest import mock
+
+import pytest
+import torch
+
+from videotuna.utils import device_utils
+
+
+def test_gpu_is_available_alias():
+    assert device_utils.cuda_is_available() == device_utils.gpu_is_available()
+
+
+def test_normalize_device_prefer():
+    assert device_utils.normalize_device_prefer(None) is None
+    assert device_utils.normalize_device_prefer("cuda") == "cuda"
+    assert device_utils.normalize_device_prefer("cuda:1") == "cuda:1"
+    assert device_utils.normalize_device_prefer(1) == "cuda:1"
+    assert device_utils.normalize_device_prefer("0") == "cuda:0"
+    assert device_utils.normalize_device_prefer("cpu") == "cpu"
+    with pytest.raises(ValueError, match="MPS is not supported"):
+        device_utils.normalize_device_prefer("mps")
+
+
+def test_normalize_device_prefer_invalid():
+    with pytest.raises(ValueError, match="Invalid device"):
+        device_utils.normalize_device_prefer("invalid")
+
+
+def test_resolve_inference_device_explicit_cpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        dev = device_utils.resolve_inference_device("cpu")
+    assert dev == torch.device("cpu")
+
+
+def test_resolve_inference_device_cpu_when_no_gpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=False):
+        assert device_utils.resolve_inference_device() == torch.device("cpu")
+
+
+def test_resolve_inference_device_cuda_when_gpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        with mock.patch.object(
+            device_utils, "detect_compute_backend", return_value="cuda"
+        ):
+            with mock.patch.object(device_utils.torch.cuda, "set_device"):
+                with mock.patch.object(
+                    device_utils.torch.cuda, "device_count", return_value=2
+                ):
+                    dev = device_utils.resolve_inference_device()
+    assert dev == torch.device("cuda", 0)
+
+
+def test_resolve_inference_device_indexed():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        with mock.patch.object(device_utils.torch.cuda, "set_device") as set_dev:
+            with mock.patch.object(
+                device_utils.torch.cuda, "device_count", return_value=2
+            ):
+                dev = device_utils.resolve_inference_device("cuda:1")
+    assert dev == torch.device("cuda", 1)
+    set_dev.assert_called_with(1)
+
+
+def test_resolve_inference_device_rejects_cuda_without_gpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=False):
+        with pytest.raises(RuntimeError, match="no GPU accelerator"):
+            device_utils.resolve_inference_device("cuda")
+
+
+def test_recommend_dtype_cpu():
+    dev = torch.device("cpu")
+    assert device_utils.recommend_dtype(dev) == "fp32"
+
+
+def test_recommend_dtype_ampere():
+    dev = torch.device("cuda", 0)
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        with mock.patch.object(
+            device_utils.torch.cuda, "get_device_capability", return_value=(8, 6)
+        ):
+            assert device_utils.recommend_dtype(dev) == "bf16"
+
+
+def test_recommend_dtype_turing():
+    dev = torch.device("cuda", 0)
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        with mock.patch.object(
+            device_utils.torch.cuda, "get_device_capability", return_value=(7, 5)
+        ):
+            assert device_utils.recommend_dtype(dev) == "fp16"
+
+
+def test_require_min_vram_raises():
+    dev = torch.device("cuda", 0)
+    props = mock.Mock()
+    props.total_memory = 8 * 1024**3
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        with mock.patch.object(
+            device_utils.torch.cuda, "get_device_properties", return_value=props
+        ):
+            with mock.patch.object(
+                device_utils, "_format_hardware_context", return_value=""
+            ):
+                with pytest.raises(RuntimeError, match="below required"):
+                    device_utils.require_min_vram(16.0, device=dev, context="test")
+
+
+def test_get_visible_gpus_empty_without_gpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=False):
+        assert device_utils.get_visible_gpus() == []
+
+
+def test_get_visible_gpus_mocked():
+    props = mock.Mock()
+    props.name = "RTX 4090"
+    props.major = 8
+    props.minor = 9
+    props.total_memory = 24 * 1024**3
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=True):
+        with mock.patch.object(device_utils.torch.cuda, "device_count", return_value=1):
+            with mock.patch.object(
+                device_utils.torch.cuda, "get_device_properties", return_value=props
+            ):
+                with mock.patch.object(
+                    device_utils.torch.cuda,
+                    "mem_get_info",
+                    return_value=(8 * 1024**3, 24 * 1024**3),
+                ):
+                    gpus = device_utils.get_visible_gpus()
+    assert len(gpus) == 1
+    assert gpus[0].name == "RTX 4090"
+    assert gpus[0].supports_bf16 is True
+
+
+def test_empty_cache_aliases():
+    assert device_utils.empty_cache is device_utils.empty_accelerator_cache
+    assert device_utils.synchronize_device is device_utils.synchronize_accelerator
+
+
+def test_detect_compute_backend_cpu():
+    with mock.patch.object(device_utils.torch.cuda, "is_available", return_value=False):
+        assert device_utils.detect_compute_backend() == "cpu"
+
+
+def test_detect_compute_backend_cuda():
+    with mock.patch.object(device_utils.torch.cuda, "is_available", return_value=True):
+        with mock.patch.object(device_utils, "_torch_hip_version", return_value=None):
+            assert device_utils.detect_compute_backend() == "cuda"
+
+
+def test_detect_compute_backend_rocm():
+    with mock.patch.object(device_utils.torch.cuda, "is_available", return_value=True):
+        with mock.patch.object(
+            device_utils, "_torch_hip_version", return_value="6.2.4"
+        ):
+            assert device_utils.detect_compute_backend() == "rocm"
+
+
+def test_describe_compute_environment_rocm():
+    with mock.patch.object(
+        device_utils, "_detect_compute_backend_raw", return_value="rocm"
+    ):
+        with mock.patch.object(
+            device_utils.torch.cuda, "get_device_name", return_value="gfx1100"
+        ):
+            with mock.patch.object(
+                device_utils, "_torch_hip_version", return_value="6.2.4"
+            ):
+                with mock.patch.object(device_utils.torch, "__version__", "2.6.0"):
+                    desc = device_utils.describe_compute_environment()
+    assert "ROCm available" in desc
+    assert "gfx1100" in desc
+    assert "HIP 6.2.4" in desc
+
+
+def test_require_accelerator_for_flow_raises_without_gpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=False):
+        with pytest.raises(RuntimeError, match="requires a GPU"):
+            device_utils.require_accelerator_for_flow(
+                "videotuna.flow.wanvideo.WanVideoModelFlow"
+            )
+
+
+def test_get_flow_tier_wan_720p_gpu_required():
+    tier = device_utils.get_flow_tier(
+        device_utils._DIFFUSERS_FLOW,
+        model_family="wan",
+        height=720,
+        width=1280,
+    )
+    assert tier == "gpu_required"
+
+
+def test_get_flow_tier_flux_schnell_cpu_smoke():
+    tier = device_utils.get_flow_tier(
+        device_utils._DIFFUSERS_FLOW,
+        model_family="flux",
+        model_variant="schnell",
+        height=768,
+        width=1360,
+    )
+    assert tier == "cpu_smoke"
+
+
+def test_get_flow_tier_flux_dev_cpu_smoke_at_512():
+    tier = device_utils.get_flow_tier(
+        device_utils._DIFFUSERS_FLOW,
+        model_family="flux",
+        model_variant="1-dev",
+        height=256,
+        width=256,
+    )
+    assert tier == "cpu_smoke"
+
+
+def test_get_flow_tier_flux_dev_gpu_required_above_512():
+    tier = device_utils.get_flow_tier(
+        device_utils._DIFFUSERS_FLOW,
+        model_family="flux",
+        model_variant="dev",
+        height=768,
+        width=512,
+    )
+    assert tier == "gpu_required"
+
+
+def test_get_flow_tier_unknown_family_defaults():
+    tier = device_utils.get_flow_tier(
+        device_utils._DIFFUSERS_FLOW,
+        model_family="unknown",
+        height=256,
+        width=256,
+    )
+    assert tier == "cpu_smoke"
+
+
+def test_require_accelerator_wan_native_720p_blocks_cpu_smoke():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=False):
+        with pytest.raises(RuntimeError, match="requires a GPU"):
+            device_utils.require_accelerator_for_flow(
+                device_utils._WAN_FLOW,
+                cpu_mode="smoke",
+                height=720,
+                width=1280,
+                frames=81,
+            )
+
+
+def test_resolve_cpu_mode_smoke_cli():
+    with mock.patch.dict("os.environ", {}, clear=True):
+        assert device_utils.resolve_cpu_mode(cli_smoke=True) == "smoke"
+
+
+def test_resolve_cpu_mode_legacy_allow_cpu():
+    with mock.patch.dict(
+        "os.environ",
+        {"VIDEOTUNA_ALLOW_CPU_INFERENCE": "1"},
+        clear=True,
+    ):
+        assert device_utils.resolve_cpu_mode() == "force"
+
+
+def test_require_accelerator_for_flow_allow_cpu():
+    device_utils.require_accelerator_for_flow(
+        "videotuna.flow.wanvideo.WanVideoModelFlow",
+        allow_cpu=True,
+    )
+
+
+def test_compute_backend_env_rocm_mismatch():
+    with mock.patch.dict("os.environ", {"VIDEOTUNA_COMPUTE_BACKEND": "rocm"}):
+        with mock.patch.object(device_utils, "_torch_hip_version", return_value=None):
+            with pytest.raises(RuntimeError, match="not built with HIP"):
+                device_utils.detect_compute_backend()
+
+
+def test_compute_backend_env_mps_rejected():
+    with mock.patch.dict("os.environ", {"VIDEOTUNA_COMPUTE_BACKEND": "mps"}):
+        with pytest.raises(ValueError, match="mps is not supported"):
+            device_utils.detect_compute_backend()
+
+
+def test_require_xfuser_sequence_parallel_on_rocm():
+    with mock.patch.object(device_utils, "detect_compute_backend", return_value="rocm"):
+        with pytest.raises(RuntimeError, match="xfuser requires NVIDIA CUDA"):
+            device_utils.require_xfuser_sequence_parallel("TestFlow")
+
+
+def test_validate_sequence_parallel_degrees_mismatch():
+    with pytest.raises(ValueError, match="WORLD_SIZE"):
+        device_utils.validate_sequence_parallel_degrees(2, 2, world_size=3)
+
+
+def test_validate_sequence_parallel_degrees_ok():
+    device_utils.validate_sequence_parallel_degrees(2, 2, world_size=4)
+
+
+def test_accelerator_helpers_noop_on_cpu():
+    with mock.patch.object(device_utils, "gpu_is_available", return_value=False):
+        assert device_utils.accelerator_device_string() == "cpu"
+        device_utils.empty_accelerator_cache()
+        device_utils.synchronize_accelerator()
diff --git a/tests/test_diffusers_quantization.py b/tests/test_diffusers_quantization.py
new file mode 100644
index 00000000..97f61ccd
--- /dev/null
+++ b/tests/test_diffusers_quantization.py
@@ -0,0 +1,224 @@
+"""Tests for Diffusers pipeline quantization helpers."""
+
+from unittest import mock
+
+import pytest
+import torch
+
+from videotuna.utils.diffusers_quantization import (
+    _build_torchao_component_config,
+    build_pipeline_quantization_config,
+    normalize_quant_backend,
+    normalize_transformer_quant,
+    resolve_quant_components,
+    validate_transformer_quant,
+)
+
+
+def test_resolve_quant_components_wan_22():
+    components = resolve_quant_components("wan", "2.2", "t2v")
+    assert components == ["transformer", "transformer_2"]
+
+
+def test_resolve_quant_components_wan_22_i2v():
+    components = resolve_quant_components("wan", "2.2", "i2v")
+    assert components == ["transformer", "transformer_2"]
+
+
+def test_resolve_quant_components_flux():
+    assert resolve_quant_components("flux", None, "t2i") == ["transformer"]
+
+
+def test_normalize_transformer_quant_defaults():
+    assert normalize_transformer_quant(None) == "none"
+    assert normalize_transformer_quant("") == "none"
+    assert normalize_transformer_quant("int8_wo") == "int8_wo"
+
+
+def test_normalize_transformer_quant_invalid():
+    with pytest.raises(ValueError, match="Unsupported transformer_quant"):
+        normalize_transformer_quant("uint7")
+
+
+def test_normalize_quant_backend_defaults():
+    assert normalize_quant_backend(None) == "torchao"
+    assert normalize_quant_backend("quanto") == "quanto"
+
+
+def test_normalize_quant_backend_invalid():
+    with pytest.raises(ValueError, match="Unsupported quant_backend"):
+        normalize_quant_backend("bitsandbytes")
+
+
+def test_build_pipeline_quantization_config_none():
+    assert (
+        build_pipeline_quantization_config(
+            transformer_quant="none",
+            quant_backend="torchao",
+            components=["transformer"],
+        )
+        is None
+    )
+
+
+def test_build_torchao_int8_config():
+    from torchao.quantization import Int8WeightOnlyConfig, PerGroup
+
+    cfg = _build_torchao_component_config("int8_wo")
+    assert isinstance(cfg.quant_type, Int8WeightOnlyConfig)
+    assert isinstance(cfg.quant_type.granularity, PerGroup)
+    assert cfg.quant_type.granularity.group_size == 128
+    assert cfg.quant_type.version == 2
+
+
+def test_build_torchao_int4_config():
+    from torchao.quantization import Int4WeightOnlyConfig
+
+    cfg = _build_torchao_component_config("int4_wo")
+    assert isinstance(cfg.quant_type, Int4WeightOnlyConfig)
+    assert cfg.quant_type.group_size == 128
+    assert cfg.quant_type.version == 2
+
+
+def test_build_torchao_fp8_config():
+    from torchao.quantization import Float8WeightOnlyConfig
+
+    cfg = _build_torchao_component_config("fp8_wo")
+    assert isinstance(cfg.quant_type, Float8WeightOnlyConfig)
+    assert cfg.quant_type.weight_dtype == torch.float8_e4m3fn
+
+
+def test_build_pipeline_quantization_config_torchao_integration():
+    from diffusers import PipelineQuantizationConfig, TorchAoConfig
+
+    cfg = build_pipeline_quantization_config(
+        transformer_quant="int8_wo",
+        quant_backend="torchao",
+        components=["transformer", "transformer_2"],
+    )
+    assert isinstance(cfg, PipelineQuantizationConfig)
+    assert set(cfg.quant_mapping.keys()) == {"transformer", "transformer_2"}
+    for component_cfg in cfg.quant_mapping.values():
+        assert isinstance(component_cfg, TorchAoConfig)
+
+
+def test_validate_transformer_quant_rejects_cpu():
+    with mock.patch(
+        "videotuna.utils.diffusers_quantization.detect_compute_backend",
+        return_value="cpu",
+    ):
+        with pytest.raises(RuntimeError, match="not supported on CPU"):
+            validate_transformer_quant(
+                transformer_quant="int8_wo",
+                quant_backend="torchao",
+                offload_mode="sequential",
+            )
+
+
+def test_validate_transformer_quant_rejects_rocm():
+    with mock.patch(
+        "videotuna.utils.diffusers_quantization.detect_compute_backend",
+        return_value="rocm",
+    ):
+        with pytest.raises(RuntimeError, match="not supported on AMD ROCm"):
+            validate_transformer_quant(
+                transformer_quant="int8_wo",
+                quant_backend="torchao",
+                offload_mode="model",
+            )
+
+
+def test_validate_transformer_quant_fp8_requires_ada():
+    with mock.patch(
+        "videotuna.utils.diffusers_quantization.detect_compute_backend",
+        return_value="cuda",
+    ):
+        with mock.patch(
+            "videotuna.utils.diffusers_quantization.gpu_is_available",
+            return_value=True,
+        ):
+            with mock.patch(
+                "videotuna.utils.diffusers_quantization.torch.cuda.get_device_capability",
+                return_value=(8, 0),
+            ):
+                with pytest.raises(RuntimeError, match="fp8_wo requires"):
+                    validate_transformer_quant(
+                        transformer_quant="fp8_wo",
+                        quant_backend="torchao",
+                        offload_mode="none",
+                    )
+
+
+def test_validate_transformer_quant_fp8_passes_on_ada():
+    with mock.patch(
+        "videotuna.utils.diffusers_quantization.detect_compute_backend",
+        return_value="cuda",
+    ):
+        with mock.patch(
+            "videotuna.utils.diffusers_quantization.gpu_is_available",
+            return_value=True,
+        ):
+            with mock.patch(
+                "videotuna.utils.diffusers_quantization.torch.cuda.get_device_capability",
+                return_value=(8, 9),
+            ):
+                quant = validate_transformer_quant(
+                    transformer_quant="fp8_wo",
+                    quant_backend="torchao",
+                    offload_mode="model",
+                )
+    assert quant == "fp8_wo"
+
+
+def test_validate_transformer_quant_torchao_version_guard():
+    with mock.patch(
+        "videotuna.utils.diffusers_quantization.detect_compute_backend",
+        return_value="cuda",
+    ):
+        with mock.patch(
+            "videotuna.utils.diffusers_quantization.importlib.metadata.version",
+            return_value="0.9.0",
+        ):
+            with pytest.raises(ImportError, match="torchao>=0.15.0"):
+                validate_transformer_quant(
+                    transformer_quant="int8_wo",
+                    quant_backend="torchao",
+                    offload_mode="model",
+                )
+
+
+def test_validate_transformer_quant_quanto_import_guard():
+    with mock.patch(
+        "videotuna.utils.diffusers_quantization.detect_compute_backend",
+        return_value="cuda",
+    ):
+        import builtins
+
+        real_import = builtins.__import__
+
+        def fake_import(name, *args, **kwargs):
+            if name == "optimum.quanto" or name.startswith("optimum"):
+                raise ImportError("no quanto")
+            return real_import(name, *args, **kwargs)
+
+        with mock.patch("builtins.__import__", side_effect=fake_import):
+            with pytest.raises(ImportError, match="poetry install -E quanto"):
+                validate_transformer_quant(
+                    transformer_quant="int8_wo",
+                    quant_backend="quanto",
+                    offload_mode="model",
+                )
+
+
+def test_maybe_adjust_offload_for_quant():
+    from argparse import Namespace
+
+    from videotuna.utils.diffusers_quantization import maybe_adjust_offload_for_quant
+
+    args = Namespace(
+        enable_sequential_cpu_offload=True,
+        enable_model_cpu_offload=False,
+    )
+    maybe_adjust_offload_for_quant(args, "int8_wo")
+    assert args.enable_sequential_cpu_offload is False
+    assert args.enable_model_cpu_offload is True
diff --git a/tests/test_diffusers_video_flow.py b/tests/test_diffusers_video_flow.py
new file mode 100644
index 00000000..697e6fa9
--- /dev/null
+++ b/tests/test_diffusers_video_flow.py
@@ -0,0 +1,120 @@
+"""Unit tests for the PrivTune Diffusers inference flow (Flux + Wan 2.2)."""
+
+from __future__ import annotations
+
+import argparse
+from types import SimpleNamespace
+from unittest import mock
+
+import torch
+from omegaconf import OmegaConf
+
+from videotuna.flow.diffusers_video import (
+    MODEL_REGISTRY,
+    DiffusersVideoFlow,
+    resolve_model_id,
+    resolve_torch_dtype,
+)
+from videotuna.utils.diffusers_optimizations import (
+    apply_diffusers_optimizations,
+    transformer_cache_context,
+)
+
+
+def test_resolve_model_id_defaults():
+    assert (
+        resolve_model_id("flux", "t2i", None, model_variant="1-dev")
+        == "black-forest-labs/FLUX.1-dev"
+    )
+    assert (
+        resolve_model_id("wan", "t2v", None, model_variant="2.2")
+        == "Wan-AI/Wan2.2-T2V-A14B-Diffusers"
+    )
+    assert resolve_model_id("wan", "t2v", "custom/model") == "custom/model"
+
+
+def test_resolve_torch_dtype():
+    assert resolve_torch_dtype("fp16") == torch.float16
+    assert resolve_torch_dtype("bf16") == torch.bfloat16
+    assert resolve_torch_dtype(None) == torch.bfloat16
+
+
+def test_model_registry_covers_domain_families():
+    assert ("flux", "t2i") in MODEL_REGISTRY
+    assert ("wan", "t2v") in MODEL_REGISTRY
+    assert ("cogvideox", "t2v") not in MODEL_REGISTRY
+
+
+def test_apply_diffusers_optimizations_mock_pipe():
+    pipe = mock.MagicMock()
+    pipe.vae = mock.MagicMock()
+    del pipe.enable_vae_tiling
+    args = argparse.Namespace(
+        enable_sequential_cpu_offload=False,
+        enable_model_cpu_offload=True,
+        enable_vae_slicing=True,
+        enable_vae_tiling=True,
+        fuse_qkv=True,
+        enable_attention_cache=False,
+    )
+    apply_diffusers_optimizations(pipe, args)
+    pipe.enable_model_cpu_offload.assert_called_once()
+    pipe.vae.enable_slicing.assert_called_once()
+    pipe.vae.enable_tiling.assert_called_once()
+    pipe.fuse_qkv_projections.assert_called_once()
+    pipe.set_progress_bar_config.assert_called_once()
+
+
+def test_transformer_cache_context_noop_without_transformer():
+    pipe = SimpleNamespace(transformer=None)
+    with transformer_cache_context(pipe):
+        pass
+
+
+def test_diffusers_video_flow_instantiate_pipeline_only():
+    flow = DiffusersVideoFlow(
+        model_family="wan",
+        mode="t2v",
+        pretrained_model_name_or_path="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+    )
+    assert flow.pipeline_only is True
+    assert flow.pipeline is None
+
+
+@mock.patch("videotuna.flow.diffusers_video.export_to_video")
+@mock.patch.object(DiffusersVideoFlow, "_generate_sample")
+def test_inference_t2v_saves_video(mock_generate, mock_export):
+    mock_generate.return_value = {
+        "result": [{"frame": 0}],
+        "peak_vram_gb": 1.0,
+        "wall_time_s": 2.0,
+    }
+    flow = DiffusersVideoFlow(model_family="wan", mode="t2v")
+    flow.pipeline = mock.MagicMock()
+    args = OmegaConf.create(
+        {
+            "savedir": "/tmp/privtune-test",
+            "prompt_file": "inputs/t2v/prompts.txt",
+            "frames": 49,
+            "num_inference_steps": 4,
+            "unconditional_guidance_scale": 6.0,
+            "seed": 1,
+            "savefps": 8,
+        }
+    )
+    with mock.patch.object(
+        DiffusersVideoFlow, "load_inference_inputs", return_value=["hello"]
+    ):
+        with mock.patch.object(flow, "save_metrics"):
+            metrics = flow.inference(args)
+    assert len(metrics["per_sample"]) == 1
+    mock_export.assert_called_once()
+
+
+def test_yaml_wan22_instantiates_flow():
+    from videotuna.utils.common_utils import instantiate_from_config
+
+    cfg = OmegaConf.load("configs/inference/presets/balanced_wan2_2_720p.yaml")
+    flow = instantiate_from_config(cfg.flow, resolve=True)
+    assert isinstance(flow, DiffusersVideoFlow)
+    assert flow.model_variant == "2.2"
diff --git a/tests/test_domain_finetune_configs.py b/tests/test_domain_finetune_configs.py
new file mode 100644
index 00000000..73dc0875
--- /dev/null
+++ b/tests/test_domain_finetune_configs.py
@@ -0,0 +1,118 @@
+"""CPU smoke tests for domain adult fine-tuning configs (no GPU weights)."""
+
+import json
+from pathlib import Path
+
+import yaml
+from omegaconf import OmegaConf
+
+from videotuna.training.flux_lora.config import load_train_config
+from videotuna.training.wan_lora.config import load_wan_lora_config
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+FLUX_TRAIN_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i.json"
+FLUX_DATA_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i_data.json"
+FLUX_CLOUD_SMOKE_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i_cloud_smoke.json"
+WAN_DOMAIN_CONFIG = REPO_ROOT / "configs" / "domain" / "wan_t2v_lora.yaml"
+WAN_CLOUD_SMOKE_CONFIG = (
+    REPO_ROOT / "configs" / "domain" / "wan_t2v_lora_cloud_smoke.yaml"
+)
+WAN_I2V_DOMAIN_CONFIG = REPO_ROOT / "configs" / "domain" / "wan_i2v_lora.yaml"
+FLUX_INFER_SMOKE = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "flux_domain_lora_smoke.yaml"
+)
+WAN_INFER_SMOKE = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "wan_domain_lora_smoke.yaml"
+)
+WAN_INFER_SMOKE_22 = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "wan_domain_lora_smoke_22.yaml"
+)
+
+
+def test_flux_domain_train_config_loads():
+    train_cfg, data_cfg = load_train_config(FLUX_TRAIN_CONFIG, FLUX_DATA_CONFIG)
+    assert train_cfg.pretrained_model_name_or_path == "black-forest-labs/FLUX.1-dev"
+    assert train_cfg.output_dir == "results/train/flux-domain-adult"
+    assert train_cfg.max_train_steps == 2000
+    assert train_cfg.checkpointing_steps == 250
+    assert train_cfg.validation_prompt is not None
+    assert train_cfg.validation_steps == 40
+    assert train_cfg.gradient_checkpointing is True
+    assert "sks_style" in train_cfg.validation_prompt
+    assert data_cfg.instance_data_dir == "data/t2i/domain"
+    assert data_cfg.caption_strategy == "filename"
+    assert data_cfg.text_embeds is not None
+
+
+def test_flux_domain_data_backend_json():
+    backends = json.loads(FLUX_DATA_CONFIG.read_text(encoding="utf-8"))
+    image_backend = next(b for b in backends if b.get("dataset_type") != "text_embeds")
+    assert image_backend["instance_data_dir"] == "data/t2i/domain"
+    assert "caption" not in image_backend
+
+
+def test_wan_domain_yaml_parses():
+    cfg = load_wan_lora_config(WAN_DOMAIN_CONFIG)
+    assert cfg.train.name == "train_wan_domain_t2v_lora"
+    csv_path = cfg.train.data.params["train"]["params"]["csv_path"]
+    assert csv_path == "data/t2v/domain/metadata.csv"
+    assert cfg.train.lightning.trainer.max_epochs == 50
+    ckpt_cb = cfg.train.lightning.callbacks.model_checkpoint.params
+    assert ckpt_cb["every_n_train_steps"] == 25
+    assert cfg.flow.params.ckpt_path == "checkpoints/wan/Wan2.1-T2V-14B"
+
+
+def test_flux_cloud_smoke_train_config_loads():
+    train_cfg, data_cfg = load_train_config(FLUX_CLOUD_SMOKE_CONFIG, FLUX_DATA_CONFIG)
+    assert train_cfg.max_train_steps == 50
+    assert train_cfg.checkpointing_steps == 25
+    assert train_cfg.output_dir == "results/train/flux-cloud-smoke"
+    assert train_cfg.pretrained_model_name_or_path == "checkpoints/flux/FLUX.1-dev"
+    assert data_cfg.instance_data_dir == "data/t2i/domain"
+
+
+def test_wan_cloud_smoke_yaml_parses():
+    cfg = load_wan_lora_config(WAN_CLOUD_SMOKE_CONFIG)
+    assert cfg.train.name == "train_wan_cloud_smoke"
+    assert cfg.train.lightning.trainer.max_epochs == 1
+    ckpt_cb = cfg.train.lightning.callbacks.model_checkpoint.params
+    assert ckpt_cb["every_n_train_steps"] == 5
+
+
+def test_wan_i2v_domain_yaml_parses():
+    cfg = load_wan_lora_config(WAN_I2V_DOMAIN_CONFIG)
+    assert cfg.train.name == "train_wan_domain_i2v_lora"
+    assert cfg.flow.params.task == "i2v-14B"
+    csv_path = cfg.train.data.params["train"]["params"]["csv_path"]
+    assert csv_path == "data/i2v/domain/metadata.csv"
+    assert cfg.train.data.params["train"]["params"]["image_to_video"] is False
+    assert cfg.inference.mode == "i2v"
+    assert cfg.flow.params.ckpt_path == "checkpoints/wan/Wan2.1-I2V-14B-480P"
+
+
+def test_flux_domain_inference_smoke_yaml():
+    cfg = yaml.safe_load(FLUX_INFER_SMOKE.read_text(encoding="utf-8"))
+    assert cfg["flow"]["params"]["model_variant"] == "1-dev"
+    assert cfg["inference"]["height"] == 512
+    assert cfg["inference"]["num_inference_steps"] == 8
+    assert cfg["inference"]["enable_model_cpu_offload"] is True
+
+
+def test_wan_domain_inference_smoke_yaml():
+    cfg = OmegaConf.load(WAN_INFER_SMOKE)
+    assert cfg.inference.height == 480
+    assert cfg.inference.width == 832
+    assert cfg.inference.frames == 81
+    assert cfg.inference.num_inference_steps == 20
+    assert cfg.flow.params.offload_model is True
+
+
+def test_wan_domain_inference_smoke_22_yaml():
+    cfg = OmegaConf.load(WAN_INFER_SMOKE_22)
+    assert cfg.inference.height == 720
+    assert cfg.inference.width == 1280
+    assert cfg.inference.frames == 81
+    assert cfg.inference.num_inference_steps == 4
+    assert cfg.flow.params.model_variant == "2.2"
+    assert "DiffusersVideoFlow" in cfg.flow.target
diff --git a/tests/test_entry_scripts.py b/tests/test_entry_scripts.py
new file mode 100644
index 00000000..b091c951
--- /dev/null
+++ b/tests/test_entry_scripts.py
@@ -0,0 +1,20 @@
+"""Regression tests for Poetry entry scripts (no sys.path hacks)."""
+
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+ENTRY_SCRIPTS = (
+    REPO_ROOT / "scripts" / "inference_new.py",
+    REPO_ROOT / "scripts" / "train_new.py",
+)
+
+
+def test_entry_scripts_do_not_manipulate_sys_path():
+    for script_path in ENTRY_SCRIPTS:
+        source = script_path.read_text(encoding="utf-8")
+        assert (
+            "sys.path.insert" not in source
+        ), f"{script_path.name} must not use sys.path.insert"
+        assert (
+            "/src" not in source
+        ), f"{script_path.name} must not reference dead src/ path"
diff --git a/tests/test_flux_lora_dataset.py b/tests/test_flux_lora_dataset.py
new file mode 100644
index 00000000..7d2f7b16
--- /dev/null
+++ b/tests/test_flux_lora_dataset.py
@@ -0,0 +1,66 @@
+"""Dataset behavior for Flux LoRA bucketing and caption dropout."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+from PIL import Image
+
+from videotuna.training.flux_lora.config import FluxLoraDataConfig
+from videotuna.training.flux_lora.dataset import FluxLoraImageDataset
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+
+@pytest.fixture
+def tiny_image_dataset(tmp_path):
+    img = Image.new("RGB", (640, 480), color=(128, 64, 32))
+    img.save(tmp_path / "sample.png")
+    (tmp_path / "sample.txt").write_text("a photo of sample", encoding="utf-8")
+    return FluxLoraDataConfig(
+        instance_data_dir=str(tmp_path),
+        caption_strategy="filename",
+        resolution=512,
+        resolution_type="pixel_area",
+        aspect_bucket_rounding=2,
+        minimum_image_size=0,
+    )
+
+
+def test_dataset_loads_local_images(tiny_image_dataset):
+    dataset = FluxLoraImageDataset(tiny_image_dataset, seed=0)
+    assert len(dataset) == 1
+    sample = dataset[0]
+    assert sample["caption"] == "a photo of sample"
+    height, width = sample["pixel_values"].shape[1:]
+    assert width % 64 == 0
+    assert height % 64 == 0
+
+
+def test_dataset_filters_small_images(tmp_path):
+    img = Image.new("RGB", (128, 128), color=(10, 20, 30))
+    img.save(tmp_path / "small.png")
+    (tmp_path / "small.txt").write_text("small", encoding="utf-8")
+    data_cfg = FluxLoraDataConfig(
+        instance_data_dir=str(tmp_path),
+        caption_strategy="filename",
+        resolution=512,
+        minimum_image_size=512,
+    )
+    with pytest.raises(ValueError, match="No training images found"):
+        FluxLoraImageDataset(data_cfg)
+
+
+def test_caption_dropout(tmp_path):
+    img = Image.new("RGB", (512, 512), color=(1, 2, 3))
+    img.save(tmp_path / "sample.png")
+    (tmp_path / "sample.txt").write_text("drop me", encoding="utf-8")
+    data_cfg = FluxLoraDataConfig(
+        instance_data_dir=str(tmp_path),
+        caption_strategy="filename",
+        resolution=512,
+        caption_dropout_probability=1.0,
+    )
+    dataset = FluxLoraImageDataset(data_cfg, seed=0)
+    assert dataset[0]["caption"] == ""
diff --git a/tests/test_flux_lora_features.py b/tests/test_flux_lora_features.py
new file mode 100644
index 00000000..375af266
--- /dev/null
+++ b/tests/test_flux_lora_features.py
@@ -0,0 +1,485 @@
+"""Unit tests for Flux LoRA bucketing, checkpoints, cache, and strict config."""
+
+from __future__ import annotations
+
+from pathlib import Path
+from unittest import mock
+
+import pytest
+import torch
+from PIL import Image
+
+from videotuna.training.flux_lora.bucketing import (
+    bucket_dimensions,
+    bucket_dimensions_for_image,
+    meets_minimum_size,
+    round_aspect_ratio,
+    target_pixel_area,
+)
+from videotuna.training.flux_lora.checkpoint import (
+    find_latest_checkpoint,
+    load_lora_checkpoint,
+    prune_checkpoints,
+)
+from videotuna.training.flux_lora.config import FluxLoraTrainConfig, load_train_config
+from videotuna.training.flux_lora.text_embed_cache import build_or_load_cache
+from videotuna.training.flux_lora.train import (
+    _resolve_resume_checkpoint,
+    _run_validation,
+    create_flux_accelerator,
+    run_training,
+    train,
+)
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+FLUX_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i.json"
+FLUX_DATA = REPO_ROOT / "configs" / "domain" / "flux_t2i_data.json"
+
+
+def test_round_aspect_ratio():
+    assert round_aspect_ratio(1920, 1080, 2) == 1.78
+
+
+def test_target_pixel_area():
+    assert target_pixel_area(512) == 512 * 512
+
+
+def test_bucket_dimensions_align_to_64():
+    width, height = bucket_dimensions(1.78, target_pixel_area(512))
+    assert width % 64 == 0
+    assert height % 64 == 0
+
+
+def test_bucket_dimensions_square():
+    width, height = bucket_dimensions_for_image(1000, 1000, 512, "pixel_area", 2)
+    assert width == height
+
+
+def test_meets_minimum_size_disabled():
+    assert meets_minimum_size(64, 64, 0, "pixel_area")
+
+
+def test_meets_minimum_size_pixel_area():
+    assert meets_minimum_size(512, 512, 512, "pixel_area")
+    assert not meets_minimum_size(256, 256, 512, "pixel_area")
+
+
+def test_flux_domain_config_loads_all_fields():
+    train_cfg, data_cfg = load_train_config(FLUX_CONFIG, FLUX_DATA)
+    assert train_cfg.gradient_checkpointing is True
+    assert train_cfg.disable_tf32 is True
+    assert train_cfg.disable_benchmark is False
+    assert train_cfg.validation_steps == 40
+    assert train_cfg.validation_num_inference_steps == 10
+    assert train_cfg.checkpoints_total_limit == 20
+    assert train_cfg.resume_from_checkpoint == "latest"
+    assert train_cfg.gradient_accumulation_steps == 1
+    assert train_cfg.write_batch_size == 1
+    assert train_cfg.aspect_bucket_rounding == 2
+    assert train_cfg.resolution_type == "pixel_area"
+    assert data_cfg.text_embeds is not None
+    assert data_cfg.text_embeds.cache_dir == "cache/text/flux/domain-adult"
+
+
+def test_unknown_config_key_raises(tmp_path):
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"}]',
+        encoding="utf-8",
+    )
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--unknown_key":1}',
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="Unsupported Flux training config keys"):
+        load_train_config(config_path, data_path)
+
+
+def test_invalid_lora_type_raises(tmp_path):
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"}]',
+        encoding="utf-8",
+    )
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--lora_type":"dora"}',
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="lora_type"):
+        load_train_config(config_path, data_path)
+
+
+def test_invalid_num_train_epochs_raises(tmp_path):
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"}]',
+        encoding="utf-8",
+    )
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--num_train_epochs":5}',
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="num_train_epochs"):
+        load_train_config(config_path, data_path)
+
+
+def test_find_latest_checkpoint(tmp_path):
+    (tmp_path / "checkpoint-10").mkdir()
+    (tmp_path / "checkpoint-250").mkdir()
+    (tmp_path / "checkpoint-50").mkdir()
+    latest = find_latest_checkpoint(tmp_path)
+    assert latest is not None
+    assert latest.name == "checkpoint-250"
+
+
+def test_prune_checkpoints(tmp_path):
+    for step in (10, 50, 100, 250):
+        (tmp_path / f"checkpoint-{step}").mkdir()
+    prune_checkpoints(tmp_path, 2)
+    remaining = sorted(path.name for path in tmp_path.iterdir() if path.is_dir())
+    assert remaining == ["checkpoint-100", "checkpoint-250"]
+
+
+def test_build_or_load_cache_batches_encode_prompt(tmp_path):
+    cache_dir = tmp_path / "cache"
+    captions = ["caption-a", "caption-b", "caption-c"]
+    calls: list[int] = []
+
+    def fake_encode_prompt(*, prompt, **_kwargs):
+        calls.append(len(prompt))
+        batch = len(prompt)
+        return (
+            torch.zeros(batch, 2, 4),
+            torch.zeros(batch, 3),
+            torch.zeros(batch, 2, 2),
+        )
+
+    pipeline = mock.MagicMock()
+    pipeline.encode_prompt.side_effect = fake_encode_prompt
+
+    lookup = build_or_load_cache(
+        pipeline,
+        captions,
+        cache_dir,
+        write_batch_size=2,
+        device=torch.device("cpu"),
+    )
+    assert set(lookup) == set(captions)
+    assert calls == [2, 1]
+    assert len(list(cache_dir.glob("*.pt"))) == 3
+
+    pipeline.encode_prompt.reset_mock()
+    second_lookup = build_or_load_cache(
+        pipeline,
+        captions,
+        cache_dir,
+        write_batch_size=2,
+        device=torch.device("cpu"),
+    )
+    assert second_lookup.keys() == lookup.keys()
+    pipeline.encode_prompt.assert_not_called()
+
+
+def _minimal_flux_data_backend(tmp_path: Path) -> Path:
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"}]',
+        encoding="utf-8",
+    )
+    return data_path
+
+
+def test_gradient_checkpointing_false_parses_from_json(tmp_path):
+    data_path = _minimal_flux_data_backend(tmp_path)
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--gradient_checkpointing":"false"}',
+        encoding="utf-8",
+    )
+    train_cfg, _ = load_train_config(config_path, data_path)
+    assert train_cfg.gradient_checkpointing is False
+
+
+def test_load_flux_training_models_respects_gradient_checkpointing_false():
+    pytest.importorskip("peft")
+    from videotuna.training.flux_lora import model_utils
+
+    mock_peft_model = mock.MagicMock()
+    mock_peft_model.enable_gradient_checkpointing = mock.MagicMock()
+    mock_peft_model.disable_gradient_checkpointing = mock.MagicMock()
+    mock_component = mock.MagicMock()
+
+    with (
+        mock.patch.object(model_utils, "CLIPTokenizer") as tokenizer_cls,
+        mock.patch.object(model_utils, "T5TokenizerFast") as tokenizer_two_cls,
+        mock.patch.object(model_utils, "CLIPTextModel") as text_encoder_one_cls,
+        mock.patch.object(model_utils, "T5EncoderModel") as text_encoder_two_cls,
+        mock.patch.object(model_utils, "AutoencoderKL") as vae_cls,
+        mock.patch.object(model_utils, "FluxTransformer2DModel") as transformer_cls,
+        mock.patch.object(model_utils, "get_peft_model", return_value=mock_peft_model),
+    ):
+        tokenizer_cls.from_pretrained.return_value = mock_component
+        tokenizer_two_cls.from_pretrained.return_value = mock_component
+        text_encoder_one_cls.from_pretrained.return_value = mock_component
+        text_encoder_two_cls.from_pretrained.return_value = mock_component
+        vae_cls.from_pretrained.return_value = mock_component
+        transformer_cls.from_pretrained.return_value = mock_component
+
+        model_utils.load_flux_training_models(
+            "black-forest-labs/FLUX.1-dev",
+            lora_rank=4,
+            gradient_checkpointing=False,
+        )
+
+    mock_peft_model.enable_gradient_checkpointing.assert_not_called()
+    mock_peft_model.disable_gradient_checkpointing.assert_called_once()
+
+
+def test_run_validation_writes_preview(tmp_path):
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path),
+        instance_data_dir="data",
+        validation_prompt="sks_style, portrait",
+        validation_steps=40,
+        validation_resolution="512x512",
+        validation_num_inference_steps=4,
+    )
+    mock_image = Image.new("RGB", (64, 64), color=(128, 64, 32))
+    mock_pipeline = mock.MagicMock()
+    mock_pipeline.return_value = mock.MagicMock(images=[mock_image])
+    mock_pipeline.transformer = mock.MagicMock()
+
+    mock_writer = mock.MagicMock()
+    mock_tracker = mock.MagicMock(writer=mock_writer)
+    mock_accelerator = mock.MagicMock(
+        is_main_process=True,
+        device=torch.device("cpu"),
+        trackers=[mock_tracker],
+    )
+    mock_log = mock.MagicMock()
+
+    _run_validation(
+        mock_pipeline,
+        config,
+        tmp_path,
+        global_step=40,
+        accelerator=mock_accelerator,
+        weight_dtype=torch.bfloat16,
+        log=mock_log,
+    )
+
+    image_path = tmp_path / "validation" / "step-000040.png"
+    assert image_path.is_file()
+    mock_writer.add_image.assert_called_once()
+    mock_pipeline.transformer.train.assert_called_once()
+
+
+def test_run_validation_skips_when_step_not_divides(tmp_path):
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path),
+        instance_data_dir="data",
+        validation_prompt="sks_style, portrait",
+        validation_steps=40,
+    )
+    mock_pipeline = mock.MagicMock()
+    mock_accelerator = mock.MagicMock(
+        is_main_process=True,
+        device=torch.device("cpu"),
+        trackers=[],
+    )
+
+    _run_validation(
+        mock_pipeline,
+        config,
+        tmp_path,
+        global_step=39,
+        accelerator=mock_accelerator,
+        weight_dtype=torch.bfloat16,
+        log=mock.MagicMock(),
+    )
+
+    mock_pipeline.assert_not_called()
+    assert not (tmp_path / "validation").exists()
+
+
+def test_resolve_resume_checkpoint_latest(tmp_path):
+    (tmp_path / "checkpoint-10").mkdir()
+    (tmp_path / "checkpoint-50").mkdir()
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path),
+        instance_data_dir="data",
+        resume_from_checkpoint="latest",
+    )
+    resolved = _resolve_resume_checkpoint(config, tmp_path)
+    assert resolved is not None
+    assert resolved.name == "checkpoint-50"
+
+
+def test_resolve_resume_checkpoint_relative_path(tmp_path):
+    (tmp_path / "checkpoint-100").mkdir()
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path),
+        instance_data_dir="data",
+        resume_from_checkpoint="checkpoint-100",
+    )
+    resolved = _resolve_resume_checkpoint(config, tmp_path)
+    assert resolved is not None
+    assert resolved.name == "checkpoint-100"
+
+
+def test_resolve_resume_checkpoint_missing_returns_none(tmp_path):
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path),
+        instance_data_dir="data",
+        resume_from_checkpoint="latest",
+    )
+    assert _resolve_resume_checkpoint(config, tmp_path) is None
+
+
+def test_invalid_resume_from_checkpoint_raises(tmp_path):
+    data_path = _minimal_flux_data_backend(tmp_path)
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--resume_from_checkpoint":""}',
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="resume_from_checkpoint"):
+        load_train_config(config_path, data_path)
+
+
+def test_gradient_accumulation_steps_parses_from_json(tmp_path):
+    data_path = _minimal_flux_data_backend(tmp_path)
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--gradient_accumulation_steps":4}',
+        encoding="utf-8",
+    )
+    train_cfg, _ = load_train_config(config_path, data_path)
+    assert train_cfg.gradient_accumulation_steps == 4
+
+
+def test_invalid_gradient_accumulation_steps_raises(tmp_path):
+    data_path = _minimal_flux_data_backend(tmp_path)
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10,'
+        '"--gradient_accumulation_steps":0}',
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="gradient_accumulation_steps"):
+        load_train_config(config_path, data_path)
+
+
+def test_create_flux_accelerator_passes_gradient_accumulation_steps(tmp_path):
+    target = "videotuna.training.flux_lora.train.Accelerator"
+    with mock.patch(target) as accelerator_cls:
+        create_flux_accelerator(
+            tmp_path,
+            mixed_precision="bf16",
+            gradient_accumulation_steps=4,
+        )
+    accelerator_cls.assert_called_once()
+    assert accelerator_cls.call_args.kwargs["gradient_accumulation_steps"] == 4
+
+
+def test_run_training_skips_stamp_when_resuming(tmp_path):
+    data_path = _minimal_flux_data_backend(tmp_path)
+    config_path = tmp_path / "config.json"
+    config_path.write_text("{}", encoding="utf-8")
+    train_cfg = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path / "run"),
+        instance_data_dir="data",
+        resume_from_checkpoint="latest",
+    )
+    data_cfg = mock.MagicMock()
+    with (
+        mock.patch(
+            "videotuna.training.flux_lora.train.load_train_config",
+            return_value=(train_cfg, data_cfg),
+        ),
+        mock.patch("videotuna.training.flux_lora.train.stamp_output_dir") as stamp,
+        mock.patch("videotuna.training.flux_lora.train.train") as train_fn,
+    ):
+        run_training(str(config_path), str(data_path))
+    stamp.assert_not_called()
+    train_fn.assert_called_once_with(train_cfg, data_cfg)
+
+
+def test_train_raises_when_resume_checkpoint_missing(tmp_path):
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir=str(tmp_path),
+        instance_data_dir="data",
+        resume_from_checkpoint="latest",
+    )
+    data_config = mock.MagicMock()
+    mock_transformer = mock.MagicMock()
+    with (
+        mock.patch(
+            "videotuna.training.flux_lora.train.load_flux_training_models",
+            return_value={
+                "weight_dtype": torch.bfloat16,
+                "transformer": mock_transformer,
+                "vae": mock.MagicMock(),
+                "text_encoder_one": mock.MagicMock(),
+                "text_encoder_two": mock.MagicMock(),
+                "tokenizer_one": mock.MagicMock(),
+                "tokenizer_two": mock.MagicMock(),
+            },
+        ),
+        mock.patch(
+            "videotuna.training.flux_lora.train.create_flux_accelerator",
+            return_value=mock.MagicMock(
+                device=torch.device("cpu"),
+                is_main_process=True,
+            ),
+        ),
+    ):
+        with pytest.raises(ValueError, match="No checkpoint found"):
+            train(config, data_config)
+
+
+def test_load_lora_checkpoint_roundtrip():
+    pytest.importorskip("peft")
+    mock_transformer = mock.MagicMock()
+    lora_state = {"layer.lora_A.weight": torch.zeros(4, 8)}
+    with (
+        mock.patch(
+            "videotuna.training.flux_lora.checkpoint.FluxPipeline.lora_state_dict",
+            return_value=lora_state,
+        ),
+        mock.patch(
+            "videotuna.training.flux_lora.checkpoint.set_peft_model_state_dict",
+        ) as set_state,
+    ):
+        load_lora_checkpoint(mock_transformer, "/tmp/checkpoint-100")
+    set_state.assert_called_once_with(mock_transformer, lora_state)
diff --git a/tests/test_flux_lora_train_smoke.py b/tests/test_flux_lora_train_smoke.py
new file mode 100644
index 00000000..48e2f8ed
--- /dev/null
+++ b/tests/test_flux_lora_train_smoke.py
@@ -0,0 +1,239 @@
+"""CPU smoke tests for the first-party Flux LoRA trainer."""
+
+from pathlib import Path
+
+import pytest
+import torch
+from PIL import Image
+
+from videotuna.training.flux_lora.config import FluxLoraDataConfig, load_train_config
+from videotuna.training.flux_lora.dataset import FluxLoraImageDataset
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+FLUX_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i.json"
+FLUX_DATA = REPO_ROOT / "configs" / "domain" / "flux_t2i_data.json"
+
+
+@pytest.fixture
+def tiny_image_dataset(tmp_path):
+    img = Image.new("RGB", (64, 64), color=(128, 64, 32))
+    img.save(tmp_path / "sample.png")
+    (tmp_path / "sample.txt").write_text("a photo of sample", encoding="utf-8")
+    return FluxLoraDataConfig(
+        instance_data_dir=str(tmp_path),
+        caption_strategy="filename",
+        resolution=64,
+    )
+
+
+def test_dataset_loads_local_images(tiny_image_dataset):
+    dataset = FluxLoraImageDataset(tiny_image_dataset, seed=0)
+    assert len(dataset) == 1
+    sample = dataset[0]
+    assert sample["caption"] == "a photo of sample"
+    assert sample["pixel_values"].shape[0] == 3
+    assert sample["pixel_values"].shape[1] % 64 == 0
+    assert sample["pixel_values"].shape[2] % 64 == 0
+
+
+def test_text_embeds_backend_requires_cache_dir(tmp_path):
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"},'
+        '{"type":"local","dataset_type":"text_embeds","disabled":false}]',
+        encoding="utf-8",
+    )
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10}',
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="cache_dir"):
+        load_train_config(config_path, data_path)
+
+
+def test_text_embeds_backend_parses_with_cache_dir(tmp_path):
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"},'
+        '{"type":"local","dataset_type":"text_embeds","cache_dir":"cache/text",'
+        '"disabled":false}]',
+        encoding="utf-8",
+    )
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--max_train_steps":10}',
+        encoding="utf-8",
+    )
+    train_cfg, data_cfg = load_train_config(config_path, data_path)
+    assert train_cfg.max_train_steps == 10
+    assert data_cfg.text_embeds is not None
+    assert data_cfg.text_embeds.cache_dir == "cache/text"
+
+
+def test_load_train_config_from_repo_defaults():
+    train_cfg, data_cfg = load_train_config(FLUX_CONFIG, FLUX_DATA)
+    assert train_cfg.pretrained_model_name_or_path == "black-forest-labs/FLUX.1-dev"
+    assert data_cfg.resolution == 512
+    assert train_cfg.num_workers == 0
+    assert train_cfg.gradient_checkpointing is True
+    assert train_cfg.validation_steps == 40
+
+
+def test_load_train_config_num_workers_from_json(tmp_path):
+    data_path = tmp_path / "backends.json"
+    data_path.write_text(
+        '[{"type":"local","instance_data_dir":"data","caption_strategy":"filename"}]',
+        encoding="utf-8",
+    )
+    config_path = tmp_path / "config.json"
+    config_path.write_text(
+        '{"--pretrained_model_name_or_path":"black-forest-labs/FLUX.1-dev",'
+        '"--output_dir":"results/train/test",'
+        '"--num_workers": 4}',
+        encoding="utf-8",
+    )
+    train_cfg, _ = load_train_config(config_path, data_path)
+    assert train_cfg.num_workers == 4
+
+
+def test_flux_lora_target_modules():
+    from videotuna.training.flux_lora.model_utils import FLUX_LORA_TARGET_MODULES
+
+    assert "to_q" in FLUX_LORA_TARGET_MODULES
+    assert len(FLUX_LORA_TARGET_MODULES) == 4
+
+
+def test_checkpoint_save_with_mock_transformer(tmp_path):
+    pytest.importorskip("peft")
+    from typing import Any, cast
+
+    from diffusers import FluxTransformer2DModel
+    from peft import LoraConfig, get_peft_model
+
+    try:
+        transformer = FluxTransformer2DModel(
+            in_channels=64,
+            out_channels=64,
+            num_layers=1,
+            num_single_layers=1,
+            attention_head_dim=64,
+            num_attention_heads=4,
+            joint_attention_dim=64,
+            pooled_projection_dim=64,
+            guidance_embeds=True,
+        )
+    except Exception as exc:
+        pytest.skip(f"Could not construct FluxTransformer2DModel stub: {exc}")
+
+    lora_config = LoraConfig(r=4, lora_alpha=4, target_modules=["to_q"])
+    transformer = get_peft_model(cast(Any, transformer), lora_config)
+    from videotuna.training.flux_lora.checkpoint import save_lora_checkpoint
+
+    path = save_lora_checkpoint(transformer, tmp_path, step=1)
+    assert path.is_dir()
+    assert any(path.iterdir())
+
+
+def test_create_optimizer_adamw_uses_torch(monkeypatch):
+    from unittest import mock
+
+    import torch.nn as nn
+
+    from videotuna.training.flux_lora.config import FluxLoraTrainConfig
+    from videotuna.training.flux_lora.train import _create_optimizer
+
+    model = nn.Linear(4, 4)
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir="results/train/test",
+        instance_data_dir="data/t2i/domain",
+        optimizer="adamw",
+    )
+    optimi_mock = mock.MagicMock()
+    monkeypatch.setitem(
+        __import__("sys").modules,
+        "optimi",
+        mock.MagicMock(AdamW=optimi_mock),
+    )
+    opt = _create_optimizer(model, config)
+    assert isinstance(opt, torch.optim.AdamW)
+    optimi_mock.assert_not_called()
+
+
+def test_create_optimizer_adamw_bf16_uses_optimi(monkeypatch):
+    from unittest import mock
+
+    import torch.nn as nn
+
+    from videotuna.training.flux_lora.config import FluxLoraTrainConfig
+    from videotuna.training.flux_lora.train import _create_optimizer
+
+    model = nn.Linear(4, 4)
+    config = FluxLoraTrainConfig(
+        pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev",
+        output_dir="results/train/test",
+        instance_data_dir="data/t2i/domain",
+        optimizer="adamw_bf16",
+    )
+    optimi_cls = mock.MagicMock(return_value=mock.MagicMock())
+    monkeypatch.setitem(
+        __import__("sys").modules,
+        "optimi",
+        mock.MagicMock(AdamW=optimi_cls),
+    )
+    opt = _create_optimizer(model, config)
+    optimi_cls.assert_called_once()
+    assert opt is optimi_cls.return_value
+
+
+def test_create_flux_accelerator_uses_tensorboard(tmp_path, monkeypatch):
+    from unittest import mock
+
+    from videotuna.training.flux_lora.train import create_flux_accelerator
+
+    captured: dict = {}
+
+    def fake_accelerator(**kwargs):
+        captured.update(kwargs)
+        return mock.MagicMock()
+
+    monkeypatch.setattr(
+        "videotuna.training.flux_lora.train.Accelerator",
+        fake_accelerator,
+    )
+    create_flux_accelerator(
+        tmp_path,
+        mixed_precision="bf16",
+        metrics_backend="tensorboard",
+    )
+    assert captured["log_with"] == "tensorboard"
+    assert captured["project_config"].logging_dir == str(tmp_path / "tensorboard")
+    assert captured["project_config"].project_dir == str(tmp_path)
+
+
+def test_create_flux_accelerator_uses_trackio_dual_logging(tmp_path, monkeypatch):
+    from unittest import mock
+
+    from videotuna.training.flux_lora.train import create_flux_accelerator
+
+    captured: dict = {}
+
+    def fake_accelerator(**kwargs):
+        captured.update(kwargs)
+        return mock.MagicMock()
+
+    monkeypatch.setattr(
+        "videotuna.training.flux_lora.train.Accelerator",
+        fake_accelerator,
+    )
+    monkeypatch.setattr(
+        "videotuna.utils.training_metrics.trackio_available",
+        lambda: True,
+    )
+    create_flux_accelerator(tmp_path, mixed_precision="bf16", metrics_backend="trackio")
+    assert captured["log_with"] == ["tensorboard", "trackio"]
diff --git a/tests/test_flux_training_config.py b/tests/test_flux_training_config.py
new file mode 100644
index 00000000..8182ed56
--- /dev/null
+++ b/tests/test_flux_training_config.py
@@ -0,0 +1,52 @@
+"""Flux LoRA training config loading (no GPU)."""
+
+import json
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+FLUX_CONFIG = REPO_ROOT / "configs" / "domain" / "flux_t2i.json"
+FLUX_DATA = REPO_ROOT / "configs" / "domain" / "flux_t2i_data.json"
+
+
+def test_flux_training_config_json_loads():
+    with open(FLUX_CONFIG) as f:
+        config = json.load(f)
+    assert config["--model_family"] == "flux"
+    assert config["--pretrained_model_name_or_path"] == "black-forest-labs/FLUX.1-dev"
+    assert config["--data_backend_config"] == "configs/domain/flux_t2i_data.json"
+
+
+def test_flux_multidatabackend_json_loads():
+    with open(FLUX_DATA) as f:
+        backends = json.load(f)
+    assert isinstance(backends, list)
+    assert backends[0]["type"] == "local"
+    assert backends[0]["instance_data_dir"]
+
+
+def test_flux_training_config_loader():
+    from videotuna.training.flux_lora.config import load_train_config
+
+    train_cfg, data_cfg = load_train_config(FLUX_CONFIG, FLUX_DATA)
+    assert train_cfg.model_family == "flux"
+    assert train_cfg.lora_rank == 4
+    assert train_cfg.max_train_steps == 2000
+    assert train_cfg.checkpoints_total_limit == 20
+    assert train_cfg.write_batch_size == 1
+    assert train_cfg.optimizer == "adamw_bf16"
+    assert data_cfg.caption_strategy == "filename"
+
+
+def test_train_flux_lora_yaml_loader():
+    """Exercise the YAML→JSON bridge used by some docs (no training run)."""
+    import yaml
+
+    sample = {
+        "data": [{"id": "test", "type": "local", "instance_data_dir": "data/"}],
+        "train": {
+            "model_family": "flux",
+            "pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
+        },
+    }
+    parsed = yaml.safe_load(yaml.dump(sample))
+    assert parsed["train"]["model_family"] == "flux"
diff --git a/tests/test_import_smoke.py b/tests/test_import_smoke.py
new file mode 100644
index 00000000..bacf8a2d
--- /dev/null
+++ b/tests/test_import_smoke.py
@@ -0,0 +1,67 @@
+import importlib
+
+import pytest
+import torch
+from packaging.version import Version
+
+INFERENCE_BACKENDS = [
+    "videotuna.flow.diffusers_video",
+    "videotuna.flow.wanvideo",
+]
+
+TRAINING_BACKENDS = [
+    ("videotuna.training.flux_lora.config", None),
+]
+
+GPU_BACKENDS = [
+    "videotuna.flow.wanvideo",
+]
+
+
+def test_wan_t5_encoder_no_cuda_default_arg():
+    """T5EncoderModel must not use torch.cuda at class definition time."""
+    from pathlib import Path
+
+    root = Path(__file__).resolve().parents[1]
+    source = (root / "videotuna/models/wan/wan/modules/t5.py").read_text()
+    assert "device=torch.cuda.current_device()" not in source
+    assert "device=None" in source
+
+
+@pytest.mark.parametrize("module", INFERENCE_BACKENDS)
+def test_inference_backend_import(module):
+    importlib.import_module(module)
+
+
+@pytest.mark.parametrize("module,extra", TRAINING_BACKENDS)
+def test_training_backend_import(module, extra):
+    if extra is not None:
+        pytest.importorskip(extra)
+    importlib.import_module(module)
+
+
+@pytest.mark.parametrize("module", GPU_BACKENDS)
+def test_gpu_backend_import(module):
+    importlib.import_module(module)
+
+
+def test_core_ml_stack_versions():
+    import accelerate
+    import diffusers
+    import peft
+    import transformers
+
+    assert (
+        Version(torch.__version__).major == 2 and Version(torch.__version__).minor >= 6
+    )
+    assert Version(diffusers.__version__) >= Version("0.38.0")
+    assert Version(transformers.__version__) >= Version("4.48.0")
+    assert Version(accelerate.__version__) >= Version("1.14.0")
+    assert Version(peft.__version__) >= Version("0.17.0")
+
+
+def test_training_stack_versions():
+    pytest.importorskip("deepspeed")
+    import deepspeed
+
+    assert Version(deepspeed.__version__) >= Version("0.19.0")
diff --git a/tests/test_inference_cli_cyclopts.py b/tests/test_inference_cli_cyclopts.py
new file mode 100644
index 00000000..86d44b06
--- /dev/null
+++ b/tests/test_inference_cli_cyclopts.py
@@ -0,0 +1,136 @@
+"""Tests for cyclopts inference CLI groups and Poetry entrypoints."""
+
+from __future__ import annotations
+
+import subprocess
+import sys
+
+import pytest
+
+from videotuna.cli.inference_app import (
+    PRESET_DOMAIN_T2I,
+    PRESET_VALIDATE_T2V,
+    PRESET_WAN2_2_T2V_720P,
+    app,
+)
+from videotuna.cli.inference_options import (
+    InferenceRunOptions,
+    StandardInferenceOptions,
+    inference_options_to_namespace,
+    validate_preset_requirements,
+)
+
+
+def _help_text(command: list[str]) -> str:
+    result = subprocess.run(
+        [
+            sys.executable,
+            "-c",
+            ("from videotuna.cli.inference_app import app; " f"app({command!r})"),
+        ],
+        capture_output=True,
+        text=True,
+        check=False,
+    )
+    return result.stdout + result.stderr
+
+
+@pytest.mark.parametrize(
+    "command",
+    [
+        ["inference-domain-t2i", "--help"],
+        ["validate-domain-t2v", "--help"],
+        ["inference-wan2.2-t2v-720p", "--help"],
+    ],
+)
+def test_inference_help_lists_shared_flags(command: list[str]) -> None:
+    help_text = _help_text(command)
+    for flag in (
+        "--lorackpt",
+        "--memory-preset",
+        "--enable_vae_tiling",
+        "--num-inference-steps",
+    ):
+        assert flag in help_text
+
+
+@pytest.mark.parametrize(
+    "command",
+    [
+        ["inference-domain-t2i", "--help"],
+        ["validate-domain-t2v", "--help"],
+    ],
+)
+def test_inference_help_omits_legacy_hunyuan_fp8(command: list[str]) -> None:
+    help_text = _help_text(command).lower()
+    for forbidden in ("enable_fp8", "enable-fp8", "dit-weight", "hunyuan"):
+        assert forbidden not in help_text
+
+
+def test_flag_parity_across_presets() -> None:
+    run = InferenceRunOptions(lorackpt="/tmp/lora", num_inference_steps=8)
+    standard = StandardInferenceOptions(memory_preset="balanced", device="cuda:0")
+    t2i = vars(
+        inference_options_to_namespace(
+            run=run,
+            standard=standard,
+            preset=PRESET_DOMAIN_T2I,
+        )
+    )
+    wan = vars(
+        inference_options_to_namespace(
+            run=run,
+            standard=standard,
+            preset=PRESET_WAN2_2_T2V_720P,
+        )
+    )
+
+    shared_keys = {k for k in t2i if k not in {"config", "enable_model_cpu_offload"}}
+    for key in shared_keys:
+        assert t2i[key] == wan[key], key
+
+    assert t2i["config"] == PRESET_DOMAIN_T2I.config
+    assert wan["config"] == PRESET_WAN2_2_T2V_720P.config
+    assert t2i["enable_model_cpu_offload"] is True
+    assert wan["enable_model_cpu_offload"] is False
+
+
+def test_domain_t2i_preset_applies_without_user_config() -> None:
+    args = inference_options_to_namespace(preset=PRESET_DOMAIN_T2I)
+    assert args.config == PRESET_DOMAIN_T2I.config
+    assert args.enable_model_cpu_offload is True
+
+
+def test_validate_domain_t2v_requires_checkpoint() -> None:
+    with pytest.raises(SystemExit) as exc:
+        validate_preset_requirements(InferenceRunOptions(), PRESET_VALIDATE_T2V)
+    assert exc.value.code == 2
+
+
+def test_cyclopts_parses_standard_flags() -> None:
+    captured: dict[str, object] = {}
+
+    def handler(
+        run: InferenceRunOptions | None = None,
+        *,
+        standard: StandardInferenceOptions | None = None,
+    ) -> None:
+        captured["args"] = inference_options_to_namespace(run=run, standard=standard)
+
+    probe = app.__class__(name="probe")
+    probe.command(name="probe")(handler)
+    probe(
+        [
+            "probe",
+            "--lorackpt",
+            "/tmp/lora",
+            "--memory-preset",
+            "low_vram",
+            "--enable_vae_tiling",
+        ]
+    )
+    args = captured["args"]
+    assert isinstance(args, object)
+    assert getattr(args, "lorackpt") == "/tmp/lora"
+    assert getattr(args, "memory_preset") == "low_vram"
+    assert getattr(args, "enable_vae_tiling") is True
diff --git a/tests/test_inference_optimization.py b/tests/test_inference_optimization.py
new file mode 100644
index 00000000..569251a7
--- /dev/null
+++ b/tests/test_inference_optimization.py
@@ -0,0 +1,417 @@
+"""Tests for inference CLI, metrics, and optimization."""
+
+import argparse
+import json
+import os
+import tempfile
+from unittest import mock
+
+import pytest
+
+from videotuna.cli.inference_options import (
+    StandardInferenceOptions,
+    inference_options_to_namespace,
+)
+from videotuna.utils.common_utils import monitor_resources, save_metrics
+from videotuna.utils.inference_cli import (
+    apply_compile_env,
+    prepare_cli_inference_args,
+)
+from videotuna.utils.inference_profile import resolve_inference_profile
+
+
+def test_standard_inference_options_to_namespace():
+    standard = StandardInferenceOptions(
+        device="cuda:1",
+        min_vram_gb=24.0,
+        memory_preset="low_vram",
+        enable_vae_tiling=True,
+        enable_sequential_cpu_offload=True,
+        dtype="bf16",
+        ulysses_degree=2,
+        ring_degree=2,
+        compile=True,
+        transformer_quant="int8_wo",
+        quant_backend="torchao",
+    )
+    args = inference_options_to_namespace(standard=standard)
+    assert args.device == "cuda:1"
+    assert args.min_vram_gb == 24.0
+    assert args.memory_preset == "low_vram"
+    assert args.enable_vae_tiling is True
+    assert args.enable_sequential_cpu_offload is True
+    assert args.dtype == "bf16"
+    assert args.ulysses_degree == 2
+    assert args.compile is True
+    assert args.transformer_quant == "int8_wo"
+    assert args.quant_backend == "torchao"
+    assert not hasattr(args, "enable_fp8")
+
+
+def test_resolve_inference_profile():
+    args = argparse.Namespace(
+        memory_preset="low_vram",
+        enable_model_cpu_offload=False,
+        enable_sequential_cpu_offload=False,
+        enable_vae_tiling=False,
+        dtype=None,
+    )
+    profile = resolve_inference_profile(args)
+    assert profile.offload_mode == "sequential"
+    assert profile.enable_sequential_cpu_offload is True
+    assert profile.enable_model_cpu_offload is False
+    assert profile.enable_vae_tiling is True
+    assert profile.dtype == "fp16"
+    assert profile.memory_preset == "low_vram"
+    assert args.enable_sequential_cpu_offload is True
+    assert args.enable_vae_tiling is True
+    assert args.dtype == "fp16"
+
+    args = argparse.Namespace(
+        memory_preset=None,
+        enable_model_cpu_offload=True,
+        enable_sequential_cpu_offload=False,
+        enable_vae_tiling=False,
+        dtype="bf16",
+    )
+    profile = resolve_inference_profile(args, apply_preset=False)
+    assert profile.offload_mode == "model"
+    assert profile.dtype == "bf16"
+
+
+def test_prepare_cli_inference_args_validates_parallel():
+    args = argparse.Namespace(
+        memory_preset=None,
+        ulysses_degree=2,
+        ring_degree=2,
+        cpu_smoke=False,
+        device=None,
+        enable_sequential_cpu_offload=False,
+        enable_model_cpu_offload=False,
+    )
+    with mock.patch.dict(os.environ, {"WORLD_SIZE": "3"}):
+        with pytest.raises(ValueError, match="ulysses_degree"):
+            prepare_cli_inference_args(args)
+
+
+def test_validate_cpu_offload_rejected_on_cpu_smoke():
+    from videotuna.utils.inference_cli import validate_cpu_offload_flags
+
+    args = argparse.Namespace(
+        cpu_smoke=True,
+        device=None,
+        enable_sequential_cpu_offload=True,
+        enable_model_cpu_offload=False,
+        memory_preset=None,
+    )
+    with pytest.raises(RuntimeError, match="CPU offload flags"):
+        validate_cpu_offload_flags(args)
+
+
+def test_validate_cpu_offload_both_flags_sequential_wins():
+    from videotuna.utils.inference_cli import validate_cpu_offload_flags
+
+    args = argparse.Namespace(
+        cpu_smoke=False,
+        device="cuda:0",
+        enable_sequential_cpu_offload=True,
+        enable_model_cpu_offload=True,
+        memory_preset=None,
+    )
+    with (
+        mock.patch("videotuna.utils.device_utils.gpu_is_available", return_value=True),
+        mock.patch(
+            "videotuna.utils.device_utils.detect_compute_backend", return_value="cuda"
+        ),
+        mock.patch("videotuna.utils.device_utils.resolve_cpu_mode", return_value="off"),
+        mock.patch("videotuna.utils.inference_cli.logger.warning") as warn,
+    ):
+        validate_cpu_offload_flags(args)
+
+    assert args.enable_sequential_cpu_offload is True
+    assert args.enable_model_cpu_offload is False
+    warn.assert_called_once()
+    assert "sequential" in warn.call_args[0][0].lower()
+    profile = resolve_inference_profile(args, apply_preset=False)
+    assert profile.offload_mode == "sequential"
+
+
+def test_apply_cpu_smoke_env():
+    from videotuna.utils.inference_cli import apply_cpu_smoke_env
+
+    args = argparse.Namespace(cpu_smoke=True)
+    with mock.patch.dict(os.environ, {}, clear=True):
+        apply_cpu_smoke_env(args)
+        assert os.environ["VIDEOTUNA_CPU_MODE"] == "smoke"
+        assert os.environ["VIDEOTUNA_ATTN_BACKEND"] == "eager"
+        assert os.environ["VIDEOTUNA_TORCH_COMPILE"] == "0"
+
+
+@mock.patch.dict(
+    os.environ,
+    {"VIDEOTUNA_ATTN_BACKEND": "flash", "VIDEOTUNA_ATTN_BACKEND_STRICT": "0"},
+)
+def test_attn_flash_fallback_to_sdpa():
+    from videotuna.utils import attention
+
+    with mock.patch.object(attention, "_FLASH_ATTN_AVAILABLE", False):
+        with mock.patch.object(
+            attention, "detect_compute_backend", return_value="cuda"
+        ):
+            with mock.patch.object(attention, "gpu_is_available", return_value=True):
+                assert attention.get_attn_backend() == "sdpa"
+
+
+@mock.patch.dict(
+    os.environ,
+    {"VIDEOTUNA_ATTN_BACKEND": "flash", "VIDEOTUNA_ATTN_BACKEND_STRICT": "1"},
+)
+def test_attn_flash_strict_raises():
+    from videotuna.utils import attention
+
+    with mock.patch.object(attention, "_FLASH_ATTN_AVAILABLE", False):
+        with mock.patch.object(
+            attention, "detect_compute_backend", return_value="cuda"
+        ):
+            with pytest.raises(RuntimeError, match="flash-attn"):
+                attention.get_attn_backend()
+
+
+def test_attn_auto_resolves():
+    import torch
+
+    from videotuna.utils.attention import get_attn_backend
+
+    backend = get_attn_backend()
+    if torch.cuda.is_available():
+        assert backend in ("flash", "sdpa", "eager")
+    else:
+        assert backend in ("sdpa", "eager")
+
+
+def test_apply_compile_env():
+    apply_compile_env(True)
+    assert os.environ["VIDEOTUNA_TORCH_COMPILE"] == "1"
+    apply_compile_env(False)
+    assert os.environ["VIDEOTUNA_TORCH_COMPILE"] == "0"
+
+
+@mock.patch.dict(os.environ, {"VIDEOTUNA_ATTN_BACKEND": "eager"})
+def test_monitor_resources_returns_extended_keys():
+    @monitor_resources(return_metrics=True, frames=10)
+    def dummy():
+        return "ok"
+
+    out = dummy()
+    assert out["result"] == "ok"
+    assert "peak_vram_gb" in out
+    assert "seconds_per_frame" in out
+    assert out["attention_backend"] == "eager"
+    assert "torch_compile" in out
+
+
+def test_benchmark_defaults_wan_only():
+    from scripts.benchmark_attn_backends import DEFAULT_MODEL, DEFAULT_NUM_FRAMES
+
+    assert DEFAULT_MODEL == "Wan-AI/Wan2.2-T2V-A14B-Diffusers"
+    assert DEFAULT_NUM_FRAMES == 17
+
+
+def test_inference_new_imports_argparse():
+    import torch
+
+    from videotuna.utils.device_utils import require_accelerator_for_flow
+
+    if torch.cuda.is_available():
+        require_accelerator_for_flow("videotuna.flow.wanvideo.WanVideoModelFlow")
+        return
+    with pytest.raises(RuntimeError, match="requires a GPU"):
+        require_accelerator_for_flow("videotuna.flow.wanvideo.WanVideoModelFlow")
+
+
+def test_require_nvidia_cuda_alias_raises_without_gpu():
+    import torch
+
+    from videotuna.utils.device_utils import require_nvidia_cuda_for_flow
+
+    if torch.cuda.is_available():
+        require_nvidia_cuda_for_flow("videotuna.flow.wanvideo.WanVideoModelFlow")
+        return
+    with pytest.raises(RuntimeError, match="requires a GPU"):
+        require_nvidia_cuda_for_flow("videotuna.flow.wanvideo.WanVideoModelFlow")
+
+
+def test_save_metrics_writes_metrics_json():
+    with tempfile.TemporaryDirectory() as tmp:
+        save_metrics(
+            savedir=tmp,
+            gpu=[1.5],
+            time=[10.0],
+            frames=5,
+        )
+        path = os.path.join(tmp, "metrics.json")
+        assert os.path.exists(path)
+        with open(path) as f:
+            data = json.load(f)
+        assert "per_sample" in data
+        assert os.path.exists(os.path.join(tmp, "metric.json"))
+
+
+def test_apply_diffusers_optimizations_compiles_when_no_offload():
+    from unittest.mock import MagicMock
+
+    from videotuna.utils import diffusers_optimizations
+
+    transformer = MagicMock(name="transformer")
+    compiled = MagicMock(name="compiled_transformer")
+    pipe = MagicMock()
+    pipe.transformer = transformer
+    args = argparse.Namespace(
+        enable_sequential_cpu_offload=False,
+        enable_model_cpu_offload=False,
+        enable_vae_slicing=False,
+        enable_vae_tiling=False,
+        fuse_qkv=False,
+        enable_attention_cache=False,
+        device=None,
+        device_map=None,
+    )
+    with mock.patch.object(
+        diffusers_optimizations, "maybe_compile_denoiser", return_value=compiled
+    ) as compile_mock:
+        with mock.patch.object(
+            diffusers_optimizations, "apply_diffusers_attention_backend"
+        ):
+            with mock.patch.object(diffusers_optimizations, "resolve_inference_device"):
+                with mock.patch.object(pipe, "to"):
+                    diffusers_optimizations.apply_diffusers_optimizations(pipe, args)
+    compile_mock.assert_called_once_with(transformer)
+    assert pipe.transformer is compiled
+
+
+def test_apply_diffusers_optimizations_skips_compile_with_offload():
+    from unittest.mock import MagicMock
+
+    from videotuna.utils import diffusers_optimizations
+
+    pipe = MagicMock()
+    pipe.transformer = MagicMock(name="transformer")
+    args = argparse.Namespace(
+        enable_sequential_cpu_offload=False,
+        enable_model_cpu_offload=True,
+        enable_vae_slicing=False,
+        enable_vae_tiling=False,
+        fuse_qkv=False,
+        enable_attention_cache=False,
+        device=None,
+        device_map=None,
+    )
+    with mock.patch.object(
+        diffusers_optimizations, "maybe_compile_denoiser"
+    ) as compile_mock:
+        with mock.patch.object(
+            diffusers_optimizations, "apply_diffusers_attention_backend"
+        ):
+            diffusers_optimizations.apply_diffusers_optimizations(pipe, args)
+    compile_mock.assert_not_called()
+
+
+def test_require_accelerator_for_flow_raises_without_gpu():
+    from pathlib import Path
+
+    source = (
+        Path(__file__).resolve().parents[1] / "scripts" / "inference_new.py"
+    ).read_text(encoding="utf-8")
+    assert "generic_inference_entry" in source
+    assert "import argparse" not in source
+
+
+@pytest.mark.gpu
+def test_wan_domain_low_vram_quanto_int4_gpu_smoke():
+    """GPU integration: optimum-quanto int4_wo on Wan 2.2 low-VRAM preset."""
+    pytest.importorskip("optimum.quanto")
+
+    from pathlib import Path
+
+    import torch
+    import yaml
+    from diffusers import WanPipeline
+
+    from videotuna.utils.device_utils import detect_compute_backend
+    from videotuna.utils.diffusers_optimizations import apply_diffusers_optimizations
+    from videotuna.utils.diffusers_quantization import (
+        build_pipeline_quantization_config,
+        maybe_adjust_offload_for_quant,
+        resolve_quant_components,
+        validate_transformer_quant,
+    )
+
+    if not torch.cuda.is_available():
+        pytest.skip("GPU not available")
+    if detect_compute_backend() == "rocm":
+        pytest.skip("quanto quant is CUDA-only")
+
+    repo_root = Path(__file__).resolve().parents[1]
+    preset_path = (
+        repo_root
+        / "configs"
+        / "inference"
+        / "presets"
+        / "wan_domain_lora_smoke_22_low_vram.yaml"
+    )
+    cfg = yaml.safe_load(preset_path.read_text(encoding="utf-8"))
+    flow_params = cfg["flow"]["params"]
+    inference = cfg["inference"]
+    model_id = flow_params["pretrained_model_name_or_path"]
+    model_variant = flow_params["model_variant"]
+
+    args = argparse.Namespace(
+        transformer_quant="int4_wo",
+        quant_backend="quanto",
+        enable_sequential_cpu_offload=inference["enable_sequential_cpu_offload"],
+        enable_model_cpu_offload=False,
+        enable_vae_tiling=inference["enable_vae_tiling"],
+        enable_vae_slicing=False,
+        fuse_qkv=False,
+        enable_attention_cache=False,
+        device=None,
+        device_map=None,
+    )
+    maybe_adjust_offload_for_quant(args, "int4_wo")
+    assert args.enable_sequential_cpu_offload is False
+    assert args.enable_model_cpu_offload is True
+
+    validate_transformer_quant(
+        transformer_quant="int4_wo",
+        quant_backend="quanto",
+        offload_mode="model",
+    )
+    components = resolve_quant_components("wan", model_variant, "t2v")
+    assert components == ["transformer", "transformer_2"]
+    quant_config = build_pipeline_quantization_config(
+        transformer_quant="int4_wo",
+        quant_backend="quanto",
+        components=components,
+    )
+    assert quant_config is not None
+    assert set(quant_config.quant_mapping.keys()) == {"transformer", "transformer_2"}
+
+    pipe = WanPipeline.from_pretrained(
+        model_id,
+        torch_dtype=torch.float16,
+        quantization_config=quant_config,
+    )
+    apply_diffusers_optimizations(pipe, args, model_family="wan")
+
+    generator = torch.Generator(device="cuda").manual_seed(42)
+    output = pipe(
+        prompt="sks_style, slow camera push-in, soft lighting",
+        num_frames=5,
+        height=256,
+        width=448,
+        num_inference_steps=2,
+        guidance_scale=5.0,
+        generator=generator,
+    )
+    assert output is not None
+    assert len(output.frames[0]) == 5
diff --git a/tests/test_lint_autofix_smoke.py b/tests/test_lint_autofix_smoke.py
new file mode 100644
index 00000000..18993eea
--- /dev/null
+++ b/tests/test_lint_autofix_smoke.py
@@ -0,0 +1,8 @@
+"""Temporary file for CI lint-autofix verification; remove after merge."""
+
+import os
+import sys
+
+
+def _lint_autofix_smoke() -> tuple[str, str]:
+    return (os.getcwd(), sys.version)
diff --git a/tests/test_logging_config.py b/tests/test_logging_config.py
new file mode 100644
index 00000000..f6721bd4
--- /dev/null
+++ b/tests/test_logging_config.py
@@ -0,0 +1,112 @@
+"""Tests for central loguru logging configuration."""
+
+from __future__ import annotations
+
+import io
+
+import pytest
+from loguru import logger
+
+from videotuna.utils import logging_config
+from videotuna.utils.logging_config import (
+    bound_logger,
+    configure_logging,
+    phase_from_wan_task,
+    resolve_device_label,
+)
+
+
+@pytest.fixture(autouse=True)
+def reset_loguru_handlers():
+    logger.remove()
+    logging_config._configured = False
+    logging_config._stderr_handler_id = None
+    logging_config._file_handler_id = None
+    yield
+    logger.remove()
+    logging_config._configured = False
+    logging_config._stderr_handler_id = None
+    logging_config._file_handler_id = None
+
+
+def test_configure_logging_respects_log_level(monkeypatch, capsys):
+    monkeypatch.setenv("VIDEOTUNA_LOG_LEVEL", "WARNING")
+    configure_logging()
+
+    bound_logger(phase="t2i", flow="flux_lora").debug("hidden")
+    bound_logger(phase="t2i", flow="flux_lora").warning("visible")
+
+    captured = capsys.readouterr()
+    assert "hidden" not in captured.err
+    assert "visible" in captured.err
+
+
+def test_bound_logger_emits_structured_context():
+    configure_logging()
+
+    output = io.StringIO()
+    handler_id = logger.add(
+        output,
+        level="INFO",
+        format=(
+            "phase={extra[phase]} flow={extra[flow]} "
+            "device={extra[device]} | {message}"
+        ),
+    )
+
+    bound_logger(phase="t2v", flow="wanvideo", device="cuda:0").info("hello")
+
+    logger.remove(handler_id)
+    text = output.getvalue()
+    assert "phase=t2v" in text
+    assert "flow=wanvideo" in text
+    assert "device=cuda:0" in text
+    assert "hello" in text
+
+
+def test_configure_logging_is_idempotent():
+    configure_logging()
+    first_count = len(logger._core.handlers)  # noqa: SLF001
+
+    configure_logging()
+    second_count = len(logger._core.handlers)  # noqa: SLF001
+
+    assert first_count == second_count
+
+
+def test_configure_logging_adds_file_sink(tmp_path):
+    log_file = tmp_path / "train.log"
+    configure_logging(log_file=log_file)
+
+    bound_logger(phase="t2i", flow="flux_lora").info("file test")
+
+    text = log_file.read_text(encoding="utf-8")
+    assert text
+    assert "file test" in text
+
+
+def test_resolve_device_label():
+    import torch
+
+    assert resolve_device_label(None) == "-"
+    assert resolve_device_label("cuda:1") == "cuda:1"
+    assert resolve_device_label(torch.device("cpu")) == "cpu"
+    assert resolve_device_label(torch.device("cuda", 0)) == "cuda:0"
+
+
+@pytest.mark.parametrize(
+    ("task", "expected"),
+    [
+        ("t2v-14B", "t2v"),
+        ("i2v-14B", "i2v"),
+        ("t2i-14B", "t2i"),
+        ("unknown", "inference"),
+    ],
+)
+def test_phase_from_wan_task(task, expected):
+    assert phase_from_wan_task(task) == expected
+
+
+def test_configure_logging_adds_stderr_handler():
+    configure_logging()
+    assert len(logger._core.handlers) >= 1  # noqa: SLF001
diff --git a/tests/test_lora_utils.py b/tests/test_lora_utils.py
new file mode 100644
index 00000000..8639331c
--- /dev/null
+++ b/tests/test_lora_utils.py
@@ -0,0 +1,40 @@
+"""Tests for PEFT LoRA helpers."""
+
+import pytest
+import torch.nn as nn
+
+from videotuna.utils.lora_utils import (
+    parameter_matches_lora_target,
+    resolve_lora_target_modules,
+)
+
+
+class _TinyModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear = nn.Linear(4, 4)
+
+
+def test_resolve_all_linear():
+    model = _TinyModel()
+    assert resolve_lora_target_modules(model, "all-linear") == "all-linear"
+
+
+def test_resolve_explicit_list():
+    model = _TinyModel()
+    targets = resolve_lora_target_modules(model, ["linear"])
+    assert targets == ["linear"]
+
+
+@pytest.mark.parametrize(
+    ("param_name", "targets", "expected"),
+    [
+        ("blocks.0.self_attn.q.weight", ["q"], True),
+        ("blocks.0.self_attn.to_q.weight", ["q"], False),
+        ("blocks.0.ffn.0.weight", ["ffn.0"], True),
+        ("blocks.0.ffn.00.weight", ["ffn.0"], False),
+        ("blocks.0.unique.weight", ["q"], False),
+    ],
+)
+def test_parameter_matches_lora_target_edge_cases(param_name, targets, expected):
+    assert parameter_matches_lora_target(param_name, targets) is expected
diff --git a/tests/test_poetry_scripts.py b/tests/test_poetry_scripts.py
new file mode 100644
index 00000000..b9c1e735
--- /dev/null
+++ b/tests/test_poetry_scripts.py
@@ -0,0 +1,19 @@
+import importlib
+import tomllib
+from pathlib import Path
+
+
+def _poetry_script_entrypoints():
+    pyproject = Path(__file__).resolve().parents[1] / "pyproject.toml"
+    data = tomllib.loads(pyproject.read_text())
+    scripts = data.get("tool", {}).get("poetry", {}).get("scripts", {})
+    return scripts
+
+
+def test_poetry_scripts_resolve():
+    for name, target in _poetry_script_entrypoints().items():
+        module_name, _, attr_name = target.partition(":")
+        assert module_name, f"{name} has invalid target {target!r}"
+        assert attr_name, f"{name} has invalid target {target!r}"
+        module = importlib.import_module(module_name)
+        assert hasattr(module, attr_name), f"{name} -> {target} not found"
diff --git a/tests/test_provision_retry.py b/tests/test_provision_retry.py
new file mode 100644
index 00000000..3e2c22a6
--- /dev/null
+++ b/tests/test_provision_retry.py
@@ -0,0 +1,206 @@
+"""CPU tests for cloud/vast/provision_retry.py (no network)."""
+
+from __future__ import annotations
+
+import subprocess
+import sys
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+CLOUD_VAST = REPO_ROOT / "cloud" / "vast"
+
+# Import from cloud/vast without installing as package
+sys.path.insert(0, str(CLOUD_VAST))
+import provision_retry  # noqa: E402
+
+
+@pytest.fixture
+def manifest_yaml(tmp_path: Path) -> Path:
+    path = tmp_path / "provisioning.yaml"
+    path.write_text(
+        """
+version: 1
+settings:
+  retry:
+    max_attempts: 5
+    initial_delay: 2
+    backoff_multiplier: 2
+""",
+        encoding="utf-8",
+    )
+    return path
+
+
+def test_load_retry_settings_from_manifest(manifest_yaml: Path):
+    settings = provision_retry.load_retry_settings(manifest_yaml)
+    assert settings.max_attempts == 5
+    assert settings.initial_delay == 2
+    assert settings.backoff_multiplier == 2
+
+
+def test_load_retry_settings_defaults_when_missing():
+    settings = provision_retry.load_retry_settings(
+        Path("/nonexistent/provisioning.yaml")
+    )
+    assert settings.max_attempts == 5
+    assert settings.initial_delay == 2
+    assert settings.backoff_multiplier == 2
+
+
+def test_wait_seconds_matches_exponential_backoff():
+    settings = provision_retry.RetrySettings(
+        max_attempts=5, initial_delay=2, backoff_multiplier=2
+    )
+    assert provision_retry._wait_seconds(settings, 1) == 2
+    assert provision_retry._wait_seconds(settings, 2) == 4
+    assert provision_retry._wait_seconds(settings, 3) == 8
+
+
+def _simple_retry_decorator(settings: provision_retry.RetrySettings):
+    """Test double for tenacity when tenacity is not installed in Poetry env."""
+
+    def decorator(fn):
+        def wrapper(*args, **kwargs):
+            last_exc: BaseException | None = None
+            for attempt in range(1, settings.max_attempts + 1):
+                try:
+                    return fn(*args, **kwargs)
+                except (subprocess.CalledProcessError, OSError) as exc:
+                    last_exc = exc
+            assert last_exc is not None
+            raise last_exc
+
+        return wrapper
+
+    return decorator
+
+
+@patch("provision_retry.subprocess.run")
+@patch("provision_retry._make_retry_decorator", side_effect=_simple_retry_decorator)
+def test_run_command_retries_then_succeeds(
+    _mock_decorator: MagicMock, mock_run: MagicMock
+):
+    mock_run.side_effect = [
+        subprocess.CalledProcessError(1, ["false"]),
+        subprocess.CalledProcessError(1, ["false"]),
+        MagicMock(returncode=0),
+    ]
+    settings = provision_retry.RetrySettings(
+        max_attempts=5, initial_delay=0, backoff_multiplier=2
+    )
+    provision_retry.run_command(["echo", "ok"], settings=settings)
+    assert mock_run.call_count == 3
+
+
+@patch("provision_retry.subprocess.run")
+@patch("provision_retry._make_retry_decorator", side_effect=_simple_retry_decorator)
+def test_run_command_exhausts_retries(_mock_decorator: MagicMock, mock_run: MagicMock):
+    mock_run.side_effect = subprocess.CalledProcessError(1, ["false"])
+    settings = provision_retry.RetrySettings(
+        max_attempts=3, initial_delay=0, backoff_multiplier=2
+    )
+    with pytest.raises(subprocess.CalledProcessError):
+        provision_retry.run_command(["false"], settings=settings)
+    assert mock_run.call_count == 3
+
+
+@patch("provision_retry.run_command")
+def test_hf_download_skips_when_sentinel_exists(mock_run: MagicMock, tmp_path: Path):
+    local_dir = tmp_path / "model"
+    local_dir.mkdir()
+    (local_dir / provision_retry.DOWNLOAD_OK_SENTINEL).write_text(
+        "ok\n", encoding="utf-8"
+    )
+    provision_retry.hf_download("org/model", local_dir)
+    mock_run.assert_not_called()
+
+
+@patch("provision_retry.run_command")
+def test_hf_download_invokes_hf_and_writes_sentinel(
+    mock_run: MagicMock, tmp_path: Path
+):
+    local_dir = tmp_path / "model"
+    with patch("provision_retry.shutil.which", return_value="/usr/bin/hf"):
+        provision_retry.hf_download("org/model", local_dir)
+    mock_run.assert_called_once()
+    argv = mock_run.call_args[0][0]
+    assert argv == [
+        "hf",
+        "download",
+        "org/model",
+        "--local-dir",
+        str(local_dir.resolve()),
+    ]
+    assert (local_dir / provision_retry.DOWNLOAD_OK_SENTINEL).is_file()
+
+
+@patch("provision_retry.time.sleep")
+@patch("provision_retry.subprocess.run")
+def test_install_bootstrap_deps_retries_without_tenacity(
+    mock_run: MagicMock,
+    mock_sleep: MagicMock,
+    tmp_path: Path,
+    monkeypatch: pytest.MonkeyPatch,
+):
+    req = tmp_path / "bootstrap-requirements.txt"
+    req.write_text("tenacity>=9.0.0\n", encoding="utf-8")
+    monkeypatch.setattr(provision_retry, "BOOTSTRAP_REQUIREMENTS", req)
+    mock_run.side_effect = [
+        subprocess.CalledProcessError(1, ["pip"]),
+        MagicMock(returncode=0),
+    ]
+    settings = provision_retry.RetrySettings(
+        max_attempts=5, initial_delay=0, backoff_multiplier=2
+    )
+    provision_retry.install_bootstrap_deps(settings)
+    assert mock_run.call_count == 2
+    mock_sleep.assert_called_once()
+
+
+@patch("provision_retry.run_command")
+def test_hf_download_hub_cache_skips_when_sentinel_exists(
+    mock_run: MagicMock, tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+):
+    repo_id = "Wan-AI/Wan2.2-T2V-A14B-Diffusers"
+    sentinel_dir = (
+        tmp_path / ".privtune_hub_cache" / "Wan-AI--Wan2.2-T2V-A14B-Diffusers"
+    )
+    sentinel_dir.mkdir(parents=True)
+    (sentinel_dir / provision_retry.DOWNLOAD_OK_SENTINEL).write_text(
+        "ok\n", encoding="utf-8"
+    )
+    monkeypatch.setenv("HF_HOME", str(tmp_path))
+    provision_retry.hf_download_hub_cache(repo_id)
+    mock_run.assert_not_called()
+
+
+@patch("provision_retry.run_command")
+def test_hf_download_hub_cache_invokes_hf_without_local_dir(
+    mock_run: MagicMock, tmp_path: Path, monkeypatch: pytest.MonkeyPatch
+):
+    repo_id = "Wan-AI/Wan2.2-T2V-A14B-Diffusers"
+    monkeypatch.setenv("HF_HOME", str(tmp_path))
+    with patch("provision_retry.shutil.which", return_value="/usr/bin/hf"):
+        provision_retry.hf_download_hub_cache(repo_id)
+    mock_run.assert_called_once()
+    argv = mock_run.call_args[0][0]
+    assert argv == ["hf", "download", repo_id]
+    sentinel = provision_retry._hub_cache_sentinel_path(repo_id)
+    assert sentinel.is_file()
+
+
+def test_main_run_subcommand_exit_code():
+    with patch("provision_retry.run_command") as mock_run:
+        code = provision_retry.main(["run", "--", "echo", "hi"])
+    assert code == 0
+    mock_run.assert_called_once_with(
+        ["echo", "hi"], settings=mock_run.call_args[1]["settings"]
+    )
+
+
+def test_main_run_missing_command_returns_error():
+    with pytest.raises(SystemExit):
+        provision_retry.main(["run"])
diff --git a/tests/test_training_dataloader.py b/tests/test_training_dataloader.py
new file mode 100644
index 00000000..2de8e509
--- /dev/null
+++ b/tests/test_training_dataloader.py
@@ -0,0 +1,58 @@
+"""Tests for training DataLoader configuration."""
+
+import pytest
+import torch
+from torch.utils.data import Dataset
+
+from videotuna.data.lightningdata import DataModuleFromConfig
+
+
+class _TinyDataset(Dataset):
+    def __len__(self):
+        return 4
+
+    def __getitem__(self, idx):
+        return {"video": torch.zeros(3, 2, 8, 8), "caption": f"item-{idx}"}
+
+
+@pytest.fixture
+def tiny_datamodule_config():
+    return {
+        "batch_size": 2,
+        "num_workers": 0,
+        "pin_memory": True,
+        "persistent_workers": False,
+        "prefetch_factor": 2,
+        "train": {
+            "target": "tests.test_training_dataloader._TinyDataset",
+            "params": {},
+        },
+    }
+
+
+def test_datamodule_dataloader_kwargs(tiny_datamodule_config):
+    dm = DataModuleFromConfig(**tiny_datamodule_config)
+    dm.setup()
+    loader = dm.train_dataloader()
+    assert loader.batch_size == 2
+    assert loader.pin_memory is True
+    assert loader.num_workers == 0
+
+
+def test_datamodule_collate_default_batch(tiny_datamodule_config):
+    dm = DataModuleFromConfig(**tiny_datamodule_config)
+    dm.setup()
+    batch = next(iter(dm.train_dataloader()))
+    assert batch["video"].shape[0] == 2
+    assert len(batch["caption"]) == 2
+
+
+def test_default_num_workers_not_batch_scaled():
+    dm = DataModuleFromConfig(
+        batch_size=8,
+        train={
+            "target": "tests.test_training_dataloader._TinyDataset",
+            "params": {},
+        },
+    )
+    assert dm.num_workers == 4
diff --git a/tests/test_training_metrics.py b/tests/test_training_metrics.py
new file mode 100644
index 00000000..404fde59
--- /dev/null
+++ b/tests/test_training_metrics.py
@@ -0,0 +1,77 @@
+"""Tests for training metrics backend resolution."""
+
+from __future__ import annotations
+
+from unittest import mock
+
+import pytest
+
+from videotuna.utils.training_metrics import (
+    build_trackio_init_kwargs,
+    describe_metrics_backend,
+    log_validation_image_to_trackio,
+    require_trackio,
+    resolve_accelerate_log_with,
+    trackio_enabled,
+)
+
+
+def test_resolve_accelerate_log_with_tensorboard_default():
+    assert resolve_accelerate_log_with("tensorboard") == "tensorboard"
+
+
+def test_resolve_accelerate_log_with_trackio_dual_mode(monkeypatch):
+    monkeypatch.setattr(
+        "videotuna.utils.training_metrics.trackio_available",
+        lambda: True,
+    )
+    assert resolve_accelerate_log_with("trackio") == ["tensorboard", "trackio"]
+
+
+def test_require_trackio_raises_when_missing(monkeypatch):
+    monkeypatch.setattr(
+        "videotuna.utils.training_metrics.trackio_available",
+        lambda: False,
+    )
+    with pytest.raises(ImportError, match="poetry install -E trackio"):
+        require_trackio()
+
+
+def test_trackio_enabled():
+    assert trackio_enabled("trackio") is True
+    assert trackio_enabled("tensorboard") is False
+
+
+def test_describe_metrics_backend():
+    assert describe_metrics_backend("tensorboard") == "tensorboard"
+    assert describe_metrics_backend("trackio") == "tensorboard + trackio"
+
+
+def test_build_trackio_init_kwargs_without_space_id():
+    assert build_trackio_init_kwargs(space_id=None) is None
+
+
+def test_build_trackio_init_kwargs_with_space_id():
+    assert build_trackio_init_kwargs(space_id="user/privtune-trackio") == {
+        "trackio": {"space_id": "user/privtune-trackio"}
+    }
+
+
+def test_log_validation_image_to_trackio(monkeypatch):
+    fake_image = mock.MagicMock()
+    fake_trackio = mock.MagicMock()
+    fake_trackio.Image = mock.MagicMock(side_effect=lambda img: ("image", img))
+
+    monkeypatch.setattr(
+        "videotuna.utils.training_metrics.trackio_available",
+        lambda: True,
+    )
+    monkeypatch.setitem(__import__("sys").modules, "trackio", fake_trackio)
+
+    log_validation_image_to_trackio(fake_image, step=42)
+
+    fake_trackio.Image.assert_called_once_with(fake_image)
+    fake_trackio.log.assert_called_once_with(
+        {"validation/sample": ("image", fake_image)},
+        step=42,
+    )
diff --git a/tests/test_training_metrics_callback.py b/tests/test_training_metrics_callback.py
new file mode 100644
index 00000000..6892feba
--- /dev/null
+++ b/tests/test_training_metrics_callback.py
@@ -0,0 +1,36 @@
+"""Tests for training metrics callback."""
+
+import json
+import os
+import tempfile
+from unittest import mock
+
+from videotuna.utils.callbacks import TrainingMetricsCallback
+
+
+def test_training_metrics_callback_writes_metrics_json():
+    callback = TrainingMetricsCallback()
+    trainer = mock.MagicMock()
+    trainer.current_epoch = 0
+    trainer.global_rank = 0
+    trainer.global_step = 42
+    pl_module = mock.MagicMock()
+    pl_module.logdir = None
+    trainer.logger = mock.MagicMock()
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        callback.save_dir = tmpdir
+        callback.on_train_epoch_start(trainer, pl_module)
+        callback.on_train_epoch_end(trainer, pl_module)
+
+        metrics_path = os.path.join(tmpdir, "metrics.json")
+        assert os.path.isfile(metrics_path)
+        with open(metrics_path) as f:
+            data = json.load(f)
+        assert len(data["epochs"]) == 1
+        assert "epoch_time_s" in data["epochs"][0]
+        assert "peak_vram_gb" in data["epochs"][0]
+        trainer.logger.log_metrics.assert_called_once()
+        logged = trainer.logger.log_metrics.call_args[0][0]
+        assert "epoch_time_s" in logged
+        assert "peak_vram_gb" in logged
diff --git a/tests/test_vendor_import_boundary.py b/tests/test_vendor_import_boundary.py
new file mode 100644
index 00000000..d39407ea
--- /dev/null
+++ b/tests/test_vendor_import_boundary.py
@@ -0,0 +1,44 @@
+"""Guard vendor-only third-party imports stay inside vendored trees."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+VIDEOTUNA_ROOT = REPO_ROOT / "videotuna"
+WAN_VENDOR_ROOT = VIDEOTUNA_ROOT / "models" / "wan"
+WAN_CONFIGS_ROOT = WAN_VENDOR_ROOT / "wan" / "configs"
+
+_EASYDICT_MARKERS = ("from easydict", "import easydict")
+
+
+def _iter_py_files(root: Path) -> list[Path]:
+    return sorted(root.rglob("*.py"))
+
+
+def _file_mentions_easydict(path: Path) -> bool:
+    text = path.read_text(encoding="utf-8")
+    return any(marker in text for marker in _EASYDICT_MARKERS)
+
+
+def test_easydict_only_in_wan_vendor_tree():
+    """easydict imports must stay inside videotuna/models/wan/."""
+    outside_wan = [
+        path
+        for path in _iter_py_files(VIDEOTUNA_ROOT)
+        if WAN_VENDOR_ROOT not in path.parents and _file_mentions_easydict(path)
+    ]
+    assert outside_wan == [], (
+        "easydict is Wan-vendor-only; remove imports from: "
+        + ", ".join(str(p.relative_to(REPO_ROOT)) for p in outside_wan)
+    )
+
+
+def test_wan_configs_still_use_easydict():
+    """Sanity: upstream Wan config modules still depend on easydict."""
+    config_files = _iter_py_files(WAN_CONFIGS_ROOT)
+    assert (
+        config_files
+    ), "expected Wan config modules under videotuna/models/wan/wan/configs/"
+    users = [path for path in config_files if _file_mentions_easydict(path)]
+    assert users, "expected at least one Wan config module to import easydict"
diff --git a/tests/test_video_io.py b/tests/test_video_io.py
new file mode 100644
index 00000000..bfa62d57
--- /dev/null
+++ b/tests/test_video_io.py
@@ -0,0 +1,181 @@
+"""Tests for videotuna.utils.video_io."""
+
+import pytest
+import torch
+from torchvision.io import write_video
+
+from videotuna.utils.video_io import (
+    AvVideoReader,
+    get_video_fps,
+    get_video_frame_count,
+    read_video_frames,
+    sample_frame_indices,
+)
+
+
+@pytest.fixture
+def tiny_mp4(tmp_path):
+    path = tmp_path / "test.mp4"
+    num_frames = 24
+    frames = torch.randint(0, 255, (num_frames, 3, 64, 64), dtype=torch.uint8)
+    write_video(
+        str(path),
+        frames.permute(0, 2, 3, 1),
+        fps=8,
+        video_codec="h264",
+    )
+    return path, num_frames
+
+
+def test_sample_frame_indices_length():
+    indices = sample_frame_indices(100, num_frames=16, frame_interval=1, begin_index=0)
+    assert len(indices) == 16
+    assert indices[0] == 0
+    assert indices[-1] <= 99
+
+
+def test_sample_frame_indices_with_interval():
+    indices = sample_frame_indices(200, num_frames=8, frame_interval=4, begin_index=10)
+    assert len(indices) == 8
+    assert indices[0] == 10
+    assert indices[-1] <= 10 + 8 * 4
+
+
+def test_sample_frame_indices_rejects_short_video():
+    with pytest.raises(ValueError):
+        sample_frame_indices(10, num_frames=16, frame_interval=1)
+
+
+def test_sample_frame_indices_random_begin():
+    runs = [sample_frame_indices(120, 16, 1)[0] for _ in range(20)]
+    assert min(runs) >= 0
+    assert max(runs) <= 120 - 16
+
+
+def test_get_video_frame_count(tiny_mp4):
+    path, num_frames = tiny_mp4
+    count = get_video_frame_count(str(path))
+    assert count >= num_frames - 2
+    assert count <= num_frames + 2
+
+
+def test_get_video_fps(tiny_mp4):
+    path, _ = tiny_mp4
+    fps = get_video_fps(str(path))
+    assert fps > 0
+
+
+def test_read_video_frames_subset(tiny_mp4):
+    path, num_frames = tiny_mp4
+    indices = [0, num_frames // 2, num_frames - 1]
+    frames = read_video_frames(str(path), indices, backend="av")
+    assert frames.shape == (3, 3, 64, 64)
+    assert frames.dtype == torch.uint8
+
+
+def test_read_video_frames_rejects_missing_index(tiny_mp4):
+    path, num_frames = tiny_mp4
+    with pytest.raises(RuntimeError, match="Failed to decode"):
+        read_video_frames(str(path), [num_frames + 50], backend="av")
+
+
+def test_av_video_reader_batch_shape(tiny_mp4):
+    path, num_frames = tiny_mp4
+    reader = AvVideoReader(str(path))
+    batch = reader.get_batch([0, num_frames - 1])
+    arr = batch.asnumpy()
+    assert arr.shape == (2, 64, 64, 3)
+    assert arr.dtype.name == "uint8"
+    assert len(reader) >= num_frames - 2
+    assert reader.get_avg_fps() > 0
+
+
+def test_auto_backend_prefers_torchcodec_when_available(tiny_mp4, monkeypatch):
+    path, num_frames = tiny_mp4
+    indices = [0, 1]
+    expected = torch.zeros((2, 3, 64, 64), dtype=torch.uint8)
+    calls: list[str] = []
+
+    monkeypatch.setattr("videotuna.utils.video_io._torchcodec_available", lambda: True)
+
+    def fake_torchcodec(video_path, idx):
+        calls.append("torchcodec")
+        assert video_path == str(path)
+        assert list(idx) == indices
+        return expected
+
+    def fake_av(*args, **kwargs):
+        calls.append("av")
+        raise AssertionError("av should not be called when torchcodec succeeds")
+
+    monkeypatch.setattr("videotuna.utils.video_io._read_torchcodec", fake_torchcodec)
+    monkeypatch.setattr("videotuna.utils.video_io._read_av", fake_av)
+
+    frames = read_video_frames(str(path), indices, backend="auto")
+    assert torch.equal(frames, expected)
+    assert calls == ["torchcodec"]
+
+
+def test_auto_backend_skips_torchcodec_when_unavailable(tiny_mp4, monkeypatch):
+    path, num_frames = tiny_mp4
+    indices = [0]
+    calls: list[str] = []
+
+    monkeypatch.setattr("videotuna.utils.video_io._torchcodec_available", lambda: False)
+
+    def fake_torchcodec(*args, **kwargs):
+        calls.append("torchcodec")
+        raise AssertionError("torchcodec should not be called when unavailable")
+
+    def fake_av(video_path, idx, **kwargs):
+        calls.append("av")
+        return torch.zeros((1, 3, 64, 64), dtype=torch.uint8)
+
+    monkeypatch.setattr("videotuna.utils.video_io._read_torchcodec", fake_torchcodec)
+    monkeypatch.setattr("videotuna.utils.video_io._read_av", fake_av)
+
+    read_video_frames(str(path), indices, backend="auto")
+    assert calls == ["av"]
+
+
+def test_auto_backend_falls_back_to_av_on_torchcodec_failure(tiny_mp4, monkeypatch):
+    path, num_frames = tiny_mp4
+    indices = [0]
+    expected = torch.ones((1, 3, 64, 64), dtype=torch.uint8)
+    calls: list[str] = []
+
+    monkeypatch.setattr("videotuna.utils.video_io._torchcodec_available", lambda: True)
+
+    def fake_torchcodec(*args, **kwargs):
+        calls.append("torchcodec")
+        raise RuntimeError("torchcodec decode failed")
+
+    def fake_av(video_path, idx, **kwargs):
+        calls.append("av")
+        return expected
+
+    monkeypatch.setattr("videotuna.utils.video_io._read_torchcodec", fake_torchcodec)
+    monkeypatch.setattr("videotuna.utils.video_io._read_av", fake_av)
+
+    frames = read_video_frames(str(path), indices, backend="auto")
+    assert torch.equal(frames, expected)
+    assert calls == ["torchcodec", "av"]
+
+
+def test_read_video_frames_auto_with_torchcodec_installed(tiny_mp4):
+    pytest.importorskip("torchcodec")
+    path, num_frames = tiny_mp4
+    indices = [0, num_frames // 2]
+    frames = read_video_frames(str(path), indices, backend="auto")
+    assert frames.shape[0] == 2
+    assert frames.dtype == torch.uint8
+    assert frames.ndim == 4
+
+
+def test_read_video_frames_torchcodec_backend(tiny_mp4):
+    pytest.importorskip("torchcodec")
+    path, num_frames = tiny_mp4
+    indices = [0, num_frames - 1]
+    frames = read_video_frames(str(path), indices, backend="torchcodec")
+    assert frames.shape[0] == 2
+    assert frames.dtype == torch.uint8
diff --git a/tests/test_wan_checkpoint.py b/tests/test_wan_checkpoint.py
new file mode 100644
index 00000000..5c2f8605
--- /dev/null
+++ b/tests/test_wan_checkpoint.py
@@ -0,0 +1,10 @@
+"""Tests for Wan checkpoint loading."""
+
+import pytest
+
+from videotuna.models.wan.wan.modules.model import WanModel
+
+
+def test_wan_from_pretrained_missing_dir():
+    with pytest.raises(FileNotFoundError, match="Wan checkpoint directory not found"):
+        WanModel.from_pretrained("/nonexistent/wan/checkpoint")
diff --git a/tests/test_wan_domain_i2v_smoke_22_config.py b/tests/test_wan_domain_i2v_smoke_22_config.py
new file mode 100644
index 00000000..2bb6e139
--- /dev/null
+++ b/tests/test_wan_domain_i2v_smoke_22_config.py
@@ -0,0 +1,21 @@
+"""CPU smoke tests for Wan 2.2 domain I2V validation preset."""
+
+from pathlib import Path
+
+from omegaconf import OmegaConf
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+WAN_I2V_SMOKE_22 = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "wan_domain_i2v_smoke_22.yaml"
+)
+
+
+def test_wan_domain_i2v_smoke_22_yaml():
+    cfg = OmegaConf.load(WAN_I2V_SMOKE_22)
+    assert cfg.flow.params.mode == "i2v"
+    assert cfg.flow.params.model_variant == "2.2"
+    assert cfg.inference.height == 720
+    assert cfg.inference.width == 1280
+    assert cfg.inference.frames == 81
+    assert cfg.inference.num_inference_steps == 4
+    assert "Wan2.2-I2V" in cfg.flow.params.pretrained_model_name_or_path
diff --git a/tests/test_wan_domain_lora_smoke_22_config.py b/tests/test_wan_domain_lora_smoke_22_config.py
new file mode 100644
index 00000000..639ca52a
--- /dev/null
+++ b/tests/test_wan_domain_lora_smoke_22_config.py
@@ -0,0 +1,40 @@
+"""CPU smoke tests for Wan 2.2 domain LoRA validation preset."""
+
+from pathlib import Path
+
+import yaml
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+WAN_DOMAIN_SMOKE_22 = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "wan_domain_lora_smoke_22.yaml"
+)
+WAN_DOMAIN_SMOKE_22_LOW_VRAM = (
+    REPO_ROOT
+    / "configs"
+    / "inference"
+    / "presets"
+    / "wan_domain_lora_smoke_22_low_vram.yaml"
+)
+
+
+def test_wan_domain_lora_smoke_22_yaml():
+    cfg = yaml.safe_load(WAN_DOMAIN_SMOKE_22.read_text(encoding="utf-8"))
+    assert cfg["flow"]["target"] == "videotuna.flow.diffusers_video.DiffusersVideoFlow"
+    assert cfg["flow"]["params"]["model_variant"] == "2.2"
+    inf = cfg["inference"]
+    assert inf["height"] == 720
+    assert inf["width"] == 1280
+    assert inf["frames"] == 81
+    assert inf["num_inference_steps"] == 4
+    assert inf["savefps"] == 16
+    assert inf["enable_model_cpu_offload"] is True
+    assert "sks_style" in inf["prompt_file"]
+
+
+def test_wan_domain_lora_smoke_22_low_vram_yaml():
+    cfg = yaml.safe_load(WAN_DOMAIN_SMOKE_22_LOW_VRAM.read_text(encoding="utf-8"))
+    inf = cfg["inference"]
+    assert inf["height"] == 720
+    assert inf["enable_sequential_cpu_offload"] is True
+    assert inf["min_vram_gb"] == 10
diff --git a/tests/test_wan_domain_surface.py b/tests/test_wan_domain_surface.py
new file mode 100644
index 00000000..d434d8f8
--- /dev/null
+++ b/tests/test_wan_domain_surface.py
@@ -0,0 +1,60 @@
+"""Guard: domain entrypoints must not reference pruned Wan variants."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+FORBIDDEN_TOKENS = (
+    "s2v",
+    "animate",
+    "ti2v",
+    "speech2video",
+    "textimage2video",
+    "WanS2V",
+    "WanAnimate",
+    "WanTI2V",
+)
+
+DOMAIN_PATHS = (
+    REPO_ROOT / "configs" / "domain",
+    REPO_ROOT / "scripts" / "__init__.py",
+    REPO_ROOT / "scripts" / "train_new.py",
+    REPO_ROOT / "videotuna" / "flow" / "wanvideo.py",
+    REPO_ROOT / "videotuna" / "training" / "wan_lora",
+    REPO_ROOT / "videotuna" / "utils" / "wan_training.py",
+)
+
+
+def _iter_text_files(path: Path) -> list[Path]:
+    if path.is_file():
+        return [path]
+    return sorted(path.rglob("*")) if path.is_dir() else []
+
+
+def _find_forbidden_references() -> list[str]:
+    violations: list[str] = []
+    for domain_path in DOMAIN_PATHS:
+        for file_path in _iter_text_files(domain_path):
+            if not file_path.is_file() or file_path.suffix not in {
+                ".py",
+                ".yaml",
+                ".yml",
+                ".json",
+            }:
+                continue
+            text = file_path.read_text(encoding="utf-8")
+            for token in FORBIDDEN_TOKENS:
+                if token in text:
+                    rel = file_path.relative_to(REPO_ROOT)
+                    violations.append(f"{rel}: contains {token!r}")
+    return violations
+
+
+def test_domain_entrypoints_exclude_pruned_wan_variants():
+    """Domain scripts and configs must not reference s2v/animate/ti2v."""
+    violations = _find_forbidden_references()
+    assert violations == [], "Pruned Wan variant references found:\n" + "\n".join(
+        violations
+    )
diff --git a/tests/test_wan_i2v_dataset.py b/tests/test_wan_i2v_dataset.py
new file mode 100644
index 00000000..5d4a2b74
--- /dev/null
+++ b/tests/test_wan_i2v_dataset.py
@@ -0,0 +1,164 @@
+"""CPU tests for Wan I2V pair dataset loading."""
+
+from unittest.mock import patch
+
+import pytest
+import torch
+from PIL import Image
+from torchvision.io import write_video
+
+from videotuna.data.datasets import DatasetFromCSV
+
+
+def _write_test_mp4(path, num_frames: int = 90, height: int = 480, width: int = 832):
+    frames = torch.randint(0, 255, (num_frames, 3, height, width), dtype=torch.uint8)
+    write_video(
+        str(path),
+        frames.permute(0, 2, 3, 1),
+        fps=24,
+        video_codec="h264",
+    )
+
+
+@pytest.fixture
+def i2v_pair_dataset(tmp_path):
+    images_dir = tmp_path / "images"
+    videos_dir = tmp_path / "videos"
+    images_dir.mkdir()
+    videos_dir.mkdir()
+    image_path = images_dir / "ref001.jpg"
+    video_path = videos_dir / "clip001.mp4"
+    Image.new("RGB", (832, 480), color=(10, 20, 30)).save(image_path)
+    video_path.write_bytes(b"placeholder")
+    csv_path = tmp_path / "metadata.csv"
+    csv_path.write_text(
+        "image_path,video_path,caption\n"
+        f"{image_path},{video_path},"
+        '"sks_style, slow pan"\n',
+        encoding="utf-8",
+    )
+    fake_video = torch.randint(0, 255, (81, 3, 480, 832), dtype=torch.uint8)
+    with (
+        patch("videotuna.data.datasets.get_video_frame_count", return_value=100),
+        patch("videotuna.data.datasets.read_video_frames", return_value=fake_video),
+    ):
+        yield DatasetFromCSV(
+            str(csv_path),
+            height=480,
+            width=832,
+            num_frames=81,
+            frame_interval=1,
+            image_to_video=False,
+            train=True,
+        )
+
+
+def test_i2v_pair_dataset_emits_image_and_video(i2v_pair_dataset):
+    with (
+        patch("videotuna.data.datasets.get_video_frame_count", return_value=100),
+        patch(
+            "videotuna.data.datasets.read_video_frames",
+            return_value=torch.randint(0, 255, (81, 3, 480, 832), dtype=torch.uint8),
+        ),
+    ):
+        sample = i2v_pair_dataset[0]
+    assert sample["caption"] == "sks_style, slow pan"
+    assert sample["video"].shape == (3, 81, 480, 832)
+    assert sample["image"].shape == (3, 1, 480, 832)
+
+
+def test_image_to_video_clones_first_frame(tmp_path):
+    videos_dir = tmp_path / "videos"
+    videos_dir.mkdir()
+    video_path = videos_dir / "clip001.mp4"
+    video_path.write_bytes(b"placeholder")
+    csv_path = tmp_path / "metadata.csv"
+    csv_path.write_text(
+        f"path,caption\n{video_path},sks_style clip\n",
+        encoding="utf-8",
+    )
+    dataset = DatasetFromCSV(
+        str(csv_path),
+        height=480,
+        width=832,
+        num_frames=81,
+        image_to_video=True,
+        train=True,
+    )
+    with (
+        patch("videotuna.data.datasets.get_video_frame_count", return_value=100),
+        patch(
+            "videotuna.data.datasets.read_video_frames",
+            return_value=torch.randint(0, 255, (81, 3, 480, 832), dtype=torch.uint8),
+        ),
+    ):
+        sample = dataset[0]
+    assert torch.allclose(sample["image"], sample["video"][:, :1])
+
+
+def test_pair_csv_requires_image_path_column(tmp_path):
+    csv_path = tmp_path / "bad.csv"
+    csv_path.write_text("video_path,caption\na.mp4,cap\n", encoding="utf-8")
+    with pytest.raises(ValueError, match="image_path"):
+        DatasetFromCSV(str(csv_path), height=480, width=832, num_frames=16)
+
+
+def test_image_to_video_true_rejects_pair_csv(tmp_path):
+    csv_path = tmp_path / "pair.csv"
+    csv_path.write_text(
+        "image_path,video_path,caption\nimg.jpg,vid.mp4,caption\n",
+        encoding="utf-8",
+    )
+    with pytest.raises(ValueError, match="image_to_video=true"):
+        DatasetFromCSV(
+            str(csv_path),
+            height=480,
+            width=832,
+            num_frames=16,
+            image_to_video=True,
+        )
+
+
+def test_image_to_video_true_requires_path_column(tmp_path):
+    csv_path = tmp_path / "no_path.csv"
+    csv_path.write_text("video_path,caption\nvid.mp4,caption\n", encoding="utf-8")
+    with pytest.raises(ValueError, match="'path' column"):
+        DatasetFromCSV(
+            str(csv_path),
+            height=480,
+            width=832,
+            num_frames=16,
+            image_to_video=True,
+        )
+
+
+def test_i2v_pair_dataset_integration_with_pyav(tmp_path):
+    images_dir = tmp_path / "images"
+    videos_dir = tmp_path / "videos"
+    images_dir.mkdir()
+    videos_dir.mkdir()
+    image_path = images_dir / "ref001.jpg"
+    video_path = videos_dir / "clip001.mp4"
+    Image.new("RGB", (832, 480), color=(10, 20, 30)).save(image_path)
+    _write_test_mp4(video_path, num_frames=90)
+    csv_path = tmp_path / "metadata.csv"
+    csv_path.write_text(
+        "image_path,video_path,caption\n"
+        f"{image_path},{video_path},"
+        '"sks_style, slow pan"\n',
+        encoding="utf-8",
+    )
+    dataset = DatasetFromCSV(
+        str(csv_path),
+        height=480,
+        width=832,
+        num_frames=81,
+        frame_interval=1,
+        image_to_video=False,
+        train=True,
+        video_backend="av",
+    )
+    sample = dataset[0]
+    assert sample["caption"] == "sks_style, slow pan"
+    assert sample["video"].shape == (3, 81, 480, 832)
+    assert sample["image"].shape == (3, 1, 480, 832)
diff --git a/tests/test_wan_i2v_lora_bridge.py b/tests/test_wan_i2v_lora_bridge.py
new file mode 100644
index 00000000..22341638
--- /dev/null
+++ b/tests/test_wan_i2v_lora_bridge.py
@@ -0,0 +1,19 @@
+"""Tests for Wan 2.1 native I2V LoRA → Wan 2.2 I2V Diffusers bridge."""
+
+from unittest.mock import MagicMock, patch
+
+from videotuna.utils.wan_lora_bridge import apply_native_wan_lora_to_i2v_pipeline
+
+
+def test_i2v_bridge_delegates_to_t2v_bridge():
+    pipeline = MagicMock()
+    with patch(
+        "videotuna.utils.wan_lora_bridge.apply_native_wan_lora_to_pipeline"
+    ) as mock_apply:
+        mock_apply.return_value = []
+        apply_native_wan_lora_to_i2v_pipeline(
+            pipeline, "/tmp/denoiser.ckpt", lora_scale=0.8
+        )
+        mock_apply.assert_called_once_with(
+            pipeline, "/tmp/denoiser.ckpt", lora_scale=0.8, lora_scale_2=None
+        )
diff --git a/tests/test_wan_inference_presets.py b/tests/test_wan_inference_presets.py
new file mode 100644
index 00000000..fbe6a8ca
--- /dev/null
+++ b/tests/test_wan_inference_presets.py
@@ -0,0 +1,97 @@
+"""CPU smoke tests for Wan 2.2 Diffusers inference preset YAMLs."""
+
+from pathlib import Path
+
+import yaml
+from omegaconf import OmegaConf
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+
+LOW_VRAM_PRESET = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "low_vram_wan2_2_720p.yaml"
+)
+BALANCED_PRESET = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "balanced_wan2_2_720p.yaml"
+)
+MAX_SPEED_PRESET = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "max_speed_wan2_2_720p.yaml"
+)
+CPU_SMOKE_PRESET = (
+    REPO_ROOT / "configs" / "inference" / "presets" / "wan2_2_cpu_smoke.yaml"
+)
+
+WAN_MODEL_ID = "Wan-AI/Wan2.2-T2V-A14B-Diffusers"
+
+
+def _load_yaml(path: Path) -> dict:
+    return yaml.safe_load(path.read_text(encoding="utf-8"))
+
+
+def test_wan_low_vram_preset():
+    cfg = _load_yaml(LOW_VRAM_PRESET)
+    assert cfg["flow"]["params"]["model_variant"] == "2.2"
+    assert cfg["inference"]["memory_preset"] == "low_vram"
+    assert cfg["inference"]["enable_sequential_cpu_offload"] is True
+    assert cfg["inference"]["dtype"] == "fp16"
+    assert cfg["inference"]["min_vram_gb"] == 10
+    assert cfg["inference"]["height"] == 720
+    assert cfg["inference"]["width"] == 1280
+
+
+def test_wan_low_vram_int8_quant_preset():
+    path = (
+        REPO_ROOT
+        / "configs"
+        / "inference"
+        / "presets"
+        / "low_vram_wan2_2_720p_int8.yaml"
+    )
+    cfg = _load_yaml(path)
+    assert cfg["inference"]["transformer_quant"] == "int8_wo"
+    assert cfg["inference"]["quant_backend"] == "torchao"
+    assert cfg["inference"]["enable_model_cpu_offload"] is True
+
+
+def test_wan_low_vram_fp8_quant_preset():
+    path = (
+        REPO_ROOT
+        / "configs"
+        / "inference"
+        / "presets"
+        / "low_vram_wan2_2_720p_fp8.yaml"
+    )
+    cfg = _load_yaml(path)
+    assert cfg["inference"]["transformer_quant"] == "fp8_wo"
+    assert cfg["inference"]["quant_backend"] == "torchao"
+    assert cfg["inference"]["enable_model_cpu_offload"] is True
+    assert cfg["inference"]["dtype"] == "bf16"
+
+
+def test_wan_balanced_preset():
+    cfg = _load_yaml(BALANCED_PRESET)
+    assert cfg["flow"]["params"]["pretrained_model_name_or_path"] == WAN_MODEL_ID
+    assert cfg["inference"]["memory_preset"] == "balanced"
+    assert cfg["inference"]["enable_model_cpu_offload"] is True
+    assert cfg["inference"]["enable_vae_tiling"] is True
+    assert cfg["inference"]["dtype"] == "bf16"
+    assert cfg["inference"]["min_vram_gb"] == 20
+
+
+def test_wan_max_speed_preset():
+    cfg = _load_yaml(MAX_SPEED_PRESET)
+    assert cfg["inference"]["memory_preset"] == "max_speed"
+    assert cfg["inference"]["dtype"] == "bf16"
+    assert cfg["inference"]["min_vram_gb"] == 38
+    assert "enable_model_cpu_offload" not in cfg["inference"]
+    assert "enable_sequential_cpu_offload" not in cfg["inference"]
+
+
+def test_wan_cpu_smoke_preset():
+    cfg = OmegaConf.load(CPU_SMOKE_PRESET)
+    assert cfg.flow.params.model_family == "wan"
+    assert cfg.inference.device == "cpu"
+    assert cfg.inference.frames == 2
+    assert cfg.inference.height == 256
+    assert cfg.inference.width == 448
+    assert cfg.inference.num_inference_steps == 4
+    assert cfg.inference.dtype == "fp32"
diff --git a/tests/test_wan_lora_bridge.py b/tests/test_wan_lora_bridge.py
new file mode 100644
index 00000000..fef66b4e
--- /dev/null
+++ b/tests/test_wan_lora_bridge.py
@@ -0,0 +1,228 @@
+"""Tests for Wan 2.1 native LoRA → Wan 2.2 Diffusers bridge."""
+
+from __future__ import annotations
+
+from types import SimpleNamespace
+from unittest.mock import MagicMock, patch
+
+import pytest
+import torch
+from diffusers.models.transformers.transformer_wan import WanTransformer3DModel
+
+from videotuna.testing.wan_lora_ckpt import build_synthetic_wan_lora_ckpt
+from videotuna.utils.wan_lora_bridge import (
+    WAN_DIFFUSERS_LORA_TARGETS,
+    _infer_lora_rank,
+    _remap_native_to_diffusers_keys,
+    _remap_single_native_key,
+    analyze_native_wan_lora_ckpt,
+    apply_native_wan_lora_to_pipeline,
+    compute_remap_coverage,
+    export_diffusers_lora_state_dicts,
+    is_native_wan_lora_ckpt,
+    load_native_wan_lora_state_dict,
+)
+
+
+def _production_native_keys(
+    *, block: int = 0, rank: int = 16
+) -> dict[str, torch.Tensor]:
+    dim_in, dim_mid, dim_out = 5120, 13824, 5120
+    state: dict[str, torch.Tensor] = {}
+    for p in ("q", "k", "v", "o"):
+        out_dim = dim_in if p != "o" else dim_in
+        state[f"blocks.{block}.self_attn.{p}.lora_A.weight"] = torch.zeros(rank, dim_in)
+        state[f"blocks.{block}.self_attn.{p}.lora_B.weight"] = torch.zeros(
+            out_dim, rank
+        )
+    state[f"blocks.{block}.ffn.0.lora_A.weight"] = torch.zeros(rank, dim_in)
+    state[f"blocks.{block}.ffn.0.lora_B.weight"] = torch.zeros(dim_mid, rank)
+    state[f"blocks.{block}.ffn.2.lora_A.weight"] = torch.zeros(rank, dim_mid)
+    state[f"blocks.{block}.ffn.2.lora_B.weight"] = torch.zeros(dim_out, rank)
+    return state
+
+
+def _tiny_transformer() -> WanTransformer3DModel:
+    cfg = WanTransformer3DModel.load_config(
+        "Wan-AI/Wan2.2-T2V-A14B-Diffusers", subfolder="transformer"
+    )
+    cfg["num_layers"] = 1
+    return WanTransformer3DModel.from_config(cfg)
+
+
+def test_load_native_wan_lora_state_dict_filters_non_lora(tmp_path):
+    ckpt = tmp_path / "denoiser.ckpt"
+    state = {
+        "denoiser.blocks.0.self_attn.q.lora_A.weight": torch.zeros(16, 5120),
+        "denoiser.blocks.0.self_attn.q.lora_B.weight": torch.zeros(5120, 16),
+        "denoiser.blocks.0.self_attn.q.weight": torch.zeros(4, 4),
+    }
+    torch.save({"state_dict": state}, ckpt)
+    loaded = load_native_wan_lora_state_dict(ckpt)
+    assert len(loaded) == 2
+    assert all("lora" in k for k in loaded)
+    assert loaded["blocks.0.self_attn.q.lora_A.weight"].shape == (16, 5120)
+
+
+def test_is_native_wan_lora_ckpt(tmp_path):
+    ckpt = tmp_path / "lora.ckpt"
+    torch.save(
+        {"state_dict": {"blocks.0.self_attn.q.lora_A.weight": torch.zeros(16, 5120)}},
+        ckpt,
+    )
+    assert is_native_wan_lora_ckpt(ckpt)
+    assert not is_native_wan_lora_ckpt(tmp_path / "missing.ckpt")
+
+
+def test_infer_lora_rank():
+    state = {"blocks.0.self_attn.q.lora_A.weight": torch.zeros(16, 8)}
+    assert _infer_lora_rank(state) == 16
+
+
+def test_remap_production_self_attn_keys():
+    assert (
+        _remap_single_native_key("blocks.0.self_attn.q.lora_A.weight")
+        == "blocks.0.attn1.to_q.lora_A.weight"
+    )
+    assert (
+        _remap_single_native_key("blocks.3.self_attn.o.lora_B.weight")
+        == "blocks.3.attn1.to_out.0.lora_B.weight"
+    )
+
+
+def test_remap_production_ffn_keys():
+    assert (
+        _remap_single_native_key("blocks.1.ffn.0.lora_A.weight")
+        == "blocks.1.ffn.net.0.proj.lora_A.weight"
+    )
+    assert (
+        _remap_single_native_key("blocks.1.ffn.2.lora_B.weight")
+        == "blocks.1.ffn.net.2.lora_B.weight"
+    )
+
+
+def test_remap_legacy_attn_shorthand():
+    native = {"blocks.0.attn.q.lora_A.weight": torch.zeros(1)}
+    remapped = _remap_native_to_diffusers_keys(native)
+    assert "blocks.0.attn1.to_q.lora_A.weight" in remapped
+
+
+def test_analyze_native_wan_lora_ckpt(tmp_path):
+    ckpt = tmp_path / "denoiser.ckpt"
+    state = _production_native_keys()
+    torch.save({"state_dict": {f"denoiser.{k}": v for k, v in state.items()}}, ckpt)
+    info = analyze_native_wan_lora_ckpt(ckpt)
+    assert info["native_key_count"] == 12
+    assert info["rank"] == 16
+
+
+def test_export_diffusers_lora_state_dicts(tmp_path):
+    ckpt = tmp_path / "denoiser.ckpt"
+    state = _production_native_keys()
+    torch.save({"state_dict": state}, ckpt)
+    exports = export_diffusers_lora_state_dicts(ckpt)
+    assert "high_noise" in exports and "low_noise" in exports
+    assert "blocks.0.attn1.to_q.lora_A.weight" in exports["high_noise"]
+
+
+def test_exported_lora_loads_via_diffusers_adapter(tmp_path):
+    """Offline export path: safetensors → WanTransformer3DModel.load_lora_adapter."""
+    from safetensors.torch import save_file
+
+    ckpt = tmp_path / "denoiser.ckpt"
+    state = _production_native_keys()
+    torch.save({"state_dict": state}, ckpt)
+    exports = export_diffusers_lora_state_dicts(ckpt)
+    lora_path = tmp_path / "high_noise.safetensors"
+    save_file(exports["high_noise"], lora_path)
+
+    transformer = _tiny_transformer()
+    transformer.load_lora_adapter(str(lora_path), adapter_name="exported", prefix=None)
+    transformer.set_adapters(["exported"], weights=[1.0])
+    assert _count_lora(transformer) == 12
+
+
+def test_remap_coverage_on_production_fixture():
+    native = _production_native_keys()
+    transformed, total, coverage = compute_remap_coverage(native)
+    assert total == 12
+    assert transformed == 12
+    assert coverage >= 0.9
+
+
+def test_apply_native_wan_lora_to_single_transformer():
+    ckpt_state = _production_native_keys()
+    ckpt_path = MagicMock()
+    transformer = _tiny_transformer()
+    pipeline = SimpleNamespace(transformer=transformer, transformer_2=None)
+
+    with patch(
+        "videotuna.utils.wan_lora_bridge.load_native_wan_lora_state_dict",
+        return_value=ckpt_state,
+    ):
+        reports = apply_native_wan_lora_to_pipeline(pipeline, ckpt_path)
+
+    assert len(reports) == 1
+    assert reports[0].expert == "transformer"
+    assert reports[0].remap_ratio >= 0.9
+    assert reports[0].missing_keys == []
+    assert reports[0].loaded_lora_params == 12
+    assert _count_lora(pipeline.transformer) == 12
+
+
+def test_apply_native_wan_lora_to_dual_transformer():
+    ckpt_state = _production_native_keys()
+    ckpt_path = MagicMock()
+    pipeline = SimpleNamespace(
+        transformer=_tiny_transformer(),
+        transformer_2=_tiny_transformer(),
+        set_adapters=MagicMock(),
+    )
+
+    with patch(
+        "videotuna.utils.wan_lora_bridge.load_native_wan_lora_state_dict",
+        return_value=ckpt_state,
+    ):
+        reports = apply_native_wan_lora_to_pipeline(pipeline, ckpt_path)
+
+    assert len(reports) == 2
+    assert {r.expert for r in reports} == {"transformer", "transformer_2"}
+    for report in reports:
+        assert report.remap_ratio >= 0.9
+        assert report.missing_keys == []
+    pipeline.set_adapters.assert_called_once()
+    call = pipeline.set_adapters.call_args
+    adapters = call.args[0] if call.args else call.kwargs["adapter_names"]
+    weights = call.kwargs.get("adapter_weights")
+    if weights is None and len(call.args) > 1:
+        weights = call.args[1]
+    assert len(adapters) == 2
+    assert weights == [1.0, 1.0]
+
+
+def _count_lora(module: WanTransformer3DModel) -> int:
+    return sum(1 for n, _ in module.named_parameters() if "lora" in n.lower())
+
+
+@pytest.mark.gpu
+def test_validate_domain_t2v_gpu_smoke(tmp_path):
+    """GPU integration: bridge + pipeline load (skipped without CUDA/ROCm)."""
+    if not torch.cuda.is_available():
+        pytest.skip("GPU not available")
+
+    ckpt = build_synthetic_wan_lora_ckpt(tmp_path / "denoiser.ckpt", num_blocks=2)
+
+    from diffusers import WanPipeline
+
+    pipeline = WanPipeline.from_pretrained(
+        "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+        torch_dtype=torch.bfloat16,
+    )
+    reports = apply_native_wan_lora_to_pipeline(pipeline, ckpt)
+    assert len(reports) == 2
+    assert {r.expert for r in reports} == {"transformer", "transformer_2"}
+    for report in reports:
+        assert report.remap_ratio >= 0.9
+        assert report.missing_keys == []
+        assert report.loaded_lora_params > 0
+    assert WAN_DIFFUSERS_LORA_TARGETS
diff --git a/tests/test_wan_lora_config.py b/tests/test_wan_lora_config.py
new file mode 100644
index 00000000..276da722
--- /dev/null
+++ b/tests/test_wan_lora_config.py
@@ -0,0 +1,26 @@
+"""Unit tests for Wan domain LoRA Pydantic config loading."""
+
+from pathlib import Path
+
+import pytest
+from pydantic import ValidationError
+
+from videotuna.training.wan_lora.config import WanLoraTrainConfig, load_wan_lora_config
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+WAN_T2V_CONFIG = REPO_ROOT / "configs" / "domain" / "wan_t2v_lora.yaml"
+
+
+def test_invalid_task_raises_validation_error():
+    cfg = load_wan_lora_config(WAN_T2V_CONFIG)
+    payload = cfg.model_dump(mode="json")
+    payload["flow"]["params"]["task"] = "t2v-A14B"
+    with pytest.raises(ValidationError):
+        WanLoraTrainConfig.model_validate(payload)
+
+
+def test_round_trip_revalidates():
+    cfg = load_wan_lora_config(WAN_T2V_CONFIG)
+    round_tripped = WanLoraTrainConfig.model_validate(cfg.model_dump(mode="json"))
+    assert round_tripped.train.name == cfg.train.name
+    assert round_tripped.flow.params.ckpt_path == cfg.flow.params.ckpt_path
diff --git a/tests/test_wan_train_smoke.py b/tests/test_wan_train_smoke.py
new file mode 100644
index 00000000..865280cf
--- /dev/null
+++ b/tests/test_wan_train_smoke.py
@@ -0,0 +1,44 @@
+"""Smoke benchmark for training data path (CPU-only, no checkpoints)."""
+
+import time
+
+import torch
+from torch.utils.data import Dataset
+
+from videotuna.data.lightningdata import DataModuleFromConfig
+
+
+class _SmokeDataset(Dataset):
+    def __len__(self):
+        return 8
+
+    def __getitem__(self, idx):
+        return {
+            "video": torch.randn(3, 8, 64, 64),
+            "caption": f"cap-{idx}",
+        }
+
+
+def test_dataloader_epoch_smoke_benchmark():
+    """Pseudo-epoch iteration with hardened DataLoader settings."""
+    dm = DataModuleFromConfig(
+        batch_size=2,
+        num_workers=2,
+        pin_memory=False,
+        persistent_workers=True,
+        prefetch_factor=2,
+        train={
+            "target": "tests.test_wan_train_smoke._SmokeDataset",
+            "params": {},
+        },
+    )
+    dm.setup()
+    loader = dm.train_dataloader()
+
+    start = time.perf_counter()
+    batches = list(loader)
+    elapsed = time.perf_counter() - start
+
+    assert len(batches) == 4
+    assert batches[0]["video"].shape == (2, 3, 8, 64, 64)
+    assert elapsed < 30.0
diff --git a/tests/test_wan_training_step.py b/tests/test_wan_training_step.py
new file mode 100644
index 00000000..22eb7e3c
--- /dev/null
+++ b/tests/test_wan_training_step.py
@@ -0,0 +1,91 @@
+"""CPU tests for Wan flow-matching training helpers."""
+
+from types import SimpleNamespace
+from unittest.mock import MagicMock
+
+import pytest
+import torch
+
+from videotuna.utils.wan_training import (
+    build_i2v_mask_and_latent,
+    compute_wan_flow_matching_loss,
+    is_i2v_task,
+)
+
+
+def test_is_i2v_task():
+    assert is_i2v_task("i2v-14B")
+    assert not is_i2v_task("t2v-14B")
+
+
+def test_build_i2v_mask_shape():
+    image = torch.zeros(3, 48, 64)
+    msk, video = build_i2v_mask_and_latent(
+        image,
+        num_frames=9,
+        lat_h=6,
+        lat_w=8,
+        vae_stride=(4, 8, 8),
+        device=torch.device("cpu"),
+        dtype=torch.float32,
+    )
+    assert video.shape == (3, 9, 48, 64)
+    assert msk.shape[0] == 4
+
+
+def test_compute_wan_flow_matching_loss_t2v_mock():
+    device = torch.device("cpu")
+    latent = torch.randn(16, 21, 60, 104)
+
+    class _TinyDenoiser(torch.nn.Module):
+        def forward(self, x, t, context, seq_len, y=None):
+            return [torch.randn_like(x[0])]
+
+    denoiser = _TinyDenoiser()
+    flow = SimpleNamespace(
+        task="t2v-14B",
+        device=device,
+        cfg=SimpleNamespace(
+            boundary=0.875,
+            num_train_timesteps=1000,
+            param_dtype=torch.float32,
+            sample_shift=3.0,
+        ),
+        low_denoiser=denoiser,
+        high_denoiser=denoiser,
+    )
+    vae = MagicMock()
+    vae.encode = lambda videos: [latent.clone() for _ in videos]
+    flow.wan_t2v = SimpleNamespace(
+        vae=vae,
+        vae_stride=(4, 8, 8),
+        patch_size=(1, 2, 2),
+        text_encoder=lambda texts, dev: [torch.randn(8, 4096)],
+    )
+    batch = {
+        "video": torch.randn(1, 3, 81, 480, 832),
+        "caption": ["sks_style test"],
+    }
+    loss = compute_wan_flow_matching_loss(flow, batch)
+    assert loss.ndim == 0
+    assert torch.isfinite(loss)
+
+
+def test_compute_wan_flow_matching_loss_i2v_requires_image():
+    flow = SimpleNamespace(task="i2v-14B", device=torch.device("cpu"))
+    flow.wan_i2v = SimpleNamespace(
+        vae=MagicMock(encode=lambda v: [torch.randn(16, 21, 60, 104)]),
+        vae_stride=(4, 8, 8),
+        patch_size=(1, 2, 2),
+        text_encoder=lambda texts, dev: [torch.randn(8, 4096)],
+    )
+    flow.cfg = SimpleNamespace(
+        boundary=0.9,
+        num_train_timesteps=1000,
+        param_dtype=torch.float32,
+        sample_shift=3.0,
+    )
+    flow.low_denoiser = flow.high_denoiser = MagicMock()
+    batch = {"video": torch.randn(1, 3, 81, 480, 832), "caption": ["cap"]}
+    with pytest.raises(ValueError, match="image"):
+        compute_wan_flow_matching_loss(flow, batch)
diff --git a/tools/convert_checkpoint.py b/tools/convert_checkpoint.py
index 248a405b..c511edec 100644
--- a/tools/convert_checkpoint.py
+++ b/tools/convert_checkpoint.py
@@ -4,20 +4,26 @@
 import torch
 
 """
-This script is used to convert the key of diffusion scheduler to match the format in this repo.
+Convert diffusion scheduler keys to match the format in this repo.
+
 The conversion is as follows:
-    betas                                 -->  diffusion_scheduler.betas
-    alphas_cumprod                        -->  diffusion_scheduler.alphas_cumprod
-    alphas_cumprod_prev                   -->  diffusion_scheduler.alphas_cumprod_prev
-    sqrt_alphas_cumprod                   -->  diffusion_scheduler.sqrt_alphas_cumprod
-    sqrt_one_minus_alphas_cumprod        -->  diffusion_scheduler.sqrt_one_minus_alphas_cumprod
-    log_one_minus_alphas_cumprod         -->  diffusion_scheduler.log_one_minus_alphas_cumprod
-    sqrt_recip_alphas_cumprod            -->  diffusion_scheduler.sqrt_recip_alphas_cumprod
-    sqrt_recipm1_alphas_cumprod          -->  diffusion_scheduler.sqrt_recipm1_alphas_cumprod
-    posterior_variance                   -->  diffusion_scheduler.posterior_variance
-    posterior_log_variance_clipped       -->  diffusion_scheduler.posterior_log_variance_clipped
-    posterior_mean_coef1                 -->  diffusion_scheduler.posterior_mean_coef1
-    posterior_mean_coef2                 -->  diffusion_scheduler.posterior_mean_coef2
+    betas --> diffusion_scheduler.betas
+    alphas_cumprod --> diffusion_scheduler.alphas_cumprod
+    alphas_cumprod_prev --> diffusion_scheduler.alphas_cumprod_prev
+    sqrt_alphas_cumprod --> diffusion_scheduler.sqrt_alphas_cumprod
+    sqrt_one_minus_alphas_cumprod -->
+        diffusion_scheduler.sqrt_one_minus_alphas_cumprod
+    log_one_minus_alphas_cumprod -->
+        diffusion_scheduler.log_one_minus_alphas_cumprod
+    sqrt_recip_alphas_cumprod -->
+        diffusion_scheduler.sqrt_recip_alphas_cumprod
+    sqrt_recipm1_alphas_cumprod -->
+        diffusion_scheduler.sqrt_recipm1_alphas_cumprod
+    posterior_variance --> diffusion_scheduler.posterior_variance
+    posterior_log_variance_clipped -->
+        diffusion_scheduler.posterior_log_variance_clipped
+    posterior_mean_coef1 --> diffusion_scheduler.posterior_mean_coef1
+    posterior_mean_coef2 --> diffusion_scheduler.posterior_mean_coef2
 """
 
 parser = argparse.ArgumentParser()
@@ -25,7 +31,10 @@
     "--input_path",
     type=str,
     required=True,
-    help="Path to the old checkpoint, e.g., checkpoints/dynamicrafter/i2v_576x1024/model.ckpt",
+    help=(
+        "Path to the old checkpoint, e.g., "
+        "checkpoints/dynamicrafter/i2v_576x1024/model.ckpt"
+    ),
 )
 parser.add_argument(
     "--save_key",
diff --git a/tools/convert_wan_lora_21_to_22.py b/tools/convert_wan_lora_21_to_22.py
new file mode 100644
index 00000000..8bd2333d
--- /dev/null
+++ b/tools/convert_wan_lora_21_to_22.py
@@ -0,0 +1,94 @@
+#!/usr/bin/env python3
+"""
+Export Wan 2.1 native Lightning LoRA checkpoints to Diffusers safetensors.
+
+Known limitations:
+- Best-effort key remap (2.1 single denoiser → 2.2 dual-expert). Same tensors are
+  written for high-noise and low-noise exports; load low-noise with
+  load_into_transformer_2=True.
+- Architecture deltas between Wan 2.1 native and Wan 2.2 Diffusers may reduce
+  visual fidelity; run validate-domain-t2v for QA.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+from safetensors.torch import save_file
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+
+from videotuna.utils.wan_lora_bridge import (  # noqa: E402
+    analyze_native_wan_lora_ckpt,
+    export_diffusers_lora_state_dicts,
+    is_native_wan_lora_ckpt,
+)
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument(
+        "--input",
+        required=True,
+        type=Path,
+        help="Native denoiser .ckpt from train-domain-t2v",
+    )
+    parser.add_argument(
+        "--output-dir",
+        required=True,
+        type=Path,
+        help="Directory for high_noise.safetensors and low_noise.safetensors",
+    )
+    parser.add_argument(
+        "--mode",
+        choices=("t2v", "i2v"),
+        default="t2v",
+        help="Wan task mode (T2V default; I2V uses same block remap)",
+    )
+    args = parser.parse_args()
+
+    if not is_native_wan_lora_ckpt(args.input):
+        print(f"Error: not a native Wan LoRA checkpoint: {args.input}", file=sys.stderr)
+        return 1
+
+    args.output_dir.mkdir(parents=True, exist_ok=True)
+    exports = export_diffusers_lora_state_dicts(args.input)
+    meta = analyze_native_wan_lora_ckpt(args.input)
+
+    high_path = args.output_dir / "high_noise.safetensors"
+    low_path = args.output_dir / "low_noise.safetensors"
+    save_file(exports["high_noise"], high_path)
+    save_file(exports["low_noise"], low_path)
+
+    manifest = {
+        "source": str(args.input),
+        "mode": args.mode,
+        "high_noise": str(high_path),
+        "low_noise": str(low_path),
+        "analysis": meta,
+        "load_hint": (
+            "pipeline.load_lora_weights("
+            "output_dir, weight_name='high_noise.safetensors'); "
+            "pipeline.load_lora_weights(..., weight_name='low_noise.safetensors', "
+            "load_into_transformer_2=True)"
+        ),
+    }
+    manifest_path = args.output_dir / "manifest.json"
+    manifest_path.write_text(json.dumps(manifest, indent=2), encoding="utf-8")
+
+    print(f"Exported high-noise LoRA: {high_path}")
+    print(f"Exported low-noise LoRA:  {low_path}")
+    print(f"Manifest: {manifest_path}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/tools/data_process/caption/caption.py b/tools/data_process/caption/caption.py
index f405285c..2c1c1151 100644
--- a/tools/data_process/caption/caption.py
+++ b/tools/data_process/caption/caption.py
@@ -4,25 +4,16 @@
 import json
 import os
 import warnings
-from operator import attrgetter
 
-import cv2
 import numpy as np
-import requests
-import torch
 import tqdm
-from decord import VideoReader, cpu
+from videotuna.utils.video_io import AvVideoReader as VideoReader
 from llava.constants import (
-    DEFAULT_IM_END_TOKEN,
-    DEFAULT_IM_START_TOKEN,
     DEFAULT_IMAGE_TOKEN,
-    IGNORE_INDEX,
     IMAGE_TOKEN_INDEX,
 )
-from llava.conversation import SeparatorStyle, conv_templates
+from llava.conversation import conv_templates
 from llava.mm_utils import (
-    get_model_name_from_path,
-    process_images,
     tokenizer_image_token,
 )
 from llava.model.builder import load_pretrained_model
@@ -59,7 +50,7 @@ def get_inf1(folder_path):
 
 
 def get_inf(video_path):
-    vr = VideoReader(video_path, ctx=cpu(0))
+    vr = VideoReader(video_path)
     video_length = len(vr)
     video_fps = vr.get_avg_fps()
     width = vr[0].shape[0]
@@ -71,9 +62,9 @@ def get_inf(video_path):
 # Function to extract frames from video
 def load_video(video_path, max_frames_num):
     if type(video_path) == str:
-        vr = VideoReader(video_path, ctx=cpu(0))
+        vr = VideoReader(video_path)
     else:
-        vr = VideoReader(video_path[0], ctx=cpu(0))
+        vr = VideoReader(video_path[0])
     total_frame_num = len(vr)
     uniform_sampled_frames = np.linspace(
         0, total_frame_num - 1, max_frames_num, dtype=int
diff --git a/tools/data_process/caption/llava/conversation.py b/tools/data_process/caption/llava/conversation.py
index b761c080..1f4a7a7f 100644
--- a/tools/data_process/caption/llava/conversation.py
+++ b/tools/data_process/caption/llava/conversation.py
@@ -3,7 +3,7 @@
 import re
 from enum import Enum, auto
 from io import BytesIO
-from typing import Any, Dict, List, Tuple, Union
+from typing import Any, List, Union
 
 from PIL import Image
 from transformers import AutoTokenizer
diff --git a/tools/data_process/caption/llava/eval/evaluate_interleave.py b/tools/data_process/caption/llava/eval/evaluate_interleave.py
index eecb8f79..0ddec86c 100644
--- a/tools/data_process/caption/llava/eval/evaluate_interleave.py
+++ b/tools/data_process/caption/llava/eval/evaluate_interleave.py
@@ -5,8 +5,6 @@
 
 import numpy as np
 from rouge import Rouge
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.metrics.pairwise import cosine_similarity
 
 spot_the_diff = ["Spot-the-Diff", "Birds-to-Words", "CLEVR-Change"]
 image_edit_instruct = ["IEdit", "HQ-Edit", "MagicBrush"]
@@ -201,7 +199,6 @@ def evaluate_multi_choice_image(self, preditions):
     eval_result_list_detail = dict()
 
     for dataset in preds_all_dict:
-
         preds = preds_all_dict[dataset]
         question_type = preds[0]["question_type"]
 
diff --git a/tools/data_process/caption/llava/eval/model_vqa.py b/tools/data_process/caption/llava/eval/model_vqa.py
index 067fa3d4..2785a947 100644
--- a/tools/data_process/caption/llava/eval/model_vqa.py
+++ b/tools/data_process/caption/llava/eval/model_vqa.py
@@ -3,14 +3,12 @@
 import math
 import os
 import re
-from typing import Dict, List, Optional, Sequence
+from typing import Dict
 
 import shortuuid
 import torch
 import transformers
 from llava.constants import (
-    DEFAULT_IM_END_TOKEN,
-    DEFAULT_IM_START_TOKEN,
     DEFAULT_IMAGE_TOKEN,
     IGNORE_INDEX,
     IMAGE_TOKEN_INDEX,
@@ -19,7 +17,6 @@
 from llava.mm_utils import (
     KeywordsStoppingCriteria,
     get_model_name_from_path,
-    tokenizer_image_token,
 )
 from llava.model.builder import load_pretrained_model
 from llava.utils import disable_torch_init
@@ -126,7 +123,6 @@ def preprocess_qwen(
 
 
 def eval_model(args):
-
     # Model
     disable_torch_init()
     model_path = os.path.expanduser(args.model_path)
@@ -218,7 +214,6 @@ def eval_model(args):
         ans_file.flush()
 
         if len(line["conversations"]) > 2:
-
             for i in range(2, len(line["conversations"]), 2):
                 input_ids = torch.cat((input_ids, output_ids), dim=1)
 
diff --git a/tools/data_process/caption/llava/mm_utils.py b/tools/data_process/caption/llava/mm_utils.py
index 428d7351..fdeec5b9 100644
--- a/tools/data_process/caption/llava/mm_utils.py
+++ b/tools/data_process/caption/llava/mm_utils.py
@@ -5,10 +5,11 @@
 from io import BytesIO
 
 import torch
-from llava.constants import IMAGE_TOKEN_INDEX
 from PIL import Image
 from transformers import StoppingCriteria
 
+from llava.constants import IMAGE_TOKEN_INDEX
+
 
 def resize_and_center_crop(image, shortest_edge_length):
     # Calculate new dimensions and resize
@@ -159,8 +160,9 @@ def select_best_resolution(original_size, possible_resolutions):
     for width, height in possible_resolutions:
         # Calculate the downscaled size to keep the aspect ratio
         scale = min(width / original_width, height / original_height)
-        downscaled_width, downscaled_height = int(original_width * scale), int(
-            original_height * scale
+        downscaled_width, downscaled_height = (
+            int(original_width * scale),
+            int(original_height * scale),
         )
 
         # Calculate effective and wasted resolutions
@@ -297,7 +299,7 @@ def process_anyres_image(image, processor, grid_pinpoints):
     if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
         try:
             patch_size = processor.size[0]
-        except Exception as e:
+        except Exception:
             patch_size = processor.size["shortest_edge"]
         assert patch_size in [
             224,
diff --git a/tools/data_process/caption/llava/model/apply_delta.py b/tools/data_process/caption/llava/model/apply_delta.py
index 3fbf46c2..a2353cb5 100644
--- a/tools/data_process/caption/llava/model/apply_delta.py
+++ b/tools/data_process/caption/llava/model/apply_delta.py
@@ -6,10 +6,11 @@
 import argparse
 
 import torch
-from llava import LlavaLlamaForCausalLM
 from tqdm import tqdm
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
+from llava import LlavaLlamaForCausalLM
+
 
 def apply_delta(base_model_path, target_model_path, delta_path):
     print("Loading base model")
@@ -34,10 +35,13 @@ def apply_delta(base_model_path, target_model_path, delta_path):
         if param.data.shape == base.state_dict()[name].shape:
             param.data += base.state_dict()[name]
         else:
-            assert name in [
-                "model.embed_tokens.weight",
-                "lm_head.weight",
-            ], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
+            assert (
+                name
+                in [
+                    "model.embed_tokens.weight",
+                    "lm_head.weight",
+                ]
+            ), f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
             bparam = base.state_dict()[name]
             param.data[: bparam.shape[0], : bparam.shape[1]] += bparam
 
diff --git a/tools/data_process/caption/llava/model/builder.py b/tools/data_process/caption/llava/model/builder.py
index ee79190e..b197260c 100644
--- a/tools/data_process/caption/llava/model/builder.py
+++ b/tools/data_process/caption/llava/model/builder.py
@@ -14,10 +14,16 @@
 
 
 import os
-import shutil
 import warnings
 
 import torch
+from transformers import (
+    AutoConfig,
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+)
+
 from llava.constants import (
     DEFAULT_IM_END_TOKEN,
     DEFAULT_IM_START_TOKEN,
@@ -25,12 +31,6 @@
 )
 from llava.model import *
 from llava.utils import rank0_print
-from transformers import (
-    AutoConfig,
-    AutoModelForCausalLM,
-    AutoTokenizer,
-    BitsAndBytesConfig,
-)
 
 
 def load_pretrained_model(
@@ -421,7 +421,7 @@ def load_from_hf(repo_id, filename, subfolder=None):
             )
             print(f"Loading LoRA weights from {model_path}")
             model = PeftModel.from_pretrained(model, model_path)
-            print(f"Merging weights")
+            print("Merging weights")
             model = model.merge_and_unload()
             print("Convert to FP16...")
             model.to(torch.float16)
diff --git a/tools/data_process/caption/llava/model/consolidate.py b/tools/data_process/caption/llava/model/consolidate.py
index dd92066a..26273a2a 100644
--- a/tools/data_process/caption/llava/model/consolidate.py
+++ b/tools/data_process/caption/llava/model/consolidate.py
@@ -6,9 +6,10 @@
 import argparse
 
 import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
 from llava.model import *
 from llava.model.utils import auto_upgrade
-from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 def consolidate_ckpt(src_path, dst_path):
diff --git a/tools/data_process/caption/llava/model/language_model/llava_gemma.py b/tools/data_process/caption/llava/model/language_model/llava_gemma.py
index 764babc9..dfb3d95f 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_gemma.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_gemma.py
@@ -17,7 +17,6 @@
 
 import torch
 import torch.nn as nn
-from torch.nn import CrossEntropyLoss
 from transformers import (
     AutoConfig,
     AutoModelForCausalLM,
@@ -73,7 +72,6 @@ def forward(
         return_dict: Optional[bool] = None,
         cache_position: Optional[torch.LongTensor] = None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
-
         if inputs_embeds is None:
             (
                 input_ids,
diff --git a/tools/data_process/caption/llava/model/language_model/llava_llama.py b/tools/data_process/caption/llava/model/language_model/llava_llama.py
index a51febe6..a5074655 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_llama.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_llama.py
@@ -18,7 +18,6 @@
 import torch
 import torch.nn as nn
 from llava.model.llava_arch import LlavaMetaForCausalLM, LlavaMetaModel
-from torch.nn import CrossEntropyLoss
 
 # , LlamaModel, LlamaForCausalLM, GenerationConfig
 # from .modeling_llama import LlamaModel, LlamaForCausalLM
@@ -85,7 +84,6 @@ def forward(
         dpo_forward: Optional[bool] = None,
         cache_position=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
-
         if inputs_embeds is None:
             (
                 input_ids,
diff --git a/tools/data_process/caption/llava/model/language_model/llava_mistral.py b/tools/data_process/caption/llava/model/language_model/llava_mistral.py
index 6d0da8ea..e57154c3 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_mistral.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_mistral.py
@@ -17,11 +17,9 @@
 
 import torch
 import torch.nn as nn
-from torch.nn import CrossEntropyLoss
 from transformers import (
     AutoConfig,
     AutoModelForCausalLM,
-    GenerationConfig,
     MistralConfig,
     MistralForCausalLM,
     MistralModel,
@@ -80,7 +78,6 @@ def forward(
         return_dict: Optional[bool] = None,
         cache_position=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
-
         if inputs_embeds is None:
             (
                 input_ids,
diff --git a/tools/data_process/caption/llava/model/language_model/llava_mixtral.py b/tools/data_process/caption/llava/model/language_model/llava_mixtral.py
index e85705eb..ad632dde 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_mixtral.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_mixtral.py
@@ -17,11 +17,9 @@
 
 import torch
 import torch.nn as nn
-from torch.nn import CrossEntropyLoss
 from transformers import (
     AutoConfig,
     AutoModelForCausalLM,
-    GenerationConfig,
     MixtralConfig,
     MixtralForCausalLM,
     MixtralModel,
@@ -77,7 +75,6 @@ def forward(
         dpo_forward: Optional[bool] = None,
         cache_position=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
-
         if inputs_embeds is None:
             (
                 input_ids,
diff --git a/tools/data_process/caption/llava/model/language_model/llava_mpt.py b/tools/data_process/caption/llava/model/language_model/llava_mpt.py
index 81970417..4d27fff2 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_mpt.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_mpt.py
@@ -87,7 +87,6 @@ def forward(
         cache_position=None,
         images=None,
     ):
-
         input_ids, attention_mask, past_key_values, inputs_embeds, labels = (
             self.prepare_inputs_labels_for_multimodal(
                 input_ids, attention_mask, past_key_values, labels, images
diff --git a/tools/data_process/caption/llava/model/language_model/llava_qwen.py b/tools/data_process/caption/llava/model/language_model/llava_qwen.py
index 6f651851..26e4fd0f 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_qwen.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_qwen.py
@@ -13,21 +13,16 @@
 #    limitations under the License.
 
 
-from typing import Dict, List, Optional, Tuple, Union
+from typing import List, Optional, Tuple, Union
 
 import torch
 import torch.nn as nn
-import transformers
 
 # from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
 from llava.model.llava_arch import LlavaMetaForCausalLM, LlavaMetaModel
-from torch.nn import CrossEntropyLoss
 from transformers import (
     AutoConfig,
     AutoModelForCausalLM,
-    LlamaConfig,
-    LlamaForCausalLM,
-    LlamaModel,
     Qwen2Config,
     Qwen2ForCausalLM,
     Qwen2Model,
@@ -85,7 +80,6 @@ def forward(
         dpo_forward: Optional[bool] = False,
         cache_position=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
-
         if inputs_embeds is None:
             (
                 input_ids,
diff --git a/tools/data_process/caption/llava/model/language_model/llava_qwen_moe.py b/tools/data_process/caption/llava/model/language_model/llava_qwen_moe.py
index 44606752..f1420759 100644
--- a/tools/data_process/caption/llava/model/language_model/llava_qwen_moe.py
+++ b/tools/data_process/caption/llava/model/language_model/llava_qwen_moe.py
@@ -13,15 +13,13 @@
 #    limitations under the License.
 
 
-from typing import Dict, List, Optional, Tuple, Union
+from typing import List, Optional, Tuple, Union
 
 import torch
 import torch.nn as nn
-import transformers
 
 # from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
 from llava.model.llava_arch import LlavaMetaForCausalLM, LlavaMetaModel
-from torch.nn import CrossEntropyLoss
 from transformers import (
     AutoConfig,
     AutoModelForCausalLM,
@@ -82,7 +80,6 @@ def forward(
         dpo_forward: Optional[bool] = False,
         cache_position=None,
     ) -> Union[Tuple, CausalLMOutputWithPast]:
-
         if inputs_embeds is None:
             (
                 input_ids,
diff --git a/tools/data_process/caption/llava/model/language_model/modeling_llama.py b/tools/data_process/caption/llava/model/language_model/modeling_llama.py
index e2029bd0..08960a49 100644
--- a/tools/data_process/caption/llava/model/language_model/modeling_llama.py
+++ b/tools/data_process/caption/llava/model/language_model/modeling_llama.py
@@ -17,7 +17,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" PyTorch LLaMA model."""
+"""PyTorch LLaMA model."""
+
 import math
 import warnings
 from typing import List, Optional, Tuple, Union
diff --git a/tools/data_process/caption/llava/model/llava_arch.py b/tools/data_process/caption/llava/model/llava_arch.py
index 3417847d..285c0e84 100644
--- a/tools/data_process/caption/llava/model/llava_arch.py
+++ b/tools/data_process/caption/llava/model/llava_arch.py
@@ -16,11 +16,11 @@
 import math
 import random
 import re
-import time
 from abc import ABC, abstractmethod
 
 import torch
 import torch.nn as nn
+
 from llava.constants import (
     DEFAULT_IM_END_TOKEN,
     DEFAULT_IM_START_TOKEN,
@@ -29,7 +29,7 @@
     IMAGE_TOKEN_INDEX,
 )
 from llava.mm_utils import get_anyres_image_grid_shape
-from llava.utils import rank0_print, rank_print
+from llava.utils import rank0_print
 
 from .multimodal_encoder.builder import build_vision_tower
 from .multimodal_projector.builder import build_vision_projector
@@ -37,7 +37,6 @@
 
 
 class LlavaMetaModel:
-
     def __init__(self, config):
         super(LlavaMetaModel, self).__init__(config)
 
@@ -191,7 +190,6 @@ def unpad_image(tensor, original_size):
 
 
 class LlavaMetaForCausalLM(ABC):
-
     @abstractmethod
     def get_model(self):
         pass
diff --git a/tools/data_process/caption/llava/model/make_delta.py b/tools/data_process/caption/llava/model/make_delta.py
index 9b0cf0d5..d36c8cc6 100644
--- a/tools/data_process/caption/llava/model/make_delta.py
+++ b/tools/data_process/caption/llava/model/make_delta.py
@@ -6,10 +6,11 @@
 import argparse
 
 import torch
-from llava.model.utils import auto_upgrade
 from tqdm import tqdm
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
+from llava.model.utils import auto_upgrade
+
 
 def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
     print("Loading base model")
@@ -34,10 +35,13 @@ def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
         if param.data.shape == base.state_dict()[name].shape:
             param.data -= base.state_dict()[name]
         else:
-            assert name in [
-                "model.embed_tokens.weight",
-                "lm_head.weight",
-            ], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
+            assert (
+                name
+                in [
+                    "model.embed_tokens.weight",
+                    "lm_head.weight",
+                ]
+            ), f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
             bparam = base.state_dict()[name]
             param.data[: bparam.shape[0], : bparam.shape[1]] -= bparam
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/clip_encoder.py b/tools/data_process/caption/llava/model/multimodal_encoder/clip_encoder.py
index d705e5fd..c6d6c812 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/clip_encoder.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/clip_encoder.py
@@ -25,7 +25,7 @@ def __init__(self, vision_tower, args, delay_load=False):
         elif getattr(args, "unfreeze_mm_vision_tower", False):
             # TODO: better detector is needed.
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
+                "The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
             )
             self.load_model()
         elif (
@@ -33,7 +33,7 @@ def __init__(self, vision_tower, args, delay_load=False):
             and "mm_vision_tower" in args.mm_tunable_parts
         ):
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
+                "The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
             )
             self.load_model()
         else:
@@ -160,7 +160,6 @@ def image_size(self):
 
 class CLIPVisionTowerS2(CLIPVisionTower):
     def __init__(self, vision_tower, args, delay_load=False):
-
         self.s2_scales = getattr(args, "s2_scales", "336,672,1008")
         self.s2_scales = list(map(int, self.s2_scales.split(",")))
         self.s2_scales.sort()
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/eva_vit_model.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/eva_vit_model.py
index f73e262d..6c2c8fe3 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/eva_vit_model.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/eva_vit_model.py
@@ -13,7 +13,7 @@
 except:
     from timm.layers import drop_path, to_2tuple, trunc_normal_
 
-from .rope import VisionRotaryEmbedding, VisionRotaryEmbeddingFast
+from .rope import VisionRotaryEmbeddingFast
 from .transformer import PatchDropout
 
 if os.getenv("ENV_TYPE") == "deepspeed":
@@ -214,7 +214,6 @@ def forward(self, x, rel_pos_bias=None, attn_mask=None):
             k = k.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)
             v = v.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)
         else:
-
             qkv_bias = None
             if self.q_bias is not None:
                 qkv_bias = torch.cat(
@@ -292,7 +291,6 @@ def forward(self, x, rel_pos_bias=None, attn_mask=None):
 
 
 class Block(nn.Module):
-
     def __init__(
         self,
         dim,
@@ -427,7 +425,6 @@ def forward(self, x, **kwargs):
 
 
 class RelativePositionBias(nn.Module):
-
     def __init__(self, window_size, num_heads):
         super().__init__()
         self.window_size = window_size
@@ -662,7 +659,6 @@ def reset_classifier(self, num_classes, global_pool=""):
         )
 
     def forward_features(self, x, return_all_features=False):
-
         x = self.patch_embed(x)
         batch_size, seq_len, _ = x.size()
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/factory.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/factory.py
index d0c86765..a8f0d9b9 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/factory.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/factory.py
@@ -1,11 +1,10 @@
 import json
 import logging
 import os
-import pathlib
 import re
 from copy import deepcopy
 from pathlib import Path
-from typing import Any, Dict, Optional, Tuple, Union
+from typing import Optional, Tuple, Union
 
 import torch
 
@@ -19,7 +18,6 @@
     CLIP,
     CustomCLIP,
     convert_to_custom_text_state_dict,
-    convert_weights_to_lp,
     get_cast_dtype,
 )
 from .openai import load_openai_model
@@ -38,7 +36,7 @@
     resize_visual_pos_embed,
 )
 
-_MODEL_CONFIG_PATHS = [Path(__file__).parent / f"model_configs/"]
+_MODEL_CONFIG_PATHS = [Path(__file__).parent / "model_configs/"]
 _MODEL_CONFIGS = {}  # directory (model_name: config) of model architecture configs
 
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/hf_model.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/hf_model.py
index 368cdcf7..83401f9d 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/hf_model.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/hf_model.py
@@ -1,4 +1,4 @@
-""" huggingface model adapter
+"""huggingface model adapter
 
 Wraps HuggingFace transformers (https://github.com/huggingface/transformers) models for use as a text tower in CLIP model.
 """
@@ -8,7 +8,6 @@
 import torch
 import torch.nn as nn
 from torch import TensorType
-from torch.nn import functional as F
 
 try:
     import transformers
@@ -24,7 +23,7 @@
         BaseModelOutputWithPooling,
         BaseModelOutputWithPoolingAndCrossAttentions,
     )
-except ImportError as e:
+except ImportError:
     transformers = None
 
     class BaseModelOutput:
@@ -82,7 +81,6 @@ def __init__(self, use_pooler_output=True):
         self.use_pooler_output = use_pooler_output
 
     def forward(self, x: BaseModelOutput, attention_mask: TensorType):
-
         if (
             self.use_pooler_output
             and isinstance(
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/loss.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/loss.py
index 11b0a476..7445b31a 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/loss.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/loss.py
@@ -1,5 +1,3 @@
-import math
-
 import torch
 import torch.nn as nn
 from torch.nn import functional as F
@@ -29,9 +27,7 @@ def gather_features(
     world_size=1,
     use_horovod=False,
 ):
-    assert (
-        has_distributed
-    ), "torch.distributed did not import correctly, please use a PyTorch version with support."
+    assert has_distributed, "torch.distributed did not import correctly, please use a PyTorch version with support."
     if use_horovod:
         assert hvd is not None, "Please install horovod"
         if gather_with_grad:
@@ -84,7 +80,6 @@ def gather_features(
 
 
 class ClipLoss(nn.Module):
-
     def __init__(
         self,
         local_loss=False,
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/model.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/model.py
index 46efcdfc..a9c25694 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/model.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/model.py
@@ -1,9 +1,8 @@
-""" CLIP Model
+"""CLIP Model
 
 Adapted from https://github.com/openai/CLIP. Originally MIT License, Copyright (c) 2021 OpenAI.
 """
 
-import os
 from dataclasses import dataclass
 from functools import partial
 from typing import Optional, Tuple, Union
@@ -79,12 +78,8 @@ class CLIPVisionCfg:
     patch_size: int = 16
     image_size: Union[Tuple[int, int], int] = 224
     ls_init_value: Optional[float] = None  # layer scale initial value
-    patch_dropout: float = (
-        0.0  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
-    )
-    global_average_pool: bool = (
-        False  # whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580)
-    )
+    patch_dropout: float = 0.0  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
+    global_average_pool: bool = False  # whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580)
     drop_path_rate: Optional[float] = None  # drop path rate
     timm_model_name: str = (
         None  # a valid model name overrides layers, width, patch_size
@@ -381,7 +376,6 @@ def convert_weights_to_lp(model: nn.Module, dtype=torch.float16):
     """Convert applicable model parameters to low-precision (bf16 or fp16)"""
 
     def _convert_weights(l):
-
         if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
             l.weight.data = l.weight.data.to(dtype)
             if l.bias is not None:
@@ -487,9 +481,7 @@ def build_model_from_openai_state_dict(
     transformer_heads = transformer_width // 64
     transformer_layers = len(
         set(
-            k.split(".")[2]
-            for k in state_dict
-            if k.startswith(f"transformer.resblocks")
+            k.split(".")[2] for k in state_dict if k.startswith("transformer.resblocks")
         )
     )
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/openai.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/openai.py
index 1793d58f..e067b8d0 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/openai.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/openai.py
@@ -1,4 +1,4 @@
-""" OpenAI pretrained model functions
+"""OpenAI pretrained model functions
 
 Adapted from https://github.com/openai/CLIP. Originally MIT License, Copyright (c) 2021 OpenAI.
 """
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/pretrained.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/pretrained.py
index 173d72bb..7cd77a29 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/pretrained.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/pretrained.py
@@ -314,7 +314,7 @@ def download_pretrained_from_url(
         open(download_target, "rb").read()
     ).hexdigest().startswith(expected_sha256):
         raise RuntimeError(
-            f"Model has been downloaded but the SHA256 checksum does not not match"
+            "Model has been downloaded but the SHA256 checksum does not not match"
         )
 
     return download_target
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/timm_model.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/timm_model.py
index 82a0c15b..2f7df03c 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/timm_model.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/timm_model.py
@@ -1,4 +1,4 @@
-""" timm model adapter
+"""timm model adapter
 
 Wraps timm (https://github.com/rwightman/pytorch-image-models) models for use as a vision tower in CLIP model.
 """
@@ -123,7 +123,7 @@ def lock(self, unlocked_groups=0, freeze_bn_stats=False):
     def set_grad_checkpointing(self, enable=True):
         try:
             self.trunk.set_grad_checkpointing(enable)
-        except Exception as e:
+        except Exception:
             logging.warning(
                 "grad checkpointing not supported for this timm image tower, continuing without..."
             )
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/tokenizer.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/tokenizer.py
index cc34e49d..12cd2f49 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/tokenizer.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/tokenizer.py
@@ -1,4 +1,4 @@
-""" CLIP tokenizer
+"""CLIP tokenizer
 
 Copied from https://github.com/openai/CLIP. Originally MIT License, Copyright (c) 2021 OpenAI.
 """
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transform.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transform.py
index 355dfe93..3143cc6c 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transform.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transform.py
@@ -1,4 +1,4 @@
-from typing import Optional, Sequence, Tuple
+from typing import Optional, Tuple
 
 import torch
 import torch.nn as nn
@@ -17,7 +17,6 @@
 
 
 class ResizeMaxSize(nn.Module):
-
     def __init__(
         self, max_size, interpolation=InterpolationMode.BICUBIC, fn="max", fill=0
     ):
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transformer.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transformer.py
index 9600e2ea..ec9d49d1 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transformer.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_clip/transformer.py
@@ -4,17 +4,15 @@
 from collections import OrderedDict
 from typing import Callable, Optional, Sequence
 
-import numpy as np
 import torch
 from torch import nn
 from torch.nn import functional as F
 
 try:
-    from timm.models.layers import trunc_normal_
+    pass
 except:
-    from timm.layers import trunc_normal_
+    pass
 
-from .rope import VisionRotaryEmbedding, VisionRotaryEmbeddingFast
 from .utils import to_2tuple
 
 if os.getenv("ENV_TYPE") == "deepspeed":
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_vit.py b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_vit.py
index 64bd62be..35f2e986 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_vit.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/dev_eva_clip/eva_vit.py
@@ -6,16 +6,12 @@
 # https://github.com/facebookresearch/dino
 # --------------------------------------------------------'
 # not tested yet
-import math
 import time
 
 import torch
 import torch.nn as nn
-import torch.nn.functional as F
-import torch.utils.checkpoint as checkpoint
 import torchvision
 from llava.utils import rank0_print
-from timm.models.layers import drop_path, to_2tuple, trunc_normal_
 from transformers import CLIPImageProcessor
 
 from .eva_clip import create_model_and_transforms, get_model_config
@@ -43,7 +39,7 @@ def __init__(self, vision_tower, args, delay_load=False):
         elif getattr(args, "unfreeze_mm_vision_tower", False):
             # TODO: better detector is needed.
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
+                "The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
             )
             self.load_model()
         elif (
@@ -51,7 +47,7 @@ def __init__(self, vision_tower, args, delay_load=False):
             and "mm_vision_tower" in args.mm_tunable_parts
         ):
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
+                "The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
             )
             self.load_model()
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_clip_encoder.py b/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_clip_encoder.py
index 5beb57f8..5b7d7de5 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_clip_encoder.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_clip_encoder.py
@@ -1,10 +1,9 @@
-import torch
 import torch.nn as nn
 from llava.utils import rank0_print
 
 from .eva_clip_processors import EvaClipImageTrainProcessor
 from .eva_vit import EVAEncoderWrapper
-from .factory import add_model_config, get_model_config, list_models
+from .factory import get_model_config
 
 
 class EvaClipVisionTower(nn.Module):
@@ -22,7 +21,7 @@ def __init__(self, vision_tower, args, delay_load=False):
         elif getattr(args, "unfreeze_mm_vision_tower", False):
             # TODO: better detector is needed.
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
+                "The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
             )
             self.load_model()
         elif (
@@ -30,7 +29,7 @@ def __init__(self, vision_tower, args, delay_load=False):
             and "mm_vision_tower" in args.mm_tunable_parts
         ):
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
+                "The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
             )
             self.load_model()
         else:
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_vit.py b/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_vit.py
index ddd4ba50..36d8190c 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_vit.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/eva_vit.py
@@ -370,7 +370,6 @@ def forward(self, x, rel_pos_bias=None, attn_mask=None):
             k = k.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)
             v = v.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)
         else:
-
             qkv_bias = None
             if self.q_bias is not None:
                 qkv_bias = torch.cat(
@@ -448,7 +447,6 @@ def forward(self, x, rel_pos_bias=None, attn_mask=None):
 
 
 class Block(nn.Module):
-
     def __init__(
         self,
         dim,
@@ -583,7 +581,6 @@ def forward(self, x, **kwargs):
 
 
 class RelativePositionBias(nn.Module):
-
     def __init__(self, window_size, num_heads):
         super().__init__()
         self.window_size = window_size
@@ -816,7 +813,6 @@ def reset_classifier(self, num_classes, global_pool=""):
         )
 
     def forward_features(self, x, return_all_features=False):
-
         x = self.patch_embed(x)
         batch_size, seq_len, _ = x.size()
 
@@ -943,12 +939,8 @@ class CLIPVisionCfg:
     patch_size: int = 16
     image_size: Union[Tuple[int, int], int] = 224
     ls_init_value: Optional[float] = None  # layer scale initial value
-    patch_dropout: float = (
-        0.0  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
-    )
-    global_average_pool: bool = (
-        False  # whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580)
-    )
+    patch_dropout: float = 0.0  # what fraction of patches to dropout during training (0 would mean disabled and no patches dropped) - 0.5 to 0.75 recommended in the paper for optimal results
+    global_average_pool: bool = False  # whether to global average pool the last embedding layer, instead of using CLS token (https://arxiv.org/abs/2205.01580)
     drop_path_rate: Optional[float] = None  # drop path rate
     timm_model_name: str = (
         None  # a valid model name overrides layers, width, patch_size
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/factory.py b/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/factory.py
index 132d0367..9c4626d1 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/factory.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/eva_clip/factory.py
@@ -1,15 +1,9 @@
 import json
-import logging
-import os
-import pathlib
 import re
 from copy import deepcopy
 from pathlib import Path
-from typing import Any, Dict, Optional, Tuple, Union
 
-import torch
-
-_MODEL_CONFIG_PATHS = [Path(__file__).parent / f"model_configs/"]
+_MODEL_CONFIG_PATHS = [Path(__file__).parent / "model_configs/"]
 _MODEL_CONFIGS = {}  # directory (model_name: config) of model architecture configs
 
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/hf_vision.py b/tools/data_process/caption/llava/model/multimodal_encoder/hf_vision.py
index a373f7b3..7153bec7 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/hf_vision.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/hf_vision.py
@@ -24,7 +24,7 @@ def load_model(self):
             self.image_processor = AutoImageProcessor.from_pretrained(
                 self.vision_tower_name
             )
-        except Exception as e:
+        except Exception:
             if "448" in self.vision_tower_name:
                 image_size = 448
                 # use image processor with conig
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/open_clip_encoder.py b/tools/data_process/caption/llava/model/multimodal_encoder/open_clip_encoder.py
index ba37512e..b849c992 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/open_clip_encoder.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/open_clip_encoder.py
@@ -32,7 +32,7 @@ def __init__(self, vision_tower, args, delay_load=False):
         elif getattr(args, "unfreeze_mm_vision_tower", False):
             # TODO: better detector is needed.
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
+                "The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
             )
             self.load_model()
         elif (
@@ -40,7 +40,7 @@ def __init__(self, vision_tower, args, delay_load=False):
             and "mm_vision_tower" in args.mm_tunable_parts
         ):
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
+                "The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
             )
             self.load_model()
 
diff --git a/tools/data_process/caption/llava/model/multimodal_encoder/siglip_encoder.py b/tools/data_process/caption/llava/model/multimodal_encoder/siglip_encoder.py
index a5ea6caa..0d83c0d4 100644
--- a/tools/data_process/caption/llava/model/multimodal_encoder/siglip_encoder.py
+++ b/tools/data_process/caption/llava/model/multimodal_encoder/siglip_encoder.py
@@ -655,7 +655,7 @@ def __init__(self, vision_tower, vision_tower_cfg, delay_load=False):
         elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
             # TODO: better detector is needed.
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
+                "The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True."
             )
             self.load_model()
         elif (
@@ -663,7 +663,7 @@ def __init__(self, vision_tower, vision_tower_cfg, delay_load=False):
             and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts
         ):
             rank0_print(
-                f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
+                "The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`."
             )
             self.load_model()
         else:
diff --git a/tools/data_process/caption/llava/model/multimodal_projector/builder.py b/tools/data_process/caption/llava/model/multimodal_projector/builder.py
index afb3e21b..7acf9b9d 100644
--- a/tools/data_process/caption/llava/model/multimodal_projector/builder.py
+++ b/tools/data_process/caption/llava/model/multimodal_projector/builder.py
@@ -1,6 +1,5 @@
 import re
 
-import torch
 import torch.nn as nn
 
 from .pooler_projector import PoolerProjector
diff --git a/tools/data_process/caption/llava/model/multimodal_projector/pooler_projector.py b/tools/data_process/caption/llava/model/multimodal_projector/pooler_projector.py
index df0f95c2..7ce1597d 100644
--- a/tools/data_process/caption/llava/model/multimodal_projector/pooler_projector.py
+++ b/tools/data_process/caption/llava/model/multimodal_projector/pooler_projector.py
@@ -1,8 +1,4 @@
-import math
-
-import torch
 import torch.nn as nn
-from transformers.models.clip.modeling_clip import CLIPVisionModel
 
 
 class PoolerProjector(nn.Module):
diff --git a/tools/data_process/caption/llava/model/multimodal_resampler/masked_drop.py b/tools/data_process/caption/llava/model/multimodal_resampler/masked_drop.py
index 827dde33..3a45fbf9 100644
--- a/tools/data_process/caption/llava/model/multimodal_resampler/masked_drop.py
+++ b/tools/data_process/caption/llava/model/multimodal_resampler/masked_drop.py
@@ -15,7 +15,6 @@ def __init__(self, model_args):
         self.ratio_lower = model_args.mm_mask_drop_ratio_lower
 
     def forward(self, image_features, *args, **kwargs):
-
         if not self.training:
             return image_features
 
diff --git a/tools/data_process/caption/llava/model/multimodal_resampler/qformer.py b/tools/data_process/caption/llava/model/multimodal_resampler/qformer.py
index 5e8e33b4..97a7c347 100644
--- a/tools/data_process/caption/llava/model/multimodal_resampler/qformer.py
+++ b/tools/data_process/caption/llava/model/multimodal_resampler/qformer.py
@@ -1,36 +1,26 @@
 """
- * Copyright (c) 2023, salesforce.com, inc.
- * All rights reserved.
- * SPDX-License-Identifier: BSD-3-Clause
- * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
- * By Junnan Li
- * Based on huggingface code base
- * https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/bert
+* Copyright (c) 2023, salesforce.com, inc.
+* All rights reserved.
+* SPDX-License-Identifier: BSD-3-Clause
+* For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+* By Junnan Li
+* Based on huggingface code base
+* https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/bert
 """
 
 import math
-import os
-import warnings
-from dataclasses import dataclass
-from typing import Any, Dict, Optional, Tuple
+from typing import Tuple
 
 import torch
-import torch.nn.functional as F
 import torch.utils.checkpoint
-from torch import Tensor, device, dtype, nn
+from torch import Tensor, device, nn
 from torch.nn import CrossEntropyLoss
 from transformers.activations import ACT2FN
-from transformers.file_utils import ModelOutput
 from transformers.modeling_outputs import (
     BaseModelOutputWithPastAndCrossAttentions,
     BaseModelOutputWithPoolingAndCrossAttentions,
     CausalLMOutputWithCrossAttentions,
     MaskedLMOutput,
-    MultipleChoiceModelOutput,
-    NextSentencePredictorOutput,
-    QuestionAnsweringModelOutput,
-    SequenceClassifierOutput,
-    TokenClassifierOutput,
 )
 from transformers.modeling_utils import (
     PreTrainedModel,
@@ -178,7 +168,6 @@ def forward(
         past_key_value=None,
         output_attentions=False,
     ):
-
         # If this is instantiated as a cross-attention module, the keys
         # and values come from an encoder; the attention mask needs to be
         # such that the encoder's padding tokens are not attended to.
@@ -525,7 +514,6 @@ def forward(
             past_key_value = past_key_values[i] if past_key_values is not None else None
 
             if getattr(self.config, "gradient_checkpointing", False) and self.training:
-
                 if use_cache:
                     logger.warning(
                         "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
@@ -968,7 +956,6 @@ def forward(
 
 
 class BertLMHeadModel(BertPreTrainedModel):
-
     _keys_to_ignore_on_load_unexpected = [r"pooler"]
     _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
 
@@ -1131,7 +1118,6 @@ def _reorder_cache(self, past, beam_idx):
 
 
 class BertForMaskedLM(BertPreTrainedModel):
-
     _keys_to_ignore_on_load_unexpected = [r"pooler"]
     _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
 
diff --git a/tools/data_process/caption/llava/model/multimodal_resampler/spatial_pool.py b/tools/data_process/caption/llava/model/multimodal_resampler/spatial_pool.py
index f508afd3..fd3d5c64 100644
--- a/tools/data_process/caption/llava/model/multimodal_resampler/spatial_pool.py
+++ b/tools/data_process/caption/llava/model/multimodal_resampler/spatial_pool.py
@@ -1,6 +1,5 @@
 import math
 
-import torch
 import torch.nn as nn
 
 
diff --git a/tools/data_process/caption/llava/serve/cli.py b/tools/data_process/caption/llava/serve/cli.py
index 4537d586..3b400243 100644
--- a/tools/data_process/caption/llava/serve/cli.py
+++ b/tools/data_process/caption/llava/serve/cli.py
@@ -3,6 +3,9 @@
 
 import requests
 import torch
+from PIL import Image
+from transformers import TextStreamer
+
 from llava.constants import (
     DEFAULT_IM_END_TOKEN,
     DEFAULT_IM_START_TOKEN,
@@ -17,8 +20,6 @@
 )
 from llava.model.builder import load_pretrained_model
 from llava.utils import disable_torch_init
-from PIL import Image
-from transformers import TextStreamer
 
 
 def load_image(image_file):
diff --git a/tools/data_process/caption/llava/serve/controller.py b/tools/data_process/caption/llava/serve/controller.py
index 764161b8..997ff924 100644
--- a/tools/data_process/caption/llava/serve/controller.py
+++ b/tools/data_process/caption/llava/serve/controller.py
@@ -4,20 +4,19 @@
 """
 
 import argparse
-import asyncio
 import dataclasses
 import json
-import logging
 import threading
 import time
 from enum import Enum, auto
-from typing import List, Union
+from typing import List
 
 import numpy as np
 import requests
 import uvicorn
 from fastapi import FastAPI, Request
 from fastapi.responses import StreamingResponse
+
 from llava.constants import CONTROLLER_HEART_BEAT_EXPIRATION
 from llava.utils import build_logger, server_error_msg
 
@@ -35,7 +34,7 @@ def from_str(cls, name):
         elif name == "shortest_queue":
             return cls.SHORTEST_QUEUE
         else:
-            raise ValueError(f"Invalid dispatch method")
+            raise ValueError("Invalid dispatch method")
 
 
 @dataclasses.dataclass
@@ -215,7 +214,7 @@ def worker_api_generate_stream(self, params):
             for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
                 if chunk:
                     yield chunk + b"\0"
-        except requests.exceptions.RequestException as e:
+        except requests.exceptions.RequestException:
             logger.info(f"worker timeout: {worker_addr}")
             ret = {
                 "text": server_error_msg,
diff --git a/tools/data_process/caption/llava/serve/gradio_multi_image.py b/tools/data_process/caption/llava/serve/gradio_multi_image.py
index fbfc18ac..48be4d6e 100644
--- a/tools/data_process/caption/llava/serve/gradio_multi_image.py
+++ b/tools/data_process/caption/llava/serve/gradio_multi_image.py
@@ -7,6 +7,7 @@
 
 import gradio as gr
 import requests
+
 from llava.constants import LOGDIR
 from llava.conversation import SeparatorStyle, conv_templates, default_conversation
 from llava.utils import (
@@ -328,7 +329,7 @@ def http_bot(
                     )
                     return
                 time.sleep(0.03)
-    except requests.exceptions.RequestException as e:
+    except requests.exceptions.RequestException:
         state.messages[-1][-1] = server_error_msg
         yield (state, state.to_gradio_chatbot()) + (
             disable_btn,
diff --git a/tools/data_process/caption/llava/serve/gradio_web_server.py b/tools/data_process/caption/llava/serve/gradio_web_server.py
index 8c2bb0b4..b19d2696 100644
--- a/tools/data_process/caption/llava/serve/gradio_web_server.py
+++ b/tools/data_process/caption/llava/serve/gradio_web_server.py
@@ -7,6 +7,7 @@
 
 import gradio as gr
 import requests
+
 from llava.constants import LOGDIR
 from llava.conversation import SeparatorStyle, conv_templates, default_conversation
 from llava.utils import (
@@ -319,7 +320,7 @@ def http_bot(
                     )
                     return
                 time.sleep(0.03)
-    except requests.exceptions.RequestException as e:
+    except requests.exceptions.RequestException:
         state.messages[-1][-1] = server_error_msg
         yield (state, state.to_gradio_chatbot()) + (
             disable_btn,
diff --git a/tools/data_process/caption/llava/serve/model_worker.py b/tools/data_process/caption/llava/serve/model_worker.py
index 06a664e6..b7fb1363 100644
--- a/tools/data_process/caption/llava/serve/model_worker.py
+++ b/tools/data_process/caption/llava/serve/model_worker.py
@@ -16,6 +16,8 @@
 import uvicorn
 from fastapi import BackgroundTasks, FastAPI, Request
 from fastapi.responses import StreamingResponse
+from transformers import TextIteratorStreamer
+
 from llava.constants import (
     DEFAULT_IM_END_TOKEN,
     DEFAULT_IM_START_TOKEN,
@@ -31,7 +33,6 @@
 )
 from llava.model.builder import load_pretrained_model
 from llava.utils import build_logger, pretty_print_semaphore, server_error_msg
-from transformers import TextIteratorStreamer
 
 GB = 1 << 30
 
@@ -43,7 +44,6 @@
 
 
 def heart_beat_worker(controller):
-
     while True:
         time.sleep(WORKER_HEART_BEAT_INTERVAL)
         controller.send_heart_beat()
@@ -226,13 +226,16 @@ def generate_stream(self, params):
         )
 
         if max_new_tokens < 1:
-            yield json.dumps(
-                {
-                    "text": ori_prompt
-                    + "Exceeds max token length. Please start a new conversation, thanks.",
-                    "error_code": 0,
-                }
-            ).encode() + b"\0"
+            yield (
+                json.dumps(
+                    {
+                        "text": ori_prompt
+                        + "Exceeds max token length. Please start a new conversation, thanks.",
+                        "error_code": 0,
+                    }
+                ).encode()
+                + b"\0"
+            )
             return
 
         thread = Thread(
diff --git a/tools/data_process/caption/llava/serve/sglang_worker.py b/tools/data_process/caption/llava/serve/sglang_worker.py
index 620e6195..7fb1c5ce 100644
--- a/tools/data_process/caption/llava/serve/sglang_worker.py
+++ b/tools/data_process/caption/llava/serve/sglang_worker.py
@@ -5,11 +5,9 @@
 import argparse
 import asyncio
 import json
-import re
 import threading
 import time
 import uuid
-from concurrent.futures import ThreadPoolExecutor
 from functools import partial
 
 import requests
@@ -17,29 +15,17 @@
 import uvicorn
 from fastapi import BackgroundTasks, FastAPI, Request
 from fastapi.responses import StreamingResponse
+from sglang.backend.runtime_endpoint import RuntimeEndpoint
+
 from llava.constants import (
-    DEFAULT_IM_END_TOKEN,
-    DEFAULT_IM_START_TOKEN,
     DEFAULT_IMAGE_TOKEN,
-    IMAGE_TOKEN_INDEX,
     WORKER_HEART_BEAT_INTERVAL,
 )
 from llava.mm_utils import (
     expand2square,
     load_image_from_base64,
-    process_images,
-    tokenizer_image_token,
 )
-from llava.model.builder import load_pretrained_model
 from llava.utils import build_logger, pretty_print_semaphore, server_error_msg
-from sglang.backend.runtime_endpoint import RuntimeEndpoint
-from sglang.lang.interpreter import ProgramState
-from sglang.test.test_utils import (
-    add_common_sglang_args_and_parse,
-    select_sglang_backend,
-)
-from sglang.utils import dump_state_text, read_jsonl
-from transformers import AutoTokenizer
 
 GB = 1 << 30
 
@@ -215,13 +201,16 @@ async def generate_stream(self, params):
         stop_str = [stop_str] if stop_str is not None else None
 
         if max_new_tokens < 1:
-            yield json.dumps(
-                {
-                    "text": ori_prompt
-                    + "Exceeds max token length. Please start a new conversation, thanks.",
-                    "error_code": 0,
-                }
-            ).encode() + b"\0"
+            yield (
+                json.dumps(
+                    {
+                        "text": ori_prompt
+                        + "Exceeds max token length. Please start a new conversation, thanks.",
+                        "error_code": 0,
+                    }
+                ).encode()
+                + b"\0"
+            )
             return
 
         # print(prompt)
diff --git a/tools/data_process/caption/llava/serve/test_message.py b/tools/data_process/caption/llava/serve/test_message.py
index 3d5c2576..34ca8883 100644
--- a/tools/data_process/caption/llava/serve/test_message.py
+++ b/tools/data_process/caption/llava/serve/test_message.py
@@ -2,6 +2,7 @@
 import json
 
 import requests
+
 from llava.conversation import default_conversation
 
 
diff --git a/tools/data_process/caption/llava/train/llama_flash_attn_monkey_patch.py b/tools/data_process/caption/llava/train/llama_flash_attn_monkey_patch.py
index dcb1a2de..c16c8cbc 100644
--- a/tools/data_process/caption/llava/train/llama_flash_attn_monkey_patch.py
+++ b/tools/data_process/caption/llava/train/llama_flash_attn_monkey_patch.py
@@ -112,7 +112,5 @@ def replace_llama_attn_with_flash_attn():
             "Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward."
             "ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593"
         )
-    transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = (
-        _prepare_decoder_attention_mask
-    )
+    transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = _prepare_decoder_attention_mask
     transformers.models.llama.modeling_llama.LlamaAttention.forward = forward
diff --git a/tools/data_process/caption/llava/train/llava_trainer.py b/tools/data_process/caption/llava/train/llava_trainer.py
index ce9af2ca..3ab07f68 100644
--- a/tools/data_process/caption/llava/train/llava_trainer.py
+++ b/tools/data_process/caption/llava/train/llava_trainer.py
@@ -1,4 +1,3 @@
-import datetime
 import os
 from datetime import timedelta
 from typing import List, Optional
@@ -7,7 +6,7 @@
 import torch.nn as nn
 from accelerate import Accelerator
 from accelerate.utils import GradientAccumulationPlugin, InitProcessGroupKwargs
-from torch.utils.data import DataLoader, Dataset, Sampler
+from torch.utils.data import DataLoader, Sampler
 from transformers import Trainer
 from transformers.trainer import (
     ALL_LAYERNORM_LAYERS,
@@ -19,16 +18,14 @@
     is_sagemaker_mp_enabled,
     logger,
 )
-from transformers.trainer_pt_utils import AcceleratorConfig
 from transformers.trainer_pt_utils import (
     get_length_grouped_indices as get_length_grouped_indices_hf,
 )
 from transformers.trainer_utils import seed_worker
 from trl.trainer import DPOTrainer
-from trl.trainer.utils import DPODataCollatorWithPadding
 
 if is_accelerate_available():
-    from accelerate import Accelerator, InitProcessGroupKwargs, skip_first_batches
+    from accelerate import Accelerator, InitProcessGroupKwargs
 
 if is_datasets_available():
     import datasets
@@ -347,7 +344,6 @@ def __iter__(self):
 
 
 class LLaVATrainer(Trainer):
-
     def create_accelerator_and_postprocess(self):
         grad_acc_kwargs = {"num_steps": self.args.gradient_accumulation_steps}
         grad_acc_kwargs["sync_with_dataloader"] = False
@@ -664,9 +660,7 @@ def _save_checkpoint(self, model, trial, metrics=None):
 
             if self.args.local_rank == 0 or self.args.local_rank == -1:
                 self.model.config.save_pretrained(output_dir)
-                torch.save(
-                    weight_to_save, os.path.join(output_dir, f"mm_projector.bin")
-                )
+                torch.save(weight_to_save, os.path.join(output_dir, "mm_projector.bin"))
         else:
             super(LLaVATrainer, self)._save_checkpoint(model, trial, metrics)
 
@@ -723,9 +717,7 @@ def _save_checkpoint(self, model, trial, metrics=None):
 
             if self.args.local_rank == 0 or self.args.local_rank == -1:
                 self.model.config.save_pretrained(output_dir)
-                torch.save(
-                    weight_to_save, os.path.join(output_dir, f"mm_projector.bin")
-                )
+                torch.save(weight_to_save, os.path.join(output_dir, "mm_projector.bin"))
         else:
             # super(LLaVADPOTrainer, self)._save_checkpoint(model, trial, metrics)
             # print(type(model))
diff --git a/tools/data_process/caption/llava/train/llava_trainer_eval.py b/tools/data_process/caption/llava/train/llava_trainer_eval.py
index ac4d2186..ac65d6ae 100644
--- a/tools/data_process/caption/llava/train/llava_trainer_eval.py
+++ b/tools/data_process/caption/llava/train/llava_trainer_eval.py
@@ -20,18 +20,18 @@ def evaluate(self, evaluate_args):
         if evaluate_args.gen_kwargs != "":
             cmd += f" --gen_kwargs {evaluate_args.gen_kwargs}"
         if evaluate_args.log_samples:
-            cmd += f" --log_samples"
+            cmd += " --log_samples"
         else:
             assert False, "Please log samples so that the result can be parsed"
         results = subprocess.run([cmd], shell=True, capture_output=True, text=True)
         try:
             result_file_index_start = results.stdout.index("Saved samples to ")
-            result_file_index_end = results.stdout.index(f".json")
+            result_file_index_end = results.stdout.index(".json")
             result_file_index_start += len("Saved samples to ")
             file = results.stdout[result_file_index_start:result_file_index_end]
         except:
             result_file_index_start = results.stderr.index("Saved samples to ")
-            result_file_index_end = results.stderr.index(f".json")
+            result_file_index_end = results.stderr.index(".json")
             result_file_index_start += len("Saved samples to ")
             file = results.stderr[result_file_index_start:result_file_index_end]
         file = file.split("/")[:-1]
diff --git a/tools/data_process/caption/llava/train/train.py b/tools/data_process/caption/llava/train/train.py
index d282c022..a7f6ee19 100644
--- a/tools/data_process/caption/llava/train/train.py
+++ b/tools/data_process/caption/llava/train/train.py
@@ -49,7 +49,7 @@
 )
 from llava.model import *
 from llava.train.llava_trainer import LLaVATrainer
-from llava.utils import process_video_with_decord, process_video_with_pyav, rank0_print
+from llava.utils import process_video_with_av, rank0_print
 from packaging import version
 from PIL import Image, ImageFile
 from torch.utils.data import Dataset
@@ -344,9 +344,7 @@ def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: st
                     os.path.join(mm_projector_folder, f"{current_folder}.bin"),
                 )
             else:
-                torch.save(
-                    weight_to_save, os.path.join(output_dir, f"mm_projector.bin")
-                )
+                torch.save(weight_to_save, os.path.join(output_dir, "mm_projector.bin"))
         return
 
     if trainer.deepspeed:
@@ -1399,7 +1397,7 @@ def _get_item(self, i) -> Dict[str, torch.Tensor]:
                         except IOError:
                             print(f"Failed to read frame at path: {frame_path}")
                 else:
-                    video = process_video_with_decord(video_file, self.data_args)
+                    video = process_video_with_av(video_file, self.data_args)
 
                 processor = self.data_args.image_processor
                 image = processor.preprocess(video, return_tensors="pt")["pixel_values"]
@@ -1720,7 +1718,7 @@ def train(attn_implementation=None):
     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
 
     if training_args.verbose_logging:
-        rank0_print(f"Inspecting experiment hyperparameters:\n")
+        rank0_print("Inspecting experiment hyperparameters:\n")
         rank0_print(f"model_args = {vars(model_args)}\n\n")
         rank0_print(f"data_args = {vars(data_args)}\n\n")
         rank0_print(f"training_args = {vars(training_args)}\n\n")
@@ -1887,7 +1885,7 @@ def make_inputs_require_grad(module, input, output):
             ):
                 try:
                     patch_size = data_args.image_processor.size[0]
-                except Exception as e:
+                except Exception:
                     patch_size = data_args.image_processor.size["shortest_edge"]
 
                 assert patch_size in [
diff --git a/tools/data_process/caption/llava/train/train_dpo.py b/tools/data_process/caption/llava/train/train_dpo.py
index c1712d09..2598ca83 100644
--- a/tools/data_process/caption/llava/train/train_dpo.py
+++ b/tools/data_process/caption/llava/train/train_dpo.py
@@ -16,7 +16,6 @@
 
 import ast
 import copy
-import json
 import logging
 import math
 import os
@@ -34,7 +33,7 @@
 import transformers
 import yaml
 from data_processing.utils import load_json, load_jsonl
-from decord import VideoReader, cpu
+from videotuna.utils.video_io import AvVideoReader as VideoReader
 from llava import conversation as conversation_lib
 from llava.constants import (
     DEFAULT_IM_END_TOKEN,
@@ -335,9 +334,7 @@ def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: st
                     os.path.join(mm_projector_folder, f"{current_folder}.bin"),
                 )
             else:
-                torch.save(
-                    weight_to_save, os.path.join(output_dir, f"mm_projector.bin")
-                )
+                torch.save(weight_to_save, os.path.join(output_dir, "mm_projector.bin"))
         return
 
     if trainer.deepspeed:
@@ -1397,7 +1394,7 @@ def _get_item(self, i) -> Dict[str, torch.Tensor]:
                 )
             else:  # using videoreader
                 if "shareVideoGPTV" not in video_file and "liangke" not in video_file:
-                    vr = VideoReader(video_file, ctx=cpu(0))
+                    vr = VideoReader(video_file)
                     total_frame_num = len(vr)
                     avg_fps = round(vr.get_avg_fps() / self.data_args.video_fps)
                     frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
@@ -1589,7 +1586,6 @@ def tokenize_batch_element(
         return batch
 
     def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
-
         tokenized_batch = []
         Xs, keys = [], []
         for feature in features:
@@ -1852,7 +1848,7 @@ def train(attn_implementation=None):
     model_args, data_args, training_args = parser.parse_args_into_dataclasses()
 
     if training_args.verbose_logging:
-        rank0_print(f"Inspecting experiment hyperparameters:\n")
+        rank0_print("Inspecting experiment hyperparameters:\n")
         rank0_print(f"model_args = {vars(model_args)}\n\n")
         rank0_print(f"data_args = {vars(data_args)}\n\n")
         rank0_print(f"training_args = {vars(training_args)}\n\n")
diff --git a/tools/data_process/caption/llava/utils.py b/tools/data_process/caption/llava/utils.py
index c8f03b31..747cfcf4 100644
--- a/tools/data_process/caption/llava/utils.py
+++ b/tools/data_process/caption/llava/utils.py
@@ -1,4 +1,3 @@
-import datetime
 import logging
 import logging.handlers
 import os
@@ -6,6 +5,7 @@
 
 import numpy as np
 import requests
+
 from llava.constants import LOGDIR
 
 server_error_msg = (
@@ -19,13 +19,13 @@
 
 try:
     import av
-    from decord import VideoReader, cpu
+    from videotuna.utils.video_io import AvVideoReader as VideoReader
 except ImportError:
     print("Please install pyav to use video processing functions.")
 
 
-def process_video_with_decord(video_file, data_args):
-    vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
+def process_video_with_av(video_file, data_args):
+    vr = VideoReader(video_file, num_threads=1)
     total_frame_num = len(vr)
     avg_fps = round(vr.get_avg_fps() / data_args.video_fps)
     frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
@@ -38,11 +38,15 @@ def process_video_with_decord(video_file, data_args):
             frame_idx = uniform_sampled_frames.tolist()
 
     video = vr.get_batch(frame_idx).asnumpy()
-    # https://github.com/dmlc/decord/issues/208
     vr.seek(0)
     return video
 
 
+def process_video_with_decord(video_file, data_args):
+    """Deprecated alias for process_video_with_av."""
+    return process_video_with_av(video_file, data_args)
+
+
 def process_video_with_pyav(video_file, data_args):
     container = av.open(video_file)
     # !!! This is the only difference. Using auto threading
diff --git a/tools/data_process/scenecut.py b/tools/data_process/scenecut.py
index 0e3ab434..0ba29fa9 100644
--- a/tools/data_process/scenecut.py
+++ b/tools/data_process/scenecut.py
@@ -1,18 +1,17 @@
 import argparse
 import json
-import multiprocessing as mp
 import os
 import subprocess
-from typing import Any, List, Tuple, Union
+from typing import Any, List, Union
 
 import tqdm
 from joblib import Parallel, delayed
 
 # Standard PySceneDetect imports:
-from scenedetect import SceneManager, VideoManager, open_video
+from scenedetect import SceneManager, open_video
 
 # For content-aware scene detection:
-from scenedetect.detectors import AdaptiveDetector, ContentDetector
+from scenedetect.detectors import ContentDetector
 
 # Standard PySceneDetect imports:
 from scenedetect.frame_timecode import FrameTimecode
diff --git a/tools/deepspeed_checkpoint_converter.py b/tools/deepspeed_checkpoint_converter.py
index 21ef8246..188bb3cc 100644
--- a/tools/deepspeed_checkpoint_converter.py
+++ b/tools/deepspeed_checkpoint_converter.py
@@ -1,18 +1,16 @@
 # Please refer to https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html#saving-training-checkpoints
-from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
 import torch
+from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
 
-# The dir contains your deepspeed checkpoints. The dir should contains "lattest" file. 
-# One example file path is: results/train/xxxxxxx_hunyuanvideo_t2v_lora/checkpoints/epoch=161.ckpt
+# The dir contains your deepspeed checkpoints. The dir should contains "lattest" file.
+# Example: results/train/train_wan_domain_t2v_lora/checkpoints/epoch=161.ckpt
 checkpoint_dir = "path/to/your/checkpoint_dir"
 
-# Path to save your converted checkpoint. The checkpoint can be directly loaded with "torch.load" function
+# Path to save your converted checkpoint. Load with torch.load().
 save_path = "path/to/save/your/checkpoint_dir"
 
 
-state_dict = get_fp32_state_dict_from_zero_checkpoint(
-    checkpoint_dir
-)
+state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)
 
 checkpoint = {"state_dict": state_dict}
 
diff --git a/tools/spike_wan_lora_bridge.py b/tools/spike_wan_lora_bridge.py
new file mode 100644
index 00000000..cc3eb79f
--- /dev/null
+++ b/tools/spike_wan_lora_bridge.py
@@ -0,0 +1,107 @@
+#!/usr/bin/env python3
+"""GPU/CPU spike: inventory native Wan LoRA keys and optional smoke inference."""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+
+from videotuna.testing.wan_lora_ckpt import build_synthetic_wan_lora_ckpt  # noqa: E402
+from videotuna.utils.wan_lora_bridge import (  # noqa: E402
+    analyze_native_wan_lora_ckpt,
+    apply_native_wan_lora_to_pipeline,
+    compute_remap_coverage,
+    is_native_wan_lora_ckpt,
+    load_native_wan_lora_state_dict,
+)
+
+
+def _build_synthetic_ckpt(path: Path, *, num_blocks: int = 2, rank: int = 16) -> None:
+    """Write a synthetic denoiser ckpt with production-style key names."""
+    build_synthetic_wan_lora_ckpt(path, num_blocks=num_blocks, rank=rank)
+    import torch
+
+    raw = torch.load(path, map_location="cpu", weights_only=False)
+    count = len(raw.get("state_dict", {}))
+    print(f"Wrote synthetic checkpoint: {path} ({count} tensors)")
+
+
+def _inventory_only(ckpt: Path) -> int:
+    info = analyze_native_wan_lora_ckpt(ckpt)
+    native = load_native_wan_lora_state_dict(ckpt)
+    transformed, total, coverage = compute_remap_coverage(native)
+    print(json.dumps(info, indent=2))
+    print(f"Remap coverage: {transformed}/{total} keys transformed ({coverage:.1%})")
+    return 0
+
+
+def _load_on_pipeline(ckpt: Path, model_id: str) -> int:
+    import torch
+    from diffusers import WanPipeline
+
+    dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
+    print(f"Loading pipeline {model_id} (dtype={dtype})...")
+    pipeline = WanPipeline.from_pretrained(model_id, torch_dtype=dtype)
+    reports = apply_native_wan_lora_to_pipeline(pipeline, ckpt)
+    for report in reports:
+        print(json.dumps(report.as_dict(), indent=2))
+        if report.loaded_lora_params == 0:
+            return 1
+    return 0
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(
+        description=(
+            "Spike: Wan 2.1 native LoRA → 2.2 Diffusers bridge inventory/load test."
+        ),
+    )
+    parser.add_argument(
+        "--input",
+        type=Path,
+        help="Native denoiser .ckpt path (or omit with --synthetic)",
+    )
+    parser.add_argument(
+        "--synthetic",
+        type=Path,
+        metavar="PATH",
+        help="Write a synthetic production-key ckpt to PATH and use it",
+    )
+    parser.add_argument(
+        "--load-pipeline",
+        action="store_true",
+        help="Load Wan 2.2 pipeline and apply bridge (requires GPU + weights)",
+    )
+    parser.add_argument(
+        "--model-id",
+        default="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+        help="Diffusers model id for --load-pipeline",
+    )
+    args = parser.parse_args()
+
+    ckpt = args.input
+    if args.synthetic:
+        _build_synthetic_ckpt(args.synthetic)
+        ckpt = args.synthetic
+
+    if ckpt is None:
+        parser.error("Provide --input or --synthetic")
+
+    if not is_native_wan_lora_ckpt(ckpt):
+        print(f"Not a native Wan LoRA checkpoint: {ckpt}", file=sys.stderr)
+        return 1
+
+    rc = _inventory_only(ckpt)
+    if args.load_pipeline:
+        rc = _load_on_pipeline(ckpt, args.model_id) or rc
+    return rc
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/tools/spike_wan_quant_compare.py b/tools/spike_wan_quant_compare.py
new file mode 100644
index 00000000..f074c70f
--- /dev/null
+++ b/tools/spike_wan_quant_compare.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""
+Compare Wan 2.2 Diffusers peak VRAM: baseline vs torchao int8/fp8 vs quanto int8.
+
+Manual rental-GPU spike — not run in CI. Documents quant evaluation per runbook.
+
+Usage:
+  poetry run python tools/spike_wan_quant_compare.py \\
+    --prompt "A cat walking on grass" \\
+    --frames 17 --height 480 --width 848
+
+  # fp8 on Ada/Hopper (sm >= 8.9):
+  poetry run python tools/spike_wan_quant_compare.py \\
+    --schemes none,fp8_wo --prompt "A cat walking on grass"
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(REPO_ROOT))
+
+_FP8_MIN_COMPUTE_CAPABILITY = (8, 9)
+
+
+def _fp8_gpu_supported() -> bool:
+    import torch
+
+    if not torch.cuda.is_available():
+        return False
+    major, minor = torch.cuda.get_device_capability(0)
+    return (major, minor) >= _FP8_MIN_COMPUTE_CAPABILITY
+
+
+def _run_variant(
+    *,
+    label: str,
+    transformer_quant: str,
+    quant_backend: str,
+    args: argparse.Namespace,
+) -> dict:
+    import torch
+    from diffusers import WanPipeline
+
+    from videotuna.utils.common_utils import monitor_resources
+    from videotuna.utils.diffusers_optimizations import apply_diffusers_optimizations
+    from videotuna.utils.diffusers_quantization import (
+        build_pipeline_quantization_config,
+        resolve_quant_components,
+        validate_transformer_quant,
+    )
+
+    validate_transformer_quant(
+        transformer_quant=transformer_quant,
+        quant_backend=quant_backend,
+        offload_mode="model",
+    )
+    dtype = torch.bfloat16 if transformer_quant == "fp8_wo" else torch.float16
+    model_id = args.model_id
+    load_kwargs: dict = {"torch_dtype": dtype}
+    if transformer_quant != "none":
+        components = resolve_quant_components("wan", "2.2", "t2v")
+        qcfg = build_pipeline_quantization_config(
+            transformer_quant=transformer_quant,
+            quant_backend=quant_backend,
+            components=components,
+        )
+        if qcfg is not None:
+            load_kwargs["quantization_config"] = qcfg
+
+    pipe = WanPipeline.from_pretrained(model_id, **load_kwargs)
+    ns = argparse.Namespace(
+        enable_sequential_cpu_offload=False,
+        enable_model_cpu_offload=True,
+        enable_vae_tiling=True,
+        enable_vae_slicing=False,
+        fuse_qkv=False,
+        enable_attention_cache=False,
+        device=None,
+        device_map=None,
+    )
+    apply_diffusers_optimizations(pipe, ns, model_family="wan")
+
+    @monitor_resources(return_metrics=True, frames=args.frames)
+    def _gen():
+        generator = torch.Generator(device="cuda").manual_seed(args.seed)
+        pipe(
+            prompt=args.prompt,
+            num_frames=args.frames,
+            height=args.height,
+            width=args.width,
+            num_inference_steps=args.steps,
+            guidance_scale=5.0,
+            generator=generator,
+        )
+        return {"ok": True}
+
+    metrics = _gen()
+    return {
+        "label": label,
+        "transformer_quant": transformer_quant,
+        "quant_backend": quant_backend,
+        "peak_vram_gb": metrics.get("peak_vram_gb", -1),
+        "wall_time_s": metrics.get("wall_time_s", -1),
+    }
+
+
+def _build_variants(cli: argparse.Namespace) -> list[tuple[str, str, str]]:
+    scheme_to_label = {
+        "none": "baseline",
+        "int8_wo": "torchao_int8",
+        "int4_wo": "torchao_int4",
+        "fp8_wo": "torchao_fp8",
+    }
+    requested = [s.strip() for s in cli.schemes.split(",") if s.strip()]
+    variants: list[tuple[str, str, str]] = []
+    for scheme in requested:
+        if scheme not in scheme_to_label:
+            raise ValueError(
+                f"Unsupported scheme {scheme!r}. "
+                f"Expected one of: {', '.join(scheme_to_label)}"
+            )
+        label = scheme_to_label[scheme]
+        backend = "torchao"
+        if scheme == "int8_wo" and cli.include_quanto:
+            variants.append((label, scheme, backend))
+            variants.append(("quanto_int8", scheme, "quanto"))
+            continue
+        if scheme == "fp8_wo" and not _fp8_gpu_supported():
+            print(
+                f"Skipping {label}: fp8_wo requires NVIDIA GPU sm >= "
+                f"{_FP8_MIN_COMPUTE_CAPABILITY[0]}.{_FP8_MIN_COMPUTE_CAPABILITY[1]}",
+                file=sys.stderr,
+            )
+            continue
+        variants.append((label, scheme, backend))
+    return variants
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Wan 2.2 quant VRAM spike")
+    parser.add_argument(
+        "--model-id",
+        default="Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+    )
+    parser.add_argument("--prompt", default="A cat walking on grass")
+    parser.add_argument("--frames", type=int, default=17)
+    parser.add_argument("--height", type=int, default=480)
+    parser.add_argument("--width", type=int, default=848)
+    parser.add_argument("--steps", type=int, default=4)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument(
+        "--schemes",
+        default="none,int8_wo,fp8_wo",
+        help="Comma-separated transformer_quant schemes to benchmark",
+    )
+    parser.add_argument(
+        "--include-int4",
+        action="store_true",
+        help="Add torchao int4_wo variant to the run list",
+    )
+    parser.add_argument(
+        "--include-quanto",
+        action="store_true",
+        help="Also run optimum-quanto int8 when int8_wo is requested",
+    )
+    parser.add_argument(
+        "--out",
+        type=Path,
+        default=Path("results/spike_wan_quant.json"),
+    )
+    cli = parser.parse_args()
+
+    import torch
+    import torchao
+
+    if not torch.cuda.is_available():
+        print("CUDA required for spike_wan_quant_compare", file=sys.stderr)
+        sys.exit(1)
+
+    schemes = cli.schemes
+    if cli.include_int4 and "int4_wo" not in schemes:
+        schemes = f"{schemes},int4_wo"
+    cli.schemes = schemes
+
+    try:
+        variants = _build_variants(cli)
+    except ValueError as exc:
+        print(str(exc), file=sys.stderr)
+        sys.exit(2)
+
+    if cli.include_quanto:
+        try:
+            import optimum.quanto  # noqa: F401
+        except ImportError:
+            print("optimum-quanto not installed; skipping quanto variant")
+
+    results: list[dict] = [{"torchao_version": torchao.__version__}]
+    for label, quant, backend in variants:
+        if backend == "quanto":
+            try:
+                import optimum.quanto  # noqa: F401
+            except ImportError:
+                continue
+        print(f"Running {label}...")
+        try:
+            metrics = _run_variant(
+                label=label,
+                transformer_quant=quant,
+                quant_backend=backend,
+                args=cli,
+            )
+        except Exception as exc:
+            metrics = {"label": label, "error": str(exc)}
+        results.append(metrics)
+        print(json.dumps(metrics, indent=2))
+
+    cli.out.parent.mkdir(parents=True, exist_ok=True)
+    cli.out.write_text(json.dumps(results, indent=2), encoding="utf-8")
+    print(f"Wrote {cli.out}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tools/video_comparison/combine.py b/tools/video_comparison/combine.py
index f0821a0b..967dda80 100644
--- a/tools/video_comparison/combine.py
+++ b/tools/video_comparison/combine.py
@@ -3,7 +3,7 @@
 import os
 
 import numpy as np
-from moviepy.editor import VideoFileClip, clips_array
+from moviepy import VideoFileClip, clips_array
 from PIL import Image, ImageDraw, ImageFont
 
 parser = argparse.ArgumentParser(description="Check the input directory")
@@ -37,7 +37,6 @@ def add_text_to_frame(frame, text="hi", position=(0, 0)):
 
 
 for video_index in range(num_of_videos):
-
     video_paths = []
     for method in methods:
         video_path = sorted(os.listdir(method))[video_index]
@@ -46,15 +45,15 @@ def add_text_to_frame(frame, text="hi", position=(0, 0)):
     clips = [VideoFileClip(video_path) for video_path in video_paths]
     max_fps = max([clip.fps for clip in clips])
     max_duration = max([clip.duration for clip in clips])
-    clips = [clip.set_end(max_duration).set_fps(max_fps) for clip in clips]
+    clips = [clip.with_end(max_duration).with_fps(max_fps) for clip in clips]
 
-    clips = [clip.resize(height=args.unified_height) for clip in clips]
+    clips = [clip.resized(height=args.unified_height) for clip in clips]
 
     clips_with_name = []
     for index, clip in enumerate(clips):
         method = methods[index].split("/")[-1]
         print(f"tackling {index} {method}")
-        clip = clip.fl_image(
+        clip = clip.image_transform(
             lambda frame, method=method: add_text_to_frame(
                 frame, text=method, position=(0, 0)
             )
diff --git a/tools/video_comparison/compare.sh b/tools/video_comparison/compare.sh
deleted file mode 100755
index ec561a3b..00000000
--- a/tools/video_comparison/compare.sh
+++ /dev/null
@@ -1,139 +0,0 @@
-#! /bin/bash
-
-input_dir='inputs/t2v'
-save_dir='results/compare1/'
-seed=42
-unified_visualization_height=320
-inference_methods="videocrafter2;dynamicrafter;cogvideo—t2v;cogvideo—i2v;opensora;mochi"
-
-#### check input ####
-# Check if the directory exists
-if [ ! -d "$input_dir" ]; then
-  echo "The input should be a directory, exiting..."
-  exit 1  # Exit the script with an error code
-fi
-
-# Check if the prompts.txt file exists within the directory
-if [ ! -f "${input_dir}/prompts.txt" ]; then
-  echo "The file prompts.txt does not exist in the directory, exiting..."
-  exit 1  # Exit the script with an error code
-fi
-
-# run check_input.py, will create images using t2i if there is no images in the input directory
-python tools/video_comparison/check_input.py --input_dir=$input_dir --seed=$seed
-
-
-
-################################ videocrafter2 ################################
-ckpt='checkpoints/videocrafter/t2v_v2_512_split'
-config='configs/001_videocrafter2/vc2_t2v_320x512.yaml'
-prompt_file="${input_dir}/prompts.txt"
-height=320
-width=512
-fps=28
-
-if [[ $inference_methods == *"videocrafter2"* ]]; then
-  python3 scripts/inference_new.py \
-    --ckpt_path $ckpt \
-    --config $config \
-    --prompt_file $prompt_file \
-    --savedir ${save_dir}/t2v/videocrafter2-${width}x${height}-${fps}fps \
-    --bs 1 --height 320 --width 512 \
-    --savefps ${fps} \
-    --seed ${seed}
-fi
-
-
-################################ dynamicrafter ################################
-ckpt=checkpoints/dynamicrafter/i2v_576x1024/model.ckpt
-config=configs/002_dynamicrafter/dc_i2v_1024.yaml
-prompt_dir="${input_dir}"
-height=576
-width=1024
-fps=10
-
-if [[ $inference_methods == *"dynamicrafter"* ]]; then
-  python3 scripts/inference.py \
-    --mode 'i2v' \
-    --ckpt_path $ckpt \
-    --config $config \
-    --prompt_dir $prompt_dir \
-    --savedir ${save_dir}/i2v/dynamicrafter-${width}x${height}-${fps}fps \
-    --bs 1 --height ${height} --width ${width} \
-    --fps ${fps} \
-    --seed ${seed}
-fi
-
-
-################################ cogvideo—t2v ################################
-if [[ $inference_methods == *"cogvideo—t2v"* ]]; then
-  python scripts/inference_cogVideo_diffusers.py \
-    --model_input $prompt_file \
-    --model_path checkpoints/cogvideo/CogVideoX-2b \
-    --output_path ${save_dir}/t2v/cogvideo-t2v-720x480-8fps \
-    --num_inference_steps 50 \
-    --guidance_scale 3.5 \
-    --num_videos_per_prompt 1 \
-    --dtype float16 --seed ${seed}
-fi
-
-
-################################ cogvideo—i2v ################################
-if [[ $inference_methods == *"cogvideo—i2v"* ]]; then
-  python scripts/inference_cogVideo_diffusers.py \
-      --generate_type i2v \
-      --model_input ${input_dir} \
-      --model_path checkpoints/cogvideo/CogVideoX-5b-I2V \
-      --output_path ${save_dir}/i2v/cogvideo-i2v-720x480-8fps \
-      --num_inference_steps 50 \
-      --guidance_scale 3.5 \
-      --num_videos_per_prompt 1 \
-      --dtype float16 --seed ${seed}
-fi
-
-################################ opensora ################################
-ckpt="checkpoints/open-sora/t2v_v10/OpenSora-v1-HQ-16x256x256.pth"
-config='configs/003_opensora/opensorav10_256x256.yaml'
-height=256
-width=256
-fps=8
-res_dir="${save_dir}/t2v/opensora-${width}x${height}-${fps}fps"
-
-if [[ $inference_methods == *"opensora"* ]]; then
-  python3 scripts/inference.py \
-      --seed ${seed} \
-      --mode 't2v' \
-      --ckpt_path $ckpt \
-      --config $config \
-      --savedir $res_dir \
-      --n_samples 1 \
-      --bs 1 --height ${height} --width ${width} \
-      --unconditional_guidance_scale 7.0 \
-      --ddim_steps 50 \
-      --ddim_eta 1.0 \
-      --prompt_file $prompt_file \
-      --fps ${fps} \
-      --frames 16
-fi
-
-################################ mochi ################################
-if [[ $inference_methods == *"mochi"* ]]; then
-  ckpt='genmo/mochi-1-preview'
-  prompt_file="${input_dir}/prompts.txt"
-  height=480
-  width=848
-  savedir="${save_dir}/t2v/mochi-${width}x${height}-28fps"
-
-  python3 scripts/inference_mochi.py \
-      --ckpt_path $ckpt \
-      --prompt_file $prompt_file \
-      --savedir $savedir \
-      --bs 1 --height $height --width $width \
-      --fps 28 \
-      --seed ${seed}
-fi
-
-
-
-#### combine video
-python3 tools/video_comparison/combine.py --save_dir=$save_dir --input_dir=$input_dir --unified_height=$unified_visualization_height
diff --git a/tools/video_comparison/compare_i2v.sh b/tools/video_comparison/compare_i2v.sh
deleted file mode 100644
index e773274b..00000000
--- a/tools/video_comparison/compare_i2v.sh
+++ /dev/null
@@ -1,54 +0,0 @@
-#! /bin/bash
-
-input_dir='inputs/i2v/tmp'
-save_dir='results/compare20/'
-
-#### check input ####
-# Check if the directory exists
-if [ ! -d "$input_dir" ]; then
-  echo "The input should be a directory, exiting..."
-  exit 1  # Exit the script with an error code
-fi
-
-# # Check if the prompts.txt file exists within the directory
-# if [ ! -f "${input_dir}/prompts.txt" ]; then
-#   echo "The file prompts.txt does not exist in the directory, exiting..."
-#   exit 1  # Exit the script with an error code
-# fi
-
-# # run check_input.py, will create images using t2i if there is no images in the input directory
-# python tools/video_comparison/check_input.py --input_dir=$input_dir
-
-#### run dynamicrafter ####
-ckpt=checkpoints/dynamicrafter/i2v_576x1024/model.ckpt
-config=configs/002_dynamicrafter/dc_i2v_1024.yaml
-prompt_dir="${input_dir}"
-height=576
-width=1024
-fps=10
-
-python3 scripts/inference.py \
-  --mode 'i2v' \
-  --ckpt_path $ckpt \
-  --config $config \
-  --prompt_dir $prompt_dir \
-  --savedir ${save_dir}/i2v/dynamicrafter-${height}x${width}-${fps}fps \
-  --bs 1 --height ${height} --width ${width} \
-  --fps ${fps} \
-  --seed 123
-
-
-
-#### run cogvideo—i2v
-python scripts/inference_cogVideo_diffusers.py \
-    --generate_type i2v \
-    --model_input $prompt_dir \
-    --model_path checkpoints/cogvideo/CogVideoX-5b-I2V \
-    --output_path ${save_dir}/i2v/cogvideo-i2v-720x480-8fps \
-    --num_inference_steps 50 \
-    --guidance_scale 3.5 \
-    --num_videos_per_prompt 1 \
-    --dtype float16
-
-#### combine video
-python3 tools/video_comparison/combine.py --save_dir=$save_dir --input_dir=$input_dir
diff --git a/tools/videocrafter_checkpoint_converter.py b/tools/videocrafter_checkpoint_converter.py
deleted file mode 100644
index 8571de88..00000000
--- a/tools/videocrafter_checkpoint_converter.py
+++ /dev/null
@@ -1,41 +0,0 @@
-import os
-import torch
-from collections import OrderedDict
-from videotuna.base.generation_base import Component
-
-ckpt = torch.load("checkpoints/videocrafter/t2v_v2_512/model.ckpt")
-state_dict = ckpt['state_dict']
-
-denoiser_ckpt = {'state_dict': OrderedDict()}
-first_stage_ckpt = {'state_dict': OrderedDict()}
-cond_stage_ckpt = {'state_dict': OrderedDict()}
-new_ckpt = {'state_dict': OrderedDict()}
-for k, v in state_dict.items():
-    if 'model.diffusion_model' in k:
-        key_list = k.split('.')
-        new_list = key_list[2:]
-        new_key = '.'.join(new_list)
-        denoiser_ckpt['state_dict'][new_key] = v
-        print(f'{new_key} saved to denoiser_ckpt')
-    elif 'first_stage_model' in k:
-        key_list = k.split('.')
-        new_list = key_list[1:]
-        new_key = '.'.join(new_list)
-        first_stage_ckpt['state_dict'][new_key] = v
-        print(f'{new_key} saved to first_stage_ckpt')
-    elif 'cond_stage_model' in k:
-        key_list = k.split('.')
-        new_list = key_list[1:]
-        new_key = '.'.join(new_list)
-        cond_stage_ckpt['state_dict'][new_key] = v
-        print(f'{new_key} saved to cond_stage_ckpt')
-    else:
-        new_ckpt['state_dict'][k] = v
-
-os.makedirs("checkpoints/videocrafter/t2v_v2_512_split", exist_ok=True)
-torch.save(new_ckpt, "checkpoints/videocrafter/t2v_v2_512_split/model_new.ckpt")
-torch.save(denoiser_ckpt, f"checkpoints/videocrafter/t2v_v2_512_split/{Component.DENOISER.get_component_path()}")
-torch.save(first_stage_ckpt, f"checkpoints/videocrafter/t2v_v2_512_split/{Component.FIRST_STAGE_MODEL.get_component_path()}")
-torch.save(cond_stage_ckpt, f"checkpoints/videocrafter/t2v_v2_512_split/{Component.COND_STAGE_MODEL.get_component_path()}")
-
-print('Finish!')
\ No newline at end of file
diff --git a/tools/vript_anno_converter.py b/tools/vript_anno_converter.py
index 52e049e0..1df87511 100644
--- a/tools/vript_anno_converter.py
+++ b/tools/vript_anno_converter.py
@@ -4,10 +4,12 @@
 from pathlib import Path
 
 import cv2
-import fire
 import pandas as pd
+from cyclopts import App
 from tqdm import tqdm
 
+app = App(help="Convert Vript annotation JSONL to a training CSV.")
+
 
 def read_video_meta(video_path):
     # Video fps
@@ -47,7 +49,7 @@ def get_video_data(video_root):
                     meta.update(read_video_meta(video_path))
                     video_dict[osp.splitext(clip_meta["clip_id"])[0]] = meta
 
-            except Exception as e:
+            except Exception:
                 import traceback
 
                 traceback.print_exc()
@@ -55,7 +57,8 @@ def get_video_data(video_root):
     return video_dict
 
 
-def main(input_path, output_path, video_root):
+@app.default
+def main(input_path: str, output_path: str, video_root: str):
     with open(input_path, "r") as jsonl_file:
         lines = jsonl_file.readlines()
 
@@ -74,7 +77,6 @@ def main(input_path, output_path, video_root):
         caption_data = data["caption"]
         caption = ""
         for caption_keys in caption_data.keys():
-            caption_id = caption_keys
             caption_text = caption_data[caption_keys]
             if not caption_text.endswith("."):
                 caption_text += "."
@@ -90,5 +92,7 @@ def main(input_path, output_path, video_root):
 
 
 if __name__ == "__main__":
-    # python vript_anno_converter.py --input_path {ROOT}/Vript/vript_captions/vript_short_videos_captions.jsonl --output_path ./test.csv --video_root  {ROOT}/Vript/vript_short_videos_clips
-    fire.Fire(main)
+    # Example:
+    # python vript_anno_converter.py --input-path {ROOT}/Vript/...jsonl
+    #   --output-path ./test.csv --video-root {ROOT}/Vript/...clips
+    app()
diff --git a/typings/xfuser/__init__.pyi b/typings/xfuser/__init__.pyi
new file mode 100644
index 00000000..fb7197fb
--- /dev/null
+++ b/typings/xfuser/__init__.pyi
@@ -0,0 +1 @@
+__all__: list[str]
diff --git a/eval/vbench/cli/__init__.py b/typings/xfuser/core/__init__.pyi
similarity index 100%
rename from eval/vbench/cli/__init__.py
rename to typings/xfuser/core/__init__.pyi
diff --git a/typings/xfuser/core/distributed/__init__.pyi b/typings/xfuser/core/distributed/__init__.pyi
new file mode 100644
index 00000000..82c0700c
--- /dev/null
+++ b/typings/xfuser/core/distributed/__init__.pyi
@@ -0,0 +1,8 @@
+from typing import Any
+
+def init_distributed_environment(*args: Any, **kwargs: Any) -> None: ...
+def initialize_model_parallel(*args: Any, **kwargs: Any) -> None: ...
+def get_sequence_parallel_rank(*args: Any, **kwargs: Any) -> int: ...
+def get_sequence_parallel_world_size(*args: Any, **kwargs: Any) -> int: ...
+def get_sp_group(*args: Any, **kwargs: Any) -> Any: ...
+def get_world_group(*args: Any, **kwargs: Any) -> Any: ...
diff --git a/typings/xfuser/core/distributed/parallel_state.pyi b/typings/xfuser/core/distributed/parallel_state.pyi
new file mode 100644
index 00000000..a4d1a95e
--- /dev/null
+++ b/typings/xfuser/core/distributed/parallel_state.pyi
@@ -0,0 +1,7 @@
+from typing import Any
+
+def get_tensor_model_parallel_rank(*args: Any, **kwargs: Any) -> int: ...
+def get_tensor_model_parallel_world_size(*args: Any, **kwargs: Any) -> int: ...
+def get_sequence_parallel_world_size(*args: Any, **kwargs: Any) -> int: ...
+def get_sequence_parallel_rank(*args: Any, **kwargs: Any) -> int: ...
+def get_sp_group(*args: Any, **kwargs: Any) -> Any: ...
diff --git a/typings/xfuser/core/long_ctx_attention/__init__.pyi b/typings/xfuser/core/long_ctx_attention/__init__.pyi
new file mode 100644
index 00000000..008034af
--- /dev/null
+++ b/typings/xfuser/core/long_ctx_attention/__init__.pyi
@@ -0,0 +1,4 @@
+from typing import Any
+
+class xFuserLongContextAttention:
+    def __init__(self, *args: Any, **kwargs: Any) -> None: ...
diff --git a/typings/xfuser/model_executor/models/customized/step_video_t2v/tp_applicator.pyi b/typings/xfuser/model_executor/models/customized/step_video_t2v/tp_applicator.pyi
new file mode 100644
index 00000000..04007c04
--- /dev/null
+++ b/typings/xfuser/model_executor/models/customized/step_video_t2v/tp_applicator.pyi
@@ -0,0 +1,6 @@
+from typing import Any
+
+class TensorParallelApplicator:
+    def __init__(self, *args: Any, **kwargs: Any) -> None: ...
+    def apply(self, *args: Any, **kwargs: Any) -> Any: ...
+    def apply_to_model(self, *args: Any, **kwargs: Any) -> Any: ...
diff --git a/uv.lock b/uv.lock
new file mode 100644
index 00000000..522a71c0
--- /dev/null
+++ b/uv.lock
@@ -0,0 +1,2665 @@
+version = 1
+revision = 3
+requires-python = ">=3.11"
+resolution-markers = [
+    "python_full_version >= '3.15' and sys_platform == 'linux'",
+    "python_full_version >= '3.13' and python_full_version < '3.15' and sys_platform == 'linux'",
+    "python_full_version == '3.12.*' and sys_platform == 'linux'",
+    "python_full_version >= '3.15' and sys_platform != 'linux'",
+    "python_full_version >= '3.13' and python_full_version < '3.15' and sys_platform != 'linux'",
+    "python_full_version == '3.12.*' and sys_platform != 'linux'",
+    "python_full_version < '3.12' and sys_platform == 'linux'",
+    "python_full_version < '3.12' and sys_platform != 'linux'",
+]
+
+[manifest]
+
+[manifest.dependency-groups]
+dev = [
+    { name = "coverage", specifier = ">=7.6.1" },
+    { name = "mypy", specifier = ">=1.11.2" },
+    { name = "pre-commit", specifier = ">=4.1.0" },
+    { name = "pudb", specifier = "==2024.1.2" },
+    { name = "pytest", specifier = ">=9.1,<10" },
+    { name = "ruff", specifier = ">=0.6.8" },
+    { name = "types-colorama" },
+    { name = "types-psutil" },
+    { name = "types-tqdm" },
+]
+training = [
+    { name = "deepspeed", specifier = "==0.19.2" },
+    { name = "pandas", specifier = "==2.2.2" },
+    { name = "pytorch-lightning", specifier = "==2.4.0" },
+    { name = "scikit-learn", specifier = ">=1.6.1" },
+    { name = "tensorboard", specifier = ">=2.19.0" },
+]
+
+[[package]]
+name = "absl-py"
+version = "2.4.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/64/c7/8de93764ad66968d19329a7e0c147a2bb3c7054c554d4a119111b8f9440f/absl_py-2.4.0.tar.gz", hash = "sha256:8c6af82722b35cf71e0f4d1d47dcaebfff286e27110a99fc359349b247dfb5d4", size = 116543, upload-time = "2026-01-28T10:17:05.322Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/18/a6/907a406bb7d359e6a63f99c313846d9eec4f7e6f7437809e03aa00fa3074/absl_py-2.4.0-py3-none-any.whl", hash = "sha256:88476fd881ca8aab94ffa78b7b6c632a782ab3ba1cd19c9bd423abc4fb4cd28d", size = 135750, upload-time = "2026-01-28T10:17:04.19Z" },
+]
+
+[[package]]
+name = "aiohappyeyeballs"
+version = "2.6.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/33/c6/61a2d7b7572279226bb2e7f61d7a19ca7c90da0329c93fa0d560cbf288d8/aiohappyeyeballs-2.6.2.tar.gz", hash = "sha256:e202810ee718bd01fc6ef49e8ea53d023d5cb6b581076d7925aa499fa55dbe64", size = 22591, upload-time = "2026-05-20T15:12:24.631Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/5f/fc/a7bf5b6e4e617b45f90f2d9d2a68519c249c81dd4fc2658c7a2a61c4f4b7/aiohappyeyeballs-2.6.2-py3-none-any.whl", hash = "sha256:4708045e2d7a6c6bdf8aafa8ed39649eaf926a4543b54560659129e3365953c4", size = 15062, upload-time = "2026-05-20T15:12:23.328Z" },
+]
+
+[[package]]
+name = "aiohttp"
+version = "3.14.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "aiohappyeyeballs" },
+    { name = "aiosignal" },
+    { name = "attrs" },
+    { name = "frozenlist" },
+    { name = "multidict" },
+    { name = "propcache" },
+    { name = "typing-extensions", marker = "python_full_version < '3.13'" },
+    { name = "yarl" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/82/78/8ea7308cac6934de8c74a14f3d5f65d1c89287426688be79538d0e5c013d/aiohttp-3.14.1.tar.gz", hash = "sha256:307f2cff90a764d329e77040603fa032db89c5c24fdad50c4c15334cba744035", size = 7955794, upload-time = "2026-06-07T21:09:35.529Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/26/dd/bf526e6f0a1120dd6f2df2e97bacfe4d358f13d17a0ff5847301a1375a51/aiohttp-3.14.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:aa00140699487bd435fde4342d85c94cb256b7cd3a5b9c3396c67f19922afda2", size = 765225, upload-time = "2026-06-07T21:06:07.957Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/e1/a2872aa55495a70f61310d411541c6ee23812d9a884e000c716e1bc3edbf/aiohttp-3.14.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1c1af67559445498b502030c35c59db59966f47041ca9de5b4e707f86bd10b5f", size = 518743, upload-time = "2026-06-07T21:06:09.749Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/e7/c60c7b209e509cc787de3cea0550a518538cfc08003e1c1e14c1c63fff71/aiohttp-3.14.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:d44ec478e713ee7f29b439f7eb8dc2b9d4079e11ae114d2c2ac3d5daf30516c8", size = 514139, upload-time = "2026-06-07T21:06:11.26Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/8d/614ace2f579702c9840ab1e1447fd8509e35b0b904f7196418fa2f57b25d/aiohttp-3.14.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d3b1a184a9a8f548a6b73f1e26b96b052193e4b3175ed7342aaf1151a1f00a04", size = 1784088, upload-time = "2026-06-07T21:06:12.887Z" },
+    { url = "https://files.pythonhosted.org/packages/49/e0/726e90f99542bf292f81a96a12cc4847deb86f3ccf62c6f4014a201f4d33/aiohttp-3.14.1-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:5f2504bc0322437c9a1ff6d3333ca56c7477b727c995f036b976ae17b98372c8", size = 1737835, upload-time = "2026-06-07T21:06:14.564Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/4b/d176d5c4db9d33dacf0543102ea59503bc1d528af4cfd0b719949ca49389/aiohttp-3.14.1-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:73f05ea02013e02512c3bf42714f1208c57168c779cc6fe23516e4543089d0a6", size = 1842801, upload-time = "2026-06-07T21:06:16.228Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/d6/5a99b563690ea0cbed912ae94a2ce33993a5709a651a3a4fe761e7dd973a/aiohttp-3.14.1-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:797457503c2d426bee06eef808d07b31ede30b65e054444e7de64cad0061b7af", size = 1929992, upload-time = "2026-06-07T21:06:17.947Z" },
+    { url = "https://files.pythonhosted.org/packages/76/7f/a987b14a3859094b3cea3f4825219c3e5536242564af6e3f9c2f6c994eb2/aiohttp-3.14.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b821a1f7dedf7e37450654e620038ac3b2e81e8fa6ea269337e97101978ec730", size = 1786989, upload-time = "2026-06-07T21:06:19.677Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/1a/420e5c85a3e73349372ed22ce0b6af86bfa6ce16a4b20a64a2e94608c781/aiohttp-3.14.1-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:4cd96b5ba05d67ed0cf00b5b405c8cd99586d8e3481e8ee0a831057591af7621", size = 1640129, upload-time = "2026-06-07T21:06:22.558Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/80/18a592ed3be0a402cc03670bd72ee1f8563ddbe1d8d5542dbf868f274136/aiohttp-3.14.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:1d459b98a932296c6f0e94f87511a0b1b90a8a02c30a50e60a297619cd5a58ee", size = 1756576, upload-time = "2026-06-07T21:06:24.8Z" },
+    { url = "https://files.pythonhosted.org/packages/ec/0b/8b3d5713373858ff71a617daf6e3b0e81ad63e79d09a3cf2f6b6b983939c/aiohttp-3.14.1-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:764457a7be60825fb770a644852ff717bcbb5042f189f2bd16df61a81b3f6573", size = 1754668, upload-time = "2026-06-07T21:06:26.528Z" },
+    { url = "https://files.pythonhosted.org/packages/9f/49/fd564575cf225821d7ba5a117cb8bc27213d8a7e1811162afb43ae077039/aiohttp-3.14.1-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:f7a16ef45b081454ef844502d87a848876c490c4cb5c650c230f6ec79ed2c1e7", size = 1817019, upload-time = "2026-06-07T21:06:28.297Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/1b/e850c9ae6fc91356552ae668bb6c51e93fa29c8aef13398a10b56678557f/aiohttp-3.14.1-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:2fbc3ed048b3475b9f0cbcb9978e9d2d3511acd91ead203af26ed9f0056004cf", size = 1631638, upload-time = "2026-06-07T21:06:30.242Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/94/3c337ba72451a89806ace6f75bddc92bafc5b8d53d90115a512858024b63/aiohttp-3.14.1-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:bedb0cd073cc2dc035e30aeb99444389d3cd2113afe4ef9fcd23d439f5bade85", size = 1835660, upload-time = "2026-06-07T21:06:31.943Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/9c/9c18cf367a0498212d9ba7daf990b504a5e8ae064cda4b504e2647c89c03/aiohttp-3.14.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:b6feea921016eb3d4e04d65fc4e9ca402d1a3801f562aef94989f54694917af3", size = 1775698, upload-time = "2026-06-07T21:06:33.72Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/63/a251a9d2a6cb45065b2ddc0bde2b3dd10108740a9a42f632c66405a761a2/aiohttp-3.14.1-cp311-cp311-win32.whl", hash = "sha256:313701e488100074ce99850404ee36e741abf6330179fec908a1944ecf570126", size = 458386, upload-time = "2026-06-07T21:06:35.279Z" },
+    { url = "https://files.pythonhosted.org/packages/17/ca/69274c51dcd6e8947d77b2806cf47a4a15f2c846e2cbeb1882547d3da283/aiohttp-3.14.1-cp311-cp311-win_amd64.whl", hash = "sha256:03ab4530fdcb3a543a122ba4b65ac9919da9fe9f78a03d328a6e38ff962f7aa5", size = 483406, upload-time = "2026-06-07T21:06:36.824Z" },
+    { url = "https://files.pythonhosted.org/packages/2c/8a/c25904f77690c3688ec140f87591ef11a0cfe36bf3d5c0f1f38056fb62b3/aiohttp-3.14.1-cp311-cp311-win_arm64.whl", hash = "sha256:486f7d16ed54c39c2cbd7ca71fd8ba2b8bb7860df65bd7b6ed640bab96a38a8b", size = 452987, upload-time = "2026-06-07T21:06:38.371Z" },
+    { url = "https://files.pythonhosted.org/packages/1d/21/151624b51cd92553d95424daf4bf19f19ce9be9002d19253e7e7ce67197b/aiohttp-3.14.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:d35143e27778b4bb0fb189562d7f275bff79c62ab8e98459717c0ea617ff2480", size = 757402, upload-time = "2026-06-07T21:06:40.311Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/82/280619e0bd7bf2454987e19282616e84762255dd9c8468f62382e8c191f1/aiohttp-3.14.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:bcfb80a2cc36fba2534e5e5b5264dc7ae6fcd9bf15256da3e53d2f499e6fa29d", size = 512310, upload-time = "2026-06-07T21:06:42.207Z" },
+    { url = "https://files.pythonhosted.org/packages/55/b2/2aac325583aaa1353045f96dffa586d8a34e8322e14a7ba49cffeb103ab4/aiohttp-3.14.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:27fd7c91e51729b4f7e1577865fa6d34c9adccbc39aabe9000285b48af9f0ec2", size = 512448, upload-time = "2026-06-07T21:06:43.813Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/72/a60607cb849faa8af8a356c9329ea2eb6f395d49e82cc82ccba1fd8deb8f/aiohttp-3.14.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:64c567bf9eaf664280116a8688f63016e6b32db2505908e2bdaca1b6438142f2", size = 1766854, upload-time = "2026-06-07T21:06:45.391Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/d3/d9fe1c9ec7557ab4d0d82bebaa728c6418f0b93295ec2f4ab015f7710cc7/aiohttp-3.14.1-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:f5e6ff2bdbb8f4cd3fbe41f99e25bbcd58e3bf9f13d3dd31a11e7917251cc77a", size = 1740884, upload-time = "2026-06-07T21:06:47.413Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/dc/f2cecfaf9337ba3e63f181500814ff502aa3d00d9c7ec93a9d23d10a27b2/aiohttp-3.14.1-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:2f73e01dc37122325caf079982621262f96d74823c179038a82fddfc50359264", size = 1810034, upload-time = "2026-06-07T21:06:50.165Z" },
+    { url = "https://files.pythonhosted.org/packages/66/d7/2ff65c5e65c0d7476daf7e15c032e0805e36811185b9623e3238ad6c763e/aiohttp-3.14.1-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:bb2c0c80d431c0d03f2c7dbf125150fedd4f0de17366a7ca33f7ccb822391842", size = 1904054, upload-time = "2026-06-07T21:06:52.035Z" },
+    { url = "https://files.pythonhosted.org/packages/20/9c/d445818389df371f56d141d881153ba23183c4735a03f7356ffb43f7757d/aiohttp-3.14.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:3e6fc1a85fa7194a1a7d19f44e8609180f4a8eb5fa4c7ed8b4355f080fad235c", size = 1790278, upload-time = "2026-06-07T21:06:54.049Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/aa/bf04cb4d865fc6101c2229a294ad744973b72e513fdc5a6b791e6983d72a/aiohttp-3.14.1-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:686b6c0d3911ec387b444ddf5dc62fb7f7c0a7d5186a7861626496a5ab4aff95", size = 1591795, upload-time = "2026-06-07T21:06:55.911Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/b4/4dac0038960427ba832f6609dfb4ea5437d7fd80c72001b9e48f834f428b/aiohttp-3.14.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:c6fa4dc7ad6f8109c70bb1499e589f76b0b792baf39f9b017eb92c8a81d0a199", size = 1728397, upload-time = "2026-06-07T21:06:57.777Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/f9/7cd4e8ad7aa3b75f17d56bb5498dd604a93d4e6eece822ba0568c413fff0/aiohttp-3.14.1-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:87a5eea1b2a5e21e1ebdbb33ad4165359189327e63fc4e4894693e7f821ac817", size = 1766504, upload-time = "2026-06-07T21:07:00.009Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/df/fc01d9fcad0f73fed3f3d361f1f94f975947b50dff82919f6dc2bf4316cc/aiohttp-3.14.1-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:1c1421eb01d4fd608d88cc8290211d177a58532b55ad94076fb349c5bf467f0a", size = 1777806, upload-time = "2026-06-07T21:07:02.064Z" },
+    { url = "https://files.pythonhosted.org/packages/41/09/47e2d090bddcc8fb4ccb4c314aadc32d7c5d9bb55f50f6ad1c92fc15d501/aiohttp-3.14.1-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:34b257ec41345c1e8f2df68fa908a7952f5de932723871eb633ecbbff396c9a4", size = 1580707, upload-time = "2026-06-07T21:07:03.942Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/36/f1a4ce904ae0b6930cfe9afc96d0896f7ec1a620c400405d63783bb95a9c/aiohttp-3.14.1-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:de538791a80e5d862addbc183f70f0158ac9b9bb872bb147f1fd2a683691e087", size = 1798121, upload-time = "2026-06-07T21:07:05.987Z" },
+    { url = "https://files.pythonhosted.org/packages/70/0a/e0075ce9ca0279ee1d4f0c0b85f54fea02ebc83c3007651a72bece658fec/aiohttp-3.14.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:6f71173be42d3241d428f760122febb748de0623f44308a6f120d0dd9ec572e3", size = 1767580, upload-time = "2026-06-07T21:07:07.873Z" },
+    { url = "https://files.pythonhosted.org/packages/3e/61/a0c0a8f327a9c52095cdd8e312391b00d3ed64ab6c72bb5c33d8ec251cf7/aiohttp-3.14.1-cp312-cp312-win32.whl", hash = "sha256:ec8dc383ee57ea3e883477dcca3f11b65d58199f1080acaf4cd6ad9a99698be4", size = 452771, upload-time = "2026-06-07T21:07:09.669Z" },
+    { url = "https://files.pythonhosted.org/packages/df/d9/ea367c75f16ac9c6cdc8febb25e8318fa21a2b1bc8d6514d4b2d890bface/aiohttp-3.14.1-cp312-cp312-win_amd64.whl", hash = "sha256:2aa92c87868cd13674989f9ee83e5f9f7ea4237589b728048e1f0c8f6caa3271", size = 479873, upload-time = "2026-06-07T21:07:11.538Z" },
+    { url = "https://files.pythonhosted.org/packages/03/64/8d96784a7851156db8a4c6c3f6f91042fdf39fb15a4cc38c8b3c14833c45/aiohttp-3.14.1-cp312-cp312-win_arm64.whl", hash = "sha256:2c840c90759922cb5e6dda94596e079a30fb5a5ba548e7e0dc00574703940847", size = 448073, upload-time = "2026-06-07T21:07:13.637Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/97/bd137012dd97e1649162b099135a80e1fd59aaa807b2430fc448d1029aff/aiohttp-3.14.1-cp313-cp313-android_21_arm64_v8a.whl", hash = "sha256:b3a03285a7f9c7b016324574a6d92a1c895da6b978cb8f1deee3ac72bc6da178", size = 506882, upload-time = "2026-06-07T21:07:15.501Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/79/e5cc690e9d922a66887ceeaca53a8ffd5a7b0be3816142b7abc433742d89/aiohttp-3.14.1-cp313-cp313-android_21_x86_64.whl", hash = "sha256:2a73f487ab8ef5abbb24b7aa9b73e98eaba9e9e031804ff2416f02eca315ccaf", size = 515270, upload-time = "2026-06-07T21:07:17.53Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/22/a73ccbf9dbd6e26dda0b24d5fd5db7da92ee3383a79f47677ffb834c5c5b/aiohttp-3.14.1-cp313-cp313-ios_13_0_arm64_iphoneos.whl", hash = "sha256:915fbb7b41b115192259f8c9ae58f3ddc444d2b5579917270211858e606a4afd", size = 485841, upload-time = "2026-06-07T21:07:19.555Z" },
+    { url = "https://files.pythonhosted.org/packages/3b/b9/57ed8eaf596321c2ad747bd480fb1700dbd7177c60dfc9e4c187f629662e/aiohttp-3.14.1-cp313-cp313-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:7fb4bdf95b0561a79f259f9d28fbc109728c5ee7f27aff6391f0ca703a329abe", size = 492088, upload-time = "2026-06-07T21:07:21.581Z" },
+    { url = "https://files.pythonhosted.org/packages/78/c0/5ebe5270a7c140d7c6f79dcb018640225f14d406c149e4eec04a7d82fe71/aiohttp-3.14.1-cp313-cp313-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:1b9748363260121d2927704f5d4fc498150669ca3ae93625986ee89c8f80dcd4", size = 501564, upload-time = "2026-06-07T21:07:23.388Z" },
+    { url = "https://files.pythonhosted.org/packages/75/7f/8cdaa24fc7983865e0915153b96a9ac5bcdd3548d64c5a27d17cecccad2d/aiohttp-3.14.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:86a6dab78b0e43e2897a3bbe15745aa60dc5423ca437b7b0b164c069bf91b876", size = 751998, upload-time = "2026-06-07T21:07:25.046Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/f4/c4227aacfacc5cb0cc2d119b65301d177912a6842cd64e120c47af76064f/aiohttp-3.14.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:4dfd6e47d3c44c2279907607f73a4240b88c69eb8b90da7e2441a8045dfd21da", size = 510918, upload-time = "2026-06-07T21:07:27.28Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/01/a2d5f96cd4e74424864d30bc0a7e44d0a12dacdcfa91b5b2d1bd3dca6bf3/aiohttp-3.14.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:317acd9f8602858dc7d59679812c376c7f0b97bcbbf16e0d6237f54141d8a8a6", size = 508657, upload-time = "2026-06-07T21:07:29.252Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/ed/3c0fb5c500fdd8e7ebc10d1889c04384fffa1a9163eac1356088ca9da1b1/aiohttp-3.14.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:bd869c427324e5cb15195793de951295710db28be7d818247f3097b4ab5d4b96", size = 1757907, upload-time = "2026-06-07T21:07:31.03Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/ab/d4c924d9bd5be3050c226612413ce68cb54c70d2c31b661bfc8d9a5b6a70/aiohttp-3.14.1-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:93b032b5ec3255473c143627d21a69ac74ae12f7f33974cb587c564d11b1066f", size = 1737565, upload-time = "2026-06-07T21:07:33.031Z" },
+    { url = "https://files.pythonhosted.org/packages/19/2a/37326821ff779084020cdc33224d20b19f42f4183a500ff92022a739eda7/aiohttp-3.14.1-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f234b4deb12f3ad59127e037bc57c40c21e45b45282df7d3a55a0f409f595296", size = 1799018, upload-time = "2026-06-07T21:07:35.003Z" },
+    { url = "https://files.pythonhosted.org/packages/b3/4f/6e947ba73e4ce09070761c05ed3a8ceb7c21f5e46798671d8b2aac0e4626/aiohttp-3.14.1-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:9af6779bfb46abf124068327abcdf9ce95c9ef8287a3e8da76ccf2d0f16c28fa", size = 1894416, upload-time = "2026-06-07T21:07:36.956Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/6e/dbf1d0625dc711fb2851f4f3c3055c39ed58bae92082d8c627dbe6013736/aiohttp-3.14.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:faccab372e66bc76d5731525e7f1143c922271725b9d38c9f97edcc66266b451", size = 1783881, upload-time = "2026-06-07T21:07:39.063Z" },
+    { url = "https://files.pythonhosted.org/packages/44/c2/5e25098a67268ed369483ae7d1a58bd0a13d03aab860d2a0e4a6eb25b046/aiohttp-3.14.1-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f380468b09d2a81633ee863b0ec5648d364bd17bb8ecfb8c2f387f7ac1faf42c", size = 1587572, upload-time = "2026-06-07T21:07:41.058Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/bd/cf9cee17e140f942a3de73e658a543aa8fbf35a5fc67a9d2538d52d77f0b/aiohttp-3.14.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:97e704dcd26271f5bda3fa07c3ce0fb76d6d3f8659f4baa1a24442cc9ba177ca", size = 1722137, upload-time = "2026-06-07T21:07:43.014Z" },
+    { url = "https://files.pythonhosted.org/packages/89/6d/5684f8c59045c96f81a18cefbc1fbbd79d25b88f1c622f2a5c5c08fcb632/aiohttp-3.14.1-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:269b76ac5394092b95bc4a098f4fc6c191c083c3bd12775d1e30e663132f6a09", size = 1755953, upload-time = "2026-06-07T21:07:45.933Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/40/35caf3170f8359760740a7d9aa0fff2e344bef98e1d1186f5a0f6dec17e6/aiohttp-3.14.1-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:5c0b3e614340c889d575451696374c9d17affd54cd607ca0babed8f8c37b9397", size = 1766479, upload-time = "2026-06-07T21:07:48.047Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/a1/b0c61e7a137f0d81de49a82023a6df73c3c16d6fefb0f8e4a93d21639002/aiohttp-3.14.1-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:5663ee9257cfa1add7253a7da3035a02f31b6600ec48261585e1800a81533080", size = 1580077, upload-time = "2026-06-07T21:07:50.069Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/41/194ea4623693009fcefebef7aef63c141754f153e9cd0d39d3b9e36c175c/aiohttp-3.14.1-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:603a2c834142172ffddc054067f5ec0ca65d57a0aa98a71bc81952573208e345", size = 1791688, upload-time = "2026-06-07T21:07:52.106Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/45/4de841f005cfe1fd63e2a2fe011262c515e2a62aa6994b15947e7d717ac9/aiohttp-3.14.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:cb21957bb8aca671c1765e32f58164cf0c50e6bf41c0bbbd16da20732ecaf588", size = 1761094, upload-time = "2026-06-07T21:07:54.113Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/ae/dbce10533d3896d544d5053939ed75b7dc31a1b0973d959b1b5ae21028d6/aiohttp-3.14.1-cp313-cp313-win32.whl", hash = "sha256:e509a55f681e6158c20f70f102f9cf61fb20fbc382272bc6d94b7343f2582780", size = 452662, upload-time = "2026-06-07T21:07:56.06Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/d9/0bf1a19362c32f06229da5e7ddfcec91f93474d6307f7a2d3135e9c674dc/aiohttp-3.14.1-cp313-cp313-win_amd64.whl", hash = "sha256:1ac8531b638959718e18c2207fbfe297819875da46a740b29dfa29beba64355a", size = 479748, upload-time = "2026-06-07T21:07:58.319Z" },
+    { url = "https://files.pythonhosted.org/packages/22/0a/62e7232dc9484fbec112ceb32efb6a624cc7994ec6e2b019286f17c4e8f2/aiohttp-3.14.1-cp313-cp313-win_arm64.whl", hash = "sha256:250d14af67f6b6a1a4a811049b1afa69d61d617fca6bf33149b3ab1a6dbcf7b8", size = 447723, upload-time = "2026-06-07T21:08:00.154Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/a1/5fafa04e1ca91ddb47608699d60649c1c6db3cf41c99e78fc4056f9513db/aiohttp-3.14.1-cp314-cp314-android_24_arm64_v8a.whl", hash = "sha256:7c106c26852ca1c2047c6b80384f17100b4e439af276f21ef3d4e2f450ae7e15", size = 508531, upload-time = "2026-06-07T21:08:02.093Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/2e/bfa02f699d87ffc86d5959270b28f1cb410add3ccaced8ed2e0b8a5238fc/aiohttp-3.14.1-cp314-cp314-android_24_x86_64.whl", hash = "sha256:20205f7f5ade7aaec9f4b500549bbc071b046453aed72f9c06dcab87896a83e8", size = 514718, upload-time = "2026-06-07T21:08:04.476Z" },
+    { url = "https://files.pythonhosted.org/packages/85/a5/9594ad6289eebbc97d167c44213d557807f90e59115caad24de21ad2c3b1/aiohttp-3.14.1-cp314-cp314-ios_13_0_arm64_iphoneos.whl", hash = "sha256:62a759436b29e677181a9e76bab8b8f689a29cb9c535f45f7c48c9c830d3f8c3", size = 487918, upload-time = "2026-06-07T21:08:06.377Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/61/16a32c36c3c49edec122a3dc811f2057df2f94d3b14aa107c8017d981618/aiohttp-3.14.1-cp314-cp314-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:2964cbf553df4d7a57348da44d961d871895fc1ee4e8c322b2a95612c7b17fba", size = 494014, upload-time = "2026-06-07T21:08:08.263Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/89/3ebcf96ed99c05bec9c434aaac6963fd3cbab4a786ae739908a144d9ce44/aiohttp-3.14.1-cp314-cp314-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:237651caadc3a59badd39319c54642b5299e9cc98a3a194310e55d5bb9f5e397", size = 502398, upload-time = "2026-06-07T21:08:10.244Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/3d/b74870a0c2d40c355928cd5b96c7a11fa821b8a40fc41365e64479b151fb/aiohttp-3.14.1-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:896e12dfdbbab9d8f7e16d2b28c6769a60126fa92095d1ebf9473d02593a2448", size = 758018, upload-time = "2026-06-07T21:08:12.447Z" },
+    { url = "https://files.pythonhosted.org/packages/d3/66/f42f5c984d99e49c6cff5f26f590750f2e2f7ef1fcfb99966ab5be1b632e/aiohttp-3.14.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:d03f281ed22579314ba00821ce20115a7c0ac430660b4cc05704a3f818b3e004", size = 512462, upload-time = "2026-06-07T21:08:14.624Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/a7/248e1aebe0c7810b0271e021a0f2a5eb6e78a051885b3c9df49f42a5802d/aiohttp-3.14.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:07eabb979d236335fed927e137a928c9adfb7df3b9ec7aa31726f133a62be983", size = 512824, upload-time = "2026-06-07T21:08:16.572Z" },
+    { url = "https://files.pythonhosted.org/packages/26/97/2aa0e5ba0727dc3bd5aaebb7ccbc510f7dfb7fb961ec87497cd496635ab1/aiohttp-3.14.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4fe1f1087cbadb280b5e1bb054a4f00d1423c74d6626c5e48400d871d34ecefe", size = 1749898, upload-time = "2026-06-07T21:08:18.635Z" },
+    { url = "https://files.pythonhosted.org/packages/00/8d/e97f6c96c891d457c8479d92a514ba194d0412f981d72c70341ee18488ed/aiohttp-3.14.1-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:367a9314fdc79dab0fac96e216cb41dd73c85bdca85306ce8999118ba7e0f333", size = 1710114, upload-time = "2026-06-07T21:08:20.892Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/e6/aa8d7e863048c8fceb5cd6ce74017311cec3ead07847387e12265fb4444e/aiohttp-3.14.1-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:a24f677ebe83749039e7bdf862ff0bbb16818ae4193d4ef96505e269375bcce0", size = 1802541, upload-time = "2026-06-07T21:08:23.044Z" },
+    { url = "https://files.pythonhosted.org/packages/83/a8/72193137de57fda4ebfae4563182d082c8856e3b6e9871d0b46f028fb369/aiohttp-3.14.1-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c83afe0ba876be7e943d2e0ba645809ad441575d2840c895c21ee5de93b9377a", size = 1875776, upload-time = "2026-06-07T21:08:25.288Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/18/938441025db6769a3464596b2410af3afde0b21eb2f204c6f766f68af4bd/aiohttp-3.14.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:634e385930fb6d2d479cf3aa66515955863b77a5e3c2b5894ca259a25b308602", size = 1760329, upload-time = "2026-06-07T21:08:27.363Z" },
+    { url = "https://files.pythonhosted.org/packages/60/29/bf2496b4065e76e09fe48015aaffe5ce161d8f089b06ac6982070f653076/aiohttp-3.14.1-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:eeea07c4397bbc57719c4eed8f9c284874d4f175f9b6d57f7a1546b976d455ca", size = 1587293, upload-time = "2026-06-07T21:08:29.805Z" },
+    { url = "https://files.pythonhosted.org/packages/49/a2/2136674d52123b1354bd05dd5753c318db47dc0c927cc70b27bab3755456/aiohttp-3.14.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:335c0cc3e3545ce98dcb9cfcb836f40c3411f43fa03dab757597d80c89af8a35", size = 1714756, upload-time = "2026-06-07T21:08:32.094Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/b9/e5fd2e6f915503081c0f9b1e8540947037929c70c191da2e4d54b31a21a1/aiohttp-3.14.1-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:ae6be797afdef264e8a84864a85b196ca06045586481b3df8a967322fd2fa844", size = 1721052, upload-time = "2026-06-07T21:08:34.167Z" },
+    { url = "https://files.pythonhosted.org/packages/63/5a/2833e324a2263e104e31e2e91bc5bbee81bc499afd32203faee048a883f0/aiohttp-3.14.1-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:8560b4d712474335d08907db7973f71912d3a9a8f1dee992ec06b5d2fe359496", size = 1766888, upload-time = "2026-06-07T21:08:36.95Z" },
+    { url = "https://files.pythonhosted.org/packages/57/fa/dea6511870913162f3b2e8c42a7614eb203a4540b8c2da43e0bfb0548f3c/aiohttp-3.14.1-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:2b7edd08e0a5deb1e8564a2fcd8f4561014a3f05252334671bbf55ddd47db0e5", size = 1581679, upload-time = "2026-06-07T21:08:39.292Z" },
+    { url = "https://files.pythonhosted.org/packages/14/bd/3cf0d55e71784b33534e9710a67d382d900598b4787fbce6cc7317f8c42a/aiohttp-3.14.1-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:b6ff7fcee63287ae57b5df3e4f5957ce032122802509246dec1a5bcc55904c95", size = 1782021, upload-time = "2026-06-07T21:08:41.407Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/af/14bb5843eccbe234f4dfb78ab73e549d99727247e62ae5d62cbd22eaf5b0/aiohttp-3.14.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:6ffbb2f4ec1ceaff7e07d43922954da26b223d188bf30658e561b98e23089444", size = 1742574, upload-time = "2026-06-07T21:08:43.795Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/1e/fbeb7af9210a67ac0f9c9bec0f8f4568497924e33137a3d5b48e1cf85f3f/aiohttp-3.14.1-cp314-cp314-win32.whl", hash = "sha256:a9875b46d910cff3ea2f5962f9d266b465459fe634e22556ab9bd6fc1192eea0", size = 457773, upload-time = "2026-06-07T21:08:46.168Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/2b/13e8d741a9ec5db7d900c060554cf8352ab85e44e2a4469ebb9d377bda17/aiohttp-3.14.1-cp314-cp314-win_amd64.whl", hash = "sha256:af8b4b81a960eeaf1234971ac3cd0ba5901f3cd42eae42a46b4d089a8b492719", size = 485001, upload-time = "2026-06-07T21:08:48.401Z" },
+    { url = "https://files.pythonhosted.org/packages/df/30/491acfa2c4d6c3ff59c49a14fc1b50be3241e25bbb0c84c09e2da4d11395/aiohttp-3.14.1-cp314-cp314-win_arm64.whl", hash = "sha256:cf4491381b1b57425c315a56a439251b1bdac07b2275f19a8c44bc57744532ec", size = 453809, upload-time = "2026-06-07T21:08:50.7Z" },
+    { url = "https://files.pythonhosted.org/packages/34/e3/19dbe1a1f4cc6230eb9e314de7fe68053b0992f9302b27d12141a0b5db53/aiohttp-3.14.1-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:819c054312f1af92947e6a55883d1b66feefab11531a7fc45e0fb9b63880b5c2", size = 793320, upload-time = "2026-06-07T21:08:52.775Z" },
+    { url = "https://files.pythonhosted.org/packages/7f/20/1b7182219ba1b108430d6e4dc53d25ae02dcfcf5a045b33af4e8c5167527/aiohttp-3.14.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:10ee9c1753a8f706345b22496c79fbddb5be0599e0823f3738b1534058e25340", size = 529077, upload-time = "2026-06-07T21:08:55Z" },
+    { url = "https://files.pythonhosted.org/packages/b9/c8/14ce60ec31a2e5f5274bb17d383a6f7a3aabca31ac04eee05585bbadab16/aiohttp-3.14.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1601cc37baf5750ccacae618ec2daf020769581695550e3b654a911f859c563d", size = 532476, upload-time = "2026-06-07T21:08:57.176Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/02/9ac85e081e53da2e061b02fa7758fe0a12d17b8ce2d1f5e6c7cb76730328/aiohttp-3.14.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4d6e0ac9da31c9c04c84e1c0182ad8d6df35965a85cae29cd71d089621b3ae94", size = 1922347, upload-time = "2026-06-07T21:08:59.563Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/3e/d3ba07a0ab38b5389e10bec4362d21e10a4f667cba2d79ba30837b3a5059/aiohttp-3.14.1-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:9e8f2d660c350b3d0e259c7a7e3d9b7fc8b41210cbcc3d4a7076ff0a5e5c2fdc", size = 1786465, upload-time = "2026-06-07T21:09:01.909Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/cb/e2ee978a00cfb2df829704a69528b18154eba5939f45bc1efa8f33aee4c5/aiohttp-3.14.1-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:4691802dda97be727f79d86818acaad7eb8e9252626a1d6b519fedbb92d5e251", size = 1909423, upload-time = "2026-06-07T21:09:04.357Z" },
+    { url = "https://files.pythonhosted.org/packages/73/5d/1430334858b1022b58ae50399a918f0bd6fe8fa7fa183598d657ff61e040/aiohttp-3.14.1-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c389c482a7e9b9dc3ee2701ac46c4125297a3818875b9c305ddb603c04828fd1", size = 2001906, upload-time = "2026-06-07T21:09:06.722Z" },
+    { url = "https://files.pythonhosted.org/packages/66/4e/560c7472d3d198a23aa5c8b19a5115bf6a9b77b7d3e4bb363da320430ad2/aiohttp-3.14.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fc0cacab7ba4e56f0f81c82a98c09bed2f39c940107b03a34b168bdf7597edd3", size = 1877095, upload-time = "2026-06-07T21:09:09.011Z" },
+    { url = "https://files.pythonhosted.org/packages/0d/f1/4745806578d447db4a784a8591e2dae3afdfc2bcb96f8f81271b13df6543/aiohttp-3.14.1-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:979ed4717f59b8bb12e3963378fa285d93d367e15bcd66c721311826d3c44a6c", size = 1676222, upload-time = "2026-06-07T21:09:11.461Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/c9/48255813cca749a229ef0ab476004ec623728ad79a9c0840616f6c076325/aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:38e1e7daaea81df51c952e18483f323d878499a1e2bfe564790e0f9701d6f203", size = 1842922, upload-time = "2026-06-07T21:09:14.118Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/c0/bbd054e2bee909f529523a5af3891052606af5143c09f5f183ec3b234676/aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:4132e72c608fe9fecb8f409113567605915b83e9bdd3ea56538d2f9cd35002f1", size = 1825035, upload-time = "2026-06-07T21:09:16.447Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/ae/90395d4376deceb74e09ec26b6adf7d2015a6f8802d6d84446af860fef04/aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:eefd9cc9b6d4a2db5f00a26bc3e4f9acf71926a6ec557cd56c9c6f27c290b665", size = 1849512, upload-time = "2026-06-07T21:09:18.742Z" },
+    { url = "https://files.pythonhosted.org/packages/93/bd/fb25f3049957553d4ce0ba6ae480aa2f592a6985497fca590837d16c1be0/aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:b165790117eea512d7f3fb22f1f6dad3d55a7189571993eb015591c1401276d1", size = 1668571, upload-time = "2026-06-07T21:09:21.458Z" },
+    { url = "https://files.pythonhosted.org/packages/3f/22/7f73303d64dd567ff3addca90b556690ed1233a47b8f55d242fb90af3681/aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:ed09c7eb1c391271c2ed0314a51903e72a3acb653d5ccfc264cdf3ef11f8269d", size = 1881159, upload-time = "2026-06-07T21:09:23.813Z" },
+    { url = "https://files.pythonhosted.org/packages/44/be/0474c5a8b5640e1e4aa1923430a91f4151be82e511373fe764189b89aef5/aiohttp-3.14.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:99abd37084b82f5830c635fddd0b4993b9742a66eb746dacf433c8590e8f9e3c", size = 1841409, upload-time = "2026-06-07T21:09:26.207Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/3c/bb4a7cba26956cb3da4553cc2056cf67be5b5ff6e6d8fa4fbdff73bfb7ae/aiohttp-3.14.1-cp314-cp314t-win32.whl", hash = "sha256:47ddf841cdecc810749921d25606dee45857d12d2ad5ddb7b5bd7eab12e4b365", size = 494166, upload-time = "2026-06-07T21:09:28.505Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/84/ec80c2c1f66a952555a9f86df6b33af65108a6febfa0471b69013a12f807/aiohttp-3.14.1-cp314-cp314t-win_amd64.whl", hash = "sha256:5e78b522b7a6e27e0b25d19b247b75039ac4c94f99823e3c9e53ae1603a9f7e9", size = 530255, upload-time = "2026-06-07T21:09:30.843Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/71/6e22be134a4061ada85a92951b842f2657f17d926b727f3f94c56ae963d6/aiohttp-3.14.1-cp314-cp314t-win_arm64.whl", hash = "sha256:90d53f1609c29ccc2193945ef732428382a28f78d0456ae4d3daf0d48b74f0f6", size = 469640, upload-time = "2026-06-07T21:09:33.028Z" },
+]
+
+[[package]]
+name = "aiosignal"
+version = "1.4.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "frozenlist" },
+    { name = "typing-extensions", marker = "python_full_version < '3.13'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/61/62/06741b579156360248d1ec624842ad0edf697050bbaf7c3e46394e106ad1/aiosignal-1.4.0.tar.gz", hash = "sha256:f47eecd9468083c2029cc99945502cb7708b082c232f9aca65da147157b251c7", size = 25007, upload-time = "2025-07-03T22:54:43.528Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/fb/76/641ae371508676492379f16e2fa48f4e2c11741bd63c48be4b12a6b09cba/aiosignal-1.4.0-py3-none-any.whl", hash = "sha256:053243f8b92b990551949e63930a839ff0cf0b0ebbe0597b0f3fb19e1a0fe82e", size = 7490, upload-time = "2025-07-03T22:54:42.156Z" },
+]
+
+[[package]]
+name = "annotated-types"
+version = "0.7.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ee/67/531ea369ba64dcff5ec9c3402f9f51bf748cec26dde048a2f973a4eea7f5/annotated_types-0.7.0.tar.gz", hash = "sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89", size = 16081, upload-time = "2024-05-20T21:33:25.928Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/78/b6/6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8/annotated_types-0.7.0-py3-none-any.whl", hash = "sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53", size = 13643, upload-time = "2024-05-20T21:33:24.1Z" },
+]
+
+[[package]]
+name = "ast-serialize"
+version = "0.5.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/81/9d/09e27731bd5864a9ce04e3244074e674bb8936bf62b45e0357248717adac/ast_serialize-0.5.0.tar.gz", hash = "sha256:5880091bfe6f4f986f22866375c2e884843e7a0b6343ae41aeea659613d879b6", size = 61157, upload-time = "2026-05-17T17:48:29.429Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c0/9a/13dde51ba9e15f8b97957ab7cb0120d0e381524d651c6bd630b9c359227f/ast_serialize-0.5.0-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:8f5c14f169eb0972c0c21bada5358b23d6047c76583b005234f865b11f1fa00a", size = 1183520, upload-time = "2026-05-17T17:47:30.831Z" },
+    { url = "https://files.pythonhosted.org/packages/37/de/5a7f0a9fe68944f536632a5af84676739c7d2582be42deb082634bf3a754/ast_serialize-0.5.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7d1a2de9de5be04652f0ed60738356ef94f66db37924a9499fffe98dc491aa0b", size = 1175779, upload-time = "2026-05-17T17:47:32.551Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/81/0bb853e76e4f6e9a1855d569003c59e19ffac45f7079d91505d1bb212f92/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:be5173fb66f9b49026d9d5a2ff0fc7c7009077107c0eb285b2d60fdf1fe10bd1", size = 1233750, upload-time = "2026-05-17T17:47:34.731Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/d3/4cf705beeccc08754d0bbda99aefff26110e209b9a07ac8a6b60eec48531/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f8015cd071ac1339924ee2b8098c93e00e155f30a16f40ec9816fcf84f4753f6", size = 1235942, upload-time = "2026-05-17T17:47:36.287Z" },
+    { url = "https://files.pythonhosted.org/packages/26/c8/ee097e437ea27dd2b8b227865c875492b585650a5802a22d82b304c8201b/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5499e8797edff2a9186aa313ed382c6b422e798e9332d9953badcee6e69a88f2", size = 1442517, upload-time = "2026-05-17T17:47:38.17Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/bd/68063442838f1ba68ec72b5436430bc75b3bb17a1a3c3063f09b0c05ae2b/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:6848f2a093fb5548751a9a09bff8fcd229e2bbeb0e3331f391b6ae6d26cd9903", size = 1254081, upload-time = "2026-05-17T17:47:39.826Z" },
+    { url = "https://files.pythonhosted.org/packages/50/e2/1e520793bc6a4e4524a6ab022391e827825eaa0c3811828bfdc6852eca26/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:832d4c998e0b091fd60a6d6bceee535483c4d490de9ba85003af835225719261", size = 1259910, upload-time = "2026-05-17T17:47:41.369Z" },
+    { url = "https://files.pythonhosted.org/packages/4e/e1/49b60f467979979cfe6913b43948ff25bca971ad0591d181812f163a988e/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:16db7c62ec0b8efe1d7afd283a388d8f74f2605d56032e5a37747d2de8dba027", size = 1250678, upload-time = "2026-05-17T17:47:43.702Z" },
+    { url = "https://files.pythonhosted.org/packages/74/ba/66ab9555de6275677566f6574e5ef6c29cb185ea866f643bc06f8280a8ee/ast_serialize-0.5.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:baf5eb061eb5bccade4128ad42da33787d72f6013809cd1b590376ece8b3c937", size = 1301603, upload-time = "2026-05-17T17:47:46.256Z" },
+    { url = "https://files.pythonhosted.org/packages/66/42/6aca9b9abc710014b2be9059689e5dd1679339e78f567ffb4d255a9e2050/ast_serialize-0.5.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:104e4a35bd7c124173c41760ef9aaea17ddb3f86c65cb643671d59afbe3ee94c", size = 1410332, upload-time = "2026-05-17T17:47:47.899Z" },
+    { url = "https://files.pythonhosted.org/packages/47/68/2f76594432a22581ecf878b5e75a9b8601c24b2241cf0bbeb1e21fcf370c/ast_serialize-0.5.0-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:36be371028fc1675acb38a331bde160dbab7ff907fdf00b67eb6911aa106951b", size = 1509979, upload-time = "2026-05-17T17:47:50.942Z" },
+    { url = "https://files.pythonhosted.org/packages/40/ac/a93c9b58292653f6c595752f677a08e608f903b710594909e9231a389b3b/ast_serialize-0.5.0-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:061ee58bdb52341c8201a6df41182a977736bae3b7ded87ca7176ca25a8a47ab", size = 1505002, upload-time = "2026-05-17T17:47:54.093Z" },
+    { url = "https://files.pythonhosted.org/packages/14/2e/b278f68c497ee2f1d1576cbbef8db5281cd4a5f2db040537592ac9c8862e/ast_serialize-0.5.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:b15219e9cdc9f53f6f4cb51c009203507228226148c05c5e8fe451c28b435eb3", size = 1456231, upload-time = "2026-05-17T17:47:56.311Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/43/419be1c566a4c504cd8fd60ce2f84e790f295495c0f327cfaeadf3d51012/ast_serialize-0.5.0-cp314-cp314t-win32.whl", hash = "sha256:842d1c004bb466c7df036f95fabef789570541922b10976b12f5592a69cf0b38", size = 1058668, upload-time = "2026-05-17T17:47:58.305Z" },
+    { url = "https://files.pythonhosted.org/packages/03/6f/c9d4d549295ed05111aeb8853232d1afd9d0a179fddb01eeffbb3a4a6842/ast_serialize-0.5.0-cp314-cp314t-win_amd64.whl", hash = "sha256:b0c06d760909b095cc466356dfccd05a1c7233a6ca191c020dca2c6a6f16c24c", size = 1101075, upload-time = "2026-05-17T17:48:00.35Z" },
+    { url = "https://files.pythonhosted.org/packages/d0/8e/d00c5ab30c58222e07d62956fca86c59d91b9ad32997e633c38b526623a3/ast_serialize-0.5.0-cp314-cp314t-win_arm64.whl", hash = "sha256:787baedb0262cc49e8ce37cc15c00ae818e46a165a3b36f5e21ed174998104cb", size = 1075347, upload-time = "2026-05-17T17:48:01.753Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/9e/dc2530acb3a60dc6e46d65abf27d1d9f86721694757906a148d90a6860de/ast_serialize-0.5.0-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:0668aa9459cfa8c9c49ddd2163ebcf43088ba045ef7492af6fe22e0098303101", size = 1191380, upload-time = "2026-05-17T17:48:03.738Z" },
+    { url = "https://files.pythonhosted.org/packages/26/0a/bd3d18a582f273d6c843d16bb9e22e9e16365ff7991e92f18f798e9f1224/ast_serialize-0.5.0-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:bf683d6363edf2b39eed6b6d4fe22d34b6203867a67e27134d9e2a2680c4bc4a", size = 1183879, upload-time = "2026-05-17T17:48:05.463Z" },
+    { url = "https://files.pythonhosted.org/packages/40/ae/1f919100f8620887af58fcc381c61a1f218cdf89c6e155f87b213e61010a/ast_serialize-0.5.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9cc22cf0c9be65e71cf88fda130af60d61eb4a79370ad4cfe7900d48a4aa2211", size = 1244529, upload-time = "2026-05-17T17:48:07.008Z" },
+    { url = "https://files.pythonhosted.org/packages/c6/ca/6376559dcce707cdbc1d0d9a13c8d3baaaa501e949ce0ebdc4230cd881aa/ast_serialize-0.5.0-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:f66173891548c9f2726bf27957b41cabce12fa679dc6da505ddbde4d4b3b31cf", size = 1240560, upload-time = "2026-05-17T17:48:08.46Z" },
+    { url = "https://files.pythonhosted.org/packages/35/b2/a620e206b5aeb7efbf2710336df57d457cffbb3991076bbcc1147ef9abd4/ast_serialize-0.5.0-cp39-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e42d729ef2be96a14efbad355093284739e3670ece3e534f82cc8832790911d9", size = 1451172, upload-time = "2026-05-17T17:48:09.922Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/e0/4ad5c04c24a40481b2935ce9a0ccdb6023dc8b667167d06ae530cc3512f2/ast_serialize-0.5.0-cp39-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:b725026bafa801dbd7310eb13a75f0a2e370e7e51b2cb225f9d21fcfadf919ee", size = 1265072, upload-time = "2026-05-17T17:48:11.469Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/71/4d1d479aa56d0101c40e17720c3d6ac2af7269ea0487a80b18e7bfd1a5b7/ast_serialize-0.5.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b54f60c1d78767a53b67eaa663f0dfac3afe606aa07f1301572f588b73d64809", size = 1270488, upload-time = "2026-05-17T17:48:13.575Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/4f/0de1bbe06f6edef9fde4ed12ca8e7b3ec7e6e2bd4e672c5af487f7957665/ast_serialize-0.5.0-cp39-abi3-manylinux_2_31_riscv64.whl", hash = "sha256:27d51654fc240a1e87e742d353d98eb45b75f62f129086b3596ab53df2ac2a43", size = 1260702, upload-time = "2026-05-17T17:48:15.141Z" },
+    { url = "https://files.pythonhosted.org/packages/75/61/e00872439cfdddcc3c1b6cdaa6e5d904ba8e26a18807c67c4e14409d0ca8/ast_serialize-0.5.0-cp39-abi3-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2782c36237c46dd1674542f2109740ea5ea485a169bf1431939ada0434e17934", size = 1311182, upload-time = "2026-05-17T17:48:16.779Z" },
+    { url = "https://files.pythonhosted.org/packages/76/8e/699a5b955f7926956c95e9e1d74132acad73c2fe7a426f94da89123c20aa/ast_serialize-0.5.0-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:1943db345233cc7194a470f13afa9c59772c0b123dea0c9414c4d4ca54369759", size = 1421410, upload-time = "2026-05-17T17:48:18.527Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/ae/d5b7626874478997adc7a29ab28accf21e596fb590c944290401dfd0b29e/ast_serialize-0.5.0-cp39-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:df1c00022cbbcb064bfaa505aa9c9295362443ce5dacb459d1331d3da353f887", size = 1516587, upload-time = "2026-05-17T17:48:20.133Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/ce/b59e02a82d9c4244d64cde502e0b00e83e38816abe19155ceb5437402c7f/ast_serialize-0.5.0-cp39-abi3-musllinux_1_2_i686.whl", hash = "sha256:cae65289fc456fde04af979a2be09302ef5d8ab92ef23e596d6746dc267ada27", size = 1515171, upload-time = "2026-05-17T17:48:21.921Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/38/d8d90042747d05aa08d4efcf1c99035a5f670a6bf4c214d31644392afbca/ast_serialize-0.5.0-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:239a4c354e8d676e9d94631d1d4a64edc6b266f86ff3a5a80aedd344f342c01d", size = 1464668, upload-time = "2026-05-17T17:48:23.544Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/51/5b840c4df7334104cecffa28f23904fe81ca89ca223d2450e288de39fd3c/ast_serialize-0.5.0-cp39-abi3-win32.whl", hash = "sha256:143a4ef63285a075871908fda3672dc21864b83a8ec3ee12304aa3e4c5387b9a", size = 1068311, upload-time = "2026-05-17T17:48:25.027Z" },
+    { url = "https://files.pythonhosted.org/packages/41/11/ca5672c7d491825bc4cd6702dea106a6b60d928707712ec257c7833ae476/ast_serialize-0.5.0-cp39-abi3-win_amd64.whl", hash = "sha256:cf25572c526add400f26a4750dc6ce0c3bb93fc1f75e7ae0cad4ce4f2cd5c590", size = 1108931, upload-time = "2026-05-17T17:48:26.591Z" },
+    { url = "https://files.pythonhosted.org/packages/45/19/cc8bd127d28a43da249aa955cfd164cf8fd534e79e42cea96c4854d72fd0/ast_serialize-0.5.0-cp39-abi3-win_arm64.whl", hash = "sha256:92a31c9c20d25a076edaeec76b128a3535d74a24f340b9a8a7e96c9b86dc9642", size = 1081181, upload-time = "2026-05-17T17:48:28.122Z" },
+]
+
+[[package]]
+name = "attrs"
+version = "26.1.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/9a/8e/82a0fe20a541c03148528be8cac2408564a6c9a0cc7e9171802bc1d26985/attrs-26.1.0.tar.gz", hash = "sha256:d03ceb89cb322a8fd706d4fb91940737b6642aa36998fe130a9bc96c985eff32", size = 952055, upload-time = "2026-03-19T14:22:25.026Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/64/b4/17d4b0b2a2dc85a6df63d1157e028ed19f90d4cd97c36717afef2bc2f395/attrs-26.1.0-py3-none-any.whl", hash = "sha256:c647aa4a12dfbad9333ca4e71fe62ddc36f4e63b2d260a37a8b83d2f043ac309", size = 67548, upload-time = "2026-03-19T14:22:23.645Z" },
+]
+
+[[package]]
+name = "cfgv"
+version = "3.5.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/4e/b5/721b8799b04bf9afe054a3899c6cf4e880fcf8563cc71c15610242490a0c/cfgv-3.5.0.tar.gz", hash = "sha256:d5b1034354820651caa73ede66a6294d6e95c1b00acc5e9b098e917404669132", size = 7334, upload-time = "2025-11-19T20:55:51.612Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/db/3c/33bac158f8ab7f89b2e59426d5fe2e4f63f7ed25df84c036890172b412b5/cfgv-3.5.0-py2.py3-none-any.whl", hash = "sha256:a8dc6b26ad22ff227d2634a65cb388215ce6cc96bbcc5cfde7641ae87e8dacc0", size = 7445, upload-time = "2025-11-19T20:55:50.744Z" },
+]
+
+[[package]]
+name = "colorama"
+version = "0.4.6"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/d8/53/6f443c9a4a8358a93a6792e2acffb9d9d5cb0a5cfd8802644b7b1c9a02e4/colorama-0.4.6.tar.gz", hash = "sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44", size = 27697, upload-time = "2022-10-25T02:36:22.414Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d1/d6/3965ed04c63042e047cb6a3e6ed1a63a35087b6a609aa3a15ed8ac56c221/colorama-0.4.6-py2.py3-none-any.whl", hash = "sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6", size = 25335, upload-time = "2022-10-25T02:36:20.889Z" },
+]
+
+[[package]]
+name = "coverage"
+version = "7.14.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/9c/a3/3834a5564fe8f32154cd7032400d3c2f9c565b2a373fa671f2bbdad6f634/coverage-7.14.2.tar.gz", hash = "sha256:7a2da3d81cfe17c18038c6d98e6592aa9147d596d056119b0ee612c3c8bd5230", size = 923982, upload-time = "2026-06-20T14:49:30.885Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/04/d5/d0e511247f84fa88ae7da68403cbd3bf9d2a5fc48f5d6618a6846b275632/coverage-7.14.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:909f265c8c41f04c824bf741b2601fdcb56cab4bf56e018996b6494192ba0f58", size = 220352, upload-time = "2026-06-20T14:47:28.61Z" },
+    { url = "https://files.pythonhosted.org/packages/03/4a/ecaff6db72e6c1782ca51336e391393f1e9cc6e4412d6c3da8b7d5075adf/coverage-7.14.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c8102deaf911938233f760426e6a5e287388521de95111d5c8de26c8a1028924", size = 220855, upload-time = "2026-06-20T14:47:29.972Z" },
+    { url = "https://files.pythonhosted.org/packages/34/9a/cf950cd8e8df06ee5941276e69f81647005360421be523d5ca18f658e143/coverage-7.14.2-cp311-cp311-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:851f49e7bd7d1cdaf328f3133942b252d5e3d3380690131f423cba8e435b87f5", size = 251276, upload-time = "2026-06-20T14:47:31.413Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/08/f973be32c9a095e4bb2d3a7bdcb2f9c117e39d4062471ffffae3623f6c51/coverage-7.14.2-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:04cb445bed86aaf00aaa97d41a8b6e30f100f21e81c34caaec4efc684cb57768", size = 253189, upload-time = "2026-06-20T14:47:32.727Z" },
+    { url = "https://files.pythonhosted.org/packages/96/aa/f3a50952ba553d442d94b793e5dede25d426b02e5e011e9a9dd225c002d3/coverage-7.14.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7471bc920d97c51c37ea8127f13b2adca43c3d78c53313b26a1f428e99d2c254", size = 255299, upload-time = "2026-06-20T14:47:34.019Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/29/9a4c491986f4d637ed64961ae56721661fc21b6b767d280848d0c708756a/coverage-7.14.2-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:da5057e1bb257c967feee8ba67f3ebf379e801c7717f238b3d8c9caf00fc8f93", size = 257255, upload-time = "2026-06-20T14:47:35.397Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/61/d2a5b48007f6a212f321c36cf5486feb80505d2d00dfb1163aad2da71197/coverage-7.14.2-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:33c0da852e8a40246cd8e20cf3b2fc17ca52a45e9b5f7983c93db26f5d24b87b", size = 251417, upload-time = "2026-06-20T14:47:36.677Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/25/8df66ae25b401d4529e1d0617af20d9695d171ea4ffec4ca9dffc5dc37b7/coverage-7.14.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:f48a85bb437fab7782021c40bfee6b15146928b96960d008ace41b6901a0f21d", size = 252991, upload-time = "2026-06-20T14:47:38.027Z" },
+    { url = "https://files.pythonhosted.org/packages/e3/7b/16bdc9116dd8bf412a421a7227daa65ad9f12bef0685b13c1bd1c12e6d4c/coverage-7.14.2-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:f44e7579a769a21d5b5e3166916bfe30ee175aaffff750324cbb11be2dbec5ad", size = 251051, upload-time = "2026-06-20T14:47:39.26Z" },
+    { url = "https://files.pythonhosted.org/packages/0a/f8/b7dbed84274dcc69ddb9c0fe72ec1260830473e0d6c299dcf087a0567f7c/coverage-7.14.2-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:78853ca3c6ca2f012daa2b07dbabbb8db0f09d4dbe8ee828d294b3445d3f4cd8", size = 254817, upload-time = "2026-06-20T14:47:40.995Z" },
+    { url = "https://files.pythonhosted.org/packages/c6/07/4659e6bed01a25a0effb4952e8e75fd157038fe5f2829b0f69c6811c2033/coverage-7.14.2-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:c9c2795ee3692097ff226ab806005d36bb9691fca9b35353542b57ea749cc830", size = 250772, upload-time = "2026-06-20T14:47:42.306Z" },
+    { url = "https://files.pythonhosted.org/packages/26/f4/45019da4cd6cd1df3042476447449d62a76a201f6b3556aa40ac31bce20b/coverage-7.14.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:2f5cc48a845d755b6db236f8c29c2b54773eb4c7e4ee2ead43812d73718784b0", size = 251679, upload-time = "2026-06-20T14:47:43.703Z" },
+    { url = "https://files.pythonhosted.org/packages/92/e5/76d75fa2ffe0285d3f2608d1bb241fc245cf98fe614d52118427dd6ccdaa/coverage-7.14.2-cp311-cp311-win32.whl", hash = "sha256:9c61cb7eaabcfa609c5bc0f5ff5869d72a2f02f17994e5fba5f971de516f3c82", size = 222445, upload-time = "2026-06-20T14:47:45.137Z" },
+    { url = "https://files.pythonhosted.org/packages/57/59/696c64547e5c8b9ed31532e9c7a5f9b6474054da93f8ab07f8baf7365c57/coverage-7.14.2-cp311-cp311-win_amd64.whl", hash = "sha256:e715909b0966d1774d8a26e14e2f4a3ae75909dca526901c6306286b2dcbfbdc", size = 222922, upload-time = "2026-06-20T14:47:46.67Z" },
+    { url = "https://files.pythonhosted.org/packages/63/72/646a28100462996c11b98e27d6786cd61f48100d1479804846a3e1e5bf9b/coverage-7.14.2-cp311-cp311-win_arm64.whl", hash = "sha256:9193f7150937a4fd836b10eaa123e15d98e961d1fabac07e60adf2d4785f888a", size = 222468, upload-time = "2026-06-20T14:47:48.119Z" },
+    { url = "https://files.pythonhosted.org/packages/d0/d9/bdd141aa2c605096a8ef63b8435fd4f5fec78946a3cb7b9145840ec78291/coverage-7.14.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:37c94712e533ea06f0b1e4d934811c520b1914ce0e4da3916220717aa7a86bc6", size = 220528, upload-time = "2026-06-20T14:47:49.652Z" },
+    { url = "https://files.pythonhosted.org/packages/02/97/d24ae7d2afc62c54a36313d4dedb655c9afbba3003f0f7f1ae81e97af31f/coverage-7.14.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:c050bbc7bba94c77e4ed7438f4fda1babe98ab145691d80aa6f60df934a1468b", size = 220883, upload-time = "2026-06-20T14:47:51.036Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/0e/d8f00efd3df0d63e6843ebcbade9e4119d60f5376753c9705d84b014c775/coverage-7.14.2-cp312-cp312-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:a7af571767a2ee342a171c16fc1b1a07a0bf511606d381703fb7cf397fe49d46", size = 252395, upload-time = "2026-06-20T14:47:52.627Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/1c/ab9510dfe1a16a35a10f90efad0d9a9cf61b9876973752968f2ba882f73f/coverage-7.14.2-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:8b4910cce599cd2438f8da65f5ef199a70a1cdb6ab314926df78271ca5954240", size = 255131, upload-time = "2026-06-20T14:47:54.235Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/dd/70171e9371003b33dc6b20f527ac216ff91bbe5c1088e754eb8950d79193/coverage-7.14.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c33e9e4878972f430b0cc06de3bf2a28d054a9efb4f8426d27de0d9cb81396ff", size = 256246, upload-time = "2026-06-20T14:47:55.61Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/80/a68b1dd81d5c011e17fd6ab0d707d33297df1d0c618114b9b750a2219c80/coverage-7.14.2-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e7967ea55c6dea6becba4d5870e2fa0aa4915a8be7ebff1bb79e6207aa75ce8d", size = 258504, upload-time = "2026-06-20T14:47:56.979Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/7b/40baaa946189f5317cd77d484e39b9b0727d02ebada0a12162374f2faee2/coverage-7.14.2-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:d1322f237c2979b84096f4239c17828ff17fea6b3bbe96c44381c5f587c44c26", size = 252808, upload-time = "2026-06-20T14:47:58.418Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/05/b19517b09c43d1e8591de6c13178b0c03166c31e1adbebda378e64c66b9a/coverage-7.14.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:77849525340c99f516d793dddbcee16b18d50af892ac43c8de1a6f343d41e3b5", size = 254166, upload-time = "2026-06-20T14:48:00.004Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/f5/6e65da5957e041d2094a9b97736628dd80160f1cc007a50790bbb2668c1a/coverage-7.14.2-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:ef11695493ec3f06f7b2678ca274bcabb4ca04057317df268ddbfd8b05f661a8", size = 252310, upload-time = "2026-06-20T14:48:01.458Z" },
+    { url = "https://files.pythonhosted.org/packages/2d/de/01b5274f0db63175b04d9354eff68d2d268b8b57a1b2db7d3dcb1f2c9dbb/coverage-7.14.2-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:8134f0e0723e080d1c27bbe8fc149f0162e429fa1852482150015d0fce83eaf1", size = 256379, upload-time = "2026-06-20T14:48:02.981Z" },
+    { url = "https://files.pythonhosted.org/packages/71/d6/9a2ffbca41e2f8f86f61e8b78b86afa433ec8cdeac4908ace93a28fe3ff0/coverage-7.14.2-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:914eead2b843fc357f733b3fe39cc94f1b53d466e8cfe03080b1ed9d24ccfc73", size = 251880, upload-time = "2026-06-20T14:48:04.463Z" },
+    { url = "https://files.pythonhosted.org/packages/e3/ff/20bd54a43c88c08f474e6cb355a97e024e38412873ef0a581629abe1e26f/coverage-7.14.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e4b2d5e847fb7958583b74910cc19e5ec4ece514487385677b26433b2546116e", size = 253753, upload-time = "2026-06-20T14:48:05.99Z" },
+    { url = "https://files.pythonhosted.org/packages/35/2a/2b3482c30d8344f301d8df6ff232a321f2ab87d5ac97ba21891a68638131/coverage-7.14.2-cp312-cp312-win32.whl", hash = "sha256:e753db9e40dda7302e0ac3e1e6e1325fb7f7b4694f87a7314ab15dd5d57911a7", size = 222584, upload-time = "2026-06-20T14:48:07.361Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/5e/83934ffff147edd313fe925db426e8f7ccad9e4663262eb5c4db4e345658/coverage-7.14.2-cp312-cp312-win_amd64.whl", hash = "sha256:d32e5ca5f16dafb269ee50b60d32b00c704b3f6f78e238105f1d94a3a5f24bf5", size = 223118, upload-time = "2026-06-20T14:48:08.837Z" },
+    { url = "https://files.pythonhosted.org/packages/bb/ee/616b4f38a34f076f3045d3eedfa764d34d82e6a6cc6b300acb0f1ff22a98/coverage-7.14.2-cp312-cp312-win_arm64.whl", hash = "sha256:dc366f158e2fb2add9d4e57338ca48f12611024278688ee657eb0b853fcb5de5", size = 222504, upload-time = "2026-06-20T14:48:10.436Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/09/b5b334c27960e7aac0003b96491bada7838dc641099fa64a1a598abf33cd/coverage-7.14.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e5f077641a6713ce9d38df9e85d4fb9e008677fc0775cbaeb32ddfc3b319d4ca", size = 220552, upload-time = "2026-06-20T14:48:11.847Z" },
+    { url = "https://files.pythonhosted.org/packages/79/20/879a000c319b4df7b50e4d688c0f7c0f6b5ac9d7b18848cbc00eabf26efe/coverage-7.14.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:0907f39b49ae818fe8af50aaa0f19afbc8ca164aea0865181ca7af17a3ac690b", size = 220919, upload-time = "2026-06-20T14:48:13.397Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/b7/326dded4371bab60f42215797944a356e4d81a3cee106121c7f7dd531604/coverage-7.14.2-cp313-cp313-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:5734d47669118d75c28981e562d4530ceb77342d31ffef6def5edd5ad4f05d7b", size = 251917, upload-time = "2026-06-20T14:48:14.931Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/14/b3232ba218a0d1a70883d2675f18ff465de9e8e5e3346e81dc2b079838bd/coverage-7.14.2-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:1d9a1b5813d00ea6151f6ccf64d1fa16892771dfdda12ba87162d15ec4ea3e1e", size = 254515, upload-time = "2026-06-20T14:48:16.545Z" },
+    { url = "https://files.pythonhosted.org/packages/b7/7a/d77bcbee1cad71b42776574114b462225cc9125b4982f43da1b66adc850f/coverage-7.14.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9f0a80f4c8ac3f774210b1cc1bc0e31e75502f2818dda9a144ff90e702c4d91d", size = 255749, upload-time = "2026-06-20T14:48:18.214Z" },
+    { url = "https://files.pythonhosted.org/packages/86/86/97377937b29e9e44a1529bb20cb74dbcf80ed9006d87d7e742ff69e44b67/coverage-7.14.2-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:c2e66f3f22d6c1515ce70f2e7c3e9c6f3ff0ff33480125c9f9c53e8f6508e30f", size = 257882, upload-time = "2026-06-20T14:48:19.7Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/a4/0fc8fe68bc505450bb068a2823ac7797bd8495240ccb8b4a5a1da1ee7e62/coverage-7.14.2-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:6a2c37c3114f87ca7f10113756026eecb49656514debad600dcbec21f355ccea", size = 252144, upload-time = "2026-06-20T14:48:21.176Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/4a/450094ddc41ab0d2eb4a0457b3856400ea3329568d1303696e85de099ae6/coverage-7.14.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:3b16a7959d04b1497281c062c180413565c3f3469211d78799ad5b9a75f67796", size = 253882, upload-time = "2026-06-20T14:48:22.701Z" },
+    { url = "https://files.pythonhosted.org/packages/d0/28/2f6ae6d98265d9aa6bac311c4a93403675905b03aca95dc4373080279d75/coverage-7.14.2-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:6466c6999545cf00c4c142dfcbbf2db396dc735f005dcf8f91d57e351a79472b", size = 251846, upload-time = "2026-06-20T14:48:24.295Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/6e/707281468400794d52874e8fb5e38ff7578a0ff32ed49fe4fe85f192d0fc/coverage-7.14.2-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:5c60915ebb8f562317ba5ff6b8c32e25c0882289b201a9f2fb2987f91efd95d8", size = 256002, upload-time = "2026-06-20T14:48:26.015Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/83/5e963120de4011257a950ce4cfb7fc833ddf3fee19db495268d3dec28154/coverage-7.14.2-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:33b830850488acbcd358c78a4fecfafe7031667b4da8ddff5546295dc962cdeb", size = 251665, upload-time = "2026-06-20T14:48:27.654Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/78/66b482cd525083bcc0bc894c16db79dabac37490065b53b07d6e8ab77202/coverage-7.14.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:d0f845539230b8269aec902bc978b0cc403f52f002d18a04492efc943404d0bc", size = 253435, upload-time = "2026-06-20T14:48:29.354Z" },
+    { url = "https://files.pythonhosted.org/packages/e6/61/0663fb8cb530c8b11819b920109694eee95a3b22960a9495be0200f657f1/coverage-7.14.2-cp313-cp313-win32.whl", hash = "sha256:a8ac51a2e441e9119b9395f4d893fbc4934c64c8ba58be9b9eaa85591249e548", size = 222591, upload-time = "2026-06-20T14:48:31.142Z" },
+    { url = "https://files.pythonhosted.org/packages/a6/47/1536d2b009c2848c3682500f497053f4645e70911afe02f594000997831a/coverage-7.14.2-cp313-cp313-win_amd64.whl", hash = "sha256:039b264cdb31c44b48f9821e2afbf8f37df49e0fb837e24a942918b36c567e31", size = 223134, upload-time = "2026-06-20T14:48:32.696Z" },
+    { url = "https://files.pythonhosted.org/packages/28/9a/33ba4f335dd60bb34350318283d784f46018070e67b7d4df7c910ec9d9a0/coverage-7.14.2-cp313-cp313-win_arm64.whl", hash = "sha256:7f2ef591e381cc36b8e53334e1b842c760c520c8a52d01e8626209400e93fe6a", size = 222529, upload-time = "2026-06-20T14:48:34.237Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/bc/120390669817ede714ab141ae0a2a73240fd7354aac992c41dc0bd19570f/coverage-7.14.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:7a0d1f026b72d627fa5c8a57cbc86ad209b64aa2a65833c83b290ace5cbee126", size = 220593, upload-time = "2026-06-20T14:48:35.755Z" },
+    { url = "https://files.pythonhosted.org/packages/4f/a3/7f1cfacd76af91e585f7ad689d7168002b444ed2a8ce59f2daaff10089b5/coverage-7.14.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:4d2b86f81c1c9310a7e774e3cc9e927a3d0bf583ecbfa01498dd626930025428", size = 220925, upload-time = "2026-06-20T14:48:37.35Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/10/6514b2525bb672eb8b43703e46d061d694111db21efe7609db722df2233f/coverage-7.14.2-cp314-cp314-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:d76bdc1f9396ae70a55d050cf9743d88141c62ce0a22a3f627fab1d11c2f8bc6", size = 251974, upload-time = "2026-06-20T14:48:39.109Z" },
+    { url = "https://files.pythonhosted.org/packages/23/b4/4533091541c6620ecd68115bbfa1c61265b775618adef3a5fd137f4582e9/coverage-7.14.2-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:cda36d8e7bfd63b3e44e75163265429caa5d935b672b00f71bccc8c010518c64", size = 254479, upload-time = "2026-06-20T14:48:40.871Z" },
+    { url = "https://files.pythonhosted.org/packages/06/af/e251a143d5d106385dbca696c553afab6b69f7f6bc376a34e089cc0b8b32/coverage-7.14.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0904f3b79d7b845bef0715afe1900da634d12b97f05b9479cb472880ca07cb9c", size = 255824, upload-time = "2026-06-20T14:48:42.608Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/53/9e5876e60efbaa79d743d1948a5015ddc05b808db1cd62228acf83e87d43/coverage-7.14.2-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:b6795ca4198d6cb7fc2c6163214f6555a6bc5f0ae1e268e76139dec4b37c4499", size = 258139, upload-time = "2026-06-20T14:48:44.263Z" },
+    { url = "https://files.pythonhosted.org/packages/85/5a/d35a4f431fb594e46b81cad4a13b470b017e918f347c1c0b260f7494fa1e/coverage-7.14.2-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c41e9b60fc0fa57f5d73306417d2f9d668202cca6944f9435878c55a5e7ae213", size = 252002, upload-time = "2026-06-20T14:48:45.961Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/e2/f5b304c8139c606c4f1b230d3a257d0c88edfbbdf06c58364f07625dc45c/coverage-7.14.2-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:419d2aadd5746efc2e9df0f33c05570d8192e6f6a6098ab05acce586f44ce8a5", size = 253832, upload-time = "2026-06-20T14:48:47.582Z" },
+    { url = "https://files.pythonhosted.org/packages/86/bc/bbbd283daa6be4f68aad4ad4066fd39ae98e4174db8c03ab26c5803d6234/coverage-7.14.2-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:1c5d273c5f1411c0d26c4f066c398d4a434b1f97bb5fa409189bedce86d4add4", size = 251799, upload-time = "2026-06-20T14:48:49.42Z" },
+    { url = "https://files.pythonhosted.org/packages/69/8d/0745fceb89c9e5f7dd8ed243d97dc8561b7a95545741e2409d2b34654824/coverage-7.14.2-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:5fe465bc691264adce601527a972990c1174075d86bcbe9968fd20c95e0b1948", size = 256075, upload-time = "2026-06-20T14:48:51.065Z" },
+    { url = "https://files.pythonhosted.org/packages/a2/a0/441d9a5255cf021ab41ee00c014a4607d1c72d5e5bef0a4fdaa5be86a907/coverage-7.14.2-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:6fbb61617af1c56f95d53170ae9fa6c9aef6de1abd02fcc50064bfc672efb18d", size = 251612, upload-time = "2026-06-20T14:48:52.653Z" },
+    { url = "https://files.pythonhosted.org/packages/50/37/3d19c5e32d4a529c068eb296abfa3e455bd2c0f9311ecf26280f408ff8e0/coverage-7.14.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:e1eff22b831dfd5694989cc1f0789980f18391f614ac67c851af9a8e6d25e9ba", size = 253270, upload-time = "2026-06-20T14:48:54.3Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/b0/54dd13937297518da6d092cc2c39d9340ec2194bdfa92e0a64694d643e23/coverage-7.14.2-cp314-cp314-win32.whl", hash = "sha256:58e91be0a233adef698d3e6be54f10401bb91fd7854c0d4c4d50e0d3711e72f1", size = 222796, upload-time = "2026-06-20T14:48:56.084Z" },
+    { url = "https://files.pythonhosted.org/packages/51/45/7a10e0909919686e335fdd95869cfb222d55243ebff27dc5cf59ca259a1f/coverage-7.14.2-cp314-cp314-win_amd64.whl", hash = "sha256:d8429bf97906bfe6c61f9dbfb3342e0d88120da61939da8bd04f830cc3eab3b8", size = 223285, upload-time = "2026-06-20T14:48:57.729Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/03/9cb197eb4b3d1a2eccb2537c226a93c80522c5b8afc5dd93e1993d7bb021/coverage-7.14.2-cp314-cp314-win_arm64.whl", hash = "sha256:13609d9d77249447aa73357b14831b0f3b95f275026c9ff20dd105f981f53a0c", size = 222712, upload-time = "2026-06-20T14:48:59.413Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/3c/e59f498511080d20bf866b0af9eeab820feb91547dae2084cb9bb7fb0e58/coverage-7.14.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:9818486c2bac88ae931df7e04905ee29bef49fd218c00f5f02bed4855254a101", size = 221325, upload-time = "2026-06-20T14:49:01.447Z" },
+    { url = "https://files.pythonhosted.org/packages/d3/37/8d7955f7e701e69198bd0a0132ea76518c078a635b930a4924e2ccfa70f0/coverage-7.14.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:58055adffabfa243516a197aa9f85f0dd56d905b0fba1a10193269759c29ccb0", size = 221594, upload-time = "2026-06-20T14:49:03.13Z" },
+    { url = "https://files.pythonhosted.org/packages/34/7a/6738e1e1533ce8ec4e2e472696eefdd4723864d7efaa140e433053bf576a/coverage-7.14.2-cp314-cp314t-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:535747dbc200349d7fb434cffcb28e770f0290f69b225f56dc3803aa7210cdea", size = 262957, upload-time = "2026-06-20T14:49:04.829Z" },
+    { url = "https://files.pythonhosted.org/packages/35/c4/d1be863cd39e0955904315fece67c5c23e046563f5eea0ceac16c547a759/coverage-7.14.2-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:420c66e35d85c0ca5dc6a38147d83ef239762542900e5921ebbdb89333c540ea", size = 265081, upload-time = "2026-06-20T14:49:07.018Z" },
+    { url = "https://files.pythonhosted.org/packages/72/7f/412df3c3c251284a11834287fd6f7e3bb98c528c53e030589e9344a3ef80/coverage-7.14.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f2cf17b33773be446a588551ea6a746b2d70dd0bc90dc31f1dd7648975a63c6b", size = 267500, upload-time = "2026-06-20T14:49:08.709Z" },
+    { url = "https://files.pythonhosted.org/packages/54/68/7d0764e83459455384d5c04179ce2d2a837bef01b9ba463079c6e8b31361/coverage-7.14.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:adb4a5fef041f7179bb264203add873c147d169cf2f8d0adae89ff2e51271bac", size = 268619, upload-time = "2026-06-20T14:49:10.405Z" },
+    { url = "https://files.pythonhosted.org/packages/14/68/1292164ac70cbcc86ac3982da31a6fbb42bb4bcebf6e5cf73c99cfcfd50d/coverage-7.14.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:9c012ec357dec9408a83dad5541172a63c5cfa1421709f2e5811480d31ae1b28", size = 262066, upload-time = "2026-06-20T14:49:12.257Z" },
+    { url = "https://files.pythonhosted.org/packages/20/44/fd6fdf3f63b6e00a1a9230022d072ded5189576001685706aa6524187c65/coverage-7.14.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:dacd0ecd08fda3cb2f85b60cabea7da326dcb2fc15fbb23a88830a80144cc9f2", size = 264953, upload-time = "2026-06-20T14:49:14.13Z" },
+    { url = "https://files.pythonhosted.org/packages/39/29/e803fea3da89eaeb5b6b41b3ccd039fe9f3300a900e3803baac1a998529f/coverage-7.14.2-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:f27e980f2feba5dfe7a32b22b125470de69c0bd113c75e16165de909a777f512", size = 262555, upload-time = "2026-06-20T14:49:15.803Z" },
+    { url = "https://files.pythonhosted.org/packages/32/3c/b360e48ac68e3236c04cb83658382e7f5be7efbbec2e1faae3dcca432783/coverage-7.14.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:105c00efb65c863630b2b63cbf7b8267e4da2d44b62284efbb19a03b04c337d4", size = 266289, upload-time = "2026-06-20T14:49:17.962Z" },
+    { url = "https://files.pythonhosted.org/packages/59/12/1ed6d9274d599c586e2d1aa9818765dcdae6bb52aa88afa2fcd868398191/coverage-7.14.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:571173fa04c8e8d6235ab32ae67fecca97777e2e1b4a1a30f3022c34e397c1c1", size = 261402, upload-time = "2026-06-20T14:49:19.708Z" },
+    { url = "https://files.pythonhosted.org/packages/44/17/eb6cf12a4538cda937aefbeabb15377a8a30b377b484e63d31c9da790966/coverage-7.14.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:e532f34d42d1a421fa00ed6b7735d14ac2e340256c1bad26a5e1dc1252b0bed7", size = 263715, upload-time = "2026-06-20T14:49:21.427Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/ca/4bafdb9d372ab05d6ed3a63e7f00d3195d169d0afea00f617c026e386c19/coverage-7.14.2-cp314-cp314t-win32.whl", hash = "sha256:243971550fb46c3039257f75e65610002d84304c505f609bbd9779e20a653a0a", size = 223103, upload-time = "2026-06-20T14:49:23.24Z" },
+    { url = "https://files.pythonhosted.org/packages/35/cb/0765dbd9011d2e47315f1da31e62c5fe231f04a6ec8da213e64c4505896d/coverage-7.14.2-cp314-cp314t-win_amd64.whl", hash = "sha256:60fb0ca084a92da96474b8b405a7ea76dfecac3c68db54383e7934b6f3871169", size = 223934, upload-time = "2026-06-20T14:49:25.347Z" },
+    { url = "https://files.pythonhosted.org/packages/4e/ce/373dde027ecd0ae54511430fe7569f838d3a0376b70333ba9fd20c76b836/coverage-7.14.2-cp314-cp314t-win_arm64.whl", hash = "sha256:36a0a3f42ed7dfdbca2a69a541519ffd5064a5692152fc0018109e74370d7345", size = 223249, upload-time = "2026-06-20T14:49:27.241Z" },
+    { url = "https://files.pythonhosted.org/packages/a3/5e/a8ba14ceb014f39bd5e3f7077150718c7de61c01ce326bfe7e8eae9b19b2/coverage-7.14.2-py3-none-any.whl", hash = "sha256:04d92589e481a8b68a005a5a1e0646a91c76f322c397c4635298c57cf63699b5", size = 212325, upload-time = "2026-06-20T14:49:28.991Z" },
+]
+
+[[package]]
+name = "cuda-bindings"
+version = "13.3.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "cuda-pathfinder", marker = "sys_platform == 'linux'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/51/6b/457ca12dad3ee9bfcc9a545cfd6b64b359ba49de40f776f6e028e678f262/cuda_bindings-13.3.1-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c5879712accf6e14bb01aa5e67440eb84998b8d104b509cc7a6dc0b8f656a474", size = 6053539, upload-time = "2026-05-29T23:11:43.19Z" },
+    { url = "https://files.pythonhosted.org/packages/95/7a/c5e3c34a409b148f5c0f5a4ea374158f95d488862c1dffedf9aa5c639df9/cuda_bindings-13.3.1-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:04436a9364059c84b8f9636f359eccda1cf814341f5b670c71d80d2f79dbc708", size = 6674166, upload-time = "2026-05-29T23:11:45.478Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/67/5e7dba1ba576dd73da5dee894ca076ca5e959450dfff66d6d510a255d1f7/cuda_bindings-13.3.1-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c7855c4868aabc0cfae28abbe83d56734bdfbd08f08fc234ac1912a12858bf49", size = 6025351, upload-time = "2026-05-29T23:11:49.685Z" },
+    { url = "https://files.pythonhosted.org/packages/39/2a/6d2e9047d1fb243dbaa364b01e0297534b9ed7fd27dba1c9f361519cf69b/cuda_bindings-13.3.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e32d08f71ebcdf00f0f41eab2eb37e8da94c8ed411cc9f7f7a019ce6b34abe3a", size = 6657965, upload-time = "2026-05-29T23:11:52.227Z" },
+    { url = "https://files.pythonhosted.org/packages/cc/6e/2394f8163360f8391f8f1b7e72d300a82724edb81a7b7084c799fbd4c91f/cuda_bindings-13.3.1-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9efb21c1ee64981e184b9e0ba5eb3179e5ba3d4b51665a6cb52b8ef3d01a7cbf", size = 5920504, upload-time = "2026-05-29T23:11:56.883Z" },
+    { url = "https://files.pythonhosted.org/packages/34/c2/ef9b6a63f7dc432712a462c816662e662e00d38caa9b861c8c2588195d03/cuda_bindings-13.3.1-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2732904099e0a4d4db774a5fc6d91ee95fae065b4d2ecabb4968c5fe2406c9d7", size = 6476660, upload-time = "2026-05-29T23:11:59.188Z" },
+    { url = "https://files.pythonhosted.org/packages/b1/81/bff68ce829999c1e4209c761bbf903b1c06ec570416ddb25020864ad5907/cuda_bindings-13.3.1-cp314-cp314-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1ab2f74ed65bfef4163ba07a8db16f1085e0729291db12a2423aff84ee8278b8", size = 6013639, upload-time = "2026-05-29T23:12:03.509Z" },
+    { url = "https://files.pythonhosted.org/packages/d4/e0/c8a1f0c8f9ffdea4f5fe6dbab89b326cef4d85caf489dad39e209da89416/cuda_bindings-13.3.1-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:efd4c814d311ec08c981f6dded1dbe7d4b371067ee4f6c14cccec4bde9590f80", size = 6534419, upload-time = "2026-05-29T23:12:05.633Z" },
+    { url = "https://files.pythonhosted.org/packages/52/b8/83b1f563925b290f2d11a01a77a84013ba56052fe3653a5bef3ccfbb43d6/cuda_bindings-13.3.1-cp314-cp314t-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c3c772dfff49681541d59630c90f858e173ac926b9c593a2b7123f2a1043cc76", size = 5809771, upload-time = "2026-05-29T23:12:10.422Z" },
+    { url = "https://files.pythonhosted.org/packages/12/20/e79b4bfe98f075195afb6343d41c498f9dbd2d161d7021d4d28bceb83581/cuda_bindings-13.3.1-cp314-cp314t-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:36febb7c1079d68a981dbbd8d5a67235b399802b82075c9388624719607e52b9", size = 6358584, upload-time = "2026-05-29T23:12:12.767Z" },
+]
+
+[[package]]
+name = "cuda-pathfinder"
+version = "1.5.5"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/11/c8/26f2e4aae92f11522a96043892ba39a90eac610d5242523aa863212bc1c7/cuda_pathfinder-1.5.5-py3-none-any.whl", hash = "sha256:0228c023f95d1480f143ef5c8922d27a2ab052087a942e81dc289c9eb8f91689", size = 51671, upload-time = "2026-05-27T01:21:25.413Z" },
+]
+
+[[package]]
+name = "cuda-toolkit"
+version = "13.0.2"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/57/b2/453099f5f3b698d7d0eab38916aac44c7f76229f451709e2eb9db6615dcd/cuda_toolkit-13.0.2-py2.py3-none-any.whl", hash = "sha256:b198824cf2f54003f50d64ada3a0f184b42ca0846c1c94192fa269ecd97a66eb", size = 2364, upload-time = "2025-12-19T23:24:07.328Z" },
+]
+
+[package.optional-dependencies]
+cudart = [
+    { name = "nvidia-cuda-runtime", marker = "sys_platform == 'linux'" },
+]
+cufft = [
+    { name = "nvidia-cufft", marker = "sys_platform == 'linux'" },
+]
+cufile = [
+    { name = "nvidia-cufile", marker = "sys_platform == 'linux'" },
+]
+cupti = [
+    { name = "nvidia-cuda-cupti", marker = "sys_platform == 'linux'" },
+]
+curand = [
+    { name = "nvidia-curand", marker = "sys_platform == 'linux'" },
+]
+cusolver = [
+    { name = "nvidia-cusolver", marker = "sys_platform == 'linux'" },
+]
+cusparse = [
+    { name = "nvidia-cusparse", marker = "sys_platform == 'linux'" },
+]
+nvjitlink = [
+    { name = "nvidia-nvjitlink", marker = "sys_platform == 'linux'" },
+]
+nvrtc = [
+    { name = "nvidia-cuda-nvrtc", marker = "sys_platform == 'linux'" },
+]
+nvtx = [
+    { name = "nvidia-nvtx", marker = "sys_platform == 'linux'" },
+]
+
+[[package]]
+name = "deepspeed"
+version = "0.19.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "einops" },
+    { name = "hjson" },
+    { name = "msgpack" },
+    { name = "ninja" },
+    { name = "numpy", version = "2.4.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+    { name = "numpy", version = "2.5.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+    { name = "packaging" },
+    { name = "psutil" },
+    { name = "py-cpuinfo" },
+    { name = "pydantic" },
+    { name = "torch" },
+    { name = "tqdm" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/62/cc/1bd9a0f1545fa57a45f98597a78ef6b39ae1fac1afb3e14c70cb8b02455e/deepspeed-0.19.2.tar.gz", hash = "sha256:7e854b6ebe3d2bfa239f82958372927631c74e5324c7f08f17ce7ff5f6b06969", size = 1756950, upload-time = "2026-06-16T20:53:22.919Z" }
+
+[[package]]
+name = "distlib"
+version = "0.4.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/c9/02/bd72be9134d25ed783ecbbc38a539ffaefbf90c78418c7fb7229600dbac7/distlib-0.4.3.tar.gz", hash = "sha256:f152097224a0ae24be5a0f6bae1b9359af82133bce63f98a95f86cae1aede9ed", size = 615141, upload-time = "2026-06-12T08:04:52.847Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/02/08/9c41fb51ab5b43eb21674aff13df270e8ba6c4b29c8624e328dc7a9482af/distlib-0.4.3-py2.py3-none-any.whl", hash = "sha256:4b0ce306c966eb73bc3a7b6abad017c556dadd92c44701562cd528ac7fde4d5b", size = 470628, upload-time = "2026-06-12T08:04:50.506Z" },
+]
+
+[[package]]
+name = "einops"
+version = "0.8.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/2c/77/850bef8d72ffb9219f0b1aac23fbc1bf7d038ee6ea666f331fa273031aa2/einops-0.8.2.tar.gz", hash = "sha256:609da665570e5e265e27283aab09e7f279ade90c4f01bcfca111f3d3e13f2827", size = 56261, upload-time = "2026-01-26T04:13:17.638Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2a/09/f8d8f8f31e4483c10a906437b4ce31bdf3d6d417b73fe33f1a8b59e34228/einops-0.8.2-py3-none-any.whl", hash = "sha256:54058201ac7087911181bfec4af6091bb59380360f069276601256a76af08193", size = 65638, upload-time = "2026-01-26T04:13:18.546Z" },
+]
+
+[[package]]
+name = "filelock"
+version = "3.29.4"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/e6/dc/be6cbe99670cd6e4ad387123647cb08e0c32975e223f82551e914c5568a6/filelock-3.29.4.tar.gz", hash = "sha256:10cdb3656fc44541cdf30652a93fb10ec6b05325620eb316bd26893e4201538a", size = 63028, upload-time = "2026-06-13T16:12:00.744Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/13/37/a065dc3bd6e49423a6532c642ca7378d3f467b1ef44c2800c937af7f9739/filelock-3.29.4-py3-none-any.whl", hash = "sha256:dac1648087d5115554850d113e7dd8c83ab2d38e3435dde2d4f163847e57b767", size = 42757, upload-time = "2026-06-13T16:11:59.582Z" },
+]
+
+[[package]]
+name = "frozenlist"
+version = "1.8.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/2d/f5/c831fac6cc817d26fd54c7eaccd04ef7e0288806943f7cc5bbf69f3ac1f0/frozenlist-1.8.0.tar.gz", hash = "sha256:3ede829ed8d842f6cd48fc7081d7a41001a56f1f38603f9d49bf3020d59a31ad", size = 45875, upload-time = "2025-10-06T05:38:17.865Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/bc/03/077f869d540370db12165c0aa51640a873fb661d8b315d1d4d67b284d7ac/frozenlist-1.8.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:09474e9831bc2b2199fad6da3c14c7b0fbdd377cce9d3d77131be28906cb7d84", size = 86912, upload-time = "2025-10-06T05:35:45.98Z" },
+    { url = "https://files.pythonhosted.org/packages/df/b5/7610b6bd13e4ae77b96ba85abea1c8cb249683217ef09ac9e0ae93f25a91/frozenlist-1.8.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:17c883ab0ab67200b5f964d2b9ed6b00971917d5d8a92df149dc2c9779208ee9", size = 50046, upload-time = "2025-10-06T05:35:47.009Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/ef/0e8f1fe32f8a53dd26bdd1f9347efe0778b0fddf62789ea683f4cc7d787d/frozenlist-1.8.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:fa47e444b8ba08fffd1c18e8cdb9a75db1b6a27f17507522834ad13ed5922b93", size = 50119, upload-time = "2025-10-06T05:35:48.38Z" },
+    { url = "https://files.pythonhosted.org/packages/11/b1/71a477adc7c36e5fb628245dfbdea2166feae310757dea848d02bd0689fd/frozenlist-1.8.0-cp311-cp311-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:2552f44204b744fba866e573be4c1f9048d6a324dfe14475103fd51613eb1d1f", size = 231067, upload-time = "2025-10-06T05:35:49.97Z" },
+    { url = "https://files.pythonhosted.org/packages/45/7e/afe40eca3a2dc19b9904c0f5d7edfe82b5304cb831391edec0ac04af94c2/frozenlist-1.8.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:957e7c38f250991e48a9a73e6423db1bb9dd14e722a10f6b8bb8e16a0f55f695", size = 233160, upload-time = "2025-10-06T05:35:51.729Z" },
+    { url = "https://files.pythonhosted.org/packages/a6/aa/7416eac95603ce428679d273255ffc7c998d4132cfae200103f164b108aa/frozenlist-1.8.0-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:8585e3bb2cdea02fc88ffa245069c36555557ad3609e83be0ec71f54fd4abb52", size = 228544, upload-time = "2025-10-06T05:35:53.246Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/3d/2a2d1f683d55ac7e3875e4263d28410063e738384d3adc294f5ff3d7105e/frozenlist-1.8.0-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:edee74874ce20a373d62dc28b0b18b93f645633c2943fd90ee9d898550770581", size = 243797, upload-time = "2025-10-06T05:35:54.497Z" },
+    { url = "https://files.pythonhosted.org/packages/78/1e/2d5565b589e580c296d3bb54da08d206e797d941a83a6fdea42af23be79c/frozenlist-1.8.0-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c9a63152fe95756b85f31186bddf42e4c02c6321207fd6601a1c89ebac4fe567", size = 247923, upload-time = "2025-10-06T05:35:55.861Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/c3/65872fcf1d326a7f101ad4d86285c403c87be7d832b7470b77f6d2ed5ddc/frozenlist-1.8.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:b6db2185db9be0a04fecf2f241c70b63b1a242e2805be291855078f2b404dd6b", size = 230886, upload-time = "2025-10-06T05:35:57.399Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/76/ac9ced601d62f6956f03cc794f9e04c81719509f85255abf96e2510f4265/frozenlist-1.8.0-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:f4be2e3d8bc8aabd566f8d5b8ba7ecc09249d74ba3c9ed52e54dc23a293f0b92", size = 245731, upload-time = "2025-10-06T05:35:58.563Z" },
+    { url = "https://files.pythonhosted.org/packages/b9/49/ecccb5f2598daf0b4a1415497eba4c33c1e8ce07495eb07d2860c731b8d5/frozenlist-1.8.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:c8d1634419f39ea6f5c427ea2f90ca85126b54b50837f31497f3bf38266e853d", size = 241544, upload-time = "2025-10-06T05:35:59.719Z" },
+    { url = "https://files.pythonhosted.org/packages/53/4b/ddf24113323c0bbcc54cb38c8b8916f1da7165e07b8e24a717b4a12cbf10/frozenlist-1.8.0-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:1a7fa382a4a223773ed64242dbe1c9c326ec09457e6b8428efb4118c685c3dfd", size = 241806, upload-time = "2025-10-06T05:36:00.959Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/fb/9b9a084d73c67175484ba2789a59f8eebebd0827d186a8102005ce41e1ba/frozenlist-1.8.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:11847b53d722050808926e785df837353bd4d75f1d494377e59b23594d834967", size = 229382, upload-time = "2025-10-06T05:36:02.22Z" },
+    { url = "https://files.pythonhosted.org/packages/95/a3/c8fb25aac55bf5e12dae5c5aa6a98f85d436c1dc658f21c3ac73f9fa95e5/frozenlist-1.8.0-cp311-cp311-win32.whl", hash = "sha256:27c6e8077956cf73eadd514be8fb04d77fc946a7fe9f7fe167648b0b9085cc25", size = 39647, upload-time = "2025-10-06T05:36:03.409Z" },
+    { url = "https://files.pythonhosted.org/packages/0a/f5/603d0d6a02cfd4c8f2a095a54672b3cf967ad688a60fb9faf04fc4887f65/frozenlist-1.8.0-cp311-cp311-win_amd64.whl", hash = "sha256:ac913f8403b36a2c8610bbfd25b8013488533e71e62b4b4adce9c86c8cea905b", size = 44064, upload-time = "2025-10-06T05:36:04.368Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/16/c2c9ab44e181f043a86f9a8f84d5124b62dbcb3a02c0977ec72b9ac1d3e0/frozenlist-1.8.0-cp311-cp311-win_arm64.whl", hash = "sha256:d4d3214a0f8394edfa3e303136d0575eece0745ff2b47bd2cb2e66dd92d4351a", size = 39937, upload-time = "2025-10-06T05:36:05.669Z" },
+    { url = "https://files.pythonhosted.org/packages/69/29/948b9aa87e75820a38650af445d2ef2b6b8a6fab1a23b6bb9e4ef0be2d59/frozenlist-1.8.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:78f7b9e5d6f2fdb88cdde9440dc147259b62b9d3b019924def9f6478be254ac1", size = 87782, upload-time = "2025-10-06T05:36:06.649Z" },
+    { url = "https://files.pythonhosted.org/packages/64/80/4f6e318ee2a7c0750ed724fa33a4bdf1eacdc5a39a7a24e818a773cd91af/frozenlist-1.8.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:229bf37d2e4acdaf808fd3f06e854a4a7a3661e871b10dc1f8f1896a3b05f18b", size = 50594, upload-time = "2025-10-06T05:36:07.69Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/94/5c8a2b50a496b11dd519f4a24cb5496cf125681dd99e94c604ccdea9419a/frozenlist-1.8.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:f833670942247a14eafbb675458b4e61c82e002a148f49e68257b79296e865c4", size = 50448, upload-time = "2025-10-06T05:36:08.78Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/bd/d91c5e39f490a49df14320f4e8c80161cfcce09f1e2cde1edd16a551abb3/frozenlist-1.8.0-cp312-cp312-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:494a5952b1c597ba44e0e78113a7266e656b9794eec897b19ead706bd7074383", size = 242411, upload-time = "2025-10-06T05:36:09.801Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/83/f61505a05109ef3293dfb1ff594d13d64a2324ac3482be2cedc2be818256/frozenlist-1.8.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:96f423a119f4777a4a056b66ce11527366a8bb92f54e541ade21f2374433f6d4", size = 243014, upload-time = "2025-10-06T05:36:11.394Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/cb/cb6c7b0f7d4023ddda30cf56b8b17494eb3a79e3fda666bf735f63118b35/frozenlist-1.8.0-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:3462dd9475af2025c31cc61be6652dfa25cbfb56cbbf52f4ccfe029f38decaf8", size = 234909, upload-time = "2025-10-06T05:36:12.598Z" },
+    { url = "https://files.pythonhosted.org/packages/31/c5/cd7a1f3b8b34af009fb17d4123c5a778b44ae2804e3ad6b86204255f9ec5/frozenlist-1.8.0-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:c4c800524c9cd9bac5166cd6f55285957fcfc907db323e193f2afcd4d9abd69b", size = 250049, upload-time = "2025-10-06T05:36:14.065Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/01/2f95d3b416c584a1e7f0e1d6d31998c4a795f7544069ee2e0962a4b60740/frozenlist-1.8.0-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:d6a5df73acd3399d893dafc71663ad22534b5aa4f94e8a2fabfe856c3c1b6a52", size = 256485, upload-time = "2025-10-06T05:36:15.39Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/03/024bf7720b3abaebcff6d0793d73c154237b85bdf67b7ed55e5e9596dc9a/frozenlist-1.8.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:405e8fe955c2280ce66428b3ca55e12b3c4e9c336fb2103a4937e891c69a4a29", size = 237619, upload-time = "2025-10-06T05:36:16.558Z" },
+    { url = "https://files.pythonhosted.org/packages/69/fa/f8abdfe7d76b731f5d8bd217827cf6764d4f1d9763407e42717b4bed50a0/frozenlist-1.8.0-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:908bd3f6439f2fef9e85031b59fd4f1297af54415fb60e4254a95f75b3cab3f3", size = 250320, upload-time = "2025-10-06T05:36:17.821Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/3c/b051329f718b463b22613e269ad72138cc256c540f78a6de89452803a47d/frozenlist-1.8.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:294e487f9ec720bd8ffcebc99d575f7eff3568a08a253d1ee1a0378754b74143", size = 246820, upload-time = "2025-10-06T05:36:19.046Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/ae/58282e8f98e444b3f4dd42448ff36fa38bef29e40d40f330b22e7108f565/frozenlist-1.8.0-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:74c51543498289c0c43656701be6b077f4b265868fa7f8a8859c197006efb608", size = 250518, upload-time = "2025-10-06T05:36:20.763Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/96/007e5944694d66123183845a106547a15944fbbb7154788cbf7272789536/frozenlist-1.8.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:776f352e8329135506a1d6bf16ac3f87bc25b28e765949282dcc627af36123aa", size = 239096, upload-time = "2025-10-06T05:36:22.129Z" },
+    { url = "https://files.pythonhosted.org/packages/66/bb/852b9d6db2fa40be96f29c0d1205c306288f0684df8fd26ca1951d461a56/frozenlist-1.8.0-cp312-cp312-win32.whl", hash = "sha256:433403ae80709741ce34038da08511d4a77062aa924baf411ef73d1146e74faf", size = 39985, upload-time = "2025-10-06T05:36:23.661Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/af/38e51a553dd66eb064cdf193841f16f077585d4d28394c2fa6235cb41765/frozenlist-1.8.0-cp312-cp312-win_amd64.whl", hash = "sha256:34187385b08f866104f0c0617404c8eb08165ab1272e884abc89c112e9c00746", size = 44591, upload-time = "2025-10-06T05:36:24.958Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/06/1dc65480ab147339fecc70797e9c2f69d9cea9cf38934ce08df070fdb9cb/frozenlist-1.8.0-cp312-cp312-win_arm64.whl", hash = "sha256:fe3c58d2f5db5fbd18c2987cba06d51b0529f52bc3a6cdc33d3f4eab725104bd", size = 40102, upload-time = "2025-10-06T05:36:26.333Z" },
+    { url = "https://files.pythonhosted.org/packages/2d/40/0832c31a37d60f60ed79e9dfb5a92e1e2af4f40a16a29abcc7992af9edff/frozenlist-1.8.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:8d92f1a84bb12d9e56f818b3a746f3efba93c1b63c8387a73dde655e1e42282a", size = 85717, upload-time = "2025-10-06T05:36:27.341Z" },
+    { url = "https://files.pythonhosted.org/packages/30/ba/b0b3de23f40bc55a7057bd38434e25c34fa48e17f20ee273bbde5e0650f3/frozenlist-1.8.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:96153e77a591c8adc2ee805756c61f59fef4cf4073a9275ee86fe8cba41241f7", size = 49651, upload-time = "2025-10-06T05:36:28.855Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/ab/6e5080ee374f875296c4243c381bbdef97a9ac39c6e3ce1d5f7d42cb78d6/frozenlist-1.8.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:f21f00a91358803399890ab167098c131ec2ddd5f8f5fd5fe9c9f2c6fcd91e40", size = 49417, upload-time = "2025-10-06T05:36:29.877Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/4e/e4691508f9477ce67da2015d8c00acd751e6287739123113a9fca6f1604e/frozenlist-1.8.0-cp313-cp313-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:fb30f9626572a76dfe4293c7194a09fb1fe93ba94c7d4f720dfae3b646b45027", size = 234391, upload-time = "2025-10-06T05:36:31.301Z" },
+    { url = "https://files.pythonhosted.org/packages/40/76/c202df58e3acdf12969a7895fd6f3bc016c642e6726aa63bd3025e0fc71c/frozenlist-1.8.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:eaa352d7047a31d87dafcacbabe89df0aa506abb5b1b85a2fb91bc3faa02d822", size = 233048, upload-time = "2025-10-06T05:36:32.531Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/c0/8746afb90f17b73ca5979c7a3958116e105ff796e718575175319b5bb4ce/frozenlist-1.8.0-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:03ae967b4e297f58f8c774c7eabcce57fe3c2434817d4385c50661845a058121", size = 226549, upload-time = "2025-10-06T05:36:33.706Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/eb/4c7eefc718ff72f9b6c4893291abaae5fbc0c82226a32dcd8ef4f7a5dbef/frozenlist-1.8.0-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f6292f1de555ffcc675941d65fffffb0a5bcd992905015f85d0592201793e0e5", size = 239833, upload-time = "2025-10-06T05:36:34.947Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/4e/e5c02187cf704224f8b21bee886f3d713ca379535f16893233b9d672ea71/frozenlist-1.8.0-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:29548f9b5b5e3460ce7378144c3010363d8035cea44bc0bf02d57f5a685e084e", size = 245363, upload-time = "2025-10-06T05:36:36.534Z" },
+    { url = "https://files.pythonhosted.org/packages/1f/96/cb85ec608464472e82ad37a17f844889c36100eed57bea094518bf270692/frozenlist-1.8.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:ec3cc8c5d4084591b4237c0a272cc4f50a5b03396a47d9caaf76f5d7b38a4f11", size = 229314, upload-time = "2025-10-06T05:36:38.582Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/6f/4ae69c550e4cee66b57887daeebe006fe985917c01d0fff9caab9883f6d0/frozenlist-1.8.0-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:517279f58009d0b1f2e7c1b130b377a349405da3f7621ed6bfae50b10adf20c1", size = 243365, upload-time = "2025-10-06T05:36:40.152Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/58/afd56de246cf11780a40a2c28dc7cbabbf06337cc8ddb1c780a2d97e88d8/frozenlist-1.8.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:db1e72ede2d0d7ccb213f218df6a078a9c09a7de257c2fe8fcef16d5925230b1", size = 237763, upload-time = "2025-10-06T05:36:41.355Z" },
+    { url = "https://files.pythonhosted.org/packages/cb/36/cdfaf6ed42e2644740d4a10452d8e97fa1c062e2a8006e4b09f1b5fd7d63/frozenlist-1.8.0-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:b4dec9482a65c54a5044486847b8a66bf10c9cb4926d42927ec4e8fd5db7fed8", size = 240110, upload-time = "2025-10-06T05:36:42.716Z" },
+    { url = "https://files.pythonhosted.org/packages/03/a8/9ea226fbefad669f11b52e864c55f0bd57d3c8d7eb07e9f2e9a0b39502e1/frozenlist-1.8.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:21900c48ae04d13d416f0e1e0c4d81f7931f73a9dfa0b7a8746fb2fe7dd970ed", size = 233717, upload-time = "2025-10-06T05:36:44.251Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/0b/1b5531611e83ba7d13ccc9988967ea1b51186af64c42b7a7af465dcc9568/frozenlist-1.8.0-cp313-cp313-win32.whl", hash = "sha256:8b7b94a067d1c504ee0b16def57ad5738701e4ba10cec90529f13fa03c833496", size = 39628, upload-time = "2025-10-06T05:36:45.423Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/cf/174c91dbc9cc49bc7b7aab74d8b734e974d1faa8f191c74af9b7e80848e6/frozenlist-1.8.0-cp313-cp313-win_amd64.whl", hash = "sha256:878be833caa6a3821caf85eb39c5ba92d28e85df26d57afb06b35b2efd937231", size = 43882, upload-time = "2025-10-06T05:36:46.796Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/17/502cd212cbfa96eb1388614fe39a3fc9ab87dbbe042b66f97acb57474834/frozenlist-1.8.0-cp313-cp313-win_arm64.whl", hash = "sha256:44389d135b3ff43ba8cc89ff7f51f5a0bb6b63d829c8300f79a2fe4fe61bcc62", size = 39676, upload-time = "2025-10-06T05:36:47.8Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/5c/3bbfaa920dfab09e76946a5d2833a7cbdf7b9b4a91c714666ac4855b88b4/frozenlist-1.8.0-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:e25ac20a2ef37e91c1b39938b591457666a0fa835c7783c3a8f33ea42870db94", size = 89235, upload-time = "2025-10-06T05:36:48.78Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/d6/f03961ef72166cec1687e84e8925838442b615bd0b8854b54923ce5b7b8a/frozenlist-1.8.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:07cdca25a91a4386d2e76ad992916a85038a9b97561bf7a3fd12d5d9ce31870c", size = 50742, upload-time = "2025-10-06T05:36:49.837Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/bb/a6d12b7ba4c3337667d0e421f7181c82dda448ce4e7ad7ecd249a16fa806/frozenlist-1.8.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:4e0c11f2cc6717e0a741f84a527c52616140741cd812a50422f83dc31749fb52", size = 51725, upload-time = "2025-10-06T05:36:50.851Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/71/d1fed0ffe2c2ccd70b43714c6cab0f4188f09f8a67a7914a6b46ee30f274/frozenlist-1.8.0-cp313-cp313t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:b3210649ee28062ea6099cfda39e147fa1bc039583c8ee4481cb7811e2448c51", size = 284533, upload-time = "2025-10-06T05:36:51.898Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/1f/fb1685a7b009d89f9bf78a42d94461bc06581f6e718c39344754a5d9bada/frozenlist-1.8.0-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:581ef5194c48035a7de2aefc72ac6539823bb71508189e5de01d60c9dcd5fa65", size = 292506, upload-time = "2025-10-06T05:36:53.101Z" },
+    { url = "https://files.pythonhosted.org/packages/e6/3b/b991fe1612703f7e0d05c0cf734c1b77aaf7c7d321df4572e8d36e7048c8/frozenlist-1.8.0-cp313-cp313t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:3ef2d026f16a2b1866e1d86fc4e1291e1ed8a387b2c333809419a2f8b3a77b82", size = 274161, upload-time = "2025-10-06T05:36:54.309Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/ec/c5c618767bcdf66e88945ec0157d7f6c4a1322f1473392319b7a2501ded7/frozenlist-1.8.0-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:5500ef82073f599ac84d888e3a8c1f77ac831183244bfd7f11eaa0289fb30714", size = 294676, upload-time = "2025-10-06T05:36:55.566Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/ce/3934758637d8f8a88d11f0585d6495ef54b2044ed6ec84492a91fa3b27aa/frozenlist-1.8.0-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:50066c3997d0091c411a66e710f4e11752251e6d2d73d70d8d5d4c76442a199d", size = 300638, upload-time = "2025-10-06T05:36:56.758Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/4f/a7e4d0d467298f42de4b41cbc7ddaf19d3cfeabaf9ff97c20c6c7ee409f9/frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:5c1c8e78426e59b3f8005e9b19f6ff46e5845895adbde20ece9218319eca6506", size = 283067, upload-time = "2025-10-06T05:36:57.965Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/48/c7b163063d55a83772b268e6d1affb960771b0e203b632cfe09522d67ea5/frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:eefdba20de0d938cec6a89bd4d70f346a03108a19b9df4248d3cf0d88f1b0f51", size = 292101, upload-time = "2025-10-06T05:36:59.237Z" },
+    { url = "https://files.pythonhosted.org/packages/9f/d0/2366d3c4ecdc2fd391e0afa6e11500bfba0ea772764d631bbf82f0136c9d/frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:cf253e0e1c3ceb4aaff6df637ce033ff6535fb8c70a764a8f46aafd3d6ab798e", size = 289901, upload-time = "2025-10-06T05:37:00.811Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/94/daff920e82c1b70e3618a2ac39fbc01ae3e2ff6124e80739ce5d71c9b920/frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:032efa2674356903cd0261c4317a561a6850f3ac864a63fc1583147fb05a79b0", size = 289395, upload-time = "2025-10-06T05:37:02.115Z" },
+    { url = "https://files.pythonhosted.org/packages/e3/20/bba307ab4235a09fdcd3cc5508dbabd17c4634a1af4b96e0f69bfe551ebd/frozenlist-1.8.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:6da155091429aeba16851ecb10a9104a108bcd32f6c1642867eadaee401c1c41", size = 283659, upload-time = "2025-10-06T05:37:03.711Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/00/04ca1c3a7a124b6de4f8a9a17cc2fcad138b4608e7a3fc5877804b8715d7/frozenlist-1.8.0-cp313-cp313t-win32.whl", hash = "sha256:0f96534f8bfebc1a394209427d0f8a63d343c9779cda6fc25e8e121b5fd8555b", size = 43492, upload-time = "2025-10-06T05:37:04.915Z" },
+    { url = "https://files.pythonhosted.org/packages/59/5e/c69f733a86a94ab10f68e496dc6b7e8bc078ebb415281d5698313e3af3a1/frozenlist-1.8.0-cp313-cp313t-win_amd64.whl", hash = "sha256:5d63a068f978fc69421fb0e6eb91a9603187527c86b7cd3f534a5b77a592b888", size = 48034, upload-time = "2025-10-06T05:37:06.343Z" },
+    { url = "https://files.pythonhosted.org/packages/16/6c/be9d79775d8abe79b05fa6d23da99ad6e7763a1d080fbae7290b286093fd/frozenlist-1.8.0-cp313-cp313t-win_arm64.whl", hash = "sha256:bf0a7e10b077bf5fb9380ad3ae8ce20ef919a6ad93b4552896419ac7e1d8e042", size = 41749, upload-time = "2025-10-06T05:37:07.431Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/c8/85da824b7e7b9b6e7f7705b2ecaf9591ba6f79c1177f324c2735e41d36a2/frozenlist-1.8.0-cp314-cp314-macosx_10_13_universal2.whl", hash = "sha256:cee686f1f4cadeb2136007ddedd0aaf928ab95216e7691c63e50a8ec066336d0", size = 86127, upload-time = "2025-10-06T05:37:08.438Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/e8/a1185e236ec66c20afd72399522f142c3724c785789255202d27ae992818/frozenlist-1.8.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:119fb2a1bd47307e899c2fac7f28e85b9a543864df47aa7ec9d3c1b4545f096f", size = 49698, upload-time = "2025-10-06T05:37:09.48Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/93/72b1736d68f03fda5fdf0f2180fb6caaae3894f1b854d006ac61ecc727ee/frozenlist-1.8.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:4970ece02dbc8c3a92fcc5228e36a3e933a01a999f7094ff7c23fbd2beeaa67c", size = 49749, upload-time = "2025-10-06T05:37:10.569Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/b2/fabede9fafd976b991e9f1b9c8c873ed86f202889b864756f240ce6dd855/frozenlist-1.8.0-cp314-cp314-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:cba69cb73723c3f329622e34bdbf5ce1f80c21c290ff04256cff1cd3c2036ed2", size = 231298, upload-time = "2025-10-06T05:37:11.993Z" },
+    { url = "https://files.pythonhosted.org/packages/3a/3b/d9b1e0b0eed36e70477ffb8360c49c85c8ca8ef9700a4e6711f39a6e8b45/frozenlist-1.8.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:778a11b15673f6f1df23d9586f83c4846c471a8af693a22e066508b77d201ec8", size = 232015, upload-time = "2025-10-06T05:37:13.194Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/94/be719d2766c1138148564a3960fc2c06eb688da592bdc25adcf856101be7/frozenlist-1.8.0-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:0325024fe97f94c41c08872db482cf8ac4800d80e79222c6b0b7b162d5b13686", size = 225038, upload-time = "2025-10-06T05:37:14.577Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/09/6712b6c5465f083f52f50cf74167b92d4ea2f50e46a9eea0523d658454ae/frozenlist-1.8.0-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:97260ff46b207a82a7567b581ab4190bd4dfa09f4db8a8b49d1a958f6aa4940e", size = 240130, upload-time = "2025-10-06T05:37:15.781Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/d4/cd065cdcf21550b54f3ce6a22e143ac9e4836ca42a0de1022da8498eac89/frozenlist-1.8.0-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:54b2077180eb7f83dd52c40b2750d0a9f175e06a42e3213ce047219de902717a", size = 242845, upload-time = "2025-10-06T05:37:17.037Z" },
+    { url = "https://files.pythonhosted.org/packages/62/c3/f57a5c8c70cd1ead3d5d5f776f89d33110b1addae0ab010ad774d9a44fb9/frozenlist-1.8.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:2f05983daecab868a31e1da44462873306d3cbfd76d1f0b5b69c473d21dbb128", size = 229131, upload-time = "2025-10-06T05:37:18.221Z" },
+    { url = "https://files.pythonhosted.org/packages/6c/52/232476fe9cb64f0742f3fde2b7d26c1dac18b6d62071c74d4ded55e0ef94/frozenlist-1.8.0-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:33f48f51a446114bc5d251fb2954ab0164d5be02ad3382abcbfe07e2531d650f", size = 240542, upload-time = "2025-10-06T05:37:19.771Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/85/07bf3f5d0fb5414aee5f47d33c6f5c77bfe49aac680bfece33d4fdf6a246/frozenlist-1.8.0-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:154e55ec0655291b5dd1b8731c637ecdb50975a2ae70c606d100750a540082f7", size = 237308, upload-time = "2025-10-06T05:37:20.969Z" },
+    { url = "https://files.pythonhosted.org/packages/11/99/ae3a33d5befd41ac0ca2cc7fd3aa707c9c324de2e89db0e0f45db9a64c26/frozenlist-1.8.0-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:4314debad13beb564b708b4a496020e5306c7333fa9a3ab90374169a20ffab30", size = 238210, upload-time = "2025-10-06T05:37:22.252Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/60/b1d2da22f4970e7a155f0adde9b1435712ece01b3cd45ba63702aea33938/frozenlist-1.8.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:073f8bf8becba60aa931eb3bc420b217bb7d5b8f4750e6f8b3be7f3da85d38b7", size = 231972, upload-time = "2025-10-06T05:37:23.5Z" },
+    { url = "https://files.pythonhosted.org/packages/3f/ab/945b2f32de889993b9c9133216c068b7fcf257d8595a0ac420ac8677cab0/frozenlist-1.8.0-cp314-cp314-win32.whl", hash = "sha256:bac9c42ba2ac65ddc115d930c78d24ab8d4f465fd3fc473cdedfccadb9429806", size = 40536, upload-time = "2025-10-06T05:37:25.581Z" },
+    { url = "https://files.pythonhosted.org/packages/59/ad/9caa9b9c836d9ad6f067157a531ac48b7d36499f5036d4141ce78c230b1b/frozenlist-1.8.0-cp314-cp314-win_amd64.whl", hash = "sha256:3e0761f4d1a44f1d1a47996511752cf3dcec5bbdd9cc2b4fe595caf97754b7a0", size = 44330, upload-time = "2025-10-06T05:37:26.928Z" },
+    { url = "https://files.pythonhosted.org/packages/82/13/e6950121764f2676f43534c555249f57030150260aee9dcf7d64efda11dd/frozenlist-1.8.0-cp314-cp314-win_arm64.whl", hash = "sha256:d1eaff1d00c7751b7c6662e9c5ba6eb2c17a2306ba5e2a37f24ddf3cc953402b", size = 40627, upload-time = "2025-10-06T05:37:28.075Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/c7/43200656ecc4e02d3f8bc248df68256cd9572b3f0017f0a0c4e93440ae23/frozenlist-1.8.0-cp314-cp314t-macosx_10_13_universal2.whl", hash = "sha256:d3bb933317c52d7ea5004a1c442eef86f426886fba134ef8cf4226ea6ee1821d", size = 89238, upload-time = "2025-10-06T05:37:29.373Z" },
+    { url = "https://files.pythonhosted.org/packages/d1/29/55c5f0689b9c0fb765055629f472c0de484dcaf0acee2f7707266ae3583c/frozenlist-1.8.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:8009897cdef112072f93a0efdce29cd819e717fd2f649ee3016efd3cd885a7ed", size = 50738, upload-time = "2025-10-06T05:37:30.792Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/7d/b7282a445956506fa11da8c2db7d276adcbf2b17d8bb8407a47685263f90/frozenlist-1.8.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:2c5dcbbc55383e5883246d11fd179782a9d07a986c40f49abe89ddf865913930", size = 51739, upload-time = "2025-10-06T05:37:32.127Z" },
+    { url = "https://files.pythonhosted.org/packages/62/1c/3d8622e60d0b767a5510d1d3cf21065b9db874696a51ea6d7a43180a259c/frozenlist-1.8.0-cp314-cp314t-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:39ecbc32f1390387d2aa4f5a995e465e9e2f79ba3adcac92d68e3e0afae6657c", size = 284186, upload-time = "2025-10-06T05:37:33.21Z" },
+    { url = "https://files.pythonhosted.org/packages/2d/14/aa36d5f85a89679a85a1d44cd7a6657e0b1c75f61e7cad987b203d2daca8/frozenlist-1.8.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:92db2bf818d5cc8d9c1f1fc56b897662e24ea5adb36ad1f1d82875bd64e03c24", size = 292196, upload-time = "2025-10-06T05:37:36.107Z" },
+    { url = "https://files.pythonhosted.org/packages/05/23/6bde59eb55abd407d34f77d39a5126fb7b4f109a3f611d3929f14b700c66/frozenlist-1.8.0-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:2dc43a022e555de94c3b68a4ef0b11c4f747d12c024a520c7101709a2144fb37", size = 273830, upload-time = "2025-10-06T05:37:37.663Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/3f/22cff331bfad7a8afa616289000ba793347fcd7bc275f3b28ecea2a27909/frozenlist-1.8.0-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:cb89a7f2de3602cfed448095bab3f178399646ab7c61454315089787df07733a", size = 294289, upload-time = "2025-10-06T05:37:39.261Z" },
+    { url = "https://files.pythonhosted.org/packages/a4/89/5b057c799de4838b6c69aa82b79705f2027615e01be996d2486a69ca99c4/frozenlist-1.8.0-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:33139dc858c580ea50e7e60a1b0ea003efa1fd42e6ec7fdbad78fff65fad2fd2", size = 300318, upload-time = "2025-10-06T05:37:43.213Z" },
+    { url = "https://files.pythonhosted.org/packages/30/de/2c22ab3eb2a8af6d69dc799e48455813bab3690c760de58e1bf43b36da3e/frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:168c0969a329b416119507ba30b9ea13688fafffac1b7822802537569a1cb0ef", size = 282814, upload-time = "2025-10-06T05:37:45.337Z" },
+    { url = "https://files.pythonhosted.org/packages/59/f7/970141a6a8dbd7f556d94977858cfb36fa9b66e0892c6dd780d2219d8cd8/frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:28bd570e8e189d7f7b001966435f9dac6718324b5be2990ac496cf1ea9ddb7fe", size = 291762, upload-time = "2025-10-06T05:37:46.657Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/15/ca1adae83a719f82df9116d66f5bb28bb95557b3951903d39135620ef157/frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:b2a095d45c5d46e5e79ba1e5b9cb787f541a8dee0433836cea4b96a2c439dcd8", size = 289470, upload-time = "2025-10-06T05:37:47.946Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/83/dca6dc53bf657d371fbc88ddeb21b79891e747189c5de990b9dfff2ccba1/frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:eab8145831a0d56ec9c4139b6c3e594c7a83c2c8be25d5bcf2d86136a532287a", size = 289042, upload-time = "2025-10-06T05:37:49.499Z" },
+    { url = "https://files.pythonhosted.org/packages/96/52/abddd34ca99be142f354398700536c5bd315880ed0a213812bc491cff5e4/frozenlist-1.8.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:974b28cf63cc99dfb2188d8d222bc6843656188164848c4f679e63dae4b0708e", size = 283148, upload-time = "2025-10-06T05:37:50.745Z" },
+    { url = "https://files.pythonhosted.org/packages/af/d3/76bd4ed4317e7119c2b7f57c3f6934aba26d277acc6309f873341640e21f/frozenlist-1.8.0-cp314-cp314t-win32.whl", hash = "sha256:342c97bf697ac5480c0a7ec73cd700ecfa5a8a40ac923bd035484616efecc2df", size = 44676, upload-time = "2025-10-06T05:37:52.222Z" },
+    { url = "https://files.pythonhosted.org/packages/89/76/c615883b7b521ead2944bb3480398cbb07e12b7b4e4d073d3752eb721558/frozenlist-1.8.0-cp314-cp314t-win_amd64.whl", hash = "sha256:06be8f67f39c8b1dc671f5d83aaefd3358ae5cdcf8314552c57e7ed3e6475bdd", size = 49451, upload-time = "2025-10-06T05:37:53.425Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/a3/5982da14e113d07b325230f95060e2169f5311b1017ea8af2a29b374c289/frozenlist-1.8.0-cp314-cp314t-win_arm64.whl", hash = "sha256:102e6314ca4da683dca92e3b1355490fed5f313b768500084fbe6371fddfdb79", size = 42507, upload-time = "2025-10-06T05:37:54.513Z" },
+    { url = "https://files.pythonhosted.org/packages/9a/9a/e35b4a917281c0b8419d4207f4334c8e8c5dbf4f3f5f9ada73958d937dcc/frozenlist-1.8.0-py3-none-any.whl", hash = "sha256:0c18a16eab41e82c295618a77502e17b195883241c563b00f0aa5106fc4eaa0d", size = 13409, upload-time = "2025-10-06T05:38:16.721Z" },
+]
+
+[[package]]
+name = "fsspec"
+version = "2026.6.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/10/a1/ae4e3e5003468d6391d2c77b6fa1cd73bd5d13511d81c642d7b28ac90ed4/fsspec-2026.6.0.tar.gz", hash = "sha256:f5bac145310fe30e16e1471bd6840b2d990d609e872251d7e674241822abf01a", size = 313646, upload-time = "2026-06-16T01:57:28.105Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e5/22/4222d7ddf3da30f363edaa98e329c2bce6c65497c9cb2810931c8b2c0fbc/fsspec-2026.6.0-py3-none-any.whl", hash = "sha256:02e0b71817df9b2169dc30a16832045764def1191b43dcff5bb85bdee212d2a1", size = 203949, upload-time = "2026-06-16T01:57:26.358Z" },
+]
+
+[package.optional-dependencies]
+http = [
+    { name = "aiohttp" },
+]
+
+[[package]]
+name = "grpcio"
+version = "1.81.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/b0/b5/1ff353970a87eda4c98251e34d2dfd214abd4982dc89119c9252a2a482d2/grpcio-1.81.1.tar.gz", hash = "sha256:6fa10a767143a5e82e8eaab53918af0cd8909a57a27f8cb2288b80a613ac671b", size = 13026582, upload-time = "2026-06-11T12:46:51.673Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/52/ea/1c2fa386b718ff493225e61cfc052ef400b4d6ffc54cbe261026432624b5/grpcio-1.81.1-cp311-cp311-linux_armv7l.whl", hash = "sha256:d71d30f2d92f67d944631c523713934fee37292469e182ebcd2c1dd8a64ce53f", size = 6093112, upload-time = "2026-06-11T12:44:52.131Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/18/acf45fa8bd1bc5d7b0c2fd3dc4c209379fbd5bb396b440b68a83342226b7/grpcio-1.81.1-cp311-cp311-macosx_11_0_universal2.whl", hash = "sha256:b137f4bf3ada9dc44d411478decc6ff09a79ed30b306cd2abaa98408c3588137", size = 12074277, upload-time = "2026-06-11T12:44:55.354Z" },
+    { url = "https://files.pythonhosted.org/packages/48/d7/ee86a60699b7db039f772a2c4a7e4facc7138984ff42c0130933a0063884/grpcio-1.81.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:a3acb384427816dd5d470f47e62137b87f74da694faa8a50147012cf40df276a", size = 6640348, upload-time = "2026-06-11T12:44:59.223Z" },
+    { url = "https://files.pythonhosted.org/packages/26/ee/d2de5e47378ffc207d476c230fea3be4d2601edbce9995f4fe45535d4896/grpcio-1.81.1-cp311-cp311-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:f9a0ebbe45c29b5e5866593c12b78bd9035f0f0f0d4bc8361680cd580d99db49", size = 7331842, upload-time = "2026-06-11T12:45:02.001Z" },
+    { url = "https://files.pythonhosted.org/packages/23/d6/abeda5c2b896a0b341584fe5ac411bbf72e197a9a374c355fb90965e08d2/grpcio-1.81.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:0a37165cc80b1a368384b383e63a4c38116a10467ae44c904d2d7468c4470ec2", size = 6842229, upload-time = "2026-06-11T12:45:04.76Z" },
+    { url = "https://files.pythonhosted.org/packages/10/1c/1f0da7d590b4aeee006826ba568d0e419ca14b23e18f901a3da3e9fba613/grpcio-1.81.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:6282caffb41ec326d4cb67ca9cf53b739d1b2f975a2acb498c7418e9f7d9a416", size = 7446096, upload-time = "2026-06-11T12:45:07.499Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/81/5c505d508f7c887aa7982d21443a4126597c80d34b0bcf40f9cec576d7f3/grpcio-1.81.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:a35009284d0d3d5c2c9601c164a911b8b4331608d98a9a66d47d97bb2f522b70", size = 8445238, upload-time = "2026-06-11T12:45:10.243Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/b2/524847365122ee509ca17bcc4e092198b700e94af7bfd5bb5e6dd9f3ee66/grpcio-1.81.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:1b22c80559854b789a01fd89e8929b3798a156c0829b5282a8939f33ad4115ad", size = 7873989, upload-time = "2026-06-11T12:45:13.102Z" },
+    { url = "https://files.pythonhosted.org/packages/18/fa/07c037c50b006909d1d13a5848774f8aa7b242f70dc03a035c64eea0e6db/grpcio-1.81.1-cp311-cp311-win32.whl", hash = "sha256:428bec0161b48d8cf583c068591bc0016d0d9cfff52462b72b3884861ea768c5", size = 4202223, upload-time = "2026-06-11T12:45:16.166Z" },
+    { url = "https://files.pythonhosted.org/packages/41/ed/6bff15376920942fac6b95b9802752b837437172c9e8fc2d3170546b89cc/grpcio-1.81.1-cp311-cp311-win_amd64.whl", hash = "sha256:30e825f6848d9f18bba350ed6c75c1b02a0b5184474a31db9a32b1fa66fd8c79", size = 4941303, upload-time = "2026-06-11T12:45:18.724Z" },
+    { url = "https://files.pythonhosted.org/packages/85/07/9a979c81738863a738dc23d65177056e71fbb2db817740ed870b33434e7a/grpcio-1.81.1-cp312-cp312-linux_armv7l.whl", hash = "sha256:8b39472beafc0bdcafc4c8c73ad082ebfdb449d566897a61e7acb4fa88089115", size = 6053264, upload-time = "2026-06-11T12:45:21.017Z" },
+    { url = "https://files.pythonhosted.org/packages/75/95/539706ca0d3bd40dbad583dc56fd883da941f37556b629132da5762781b9/grpcio-1.81.1-cp312-cp312-macosx_11_0_universal2.whl", hash = "sha256:12b7524c88d4026d3dcb7b0ebe16b6714f3b4af402ddd0f0639ab064a00c87c3", size = 12052560, upload-time = "2026-06-11T12:45:23.652Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/44/f257b7e0bd69c93b06c6cb8ac8d1b901ccb42bedabd83c1a4c77a71f8810/grpcio-1.81.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1e123f9b37edb8375fd74130d1f69c944bbf0a7b06761ae7211154b8759e94d2", size = 6595983, upload-time = "2026-06-11T12:45:26.963Z" },
+    { url = "https://files.pythonhosted.org/packages/b9/f3/19782aa04c960968bef8c5539329d8e3bbc3364e2e46d19eb5e5cc5e43b7/grpcio-1.81.1-cp312-cp312-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:2c2e2ae6867c2966b8daccc836d54a13218e0007e9a490aeb81dd05be64d22d7", size = 7303455, upload-time = "2026-06-11T12:45:29.707Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/8c/dea020b6d91508cd84463917a63149ec196ee7db505d032ae43fcb3303b9/grpcio-1.81.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:766bc7c9a9c340342f4c864ccbda8e78111e4751f13b895812b9c148fb79e9d0", size = 6809167, upload-time = "2026-06-11T12:45:32.52Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/c7/3030dd940408083bd32cd95d634777a71605ade4887154d93e8a89244946/grpcio-1.81.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:b259a04a737cb3496be0901328eb8b7552ed8df4865d8c8f1cf1bffcfc0776a3", size = 7412536, upload-time = "2026-06-11T12:45:35.403Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/dd/1172a9e42b168edcafefad6115346ef619a3fc02158bb170e66ced24bcdd/grpcio-1.81.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:85b10a45b8993d195c4f3ff57025b8d1e11834909ee475c403bfa60cb4caefaf", size = 8408276, upload-time = "2026-06-11T12:45:37.78Z" },
+    { url = "https://files.pythonhosted.org/packages/25/7a/71437c7f3596e5246155c515852795a85a1a8d228190212432b13b97a95d/grpcio-1.81.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:8ea1936c26b99999b27479853039a7f34713f56c49375ad52b38535ec93a796c", size = 7849660, upload-time = "2026-06-11T12:45:40.627Z" },
+    { url = "https://files.pythonhosted.org/packages/65/40/7debc0da45d2efebafb82da75644be347497fe4ee250514b8cd3b86ae8bf/grpcio-1.81.1-cp312-cp312-win32.whl", hash = "sha256:a185a04039df6cae8648bc8ab6d6fde7bf94f7188ecf7828e76ac52eef1e41d6", size = 4185819, upload-time = "2026-06-11T12:45:43.027Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/b9/8fe3ba5ed462067774ebc1f9c7f26aa7ebcc280ddd476be107153de1339e/grpcio-1.81.1-cp312-cp312-win_amd64.whl", hash = "sha256:3ad74f8bb1a18963914c5452d289422830b39459e8776ebbcd207be1fbfb1d94", size = 4930461, upload-time = "2026-06-11T12:45:45.775Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/42/dcc2e4b600538ef18327c0839d56b7d3c3812337c5d710df5877dbb39b1e/grpcio-1.81.1-cp313-cp313-linux_armv7l.whl", hash = "sha256:b10e1ff4756ed27d5a29d7fc79cfce7ef1ff56ad20025b89bac7cf79e09abbbe", size = 6054466, upload-time = "2026-06-11T12:45:48.43Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/4a/a36e03210183a8a7d4c80c3936acee679f4bd77d5861f369db47b2cc5f05/grpcio-1.81.1-cp313-cp313-macosx_11_0_universal2.whl", hash = "sha256:819edbdcb42ab8598b494bcf0222684bbb7a3c772bd1b1f0be7e029a6063c28e", size = 12048795, upload-time = "2026-06-11T12:45:54.011Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/d5/d68e30b29098f63beab6fe501100fe82674ff142b32c672532da86a99b3a/grpcio-1.81.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:c5bf2dc311127d91230cc79b92188c082634a06cf66c5234db49a43b910183b0", size = 6599094, upload-time = "2026-06-11T12:45:57.799Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/b3/e837954d279754f638a11cca5dcf6b24a005efb398984cefaf7735945a54/grpcio-1.81.1-cp313-cp313-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:e8ca6a1fcdb2943c9cbc1804a1baf3acb6071d72a471591678ded84218006e14", size = 7307182, upload-time = "2026-06-11T12:46:00.568Z" },
+    { url = "https://files.pythonhosted.org/packages/0d/1e/b47957057e729adc6cdf519a47f8be2562b7140e280f1418443eb4022192/grpcio-1.81.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:e64dd101d380a115cc5a0c7856788adb535f1a4e21fc543775602f8be95180ae", size = 6810962, upload-time = "2026-06-11T12:46:03.312Z" },
+    { url = "https://files.pythonhosted.org/packages/40/26/569868e364e05b19ec8f969da53d230bcd89c962cd198f7c29943155c4d3/grpcio-1.81.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:98a07f9bf591e3a8919797bee1c53f026ba4acd587e5a4404c8e57c9ec36b2a5", size = 7415698, upload-time = "2026-06-11T12:46:06.005Z" },
+    { url = "https://files.pythonhosted.org/packages/36/0c/5440a0582cb5653fc42a6e262eeb22700943313f8076f9dc927491b20a59/grpcio-1.81.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:c261d74b1a945cf895a9d6eccd1685a8e837531beaab782da4d630a8d12deffb", size = 8407779, upload-time = "2026-06-11T12:46:08.84Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/aa/66fe9f39871d766987d869a03ee0842a026f499c7b1e62decb9e78a8088e/grpcio-1.81.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:58ad1131c300d3c9b933802b3cc4dc69d380822935ba50b28703156ea826fbf7", size = 7844521, upload-time = "2026-06-11T12:46:12.171Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/9e/69bb7194861bcd28fb3193261d4f9c3831b4446993f002cf59068943e7ab/grpcio-1.81.1-cp313-cp313-win32.whl", hash = "sha256:78e29211f26da2fdd0e9c6d2b79f489476140cf7029b6a64808ade7ca4156a42", size = 4182786, upload-time = "2026-06-11T12:46:15.192Z" },
+    { url = "https://files.pythonhosted.org/packages/0d/20/3da8bb0d637feccdc3e1e419bb511ce93651ce7d54164f95de22cc0b8b34/grpcio-1.81.1-cp313-cp313-win_amd64.whl", hash = "sha256:edb59506291b647a30884b1d51a599d605f40b20af4a7dc3d33786a47a31de60", size = 4928648, upload-time = "2026-06-11T12:46:17.823Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/58/19414622b1bf6981bc9c05a365bd548e71876c89000083b3af489251e9c0/grpcio-1.81.1-cp314-cp314-linux_armv7l.whl", hash = "sha256:506f48f2f9c29b143fca3dad7b0d518c188b6c9648c75a2ae6e2d9f2c13a060b", size = 6055336, upload-time = "2026-06-11T12:46:20.557Z" },
+    { url = "https://files.pythonhosted.org/packages/32/f1/2ec88adb92b0eba970dd0e0e7dd086341daa3c75eba4f735f9e44bf684b0/grpcio-1.81.1-cp314-cp314-macosx_11_0_universal2.whl", hash = "sha256:d865db4a6318e1c1bea83292e0ed231090538fc4ca45425b0f0480eb338bbc6e", size = 12056279, upload-time = "2026-06-11T12:46:24.255Z" },
+    { url = "https://files.pythonhosted.org/packages/41/36/e8c5f8c6ec71de73733695ebc809e98b178b534ec6d8eaa31a7ebab4ad4c/grpcio-1.81.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:e2aa72e3ce1770317ef534f63d397b55e130725f5149bd36077c3b539019db27", size = 6608225, upload-time = "2026-06-11T12:46:27.601Z" },
+    { url = "https://files.pythonhosted.org/packages/30/22/96fc577a845ab093326d9ab1adb874bd4936c8cf98ac8ed2f3db13a0a2fb/grpcio-1.81.1-cp314-cp314-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:0490c30c261eded63f3f354979f9dc4502a9fb944cccb60cd9dc85f5a7349854", size = 7306576, upload-time = "2026-06-11T12:46:30.514Z" },
+    { url = "https://files.pythonhosted.org/packages/76/7b/61dab5d5969f28d97fb1009cead1df0a5cd987d3315e1b37f18a4449f8bc/grpcio-1.81.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:410482da976329fe5f4067270401b12cf2bd552ff8020f054ecfaddb5475f9d6", size = 6812165, upload-time = "2026-06-11T12:46:33.699Z" },
+    { url = "https://files.pythonhosted.org/packages/82/78/6e501929d4f5f96462fd82fd9f0f06e5f9612207582b862868d68757b27d/grpcio-1.81.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:e3657301562ac3cb8018d30d0d3ebfa39932239f7b5703422057ef14b69949f5", size = 7422962, upload-time = "2026-06-11T12:46:36.511Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/7e/f2157589e66daa78ebb3165942d05a08bdea93b9d11c2bc1e172aef89685/grpcio-1.81.1-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:24c8e57504c8f45b237e40b99262d181071e5099a07053695b75d97bb53053a0", size = 8408176, upload-time = "2026-06-11T12:46:39.803Z" },
+    { url = "https://files.pythonhosted.org/packages/da/df/c6717fef716e00d235ffb96123baf6dce76d6004f6233fa767c502861460/grpcio-1.81.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:b427c19380991a4eaab2f6144b64b99b412043314c6bf4ab544f97bb31ee4190", size = 7846681, upload-time = "2026-06-11T12:46:43.013Z" },
+    { url = "https://files.pythonhosted.org/packages/36/84/3502e9f210a6a5c4438c8aca3f88edd2e04f6a27f3d41b26cf0a0024b096/grpcio-1.81.1-cp314-cp314-win32.whl", hash = "sha256:61233fe8951e5c85dff81c2458b6528624760166946b5b47ea150a589168411f", size = 4264615, upload-time = "2026-06-11T12:46:45.741Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/b0/4af731ff7492c68a96e4c71bfd0f4590acde92b31c6fe4894e6465c10ff6/grpcio-1.81.1-cp314-cp314-win_amd64.whl", hash = "sha256:3768a5ff1b2125e6f552e561b6b2dca0e64982d8949689b4df145cf8b98d7821", size = 5070275, upload-time = "2026-06-11T12:46:48.486Z" },
+]
+
+[[package]]
+name = "hjson"
+version = "3.1.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/82/e5/0b56d723a76ca67abadbf7fb71609fb0ea7e6926e94fcca6c65a85b36a0e/hjson-3.1.0.tar.gz", hash = "sha256:55af475a27cf83a7969c808399d7bccdec8fb836a07ddbd574587593b9cdcf75", size = 40541, upload-time = "2022-08-13T02:53:01.919Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1f/7f/13cd798d180af4bf4c0ceddeefba2b864a63c71645abc0308b768d67bb81/hjson-3.1.0-py3-none-any.whl", hash = "sha256:65713cdcf13214fb554eb8b4ef803419733f4f5e551047c9b711098ab7186b89", size = 54018, upload-time = "2022-08-13T02:52:59.899Z" },
+]
+
+[[package]]
+name = "identify"
+version = "2.6.19"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/52/63/51723b5f116cc04b061cb6f5a561790abf249d25931d515cd375e063e0f4/identify-2.6.19.tar.gz", hash = "sha256:6be5020c38fcb07da56c53733538a3081ea5aa70d36a156f83044bfbf9173842", size = 99567, upload-time = "2026-04-17T18:39:50.265Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/94/84/d9273cd09688070a6523c4aee4663a8538721b2b755c4962aafae0011e72/identify-2.6.19-py2.py3-none-any.whl", hash = "sha256:20e6a87f786f768c092a721ad107fc9df0eb89347be9396cadf3f4abbd1fb78a", size = 99397, upload-time = "2026-04-17T18:39:49.221Z" },
+]
+
+[[package]]
+name = "idna"
+version = "3.18"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/cd/63/9496c57188a2ee585e0f1db071d75089a11e98aa86eb99d9d7618fc1edce/idna-3.18.tar.gz", hash = "sha256:ffb385a7e039654cef1ab9ef32c6fafe283c0c0467bba1d9029738ce4a14a848", size = 196711, upload-time = "2026-06-02T14:34:07.794Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1e/5e/d4e9f1a599fb8e573b7b87160658329fbf28d19eac2718f51fc3def3aa5a/idna-3.18-py3-none-any.whl", hash = "sha256:7f952cbe720b688055e3f87de14f5c3e5fdaa8bc3928985c4077ca689de849a2", size = 65455, upload-time = "2026-06-02T14:34:06.319Z" },
+]
+
+[[package]]
+name = "iniconfig"
+version = "2.3.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" },
+]
+
+[[package]]
+name = "jedi"
+version = "0.20.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "parso" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/46/b7/a3635f6a2d7cf5b5dd98064fc1d5fbbafcb25477bcea204a3a92145d158b/jedi-0.20.0.tar.gz", hash = "sha256:c3f4ccbd276696f4b19c54618d4fb18f9fc24b0aef02acf704b23f487daa1011", size = 3119416, upload-time = "2026-05-01T23:38:47.814Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9a/93/242e2eab5fe682ffcb8b0084bde703a41d51e17ee0f3a31ff0d9d813620a/jedi-0.20.0-py2.py3-none-any.whl", hash = "sha256:7bdd9c2634f56713299976f4cbd59cb3fa92165cc5e05ea811fb253480728b67", size = 4884812, upload-time = "2026-05-01T23:38:43.919Z" },
+]
+
+[[package]]
+name = "jinja2"
+version = "3.1.6"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "markupsafe" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/df/bf/f7da0350254c0ed7c72f3e33cef02e048281fec7ecec5f032d4aac52226b/jinja2-3.1.6.tar.gz", hash = "sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d", size = 245115, upload-time = "2025-03-05T20:05:02.478Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload-time = "2025-03-05T20:05:00.369Z" },
+]
+
+[[package]]
+name = "joblib"
+version = "1.5.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/41/f2/d34e8b3a08a9cc79a50b2208a93dce981fe615b64d5a4d4abee421d898df/joblib-1.5.3.tar.gz", hash = "sha256:8561a3269e6801106863fd0d6d84bb737be9e7631e33aaed3fb9ce5953688da3", size = 331603, upload-time = "2025-12-15T08:41:46.427Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/7b/91/984aca2ec129e2757d1e4e3c81c3fcda9d0f85b74670a094cc443d9ee949/joblib-1.5.3-py3-none-any.whl", hash = "sha256:5fc3c5039fc5ca8c0276333a188bbd59d6b7ab37fe6632daa76bc7f9ec18e713", size = 309071, upload-time = "2025-12-15T08:41:44.973Z" },
+]
+
+[[package]]
+name = "librt"
+version = "0.11.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/40/08/9e7f6b5d2b5bed6ad055cdd5925f192bb403a51280f86b56554d9d0699a2/librt-0.11.0.tar.gz", hash = "sha256:075dc3ef4458a278e0195cbf6ac9d38808d9b906c5a6c7f7f79c3888276a3fb1", size = 200139, upload-time = "2026-05-10T18:17:25.138Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/fe/87/2bf31fe17587b29e3f93ec31421e2b1e1c3e349b8bf6c7c313dbad1d5340/librt-0.11.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:93d95bd45b7d58343d8b90d904450a545144eec19a002511163426f8ab1fae29", size = 141092, upload-time = "2026-05-10T18:15:34.795Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/08/5c5bf772920b7ebac6e32bc91a643e0ab3870199c0b542356d3baa83970a/librt-0.11.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4ee278c769a713638cdacd4c0436d72156e75df3ebc0166ab2b9dc43acc386c9", size = 142035, upload-time = "2026-05-10T18:15:36.242Z" },
+    { url = "https://files.pythonhosted.org/packages/06/20/662a03d254e5b000d838e8b345d83303ddb768c080fd488e40634c0fa66b/librt-0.11.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f230cb1cbc9faaa616f9a678f530ebcf186e414b6bcbd88b960e4ba1b92428d5", size = 475022, upload-time = "2026-05-10T18:15:37.56Z" },
+    { url = "https://files.pythonhosted.org/packages/de/f3/aa81523e45184c6ec23dc7f63263362ec55f80a09d424c012359ecbe7e35/librt-0.11.0-cp311-cp311-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:5d63c855d86938d9de93e265c9bd8c705b51ec494de5738340ee93767a686e4b", size = 467273, upload-time = "2026-05-10T18:15:39.182Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/6f/59c74b560ca8853834d5501d589c8a2519f4184f273a085ffd0f37a1cc47/librt-0.11.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:993f028be9e96a08d31df3479ac80d99be374d17f3b78e4796b3fd3c913d4e89", size = 497083, upload-time = "2026-05-10T18:15:40.634Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/7b/5aa4d2c9600a719401160bf7055417df0b2a47439b9d88286ce45e56b65f/librt-0.11.0-cp311-cp311-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:258d73a0aa66a055e65b2e4d1b8cdb23b9d132c5bb915d9547d804fcaed116cc", size = 489139, upload-time = "2026-05-10T18:15:41.934Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/31/9143803d7da6856a69153785768c4936864430eec0fd9461c3ea527d9922/librt-0.11.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:0827efe7854718f04aaddf6496e96960a956e676fe1d0f04eb41511fd8ad06d5", size = 508442, upload-time = "2026-05-10T18:15:43.206Z" },
+    { url = "https://files.pythonhosted.org/packages/2f/5a/bce08184488426bda4ccc2c4964ac048c8f68ae89bd7120082eef4233cfd/librt-0.11.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:7753e57d6e12d019c0d8786f1c09c709f4c3fcc57c3887b24e36e6c06ec938b7", size = 514230, upload-time = "2026-05-10T18:15:44.761Z" },
+    { url = "https://files.pythonhosted.org/packages/89/8c/bb5e213d254b7505a0e658da199d8ab719086632ce09eef311ab27976523/librt-0.11.0-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:11bd19822431cc21af9f27374e7ae2e58103c7d98bda823536a6c47f6bb2bb3d", size = 494231, upload-time = "2026-05-10T18:15:46.308Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/fb/541cdad5b1ab1300398c74c4c9a497b88e5074c21b1244c8f49731d3a284/librt-0.11.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:22bdf239b219d3993761a148ffa134b19e52e9989c84f845d5d7b71d70a17412", size = 537585, upload-time = "2026-05-10T18:15:47.629Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/f2/464bb69295c320cb06bddb4f14a4ec67934ee14b2bffb12b19fb7ab287ba/librt-0.11.0-cp311-cp311-win32.whl", hash = "sha256:46c60b61e308eb535fbd6fa622b1ee1bb2815691c1ad9c98bf7b84952ec3bc8d", size = 100509, upload-time = "2026-05-10T18:15:49.157Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/e7/a17ee1788f9e4fbf548c19f4afa07c92089b9e24fef6cb2410863781ef4c/librt-0.11.0-cp311-cp311-win_amd64.whl", hash = "sha256:902e546ff044f579ff1c953ff5fce97b636fe9e3943996b2177710c6ef076f73", size = 118628, upload-time = "2026-05-10T18:15:50.345Z" },
+    { url = "https://files.pythonhosted.org/packages/cc/c7/6c766214f9f9903bcfcfbef97d807af8d8f5aa3502d247858ab17582d212/librt-0.11.0-cp311-cp311-win_arm64.whl", hash = "sha256:65ac3bc20f78aa0ee5ae84baa68917f89fef4af63e941084dd019a0d0e749f0c", size = 103122, upload-time = "2026-05-10T18:15:52.068Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/d0/07c77e067f0838949b43bd89232c29d72efebb9d2801a9750184eb706b71/librt-0.11.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b87504f1690a23b9a2cca841191a04f83895d4fc2dd04df91d82b1a04ca2ad46", size = 144147, upload-time = "2026-05-10T18:15:53.227Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/24/8493538fa4f62f982686398a5b8f68008138a75086abdea19ade64bf4255/librt-0.11.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40071fc5fe0ce8daa6de616702314a01e1250711682b0523d6ab8d4525910cb3", size = 143614, upload-time = "2026-05-10T18:15:54.657Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/1e/f8bad050810d9171f34a1648ed910e56814c2ba61639f2bd53c6377ae24b/librt-0.11.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:137e79445c896a0ea7b265f52d23954e05b64222ee1af69e2cb34219067cbb67", size = 485538, upload-time = "2026-05-10T18:15:56.117Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/fe/3594ebfbaf03084ba4b120c9ba5c3183fd938a48725e9bbe6ff0a5159ad8/librt-0.11.0-cp312-cp312-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:cca6644054e78746d8d4ef238681f9c34ff8b584fe6b988ecebb8db3b15e622a", size = 479623, upload-time = "2026-05-10T18:15:57.544Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/da/5d1876984b3746c85dbd219dbfcb73c85f54ee263fd32e5b2a632ec14571/librt-0.11.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d5b0eea49f5562861ee8d757a32ef7d559c1d35be2aaaa1ec28941d74c9ffc8a", size = 513082, upload-time = "2026-05-10T18:15:58.805Z" },
+    { url = "https://files.pythonhosted.org/packages/19/6e/55bdf5d5ca00c3e18430690bf2c953d8d3ffd3c337418173d33dec985dc9/librt-0.11.0-cp312-cp312-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:0d1029d7e1ae1a7e647ed6fb5df8c4ce2dffefb7a9f5fd1376a4554d96dac09f", size = 508105, upload-time = "2026-05-10T18:16:00.2Z" },
+    { url = "https://files.pythonhosted.org/packages/07/10/f1f23a7c595ee90ece4d35c851e5d104b1311a887ed1b4ac4c35bbd13da8/librt-0.11.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:bc3ce6b33c5828d9e80592011a5c584cb2ce86edbc4088405f70da47dc1d1b3b", size = 522268, upload-time = "2026-05-10T18:16:01.708Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/02/5720f5697a7f54b78b3aefbe20df3a48cedcff1276618c4aa481177942ed/librt-0.11.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:936c5995f3514a42111f20099397d8177c79b4d7e70961e396c6f5a0a3566766", size = 527348, upload-time = "2026-05-10T18:16:03.496Z" },
+    { url = "https://files.pythonhosted.org/packages/50/db/b4a47c6f91db4ff76348a0b3dd0cc65e090a078b765a810a62ff9434c3d3/librt-0.11.0-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:9bc0ca6ad9381cbe8e4aa6e5726e4c80c78115a6e9723c599ed1d73e092bc49d", size = 516294, upload-time = "2026-05-10T18:16:05.173Z" },
+    { url = "https://files.pythonhosted.org/packages/9e/58/9384b2f4eb1ed1d273d40948a7c5c4b2360213b402ef3be4641c06299f9c/librt-0.11.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:070aa8c26c0a74774317a72df8851facc7f0f012a5b406557ac56992d92e1ec8", size = 553608, upload-time = "2026-05-10T18:16:06.839Z" },
+    { url = "https://files.pythonhosted.org/packages/21/7b/5aa8848a7c6a9278c79375146da1812e695754ceec5f005e6043461a7315/librt-0.11.0-cp312-cp312-win32.whl", hash = "sha256:6bf14feb84b05ae945277395451998c89c54d0def4070eb5c08de544930b245a", size = 101879, upload-time = "2026-05-10T18:16:08.103Z" },
+    { url = "https://files.pythonhosted.org/packages/37/33/8a745436944947575b584231750a41417de1a38cf6a2e9251d1065651c09/librt-0.11.0-cp312-cp312-win_amd64.whl", hash = "sha256:75672f0bc524ede266287d532d7923dbce94c7514ad07627bac3d0c6d92cc4d9", size = 119831, upload-time = "2026-05-10T18:16:09.174Z" },
+    { url = "https://files.pythonhosted.org/packages/59/67/a6739ac96e28b7855808bdb0370e250606104a859750d209e5a0716fe7ab/librt-0.11.0-cp312-cp312-win_arm64.whl", hash = "sha256:2f10cf143e4a9bb0f4f5af568a00df94a2d69ef41c2579584454bb0fe5cc642c", size = 103470, upload-time = "2026-05-10T18:16:10.369Z" },
+    { url = "https://files.pythonhosted.org/packages/82/61/e59168d4d0bf2bf90f4f0caf7a001bfc60254c3af4586013b04dc3ef517b/librt-0.11.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:78dc31f7fdfe9c9d0eb0e8f42d139db230e826415bbcabd9f0e9faaaee909894", size = 144119, upload-time = "2026-05-10T18:16:11.771Z" },
+    { url = "https://files.pythonhosted.org/packages/61/fd/caa1d60b12f7dd79ccea23054e06eeaebe266a5f52c40a6b651069200ce5/librt-0.11.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:fa475675db22290c3158e1d42326d0f5a65f04f44a0e68c3630a25b53560fb9c", size = 143565, upload-time = "2026-05-10T18:16:13.334Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/a9/dc744f5c2b4978d48db970be29f22716d3413d28b14ad99740817315cf2c/librt-0.11.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:621db29691044bdeda22e789e482e1b0f3a985d90e3426c9c6d17606416205ea", size = 485395, upload-time = "2026-05-10T18:16:14.729Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/21/7f8e97a1e4dae952a5a95948f6f8507a173bc1e669f54340bba6ca1ca31b/librt-0.11.0-cp313-cp313-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:a9010e2ed5b3a9e158c5fd966b3ab7e834bb3d3aacc8f66c91dd4b57a3799230", size = 479383, upload-time = "2026-05-10T18:16:16.321Z" },
+    { url = "https://files.pythonhosted.org/packages/a6/6d/d8ee9c114bebf2c50e29ec2aa940826fccb62a645c3e4c18760987d0e16d/librt-0.11.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7c39513d8b7477a2e1ed8c43fc21c524e8d5a0f8d4e8b7b074dbdbe7820a08e2", size = 513010, upload-time = "2026-05-10T18:16:17.647Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/43/0b5708af2bd30a46400e72ba6bdaa8f066f15fb9a688527e34220e8d6c06/librt-0.11.0-cp313-cp313-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:7aef3cf1d5af86e770ab04bfd993dfc4ae8b8c17f66fb77dd4a7d50de7bbb1a3", size = 508433, upload-time = "2026-05-10T18:16:19.309Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/50/356187247d09013490481033183b3532b58acf8028bcb34b2b56a375c9b2/librt-0.11.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:557183ddc36babe46b27dd60facbd5adb4492181a5be887587d57cda6e092f21", size = 522595, upload-time = "2026-05-10T18:16:20.642Z" },
+    { url = "https://files.pythonhosted.org/packages/40/e7/c6ac4240899c7f3248079d5a9900debe0dadb3fdeaf856684c987105ba47/librt-0.11.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:83d3e1f72bd42f6c5c0b7daec530c3f829bd02db42c70b8ddf0c2d90a2459930", size = 527255, upload-time = "2026-05-10T18:16:22.352Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/b5/a81322dbeedeeaf9c1ee6f001734d28a09d8383ac9e6779bc24bbd0743c6/librt-0.11.0-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:4ce1f21fbe589bc1afd7872dece84fb0e1144f794a288e58a10d2c54a55c43be", size = 516847, upload-time = "2026-05-10T18:16:23.627Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/66/6e6323787d592b55204a42595ff1102da5115601b53a7e9ddebc889a6da5/librt-0.11.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:970b09f7044ea2b64c9da42fd3d335666518cfd1c6e8a182c95da73d0214b41e", size = 553920, upload-time = "2026-05-10T18:16:25.025Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/21/623f8ca230857102066d9ca8c6c1734995908c4d0d1bee7bb2ef0021cb33/librt-0.11.0-cp313-cp313-win32.whl", hash = "sha256:78fddc31cd4d3caa897ad5d31f856b1faadc9474021ad6cb182b9018793e254e", size = 101898, upload-time = "2026-05-10T18:16:26.649Z" },
+    { url = "https://files.pythonhosted.org/packages/b3/1d/b4ebd44dd723f768469007515cb92251e0ae286c94c140f374801140fa74/librt-0.11.0-cp313-cp313-win_amd64.whl", hash = "sha256:8ca8aa88751a775870b764e93bad5135385f563cb8dcee399abf034ea4d3cb47", size = 119812, upload-time = "2026-05-10T18:16:27.859Z" },
+    { url = "https://files.pythonhosted.org/packages/3b/e4/b2f4ca7965ca373b491cdb4bc25cdb30c1649ca81a8782056a83850292a9/librt-0.11.0-cp313-cp313-win_arm64.whl", hash = "sha256:96f044bb325fd9cf1a723015638c219e9143f0dfbc0ca54c565df2b7fc748b44", size = 103448, upload-time = "2026-05-10T18:16:29.066Z" },
+    { url = "https://files.pythonhosted.org/packages/29/eb/dbce197da4e227779e56b5735f2decc3eb36e55a1cdbf1bd65d6639d76c1/librt-0.11.0-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:4a017a95e5837dc15a8c5661d60e05daa96b90908b1aa6b7acdf443cd25c8ebd", size = 143345, upload-time = "2026-05-10T18:16:30.674Z" },
+    { url = "https://files.pythonhosted.org/packages/76/a3/254bebd0c11c8ba684018efb8006ff22e466abce445215cca6c778e7d9de/librt-0.11.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:b1ecbd9819deccc39b7542bf4d2a740d8a620694d39989e58661d3763458f8d4", size = 143131, upload-time = "2026-05-10T18:16:32.037Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/3f/f77d6122d21ac7bf6ae8a7dfced1bd2a7ac545d3273ebdcaf8042f6d619f/librt-0.11.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7da327dacd7be8f8ec36547373550744a3cc0e536d54665cd83f8bcd961200e8", size = 477024, upload-time = "2026-05-10T18:16:33.493Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/0a/2c996dadebaa7d9bbbd43ef2d4f3e66b6da545f838a41694ef6172cebec8/librt-0.11.0-cp314-cp314-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:0dc56b1f8d06e60db362cc3fdae206681817f86ce4725d34511473487f12a34b", size = 474221, upload-time = "2026-05-10T18:16:34.864Z" },
+    { url = "https://files.pythonhosted.org/packages/0a/7e/f5d92af8486b8272c23b3e686b46ff72d89c8169585eb61eef01a2ac7147/librt-0.11.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:05fb8fb2ab90e21c8d12ea240d744ad514da9baf381ebfa70d91d20d21713175", size = 505174, upload-time = "2026-05-10T18:16:36.705Z" },
+    { url = "https://files.pythonhosted.org/packages/af/1a/cb0734fe86398eb33193ab753b7326255c74cac5eb09e76b9b16536e7adb/librt-0.11.0-cp314-cp314-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:cae74872be221df4374d10fec61f93ed1513b9546ea84f2c0bf73ab3e9bd0b03", size = 497216, upload-time = "2026-05-10T18:16:38.418Z" },
+    { url = "https://files.pythonhosted.org/packages/18/06/094820f91558b66e29943c0ec41c9914f460f48dd51fc503c3101e10842d/librt-0.11.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:32bcc918c0148eb7e3d57385125bac7e5f9e4359d05f07448b09f6f778c2f31c", size = 513921, upload-time = "2026-05-10T18:16:39.848Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/c2/00de9018871a282f530cacb457d5ec0428f6ac7e6fedde9aff7468d9fb04/librt-0.11.0-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:f9743fc99135d5f78d2454435615f6dec0473ca507c26ce9d92b10b562a280d3", size = 520850, upload-time = "2026-05-10T18:16:41.471Z" },
+    { url = "https://files.pythonhosted.org/packages/51/9d/64631832348fd1834fb3a61b996434edddaaf25a31d03b0a76273159d2cf/librt-0.11.0-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:5ba067f4aadae8fda802d91d2124c90c42195ff32d9161d3549e6d05cfe26f96", size = 504237, upload-time = "2026-05-10T18:16:43.15Z" },
+    { url = "https://files.pythonhosted.org/packages/a5/ec/ae5525eb16edc827a044e7bb8777a455ff95d4bca9379e7e6bddd7383647/librt-0.11.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:de3bf945454d032f9e390b85c4072e0a0570bf825421c8be0e71209fa65e1abe", size = 546261, upload-time = "2026-05-10T18:16:44.408Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/09/adce371f27ca039411da9659f7430fcc2ba6cd0c7b3e4467a0f091be7fa9/librt-0.11.0-cp314-cp314-win32.whl", hash = "sha256:d2277a05f6dcb9fd13db9566aac4fabd68c3ea1ea46ee5567d4eef8efa495a2f", size = 96965, upload-time = "2026-05-10T18:16:46.039Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/ee/8ac720d98548f173c7ce2e632a7ca94673f74cacd5c8162a84af5b35958a/librt-0.11.0-cp314-cp314-win_amd64.whl", hash = "sha256:ab73e8db5e3f564d812c1f5c3a175930a5f9bc96ccb5e3b22a34d7858b401cf7", size = 115151, upload-time = "2026-05-10T18:16:47.133Z" },
+    { url = "https://files.pythonhosted.org/packages/94/20/c900cf14efeb09b6bef2b2dff20779f73464b97fd58d1c6bccc379588ae3/librt-0.11.0-cp314-cp314-win_arm64.whl", hash = "sha256:aea3caa317752e3a466fa8af45d91ee0ea8c7fdd96e42b0a8dd9b76a7931eba1", size = 98850, upload-time = "2026-05-10T18:16:48.597Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/71/944bfe4b64e12abffcd3c15e1cce07f72f3d55655083786285f4dedeb532/librt-0.11.0-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:d1b36540d7aaf9b9101b3a6f376c8d8e9f7a9aec93ed05918f2c69d493ffef72", size = 151138, upload-time = "2026-05-10T18:16:49.839Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/10/99e64a5c86989357fda078c8143c533389585f6473b7439172dd8f3b3b2d/librt-0.11.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:efbb343ab2ce3540f4ecbe6315d677ed70f37cd9a72b1e58066c918ca83acbaa", size = 151976, upload-time = "2026-05-10T18:16:51.062Z" },
+    { url = "https://files.pythonhosted.org/packages/21/31/5072ad880946d83e5ea4147d6d018c78eefce85b77819b19bdd0ee229435/librt-0.11.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:aa0dd688aab3f7914d3e6e5e3554978e0383312fb8e771d84be008a35b9ee548", size = 557927, upload-time = "2026-05-10T18:16:52.632Z" },
+    { url = "https://files.pythonhosted.org/packages/5e/8d/70b5fb7cfbab60edbe7381614ab985da58e144fbf465c86d44c95f43cdca/librt-0.11.0-cp314-cp314t-manylinux2014_i686.manylinux_2_17_i686.manylinux_2_28_i686.whl", hash = "sha256:f5fb36b8c6c63fdcbb1d526d94c0d1331610d43f4118cc1beb4efef4f3faacb2", size = 539698, upload-time = "2026-05-10T18:16:53.934Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/a3/ba3495a0b3edbd24a4cae0d1d3c64f39a9fc45d06e812101289b50c1a619/librt-0.11.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4a9a237d13addb93715b6fee74023d5ee3469b53fce527626c0e088aa585805f", size = 577162, upload-time = "2026-05-10T18:16:55.589Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/db/36e25fb81f99937ff1b96612a1dc9fd66f039cb9cc3aee12c01fac31aab9/librt-0.11.0-cp314-cp314t-manylinux_2_34_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:5ddd17bd87b2c56ddd60e546a7984a2e64c4e8eab92fb4cf3830a48ad5469d51", size = 566494, upload-time = "2026-05-10T18:16:56.975Z" },
+    { url = "https://files.pythonhosted.org/packages/33/0d/3f622b47f0b013eeb9cf4cc07ae9bfe378d832a4eec998b2b209fe84244d/librt-0.11.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:bd43992b4473d42f12ff9e68326079f0696d9d4e6000e8f39a0238d482ba6ee2", size = 596858, upload-time = "2026-05-10T18:16:58.374Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/02/71b90bc93039c46a2000651f6ad60122b114c8f54c4ad306e0e96f5b75ad/librt-0.11.0-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:f8e3e8056dd674e279741485e2e512d6e9a751c7455809d0114e6ebf8d781085", size = 590318, upload-time = "2026-05-10T18:16:59.676Z" },
+    { url = "https://files.pythonhosted.org/packages/04/04/418cb3f75621e2b761fb1ab0f017f4d70a1a72a6e7c74ee4f7e8d198c2f3/librt-0.11.0-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:c1f708d8ae9c56cf38a903c44297243d2ec83fd82b396b977e0144a3e76217e3", size = 575115, upload-time = "2026-05-10T18:17:01.007Z" },
+    { url = "https://files.pythonhosted.org/packages/cc/2c/5a2183ac58dd911f26b5d7e7d7d8f1d87fcecdddd99d6c12169a258ff62c/librt-0.11.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:0add982e0e7b9fc14cf4b33789d5f13f66581889b88c2f58099f6ce8f92617bd", size = 617918, upload-time = "2026-05-10T18:17:02.682Z" },
+    { url = "https://files.pythonhosted.org/packages/15/1f/dc6771a52592a4451be6effa200cbfc9cec61e4393d3033d81a9d307961d/librt-0.11.0-cp314-cp314t-win32.whl", hash = "sha256:2b481d846ac894c4e8403c5fd0e87c5d11d6499e404b474602508a224ff531c8", size = 103562, upload-time = "2026-05-10T18:17:03.99Z" },
+    { url = "https://files.pythonhosted.org/packages/62/4a/7d1415567027286a75ba1093ec4aca11f073e0f559c530cf3e0a757ad55c/librt-0.11.0-cp314-cp314t-win_amd64.whl", hash = "sha256:28edb433edde181112a908c78907af28f964eabc15f4dd16c9d66c834302677c", size = 124327, upload-time = "2026-05-10T18:17:05.465Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/62/b40b382fa0c66fee1478073eb8db352a4a6beda4a1adccf1df911d8c289c/librt-0.11.0-cp314-cp314t-win_arm64.whl", hash = "sha256:dee008f20b542e3cd162ba338a7f9ec0f6d23d395f66fe8aeeec3c9d067ea253", size = 102572, upload-time = "2026-05-10T18:17:06.809Z" },
+]
+
+[[package]]
+name = "lightning-utilities"
+version = "0.15.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "packaging" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/f1/45/7fa8f56b17dc0f0a41ec70dd307ecd6787254483549843bef4c30ab5adce/lightning_utilities-0.15.3.tar.gz", hash = "sha256:792ae0204c79f6859721ac7f386c237a33b0ed06ba775009cb894e010a842033", size = 33553, upload-time = "2026-02-22T14:48:53.348Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/25/f4/ead6e0e37209b07c9baa3e984ccdb0348ca370b77cea3aaea8ddbb097e00/lightning_utilities-0.15.3-py3-none-any.whl", hash = "sha256:6c55f1bee70084a1cbeaa41ada96e4b3a0fea5909e844dd335bd80f5a73c5f91", size = 31906, upload-time = "2026-02-22T14:48:52.488Z" },
+]
+
+[[package]]
+name = "markdown"
+version = "3.10.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/2b/f4/69fa6ed85ae003c2378ffa8f6d2e3234662abd02c10d216c0ba96081a238/markdown-3.10.2.tar.gz", hash = "sha256:994d51325d25ad8aa7ce4ebaec003febcce822c3f8c911e3b17c52f7f589f950", size = 368805, upload-time = "2026-02-09T14:57:26.942Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/de/1f/77fa3081e4f66ca3576c896ae5d31c3002ac6607f9747d2e3aa49227e464/markdown-3.10.2-py3-none-any.whl", hash = "sha256:e91464b71ae3ee7afd3017d9f358ef0baf158fd9a298db92f1d4761133824c36", size = 108180, upload-time = "2026-02-09T14:57:25.787Z" },
+]
+
+[[package]]
+name = "markupsafe"
+version = "3.0.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/7e/99/7690b6d4034fffd95959cbe0c02de8deb3098cc577c67bb6a24fe5d7caa7/markupsafe-3.0.3.tar.gz", hash = "sha256:722695808f4b6457b320fdc131280796bdceb04ab50fe1795cd540799ebe1698", size = 80313, upload-time = "2025-09-27T18:37:40.426Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/08/db/fefacb2136439fc8dd20e797950e749aa1f4997ed584c62cfb8ef7c2be0e/markupsafe-3.0.3-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1cc7ea17a6824959616c525620e387f6dd30fec8cb44f649e31712db02123dad", size = 11631, upload-time = "2025-09-27T18:36:18.185Z" },
+    { url = "https://files.pythonhosted.org/packages/e1/2e/5898933336b61975ce9dc04decbc0a7f2fee78c30353c5efba7f2d6ff27a/markupsafe-3.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:4bd4cd07944443f5a265608cc6aab442e4f74dff8088b0dfc8238647b8f6ae9a", size = 12058, upload-time = "2025-09-27T18:36:19.444Z" },
+    { url = "https://files.pythonhosted.org/packages/1d/09/adf2df3699d87d1d8184038df46a9c80d78c0148492323f4693df54e17bb/markupsafe-3.0.3-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6b5420a1d9450023228968e7e6a9ce57f65d148ab56d2313fcd589eee96a7a50", size = 24287, upload-time = "2025-09-27T18:36:20.768Z" },
+    { url = "https://files.pythonhosted.org/packages/30/ac/0273f6fcb5f42e314c6d8cd99effae6a5354604d461b8d392b5ec9530a54/markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0bf2a864d67e76e5c9a34dc26ec616a66b9888e25e7b9460e1c76d3293bd9dbf", size = 22940, upload-time = "2025-09-27T18:36:22.249Z" },
+    { url = "https://files.pythonhosted.org/packages/19/ae/31c1be199ef767124c042c6c3e904da327a2f7f0cd63a0337e1eca2967a8/markupsafe-3.0.3-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:bc51efed119bc9cfdf792cdeaa4d67e8f6fcccab66ed4bfdd6bde3e59bfcbb2f", size = 21887, upload-time = "2025-09-27T18:36:23.535Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/76/7edcab99d5349a4532a459e1fe64f0b0467a3365056ae550d3bcf3f79e1e/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:068f375c472b3e7acbe2d5318dea141359e6900156b5b2ba06a30b169086b91a", size = 23692, upload-time = "2025-09-27T18:36:24.823Z" },
+    { url = "https://files.pythonhosted.org/packages/a4/28/6e74cdd26d7514849143d69f0bf2399f929c37dc2b31e6829fd2045b2765/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:7be7b61bb172e1ed687f1754f8e7484f1c8019780f6f6b0786e76bb01c2ae115", size = 21471, upload-time = "2025-09-27T18:36:25.95Z" },
+    { url = "https://files.pythonhosted.org/packages/62/7e/a145f36a5c2945673e590850a6f8014318d5577ed7e5920a4b3448e0865d/markupsafe-3.0.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f9e130248f4462aaa8e2552d547f36ddadbeaa573879158d721bbd33dfe4743a", size = 22923, upload-time = "2025-09-27T18:36:27.109Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/62/d9c46a7f5c9adbeeeda52f5b8d802e1094e9717705a645efc71b0913a0a8/markupsafe-3.0.3-cp311-cp311-win32.whl", hash = "sha256:0db14f5dafddbb6d9208827849fad01f1a2609380add406671a26386cdf15a19", size = 14572, upload-time = "2025-09-27T18:36:28.045Z" },
+    { url = "https://files.pythonhosted.org/packages/83/8a/4414c03d3f891739326e1783338e48fb49781cc915b2e0ee052aa490d586/markupsafe-3.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:de8a88e63464af587c950061a5e6a67d3632e36df62b986892331d4620a35c01", size = 15077, upload-time = "2025-09-27T18:36:29.025Z" },
+    { url = "https://files.pythonhosted.org/packages/35/73/893072b42e6862f319b5207adc9ae06070f095b358655f077f69a35601f0/markupsafe-3.0.3-cp311-cp311-win_arm64.whl", hash = "sha256:3b562dd9e9ea93f13d53989d23a7e775fdfd1066c33494ff43f5418bc8c58a5c", size = 13876, upload-time = "2025-09-27T18:36:29.954Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/72/147da192e38635ada20e0a2e1a51cf8823d2119ce8883f7053879c2199b5/markupsafe-3.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:d53197da72cc091b024dd97249dfc7794d6a56530370992a5e1a08983ad9230e", size = 11615, upload-time = "2025-09-27T18:36:30.854Z" },
+    { url = "https://files.pythonhosted.org/packages/9a/81/7e4e08678a1f98521201c3079f77db69fb552acd56067661f8c2f534a718/markupsafe-3.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:1872df69a4de6aead3491198eaf13810b565bdbeec3ae2dc8780f14458ec73ce", size = 12020, upload-time = "2025-09-27T18:36:31.971Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/2c/799f4742efc39633a1b54a92eec4082e4f815314869865d876824c257c1e/markupsafe-3.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:3a7e8ae81ae39e62a41ec302f972ba6ae23a5c5396c8e60113e9066ef893da0d", size = 24332, upload-time = "2025-09-27T18:36:32.813Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/2e/8d0c2ab90a8c1d9a24f0399058ab8519a3279d1bd4289511d74e909f060e/markupsafe-3.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:d6dd0be5b5b189d31db7cda48b91d7e0a9795f31430b7f271219ab30f1d3ac9d", size = 22947, upload-time = "2025-09-27T18:36:33.86Z" },
+    { url = "https://files.pythonhosted.org/packages/2c/54/887f3092a85238093a0b2154bd629c89444f395618842e8b0c41783898ea/markupsafe-3.0.3-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:94c6f0bb423f739146aec64595853541634bde58b2135f27f61c1ffd1cd4d16a", size = 21962, upload-time = "2025-09-27T18:36:35.099Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/2f/336b8c7b6f4a4d95e91119dc8521402461b74a485558d8f238a68312f11c/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:be8813b57049a7dc738189df53d69395eba14fb99345e0a5994914a3864c8a4b", size = 23760, upload-time = "2025-09-27T18:36:36.001Z" },
+    { url = "https://files.pythonhosted.org/packages/32/43/67935f2b7e4982ffb50a4d169b724d74b62a3964bc1a9a527f5ac4f1ee2b/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:83891d0e9fb81a825d9a6d61e3f07550ca70a076484292a70fde82c4b807286f", size = 21529, upload-time = "2025-09-27T18:36:36.906Z" },
+    { url = "https://files.pythonhosted.org/packages/89/e0/4486f11e51bbba8b0c041098859e869e304d1c261e59244baa3d295d47b7/markupsafe-3.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:77f0643abe7495da77fb436f50f8dab76dbc6e5fd25d39589a0f1fe6548bfa2b", size = 23015, upload-time = "2025-09-27T18:36:37.868Z" },
+    { url = "https://files.pythonhosted.org/packages/2f/e1/78ee7a023dac597a5825441ebd17170785a9dab23de95d2c7508ade94e0e/markupsafe-3.0.3-cp312-cp312-win32.whl", hash = "sha256:d88b440e37a16e651bda4c7c2b930eb586fd15ca7406cb39e211fcff3bf3017d", size = 14540, upload-time = "2025-09-27T18:36:38.761Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/5b/bec5aa9bbbb2c946ca2733ef9c4ca91c91b6a24580193e891b5f7dbe8e1e/markupsafe-3.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:26a5784ded40c9e318cfc2bdb30fe164bdb8665ded9cd64d500a34fb42067b1c", size = 15105, upload-time = "2025-09-27T18:36:39.701Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/f1/216fc1bbfd74011693a4fd837e7026152e89c4bcf3e77b6692fba9923123/markupsafe-3.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:35add3b638a5d900e807944a078b51922212fb3dedb01633a8defc4b01a3c85f", size = 13906, upload-time = "2025-09-27T18:36:40.689Z" },
+    { url = "https://files.pythonhosted.org/packages/38/2f/907b9c7bbba283e68f20259574b13d005c121a0fa4c175f9bed27c4597ff/markupsafe-3.0.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e1cf1972137e83c5d4c136c43ced9ac51d0e124706ee1c8aa8532c1287fa8795", size = 11622, upload-time = "2025-09-27T18:36:41.777Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/d9/5f7756922cdd676869eca1c4e3c0cd0df60ed30199ffd775e319089cb3ed/markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:116bb52f642a37c115f517494ea5feb03889e04df47eeff5b130b1808ce7c219", size = 12029, upload-time = "2025-09-27T18:36:43.257Z" },
+    { url = "https://files.pythonhosted.org/packages/00/07/575a68c754943058c78f30db02ee03a64b3c638586fba6a6dd56830b30a3/markupsafe-3.0.3-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:133a43e73a802c5562be9bbcd03d090aa5a1fe899db609c29e8c8d815c5f6de6", size = 24374, upload-time = "2025-09-27T18:36:44.508Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/21/9b05698b46f218fc0e118e1f8168395c65c8a2c750ae2bab54fc4bd4e0e8/markupsafe-3.0.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ccfcd093f13f0f0b7fdd0f198b90053bf7b2f02a3927a30e63f3ccc9df56b676", size = 22980, upload-time = "2025-09-27T18:36:45.385Z" },
+    { url = "https://files.pythonhosted.org/packages/7f/71/544260864f893f18b6827315b988c146b559391e6e7e8f7252839b1b846a/markupsafe-3.0.3-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:509fa21c6deb7a7a273d629cf5ec029bc209d1a51178615ddf718f5918992ab9", size = 21990, upload-time = "2025-09-27T18:36:46.916Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/28/b50fc2f74d1ad761af2f5dcce7492648b983d00a65b8c0e0cb457c82ebbe/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:a4afe79fb3de0b7097d81da19090f4df4f8d3a2b3adaa8764138aac2e44f3af1", size = 23784, upload-time = "2025-09-27T18:36:47.884Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/76/104b2aa106a208da8b17a2fb72e033a5a9d7073c68f7e508b94916ed47a9/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:795e7751525cae078558e679d646ae45574b47ed6e7771863fcc079a6171a0fc", size = 21588, upload-time = "2025-09-27T18:36:48.82Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/99/16a5eb2d140087ebd97180d95249b00a03aa87e29cc224056274f2e45fd6/markupsafe-3.0.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:8485f406a96febb5140bfeca44a73e3ce5116b2501ac54fe953e488fb1d03b12", size = 23041, upload-time = "2025-09-27T18:36:49.797Z" },
+    { url = "https://files.pythonhosted.org/packages/19/bc/e7140ed90c5d61d77cea142eed9f9c303f4c4806f60a1044c13e3f1471d0/markupsafe-3.0.3-cp313-cp313-win32.whl", hash = "sha256:bdd37121970bfd8be76c5fb069c7751683bdf373db1ed6c010162b2a130248ed", size = 14543, upload-time = "2025-09-27T18:36:51.584Z" },
+    { url = "https://files.pythonhosted.org/packages/05/73/c4abe620b841b6b791f2edc248f556900667a5a1cf023a6646967ae98335/markupsafe-3.0.3-cp313-cp313-win_amd64.whl", hash = "sha256:9a1abfdc021a164803f4d485104931fb8f8c1efd55bc6b748d2f5774e78b62c5", size = 15113, upload-time = "2025-09-27T18:36:52.537Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/3a/fa34a0f7cfef23cf9500d68cb7c32dd64ffd58a12b09225fb03dd37d5b80/markupsafe-3.0.3-cp313-cp313-win_arm64.whl", hash = "sha256:7e68f88e5b8799aa49c85cd116c932a1ac15caaa3f5db09087854d218359e485", size = 13911, upload-time = "2025-09-27T18:36:53.513Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/d7/e05cd7efe43a88a17a37b3ae96e79a19e846f3f456fe79c57ca61356ef01/markupsafe-3.0.3-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:218551f6df4868a8d527e3062d0fb968682fe92054e89978594c28e642c43a73", size = 11658, upload-time = "2025-09-27T18:36:54.819Z" },
+    { url = "https://files.pythonhosted.org/packages/99/9e/e412117548182ce2148bdeacdda3bb494260c0b0184360fe0d56389b523b/markupsafe-3.0.3-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:3524b778fe5cfb3452a09d31e7b5adefeea8c5be1d43c4f810ba09f2ceb29d37", size = 12066, upload-time = "2025-09-27T18:36:55.714Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/e6/fa0ffcda717ef64a5108eaa7b4f5ed28d56122c9a6d70ab8b72f9f715c80/markupsafe-3.0.3-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4e885a3d1efa2eadc93c894a21770e4bc67899e3543680313b09f139e149ab19", size = 25639, upload-time = "2025-09-27T18:36:56.908Z" },
+    { url = "https://files.pythonhosted.org/packages/96/ec/2102e881fe9d25fc16cb4b25d5f5cde50970967ffa5dddafdb771237062d/markupsafe-3.0.3-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8709b08f4a89aa7586de0aadc8da56180242ee0ada3999749b183aa23df95025", size = 23569, upload-time = "2025-09-27T18:36:57.913Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/30/6f2fce1f1f205fc9323255b216ca8a235b15860c34b6798f810f05828e32/markupsafe-3.0.3-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:b8512a91625c9b3da6f127803b166b629725e68af71f8184ae7e7d54686a56d6", size = 23284, upload-time = "2025-09-27T18:36:58.833Z" },
+    { url = "https://files.pythonhosted.org/packages/58/47/4a0ccea4ab9f5dcb6f79c0236d954acb382202721e704223a8aafa38b5c8/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:9b79b7a16f7fedff2495d684f2b59b0457c3b493778c9eed31111be64d58279f", size = 24801, upload-time = "2025-09-27T18:36:59.739Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/70/3780e9b72180b6fecb83a4814d84c3bf4b4ae4bf0b19c27196104149734c/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:12c63dfb4a98206f045aa9563db46507995f7ef6d83b2f68eda65c307c6829eb", size = 22769, upload-time = "2025-09-27T18:37:00.719Z" },
+    { url = "https://files.pythonhosted.org/packages/98/c5/c03c7f4125180fc215220c035beac6b9cb684bc7a067c84fc69414d315f5/markupsafe-3.0.3-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:8f71bc33915be5186016f675cd83a1e08523649b0e33efdb898db577ef5bb009", size = 23642, upload-time = "2025-09-27T18:37:01.673Z" },
+    { url = "https://files.pythonhosted.org/packages/80/d6/2d1b89f6ca4bff1036499b1e29a1d02d282259f3681540e16563f27ebc23/markupsafe-3.0.3-cp313-cp313t-win32.whl", hash = "sha256:69c0b73548bc525c8cb9a251cddf1931d1db4d2258e9599c28c07ef3580ef354", size = 14612, upload-time = "2025-09-27T18:37:02.639Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/98/e48a4bfba0a0ffcf9925fe2d69240bfaa19c6f7507b8cd09c70684a53c1e/markupsafe-3.0.3-cp313-cp313t-win_amd64.whl", hash = "sha256:1b4b79e8ebf6b55351f0d91fe80f893b4743f104bff22e90697db1590e47a218", size = 15200, upload-time = "2025-09-27T18:37:03.582Z" },
+    { url = "https://files.pythonhosted.org/packages/0e/72/e3cc540f351f316e9ed0f092757459afbc595824ca724cbc5a5d4263713f/markupsafe-3.0.3-cp313-cp313t-win_arm64.whl", hash = "sha256:ad2cf8aa28b8c020ab2fc8287b0f823d0a7d8630784c31e9ee5edea20f406287", size = 13973, upload-time = "2025-09-27T18:37:04.929Z" },
+    { url = "https://files.pythonhosted.org/packages/33/8a/8e42d4838cd89b7dde187011e97fe6c3af66d8c044997d2183fbd6d31352/markupsafe-3.0.3-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:eaa9599de571d72e2daf60164784109f19978b327a3910d3e9de8c97b5b70cfe", size = 11619, upload-time = "2025-09-27T18:37:06.342Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/64/7660f8a4a8e53c924d0fa05dc3a55c9cee10bbd82b11c5afb27d44b096ce/markupsafe-3.0.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:c47a551199eb8eb2121d4f0f15ae0f923d31350ab9280078d1e5f12b249e0026", size = 12029, upload-time = "2025-09-27T18:37:07.213Z" },
+    { url = "https://files.pythonhosted.org/packages/da/ef/e648bfd021127bef5fa12e1720ffed0c6cbb8310c8d9bea7266337ff06de/markupsafe-3.0.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f34c41761022dd093b4b6896d4810782ffbabe30f2d443ff5f083e0cbbb8c737", size = 24408, upload-time = "2025-09-27T18:37:09.572Z" },
+    { url = "https://files.pythonhosted.org/packages/41/3c/a36c2450754618e62008bf7435ccb0f88053e07592e6028a34776213d877/markupsafe-3.0.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:457a69a9577064c05a97c41f4e65148652db078a3a509039e64d3467b9e7ef97", size = 23005, upload-time = "2025-09-27T18:37:10.58Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/20/b7fdf89a8456b099837cd1dc21974632a02a999ec9bf7ca3e490aacd98e7/markupsafe-3.0.3-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:e8afc3f2ccfa24215f8cb28dcf43f0113ac3c37c2f0f0806d8c70e4228c5cf4d", size = 22048, upload-time = "2025-09-27T18:37:11.547Z" },
+    { url = "https://files.pythonhosted.org/packages/9a/a7/591f592afdc734f47db08a75793a55d7fbcc6902a723ae4cfbab61010cc5/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ec15a59cf5af7be74194f7ab02d0f59a62bdcf1a537677ce67a2537c9b87fcda", size = 23821, upload-time = "2025-09-27T18:37:12.48Z" },
+    { url = "https://files.pythonhosted.org/packages/7d/33/45b24e4f44195b26521bc6f1a82197118f74df348556594bd2262bda1038/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:0eb9ff8191e8498cca014656ae6b8d61f39da5f95b488805da4bb029cccbfbaf", size = 21606, upload-time = "2025-09-27T18:37:13.485Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/0e/53dfaca23a69fbfbbf17a4b64072090e70717344c52eaaaa9c5ddff1e5f0/markupsafe-3.0.3-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:2713baf880df847f2bece4230d4d094280f4e67b1e813eec43b4c0e144a34ffe", size = 23043, upload-time = "2025-09-27T18:37:14.408Z" },
+    { url = "https://files.pythonhosted.org/packages/46/11/f333a06fc16236d5238bfe74daccbca41459dcd8d1fa952e8fbd5dccfb70/markupsafe-3.0.3-cp314-cp314-win32.whl", hash = "sha256:729586769a26dbceff69f7a7dbbf59ab6572b99d94576a5592625d5b411576b9", size = 14747, upload-time = "2025-09-27T18:37:15.36Z" },
+    { url = "https://files.pythonhosted.org/packages/28/52/182836104b33b444e400b14f797212f720cbc9ed6ba34c800639d154e821/markupsafe-3.0.3-cp314-cp314-win_amd64.whl", hash = "sha256:bdc919ead48f234740ad807933cdf545180bfbe9342c2bb451556db2ed958581", size = 15341, upload-time = "2025-09-27T18:37:16.496Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/18/acf23e91bd94fd7b3031558b1f013adfa21a8e407a3fdb32745538730382/markupsafe-3.0.3-cp314-cp314-win_arm64.whl", hash = "sha256:5a7d5dc5140555cf21a6fefbdbf8723f06fcd2f63ef108f2854de715e4422cb4", size = 14073, upload-time = "2025-09-27T18:37:17.476Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/f0/57689aa4076e1b43b15fdfa646b04653969d50cf30c32a102762be2485da/markupsafe-3.0.3-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:1353ef0c1b138e1907ae78e2f6c63ff67501122006b0f9abad68fda5f4ffc6ab", size = 11661, upload-time = "2025-09-27T18:37:18.453Z" },
+    { url = "https://files.pythonhosted.org/packages/89/c3/2e67a7ca217c6912985ec766c6393b636fb0c2344443ff9d91404dc4c79f/markupsafe-3.0.3-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1085e7fbddd3be5f89cc898938f42c0b3c711fdcb37d75221de2666af647c175", size = 12069, upload-time = "2025-09-27T18:37:19.332Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/00/be561dce4e6ca66b15276e184ce4b8aec61fe83662cce2f7d72bd3249d28/markupsafe-3.0.3-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1b52b4fb9df4eb9ae465f8d0c228a00624de2334f216f178a995ccdcf82c4634", size = 25670, upload-time = "2025-09-27T18:37:20.245Z" },
+    { url = "https://files.pythonhosted.org/packages/50/09/c419f6f5a92e5fadde27efd190eca90f05e1261b10dbd8cbcb39cd8ea1dc/markupsafe-3.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:fed51ac40f757d41b7c48425901843666a6677e3e8eb0abcff09e4ba6e664f50", size = 23598, upload-time = "2025-09-27T18:37:21.177Z" },
+    { url = "https://files.pythonhosted.org/packages/22/44/a0681611106e0b2921b3033fc19bc53323e0b50bc70cffdd19f7d679bb66/markupsafe-3.0.3-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:f190daf01f13c72eac4efd5c430a8de82489d9cff23c364c3ea822545032993e", size = 23261, upload-time = "2025-09-27T18:37:22.167Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/57/1b0b3f100259dc9fffe780cfb60d4be71375510e435efec3d116b6436d43/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e56b7d45a839a697b5eb268c82a71bd8c7f6c94d6fd50c3d577fa39a9f1409f5", size = 24835, upload-time = "2025-09-27T18:37:23.296Z" },
+    { url = "https://files.pythonhosted.org/packages/26/6a/4bf6d0c97c4920f1597cc14dd720705eca0bf7c787aebc6bb4d1bead5388/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:f3e98bb3798ead92273dc0e5fd0f31ade220f59a266ffd8a4f6065e0a3ce0523", size = 22733, upload-time = "2025-09-27T18:37:24.237Z" },
+    { url = "https://files.pythonhosted.org/packages/14/c7/ca723101509b518797fedc2fdf79ba57f886b4aca8a7d31857ba3ee8281f/markupsafe-3.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:5678211cb9333a6468fb8d8be0305520aa073f50d17f089b5b4b477ea6e67fdc", size = 23672, upload-time = "2025-09-27T18:37:25.271Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/df/5bd7a48c256faecd1d36edc13133e51397e41b73bb77e1a69deab746ebac/markupsafe-3.0.3-cp314-cp314t-win32.whl", hash = "sha256:915c04ba3851909ce68ccc2b8e2cd691618c4dc4c4232fb7982bca3f41fd8c3d", size = 14819, upload-time = "2025-09-27T18:37:26.285Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/8a/0402ba61a2f16038b48b39bccca271134be00c5c9f0f623208399333c448/markupsafe-3.0.3-cp314-cp314t-win_amd64.whl", hash = "sha256:4faffd047e07c38848ce017e8725090413cd80cbc23d86e55c587bf979e579c9", size = 15426, upload-time = "2025-09-27T18:37:27.316Z" },
+    { url = "https://files.pythonhosted.org/packages/70/bc/6f1c2f612465f5fa89b95bead1f44dcb607670fd42891d8fdcd5d039f4f4/markupsafe-3.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:32001d6a8fc98c8cb5c947787c5d08b0a50663d139f1305bac5885d98d9b40fa", size = 14146, upload-time = "2025-09-27T18:37:28.327Z" },
+]
+
+[[package]]
+name = "mpmath"
+version = "1.3.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/e0/47/dd32fa426cc72114383ac549964eecb20ecfd886d1e5ccf5340b55b02f57/mpmath-1.3.0.tar.gz", hash = "sha256:7a28eb2a9774d00c7bc92411c19a89209d5da7c4c9a9e227be8330a23a25b91f", size = 508106, upload-time = "2023-03-07T16:47:11.061Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/43/e3/7d92a15f894aa0c9c4b49b8ee9ac9850d6e63b03c9c32c0367a13ae62209/mpmath-1.3.0-py3-none-any.whl", hash = "sha256:a0b2b9fe80bbcd81a6647ff13108738cfb482d481d826cc0e02f5b35e5c88d2c", size = 536198, upload-time = "2023-03-07T16:47:09.197Z" },
+]
+
+[[package]]
+name = "msgpack"
+version = "1.2.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/31/f9/c0a1c127f9049db9155afc316952ea571720dd01833ff5e4d7e8e6352dbb/msgpack-1.2.1.tar.gz", hash = "sha256:04c721c2c7448767e9e3f2520a475663d8ee0f09c31890f6d2bd70fd636a9647", size = 183960, upload-time = "2026-06-18T16:13:52.594Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f4/6b/e9b1cdc042c4458801d2545ed782a95f3d6ba8e270cce8745b8603c7f748/msgpack-1.2.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:29a3f6e9667868429d8240dfd063ea5ffdc1321c13d783aa23827a38de0dcb22", size = 82812, upload-time = "2026-06-18T16:12:45.022Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/3a/dd518a1bf78ed1e9ad8afe57307c079a00eafe4b3068932a27ca1ea56b4f/msgpack-1.2.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:aded5bdf32609dc7987a49bbbd15a8ef096193f96dd8bbeb791de729e650acf5", size = 82739, upload-time = "2026-06-18T16:12:46.025Z" },
+    { url = "https://files.pythonhosted.org/packages/70/e0/7ba9e1542bf0771a27b8b37c1316e3f95ae9d748fd765284655c476ad4ef/msgpack-1.2.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:146ee4e9ce80b365c6d4c47073da9da7bcec473e58194ceee5dd7620ace77e06", size = 414233, upload-time = "2026-06-18T16:12:47.029Z" },
+    { url = "https://files.pythonhosted.org/packages/03/8d/671d81534ea0e2b0e8a121be100020da09eb78861fe3aa8f3ef7dcd3bed1/msgpack-1.2.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a28d076ca7c82b9c8728ad90b7147489449557038bed50e4241eb832395169b4", size = 423843, upload-time = "2026-06-18T16:12:48.19Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/b6/e5c737515ed1f166664b87601b532f58cbb73d8aa6a90b99f7c2c5037e8e/msgpack-1.2.1-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:7d31c0ac0c640f877804c67cb2bc9f4e23dc2db97e96c2e67fa27d38283b41f8", size = 390772, upload-time = "2026-06-18T16:12:49.624Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/46/62ed8c2e87d7021eab19921594d961ef3aa3794eec76c716dc30f3bfd433/msgpack-1.2.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:8ff92d7feeaf5bc26c51495b69e2f99ed97ab79346fb6555f44be7dd2ac6503b", size = 409559, upload-time = "2026-06-18T16:12:50.936Z" },
+    { url = "https://files.pythonhosted.org/packages/70/ff/59aa3887b860bbf43532835e192b1c388a17590d6068ae4f8b2bc74c906e/msgpack-1.2.1-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:779197a6513bab3c3632265e3d0f7cb3227e62510841a6f34f1eaa37efbb345e", size = 387838, upload-time = "2026-06-18T16:12:52.161Z" },
+    { url = "https://files.pythonhosted.org/packages/09/11/f8563e471093420cf6478cb3271a0175d8402b82d879783d4035d2d03360/msgpack-1.2.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:67f6dd22fa72a93752643f07889796d62739a13415ee630169a8ce764f86cf9f", size = 421732, upload-time = "2026-06-18T16:12:53.556Z" },
+    { url = "https://files.pythonhosted.org/packages/57/cf/e673683c4c6c90c1022b24c65af4b03eda72b182a1176ef6449069d66acc/msgpack-1.2.1-cp311-cp311-win32.whl", hash = "sha256:91054a783328e0ea7954b8771095705c8d2243b814743fbaadf14552c9c52c5d", size = 64091, upload-time = "2026-06-18T16:12:54.821Z" },
+    { url = "https://files.pythonhosted.org/packages/3f/07/ca212739d179f9083bff2c7c08c24101c3555a334fadc2b876b18768a3ae/msgpack-1.2.1-cp311-cp311-win_amd64.whl", hash = "sha256:2eda0b7ebb1283a98d3e4492ac933c8af6aff59fd3df1c3ed024f536af4b1dc8", size = 70462, upload-time = "2026-06-18T16:12:55.898Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/be/6798347b425e26f35db82e69dd83c09716c856a3714e7bffc4c0860fd830/msgpack-1.2.1-cp311-cp311-win_arm64.whl", hash = "sha256:6ee967f7c7e1df2890c671ff2ee51a28ded0efc95da3e507176dee881ce36c66", size = 65059, upload-time = "2026-06-18T16:12:57.053Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/dd/9e8cbd8f5582ca4b590336f2b91ee5662f6a6ca562b565abaf696a0f81ff/msgpack-1.2.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:2ef59c659f289eddf8aa6623823f19fa2f40a4029266889eac7a2505dd210c35", size = 83531, upload-time = "2026-06-18T16:12:58.249Z" },
+    { url = "https://files.pythonhosted.org/packages/50/2e/ebdb85a8da151397a2790363676b7ed7c125924fe618e4c6d8befb0cc62c/msgpack-1.2.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:d3567748a5107cb40cdf66a275430c2f87c07777698f4bfd25c35f44d533258c", size = 82657, upload-time = "2026-06-18T16:12:59.396Z" },
+    { url = "https://files.pythonhosted.org/packages/26/aa/753ad8b007b464e1d8aa0c8e650b9c5f4f725e658fc5ac8a7635c55b7f6e/msgpack-1.2.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:60926b75d00c8e816ef98f3034f484a8bc64242d66839cef4cf7e503142316a0", size = 410634, upload-time = "2026-06-18T16:13:00.383Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/fd/6adabd4f6d5e686f97dd02ce7fce3fe4cf672cbac36b8f67ff4040e8ad8b/msgpack-1.2.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:020e881a764b20d8d7ca1a54fc01b8175519d108e3c3f194fddc200bda95951a", size = 419989, upload-time = "2026-06-18T16:13:01.776Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/cc/85039b7b0eb168aaad7383a23c97e291a11f08351cb45a606ce865e4e3f1/msgpack-1.2.1-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:4202c74688ca06591f78cb18988228bd4cca2cc75d57b60008372892d2f1e6e6", size = 377544, upload-time = "2026-06-18T16:13:03.637Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/bf/35963899493b32030c85fc513b723ae66144ac70c11ebc52e889e16e3d99/msgpack-1.2.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:8b267ce94efb76fbd1b3373511420074ee3187f0f7811bf394531de13294735a", size = 400842, upload-time = "2026-06-18T16:13:05.012Z" },
+    { url = "https://files.pythonhosted.org/packages/a6/df/8e2ac970c8f99264cd9997d1c73df5466bc19da3301d7dc5500862a9b089/msgpack-1.2.1-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:e4f1d0f8f98ade9634e01fb704a408f9336c0a8f1117b369f5db83dc7551d8b1", size = 374108, upload-time = "2026-06-18T16:13:06.232Z" },
+    { url = "https://files.pythonhosted.org/packages/17/dd/fa8bd265110dfa51c20cb529f9e6d240a16fafe7e645004c6af2d01353ba/msgpack-1.2.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:f02cf17a6ca1abe29b5f980644f7551f94d71f2011509b26d8625ce038f0df64", size = 414939, upload-time = "2026-06-18T16:13:07.478Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/b9/8377a5ad8953fc0437c70cc98d9ae29f27fe5ac5109fbec0812085865735/msgpack-1.2.1-cp312-cp312-win32.whl", hash = "sha256:0c0d9802354507bcba62af19c17918e3eb437cc25e6f50657d511b5856a77aac", size = 64504, upload-time = "2026-06-18T16:13:08.822Z" },
+    { url = "https://files.pythonhosted.org/packages/57/7f/ce1e377df7e62461fefd9eb23bfb93a4a523f40a517b377b8f844d836828/msgpack-1.2.1-cp312-cp312-win_amd64.whl", hash = "sha256:5c24aa15d5963051e1a5c62b12c50cd705992502b5ec1f3bece6046f33c9fc24", size = 71421, upload-time = "2026-06-18T16:13:09.828Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/32/ebfe84c9929f08f188d56c7a2fd913406a9ddad76a634697c1c43b8112e6/msgpack-1.2.1-cp312-cp312-win_arm64.whl", hash = "sha256:4227224aaec8f7fbcbfbd4272319347b2bb4030366502600f8c45588c5187b07", size = 64775, upload-time = "2026-06-18T16:13:11.056Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/ac/dcddcab6f6c20ecb387ca5e980371cdb3f87ff69aeca388be97eebc4c074/msgpack-1.2.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:0a70e3cf2804a300d921bb0940426e35f4e489a23adfb77a808892241db0a064", size = 83151, upload-time = "2026-06-18T16:13:12.173Z" },
+    { url = "https://files.pythonhosted.org/packages/64/71/fbcfa83a1d6a9c6091942d1cfd070962244664b87427a9a49a6897b1b219/msgpack-1.2.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:491cc39455ca765fad51fb451bf2915eb2cf41192ab5801ce8d67c1d614fe056", size = 82351, upload-time = "2026-06-18T16:13:13.194Z" },
+    { url = "https://files.pythonhosted.org/packages/e3/10/ddf7b06db879e8792d13934ddda09ff20bd2a583fd84c9b59aae9b0e650b/msgpack-1.2.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f310233ef7fb9c14e201c93639fe5f5260b005f56f0b29048e999c30935596cc", size = 407518, upload-time = "2026-06-18T16:13:14.233Z" },
+    { url = "https://files.pythonhosted.org/packages/79/d3/36a46a8ed992b781acbc05928bd5bee3c810cb0c3563bf81a7b0c04a1a76/msgpack-1.2.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:787c9bebb5833e8f6fc8abca3c0597683d8d87f56a8842b6b89c75a5f3176e2d", size = 416405, upload-time = "2026-06-18T16:13:15.435Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/84/e8e9598b557c0ba6ddae901a73780a4c75ac667dddf59414b1e56a42fb34/msgpack-1.2.1-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:dc871b997a9370d855b7394465f2f350e847a5b806dd38dcc9c989e7d87da155", size = 376257, upload-time = "2026-06-18T16:13:17.022Z" },
+    { url = "https://files.pythonhosted.org/packages/40/16/738fe6d875ad7e2a9429c165322a4ec088f4f273cdfae63d96a89c467961/msgpack-1.2.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:85f57e960d877f2977f6430896191b04a21f8901b3b4baf2e4604329f4db5402", size = 397469, upload-time = "2026-06-18T16:13:18.287Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/be/6d5952df75a7f24f35833af764c3a6860780364cb3a0030beb8099e1b2b4/msgpack-1.2.1-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:1233ee2dd0cefba127583de50ea654677277047d238303521db35def3d7b2e7c", size = 372802, upload-time = "2026-06-18T16:13:19.685Z" },
+    { url = "https://files.pythonhosted.org/packages/e1/39/e2ef7dbf0473bcb8dc7c50bf782a892d67414877b63e47fc88eb189ef5e6/msgpack-1.2.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e3dc2feb0876209d9c38aa56cb1de169bd6c4348f1aa48271f241226590993e6", size = 411273, upload-time = "2026-06-18T16:13:21.028Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/c5/133f4512a56e983a93445c836c9d94d88f3bc2e0980ff4b9e577bd8416ce/msgpack-1.2.1-cp313-cp313-win32.whl", hash = "sha256:6d09badf350af2be9d189184e04e64cf54ad93569ab3d96fca58bd3e84aad707", size = 64471, upload-time = "2026-06-18T16:13:22.293Z" },
+    { url = "https://files.pythonhosted.org/packages/e2/98/577e10b055096a7dd40732358cabaf7180a20c79ed1dcdbb618e4b9deac7/msgpack-1.2.1-cp313-cp313-win_amd64.whl", hash = "sha256:33f14fba63278b714efe6ad07e50ea5f03d91537aa6a1c5f1ceca4cf44013ca9", size = 71274, upload-time = "2026-06-18T16:13:23.455Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/ee/0c0048e7cfbef23c6a94791b8959ab28155232e7956de8a305b5ff588f05/msgpack-1.2.1-cp313-cp313-win_arm64.whl", hash = "sha256:afc5febcd4c99effbc02b528e49d6fd0760b2b7d48c05239e345a5fa6e743d9a", size = 64795, upload-time = "2026-06-18T16:13:24.687Z" },
+    { url = "https://files.pythonhosted.org/packages/77/58/cce442852c6b9e1639c7c8ac8fd9143121cb32dab0f308df4d1426a8eb9c/msgpack-1.2.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:05f340e47e7e47d2da8db9b53e1bb1d294369e9ef45a747441309f6650b8351d", size = 83610, upload-time = "2026-06-18T16:13:25.724Z" },
+    { url = "https://files.pythonhosted.org/packages/60/5c/15b4c7a0182f75ffa90751958ba36a9c01cafee367d49a3edc10ed140b01/msgpack-1.2.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:810b916696c86ef0deb3b74588480224df4c1b071136c34183e4a2a4284d7ac7", size = 83138, upload-time = "2026-06-18T16:13:26.781Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/a6/99e58722feaffc5f2fbcc0c8c0d1451ab9f84097f7af87291b46af2390f4/msgpack-1.2.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ca0dacff965c47afdc3749a8469d7302a8f801d6a28758d55120d75e66ce6889", size = 406090, upload-time = "2026-06-18T16:13:28.072Z" },
+    { url = "https://files.pythonhosted.org/packages/19/03/8c63e8cf52958534ef688625965ab04c269a6cadd8caef16758b380a821a/msgpack-1.2.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0e2bf9280bceb5efca998435904b5d3e9fdbcc11d90dc9df30aec7973252b720", size = 412106, upload-time = "2026-06-18T16:13:29.427Z" },
+    { url = "https://files.pythonhosted.org/packages/63/d2/155d9e71b40e41fd934bc0c48b9b2770f22263e1ac20aad8e29fdca7be3f/msgpack-1.2.1-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:aa6c4be5d1c02a42b066ca6ddb71adf36432868fdcdb6ee87e634e86e0674190", size = 374851, upload-time = "2026-06-18T16:13:30.631Z" },
+    { url = "https://files.pythonhosted.org/packages/98/48/deaf2326262a8d5ea3295ce9649912ecd3f551ba7ec8e33c665d2ba583f3/msgpack-1.2.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ec0e675d59150a6269ddc9139087c722292664a37d071a849c05c473350f1f2d", size = 396168, upload-time = "2026-06-18T16:13:31.977Z" },
+    { url = "https://files.pythonhosted.org/packages/10/2a/b4410f906c2ec0008f1608d3ab5143afc3ad3f4e6da0fed3ea2231d0bef4/msgpack-1.2.1-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:dd3bfe82d53edfe4b7fc9a7ec9761e23a7a5b1dac22264505af428253c29ed24", size = 371959, upload-time = "2026-06-18T16:13:33.282Z" },
+    { url = "https://files.pythonhosted.org/packages/59/86/1edc67270099a528fa2093ea60fe191233cd238e4bd30cfacf7db79fc959/msgpack-1.2.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:5ad5467fc3f68b5468e06c5f788d712e9f8ffc8b0cd1bcb160c105c1ee92dae7", size = 408457, upload-time = "2026-06-18T16:13:34.567Z" },
+    { url = "https://files.pythonhosted.org/packages/82/90/8b630fef07d8c5ab457b71ff2c217910c83d333c7a68472c186e87cc504a/msgpack-1.2.1-cp314-cp314-win32.whl", hash = "sha256:98b58bdb89c46190e4609bb36abe17c6d4105ad13f9c5f8f6f64d320f8ced3fb", size = 65942, upload-time = "2026-06-18T16:13:36.056Z" },
+    { url = "https://files.pythonhosted.org/packages/16/f1/467b81e98b24dd3885d7b1857728797b4ffc76a7a7483af4fb321a07de3c/msgpack-1.2.1-cp314-cp314-win_amd64.whl", hash = "sha256:74847557e28ce71bd3c438a447ca90e4b507e997ddbdef8a12a7b283b86c156b", size = 72627, upload-time = "2026-06-18T16:13:37.079Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/1d/5d8c4c89985feb6acefb82a09e501c60392261856d2408d20bfe4f0360b1/msgpack-1.2.1-cp314-cp314-win_arm64.whl", hash = "sha256:b50b727bd652bdc37d950336c848ef20ec54a4cafc38dce19b1cd86ad625d0f7", size = 66908, upload-time = "2026-06-18T16:13:38.23Z" },
+    { url = "https://files.pythonhosted.org/packages/1b/02/ad2afb678b4de94496cd432b581759b756a92c1192d8c767edd6b132efdc/msgpack-1.2.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:8d00f177ca88a77c1cf848d204a38f249751650b601cb6532acc68805d8a8273", size = 86000, upload-time = "2026-06-18T16:13:39.44Z" },
+    { url = "https://files.pythonhosted.org/packages/54/74/0b797484013128837f3b1cbb6cea019277c4de4e377dc512b4d9a0f92940/msgpack-1.2.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5bb9c386f0a329c035ddbab4b72d1028bf9627add8dda41070288563d57ed1b1", size = 86544, upload-time = "2026-06-18T16:13:40.447Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/b4/b774d7eb95561739907fec675582f83203cf41c597a418c2589b4bfb8e9d/msgpack-1.2.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:20466cca18c49c7292a8984bc15d65857b171e7264bdcb5f96baf8be238791fc", size = 427661, upload-time = "2026-06-18T16:13:41.574Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/f9/3243191dc9937e00756c8bc1b0272fed8f23758e43df2a3b46f533e5090f/msgpack-1.2.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:196300e7e5d6e74d50f1607ab9c06c4a1484c383cd22defd727902591f7e8dde", size = 426375, upload-time = "2026-06-18T16:13:42.936Z" },
+    { url = "https://files.pythonhosted.org/packages/23/c7/1693111db9944ba4ad4b67a1e788400d78a0b6af7a6523dc7e4e58f8274b/msgpack-1.2.1-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:575957e79cd51903a4e8495a242442949641e08f1efd5197b43bebd3ea7682b4", size = 380495, upload-time = "2026-06-18T16:13:44.306Z" },
+    { url = "https://files.pythonhosted.org/packages/3e/2b/92f86956a0c13e8662f7e2ad630c4eb4db07497b967589bd5245e018b2c1/msgpack-1.2.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:8c2ed1e48cc0f460bf3c7780e7137ff21a4e18433451916f2442c1b21036cd7d", size = 410897, upload-time = "2026-06-18T16:13:45.629Z" },
+    { url = "https://files.pythonhosted.org/packages/da/ea/1479f72d200313a76fc2f823a79d1e07ed052ab7b8a0280640aa7b95de42/msgpack-1.2.1-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:5f6277e5f783c36786a145e0247fc189a03f35f84b251646e53592d2bc12b355", size = 378519, upload-time = "2026-06-18T16:13:46.998Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/4d/fa006060ffa1011d32bfae826fe766fe73e02982183601633b7121058ab3/msgpack-1.2.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:f9389552ecf4784886345ead0647e4edc96bee37cbab05b75540f542f766c48c", size = 419815, upload-time = "2026-06-18T16:13:48.205Z" },
+    { url = "https://files.pythonhosted.org/packages/2f/e1/aab6c946570496b78e67804721f3d5e2d62a93081b9b37df77764ef56347/msgpack-1.2.1-cp314-cp314t-win32.whl", hash = "sha256:c1c79a604a2969a868a78b6ebd27a887e00c624f14f66b3038e0590cb23332d1", size = 70914, upload-time = "2026-06-18T16:13:49.385Z" },
+    { url = "https://files.pythonhosted.org/packages/13/0a/e608956488a2af014cfe6e3d665e090b8ee42aa14b07f8f95b8880d66b09/msgpack-1.2.1-cp314-cp314t-win_amd64.whl", hash = "sha256:f12038a35fabd52e56a3547bab42401af49a45caa6dd00b34c44de235bc93ee2", size = 77999, upload-time = "2026-06-18T16:13:50.467Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/8a/27e2e57055176e366a46b85d02d68e7a5bcfbdd8474c9706375d965f24d3/msgpack-1.2.1-cp314-cp314t-win_arm64.whl", hash = "sha256:0adcf06ffde0777c0e1a9b771a2b1c4226ba1bbf748c8efcc02fcdeca3299107", size = 71160, upload-time = "2026-06-18T16:13:51.498Z" },
+]
+
+[[package]]
+name = "multidict"
+version = "6.7.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/1a/c2/c2d94cbe6ac1753f3fc980da97b3d930efe1da3af3c9f5125354436c073d/multidict-6.7.1.tar.gz", hash = "sha256:ec6652a1bee61c53a3e5776b6049172c53b6aaba34f18c9ad04f82712bac623d", size = 102010, upload-time = "2026-01-26T02:46:45.979Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ce/f1/a90635c4f88fb913fbf4ce660b83b7445b7a02615bda034b2f8eb38fd597/multidict-6.7.1-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:7ff981b266af91d7b4b3793ca3382e53229088d193a85dfad6f5f4c27fc73e5d", size = 76626, upload-time = "2026-01-26T02:43:26.485Z" },
+    { url = "https://files.pythonhosted.org/packages/a6/9b/267e64eaf6fc637a15b35f5de31a566634a2740f97d8d094a69d34f524a4/multidict-6.7.1-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:844c5bca0b5444adb44a623fb0a1310c2f4cd41f402126bb269cd44c9b3f3e1e", size = 44706, upload-time = "2026-01-26T02:43:27.607Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/a4/d45caf2b97b035c57267791ecfaafbd59c68212004b3842830954bb4b02e/multidict-6.7.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:f2a0a924d4c2e9afcd7ec64f9de35fcd96915149b2216e1cb2c10a56df483855", size = 44356, upload-time = "2026-01-26T02:43:28.661Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/d2/0a36c8473f0cbaeadd5db6c8b72d15bbceeec275807772bfcd059bef487d/multidict-6.7.1-cp311-cp311-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:8be1802715a8e892c784c0197c2ace276ea52702a0ede98b6310c8f255a5afb3", size = 244355, upload-time = "2026-01-26T02:43:31.165Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/16/8c65be997fd7dd311b7d39c7b6e71a0cb449bad093761481eccbbe4b42a2/multidict-6.7.1-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2e2d2ed645ea29f31c4c7ea1552fcfd7cb7ba656e1eafd4134a6620c9f5fdd9e", size = 246433, upload-time = "2026-01-26T02:43:32.581Z" },
+    { url = "https://files.pythonhosted.org/packages/01/fb/4dbd7e848d2799c6a026ec88ad39cf2b8416aa167fcc903baa55ecaa045c/multidict-6.7.1-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:95922cee9a778659e91db6497596435777bd25ed116701a4c034f8e46544955a", size = 225376, upload-time = "2026-01-26T02:43:34.417Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/8a/4a3a6341eac3830f6053062f8fbc9a9e54407c80755b3f05bc427295c2d0/multidict-6.7.1-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:6b83cabdc375ffaaa15edd97eb7c0c672ad788e2687004990074d7d6c9b140c8", size = 257365, upload-time = "2026-01-26T02:43:35.741Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/a2/dd575a69c1aa206e12d27d0770cdf9b92434b48a9ef0cd0d1afdecaa93c4/multidict-6.7.1-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:38fb49540705369bab8484db0689d86c0a33a0a9f2c1b197f506b71b4b6c19b0", size = 254747, upload-time = "2026-01-26T02:43:36.976Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/56/21b27c560c13822ed93133f08aa6372c53a8e067f11fbed37b4adcdac922/multidict-6.7.1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:439cbebd499f92e9aa6793016a8acaa161dfa749ae86d20960189f5398a19144", size = 246293, upload-time = "2026-01-26T02:43:38.258Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/a4/23466059dc3854763423d0ad6c0f3683a379d97673b1b89ec33826e46728/multidict-6.7.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:6d3bc717b6fe763b8be3f2bee2701d3c8eb1b2a8ae9f60910f1b2860c82b6c49", size = 242962, upload-time = "2026-01-26T02:43:40.034Z" },
+    { url = "https://files.pythonhosted.org/packages/1f/67/51dd754a3524d685958001e8fa20a0f5f90a6a856e0a9dcabff69be3dbb7/multidict-6.7.1-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:619e5a1ac57986dbfec9f0b301d865dddf763696435e2962f6d9cf2fdff2bb71", size = 237360, upload-time = "2026-01-26T02:43:41.752Z" },
+    { url = "https://files.pythonhosted.org/packages/64/3f/036dfc8c174934d4b55d86ff4f978e558b0e585cef70cfc1ad01adc6bf18/multidict-6.7.1-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:0b38ebffd9be37c1170d33bc0f36f4f262e0a09bc1aac1c34c7aa51a7293f0b3", size = 245940, upload-time = "2026-01-26T02:43:43.042Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/20/6214d3c105928ebc353a1c644a6ef1408bc5794fcb4f170bb524a3c16311/multidict-6.7.1-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:10ae39c9cfe6adedcdb764f5e8411d4a92b055e35573a2eaa88d3323289ef93c", size = 253502, upload-time = "2026-01-26T02:43:44.371Z" },
+    { url = "https://files.pythonhosted.org/packages/b1/e2/c653bc4ae1be70a0f836b82172d643fcf1dade042ba2676ab08ec08bff0f/multidict-6.7.1-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:25167cc263257660290fba06b9318d2026e3c910be240a146e1f66dd114af2b0", size = 247065, upload-time = "2026-01-26T02:43:45.745Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/11/a854b4154cd3bd8b1fd375e8a8ca9d73be37610c361543d56f764109509b/multidict-6.7.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:128441d052254f42989ef98b7b6a6ecb1e6f708aa962c7984235316db59f50fa", size = 241870, upload-time = "2026-01-26T02:43:47.054Z" },
+    { url = "https://files.pythonhosted.org/packages/13/bf/9676c0392309b5fdae322333d22a829715b570edb9baa8016a517b55b558/multidict-6.7.1-cp311-cp311-win32.whl", hash = "sha256:d62b7f64ffde3b99d06b707a280db04fb3855b55f5a06df387236051d0668f4a", size = 41302, upload-time = "2026-01-26T02:43:48.753Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/68/f16a3a8ba6f7b6dc92a1f19669c0810bd2c43fc5a02da13b1cbf8e253845/multidict-6.7.1-cp311-cp311-win_amd64.whl", hash = "sha256:bdbf9f3b332abd0cdb306e7c2113818ab1e922dc84b8f8fd06ec89ed2a19ab8b", size = 45981, upload-time = "2026-01-26T02:43:49.921Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/ad/9dd5305253fa00cd3c7555dbef69d5bf4133debc53b87ab8d6a44d411665/multidict-6.7.1-cp311-cp311-win_arm64.whl", hash = "sha256:b8c990b037d2fff2f4e33d3f21b9b531c5745b33a49a7d6dbe7a177266af44f6", size = 43159, upload-time = "2026-01-26T02:43:51.635Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/9c/f20e0e2cf80e4b2e4b1c365bf5fe104ee633c751a724246262db8f1a0b13/multidict-6.7.1-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:a90f75c956e32891a4eda3639ce6dd86e87105271f43d43442a3aedf3cddf172", size = 76893, upload-time = "2026-01-26T02:43:52.754Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/cf/18ef143a81610136d3da8193da9d80bfe1cb548a1e2d1c775f26b23d024a/multidict-6.7.1-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:3fccb473e87eaa1382689053e4a4618e7ba7b9b9b8d6adf2027ee474597128cd", size = 45456, upload-time = "2026-01-26T02:43:53.893Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/65/1caac9d4cd32e8433908683446eebc953e82d22b03d10d41a5f0fefe991b/multidict-6.7.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:b0fa96985700739c4c7853a43c0b3e169360d6855780021bfc6d0f1ce7c123e7", size = 43872, upload-time = "2026-01-26T02:43:55.041Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/3b/d6bd75dc4f3ff7c73766e04e705b00ed6dbbaccf670d9e05a12b006f5a21/multidict-6.7.1-cp312-cp312-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:cb2a55f408c3043e42b40cc8eecd575afa27b7e0b956dfb190de0f8499a57a53", size = 251018, upload-time = "2026-01-26T02:43:56.198Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/80/c959c5933adedb9ac15152e4067c702a808ea183a8b64cf8f31af8ad3155/multidict-6.7.1-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:eb0ce7b2a32d09892b3dd6cc44877a0d02a33241fafca5f25c8b6b62374f8b75", size = 258883, upload-time = "2026-01-26T02:43:57.499Z" },
+    { url = "https://files.pythonhosted.org/packages/86/85/7ed40adafea3d4f1c8b916e3b5cc3a8e07dfcdcb9cd72800f4ed3ca1b387/multidict-6.7.1-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:c3a32d23520ee37bf327d1e1a656fec76a2edd5c038bf43eddfa0572ec49c60b", size = 242413, upload-time = "2026-01-26T02:43:58.755Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/57/b8565ff533e48595503c785f8361ff9a4fde4d67de25c207cd0ba3befd03/multidict-6.7.1-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:9c90fed18bffc0189ba814749fdcc102b536e83a9f738a9003e569acd540a733", size = 268404, upload-time = "2026-01-26T02:44:00.216Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/50/9810c5c29350f7258180dfdcb2e52783a0632862eb334c4896ac717cebcb/multidict-6.7.1-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:da62917e6076f512daccfbbde27f46fed1c98fee202f0559adec8ee0de67f71a", size = 269456, upload-time = "2026-01-26T02:44:02.202Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/8d/5e5be3ced1d12966fefb5c4ea3b2a5b480afcea36406559442c6e31d4a48/multidict-6.7.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bfde23ef6ed9db7eaee6c37dcec08524cb43903c60b285b172b6c094711b3961", size = 256322, upload-time = "2026-01-26T02:44:03.56Z" },
+    { url = "https://files.pythonhosted.org/packages/31/6e/d8a26d81ac166a5592782d208dd90dfdc0a7a218adaa52b45a672b46c122/multidict-6.7.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3758692429e4e32f1ba0df23219cd0b4fc0a52f476726fff9337d1a57676a582", size = 253955, upload-time = "2026-01-26T02:44:04.845Z" },
+    { url = "https://files.pythonhosted.org/packages/59/4c/7c672c8aad41534ba619bcd4ade7a0dc87ed6b8b5c06149b85d3dd03f0cd/multidict-6.7.1-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:398c1478926eca669f2fd6a5856b6de9c0acf23a2cb59a14c0ba5844fa38077e", size = 251254, upload-time = "2026-01-26T02:44:06.133Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/bd/84c24de512cbafbdbc39439f74e967f19570ce7924e3007174a29c348916/multidict-6.7.1-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:c102791b1c4f3ab36ce4101154549105a53dc828f016356b3e3bcae2e3a039d3", size = 252059, upload-time = "2026-01-26T02:44:07.518Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/ba/f5449385510825b73d01c2d4087bf6d2fccc20a2d42ac34df93191d3dd03/multidict-6.7.1-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:a088b62bd733e2ad12c50dad01b7d0166c30287c166e137433d3b410add807a6", size = 263588, upload-time = "2026-01-26T02:44:09.382Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/11/afc7c677f68f75c84a69fe37184f0f82fce13ce4b92f49f3db280b7e92b3/multidict-6.7.1-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:3d51ff4785d58d3f6c91bdbffcb5e1f7ddfda557727043aa20d20ec4f65e324a", size = 259642, upload-time = "2026-01-26T02:44:10.73Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/17/ebb9644da78c4ab36403739e0e6e0e30ebb135b9caf3440825001a0bddcb/multidict-6.7.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fc5907494fccf3e7d3f94f95c91d6336b092b5fc83811720fae5e2765890dfba", size = 251377, upload-time = "2026-01-26T02:44:12.042Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/a4/840f5b97339e27846c46307f2530a2805d9d537d8b8bd416af031cad7fa0/multidict-6.7.1-cp312-cp312-win32.whl", hash = "sha256:28ca5ce2fd9716631133d0e9a9b9a745ad7f60bac2bccafb56aa380fc0b6c511", size = 41887, upload-time = "2026-01-26T02:44:14.245Z" },
+    { url = "https://files.pythonhosted.org/packages/80/31/0b2517913687895f5904325c2069d6a3b78f66cc641a86a2baf75a05dcbb/multidict-6.7.1-cp312-cp312-win_amd64.whl", hash = "sha256:fcee94dfbd638784645b066074b338bc9cc155d4b4bffa4adce1615c5a426c19", size = 46053, upload-time = "2026-01-26T02:44:15.371Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/5b/aba28e4ee4006ae4c7df8d327d31025d760ffa992ea23812a601d226e682/multidict-6.7.1-cp312-cp312-win_arm64.whl", hash = "sha256:ba0a9fb644d0c1a2194cf7ffb043bd852cea63a57f66fbd33959f7dae18517bf", size = 43307, upload-time = "2026-01-26T02:44:16.852Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/22/929c141d6c0dba87d3e1d38fbdf1ba8baba86b7776469f2bc2d3227a1e67/multidict-6.7.1-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:2b41f5fed0ed563624f1c17630cb9941cf2309d4df00e494b551b5f3e3d67a23", size = 76174, upload-time = "2026-01-26T02:44:18.509Z" },
+    { url = "https://files.pythonhosted.org/packages/c7/75/bc704ae15fee974f8fccd871305e254754167dce5f9e42d88a2def741a1d/multidict-6.7.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:84e61e3af5463c19b67ced91f6c634effb89ef8bfc5ca0267f954451ed4bb6a2", size = 45116, upload-time = "2026-01-26T02:44:19.745Z" },
+    { url = "https://files.pythonhosted.org/packages/79/76/55cd7186f498ed080a18440c9013011eb548f77ae1b297206d030eb1180a/multidict-6.7.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:935434b9853c7c112eee7ac891bc4cb86455aa631269ae35442cb316790c1445", size = 43524, upload-time = "2026-01-26T02:44:21.571Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/3c/414842ef8d5a1628d68edee29ba0e5bcf235dbfb3ccd3ea303a7fe8c72ff/multidict-6.7.1-cp313-cp313-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:432feb25a1cb67fe82a9680b4d65fb542e4635cb3166cd9c01560651ad60f177", size = 249368, upload-time = "2026-01-26T02:44:22.803Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/32/befed7f74c458b4a525e60519fe8d87eef72bb1e99924fa2b0f9d97a221e/multidict-6.7.1-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e82d14e3c948952a1a85503817e038cba5905a3352de76b9a465075d072fba23", size = 256952, upload-time = "2026-01-26T02:44:24.306Z" },
+    { url = "https://files.pythonhosted.org/packages/03/d6/c878a44ba877f366630c860fdf74bfb203c33778f12b6ac274936853c451/multidict-6.7.1-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:4cfb48c6ea66c83bcaaf7e4dfa7ec1b6bbcf751b7db85a328902796dfde4c060", size = 240317, upload-time = "2026-01-26T02:44:25.772Z" },
+    { url = "https://files.pythonhosted.org/packages/68/49/57421b4d7ad2e9e60e25922b08ceb37e077b90444bde6ead629095327a6f/multidict-6.7.1-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:1d540e51b7e8e170174555edecddbd5538105443754539193e3e1061864d444d", size = 267132, upload-time = "2026-01-26T02:44:27.648Z" },
+    { url = "https://files.pythonhosted.org/packages/b7/fe/ec0edd52ddbcea2a2e89e174f0206444a61440b40f39704e64dc807a70bd/multidict-6.7.1-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:273d23f4b40f3dce4d6c8a821c741a86dec62cded82e1175ba3d99be128147ed", size = 268140, upload-time = "2026-01-26T02:44:29.588Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/73/6e1b01cbeb458807aa0831742232dbdd1fa92bfa33f52a3f176b4ff3dc11/multidict-6.7.1-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9d624335fd4fa1c08a53f8b4be7676ebde19cd092b3895c421045ca87895b429", size = 254277, upload-time = "2026-01-26T02:44:30.902Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/b2/5fb8c124d7561a4974c342bc8c778b471ebbeb3cc17df696f034a7e9afe7/multidict-6.7.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:12fad252f8b267cc75b66e8fc51b3079604e8d43a75428ffe193cd9e2195dfd6", size = 252291, upload-time = "2026-01-26T02:44:32.31Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/96/51d4e4e06bcce92577fcd488e22600bd38e4fd59c20cb49434d054903bd2/multidict-6.7.1-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:03ede2a6ffbe8ef936b92cb4529f27f42be7f56afcdab5ab739cd5f27fb1cbf9", size = 250156, upload-time = "2026-01-26T02:44:33.734Z" },
+    { url = "https://files.pythonhosted.org/packages/db/6b/420e173eec5fba721a50e2a9f89eda89d9c98fded1124f8d5c675f7a0c0f/multidict-6.7.1-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:90efbcf47dbe33dcf643a1e400d67d59abeac5db07dc3f27d6bdeae497a2198c", size = 249742, upload-time = "2026-01-26T02:44:35.222Z" },
+    { url = "https://files.pythonhosted.org/packages/44/a3/ec5b5bd98f306bc2aa297b8c6f11a46714a56b1e6ef5ebda50a4f5d7c5fb/multidict-6.7.1-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:5c4b9bfc148f5a91be9244d6264c53035c8a0dcd2f51f1c3c6e30e30ebaa1c84", size = 262221, upload-time = "2026-01-26T02:44:36.604Z" },
+    { url = "https://files.pythonhosted.org/packages/cd/f7/e8c0d0da0cd1e28d10e624604e1a36bcc3353aaebdfdc3a43c72bc683a12/multidict-6.7.1-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:401c5a650f3add2472d1d288c26deebc540f99e2fb83e9525007a74cd2116f1d", size = 258664, upload-time = "2026-01-26T02:44:38.008Z" },
+    { url = "https://files.pythonhosted.org/packages/52/da/151a44e8016dd33feed44f730bd856a66257c1ee7aed4f44b649fb7edeb3/multidict-6.7.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:97891f3b1b3ffbded884e2916cacf3c6fc87b66bb0dde46f7357404750559f33", size = 249490, upload-time = "2026-01-26T02:44:39.386Z" },
+    { url = "https://files.pythonhosted.org/packages/87/af/a3b86bf9630b732897f6fc3f4c4714b90aa4361983ccbdcd6c0339b21b0c/multidict-6.7.1-cp313-cp313-win32.whl", hash = "sha256:e1c5988359516095535c4301af38d8a8838534158f649c05dd1050222321bcb3", size = 41695, upload-time = "2026-01-26T02:44:41.318Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/35/e994121b0e90e46134673422dd564623f93304614f5d11886b1b3e06f503/multidict-6.7.1-cp313-cp313-win_amd64.whl", hash = "sha256:960c83bf01a95b12b08fd54324a4eb1d5b52c88932b5cba5d6e712bb3ed12eb5", size = 45884, upload-time = "2026-01-26T02:44:42.488Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/61/42d3e5dbf661242a69c97ea363f2d7b46c567da8eadef8890022be6e2ab0/multidict-6.7.1-cp313-cp313-win_arm64.whl", hash = "sha256:563fe25c678aaba333d5399408f5ec3c383ca5b663e7f774dd179a520b8144df", size = 43122, upload-time = "2026-01-26T02:44:43.664Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/b3/e6b21c6c4f314bb956016b0b3ef2162590a529b84cb831c257519e7fde44/multidict-6.7.1-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:c76c4bec1538375dad9d452d246ca5368ad6e1c9039dadcf007ae59c70619ea1", size = 83175, upload-time = "2026-01-26T02:44:44.894Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/76/23ecd2abfe0957b234f6c960f4ade497f55f2c16aeb684d4ecdbf1c95791/multidict-6.7.1-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:57b46b24b5d5ebcc978da4ec23a819a9402b4228b8a90d9c656422b4bdd8a963", size = 48460, upload-time = "2026-01-26T02:44:46.106Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/57/a0ed92b23f3a042c36bc4227b72b97eca803f5f1801c1ab77c8a212d455e/multidict-6.7.1-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e954b24433c768ce78ab7929e84ccf3422e46deb45a4dc9f93438f8217fa2d34", size = 46930, upload-time = "2026-01-26T02:44:47.278Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/66/02ec7ace29162e447f6382c495dc95826bf931d3818799bbef11e8f7df1a/multidict-6.7.1-cp313-cp313t-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:3bd231490fa7217cc832528e1cd8752a96f0125ddd2b5749390f7c3ec8721b65", size = 242582, upload-time = "2026-01-26T02:44:48.604Z" },
+    { url = "https://files.pythonhosted.org/packages/58/18/64f5a795e7677670e872673aca234162514696274597b3708b2c0d276cce/multidict-6.7.1-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:253282d70d67885a15c8a7716f3a73edf2d635793ceda8173b9ecc21f2fb8292", size = 250031, upload-time = "2026-01-26T02:44:50.544Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/ed/e192291dbbe51a8290c5686f482084d31bcd9d09af24f63358c3d42fd284/multidict-6.7.1-cp313-cp313t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:0b4c48648d7649c9335cf1927a8b87fa692de3dcb15faa676c6a6f1f1aabda43", size = 228596, upload-time = "2026-01-26T02:44:51.951Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/7e/3562a15a60cf747397e7f2180b0a11dc0c38d9175a650e75fa1b4d325e15/multidict-6.7.1-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:98bc624954ec4d2c7cb074b8eefc2b5d0ce7d482e410df446414355d158fe4ca", size = 257492, upload-time = "2026-01-26T02:44:53.902Z" },
+    { url = "https://files.pythonhosted.org/packages/24/02/7d0f9eae92b5249bb50ac1595b295f10e263dd0078ebb55115c31e0eaccd/multidict-6.7.1-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:1b99af4d9eec0b49927b4402bcbb58dea89d3e0db8806a4086117019939ad3dd", size = 255899, upload-time = "2026-01-26T02:44:55.316Z" },
+    { url = "https://files.pythonhosted.org/packages/00/e3/9b60ed9e23e64c73a5cde95269ef1330678e9c6e34dd4eb6b431b85b5a10/multidict-6.7.1-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6aac4f16b472d5b7dc6f66a0d49dd57b0e0902090be16594dc9ebfd3d17c47e7", size = 247970, upload-time = "2026-01-26T02:44:56.783Z" },
+    { url = "https://files.pythonhosted.org/packages/3e/06/538e58a63ed5cfb0bd4517e346b91da32fde409d839720f664e9a4ae4f9d/multidict-6.7.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:21f830fe223215dffd51f538e78c172ed7c7f60c9b96a2bf05c4848ad49921c3", size = 245060, upload-time = "2026-01-26T02:44:58.195Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/2f/d743a3045a97c895d401e9bd29aaa09b94f5cbdf1bd561609e5a6c431c70/multidict-6.7.1-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:f5dd81c45b05518b9aa4da4aa74e1c93d715efa234fd3e8a179df611cc85e5f4", size = 235888, upload-time = "2026-01-26T02:44:59.57Z" },
+    { url = "https://files.pythonhosted.org/packages/38/83/5a325cac191ab28b63c52f14f1131f3b0a55ba3b9aa65a6d0bf2a9b921a0/multidict-6.7.1-cp313-cp313t-musllinux_1_2_i686.whl", hash = "sha256:eb304767bca2bb92fb9c5bd33cedc95baee5bb5f6c88e63706533a1c06ad08c8", size = 243554, upload-time = "2026-01-26T02:45:01.054Z" },
+    { url = "https://files.pythonhosted.org/packages/20/1f/9d2327086bd15da2725ef6aae624208e2ef828ed99892b17f60c344e57ed/multidict-6.7.1-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:c9035dde0f916702850ef66460bc4239d89d08df4d02023a5926e7446724212c", size = 252341, upload-time = "2026-01-26T02:45:02.484Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/2c/2a1aa0280cf579d0f6eed8ee5211c4f1730bd7e06c636ba2ee6aafda302e/multidict-6.7.1-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:af959b9beeb66c822380f222f0e0a1889331597e81f1ded7f374f3ecb0fd6c52", size = 246391, upload-time = "2026-01-26T02:45:03.862Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/03/7ca022ffc36c5a3f6e03b179a5ceb829be9da5783e6fe395f347c0794680/multidict-6.7.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:41f2952231456154ee479651491e94118229844dd7226541788be783be2b5108", size = 243422, upload-time = "2026-01-26T02:45:05.296Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/1d/b31650eab6c5778aceed46ba735bd97f7c7d2f54b319fa916c0f96e7805b/multidict-6.7.1-cp313-cp313t-win32.whl", hash = "sha256:df9f19c28adcb40b6aae30bbaa1478c389efd50c28d541d76760199fc1037c32", size = 47770, upload-time = "2026-01-26T02:45:06.754Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/5b/2d2d1d522e51285bd61b1e20df8f47ae1a9d80839db0b24ea783b3832832/multidict-6.7.1-cp313-cp313t-win_amd64.whl", hash = "sha256:d54ecf9f301853f2c5e802da559604b3e95bb7a3b01a9c295c6ee591b9882de8", size = 53109, upload-time = "2026-01-26T02:45:08.044Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/a3/cc409ba012c83ca024a308516703cf339bdc4b696195644a7215a5164a24/multidict-6.7.1-cp313-cp313t-win_arm64.whl", hash = "sha256:5a37ca18e360377cfda1d62f5f382ff41f2b8c4ccb329ed974cc2e1643440118", size = 45573, upload-time = "2026-01-26T02:45:09.349Z" },
+    { url = "https://files.pythonhosted.org/packages/91/cc/db74228a8be41884a567e88a62fd589a913708fcf180d029898c17a9a371/multidict-6.7.1-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:8f333ec9c5eb1b7105e3b84b53141e66ca05a19a605368c55450b6ba208cb9ee", size = 75190, upload-time = "2026-01-26T02:45:10.651Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/22/492f2246bb5b534abd44804292e81eeaf835388901f0c574bac4eeec73c5/multidict-6.7.1-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:a407f13c188f804c759fc6a9f88286a565c242a76b27626594c133b82883b5c2", size = 44486, upload-time = "2026-01-26T02:45:11.938Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/4f/733c48f270565d78b4544f2baddc2fb2a245e5a8640254b12c36ac7ac68e/multidict-6.7.1-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:0e161ddf326db5577c3a4cc2d8648f81456e8a20d40415541587a71620d7a7d1", size = 43219, upload-time = "2026-01-26T02:45:14.346Z" },
+    { url = "https://files.pythonhosted.org/packages/24/bb/2c0c2287963f4259c85e8bcbba9182ced8d7fca65c780c38e99e61629d11/multidict-6.7.1-cp314-cp314-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:1e3a8bb24342a8201d178c3b4984c26ba81a577c80d4d525727427460a50c22d", size = 245132, upload-time = "2026-01-26T02:45:15.712Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/f9/44d4b3064c65079d2467888794dea218d1601898ac50222ab8a9a8094460/multidict-6.7.1-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:97231140a50f5d447d3164f994b86a0bed7cd016e2682f8650d6a9158e14fd31", size = 252420, upload-time = "2026-01-26T02:45:17.293Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/13/78f7275e73fa17b24c9a51b0bd9d73ba64bb32d0ed51b02a746eb876abe7/multidict-6.7.1-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:6b10359683bd8806a200fd2909e7c8ca3a7b24ec1d8132e483d58e791d881048", size = 233510, upload-time = "2026-01-26T02:45:19.356Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/25/8167187f62ae3cbd52da7893f58cb036b47ea3fb67138787c76800158982/multidict-6.7.1-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:283ddac99f7ac25a4acadbf004cb5ae34480bbeb063520f70ce397b281859362", size = 264094, upload-time = "2026-01-26T02:45:20.834Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/e7/69a3a83b7b030cf283fb06ce074a05a02322359783424d7edf0f15fe5022/multidict-6.7.1-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:538cec1e18c067d0e6103aa9a74f9e832904c957adc260e61cd9d8cf0c3b3d37", size = 260786, upload-time = "2026-01-26T02:45:22.818Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/3b/8ec5074bcfc450fe84273713b4b0a0dd47c0249358f5d82eb8104ffe2520/multidict-6.7.1-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7eee46ccb30ff48a1e35bb818cc90846c6be2b68240e42a78599166722cea709", size = 248483, upload-time = "2026-01-26T02:45:24.368Z" },
+    { url = "https://files.pythonhosted.org/packages/48/5a/d5a99e3acbca0e29c5d9cba8f92ceb15dce78bab963b308ae692981e3a5d/multidict-6.7.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:fa263a02f4f2dd2d11a7b1bb4362aa7cb1049f84a9235d31adf63f30143469a0", size = 248403, upload-time = "2026-01-26T02:45:25.982Z" },
+    { url = "https://files.pythonhosted.org/packages/35/48/e58cd31f6c7d5102f2a4bf89f96b9cf7e00b6c6f3d04ecc44417c00a5a3c/multidict-6.7.1-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:2e1425e2f99ec5bd36c15a01b690a1a2456209c5deed58f95469ffb46039ccbb", size = 240315, upload-time = "2026-01-26T02:45:27.487Z" },
+    { url = "https://files.pythonhosted.org/packages/94/33/1cd210229559cb90b6786c30676bb0c58249ff42f942765f88793b41fdce/multidict-6.7.1-cp314-cp314-musllinux_1_2_i686.whl", hash = "sha256:497394b3239fc6f0e13a78a3e1b61296e72bf1c5f94b4c4eb80b265c37a131cd", size = 245528, upload-time = "2026-01-26T02:45:28.991Z" },
+    { url = "https://files.pythonhosted.org/packages/64/f2/6e1107d226278c876c783056b7db43d800bb64c6131cec9c8dfb6903698e/multidict-6.7.1-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:233b398c29d3f1b9676b4b6f75c518a06fcb2ea0b925119fb2c1bc35c05e1601", size = 258784, upload-time = "2026-01-26T02:45:30.503Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/c1/11f664f14d525e4a1b5327a82d4de61a1db604ab34c6603bb3c2cc63ad34/multidict-6.7.1-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:93b1818e4a6e0930454f0f2af7dfce69307ca03cdcfb3739bf4d91241967b6c1", size = 251980, upload-time = "2026-01-26T02:45:32.603Z" },
+    { url = "https://files.pythonhosted.org/packages/e1/9f/75a9ac888121d0c5bbd4ecf4eead45668b1766f6baabfb3b7f66a410e231/multidict-6.7.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:f33dc2a3abe9249ea5d8360f969ec7f4142e7ac45ee7014d8f8d5acddf178b7b", size = 243602, upload-time = "2026-01-26T02:45:34.043Z" },
+    { url = "https://files.pythonhosted.org/packages/9a/e7/50bf7b004cc8525d80dbbbedfdc7aed3e4c323810890be4413e589074032/multidict-6.7.1-cp314-cp314-win32.whl", hash = "sha256:3ab8b9d8b75aef9df299595d5388b14530839f6422333357af1339443cff777d", size = 40930, upload-time = "2026-01-26T02:45:36.278Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/bf/52f25716bbe93745595800f36fb17b73711f14da59ed0bb2eba141bc9f0f/multidict-6.7.1-cp314-cp314-win_amd64.whl", hash = "sha256:5e01429a929600e7dab7b166062d9bb54a5eed752384c7384c968c2afab8f50f", size = 45074, upload-time = "2026-01-26T02:45:37.546Z" },
+    { url = "https://files.pythonhosted.org/packages/97/ab/22803b03285fa3a525f48217963da3a65ae40f6a1b6f6cf2768879e208f9/multidict-6.7.1-cp314-cp314-win_arm64.whl", hash = "sha256:4885cb0e817aef5d00a2e8451d4665c1808378dc27c2705f1bf4ef8505c0d2e5", size = 42471, upload-time = "2026-01-26T02:45:38.889Z" },
+    { url = "https://files.pythonhosted.org/packages/e0/6d/f9293baa6146ba9507e360ea0292b6422b016907c393e2f63fc40ab7b7b5/multidict-6.7.1-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:0458c978acd8e6ea53c81eefaddbbee9c6c5e591f41b3f5e8e194780fe026581", size = 82401, upload-time = "2026-01-26T02:45:40.254Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/68/53b5494738d83558d87c3c71a486504d8373421c3e0dbb6d0db48ad42ee0/multidict-6.7.1-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:c0abd12629b0af3cf590982c0b413b1e7395cd4ec026f30986818ab95bfaa94a", size = 48143, upload-time = "2026-01-26T02:45:41.635Z" },
+    { url = "https://files.pythonhosted.org/packages/37/e8/5284c53310dcdc99ce5d66563f6e5773531a9b9fe9ec7a615e9bc306b05f/multidict-6.7.1-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:14525a5f61d7d0c94b368a42cff4c9a4e7ba2d52e2672a7b23d84dc86fb02b0c", size = 46507, upload-time = "2026-01-26T02:45:42.99Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/fc/6800d0e5b3875568b4083ecf5f310dcf91d86d52573160834fb4bfcf5e4f/multidict-6.7.1-cp314-cp314t-manylinux1_i686.manylinux_2_28_i686.manylinux_2_5_i686.whl", hash = "sha256:17307b22c217b4cf05033dabefe68255a534d637c6c9b0cc8382718f87be4262", size = 239358, upload-time = "2026-01-26T02:45:44.376Z" },
+    { url = "https://files.pythonhosted.org/packages/41/75/4ad0973179361cdf3a113905e6e088173198349131be2b390f9fa4da5fc6/multidict-6.7.1-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7a7e590ff876a3eaf1c02a4dfe0724b6e69a9e9de6d8f556816f29c496046e59", size = 246884, upload-time = "2026-01-26T02:45:47.167Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/9c/095bb28b5da139bd41fb9a5d5caff412584f377914bd8787c2aa98717130/multidict-6.7.1-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:5fa6a95dfee63893d80a34758cd0e0c118a30b8dcb46372bf75106c591b77889", size = 225878, upload-time = "2026-01-26T02:45:48.698Z" },
+    { url = "https://files.pythonhosted.org/packages/07/d0/c0a72000243756e8f5a277b6b514fa005f2c73d481b7d9e47cd4568aa2e4/multidict-6.7.1-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:a0543217a6a017692aa6ae5cc39adb75e587af0f3a82288b1492eb73dd6cc2a4", size = 253542, upload-time = "2026-01-26T02:45:50.164Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/6b/f69da15289e384ecf2a68837ec8b5ad8c33e973aa18b266f50fe55f24b8c/multidict-6.7.1-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:f99fe611c312b3c1c0ace793f92464d8cd263cc3b26b5721950d977b006b6c4d", size = 252403, upload-time = "2026-01-26T02:45:51.779Z" },
+    { url = "https://files.pythonhosted.org/packages/a2/76/b9669547afa5a1a25cd93eaca91c0da1c095b06b6d2d8ec25b713588d3a1/multidict-6.7.1-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9004d8386d133b7e6135679424c91b0b854d2d164af6ea3f289f8f2761064609", size = 244889, upload-time = "2026-01-26T02:45:53.27Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/a9/a50d2669e506dad33cfc45b5d574a205587b7b8a5f426f2fbb2e90882588/multidict-6.7.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e628ef0e6859ffd8273c69412a2465c4be4a9517d07261b33334b5ec6f3c7489", size = 241982, upload-time = "2026-01-26T02:45:54.919Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/bb/1609558ad8b456b4827d3c5a5b775c93b87878fd3117ed3db3423dfbce1b/multidict-6.7.1-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:841189848ba629c3552035a6a7f5bf3b02eb304e9fea7492ca220a8eda6b0e5c", size = 232415, upload-time = "2026-01-26T02:45:56.981Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/59/6f61039d2aa9261871e03ab9dc058a550d240f25859b05b67fd70f80d4b3/multidict-6.7.1-cp314-cp314t-musllinux_1_2_i686.whl", hash = "sha256:ce1bbd7d780bb5a0da032e095c951f7014d6b0a205f8318308140f1a6aba159e", size = 240337, upload-time = "2026-01-26T02:45:58.698Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/29/fdc6a43c203890dc2ae9249971ecd0c41deaedfe00d25cb6564b2edd99eb/multidict-6.7.1-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:b26684587228afed0d50cf804cc71062cc9c1cdf55051c4c6345d372947b268c", size = 248788, upload-time = "2026-01-26T02:46:00.862Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/14/a153a06101323e4cf086ecee3faadba52ff71633d471f9685c42e3736163/multidict-6.7.1-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:9f9af11306994335398293f9958071019e3ab95e9a707dc1383a35613f6abcb9", size = 242842, upload-time = "2026-01-26T02:46:02.824Z" },
+    { url = "https://files.pythonhosted.org/packages/41/5f/604ae839e64a4a6efc80db94465348d3b328ee955e37acb24badbcd24d83/multidict-6.7.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:b4938326284c4f1224178a560987b6cf8b4d38458b113d9b8c1db1a836e640a2", size = 240237, upload-time = "2026-01-26T02:46:05.898Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/60/c3a5187bf66f6fb546ff4ab8fb5a077cbdd832d7b1908d4365c7f74a1917/multidict-6.7.1-cp314-cp314t-win32.whl", hash = "sha256:98655c737850c064a65e006a3df7c997cd3b220be4ec8fe26215760b9697d4d7", size = 48008, upload-time = "2026-01-26T02:46:07.468Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/f7/addf1087b860ac60e6f382240f64fb99f8bfb532bb06f7c542b83c29ca61/multidict-6.7.1-cp314-cp314t-win_amd64.whl", hash = "sha256:497bde6223c212ba11d462853cfa4f0ae6ef97465033e7dc9940cdb3ab5b48e5", size = 53542, upload-time = "2026-01-26T02:46:08.809Z" },
+    { url = "https://files.pythonhosted.org/packages/4c/81/4629d0aa32302ef7b2ec65c75a728cc5ff4fa410c50096174c1632e70b3e/multidict-6.7.1-cp314-cp314t-win_arm64.whl", hash = "sha256:2bbd113e0d4af5db41d5ebfe9ccaff89de2120578164f86a5d17d5a576d1e5b2", size = 44719, upload-time = "2026-01-26T02:46:11.146Z" },
+    { url = "https://files.pythonhosted.org/packages/81/08/7036c080d7117f28a4af526d794aab6a84463126db031b007717c1a6676e/multidict-6.7.1-py3-none-any.whl", hash = "sha256:55d97cc6dae627efa6a6e548885712d4864b81110ac76fa4e534c03819fa4a56", size = 12319, upload-time = "2026-01-26T02:46:44.004Z" },
+]
+
+[[package]]
+name = "mypy"
+version = "2.1.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "ast-serialize" },
+    { name = "librt", marker = "platform_python_implementation != 'PyPy'" },
+    { name = "mypy-extensions" },
+    { name = "pathspec" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/82/15/cca9d88503549ed6fedeaa1d448cdddd542ee8a490232d732e278036fbf2/mypy-2.1.0.tar.gz", hash = "sha256:81e76ad12c2d804512e9b13240d1588316531bfba07558286078bfbce9613633", size = 3898359, upload-time = "2026-05-11T18:37:36.237Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/0a/a1/639f3024794a2a15899cb90707fe02e044c4412794c39c5769fd3df2e2ef/mypy-2.1.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:a683016b16fe2f572dc04c72be7ee0504ac1605a265d0200f5cea695fb788f41", size = 14691685, upload-time = "2026-05-11T18:33:27.973Z" },
+    { url = "https://files.pythonhosted.org/packages/3b/08/9a585dea4325f20d8b80dc78623fa50d1fd2173b710f6237afd6ba6ab39b/mypy-2.1.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:1a293c534adb55271fef24a26da04b855540a8c13cc07bc5917b9fd2c394f2ca", size = 13555165, upload-time = "2026-05-11T18:32:16.107Z" },
+    { url = "https://files.pythonhosted.org/packages/81/dc/7c42cc9c6cb01e8eb09961f1f738741d3e9c7e9d5c5b30ec69222625cd5f/mypy-2.1.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7406f4d048e71e576f5356d317e5b0a9e666dfd966bd99f9d14ca06e1a341538", size = 13994376, upload-time = "2026-05-11T18:32:39.256Z" },
+    { url = "https://files.pythonhosted.org/packages/d4/fa/285946c33bce716e082c11dfeee9ee196eaf1f5042efb3581a31f9f205e4/mypy-2.1.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e0210d626fc8b31ccc90233754c7bc90e1f43205e85d96387f7db1285b55c398", size = 14864618, upload-time = "2026-05-11T18:34:49.765Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/83/82397f48af6c27e295d57979ded8490c9829040152cf7571b2f026aeb9a0/mypy-2.1.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:3712c20deed54e814eaaa825603bada8ea1c390670a397c95b98405347acc563", size = 15102063, upload-time = "2026-05-11T18:34:05.855Z" },
+    { url = "https://files.pythonhosted.org/packages/40/68/b02dec39057b88eb03dc0aa854732e26e8361f34f9d0e20c7614967d1eba/mypy-2.1.0-cp311-cp311-win_amd64.whl", hash = "sha256:fcaa0e479066e31f7cceb6a3bea39cb22b2ff51a6b2f24f193d19179ba17c389", size = 11060564, upload-time = "2026-05-11T18:35:36.494Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/a8/ea3dcbef31f99b634f2ee23bb0321cbc8c1b388b76a861eb849f13c347dc/mypy-2.1.0-cp311-cp311-win_arm64.whl", hash = "sha256:0b1a5260c95aa443083f9ed3592662941951bca3d4ca224a5dc517c38b7cf666", size = 9966983, upload-time = "2026-05-11T18:37:14.139Z" },
+    { url = "https://files.pythonhosted.org/packages/95/b1/55861beb5c339b44f9a2ba92df9e2cb1eeb4ae1eee674cdf7772c797778b/mypy-2.1.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:244358bf1c0da7722230bce60683d52e8e9fd030554926f15b747a84efb5b3af", size = 14874381, upload-time = "2026-05-11T18:37:31.784Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/b3/b7f770114b7d0ac92d0f76e8d93c2780844a70488a90e91821927850da86/mypy-2.1.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:4ec7c57657493c7a75534df2751c8ae2cda383c16ecc55d2106c54476b1b16f6", size = 13665501, upload-time = "2026-05-11T18:34:23.063Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/f3/8ae2037967e2126689a0c11d99e2b707134a565191e92c60ca2572aec60a/mypy-2.1.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d8161b6ff4392410023224f0969d17db93e1e154bc3e4ba62598e720723ae211", size = 14045750, upload-time = "2026-05-11T18:31:48.151Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/32/615eb5911859e43d054941b0d0a7d06cfa2870eba86529cf385b052b111c/mypy-2.1.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:bf03e12003084a67395184d3eb8cbd6a489dc3655b5664b28c210a9e2403ab0b", size = 15061630, upload-time = "2026-05-11T18:37:06.898Z" },
+    { url = "https://files.pythonhosted.org/packages/d4/03/4eafbfff8bfab1b87082741eae6e6a624028c984e6708b73bce2a8570c9d/mypy-2.1.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:20509760fd791c51579d573153407d226385ec1f8bcce55d730b354f3336bc22", size = 15288831, upload-time = "2026-05-11T18:31:18.07Z" },
+    { url = "https://files.pythonhosted.org/packages/99/ee/919661478e5891a3c96e549c036e467e64563ab85995b10c53c8358e16a3/mypy-2.1.0-cp312-cp312-win_amd64.whl", hash = "sha256:6753d0c1fdd6b1a23b9e4f283ce80b2153b724adcb2653b20b85a8a28ac6436b", size = 11135228, upload-time = "2026-05-11T18:34:31.23Z" },
+    { url = "https://files.pythonhosted.org/packages/24/0a/6a12b9782ca0831a553192f351679f4548abc9d19a7cc93bb7feb02084c7/mypy-2.1.0-cp312-cp312-win_arm64.whl", hash = "sha256:98ebb6589bb3b6d0c6f0c459d53ca55b8091fbc13d277c4041c885392e8195e8", size = 10040684, upload-time = "2026-05-11T18:36:48.199Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/dd/c7191469c777f07689c032a8f7326e393ea34c92d6d76eb7ce5ba57ea66d/mypy-2.1.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:35aac3bb114e03888f535d5eb51b8bafbb3266586b599da1940f9b1be3ec5bd5", size = 14852174, upload-time = "2026-05-11T18:31:38.929Z" },
+    { url = "https://files.pythonhosted.org/packages/55/8c/aed55408879043d72bb9135f4d0d19a02b886dd569631e113e3d2706cb8d/mypy-2.1.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:8de55a8c861f2a49331f807be98d90caeceeef520bde13d43a160207f8af613e", size = 13651542, upload-time = "2026-05-11T18:36:04.636Z" },
+    { url = "https://files.pythonhosted.org/packages/3a/8e/f371a824b1f1fa8ea6e3dbb8703d232977d572be2329554a3bc4d960302f/mypy-2.1.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5fdf2941a07434af755837d9880f7d7d25f1dacb1af9dcd4b9b66f2220a3024e", size = 14033929, upload-time = "2026-05-11T18:35:55.742Z" },
+    { url = "https://files.pythonhosted.org/packages/94/21/f54be870d6dd53a82c674407e0f8eed7174b05ec78d42e5abd7b42e84fd5/mypy-2.1.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e195b817c13f02352a9c124301f9f30f078405444679b6753c1b96b6eed37285", size = 15039200, upload-time = "2026-05-11T18:33:10.281Z" },
+    { url = "https://files.pythonhosted.org/packages/17/99/bf21748626a40ce59fd29a39386ab46afec88b7bd2f0fa6c3a97c995523f/mypy-2.1.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:5431d42af987ebd92ba2f71d45c85ed41d8e6ca9f5fd209a69f68f707d2469e5", size = 15272690, upload-time = "2026-05-11T18:32:07.205Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/d7/9e90d2cf47100bea550ed2bc7b0d4de3a62181d84d5e37da0003e8462637/mypy-2.1.0-cp313-cp313-win_amd64.whl", hash = "sha256:767fe8c66dc3e01e19e1737d4c38ebefead16125e1b8e58ad421903b376f5c65", size = 11147435, upload-time = "2026-05-11T18:33:56.477Z" },
+    { url = "https://files.pythonhosted.org/packages/ec/46/e5c449e858798e35ffc90946282a27c62a77be743fe17480e4977374eb91/mypy-2.1.0-cp313-cp313-win_arm64.whl", hash = "sha256:ecfe70d43775ab99562ab128ce49854a362044c9f894961f68f898c23cb7429d", size = 10035052, upload-time = "2026-05-11T18:32:30.049Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/ca/b279a672e874aedd5498ae25f722dacc8aa86bbffb939b3f97cbb1cf6686/mypy-2.1.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:7354c5a7f69d9345c3d6e69921d57088eea3ddeeb6b20d34c1b3855b02c36ec2", size = 14848422, upload-time = "2026-05-11T18:35:45.984Z" },
+    { url = "https://files.pythonhosted.org/packages/27/e6/3efe56c631d959b9b4454e208b0ac4b7f4f58b404c89f8bec7b49efdfc21/mypy-2.1.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:49890d4f76ac9e06ec117f9e09f3174da70a620a0c300953d8595c926e80947f", size = 13677374, upload-time = "2026-05-11T18:36:57.188Z" },
+    { url = "https://files.pythonhosted.org/packages/84/7f/8107ea87a44fd1f1b59882442f033c9c3488c127201b1d1d15f1cbd6022e/mypy-2.1.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:761be68e023ef5d94678772396a8af1220030f80837a3afd8d0aef3b419666f4", size = 14055743, upload-time = "2026-05-11T18:35:18.361Z" },
+    { url = "https://files.pythonhosted.org/packages/51/4d/b6d34db183133b83761b9199a82d31557cdbb70a380d8c3b3438e11882a3/mypy-2.1.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c90345fc182dc363b891350457ec69c35140858538f38b4540845afcc32b1aef", size = 15020937, upload-time = "2026-05-11T18:34:59.618Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/d7/f08360c691d758acb02f45022c34d98b92892f4ea756644e1000d4b9f3d8/mypy-2.1.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:b84802e7b5a6daf1f5e15bc9fcd7ddae77be13981ffab037f1c67bb84d67d135", size = 15253371, upload-time = "2026-05-11T18:36:41.081Z" },
+    { url = "https://files.pythonhosted.org/packages/67/1b/09460a13719530a19bce27bd3bc8449e83569dd2ba7faf51c9c3c30c0b61/mypy-2.1.0-cp314-cp314-win_amd64.whl", hash = "sha256:022c771234936ceac541ebaf836fe9e2abeb3f5e09aff21588fe543ff006fe21", size = 11326429, upload-time = "2026-05-11T18:34:13.526Z" },
+    { url = "https://files.pythonhosted.org/packages/40/62/75dbf0f82f7b6680340efc614af29dd0b3c17b8a4f1cd09b8bd2fd6bc814/mypy-2.1.0-cp314-cp314-win_arm64.whl", hash = "sha256:498207db725cec88829a6a5c2fc771205fd043719ef98bc49aba8fb9fc4e6d57", size = 10218799, upload-time = "2026-05-11T18:32:23.491Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/66/caca04ed7d972fb6eb6dd1ccd6df1de5c38fae8c5b3dc1c4e8e0d85ee6b9/mypy-2.1.0-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:7d5e5cad0efeba72b93cd17490cc0d69c5ac9ca132994fe3fb0314808aeeb83e", size = 15923458, upload-time = "2026-05-11T18:35:28.64Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/52/2d90cbe49d014b13ed7ff337930c30bad35893fe38a1e4641e756bb62191/mypy-2.1.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:ff715050c127d724fd260a2e666e7747fdd83511c0c47d449d98238970aef780", size = 14757697, upload-time = "2026-05-11T18:36:14.208Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/37/d98f4a14e081b238992d0ed96b6d39c7cc0148c9699eb71eaa68629665ea/mypy-2.1.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:82208da9e09414d520e912d3e462d454854bed0810b71540bb016dcbca7308fd", size = 15405638, upload-time = "2026-05-11T18:33:48.249Z" },
+    { url = "https://files.pythonhosted.org/packages/a3/c2/15c46613b24a84fad2aea1248bf9619b99c2767ae9071fe224c179a0b7d4/mypy-2.1.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e79ebc1b904b84f0310dff7469655a9c36c7a68bddb37bdd42b67a332df61d08", size = 16215852, upload-time = "2026-05-11T18:32:50.296Z" },
+    { url = "https://files.pythonhosted.org/packages/5c/90/9c16a57f482c76d25f6379762b56bbf65c711d8158cf271fb2802cfb0640/mypy-2.1.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:e583edc957cfb0deb142079162ae826f58449b116c1d442f2d91c69d9fced081", size = 16452695, upload-time = "2026-05-11T18:33:38.182Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/4c/215a4eeb63cacc5f17f516691ea7285d11e249802b942476bff15922a314/mypy-2.1.0-cp314-cp314t-win_amd64.whl", hash = "sha256:b33b6cd332695bba180d55e717a79d3038e479a2c49cc5eb3d53603409b9a5d7", size = 12866622, upload-time = "2026-05-11T18:34:39.945Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/50/1043e1db5f455ffe4c9ab22747cd8ca2bc492b1e4f4e21b130a44ee2b217/mypy-2.1.0-cp314-cp314t-win_arm64.whl", hash = "sha256:4f910fe825376a7b66ef7ca8c98e5a149e8cd64c19ae71d84047a74ee060d4e6", size = 10610798, upload-time = "2026-05-11T18:36:31.444Z" },
+    { url = "https://files.pythonhosted.org/packages/0d/2a/13ca1f292f6db1b98ff495ef3467736b331621c5917cad984b7043e7348d/mypy-2.1.0-py3-none-any.whl", hash = "sha256:a663814603a5c563fb87a4f96fb473eeb30d1f5a4885afcf44f9db000a366289", size = 2693302, upload-time = "2026-05-11T18:31:29.246Z" },
+]
+
+[[package]]
+name = "mypy-extensions"
+version = "1.1.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/a2/6e/371856a3fb9d31ca8dac321cda606860fa4548858c0cc45d9d1d4ca2628b/mypy_extensions-1.1.0.tar.gz", hash = "sha256:52e68efc3284861e772bbcd66823fde5ae21fd2fdb51c62a211403730b916558", size = 6343, upload-time = "2025-04-22T14:54:24.164Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/79/7b/2c79738432f5c924bef5071f933bcc9efd0473bac3b4aa584a6f7c1c8df8/mypy_extensions-1.1.0-py3-none-any.whl", hash = "sha256:1be4cccdb0f2482337c4743e60421de3a356cd97508abadd57d47403e94f5505", size = 4963, upload-time = "2025-04-22T14:54:22.983Z" },
+]
+
+[[package]]
+name = "narwhals"
+version = "2.22.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/62/3c/c4ef2164a71c1a63d7f1ae411c4082c5fa872405106db60a4b7114989ad7/narwhals-2.22.1.tar.gz", hash = "sha256:d62920805a0a43b7ff8b54b0c0d3142d796f8a9301836ada37e573d6a33cbcd9", size = 647493, upload-time = "2026-06-05T12:34:34.051Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/48/ca/36339329c4604adbcc99c899b7eb1ce1a555c499b6a6860757dc9bfed36d/narwhals-2.22.1-py3-none-any.whl", hash = "sha256:60567d774edf77db53906f89d9fbd164e66e56d66d388e1e6990f17ac33cfb53", size = 454815, upload-time = "2026-06-05T12:34:32.289Z" },
+]
+
+[[package]]
+name = "networkx"
+version = "3.6.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/6a/51/63fe664f3908c97be9d2e4f1158eb633317598cfa6e1fc14af5383f17512/networkx-3.6.1.tar.gz", hash = "sha256:26b7c357accc0c8cde558ad486283728b65b6a95d85ee1cd66bafab4c8168509", size = 2517025, upload-time = "2025-12-08T17:02:39.908Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9e/c9/b2622292ea83fbb4ec318f5b9ab867d0a28ab43c5717bb85b0a5f6b3b0a4/networkx-3.6.1-py3-none-any.whl", hash = "sha256:d47fbf302e7d9cbbb9e2555a0d267983d2aa476bac30e90dfbe5669bd57f3762", size = 2068504, upload-time = "2025-12-08T17:02:38.159Z" },
+]
+
+[[package]]
+name = "ninja"
+version = "1.13.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/43/73/79a0b22fc731989c708068427579e840a6cf4e937fe7ae5c5d0b7356ac22/ninja-1.13.0.tar.gz", hash = "sha256:4a40ce995ded54d9dc24f8ea37ff3bf62ad192b547f6c7126e7e25045e76f978", size = 242558, upload-time = "2025-08-11T15:10:19.421Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3c/74/d02409ed2aa865e051b7edda22ad416a39d81a84980f544f8de717cab133/ninja-1.13.0-py3-none-macosx_10_9_universal2.whl", hash = "sha256:fa2a8bfc62e31b08f83127d1613d10821775a0eb334197154c4d6067b7068ff1", size = 310125, upload-time = "2025-08-11T15:09:50.971Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/de/6e1cd6b84b412ac1ef327b76f0641aeb5dcc01e9d3f9eee0286d0c34fd93/ninja-1.13.0-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:3d00c692fb717fd511abeb44b8c5d00340c36938c12d6538ba989fe764e79630", size = 177467, upload-time = "2025-08-11T15:09:52.767Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/83/49320fb6e58ae3c079381e333575fdbcf1cca3506ee160a2dcce775046fa/ninja-1.13.0-py3-none-manylinux2014_i686.manylinux_2_17_i686.whl", hash = "sha256:be7f478ff9f96a128b599a964fc60a6a87b9fa332ee1bd44fa243ac88d50291c", size = 187834, upload-time = "2025-08-11T15:09:54.115Z" },
+    { url = "https://files.pythonhosted.org/packages/56/c7/ba22748fb59f7f896b609cd3e568d28a0a367a6d953c24c461fe04fc4433/ninja-1.13.0-py3-none-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:60056592cf495e9a6a4bea3cd178903056ecb0943e4de45a2ea825edb6dc8d3e", size = 202736, upload-time = "2025-08-11T15:09:55.745Z" },
+    { url = "https://files.pythonhosted.org/packages/79/22/d1de07632b78ac8e6b785f41fa9aad7a978ec8c0a1bf15772def36d77aac/ninja-1.13.0-py3-none-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:1c97223cdda0417f414bf864cfb73b72d8777e57ebb279c5f6de368de0062988", size = 179034, upload-time = "2025-08-11T15:09:57.394Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/de/0e6edf44d6a04dabd0318a519125ed0415ce437ad5a1ec9b9be03d9048cf/ninja-1.13.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:fb46acf6b93b8dd0322adc3a4945452a4e774b75b91293bafcc7b7f8e6517dfa", size = 180716, upload-time = "2025-08-11T15:09:58.696Z" },
+    { url = "https://files.pythonhosted.org/packages/54/28/938b562f9057aaa4d6bfbeaa05e81899a47aebb3ba6751e36c027a7f5ff7/ninja-1.13.0-py3-none-manylinux_2_28_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:4be9c1b082d244b1ad7ef41eb8ab088aae8c109a9f3f0b3e56a252d3e00f42c1", size = 146843, upload-time = "2025-08-11T15:10:00.046Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/fb/d06a3838de4f8ab866e44ee52a797b5491df823901c54943b2adb0389fbb/ninja-1.13.0-py3-none-manylinux_2_31_riscv64.whl", hash = "sha256:6739d3352073341ad284246f81339a384eec091d9851a886dfa5b00a6d48b3e2", size = 154402, upload-time = "2025-08-11T15:10:01.657Z" },
+    { url = "https://files.pythonhosted.org/packages/31/bf/0d7808af695ceddc763cf251b84a9892cd7f51622dc8b4c89d5012779f06/ninja-1.13.0-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:11be2d22027bde06f14c343f01d31446747dbb51e72d00decca2eb99be911e2f", size = 552388, upload-time = "2025-08-11T15:10:03.349Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/70/c99d0c2c809f992752453cce312848abb3b1607e56d4cd1b6cded317351a/ninja-1.13.0-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:aa45b4037b313c2f698bc13306239b8b93b4680eb47e287773156ac9e9304714", size = 472501, upload-time = "2025-08-11T15:10:04.735Z" },
+    { url = "https://files.pythonhosted.org/packages/9f/43/c217b1153f0e499652f5e0766da8523ce3480f0a951039c7af115e224d55/ninja-1.13.0-py3-none-musllinux_1_2_i686.whl", hash = "sha256:5f8e1e8a1a30835eeb51db05cf5a67151ad37542f5a4af2a438e9490915e5b72", size = 638280, upload-time = "2025-08-11T15:10:06.512Z" },
+    { url = "https://files.pythonhosted.org/packages/8c/45/9151bba2c8d0ae2b6260f71696330590de5850e5574b7b5694dce6023e20/ninja-1.13.0-py3-none-musllinux_1_2_ppc64le.whl", hash = "sha256:3d7d7779d12cb20c6d054c61b702139fd23a7a964ec8f2c823f1ab1b084150db", size = 642420, upload-time = "2025-08-11T15:10:08.35Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/fb/95752eb635bb8ad27d101d71bef15bc63049de23f299e312878fc21cb2da/ninja-1.13.0-py3-none-musllinux_1_2_riscv64.whl", hash = "sha256:d741a5e6754e0bda767e3274a0f0deeef4807f1fec6c0d7921a0244018926ae5", size = 585106, upload-time = "2025-08-11T15:10:09.818Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/31/aa56a1a286703800c0cbe39fb4e82811c277772dc8cd084f442dd8e2938a/ninja-1.13.0-py3-none-musllinux_1_2_s390x.whl", hash = "sha256:e8bad11f8a00b64137e9b315b137d8bb6cbf3086fbdc43bf1f90fd33324d2e96", size = 707138, upload-time = "2025-08-11T15:10:11.366Z" },
+    { url = "https://files.pythonhosted.org/packages/34/6f/5f5a54a1041af945130abdb2b8529cbef0cdcbbf9bcf3f4195378319d29a/ninja-1.13.0-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:b4f2a072db3c0f944c32793e91532d8948d20d9ab83da9c0c7c15b5768072200", size = 581758, upload-time = "2025-08-11T15:10:13.295Z" },
+    { url = "https://files.pythonhosted.org/packages/95/97/51359c77527d45943fe7a94d00a3843b81162e6c4244b3579fe8fc54cb9c/ninja-1.13.0-py3-none-win32.whl", hash = "sha256:8cfbb80b4a53456ae8a39f90ae3d7a2129f45ea164f43fadfa15dc38c4aef1c9", size = 267201, upload-time = "2025-08-11T15:10:15.158Z" },
+    { url = "https://files.pythonhosted.org/packages/29/45/c0adfbfb0b5895aa18cec400c535b4f7ff3e52536e0403602fc1a23f7de9/ninja-1.13.0-py3-none-win_amd64.whl", hash = "sha256:fb8ee8719f8af47fed145cced4a85f0755dd55d45b2bddaf7431fa89803c5f3e", size = 309975, upload-time = "2025-08-11T15:10:16.697Z" },
+    { url = "https://files.pythonhosted.org/packages/df/93/a7b983643d1253bb223234b5b226e69de6cda02b76cdca7770f684b795f5/ninja-1.13.0-py3-none-win_arm64.whl", hash = "sha256:3c0b40b1f0bba764644385319028650087b4c1b18cdfa6f45cb39a3669b81aa9", size = 290806, upload-time = "2025-08-11T15:10:18.018Z" },
+]
+
+[[package]]
+name = "nodeenv"
+version = "1.10.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/24/bf/d1bda4f6168e0b2e9e5958945e01910052158313224ada5ce1fb2e1113b8/nodeenv-1.10.0.tar.gz", hash = "sha256:996c191ad80897d076bdfba80a41994c2b47c68e224c542b48feba42ba00f8bb", size = 55611, upload-time = "2025-12-20T14:08:54.006Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/88/b2/d0896bdcdc8d28a7fc5717c305f1a861c26e18c05047949fb371034d98bd/nodeenv-1.10.0-py2.py3-none-any.whl", hash = "sha256:5bb13e3eed2923615535339b3c620e76779af4cb4c6a90deccc9e36b274d3827", size = 23438, upload-time = "2025-12-20T14:08:52.782Z" },
+]
+
+[[package]]
+name = "numpy"
+version = "2.4.6"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version < '3.12' and sys_platform == 'linux'",
+    "python_full_version < '3.12' and sys_platform != 'linux'",
+]
+sdist = { url = "https://files.pythonhosted.org/packages/d0/ad/fed0499ce6a338d2a03ebae59cd15093910c8875328855781952abf6c2fe/numpy-2.4.6.tar.gz", hash = "sha256:f3a3570c4a2a16746ac2c31a7c7c7b0c186b95ce902e33db6f28094ed7387dda", size = 20735807, upload-time = "2026-05-18T23:37:14.07Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b3/49/ec46835a70be8fa6446c495126ac84fdb28cb2558e1620ffb87a10c8b64c/numpy-2.4.6-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:0280e0356c0829a18d9de1cb7eee50ec22ca639878d7240307ca0943d73cd2c4", size = 16969194, upload-time = "2026-05-18T23:33:13.503Z" },
+    { url = "https://files.pythonhosted.org/packages/0e/0d/f5957185c0ee2f3e12f78715aa9e3b353fd83633316c8532b38faa37e3f6/numpy-2.4.6-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:110f8b71aacb688ec69062bb7f6938a0f8acb01b7c1c4beb453c65b6d234584d", size = 14964111, upload-time = "2026-05-18T23:33:17.795Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/40/40a40ee0ddf7ceb782c49af278894b686e586d65d8c1889c8b5da01a3d7d/numpy-2.4.6-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:4cfe66903cc32a9921a6733d96b19bb6abf310397581bbad89c228f5abaf0ee8", size = 5469159, upload-time = "2026-05-18T23:33:20.654Z" },
+    { url = "https://files.pythonhosted.org/packages/63/13/f9a8046535cb21deae82f8d03de9617e08882d274fad2539630761888228/numpy-2.4.6-cp311-cp311-macosx_14_0_x86_64.whl", hash = "sha256:8155154c7c691289fe18f510b5d4657c68c67989f293f0535a91360392ff6538", size = 6798936, upload-time = "2026-05-18T23:33:22.987Z" },
+    { url = "https://files.pythonhosted.org/packages/33/a8/6fa8c1a345a8c85dbb21932c447bee07c30a2c2a3f31e369c0a84b300147/numpy-2.4.6-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0ab0a9c4ffb1a6d95ef519fe4247dba8eb6b18ad93999f76b7f657039acabd47", size = 15966692, upload-time = "2026-05-18T23:33:26.62Z" },
+    { url = "https://files.pythonhosted.org/packages/02/03/74fe2a4cb3817d94d86402f2506554130a2f01414e299b5a843e5a8a957f/numpy-2.4.6-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:89cd468399cfd2504718f0ba50e410dca55a170b61a02ad92bb18c8a65186e93", size = 16918164, upload-time = "2026-05-18T23:33:29.955Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/80/3615be3313f7e7696609bc194b9f0101da809df79e859bdb84e0cd043f46/numpy-2.4.6-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:c2d37ab77531417474168eb79d6d80b14f821a966818505d03013d0833edb7a8", size = 17322877, upload-time = "2026-05-18T23:33:34.724Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/ac/a691e0fe2675e370d0e08ff905adc49a1c8830e8cae03efe4477e92cd55d/numpy-2.4.6-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f407cb6b8e9d6d8c626bc73c945db1706035af8fd632295547bf1c9e46d092d6", size = 18651487, upload-time = "2026-05-18T23:33:38.217Z" },
+    { url = "https://files.pythonhosted.org/packages/15/a7/9bc1cd626d7bf6869bfedf27b91b6ab5dd607758bf8e959d6fa80c6a59cb/numpy-2.4.6-cp311-cp311-win32.whl", hash = "sha256:ddea102b48f9e339f3948bf22040944184627a30fdf7f858667673b9c5f033c8", size = 6233945, upload-time = "2026-05-18T23:33:41.331Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/31/7fc6239c12bce7e931463251cca4426c465e1876ba3cc785402ef4dd8f4e/numpy-2.4.6-cp311-cp311-win_amd64.whl", hash = "sha256:1e254a00cdf42b1e4d5b3d68d33af63268d41340d8885df2ab6470f2e1500147", size = 12608406, upload-time = "2026-05-18T23:33:44.131Z" },
+    { url = "https://files.pythonhosted.org/packages/27/83/140f85a466595a16382996a1bf06b2b54bcd597488921b0c9daaeeda72af/numpy-2.4.6-cp311-cp311-win_arm64.whl", hash = "sha256:ed9749eef4cbd126da3dc1d6bcb3a57f5eb7ac6a6484146bdbf743f552dfc577", size = 10479528, upload-time = "2026-05-18T23:33:50.725Z" },
+    { url = "https://files.pythonhosted.org/packages/95/2a/3d7b5ac8aac24feaf9ad7ed58f45b0bbc06d37e4338ae84c9f2298b570f9/numpy-2.4.6-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:001fbb8e08d942dd57599e781f2472269ee7f2755fae407b4f67b2f0b17da3f1", size = 16689119, upload-time = "2026-05-18T23:33:54.065Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/12/92c4c131527599e8288d6918e888d88726f84d805d784b771f32408aeaef/numpy-2.4.6-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:ebfb099f8dcf083deef3ac1ca4c1503f387cf76296fcb3816b66f5ecb5f54fdb", size = 14699246, upload-time = "2026-05-18T23:33:57.621Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/fe/c0a6b7b2ca128a8fb228575147073b660656734b8ebe4d76c8fd748dcc79/numpy-2.4.6-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:3213d622a0283a39a93d188f3cf72b26862df52fbb4ca3697f51705016523d41", size = 5204410, upload-time = "2026-05-18T23:34:00.302Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/d4/9770d14ba719432bb90a421bfd443872ed0f70f7264b64bec12ea363d5fd/numpy-2.4.6-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:357cc07a6d7b0b182ff02249616a03742827ebb1277546b5c7cd7f7620a45698", size = 6551240, upload-time = "2026-05-18T23:34:02.852Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/c6/50a46a6205feba2343f1d6d17438107c5dc491ed1c736e6ea68689fd906b/numpy-2.4.6-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5f9fb9157b4ce2971008323afe46053787b526ef624fea915b261468a8421a0f", size = 15671012, upload-time = "2026-05-18T23:34:05.485Z" },
+    { url = "https://files.pythonhosted.org/packages/99/60/14115e6364fa676c5397c2ad3004e527e9aa487abf5d0706ec81bbd08529/numpy-2.4.6-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:90f9849678c75fe7afa2d348ac842c168b0a4d3d61919687216dfc547976d853", size = 16645538, upload-time = "2026-05-18T23:34:09.265Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/c5/693cbe59e57db94d2231fa519ca3978dc9e19da5a8f088588f5c6e947ff2/numpy-2.4.6-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:c1a2af6c6ef86344a6b0db6b97834208bf598db514f2b155042439b62605601a", size = 17020706, upload-time = "2026-05-18T23:34:13.053Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/fc/85b7c4eff9b4966ade25c2273cf7e7012e92366c032058653934b37de044/numpy-2.4.6-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e5805d5a22fd19c8ccff10a9561f9df94436b0545619ea579db2d3c35294bce2", size = 18368541, upload-time = "2026-05-18T23:34:17.024Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/81/e1b27545deedce7f4a0b348618c6b62d74e36a4dc9ccd42f3eb2f85eee32/numpy-2.4.6-cp312-cp312-win32.whl", hash = "sha256:e3eeb0aabd6bd5ce64faae67e9935203a6991b4bc2a485a767fbafb2c5125f45", size = 5962825, upload-time = "2026-05-18T23:34:20.3Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/ca/feab00bd44aa5fe1ad2c18f08b4d3bb92e26484b0b1d1443897809ed528c/numpy-2.4.6-cp312-cp312-win_amd64.whl", hash = "sha256:d8e8286dd7cea7895157318d1b91cdacac64c479f3cbc8dce548331728484751", size = 12321687, upload-time = "2026-05-18T23:34:23.095Z" },
+    { url = "https://files.pythonhosted.org/packages/63/cf/5a6d34850a39d1093558564f77ee8e8e0bee5061151b8f05a55711001ec7/numpy-2.4.6-cp312-cp312-win_arm64.whl", hash = "sha256:4081eb135ac24158bd51cdfbef16f1c64df7063b1143f24731387137c092bec8", size = 10221482, upload-time = "2026-05-18T23:34:25.876Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/82/bdab26d7438c6791ca31b7c024ca37c1eab8b726ba236129005cd4a06e45/numpy-2.4.6-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:511dbaf848decaaaf4b4ca48032619fb3138710c4bf7da7617765edad1ef96b0", size = 16684648, upload-time = "2026-05-18T23:34:29.41Z" },
+    { url = "https://files.pythonhosted.org/packages/1b/30/a80189bcc7f5e4258b3fbc3968d909d1756f54d023299ecc39ad6fdb9ef8/numpy-2.4.6-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:bf162abab1c1a736333192707cef898e735a5ca00f38f27eeedf44b39d9e85eb", size = 14693902, upload-time = "2026-05-18T23:34:33.013Z" },
+    { url = "https://files.pythonhosted.org/packages/97/12/70b5d0d7c15e1ebb8a6a84a8caa1d19e181d84fb58bb6d70aca29099dec1/numpy-2.4.6-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:043191bfa8eab18c776647b62723ac9dddece59743b13f49b2016094129c2b3f", size = 5198992, upload-time = "2026-05-18T23:34:36.132Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/8c/ebd2a8f8a83541f8d38cc5667e8c2b69cecfd30da6e45693e8158857d44b/numpy-2.4.6-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:6180d8b35af935aed8ece3a85e0a43f87393ae0ac87c8d2c8bd2c993f7270ef3", size = 6546944, upload-time = "2026-05-18T23:34:38.484Z" },
+    { url = "https://files.pythonhosted.org/packages/bb/c5/7b863a97a91671a0338f4253bd3b5a3d3852f0692dae91711c9f4a10e787/numpy-2.4.6-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:72fbe16c6fac95aedf5937fa873445cec2110be35d8a4e9433d7501fd98dae6b", size = 15669392, upload-time = "2026-05-18T23:34:41.257Z" },
+    { url = "https://files.pythonhosted.org/packages/a5/9d/3584b9984ca4c047aea75214ce1a4c4c73d849bd71b604264b7f5653f8a8/numpy-2.4.6-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a7830bab239b79cda9c08c2da014761cafb48da6150e1da17ac06283f43b6089", size = 16633220, upload-time = "2026-05-18T23:34:45.075Z" },
+    { url = "https://files.pythonhosted.org/packages/05/ae/7c67fba23bd98caec7c99261f3a16072ade14813486b0282cb29846de832/numpy-2.4.6-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:ef4aea96ce4d3b074422cb4f2f64e216bf9e213004bb58ecfdf50ea02ea8eb9a", size = 17020800, upload-time = "2026-05-18T23:34:49.065Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/5d/3b6725cb31d983c5e66916f5d36f6d7e5521129e4c4404d64f918292a5b6/numpy-2.4.6-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:dfa20cc6ca228e6b155b11da03825975ce66aea520985dbbddf0f2a5a495c605", size = 18357600, upload-time = "2026-05-18T23:34:52.709Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/da/2ccc6c2fe8898dee01d90c75c5f5f914a23daf99e3e0f59516a08760c8b5/numpy-2.4.6-cp313-cp313-win32.whl", hash = "sha256:56b39e5e0622a09a25bf5baf62f4bcf0cb8a41ae6e2819cf49bbc5a74c083f91", size = 5961134, upload-time = "2026-05-18T23:34:55.618Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/cd/9cc4dc876fb065d5c220aae4d5e14826b2715331bb7618ce1fb07a679d99/numpy-2.4.6-cp313-cp313-win_amd64.whl", hash = "sha256:c4fc99836233ea196540b17ab0983aff60ed07941751930f5f4d05bc3b3b7359", size = 12318598, upload-time = "2026-05-18T23:34:58.928Z" },
+    { url = "https://files.pythonhosted.org/packages/39/1e/c0bcba1f8694116485fe28fd1be698c278fcda4141c5b0e53a2aed8b12a8/numpy-2.4.6-cp313-cp313-win_arm64.whl", hash = "sha256:a7c711e21628b52034bb5ab8d1bce291f752fcc5e92accc615778acee1ff4778", size = 10222272, upload-time = "2026-05-18T23:35:02.167Z" },
+    { url = "https://files.pythonhosted.org/packages/63/6d/cc5619247c8f4204e507f5883528372e4ac4bb189e579fb859a12e480b1f/numpy-2.4.6-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:112b06a867b235ef466ed3508ddf0238050df9c727cafb5301ac385b899189a1", size = 14821197, upload-time = "2026-05-18T23:35:05.468Z" },
+    { url = "https://files.pythonhosted.org/packages/00/58/f1c39161c87d9e9bed660f1ed4bafc0e403d5ec9650b6dd77aead07d489b/numpy-2.4.6-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:eaf7fa2de5c0be8ae6ff8e9bea2ccd725e980541244521d8d4b5f3354a27babe", size = 5326287, upload-time = "2026-05-18T23:35:08.693Z" },
+    { url = "https://files.pythonhosted.org/packages/af/57/3917ab0fd97f271a8694513581b8a36c655f111c446852c302f04ccdb6fc/numpy-2.4.6-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:7265a2f3d436e54ef9f2b52b5c937e6be778781bd97a590319d7348f1c1ca997", size = 6646763, upload-time = "2026-05-18T23:35:11.459Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/0f/037e64c494b67581ae18193d770adef354c41f3f2c8ebf865602d949bf8f/numpy-2.4.6-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:f74a575920ab21fe304421a3fc28793d82e299cae9eccb37084e9fc7f3617c20", size = 15728070, upload-time = "2026-05-18T23:35:14.79Z" },
+    { url = "https://files.pythonhosted.org/packages/21/a6/5d2bae9c9542eb4df16dc9c46dc79c186e9bad53805dfa5399a6023c6db0/numpy-2.4.6-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ede83e07a75dd06bc501566c1eca2afc0d61677c1472ac9ad93fdee6e638a48d", size = 16681752, upload-time = "2026-05-18T23:35:18.836Z" },
+    { url = "https://files.pythonhosted.org/packages/92/14/23d1dfb410ae362cd59ce53e936b1513d545eb40db3949ced632e19a459e/numpy-2.4.6-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:68bb27509ac1b9a3443094260f6326150663b06abe40b73a2f81160623da5b67", size = 17086024, upload-time = "2026-05-18T23:35:22.52Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/6e/23595a2c642cdf3bc567877064bdd7f91c8b0038a4453cf2daf7248eafe9/numpy-2.4.6-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:a0df0043bdb289bde1f62da130d20df23d58b45429f752bc7a8fc5325a225ecd", size = 18403398, upload-time = "2026-05-18T23:35:26.398Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/90/0ac3bc947217e66dec77e7cbc6a1979d1af70b6461b82f620d3bccd5e4c8/numpy-2.4.6-cp313-cp313t-win32.whl", hash = "sha256:29a287e0cf63ff528da061de6b9f64a4618da591ca1046aafc54062e40ca7eab", size = 6084971, upload-time = "2026-05-18T23:35:29.387Z" },
+    { url = "https://files.pythonhosted.org/packages/77/71/5673e351671a1d2bd6063b91b44f70c0affea7d1516fa7a6572941ba4aa1/numpy-2.4.6-cp313-cp313t-win_amd64.whl", hash = "sha256:25c692919ac5a01f170a3bfcd62d745b24fd095c353d50812637d6fcab442e75", size = 12458532, upload-time = "2026-05-18T23:35:32.175Z" },
+    { url = "https://files.pythonhosted.org/packages/3f/88/19d3503c5046e688f049274b27a3ef3d771152fa80d3ba3d01a3dff61abe/numpy-2.4.6-cp313-cp313t-win_arm64.whl", hash = "sha256:1e978ec1e8bd0e0e4de6bb75de9d30cbb74db6b6a2bb727618613703ca0167dd", size = 10291881, upload-time = "2026-05-18T23:35:35.465Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/91/3ab2044d05fd16d343c5ac2e69b127f1b2854040dd20b193257c78028bd3/numpy-2.4.6-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:06ca2f61ec4385a07a6977c55ba998a4466c123642b4a32694d3128fce18c079", size = 16683458, upload-time = "2026-05-18T23:35:38.353Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/62/764ce66fa4147ae6d73071a3abf804ffe606f174618697c571acdf26a7c9/numpy-2.4.6-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:38efbc8de75c7a0fc1ac190162d892787f3f47b57cc291231aafee36b80982b7", size = 14704559, upload-time = "2026-05-18T23:35:42.14Z" },
+    { url = "https://files.pythonhosted.org/packages/60/61/23f27c172f022e04025b7dc2367f4d63c1a398120607ec896228649a6f48/numpy-2.4.6-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:d581b735e177fdcdce6fed8e7e8880a3fb6ee4e3653a3ac6af01c6f4c03effc5", size = 5209716, upload-time = "2026-05-18T23:35:45.377Z" },
+    { url = "https://files.pythonhosted.org/packages/03/71/21cf70dc6ea3e3acb95fc53a265b2fc248b981f0194ceb5b475271b8809d/numpy-2.4.6-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:0a041d3d761dc3c35cc56ce0351506a02bcbc25f7b169f652435141a17db9096", size = 6543947, upload-time = "2026-05-18T23:35:47.926Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/91/64288395ee1799bd2e0b04a305dce9666da90c961e1f3fe982a05ee1c036/numpy-2.4.6-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:40fdc1ae7125e518ea98e53e69a4ebc27e1fd50510c47b7ea130cf21e5e1d42b", size = 15685197, upload-time = "2026-05-18T23:35:50.863Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/eb/ebffaa97dc55502df69584a8f0dcf07f69a3e0b3e2323670a2722db9aa39/numpy-2.4.6-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a2c306dea656c12c68f51f4cea133cbe78ca7435eb28c735eac1d3ebe73be6e8", size = 16638245, upload-time = "2026-05-18T23:35:54.752Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/0b/54f9da33128d7e350fab89c7455902eeae70349ee52bddb448dc4a576f45/numpy-2.4.6-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:33111801a01c12a8a1e3721f0a9232f8cfc8ae2c6b7098167e6f623c6073f402", size = 17036587, upload-time = "2026-05-18T23:35:58.355Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/f0/fdebc1052db1cc37c64beb22072d67cd6d1c71adca1299f53dec2b5e20d3/numpy-2.4.6-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:ae506e6902902557576a26ff33eda8695e7ecb3cb36c3b573a0765dee114ebdb", size = 18363226, upload-time = "2026-05-18T23:36:02.845Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/b4/298628d98c72b57e57f7165ae6a481a1deaf6f3c28262a6e4c739c275930/numpy-2.4.6-cp314-cp314-win32.whl", hash = "sha256:aaf159caa35993cb1f56fb9b8e4610d35758e7ca005412eb1daa856a78c9c4b1", size = 6010196, upload-time = "2026-05-18T23:36:05.92Z" },
+    { url = "https://files.pythonhosted.org/packages/df/ac/46de6dda46478f7942f839e094970be2d4a861e005c4b3bf07c92e291a09/numpy-2.4.6-cp314-cp314-win_amd64.whl", hash = "sha256:b507f5c4c1d508876d1819b6bf9a49d365b96320b5d4993426b33a23ca4b8261", size = 12450334, upload-time = "2026-05-18T23:36:09.107Z" },
+    { url = "https://files.pythonhosted.org/packages/78/92/b8b798ac784102c0da830d2257d59358e3d3d90d1e2b3f2575dad976c5cf/numpy-2.4.6-cp314-cp314-win_arm64.whl", hash = "sha256:6f41ae150c4e32db4f3310cdaf64b1593a03dbabe29eec77fc9b50fe64061df6", size = 10495678, upload-time = "2026-05-18T23:36:12.766Z" },
+    { url = "https://files.pythonhosted.org/packages/30/34/ec28d1aa8115971537c01469ab2011ee96827930f0a124de1000cc2a7ed7/numpy-2.4.6-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:ece3d2cfe132e7d51f44a832b303895e6f2d499c5e74dfbdb06ee246147a304a", size = 14823672, upload-time = "2026-05-18T23:36:16.473Z" },
+    { url = "https://files.pythonhosted.org/packages/16/bd/f6d1fede4e54e8042a7ff97bb495510f3c220f94bcd9e8b228e87c92cc0d/numpy-2.4.6-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:e3e5193ef5a3dc73bceee50f7fdc2c90dbb76c42df8d8fae3d1067a583df579e", size = 5328731, upload-time = "2026-05-18T23:36:19.767Z" },
+    { url = "https://files.pythonhosted.org/packages/f4/f0/e105b9e2fd728a9910103884decd6951d9dd73896b914a98d9a231de02ee/numpy-2.4.6-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:17f9ade344e7d9b464a084d69bcf18fc691cb1db67c62ed80820bf4926d78f0e", size = 6649805, upload-time = "2026-05-18T23:36:22.266Z" },
+    { url = "https://files.pythonhosted.org/packages/82/dd/1206a7ca6ab15e3f02069707ca96222e202af681bb73756da7527f3cb837/numpy-2.4.6-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9cd5ffd25db4e7ba6a375693b3fc0fc1791ec636c17db3720da19bde7180ec43", size = 15730496, upload-time = "2026-05-18T23:36:25.713Z" },
+    { url = "https://files.pythonhosted.org/packages/51/e7/38d3ea825dcab85a591734decb2f6c67caa7c8367d374df1a1c3842f9b07/numpy-2.4.6-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7d92c3819208a60205a12a245c91ad70cb0a85336659b19b834205573ac8456e", size = 16679616, upload-time = "2026-05-18T23:36:29.652Z" },
+    { url = "https://files.pythonhosted.org/packages/93/b7/caabfdf53edf663e0b4eb74d7d405d83baef09eb5e83bcd32d601d72b93e/numpy-2.4.6-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:e85b752a1e912b70eaad4fafbd4d1238007ab221de2009b9a2f5ae7461239895", size = 17085145, upload-time = "2026-05-18T23:36:33.449Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/45/68d7c33a6bcf3e5aa3bdbd57a367e6f615286dfd6482f97e8ffeb734306e/numpy-2.4.6-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:29cb7f67d10b479ff07c17d33e39f78c07f71c40ef30d63c153d340e96cd3fb4", size = 18403813, upload-time = "2026-05-18T23:36:37.369Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/50/0753655aa844c99cd9e018aacf76f130f1bd81d881bb74bc0aef5d73a8ba/numpy-2.4.6-cp314-cp314t-win32.whl", hash = "sha256:260a5d70215b61ab4fadf5c7baacd64821842975eea312125ed3c39a6391b063", size = 6156982, upload-time = "2026-05-18T23:36:40.817Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/d4/7c67becf668f973cb490cec3e98dfd799d866f9c989a54d355672cfa0db6/numpy-2.4.6-cp314-cp314t-win_amd64.whl", hash = "sha256:81a1cca95ed5bb92aa8b10dd2cdc9a0d3853a50fad926c28b5d7e8ea54389627", size = 12638908, upload-time = "2026-05-18T23:36:43.996Z" },
+    { url = "https://files.pythonhosted.org/packages/43/bb/e1c71a4295b1b1d1393d50dbb4f2a36283c6859d9d3892e84f00ec5a91d5/numpy-2.4.6-cp314-cp314t-win_arm64.whl", hash = "sha256:0c9136e14ed34a9e343a31c533d78a9813a69a3148332bce5e9821cb2f996e66", size = 10565867, upload-time = "2026-05-18T23:36:47.114Z" },
+    { url = "https://files.pythonhosted.org/packages/de/12/b422cc84439adc0d00de605bf4a308890ae5c26f2c71fbd73e5d08fbb0dd/numpy-2.4.6-pp311-pypy311_pp73-macosx_10_15_x86_64.whl", hash = "sha256:55cced7c52e981362f708ad635198e97a752dfba412cc03c23bbf3bd8d5cd662", size = 16847511, upload-time = "2026-05-18T23:36:50.673Z" },
+    { url = "https://files.pythonhosted.org/packages/44/53/f481bef68011740f8849418d82db07230e825013f31f4eef5ba5b805316a/numpy-2.4.6-pp311-pypy311_pp73-macosx_11_0_arm64.whl", hash = "sha256:d6da64deb6b8ed903e7560180a92f2d804ee1ba5eeb849ac2748b8c1aba1f6d7", size = 14889064, upload-time = "2026-05-18T23:36:53.879Z" },
+    { url = "https://files.pythonhosted.org/packages/7f/57/42ed575c10ced8af951d426bc4e1f8aff16fd851db33f067036215a7f860/numpy-2.4.6-pp311-pypy311_pp73-macosx_14_0_arm64.whl", hash = "sha256:68a5124b13fa6cc2086764a20005d30bc0548146f7f5322f02fce212ca14317f", size = 5394157, upload-time = "2026-05-18T23:36:57.194Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/ef/f66cc724fcc36c1e364c67f51ae9146090b8b584f27d58b97fdae3edd737/numpy-2.4.6-pp311-pypy311_pp73-macosx_14_0_x86_64.whl", hash = "sha256:948424b06129ce883307e8cff868c31396d8dc7630a59c61d70d98dbe70f222c", size = 6708728, upload-time = "2026-05-18T23:36:59.575Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/9c/c531f2293b91265d8b48e9b329f54fdd7ffae73cb4134ea10cca4237e9cc/numpy-2.4.6-pp311-pypy311_pp73-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5dbbdb29840ca3d91ee0fece42fc29278886d908280bfec0a5846c6f901a3eb0", size = 15798374, upload-time = "2026-05-18T23:37:02.674Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/b0/413077f6b1153ed3cba361401c6783bbad6114804a000cc22eb71c13e190/numpy-2.4.6-pp311-pypy311_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8ad03c0965fb3c692200e74d458ca28c1dbb4ce96f9a479a8aa041ad5fabca02", size = 16747286, upload-time = "2026-05-18T23:37:06.327Z" },
+    { url = "https://files.pythonhosted.org/packages/15/ce/e5ec180bc41812edcd8daeb8639d205622c0e8c02259d8ab25a0201b3c2a/numpy-2.4.6-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:2803abfebfc990042cd494d8ce2d5f82e9d847af6d35ec486923aa19dbad5e73", size = 12504263, upload-time = "2026-05-18T23:37:09.715Z" },
+]
+
+[[package]]
+name = "numpy"
+version = "2.5.0"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.15' and sys_platform == 'linux'",
+    "python_full_version >= '3.13' and python_full_version < '3.15' and sys_platform == 'linux'",
+    "python_full_version == '3.12.*' and sys_platform == 'linux'",
+    "python_full_version >= '3.15' and sys_platform != 'linux'",
+    "python_full_version >= '3.13' and python_full_version < '3.15' and sys_platform != 'linux'",
+    "python_full_version == '3.12.*' and sys_platform != 'linux'",
+]
+sdist = { url = "https://files.pythonhosted.org/packages/e7/05/3d27272d30698dc0ecb7fdfaa41ad70303b444f81722bb99bce1d818638a/numpy-2.5.0.tar.gz", hash = "sha256:5a129578019311b6e56bdd714250f19b518f7dceeeb8d1af5490f4942d3f891c", size = 20652461, upload-time = "2026-06-21T20:57:51.95Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/fa/0a/11486d02add7b1384dff7374d124b1cfbb0ee864dcc9f6a2c0380638cf84/numpy-2.5.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:489780423903667933b4ed6197b6ec3b75ea5dd17d1d8f0f38d798feb6921561", size = 16789987, upload-time = "2026-06-21T20:56:16.657Z" },
+    { url = "https://files.pythonhosted.org/packages/55/b2/285f48640a181947b4587a3766d21ec1eaa7fea833d4b49957e09da467a2/numpy-2.5.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:ece55976ced6bca95a03ae2839e2e5ccffe8eb6a3e7022415645eb154a81e4e6", size = 11760322, upload-time = "2026-06-21T20:56:19.813Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/67/b032db1eb03ca30d16eda3b0c22aaa615338b9263c2fd559d0f29451aca4/numpy-2.5.0-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:c83b664b0e6eee9594fa920cf0639d8af796606d3fad6cc70180c87e4b97c7be", size = 5319605, upload-time = "2026-06-21T20:56:22.173Z" },
+    { url = "https://files.pythonhosted.org/packages/b9/83/03fc7300c7c6b6c84c487b1dc80d322817b95fbd1f4dd57a85e23b7198de/numpy-2.5.0-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:bf80333980bf37f523341ddd72c783f39d6829ec7736b9eb99086388a2d52cc2", size = 6653628, upload-time = "2026-06-21T20:56:23.914Z" },
+    { url = "https://files.pythonhosted.org/packages/82/49/2ec21730bc63ccfda829323f7040a8ed4715b3852ce658689cf74ee96a8c/numpy-2.5.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:a1a4874217b36d5ac8fc876f52e39df56f8182c88463e9e2dceabf7ca8b7efb8", size = 15153691, upload-time = "2026-06-21T20:56:25.631Z" },
+    { url = "https://files.pythonhosted.org/packages/bb/6b/f4a3d0637692c49da8ef99d72d52526f92e0a8d6ac4f0ca9f31441b9d9ea/numpy-2.5.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:aaa760137137e8d3c920d27927748215b56014f92667dc9b6c27dfc61249255a", size = 16660066, upload-time = "2026-06-21T20:56:28.009Z" },
+    { url = "https://files.pythonhosted.org/packages/3a/2f/c354ec86d1f3f5c19649463b0d39652e160736e5b0a4cd18dff0576715c4/numpy-2.5.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:7174ce8265fc7f7417d171c9ea8fe905220748893ea67a2a7abe726ec331c4b0", size = 16514638, upload-time = "2026-06-21T20:56:30.26Z" },
+    { url = "https://files.pythonhosted.org/packages/06/34/43efdcb319988648580f93c11f1ae82cf7e2faa74925e98e454ae3aa95f8/numpy-2.5.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:b8c3daaf99de52415d20b42f8e8155c78642cb04207d02f9d317a0dcf1b3fb54", size = 18419647, upload-time = "2026-06-21T20:56:32.41Z" },
+    { url = "https://files.pythonhosted.org/packages/71/e2/f5d1676b1d7fb682eb5e9a1641e7ebd2414b3216c370661d1029778908b4/numpy-2.5.0-cp312-cp312-win32.whl", hash = "sha256:6206db0af545d73d068add6d992279145f158428d1da6cc49adc4b630c5d6ee5", size = 6056688, upload-time = "2026-06-21T20:56:34.657Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/7c/48f115d1c58a34032facebcd51fdf2d02df2c51d4a46a81dd1197bb2ea6b/numpy-2.5.0-cp312-cp312-win_amd64.whl", hash = "sha256:6f2d6873e2940c860a309d21e25b1e69af6aaffdd80aa056b04c16380db1c4f2", size = 12419237, upload-time = "2026-06-21T20:56:36.24Z" },
+    { url = "https://files.pythonhosted.org/packages/86/26/2e0882f4044d1b1a1b63e875151fb2393389032022a8b7f5657a7996d3b2/numpy-2.5.0-cp312-cp312-win_arm64.whl", hash = "sha256:a55e1eb2bca2cfd17a16b213c99dfc8502d47b0d494224d2122277d0400935ca", size = 10339912, upload-time = "2026-06-21T20:56:38.733Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/33/07675aaad7f26ea013d5e884d9a0d784b79c6bd7566c333f5a52fa3c610b/numpy-2.5.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:520e6b8be0a4b65840ac8090d4f51cef4bed66e2b0894d5a520f099adc24a9b2", size = 16784890, upload-time = "2026-06-21T20:56:40.799Z" },
+    { url = "https://files.pythonhosted.org/packages/85/4b/953118a730ee3b35e28645e0eb4cf9beec5bdbb954e1ac2f5fcefba6bbc3/numpy-2.5.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:146b81cdd3967fdb6beca8ba25f00c58741d8f3cbd797f55af0fbe0bfec3469c", size = 11754584, upload-time = "2026-06-21T20:56:43.094Z" },
+    { url = "https://files.pythonhosted.org/packages/44/9b/56dd530c367c74ae17411027cea4135ca57e1e0583bf5594cee18bd83217/numpy-2.5.0-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:126b88d95e8ff9b00c9e717aa540469f21d6180162f84c0caec51b16215d49cd", size = 5313904, upload-time = "2026-06-21T20:56:45.503Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/b0/bcd672edad27ecca7da1f7bb0ce72cd1706a4f2d79ae94990afc97c13e1c/numpy-2.5.0-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:d4313cef1594c5ce46c31b6e54e918338f63f16ee9322304e8c9114d6d81c8bd", size = 6648504, upload-time = "2026-06-21T20:56:47.567Z" },
+    { url = "https://files.pythonhosted.org/packages/80/9e/15cdfcbd30a1544a46c9e487a00df331c4672450216538705a9e51fa6710/numpy-2.5.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:750fb097caf26fa878746d9d119f6f9da12dedcbff1eea966c3e3447647c4a9e", size = 15150086, upload-time = "2026-06-21T20:56:49.352Z" },
+    { url = "https://files.pythonhosted.org/packages/32/4e/8d7656ccaab3e81e97258b8a9bc5f0c8502513a92fb4ceb0a2cbfebc17bf/numpy-2.5.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:3893adc2dc7c0412ba76777db55a049215d99c9aa3113003be8f49f4f1290ab9", size = 16647250, upload-time = "2026-06-21T20:56:51.542Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/81/97060281b602ed07f21b12f4ec409eac1f75a2f91fbc829ed8b2becf3ad4/numpy-2.5.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:835e454dd99b238cdc5a3f63bce2371296f5ebc53ca1e0f8e6ddbb6d92a29aab", size = 16512864, upload-time = "2026-06-21T20:56:55.401Z" },
+    { url = "https://files.pythonhosted.org/packages/33/ab/4496208146911f8d8ddb54f68a972aafa6c8d44babcb2ea03b0e5cc87c9d/numpy-2.5.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6f9836778081a0a3c02a6a21493f3e9f5b311f8d2541934f31f05583dc999ea4", size = 18408407, upload-time = "2026-06-21T20:56:57.75Z" },
+    { url = "https://files.pythonhosted.org/packages/d4/9f/a4df67c181e4ee8b467aa3332dc2db10fd5c515136831302f3ca48bc0a01/numpy-2.5.0-cp313-cp313-win32.whl", hash = "sha256:0b525be4744b60bb0557ac872d53ef07d085b5f39622bc579c98d3809d05b988", size = 6054431, upload-time = "2026-06-21T20:57:00.016Z" },
+    { url = "https://files.pythonhosted.org/packages/30/53/491e1c47c55b62ccc6a63c1c5b8635c73fc2258dddeb9bda27cae4a0ae96/numpy-2.5.0-cp313-cp313-win_amd64.whl", hash = "sha256:44353e2878930039db472b99dc353d749826e4010bd4d2a7f835e94a97a5c748", size = 12414420, upload-time = "2026-06-21T20:57:01.815Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/4a/25c2906f541e9d9f4c5769764db732e6627be91a13f4724fa10634d77db4/numpy-2.5.0-cp313-cp313-win_arm64.whl", hash = "sha256:48f54b00711f83a5f796b70c518e8c2b3c5848dda03a54911f23eb68519b9b60", size = 10339533, upload-time = "2026-06-21T20:57:03.961Z" },
+    { url = "https://files.pythonhosted.org/packages/86/ad/abc44aaceaf7b17ee1edde2bbb4458da591bc79574cffff50c4bb35f00d1/numpy-2.5.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:f27582c55ba4c750b7c58c8faf021d2cd9324a662b466229db8a417b41368af9", size = 16783807, upload-time = "2026-06-21T20:57:06.253Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/39/b72e168daf9c00fb20c9fc996d00437ccecdef3102387775d29d7a62576d/numpy-2.5.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:28e7137057d551e4a83c4ae414e3451f50568409db7569aacc7f9811ee06a446", size = 11765215, upload-time = "2026-06-21T20:57:08.547Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/a0/8400a9c0e3625182347593f5e1f57da9a617a534794805c8df5518154ddc/numpy-2.5.0-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:e1da54b53e75cd9fcfc23efcc7edab2c6aecf97b6037566d8a0fe804af8ec57c", size = 5324493, upload-time = "2026-06-21T20:57:11.012Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/8c/0d104deaa0401c93395a629ec902891618a2eff76d19229139cb5a887bfc/numpy-2.5.0-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:694d8f74e156f7fd01179f1aa8faa2f648ab6ae0f70b6c3fe57a03249aea2303", size = 6645211, upload-time = "2026-06-21T20:57:12.919Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/d9/4a4a628c812750363786afc3d33492709a5cd64b215469c16b0f6c7bb811/numpy-2.5.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1a7569a7b53c77716f036bb28cb1c91f166a26ec7d9502cd1e4bdfe502fdec22", size = 15166004, upload-time = "2026-06-21T20:57:14.717Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/5e/2a902317d7fc4aa93236e80c932662dadfc459b323d758329e01775125e1/numpy-2.5.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:39a0433bd4086ebd462960cf375e19195bb07b53dc1d87dd5fcf47ad78576f03", size = 16650797, upload-time = "2026-06-21T20:57:16.906Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/a0/a0090e6329f4ca5992c07847bb579c5259a19953dc57255bb08793142ffb/numpy-2.5.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:929f0c79ac38bcbd7154fe631dc907abfeddbcc5027a896bd1f7767323271e7a", size = 16524647, upload-time = "2026-06-21T20:57:19.165Z" },
+    { url = "https://files.pythonhosted.org/packages/5e/7d/6caf27734c42b65837e7461ed0dbbd6b6fc835060c9714ec59d673bb383a/numpy-2.5.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:cc4f247a47bbf070bfd70be53ccdcf47b800af563535e7bbe172322197c30e21", size = 18411841, upload-time = "2026-06-21T20:57:21.638Z" },
+    { url = "https://files.pythonhosted.org/packages/13/dc/26edadbd812536769a82c2e9e002234e33feb5da43061d47a044f6d309b7/numpy-2.5.0-cp314-cp314-win32.whl", hash = "sha256:5dc71423499fab3f46f7a7201155ade1669ea101f2f429d332df9e72f8161731", size = 6106361, upload-time = "2026-06-21T20:57:23.844Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/9e/4dd1459282229a72d92dece2ae9138e5cac94a72263a7ceb48f37434c925/numpy-2.5.0-cp314-cp314-win_amd64.whl", hash = "sha256:ebb81d9d5443e0309d6c54894c3fbed74ad7da0714352a67b6d773cd189eae73", size = 12551749, upload-time = "2026-06-21T20:57:25.945Z" },
+    { url = "https://files.pythonhosted.org/packages/05/a7/6bc6384c080b86c7f6c85c5bc5b540b24f4f679cd144791d99574e90d462/numpy-2.5.0-cp314-cp314-win_arm64.whl", hash = "sha256:3b94d0d0deceebfad3e67ae5c0e5eb87371e8f7a0581cd04a779928c2450cf1e", size = 10617072, upload-time = "2026-06-21T20:57:28.175Z" },
+    { url = "https://files.pythonhosted.org/packages/86/6b/4a2b71d66ada5608ae02b63f150dfad520f6940721cb7f029ad270befc0e/numpy-2.5.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:22f3d43e362d650bc39db1f17851302874a148ca95ba6981c1dfb5fa6862f35b", size = 11881067, upload-time = "2026-06-21T20:57:30.104Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/b2/d365eb40a20efb49d67e9feb90494ed8511282ee1f5fa16006675c65397d/numpy-2.5.0-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:243563efb4cd7528a264567e9fd206c87826457322521d06206a00bfa316c927", size = 5440290, upload-time = "2026-06-21T20:57:32.193Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/5e/e9c03188de5f9b767e46a8fe988bcfd3efad066a4a3fda8b9cb11a93f895/numpy-2.5.0-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:84881d825ca75249b189bbee875fcfe3238aa5c479e6100893cda566e8e86826", size = 6748371, upload-time = "2026-06-21T20:57:33.933Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/1d/68c186a38a5027bae2c4ddd5ea681fdaf8b4d30fb7301def6d8ad270390f/numpy-2.5.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:cda12aa4779d42b8771180aba759c96f527d43446d8f380ab59e2b35e8489efd", size = 15214643, upload-time = "2026-06-21T20:57:35.677Z" },
+    { url = "https://files.pythonhosted.org/packages/8c/67/73f67b7c7e20635baae9c4c3ead4ae7326a005900297a6110971abd62eb5/numpy-2.5.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1c0121101093d2bd74981b10f8837d78e794a8ff57834eb27179f49e1ba11ac6", size = 16690128, upload-time = "2026-06-21T20:57:38.159Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/05/d4c1fb0c46d02a27d6b2b8b319a78c90937acec8631c1641874670b31e6f/numpy-2.5.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:d371c92cfa09da00022f501ab67fafaea813d752eb30ac44336d45b1e5b0268a", size = 16577902, upload-time = "2026-06-21T20:57:40.447Z" },
+    { url = "https://files.pythonhosted.org/packages/9e/1d/771c797d50fa26e4888989cccf1d50ee51f530d4e455ad2692dcb64fa711/numpy-2.5.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:9990713e9c38154c6861e7547f1e3fc7a87e75ff09bab24ef1cc81d81c2835e9", size = 18452814, upload-time = "2026-06-21T20:57:42.875Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/46/52fc0d2a68d7643f0f149eeea5a5d8ea2a3507056ac8afa83c9212606e8b/numpy-2.5.0-cp314-cp314t-win32.whl", hash = "sha256:edadfbd4794b1086c0d822f81863e8a68fc129d132fd0bb9e31e955d7fbbbdb7", size = 6253168, upload-time = "2026-06-21T20:57:45.101Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/be/6c8d1118b5f13b2881dc095d5b345de19c6638b8959c17409b6eff84c8aa/numpy-2.5.0-cp314-cp314t-win_amd64.whl", hash = "sha256:f7e5fa4382967ae6548bd2f174219afb908e294b0d5f625af01166edd5f7d9aa", size = 12736286, upload-time = "2026-06-21T20:57:46.935Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/6a/d3a169aaf8536cf228d56a09e04bcb713a2fe4410d4e2105b9419b5a9c89/numpy-2.5.0-cp314-cp314t-win_arm64.whl", hash = "sha256:016623417bb330d719d579daf2d6b9a01ddc52e41a9ed61a47f39fde46dcd865", size = 10686451, upload-time = "2026-06-21T20:57:49.313Z" },
+]
+
+[[package]]
+name = "nvidia-cublas"
+version = "13.1.1.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nvidia-cuda-nvrtc", marker = "sys_platform == 'linux'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/a7/a1/0bd24ee8c8d03adac032fd2909426a00c88f8c57961b1277ded97f91119f/nvidia_cublas-13.1.1.3-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:b7a210458267ac818974c53038fbec2e969d5c99f305ab15c72522fa9f001dd5", size = 542848918, upload-time = "2026-04-08T18:46:22.985Z" },
+    { url = "https://files.pythonhosted.org/packages/3b/cd/154ca20c38269e05eff77c1464e6c1da89f50a6390b565e9d82e06bc11e1/nvidia_cublas-13.1.1.3-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:37936a16db8fe4ac1f065c2139360608a543a09275cb1a1af612e08cfa065436", size = 423138758, upload-time = "2026-04-08T18:46:58.655Z" },
+]
+
+[[package]]
+name = "nvidia-cuda-cupti"
+version = "13.0.85"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2a/2a/80353b103fc20ce05ef51e928daed4b6015db4aaa9162ed0997090fe2250/nvidia_cuda_cupti-13.0.85-py3-none-manylinux_2_25_aarch64.whl", hash = "sha256:796bd679890ee55fb14a94629b698b6db54bcfd833d391d5e94017dd9d7d3151", size = 10310827, upload-time = "2025-09-04T08:26:42.012Z" },
+    { url = "https://files.pythonhosted.org/packages/33/6d/737d164b4837a9bbd202f5ae3078975f0525a55730fe871d8ed4e3b952b0/nvidia_cuda_cupti-13.0.85-py3-none-manylinux_2_25_x86_64.whl", hash = "sha256:4eb01c08e859bf924d222250d2e8f8b8ff6d3db4721288cf35d14252a4d933c8", size = 10715597, upload-time = "2025-09-04T08:26:51.312Z" },
+]
+
+[[package]]
+name = "nvidia-cuda-nvrtc"
+version = "13.0.88"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c3/68/483a78f5e8f31b08fb1bb671559968c0ca3a065ac7acabfc7cee55214fd6/nvidia_cuda_nvrtc-13.0.88-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:ad9b6d2ead2435f11cbb6868809d2adeeee302e9bb94bcf0539c7a40d80e8575", size = 90215200, upload-time = "2025-09-04T08:28:44.204Z" },
+    { url = "https://files.pythonhosted.org/packages/b7/dc/6bb80850e0b7edd6588d560758f17e0550893a1feaf436807d64d2da040f/nvidia_cuda_nvrtc-13.0.88-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:d27f20a0ca67a4bb34268a5e951033496c5b74870b868bacd046b1b8e0c3267b", size = 43015449, upload-time = "2025-09-04T08:28:20.239Z" },
+]
+
+[[package]]
+name = "nvidia-cuda-runtime"
+version = "13.0.96"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/87/4f/17d7b9b8e285199c58ce28e31b5c5bbaa4d8271af06a89b6405258245de2/nvidia_cuda_runtime-13.0.96-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:ef9bcbe90493a2b9d810e43d249adb3d02e98dd30200d86607d8d02687c43f55", size = 2261060, upload-time = "2025-10-09T08:55:15.78Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/24/d1558f3b68b1d26e706813b1d10aa1d785e4698c425af8db8edc3dced472/nvidia_cuda_runtime-13.0.96-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:7f82250d7782aa23b6cfe765ecc7db554bd3c2870c43f3d1821f1d18aebf0548", size = 2243632, upload-time = "2025-10-09T08:55:36.117Z" },
+]
+
+[[package]]
+name = "nvidia-cudnn-cu13"
+version = "9.20.0.48"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nvidia-cublas", marker = "sys_platform == 'linux'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/56/c5/83384d846b2fd17c44bd499b36c75a45ed4f095fbbb2252294e89cea5c5c/nvidia_cudnn_cu13-9.20.0.48-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:e31454ae00094b0c55319d9d15b6fa2fc50a9e1c0f5c8c80fb75258234e731e1", size = 444574296, upload-time = "2026-03-09T19:28:27.751Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/5e/edb9c0ae051602c3ccaffe424256463636d639e27d7f302dde9975ef9e7a/nvidia_cudnn_cu13-9.20.0.48-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:0c45dd8eeb50b603f07995b1b300c62ffe6a1980482b82b3bcf94a4ca9d49304", size = 366173588, upload-time = "2026-03-09T19:29:34.474Z" },
+]
+
+[[package]]
+name = "nvidia-cufft"
+version = "12.0.0.61"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nvidia-nvjitlink", marker = "sys_platform == 'linux'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/8b/ae/f417a75c0259e85c1d2f83ca4e960289a5f814ed0cea74d18c353d3e989d/nvidia_cufft-12.0.0.61-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:2708c852ef8cd89d1d2068bdbece0aa188813a0c934db3779b9b1faa8442e5f5", size = 214053554, upload-time = "2025-09-04T08:31:38.196Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/2f/7b57e29836ea8714f81e9898409196f47d772d5ddedddf1592eadb8ab743/nvidia_cufft-12.0.0.61-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:6c44f692dce8fd5ffd3e3df134b6cdb9c2f72d99cf40b62c32dde45eea9ddad3", size = 214085489, upload-time = "2025-09-04T08:31:56.044Z" },
+]
+
+[[package]]
+name = "nvidia-cufile"
+version = "1.15.1.6"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3f/70/4f193de89a48b71714e74602ee14d04e4019ad36a5a9f20c425776e72cd6/nvidia_cufile-1.15.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:08a3ecefae5a01c7f5117351c64f17c7c62efa5fffdbe24fc7d298da19cd0b44", size = 1223672, upload-time = "2025-09-04T08:32:22.779Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/73/cc4a14c9813a8a0d509417cf5f4bdaba76e924d58beb9864f5a7baceefbf/nvidia_cufile-1.15.1.6-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:bdc0deedc61f548bddf7733bdc216456c2fdb101d020e1ab4b88d232d5e2f6d1", size = 1136992, upload-time = "2025-09-04T08:32:14.119Z" },
+]
+
+[[package]]
+name = "nvidia-curand"
+version = "10.4.0.35"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1e/72/7c2ae24fb6b63a32e6ae5d241cc65263ea18d08802aaae087d9f013335a2/nvidia_curand-10.4.0.35-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:133df5a7509c3e292aaa2b477afd0194f06ce4ea24d714d616ff36439cee349a", size = 61962106, upload-time = "2025-08-04T10:21:41.128Z" },
+    { url = "https://files.pythonhosted.org/packages/a5/9f/be0a41ca4a4917abf5cb9ae0daff1a6060cc5de950aec0396de9f3b52bc5/nvidia_curand-10.4.0.35-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:1aee33a5da6e1db083fe2b90082def8915f30f3248d5896bcec36a579d941bfc", size = 59544258, upload-time = "2025-08-04T10:22:03.992Z" },
+]
+
+[[package]]
+name = "nvidia-cusolver"
+version = "12.0.4.66"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nvidia-cublas", marker = "sys_platform == 'linux'" },
+    { name = "nvidia-cusparse", marker = "sys_platform == 'linux'" },
+    { name = "nvidia-nvjitlink", marker = "sys_platform == 'linux'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c8/c3/b30c9e935fc01e3da443ec0116ed1b2a009bb867f5324d3f2d7e533e776b/nvidia_cusolver-12.0.4.66-py3-none-manylinux_2_27_aarch64.whl", hash = "sha256:02c2457eaa9e39de20f880f4bd8820e6a1cfb9f9a34f820eb12a155aa5bc92d2", size = 223467760, upload-time = "2025-09-04T08:33:04.222Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/67/cba3777620cdacb99102da4042883709c41c709f4b6323c10781a9c3aa34/nvidia_cusolver-12.0.4.66-py3-none-manylinux_2_27_x86_64.whl", hash = "sha256:0a759da5dea5c0ea10fd307de75cdeb59e7ea4fcb8add0924859b944babf1112", size = 200941980, upload-time = "2025-09-04T08:33:22.767Z" },
+]
+
+[[package]]
+name = "nvidia-cusparse"
+version = "12.6.3.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "nvidia-nvjitlink", marker = "sys_platform == 'linux'" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f8/94/5c26f33738ae35276672f12615a64bd008ed5be6d1ebcb23579285d960a9/nvidia_cusparse-12.6.3.3-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:80bcc4662f23f1054ee334a15c72b8940402975e0eab63178fc7e670aa59472c", size = 162155568, upload-time = "2025-09-04T08:33:42.864Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/18/623c77619c31d62efd55302939756966f3ecc8d724a14dab2b75f1508850/nvidia_cusparse-12.6.3.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:2b3c89c88d01ee0e477cb7f82ef60a11a4bcd57b6b87c33f789350b59759360b", size = 145942937, upload-time = "2025-09-04T08:33:58.029Z" },
+]
+
+[[package]]
+name = "nvidia-cusparselt-cu13"
+version = "0.8.1"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/46/e1/cdc1797eadf82d3a9a575a19b33fdc871a97edbec42c00b5b5e914f4aff4/nvidia_cusparselt_cu13-0.8.1-py3-none-manylinux2014_aarch64.whl", hash = "sha256:4dca476c50bf4780d46cd0bfbd82e2bc10a08e4fef7950917ce8d7578d22a23f", size = 221051344, upload-time = "2025-09-05T18:49:51.289Z" },
+    { url = "https://files.pythonhosted.org/packages/34/7d/2661f2fb3ac4302f3a246f5fc030213ac60c1fe0bce84f9783dbd831dbb7/nvidia_cusparselt_cu13-0.8.1-py3-none-manylinux2014_x86_64.whl", hash = "sha256:786ce87568c303fadb5afcc7102d454cd3040d75f6f8626f5db460d1871f4dd0", size = 170148586, upload-time = "2025-09-05T18:50:50.248Z" },
+]
+
+[[package]]
+name = "nvidia-nccl-cu13"
+version = "2.29.7"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/72/0d/daf50d44177ee0cbc7ff0a0c91eb5ff676c82be42f9a970bc7597f440c3a/nvidia_nccl_cu13-2.29.7-py3-none-manylinux_2_18_aarch64.whl", hash = "sha256:674a12383e3c38a1bcccae7d4f3633b37852230b6047883cb2f4c2d1b36d9bf5", size = 206014712, upload-time = "2026-03-03T05:34:20.843Z" },
+    { url = "https://files.pythonhosted.org/packages/67/f4/58e4e91b6919367c7aafb8e36fce9aad1a3047e536bf7e2fd560927d3a4c/nvidia_nccl_cu13-2.29.7-py3-none-manylinux_2_18_x86_64.whl", hash = "sha256:edd81538446786ec3b73972543e53bb43bcaf0bfc8ef76cb679fcc390ffe136d", size = 205976000, upload-time = "2026-03-03T05:36:24.472Z" },
+]
+
+[[package]]
+name = "nvidia-nvjitlink"
+version = "13.0.88"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/56/7a/123e033aaff487c77107195fa5a2b8686795ca537935a24efae476c41f05/nvidia_nvjitlink-13.0.88-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl", hash = "sha256:13a74f429e23b921c1109976abefacc69835f2f433ebd323d3946e11d804e47b", size = 40713933, upload-time = "2025-09-04T08:35:43.553Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/2c/93c5250e64df4f894f1cbb397c6fd71f79813f9fd79d7cd61de3f97b3c2d/nvidia_nvjitlink-13.0.88-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:e931536ccc7d467a98ba1d8b89ff7fa7f1fa3b13f2b0069118cd7f47bff07d0c", size = 38768748, upload-time = "2025-09-04T08:35:20.008Z" },
+]
+
+[[package]]
+name = "nvidia-nvshmem-cu13"
+version = "3.4.5"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/dc/0f/05cc9c720236dcd2db9c1ab97fff629e96821be2e63103569da0c9b72f19/nvidia_nvshmem_cu13-3.4.5-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:6dc2a197f38e5d0376ad52cd1a2a3617d3cdc150fd5966f4aee9bcebb1d68fe9", size = 60215947, upload-time = "2025-09-06T00:32:20.022Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/35/a9bf80a609e74e3b000fef598933235c908fcefcef9026042b8e6dfde2a9/nvidia_nvshmem_cu13-3.4.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:290f0a2ee94c9f3687a02502f3b9299a9f9fe826e6d0287ee18482e78d495b80", size = 60412546, upload-time = "2025-09-06T00:32:41.564Z" },
+]
+
+[[package]]
+name = "nvidia-nvtx"
+version = "13.0.85"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c2/f3/d86c845465a2723ad7e1e5c36dcd75ddb82898b3f53be47ebd429fb2fa5d/nvidia_nvtx-13.0.85-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:4936d1d6780fbe68db454f5e72a42ff64d1fd6397df9f363ae786930fd5c1cd4", size = 148047, upload-time = "2025-09-04T08:29:01.761Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/64/3708a90d1ebe202ffdeb7185f878a3c84d15c2b2c31858da2ce0583e2def/nvidia_nvtx-13.0.85-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:cb7780edb6b14107373c835bf8b72e7a178bac7367e23da7acb108f973f157a6", size = 148878, upload-time = "2025-09-04T08:28:53.627Z" },
+]
+
+[[package]]
+name = "packaging"
+version = "26.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/d7/f1/e7a6dd94a8d4a5626c03e4e99c87f241ba9e350cd9e6d75123f992427270/packaging-26.2.tar.gz", hash = "sha256:ff452ff5a3e828ce110190feff1178bb1f2ea2281fa2075aadb987c2fb221661", size = 228134, upload-time = "2026-04-24T20:15:23.917Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/df/b2/87e62e8c3e2f4b32e5fe99e0b86d576da1312593b39f47d8ceef365e95ed/packaging-26.2-py3-none-any.whl", hash = "sha256:5fc45236b9446107ff2415ce77c807cee2862cb6fac22b8a73826d0693b0980e", size = 100195, upload-time = "2026-04-24T20:15:22.081Z" },
+]
+
+[[package]]
+name = "pandas"
+version = "2.2.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "numpy", version = "2.4.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+    { name = "numpy", version = "2.5.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+    { name = "python-dateutil" },
+    { name = "pytz" },
+    { name = "tzdata" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/88/d9/ecf715f34c73ccb1d8ceb82fc01cd1028a65a5f6dbc57bfa6ea155119058/pandas-2.2.2.tar.gz", hash = "sha256:9e79019aba43cb4fda9e4d983f8e88ca0373adbb697ae9c6c43093218de28b54", size = 4398391, upload-time = "2024-04-10T19:45:48.342Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1b/70/61704497903d43043e288017cb2b82155c0d41e15f5c17807920877b45c2/pandas-2.2.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:696039430f7a562b74fa45f540aca068ea85fa34c244d0deee539cb6d70aa288", size = 12574808, upload-time = "2024-04-10T19:44:35.516Z" },
+    { url = "https://files.pythonhosted.org/packages/16/c6/75231fd47afd6b3f89011e7077f1a3958441264aca7ae9ff596e3276a5d0/pandas-2.2.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:8e90497254aacacbc4ea6ae5e7a8cd75629d6ad2b30025a4a8b09aa4faf55151", size = 11304876, upload-time = "2024-04-10T19:44:39.37Z" },
+    { url = "https://files.pythonhosted.org/packages/97/2d/7b54f80b93379ff94afb3bd9b0cd1d17b48183a0d6f98045bc01ce1e06a7/pandas-2.2.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:58b84b91b0b9f4bafac2a0ac55002280c094dfc6402402332c0913a59654ab2b", size = 15602548, upload-time = "2024-04-10T19:44:42.902Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/a5/4d82be566f069d7a9a702dcdf6f9106df0e0b042e738043c0cc7ddd7e3f6/pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6d2123dc9ad6a814bcdea0f099885276b31b24f7edf40f6cdbc0912672e22eee", size = 13031332, upload-time = "2024-04-10T19:44:46.98Z" },
+    { url = "https://files.pythonhosted.org/packages/92/a2/b79c48f530673567805e607712b29814b47dcaf0d167e87145eb4b0118c6/pandas-2.2.2-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:2925720037f06e89af896c70bca73459d7e6a4be96f9de79e2d440bd499fe0db", size = 16286054, upload-time = "2024-04-10T19:44:50.51Z" },
+    { url = "https://files.pythonhosted.org/packages/40/c7/47e94907f1d8fdb4868d61bd6c93d57b3784a964d52691b77ebfdb062842/pandas-2.2.2-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:0cace394b6ea70c01ca1595f839cf193df35d1575986e484ad35c4aeae7266c1", size = 13879507, upload-time = "2024-04-10T19:44:54.412Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/63/966db1321a0ad55df1d1fe51505d2cdae191b84c907974873817b0a6e849/pandas-2.2.2-cp311-cp311-win_amd64.whl", hash = "sha256:873d13d177501a28b2756375d59816c365e42ed8417b41665f346289adc68d24", size = 11634249, upload-time = "2024-04-10T19:44:58.183Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/49/de869130028fb8d90e25da3b7d8fb13e40f5afa4c4af1781583eb1ff3839/pandas-2.2.2-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:9dfde2a0ddef507a631dc9dc4af6a9489d5e2e740e226ad426a05cabfbd7c8ef", size = 12500886, upload-time = "2024-04-10T19:45:01.808Z" },
+    { url = "https://files.pythonhosted.org/packages/db/7c/9a60add21b96140e22465d9adf09832feade45235cd22f4cb1668a25e443/pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e9b79011ff7a0f4b1d6da6a61aa1aa604fb312d6647de5bad20013682d1429ce", size = 11340320, upload-time = "2024-04-11T18:36:14.398Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/85/f95b5f322e1ae13b7ed7e97bd999160fa003424711ab4dc8344b8772c270/pandas-2.2.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1cb51fe389360f3b5a4d57dbd2848a5f033350336ca3b340d1c53a1fad33bcad", size = 15204346, upload-time = "2024-04-10T19:45:05.903Z" },
+    { url = "https://files.pythonhosted.org/packages/40/10/79e52ef01dfeb1c1ca47a109a01a248754ebe990e159a844ece12914de83/pandas-2.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:eee3a87076c0756de40b05c5e9a6069c035ba43e8dd71c379e68cab2c20f16ad", size = 12733396, upload-time = "2024-04-10T19:45:09.282Z" },
+    { url = "https://files.pythonhosted.org/packages/35/9d/208febf8c4eb5c1d9ea3314d52d8bd415fd0ef0dd66bb24cc5bdbc8fa71a/pandas-2.2.2-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:3e374f59e440d4ab45ca2fffde54b81ac3834cf5ae2cdfa69c90bc03bde04d76", size = 15858913, upload-time = "2024-04-10T19:45:12.514Z" },
+    { url = "https://files.pythonhosted.org/packages/99/d1/2d9bd05def7a9e08a92ec929b5a4c8d5556ec76fae22b0fa486cbf33ea63/pandas-2.2.2-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:43498c0bdb43d55cb162cdc8c06fac328ccb5d2eabe3cadeb3529ae6f0517c32", size = 13417786, upload-time = "2024-04-10T19:45:16.275Z" },
+    { url = "https://files.pythonhosted.org/packages/22/a5/a0b255295406ed54269814bc93723cfd1a0da63fb9aaf99e1364f07923e5/pandas-2.2.2-cp312-cp312-win_amd64.whl", hash = "sha256:d187d355ecec3629624fccb01d104da7d7f391db0311145817525281e2804d23", size = 11498828, upload-time = "2024-04-10T19:45:19.85Z" },
+]
+
+[[package]]
+name = "parso"
+version = "0.8.7"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/30/4b/90c937815137d43ce71ba043cd3566221e9df6b9c805f24b5d138c9d40a7/parso-0.8.7.tar.gz", hash = "sha256:eaaac4c9fdd5e9e8852dc778d2d7405897ec510f2a298071453e5e3a07914bb1", size = 401824, upload-time = "2026-05-01T23:13:02.138Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/99/5d/8268b644392ee874ee82a635cd0df1773de230bde356c38de28e298392cc/parso-0.8.7-py2.py3-none-any.whl", hash = "sha256:a8926eb2a1b915486941fdbd31e86a4baf88fe8c210f25f2f35ecec5b574ca1c", size = 107025, upload-time = "2026-05-01T23:12:58.867Z" },
+]
+
+[[package]]
+name = "pathspec"
+version = "1.1.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/5a/82/42f767fc1c1143d6fd36efb827202a2d997a375e160a71eb2888a925aac1/pathspec-1.1.1.tar.gz", hash = "sha256:17db5ecd524104a120e173814c90367a96a98d07c45b2e10c2f3919fff91bf5a", size = 135180, upload-time = "2026-04-27T01:46:08.907Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f1/d9/7fb5aa316bc299258e68c73ba3bddbc499654a07f151cba08f6153988714/pathspec-1.1.1-py3-none-any.whl", hash = "sha256:a00ce642f577bf7f473932318056212bc4f8bfdf53128c78bbd5af0b9b20b189", size = 57328, upload-time = "2026-04-27T01:46:07.06Z" },
+]
+
+[[package]]
+name = "pillow"
+version = "12.2.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/8c/21/c2bcdd5906101a30244eaffc1b6e6ce71a31bd0742a01eb89e660ebfac2d/pillow-12.2.0.tar.gz", hash = "sha256:a830b1a40919539d07806aa58e1b114df53ddd43213d9c8b75847eee6c0182b5", size = 46987819, upload-time = "2026-04-01T14:46:17.687Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/68/e1/748f5663efe6edcfc4e74b2b93edfb9b8b99b67f21a854c3ae416500a2d9/pillow-12.2.0-cp311-cp311-macosx_10_10_x86_64.whl", hash = "sha256:8be29e59487a79f173507c30ddf57e733a357f67881430449bb32614075a40ab", size = 5354347, upload-time = "2026-04-01T14:42:44.255Z" },
+    { url = "https://files.pythonhosted.org/packages/47/a1/d5ff69e747374c33a3b53b9f98cca7889fce1fd03d79cdc4e1bccc6c5a87/pillow-12.2.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:71cde9a1e1551df7d34a25462fc60325e8a11a82cc2e2f54578e5e9a1e153d65", size = 4695873, upload-time = "2026-04-01T14:42:46.452Z" },
+    { url = "https://files.pythonhosted.org/packages/df/21/e3fbdf54408a973c7f7f89a23b2cb97a7ef30c61ab4142af31eee6aebc88/pillow-12.2.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:f490f9368b6fc026f021db16d7ec2fbf7d89e2edb42e8ec09d2c60505f5729c7", size = 6280168, upload-time = "2026-04-01T14:42:49.228Z" },
+    { url = "https://files.pythonhosted.org/packages/d3/f1/00b7278c7dd52b17ad4329153748f87b6756ec195ff786c2bdf12518337d/pillow-12.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:8bd7903a5f2a4545f6fd5935c90058b89d30045568985a71c79f5fd6edf9b91e", size = 8088188, upload-time = "2026-04-01T14:42:51.735Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/cf/220a5994ef1b10e70e85748b75649d77d506499352be135a4989c957b701/pillow-12.2.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:3997232e10d2920a68d25191392e3a4487d8183039e1c74c2297f00ed1c50705", size = 6394401, upload-time = "2026-04-01T14:42:54.343Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/bd/e51a61b1054f09437acfbc2ff9106c30d1eb76bc1453d428399946781253/pillow-12.2.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e74473c875d78b8e9d5da2a70f7099549f9eb37ded4e2f6a463e60125bccd176", size = 7079655, upload-time = "2026-04-01T14:42:56.954Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/3d/45132c57d5fb4b5744567c3817026480ac7fc3ce5d4c47902bc0e7f6f853/pillow-12.2.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:56a3f9c60a13133a98ecff6197af34d7824de9b7b38c3654861a725c970c197b", size = 6503105, upload-time = "2026-04-01T14:42:59.847Z" },
+    { url = "https://files.pythonhosted.org/packages/7d/2e/9df2fc1e82097b1df3dce58dc43286aa01068e918c07574711fcc53e6fb4/pillow-12.2.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:90e6f81de50ad6b534cab6e5aef77ff6e37722b2f5d908686f4a5c9eba17a909", size = 7203402, upload-time = "2026-04-01T14:43:02.664Z" },
+    { url = "https://files.pythonhosted.org/packages/bd/2e/2941e42858ebb67e50ae741473de81c2984e6eff7b397017623c676e2e8d/pillow-12.2.0-cp311-cp311-win32.whl", hash = "sha256:8c984051042858021a54926eb597d6ee3012393ce9c181814115df4c60b9a808", size = 6378149, upload-time = "2026-04-01T14:43:05.274Z" },
+    { url = "https://files.pythonhosted.org/packages/69/42/836b6f3cd7f3e5fa10a1f1a5420447c17966044c8fbf589cc0452d5502db/pillow-12.2.0-cp311-cp311-win_amd64.whl", hash = "sha256:6e6b2a0c538fc200b38ff9eb6628228b77908c319a005815f2dde585a0664b60", size = 7082626, upload-time = "2026-04-01T14:43:08.557Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/88/549194b5d6f1f494b485e493edc6693c0a16f4ada488e5bd974ed1f42fad/pillow-12.2.0-cp311-cp311-win_arm64.whl", hash = "sha256:9a8a34cc89c67a65ea7437ce257cea81a9dad65b29805f3ecee8c8fe8ff25ffe", size = 2463531, upload-time = "2026-04-01T14:43:10.743Z" },
+    { url = "https://files.pythonhosted.org/packages/58/be/7482c8a5ebebbc6470b3eb791812fff7d5e0216c2be3827b30b8bb6603ed/pillow-12.2.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:2d192a155bbcec180f8564f693e6fd9bccff5a7af9b32e2e4bf8c9c69dbad6b5", size = 5308279, upload-time = "2026-04-01T14:43:13.246Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/95/0a351b9289c2b5cbde0bacd4a83ebc44023e835490a727b2a3bd60ddc0f4/pillow-12.2.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:f3f40b3c5a968281fd507d519e444c35f0ff171237f4fdde090dd60699458421", size = 4695490, upload-time = "2026-04-01T14:43:15.584Z" },
+    { url = "https://files.pythonhosted.org/packages/de/af/4e8e6869cbed569d43c416fad3dc4ecb944cb5d9492defaed89ddd6fe871/pillow-12.2.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:03e7e372d5240cc23e9f07deca4d775c0817bffc641b01e9c3af208dbd300987", size = 6284462, upload-time = "2026-04-01T14:43:18.268Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/9e/c05e19657fd57841e476be1ab46c4d501bffbadbafdc31a6d665f8b737b6/pillow-12.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:b86024e52a1b269467a802258c25521e6d742349d760728092e1bc2d135b4d76", size = 8094744, upload-time = "2026-04-01T14:43:20.716Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/54/1789c455ed10176066b6e7e6da1b01e50e36f94ba584dc68d9eebfe9156d/pillow-12.2.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:7371b48c4fa448d20d2714c9a1f775a81155050d383333e0a6c15b1123dda005", size = 6398371, upload-time = "2026-04-01T14:43:23.443Z" },
+    { url = "https://files.pythonhosted.org/packages/43/e3/fdc657359e919462369869f1c9f0e973f353f9a9ee295a39b1fea8ee1a77/pillow-12.2.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:62f5409336adb0663b7caa0da5c7d9e7bdbaae9ce761d34669420c2a801b2780", size = 7087215, upload-time = "2026-04-01T14:43:26.758Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/f8/2f6825e441d5b1959d2ca5adec984210f1ec086435b0ed5f52c19b3b8a6e/pillow-12.2.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:01afa7cf67f74f09523699b4e88c73fb55c13346d212a59a2db1f86b0a63e8c5", size = 6509783, upload-time = "2026-04-01T14:43:29.56Z" },
+    { url = "https://files.pythonhosted.org/packages/67/f9/029a27095ad20f854f9dba026b3ea6428548316e057e6fc3545409e86651/pillow-12.2.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:fc3d34d4a8fbec3e88a79b92e5465e0f9b842b628675850d860b8bd300b159f5", size = 7212112, upload-time = "2026-04-01T14:43:32.091Z" },
+    { url = "https://files.pythonhosted.org/packages/be/42/025cfe05d1be22dbfdb4f264fe9de1ccda83f66e4fc3aac94748e784af04/pillow-12.2.0-cp312-cp312-win32.whl", hash = "sha256:58f62cc0f00fd29e64b29f4fd923ffdb3859c9f9e6105bfc37ba1d08994e8940", size = 6378489, upload-time = "2026-04-01T14:43:34.601Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/7b/25a221d2c761c6a8ae21bfa3874988ff2583e19cf8a27bf2fee358df7942/pillow-12.2.0-cp312-cp312-win_amd64.whl", hash = "sha256:7f84204dee22a783350679a0333981df803dac21a0190d706a50475e361c93f5", size = 7084129, upload-time = "2026-04-01T14:43:37.213Z" },
+    { url = "https://files.pythonhosted.org/packages/10/e1/542a474affab20fd4a0f1836cb234e8493519da6b76899e30bcc5d990b8b/pillow-12.2.0-cp312-cp312-win_arm64.whl", hash = "sha256:af73337013e0b3b46f175e79492d96845b16126ddf79c438d7ea7ff27783a414", size = 2463612, upload-time = "2026-04-01T14:43:39.421Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/01/53d10cf0dbad820a8db274d259a37ba50b88b24768ddccec07355382d5ad/pillow-12.2.0-cp313-cp313-ios_13_0_arm64_iphoneos.whl", hash = "sha256:8297651f5b5679c19968abefd6bb84d95fe30ef712eb1b2d9b2d31ca61267f4c", size = 4100837, upload-time = "2026-04-01T14:43:41.506Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/98/f3a6657ecb698c937f6c76ee564882945f29b79bad496abcba0e84659ec5/pillow-12.2.0-cp313-cp313-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:50d8520da2a6ce0af445fa6d648c4273c3eeefbc32d7ce049f22e8b5c3daecc2", size = 4176528, upload-time = "2026-04-01T14:43:43.773Z" },
+    { url = "https://files.pythonhosted.org/packages/69/bc/8986948f05e3ea490b8442ea1c1d4d990b24a7e43d8a51b2c7d8b1dced36/pillow-12.2.0-cp313-cp313-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:766cef22385fa1091258ad7e6216792b156dc16d8d3fa607e7545b2b72061f1c", size = 3640401, upload-time = "2026-04-01T14:43:45.87Z" },
+    { url = "https://files.pythonhosted.org/packages/34/46/6c717baadcd62bc8ed51d238d521ab651eaa74838291bda1f86fe1f864c9/pillow-12.2.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:5d2fd0fa6b5d9d1de415060363433f28da8b1526c1c129020435e186794b3795", size = 5308094, upload-time = "2026-04-01T14:43:48.438Z" },
+    { url = "https://files.pythonhosted.org/packages/71/43/905a14a8b17fdb1ccb58d282454490662d2cb89a6bfec26af6d3520da5ec/pillow-12.2.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:56b25336f502b6ed02e889f4ece894a72612fe885889a6e8c4c80239ff6e5f5f", size = 4695402, upload-time = "2026-04-01T14:43:51.292Z" },
+    { url = "https://files.pythonhosted.org/packages/73/dd/42107efcb777b16fa0393317eac58f5b5cf30e8392e266e76e51cff28c3d/pillow-12.2.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:f1c943e96e85df3d3478f7b691f229887e143f81fedab9b20205349ab04d73ed", size = 6280005, upload-time = "2026-04-01T14:43:54.242Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/68/b93e09e5e8549019e61acf49f65b1a8530765a7f812c77a7461bca7e4494/pillow-12.2.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:03f6fab9219220f041c74aeaa2939ff0062bd5c364ba9ce037197f4c6d498cd9", size = 8090669, upload-time = "2026-04-01T14:43:57.335Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/6e/3ccb54ce8ec4ddd1accd2d89004308b7b0b21c4ac3d20fa70af4760a4330/pillow-12.2.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5cdfebd752ec52bf5bb4e35d9c64b40826bc5b40a13df7c3cda20a2c03a0f5ed", size = 6395194, upload-time = "2026-04-01T14:43:59.864Z" },
+    { url = "https://files.pythonhosted.org/packages/67/ee/21d4e8536afd1a328f01b359b4d3997b291ffd35a237c877b331c1c3b71c/pillow-12.2.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:eedf4b74eda2b5a4b2b2fb4c006d6295df3bf29e459e198c90ea48e130dc75c3", size = 7082423, upload-time = "2026-04-01T14:44:02.74Z" },
+    { url = "https://files.pythonhosted.org/packages/78/5f/e9f86ab0146464e8c133fe85df987ed9e77e08b29d8d35f9f9f4d6f917ba/pillow-12.2.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:00a2865911330191c0b818c59103b58a5e697cae67042366970a6b6f1b20b7f9", size = 6505667, upload-time = "2026-04-01T14:44:05.381Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/1e/409007f56a2fdce61584fd3acbc2bbc259857d555196cedcadc68c015c82/pillow-12.2.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:1e1757442ed87f4912397c6d35a0db6a7b52592156014706f17658ff58bbf795", size = 7208580, upload-time = "2026-04-01T14:44:08.39Z" },
+    { url = "https://files.pythonhosted.org/packages/23/c4/7349421080b12fb35414607b8871e9534546c128a11965fd4a7002ccfbee/pillow-12.2.0-cp313-cp313-win32.whl", hash = "sha256:144748b3af2d1b358d41286056d0003f47cb339b8c43a9ea42f5fea4d8c66b6e", size = 6375896, upload-time = "2026-04-01T14:44:11.197Z" },
+    { url = "https://files.pythonhosted.org/packages/3f/82/8a3739a5e470b3c6cbb1d21d315800d8e16bff503d1f16b03a4ec3212786/pillow-12.2.0-cp313-cp313-win_amd64.whl", hash = "sha256:390ede346628ccc626e5730107cde16c42d3836b89662a115a921f28440e6a3b", size = 7081266, upload-time = "2026-04-01T14:44:13.947Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/25/f968f618a062574294592f668218f8af564830ccebdd1fa6200f598e65c5/pillow-12.2.0-cp313-cp313-win_arm64.whl", hash = "sha256:8023abc91fba39036dbce14a7d6535632f99c0b857807cbbbf21ecc9f4717f06", size = 2463508, upload-time = "2026-04-01T14:44:16.312Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/a4/b342930964e3cb4dce5038ae34b0eab4653334995336cd486c5a8c25a00c/pillow-12.2.0-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:042db20a421b9bafecc4b84a8b6e444686bd9d836c7fd24542db3e7df7baad9b", size = 5309927, upload-time = "2026-04-01T14:44:18.89Z" },
+    { url = "https://files.pythonhosted.org/packages/9f/de/23198e0a65a9cf06123f5435a5d95cea62a635697f8f03d134d3f3a96151/pillow-12.2.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:dd025009355c926a84a612fecf58bb315a3f6814b17ead51a8e48d3823d9087f", size = 4698624, upload-time = "2026-04-01T14:44:21.115Z" },
+    { url = "https://files.pythonhosted.org/packages/01/a6/1265e977f17d93ea37aa28aa81bad4fa597933879fac2520d24e021c8da3/pillow-12.2.0-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:88ddbc66737e277852913bd1e07c150cc7bb124539f94c4e2df5344494e0a612", size = 6321252, upload-time = "2026-04-01T14:44:23.663Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/83/5982eb4a285967baa70340320be9f88e57665a387e3a53a7f0db8231a0cd/pillow-12.2.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:d362d1878f00c142b7e1a16e6e5e780f02be8195123f164edf7eddd911eefe7c", size = 8126550, upload-time = "2026-04-01T14:44:26.772Z" },
+    { url = "https://files.pythonhosted.org/packages/4e/48/6ffc514adce69f6050d0753b1a18fd920fce8cac87620d5a31231b04bfc5/pillow-12.2.0-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2c727a6d53cb0018aadd8018c2b938376af27914a68a492f59dfcaca650d5eea", size = 6433114, upload-time = "2026-04-01T14:44:29.615Z" },
+    { url = "https://files.pythonhosted.org/packages/36/a3/f9a77144231fb8d40ee27107b4463e205fa4677e2ca2548e14da5cf18dce/pillow-12.2.0-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:efd8c21c98c5cc60653bcb311bef2ce0401642b7ce9d09e03a7da87c878289d4", size = 7115667, upload-time = "2026-04-01T14:44:32.773Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/fc/ac4ee3041e7d5a565e1c4fd72a113f03b6394cc72ab7089d27608f8aaccb/pillow-12.2.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:9f08483a632889536b8139663db60f6724bfcb443c96f1b18855860d7d5c0fd4", size = 6538966, upload-time = "2026-04-01T14:44:35.252Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/a8/27fb307055087f3668f6d0a8ccb636e7431d56ed0750e07a60547b1e083e/pillow-12.2.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:dac8d77255a37e81a2efcbd1fc05f1c15ee82200e6c240d7e127e25e365c39ea", size = 7238241, upload-time = "2026-04-01T14:44:37.875Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/4b/926ab182c07fccae9fcb120043464e1ff1564775ec8864f21a0ebce6ac25/pillow-12.2.0-cp313-cp313t-win32.whl", hash = "sha256:ee3120ae9dff32f121610bb08e4313be87e03efeadfc6c0d18f89127e24d0c24", size = 6379592, upload-time = "2026-04-01T14:44:40.336Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/c4/f9e476451a098181b30050cc4c9a3556b64c02cf6497ea421ac047e89e4b/pillow-12.2.0-cp313-cp313t-win_amd64.whl", hash = "sha256:325ca0528c6788d2a6c3d40e3568639398137346c3d6e66bb61db96b96511c98", size = 7085542, upload-time = "2026-04-01T14:44:43.251Z" },
+    { url = "https://files.pythonhosted.org/packages/00/a4/285f12aeacbe2d6dc36c407dfbbe9e96d4a80b0fb710a337f6d2ad978c75/pillow-12.2.0-cp313-cp313t-win_arm64.whl", hash = "sha256:2e5a76d03a6c6dcef67edabda7a52494afa4035021a79c8558e14af25313d453", size = 2465765, upload-time = "2026-04-01T14:44:45.996Z" },
+    { url = "https://files.pythonhosted.org/packages/bf/98/4595daa2365416a86cb0d495248a393dfc84e96d62ad080c8546256cb9c0/pillow-12.2.0-cp314-cp314-ios_13_0_arm64_iphoneos.whl", hash = "sha256:3adc9215e8be0448ed6e814966ecf3d9952f0ea40eb14e89a102b87f450660d8", size = 4100848, upload-time = "2026-04-01T14:44:48.48Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/79/40184d464cf89f6663e18dfcf7ca21aae2491fff1a16127681bf1fa9b8cf/pillow-12.2.0-cp314-cp314-ios_13_0_arm64_iphonesimulator.whl", hash = "sha256:6a9adfc6d24b10f89588096364cc726174118c62130c817c2837c60cf08a392b", size = 4176515, upload-time = "2026-04-01T14:44:51.353Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/63/703f86fd4c422a9cf722833670f4f71418fb116b2853ff7da722ea43f184/pillow-12.2.0-cp314-cp314-ios_13_0_x86_64_iphonesimulator.whl", hash = "sha256:6a6e67ea2e6feda684ed370f9a1c52e7a243631c025ba42149a2cc5934dec295", size = 3640159, upload-time = "2026-04-01T14:44:53.588Z" },
+    { url = "https://files.pythonhosted.org/packages/71/e0/fb22f797187d0be2270f83500aab851536101b254bfa1eae10795709d283/pillow-12.2.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:2bb4a8d594eacdfc59d9e5ad972aa8afdd48d584ffd5f13a937a664c3e7db0ed", size = 5312185, upload-time = "2026-04-01T14:44:56.039Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/8c/1a9e46228571de18f8e28f16fabdfc20212a5d019f3e3303452b3f0a580d/pillow-12.2.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:80b2da48193b2f33ed0c32c38140f9d3186583ce7d516526d462645fd98660ae", size = 4695386, upload-time = "2026-04-01T14:44:58.663Z" },
+    { url = "https://files.pythonhosted.org/packages/70/62/98f6b7f0c88b9addd0e87c217ded307b36be024d4ff8869a812b241d1345/pillow-12.2.0-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:22db17c68434de69d8ecfc2fe821569195c0c373b25cccb9cbdacf2c6e53c601", size = 6280384, upload-time = "2026-04-01T14:45:01.5Z" },
+    { url = "https://files.pythonhosted.org/packages/5e/03/688747d2e91cfbe0e64f316cd2e8005698f76ada3130d0194664174fa5de/pillow-12.2.0-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:7b14cc0106cd9aecda615dd6903840a058b4700fcb817687d0ee4fc8b6e389be", size = 8091599, upload-time = "2026-04-01T14:45:04.5Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/35/577e22b936fcdd66537329b33af0b4ccfefaeabd8aec04b266528cddb33c/pillow-12.2.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:8cbeb542b2ebc6fcdacabf8aca8c1a97c9b3ad3927d46b8723f9d4f033288a0f", size = 6396021, upload-time = "2026-04-01T14:45:07.117Z" },
+    { url = "https://files.pythonhosted.org/packages/11/8d/d2532ad2a603ca2b93ad9f5135732124e57811d0168155852f37fbce2458/pillow-12.2.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4bfd07bc812fbd20395212969e41931001fd59eb55a60658b0e5710872e95286", size = 7083360, upload-time = "2026-04-01T14:45:09.763Z" },
+    { url = "https://files.pythonhosted.org/packages/5e/26/d325f9f56c7e039034897e7380e9cc202b1e368bfd04d4cbe6a441f02885/pillow-12.2.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:9aba9a17b623ef750a4d11b742cbafffeb48a869821252b30ee21b5e91392c50", size = 6507628, upload-time = "2026-04-01T14:45:12.378Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/f7/769d5632ffb0988f1c5e7660b3e731e30f7f8ec4318e94d0a5d674eb65a4/pillow-12.2.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:deede7c263feb25dba4e82ea23058a235dcc2fe1f6021025dc71f2b618e26104", size = 7209321, upload-time = "2026-04-01T14:45:15.122Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/7a/c253e3c645cd47f1aceea6a8bacdba9991bf45bb7dfe927f7c893e89c93c/pillow-12.2.0-cp314-cp314-win32.whl", hash = "sha256:632ff19b2778e43162304d50da0181ce24ac5bb8180122cbe1bf4673428328c7", size = 6479723, upload-time = "2026-04-01T14:45:17.797Z" },
+    { url = "https://files.pythonhosted.org/packages/cd/8b/601e6566b957ca50e28725cb6c355c59c2c8609751efbecd980db44e0349/pillow-12.2.0-cp314-cp314-win_amd64.whl", hash = "sha256:4e6c62e9d237e9b65fac06857d511e90d8461a32adcc1b9065ea0c0fa3a28150", size = 7217400, upload-time = "2026-04-01T14:45:20.529Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/94/220e46c73065c3e2951bb91c11a1fb636c8c9ad427ac3ce7d7f3359b9b2f/pillow-12.2.0-cp314-cp314-win_arm64.whl", hash = "sha256:b1c1fbd8a5a1af3412a0810d060a78b5136ec0836c8a4ef9aa11807f2a22f4e1", size = 2554835, upload-time = "2026-04-01T14:45:23.162Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/ab/1b426a3974cb0e7da5c29ccff4807871d48110933a57207b5a676cccc155/pillow-12.2.0-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:57850958fe9c751670e49b2cecf6294acc99e562531f4bd317fa5ddee2068463", size = 5314225, upload-time = "2026-04-01T14:45:25.637Z" },
+    { url = "https://files.pythonhosted.org/packages/19/1e/dce46f371be2438eecfee2a1960ee2a243bbe5e961890146d2dee1ff0f12/pillow-12.2.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:d5d38f1411c0ed9f97bcb49b7bd59b6b7c314e0e27420e34d99d844b9ce3b6f3", size = 4698541, upload-time = "2026-04-01T14:45:28.355Z" },
+    { url = "https://files.pythonhosted.org/packages/55/c3/7fbecf70adb3a0c33b77a300dc52e424dc22ad8cdc06557a2e49523b703d/pillow-12.2.0-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:5c0a9f29ca8e79f09de89293f82fc9b0270bb4af1d58bc98f540cc4aedf03166", size = 6322251, upload-time = "2026-04-01T14:45:30.924Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/3c/7fbc17cfb7e4fe0ef1642e0abc17fc6c94c9f7a16be41498e12e2ba60408/pillow-12.2.0-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1610dd6c61621ae1cf811bef44d77e149ce3f7b95afe66a4512f8c59f25d9ebe", size = 8127807, upload-time = "2026-04-01T14:45:33.908Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/c3/a8ae14d6defd2e448493ff512fae903b1e9bd40b72efb6ec55ce0048c8ce/pillow-12.2.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a34329707af4f73cf1782a36cd2289c0368880654a2c11f027bcee9052d35dd", size = 6433935, upload-time = "2026-04-01T14:45:36.623Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/32/2880fb3a074847ac159d8f902cb43278a61e85f681661e7419e6596803ed/pillow-12.2.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8e9c4f5b3c546fa3458a29ab22646c1c6c787ea8f5ef51300e5a60300736905e", size = 7116720, upload-time = "2026-04-01T14:45:39.258Z" },
+    { url = "https://files.pythonhosted.org/packages/46/87/495cc9c30e0129501643f24d320076f4cc54f718341df18cc70ec94c44e1/pillow-12.2.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:fb043ee2f06b41473269765c2feae53fc2e2fbf96e5e22ca94fb5ad677856f06", size = 6540498, upload-time = "2026-04-01T14:45:41.879Z" },
+    { url = "https://files.pythonhosted.org/packages/18/53/773f5edca692009d883a72211b60fdaf8871cbef075eaa9d577f0a2f989e/pillow-12.2.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:f278f034eb75b4e8a13a54a876cc4a5ab39173d2cdd93a638e1b467fc545ac43", size = 7239413, upload-time = "2026-04-01T14:45:44.705Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/e4/4b64a97d71b2a83158134abbb2f5bd3f8a2ea691361282f010998f339ec7/pillow-12.2.0-cp314-cp314t-win32.whl", hash = "sha256:6bb77b2dcb06b20f9f4b4a8454caa581cd4dd0643a08bacf821216a16d9c8354", size = 6482084, upload-time = "2026-04-01T14:45:47.568Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/13/306d275efd3a3453f72114b7431c877d10b1154014c1ebbedd067770d629/pillow-12.2.0-cp314-cp314t-win_amd64.whl", hash = "sha256:6562ace0d3fb5f20ed7290f1f929cae41b25ae29528f2af1722966a0a02e2aa1", size = 7225152, upload-time = "2026-04-01T14:45:50.032Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/6e/cf826fae916b8658848d7b9f38d88da6396895c676e8086fc0988073aaf8/pillow-12.2.0-cp314-cp314t-win_arm64.whl", hash = "sha256:aa88ccfe4e32d362816319ed727a004423aab09c5cea43c01a4b435643fa34eb", size = 2556579, upload-time = "2026-04-01T14:45:52.529Z" },
+    { url = "https://files.pythonhosted.org/packages/4e/b7/2437044fb910f499610356d1352e3423753c98e34f915252aafecc64889f/pillow-12.2.0-pp311-pypy311_pp73-macosx_10_15_x86_64.whl", hash = "sha256:0538bd5e05efec03ae613fd89c4ce0368ecd2ba239cc25b9f9be7ed426b0af1f", size = 5273969, upload-time = "2026-04-01T14:45:55.538Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/f4/8316e31de11b780f4ac08ef3654a75555e624a98db1056ecb2122d008d5a/pillow-12.2.0-pp311-pypy311_pp73-macosx_11_0_arm64.whl", hash = "sha256:394167b21da716608eac917c60aa9b969421b5dcbbe02ae7f013e7b85811c69d", size = 4659674, upload-time = "2026-04-01T14:45:58.093Z" },
+    { url = "https://files.pythonhosted.org/packages/d4/37/664fca7201f8bb2aa1d20e2c3d5564a62e6ae5111741966c8319ca802361/pillow-12.2.0-pp311-pypy311_pp73-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:5d04bfa02cc2d23b497d1e90a0f927070043f6cbf303e738300532379a4b4e0f", size = 5288479, upload-time = "2026-04-01T14:46:01.141Z" },
+    { url = "https://files.pythonhosted.org/packages/49/62/5b0ed78fce87346be7a5cfcfaaad91f6a1f98c26f86bdbafa2066c647ef6/pillow-12.2.0-pp311-pypy311_pp73-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:0c838a5125cee37e68edec915651521191cef1e6aa336b855f495766e77a366e", size = 7032230, upload-time = "2026-04-01T14:46:03.874Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/28/ec0fc38107fc32536908034e990c47914c57cd7c5a3ece4d8d8f7ffd7e27/pillow-12.2.0-pp311-pypy311_pp73-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4a6c9fa44005fa37a91ebfc95d081e8079757d2e904b27103f4f5fa6f0bf78c0", size = 5355404, upload-time = "2026-04-01T14:46:06.33Z" },
+    { url = "https://files.pythonhosted.org/packages/5e/8b/51b0eddcfa2180d60e41f06bd6d0a62202b20b59c68f5a132e615b75aecf/pillow-12.2.0-pp311-pypy311_pp73-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:25373b66e0dd5905ed63fa3cae13c82fbddf3079f2c8bf15c6fb6a35586324c1", size = 6002215, upload-time = "2026-04-01T14:46:08.83Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/60/5382c03e1970de634027cee8e1b7d39776b778b81812aaf45b694dfe9e28/pillow-12.2.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:bfa9c230d2fe991bed5318a5f119bd6780cda2915cca595393649fc118ab895e", size = 7080946, upload-time = "2026-04-01T14:46:11.734Z" },
+]
+
+[[package]]
+name = "platformdirs"
+version = "4.10.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/d7/47/e4501f49c178ae1d9f4a75073fda4204f52647993f075a9db4d14930e0c5/platformdirs-4.10.0.tar.gz", hash = "sha256:31e761a6a0ca04faf7353ea759bdba55652be214725111e5aac52dfa29d4bef7", size = 31224, upload-time = "2026-05-28T03:32:53.587Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/81/e6/cd9575ac904136b3cbf7aa7ee819ef86eedb7274e46f230e94ea4342e729/platformdirs-4.10.0-py3-none-any.whl", hash = "sha256:fb516cdb12eb0d857d0cd85a7c57cea4d060bee4578d6cf5a14dfdf8cbf8784a", size = 22743, upload-time = "2026-05-28T03:32:52.175Z" },
+]
+
+[[package]]
+name = "pluggy"
+version = "1.6.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
+]
+
+[[package]]
+name = "pre-commit"
+version = "4.6.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "cfgv" },
+    { name = "identify" },
+    { name = "nodeenv" },
+    { name = "pyyaml" },
+    { name = "virtualenv" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/8e/22/2de9408ac81acbb8a7d05d4cc064a152ccf33b3d480ebe0cd292153db239/pre_commit-4.6.0.tar.gz", hash = "sha256:718d2208cef53fdc38206e40524a6d4d9576d103eb16f0fec11c875e7716e9d9", size = 198525, upload-time = "2026-04-21T20:31:41.613Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/80/6e/4b28b62ecb6aae56769c34a8ff1d661473ec1e9519e2d5f8b2c150086b26/pre_commit-4.6.0-py2.py3-none-any.whl", hash = "sha256:e2cf246f7299edcabcf15f9b0571fdce06058527f0a06535068a86d38089f29b", size = 226472, upload-time = "2026-04-21T20:31:40.092Z" },
+]
+
+[[package]]
+name = "propcache"
+version = "0.5.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ec/44/c87281c333769159c50594f22610f77398a47ccbfbbf23074e744e86f87c/propcache-0.5.2.tar.gz", hash = "sha256:01c4fc7480cd0598bb4b57022df55b9ca296da7fc5a8760bd8451a7e63a7d427", size = 50208, upload-time = "2026-05-08T21:02:12.199Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e7/f1/8a8cc1c2c7e7934ab77e0163414f736fadbc0f5e8dd9673b952355ac175b/propcache-0.5.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:74b70780220e2dd89175ca24b81b68b67c83db499ae611e7f2313cb329801c78", size = 90744, upload-time = "2026-05-08T20:59:45.799Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/f4/651b1225e976bd1a2ba5cfba0c29d096581c2636b437e3a9a7ab6276270a/propcache-0.5.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:a4840ab0ae0216d952f4b53dc6d0b992bfc2bedbfe360bdd9b548bc184c08959", size = 52033, upload-time = "2026-05-08T20:59:47.408Z" },
+    { url = "https://files.pythonhosted.org/packages/15/a8/8ede85d6aa1f79fc7dc2f8fd2c8d65920b8272c3892903c8a1affde48cfb/propcache-0.5.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:c6844ba6364fb12f403928a82cfd295ab103a2b315c77c747b2dbe4a41894ea7", size = 52754, upload-time = "2026-05-08T20:59:49.202Z" },
+    { url = "https://files.pythonhosted.org/packages/7d/fe/b3551b41bbc2f5b5bb088fc6920567cd43101253e68fbaa261339eb96fe1/propcache-0.5.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2293949b855ce597f2826452d17c2d545fb5622379c4ea6fdf525e9b8e8a2511", size = 57573, upload-time = "2026-05-08T20:59:50.778Z" },
+    { url = "https://files.pythonhosted.org/packages/83/27/ab851ebd1b7172e3e161f5f8d39e315d54a91bea246f01f4d872d3376aef/propcache-0.5.2-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:0fd59b5af35f74da48d905dcbad55449ba13be91823cb05a9bd590bbf5b61660", size = 60645, upload-time = "2026-05-08T20:59:52.227Z" },
+    { url = "https://files.pythonhosted.org/packages/95/7d/466b3d18022e9897cbda9c735c493c5bd747d7a4c6f5ea1480b4cec434b6/propcache-0.5.2-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:29f9309a2e42b0d273be006fdb4be2d6c39a47f6f57d8fb1cf9f81481df81b66", size = 61563, upload-time = "2026-05-08T20:59:53.866Z" },
+    { url = "https://files.pythonhosted.org/packages/27/1b/16ab7f2cf2041da2f60d156ba64c2484eadf9168075b4ff43c3ef60045af/propcache-0.5.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5aaa2b923c1944ac8febd6609cb373540a5563e7cbcb0fd770f75dace2eb817b", size = 58888, upload-time = "2026-05-08T20:59:55.457Z" },
+    { url = "https://files.pythonhosted.org/packages/0a/67/bb777ffd907633563bf35fd859c4ce97b0512c32f4633cf5d1eb7c33512b/propcache-0.5.2-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:66ea454f095ddf5b6b14f56c064c0941c4788be11e18d2464cf643bf7203ff67", size = 59253, upload-time = "2026-05-08T20:59:57.075Z" },
+    { url = "https://files.pythonhosted.org/packages/b9/42/64f8d90b73fd9cdc1499b48057ff6d9cd2a98a25734c9bb62ecf07e87061/propcache-0.5.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:95f1e3f4760d404b13c9976c0229b2b49a3c8e2c62a9ce92efdd2b11ada75e3f", size = 57558, upload-time = "2026-05-08T20:59:58.602Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/02/dba5bc03c9041f2092ea55a449caf5dfe68352c6654511b29ba0654ddb69/propcache-0.5.2-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:85341b12b9d55bad0bded24cac341bb34289469e03a11f3f583ea1cc1db0326c", size = 55007, upload-time = "2026-05-08T20:59:59.837Z" },
+    { url = "https://files.pythonhosted.org/packages/14/c0/43f649c7aa2a77a3b100d84e9dea3a483120ecb608bfe36ce49eaff517fe/propcache-0.5.2-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:26a4dca084132874e639895c3135dfad5eb20bae209f62d1aeb31b03e601c3c0", size = 60355, upload-time = "2026-05-08T21:00:01.144Z" },
+    { url = "https://files.pythonhosted.org/packages/83/c0/435dafd27f1cb4a495381dae60e25883ccfe4020bb72818e8184c1678092/propcache-0.5.2-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:3b199b9b2b3d6a7edf3183ba8a9a137a22b97f7df525feb5ae1eccf026d2a9c6", size = 59057, upload-time = "2026-05-08T21:00:02.401Z" },
+    { url = "https://files.pythonhosted.org/packages/53/ae/6e292df9135d659944e96cb3389258e4a663e5b2b5f6c217ef0ddc8d2f73/propcache-0.5.2-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:e59bc9e66329185b93dab73f210f1a37f81cb40f321501db8017c9aea15dba27", size = 61938, upload-time = "2026-05-08T21:00:03.638Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/42/314ebc50d8159055411fd6b0bda322ff510e4b1f7d2e4927940ad0f6af20/propcache-0.5.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:552ffadf6ad409844bc5919c42a0a83d88314cedddaea0e41e80a8b8fffe881f", size = 59731, upload-time = "2026-05-08T21:00:04.881Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/9b/2da6dee38871c3c8772fabc2758325a5c9077d6d18c597737dc04dd884cd/propcache-0.5.2-cp311-cp311-win32.whl", hash = "sha256:cd416c1de191973c52ff1a12a57446bfc7642797b282d7caf2162d7d1b8aa9a0", size = 38966, upload-time = "2026-05-08T21:00:06.511Z" },
+    { url = "https://files.pythonhosted.org/packages/42/4e/f17363fb58c0afe05b067361cb6d86ed2d29de6506779a27547c4d183075/propcache-0.5.2-cp311-cp311-win_amd64.whl", hash = "sha256:44e488ef40dbb452700b2b1f8188934121f6648f52c295055662d2191959ff82", size = 42135, upload-time = "2026-05-08T21:00:08.088Z" },
+    { url = "https://files.pythonhosted.org/packages/c6/eb/6af6685077d22e8b33358d3c548e3282706a0b3cd85044ffba4e5dd08e3b/propcache-0.5.2-cp311-cp311-win_arm64.whl", hash = "sha256:54adaa85a22078d1e306304a40984dc5be99d599bf3dc0a24dc98f7daeab89ab", size = 38381, upload-time = "2026-05-08T21:00:09.692Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/cb/e27bc2b2737a0bb49962b275efa051e8f1c35a936df7d5139b6b658b7dc9/propcache-0.5.2-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:806719138ecd720339a12410fb9614ac9b2b2d3a5fdf8235d56981c36f4039ba", size = 95887, upload-time = "2026-05-08T21:00:11.277Z" },
+    { url = "https://files.pythonhosted.org/packages/e6/13/b8ae04c59392f8d11c6cd9fb4011d1dc7c86b81225c770280300e259ffe1/propcache-0.5.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:db2b80ea58eab4f86b2beec3cc8b39e8ff9276ac20e96b7cce43c8ae84cd6b5a", size = 54654, upload-time = "2026-05-08T21:00:12.604Z" },
+    { url = "https://files.pythonhosted.org/packages/2c/7d/49777a3e20b55863d4794384a38acd460c04157b0a00f8602b0d508b8431/propcache-0.5.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e5cbfac9f61484f7e9f3597775500cd3ebe8274e9b050c38f9525c77c97520bf", size = 55190, upload-time = "2026-05-08T21:00:13.935Z" },
+    { url = "https://files.pythonhosted.org/packages/44/c7/085d0cd63062e84044e3f05797749c3f8e3938ff3aeb0eb2f69d43fafc91/propcache-0.5.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5dbc581d2814337da56222fab8dc5f161cd798a434e49bac27930aaef798e144", size = 59995, upload-time = "2026-05-08T21:00:15.526Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/42/32cf8e3009e92b2645cf1e944f701e8ea4e924dffde1ee26db860bcbf7e4/propcache-0.5.2-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:857187f381f88c8e2fa2fe56ab94879d011b883d5a2ee5a1b60a8cd2a06846d9", size = 63422, upload-time = "2026-05-08T21:00:16.824Z" },
+    { url = "https://files.pythonhosted.org/packages/9e/1b/f112433f99fc979431b87a39ef169e3f8df070d99a72792c56d6937ac48b/propcache-0.5.2-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:178b4a2cdaac1818e2bf1c5a99b94383fa73ea5382e032a48dec07dc5668dc42", size = 64342, upload-time = "2026-05-08T21:00:18.362Z" },
+    { url = "https://files.pythonhosted.org/packages/14/15/5574111ae50dd6e879456888c0eadd4c5a869959775854e18e18a6b345f3/propcache-0.5.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6f328175a2cde1f0ff2c4ed8ce968b9dcfb55f3a7153f39e2957ed994da13476", size = 61639, upload-time = "2026-05-08T21:00:19.692Z" },
+    { url = "https://files.pythonhosted.org/packages/cc/da/4d775080b1490c0ae604acda868bd71aabe3a89ed16f2aa4339eb8a283e7/propcache-0.5.2-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:5671d09a36b06d0fd4a3da0fccbcae360e9b1570924171a15e9e0997f0249fba", size = 61588, upload-time = "2026-05-08T21:00:21.155Z" },
+    { url = "https://files.pythonhosted.org/packages/04/ac/f076982cbe2195ee9cf32de5a1e46951d9fb399fc207f390562dd0fd8fb2/propcache-0.5.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:80168e2ebe4d3ec6599d10ad8f520304ae1cad9b6c5a95372aef1b66b7bfb53a", size = 60029, upload-time = "2026-05-08T21:00:22.713Z" },
+    { url = "https://files.pythonhosted.org/packages/70/60/189be62e0dd898dce3b331e1b8c7a543cd3a405ac0c81fe8ee8a9d5d77e1/propcache-0.5.2-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:45f11346f884bc47444f6e6647131055844134c3175b629f84952e2b5cd62b64", size = 56774, upload-time = "2026-05-08T21:00:24.001Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/9e/93377b9c7939c1ffae98f878dee955efadfd638078bc86dbc21f9d52f651/propcache-0.5.2-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:8e778ebd44ef4f66ed60a0416b06b489687db264a9c0b3620362f26489492913", size = 63532, upload-time = "2026-05-08T21:00:25.545Z" },
+    { url = "https://files.pythonhosted.org/packages/14/f9/590ef6cfb9b8028d516d287812ece32bb0bc5f11fbb9c8bf6b2e6313fec8/propcache-0.5.2-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:c0cb9ed24c8964e172768d455a38254c2dd8a552905729ce006cad3d3dda59b1", size = 61592, upload-time = "2026-05-08T21:00:27.186Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/5e/70958b3034c297a630bba2f17ca7abc2d5f39a803ad7e370ab79d1ecd022/propcache-0.5.2-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:1d1ad32d9d4355e2be65574fd0bfd3677e7066b009cd5b9b2dee8aa6a6393b33", size = 64788, upload-time = "2026-05-08T21:00:28.8Z" },
+    { url = "https://files.pythonhosted.org/packages/12/fd/77fe5936d8c3086ca9048f7f415f122ed82e53884a9ec193646b42deef06/propcache-0.5.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:c80f4ba3e8f00189165999a742ee526ebeccedf6c3f7beb0c7df821e9772435a", size = 62514, upload-time = "2026-05-08T21:00:30.098Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/74/66bd798b5b3be70aa1b391f5cc9d6a0a5532d7fd3b19ec0b213e72e6ad9d/propcache-0.5.2-cp312-cp312-win32.whl", hash = "sha256:8c7972d8f193740d9175f0998ab38717e6cd322d5935c5b0fef8c0d323fd9031", size = 39018, upload-time = "2026-05-08T21:00:31.622Z" },
+    { url = "https://files.pythonhosted.org/packages/61/7c/5c0d34aa3024694d6dcb9271cdbdd08c4e47c1c0ad95ec7e7bc74cdea145/propcache-0.5.2-cp312-cp312-win_amd64.whl", hash = "sha256:d9ee8826a7d47863a08ac44e1a5f611a462eefc3a194b492da242128bec75b42", size = 42322, upload-time = "2026-05-08T21:00:32.918Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/91/875812f1a3feb20ceba818ef39fbe4d92f1081e04ac815c822496d0d038b/propcache-0.5.2-cp312-cp312-win_arm64.whl", hash = "sha256:2800a4a8ead6b28cccd1ec54b59346f0def7922ee1c7598e8499c733cfbb7c84", size = 38172, upload-time = "2026-05-08T21:00:35.124Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/09/f049e45385503fe67db75a6b6186a7b9f0c3930366dc960522c312a825b1/propcache-0.5.2-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:099aaf4b4d1a02265b92a977edf00b5c4f63b3b17ac6de39b0d637c9cac0188a", size = 94457, upload-time = "2026-05-08T21:00:36.355Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/65/83d1d05655baf63113731bd5a1008435e14f8d1e5a06cbe4ec5b23ad7a31/propcache-0.5.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:68ce1c44c7a813a7f71ea04315a8c7b330b63db99d059a797a4651bb6f69f117", size = 53835, upload-time = "2026-05-08T21:00:38.072Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/12/a6ba6482bb5ea3260c000c9b20881c95fa11c6b30173715668259f844ed7/propcache-0.5.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:fc299c129490f55f254cd90be0deca4764e36e9a7c08b4aa588479a3bbed3098", size = 54545, upload-time = "2026-05-08T21:00:39.319Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/19/7fa086f5764c59ec8a8e157cd93aa8497acc00aba9dcdec56bfffb32602d/propcache-0.5.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:a6ae2198be502c10f09b2516e7b5d019816924bc3183a43ce792a7bd6625e6f4", size = 59886, upload-time = "2026-05-08T21:00:40.621Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/e4/5d7663dc8235956c8f5281698a3af1d351d8820341ddd890f59d9a9127f2/propcache-0.5.2-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:6041d31504dc1779d700e1edcfb08eea334b357620b06681a4eabb57a74e574e", size = 63261, upload-time = "2026-05-08T21:00:41.775Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/4a/15a03adee24d6350da4292caeac44c34c033d2afe5e87eb370f38854560f/propcache-0.5.2-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:f7eabc04151c78a9f4d5bbb5f1faf571e4defeb4b585e0fe95b60ff2dbe4d3d7", size = 64184, upload-time = "2026-05-08T21:00:43.018Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/c6/979176efdaa3d239e36d503d5af63a0a773b36662ed8f52e5b6a6d9fd40e/propcache-0.5.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4db0ba63d693afd40d249bd93f842b5f144f8fcbb83de05660373bcf30517b1d", size = 61534, upload-time = "2026-05-08T21:00:44.507Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/22/63e8cd1bae4c2d2be6493b6b7d10566ddafad88137cfbc99964a1119853c/propcache-0.5.2-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:1dbcf7675229b35d31abb6547d8ebc8c27a830ac3f9a794edff6254873ec7c0a", size = 61500, upload-time = "2026-05-08T21:00:45.796Z" },
+    { url = "https://files.pythonhosted.org/packages/60/5a/28e5d9acbac1cc9ccb67045e8c1b943aa8d79fdf39c93bd73cacd68008ea/propcache-0.5.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:d310c013aad2c72f1c3f2f8dd3279d460a858c551f97aeb8c63e4693cca7b4d2", size = 59994, upload-time = "2026-05-08T21:00:47.093Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/40/db650677f554a95b9c01a7c9d93d629e93a15562f5deb4573c9ee136fed2/propcache-0.5.2-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:06187263ddad280d05b4d8a8b3bb7d164cbebd469236544a42e6d9b28ac6a4fa", size = 56884, upload-time = "2026-05-08T21:00:48.376Z" },
+    { url = "https://files.pythonhosted.org/packages/80/45/70b39b89516ff8b96bf732fa6fded8cef20f293cb1508690101c3c07ec51/propcache-0.5.2-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:3115559b8effafd63b142ea5ed53d63a16ea6469cbc63dce4ee194b42db5d853", size = 63464, upload-time = "2026-05-08T21:00:49.954Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/e2/fa59d3a89eac5534293124af4f1d0d0ada091ce4a0ab4610ce03fd2bdd8d/propcache-0.5.2-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:c60462af8e6dc30c35407c7237ea908d777b22862bbee27bc4699c0d8bcdc45a", size = 61588, upload-time = "2026-05-08T21:00:51.281Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/97/efb547a55c4bc7381cfb202d6a2239ac621045277bc1ea5dfd3a7f0516c0/propcache-0.5.2-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:40314bca9ac559716fe374094fc81c11dcc34b64fd6c585360f5775690505704", size = 64667, upload-time = "2026-05-08T21:00:52.602Z" },
+    { url = "https://files.pythonhosted.org/packages/92/56/f5c7d9b4b7595d5127da38974d791b2153f3d1eae6c674af3583ace92ad3/propcache-0.5.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:cfa21e036ce1e1db2be04ba3b85d2df1bb1702fa01932d984c5464c665228ff4", size = 62463, upload-time = "2026-05-08T21:00:54.303Z" },
+    { url = "https://files.pythonhosted.org/packages/bd/3b/484a3a65fc9f9f60c41dcd17b428bace5389544e2c680994534a20755066/propcache-0.5.2-cp313-cp313-win32.whl", hash = "sha256:f156a3529f38063b6dbaf356e15602a7f95f8055b1295a438433a6386f10463d", size = 38621, upload-time = "2026-05-08T21:00:55.808Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/fd/3f0f10dba4dabad3bf53102be007abf55481067952bde0fdddff439e7c61/propcache-0.5.2-cp313-cp313-win_amd64.whl", hash = "sha256:dfed59d0a5aeb01e242e66ff0300bc4a265a7c05f612d30016f0b60b1017d757", size = 41649, upload-time = "2026-05-08T21:00:57.061Z" },
+    { url = "https://files.pythonhosted.org/packages/90/ec/6ce619cc32bb500a482f811f9cd509368b4e58e638d13f2c68f370d6b475/propcache-0.5.2-cp313-cp313-win_arm64.whl", hash = "sha256:ba338430e87ceb9c8f0cf754de38a9860560261e56c00376debd628698a7364f", size = 37636, upload-time = "2026-05-08T21:00:58.646Z" },
+    { url = "https://files.pythonhosted.org/packages/1b/82/c1d268bbbf2ef981c5bf0fbbe746db617c66e3bcefe431a1aa8943fbe23a/propcache-0.5.2-cp313-cp313t-macosx_10_13_universal2.whl", hash = "sha256:a592f5f3da71c8691c788c13cb6734b6d17663d2e1cb8caddf0673d01ef8847d", size = 98872, upload-time = "2026-05-08T21:00:59.889Z" },
+    { url = "https://files.pythonhosted.org/packages/f4/d4/52c871e73e864e6b34c0e2d58ac1ec5ccd149497ddc7ad2137ae98323a35/propcache-0.5.2-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:6a997d0489e9668a384fcfd5061b857aa5361de73191cac204d04b889cfbbafa", size = 56257, upload-time = "2026-05-08T21:01:01.195Z" },
+    { url = "https://files.pythonhosted.org/packages/67/f0/9b90ca2a210b3d09bcfcd96ecd0f55545c091535abce2a45de2775cfd357/propcache-0.5.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:10734b5484ea113152ee25a91dccedf81631791805d2c9ccb054958e51842c94", size = 56696, upload-time = "2026-05-08T21:01:02.941Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/0e/6e9d4ba07c8e56e21ddec1e75f12148142b21ca83a51871babce095334f4/propcache-0.5.2-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:cafca7e56c12bb02ae16d283742bef25a61122e9dab2b5b3f2ccbe589ce32164", size = 62378, upload-time = "2026-05-08T21:01:04.475Z" },
+    { url = "https://files.pythonhosted.org/packages/65/19/c10badaa463dde8a27ce884f8ee2ec37e6035b7c9f5ff0c8f74f06f08dac/propcache-0.5.2-cp313-cp313t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f064f8d2b59177878b7615df1735cd8fe3462ed6be8c7b217d17a276489c2b7f", size = 65283, upload-time = "2026-05-08T21:01:05.959Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/b6/93bea99ca80e19cef6512a8580e5b7857bbe09422d9daa7fd4ef5723306c/propcache-0.5.2-cp313-cp313t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:f78abfa8dfc32376fd1aacf597b2f2fbbe0ea751419aee718af5d4f82537ef8c", size = 66616, upload-time = "2026-05-08T21:01:07.228Z" },
+    { url = "https://files.pythonhosted.org/packages/83/e4/5c7462e50625f051f37fb38b8224f7639f667184bbd34424ec83819bb1b7/propcache-0.5.2-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f7467da8a9822bf1a55336f877340c5bcbd3c482afc43a99771169f74a26dedc", size = 63773, upload-time = "2026-05-08T21:01:08.514Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/b6/99238894047b13c823be25027e736626cd414a52a5e30d2c3347c2733529/propcache-0.5.2-cp313-cp313t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:a6ddc6ac9e25de626c1f129c1b467d7ecd33ce2237d3fd0c4e429feef0a7ee1f", size = 63664, upload-time = "2026-05-08T21:01:09.874Z" },
+    { url = "https://files.pythonhosted.org/packages/85/1e/a3a1a63116a2b8edb415a8bb9a6f0c34bd03830b1e18e8ce2904e1dc1cf4/propcache-0.5.2-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:2f22cbbac9e26a8e864c0985ff1268d5d939d53d9d9411a9824279097e03a2cb", size = 62643, upload-time = "2026-05-08T21:01:11.132Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/03/893cf147de2fc6543c5eaa07ad833170e7e2a2385725bbebe8c0503723bb/propcache-0.5.2-cp313-cp313t-musllinux_1_2_armv7l.whl", hash = "sha256:fc76378c62a0f04d0cd82fbb1a2cd2d7e28fcb40d5873f28a6c44e388aaa2751", size = 59595, upload-time = "2026-05-08T21:01:12.387Z" },
+    { url = "https://files.pythonhosted.org/packages/86/3b/04c1a2e12c57766568ba75ba72b3bf2042818d4c1425fab6fc07155c7cff/propcache-0.5.2-cp313-cp313t-musllinux_1_2_ppc64le.whl", hash = "sha256:acd2c8edba48e31e58a363b8cf4e5c7db3b04b3f9e371f601df30d9b0d244836", size = 65711, upload-time = "2026-05-08T21:01:13.676Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/34/80f8d0099f8d6bacc4de1624c85672681c8cd1149ca2da0e38fd120b817f/propcache-0.5.2-cp313-cp313t-musllinux_1_2_riscv64.whl", hash = "sha256:452b5065457eb9991ec5eb38ff41d6cd4c991c9ac7c531c4d5849ae473a9a13f", size = 64247, upload-time = "2026-05-08T21:01:14.936Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/1a/8b08f3a5f1037e9e370c55883ceeeee0f6dd0416fb2d2d67b8bfc91f2a79/propcache-0.5.2-cp313-cp313t-musllinux_1_2_s390x.whl", hash = "sha256:3430bb2bfe1331885c427745a751e774ee679fd4344f80b97bf879815fe8fa55", size = 67102, upload-time = "2026-05-08T21:01:16.281Z" },
+    { url = "https://files.pythonhosted.org/packages/34/68/8bdb7bb7756d76e005490649d10e4a8369e610c74d619f71e1aedf889e9c/propcache-0.5.2-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:cef6cea3922890dd6c9654971001fa797b526c16ab5e1e46c05fd6f877be7568", size = 64964, upload-time = "2026-05-08T21:01:17.57Z" },
+    { url = "https://files.pythonhosted.org/packages/0a/aa/50fb0b5d3968b61a510926ff8b8465f1d6e976b3ab74496d7a4b9fc42515/propcache-0.5.2-cp313-cp313t-win32.whl", hash = "sha256:72d61e16dd78228b58c5d47be830ff3da7e5f139abdf0aef9d86cde1c5cf2191", size = 42546, upload-time = "2026-05-08T21:01:18.946Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/4c/0ddbae64321bd4a95bcbfc19307238016b5b1fee645c84626c8d539e5b74/propcache-0.5.2-cp313-cp313t-win_amd64.whl", hash = "sha256:0958834041a0166d343b8d2cedcd8bcbaeb4fdbe0cf08320c5379f143c3be6e7", size = 46330, upload-time = "2026-05-08T21:01:20.162Z" },
+    { url = "https://files.pythonhosted.org/packages/00/d9/9cddc8efb78d8af264c5ec9f6d10b62f57c515feda8d321595f56010fb23/propcache-0.5.2-cp313-cp313t-win_arm64.whl", hash = "sha256:6de8bd93ddde9b992cf2b2e0d796d501a19026b5b9fd87356d7d0779531a8d96", size = 40521, upload-time = "2026-05-08T21:01:21.399Z" },
+    { url = "https://files.pythonhosted.org/packages/e2/ea/23ee535d90ce8bcc465a3028eb3cc0ce3bd1005f4bb27710b30587de798d/propcache-0.5.2-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:46088abff4cba581dea21ae0467a480526cb25aa5f3c269e909f800328bc3999", size = 94662, upload-time = "2026-05-08T21:01:22.683Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/06/c5a52f419b5d8972f8d46a7577476090d8e3263ff589ce40b5ca4968d5be/propcache-0.5.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:fc88b26f08d634f7bc819a7852e5214f5802641ab8d9fd5326892292eee1993e", size = 53928, upload-time = "2026-05-08T21:01:23.986Z" },
+    { url = "https://files.pythonhosted.org/packages/63/b1/4260d67d6bd85e58a66b72d54ce15d5de789b6f3870cc6bedf8ff9667401/propcache-0.5.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:97797ebb098e670a2f92dd66f32897e30d7615b14e7f59711de23e30a9072539", size = 54650, upload-time = "2026-05-08T21:01:25.305Z" },
+    { url = "https://files.pythonhosted.org/packages/70/06/2f46c318e3307cd7a6a7481def374ce838c0fe20084b39dd54b0879d0e99/propcache-0.5.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ba57fffe4ac99c5d30076161b5866336d97600769bad35cc68f7774b15298a4e", size = 59912, upload-time = "2026-05-08T21:01:26.545Z" },
+    { url = "https://files.pythonhosted.org/packages/4c/29/fe1aebec2ce57ab985a9c382bded1124431f85078113aa222c5d278430d4/propcache-0.5.2-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:583c19759d9eec1e5b69e2fbef36a7d9c326041be9746cb822d335c8cedc2979", size = 63300, upload-time = "2026-05-08T21:01:27.937Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/18/2334b26768b6c82be8c69e83671b767d5ef426aa09b0cba6c2ea47816774/propcache-0.5.2-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:d0326e2e5e1f3163fa306c834e48e8d490e5fae607a097a40c0648109b47ba80", size = 64208, upload-time = "2026-05-08T21:01:29.484Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/76/7f1bfd6afff4c5e38e36a3c6d68eb5f4b7311ea80baf693db78d95b603c4/propcache-0.5.2-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e00820e192c8dbebcafb383ebbf99030895f09905e7a0eb2e0340a0bcc2bc825", size = 61633, upload-time = "2026-05-08T21:01:31.068Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/46/b3ff8aba2b4953a3e50de2cf72f1b5748b8eca93b15f3dc2c84339084c09/propcache-0.5.2-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:c66afea89b1e43725731d2004732a046fe6fe955d51f952c3e95a7314a284a39", size = 61724, upload-time = "2026-05-08T21:01:32.374Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/01/814cfcafbcff954f94c01cf30e097ddc88a076b5440fbcf4570753437d40/propcache-0.5.2-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:d4dc37dec6c6cdad0b57881a5658fd14fbf53e333b1a86cf86559f190e1d9ec4", size = 60069, upload-time = "2026-05-08T21:01:33.67Z" },
+    { url = "https://files.pythonhosted.org/packages/da/68/5c6f7622d510cc666a300687e06fd060c1a43361c0c9b20d284f06d8096a/propcache-0.5.2-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:5570dbcc97571c15f68068e529c92715a12f8d54030e272d264b377e22bd17a5", size = 57099, upload-time = "2026-05-08T21:01:34.915Z" },
+    { url = "https://files.pythonhosted.org/packages/55/27/9cb0b4c679124085327957d42521c99dba04c88c90c3e55a6f0b633ebccc/propcache-0.5.2-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:f814362777a9f841adddb200ecdf8f5cb1e5a3c4b7a86378edbd6ccb26edd702", size = 63391, upload-time = "2026-05-08T21:01:36.231Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/9d/7258aaa5bdf60fc6f27591eef6fe52768cb0beda7140be477c8b12c9794a/propcache-0.5.2-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:196913dea116aeb5a2ba95af4ddcb7ea85559ae07d8eee8751688310d09168c3", size = 61626, upload-time = "2026-05-08T21:01:37.545Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/0d/41c602003e8a9b16fe1e7eadf62c7bfba9d5474370b24200bf48b315f45f/propcache-0.5.2-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:6e7b8719005dd1175be4ab1cd25e9b98659a5e0347331506ec6760d2773a7fb5", size = 64781, upload-time = "2026-05-08T21:01:38.83Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/f3/38e66b1856e9bd079deea015bc4a55f7767c0e4db2f7dcf69e7e680ba4ce/propcache-0.5.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:51f96d685ab16e88cab128cd37a52c5da540809c8b879fa047731bfcb4ad35a4", size = 62570, upload-time = "2026-05-08T21:01:40.415Z" },
+    { url = "https://files.pythonhosted.org/packages/95/ca/bbfe9b910ce57dde8bb4876b4520fc02a4e89497c10de26be936758a3aaa/propcache-0.5.2-cp314-cp314-win32.whl", hash = "sha256:cc6fc3cc62e8501d3ed62894425040d2728ecddb1ed072737a5c70bd537aa9f0", size = 39436, upload-time = "2026-05-08T21:01:41.654Z" },
+    { url = "https://files.pythonhosted.org/packages/61/d2/45c9defbaa1ea297035d9d4cce9e8f80daafbf19319c6007f157c6256ea9/propcache-0.5.2-cp314-cp314-win_amd64.whl", hash = "sha256:81e3a30b0bb60caa22033dd0f8a3618d1d67356212514f62c57db75cb0ef410c", size = 42373, upload-time = "2026-05-08T21:01:43.041Z" },
+    { url = "https://files.pythonhosted.org/packages/44/68/9ea5103f41d5217d7d6ec24db90018e23aebec070c3f9a6e54d12b841fd8/propcache-0.5.2-cp314-cp314-win_arm64.whl", hash = "sha256:0d2c9bf8528f135dbb805ce027567e09164f7efa51a2be07458a2c0420f292d0", size = 38554, upload-time = "2026-05-08T21:01:44.336Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/81/fadf555f42d3b762eea8a53950b0489fdc0aa9da5f8ed9e10ce0a4e01b48/propcache-0.5.2-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:4bc8ff1feffc6a61c7002ffe84634c41b822e104990ae009f44a0834430070bb", size = 99395, upload-time = "2026-05-08T21:01:45.883Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/c9/c61e134a686949cf7971af3a390148b1156f7be81c73bc0cd12c873e2d48/propcache-0.5.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:79aa3ff0a9b566633b642fa9caf7e21ed1c13d6feca718187873f199e1514078", size = 56653, upload-time = "2026-05-08T21:01:47.307Z" },
+    { url = "https://files.pythonhosted.org/packages/cb/73/daf935ea7048ddd7ec8eec5345b4a40b619d2d178b3c0a0900796bc3c794/propcache-0.5.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:1b31822f4474c4036bae62de9402710051d431a606d6a0f907fec79935a071aa", size = 56914, upload-time = "2026-05-08T21:01:48.573Z" },
+    { url = "https://files.pythonhosted.org/packages/79/9f/aba959b435ea18617edd7cf0a7ad0b9c574b8fc7e3d2cd55fb59cb255d33/propcache-0.5.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:13fef48778b5a2a756523fdb781326b028ca75e32858b04f2cdd19f394564917", size = 62567, upload-time = "2026-05-08T21:01:49.903Z" },
+    { url = "https://files.pythonhosted.org/packages/6c/a1/859942de9a791ff42f6141736f5b37749b8f53e65edfa49638c67dd67e6a/propcache-0.5.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:8b73ab70f1a3351fbc71f663b3e645af6dd0329100c353081cf69c37433fc6fe", size = 65542, upload-time = "2026-05-08T21:01:51.204Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/61/315bc0fd6c0fc7f80a528b8afd209e5fc4a875ea79571b91b8f50f442907/propcache-0.5.2-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:5538d2c13d93e4698af7e092b57bc7298fd35d1d58e656ae18f23ee0d0378e03", size = 66845, upload-time = "2026-05-08T21:01:52.539Z" },
+    { url = "https://files.pythonhosted.org/packages/47/f7/9f8122e3132e8e354ac41975ef8f1099be7d5a16bc7ae562734e993665c0/propcache-0.5.2-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:cd645f03898405cabe694fb8bc35241e3a9c332ec85627584fe3de201452b335", size = 63985, upload-time = "2026-05-08T21:01:53.847Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/54/c317819ec157cbf6f35df9df9657a6f82daf34d5faf15948b2f639c2192e/propcache-0.5.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:a473b3440261e0c60706e732b2ed2f517857344fc21bf48fdfe211e2d98eb285", size = 63999, upload-time = "2026-05-08T21:01:55.179Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/56/387e3f7dfce0a9233df41fb888aa1c30222cb4bbbf09537c02dd9bd85fe2/propcache-0.5.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:7afa37062e6650640e932e4cc9297d81f9f42d9944029cc386b8247dea4da837", size = 62779, upload-time = "2026-05-08T21:01:57.489Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/9c/596784cb5824ed61ee960d3f8655a3f0993e107c6e98ab6c818b7fb92ccb/propcache-0.5.2-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:8a90efd5777e996e42d568db9ac740b944d691e565cbfd31b2f7832f9184b2b8", size = 59796, upload-time = "2026-05-08T21:01:58.736Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/3d/1a6cfa1726a48542c1e8784a0761421476a5b68e09b7f36bf95eb954aaba/propcache-0.5.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:f19bb891234d72535764d703bfed1153cc34f4214d5bd7150aee1eec9e8f4366", size = 66023, upload-time = "2026-05-08T21:02:00.228Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/0e/05fd6990369477076e4e280bcb970de760fddf0161a46e988bc95f7940ec/propcache-0.5.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:32775082acd2d807ee3db715c7770d38767b817870acfa08c29e057f3c4d5b56", size = 64448, upload-time = "2026-05-08T21:02:01.888Z" },
+    { url = "https://files.pythonhosted.org/packages/cd/86/5f8da315a4309c62c10c0b2516b17492d5d3bbe1bb862b96604db67e2a37/propcache-0.5.2-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:9282fb1a3bccd038da9f768b927b24a0c753e466c086b7c4f3c6982851eefb2d", size = 67329, upload-time = "2026-05-08T21:02:03.484Z" },
+    { url = "https://files.pythonhosted.org/packages/da/d3/3368efe79ab21f0cdf86ef49895811c9cc933131d4cde1f28a624e22e712/propcache-0.5.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:cc49723e2f60d6b32a0f0b08a3fd6d13203c07f1cd9566cfce0f12a917c967a2", size = 65172, upload-time = "2026-05-08T21:02:04.745Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/07/127e8b0bacfb325396196f9d976a22453049b89b9b2b08477cc3145faa44/propcache-0.5.2-cp314-cp314t-win32.whl", hash = "sha256:2d7aa89ebca5acc98cba9d1472d976e394782f587bad6661003602a619fd1821", size = 43813, upload-time = "2026-05-08T21:02:06.025Z" },
+    { url = "https://files.pythonhosted.org/packages/88/fb/46dad6c0ae49ed230ab1b16c890c2b6314e2403e6c412976f4a72d64a527/propcache-0.5.2-cp314-cp314t-win_amd64.whl", hash = "sha256:d447bb0b3054be5818458fbb171208b1d9ff11eba14e18ca18b90cbb45767370", size = 47764, upload-time = "2026-05-08T21:02:07.353Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/c4/a47d0a63aa309d10d59ede6e9d4cff03a344a79d1f0f4cd0cd74997b53e0/propcache-0.5.2-cp314-cp314t-win_arm64.whl", hash = "sha256:fe67a3d11cd9b4efabfa45c3d00ffba2b26811442a73a581a94b67c2b5faccf6", size = 41140, upload-time = "2026-05-08T21:02:09.065Z" },
+    { url = "https://files.pythonhosted.org/packages/3a/ed/1cdcab6ba3d6ab7feca11fc14f0eeea80755bb53ef4e892079f31b10a25f/propcache-0.5.2-py3-none-any.whl", hash = "sha256:be1ddfcbb376e3de5d2e2db1d58d6d67463e6b4f9f040c000de8e300295465fe", size = 14036, upload-time = "2026-05-08T21:02:10.673Z" },
+]
+
+[[package]]
+name = "protobuf"
+version = "3.20.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/55/5b/e3d951e34f8356e5feecacd12a8e3b258a1da6d9a03ad1770f28925f29bc/protobuf-3.20.3.tar.gz", hash = "sha256:2e3427429c9cffebf259491be0af70189607f365c2f41c7c3764af6f337105f2", size = 216768, upload-time = "2022-09-29T22:39:47.592Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/8d/14/619e24a4c70df2901e1f4dbc50a6291eb63a759172558df326347dce1f0d/protobuf-3.20.3-py2.py3-none-any.whl", hash = "sha256:a7ca6d488aa8ff7f329d4c545b2dbad8ac31464f1d8b1c87ad1346717731e4db", size = 162128, upload-time = "2022-09-29T22:39:44.547Z" },
+]
+
+[[package]]
+name = "psutil"
+version = "7.2.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/aa/c6/d1ddf4abb55e93cebc4f2ed8b5d6dbad109ecb8d63748dd2b20ab5e57ebe/psutil-7.2.2.tar.gz", hash = "sha256:0746f5f8d406af344fd547f1c8daa5f5c33dbc293bb8d6a16d80b4bb88f59372", size = 493740, upload-time = "2026-01-28T18:14:54.428Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/51/08/510cbdb69c25a96f4ae523f733cdc963ae654904e8db864c07585ef99875/psutil-7.2.2-cp313-cp313t-macosx_10_13_x86_64.whl", hash = "sha256:2edccc433cbfa046b980b0df0171cd25bcaeb3a68fe9022db0979e7aa74a826b", size = 130595, upload-time = "2026-01-28T18:14:57.293Z" },
+    { url = "https://files.pythonhosted.org/packages/d6/f5/97baea3fe7a5a9af7436301f85490905379b1c6f2dd51fe3ecf24b4c5fbf/psutil-7.2.2-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:e78c8603dcd9a04c7364f1a3e670cea95d51ee865e4efb3556a3a63adef958ea", size = 131082, upload-time = "2026-01-28T18:14:59.732Z" },
+    { url = "https://files.pythonhosted.org/packages/37/d6/246513fbf9fa174af531f28412297dd05241d97a75911ac8febefa1a53c6/psutil-7.2.2-cp313-cp313t-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1a571f2330c966c62aeda00dd24620425d4b0cc86881c89861fbc04549e5dc63", size = 181476, upload-time = "2026-01-28T18:15:01.884Z" },
+    { url = "https://files.pythonhosted.org/packages/b8/b5/9182c9af3836cca61696dabe4fd1304e17bc56cb62f17439e1154f225dd3/psutil-7.2.2-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:917e891983ca3c1887b4ef36447b1e0873e70c933afc831c6b6da078ba474312", size = 184062, upload-time = "2026-01-28T18:15:04.436Z" },
+    { url = "https://files.pythonhosted.org/packages/16/ba/0756dca669f5a9300d0cbcbfae9a4c30e446dfc7440ffe43ded5724bfd93/psutil-7.2.2-cp313-cp313t-win_amd64.whl", hash = "sha256:ab486563df44c17f5173621c7b198955bd6b613fb87c71c161f827d3fb149a9b", size = 139893, upload-time = "2026-01-28T18:15:06.378Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/61/8fa0e26f33623b49949346de05ec1ddaad02ed8ba64af45f40a147dbfa97/psutil-7.2.2-cp313-cp313t-win_arm64.whl", hash = "sha256:ae0aefdd8796a7737eccea863f80f81e468a1e4cf14d926bd9b6f5f2d5f90ca9", size = 135589, upload-time = "2026-01-28T18:15:08.03Z" },
+    { url = "https://files.pythonhosted.org/packages/81/69/ef179ab5ca24f32acc1dac0c247fd6a13b501fd5534dbae0e05a1c48b66d/psutil-7.2.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:eed63d3b4d62449571547b60578c5b2c4bcccc5387148db46e0c2313dad0ee00", size = 130664, upload-time = "2026-01-28T18:15:09.469Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/64/665248b557a236d3fa9efc378d60d95ef56dd0a490c2cd37dafc7660d4a9/psutil-7.2.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7b6d09433a10592ce39b13d7be5a54fbac1d1228ed29abc880fb23df7cb694c9", size = 131087, upload-time = "2026-01-28T18:15:11.724Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/2e/e6782744700d6759ebce3043dcfa661fb61e2fb752b91cdeae9af12c2178/psutil-7.2.2-cp314-cp314t-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1fa4ecf83bcdf6e6c8f4449aff98eefb5d0604bf88cb883d7da3d8d2d909546a", size = 182383, upload-time = "2026-01-28T18:15:13.445Z" },
+    { url = "https://files.pythonhosted.org/packages/57/49/0a41cefd10cb7505cdc04dab3eacf24c0c2cb158a998b8c7b1d27ee2c1f5/psutil-7.2.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e452c464a02e7dc7822a05d25db4cde564444a67e58539a00f929c51eddda0cf", size = 185210, upload-time = "2026-01-28T18:15:16.002Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/2c/ff9bfb544f283ba5f83ba725a3c5fec6d6b10b8f27ac1dc641c473dc390d/psutil-7.2.2-cp314-cp314t-win_amd64.whl", hash = "sha256:c7663d4e37f13e884d13994247449e9f8f574bc4655d509c3b95e9ec9e2b9dc1", size = 141228, upload-time = "2026-01-28T18:15:18.385Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/fc/f8d9c31db14fcec13748d373e668bc3bed94d9077dbc17fb0eebc073233c/psutil-7.2.2-cp314-cp314t-win_arm64.whl", hash = "sha256:11fe5a4f613759764e79c65cf11ebdf26e33d6dd34336f8a337aa2996d71c841", size = 136284, upload-time = "2026-01-28T18:15:19.912Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/36/5ee6e05c9bd427237b11b3937ad82bb8ad2752d72c6969314590dd0c2f6e/psutil-7.2.2-cp36-abi3-macosx_10_9_x86_64.whl", hash = "sha256:ed0cace939114f62738d808fdcecd4c869222507e266e574799e9c0faa17d486", size = 129090, upload-time = "2026-01-28T18:15:22.168Z" },
+    { url = "https://files.pythonhosted.org/packages/80/c4/f5af4c1ca8c1eeb2e92ccca14ce8effdeec651d5ab6053c589b074eda6e1/psutil-7.2.2-cp36-abi3-macosx_11_0_arm64.whl", hash = "sha256:1a7b04c10f32cc88ab39cbf606e117fd74721c831c98a27dc04578deb0c16979", size = 129859, upload-time = "2026-01-28T18:15:23.795Z" },
+    { url = "https://files.pythonhosted.org/packages/b5/70/5d8df3b09e25bce090399cf48e452d25c935ab72dad19406c77f4e828045/psutil-7.2.2-cp36-abi3-manylinux2010_x86_64.manylinux_2_12_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:076a2d2f923fd4821644f5ba89f059523da90dc9014e85f8e45a5774ca5bc6f9", size = 155560, upload-time = "2026-01-28T18:15:25.976Z" },
+    { url = "https://files.pythonhosted.org/packages/63/65/37648c0c158dc222aba51c089eb3bdfa238e621674dc42d48706e639204f/psutil-7.2.2-cp36-abi3-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b0726cecd84f9474419d67252add4ac0cd9811b04d61123054b9fb6f57df6e9e", size = 156997, upload-time = "2026-01-28T18:15:27.794Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/13/125093eadae863ce03c6ffdbae9929430d116a246ef69866dad94da3bfbc/psutil-7.2.2-cp36-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:fd04ef36b4a6d599bbdb225dd1d3f51e00105f6d48a28f006da7f9822f2606d8", size = 148972, upload-time = "2026-01-28T18:15:29.342Z" },
+    { url = "https://files.pythonhosted.org/packages/04/78/0acd37ca84ce3ddffaa92ef0f571e073faa6d8ff1f0559ab1272188ea2be/psutil-7.2.2-cp36-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:b58fabe35e80b264a4e3bb23e6b96f9e45a3df7fb7eed419ac0e5947c61e47cc", size = 148266, upload-time = "2026-01-28T18:15:31.597Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/90/e2159492b5426be0c1fef7acba807a03511f97c5f86b3caeda6ad92351a7/psutil-7.2.2-cp37-abi3-win_amd64.whl", hash = "sha256:eb7e81434c8d223ec4a219b5fc1c47d0417b12be7ea866e24fb5ad6e84b3d988", size = 137737, upload-time = "2026-01-28T18:15:33.849Z" },
+    { url = "https://files.pythonhosted.org/packages/8c/c7/7bb2e321574b10df20cbde462a94e2b71d05f9bbda251ef27d104668306a/psutil-7.2.2-cp37-abi3-win_arm64.whl", hash = "sha256:8c233660f575a5a89e6d4cb65d9f938126312bca76d8fe087b947b3a1aaac9ee", size = 134617, upload-time = "2026-01-28T18:15:36.514Z" },
+]
+
+[[package]]
+name = "pudb"
+version = "2024.1.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "jedi" },
+    { name = "packaging" },
+    { name = "pygments" },
+    { name = "urwid" },
+    { name = "urwid-readline" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/54/70/fc7d81b7ac439d5e21c8c2b51e15cdc6632b720b02219057fe098a80e766/pudb-2024.1.2.tar.gz", hash = "sha256:adc9b00042ba8367117df0a6c0dc62fa9609abd21c3bf8e5b73d620907c5b43e", size = 226551, upload-time = "2024-07-22T19:28:26.69Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2e/cd/fe7346d1a082e74e89d029524b384b74eca17c62ac033609b95782ef16af/pudb-2024.1.2-py3-none-any.whl", hash = "sha256:4726c288d9f57845b8dba706c70eb6faaddff9d86e5208eda82216ef5e79cc2e", size = 87493, upload-time = "2024-07-22T19:28:24.079Z" },
+]
+
+[[package]]
+name = "py-cpuinfo"
+version = "9.0.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/37/a8/d832f7293ebb21690860d2e01d8115e5ff6f2ae8bbdc953f0eb0fa4bd2c7/py-cpuinfo-9.0.0.tar.gz", hash = "sha256:3cdbbf3fac90dc6f118bfd64384f309edeadd902d7c8fb17f02ffa1fc3f49690", size = 104716, upload-time = "2022-10-25T20:38:06.303Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e0/a9/023730ba63db1e494a271cb018dcd361bd2c917ba7004c3e49d5daf795a2/py_cpuinfo-9.0.0-py3-none-any.whl", hash = "sha256:859625bc251f64e21f077d099d4162689c762b5d6a4c3c97553d56241c9674d5", size = 22335, upload-time = "2022-10-25T20:38:27.636Z" },
+]
+
+[[package]]
+name = "pydantic"
+version = "2.13.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "annotated-types" },
+    { name = "pydantic-core" },
+    { name = "typing-extensions" },
+    { name = "typing-inspection" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/18/a5/b60d21ac674192f8ab0ba4e9fd860690f9b4a6e51ca5df118733b487d8d6/pydantic-2.13.4.tar.gz", hash = "sha256:c40756b57adaa8b1efeeced5c196f3f3b7c435f90e84ea7f443901bec8099ef6", size = 844775, upload-time = "2026-05-06T13:43:05.343Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/fd/7b/122376b1fd3c62c1ed9dc80c931ace4844b3c55407b6fb2d199377c9736f/pydantic-2.13.4-py3-none-any.whl", hash = "sha256:45a282cde31d808236fd7ea9d919b128653c8b38b393d1c4ab335c62924d9aba", size = 472262, upload-time = "2026-05-06T13:43:02.641Z" },
+]
+
+[[package]]
+name = "pydantic-core"
+version = "2.46.4"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/9d/56/921726b776ace8d8f5db44c4ef961006580d91dc52b803c489fafd1aa249/pydantic_core-2.46.4.tar.gz", hash = "sha256:62f875393d7f270851f20523dd2e29f082bcc82292d66db2b64ea71f64b6e1c1", size = 471464, upload-time = "2026-05-06T13:37:06.98Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/5c/fa/6d7708d2cfc1a832acb6aeb0cd16e801902df8a0f583bb3b4b527fde022e/pydantic_core-2.46.4-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:0e96592440881c74a213e5ad528e2b24d3d4f940de2766bed9010ab1d9e51594", size = 2111872, upload-time = "2026-05-06T13:40:27.596Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/6f/aa064a3e74b5745afbdf250594f38e7ead05e2d651bcb35994b9417a0d4d/pydantic_core-2.46.4-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:e0d65b8c354be7fb5f720c3caa8bc940bc2d20ce749c8e06135f07f8ed95dd7c", size = 1948255, upload-time = "2026-05-06T13:39:12.574Z" },
+    { url = "https://files.pythonhosted.org/packages/43/3a/41114a9f7569b84b4d84e7a018c57c56347dac30c0d4a872946ec4e36c46/pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7bfb192b3f4b9e8a89b6277b6ce787564f62cfd272055f6e685726b111dc7826", size = 1972827, upload-time = "2026-05-06T13:38:19.841Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/25/1ab42e8048fe551934d9884e8d64daa7e990ad386f310a15981aeb6a5b08/pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:9037063db01f09b09e237c282b6792bd4da634b5402c4e7f0c61effed7701a04", size = 2041051, upload-time = "2026-05-06T13:38:10.447Z" },
+    { url = "https://files.pythonhosted.org/packages/94/c2/1a934597ddf08da410385b3b7aae91956a5a76c635effef456074fad7e88/pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:fc010ab034c8c7452522748bf937df58020d256ccae0874463d1f4d01758af8e", size = 2221314, upload-time = "2026-05-06T13:40:13.089Z" },
+    { url = "https://files.pythonhosted.org/packages/02/6d/9e8ad178c9c4df27ad3c8f25d1fe2a7ab0d2ba0559fad4aee5d3d1f16771/pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:8c5dac79fa1614d1e06ca695109c6105923bd9c7d1d6c918d4e637b7e6b32fd3", size = 2285146, upload-time = "2026-05-06T13:38:59.224Z" },
+    { url = "https://files.pythonhosted.org/packages/80/50/540cd3aeefc041beb111125c4bff779831a2111fc6b15a9138cda277d32c/pydantic_core-2.46.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f9fa868638bf362d3d138ea55829cefb3d5f4b0d7f142234382a15e2485dbec4", size = 2089685, upload-time = "2026-05-06T13:38:17.762Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/a4/b440ad35f05f6a38f89fa0f149accb3f0e02be94ca5e15f3c449a61b4bc9/pydantic_core-2.46.4-cp311-cp311-manylinux_2_31_riscv64.whl", hash = "sha256:17299feefe090f2caa5b8e37222bb5f663e4935a8bfa6931d4102e5df1a9f398", size = 2115420, upload-time = "2026-05-06T13:37:58.195Z" },
+    { url = "https://files.pythonhosted.org/packages/99/61/de4f55db8dfd57bfdfa9a12ec90fe1b57c4f41062f7ca86f08586b3e0ac0/pydantic_core-2.46.4-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:4c63ebc82684aa89d9a3bcbd13d515b3be44250dc68dd3bd81526c1cb31286c3", size = 2165122, upload-time = "2026-05-06T13:37:01.167Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/52/7c529d7bdb2d1068bd52f51fe32572c8301f9a4febf1948f10639f1436f5/pydantic_core-2.46.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:aaa2a54443eff1950ba5ddc6b6ccda0d9c84a364276a62f969bdf2a390650848", size = 2182573, upload-time = "2026-05-06T13:38:45.04Z" },
+    { url = "https://files.pythonhosted.org/packages/37/b3/7c40325848ba78247f2812dcf9c7274e38cd801820ca6dd9fe63bcfb0eb4/pydantic_core-2.46.4-cp311-cp311-musllinux_1_1_armv7l.whl", hash = "sha256:18e5ceec2ab67e6d5f1a9085e5a24c9c4e2ac4545730bfe668680bca05e555f3", size = 2317139, upload-time = "2026-05-06T13:37:15.539Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/37/f913f81a657c865b75da6c0dbed79876073c2a43b5bd9edbe8da785e4d49/pydantic_core-2.46.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:a0f62d0a58f4e7da165457e995725421e0064f2255d8eccebc49f41bbc23b109", size = 2360433, upload-time = "2026-05-06T13:37:30.099Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/67/6acaa1be2567f9256b056d8477158cac7240813956ce86e49deae8e173b4/pydantic_core-2.46.4-cp311-cp311-win32.whl", hash = "sha256:041bde0a48fd37cf71cab1c9d56d3e8625a3793fef1f7dd232b3ff37e978ecda", size = 1985513, upload-time = "2026-05-06T13:38:15.669Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/e6/c505f83dfeda9a2e5c995cfd872949e4d05e12f7feb3dca72f633daefa94/pydantic_core-2.46.4-cp311-cp311-win_amd64.whl", hash = "sha256:6f2eeda33a839975441c86a4119e1383c50b47faf0cbb5176985565c6bb02c33", size = 2071114, upload-time = "2026-05-06T13:40:35.416Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/da/7a263a96d965d9d0df5e8de8a475f33495451117035b09acb110288c381f/pydantic_core-2.46.4-cp311-cp311-win_arm64.whl", hash = "sha256:14f4c5d6db102bd796a627bbb3a17b4cf4574b9ae861d8b7c9a9661c6dd3362d", size = 2044298, upload-time = "2026-05-06T13:38:29.754Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/8c/af022f0af448d7747c5154288d46b5f2bc5f17366eaa0e23e9aa04d59f3b/pydantic_core-2.46.4-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:3245406455a5d98187ec35530fd772b1d799b26667980872c8d4614991e2c4a2", size = 2106158, upload-time = "2026-05-06T13:38:57.215Z" },
+    { url = "https://files.pythonhosted.org/packages/19/95/6195171e385007300f0f5574592e467c568becce2d937a0b6804f218bc49/pydantic_core-2.46.4-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:962ccbab7b642487b1d8b7df90ef677e03134cf1fd8880bf698649b22a69371f", size = 1951724, upload-time = "2026-05-06T13:37:02.697Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/bc/f47d1ff9cbb1620e1b5b697eef06010035735f07820180e74178226b27b3/pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8233f2947cf85404441fd7e0085f53b10c93e0ee78611099b5c7237e36aacbf7", size = 1975742, upload-time = "2026-05-06T13:37:09.448Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/11/9b9a5b0306345664a2da6410877af6e8082481b5884b3ddd78d47c6013ce/pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:3a233125ac121aa3ffba9a2b59edfc4a985a76092dc8279586ab4b71390875e7", size = 2052418, upload-time = "2026-05-06T13:37:38.234Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/b7/a65fec226f5d78fc39f4a13c4cc0c768c22b113438f60c14adc9d2865038/pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5b712b53160b79a5850310b912a5ef8e57e56947c8ad690c227f5c9d7e561712", size = 2232274, upload-time = "2026-05-06T13:38:27.753Z" },
+    { url = "https://files.pythonhosted.org/packages/68/f0/92039db98b907ef49269a8271f67db9cb78ae2fc68062ef7e4e77adb5f61/pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:9401557acd873c3a7f3eb9383edef8ac4968f9510e340f4808d427e75667e7b4", size = 2309940, upload-time = "2026-05-06T13:38:05.353Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/97/2aab507d3d00ca626e8e57c1eac6a79e4e5fbcc63eb99733ff55d1717f65/pydantic_core-2.46.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:926c9541b14b12b1681dca8a0b75feb510b06c6341b70a8e500c2fdcff837cce", size = 2094516, upload-time = "2026-05-06T13:39:10.577Z" },
+    { url = "https://files.pythonhosted.org/packages/22/37/a8aca44d40d737dde2bc05b3c6c07dff0de07ce6f82e9f3167aeaf4d5dea/pydantic_core-2.46.4-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:56cb4851bcaf3d117eddcef4fe66afd750a50274b0da8e22be256d10e5611987", size = 2136854, upload-time = "2026-05-06T13:40:22.59Z" },
+    { url = "https://files.pythonhosted.org/packages/24/99/fcef1b79238c06a8cbec70819ac722ba76e02bc8ada9b0fd66eba40da01b/pydantic_core-2.46.4-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:c68fcd102d71ea85c5b2dfac3f4f8476eff42a9e078fd5faefff6d145063536b", size = 2180306, upload-time = "2026-05-06T13:40:10.666Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/6c/fc44000918855b42779d007ae63b0532794739027b2f417321cddbc44f6a/pydantic_core-2.46.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:b2f69dec1725e79a012d920df1707de5caf7ed5e08f3be4435e25803efc47458", size = 2190044, upload-time = "2026-05-06T13:40:43.231Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/65/d9cadc9f1920d7a127ad2edba16c1db7916e59719285cd6c94600b0080ba/pydantic_core-2.46.4-cp312-cp312-musllinux_1_1_armv7l.whl", hash = "sha256:8d0820e8192167f80d88d64038e609c31452eeca865b4e1d9950a27a4609b00b", size = 2329133, upload-time = "2026-05-06T13:39:57.365Z" },
+    { url = "https://files.pythonhosted.org/packages/d0/cf/c873d91679f3a30bcf5e7ac280ce5573483e72295307685120d0d5ad3416/pydantic_core-2.46.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:fbdb89b3e1c94a30cc5edfce477c6e6a5dc4d8f84665b455c27582f211a1c72c", size = 2374464, upload-time = "2026-05-06T13:38:06.976Z" },
+    { url = "https://files.pythonhosted.org/packages/47/bd/6f2fc8188f31bf10590f1e98e7b306336161fac930a8c514cd7bd828c7dc/pydantic_core-2.46.4-cp312-cp312-win32.whl", hash = "sha256:9aa768456404a8bf48a4406685ac2bec8e72b62c69313734fa3b73cf33b3a894", size = 1974823, upload-time = "2026-05-06T13:40:47.985Z" },
+    { url = "https://files.pythonhosted.org/packages/40/8c/985c1d41ea1107c2534abd9870e4ed5c8e7669b5c308297835c001e7a1c4/pydantic_core-2.46.4-cp312-cp312-win_amd64.whl", hash = "sha256:e9c26f834c65f5752f3f06cb08cb86a913ceb7274d0db6e267808a708b46bc89", size = 2072919, upload-time = "2026-05-06T13:39:21.153Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/ba/f463d006e0c47373ca7ec5e1a261c59dc01ef4d62b2657af925fb0deee3a/pydantic_core-2.46.4-cp312-cp312-win_arm64.whl", hash = "sha256:4fc73cb559bdb54b1134a706a2802a4cddd27a0633f5abb7e53056268751ac6a", size = 2027604, upload-time = "2026-05-06T13:39:03.753Z" },
+    { url = "https://files.pythonhosted.org/packages/51/a2/5d30b469c5267a17b39dec53208222f76a8d351dfac4af661888c5aee77d/pydantic_core-2.46.4-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:5d5902252db0d3cedf8d4a1bc68f70eeb430f7e4c7104c8c476753519b423008", size = 2106306, upload-time = "2026-05-06T13:37:48.029Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/81/4fa520eaffa8bd7d1525e644cd6d39e7d60b1592bc5b516693c7340b50f1/pydantic_core-2.46.4-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:c94f0688e7b8d0a67abf40e57a7eaaecd17cc9586706a31b76c031f63df052b4", size = 1951906, upload-time = "2026-05-06T13:37:17.012Z" },
+    { url = "https://files.pythonhosted.org/packages/03/d5/fd02da45b659668b05923b17ba3a0100a0a3d5541e3bd8fcc4ecb711309e/pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f027324c56cd5406ca49c124b0db10e56c69064fec039acc571c29020cc87c76", size = 1976802, upload-time = "2026-05-06T13:37:35.113Z" },
+    { url = "https://files.pythonhosted.org/packages/21/f2/95727e1368be3d3ed485eaab7adbd7dda408f33f7a36e8b48e0144002b91/pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:e739fee756ba1010f8bcccb534252e85a35fe45ae92c295a06059ce58b74ccd3", size = 2052446, upload-time = "2026-05-06T13:37:12.313Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/86/5d99feea3f77c7234b8718075b23db11532773c1a0dbd9b9490215dc2eeb/pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:9d56801be94b86a9da183e5f3766e6310752b99ff647e38b09a9500d88e46e76", size = 2232757, upload-time = "2026-05-06T13:39:01.149Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/3a/508ac615935ef7588cf6d9e9b91309fdc2da751af865e02a9098de88258c/pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:2412e734dcb48da14d4e4006b82b46b74f2518b8a26ee7e58c6844a6cd6d03c4", size = 2309275, upload-time = "2026-05-06T13:37:41.406Z" },
+    { url = "https://files.pythonhosted.org/packages/07/f8/41db9de19d7987d6b04715a02b3b40aea467000275d9d758ffaa31af7d50/pydantic_core-2.46.4-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9551187363ffc0de2a00b2e47c25aeaeb1020b69b668762966df15fc5659dd5a", size = 2094467, upload-time = "2026-05-06T13:39:18.847Z" },
+    { url = "https://files.pythonhosted.org/packages/2c/e2/f35033184cb11d0052daf4416e8e10a502ea2ac006fc4f459aee872727d1/pydantic_core-2.46.4-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:0186750b482eefa11d7f435892b09c5c606193ef3375bcf94aa00ae6bfb66262", size = 2134417, upload-time = "2026-05-06T13:40:17.944Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/7b/6ceeb1cc90e193862f444ebe373d8fdf613f0a82572dde03fb10734c6c71/pydantic_core-2.46.4-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:5855698a4856556d86e8e6cd8434bc3ac0314ee8e12089ae0e143f64c6256e4e", size = 2179782, upload-time = "2026-05-06T13:40:32.618Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/f2/c8d7773ede6af08036423a00ae0ceffce266c3c52a096c435d68c896083f/pydantic_core-2.46.4-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:cbaf13819775b7f769bf4a1f066cb6df7a28d4480081a589828ef190226881cd", size = 2188782, upload-time = "2026-05-06T13:36:51.018Z" },
+    { url = "https://files.pythonhosted.org/packages/59/31/0c864784e31f09f05cdd87606f08923b9c9e7f6e51dd27f20f62f975ce9f/pydantic_core-2.46.4-cp313-cp313-musllinux_1_1_armv7l.whl", hash = "sha256:633147d34cf4550417f12e2b1a0383973bdf5cdfde212cb09e9a581cf10820be", size = 2328334, upload-time = "2026-05-06T13:40:37.764Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/eb/4f6c8a41efa30baa755590f4141abf3a8c370fab610915733e74134a7270/pydantic_core-2.46.4-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:82cf5301172168103724d49a1444d3378cb20cdee30b116a1bd6031236298a5d", size = 2372986, upload-time = "2026-05-06T13:39:34.152Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/24/b375a480d53113860c299764bfe9f349a3dc9108b3adc0d7f0d786492ebf/pydantic_core-2.46.4-cp313-cp313-win32.whl", hash = "sha256:9fa8ae11da9e2b3126c6426f147e0fba88d96d65921799bb30c6abd1cb2c97fb", size = 1973693, upload-time = "2026-05-06T13:37:55.072Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/e8/cff247591966f2d22ec8c003cd7587e27b7ba7b81ab2fb888e3ab75dc285/pydantic_core-2.46.4-cp313-cp313-win_amd64.whl", hash = "sha256:6b3ace8194b0e5204818c92802dcdca7fc6d88aabbb799d7c795540d9cd6d292", size = 2071819, upload-time = "2026-05-06T13:38:49.139Z" },
+    { url = "https://files.pythonhosted.org/packages/c6/1a/f4aee670d5670e9e148e0c82c7db98d780be566c6e6a97ee8035528ca0b3/pydantic_core-2.46.4-cp313-cp313-win_arm64.whl", hash = "sha256:184c081504d17f1c1066e430e117142b2c77d9448a97f7b65c6ac9fd9aee238d", size = 2027411, upload-time = "2026-05-06T13:40:45.796Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/74/228a26ddad29c6672b805d9fd78e8d251cd04004fa7eed0e622096cd0250/pydantic_core-2.46.4-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:428e04521a40150c85216fc8b85e8d39fece235a9cf5e383761238c7fa9b96fb", size = 2102079, upload-time = "2026-05-06T13:38:41.019Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/1f/8970b150a4b4365623ae00fc88603491f763c627311ae8031e3111356d6e/pydantic_core-2.46.4-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:23ace664830ee0bfe014a0c7bc248b1f7f25ed7ad103852c317624a1083af462", size = 1952179, upload-time = "2026-05-06T13:36:59.812Z" },
+    { url = "https://files.pythonhosted.org/packages/95/30/5211a831ae054928054b2f79731661087a2bc5c01e825c672b3a4a8f1b3e/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ce5c1d2a8b27468f433ca974829c44060b8097eedc39933e3c206a90ee49c4a9", size = 1978926, upload-time = "2026-05-06T13:37:39.933Z" },
+    { url = "https://files.pythonhosted.org/packages/57/e9/689668733b1eb67adeef047db3c2e8788fcf65a7fd9c9e2b46b7744fe245/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:7283d57845ecf5a163403eb0702dfc220cc4fbdd18919cb5ccea4f95ee1cdab4", size = 2046785, upload-time = "2026-05-06T13:38:01.995Z" },
+    { url = "https://files.pythonhosted.org/packages/60/d9/6715260422ff50a2109878fd24d948a6c3446bb2664f34ee78cd972b3acd/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:8daafc69c93ee8a0204506a3b6b30f586ef54028f52aeeeb5c4cfc5184fd5914", size = 2228733, upload-time = "2026-05-06T13:40:50.371Z" },
+    { url = "https://files.pythonhosted.org/packages/18/ae/fdb2f64316afca925640f8e70bb1a564b0ec2721c1389e25b8eb4bf9a299/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:cd2213145bcc2ba85884d0ac63d222fece9209678f77b9b4d76f054c561adb28", size = 2307534, upload-time = "2026-05-06T13:37:21.531Z" },
+    { url = "https://files.pythonhosted.org/packages/89/1d/8eff589b45bb8190a9d12c49cfad0f176a5cbd1534908a6b5125e2886239/pydantic_core-2.46.4-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7a5f930472650a82629163023e630d160863fce524c616f4e5186e5de9d9a49b", size = 2099732, upload-time = "2026-05-06T13:39:31.942Z" },
+    { url = "https://files.pythonhosted.org/packages/06/d5/ee5a3366637fee41dee51a1fc91562dcf12ddbc68fda34e6b253da2324bb/pydantic_core-2.46.4-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:c1b3f518abeca3aa13c712fd202306e145abf59a18b094a6bafb2d2bbf59192c", size = 2129627, upload-time = "2026-05-06T13:37:25.033Z" },
+    { url = "https://files.pythonhosted.org/packages/94/33/2414be571d2c6a6c4d08be21f9292b6d3fdb08949a97b6dfe985017821db/pydantic_core-2.46.4-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1a7dd0b3ee80d90150e3495a3a13ac34dbcbfd4f012996a6a1d8900e91b5c0fb", size = 2179141, upload-time = "2026-05-06T13:37:14.046Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/79/7daa95be995be0eecc4cf75064cb33f9bbbfe3fe0158caf2f0d4a996a5c7/pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:3fb702cd90b0446a3a1c5e470bfa0dd23c0233b676a9099ddcc964fa6ca13898", size = 2184325, upload-time = "2026-05-06T13:36:53.615Z" },
+    { url = "https://files.pythonhosted.org/packages/9f/cb/d0a382f5c0de8a222dc61c65348e0ce831b1f68e0a018450d31c2cace3a5/pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_armv7l.whl", hash = "sha256:b8458003118a712e66286df6a707db01c52c0f52f7db8e4a38f0da1d3b94fc4e", size = 2323990, upload-time = "2026-05-06T13:40:29.971Z" },
+    { url = "https://files.pythonhosted.org/packages/05/db/d9ba624cc4a5aced1598e88c04fdbd8310c8a69b9d38b9a3d39ce3a61ed7/pydantic_core-2.46.4-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:372429a130e469c9cd698925ce5fc50940b7a1336b0d82038e63d5bbc4edc519", size = 2369978, upload-time = "2026-05-06T13:37:23.027Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/20/d15df15ba918c423461905802bfd2981c3af0bfa0e40d05e13edbfa48bc3/pydantic_core-2.46.4-cp314-cp314-win32.whl", hash = "sha256:85bb3611ff1802f3ee7fdd7dbff26b56f343fb432d57a4728fdd49b6ef35e2f4", size = 1966354, upload-time = "2026-05-06T13:38:03.499Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/b6/6b8de4c0a7d7ab3004c439c80c5c1e0a3e8d78bbae19379b01960383d9e5/pydantic_core-2.46.4-cp314-cp314-win_amd64.whl", hash = "sha256:811ff8e9c313ab425368bcbb36e5c4ebd7108c2bbf4e4089cfbb0b01eff63fac", size = 2072238, upload-time = "2026-05-06T13:39:40.807Z" },
+    { url = "https://files.pythonhosted.org/packages/32/36/51eb763beec1f4cf59b1db243a7dcc39cbb41230f050a09b9d69faaf0a48/pydantic_core-2.46.4-cp314-cp314-win_arm64.whl", hash = "sha256:bfec22eab3c8cc2ceec0248aec886624116dc079afa027ecc8ad4a7e62010f8a", size = 2018251, upload-time = "2026-05-06T13:37:26.72Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/91/855af51d625b23aa987116a19e231d2aaef9c4a415273ddc189b79a45fee/pydantic_core-2.46.4-cp314-cp314t-macosx_10_12_x86_64.whl", hash = "sha256:af8244b2bef6aaad6d92cda81372de7f8c8d36c9f0c3ea36e827c60e7d9467a0", size = 2099593, upload-time = "2026-05-06T13:39:47.682Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/1b/8784a54c65edb5f49f0a14d6977cf1b209bba85a4c77445b255c2de58ab3/pydantic_core-2.46.4-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5a4330cdbc57162e4b3aa303f588ba752257694c9c9be3e7ebb11b4aca659b5d", size = 1935226, upload-time = "2026-05-06T13:40:40.428Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/e7/1955d28d1afc56dd4b3ad7cc0cf39df1b9852964cf16e5d13912756d6d6b/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:29c61fc04a3d840155ff08e475a04809278972fe6aef51e2720554e96367e34b", size = 1974605, upload-time = "2026-05-06T13:37:32.029Z" },
+    { url = "https://files.pythonhosted.org/packages/93/e2/3fedbf0ba7a22850e6e9fd78117f1c0f10f950182344d8a6c535d468fdd8/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:c50f2528cf200c5eed56faf3f4e22fcd5f38c157a8b78576e6ba3168ec35f000", size = 2030777, upload-time = "2026-05-06T13:38:55.239Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/61/46be275fcaaba0b4f5b9669dd852267ce1ff616592dccf7a7845588df091/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:0cbe8b01f948de4286c74cdd6c667aceb38f5c1e26f0693b3983d9d74887c65e", size = 2236641, upload-time = "2026-05-06T13:37:08.096Z" },
+    { url = "https://files.pythonhosted.org/packages/60/db/12e93e46a8bac9988be3c016860f83293daea8c716c029c9ace279036f2f/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:617d7e2ca7dcb8c5cf6bcb8c59b8832c94b36196bbf1cbd1bfb56ed341905edd", size = 2286404, upload-time = "2026-05-06T13:40:20.221Z" },
+    { url = "https://files.pythonhosted.org/packages/e2/4a/4d8b19008f38d31c53b8219cfedc2e3d5de5fe99d90076b7e767de29274f/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7027560ee92211647d0d34e3f7cd6f50da56399d26a9c8ad0da286d3869a53f3", size = 2109219, upload-time = "2026-05-06T13:38:12.153Z" },
+    { url = "https://files.pythonhosted.org/packages/88/70/3cbc40978fefb7bb09c6708d40d4ad1a5d70fd7213c3d17f971de868ec1f/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:f99626688942fb746e545232e7726926f3be91b5975f8b55327665fafda991c7", size = 2110594, upload-time = "2026-05-06T13:40:02.971Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/20/b8d36736216e29491125531685b2f9e61aa5b4b2599893f8268551da3338/pydantic_core-2.46.4-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:fc3e9034a63de20e15e8ade85358bc6efc614008cab72898b4b4952bea0509ff", size = 2159542, upload-time = "2026-05-06T13:39:27.506Z" },
+    { url = "https://files.pythonhosted.org/packages/1d/a2/367df868eb584dacf6bf82a389272406d7178e301c4ac82545ab98bc2dd9/pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:97e7cf2be5c77b7d1a9713a05605d49460d02c6078d38d8bef3cbe323c548424", size = 2168146, upload-time = "2026-05-06T13:38:31.93Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/b8/4460f77f7e201893f649a29ab355dddd3beee8a97bcb1a320db414f9a06e/pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_armv7l.whl", hash = "sha256:3bf92c5d0e00fefaab325a4d27828fe6b6e2a21848686b5b60d2d9eeb09d76c6", size = 2306309, upload-time = "2026-05-06T13:37:44.717Z" },
+    { url = "https://files.pythonhosted.org/packages/64/c4/be2639293acd87dc8ddbcec41a73cee9b2ebf996fe6d892a1a74e88ad3f7/pydantic_core-2.46.4-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:3ecbc122d18468d06ca279dc26a8c2e2d5acb10943bb35e36ae92096dc3b5565", size = 2369736, upload-time = "2026-05-06T13:37:05.645Z" },
+    { url = "https://files.pythonhosted.org/packages/30/a6/9f9f380dbb301f67023bf8f707aaa75daadf84f7152d95c410fd7e81d994/pydantic_core-2.46.4-cp314-cp314t-win32.whl", hash = "sha256:e846ae7835bf0703ae43f534ab79a867146dadd59dc9ca5c8b53d5c8f7c9ef02", size = 1955575, upload-time = "2026-05-06T13:38:51.116Z" },
+    { url = "https://files.pythonhosted.org/packages/40/1f/f1eb9eb350e795d1af8586289746f5c5677d16043040d63710e22abc43c9/pydantic_core-2.46.4-cp314-cp314t-win_amd64.whl", hash = "sha256:2108ba5c1c1eca18030634489dc544844144ee36357f2f9f780b93e7ddbb44b5", size = 2051624, upload-time = "2026-05-06T13:38:21.672Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/d2/42dd53d0a85c27606f316d3aa5d2869c4e8470a5ed6dec30e4a1abe19192/pydantic_core-2.46.4-cp314-cp314t-win_arm64.whl", hash = "sha256:4fcbe087dbc2068af7eda3aa87634eba216dbda64d1ae73c8684b621d33f6596", size = 2017325, upload-time = "2026-05-06T13:40:52.723Z" },
+    { url = "https://files.pythonhosted.org/packages/ee/a4/73995fd4ebbb46ba0ee51e6fa049b8f02c40daebb762208feda8a6b7894d/pydantic_core-2.46.4-graalpy311-graalpy242_311_native-macosx_10_12_x86_64.whl", hash = "sha256:14d4edf427bdcf950a8a02d7cb44a08614388dd6e1bdcbf4f67504fa7887da9c", size = 2111589, upload-time = "2026-05-06T13:37:10.817Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/7f/f37d3a5e8bfcc2e403f5c57a730f2d815693fb42119e8ea48b3789335af1/pydantic_core-2.46.4-graalpy311-graalpy242_311_native-macosx_11_0_arm64.whl", hash = "sha256:0ce40cd7b21210e99342afafbd4d0f76d784eb5b1d60f3bdc566be4983c6c73b", size = 1944552, upload-time = "2026-05-06T13:36:56.717Z" },
+    { url = "https://files.pythonhosted.org/packages/15/3c/d7eb777b3ff43e8433a4efb39a17aa8fd98a4ee8561a24a67ef5db07b2d6/pydantic_core-2.46.4-graalpy311-graalpy242_311_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:90884113d8b48f760e9587002789ddd741e76ab9f89518cd1e43b1f1a52ec44b", size = 1982984, upload-time = "2026-05-06T13:39:06.207Z" },
+    { url = "https://files.pythonhosted.org/packages/63/87/70b9f40170a81afd55ca26c9b2acb25c20d64bcfbf888fafecb3ba077d4c/pydantic_core-2.46.4-graalpy311-graalpy242_311_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:66ce7632c22d837c95301830e111ad0128a32b8207533b60896a96c4915192ea", size = 2138417, upload-time = "2026-05-06T13:39:45.476Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/1d/8987ad40f65ae1432753072f214fb5c74fe47ffbd0698bb9cbbb585664f8/pydantic_core-2.46.4-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:1d8ba486450b14f3b1d63bc521d410ec7565e52f887b9fb671791886436a42f7", size = 2095527, upload-time = "2026-05-06T13:39:52.283Z" },
+    { url = "https://files.pythonhosted.org/packages/64/d3/84c282a7eee1d3ac4c0377546ef5a1ea436ce26840d9ac3b7ed54a377507/pydantic_core-2.46.4-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:3009f12e4e90b7f88b4f9adb1b0c4a3d58fe7820f3238c190047209d148026df", size = 1936024, upload-time = "2026-05-06T13:40:15.671Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/ca/eac61596cdeb4d7e174d3dc0bd8a6238f14f75f97a24e7b7db4c7e7340a0/pydantic_core-2.46.4-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:ad785e92e6dc634c21555edc8bd6b64957ab844541bcb96a1366c202951ae526", size = 1990696, upload-time = "2026-05-06T13:38:34.717Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/c3/7c8b240552251faf6b3a957db200fcfbbcec36763c050428b601e0c9b83b/pydantic_core-2.46.4-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:00c603d540afdd6b80eb39f078f33ebd46211f02f33e34a32d9f053bba711de0", size = 2147590, upload-time = "2026-05-06T13:39:29.883Z" },
+    { url = "https://files.pythonhosted.org/packages/11/cb/428de0385b6c8d44b716feba566abfacfbd23ee3c4439faa789a1456242f/pydantic_core-2.46.4-pp311-pypy311_pp73-macosx_10_12_x86_64.whl", hash = "sha256:0c563b08bca408dc7f65f700633d8442fffb2421fc47b8101377e9fd65051ff0", size = 2112782, upload-time = "2026-05-06T13:37:04.016Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/b5/6a17bdadd0fc1f170adfd05a20d37c832f52b117b4d9131da1f41bb097ce/pydantic_core-2.46.4-pp311-pypy311_pp73-macosx_11_0_arm64.whl", hash = "sha256:db06ffe51636ffe9ca531fe9023dd64bdd794be8754cb5df57c5498ae5b518a7", size = 1952146, upload-time = "2026-05-06T13:39:43.092Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/dc/03734d80e362cd43ef65428e9de77c730ce7f2f11c60d2b1e1b39f0fbf99/pydantic_core-2.46.4-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:133878133d271ade3d41d1bfb2a45ec38dbdbda40bc065921c6b04e4630127e2", size = 2134492, upload-time = "2026-05-06T13:36:58.124Z" },
+    { url = "https://files.pythonhosted.org/packages/de/df/5e5ffc085ed07cc22d298134d3d911c63e91f6a0eb91fe646750a3209910/pydantic_core-2.46.4-pp311-pypy311_pp73-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:9bc519fbf2b7578398853d815009ae5e4d4603d12f4e3f91da8c06852d3da3e9", size = 2156604, upload-time = "2026-05-06T13:37:49.88Z" },
+    { url = "https://files.pythonhosted.org/packages/81/44/6e112a4253e56f5705467cbab7ab5e91ee7398ba3d56d358635958893d3e/pydantic_core-2.46.4-pp311-pypy311_pp73-musllinux_1_1_aarch64.whl", hash = "sha256:c7a7bd4e39e8e4c12c39cd480356842b6a8a06e41b23a55a5e3e191718838ddf", size = 2183828, upload-time = "2026-05-06T13:37:43.053Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/ad/5565071e937d8e752842ac241463944c9eb14c87e2d269f2658a5bd05e98/pydantic_core-2.46.4-pp311-pypy311_pp73-musllinux_1_1_armv7l.whl", hash = "sha256:d396ec2b979760aaf3218e76c24e65bd0aca24983298653b3a9d7a45f9e47b30", size = 2310000, upload-time = "2026-05-06T13:37:56.694Z" },
+    { url = "https://files.pythonhosted.org/packages/4f/c3/66883a5cec183e7fba4d024b4cbbe61851a63750ef606b0afecc46d1f2bf/pydantic_core-2.46.4-pp311-pypy311_pp73-musllinux_1_1_x86_64.whl", hash = "sha256:86e1a4418c6cd97d60c95c71164158eaf7324fae7b0923264016baa993eba6fc", size = 2361286, upload-time = "2026-05-06T13:40:05.667Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/2d/69abac8f838090bbecd5df894befb2c2619e7996a98ddb949db9f3b93225/pydantic_core-2.46.4-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:d51026d73fcfd93610abc7b27789c26b313920fcfb20e27462d74a7f8b06e983", size = 2193071, upload-time = "2026-05-06T13:38:08.682Z" },
+]
+
+[[package]]
+name = "pygments"
+version = "2.20.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/c3/b2/bc9c9196916376152d655522fdcebac55e66de6603a76a02bca1b6414f6c/pygments-2.20.0.tar.gz", hash = "sha256:6757cd03768053ff99f3039c1a36d6c0aa0b263438fcab17520b30a303a82b5f", size = 4955991, upload-time = "2026-03-29T13:29:33.898Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
+]
+
+[[package]]
+name = "pytest"
+version = "9.1.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+    { name = "iniconfig" },
+    { name = "packaging" },
+    { name = "pluggy" },
+    { name = "pygments" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/e4/47/b9efed96c114afcfa3c9d3fe98a76a1d14c74a9e266d397cf6eb64be5e01/pytest-9.1.1.tar.gz", hash = "sha256:1088fbde8f2b49d95a549a195707afa7a76a3ce9bcadc26b6d71f0ffda5fe313", size = 1636369, upload-time = "2026-06-19T10:58:32.857Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/24/25/1de2678b631f5a49215c6c96fff41ba892b0a34df68d6d80292b1b48aa7f/pytest-9.1.1-py3-none-any.whl", hash = "sha256:37a86b45efb9a47a61a36449063e8e18d0cab3161329fc099eb21783169c4f0c", size = 386536, upload-time = "2026-06-19T10:58:31.347Z" },
+]
+
+[[package]]
+name = "python-dateutil"
+version = "2.9.0.post0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "six" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/66/c0/0c8b6ad9f17a802ee498c46e004a0eb49bc148f2fd230864601a86dcf6db/python-dateutil-2.9.0.post0.tar.gz", hash = "sha256:37dd54208da7e1cd875388217d5e00ebd4179249f90fb72437e91a35459a0ad3", size = 342432, upload-time = "2024-03-01T18:36:20.211Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427", size = 229892, upload-time = "2024-03-01T18:36:18.57Z" },
+]
+
+[[package]]
+name = "python-discovery"
+version = "1.4.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "filelock" },
+    { name = "platformdirs" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/0b/1a/cbbaf13b730abb0a16b964d984e19f2fe520c21a4dc664051359a3f5a9e7/python_discovery-1.4.2.tar.gz", hash = "sha256:8f3746c4b4968d22afbb97d36e1a0e5b66e6c0f297290f2e95f05b9b8bf18690", size = 70277, upload-time = "2026-06-11T16:10:42.383Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1a/82/a70006589557f267f15bd384c0642ad49f0d97b690c3a05b166b9dcbad3b/python_discovery-1.4.2-py3-none-any.whl", hash = "sha256:475803f53b7b2ed6e490e27373f9d8340f7d2eebf9acdaf645d7d714c97bb500", size = 33886, upload-time = "2026-06-11T16:10:41.192Z" },
+]
+
+[[package]]
+name = "pytorch-lightning"
+version = "2.4.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "fsspec", extra = ["http"] },
+    { name = "lightning-utilities" },
+    { name = "packaging" },
+    { name = "pyyaml" },
+    { name = "torch" },
+    { name = "torchmetrics" },
+    { name = "tqdm" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/d1/f0/3207bd5019c43899efbb5444da263577497a5c4dc82719633a3bf63d8f45/pytorch-lightning-2.4.0.tar.gz", hash = "sha256:6aa897fd9d6dfa7b7b49f37c2f04e13592861831d08deae584dfda423fdb71c8", size = 625320, upload-time = "2024-08-07T09:46:42.244Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2b/d2/ecd65ff1e0b1ca79f9785dd65d5ced7ec2643a828068aaa24e47e4c84a14/pytorch_lightning-2.4.0-py3-none-any.whl", hash = "sha256:9ac7935229ac022ef06994c928217ed37f525ac6700f7d4fc57009624570e655", size = 815151, upload-time = "2024-08-07T09:46:38.943Z" },
+]
+
+[[package]]
+name = "pytz"
+version = "2026.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ff/46/dd499ec9038423421951e4fad73051febaa13d2df82b4064f87af8b8c0c3/pytz-2026.2.tar.gz", hash = "sha256:0e60b47b29f21574376f218fe21abc009894a2321ea16c6754f3cad6eb7cdd6a", size = 320861, upload-time = "2026-05-04T01:35:29.667Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ec/dd/96da98f892250475bdf2328112d7468abdd4acc7b902b6af23f4ed958ea0/pytz-2026.2-py2.py3-none-any.whl", hash = "sha256:04156e608bee23d3792fd45c94ae47fae1036688e75032eea2e3bf0323d1f126", size = 510141, upload-time = "2026-05-04T01:35:27.408Z" },
+]
+
+[[package]]
+name = "pyyaml"
+version = "6.0.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/05/8e/961c0007c59b8dd7729d542c61a4d537767a59645b82a0b521206e1e25c2/pyyaml-6.0.3.tar.gz", hash = "sha256:d76623373421df22fb4cf8817020cbb7ef15c725b9d5e45f17e189bfc384190f", size = 130960, upload-time = "2025-09-25T21:33:16.546Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/6d/16/a95b6757765b7b031c9374925bb718d55e0a9ba8a1b6a12d25962ea44347/pyyaml-6.0.3-cp311-cp311-macosx_10_13_x86_64.whl", hash = "sha256:44edc647873928551a01e7a563d7452ccdebee747728c1080d881d68af7b997e", size = 185826, upload-time = "2025-09-25T21:31:58.655Z" },
+    { url = "https://files.pythonhosted.org/packages/16/19/13de8e4377ed53079ee996e1ab0a9c33ec2faf808a4647b7b4c0d46dd239/pyyaml-6.0.3-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:652cb6edd41e718550aad172851962662ff2681490a8a711af6a4d288dd96824", size = 175577, upload-time = "2025-09-25T21:32:00.088Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/62/d2eb46264d4b157dae1275b573017abec435397aa59cbcdab6fc978a8af4/pyyaml-6.0.3-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:10892704fc220243f5305762e276552a0395f7beb4dbf9b14ec8fd43b57f126c", size = 775556, upload-time = "2025-09-25T21:32:01.31Z" },
+    { url = "https://files.pythonhosted.org/packages/10/cb/16c3f2cf3266edd25aaa00d6c4350381c8b012ed6f5276675b9eba8d9ff4/pyyaml-6.0.3-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:850774a7879607d3a6f50d36d04f00ee69e7fc816450e5f7e58d7f17f1ae5c00", size = 882114, upload-time = "2025-09-25T21:32:03.376Z" },
+    { url = "https://files.pythonhosted.org/packages/71/60/917329f640924b18ff085ab889a11c763e0b573da888e8404ff486657602/pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:b8bb0864c5a28024fac8a632c443c87c5aa6f215c0b126c449ae1a150412f31d", size = 806638, upload-time = "2025-09-25T21:32:04.553Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/6f/529b0f316a9fd167281a6c3826b5583e6192dba792dd55e3203d3f8e655a/pyyaml-6.0.3-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:1d37d57ad971609cf3c53ba6a7e365e40660e3be0e5175fa9f2365a379d6095a", size = 767463, upload-time = "2025-09-25T21:32:06.152Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/6a/b627b4e0c1dd03718543519ffb2f1deea4a1e6d42fbab8021936a4d22589/pyyaml-6.0.3-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:37503bfbfc9d2c40b344d06b2199cf0e96e97957ab1c1b546fd4f87e53e5d3e4", size = 794986, upload-time = "2025-09-25T21:32:07.367Z" },
+    { url = "https://files.pythonhosted.org/packages/45/91/47a6e1c42d9ee337c4839208f30d9f09caa9f720ec7582917b264defc875/pyyaml-6.0.3-cp311-cp311-win32.whl", hash = "sha256:8098f252adfa6c80ab48096053f512f2321f0b998f98150cea9bd23d83e1467b", size = 142543, upload-time = "2025-09-25T21:32:08.95Z" },
+    { url = "https://files.pythonhosted.org/packages/da/e3/ea007450a105ae919a72393cb06f122f288ef60bba2dc64b26e2646fa315/pyyaml-6.0.3-cp311-cp311-win_amd64.whl", hash = "sha256:9f3bfb4965eb874431221a3ff3fdcddc7e74e3b07799e0e84ca4a0f867d449bf", size = 158763, upload-time = "2025-09-25T21:32:09.96Z" },
+    { url = "https://files.pythonhosted.org/packages/d1/33/422b98d2195232ca1826284a76852ad5a86fe23e31b009c9886b2d0fb8b2/pyyaml-6.0.3-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7f047e29dcae44602496db43be01ad42fc6f1cc0d8cd6c83d342306c32270196", size = 182063, upload-time = "2025-09-25T21:32:11.445Z" },
+    { url = "https://files.pythonhosted.org/packages/89/a0/6cf41a19a1f2f3feab0e9c0b74134aa2ce6849093d5517a0c550fe37a648/pyyaml-6.0.3-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:fc09d0aa354569bc501d4e787133afc08552722d3ab34836a80547331bb5d4a0", size = 173973, upload-time = "2025-09-25T21:32:12.492Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/23/7a778b6bd0b9a8039df8b1b1d80e2e2ad78aa04171592c8a5c43a56a6af4/pyyaml-6.0.3-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9149cad251584d5fb4981be1ecde53a1ca46c891a79788c0df828d2f166bda28", size = 775116, upload-time = "2025-09-25T21:32:13.652Z" },
+    { url = "https://files.pythonhosted.org/packages/65/30/d7353c338e12baef4ecc1b09e877c1970bd3382789c159b4f89d6a70dc09/pyyaml-6.0.3-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:5fdec68f91a0c6739b380c83b951e2c72ac0197ace422360e6d5a959d8d97b2c", size = 844011, upload-time = "2025-09-25T21:32:15.21Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/9d/b3589d3877982d4f2329302ef98a8026e7f4443c765c46cfecc8858c6b4b/pyyaml-6.0.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:ba1cc08a7ccde2d2ec775841541641e4548226580ab850948cbfda66a1befcdc", size = 807870, upload-time = "2025-09-25T21:32:16.431Z" },
+    { url = "https://files.pythonhosted.org/packages/05/c0/b3be26a015601b822b97d9149ff8cb5ead58c66f981e04fedf4e762f4bd4/pyyaml-6.0.3-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:8dc52c23056b9ddd46818a57b78404882310fb473d63f17b07d5c40421e47f8e", size = 761089, upload-time = "2025-09-25T21:32:17.56Z" },
+    { url = "https://files.pythonhosted.org/packages/be/8e/98435a21d1d4b46590d5459a22d88128103f8da4c2d4cb8f14f2a96504e1/pyyaml-6.0.3-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:41715c910c881bc081f1e8872880d3c650acf13dfa8214bad49ed4cede7c34ea", size = 790181, upload-time = "2025-09-25T21:32:18.834Z" },
+    { url = "https://files.pythonhosted.org/packages/74/93/7baea19427dcfbe1e5a372d81473250b379f04b1bd3c4c5ff825e2327202/pyyaml-6.0.3-cp312-cp312-win32.whl", hash = "sha256:96b533f0e99f6579b3d4d4995707cf36df9100d67e0c8303a0c55b27b5f99bc5", size = 137658, upload-time = "2025-09-25T21:32:20.209Z" },
+    { url = "https://files.pythonhosted.org/packages/86/bf/899e81e4cce32febab4fb42bb97dcdf66bc135272882d1987881a4b519e9/pyyaml-6.0.3-cp312-cp312-win_amd64.whl", hash = "sha256:5fcd34e47f6e0b794d17de1b4ff496c00986e1c83f7ab2fb8fcfe9616ff7477b", size = 154003, upload-time = "2025-09-25T21:32:21.167Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/08/67bd04656199bbb51dbed1439b7f27601dfb576fb864099c7ef0c3e55531/pyyaml-6.0.3-cp312-cp312-win_arm64.whl", hash = "sha256:64386e5e707d03a7e172c0701abfb7e10f0fb753ee1d773128192742712a98fd", size = 140344, upload-time = "2025-09-25T21:32:22.617Z" },
+    { url = "https://files.pythonhosted.org/packages/d1/11/0fd08f8192109f7169db964b5707a2f1e8b745d4e239b784a5a1dd80d1db/pyyaml-6.0.3-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:8da9669d359f02c0b91ccc01cac4a67f16afec0dac22c2ad09f46bee0697eba8", size = 181669, upload-time = "2025-09-25T21:32:23.673Z" },
+    { url = "https://files.pythonhosted.org/packages/b1/16/95309993f1d3748cd644e02e38b75d50cbc0d9561d21f390a76242ce073f/pyyaml-6.0.3-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:2283a07e2c21a2aa78d9c4442724ec1eb15f5e42a723b99cb3d822d48f5f7ad1", size = 173252, upload-time = "2025-09-25T21:32:25.149Z" },
+    { url = "https://files.pythonhosted.org/packages/50/31/b20f376d3f810b9b2371e72ef5adb33879b25edb7a6d072cb7ca0c486398/pyyaml-6.0.3-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:ee2922902c45ae8ccada2c5b501ab86c36525b883eff4255313a253a3160861c", size = 767081, upload-time = "2025-09-25T21:32:26.575Z" },
+    { url = "https://files.pythonhosted.org/packages/49/1e/a55ca81e949270d5d4432fbbd19dfea5321eda7c41a849d443dc92fd1ff7/pyyaml-6.0.3-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a33284e20b78bd4a18c8c2282d549d10bc8408a2a7ff57653c0cf0b9be0afce5", size = 841159, upload-time = "2025-09-25T21:32:27.727Z" },
+    { url = "https://files.pythonhosted.org/packages/74/27/e5b8f34d02d9995b80abcef563ea1f8b56d20134d8f4e5e81733b1feceb2/pyyaml-6.0.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:0f29edc409a6392443abf94b9cf89ce99889a1dd5376d94316ae5145dfedd5d6", size = 801626, upload-time = "2025-09-25T21:32:28.878Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/11/ba845c23988798f40e52ba45f34849aa8a1f2d4af4b798588010792ebad6/pyyaml-6.0.3-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:f7057c9a337546edc7973c0d3ba84ddcdf0daa14533c2065749c9075001090e6", size = 753613, upload-time = "2025-09-25T21:32:30.178Z" },
+    { url = "https://files.pythonhosted.org/packages/3d/e0/7966e1a7bfc0a45bf0a7fb6b98ea03fc9b8d84fa7f2229e9659680b69ee3/pyyaml-6.0.3-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:eda16858a3cab07b80edaf74336ece1f986ba330fdb8ee0d6c0d68fe82bc96be", size = 794115, upload-time = "2025-09-25T21:32:31.353Z" },
+    { url = "https://files.pythonhosted.org/packages/de/94/980b50a6531b3019e45ddeada0626d45fa85cbe22300844a7983285bed3b/pyyaml-6.0.3-cp313-cp313-win32.whl", hash = "sha256:d0eae10f8159e8fdad514efdc92d74fd8d682c933a6dd088030f3834bc8e6b26", size = 137427, upload-time = "2025-09-25T21:32:32.58Z" },
+    { url = "https://files.pythonhosted.org/packages/97/c9/39d5b874e8b28845e4ec2202b5da735d0199dbe5b8fb85f91398814a9a46/pyyaml-6.0.3-cp313-cp313-win_amd64.whl", hash = "sha256:79005a0d97d5ddabfeeea4cf676af11e647e41d81c9a7722a193022accdb6b7c", size = 154090, upload-time = "2025-09-25T21:32:33.659Z" },
+    { url = "https://files.pythonhosted.org/packages/73/e8/2bdf3ca2090f68bb3d75b44da7bbc71843b19c9f2b9cb9b0f4ab7a5a4329/pyyaml-6.0.3-cp313-cp313-win_arm64.whl", hash = "sha256:5498cd1645aa724a7c71c8f378eb29ebe23da2fc0d7a08071d89469bf1d2defb", size = 140246, upload-time = "2025-09-25T21:32:34.663Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/8c/f4bd7f6465179953d3ac9bc44ac1a8a3e6122cf8ada906b4f96c60172d43/pyyaml-6.0.3-cp314-cp314-macosx_10_13_x86_64.whl", hash = "sha256:8d1fab6bb153a416f9aeb4b8763bc0f22a5586065f86f7664fc23339fc1c1fac", size = 181814, upload-time = "2025-09-25T21:32:35.712Z" },
+    { url = "https://files.pythonhosted.org/packages/bd/9c/4d95bb87eb2063d20db7b60faa3840c1b18025517ae857371c4dd55a6b3a/pyyaml-6.0.3-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:34d5fcd24b8445fadc33f9cf348c1047101756fd760b4dacb5c3e99755703310", size = 173809, upload-time = "2025-09-25T21:32:36.789Z" },
+    { url = "https://files.pythonhosted.org/packages/92/b5/47e807c2623074914e29dabd16cbbdd4bf5e9b2db9f8090fa64411fc5382/pyyaml-6.0.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:501a031947e3a9025ed4405a168e6ef5ae3126c59f90ce0cd6f2bfc477be31b7", size = 766454, upload-time = "2025-09-25T21:32:37.966Z" },
+    { url = "https://files.pythonhosted.org/packages/02/9e/e5e9b168be58564121efb3de6859c452fccde0ab093d8438905899a3a483/pyyaml-6.0.3-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:b3bc83488de33889877a0f2543ade9f70c67d66d9ebb4ac959502e12de895788", size = 836355, upload-time = "2025-09-25T21:32:39.178Z" },
+    { url = "https://files.pythonhosted.org/packages/88/f9/16491d7ed2a919954993e48aa941b200f38040928474c9e85ea9e64222c3/pyyaml-6.0.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c458b6d084f9b935061bc36216e8a69a7e293a2f1e68bf956dcd9e6cbcd143f5", size = 794175, upload-time = "2025-09-25T21:32:40.865Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/3f/5989debef34dc6397317802b527dbbafb2b4760878a53d4166579111411e/pyyaml-6.0.3-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:7c6610def4f163542a622a73fb39f534f8c101d690126992300bf3207eab9764", size = 755228, upload-time = "2025-09-25T21:32:42.084Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/ce/af88a49043cd2e265be63d083fc75b27b6ed062f5f9fd6cdc223ad62f03e/pyyaml-6.0.3-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:5190d403f121660ce8d1d2c1bb2ef1bd05b5f68533fc5c2ea899bd15f4399b35", size = 789194, upload-time = "2025-09-25T21:32:43.362Z" },
+    { url = "https://files.pythonhosted.org/packages/23/20/bb6982b26a40bb43951265ba29d4c246ef0ff59c9fdcdf0ed04e0687de4d/pyyaml-6.0.3-cp314-cp314-win_amd64.whl", hash = "sha256:4a2e8cebe2ff6ab7d1050ecd59c25d4c8bd7e6f400f5f82b96557ac0abafd0ac", size = 156429, upload-time = "2025-09-25T21:32:57.844Z" },
+    { url = "https://files.pythonhosted.org/packages/f4/f4/a4541072bb9422c8a883ab55255f918fa378ecf083f5b85e87fc2b4eda1b/pyyaml-6.0.3-cp314-cp314-win_arm64.whl", hash = "sha256:93dda82c9c22deb0a405ea4dc5f2d0cda384168e466364dec6255b293923b2f3", size = 143912, upload-time = "2025-09-25T21:32:59.247Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/f9/07dd09ae774e4616edf6cda684ee78f97777bdd15847253637a6f052a62f/pyyaml-6.0.3-cp314-cp314t-macosx_10_13_x86_64.whl", hash = "sha256:02893d100e99e03eda1c8fd5c441d8c60103fd175728e23e431db1b589cf5ab3", size = 189108, upload-time = "2025-09-25T21:32:44.377Z" },
+    { url = "https://files.pythonhosted.org/packages/4e/78/8d08c9fb7ce09ad8c38ad533c1191cf27f7ae1effe5bb9400a46d9437fcf/pyyaml-6.0.3-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:c1ff362665ae507275af2853520967820d9124984e0f7466736aea23d8611fba", size = 183641, upload-time = "2025-09-25T21:32:45.407Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/5b/3babb19104a46945cf816d047db2788bcaf8c94527a805610b0289a01c6b/pyyaml-6.0.3-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6adc77889b628398debc7b65c073bcb99c4a0237b248cacaf3fe8a557563ef6c", size = 831901, upload-time = "2025-09-25T21:32:48.83Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/cc/dff0684d8dc44da4d22a13f35f073d558c268780ce3c6ba1b87055bb0b87/pyyaml-6.0.3-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a80cb027f6b349846a3bf6d73b5e95e782175e52f22108cfa17876aaeff93702", size = 861132, upload-time = "2025-09-25T21:32:50.149Z" },
+    { url = "https://files.pythonhosted.org/packages/b1/5e/f77dc6b9036943e285ba76b49e118d9ea929885becb0a29ba8a7c75e29fe/pyyaml-6.0.3-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:00c4bdeba853cc34e7dd471f16b4114f4162dc03e6b7afcc2128711f0eca823c", size = 839261, upload-time = "2025-09-25T21:32:51.808Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/88/a9db1376aa2a228197c58b37302f284b5617f56a5d959fd1763fb1675ce6/pyyaml-6.0.3-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:66e1674c3ef6f541c35191caae2d429b967b99e02040f5ba928632d9a7f0f065", size = 805272, upload-time = "2025-09-25T21:32:52.941Z" },
+    { url = "https://files.pythonhosted.org/packages/da/92/1446574745d74df0c92e6aa4a7b0b3130706a4142b2d1a5869f2eaa423c6/pyyaml-6.0.3-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:16249ee61e95f858e83976573de0f5b2893b3677ba71c9dd36b9cf8be9ac6d65", size = 829923, upload-time = "2025-09-25T21:32:54.537Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/7a/1c7270340330e575b92f397352af856a8c06f230aa3e76f86b39d01b416a/pyyaml-6.0.3-cp314-cp314t-win_amd64.whl", hash = "sha256:4ad1906908f2f5ae4e5a8ddfce73c320c2a1429ec52eafd27138b7f1cbe341c9", size = 174062, upload-time = "2025-09-25T21:32:55.767Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/12/de94a39c2ef588c7e6455cfbe7343d3b2dc9d6b6b2f40c4c6565744c873d/pyyaml-6.0.3-cp314-cp314t-win_arm64.whl", hash = "sha256:ebc55a14a21cb14062aa4162f906cd962b28e2e9ea38f9b4391244cd8de4ae0b", size = 149341, upload-time = "2025-09-25T21:32:56.828Z" },
+]
+
+[[package]]
+name = "ruff"
+version = "0.15.18"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/74/98/1295ad5a5aa9bc85bdcdfa5d82fe7b49c61af5657df4f227637ff9de0da6/ruff-0.15.18.tar.gz", hash = "sha256:2698a964c70e8bf402dcb99c8810472d270d141e7aa8c4e13599fd52033a2f33", size = 4761437, upload-time = "2026-06-18T18:25:39.224Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b9/d0/686e984941269621e2be72612d5c1e461f8f7b38415a2a7d7a81c8ae6715/ruff-0.15.18-py3-none-linux_armv6l.whl", hash = "sha256:8b6850172348c8381b8b3084c5915a4393c2373b9b54cd5b5e1ea15812bc10df", size = 10887308, upload-time = "2026-06-18T18:25:03.062Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/21/bc4123e3f5515ee99f8ce1eb93a14a0628fe4d1678663cd08f933ac16931/ruff-0.15.18-py3-none-macosx_10_12_x86_64.whl", hash = "sha256:3fccc153a85417dcd976883160cacce486997b0a0058dd18f54b8aaaac7d1ce2", size = 11281305, upload-time = "2026-06-18T18:25:30.026Z" },
+    { url = "https://files.pythonhosted.org/packages/51/93/4769464c25cf7ab2acb3c7dda9cad3d867eb41c59565b3e2a9d17249c90c/ruff-0.15.18-py3-none-macosx_11_0_arm64.whl", hash = "sha256:08d4c86a68f2c3ec2c9d56380a71fb4a4f65373055cbb8caabd645e9102f38d4", size = 10641215, upload-time = "2026-06-18T18:25:15.802Z" },
+    { url = "https://files.pythonhosted.org/packages/6c/42/56926d17120db2c208d76bf60a1a019644dd9e91dc27f0f95c9caddb1366/ruff-0.15.18-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:37e5108745c2c0705da916d7d4de533ddf547051ef45f62888c31bae73f66318", size = 10957224, upload-time = "2026-06-18T18:25:36.955Z" },
+    { url = "https://files.pythonhosted.org/packages/22/4f/d43fab8d8189afde803103022d000a8ef9f230616d436d52a8b2b8d63b50/ruff-0.15.18-py3-none-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:56949a6ce8b3abde54c0bcb22cebfe57e8771cadc84b407ae8b8eaf67ebdcd43", size = 10699024, upload-time = "2026-06-18T18:25:05.707Z" },
+    { url = "https://files.pythonhosted.org/packages/63/42/1e3e4c68bd408b9768cf3e439acbe2c78245225faef253f7028a0cdb63e0/ruff-0.15.18-py3-none-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:01a754cd6a1b630d3f97e33eb452cf7a98040482318e870f8bc52a5a30e62657", size = 11491458, upload-time = "2026-06-18T18:25:20.275Z" },
+    { url = "https://files.pythonhosted.org/packages/20/77/47a3484bea8521e14a203d98c389c5c97846675e4f02734672da4a69b52a/ruff-0.15.18-py3-none-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:6ba7a07e03a44dbf10bb086ee06705b173625014ec99f73a7e6836a5e5590a0c", size = 12383752, upload-time = "2026-06-18T18:25:22.535Z" },
+    { url = "https://files.pythonhosted.org/packages/0a/ca/054159590787023d83b658a1a1819c4c8910114e7015069340b71c0961cb/ruff-0.15.18-py3-none-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5a2c40a41a4cadbcf5897b548ab29dfe248b20c540961c0247d98a3973c70403", size = 11577923, upload-time = "2026-06-18T18:25:10.702Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/ff/d353d6b7bbd73cc0ec37f4463d7540e45e894338abdd9964eee0de332708/ruff-0.15.18-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:5f0480ce690cbb6c4db6e5d08f19fce98e10ba131a8b60c1bcdac42771e3ae2d", size = 11583925, upload-time = "2026-06-18T18:25:32.391Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/4a/891f89b9c296ed3e5f3ece1a5629badc989d9a8fdaa30431aaf4774bc1c2/ruff-0.15.18-py3-none-manylinux_2_31_riscv64.whl", hash = "sha256:2330215f1f393fa8733f55edce04fcf94c36a2c460fcde31f78cc84e4951e9b1", size = 11582834, upload-time = "2026-06-18T18:25:27.309Z" },
+    { url = "https://files.pythonhosted.org/packages/32/a3/ed9e370154bf85de360b93c03026157f02d4943b2d01ff4945f4429f8e8a/ruff-0.15.18-py3-none-musllinux_1_2_aarch64.whl", hash = "sha256:a6aa6a3d979e48ae617578183674bf264fbe7d0114a796a26bd678d67963c7ff", size = 10927328, upload-time = "2026-06-18T18:25:34.676Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/d1/5cf5909329fedb5d39d555ee818ba5cf4638e1a301b89785d34f2905bfcb/ruff-0.15.18-py3-none-musllinux_1_2_armv7l.whl", hash = "sha256:a81beadbbff2c9c245561ae3f77b16709d87f35eec650d0501679239d3449b22", size = 10693187, upload-time = "2026-06-18T18:25:08.245Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/44/ff6c635cf2c4f4e7b618b6640da057376baa36014695487d88aed4794268/ruff-0.15.18-py3-none-musllinux_1_2_i686.whl", hash = "sha256:2186d9e940ae332ab293623a75b5f4fe49565f449954d50a72a046683aa6b809", size = 11208721, upload-time = "2026-06-18T18:25:41.327Z" },
+    { url = "https://files.pythonhosted.org/packages/88/d9/5baa2a30861adfb7022cf33c1e35b2fc18085b08c16f83eff4c7b99a5f48/ruff-0.15.18-py3-none-musllinux_1_2_x86_64.whl", hash = "sha256:5c2abf140438032bc77b2284a6c9944ecd8a19e5f1c7b52b1b8e4a0a80d19a7a", size = 11678599, upload-time = "2026-06-18T18:25:13.607Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/1a/0725a7cfdc32ff769efb96ee782bec882e16448c5d9e3be947ec4c04ce27/ruff-0.15.18-py3-none-win32.whl", hash = "sha256:02299e6e9fa5b297a3f6d5d10d7bcd655c925b028bb8b9d4588214549c6b9ec4", size = 10901903, upload-time = "2026-06-18T18:25:24.755Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/51/805d9f6fb7970505c3504794a5ec350f605361b807fef4dcf214ebd35e72/ruff-0.15.18-py3-none-win_amd64.whl", hash = "sha256:dac80dc8d26b2257dbefabed62f5d255c3937b4ccb122da1fc634794fa3578b3", size = 12041189, upload-time = "2026-06-18T18:25:17.915Z" },
+    { url = "https://files.pythonhosted.org/packages/29/4c/67bb45e41609eb4726f1bfeb59e083cf91d14c696d4bd14c234a980be93d/ruff-0.15.18-py3-none-win_arm64.whl", hash = "sha256:b2c9257fcbd4a3e5b977a1904e6facca016bafe2edc17df24db67cfaee03b4e4", size = 11329958, upload-time = "2026-06-18T18:25:43.686Z" },
+]
+
+[[package]]
+name = "scikit-learn"
+version = "1.9.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "joblib" },
+    { name = "narwhals" },
+    { name = "numpy", version = "2.4.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+    { name = "numpy", version = "2.5.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+    { name = "scipy", version = "1.17.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+    { name = "scipy", version = "1.18.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+    { name = "threadpoolctl" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/fa/6f/37092bdb25f712817231799fc5674d8e704066a8a70c1d2d40517e18b4ab/scikit_learn-1.9.0.tar.gz", hash = "sha256:8833266989d3a5110178a9fae30783675460724d0e1efb13b14901d2c660c557", size = 7750767, upload-time = "2026-06-02T11:54:32.706Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f5/be/e844fd9586e66540a15b71924d17a6cbc1bb749e81ddd0a796bcdba4c055/scikit_learn-1.9.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:9db6f4d34e68c8899e4cab27fdf8eafe6ed21f2ba52ceb25ea250cd237f8e47b", size = 8789686, upload-time = "2026-06-02T11:53:05.439Z" },
+    { url = "https://files.pythonhosted.org/packages/42/e2/ff880f62677a17d035817d543cb0fc8727d01eccbee81c5f7fc733a9d856/scikit_learn-1.9.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:f401448645a3e7bc115aa3c094097865155b34bff1cba8101857d9104e99074c", size = 8256782, upload-time = "2026-06-02T11:53:08.904Z" },
+    { url = "https://files.pythonhosted.org/packages/25/64/eb40435e1a508ab1b4e284ce43ae80f6a162e5be5e38ed5a6fab467a9ea4/scikit_learn-1.9.0-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fd3a8ef0c758555a3b23c03adaa858af32f7736785ded50ad5991f59c4ed03fa", size = 8992419, upload-time = "2026-06-02T11:53:11.551Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/da/4810a28e473185429e45a57eebcc91fc991b33d889cc0676063e671db03d/scikit_learn-1.9.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:f7e254636164090da847715a27f8e5478feb98c40a9e0ee90cbd277de9e5ceb8", size = 9281411, upload-time = "2026-06-02T11:53:15.063Z" },
+    { url = "https://files.pythonhosted.org/packages/3b/67/be3d369f40d8178ba3bd86635d132e08cb5329b023e4669d9426d84bc007/scikit_learn-1.9.0-cp311-cp311-win_amd64.whl", hash = "sha256:5dc1818c77575d149e25fce9ef82dd7b7263ae372f03494158668ad632a69759", size = 8272736, upload-time = "2026-06-02T11:53:18.108Z" },
+    { url = "https://files.pythonhosted.org/packages/37/79/a733f02dc2118da7e77a134b34f39f40201a353311b011d20859d2db3556/scikit_learn-1.9.0-cp311-cp311-win_arm64.whl", hash = "sha256:366652351f092b219c248f1e72821e841960a63d8f358f1dcfd54dc1cbdbbc28", size = 7919564, upload-time = "2026-06-02T11:53:21.2Z" },
+    { url = "https://files.pythonhosted.org/packages/ac/20/75f915ff375d6249e6550ac740fdbbd66159a068fd3af1400ff62036b07a/scikit_learn-1.9.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:2bd41b0d201bc81575531b96b713d3eb5e5f50fb0b82101ff0f92294fdc236ac", size = 8741122, upload-time = "2026-06-02T11:53:24.08Z" },
+    { url = "https://files.pythonhosted.org/packages/cc/d5/2b5148f2279196775e1db2aeb85d14b70ac80e7e32b3b28e7ebeafb0901d/scikit_learn-1.9.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:5be45aa4a42a68a533913a6ed736cf309de2226411c79ef8d609a5456f1939b1", size = 8261512, upload-time = "2026-06-02T11:53:27.183Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/ee/5adbc77656b71f9456a2f5a7a9fdb4bcf9207a6b962889f1c2f9323afa4e/scikit_learn-1.9.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5e50ed4da51974e86e940690e9a3d82e729b62b5a49f7c9bac534d515d39d86f", size = 8837603, upload-time = "2026-06-02T11:53:30.328Z" },
+    { url = "https://files.pythonhosted.org/packages/6c/c2/63fdda36c56437eeb44aaf9493c8bcd62ce230ab1598924fc626ffbfa943/scikit_learn-1.9.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:056c92bb67ad4c28463c2f2653d9701449201e7e7a9e94e321be0f71c4fef2b8", size = 9132097, upload-time = "2026-06-02T11:53:33.456Z" },
+    { url = "https://files.pythonhosted.org/packages/83/a4/c8e67227c680e2259c8864ae72ff48b06e16a6f51253a22167aa02a8aa4e/scikit_learn-1.9.0-cp312-cp312-win_amd64.whl", hash = "sha256:4306775fad04cc4b472a1b15af1ae9cede1540fbfcc17fbce3767cd8dc7ae283", size = 8211173, upload-time = "2026-06-02T11:53:36.602Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/fd/3c0863792e98e67e9184aa4029288a175935eb65443afcd30d4f143450cf/scikit_learn-1.9.0-cp312-cp312-win_arm64.whl", hash = "sha256:26e22435f63bcdcf396b574273f29f13dd531f5ea035801f5be10ba1540a4e60", size = 7867451, upload-time = "2026-06-02T11:53:39.075Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/01/cf3310626b6d48d3e9be69a1223f9180360b5e6edb045f50fade723ce494/scikit_learn-1.9.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:80746d63bd4b6eaca54d36fe5feaf4d28bb38dc6f9470f81c7cad7c40155f119", size = 8705188, upload-time = "2026-06-02T11:53:41.964Z" },
+    { url = "https://files.pythonhosted.org/packages/3e/04/5acd7ae280c5f93b6ac5ef6cdec14eef4c8d1cd91d85b3292989c94d96b1/scikit_learn-1.9.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:5b934c45c252844a91d69fda3a34cff5e7307e1db10d77cb10a3980312c74713", size = 8228299, upload-time = "2026-06-02T11:53:44.817Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/39/ffe829a5b8ecb40a518724a997794657fdc354ada5e8fe8e64d998c0bac9/scikit_learn-1.9.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:38c3dcb9a1ffb85505ec53d54c7b4aea0cff70050425a7760c2af661ac85df05", size = 8789690, upload-time = "2026-06-02T11:53:47.461Z" },
+    { url = "https://files.pythonhosted.org/packages/1f/88/8dab5de10c638c083772a6be83a3d8106ced492f74a928c8693638e5bb50/scikit_learn-1.9.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:da76d09304a4706db7cc1e3ebaa3b6b98a67365cc11d2996c4f1e58ba47df714", size = 9087723, upload-time = "2026-06-02T11:53:50.702Z" },
+    { url = "https://files.pythonhosted.org/packages/20/3f/7917ca72464038f6240ec70c29f94862d08a34a74291ae4d4ec5eb8186a0/scikit_learn-1.9.0-cp313-cp313-win_amd64.whl", hash = "sha256:5808d98f15c6bf6d9d96d2348c1997392a5888ce7097e664105f930c4bca1277", size = 8184330, upload-time = "2026-06-02T11:53:53.396Z" },
+    { url = "https://files.pythonhosted.org/packages/78/c7/15739eb2f61fda3c54639e9942414e5a19ad8a8d1f5a3266afad7cb7df80/scikit_learn-1.9.0-cp313-cp313-win_arm64.whl", hash = "sha256:d77f54c017633791bc0225a43e2f8d03745fdcfe4880268fcc4df15f505dec2e", size = 7840653, upload-time = "2026-06-02T11:53:56.035Z" },
+    { url = "https://files.pythonhosted.org/packages/f4/7d/c9a35cf59b20a86fec24d306f1547b78dec194b08d367ce2a3e4854169d9/scikit_learn-1.9.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:9656acd4e93f74e0b66c8a36c88830a99252dfa900044d36bc2212ae89a47162", size = 8713289, upload-time = "2026-06-02T11:53:58.788Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/a7/552a7821597c632b907f7bfe8f36f9f572777af8ef8a48353041cf8e091a/scikit_learn-1.9.0-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:24360002ae845e7866522b0a5bbf690802e7bc388cac8663502e78aa98598aa2", size = 8245141, upload-time = "2026-06-02T11:54:01.694Z" },
+    { url = "https://files.pythonhosted.org/packages/7d/79/f4a0c4fe9711154cddabf913471153af79056382ddc612cfe5ee0ff4b72e/scikit_learn-1.9.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:5162ad10a418c8a282dde04c9aa06965de3e9a65f33c1440c0ae69bb1a09d913", size = 8847671, upload-time = "2026-06-02T11:54:04.448Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/af/4d72d9e475ac83719160c662619e4bf7b95c19507cd582e7d0167a3c3dae/scikit_learn-1.9.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1fea2cc5677ab49d6f5bade978c866da44957b712d92e9635e8b4f723013c3cb", size = 9118104, upload-time = "2026-06-02T11:54:07.205Z" },
+    { url = "https://files.pythonhosted.org/packages/a2/d5/6a58eea2cb9abbb9b3f2bb8b2cfb3243d1152d69f442d256c7af71304769/scikit_learn-1.9.0-cp314-cp314-win_amd64.whl", hash = "sha256:64fa347efc1c839c487433e40c5144d38c336e8a2b59c81aa8660373945c2673", size = 8290674, upload-time = "2026-06-02T11:54:10.087Z" },
+    { url = "https://files.pythonhosted.org/packages/65/5b/d4c879cf358f1187141cf90ced473f087183489090244f50c124a2ee478b/scikit_learn-1.9.0-cp314-cp314-win_arm64.whl", hash = "sha256:1b944b6db288f6b926e3650026ddafb988929de95d11fc2cc5fa117773c9ba42", size = 7978807, upload-time = "2026-06-02T11:54:12.769Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/43/bfae3121ec67ae09150d453c442c7c1cc166e9aefe056e6ab3b7728a5cfc/scikit_learn-1.9.0-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:4ccacf04ca5f4b492158a5f28afe0ace43f81b2571e4b9a66d34848b46128949", size = 9031941, upload-time = "2026-06-02T11:54:15.436Z" },
+    { url = "https://files.pythonhosted.org/packages/75/b0/20a4546eb17f3b25d3c66df15810411c14ed5065bcfab50b53c96fb627b2/scikit_learn-1.9.0-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:ee1a8db2c18c08e34c7412d4b10be1cac214cd4ea7dc9715a6a327eb49a37c96", size = 8613528, upload-time = "2026-06-02T11:54:18.842Z" },
+    { url = "https://files.pythonhosted.org/packages/18/3c/e440e039bb82cd19004edaaad00acbde0fb9b461083c3ecf37941c557312/scikit_learn-1.9.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:147e9329ef0e39f75d4cffa02b2aa48d827832684926cd5210d9a2cb5c57246b", size = 8855050, upload-time = "2026-06-02T11:54:21.699Z" },
+    { url = "https://files.pythonhosted.org/packages/43/26/b341b8dab5998da6270a3a42c2152c578501354d36f944b5856757035ef8/scikit_learn-1.9.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5bad8f8b9950321b54c965fdcbac6c6c55e79e16646b49977bcf3668d3870a1a", size = 9097190, upload-time = "2026-06-02T11:54:24.454Z" },
+    { url = "https://files.pythonhosted.org/packages/fb/de/b650b4d69b84468cfa2e28a3ff7b8103743029e6446ce1a97fe060ef688c/scikit_learn-1.9.0-cp314-cp314t-win_amd64.whl", hash = "sha256:78fc56eafd4edb9575d2d8950d1dd152061abb573341a1cb7e099fc40f6c6666", size = 8963204, upload-time = "2026-06-02T11:54:27.428Z" },
+    { url = "https://files.pythonhosted.org/packages/ee/f3/ff83d76d7418112e5a61326443cdda87be3545dd8d6599c95b2481a4419e/scikit_learn-1.9.0-cp314-cp314t-win_arm64.whl", hash = "sha256:051075bda8b7aab87b1906ab3d4740a1e1224a19d7b3781a576736edc94e76aa", size = 8222661, upload-time = "2026-06-02T11:54:30.192Z" },
+]
+
+[[package]]
+name = "scipy"
+version = "1.17.1"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version < '3.12' and sys_platform == 'linux'",
+    "python_full_version < '3.12' and sys_platform != 'linux'",
+]
+dependencies = [
+    { name = "numpy", version = "2.4.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/7a/97/5a3609c4f8d58b039179648e62dd220f89864f56f7357f5d4f45c29eb2cc/scipy-1.17.1.tar.gz", hash = "sha256:95d8e012d8cb8816c226aef832200b1d45109ed4464303e997c5b13122b297c0", size = 30573822, upload-time = "2026-02-23T00:26:24.851Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/df/75/b4ce781849931fef6fd529afa6b63711d5a733065722d0c3e2724af9e40a/scipy-1.17.1-cp311-cp311-macosx_10_14_x86_64.whl", hash = "sha256:1f95b894f13729334fb990162e911c9e5dc1ab390c58aa6cbecb389c5b5e28ec", size = 31613675, upload-time = "2026-02-23T00:16:00.13Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/58/bccc2861b305abdd1b8663d6130c0b3d7cc22e8d86663edbc8401bfd40d4/scipy-1.17.1-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:e18f12c6b0bc5a592ed23d3f7b891f68fd7f8241d69b7883769eb5d5dfb52696", size = 28162057, upload-time = "2026-02-23T00:16:09.456Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/ee/18146b7757ed4976276b9c9819108adbc73c5aad636e5353e20746b73069/scipy-1.17.1-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:a3472cfbca0a54177d0faa68f697d8ba4c80bbdc19908c3465556d9f7efce9ee", size = 20334032, upload-time = "2026-02-23T00:16:17.358Z" },
+    { url = "https://files.pythonhosted.org/packages/ec/e6/cef1cf3557f0c54954198554a10016b6a03b2ec9e22a4e1df734936bd99c/scipy-1.17.1-cp311-cp311-macosx_14_0_x86_64.whl", hash = "sha256:766e0dc5a616d026a3a1cffa379af959671729083882f50307e18175797b3dfd", size = 22709533, upload-time = "2026-02-23T00:16:25.791Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/60/8804678875fc59362b0fb759ab3ecce1f09c10a735680318ac30da8cd76b/scipy-1.17.1-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:744b2bf3640d907b79f3fd7874efe432d1cf171ee721243e350f55234b4cec4c", size = 33062057, upload-time = "2026-02-23T00:16:36.931Z" },
+    { url = "https://files.pythonhosted.org/packages/09/7d/af933f0f6e0767995b4e2d705a0665e454d1c19402aa7e895de3951ebb04/scipy-1.17.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:43af8d1f3bea642559019edfe64e9b11192a8978efbd1539d7bc2aaa23d92de4", size = 35349300, upload-time = "2026-02-23T00:16:49.108Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/3d/7ccbbdcbb54c8fdc20d3b6930137c782a163fa626f0aef920349873421ba/scipy-1.17.1-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:cd96a1898c0a47be4520327e01f874acfd61fb48a9420f8aa9f6483412ffa444", size = 35127333, upload-time = "2026-02-23T00:17:01.293Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/19/f926cb11c42b15ba08e3a71e376d816ac08614f769b4f47e06c3580c836a/scipy-1.17.1-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:4eb6c25dd62ee8d5edf68a8e1c171dd71c292fdae95d8aeb3dd7d7de4c364082", size = 37741314, upload-time = "2026-02-23T00:17:12.576Z" },
+    { url = "https://files.pythonhosted.org/packages/95/da/0d1df507cf574b3f224ccc3d45244c9a1d732c81dcb26b1e8a766ae271a8/scipy-1.17.1-cp311-cp311-win_amd64.whl", hash = "sha256:d30e57c72013c2a4fe441c2fcb8e77b14e152ad48b5464858e07e2ad9fbfceff", size = 36607512, upload-time = "2026-02-23T00:17:23.424Z" },
+    { url = "https://files.pythonhosted.org/packages/68/7f/bdd79ceaad24b671543ffe0ef61ed8e659440eb683b66f033454dcee90eb/scipy-1.17.1-cp311-cp311-win_arm64.whl", hash = "sha256:9ecb4efb1cd6e8c4afea0daa91a87fbddbce1b99d2895d151596716c0b2e859d", size = 24599248, upload-time = "2026-02-23T00:17:34.561Z" },
+    { url = "https://files.pythonhosted.org/packages/35/48/b992b488d6f299dbe3f11a20b24d3dda3d46f1a635ede1c46b5b17a7b163/scipy-1.17.1-cp312-cp312-macosx_10_14_x86_64.whl", hash = "sha256:35c3a56d2ef83efc372eaec584314bd0ef2e2f0d2adb21c55e6ad5b344c0dcb8", size = 31610954, upload-time = "2026-02-23T00:17:49.855Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/02/cf107b01494c19dc100f1d0b7ac3cc08666e96ba2d64db7626066cee895e/scipy-1.17.1-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:fcb310ddb270a06114bb64bbe53c94926b943f5b7f0842194d585c65eb4edd76", size = 28172662, upload-time = "2026-02-23T00:18:01.64Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/a9/599c28631bad314d219cf9ffd40e985b24d603fc8a2f4ccc5ae8419a535b/scipy-1.17.1-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:cc90d2e9c7e5c7f1a482c9875007c095c3194b1cfedca3c2f3291cdc2bc7c086", size = 20344366, upload-time = "2026-02-23T00:18:12.015Z" },
+    { url = "https://files.pythonhosted.org/packages/35/f5/906eda513271c8deb5af284e5ef0206d17a96239af79f9fa0aebfe0e36b4/scipy-1.17.1-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:c80be5ede8f3f8eded4eff73cc99a25c388ce98e555b17d31da05287015ffa5b", size = 22704017, upload-time = "2026-02-23T00:18:21.502Z" },
+    { url = "https://files.pythonhosted.org/packages/da/34/16f10e3042d2f1d6b66e0428308ab52224b6a23049cb2f5c1756f713815f/scipy-1.17.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:e19ebea31758fac5893a2ac360fedd00116cbb7628e650842a6691ba7ca28a21", size = 32927842, upload-time = "2026-02-23T00:18:35.367Z" },
+    { url = "https://files.pythonhosted.org/packages/01/8e/1e35281b8ab6d5d72ebe9911edcdffa3f36b04ed9d51dec6dd140396e220/scipy-1.17.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:02ae3b274fde71c5e92ac4d54bc06c42d80e399fec704383dcd99b301df37458", size = 35235890, upload-time = "2026-02-23T00:18:49.188Z" },
+    { url = "https://files.pythonhosted.org/packages/c5/5c/9d7f4c88bea6e0d5a4f1bc0506a53a00e9fcb198de372bfe4d3652cef482/scipy-1.17.1-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:8a604bae87c6195d8b1045eddece0514d041604b14f2727bbc2b3020172045eb", size = 35003557, upload-time = "2026-02-23T00:18:54.74Z" },
+    { url = "https://files.pythonhosted.org/packages/65/94/7698add8f276dbab7a9de9fb6b0e02fc13ee61d51c7c3f85ac28b65e1239/scipy-1.17.1-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:f590cd684941912d10becc07325a3eeb77886fe981415660d9265c4c418d0bea", size = 37625856, upload-time = "2026-02-23T00:19:00.307Z" },
+    { url = "https://files.pythonhosted.org/packages/a2/84/dc08d77fbf3d87d3ee27f6a0c6dcce1de5829a64f2eae85a0ecc1f0daa73/scipy-1.17.1-cp312-cp312-win_amd64.whl", hash = "sha256:41b71f4a3a4cab9d366cd9065b288efc4d4f3c0b37a91a8e0947fb5bd7f31d87", size = 36549682, upload-time = "2026-02-23T00:19:07.67Z" },
+    { url = "https://files.pythonhosted.org/packages/bc/98/fe9ae9ffb3b54b62559f52dedaebe204b408db8109a8c66fdd04869e6424/scipy-1.17.1-cp312-cp312-win_arm64.whl", hash = "sha256:f4115102802df98b2b0db3cce5cb9b92572633a1197c77b7553e5203f284a5b3", size = 24547340, upload-time = "2026-02-23T00:19:12.024Z" },
+    { url = "https://files.pythonhosted.org/packages/76/27/07ee1b57b65e92645f219b37148a7e7928b82e2b5dbeccecb4dff7c64f0b/scipy-1.17.1-cp313-cp313-macosx_10_14_x86_64.whl", hash = "sha256:5e3c5c011904115f88a39308379c17f91546f77c1667cea98739fe0fccea804c", size = 31590199, upload-time = "2026-02-23T00:19:17.192Z" },
+    { url = "https://files.pythonhosted.org/packages/ec/ae/db19f8ab842e9b724bf5dbb7db29302a91f1e55bc4d04b1025d6d605a2c5/scipy-1.17.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:6fac755ca3d2c3edcb22f479fceaa241704111414831ddd3bc6056e18516892f", size = 28154001, upload-time = "2026-02-23T00:19:22.241Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/58/3ce96251560107b381cbd6e8413c483bbb1228a6b919fa8652b0d4090e7f/scipy-1.17.1-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:7ff200bf9d24f2e4d5dc6ee8c3ac64d739d3a89e2326ba68aaf6c4a2b838fd7d", size = 20325719, upload-time = "2026-02-23T00:19:26.329Z" },
+    { url = "https://files.pythonhosted.org/packages/b2/83/15087d945e0e4d48ce2377498abf5ad171ae013232ae31d06f336e64c999/scipy-1.17.1-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:4b400bdc6f79fa02a4d86640310dde87a21fba0c979efff5248908c6f15fad1b", size = 22683595, upload-time = "2026-02-23T00:19:30.304Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/e0/e58fbde4a1a594c8be8114eb4aac1a55bcd6587047efc18a61eb1f5c0d30/scipy-1.17.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:2b64ca7d4aee0102a97f3ba22124052b4bd2152522355073580bf4845e2550b6", size = 32896429, upload-time = "2026-02-23T00:19:35.536Z" },
+    { url = "https://files.pythonhosted.org/packages/f5/5f/f17563f28ff03c7b6799c50d01d5d856a1d55f2676f537ca8d28c7f627cd/scipy-1.17.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:581b2264fc0aa555f3f435a5944da7504ea3a065d7029ad60e7c3d1ae09c5464", size = 35203952, upload-time = "2026-02-23T00:19:42.259Z" },
+    { url = "https://files.pythonhosted.org/packages/8d/a5/9afd17de24f657fdfe4df9a3f1ea049b39aef7c06000c13db1530d81ccca/scipy-1.17.1-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:beeda3d4ae615106d7094f7e7cef6218392e4465cc95d25f900bebabfded0950", size = 34979063, upload-time = "2026-02-23T00:19:47.547Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/13/88b1d2384b424bf7c924f2038c1c409f8d88bb2a8d49d097861dd64a57b2/scipy-1.17.1-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:6609bc224e9568f65064cfa72edc0f24ee6655b47575954ec6339534b2798369", size = 37598449, upload-time = "2026-02-23T00:19:53.238Z" },
+    { url = "https://files.pythonhosted.org/packages/35/e5/d6d0e51fc888f692a35134336866341c08655d92614f492c6860dc45bb2c/scipy-1.17.1-cp313-cp313-win_amd64.whl", hash = "sha256:37425bc9175607b0268f493d79a292c39f9d001a357bebb6b88fdfaff13f6448", size = 36510943, upload-time = "2026-02-23T00:20:50.89Z" },
+    { url = "https://files.pythonhosted.org/packages/2a/fd/3be73c564e2a01e690e19cc618811540ba5354c67c8680dce3281123fb79/scipy-1.17.1-cp313-cp313-win_arm64.whl", hash = "sha256:5cf36e801231b6a2059bf354720274b7558746f3b1a4efb43fcf557ccd484a87", size = 24545621, upload-time = "2026-02-23T00:20:55.871Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/6b/17787db8b8114933a66f9dcc479a8272e4b4da75fe03b0c282f7b0ade8cd/scipy-1.17.1-cp313-cp313t-macosx_10_14_x86_64.whl", hash = "sha256:d59c30000a16d8edc7e64152e30220bfbd724c9bbb08368c054e24c651314f0a", size = 31936708, upload-time = "2026-02-23T00:19:58.694Z" },
+    { url = "https://files.pythonhosted.org/packages/38/2e/524405c2b6392765ab1e2b722a41d5da33dc5c7b7278184a8ad29b6cb206/scipy-1.17.1-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:010f4333c96c9bb1a4516269e33cb5917b08ef2166d5556ca2fd9f082a9e6ea0", size = 28570135, upload-time = "2026-02-23T00:20:03.934Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/c3/5bd7199f4ea8556c0c8e39f04ccb014ac37d1468e6cfa6a95c6b3562b76e/scipy-1.17.1-cp313-cp313t-macosx_14_0_arm64.whl", hash = "sha256:2ceb2d3e01c5f1d83c4189737a42d9cb2fc38a6eeed225e7515eef71ad301dce", size = 20741977, upload-time = "2026-02-23T00:20:07.935Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/b8/8ccd9b766ad14c78386599708eb745f6b44f08400a5fd0ade7cf89b6fc93/scipy-1.17.1-cp313-cp313t-macosx_14_0_x86_64.whl", hash = "sha256:844e165636711ef41f80b4103ed234181646b98a53c8f05da12ca5ca289134f6", size = 23029601, upload-time = "2026-02-23T00:20:12.161Z" },
+    { url = "https://files.pythonhosted.org/packages/6d/a0/3cb6f4d2fb3e17428ad2880333cac878909ad1a89f678527b5328b93c1d4/scipy-1.17.1-cp313-cp313t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:158dd96d2207e21c966063e1635b1063cd7787b627b6f07305315dd73d9c679e", size = 33019667, upload-time = "2026-02-23T00:20:17.208Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/c3/2d834a5ac7bf3a0c806ad1508efc02dda3c8c61472a56132d7894c312dea/scipy-1.17.1-cp313-cp313t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:74cbb80d93260fe2ffa334efa24cb8f2f0f622a9b9febf8b483c0b865bfb3475", size = 35264159, upload-time = "2026-02-23T00:20:23.087Z" },
+    { url = "https://files.pythonhosted.org/packages/4d/77/d3ed4becfdbd217c52062fafe35a72388d1bd82c2d0ba5ca19d6fcc93e11/scipy-1.17.1-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:dbc12c9f3d185f5c737d801da555fb74b3dcfa1a50b66a1a93e09190f41fab50", size = 35102771, upload-time = "2026-02-23T00:20:28.636Z" },
+    { url = "https://files.pythonhosted.org/packages/bd/12/d19da97efde68ca1ee5538bb261d5d2c062f0c055575128f11a2730e3ac1/scipy-1.17.1-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:94055a11dfebe37c656e70317e1996dc197e1a15bbcc351bcdd4610e128fe1ca", size = 37665910, upload-time = "2026-02-23T00:20:34.743Z" },
+    { url = "https://files.pythonhosted.org/packages/06/1c/1172a88d507a4baaf72c5a09bb6c018fe2ae0ab622e5830b703a46cc9e44/scipy-1.17.1-cp313-cp313t-win_amd64.whl", hash = "sha256:e30bdeaa5deed6bc27b4cc490823cd0347d7dae09119b8803ae576ea0ce52e4c", size = 36562980, upload-time = "2026-02-23T00:20:40.575Z" },
+    { url = "https://files.pythonhosted.org/packages/70/b0/eb757336e5a76dfa7911f63252e3b7d1de00935d7705cf772db5b45ec238/scipy-1.17.1-cp313-cp313t-win_arm64.whl", hash = "sha256:a720477885a9d2411f94a93d16f9d89bad0f28ca23c3f8daa521e2dcc3f44d49", size = 24856543, upload-time = "2026-02-23T00:20:45.313Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/83/333afb452af6f0fd70414dc04f898647ee1423979ce02efa75c3b0f2c28e/scipy-1.17.1-cp314-cp314-macosx_10_14_x86_64.whl", hash = "sha256:a48a72c77a310327f6a3a920092fa2b8fd03d7deaa60f093038f22d98e096717", size = 31584510, upload-time = "2026-02-23T00:21:01.015Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/a6/d05a85fd51daeb2e4ea71d102f15b34fedca8e931af02594193ae4fd25f7/scipy-1.17.1-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:45abad819184f07240d8a696117a7aacd39787af9e0b719d00285549ed19a1e9", size = 28170131, upload-time = "2026-02-23T00:21:05.888Z" },
+    { url = "https://files.pythonhosted.org/packages/db/7b/8624a203326675d7746a254083a187398090a179335b2e4a20e2ddc46e83/scipy-1.17.1-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:3fd1fcdab3ea951b610dc4cef356d416d5802991e7e32b5254828d342f7b7e0b", size = 20342032, upload-time = "2026-02-23T00:21:09.904Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/35/2c342897c00775d688d8ff3987aced3426858fd89d5a0e26e020b660b301/scipy-1.17.1-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:7bdf2da170b67fdf10bca777614b1c7d96ae3ca5794fd9587dce41eb2966e866", size = 22678766, upload-time = "2026-02-23T00:21:14.313Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/f2/7cdb8eb308a1a6ae1e19f945913c82c23c0c442a462a46480ce487fdc0ac/scipy-1.17.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:adb2642e060a6549c343603a3851ba76ef0b74cc8c079a9a58121c7ec9fe2350", size = 32957007, upload-time = "2026-02-23T00:21:19.663Z" },
+    { url = "https://files.pythonhosted.org/packages/0b/2e/7eea398450457ecb54e18e9d10110993fa65561c4f3add5e8eccd2b9cd41/scipy-1.17.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:eee2cfda04c00a857206a4330f0c5e3e56535494e30ca445eb19ec624ae75118", size = 35221333, upload-time = "2026-02-23T00:21:25.278Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/77/5b8509d03b77f093a0d52e606d3c4f79e8b06d1d38c441dacb1e26cacf46/scipy-1.17.1-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:d2650c1fb97e184d12d8ba010493ee7b322864f7d3d00d3f9bb97d9c21de4068", size = 35042066, upload-time = "2026-02-23T00:21:31.358Z" },
+    { url = "https://files.pythonhosted.org/packages/f9/df/18f80fb99df40b4070328d5ae5c596f2f00fffb50167e31439e932f29e7d/scipy-1.17.1-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:08b900519463543aa604a06bec02461558a6e1cef8fdbb8098f77a48a83c8118", size = 37612763, upload-time = "2026-02-23T00:21:37.247Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/39/f0e8ea762a764a9dc52aa7dabcfad51a354819de1f0d4652b6a1122424d6/scipy-1.17.1-cp314-cp314-win_amd64.whl", hash = "sha256:3877ac408e14da24a6196de0ddcace62092bfc12a83823e92e49e40747e52c19", size = 37290984, upload-time = "2026-02-23T00:22:35.023Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/56/fe201e3b0f93d1a8bcf75d3379affd228a63d7e2d80ab45467a74b494947/scipy-1.17.1-cp314-cp314-win_arm64.whl", hash = "sha256:f8885db0bc2bffa59d5c1b72fad7a6a92d3e80e7257f967dd81abb553a90d293", size = 25192877, upload-time = "2026-02-23T00:22:39.798Z" },
+    { url = "https://files.pythonhosted.org/packages/96/ad/f8c414e121f82e02d76f310f16db9899c4fcde36710329502a6b2a3c0392/scipy-1.17.1-cp314-cp314t-macosx_10_14_x86_64.whl", hash = "sha256:1cc682cea2ae55524432f3cdff9e9a3be743d52a7443d0cba9017c23c87ae2f6", size = 31949750, upload-time = "2026-02-23T00:21:42.289Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/b0/c741e8865d61b67c81e255f4f0a832846c064e426636cd7de84e74d209be/scipy-1.17.1-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:2040ad4d1795a0ae89bfc7e8429677f365d45aa9fd5e4587cf1ea737f927b4a1", size = 28585858, upload-time = "2026-02-23T00:21:47.706Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/1b/3985219c6177866628fa7c2595bfd23f193ceebbe472c98a08824b9466ff/scipy-1.17.1-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:131f5aaea57602008f9822e2115029b55d4b5f7c070287699fe45c661d051e39", size = 20757723, upload-time = "2026-02-23T00:21:52.039Z" },
+    { url = "https://files.pythonhosted.org/packages/c0/19/2a04aa25050d656d6f7b9e7b685cc83d6957fb101665bfd9369ca6534563/scipy-1.17.1-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:9cdc1a2fcfd5c52cfb3045feb399f7b3ce822abdde3a193a6b9a60b3cb5854ca", size = 23043098, upload-time = "2026-02-23T00:21:56.185Z" },
+    { url = "https://files.pythonhosted.org/packages/86/f1/3383beb9b5d0dbddd030335bf8a8b32d4317185efe495374f134d8be6cce/scipy-1.17.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:6e3dcd57ab780c741fde8dc68619de988b966db759a3c3152e8e9142c26295ad", size = 33030397, upload-time = "2026-02-23T00:22:01.404Z" },
+    { url = "https://files.pythonhosted.org/packages/41/68/8f21e8a65a5a03f25a79165ec9d2b28c00e66dc80546cf5eb803aeeff35b/scipy-1.17.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a9956e4d4f4a301ebf6cde39850333a6b6110799d470dbbb1e25326ac447f52a", size = 35281163, upload-time = "2026-02-23T00:22:07.024Z" },
+    { url = "https://files.pythonhosted.org/packages/84/8d/c8a5e19479554007a5632ed7529e665c315ae7492b4f946b0deb39870e39/scipy-1.17.1-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:a4328d245944d09fd639771de275701ccadf5f781ba0ff092ad141e017eccda4", size = 35116291, upload-time = "2026-02-23T00:22:12.585Z" },
+    { url = "https://files.pythonhosted.org/packages/52/52/e57eceff0e342a1f50e274264ed47497b59e6a4e3118808ee58ddda7b74a/scipy-1.17.1-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:a77cbd07b940d326d39a1d1b37817e2ee4d79cb30e7338f3d0cddffae70fcaa2", size = 37682317, upload-time = "2026-02-23T00:22:18.513Z" },
+    { url = "https://files.pythonhosted.org/packages/11/2f/b29eafe4a3fbc3d6de9662b36e028d5f039e72d345e05c250e121a230dd4/scipy-1.17.1-cp314-cp314t-win_amd64.whl", hash = "sha256:eb092099205ef62cd1782b006658db09e2fed75bffcae7cc0d44052d8aa0f484", size = 37345327, upload-time = "2026-02-23T00:22:24.442Z" },
+    { url = "https://files.pythonhosted.org/packages/07/39/338d9219c4e87f3e708f18857ecd24d22a0c3094752393319553096b98af/scipy-1.17.1-cp314-cp314t-win_arm64.whl", hash = "sha256:200e1050faffacc162be6a486a984a0497866ec54149a01270adc8a59b7c7d21", size = 25489165, upload-time = "2026-02-23T00:22:29.563Z" },
+]
+
+[[package]]
+name = "scipy"
+version = "1.18.0"
+source = { registry = "https://pypi.org/simple" }
+resolution-markers = [
+    "python_full_version >= '3.15' and sys_platform == 'linux'",
+    "python_full_version >= '3.13' and python_full_version < '3.15' and sys_platform == 'linux'",
+    "python_full_version == '3.12.*' and sys_platform == 'linux'",
+    "python_full_version >= '3.15' and sys_platform != 'linux'",
+    "python_full_version >= '3.13' and python_full_version < '3.15' and sys_platform != 'linux'",
+    "python_full_version == '3.12.*' and sys_platform != 'linux'",
+]
+dependencies = [
+    { name = "numpy", version = "2.5.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/a7/25/c2700dfaf6442b4effaa91af24ebce5dc9d31bb4a69706313aae70d72cd0/scipy-1.18.0.tar.gz", hash = "sha256:67b2ad2ad54c72ca6d04975a9b2df8c3638c34ddd5b28738e94fc2b57929d378", size = 30774447, upload-time = "2026-06-19T15:01:43.456Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/6a/19/ca10ead60b0acc80b2b833c2c4a4f2ff753d0f58b811f70d911c7e94a25c/scipy-1.18.0-cp312-cp312-macosx_10_15_x86_64.whl", hash = "sha256:7bd21faaf5a1a3b2eff922d02db5f191b99a6518db9078a8fb23169f6d22259a", size = 31056519, upload-time = "2026-06-19T14:59:45.203Z" },
+    { url = "https://files.pythonhosted.org/packages/96/72/1e6442a00cd2924d361aa1b642ab6373ec35c6fabf311a760be9f76e0f13/scipy-1.18.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:265915e79107de9f946b855e50d7470d5893ec3f54b342e1aa6201cbdcd8bb6b", size = 28681889, upload-time = "2026-06-19T14:59:48.103Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/2d/11dd93d21e147a73ba22bd75c0b9208d3a2e0ec76d53170ce7d9029b1015/scipy-1.18.0-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:9ab7b758be6940954a713ee466e2043e9f6e2ed965c1fce5c91039f4be3d90a9", size = 20423580, upload-time = "2026-06-19T14:59:50.665Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/01/93552f75e0d2a7dd115a45e59209c51e8d514daff02fc887d2623be06fe1/scipy-1.18.0-cp312-cp312-macosx_14_0_x86_64.whl", hash = "sha256:97b6cddaaee0a779ef6b5ca83c9604b27cc16b2b8fc22c142652df8793319fb8", size = 23054441, upload-time = "2026-06-19T14:59:53.564Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/23/21f5e703643d66f21faa6b4c73195bfcad70c55efcb4f1ab327cd7c4101a/scipy-1.18.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:52a96e21517c7292375c0e27dd796a811f03fcea5fd4d108fdfea8145dcf17ab", size = 33968720, upload-time = "2026-06-19T14:59:56.415Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/aa/1b939f6c67ed68635bb538e6752d3dacc02f66535182e939a89581a44e9c/scipy-1.18.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:1f55797419e16e7f30cf88ffb3113ce0467f00cfe3f70d5c281730b21769bfc2", size = 35287115, upload-time = "2026-06-19T14:59:59.411Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/ff/eec46be7e9234208f801062b53e1983085eddebd693f6c9bfb03b459830d/scipy-1.18.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:ad033410e2e0672ffdc1042110cef20e1c46f8fd0616cee1d44d8d58fad8fc11", size = 35577989, upload-time = "2026-06-19T15:00:02.235Z" },
+    { url = "https://files.pythonhosted.org/packages/84/ca/210d4759c7210bb7d269437421959b39a33434e2776b60c5cb8a763bb30a/scipy-1.18.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:4a55985d54c769c872e64b7f4c8a81cc30ef700cc04296abbbf3705439c126de", size = 37421717, upload-time = "2026-06-19T15:00:05.102Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/54/9a9edb45345bd6744da5ddfb6628e5d5185920494c6a67ec45b6381004cb/scipy-1.18.0-cp312-cp312-win_amd64.whl", hash = "sha256:71ccc8faa2dd16ac310233203474a8b5cb67f10dedd54a3116d34943f4b19132", size = 36597428, upload-time = "2026-06-19T15:00:08.112Z" },
+    { url = "https://files.pythonhosted.org/packages/99/0e/33f32a2a58987e26aec0f7df252cbbad1e90ae77bdbc76f40dd4ed0cf0ea/scipy-1.18.0-cp312-cp312-win_arm64.whl", hash = "sha256:d88363fd9d8fbd3511bd273f1a49efb2a540773ddf92a91d57498ce7dd7f3e76", size = 24351481, upload-time = "2026-06-19T15:00:11.103Z" },
+    { url = "https://files.pythonhosted.org/packages/05/52/9c0136c2de7ae0779b7b366447766cec6d9f0702c56bb8ffeb04c8fd3af4/scipy-1.18.0-cp313-cp313-macosx_10_15_x86_64.whl", hash = "sha256:09143f676d157d9f546d663504ef9c1becb819824f1afc018814176411942446", size = 31036107, upload-time = "2026-06-19T15:00:14.03Z" },
+    { url = "https://files.pythonhosted.org/packages/02/73/0291a64843270f4efb86cdcf2ee0f2048631b65ec6b405398b2b4dbf11bf/scipy-1.18.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:5efe260f69417b97ddae455bfb5a95e8359f7f66ad7fa9522a60feb66f169520", size = 28663303, upload-time = "2026-06-19T15:00:16.819Z" },
+    { url = "https://files.pythonhosted.org/packages/d3/0f/10ffa0b697a572f4e0d48b92a88895d366422f019f723e7e14a84c050dac/scipy-1.18.0-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:68363b7eaacd8b5dd426df56d782cc156468ac79a127a1b87ca597d6e2e82197", size = 20404960, upload-time = "2026-06-19T15:00:19.635Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/d2/e896cea21ba8edd6c81d4c55b1ffcc717e79698dcbebf9641b4cfb4c6622/scipy-1.18.0-cp313-cp313-macosx_14_0_x86_64.whl", hash = "sha256:c5557d8be5da8e41353fcd4d21491fdbab83b062fc579e94dc09a7c8ab4f669b", size = 23034074, upload-time = "2026-06-19T15:00:22.107Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/b2/e83ea34279a52c03374477c74006256ec78df65fc877baa4617d6de1d202/scipy-1.18.0-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0d13bca67c096d89fb95ced0d8921807300fce0275643aef9533cc63a0773468", size = 33942038, upload-time = "2026-06-19T15:00:24.964Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/af/e8fe5fb136f51e2b01678b92cb4106d10d8cd68ec147ead2e7cb0ac75398/scipy-1.18.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:a46f9273dbd0eb1cefba61c9b8648b4dfe3cbc14a080176f9a73e44b8336dc7f", size = 35266390, upload-time = "2026-06-19T15:00:28.059Z" },
+    { url = "https://files.pythonhosted.org/packages/3a/49/2c5cbb907b56695fc67517811d1db234dfd83381a84814ec220aded2794d/scipy-1.18.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:5aba46108853ddfc77906b6557aac839d2b52e900c1d72a1180adaaab58d265f", size = 35551324, upload-time = "2026-06-19T15:00:31.014Z" },
+    { url = "https://files.pythonhosted.org/packages/bb/73/eda39f7a2d306ff0ffc574afd13c0bbb6d10a603d9a413998ee269487a80/scipy-1.18.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:b6f758e35f12757b5d95c00bc6de2438e229c2664b7a92e96f205959d9f2dfa4", size = 37404785, upload-time = "2026-06-19T15:00:34.072Z" },
+    { url = "https://files.pythonhosted.org/packages/b7/d2/ae881ee28d014f38e0ccbfd974a06a919ba9af34f1f74bf42b5301891d63/scipy-1.18.0-cp313-cp313-win_amd64.whl", hash = "sha256:1afac4a847207c7ff8efd321734a50b06d0280b3b2a2c0fc2f413101747ad7c7", size = 36554943, upload-time = "2026-06-19T15:00:36.903Z" },
+    { url = "https://files.pythonhosted.org/packages/70/3a/21154e2d54eb3639c6bf4dbae2e531c68356bfe95990daa30df33b30d556/scipy-1.18.0-cp313-cp313-win_arm64.whl", hash = "sha256:c5dbddf60e58c2312316d097271a8e73d40eaf2eabfa4d95ed7d3695bbf2ce7b", size = 24350911, upload-time = "2026-06-19T15:00:40.062Z" },
+    { url = "https://files.pythonhosted.org/packages/78/b5/915a19b3de2f7430062b509653563db1633ddbb6f021b06731521115d4e2/scipy-1.18.0-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:4c256ee70c0d1a8a2ace807e199ccd4e3f57037433842abb3fb36bc17eaa9578", size = 31036253, upload-time = "2026-06-19T15:00:43.216Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/88/b72def7262e150d16be13fca37a96481138d624e700340bc3362a7588929/scipy-1.18.0-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:2ef3abc54a4ffc53765374b0d5728532dfdd2585ed23f6b11c206a1f0b1b9af8", size = 28673758, upload-time = "2026-06-19T15:00:46.663Z" },
+    { url = "https://files.pythonhosted.org/packages/91/02/2e636a61a525632c373cf6a9c24442a3ffb79e364d38e98b32042964ac32/scipy-1.18.0-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:f2a6af57bd9e4a75d70e4117e78a1bbee84f79ae3fbb6d0111005d6ebcc4cb8d", size = 20415514, upload-time = "2026-06-19T15:00:49.399Z" },
+    { url = "https://files.pythonhosted.org/packages/c9/b6/2135974442f6aba159d9d39d774a1c8cb19947016725d69fecc685df45bf/scipy-1.18.0-cp314-cp314-macosx_14_0_x86_64.whl", hash = "sha256:3f1ac564d3bf6c03d861d2cd87a1bea0da2887136f7fb1bf519c05a8971452d6", size = 23034398, upload-time = "2026-06-19T15:00:51.941Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/e6/ba89ec5abf6ee9257c0d1ec985573f3ae32742c24bc03e016388a40b1b15/scipy-1.18.0-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:40395a5fcd1abee49a5c7aaa98c29db393eedc835138560a588c47ec16156690", size = 33998032, upload-time = "2026-06-19T15:00:54.838Z" },
+    { url = "https://files.pythonhosted.org/packages/7f/c4/bc41eb19b0fd0db868f4132920879019318d80cc522ad8f2bca4611af808/scipy-1.18.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:8ca01e8ae69f1b18e9a58d91afead31be3cef0dd905a10249dac559ee15460a0", size = 35283333, upload-time = "2026-06-19T15:00:58.152Z" },
+    { url = "https://files.pythonhosted.org/packages/53/a4/cbdeef6eb3830a8462a9d4ada814de5fc984345cc9ecf17cbec51a036f1e/scipy-1.18.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:7a7f3b01647384dbc3a711e8c6778e0aabbe93959249fef5c7393396bcac0867", size = 35610216, upload-time = "2026-06-19T15:01:01.155Z" },
+    { url = "https://files.pythonhosted.org/packages/80/4d/b2b82502b65f661d1b789c1665dcdf315d5f12194e06fc0b37946294ebae/scipy-1.18.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:6aa94e78ec192a30063a5e72e561c28af769dc311190b24fe91774eff1969709", size = 37418960, upload-time = "2026-06-19T15:01:04.155Z" },
+    { url = "https://files.pythonhosted.org/packages/93/3e/902d836831474b0ab5a37d16404f7bc5fafd9efba632890e271ba952635f/scipy-1.18.0-cp314-cp314-win_amd64.whl", hash = "sha256:2d8bbdc6c817f5b4006a54d799d4f5bab6f910193cbb9a1ff310833d4d270f61", size = 37288845, upload-time = "2026-06-19T15:01:07.822Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/43/8d73b337a3bdb14daa0314f0434210747c02d79d729ce1777574a817dcf6/scipy-1.18.0-cp314-cp314-win_arm64.whl", hash = "sha256:18e9575f1569b2c54174e6159d32942e03731177f63dce7975f0a0c88d102f5b", size = 24988971, upload-time = "2026-06-19T15:01:11.076Z" },
+    { url = "https://files.pythonhosted.org/packages/b4/b4/f11918b0508a2787031a0499a03fbe3546f3bb5ca05d01038c45b278c09a/scipy-1.18.0-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:f351e0dd702687d12a402b867a1b4146a256923e1c38317cbc472f6372b94707", size = 31399325, upload-time = "2026-06-19T15:01:13.723Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/d1/1f287b57c0ff0ee5185dff3946d92c8017d39b0e431f0ae79a3ff1859512/scipy-1.18.0-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:7c7a51b33ce387193c97f228320cf8e87361daa1bba750638677729598b3e677", size = 29092110, upload-time = "2026-06-19T15:01:16.908Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/1a/7b74eb6c392fdcb27d414c0e7558a6d0231eb3b6d73571f479bb81ea8794/scipy-1.18.0-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:84031d7b052a54fae2f8632e0ec802073d385476eb9a63079bce6e23ef9283d4", size = 20833811, upload-time = "2026-06-19T15:01:20.488Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/ad/f3941716320a7b9cb4d68734a903b45fe16eff5fb7da7e16f2e619304979/scipy-1.18.0-cp314-cp314t-macosx_14_0_x86_64.whl", hash = "sha256:56abf29a7c067dde59be8b9a22d606a4ea1b2f2a4b756d9d903c62818f5dacce", size = 23396644, upload-time = "2026-06-19T15:01:23.364Z" },
+    { url = "https://files.pythonhosted.org/packages/22/22/1446b62ffe07f9719b7d9b1b6a4e05a772833ae8f441fe4c22c34c9b250f/scipy-1.18.0-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:1ad44305cfa24b1ba5803cbbebf033590ccbac1aa5d612d727b785325ab408b0", size = 34079318, upload-time = "2026-06-19T15:01:26.002Z" },
+    { url = "https://files.pythonhosted.org/packages/56/3b/b87da667098bb470fa30c7011b0ba351ee976dd395c78798c66e941665a3/scipy-1.18.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:945c1761b93f38d7f99ae81ae80c63e621471608c7eeead563f6df025585cd58", size = 35324320, upload-time = "2026-06-19T15:01:28.881Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/a1/c7932f91909759b0267f75fdea34e91309f96b895757534b76a90b6b4344/scipy-1.18.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:1a4441f15d620578772a49e5ab48c0ee1f7a0220e387110283062729136b2553", size = 35699541, upload-time = "2026-06-19T15:01:31.968Z" },
+    { url = "https://files.pythonhosted.org/packages/f7/86/5185061a1fcc41d18c5dc2463969b3a3964b31d9ac67b2fb05d4c7ff7670/scipy-1.18.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:9aac6192fac56bf2ca534389d24623f07b39ff83317d58287285e7fbd622ff76", size = 37472480, upload-time = "2026-06-19T15:01:35.136Z" },
+    { url = "https://files.pythonhosted.org/packages/31/8e/f04c68e39919a010d34f2ee1367fd705b0a25a02f609d755f0bfbc0a15fc/scipy-1.18.0-cp314-cp314t-win_amd64.whl", hash = "sha256:e40baea28ae7f5475c779741e2d90b1247c78531207b49c7030e698ff81cee3f", size = 37365390, upload-time = "2026-06-19T15:01:38.091Z" },
+    { url = "https://files.pythonhosted.org/packages/d5/19/969dc072906c84dd0a3b05dcf57ea750936087d7873549e408b35cfc3f97/scipy-1.18.0-cp314-cp314t-win_arm64.whl", hash = "sha256:368e0a705903c466aa5f08eefb39e6b1b6b2d659e7352a31fd9e2438365be0f8", size = 25279661, upload-time = "2026-06-19T15:01:40.817Z" },
+]
+
+[[package]]
+name = "setuptools"
+version = "81.0.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/0d/1c/73e719955c59b8e424d015ab450f51c0af856ae46ea2da83eba51cc88de1/setuptools-81.0.0.tar.gz", hash = "sha256:487b53915f52501f0a79ccfd0c02c165ffe06631443a886740b91af4b7a5845a", size = 1198299, upload-time = "2026-02-06T21:10:39.601Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e1/e3/c164c88b2e5ce7b24d667b9bd83589cf4f3520d97cad01534cd3c4f55fdb/setuptools-81.0.0-py3-none-any.whl", hash = "sha256:fdd925d5c5d9f62e4b74b30d6dd7828ce236fd6ed998a08d81de62ce5a6310d6", size = 1062021, upload-time = "2026-02-06T21:10:37.175Z" },
+]
+
+[[package]]
+name = "six"
+version = "1.17.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/94/e7/b2c673351809dca68a0e064b6af791aa332cf192da575fd474ed7d6f16a2/six-1.17.0.tar.gz", hash = "sha256:ff70335d468e7eb6ec65b95b99d3a2836546063f63acc5171de367e834932a81", size = 34031, upload-time = "2024-12-04T17:35:28.174Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b7/ce/149a00dd41f10bc29e5921b496af8b574d8413afcd5e30dfa0ed46c2cc5e/six-1.17.0-py2.py3-none-any.whl", hash = "sha256:4721f391ed90541fddacab5acf947aa0d3dc7d27b2e1e8eda2be8970586c3274", size = 11050, upload-time = "2024-12-04T17:35:26.475Z" },
+]
+
+[[package]]
+name = "sympy"
+version = "1.14.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "mpmath" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/83/d3/803453b36afefb7c2bb238361cd4ae6125a569b4db67cd9e79846ba2d68c/sympy-1.14.0.tar.gz", hash = "sha256:d3d3fe8df1e5a0b42f0e7bdf50541697dbe7d23746e894990c030e2b05e72517", size = 7793921, upload-time = "2025-04-27T18:05:01.611Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/a2/09/77d55d46fd61b4a135c444fc97158ef34a095e5681d0a6c10b75bf356191/sympy-1.14.0-py3-none-any.whl", hash = "sha256:e091cc3e99d2141a0ba2847328f5479b05d94a6635cb96148ccb3f34671bd8f5", size = 6299353, upload-time = "2025-04-27T18:04:59.103Z" },
+]
+
+[[package]]
+name = "tensorboard"
+version = "2.20.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "absl-py" },
+    { name = "grpcio" },
+    { name = "markdown" },
+    { name = "numpy", version = "2.4.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+    { name = "numpy", version = "2.5.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+    { name = "packaging" },
+    { name = "pillow" },
+    { name = "protobuf" },
+    { name = "setuptools" },
+    { name = "tensorboard-data-server" },
+    { name = "werkzeug" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/9c/d9/a5db55f88f258ac669a92858b70a714bbbd5acd993820b41ec4a96a4d77f/tensorboard-2.20.0-py3-none-any.whl", hash = "sha256:9dc9f978cb84c0723acf9a345d96c184f0293d18f166bb8d59ee098e6cfaaba6", size = 5525680, upload-time = "2025-07-17T19:20:49.638Z" },
+]
+
+[[package]]
+name = "tensorboard-data-server"
+version = "0.7.2"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/7a/13/e503968fefabd4c6b2650af21e110aa8466fe21432cd7c43a84577a89438/tensorboard_data_server-0.7.2-py3-none-any.whl", hash = "sha256:7e0610d205889588983836ec05dc098e80f97b7e7bbff7e994ebb78f578d0ddb", size = 2356, upload-time = "2023-10-23T21:23:32.16Z" },
+    { url = "https://files.pythonhosted.org/packages/b7/85/dabeaf902892922777492e1d253bb7e1264cadce3cea932f7ff599e53fea/tensorboard_data_server-0.7.2-py3-none-macosx_10_9_x86_64.whl", hash = "sha256:9fe5d24221b29625dbc7328b0436ca7fc1c23de4acf4d272f1180856e32f9f60", size = 4823598, upload-time = "2023-10-23T21:23:33.714Z" },
+    { url = "https://files.pythonhosted.org/packages/73/c6/825dab04195756cf8ff2e12698f22513b3db2f64925bdd41671bfb33aaa5/tensorboard_data_server-0.7.2-py3-none-manylinux_2_31_x86_64.whl", hash = "sha256:ef687163c24185ae9754ed5650eb5bc4d84ff257aabdc33f0cc6f74d8ba54530", size = 6590363, upload-time = "2023-10-23T21:23:35.583Z" },
+]
+
+[[package]]
+name = "threadpoolctl"
+version = "3.6.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/b7/4d/08c89e34946fce2aec4fbb45c9016efd5f4d7f24af8e5d93296e935631d8/threadpoolctl-3.6.0.tar.gz", hash = "sha256:8ab8b4aa3491d812b623328249fab5302a68d2d71745c8a4c719a2fcaba9f44e", size = 21274, upload-time = "2025-03-13T13:49:23.031Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/32/d5/f9a850d79b0851d1d4ef6456097579a9005b31fea68726a4ae5f2d82ddd9/threadpoolctl-3.6.0-py3-none-any.whl", hash = "sha256:43a0b8fd5a2928500110039e43a5eed8480b918967083ea48dc3ab9f13c4a7fb", size = 18638, upload-time = "2025-03-13T13:49:21.846Z" },
+]
+
+[[package]]
+name = "torch"
+version = "2.12.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "cuda-bindings", marker = "sys_platform == 'linux'" },
+    { name = "cuda-toolkit", extra = ["cudart", "cufft", "cufile", "cupti", "curand", "cusolver", "cusparse", "nvjitlink", "nvrtc", "nvtx"], marker = "sys_platform == 'linux'" },
+    { name = "filelock" },
+    { name = "fsspec" },
+    { name = "jinja2" },
+    { name = "networkx" },
+    { name = "nvidia-cublas", marker = "sys_platform == 'linux'" },
+    { name = "nvidia-cudnn-cu13", marker = "sys_platform == 'linux'" },
+    { name = "nvidia-cusparselt-cu13", marker = "sys_platform == 'linux'" },
+    { name = "nvidia-nccl-cu13", marker = "sys_platform == 'linux'" },
+    { name = "nvidia-nvshmem-cu13", marker = "sys_platform == 'linux'" },
+    { name = "setuptools" },
+    { name = "sympy" },
+    { name = "triton", marker = "sys_platform == 'linux'" },
+    { name = "typing-extensions" },
+]
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/59/38/7028d3be540f1dcdf41660a2b01d0c51d2cb73915fe370d84e4d277a6d47/torch-2.12.1-cp311-cp311-macosx_14_0_arm64.whl", hash = "sha256:ef81f503912effea2ce3d9b12a2e3a6ed488943e91271c90c7a829f60baf6aa2", size = 87975425, upload-time = "2026-06-17T21:08:34.094Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/e3/750b3e3548635ceac03ba255daa26dbc7ed66ca3484dc4b4d955ab7f4501/torch-2.12.1-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:107df6888624bdea41508f9aeb6149d9333c737a5530ceecb56c904e811369ae", size = 426379894, upload-time = "2026-06-17T21:06:55.077Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/ca/ed24783da629ff3e640ba3f70a7639e9045d3d88b93ee6bc47b8a28a1f2c/torch-2.12.1-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:6e29e7e74d05bda7d955c75e99459f878ebd970ef851b4057edbd3b34a5eb4a3", size = 532169264, upload-time = "2026-06-17T21:08:17.65Z" },
+    { url = "https://files.pythonhosted.org/packages/46/61/c63f0158446f3a98ea672b004d761b848911eba567ea4a624c7db5aadc04/torch-2.12.1-cp311-cp311-win_amd64.whl", hash = "sha256:a513506cfda3c1c78dabeb6574c1597538c0254b3d39af174dde35d8177f4ce3", size = 122953086, upload-time = "2026-06-17T21:08:27.69Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/54/efb7ebca77970012b0cc21687a55d70eb2ba514b2c2b8e18d9fb1222f3be/torch-2.12.1-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:d2dd0f2c5f7ccbddaf34cade0deaf476808368f902b9cdb7f36a2ab42301bc0e", size = 87991951, upload-time = "2026-06-17T21:07:49.309Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/00/4210d76ca7424981f04033ebe7e48816ab83287a62538747a58825db770c/torch-2.12.1-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:2de4e19b88a481482c6c75291f2d6a52eda3ce51f311b29aa9b68499c830c07c", size = 426382721, upload-time = "2026-06-17T21:06:41.842Z" },
+    { url = "https://files.pythonhosted.org/packages/76/1f/bc9f5a5aa569307076365f25afcebacb22e9c754b1bcfbaaa146627c7fda/torch-2.12.1-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:649e4ced014ba646f76f8cb9c9726735a6323eb321b7919f942790a923f90921", size = 532261322, upload-time = "2026-06-17T21:06:06.673Z" },
+    { url = "https://files.pythonhosted.org/packages/9e/49/c549461daa008159d006a76a991fbc2f26fa8bac27a4030c858463dcb20f/torch-2.12.1-cp312-cp312-win_amd64.whl", hash = "sha256:e86550597877fb272ddc52db2f85b82cb601ea7bd932576a0340152cae2200b3", size = 122988095, upload-time = "2026-06-17T21:07:44.9Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/4a/0300261818e1560d72cc160ac826005507e8b7ca0a35788b591436d05b4a/torch-2.12.1-cp313-cp313-macosx_14_0_arm64.whl", hash = "sha256:c75e93173c700bccd6bfcc4a9d19ce242ab6dacd1f1781483027a16239b9e650", size = 87992358, upload-time = "2026-06-17T21:07:40.299Z" },
+    { url = "https://files.pythonhosted.org/packages/30/a7/874a5ca05e8f159211dca7921060f7057acc1adb26431e119fd150623efc/torch-2.12.1-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:fcb61ccd20784b62bdd78ec84238a5cfb383b4994902e03bac95505ab360884c", size = 426386134, upload-time = "2026-06-17T21:07:31.481Z" },
+    { url = "https://files.pythonhosted.org/packages/e1/75/20bb8fe9c1ad6538cce8cd0391b51927ae5af0b17ed1eab44b8824465dc1/torch-2.12.1-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:f4afc8083dff08719edbea346644476e3cec0cf40ebe256be0ee5d5b7c7e8c0d", size = 532268019, upload-time = "2026-06-17T21:05:37.925Z" },
+    { url = "https://files.pythonhosted.org/packages/d1/fa/824ddb662af55b2eabc0dbb7b57c7c0b1bcd93693754a2b8509ec4d16490/torch-2.12.1-cp313-cp313-win_amd64.whl", hash = "sha256:f92609e3b3ce72f25e2eb780d043ced2480c1a86c47c852604fc7a9108648386", size = 122987777, upload-time = "2026-06-17T21:07:09.49Z" },
+    { url = "https://files.pythonhosted.org/packages/63/b7/1b49fe7086ea36839cc80abc43174c43d0ab6f676c0891c871c162f44fe3/torch-2.12.1-cp314-cp314-macosx_14_0_arm64.whl", hash = "sha256:e9b6f7d2dd66ea87a3ae620069d31335d594c06effb1a383bdd21cfe61e44ece", size = 88010025, upload-time = "2026-06-17T21:07:03.934Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/06/5b44063a6545036dcc680d2d303b137d9176cfb2cc1e1863e3ef94abeb52/torch-2.12.1-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:7973ccd3d2cd35c74449213f7bded199bec6c6247e705cbeda7407af79703d91", size = 426392891, upload-time = "2026-06-17T21:05:52.261Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/dd/c9ce9a4b0eb3c5bb92d9ea56766e2c22559f0b45171149188494edcce80f/torch-2.12.1-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:c64ac4aac16be5e296dcd912305605804b203333c690bf98c55bc09494ee92ad", size = 532272494, upload-time = "2026-06-17T21:06:22.72Z" },
+    { url = "https://files.pythonhosted.org/packages/21/7c/f3a601fc1b1f663ff269bfe553654e638651939aa6563e8daa7167c33098/torch-2.12.1-cp314-cp314-win_amd64.whl", hash = "sha256:f6dc4caf7eb4adb38a2d9f536b51db56310fdd1254e69a2d96767e1367c892b3", size = 122987254, upload-time = "2026-06-17T21:06:33.199Z" },
+    { url = "https://files.pythonhosted.org/packages/e6/8c/b8087556cf81ddd808dbeb34afb8396d7ae7a1694ab489f08b1a0004e7d0/torch-2.12.1-cp314-cp314t-macosx_14_0_arm64.whl", hash = "sha256:2afbb2bdaa8a95040e733f05492ddf133c3967c9b7ce0abd218d704b6cab437d", size = 88303173, upload-time = "2026-06-17T21:05:06.603Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/07/fe09d1699fbed2afa10ebc692ff2b99d113f2605b6748cea633989e2789a/torch-2.12.1-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:97eba061fcb042fed191400b15568990073d67eaacaa6ee9b7ca01dd8b790fe9", size = 426404009, upload-time = "2026-06-17T21:04:57.557Z" },
+    { url = "https://files.pythonhosted.org/packages/2e/f7/0ce4f6c1962c60ded7270e0a9eb560fb615c92b89d332cf9e3dff36d5ecc/torch-2.12.1-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:3867b861391701012adb2df93360efb88494dca245a185e3bb7624495cfe3f33", size = 532184292, upload-time = "2026-06-17T21:05:17.526Z" },
+    { url = "https://files.pythonhosted.org/packages/70/db/e384c12aba30320ca92aaaf557456cbcb26f04b4df307728bb8f019f5000/torch-2.12.1-cp314-cp314t-win_amd64.whl", hash = "sha256:dd15595f8fc764cffde8c6361a3beb6ef69a028c851b1b3e70e077f615980d4e", size = 123231142, upload-time = "2026-06-17T21:05:27.061Z" },
+]
+
+[[package]]
+name = "torchmetrics"
+version = "1.9.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "lightning-utilities" },
+    { name = "numpy", version = "2.4.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.12'" },
+    { name = "numpy", version = "2.5.0", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.12'" },
+    { name = "packaging" },
+    { name = "torch" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/81/34/39b8b749333db56c0585d7a11fa62a283c087bb1dfc897d69fb8cedbefb1/torchmetrics-1.9.0.tar.gz", hash = "sha256:a488609948600df52d3db4fcdab02e62aab2a85ef34da67037dc3e65b8512faa", size = 581765, upload-time = "2026-03-09T17:41:22.443Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c3/a2/c7f6ebf546f8f644edf0f999aa98ece106986a77a7b922316bf6414ff825/torchmetrics-1.9.0-py3-none-any.whl", hash = "sha256:bfdcbff3dd1d96b3374bb2496eb39f23c4b28b8a845b6a18c313688e0d2d9ca1", size = 983384, upload-time = "2026-03-09T17:41:19.756Z" },
+]
+
+[[package]]
+name = "tqdm"
+version = "4.68.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/87/d7/0535a28b1f5f24f6612fb3ff1e89fb1a8d160fee0f976e0aa6803862134b/tqdm-4.68.3.tar.gz", hash = "sha256:00dfa48452b6b6cfae3dd9885636c23d3422d1ec97c66d96818cbd5e0821d482", size = 170596, upload-time = "2026-06-17T07:36:52.105Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d8/8e/bb97bb0c71802080bfc8952937d174e49cfc50de5c951dd47b2496f0dcdb/tqdm-4.68.3-py3-none-any.whl", hash = "sha256:39832cc2def2789a6f29df83f172db7416cea70052c0907a57801c5f2fdccb03", size = 78337, upload-time = "2026-06-17T07:36:50.132Z" },
+]
+
+[[package]]
+name = "triton"
+version = "3.7.1"
+source = { registry = "https://pypi.org/simple" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/7b/f9/19d842d06a08559534fa1eaab6ca551b1bcf40f06620bddec1babaa2772d/triton-3.7.1-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d4a0e1cd4c4a76370ed74a8432a53cea28716827d19e40ffc732233e35ceb3f6", size = 184664887, upload-time = "2026-06-17T20:03:42.913Z" },
+    { url = "https://files.pythonhosted.org/packages/cd/5e/fce69606f7f240297f163e25539906732b199530d486ce67ae319877e821/triton-3.7.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6744957e9fd610a29680ec2346057d0c86948ed3812468670719f391e94b44a5", size = 197701306, upload-time = "2026-06-17T19:53:13.673Z" },
+    { url = "https://files.pythonhosted.org/packages/94/fa/f856e24deb462d5f18bd4b5a746957862ab9b6ee5834bda60605ec348366/triton-3.7.1-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9497f2e696ee368862a181a90b2dcc03ca978cc4f602abd67c7d81022a6988e1", size = 184692359, upload-time = "2026-06-17T20:03:48.288Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/6f/fb96d15db6f36d6eae4cafb998c2e0353bf59d7c4ea1662d7497f269134a/triton-3.7.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7e40869937a68206ec70d7f25bb7ec6433cb083f9135e1f36dbd318dc449a728", size = 197719725, upload-time = "2026-06-17T19:53:20.419Z" },
+    { url = "https://files.pythonhosted.org/packages/00/42/c5089d4d9327fcd1e862c599cc2927f39418f84dd11a84cb2ccff9d4787a/triton-3.7.1-cp313-cp313-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:cdbfc09d9ec58bc5e68321525653220de7515c199e7a8097a97c85e62b52cd0a", size = 184694629, upload-time = "2026-06-17T20:03:53.444Z" },
+    { url = "https://files.pythonhosted.org/packages/07/42/2c3ac59253ae8892b6f307875263dd23dc875cdf732d3aea40d6d41fb7cb/triton-3.7.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:58c0e131da05134a2a4788ccbcc0c1105cf0f54c8e98f19e34cd465396dc15eb", size = 197729241, upload-time = "2026-06-17T19:53:27.801Z" },
+    { url = "https://files.pythonhosted.org/packages/40/71/e01aa7ad573883ed9456f130226babdec70b005e098c4d6226a6238e761b/triton-3.7.1-cp314-cp314-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fe4ea396a06171f1f1f58cbd39c70b09294398f7dd7c620939bab54ad6f934fa", size = 184705764, upload-time = "2026-06-17T20:03:59.064Z" },
+    { url = "https://files.pythonhosted.org/packages/a4/09/5683146fda6a2b569deb78ccfd8fbfea8bfe55f726b081c0a6bb18dd6f28/triton-3.7.1-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2020153b08280415ec0da6607834e79166442147e78e144df06b508c75b186d2", size = 197729537, upload-time = "2026-06-17T19:53:35.516Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/f8/448220c3092019f9fdfab39ec47985968181d67da34b44f6a7f6280a5cbb/triton-3.7.1-cp314-cp314t-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:c58e4c61f0c73b5dba3b5d19b4a7093c32f90dc18b2a7f121a7c16ccd31107b7", size = 184814760, upload-time = "2026-06-17T20:04:04.984Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/ac/229b7d4589d2e5937310e72c6d46e89599d16a4a12b479ffa1499fee8eb8/triton-3.7.1-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:10ba85fa2cca4a2fbdeb36bf1cb082f2c252bda55bf9fccd74f65ec5bc647e68", size = 197824404, upload-time = "2026-06-17T19:53:42.772Z" },
+]
+
+[[package]]
+name = "types-colorama"
+version = "0.4.15.20260508"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/be/d5/4870d88c6c87adf786ef33d674cbf82b3a2d5fcbe6deec29282cff80e8b3/types_colorama-0.4.15.20260508.tar.gz", hash = "sha256:3a8916039e57452bd21f57e674e1f221ca9e4f319893c5e3bbd37b845c27d8e6", size = 10638, upload-time = "2026-05-08T04:47:14.593Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/56/33/a3449e3e93c43861b92d0a93ed4ea5505756463e507625812cfb29c91c7a/types_colorama-0.4.15.20260508-py3-none-any.whl", hash = "sha256:b0c39908a3e5171ef1f8bf3d59fae082e9eaff3a19ca49b6d640b83f78cff61c", size = 10755, upload-time = "2026-05-08T04:47:13.761Z" },
+]
+
+[[package]]
+name = "types-psutil"
+version = "7.2.2.20260518"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/cb/f1/6901857281d4e8d792492e1495eef6f4f01318a3b6a066486d81000a4511/types_psutil-7.2.2.20260518.tar.gz", hash = "sha256:9f825f631463a5b4d26f19f63aebc9ec25f01140d655026f3ad8a67841f9b331", size = 26660, upload-time = "2026-05-18T06:05:09.389Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/cd/eb/f726339668879819599c74c2e1f0cab760912a4159046942bdae2ad37bd6/types_psutil-7.2.2.20260518-py3-none-any.whl", hash = "sha256:6a3d697665754a60d7b5a41d5a2cff12b53f5e0676d77810cd28ba5e14cb4049", size = 32820, upload-time = "2026-05-18T06:05:08.321Z" },
+]
+
+[[package]]
+name = "types-requests"
+version = "2.33.0.20260518"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "urllib3" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/e0/01/c5a19253fe1ac159159ddf9a3a07cec8bb5e486ec4d9002ad2821da0e5d2/types_requests-2.33.0.20260518.tar.gz", hash = "sha256:df7bd3bfe0ca8402dfb841e7d9be714bb5578203283d66d7dc4ef69343449a5e", size = 24752, upload-time = "2026-05-18T06:07:37.966Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1c/bc/b139710a3b6018f7fb2b9508b35c8af564e61bf2bf4fa619d088f3e16f85/types_requests-2.33.0.20260518-py3-none-any.whl", hash = "sha256:626d697d1adaaff76e2044dc8c5c051d8f21abc157bdfe204a75558076fe0bf0", size = 21391, upload-time = "2026-05-18T06:07:37.044Z" },
+]
+
+[[package]]
+name = "types-tqdm"
+version = "4.68.0.20260608"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "types-requests" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/dc/e0/3facccb1ff69970c73fca7a8028286c233d4c1312c475a65fb3d896f56d9/types_tqdm-4.68.0.20260608.tar.gz", hash = "sha256:e1dfddf8770fbc30ecaf95ae57c286397831235064308f7dfc2b1d6684a76107", size = 18470, upload-time = "2026-06-08T06:26:06.661Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/53/e8/61d95bfd49d1609fb8e8c5e06f4a094183411988a6f448873f5de6602499/types_tqdm-4.68.0.20260608-py3-none-any.whl", hash = "sha256:450a6e7e9e9b604928968927c414b32970e40091213c4180e1ed470905b13eff", size = 24858, upload-time = "2026-06-08T06:26:05.741Z" },
+]
+
+[[package]]
+name = "typing-extensions"
+version = "4.15.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/72/94/1a15dd82efb362ac84269196e94cf00f187f7ed21c242792a923cdb1c61f/typing_extensions-4.15.0.tar.gz", hash = "sha256:0cea48d173cc12fa28ecabc3b837ea3cf6f38c6d1136f85cbaaf598984861466", size = 109391, upload-time = "2025-08-25T13:49:26.313Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/18/67/36e9267722cc04a6b9f15c7f3441c2363321a3ea07da7ae0c0707beb2a9c/typing_extensions-4.15.0-py3-none-any.whl", hash = "sha256:f0fa19c6845758ab08074a0cfa8b7aecb71c999ca73d62883bc25cc018c4e548", size = 44614, upload-time = "2025-08-25T13:49:24.86Z" },
+]
+
+[[package]]
+name = "typing-inspection"
+version = "0.4.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/55/e3/70399cb7dd41c10ac53367ae42139cf4b1ca5f36bb3dc6c9d33acdb43655/typing_inspection-0.4.2.tar.gz", hash = "sha256:ba561c48a67c5958007083d386c3295464928b01faa735ab8547c5692e87f464", size = 75949, upload-time = "2025-10-01T02:14:41.687Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/dc/9b/47798a6c91d8bdb567fe2698fe81e0c6b7cb7ef4d13da4114b41d239f65d/typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7", size = 14611, upload-time = "2025-10-01T02:14:40.154Z" },
+]
+
+[[package]]
+name = "tzdata"
+version = "2026.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ba/19/1b9b0e29f30c6d35cb345486df41110984ea67ae69dddbc0e8a100999493/tzdata-2026.2.tar.gz", hash = "sha256:9173fde7d80d9018e02a662e168e5a2d04f87c41ea174b139fbef642eda62d10", size = 198254, upload-time = "2026-04-24T15:22:08.651Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ce/e4/dccd7f47c4b64213ac01ef921a1337ee6e30e8c6466046018326977efd95/tzdata-2026.2-py2.py3-none-any.whl", hash = "sha256:bbe9af844f658da81a5f95019480da3a89415801f6cc966806612cc7169bffe7", size = 349321, upload-time = "2026-04-24T15:22:05.876Z" },
+]
+
+[[package]]
+name = "urllib3"
+version = "2.7.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/53/0c/06f8b233b8fd13b9e5ee11424ef85419ba0d8ba0b3138bf360be2ff56953/urllib3-2.7.0.tar.gz", hash = "sha256:231e0ec3b63ceb14667c67be60f2f2c40a518cb38b03af60abc813da26505f4c", size = 433602, upload-time = "2026-05-07T16:13:18.596Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/7f/3e/5db95bcf282c52709639744ca2a8b149baccf648e39c8cc87553df9eae0c/urllib3-2.7.0-py3-none-any.whl", hash = "sha256:9fb4c81ebbb1ce9531cce37674bbc6f1360472bc18ca9a553ede278ef7276897", size = 131087, upload-time = "2026-05-07T16:13:17.151Z" },
+]
+
+[[package]]
+name = "urwid"
+version = "4.0.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "wcwidth" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/98/b8/9ed1c288eb7e9236ee83a3f847d15dfa879841219b9a7d174c6c2ef33f53/urwid-4.0.2.tar.gz", hash = "sha256:6962bd04ab98002326b67a431c59b2fb35e8b5abe2e095feda3ee7d8ea8f1228", size = 861918, upload-time = "2026-06-02T18:20:41.867Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e5/a0/39d524fb8ed3a9facdd2aa4eeb1a2635b3b8689c300989f8cebb989624ba/urwid-4.0.2-py3-none-any.whl", hash = "sha256:ca5958eca20d55535da37810a2e62cbd81a2ce399ee2e93b04a2718a544029eb", size = 295760, upload-time = "2026-06-02T18:20:40.11Z" },
+]
+
+[[package]]
+name = "urwid-readline"
+version = "0.15.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "urwid" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/ad/70/be318554495555eba7d8ff6e489f6f74ddb225b24086ba4af62a82e723fd/urwid_readline-0.15.1.tar.gz", hash = "sha256:9301444b86d58f7d26388506b704f142cefd193888488b4070d3a0fdfcfc0f84", size = 9007, upload-time = "2024-09-22T17:51:55.144Z" }
+
+[[package]]
+name = "virtualenv"
+version = "21.5.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "distlib" },
+    { name = "filelock" },
+    { name = "platformdirs" },
+    { name = "python-discovery" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/f1/a5/81f987504738e6defeed61ec1c47e2aefab3c35d8eeb87e1b3f38cf28254/virtualenv-21.5.1.tar.gz", hash = "sha256:dca3bf98275a59c652b69d68e73433e597d977c2da9198882479d1a7188009c8", size = 4578798, upload-time = "2026-06-16T16:23:58.603Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2c/02/3623e6169bed617ed1e2d372f7c69f92ec28d54c4dfc997055c8578ec148/virtualenv-21.5.1-py3-none-any.whl", hash = "sha256:55aa670b67bbfb991b03fda39bd3276d92c419d702376e98c5df1c9989a26783", size = 4558820, upload-time = "2026-06-16T16:23:56.963Z" },
+]
+
+[[package]]
+name = "wcwidth"
+version = "0.8.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/49/b4/51fe890511f0f242d07cb1ebe6a5b6db417262b9d2568b460347c57d95cc/wcwidth-0.8.1.tar.gz", hash = "sha256:faf5b4a5366a72dc49cad48cdf21f52bdf63bdda995178e483ba247ff79089b9", size = 1466072, upload-time = "2026-06-08T05:57:23.146Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/bd/6e/95b0e537de1f4d4301f76f944642c6da50d1511cc7b3d64dc418a66c7509/wcwidth-0.8.1-py3-none-any.whl", hash = "sha256:f453740b1e4a4f3291faa37944c555d71056c4da08d59809b307ef4feba695c8", size = 323092, upload-time = "2026-06-08T05:57:21.413Z" },
+]
+
+[[package]]
+name = "werkzeug"
+version = "3.1.8"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "markupsafe" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/dd/b2/381be8cfdee792dd117872481b6e378f85c957dd7c5bca38897b08f765fd/werkzeug-3.1.8.tar.gz", hash = "sha256:9bad61a4268dac112f1c5cd4630a56ede601b6ed420300677a869083d70a4c44", size = 875852, upload-time = "2026-04-02T18:49:14.268Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/93/8c/2e650f2afeb7ee576912636c23ddb621c91ac6a98e66dc8d29c3c69446e1/werkzeug-3.1.8-py3-none-any.whl", hash = "sha256:63a77fb8892bf28ebc3178683445222aa500e48ebad5ec77b0ad80f8726b1f50", size = 226459, upload-time = "2026-04-02T18:49:12.72Z" },
+]
+
+[[package]]
+name = "yarl"
+version = "1.24.2"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "idna" },
+    { name = "multidict" },
+    { name = "propcache" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/79/12/1e8f37460ea0f7eb59c221fdaf0ed75e7ac43e97f8093b9c6f411df50a78/yarl-1.24.2.tar.gz", hash = "sha256:9ac374123c6fd7abf64d1fec93962b0bd4ee2c19751755a762a72dd96c0378f8", size = 210798, upload-time = "2026-05-19T21:31:05.599Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/c5/c5/1ce244152ff2839645e7cae92f90e7bafcb2c52bea7ff586ac714f14f5df/yarl-1.24.2-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:36348bebb147b83818b9d7e673ea4debc75970afc6ffdc7e3975ad05ce5a58c1", size = 128971, upload-time = "2026-05-19T21:28:20.543Z" },
+    { url = "https://files.pythonhosted.org/packages/87/5a/00f36967203ed89cb3acd2c8ed526cc3fed9418eb70ce128160a911c8499/yarl-1.24.2-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1a97e42c8a2233f2f279ecadd9e4a037bcb5d813b78435e8eedd4db5a9e9708c", size = 91507, upload-time = "2026-05-19T21:28:22.556Z" },
+    { url = "https://files.pythonhosted.org/packages/31/d0/1fb0c1cd27288f39f6974da4318c32768d72c9890984541fdf1e2e32a51d/yarl-1.24.2-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:8d027d56f1035e339d1001ac33eceab5b2ec8e42e449787bb75e289fb9a5cd1d", size = 91343, upload-time = "2026-05-19T21:28:24.092Z" },
+    { url = "https://files.pythonhosted.org/packages/03/ce/d4a646508bed2f8dec6435b40166fe9308dd191262033d3f307b2bbcaecd/yarl-1.24.2-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a6377060e7927187a42b7eb202090cbe2b34933a4eeaf90e3bd9e33432e5cae", size = 105704, upload-time = "2026-05-19T21:28:25.872Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/07/b3278e82d8bc41485bcf6d856cd0433262593de615b1d3dc43bd3f5bead4/yarl-1.24.2-cp311-cp311-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:17076578bce0049a5ce57d14ad1bded391b68a3b213e9b81b0097b090244999a", size = 97281, upload-time = "2026-05-19T21:28:27.352Z" },
+    { url = "https://files.pythonhosted.org/packages/17/5b/4cee6e7c92e487bebe7afc797da0aa54a248ab4e776a68fe369ec29665a5/yarl-1.24.2-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:50713f1d4d6be6375bb178bb43d140ee1acb8abe589cd723320b7925a275be1e", size = 114020, upload-time = "2026-05-19T21:28:29.458Z" },
+    { url = "https://files.pythonhosted.org/packages/5c/82/111076571545a7d4f9cca3fbd5c6f40615af58642be09f12328f48022468/yarl-1.24.2-cp311-cp311-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:34263e2fa8fb5bb63a0d97706cda38edbad62fddb58c7f12d6acbc092812aa50", size = 111450, upload-time = "2026-05-19T21:28:31.262Z" },
+    { url = "https://files.pythonhosted.org/packages/b6/ec/08f671f69a444d704aeecebf92af659b67b97a869942411d0a578b08c334/yarl-1.24.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:49016d82f032b1bd1e10b01078a7d29ae71bf468eeae0ea22df8bab691e60003", size = 106384, upload-time = "2026-05-19T21:28:32.856Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/86/ce41e7a7a199340b2330d52b60f25c4074b6636dd0e60b1a80d31a9db042/yarl-1.24.2-cp311-cp311-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:3f6d2c216318f8f32038ca3f72501ba08536f0fd18a36e858836b121b2deed9f", size = 106153, upload-time = "2026-05-19T21:28:35.222Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/5d/31be8a729531ab3e55ac3e7e5c800be8c89ea98947f418b2f6ea259fb6ee/yarl-1.24.2-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:08d3a33218e0c64393e7610284e770409a9c31c429b078bcb24096ed0a783b8f", size = 105322, upload-time = "2026-05-19T21:28:36.642Z" },
+    { url = "https://files.pythonhosted.org/packages/47/9b/b57afb22b386ae87ac9940f09878b98d8c333f89113e6fc96fcf4ca9eb64/yarl-1.24.2-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:5d699376c4ca3cba49bbfae3a05b5b70ded572937171ce1e0b8d87118e2ba294", size = 99057, upload-time = "2026-05-19T21:28:38.386Z" },
+    { url = "https://files.pythonhosted.org/packages/a3/4f/06348c27c8389256c313e8a57d796808fc0264c915dd5e7cfd3c0e314dc7/yarl-1.24.2-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:a1cab588b4fa14bea2e55ebea27478adfb05372f47573738e1acc4a36c0b05d2", size = 113502, upload-time = "2026-05-19T21:28:40.091Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/1c/284f307b298e4a17b7943b07d9d7ecc4151537f8d137ba51f3bb6c31ca20/yarl-1.24.2-cp311-cp311-musllinux_1_2_riscv64.whl", hash = "sha256:ec87ccc31bd21db7ad009d8572c127c1000f268517618a4cc09adba3c2a7f21c", size = 105253, upload-time = "2026-05-19T21:28:41.987Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/bf/0de123bec8619e45c80cbded9085f61b5b4a9eddb8abe6d25d28ee1ec866/yarl-1.24.2-cp311-cp311-musllinux_1_2_s390x.whl", hash = "sha256:d1dd47a22843b212baa8d74f37796815d43bd046b42a0f41e9da433386c3136b", size = 111345, upload-time = "2026-05-19T21:28:43.93Z" },
+    { url = "https://files.pythonhosted.org/packages/90/af/0248eb065e51129d2a9b2436cd1b5c772c19a6b04e5b6a186955671e3319/yarl-1.24.2-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:7b54b9c67c2b06bd7b9a77253d242124b9c95d2c02def5a1144001ee547dd9d5", size = 106558, upload-time = "2026-05-19T21:28:45.806Z" },
+    { url = "https://files.pythonhosted.org/packages/21/3c/f960d7a65ef97d8ba9b424fb5128796a4bc710fc6df2ddbbd7dfdc3bbd20/yarl-1.24.2-cp311-cp311-win_amd64.whl", hash = "sha256:f8fdbcff8b2c7c9284e60c196f693588598ddcee31e11c18e14949ce44519d45", size = 92808, upload-time = "2026-05-19T21:28:48.465Z" },
+    { url = "https://files.pythonhosted.org/packages/03/1a/49fb03750e4de4d2284cd5b885a383133c34eef45bd59631b2bb8b7e81e8/yarl-1.24.2-cp311-cp311-win_arm64.whl", hash = "sha256:b32c37a7a337e90822c45797bf3d79d60875cfcccd3ecc80e9f453d87026c122", size = 87610, upload-time = "2026-05-19T21:28:50.07Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/da/866bcb01076ba49d2b42b309867bed3826421f1c479655eb7a607b44f20b/yarl-1.24.2-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:b975866c184564c827e0877380f0dae57dcca7e52782128381b72feff6dfceb8", size = 129957, upload-time = "2026-05-19T21:28:51.695Z" },
+    { url = "https://files.pythonhosted.org/packages/bf/1d/fcefb70922ea2268a8971d8e5874d9a8218644200fb8465f1dcad55e6851/yarl-1.24.2-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:3b075301a2836a0e297b1b658cb6d6135df535d62efefdd60366bd589c2c82f2", size = 92164, upload-time = "2026-05-19T21:28:53.242Z" },
+    { url = "https://files.pythonhosted.org/packages/29/b6/170e2b8d4e3bc30e6bfdcca53556537f5bf595e938632dfcb059311f3ff6/yarl-1.24.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8ae44649b00947634ab0dab2a374a638f52923a6e67083f2c156cd5cbd1a881d", size = 91688, upload-time = "2026-05-19T21:28:54.865Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/a5/c9f655d5553ea0b99fdac9d6a99ad3f9b3e73b8e5758bb46f58c9831f74c/yarl-1.24.2-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:507cc19f0b45454e2d6dcd62ff7d062b9f77a2812404e62dbdaec05b50faa035", size = 102902, upload-time = "2026-05-19T21:28:56.963Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/bc/6b9664d815d79af4ee553337f9d606c56bbf269186ada9172de45f1b5f60/yarl-1.24.2-cp312-cp312-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:c4c17bad5a530912d2111825d3f05e89bab2dd376aaa8cbc77e449e6db63e576", size = 97931, upload-time = "2026-05-19T21:28:58.56Z" },
+    { url = "https://files.pythonhosted.org/packages/98/ec/32ba48acae30fecd60928f5791188b80a9d6ee3840507ffda29fecd37b71/yarl-1.24.2-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:f5f0cbb112838a4a293985b6ed73948a547dadcc1ba6d2089938e7abdedceef8", size = 111030, upload-time = "2026-05-19T21:29:00.148Z" },
+    { url = "https://files.pythonhosted.org/packages/82/5a/6f4cd081e5f4934d2ae3a8ef4abe3afacc010d26f0035ee91b35cd7d7c37/yarl-1.24.2-cp312-cp312-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:5ec8356b8a6afcf81fc7aeeef13b1ff7a49dec00f313394bbb9e83830d32ccd7", size = 110392, upload-time = "2026-05-19T21:29:02.155Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/da/323a01c349bd5fb01bb6652e314d9bb218cee630a736bdb810ad50e4013f/yarl-1.24.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7e7ebcdef69dec6c6451e616f32b622a6d4a2e92b445c992f7c8e5274a6bbc4c", size = 105612, upload-time = "2026-05-19T21:29:04.247Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/80/264ab684f181e1a876389374519ff05d10248725535ae2ac4e8ac4e563d6/yarl-1.24.2-cp312-cp312-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:47a55d6cf6db2f401017a9e96e5288844e5051911fb4e0c8311a3980f5e59a7d", size = 104487, upload-time = "2026-05-19T21:29:06.491Z" },
+    { url = "https://files.pythonhosted.org/packages/41/07/efabe5df87e96d7ad5959760b888344be48cd6884db127b407c6b5503adc/yarl-1.24.2-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3065657c80a2321225e804048597ad55658a7e76b32d6f5ee4074d04c50401db", size = 102333, upload-time = "2026-05-19T21:29:08.267Z" },
+    { url = "https://files.pythonhosted.org/packages/44/0c/bcf7c42603e1009295f586d8890f2ba032c8b53310e815adf0a202c73d9f/yarl-1.24.2-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:cb84b80d88e19ede158619b80813968713d8d008b0e2497a576e6a0557d50712", size = 99025, upload-time = "2026-05-19T21:29:10.682Z" },
+    { url = "https://files.pythonhosted.org/packages/4f/82/84482ab1a57a0f21a08afe6a7004c61d741f8f2ecc3b05c321577c612164/yarl-1.24.2-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:990de4f680b1c217e77ff0d6aa0029f9eb79889c11fb3e9a3942c7eba29c1996", size = 110507, upload-time = "2026-05-19T21:29:12.954Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/8d/a546ba1dfe1b0f290e05fef145cd07614c0f15df1a707195e512d1e39d1d/yarl-1.24.2-cp312-cp312-musllinux_1_2_riscv64.whl", hash = "sha256:abb8ec0323b80161e3802da3150ef660b41d0e9be2048b76a363d93eee992c2b", size = 103719, upload-time = "2026-05-19T21:29:14.893Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/b6/267f2a09213138473adfce6b8a6e17791d7fee70bd4d9003218e4dec58b0/yarl-1.24.2-cp312-cp312-musllinux_1_2_s390x.whl", hash = "sha256:e7977781f83638a4c73e0f88425563d70173e0dfd90ac006a45c65036293ee3c", size = 110438, upload-time = "2026-05-19T21:29:16.485Z" },
+    { url = "https://files.pythonhosted.org/packages/48/2d/1c8d89c7c5f9cad9fb2902445d94e2ab1d7aa35de029afbb8ae95c42d00f/yarl-1.24.2-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e30dd55825dc554ec5b66a94953b8eda8745926514c5089dfcacecb9c99b5bd1", size = 105719, upload-time = "2026-05-19T21:29:18.367Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/25/722e3b93bd687009afb2d59a35e13d30ddd8f80571445bb0c4e4ce26ec66/yarl-1.24.2-cp312-cp312-win_amd64.whl", hash = "sha256:7dafe10c12ddd4d120d528c4b5599c953bd7b12845347d507b95451195bb6cad", size = 92901, upload-time = "2026-05-19T21:29:20.014Z" },
+    { url = "https://files.pythonhosted.org/packages/39/47/4486ccfb674c04854a1ef8aa77868b6a6f765feaf69633409d7ca4f02cb8/yarl-1.24.2-cp312-cp312-win_arm64.whl", hash = "sha256:044a09d8401fcf8681977faef6d286b8ade1e2d2e9dceda175d1cfa5ca496f30", size = 87229, upload-time = "2026-05-19T21:29:22.1Z" },
+    { url = "https://files.pythonhosted.org/packages/82/62/fcf0ce677f17e5c471c06311dd25964be38a4c586993632910d2e75278bc/yarl-1.24.2-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:491ac9141decf49ee8030199e1ee251cdff0e131f25678817ff6aa5f837a3536", size = 128978, upload-time = "2026-05-19T21:29:23.83Z" },
+    { url = "https://files.pythonhosted.org/packages/d3/58/8e63299bb71ed61a834121d9d3fe6c9fcf2a6a5d09754ff4f20f2d20baf5/yarl-1.24.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:e89418f65eda18f99030386305bd44d7d504e328a7945db1ead514fbe03a0607", size = 91733, upload-time = "2026-05-19T21:29:25.375Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/24/16748d5dab6daec8b0ed81ccec639a1cded0f18dcc62a4f696b4fe366c37/yarl-1.24.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:cdfcce633b4a4bb8281913c57fcafd4b5933fbc19111a5e3930bbd299d6102f1", size = 91113, upload-time = "2026-05-19T21:29:26.928Z" },
+    { url = "https://files.pythonhosted.org/packages/1b/66/b63fff7b71211e866624b21432d5943cbb633eb0c2872d9ee3070648f22c/yarl-1.24.2-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:863297ddede92ee49024e9a9b11ecb59f310ca85b60d8537f56bed9bbb5b1986", size = 103899, upload-time = "2026-05-19T21:29:28.842Z" },
+    { url = "https://files.pythonhosted.org/packages/9d/ac/ba1974b8533909636f7733fe86cf677e3619527c3c2fa913e0ea89c48757/yarl-1.24.2-cp313-cp313-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:374423f70754a2c96942ede36a29d37dc6b0cb8f92f8d009ddf3ed78d3da5488", size = 97862, upload-time = "2026-05-19T21:29:31.086Z" },
+    { url = "https://files.pythonhosted.org/packages/1b/a5/123ac993b5c2ba6f554a140305620cb8f150fa543711bbc49be3ec0a65a4/yarl-1.24.2-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:33a29b5d00ccbf3219bb3e351d7875739c19481e030779f48cc46a7a71681a9b", size = 111060, upload-time = "2026-05-19T21:29:32.657Z" },
+    { url = "https://files.pythonhosted.org/packages/23/37/c472d3af3509688392134a88a825276770a187f1daa4de3f6dc0a327a751/yarl-1.24.2-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a9532c57211730c515341af11fef6e9b61d157487272a096d0c04da445642592", size = 110613, upload-time = "2026-05-19T21:29:34.379Z" },
+    { url = "https://files.pythonhosted.org/packages/df/88/09c28dad91e662ccfaa1b78f1c57badde74fc9d0b23e74aef644750ecd73/yarl-1.24.2-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:91e72cf093fd833483a97ee648e0c053c7c629f51ff4a0e7edd84f806b0c5617", size = 107012, upload-time = "2026-05-19T21:29:36.216Z" },
+    { url = "https://files.pythonhosted.org/packages/07/ab/9d4f69d571a94f4d112fa7e2e007200f5a54d319f58c82ac7b7baa61f5c6/yarl-1.24.2-cp313-cp313-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:b3177bc0a768ef3bacceb4f272632990b7bea352f1b2f1eee9d6d6ff16516f92", size = 105887, upload-time = "2026-05-19T21:29:38.746Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/9a/000b2b66c0d772a499fc531d21dab92dfeb73b640a12eed6ba89f49bb2d0/yarl-1.24.2-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:e196952aacaf3b232e265ff02980b64d483dc0972bd49bcb061171ff22ac203a", size = 103620, upload-time = "2026-05-19T21:29:40.368Z" },
+    { url = "https://files.pythonhosted.org/packages/41/7c/7c1050f73450fbdaa3f0c72017059f00ce5e13366692f3dba25275a1083d/yarl-1.24.2-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:204e7a61ce99919c0de1bf904ab5d7aa188a129ea8f690a8f76cfb6e2844dc44", size = 100599, upload-time = "2026-05-19T21:29:42.66Z" },
+    { url = "https://files.pythonhosted.org/packages/ec/b1/29e5756b3926705f5f6089bd5b9f50a56eaac550da6e260bf713ead44d04/yarl-1.24.2-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:4b156914620f0b9d78dc1adb3751141daee561cfec796088abb89ed49d220f1a", size = 110604, upload-time = "2026-05-19T21:29:44.632Z" },
+    { url = "https://files.pythonhosted.org/packages/a3/4b/8415bc96e9b150cde942fbac9a8182985e58f40ce5c54c34ed015407d3ee/yarl-1.24.2-cp313-cp313-musllinux_1_2_riscv64.whl", hash = "sha256:8372a2b976cf70654b2be6619ab6068acabb35f724c0fda7b277fbf53d66a5cf", size = 105161, upload-time = "2026-05-19T21:29:46.755Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/d4/cde059abfa229553b7298a2eadde2752e723d50aeedaef86ce59da2718ee/yarl-1.24.2-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:f9a1e9b622ca284143aab5d885848686dcd85453bb1ca9abcdb7503e64dc0056", size = 110619, upload-time = "2026-05-19T21:29:48.972Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/2c/d6a6c9a61549f7b6c7e6dc6937d195bcf069582b47b7200dcd0e7b256acf/yarl-1.24.2-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:810e19b685c8c3c5862f6a38160a1f4e4c0916c9390024ec347b6157a45a0992", size = 107362, upload-time = "2026-05-19T21:29:51Z" },
+    { url = "https://files.pythonhosted.org/packages/92/dd/3ae5fe417e9d1c353a548553326eb9935e76b6b727161563b424cc296df3/yarl-1.24.2-cp313-cp313-win_amd64.whl", hash = "sha256:7d37fb7c38f2b6edab0f845c4f85148d4c44204f52bc127021bd2bc9fdbf1656", size = 92667, upload-time = "2026-05-19T21:29:52.743Z" },
+    { url = "https://files.pythonhosted.org/packages/10/cc/a7beb239f78f27fca1b053c8e8595e4179c02e62249b4687ec218c370c50/yarl-1.24.2-cp313-cp313-win_arm64.whl", hash = "sha256:1e831894be7c2954240e49791fa4b50c05a0dc881de2552cfe3ffd8631c7f461", size = 87069, upload-time = "2026-05-19T21:29:54.442Z" },
+    { url = "https://files.pythonhosted.org/packages/40/0e/e08087695fc12789263821c5dc0f8dc52b5b17efd0887cacf419f8a43ba3/yarl-1.24.2-cp314-cp314-macosx_10_15_universal2.whl", hash = "sha256:f9312b3c02d9b3d23840f67952913c9c8721d7f1b7db305289faefa878f364c2", size = 129670, upload-time = "2026-05-19T21:29:56.631Z" },
+    { url = "https://files.pythonhosted.org/packages/3a/98/ab4b5ed1b1b5cd973c8a3eb994c3a6aefb6ce6d399e21bb5f0316c33815c/yarl-1.24.2-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:a4f4d6cd615823bfc7fb7e9b5987c3f41666371d870d51058f77e2680fbe9630", size = 91916, upload-time = "2026-05-19T21:29:58.645Z" },
+    { url = "https://files.pythonhosted.org/packages/ba/b1/5297bb6a7df4782f7605bffc43b31f5044070935fbbcaa6c705a07e6ac65/yarl-1.24.2-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:0c3063e5c0a8e8e62fae6c2596fa01da1561e4cd1da6fec5789f5cf99a8aefd8", size = 91625, upload-time = "2026-05-19T21:30:00.412Z" },
+    { url = "https://files.pythonhosted.org/packages/02/a7/45baabfff76829264e623b185cff0c340d7e11bf3e1cd9ea37e7d17934bd/yarl-1.24.2-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:fecd17873a096036c1c87ab3486f1aef7f269ada7f23f7f856f93b1cc7744f14", size = 104574, upload-time = "2026-05-19T21:30:02.544Z" },
+    { url = "https://files.pythonhosted.org/packages/f3/40/3a5ab144d3d650ca37d4f4b57e56169be8af3ca34c448793e064b30baaed/yarl-1.24.2-cp314-cp314-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:a46d1ab4ba4d32e6dc80daf8a28ce0bd83d08df52fbc32f3e288663427734535", size = 97534, upload-time = "2026-05-19T21:30:04.319Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/b5/5658fef3681fb5776b4513b052bec750009f47b3a592251c705d75375798/yarl-1.24.2-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:73e68edf6dfd5f73f9ca127d84e2a6f9213c65bdffb736bda19524c0564fcd14", size = 111481, upload-time = "2026-05-19T21:30:05.988Z" },
+    { url = "https://files.pythonhosted.org/packages/4c/06/fdcd7dde037f00866dce123ed4ba23dba94beb56fc4cf561668d27be37f2/yarl-1.24.2-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:a296ca617f2d25fbceafb962b88750d627e5984e75732c712154d058ae8d79a3", size = 111529, upload-time = "2026-05-19T21:30:07.738Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/53/d81269aaafccea0d33396c03035de997b743f11e648e6e27a0df99c72980/yarl-1.24.2-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:e51b2cf5ec89a8b8470177641ed62a3ba22d74e1e898e06ad53aa77972487208", size = 107338, upload-time = "2026-05-19T21:30:09.713Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/04/23049463f729bd899df203a7960505a75333edd499cda8aa1d5a82b64df5/yarl-1.24.2-cp314-cp314-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:310fc687f7b2044ec54e372c8cbe923bb88f5c37bded0d3079e5791c2fc3cf50", size = 106147, upload-time = "2026-05-19T21:30:11.365Z" },
+    { url = "https://files.pythonhosted.org/packages/14/18/04a4b5830b43ed5e4c5015b40e9f6241ad91487d71611061b4e111d6ac80/yarl-1.24.2-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:297a2fe352ecf858b30a98f87948746ec16f001d279f84aebdbd3bd965e2f1bd", size = 104272, upload-time = "2026-05-19T21:30:12.978Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/f7/8cffdf319aee7a7c1dbd07b61d91c3e3fda460c7a93b5f93e445f3806c4c/yarl-1.24.2-cp314-cp314-musllinux_1_2_armv7l.whl", hash = "sha256:2a263e76b97bc42bdcd7c5f4953dec1f7cd62a1112fa7f869e57255229390d67", size = 99962, upload-time = "2026-05-19T21:30:15.001Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/39/b3cce3b7dbef64ac700ad4cea156a207d01bede0f507587616c364b5468e/yarl-1.24.2-cp314-cp314-musllinux_1_2_ppc64le.whl", hash = "sha256:822519b64cf0b474f1a0aaef1dc621438ea46bb77c94df97a5b4d213a7d8a8b1", size = 111063, upload-time = "2026-05-19T21:30:16.683Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/ea/100818505e7ebf165c7242ff17fdf7d9fee79e27234aeca871c1082920d7/yarl-1.24.2-cp314-cp314-musllinux_1_2_riscv64.whl", hash = "sha256:b6067060d9dc594899ba83e6db6c48c68d1e494a6dab158156ed86977ca7bcb1", size = 105438, upload-time = "2026-05-19T21:30:18.769Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/d2/e075a0b32aa6625087de9e653087df0759fed5de4a435fef594181102a77/yarl-1.24.2-cp314-cp314-musllinux_1_2_s390x.whl", hash = "sha256:0063adad533e57171b79db3943b229d40dfafeeee579767f96541f106bac5f1b", size = 111458, upload-time = "2026-05-19T21:30:21.024Z" },
+    { url = "https://files.pythonhosted.org/packages/e6/5c/ceea7ba98b65c8eb8d947fdc52f9bedfcd43c6a57c9e3c90c17be8f324a3/yarl-1.24.2-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:ee8e3fb34513e8dc082b586ef4910c98335d43a6fab688cd44d4851bacfce3e8", size = 107589, upload-time = "2026-05-19T21:30:23.412Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/d9/5582d57e2b2db9b85eb6663a22efdd78e08805f3f5389566e9fcad254d1b/yarl-1.24.2-cp314-cp314-win_amd64.whl", hash = "sha256:afb00d7fd8e0f285ca29a44cc50df2d622ff2f7a6d933fa641577b5f9d5f3db0", size = 94424, upload-time = "2026-05-19T21:30:25.425Z" },
+    { url = "https://files.pythonhosted.org/packages/92/10/7dc07a0e22806a9280f42a57361395506e800c64e22737cd7b0886feab42/yarl-1.24.2-cp314-cp314-win_arm64.whl", hash = "sha256:68cf6eacd6028ef1142bc4b48376b81566385ca6f9e7dde3b0fa91be08ffcb57", size = 88690, upload-time = "2026-05-19T21:30:27.623Z" },
+    { url = "https://files.pythonhosted.org/packages/9e/13/d5b8e2c8667db955bcb3de233f18798fefe7edf1d7429c2c9d4f9c401114/yarl-1.24.2-cp314-cp314t-macosx_10_15_universal2.whl", hash = "sha256:221ce1dd921ac4f603957f17d7c18c5cc0797fbb52f156941f92e04605d1d67b", size = 136248, upload-time = "2026-05-19T21:30:29.297Z" },
+    { url = "https://files.pythonhosted.org/packages/de/46/a4a97c05c9c9b8fd266bb2a0df12992c7fbd02391eb9640583411b6dab32/yarl-1.24.2-cp314-cp314t-macosx_10_15_x86_64.whl", hash = "sha256:5f3224db28173a00d7afacdee07045cc4673dfab2b15492c7ae10deddbece761", size = 95084, upload-time = "2026-05-19T21:30:31.031Z" },
+    { url = "https://files.pythonhosted.org/packages/95/b2/845cf2074a015e6fe0d0808cf1a2d9e868386c4220d657ebd8302b199043/yarl-1.24.2-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:c557165320d6244ebe3a02431b2a201a20080e02f41f0cfa0ccc47a183765da8", size = 95272, upload-time = "2026-05-19T21:30:33.062Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/16/e69d4aa244aef45235ddfebc0e04036a6829842bc5a6a795aedc6c998d23/yarl-1.24.2-cp314-cp314t-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:904065e6e85b1fa54d0d87438bd58c14c0bad97aad654ad1077fd9d87e8478ed", size = 101497, upload-time = "2026-05-19T21:30:34.842Z" },
+    { url = "https://files.pythonhosted.org/packages/15/94/c07107715d621076863ee88b3ddf183fa5e9d4aba5769623c9979828410a/yarl-1.24.2-cp314-cp314t-manylinux2014_armv7l.manylinux_2_17_armv7l.manylinux_2_31_armv7l.whl", hash = "sha256:8cec2a38d70edc10e0e856ceda886af5327a017ccbde8e1de1bd44d300357543", size = 94002, upload-time = "2026-05-19T21:30:37.724Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/35/fc1bbdd895b5e4010b8fdd037f7ed3aa289d3863e08231b30231ca9a0815/yarl-1.24.2-cp314-cp314t-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e7484b9361ed222ee1ca5b4337aa4cbdcc4618ce5aff57d9ef1582fd95893fc0", size = 106524, upload-time = "2026-05-19T21:30:40.196Z" },
+    { url = "https://files.pythonhosted.org/packages/1f/f2/32b66d0a4ba47c296cf86d03e2c67bff58399fe6d6d84d5205c04c66cc6d/yarl-1.24.2-cp314-cp314t-manylinux2014_s390x.manylinux_2_17_s390x.manylinux_2_28_s390x.whl", hash = "sha256:84f9670b89f34db07f81e53aee83e0b938a3412329d51c8f922488be7fcc4024", size = 106165, upload-time = "2026-05-19T21:30:41.888Z" },
+    { url = "https://files.pythonhosted.org/packages/95/47/37cb5ff50c5e825d4d38e81bb04d1b7e96bf960f7ab89f9850b162f3f114/yarl-1.24.2-cp314-cp314t-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:abb2759733d63a28b4956500a5dd57140f26486c92b2caedfb964ab7d9b79dbf", size = 103010, upload-time = "2026-05-19T21:30:43.985Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/d2/4597912315096f7bb359e46e13bf8b60994fcbb2db29b804c0902ef4eff5/yarl-1.24.2-cp314-cp314t-manylinux_2_31_riscv64.manylinux_2_39_riscv64.whl", hash = "sha256:081c2bf54efe03774d0311172bc04fedf9ca01e644d4cd8c805688e527209bdc", size = 101128, upload-time = "2026-05-19T21:30:46.291Z" },
+    { url = "https://files.pythonhosted.org/packages/b9/d5/c8e86e120521e646013d02a8e3b8884392e28494be8f392366e50d208efc/yarl-1.24.2-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:86746bef442aa479107fe28132e1277237f9c24c2f00b0b0cf22b3ee0904f2bb", size = 101382, upload-time = "2026-05-19T21:30:48.085Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/98/70b229236118f89dbeb739b76f10225bbf53b5497725502594c9a01d699a/yarl-1.24.2-cp314-cp314t-musllinux_1_2_armv7l.whl", hash = "sha256:2d07d21d0bc4b17558e8de0b02fbfdf1e347d3bb3699edd00bb92e7c57925420", size = 95964, upload-time = "2026-05-19T21:30:49.785Z" },
+    { url = "https://files.pythonhosted.org/packages/87/f8/56c386981e3c8648d279fdef2397ffec577e8320fd5649745e34d54faeb7/yarl-1.24.2-cp314-cp314t-musllinux_1_2_ppc64le.whl", hash = "sha256:4fb1ac3fc5fecd8ae7453ea237e4d22b49befa70266dfe1629924245c21a0c7f", size = 106204, upload-time = "2026-05-19T21:30:51.862Z" },
+    { url = "https://files.pythonhosted.org/packages/1a/1e/765afe97811ca35933e2a7de70ac57b1997ea2e4ee895719ee7a231fb7e5/yarl-1.24.2-cp314-cp314t-musllinux_1_2_riscv64.whl", hash = "sha256:4da31a5512ed1729ca8d8aacde3f7faeb8843cde3165d6bcf7f88f74f17bb8aa", size = 101510, upload-time = "2026-05-19T21:30:53.62Z" },
+    { url = "https://files.pythonhosted.org/packages/ee/78/393913f4b9039e1edd09ae8a9bbb9d539be909a8abf6d8a2084585bed4b7/yarl-1.24.2-cp314-cp314t-musllinux_1_2_s390x.whl", hash = "sha256:533ded4dceb5f1f3da7906244f4e82cf46cfd40d84c69a1faf5ac506aa65ecbe", size = 105584, upload-time = "2026-05-19T21:30:55.962Z" },
+    { url = "https://files.pythonhosted.org/packages/78/87/deb17b7049bbe74ea11a713b86f8f27800cc1c8648b0b797243ebb4830ba/yarl-1.24.2-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:7b3a85525f6e7eeabcfdd372862b21ee1915db1b498a04e8bf0e389b607ff0bd", size = 103410, upload-time = "2026-05-19T21:30:57.962Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/be/f9f7594e23b5b93affff0318e4593c1920331bcaefda326cabcad94296a1/yarl-1.24.2-cp314-cp314t-win_amd64.whl", hash = "sha256:a7624b1ca46ca5d7b864ef0d2f8efe3091454085ee1855b4e992314529972215", size = 102980, upload-time = "2026-05-19T21:30:59.735Z" },
+    { url = "https://files.pythonhosted.org/packages/65/a4/ba80dccd3593ff1f01051a818694d07b58cb8232677ee9a22a5a1f93a9fc/yarl-1.24.2-cp314-cp314t-win_arm64.whl", hash = "sha256:e434a45ce2e7a947f951fc5a8944c8cc080b7e59f9c50ae80fd39107cf88126d", size = 91219, upload-time = "2026-05-19T21:31:01.934Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/4d/4b880086bd0d3e034d25647be1d830afc3e3f610e98c4ab3490af6b1b6d5/yarl-1.24.2-py3-none-any.whl", hash = "sha256:2783d9226db8797636cd6896e4de81feed252d1db72265686c9558d97a4d94b9", size = 53576, upload-time = "2026-05-19T21:31:03.909Z" },
+]
diff --git a/videotuna/base/checkpoint_mixin.py b/videotuna/base/checkpoint_mixin.py
new file mode 100644
index 00000000..d37c655b
--- /dev/null
+++ b/videotuna/base/checkpoint_mixin.py
@@ -0,0 +1,150 @@
+import os
+from pathlib import Path
+from typing import Optional, Union
+
+import torch
+import torch.nn as nn
+from loguru import logger
+
+from videotuna.utils.common_utils import print_green, print_yellow
+
+
+class CheckpointMixin:
+    def load_first_stage(
+        self, ckpt_path: Union[str, Path], ignore_missing_ckpts: bool = False
+    ) -> None:
+        path = os.path.join(str(ckpt_path), self.first_stage_model_path)
+        if os.path.exists(path):
+            assert self.first_stage_model is not None
+            self.first_stage_model = self.load_model(self.first_stage_model, path)
+            print_green("Successfully loaded first_stage_model from checkpoint.")
+        elif ignore_missing_ckpts:
+            print_yellow("Checkpoint of first_stage_model file not found. Ignoring.")
+        else:
+            raise FileNotFoundError("Checkpoint of first_stage_model file not found.")
+
+    def load_cond_stage(
+        self, ckpt_path: Union[str, Path], ignore_missing_ckpts: bool = False
+    ) -> None:
+        path = os.path.join(str(ckpt_path), self.cond_stage_model_path)
+        if os.path.exists(path):
+            assert self.cond_stage_model is not None
+            self.cond_stage_model = self.load_model(self.cond_stage_model, path)
+            print_green("Successfully loaded cond_stage_model from checkpoint.")
+        elif ignore_missing_ckpts:
+            print_yellow("Checkpoint of cond_stage_model file not found. Ignoring.")
+        else:
+            raise FileNotFoundError("Checkpoint of cond_stage_model file not found.")
+
+    def load_cond_stage_2(
+        self, ckpt_path: Union[str, Path], ignore_missing_ckpts: bool = False
+    ) -> None:
+        if self.cond_stage_2_model is None:
+            return
+
+        path = os.path.join(str(ckpt_path), self.cond_stage_2_model_path)
+        if os.path.exists(path):
+            self.cond_stage_2_model = self.load_model(self.cond_stage_2_model, path)
+            print_green("Successfully loaded cond_stage_2_model from checkpoint.")
+        elif ignore_missing_ckpts:
+            print_yellow("Checkpoint of cond_stage_2_model file not found. Ignoring.")
+        else:
+            raise FileNotFoundError("Checkpoint of cond_stage_2_model file not found.")
+
+    def load_denoiser(
+        self,
+        ckpt_path: Optional[Union[str, Path]] = None,
+        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
+        ignore_missing_ckpts: bool = False,
+    ) -> None:
+        if ckpt_path is None and denoiser_ckpt_path is None:
+            return
+        path = os.path.join(str(ckpt_path or ""), self.denoiser_path)
+        if denoiser_ckpt_path is not None:
+            path = str(denoiser_ckpt_path)
+
+        if os.path.exists(path):
+            assert self.denoiser is not None
+            self.denoiser = self.load_model(self.denoiser, path)
+            print_green("Successfully loaded denoiser from checkpoint.")
+        elif ignore_missing_ckpts:
+            print_yellow("Checkpoint of denoiser file not found. Ignoring.")
+        else:
+            raise FileNotFoundError("Checkpoint of denoiser file not found.")
+
+    def load_lora(
+        self,
+        lora_ckpt_path: Optional[Union[str, Path]] = None,
+        ignore_missing_ckpts: bool = False,
+    ) -> None:
+        if not self.use_lora:
+            return
+
+        lora_path = self.lora_path
+        if lora_ckpt_path is not None:
+            lora_path = str(lora_ckpt_path)
+
+        if lora_path is not None and os.path.exists(lora_path):
+            assert self.denoiser is not None
+            self.load_model(self.denoiser, lora_path, strict=False)
+            print_green("Successfully loaded denoiser from checkpoint.")
+        elif ignore_missing_ckpts:
+            print_yellow("Checkpoint of denoiser file not found. Ignoring.")
+        else:
+            raise FileNotFoundError("Checkpoint of denoiser file not found.")
+
+    def from_pretrained(
+        self,
+        ckpt_path: Optional[Union[str, Path]] = None,
+        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
+        lora_ckpt_path: Optional[Union[str, Path]] = None,
+        ignore_missing_ckpts: bool = False,
+        device: Optional[str] = None,
+        **kwargs,
+    ) -> None:
+        assert ckpt_path is not None, "Please provide a valid checkpoint path."
+        ckpt_str = str(ckpt_path)
+        denoiser_path = (
+            str(denoiser_ckpt_path) if denoiser_ckpt_path is not None else None
+        )
+        lora_path = str(lora_ckpt_path) if lora_ckpt_path is not None else None
+
+        # can ovrride following methods
+        self.load_first_stage(ckpt_str, ignore_missing_ckpts)
+        self.load_cond_stage(ckpt_str, ignore_missing_ckpts)
+        self.load_cond_stage_2(ckpt_str, ignore_missing_ckpts)
+        self.load_denoiser(ckpt_str, denoiser_path, ignore_missing_ckpts)
+        self.load_lora(lora_path, ignore_missing_ckpts)
+
+    @staticmethod
+    def load_model(
+        model: nn.Module, ckpt_path: Optional[Union[str, Path]] = None, strict=True
+    ):
+        """
+        Loads the weights of the model from a checkpoint file.
+
+        :param model: The model to be loaded.
+        :param ckpt_path: Path to the checkpoint file.
+        """
+        assert ckpt_path is not None, "Please provide a valid checkpoint path."
+
+        ckpt_path = Path(ckpt_path)
+        if ckpt_path.exists():
+            ckpt = torch.load(ckpt_path, map_location=torch.device("cpu"))
+            if "state_dict" in ckpt:
+                state_dict = ckpt["state_dict"]
+            else:
+                state_dict = ckpt
+            missing_keys, unexpected_keys = model.load_state_dict(
+                state_dict, strict=strict
+            )
+            all_keys = [i for i, _ in model.named_parameters()]
+            num_updated_keys = len(all_keys) - len(missing_keys)
+            num_unexpected_keys = len(unexpected_keys)
+            logger.info(
+                f"{num_updated_keys} parameters are loaded from {ckpt_path}. "
+                f"{num_unexpected_keys} parameters are unexpected."
+            )
+            return model
+        else:
+            raise FileNotFoundError("Checkpoint of model file not found.")
diff --git a/videotuna/base/component_loader.py b/videotuna/base/component_loader.py
new file mode 100644
index 00000000..7f490e08
--- /dev/null
+++ b/videotuna/base/component_loader.py
@@ -0,0 +1,134 @@
+import enum
+from typing import Any, Dict, Optional
+
+from loguru import logger
+
+from videotuna.utils.common_utils import instantiate_from_config
+
+
+class Component(str, enum.Enum):
+    DENOISER = "denoiser"
+    FIRST_STAGE_MODEL = "first_stage_model"
+    COND_STAGE_MODEL = "cond_stage_model"
+    COND_STAGE_2_MODEL = "cond_stage_2_model"
+    SCHEDULER = "scheduler"
+
+    def get_component_path(self) -> str:
+        return f"{self.value}.ckpt"
+
+
+class LoadingMethod(str, enum.Enum):
+    FIXED = "fixed"
+    CONFIG = "config"
+
+
+class ComponentLoader:
+    def instantiate_scheduler(self, config: Optional[Dict[str, Any]]) -> None:
+        if config is not None:
+            logger.info("creating scheduler")
+            self.diffusion_scheduler = self.scheduler = instantiate_from_config(config)
+            self.components.append(Component.SCHEDULER.value)
+
+    def instantiate_first_stage(self, config: Optional[Dict[str, Any]]) -> None:
+        """
+        Instantiates the first stage model of the generative process.
+
+        :param config: Dictionary containing configuration for the first stage model.
+        """
+        if config is None:
+            return
+        logger.info("creating first stage")
+        model = instantiate_from_config(config)
+        assert model is not None
+        self.first_stage_model = model.eval()
+        for param in self.first_stage_model.parameters():
+            param.requires_grad = False
+        self.components.append(Component.FIRST_STAGE_MODEL.value)
+        self.first_stage_model_path = config.get(
+            "ckpt_path", f"{Component.FIRST_STAGE_MODEL.value}.ckpt"
+        )
+        logger.info(f"self.first_stage_model_path: {self.first_stage_model_path}")
+
+    def instantiate_cond_stage(self, config: Optional[Dict[str, Any]]) -> None:
+        """
+        Instantiates the conditional stage model of the generative process.
+
+        :param config: Dictionary containing configuration for the conditional
+            stage model.
+        """
+        if config is None:
+            return
+        from videotuna.utils.quantization import apply_quantization_to_config_params
+
+        logger.info("creating cond stage")
+        cfg = config
+        if cfg is not None and isinstance(cfg, dict) and cfg.get("params"):
+            cfg = dict(cfg)
+            cfg["params"] = apply_quantization_to_config_params(dict(cfg["params"]))
+        model = instantiate_from_config(cfg)
+        assert model is not None
+        self.cond_stage_model = model.eval()
+        for param in self.cond_stage_model.parameters():
+            param.requires_grad = False
+        self.components.append(Component.COND_STAGE_MODEL.value)
+        self.cond_stage_model_path = config.get(
+            "ckpt_path", f"{Component.COND_STAGE_MODEL.value}.ckpt"
+        )
+        logger.info(f"self.cond_stage_model_path: {self.cond_stage_model_path}")
+
+    def instantiate_cond_stage_2(self, config: Optional[Dict[str, Any]]) -> None:
+        """
+        Instantiates the conditional stage model of the generative process.
+
+        :param config: Dictionary containing configuration for the conditional
+            stage model.
+        """
+        self.cond_stage_2_model = None
+        if config is not None:
+            logger.info("creating cond stage 2")
+            model = instantiate_from_config(config)
+            assert model is not None
+            self.cond_stage_2_model = model.eval()
+            for param in self.cond_stage_2_model.parameters():
+                param.requires_grad = False
+            self.components.append(Component.COND_STAGE_2_MODEL.value)
+            self.cond_stage_2_model_path = config.get(
+                "ckpt_path", f"{Component.COND_STAGE_2_MODEL.value}.ckpt"
+            )
+            logger.info(f"self.cond_stage_2_model_path: {self.cond_stage_2_model_path}")
+
+    def instantiate_denoiser(self, config: Optional[Dict[str, Any]]) -> None:
+        """
+        Instantiates the denoiser model of the generative process.
+
+        :param config: Dictionary containing configuration for the denoiser model.
+        """
+        if config is None:
+            return
+        logger.info("creating denoiser")
+        model = instantiate_from_config(config)
+        assert model is not None
+        self.denoiser = model.eval()
+        for param in self.denoiser.parameters():
+            param.requires_grad = False
+        self.components.append(Component.DENOISER.value)
+        self.denoiser_path = config.get("ckpt_path", f"{Component.DENOISER.value}.ckpt")
+        logger.info(f"self.denoiser_path: {self.denoiser_path}")
+
+    def apply_denoiser_gradient_checkpointing(self, enabled: bool = True) -> None:
+        """Enable gradient checkpointing on the denoiser only."""
+        denoiser = getattr(self, "denoiser", None)
+        if denoiser is None:
+            return
+        if hasattr(denoiser, "activation_checkpointing"):
+            denoiser.activation_checkpointing = enabled
+            logger.info(f"Wan denoiser activation_checkpointing={enabled}")
+            return
+        base_model = getattr(denoiser, "base_model", denoiser)
+        model = getattr(base_model, "model", base_model)
+        if enabled and hasattr(model, "enable_gradient_checkpointing"):
+            model.enable_gradient_checkpointing()
+            logger.info("Enabled diffusers gradient checkpointing on denoiser")
+        elif not enabled and hasattr(model, "disable_gradient_checkpointing"):
+            model.disable_gradient_checkpointing()
+            logger.info("Disabled diffusers gradient checkpointing on denoiser")
diff --git a/videotuna/base/generation_base.py b/videotuna/base/generation_base.py
index 6ce103ea..37114def 100644
--- a/videotuna/base/generation_base.py
+++ b/videotuna/base/generation_base.py
@@ -1,85 +1,107 @@
-from loguru import logger
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Union
-from colorama import Fore, Style
+import os
+from typing import Any, Dict, List, Optional, Union, cast
 
 import torch
-import os
 import torch.nn as nn
-import torch.nn.functional as F
-import pytorch_lightning as pl
-from torch.optim.lr_scheduler import CosineAnnealingLR, LambdaLR
-from peft import get_peft_model
-from omegaconf import OmegaConf, DictConfig
+from loguru import logger
+from omegaconf import DictConfig, OmegaConf
 from pytorch_lightning import Trainer
-import enum
+from torch.optim.lr_scheduler import CosineAnnealingLR, LambdaLR
 
-from videotuna.base.train_base import TrainBase
+from videotuna.base.checkpoint_mixin import CheckpointMixin
+from videotuna.base.component_loader import Component, ComponentLoader, LoadingMethod
 from videotuna.base.inference_base import InferenceBase
-from videotuna.utils.common_utils import instantiate_from_config, print_green, print_yellow, get_dist_info
+from videotuna.base.lora_training_mixin import LoraTrainingMixin
+from videotuna.base.train_base import TrainBase
+from videotuna.utils.common_utils import (
+    get_dist_info,
+    instantiate_from_config,
+)
+from videotuna.utils.device_utils import (
+    empty_accelerator_cache,
+    resolve_inference_device,
+)
 from videotuna.utils.train_utils import (
-    check_config_attribute,
-    get_autoresume_path,
-    get_empty_params_comparedwith_sd,
     get_trainer_callbacks,
     get_trainer_logger,
     get_trainer_strategy,
-    init_workspace,
-    load_checkpoints,
-    set_logger,
 )
 
-class Component(str, enum.Enum):
-    DENOISER = "denoiser"
-    FIRST_STAGE_MODEL = "first_stage_model"
-    COND_STAGE_MODEL = "cond_stage_model"
-    COND_STAGE_2_MODEL = "cond_stage_2_model"
-    SCHEDULER = "scheduler"
-
-    def get_component_path(self) -> str:
-        return f"{self.value}.ckpt"
-
+__all__ = ["GenerationBase", "Component", "LoadingMethod", "ComponentLoader"]
+
+
+class GenerationBase(
+    TrainBase,
+    InferenceBase,
+    ComponentLoader,
+    LoraTrainingMixin,
+    CheckpointMixin,
+):
+    denoiser: nn.Module | None = None
+    first_stage_model: nn.Module | None = None
+    cond_stage_model: nn.Module | None = None
+    cond_stage_2_model: nn.Module | None = None
+    scheduler: Any | None = None
+    lr_config: dict[str, Any] | None = None
+    data: Any | None = None
+    pipeline: Any | None = None
 
-class LoadingMethod(str, enum.Enum):
-    FIXED = "fixed"
-    CONFIG = "config"
-
-class GenerationBase(TrainBase, InferenceBase):
     """
-    The GenerationFlow class is a generative model class that inherits from both TrainBase and InferenceBase.
-    It manages the instantiation of different stages of a generative process, including a denoiser and a scheduler.
+    The GenerationFlow class is a generative model class that inherits from both
+    TrainBase and InferenceBase.
+    It manages the instantiation of different stages of a generative process,
+    including a denoiser and a scheduler.
     It also configures optimizers and learning rate schedulers for training.
 
     The main components of the model are:
-        - `first_stage`: a VAE model that encodes the input video into a latent space and decodes it back to the original video.
-        - `cond_stage`: a conditional model that takes the latent space and the conditioning text as input and generates the output video.
-        - `denoiser`: a denoiser model that takes the noisy output of the `cond_stage` and tries to remove the noise.
+        - `first_stage`: a VAE model that encodes the input video into a latent
+          space and decodes it back to the original video.
+        - `cond_stage`: a conditional model that takes the latent space and the
+          conditioning text as input and generates the output video.
+        - `denoiser`: a denoiser model that takes the noisy output of the
+          `cond_stage` and tries to remove the noise.
         - `scheduler`: a scheduler that controls denosing and sampling.
     """
 
-    def __init__(self,
-                 first_stage_config: Dict[str, Any],
-                 cond_stage_config: Dict[str, Any],
-                 denoiser_config: Dict[str, Any],
-                 scheduler_config: Dict[str, Any] = None,
-                 cond_stage_2_config: Dict[str, Any] = None,
-                 lora_config: Dict[str, Any] = None,
-                 trainable_components: Union[str, List[str]] = [],
-                 ):
+    def __init__(
+        self,
+        first_stage_config: Optional[Dict[str, Any]] = None,
+        cond_stage_config: Optional[Dict[str, Any]] = None,
+        denoiser_config: Optional[Dict[str, Any]] = None,
+        scheduler_config: Optional[Dict[str, Any]] = None,
+        cond_stage_2_config: Optional[Dict[str, Any]] = None,
+        lora_config: Optional[Dict[str, Any]] = None,
+        trainable_components: Union[str, List[str]] = [],
+        pipeline_only: bool = False,
+    ):
         """
-        Initializes the GenerationFlow class with configurations for different stages and components.
-
-        :param first_stage_config: Dictionary containing configuration for the first stage model.
-        :param cond_stage_config: Dictionary containing configuration for the conditional stage model.
-        :param cond_stage_2_config: Dictionary containing configuration for the conditional stage model 2, can be none.
-        :param denoiser_config: Dictionary containing configuration for the denoiser model.
-        :param scheduler_config: Dictionary containing configuration for the diffusion scheduler.
-        :param trainable_components: The components of the model that should be trainable.
+        Initializes the GenerationFlow class with configurations for different
+        stages and components.
+
+        :param first_stage_config: Dictionary containing configuration for the
+            first stage model.
+        :param cond_stage_config: Dictionary containing configuration for the
+            conditional stage model.
+        :param cond_stage_2_config: Dictionary containing configuration for the
+            conditional stage model 2, can be none.
+        :param denoiser_config: Dictionary containing configuration for the
+            denoiser model.
+        :param scheduler_config: Dictionary containing configuration for the
+            diffusion scheduler.
+        :param trainable_components: The components of the model that should be
+            trainable.
+        :param pipeline_only: When True, skip stage instantiation (Diffusers
+            pipeline flows).
         """
         super().__init__()
 
         # instantiate the modules
-        self.components = []
+        self.components: list[str] = []
+        self.pipeline_only = pipeline_only
+        if pipeline_only:
+            self.use_lora = False
+            self.pipeline = None
+            return
         # 1. denoiser
         self.instantiate_denoiser(denoiser_config)
 
@@ -105,125 +127,47 @@ def __init__(self,
         self.denoiser_config = denoiser_config
         self.scheduler_config = scheduler_config
         self.lora_config = lora_config
-        
+
         # set trainable components
         # be aware: loaded weight will overide requrie_grad attribute etc
         # make sure call it again after loading weight
         self.set_trainable_components(trainable_components)
-        
-    def instantiate_scheduler(self, config: Dict[str, Any]):
-        if config is not None:
-            logger.info("creating scheduler")
-            self.diffusion_scheduler = self.scheduler = instantiate_from_config(config)
-            self.components.append(Component.SCHEDULER.value)
-        
-    def instantiate_lora(self, config: Dict[str, Any]):
-        self.use_lora = False
-        if config is not None:
-            logger.info("creating lora")
-            transformer_adapter_config = instantiate_from_config(config)   
-            self.denoiser = get_peft_model(self.denoiser, transformer_adapter_config)
-            self.lora_params = set([name for name, param in self.denoiser.named_parameters() if param.requires_grad and 'lora' in name])
-            self.denoiser.requires_grad_(False)
-            self.denosier = self.denoiser.eval()
-            self.use_lora = True
-            self.lora_path = config.get("ckpt_path")
-            logger.info(f"self.use_lora: {self.use_lora} self.lora_path: {self.lora_path} self.lora_params: {self.lora_params}")
-
-    def instantiate_first_stage(self, config: Dict[str, Any]):
-        """
-        Instantiates the first stage model of the generative process.
-
-        :param config: Dictionary containing configuration for the first stage model.
-        """
-        logger.info("creating first stage")
-        model = instantiate_from_config(config)
-        self.first_stage_model = model.eval()
-        for param in self.first_stage_model.parameters():
-            param.requires_grad = False
-        self.components.append(Component.FIRST_STAGE_MODEL.value)
-        self.first_stage_model_path = config.get("ckpt_path", f"{Component.FIRST_STAGE_MODEL.value}.ckpt")
-        logger.info(f"self.first_stage_model_path: {self.first_stage_model_path}")
-    
-    def instantiate_cond_stage(self, config: Dict[str, Any]):
-        """
-        Instantiates the conditional stage model of the generative process.
-
-        :param config: Dictionary containing configuration for the conditional stage model.
-        """
-        logger.info("creating cond stage")
-        model = instantiate_from_config(config)
-        self.cond_stage_model = model.eval()
-        for param in self.cond_stage_model.parameters():
-            param.requires_grad = False
-        self.components.append(Component.COND_STAGE_MODEL.value)
-        self.cond_stage_model_path = config.get("ckpt_path", f"{Component.COND_STAGE_MODEL.value}.ckpt")
-        logger.info(f"self.cond_stage_model_path: {self.cond_stage_model_path}")
-        
-    def instantiate_cond_stage_2(self, config: Dict[str, Any]):
-        """
-        Instantiates the conditional stage model of the generative process.
-
-        :param config: Dictionary containing configuration for the conditional stage model.
-        """
-        self.cond_stage_2_model = None
-        if config is not None:
-            logger.info("creating cond stage 2")
-            model = instantiate_from_config(config)
-            self.cond_stage_2_model = model.eval()
-            for param in self.cond_stage_2_model.parameters():
-                param.requires_grad = False
-            self.components.append(Component.COND_STAGE_2_MODEL.value)
-            self.cond_stage_2_model_path = config.get("ckpt_path", f"{Component.COND_STAGE_2_MODEL.value}.ckpt")
-            logger.info(f"self.cond_stage_2_model_path: {self.cond_stage_2_model_path}")
-    
-    def instantiate_denoiser(self, config: Dict[str, Any]):
-        """
-        Instantiates the denoiser model of the generative process.
-
-        :param config: Dictionary containing configuration for the denoiser model.
-        """
-        logger.info("creating denoiser")
-        model = instantiate_from_config(config)
-        self.denoiser = model.eval()
-        for param in self.denoiser.parameters():
-            param.requires_grad = False
-        self.components.append(Component.DENOISER.value)
-        self.denoiser_path = config.get("ckpt_path", f"{Component.DENOISER.value}.ckpt")
-        logger.info(f"self.denoiser_path: {self.denoiser_path}")
 
     def configure_lr_config(self, lr_config: Dict[str, Any], bs: int, num_rank: int):
-        base_lr = lr_config['base_learning_rate']
+        base_lr = lr_config["base_learning_rate"]
         if lr_config.get("scale_lr", True):
-            lr_config['learning_rate'] = num_rank * bs * base_lr
+            lr_config["learning_rate"] = num_rank * bs * base_lr
         else:
-            lr_config['learning_rate'] = base_lr
+            lr_config["learning_rate"] = base_lr
         self.lr_config = lr_config
 
     def configure_optimizers(self):
         """
         Configures the optimizers and learning rate schedulers for the generative model.
 
-        :return: A list containing the optimizer and optionally a list containing the learning rate scheduler.
+        :return: A list containing the optimizer and optionally a list
+            containing the learning rate scheduler.
         """
+        assert self.lr_config is not None
         lr_config = self.lr_config
-        lr = lr_config['learning_rate']
+        lr = lr_config["learning_rate"]
         params = [p for p in self.parameters() if p.requires_grad]
         logger.info(f"@Training [{len(params)}] Full Paramters.")
 
         ## optimizer
-        if self.trainer.strategy.__class__.__name__ == 'DeepSpeedStrategy':
+        if self.trainer.strategy.__class__.__name__ == "DeepSpeedStrategy":
             from deepspeed.ops.adam import DeepSpeedCPUAdam
+
             optimizer = DeepSpeedCPUAdam(params, lr=lr)
-        else: 
+        else:
             optimizer = torch.optim.AdamW(params, lr=lr)
 
         ## lr scheduler
-        if lr_config.get('lr_scheduler_config', None):
+        if lr_config.get("lr_scheduler_config", None):
             logger.info("Setting up LambdaLR scheduler...")
             lr_scheduler = self.configure_lr_schedulers(optimizer)
             return [optimizer], [lr_scheduler]
-        
+
         return optimizer
 
     def configure_lr_schedulers(self, optimizer):
@@ -233,176 +177,46 @@ def configure_lr_schedulers(self, optimizer):
         :param optimizer: The optimizer for which the scheduler is being configured.
         :return: A dictionary containing the scheduler, interval, and frequency.
         """
-        lr_scheduler_config = self.lr_config.lr_scheduler_config
-        assert 'target' in lr_scheduler_config
-        scheduler_name = lr_scheduler_config.target.split('.')[-1]
-        interval = lr_scheduler_config.interval
-        frequency = lr_scheduler_config.frequency
+        assert self.lr_config is not None
+        lr_scheduler_config = self.lr_config["lr_scheduler_config"]
+        assert "target" in lr_scheduler_config
+        scheduler_name = lr_scheduler_config["target"].split(".")[-1]
+        interval = lr_scheduler_config["interval"]
+        frequency = lr_scheduler_config["frequency"]
         if scheduler_name == "LambdaLRScheduler":
             scheduler = instantiate_from_config(lr_scheduler_config)
             scheduler.start_step = self.global_step
             lr_scheduler = {
-                            'scheduler': LambdaLR(optimizer, lr_lambda=scheduler.schedule),
-                            'interval': interval,
-                            'frequency': frequency
+                "scheduler": LambdaLR(optimizer, lr_lambda=scheduler.schedule),
+                "interval": interval,
+                "frequency": frequency,
             }
         elif scheduler_name == "CosineAnnealingLRScheduler":
             scheduler = instantiate_from_config(lr_scheduler_config)
             decay_steps = scheduler.decay_steps
             last_step = -1 if self.global_step == 0 else scheduler.start_step
             lr_scheduler = {
-                            'scheduler': CosineAnnealingLR(optimizer, T_max=decay_steps, last_epoch=last_step),
-                            'interval': interval,
-                            'frequency': frequency
+                "scheduler": CosineAnnealingLR(
+                    optimizer, T_max=decay_steps, last_epoch=last_step
+                ),
+                "interval": interval,
+                "frequency": frequency,
             }
         else:
             raise NotImplementedError
         return lr_scheduler
-    
-    def set_trainable_components(
-        self,
-        components: Union[str, List[str]] = [],
-    ):
-        """
-        Sets the components of the generative model that should be trainable.
-
-        :param components: The components to be set as trainable.
-        """
-        if isinstance(components, str):
-            components = [components]
-        
-        # eval all components
-        for component in self.components:
-            model = getattr(self, component)
-            if model is None or not isinstance(model, nn.Module):
-                logger.info(f"Skipping eval component {component} since it is not set or not module")
-                continue
 
-            model.eval()
-            model.requires_grad_(False)
-
-        # train selected components
-        for component in components:
-            model = getattr(self, component)
-            if model is None:
-                raise ValueError(f"Invalid component name: {component}")
-        
-            if not isinstance(model, nn.Module):
-                logger.info(f"Skipping train component {component} since it is not module")
-                continue
-            
-            #if denoiser lora, make sure only lora params require grad
-            if component == Component.DENOISER.value and self.use_lora:
-                ## TODO how to define lora module
-                model.train()
-                for name, param in model.named_parameters():
-                    if name in self.lora_params:
-                        param.requires_grad_(True)
-            else:
-                model.train()
-                model.requires_grad_(True)
-                
-        print_green(f"Set the following components as trainable: {components}")
-    
-    
-    def load_first_stage(self, 
-                         ckpt_path: str,
-                         ignore_missing_ckpts: bool = False):
-        path = os.path.join(ckpt_path, self.first_stage_model_path)
-        if os.path.exists(path):
-            self.first_stage_model = self.load_model(self.first_stage_model, path)
-            print_green("Successfully loaded first_stage_model from checkpoint.")
-        elif ignore_missing_ckpts:
-            print_yellow("Checkpoint of first_stage_model file not found. Ignoring.")
-        else:
-            raise FileNotFoundError("Checkpoint of first_stage_model file not found.")
-
-    
-    def load_cond_stage(self, 
-                         ckpt_path: str,
-                         ignore_missing_ckpts: bool = False):
-        path = os.path.join(ckpt_path, self.cond_stage_model_path)
-        if os.path.exists(path):
-            self.cond_stage_model = self.load_model(self.cond_stage_model, path)
-            print_green("Successfully loaded cond_stage_model from checkpoint.")
-        elif ignore_missing_ckpts:
-            print_yellow("Checkpoint of cond_stage_model file not found. Ignoring.")
-        else:
-            raise FileNotFoundError("Checkpoint of cond_stage_model file not found.")
-
-    def load_cond_stage_2(self, 
-                         ckpt_path: str,
-                         ignore_missing_ckpts: bool = False):
-        if self.cond_stage_2_model is None:
-            return
-        
-        path = os.path.join(ckpt_path, self.cond_stage_2_model_path)
-        if os.path.exists(path):
-            self.cond_stage_2_model = self.load_model(self.cond_stage_2_model, path)
-            print_green("Successfully loaded cond_stage_2_model from checkpoint.")
-        elif ignore_missing_ckpts:
-            print_yellow("Checkpoint of cond_stage_2_model file not found. Ignoring.")
-        else:
-            raise FileNotFoundError("Checkpoint of cond_stage_2_model file not found.")
-    def load_denoiser(self, 
-                    ckpt_path: str = None,
-                    denoiser_ckpt_path: str = None,
-                    ignore_missing_ckpts: bool = False):
-        path = os.path.join(ckpt_path, self.denoiser_path)
-        if denoiser_ckpt_path is not None:
-            path = denoiser_ckpt_path
-
-        if os.path.exists(path):
-            self.denoiser = self.load_model(self.denoiser, path)
-            print_green("Successfully loaded denoiser from checkpoint.")
-        elif ignore_missing_ckpts:
-            print_yellow("Checkpoint of denoiser file not found. Ignoring.")
-        else:
-            raise FileNotFoundError("Checkpoint of denoiser file not found.")
-            
-    def load_lora(self,
-                lora_ckpt_path: str = None,
-                ignore_missing_ckpts: bool = False):
-        if not self.use_lora:
-            return
-        
-        lora_path = self.lora_path
-        if lora_ckpt_path is not None:
-            lora_path = lora_ckpt_path
-        
-        if os.path.exists(lora_path):
-            self.load_model(self.denoiser, lora_path, strict=False)
-            print_green("Successfully loaded denoiser from checkpoint.")
-        elif ignore_missing_ckpts:
-            print_yellow("Checkpoint of denoiser file not found. Ignoring.")
-        else:
-            raise FileNotFoundError("Checkpoint of denoiser file not found.")
-        
-    def from_pretrained(self,
-                        ckpt_path: Optional[Union[str, Path]] = None,
-                        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
-                        lora_ckpt_path: Optional[Union[str, Path]] = None,
-                        ignore_missing_ckpts: bool = False) -> None:
-        assert ckpt_path is not None, "Please provide a valid checkpoint path."
-
-        #can ovrride following methods
-        self.load_first_stage(ckpt_path, ignore_missing_ckpts)
-        self.load_cond_stage(ckpt_path, ignore_missing_ckpts)
-        self.load_cond_stage_2(ckpt_path, ignore_missing_ckpts)
-        self.load_denoiser(ckpt_path, denoiser_ckpt_path, ignore_missing_ckpts)
-        self.load_lora(lora_ckpt_path, ignore_missing_ckpts)
-    
     def enable_vram_management(self):
         logger.info("enable_vram_management: default moving to cuda")
         self.cuda()
-    
 
     def enable_cpu_offload(self):
         self.cpu_offload = True
 
-
-    def load_models_to_device(self, loadmodel_names=[], device='cuda'):
-        skip_components = ['scheduler']
+    def load_models_to_device(self, loadmodel_names=[], device=None):
+        if device is None:
+            device = str(resolve_inference_device())
+        skip_components = ["scheduler"]
         # only load models to device if cpu_offload is enabled
         if not self.cpu_offload:
             logger.info("cpu offload is closed, skipping")
@@ -416,7 +230,10 @@ def load_models_to_device(self, loadmodel_names=[], device='cuda'):
             if model_name not in loadmodel_names:
                 model = getattr(self, model_name)
                 if model is not None:
-                    if hasattr(model, "vram_management_enabled") and model.vram_management_enabled:
+                    if (
+                        hasattr(model, "vram_management_enabled")
+                        and model.vram_management_enabled
+                    ):
                         logger.info(f"{model_name} cpu offloading using offload method")
                         for module in model.modules():
                             if hasattr(module, "offload"):
@@ -429,7 +246,10 @@ def load_models_to_device(self, loadmodel_names=[], device='cuda'):
         for model_name in loadmodel_names:
             model = getattr(self, model_name)
             if model is not None:
-                if hasattr(model, "vram_management_enabled") and model.vram_management_enabled:
+                if (
+                    hasattr(model, "vram_management_enabled")
+                    and model.vram_management_enabled
+                ):
                     logger.info(f"{model_name} onloading using onload method")
                     for module in model.modules():
                         if hasattr(module, "onload"):
@@ -437,43 +257,16 @@ def load_models_to_device(self, loadmodel_names=[], device='cuda'):
                 else:
                     logger.info(f"{model_name} onloading using to device method")
                     model.to(device)
-        # fresh the cuda cache
-        torch.cuda.empty_cache()
-    
-    @staticmethod
-    def load_model(model: nn.Module, ckpt_path: Optional[Union[str, Path]] = None, strict=True):
-        """
-        Loads the weights of the model from a checkpoint file.
+        # fresh the accelerator cache
+        empty_accelerator_cache()
 
-        :param model: The model to be loaded.
-        :param ckpt_path: Path to the checkpoint file.
-        """
-        assert ckpt_path is not None, "Please provide a valid checkpoint path."
-
-        ckpt_path = Path(ckpt_path)
-        if ckpt_path.exists():
-            ckpt = torch.load(ckpt_path, map_location=torch.device('cpu'))
-            if 'state_dict' in ckpt:
-                state_dict = ckpt['state_dict']
-            else:
-                state_dict = ckpt
-            missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=strict)
-            all_keys = [i for i, _ in model.named_parameters()]
-            num_updated_keys = len(all_keys) - len(missing_keys)
-            num_unexpected_keys = len(unexpected_keys)
-            logger.info(f"{num_updated_keys} parameters are loaded from {ckpt_path}. {num_unexpected_keys} parameters are unexpected.")
-            return model
-        else:
-            raise FileNotFoundError("Checkpoint of model file not found.")
-
-    
     def init_trainer(self, train_config: DictConfig):
         # 1. basic info setup
         local_rank, global_rank, num_rank = get_dist_info()
 
-        debug = train_config['debug']
-        workdir = train_config['workdir']
-        ckptdir = train_config['ckptdir']
+        debug = train_config["debug"]
+        workdir = train_config["workdir"]
+        ckptdir = train_config["ckptdir"]
         lightning_config: DictConfig = train_config.get("lightning")
         trainer_config: DictConfig = lightning_config.get("trainer")
         self.first_stage_key = train_config.first_stage_key
@@ -482,14 +275,17 @@ def init_trainer(self, train_config: DictConfig):
 
         # 2. lr
         lr_config: DictConfig = train_config.get("lr_config")
-        bs = train_config['data']['params']['batch_size']
-        self.lr_config = OmegaConf.to_container(lr_config, resolve=True)
+        bs = train_config["data"]["params"]["batch_size"]
+        self.lr_config = cast(
+            dict[str, Any], OmegaConf.to_container(lr_config, resolve=True)
+        )
         self.configure_lr_config(self.lr_config, bs=bs, num_rank=num_rank)
-        
+
         # 3. dataset
         logger.info("***** Configuring Data *****")
-        data = instantiate_from_config(train_config['data'])
+        data = instantiate_from_config(train_config["data"])
         self.data = data
+        assert data is not None
         data.setup()
         for k in data.datasets:
             logger.info(
@@ -498,8 +294,8 @@ def init_trainer(self, train_config: DictConfig):
 
         ## 4. lightning trainer config
         logger.info(f"trainer_config: {trainer_config}")
-        num_nodes = trainer_config['num_nodes']
-        ngpu_per_node = trainer_config['devices']
+        num_nodes = trainer_config["num_nodes"]
+        ngpu_per_node = trainer_config["devices"]
         logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs")
         logger.info("***** Configuring Trainer *****")
 
@@ -508,17 +304,21 @@ def init_trainer(self, train_config: DictConfig):
             trainer_config["accelerator"] = "gpu"
 
         ## 4.2 logger
-        trainer_kwargs = dict()
+        trainer_kwargs: dict[str, Any] = {}
         trainer_kwargs["num_sanity_val_steps"] = 0
         logger_cfg = get_trainer_logger(lightning_config, workdir, debug)
         trainer_kwargs["logger"] = instantiate_from_config(logger_cfg)
-        logger.info(f"logger save_dir: {trainer_kwargs['logger'].save_dir}")
+        logger_obj = trainer_kwargs["logger"]
+        if hasattr(logger_obj, "save_dir"):
+            logger.info(f"logger save_dir: {logger_obj.save_dir}")
 
         ## 4.3 callback
-        callbacks_cfg = get_trainer_callbacks(
-            lightning_config, workdir, ckptdir
+        callbacks_cfg = cast(
+            dict[str, Any], get_trainer_callbacks(lightning_config, workdir, ckptdir)
         )
         callbacks_cfg["image_logger"]["params"]["save_dir"] = workdir
+        if "training_metrics" in callbacks_cfg:
+            callbacks_cfg["training_metrics"]["params"]["save_dir"] = workdir
         trainer_kwargs["callbacks"] = [
             instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg
         ]
@@ -527,16 +327,25 @@ def init_trainer(self, train_config: DictConfig):
         strategy_cfg = get_trainer_strategy(lightning_config)
         trainer_kwargs["strategy"] = (
             strategy_cfg
-            if type(strategy_cfg) == str
-            else instantiate_from_config(OmegaConf.to_container(strategy_cfg))
+            if isinstance(strategy_cfg, str)
+            else instantiate_from_config(
+                cast(dict[str, Any], OmegaConf.to_container(strategy_cfg))
+            )
         )
         trainer_kwargs["sync_batchnorm"] = False
 
         ## 4.5 create Trainer
         logger.info(f"trainer_kwargs: {trainer_kwargs}")
-        from pytorch_lightning.profilers import PyTorchProfiler
-        profiler = PyTorchProfiler(emit_nvtx=True)
-        trainer = Trainer(**trainer_config, **trainer_kwargs,  profiler=profiler)
+        enable_profiler = lightning_config.get("enable_profiler", False)
+        profiler = None
+        if enable_profiler:
+            from pytorch_lightning.profilers import PyTorchProfiler
+
+            profiler = PyTorchProfiler(emit_nvtx=True)
+        trainer_config_dict = cast(
+            dict[str, Any], OmegaConf.to_container(trainer_config, resolve=True)
+        )
+        trainer = Trainer(**trainer_config_dict, **trainer_kwargs, profiler=profiler)
         self.trainer = trainer
 
         ## 5. allow user
@@ -550,14 +359,20 @@ def melk(*args, **kwargs):
         def divein(*args, **kwargs):
             if trainer.global_rank == 0:
                 import pudb
+
                 pudb.set_trace()
 
         import signal
+
         signal.signal(signal.SIGUSR1, melk)
         signal.signal(signal.SIGUSR2, divein)
 
         ## since loaded weight will ovrride params, make sure it is been handled
-        if trainer.strategy.__class__.__name__ == 'DeepSpeedStrategy':
-            logger.info(f"Make parameter contiguous in case deepseed does not allow non contigouous data")
-            for param in self.parameters(): param.data = param.data.contiguous()
-        self.set_trainable_components([Component.DENOISER.value])
\ No newline at end of file
+        if trainer.strategy.__class__.__name__ == "DeepSpeedStrategy":
+            logger.info(
+                "Make parameter contiguous in case deepseed does not allow "
+                "non contigouous data"
+            )
+            for param in self.parameters():
+                param.data = param.data.contiguous()
+        self.set_trainable_components([Component.DENOISER.value])
diff --git a/videotuna/base/inference_base.py b/videotuna/base/inference_base.py
index 66002a12..5436cdad 100644
--- a/videotuna/base/inference_base.py
+++ b/videotuna/base/inference_base.py
@@ -1,15 +1,12 @@
-import torch
 import os
-from einops import rearrange
 from pathlib import Path
-from typing import Any, Dict, List, Optional, Union
-from loguru import logger
-import json
-from omegaconf import DictConfig, OmegaConf
+from typing import List, Optional, Union
 
 import torch
 import torchvision
-import torchvision.transforms as transforms
+from einops import rearrange
+from loguru import logger
+from omegaconf import DictConfig
 
 from videotuna.utils.args_utils import VideoMode
 
@@ -25,7 +22,9 @@ def __init__(self):
         pass
 
     @staticmethod
-    def process_savename(savename: List[str], n_per_prompt: int = 1, mode: str = 'default') -> List[str]:
+    def process_savename(
+        savename: List[str], n_per_prompt: int = 1, mode: str = "default"
+    ) -> List[str]:
         """
         Processes the save name to include the save path.
 
@@ -35,21 +34,21 @@ def process_savename(savename: List[str], n_per_prompt: int = 1, mode: str = 'de
         :return: The processed save name.
         """
         if n_per_prompt == 1:
-            if mode == 'default':
+            if mode == "default":
                 newnames = [f"prompt-{idx+1:04d}" for idx in range(len(savename))]
-            elif mode == 'prompt':
+            elif mode == "prompt":
                 newnames = []
                 for idx, name in enumerate(savename):
                     name = name[:100]  # limit the length of the name
                     newname = f"{name}"
                     newnames.append(newname)
         elif n_per_prompt > 1:
-            if mode == 'default':
+            if mode == "default":
                 newnames = []
                 for idx in range(len(savename)):
                     for i in range(n_per_prompt):
                         newnames.append(f"prompt-{idx+1:04d}-{i:02d}")
-            elif mode == 'prompt':
+            elif mode == "prompt":
                 newnames = []
                 for idx, name in enumerate(savename):
                     for i in range(n_per_prompt):
@@ -59,13 +58,9 @@ def process_savename(savename: List[str], n_per_prompt: int = 1, mode: str = 'de
             raise ValueError("Invalid number of samples per prompt.")
 
         return newnames
-    
+
     @staticmethod
-    def save_video(
-            vid_tensor: torch.Tensor,
-            savepath: str,
-            fps: int = 10
-        ) -> None:
+    def save_video(vid_tensor: torch.Tensor, savepath: str, fps: int = 10) -> None:
         """
         Save a video tensor to the specified path.
 
@@ -77,21 +72,21 @@ def save_video(
         assert vid_tensor.dim() == 4, "Invalid video tensor shape."
         video = vid_tensor.detach().cpu()
         video = torch.clamp(video.float(), -1.0, 1.0)
-        video = rearrange(video, 'c t h w -> t c h w')
+        video = rearrange(video, "c t h w -> t c h w")
         video = (video + 1.0) / 2.0
         video = (video * 255).to(torch.uint8).permute(0, 2, 3, 1)
-        
+
         torchvision.io.write_video(
             savepath, video, fps=fps, video_codec="h264", options={"crf": "10"}
         )
 
     def save_videos(
-            self,
-            batch_tensors: torch.Tensor, 
-            savedir: str, 
-            filenames: List[str], 
-            fps: int = 10
-        ) -> None:
+        self,
+        batch_tensors: torch.Tensor,
+        savedir: str,
+        filenames: List[str],
+        fps: int = 10,
+    ) -> None:
         """
         Save a batch of video tensors to the specified directory.
 
@@ -104,7 +99,9 @@ def save_videos(
         bs = batch_tensors.shape[0]
         n_samples = batch_tensors.shape[1]
         assert batch_tensors.dim() == 6, "Invalid batch shape."
-        assert n_samples * bs == len(filenames), "Number of filenames must match the batch size."
+        assert n_samples * bs == len(
+            filenames
+        ), "Number of filenames must match the batch size."
 
         c = 0
         for idx, vid_tensor in enumerate(batch_tensors):
@@ -113,31 +110,36 @@ def save_videos(
                 savepath = os.path.join(savedir, f"{filenames[c]}.mp4")
                 self.save_video(single_vid_tensor, savepath, fps=fps)
                 c += 1
-    
-    def save_metrics(self,
-                     gpu: List[float],
-                    time: List[float],
-                    config: DictConfig,
-                    savedir: str):
-        metrics = {
-            "gpu" : gpu,
-            "time": time,
-            "config" : OmegaConf.to_container(config, resolve=True)
-        }
-        with open(f"{savedir}/metric.json", "w") as f:
-            json.dump(metrics, f, indent=4)
-
-    
+
+    def save_metrics(
+        self,
+        gpu: List[float],
+        time: List[float],
+        config: DictConfig,
+        savedir: str,
+        frames: int = 1,
+    ):
+        from videotuna.utils.common_utils import save_metrics as write_metrics
+
+        write_metrics(
+            savedir=savedir,
+            config=config,
+            gpu=gpu,
+            time=time,
+            frames=frames,
+        )
+
     def save_videos_vbench(
-            self, 
-            batch_tensors: torch.Tensor, 
-            savedir: str, 
-            prompts: List[str], 
-            format_file: dict, 
-            fps: int = 10
-        ) -> None:
+        self,
+        batch_tensors: torch.Tensor,
+        savedir: str,
+        prompts: List[str],
+        format_file: dict,
+        fps: int = 10,
+    ) -> None:
         """
-        Save a batch of video tensors to the specified directory with filenames based on prompts.
+        Save a batch of video tensors to the specified directory with filenames
+        based on prompts.
 
         :param batch_tensors: A tensor containing the batch of video data.
         :param savedir: The directory where the videos will be saved.
@@ -158,7 +160,9 @@ def save_videos_vbench(
             for n in range(n_samples):
                 filename = f"{prompt}-{n}.mp4"
                 format_file[filename] = prompt
-                self.save_video(batch_tensors[idx, n], os.path.join(sub_savedir, filename), fps=fps)
+                self.save_video(
+                    batch_tensors[idx, n], os.path.join(sub_savedir, filename), fps=fps
+                )
 
     @staticmethod
     def load_prompts_from_txt(prompt_file: str) -> List[str]:
@@ -178,14 +182,14 @@ def load_prompts(prompts: Optional[Union[str, Path]]):
         prompt_list = []
         if prompts is None:
             return prompt_list
-        
-        if os.path.isfile(prompts) and prompts.endswith('.txt'):
+
+        if os.path.isfile(prompts) and prompts.endswith(".txt"):
             prompt_list = InferenceBase.load_prompts_from_txt(prompts)
         else:
             logger.info("Process the input path as a prompt")
             prompt_list = [prompts]
         return prompt_list
-    
+
     @staticmethod
     def get_target_filelist(data_dir: str, ext: str):
         """
@@ -207,22 +211,15 @@ def get_target_filelist(data_dir: str, ext: str):
             raise ValueError(f"No file with extensions {ext} found in {data_dir}.")
         return file_list
 
-    @staticmethod
-    def load_prompts_from_txt(prompt_file: str):
-        """Load and return a list of prompts from a text file, stripping whitespace."""
-        with open(prompt_file, "r") as f:
-            lines = f.readlines()
-        prompt_list = [line.strip() for line in lines if line.strip() != ""]
-        return prompt_list
-
     @staticmethod
     def load_prompts_images(prompt_dir: str):
-        #1. load prompts
+        # 1. load prompts
         prompt_files = InferenceBase.get_target_filelist(prompt_dir, ext="txt")
         if len(prompt_files) > 1:
             # only use the first one (sorted by name) if multiple exist
             logger.warning(
-                f"Warning: multiple prompt files exist. The one {os.path.split(prompt_files[0])[1]} is used."
+                "Warning: multiple prompt files exist. The one "
+                f"{os.path.split(prompt_files[0])[1]} is used."
             )
             prompt_file = prompt_files[0]
         elif len(prompt_files) == 1:
@@ -233,19 +230,21 @@ def load_prompts_images(prompt_dir: str):
 
         prompt_list = InferenceBase.load_prompts_from_txt(prompt_file)
 
-        #2. load images
-        image_path_list = sorted(InferenceBase.get_target_filelist(prompt_dir, ext="png,jpg,webp,jpeg"))
+        # 2. load images
+        image_path_list = sorted(
+            InferenceBase.get_target_filelist(prompt_dir, ext="png,jpg,webp,jpeg")
+        )
         return prompt_list, image_path_list
-    
-    
 
-    def load_inference_inputs(self, prompts: Optional[Union[str, Path]], mode: str = 't2v'):
+    def load_inference_inputs(
+        self, prompts: Optional[Union[str, Path]], mode: str = "t2v"
+    ):
         """
         Loads the prompts and conditions for the conditional stage model.
 
         :param prompts: List of prompts to be loaded.
         :param mode: The mode in which the prompts are loaded. `t2v` or `i2v`.
-        :return: `t2v` -> prompts; 
+        :return: `t2v` -> prompts;
                  `i2v` -> prompts + images.
         """
         assert prompts is not None, "Please provide a valid prompts or prompts path."
@@ -257,8 +256,6 @@ def load_inference_inputs(self, prompts: Optional[Union[str, Path]], mode: str =
         else:
             raise NotImplementedError("Invalid mode.")
 
-
-    
     # TODO: Add more methods as needed
     # - sample
     # - save results
diff --git a/videotuna/base/lora_training_mixin.py b/videotuna/base/lora_training_mixin.py
new file mode 100644
index 00000000..3d8c2492
--- /dev/null
+++ b/videotuna/base/lora_training_mixin.py
@@ -0,0 +1,93 @@
+from typing import Any, Dict, List, Optional, Union, cast
+
+import torch.nn as nn
+from loguru import logger
+from peft import get_peft_model
+
+from videotuna.base.component_loader import Component
+from videotuna.utils.common_utils import instantiate_from_config, print_green
+from videotuna.utils.lora_utils import (
+    collect_lora_parameter_names,
+    resolve_lora_target_modules,
+)
+
+
+class LoraTrainingMixin:
+    def instantiate_lora(self, config: Optional[Dict[str, Any]]) -> None:
+        self.use_lora = False
+        if config is not None:
+            logger.info("creating lora")
+            transformer_adapter_config = instantiate_from_config(config)
+            assert self.denoiser is not None
+            if transformer_adapter_config is not None and hasattr(
+                transformer_adapter_config, "target_modules"
+            ):
+                transformer_adapter_config.target_modules = resolve_lora_target_modules(
+                    self.denoiser, transformer_adapter_config.target_modules
+                )
+            self.denoiser = get_peft_model(
+                cast(Any, self.denoiser),
+                cast(Any, transformer_adapter_config),
+                autocast_adapter_dtype=False,
+            )
+            self.lora_params = collect_lora_parameter_names(self.denoiser)
+            self.denoiser.requires_grad_(False)
+            for name, param in self.denoiser.named_parameters():
+                if name in self.lora_params:
+                    param.requires_grad_(True)
+            self.use_lora = True
+            self.lora_path = config.get("ckpt_path")
+            logger.info(
+                f"self.use_lora: {self.use_lora} self.lora_path: {self.lora_path} "
+                f"self.lora_params: {self.lora_params}"
+            )
+
+    def set_trainable_components(
+        self,
+        components: Union[str, List[str]] = [],
+    ):
+        """
+        Sets the components of the generative model that should be trainable.
+
+        :param components: The components to be set as trainable.
+        """
+        if isinstance(components, str):
+            components = [components]
+
+        # eval all components
+        for component in self.components:
+            model = getattr(self, component)
+            if model is None or not isinstance(model, nn.Module):
+                logger.info(
+                    f"Skipping eval component {component} since it is not set or "
+                    "not module"
+                )
+                continue
+
+            model.eval()
+            model.requires_grad_(False)
+
+        # train selected components
+        for component in components:
+            model = getattr(self, component)
+            if model is None:
+                raise ValueError(f"Invalid component name: {component}")
+
+            if not isinstance(model, nn.Module):
+                logger.info(
+                    f"Skipping train component {component} since it is not module"
+                )
+                continue
+
+            # if denoiser lora, make sure only lora params require grad
+            if component == Component.DENOISER.value and self.use_lora:
+                ## TODO how to define lora module
+                model.train()
+                for name, param in model.named_parameters():
+                    if name in self.lora_params:
+                        param.requires_grad_(True)
+            else:
+                model.train()
+                model.requires_grad_(True)
+
+        print_green(f"Set the following components as trainable: {components}")
diff --git a/videotuna/base/model_base.py b/videotuna/base/model_base.py
index 93e6f894..ec61269b 100644
--- a/videotuna/base/model_base.py
+++ b/videotuna/base/model_base.py
@@ -1,49 +1,47 @@
-import torch
-import torch.nn as nn
-from typing import Any, Dict, List, Optional, Union
 from pathlib import Path
+from typing import Any, Dict, Union
 
-
-from typing import Union, Dict, Any
-from pathlib import Path
 import torch.nn as nn
 
+
 class ModelBase(nn.Module):
     """
-    A base class for all models. This class extends nn.Module from PyTorch and provides a structure
-    that all models should follow, including initialization, forward pass, and utility methods for
-    saving/loading models, getting model configuration, and counting model parameters.
+    A base class for all models. This class extends nn.Module from PyTorch and
+    provides a structure that all models should follow, including initialization,
+    forward pass, and utility methods for saving/loading models, getting model
+    configuration, and counting model parameters.
     """
-    
+
     def __init__(self):
         """
-        Initializes the ModelBase class. This method should be overridden in any subclass to
-        initialize model-specific components.
+        Initializes the ModelBase class. This method should be overridden in any
+        subclass to initialize model-specific components.
         """
         super().__init__()
-    
+
     def forward(self):
         """
-        Defines the forward pass of the model. This method should be implemented in any subclass
-        to specify how input data should be processed through the network.
+        Defines the forward pass of the model. This method should be implemented
+        in any subclass to specify how input data should be processed through the
+        network.
         """
         raise NotImplementedError("Please implement the forward method.")
-    
+
     def save_model(self, path: Union[str, Path]):
         """
-        Saves the model to a specified path. This method should be implemented in any subclass to
-        define how the model's state is saved.
-        
+        Saves the model to a specified path. This method should be implemented in
+        any subclass to define how the model's state is saved.
+
         Args:
             path (Union[str, Path]): The file path where the model will be saved.
         """
         pass
-    
+
     def load_model(self, path: Union[str, Path]):
         """
-        Loads the model from a specified path. This method should be implemented in any subclass to
-        define how the model's state is loaded.
-        
+        Loads the model from a specified path. This method should be implemented in
+        any subclass to define how the model's state is loaded.
+
         Args:
             path (Union[str, Path]): The file path from where the model will be loaded.
         """
@@ -51,9 +49,10 @@ def load_model(self, path: Union[str, Path]):
 
     def get_model_config(self) -> Dict[str, Any]:
         """
-        Returns a dictionary containing the configuration of the model. This method should be
-        implemented in any subclass to provide a way to access the model's configuration settings.
-        
+        Returns a dictionary containing the configuration of the model. This
+        method should be implemented in any subclass to provide a way to access
+        the model's configuration settings.
+
         Returns:
             Dict[str, Any]: A dictionary with model configuration details.
         """
@@ -61,7 +60,7 @@ def get_model_config(self) -> Dict[str, Any]:
 
     def get_param_count(self):
         """
-        Gets the total number of parameters in the model. This method should be implemented in
-        any subclass to provide a way to count the parameters.
+        Gets the total number of parameters in the model. This method should be
+        implemented in any subclass to provide a way to count the parameters.
         """
         pass
diff --git a/videotuna/base/train_base.py b/videotuna/base/train_base.py
index dbe0d178..3163168a 100644
--- a/videotuna/base/train_base.py
+++ b/videotuna/base/train_base.py
@@ -1,14 +1,13 @@
-import torch
 import pytorch_lightning as pl
-from typing import Any, Dict, List, Optional, Union
+import torch
 
 
 class TrainBase(pl.LightningModule):
     """
     Base class for training models using PyTorch Lightning.
-    This class extends pl.LightningModule and provides a template for implementing
-    custom training logic. Users should inherit from this class and override the necessary
-    methods to define their training process.
+    This class extends pl.LightningModule and provides a template for
+    implementing custom training logic. Users should inherit from this class and
+    override the necessary methods to define their training process.
     """
 
     def __init__(self):
@@ -17,26 +16,29 @@ def __init__(self):
         Call the parent class constructor using super().__init__().
         """
         super().__init__()
-    
+
     def configure_optimizers(self):
         """
         Configures the optimizers and learning rate schedulers.
-        This method should be overridden in the child class to define the optimizers and learning rate schedules.
+        This method should be overridden in the child class to define the
+        optimizers and learning rate schedules.
         """
         raise NotImplementedError("Please implement the configure_optimizers method")
-    
+
     def forward(self):
         """
         Defines the forward pass of the model.
-        This method should be overridden in the child class to define the model's forward pass.
+        This method should be overridden in the child class to define the model's
+        forward pass.
         """
         raise NotImplementedError("Please implement the forward method")
 
     def training_step(self, batch, batch_idx):
         """
         Defines a single training step.
-        This method should be overridden in the child class to implement the logic for a single training step.
-        
+        This method should be overridden in the child class to implement the
+        logic for a single training step.
+
         :param batch: A batch of input data.
         :param batch_idx: The index of the current batch.
         :return: A dictionary containing the loss and any other metrics to log.
@@ -47,9 +49,10 @@ def training_step(self, batch, batch_idx):
     def validation_step(self, batch, batch_idx):
         """
         Defines a single validation step.
-        This method can be overridden in the child class to implement the logic for a single validation step.
+        This method can be overridden in the child class to implement the logic
+        for a single validation step.
         If not overridden, it does nothing by default.
-        
+
         :param batch: A batch of input data.
         :param batch_idx: The index of the current batch.
         :return: A dictionary containing the loss and any other metrics to log.
@@ -140,4 +143,4 @@ def val_loop():
     # set up for train
     on_validation_model_train()  # calls `model.train()`
     torch.set_grad_enabled(True)
-"""
\ No newline at end of file
+"""
diff --git a/videotuna/cli/__init__.py b/videotuna/cli/__init__.py
new file mode 100644
index 00000000..15fe9f09
--- /dev/null
+++ b/videotuna/cli/__init__.py
@@ -0,0 +1 @@
+"""PrivTune CLI entrypoints."""
diff --git a/videotuna/cli/inference_app.py b/videotuna/cli/inference_app.py
new file mode 100644
index 00000000..b98ae545
--- /dev/null
+++ b/videotuna/cli/inference_app.py
@@ -0,0 +1,143 @@
+"""cyclopts App for PrivTune inference and validation entrypoints."""
+
+from __future__ import annotations
+
+import sys
+from collections.abc import Callable
+
+from cyclopts import App
+
+from videotuna.cli.inference_options import (
+    InferencePreset,
+    InferenceRunOptions,
+    StandardInferenceOptions,
+    inference_options_to_namespace,
+    validate_preset_requirements,
+)
+
+FLUX_DOMAIN_SMOKE_CONFIG = "configs/inference/presets/flux_domain_lora_smoke.yaml"
+WAN_DOMAIN_SMOKE_22_CONFIG = "configs/inference/presets/wan_domain_lora_smoke_22.yaml"
+WAN_DOMAIN_I2V_SMOKE_22_CONFIG = (
+    "configs/inference/presets/wan_domain_i2v_smoke_22.yaml"
+)
+WAN2_2_T2V_720P_CONFIG = "configs/inference/presets/balanced_wan2_2_720p.yaml"
+
+PRESET_DOMAIN_T2I = InferencePreset(
+    cli_name="inference-domain-t2i",
+    config=FLUX_DOMAIN_SMOKE_CONFIG,
+    enable_model_cpu_offload=True,
+)
+PRESET_VALIDATE_T2V = InferencePreset(
+    cli_name="validate-domain-t2v",
+    config=WAN_DOMAIN_SMOKE_22_CONFIG,
+    enable_model_cpu_offload=True,
+    require_checkpoint=True,
+)
+PRESET_WAN2_2_T2V_720P = InferencePreset(
+    cli_name="inference-wan2.2-t2v-720p",
+    config=WAN2_2_T2V_720P_CONFIG,
+)
+PRESET_VALIDATE_I2V = InferencePreset(
+    cli_name="validate-domain-i2v",
+    config=WAN_DOMAIN_I2V_SMOKE_22_CONFIG,
+    enable_model_cpu_offload=True,
+    require_checkpoint=True,
+    require_prompt_dir=True,
+)
+PRESET_WAN2_2_I2V_720P = InferencePreset(
+    cli_name="inference-wan2.2-i2v-720p",
+    config=WAN_DOMAIN_I2V_SMOKE_22_CONFIG,
+)
+
+
+def _run_inference_with_options(
+    run: InferenceRunOptions | None,
+    standard: StandardInferenceOptions | None,
+    *,
+    preset: InferencePreset | None = None,
+) -> None:
+    from scripts.inference_new import run_inference
+
+    run_opts = run or InferenceRunOptions()
+    if preset is not None:
+        validate_preset_requirements(run_opts, preset)
+    args = inference_options_to_namespace(
+        run=run_opts,
+        standard=standard,
+        preset=preset,
+    )
+    run_inference(args)
+
+
+def _register_inference_command(
+    app: App,
+    preset: InferencePreset | None,
+    *,
+    name: str | None = None,
+    is_default: bool = False,
+) -> None:
+    command_name = name or (preset.cli_name if preset else "run")
+
+    def handler(
+        run: InferenceRunOptions | None = None,
+        *,
+        standard: StandardInferenceOptions | None = None,
+    ) -> None:
+        _run_inference_with_options(run, standard, preset=preset)
+
+    handler.__doc__ = (
+        f"Run inference with preset {preset.config}."
+        if preset is not None
+        else "Run inference from a YAML config and CLI overrides."
+    )
+    if is_default:
+        app.default(handler)
+    else:
+        app.command(name=command_name)(handler)
+
+
+def _make_app(preset: InferencePreset | None = None, *, name: str | None = None) -> App:
+    app = App(
+        name=name or (preset.cli_name if preset else "privtune-inference"),
+        help="PrivTune domain inference and validation.",
+    )
+    _register_inference_command(app, preset, is_default=True)
+    return app
+
+
+app = App(name="privtune-inference", help="PrivTune domain inference and validation.")
+_register_inference_command(app, PRESET_DOMAIN_T2I, name="inference-domain-t2i")
+_register_inference_command(app, PRESET_VALIDATE_T2V, name="validate-domain-t2v")
+_register_inference_command(
+    app, PRESET_WAN2_2_T2V_720P, name="inference-wan2.2-t2v-720p"
+)
+_register_inference_command(app, None, name="run")
+
+
+def _entry_for_preset(preset: InferencePreset) -> Callable[[], None]:
+    cli_app = _make_app(preset)
+
+    def entry() -> None:
+        raise SystemExit(cli_app(sys.argv[1:]))
+
+    entry.__name__ = preset.cli_name.replace(".", "_").replace("-", "_")
+    entry.__doc__ = f"Entry point for {preset.cli_name}."
+    return entry
+
+
+def generic_inference_entry() -> None:
+    """Compat shim for ``python scripts/inference_new.py``."""
+    raise SystemExit(_make_app(name="inference").__call__(sys.argv[1:]))
+
+
+def main() -> None:
+    """Dispatch subcommands on the shared inference App."""
+    raise SystemExit(app(sys.argv[1:]))
+
+
+inference_domain_t2i_entry = _entry_for_preset(PRESET_DOMAIN_T2I)
+validate_domain_t2v_entry = _entry_for_preset(PRESET_VALIDATE_T2V)
+inference_wan2_2_t2v_720p_entry = _entry_for_preset(PRESET_WAN2_2_T2V_720P)
+inference_flux_lora_entry = inference_domain_t2i_entry
+validate_domain_i2v_entry = _entry_for_preset(PRESET_VALIDATE_I2V)
+inference_wan2_2_i2v_720p_entry = _entry_for_preset(PRESET_WAN2_2_I2V_720P)
diff --git a/videotuna/cli/inference_options.py b/videotuna/cli/inference_options.py
new file mode 100644
index 00000000..b83903a6
--- /dev/null
+++ b/videotuna/cli/inference_options.py
@@ -0,0 +1,174 @@
+"""Typed cyclopts option groups for inference entrypoints."""
+
+from __future__ import annotations
+
+import argparse
+from dataclasses import dataclass, fields
+from typing import Annotated, Any, Literal
+
+from cyclopts import Parameter
+
+from videotuna.utils.inference_profile import MemoryPreset
+
+DtypeChoice = Literal["bf16", "fp16", "fp32"]
+DeviceMapChoice = Literal["auto"]
+
+
+@Parameter(name="*")
+@dataclass
+class StandardInferenceOptions:
+    """Memory, device, and performance flags shared by all inference commands."""
+
+    cpu_smoke: Annotated[bool | None, Parameter(name="cpu-smoke")] = None
+    device: Annotated[str | None, Parameter(name="device", alias="--gpu-id")] = None
+    min_vram_gb: Annotated[float | None, Parameter(name="min-vram-gb")] = None
+    memory_preset: Annotated[MemoryPreset | None, Parameter(name="memory-preset")] = (
+        None
+    )
+    enable_vae_tiling: Annotated[bool | None, Parameter(name="enable_vae_tiling")] = (
+        None
+    )
+    enable_vae_slicing: Annotated[bool | None, Parameter(name="enable_vae_slicing")] = (
+        None
+    )
+    enable_model_cpu_offload: Annotated[
+        bool | None, Parameter(name="enable_model_cpu_offload")
+    ] = None
+    enable_sequential_cpu_offload: Annotated[
+        bool | None, Parameter(name="enable_sequential_cpu_offload")
+    ] = None
+    dtype: DtypeChoice | None = None
+    device_map: Annotated[DeviceMapChoice | None, Parameter(name="device-map")] = None
+    ulysses_degree: int | None = None
+    ring_degree: int | None = None
+    compile: Annotated[bool | None, Parameter(name="compile")] = None
+    fuse_qkv: bool | None = None
+    enable_attention_cache: bool | None = None
+    transformer_quant: Annotated[
+        str | None,
+        Parameter(
+            name="transformer-quant",
+            help=(
+                "Diffusers transformer weight-only quant: "
+                "none, int8_wo, int4_wo, fp8_wo (CUDA; fp8_wo needs Ada/Hopper+)."
+            ),
+        ),
+    ] = None
+    quant_backend: Annotated[str | None, Parameter(name="quant-backend")] = None
+
+
+@Parameter(name="*")
+@dataclass
+class InferenceRunOptions:
+    """Model, prompt, and sampling flags for inference."""
+
+    mode: str | None = None
+    ckpt_path: str | None = None
+    lorackpt: str | None = None
+    trained_ckpt: str | None = None
+    config: str | None = None
+    prompt_file: str | None = None
+    prompt_dir: str | None = None
+    savedir: str | None = None
+    standard_vbench: bool | None = None
+    seed: int | None = None
+    height: int | None = None
+    width: int | None = None
+    frames: int | None = None
+    fps: int | None = None
+    n_samples_prompt: int | None = None
+    bs: int | None = None
+    ddim_steps: int | None = None
+    ddim_eta: float | None = None
+    uncond_prompt: str | None = None
+    unconditional_guidance_scale: float | None = None
+    unconditional_guidance_scale_temporal: float | None = None
+    multiple_cond_cfg: bool | None = None
+    cfg_img: float | None = None
+    timestep_spacing: str | None = None
+    guidance_rescale: float | None = None
+    loop: bool | None = None
+    gfi: bool | None = None
+    savefps: str | None = None
+    time_shift: float | None = None
+    num_inference_steps: int | None = None
+    i2v_resolution: str | None = None
+    lora_rank: int | None = None
+
+
+@dataclass(frozen=True)
+class InferencePreset:
+    """Baked-in defaults for a Poetry inference entrypoint."""
+
+    cli_name: str
+    config: str
+    enable_model_cpu_offload: bool = False
+    require_checkpoint: bool = False
+    require_prompt_dir: bool = False
+
+
+def _non_null_values(options: Any) -> dict[str, Any]:
+    return {field.name: getattr(options, field.name) for field in fields(options)}
+
+
+def inference_options_to_namespace(
+    *,
+    run: InferenceRunOptions | None = None,
+    standard: StandardInferenceOptions | None = None,
+    preset: InferencePreset | None = None,
+) -> argparse.Namespace:
+    """Flatten typed option groups into a namespace for legacy inference code."""
+    merged: dict[str, Any] = {}
+
+    if preset is not None:
+        merged["config"] = preset.config
+        if preset.enable_model_cpu_offload:
+            merged["enable_model_cpu_offload"] = True
+
+    for key, value in _non_null_values(run or InferenceRunOptions()).items():
+        if value is not None:
+            merged[key] = value
+
+    for key, value in _non_null_values(standard or StandardInferenceOptions()).items():
+        if value is not None:
+            merged[key] = value
+
+    if merged.get("cpu_smoke") is None:
+        merged["cpu_smoke"] = False
+
+    bool_defaults = (
+        "enable_vae_tiling",
+        "enable_vae_slicing",
+        "enable_model_cpu_offload",
+        "enable_sequential_cpu_offload",
+        "compile",
+        "fuse_qkv",
+        "enable_attention_cache",
+    )
+    for key in bool_defaults:
+        if key not in merged:
+            merged[key] = False
+
+    return argparse.Namespace(**merged)
+
+
+def validate_preset_requirements(
+    run: InferenceRunOptions,
+    preset: InferencePreset,
+) -> None:
+    """Enforce checkpoint / prompt requirements for validation entrypoints."""
+    import sys
+
+    if preset.require_checkpoint and not (run.trained_ckpt or run.lorackpt):
+        print(
+            f"Error: {preset.cli_name} requires --trained_ckpt <denoiser.ckpt> "
+            "or --lorackpt <checkpoint-dir>",
+            file=sys.stderr,
+        )
+        raise SystemExit(2)
+    if preset.require_prompt_dir and not run.prompt_dir:
+        print(
+            f"Error: {preset.cli_name} requires --prompt_dir <image+prompt pairs>",
+            file=sys.stderr,
+        )
+        raise SystemExit(2)
diff --git a/videotuna/data/anno_files/toy_image_dataset.csv b/videotuna/data/anno_files/toy_image_dataset.csv
new file mode 100644
index 00000000..2f2b3aec
--- /dev/null
+++ b/videotuna/data/anno_files/toy_image_dataset.csv
@@ -0,0 +1,17 @@
+path,caption,fps,frames,height,width
+sample.png,"A sample image for dataset tests.",0,1,512,512
+sample2.png,"Another sample image for dataset tests.",0,1,512,512
+sample3.png,"Third sample image for dataset tests.",0,1,512,512
+sample4.png,"Fourth sample image for dataset tests.",0,1,512,512
+sample5.png,"Fifth sample image for dataset tests.",0,1,512,512
+sample6.png,"Sixth sample image for dataset tests.",0,1,512,512
+sample7.png,"Seventh sample image for dataset tests.",0,1,512,512
+sample8.png,"Eighth sample image for dataset tests.",0,1,512,512
+sample9.png,"Ninth sample image for dataset tests.",0,1,512,512
+sample10.png,"Tenth sample image for dataset tests.",0,1,512,512
+sample11.png,"Eleventh sample image for dataset tests.",0,1,512,512
+sample12.png,"Twelfth sample image for dataset tests.",0,1,512,512
+sample13.png,"Thirteenth sample image for dataset tests.",0,1,512,512
+sample14.png,"Fourteenth sample image for dataset tests.",0,1,512,512
+sample15.png,"Fifteenth sample image for dataset tests.",0,1,512,512
+sample16.png,"Sixteenth sample image for dataset tests.",0,1,512,512
diff --git a/videotuna/data/cogvideo_dataset.py b/videotuna/data/cogvideo_dataset.py
deleted file mode 100644
index 4da8e6f2..00000000
--- a/videotuna/data/cogvideo_dataset.py
+++ /dev/null
@@ -1,223 +0,0 @@
-import argparse
-import logging
-import math
-import os
-import shutil
-from pathlib import Path
-from typing import List, Optional, Tuple, Union
-
-import torch
-from torch.utils.data import DataLoader, Dataset
-from torchvision import transforms
-
-
-class VideoDataset(Dataset):
-    def __init__(
-        self,
-        instance_data_root: Optional[str] = None,
-        dataset_name: Optional[str] = None,
-        dataset_config_name: Optional[str] = None,
-        caption_column: str = "text",
-        video_column: str = "video",
-        height: int = 480,
-        width: int = 720,
-        fps: int = 8,
-        max_num_frames: int = 49,
-        skip_frames_start: int = 0,
-        skip_frames_end: int = 0,
-        cache_dir: Optional[str] = None,
-        id_token: Optional[str] = None,
-        image_to_video: bool = False,
-    ) -> None:
-        super().__init__()
-
-        self.instance_data_root = (
-            Path(instance_data_root) if instance_data_root is not None else None
-        )
-        self.dataset_name = dataset_name
-        self.dataset_config_name = dataset_config_name
-        self.caption_column = caption_column
-        self.video_column = video_column
-        self.height = height
-        self.width = width
-        self.fps = fps
-        self.max_num_frames = max_num_frames
-        self.skip_frames_start = skip_frames_start
-        self.skip_frames_end = skip_frames_end
-        self.cache_dir = cache_dir
-        self.id_token = id_token or ""
-        self.image_to_video = image_to_video
-
-        if dataset_name is not None:
-            self.instance_prompts, self.instance_video_paths = (
-                self._load_dataset_from_hub()
-            )
-        else:
-            self.instance_prompts, self.instance_video_paths = (
-                self._load_dataset_from_local_path()
-            )
-
-        self.num_instance_videos = len(self.instance_video_paths)
-        if self.num_instance_videos != len(self.instance_prompts):
-            raise ValueError(
-                f"Expected length of instance prompts and videos to be the same but found {len(self.instance_prompts)=} and {len(self.instance_video_paths)=}. Please ensure that the number of caption prompts and videos match in your dataset."
-            )
-
-        self.instance_videos = self._preprocess_data()
-
-    def __len__(self):
-        return self.num_instance_videos
-
-    def __getitem__(self, index):
-        if self.image_to_video:
-            image = self.instance_videos[index][:1].clone()
-            return {
-                "prompt": self.id_token + self.instance_prompts[index],
-                "video": self.instance_videos[index],
-                "image": image,
-            }
-        else:
-            return {
-                "prompt": self.id_token + self.instance_prompts[index],
-                "video": self.instance_videos[index],
-            }
-
-    def _load_dataset_from_hub(self):
-        try:
-            from datasets import load_dataset
-        except ImportError:
-            raise ImportError(
-                "You are trying to load your data using the datasets library. If you wish to train using custom "
-                "captions please install the datasets library: `pip install datasets`. If you wish to load a "
-                "local folder containing images only, specify --instance_data_root instead."
-            )
-
-        # Downloading and loading a dataset from the hub. See more about loading custom images at
-        # https://huggingface.co/docs/datasets/v2.0.0/en/dataset_script
-        dataset = load_dataset(
-            self.dataset_name,
-            self.dataset_config_name,
-            cache_dir=self.cache_dir,
-        )
-        column_names = dataset["train"].column_names
-
-        if self.video_column is None:
-            video_column = column_names[0]
-            logger.info(f"`video_column` defaulting to {video_column}")
-        else:
-            video_column = self.video_column
-            if video_column not in column_names:
-                raise ValueError(
-                    f"`--video_column` value '{video_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}"
-                )
-
-        if self.caption_column is None:
-            caption_column = column_names[1]
-            logger.info(f"`caption_column` defaulting to {caption_column}")
-        else:
-            caption_column = self.caption_column
-            if self.caption_column not in column_names:
-                raise ValueError(
-                    f"`--caption_column` value '{self.caption_column}' not found in dataset columns. Dataset columns are: {', '.join(column_names)}"
-                )
-
-        instance_prompts = dataset["train"][caption_column]
-        instance_videos = [
-            Path(self.instance_data_root, filepath)
-            for filepath in dataset["train"][video_column]
-        ]
-
-        return instance_prompts, instance_videos
-
-    def _load_dataset_from_local_path(self):
-        if not self.instance_data_root.exists():
-            raise ValueError("Instance videos root folder does not exist")
-
-        prompt_path = self.instance_data_root.joinpath(self.caption_column)
-        video_path = self.instance_data_root.joinpath(self.video_column)
-
-        if not prompt_path.exists() or not prompt_path.is_file():
-            raise ValueError(
-                "Expected `--caption_column` to be path to a file in `--instance_data_root` containing line-separated text prompts."
-            )
-        if not video_path.exists() or not video_path.is_file():
-            raise ValueError(
-                "Expected `--video_column` to be path to a file in `--instance_data_root` containing line-separated paths to video data in the same directory."
-            )
-
-        with open(prompt_path, "r", encoding="utf-8") as file:
-            instance_prompts = [
-                line.strip() for line in file.readlines() if len(line.strip()) > 0
-            ]
-        with open(video_path, "r", encoding="utf-8") as file:
-            instance_videos = [
-                self.instance_data_root.joinpath(line.strip())
-                for line in file.readlines()
-                if len(line.strip()) > 0
-            ]
-
-        if any(not path.is_file() for path in instance_videos):
-            raise ValueError(
-                "Expected '--video_column' to be a path to a file in `--instance_data_root` containing line-separated paths to video data but found atleast one path that is not a valid file."
-            )
-        
-        return instance_prompts, instance_videos
-
-    def _preprocess_data(self):
-        try:
-            import decord
-        except ImportError:
-            raise ImportError(
-                "The `decord` package is required for loading the video dataset. Install with `pip install decord`"
-            )
-
-        decord.bridge.set_bridge("torch")
-
-        videos = []
-        train_transforms = transforms.Compose(
-            [
-                transforms.Lambda(lambda x: x / 255.0 * 2.0 - 1.0),
-            ]
-        )
-
-        for filename in self.instance_video_paths:
-            video_reader = decord.VideoReader(
-                uri=filename.as_posix(), width=self.width, height=self.height
-            )
-            video_num_frames = len(video_reader)
-
-            start_frame = min(self.skip_frames_start, video_num_frames)
-            end_frame = max(0, video_num_frames - self.skip_frames_end)
-            if end_frame <= start_frame:
-                frames = video_reader.get_batch([start_frame])
-            elif end_frame - start_frame <= self.max_num_frames:
-                frames = video_reader.get_batch(list(range(start_frame, end_frame)))
-            else:
-                indices = list(
-                    range(
-                        start_frame,
-                        end_frame,
-                        (end_frame - start_frame) // self.max_num_frames,
-                    )
-                )
-                frames = video_reader.get_batch(indices)
-
-            # Ensure that we don't go over the limit
-            frames = frames[: self.max_num_frames]
-            selected_num_frames = frames.shape[0]
-
-            # TODO: check this 
-            # Choose first (4k + 1) frames as this is how many is required by the VAE
-            remainder = (3 + (selected_num_frames % 4)) % 4
-            if remainder != 0:
-                frames = frames[:-remainder]
-            selected_num_frames = frames.shape[0]
-
-            assert (selected_num_frames - 1) % 4 == 0
-
-            # Training transforms
-            frames = frames.float()
-            frames = torch.stack([train_transforms(frame) for frame in frames], dim=0)
-            videos.append(frames.permute(0, 3, 1, 2).contiguous())  # [F, C, H, W]
-            # print(f"video shape end: {frames.shape}")
-        return videos
diff --git a/videotuna/data/datasets.py b/videotuna/data/datasets.py
index ba16dfc4..20314b68 100644
--- a/videotuna/data/datasets.py
+++ b/videotuna/data/datasets.py
@@ -4,18 +4,18 @@
 sys.path.append(os.getcwd())
 import copy
 import random
-from typing import Dict, List, Tuple, Union
+from typing import Dict, List, Union
 
 import pandas as pd
 import torch
 from torchvision.datasets.folder import pil_loader
 from torchvision.transforms import Compose
+from torchvision.transforms.functional import pil_to_tensor
 
 from videotuna.data.datasets_utils import (
     is_image,
     is_video,
     read_image_meta,
-    read_video,
     read_video_meta,
 )
 from videotuna.data.transforms import (
@@ -23,6 +23,11 @@
     get_transforms_image,
     get_transforms_video,
 )
+from videotuna.utils.video_io import (
+    get_video_frame_count,
+    read_video_frames,
+    sample_frame_indices,
+)
 
 
 class DatasetFromCSV(torch.utils.data.Dataset):
@@ -46,8 +51,8 @@ class DatasetFromCSV(torch.utils.data.Dataset):
             ```
 
         data_root : str or list
-            the root path of the data item. If the path in the csv file is a relative path,
-            the data_root will be added to the file path.
+            the root path of the data item. If the path in the csv file is a
+            relative path, the data_root will be added to the file path.
 
         transform : callable
             the transform function to process the video/image data.
@@ -59,7 +64,8 @@ class DatasetFromCSV(torch.utils.data.Dataset):
             the interval of the sampled frames.
 
         train : bool
-            if True, the dataset is for training. Otherwise, the dataset is for validation.
+            if True, the dataset is for training. Otherwise, the dataset is for
+            validation.
 
         split_val : bool
             if True, split the dataset into training and validation dataset.
@@ -79,13 +85,18 @@ def __init__(
         train: bool = True,
         split_val: bool = False,
         image_to_video: bool = False,
+        video_backend: str = "auto",
         **kwargs,
     ):
+        if "video_length" in kwargs:
+            num_frames = kwargs.pop("video_length")
         self.csv_path = csv_path
-        if isinstance(csv_path, str):
-            csv_path = [csv_path]
-        if data_root is None or isinstance(data_root, str):
-            data_root = [data_root]
+        if isinstance(csv_path, (str, os.PathLike)):
+            csv_path = [str(csv_path)]
+        if data_root is None:
+            data_root = [None]
+        elif isinstance(data_root, (str, os.PathLike)):
+            data_root = [str(data_root)]
 
         if len(data_root) == 1:
             data_root = data_root * len(csv_path)
@@ -96,7 +107,12 @@ def __init__(
 
         if transform is None:
             transform = dict(
-                video=get_transforms_video((height, width), num_frames, frame_interval),
+                video=get_transforms_video(
+                    (height, width),
+                    num_frames,
+                    frame_interval,
+                    temporal_crop=False,
+                ),
                 image=get_transforms_image((height, width), num_frames),
             )
 
@@ -116,6 +132,7 @@ def __init__(
         self.split_val = split_val
         self.safe_data_list = set()
         self.image_to_video = image_to_video
+        self.video_backend = video_backend
         self.check_video = CheckVideo(self.resolution, frame_interval, num_frames)
 
         self.load_annotations(csv_path, data_root)
@@ -136,11 +153,21 @@ def load_annotations(self, csv_path, data_root):
         self.data_list = []
         for i, path in enumerate(csv_path):
             df = pd.read_csv(path)
-            self.check_df(df, path)
+            self._validate_csv_schema(df, path)
+            pair_mode = (
+                "video_path" in df.columns
+                and "image_path" in df.columns
+                and "path" not in df.columns
+            )
             for _, row in df.iterrows():
-                video_path = row.get(
-                    "path", row.get("video_path", row.get("image_path"))
-                )
+                if pair_mode:
+                    video_path = row["video_path"]
+                    image_path = row["image_path"]
+                else:
+                    video_path = row.get(
+                        "path", row.get("video_path", row.get("image_path"))
+                    )
+                    image_path = row.get("image_path", None)
                 caption = row["caption"]
 
                 if not self._is_valid_data(row):
@@ -148,7 +175,11 @@ def load_annotations(self, csv_path, data_root):
 
                 if data_root[i]:
                     video_path = os.path.join(data_root[i], video_path)
+                    if image_path is not None:
+                        image_path = os.path.join(data_root[i], image_path)
                 data_dict = {"path": video_path, "caption": caption}
+                if image_path is not None:
+                    data_dict["image_path"] = image_path
                 data_dict["fps"] = (
                     row.get("fps") / self.frame_interval
                     if row.get("fps", None)
@@ -160,14 +191,21 @@ def load_annotations(self, csv_path, data_root):
 
                 self.data_list.append(data_dict)
 
-    def getitem(self, index):
+    def getitem(self, index):  # noqa: C901
         data = copy.deepcopy(self.data_list[index])
         path = data.pop("path")
+        image_path = data.pop("image_path", None)
         if is_video(path):
-            video = read_video(path)
-            video = self.check_video(
-                video, index
-            )  # filter the video with unsatisfied resolution and frames
+            total_frames = get_video_frame_count(path)
+            if total_frames < self.frame_limit:
+                raise ValueError(
+                    f"The video has not enough frames. Current frames: {total_frames}"
+                )
+            indices = sample_frame_indices(
+                total_frames, self.num_frames, self.frame_interval
+            )
+            video = read_video_frames(path, indices, backend=self.video_backend)
+            video = self.check_video(video, index)
             video = self.transform["video"](video)
         elif is_image(path):
             video = pil_loader(path)
@@ -205,8 +243,27 @@ def getitem(self, index):
 
         if self.image_to_video:
             data["image"] = data["video"][:, :1, :, :].clone()  # CTHW (3，1，H, W)
+        elif image_path is not None:
+            data["image"] = self._load_conditioning_image(image_path)
         return data
 
+    def _load_conditioning_image(self, image_path: str) -> torch.Tensor:
+        if not is_image(image_path):
+            raise ValueError(f"Unsupported conditioning image format: {image_path}")
+        frame = pil_to_tensor(pil_loader(image_path)).unsqueeze(0)
+        if "video" in self.transform:
+            spatial = [
+                t
+                for t in self.transform["video"].transforms
+                if t.__class__.__name__ != "TemporalRandomCrop"
+            ]
+            frame = Compose(spatial)(frame)
+        else:
+            frame = self.transform["image"](pil_loader(image_path))
+            if frame.dim() == 4:
+                frame = frame.unsqueeze(0)
+        return frame.permute(1, 0, 2, 3)
+
     def __getitem__(self, index):
         cnt = 100
         while cnt > 0:  # randomly get a good data, till 100 times
@@ -215,7 +272,7 @@ def __getitem__(self, index):
                 data_item = self.getitem(index)
                 self.safe_data_list.add(index)
                 return data_item
-            except (ValueError, AssertionError) as e:
+            except (ValueError, AssertionError):
                 import traceback
 
                 traceback.print_exc()
@@ -251,18 +308,65 @@ def _is_valid_data(self, row) -> bool:
 
         return True
 
+    def _validate_csv_schema(self, df, df_path):
+        columns = set(df.columns)
+        has_path = "path" in columns
+        has_video_path = "video_path" in columns
+        has_image_path = "image_path" in columns
+        pair_mode = has_video_path and has_image_path and not has_path
+
+        if not (has_path or has_video_path or has_image_path):
+            raise ValueError(
+                f"The csv file {df_path} must have a column named 'path', "
+                "'video_path', or 'image_path'."
+            )
+        if "caption" not in columns:
+            raise ValueError(
+                f"The csv file {df_path} must have a column named 'caption'."
+            )
+
+        if self.image_to_video:
+            if pair_mode:
+                raise ValueError(
+                    f"The csv file {df_path} uses image_path+video_path pair columns, "
+                    "but image_to_video=true expects first-frame mode with a 'path' "
+                    "column only — set image_to_video: false in config for pair mode "
+                    "(see docs/runbooks/domain-adult-finetune.md)."
+                )
+            if not has_path:
+                raise ValueError(
+                    f"The csv file {df_path} must have a 'path' column when "
+                    "image_to_video=true (first-frame conditioning mode)."
+                )
+            return
+
+        if has_video_path and not has_image_path:
+            raise ValueError(
+                f"The csv file {df_path} pair mode requires an 'image_path' column."
+            )
+        if has_image_path and not has_video_path and not has_path:
+            raise ValueError(
+                f"The csv file {df_path} with image_path requires 'video_path' or "
+                "'path' when image_to_video=false."
+            )
+
     @staticmethod
     def check_df(df, df_path):
-        if (
-            "path" not in df.columns
-            and "video_path" not in df.columns
-            and "image_path" not in df.columns
-        ):
+        """Backward-compatible CSV validation without image_to_video context."""
+        columns = set(df.columns)
+        has_path = "path" in columns
+        has_video_path = "video_path" in columns
+        has_image_path = "image_path" in columns
+        if not (has_path or has_video_path or has_image_path):
             raise ValueError(f"The csv file {df_path} must have a column named 'path'.")
-        elif "caption" not in df.columns:
+        if "caption" not in columns:
             raise ValueError(
                 f"The csv file {df_path} must have a column named 'caption'."
             )
+        if has_video_path and not has_image_path:
+            raise ValueError(
+                f"The csv file {df_path} pair mode requires an 'image_path' column."
+            )
 
 
 if __name__ == "__main__":
diff --git a/videotuna/data/datasets_utils.py b/videotuna/data/datasets_utils.py
index 3887bb21..a476af3e 100644
--- a/videotuna/data/datasets_utils.py
+++ b/videotuna/data/datasets_utils.py
@@ -1,14 +1,12 @@
 import cv2
-import decord
 import numpy as np
 import torch
-import torchvision.transforms as transforms
-from decord import VideoReader, cpu
-from einops import rearrange
 from PIL import Image
 from torchvision.io import write_video
 from torchvision.utils import save_image
 
+from videotuna.utils.video_io import get_video_fps, read_video_frames
+
 IMG_EXTS = {"jpg", "bmp", "png", "jpeg", "rgb", "tif"}
 VIDEO_EXTS = {"mp4", "avi", "mov", "flv", "mkv", "webm", "wmv", "mov"}
 
@@ -66,18 +64,18 @@ def center_crop_arr(pil_image, image_size):
     )
 
 
-def read_video(video_path, fps=False):
-    decord.bridge.set_bridge("torch")
-    video = VideoReader(video_path, ctx=cpu(0))
-    video_len = len(video)
-    indexes = range(0, video_len)
-    vframes = video.get_batch(indexes)
-    vframes = rearrange(vframes, "t h w c -> t c h w")
+def read_video(video_path, fps=False, indices=None):
+    from videotuna.utils.video_io import get_video_frame_count
 
-    if fps:
-        return vframes, video.get_avg_fps()
+    if indices is not None:
+        vframes = read_video_frames(video_path, indices)
     else:
-        return vframes
+        total = get_video_frame_count(video_path)
+        vframes = read_video_frames(video_path, range(total))
+
+    if fps:
+        return vframes, get_video_fps(video_path)
+    return vframes
 
 
 def read_video_meta(video_path):
@@ -112,4 +110,4 @@ def is_video(path):
 
 
 def is_image(path):
-    return path.split(".")[-1] in IMG_EXTS
\ No newline at end of file
+    return path.split(".")[-1] in IMG_EXTS
diff --git a/videotuna/data/lightningdata.py b/videotuna/data/lightningdata.py
index 8e5d9c48..df63afb0 100644
--- a/videotuna/data/lightningdata.py
+++ b/videotuna/data/lightningdata.py
@@ -1,19 +1,14 @@
-import argparse
-import glob
-import os
-import sys
-from functools import partial
 from abc import abstractmethod
+from functools import partial
 
 import numpy as np
 import pytorch_lightning as pl
 import torch
 from torch.utils.data import DataLoader, Dataset, IterableDataset
 
-os.chdir(sys.path[0])
-sys.path.append("..")
-
 from videotuna.utils.common_utils import instantiate_from_config
+from videotuna.utils.video_io import init_video_worker
+
 
 class Txt2ImgIterableBaseDataset(IterableDataset):
     """
@@ -35,9 +30,11 @@ def __len__(self):
     @abstractmethod
     def __iter__(self):
         pass
-     
+
+
 def worker_init_fn(_):
     worker_info = torch.utils.data.get_worker_info()
+    init_video_worker()
 
     dataset = worker_info.dataset
     worker_id = worker_info.id
@@ -67,6 +64,12 @@ def __getitem__(self, idx):
         return self.data[idx]
 
 
+def _default_pin_memory(pin_memory):
+    if pin_memory is not None:
+        return pin_memory
+    return torch.cuda.is_available()
+
+
 class DataModuleFromConfig(pl.LightningDataModule):
     def __init__(
         self,
@@ -83,11 +86,22 @@ def __init__(
         img_loader=None,
         train_img=None,
         test_max_n_samples=None,
+        pin_memory=None,
+        persistent_workers=None,
+        prefetch_factor=2,
+        drop_last=False,
     ):
         super().__init__()
         self.batch_size = batch_size
         self.dataset_configs = dict()
-        self.num_workers = num_workers if num_workers is not None else batch_size * 2
+        self.num_workers = 4 if num_workers is None else num_workers
+        self.pin_memory = _default_pin_memory(pin_memory)
+        if persistent_workers is None:
+            self.persistent_workers = self.num_workers > 0
+        else:
+            self.persistent_workers = persistent_workers and self.num_workers > 0
+        self.prefetch_factor = prefetch_factor if self.num_workers > 0 else None
+        self.drop_last = drop_last
         self.use_worker_init_fn = use_worker_init_fn
         if train is not None:
             self.dataset_configs["train"] = train
@@ -141,60 +155,55 @@ def setup(self, stage=None):
             for k in self.datasets:
                 self.datasets[k] = WrappedDataset(self.datasets[k])
 
+    def _resolve_worker_init_fn(self, dataset):
+        if isinstance(dataset, Txt2ImgIterableBaseDataset) or self.use_worker_init_fn:
+            return worker_init_fn
+        if self.num_workers > 0:
+            return worker_init_fn
+        return None
+
+    def _build_dataloader(self, dataset, shuffle=False):
+        loader_kwargs = dict(
+            dataset=dataset,
+            batch_size=self.batch_size,
+            num_workers=self.num_workers,
+            shuffle=shuffle,
+            worker_init_fn=self._resolve_worker_init_fn(dataset),
+            collate_fn=self.collate_fn,
+            pin_memory=self.pin_memory,
+            drop_last=self.drop_last,
+        )
+        if self.num_workers > 0:
+            loader_kwargs["persistent_workers"] = self.persistent_workers
+            if self.prefetch_factor is not None:
+                loader_kwargs["prefetch_factor"] = self.prefetch_factor
+        return DataLoader(**loader_kwargs)
+
     def _train_dataloader(self):
         is_iterable_dataset = isinstance(
             self.datasets["train"], Txt2ImgIterableBaseDataset
         )
-        if is_iterable_dataset or self.use_worker_init_fn:
-            init_fn = worker_init_fn
-        else:
-            init_fn = None
-        loader = DataLoader(
+        loader = self._build_dataloader(
             self.datasets["train"],
-            batch_size=self.batch_size,
-            num_workers=self.num_workers,
             shuffle=False if is_iterable_dataset else True,
-            worker_init_fn=init_fn,
-            collate_fn=self.collate_fn,
         )
         if self.img_loader is not None:
             return {"loader_video": loader, "loader_img": self.img_loader}
-        else:
-            return loader
+        return loader
 
     def _val_dataloader(self, shuffle=False):
-        if (
-            isinstance(self.datasets["validation"], Txt2ImgIterableBaseDataset)
-            or self.use_worker_init_fn
-        ):
-            init_fn = worker_init_fn
-        else:
-            init_fn = None
-        return DataLoader(
-            self.datasets["validation"],
-            batch_size=self.batch_size,
-            num_workers=self.num_workers,
-            worker_init_fn=init_fn,
-            shuffle=shuffle,
-            collate_fn=self.collate_fn,
-        )
+        return self._build_dataloader(self.datasets["validation"], shuffle=shuffle)
 
     def _test_dataloader(self, shuffle=False):
         try:
             is_iterable_dataset = isinstance(
                 self.datasets["train"], Txt2ImgIterableBaseDataset
             )
-        except:
+        except Exception:
             is_iterable_dataset = isinstance(
                 self.datasets["test"], Txt2ImgIterableBaseDataset
             )
 
-        if is_iterable_dataset or self.use_worker_init_fn:
-            init_fn = worker_init_fn
-        else:
-            init_fn = None
-
-        # do not shuffle dataloader for iterable dataset
         shuffle = shuffle and (not is_iterable_dataset)
         if self.test_max_n_samples is not None:
             dataset = torch.utils.data.Subset(
@@ -202,27 +211,7 @@ def _test_dataloader(self, shuffle=False):
             )
         else:
             dataset = self.datasets["test"]
-        return DataLoader(
-            dataset,
-            batch_size=self.batch_size,
-            num_workers=self.num_workers,
-            worker_init_fn=init_fn,
-            shuffle=shuffle,
-            collate_fn=self.collate_fn,
-        )
+        return self._build_dataloader(dataset, shuffle=shuffle)
 
     def _predict_dataloader(self, shuffle=False):
-        if (
-            isinstance(self.datasets["predict"], Txt2ImgIterableBaseDataset)
-            or self.use_worker_init_fn
-        ):
-            init_fn = worker_init_fn
-        else:
-            init_fn = None
-        return DataLoader(
-            self.datasets["predict"],
-            batch_size=self.batch_size,
-            num_workers=self.num_workers,
-            worker_init_fn=init_fn,
-            collate_fn=self.collate_fn,
-        )
+        return self._build_dataloader(self.datasets["predict"], shuffle=shuffle)
diff --git a/videotuna/data/transforms.py b/videotuna/data/transforms.py
index ce6832a5..a37dc200 100644
--- a/videotuna/data/transforms.py
+++ b/videotuna/data/transforms.py
@@ -18,16 +18,14 @@
 import numbers
 import random
 
-import decord
 import numpy as np
 import torch
 import torch.nn.functional as F
 import torchvision.transforms as torch_transforms
-from decord import VideoReader, cpu
-from einops import rearrange
 from PIL import Image
 from torchvision.datasets.folder import pil_loader
-from torchvision.io import write_video
+
+from videotuna.utils.video_io import get_video_frame_count, read_video_frames
 
 from .datasets_utils import IMG_EXTS, VIDEO_EXTS
 
@@ -215,18 +213,26 @@ def hflip(clip):
     return clip.flip(-1)
 
 
-def get_transforms_video(resolution=(256, 256), num_frames=16, frame_interval=1):
-    transform_video = torch_transforms.Compose(
-        [
-            TemporalRandomCrop(num_frames, frame_interval),
-            ToTensorVideo(),  # TCHW
-            RandomHorizontalFlipVideo(),
-            ResizeCenterCropVideo(resolution),
-            torch_transforms.Normalize(
-                mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True
-            ),
-        ]
-    )
+def get_transforms_video(
+    resolution=(256, 256),
+    num_frames=16,
+    frame_interval=1,
+    temporal_crop: bool = True,
+):
+    spatial_transforms = [
+        ToTensorVideo(),  # TCHW
+        RandomHorizontalFlipVideo(),
+        ResizeCenterCropVideo(resolution),
+        torch_transforms.Normalize(
+            mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True
+        ),
+    ]
+    if temporal_crop:
+        transform_video = torch_transforms.Compose(
+            [TemporalRandomCrop(num_frames, frame_interval)] + spatial_transforms
+        )
+    else:
+        transform_video = torch_transforms.Compose(spatial_transforms)
     return transform_video
 
 
@@ -271,7 +277,8 @@ def get_params(self, clip):
 
         if h < th or w < tw:
             raise ValueError(
-                f"Required crop size {(th, tw)} is larger than input image size {(h, w)}"
+                f"Required crop size {(th, tw)} is larger than "
+                f"input image size {(h, w)}"
             )
 
         if w == tw and h == th:
@@ -325,7 +332,10 @@ def __call__(self, clip):
         return clip_center_crop_resize
 
     def __repr__(self) -> str:
-        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+        return (
+            f"{self.__class__.__name__}(size={self.size}, "
+            f"interpolation_mode={self.interpolation_mode}"
+        )
 
 
 class ResizeCenterCropVideo:
@@ -402,7 +412,10 @@ def center_crop(self, clip):
         return cropped_clip
 
     def __repr__(self) -> str:
-        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+        return (
+            f"{self.__class__.__name__}(size={self.size}, "
+            f"interpolation_mode={self.interpolation_mode}"
+        )
 
 
 class UCFCenterCropVideo:
@@ -450,12 +463,16 @@ def __call__(self, clip):
         return clip_center_crop
 
     def __repr__(self) -> str:
-        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+        return (
+            f"{self.__class__.__name__}(size={self.size}, "
+            f"interpolation_mode={self.interpolation_mode}"
+        )
 
 
 class KineticsRandomCropResizeVideo:
     """
-    Slide along the long edge, with the short edge as crop size. And resie to the desired size.
+    Slide along the long edge, with the short edge as crop size. And resie to
+    the desired size.
     """
 
     def __init__(
@@ -509,7 +526,10 @@ def __call__(self, clip):
         return clip_center_crop
 
     def __repr__(self) -> str:
-        return f"{self.__class__.__name__}(size={self.size}, interpolation_mode={self.interpolation_mode}"
+        return (
+            f"{self.__class__.__name__}(size={self.size}, "
+            f"interpolation_mode={self.interpolation_mode}"
+        )
 
 
 class NormalizeVideo:
@@ -534,7 +554,10 @@ def __call__(self, clip):
         return normalize(clip, self.mean, self.std, self.inplace)
 
     def __repr__(self) -> str:
-        return f"{self.__class__.__name__}(mean={self.mean}, std={self.std}, inplace={self.inplace})"
+        return (
+            f"{self.__class__.__name__}(mean={self.mean}, std={self.std}, "
+            f"inplace={self.inplace})"
+        )
 
 
 class ToTensorVideo:
@@ -600,18 +623,13 @@ def __init__(self, num_frames, frame_interval):
         self.sample_length = num_frames * frame_interval
 
     def __call__(self, frames):
+        from videotuna.utils.video_io import sample_frame_indices
+
         total_frames = len(frames)
-        rand_end = max(0, total_frames - self.sample_length - 1)
-        begin_index = random.randint(0, rand_end)
-        end_index = min(begin_index + self.sample_length, total_frames)
-        assert (
-            end_index - begin_index >= self.num_frames
-        ), f"The video has not enough frames. Current frames: {len(vframes)}"
-        frame_indice = np.linspace(
-            begin_index, end_index - 1, self.num_frames, dtype=int
+        frame_indice = sample_frame_indices(
+            total_frames, self.num_frames, self.frame_interval
         )
-        sample_frames = frames[frame_indice]
-        return sample_frames
+        return frames[frame_indice]
 
 
 class LoadDummyVideo:
@@ -640,13 +658,8 @@ def __init__(self):
 
     def __call__(self, video_path):
         assert video_path.split(".")[-1] in VIDEO_EXTS
-        decord.bridge.set_bridge("torch")
-        video = VideoReader(video_path, ctx=cpu(0))
-        video_len = len(video)
-        indexes = range(0, video_len)
-        vframes = video.get_batch(indexes)
-        vframes = rearrange(vframes, "t h w c -> t c h w")
-        return vframes
+        total = get_video_frame_count(video_path)
+        return read_video_frames(video_path, range(total))
 
 
 class CheckVideo:
@@ -660,8 +673,8 @@ def __init__(self, resolution=(256, 256), frame_interval=1, num_frames=32):
 
     def __call__(self, vframes, index):
         length = vframes.shape[0]  # [F, C, H, W]
-        h = vframes.shape[2]
-        w = vframes.shape[3]
+        vframes.shape[2]
+        vframes.shape[3]
         if length < self.frame_limit:
             raise ValueError(
                 f"The video has not enough frames. Current frames: {length}"
diff --git a/videotuna/flow/diffusers_video.py b/videotuna/flow/diffusers_video.py
new file mode 100644
index 00000000..6222582c
--- /dev/null
+++ b/videotuna/flow/diffusers_video.py
@@ -0,0 +1,420 @@
+"""Unified Diffusers pipeline flow for Flux T2I and Wan 2.2 T2V / I2V."""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import torch
+from diffusers import FluxPipeline, WanImageToVideoPipeline, WanPipeline
+from diffusers.utils import export_to_video
+from omegaconf import DictConfig
+
+from videotuna.base.generation_base import GenerationBase
+from videotuna.settings import get_settings
+from videotuna.utils.common_utils import monitor_resources
+from videotuna.utils.device_utils import resolve_inference_device
+from videotuna.utils.diffusers_optimizations import (
+    apply_diffusers_optimizations,
+    transformer_cache_context,
+)
+from videotuna.utils.diffusers_quantization import (
+    build_pipeline_quantization_config,
+    normalize_quant_backend,
+    normalize_transformer_quant,
+    resolve_quant_components,
+)
+from videotuna.utils.logging_config import bound_logger, resolve_device_label
+from videotuna.utils.wan_lora_bridge import (
+    apply_native_wan_lora_to_pipeline,
+    is_native_wan_lora_ckpt,
+)
+
+WAN_DEFAULT_NEGATIVE_PROMPT = (
+    "Bright tones, overexposed, static, blurred details, subtitles, style, works, "
+    "paintings, images, static, overall gray, worst quality, low quality, JPEG "
+    "compression residue, ugly, incomplete, extra fingers, poorly drawn hands, "
+    "poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, "
+    "still picture, messy background, three legs, many people in the background, "
+    "walking backwards"
+)
+
+FLUX_VARIANTS = {
+    "1-dev": "black-forest-labs/FLUX.1-dev",
+    "1-schnell": "black-forest-labs/FLUX.1-schnell",
+    "dev": "black-forest-labs/FLUX.1-dev",
+    "schnell": "black-forest-labs/FLUX.1-schnell",
+}
+
+WAN_T2V_VARIANTS = {
+    "2.1": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
+    "2.2": "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+}
+
+WAN_I2V_VARIANTS = {
+    "2.1": "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers",
+    "2.2": "Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+}
+
+MODEL_REGISTRY: Dict[Tuple[str, str], Dict[str, Any]] = {
+    ("flux", "t2i"): {
+        "pipeline_cls": FluxPipeline,
+        "default_id": "black-forest-labs/FLUX.1-dev",
+        "variants": FLUX_VARIANTS,
+    },
+    ("wan", "t2v"): {
+        "pipeline_cls": WanPipeline,
+        "default_id": "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
+        "variants": WAN_T2V_VARIANTS,
+        "export_fps": 16,
+        "negative_prompt": WAN_DEFAULT_NEGATIVE_PROMPT,
+    },
+    ("wan", "i2v"): {
+        "pipeline_cls": WanImageToVideoPipeline,
+        "default_id": "Wan-AI/Wan2.2-I2V-A14B-Diffusers",
+        "variants": WAN_I2V_VARIANTS,
+        "export_fps": 16,
+        "negative_prompt": WAN_DEFAULT_NEGATIVE_PROMPT,
+    },
+}
+
+
+def resolve_model_id(
+    model_family: str,
+    mode: str,
+    pretrained_model_name_or_path: Optional[str],
+    model_variant: Optional[str] = None,
+) -> str:
+    key = (model_family.lower(), mode.lower())
+    if key not in MODEL_REGISTRY:
+        raise ValueError(f"Unsupported diffusers model: {model_family}/{mode}")
+    entry = MODEL_REGISTRY[key]
+    if pretrained_model_name_or_path:
+        return pretrained_model_name_or_path
+    variants = entry.get("variants")
+    if variants and model_variant:
+        return variants.get(model_variant, entry["default_id"])
+    return entry["default_id"]
+
+
+def resolve_torch_dtype(dtype_flag: Optional[str]) -> torch.dtype:
+    if dtype_flag in ("fp16", "float16"):
+        return torch.float16
+    return torch.bfloat16
+
+
+class DiffusersVideoFlow(GenerationBase):
+    """Diffusers-native inference for Flux T2I and Wan 2.2 T2V / I2V."""
+
+    def __init__(
+        self,
+        model_family: str,
+        mode: str,
+        pretrained_model_name_or_path: Optional[str] = None,
+        pipeline_only: bool = True,
+        model_variant: Optional[str] = None,
+        lora_rank: int = 128,
+        lora_weight_name: str = "pytorch_lora_weights.safetensors",
+        fuse_qkv: bool = False,
+        enable_attention_cache: bool = False,
+        **kwargs,
+    ):
+        super().__init__(pipeline_only=True)
+        self.model_family = model_family.lower()
+        self.mode = mode.lower()
+        self.pretrained_model_name_or_path = pretrained_model_name_or_path
+        self.model_variant = model_variant
+        self.lora_rank = lora_rank
+        self.lora_weight_name = lora_weight_name
+        self.fuse_qkv = fuse_qkv
+        self.enable_attention_cache = enable_attention_cache
+        self._model_id: Optional[str] = None
+        self._lora_path: Optional[str] = None
+        self._dtype = torch.bfloat16
+        self._inference_device: Optional[str] = None
+        self._transformer_quant = "none"
+        self._quant_backend = "torchao"
+        self._log = bound_logger(phase=self.mode, flow="diffusers_video")
+
+    def from_pretrained(
+        self,
+        ckpt_path: Optional[Union[str, Path]] = None,
+        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
+        lora_ckpt_path: Optional[Union[str, Path]] = None,
+        ignore_missing_ckpts: bool = False,
+        device: Optional[str] = None,
+        **kwargs,
+    ) -> None:
+        ckpt_str = str(ckpt_path) if ckpt_path is not None else None
+        self._model_id = resolve_model_id(
+            self.model_family,
+            self.mode,
+            ckpt_str or self.pretrained_model_name_or_path,
+            self.model_variant,
+        )
+        if lora_ckpt_path is not None:
+            self._lora_path = str(lora_ckpt_path)
+        elif denoiser_ckpt_path is not None and self.model_family == "wan":
+            self._lora_path = str(denoiser_ckpt_path)
+        self._inference_device = device
+        self._log.info(
+            "DiffusersVideoFlow: model_id={} family={} mode={} lora={}",
+            self._model_id,
+            self.model_family,
+            self.mode,
+            self._lora_path,
+        )
+
+    def enable_vram_management(self):
+        """No-op; optimizations are applied in inference() from CLI flags."""
+
+    def eval(self) -> DiffusersVideoFlow:
+        if self.pipeline is not None:
+            self.pipeline.set_progress_bar_config(disable=False)
+        return self
+
+    def _require_pipeline(self) -> Any:
+        assert self.pipeline is not None, "Pipeline is not loaded"
+        return self.pipeline
+
+    def _load_pipeline(self, dtype: torch.dtype) -> None:
+        key = (self.model_family, self.mode)
+        entry = MODEL_REGISTRY[key]
+        pipeline_cls = entry["pipeline_cls"]
+        load_kwargs: Dict[str, Any] = {"torch_dtype": dtype}
+        quant = normalize_transformer_quant(self._transformer_quant)
+        if quant != "none":
+            components = resolve_quant_components(
+                self.model_family,
+                self.model_variant,
+                self.mode,
+            )
+            quant_config = build_pipeline_quantization_config(
+                transformer_quant=quant,
+                quant_backend=normalize_quant_backend(self._quant_backend),
+                components=components,
+            )
+            if quant_config is not None:
+                load_kwargs["quantization_config"] = quant_config
+        self.pipeline = pipeline_cls.from_pretrained(self._model_id, **load_kwargs)
+        self._load_lora_weights()
+
+    def _load_lora_weights(self) -> None:
+        if not self._lora_path:
+            return
+        pipeline = self._require_pipeline()
+        if self.model_family == "flux":
+            pipeline.load_lora_weights(self._lora_path)
+            self._log.info("Loaded Flux LoRA weights from {}", self._lora_path)
+            return
+        if self.model_family == "wan":
+            if is_native_wan_lora_ckpt(self._lora_path):
+                if self.mode == "i2v":
+                    from videotuna.utils.wan_lora_bridge import (
+                        apply_native_wan_lora_to_i2v_pipeline,
+                    )
+
+                    reports = apply_native_wan_lora_to_i2v_pipeline(
+                        pipeline, self._lora_path
+                    )
+                else:
+                    reports = apply_native_wan_lora_to_pipeline(
+                        pipeline, self._lora_path
+                    )
+                self._log.info(
+                    "Applied native Wan 2.1 LoRA bridge from {} ({})",
+                    self._lora_path,
+                    [r.as_dict() for r in reports],
+                )
+                return
+            pipeline.load_lora_weights(self._lora_path)
+            self._log.info("Loaded Wan Diffusers LoRA from {}", self._lora_path)
+
+    def _resolve_inputs(
+        self, args: DictConfig
+    ) -> Tuple[List[str], List[Optional[str]]]:
+        if self.mode in ("t2v", "t2i"):
+            prompts = self.load_inference_inputs(args.prompt_file, "t2v")
+            return prompts, [None] * len(prompts)
+        if self.mode == "i2v":
+            prompt_dir = getattr(args, "prompt_dir", None)
+            if not prompt_dir:
+                raise ValueError(
+                    "I2V inference requires --prompt_dir with paired images"
+                )
+            prompts, image_paths = self.load_inference_inputs(prompt_dir, "i2v")
+            return prompts, image_paths
+        raise ValueError(f"Unsupported mode: {self.mode}")
+
+    @torch.inference_mode()
+    def inference(self, args: DictConfig) -> Dict[str, Any]:
+        os.makedirs(args.savedir, exist_ok=True)
+        if getattr(args, "lora_rank", None):
+            self.lora_rank = int(args.lora_rank)
+        if getattr(args, "lorackpt", None):
+            self._lora_path = args.lorackpt
+        if getattr(args, "trained_ckpt", None) and self.model_family == "wan":
+            self._lora_path = args.trained_ckpt
+        self._transformer_quant = normalize_transformer_quant(
+            getattr(args, "transformer_quant", None)
+        )
+        self._quant_backend = normalize_quant_backend(
+            getattr(args, "quant_backend", None)
+        )
+        self._dtype = resolve_torch_dtype(getattr(args, "dtype", None))
+        inference_device = resolve_inference_device(
+            getattr(args, "device", None) or self._inference_device
+        )
+        self._log = self._log.bind(device=resolve_device_label(inference_device))
+        if self.pipeline is None:
+            self._load_pipeline(self._dtype)
+        pipeline = self._require_pipeline()
+
+        if not hasattr(args, "fuse_qkv"):
+            args.fuse_qkv = self.fuse_qkv
+        if not hasattr(args, "enable_attention_cache"):
+            args.enable_attention_cache = self.enable_attention_cache
+
+        apply_diffusers_optimizations(
+            pipeline,
+            args,
+            model_family=self.model_family,
+            disable_progress_bar=False,
+            device=inference_device,
+        )
+
+        prompts, media_paths = self._resolve_inputs(args)
+        num_steps = int(
+            getattr(args, "num_inference_steps", None)
+            or getattr(args, "ddim_steps", 50)
+            or 50
+        )
+        guidance = float(
+            getattr(args, "unconditional_guidance_scale", None)
+            or getattr(args, "guidance_scale", 6.0)
+            or 6.0
+        )
+        seed = int(getattr(args, "seed", 42) or 42)
+        frames = int(getattr(args, "frames", 49) or 49)
+        height = getattr(args, "height", None)
+        width = getattr(args, "width", None)
+        n_samples = int(getattr(args, "n_samples_prompt", 1) or 1)
+
+        per_sample: List[Dict[str, Any]] = []
+        gpu_metrics: List[float] = []
+        time_metrics: List[float] = []
+
+        for idx, (prompt, media_path) in enumerate(zip(prompts, media_paths)):
+            for sample_idx in range(n_samples):
+                sample_seed = seed + idx * n_samples + sample_idx
+                result = self._generate_sample(
+                    prompt=prompt,
+                    num_steps=num_steps,
+                    guidance=guidance,
+                    seed=sample_seed,
+                    frames=frames,
+                    height=height,
+                    width=width,
+                    args=args,
+                    image_path=media_path,
+                )
+                per_sample.append(result)
+                gpu_metrics.append(result.get("peak_vram_gb", -1.0))
+                time_metrics.append(result.get("wall_time_s", -1.0))
+                self._save_output(
+                    result["result"],
+                    args,
+                    prompt,
+                    idx,
+                    sample_idx,
+                )
+
+        if get_settings().metrics_owner == "flow":
+            self.save_metrics(
+                gpu=gpu_metrics,
+                time=time_metrics,
+                config=args,
+                savedir=args.savedir,
+                frames=frames if self.mode != "t2i" else 1,
+            )
+        return {"per_sample": per_sample, "gpu": gpu_metrics, "time": time_metrics}
+
+    @monitor_resources(return_metrics=True)
+    def _generate_sample(
+        self,
+        prompt: str,
+        num_steps: int,
+        guidance: float,
+        seed: int,
+        frames: int,
+        height: Optional[int],
+        width: Optional[int],
+        args: DictConfig,
+        image_path: Optional[str] = None,
+    ) -> Any:
+        from PIL import Image
+
+        generator = torch.Generator().manual_seed(seed)
+        pipe_kwargs: Dict[str, Any] = {
+            "prompt": prompt,
+            "num_inference_steps": num_steps,
+            "generator": generator,
+        }
+
+        entry = MODEL_REGISTRY[(self.model_family, self.mode)]
+        pipeline = self._require_pipeline()
+
+        with transformer_cache_context(pipeline):
+            if self.model_family == "flux":
+                pipe_kwargs.update(
+                    guidance_scale=guidance,
+                    height=height or 768,
+                    width=width or 1360,
+                    max_sequence_length=256,
+                )
+                output = pipeline(**pipe_kwargs).images[0]
+            elif self.model_family == "wan":
+                pipe_kwargs.update(
+                    num_frames=frames,
+                    guidance_scale=guidance,
+                )
+                if height is not None:
+                    pipe_kwargs["height"] = height
+                if width is not None:
+                    pipe_kwargs["width"] = width
+                neg = getattr(args, "uncond_prompt", None) or entry.get(
+                    "negative_prompt"
+                )
+                if neg:
+                    pipe_kwargs["negative_prompt"] = neg
+                if self.mode == "i2v":
+                    if not image_path:
+                        raise ValueError("I2V generation requires an image path")
+                    pipe_kwargs["image"] = Image.open(image_path).convert("RGB")
+                output = pipeline(**pipe_kwargs).frames[0]
+            else:
+                raise ValueError(f"Unknown model family: {self.model_family}")
+
+        return output
+
+    def _save_output(
+        self,
+        output: Any,
+        args: DictConfig,
+        prompt: str,
+        idx: int,
+        sample_idx: int,
+    ) -> None:
+        entry = MODEL_REGISTRY[(self.model_family, self.mode)]
+        safe_prompt = prompt[:80].replace("/", "_").replace(" ", "_")
+        if self.mode == "t2i":
+            filename = f"{idx:03d}_{sample_idx:02d}_{safe_prompt}.jpg"
+            out_path = os.path.join(args.savedir, filename)
+            output.save(out_path)
+            return
+
+        fps = int(getattr(args, "savefps", None) or entry.get("export_fps", 8))
+        filename = f"{idx:03d}_{sample_idx:02d}_{safe_prompt}.mp4"
+        out_path = os.path.join(args.savedir, filename)
+        export_to_video(output, out_path, fps=fps)
diff --git a/videotuna/flow/hunyuanvideo.py b/videotuna/flow/hunyuanvideo.py
deleted file mode 100644
index a6366bd5..00000000
--- a/videotuna/flow/hunyuanvideo.py
+++ /dev/null
@@ -1,772 +0,0 @@
-import os
-import time
-import random
-import functools
-from typing import List, Optional, Tuple, Union, Dict, Any
-from omegaconf import DictConfig
-
-from pathlib import Path
-from loguru import logger
-
-import torch
-import torch.distributed as dist
-from videotuna.models.hunyuan.hyvideo_i2v.constants import PROMPT_TEMPLATE, NEGATIVE_PROMPT, PRECISION_TO_TYPE, NEGATIVE_PROMPT_I2V
-from videotuna.models.hunyuan.hyvideo_i2v.vae.autoencoder_kl_causal_3d import AutoencoderKLCausal3DWrapper
-from videotuna.models.hunyuan.hyvideo_i2v.modules.models import HYVideoDiffusionTransformerWrapper
-from videotuna.models.hunyuan.hyvideo_i2v.text_encoder import TextEncoder, TextEncoderWrapper
-from videotuna.models.hunyuan.hyvideo_i2v.utils.data_utils import align_to, get_closest_ratio, generate_crop_size_list
-from videotuna.models.hunyuan.hyvideo_i2v.utils.lora_utils import load_lora_for_pipeline
-from videotuna.models.hunyuan.hyvideo_i2v.modules.posemb_layers import get_nd_rotary_pos_embed
-from videotuna.models.hunyuan.hyvideo_i2v.modules.fp8_optimization import convert_fp8_linear
-from videotuna.models.hunyuan.hyvideo_i2v.diffusion.schedulers import FlowMatchDiscreteScheduler
-from videotuna.models.hunyuan.hyvideo_i2v.diffusion.pipelines import HunyuanVideoPipeline
-from videotuna.models.hunyuan.hyvideo_i2v.utils.file_utils import save_videos_grid
-from videotuna.base.generation_base import GenerationBase
-from videotuna.utils.common_utils import monitor_resources
-import torchvision.transforms as transforms
-from PIL import Image
-import numpy as np
-from safetensors.torch import load_file
-
-try:
-    import xfuser
-    from xfuser.core.distributed import (
-        get_sequence_parallel_world_size,
-        get_sequence_parallel_rank,
-        get_sp_group,
-        initialize_model_parallel,
-        init_distributed_environment
-    )
-except:
-    xfuser = None
-    get_sequence_parallel_world_size = None
-    get_sequence_parallel_rank = None
-    get_sp_group = None
-    initialize_model_parallel = None
-    init_distributed_environment = None
-
-
-###############################################
-# 20250308 pftq: Riflex workaround to fix 192-frame-limit bug, credit to Kijai for finding it in ComfyUI and thu-ml for making it
-# https://github.com/thu-ml/RIFLEx/blob/main/riflex_utils.py
-from diffusers.models.embeddings import get_1d_rotary_pos_embed
-import numpy as np
-from typing import Union,Optional
-def get_1d_rotary_pos_embed_riflex(
-    dim: int,
-    pos: Union[np.ndarray, int],
-    theta: float = 10000.0,
-    use_real=False,
-    k: Optional[int] = None,
-    L_test: Optional[int] = None,
-):
-    """
-    RIFLEx: Precompute the frequency tensor for complex exponentials (cis) with given dimensions.
-
-    This function calculates a frequency tensor with complex exponentials using the given dimension 'dim' and the end
-    index 'end'. The 'theta' parameter scales the frequencies. The returned tensor contains complex values in complex64
-    data type.
-
-    Args:
-        dim (`int`): Dimension of the frequency tensor.
-        pos (`np.ndarray` or `int`): Position indices for the frequency tensor. [S] or scalar
-        theta (`float`, *optional*, defaults to 10000.0):
-            Scaling factor for frequency computation. Defaults to 10000.0.
-        use_real (`bool`, *optional*):
-            If True, return real part and imaginary part separately. Otherwise, return complex numbers.
-        k (`int`, *optional*, defaults to None): the index for the intrinsic frequency in RoPE
-        L_test (`int`, *optional*, defaults to None): the number of frames for inference
-    Returns:
-        `torch.Tensor`: Precomputed frequency tensor with complex exponentials. [S, D/2]
-    """
-    assert dim % 2 == 0
-
-    if isinstance(pos, int):
-        pos = torch.arange(pos)
-    if isinstance(pos, np.ndarray):
-        pos = torch.from_numpy(pos)  # type: ignore  # [S]
-
-    freqs = 1.0 / (
-            theta ** (torch.arange(0, dim, 2, device=pos.device)[: (dim // 2)].float() / dim)
-    )  # [D/2]
-
-    # === Riflex modification start ===
-    # Reduce the intrinsic frequency to stay within a single period after extrapolation (see Eq. (8)).
-    # Empirical observations show that a few videos may exhibit repetition in the tail frames.
-    # To be conservative, we multiply by 0.9 to keep the extrapolated length below 90% of a single period.
-    if k is not None:
-        freqs[k-1] = 0.9 * 2 * torch.pi / L_test
-    # === Riflex modification end ===
-
-    freqs = torch.outer(pos, freqs)  # type: ignore   # [S, D/2]
-    if use_real:
-        freqs_cos = freqs.cos().repeat_interleave(2, dim=1).float()  # [S, D]
-        freqs_sin = freqs.sin().repeat_interleave(2, dim=1).float()  # [S, D]
-        return freqs_cos, freqs_sin
-    else:
-        # lumina
-        freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64     # [S, D/2]
-        return freqs_cis
-
-
-###############################################
-
-def parallelize_transformer(pipe):
-    transformer = pipe.transformer
-    original_forward = transformer.forward
-
-    @functools.wraps(transformer.__class__.forward)
-    def new_forward(
-        self,
-        x: torch.Tensor,
-        t: torch.Tensor,  # Should be in range(0, 1000).
-        text_states: torch.Tensor = None,
-        text_mask: torch.Tensor = None,  # Now we don't use it.
-        text_states_2: Optional[torch.Tensor] = None,  # Text embedding for modulation.
-        freqs_cos: Optional[torch.Tensor] = None,
-        freqs_sin: Optional[torch.Tensor] = None,
-        guidance: torch.Tensor = None,  # Guidance for modulation, should be cfg_scale x 1000.
-        return_dict: bool = True,
-    ):
-        if x.shape[-2] // 2 % get_sequence_parallel_world_size() == 0:
-            # try to split x by height
-            split_dim = -2
-        elif x.shape[-1] // 2 % get_sequence_parallel_world_size() == 0:
-            # try to split x by width
-            split_dim = -1
-        else:
-            raise ValueError(f"Cannot split video sequence into ulysses_degree x ring_degree ({get_sequence_parallel_world_size()}) parts evenly")
-
-        # patch sizes for the temporal, height, and width dimensions are 1, 2, and 2.
-        temporal_size, h, w = x.shape[2], x.shape[3] // 2, x.shape[4] // 2
-
-        x = torch.chunk(x, get_sequence_parallel_world_size(),dim=split_dim)[get_sequence_parallel_rank()]
-
-        dim_thw = freqs_cos.shape[-1]
-        freqs_cos = freqs_cos.reshape(temporal_size, h, w, dim_thw)
-        freqs_cos = torch.chunk(freqs_cos, get_sequence_parallel_world_size(),dim=split_dim - 1)[get_sequence_parallel_rank()]
-        freqs_cos = freqs_cos.reshape(-1, dim_thw)
-        dim_thw = freqs_sin.shape[-1]
-        freqs_sin = freqs_sin.reshape(temporal_size, h, w, dim_thw)
-        freqs_sin = torch.chunk(freqs_sin, get_sequence_parallel_world_size(),dim=split_dim - 1)[get_sequence_parallel_rank()]
-        freqs_sin = freqs_sin.reshape(-1, dim_thw)
-        
-        from xfuser.core.long_ctx_attention import xFuserLongContextAttention
-        
-        for block in transformer.double_blocks + transformer.single_blocks:
-            block.hybrid_seq_parallel_attn = xFuserLongContextAttention()
-
-        output = original_forward(
-            x,
-            t,
-            text_states,
-            text_mask,
-            text_states_2,
-            freqs_cos,
-            freqs_sin,
-            guidance,
-            return_dict,
-        )
-
-        return_dict = not isinstance(output, tuple)
-        sample = output["x"]
-        sample = get_sp_group().all_gather(sample, dim=split_dim)
-        output["x"] = sample
-        return output
-
-    new_forward = new_forward.__get__(transformer)
-    transformer.forward = new_forward
-
-
-class HunyuanVideoFlow(GenerationBase):
-    def __init__(
-        self,
-        first_stage_config: Dict[str, Any],
-        cond_stage_config: Dict[str, Any],
-        denoiser_config: Dict[str, Any],
-        scheduler_config: Optional[Dict[str, Any]] = None,
-        cond_stage_2_config: Optional[Dict[str, Any]] = None,
-        lora_config: Optional[Dict[str, Any]] = None,
-        use_cpu_offload=False,
-        device=0,
-        logger=None,
-        #parallel
-        ulysses_degree: int = 1,
-        ring_degree: int = 1,
-        use_fp8: bool = False,
-        #lora
-        use_lora: bool = False,
-        lora_path: str = '',
-        lora_scale: float = 1.0,
-        lora_rank: int = 64,
-        #path settings
-        ckpt_path: str = '',
-        dit_weight: str = '',
-        #vae
-        vae_type: str = '884-16c-hy',
-        vae_tiling: bool = True,
-        vae_precision: str = 'fp16',
-        #i2v settings
-        i2v_mode: bool = True,
-        i2v_condition_type: str = 'token_replace',
-        #model
-        rope_theta: int = 256,
-        precision: str = 'bf16',
-        disable_autocast: bool = False,
-        *args, **kwargs
-    ):
-        super().__init__(
-            first_stage_config=first_stage_config,
-            cond_stage_config=cond_stage_config,
-            denoiser_config=denoiser_config,
-            scheduler_config=scheduler_config,
-            cond_stage_2_config=cond_stage_2_config,
-            lora_config=lora_config,
-            trainable_components=[]
-        )
-        self.use_cpu_offload = use_cpu_offload
-        self.device_type = (
-            device
-            if device is not None
-            else "cuda"
-            if torch.cuda.is_available()
-            else "cpu"
-        )
-        self.vae_type = vae_type
-        self.vae_tiling = vae_tiling
-        self.vae_precision = vae_precision
-        self.precision = precision
-        self.disable_autocast = disable_autocast
-
-        #parallel
-        self.ulysses_degree = ulysses_degree
-        self.ring_degree = ring_degree
-        self.use_fp8 = use_fp8
-        #model !!!
-        self.dit_weight = dit_weight
-        self.ckpt_path = ckpt_path
-        self.rope_theta = rope_theta
-
-        #i2v setting
-        self.i2v_mode = i2v_mode
-        self.i2v_condition_type = i2v_condition_type
-        #lora config
-        self.use_lora = use_lora
-        self.lora_rank = lora_rank
-        self.lora_path = lora_path
-        self.lora_scale = lora_scale
-
-        text_encoder : TextEncoderWrapper = self.cond_stage_model
-        text_encoder_2 : TextEncoder = self.cond_stage_2_model
-        model: HYVideoDiffusionTransformerWrapper = self.denoiser
-        vae : AutoencoderKLCausal3DWrapper = self.first_stage_model
-        self.pipeline = HunyuanVideoPipeline(
-            vae=vae.vae,
-            text_encoder=text_encoder.text_encoder,
-            text_encoder_2=text_encoder_2,
-            transformer=model.model,
-            scheduler=self.scheduler,
-            progress_bar_config=logger,
-            precision=precision,
-            vae_precision=vae_precision,
-            disable_autocast=disable_autocast
-        )
-
-        if self.i2v_mode:
-            self.default_negative_prompt = NEGATIVE_PROMPT_I2V
-            if self.use_lora:
-                self.pipeline = load_lora_for_pipeline(
-                    self.pipeline, self.lora_path, LORA_PREFIX_TRANSFORMER="Hunyuan_video_I2V_lora", alpha=self.lora_scale,
-                    device=self.device_type,
-                    is_parallel=(self.ulysses_degree > 1 or self.ring_degree > 1))
-                logger.info(f"load lora {self.lora_path} into pipeline, lora scale is {self.lora_scale}.")
-        else:
-            self.default_negative_prompt = NEGATIVE_PROMPT
-
-    def from_pretrained(self,
-                        ckpt_path: Optional[Union[str, Path]] = None,
-                        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
-                        lora_ckpt_path: Optional[Union[str, Path]] = None,
-                        ignore_missing_ckpts: bool = False,
-                        device: str = "cuda"):
-        """
-        Initialize the Inference pipeline.
-    
-        Args:
-            pretrained_model_path (str or pathlib.Path): The model path, including t2v, text encoder and vae checkpoints.
-            args (argparse.Namespace): The arguments for the pipeline.
-            device (int): The device for inference. Default is None.
-        """
-        logger.info(f"Got text-to-video model root path: {ckpt_path}")
-        
-        # ========================================================================
-        # Initialize Distributed Environment
-        # ========================================================================
-        # 20250316 pftq: Modified to extract rank and world_size early for sequential loading
-        if self.ulysses_degree > 1 or self.ring_degree > 1:
-            assert xfuser is not None, "Ulysses Attention and Ring Attention requires xfuser package."
-            assert self.use_cpu_offload is False, "Cannot enable use_cpu_offload in the distributed environment."
-            # 20250316 pftq: Set local rank and device explicitly for NCCL
-            local_rank = int(os.environ['LOCAL_RANK'])
-            device = torch.device(f"cuda:{local_rank}")
-            torch.cuda.set_device(local_rank)  # 20250316 pftq: Set CUDA device explicitly
-            dist.init_process_group("nccl")  # 20250316 pftq: Removed device_id, rely on set_device
-            rank = dist.get_rank()
-            world_size = dist.get_world_size()
-            assert world_size == self.ring_degree * self.ulysses_degree, \
-                "number of GPUs should be equal to ring_degree * ulysses_degree."
-            init_distributed_environment(rank=rank, world_size=world_size)
-            initialize_model_parallel(
-                sequence_parallel_degree=world_size,
-                ring_degree=self.ring_degree,
-                ulysses_degree=self.ulysses_degree,
-            )
-        else:
-            rank = 0  # 20250316 pftq: Default rank for single GPU
-            world_size = 1  # 20250316 pftq: Default world_size for single GPU
-            if device is None:
-                device = "cuda" if torch.cuda.is_available() else "cpu"
-    
-        torch.set_grad_enabled(False)
-    
-        # ========================================================================
-        # Build main model, VAE, and text encoder sequentially on rank 0
-        # ========================================================================
-        # 20250316 pftq: Load models only on rank 0, then broadcast
-        if rank == 0:
-            logger.info("Building model...")
-            model: HYVideoDiffusionTransformerWrapper = self.denoiser
-            self.denoiser.load_weight()
-            if self.use_fp8:
-                convert_fp8_linear(model, self.dit_weight, original_dtype=PRECISION_TO_TYPE[self.precision])
-            self.denoiser.eval()
-    
-            # VAE
-            vae : AutoencoderKLCausal3DWrapper = self.first_stage_model
-            vae.load_weight()
-            s_ratio = self.first_stage_model.vae.config.spatial_compression_ratio
-            t_ratio = self.first_stage_model.vae.config.time_compression_ratio
-            vae_kwargs = {"s_ratio": s_ratio, "t_ratio": t_ratio}
-            vae = self.first_stage_model
-
-            #encoder
-            text_encoder : TextEncoderWrapper = self.cond_stage_model
-            text_encoder_2 : TextEncoder = self.cond_stage_2_model
-        else:
-            # 20250316 pftq: Initialize as None on non-zero ranks
-            model = None
-            vae = None
-            vae_kwargs = None
-            text_encoder = None
-            text_encoder_2 = None
-    
-        # 20250316 pftq: Broadcast models to all ranks
-        if world_size > 1:
-            logger.info(f"Rank {rank}: Starting broadcast synchronization")
-            dist.barrier()  # Ensure rank 0 finishes loading before broadcasting
-            if rank != 0:
-                # Reconstruct model skeleton on non-zero ranks
-                self.denoiser : HYVideoDiffusionTransformerWrapper
-                self.denoiser.load_weight()
-                self.denoiser.eval()
-                model = self.denoiser
-
-                # VAE
-                vae : AutoencoderKLCausal3DWrapper = self.first_stage_model
-                vae.load_weight()
-                s_ratio = self.first_stage_model.vae.config.spatial_compression_ratio
-                t_ratio = self.first_stage_model.vae.config.time_compression_ratio
-                vae_kwargs = {"s_ratio": s_ratio, "t_ratio": t_ratio}
-                vae = self.first_stage_model
-                vae = vae.to(device)
-
-                #encoder
-                text_encoder : TextEncoderWrapper = self.cond_stage_model.to(device)
-                text_encoder_2 : TextEncoder = self.cond_stage_2_model.to(device)
-                
-            # Broadcast model parameters with logging
-            logger.info(f"Rank {rank}: Broadcasting model parameters")
-            for param in model.parameters():
-                dist.broadcast(param.data, src=0)
-            model.eval()
-            logger.info(f"Rank {rank}: Broadcasting VAE parameters")
-            for param in vae.parameters():
-                dist.broadcast(param.data, src=0)
-            # 20250316 pftq: Use broadcast_object_list for vae_kwargs
-            logger.info(f"Rank {rank}: Broadcasting vae_kwargs")
-            vae_kwargs_list = [vae_kwargs] if rank == 0 else [None]
-            dist.broadcast_object_list(vae_kwargs_list, src=0)
-            vae_kwargs = vae_kwargs_list[0]
-            logger.info(f"Rank {rank}: Broadcasting text_encoder parameters")
-            for param in text_encoder.parameters():
-                dist.broadcast(param.data, src=0)
-            if text_encoder_2 is not None:
-                logger.info(f"Rank {rank}: Broadcasting text_encoder_2 parameters")
-                for param in text_encoder_2.parameters():
-                    dist.broadcast(param.data, src=0)
-
-        if self.use_cpu_offload:
-            self.pipeline.enable_sequential_cpu_offload()
-        else:
-            self.pipeline = self.pipeline.to(device)
-    
-        if self.ulysses_degree > 1 or self.ring_degree > 1:
-            parallelize_transformer(self.pipeline)
-        
-
-    @staticmethod
-    def parse_size(size):
-        if isinstance(size, int):
-            size = [size]
-        if not isinstance(size, (list, tuple)):
-            raise ValueError(f"Size must be an integer or (height, width), got {size}.")
-        if len(size) == 1:
-            size = [size[0], size[0]]
-        if len(size) != 2:
-            raise ValueError(f"Size must be an integer or (height, width), got {size}.")
-        return size
-
-    # 20250317 pftq: Modified to use Riflex when >192 frames
-    def get_rotary_pos_embed(self, video_length, height, width):
-        target_ndim = 3
-        ndim = 5 - 2  # B, C, F, H, W -> F, H, W
-        model = self.pipeline.transformer
-        # Compute latent sizes based on VAE type
-        if "884" in self.vae_type:
-            latents_size = [(video_length - 1) // 4 + 1, height // 8, width // 8]
-        elif "888" in self.vae_type:
-            latents_size = [(video_length - 1) // 8 + 1, height // 8, width // 8]
-        else:
-            latents_size = [video_length, height // 8, width // 8]
-    
-        # Compute rope sizes
-        if isinstance(model.patch_size, int):
-            assert all(s % model.patch_size == 0 for s in latents_size), (
-                f"Latent size(last {ndim} dimensions) should be divisible by patch size({model.patch_size}), "
-                f"but got {latents_size}."
-            )
-            rope_sizes = [s // model.patch_size for s in latents_size]
-        elif isinstance(model.patch_size, list):
-            assert all(
-                s % model.patch_size[idx] == 0
-                for idx, s in enumerate(latents_size)
-            ), (
-                f"Latent size(last {ndim} dimensions) should be divisible by patch size({model.patch_size}), "
-                f"but got {latents_size}."
-            )
-            rope_sizes = [s // model.patch_size[idx] for idx, s in enumerate(latents_size)]
-    
-        if len(rope_sizes) != target_ndim:
-            rope_sizes = [1] * (target_ndim - len(rope_sizes)) + rope_sizes  # Pad time axis
-    
-        # 20250316 pftq: Add RIFLEx logic for > 192 frames
-        L_test = rope_sizes[0]  # Latent frames
-        L_train = 25  # Training length from HunyuanVideo
-        actual_num_frames = video_length  # Use input video_length directly
-    
-        head_dim = model.hidden_size // model.heads_num
-        rope_dim_list = model.rope_dim_list or [head_dim // target_ndim for _ in range(target_ndim)]
-        assert sum(rope_dim_list) == head_dim, "sum(rope_dim_list) must equal head_dim"
-    
-        if actual_num_frames > 192:
-            k = 2+((actual_num_frames + 3) // (4 * L_train))
-            k = max(4, min(8, k))
-            logger.debug(f"actual_num_frames = {actual_num_frames} > 192, RIFLEx applied with k = {k}")
-    
-            # Compute positional grids for RIFLEx
-            axes_grids = [torch.arange(size, device=self.device_type, dtype=torch.float32) for size in rope_sizes]
-            grid = torch.meshgrid(*axes_grids, indexing="ij")
-            grid = torch.stack(grid, dim=0)  # [3, t, h, w]
-            pos = grid.reshape(3, -1).t()  # [t * h * w, 3]
-    
-            # Apply RIFLEx to temporal dimension
-            freqs = []
-            for i in range(3):
-                if i == 0:  # Temporal with RIFLEx
-                    freqs_cos, freqs_sin = get_1d_rotary_pos_embed_riflex(
-                        rope_dim_list[i],
-                        pos[:, i],
-                        theta=self.rope_theta,
-                        use_real=True,
-                        k=k,
-                        L_test=L_test
-                    )
-                else:  # Spatial with default RoPE
-                    freqs_cos, freqs_sin = get_1d_rotary_pos_embed_riflex(
-                        rope_dim_list[i],
-                        pos[:, i],
-                        theta=self.rope_theta,
-                        use_real=True,
-                        k=None,
-                        L_test=None
-                    )
-                freqs.append((freqs_cos, freqs_sin))
-                logger.debug(f"freq[{i}] shape: {freqs_cos.shape}, device: {freqs_cos.device}")
-    
-            freqs_cos = torch.cat([f[0] for f in freqs], dim=1)
-            freqs_sin = torch.cat([f[1] for f in freqs], dim=1)
-            logger.debug(f"freqs_cos shape: {freqs_cos.shape}, device: {freqs_cos.device}")
-        else:
-            # 20250316 pftq: Original code for <= 192 frames
-            logger.debug(f"actual_num_frames = {actual_num_frames} <= 192, using original RoPE")
-            freqs_cos, freqs_sin = get_nd_rotary_pos_embed(
-                rope_dim_list,
-                rope_sizes,
-                theta=self.rope_theta,
-                use_real=True,
-                theta_rescale_factor=1,
-            )
-            logger.debug(f"freqs_cos shape: {freqs_cos.shape}, device: {freqs_cos.device}")
-    
-        return freqs_cos, freqs_sin
-
-
-    @monitor_resources(return_metrics=True)
-    def single_inference(self, 
-                         prompt, 
-                         i2v_image_path, 
-                         target_video_length,
-                         generator,
-                         config : DictConfig):
-        height=config.height
-        width=config.width
-        video_length=config.frames
-        seed=config.seed
-        negative_prompt=config.uncond_prompt
-        infer_steps=config.num_inference_steps
-        guidance_scale=config.unconditional_guidance_scale
-        flow_shift=config.time_shift
-        embedded_guidance_scale=config.embedded_guidance_scale
-        batch_size=config.bs
-        num_videos_per_prompt=config.n_samples_prompt
-        i2v_mode=config.i2v_mode
-        i2v_resolution=config.i2v_resolution
-        i2v_condition_type=config.i2v_condition_type
-        i2v_stability=config.i2v_stability
-        ulysses_degree=config.ulysses_degree
-        ring_degree=config.ring_degree
-        xdit_adaptive_size=config.xdit_adaptive_size
-        if not isinstance(prompt, str):
-            raise TypeError(f"`prompt` must be a string, but got {type(prompt)}")
-        prompt = [prompt.strip()]
-
-        if negative_prompt is None or negative_prompt == "":
-            negative_prompt = self.default_negative_prompt
-        if guidance_scale == 1.0:
-            negative_prompt = ""
-        if not isinstance(negative_prompt, str):
-            raise TypeError(
-                f"`negative_prompt` must be a string, but got {type(negative_prompt)}"
-            )
-        negative_prompt = [negative_prompt.strip()]
-
-        img_latents = None
-        semantic_images = None
-        if i2v_mode:
-            if i2v_resolution == "720p":
-                bucket_hw_base_size = 960
-            elif i2v_resolution == "540p":
-                bucket_hw_base_size = 720
-            elif i2v_resolution == "360p":
-                bucket_hw_base_size = 480
-            else:
-                raise ValueError(f"i2v_resolution: {i2v_resolution} must be in [360p, 540p, 720p]")
-
-            semantic_images = [Image.open(i2v_image_path).convert('RGB')]
-            origin_size = semantic_images[0].size
-
-            crop_size_list = generate_crop_size_list(bucket_hw_base_size, 32)
-            aspect_ratios = np.array([round(float(h)/float(w), 5) for h, w in crop_size_list])
-            closest_size, closest_ratio = get_closest_ratio(origin_size[1], origin_size[0], aspect_ratios, crop_size_list)
-
-            if ulysses_degree != 1 or ring_degree != 1:
-                closest_size = (height, width)
-                resize_param = min(closest_size)
-                center_crop_param = closest_size
-
-                if xdit_adaptive_size:
-                    original_h, original_w = origin_size[1], origin_size[0]
-                    target_h, target_w = height, width
-
-                    scale_w = target_w / original_w
-                    scale_h = target_h / original_h
-                    scale = max(scale_w, scale_h)
-
-                    new_w = int(original_w * scale)
-                    new_h = int(original_h * scale)
-                    resize_param = (new_h, new_w)
-                    center_crop_param = (target_h, target_w)
-            else:
-                resize_param = min(closest_size)
-                center_crop_param = closest_size
-
-            ref_image_transform = transforms.Compose([
-                transforms.Resize(resize_param),
-                transforms.CenterCrop(center_crop_param),
-                transforms.ToTensor(),
-                transforms.Normalize([0.5], [0.5])
-            ])
-
-            semantic_image_pixel_values = [ref_image_transform(semantic_image) for semantic_image in semantic_images]
-            semantic_image_pixel_values = torch.cat(semantic_image_pixel_values).unsqueeze(0).unsqueeze(2).to(self.device_type)
-
-            with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=True):
-                img_latents = self.pipeline.vae.encode(semantic_image_pixel_values).latent_dist.mode()
-                img_latents.mul_(self.pipeline.vae.config.scaling_factor)
-
-            target_height, target_width = closest_size
-
-        freqs_cos, freqs_sin = self.get_rotary_pos_embed(
-            target_video_length, target_height, target_width
-        )
-        n_tokens = freqs_cos.shape[0]
-
-        debug_str = f"""
-                        height: {target_height}
-                        width: {target_width}
-                video_length: {target_video_length}
-                        prompt: {prompt}
-                    neg_prompt: {negative_prompt}
-                        seed: {seed}
-                infer_steps: {infer_steps}
-        num_videos_per_prompt: {num_videos_per_prompt}
-                guidance_scale: {guidance_scale}
-                    n_tokens: {n_tokens}
-                    flow_shift: {flow_shift}
-    embedded_guidance_scale: {embedded_guidance_scale}
-                i2v_stability: {i2v_stability}"""
-        if ulysses_degree != 1 or ring_degree != 1:
-            debug_str += f"""
-                ulysses_degree: {ulysses_degree}
-                ring_degree: {ring_degree}
-            xdit_adaptive_size: {xdit_adaptive_size}"""
-        logger.debug(debug_str)
-
-        samples = self.pipeline(
-            prompt=prompt,
-            height=target_height,
-            width=target_width,
-            video_length=target_video_length,
-            num_inference_steps=infer_steps,
-            guidance_scale=guidance_scale,
-            negative_prompt=negative_prompt,
-            num_videos_per_prompt=num_videos_per_prompt,
-            generator=generator,
-            output_type="pil",
-            freqs_cis=(freqs_cos, freqs_sin),
-            n_tokens=n_tokens,
-            embedded_guidance_scale=embedded_guidance_scale,
-            data_type="video" if target_video_length > 1 else "image",
-            is_progress_bar=True,
-            vae_ver=self.vae_type,
-            enable_tiling=self.vae_tiling,
-            i2v_mode=i2v_mode,
-            i2v_condition_type=i2v_condition_type,
-            i2v_stability=i2v_stability,
-            img_latents=img_latents,
-            semantic_images=semantic_images,
-        )[0]
-        return samples
-
-    @torch.no_grad()
-    def inference(
-        self,
-        config : DictConfig,
-        **kwargs,
-    ):
-        height=config.height
-        width=config.width
-        video_length=config.frames
-        seed=config.seed
-        batch_size=config.bs
-        num_videos_per_prompt=config.n_samples_prompt
-        out_dict = dict()
-
-        prompt_list, image_path_list = self.load_inference_inputs(config.prompt_dir, config.mode)
-        if len(prompt_list) > 1:
-            logger.warning("HunyuanVideo currently does not support batch inference, we will sample at a time")
-    
-        # seeds
-        seeds = self.set_seed(seed, batch_size, num_videos_per_prompt)
-        generator = [torch.Generator(self.device_type).manual_seed(seed) for seed in seeds]
-        out_dict["seeds"] = seeds
-
-        # video input
-        self.check_video_input(height, width, video_length)
-        target_height = align_to(height, 16)
-        target_width = align_to(width, 16)
-        target_video_length = video_length
-        out_dict["size"] = (target_height, target_width, target_video_length)
-        filenames = self.process_savename(prompt_list, config.n_samples_prompt)
-
-        samples = []
-        gpu = []
-        time = []
-        for i, (prompt, i2v_image_path) in enumerate(zip(prompt_list, image_path_list)):
-            result_with_metrics = self.single_inference(prompt, i2v_image_path, target_video_length, generator, config)
-            sample = result_with_metrics['result']
-            samples.append(sample)
-            gpu.append(result_with_metrics.get('gpu', -1.0))
-            time.append(result_with_metrics.get('time', -1.0))
-
-            # Save samples
-            if 'LOCAL_RANK' not in os.environ or int(os.environ['LOCAL_RANK']) == 0:
-                save_videos_grid(sample, f"{config.savedir}/{filenames[i]}.mp4", fps=24)
-        
-        self.save_metrics(gpu=gpu, time=time, config=config, savedir=config.savedir)
-        out_dict['samples'] = samples
-        out_dict['prompts'] = prompt_list
-        return out_dict
-
-    def check_video_input(self, height, width, video_length):
-        if width <= 0 or height <= 0 or video_length <= 0:
-            raise ValueError(
-                f"`height` and `width` and `video_length` must be positive integers, got height={height}, width={width}, video_length={video_length}"
-            )
-        if (video_length - 1) % 4 != 0:
-            raise ValueError(
-                f"`video_length-1` must be a multiple of 4, got {video_length}"
-            )
-
-        logger.info(
-            f"Input (height, width, video_length) = ({height}, {width}, {video_length})"
-        )
-
-    def set_seed(self, seed, batch_size, num_videos_per_prompt):
-        if isinstance(seed, torch.Tensor):
-            seed = seed.tolist()
-        if seed is None:
-            seeds = [
-                random.randint(0, 1_000_000)
-                for _ in range(batch_size * num_videos_per_prompt)
-            ]
-        elif isinstance(seed, int):
-            seeds = [
-                seed + i
-                for _ in range(batch_size)
-                for i in range(num_videos_per_prompt)
-            ]
-        elif isinstance(seed, (list, tuple)):
-            if len(seed) == batch_size:
-                seeds = [
-                    int(seed[i]) + j
-                    for i in range(batch_size)
-                    for j in range(num_videos_per_prompt)
-                ]
-            elif len(seed) == batch_size * num_videos_per_prompt:
-                seeds = [int(s) for s in seed]
-            else:
-                raise ValueError(
-                    f"Length of seed must be equal to number of prompt(batch_size) or "
-                    f"batch_size * num_videos_per_prompt ({batch_size} * {num_videos_per_prompt}), got {seed}."
-                )
-        else:
-            raise ValueError(
-                f"Seed must be an integer, a list of integers, or None, got {seed}."
-            )
-            
-        return seeds
-    
-
-    def enable_vram_management(self):
-        pass
diff --git a/videotuna/flow/stepvideo.py b/videotuna/flow/stepvideo.py
deleted file mode 100644
index 1b98db6a..00000000
--- a/videotuna/flow/stepvideo.py
+++ /dev/null
@@ -1,448 +0,0 @@
-import torch
-import logging
-import os
-import torch.distributed as dist
-from typing import Any, Dict, List, Optional, Union
-from pathlib import Path
-from PIL import Image
-from datetime import datetime
-import sys
-import asyncio
-from tqdm import tqdm
-from omegaconf import OmegaConf, DictConfig
-
-from videotuna.base.generation_base import GenerationBase
-from videotuna.utils.common_utils import instantiate_from_config
-from videotuna.schedulers.flow_matching import FlowMatchScheduler
-
-
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union
-from dataclasses import dataclass
-
-from loguru import logger
-import numpy as np
-import pickle
-import torch, copy
-from transformers.models.bert.modeling_bert import BertEmbeddings
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline
-from diffusers.utils import BaseOutput
-
-from ..utils.common_utils import monitor_resources
-from videotuna.utils.inference_utils import enable_vram_management, AutoWrappedModule, AutoWrappedLinear
-from videotuna.models.stepvideo.stepvideo.modules.model import StepVideoModel
-from videotuna.models.stepvideo.stepvideo.diffusion.scheduler import FlowMatchDiscreteScheduler
-from videotuna.models.stepvideo.stepvideo.utils import VideoProcessor, with_empty_init
-from videotuna.models.stepvideo.stepvideo.modules.model import RMSNorm
-from videotuna.models.stepvideo.stepvideo.vae.vae import CausalConv, CausalConvAfterNorm, Upsample2D
-from videotuna.models.stepvideo.stepvideo.parallel import initialize_parall_group, get_parallel_group
-from xfuser.model_executor.models.customized.step_video_t2v.tp_applicator import TensorParallelApplicator
-from xfuser.core.distributed.parallel_state import get_tensor_model_parallel_world_size, get_tensor_model_parallel_rank
-
-
-class StepVideoModelFlow(GenerationBase):
-    """
-    Training and inference flow for YourModel.
-    
-    This model inherits from GenerationFlow, which is a base class for all generative models.
-    """
-
-    def __init__(
-        self,
-        first_stage_config: Dict[str, Any],
-        cond_stage_config: Dict[str, Any],
-        denoiser_config: Dict[str, Any],
-        scheduler_config: Optional[Dict[str, Any]] = None,
-        cond_stage_2_config: Optional[Dict[str, Any]] = None,
-        lora_config: Optional[Dict[str, Any]] = None,
-        ring_degree: int = 1,
-        ulysses_degree: int = 1,
-        tensor_parallel_degree: int = 1,
-        scale_factor: float = 1.0,
-        num_persistent_param_in_dit: int = None,
-        torch_dtype: torch.dtype = torch.bfloat16,
-        device: str = torch.cuda.current_device(),
-        enable_model_cpu_offload: bool = True,
-        enable_sequential_cpu_offload: bool = False,
-        *args, **kwargs
-    ):
-        logger.info("StepVideoModelFlow: init workflow")
-        if tensor_parallel_degree > 1:
-            logger.info("StepVideoModelFlow: init tensor parallel group")
-            initialize_parall_group(ring_degree=ring_degree, ulysses_degree=ulysses_degree, tensor_parallel_degree=tensor_parallel_degree)
-        super().__init__(
-            first_stage_config=first_stage_config,
-            cond_stage_config=cond_stage_config,
-            denoiser_config=denoiser_config,
-            scheduler_config=scheduler_config,
-            cond_stage_2_config=cond_stage_2_config,
-            lora_config=lora_config,
-            trainable_components=[]
-        )
-        
-        self.ring_degree = ring_degree
-        self.ulysses_degree = ulysses_degree
-        self.tensor_parallel_degree = tensor_parallel_degree
-        self.torch_dtype = torch_dtype
-        self.device_type = device
-        self.vae_scale_factor_temporal = self.vae.temporal_compression_ratio if getattr(self, "vae", None) else 8
-        self.vae_scale_factor_spatial = self.vae.spatial_compression_ratio if getattr(self, "vae", None) else 16
-        self.scale_factor = scale_factor
-        self.num_persistent_param_in_dit = num_persistent_param_in_dit
-        self.enable_sequential_cpu_offload = enable_sequential_cpu_offload
-        self.enable_model_cpu_offload = enable_model_cpu_offload
-
-    def load_lib(self, ckpt_path: str):
-        logger.info(f"loading lib from {ckpt_path}")
-        accepted_version = {
-            '2.2': 'liboptimus_ths-torch2.2-cu121.cpython-310-x86_64-linux-gnu.so',
-            '2.3': 'liboptimus_ths-torch2.3-cu121.cpython-310-x86_64-linux-gnu.so',
-            '2.5': 'liboptimus_ths-torch2.5-cu124.cpython-310-x86_64-linux-gnu.so',
-        }
-        try:
-            version = '.'.join(torch.__version__.split('.')[:2])
-            if version in accepted_version:
-                logger.info(f"cur dir: {os.getcwd()}")
-                library = os.path.join(ckpt_path, f'lib/{accepted_version[version]}')
-                logger.info(f"loading lib from {library}")
-                torch.ops.load_library(library)
-                logger.info(f"{library} loaded")
-            else:
-                raise ValueError("Not supported torch version for liboptimus")
-        except Exception as err:
-            print(err)
-
-    def enable_vram_management(self):
-        logger.info("StepVideoModelFlow: start enable_vram_management")
-        dtype = next(iter(self.cond_stage_2_model.parameters())).dtype
-        logger.info(f"cond_stage_2_model param dtype: {dtype}")
-        #use enable_model_cpu_offload as default
-        onload_device = self.device_type
-        if self.enable_sequential_cpu_offload:
-            onload_device = 'cpu'
-        elif self.enable_model_cpu_offload:
-            onload_device = self.device_type
-
-        enable_vram_management(
-            self.cond_stage_2_model,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                BertEmbeddings: AutoWrappedModule,
-                torch.nn.LayerNorm: AutoWrappedModule,
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device=onload_device,
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        dtype = next(iter(self.cond_stage_model.parameters())).dtype
-        logger.info(f"cond_stage_model param dtype: {dtype}")
-        enable_vram_management(
-            self.cond_stage_model,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                RMSNorm: AutoWrappedModule,
-                torch.nn.Embedding: AutoWrappedModule,
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device=onload_device,
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        dtype = next(iter(self.denoiser.parameters())).dtype
-        logger.info(f"denoiser param dtype: {dtype}")
-        enable_vram_management(
-            self.denoiser,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                torch.nn.Conv2d: AutoWrappedModule,
-                torch.nn.LayerNorm: AutoWrappedModule,
-                RMSNorm: AutoWrappedModule,
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device=onload_device,
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-            max_num_param=self.num_persistent_param_in_dit,
-            overflow_module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device=onload_device,
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        dtype = next(iter(self.first_stage_model.parameters())).dtype
-        logger.info(f"first_stage_model param dtype: {dtype}")
-        enable_vram_management(
-            self.first_stage_model,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                torch.nn.Conv3d: AutoWrappedModule,
-                CausalConv: AutoWrappedModule,
-                CausalConvAfterNorm: AutoWrappedModule,
-                Upsample2D: AutoWrappedModule
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device=onload_device,
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        self.enable_cpu_offload()
-        logger.info("StepVideoModelFlow: end enable_vram_management")
-
-    
-    def encode_prompt(
-        self,
-        input_prompt: str,
-        neg_magic: str = '',
-        pos_magic: str = '',
-    ):
-        prompts = [input_prompt + pos_magic]
-        bs = len(prompts)
-        prompts += [neg_magic] * bs
-        
-        prompt_embeds, prompt_embeds_mask = self.cond_stage_model(prompts)
-        clip_embedding, _ = self.cond_stage_2_model(prompts)
-        
-        len_clip = clip_embedding.shape[1]
-        prompt_embeds_mask = torch.nn.functional.pad(prompt_embeds_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 
-
-        return prompt_embeds, clip_embedding, prompt_embeds_mask
-
-    def check_inputs(self, num_frames, width, height):
-        num_frames = max(num_frames//17*17, 1)
-        width = max(width//16*16, 16)
-        height = max(height//16*16, 16)
-        return num_frames, width, height
-
-    def prepare_latents(
-        self,
-        batch_size: int,
-        num_channels_latents: 64,
-        height: int = 544,
-        width: int = 992,
-        num_frames: int = 204,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        if latents is not None:
-            return latents.to(device=device, dtype=dtype)
-
-        num_frames, width, height = self.check_inputs(num_frames, width, height)
-        shape = (
-            batch_size,
-            max(num_frames//17*3, 1),
-            num_channels_latents,
-            int(height) // self.vae_scale_factor_spatial,
-            int(width) // self.vae_scale_factor_spatial,
-        )   # b,f,c,h,w
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if generator is None:
-            generator = torch.Generator(device=device)
-
-        latents = torch.randn(shape, generator=generator, device=device, dtype=dtype)
-        return latents
-
-
-    @torch.inference_mode()
-    def inference(self, config: DictConfig, device=torch.cuda.current_device()):
-        # init vars
-        rank = int(os.getenv("RANK", 0))
-        world_size = int(os.getenv("WORLD_SIZE", 1))
-        local_rank = int(os.getenv("LOCAL_RANK", 0))
-        device = local_rank
-    
-        # load input
-        prompt_list = self.load_inference_inputs(config.prompt_file, config.mode)
-        if len(prompt_list) > 1:
-            logger.warning("Stepvideo currently does not support batch inference, we will sample at a time")
-        
-        videos = []
-        gpu = []
-        time = []
-        for prompt in prompt_list:
-            if rank == 0:
-                result_with_metrics = self.single_inference(prompt, config)
-                video  = result_with_metrics['result']
-                videos.append(video)
-                gpu.append(result_with_metrics.get('gpu', -1.0))
-                time.append(result_with_metrics.get('time', -1.0))
-        
-        if rank == 0:
-            logger.info("Saving videos")
-            filenames = self.process_savename(prompt_list, config.n_samples_prompt)
-            processor = VideoProcessor(config.savedir)
-            for video, filename in zip(videos, filenames):
-                processor.postprocess_video(video, filename)
-            self.save_metrics(gpu=gpu, time=time, config=config, savedir=config.savedir)
-        
-    
-    @monitor_resources(return_metrics=True)
-    def single_inference(self, prompt, config: DictConfig):
-        rank = int(os.getenv("RANK", 0))
-        world_size = int(os.getenv("WORLD_SIZE", 1))
-        local_rank = int(os.getenv("LOCAL_RANK", 0))
-        device = local_rank
-
-        neg_magic = config.uncond_prompt
-        pos_magic = config.pos_prompt
-        batch_size = config.bs
-        time_shift = config.time_shift
-        num_inference_steps = config.num_inference_steps
-        unconditional_guidance_scale = config.unconditional_guidance_scale
-        do_classifier_free_guidance = unconditional_guidance_scale > 1.0
-        # 3. Encode input prompt
-        logger.info("loading cond_stage_model and cond_stage_2_model")
-        self.load_models_to_device(['cond_stage_model', 'cond_stage_2_model'])
-
-        logger.info("encoding prompt")
-        prompt_embeds, prompt_embeds_2, prompt_attention_mask = self.encode_prompt(
-            input_prompt=prompt,
-            neg_magic=neg_magic,
-            pos_magic=pos_magic
-        )
-
-        denoiser_dtype = self.denoiser.dtype
-        prompt_embeds = prompt_embeds.to(denoiser_dtype).to(device)
-        prompt_attention_mask = prompt_attention_mask.to(denoiser_dtype).to(device)
-        prompt_embeds_2 = prompt_embeds_2.to(denoiser_dtype).to(device)
-
-        # 4. Prepare timesteps
-        self.scheduler.set_timesteps(
-            num_inference_steps=num_inference_steps,
-            time_shift=time_shift,
-            device=device
-        )
-
-        # 5. Prepare latent variables
-        logger.info("preparing latents")
-        num_channels_latents = self.denoiser.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * config.n_samples_prompt,
-            num_channels_latents,
-            config.height,
-            config.width,
-            config.frames,
-            torch.bfloat16,
-            device,
-            torch.Generator(device=device).manual_seed(config.seed)
-        ).to(device)
-
-        # 7. Denoising loop
-        logger.info("loading denoiser")
-        self.load_models_to_device(['denoiser'])
-
-        with tqdm(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(self.scheduler.timesteps):
-                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
-                latent_model_input = latent_model_input.to(denoiser_dtype)
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                timestep = t.expand(latent_model_input.shape[0]).to(latent_model_input.dtype).to(device)
-
-                noise_pred = self.denoiser(
-                    hidden_states=latent_model_input,
-                    timestep=timestep,
-                    encoder_hidden_states=prompt_embeds,
-                    encoder_attention_mask=prompt_attention_mask,
-                    encoder_hidden_states_2=prompt_embeds_2,
-                    return_dict=False,
-                )
-                # perform guidance
-                if do_classifier_free_guidance:
-                    noise_pred_text, noise_pred_uncond = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + unconditional_guidance_scale * (noise_pred_text - noise_pred_uncond)
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents = self.scheduler.step(
-                    model_output=noise_pred,
-                    timestep=t,
-                    sample=latents
-                )
-                
-                progress_bar.update()
-
-        if not torch.distributed.is_initialized() or int(torch.distributed.get_rank())==0:
-            self.load_models_to_device(['first_stage_model'])
-            video = self.first_stage_model.decode(latents.to(denoiser_dtype).to(device) / self.scale_factor)
-            return video
-
-
-    def from_pretrained(self,
-                        ckpt_path: Optional[Union[str, Path]] = None,
-                        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
-                        lora_ckpt_path: Optional[Union[str, Path]] = None,
-                        ignore_missing_ckpts: bool = False):
-        logger.info("StepVideoModelFlow: start load weight")
-        self.load_lib(ckpt_path)
-        self.first_stage_model.load_weight()
-        self.cond_stage_2_model.load_weight()
-        logger.info("StepVideoModelFlow: end load weight")
-
-        if self.tensor_parallel_degree > 1:
-            logger.info("StepVideoModelFlow: apply tensor parallel")
-            tp_applicator = TensorParallelApplicator(get_tensor_model_parallel_world_size(), get_tensor_model_parallel_rank())
-            tp_applicator.apply_to_model(self.denoiser)    
-        
-    def training_step(self, batch, batch_idx):
-        model_offload: bool = True,
-        dtype: torch.dtype = torch.bfloat16,
-        device: str = "cuda"
-        first_stage_key = self.first_stage_key
-        cond_stage_key = self.cond_stage_key
-
-        if model_offload:
-            self.first_stage_model.to(device)
-        latents = torch.stack(self.first_stage_model.encode(batch[first_stage_key])).to(dtype=dtype, device=device).detach()
-        if model_offload:
-            self.first_stage_model.to('cpu')
-            self.cond_stage_model.to(device)
-        text_cond_embed, text_cond_embed_mask  = self.cond_stage_model(batch[cond_stage_key], device)
-        if model_offload:
-            self.cond_stage_model.to('cpu')
-
-        ## scheduler
-        self.scheduler : FlowMatchScheduler = FlowMatchScheduler(shift=5, sigma_min=0.0, extra_one_step=True)
-        self.scheduler.set_timesteps(1000, training=True)
-
-        ## noise
-        B = len(latents)
-        noise = torch.randn_like(latents)
-        timestep_id = torch.randint(0, self.scheduler.num_train_timesteps, (1,))
-        timestep = self.scheduler.timesteps[timestep_id].to(dtype=dtype, device=device)
-        noisy_latents = self.scheduler.add_noise(latents, noise, timestep).to(dtype=dtype, device=device)
-        training_target = noise.to(device) - latents
-
-        # compute loss
-        noise_pred = self.model(x=noisy_latents, t=timestep, context=text_cond_embed, seq_len=None)
-        loss = torch.nn.functional.mse_loss(torch.stack(noise_pred).float(), training_target.float())
-        loss = loss * self.scheduler.training_weight(timestep).to(device=device)
-        self.log("train_loss", loss, prog_bar=True, on_step=True)
-        return loss
-    
-    @torch.no_grad()
-    def log_images(self, batch, **kwargs):
-        pass
\ No newline at end of file
diff --git a/videotuna/flow/videocrafter.py b/videotuna/flow/videocrafter.py
deleted file mode 100644
index 6049f9d9..00000000
--- a/videotuna/flow/videocrafter.py
+++ /dev/null
@@ -1,788 +0,0 @@
-import logging
-import os
-import json
-import random
-import time
-import numpy as np
-from einops import rearrange, repeat
-from tqdm import tqdm, trange
-from contextlib import contextmanager
-from functools import partial
-from typing import Any, Dict, List, Optional, Union
-from pathlib import Path
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import pytorch_lightning as pl
-from pytorch_lightning.utilities import rank_zero_only
-from torchvision.utils import make_grid
-
-from videotuna.utils.ema import LitEma
-from videotuna.models.lvdm.ddpm3d import DiffusionWrapper
-from videotuna.utils.distributions import DiagonalGaussianDistribution
-from videotuna.schedulers.ddim import DDIMSampler
-from videotuna.base.generation_base import GenerationBase
-from videotuna.utils.common_utils import instantiate_from_config, print_green, print_yellow
-from videotuna.models.lvdm.modules.utils import (
-    default,
-    disabled_train,
-    exists,
-    extract_into_tensor,
-    noise_like,
-)
-
-mainlogger = logging.getLogger("mainlogger")
-
-
-
-class VideocrafterFlow(GenerationBase):
-    """
-    Training and inference flow for VideoCrafter.
-
-    THis model inherits from GenerationFlow, which is a base class for all generative models.
-
-    The main components of the model are:
-        - `first_stage`: a VAE model that encodes the input video into a latent space and decodes it back to the original video.
-        - `cond_stage`: a conditional model that takes the latent space and the conditioning text as input and generates the output video.
-        - `denoiser`: a denoiser model that takes the noisy output of the `cond_stage` and tries to remove the noise, which is the most important part of the model.
-        - `scheduler`: a scheduler that controls denosing and sampling.
-    """
-
-    def __init__(
-        self,
-        first_stage_config: Dict[str, Any],
-        cond_stage_config: Dict[str, Any],
-        denoiser_config: Dict[str, Any],
-        scheduler_config: Optional[Dict[str, Any]] = None,
-        cond_stage_2_config: Optional[Dict[str, Any]] = None,
-        lora_config: Optional[Dict[str, Any]] = None,
-        loss_type: str = "l2",
-        ckpt_path: Optional[Union[str, Path]] = None,
-        ignore_keys: List[str] = [],
-        load_only_unet: bool = False,
-        monitor: Optional[str] = None,
-        use_ema: bool = True,
-        first_stage_key: str = "image",
-        image_size: int = 256,
-        channels: int = 3,
-        log_every_t: int = 100,
-        clip_denoised: bool = True,
-        original_elbo_weight: float = 0.0,
-        l_simple_weight: float = 1.0,
-        conditioning_key: Optional[str] = None,
-        parameterization: str = "eps",  # all assuming fixed variance schedules
-        use_positional_encodings: bool = False,
-        cond_stage_key: str = "caption",
-        cond_stage_trainable: bool = False,
-        cond_stage_forward: Optional[callable] = None,
-        uncond_prob: float = 0.2,
-        uncond_type: str = "empty_seq",
-        scale_factor: float = 1.0,
-        scale_by_std: bool = False,
-        fps_condition_type: str = 'fs',
-        # added for LVDM
-        encoder_type: str = "2d",
-        frame_cond: Optional[Dict[str, Any]] = None,
-        only_model: bool = False,
-        use_scale: bool = False,  # dynamic rescaling
-        scale_a: int = 1,
-        scale_b: float = 0.3,
-        mid_step: int = 400,
-        fix_scale_bug: bool = False,
-        interp_mode: bool = False,
-        logdir: Optional[Union[str, Path]] = None,
-        rand_cond_frame: bool = False,
-        empty_params_only: bool = False,
-        *args, **kwargs
-    ):
-        super().__init__(
-            first_stage_config=first_stage_config,
-            cond_stage_config=cond_stage_config,
-            cond_stage_2_config=cond_stage_2_config,
-            denoiser_config=denoiser_config,
-            scheduler_config=scheduler_config,
-            lora_config=lora_config,
-        )
-        # DDPMFlow related
-        assert parameterization in ["eps", "x0", "v"], 'currently only supporting "eps" and "x0" and "v"'
-        self.parameterization = parameterization
-        mainlogger.info(f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode")
-
-        self.clip_denoised = clip_denoised
-        self.log_every_t = log_every_t
-
-        # model related 
-        self.first_stage_key = first_stage_key
-        self.channels = channels
-        self.temporal_length = denoiser_config['params'].get('temporal_length', 16)
-        self.image_size = image_size
-        if isinstance(self.image_size, int):
-            self.image_size = [self.image_size, self.image_size]
-        self.use_positional_encodings = use_positional_encodings
-        self.model = DiffusionWrapper(self.denoiser, conditioning_key)
-
-        self.use_ema = use_ema
-        if self.use_ema:
-            self.model_ema = LitEma(self.model)
-            mainlogger.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
-        
-        self.original_elbo_weight = original_elbo_weight
-        self.l_simple_weight = l_simple_weight
-
-        print('scheduler config type: ', type(scheduler_config))
-        scheduler_config['parameterization'] = self.parameterization
-        self.num_timesteps = self.scheduler.num_timesteps
-
-        # others 
-        if monitor is not None:
-            self.monitor = monitor
-        
-        self.loss_type = loss_type
-
-        # LVDM related
-        self.scale_by_std = scale_by_std
-        ckpt_path = kwargs.pop("ckpt_path", None)
-        ignore_keys = kwargs.pop("ignore_keys", [])
-        conditioning_key = default(conditioning_key, 'crossattn')
-
-        self.cond_stage_trainable = cond_stage_trainable
-        self.cond_stage_key = cond_stage_key
-        self.empty_params_only = empty_params_only
-        self.fps_condition_type = fps_condition_type
-
-        # scale factor
-        self.use_scale=use_scale
-        if self.use_scale:
-            self.scale_a=scale_a
-            self.scale_b=scale_b
-            if fix_scale_bug:
-                scale_step=self.num_timesteps-mid_step
-            else: #bug
-                scale_step = self.num_timesteps
-
-            scale_arr1 = np.linspace(scale_a, scale_b, mid_step)
-            scale_arr2 = np.full(scale_step, scale_b)
-            scale_arr = np.concatenate((scale_arr1, scale_arr2))
-            scale_arr_prev = np.append(scale_a, scale_arr[:-1])
-            to_torch = partial(torch.tensor, dtype=torch.float32)
-            self.register_buffer('scale_arr', to_torch(scale_arr))
-
-        try:
-            self.num_downs = len(first_stage_config['params'].ddconfig.ch_mult) - 1
-        except:
-            self.num_downs = 0
-        if not scale_by_std:
-            self.scale_factor = scale_factor
-        else:
-            self.register_buffer('scale_factor', torch.tensor(scale_factor))
-        
-        self.clip_denoised = False
-        self.cond_stage_forward = cond_stage_forward
-        self.encoder_type = encoder_type
-        assert(encoder_type in ["2d", "3d"])
-        self.uncond_prob = uncond_prob
-        self.classifier_free_guidance = True if uncond_prob > 0 else False
-        assert(uncond_type in ["zero_embed", "empty_seq"])
-        self.uncond_type = uncond_type
-
-        # future frame prediction
-        self.frame_cond = frame_cond
-        if self.frame_cond:
-            frame_len = self.temporal_length
-            cond_mask = torch.zeros(frame_len, dtype=torch.float32)
-            cond_mask[:self.frame_cond] = 1.0
-            self.cond_mask = cond_mask[None,None,:,None,None]
-            mainlogger.info("---training for %d-frame conditoning T2V"%(self.frame_cond))
-        else:
-            self.cond_mask = None
-                
-        self.logdir = logdir
-        self.rand_cond_frame = rand_cond_frame
-        self.interp_mode = interp_mode
-    
-    @contextmanager
-    def ema_scope(self, context=None):
-        if self.use_ema:
-            self.model_ema.store(self.model.parameters())
-            self.model_ema.copy_to(self.model)
-            if context is not None:
-                mainlogger.info(f"{context}: Switched to EMA weights")
-        try:
-            yield None
-        finally:
-            if self.use_ema:
-                self.model_ema.restore(self.model.parameters())
-                if context is not None:
-                    mainlogger.info(f"{context}: Restored training weights")
-    
-    @rank_zero_only
-    @torch.no_grad()
-    def on_train_batch_start(self, batch, batch_idx, dataloader_idx=None):
-        # only for very first batch, reset the self.scale_factor
-        if self.scale_by_std and self.current_epoch == 0 and self.global_step == 0 and batch_idx == 0:
-            assert self.scale_factor == 1., 'rather not use custom rescaling and std-rescaling simultaneously'
-            # set rescale weight to 1./std of encodings
-            mainlogger.info("### USING STD-RESCALING ###")
-            x = self.get_input(batch, self.first_stage_key)
-            x = x.to(self.device)
-            encoder_posterior = self.encode_first_stage(x)
-            z = self.get_first_stage_encoding(encoder_posterior).detach()
-            del self.scale_factor
-            self.register_buffer('scale_factor', 1. / z.flatten().std())
-            mainlogger.info(f"setting self.scale_factor to {self.scale_factor}")
-            mainlogger.info("### USING STD-RESCALING ###")
-            mainlogger.info(f"std={z.flatten().std()}")
-    
-    def on_train_batch_end(self, *args, **kwargs):
-        if self.use_ema:
-            self.model_ema(self.model)
-    
-    def get_learned_conditioning(self, c):
-        if self.cond_stage_forward is None:
-            if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode):
-                c = self.cond_stage_model.encode(c)
-                if isinstance(c, DiagonalGaussianDistribution):
-                    c = c.mode()
-            else:
-                c = self.cond_stage_model(c)
-        else:
-            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
-            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
-        return c
-
-    def get_first_stage_encoding(self, encoder_posterior, noise=None):
-        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
-            z = encoder_posterior.sample(noise=noise)
-        elif isinstance(encoder_posterior, torch.Tensor):
-            z = encoder_posterior
-        else:
-            raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented")
-        return self.scale_factor * z
-    
-    @torch.no_grad()
-    def encode_first_stage(self, x):
-        if self.encoder_type == "2d" and x.dim() == 5:
-            return self.encode_first_stage_2DAE(x)
-        encoder_posterior = self.first_stage_model.encode(x)
-        results = self.get_first_stage_encoding(encoder_posterior).detach()
-        return results
-    
-    def encode_first_stage_2DAE(self, x):
-        """encode frame by frame"""
-        b, _, t, _, _ = x.shape
-        results = torch.cat([self.get_first_stage_encoding(self.first_stage_model.encode(x[:,:,i])).detach().unsqueeze(2) for i in range(t)], dim=2)
-        return results
-    
-    def decode_first_stage_2DAE(self, z, **kwargs):
-        """decode frame by frame"""
-        _, _, t, _, _ = z.shape
-        results = torch.cat([self.first_stage_model.decode(z[:,:,i], **kwargs).unsqueeze(2) for i in range(t)], dim=2)
-        return results
-
-    def _decode_core(self, z, **kwargs):
-        z = 1. / self.scale_factor * z
-
-        if self.encoder_type == "2d" and z.dim() == 5:
-            return self.decode_first_stage_2DAE(z)
-        results = self.first_stage_model.decode(z, **kwargs)
-        return results
-
-    @torch.no_grad()
-    def decode_first_stage(self, z, **kwargs):
-        return self._decode_core(z, **kwargs)
-
-    def differentiable_decode_first_stage(self, z, **kwargs):
-        """same as decode_first_stage but without decorator"""
-        return self._decode_core(z, **kwargs)
-    
-    def get_input(self, batch, k):
-        x = batch[k]
-        """
-        if len(x.shape) == 3:
-            x = x[..., None]
-        x = rearrange(x, 'b h w c -> b c h w')
-        """
-        x = x.to(memory_format=torch.contiguous_format).float()
-        return x
-    
-    def get_batch_input(self, batch, random_uncond, return_first_stage_outputs=False, return_original_cond=False, is_imgbatch=False):
-        ## image/video shape: b, c, t, h, w
-        data_key = 'jpg' if is_imgbatch else self.first_stage_key
-        x = self.get_input(batch, data_key)
-        if is_imgbatch:
-            ## pack image as video
-            #x = x[:,:,None,:,:]
-            b = x.shape[0] // self.temporal_length
-            x = rearrange(x, '(b t) c h w -> b c t h w', b=b, t=self.temporal_length)
-        x_ori = x
-        ## encode video frames x to z via a 2D encoder
-        z = self.encode_first_stage(x)
-                
-        ## get caption condition
-        cond_key = 'txt' if is_imgbatch else self.cond_stage_key
-        cond = batch[cond_key]
-        if random_uncond and self.uncond_type == 'empty_seq':
-            for i, ci in enumerate(cond):
-                if random.random() < self.uncond_prob:
-                    cond[i] = ""
-        if isinstance(cond, dict) or isinstance(cond, list):
-            cond_emb = self.get_learned_conditioning(cond)
-        else:
-            cond_emb = self.get_learned_conditioning(cond.to(self.device))
-        if random_uncond and self.uncond_type == 'zero_embed':
-            for i, ci in enumerate(cond):
-                if random.random() < self.uncond_prob:
-                    cond_emb[i] = torch.zeros_like(ci)
-
-        out = [z, cond_emb]
-        ## optional output: self-reconst or caption
-        if return_first_stage_outputs:
-            xrec = self.decode_first_stage(z)
-            out.extend([x_ori, xrec])
-        if return_original_cond:
-            out.append(cond)
-
-        return out
-
-    def forward(self, x, c, **kwargs):
-        if 't' in kwargs:
-            t = kwargs.pop('t')
-        else:
-            t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()
-        if self.use_scale:
-            x = x * extract_into_tensor(self.scale_arr, t, x.shape)
-        return self.p_losses(x, c, t, **kwargs)
-
-    def shared_step(self, batch, random_uncond, **kwargs):
-        is_imgbatch = False
-        if "loader_img" in batch.keys():
-            ratio = 10.0 / self.temporal_length
-            if random.uniform(0.,10.) < ratio:
-                is_imgbatch = True
-                batch = batch["loader_img"]
-            else:
-                batch = batch["loader_video"]
-        else:
-            pass
-
-        x, c = self.get_batch_input(batch, random_uncond=random_uncond, is_imgbatch=is_imgbatch)
-        loss, loss_dict = self(x, c, is_imgbatch=is_imgbatch, **kwargs)
-        return loss, loss_dict
-    
-    def apply_model(self, x_noisy, t, cond, **kwargs):
-        if self.model.conditioning_key == "crossattn_stdit":
-            key = "c_crossattn_stdit"
-            cond = {key: [cond["y"]], "mask": [cond["mask"]]}  # support mask for T5
-        else:
-            if isinstance(cond, dict):
-                # hybrid case, cond is exptected to be a dict
-                pass
-            else:
-                if not isinstance(cond, list):
-                    cond = [cond]
-                key = (
-                    "c_concat"
-                    if self.model.conditioning_key == "concat"
-                    else "c_crossattn"
-                )
-                cond = {key: cond}
-
-        x_recon = self.model(x_noisy, t, **cond, **kwargs)
-
-        if isinstance(x_recon, tuple):
-            return x_recon[0]
-        else:
-            return x_recon
-    
-    def get_loss(self, pred, target, mean=True):
-
-        if target.size()[1] != pred.size()[1]:
-            c = target.size()[1]
-            pred = pred[
-                :, :c, ...
-            ]  # opensora, only previous 4 channels used for calculating loss.
-
-        if self.loss_type == "l1":
-            loss = (target - pred).abs()
-            if mean:
-                loss = loss.mean()
-        elif self.loss_type == "l2":
-            if mean:
-                loss = torch.nn.functional.mse_loss(target, pred)
-            else:
-                loss = torch.nn.functional.mse_loss(target, pred, reduction="none")
-        else:
-            raise NotImplementedError("unknown loss type '{loss_type}'")
-
-        return loss
-    
-    def p_losses(self, x_start, cond, t, noise=None, **kwargs):
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        x_noisy = self.scheduler.q_sample(x_start=x_start, t=t, noise=noise)
-        if self.frame_cond:
-            if self.cond_mask.device is not self.device:
-                self.cond_mask = self.cond_mask.to(self.device)
-            ## condition on fist few frames
-            x_noisy = x_start * self.cond_mask + (1.-self.cond_mask) * x_noisy
-        model_output = self.apply_model(x_noisy, t, cond, **kwargs)
-
-        loss_dict = {}
-        prefix = 'train' if self.training else 'val'
-
-        if self.parameterization == "x0":
-            target = x_start
-        elif self.parameterization == "eps":
-            target = noise
-        elif self.parameterization == "v":
-            target = self.scheduler.get_v(x_start, noise, t)
-        else:
-            raise NotImplementedError()
-        
-        if self.frame_cond:
-            ## [b,c,t,h,w]: only care about the predicted part (avoid disturbance)
-            model_output = model_output[:,:,self.frame_cond:,:,:]
-            target = target[:,:,self.frame_cond:,:,:]
-        
-        loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3, 4])
-
-        if torch.isnan(loss_simple).any():
-            print(f"loss_simple exists nan: {loss_simple}")
-            for i in range(loss_simple.shape[0]):
-                if torch.isnan(loss_simple[i]).any():
-                    loss_simple[i] = torch.zeros_like(loss_simple[i])
-
-        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
-
-        if self.scheduler.logvar.device is not self.device:
-            self.scheduler.logvar = self.scheduler.logvar.to(self.device)
-        logvar_t = self.scheduler.logvar[t]
-        # logvar_t = self.logvar[t.item()].to(self.device) # device conflict when ddp shared
-        loss = loss_simple / torch.exp(logvar_t) + logvar_t
-        # loss = loss_simple / torch.exp(self.logvar) + self.logvar
-        if self.scheduler.learn_logvar:
-            loss_dict.update({f'{prefix}/loss_gamma': loss.mean()})
-            loss_dict.update({'logvar': self.scheduler.logvar.data.mean()})
-
-        loss = self.l_simple_weight * loss.mean()
-
-        if self.original_elbo_weight > 0:
-            loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3, 4))
-            loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()
-            loss_dict.update({f'{prefix}/loss_vlb': loss_vlb})
-            loss += (self.original_elbo_weight * loss_vlb)
-        loss_dict.update({f'{prefix}/loss': loss})
-
-        return loss, loss_dict   
-
-    def training_step(self, batch, batch_idx):
-        loss, loss_dict = self.shared_step(batch, random_uncond=self.classifier_free_guidance)
-        self.log_dict(loss_dict, prog_bar=True, logger=True, on_step=True, on_epoch=True, sync_dist=False)
-        #self.log("epoch/global_step", self.global_step.float(), prog_bar=True, logger=True, on_step=True, on_epoch=False)
-        '''
-        if self.use_scheduler:
-            lr = self.optimizers().param_groups[0]['lr']
-            self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False, rank_zero_only=True)
-        '''
-        if (batch_idx+1) % self.log_every_t == 0:
-            mainlogger.info(f"batch:{batch_idx}|epoch:{self.current_epoch} [globalstep:{self.global_step}]: loss={loss}")
-        return loss
-    
-    def _get_denoise_row_from_list(self, samples, desc=''):
-        denoise_row = []
-        for zd in tqdm(samples, desc=desc):
-            denoise_row.append(self.decode_first_stage(zd.to(self.device)))
-        n_log_timesteps = len(denoise_row)
-
-        denoise_row = torch.stack(denoise_row)  # n_log_timesteps, b, C, H, W
-        
-        if denoise_row.dim() == 5:
-            # img, num_imgs= n_log_timesteps * bs, grid_size=[bs,n_log_timesteps]
-            # batch:col, different samples, 
-            # n:rows, different steps for one sample
-            denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')
-            denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
-            denoise_grid = make_grid(denoise_grid, nrow=n_log_timesteps)
-        elif denoise_row.dim() == 6:
-            # video, grid_size=[n_log_timesteps*bs, t]
-            video_length = denoise_row.shape[3]
-            denoise_grid = rearrange(denoise_row, 'n b c t h w -> b n c t h w')
-            denoise_grid = rearrange(denoise_grid, 'b n c t h w -> (b n) c t h w')
-            denoise_grid = rearrange(denoise_grid, 'n c t h w -> (n t) c h w')
-            denoise_grid = make_grid(denoise_grid, nrow=video_length)
-        else:
-            raise ValueError
-
-        return denoise_grid
-
-    @torch.no_grad()
-    def log_images(self, batch, sample=True, ddim_steps=200, ddim_eta=1., plot_denoise_rows=False, \
-                    unconditional_guidance_scale=1.0, **kwargs):
-        """ log images for LatentDiffusion """
-        ## TBD: currently, classifier_free_guidance sampling is only supported by DDIM
-        use_ddim = ddim_steps is not None
-        log = dict()
-        z, c, x, xrec, xc = self.get_batch_input(batch, random_uncond=False,
-                                                return_first_stage_outputs=True,
-                                                return_original_cond=True)
-        N, _, T, H, W = x.shape
-        # TODO fix data type 
-        log["inputs"] = x.to(torch.bfloat16)
-        log["reconst"] = xrec
-        log["condition"] = xc
-        
-        if sample:
-            # get uncond embedding for classifier-free guidance sampling
-            if unconditional_guidance_scale != 1.0:
-                if isinstance(c, dict):
-                    if "y" in c:
-                        c_emb = c["y"]
-                        c_cat = None # set default value is None
-                    else:
-                        c_cat, c_emb = c["c_concat"][0], c["c_crossattn"][0]
-                else:
-                    c_emb = c
-                
-                # TODO fix data type 
-                z = z.to(torch.bfloat16)
-                c_emb = c_emb.to(torch.bfloat16)
-
-                # get uc: unconditional condition for classifier-free guidance sampling
-                if self.uncond_type == "empty_seq":
-                    prompts = N * [""]
-                    uc = self.get_learned_conditioning(prompts)
-                elif self.uncond_type == "zero_embed":
-                    uc = torch.zeros_like(c_emb)
-                # make uc for hybrid condition case
-                if isinstance(c, dict) and c_cat is not None:
-                    uc = {"c_concat": [c_cat], "c_crossattn": [uc]}
-            else:
-                uc = None
-
-            with self.ema_scope("Plotting"):
-                samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim,
-                                                         ddim_steps=ddim_steps,eta=ddim_eta,
-                                                         unconditional_guidance_scale=unconditional_guidance_scale,
-                                                         unconditional_conditioning=uc, mask=self.cond_mask, x0=z, **kwargs)
-            x_samples = self.decode_first_stage(samples)
-            log["samples"] = x_samples
-            
-            if plot_denoise_rows:
-                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
-                log["denoise_row"] = denoise_grid
-
-        return log
-    
-    @torch.no_grad()
-    def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=None, verbose=True, callback=None, \
-                      timesteps=None, mask=None, x0=None, img_callback=None, start_T=None, log_every_t=None, **kwargs):
-
-        if not log_every_t:
-            log_every_t = self.log_every_t
-        device = self.device
-        b = shape[0]        
-        # sample an initial noise
-        if x_T is None:
-            img = torch.randn(shape, device=device)
-        else:
-            img = x_T
-
-        intermediates = [img]
-        if timesteps is None:
-            timesteps = self.num_timesteps
-        if start_T is not None:
-            timesteps = min(timesteps, start_T)
-
-        iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(range(0, timesteps))
-
-        if mask is not None:
-            assert x0 is not None
-            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match
-
-        for i in iterator:
-            ts = torch.full((b,), i, device=device, dtype=torch.long)
-            if self.scheduler.shorten_cond_schedule:
-                assert self.model.conditioning_key != 'hybrid'
-                tc = self.cond_ids[ts].to(cond.device)
-                cond = self.scheduler.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
-
-            img = self.scheduler.p_sample(img, cond, ts, clip_denoised=self.clip_denoised, **kwargs)
-            if mask is not None:
-                img_orig = self.scheduler.q_sample(x0, ts)
-                img = img_orig * mask + (1. - mask) * img
-
-            if i % log_every_t == 0 or i == timesteps - 1:
-                intermediates.append(img)
-            if callback: callback(i)
-            if img_callback: img_callback(img, i)
-
-        if return_intermediates:
-            return img, intermediates
-        return img
-
-    @torch.no_grad()
-    def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None, \
-               verbose=True, timesteps=None, mask=None, x0=None, shape=None, **kwargs):
-        if shape is None:
-            shape = (batch_size, self.channels, self.temporal_length, *self.image_size)
-        if cond is not None:
-            if isinstance(cond, dict):
-                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
-                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
-            else:
-                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
-        return self.p_sample_loop(cond,
-                                  shape,
-                                  return_intermediates=return_intermediates, x_T=x_T,
-                                  verbose=verbose, timesteps=timesteps,
-                                  mask=mask, x0=x0, **kwargs)
-
-    @torch.no_grad()
-    def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs):        
-        if ddim:
-            ddim_sampler = DDIMSampler(self)
-            shape = (self.channels, self.temporal_length, *self.image_size)
-            # kwargs.update({"clean_cond": True})
-            samples, intermediates =ddim_sampler.sample(ddim_steps, batch_size, shape, cond, verbose=False, **kwargs)
-
-        else:
-            samples, intermediates = self.sample(cond=cond, batch_size=batch_size, return_intermediates=True, **kwargs)
-
-        return samples, intermediates
-    
-    @torch.no_grad()
-    def validation_step(self, batch, batch_idx):
-        _, loss_dict_no_ema = self.shared_step(batch, random_uncond=False)
-        with self.ema_scope():
-            _, loss_dict_ema = self.shared_step(batch, random_uncond=False)
-            loss_dict_ema = {key + "_ema": loss_dict_ema[key] for key in loss_dict_ema}
-        self.log_dict(
-            loss_dict_no_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True
-        )
-        self.log_dict(
-            loss_dict_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True
-        )
-    
-    def sample_batch_t2v(
-            self,
-            prompts: List[str],
-            fps: int,
-            noise_shape: Optional[tuple] = None,
-            n_samples_prompt: int = 1,
-            ddim_steps: int = 50,
-            ddim_eta: float = 1.0,
-            cfg_scale: float = 1.0,
-            temporal_cfg_scale: Optional[float] = None,
-            uncond_prompt: str = "",
-            **kwargs,
-        ) -> None:
-        """
-        Sample a batch of text-to-video (T2V) sequences.
-
-        :param model: The model used for generating the video.
-        :param sampler: The sampler used for sampling the video frames.
-        :param prompts: A list of text prompts for generating the video.
-        :param noise_shape: The shape of the noise input for the model.
-        :param fps: Frames per second for the generated video.
-        :param n_samples_prompt: Number of samples per prompt. Default is 1.
-        :param ddim_steps: Number of DDIM steps for the sampling process. Default is 50.
-        :param ddim_eta: The eta parameter for DDIM. Default is 1.0.
-        :param cfg_scale: The scale for classifier-free guidance. Default is 1.0.
-        :param temporal_cfg_scale: The scale for temporal classifier-free guidance. Default is None.
-        :param uncond_prompt: The unconditional prompt for classifier-free guidance. Default is an empty string.
-        :param kwargs: Additional keyword arguments.
-        """
-        # ----------------------------------------------------------------------------------
-        # make cond & uncond for t2v
-        uncond_prompt = "" if uncond_prompt is None else uncond_prompt
-        batch_size = noise_shape[0]
-        text_emb = self.get_learned_conditioning(prompts)
-        fps = torch.tensor([fps] * batch_size).to(self.device).long()
-        cond = {"c_crossattn": [text_emb], "fps": fps}
-
-        if cfg_scale != 1.0:  # unconditional guidance
-            uc_text_emb = self.get_learned_conditioning(batch_size * [uncond_prompt])
-            uncond = {k: v for k, v in cond.items()}
-            uncond.update({"c_crossattn": [uc_text_emb]})
-        else:
-            uncond = None
-
-        # ----------------------------------------------------------------------------------
-        # sampling
-        batch_samples = []
-        for _ in range(n_samples_prompt):  # iter over batch of prompts
-            samples, _ = self.ddim_sampler.sample(
-                S=ddim_steps,
-                conditioning=cond,
-                batch_size=batch_size,
-                shape=noise_shape[1:],
-                verbose=False,
-                unconditional_guidance_scale=cfg_scale,
-                unconditional_conditioning=uncond,
-                eta=ddim_eta,
-                temporal_length=noise_shape[2],
-                conditional_guidance_scale_temporal=temporal_cfg_scale,
-                **kwargs,
-            )
-            res = self.decode_first_stage(samples)
-            batch_samples.append(res)
-        batch_samples = torch.stack(batch_samples, dim=1)
-        return batch_samples
-    
-    @torch.no_grad()
-    def inference(self, args, **kwargs):
-        # create inference sampler
-        self.ddim_sampler = DDIMSampler(self)
-        # load prompt list
-        prompt_list = self.load_inference_inputs(args.prompt_file, mode=args.mode)
-
-        # TODO: inference on multiple gpus
-
-        # noise shape
-        args.frames = self.temporal_length if args.frames is None else args.frames
-        h, w, frames, channels = (
-            args.height // 8,
-            args.width // 8,
-            args.frames,
-            self.channels,
-        )
-
-        # -----------------------------------------------------------------
-        # inference
-        format_file = {}
-        start = time.time()
-        n_iters = len(prompt_list) // args.bs + (
-            1 if len(prompt_list) % args.bs else 0
-        )
-        with torch.no_grad():
-            for idx in trange(0, n_iters, desc="Sample Iters"):
-                prompts = prompt_list[idx * args.bs : (idx + 1) * args.bs]
-                filenames = self.process_savename(prompts, args.n_samples_prompt)
-                ## inference
-                bs = args.bs if args.bs == len(prompts) else len(prompts)
-                noise_shape = [bs, channels, frames, h, w]
-                if args.mode == "t2v":
-                    batch_samples = self.sample_batch_t2v(
-                        prompts,
-                        args.fps,
-                        noise_shape,
-                        args.n_samples_prompt,
-                        args.ddim_steps,
-                        args.ddim_eta,
-                        args.unconditional_guidance_scale,
-                        args.unconditional_guidance_scale_temporal,
-                        args.uncond_prompt,
-                    )
-
-                if args.standard_vbench:
-                    self.save_videos_vbench(
-                        batch_samples, args.savedir, prompts, format_file, fps=args.savefps
-                    )
-                else:
-                    self.save_videos(batch_samples, args.savedir, filenames, fps=args.savefps)
-
-        if args.standard_vbench:
-            with open(os.path.join(args.savedir, "info.json"), "w") as f:
-                json.dump(format_file, f)
-
-        print_green(f"Saved in {args.savedir}. Time used: {(time.time() - start):.2f} seconds")
\ No newline at end of file
diff --git a/videotuna/flow/wanvideo.py b/videotuna/flow/wanvideo.py
index f478597f..9de3582b 100644
--- a/videotuna/flow/wanvideo.py
+++ b/videotuna/flow/wanvideo.py
@@ -1,47 +1,88 @@
-import torch
-from loguru import logger
-import random
 import os
-import math
-import torch.distributed as dist
-from typing import Any, Dict, List, Optional, Union
 from pathlib import Path
+from typing import Any, Dict, Optional, Union, cast
+
+import torch
+import torch.distributed as dist
+from omegaconf import DictConfig
 from PIL import Image
-from datetime import datetime
-import sys
-from omegaconf import OmegaConf, DictConfig
 
+import videotuna.models.wan.wan as wan
 from videotuna.base.generation_base import GenerationBase
-from videotuna.utils.common_utils import instantiate_from_config
+from videotuna.models.wan.wan.configs import (
+    MAX_AREA_CONFIGS,
+    SIZE_CONFIGS,
+    SUPPORTED_SIZES,
+    WAN_CONFIGS,
+)
+from videotuna.models.wan.wan.utils.prompt_extend import (
+    DashScopePromptExpander,
+    QwenPromptExpander,
+)
 from videotuna.utils.args_utils import VideoMode
-import videotuna.models.wan.wan as wan
-from videotuna.models.wan.wan.configs import WAN_CONFIGS, SIZE_CONFIGS, MAX_AREA_CONFIGS, SUPPORTED_SIZES
-from videotuna.models.wan.wan.utils.prompt_extend import DashScopePromptExpander, QwenPromptExpander
-from videotuna.models.wan.wan.utils.utils import cache_video, cache_image, str2bool
+from videotuna.utils.attention import maybe_compile_denoiser
+from videotuna.utils.common_utils import monitor_resources
+from videotuna.utils.device_utils import (
+    gpu_is_available,
+    require_xfuser_sequence_parallel,
+)
+from videotuna.utils.logging_config import (
+    bound_logger,
+    phase_from_wan_task,
+    resolve_device_label,
+)
+from videotuna.utils.wan_training import (
+    compute_wan_flow_matching_loss,
+    init_wan_training_denoisers,
+)
+
+_BOXING_PROMPT = (
+    "Two anthropomorphic cats in comfy boxing gear and bright gloves "
+    "fight intensely on a spotlighted stage."
+)
+_BEACH_PROMPT = (
+    "Summer beach vacation style, a white cat wearing sunglasses sits on a "
+    "surfboard. The fluffy-furred feline gazes directly at the camera with a "
+    "relaxed expression. Blurred beach scenery forms the background featuring "
+    "crystal-clear waters, distant green hills, and a blue sky dotted with "
+    "white clouds. The cat assumes a naturally relaxed posture, as if "
+    "savoring the sea breeze and warm sunlight. A close-up shot highlights "
+    "the feline's intricate details and the refreshing atmosphere of the "
+    "seaside."
+)
 
 EXAMPLE_PROMPT = {
     "t2v-1.3B": {
-        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+        "prompt": _BOXING_PROMPT,
     },
     "t2v-14B": {
-        "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+        "prompt": _BOXING_PROMPT,
+    },
+    "t2v-A14B": {
+        "prompt": _BOXING_PROMPT,
     },
     "t2i-14B": {
         "prompt": "一个朴素端庄的美人",
     },
     "i2v-14B": {
-        "prompt":
-            "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside.",
-        "image":
-            "inputs/i2v/576x1024/i2v_input.JPG",
+        "prompt": _BEACH_PROMPT,
+        "image": "inputs/i2v/576x1024/i2v_input.JPG",
+    },
+    "i2v-A14B": {
+        "prompt": _BEACH_PROMPT,
+        "image": "inputs/i2v/576x1024/i2v_input.JPG",
     },
 }
 
+
 class WanVideoModelFlow(GenerationBase):
+    prompt_expander: DashScopePromptExpander | QwenPromptExpander | None = None
+
     """
     Training and inference flow for YourModel.
-    
-    This model inherits from GenerationFlow, which is a base class for all generative models.
+
+    This model inherits from GenerationFlow, which is a base class for all
+    generative models.
     """
 
     def __init__(
@@ -52,22 +93,26 @@ def __init__(
         scheduler_config: Optional[Dict[str, Any]] = None,
         cond_stage_2_config: Optional[Dict[str, Any]] = None,
         lora_config: Optional[Dict[str, Any]] = None,
-        task: str = "t2v-14B",            
-        ckpt_path: Optional[str] = None,    
-        offload_model: Optional[bool] = None, 
-        ulysses_size: int = 1,             
-        ring_size: int = 1,               
-        t5_fsdp: bool = False,           
-        t5_cpu: bool = False,             
-        dit_fsdp: bool = False,            
-        use_prompt_extend: bool = False,   
-        prompt_extend_method: str = "local_qwen",  
-        prompt_extend_model: Optional[str] = None,  
-        prompt_extend_target_lang: str = "zh",    
+        gradient_checkpointing: bool = True,
+        task: str = "t2v-14B",
+        ckpt_path: Optional[str] = None,
+        offload_model: Optional[bool] = None,
+        ulysses_size: int = 1,
+        ring_size: int = 1,
+        t5_fsdp: bool = False,
+        t5_cpu: bool = False,
+        dit_fsdp: bool = False,
+        use_prompt_extend: bool = False,
+        prompt_extend_method: str = "local_qwen",
+        prompt_extend_model: Optional[str] = None,
+        prompt_extend_target_lang: str = "zh",
         seed: int = -1,
-        *args, **kwargs
+        *args,
+        **kwargs,
     ):
-        logger.info("WanVideo flow: starting init")
+        phase = phase_from_wan_task(task)
+        self._log = bound_logger(phase=phase, flow="wanvideo")
+        self._log.info("WanVideo flow: starting init")
         assert ckpt_path is not None, "Please specify the checkpoint directory."
         assert task in WAN_CONFIGS, f"Unsupport task: {task}"
         assert task in EXAMPLE_PROMPT, f"Unsupport task: {task}"
@@ -78,9 +123,10 @@ def __init__(
             scheduler_config=scheduler_config,
             cond_stage_2_config=cond_stage_2_config,
             lora_config=lora_config,
-            trainable_components=[]
+            trainable_components=[],
         )
-        logger.info("WanVideo flow: class init finished")
+        self.apply_denoiser_gradient_checkpointing(gradient_checkpointing)
+        self._log.info("WanVideo flow: class init finished")
         self.task = task
         self.ckpt_path = ckpt_path
         self.use_prompt_extend = use_prompt_extend
@@ -90,77 +136,103 @@ def __init__(
         self.offload_model = offload_model
         self.ulysses_size = ulysses_size
         self.ring_size = ring_size
+        self.use_sp = ulysses_size > 1 or ring_size > 1
 
         rank = int(os.getenv("RANK", 0))
         world_size = int(os.getenv("WORLD_SIZE", 1))
         local_rank = int(os.getenv("LOCAL_RANK", 0))
         device = local_rank
+        if gpu_is_available():
+            device_label = resolve_device_label(torch.device(f"cuda:{local_rank}"))
+        else:
+            device_label = "cpu"
+        self._log = self._log.bind(device=device_label)
 
         if offload_model is None:
             offload_model = False if world_size > 1 else True
-            logger.info(
-                f"offload_model is not specified, set to {offload_model}.")
+            self._log.info("offload_model is not specified, set to {}.", offload_model)
         if world_size > 1:
-            pass
-            # torch.cuda.set_device(local_rank)
-            # dist.init_process_group(
-            #     backend="nccl",
-            #     init_method="env://",
-            #     rank=rank,
-            #     world_size=world_size)
-            # logger.info("WanVideo flow: Init Process Group")
+            if gpu_is_available():
+                torch.cuda.set_device(local_rank)
+            if not dist.is_initialized():
+                dist.init_process_group(
+                    backend="nccl",
+                    init_method="env://",
+                    rank=rank,
+                    world_size=world_size,
+                )
+            self._log.info("WanVideo flow: Init Process Group")
         else:
             assert not (
                 t5_fsdp or dit_fsdp
-            ), f"t5_fsdp and dit_fsdp are not supported in non-distributed environments."
+            ), "t5_fsdp and dit_fsdp are not supported in non-distributed environments."
             assert not (
                 ulysses_size > 1 or ring_size > 1
-            ), f"context parallel are not supported in non-distributed environments."
-        
+            ), "context parallel are not supported in non-distributed environments."
+
         if ulysses_size > 1 or ring_size > 1:
-            assert ulysses_size * ring_size == world_size, f"The number of ulysses_size and ring_size should be equal to the world size."
-            from xfuser.core.distributed import (initialize_model_parallel,
-                                                init_distributed_environment)
+            require_xfuser_sequence_parallel("WanVideoModelFlow")
+            assert ulysses_size * ring_size == world_size, (
+                "The number of ulysses_size and ring_size should be equal to "
+                "the world size."
+            )
+            from xfuser.core.distributed import (
+                init_distributed_environment,
+                initialize_model_parallel,
+            )
+
             init_distributed_environment(
-                rank=dist.get_rank(), world_size=dist.get_world_size())
+                rank=dist.get_rank(), world_size=dist.get_world_size()
+            )
 
             initialize_model_parallel(
                 sequence_parallel_degree=dist.get_world_size(),
                 ring_degree=ring_size,
                 ulysses_degree=ulysses_size,
             )
-            logger.info("WanVideo flow: Init Ring/Ulysses Seqeunce Parallel Process Group")
+            self._log.info(
+                "WanVideo flow: Init Ring/Ulysses Seqeunce Parallel Process Group"
+            )
 
         if use_prompt_extend:
             if prompt_extend_method == "dashscope":
                 self.prompt_expander = DashScopePromptExpander(
-                    model_name=prompt_extend_model, is_vl="i2v" in task)
+                    model_name=prompt_extend_model, is_vl="i2v" in task
+                )
             elif prompt_extend_method == "local_qwen":
                 self.prompt_expander = QwenPromptExpander(
-                    model_name=prompt_extend_model,
-                    is_vl="i2v" in task,
-                    device=rank)
+                    model_name=prompt_extend_model, is_vl="i2v" in task, device=rank
+                )
             else:
                 raise NotImplementedError(
-                    f"Unsupport prompt_extend_method: {prompt_extend_method}")
-            logger.info("WanVideo flow: Set Prompt Extention")
+                    f"Unsupport prompt_extend_method: {prompt_extend_method}"
+                )
+            self._log.info("WanVideo flow: Set Prompt Extention")
 
         cfg = WAN_CONFIGS[task]
         self.cfg = cfg
         if ulysses_size > 1:
-            assert cfg.num_heads % ulysses_size == 0, f"`{cfg.num_heads=}` cannot be divided evenly by `{ulysses_size=}`."
+            num_heads = getattr(cfg, "num_heads", None)
+            assert num_heads is not None, "Wan config missing num_heads"
+            assert num_heads % ulysses_size == 0, (
+                f"`num_heads={num_heads}` cannot be divided evenly by "
+                f"`ulysses_size={ulysses_size}`."
+            )
 
-        logger.info(f"WanVideo flow: model config: {cfg}")
+        self._log.info(f"WanVideo flow: model config: {cfg}")
 
         if dist.is_initialized():
-            seed = [seed] if rank == 0 else [None]
-            dist.broadcast_object_list(seed, src=0)
-            seed = seed[0]
-            logger.info(f"WanVideo flow: broadcast seed")
-        
+            seed_list: list[int | None] = [seed] if rank == 0 else [None]
+            dist.broadcast_object_list(seed_list, src=0)
+            broadcast_seed = seed_list[0]
+            assert broadcast_seed is not None
+            seed = broadcast_seed
+            self.seed = seed
+            self._log.info("WanVideo flow: broadcast seed")
 
+        use_sp = self.use_sp
         if "t2v" in task or "t2i" in task:
-            logger.info("Creating WanT2V pipeline.")
+            self._log.info("Creating WanT2V pipeline.")
             self.wan_t2v = wan.WanT2V(
                 config=cfg,
                 checkpoint_dir=ckpt_path,
@@ -168,14 +240,11 @@ def __init__(
                 rank=rank,
                 t5_fsdp=t5_fsdp,
                 dit_fsdp=dit_fsdp,
-                use_usp=(ulysses_size > 1 or ring_size > 1),
+                use_sp=use_sp,
                 t5_cpu=t5_cpu,
-                first_stage_model=self.first_stage_model,
-                cond_stage_model=self.cond_stage_model,
-                denoiser=self.denoiser
             )
         else:
-            logger.info("Creating WanI2V pipeline.")
+            self._log.info("Creating WanI2V pipeline.")
             self.wan_i2v = wan.WanI2V(
                 config=cfg,
                 checkpoint_dir=ckpt_path,
@@ -183,27 +252,27 @@ def __init__(
                 rank=rank,
                 t5_fsdp=t5_fsdp,
                 dit_fsdp=dit_fsdp,
-                use_usp=(ulysses_size > 1 or ring_size > 1),
+                use_sp=use_sp,
                 t5_cpu=t5_cpu,
-                first_stage_model=self.first_stage_model,
-                cond_stage_model=self.cond_stage_model,
-                cond_stage_2_model=self.cond_stage_2_model,
-                denoiser=self.denoiser
             )
-        
-    def _validate_args(self, args):        
+
+        init_wan_training_denoisers(self)
+
+    def _validate_args(self, args):
         # Size reassign and check
         args.size = f"{args.width}*{args.height}"
-        logger.info(f"setting size = width*height == {args.size}")
-        assert args.size in SUPPORTED_SIZES[
-            self.task], f"Unsupport size {args.size} for task {self.task}, supported sizes are: {', '.join(SUPPORTED_SIZES[self.task])}"
+        self._log.info(f"setting size = width*height == {args.size}")
+        supported = ", ".join(SUPPORTED_SIZES[self.task])
+        assert args.size in SUPPORTED_SIZES[self.task], (
+            f"Unsupport size {args.size} for task {self.task}, "
+            f"supported sizes are: {supported}"
+        )
 
     def inference_t2v(self, args: DictConfig):
         # init vars
         rank = int(os.getenv("RANK", 0))
-        world_size = int(os.getenv("WORLD_SIZE", 1))
-        local_rank = int(os.getenv("LOCAL_RANK", 0))
-        device = local_rank
+        int(os.getenv("WORLD_SIZE", 1))
+        int(os.getenv("LOCAL_RANK", 0))
 
         frames = args.frames
         size = args.size
@@ -215,24 +284,26 @@ def inference_t2v(self, args: DictConfig):
         # load input
         prompt_list = self.load_inference_inputs(args.prompt_file, args.mode)
         if len(prompt_list) > 1:
-            logger.warning("WanVideo currently does not support batch inference, we will run sample at a time")
-        
+            self._log.info("Processing prompts sequentially (batch size 1 per prompt).")
+
         videos = []
         gpu = []
         time = []
         for prompt in prompt_list:
-            logger.info(f"Input prompt: {prompt}")
+            self._log.info(f"Input prompt: {prompt}")
             if self.use_prompt_extend:
-                logger.info("Extending prompt ...")
+                assert self.prompt_expander is not None
+                self._log.info("Extending prompt ...")
                 if rank == 0:
                     prompt_output = self.prompt_expander(
-                        prompt,
-                        tar_lang=self.prompt_extend_target_lang,
-                        seed=self.seed)
-                    if prompt_output.status == False:
-                        logger.info(
-                            f"Extending prompt failed: {prompt_output.message}")
-                        logger.info("Falling back to original prompt.")
+                        prompt, tar_lang=self.prompt_extend_target_lang, seed=self.seed
+                    )
+                    assert prompt_output is not None
+                    if prompt_output.status is False:
+                        self._log.info(
+                            f"Extending prompt failed: {prompt_output.message}"
+                        )
+                        self._log.info("Falling back to original prompt.")
                         input_prompt = prompt
                     else:
                         input_prompt = prompt_output.prompt
@@ -242,38 +313,57 @@ def inference_t2v(self, args: DictConfig):
                 if dist.is_initialized():
                     dist.broadcast_object_list(input_prompt, src=0)
                 prompt = input_prompt[0]
-                logger.info(f"Extended prompt: {prompt}")
-
-            logger.info(
-                f"Generating {'image' if 't2i' in self.task else 'video'} ...")
-            result_with_metrics = self.wan_t2v.generate(
-                prompt,
-                size=SIZE_CONFIGS[size],
-                frame_num=frames,
-                shift=sample_shift,
-                sample_solver=sample_solver,
-                sampling_steps=sampling_steps,
-                guide_scale=guide_scale,
-                seed=self.seed,
-                offload_model=self.offload_model)
-            video = result_with_metrics['result']
+                self._log.info(f"Extended prompt: {prompt}")
+
+            self._log.info(
+                f"Generating {'image' if 't2i' in self.task else 'video'} ..."
+            )
+
+            @monitor_resources(return_metrics=True, frames=frames)
+            def _run_generate():
+                return self.wan_t2v.generate(
+                    prompt,
+                    size=SIZE_CONFIGS[size],
+                    frame_num=frames,
+                    shift=sample_shift,
+                    sample_solver=sample_solver,
+                    sampling_steps=sampling_steps,
+                    guide_scale=guide_scale,
+                    seed=self.seed,
+                    offload_model=self.offload_model,
+                )
+
+            result_with_metrics = _run_generate()
+            video = result_with_metrics["result"]
             videos.append(video)
 
-            gpu.append(result_with_metrics.get('gpu', -1.0))
-            time.append(result_with_metrics.get('time', -1.0))
+            gpu.append(
+                result_with_metrics.get("peak_vram_gb")
+                or result_with_metrics.get("gpu", -1.0)
+            )
+            time.append(
+                result_with_metrics.get("wall_time_s")
+                or result_with_metrics.get("time", -1.0)
+            )
 
         if rank == 0:
-            logger.info("Saving videos")
+            self._log.info("Saving videos")
             filenames = self.process_savename(prompt_list, args.n_samples_prompt)
-            self.save_videos(torch.stack(videos).unsqueeze(dim=1), args.savedir, filenames, fps=args.savefps)
-            self.save_metrics(gpu=gpu, time=time, config=args, savedir=args.savedir)
+            self.save_videos(
+                torch.stack(videos).unsqueeze(dim=1),
+                args.savedir,
+                filenames,
+                fps=args.savefps,
+            )
+            self.save_metrics(
+                gpu=gpu, time=time, config=args, savedir=args.savedir, frames=frames
+            )
 
     def inference_i2v(self, args: DictConfig):
         # init vars
         rank = int(os.getenv("RANK", 0))
-        world_size = int(os.getenv("WORLD_SIZE", 1))
-        local_rank = int(os.getenv("LOCAL_RANK", 0))
-        device = local_rank
+        int(os.getenv("WORLD_SIZE", 1))
+        int(os.getenv("LOCAL_RANK", 0))
 
         frames = args.frames
         size = args.size
@@ -283,31 +373,37 @@ def inference_i2v(self, args: DictConfig):
         guide_scale = args.unconditional_guidance_scale
 
         prompt_list, image_list = self.load_inference_inputs(args.prompt_dir, args.mode)
-        assert len(prompt_list) == len(image_list), "prompt and image number should match"
-        
-        if len(prompt_list) > 0:
-            logger.warning("WanVideo currently does not support batch inference, we will run sample at a time")
-            
+        assert len(prompt_list) == len(
+            image_list
+        ), "prompt and image number should match"
+
+        if len(prompt_list) > 1:
+            self._log.info("Processing prompts sequentially (batch size 1 per prompt).")
+
         videos = []
         gpu = []
         time = []
         for prompt, image_path in zip(prompt_list, image_list):
-            logger.info(f"Input prompt: {prompt}")
-            logger.info(f"Input image: {image_path}")
+            self._log.info(f"Input prompt: {prompt}")
+            self._log.info(f"Input image: {image_path}")
 
             img = Image.open(image_path).convert("RGB")
             if self.use_prompt_extend:
-                logger.info("Extending prompt ...")
+                assert self.prompt_expander is not None
+                self._log.info("Extending prompt ...")
                 if rank == 0:
                     prompt_output = self.prompt_expander(
                         prompt,
                         tar_lang=self.prompt_extend_target_lang,
                         image=img,
-                        seed=self.seed)
-                    if prompt_output.status == False:
-                        logger.info(
-                            f"Extending prompt failed: {prompt_output.message}")
-                        logger.info("Falling back to original prompt.")
+                        seed=self.seed,
+                    )
+                    assert prompt_output is not None
+                    if prompt_output.status is False:
+                        self._log.info(
+                            f"Extending prompt failed: {prompt_output.message}"
+                        )
+                        self._log.info("Falling back to original prompt.")
                         input_prompt = prompt
                     else:
                         input_prompt = prompt_output.prompt
@@ -317,78 +413,106 @@ def inference_i2v(self, args: DictConfig):
                 if dist.is_initialized():
                     dist.broadcast_object_list(input_prompt, src=0)
                 prompt = input_prompt[0]
-                logger.info(f"Extended prompt: {prompt}")
-
-
-            logger.info("Generating video ...")
-            result_with_metrics = self.wan_i2v.generate(
-                prompt,
-                img,
-                max_area=MAX_AREA_CONFIGS[size],
-                frame_num=frames,                                  
-                shift=sample_shift,
-                sample_solver=sample_solver,
-                sampling_steps=sampling_steps,
-                guide_scale=guide_scale,
-                seed=self.seed,
-                offload_model=self.offload_model)
-            
-            video = result_with_metrics['result']
+                self._log.info(f"Extended prompt: {prompt}")
+
+            self._log.info("Generating video ...")
+
+            @monitor_resources(return_metrics=True, frames=frames)
+            def _run_generate():
+                return self.wan_i2v.generate(
+                    prompt,
+                    img,
+                    max_area=MAX_AREA_CONFIGS[size],
+                    frame_num=frames,
+                    shift=sample_shift,
+                    sample_solver=sample_solver,
+                    sampling_steps=sampling_steps,
+                    guide_scale=guide_scale,
+                    seed=self.seed,
+                    offload_model=self.offload_model,
+                )
+
+            result_with_metrics = _run_generate()
+            video = result_with_metrics["result"]
             video = video.cpu()
             videos.append(video)
-            gpu.append(result_with_metrics.get('gpu', -1.0))
-            time.append(result_with_metrics.get('time', -1.0))
+            gpu.append(
+                result_with_metrics.get("peak_vram_gb")
+                or result_with_metrics.get("gpu", -1.0)
+            )
+            time.append(
+                result_with_metrics.get("wall_time_s")
+                or result_with_metrics.get("time", -1.0)
+            )
             del result_with_metrics
-            
+
         if rank == 0:
-            logger.info("Saving videos")
+            self._log.info("Saving videos")
             filenames = self.process_savename(prompt_list, args.n_samples_prompt)
-            self.save_videos(torch.stack(videos).unsqueeze(dim=1), args.savedir, filenames, fps=args.savefps)
-            self.save_metrics(gpu=gpu, time=time, config=args, savedir=args.savedir)
+            self.save_videos(
+                torch.stack(videos).unsqueeze(dim=1),
+                args.savedir,
+                filenames,
+                fps=args.savefps,
+            )
+            self.save_metrics(
+                gpu=gpu, time=time, config=args, savedir=args.savedir, frames=frames
+            )
 
-    @torch.no_grad()
-    def inference(self, args: DictConfig): 
-        # check input  
-        self._validate_args(args) 
+    @torch.inference_mode()
+    def inference(self, args: DictConfig):
+        # check input
+        self._validate_args(args)
 
         # t2v mode
-        if args.mode == VideoMode.T2V.value:  
+        if args.mode == VideoMode.T2V.value:
             self.inference_t2v(args)
         # i2v mode
         elif args.mode == VideoMode.I2V.value:
             self.inference_i2v(args)
         else:
-            raise ValueError("Error: invalid mode, we currently only support t2v and i2v for wanvideo")
-
-    def from_pretrained(self,
-                        ckpt_path: Optional[Union[str, Path]] = None,
-                        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
-                        lora_ckpt_path: Optional[Union[str, Path]] = None,
-                        ignore_missing_ckpts: bool = False):
-        if "t2v" in self.task or "t2i" in self.task:
-            self.wan_t2v.load_weight()
-            #this is only used to load trained denoiser_ckpt_path, 
-            #so we set ignore missing ckpts avoid duplicate loading
-            self.load_denoiser(ckpt_path, denoiser_ckpt_path, True)
-        else:
-            self.wan_i2v.load_weight()
+            raise ValueError(
+                "Error: invalid mode, we currently only support t2v and i2v "
+                "for wanvideo"
+            )
+
+    def from_pretrained(
+        self,
+        ckpt_path: Optional[Union[str, Path]] = None,
+        denoiser_ckpt_path: Optional[Union[str, Path]] = None,
+        lora_ckpt_path: Optional[Union[str, Path]] = None,
+        ignore_missing_ckpts: bool = False,
+        device: Optional[str] = None,
+        **kwargs,
+    ) -> None:
+        if denoiser_ckpt_path is not None or ckpt_path is not None:
             self.load_denoiser(ckpt_path, denoiser_ckpt_path, True)
-    
-    def enable_vram_management(self):
-        if "t2v" in self.task or "t2i" in self.task:
-            self.wan_t2v.enable_vram_management()
-        else:
-            self.wan_i2v.enable_vram_management()
-    
+        if not self.use_sp:
+            if "t2v" in self.task or "t2i" in self.task:
+                self.wan_t2v.low_noise_model = cast(
+                    Any, maybe_compile_denoiser(self.wan_t2v.low_noise_model)
+                )
+                self.wan_t2v.high_noise_model = cast(
+                    Any, maybe_compile_denoiser(self.wan_t2v.high_noise_model)
+                )
+            else:
+                self.wan_i2v.low_noise_model = cast(
+                    Any, maybe_compile_denoiser(self.wan_i2v.low_noise_model)
+                )
+                self.wan_i2v.high_noise_model = cast(
+                    Any, maybe_compile_denoiser(self.wan_i2v.high_noise_model)
+                )
+
+    def enable_vram_management(self) -> None:
+        self._log.info(
+            "WanVideoModelFlow: VRAM handled via offload_model in generate(); no-op"
+        )
+
     def training_step(self, batch, batch_idx):
-        #self.first_stage_model.disable_cache()
-        if "t2v" in self.task or "t2i" in self.task:
-            loss = self.wan_t2v.training_step(batch, batch_idx, self.first_stage_key, self.cond_stage_key)
-        else:
-            loss = self.wan_i2v.training_step(batch, batch_idx, self.first_stage_key, self.cond_stage_key)
-        self.log("train_loss", loss, prog_bar=True, on_step=True)
+        loss = compute_wan_flow_matching_loss(self, batch)
+        self.log("train/loss", loss, prog_bar=True, on_step=True, on_epoch=True)
         return loss
-    
+
     @torch.no_grad()
     def log_images(self, batch, **kwargs):
-        pass
\ No newline at end of file
+        pass
diff --git a/videotuna/models/cogvideo_hf/cogvideo_i2v.py b/videotuna/models/cogvideo_hf/cogvideo_i2v.py
deleted file mode 100644
index 8def368a..00000000
--- a/videotuna/models/cogvideo_hf/cogvideo_i2v.py
+++ /dev/null
@@ -1,660 +0,0 @@
-import inspect
-import math
-import random
-from tqdm import tqdm
-from typing import Callable, Dict, List, Optional, Tuple, Union
-
-import PIL
-import torch
-
-from diffusers import CogVideoXDPMScheduler
-from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
-from diffusers.image_processor import PipelineImageInput
-from diffusers.pipelines.cogvideo.pipeline_output import CogVideoXPipelineOutput
-from diffusers.utils.torch_utils import randn_tensor
-
-from videotuna.models.cogvideo_hf.cogvideo_pl import CogVideoXWorkFlow, retrieve_timesteps
-from videotuna.utils.common_utils import precision_to_dtype
-from tqdm import tqdm
-
-def retrieve_latents(
-    encoder_output: torch.Tensor,
-    generator: Optional[torch.Generator] = None,
-    sample_mode: str = "sample",
-):
-    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
-        return encoder_output.latent_dist.sample(generator)
-    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
-        return encoder_output.latent_dist.mode()
-    elif hasattr(encoder_output, "latents"):
-        return encoder_output.latents
-    else:
-        raise AttributeError("Could not access latents of provided encoder_output")
-
-class CogVideoXI2V(CogVideoXWorkFlow):
-    _callback_tensor_inputs = [
-        "latents",
-        "prompt_embeds",
-        "negative_prompt_embeds",
-    ]
-
-    def __init__(
-        self,
-        first_stage_config,
-        cond_stage_config,
-        denoiser_config,
-        scheduler_config,
-        learning_rate: float = 6e-6,
-        adapter_config=None,
-        noised_image_input: bool = False,
-        noised_image_dropout: float = 0.05,
-        logdir=None,
-    ):
-        super().__init__(
-            first_stage_config,
-            cond_stage_config,
-            denoiser_config,
-            scheduler_config,
-            learning_rate,
-            adapter_config,
-            logdir,
-        )
-        self.noised_image_input = noised_image_input
-        self.noised_image_dropout = noised_image_dropout
-
-    def encode_image(self, image):
-        image = image.to(self.device, dtype=self.dtype).unsqueeze(0) # [3, 1, 480, 720] -> [1, 3, 1, 480, 720]
-        latent_dist = self.vae.encode(image).latent_dist
-        return latent_dist
-
-    def get_batch_input(self, batch):
-        """
-        Prepare model batch inputs
-        """
-        # prompt
-        prompts = [item for item in batch["caption"]]
-
-        # video latents
-        videos = [self.encode_video(video) for video in batch["video"]]
-        videos = [video.sample() * self.vae.config.scaling_factor for video in videos]
-        videos = torch.cat(videos, dim=0)
-        videos = videos.to(memory_format=torch.contiguous_format)
-        # image latents
-        images = [self.encode_image(image) for image in batch["image"]]
-        images = [image.sample() * self.vae.config.scaling_factor for image in images]
-        images = torch.cat(images, dim=0).to(memory_format=torch.contiguous_format)
-        
-        videos = videos.permute(0, 2, 1, 3, 4).contiguous() # [B, C, T, H, W] -> [B, T, C, H, W]
-        images = images.permute(0, 2, 1, 3, 4).contiguous() # [B, C, T, H, W] -> [B, T, C, H, W]
-
-        # pad conditional image latents
-        padding_shape = (
-            videos.shape[0],
-            videos.shape[1] - 1,
-            *videos.shape[2:],
-        )
-        latent_padding = videos.new_zeros(padding_shape)
-        images = torch.cat([images, latent_padding], dim=1)  # [B, 4, 16, 60, 90]
-        # conditional image dropout
-        if random.random() < self.noised_image_dropout:
-            images = torch.zeros_like(images)
-
-        return {
-            "videos": videos,
-            "images": images,
-            "prompts": prompts,
-        }
-
-    def training_step(self, batch, batch_idx):
-        batch = self.get_batch_input(batch)
-        model_input = batch["videos"]
-        image_latents = batch["images"]
-        prompts = batch["prompts"]
-
-        max_sequence_length = 226
-        with torch.no_grad():
-            prompt_embeds = self.encode_prompt(
-                prompts,
-                do_classifier_free_guidance=False,  # set to false for train
-                num_videos_per_prompt=1,
-                max_sequence_length=max_sequence_length,
-                device=self.device,
-            )
-
-        batch_size, num_frames, num_channels, height, width = model_input.shape
-
-        # Sample noise that will be added to the latents
-        noise = torch.randn_like(model_input)
-
-        # Sample a random timestep for each image
-        timesteps = torch.randint(
-            0,
-            self.scheduler.config.num_train_timesteps,
-            (batch_size,),
-            device=self.device,
-        )
-        timesteps = timesteps.long()
-
-        # Prepare rotary embeds
-        image_rotary_emb = (
-            # in the first place, we assume this function is the same during inference and train.
-            self._prepare_rotary_positional_embeddings(
-                height=height * self.vae_scale_factor_spatial,
-                width=width * self.vae_scale_factor_spatial,
-                num_frames=num_frames,
-                vae_scale_factor_spatial=self.vae_scale_factor_spatial,
-                patch_size=self.model.config.patch_size,
-                attention_head_dim=self.model.config.attention_head_dim,
-                device=self.device,
-            )
-            if self.model.config.use_rotary_positional_embeddings
-            else None
-        )
-
-        # Add noise to the model input according to the noise magnitude at each timestep
-        # (this is the forward diffusion process)
-        noisy_video_latents = self.scheduler.add_noise(model_input, noise, timesteps)
-        # concate conditional image
-        noisy_model_input = torch.cat([noisy_video_latents, image_latents], dim=2)
-        model_output = self.model(
-            hidden_states=noisy_model_input,
-            encoder_hidden_states=prompt_embeds,
-            timestep=timesteps,
-            image_rotary_emb=image_rotary_emb,
-            return_dict=False,
-        )[0]
-        model_pred = self.scheduler.get_velocity(
-            model_output, noisy_video_latents, timesteps
-        )
-
-        alphas_cumprod = self.scheduler.alphas_cumprod[timesteps]
-        weights = 1 / (1 - alphas_cumprod)
-        while len(weights.shape) < len(model_pred.shape):
-            weights = weights.unsqueeze(-1)
-
-        target = model_input
-        # TODO: inherent loss computation from base class.
-        loss = torch.mean(
-            (weights * (model_pred - target) ** 2).reshape(batch_size, -1), dim=1
-        )
-        loss = loss.mean()
-        return loss
-
-    def prepare_latents(
-        self,
-        image: torch.Tensor,
-        batch_size: int = 1,
-        num_channels_latents: int = 16,
-        num_frames: int = 13,
-        height: int = 60,
-        width: int = 90,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        generator: Optional[torch.Generator] = None,
-        latents: Optional[torch.Tensor] = None,
-    ):
-        num_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1
-        shape = (
-            batch_size,
-            num_frames,
-            num_channels_latents,
-            height // self.vae_scale_factor_spatial,
-            width // self.vae_scale_factor_spatial,
-        )
-
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if isinstance(image, torch.Tensor):
-            if image.ndim < 5:
-                image = image.unsqueeze(2)
-
-        image = image.to(self.vae.dtype)
-
-        if isinstance(generator, list):
-            image_latents = [
-                retrieve_latents(self.vae.encode(image[i].unsqueeze(0)), generator[i])
-                for i in range(batch_size)
-            ]
-        else:
-            image_latents = [
-                retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator)
-                for img in image
-            ]
-
-        image_latents = (
-            torch.cat(image_latents, dim=0).to(dtype).permute(0, 2, 1, 3, 4)
-        )  # [B, T, C, H, W]
-        image_latents = self.vae.config.scaling_factor * image_latents
-
-        # pad conditional images
-        padding_shape = (
-            batch_size,
-            num_frames - 1,
-            num_channels_latents,
-            height // self.vae_scale_factor_spatial,
-            width // self.vae_scale_factor_spatial,
-        )
-        latent_padding = torch.zeros(padding_shape, device=device, dtype=dtype)
-        image_latents = torch.cat([image_latents, latent_padding], dim=1)
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-
-        # scale the initial noise by the standard deviation required by the scheduler
-        latents = latents * self.scheduler.init_noise_sigma
-        return latents, image_latents
-
-    def check_inputs(
-        self,
-        image,
-        prompt,
-        height,
-        width,
-        negative_prompt,
-        callback_on_step_end_tensor_inputs,
-        video=None,
-        latents=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-    ):
-        if (
-            not isinstance(image, torch.Tensor)
-            and not isinstance(image, PIL.Image.Image)
-            and not isinstance(image, list)
-        ):
-            raise ValueError(
-                "`image` has to be of type `torch.Tensor` or `PIL.Image.Image` or `List[PIL.Image.Image]` but is"
-                f" {type(image)}"
-            )
-
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-
-        if prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-        if video is not None and latents is not None:
-            raise ValueError("Only one of `video` or `latents` should be provided")
-
-    @torch.no_grad()
-    def sample(
-        self,
-        image: PipelineImageInput,
-        prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        height: int = 480,
-        width: int = 720,
-        num_frames: int = 49,
-        num_inference_steps: int = 50,
-        timesteps: Optional[List[int]] = None,
-        guidance_scale: float = 6,
-        use_dynamic_cfg: bool = False,
-        num_videos_per_prompt: int = 1,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.FloatTensor] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        output_type: str = "pil",
-        sample_precision: str = None,
-        return_dict: bool = True,
-        callback_on_step_end: Optional[
-            Union[
-                Callable[[int, int, Dict], None],
-                PipelineCallback,
-                MultiPipelineCallbacks,
-            ]
-        ] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        max_sequence_length: int = 226,
-        progress_bar: bool = True,
-    ) -> Union[CogVideoXPipelineOutput, Tuple]:
-        """
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            image (`PipelineImageInput`):
-                The input video to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The height in pixels of the generated image. This is set to 1024 by default for the best results.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The width in pixels of the generated image. This is set to 1024 by default for the best results.
-            num_frames (`int`, defaults to `48`):
-                Number of frames to generate. Must be divisible by self.vae_scale_factor_temporal. Generated video will
-                contain 1 extra frame because CogVideoX is conditioned with (num_seconds * fps + 1) frames where
-                num_seconds is 6 and fps is 4. However, since videos can be saved at any fps, the only condition that
-                needs to be satisfied is that of divisibility mentioned above.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            guidance_scale (`float`, *optional*, defaults to 7.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                The number of videos to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
-                of a plain tuple.
-            callback_on_step_end (`Callable`, *optional*):
-                A function that calls at the end of each denoising steps during the inference. The function is called
-                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
-                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
-                `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-            max_sequence_length (`int`, defaults to `226`):
-                Maximum sequence length in encoded prompt. Must be consistent with
-                `self.model.config.max_text_seq_length` otherwise may lead to poor results.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput`] or `tuple`:
-            [`~pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput`] if `return_dict` is True, otherwise a
-            `tuple`. When returning a tuple, the first element is a list with the generated images.
-        """
-
-        if num_frames > 49:
-            raise ValueError(
-                "The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation."
-            )
-
-        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
-            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
-
-        height = height or self.model.config.sample_size * self.vae_scale_factor_spatial
-        width = width or self.model.config.sample_size * self.vae_scale_factor_spatial
-        num_videos_per_prompt = 1
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            image,
-            prompt,
-            height,
-            width,
-            negative_prompt,
-            callback_on_step_end_tensor_inputs,
-            prompt_embeds,
-            negative_prompt_embeds,
-        )
-        self._guidance_scale = guidance_scale
-        self._interrupt = False
-
-        # 2. Default call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self.device
-        if sample_precision is not None:
-            ori_dtype = self.model.dtype
-            dtype = precision_to_dtype[sample_precision]
-            self.model.to(dtype)
-        else:
-            dtype = self.model.dtype
-
-        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-        # corresponds to doing no classifier free guidance.
-        do_classifier_free_guidance = guidance_scale > 1.0
-
-        # 3. Encode input prompt
-        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
-            prompt=prompt,
-            negative_prompt=negative_prompt,
-            do_classifier_free_guidance=do_classifier_free_guidance,
-            num_videos_per_prompt=num_videos_per_prompt,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            max_sequence_length=max_sequence_length,
-            device=device,
-            dtype=dtype,
-        )
-        if do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-
-        # 4. Prepare timesteps
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps
-        )
-        self._num_timesteps = len(timesteps)
-
-        ## Prepare input image
-        if (isinstance(image, torch.Tensor) and image.ndim == 5):
-            pass
-        else:
-            image = self.video_processor.preprocess(image, height=height, width=width).to(
-                device, dtype=dtype
-            )
-
-        # 5. Prepare latents
-        latent_channels = self.model.config.in_channels // 2
-        latents, image_latents = self.prepare_latents(
-            image,
-            batch_size * num_videos_per_prompt,
-            latent_channels,
-            num_frames,
-            height,
-            width,
-            dtype,
-            device,
-            generator,
-            latents,
-        )
-
-        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
-
-        # 7. Create rotary embeds if required
-        image_rotary_emb = (
-            self._prepare_rotary_positional_embeddings(
-                height, width, latents.size(1), device=device
-            )
-            if self.model.config.use_rotary_positional_embeddings
-            else None
-        )
-
-        # 8. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-
-        # for DPM-solver++
-        old_pred_original_sample = None
-        if progress_bar:
-            iters = tqdm(enumerate(timesteps), desc="Denoising Steps", total=num_inference_steps)
-        else:
-            iters = enumerate(timesteps)
-        for i, t in iters:
-
-            latent_model_input = (
-                torch.cat([latents] * 2) if do_classifier_free_guidance else latents
-            )
-            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
-
-            latent_image_input = (
-                torch.cat([image_latents] * 2)
-                if do_classifier_free_guidance
-                else image_latents
-            )
-            latent_model_input = torch.cat(
-                [latent_model_input, latent_image_input], dim=2
-            )
-
-            # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-            timestep = t.expand(latent_model_input.shape[0])
-
-            # predict noise model_output
-            noise_pred = self.model(
-                hidden_states=latent_model_input,
-                encoder_hidden_states=prompt_embeds,
-                timestep=timestep,
-                image_rotary_emb=image_rotary_emb,
-                return_dict=False,
-            )[0]
-
-            # perform guidance
-            if use_dynamic_cfg:
-                self._guidance_scale = 1 + guidance_scale * (
-                    (
-                        1
-                        - math.cos(
-                            math.pi
-                            * ((num_inference_steps - t.item()) / num_inference_steps)
-                            ** 5.0
-                        )
-                    )
-                    / 2
-                )
-            else:
-                self.guidance_scale = guidance_scale
-            if do_classifier_free_guidance:
-                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                noise_pred = noise_pred_uncond + self.guidance_scale * (
-                    noise_pred_text - noise_pred_uncond
-                )
-
-            # compute the previous noisy sample x_t -> x_t-1
-            if not isinstance(self.scheduler, CogVideoXDPMScheduler):
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-            else:
-                latents, old_pred_original_sample = self.scheduler.step(
-                    noise_pred,
-                    old_pred_original_sample,
-                    t,
-                    timesteps[i - 1] if i > 0 else None,
-                    latents,
-                    **extra_step_kwargs,
-                    return_dict=False,
-                )
-            latents = latents.to(dtype)
-
-            # call the callback, if provided
-            if callback_on_step_end is not None:
-                callback_kwargs = {}
-                for k in callback_on_step_end_tensor_inputs:
-                    callback_kwargs[k] = locals()[k]
-                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                latents = callback_outputs.pop("latents", latents)
-                prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                negative_prompt_embeds = callback_outputs.pop(
-                    "negative_prompt_embeds", negative_prompt_embeds
-                )
-
-
-        if not output_type == "latent":
-            video = self.decode_latents(latents)
-        else:
-            video = latents
-
-        video = video[None, ...].cpu()
-
-        torch.cuda.empty_cache()
-
-        if sample_precision is not None:
-            self.model.to(ori_dtype)
-        return video
-
-
-    @torch.no_grad()
-    def log_images(self, batch, **kwargs):
-        log = dict()
-        images = batch["image"].to(dtype=self.dtype) # [B, C, T, H, W] 
-        prompts = batch["caption"]
-        batch_samples = self.sample(images, prompts, 
-                                    num_inference_steps=50,
-                                    sample_precision="bfloat16",
-        )
-        log["inputs"] = batch["image"]
-        log["prompts"] = batch["caption"]
-        log["samples"] = batch_samples
-        
-        return log
\ No newline at end of file
diff --git a/videotuna/models/cogvideo_hf/cogvideo_pl.py b/videotuna/models/cogvideo_hf/cogvideo_pl.py
deleted file mode 100644
index b39e015e..00000000
--- a/videotuna/models/cogvideo_hf/cogvideo_pl.py
+++ /dev/null
@@ -1,923 +0,0 @@
-import inspect
-import math
-from tqdm import tqdm
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union
-
-import torch
-import pytorch_lightning as pl
-from diffusers import CogVideoXDPMScheduler
-from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
-from diffusers.models.embeddings import get_3d_rotary_pos_embed
-from diffusers.utils.torch_utils import randn_tensor
-from diffusers.video_processor import VideoProcessor
-from peft import get_peft_model
-from transformers import T5EncoderModel, T5Tokenizer
-
-from videotuna.utils.common_utils import instantiate_from_config
-from videotuna.utils.common_utils import precision_to_dtype, get_resize_crop_region_for_grid
-
-
-def has_nan(tensor):
-    return torch.isnan(tensor).any()
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-class CogVideoXWorkFlow(pl.LightningModule):
-    def __init__(
-        self,
-        first_stage_config,
-        cond_stage_config,
-        denoiser_config,
-        scheduler_config,
-        learning_rate: float = 6e-6,
-        adapter_config=None,
-        logdir=None,  # notice: this is not configured in config.yaml but configured in train.py
-    ):
-        super().__init__()
-        self.logdir = logdir
-        self.learning_rate = learning_rate
-        
-        self.instantiate_first_stage(first_stage_config)
-        self.instantiate_cond_stage(cond_stage_config)
-        
-        self.vae_scale_factor_spatial = (
-            2 ** (len(self.vae.config.block_out_channels) - 1)
-            if hasattr(self, "first_stage_model") and self is not None
-            else 8
-        )
-        self.vae_scale_factor_temporal = (
-            self.vae.config.temporal_compression_ratio
-            if hasattr(self, "first_stage_model") and self.vae is not None
-            else 4
-        )
-
-        self.video_processor = VideoProcessor(
-            vae_scale_factor=self.vae_scale_factor_spatial
-        )
-
-        self.model = instantiate_from_config(denoiser_config)
-        
-        if "load_dtype" in denoiser_config.params:
-            # only used in inference
-            if denoiser_config.params.load_dtype == "fp16":
-                print("Convert denoiser to fp16")
-                self.model.half()
-            elif denoiser_config.params.load_dtype == "bf16":
-                print("Convert denoiser to bf16")
-                self.model.bfloat16()
-
-        self.scheduler = instantiate_from_config(scheduler_config)
-        
-        # add adapter config (Support Lora and HRA )
-        self.lora_args = []
-        if adapter_config is not None:
-            self.inject_adapter(adapter_config)
-
-        self.model.enable_gradient_checkpointing()
-
-    def inject_adapter(self, adapter_config):
-        self.model.requires_grad_(False)
-        print("Injecting lora adapter")
-        transformer_adapter_config = instantiate_from_config(adapter_config)
-        print(transformer_adapter_config)
-        self.model = get_peft_model(self.model, transformer_adapter_config)
-        self.model.print_trainable_parameters()
-
-    def instantiate_first_stage(self, config):
-        model = instantiate_from_config(config)
-        self.vae = model.eval()
-        self.vae.requires_grad_(False)
-
-    @torch.no_grad()
-    def encode_first_stage(self, x):
-        x = x.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
-        latent_dist = self.vae.encode(x).latent_dist
-        return latent_dist
-
-    def _decode_core(self, z, **kwargs):
-        z = 1.0 / self.scale_factor * z
-
-        if self.encoder_type == "2d" and z.dim() == 5:
-            return self.decode_first_stage_2DAE(z)
-        results = self.vae.decode(z, **kwargs)
-        return results
-
-    @torch.no_grad()
-    def decode_first_stage(self, z, **kwargs):
-        return self._decode_core(z, **kwargs)
-
-    def differentiable_decode_first_stage(self, z, **kwargs):
-        """same as decode_first_stage but without decorator"""
-        return self._decode_core(z, **kwargs)
-
-    def instantiate_cond_stage(self, config):
-        model = instantiate_from_config(config)
-        if config.get("freeze", True):
-            self.cond_stage_model = model.eval()
-            self.cond_stage_model.requires_grad_(False)
-        else:
-            self.cond_stage_model = model
-
-    def get_learned_conditioning(self, c):
-        if self.cond_stage_forward is None:
-            if hasattr(self.cond_stage_model, "encode") and callable(
-                self.cond_stage_model.encode
-            ):
-                c = self.cond_stage_model.encode(c)
-                if isinstance(c, DiagonalGaussianDistribution):
-                    c = c.mode()
-            else:
-                c = self.cond_stage_model(c)
-        else:
-            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
-            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
-        return c
-
-
-    # Copied from diffusers.pipelines.latte.pipeline_latte.LattePipeline.check_inputs
-    def check_inputs(
-        self,
-        prompt,
-        height,
-        width,
-        negative_prompt,
-        callback_on_step_end_tensor_inputs,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-
-        if prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-    def _get_t5_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]],
-        num_videos_per_prompt: int = 1,
-        max_sequence_length: int = 226,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = torch.float32,
-        text_input_ids=None,
-    ):
-        device = self.device
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        text_inputs = self.cond_stage_model.tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=max_sequence_length,
-            truncation=True,
-            add_special_tokens=True,
-            return_tensors="pt",
-        )
-        text_input_ids = text_inputs.input_ids
-        prompt_embeds = self.cond_stage_model.transformer(text_input_ids.to(device))[0]
-        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-        # duplicate text embeddings for multiple samples per prompt
-        _, seq_len, _ = prompt_embeds.shape
-        prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            batch_size * num_videos_per_prompt, seq_len, -1
-        )
-
-        return prompt_embeds
-
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        do_classifier_free_guidance: bool = True,
-        num_videos_per_prompt: int = 1,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        max_sequence_length: int = 226,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ):
-        r"""
-        Encodes the prompt into text encoder hidden states.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
-                Whether to use classifier free guidance or not.
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            device: (`torch.device`, *optional*):
-                torch device
-            dtype: (`torch.dtype`, *optional*):
-                torch dtype
-        """
-        device = device or self._execution_device
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        if prompt is not None:
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        if prompt_embeds is None:
-            prompt_embeds = self._get_t5_prompt_embeds(
-                prompt=prompt,
-                num_videos_per_prompt=num_videos_per_prompt,
-                max_sequence_length=max_sequence_length,
-                device=device,
-                dtype=dtype,
-            )
-
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            negative_prompt = negative_prompt or ""
-            negative_prompt = (
-                batch_size * [negative_prompt]
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-
-            if prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-
-            negative_prompt_embeds = self._get_t5_prompt_embeds(
-                prompt=negative_prompt,
-                num_videos_per_prompt=num_videos_per_prompt,
-                max_sequence_length=max_sequence_length,
-                device=device,
-                dtype=dtype,
-            )
-
-            return prompt_embeds, negative_prompt_embeds
-        return prompt_embeds
-
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        num_frames,
-        height,
-        width,
-        dtype,
-        device,
-        generator,
-        latents=None,
-    ):
-        shape = (
-            batch_size,
-            (num_frames - 1) // self.vae_scale_factor_temporal + 1,
-            num_channels_latents,
-            height // self.vae_scale_factor_spatial,
-            width // self.vae_scale_factor_spatial,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-
-        # scale the initial noise by the standard deviation required by the scheduler
-        latents = latents * self.scheduler.init_noise_sigma
-        return latents
-
-    def decode_latents(self, latents: torch.Tensor) -> torch.Tensor:
-        latents = latents.permute(0, 2, 1, 3, 4)  # [batch_size, num_channels, num_frames, height, width]
-        latents = 1 / self.vae.config.scaling_factor * latents # [1, 16, 13, 60, 90]
-        
-        latents = latents.to(self.vae.dtype)
-        self.model.cpu()
-        frames = self.vae.decode(latents).sample
-        self.model.cuda()
-        return frames
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
-    def prepare_extra_step_kwargs(self, generator, eta):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-
-        accepts_eta = "eta" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        extra_step_kwargs = {}
-        if accepts_eta:
-            extra_step_kwargs["eta"] = eta
-
-        # check if the scheduler accepts generator
-        accepts_generator = "generator" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        if accepts_generator:
-            extra_step_kwargs["generator"] = generator
-        return extra_step_kwargs
-
-    def _prepare_rotary_positional_embeddings(
-        self,
-        height: int,
-        width: int,
-        num_frames: int,
-        vae_scale_factor_spatial: int = 8,
-        patch_size: int = 2,
-        attention_head_dim: int = 64,
-        device: Optional[torch.device] = None,
-        base_height: int = 480,
-        base_width: int = 720,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-
-        grid_height = height // (vae_scale_factor_spatial * patch_size)
-        grid_width = width // (vae_scale_factor_spatial * patch_size)
-        base_size_width = base_width // (vae_scale_factor_spatial * patch_size)
-        base_size_height = base_height // (vae_scale_factor_spatial * patch_size)
-
-        grid_crops_coords = get_resize_crop_region_for_grid(
-            (grid_height, grid_width), 
-            (base_size_height, base_size_width)
-        )
-        freqs_cos, freqs_sin = get_3d_rotary_pos_embed(
-            embed_dim=attention_head_dim,
-            crops_coords=grid_crops_coords,
-            grid_size=(grid_height, grid_width),
-            temporal_size=num_frames,
-        )
-
-        freqs_cos = freqs_cos.to(device=device)
-        freqs_sin = freqs_sin.to(device=device)
-        return freqs_cos, freqs_sin
-
-    @torch.no_grad()
-    def sample(
-        self,
-        prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        height: int = 480,
-        width: int = 720,
-        num_frames: int = 49,
-        num_inference_steps: int = 50,
-        timesteps: Optional[List[int]] = None,
-        guidance_scale: float = 6,
-        use_dynamic_cfg: bool = False,
-        num_videos_per_prompt: int = 1,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.FloatTensor] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        output_type: str = "pil",
-        sample_precision: str = None,
-        return_dict: bool = True,
-        attention_kwargs: Optional[Dict[str, Any]] = None,
-        callback_on_step_end: Optional[
-            Union[
-                Callable[[int, int, Dict], None],
-                PipelineCallback,
-                MultiPipelineCallbacks,
-            ]
-        ] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        max_sequence_length: int = 226,
-        progress_bar: bool = True,
-    ) -> Union[Tuple]:
-        """
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The height in pixels of the generated image. This is set to 1024 by default for the best results.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The width in pixels of the generated image. This is set to 1024 by default for the best results.
-            num_frames (`int`, defaults to `48`):
-                Number of frames to generate. Must be divisible by self.vae_scale_factor_temporal. Generated video will
-                contain 1 extra frame because CogVideoX is conditioned with (num_seconds * fps + 1) frames where
-                num_seconds is 6 and fps is 4. However, since videos can be saved at any fps, the only condition that
-                needs to be satisfied is that of divisibility mentioned above.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            guidance_scale (`float`, *optional*, defaults to 7.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                The number of videos to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
-                of a plain tuple.
-            attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            callback_on_step_end (`Callable`, *optional*):
-                A function that calls at the end of each denoising steps during the inference. The function is called
-                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
-                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
-                `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-            max_sequence_length (`int`, defaults to `226`):
-                Maximum sequence length in encoded prompt. Must be consistent with
-                `self.model.config.max_text_seq_length` otherwise may lead to poor results.
-        """
-
-        if num_frames > 49:
-            raise ValueError(
-                "The number of frames must be less than 49 for now due to static positional embeddings. This will be updated in the future to remove this limitation."
-            )
-
-        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
-            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
-
-        height = height or self.model.config.sample_size * self.vae_scale_factor_spatial
-        width = width or self.model.config.sample_size * self.vae_scale_factor_spatial
-        num_videos_per_prompt = 1
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            height,
-            width,
-            negative_prompt,
-            callback_on_step_end_tensor_inputs,
-            prompt_embeds,
-            negative_prompt_embeds,
-        )
-        self._guidance_scale = guidance_scale
-        self._attention_kwargs = attention_kwargs
-        self._interrupt = False
-
-        # 2. Default call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-        
-        device = self.device
-        if sample_precision is not None:
-            ori_dtype = self.model.dtype
-            dtype = precision_to_dtype[sample_precision]
-            self.model.to(dtype)
-        else:
-            dtype = self.model.dtype
-
-        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-        # corresponds to doing no classifier free guidance.
-        do_classifier_free_guidance = guidance_scale > 1.0
-
-        # 3. Encode input prompt
-        prompt_embeds, negative_prompt_embeds = self.encode_prompt(
-            prompt,
-            negative_prompt,
-            do_classifier_free_guidance,
-            num_videos_per_prompt=num_videos_per_prompt,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            max_sequence_length=max_sequence_length,
-            device=device,
-            dtype=dtype,
-        )
-        if do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-
-        # 4. Prepare timesteps
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps
-        )
-        self._num_timesteps = len(timesteps)
-
-        # 5. Prepare latents.
-        latent_channels = self.model.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_videos_per_prompt,
-            latent_channels,
-            num_frames,
-            height,
-            width,
-            dtype,
-            device,
-            generator,
-            latents,
-        )
-
-        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
-
-        # 7. Create rotary embeds if required
-        image_rotary_emb = (
-            self._prepare_rotary_positional_embeddings(
-                height=height,
-                width=width,
-                num_frames=latents.shape[1],
-                vae_scale_factor_spatial=self.vae_scale_factor_spatial,
-                patch_size=self.model.config.patch_size,
-                attention_head_dim=self.model.config.attention_head_dim,
-                device=self.device,
-            )
-            if self.model.config.use_rotary_positional_embeddings
-            else None
-        )
-
-        # 8. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-        # for DPM-solver++
-        # self.model.cuda()
-        old_pred_original_sample = None
-        if progress_bar:
-            iters = tqdm(enumerate(timesteps), desc="Denoising Steps", total=num_inference_steps)
-        else:
-            iters = enumerate(timesteps)
-        for i, t in iters:
-
-            latent_model_input = (
-                torch.cat([latents] * 2) if do_classifier_free_guidance else latents
-            )
-            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
-            # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-            timestep = t.expand(latent_model_input.shape[0])
-            # predict noise model_output
-            noise_pred = self.model(
-                hidden_states=latent_model_input,
-                encoder_hidden_states=prompt_embeds,
-                timestep=timestep,
-                image_rotary_emb=image_rotary_emb,
-                # attention_kwargs=attention_kwargs, # None
-                return_dict=False,
-            )[0]
-
-            # perform guidance
-            if use_dynamic_cfg:
-                self._guidance_scale = 1 + guidance_scale * (
-                    (
-                        1
-                        - math.cos(
-                            math.pi
-                            * ((num_inference_steps - t.item()) / num_inference_steps)
-                            ** 5.0
-                        )
-                    )
-                    / 2
-                )
-            else:
-                self.guidance_scale = guidance_scale
-            if do_classifier_free_guidance:
-                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                noise_pred = noise_pred_uncond + self.guidance_scale * (
-                    noise_pred_text - noise_pred_uncond
-                )
-
-            # compute the previous noisy sample x_t -> x_t-1
-            if not isinstance(self.scheduler, CogVideoXDPMScheduler):
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-            else:
-                latents, old_pred_original_sample = self.scheduler.step(
-                    noise_pred,
-                    old_pred_original_sample,
-                    t,
-                    timesteps[i - 1] if i > 0 else None,
-                    latents,
-                    **extra_step_kwargs,
-                    return_dict=False,
-                )
-            latents = latents.to(dtype)
-
-            # call the callback, if provided
-            if callback_on_step_end is not None:
-                callback_kwargs = {}
-                for k in callback_on_step_end_tensor_inputs:
-                    callback_kwargs[k] = locals()[k]
-                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                latents = callback_outputs.pop("latents", latents)
-                prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                negative_prompt_embeds = callback_outputs.pop(
-                    "negative_prompt_embeds", negative_prompt_embeds
-                )
-
-        if not output_type == "latent":
-            video = self.decode_latents(latents)
-        else:
-            video = latents
-
-        video = video[None, ...].cpu()
-        
-        torch.cuda.empty_cache()
-
-        if sample_precision is not None:
-            self.model.to(ori_dtype)
-        return video
-
-    def configure_optimizers(self):
-        optimizer = torch.optim.AdamW(
-            [p for p in self.model.parameters() if p.requires_grad],
-            lr=self.learning_rate,
-        )
-        return optimizer
-
-    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
-        # filter out lora related parameters
-        new_satate_dict = checkpoint["state_dict"]
-        new_satate_dict = {k: v for k, v in new_satate_dict.items() if "lora" in k}
-        if len(new_satate_dict) > 0:
-            checkpoint["state_dict"] = new_satate_dict
-        return checkpoint
-
-    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
-        pass
-
-    def encode_video(self, video):
-        video = video.to(self.device, dtype=self.dtype).unsqueeze(0)
-        latent_dist = self.vae.encode(video).latent_dist
-        return latent_dist
-
-    def get_batch_input(self, batch):
-        """
-        Prepare model batch inputs
-        """
-        # equal to collate_fn
-        # the resonable video latents range is [-5,5], approximately.
-        # videos
-        videos = [self.encode_video(video) for video in batch["video"]]
-        videos = [video.sample() * self.vae.config.scaling_factor for video in videos]
-        videos = torch.cat(videos, dim=0)
-        videos = videos.to(memory_format=torch.contiguous_format)
-        # prompt
-        prompts = [item for item in batch["caption"]]
-        return {
-            "videos": videos,
-            "prompts": prompts,
-        }
-
-    def training_step(self, batch, batch_idx):
-        batch = self.get_batch_input(batch)
-        model_input = (
-            batch["videos"].permute(0, 2, 1, 3, 4).to(dtype=self.dtype)
-        )  # [B, F, C, H, W]
-        prompts = batch["prompts"]
-
-        max_sequence_length = 226
-        with torch.no_grad():
-            prompt_embeds = self.encode_prompt(
-                prompts,
-                do_classifier_free_guidance=False,  # set to false for train
-                num_videos_per_prompt=1,
-                max_sequence_length=max_sequence_length,
-                device=self.device,
-            )
-
-        batch_size, num_frames, num_channels, height, width = model_input.shape
-
-        # Sample noise that will be added to the latents
-        noise = torch.randn_like(model_input)
-
-        # Sample a random timestep for each image
-        timesteps = torch.randint(
-            0,
-            self.scheduler.config.num_train_timesteps,
-            (batch_size,),
-            device=self.device,
-        )
-        timesteps = timesteps.long()
-
-        # Prepare rotary embeds
-        image_rotary_emb = (
-            # in the first place, we assume this function is the same during inference and train.
-            self._prepare_rotary_positional_embeddings(
-                height=height * self.vae_scale_factor_spatial,
-                width=width * self.vae_scale_factor_spatial,
-                num_frames=num_frames,
-                vae_scale_factor_spatial=self.vae_scale_factor_spatial,
-                patch_size=self.model.config.patch_size,
-                attention_head_dim=self.model.config.attention_head_dim,
-                device=self.device,
-            )
-            if self.model.config.use_rotary_positional_embeddings
-            else None
-        )
-
-        # Add noise to the model input according to the noise magnitude at each timestep
-        # (this is the forward diffusion process)
-        noisy_model_input = self.scheduler.add_noise(model_input, noise, timesteps)
-        model_output = self.model(
-            hidden_states=noisy_model_input,
-            encoder_hidden_states=prompt_embeds,
-            timestep=timesteps,
-            image_rotary_emb=image_rotary_emb,
-            return_dict=False,
-        )[0]
-        model_pred = self.scheduler.get_velocity(
-            model_output, noisy_model_input, timesteps
-        )
-
-        alphas_cumprod = self.scheduler.alphas_cumprod[timesteps]
-        weights = 1 / (1 - alphas_cumprod)
-        while len(weights.shape) < len(model_pred.shape):
-            weights = weights.unsqueeze(-1)
-
-        target = model_input
-        # TODO: inherent loss computation from base class.
-        loss = torch.mean(
-            (weights * (model_pred - target) ** 2).reshape(batch_size, -1), dim=1
-        )
-        loss = loss.mean()
-        return loss
-    
-    @torch.no_grad()
-    def log_images(self, batch, **kwargs):
-        log = dict()
-        prompts = batch["caption"]
-        batch_samples = self.sample(prompts, 
-                                    num_inference_steps=50,
-                                    sample_precision="bfloat16",
-        )
-        log["gt"] = batch["video"]
-        log["samples"] = batch_samples
-        return log
-
-
-
-if __name__ == "__main__":
-    # test text encoder
-    prompt = ["Elon mask is talking"]
-    device = "cuda"
-    dtype = "float32"
-    tokenizer = T5Tokenizer.from_pretrained("THUDM/CogVideoX-2b", subfolder="tokenizer")
-    text_encoder = T5EncoderModel.from_pretrained(
-        "THUDM/CogVideoX-2b", subfolder="text_encoder"
-    ).to(device)
-    text_inputs = tokenizer(
-        prompt,
-        padding="max_length",
-        max_length=226,
-        truncation=True,
-        add_special_tokens=True,
-        return_tensors="pt",
-    )
-    text_input_ids = text_inputs.input_ids
-    with torch.no_grad():
-        prompt_embeds = text_encoder(text_input_ids.to(device))[0]
-    print(has_nan(prompt_embeds))
diff --git a/videotuna/models/cogvideo_sat/arguments.py b/videotuna/models/cogvideo_sat/arguments.py
deleted file mode 100644
index e5f3e951..00000000
--- a/videotuna/models/cogvideo_sat/arguments.py
+++ /dev/null
@@ -1,337 +0,0 @@
-import argparse
-import json
-import os
-import sys
-import warnings
-
-import deepspeed
-import omegaconf
-import torch
-import torch.distributed
-from omegaconf import OmegaConf
-from sat import mpu
-from sat.arguments import (
-    add_data_args,
-    add_evaluation_args,
-    add_training_args,
-    set_random_seed,
-)
-from sat.helpers import print_rank0
-
-sys.path.append(os.path.join(os.path.dirname(__file__), "../"))
-
-
-def add_sampling_config_args(parser):
-    """Sampling configurations"""
-    group = parser.add_argument_group("sampling", "Sampling Configurations")
-    group.add_argument("--input-dir", type=str, default=None)
-    group.add_argument("--sampling-image-size", type=list, default=[768, 1360])
-    group.add_argument("--final-size", type=int, default=2048)
-    group.add_argument("--sdedit", action="store_true")
-    group.add_argument("--grid-num-rows", type=int, default=1)
-    group.add_argument("--force-inference", action="store_true")
-    group.add_argument("--lcm_steps", type=int, default=None)
-    group.add_argument("--sampling-num-frames", type=int, default=22)
-    group.add_argument("--sampling-fps", type=int, default=16)
-    group.add_argument("--only-save-latents", type=bool, default=False)
-    group.add_argument("--only-log-video-latents", type=bool, default=False)
-    group.add_argument("--latent-channels", type=int, default=16)
-    group.add_argument("--image2video", action="store_true")
-    group.add_argument("--modeForScript", type=str, default="inference")
-    group.add_argument("--batch_size", type=int, default=1)
-
-    return parser
-
-
-def add_model_config_args(parser):
-    """Model arguments"""
-    group = parser.add_argument_group("model", "model configuration")
-    # group.add_argument("--base", type=str, nargs="*", help="config for input and saving", default="configs/005_cogvideox1.5/cogvideox1.5_5b_t2v.yaml")
-    group.add_argument(
-        "--model-parallel-size",
-        type=int,
-        default=1,
-        help="size of the model parallel. only use if you are an expert.",
-    )
-    group.add_argument("--force-pretrain", action="store_true", default=True)
-    group.add_argument("--device", type=int, default=-1)
-    group.add_argument("--debug", action="store_true")
-    group.add_argument("--log-image", type=bool, default=True)
-
-    return parser
-
-
-def initialize_distributed(args):
-    """Initialize torch.distributed."""
-    if torch.distributed.is_initialized():
-        if mpu.model_parallel_is_initialized():
-            if args.model_parallel_size != mpu.get_model_parallel_world_size():
-                raise ValueError(
-                    "model_parallel_size is inconsistent with prior configuration."
-                    "We currently do not support changing model_parallel_size."
-                )
-            return False
-        else:
-            if args.model_parallel_size > 1:
-                warnings.warn(
-                    "model_parallel_size > 1 but torch.distributed is not initialized via SAT."
-                    "Please carefully make sure the correctness on your own."
-                )
-            mpu.initialize_model_parallel(args.model_parallel_size)
-        return True
-    # the automatic assignment of devices has been moved to arguments.py
-    if args.device == "cpu":
-        pass
-    else:
-        torch.cuda.set_device(args.device)
-    # Call the init process
-    init_method = "tcp://"
-    args.master_ip = os.getenv("MASTER_ADDR", "localhost")
-
-    if args.world_size == 1:
-        from sat.helpers import get_free_port
-
-        default_master_port = str(get_free_port())
-    else:
-        default_master_port = "6000"
-    args.master_port = os.getenv("MASTER_PORT", default_master_port)
-    init_method += args.master_ip + ":" + args.master_port
-    torch.distributed.init_process_group(
-        backend=args.distributed_backend,
-        world_size=args.world_size,
-        rank=args.rank,
-        init_method=init_method,
-    )
-
-    # Set the model-parallel / data-parallel communicators.
-    mpu.initialize_model_parallel(args.model_parallel_size)
-
-    # Set vae context parallel group equal to model parallel group
-    from sgm.util import initialize_context_parallel, set_context_parallel_group
-
-    if args.model_parallel_size <= 2:
-        set_context_parallel_group(
-            args.model_parallel_size, mpu.get_model_parallel_group()
-        )
-    else:
-        initialize_context_parallel(2)
-    # mpu.initialize_model_parallel(1)
-    # Optional DeepSpeed Activation Checkpointing Features
-    if args.deepspeed:
-        import deepspeed
-
-        deepspeed.init_distributed(
-            dist_backend=args.distributed_backend,
-            world_size=args.world_size,
-            rank=args.rank,
-            init_method=init_method,
-        )
-        # # It seems that it has no negative influence to configure it even without using checkpointing.
-        # deepspeed.checkpointing.configure(mpu, deepspeed_config=args.deepspeed_config, num_checkpoints=args.num_layers)
-    else:
-        # in model-only mode, we don't want to init deepspeed, but we still need to init the rng tracker for model_parallel, just because we save the seed by default when dropout.
-        try:
-            import deepspeed
-            from deepspeed.runtime.activation_checkpointing.checkpointing import (
-                _CUDA_RNG_STATE_TRACKER,
-                _MODEL_PARALLEL_RNG_TRACKER_NAME,
-            )
-
-            _CUDA_RNG_STATE_TRACKER.add(
-                _MODEL_PARALLEL_RNG_TRACKER_NAME, 1
-            )  # default seed 1
-        except Exception as e:
-            from sat.helpers import print_rank0
-
-            print_rank0(str(e), level="DEBUG")
-
-    return True
-
-
-def process_config_to_args(args):
-    """Fetch args from only --base"""
-    project_dir = os.path.join(os.path.dirname(__file__), "../../../")
-
-    def extract_clean_path(base):
-        base = base[0].strip('["]')
-        clean_path = base.strip("[]").strip("'")
-        return clean_path
-
-    clean_path = extract_clean_path(args.base)
-
-    args.base = [os.path.join(project_dir, clean_path)]
-
-    configs = [OmegaConf.load(cfg) for cfg in args.base]
-    config = OmegaConf.merge(*configs)
-
-    args_config = config.pop("args", OmegaConf.create())
-    for key in args_config:
-        if isinstance(args_config[key], omegaconf.DictConfig) or isinstance(
-            args_config[key], omegaconf.ListConfig
-        ):
-            arg = OmegaConf.to_object(args_config[key])
-        else:
-            arg = args_config[key]
-        if hasattr(args, key):
-            setattr(args, key, arg)
-
-    if "model" in config:
-        model_config = config.pop("model", OmegaConf.create())
-        args.model_config = model_config
-    if "deepspeed" in config:
-        deepspeed_config = config.pop("deepspeed", OmegaConf.create())
-        args.deepspeed_config = OmegaConf.to_object(deepspeed_config)
-    if "data" in config:
-        data_config = config.pop("data", OmegaConf.create())
-        args.data_config = data_config
-
-    return args
-
-
-def getArgs():
-    parser = argparse.ArgumentParser(description="sat")
-
-    parser.add_argument(
-        "--load_transformer",
-        type=str,
-        default="checkpoints/cogvideo/CogVideoX1.5-5B-SAT/transformer_t2v",
-    )
-    parser.add_argument("--input_type", type=str, default="txt")
-    parser.add_argument(
-        "--input_file", type=str, default="configs/005_cogvideox1.5/prompt.txt"
-    )
-    parser.add_argument("--output_dir", type=str, default="outputs")
-    parser.add_argument(
-        "--base",
-        type=str,
-        nargs="*",
-        help="config for input and saving",
-        default="configs/005_cogvideox1.5/cogvideox1.5_5b_t2v.yaml",
-    )
-    parser.add_argument("--mode_type", type=str, default="t2v")
-    parser.add_argument("--sampling_num_frames", type=int, default=22)
-    parser.add_argument("--image_folder", type=str, default="inputs/i2v/576x1024")
-
-    parser = add_model_config_args(parser)
-    parser = add_sampling_config_args(parser)
-    parser = add_training_args(parser)
-    parser = add_evaluation_args(parser)
-    parser = add_data_args(parser)
-    parser = deepspeed.add_config_arguments(parser)
-    args_list = ["--base", parser.parse_args().base]
-    args = parser.parse_args(args_list)
-    args = process_config_to_args(args)
-
-    args.cuda = torch.cuda.is_available()
-    args.rank = int(os.getenv("RANK", "0"))
-    args.world_size = int(os.getenv("WORLD_SIZE", "1"))
-    if args.local_rank is None:
-        args.local_rank = int(os.getenv("LOCAL_RANK", "0"))  # torchrun
-
-    if args.device == -1:
-        if torch.cuda.device_count() == 0:
-            args.device = "cpu"
-        elif args.local_rank is not None:
-            args.device = args.local_rank
-        else:
-            args.device = args.rank % torch.cuda.device_count()
-
-    if args.local_rank != args.device and args.mode != "inference":
-        raise ValueError(
-            "LOCAL_RANK (default 0) and args.device inconsistent. "
-            "This can only happens in inference mode. "
-            "Please use CUDA_VISIBLE_DEVICES=x for single-GPU training. "
-        )
-
-    if args.rank == 0:
-        print_rank0("using world size: {}".format(args.world_size))
-
-    if args.deepspeed:
-        if args.checkpoint_activations:
-            args.deepspeed_activation_checkpointing = True
-        else:
-            args.deepspeed_activation_checkpointing = False
-        if args.deepspeed_config is not None:
-            deepspeed_config = args.deepspeed_config
-
-        if override_deepspeed_config:  # not specify deepspeed_config, use args
-            if args.fp16:
-                deepspeed_config["fp16"]["enabled"] = True
-            elif args.bf16:
-                deepspeed_config["bf16"]["enabled"] = True
-                deepspeed_config["fp16"]["enabled"] = False
-            else:
-                deepspeed_config["fp16"]["enabled"] = False
-            deepspeed_config["train_micro_batch_size_per_gpu"] = args.batch_size
-            deepspeed_config["gradient_accumulation_steps"] = (
-                args.gradient_accumulation_steps
-            )
-            optimizer_params_config = deepspeed_config["optimizer"]["params"]
-            optimizer_params_config["lr"] = args.lr
-            optimizer_params_config["weight_decay"] = args.weight_decay
-        else:  # override args with values in deepspeed_config
-            if args.rank == 0:
-                print_rank0(
-                    "Will override arguments with manually specified deepspeed_config!"
-                )
-            if "fp16" in deepspeed_config and deepspeed_config["fp16"]["enabled"]:
-                args.fp16 = True
-            else:
-                args.fp16 = False
-            if "bf16" in deepspeed_config and deepspeed_config["bf16"]["enabled"]:
-                args.bf16 = True
-            else:
-                args.bf16 = False
-            if "train_micro_batch_size_per_gpu" in deepspeed_config:
-                args.batch_size = deepspeed_config["train_micro_batch_size_per_gpu"]
-            if "gradient_accumulation_steps" in deepspeed_config:
-                args.gradient_accumulation_steps = deepspeed_config[
-                    "gradient_accumulation_steps"
-                ]
-            else:
-                args.gradient_accumulation_steps = None
-            if "optimizer" in deepspeed_config:
-                optimizer_params_config = deepspeed_config["optimizer"].get(
-                    "params", {}
-                )
-                args.lr = optimizer_params_config.get("lr", args.lr)
-                args.weight_decay = optimizer_params_config.get(
-                    "weight_decay", args.weight_decay
-                )
-        args.deepspeed_config = deepspeed_config
-
-    args.load = parser.parse_args().load_transformer
-    args.input_type = parser.parse_args().input_type
-    args.input_file = parser.parse_args().input_file
-    args.output_dir = parser.parse_args().output_dir
-    args.image_folder = parser.parse_args().image_folder
-    args.seed = parser.parse_args().seed
-    args.batch_size = 1
-    args.bf16 = True
-
-    initialize_distributed(args)
-    args.seed = args.seed + mpu.get_data_parallel_rank()
-    set_random_seed(args.seed)
-
-    del args.deepspeed_config
-    args.model_config.first_stage_config.params.cp_size = 1
-    args.model_config.network_config.params.transformer_args.model_parallel_size = 1
-    args.model_config.network_config.params.transformer_args.checkpoint_activations = (
-        False
-    )
-    args.model_config.loss_fn_config.params.sigma_sampler_config.params.uniform_sampling = (
-        False
-    )
-    args.force_inference = True
-    args.mode = "inference"
-    args.sampling_num_frames = parser.parse_args().sampling_num_frames
-
-    if parser.parse_args().mode_type == "t2v":
-        args.image2video = False
-        args.sampling_image_size = [768, 1360]
-    else:
-        args.image2video = True
-        args.model_config.network_config.params.in_channels = 32
-        args.image_path = parser.parse_args().image_folder
-
-    return args
diff --git a/videotuna/models/cogvideo_sat/data_video.py b/videotuna/models/cogvideo_sat/data_video.py
deleted file mode 100644
index 00ea3b48..00000000
--- a/videotuna/models/cogvideo_sat/data_video.py
+++ /dev/null
@@ -1,495 +0,0 @@
-import io
-import math
-import os
-import random
-import sys
-from fractions import Fraction
-from functools import partial
-from typing import Any, Dict, Optional, Tuple, Union
-
-import decord
-import numpy as np
-import torch
-import torchvision.transforms as TT
-from decord import VideoReader
-from sgm.webds import MetaDistributedWebDataset
-from torch.utils.data import Dataset
-from torchvision.io import _video_opt
-from torchvision.io.video import (
-    _align_audio_frames,
-    _check_av_available,
-    _read_from_stream,
-    av,
-)
-from torchvision.transforms import InterpolationMode
-from torchvision.transforms.functional import center_crop, resize
-
-
-def read_video(
-    filename: str,
-    start_pts: Union[float, Fraction] = 0,
-    end_pts: Optional[Union[float, Fraction]] = None,
-    pts_unit: str = "pts",
-    output_format: str = "THWC",
-) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, Any]]:
-    """
-    Reads a video from a file, returning both the video frames and the audio frames
-
-    Args:
-        filename (str): path to the video file
-        start_pts (int if pts_unit = 'pts', float / Fraction if pts_unit = 'sec', optional):
-            The start presentation time of the video
-        end_pts (int if pts_unit = 'pts', float / Fraction if pts_unit = 'sec', optional):
-            The end presentation time
-        pts_unit (str, optional): unit in which start_pts and end_pts values will be interpreted,
-            either 'pts' or 'sec'. Defaults to 'pts'.
-        output_format (str, optional): The format of the output video tensors. Can be either "THWC" (default) or "TCHW".
-
-    Returns:
-        vframes (Tensor[T, H, W, C] or Tensor[T, C, H, W]): the `T` video frames
-        aframes (Tensor[K, L]): the audio frames, where `K` is the number of channels and `L` is the number of points
-        info (Dict): metadata for the video and audio. Can contain the fields video_fps (float) and audio_fps (int)
-    """
-
-    output_format = output_format.upper()
-    if output_format not in ("THWC", "TCHW"):
-        raise ValueError(
-            f"output_format should be either 'THWC' or 'TCHW', got {output_format}."
-        )
-
-    _check_av_available()
-
-    if end_pts is None:
-        end_pts = float("inf")
-
-    if end_pts < start_pts:
-        raise ValueError(
-            f"end_pts should be larger than start_pts, got start_pts={start_pts} and end_pts={end_pts}"
-        )
-
-    info = {}
-    audio_frames = []
-    audio_timebase = _video_opt.default_timebase
-
-    with av.open(filename, metadata_errors="ignore") as container:
-        if container.streams.audio:
-            audio_timebase = container.streams.audio[0].time_base
-        if container.streams.video:
-            video_frames = _read_from_stream(
-                container,
-                start_pts,
-                end_pts,
-                pts_unit,
-                container.streams.video[0],
-                {"video": 0},
-            )
-            video_fps = container.streams.video[0].average_rate
-            # guard against potentially corrupted files
-            if video_fps is not None:
-                info["video_fps"] = float(video_fps)
-
-        if container.streams.audio:
-            audio_frames = _read_from_stream(
-                container,
-                start_pts,
-                end_pts,
-                pts_unit,
-                container.streams.audio[0],
-                {"audio": 0},
-            )
-            info["audio_fps"] = container.streams.audio[0].rate
-
-    aframes_list = [frame.to_ndarray() for frame in audio_frames]
-
-    vframes = torch.empty((0, 1, 1, 3), dtype=torch.uint8)
-
-    if aframes_list:
-        aframes = np.concatenate(aframes_list, 1)
-        aframes = torch.as_tensor(aframes)
-        if pts_unit == "sec":
-            start_pts = int(math.floor(start_pts * (1 / audio_timebase)))
-            if end_pts != float("inf"):
-                end_pts = int(math.ceil(end_pts * (1 / audio_timebase)))
-        aframes = _align_audio_frames(aframes, audio_frames, start_pts, end_pts)
-    else:
-        aframes = torch.empty((1, 0), dtype=torch.float32)
-
-    if output_format == "TCHW":
-        # [T,H,W,C] --> [T,C,H,W]
-        vframes = vframes.permute(0, 3, 1, 2)
-
-    return vframes, aframes, info
-
-
-def resize_for_rectangle_crop(arr, image_size, reshape_mode="random"):
-    if arr.shape[3] / arr.shape[2] > image_size[1] / image_size[0]:
-        arr = resize(
-            arr,
-            size=[image_size[0], int(arr.shape[3] * image_size[0] / arr.shape[2])],
-            interpolation=InterpolationMode.BICUBIC,
-        )
-    else:
-        arr = resize(
-            arr,
-            size=[int(arr.shape[2] * image_size[1] / arr.shape[3]), image_size[1]],
-            interpolation=InterpolationMode.BICUBIC,
-        )
-
-    h, w = arr.shape[2], arr.shape[3]
-    arr = arr.squeeze(0)
-
-    delta_h = h - image_size[0]
-    delta_w = w - image_size[1]
-
-    if reshape_mode == "random" or reshape_mode == "none":
-        top = np.random.randint(0, delta_h + 1)
-        left = np.random.randint(0, delta_w + 1)
-    elif reshape_mode == "center":
-        top, left = delta_h // 2, delta_w // 2
-    else:
-        raise NotImplementedError
-    arr = TT.functional.crop(
-        arr, top=top, left=left, height=image_size[0], width=image_size[1]
-    )
-    return arr
-
-
-def pad_last_frame(tensor, num_frames):
-    # T, H, W, C
-    if len(tensor) < num_frames:
-        pad_length = num_frames - len(tensor)
-        # Use the last frame to pad instead of zero
-        last_frame = tensor[-1]
-        pad_tensor = last_frame.unsqueeze(0).expand(pad_length, *tensor.shape[1:])
-        padded_tensor = torch.cat([tensor, pad_tensor], dim=0)
-        return padded_tensor
-    else:
-        return tensor[:num_frames]
-
-
-def load_video(
-    video_data,
-    sampling="uniform",
-    duration=None,
-    num_frames=4,
-    wanted_fps=None,
-    actual_fps=None,
-    skip_frms_num=0.0,
-    nb_read_frames=None,
-):
-    decord.bridge.set_bridge("torch")
-    vr = VideoReader(uri=video_data, height=-1, width=-1)
-    if nb_read_frames is not None:
-        ori_vlen = nb_read_frames
-    else:
-        ori_vlen = min(int(duration * actual_fps) - 1, len(vr))
-
-    max_seek = int(ori_vlen - skip_frms_num - num_frames / wanted_fps * actual_fps)
-    start = random.randint(skip_frms_num, max_seek + 1)
-    end = int(start + num_frames / wanted_fps * actual_fps)
-    n_frms = num_frames
-
-    if sampling == "uniform":
-        indices = np.arange(start, end, (end - start) / n_frms).astype(int)
-    else:
-        raise NotImplementedError
-
-    # get_batch -> T, H, W, C
-    temp_frms = vr.get_batch(np.arange(start, end))
-    assert temp_frms is not None
-    tensor_frms = (
-        torch.from_numpy(temp_frms)
-        if type(temp_frms) is not torch.Tensor
-        else temp_frms
-    )
-    tensor_frms = tensor_frms[torch.tensor((indices - start).tolist())]
-
-    return pad_last_frame(tensor_frms, num_frames)
-
-
-import threading
-
-
-def load_video_with_timeout(*args, **kwargs):
-    video_container = {}
-
-    def target_function():
-        video = load_video(*args, **kwargs)
-        video_container["video"] = video
-
-    thread = threading.Thread(target=target_function)
-    thread.start()
-    timeout = 20
-    thread.join(timeout)
-
-    if thread.is_alive():
-        print("Loading video timed out")
-        raise TimeoutError
-    return video_container.get("video", None).contiguous()
-
-
-def process_video(
-    video_path,
-    image_size=None,
-    duration=None,
-    num_frames=4,
-    wanted_fps=None,
-    actual_fps=None,
-    skip_frms_num=0.0,
-    nb_read_frames=None,
-):
-    """
-    video_path: str or io.BytesIO
-    image_size: .
-    duration: preknow the duration to speed up by seeking to sampled start. TODO by_pass if unknown.
-    num_frames: wanted num_frames.
-    wanted_fps: .
-    skip_frms_num: ignore the first and the last xx frames, avoiding transitions.
-    """
-
-    video = load_video_with_timeout(
-        video_path,
-        duration=duration,
-        num_frames=num_frames,
-        wanted_fps=wanted_fps,
-        actual_fps=actual_fps,
-        skip_frms_num=skip_frms_num,
-        nb_read_frames=nb_read_frames,
-    )
-
-    # --- copy and modify the image process ---
-    video = video.permute(0, 3, 1, 2)  # [T, C, H, W]
-
-    # resize
-    if image_size is not None:
-        video = resize_for_rectangle_crop(video, image_size, reshape_mode="center")
-
-    return video
-
-
-def process_fn_video(
-    src, image_size, fps, num_frames, skip_frms_num=0.0, txt_key="caption"
-):
-    while True:
-        r = next(src)
-        if "mp4" in r:
-            video_data = r["mp4"]
-        elif "avi" in r:
-            video_data = r["avi"]
-        else:
-            print("No video data found")
-            continue
-
-        if txt_key not in r:
-            txt = ""
-        else:
-            txt = r[txt_key]
-
-        if isinstance(txt, bytes):
-            txt = txt.decode("utf-8")
-        else:
-            txt = str(txt)
-
-        duration = r.get("duration", None)
-        if duration is not None:
-            duration = float(duration)
-        else:
-            continue
-
-        actual_fps = r.get("fps", None)
-        if actual_fps is not None:
-            actual_fps = float(actual_fps)
-        else:
-            continue
-
-        required_frames = num_frames / fps * actual_fps + 2 * skip_frms_num
-        required_duration = num_frames / fps + 2 * skip_frms_num / actual_fps
-
-        if duration is not None and duration < required_duration:
-            continue
-
-        try:
-            frames = process_video(
-                io.BytesIO(video_data),
-                num_frames=num_frames,
-                wanted_fps=fps,
-                image_size=image_size,
-                duration=duration,
-                actual_fps=actual_fps,
-                skip_frms_num=skip_frms_num,
-            )
-            frames = (frames - 127.5) / 127.5
-        except Exception as e:
-            print(e)
-            continue
-
-        item = {
-            "mp4": frames,
-            "txt": txt,
-            "num_frames": num_frames,
-            "fps": fps,
-        }
-
-        yield item
-
-
-class VideoDataset(MetaDistributedWebDataset):
-    def __init__(
-        self,
-        path,
-        image_size,
-        num_frames,
-        fps,
-        skip_frms_num=0.0,
-        nshards=sys.maxsize,
-        seed=1,
-        meta_names=None,
-        shuffle_buffer=1000,
-        include_dirs=None,
-        txt_key="caption",
-        **kwargs,
-    ):
-        if seed == -1:
-            seed = random.randint(0, 1000000)
-        if meta_names is None:
-            meta_names = []
-
-        if path.startswith(";"):
-            path, include_dirs = path.split(";", 1)
-        super().__init__(
-            path,
-            partial(
-                process_fn_video,
-                num_frames=num_frames,
-                image_size=image_size,
-                fps=fps,
-                skip_frms_num=skip_frms_num,
-            ),
-            seed,
-            meta_names=meta_names,
-            shuffle_buffer=shuffle_buffer,
-            nshards=nshards,
-            include_dirs=include_dirs,
-        )
-
-    @classmethod
-    def create_dataset_function(cls, path, args, **kwargs):
-        return cls(path, **kwargs)
-
-
-class SFTDataset(Dataset):
-    def __init__(self, data_dir, video_size, fps, max_num_frames, skip_frms_num=3):
-        """
-        skip_frms_num: ignore the first and the last xx frames, avoiding transitions.
-        """
-        super(SFTDataset, self).__init__()
-
-        self.video_size = video_size
-        self.fps = fps
-        self.max_num_frames = max_num_frames
-        self.skip_frms_num = skip_frms_num
-
-        self.video_paths = []
-        self.captions = []
-
-        for root, dirnames, filenames in os.walk(data_dir):
-            for filename in filenames:
-                if filename.endswith(".mp4"):
-                    video_path = os.path.join(root, filename)
-                    self.video_paths.append(video_path)
-
-                    caption_path = video_path.replace(".mp4", ".txt").replace(
-                        "videos", "labels"
-                    )
-                    if os.path.exists(caption_path):
-                        caption = open(caption_path, "r").read().splitlines()[0]
-                    else:
-                        caption = ""
-                    self.captions.append(caption)
-
-    def __getitem__(self, index):
-        decord.bridge.set_bridge("torch")
-
-        video_path = self.video_paths[index]
-        vr = VideoReader(uri=video_path, height=-1, width=-1)
-        actual_fps = vr.get_avg_fps()
-        ori_vlen = len(vr)
-
-        if ori_vlen / actual_fps * self.fps > self.max_num_frames:
-            num_frames = self.max_num_frames
-            start = int(self.skip_frms_num)
-            end = int(start + num_frames / self.fps * actual_fps)
-            end_safty = min(
-                int(start + num_frames / self.fps * actual_fps), int(ori_vlen)
-            )
-            indices = np.arange(start, end, (end - start) // num_frames).astype(int)
-            temp_frms = vr.get_batch(np.arange(start, end_safty))
-            assert temp_frms is not None
-            tensor_frms = (
-                torch.from_numpy(temp_frms)
-                if type(temp_frms) is not torch.Tensor
-                else temp_frms
-            )
-            tensor_frms = tensor_frms[torch.tensor((indices - start).tolist())]
-        else:
-            if ori_vlen > self.max_num_frames:
-                num_frames = self.max_num_frames
-                start = int(self.skip_frms_num)
-                end = int(ori_vlen - self.skip_frms_num)
-                indices = np.arange(
-                    start, end, max((end - start) // num_frames, 1)
-                ).astype(int)
-                temp_frms = vr.get_batch(np.arange(start, end))
-                assert temp_frms is not None
-                tensor_frms = (
-                    torch.from_numpy(temp_frms)
-                    if type(temp_frms) is not torch.Tensor
-                    else temp_frms
-                )
-                tensor_frms = tensor_frms[torch.tensor((indices - start).tolist())]
-            else:
-
-                def nearest_smaller_4k_plus_1(n):
-                    remainder = n % 4
-                    if remainder == 0:
-                        return n - 3
-                    else:
-                        return n - remainder + 1
-
-                start = int(self.skip_frms_num)
-                end = int(ori_vlen - self.skip_frms_num)
-                num_frames = nearest_smaller_4k_plus_1(
-                    end - start
-                )  # 3D VAE requires the number of frames to be 4k+1
-                end = int(start + num_frames)
-                temp_frms = vr.get_batch(np.arange(start, end))
-                assert temp_frms is not None
-                tensor_frms = (
-                    torch.from_numpy(temp_frms)
-                    if type(temp_frms) is not torch.Tensor
-                    else temp_frms
-                )
-
-        tensor_frms = pad_last_frame(
-            tensor_frms, self.max_num_frames
-        )  # the len of indices may be less than num_frames, due to round error
-        tensor_frms = tensor_frms.permute(0, 3, 1, 2)  # [T, H, W, C] -> [T, C, H, W]
-        tensor_frms = resize_for_rectangle_crop(
-            tensor_frms, self.video_size, reshape_mode="center"
-        )
-        tensor_frms = (tensor_frms - 127.5) / 127.5
-
-        item = {
-            "mp4": tensor_frms,
-            "txt": self.captions[index],
-            "num_frames": num_frames,
-            "fps": self.fps,
-        }
-        return item
-
-    def __len__(self):
-        return len(self.video_paths)
-
-    @classmethod
-    def create_dataset_function(cls, path, args, **kwargs):
-        return cls(data_dir=path, **kwargs)
diff --git a/videotuna/models/cogvideo_sat/diffusion_video.py b/videotuna/models/cogvideo_sat/diffusion_video.py
deleted file mode 100644
index e19e03ba..00000000
--- a/videotuna/models/cogvideo_sat/diffusion_video.py
+++ /dev/null
@@ -1,421 +0,0 @@
-import gc
-import math
-import random
-from typing import Any, Dict, List, Tuple, Union
-
-import torch
-import torch.nn.functional as F
-from omegaconf import ListConfig
-from sat import mpu
-from sat.helpers import print_rank0
-from sgm.modules import UNCONDITIONAL_CONFIG
-from sgm.modules.autoencoding.temporal_ae import VideoDecoder
-from sgm.modules.diffusionmodules.wrappers import OPENAIUNETWRAPPER
-from sgm.util import (
-    default,
-    disabled_train,
-    get_obj_from_str,
-    instantiate_from_config,
-    log_txt_as_img,
-)
-from torch import nn
-
-
-class SATVideoDiffusionEngine(nn.Module):
-    def __init__(self, args, **kwargs):
-        super().__init__()
-
-        model_config = args.model_config
-        # model args preprocess
-        log_keys = model_config.get("log_keys", None)
-        input_key = model_config.get("input_key", "mp4")
-        network_config = model_config.get("network_config", None)
-        network_wrapper = model_config.get("network_wrapper", None)
-        denoiser_config = model_config.get("denoiser_config", None)
-        sampler_config = model_config.get("sampler_config", None)
-        conditioner_config = model_config.get("conditioner_config", None)
-        first_stage_config = model_config.get("first_stage_config", None)
-        loss_fn_config = model_config.get("loss_fn_config", None)
-        scale_factor = model_config.get("scale_factor", 1.0)
-        latent_input = model_config.get("latent_input", False)
-        disable_first_stage_autocast = model_config.get(
-            "disable_first_stage_autocast", False
-        )
-        no_cond_log = model_config.get("disable_first_stage_autocast", False)
-        not_trainable_prefixes = model_config.get(
-            "not_trainable_prefixes", ["first_stage_model", "conditioner"]
-        )
-        compile_model = model_config.get("compile_model", False)
-        en_and_decode_n_samples_a_time = model_config.get(
-            "en_and_decode_n_samples_a_time", None
-        )
-        lr_scale = model_config.get("lr_scale", None)
-        lora_train = model_config.get("lora_train", False)
-        self.use_pd = model_config.get("use_pd", False)  # progressive distillation
-
-        self.log_keys = log_keys
-        self.input_key = input_key
-        self.not_trainable_prefixes = not_trainable_prefixes
-        self.en_and_decode_n_samples_a_time = en_and_decode_n_samples_a_time
-        self.lr_scale = lr_scale
-        self.lora_train = lora_train
-        self.noised_image_input = model_config.get("noised_image_input", False)
-        self.noised_image_all_concat = model_config.get(
-            "noised_image_all_concat", False
-        )
-        self.noised_image_dropout = model_config.get("noised_image_dropout", 0.0)
-        if args.fp16:
-            dtype = torch.float16
-            dtype_str = "fp16"
-        elif args.bf16:
-            dtype = torch.bfloat16
-            dtype_str = "bf16"
-        else:
-            dtype = torch.float32
-            dtype_str = "fp32"
-        self.dtype = dtype
-        self.dtype_str = dtype_str
-
-        network_config["params"]["dtype"] = dtype_str
-        model = instantiate_from_config(network_config)
-        self.model = get_obj_from_str(default(network_wrapper, OPENAIUNETWRAPPER))(
-            model, compile_model=compile_model, dtype=dtype
-        )
-
-        self.denoiser = instantiate_from_config(denoiser_config)
-        self.sampler = (
-            instantiate_from_config(sampler_config)
-            if sampler_config is not None
-            else None
-        )
-        self.conditioner = instantiate_from_config(
-            default(conditioner_config, UNCONDITIONAL_CONFIG)
-        )
-
-        self._init_first_stage(first_stage_config)
-
-        self.loss_fn = (
-            instantiate_from_config(loss_fn_config)
-            if loss_fn_config is not None
-            else None
-        )
-
-        self.latent_input = latent_input
-        self.scale_factor = scale_factor
-        self.disable_first_stage_autocast = disable_first_stage_autocast
-        self.no_cond_log = no_cond_log
-        self.device = args.device
-
-    def disable_untrainable_params(self):
-        total_trainable = 0
-        for n, p in self.named_parameters():
-            if p.requires_grad == False:
-                continue
-            flag = False
-            for prefix in self.not_trainable_prefixes:
-                if n.startswith(prefix) or prefix == "all":
-                    flag = True
-                    break
-
-            lora_prefix = ["matrix_A", "matrix_B"]
-            for prefix in lora_prefix:
-                if prefix in n:
-                    flag = False
-                    break
-
-            if flag:
-                p.requires_grad_(False)
-            else:
-                total_trainable += p.numel()
-
-        print_rank0(
-            "***** Total trainable parameters: " + str(total_trainable) + " *****"
-        )
-
-    def reinit(self, parent_model=None):
-        # reload the initial params from previous trained modules
-        # you can also get access to other mixins through parent_model.get_mixin().
-        pass
-
-    def _init_first_stage(self, config):
-        model = instantiate_from_config(config).eval()
-        model.train = disabled_train
-        for param in model.parameters():
-            param.requires_grad = False
-        self.first_stage_model = model
-
-    def forward(self, x, batch):
-        loss = self.loss_fn(self.model, self.denoiser, self.conditioner, x, batch)
-        loss_mean = loss.mean()
-        loss_dict = {"loss": loss_mean}
-        return loss_mean, loss_dict
-
-    def add_noise_to_first_frame(self, image):
-        sigma = torch.normal(mean=-3.0, std=0.5, size=(image.shape[0],)).to(self.device)
-        sigma = torch.exp(sigma).to(image.dtype)
-        image_noise = torch.randn_like(image) * sigma[:, None, None, None, None]
-        image = image + image_noise
-        return image
-
-    def shared_step(self, batch: Dict) -> Any:
-        x = self.get_input(batch)
-        if self.lr_scale is not None:
-            lr_x = F.interpolate(
-                x, scale_factor=1 / self.lr_scale, mode="bilinear", align_corners=False
-            )
-            lr_x = F.interpolate(
-                lr_x, scale_factor=self.lr_scale, mode="bilinear", align_corners=False
-            )
-            lr_z = self.encode_first_stage(lr_x, batch)
-            batch["lr_input"] = lr_z
-
-        x = x.permute(0, 2, 1, 3, 4).contiguous()
-        if self.noised_image_input:
-            image = x[:, :, 0:1]
-            image = self.add_noise_to_first_frame(image)
-            image = self.encode_first_stage(image, batch)
-
-        x = self.encode_first_stage(x, batch)
-        x = x.permute(0, 2, 1, 3, 4).contiguous()
-        if self.noised_image_input:
-            image = image.permute(0, 2, 1, 3, 4).contiguous()
-            if self.noised_image_all_concat:
-                image = image.repeat(1, x.shape[1], 1, 1, 1)
-            else:
-                image = torch.concat([image, torch.zeros_like(x[:, 1:])], dim=1)
-            if random.random() < self.noised_image_dropout:
-                image = torch.zeros_like(image)
-            batch["concat_images"] = image
-
-        gc.collect()
-        torch.cuda.empty_cache()
-        loss, loss_dict = self(x, batch)
-        return loss, loss_dict
-
-    def get_input(self, batch):
-        return batch[self.input_key].to(self.dtype)
-
-    @torch.no_grad()
-    def decode_first_stage(self, z):
-        z = 1.0 / self.scale_factor * z
-        n_samples = default(self.en_and_decode_n_samples_a_time, z.shape[0])
-        n_rounds = math.ceil(z.shape[0] / n_samples)
-        all_out = []
-        for n in range(n_rounds):
-            z_now = z[n * n_samples : (n + 1) * n_samples, :, 1:]
-            latent_time = z_now.shape[2]  # check the time latent
-            temporal_compress_times = 4
-
-            fake_cp_size = min(10, latent_time // 2)
-            start_frame = 0
-
-            recons = []
-            start_frame = 0
-            for i in range(fake_cp_size):
-                end_frame = (
-                    start_frame
-                    + latent_time // fake_cp_size
-                    + (1 if i < latent_time % fake_cp_size else 0)
-                )
-
-                use_cp = True if i == 0 else False
-                clear_fake_cp_cache = True if i == fake_cp_size - 1 else False
-                with torch.no_grad():
-                    recon = self.first_stage_model.decode(
-                        z_now[:, :, start_frame:end_frame].contiguous(),
-                        clear_fake_cp_cache=clear_fake_cp_cache,
-                        use_cp=use_cp,
-                    )
-                recons.append(recon)
-                start_frame = end_frame
-            recons = torch.cat(recons, dim=2)
-            all_out.append(recons)
-        out = torch.cat(all_out, dim=0)
-        return out
-
-    @torch.no_grad()
-    def encode_first_stage(self, x, batch):
-        frame = x.shape[2]
-
-        if frame > 1 and self.latent_input:
-            x = x.permute(0, 2, 1, 3, 4).contiguous()
-            return x * self.scale_factor  # already encoded
-
-        n_samples = default(self.en_and_decode_n_samples_a_time, x.shape[0])
-        n_rounds = math.ceil(x.shape[0] / n_samples)
-        all_out = []
-        with torch.autocast("cuda", enabled=not self.disable_first_stage_autocast):
-            for n in range(n_rounds):
-                out = self.first_stage_model.encode(
-                    x[n * n_samples : (n + 1) * n_samples]
-                )
-                all_out.append(out)
-        z = torch.cat(all_out, dim=0)
-        z = self.scale_factor * z
-        return z
-
-    @torch.no_grad()
-    def sample(
-        self,
-        cond: Dict,
-        uc: Union[Dict, None] = None,
-        batch_size: int = 16,
-        shape: Union[None, Tuple, List] = None,
-        prefix=None,
-        concat_images=None,
-        ofs=None,
-        **kwargs,
-    ):
-        randn = torch.randn(batch_size, *shape).to(torch.float32).to(self.device)
-        if hasattr(self, "seeded_noise"):
-            randn = self.seeded_noise(randn)
-
-        if prefix is not None:
-            randn = torch.cat([prefix, randn[:, prefix.shape[1] :]], dim=1)
-
-        # broadcast noise
-        mp_size = mpu.get_model_parallel_world_size()
-        if mp_size > 1:
-            global_rank = torch.distributed.get_rank() // mp_size
-            src = global_rank * mp_size
-            torch.distributed.broadcast(
-                randn, src=src, group=mpu.get_model_parallel_group()
-            )
-
-        scale = None
-        scale_emb = None
-
-        denoiser = lambda input, sigma, c, **addtional_model_inputs: self.denoiser(
-            self.model,
-            input,
-            sigma,
-            c,
-            concat_images=concat_images,
-            **addtional_model_inputs,
-        )
-
-        samples = self.sampler(
-            denoiser, randn, cond, uc=uc, scale=scale, scale_emb=scale_emb, ofs=ofs
-        )
-        samples = samples.to(self.dtype)
-        return samples
-
-    @torch.no_grad()
-    def log_conditionings(self, batch: Dict, n: int) -> Dict:
-        """
-        Defines heuristics to log different conditionings.
-        These can be lists of strings (text-to-image), tensors, ints, ...
-        """
-        image_h, image_w = batch[self.input_key].shape[3:]
-        log = dict()
-
-        for embedder in self.conditioner.embedders:
-            if (
-                (self.log_keys is None) or (embedder.input_key in self.log_keys)
-            ) and not self.no_cond_log:
-                x = batch[embedder.input_key][:n]
-                if isinstance(x, torch.Tensor):
-                    if x.dim() == 1:
-                        # class-conditional, convert integer to string
-                        x = [str(x[i].item()) for i in range(x.shape[0])]
-                        xc = log_txt_as_img((image_h, image_w), x, size=image_h // 4)
-                    elif x.dim() == 2:
-                        # size and crop cond and the like
-                        x = [
-                            "x".join([str(xx) for xx in x[i].tolist()])
-                            for i in range(x.shape[0])
-                        ]
-                        xc = log_txt_as_img((image_h, image_w), x, size=image_h // 20)
-                    else:
-                        raise NotImplementedError()
-                elif isinstance(x, (List, ListConfig)):
-                    if isinstance(x[0], str):
-                        xc = log_txt_as_img((image_h, image_w), x, size=image_h // 20)
-                    else:
-                        raise NotImplementedError()
-                else:
-                    raise NotImplementedError()
-                log[embedder.input_key] = xc
-        return log
-
-    @torch.no_grad()
-    def log_video(
-        self,
-        batch: Dict,
-        N: int = 8,
-        ucg_keys: List[str] = None,
-        only_log_video_latents=False,
-        **kwargs,
-    ) -> Dict:
-        conditioner_input_keys = [e.input_key for e in self.conditioner.embedders]
-        if ucg_keys:
-            assert all(map(lambda x: x in conditioner_input_keys, ucg_keys)), (
-                "Each defined ucg key for sampling must be in the provided conditioner input keys,"
-                f"but we have {ucg_keys} vs. {conditioner_input_keys}"
-            )
-        else:
-            ucg_keys = conditioner_input_keys
-        log = dict()
-
-        x = self.get_input(batch)
-
-        c, uc = self.conditioner.get_unconditional_conditioning(
-            batch,
-            force_uc_zero_embeddings=(
-                ucg_keys if len(self.conditioner.embedders) > 0 else []
-            ),
-        )
-
-        sampling_kwargs = {}
-
-        N = min(x.shape[0], N)
-        x = x.to(self.device)[:N]
-        if not self.latent_input:
-            log["inputs"] = x.to(torch.float32)
-        x = x.permute(0, 2, 1, 3, 4).contiguous()
-        z = self.encode_first_stage(x, batch)
-        if not only_log_video_latents:
-            log["reconstructions"] = self.decode_first_stage(z).to(torch.float32)
-            log["reconstructions"] = (
-                log["reconstructions"].permute(0, 2, 1, 3, 4).contiguous()
-            )
-        z = z.permute(0, 2, 1, 3, 4).contiguous()
-
-        log.update(self.log_conditionings(batch, N))
-
-        for k in c:
-            if isinstance(c[k], torch.Tensor):
-                c[k], uc[k] = map(lambda y: y[k][:N].to(self.device), (c, uc))
-
-        if self.noised_image_input:
-            image = x[:, :, 0:1]
-            image = self.add_noise_to_first_frame(image)
-            image = self.encode_first_stage(image, batch)
-            image = image.permute(0, 2, 1, 3, 4).contiguous()
-            image = torch.concat([image, torch.zeros_like(z[:, 1:])], dim=1)
-            c["concat"] = image
-            uc["concat"] = image
-            samples = self.sample(
-                c, shape=z.shape[1:], uc=uc, batch_size=N, **sampling_kwargs
-            )  # b t c h w
-            samples = samples.permute(0, 2, 1, 3, 4).contiguous()
-            if only_log_video_latents:
-                latents = 1.0 / self.scale_factor * samples
-                log["latents"] = latents
-            else:
-                samples = self.decode_first_stage(samples).to(torch.float32)
-                samples = samples.permute(0, 2, 1, 3, 4).contiguous()
-                log["samples"] = samples
-        else:
-            samples = self.sample(
-                c, shape=z.shape[1:], uc=uc, batch_size=N, **sampling_kwargs
-            )  # b t c h w
-            samples = samples.permute(0, 2, 1, 3, 4).contiguous()
-            if only_log_video_latents:
-                latents = 1.0 / self.scale_factor * samples
-                log["latents"] = latents
-            else:
-                samples = self.decode_first_stage(samples).to(torch.float32)
-                samples = samples.permute(0, 2, 1, 3, 4).contiguous()
-                log["samples"] = samples
-        return log
diff --git a/videotuna/models/cogvideo_sat/dit_video_concat.py b/videotuna/models/cogvideo_sat/dit_video_concat.py
deleted file mode 100644
index 0654dbde..00000000
--- a/videotuna/models/cogvideo_sat/dit_video_concat.py
+++ /dev/null
@@ -1,950 +0,0 @@
-# cogvideox1.5
-from functools import partial, reduce
-from operator import mul
-
-import numpy as np
-import torch
-import torch.nn.functional as F
-from einops import rearrange, repeat
-from sat.model.base_model import BaseModel, non_conflict
-from sat.model.mixins import BaseMixin
-from sat.mpu.layers import ColumnParallelLinear
-from sat.ops.layernorm import LayerNorm, RMSNorm
-from sat.transformer_defaults import HOOKS_DEFAULT, attention_fn_default
-from sgm.modules.diffusionmodules.openaimodel import Timestep
-from sgm.modules.diffusionmodules.util import linear, timestep_embedding
-from sgm.util import instantiate_from_config
-from torch import nn
-
-
-class ImagePatchEmbeddingMixin(BaseMixin):
-    def __init__(self, in_channels, hidden_size, patch_size, text_hidden_size=None):
-        super().__init__()
-        self.patch_size = patch_size
-        self.proj = nn.Linear(in_channels * reduce(mul, patch_size), hidden_size)
-        if text_hidden_size is not None:
-            self.text_proj = nn.Linear(text_hidden_size, hidden_size)
-        else:
-            self.text_proj = None
-
-    def word_embedding_forward(self, input_ids, **kwargs):
-        images = kwargs["images"]  # (b,t,c,h,w)
-        emb = rearrange(images, "b t c h w -> b (t h w) c")
-        emb = rearrange(
-            emb,
-            "b (t o h p w q) c -> b (t h w) (c o p q)",
-            t=kwargs["rope_T"],
-            h=kwargs["rope_H"],
-            w=kwargs["rope_W"],
-            o=self.patch_size[0],
-            p=self.patch_size[1],
-            q=self.patch_size[2],
-        )
-        emb = self.proj(emb)
-
-        if self.text_proj is not None:
-            text_emb = self.text_proj(kwargs["encoder_outputs"])
-            emb = torch.cat((text_emb, emb), dim=1)  # (b,n_t+t*n_i,d)
-
-        emb = emb.contiguous()
-        return emb  # (b,n_t+t*n_i,d)
-
-    def reinit(self, parent_model=None):
-        w = self.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-        nn.init.constant_(self.proj.bias, 0)
-        del self.transformer.word_embeddings
-
-
-def get_3d_sincos_pos_embed(
-    embed_dim,
-    grid_height,
-    grid_width,
-    t_size,
-    cls_token=False,
-    height_interpolation=1.0,
-    width_interpolation=1.0,
-    time_interpolation=1.0,
-):
-    """
-    grid_size: int of the grid height and width
-    t_size: int of the temporal size
-    return:
-    pos_embed: [t_size*grid_size * grid_size, embed_dim] or [1+t_size*grid_size * grid_size, embed_dim]
-    (w/ or w/o cls_token)
-    """
-    assert embed_dim % 4 == 0
-    embed_dim_spatial = embed_dim // 4 * 3
-    embed_dim_temporal = embed_dim // 4
-
-    # spatial
-    grid_h = np.arange(grid_height, dtype=np.float32) / height_interpolation
-    grid_w = np.arange(grid_width, dtype=np.float32) / width_interpolation
-    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
-    grid = np.stack(grid, axis=0)
-
-    grid = grid.reshape([2, 1, grid_height, grid_width])
-    pos_embed_spatial = get_2d_sincos_pos_embed_from_grid(embed_dim_spatial, grid)
-
-    # temporal
-    grid_t = np.arange(t_size, dtype=np.float32) / time_interpolation
-    pos_embed_temporal = get_1d_sincos_pos_embed_from_grid(embed_dim_temporal, grid_t)
-
-    # concate: [T, H, W] order
-    pos_embed_temporal = pos_embed_temporal[:, np.newaxis, :]
-    pos_embed_temporal = np.repeat(
-        pos_embed_temporal, grid_height * grid_width, axis=1
-    )  # [T, H*W, D // 4]
-    pos_embed_spatial = pos_embed_spatial[np.newaxis, :, :]
-    pos_embed_spatial = np.repeat(
-        pos_embed_spatial, t_size, axis=0
-    )  # [T, H*W, D // 4 * 3]
-
-    pos_embed = np.concatenate([pos_embed_temporal, pos_embed_spatial], axis=-1)
-
-    return pos_embed  # [T, H*W, D]
-
-
-def get_2d_sincos_pos_embed(
-    embed_dim, grid_height, grid_width, cls_token=False, extra_tokens=0
-):
-    """
-    grid_size: int of the grid height and width
-    return:
-    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
-    """
-    grid_h = np.arange(grid_height, dtype=np.float32)
-    grid_w = np.arange(grid_width, dtype=np.float32)
-    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
-    grid = np.stack(grid, axis=0)
-
-    grid = grid.reshape([2, 1, grid_height, grid_width])
-    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
-    if cls_token and extra_tokens > 0:
-        pos_embed = np.concatenate(
-            [np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0
-        )
-    return pos_embed
-
-
-def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
-    assert embed_dim % 2 == 0
-
-    # use half of dimensions to encode grid_h
-    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
-    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
-
-    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
-    return emb
-
-
-def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
-    """
-    embed_dim: output dimension for each position
-    pos: a list of positions to be encoded: size (M,)
-    out: (M, D)
-    """
-    assert embed_dim % 2 == 0
-    omega = np.arange(embed_dim // 2, dtype=np.float64)
-    omega /= embed_dim / 2.0
-    omega = 1.0 / 10000**omega  # (D/2,)
-
-    pos = pos.reshape(-1)  # (M,)
-    out = np.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
-
-    emb_sin = np.sin(out)  # (M, D/2)
-    emb_cos = np.cos(out)  # (M, D/2)
-
-    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
-    return emb
-
-
-class Basic2DPositionEmbeddingMixin(BaseMixin):
-    def __init__(
-        self, height, width, compressed_num_frames, hidden_size, text_length=0
-    ):
-        super().__init__()
-        self.height = height
-        self.width = width
-        self.spatial_length = height * width
-        self.pos_embedding = nn.Parameter(
-            torch.zeros(1, int(text_length + self.spatial_length), int(hidden_size)),
-            requires_grad=False,
-        )
-
-    def position_embedding_forward(self, position_ids, **kwargs):
-        return self.pos_embedding
-
-    def reinit(self, parent_model=None):
-        del self.transformer.position_embeddings
-        pos_embed = get_2d_sincos_pos_embed(
-            self.pos_embedding.shape[-1], self.height, self.width
-        )
-        self.pos_embedding.data[:, -self.spatial_length :].copy_(
-            torch.from_numpy(pos_embed).float().unsqueeze(0)
-        )
-
-
-class Basic3DPositionEmbeddingMixin(BaseMixin):
-    def __init__(
-        self,
-        height,
-        width,
-        compressed_num_frames,
-        hidden_size,
-        text_length=0,
-        height_interpolation=1.0,
-        width_interpolation=1.0,
-        time_interpolation=1.0,
-    ):
-        super().__init__()
-        self.height = height
-        self.width = width
-        self.text_length = text_length
-        self.compressed_num_frames = compressed_num_frames
-        self.spatial_length = height * width
-        self.num_patches = height * width * compressed_num_frames
-        self.pos_embedding = nn.Parameter(
-            torch.zeros(1, int(text_length + self.num_patches), int(hidden_size)),
-            requires_grad=False,
-        )
-        self.height_interpolation = height_interpolation
-        self.width_interpolation = width_interpolation
-        self.time_interpolation = time_interpolation
-
-    def position_embedding_forward(self, position_ids, **kwargs):
-        if kwargs["images"].shape[1] == 1:
-            return self.pos_embedding[:, : self.text_length + self.spatial_length]
-
-        return self.pos_embedding[:, : self.text_length + kwargs["seq_length"]]
-
-    def reinit(self, parent_model=None):
-        del self.transformer.position_embeddings
-        pos_embed = get_3d_sincos_pos_embed(
-            self.pos_embedding.shape[-1],
-            self.height,
-            self.width,
-            self.compressed_num_frames,
-            height_interpolation=self.height_interpolation,
-            width_interpolation=self.width_interpolation,
-            time_interpolation=self.time_interpolation,
-        )
-        pos_embed = torch.from_numpy(pos_embed).float()
-        pos_embed = rearrange(pos_embed, "t n d -> (t n) d")
-        self.pos_embedding.data[:, -self.num_patches :].copy_(pos_embed)
-
-
-def broadcat(tensors, dim=-1):
-    num_tensors = len(tensors)
-    shape_lens = set(list(map(lambda t: len(t.shape), tensors)))
-    assert len(shape_lens) == 1, "tensors must all have the same number of dimensions"
-    shape_len = list(shape_lens)[0]
-    dim = (dim + shape_len) if dim < 0 else dim
-    dims = list(zip(*map(lambda t: list(t.shape), tensors)))
-    expandable_dims = [(i, val) for i, val in enumerate(dims) if i != dim]
-    assert all(
-        [*map(lambda t: len(set(t[1])) <= 2, expandable_dims)]
-    ), "invalid dimensions for broadcastable concatentation"
-    max_dims = list(map(lambda t: (t[0], max(t[1])), expandable_dims))
-    expanded_dims = list(map(lambda t: (t[0], (t[1],) * num_tensors), max_dims))
-    expanded_dims.insert(dim, (dim, dims[dim]))
-    expandable_shapes = list(zip(*map(lambda t: t[1], expanded_dims)))
-    tensors = list(map(lambda t: t[0].expand(*t[1]), zip(tensors, expandable_shapes)))
-    return torch.cat(tensors, dim=dim)
-
-
-def rotate_half(x):
-    x = rearrange(x, "... (d r) -> ... d r", r=2)
-    x1, x2 = x.unbind(dim=-1)
-    x = torch.stack((-x2, x1), dim=-1)
-    return rearrange(x, "... d r -> ... (d r)")
-
-
-class Rotary3DPositionEmbeddingMixin(BaseMixin):
-    def __init__(
-        self,
-        height,
-        width,
-        compressed_num_frames,
-        hidden_size,
-        hidden_size_head,
-        text_length,
-        theta=10000,
-        rot_v=False,
-        height_interpolation=1.0,
-        width_interpolation=1.0,
-        time_interpolation=1.0,
-        learnable_pos_embed=False,
-    ):
-        super().__init__()
-        self.rot_v = rot_v
-
-        dim_t = hidden_size_head // 4
-        dim_h = hidden_size_head // 8 * 3
-        dim_w = hidden_size_head // 8 * 3
-
-        freqs_t = 1.0 / (
-            theta ** (torch.arange(0, dim_t, 2)[: (dim_t // 2)].float() / dim_t)
-        )
-        freqs_h = 1.0 / (
-            theta ** (torch.arange(0, dim_h, 2)[: (dim_h // 2)].float() / dim_h)
-        )
-        freqs_w = 1.0 / (
-            theta ** (torch.arange(0, dim_w, 2)[: (dim_w // 2)].float() / dim_w)
-        )
-
-        grid_t = torch.arange(compressed_num_frames, dtype=torch.float32)
-        grid_h = torch.arange(height, dtype=torch.float32)
-        grid_w = torch.arange(width, dtype=torch.float32)
-
-        freqs_t = torch.einsum("..., f -> ... f", grid_t, freqs_t)
-        freqs_h = torch.einsum("..., f -> ... f", grid_h, freqs_h)
-        freqs_w = torch.einsum("..., f -> ... f", grid_w, freqs_w)
-
-        freqs_t = repeat(freqs_t, "... n -> ... (n r)", r=2)
-        freqs_h = repeat(freqs_h, "... n -> ... (n r)", r=2)
-        freqs_w = repeat(freqs_w, "... n -> ... (n r)", r=2)
-
-        freqs = broadcat(
-            (
-                freqs_t[:, None, None, :],
-                freqs_h[None, :, None, :],
-                freqs_w[None, None, :, :],
-            ),
-            dim=-1,
-        )
-
-        freqs = freqs.contiguous()
-        self.freqs_sin = freqs.sin().cuda()
-        self.freqs_cos = freqs.cos().cuda()
-        self.text_length = text_length
-        if learnable_pos_embed:
-            num_patches = height * width * compressed_num_frames + text_length
-            self.pos_embedding = nn.Parameter(
-                torch.zeros(1, num_patches, int(hidden_size)), requires_grad=True
-            )
-        else:
-            self.pos_embedding = None
-
-    def rotary(self, t, **kwargs):
-        def reshape_freq(freqs):
-            freqs = freqs[
-                : kwargs["rope_T"], : kwargs["rope_H"], : kwargs["rope_W"]
-            ].contiguous()
-            freqs = rearrange(freqs, "t h w d -> (t h w) d")
-            freqs = freqs.unsqueeze(0).unsqueeze(0)
-            return freqs
-
-        freqs_cos = reshape_freq(self.freqs_cos).to(t.dtype)
-        freqs_sin = reshape_freq(self.freqs_sin).to(t.dtype)
-
-        return t * freqs_cos + rotate_half(t) * freqs_sin
-
-    def position_embedding_forward(self, position_ids, **kwargs):
-        if self.pos_embedding is not None:
-            return self.pos_embedding[:, : self.text_length + kwargs["seq_length"]]
-        else:
-            return None
-
-    def attention_fn(
-        self,
-        query_layer,
-        key_layer,
-        value_layer,
-        attention_mask,
-        attention_dropout=None,
-        log_attention_weights=None,
-        scaling_attention_score=True,
-        **kwargs,
-    ):
-        attention_fn_default = HOOKS_DEFAULT["attention_fn"]
-
-        query_layer = torch.cat(
-            (
-                query_layer[
-                    :,
-                    :,
-                    : kwargs["text_length"],
-                ],
-                self.rotary(
-                    query_layer[
-                        :,
-                        :,
-                        kwargs["text_length"] :,
-                    ],
-                    **kwargs,
-                ),
-            ),
-            dim=2,
-        )
-        key_layer = torch.cat(
-            (
-                key_layer[
-                    :,
-                    :,
-                    : kwargs["text_length"],
-                ],
-                self.rotary(
-                    key_layer[
-                        :,
-                        :,
-                        kwargs["text_length"] :,
-                    ],
-                    **kwargs,
-                ),
-            ),
-            dim=2,
-        )
-        if self.rot_v:
-            value_layer = torch.cat(
-                (
-                    value_layer[
-                        :,
-                        :,
-                        : kwargs["text_length"],
-                    ],
-                    self.rotary(
-                        value_layer[
-                            :,
-                            :,
-                            kwargs["text_length"] :,
-                        ],
-                        **kwargs,
-                    ),
-                ),
-                dim=2,
-            )
-
-        return attention_fn_default(
-            query_layer,
-            key_layer,
-            value_layer,
-            attention_mask,
-            attention_dropout=attention_dropout,
-            log_attention_weights=log_attention_weights,
-            scaling_attention_score=scaling_attention_score,
-            **kwargs,
-        )
-
-
-def modulate(x, shift, scale):
-    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
-
-
-def unpatchify(x, c, patch_size, w, h, **kwargs):
-    """
-    x: (N, T/2 * S, patch_size**3 * C)
-    imgs: (N, T, H, W, C)
-
-    patch_size 被拆解为三个不同的维度 (o, p, q)，分别对应了深度（o）、高度（p）和宽度（q）。这使得 patch 大小在不同维度上可以不相等，增加了灵活性。
-    """
-
-    imgs = rearrange(
-        x,
-        "b (t h w) (c o p q) -> b (t o) c (h p) (w q)",
-        c=c,
-        o=patch_size[0],
-        p=patch_size[1],
-        q=patch_size[2],
-        t=kwargs["rope_T"],
-        h=kwargs["rope_H"],
-        w=kwargs["rope_W"],
-    )
-
-    return imgs
-
-
-class FinalLayerMixin(BaseMixin):
-    def __init__(
-        self,
-        hidden_size,
-        time_embed_dim,
-        patch_size,
-        out_channels,
-        latent_width,
-        latent_height,
-        elementwise_affine,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.patch_size = patch_size
-        self.out_channels = out_channels
-        self.norm_final = nn.LayerNorm(
-            hidden_size, elementwise_affine=elementwise_affine, eps=1e-6
-        )
-        self.linear = nn.Linear(
-            hidden_size, reduce(mul, patch_size) * out_channels, bias=True
-        )
-        self.adaLN_modulation = nn.Sequential(
-            nn.SiLU(), nn.Linear(time_embed_dim, 2 * hidden_size, bias=True)
-        )
-
-    def final_forward(self, logits, **kwargs):
-        x, emb = (
-            logits[:, kwargs["text_length"] :, :],
-            kwargs["emb"],
-        )  # x:(b,(t n),d),只取了x中后面images的部分
-        shift, scale = self.adaLN_modulation(emb).chunk(2, dim=1)
-        x = modulate(self.norm_final(x), shift, scale)
-        x = self.linear(x)
-
-        return unpatchify(
-            x,
-            c=self.out_channels,
-            patch_size=self.patch_size,
-            w=kwargs["rope_W"],
-            h=kwargs["rope_H"],
-            **kwargs,
-        )
-
-    def reinit(self, parent_model=None):
-        nn.init.xavier_uniform_(self.linear.weight)
-        nn.init.constant_(self.linear.bias, 0)
-
-
-class SwiGLUMixin(BaseMixin):
-    def __init__(self, num_layers, in_features, hidden_features, bias=False):
-        super().__init__()
-        self.w2 = nn.ModuleList(
-            [
-                ColumnParallelLinear(
-                    in_features,
-                    hidden_features,
-                    gather_output=False,
-                    bias=bias,
-                    module=self,
-                    name="dense_h_to_4h_gate",
-                )
-                for i in range(num_layers)
-            ]
-        )
-
-    def mlp_forward(self, hidden_states, **kw_args):
-        x = hidden_states
-        origin = self.transformer.layers[kw_args["layer_id"]].mlp
-        x1 = origin.dense_h_to_4h(x)
-        x2 = self.w2[kw_args["layer_id"]](x)
-        hidden = origin.activation_func(x2) * x1
-        x = origin.dense_4h_to_h(hidden)
-        return x
-
-
-class AdaLNMixin(BaseMixin):
-    def __init__(
-        self,
-        hidden_size,
-        num_layers,
-        time_embed_dim,
-        compressed_num_frames,
-        qk_ln=True,
-        hidden_size_head=None,
-        elementwise_affine=True,
-    ):
-        super().__init__()
-        self.num_layers = num_layers
-        self.compressed_num_frames = compressed_num_frames
-
-        self.adaLN_modulations = nn.ModuleList(
-            [
-                nn.Sequential(nn.SiLU(), nn.Linear(time_embed_dim, 12 * hidden_size))
-                for _ in range(num_layers)
-            ]
-        )
-
-        self.qk_ln = qk_ln
-        if qk_ln:
-            self.query_layernorm_list = nn.ModuleList(
-                [
-                    LayerNorm(
-                        hidden_size_head,
-                        eps=1e-6,
-                        elementwise_affine=elementwise_affine,
-                    )
-                    for _ in range(num_layers)
-                ]
-            )
-            self.key_layernorm_list = nn.ModuleList(
-                [
-                    LayerNorm(
-                        hidden_size_head,
-                        eps=1e-6,
-                        elementwise_affine=elementwise_affine,
-                    )
-                    for _ in range(num_layers)
-                ]
-            )
-
-    def layer_forward(
-        self,
-        hidden_states,
-        mask,
-        *args,
-        **kwargs,
-    ):
-        text_length = kwargs["text_length"]
-        # hidden_states (b,(n_t+t*n_i),d)
-        text_hidden_states = hidden_states[:, :text_length]  # (b,n,d)
-        img_hidden_states = hidden_states[:, text_length:]  # (b,(t n),d)
-
-        layer = self.transformer.layers[kwargs["layer_id"]]
-        adaLN_modulation = self.adaLN_modulations[kwargs["layer_id"]]
-
-        (
-            shift_msa,
-            scale_msa,
-            gate_msa,
-            shift_mlp,
-            scale_mlp,
-            gate_mlp,
-            text_shift_msa,
-            text_scale_msa,
-            text_gate_msa,
-            text_shift_mlp,
-            text_scale_mlp,
-            text_gate_mlp,
-        ) = adaLN_modulation(kwargs["emb"]).chunk(12, dim=1)
-        gate_msa, gate_mlp, text_gate_msa, text_gate_mlp = (
-            gate_msa.unsqueeze(1),
-            gate_mlp.unsqueeze(1),
-            text_gate_msa.unsqueeze(1),
-            text_gate_mlp.unsqueeze(1),
-        )
-
-        # self full attention (b,(t n),d)
-        img_attention_input = layer.input_layernorm(img_hidden_states)
-        text_attention_input = layer.input_layernorm(text_hidden_states)
-        img_attention_input = modulate(img_attention_input, shift_msa, scale_msa)
-        text_attention_input = modulate(
-            text_attention_input, text_shift_msa, text_scale_msa
-        )
-
-        attention_input = torch.cat(
-            (text_attention_input, img_attention_input), dim=1
-        )  # (b,n_t+t*n_i,d)
-        attention_output = layer.attention(attention_input, mask, **kwargs)
-        text_attention_output = attention_output[:, :text_length]  # (b,n,d)
-        img_attention_output = attention_output[:, text_length:]  # (b,(t n),d)
-        if self.transformer.layernorm_order == "sandwich":
-            text_attention_output = layer.third_layernorm(text_attention_output)
-            img_attention_output = layer.third_layernorm(img_attention_output)
-        img_hidden_states = (
-            img_hidden_states + gate_msa * img_attention_output
-        )  # (b,(t n),d)
-        text_hidden_states = (
-            text_hidden_states + text_gate_msa * text_attention_output
-        )  # (b,n,d)
-
-        # mlp (b,(t n),d)
-        img_mlp_input = layer.post_attention_layernorm(
-            img_hidden_states
-        )  # vision (b,(t n),d)
-        text_mlp_input = layer.post_attention_layernorm(
-            text_hidden_states
-        )  # language (b,n,d)
-        img_mlp_input = modulate(img_mlp_input, shift_mlp, scale_mlp)
-        text_mlp_input = modulate(text_mlp_input, text_shift_mlp, text_scale_mlp)
-        mlp_input = torch.cat(
-            (text_mlp_input, img_mlp_input), dim=1
-        )  # (b,(n_t+t*n_i),d
-        mlp_output = layer.mlp(mlp_input, **kwargs)
-        img_mlp_output = mlp_output[:, text_length:]  # vision (b,(t n),d)
-        text_mlp_output = mlp_output[:, :text_length]  # language (b,n,d)
-        if self.transformer.layernorm_order == "sandwich":
-            text_mlp_output = layer.fourth_layernorm(text_mlp_output)
-            img_mlp_output = layer.fourth_layernorm(img_mlp_output)
-
-        img_hidden_states = (
-            img_hidden_states + gate_mlp * img_mlp_output
-        )  # vision (b,(t n),d)
-        text_hidden_states = (
-            text_hidden_states + text_gate_mlp * text_mlp_output
-        )  # language (b,n,d)
-
-        hidden_states = torch.cat(
-            (text_hidden_states, img_hidden_states), dim=1
-        )  # (b,(n_t+t*n_i),d)
-        return hidden_states
-
-    def reinit(self, parent_model=None):
-        for layer in self.adaLN_modulations:
-            nn.init.constant_(layer[-1].weight, 0)
-            nn.init.constant_(layer[-1].bias, 0)
-
-    @non_conflict
-    def attention_fn(
-        self,
-        query_layer,
-        key_layer,
-        value_layer,
-        attention_mask,
-        attention_dropout=None,
-        log_attention_weights=None,
-        scaling_attention_score=True,
-        old_impl=attention_fn_default,
-        **kwargs,
-    ):
-        if self.qk_ln:
-            query_layernorm = self.query_layernorm_list[kwargs["layer_id"]]
-            key_layernorm = self.key_layernorm_list[kwargs["layer_id"]]
-            query_layer = query_layernorm(query_layer)
-            key_layer = key_layernorm(key_layer)
-
-        return old_impl(
-            query_layer,
-            key_layer,
-            value_layer,
-            attention_mask,
-            attention_dropout=attention_dropout,
-            log_attention_weights=log_attention_weights,
-            scaling_attention_score=scaling_attention_score,
-            **kwargs,
-        )
-
-
-str_to_dtype = {"fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16}
-
-
-class DiffusionTransformer(BaseModel):
-    def __init__(
-        self,
-        transformer_args,
-        num_frames,
-        time_compressed_rate,
-        latent_width,
-        latent_height,
-        patch_size,
-        in_channels,
-        out_channels,
-        hidden_size,
-        num_layers,
-        num_attention_heads,
-        elementwise_affine,
-        time_embed_dim=None,
-        num_classes=None,
-        modules={},
-        input_time="adaln",
-        adm_in_channels=None,
-        parallel_output=True,
-        height_interpolation=1.0,
-        width_interpolation=1.0,
-        time_interpolation=1.0,
-        use_SwiGLU=False,
-        use_RMSNorm=False,
-        ofs_embed_dim=None,
-        **kwargs,
-    ):
-        self.latent_width = latent_width
-        self.latent_height = latent_height
-        self.patch_size = patch_size
-        self.num_frames = num_frames
-        self.time_compressed_rate = time_compressed_rate
-        self.spatial_length = (
-            latent_width * latent_height // reduce(mul, patch_size[1:])
-        )
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.hidden_size = hidden_size
-        self.model_channels = hidden_size
-        self.time_embed_dim = (
-            time_embed_dim if time_embed_dim is not None else hidden_size
-        )
-        self.ofs_embed_dim = ofs_embed_dim
-        self.num_classes = num_classes
-        self.adm_in_channels = adm_in_channels
-        self.input_time = input_time
-        self.num_layers = num_layers
-        self.num_attention_heads = num_attention_heads
-        self.is_decoder = transformer_args.is_decoder
-        self.elementwise_affine = elementwise_affine
-        self.height_interpolation = height_interpolation
-        self.width_interpolation = width_interpolation
-        self.time_interpolation = time_interpolation
-        self.inner_hidden_size = hidden_size * 4
-        try:
-            self.dtype = str_to_dtype[kwargs.pop("dtype")]
-        except:
-            self.dtype = torch.float32
-
-        if use_SwiGLU:
-            kwargs["activation_func"] = F.silu
-        elif "activation_func" not in kwargs:
-            approx_gelu = nn.GELU(approximate="tanh")
-            kwargs["activation_func"] = approx_gelu
-
-        if use_RMSNorm:
-            kwargs["layernorm"] = RMSNorm
-        else:
-            kwargs["layernorm"] = partial(
-                LayerNorm, elementwise_affine=elementwise_affine, eps=1e-6
-            )
-
-        transformer_args.num_layers = num_layers
-        transformer_args.hidden_size = hidden_size
-        transformer_args.num_attention_heads = num_attention_heads
-        transformer_args.parallel_output = parallel_output
-        super().__init__(args=transformer_args, transformer=None, **kwargs)
-
-        module_configs = modules
-        self._build_modules(module_configs)
-
-        if use_SwiGLU:
-            self.add_mixin(
-                "swiglu",
-                SwiGLUMixin(
-                    num_layers, hidden_size, self.inner_hidden_size, bias=False
-                ),
-                reinit=True,
-            )
-
-    def _build_modules(self, module_configs):
-        model_channels = self.hidden_size
-        time_embed_dim = self.time_embed_dim
-        self.time_embed = nn.Sequential(
-            linear(model_channels, time_embed_dim),
-            nn.SiLU(),
-            linear(time_embed_dim, time_embed_dim),
-        )
-
-        if self.ofs_embed_dim is not None:
-            self.ofs_embed = nn.Sequential(
-                linear(self.ofs_embed_dim, self.ofs_embed_dim),
-                nn.SiLU(),
-                linear(self.ofs_embed_dim, self.ofs_embed_dim),
-            )
-
-        if self.num_classes is not None:
-            if isinstance(self.num_classes, int):
-                self.label_emb = nn.Embedding(self.num_classes, time_embed_dim)
-            elif self.num_classes == "continuous":
-                print("setting up linear c_adm embedding layer")
-                self.label_emb = nn.Linear(1, time_embed_dim)
-            elif self.num_classes == "timestep":
-                self.label_emb = nn.Sequential(
-                    Timestep(model_channels),
-                    nn.Sequential(
-                        linear(model_channels, time_embed_dim),
-                        nn.SiLU(),
-                        linear(time_embed_dim, time_embed_dim),
-                    ),
-                )
-            elif self.num_classes == "sequential":
-                assert self.adm_in_channels is not None
-                self.label_emb = nn.Sequential(
-                    nn.Sequential(
-                        linear(self.adm_in_channels, time_embed_dim),
-                        nn.SiLU(),
-                        linear(time_embed_dim, time_embed_dim),
-                    )
-                )
-            else:
-                raise ValueError()
-
-        pos_embed_config = module_configs["pos_embed_config"]
-        self.add_mixin(
-            "pos_embed",
-            instantiate_from_config(
-                pos_embed_config,
-                height=self.latent_height // self.patch_size[1],
-                width=self.latent_width // self.patch_size[2],
-                compressed_num_frames=(self.num_frames - 1) // self.time_compressed_rate
-                + 1,
-                hidden_size=self.hidden_size,
-                height_interpolation=self.height_interpolation,
-                width_interpolation=self.width_interpolation,
-                time_interpolation=self.time_interpolation,
-            ),
-            reinit=True,
-        )
-
-        patch_embed_config = module_configs["patch_embed_config"]
-        self.add_mixin(
-            "patch_embed",
-            instantiate_from_config(
-                patch_embed_config,
-                patch_size=self.patch_size,
-                hidden_size=self.hidden_size,
-                in_channels=self.in_channels,
-            ),
-            reinit=True,
-        )
-        if self.input_time == "adaln":
-            adaln_layer_config = module_configs["adaln_layer_config"]
-            self.add_mixin(
-                "adaln_layer",
-                instantiate_from_config(
-                    adaln_layer_config,
-                    hidden_size=self.hidden_size,
-                    num_layers=self.num_layers,
-                    compressed_num_frames=(self.num_frames - 1)
-                    // self.time_compressed_rate
-                    + 1,
-                    hidden_size_head=self.hidden_size // self.num_attention_heads,
-                    time_embed_dim=self.time_embed_dim,
-                    elementwise_affine=self.elementwise_affine,
-                ),
-            )
-        else:
-            raise NotImplementedError
-        final_layer_config = module_configs["final_layer_config"]
-        self.add_mixin(
-            "final_layer",
-            instantiate_from_config(
-                final_layer_config,
-                hidden_size=self.hidden_size,
-                patch_size=self.patch_size,
-                out_channels=self.out_channels,
-                time_embed_dim=self.time_embed_dim,
-                latent_width=self.latent_width,
-                latent_height=self.latent_height,
-                elementwise_affine=self.elementwise_affine,
-            ),
-            reinit=True,
-        )
-
-        return
-
-    def forward(self, x, timesteps=None, context=None, y=None, **kwargs):
-        b, t, d, h, w = x.shape
-        if x.dtype != self.dtype:
-            x = x.to(self.dtype)
-        if "concat_images" in kwargs and kwargs["concat_images"] is not None:
-            if kwargs["concat_images"].shape[0] != x.shape[0]:
-                concat_images = kwargs["concat_images"].repeat(2, 1, 1, 1, 1)
-            else:
-                concat_images = kwargs["concat_images"]
-            x = torch.cat([x, concat_images], dim=2)
-        assert (y is not None) == (
-            self.num_classes is not None
-        ), "must specify y if and only if the model is class-conditional"
-        t_emb = timestep_embedding(
-            timesteps, self.model_channels, repeat_only=False, dtype=self.dtype
-        )
-        emb = self.time_embed(t_emb)
-
-        if self.num_classes is not None:
-            assert x.shape[0] % y.shape[0] == 0
-            y = y.repeat_interleave(x.shape[0] // y.shape[0], dim=0)
-            emb = emb + self.label_emb(y)
-
-        if self.ofs_embed_dim is not None:
-            ofs_emb = timestep_embedding(
-                kwargs["ofs"], self.ofs_embed_dim, repeat_only=False, dtype=self.dtype
-            )
-            ofs_emb = self.ofs_embed(ofs_emb)
-            emb = emb + ofs_emb
-
-        kwargs["seq_length"] = t * h * w // reduce(mul, self.patch_size)
-        kwargs["images"] = x
-        kwargs["emb"] = emb
-        kwargs["encoder_outputs"] = context
-        kwargs["text_length"] = context.shape[1]
-
-        kwargs["rope_T"] = t // self.patch_size[0]
-        kwargs["rope_H"] = h // self.patch_size[1]
-        kwargs["rope_W"] = w // self.patch_size[2]
-
-        kwargs["input_ids"] = kwargs["position_ids"] = kwargs["attention_mask"] = (
-            torch.ones((1, 1)).to(x.dtype)
-        )
-        output = super().forward(**kwargs)[0]
-        return output
diff --git a/videotuna/models/cogvideo_sat/sgm/__init__.py b/videotuna/models/cogvideo_sat/sgm/__init__.py
deleted file mode 100644
index 1c448236..00000000
--- a/videotuna/models/cogvideo_sat/sgm/__init__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from .models import AutoencodingEngine
-from .util import get_configs_path, instantiate_from_config
-
-__version__ = "0.1.0"
diff --git a/videotuna/models/cogvideo_sat/sgm/lr_scheduler.py b/videotuna/models/cogvideo_sat/sgm/lr_scheduler.py
deleted file mode 100644
index b2f4d384..00000000
--- a/videotuna/models/cogvideo_sat/sgm/lr_scheduler.py
+++ /dev/null
@@ -1,135 +0,0 @@
-import numpy as np
-
-
-class LambdaWarmUpCosineScheduler:
-    """
-    note: use with a base_lr of 1.0
-    """
-
-    def __init__(
-        self,
-        warm_up_steps,
-        lr_min,
-        lr_max,
-        lr_start,
-        max_decay_steps,
-        verbosity_interval=0,
-    ):
-        self.lr_warm_up_steps = warm_up_steps
-        self.lr_start = lr_start
-        self.lr_min = lr_min
-        self.lr_max = lr_max
-        self.lr_max_decay_steps = max_decay_steps
-        self.last_lr = 0.0
-        self.verbosity_interval = verbosity_interval
-
-    def schedule(self, n, **kwargs):
-        if self.verbosity_interval > 0:
-            if n % self.verbosity_interval == 0:
-                print(f"current step: {n}, recent lr-multiplier: {self.last_lr}")
-        if n < self.lr_warm_up_steps:
-            lr = (
-                self.lr_max - self.lr_start
-            ) / self.lr_warm_up_steps * n + self.lr_start
-            self.last_lr = lr
-            return lr
-        else:
-            t = (n - self.lr_warm_up_steps) / (
-                self.lr_max_decay_steps - self.lr_warm_up_steps
-            )
-            t = min(t, 1.0)
-            lr = self.lr_min + 0.5 * (self.lr_max - self.lr_min) * (
-                1 + np.cos(t * np.pi)
-            )
-            self.last_lr = lr
-            return lr
-
-    def __call__(self, n, **kwargs):
-        return self.schedule(n, **kwargs)
-
-
-class LambdaWarmUpCosineScheduler2:
-    """
-    supports repeated iterations, configurable via lists
-    note: use with a base_lr of 1.0.
-    """
-
-    def __init__(
-        self, warm_up_steps, f_min, f_max, f_start, cycle_lengths, verbosity_interval=0
-    ):
-        assert (
-            len(warm_up_steps)
-            == len(f_min)
-            == len(f_max)
-            == len(f_start)
-            == len(cycle_lengths)
-        )
-        self.lr_warm_up_steps = warm_up_steps
-        self.f_start = f_start
-        self.f_min = f_min
-        self.f_max = f_max
-        self.cycle_lengths = cycle_lengths
-        self.cum_cycles = np.cumsum([0] + list(self.cycle_lengths))
-        self.last_f = 0.0
-        self.verbosity_interval = verbosity_interval
-
-    def find_in_interval(self, n):
-        interval = 0
-        for cl in self.cum_cycles[1:]:
-            if n <= cl:
-                return interval
-            interval += 1
-
-    def schedule(self, n, **kwargs):
-        cycle = self.find_in_interval(n)
-        n = n - self.cum_cycles[cycle]
-        if self.verbosity_interval > 0:
-            if n % self.verbosity_interval == 0:
-                print(
-                    f"current step: {n}, recent lr-multiplier: {self.last_f}, "
-                    f"current cycle {cycle}"
-                )
-        if n < self.lr_warm_up_steps[cycle]:
-            f = (self.f_max[cycle] - self.f_start[cycle]) / self.lr_warm_up_steps[
-                cycle
-            ] * n + self.f_start[cycle]
-            self.last_f = f
-            return f
-        else:
-            t = (n - self.lr_warm_up_steps[cycle]) / (
-                self.cycle_lengths[cycle] - self.lr_warm_up_steps[cycle]
-            )
-            t = min(t, 1.0)
-            f = self.f_min[cycle] + 0.5 * (self.f_max[cycle] - self.f_min[cycle]) * (
-                1 + np.cos(t * np.pi)
-            )
-            self.last_f = f
-            return f
-
-    def __call__(self, n, **kwargs):
-        return self.schedule(n, **kwargs)
-
-
-class LambdaLinearScheduler(LambdaWarmUpCosineScheduler2):
-    def schedule(self, n, **kwargs):
-        cycle = self.find_in_interval(n)
-        n = n - self.cum_cycles[cycle]
-        if self.verbosity_interval > 0:
-            if n % self.verbosity_interval == 0:
-                print(
-                    f"current step: {n}, recent lr-multiplier: {self.last_f}, "
-                    f"current cycle {cycle}"
-                )
-
-        if n < self.lr_warm_up_steps[cycle]:
-            f = (self.f_max[cycle] - self.f_start[cycle]) / self.lr_warm_up_steps[
-                cycle
-            ] * n + self.f_start[cycle]
-            self.last_f = f
-            return f
-        else:
-            f = self.f_min[cycle] + (self.f_max[cycle] - self.f_min[cycle]) * (
-                self.cycle_lengths[cycle] - n
-            ) / (self.cycle_lengths[cycle])
-            self.last_f = f
-            return f
diff --git a/videotuna/models/cogvideo_sat/sgm/models/__init__.py b/videotuna/models/cogvideo_sat/sgm/models/__init__.py
deleted file mode 100644
index e72b8659..00000000
--- a/videotuna/models/cogvideo_sat/sgm/models/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .autoencoder import AutoencodingEngine
diff --git a/videotuna/models/cogvideo_sat/sgm/models/autoencoder.py b/videotuna/models/cogvideo_sat/sgm/models/autoencoder.py
deleted file mode 100644
index 5b6241b3..00000000
--- a/videotuna/models/cogvideo_sat/sgm/models/autoencoder.py
+++ /dev/null
@@ -1,589 +0,0 @@
-import logging
-import math
-import random
-import re
-from abc import abstractmethod
-from contextlib import contextmanager
-from typing import Any, Dict, List, Optional, Tuple, Union
-
-import numpy as np
-import pytorch_lightning as pl
-import torch
-import torch.distributed
-import torch.nn as nn
-from einops import rearrange
-from packaging import version
-
-from ..modules.autoencoding.regularizers import AbstractRegularizer
-from ..modules.cp_enc_dec import _conv_gather, _conv_split
-from ..modules.ema import LitEma
-from ..util import (
-    default,
-    get_context_parallel_group,
-    get_context_parallel_group_rank,
-    get_nested_attribute,
-    get_obj_from_str,
-    initialize_context_parallel,
-    instantiate_from_config,
-    is_context_parallel_initialized,
-)
-
-logpy = logging.getLogger(__name__)
-
-
-class AbstractAutoencoder(pl.LightningModule):
-    """
-    This is the base class for all autoencoders, including image autoencoders, image autoencoders with discriminators,
-    unCLIP models, etc. Hence, it is fairly general, and specific features
-    (e.g. discriminator training, encoding, decoding) must be implemented in subclasses.
-    """
-
-    def __init__(
-        self,
-        ema_decay: Union[None, float] = None,
-        monitor: Union[None, str] = None,
-        input_key: str = "jpg",
-    ):
-        super().__init__()
-
-        self.input_key = input_key
-        self.use_ema = ema_decay is not None
-        if monitor is not None:
-            self.monitor = monitor
-
-        if self.use_ema:
-            self.model_ema = LitEma(self, decay=ema_decay)
-            logpy.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
-
-        if version.parse(torch.__version__) >= version.parse("2.0.0"):
-            self.automatic_optimization = False
-
-    def apply_ckpt(self, ckpt: Union[None, str, dict]):
-        if ckpt is None:
-            return
-        if isinstance(ckpt, str):
-            ckpt = {
-                "target": "sgm.modules.checkpoint.CheckpointEngine",
-                "params": {"ckpt_path": ckpt},
-            }
-        engine = instantiate_from_config(ckpt)
-        engine(self)
-
-    @abstractmethod
-    def get_input(self, batch) -> Any:
-        raise NotImplementedError()
-
-    def on_train_batch_end(self, *args, **kwargs):
-        # for EMA computation
-        if self.use_ema:
-            self.model_ema(self)
-
-    @contextmanager
-    def ema_scope(self, context=None):
-        if self.use_ema:
-            self.model_ema.store(self.parameters())
-            self.model_ema.copy_to(self)
-            if context is not None:
-                logpy.info(f"{context}: Switched to EMA weights")
-        try:
-            yield None
-        finally:
-            if self.use_ema:
-                self.model_ema.restore(self.parameters())
-                if context is not None:
-                    logpy.info(f"{context}: Restored training weights")
-
-    @abstractmethod
-    def encode(self, *args, **kwargs) -> torch.Tensor:
-        raise NotImplementedError("encode()-method of abstract base class called")
-
-    @abstractmethod
-    def decode(self, *args, **kwargs) -> torch.Tensor:
-        raise NotImplementedError("decode()-method of abstract base class called")
-
-    def instantiate_optimizer_from_config(self, params, lr, cfg):
-        logpy.info(f"loading >>> {cfg['target']} <<< optimizer from config")
-        return get_obj_from_str(cfg["target"])(
-            params, lr=lr, **cfg.get("params", dict())
-        )
-
-    def configure_optimizers(self) -> Any:
-        raise NotImplementedError()
-
-
-class AutoencodingEngine(AbstractAutoencoder):
-    """
-    Base class for all image autoencoders that we train, like VQGAN or AutoencoderKL
-    (we also restore them explicitly as special cases for legacy reasons).
-    Regularizations such as KL or VQ are moved to the regularizer class.
-    """
-
-    def __init__(
-        self,
-        *args,
-        encoder_config: Dict,
-        decoder_config: Dict,
-        loss_config: Dict,
-        regularizer_config: Dict,
-        optimizer_config: Union[Dict, None] = None,
-        lr_g_factor: float = 1.0,
-        trainable_ae_params: Optional[List[List[str]]] = None,
-        ae_optimizer_args: Optional[List[dict]] = None,
-        trainable_disc_params: Optional[List[List[str]]] = None,
-        disc_optimizer_args: Optional[List[dict]] = None,
-        disc_start_iter: int = 0,
-        diff_boost_factor: float = 3.0,
-        ckpt_engine: Union[None, str, dict] = None,
-        ckpt_path: Optional[str] = None,
-        additional_decode_keys: Optional[List[str]] = None,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-        self.automatic_optimization = False  # pytorch lightning
-
-        self.encoder: torch.nn.Module = instantiate_from_config(encoder_config)
-        self.decoder: torch.nn.Module = instantiate_from_config(decoder_config)
-        self.loss: torch.nn.Module = instantiate_from_config(loss_config)
-        self.regularization: AbstractRegularizer = instantiate_from_config(
-            regularizer_config
-        )
-        self.optimizer_config = default(
-            optimizer_config, {"target": "torch.optim.Adam"}
-        )
-        self.diff_boost_factor = diff_boost_factor
-        self.disc_start_iter = disc_start_iter
-        self.lr_g_factor = lr_g_factor
-        self.trainable_ae_params = trainable_ae_params
-        if self.trainable_ae_params is not None:
-            self.ae_optimizer_args = default(
-                ae_optimizer_args,
-                [{} for _ in range(len(self.trainable_ae_params))],
-            )
-            assert len(self.ae_optimizer_args) == len(self.trainable_ae_params)
-        else:
-            self.ae_optimizer_args = [{}]  # makes type consistent
-
-        self.trainable_disc_params = trainable_disc_params
-        if self.trainable_disc_params is not None:
-            self.disc_optimizer_args = default(
-                disc_optimizer_args,
-                [{} for _ in range(len(self.trainable_disc_params))],
-            )
-            assert len(self.disc_optimizer_args) == len(self.trainable_disc_params)
-        else:
-            self.disc_optimizer_args = [{}]  # makes type consistent
-
-        if ckpt_path is not None:
-            assert ckpt_engine is None, "Can't set ckpt_engine and ckpt_path"
-            logpy.warning("Checkpoint path is deprecated, use `checkpoint_egnine` instead")
-        self.apply_ckpt(default(ckpt_path, ckpt_engine))
-        self.additional_decode_keys = set(default(additional_decode_keys, []))
-
-    def get_input(self, batch: Dict) -> torch.Tensor:
-        # assuming unified data format, dataloader returns a dict.
-        # image tensors should be scaled to -1 ... 1 and in channels-first
-        # format (e.g., bchw instead if bhwc)
-        return batch[self.input_key]
-
-    def get_autoencoder_params(self) -> list:
-        params = []
-        if hasattr(self.loss, "get_trainable_autoencoder_parameters"):
-            params += list(self.loss.get_trainable_autoencoder_parameters())
-        if hasattr(self.regularization, "get_trainable_parameters"):
-            params += list(self.regularization.get_trainable_parameters())
-        params = params + list(self.encoder.parameters())
-        params = params + list(self.decoder.parameters())
-        return params
-
-    def get_discriminator_params(self) -> list:
-        if hasattr(self.loss, "get_trainable_parameters"):
-            params = list(self.loss.get_trainable_parameters())  # e.g., discriminator
-        else:
-            params = []
-        return params
-
-    def get_last_layer(self):
-        return self.decoder.get_last_layer()
-
-    def encode(
-        self,
-        x: torch.Tensor,
-        return_reg_log: bool = False,
-        unregularized: bool = False,
-        **kwargs,
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, dict]]:
-        z = self.encoder(x, **kwargs)
-        if unregularized:
-            return z, dict()
-        z, reg_log = self.regularization(z)
-        if return_reg_log:
-            return z, reg_log
-        return z
-
-    def decode(self, z: torch.Tensor, **kwargs) -> torch.Tensor:
-        x = self.decoder(z, **kwargs)
-        return x
-
-    def forward(
-        self, x: torch.Tensor, **additional_decode_kwargs
-    ) -> Tuple[torch.Tensor, torch.Tensor, dict]:
-        z, reg_log = self.encode(x, return_reg_log=True)
-        dec = self.decode(z, **additional_decode_kwargs)
-        return z, dec, reg_log
-
-    def inner_training_step(
-        self, batch: dict, batch_idx: int, optimizer_idx: int = 0
-    ) -> torch.Tensor:
-        x = self.get_input(batch)
-        additional_decode_kwargs = {
-            key: batch[key] for key in self.additional_decode_keys.intersection(batch)
-        }
-        z, xrec, regularization_log = self(x, **additional_decode_kwargs)
-        if hasattr(self.loss, "forward_keys"):
-            extra_info = {
-                "z": z,
-                "optimizer_idx": optimizer_idx,
-                "global_step": self.global_step,
-                "last_layer": self.get_last_layer(),
-                "split": "train",
-                "regularization_log": regularization_log,
-                "autoencoder": self,
-            }
-            extra_info = {k: extra_info[k] for k in self.loss.forward_keys}
-        else:
-            extra_info = dict()
-
-        if optimizer_idx == 0:
-            # autoencode
-            out_loss = self.loss(x, xrec, **extra_info)
-            if isinstance(out_loss, tuple):
-                aeloss, log_dict_ae = out_loss
-            else:
-                # simple loss function
-                aeloss = out_loss
-                log_dict_ae = {"train/loss/rec": aeloss.detach()}
-
-            self.log_dict(
-                log_dict_ae,
-                prog_bar=False,
-                logger=True,
-                on_step=True,
-                on_epoch=True,
-                sync_dist=False,
-            )
-            self.log(
-                "loss",
-                aeloss.mean().detach(),
-                prog_bar=True,
-                logger=False,
-                on_epoch=False,
-                on_step=True,
-            )
-            return aeloss
-        elif optimizer_idx == 1:
-            # discriminator
-            discloss, log_dict_disc = self.loss(x, xrec, **extra_info)
-            # -> discriminator always needs to return a tuple
-            self.log_dict(
-                log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=True
-            )
-            return discloss
-        else:
-            raise NotImplementedError(f"Unknown optimizer {optimizer_idx}")
-
-    def training_step(self, batch: dict, batch_idx: int):
-        opts = self.optimizers()
-        if not isinstance(opts, list):
-            # Non-adversarial case
-            opts = [opts]
-        optimizer_idx = batch_idx % len(opts)
-        if self.global_step < self.disc_start_iter:
-            optimizer_idx = 0
-        opt = opts[optimizer_idx]
-        opt.zero_grad()
-        with opt.toggle_model():
-            loss = self.inner_training_step(
-                batch, batch_idx, optimizer_idx=optimizer_idx
-            )
-            self.manual_backward(loss)
-        opt.step()
-
-    def validation_step(self, batch: dict, batch_idx: int) -> Dict:
-        log_dict = self._validation_step(batch, batch_idx)
-        with self.ema_scope():
-            log_dict_ema = self._validation_step(batch, batch_idx, postfix="_ema")
-            log_dict.update(log_dict_ema)
-        return log_dict
-
-    def _validation_step(self, batch: dict, batch_idx: int, postfix: str = "") -> Dict:
-        x = self.get_input(batch)
-
-        z, xrec, regularization_log = self(x)
-        if hasattr(self.loss, "forward_keys"):
-            extra_info = {
-                "z": z,
-                "optimizer_idx": 0,
-                "global_step": self.global_step,
-                "last_layer": self.get_last_layer(),
-                "split": "val" + postfix,
-                "regularization_log": regularization_log,
-                "autoencoder": self,
-            }
-            extra_info = {k: extra_info[k] for k in self.loss.forward_keys}
-        else:
-            extra_info = dict()
-        out_loss = self.loss(x, xrec, **extra_info)
-        if isinstance(out_loss, tuple):
-            aeloss, log_dict_ae = out_loss
-        else:
-            # simple loss function
-            aeloss = out_loss
-            log_dict_ae = {f"val{postfix}/loss/rec": aeloss.detach()}
-        full_log_dict = log_dict_ae
-
-        if "optimizer_idx" in extra_info:
-            extra_info["optimizer_idx"] = 1
-            discloss, log_dict_disc = self.loss(x, xrec, **extra_info)
-            full_log_dict.update(log_dict_disc)
-        self.log(
-            f"val{postfix}/loss/rec",
-            log_dict_ae[f"val{postfix}/loss/rec"],
-            sync_dist=True,
-        )
-        self.log_dict(full_log_dict, sync_dist=True)
-        return full_log_dict
-
-    def get_param_groups(
-        self, parameter_names: List[List[str]], optimizer_args: List[dict]
-    ) -> Tuple[List[Dict[str, Any]], int]:
-        groups = []
-        num_params = 0
-        for names, args in zip(parameter_names, optimizer_args):
-            params = []
-            for pattern_ in names:
-                pattern_params = []
-                pattern = re.compile(pattern_)
-                for p_name, param in self.named_parameters():
-                    if re.match(pattern, p_name):
-                        pattern_params.append(param)
-                        num_params += param.numel()
-                if len(pattern_params) == 0:
-                    logpy.warning(f"Did not find parameters for pattern {pattern_}")
-                params.extend(pattern_params)
-            groups.append({"params": params, **args})
-        return groups, num_params
-
-    def configure_optimizers(self) -> List[torch.optim.Optimizer]:
-        if self.trainable_ae_params is None:
-            ae_params = self.get_autoencoder_params()
-        else:
-            ae_params, num_ae_params = self.get_param_groups(
-                self.trainable_ae_params, self.ae_optimizer_args
-            )
-            logpy.info(f"Number of trainable autoencoder parameters: {num_ae_params:,}")
-        if self.trainable_disc_params is None:
-            disc_params = self.get_discriminator_params()
-        else:
-            disc_params, num_disc_params = self.get_param_groups(
-                self.trainable_disc_params, self.disc_optimizer_args
-            )
-            logpy.info(
-                f"Number of trainable discriminator parameters: {num_disc_params:,}"
-            )
-        opt_ae = self.instantiate_optimizer_from_config(
-            ae_params,
-            default(self.lr_g_factor, 1.0) * self.learning_rate,
-            self.optimizer_config,
-        )
-        opts = [opt_ae]
-        if len(disc_params) > 0:
-            opt_disc = self.instantiate_optimizer_from_config(
-                disc_params, self.learning_rate, self.optimizer_config
-            )
-            opts.append(opt_disc)
-
-        return opts
-
-    @torch.no_grad()
-    def log_images(
-        self, batch: dict, additional_log_kwargs: Optional[Dict] = None, **kwargs
-    ) -> dict:
-        log = dict()
-        additional_decode_kwargs = {}
-        x = self.get_input(batch)
-        additional_decode_kwargs.update(
-            {key: batch[key] for key in self.additional_decode_keys.intersection(batch)}
-        )
-
-        _, xrec, _ = self(x, **additional_decode_kwargs)
-        log["inputs"] = x
-        log["reconstructions"] = xrec
-        diff = 0.5 * torch.abs(torch.clamp(xrec, -1.0, 1.0) - x)
-        diff.clamp_(0, 1.0)
-        log["diff"] = 2.0 * diff - 1.0
-        # diff_boost shows location of small errors, by boosting their
-        # brightness.
-        log["diff_boost"] = (
-            2.0 * torch.clamp(self.diff_boost_factor * diff, 0.0, 1.0) - 1
-        )
-        if hasattr(self.loss, "log_images"):
-            log.update(self.loss.log_images(x, xrec))
-        with self.ema_scope():
-            _, xrec_ema, _ = self(x, **additional_decode_kwargs)
-            log["reconstructions_ema"] = xrec_ema
-            diff_ema = 0.5 * torch.abs(torch.clamp(xrec_ema, -1.0, 1.0) - x)
-            diff_ema.clamp_(0, 1.0)
-            log["diff_ema"] = 2.0 * diff_ema - 1.0
-            log["diff_boost_ema"] = (
-                2.0 * torch.clamp(self.diff_boost_factor * diff_ema, 0.0, 1.0) - 1
-            )
-        if additional_log_kwargs:
-            additional_decode_kwargs.update(additional_log_kwargs)
-            _, xrec_add, _ = self(x, **additional_decode_kwargs)
-            log_str = "reconstructions-" + "-".join(
-                [f"{key}={additional_log_kwargs[key]}" for key in additional_log_kwargs]
-            )
-            log[log_str] = xrec_add
-        return log
-
-
-class AutoencodingEngineLegacy(AutoencodingEngine):
-    def __init__(self, embed_dim: int, **kwargs):
-        self.max_batch_size = kwargs.pop("max_batch_size", None)
-        ddconfig = kwargs.pop("ddconfig")
-        ckpt_path = kwargs.pop("ckpt_path", None)
-        ckpt_engine = kwargs.pop("ckpt_engine", None)
-        super().__init__(
-            encoder_config={
-                "target": "sgm.modules.diffusionmodules.model.Encoder",
-                "params": ddconfig,
-            },
-            decoder_config={
-                "target": "sgm.modules.diffusionmodules.model.Decoder",
-                "params": ddconfig,
-            },
-            **kwargs,
-        )
-        self.quant_conv = torch.nn.Conv2d(
-            (1 + ddconfig["double_z"]) * ddconfig["z_channels"],
-            (1 + ddconfig["double_z"]) * embed_dim,
-            1,
-        )
-        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
-        self.embed_dim = embed_dim
-
-        self.apply_ckpt(default(ckpt_path, ckpt_engine))
-
-    def get_autoencoder_params(self) -> list:
-        params = super().get_autoencoder_params()
-        return params
-
-    def encode(
-        self, x: torch.Tensor, return_reg_log: bool = False
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, dict]]:
-        if self.max_batch_size is None:
-            z = self.encoder(x)
-            z = self.quant_conv(z)
-        else:
-            N = x.shape[0]
-            bs = self.max_batch_size
-            n_batches = int(math.ceil(N / bs))
-            z = list()
-            for i_batch in range(n_batches):
-                z_batch = self.encoder(x[i_batch * bs : (i_batch + 1) * bs])
-                z_batch = self.quant_conv(z_batch)
-                z.append(z_batch)
-            z = torch.cat(z, 0)
-
-        z, reg_log = self.regularization(z)
-        if return_reg_log:
-            return z, reg_log
-        return z
-
-    def decode(self, z: torch.Tensor, **decoder_kwargs) -> torch.Tensor:
-        if self.max_batch_size is None:
-            dec = self.post_quant_conv(z)
-            dec = self.decoder(dec, **decoder_kwargs)
-        else:
-            N = z.shape[0]
-            bs = self.max_batch_size
-            n_batches = int(math.ceil(N / bs))
-            dec = list()
-            for i_batch in range(n_batches):
-                dec_batch = self.post_quant_conv(z[i_batch * bs : (i_batch + 1) * bs])
-                dec_batch = self.decoder(dec_batch, **decoder_kwargs)
-                dec.append(dec_batch)
-            dec = torch.cat(dec, 0)
-
-        return dec
-
-
-class IdentityFirstStage(AbstractAutoencoder):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-    def get_input(self, x: Any) -> Any:
-        return x
-
-    def encode(self, x: Any, *args, **kwargs) -> Any:
-        return x
-
-    def decode(self, x: Any, *args, **kwargs) -> Any:
-        return
-
-
-class VideoAutoencodingEngine(AutoencodingEngine):
-    def __init__(
-        self,
-        ckpt_path: Union[None, str] = None,
-        ignore_keys: Union[Tuple, list] = (),
-        image_video_weights=[1, 1],
-        only_train_decoder=False,
-        context_parallel_size=0,
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        self.context_parallel_size = context_parallel_size
-        if ckpt_path is not None:
-            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
-
-    def log_videos(
-        self, batch: dict, additional_log_kwargs: Optional[Dict] = None, **kwargs
-    ) -> dict:
-        return self.log_images(batch, additional_log_kwargs, **kwargs)
-
-    def get_input(self, batch: dict) -> torch.Tensor:
-        if self.context_parallel_size > 0:
-            if not is_context_parallel_initialized():
-                initialize_context_parallel(self.context_parallel_size)
-
-            batch = batch[self.input_key]
-
-            global_src_rank = (
-                get_context_parallel_group_rank() * self.context_parallel_size
-            )
-            torch.distributed.broadcast(
-                batch, src=global_src_rank, group=get_context_parallel_group()
-            )
-
-            batch = _conv_split(batch, dim=2, kernel_size=1)
-            return batch
-
-        return batch[self.input_key]
-
-    def apply_ckpt(self, ckpt: Union[None, str, dict]):
-        if ckpt is None:
-            return
-        self.init_from_ckpt(ckpt)
-
-    def init_from_ckpt(self, path, ignore_keys=list()):
-        sd = torch.load(path, map_location="cpu")["state_dict"]
-        keys = list(sd.keys())
-        for k in keys:
-            for ik in ignore_keys:
-                if k.startswith(ik):
-                    del sd[k]
-        missing_keys, unexpected_keys = self.load_state_dict(sd, strict=False)
-        print("Missing keys: ", missing_keys)
-        print("Unexpected keys: ", unexpected_keys)
-        print(f"Restored from {path}")
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/__init__.py
deleted file mode 100644
index 0db1d771..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from .encoders.modules import GeneralConditioner
-
-UNCONDITIONAL_CONFIG = {
-    "target": "sgm.modules.GeneralConditioner",
-    "params": {"emb_models": []},
-}
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/attention.py b/videotuna/models/cogvideo_sat/sgm/modules/attention.py
deleted file mode 100644
index b122b111..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/attention.py
+++ /dev/null
@@ -1,633 +0,0 @@
-import math
-from inspect import isfunction
-from typing import Any, Optional
-
-import torch
-import torch.nn.functional as F
-from einops import rearrange, repeat
-from packaging import version
-from torch import nn
-
-if version.parse(torch.__version__) >= version.parse("2.0.0"):
-    SDP_IS_AVAILABLE = True
-    from torch.backends.cuda import SDPBackend, sdp_kernel
-
-    BACKEND_MAP = {
-        SDPBackend.MATH: {
-            "enable_math": True,
-            "enable_flash": False,
-            "enable_mem_efficient": False,
-        },
-        SDPBackend.FLASH_ATTENTION: {
-            "enable_math": False,
-            "enable_flash": True,
-            "enable_mem_efficient": False,
-        },
-        SDPBackend.EFFICIENT_ATTENTION: {
-            "enable_math": False,
-            "enable_flash": False,
-            "enable_mem_efficient": True,
-        },
-        None: {"enable_math": True, "enable_flash": True, "enable_mem_efficient": True},
-    }
-else:
-    from contextlib import nullcontext
-
-    SDP_IS_AVAILABLE = False
-    sdp_kernel = nullcontext
-    BACKEND_MAP = {}
-    print(
-        f"No SDP backend available, likely because you are running in pytorch versions < 2.0. In fact, "
-        f"you are using PyTorch {torch.__version__}. You might want to consider upgrading."
-    )
-
-try:
-    import xformers
-    import xformers.ops
-
-    XFORMERS_IS_AVAILABLE = True
-except:
-    XFORMERS_IS_AVAILABLE = False
-    print("no module 'xformers'. Processing without...")
-
-from .diffusionmodules.util import checkpoint
-
-
-def exists(val):
-    return val is not None
-
-
-def uniq(arr):
-    return {el: True for el in arr}.keys()
-
-
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-
-
-def max_neg_value(t):
-    return -torch.finfo(t.dtype).max
-
-
-def init_(tensor):
-    dim = tensor.shape[-1]
-    std = 1 / math.sqrt(dim)
-    tensor.uniform_(-std, std)
-    return tensor
-
-
-# feedforward
-class GEGLU(nn.Module):
-    def __init__(self, dim_in, dim_out):
-        super().__init__()
-        self.proj = nn.Linear(dim_in, dim_out * 2)
-
-    def forward(self, x):
-        x, gate = self.proj(x).chunk(2, dim=-1)
-        return x * F.gelu(gate)
-
-
-class FeedForward(nn.Module):
-    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.0):
-        super().__init__()
-        inner_dim = int(dim * mult)
-        dim_out = default(dim_out, dim)
-        project_in = (
-            nn.Sequential(nn.Linear(dim, inner_dim), nn.GELU())
-            if not glu
-            else GEGLU(dim, inner_dim)
-        )
-
-        self.net = nn.Sequential(
-            project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out)
-        )
-
-    def forward(self, x):
-        return self.net(x)
-
-
-def zero_module(module):
-    """
-    Zero out the parameters of a module and return it.
-    """
-    for p in module.parameters():
-        p.detach().zero_()
-    return module
-
-
-def Normalize(in_channels):
-    return torch.nn.GroupNorm(
-        num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-    )
-
-
-class LinearAttention(nn.Module):
-    def __init__(self, dim, heads=4, dim_head=32):
-        super().__init__()
-        self.heads = heads
-        hidden_dim = dim_head * heads
-        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
-        self.to_out = nn.Conv2d(hidden_dim, dim, 1)
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        qkv = self.to_qkv(x)
-        q, k, v = rearrange(
-            qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3
-        )
-        k = k.softmax(dim=-1)
-        context = torch.einsum("bhdn,bhen->bhde", k, v)
-        out = torch.einsum("bhde,bhdn->bhen", context, q)
-        out = rearrange(
-            out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w
-        )
-        return self.to_out(out)
-
-
-class SpatialSelfAttention(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x):
-        h_ = x
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = rearrange(q, "b c h w -> b (h w) c")
-        k = rearrange(k, "b c h w -> b c (h w)")
-        w_ = torch.einsum("bij,bjk->bik", q, k)
-
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = rearrange(v, "b c h w -> b c (h w)")
-        w_ = rearrange(w_, "b i j -> b j i")
-        h_ = torch.einsum("bij,bjk->bik", v, w_)
-        h_ = rearrange(h_, "b c (h w) -> b c h w", h=h)
-        h_ = self.proj_out(h_)
-
-        return x + h_
-
-
-class CrossAttention(nn.Module):
-    def __init__(
-        self,
-        query_dim,
-        context_dim=None,
-        heads=8,
-        dim_head=64,
-        dropout=0.0,
-        backend=None,
-    ):
-        super().__init__()
-        inner_dim = dim_head * heads
-        context_dim = default(context_dim, query_dim)
-
-        self.scale = dim_head**-0.5
-        self.heads = heads
-
-        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
-        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
-        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
-
-        self.to_out = nn.Sequential(
-            nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)
-        )
-        self.backend = backend
-
-    def forward(
-        self,
-        x,
-        context=None,
-        mask=None,
-        additional_tokens=None,
-        n_times_crossframe_attn_in_self=0,
-    ):
-        h = self.heads
-
-        if additional_tokens is not None:
-            # get the number of masked tokens at the beginning of the output sequence
-            n_tokens_to_mask = additional_tokens.shape[1]
-            # add additional token
-            x = torch.cat([additional_tokens, x], dim=1)
-
-        q = self.to_q(x)
-        context = default(context, x)
-        k = self.to_k(context)
-        v = self.to_v(context)
-
-        if n_times_crossframe_attn_in_self:
-            # reprogramming cross-frame attention as in https://arxiv.org/abs/2303.13439
-            assert x.shape[0] % n_times_crossframe_attn_in_self == 0
-            n_cp = x.shape[0] // n_times_crossframe_attn_in_self
-            k = repeat(
-                k[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_cp
-            )
-            v = repeat(
-                v[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_cp
-            )
-
-        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))
-
-        # old
-        """
-        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
-        del q, k
-
-        if exists(mask):
-            mask = rearrange(mask, 'b ... -> b (...)')
-            max_neg_value = -torch.finfo(sim.dtype).max
-            mask = repeat(mask, 'b j -> (b h) () j', h=h)
-            sim.masked_fill_(~mask, max_neg_value)
-
-        # attention, what we cannot get enough of
-        sim = sim.softmax(dim=-1)
-
-        out = einsum('b i j, b j d -> b i d', sim, v)
-        """
-        # new
-        with sdp_kernel(**BACKEND_MAP[self.backend]):
-            # print("dispatching into backend", self.backend, "q/k/v shape: ", q.shape, k.shape, v.shape)
-            out = F.scaled_dot_product_attention(
-                q, k, v, attn_mask=mask
-            )  # scale is dim_head ** -0.5 per default
-
-        del q, k, v
-        out = rearrange(out, "b h n d -> b n (h d)", h=h)
-
-        if additional_tokens is not None:
-            # remove additional token
-            out = out[:, n_tokens_to_mask:]
-        return self.to_out(out)
-
-
-class MemoryEfficientCrossAttention(nn.Module):
-    # https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223
-    def __init__(
-        self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0, **kwargs
-    ):
-        super().__init__()
-        print(
-            f"Setting up {self.__class__.__name__}. Query dim is {query_dim}, context_dim is {context_dim} and using "
-            f"{heads} heads with a dimension of {dim_head}."
-        )
-        inner_dim = dim_head * heads
-        context_dim = default(context_dim, query_dim)
-
-        self.heads = heads
-        self.dim_head = dim_head
-
-        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
-        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
-        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
-
-        self.to_out = nn.Sequential(
-            nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)
-        )
-        self.attention_op: Optional[Any] = None
-
-    def forward(
-        self,
-        x,
-        context=None,
-        mask=None,
-        additional_tokens=None,
-        n_times_crossframe_attn_in_self=0,
-    ):
-        if additional_tokens is not None:
-            # get the number of masked tokens at the beginning of the output sequence
-            n_tokens_to_mask = additional_tokens.shape[1]
-            # add additional token
-            x = torch.cat([additional_tokens, x], dim=1)
-        q = self.to_q(x)
-        context = default(context, x)
-        k = self.to_k(context)
-        v = self.to_v(context)
-
-        if n_times_crossframe_attn_in_self:
-            # reprogramming cross-frame attention as in https://arxiv.org/abs/2303.13439
-            assert x.shape[0] % n_times_crossframe_attn_in_self == 0
-            # n_cp = x.shape[0]//n_times_crossframe_attn_in_self
-            k = repeat(
-                k[::n_times_crossframe_attn_in_self],
-                "b ... -> (b n) ...",
-                n=n_times_crossframe_attn_in_self,
-            )
-            v = repeat(
-                v[::n_times_crossframe_attn_in_self],
-                "b ... -> (b n) ...",
-                n=n_times_crossframe_attn_in_self,
-            )
-
-        b, _, _ = q.shape
-        q, k, v = map(
-            lambda t: t.unsqueeze(3)
-            .reshape(b, t.shape[1], self.heads, self.dim_head)
-            .permute(0, 2, 1, 3)
-            .reshape(b * self.heads, t.shape[1], self.dim_head)
-            .contiguous(),
-            (q, k, v),
-        )
-
-        # actually compute the attention, what we cannot get enough of
-        out = xformers.ops.memory_efficient_attention(
-            q, k, v, attn_bias=None, op=self.attention_op
-        )
-
-        # TODO: Use this directly in the attention operation, as a bias
-        if exists(mask):
-            raise NotImplementedError
-        out = (
-            out.unsqueeze(0)
-            .reshape(b, self.heads, out.shape[1], self.dim_head)
-            .permute(0, 2, 1, 3)
-            .reshape(b, out.shape[1], self.heads * self.dim_head)
-        )
-        if additional_tokens is not None:
-            # remove additional token
-            out = out[:, n_tokens_to_mask:]
-        return self.to_out(out)
-
-
-class BasicTransformerBlock(nn.Module):
-    ATTENTION_MODES = {
-        "softmax": CrossAttention,  # vanilla attention
-        "softmax-xformers": MemoryEfficientCrossAttention,  # ampere
-    }
-
-    def __init__(
-        self,
-        dim,
-        n_heads,
-        d_head,
-        dropout=0.0,
-        context_dim=None,
-        gated_ff=True,
-        checkpoint=True,
-        disable_self_attn=False,
-        attn_mode="softmax",
-        sdp_backend=None,
-    ):
-        super().__init__()
-        assert attn_mode in self.ATTENTION_MODES
-        if attn_mode != "softmax" and not XFORMERS_IS_AVAILABLE:
-            print(
-                f"Attention mode '{attn_mode}' is not available. Falling back to native attention. "
-                f"This is not a problem in Pytorch >= 2.0. FYI, you are running with PyTorch version {torch.__version__}"
-            )
-            attn_mode = "softmax"
-        elif attn_mode == "softmax" and not SDP_IS_AVAILABLE:
-            print(
-                "We do not support vanilla attention anymore, as it is too expensive. Sorry."
-            )
-            if not XFORMERS_IS_AVAILABLE:
-                assert (
-                    False
-                ), "Please install xformers via e.g. 'pip install xformers==0.0.16'"
-            else:
-                print("Falling back to xformers efficient attention.")
-                attn_mode = "softmax-xformers"
-        attn_cls = self.ATTENTION_MODES[attn_mode]
-        if version.parse(torch.__version__) >= version.parse("2.0.0"):
-            assert sdp_backend is None or isinstance(sdp_backend, SDPBackend)
-        else:
-            assert sdp_backend is None
-        self.disable_self_attn = disable_self_attn
-        self.attn1 = attn_cls(
-            query_dim=dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            context_dim=context_dim if self.disable_self_attn else None,
-            backend=sdp_backend,
-        )  # is a self-attention if not self.disable_self_attn
-        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
-        self.attn2 = attn_cls(
-            query_dim=dim,
-            context_dim=context_dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            backend=sdp_backend,
-        )  # is self-attn if context is none
-        self.norm1 = nn.LayerNorm(dim)
-        self.norm2 = nn.LayerNorm(dim)
-        self.norm3 = nn.LayerNorm(dim)
-        self.checkpoint = checkpoint
-        if self.checkpoint:
-            print(f"{self.__class__.__name__} is using checkpointing")
-
-    def forward(
-        self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0
-    ):
-        kwargs = {"x": x}
-
-        if context is not None:
-            kwargs.update({"context": context})
-
-        if additional_tokens is not None:
-            kwargs.update({"additional_tokens": additional_tokens})
-
-        if n_times_crossframe_attn_in_self:
-            kwargs.update(
-                {"n_times_crossframe_attn_in_self": n_times_crossframe_attn_in_self}
-            )
-
-        # return mixed_checkpoint(self._forward, kwargs, self.parameters(), self.checkpoint)
-        return checkpoint(
-            self._forward, (x, context), self.parameters(), self.checkpoint
-        )
-
-    def _forward(
-        self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0
-    ):
-        x = (
-            self.attn1(
-                self.norm1(x),
-                context=context if self.disable_self_attn else None,
-                additional_tokens=additional_tokens,
-                n_times_crossframe_attn_in_self=(
-                    n_times_crossframe_attn_in_self if not self.disable_self_attn else 0
-                ),
-            )
-            + x
-        )
-        x = (
-            self.attn2(
-                self.norm2(x), context=context, additional_tokens=additional_tokens
-            )
-            + x
-        )
-        x = self.ff(self.norm3(x)) + x
-        return x
-
-
-class BasicTransformerSingleLayerBlock(nn.Module):
-    ATTENTION_MODES = {
-        "softmax": CrossAttention,  # vanilla attention
-        "softmax-xformers": MemoryEfficientCrossAttention,  # on the A100s not quite as fast as the above version
-        # (todo might depend on head_dim, check, falls back to semi-optimized kernels for dim!=[16,32,64,128])
-    }
-
-    def __init__(
-        self,
-        dim,
-        n_heads,
-        d_head,
-        dropout=0.0,
-        context_dim=None,
-        gated_ff=True,
-        checkpoint=True,
-        attn_mode="softmax",
-    ):
-        super().__init__()
-        assert attn_mode in self.ATTENTION_MODES
-        attn_cls = self.ATTENTION_MODES[attn_mode]
-        self.attn1 = attn_cls(
-            query_dim=dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            context_dim=context_dim,
-        )
-        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
-        self.norm1 = nn.LayerNorm(dim)
-        self.norm2 = nn.LayerNorm(dim)
-        self.checkpoint = checkpoint
-
-    def forward(self, x, context=None):
-        return checkpoint(
-            self._forward, (x, context), self.parameters(), self.checkpoint
-        )
-
-    def _forward(self, x, context=None):
-        x = self.attn1(self.norm1(x), context=context) + x
-        x = self.ff(self.norm2(x)) + x
-        return x
-
-
-class SpatialTransformer(nn.Module):
-    """
-    Transformer block for image-like data.
-    First, project the input (aka embedding)
-    and reshape to b, t, d.
-    Then apply standard transformer action.
-    Finally, reshape to image
-    NEW: use_linear for more efficiency instead of the 1x1 convs
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        n_heads,
-        d_head,
-        depth=1,
-        dropout=0.0,
-        context_dim=None,
-        disable_self_attn=False,
-        use_linear=False,
-        attn_type="softmax",
-        use_checkpoint=True,
-        # sdp_backend=SDPBackend.FLASH_ATTENTION
-        sdp_backend=None,
-    ):
-        super().__init__()
-        print(
-            f"constructing {self.__class__.__name__} of depth {depth} w/ {in_channels} channels and {n_heads} heads"
-        )
-        from omegaconf import ListConfig
-
-        if exists(context_dim) and not isinstance(context_dim, (list, ListConfig)):
-            context_dim = [context_dim]
-        if exists(context_dim) and isinstance(context_dim, list):
-            if depth != len(context_dim):
-                print(
-                    f"WARNING: {self.__class__.__name__}: Found context dims {context_dim} of depth {len(context_dim)}, "
-                    f"which does not match the specified 'depth' of {depth}. Setting context_dim to {depth * [context_dim[0]]} now."
-                )
-                # depth does not match context dims.
-                assert all(
-                    map(lambda x: x == context_dim[0], context_dim)
-                ), "need homogenous context_dim to match depth automatically"
-                context_dim = depth * [context_dim[0]]
-        elif context_dim is None:
-            context_dim = [None] * depth
-        self.in_channels = in_channels
-        inner_dim = n_heads * d_head
-        self.norm = Normalize(in_channels)
-        if not use_linear:
-            self.proj_in = nn.Conv2d(
-                in_channels, inner_dim, kernel_size=1, stride=1, padding=0
-            )
-        else:
-            self.proj_in = nn.Linear(in_channels, inner_dim)
-
-        self.transformer_blocks = nn.ModuleList(
-            [
-                BasicTransformerBlock(
-                    inner_dim,
-                    n_heads,
-                    d_head,
-                    dropout=dropout,
-                    context_dim=context_dim[d],
-                    disable_self_attn=disable_self_attn,
-                    attn_mode=attn_type,
-                    checkpoint=use_checkpoint,
-                    sdp_backend=sdp_backend,
-                )
-                for d in range(depth)
-            ]
-        )
-        if not use_linear:
-            self.proj_out = zero_module(
-                nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)
-            )
-        else:
-            # self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
-            self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
-        self.use_linear = use_linear
-
-    def forward(self, x, context=None):
-        # note: if no context is given, cross-attention defaults to self-attention
-        if not isinstance(context, list):
-            context = [context]
-        b, c, h, w = x.shape
-        x_in = x
-        x = self.norm(x)
-        if not self.use_linear:
-            x = self.proj_in(x)
-        x = rearrange(x, "b c h w -> b (h w) c").contiguous()
-        if self.use_linear:
-            x = self.proj_in(x)
-        for i, block in enumerate(self.transformer_blocks):
-            if i > 0 and len(context) == 1:
-                i = 0  # use same context for each block
-            x = block(x, context=context[i])
-        if self.use_linear:
-            x = self.proj_out(x)
-        x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w).contiguous()
-        if not self.use_linear:
-            x = self.proj_out(x)
-        return x + x_in
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/__init__.py
deleted file mode 100644
index b3bb81d9..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-__all__ = [
-    "GeneralLPIPSWithDiscriminator",
-    "LatentLPIPS",
-]
-
-from .discriminator_loss import GeneralLPIPSWithDiscriminator
-from .lpips import LatentLPIPS
-from .video_loss import VideoAutoencoderLoss
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/discriminator_loss.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/discriminator_loss.py
deleted file mode 100644
index cefcb1d9..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/discriminator_loss.py
+++ /dev/null
@@ -1,317 +0,0 @@
-from typing import Dict, Iterator, List, Optional, Tuple, Union
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torchvision
-from einops import rearrange
-from matplotlib import colormaps
-from matplotlib import pyplot as plt
-
-from ....util import default, instantiate_from_config
-from ..lpips.loss.lpips import LPIPS
-from ..lpips.model.model import weights_init
-from ..lpips.vqperceptual import hinge_d_loss, vanilla_d_loss
-
-
-class GeneralLPIPSWithDiscriminator(nn.Module):
-    def __init__(
-        self,
-        disc_start: int,
-        logvar_init: float = 0.0,
-        disc_num_layers: int = 3,
-        disc_in_channels: int = 3,
-        disc_factor: float = 1.0,
-        disc_weight: float = 1.0,
-        perceptual_weight: float = 1.0,
-        disc_loss: str = "hinge",
-        scale_input_to_tgt_size: bool = False,
-        dims: int = 2,
-        learn_logvar: bool = False,
-        regularization_weights: Union[None, Dict[str, float]] = None,
-        additional_log_keys: Optional[List[str]] = None,
-        discriminator_config: Optional[Dict] = None,
-    ):
-        super().__init__()
-        self.dims = dims
-        if self.dims > 2:
-            print(
-                f"running with dims={dims}. This means that for perceptual loss "
-                f"calculation, the LPIPS loss will be applied to each frame "
-                f"independently."
-            )
-        self.scale_input_to_tgt_size = scale_input_to_tgt_size
-        assert disc_loss in ["hinge", "vanilla"]
-        self.perceptual_loss = LPIPS().eval()
-        self.perceptual_weight = perceptual_weight
-        # output log variance
-        self.logvar = nn.Parameter(
-            torch.full((), logvar_init), requires_grad=learn_logvar
-        )
-        self.learn_logvar = learn_logvar
-
-        discriminator_config = default(
-            discriminator_config,
-            {
-                "target": "sgm.modules.autoencoding.lpips.model.model.NLayerDiscriminator",
-                "params": {
-                    "input_nc": disc_in_channels,
-                    "n_layers": disc_num_layers,
-                    "use_actnorm": False,
-                },
-            },
-        )
-
-        self.discriminator = instantiate_from_config(discriminator_config).apply(
-            weights_init
-        )
-        self.discriminator_iter_start = disc_start
-        self.disc_loss = hinge_d_loss if disc_loss == "hinge" else vanilla_d_loss
-        self.disc_factor = disc_factor
-        self.discriminator_weight = disc_weight
-        self.regularization_weights = default(regularization_weights, {})
-
-        self.forward_keys = [
-            "optimizer_idx",
-            "global_step",
-            "last_layer",
-            "split",
-            "regularization_log",
-        ]
-
-        self.additional_log_keys = set(default(additional_log_keys, []))
-        self.additional_log_keys.update(set(self.regularization_weights.keys()))
-
-    def get_trainable_parameters(self) -> Iterator[nn.Parameter]:
-        return self.discriminator.parameters()
-
-    def get_trainable_autoencoder_parameters(self) -> Iterator[nn.Parameter]:
-        if self.learn_logvar:
-            yield self.logvar
-        yield from ()
-
-    @torch.no_grad()
-    def log_images(
-        self, inputs: torch.Tensor, reconstructions: torch.Tensor
-    ) -> Dict[str, torch.Tensor]:
-        # calc logits of real/fake
-        logits_real = self.discriminator(inputs.contiguous().detach())
-        if len(logits_real.shape) < 4:
-            # Non patch-discriminator
-            return dict()
-        logits_fake = self.discriminator(reconstructions.contiguous().detach())
-        # -> (b, 1, h, w)
-
-        # parameters for colormapping
-        high = max(logits_fake.abs().max(), logits_real.abs().max()).item()
-        cmap = colormaps["PiYG"]  # diverging colormap
-
-        def to_colormap(logits: torch.Tensor) -> torch.Tensor:
-            """(b, 1, ...) -> (b, 3, ...)"""
-            logits = (logits + high) / (2 * high)
-            logits_np = cmap(logits.cpu().numpy())[..., :3]  # truncate alpha channel
-            # -> (b, 1, ..., 3)
-            logits = torch.from_numpy(logits_np).to(logits.device)
-            return rearrange(logits, "b 1 ... c -> b c ...")
-
-        logits_real = torch.nn.functional.interpolate(
-            logits_real,
-            size=inputs.shape[-2:],
-            mode="nearest",
-            antialias=False,
-        )
-        logits_fake = torch.nn.functional.interpolate(
-            logits_fake,
-            size=reconstructions.shape[-2:],
-            mode="nearest",
-            antialias=False,
-        )
-
-        # alpha value of logits for overlay
-        alpha_real = torch.abs(logits_real) / high
-        alpha_fake = torch.abs(logits_fake) / high
-        # -> (b, 1, h, w) in range [0, 0.5]
-        # alpha value of lines don't really matter, since the values are the same
-        # for both images and logits anyway
-        grid_alpha_real = torchvision.utils.make_grid(alpha_real, nrow=4)
-        grid_alpha_fake = torchvision.utils.make_grid(alpha_fake, nrow=4)
-        grid_alpha = 0.8 * torch.cat((grid_alpha_real, grid_alpha_fake), dim=1)
-        # -> (1, h, w)
-        # blend logits and images together
-
-        # prepare logits for plotting
-        logits_real = to_colormap(logits_real)
-        logits_fake = to_colormap(logits_fake)
-        # resize logits
-        # -> (b, 3, h, w)
-
-        # make some grids
-        # add all logits to one plot
-        logits_real = torchvision.utils.make_grid(logits_real, nrow=4)
-        logits_fake = torchvision.utils.make_grid(logits_fake, nrow=4)
-        # I just love how torchvision calls the number of columns `nrow`
-        grid_logits = torch.cat((logits_real, logits_fake), dim=1)
-        # -> (3, h, w)
-
-        grid_images_real = torchvision.utils.make_grid(0.5 * inputs + 0.5, nrow=4)
-        grid_images_fake = torchvision.utils.make_grid(
-            0.5 * reconstructions + 0.5, nrow=4
-        )
-        grid_images = torch.cat((grid_images_real, grid_images_fake), dim=1)
-        # -> (3, h, w) in range [0, 1]
-
-        grid_blend = grid_alpha * grid_logits + (1 - grid_alpha) * grid_images
-
-        # Create labeled colorbar
-        dpi = 100
-        height = 128 / dpi
-        width = grid_logits.shape[2] / dpi
-        fig, ax = plt.subplots(figsize=(width, height), dpi=dpi)
-        img = ax.imshow(np.array([[-high, high]]), cmap=cmap)
-        plt.colorbar(
-            img,
-            cax=ax,
-            orientation="horizontal",
-            fraction=0.9,
-            aspect=width / height,
-            pad=0.0,
-        )
-        img.set_visible(False)
-        fig.tight_layout()
-        fig.canvas.draw()
-        # manually convert figure to numpy
-        cbar_np = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
-        cbar_np = cbar_np.reshape(fig.canvas.get_width_height()[::-1] + (3,))
-        cbar = torch.from_numpy(cbar_np.copy()).to(grid_logits.dtype) / 255.0
-        cbar = rearrange(cbar, "h w c -> c h w").to(grid_logits.device)
-
-        # Add colorbar to plot
-        annotated_grid = torch.cat((grid_logits, cbar), dim=1)
-        blended_grid = torch.cat((grid_blend, cbar), dim=1)
-        return {
-            "vis_logits": 2 * annotated_grid[None, ...] - 1,
-            "vis_logits_blended": 2 * blended_grid[None, ...] - 1,
-        }
-
-    def calculate_adaptive_weight(
-        self, nll_loss: torch.Tensor, g_loss: torch.Tensor, last_layer: torch.Tensor
-    ) -> torch.Tensor:
-        nll_grads = torch.autograd.grad(nll_loss, last_layer, retain_graph=True)[0]
-        g_grads = torch.autograd.grad(g_loss, last_layer, retain_graph=True)[0]
-
-        d_weight = torch.norm(nll_grads) / (torch.norm(g_grads) + 1e-4)
-        d_weight = torch.clamp(d_weight, 0.0, 1e4).detach()
-        d_weight = d_weight * self.discriminator_weight
-        return d_weight
-
-    def forward(
-        self,
-        inputs: torch.Tensor,
-        reconstructions: torch.Tensor,
-        *,  # added because I changed the order here
-        regularization_log: Dict[str, torch.Tensor],
-        optimizer_idx: int,
-        global_step: int,
-        last_layer: torch.Tensor,
-        split: str = "train",
-        weights: Union[None, float, torch.Tensor] = None,
-    ) -> Tuple[torch.Tensor, dict]:
-        if self.scale_input_to_tgt_size:
-            inputs = torch.nn.functional.interpolate(
-                inputs, reconstructions.shape[2:], mode="bicubic", antialias=True
-            )
-
-        if self.dims > 2:
-            inputs, reconstructions = map(
-                lambda x: rearrange(x, "b c t h w -> (b t) c h w"),
-                (inputs, reconstructions),
-            )
-
-        rec_loss = torch.abs(inputs.contiguous() - reconstructions.contiguous())
-        if self.perceptual_weight > 0:
-            frame_indices = (
-                torch.randn((inputs.shape[0], inputs.shape[2])).topk(1, dim=-1).indices
-            )
-
-            from sgm.modules.autoencoding.losses.video_loss import pick_video_frame
-
-            input_frames = pick_video_frame(inputs, frame_indices)
-            recon_frames = pick_video_frame(reconstructions, frame_indices)
-
-            p_loss = self.perceptual_loss(
-                input_frames.contiguous(), recon_frames.contiguous()
-            ).mean()
-            rec_loss = rec_loss + self.perceptual_weight * p_loss
-
-        nll_loss, weighted_nll_loss = self.get_nll_loss(rec_loss, weights)
-
-        # now the GAN part
-        if optimizer_idx == 0:
-            # generator update
-            if global_step >= self.discriminator_iter_start or not self.training:
-                logits_fake = self.discriminator(reconstructions.contiguous())
-                g_loss = -torch.mean(logits_fake)
-                if self.training:
-                    d_weight = self.calculate_adaptive_weight(
-                        nll_loss, g_loss, last_layer=last_layer
-                    )
-                else:
-                    d_weight = torch.tensor(1.0)
-            else:
-                d_weight = torch.tensor(0.0)
-                g_loss = torch.tensor(0.0, requires_grad=True)
-
-            loss = weighted_nll_loss + d_weight * self.disc_factor * g_loss
-            log = dict()
-            for k in regularization_log:
-                if k in self.regularization_weights:
-                    loss = loss + self.regularization_weights[k] * regularization_log[k]
-                if k in self.additional_log_keys:
-                    log[f"{split}/{k}"] = regularization_log[k].detach().float().mean()
-
-            log.update(
-                {
-                    f"{split}/loss/total": loss.clone().detach().mean(),
-                    f"{split}/loss/nll": nll_loss.detach().mean(),
-                    f"{split}/loss/rec": rec_loss.detach().mean(),
-                    f"{split}/loss/percep": p_loss.detach().mean(),
-                    f"{split}/loss/rec": rec_loss.detach().mean(),
-                    f"{split}/loss/g": g_loss.detach().mean(),
-                    f"{split}/scalars/logvar": self.logvar.detach(),
-                    f"{split}/scalars/d_weight": d_weight.detach(),
-                }
-            )
-
-            return loss, log
-        elif optimizer_idx == 1:
-            # second pass for discriminator update
-            logits_real = self.discriminator(inputs.contiguous().detach())
-            logits_fake = self.discriminator(reconstructions.contiguous().detach())
-
-            if global_step >= self.discriminator_iter_start or not self.training:
-                d_loss = self.disc_factor * self.disc_loss(logits_real, logits_fake)
-            else:
-                d_loss = torch.tensor(0.0, requires_grad=True)
-
-            log = {
-                f"{split}/loss/disc": d_loss.clone().detach().mean(),
-                f"{split}/logits/real": logits_real.detach().mean(),
-                f"{split}/logits/fake": logits_fake.detach().mean(),
-            }
-            return d_loss, log
-        else:
-            raise NotImplementedError(f"Unknown optimizer_idx {optimizer_idx}")
-
-    def get_nll_loss(
-        self,
-        rec_loss: torch.Tensor,
-        weights: Optional[Union[float, torch.Tensor]] = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        nll_loss = rec_loss / torch.exp(self.logvar) + self.logvar
-        weighted_nll_loss = nll_loss
-        if weights is not None:
-            weighted_nll_loss = weights * nll_loss
-        weighted_nll_loss = torch.sum(weighted_nll_loss) / weighted_nll_loss.shape[0]
-        nll_loss = torch.sum(nll_loss) / nll_loss.shape[0]
-
-        return nll_loss, weighted_nll_loss
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/lpips.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/lpips.py
deleted file mode 100644
index b329fcc2..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/lpips.py
+++ /dev/null
@@ -1,73 +0,0 @@
-import torch
-import torch.nn as nn
-
-from ....util import default, instantiate_from_config
-from ..lpips.loss.lpips import LPIPS
-
-
-class LatentLPIPS(nn.Module):
-    def __init__(
-        self,
-        decoder_config,
-        perceptual_weight=1.0,
-        latent_weight=1.0,
-        scale_input_to_tgt_size=False,
-        scale_tgt_to_input_size=False,
-        perceptual_weight_on_inputs=0.0,
-    ):
-        super().__init__()
-        self.scale_input_to_tgt_size = scale_input_to_tgt_size
-        self.scale_tgt_to_input_size = scale_tgt_to_input_size
-        self.init_decoder(decoder_config)
-        self.perceptual_loss = LPIPS().eval()
-        self.perceptual_weight = perceptual_weight
-        self.latent_weight = latent_weight
-        self.perceptual_weight_on_inputs = perceptual_weight_on_inputs
-
-    def init_decoder(self, config):
-        self.decoder = instantiate_from_config(config)
-        if hasattr(self.decoder, "encoder"):
-            del self.decoder.encoder
-
-    def forward(self, latent_inputs, latent_predictions, image_inputs, split="train"):
-        log = dict()
-        loss = (latent_inputs - latent_predictions) ** 2
-        log[f"{split}/latent_l2_loss"] = loss.mean().detach()
-        image_reconstructions = None
-        if self.perceptual_weight > 0.0:
-            image_reconstructions = self.decoder.decode(latent_predictions)
-            image_targets = self.decoder.decode(latent_inputs)
-            perceptual_loss = self.perceptual_loss(
-                image_targets.contiguous(), image_reconstructions.contiguous()
-            )
-            loss = (
-                self.latent_weight * loss.mean()
-                + self.perceptual_weight * perceptual_loss.mean()
-            )
-            log[f"{split}/perceptual_loss"] = perceptual_loss.mean().detach()
-
-        if self.perceptual_weight_on_inputs > 0.0:
-            image_reconstructions = default(
-                image_reconstructions, self.decoder.decode(latent_predictions)
-            )
-            if self.scale_input_to_tgt_size:
-                image_inputs = torch.nn.functional.interpolate(
-                    image_inputs,
-                    image_reconstructions.shape[2:],
-                    mode="bicubic",
-                    antialias=True,
-                )
-            elif self.scale_tgt_to_input_size:
-                image_reconstructions = torch.nn.functional.interpolate(
-                    image_reconstructions,
-                    image_inputs.shape[2:],
-                    mode="bicubic",
-                    antialias=True,
-                )
-
-            perceptual_loss2 = self.perceptual_loss(
-                image_inputs.contiguous(), image_reconstructions.contiguous()
-            )
-            loss = loss + self.perceptual_weight_on_inputs * perceptual_loss2.mean()
-            log[f"{split}/perceptual_loss_on_inputs"] = perceptual_loss2.mean().detach()
-        return loss, log
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/video_loss.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/video_loss.py
deleted file mode 100644
index 93497302..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/losses/video_loss.py
+++ /dev/null
@@ -1,754 +0,0 @@
-from math import log2
-from typing import Any, Union
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchvision
-from beartype import beartype
-from einops import einsum, rearrange, repeat
-from einops.layers.torch import Rearrange
-from kornia.filters import filter3d
-from sgm.modules.autoencoding.vqvae.movq_enc_3d import CausalConv3d, DownSample3D
-from sgm.util import instantiate_from_config
-from torch import Tensor
-from torch.autograd import grad as torch_grad
-from torch.cuda.amp import autocast
-from torchvision.models import VGG16_Weights
-
-from ..magvit2_pytorch import FeedForward, LinearSpaceAttention, Residual
-from .lpips import LPIPS
-
-
-def exists(v):
-    return v is not None
-
-
-def pair(t):
-    return t if isinstance(t, tuple) else (t, t)
-
-
-def leaky_relu(p=0.1):
-    return nn.LeakyReLU(p)
-
-
-def hinge_discr_loss(fake, real):
-    return (F.relu(1 + fake) + F.relu(1 - real)).mean()
-
-
-def hinge_gen_loss(fake):
-    return -fake.mean()
-
-
-@autocast(enabled=False)
-@beartype
-def grad_layer_wrt_loss(loss: Tensor, layer: nn.Parameter):
-    return torch_grad(
-        outputs=loss,
-        inputs=layer,
-        grad_outputs=torch.ones_like(loss),
-        retain_graph=True,
-    )[0].detach()
-
-
-def pick_video_frame(video, frame_indices):
-    batch, device = video.shape[0], video.device
-    video = rearrange(video, "b c f ... -> b f c ...")
-    batch_indices = torch.arange(batch, device=device)
-    batch_indices = rearrange(batch_indices, "b -> b 1")
-    images = video[batch_indices, frame_indices]
-    images = rearrange(images, "b 1 c ... -> b c ...")
-    return images
-
-
-def gradient_penalty(images, output):
-    batch_size = images.shape[0]
-
-    gradients = torch_grad(
-        outputs=output,
-        inputs=images,
-        grad_outputs=torch.ones(output.size(), device=images.device),
-        create_graph=True,
-        retain_graph=True,
-        only_inputs=True,
-    )[0]
-
-    gradients = rearrange(gradients, "b ... -> b (...)")
-    return ((gradients.norm(2, dim=1) - 1) ** 2).mean()
-
-
-# discriminator with anti-aliased downsampling (blurpool Zhang et al.)
-
-
-class Blur(nn.Module):
-    def __init__(self):
-        super().__init__()
-        f = torch.Tensor([1, 2, 1])
-        self.register_buffer("f", f)
-
-    def forward(self, x, space_only=False, time_only=False):
-        assert not (space_only and time_only)
-
-        f = self.f
-
-        if space_only:
-            f = einsum("i, j -> i j", f, f)
-            f = rearrange(f, "... -> 1 1 ...")
-        elif time_only:
-            f = rearrange(f, "f -> 1 f 1 1")
-        else:
-            f = einsum("i, j, k -> i j k", f, f, f)
-            f = rearrange(f, "... -> 1 ...")
-
-        is_images = x.ndim == 4
-
-        if is_images:
-            x = rearrange(x, "b c h w -> b c 1 h w")
-
-        out = filter3d(x, f, normalized=True)
-
-        if is_images:
-            out = rearrange(out, "b c 1 h w -> b c h w")
-
-        return out
-
-
-class DiscriminatorBlock(nn.Module):
-    def __init__(
-        self, input_channels, filters, downsample=True, antialiased_downsample=True
-    ):
-        super().__init__()
-        self.conv_res = nn.Conv2d(
-            input_channels, filters, 1, stride=(2 if downsample else 1)
-        )
-
-        self.net = nn.Sequential(
-            nn.Conv2d(input_channels, filters, 3, padding=1),
-            leaky_relu(),
-            nn.Conv2d(filters, filters, 3, padding=1),
-            leaky_relu(),
-        )
-
-        self.maybe_blur = Blur() if antialiased_downsample else None
-
-        self.downsample = (
-            nn.Sequential(
-                Rearrange("b c (h p1) (w p2) -> b (c p1 p2) h w", p1=2, p2=2),
-                nn.Conv2d(filters * 4, filters, 1),
-            )
-            if downsample
-            else None
-        )
-
-    def forward(self, x):
-        res = self.conv_res(x)
-
-        x = self.net(x)
-
-        if exists(self.downsample):
-            if exists(self.maybe_blur):
-                x = self.maybe_blur(x, space_only=True)
-
-            x = self.downsample(x)
-
-        x = (x + res) * (2**-0.5)
-        return x
-
-
-class Discriminator(nn.Module):
-    @beartype
-    def __init__(
-        self,
-        *,
-        dim,
-        image_size,
-        channels=3,
-        max_dim=512,
-        attn_heads=8,
-        attn_dim_head=32,
-        linear_attn_dim_head=8,
-        linear_attn_heads=16,
-        ff_mult=4,
-        antialiased_downsample=False,
-    ):
-        super().__init__()
-        image_size = pair(image_size)
-        min_image_resolution = min(image_size)
-
-        num_layers = int(log2(min_image_resolution) - 2)
-
-        blocks = []
-
-        layer_dims = [channels] + [(dim * 4) * (2**i) for i in range(num_layers + 1)]
-        layer_dims = [min(layer_dim, max_dim) for layer_dim in layer_dims]
-        layer_dims_in_out = tuple(zip(layer_dims[:-1], layer_dims[1:]))
-
-        blocks = []
-        attn_blocks = []
-
-        image_resolution = min_image_resolution
-
-        for ind, (in_chan, out_chan) in enumerate(layer_dims_in_out):
-            num_layer = ind + 1
-            is_not_last = ind != (len(layer_dims_in_out) - 1)
-
-            block = DiscriminatorBlock(
-                in_chan,
-                out_chan,
-                downsample=is_not_last,
-                antialiased_downsample=antialiased_downsample,
-            )
-
-            attn_block = nn.Sequential(
-                Residual(
-                    LinearSpaceAttention(
-                        dim=out_chan,
-                        heads=linear_attn_heads,
-                        dim_head=linear_attn_dim_head,
-                    )
-                ),
-                Residual(FeedForward(dim=out_chan, mult=ff_mult, images=True)),
-            )
-
-            blocks.append(nn.ModuleList([block, attn_block]))
-
-            image_resolution //= 2
-
-        self.blocks = nn.ModuleList(blocks)
-
-        dim_last = layer_dims[-1]
-
-        downsample_factor = 2**num_layers
-        last_fmap_size = tuple(map(lambda n: n // downsample_factor, image_size))
-
-        latent_dim = last_fmap_size[0] * last_fmap_size[1] * dim_last
-
-        self.to_logits = nn.Sequential(
-            nn.Conv2d(dim_last, dim_last, 3, padding=1),
-            leaky_relu(),
-            Rearrange("b ... -> b (...)"),
-            nn.Linear(latent_dim, 1),
-            Rearrange("b 1 -> b"),
-        )
-
-    def forward(self, x):
-        for block, attn_block in self.blocks:
-            x = block(x)
-            x = attn_block(x)
-
-        return self.to_logits(x)
-
-
-class DiscriminatorBlock3D(nn.Module):
-    def __init__(
-        self,
-        input_channels,
-        filters,
-        antialiased_downsample=True,
-    ):
-        super().__init__()
-        self.conv_res = nn.Conv3d(input_channels, filters, 1, stride=2)
-
-        self.net = nn.Sequential(
-            nn.Conv3d(input_channels, filters, 3, padding=1),
-            leaky_relu(),
-            nn.Conv3d(filters, filters, 3, padding=1),
-            leaky_relu(),
-        )
-
-        self.maybe_blur = Blur() if antialiased_downsample else None
-
-        self.downsample = nn.Sequential(
-            Rearrange(
-                "b c (f p1) (h p2) (w p3) -> b (c p1 p2 p3) f h w", p1=2, p2=2, p3=2
-            ),
-            nn.Conv3d(filters * 8, filters, 1),
-        )
-
-    def forward(self, x):
-        res = self.conv_res(x)
-
-        x = self.net(x)
-
-        if exists(self.downsample):
-            if exists(self.maybe_blur):
-                x = self.maybe_blur(x, space_only=True)
-
-            x = self.downsample(x)
-
-        x = (x + res) * (2**-0.5)
-        return x
-
-
-class DiscriminatorBlock3DWithfirstframe(nn.Module):
-    def __init__(
-        self,
-        input_channels,
-        filters,
-        antialiased_downsample=True,
-        pad_mode="first",
-    ):
-        super().__init__()
-        self.downsample_res = DownSample3D(
-            in_channels=input_channels,
-            out_channels=filters,
-            with_conv=True,
-            compress_time=True,
-        )
-
-        self.net = nn.Sequential(
-            CausalConv3d(input_channels, filters, kernel_size=3, pad_mode=pad_mode),
-            leaky_relu(),
-            CausalConv3d(filters, filters, kernel_size=3, pad_mode=pad_mode),
-            leaky_relu(),
-        )
-
-        self.maybe_blur = Blur() if antialiased_downsample else None
-
-        self.downsample = DownSample3D(
-            in_channels=filters,
-            out_channels=filters,
-            with_conv=True,
-            compress_time=True,
-        )
-
-    def forward(self, x):
-        res = self.downsample_res(x)
-
-        x = self.net(x)
-
-        if exists(self.downsample):
-            if exists(self.maybe_blur):
-                x = self.maybe_blur(x, space_only=True)
-
-            x = self.downsample(x)
-
-        x = (x + res) * (2**-0.5)
-        return x
-
-
-class Discriminator3D(nn.Module):
-    @beartype
-    def __init__(
-        self,
-        *,
-        dim,
-        image_size,
-        frame_num,
-        channels=3,
-        max_dim=512,
-        linear_attn_dim_head=8,
-        linear_attn_heads=16,
-        ff_mult=4,
-        antialiased_downsample=False,
-    ):
-        super().__init__()
-        image_size = pair(image_size)
-        min_image_resolution = min(image_size)
-
-        num_layers = int(log2(min_image_resolution) - 2)
-        temporal_num_layers = int(log2(frame_num))
-        self.temporal_num_layers = temporal_num_layers
-
-        layer_dims = [channels] + [(dim * 4) * (2**i) for i in range(num_layers + 1)]
-        layer_dims = [min(layer_dim, max_dim) for layer_dim in layer_dims]
-        layer_dims_in_out = tuple(zip(layer_dims[:-1], layer_dims[1:]))
-
-        blocks = []
-
-        image_resolution = min_image_resolution
-        frame_resolution = frame_num
-
-        for ind, (in_chan, out_chan) in enumerate(layer_dims_in_out):
-            num_layer = ind + 1
-            is_not_last = ind != (len(layer_dims_in_out) - 1)
-
-            if ind < temporal_num_layers:
-                block = DiscriminatorBlock3D(
-                    in_chan,
-                    out_chan,
-                    antialiased_downsample=antialiased_downsample,
-                )
-
-                blocks.append(block)
-
-                frame_resolution //= 2
-            else:
-                block = DiscriminatorBlock(
-                    in_chan,
-                    out_chan,
-                    downsample=is_not_last,
-                    antialiased_downsample=antialiased_downsample,
-                )
-                attn_block = nn.Sequential(
-                    Residual(
-                        LinearSpaceAttention(
-                            dim=out_chan,
-                            heads=linear_attn_heads,
-                            dim_head=linear_attn_dim_head,
-                        )
-                    ),
-                    Residual(FeedForward(dim=out_chan, mult=ff_mult, images=True)),
-                )
-
-                blocks.append(nn.ModuleList([block, attn_block]))
-
-            image_resolution //= 2
-
-        self.blocks = nn.ModuleList(blocks)
-
-        dim_last = layer_dims[-1]
-
-        downsample_factor = 2**num_layers
-        last_fmap_size = tuple(map(lambda n: n // downsample_factor, image_size))
-
-        latent_dim = last_fmap_size[0] * last_fmap_size[1] * dim_last
-
-        self.to_logits = nn.Sequential(
-            nn.Conv2d(dim_last, dim_last, 3, padding=1),
-            leaky_relu(),
-            Rearrange("b ... -> b (...)"),
-            nn.Linear(latent_dim, 1),
-            Rearrange("b 1 -> b"),
-        )
-
-    def forward(self, x):
-        for i, layer in enumerate(self.blocks):
-            if i < self.temporal_num_layers:
-                x = layer(x)
-                if i == self.temporal_num_layers - 1:
-                    x = rearrange(x, "b c f h w -> (b f) c h w")
-            else:
-                block, attn_block = layer
-                x = block(x)
-                x = attn_block(x)
-
-        return self.to_logits(x)
-
-
-class Discriminator3DWithfirstframe(nn.Module):
-    @beartype
-    def __init__(
-        self,
-        *,
-        dim,
-        image_size,
-        frame_num,
-        channels=3,
-        max_dim=512,
-        linear_attn_dim_head=8,
-        linear_attn_heads=16,
-        ff_mult=4,
-        antialiased_downsample=False,
-    ):
-        super().__init__()
-        image_size = pair(image_size)
-        min_image_resolution = min(image_size)
-
-        num_layers = int(log2(min_image_resolution) - 2)
-        temporal_num_layers = int(log2(frame_num))
-        self.temporal_num_layers = temporal_num_layers
-
-        layer_dims = [channels] + [(dim * 4) * (2**i) for i in range(num_layers + 1)]
-        layer_dims = [min(layer_dim, max_dim) for layer_dim in layer_dims]
-        layer_dims_in_out = tuple(zip(layer_dims[:-1], layer_dims[1:]))
-
-        blocks = []
-
-        image_resolution = min_image_resolution
-        frame_resolution = frame_num
-
-        for ind, (in_chan, out_chan) in enumerate(layer_dims_in_out):
-            num_layer = ind + 1
-            is_not_last = ind != (len(layer_dims_in_out) - 1)
-
-            if ind < temporal_num_layers:
-                block = DiscriminatorBlock3DWithfirstframe(
-                    in_chan,
-                    out_chan,
-                    antialiased_downsample=antialiased_downsample,
-                )
-
-                blocks.append(block)
-
-                frame_resolution //= 2
-            else:
-                block = DiscriminatorBlock(
-                    in_chan,
-                    out_chan,
-                    downsample=is_not_last,
-                    antialiased_downsample=antialiased_downsample,
-                )
-                attn_block = nn.Sequential(
-                    Residual(
-                        LinearSpaceAttention(
-                            dim=out_chan,
-                            heads=linear_attn_heads,
-                            dim_head=linear_attn_dim_head,
-                        )
-                    ),
-                    Residual(FeedForward(dim=out_chan, mult=ff_mult, images=True)),
-                )
-
-                blocks.append(nn.ModuleList([block, attn_block]))
-
-            image_resolution //= 2
-
-        self.blocks = nn.ModuleList(blocks)
-
-        dim_last = layer_dims[-1]
-
-        downsample_factor = 2**num_layers
-        last_fmap_size = tuple(map(lambda n: n // downsample_factor, image_size))
-
-        latent_dim = last_fmap_size[0] * last_fmap_size[1] * dim_last
-
-        self.to_logits = nn.Sequential(
-            nn.Conv2d(dim_last, dim_last, 3, padding=1),
-            leaky_relu(),
-            Rearrange("b ... -> b (...)"),
-            nn.Linear(latent_dim, 1),
-            Rearrange("b 1 -> b"),
-        )
-
-    def forward(self, x):
-        for i, layer in enumerate(self.blocks):
-            if i < self.temporal_num_layers:
-                x = layer(x)
-                if i == self.temporal_num_layers - 1:
-                    x = x.mean(dim=2)
-                    # x = rearrange(x, "b c f h w -> (b f) c h w")
-            else:
-                block, attn_block = layer
-                x = block(x)
-                x = attn_block(x)
-
-        return self.to_logits(x)
-
-
-class VideoAutoencoderLoss(nn.Module):
-    def __init__(
-        self,
-        disc_start,
-        perceptual_weight=1,
-        adversarial_loss_weight=0,
-        multiscale_adversarial_loss_weight=0,
-        grad_penalty_loss_weight=0,
-        quantizer_aux_loss_weight=0,
-        vgg_weights=VGG16_Weights.DEFAULT,
-        discr_kwargs=None,
-        discr_3d_kwargs=None,
-    ):
-        super().__init__()
-
-        self.disc_start = disc_start
-        self.perceptual_weight = perceptual_weight
-        self.adversarial_loss_weight = adversarial_loss_weight
-        self.multiscale_adversarial_loss_weight = multiscale_adversarial_loss_weight
-        self.grad_penalty_loss_weight = grad_penalty_loss_weight
-        self.quantizer_aux_loss_weight = quantizer_aux_loss_weight
-
-        if self.perceptual_weight > 0:
-            self.perceptual_model = LPIPS().eval()
-            # self.vgg = torchvision.models.vgg16(pretrained = True)
-            # self.vgg.requires_grad_(False)
-        # if self.adversarial_loss_weight > 0:
-        #     self.discr = Discriminator(**discr_kwargs)
-        # else:
-        #     self.discr = None
-        # if self.multiscale_adversarial_loss_weight > 0:
-        #     self.multiscale_discrs = nn.ModuleList([*multiscale_discrs])
-        # else:
-        #     self.multiscale_discrs = None
-        if discr_kwargs is not None:
-            self.discr = Discriminator(**discr_kwargs)
-        else:
-            self.discr = None
-        if discr_3d_kwargs is not None:
-            # self.discr_3d = Discriminator3D(**discr_3d_kwargs)
-            self.discr_3d = instantiate_from_config(discr_3d_kwargs)
-        else:
-            self.discr_3d = None
-        # self.multiscale_discrs = nn.ModuleList([*multiscale_discrs])
-
-        self.register_buffer("zero", torch.tensor(0.0), persistent=False)
-
-    def get_trainable_params(self) -> Any:
-        params = []
-        if self.discr is not None:
-            params += list(self.discr.parameters())
-        if self.discr_3d is not None:
-            params += list(self.discr_3d.parameters())
-        # if self.multiscale_discrs is not None:
-        #     for discr in self.multiscale_discrs:
-        #         params += list(discr.parameters())
-        return params
-
-    def get_trainable_parameters(self) -> Any:
-        return self.get_trainable_params()
-
-    def forward(
-        self,
-        inputs,
-        reconstructions,
-        optimizer_idx,
-        global_step,
-        aux_losses=None,
-        last_layer=None,
-        split="train",
-    ):
-        batch, channels, frames = inputs.shape[:3]
-
-        if optimizer_idx == 0:
-            recon_loss = F.mse_loss(inputs, reconstructions)
-
-            if self.perceptual_weight > 0:
-                frame_indices = torch.randn((batch, frames)).topk(1, dim=-1).indices
-
-                input_frames = pick_video_frame(inputs, frame_indices)
-                recon_frames = pick_video_frame(reconstructions, frame_indices)
-
-                perceptual_loss = self.perceptual_model(
-                    input_frames.contiguous(), recon_frames.contiguous()
-                ).mean()
-            else:
-                perceptual_loss = self.zero
-
-            if (
-                global_step >= self.disc_start
-                or not self.training
-                or self.adversarial_loss_weight == 0
-            ):
-                gen_loss = self.zero
-                adaptive_weight = 0
-            else:
-                # frame_indices = torch.randn((batch, frames)).topk(1, dim = -1).indices
-                # recon_video_frames = pick_video_frame(reconstructions, frame_indices)
-
-                # fake_logits = self.discr(recon_video_frames)
-                fake_logits = self.discr_3d(reconstructions)
-                gen_loss = hinge_gen_loss(fake_logits)
-
-                adaptive_weight = 1
-                if self.perceptual_weight > 0 and last_layer is not None:
-                    norm_grad_wrt_perceptual_loss = grad_layer_wrt_loss(
-                        perceptual_loss, last_layer
-                    ).norm(p=2)
-                    norm_grad_wrt_gen_loss = grad_layer_wrt_loss(
-                        gen_loss, last_layer
-                    ).norm(p=2)
-                    adaptive_weight = (
-                        norm_grad_wrt_perceptual_loss
-                        / norm_grad_wrt_gen_loss.clamp(min=1e-3)
-                    )
-                    adaptive_weight.clamp_(max=1e3)
-
-                    if torch.isnan(adaptive_weight).any():
-                        adaptive_weight = 1
-
-            # multiscale discriminator losses
-
-            # multiscale_gen_losses = []
-            # multiscale_gen_adaptive_weights = []
-            # if self.multiscale_adversarial_loss_weight > 0:
-            #     if not exists(recon_video_frames):
-            #         frame_indices = torch.randn((batch, frames)).topk(1, dim = -1).indices
-            #         recon_video_frames = pick_video_frame(reconstructions, frame_indices)
-            #     for discr in self.multiscale_discrs:
-            #         fake_logits = recon_video_frames
-
-            #         multiscale_gen_loss = hinge_gen_loss(fake_logits)
-            #         multiscale_gen_losses.append(multiscale_gen_loss)
-
-            #         multiscale_adaptive_weight = 1.
-
-            #         if exists(norm_grad_wrt_perceptual_loss):
-            #             norm_grad_wrt_gen_loss = grad_layer_wrt_loss(multiscale_gen_loss, last_layer).norm(p = 2)
-            #             multiscale_adaptive_weight = norm_grad_wrt_perceptual_loss / norm_grad_wrt_gen_loss.clamp(min = 1e-5)
-            #             multiscale_adaptive_weight.clamp_(max = 1e3)
-
-            #         multiscale_gen_adaptive_weights.append(multiscale_adaptive_weight)
-            #     weighted_multiscale_gen_losses = sum(loss * weight for loss, weight in zip(multiscale_gen_losses, multiscale_gen_adaptive_weights))
-            # else:
-            #     weighted_multiscale_gen_losses = self.zero
-
-            if aux_losses is None:
-                aux_losses = self.zero
-
-            total_loss = (
-                recon_loss
-                + aux_losses * self.quantizer_aux_loss_weight
-                + perceptual_loss * self.perceptual_weight
-                + gen_loss * self.adversarial_loss_weight
-            )
-            # gen_loss * adaptive_weight * self.adversarial_loss_weight + \
-            # weighted_multiscale_gen_losses * self.multiscale_adversarial_loss_weight
-
-            log = {
-                "{}/total_loss".format(split): total_loss.detach(),
-                "{}/recon_loss".format(split): recon_loss.detach(),
-                "{}/perceptual_loss".format(split): perceptual_loss.detach(),
-                "{}/gen_loss".format(split): gen_loss.detach(),
-                "{}/aux_losses".format(split): aux_losses.detach(),
-                # "{}/weighted_multiscale_gen_losses".format(split): weighted_multiscale_gen_losses.detach(),
-                "{}/adaptive_weight".format(split): adaptive_weight,
-                # "{}/multiscale_adaptive_weights".format(split): sum(multiscale_gen_adaptive_weights),
-            }
-
-            return total_loss, log
-
-        if optimizer_idx == 1:
-            # frame_indices = torch.randn((batch, frames)).topk(1, dim = -1).indices
-
-            # real = pick_video_frame(inputs, frame_indices)
-            # fake = pick_video_frame(reconstructions, frame_indices)
-
-            # apply_gradient_penalty = self.grad_penalty_loss_weight > 0
-            # if apply_gradient_penalty:
-            #     real = real.requires_grad_()
-
-            # real_logits = self.discr(real)
-            # fake_logits = self.discr(fake.detach())
-
-            apply_gradient_penalty = self.grad_penalty_loss_weight > 0
-            if apply_gradient_penalty:
-                inputs = inputs.requires_grad_()
-            real_logits = self.discr_3d(inputs)
-            fake_logits = self.discr_3d(reconstructions.detach())
-
-            discr_loss = hinge_discr_loss(fake_logits, real_logits)
-
-            # # multiscale discriminators
-            # multiscale_discr_losses = []
-            # if self.multiscale_adversarial_loss_weight > 0:
-            #     for discr in self.multiscale_discrs:
-            #         multiscale_real_logits = discr(inputs)
-            #         multiscale_fake_logits = discr(reconstructions.detach())
-
-            #         multiscale_discr_loss = hinge_discr_loss(multiscale_fake_logits, multiscale_real_logits)
-            #         multiscale_discr_losses.append(multiscale_discr_loss)
-            # else:
-            #     multiscale_discr_losses.append(self.zero)
-
-            # gradient penalty
-            if apply_gradient_penalty:
-                # gradient_penalty_loss = gradient_penalty(real, real_logits)
-                gradient_penalty_loss = gradient_penalty(inputs, real_logits)
-            else:
-                gradient_penalty_loss = self.zero
-
-            total_loss = (
-                discr_loss + self.grad_penalty_loss_weight * gradient_penalty_loss
-            )
-            # self.grad_penalty_loss_weight * gradient_penalty_loss + \
-            # sum(multiscale_discr_losses) * self.multiscale_adversarial_loss_weight
-
-            log = {
-                "{}/total_disc_loss".format(split): total_loss.detach(),
-                "{}/discr_loss".format(split): discr_loss.detach(),
-                "{}/grad_penalty_loss".format(split): gradient_penalty_loss.detach(),
-                # "{}/multiscale_discr_loss".format(split): sum(multiscale_discr_losses).detach(),
-                "{}/logits_real".format(split): real_logits.detach().mean(),
-                "{}/logits_fake".format(split): fake_logits.detach().mean(),
-            }
-            return total_loss, log
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/.gitignore b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/.gitignore
deleted file mode 100644
index 13960255..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-vgg.pth
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/LICENSE b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/LICENSE
deleted file mode 100644
index 842c363a..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/LICENSE
+++ /dev/null
@@ -1,23 +0,0 @@
-Copyright (c) 2018, Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, Oliver Wang
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-* Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/lpips.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/lpips.py
deleted file mode 100644
index 3e34f3d0..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/loss/lpips.py
+++ /dev/null
@@ -1,147 +0,0 @@
-"""Stripped version of https://github.com/richzhang/PerceptualSimilarity/tree/master/models"""
-
-from collections import namedtuple
-
-import torch
-import torch.nn as nn
-from torchvision import models
-
-from ..util import get_ckpt_path
-
-
-class LPIPS(nn.Module):
-    # Learned perceptual metric
-    def __init__(self, use_dropout=True):
-        super().__init__()
-        self.scaling_layer = ScalingLayer()
-        self.chns = [64, 128, 256, 512, 512]  # vg16 features
-        self.net = vgg16(pretrained=True, requires_grad=False)
-        self.lin0 = NetLinLayer(self.chns[0], use_dropout=use_dropout)
-        self.lin1 = NetLinLayer(self.chns[1], use_dropout=use_dropout)
-        self.lin2 = NetLinLayer(self.chns[2], use_dropout=use_dropout)
-        self.lin3 = NetLinLayer(self.chns[3], use_dropout=use_dropout)
-        self.lin4 = NetLinLayer(self.chns[4], use_dropout=use_dropout)
-        self.load_from_pretrained()
-        for param in self.parameters():
-            param.requires_grad = False
-
-    def load_from_pretrained(self, name="vgg_lpips"):
-        ckpt = get_ckpt_path(name, "sgm/modules/autoencoding/lpips/loss")
-        self.load_state_dict(
-            torch.load(ckpt, map_location=torch.device("cpu")), strict=False
-        )
-        print("loaded pretrained LPIPS loss from {}".format(ckpt))
-
-    @classmethod
-    def from_pretrained(cls, name="vgg_lpips"):
-        if name != "vgg_lpips":
-            raise NotImplementedError
-        model = cls()
-        ckpt = get_ckpt_path(name)
-        model.load_state_dict(
-            torch.load(ckpt, map_location=torch.device("cpu")), strict=False
-        )
-        return model
-
-    def forward(self, input, target):
-        in0_input, in1_input = (self.scaling_layer(input), self.scaling_layer(target))
-        outs0, outs1 = self.net(in0_input), self.net(in1_input)
-        feats0, feats1, diffs = {}, {}, {}
-        lins = [self.lin0, self.lin1, self.lin2, self.lin3, self.lin4]
-        for kk in range(len(self.chns)):
-            feats0[kk], feats1[kk] = normalize_tensor(outs0[kk]), normalize_tensor(
-                outs1[kk]
-            )
-            diffs[kk] = (feats0[kk] - feats1[kk]) ** 2
-
-        res = [
-            spatial_average(lins[kk].model(diffs[kk]), keepdim=True)
-            for kk in range(len(self.chns))
-        ]
-        val = res[0]
-        for l in range(1, len(self.chns)):
-            val += res[l]
-        return val
-
-
-class ScalingLayer(nn.Module):
-    def __init__(self):
-        super(ScalingLayer, self).__init__()
-        self.register_buffer(
-            "shift", torch.Tensor([-0.030, -0.088, -0.188])[None, :, None, None]
-        )
-        self.register_buffer(
-            "scale", torch.Tensor([0.458, 0.448, 0.450])[None, :, None, None]
-        )
-
-    def forward(self, inp):
-        return (inp - self.shift) / self.scale
-
-
-class NetLinLayer(nn.Module):
-    """A single linear layer which does a 1x1 conv"""
-
-    def __init__(self, chn_in, chn_out=1, use_dropout=False):
-        super(NetLinLayer, self).__init__()
-        layers = (
-            [
-                nn.Dropout(),
-            ]
-            if (use_dropout)
-            else []
-        )
-        layers += [
-            nn.Conv2d(chn_in, chn_out, 1, stride=1, padding=0, bias=False),
-        ]
-        self.model = nn.Sequential(*layers)
-
-
-class vgg16(torch.nn.Module):
-    def __init__(self, requires_grad=False, pretrained=True):
-        super(vgg16, self).__init__()
-        vgg_pretrained_features = models.vgg16(pretrained=pretrained).features
-        self.slice1 = torch.nn.Sequential()
-        self.slice2 = torch.nn.Sequential()
-        self.slice3 = torch.nn.Sequential()
-        self.slice4 = torch.nn.Sequential()
-        self.slice5 = torch.nn.Sequential()
-        self.N_slices = 5
-        for x in range(4):
-            self.slice1.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(4, 9):
-            self.slice2.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(9, 16):
-            self.slice3.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(16, 23):
-            self.slice4.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(23, 30):
-            self.slice5.add_module(str(x), vgg_pretrained_features[x])
-        if not requires_grad:
-            for param in self.parameters():
-                param.requires_grad = False
-
-    def forward(self, X):
-        h = self.slice1(X)
-        h_relu1_2 = h
-        h = self.slice2(h)
-        h_relu2_2 = h
-        h = self.slice3(h)
-        h_relu3_3 = h
-        h = self.slice4(h)
-        h_relu4_3 = h
-        h = self.slice5(h)
-        h_relu5_3 = h
-        vgg_outputs = namedtuple(
-            "VggOutputs", ["relu1_2", "relu2_2", "relu3_3", "relu4_3", "relu5_3"]
-        )
-        out = vgg_outputs(h_relu1_2, h_relu2_2, h_relu3_3, h_relu4_3, h_relu5_3)
-        return out
-
-
-def normalize_tensor(x, eps=1e-10):
-    norm_factor = torch.sqrt(torch.sum(x**2, dim=1, keepdim=True))
-    return x / (norm_factor + eps)
-
-
-def spatial_average(x, keepdim=True):
-    return x.mean([2, 3], keepdim=keepdim)
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/LICENSE b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/LICENSE
deleted file mode 100644
index d75f0ee8..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/LICENSE
+++ /dev/null
@@ -1,58 +0,0 @@
-Copyright (c) 2017, Jun-Yan Zhu and Taesung Park
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-* Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
-FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-
---------------------------- LICENSE FOR pix2pix --------------------------------
-BSD License
-
-For pix2pix software
-Copyright (c) 2016, Phillip Isola and Jun-Yan Zhu
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without
-modification, are permitted provided that the following conditions are met:
-
-* Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
-
-* Redistributions in binary form must reproduce the above copyright notice,
-  this list of conditions and the following disclaimer in the documentation
-  and/or other materials provided with the distribution.
-
------------------------------ LICENSE FOR DCGAN --------------------------------
-BSD License
-
-For dcgan.torch software
-
-Copyright (c) 2015, Facebook, Inc. All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
-Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
-Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
-Neither the name Facebook nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/model.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/model.py
deleted file mode 100644
index 5d767fcf..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/model/model.py
+++ /dev/null
@@ -1,91 +0,0 @@
-import functools
-
-import torch.nn as nn
-
-from ..util import ActNorm
-
-
-def weights_init(m):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        try:
-            nn.init.normal_(m.weight.data, 0.0, 0.02)
-        except:
-            nn.init.normal_(m.conv.weight.data, 0.0, 0.02)
-    elif classname.find("BatchNorm") != -1:
-        nn.init.normal_(m.weight.data, 1.0, 0.02)
-        nn.init.constant_(m.bias.data, 0)
-
-
-class NLayerDiscriminator(nn.Module):
-    """Defines a PatchGAN discriminator as in Pix2Pix
-    --> see https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/blob/master/models/networks.py
-    """
-
-    def __init__(self, input_nc=3, ndf=64, n_layers=3, use_actnorm=False):
-        """Construct a PatchGAN discriminator
-        Parameters:
-            input_nc (int)  -- the number of channels in input images
-            ndf (int)       -- the number of filters in the last conv layer
-            n_layers (int)  -- the number of conv layers in the discriminator
-            norm_layer      -- normalization layer
-        """
-        super(NLayerDiscriminator, self).__init__()
-        if not use_actnorm:
-            norm_layer = nn.BatchNorm2d
-        else:
-            norm_layer = ActNorm
-        if (
-            type(norm_layer) == functools.partial
-        ):  # no need to use bias as BatchNorm2d has affine parameters
-            use_bias = norm_layer.func != nn.BatchNorm2d
-        else:
-            use_bias = norm_layer != nn.BatchNorm2d
-
-        kw = 4
-        padw = 1
-        sequence = [
-            nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw),
-            nn.LeakyReLU(0.2, True),
-        ]
-        nf_mult = 1
-        nf_mult_prev = 1
-        for n in range(1, n_layers):  # gradually increase the number of filters
-            nf_mult_prev = nf_mult
-            nf_mult = min(2**n, 8)
-            sequence += [
-                nn.Conv2d(
-                    ndf * nf_mult_prev,
-                    ndf * nf_mult,
-                    kernel_size=kw,
-                    stride=2,
-                    padding=padw,
-                    bias=use_bias,
-                ),
-                norm_layer(ndf * nf_mult),
-                nn.LeakyReLU(0.2, True),
-            ]
-
-        nf_mult_prev = nf_mult
-        nf_mult = min(2**n_layers, 8)
-        sequence += [
-            nn.Conv2d(
-                ndf * nf_mult_prev,
-                ndf * nf_mult,
-                kernel_size=kw,
-                stride=1,
-                padding=padw,
-                bias=use_bias,
-            ),
-            norm_layer(ndf * nf_mult),
-            nn.LeakyReLU(0.2, True),
-        ]
-
-        sequence += [
-            nn.Conv2d(ndf * nf_mult, 1, kernel_size=kw, stride=1, padding=padw)
-        ]  # output 1 channel prediction map
-        self.main = nn.Sequential(*sequence)
-
-    def forward(self, input):
-        """Standard forward."""
-        return self.main(input)
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/util.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/util.py
deleted file mode 100644
index 49c76e37..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/util.py
+++ /dev/null
@@ -1,128 +0,0 @@
-import hashlib
-import os
-
-import requests
-import torch
-import torch.nn as nn
-from tqdm import tqdm
-
-URL_MAP = {"vgg_lpips": "https://heibox.uni-heidelberg.de/f/607503859c864bc1b30b/?dl=1"}
-
-CKPT_MAP = {"vgg_lpips": "vgg.pth"}
-
-MD5_MAP = {"vgg_lpips": "d507d7349b931f0638a25a48a722f98a"}
-
-
-def download(url, local_path, chunk_size=1024):
-    os.makedirs(os.path.split(local_path)[0], exist_ok=True)
-    with requests.get(url, stream=True) as r:
-        total_size = int(r.headers.get("content-length", 0))
-        with tqdm(total=total_size, unit="B", unit_scale=True) as pbar:
-            with open(local_path, "wb") as f:
-                for data in r.iter_content(chunk_size=chunk_size):
-                    if data:
-                        f.write(data)
-                        pbar.update(chunk_size)
-
-
-def md5_hash(path):
-    with open(path, "rb") as f:
-        content = f.read()
-    return hashlib.md5(content).hexdigest()
-
-
-def get_ckpt_path(name, root, check=False):
-    assert name in URL_MAP
-    path = os.path.join(root, CKPT_MAP[name])
-    if not os.path.exists(path) or (check and not md5_hash(path) == MD5_MAP[name]):
-        print("Downloading {} model from {} to {}".format(name, URL_MAP[name], path))
-        download(URL_MAP[name], path)
-        md5 = md5_hash(path)
-        assert md5 == MD5_MAP[name], md5
-    return path
-
-
-class ActNorm(nn.Module):
-    def __init__(
-        self, num_features, logdet=False, affine=True, allow_reverse_init=False
-    ):
-        assert affine
-        super().__init__()
-        self.logdet = logdet
-        self.loc = nn.Parameter(torch.zeros(1, num_features, 1, 1))
-        self.scale = nn.Parameter(torch.ones(1, num_features, 1, 1))
-        self.allow_reverse_init = allow_reverse_init
-
-        self.register_buffer("initialized", torch.tensor(0, dtype=torch.uint8))
-
-    def initialize(self, input):
-        with torch.no_grad():
-            flatten = input.permute(1, 0, 2, 3).contiguous().view(input.shape[1], -1)
-            mean = (
-                flatten.mean(1)
-                .unsqueeze(1)
-                .unsqueeze(2)
-                .unsqueeze(3)
-                .permute(1, 0, 2, 3)
-            )
-            std = (
-                flatten.std(1)
-                .unsqueeze(1)
-                .unsqueeze(2)
-                .unsqueeze(3)
-                .permute(1, 0, 2, 3)
-            )
-
-            self.loc.data.copy_(-mean)
-            self.scale.data.copy_(1 / (std + 1e-6))
-
-    def forward(self, input, reverse=False):
-        if reverse:
-            return self.reverse(input)
-        if len(input.shape) == 2:
-            input = input[:, :, None, None]
-            squeeze = True
-        else:
-            squeeze = False
-
-        _, _, height, width = input.shape
-
-        if self.training and self.initialized.item() == 0:
-            self.initialize(input)
-            self.initialized.fill_(1)
-
-        h = self.scale * (input + self.loc)
-
-        if squeeze:
-            h = h.squeeze(-1).squeeze(-1)
-
-        if self.logdet:
-            log_abs = torch.log(torch.abs(self.scale))
-            logdet = height * width * torch.sum(log_abs)
-            logdet = logdet * torch.ones(input.shape[0]).to(input)
-            return h, logdet
-
-        return h
-
-    def reverse(self, output):
-        if self.training and self.initialized.item() == 0:
-            if not self.allow_reverse_init:
-                raise RuntimeError(
-                    "Initializing ActNorm in reverse direction is "
-                    "disabled by default. Use allow_reverse_init=True to enable."
-                )
-            else:
-                self.initialize(output)
-                self.initialized.fill_(1)
-
-        if len(output.shape) == 2:
-            output = output[:, :, None, None]
-            squeeze = True
-        else:
-            squeeze = False
-
-        h = output / self.scale - self.loc
-
-        if squeeze:
-            h = h.squeeze(-1).squeeze(-1)
-        return h
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/vqperceptual.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/vqperceptual.py
deleted file mode 100644
index 6195f0a6..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/lpips/vqperceptual.py
+++ /dev/null
@@ -1,17 +0,0 @@
-import torch
-import torch.nn.functional as F
-
-
-def hinge_d_loss(logits_real, logits_fake):
-    loss_real = torch.mean(F.relu(1.0 - logits_real))
-    loss_fake = torch.mean(F.relu(1.0 + logits_fake))
-    d_loss = 0.5 * (loss_real + loss_fake)
-    return d_loss
-
-
-def vanilla_d_loss(logits_real, logits_fake):
-    d_loss = 0.5 * (
-        torch.mean(torch.nn.functional.softplus(-logits_real))
-        + torch.mean(torch.nn.functional.softplus(logits_fake))
-    )
-    return d_loss
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/magvit2_pytorch.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/magvit2_pytorch.py
deleted file mode 100644
index 16370ad9..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/magvit2_pytorch.py
+++ /dev/null
@@ -1,1968 +0,0 @@
-import copy
-import pickle
-from collections import namedtuple
-from functools import partial, wraps
-from math import ceil, log2, sqrt
-from pathlib import Path
-
-import torch
-import torch.nn.functional as F
-import torchvision
-from beartype import beartype
-from beartype.typing import List, Optional, Tuple, Union
-from einops import pack, rearrange, reduce, repeat, unpack
-from einops.layers.torch import Rearrange
-from gateloop_transformer import SimpleGateLoopLayer
-from kornia.filters import filter3d
-from magvit2_pytorch.attend import Attend
-from magvit2_pytorch.version import __version__
-from taylor_series_linear_attention import TaylorSeriesLinearAttn
-from torch import Tensor, einsum, nn
-from torch.autograd import grad as torch_grad
-from torch.cuda.amp import autocast
-from torch.nn import Module, ModuleList
-from torchvision.models import VGG16_Weights
-
-# from vector_quantize_pytorch import LFQ, FSQ
-from .regularizers.finite_scalar_quantization import FSQ
-from .regularizers.lookup_free_quantization import LFQ
-
-# helper
-
-
-def exists(v):
-    return v is not None
-
-
-def default(v, d):
-    return v if exists(v) else d
-
-
-def safe_get_index(it, ind, default=None):
-    if ind < len(it):
-        return it[ind]
-    return default
-
-
-def pair(t):
-    return t if isinstance(t, tuple) else (t, t)
-
-
-def identity(t, *args, **kwargs):
-    return t
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def pack_one(t, pattern):
-    return pack([t], pattern)
-
-
-def unpack_one(t, ps, pattern):
-    return unpack(t, ps, pattern)[0]
-
-
-def append_dims(t, ndims: int):
-    return t.reshape(*t.shape, *((1,) * ndims))
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def maybe_del_attr_(o, attr):
-    if hasattr(o, attr):
-        delattr(o, attr)
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-# tensor helpers
-
-
-def l2norm(t):
-    return F.normalize(t, dim=-1, p=2)
-
-
-def pad_at_dim(t, pad, dim=-1, value=0.0):
-    dims_from_right = (-dim - 1) if dim < 0 else (t.ndim - dim - 1)
-    zeros = (0, 0) * dims_from_right
-    return F.pad(t, (*zeros, *pad), value=value)
-
-
-def pick_video_frame(video, frame_indices):
-    batch, device = video.shape[0], video.device
-    video = rearrange(video, "b c f ... -> b f c ...")
-    batch_indices = torch.arange(batch, device=device)
-    batch_indices = rearrange(batch_indices, "b -> b 1")
-    images = video[batch_indices, frame_indices]
-    images = rearrange(images, "b 1 c ... -> b c ...")
-    return images
-
-
-# gan related
-
-
-def gradient_penalty(images, output):
-    batch_size = images.shape[0]
-
-    gradients = torch_grad(
-        outputs=output,
-        inputs=images,
-        grad_outputs=torch.ones(output.size(), device=images.device),
-        create_graph=True,
-        retain_graph=True,
-        only_inputs=True,
-    )[0]
-
-    gradients = rearrange(gradients, "b ... -> b (...)")
-    return ((gradients.norm(2, dim=1) - 1) ** 2).mean()
-
-
-def leaky_relu(p=0.1):
-    return nn.LeakyReLU(p)
-
-
-def hinge_discr_loss(fake, real):
-    return (F.relu(1 + fake) + F.relu(1 - real)).mean()
-
-
-def hinge_gen_loss(fake):
-    return -fake.mean()
-
-
-@autocast(enabled=False)
-@beartype
-def grad_layer_wrt_loss(loss: Tensor, layer: nn.Parameter):
-    return torch_grad(
-        outputs=loss,
-        inputs=layer,
-        grad_outputs=torch.ones_like(loss),
-        retain_graph=True,
-    )[0].detach()
-
-
-# helper decorators
-
-
-def remove_vgg(fn):
-    @wraps(fn)
-    def inner(self, *args, **kwargs):
-        has_vgg = hasattr(self, "vgg")
-        if has_vgg:
-            vgg = self.vgg
-            delattr(self, "vgg")
-
-        out = fn(self, *args, **kwargs)
-
-        if has_vgg:
-            self.vgg = vgg
-
-        return out
-
-    return inner
-
-
-# helper classes
-
-
-def Sequential(*modules):
-    modules = [*filter(exists, modules)]
-
-    if len(modules) == 0:
-        return nn.Identity()
-
-    return nn.Sequential(*modules)
-
-
-class Residual(Module):
-    @beartype
-    def __init__(self, fn: Module):
-        super().__init__()
-        self.fn = fn
-
-    def forward(self, x, **kwargs):
-        return self.fn(x, **kwargs) + x
-
-
-# for a bunch of tensor operations to change tensor to (batch, time, feature dimension) and back
-
-
-class ToTimeSequence(Module):
-    @beartype
-    def __init__(self, fn: Module):
-        super().__init__()
-        self.fn = fn
-
-    def forward(self, x, **kwargs):
-        x = rearrange(x, "b c f ... -> b ... f c")
-        x, ps = pack_one(x, "* n c")
-
-        o = self.fn(x, **kwargs)
-
-        o = unpack_one(o, ps, "* n c")
-        return rearrange(o, "b ... f c -> b c f ...")
-
-
-class SqueezeExcite(Module):
-    # global context network - attention-esque squeeze-excite variant (https://arxiv.org/abs/2012.13375)
-
-    def __init__(self, dim, *, dim_out=None, dim_hidden_min=16, init_bias=-10):
-        super().__init__()
-        dim_out = default(dim_out, dim)
-
-        self.to_k = nn.Conv2d(dim, 1, 1)
-        dim_hidden = max(dim_hidden_min, dim_out // 2)
-
-        self.net = nn.Sequential(
-            nn.Conv2d(dim, dim_hidden, 1),
-            nn.LeakyReLU(0.1),
-            nn.Conv2d(dim_hidden, dim_out, 1),
-            nn.Sigmoid(),
-        )
-
-        nn.init.zeros_(self.net[-2].weight)
-        nn.init.constant_(self.net[-2].bias, init_bias)
-
-    def forward(self, x):
-        orig_input, batch = x, x.shape[0]
-        is_video = x.ndim == 5
-
-        if is_video:
-            x = rearrange(x, "b c f h w -> (b f) c h w")
-
-        context = self.to_k(x)
-
-        context = rearrange(context, "b c h w -> b c (h w)").softmax(dim=-1)
-        spatial_flattened_input = rearrange(x, "b c h w -> b c (h w)")
-
-        out = einsum("b i n, b c n -> b c i", context, spatial_flattened_input)
-        out = rearrange(out, "... -> ... 1")
-        gates = self.net(out)
-
-        if is_video:
-            gates = rearrange(gates, "(b f) c h w -> b c f h w", b=batch)
-
-        return gates * orig_input
-
-
-# token shifting
-
-
-class TokenShift(Module):
-    @beartype
-    def __init__(self, fn: Module):
-        super().__init__()
-        self.fn = fn
-
-    def forward(self, x, **kwargs):
-        x, x_shift = x.chunk(2, dim=1)
-        x_shift = pad_at_dim(x_shift, (1, -1), dim=2)  # shift time dimension
-        x = torch.cat((x, x_shift), dim=1)
-        return self.fn(x, **kwargs)
-
-
-# rmsnorm
-
-
-class RMSNorm(Module):
-    def __init__(self, dim, channel_first=False, images=False, bias=False):
-        super().__init__()
-        broadcastable_dims = (1, 1, 1) if not images else (1, 1)
-        shape = (dim, *broadcastable_dims) if channel_first else (dim,)
-
-        self.channel_first = channel_first
-        self.scale = dim**0.5
-        self.gamma = nn.Parameter(torch.ones(shape))
-        self.bias = nn.Parameter(torch.zeros(shape)) if bias else 0.0
-
-    def forward(self, x):
-        return (
-            F.normalize(x, dim=(1 if self.channel_first else -1))
-            * self.scale
-            * self.gamma
-            + self.bias
-        )
-
-
-class AdaptiveRMSNorm(Module):
-    def __init__(self, dim, *, dim_cond, channel_first=False, images=False, bias=False):
-        super().__init__()
-        broadcastable_dims = (1, 1, 1) if not images else (1, 1)
-        shape = (dim, *broadcastable_dims) if channel_first else (dim,)
-
-        self.dim_cond = dim_cond
-        self.channel_first = channel_first
-        self.scale = dim**0.5
-
-        self.to_gamma = nn.Linear(dim_cond, dim)
-        self.to_bias = nn.Linear(dim_cond, dim) if bias else None
-
-        nn.init.zeros_(self.to_gamma.weight)
-        nn.init.ones_(self.to_gamma.bias)
-
-        if bias:
-            nn.init.zeros_(self.to_bias.weight)
-            nn.init.zeros_(self.to_bias.bias)
-
-    @beartype
-    def forward(self, x: Tensor, *, cond: Tensor):
-        batch = x.shape[0]
-        assert cond.shape == (batch, self.dim_cond)
-
-        gamma = self.to_gamma(cond)
-
-        bias = 0.0
-        if exists(self.to_bias):
-            bias = self.to_bias(cond)
-
-        if self.channel_first:
-            gamma = append_dims(gamma, x.ndim - 2)
-
-            if exists(self.to_bias):
-                bias = append_dims(bias, x.ndim - 2)
-
-        return (
-            F.normalize(x, dim=(1 if self.channel_first else -1)) * self.scale * gamma
-            + bias
-        )
-
-
-# attention
-
-
-class Attention(Module):
-    @beartype
-    def __init__(
-        self,
-        *,
-        dim,
-        dim_cond: Optional[int] = None,
-        causal=False,
-        dim_head=32,
-        heads=8,
-        flash=False,
-        dropout=0.0,
-        num_memory_kv=4,
-    ):
-        super().__init__()
-        dim_inner = dim_head * heads
-
-        self.need_cond = exists(dim_cond)
-
-        if self.need_cond:
-            self.norm = AdaptiveRMSNorm(dim, dim_cond=dim_cond)
-        else:
-            self.norm = RMSNorm(dim)
-
-        self.to_qkv = nn.Sequential(
-            nn.Linear(dim, dim_inner * 3, bias=False),
-            Rearrange("b n (qkv h d) -> qkv b h n d", qkv=3, h=heads),
-        )
-
-        assert num_memory_kv > 0
-        self.mem_kv = nn.Parameter(torch.randn(2, heads, num_memory_kv, dim_head))
-
-        self.attend = Attend(causal=causal, dropout=dropout, flash=flash)
-
-        self.to_out = nn.Sequential(
-            Rearrange("b h n d -> b n (h d)"), nn.Linear(dim_inner, dim, bias=False)
-        )
-
-    @beartype
-    def forward(self, x, mask: Optional[Tensor] = None, cond: Optional[Tensor] = None):
-        maybe_cond_kwargs = dict(cond=cond) if self.need_cond else dict()
-
-        x = self.norm(x, **maybe_cond_kwargs)
-
-        q, k, v = self.to_qkv(x)
-
-        mk, mv = map(lambda t: repeat(t, "h n d -> b h n d", b=q.shape[0]), self.mem_kv)
-        k = torch.cat((mk, k), dim=-2)
-        v = torch.cat((mv, v), dim=-2)
-
-        out = self.attend(q, k, v, mask=mask)
-        return self.to_out(out)
-
-
-class LinearAttention(Module):
-    """
-    using the specific linear attention proposed in https://arxiv.org/abs/2106.09681
-    """
-
-    @beartype
-    def __init__(
-        self, *, dim, dim_cond: Optional[int] = None, dim_head=8, heads=8, dropout=0.0
-    ):
-        super().__init__()
-        dim_inner = dim_head * heads
-
-        self.need_cond = exists(dim_cond)
-
-        if self.need_cond:
-            self.norm = AdaptiveRMSNorm(dim, dim_cond=dim_cond)
-        else:
-            self.norm = RMSNorm(dim)
-
-        self.attn = TaylorSeriesLinearAttn(dim=dim, dim_head=dim_head, heads=heads)
-
-    def forward(self, x, cond: Optional[Tensor] = None):
-        maybe_cond_kwargs = dict(cond=cond) if self.need_cond else dict()
-
-        x = self.norm(x, **maybe_cond_kwargs)
-
-        return self.attn(x)
-
-
-class LinearSpaceAttention(LinearAttention):
-    def forward(self, x, *args, **kwargs):
-        x = rearrange(x, "b c ... h w -> b ... h w c")
-        x, batch_ps = pack_one(x, "* h w c")
-        x, seq_ps = pack_one(x, "b * c")
-
-        x = super().forward(x, *args, **kwargs)
-
-        x = unpack_one(x, seq_ps, "b * c")
-        x = unpack_one(x, batch_ps, "* h w c")
-        return rearrange(x, "b ... h w c -> b c ... h w")
-
-
-class SpaceAttention(Attention):
-    def forward(self, x, *args, **kwargs):
-        x = rearrange(x, "b c t h w -> b t h w c")
-        x, batch_ps = pack_one(x, "* h w c")
-        x, seq_ps = pack_one(x, "b * c")
-
-        x = super().forward(x, *args, **kwargs)
-
-        x = unpack_one(x, seq_ps, "b * c")
-        x = unpack_one(x, batch_ps, "* h w c")
-        return rearrange(x, "b t h w c -> b c t h w")
-
-
-class TimeAttention(Attention):
-    def forward(self, x, *args, **kwargs):
-        x = rearrange(x, "b c t h w -> b h w t c")
-        x, batch_ps = pack_one(x, "* t c")
-
-        x = super().forward(x, *args, **kwargs)
-
-        x = unpack_one(x, batch_ps, "* t c")
-        return rearrange(x, "b h w t c -> b c t h w")
-
-
-class GEGLU(Module):
-    def forward(self, x):
-        x, gate = x.chunk(2, dim=1)
-        return F.gelu(gate) * x
-
-
-class FeedForward(Module):
-    @beartype
-    def __init__(self, dim, *, dim_cond: Optional[int] = None, mult=4, images=False):
-        super().__init__()
-        conv_klass = nn.Conv2d if images else nn.Conv3d
-
-        rmsnorm_klass = (
-            RMSNorm
-            if not exists(dim_cond)
-            else partial(AdaptiveRMSNorm, dim_cond=dim_cond)
-        )
-
-        maybe_adaptive_norm_klass = partial(
-            rmsnorm_klass, channel_first=True, images=images
-        )
-
-        dim_inner = int(dim * mult * 2 / 3)
-
-        self.norm = maybe_adaptive_norm_klass(dim)
-
-        self.net = Sequential(
-            conv_klass(dim, dim_inner * 2, 1), GEGLU(), conv_klass(dim_inner, dim, 1)
-        )
-
-    @beartype
-    def forward(self, x: Tensor, *, cond: Optional[Tensor] = None):
-        maybe_cond_kwargs = dict(cond=cond) if exists(cond) else dict()
-
-        x = self.norm(x, **maybe_cond_kwargs)
-        return self.net(x)
-
-
-# discriminator with anti-aliased downsampling (blurpool Zhang et al.)
-
-
-class Blur(Module):
-    def __init__(self):
-        super().__init__()
-        f = torch.Tensor([1, 2, 1])
-        self.register_buffer("f", f)
-
-    def forward(self, x, space_only=False, time_only=False):
-        assert not (space_only and time_only)
-
-        f = self.f
-
-        if space_only:
-            f = einsum("i, j -> i j", f, f)
-            f = rearrange(f, "... -> 1 1 ...")
-        elif time_only:
-            f = rearrange(f, "f -> 1 f 1 1")
-        else:
-            f = einsum("i, j, k -> i j k", f, f, f)
-            f = rearrange(f, "... -> 1 ...")
-
-        is_images = x.ndim == 4
-
-        if is_images:
-            x = rearrange(x, "b c h w -> b c 1 h w")
-
-        out = filter3d(x, f, normalized=True)
-
-        if is_images:
-            out = rearrange(out, "b c 1 h w -> b c h w")
-
-        return out
-
-
-class DiscriminatorBlock(Module):
-    def __init__(
-        self, input_channels, filters, downsample=True, antialiased_downsample=True
-    ):
-        super().__init__()
-        self.conv_res = nn.Conv2d(
-            input_channels, filters, 1, stride=(2 if downsample else 1)
-        )
-
-        self.net = nn.Sequential(
-            nn.Conv2d(input_channels, filters, 3, padding=1),
-            leaky_relu(),
-            nn.Conv2d(filters, filters, 3, padding=1),
-            leaky_relu(),
-        )
-
-        self.maybe_blur = Blur() if antialiased_downsample else None
-
-        self.downsample = (
-            nn.Sequential(
-                Rearrange("b c (h p1) (w p2) -> b (c p1 p2) h w", p1=2, p2=2),
-                nn.Conv2d(filters * 4, filters, 1),
-            )
-            if downsample
-            else None
-        )
-
-    def forward(self, x):
-        res = self.conv_res(x)
-
-        x = self.net(x)
-
-        if exists(self.downsample):
-            if exists(self.maybe_blur):
-                x = self.maybe_blur(x, space_only=True)
-
-            x = self.downsample(x)
-
-        x = (x + res) * (2**-0.5)
-        return x
-
-
-class Discriminator(Module):
-    @beartype
-    def __init__(
-        self,
-        *,
-        dim,
-        image_size,
-        channels=3,
-        max_dim=512,
-        attn_heads=8,
-        attn_dim_head=32,
-        linear_attn_dim_head=8,
-        linear_attn_heads=16,
-        ff_mult=4,
-        antialiased_downsample=False,
-    ):
-        super().__init__()
-        image_size = pair(image_size)
-        min_image_resolution = min(image_size)
-
-        num_layers = int(log2(min_image_resolution) - 2)
-
-        blocks = []
-
-        layer_dims = [channels] + [(dim * 4) * (2**i) for i in range(num_layers + 1)]
-        layer_dims = [min(layer_dim, max_dim) for layer_dim in layer_dims]
-        layer_dims_in_out = tuple(zip(layer_dims[:-1], layer_dims[1:]))
-
-        blocks = []
-        attn_blocks = []
-
-        image_resolution = min_image_resolution
-
-        for ind, (in_chan, out_chan) in enumerate(layer_dims_in_out):
-            num_layer = ind + 1
-            is_not_last = ind != (len(layer_dims_in_out) - 1)
-
-            block = DiscriminatorBlock(
-                in_chan,
-                out_chan,
-                downsample=is_not_last,
-                antialiased_downsample=antialiased_downsample,
-            )
-
-            attn_block = Sequential(
-                Residual(
-                    LinearSpaceAttention(
-                        dim=out_chan,
-                        heads=linear_attn_heads,
-                        dim_head=linear_attn_dim_head,
-                    )
-                ),
-                Residual(FeedForward(dim=out_chan, mult=ff_mult, images=True)),
-            )
-
-            blocks.append(ModuleList([block, attn_block]))
-
-            image_resolution //= 2
-
-        self.blocks = ModuleList(blocks)
-
-        dim_last = layer_dims[-1]
-
-        downsample_factor = 2**num_layers
-        last_fmap_size = tuple(map(lambda n: n // downsample_factor, image_size))
-
-        latent_dim = last_fmap_size[0] * last_fmap_size[1] * dim_last
-
-        self.to_logits = Sequential(
-            nn.Conv2d(dim_last, dim_last, 3, padding=1),
-            leaky_relu(),
-            Rearrange("b ... -> b (...)"),
-            nn.Linear(latent_dim, 1),
-            Rearrange("b 1 -> b"),
-        )
-
-    def forward(self, x):
-        for block, attn_block in self.blocks:
-            x = block(x)
-            x = attn_block(x)
-
-        return self.to_logits(x)
-
-
-# modulatable conv from Karras et al. Stylegan2
-# for conditioning on latents
-
-
-class Conv3DMod(Module):
-    @beartype
-    def __init__(
-        self,
-        dim,
-        *,
-        spatial_kernel,
-        time_kernel,
-        causal=True,
-        dim_out=None,
-        demod=True,
-        eps=1e-8,
-        pad_mode="zeros",
-    ):
-        super().__init__()
-        dim_out = default(dim_out, dim)
-
-        self.eps = eps
-
-        assert is_odd(spatial_kernel) and is_odd(time_kernel)
-
-        self.spatial_kernel = spatial_kernel
-        self.time_kernel = time_kernel
-
-        time_padding = (time_kernel - 1, 0) if causal else ((time_kernel // 2,) * 2)
-
-        self.pad_mode = pad_mode
-        self.padding = (*((spatial_kernel // 2,) * 4), *time_padding)
-        self.weights = nn.Parameter(
-            torch.randn((dim_out, dim, time_kernel, spatial_kernel, spatial_kernel))
-        )
-
-        self.demod = demod
-
-        nn.init.kaiming_normal_(self.weights, a=0, mode="fan_in", nonlinearity="selu")
-
-    @beartype
-    def forward(self, fmap, cond: Tensor):
-        """
-        notation
-
-        b - batch
-        n - convs
-        o - output
-        i - input
-        k - kernel
-        """
-
-        b = fmap.shape[0]
-
-        # prepare weights for modulation
-
-        weights = self.weights
-
-        # do the modulation, demodulation, as done in stylegan2
-
-        cond = rearrange(cond, "b i -> b 1 i 1 1 1")
-
-        weights = weights * (cond + 1)
-
-        if self.demod:
-            inv_norm = (
-                reduce(weights**2, "b o i k0 k1 k2 -> b o 1 1 1 1", "sum")
-                .clamp(min=self.eps)
-                .rsqrt()
-            )
-            weights = weights * inv_norm
-
-        fmap = rearrange(fmap, "b c t h w -> 1 (b c) t h w")
-
-        weights = rearrange(weights, "b o ... -> (b o) ...")
-
-        fmap = F.pad(fmap, self.padding, mode=self.pad_mode)
-        fmap = F.conv3d(fmap, weights, groups=b)
-
-        return rearrange(fmap, "1 (b o) ... -> b o ...", b=b)
-
-
-# strided conv downsamples
-
-
-class SpatialDownsample2x(Module):
-    def __init__(self, dim, dim_out=None, kernel_size=3, antialias=False):
-        super().__init__()
-        dim_out = default(dim_out, dim)
-        self.maybe_blur = Blur() if antialias else identity
-        self.conv = nn.Conv2d(
-            dim, dim_out, kernel_size, stride=2, padding=kernel_size // 2
-        )
-
-    def forward(self, x):
-        x = self.maybe_blur(x, space_only=True)
-
-        x = rearrange(x, "b c t h w -> b t c h w")
-        x, ps = pack_one(x, "* c h w")
-
-        out = self.conv(x)
-
-        out = unpack_one(out, ps, "* c h w")
-        out = rearrange(out, "b t c h w -> b c t h w")
-        return out
-
-
-class TimeDownsample2x(Module):
-    def __init__(self, dim, dim_out=None, kernel_size=3, antialias=False):
-        super().__init__()
-        dim_out = default(dim_out, dim)
-        self.maybe_blur = Blur() if antialias else identity
-        self.time_causal_padding = (kernel_size - 1, 0)
-        self.conv = nn.Conv1d(dim, dim_out, kernel_size, stride=2)
-
-    def forward(self, x):
-        x = self.maybe_blur(x, time_only=True)
-
-        x = rearrange(x, "b c t h w -> b h w c t")
-        x, ps = pack_one(x, "* c t")
-
-        x = F.pad(x, self.time_causal_padding)
-        out = self.conv(x)
-
-        out = unpack_one(out, ps, "* c t")
-        out = rearrange(out, "b h w c t -> b c t h w")
-        return out
-
-
-# depth to space upsamples
-
-
-class SpatialUpsample2x(Module):
-    def __init__(self, dim, dim_out=None):
-        super().__init__()
-        dim_out = default(dim_out, dim)
-        conv = nn.Conv2d(dim, dim_out * 4, 1)
-
-        self.net = nn.Sequential(
-            conv,
-            nn.SiLU(),
-            Rearrange("b (c p1 p2) h w -> b c (h p1) (w p2)", p1=2, p2=2),
-        )
-
-        self.init_conv_(conv)
-
-    def init_conv_(self, conv):
-        o, i, h, w = conv.weight.shape
-        conv_weight = torch.empty(o // 4, i, h, w)
-        nn.init.kaiming_uniform_(conv_weight)
-        conv_weight = repeat(conv_weight, "o ... -> (o 4) ...")
-
-        conv.weight.data.copy_(conv_weight)
-        nn.init.zeros_(conv.bias.data)
-
-    def forward(self, x):
-        x = rearrange(x, "b c t h w -> b t c h w")
-        x, ps = pack_one(x, "* c h w")
-
-        out = self.net(x)
-
-        out = unpack_one(out, ps, "* c h w")
-        out = rearrange(out, "b t c h w -> b c t h w")
-        return out
-
-
-class TimeUpsample2x(Module):
-    def __init__(self, dim, dim_out=None):
-        super().__init__()
-        dim_out = default(dim_out, dim)
-        conv = nn.Conv1d(dim, dim_out * 2, 1)
-
-        self.net = nn.Sequential(
-            conv, nn.SiLU(), Rearrange("b (c p) t -> b c (t p)", p=2)
-        )
-
-        self.init_conv_(conv)
-
-    def init_conv_(self, conv):
-        o, i, t = conv.weight.shape
-        conv_weight = torch.empty(o // 2, i, t)
-        nn.init.kaiming_uniform_(conv_weight)
-        conv_weight = repeat(conv_weight, "o ... -> (o 2) ...")
-
-        conv.weight.data.copy_(conv_weight)
-        nn.init.zeros_(conv.bias.data)
-
-    def forward(self, x):
-        x = rearrange(x, "b c t h w -> b h w c t")
-        x, ps = pack_one(x, "* c t")
-
-        out = self.net(x)
-
-        out = unpack_one(out, ps, "* c t")
-        out = rearrange(out, "b h w c t -> b c t h w")
-        return out
-
-
-# autoencoder - only best variant here offered, with causal conv 3d
-
-
-def SameConv2d(dim_in, dim_out, kernel_size):
-    kernel_size = cast_tuple(kernel_size, 2)
-    padding = [k // 2 for k in kernel_size]
-    return nn.Conv2d(dim_in, dim_out, kernel_size=kernel_size, padding=padding)
-
-
-class CausalConv3d(Module):
-    @beartype
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        pad_mode="constant",
-        **kwargs,
-    ):
-        super().__init__()
-        kernel_size = cast_tuple(kernel_size, 3)
-
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-
-        assert is_odd(height_kernel_size) and is_odd(width_kernel_size)
-
-        dilation = kwargs.pop("dilation", 1)
-        stride = kwargs.pop("stride", 1)
-
-        self.pad_mode = pad_mode
-        time_pad = dilation * (time_kernel_size - 1) + (1 - stride)
-        height_pad = height_kernel_size // 2
-        width_pad = width_kernel_size // 2
-
-        self.time_pad = time_pad
-        self.time_causal_padding = (
-            width_pad,
-            width_pad,
-            height_pad,
-            height_pad,
-            time_pad,
-            0,
-        )
-
-        stride = (stride, 1, 1)
-        dilation = (dilation, 1, 1)
-        self.conv = nn.Conv3d(
-            chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs
-        )
-
-    def forward(self, x):
-        pad_mode = self.pad_mode if self.time_pad < x.shape[2] else "constant"
-
-        x = F.pad(x, self.time_causal_padding, mode=pad_mode)
-        return self.conv(x)
-
-
-@beartype
-def ResidualUnit(
-    dim, kernel_size: Union[int, Tuple[int, int, int]], pad_mode: str = "constant"
-):
-    net = Sequential(
-        CausalConv3d(dim, dim, kernel_size, pad_mode=pad_mode),
-        nn.ELU(),
-        nn.Conv3d(dim, dim, 1),
-        nn.ELU(),
-        SqueezeExcite(dim),
-    )
-
-    return Residual(net)
-
-
-@beartype
-class ResidualUnitMod(Module):
-    def __init__(
-        self,
-        dim,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        *,
-        dim_cond,
-        pad_mode: str = "constant",
-        demod=True,
-    ):
-        super().__init__()
-        kernel_size = cast_tuple(kernel_size, 3)
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-        assert height_kernel_size == width_kernel_size
-
-        self.to_cond = nn.Linear(dim_cond, dim)
-
-        self.conv = Conv3DMod(
-            dim=dim,
-            spatial_kernel=height_kernel_size,
-            time_kernel=time_kernel_size,
-            causal=True,
-            demod=demod,
-            pad_mode=pad_mode,
-        )
-
-        self.conv_out = nn.Conv3d(dim, dim, 1)
-
-    @beartype
-    def forward(
-        self,
-        x,
-        cond: Tensor,
-    ):
-        res = x
-        cond = self.to_cond(cond)
-
-        x = self.conv(x, cond=cond)
-        x = F.elu(x)
-        x = self.conv_out(x)
-        x = F.elu(x)
-        return x + res
-
-
-class CausalConvTranspose3d(Module):
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        *,
-        time_stride,
-        **kwargs,
-    ):
-        super().__init__()
-        kernel_size = cast_tuple(kernel_size, 3)
-
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-
-        assert is_odd(height_kernel_size) and is_odd(width_kernel_size)
-
-        self.upsample_factor = time_stride
-
-        height_pad = height_kernel_size // 2
-        width_pad = width_kernel_size // 2
-
-        stride = (time_stride, 1, 1)
-        padding = (0, height_pad, width_pad)
-
-        self.conv = nn.ConvTranspose3d(
-            chan_in, chan_out, kernel_size, stride, padding=padding, **kwargs
-        )
-
-    def forward(self, x):
-        assert x.ndim == 5
-        t = x.shape[2]
-
-        out = self.conv(x)
-
-        out = out[..., : (t * self.upsample_factor), :, :]
-        return out
-
-
-# video tokenizer class
-
-LossBreakdown = namedtuple(
-    "LossBreakdown",
-    [
-        "recon_loss",
-        "lfq_aux_loss",
-        "quantizer_loss_breakdown",
-        "perceptual_loss",
-        "adversarial_gen_loss",
-        "adaptive_adversarial_weight",
-        "multiscale_gen_losses",
-        "multiscale_gen_adaptive_weights",
-    ],
-)
-
-DiscrLossBreakdown = namedtuple(
-    "DiscrLossBreakdown", ["discr_loss", "multiscale_discr_losses", "gradient_penalty"]
-)
-
-
-class VideoTokenizer(Module):
-    @beartype
-    def __init__(
-        self,
-        *,
-        image_size,
-        layers: Tuple[Union[str, Tuple[str, int]], ...] = (
-            "residual",
-            "residual",
-            "residual",
-        ),
-        residual_conv_kernel_size=3,
-        num_codebooks=1,
-        codebook_size: Optional[int] = None,
-        channels=3,
-        init_dim=64,
-        max_dim=float("inf"),
-        dim_cond=None,
-        dim_cond_expansion_factor=4.0,
-        input_conv_kernel_size: Tuple[int, int, int] = (7, 7, 7),
-        output_conv_kernel_size: Tuple[int, int, int] = (3, 3, 3),
-        pad_mode: str = "constant",
-        lfq_entropy_loss_weight=0.1,
-        lfq_commitment_loss_weight=1.0,
-        lfq_diversity_gamma=2.5,
-        quantizer_aux_loss_weight=1.0,
-        lfq_activation=nn.Identity(),
-        use_fsq=False,
-        fsq_levels: Optional[List[int]] = None,
-        attn_dim_head=32,
-        attn_heads=8,
-        attn_dropout=0.0,
-        linear_attn_dim_head=8,
-        linear_attn_heads=16,
-        vgg: Optional[Module] = None,
-        vgg_weights: VGG16_Weights = VGG16_Weights.DEFAULT,
-        perceptual_loss_weight=1e-1,
-        discr_kwargs: Optional[dict] = None,
-        multiscale_discrs: Tuple[Module, ...] = tuple(),
-        use_gan=True,
-        adversarial_loss_weight=1.0,
-        grad_penalty_loss_weight=10.0,
-        multiscale_adversarial_loss_weight=1.0,
-        flash_attn=True,
-        separate_first_frame_encoding=False,
-    ):
-        super().__init__()
-
-        # for autosaving the config
-
-        _locals = locals()
-        _locals.pop("self", None)
-        _locals.pop("__class__", None)
-        self._configs = pickle.dumps(_locals)
-
-        # image size
-
-        self.channels = channels
-        self.image_size = image_size
-
-        # initial encoder
-
-        self.conv_in = CausalConv3d(
-            channels, init_dim, input_conv_kernel_size, pad_mode=pad_mode
-        )
-
-        # whether to encode the first frame separately or not
-
-        self.conv_in_first_frame = nn.Identity()
-        self.conv_out_first_frame = nn.Identity()
-
-        if separate_first_frame_encoding:
-            self.conv_in_first_frame = SameConv2d(
-                channels, init_dim, input_conv_kernel_size[-2:]
-            )
-            self.conv_out_first_frame = SameConv2d(
-                init_dim, channels, output_conv_kernel_size[-2:]
-            )
-
-        self.separate_first_frame_encoding = separate_first_frame_encoding
-
-        # encoder and decoder layers
-
-        self.encoder_layers = ModuleList([])
-        self.decoder_layers = ModuleList([])
-
-        self.conv_out = CausalConv3d(
-            init_dim, channels, output_conv_kernel_size, pad_mode=pad_mode
-        )
-
-        dim = init_dim
-        dim_out = dim
-
-        layer_fmap_size = image_size
-        time_downsample_factor = 1
-        has_cond_across_layers = []
-
-        for layer_def in layers:
-            layer_type, *layer_params = cast_tuple(layer_def)
-
-            has_cond = False
-
-            if layer_type == "residual":
-                encoder_layer = ResidualUnit(dim, residual_conv_kernel_size)
-                decoder_layer = ResidualUnit(dim, residual_conv_kernel_size)
-
-            elif layer_type == "consecutive_residual":
-                (num_consecutive,) = layer_params
-                encoder_layer = Sequential(
-                    *[
-                        ResidualUnit(dim, residual_conv_kernel_size)
-                        for _ in range(num_consecutive)
-                    ]
-                )
-                decoder_layer = Sequential(
-                    *[
-                        ResidualUnit(dim, residual_conv_kernel_size)
-                        for _ in range(num_consecutive)
-                    ]
-                )
-
-            elif layer_type == "cond_residual":
-                assert exists(
-                    dim_cond
-                ), "dim_cond must be passed into VideoTokenizer, if tokenizer is to be conditioned"
-
-                has_cond = True
-
-                encoder_layer = ResidualUnitMod(
-                    dim,
-                    residual_conv_kernel_size,
-                    dim_cond=int(dim_cond * dim_cond_expansion_factor),
-                )
-                decoder_layer = ResidualUnitMod(
-                    dim,
-                    residual_conv_kernel_size,
-                    dim_cond=int(dim_cond * dim_cond_expansion_factor),
-                )
-                dim_out = dim
-
-            elif layer_type == "compress_space":
-                dim_out = safe_get_index(layer_params, 0)
-                dim_out = default(dim_out, dim * 2)
-                dim_out = min(dim_out, max_dim)
-
-                encoder_layer = SpatialDownsample2x(dim, dim_out)
-                decoder_layer = SpatialUpsample2x(dim_out, dim)
-
-                assert layer_fmap_size > 1
-                layer_fmap_size //= 2
-
-            elif layer_type == "compress_time":
-                dim_out = safe_get_index(layer_params, 0)
-                dim_out = default(dim_out, dim * 2)
-                dim_out = min(dim_out, max_dim)
-
-                encoder_layer = TimeDownsample2x(dim, dim_out)
-                decoder_layer = TimeUpsample2x(dim_out, dim)
-
-                time_downsample_factor *= 2
-
-            elif layer_type == "attend_space":
-                attn_kwargs = dict(
-                    dim=dim,
-                    dim_head=attn_dim_head,
-                    heads=attn_heads,
-                    dropout=attn_dropout,
-                    flash=flash_attn,
-                )
-
-                encoder_layer = Sequential(
-                    Residual(SpaceAttention(**attn_kwargs)), Residual(FeedForward(dim))
-                )
-
-                decoder_layer = Sequential(
-                    Residual(SpaceAttention(**attn_kwargs)), Residual(FeedForward(dim))
-                )
-
-            elif layer_type == "linear_attend_space":
-                linear_attn_kwargs = dict(
-                    dim=dim, dim_head=linear_attn_dim_head, heads=linear_attn_heads
-                )
-
-                encoder_layer = Sequential(
-                    Residual(LinearSpaceAttention(**linear_attn_kwargs)),
-                    Residual(FeedForward(dim)),
-                )
-
-                decoder_layer = Sequential(
-                    Residual(LinearSpaceAttention(**linear_attn_kwargs)),
-                    Residual(FeedForward(dim)),
-                )
-
-            elif layer_type == "gateloop_time":
-                gateloop_kwargs = dict(use_heinsen=False)
-
-                encoder_layer = ToTimeSequence(Residual(SimpleGateLoopLayer(dim=dim)))
-                decoder_layer = ToTimeSequence(Residual(SimpleGateLoopLayer(dim=dim)))
-
-            elif layer_type == "attend_time":
-                attn_kwargs = dict(
-                    dim=dim,
-                    dim_head=attn_dim_head,
-                    heads=attn_heads,
-                    dropout=attn_dropout,
-                    causal=True,
-                    flash=flash_attn,
-                )
-
-                encoder_layer = Sequential(
-                    Residual(TokenShift(TimeAttention(**attn_kwargs))),
-                    Residual(TokenShift(FeedForward(dim, dim_cond=dim_cond))),
-                )
-
-                decoder_layer = Sequential(
-                    Residual(TokenShift(TimeAttention(**attn_kwargs))),
-                    Residual(TokenShift(FeedForward(dim, dim_cond=dim_cond))),
-                )
-
-            elif layer_type == "cond_attend_space":
-                has_cond = True
-
-                attn_kwargs = dict(
-                    dim=dim,
-                    dim_cond=dim_cond,
-                    dim_head=attn_dim_head,
-                    heads=attn_heads,
-                    dropout=attn_dropout,
-                    flash=flash_attn,
-                )
-
-                encoder_layer = Sequential(
-                    Residual(SpaceAttention(**attn_kwargs)), Residual(FeedForward(dim))
-                )
-
-                decoder_layer = Sequential(
-                    Residual(SpaceAttention(**attn_kwargs)), Residual(FeedForward(dim))
-                )
-
-            elif layer_type == "cond_linear_attend_space":
-                has_cond = True
-
-                attn_kwargs = dict(
-                    dim=dim,
-                    dim_cond=dim_cond,
-                    dim_head=attn_dim_head,
-                    heads=attn_heads,
-                    dropout=attn_dropout,
-                    flash=flash_attn,
-                )
-
-                encoder_layer = Sequential(
-                    Residual(LinearSpaceAttention(**attn_kwargs)),
-                    Residual(FeedForward(dim, dim_cond=dim_cond)),
-                )
-
-                decoder_layer = Sequential(
-                    Residual(LinearSpaceAttention(**attn_kwargs)),
-                    Residual(FeedForward(dim, dim_cond=dim_cond)),
-                )
-
-            elif layer_type == "cond_attend_time":
-                has_cond = True
-
-                attn_kwargs = dict(
-                    dim=dim,
-                    dim_cond=dim_cond,
-                    dim_head=attn_dim_head,
-                    heads=attn_heads,
-                    dropout=attn_dropout,
-                    causal=True,
-                    flash=flash_attn,
-                )
-
-                encoder_layer = Sequential(
-                    Residual(TokenShift(TimeAttention(**attn_kwargs))),
-                    Residual(TokenShift(FeedForward(dim, dim_cond=dim_cond))),
-                )
-
-                decoder_layer = Sequential(
-                    Residual(TokenShift(TimeAttention(**attn_kwargs))),
-                    Residual(TokenShift(FeedForward(dim, dim_cond=dim_cond))),
-                )
-
-            else:
-                raise ValueError(f"unknown layer type {layer_type}")
-
-            self.encoder_layers.append(encoder_layer)
-            self.decoder_layers.insert(0, decoder_layer)
-
-            dim = dim_out
-            has_cond_across_layers.append(has_cond)
-
-        # add a final norm just before quantization layer
-
-        self.encoder_layers.append(
-            Sequential(
-                Rearrange("b c ... -> b ... c"),
-                nn.LayerNorm(dim),
-                Rearrange("b ... c -> b c ..."),
-            )
-        )
-
-        self.time_downsample_factor = time_downsample_factor
-        self.time_padding = time_downsample_factor - 1
-
-        self.fmap_size = layer_fmap_size
-
-        # use a MLP stem for conditioning, if needed
-
-        self.has_cond_across_layers = has_cond_across_layers
-        self.has_cond = any(has_cond_across_layers)
-
-        self.encoder_cond_in = nn.Identity()
-        self.decoder_cond_in = nn.Identity()
-
-        if has_cond:
-            self.dim_cond = dim_cond
-
-            self.encoder_cond_in = Sequential(
-                nn.Linear(dim_cond, int(dim_cond * dim_cond_expansion_factor)),
-                nn.SiLU(),
-            )
-
-            self.decoder_cond_in = Sequential(
-                nn.Linear(dim_cond, int(dim_cond * dim_cond_expansion_factor)),
-                nn.SiLU(),
-            )
-
-        # quantizer related
-
-        self.use_fsq = use_fsq
-
-        if not use_fsq:
-            assert exists(codebook_size) and not exists(
-                fsq_levels
-            ), "if use_fsq is set to False, `codebook_size` must be set (and not `fsq_levels`)"
-
-            # lookup free quantizer(s) - multiple codebooks is possible
-            # each codebook will get its own entropy regularization
-
-            self.quantizers = LFQ(
-                dim=dim,
-                codebook_size=codebook_size,
-                num_codebooks=num_codebooks,
-                entropy_loss_weight=lfq_entropy_loss_weight,
-                commitment_loss_weight=lfq_commitment_loss_weight,
-                diversity_gamma=lfq_diversity_gamma,
-            )
-
-        else:
-            assert not exists(codebook_size) and exists(
-                fsq_levels
-            ), "if use_fsq is set to True, `fsq_levels` must be set (and not `codebook_size`). the effective codebook size is the cumulative product of all the FSQ levels"
-
-            self.quantizers = FSQ(fsq_levels, dim=dim, num_codebooks=num_codebooks)
-
-        self.quantizer_aux_loss_weight = quantizer_aux_loss_weight
-
-        # dummy loss
-
-        self.register_buffer("zero", torch.tensor(0.0), persistent=False)
-
-        # perceptual loss related
-
-        use_vgg = channels in {1, 3, 4} and perceptual_loss_weight > 0.0
-
-        self.vgg = None
-        self.perceptual_loss_weight = perceptual_loss_weight
-
-        if use_vgg:
-            if not exists(vgg):
-                vgg = torchvision.models.vgg16(weights=vgg_weights)
-
-                vgg.classifier = Sequential(*vgg.classifier[:-2])
-
-            self.vgg = vgg
-
-        self.use_vgg = use_vgg
-
-        # main flag for whether to use GAN at all
-
-        self.use_gan = use_gan
-
-        # discriminator
-
-        discr_kwargs = default(
-            discr_kwargs,
-            dict(dim=dim, image_size=image_size, channels=channels, max_dim=512),
-        )
-
-        self.discr = Discriminator(**discr_kwargs)
-
-        self.adversarial_loss_weight = adversarial_loss_weight
-        self.grad_penalty_loss_weight = grad_penalty_loss_weight
-
-        self.has_gan = use_gan and adversarial_loss_weight > 0.0
-
-        # multi-scale discriminators
-
-        self.has_multiscale_gan = use_gan and multiscale_adversarial_loss_weight > 0.0
-
-        self.multiscale_discrs = ModuleList([*multiscale_discrs])
-
-        self.multiscale_adversarial_loss_weight = multiscale_adversarial_loss_weight
-
-        self.has_multiscale_discrs = (
-            use_gan
-            and multiscale_adversarial_loss_weight > 0.0
-            and len(multiscale_discrs) > 0
-        )
-
-    @property
-    def device(self):
-        return self.zero.device
-
-    @classmethod
-    def init_and_load_from(cls, path, strict=True):
-        path = Path(path)
-        assert path.exists()
-        pkg = torch.load(str(path), map_location="cpu")
-
-        assert "config" in pkg, "model configs were not found in this saved checkpoint"
-
-        config = pickle.loads(pkg["config"])
-        tokenizer = cls(**config)
-        tokenizer.load(path, strict=strict)
-        return tokenizer
-
-    def parameters(self):
-        return [
-            *self.conv_in.parameters(),
-            *self.conv_in_first_frame.parameters(),
-            *self.conv_out_first_frame.parameters(),
-            *self.conv_out.parameters(),
-            *self.encoder_layers.parameters(),
-            *self.decoder_layers.parameters(),
-            *self.encoder_cond_in.parameters(),
-            *self.decoder_cond_in.parameters(),
-            *self.quantizers.parameters(),
-        ]
-
-    def discr_parameters(self):
-        return self.discr.parameters()
-
-    def copy_for_eval(self):
-        device = self.device
-        vae_copy = copy.deepcopy(self.cpu())
-
-        maybe_del_attr_(vae_copy, "discr")
-        maybe_del_attr_(vae_copy, "vgg")
-        maybe_del_attr_(vae_copy, "multiscale_discrs")
-
-        vae_copy.eval()
-        return vae_copy.to(device)
-
-    @remove_vgg
-    def state_dict(self, *args, **kwargs):
-        return super().state_dict(*args, **kwargs)
-
-    @remove_vgg
-    def load_state_dict(self, *args, **kwargs):
-        return super().load_state_dict(*args, **kwargs)
-
-    def save(self, path, overwrite=True):
-        path = Path(path)
-        assert overwrite or not path.exists(), f"{str(path)} already exists"
-
-        pkg = dict(
-            model_state_dict=self.state_dict(),
-            version=__version__,
-            config=self._configs,
-        )
-
-        torch.save(pkg, str(path))
-
-    def load(self, path, strict=True):
-        path = Path(path)
-        assert path.exists()
-
-        pkg = torch.load(str(path))
-        state_dict = pkg.get("model_state_dict")
-        version = pkg.get("version")
-
-        assert exists(state_dict)
-
-        if exists(version):
-            print(f"loading checkpointed tokenizer from version {version}")
-
-        self.load_state_dict(state_dict, strict=strict)
-
-    @beartype
-    def encode(
-        self,
-        video: Tensor,
-        quantize=False,
-        cond: Optional[Tensor] = None,
-        video_contains_first_frame=True,
-    ):
-        encode_first_frame_separately = (
-            self.separate_first_frame_encoding and video_contains_first_frame
-        )
-
-        # whether to pad video or not
-
-        if video_contains_first_frame:
-            video_len = video.shape[2]
-
-            video = pad_at_dim(video, (self.time_padding, 0), value=0.0, dim=2)
-            video_packed_shape = [
-                torch.Size([self.time_padding]),
-                torch.Size([]),
-                torch.Size([video_len - 1]),
-            ]
-
-        # conditioning, if needed
-
-        assert (not self.has_cond) or exists(
-            cond
-        ), "`cond` must be passed into tokenizer forward method since conditionable layers were specified"
-
-        if exists(cond):
-            assert cond.shape == (video.shape[0], self.dim_cond)
-
-            cond = self.encoder_cond_in(cond)
-            cond_kwargs = dict(cond=cond)
-
-        # initial conv
-        # taking into account whether to encode first frame separately
-
-        if encode_first_frame_separately:
-            pad, first_frame, video = unpack(video, video_packed_shape, "b c * h w")
-            first_frame = self.conv_in_first_frame(first_frame)
-
-        video = self.conv_in(video)
-
-        if encode_first_frame_separately:
-            video, _ = pack([first_frame, video], "b c * h w")
-            video = pad_at_dim(video, (self.time_padding, 0), dim=2)
-
-        # encoder layers
-
-        for fn, has_cond in zip(self.encoder_layers, self.has_cond_across_layers):
-            layer_kwargs = dict()
-
-            if has_cond:
-                layer_kwargs = cond_kwargs
-
-            video = fn(video, **layer_kwargs)
-
-        maybe_quantize = identity if not quantize else self.quantizers
-
-        return maybe_quantize(video)
-
-    @beartype
-    def decode_from_code_indices(
-        self,
-        codes: Tensor,
-        cond: Optional[Tensor] = None,
-        video_contains_first_frame=True,
-    ):
-        assert codes.dtype in (torch.long, torch.int32)
-
-        if codes.ndim == 2:
-            video_code_len = codes.shape[-1]
-            assert divisible_by(
-                video_code_len, self.fmap_size**2
-            ), f"flattened video ids must have a length ({video_code_len}) that is divisible by the fmap size ({self.fmap_size}) squared ({self.fmap_size ** 2})"
-
-            codes = rearrange(
-                codes, "b (f h w) -> b f h w", h=self.fmap_size, w=self.fmap_size
-            )
-
-        quantized = self.quantizers.indices_to_codes(codes)
-
-        return self.decode(
-            quantized, cond=cond, video_contains_first_frame=video_contains_first_frame
-        )
-
-    @beartype
-    def decode(
-        self,
-        quantized: Tensor,
-        cond: Optional[Tensor] = None,
-        video_contains_first_frame=True,
-    ):
-        decode_first_frame_separately = (
-            self.separate_first_frame_encoding and video_contains_first_frame
-        )
-
-        batch = quantized.shape[0]
-
-        # conditioning, if needed
-
-        assert (not self.has_cond) or exists(
-            cond
-        ), "`cond` must be passed into tokenizer forward method since conditionable layers were specified"
-
-        if exists(cond):
-            assert cond.shape == (batch, self.dim_cond)
-
-            cond = self.decoder_cond_in(cond)
-            cond_kwargs = dict(cond=cond)
-
-        # decoder layers
-
-        x = quantized
-
-        for fn, has_cond in zip(
-            self.decoder_layers, reversed(self.has_cond_across_layers)
-        ):
-            layer_kwargs = dict()
-
-            if has_cond:
-                layer_kwargs = cond_kwargs
-
-            x = fn(x, **layer_kwargs)
-
-        # to pixels
-
-        if decode_first_frame_separately:
-            left_pad, xff, x = (
-                x[:, :, : self.time_padding],
-                x[:, :, self.time_padding],
-                x[:, :, (self.time_padding + 1) :],
-            )
-
-            out = self.conv_out(x)
-            outff = self.conv_out_first_frame(xff)
-
-            video, _ = pack([outff, out], "b c * h w")
-
-        else:
-            video = self.conv_out(x)
-
-            # if video were padded, remove padding
-
-            if video_contains_first_frame:
-                video = video[:, :, self.time_padding :]
-
-        return video
-
-    @torch.no_grad()
-    def tokenize(self, video):
-        self.eval()
-        return self.forward(video, return_codes=True)
-
-    @beartype
-    def forward(
-        self,
-        video_or_images: Tensor,
-        cond: Optional[Tensor] = None,
-        return_loss=False,
-        return_codes=False,
-        return_recon=False,
-        return_discr_loss=False,
-        return_recon_loss_only=False,
-        apply_gradient_penalty=True,
-        video_contains_first_frame=True,
-        adversarial_loss_weight=None,
-        multiscale_adversarial_loss_weight=None,
-    ):
-        adversarial_loss_weight = default(
-            adversarial_loss_weight, self.adversarial_loss_weight
-        )
-        multiscale_adversarial_loss_weight = default(
-            multiscale_adversarial_loss_weight, self.multiscale_adversarial_loss_weight
-        )
-
-        assert (return_loss + return_codes + return_discr_loss) <= 1
-        assert video_or_images.ndim in {4, 5}
-
-        assert video_or_images.shape[-2:] == (self.image_size, self.image_size)
-
-        # accept images for image pretraining (curriculum learning from images to video)
-
-        is_image = video_or_images.ndim == 4
-
-        if is_image:
-            video = rearrange(video_or_images, "b c ... -> b c 1 ...")
-            video_contains_first_frame = True
-        else:
-            video = video_or_images
-
-        batch, channels, frames = video.shape[:3]
-
-        assert divisible_by(
-            frames - int(video_contains_first_frame), self.time_downsample_factor
-        ), f"number of frames {frames} minus the first frame ({frames - int(video_contains_first_frame)}) must be divisible by the total downsample factor across time {self.time_downsample_factor}"
-
-        # encoder
-
-        x = self.encode(
-            video, cond=cond, video_contains_first_frame=video_contains_first_frame
-        )
-
-        # lookup free quantization
-
-        if self.use_fsq:
-            quantized, codes = self.quantizers(x)
-
-            aux_losses = self.zero
-            quantizer_loss_breakdown = None
-        else:
-            (quantized, codes, aux_losses), quantizer_loss_breakdown = self.quantizers(
-                x, return_loss_breakdown=True
-            )
-
-        if return_codes and not return_recon:
-            return codes
-
-        # decoder
-
-        recon_video = self.decode(
-            quantized, cond=cond, video_contains_first_frame=video_contains_first_frame
-        )
-
-        if return_codes:
-            return codes, recon_video
-
-        # reconstruction loss
-
-        if not (return_loss or return_discr_loss or return_recon_loss_only):
-            return recon_video
-
-        recon_loss = F.mse_loss(video, recon_video)
-
-        # for validation, only return recon loss
-
-        if return_recon_loss_only:
-            return recon_loss, recon_video
-
-        # gan discriminator loss
-
-        if return_discr_loss:
-            assert self.has_gan
-            assert exists(self.discr)
-
-            # pick a random frame for image discriminator
-
-            frame_indices = torch.randn((batch, frames)).topk(1, dim=-1).indices
-
-            real = pick_video_frame(video, frame_indices)
-
-            if apply_gradient_penalty:
-                real = real.requires_grad_()
-
-            fake = pick_video_frame(recon_video, frame_indices)
-
-            real_logits = self.discr(real)
-            fake_logits = self.discr(fake.detach())
-
-            discr_loss = hinge_discr_loss(fake_logits, real_logits)
-
-            # multiscale discriminators
-
-            multiscale_discr_losses = []
-
-            if self.has_multiscale_discrs:
-                for discr in self.multiscale_discrs:
-                    multiscale_real_logits = discr(video)
-                    multiscale_fake_logits = discr(recon_video.detach())
-
-                    multiscale_discr_loss = hinge_discr_loss(
-                        multiscale_fake_logits, multiscale_real_logits
-                    )
-
-                    multiscale_discr_losses.append(multiscale_discr_loss)
-            else:
-                multiscale_discr_losses.append(self.zero)
-
-            # gradient penalty
-
-            if apply_gradient_penalty:
-                gradient_penalty_loss = gradient_penalty(real, real_logits)
-            else:
-                gradient_penalty_loss = self.zero
-
-            # total loss
-
-            total_loss = (
-                discr_loss
-                + gradient_penalty_loss * self.grad_penalty_loss_weight
-                + sum(multiscale_discr_losses) * self.multiscale_adversarial_loss_weight
-            )
-
-            discr_loss_breakdown = DiscrLossBreakdown(
-                discr_loss, multiscale_discr_losses, gradient_penalty_loss
-            )
-
-            return total_loss, discr_loss_breakdown
-
-        # perceptual loss
-
-        if self.use_vgg:
-            frame_indices = torch.randn((batch, frames)).topk(1, dim=-1).indices
-
-            input_vgg_input = pick_video_frame(video, frame_indices)
-            recon_vgg_input = pick_video_frame(recon_video, frame_indices)
-
-            if channels == 1:
-                input_vgg_input = repeat(input_vgg_input, "b 1 h w -> b c h w", c=3)
-                recon_vgg_input = repeat(recon_vgg_input, "b 1 h w -> b c h w", c=3)
-
-            elif channels == 4:
-                input_vgg_input = input_vgg_input[:, :3]
-                recon_vgg_input = recon_vgg_input[:, :3]
-
-            input_vgg_feats = self.vgg(input_vgg_input)
-            recon_vgg_feats = self.vgg(recon_vgg_input)
-
-            perceptual_loss = F.mse_loss(input_vgg_feats, recon_vgg_feats)
-        else:
-            perceptual_loss = self.zero
-
-        # get gradient with respect to perceptual loss for last decoder layer
-        # needed for adaptive weighting
-
-        last_dec_layer = self.conv_out.conv.weight
-
-        norm_grad_wrt_perceptual_loss = None
-
-        if (
-            self.training
-            and self.use_vgg
-            and (self.has_gan or self.has_multiscale_discrs)
-        ):
-            norm_grad_wrt_perceptual_loss = grad_layer_wrt_loss(
-                perceptual_loss, last_dec_layer
-            ).norm(p=2)
-
-        # per-frame image discriminator
-
-        recon_video_frames = None
-
-        if self.has_gan:
-            frame_indices = torch.randn((batch, frames)).topk(1, dim=-1).indices
-            recon_video_frames = pick_video_frame(recon_video, frame_indices)
-
-            fake_logits = self.discr(recon_video_frames)
-            gen_loss = hinge_gen_loss(fake_logits)
-
-            adaptive_weight = 1.0
-
-            if exists(norm_grad_wrt_perceptual_loss):
-                norm_grad_wrt_gen_loss = grad_layer_wrt_loss(
-                    gen_loss, last_dec_layer
-                ).norm(p=2)
-                adaptive_weight = (
-                    norm_grad_wrt_perceptual_loss
-                    / norm_grad_wrt_gen_loss.clamp(min=1e-3)
-                )
-                adaptive_weight.clamp_(max=1e3)
-
-                if torch.isnan(adaptive_weight).any():
-                    adaptive_weight = 1.0
-        else:
-            gen_loss = self.zero
-            adaptive_weight = 0.0
-
-        # multiscale discriminator losses
-
-        multiscale_gen_losses = []
-        multiscale_gen_adaptive_weights = []
-
-        if self.has_multiscale_gan and self.has_multiscale_discrs:
-            if not exists(recon_video_frames):
-                recon_video_frames = pick_video_frame(recon_video, frame_indices)
-
-            for discr in self.multiscale_discrs:
-                fake_logits = recon_video_frames
-                multiscale_gen_loss = hinge_gen_loss(fake_logits)
-
-                multiscale_gen_losses.append(multiscale_gen_loss)
-
-                multiscale_adaptive_weight = 1.0
-
-                if exists(norm_grad_wrt_perceptual_loss):
-                    norm_grad_wrt_gen_loss = grad_layer_wrt_loss(
-                        multiscale_gen_loss, last_dec_layer
-                    ).norm(p=2)
-                    multiscale_adaptive_weight = (
-                        norm_grad_wrt_perceptual_loss
-                        / norm_grad_wrt_gen_loss.clamp(min=1e-5)
-                    )
-                    multiscale_adaptive_weight.clamp_(max=1e3)
-
-                multiscale_gen_adaptive_weights.append(multiscale_adaptive_weight)
-
-        # calculate total loss
-
-        total_loss = (
-            recon_loss
-            + aux_losses * self.quantizer_aux_loss_weight
-            + perceptual_loss * self.perceptual_loss_weight
-            + gen_loss * adaptive_weight * adversarial_loss_weight
-        )
-
-        if self.has_multiscale_discrs:
-            weighted_multiscale_gen_losses = sum(
-                loss * weight
-                for loss, weight in zip(
-                    multiscale_gen_losses, multiscale_gen_adaptive_weights
-                )
-            )
-
-            total_loss = (
-                total_loss
-                + weighted_multiscale_gen_losses * multiscale_adversarial_loss_weight
-            )
-
-        # loss breakdown
-
-        loss_breakdown = LossBreakdown(
-            recon_loss,
-            aux_losses,
-            quantizer_loss_breakdown,
-            perceptual_loss,
-            gen_loss,
-            adaptive_weight,
-            multiscale_gen_losses,
-            multiscale_gen_adaptive_weights,
-        )
-
-        return total_loss, loss_breakdown
-
-
-# main class
-
-
-class MagViT2(Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, x):
-        return x
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/__init__.py
deleted file mode 100644
index 6065fb20..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/__init__.py
+++ /dev/null
@@ -1,30 +0,0 @@
-from abc import abstractmethod
-from typing import Any, Tuple
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from ....modules.distributions.distributions import DiagonalGaussianDistribution
-from .base import AbstractRegularizer
-
-
-class DiagonalGaussianRegularizer(AbstractRegularizer):
-    def __init__(self, sample: bool = True):
-        super().__init__()
-        self.sample = sample
-
-    def get_trainable_parameters(self) -> Any:
-        yield from ()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        log = dict()
-        posterior = DiagonalGaussianDistribution(z)
-        if self.sample:
-            z = posterior.sample()
-        else:
-            z = posterior.mode()
-        kl_loss = posterior.kl()
-        kl_loss = torch.sum(kl_loss) / kl_loss.shape[0]
-        log["kl_loss"] = kl_loss
-        return z, log
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/base.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/base.py
deleted file mode 100644
index 7f405a10..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/base.py
+++ /dev/null
@@ -1,40 +0,0 @@
-from abc import abstractmethod
-from typing import Any, Tuple
-
-import torch
-import torch.nn.functional as F
-from torch import nn
-
-
-class AbstractRegularizer(nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        raise NotImplementedError()
-
-    @abstractmethod
-    def get_trainable_parameters(self) -> Any:
-        raise NotImplementedError()
-
-
-class IdentityRegularizer(AbstractRegularizer):
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        return z, dict()
-
-    def get_trainable_parameters(self) -> Any:
-        yield from ()
-
-
-def measure_perplexity(
-    predicted_indices: torch.Tensor, num_centroids: int
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    # videotuna: https://github.com/karpathy/deep-vector-quantization/blob/main/model.py
-    # eval cluster perplexity. when perplexity == num_embeddings then all clusters are used exactly equally
-    encodings = (
-        F.one_hot(predicted_indices, num_centroids).float().reshape(-1, num_centroids)
-    )
-    avg_probs = encodings.mean(0)
-    perplexity = (-(avg_probs * torch.log(avg_probs + 1e-10)).sum()).exp()
-    cluster_use = torch.sum(avg_probs > 0)
-    return perplexity, cluster_use
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/finite_scalar_quantization.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/finite_scalar_quantization.py
deleted file mode 100644
index ca2e11b8..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/finite_scalar_quantization.py
+++ /dev/null
@@ -1,191 +0,0 @@
-"""
-Finite Scalar Quantization: VQ-VAE Made Simple - https://arxiv.org/abs/2309.15505
-Code adapted from Jax version in Appendix A.1
-"""
-
-from typing import List, Optional
-
-import torch
-import torch.nn as nn
-from einops import pack, rearrange, unpack
-from torch import Tensor, int32
-from torch.cuda.amp import autocast
-from torch.nn import Module
-
-# helper functions
-
-
-def exists(v):
-    return v is not None
-
-
-def default(*args):
-    for arg in args:
-        if exists(arg):
-            return arg
-    return None
-
-
-def pack_one(t, pattern):
-    return pack([t], pattern)
-
-
-def unpack_one(t, ps, pattern):
-    return unpack(t, ps, pattern)[0]
-
-
-# tensor helpers
-
-
-def round_ste(z: Tensor) -> Tensor:
-    """Round with straight through gradients."""
-    zhat = z.round()
-    return z + (zhat - z).detach()
-
-
-# main class
-
-
-class FSQ(Module):
-    def __init__(
-        self,
-        levels: List[int],
-        dim: Optional[int] = None,
-        num_codebooks=1,
-        keep_num_codebooks_dim: Optional[bool] = None,
-        scale: Optional[float] = None,
-    ):
-        super().__init__()
-        _levels = torch.tensor(levels, dtype=int32)
-        self.register_buffer("_levels", _levels, persistent=False)
-
-        _basis = torch.cumprod(torch.tensor([1] + levels[:-1]), dim=0, dtype=int32)
-        self.register_buffer("_basis", _basis, persistent=False)
-
-        self.scale = scale
-
-        codebook_dim = len(levels)
-        self.codebook_dim = codebook_dim
-
-        effective_codebook_dim = codebook_dim * num_codebooks
-        self.num_codebooks = num_codebooks
-        self.effective_codebook_dim = effective_codebook_dim
-
-        keep_num_codebooks_dim = default(keep_num_codebooks_dim, num_codebooks > 1)
-        assert not (num_codebooks > 1 and not keep_num_codebooks_dim)
-        self.keep_num_codebooks_dim = keep_num_codebooks_dim
-
-        self.dim = default(dim, len(_levels) * num_codebooks)
-
-        has_projections = self.dim != effective_codebook_dim
-        self.project_in = (
-            nn.Linear(self.dim, effective_codebook_dim)
-            if has_projections
-            else nn.Identity()
-        )
-        self.project_out = (
-            nn.Linear(effective_codebook_dim, self.dim)
-            if has_projections
-            else nn.Identity()
-        )
-        self.has_projections = has_projections
-
-        self.codebook_size = self._levels.prod().item()
-
-        implicit_codebook = self.indices_to_codes(
-            torch.arange(self.codebook_size), project_out=False
-        )
-        self.register_buffer("implicit_codebook", implicit_codebook, persistent=False)
-
-    def bound(self, z: Tensor, eps: float = 1e-3) -> Tensor:
-        """Bound `z`, an array of shape (..., d)."""
-        half_l = (self._levels - 1) * (1 + eps) / 2
-        offset = torch.where(self._levels % 2 == 0, 0.5, 0.0)
-        shift = (offset / half_l).atanh()
-        return (z + shift).tanh() * half_l - offset
-
-    def quantize(self, z: Tensor) -> Tensor:
-        """Quantizes z, returns quantized zhat, same shape as z."""
-        quantized = round_ste(self.bound(z))
-        half_width = self._levels // 2  # Renormalize to [-1, 1].
-        return quantized / half_width
-
-    def _scale_and_shift(self, zhat_normalized: Tensor) -> Tensor:
-        half_width = self._levels // 2
-        return (zhat_normalized * half_width) + half_width
-
-    def _scale_and_shift_inverse(self, zhat: Tensor) -> Tensor:
-        half_width = self._levels // 2
-        return (zhat - half_width) / half_width
-
-    def codes_to_indices(self, zhat: Tensor) -> Tensor:
-        """Converts a `code` to an index in the codebook."""
-        assert zhat.shape[-1] == self.codebook_dim
-        zhat = self._scale_and_shift(zhat)
-        return (zhat * self._basis).sum(dim=-1).to(int32)
-
-    def indices_to_codes(self, indices: Tensor, project_out=True) -> Tensor:
-        """Inverse of `codes_to_indices`."""
-
-        is_img_or_video = indices.ndim >= (3 + int(self.keep_num_codebooks_dim))
-
-        indices = rearrange(indices, "... -> ... 1")
-        codes_non_centered = (indices // self._basis) % self._levels
-        codes = self._scale_and_shift_inverse(codes_non_centered)
-
-        if self.keep_num_codebooks_dim:
-            codes = rearrange(codes, "... c d -> ... (c d)")
-
-        if project_out:
-            codes = self.project_out(codes)
-
-        if is_img_or_video:
-            codes = rearrange(codes, "b ... d -> b d ...")
-
-        return codes
-
-    @autocast(enabled=False)
-    def forward(self, z: Tensor) -> Tensor:
-        """
-        einstein notation
-        b - batch
-        n - sequence (or flattened spatial dimensions)
-        d - feature dimension
-        c - number of codebook dim
-        """
-
-        is_img_or_video = z.ndim >= 4
-
-        # standardize image or video into (batch, seq, dimension)
-
-        if is_img_or_video:
-            z = rearrange(z, "b d ... -> b ... d")
-            z, ps = pack_one(z, "b * d")
-
-        assert (
-            z.shape[-1] == self.dim
-        ), f"expected dimension of {self.dim} but found dimension of {z.shape[-1]}"
-
-        z = self.project_in(z)
-
-        z = rearrange(z, "b n (c d) -> b n c d", c=self.num_codebooks)
-
-        codes = self.quantize(z)
-        indices = self.codes_to_indices(codes)
-
-        codes = rearrange(codes, "b n c d -> b n (c d)")
-
-        out = self.project_out(codes)
-
-        # reconstitute image or video dimensions
-
-        if is_img_or_video:
-            out = unpack_one(out, ps, "b * d")
-            out = rearrange(out, "b ... d -> b d ...")
-
-            indices = unpack_one(indices, ps, "b * c")
-
-        if not self.keep_num_codebooks_dim:
-            indices = rearrange(indices, "... 1 -> ...")
-
-        return out, indices
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/lookup_free_quantization.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/lookup_free_quantization.py
deleted file mode 100644
index 026c04d9..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/lookup_free_quantization.py
+++ /dev/null
@@ -1,327 +0,0 @@
-"""
-Lookup Free Quantization
-Proposed in https://arxiv.org/abs/2310.05737
-
-In the simplest setup, each dimension is quantized into {-1, 1}.
-An entropy penalty is used to encourage utilization.
-"""
-
-from collections import namedtuple
-from math import ceil, log2
-
-import torch
-import torch.nn.functional as F
-from einops import pack, rearrange, reduce, unpack
-from torch import einsum, nn
-from torch.cuda.amp import autocast
-from torch.nn import Module
-
-# constants
-
-Return = namedtuple("Return", ["quantized", "indices", "entropy_aux_loss"])
-
-LossBreakdown = namedtuple(
-    "LossBreakdown", ["per_sample_entropy", "batch_entropy", "commitment"]
-)
-
-# helper functions
-
-
-def exists(v):
-    return v is not None
-
-
-def default(*args):
-    for arg in args:
-        if exists(arg):
-            return arg() if callable(arg) else arg
-    return None
-
-
-def pack_one(t, pattern):
-    return pack([t], pattern)
-
-
-def unpack_one(t, ps, pattern):
-    return unpack(t, ps, pattern)[0]
-
-
-# entropy
-
-
-def log(t, eps=1e-5):
-    return t.clamp(min=eps).log()
-
-
-def entropy(prob):
-    return (-prob * log(prob)).sum(dim=-1)
-
-
-# class
-
-
-class LFQ(Module):
-    def __init__(
-        self,
-        *,
-        dim=None,
-        codebook_size=None,
-        entropy_loss_weight=0.1,
-        commitment_loss_weight=0.25,
-        diversity_gamma=1.0,
-        straight_through_activation=nn.Identity(),
-        num_codebooks=1,
-        keep_num_codebooks_dim=None,
-        codebook_scale=1.0,  # for residual LFQ, codebook scaled down by 2x at each layer
-        frac_per_sample_entropy=1.0,  # make less than 1. to only use a random fraction of the probs for per sample entropy
-    ):
-        super().__init__()
-
-        # some assert validations
-
-        assert exists(dim) or exists(
-            codebook_size
-        ), "either dim or codebook_size must be specified for LFQ"
-        assert (
-            not exists(codebook_size) or log2(codebook_size).is_integer()
-        ), f"your codebook size must be a power of 2 for lookup free quantization (suggested {2 ** ceil(log2(codebook_size))})"
-
-        codebook_size = default(codebook_size, lambda: 2**dim)
-        codebook_dim = int(log2(codebook_size))
-
-        codebook_dims = codebook_dim * num_codebooks
-        dim = default(dim, codebook_dims)
-
-        has_projections = dim != codebook_dims
-        self.project_in = (
-            nn.Linear(dim, codebook_dims) if has_projections else nn.Identity()
-        )
-        self.project_out = (
-            nn.Linear(codebook_dims, dim) if has_projections else nn.Identity()
-        )
-        self.has_projections = has_projections
-
-        self.dim = dim
-        self.codebook_dim = codebook_dim
-        self.num_codebooks = num_codebooks
-
-        keep_num_codebooks_dim = default(keep_num_codebooks_dim, num_codebooks > 1)
-        assert not (num_codebooks > 1 and not keep_num_codebooks_dim)
-        self.keep_num_codebooks_dim = keep_num_codebooks_dim
-
-        # straight through activation
-
-        self.activation = straight_through_activation
-
-        # entropy aux loss related weights
-
-        assert 0 < frac_per_sample_entropy <= 1.0
-        self.frac_per_sample_entropy = frac_per_sample_entropy
-
-        self.diversity_gamma = diversity_gamma
-        self.entropy_loss_weight = entropy_loss_weight
-
-        # codebook scale
-
-        self.codebook_scale = codebook_scale
-
-        # commitment loss
-
-        self.commitment_loss_weight = commitment_loss_weight
-
-        # for no auxiliary loss, during inference
-
-        self.register_buffer("mask", 2 ** torch.arange(codebook_dim - 1, -1, -1))
-        self.register_buffer("zero", torch.tensor(0.0), persistent=False)
-
-        # codes
-
-        all_codes = torch.arange(codebook_size)
-        bits = ((all_codes[..., None].int() & self.mask) != 0).float()
-        codebook = self.bits_to_codes(bits)
-
-        self.register_buffer("codebook", codebook, persistent=False)
-
-    def bits_to_codes(self, bits):
-        return bits * self.codebook_scale * 2 - self.codebook_scale
-
-    @property
-    def dtype(self):
-        return self.codebook.dtype
-
-    def indices_to_codes(self, indices, project_out=True):
-        is_img_or_video = indices.ndim >= (3 + int(self.keep_num_codebooks_dim))
-
-        if not self.keep_num_codebooks_dim:
-            indices = rearrange(indices, "... -> ... 1")
-
-        # indices to codes, which are bits of either -1 or 1
-
-        bits = ((indices[..., None].int() & self.mask) != 0).to(self.dtype)
-
-        codes = self.bits_to_codes(bits)
-
-        codes = rearrange(codes, "... c d -> ... (c d)")
-
-        # whether to project codes out to original dimensions
-        # if the input feature dimensions were not log2(codebook size)
-
-        if project_out:
-            codes = self.project_out(codes)
-
-        # rearrange codes back to original shape
-
-        if is_img_or_video:
-            codes = rearrange(codes, "b ... d -> b d ...")
-
-        return codes
-
-    @autocast(enabled=False)
-    def forward(
-        self,
-        x,
-        inv_temperature=100.0,
-        return_loss_breakdown=False,
-        mask=None,
-    ):
-        """
-        einstein notation
-        b - batch
-        n - sequence (or flattened spatial dimensions)
-        d - feature dimension, which is also log2(codebook size)
-        c - number of codebook dim
-        """
-
-        x = x.float()
-
-        is_img_or_video = x.ndim >= 4
-
-        # standardize image or video into (batch, seq, dimension)
-
-        if is_img_or_video:
-            x = rearrange(x, "b d ... -> b ... d")
-            x, ps = pack_one(x, "b * d")
-
-        assert (
-            x.shape[-1] == self.dim
-        ), f"expected dimension of {self.dim} but received {x.shape[-1]}"
-
-        x = self.project_in(x)
-
-        # split out number of codebooks
-
-        x = rearrange(x, "b n (c d) -> b n c d", c=self.num_codebooks)
-
-        # quantize by eq 3.
-
-        original_input = x
-
-        codebook_value = torch.ones_like(x) * self.codebook_scale
-        quantized = torch.where(x > 0, codebook_value, -codebook_value)
-
-        # use straight-through gradients (optionally with custom activation fn) if training
-
-        if self.training:
-            x = self.activation(x)
-            x = x + (quantized - x).detach()
-        else:
-            x = quantized
-
-        # calculate indices
-
-        indices = reduce((x > 0).int() * self.mask.int(), "b n c d -> b n c", "sum")
-
-        # entropy aux loss
-
-        if self.training:
-            # the same as euclidean distance up to a constant
-            distance = -2 * einsum(
-                "... i d, j d -> ... i j", original_input, self.codebook
-            )
-
-            prob = (-distance * inv_temperature).softmax(dim=-1)
-
-            # account for mask
-
-            if exists(mask):
-                prob = prob[mask]
-            else:
-                prob = rearrange(prob, "b n ... -> (b n) ...")
-
-            # whether to only use a fraction of probs, for reducing memory
-
-            if self.frac_per_sample_entropy < 1.0:
-                num_tokens = prob.shape[0]
-                num_sampled_tokens = int(num_tokens * self.frac_per_sample_entropy)
-                rand_mask = torch.randn(num_tokens).argsort(dim=-1) < num_sampled_tokens
-                per_sample_probs = prob[rand_mask]
-            else:
-                per_sample_probs = prob
-
-            # calculate per sample entropy
-
-            per_sample_entropy = entropy(per_sample_probs).mean()
-
-            # distribution over all available tokens in the batch
-
-            avg_prob = reduce(per_sample_probs, "... c d -> c d", "mean")
-            codebook_entropy = entropy(avg_prob).mean()
-
-            # 1. entropy will be nudged to be low for each code, to encourage the network to output confident predictions
-            # 2. codebook entropy will be nudged to be high, to encourage all codes to be uniformly used within the batch
-
-            entropy_aux_loss = (
-                per_sample_entropy - self.diversity_gamma * codebook_entropy
-            )
-        else:
-            # if not training, just return dummy 0
-            entropy_aux_loss = per_sample_entropy = codebook_entropy = self.zero
-
-        # commit loss
-
-        if self.training:
-            commit_loss = F.mse_loss(
-                original_input, quantized.detach(), reduction="none"
-            )
-
-            if exists(mask):
-                commit_loss = commit_loss[mask]
-
-            commit_loss = commit_loss.mean()
-        else:
-            commit_loss = self.zero
-
-        # merge back codebook dim
-
-        x = rearrange(x, "b n c d -> b n (c d)")
-
-        # project out to feature dimension if needed
-
-        x = self.project_out(x)
-
-        # reconstitute image or video dimensions
-
-        if is_img_or_video:
-            x = unpack_one(x, ps, "b * d")
-            x = rearrange(x, "b ... d -> b d ...")
-
-            indices = unpack_one(indices, ps, "b * c")
-
-        # whether to remove single codebook dim
-
-        if not self.keep_num_codebooks_dim:
-            indices = rearrange(indices, "... 1 -> ...")
-
-        # complete aux loss
-
-        aux_loss = (
-            entropy_aux_loss * self.entropy_loss_weight
-            + commit_loss * self.commitment_loss_weight
-        )
-
-        ret = Return(x, indices, aux_loss)
-
-        if not return_loss_breakdown:
-            return ret
-
-        return ret, LossBreakdown(per_sample_entropy, codebook_entropy, commit_loss)
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/quantize.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/quantize.py
deleted file mode 100644
index 86a4dbdd..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/regularizers/quantize.py
+++ /dev/null
@@ -1,487 +0,0 @@
-import logging
-from abc import abstractmethod
-from typing import Dict, Iterator, Literal, Optional, Tuple, Union
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-from torch import einsum
-
-from .base import AbstractRegularizer, measure_perplexity
-
-logpy = logging.getLogger(__name__)
-
-
-class AbstractQuantizer(AbstractRegularizer):
-    def __init__(self):
-        super().__init__()
-        # Define these in your init
-        # shape (N,)
-        self.used: Optional[torch.Tensor]
-        self.re_embed: int
-        self.unknown_index: Union[Literal["random"], int]
-
-    def remap_to_used(self, inds: torch.Tensor) -> torch.Tensor:
-        assert self.used is not None, "You need to define used indices for remap"
-        ishape = inds.shape
-        assert len(ishape) > 1
-        inds = inds.reshape(ishape[0], -1)
-        used = self.used.to(inds)
-        match = (inds[:, :, None] == used[None, None, ...]).long()
-        new = match.argmax(-1)
-        unknown = match.sum(2) < 1
-        if self.unknown_index == "random":
-            new[unknown] = torch.randint(0, self.re_embed, size=new[unknown].shape).to(
-                device=new.device
-            )
-        else:
-            new[unknown] = self.unknown_index
-        return new.reshape(ishape)
-
-    def unmap_to_all(self, inds: torch.Tensor) -> torch.Tensor:
-        assert self.used is not None, "You need to define used indices for remap"
-        ishape = inds.shape
-        assert len(ishape) > 1
-        inds = inds.reshape(ishape[0], -1)
-        used = self.used.to(inds)
-        if self.re_embed > self.used.shape[0]:  # extra token
-            inds[inds >= self.used.shape[0]] = 0  # simply set to zero
-        back = torch.gather(used[None, :][inds.shape[0] * [0], :], 1, inds)
-        return back.reshape(ishape)
-
-    @abstractmethod
-    def get_codebook_entry(
-        self, indices: torch.Tensor, shape: Optional[Tuple[int, ...]] = None
-    ) -> torch.Tensor:
-        raise NotImplementedError()
-
-    def get_trainable_parameters(self) -> Iterator[torch.nn.Parameter]:
-        yield from self.parameters()
-
-
-class GumbelQuantizer(AbstractQuantizer):
-    """
-    credit to @karpathy:
-    https://github.com/karpathy/deep-vector-quantization/blob/main/model.py (thanks!)
-    Gumbel Softmax trick quantizer
-    Categorical Reparameterization with Gumbel-Softmax, Jang et al. 2016
-    https://arxiv.org/abs/1611.01144
-    """
-
-    def __init__(
-        self,
-        num_hiddens: int,
-        embedding_dim: int,
-        n_embed: int,
-        straight_through: bool = True,
-        kl_weight: float = 5e-4,
-        temp_init: float = 1.0,
-        remap: Optional[str] = None,
-        unknown_index: str = "random",
-        loss_key: str = "loss/vq",
-    ) -> None:
-        super().__init__()
-
-        self.loss_key = loss_key
-        self.embedding_dim = embedding_dim
-        self.n_embed = n_embed
-
-        self.straight_through = straight_through
-        self.temperature = temp_init
-        self.kl_weight = kl_weight
-
-        self.proj = nn.Conv2d(num_hiddens, n_embed, 1)
-        self.embed = nn.Embedding(n_embed, embedding_dim)
-
-        self.remap = remap
-        if self.remap is not None:
-            self.register_buffer("used", torch.tensor(np.load(self.remap)))
-            self.re_embed = self.used.shape[0]
-        else:
-            self.used = None
-            self.re_embed = n_embed
-        if unknown_index == "extra":
-            self.unknown_index = self.re_embed
-            self.re_embed = self.re_embed + 1
-        else:
-            assert unknown_index == "random" or isinstance(
-                unknown_index, int
-            ), "unknown index needs to be 'random', 'extra' or any integer"
-            self.unknown_index = unknown_index  # "random" or "extra" or integer
-        if self.remap is not None:
-            logpy.info(
-                f"Remapping {self.n_embed} indices to {self.re_embed} indices. "
-                f"Using {self.unknown_index} for unknown indices."
-            )
-
-    def forward(
-        self, z: torch.Tensor, temp: Optional[float] = None, return_logits: bool = False
-    ) -> Tuple[torch.Tensor, Dict]:
-        # force hard = True when we are in eval mode, as we must quantize.
-        # actually, always true seems to work
-        hard = self.straight_through if self.training else True
-        temp = self.temperature if temp is None else temp
-        out_dict = {}
-        logits = self.proj(z)
-        if self.remap is not None:
-            # continue only with used logits
-            full_zeros = torch.zeros_like(logits)
-            logits = logits[:, self.used, ...]
-
-        soft_one_hot = F.gumbel_softmax(logits, tau=temp, dim=1, hard=hard)
-        if self.remap is not None:
-            # go back to all entries but unused set to zero
-            full_zeros[:, self.used, ...] = soft_one_hot
-            soft_one_hot = full_zeros
-        z_q = einsum("b n h w, n d -> b d h w", soft_one_hot, self.embed.weight)
-
-        # + kl divergence to the prior loss
-        qy = F.softmax(logits, dim=1)
-        diff = (
-            self.kl_weight
-            * torch.sum(qy * torch.log(qy * self.n_embed + 1e-10), dim=1).mean()
-        )
-        out_dict[self.loss_key] = diff
-
-        ind = soft_one_hot.argmax(dim=1)
-        out_dict["indices"] = ind
-        if self.remap is not None:
-            ind = self.remap_to_used(ind)
-
-        if return_logits:
-            out_dict["logits"] = logits
-
-        return z_q, out_dict
-
-    def get_codebook_entry(self, indices, shape):
-        # TODO: shape not yet optional
-        b, h, w, c = shape
-        assert b * h * w == indices.shape[0]
-        indices = rearrange(indices, "(b h w) -> b h w", b=b, h=h, w=w)
-        if self.remap is not None:
-            indices = self.unmap_to_all(indices)
-        one_hot = (
-            F.one_hot(indices, num_classes=self.n_embed).permute(0, 3, 1, 2).float()
-        )
-        z_q = einsum("b n h w, n d -> b d h w", one_hot, self.embed.weight)
-        return z_q
-
-
-class VectorQuantizer(AbstractQuantizer):
-    """
-    ____________________________________________
-    Discretization bottleneck part of the VQ-VAE.
-    Inputs:
-    - n_e : number of embeddings
-    - e_dim : dimension of embedding
-    - beta : commitment cost used in loss term,
-        beta * ||z_e(x)-sg[e]||^2
-    _____________________________________________
-    """
-
-    def __init__(
-        self,
-        n_e: int,
-        e_dim: int,
-        beta: float = 0.25,
-        remap: Optional[str] = None,
-        unknown_index: str = "random",
-        sane_index_shape: bool = False,
-        log_perplexity: bool = False,
-        embedding_weight_norm: bool = False,
-        loss_key: str = "loss/vq",
-    ):
-        super().__init__()
-        self.n_e = n_e
-        self.e_dim = e_dim
-        self.beta = beta
-        self.loss_key = loss_key
-
-        if not embedding_weight_norm:
-            self.embedding = nn.Embedding(self.n_e, self.e_dim)
-            self.embedding.weight.data.uniform_(-1.0 / self.n_e, 1.0 / self.n_e)
-        else:
-            self.embedding = torch.nn.utils.weight_norm(
-                nn.Embedding(self.n_e, self.e_dim), dim=1
-            )
-
-        self.remap = remap
-        if self.remap is not None:
-            self.register_buffer("used", torch.tensor(np.load(self.remap)))
-            self.re_embed = self.used.shape[0]
-        else:
-            self.used = None
-            self.re_embed = n_e
-        if unknown_index == "extra":
-            self.unknown_index = self.re_embed
-            self.re_embed = self.re_embed + 1
-        else:
-            assert unknown_index == "random" or isinstance(
-                unknown_index, int
-            ), "unknown index needs to be 'random', 'extra' or any integer"
-            self.unknown_index = unknown_index  # "random" or "extra" or integer
-        if self.remap is not None:
-            logpy.info(
-                f"Remapping {self.n_e} indices to {self.re_embed} indices. "
-                f"Using {self.unknown_index} for unknown indices."
-            )
-
-        self.sane_index_shape = sane_index_shape
-        self.log_perplexity = log_perplexity
-
-    def forward(
-        self,
-        z: torch.Tensor,
-    ) -> Tuple[torch.Tensor, Dict]:
-        do_reshape = z.ndim == 4
-        if do_reshape:
-            #     # reshape z -> (batch, height, width, channel) and flatten
-            z = rearrange(z, "b c h w -> b h w c").contiguous()
-
-        else:
-            assert z.ndim < 4, "No reshaping strategy for inputs > 4 dimensions defined"
-            z = z.contiguous()
-
-        z_flattened = z.view(-1, self.e_dim)
-        # distances from z to embeddings e_j (z - e)^2 = z^2 + e^2 - 2 e * z
-
-        d = (
-            torch.sum(z_flattened**2, dim=1, keepdim=True)
-            + torch.sum(self.embedding.weight**2, dim=1)
-            - 2
-            * torch.einsum(
-                "bd,dn->bn", z_flattened, rearrange(self.embedding.weight, "n d -> d n")
-            )
-        )
-
-        min_encoding_indices = torch.argmin(d, dim=1)
-        z_q = self.embedding(min_encoding_indices).view(z.shape)
-        loss_dict = {}
-        if self.log_perplexity:
-            perplexity, cluster_usage = measure_perplexity(
-                min_encoding_indices.detach(), self.n_e
-            )
-            loss_dict.update({"perplexity": perplexity, "cluster_usage": cluster_usage})
-
-        # compute loss for embedding
-        loss = self.beta * torch.mean((z_q.detach() - z) ** 2) + torch.mean(
-            (z_q - z.detach()) ** 2
-        )
-        loss_dict[self.loss_key] = loss
-
-        # preserve gradients
-        z_q = z + (z_q - z).detach()
-
-        # reshape back to match original input shape
-        if do_reshape:
-            z_q = rearrange(z_q, "b h w c -> b c h w").contiguous()
-
-        if self.remap is not None:
-            min_encoding_indices = min_encoding_indices.reshape(
-                z.shape[0], -1
-            )  # add batch axis
-            min_encoding_indices = self.remap_to_used(min_encoding_indices)
-            min_encoding_indices = min_encoding_indices.reshape(-1, 1)  # flatten
-
-        if self.sane_index_shape:
-            if do_reshape:
-                min_encoding_indices = min_encoding_indices.reshape(
-                    z_q.shape[0], z_q.shape[2], z_q.shape[3]
-                )
-            else:
-                min_encoding_indices = rearrange(
-                    min_encoding_indices, "(b s) 1 -> b s", b=z_q.shape[0]
-                )
-
-        loss_dict["min_encoding_indices"] = min_encoding_indices
-
-        return z_q, loss_dict
-
-    def get_codebook_entry(
-        self, indices: torch.Tensor, shape: Optional[Tuple[int, ...]] = None
-    ) -> torch.Tensor:
-        # shape specifying (batch, height, width, channel)
-        if self.remap is not None:
-            assert shape is not None, "Need to give shape for remap"
-            indices = indices.reshape(shape[0], -1)  # add batch axis
-            indices = self.unmap_to_all(indices)
-            indices = indices.reshape(-1)  # flatten again
-
-        # get quantized latent vectors
-        z_q = self.embedding(indices)
-
-        if shape is not None:
-            z_q = z_q.view(shape)
-            # reshape back to match original input shape
-            z_q = z_q.permute(0, 3, 1, 2).contiguous()
-
-        return z_q
-
-
-class EmbeddingEMA(nn.Module):
-    def __init__(self, num_tokens, codebook_dim, decay=0.99, eps=1e-5):
-        super().__init__()
-        self.decay = decay
-        self.eps = eps
-        weight = torch.randn(num_tokens, codebook_dim)
-        self.weight = nn.Parameter(weight, requires_grad=False)
-        self.cluster_size = nn.Parameter(torch.zeros(num_tokens), requires_grad=False)
-        self.embed_avg = nn.Parameter(weight.clone(), requires_grad=False)
-        self.update = True
-
-    def forward(self, embed_id):
-        return F.embedding(embed_id, self.weight)
-
-    def cluster_size_ema_update(self, new_cluster_size):
-        self.cluster_size.data.mul_(self.decay).add_(
-            new_cluster_size, alpha=1 - self.decay
-        )
-
-    def embed_avg_ema_update(self, new_embed_avg):
-        self.embed_avg.data.mul_(self.decay).add_(new_embed_avg, alpha=1 - self.decay)
-
-    def weight_update(self, num_tokens):
-        n = self.cluster_size.sum()
-        smoothed_cluster_size = (
-            (self.cluster_size + self.eps) / (n + num_tokens * self.eps) * n
-        )
-        # normalize embedding average with smoothed cluster size
-        embed_normalized = self.embed_avg / smoothed_cluster_size.unsqueeze(1)
-        self.weight.data.copy_(embed_normalized)
-
-
-class EMAVectorQuantizer(AbstractQuantizer):
-    def __init__(
-        self,
-        n_embed: int,
-        embedding_dim: int,
-        beta: float,
-        decay: float = 0.99,
-        eps: float = 1e-5,
-        remap: Optional[str] = None,
-        unknown_index: str = "random",
-        loss_key: str = "loss/vq",
-    ):
-        super().__init__()
-        self.codebook_dim = embedding_dim
-        self.num_tokens = n_embed
-        self.beta = beta
-        self.loss_key = loss_key
-
-        self.embedding = EmbeddingEMA(self.num_tokens, self.codebook_dim, decay, eps)
-
-        self.remap = remap
-        if self.remap is not None:
-            self.register_buffer("used", torch.tensor(np.load(self.remap)))
-            self.re_embed = self.used.shape[0]
-        else:
-            self.used = None
-            self.re_embed = n_embed
-        if unknown_index == "extra":
-            self.unknown_index = self.re_embed
-            self.re_embed = self.re_embed + 1
-        else:
-            assert unknown_index == "random" or isinstance(
-                unknown_index, int
-            ), "unknown index needs to be 'random', 'extra' or any integer"
-            self.unknown_index = unknown_index  # "random" or "extra" or integer
-        if self.remap is not None:
-            logpy.info(
-                f"Remapping {self.n_embed} indices to {self.re_embed} indices. "
-                f"Using {self.unknown_index} for unknown indices."
-            )
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, Dict]:
-        # reshape z -> (batch, height, width, channel) and flatten
-        # z, 'b c h w -> b h w c'
-        z = rearrange(z, "b c h w -> b h w c")
-        z_flattened = z.reshape(-1, self.codebook_dim)
-
-        # distances from z to embeddings e_j (z - e)^2 = z^2 + e^2 - 2 e * z
-        d = (
-            z_flattened.pow(2).sum(dim=1, keepdim=True)
-            + self.embedding.weight.pow(2).sum(dim=1)
-            - 2 * torch.einsum("bd,nd->bn", z_flattened, self.embedding.weight)
-        )  # 'n d -> d n'
-
-        encoding_indices = torch.argmin(d, dim=1)
-
-        z_q = self.embedding(encoding_indices).view(z.shape)
-        encodings = F.one_hot(encoding_indices, self.num_tokens).type(z.dtype)
-        avg_probs = torch.mean(encodings, dim=0)
-        perplexity = torch.exp(-torch.sum(avg_probs * torch.log(avg_probs + 1e-10)))
-
-        if self.training and self.embedding.update:
-            # EMA cluster size
-            encodings_sum = encodings.sum(0)
-            self.embedding.cluster_size_ema_update(encodings_sum)
-            # EMA embedding average
-            embed_sum = encodings.transpose(0, 1) @ z_flattened
-            self.embedding.embed_avg_ema_update(embed_sum)
-            # normalize embed_avg and update weight
-            self.embedding.weight_update(self.num_tokens)
-
-        # compute loss for embedding
-        loss = self.beta * F.mse_loss(z_q.detach(), z)
-
-        # preserve gradients
-        z_q = z + (z_q - z).detach()
-
-        # reshape back to match original input shape
-        # z_q, 'b h w c -> b c h w'
-        z_q = rearrange(z_q, "b h w c -> b c h w")
-
-        out_dict = {
-            self.loss_key: loss,
-            "encodings": encodings,
-            "encoding_indices": encoding_indices,
-            "perplexity": perplexity,
-        }
-
-        return z_q, out_dict
-
-
-class VectorQuantizerWithInputProjection(VectorQuantizer):
-    def __init__(
-        self,
-        input_dim: int,
-        n_codes: int,
-        codebook_dim: int,
-        beta: float = 1.0,
-        output_dim: Optional[int] = None,
-        **kwargs,
-    ):
-        super().__init__(n_codes, codebook_dim, beta, **kwargs)
-        self.proj_in = nn.Linear(input_dim, codebook_dim)
-        self.output_dim = output_dim
-        if output_dim is not None:
-            self.proj_out = nn.Linear(codebook_dim, output_dim)
-        else:
-            self.proj_out = nn.Identity()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, Dict]:
-        rearr = False
-        in_shape = z.shape
-
-        if z.ndim > 3:
-            rearr = self.output_dim is not None
-            z = rearrange(z, "b c ... -> b (...) c")
-        z = self.proj_in(z)
-        z_q, loss_dict = super().forward(z)
-
-        z_q = self.proj_out(z_q)
-        if rearr:
-            if len(in_shape) == 4:
-                z_q = rearrange(z_q, "b (h w) c -> b c h w ", w=in_shape[-1])
-            elif len(in_shape) == 5:
-                z_q = rearrange(
-                    z_q, "b (t h w) c -> b c t h w ", w=in_shape[-1], h=in_shape[-2]
-                )
-            else:
-                raise NotImplementedError(
-                    f"rearranging not available for {len(in_shape)}-dimensional input."
-                )
-
-        return z_q, loss_dict
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/temporal_ae.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/temporal_ae.py
deleted file mode 100644
index 03e30557..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/temporal_ae.py
+++ /dev/null
@@ -1,348 +0,0 @@
-from typing import Callable, Iterable, Union
-
-import torch
-from einops import rearrange, repeat
-from sgm.modules.diffusionmodules.model import (
-    XFORMERS_IS_AVAILABLE,
-    AttnBlock,
-    Decoder,
-    MemoryEfficientAttnBlock,
-    ResnetBlock,
-)
-from sgm.modules.diffusionmodules.openaimodel import ResBlock, timestep_embedding
-from sgm.modules.video_attention import VideoTransformerBlock
-from sgm.util import partialclass
-
-
-class VideoResBlock(ResnetBlock):
-    def __init__(
-        self,
-        out_channels,
-        *args,
-        dropout=0.0,
-        video_kernel_size=3,
-        alpha=0.0,
-        merge_strategy="learned",
-        **kwargs,
-    ):
-        super().__init__(out_channels=out_channels, dropout=dropout, *args, **kwargs)
-        if video_kernel_size is None:
-            video_kernel_size = [3, 1, 1]
-        self.time_stack = ResBlock(
-            channels=out_channels,
-            emb_channels=0,
-            dropout=dropout,
-            dims=3,
-            use_scale_shift_norm=False,
-            use_conv=False,
-            up=False,
-            down=False,
-            kernel_size=video_kernel_size,
-            use_checkpoint=False,
-            skip_t_emb=True,
-        )
-
-        self.merge_strategy = merge_strategy
-        if self.merge_strategy == "fixed":
-            self.register_buffer("mix_factor", torch.Tensor([alpha]))
-        elif self.merge_strategy == "learned":
-            self.register_parameter(
-                "mix_factor", torch.nn.Parameter(torch.Tensor([alpha]))
-            )
-        else:
-            raise ValueError(f"unknown merge strategy {self.merge_strategy}")
-
-    def get_alpha(self, bs):
-        if self.merge_strategy == "fixed":
-            return self.mix_factor
-        elif self.merge_strategy == "learned":
-            return torch.sigmoid(self.mix_factor)
-        else:
-            raise NotImplementedError()
-
-    def forward(self, x, temb, skip_video=False, timesteps=None):
-        if timesteps is None:
-            timesteps = self.timesteps
-
-        b, c, h, w = x.shape
-
-        x = super().forward(x, temb)
-
-        if not skip_video:
-            x_mix = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps)
-
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps)
-
-            x = self.time_stack(x, temb)
-
-            alpha = self.get_alpha(bs=b // timesteps)
-            x = alpha * x + (1.0 - alpha) * x_mix
-
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-        return x
-
-
-class AE3DConv(torch.nn.Conv2d):
-    def __init__(self, in_channels, out_channels, video_kernel_size=3, *args, **kwargs):
-        super().__init__(in_channels, out_channels, *args, **kwargs)
-        if isinstance(video_kernel_size, Iterable):
-            padding = [int(k // 2) for k in video_kernel_size]
-        else:
-            padding = int(video_kernel_size // 2)
-
-        self.time_mix_conv = torch.nn.Conv3d(
-            in_channels=out_channels,
-            out_channels=out_channels,
-            kernel_size=video_kernel_size,
-            padding=padding,
-        )
-
-    def forward(self, input, timesteps, skip_video=False):
-        x = super().forward(input)
-        if skip_video:
-            return x
-        x = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps)
-        x = self.time_mix_conv(x)
-        return rearrange(x, "b c t h w -> (b t) c h w")
-
-
-class VideoBlock(AttnBlock):
-    def __init__(
-        self, in_channels: int, alpha: float = 0, merge_strategy: str = "learned"
-    ):
-        super().__init__(in_channels)
-        # no context, single headed, as in base class
-        self.time_mix_block = VideoTransformerBlock(
-            dim=in_channels,
-            n_heads=1,
-            d_head=in_channels,
-            checkpoint=False,
-            ff_in=True,
-            attn_mode="softmax",
-        )
-
-        time_embed_dim = self.in_channels * 4
-        self.video_time_embed = torch.nn.Sequential(
-            torch.nn.Linear(self.in_channels, time_embed_dim),
-            torch.nn.SiLU(),
-            torch.nn.Linear(time_embed_dim, self.in_channels),
-        )
-
-        self.merge_strategy = merge_strategy
-        if self.merge_strategy == "fixed":
-            self.register_buffer("mix_factor", torch.Tensor([alpha]))
-        elif self.merge_strategy == "learned":
-            self.register_parameter(
-                "mix_factor", torch.nn.Parameter(torch.Tensor([alpha]))
-            )
-        else:
-            raise ValueError(f"unknown merge strategy {self.merge_strategy}")
-
-    def forward(self, x, timesteps, skip_video=False):
-        if skip_video:
-            return super().forward(x)
-
-        x_in = x
-        x = self.attention(x)
-        h, w = x.shape[2:]
-        x = rearrange(x, "b c h w -> b (h w) c")
-
-        x_mix = x
-        num_frames = torch.arange(timesteps, device=x.device)
-        num_frames = repeat(num_frames, "t -> b t", b=x.shape[0] // timesteps)
-        num_frames = rearrange(num_frames, "b t -> (b t)")
-        t_emb = timestep_embedding(num_frames, self.in_channels, repeat_only=False)
-        emb = self.video_time_embed(t_emb)  # b, n_channels
-        emb = emb[:, None, :]
-        x_mix = x_mix + emb
-
-        alpha = self.get_alpha()
-        x_mix = self.time_mix_block(x_mix, timesteps=timesteps)
-        x = alpha * x + (1.0 - alpha) * x_mix  # alpha merge
-
-        x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w)
-        x = self.proj_out(x)
-
-        return x_in + x
-
-    def get_alpha(
-        self,
-    ):
-        if self.merge_strategy == "fixed":
-            return self.mix_factor
-        elif self.merge_strategy == "learned":
-            return torch.sigmoid(self.mix_factor)
-        else:
-            raise NotImplementedError(f"unknown merge strategy {self.merge_strategy}")
-
-
-class MemoryEfficientVideoBlock(MemoryEfficientAttnBlock):
-    def __init__(
-        self, in_channels: int, alpha: float = 0, merge_strategy: str = "learned"
-    ):
-        super().__init__(in_channels)
-        # no context, single headed, as in base class
-        self.time_mix_block = VideoTransformerBlock(
-            dim=in_channels,
-            n_heads=1,
-            d_head=in_channels,
-            checkpoint=False,
-            ff_in=True,
-            attn_mode="softmax-xformers",
-        )
-
-        time_embed_dim = self.in_channels * 4
-        self.video_time_embed = torch.nn.Sequential(
-            torch.nn.Linear(self.in_channels, time_embed_dim),
-            torch.nn.SiLU(),
-            torch.nn.Linear(time_embed_dim, self.in_channels),
-        )
-
-        self.merge_strategy = merge_strategy
-        if self.merge_strategy == "fixed":
-            self.register_buffer("mix_factor", torch.Tensor([alpha]))
-        elif self.merge_strategy == "learned":
-            self.register_parameter(
-                "mix_factor", torch.nn.Parameter(torch.Tensor([alpha]))
-            )
-        else:
-            raise ValueError(f"unknown merge strategy {self.merge_strategy}")
-
-    def forward(self, x, timesteps, skip_time_block=False):
-        if skip_time_block:
-            return super().forward(x)
-
-        x_in = x
-        x = self.attention(x)
-        h, w = x.shape[2:]
-        x = rearrange(x, "b c h w -> b (h w) c")
-
-        x_mix = x
-        num_frames = torch.arange(timesteps, device=x.device)
-        num_frames = repeat(num_frames, "t -> b t", b=x.shape[0] // timesteps)
-        num_frames = rearrange(num_frames, "b t -> (b t)")
-        t_emb = timestep_embedding(num_frames, self.in_channels, repeat_only=False)
-        emb = self.video_time_embed(t_emb)  # b, n_channels
-        emb = emb[:, None, :]
-        x_mix = x_mix + emb
-
-        alpha = self.get_alpha()
-        x_mix = self.time_mix_block(x_mix, timesteps=timesteps)
-        x = alpha * x + (1.0 - alpha) * x_mix  # alpha merge
-
-        x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w)
-        x = self.proj_out(x)
-
-        return x_in + x
-
-    def get_alpha(
-        self,
-    ):
-        if self.merge_strategy == "fixed":
-            return self.mix_factor
-        elif self.merge_strategy == "learned":
-            return torch.sigmoid(self.mix_factor)
-        else:
-            raise NotImplementedError(f"unknown merge strategy {self.merge_strategy}")
-
-
-def make_time_attn(
-    in_channels,
-    attn_type="vanilla",
-    attn_kwargs=None,
-    alpha: float = 0,
-    merge_strategy: str = "learned",
-):
-    assert attn_type in [
-        "vanilla",
-        "vanilla-xformers",
-    ], f"attn_type {attn_type} not supported for spatio-temporal attention"
-    print(
-        f"making spatial and temporal attention of type '{attn_type}' with {in_channels} in_channels"
-    )
-    if not XFORMERS_IS_AVAILABLE and attn_type == "vanilla-xformers":
-        print(
-            f"Attention mode '{attn_type}' is not available. Falling back to vanilla attention. "
-            f"This is not a problem in Pytorch >= 2.0. FYI, you are running with PyTorch version {torch.__version__}"
-        )
-        attn_type = "vanilla"
-
-    if attn_type == "vanilla":
-        assert attn_kwargs is None
-        return partialclass(
-            VideoBlock, in_channels, alpha=alpha, merge_strategy=merge_strategy
-        )
-    elif attn_type == "vanilla-xformers":
-        print(f"building MemoryEfficientAttnBlock with {in_channels} in_channels...")
-        return partialclass(
-            MemoryEfficientVideoBlock,
-            in_channels,
-            alpha=alpha,
-            merge_strategy=merge_strategy,
-        )
-    else:
-        return NotImplementedError()
-
-
-class Conv2DWrapper(torch.nn.Conv2d):
-    def forward(self, input: torch.Tensor, **kwargs) -> torch.Tensor:
-        return super().forward(input)
-
-
-class VideoDecoder(Decoder):
-    available_time_modes = ["all", "conv-only", "attn-only"]
-
-    def __init__(
-        self,
-        *args,
-        video_kernel_size: Union[int, list] = 3,
-        alpha: float = 0.0,
-        merge_strategy: str = "learned",
-        time_mode: str = "conv-only",
-        **kwargs,
-    ):
-        self.video_kernel_size = video_kernel_size
-        self.alpha = alpha
-        self.merge_strategy = merge_strategy
-        self.time_mode = time_mode
-        assert (
-            self.time_mode in self.available_time_modes
-        ), f"time_mode parameter has to be in {self.available_time_modes}"
-        super().__init__(*args, **kwargs)
-
-    def get_last_layer(self, skip_time_mix=False, **kwargs):
-        if self.time_mode == "attn-only":
-            raise NotImplementedError("TODO")
-        else:
-            return (
-                self.conv_out.time_mix_conv.weight
-                if not skip_time_mix
-                else self.conv_out.weight
-            )
-
-    def _make_attn(self) -> Callable:
-        if self.time_mode not in ["conv-only", "only-last-conv"]:
-            return partialclass(
-                make_time_attn,
-                alpha=self.alpha,
-                merge_strategy=self.merge_strategy,
-            )
-        else:
-            return super()._make_attn()
-
-    def _make_conv(self) -> Callable:
-        if self.time_mode != "attn-only":
-            return partialclass(AE3DConv, video_kernel_size=self.video_kernel_size)
-        else:
-            return Conv2DWrapper
-
-    def _make_resblock(self) -> Callable:
-        if self.time_mode not in ["attn-only", "only-last-conv"]:
-            return partialclass(
-                VideoResBlock,
-                video_kernel_size=self.video_kernel_size,
-                alpha=self.alpha,
-                merge_strategy=self.merge_strategy,
-            )
-        else:
-            return super()._make_resblock()
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_dec_3d.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_dec_3d.py
deleted file mode 100644
index 9c3dc8b2..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_dec_3d.py
+++ /dev/null
@@ -1,541 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-from einops import rearrange
-
-from .movq_enc_3d import CausalConv3d, DownSample3D, Upsample3D
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-class SpatialNorm3D(nn.Module):
-    def __init__(
-        self,
-        f_channels,
-        zq_channels,
-        norm_layer=nn.GroupNorm,
-        freeze_norm_layer=False,
-        add_conv=False,
-        pad_mode="constant",
-        **norm_layer_params,
-    ):
-        super().__init__()
-        self.norm_layer = norm_layer(num_channels=f_channels, **norm_layer_params)
-        if freeze_norm_layer:
-            for p in self.norm_layer.parameters:
-                p.requires_grad = False
-        self.add_conv = add_conv
-        if self.add_conv:
-            self.conv = CausalConv3d(
-                zq_channels, zq_channels, kernel_size=3, pad_mode=pad_mode
-            )
-        self.conv_y = CausalConv3d(
-            zq_channels, f_channels, kernel_size=1, pad_mode=pad_mode
-        )
-        self.conv_b = CausalConv3d(
-            zq_channels, f_channels, kernel_size=1, pad_mode=pad_mode
-        )
-
-    def forward(self, f, zq):
-        if zq.shape[2] > 1:
-            f_first, f_rest = f[:, :, :1], f[:, :, 1:]
-            f_first_size, f_rest_size = f_first.shape[-3:], f_rest.shape[-3:]
-            zq_first, zq_rest = zq[:, :, :1], zq[:, :, 1:]
-            zq_first = torch.nn.functional.interpolate(
-                zq_first, size=f_first_size, mode="nearest"
-            )
-            zq_rest = torch.nn.functional.interpolate(
-                zq_rest, size=f_rest_size, mode="nearest"
-            )
-            zq = torch.cat([zq_first, zq_rest], dim=2)
-        else:
-            zq = torch.nn.functional.interpolate(zq, size=f.shape[-3:], mode="nearest")
-        if self.add_conv:
-            zq = self.conv(zq)
-        norm_f = self.norm_layer(f)
-        new_f = norm_f * self.conv_y(zq) + self.conv_b(zq)
-        return new_f
-
-
-def Normalize3D(in_channels, zq_ch, add_conv):
-    return SpatialNorm3D(
-        in_channels,
-        zq_ch,
-        norm_layer=nn.GroupNorm,
-        freeze_norm_layer=False,
-        add_conv=add_conv,
-        num_groups=32,
-        eps=1e-6,
-        affine=True,
-    )
-
-
-class ResnetBlock3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="constant",
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize3D(in_channels, zq_ch, add_conv=add_conv)
-        self.conv1 = CausalConv3d(
-            in_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize3D(out_channels, zq_ch, add_conv=add_conv)
-        self.dropout = torch.nn.Dropout(dropout)
-        self.conv2 = CausalConv3d(
-            out_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = CausalConv3d(
-                    in_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv3d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-
-    def forward(self, x, temb, zq):
-        h = x
-        h = self.norm1(h, zq)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None, None]
-
-        h = self.norm2(h, zq)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class AttnBlock2D(nn.Module):
-    def __init__(self, in_channels, zq_ch=None, add_conv=False):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize3D(in_channels, zq_ch, add_conv=add_conv)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x, zq):
-        h_ = x
-        h_ = self.norm(h_, zq)
-
-        t = h_.shape[2]
-        h_ = rearrange(h_, "b c t h w -> (b t) c h w")
-
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = q.reshape(b, c, h * w)
-        q = q.permute(0, 2, 1)  # b,hw,c
-        k = k.reshape(b, c, h * w)  # b,c,hw
-        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = v.reshape(b, c, h * w)
-        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
-        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
-        h_ = h_.reshape(b, c, h, w)
-
-        h_ = self.proj_out(h_)
-
-        h_ = rearrange(h_, "(b t) c h w -> b c t h w", t=t)
-
-        return x + h_
-
-
-class MOVQDecoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="first",
-        temporal_compress_times=4,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        if zq_ch is None:
-            zq_ch = z_channels
-
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-
-        self.conv_in = CausalConv3d(
-            z_channels, block_in, kernel_size=3, pad_mode=pad_mode
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-
-        self.mid.block_2 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        zq_ch=zq_ch,
-                        add_conv=add_conv,
-                        pad_mode=pad_mode,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock2D(block_in, zq_ch, add_conv=add_conv))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                if i_level < self.num_resolutions - self.temporal_compress_level:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=False
-                    )
-                else:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=True
-                    )
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        self.norm_out = Normalize3D(block_in, zq_ch, add_conv=add_conv)
-        self.conv_out = CausalConv3d(block_in, out_ch, kernel_size=3, pad_mode=pad_mode)
-
-    def forward(self, z, use_cp=False):
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        t = z.shape[2]
-        # z to block_in
-
-        zq = z
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb, zq)
-        # h = self.mid.attn_1(h, zq)
-        h = self.mid.block_2(h, temb, zq)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, zq)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h, zq)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.conv.weight
-
-
-class NewDecoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="first",
-        temporal_compress_times=4,
-        post_quant_conv=False,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        if zq_ch is None:
-            zq_ch = z_channels
-        if post_quant_conv:
-            self.post_quant_conv = CausalConv3d(
-                zq_ch, z_channels, kernel_size=3, pad_mode=pad_mode
-            )
-        else:
-            self.post_quant_conv = None
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "Working with z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        # z to block_in
-        # self.conv_in = torch.nn.Conv3d(z_channels,
-        #                                block_in,
-        #                                kernel_size=3,
-        #                                stride=1,
-        #                                padding=1)
-        self.conv_in = CausalConv3d(
-            z_channels, block_in, kernel_size=3, pad_mode=pad_mode
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-        # remove attention block
-        # self.mid.attn_1 = AttnBlock2D(block_in, zq_ch, add_conv=add_conv)
-        self.mid.block_2 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        zq_ch=zq_ch,
-                        add_conv=add_conv,
-                        pad_mode=pad_mode,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock2D(block_in, zq_ch, add_conv=add_conv))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                if i_level < self.num_resolutions - self.temporal_compress_level:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=False
-                    )
-                else:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=True
-                    )
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        self.norm_out = Normalize3D(block_in, zq_ch, add_conv=add_conv)
-        # self.conv_out = torch.nn.Conv3d(block_in,
-        #                                 out_ch,
-        #                                 kernel_size=3,
-        #                                 stride=1,
-        #                                 padding=1)
-        self.conv_out = CausalConv3d(block_in, out_ch, kernel_size=3, pad_mode=pad_mode)
-
-    def forward(self, z):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        t = z.shape[2]
-        # z to block_in
-
-        zq = z
-        if self.post_quant_conv is not None:
-            z = self.post_quant_conv(z)
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb, zq)
-        # h = self.mid.attn_1(h, zq)
-        h = self.mid.block_2(h, temb, zq)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, zq)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h, zq)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.conv.weight
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_dec_3d_dev.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_dec_3d_dev.py
deleted file mode 100644
index 008eab2b..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_dec_3d_dev.py
+++ /dev/null
@@ -1,583 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from beartype import beartype
-from beartype.typing import List, Optional, Tuple, Union
-from einops import rearrange
-
-from .movq_enc_3d import CausalConv3d, DownSample3D, Upsample3D
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-class SpatialNorm3D(nn.Module):
-    def __init__(
-        self,
-        f_channels,
-        zq_channels,
-        norm_layer=nn.GroupNorm,
-        freeze_norm_layer=False,
-        add_conv=False,
-        pad_mode="constant",
-        **norm_layer_params,
-    ):
-        super().__init__()
-        self.norm_layer = norm_layer(num_channels=f_channels, **norm_layer_params)
-        if freeze_norm_layer:
-            for p in self.norm_layer.parameters:
-                p.requires_grad = False
-        self.add_conv = add_conv
-        if self.add_conv:
-            # self.conv = nn.Conv3d(zq_channels, zq_channels, kernel_size=3, stride=1, padding=1)
-            self.conv = CausalConv3d(
-                zq_channels, zq_channels, kernel_size=3, pad_mode=pad_mode
-            )
-        # self.conv_y = nn.Conv3d(zq_channels, f_channels, kernel_size=1, stride=1, padding=0)
-        # self.conv_b = nn.Conv3d(zq_channels, f_channels, kernel_size=1, stride=1, padding=0)
-        self.conv_y = CausalConv3d(
-            zq_channels, f_channels, kernel_size=1, pad_mode=pad_mode
-        )
-        self.conv_b = CausalConv3d(
-            zq_channels, f_channels, kernel_size=1, pad_mode=pad_mode
-        )
-
-    def forward(self, f, zq):
-        if zq.shape[2] > 1:
-            f_first, f_rest = f[:, :, :1], f[:, :, 1:]
-            f_first_size, f_rest_size = f_first.shape[-3:], f_rest.shape[-3:]
-            zq_first, zq_rest = zq[:, :, :1], zq[:, :, 1:]
-            zq_first = torch.nn.functional.interpolate(
-                zq_first, size=f_first_size, mode="nearest"
-            )
-            zq_rest = torch.nn.functional.interpolate(
-                zq_rest, size=f_rest_size, mode="nearest"
-            )
-            zq = torch.cat([zq_first, zq_rest], dim=2)
-        else:
-            zq = torch.nn.functional.interpolate(zq, size=f.shape[-3:], mode="nearest")
-        if self.add_conv:
-            zq = self.conv(zq)
-        norm_f = self.norm_layer(f)
-        new_f = norm_f * self.conv_y(zq) + self.conv_b(zq)
-        return new_f
-
-
-def Normalize3D(in_channels, zq_ch, add_conv):
-    return SpatialNorm3D(
-        in_channels,
-        zq_ch,
-        norm_layer=nn.GroupNorm,
-        freeze_norm_layer=False,
-        add_conv=add_conv,
-        num_groups=32,
-        eps=1e-6,
-        affine=True,
-    )
-
-
-class ResnetBlock3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="constant",
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize3D(in_channels, zq_ch, add_conv=add_conv)
-        # self.conv1 = torch.nn.Conv3d(in_channels,
-        #                              out_channels,
-        #                              kernel_size=3,
-        #                              stride=1,
-        #                              padding=1)
-        self.conv1 = CausalConv3d(
-            in_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize3D(out_channels, zq_ch, add_conv=add_conv)
-        self.dropout = torch.nn.Dropout(dropout)
-        # self.conv2 = torch.nn.Conv3d(out_channels,
-        #                              out_channels,
-        #                              kernel_size=3,
-        #                              stride=1,
-        #                              padding=1)
-        self.conv2 = CausalConv3d(
-            out_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                # self.conv_shortcut = torch.nn.Conv3d(in_channels,
-                #                                      out_channels,
-                #                                      kernel_size=3,
-                #                                      stride=1,
-                #                                      padding=1)
-                self.conv_shortcut = CausalConv3d(
-                    in_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv3d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-                # self.nin_shortcut = CausalConv3d(in_channels, out_channels, kernel_size=1, pad_mode=pad_mode)
-
-    def forward(self, x, temb, zq):
-        h = x
-        h = self.norm1(h, zq)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None, None]
-
-        h = self.norm2(h, zq)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class AttnBlock2D(nn.Module):
-    def __init__(self, in_channels, zq_ch=None, add_conv=False):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize3D(in_channels, zq_ch, add_conv=add_conv)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x, zq):
-        h_ = x
-        h_ = self.norm(h_, zq)
-
-        t = h_.shape[2]
-        h_ = rearrange(h_, "b c t h w -> (b t) c h w")
-
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = q.reshape(b, c, h * w)
-        q = q.permute(0, 2, 1)  # b,hw,c
-        k = k.reshape(b, c, h * w)  # b,c,hw
-        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = v.reshape(b, c, h * w)
-        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
-        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
-        h_ = h_.reshape(b, c, h, w)
-
-        h_ = self.proj_out(h_)
-
-        h_ = rearrange(h_, "(b t) c h w -> b c t h w", t=t)
-
-        return x + h_
-
-
-class MOVQDecoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="first",
-        temporal_compress_times=4,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        if zq_ch is None:
-            zq_ch = z_channels
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "Working with z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        # z to block_in
-        # self.conv_in = torch.nn.Conv3d(z_channels,
-        #                                block_in,
-        #                                kernel_size=3,
-        #                                stride=1,
-        #                                padding=1)
-        self.conv_in = CausalConv3d(
-            z_channels, block_in, kernel_size=3, pad_mode=pad_mode
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-        # remove attention block
-        # self.mid.attn_1 = AttnBlock2D(block_in, zq_ch, add_conv=add_conv)
-        self.mid.block_2 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        zq_ch=zq_ch,
-                        add_conv=add_conv,
-                        pad_mode=pad_mode,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock2D(block_in, zq_ch, add_conv=add_conv))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                if i_level < self.num_resolutions - self.temporal_compress_level:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=False
-                    )
-                else:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=True
-                    )
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        self.norm_out = Normalize3D(block_in, zq_ch, add_conv=add_conv)
-        # self.conv_out = torch.nn.Conv3d(block_in,
-        #                                 out_ch,
-        #                                 kernel_size=3,
-        #                                 stride=1,
-        #                                 padding=1)
-        self.conv_out = CausalConv3d(block_in, out_ch, kernel_size=3, pad_mode=pad_mode)
-
-    def forward(self, z, use_cp=False):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        t = z.shape[2]
-        # z to block_in
-
-        zq = z
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb, zq)
-        # h = self.mid.attn_1(h, zq)
-        h = self.mid.block_2(h, temb, zq)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, zq)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h, zq)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.conv.weight
-
-
-class NewDecoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="first",
-        temporal_compress_times=4,
-        post_quant_conv=False,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        if zq_ch is None:
-            zq_ch = z_channels
-        if post_quant_conv:
-            self.post_quant_conv = CausalConv3d(
-                zq_ch, z_channels, kernel_size=3, pad_mode=pad_mode
-            )
-        else:
-            self.post_quant_conv = None
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "Working with z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        # z to block_in
-        # self.conv_in = torch.nn.Conv3d(z_channels,
-        #                                block_in,
-        #                                kernel_size=3,
-        #                                stride=1,
-        #                                padding=1)
-        self.conv_in = CausalConv3d(
-            z_channels, block_in, kernel_size=3, pad_mode=pad_mode
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-        # remove attention block
-        # self.mid.attn_1 = AttnBlock2D(block_in, zq_ch, add_conv=add_conv)
-        self.mid.block_2 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            pad_mode=pad_mode,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        zq_ch=zq_ch,
-                        add_conv=add_conv,
-                        pad_mode=pad_mode,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock2D(block_in, zq_ch, add_conv=add_conv))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                if i_level < self.num_resolutions - self.temporal_compress_level:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=False
-                    )
-                else:
-                    up.upsample = Upsample3D(
-                        block_in, resamp_with_conv, compress_time=True
-                    )
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        self.norm_out = Normalize3D(block_in, zq_ch, add_conv=add_conv)
-        # self.conv_out = torch.nn.Conv3d(block_in,
-        #                                 out_ch,
-        #                                 kernel_size=3,
-        #                                 stride=1,
-        #                                 padding=1)
-        self.conv_out = CausalConv3d(block_in, out_ch, kernel_size=3, pad_mode=pad_mode)
-
-    def forward(self, z):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        t = z.shape[2]
-        # z to block_in
-
-        zq = z
-        if self.post_quant_conv is not None:
-            z = self.post_quant_conv(z)
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb, zq)
-        # h = self.mid.attn_1(h, zq)
-        h = self.mid.block_2(h, temb, zq)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, zq)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h, zq)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.conv.weight
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_enc_3d.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_enc_3d.py
deleted file mode 100644
index 2596d328..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_enc_3d.py
+++ /dev/null
@@ -1,497 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from beartype import beartype
-from beartype.typing import List, Optional, Tuple, Union
-from einops import rearrange
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-class CausalConv3d(nn.Module):
-    @beartype
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        pad_mode="constant",
-        **kwargs,
-    ):
-        super().__init__()
-        kernel_size = cast_tuple(kernel_size, 3)
-
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-
-        assert is_odd(height_kernel_size) and is_odd(width_kernel_size)
-
-        dilation = kwargs.pop("dilation", 1)
-        stride = kwargs.pop("stride", 1)
-
-        self.pad_mode = pad_mode
-        time_pad = dilation * (time_kernel_size - 1) + (1 - stride)
-        height_pad = height_kernel_size // 2
-        width_pad = width_kernel_size // 2
-
-        self.height_pad = height_pad
-        self.width_pad = width_pad
-        self.time_pad = time_pad
-        self.time_causal_padding = (
-            width_pad,
-            width_pad,
-            height_pad,
-            height_pad,
-            time_pad,
-            0,
-        )
-
-        stride = (stride, 1, 1)
-        dilation = (dilation, 1, 1)
-        self.conv = nn.Conv3d(
-            chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs
-        )
-
-    def forward(self, x):
-        if self.pad_mode == "constant":
-            causal_padding_3d = (
-                self.time_pad,
-                0,
-                self.width_pad,
-                self.width_pad,
-                self.height_pad,
-                self.height_pad,
-            )
-            x = F.pad(x, causal_padding_3d, mode="constant", value=0)
-        elif self.pad_mode == "first":
-            pad_x = torch.cat([x[:, :, :1]] * self.time_pad, dim=2)
-            x = torch.cat([pad_x, x], dim=2)
-            causal_padding_2d = (
-                self.width_pad,
-                self.width_pad,
-                self.height_pad,
-                self.height_pad,
-            )
-            x = F.pad(x, causal_padding_2d, mode="constant", value=0)
-        elif self.pad_mode == "reflect":
-            # reflect padding
-            reflect_x = x[:, :, 1 : self.time_pad + 1, :, :].flip(dims=[2])
-            if reflect_x.shape[2] < self.time_pad:
-                reflect_x = torch.cat(
-                    [torch.zeros_like(x[:, :, :1, :, :])]
-                    * (self.time_pad - reflect_x.shape[2])
-                    + [reflect_x],
-                    dim=2,
-                )
-            x = torch.cat([reflect_x, x], dim=2)
-            causal_padding_2d = (
-                self.width_pad,
-                self.width_pad,
-                self.height_pad,
-                self.height_pad,
-            )
-            x = F.pad(x, causal_padding_2d, mode="constant", value=0)
-        else:
-            raise ValueError("Invalid pad mode")
-        return self.conv(x)
-
-
-def Normalize3D(in_channels):  # same for 3D and 2D
-    return torch.nn.GroupNorm(
-        num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-    )
-
-
-class Upsample3D(nn.Module):
-    def __init__(self, in_channels, with_conv, compress_time=False):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1
-            )
-        self.compress_time = compress_time
-
-    def forward(self, x):
-        if self.compress_time:
-            if x.shape[2] > 1:
-                # split first frame
-                x_first, x_rest = x[:, :, 0], x[:, :, 1:]
-
-                x_first = torch.nn.functional.interpolate(
-                    x_first, scale_factor=2.0, mode="nearest"
-                )
-                x_rest = torch.nn.functional.interpolate(
-                    x_rest, scale_factor=2.0, mode="nearest"
-                )
-                x = torch.cat([x_first[:, :, None, :, :], x_rest], dim=2)
-            else:
-                x = x.squeeze(2)
-                x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-                x = x[:, :, None, :, :]
-        else:
-            # only interpolate 2D
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-
-        if self.with_conv:
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = self.conv(x)
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-        return x
-
-
-class DownSample3D(nn.Module):
-    def __init__(self, in_channels, with_conv, compress_time=False, out_channels=None):
-        super().__init__()
-        self.with_conv = with_conv
-        if out_channels is None:
-            out_channels = in_channels
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, out_channels, kernel_size=3, stride=2, padding=0
-            )
-        self.compress_time = compress_time
-
-    def forward(self, x):
-        if self.compress_time:
-            h, w = x.shape[-2:]
-            x = rearrange(x, "b c t h w -> (b h w) c t")
-
-            # split first frame
-            x_first, x_rest = x[..., 0], x[..., 1:]
-
-            if x_rest.shape[-1] > 0:
-                x_rest = torch.nn.functional.avg_pool1d(x_rest, kernel_size=2, stride=2)
-            x = torch.cat([x_first[..., None], x_rest], dim=-1)
-            x = rearrange(x, "(b h w) c t -> b c t h w", h=h, w=w)
-
-        if self.with_conv:
-            pad = (0, 1, 0, 1)
-            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = self.conv(x)
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-        else:
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-        return x
-
-
-class ResnetBlock3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-        pad_mode="constant",
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize3D(in_channels)
-        # self.conv1 = torch.nn.Conv3d(in_channels,
-        #                              out_channels,
-        #                              kernel_size=3,
-        #                              stride=1,
-        #                              padding=1)
-        self.conv1 = CausalConv3d(
-            in_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize3D(out_channels)
-        self.dropout = torch.nn.Dropout(dropout)
-        # self.conv2 = torch.nn.Conv3d(out_channels,
-        #                              out_channels,
-        #                              kernel_size=3,
-        #                              stride=1,
-        #                              padding=1)
-        self.conv2 = CausalConv3d(
-            out_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                # self.conv_shortcut = torch.nn.Conv3d(in_channels,
-                #                                      out_channels,
-                #                                      kernel_size=3,
-                #                                      stride=1,
-                #                                      padding=1)
-                self.conv_shortcut = CausalConv3d(
-                    in_channels, out_channels, kernel_size=3, pad_mode=pad_mode
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv3d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-                # self.nin_shortcut = CausalConv3d(in_channels, out_channels, kernel_size=1, pad_mode=pad_mode)
-
-    def forward(self, x, temb):
-        h = x
-        h = self.norm1(h)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None, None]
-
-        h = self.norm2(h)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class AttnBlock2D(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize3D(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x):
-        h_ = x
-        h_ = self.norm(h_)
-
-        t = h_.shape[2]
-        h_ = rearrange(h_, "b c t h w -> (b t) c h w")
-
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = q.reshape(b, c, h * w)
-        q = q.permute(0, 2, 1)  # b,hw,c
-        k = k.reshape(b, c, h * w)  # b,c,hw
-
-        # # original version, nan in fp16
-        # w_ = torch.bmm(q,k)     # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-        # w_ = w_ * (int(c)**(-0.5))
-        # # implement c**-0.5 on q
-        q = q * (int(c) ** (-0.5))
-        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = v.reshape(b, c, h * w)
-        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
-        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
-        h_ = h_.reshape(b, c, h, w)
-
-        h_ = self.proj_out(h_)
-
-        h_ = rearrange(h_, "(b t) c h w -> b c t h w", t=t)
-
-        return x + h_
-
-
-class Encoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        double_z=True,
-        pad_mode="first",
-        temporal_compress_times=4,
-        **ignore_kwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        # downsampling
-        # self.conv_in = torch.nn.Conv3d(in_channels,
-        #                                self.ch,
-        #                                kernel_size=3,
-        #                                stride=1,
-        #                                padding=1)
-        self.conv_in = CausalConv3d(
-            in_channels, self.ch, kernel_size=3, pad_mode=pad_mode
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        pad_mode=pad_mode,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock2D(block_in))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                if i_level < self.temporal_compress_level:
-                    down.downsample = DownSample3D(
-                        block_in, resamp_with_conv, compress_time=True
-                    )
-                else:
-                    down.downsample = DownSample3D(
-                        block_in, resamp_with_conv, compress_time=False
-                    )
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            pad_mode=pad_mode,
-        )
-        # remove attention block
-        # self.mid.attn_1 = AttnBlock2D(block_in)
-        self.mid.block_2 = ResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            pad_mode=pad_mode,
-        )
-
-        # end
-        self.norm_out = Normalize3D(block_in)
-        # self.conv_out = torch.nn.Conv3d(block_in,
-        #                                 2*z_channels if double_z else z_channels,
-        #                                 kernel_size=3,
-        #                                 stride=1,
-        #                                 padding=1)
-        self.conv_out = CausalConv3d(
-            block_in,
-            2 * z_channels if double_z else z_channels,
-            kernel_size=3,
-            pad_mode=pad_mode,
-        )
-
-    def forward(self, x, use_cp=False):
-        # assert x.shape[2] == x.shape[3] == self.resolution, "{}, {}, {}".format(x.shape[2], x.shape[3], self.resolution)
-        # timestep embedding
-        temb = None
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        # h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_modules.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_modules.py
deleted file mode 100644
index b06a1d01..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/movq_modules.py
+++ /dev/null
@@ -1,403 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-class SpatialNorm(nn.Module):
-    def __init__(
-        self,
-        f_channels,
-        zq_channels,
-        norm_layer=nn.GroupNorm,
-        freeze_norm_layer=False,
-        add_conv=False,
-        **norm_layer_params,
-    ):
-        super().__init__()
-        self.norm_layer = norm_layer(num_channels=f_channels, **norm_layer_params)
-        if freeze_norm_layer:
-            for p in self.norm_layer.parameters:
-                p.requires_grad = False
-        self.add_conv = add_conv
-        if self.add_conv:
-            self.conv = nn.Conv2d(
-                zq_channels, zq_channels, kernel_size=3, stride=1, padding=1
-            )
-        self.conv_y = nn.Conv2d(
-            zq_channels, f_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.conv_b = nn.Conv2d(
-            zq_channels, f_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, f, zq):
-        f_size = f.shape[-2:]
-        zq = torch.nn.functional.interpolate(zq, size=f_size, mode="nearest")
-        if self.add_conv:
-            zq = self.conv(zq)
-        norm_f = self.norm_layer(f)
-        new_f = norm_f * self.conv_y(zq) + self.conv_b(zq)
-        return new_f
-
-
-def Normalize(in_channels, zq_ch, add_conv):
-    return SpatialNorm(
-        in_channels,
-        zq_ch,
-        norm_layer=nn.GroupNorm,
-        freeze_norm_layer=False,
-        add_conv=add_conv,
-        num_groups=32,
-        eps=1e-6,
-        affine=True,
-    )
-
-
-class Upsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1
-            )
-
-    def forward(self, x):
-        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-        if self.with_conv:
-            x = self.conv(x)
-        return x
-
-
-class Downsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=2, padding=0
-            )
-
-    def forward(self, x):
-        if self.with_conv:
-            pad = (0, 1, 0, 1)
-            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
-            x = self.conv(x)
-        else:
-            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
-        return x
-
-
-class ResnetBlock(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-        zq_ch=None,
-        add_conv=False,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize(in_channels, zq_ch, add_conv=add_conv)
-        self.conv1 = torch.nn.Conv2d(
-            in_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize(out_channels, zq_ch, add_conv=add_conv)
-        self.dropout = torch.nn.Dropout(dropout)
-        self.conv2 = torch.nn.Conv2d(
-            out_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=3, stride=1, padding=1
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-
-    def forward(self, x, temb, zq):
-        h = x
-        h = self.norm1(h, zq)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None]
-
-        h = self.norm2(h, zq)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class AttnBlock(nn.Module):
-    def __init__(self, in_channels, zq_ch=None, add_conv=False):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels, zq_ch, add_conv=add_conv)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x, zq):
-        h_ = x
-        h_ = self.norm(h_, zq)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = q.reshape(b, c, h * w)
-        q = q.permute(0, 2, 1)  # b,hw,c
-        k = k.reshape(b, c, h * w)  # b,c,hw
-        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = v.reshape(b, c, h * w)
-        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
-        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
-        h_ = h_.reshape(b, c, h, w)
-
-        h_ = self.proj_out(h_)
-
-        return x + h_
-
-
-class MOVQDecoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        zq_ch=None,
-        add_conv=False,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "Working with z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        # z to block_in
-        self.conv_in = torch.nn.Conv2d(
-            z_channels, block_in, kernel_size=3, stride=1, padding=1
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-        )
-        self.mid.attn_1 = AttnBlock(block_in, zq_ch, add_conv=add_conv)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        zq_ch=zq_ch,
-                        add_conv=add_conv,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock(block_in, zq_ch, add_conv=add_conv))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in, resamp_with_conv)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = Normalize(block_in, zq_ch, add_conv=add_conv)
-        self.conv_out = torch.nn.Conv2d(
-            block_in, out_ch, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, z, zq):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        # z to block_in
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb, zq)
-        h = self.mid.attn_1(h, zq)
-        h = self.mid.block_2(h, temb, zq)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, zq)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h, zq)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def forward_with_features_output(self, z, zq):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-        output_features = {}
-
-        # z to block_in
-        h = self.conv_in(z)
-        output_features["conv_in"] = h
-
-        # middle
-        h = self.mid.block_1(h, temb, zq)
-        output_features["mid_block_1"] = h
-        h = self.mid.attn_1(h, zq)
-        output_features["mid_attn_1"] = h
-        h = self.mid.block_2(h, temb, zq)
-        output_features["mid_block_2"] = h
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, zq)
-                output_features[f"up_{i_level}_block_{i_block}"] = h
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-                    output_features[f"up_{i_level}_attn_{i_block}"] = h
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-                output_features[f"up_{i_level}_upsample"] = h
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h, zq)
-        output_features["norm_out"] = h
-        h = nonlinearity(h)
-        output_features["nonlinearity"] = h
-        h = self.conv_out(h)
-        output_features["conv_out"] = h
-
-        return h, output_features
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/quantize.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/quantize.py
deleted file mode 100644
index 96cc56ac..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/quantize.py
+++ /dev/null
@@ -1,270 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-from torch import einsum
-
-
-class VectorQuantizer2(nn.Module):
-    """
-    Improved version over VectorQuantizer, can be used as a drop-in replacement. Mostly
-    avoids costly matrix multiplications and allows for post-hoc remapping of indices.
-    """
-
-    # NOTE: due to a bug the beta term was applied to the wrong term. for
-    # backwards compatibility we use the buggy version by default, but you can
-    # specify legacy=False to fix it.
-    def __init__(
-        self,
-        n_e,
-        e_dim,
-        beta,
-        remap=None,
-        unknown_index="random",
-        sane_index_shape=False,
-        legacy=True,
-    ):
-        super().__init__()
-        self.n_e = n_e
-        self.e_dim = e_dim
-        self.beta = beta
-        self.legacy = legacy
-
-        self.embedding = nn.Embedding(self.n_e, self.e_dim)
-        self.embedding.weight.data.uniform_(-1.0 / self.n_e, 1.0 / self.n_e)
-
-        self.remap = remap
-        if self.remap is not None:
-            self.register_buffer("used", torch.tensor(np.load(self.remap)))
-            self.re_embed = self.used.shape[0]
-            self.unknown_index = unknown_index  # "random" or "extra" or integer
-            if self.unknown_index == "extra":
-                self.unknown_index = self.re_embed
-                self.re_embed = self.re_embed + 1
-            print(
-                f"Remapping {self.n_e} indices to {self.re_embed} indices. "
-                f"Using {self.unknown_index} for unknown indices."
-            )
-        else:
-            self.re_embed = n_e
-
-        self.sane_index_shape = sane_index_shape
-
-    def remap_to_used(self, inds):
-        ishape = inds.shape
-        assert len(ishape) > 1
-        inds = inds.reshape(ishape[0], -1)
-        used = self.used.to(inds)
-        match = (inds[:, :, None] == used[None, None, ...]).long()
-        new = match.argmax(-1)
-        unknown = match.sum(2) < 1
-        if self.unknown_index == "random":
-            new[unknown] = torch.randint(0, self.re_embed, size=new[unknown].shape).to(
-                device=new.device
-            )
-        else:
-            new[unknown] = self.unknown_index
-        return new.reshape(ishape)
-
-    def unmap_to_all(self, inds):
-        ishape = inds.shape
-        assert len(ishape) > 1
-        inds = inds.reshape(ishape[0], -1)
-        used = self.used.to(inds)
-        if self.re_embed > self.used.shape[0]:  # extra token
-            inds[inds >= self.used.shape[0]] = 0  # simply set to zero
-        back = torch.gather(used[None, :][inds.shape[0] * [0], :], 1, inds)
-        return back.reshape(ishape)
-
-    def forward(self, z, temp=None, rescale_logits=False, return_logits=False):
-        assert temp is None or temp == 1.0, "Only for interface compatible with Gumbel"
-        assert rescale_logits == False, "Only for interface compatible with Gumbel"
-        assert return_logits == False, "Only for interface compatible with Gumbel"
-        # reshape z -> (batch, height, width, channel) and flatten
-        z = rearrange(z, "b c h w -> b h w c").contiguous()
-        z_flattened = z.view(-1, self.e_dim)
-        # distances from z to embeddings e_j (z - e)^2 = z^2 + e^2 - 2 e * z
-
-        d = (
-            torch.sum(z_flattened**2, dim=1, keepdim=True)
-            + torch.sum(self.embedding.weight**2, dim=1)
-            - 2
-            * torch.einsum(
-                "bd,dn->bn", z_flattened, rearrange(self.embedding.weight, "n d -> d n")
-            )
-        )
-
-        min_encoding_indices = torch.argmin(d, dim=1)
-        z_q = self.embedding(min_encoding_indices).view(z.shape)
-        perplexity = None
-        min_encodings = None
-
-        # compute loss for embedding
-        if not self.legacy:
-            loss = self.beta * torch.mean((z_q.detach() - z) ** 2) + torch.mean(
-                (z_q - z.detach()) ** 2
-            )
-        else:
-            loss = torch.mean((z_q.detach() - z) ** 2) + self.beta * torch.mean(
-                (z_q - z.detach()) ** 2
-            )
-
-        # preserve gradients
-        z_q = z + (z_q - z).detach()
-
-        # reshape back to match original input shape
-        z_q = rearrange(z_q, "b h w c -> b c h w").contiguous()
-
-        if self.remap is not None:
-            min_encoding_indices = min_encoding_indices.reshape(
-                z.shape[0], -1
-            )  # add batch axis
-            min_encoding_indices = self.remap_to_used(min_encoding_indices)
-            min_encoding_indices = min_encoding_indices.reshape(-1, 1)  # flatten
-
-        if self.sane_index_shape:
-            min_encoding_indices = min_encoding_indices.reshape(
-                z_q.shape[0], z_q.shape[2], z_q.shape[3]
-            )
-
-        return z_q, loss, (perplexity, min_encodings, min_encoding_indices)
-
-    def get_codebook_entry(self, indices, shape):
-        # shape specifying (batch, height, width, channel)
-        if self.remap is not None:
-            indices = indices.reshape(shape[0], -1)  # add batch axis
-            indices = self.unmap_to_all(indices)
-            indices = indices.reshape(-1)  # flatten again
-
-        # get quantized latent vectors
-        z_q = self.embedding(indices)
-
-        if shape is not None:
-            z_q = z_q.view(shape)
-            # reshape back to match original input shape
-            z_q = z_q.permute(0, 3, 1, 2).contiguous()
-
-        return z_q
-
-
-class GumbelQuantize(nn.Module):
-    """
-    credit to @karpathy: https://github.com/karpathy/deep-vector-quantization/blob/main/model.py (thanks!)
-    Gumbel Softmax trick quantizer
-    Categorical Reparameterization with Gumbel-Softmax, Jang et al. 2016
-    https://arxiv.org/abs/1611.01144
-    """
-
-    def __init__(
-        self,
-        num_hiddens,
-        embedding_dim,
-        n_embed,
-        straight_through=True,
-        kl_weight=5e-4,
-        temp_init=1.0,
-        use_vqinterface=True,
-        remap=None,
-        unknown_index="random",
-    ):
-        super().__init__()
-
-        self.embedding_dim = embedding_dim
-        self.n_embed = n_embed
-
-        self.straight_through = straight_through
-        self.temperature = temp_init
-        self.kl_weight = kl_weight
-
-        self.proj = nn.Conv2d(num_hiddens, n_embed, 1)
-        self.embed = nn.Embedding(n_embed, embedding_dim)
-
-        self.use_vqinterface = use_vqinterface
-
-        self.remap = remap
-        if self.remap is not None:
-            self.register_buffer("used", torch.tensor(np.load(self.remap)))
-            self.re_embed = self.used.shape[0]
-            self.unknown_index = unknown_index  # "random" or "extra" or integer
-            if self.unknown_index == "extra":
-                self.unknown_index = self.re_embed
-                self.re_embed = self.re_embed + 1
-            print(
-                f"Remapping {self.n_embed} indices to {self.re_embed} indices. "
-                f"Using {self.unknown_index} for unknown indices."
-            )
-        else:
-            self.re_embed = n_embed
-
-    def remap_to_used(self, inds):
-        ishape = inds.shape
-        assert len(ishape) > 1
-        inds = inds.reshape(ishape[0], -1)
-        used = self.used.to(inds)
-        match = (inds[:, :, None] == used[None, None, ...]).long()
-        new = match.argmax(-1)
-        unknown = match.sum(2) < 1
-        if self.unknown_index == "random":
-            new[unknown] = torch.randint(0, self.re_embed, size=new[unknown].shape).to(
-                device=new.device
-            )
-        else:
-            new[unknown] = self.unknown_index
-        return new.reshape(ishape)
-
-    def unmap_to_all(self, inds):
-        ishape = inds.shape
-        assert len(ishape) > 1
-        inds = inds.reshape(ishape[0], -1)
-        used = self.used.to(inds)
-        if self.re_embed > self.used.shape[0]:  # extra token
-            inds[inds >= self.used.shape[0]] = 0  # simply set to zero
-        back = torch.gather(used[None, :][inds.shape[0] * [0], :], 1, inds)
-        return back.reshape(ishape)
-
-    def forward(self, z, temp=None, return_logits=False):
-        # force hard = True when we are in eval mode, as we must quantize. actually, always true seems to work
-        hard = self.straight_through if self.training else True
-        temp = self.temperature if temp is None else temp
-
-        logits = self.proj(z)
-        if self.remap is not None:
-            # continue only with used logits
-            full_zeros = torch.zeros_like(logits)
-            logits = logits[:, self.used, ...]
-
-        soft_one_hot = F.gumbel_softmax(logits, tau=temp, dim=1, hard=hard)
-        if self.remap is not None:
-            # go back to all entries but unused set to zero
-            full_zeros[:, self.used, ...] = soft_one_hot
-            soft_one_hot = full_zeros
-        z_q = einsum("b n h w, n d -> b d h w", soft_one_hot, self.embed.weight)
-
-        # + kl divergence to the prior loss
-        qy = F.softmax(logits, dim=1)
-        diff = (
-            self.kl_weight
-            * torch.sum(qy * torch.log(qy * self.n_embed + 1e-10), dim=1).mean()
-        )
-
-        ind = soft_one_hot.argmax(dim=1)
-        if self.remap is not None:
-            ind = self.remap_to_used(ind)
-        if self.use_vqinterface:
-            if return_logits:
-                return z_q, diff, (None, None, ind), logits
-            return z_q, diff, (None, None, ind)
-        return z_q, diff, ind
-
-    def get_codebook_entry(self, indices, shape):
-        b, h, w, c = shape
-        assert b * h * w == indices.shape[0]
-        indices = rearrange(indices, "(b h w) -> b h w", b=b, h=h, w=w)
-        if self.remap is not None:
-            indices = self.unmap_to_all(indices)
-        one_hot = (
-            F.one_hot(indices, num_classes=self.n_embed).permute(0, 3, 1, 2).float()
-        )
-        z_q = einsum("b n h w, n d -> b d h w", one_hot, self.embed.weight)
-        return z_q
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/vqvae_blocks.py b/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/vqvae_blocks.py
deleted file mode 100644
index 6334ee02..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/autoencoding/vqvae/vqvae_blocks.py
+++ /dev/null
@@ -1,465 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-def Normalize(in_channels):
-    return torch.nn.GroupNorm(
-        num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-    )
-
-
-class Upsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1
-            )
-
-    def forward(self, x):
-        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-        if self.with_conv:
-            x = self.conv(x)
-        return x
-
-
-class Downsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=2, padding=0
-            )
-
-    def forward(self, x):
-        if self.with_conv:
-            pad = (0, 1, 0, 1)
-            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
-            x = self.conv(x)
-        else:
-            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
-        return x
-
-
-class ResnetBlock(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize(in_channels)
-        self.conv1 = torch.nn.Conv2d(
-            in_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize(out_channels)
-        self.dropout = torch.nn.Dropout(dropout)
-        self.conv2 = torch.nn.Conv2d(
-            out_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=3, stride=1, padding=1
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-
-    def forward(self, x, temb):
-        h = x
-        h = self.norm1(h)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None]
-
-        h = self.norm2(h)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class AttnBlock(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x):
-        h_ = x
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = q.reshape(b, c, h * w)
-        q = q.permute(0, 2, 1)  # b,hw,c
-        k = k.reshape(b, c, h * w)  # b,c,hw
-
-        # # original version, nan in fp16
-        # w_ = torch.bmm(q,k)     # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-        # w_ = w_ * (int(c)**(-0.5))
-        # # implement c**-0.5 on q
-        q = q * (int(c) ** (-0.5))
-        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = v.reshape(b, c, h * w)
-        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
-        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
-        h_ = h_.reshape(b, c, h, w)
-
-        h_ = self.proj_out(h_)
-
-        return x + h_
-
-
-class Encoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        double_z=True,
-        **ignore_kwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        # downsampling
-        self.conv_in = torch.nn.Conv2d(
-            in_channels, self.ch, kernel_size=3, stride=1, padding=1
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock(block_in))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                down.downsample = Downsample(block_in, resamp_with_conv)
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = AttnBlock(block_in)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in,
-            2 * z_channels if double_z else z_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-        )
-
-    def forward(self, x):
-        # assert x.shape[2] == x.shape[3] == self.resolution, "{}, {}, {}".format(x.shape[2], x.shape[3], self.resolution)
-
-        # timestep embedding
-        temb = None
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def forward_with_features_output(self, x):
-        # assert x.shape[2] == x.shape[3] == self.resolution, "{}, {}, {}".format(x.shape[2], x.shape[3], self.resolution)
-
-        # timestep embedding
-        temb = None
-        output_features = {}
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        output_features["conv_in"] = hs[-1]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                output_features["down{}_block{}".format(i_level, i_block)] = h
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                    output_features["down{}_attn{}".format(i_level, i_block)] = h
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-                output_features["down{}_downsample".format(i_level)] = hs[-1]
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        output_features["mid_block_1"] = h
-        h = self.mid.attn_1(h)
-        output_features["mid_attn_1"] = h
-        h = self.mid.block_2(h, temb)
-        output_features["mid_block_2"] = h
-
-        # end
-        h = self.norm_out(h)
-        output_features["norm_out"] = h
-        h = nonlinearity(h)
-        output_features["nonlinearity"] = h
-        h = self.conv_out(h)
-        output_features["conv_out"] = h
-
-        return h, output_features
-
-
-class Decoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "Working with z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        # z to block_in
-        self.conv_in = torch.nn.Conv2d(
-            z_channels, block_in, kernel_size=3, stride=1, padding=1
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = AttnBlock(block_in)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(AttnBlock(block_in))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in, resamp_with_conv)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in, out_ch, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, z):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        # z to block_in
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/cp_enc_dec.py b/videotuna/models/cogvideo_sat/sgm/modules/cp_enc_dec.py
deleted file mode 100644
index 931baf09..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/cp_enc_dec.py
+++ /dev/null
@@ -1,187 +0,0 @@
-import math
-
-import torch
-import torch.distributed
-import torch.nn as nn
-
-from ..util import (
-    get_context_parallel_group,
-    get_context_parallel_rank,
-    get_context_parallel_world_size,
-)
-
-_USE_CP = True
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def exists(v):
-    return v is not None
-
-
-def pair(t):
-    return t if isinstance(t, tuple) else (t, t)
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-def leaky_relu(p=0.1):
-    return nn.LeakyReLU(p)
-
-
-def _split(input_, dim):
-    cp_world_size = get_context_parallel_world_size()
-
-    if cp_world_size == 1:
-        return input_
-
-    cp_rank = get_context_parallel_rank()
-
-    # print('in _split, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    inpu_first_frame_ = input_.transpose(0, dim)[:1].transpose(0, dim).contiguous()
-    input_ = input_.transpose(0, dim)[1:].transpose(0, dim).contiguous()
-    dim_size = input_.size()[dim] // cp_world_size
-
-    input_list = torch.split(input_, dim_size, dim=dim)
-    output = input_list[cp_rank]
-
-    if cp_rank == 0:
-        output = torch.cat([inpu_first_frame_, output], dim=dim)
-    output = output.contiguous()
-
-    # print('out _split, cp_rank:', cp_rank, 'output_size:', output.shape)
-
-    return output
-
-
-def _gather(input_, dim):
-    cp_world_size = get_context_parallel_world_size()
-
-    # Bypass the function if context parallel is 1
-    if cp_world_size == 1:
-        return input_
-
-    group = get_context_parallel_group()
-    cp_rank = get_context_parallel_rank()
-
-    # print('in _gather, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    input_first_frame_ = input_.transpose(0, dim)[:1].transpose(0, dim).contiguous()
-    if cp_rank == 0:
-        input_ = input_.transpose(0, dim)[1:].transpose(0, dim).contiguous()
-
-    tensor_list = [
-        torch.empty_like(torch.cat([input_first_frame_, input_], dim=dim))
-    ] + [torch.empty_like(input_) for _ in range(cp_world_size - 1)]
-
-    if cp_rank == 0:
-        input_ = torch.cat([input_first_frame_, input_], dim=dim)
-
-    tensor_list[cp_rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=group)
-
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    # print('out _gather, cp_rank:', cp_rank, 'output_size:', output.shape)
-
-    return output
-
-
-def _conv_split(input_, dim, kernel_size):
-    cp_world_size = get_context_parallel_world_size()
-
-    # Bypass the function if context parallel is 1
-    if cp_world_size == 1:
-        return input_
-
-    # print('in _conv_split, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    cp_rank = get_context_parallel_rank()
-
-    dim_size = (input_.size()[dim] - kernel_size) // cp_world_size
-
-    if cp_rank == 0:
-        output = input_.transpose(dim, 0)[: dim_size + kernel_size].transpose(dim, 0)
-    else:
-        output = input_.transpose(dim, 0)[
-            cp_rank * dim_size + 1 : (cp_rank + 1) * dim_size + kernel_size
-        ].transpose(dim, 0)
-    output = output.contiguous()
-
-    # print('out _conv_split, cp_rank:', cp_rank, 'input_size:', output.shape)
-
-    return output
-
-
-def _conv_gather(input_, dim, kernel_size):
-    cp_world_size = get_context_parallel_world_size()
-
-    # Bypass the function if context parallel is 1
-    if cp_world_size == 1:
-        return input_
-
-    group = get_context_parallel_group()
-    cp_rank = get_context_parallel_rank()
-
-    # print('in _conv_gather, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    input_first_kernel_ = (
-        input_.transpose(0, dim)[:kernel_size].transpose(0, dim).contiguous()
-    )
-    if cp_rank == 0:
-        input_ = input_.transpose(0, dim)[kernel_size:].transpose(0, dim).contiguous()
-    else:
-        input_ = (
-            input_.transpose(0, dim)[kernel_size - 1 :].transpose(0, dim).contiguous()
-        )
-
-    tensor_list = [
-        torch.empty_like(torch.cat([input_first_kernel_, input_], dim=dim))
-    ] + [torch.empty_like(input_) for _ in range(cp_world_size - 1)]
-    if cp_rank == 0:
-        input_ = torch.cat([input_first_kernel_, input_], dim=dim)
-
-    tensor_list[cp_rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=group)
-
-    # Note: torch.cat already creates a contiguous tensor.
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    # print('out _conv_gather, cp_rank:', cp_rank, 'input_size:', output.shape)
-
-    return output
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/__init__.py
deleted file mode 100644
index fccebf95..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from .denoiser import Denoiser
-from .discretizer import Discretization
-from .model import Decoder, Encoder, Model
-from .openaimodel import UNetModel
-from .sampling import BaseDiffusionSampler
-from .wrappers import OpenAIWrapper
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser.py
deleted file mode 100644
index ffece73c..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser.py
+++ /dev/null
@@ -1,77 +0,0 @@
-from typing import Dict, Union
-
-import torch
-import torch.nn as nn
-
-from ...util import append_dims, instantiate_from_config
-
-
-class Denoiser(nn.Module):
-    def __init__(self, weighting_config, scaling_config):
-        super().__init__()
-
-        self.weighting = instantiate_from_config(weighting_config)
-        self.scaling = instantiate_from_config(scaling_config)
-
-    def possibly_quantize_sigma(self, sigma):
-        return sigma
-
-    def possibly_quantize_c_noise(self, c_noise):
-        return c_noise
-
-    def w(self, sigma):
-        return self.weighting(sigma)
-
-    def forward(
-        self,
-        network: nn.Module,
-        input: torch.Tensor,
-        sigma: torch.Tensor,
-        cond: Dict,
-        **additional_model_inputs,
-    ) -> torch.Tensor:
-        sigma = self.possibly_quantize_sigma(sigma)
-        sigma_shape = sigma.shape
-        sigma = append_dims(sigma, input.ndim)
-        c_skip, c_out, c_in, c_noise = self.scaling(sigma, **additional_model_inputs)
-        c_noise = self.possibly_quantize_c_noise(c_noise.reshape(sigma_shape))
-        return (
-            network(input * c_in, c_noise, cond, **additional_model_inputs) * c_out
-            + input * c_skip
-        )
-
-
-class DiscreteDenoiser(Denoiser):
-    def __init__(
-        self,
-        weighting_config,
-        scaling_config,
-        num_idx,
-        discretization_config,
-        do_append_zero=False,
-        quantize_c_noise=True,
-        flip=True,
-    ):
-        super().__init__(weighting_config, scaling_config)
-        sigmas = instantiate_from_config(discretization_config)(
-            num_idx, do_append_zero=do_append_zero, flip=flip
-        )
-        self.sigmas = sigmas
-        # self.register_buffer("sigmas", sigmas)
-        self.quantize_c_noise = quantize_c_noise
-
-    def sigma_to_idx(self, sigma):
-        dists = sigma - self.sigmas.to(sigma.device)[:, None]
-        return dists.abs().argmin(dim=0).view(sigma.shape)
-
-    def idx_to_sigma(self, idx):
-        return self.sigmas.to(idx.device)[idx]
-
-    def possibly_quantize_sigma(self, sigma):
-        return self.idx_to_sigma(self.sigma_to_idx(sigma))
-
-    def possibly_quantize_c_noise(self, c_noise):
-        if self.quantize_c_noise:
-            return self.sigma_to_idx(c_noise)
-        else:
-            return c_noise
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser_scaling.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser_scaling.py
deleted file mode 100644
index 05362a00..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser_scaling.py
+++ /dev/null
@@ -1,70 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import Any, Tuple
-
-import torch
-
-
-class DenoiserScaling(ABC):
-    @abstractmethod
-    def __call__(
-        self, sigma: torch.Tensor
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        pass
-
-
-class EDMScaling:
-    def __init__(self, sigma_data: float = 0.5):
-        self.sigma_data = sigma_data
-
-    def __call__(
-        self, sigma: torch.Tensor
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        c_skip = self.sigma_data**2 / (sigma**2 + self.sigma_data**2)
-        c_out = sigma * self.sigma_data / (sigma**2 + self.sigma_data**2) ** 0.5
-        c_in = 1 / (sigma**2 + self.sigma_data**2) ** 0.5
-        c_noise = 0.25 * sigma.log()
-        return c_skip, c_out, c_in, c_noise
-
-
-class EpsScaling:
-    def __call__(
-        self, sigma: torch.Tensor
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        c_skip = torch.ones_like(sigma, device=sigma.device)
-        c_out = -sigma
-        c_in = 1 / (sigma**2 + 1.0) ** 0.5
-        c_noise = sigma.clone()
-        return c_skip, c_out, c_in, c_noise
-
-
-class VScaling:
-    def __call__(
-        self, sigma: torch.Tensor
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        c_skip = 1.0 / (sigma**2 + 1.0)
-        c_out = -sigma / (sigma**2 + 1.0) ** 0.5
-        c_in = 1.0 / (sigma**2 + 1.0) ** 0.5
-        c_noise = sigma.clone()
-        return c_skip, c_out, c_in, c_noise
-
-
-class VScalingWithEDMcNoise(DenoiserScaling):
-    def __call__(
-        self, sigma: torch.Tensor
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        c_skip = 1.0 / (sigma**2 + 1.0)
-        c_out = -sigma / (sigma**2 + 1.0) ** 0.5
-        c_in = 1.0 / (sigma**2 + 1.0) ** 0.5
-        c_noise = 0.25 * sigma.log()
-        return c_skip, c_out, c_in, c_noise
-
-
-class VideoScaling:  # similar to VScaling
-    def __call__(
-        self, alphas_cumprod_sqrt: torch.Tensor, **additional_model_inputs
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        c_skip = alphas_cumprod_sqrt
-        c_out = -((1 - alphas_cumprod_sqrt**2) ** 0.5)
-        c_in = torch.ones_like(alphas_cumprod_sqrt, device=alphas_cumprod_sqrt.device)
-        c_noise = additional_model_inputs["idx"].clone()
-        return c_skip, c_out, c_in, c_noise
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser_weighting.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser_weighting.py
deleted file mode 100644
index b8b03ca5..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/denoiser_weighting.py
+++ /dev/null
@@ -1,24 +0,0 @@
-import torch
-
-
-class UnitWeighting:
-    def __call__(self, sigma):
-        return torch.ones_like(sigma, device=sigma.device)
-
-
-class EDMWeighting:
-    def __init__(self, sigma_data=0.5):
-        self.sigma_data = sigma_data
-
-    def __call__(self, sigma):
-        return (sigma**2 + self.sigma_data**2) / (sigma * self.sigma_data) ** 2
-
-
-class VWeighting(EDMWeighting):
-    def __init__(self):
-        super().__init__(sigma_data=1.0)
-
-
-class EpsWeighting:
-    def __call__(self, sigma):
-        return sigma**-2.0
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/discretizer.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/discretizer.py
deleted file mode 100644
index e4711327..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/discretizer.py
+++ /dev/null
@@ -1,141 +0,0 @@
-from abc import abstractmethod
-from functools import partial
-
-import numpy as np
-import torch
-
-from ...modules.diffusionmodules.util import make_beta_schedule
-from ...util import append_zero
-
-
-def generate_roughly_equally_spaced_steps(
-    num_substeps: int, max_step: int
-) -> np.ndarray:
-    return np.linspace(max_step - 1, 0, num_substeps, endpoint=False).astype(int)[::-1]
-
-
-class Discretization:
-    def __call__(
-        self, n, do_append_zero=True, device="cpu", flip=False, return_idx=False
-    ):
-        if return_idx:
-            sigmas, idx = self.get_sigmas(n, device=device, return_idx=return_idx)
-        else:
-            sigmas = self.get_sigmas(n, device=device, return_idx=return_idx)
-        sigmas = append_zero(sigmas) if do_append_zero else sigmas
-        if return_idx:
-            return sigmas if not flip else torch.flip(sigmas, (0,)), idx
-        else:
-            return sigmas if not flip else torch.flip(sigmas, (0,))
-
-    @abstractmethod
-    def get_sigmas(self, n, device):
-        pass
-
-
-class EDMDiscretization(Discretization):
-    def __init__(self, sigma_min=0.002, sigma_max=80.0, rho=7.0):
-        self.sigma_min = sigma_min
-        self.sigma_max = sigma_max
-        self.rho = rho
-
-    def get_sigmas(self, n, device="cpu"):
-        ramp = torch.linspace(0, 1, n, device=device)
-        min_inv_rho = self.sigma_min ** (1 / self.rho)
-        max_inv_rho = self.sigma_max ** (1 / self.rho)
-        sigmas = (max_inv_rho + ramp * (min_inv_rho - max_inv_rho)) ** self.rho
-        return sigmas
-
-
-class LegacyDDPMDiscretization(Discretization):
-    def __init__(
-        self,
-        linear_start=0.00085,
-        linear_end=0.0120,
-        num_timesteps=1000,
-    ):
-        super().__init__()
-        self.num_timesteps = num_timesteps
-        betas = make_beta_schedule(
-            "linear", num_timesteps, linear_start=linear_start, linear_end=linear_end
-        )
-        alphas = 1.0 - betas
-        self.alphas_cumprod = np.cumprod(alphas, axis=0)
-        self.to_torch = partial(torch.tensor, dtype=torch.float32)
-
-    def get_sigmas(self, n, device="cpu"):
-        if n < self.num_timesteps:
-            timesteps = generate_roughly_equally_spaced_steps(n, self.num_timesteps)
-            alphas_cumprod = self.alphas_cumprod[timesteps]
-        elif n == self.num_timesteps:
-            alphas_cumprod = self.alphas_cumprod
-        else:
-            raise ValueError
-
-        to_torch = partial(torch.tensor, dtype=torch.float32, device=device)
-        sigmas = to_torch((1 - alphas_cumprod) / alphas_cumprod) ** 0.5
-        return torch.flip(sigmas, (0,))  # sigma_t: 14.4 -> 0.029
-
-
-class ZeroSNRDDPMDiscretization(Discretization):
-    def __init__(
-        self,
-        linear_start=0.00085,
-        linear_end=0.0120,
-        num_timesteps=1000,
-        shift_scale=1.0,  # noise schedule t_n -> t_m: logSNR(t_m) = logSNR(t_n) - log(shift_scale)
-        keep_start=False,
-        post_shift=False,
-    ):
-        super().__init__()
-        if keep_start and not post_shift:
-            linear_start = linear_start / (
-                shift_scale + (1 - shift_scale) * linear_start
-            )
-        self.num_timesteps = num_timesteps
-        betas = make_beta_schedule(
-            "linear", num_timesteps, linear_start=linear_start, linear_end=linear_end
-        )
-        alphas = 1.0 - betas
-        self.alphas_cumprod = np.cumprod(alphas, axis=0)
-        self.to_torch = partial(torch.tensor, dtype=torch.float32)
-
-        # SNR shift
-        if not post_shift:
-            self.alphas_cumprod = self.alphas_cumprod / (
-                shift_scale + (1 - shift_scale) * self.alphas_cumprod
-            )
-
-        self.post_shift = post_shift
-        self.shift_scale = shift_scale
-
-    def get_sigmas(self, n, device="cpu", return_idx=False):
-        if n < self.num_timesteps:
-            timesteps = generate_roughly_equally_spaced_steps(n, self.num_timesteps)
-            alphas_cumprod = self.alphas_cumprod[timesteps]
-        elif n == self.num_timesteps:
-            alphas_cumprod = self.alphas_cumprod
-        else:
-            raise ValueError
-
-        to_torch = partial(torch.tensor, dtype=torch.float32, device=device)
-        alphas_cumprod = to_torch(alphas_cumprod)
-        alphas_cumprod_sqrt = alphas_cumprod.sqrt()
-        alphas_cumprod_sqrt_0 = alphas_cumprod_sqrt[0].clone()
-        alphas_cumprod_sqrt_T = alphas_cumprod_sqrt[-1].clone()
-
-        alphas_cumprod_sqrt -= alphas_cumprod_sqrt_T
-        alphas_cumprod_sqrt *= alphas_cumprod_sqrt_0 / (
-            alphas_cumprod_sqrt_0 - alphas_cumprod_sqrt_T
-        )
-
-        if self.post_shift:
-            alphas_cumprod_sqrt = (
-                alphas_cumprod_sqrt**2
-                / (self.shift_scale + (1 - self.shift_scale) * alphas_cumprod_sqrt**2)
-            ) ** 0.5
-
-        if return_idx:
-            return torch.flip(alphas_cumprod_sqrt, (0,)), timesteps
-        else:
-            return torch.flip(alphas_cumprod_sqrt, (0,))  # sqrt(alpha_t): 0 -> 0.99
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/guiders.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/guiders.py
deleted file mode 100644
index 4175e133..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/guiders.py
+++ /dev/null
@@ -1,94 +0,0 @@
-import logging
-import math
-from abc import ABC, abstractmethod
-from functools import partial
-from typing import Dict, List, Optional, Tuple, Union
-
-import torch
-from einops import rearrange, repeat
-
-from ...util import append_dims, default, instantiate_from_config
-
-
-class Guider(ABC):
-    @abstractmethod
-    def __call__(self, x: torch.Tensor, sigma: float) -> torch.Tensor:
-        pass
-
-    def prepare_inputs(
-        self, x: torch.Tensor, s: float, c: Dict, uc: Dict
-    ) -> Tuple[torch.Tensor, float, Dict]:
-        pass
-
-
-class VanillaCFG:
-    """
-    implements parallelized CFG
-    """
-
-    def __init__(self, scale, dyn_thresh_config=None):
-        self.scale = scale
-        scale_schedule = lambda scale, sigma: scale  # independent of step
-        self.scale_schedule = partial(scale_schedule, scale)
-        self.dyn_thresh = instantiate_from_config(
-            default(
-                dyn_thresh_config,
-                {
-                    "target": "sgm.modules.diffusionmodules.sampling_utils.NoDynamicThresholding"
-                },
-            )
-        )
-
-    def __call__(self, x, sigma, scale=None):
-        x_u, x_c = x.chunk(2)
-        scale_value = default(scale, self.scale_schedule(sigma))
-        x_pred = self.dyn_thresh(x_u, x_c, scale_value)
-        return x_pred
-
-    def prepare_inputs(self, x, s, c, uc):
-        c_out = dict()
-
-        for k in c:
-            if k in ["vector", "crossattn", "concat"]:
-                c_out[k] = torch.cat((uc[k], c[k]), 0)
-            else:
-                assert c[k] == uc[k]
-                c_out[k] = c[k]
-        return torch.cat([x] * 2), torch.cat([s] * 2), c_out
-
-
-class DynamicCFG(VanillaCFG):
-    def __init__(self, scale, exp, num_steps, dyn_thresh_config=None):
-        super().__init__(scale, dyn_thresh_config)
-        scale_schedule = (
-            lambda scale, sigma, step_index: 1
-            + scale * (1 - math.cos(math.pi * (step_index / num_steps) ** exp)) / 2
-        )
-        self.scale_schedule = partial(scale_schedule, scale)
-        self.dyn_thresh = instantiate_from_config(
-            default(
-                dyn_thresh_config,
-                {
-                    "target": "sgm.modules.diffusionmodules.sampling_utils.NoDynamicThresholding"
-                },
-            )
-        )
-
-    def __call__(self, x, sigma, step_index, scale=None):
-        x_u, x_c = x.chunk(2)
-        scale_value = self.scale_schedule(sigma, step_index.item())
-        x_pred = self.dyn_thresh(x_u, x_c, scale_value)
-        return x_pred
-
-
-class IdentityGuider:
-    def __call__(self, x, sigma):
-        return x
-
-    def prepare_inputs(self, x, s, c, uc):
-        c_out = dict()
-
-        for k in c:
-            c_out[k] = c[k]
-
-        return x, s, c_out
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/lora.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/lora.py
deleted file mode 100644
index d3871eb4..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/lora.py
+++ /dev/null
@@ -1,420 +0,0 @@
-# Copyright 2023 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from typing import Callable, Dict, List, Optional, Set, Tuple, Type, Union
-
-import torch
-import torch.nn.functional as F
-from torch import nn
-
-
-class LoRALinearLayer(nn.Module):
-    def __init__(
-        self,
-        in_features,
-        out_features,
-        rank=4,
-        network_alpha=None,
-        device=None,
-        dtype=None,
-    ):
-        super().__init__()
-
-        self.down = nn.Linear(in_features, rank, bias=False, device=device, dtype=dtype)
-        self.up = nn.Linear(rank, out_features, bias=False, device=device, dtype=dtype)
-        # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script.
-        # See https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_network_README-en.md#execute-learning
-        self.network_alpha = network_alpha
-        self.rank = rank
-        self.out_features = out_features
-        self.in_features = in_features
-
-        nn.init.normal_(self.down.weight, std=1 / rank)
-        nn.init.zeros_(self.up.weight)
-
-    def forward(self, hidden_states):
-        orig_dtype = hidden_states.dtype
-        dtype = self.down.weight.dtype
-
-        down_hidden_states = self.down(hidden_states.to(dtype))
-        up_hidden_states = self.up(down_hidden_states)
-
-        if self.network_alpha is not None:
-            up_hidden_states *= self.network_alpha / self.rank
-
-        return up_hidden_states.to(orig_dtype)
-
-
-class LoRAConv2dLayer(nn.Module):
-    def __init__(
-        self,
-        in_features,
-        out_features,
-        rank=4,
-        kernel_size=(1, 1),
-        stride=(1, 1),
-        padding=0,
-        network_alpha=None,
-    ):
-        super().__init__()
-
-        self.down = nn.Conv2d(
-            in_features,
-            rank,
-            kernel_size=kernel_size,
-            stride=stride,
-            padding=padding,
-            bias=False,
-        )
-        # according to the official kohya_ss trainer kernel_size are always fixed for the up layer
-        # # see: https://github.com/bmaltais/kohya_ss/blob/2accb1305979ba62f5077a23aabac23b4c37e935/networks/lora_diffusers.py#L129
-        self.up = nn.Conv2d(
-            rank, out_features, kernel_size=(1, 1), stride=(1, 1), bias=False
-        )
-
-        # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script.
-        # See https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_network_README-en.md#execute-learning
-        self.network_alpha = network_alpha
-        self.rank = rank
-
-        nn.init.normal_(self.down.weight, std=1 / rank)
-        nn.init.zeros_(self.up.weight)
-
-    def forward(self, hidden_states):
-        orig_dtype = hidden_states.dtype
-        dtype = self.down.weight.dtype
-
-        down_hidden_states = self.down(hidden_states.to(dtype))
-        up_hidden_states = self.up(down_hidden_states)
-
-        if self.network_alpha is not None:
-            up_hidden_states *= self.network_alpha / self.rank
-
-        return up_hidden_states.to(orig_dtype)
-
-
-class LoRACompatibleConv(nn.Conv2d):
-    """
-    A convolutional layer that can be used with LoRA.
-    """
-
-    def __init__(
-        self,
-        *args,
-        lora_layer: Optional[LoRAConv2dLayer] = None,
-        scale: float = 1.0,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-        self.lora_layer = lora_layer
-        self.scale = scale
-
-    def set_lora_layer(self, lora_layer: Optional[LoRAConv2dLayer]):
-        self.lora_layer = lora_layer
-
-    def _fuse_lora(self, lora_scale=1.0):
-        if self.lora_layer is None:
-            return
-
-        dtype, device = self.weight.data.dtype, self.weight.data.device
-
-        w_orig = self.weight.data.float()
-        w_up = self.lora_layer.up.weight.data.float()
-        w_down = self.lora_layer.down.weight.data.float()
-
-        if self.lora_layer.network_alpha is not None:
-            w_up = w_up * self.lora_layer.network_alpha / self.lora_layer.rank
-
-        fusion = torch.mm(w_up.flatten(start_dim=1), w_down.flatten(start_dim=1))
-        fusion = fusion.reshape((w_orig.shape))
-        fused_weight = w_orig + (lora_scale * fusion)
-        self.weight.data = fused_weight.to(device=device, dtype=dtype)
-
-        # we can drop the lora layer now
-        self.lora_layer = None
-
-        # offload the up and down matrices to CPU to not blow the memory
-        self.w_up = w_up.cpu()
-        self.w_down = w_down.cpu()
-        self._lora_scale = lora_scale
-
-    def _unfuse_lora(self):
-        if not (hasattr(self, "w_up") and hasattr(self, "w_down")):
-            return
-
-        fused_weight = self.weight.data
-        dtype, device = fused_weight.data.dtype, fused_weight.data.device
-
-        self.w_up = self.w_up.to(device=device).float()
-        self.w_down = self.w_down.to(device).float()
-
-        fusion = torch.mm(
-            self.w_up.flatten(start_dim=1), self.w_down.flatten(start_dim=1)
-        )
-        fusion = fusion.reshape((fused_weight.shape))
-        unfused_weight = fused_weight.float() - (self._lora_scale * fusion)
-        self.weight.data = unfused_weight.to(device=device, dtype=dtype)
-
-        self.w_up = None
-        self.w_down = None
-
-    def forward(self, hidden_states, scale: float = None):
-        if scale is None:
-            scale = self.scale
-        if self.lora_layer is None:
-            # make sure to the functional Conv2D function as otherwise torch.compile's graph will break
-            # see: https://github.com/huggingface/diffusers/pull/4315
-            return F.conv2d(
-                hidden_states,
-                self.weight,
-                self.bias,
-                self.stride,
-                self.padding,
-                self.dilation,
-                self.groups,
-            )
-        else:
-            return super().forward(hidden_states) + (
-                scale * self.lora_layer(hidden_states)
-            )
-
-
-class LoRACompatibleLinear(nn.Linear):
-    """
-    A Linear layer that can be used with LoRA.
-    """
-
-    def __init__(
-        self,
-        *args,
-        lora_layer: Optional[LoRALinearLayer] = None,
-        scale: float = 1.0,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-        self.lora_layer = lora_layer
-        self.scale = scale
-
-    def set_lora_layer(self, lora_layer: Optional[LoRALinearLayer]):
-        self.lora_layer = lora_layer
-
-    def _fuse_lora(self, lora_scale=1.0):
-        if self.lora_layer is None:
-            return
-
-        dtype, device = self.weight.data.dtype, self.weight.data.device
-
-        w_orig = self.weight.data.float()
-        w_up = self.lora_layer.up.weight.data.float()
-        w_down = self.lora_layer.down.weight.data.float()
-
-        if self.lora_layer.network_alpha is not None:
-            w_up = w_up * self.lora_layer.network_alpha / self.lora_layer.rank
-
-        fused_weight = w_orig + (
-            lora_scale * torch.bmm(w_up[None, :], w_down[None, :])[0]
-        )
-        self.weight.data = fused_weight.to(device=device, dtype=dtype)
-
-        # we can drop the lora layer now
-        self.lora_layer = None
-
-        # offload the up and down matrices to CPU to not blow the memory
-        self.w_up = w_up.cpu()
-        self.w_down = w_down.cpu()
-        self._lora_scale = lora_scale
-
-    def _unfuse_lora(self):
-        if not (hasattr(self, "w_up") and hasattr(self, "w_down")):
-            return
-
-        fused_weight = self.weight.data
-        dtype, device = fused_weight.dtype, fused_weight.device
-
-        w_up = self.w_up.to(device=device).float()
-        w_down = self.w_down.to(device).float()
-
-        unfused_weight = fused_weight.float() - (
-            self._lora_scale * torch.bmm(w_up[None, :], w_down[None, :])[0]
-        )
-        self.weight.data = unfused_weight.to(device=device, dtype=dtype)
-
-        self.w_up = None
-        self.w_down = None
-
-    def forward(self, hidden_states, scale: float = None):
-        if scale is None:
-            scale = self.scale
-        if self.lora_layer is None:
-            out = super().forward(hidden_states)
-            return out
-        else:
-            out = super().forward(hidden_states) + (
-                scale * self.lora_layer(hidden_states)
-            )
-            return out
-
-
-def _find_children(
-    model,
-    search_class: List[Type[nn.Module]] = [nn.Linear],
-):
-    """
-    Find all modules of a certain class (or union of classes).
-
-    Returns all matching modules, along with the parent of those moduless and the
-    names they are referenced by.
-    """
-    # For each target find every linear_class module that isn't a child of a LoraInjectedLinear
-    for parent in model.modules():
-        for name, module in parent.named_children():
-            if any([isinstance(module, _class) for _class in search_class]):
-                yield parent, name, module
-
-
-def _find_modules_v2(
-    model,
-    ancestor_class: Optional[Set[str]] = None,
-    search_class: List[Type[nn.Module]] = [nn.Linear],
-    exclude_children_of: Optional[List[Type[nn.Module]]] = [
-        LoRACompatibleLinear,
-        LoRACompatibleConv,
-        LoRALinearLayer,
-        LoRAConv2dLayer,
-    ],
-):
-    """
-    Find all modules of a certain class (or union of classes) that are direct or
-    indirect descendants of other modules of a certain class (or union of classes).
-
-    Returns all matching modules, along with the parent of those moduless and the
-    names they are referenced by.
-    """
-
-    # Get the targets we should replace all linears under
-    if ancestor_class is not None:
-        ancestors = (
-            module
-            for module in model.modules()
-            if module.__class__.__name__ in ancestor_class
-        )
-    else:
-        # this, incase you want to naively iterate over all modules.
-        ancestors = [module for module in model.modules()]
-
-    # For each target find every linear_class module that isn't a child of a LoraInjectedLinear
-    for ancestor in ancestors:
-        for fullname, module in ancestor.named_modules():
-            if any([isinstance(module, _class) for _class in search_class]):
-                # Find the direct parent if this is a descendant, not a child, of target
-                *path, name = fullname.split(".")
-                parent = ancestor
-                flag = False
-                while path:
-                    try:
-                        parent = parent.get_submodule(path.pop(0))
-                    except:
-                        flag = True
-                        break
-                if flag:
-                    continue
-                # Skip this linear if it's a child of a LoraInjectedLinear
-                if exclude_children_of and any(
-                    [isinstance(parent, _class) for _class in exclude_children_of]
-                ):
-                    continue
-                # Otherwise, yield it
-                yield parent, name, module
-
-
-_find_modules = _find_modules_v2
-
-
-def inject_trainable_lora_extended(
-    model: nn.Module,
-    target_replace_module: Set[str] = None,
-    rank: int = 4,
-    scale: float = 1.0,
-):
-    for _module, name, _child_module in _find_modules(
-        model, target_replace_module, search_class=[nn.Linear, nn.Conv2d]
-    ):
-        if _child_module.__class__ == nn.Linear:
-            weight = _child_module.weight
-            bias = _child_module.bias
-            lora_layer = LoRALinearLayer(
-                in_features=_child_module.in_features,
-                out_features=_child_module.out_features,
-                rank=rank,
-            )
-            _tmp = (
-                LoRACompatibleLinear(
-                    _child_module.in_features,
-                    _child_module.out_features,
-                    lora_layer=lora_layer,
-                    scale=scale,
-                )
-                .to(weight.dtype)
-                .to(weight.device)
-            )
-            _tmp.weight = weight
-            if bias is not None:
-                _tmp.bias = bias
-        elif _child_module.__class__ == nn.Conv2d:
-            weight = _child_module.weight
-            bias = _child_module.bias
-            lora_layer = LoRAConv2dLayer(
-                in_features=_child_module.in_channels,
-                out_features=_child_module.out_channels,
-                rank=rank,
-                kernel_size=_child_module.kernel_size,
-                stride=_child_module.stride,
-                padding=_child_module.padding,
-            )
-            _tmp = (
-                LoRACompatibleConv(
-                    _child_module.in_channels,
-                    _child_module.out_channels,
-                    kernel_size=_child_module.kernel_size,
-                    stride=_child_module.stride,
-                    padding=_child_module.padding,
-                    lora_layer=lora_layer,
-                    scale=scale,
-                )
-                .to(weight.dtype)
-                .to(weight.device)
-            )
-            _tmp.weight = weight
-            if bias is not None:
-                _tmp.bias = bias
-        else:
-            continue
-
-        _module._modules[name] = _tmp
-        # print('injecting lora layer to', _module, name)
-
-    return
-
-
-def update_lora_scale(
-    model: nn.Module,
-    target_module: Set[str] = None,
-    scale: float = 1.0,
-):
-    for _module, name, _child_module in _find_modules(
-        model, target_module, search_class=[LoRACompatibleLinear, LoRACompatibleConv]
-    ):
-        _child_module.scale = scale
-
-    return
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/loss.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/loss.py
deleted file mode 100644
index 3f8d5564..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/loss.py
+++ /dev/null
@@ -1,152 +0,0 @@
-from typing import List, Optional, Union
-
-import torch
-import torch.nn as nn
-from omegaconf import ListConfig
-from sat import mpu
-
-from ...modules.autoencoding.lpips.loss.lpips import LPIPS
-from ...util import append_dims, instantiate_from_config
-
-
-class StandardDiffusionLoss(nn.Module):
-    def __init__(
-        self,
-        sigma_sampler_config,
-        type="l2",
-        offset_noise_level=0.0,
-        batch2model_keys: Optional[Union[str, List[str], ListConfig]] = None,
-    ):
-        super().__init__()
-
-        assert type in ["l2", "l1", "lpips"]
-
-        self.sigma_sampler = instantiate_from_config(sigma_sampler_config)
-
-        self.type = type
-        self.offset_noise_level = offset_noise_level
-
-        if type == "lpips":
-            self.lpips = LPIPS().eval()
-
-        if not batch2model_keys:
-            batch2model_keys = []
-
-        if isinstance(batch2model_keys, str):
-            batch2model_keys = [batch2model_keys]
-
-        self.batch2model_keys = set(batch2model_keys)
-
-    def __call__(self, network, denoiser, conditioner, input, batch):
-        cond = conditioner(batch)
-        additional_model_inputs = {
-            key: batch[key] for key in self.batch2model_keys.intersection(batch)
-        }
-
-        sigmas = self.sigma_sampler(input.shape[0]).to(input.device)
-        noise = torch.randn_like(input)
-        if self.offset_noise_level > 0.0:
-            noise = (
-                noise
-                + append_dims(torch.randn(input.shape[0]).to(input.device), input.ndim)
-                * self.offset_noise_level
-            )
-            noise = noise.to(input.dtype)
-        noised_input = input.float() + noise * append_dims(sigmas, input.ndim)
-        model_output = denoiser(
-            network, noised_input, sigmas, cond, **additional_model_inputs
-        )
-        w = append_dims(denoiser.w(sigmas), input.ndim)
-        return self.get_loss(model_output, input, w)
-
-    def get_loss(self, model_output, target, w):
-        if self.type == "l2":
-            return torch.mean(
-                (w * (model_output - target) ** 2).reshape(target.shape[0], -1), 1
-            )
-        elif self.type == "l1":
-            return torch.mean(
-                (w * (model_output - target).abs()).reshape(target.shape[0], -1), 1
-            )
-        elif self.type == "lpips":
-            loss = self.lpips(model_output, target).reshape(-1)
-            return loss
-
-
-class VideoDiffusionLoss(StandardDiffusionLoss):
-    def __init__(
-        self,
-        block_scale=None,
-        block_size=None,
-        min_snr_value=None,
-        fixed_frames=0,
-        **kwargs,
-    ):
-        self.fixed_frames = fixed_frames
-        self.block_scale = block_scale
-        self.block_size = block_size
-        self.min_snr_value = min_snr_value
-        super().__init__(**kwargs)
-
-    def __call__(self, network, denoiser, conditioner, input, batch):
-        cond = conditioner(batch)
-        additional_model_inputs = {
-            key: batch[key] for key in self.batch2model_keys.intersection(batch)
-        }
-
-        alphas_cumprod_sqrt, idx = self.sigma_sampler(input.shape[0], return_idx=True)
-        alphas_cumprod_sqrt = alphas_cumprod_sqrt.to(input.device)
-        idx = idx.to(input.device)
-
-        noise = torch.randn_like(input)
-
-        # broadcast noise
-        mp_size = mpu.get_model_parallel_world_size()
-        global_rank = torch.distributed.get_rank() // mp_size
-        src = global_rank * mp_size
-        torch.distributed.broadcast(idx, src=src, group=mpu.get_model_parallel_group())
-        torch.distributed.broadcast(
-            noise, src=src, group=mpu.get_model_parallel_group()
-        )
-        torch.distributed.broadcast(
-            alphas_cumprod_sqrt, src=src, group=mpu.get_model_parallel_group()
-        )
-
-        additional_model_inputs["idx"] = idx
-
-        if self.offset_noise_level > 0.0:
-            noise = (
-                noise
-                + append_dims(torch.randn(input.shape[0]).to(input.device), input.ndim)
-                * self.offset_noise_level
-            )
-
-        noised_input = input.float() * append_dims(
-            alphas_cumprod_sqrt, input.ndim
-        ) + noise * append_dims((1 - alphas_cumprod_sqrt**2) ** 0.5, input.ndim)
-
-        if "concat_images" in batch.keys():
-            cond["concat"] = batch["concat_images"]
-
-        # [2, 13, 16, 60, 90],[2] dict_keys(['crossattn', 'concat'])  dict_keys(['idx'])
-        model_output = denoiser(
-            network, noised_input, alphas_cumprod_sqrt, cond, **additional_model_inputs
-        )
-        w = append_dims(1 / (1 - alphas_cumprod_sqrt**2), input.ndim)  # v-pred
-
-        if self.min_snr_value is not None:
-            w = min(w, self.min_snr_value)
-        return self.get_loss(model_output, input, w)
-
-    def get_loss(self, model_output, target, w):
-        if self.type == "l2":
-            return torch.mean(
-                (w * (model_output - target) ** 2).reshape(target.shape[0], -1), 1
-            )
-        elif self.type == "l1":
-            return torch.mean(
-                (w * (model_output - target).abs()).reshape(target.shape[0], -1), 1
-            )
-        elif self.type == "lpips":
-            loss = self.lpips(model_output, target).reshape(-1)
-            return loss
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/model.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/model.py
deleted file mode 100644
index 26efd078..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/model.py
+++ /dev/null
@@ -1,743 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-from typing import Any, Callable, Optional
-
-import numpy as np
-import torch
-import torch.nn as nn
-from einops import rearrange
-from packaging import version
-
-try:
-    import xformers
-    import xformers.ops
-
-    XFORMERS_IS_AVAILABLE = True
-except:
-    XFORMERS_IS_AVAILABLE = False
-    print("no module 'xformers'. Processing without...")
-
-from ...modules.attention import LinearAttention, MemoryEfficientCrossAttention
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-def Normalize(in_channels, num_groups=32):
-    return torch.nn.GroupNorm(
-        num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True
-    )
-
-
-class Upsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1
-            )
-
-    def forward(self, x):
-        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-        if self.with_conv:
-            x = self.conv(x)
-        return x
-
-
-class Downsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=2, padding=0
-            )
-
-    def forward(self, x):
-        if self.with_conv:
-            pad = (0, 1, 0, 1)
-            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
-            x = self.conv(x)
-        else:
-            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
-        return x
-
-
-class ResnetBlock(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize(in_channels)
-        self.conv1 = torch.nn.Conv2d(
-            in_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize(out_channels)
-        self.dropout = torch.nn.Dropout(dropout)
-        self.conv2 = torch.nn.Conv2d(
-            out_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=3, stride=1, padding=1
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-
-    def forward(self, x, temb):
-        h = x
-        h = self.norm1(h)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None]
-
-        h = self.norm2(h)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class LinAttnBlock(LinearAttention):
-    """to match AttnBlock usage"""
-
-    def __init__(self, in_channels):
-        super().__init__(dim=in_channels, heads=1, dim_head=in_channels)
-
-
-class AttnBlock(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def attention(self, h_: torch.Tensor) -> torch.Tensor:
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        b, c, h, w = q.shape
-        q, k, v = map(
-            lambda x: rearrange(x, "b c h w -> b 1 (h w) c").contiguous(), (q, k, v)
-        )
-        h_ = torch.nn.functional.scaled_dot_product_attention(
-            q, k, v
-        )  # scale is dim ** -0.5 per default
-        # compute attention
-
-        return rearrange(h_, "b 1 (h w) c -> b c h w", h=h, w=w, c=c, b=b)
-
-    def forward(self, x, **kwargs):
-        h_ = x
-        h_ = self.attention(h_)
-        h_ = self.proj_out(h_)
-        return x + h_
-
-
-class MemoryEfficientAttnBlock(nn.Module):
-    """
-    Uses xformers efficient implementation,
-    see https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223
-    Note: this is a single-head self-attention operation
-    """
-
-    #
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.attention_op: Optional[Any] = None
-
-    def attention(self, h_: torch.Tensor) -> torch.Tensor:
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        B, C, H, W = q.shape
-        q, k, v = map(lambda x: rearrange(x, "b c h w -> b (h w) c"), (q, k, v))
-
-        q, k, v = map(
-            lambda t: t.unsqueeze(3)
-            .reshape(B, t.shape[1], 1, C)
-            .permute(0, 2, 1, 3)
-            .reshape(B * 1, t.shape[1], C)
-            .contiguous(),
-            (q, k, v),
-        )
-        out = xformers.ops.memory_efficient_attention(
-            q, k, v, attn_bias=None, op=self.attention_op
-        )
-
-        out = (
-            out.unsqueeze(0)
-            .reshape(B, 1, out.shape[1], C)
-            .permute(0, 2, 1, 3)
-            .reshape(B, out.shape[1], C)
-        )
-        return rearrange(out, "b (h w) c -> b c h w", b=B, h=H, w=W, c=C)
-
-    def forward(self, x, **kwargs):
-        h_ = x
-        h_ = self.attention(h_)
-        h_ = self.proj_out(h_)
-        return x + h_
-
-
-class MemoryEfficientCrossAttentionWrapper(MemoryEfficientCrossAttention):
-    def forward(self, x, context=None, mask=None, **unused_kwargs):
-        b, c, h, w = x.shape
-        x = rearrange(x, "b c h w -> b (h w) c")
-        out = super().forward(x, context=context, mask=mask)
-        out = rearrange(out, "b (h w) c -> b c h w", h=h, w=w, c=c)
-        return x + out
-
-
-def make_attn(in_channels, attn_type="vanilla", attn_kwargs=None):
-    assert attn_type in [
-        "vanilla",
-        "vanilla-xformers",
-        "memory-efficient-cross-attn",
-        "linear",
-        "none",
-    ], f"attn_type {attn_type} unknown"
-    if (
-        version.parse(torch.__version__) < version.parse("2.0.0")
-        and attn_type != "none"
-    ):
-        assert XFORMERS_IS_AVAILABLE, (
-            f"We do not support vanilla attention in {torch.__version__} anymore, "
-            f"as it is too expensive. Please install xformers via e.g. 'pip install xformers==0.0.16'"
-        )
-        attn_type = "vanilla-xformers"
-    print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
-    if attn_type == "vanilla":
-        assert attn_kwargs is None
-        return AttnBlock(in_channels)
-    elif attn_type == "vanilla-xformers":
-        print(f"building MemoryEfficientAttnBlock with {in_channels} in_channels...")
-        return MemoryEfficientAttnBlock(in_channels)
-    elif type == "memory-efficient-cross-attn":
-        attn_kwargs["query_dim"] = in_channels
-        return MemoryEfficientCrossAttentionWrapper(**attn_kwargs)
-    elif attn_type == "none":
-        return nn.Identity(in_channels)
-    else:
-        return LinAttnBlock(in_channels)
-
-
-class Model(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        use_timestep=True,
-        use_linear_attn=False,
-        attn_type="vanilla",
-    ):
-        super().__init__()
-        if use_linear_attn:
-            attn_type = "linear"
-        self.ch = ch
-        self.temb_ch = self.ch * 4
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        self.use_timestep = use_timestep
-        if self.use_timestep:
-            # timestep embedding
-            self.temb = nn.Module()
-            self.temb.dense = nn.ModuleList(
-                [
-                    torch.nn.Linear(self.ch, self.temb_ch),
-                    torch.nn.Linear(self.temb_ch, self.temb_ch),
-                ]
-            )
-
-        # downsampling
-        self.conv_in = torch.nn.Conv2d(
-            in_channels, self.ch, kernel_size=3, stride=1, padding=1
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                down.downsample = Downsample(block_in, resamp_with_conv)
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            skip_in = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                if i_block == self.num_res_blocks:
-                    skip_in = ch * in_ch_mult[i_level]
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in + skip_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in, resamp_with_conv)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in, out_ch, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, x, t=None, context=None):
-        # assert x.shape[2] == x.shape[3] == self.resolution
-        if context is not None:
-            # assume aligned context, cat along channel axis
-            x = torch.cat((x, context), dim=1)
-        if self.use_timestep:
-            # timestep embedding
-            assert t is not None
-            temb = get_timestep_embedding(t, self.ch)
-            temb = self.temb.dense[0](temb)
-            temb = nonlinearity(temb)
-            temb = self.temb.dense[1](temb)
-        else:
-            temb = None
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](
-                    torch.cat([h, hs.pop()], dim=1), temb
-                )
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.weight
-
-
-class Encoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        double_z=True,
-        use_linear_attn=False,
-        attn_type="vanilla",
-        **ignore_kwargs,
-    ):
-        super().__init__()
-        if use_linear_attn:
-            attn_type = "linear"
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        # downsampling
-        self.conv_in = torch.nn.Conv2d(
-            in_channels, self.ch, kernel_size=3, stride=1, padding=1
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.in_ch_mult = in_ch_mult
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                down.downsample = Downsample(block_in, resamp_with_conv)
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in,
-            2 * z_channels if double_z else z_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-        )
-
-    def forward(self, x):
-        # timestep embedding
-        temb = None
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-
-class Decoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        tanh_out=False,
-        use_linear_attn=False,
-        attn_type="vanilla",
-        **ignorekwargs,
-    ):
-        super().__init__()
-        if use_linear_attn:
-            attn_type = "linear"
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-        self.tanh_out = tanh_out
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "Working with z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        make_attn_cls = self._make_attn()
-        make_resblock_cls = self._make_resblock()
-        make_conv_cls = self._make_conv()
-        # z to block_in
-        self.conv_in = torch.nn.Conv2d(
-            z_channels, block_in, kernel_size=3, stride=1, padding=1
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = make_resblock_cls(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = make_attn_cls(block_in, attn_type=attn_type)
-        self.mid.block_2 = make_resblock_cls(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    make_resblock_cls(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn_cls(block_in, attn_type=attn_type))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in, resamp_with_conv)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = make_conv_cls(
-            block_in, out_ch, kernel_size=3, stride=1, padding=1
-        )
-
-    def _make_attn(self) -> Callable:
-        return make_attn
-
-    def _make_resblock(self) -> Callable:
-        return ResnetBlock
-
-    def _make_conv(self) -> Callable:
-        return torch.nn.Conv2d
-
-    def get_last_layer(self, **kwargs):
-        return self.conv_out.weight
-
-    def forward(self, z, **kwargs):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        # z to block_in
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h, temb, **kwargs)
-        h = self.mid.attn_1(h, **kwargs)
-        h = self.mid.block_2(h, temb, **kwargs)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb, **kwargs)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, **kwargs)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h, **kwargs)
-        if self.tanh_out:
-            h = torch.tanh(h)
-        return h
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/openaimodel.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/openaimodel.py
deleted file mode 100644
index 167b78e2..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/openaimodel.py
+++ /dev/null
@@ -1,1319 +0,0 @@
-import math
-import os
-from abc import abstractmethod
-from functools import partial
-from typing import Iterable, List, Optional, Tuple, Union
-
-import numpy as np
-import torch as th
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-
-from ...modules.attention import SpatialTransformer
-from ...modules.diffusionmodules.lora import (
-    inject_trainable_lora_extended,
-    update_lora_scale,
-)
-from ...modules.diffusionmodules.util import (
-    avg_pool_nd,
-    checkpoint,
-    conv_nd,
-    linear,
-    normalization,
-    timestep_embedding,
-    zero_module,
-)
-from ...modules.video_attention import SpatialVideoTransformer
-from ...util import default, exists
-
-
-# dummy replace
-def convert_module_to_f16(x):
-    pass
-
-
-def convert_module_to_f32(x):
-    pass
-
-
-class AttentionPool2d(nn.Module):
-    """
-    Adapted from CLIP: https://github.com/openai/CLIP/blob/main/clip/model.py
-    """
-
-    def __init__(
-        self,
-        spacial_dim: int,
-        embed_dim: int,
-        num_heads_channels: int,
-        output_dim: int = None,
-    ):
-        super().__init__()
-        self.positional_embedding = nn.Parameter(
-            th.randn(embed_dim, spacial_dim**2 + 1) / embed_dim**0.5
-        )
-        self.qkv_proj = conv_nd(1, embed_dim, 3 * embed_dim, 1)
-        self.c_proj = conv_nd(1, embed_dim, output_dim or embed_dim, 1)
-        self.num_heads = embed_dim // num_heads_channels
-        self.attention = QKVAttention(self.num_heads)
-
-    def forward(self, x):
-        b, c, *_spatial = x.shape
-        x = x.reshape(b, c, -1)  # NC(HW)
-        x = th.cat([x.mean(dim=-1, keepdim=True), x], dim=-1)  # NC(HW+1)
-        x = x + self.positional_embedding[None, :, :].to(x.dtype)  # NC(HW+1)
-        x = self.qkv_proj(x)
-        x = self.attention(x)
-        x = self.c_proj(x)
-        return x[:, :, 0]
-
-
-class TimestepBlock(nn.Module):
-    """
-    Any module where forward() takes timestep embeddings as a second argument.
-    """
-
-    @abstractmethod
-    def forward(self, x, emb):
-        """
-        Apply the module to `x` given `emb` timestep embeddings.
-        """
-
-
-class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
-    """
-    A sequential module that passes timestep embeddings to the children that
-    support it as an extra input.
-    """
-
-    def forward(
-        self,
-        x: th.Tensor,
-        emb: th.Tensor,
-        context: Optional[th.Tensor] = None,
-        image_only_indicator: Optional[th.Tensor] = None,
-        time_context: Optional[int] = None,
-        num_video_frames: Optional[int] = None,
-    ):
-        from ...modules.diffusionmodules.video_model import VideoResBlock
-
-        for layer in self:
-            module = layer
-
-            if isinstance(module, TimestepBlock) and not isinstance(
-                module, VideoResBlock
-            ):
-                x = layer(x, emb)
-            elif isinstance(module, VideoResBlock):
-                x = layer(x, emb, num_video_frames, image_only_indicator)
-            elif isinstance(module, SpatialVideoTransformer):
-                x = layer(
-                    x,
-                    context,
-                    time_context,
-                    num_video_frames,
-                    image_only_indicator,
-                )
-            elif isinstance(module, SpatialTransformer):
-                x = layer(x, context)
-            else:
-                x = layer(x)
-        return x
-
-
-class Upsample(nn.Module):
-    """
-    An upsampling layer with an optional convolution.
-    :param channels: channels in the inputs and outputs.
-    :param use_conv: a bool determining if a convolution is applied.
-    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
-                 upsampling occurs in the inner-two dimensions.
-    """
-
-    def __init__(
-        self, channels, use_conv, dims=2, out_channels=None, padding=1, third_up=False
-    ):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.dims = dims
-        self.third_up = third_up
-        if use_conv:
-            self.conv = conv_nd(
-                dims, self.channels, self.out_channels, 3, padding=padding
-            )
-
-    def forward(self, x):
-        assert x.shape[1] == self.channels
-        if self.dims == 3:
-            t_factor = 1 if not self.third_up else 2
-            x = F.interpolate(
-                x,
-                (t_factor * x.shape[2], x.shape[3] * 2, x.shape[4] * 2),
-                mode="nearest",
-            )
-        else:
-            x = F.interpolate(x, scale_factor=2, mode="nearest")
-        if self.use_conv:
-            x = self.conv(x)
-        return x
-
-
-class TransposedUpsample(nn.Module):
-    "Learned 2x upsampling without padding"
-
-    def __init__(self, channels, out_channels=None, ks=5):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-
-        self.up = nn.ConvTranspose2d(
-            self.channels, self.out_channels, kernel_size=ks, stride=2
-        )
-
-    def forward(self, x):
-        return self.up(x)
-
-
-class Downsample(nn.Module):
-    """
-    A downsampling layer with an optional convolution.
-    :param channels: channels in the inputs and outputs.
-    :param use_conv: a bool determining if a convolution is applied.
-    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
-                 downsampling occurs in the inner-two dimensions.
-    """
-
-    def __init__(
-        self, channels, use_conv, dims=2, out_channels=None, padding=1, third_down=False
-    ):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.dims = dims
-        stride = 2 if dims != 3 else ((1, 2, 2) if not third_down else (2, 2, 2))
-        if use_conv:
-            print(f"Building a Downsample layer with {dims} dims.")
-            print(
-                f"  --> settings are: \n in-chn: {self.channels}, out-chn: {self.out_channels}, "
-                f"kernel-size: 3, stride: {stride}, padding: {padding}"
-            )
-            if dims == 3:
-                print(f"  --> Downsampling third axis (time): {third_down}")
-            self.op = conv_nd(
-                dims,
-                self.channels,
-                self.out_channels,
-                3,
-                stride=stride,
-                padding=padding,
-            )
-        else:
-            assert self.channels == self.out_channels
-            self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride)
-
-    def forward(self, x):
-        assert x.shape[1] == self.channels
-        return self.op(x)
-
-
-class ResBlock(TimestepBlock):
-    """
-    A residual block that can optionally change the number of channels.
-    :param channels: the number of input channels.
-    :param emb_channels: the number of timestep embedding channels.
-    :param dropout: the rate of dropout.
-    :param out_channels: if specified, the number of out channels.
-    :param use_conv: if True and out_channels is specified, use a spatial
-        convolution instead of a smaller 1x1 convolution to change the
-        channels in the skip connection.
-    :param dims: determines if the signal is 1D, 2D, or 3D.
-    :param use_checkpoint: if True, use gradient checkpointing on this module.
-    :param up: if True, use this block for upsampling.
-    :param down: if True, use this block for downsampling.
-    """
-
-    def __init__(
-        self,
-        channels,
-        emb_channels,
-        dropout,
-        out_channels=None,
-        use_conv=False,
-        use_scale_shift_norm=False,
-        dims=2,
-        use_checkpoint=False,
-        up=False,
-        down=False,
-        kernel_size=3,
-        exchange_temb_dims=False,
-        skip_t_emb=False,
-    ):
-        super().__init__()
-        self.channels = channels
-        self.emb_channels = emb_channels
-        self.dropout = dropout
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.use_checkpoint = use_checkpoint
-        self.use_scale_shift_norm = use_scale_shift_norm
-        self.exchange_temb_dims = exchange_temb_dims
-
-        if isinstance(kernel_size, Iterable):
-            padding = [k // 2 for k in kernel_size]
-        else:
-            padding = kernel_size // 2
-
-        self.in_layers = nn.Sequential(
-            normalization(channels),
-            nn.SiLU(),
-            conv_nd(dims, channels, self.out_channels, kernel_size, padding=padding),
-        )
-
-        self.updown = up or down
-
-        if up:
-            self.h_upd = Upsample(channels, False, dims)
-            self.x_upd = Upsample(channels, False, dims)
-        elif down:
-            self.h_upd = Downsample(channels, False, dims)
-            self.x_upd = Downsample(channels, False, dims)
-        else:
-            self.h_upd = self.x_upd = nn.Identity()
-
-        self.skip_t_emb = skip_t_emb
-        self.emb_out_channels = (
-            2 * self.out_channels if use_scale_shift_norm else self.out_channels
-        )
-        if self.skip_t_emb:
-            print(f"Skipping timestep embedding in {self.__class__.__name__}")
-            assert not self.use_scale_shift_norm
-            self.emb_layers = None
-            self.exchange_temb_dims = False
-        else:
-            self.emb_layers = nn.Sequential(
-                nn.SiLU(),
-                linear(
-                    emb_channels,
-                    self.emb_out_channels,
-                ),
-            )
-
-        self.out_layers = nn.Sequential(
-            normalization(self.out_channels),
-            nn.SiLU(),
-            nn.Dropout(p=dropout),
-            zero_module(
-                conv_nd(
-                    dims,
-                    self.out_channels,
-                    self.out_channels,
-                    kernel_size,
-                    padding=padding,
-                )
-            ),
-        )
-
-        if self.out_channels == channels:
-            self.skip_connection = nn.Identity()
-        elif use_conv:
-            self.skip_connection = conv_nd(
-                dims, channels, self.out_channels, kernel_size, padding=padding
-            )
-        else:
-            self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
-
-    def forward(self, x, emb):
-        """
-        Apply the block to a Tensor, conditioned on a timestep embedding.
-        :param x: an [N x C x ...] Tensor of features.
-        :param emb: an [N x emb_channels] Tensor of timestep embeddings.
-        :return: an [N x C x ...] Tensor of outputs.
-        """
-        return checkpoint(
-            self._forward, (x, emb), self.parameters(), self.use_checkpoint
-        )
-
-    def _forward(self, x, emb):
-        if self.updown:
-            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
-            h = in_rest(x)
-            h = self.h_upd(h)
-            x = self.x_upd(x)
-            h = in_conv(h)
-        else:
-            h = self.in_layers(x)
-
-        if self.skip_t_emb:
-            emb_out = th.zeros_like(h)
-        else:
-            emb_out = self.emb_layers(emb).type(h.dtype)
-        while len(emb_out.shape) < len(h.shape):
-            emb_out = emb_out[..., None]
-        if self.use_scale_shift_norm:
-            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
-            scale, shift = th.chunk(emb_out, 2, dim=1)
-            h = out_norm(h) * (1 + scale) + shift
-            h = out_rest(h)
-        else:
-            if self.exchange_temb_dims:
-                emb_out = rearrange(emb_out, "b t c ... -> b c t ...")
-            h = h + emb_out
-            h = self.out_layers(h)
-        return self.skip_connection(x) + h
-
-
-class AttentionBlock(nn.Module):
-    """
-    An attention block that allows spatial positions to attend to each other.
-    Originally ported from here, but adapted to the N-d case.
-    https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
-    """
-
-    def __init__(
-        self,
-        channels,
-        num_heads=1,
-        num_head_channels=-1,
-        use_checkpoint=False,
-        use_new_attention_order=False,
-    ):
-        super().__init__()
-        self.channels = channels
-        if num_head_channels == -1:
-            self.num_heads = num_heads
-        else:
-            assert (
-                channels % num_head_channels == 0
-            ), f"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}"
-            self.num_heads = channels // num_head_channels
-        self.use_checkpoint = use_checkpoint
-        self.norm = normalization(channels)
-        self.qkv = conv_nd(1, channels, channels * 3, 1)
-        if use_new_attention_order:
-            # split qkv before split heads
-            self.attention = QKVAttention(self.num_heads)
-        else:
-            # split heads before split qkv
-            self.attention = QKVAttentionLegacy(self.num_heads)
-
-        self.proj_out = zero_module(conv_nd(1, channels, channels, 1))
-
-    def forward(self, x, **kwargs):
-        # TODO add crossframe attention and use mixed checkpoint
-        return checkpoint(
-            self._forward, (x,), self.parameters(), True
-        )  # TODO: check checkpoint usage, is True # TODO: fix the .half call!!!
-        # return pt_checkpoint(self._forward, x)  # pytorch
-
-    def _forward(self, x):
-        b, c, *spatial = x.shape
-        x = x.reshape(b, c, -1)
-        qkv = self.qkv(self.norm(x))
-        h = self.attention(qkv)
-        h = self.proj_out(h)
-        return (x + h).reshape(b, c, *spatial)
-
-
-def count_flops_attn(model, _x, y):
-    """
-    A counter for the `thop` package to count the operations in an
-    attention operation.
-    Meant to be used like:
-        macs, params = thop.profile(
-            model,
-            inputs=(inputs, timestamps),
-            custom_ops={QKVAttention: QKVAttention.count_flops},
-        )
-    """
-    b, c, *spatial = y[0].shape
-    num_spatial = int(np.prod(spatial))
-    # We perform two matmuls with the same number of ops.
-    # The first computes the weight matrix, the second computes
-    # the combination of the value vectors.
-    matmul_ops = 2 * b * (num_spatial**2) * c
-    model.total_ops += th.DoubleTensor([matmul_ops])
-
-
-class QKVAttentionLegacy(nn.Module):
-    """
-    A module which performs QKV attention. Matches legacy QKVAttention + input/ouput heads shaping
-    """
-
-    def __init__(self, n_heads):
-        super().__init__()
-        self.n_heads = n_heads
-
-    def forward(self, qkv):
-        """
-        Apply QKV attention.
-        :param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs.
-        :return: an [N x (H * C) x T] tensor after attention.
-        """
-        bs, width, length = qkv.shape
-        assert width % (3 * self.n_heads) == 0
-        ch = width // (3 * self.n_heads)
-        q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1)
-        scale = 1 / math.sqrt(math.sqrt(ch))
-        weight = th.einsum(
-            "bct,bcs->bts", q * scale, k * scale
-        )  # More stable with f16 than dividing afterwards
-        weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)
-        a = th.einsum("bts,bcs->bct", weight, v)
-        return a.reshape(bs, -1, length)
-
-    @staticmethod
-    def count_flops(model, _x, y):
-        return count_flops_attn(model, _x, y)
-
-
-class QKVAttention(nn.Module):
-    """
-    A module which performs QKV attention and splits in a different order.
-    """
-
-    def __init__(self, n_heads):
-        super().__init__()
-        self.n_heads = n_heads
-
-    def forward(self, qkv):
-        """
-        Apply QKV attention.
-        :param qkv: an [N x (3 * H * C) x T] tensor of Qs, Ks, and Vs.
-        :return: an [N x (H * C) x T] tensor after attention.
-        """
-        bs, width, length = qkv.shape
-        assert width % (3 * self.n_heads) == 0
-        ch = width // (3 * self.n_heads)
-        q, k, v = qkv.chunk(3, dim=1)
-        scale = 1 / math.sqrt(math.sqrt(ch))
-        weight = th.einsum(
-            "bct,bcs->bts",
-            (q * scale).view(bs * self.n_heads, ch, length),
-            (k * scale).view(bs * self.n_heads, ch, length),
-        )  # More stable with f16 than dividing afterwards
-        weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)
-        a = th.einsum("bts,bcs->bct", weight, v.reshape(bs * self.n_heads, ch, length))
-        return a.reshape(bs, -1, length)
-
-    @staticmethod
-    def count_flops(model, _x, y):
-        return count_flops_attn(model, _x, y)
-
-
-class Timestep(nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        self.dim = dim
-
-    def forward(self, t):
-        return timestep_embedding(t, self.dim)
-
-
-str_to_dtype = {"fp32": th.float32, "fp16": th.float16, "bf16": th.bfloat16}
-
-
-class UNetModel(nn.Module):
-    """
-    The full UNet model with attention and timestep embedding.
-    :param in_channels: channels in the input Tensor.
-    :param model_channels: base channel count for the model.
-    :param out_channels: channels in the output Tensor.
-    :param num_res_blocks: number of residual blocks per downsample.
-    :param attention_resolutions: a collection of downsample rates at which
-        attention will take place. May be a set, list, or tuple.
-        For example, if this contains 4, then at 4x downsampling, attention
-        will be used.
-    :param dropout: the dropout probability.
-    :param channel_mult: channel multiplier for each level of the UNet.
-    :param conv_resample: if True, use learned convolutions for upsampling and
-        downsampling.
-    :param dims: determines if the signal is 1D, 2D, or 3D.
-    :param num_classes: if specified (as an int), then this model will be
-        class-conditional with `num_classes` classes.
-    :param use_checkpoint: use gradient checkpointing to reduce memory usage.
-    :param num_heads: the number of attention heads in each attention layer.
-    :param num_heads_channels: if specified, ignore num_heads and instead use
-                               a fixed channel width per attention head.
-    :param num_heads_upsample: works with num_heads to set a different number
-                               of heads for upsampling. Deprecated.
-    :param use_scale_shift_norm: use a FiLM-like conditioning mechanism.
-    :param resblock_updown: use residual blocks for up/downsampling.
-    :param use_new_attention_order: use a different attention pattern for potentially
-                                    increased efficiency.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        model_channels,
-        out_channels,
-        num_res_blocks,
-        attention_resolutions,
-        dropout=0,
-        channel_mult=(1, 2, 4, 8),
-        conv_resample=True,
-        dims=2,
-        num_classes=None,
-        use_checkpoint=False,
-        use_fp16=False,
-        num_heads=-1,
-        num_head_channels=-1,
-        num_heads_upsample=-1,
-        use_scale_shift_norm=False,
-        resblock_updown=False,
-        use_new_attention_order=False,
-        use_spatial_transformer=False,  # custom transformer support
-        transformer_depth=1,  # custom transformer support
-        context_dim=None,  # custom transformer support
-        n_embed=None,  # custom support for prediction of discrete ids into codebook of first stage vq model
-        legacy=True,
-        disable_self_attentions=None,
-        num_attention_blocks=None,
-        disable_middle_self_attn=False,
-        use_linear_in_transformer=False,
-        spatial_transformer_attn_type="softmax",
-        adm_in_channels=None,
-        use_fairscale_checkpoint=False,
-        offload_to_cpu=False,
-        transformer_depth_middle=None,
-        dtype="fp32",
-        lora_init=False,
-        lora_rank=4,
-        lora_scale=1.0,
-        lora_weight_path=None,
-    ):
-        super().__init__()
-        from omegaconf.listconfig import ListConfig
-
-        self.dtype = str_to_dtype[dtype]
-
-        if use_spatial_transformer:
-            assert (
-                context_dim is not None
-            ), "Fool!! You forgot to include the dimension of your cross-attention conditioning..."
-
-        if context_dim is not None:
-            assert (
-                use_spatial_transformer
-            ), "Fool!! You forgot to use the spatial transformer for your cross-attention conditioning..."
-            if type(context_dim) == ListConfig:
-                context_dim = list(context_dim)
-
-        if num_heads_upsample == -1:
-            num_heads_upsample = num_heads
-
-        if num_heads == -1:
-            assert (
-                num_head_channels != -1
-            ), "Either num_heads or num_head_channels has to be set"
-
-        if num_head_channels == -1:
-            assert (
-                num_heads != -1
-            ), "Either num_heads or num_head_channels has to be set"
-
-        self.in_channels = in_channels
-        self.model_channels = model_channels
-        self.out_channels = out_channels
-        if isinstance(transformer_depth, int):
-            transformer_depth = len(channel_mult) * [transformer_depth]
-        elif isinstance(transformer_depth, ListConfig):
-            transformer_depth = list(transformer_depth)
-        transformer_depth_middle = default(
-            transformer_depth_middle, transformer_depth[-1]
-        )
-
-        if isinstance(num_res_blocks, int):
-            self.num_res_blocks = len(channel_mult) * [num_res_blocks]
-        else:
-            if len(num_res_blocks) != len(channel_mult):
-                raise ValueError(
-                    "provide num_res_blocks either as an int (globally constant) or "
-                    "as a list/tuple (per-level) with the same length as channel_mult"
-                )
-            self.num_res_blocks = num_res_blocks
-        # self.num_res_blocks = num_res_blocks
-        if disable_self_attentions is not None:
-            # should be a list of booleans, indicating whether to disable self-attention in TransformerBlocks or not
-            assert len(disable_self_attentions) == len(channel_mult)
-        if num_attention_blocks is not None:
-            assert len(num_attention_blocks) == len(self.num_res_blocks)
-            assert all(
-                map(
-                    lambda i: self.num_res_blocks[i] >= num_attention_blocks[i],
-                    range(len(num_attention_blocks)),
-                )
-            )
-            print(
-                f"Constructor of UNetModel received num_attention_blocks={num_attention_blocks}. "
-                f"This option has LESS priority than attention_resolutions {attention_resolutions}, "
-                f"i.e., in cases where num_attention_blocks[i] > 0 but 2**i not in attention_resolutions, "
-                f"attention will still not be set."
-            )  # todo: convert to warning
-
-        self.attention_resolutions = attention_resolutions
-        self.dropout = dropout
-        self.channel_mult = channel_mult
-        self.conv_resample = conv_resample
-        self.num_classes = num_classes
-        self.use_checkpoint = use_checkpoint
-        if use_fp16:
-            print("WARNING: use_fp16 was dropped and has no effect anymore.")
-        # self.dtype = th.float16 if use_fp16 else th.float32
-        self.num_heads = num_heads
-        self.num_head_channels = num_head_channels
-        self.num_heads_upsample = num_heads_upsample
-        self.predict_codebook_ids = n_embed is not None
-
-        assert use_fairscale_checkpoint != use_checkpoint or not (
-            use_checkpoint or use_fairscale_checkpoint
-        )
-
-        self.use_fairscale_checkpoint = False
-        checkpoint_wrapper_fn = (
-            partial(checkpoint_wrapper, offload_to_cpu=offload_to_cpu)
-            if self.use_fairscale_checkpoint
-            else lambda x: x
-        )
-
-        time_embed_dim = model_channels * 4
-        self.time_embed = checkpoint_wrapper_fn(
-            nn.Sequential(
-                linear(model_channels, time_embed_dim),
-                nn.SiLU(),
-                linear(time_embed_dim, time_embed_dim),
-            )
-        )
-
-        if self.num_classes is not None:
-            if isinstance(self.num_classes, int):
-                self.label_emb = nn.Embedding(num_classes, time_embed_dim)
-            elif self.num_classes == "continuous":
-                print("setting up linear c_adm embedding layer")
-                self.label_emb = nn.Linear(1, time_embed_dim)
-            elif self.num_classes == "timestep":
-                self.label_emb = checkpoint_wrapper_fn(
-                    nn.Sequential(
-                        Timestep(model_channels),
-                        nn.Sequential(
-                            linear(model_channels, time_embed_dim),
-                            nn.SiLU(),
-                            linear(time_embed_dim, time_embed_dim),
-                        ),
-                    )
-                )
-            elif self.num_classes == "sequential":
-                assert adm_in_channels is not None
-                self.label_emb = nn.Sequential(
-                    nn.Sequential(
-                        linear(adm_in_channels, time_embed_dim),
-                        nn.SiLU(),
-                        linear(time_embed_dim, time_embed_dim),
-                    )
-                )
-            else:
-                raise ValueError()
-
-        self.input_blocks = nn.ModuleList(
-            [
-                TimestepEmbedSequential(
-                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
-                )
-            ]
-        )
-        self._feature_size = model_channels
-        input_block_chans = [model_channels]
-        ch = model_channels
-        ds = 1
-        for level, mult in enumerate(channel_mult):
-            for nr in range(self.num_res_blocks[level]):
-                layers = [
-                    checkpoint_wrapper_fn(
-                        ResBlock(
-                            ch,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=mult * model_channels,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                        )
-                    )
-                ]
-                ch = mult * model_channels
-                if ds in attention_resolutions:
-                    if num_head_channels == -1:
-                        dim_head = ch // num_heads
-                    else:
-                        num_heads = ch // num_head_channels
-                        dim_head = num_head_channels
-                    if legacy:
-                        # num_heads = 1
-                        dim_head = (
-                            ch // num_heads
-                            if use_spatial_transformer
-                            else num_head_channels
-                        )
-                    if exists(disable_self_attentions):
-                        disabled_sa = disable_self_attentions[level]
-                    else:
-                        disabled_sa = False
-
-                    if (
-                        not exists(num_attention_blocks)
-                        or nr < num_attention_blocks[level]
-                    ):
-                        layers.append(
-                            checkpoint_wrapper_fn(
-                                AttentionBlock(
-                                    ch,
-                                    use_checkpoint=use_checkpoint,
-                                    num_heads=num_heads,
-                                    num_head_channels=dim_head,
-                                    use_new_attention_order=use_new_attention_order,
-                                )
-                            )
-                            if not use_spatial_transformer
-                            else checkpoint_wrapper_fn(
-                                SpatialTransformer(
-                                    ch,
-                                    num_heads,
-                                    dim_head,
-                                    depth=transformer_depth[level],
-                                    context_dim=context_dim,
-                                    disable_self_attn=disabled_sa,
-                                    use_linear=use_linear_in_transformer,
-                                    attn_type=spatial_transformer_attn_type,
-                                    use_checkpoint=use_checkpoint,
-                                )
-                            )
-                        )
-                self.input_blocks.append(TimestepEmbedSequential(*layers))
-                self._feature_size += ch
-                input_block_chans.append(ch)
-            if level != len(channel_mult) - 1:
-                out_ch = ch
-                self.input_blocks.append(
-                    TimestepEmbedSequential(
-                        checkpoint_wrapper_fn(
-                            ResBlock(
-                                ch,
-                                time_embed_dim,
-                                dropout,
-                                out_channels=out_ch,
-                                dims=dims,
-                                use_checkpoint=use_checkpoint,
-                                use_scale_shift_norm=use_scale_shift_norm,
-                                down=True,
-                            )
-                        )
-                        if resblock_updown
-                        else Downsample(
-                            ch, conv_resample, dims=dims, out_channels=out_ch
-                        )
-                    )
-                )
-                ch = out_ch
-                input_block_chans.append(ch)
-                ds *= 2
-                self._feature_size += ch
-
-        if num_head_channels == -1:
-            dim_head = ch // num_heads
-        else:
-            num_heads = ch // num_head_channels
-            dim_head = num_head_channels
-        if legacy:
-            # num_heads = 1
-            dim_head = ch // num_heads if use_spatial_transformer else num_head_channels
-        self.middle_block = TimestepEmbedSequential(
-            checkpoint_wrapper_fn(
-                ResBlock(
-                    ch,
-                    time_embed_dim,
-                    dropout,
-                    dims=dims,
-                    use_checkpoint=use_checkpoint,
-                    use_scale_shift_norm=use_scale_shift_norm,
-                )
-            ),
-            (
-                checkpoint_wrapper_fn(
-                    AttentionBlock(
-                        ch,
-                        use_checkpoint=use_checkpoint,
-                        num_heads=num_heads,
-                        num_head_channels=dim_head,
-                        use_new_attention_order=use_new_attention_order,
-                    )
-                )
-                if not use_spatial_transformer
-                else checkpoint_wrapper_fn(
-                    SpatialTransformer(  # always uses a self-attn
-                        ch,
-                        num_heads,
-                        dim_head,
-                        depth=transformer_depth_middle,
-                        context_dim=context_dim,
-                        disable_self_attn=disable_middle_self_attn,
-                        use_linear=use_linear_in_transformer,
-                        attn_type=spatial_transformer_attn_type,
-                        use_checkpoint=use_checkpoint,
-                    )
-                )
-            ),
-            checkpoint_wrapper_fn(
-                ResBlock(
-                    ch,
-                    time_embed_dim,
-                    dropout,
-                    dims=dims,
-                    use_checkpoint=use_checkpoint,
-                    use_scale_shift_norm=use_scale_shift_norm,
-                )
-            ),
-        )
-        self._feature_size += ch
-
-        self.output_blocks = nn.ModuleList([])
-        for level, mult in list(enumerate(channel_mult))[::-1]:
-            for i in range(self.num_res_blocks[level] + 1):
-                ich = input_block_chans.pop()
-                layers = [
-                    checkpoint_wrapper_fn(
-                        ResBlock(
-                            ch + ich,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=model_channels * mult,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                        )
-                    )
-                ]
-                ch = model_channels * mult
-                if ds in attention_resolutions:
-                    if num_head_channels == -1:
-                        dim_head = ch // num_heads
-                    else:
-                        num_heads = ch // num_head_channels
-                        dim_head = num_head_channels
-                    if legacy:
-                        # num_heads = 1
-                        dim_head = (
-                            ch // num_heads
-                            if use_spatial_transformer
-                            else num_head_channels
-                        )
-                    if exists(disable_self_attentions):
-                        disabled_sa = disable_self_attentions[level]
-                    else:
-                        disabled_sa = False
-
-                    if (
-                        not exists(num_attention_blocks)
-                        or i < num_attention_blocks[level]
-                    ):
-                        layers.append(
-                            checkpoint_wrapper_fn(
-                                AttentionBlock(
-                                    ch,
-                                    use_checkpoint=use_checkpoint,
-                                    num_heads=num_heads_upsample,
-                                    num_head_channels=dim_head,
-                                    use_new_attention_order=use_new_attention_order,
-                                )
-                            )
-                            if not use_spatial_transformer
-                            else checkpoint_wrapper_fn(
-                                SpatialTransformer(
-                                    ch,
-                                    num_heads,
-                                    dim_head,
-                                    depth=transformer_depth[level],
-                                    context_dim=context_dim,
-                                    disable_self_attn=disabled_sa,
-                                    use_linear=use_linear_in_transformer,
-                                    attn_type=spatial_transformer_attn_type,
-                                    use_checkpoint=use_checkpoint,
-                                )
-                            )
-                        )
-                if level and i == self.num_res_blocks[level]:
-                    out_ch = ch
-                    layers.append(
-                        checkpoint_wrapper_fn(
-                            ResBlock(
-                                ch,
-                                time_embed_dim,
-                                dropout,
-                                out_channels=out_ch,
-                                dims=dims,
-                                use_checkpoint=use_checkpoint,
-                                use_scale_shift_norm=use_scale_shift_norm,
-                                up=True,
-                            )
-                        )
-                        if resblock_updown
-                        else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch)
-                    )
-                    ds //= 2
-                self.output_blocks.append(TimestepEmbedSequential(*layers))
-                self._feature_size += ch
-
-        self.out = checkpoint_wrapper_fn(
-            nn.Sequential(
-                normalization(ch),
-                nn.SiLU(),
-                zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),
-            )
-        )
-        if self.predict_codebook_ids:
-            self.id_predictor = checkpoint_wrapper_fn(
-                nn.Sequential(
-                    normalization(ch),
-                    conv_nd(dims, model_channels, n_embed, 1),
-                    # nn.LogSoftmax(dim=1)  # change to cross_entropy and produce non-normalized logits
-                )
-            )
-
-        if lora_init:
-            self._init_lora(lora_rank, lora_scale, lora_weight_path)
-
-    def _init_lora(self, rank, scale, ckpt_dir=None):
-        inject_trainable_lora_extended(
-            self, target_replace_module=None, rank=rank, scale=scale
-        )
-
-        if ckpt_dir is not None:
-            with open(os.path.join(ckpt_dir, "latest")) as latest_file:
-                latest = latest_file.read().strip()
-            ckpt_path = os.path.join(ckpt_dir, latest, "mp_rank_00_model_states.pt")
-            print(f"loading lora from {ckpt_path}")
-            sd = th.load(ckpt_path)["module"]
-            sd = {
-                key[len("model.diffusion_model") :]: sd[key]
-                for key in sd
-                if key.startswith("model.diffusion_model")
-            }
-            self.load_state_dict(sd, strict=False)
-
-    def _update_scale(self, scale):
-        update_lora_scale(self, scale)
-
-    def convert_to_fp16(self):
-        """
-        Convert the torso of the model to float16.
-        """
-        self.input_blocks.apply(convert_module_to_f16)
-        self.middle_block.apply(convert_module_to_f16)
-        self.output_blocks.apply(convert_module_to_f16)
-
-    def convert_to_fp32(self):
-        """
-        Convert the torso of the model to float32.
-        """
-        self.input_blocks.apply(convert_module_to_f32)
-        self.middle_block.apply(convert_module_to_f32)
-        self.output_blocks.apply(convert_module_to_f32)
-
-    def forward(self, x, timesteps=None, context=None, y=None, **kwargs):
-        """
-        Apply the model to an input batch.
-        :param x: an [N x C x ...] Tensor of inputs.
-        :param timesteps: a 1-D batch of timesteps.
-        :param context: conditioning plugged in via crossattn
-        :param y: an [N] Tensor of labels, if class-conditional.
-        :return: an [N x C x ...] Tensor of outputs.
-        """
-        assert (y is not None) == (
-            self.num_classes is not None
-        ), "must specify y if and only if the model is class-conditional"
-        hs = []
-        t_emb = timestep_embedding(
-            timesteps, self.model_channels, repeat_only=False, dtype=self.dtype
-        )
-        emb = self.time_embed(t_emb)
-
-        if self.num_classes is not None:
-            assert y.shape[0] == x.shape[0]
-            emb = emb + self.label_emb(y)
-
-        # h = x.type(self.dtype)
-        h = x
-        for module in self.input_blocks:
-            h = module(h, emb, context)
-            hs.append(h)
-        h = self.middle_block(h, emb, context)
-        for module in self.output_blocks:
-            h = th.cat([h, hs.pop()], dim=1)
-            h = module(h, emb, context)
-        h = h.type(x.dtype)
-        if self.predict_codebook_ids:
-            assert False, "not supported anymore. what the f*** are you doing?"
-        else:
-            return self.out(h)
-
-
-class NoTimeUNetModel(UNetModel):
-    def forward(self, x, timesteps=None, context=None, y=None, **kwargs):
-        timesteps = th.zeros_like(timesteps)
-        return super().forward(x, timesteps, context, y, **kwargs)
-
-
-class EncoderUNetModel(nn.Module):
-    """
-    The half UNet model with attention and timestep embedding.
-    For usage, see UNet.
-    """
-
-    def __init__(
-        self,
-        image_size,
-        in_channels,
-        model_channels,
-        out_channels,
-        num_res_blocks,
-        attention_resolutions,
-        dropout=0,
-        channel_mult=(1, 2, 4, 8),
-        conv_resample=True,
-        dims=2,
-        use_checkpoint=False,
-        use_fp16=False,
-        num_heads=1,
-        num_head_channels=-1,
-        num_heads_upsample=-1,
-        use_scale_shift_norm=False,
-        resblock_updown=False,
-        use_new_attention_order=False,
-        pool="adaptive",
-        *args,
-        **kwargs,
-    ):
-        super().__init__()
-
-        if num_heads_upsample == -1:
-            num_heads_upsample = num_heads
-
-        self.in_channels = in_channels
-        self.model_channels = model_channels
-        self.out_channels = out_channels
-        self.num_res_blocks = num_res_blocks
-        self.attention_resolutions = attention_resolutions
-        self.dropout = dropout
-        self.channel_mult = channel_mult
-        self.conv_resample = conv_resample
-        self.use_checkpoint = use_checkpoint
-        self.dtype = th.float16 if use_fp16 else th.float32
-        self.num_heads = num_heads
-        self.num_head_channels = num_head_channels
-        self.num_heads_upsample = num_heads_upsample
-
-        time_embed_dim = model_channels * 4
-        self.time_embed = nn.Sequential(
-            linear(model_channels, time_embed_dim),
-            nn.SiLU(),
-            linear(time_embed_dim, time_embed_dim),
-        )
-
-        self.input_blocks = nn.ModuleList(
-            [
-                TimestepEmbedSequential(
-                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
-                )
-            ]
-        )
-        self._feature_size = model_channels
-        input_block_chans = [model_channels]
-        ch = model_channels
-        ds = 1
-        for level, mult in enumerate(channel_mult):
-            for _ in range(num_res_blocks):
-                layers = [
-                    ResBlock(
-                        ch,
-                        time_embed_dim,
-                        dropout,
-                        out_channels=mult * model_channels,
-                        dims=dims,
-                        use_checkpoint=use_checkpoint,
-                        use_scale_shift_norm=use_scale_shift_norm,
-                    )
-                ]
-                ch = mult * model_channels
-                if ds in attention_resolutions:
-                    layers.append(
-                        AttentionBlock(
-                            ch,
-                            use_checkpoint=use_checkpoint,
-                            num_heads=num_heads,
-                            num_head_channels=num_head_channels,
-                            use_new_attention_order=use_new_attention_order,
-                        )
-                    )
-                self.input_blocks.append(TimestepEmbedSequential(*layers))
-                self._feature_size += ch
-                input_block_chans.append(ch)
-            if level != len(channel_mult) - 1:
-                out_ch = ch
-                self.input_blocks.append(
-                    TimestepEmbedSequential(
-                        ResBlock(
-                            ch,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=out_ch,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                            down=True,
-                        )
-                        if resblock_updown
-                        else Downsample(
-                            ch, conv_resample, dims=dims, out_channels=out_ch
-                        )
-                    )
-                )
-                ch = out_ch
-                input_block_chans.append(ch)
-                ds *= 2
-                self._feature_size += ch
-
-        self.middle_block = TimestepEmbedSequential(
-            ResBlock(
-                ch,
-                time_embed_dim,
-                dropout,
-                dims=dims,
-                use_checkpoint=use_checkpoint,
-                use_scale_shift_norm=use_scale_shift_norm,
-            ),
-            AttentionBlock(
-                ch,
-                use_checkpoint=use_checkpoint,
-                num_heads=num_heads,
-                num_head_channels=num_head_channels,
-                use_new_attention_order=use_new_attention_order,
-            ),
-            ResBlock(
-                ch,
-                time_embed_dim,
-                dropout,
-                dims=dims,
-                use_checkpoint=use_checkpoint,
-                use_scale_shift_norm=use_scale_shift_norm,
-            ),
-        )
-        self._feature_size += ch
-        self.pool = pool
-        if pool == "adaptive":
-            self.out = nn.Sequential(
-                normalization(ch),
-                nn.SiLU(),
-                nn.AdaptiveAvgPool2d((1, 1)),
-                zero_module(conv_nd(dims, ch, out_channels, 1)),
-                nn.Flatten(),
-            )
-        elif pool == "attention":
-            assert num_head_channels != -1
-            self.out = nn.Sequential(
-                normalization(ch),
-                nn.SiLU(),
-                AttentionPool2d(
-                    (image_size // ds), ch, num_head_channels, out_channels
-                ),
-            )
-        elif pool == "spatial":
-            self.out = nn.Sequential(
-                nn.Linear(self._feature_size, 2048),
-                nn.ReLU(),
-                nn.Linear(2048, self.out_channels),
-            )
-        elif pool == "spatial_v2":
-            self.out = nn.Sequential(
-                nn.Linear(self._feature_size, 2048),
-                normalization(2048),
-                nn.SiLU(),
-                nn.Linear(2048, self.out_channels),
-            )
-        else:
-            raise NotImplementedError(f"Unexpected {pool} pooling")
-
-    def convert_to_fp16(self):
-        """
-        Convert the torso of the model to float16.
-        """
-        self.input_blocks.apply(convert_module_to_f16)
-        self.middle_block.apply(convert_module_to_f16)
-
-    def convert_to_fp32(self):
-        """
-        Convert the torso of the model to float32.
-        """
-        self.input_blocks.apply(convert_module_to_f32)
-        self.middle_block.apply(convert_module_to_f32)
-
-    def forward(self, x, timesteps):
-        """
-        Apply the model to an input batch.
-        :param x: an [N x C x ...] Tensor of inputs.
-        :param timesteps: a 1-D batch of timesteps.
-        :return: an [N x K] Tensor of outputs.
-        """
-        emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))
-
-        results = []
-        # h = x.type(self.dtype)
-        h = x
-        for module in self.input_blocks:
-            h = module(h, emb)
-            if self.pool.startswith("spatial"):
-                results.append(h.type(x.dtype).mean(dim=(2, 3)))
-        h = self.middle_block(h, emb)
-        if self.pool.startswith("spatial"):
-            results.append(h.type(x.dtype).mean(dim=(2, 3)))
-            h = th.cat(results, axis=-1)
-            return self.out(h)
-        else:
-            h = h.type(x.dtype)
-            return self.out(h)
-
-
-if __name__ == "__main__":
-
-    class Dummy(nn.Module):
-        def __init__(self, in_channels=3, model_channels=64):
-            super().__init__()
-            self.input_blocks = nn.ModuleList(
-                [
-                    TimestepEmbedSequential(
-                        conv_nd(2, in_channels, model_channels, 3, padding=1)
-                    )
-                ]
-            )
-
-    model = UNetModel(
-        use_checkpoint=True,
-        image_size=64,
-        in_channels=4,
-        out_channels=4,
-        model_channels=128,
-        attention_resolutions=[4, 2],
-        num_res_blocks=2,
-        channel_mult=[1, 2, 4],
-        num_head_channels=64,
-        use_spatial_transformer=False,
-        use_linear_in_transformer=True,
-        transformer_depth=1,
-        legacy=False,
-    ).cuda()
-    x = th.randn(11, 4, 64, 64).cuda()
-    t = th.randint(low=0, high=10, size=(11,), device="cuda")
-    o = model(x, t)
-    print("done.")
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sampling.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sampling.py
deleted file mode 100644
index 7067334d..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sampling.py
+++ /dev/null
@@ -1,1103 +0,0 @@
-"""
-    Partially ported from https://github.com/crowsonkb/k-diffusion/blob/master/k_diffusion/sampling.py
-"""
-
-from typing import Dict, Union
-
-import torch
-from omegaconf import ListConfig, OmegaConf
-from tqdm import tqdm
-
-from ...modules.diffusionmodules.sampling_utils import (
-    get_ancestral_step,
-    linear_multistep_coeff,
-    to_d,
-    to_neg_log_sigma,
-    to_sigma,
-)
-from ...util import append_dims, default, instantiate_from_config
-from .guiders import DynamicCFG
-
-DEFAULT_GUIDER = {"target": "sgm.modules.diffusionmodules.guiders.IdentityGuider"}
-
-
-class BaseDiffusionSampler:
-    def __init__(
-        self,
-        discretization_config: Union[Dict, ListConfig, OmegaConf],
-        num_steps: Union[int, None] = None,
-        guider_config: Union[Dict, ListConfig, OmegaConf, None] = None,
-        verbose: bool = False,
-        device: str = "cuda",
-    ):
-        self.num_steps = num_steps
-        self.discretization = instantiate_from_config(discretization_config)
-        self.guider = instantiate_from_config(
-            default(
-                guider_config,
-                DEFAULT_GUIDER,
-            )
-        )
-        self.verbose = verbose
-        self.device = device
-
-    def prepare_sampling_loop(self, x, cond, uc=None, num_steps=None):
-        sigmas = self.discretization(
-            self.num_steps if num_steps is None else num_steps, device=self.device
-        )
-        uc = default(uc, cond)
-
-        x *= torch.sqrt(1.0 + sigmas[0] ** 2.0)
-        num_sigmas = len(sigmas)
-
-        s_in = x.new_ones([x.shape[0]]).float()
-
-        return x, s_in, sigmas, num_sigmas, cond, uc
-
-    def denoise(self, x, denoiser, sigma, cond, uc):
-        denoised = denoiser(*self.guider.prepare_inputs(x, sigma, cond, uc))
-        denoised = self.guider(denoised, sigma)
-        return denoised
-
-    def get_sigma_gen(self, num_sigmas):
-        sigma_generator = range(num_sigmas - 1)
-        if self.verbose:
-            print("#" * 30, " Sampling setting ", "#" * 30)
-            print(f"Sampler: {self.__class__.__name__}")
-            print(f"Discretization: {self.discretization.__class__.__name__}")
-            print(f"Guider: {self.guider.__class__.__name__}")
-            sigma_generator = tqdm(
-                sigma_generator,
-                total=num_sigmas,
-                desc=f"Sampling with {self.__class__.__name__} for {num_sigmas} steps",
-            )
-        return sigma_generator
-
-
-class SingleStepDiffusionSampler(BaseDiffusionSampler):
-    def sampler_step(self, sigma, next_sigma, denoiser, x, cond, uc, *args, **kwargs):
-        raise NotImplementedError
-
-    def euler_step(self, x, d, dt):
-        return x + dt * d
-
-
-class EDMSampler(SingleStepDiffusionSampler):
-    def __init__(
-        self, s_churn=0.0, s_tmin=0.0, s_tmax=float("inf"), s_noise=1.0, *args, **kwargs
-    ):
-        super().__init__(*args, **kwargs)
-
-        self.s_churn = s_churn
-        self.s_tmin = s_tmin
-        self.s_tmax = s_tmax
-        self.s_noise = s_noise
-
-    def sampler_step(self, sigma, next_sigma, denoiser, x, cond, uc=None, gamma=0.0):
-        sigma_hat = sigma * (gamma + 1.0)
-        if gamma > 0:
-            eps = torch.randn_like(x) * self.s_noise
-            x = x + eps * append_dims(sigma_hat**2 - sigma**2, x.ndim) ** 0.5
-
-        denoised = self.denoise(x, denoiser, sigma_hat, cond, uc)
-        d = to_d(x, sigma_hat, denoised)
-        dt = append_dims(next_sigma - sigma_hat, x.ndim)
-
-        euler_step = self.euler_step(x, d, dt)
-        x = self.possible_correction_step(
-            euler_step, x, d, dt, next_sigma, denoiser, cond, uc
-        )
-        return x
-
-    def __call__(self, denoiser, x, cond, uc=None, num_steps=None):
-        x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        for i in self.get_sigma_gen(num_sigmas):
-            gamma = (
-                min(self.s_churn / (num_sigmas - 1), 2**0.5 - 1)
-                if self.s_tmin <= sigmas[i] <= self.s_tmax
-                else 0.0
-            )
-            x = self.sampler_step(
-                s_in * sigmas[i],
-                s_in * sigmas[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-                gamma,
-            )
-
-        return x
-
-
-class DDIMSampler(SingleStepDiffusionSampler):
-    def __init__(self, s_noise=0.1, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-        self.s_noise = s_noise
-
-    def sampler_step(self, sigma, next_sigma, denoiser, x, cond, uc=None, s_noise=0.0):
-
-        denoised = self.denoise(x, denoiser, sigma, cond, uc)
-        d = to_d(x, sigma, denoised)
-        dt = append_dims(next_sigma * (1 - s_noise**2) ** 0.5 - sigma, x.ndim)
-
-        euler_step = (
-            x + dt * d + s_noise * append_dims(next_sigma, x.ndim) * torch.randn_like(x)
-        )
-
-        x = self.possible_correction_step(
-            euler_step, x, d, dt, next_sigma, denoiser, cond, uc
-        )
-        return x
-
-    def __call__(self, denoiser, x, cond, uc=None, num_steps=None):
-        x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        for i in self.get_sigma_gen(num_sigmas):
-            x = self.sampler_step(
-                s_in * sigmas[i],
-                s_in * sigmas[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-                self.s_noise,
-            )
-
-        return x
-
-
-class AncestralSampler(SingleStepDiffusionSampler):
-    def __init__(self, eta=1.0, s_noise=1.0, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-        self.eta = eta
-        self.s_noise = s_noise
-        self.noise_sampler = lambda x: torch.randn_like(x)
-
-    def ancestral_euler_step(self, x, denoised, sigma, sigma_down):
-        d = to_d(x, sigma, denoised)
-        dt = append_dims(sigma_down - sigma, x.ndim)
-
-        return self.euler_step(x, d, dt)
-
-    def ancestral_step(self, x, sigma, next_sigma, sigma_up):
-        x = torch.where(
-            append_dims(next_sigma, x.ndim) > 0.0,
-            x + self.noise_sampler(x) * self.s_noise * append_dims(sigma_up, x.ndim),
-            x,
-        )
-        return x
-
-    def __call__(self, denoiser, x, cond, uc=None, num_steps=None):
-        x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        for i in self.get_sigma_gen(num_sigmas):
-            x = self.sampler_step(
-                s_in * sigmas[i],
-                s_in * sigmas[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-            )
-
-        return x
-
-
-class LinearMultistepSampler(BaseDiffusionSampler):
-    def __init__(
-        self,
-        order=4,
-        *args,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-
-        self.order = order
-
-    def __call__(self, denoiser, x, cond, uc=None, num_steps=None, **kwargs):
-        x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        ds = []
-        sigmas_cpu = sigmas.detach().cpu().numpy()
-        for i in self.get_sigma_gen(num_sigmas):
-            sigma = s_in * sigmas[i]
-            denoised = denoiser(
-                *self.guider.prepare_inputs(x, sigma, cond, uc), **kwargs
-            )
-            denoised = self.guider(denoised, sigma)
-            d = to_d(x, sigma, denoised)
-            ds.append(d)
-            if len(ds) > self.order:
-                ds.pop(0)
-            cur_order = min(i + 1, self.order)
-            coeffs = [
-                linear_multistep_coeff(cur_order, sigmas_cpu, i, j)
-                for j in range(cur_order)
-            ]
-            x = x + sum(coeff * d for coeff, d in zip(coeffs, reversed(ds)))
-
-        return x
-
-
-class EulerEDMSampler(EDMSampler):
-    def possible_correction_step(
-        self, euler_step, x, d, dt, next_sigma, denoiser, cond, uc
-    ):
-        return euler_step
-
-
-class HeunEDMSampler(EDMSampler):
-    def possible_correction_step(
-        self, euler_step, x, d, dt, next_sigma, denoiser, cond, uc
-    ):
-        if torch.sum(next_sigma) < 1e-14:
-            # Save a network evaluation if all noise levels are 0
-            return euler_step
-        else:
-            denoised = self.denoise(euler_step, denoiser, next_sigma, cond, uc)
-            d_new = to_d(euler_step, next_sigma, denoised)
-            d_prime = (d + d_new) / 2.0
-
-            # apply correction if noise level is not 0
-            x = torch.where(
-                append_dims(next_sigma, x.ndim) > 0.0, x + d_prime * dt, euler_step
-            )
-            return x
-
-
-class EulerAncestralSampler(AncestralSampler):
-    def sampler_step(self, sigma, next_sigma, denoiser, x, cond, uc):
-        sigma_down, sigma_up = get_ancestral_step(sigma, next_sigma, eta=self.eta)
-        denoised = self.denoise(x, denoiser, sigma, cond, uc)
-        x = self.ancestral_euler_step(x, denoised, sigma, sigma_down)
-        x = self.ancestral_step(x, sigma, next_sigma, sigma_up)
-
-        return x
-
-
-class DPMPP2SAncestralSampler(AncestralSampler):
-    def get_variables(self, sigma, sigma_down):
-        t, t_next = [to_neg_log_sigma(s) for s in (sigma, sigma_down)]
-        h = t_next - t
-        s = t + 0.5 * h
-        return h, s, t, t_next
-
-    def get_mult(self, h, s, t, t_next):
-        mult1 = to_sigma(s) / to_sigma(t)
-        mult2 = (-0.5 * h).expm1()
-        mult3 = to_sigma(t_next) / to_sigma(t)
-        mult4 = (-h).expm1()
-
-        return mult1, mult2, mult3, mult4
-
-    def sampler_step(self, sigma, next_sigma, denoiser, x, cond, uc=None, **kwargs):
-        sigma_down, sigma_up = get_ancestral_step(sigma, next_sigma, eta=self.eta)
-        denoised = self.denoise(x, denoiser, sigma, cond, uc)
-        x_euler = self.ancestral_euler_step(x, denoised, sigma, sigma_down)
-
-        if torch.sum(sigma_down) < 1e-14:
-            # Save a network evaluation if all noise levels are 0
-            x = x_euler
-        else:
-            h, s, t, t_next = self.get_variables(sigma, sigma_down)
-            mult = [
-                append_dims(mult, x.ndim) for mult in self.get_mult(h, s, t, t_next)
-            ]
-
-            x2 = mult[0] * x - mult[1] * denoised
-            denoised2 = self.denoise(x2, denoiser, to_sigma(s), cond, uc)
-            x_dpmpp2s = mult[2] * x - mult[3] * denoised2
-
-            # apply correction if noise level is not 0
-            x = torch.where(append_dims(sigma_down, x.ndim) > 0.0, x_dpmpp2s, x_euler)
-
-        x = self.ancestral_step(x, sigma, next_sigma, sigma_up)
-        return x
-
-
-class DPMPP2MSampler(BaseDiffusionSampler):
-    def get_variables(self, sigma, next_sigma, previous_sigma=None):
-        t, t_next = [to_neg_log_sigma(s) for s in (sigma, next_sigma)]
-        h = t_next - t
-
-        if previous_sigma is not None:
-            h_last = t - to_neg_log_sigma(previous_sigma)
-            r = h_last / h
-            return h, r, t, t_next
-        else:
-            return h, None, t, t_next
-
-    def get_mult(self, h, r, t, t_next, previous_sigma):
-        mult1 = to_sigma(t_next) / to_sigma(t)
-        mult2 = (-h).expm1()
-
-        if previous_sigma is not None:
-            mult3 = 1 + 1 / (2 * r)
-            mult4 = 1 / (2 * r)
-            return mult1, mult2, mult3, mult4
-        else:
-            return mult1, mult2
-
-    def sampler_step(
-        self,
-        old_denoised,
-        previous_sigma,
-        sigma,
-        next_sigma,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-    ):
-        denoised = self.denoise(x, denoiser, sigma, cond, uc)
-
-        h, r, t, t_next = self.get_variables(sigma, next_sigma, previous_sigma)
-        mult = [
-            append_dims(mult, x.ndim)
-            for mult in self.get_mult(h, r, t, t_next, previous_sigma)
-        ]
-
-        x_standard = mult[0] * x - mult[1] * denoised
-        if old_denoised is None or torch.sum(next_sigma) < 1e-14:
-            # Save a network evaluation if all noise levels are 0 or on the first step
-            return x_standard, denoised
-        else:
-            denoised_d = mult[2] * denoised - mult[3] * old_denoised
-            x_advanced = mult[0] * x - mult[1] * denoised_d
-
-            # apply correction if noise level is not 0 and not first step
-            x = torch.where(
-                append_dims(next_sigma, x.ndim) > 0.0, x_advanced, x_standard
-            )
-
-        return x, denoised
-
-    def __call__(self, denoiser, x, cond, uc=None, num_steps=None, **kwargs):
-        x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        old_denoised = None
-        for i in self.get_sigma_gen(num_sigmas):
-            x, old_denoised = self.sampler_step(
-                old_denoised,
-                None if i == 0 else s_in * sigmas[i - 1],
-                s_in * sigmas[i],
-                s_in * sigmas[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc=uc,
-            )
-
-        return x
-
-
-class SDEDPMPP2MSampler(BaseDiffusionSampler):
-    def get_variables(self, sigma, next_sigma, previous_sigma=None):
-        t, t_next = [to_neg_log_sigma(s) for s in (sigma, next_sigma)]
-        h = t_next - t
-
-        if previous_sigma is not None:
-            h_last = t - to_neg_log_sigma(previous_sigma)
-            r = h_last / h
-            return h, r, t, t_next
-        else:
-            return h, None, t, t_next
-
-    def get_mult(self, h, r, t, t_next, previous_sigma):
-        mult1 = to_sigma(t_next) / to_sigma(t) * (-h).exp()
-        mult2 = (-2 * h).expm1()
-
-        if previous_sigma is not None:
-            mult3 = 1 + 1 / (2 * r)
-            mult4 = 1 / (2 * r)
-            return mult1, mult2, mult3, mult4
-        else:
-            return mult1, mult2
-
-    def sampler_step(
-        self,
-        old_denoised,
-        previous_sigma,
-        sigma,
-        next_sigma,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-    ):
-        denoised = self.denoise(x, denoiser, sigma, cond, uc)
-
-        h, r, t, t_next = self.get_variables(sigma, next_sigma, previous_sigma)
-        mult = [
-            append_dims(mult, x.ndim)
-            for mult in self.get_mult(h, r, t, t_next, previous_sigma)
-        ]
-        mult_noise = append_dims(next_sigma * (1 - (-2 * h).exp()) ** 0.5, x.ndim)
-
-        x_standard = mult[0] * x - mult[1] * denoised + mult_noise * torch.randn_like(x)
-        if old_denoised is None or torch.sum(next_sigma) < 1e-14:
-            # Save a network evaluation if all noise levels are 0 or on the first step
-            return x_standard, denoised
-        else:
-            denoised_d = mult[2] * denoised - mult[3] * old_denoised
-            x_advanced = (
-                mult[0] * x - mult[1] * denoised_d + mult_noise * torch.randn_like(x)
-            )
-
-            # apply correction if noise level is not 0 and not first step
-            x = torch.where(
-                append_dims(next_sigma, x.ndim) > 0.0, x_advanced, x_standard
-            )
-
-        return x, denoised
-
-    def __call__(
-        self, denoiser, x, cond, uc=None, num_steps=None, scale=None, **kwargs
-    ):
-        x, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        old_denoised = None
-        for i in self.get_sigma_gen(num_sigmas):
-            x, old_denoised = self.sampler_step(
-                old_denoised,
-                None if i == 0 else s_in * sigmas[i - 1],
-                s_in * sigmas[i],
-                s_in * sigmas[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc=uc,
-            )
-
-        return x
-
-
-class SdeditEDMSampler(EulerEDMSampler):
-    def __init__(self, edit_ratio=0.5, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-        self.edit_ratio = edit_ratio
-
-    def __call__(
-        self, denoiser, image, randn, cond, uc=None, num_steps=None, edit_ratio=None
-    ):
-        randn_unit = randn.clone()
-        randn, s_in, sigmas, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            randn, cond, uc, num_steps
-        )
-
-        if num_steps is None:
-            num_steps = self.num_steps
-        if edit_ratio is None:
-            edit_ratio = self.edit_ratio
-        x = None
-
-        for i in self.get_sigma_gen(num_sigmas):
-            if i / num_steps < edit_ratio:
-                continue
-            if x is None:
-                x = image + randn_unit * append_dims(
-                    s_in * sigmas[i], len(randn_unit.shape)
-                )
-
-            gamma = (
-                min(self.s_churn / (num_sigmas - 1), 2**0.5 - 1)
-                if self.s_tmin <= sigmas[i] <= self.s_tmax
-                else 0.0
-            )
-            x = self.sampler_step(
-                s_in * sigmas[i],
-                s_in * sigmas[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-                gamma,
-            )
-
-        return x
-
-
-class VideoDDIMSampler(BaseDiffusionSampler):
-
-    def __init__(self, fixed_frames=0, sdedit=False, **kwargs):
-        super().__init__(**kwargs)
-        self.fixed_frames = fixed_frames
-        self.sdedit = sdedit
-
-    def prepare_sampling_loop(self, x, cond, uc=None, num_steps=None):
-        alpha_cumprod_sqrt, timesteps = self.discretization(
-            self.num_steps if num_steps is None else num_steps,
-            device=self.device,
-            return_idx=True,
-            do_append_zero=False,
-        )
-        alpha_cumprod_sqrt = torch.cat(
-            [alpha_cumprod_sqrt, alpha_cumprod_sqrt.new_ones([1])]
-        )
-        timesteps = torch.cat(
-            [
-                torch.tensor(list(timesteps)).new_zeros([1]) - 1,
-                torch.tensor(list(timesteps)),
-            ]
-        )
-
-        uc = default(uc, cond)
-
-        num_sigmas = len(alpha_cumprod_sqrt)
-
-        s_in = x.new_ones([x.shape[0]])
-
-        return x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc, timesteps
-
-    def denoise(
-        self,
-        x,
-        denoiser,
-        alpha_cumprod_sqrt,
-        cond,
-        uc,
-        timestep=None,
-        idx=None,
-        scale=None,
-        scale_emb=None,
-        ofs=None,
-    ):
-        additional_model_inputs = {}
-
-        if ofs is not None:
-            additional_model_inputs["ofs"] = ofs
-
-        if isinstance(scale, torch.Tensor) == False and scale == 1:
-            additional_model_inputs["idx"] = x.new_ones([x.shape[0]]) * timestep
-            if scale_emb is not None:
-                additional_model_inputs["scale_emb"] = scale_emb
-            denoised = denoiser(
-                x, alpha_cumprod_sqrt, cond, **additional_model_inputs
-            ).to(torch.float32)
-        else:
-            additional_model_inputs["idx"] = torch.cat(
-                [x.new_ones([x.shape[0]]) * timestep] * 2
-            )
-            denoised = denoiser(
-                *self.guider.prepare_inputs(x, alpha_cumprod_sqrt, cond, uc),
-                **additional_model_inputs,
-            ).to(torch.float32)
-            if isinstance(self.guider, DynamicCFG):
-                denoised = self.guider(
-                    denoised,
-                    (1 - alpha_cumprod_sqrt**2) ** 0.5,
-                    step_index=self.num_steps - timestep,
-                    scale=scale,
-                )
-            else:
-                denoised = self.guider(
-                    denoised, (1 - alpha_cumprod_sqrt**2) ** 0.5, scale=scale
-                )
-        return denoised
-
-    def sampler_step(
-        self,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        idx=None,
-        timestep=None,
-        scale=None,
-        scale_emb=None,
-        ofs=None,
-    ):
-        denoised = self.denoise(
-            x,
-            denoiser,
-            alpha_cumprod_sqrt,
-            cond,
-            uc,
-            timestep,
-            idx,
-            scale=scale,
-            scale_emb=scale_emb,
-            ofs=ofs,
-        ).to(
-            torch.float32
-        )  # 1020
-
-        a_t = ((1 - next_alpha_cumprod_sqrt**2) / (1 - alpha_cumprod_sqrt**2)) ** 0.5
-        b_t = next_alpha_cumprod_sqrt - alpha_cumprod_sqrt * a_t
-
-        x = append_dims(a_t, x.ndim) * x + append_dims(b_t, x.ndim) * denoised
-        return x
-
-    def __call__(
-        self,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        num_steps=None,
-        scale=None,
-        scale_emb=None,
-        ofs=None,
-    ):  # 1020
-        x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc, timesteps = (
-            self.prepare_sampling_loop(x, cond, uc, num_steps)
-        )
-
-        for i in self.get_sigma_gen(num_sigmas):
-            x = self.sampler_step(
-                s_in * alpha_cumprod_sqrt[i],
-                s_in * alpha_cumprod_sqrt[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-                idx=self.num_steps - i,
-                timestep=timesteps[-(i + 1)],
-                scale=scale,
-                scale_emb=scale_emb,
-                ofs=ofs,  # 1020
-            )
-
-        return x
-
-
-class Image2VideoDDIMSampler(BaseDiffusionSampler):
-
-    def prepare_sampling_loop(self, x, cond, uc=None, num_steps=None):
-        alpha_cumprod_sqrt, timesteps = self.discretization(
-            self.num_steps if num_steps is None else num_steps,
-            device=self.device,
-            return_idx=True,
-        )
-        uc = default(uc, cond)
-
-        num_sigmas = len(alpha_cumprod_sqrt)
-
-        s_in = x.new_ones([x.shape[0]])
-
-        return x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc, timesteps
-
-    def denoise(self, x, denoiser, alpha_cumprod_sqrt, cond, uc, timestep=None):
-        additional_model_inputs = {}
-        additional_model_inputs["idx"] = torch.cat(
-            [x.new_ones([x.shape[0]]) * timestep] * 2
-        )
-        denoised = denoiser(
-            *self.guider.prepare_inputs(x, alpha_cumprod_sqrt, cond, uc),
-            **additional_model_inputs,
-        ).to(torch.float32)
-        if isinstance(self.guider, DynamicCFG):
-            denoised = self.guider(
-                denoised,
-                (1 - alpha_cumprod_sqrt**2) ** 0.5,
-                step_index=self.num_steps - timestep,
-            )
-        else:
-            denoised = self.guider(denoised, (1 - alpha_cumprod_sqrt**2) ** 0.5)
-        return denoised
-
-    def sampler_step(
-        self,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        idx=None,
-        timestep=None,
-    ):
-        # 此处的sigma实际上是alpha_cumprod_sqrt
-        denoised = self.denoise(x, denoiser, alpha_cumprod_sqrt, cond, uc, timestep).to(
-            torch.float32
-        )
-        if idx == 1:
-            return denoised
-
-        a_t = ((1 - next_alpha_cumprod_sqrt**2) / (1 - alpha_cumprod_sqrt**2)) ** 0.5
-        b_t = next_alpha_cumprod_sqrt - alpha_cumprod_sqrt * a_t
-
-        x = append_dims(a_t, x.ndim) * x + append_dims(b_t, x.ndim) * denoised
-        return x
-
-    def __call__(self, image, denoiser, x, cond, uc=None, num_steps=None):
-        x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc, timesteps = (
-            self.prepare_sampling_loop(x, cond, uc, num_steps)
-        )
-
-        for i in self.get_sigma_gen(num_sigmas):
-            x = self.sampler_step(
-                s_in * alpha_cumprod_sqrt[i],
-                s_in * alpha_cumprod_sqrt[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-                idx=self.num_steps - i,
-                timestep=timesteps[-(i + 1)],
-            )
-
-        return x
-
-
-class VPSDEDPMPP2MSampler(VideoDDIMSampler):
-    def get_variables(
-        self,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        previous_alpha_cumprod_sqrt=None,
-    ):
-        alpha_cumprod = alpha_cumprod_sqrt**2
-        lamb = ((alpha_cumprod / (1 - alpha_cumprod)) ** 0.5).log()
-        next_alpha_cumprod = next_alpha_cumprod_sqrt**2
-        lamb_next = ((next_alpha_cumprod / (1 - next_alpha_cumprod)) ** 0.5).log()
-        h = lamb_next - lamb
-
-        if previous_alpha_cumprod_sqrt is not None:
-            previous_alpha_cumprod = previous_alpha_cumprod_sqrt**2
-            lamb_previous = (
-                (previous_alpha_cumprod / (1 - previous_alpha_cumprod)) ** 0.5
-            ).log()
-            h_last = lamb - lamb_previous
-            r = h_last / h
-            return h, r, lamb, lamb_next
-        else:
-            return h, None, lamb, lamb_next
-
-    def get_mult(
-        self,
-        h,
-        r,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        previous_alpha_cumprod_sqrt,
-    ):
-        mult1 = (
-            (1 - next_alpha_cumprod_sqrt**2) / (1 - alpha_cumprod_sqrt**2)
-        ) ** 0.5 * (-h).exp()
-        mult2 = (-2 * h).expm1() * next_alpha_cumprod_sqrt
-
-        if previous_alpha_cumprod_sqrt is not None:
-            mult3 = 1 + 1 / (2 * r)
-            mult4 = 1 / (2 * r)
-            return mult1, mult2, mult3, mult4
-        else:
-            return mult1, mult2
-
-    def sampler_step(
-        self,
-        old_denoised,
-        previous_alpha_cumprod_sqrt,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        idx=None,
-        timestep=None,
-        scale=None,
-        scale_emb=None,
-        ofs=None,  # 1020
-    ):
-        denoised = self.denoise(
-            x,
-            denoiser,
-            alpha_cumprod_sqrt,
-            cond,
-            uc,
-            timestep,
-            idx,
-            scale=scale,
-            scale_emb=scale_emb,
-            ofs=ofs,
-        ).to(
-            torch.float32
-        )  # 1020
-        if idx == 1:
-            return denoised, denoised
-
-        h, r, lamb, lamb_next = self.get_variables(
-            alpha_cumprod_sqrt, next_alpha_cumprod_sqrt, previous_alpha_cumprod_sqrt
-        )
-        mult = [
-            append_dims(mult, x.ndim)
-            for mult in self.get_mult(
-                h,
-                r,
-                alpha_cumprod_sqrt,
-                next_alpha_cumprod_sqrt,
-                previous_alpha_cumprod_sqrt,
-            )
-        ]
-        mult_noise = append_dims(
-            (1 - next_alpha_cumprod_sqrt**2) ** 0.5 * (1 - (-2 * h).exp()) ** 0.5,
-            x.ndim,
-        )
-
-        x_standard = mult[0] * x - mult[1] * denoised + mult_noise * torch.randn_like(x)
-        if old_denoised is None or torch.sum(next_alpha_cumprod_sqrt) < 1e-14:
-            # Save a network evaluation if all noise levels are 0 or on the first step
-            return x_standard, denoised
-        else:
-            denoised_d = mult[2] * denoised - mult[3] * old_denoised
-            x_advanced = (
-                mult[0] * x - mult[1] * denoised_d + mult_noise * torch.randn_like(x)
-            )
-
-            x = x_advanced
-
-        return x, denoised
-
-    def __call__(
-        self,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        num_steps=None,
-        scale=None,
-        scale_emb=None,
-        ofs=None,
-    ):  # 1020
-        x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc, timesteps = (
-            self.prepare_sampling_loop(x, cond, uc, num_steps)
-        )
-
-        if self.fixed_frames > 0:
-            prefix_frames = x[:, : self.fixed_frames]
-        old_denoised = None
-        for i in self.get_sigma_gen(num_sigmas):
-
-            if self.fixed_frames > 0:
-                if self.sdedit:
-                    rd = torch.randn_like(prefix_frames)
-                    noised_prefix_frames = alpha_cumprod_sqrt[
-                        i
-                    ] * prefix_frames + rd * append_dims(
-                        s_in * (1 - alpha_cumprod_sqrt[i] ** 2) ** 0.5,
-                        len(prefix_frames.shape),
-                    )
-                    x = torch.cat(
-                        [noised_prefix_frames, x[:, self.fixed_frames :]], dim=1
-                    )
-                else:
-                    x = torch.cat([prefix_frames, x[:, self.fixed_frames :]], dim=1)
-            x, old_denoised = self.sampler_step(
-                old_denoised,
-                None if i == 0 else s_in * alpha_cumprod_sqrt[i - 1],
-                s_in * alpha_cumprod_sqrt[i],
-                s_in * alpha_cumprod_sqrt[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc=uc,
-                idx=self.num_steps - i,
-                timestep=timesteps[-(i + 1)],
-                scale=scale,
-                scale_emb=scale_emb,
-                ofs=ofs,  # 1020
-            )
-
-        if self.fixed_frames > 0:
-            x = torch.cat([prefix_frames, x[:, self.fixed_frames :]], dim=1)
-
-        return x
-
-
-class VPODEDPMPP2MSampler(VideoDDIMSampler):
-    def get_variables(
-        self,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        previous_alpha_cumprod_sqrt=None,
-    ):
-        alpha_cumprod = alpha_cumprod_sqrt**2
-        lamb = ((alpha_cumprod / (1 - alpha_cumprod)) ** 0.5).log()
-        next_alpha_cumprod = next_alpha_cumprod_sqrt**2
-        lamb_next = ((next_alpha_cumprod / (1 - next_alpha_cumprod)) ** 0.5).log()
-        h = lamb_next - lamb
-
-        if previous_alpha_cumprod_sqrt is not None:
-            previous_alpha_cumprod = previous_alpha_cumprod_sqrt**2
-            lamb_previous = (
-                (previous_alpha_cumprod / (1 - previous_alpha_cumprod)) ** 0.5
-            ).log()
-            h_last = lamb - lamb_previous
-            r = h_last / h
-            return h, r, lamb, lamb_next
-        else:
-            return h, None, lamb, lamb_next
-
-    def get_mult(
-        self,
-        h,
-        r,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        previous_alpha_cumprod_sqrt,
-    ):
-        mult1 = ((1 - next_alpha_cumprod_sqrt**2) / (1 - alpha_cumprod_sqrt**2)) ** 0.5
-        mult2 = (-h).expm1() * next_alpha_cumprod_sqrt
-
-        if previous_alpha_cumprod_sqrt is not None:
-            mult3 = 1 + 1 / (2 * r)
-            mult4 = 1 / (2 * r)
-            return mult1, mult2, mult3, mult4
-        else:
-            return mult1, mult2
-
-    def sampler_step(
-        self,
-        old_denoised,
-        previous_alpha_cumprod_sqrt,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        idx=None,
-        timestep=None,
-    ):
-        denoised = self.denoise(
-            x, denoiser, alpha_cumprod_sqrt, cond, uc, timestep, idx
-        ).to(torch.float32)
-        if idx == 1:
-            return denoised, denoised
-
-        h, r, lamb, lamb_next = self.get_variables(
-            alpha_cumprod_sqrt, next_alpha_cumprod_sqrt, previous_alpha_cumprod_sqrt
-        )
-        mult = [
-            append_dims(mult, x.ndim)
-            for mult in self.get_mult(
-                h,
-                r,
-                alpha_cumprod_sqrt,
-                next_alpha_cumprod_sqrt,
-                previous_alpha_cumprod_sqrt,
-            )
-        ]
-
-        x_standard = mult[0] * x - mult[1] * denoised
-        if old_denoised is None or torch.sum(next_alpha_cumprod_sqrt) < 1e-14:
-            # Save a network evaluation if all noise levels are 0 or on the first step
-            return x_standard, denoised
-        else:
-            denoised_d = mult[2] * denoised - mult[3] * old_denoised
-            x_advanced = mult[0] * x - mult[1] * denoised_d
-
-            x = x_advanced
-
-        return x, denoised
-
-    def __call__(
-        self, denoiser, x, cond, uc=None, num_steps=None, scale=None, **kwargs
-    ):
-        x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc, timesteps = (
-            self.prepare_sampling_loop(x, cond, uc, num_steps)
-        )
-
-        old_denoised = None
-        for i in self.get_sigma_gen(num_sigmas):
-            x, old_denoised = self.sampler_step(
-                old_denoised,
-                None if i == 0 else s_in * alpha_cumprod_sqrt[i - 1],
-                s_in * alpha_cumprod_sqrt[i],
-                s_in * alpha_cumprod_sqrt[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc=uc,
-                idx=self.num_steps - i,
-                timestep=timesteps[-(i + 1)],
-            )
-
-        return x
-
-
-class VideoDDPMSampler(VideoDDIMSampler):
-    def sampler_step(
-        self,
-        alpha_cumprod_sqrt,
-        next_alpha_cumprod_sqrt,
-        denoiser,
-        x,
-        cond,
-        uc=None,
-        idx=None,
-    ):
-        # 此处的sigma实际上是alpha_cumprod_sqrt
-        denoised = self.denoise(
-            x, denoiser, alpha_cumprod_sqrt, cond, uc, idx * 1000 // self.num_steps
-        ).to(torch.float32)
-        if idx == 1:
-            return denoised
-
-        alpha_sqrt = alpha_cumprod_sqrt / next_alpha_cumprod_sqrt
-        x = (
-            append_dims(
-                alpha_sqrt
-                * (1 - next_alpha_cumprod_sqrt**2)
-                / (1 - alpha_cumprod_sqrt**2),
-                x.ndim,
-            )
-            * x
-            + append_dims(
-                next_alpha_cumprod_sqrt
-                * (1 - alpha_sqrt**2)
-                / (1 - alpha_cumprod_sqrt**2),
-                x.ndim,
-            )
-            * denoised
-            + append_dims(
-                (
-                    (1 - next_alpha_cumprod_sqrt**2)
-                    * (1 - alpha_sqrt**2)
-                    / (1 - alpha_cumprod_sqrt**2)
-                )
-                ** 0.5,
-                x.ndim,
-            )
-            * torch.randn_like(x)
-        )
-
-        return x
-
-    def __call__(self, denoiser, x, cond, uc=None, num_steps=None):
-        x, s_in, alpha_cumprod_sqrt, num_sigmas, cond, uc = self.prepare_sampling_loop(
-            x, cond, uc, num_steps
-        )
-
-        for i in self.get_sigma_gen(num_sigmas):
-            x = self.sampler_step(
-                s_in * alpha_cumprod_sqrt[i],
-                s_in * alpha_cumprod_sqrt[i + 1],
-                denoiser,
-                x,
-                cond,
-                uc,
-                idx=self.num_steps - i,
-            )
-
-        return x
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sampling_utils.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sampling_utils.py
deleted file mode 100644
index 4c26a75e..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sampling_utils.py
+++ /dev/null
@@ -1,157 +0,0 @@
-import torch
-from einops import rearrange
-from scipy import integrate
-
-from ...util import append_dims
-
-
-class NoDynamicThresholding:
-    def __call__(self, uncond, cond, scale):
-        scale = (
-            append_dims(scale, cond.ndim) if isinstance(scale, torch.Tensor) else scale
-        )
-        return uncond + scale * (cond - uncond)
-
-
-class StaticThresholding:
-    def __call__(self, uncond, cond, scale):
-        result = uncond + scale * (cond - uncond)
-        result = torch.clamp(result, min=-1.0, max=1.0)
-        return result
-
-
-def dynamic_threshold(x, p=0.95):
-    N, T, C, H, W = x.shape
-    x = rearrange(x, "n t c h w -> n c (t h w)")
-    l, r = x.quantile(q=torch.tensor([1 - p, p], device=x.device), dim=-1, keepdim=True)
-    s = torch.maximum(-l, r)
-    threshold_mask = (s > 1).expand(-1, -1, H * W * T)
-    if threshold_mask.any():
-        x = torch.where(threshold_mask, x.clamp(min=-1 * s, max=s), x)
-    x = rearrange(x, "n c (t h w) -> n t c h w", t=T, h=H, w=W)
-    return x
-
-
-def dynamic_thresholding2(x0):
-    p = 0.995  # A hyperparameter in the paper of "Imagen" [1].
-    origin_dtype = x0.dtype
-    x0 = x0.to(torch.float32)
-    s = torch.quantile(torch.abs(x0).reshape((x0.shape[0], -1)), p, dim=1)
-    s = append_dims(torch.maximum(s, torch.ones_like(s).to(s.device)), x0.dim())
-    x0 = torch.clamp(x0, -s, s)  # / s
-    return x0.to(origin_dtype)
-
-
-def latent_dynamic_thresholding(x0):
-    p = 0.9995
-    origin_dtype = x0.dtype
-    x0 = x0.to(torch.float32)
-    s = torch.quantile(torch.abs(x0), p, dim=2)
-    s = append_dims(s, x0.dim())
-    x0 = torch.clamp(x0, -s, s) / s
-    return x0.to(origin_dtype)
-
-
-def dynamic_thresholding3(x0):
-    p = 0.995  # A hyperparameter in the paper of "Imagen" [1].
-    origin_dtype = x0.dtype
-    x0 = x0.to(torch.float32)
-    s = torch.quantile(torch.abs(x0).reshape((x0.shape[0], -1)), p, dim=1)
-    s = append_dims(torch.maximum(s, torch.ones_like(s).to(s.device)), x0.dim())
-    x0 = torch.clamp(x0, -s, s)  # / s
-    return x0.to(origin_dtype)
-
-
-class DynamicThresholding:
-    def __call__(self, uncond, cond, scale):
-        mean = uncond.mean()
-        std = uncond.std()
-        result = uncond + scale * (cond - uncond)
-        result_mean, result_std = result.mean(), result.std()
-        result = (result - result_mean) / result_std * std
-        # result = dynamic_thresholding3(result)
-        return result
-
-
-class DynamicThresholdingV1:
-    def __init__(self, scale_factor):
-        self.scale_factor = scale_factor
-
-    def __call__(self, uncond, cond, scale):
-        result = uncond + scale * (cond - uncond)
-        unscaled_result = result / self.scale_factor
-        B, T, C, H, W = unscaled_result.shape
-        flattened = rearrange(unscaled_result, "b t c h w -> b c (t h w)")
-        means = flattened.mean(dim=2).unsqueeze(2)
-        recentered = flattened - means
-        magnitudes = recentered.abs().max()
-        normalized = recentered / magnitudes
-        thresholded = latent_dynamic_thresholding(normalized)
-        denormalized = thresholded * magnitudes
-        uncentered = denormalized + means
-        unflattened = rearrange(uncentered, "b c (t h w) -> b t c h w", t=T, h=H, w=W)
-        scaled_result = unflattened * self.scale_factor
-        return scaled_result
-
-
-class DynamicThresholdingV2:
-    def __call__(self, uncond, cond, scale):
-        B, T, C, H, W = uncond.shape
-        diff = cond - uncond
-        mim_target = uncond + diff * 4.0
-        cfg_target = uncond + diff * 8.0
-
-        mim_flattened = rearrange(mim_target, "b t c h w -> b c (t h w)")
-        cfg_flattened = rearrange(cfg_target, "b t c h w -> b c (t h w)")
-        mim_means = mim_flattened.mean(dim=2).unsqueeze(2)
-        cfg_means = cfg_flattened.mean(dim=2).unsqueeze(2)
-        mim_centered = mim_flattened - mim_means
-        cfg_centered = cfg_flattened - cfg_means
-
-        mim_scaleref = mim_centered.std(dim=2).unsqueeze(2)
-        cfg_scaleref = cfg_centered.std(dim=2).unsqueeze(2)
-
-        cfg_renormalized = cfg_centered / cfg_scaleref * mim_scaleref
-
-        result = cfg_renormalized + cfg_means
-        unflattened = rearrange(result, "b c (t h w) -> b t c h w", t=T, h=H, w=W)
-
-        return unflattened
-
-
-def linear_multistep_coeff(order, t, i, j, epsrel=1e-4):
-    if order - 1 > i:
-        raise ValueError(f"Order {order} too high for step {i}")
-
-    def fn(tau):
-        prod = 1.0
-        for k in range(order):
-            if j == k:
-                continue
-            prod *= (tau - t[i - k]) / (t[i - j] - t[i - k])
-        return prod
-
-    return integrate.quad(fn, t[i], t[i + 1], epsrel=epsrel)[0]
-
-
-def get_ancestral_step(sigma_from, sigma_to, eta=1.0):
-    if not eta:
-        return sigma_to, 0.0
-    sigma_up = torch.minimum(
-        sigma_to,
-        eta * (sigma_to**2 * (sigma_from**2 - sigma_to**2) / sigma_from**2) ** 0.5,
-    )
-    sigma_down = (sigma_to**2 - sigma_up**2) ** 0.5
-    return sigma_down, sigma_up
-
-
-def to_d(x, sigma, denoised):
-    return (x - denoised) / append_dims(sigma, x.ndim)
-
-
-def to_neg_log_sigma(sigma):
-    return sigma.log().neg()
-
-
-def to_sigma(neg_log_sigma):
-    return neg_log_sigma.neg().exp()
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sigma_sampling.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sigma_sampling.py
deleted file mode 100644
index 5af67b15..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/sigma_sampling.py
+++ /dev/null
@@ -1,95 +0,0 @@
-import torch
-import torch.distributed
-from sat import mpu
-
-from ...util import default, instantiate_from_config
-
-
-class EDMSampling:
-    def __init__(self, p_mean=-1.2, p_std=1.2):
-        self.p_mean = p_mean
-        self.p_std = p_std
-
-    def __call__(self, n_samples, rand=None):
-        log_sigma = self.p_mean + self.p_std * default(rand, torch.randn((n_samples,)))
-        return log_sigma.exp()
-
-
-class DiscreteSampling:
-    def __init__(
-        self,
-        discretization_config,
-        num_idx,
-        do_append_zero=False,
-        flip=True,
-        uniform_sampling=False,
-        group_num=0,
-    ):
-        self.num_idx = num_idx
-        self.sigmas = instantiate_from_config(discretization_config)(
-            num_idx, do_append_zero=do_append_zero, flip=flip
-        )
-        world_size = mpu.get_data_parallel_world_size()
-        if world_size <= 8:
-            uniform_sampling = False
-        self.uniform_sampling = uniform_sampling
-        self.group_num = group_num
-        if self.uniform_sampling:
-            assert self.group_num > 0
-            assert world_size % group_num == 0
-            self.group_width = (
-                world_size // group_num
-            )  # the number of rank in one group
-            self.sigma_interval = self.num_idx // self.group_num
-
-    def idx_to_sigma(self, idx):
-        return self.sigmas[idx]
-
-    def __call__(self, n_samples, rand=None, return_idx=False):
-        if self.uniform_sampling:
-            rank = mpu.get_data_parallel_rank()
-            group_index = rank // self.group_width
-            idx = default(
-                rand,
-                torch.randint(
-                    group_index * self.sigma_interval,
-                    (group_index + 1) * self.sigma_interval,
-                    (n_samples,),
-                ),
-            )
-        else:
-            idx = default(
-                rand,
-                torch.randint(0, self.num_idx, (n_samples,)),
-            )
-        if return_idx:
-            return self.idx_to_sigma(idx), idx
-        else:
-            return self.idx_to_sigma(idx)
-
-
-class PartialDiscreteSampling:
-    def __init__(
-        self,
-        discretization_config,
-        total_num_idx,
-        partial_num_idx,
-        do_append_zero=False,
-        flip=True,
-    ):
-        self.total_num_idx = total_num_idx
-        self.partial_num_idx = partial_num_idx
-        self.sigmas = instantiate_from_config(discretization_config)(
-            total_num_idx, do_append_zero=do_append_zero, flip=flip
-        )
-
-    def idx_to_sigma(self, idx):
-        return self.sigmas[idx]
-
-    def __call__(self, n_samples, rand=None):
-        idx = default(
-            rand,
-            # torch.randint(self.total_num_idx-self.partial_num_idx, self.total_num_idx, (n_samples,)),
-            torch.randint(0, self.partial_num_idx, (n_samples,)),
-        )
-        return self.idx_to_sigma(idx)
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/util.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/util.py
deleted file mode 100644
index fc671ed1..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/util.py
+++ /dev/null
@@ -1,371 +0,0 @@
-"""
-adopted from
-https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py
-and
-https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
-and
-https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py
-
-thanks!
-"""
-
-import math
-from typing import Optional
-
-import torch
-import torch.nn as nn
-from einops import rearrange, repeat
-
-
-def make_beta_schedule(
-    schedule,
-    n_timestep,
-    linear_start=1e-4,
-    linear_end=2e-2,
-):
-    if schedule == "linear":
-        betas = (
-            torch.linspace(
-                linear_start**0.5, linear_end**0.5, n_timestep, dtype=torch.float64
-            )
-            ** 2
-        )
-    return betas.numpy()
-
-
-def extract_into_tensor(a, t, x_shape):
-    b, *_ = t.shape
-    out = a.gather(-1, t)
-    return out.reshape(b, *((1,) * (len(x_shape) - 1)))
-
-
-def mixed_checkpoint(func, inputs: dict, params, flag):
-    """
-    Evaluate a function without caching intermediate activations, allowing for
-    reduced memory at the expense of extra compute in the backward pass. This differs from the original checkpoint function
-    borrowed from https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py in that
-    it also works with non-tensor inputs
-    :param func: the function to evaluate.
-    :param inputs: the argument dictionary to pass to `func`.
-    :param params: a sequence of parameters `func` depends on but does not
-                   explicitly take as arguments.
-    :param flag: if False, disable gradient checkpointing.
-    """
-    if flag:
-        tensor_keys = [key for key in inputs if isinstance(inputs[key], torch.Tensor)]
-        tensor_inputs = [
-            inputs[key] for key in inputs if isinstance(inputs[key], torch.Tensor)
-        ]
-        non_tensor_keys = [
-            key for key in inputs if not isinstance(inputs[key], torch.Tensor)
-        ]
-        non_tensor_inputs = [
-            inputs[key] for key in inputs if not isinstance(inputs[key], torch.Tensor)
-        ]
-        args = tuple(tensor_inputs) + tuple(non_tensor_inputs) + tuple(params)
-        return MixedCheckpointFunction.apply(
-            func,
-            len(tensor_inputs),
-            len(non_tensor_inputs),
-            tensor_keys,
-            non_tensor_keys,
-            *args,
-        )
-    else:
-        return func(**inputs)
-
-
-class MixedCheckpointFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(
-        ctx,
-        run_function,
-        length_tensors,
-        length_non_tensors,
-        tensor_keys,
-        non_tensor_keys,
-        *args,
-    ):
-        ctx.end_tensors = length_tensors
-        ctx.end_non_tensors = length_tensors + length_non_tensors
-        ctx.gpu_autocast_kwargs = {
-            "enabled": torch.is_autocast_enabled(),
-            "dtype": torch.get_autocast_gpu_dtype(),
-            "cache_enabled": torch.is_autocast_cache_enabled(),
-        }
-        assert (
-            len(tensor_keys) == length_tensors
-            and len(non_tensor_keys) == length_non_tensors
-        )
-
-        ctx.input_tensors = {
-            key: val for (key, val) in zip(tensor_keys, list(args[: ctx.end_tensors]))
-        }
-        ctx.input_non_tensors = {
-            key: val
-            for (key, val) in zip(
-                non_tensor_keys, list(args[ctx.end_tensors : ctx.end_non_tensors])
-            )
-        }
-        ctx.run_function = run_function
-        ctx.input_params = list(args[ctx.end_non_tensors :])
-
-        with torch.no_grad():
-            output_tensors = ctx.run_function(
-                **ctx.input_tensors, **ctx.input_non_tensors
-            )
-        return output_tensors
-
-    @staticmethod
-    def backward(ctx, *output_grads):
-        # additional_args = {key: ctx.input_tensors[key] for key in ctx.input_tensors if not isinstance(ctx.input_tensors[key],torch.Tensor)}
-        ctx.input_tensors = {
-            key: ctx.input_tensors[key].detach().requires_grad_(True)
-            for key in ctx.input_tensors
-        }
-
-        with torch.enable_grad(), torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs):
-            # Fixes a bug where the first op in run_function modifies the
-            # Tensor storage in place, which is not allowed for detach()'d
-            # Tensors.
-            shallow_copies = {
-                key: ctx.input_tensors[key].view_as(ctx.input_tensors[key])
-                for key in ctx.input_tensors
-            }
-            # shallow_copies.update(additional_args)
-            output_tensors = ctx.run_function(**shallow_copies, **ctx.input_non_tensors)
-        input_grads = torch.autograd.grad(
-            output_tensors,
-            list(ctx.input_tensors.values()) + ctx.input_params,
-            output_grads,
-            allow_unused=True,
-        )
-        del ctx.input_tensors
-        del ctx.input_params
-        del output_tensors
-        return (
-            (None, None, None, None, None)
-            + input_grads[: ctx.end_tensors]
-            + (None,) * (ctx.end_non_tensors - ctx.end_tensors)
-            + input_grads[ctx.end_tensors :]
-        )
-
-
-def checkpoint(func, inputs, params, flag):
-    """
-    Evaluate a function without caching intermediate activations, allowing for
-    reduced memory at the expense of extra compute in the backward pass.
-    :param func: the function to evaluate.
-    :param inputs: the argument sequence to pass to `func`.
-    :param params: a sequence of parameters `func` depends on but does not
-                   explicitly take as arguments.
-    :param flag: if False, disable gradient checkpointing.
-    """
-    if flag:
-        args = tuple(inputs) + tuple(params)
-        return CheckpointFunction.apply(func, len(inputs), *args)
-    else:
-        return func(*inputs)
-
-
-class CheckpointFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, run_function, length, *args):
-        ctx.run_function = run_function
-        ctx.input_tensors = list(args[:length])
-        ctx.input_params = list(args[length:])
-        ctx.gpu_autocast_kwargs = {
-            "enabled": torch.is_autocast_enabled(),
-            "dtype": torch.get_autocast_gpu_dtype(),
-            "cache_enabled": torch.is_autocast_cache_enabled(),
-        }
-        with torch.no_grad():
-            output_tensors = ctx.run_function(*ctx.input_tensors)
-        return output_tensors
-
-    @staticmethod
-    def backward(ctx, *output_grads):
-        ctx.input_tensors = [x.detach().requires_grad_(True) for x in ctx.input_tensors]
-        with torch.enable_grad(), torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs):
-            # Fixes a bug where the first op in run_function modifies the
-            # Tensor storage in place, which is not allowed for detach()'d
-            # Tensors.
-            shallow_copies = [x.view_as(x) for x in ctx.input_tensors]
-            output_tensors = ctx.run_function(*shallow_copies)
-        input_grads = torch.autograd.grad(
-            output_tensors,
-            ctx.input_tensors + ctx.input_params,
-            output_grads,
-            allow_unused=True,
-        )
-        del ctx.input_tensors
-        del ctx.input_params
-        del output_tensors
-        return (None, None) + input_grads
-
-
-def timestep_embedding(
-    timesteps, dim, max_period=10000, repeat_only=False, dtype=torch.float32
-):
-    """
-    Create sinusoidal timestep embeddings.
-    :param timesteps: a 1-D Tensor of N indices, one per batch element.
-                      These may be fractional.
-    :param dim: the dimension of the output.
-    :param max_period: controls the minimum frequency of the embeddings.
-    :return: an [N x dim] Tensor of positional embeddings.
-    """
-    if not repeat_only:
-        half = dim // 2
-        freqs = torch.exp(
-            -math.log(max_period)
-            * torch.arange(start=0, end=half, dtype=torch.float32)
-            / half
-        ).to(device=timesteps.device)
-        args = timesteps[:, None].float() * freqs[None]
-        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
-        if dim % 2:
-            embedding = torch.cat(
-                [embedding, torch.zeros_like(embedding[:, :1])], dim=-1
-            )
-    else:
-        embedding = repeat(timesteps, "b -> b d", d=dim)
-    return embedding.to(dtype)
-
-
-def zero_module(module):
-    """
-    Zero out the parameters of a module and return it.
-    """
-    for p in module.parameters():
-        p.detach().zero_()
-    return module
-
-
-def scale_module(module, scale):
-    """
-    Scale the parameters of a module and return it.
-    """
-    for p in module.parameters():
-        p.detach().mul_(scale)
-    return module
-
-
-def mean_flat(tensor):
-    """
-    Take the mean over all non-batch dimensions.
-    """
-    return tensor.mean(dim=list(range(1, len(tensor.shape))))
-
-
-def normalization(channels):
-    """
-    Make a standard normalization layer.
-    :param channels: number of input channels.
-    :return: an nn.Module for normalization.
-    """
-    return GroupNorm32(32, channels)
-
-
-# PyTorch 1.7 has SiLU, but we support PyTorch 1.5.
-class SiLU(nn.Module):
-    def forward(self, x):
-        return x * torch.sigmoid(x)
-
-
-class GroupNorm32(nn.GroupNorm):
-    def forward(self, x):
-        return super().forward(x).type(x.dtype)
-
-
-def conv_nd(dims, *args, **kwargs):
-    """
-    Create a 1D, 2D, or 3D convolution module.
-    """
-    if dims == 1:
-        return nn.Conv1d(*args, **kwargs)
-    elif dims == 2:
-        return nn.Conv2d(*args, **kwargs)
-    elif dims == 3:
-        return nn.Conv3d(*args, **kwargs)
-    raise ValueError(f"unsupported dimensions: {dims}")
-
-
-def linear(*args, **kwargs):
-    """
-    Create a linear module.
-    """
-    return nn.Linear(*args, **kwargs)
-
-
-def avg_pool_nd(dims, *args, **kwargs):
-    """
-    Create a 1D, 2D, or 3D average pooling module.
-    """
-    if dims == 1:
-        return nn.AvgPool1d(*args, **kwargs)
-    elif dims == 2:
-        return nn.AvgPool2d(*args, **kwargs)
-    elif dims == 3:
-        return nn.AvgPool3d(*args, **kwargs)
-    raise ValueError(f"unsupported dimensions: {dims}")
-
-
-class AlphaBlender(nn.Module):
-    strategies = ["learned", "fixed", "learned_with_images"]
-
-    def __init__(
-        self,
-        alpha: float,
-        merge_strategy: str = "learned_with_images",
-        rearrange_pattern: str = "b t -> (b t) 1 1",
-    ):
-        super().__init__()
-        self.merge_strategy = merge_strategy
-        self.rearrange_pattern = rearrange_pattern
-
-        assert (
-            merge_strategy in self.strategies
-        ), f"merge_strategy needs to be in {self.strategies}"
-
-        if self.merge_strategy == "fixed":
-            self.register_buffer("mix_factor", torch.Tensor([alpha]))
-        elif (
-            self.merge_strategy == "learned"
-            or self.merge_strategy == "learned_with_images"
-        ):
-            self.register_parameter(
-                "mix_factor", torch.nn.Parameter(torch.Tensor([alpha]))
-            )
-        else:
-            raise ValueError(f"unknown merge strategy {self.merge_strategy}")
-
-    def get_alpha(self, image_only_indicator: torch.Tensor) -> torch.Tensor:
-        if self.merge_strategy == "fixed":
-            alpha = self.mix_factor
-        elif self.merge_strategy == "learned":
-            alpha = torch.sigmoid(self.mix_factor)
-        elif self.merge_strategy == "learned_with_images":
-            assert image_only_indicator is not None, "need image_only_indicator ..."
-            alpha = torch.where(
-                image_only_indicator.bool(),
-                torch.ones(1, 1, device=image_only_indicator.device),
-                rearrange(torch.sigmoid(self.mix_factor), "... -> ... 1"),
-            )
-            alpha = rearrange(alpha, self.rearrange_pattern)
-        else:
-            raise NotImplementedError
-        return alpha
-
-    def forward(
-        self,
-        x_spatial: torch.Tensor,
-        x_temporal: torch.Tensor,
-        image_only_indicator: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        alpha = self.get_alpha(image_only_indicator)
-        x = (
-            alpha.to(x_spatial.dtype) * x_spatial
-            + (1.0 - alpha).to(x_spatial.dtype) * x_temporal
-        )
-        return x
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/wrappers.py b/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/wrappers.py
deleted file mode 100644
index bfceb421..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/diffusionmodules/wrappers.py
+++ /dev/null
@@ -1,49 +0,0 @@
-import torch
-import torch.nn as nn
-from packaging import version
-
-OPENAIUNETWRAPPER = "sgm.modules.diffusionmodules.wrappers.OpenAIWrapper"
-
-
-class IdentityWrapper(nn.Module):
-    def __init__(
-        self,
-        diffusion_model,
-        compile_model: bool = False,
-        dtype: torch.dtype = torch.float32,
-    ):
-        super().__init__()
-        compile = (
-            torch.compile
-            if (version.parse(torch.__version__) >= version.parse("2.0.0"))
-            and compile_model
-            else lambda x: x
-        )
-        self.diffusion_model = compile(diffusion_model)
-        self.dtype = dtype
-
-    def forward(self, *args, **kwargs):
-        return self.diffusion_model(*args, **kwargs)
-
-
-class OpenAIWrapper(IdentityWrapper):
-    def forward(
-        self, x: torch.Tensor, t: torch.Tensor, c: dict, **kwargs
-    ) -> torch.Tensor:
-        for key in c:
-            c[key] = c[key].to(self.dtype)
-
-        if x.dim() == 4:
-            x = torch.cat((x, c.get("concat", torch.Tensor([]).type_as(x))), dim=1)
-        elif x.dim() == 5:
-            x = torch.cat((x, c.get("concat", torch.Tensor([]).type_as(x))), dim=2)
-        else:
-            raise ValueError("Input tensor must be 4D or 5D")
-
-        return self.diffusion_model(
-            x,
-            timesteps=t,
-            context=c.get("crossattn", None),
-            y=c.get("vector", None),
-            **kwargs,
-        )
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/distributions/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/distributions/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/distributions/distributions.py b/videotuna/models/cogvideo_sat/sgm/modules/distributions/distributions.py
deleted file mode 100644
index 84884566..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/distributions/distributions.py
+++ /dev/null
@@ -1,86 +0,0 @@
-import numpy as np
-import torch
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters, deterministic=False):
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(self.mean).to(
-                device=self.parameters.device
-            )
-
-    def sample(self):
-        # x = self.mean + self.std * torch.randn(self.mean.shape).to(
-        #     device=self.parameters.device
-        # )
-        x = self.mean + self.std * torch.randn_like(self.mean).to(
-            device=self.parameters.device
-        )
-        return x
-
-    def kl(self, other=None):
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        else:
-            if other is None:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
-                    dim=[1, 2, 3],
-                )
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var
-                    - 1.0
-                    - self.logvar
-                    + other.logvar,
-                    dim=[1, 2, 3],
-                )
-
-    def nll(self, sample, dims=[1, 2, 3]):
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims,
-        )
-
-    def mode(self):
-        return self.mean
-
-
-def normal_kl(mean1, logvar1, mean2, logvar2):
-    """
-    source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12
-    Compute the KL divergence between two gaussians.
-    Shapes are automatically broadcasted, so batches can be compared to
-    scalars, among other use cases.
-    """
-    tensor = None
-    for obj in (mean1, logvar1, mean2, logvar2):
-        if isinstance(obj, torch.Tensor):
-            tensor = obj
-            break
-    assert tensor is not None, "at least one argument must be a Tensor"
-
-    # Force variances to be Tensors. Broadcasting helps convert scalars to
-    # Tensors, but it does not work for torch.exp().
-    logvar1, logvar2 = [
-        x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor)
-        for x in (logvar1, logvar2)
-    ]
-
-    return 0.5 * (
-        -1.0
-        + logvar2
-        - logvar1
-        + torch.exp(logvar1 - logvar2)
-        + ((mean1 - mean2) ** 2) * torch.exp(-logvar2)
-    )
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/ema.py b/videotuna/models/cogvideo_sat/sgm/modules/ema.py
deleted file mode 100644
index 96f64345..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/ema.py
+++ /dev/null
@@ -1,88 +0,0 @@
-import torch
-from torch import nn
-
-
-class LitEma(nn.Module):
-    def __init__(self, model, decay=0.9999, use_num_upates=True):
-        super().__init__()
-        if decay < 0.0 or decay > 1.0:
-            raise ValueError("Decay must be between 0 and 1")
-
-        self.m_name2s_name = {}
-        self.register_buffer("decay", torch.tensor(decay, dtype=torch.float32))
-        self.register_buffer(
-            "num_updates",
-            (
-                torch.tensor(0, dtype=torch.int)
-                if use_num_upates
-                else torch.tensor(-1, dtype=torch.int)
-            ),
-        )
-
-        for name, p in model.named_parameters():
-            if p.requires_grad:
-                # remove as '.'-character is not allowed in buffers
-                s_name = name.replace(".", "")
-                self.m_name2s_name.update({name: s_name})
-                self.register_buffer(s_name, p.clone().detach().data)
-
-        self.collected_params = []
-
-    def reset_num_updates(self):
-        del self.num_updates
-        self.register_buffer("num_updates", torch.tensor(0, dtype=torch.int))
-
-    def forward(self, model):
-        decay = self.decay
-
-        if self.num_updates >= 0:
-            self.num_updates += 1
-            decay = min(self.decay, (1 + self.num_updates) / (10 + self.num_updates))
-
-        one_minus_decay = 1.0 - decay
-
-        with torch.no_grad():
-            m_param = dict(model.named_parameters())
-            shadow_params = dict(self.named_buffers())
-
-            for key in m_param:
-                if m_param[key].requires_grad:
-                    sname = self.m_name2s_name[key]
-                    shadow_params[sname] = shadow_params[sname].type_as(m_param[key])
-                    shadow_params[sname].sub_(
-                        one_minus_decay * (shadow_params[sname] - m_param[key])
-                    )
-                else:
-                    assert not key in self.m_name2s_name
-
-    def copy_to(self, model):
-        m_param = dict(model.named_parameters())
-        shadow_params = dict(self.named_buffers())
-        for key in m_param:
-            if m_param[key].requires_grad:
-                m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data)
-            else:
-                assert not key in self.m_name2s_name
-
-    def store(self, parameters):
-        """
-        Save the current parameters for restoring later.
-        Args:
-          parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-            temporarily stored.
-        """
-        self.collected_params = [param.clone() for param in parameters]
-
-    def restore(self, parameters):
-        """
-        Restore the parameters stored with the `store` method.
-        Useful to validate the model with EMA parameters without affecting the
-        original optimization process. Store the parameters before the
-        `copy_to` method. After validation (or model saving), use this to
-        restore the former parameters.
-        Args:
-          parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-            updated with the stored parameters.
-        """
-        for c_param, param in zip(self.collected_params, parameters):
-            param.data.copy_(c_param.data)
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/encoders/__init__.py b/videotuna/models/cogvideo_sat/sgm/modules/encoders/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/encoders/modules.py b/videotuna/models/cogvideo_sat/sgm/modules/encoders/modules.py
deleted file mode 100644
index bf90110d..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/encoders/modules.py
+++ /dev/null
@@ -1,303 +0,0 @@
-import math
-from contextlib import nullcontext
-from functools import partial
-from typing import Dict, List, Optional, Tuple, Union
-
-import kornia
-import numpy as np
-import torch
-import torch.nn as nn
-from einops import rearrange, repeat
-from omegaconf import ListConfig
-from torch.utils.checkpoint import checkpoint
-from transformers import T5EncoderModel, T5Tokenizer
-
-from ...util import (
-    append_dims,
-    autocast,
-    count_params,
-    default,
-    disabled_train,
-    expand_dims_like,
-    instantiate_from_config,
-)
-
-
-class AbstractEmbModel(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self._is_trainable = None
-        self._ucg_rate = None
-        self._input_key = None
-
-    @property
-    def is_trainable(self) -> bool:
-        return self._is_trainable
-
-    @property
-    def ucg_rate(self) -> Union[float, torch.Tensor]:
-        return self._ucg_rate
-
-    @property
-    def input_key(self) -> str:
-        return self._input_key
-
-    @is_trainable.setter
-    def is_trainable(self, value: bool):
-        self._is_trainable = value
-
-    @ucg_rate.setter
-    def ucg_rate(self, value: Union[float, torch.Tensor]):
-        self._ucg_rate = value
-
-    @input_key.setter
-    def input_key(self, value: str):
-        self._input_key = value
-
-    @is_trainable.deleter
-    def is_trainable(self):
-        del self._is_trainable
-
-    @ucg_rate.deleter
-    def ucg_rate(self):
-        del self._ucg_rate
-
-    @input_key.deleter
-    def input_key(self):
-        del self._input_key
-
-
-class GeneralConditioner(nn.Module):
-    OUTPUT_DIM2KEYS = {2: "vector", 3: "crossattn", 4: "concat", 5: "concat"}
-    KEY2CATDIM = {"vector": 1, "crossattn": 2, "concat": 1}
-
-    def __init__(self, emb_models: Union[List, ListConfig], cor_embs=[], cor_p=[]):
-        super().__init__()
-        embedders = []
-        for n, embconfig in enumerate(emb_models):
-            embedder = instantiate_from_config(embconfig)
-            assert isinstance(
-                embedder, AbstractEmbModel
-            ), f"embedder model {embedder.__class__.__name__} has to inherit from AbstractEmbModel"
-            embedder.is_trainable = embconfig.get("is_trainable", False)
-            embedder.ucg_rate = embconfig.get("ucg_rate", 0.0)
-            if not embedder.is_trainable:
-                embedder.train = disabled_train
-                for param in embedder.parameters():
-                    param.requires_grad = False
-                embedder.eval()
-            print(
-                f"Initialized embedder #{n}: {embedder.__class__.__name__} "
-                f"with {count_params(embedder, False)} params. Trainable: {embedder.is_trainable}"
-            )
-
-            if "input_key" in embconfig:
-                embedder.input_key = embconfig["input_key"]
-            elif "input_keys" in embconfig:
-                embedder.input_keys = embconfig["input_keys"]
-            else:
-                raise KeyError(
-                    f"need either 'input_key' or 'input_keys' for embedder {embedder.__class__.__name__}"
-                )
-
-            embedder.legacy_ucg_val = embconfig.get("legacy_ucg_value", None)
-            if embedder.legacy_ucg_val is not None:
-                embedder.ucg_prng = np.random.RandomState()
-
-            embedders.append(embedder)
-        self.embedders = nn.ModuleList(embedders)
-
-        if len(cor_embs) > 0:
-            assert len(cor_p) == 2 ** len(cor_embs)
-        self.cor_embs = cor_embs
-        self.cor_p = cor_p
-
-    def possibly_get_ucg_val(self, embedder: AbstractEmbModel, batch: Dict) -> Dict:
-        assert embedder.legacy_ucg_val is not None
-        p = embedder.ucg_rate
-        val = embedder.legacy_ucg_val
-        for i in range(len(batch[embedder.input_key])):
-            if embedder.ucg_prng.choice(2, p=[1 - p, p]):
-                batch[embedder.input_key][i] = val
-        return batch
-
-    def surely_get_ucg_val(
-        self, embedder: AbstractEmbModel, batch: Dict, cond_or_not
-    ) -> Dict:
-        assert embedder.legacy_ucg_val is not None
-        val = embedder.legacy_ucg_val
-        for i in range(len(batch[embedder.input_key])):
-            if cond_or_not[i]:
-                batch[embedder.input_key][i] = val
-        return batch
-
-    def get_single_embedding(
-        self,
-        embedder,
-        batch,
-        output,
-        cond_or_not: Optional[np.ndarray] = None,
-        force_zero_embeddings: Optional[List] = None,
-    ):
-        embedding_context = nullcontext if embedder.is_trainable else torch.no_grad
-        with embedding_context():
-            if hasattr(embedder, "input_key") and (embedder.input_key is not None):
-                if embedder.legacy_ucg_val is not None:
-                    if cond_or_not is None:
-                        batch = self.possibly_get_ucg_val(embedder, batch)
-                    else:
-                        batch = self.surely_get_ucg_val(embedder, batch, cond_or_not)
-                emb_out = embedder(batch[embedder.input_key])
-            elif hasattr(embedder, "input_keys"):
-                emb_out = embedder(*[batch[k] for k in embedder.input_keys])
-        assert isinstance(
-            emb_out, (torch.Tensor, list, tuple)
-        ), f"encoder outputs must be tensors or a sequence, but got {type(emb_out)}"
-        if not isinstance(emb_out, (list, tuple)):
-            emb_out = [emb_out]
-        for emb in emb_out:
-            out_key = self.OUTPUT_DIM2KEYS[emb.dim()]
-            if embedder.ucg_rate > 0.0 and embedder.legacy_ucg_val is None:
-                if cond_or_not is None:
-                    emb = (
-                        expand_dims_like(
-                            torch.bernoulli(
-                                (1.0 - embedder.ucg_rate)
-                                * torch.ones(emb.shape[0], device=emb.device)
-                            ),
-                            emb,
-                        )
-                        * emb
-                    )
-                else:
-                    emb = (
-                        expand_dims_like(
-                            torch.tensor(
-                                1 - cond_or_not, dtype=emb.dtype, device=emb.device
-                            ),
-                            emb,
-                        )
-                        * emb
-                    )
-            if (
-                hasattr(embedder, "input_key")
-                and embedder.input_key in force_zero_embeddings
-            ):
-                emb = torch.zeros_like(emb)
-            if out_key in output:
-                output[out_key] = torch.cat(
-                    (output[out_key], emb), self.KEY2CATDIM[out_key]
-                )
-            else:
-                output[out_key] = emb
-        return output
-
-    def forward(
-        self, batch: Dict, force_zero_embeddings: Optional[List] = None
-    ) -> Dict:
-        output = dict()
-        if force_zero_embeddings is None:
-            force_zero_embeddings = []
-
-        if len(self.cor_embs) > 0:
-            batch_size = len(batch[list(batch.keys())[0]])
-            rand_idx = np.random.choice(
-                len(self.cor_p), size=(batch_size,), p=self.cor_p
-            )
-            for emb_idx in self.cor_embs:
-                cond_or_not = rand_idx % 2
-                rand_idx //= 2
-                output = self.get_single_embedding(
-                    self.embedders[emb_idx],
-                    batch,
-                    output=output,
-                    cond_or_not=cond_or_not,
-                    force_zero_embeddings=force_zero_embeddings,
-                )
-
-        for i, embedder in enumerate(self.embedders):
-            if i in self.cor_embs:
-                continue
-            output = self.get_single_embedding(
-                embedder,
-                batch,
-                output=output,
-                force_zero_embeddings=force_zero_embeddings,
-            )
-        return output
-
-    def get_unconditional_conditioning(
-        self, batch_c, batch_uc=None, force_uc_zero_embeddings=None
-    ):
-        if force_uc_zero_embeddings is None:
-            force_uc_zero_embeddings = []
-        ucg_rates = list()
-        for embedder in self.embedders:
-            ucg_rates.append(embedder.ucg_rate)
-            embedder.ucg_rate = 0.0
-        cor_embs = self.cor_embs
-        cor_p = self.cor_p
-        self.cor_embs = []
-        self.cor_p = []
-
-        c = self(batch_c)
-        uc = self(batch_c if batch_uc is None else batch_uc, force_uc_zero_embeddings)
-
-        for embedder, rate in zip(self.embedders, ucg_rates):
-            embedder.ucg_rate = rate
-        self.cor_embs = cor_embs
-        self.cor_p = cor_p
-
-        return c, uc
-
-
-class FrozenT5Embedder(AbstractEmbModel):
-    """Uses the T5 transformer encoder for text"""
-
-    def __init__(
-        self,
-        model_dir="google/t5-v1_1-xxl",
-        device="cuda",
-        max_length=77,
-        freeze=True,
-        cache_dir=None,
-    ):
-        super().__init__()
-        if model_dir is not "google/t5-v1_1-xxl":
-            self.tokenizer = T5Tokenizer.from_pretrained(model_dir)
-            self.transformer = T5EncoderModel.from_pretrained(model_dir)
-        else:
-            self.tokenizer = T5Tokenizer.from_pretrained(model_dir, cache_dir=cache_dir)
-            self.transformer = T5EncoderModel.from_pretrained(
-                model_dir, cache_dir=cache_dir
-            )
-        self.device = device
-        self.max_length = max_length
-        if freeze:
-            self.freeze()
-
-    def freeze(self):
-        self.transformer = self.transformer.eval()
-
-        for param in self.parameters():
-            param.requires_grad = False
-
-    # @autocast
-    def forward(self, text):
-        batch_encoding = self.tokenizer(
-            text,
-            truncation=True,
-            max_length=self.max_length,
-            return_length=True,
-            return_overflowing_tokens=False,
-            padding="max_length",
-            return_tensors="pt",
-        )
-        tokens = batch_encoding["input_ids"].to(self.device)
-        with torch.autocast("cuda", enabled=False):
-            outputs = self.transformer(input_ids=tokens)
-        z = outputs.last_hidden_state
-        return z
-
-    def encode(self, text):
-        return self(text)
diff --git a/videotuna/models/cogvideo_sat/sgm/modules/video_attention.py b/videotuna/models/cogvideo_sat/sgm/modules/video_attention.py
deleted file mode 100644
index 756ae4bf..00000000
--- a/videotuna/models/cogvideo_sat/sgm/modules/video_attention.py
+++ /dev/null
@@ -1,307 +0,0 @@
-import torch
-
-from ..modules.attention import *
-from ..modules.diffusionmodules.util import AlphaBlender, linear, timestep_embedding
-
-
-class TimeMixSequential(nn.Sequential):
-    def forward(self, x, context=None, timesteps=None):
-        for layer in self:
-            x = layer(x, context, timesteps)
-
-        return x
-
-
-class VideoTransformerBlock(nn.Module):
-    ATTENTION_MODES = {
-        "softmax": CrossAttention,
-        "softmax-xformers": MemoryEfficientCrossAttention,
-    }
-
-    def __init__(
-        self,
-        dim,
-        n_heads,
-        d_head,
-        dropout=0.0,
-        context_dim=None,
-        gated_ff=True,
-        checkpoint=True,
-        timesteps=None,
-        ff_in=False,
-        inner_dim=None,
-        attn_mode="softmax",
-        disable_self_attn=False,
-        disable_temporal_crossattention=False,
-        switch_temporal_ca_to_sa=False,
-    ):
-        super().__init__()
-
-        attn_cls = self.ATTENTION_MODES[attn_mode]
-
-        self.ff_in = ff_in or inner_dim is not None
-        if inner_dim is None:
-            inner_dim = dim
-
-        assert int(n_heads * d_head) == inner_dim
-
-        self.is_res = inner_dim == dim
-
-        if self.ff_in:
-            self.norm_in = nn.LayerNorm(dim)
-            self.ff_in = FeedForward(
-                dim, dim_out=inner_dim, dropout=dropout, glu=gated_ff
-            )
-
-        self.timesteps = timesteps
-        self.disable_self_attn = disable_self_attn
-        if self.disable_self_attn:
-            self.attn1 = attn_cls(
-                query_dim=inner_dim,
-                heads=n_heads,
-                dim_head=d_head,
-                context_dim=context_dim,
-                dropout=dropout,
-            )  # is a cross-attention
-        else:
-            self.attn1 = attn_cls(
-                query_dim=inner_dim, heads=n_heads, dim_head=d_head, dropout=dropout
-            )  # is a self-attention
-
-        self.ff = FeedForward(inner_dim, dim_out=dim, dropout=dropout, glu=gated_ff)
-
-        if disable_temporal_crossattention:
-            if switch_temporal_ca_to_sa:
-                raise ValueError
-            else:
-                self.attn2 = None
-        else:
-            self.norm2 = nn.LayerNorm(inner_dim)
-            if switch_temporal_ca_to_sa:
-                self.attn2 = attn_cls(
-                    query_dim=inner_dim, heads=n_heads, dim_head=d_head, dropout=dropout
-                )  # is a self-attention
-            else:
-                self.attn2 = attn_cls(
-                    query_dim=inner_dim,
-                    context_dim=context_dim,
-                    heads=n_heads,
-                    dim_head=d_head,
-                    dropout=dropout,
-                )  # is self-attn if context is none
-
-        self.norm1 = nn.LayerNorm(inner_dim)
-        self.norm3 = nn.LayerNorm(inner_dim)
-        self.switch_temporal_ca_to_sa = switch_temporal_ca_to_sa
-
-        self.checkpoint = checkpoint
-        if self.checkpoint:
-            print(f"{self.__class__.__name__} is using checkpointing")
-
-    def forward(
-        self, x: torch.Tensor, context: torch.Tensor = None, timesteps: int = None
-    ) -> torch.Tensor:
-        if self.checkpoint:
-            return checkpoint(self._forward, x, context, timesteps)
-        else:
-            return self._forward(x, context, timesteps=timesteps)
-
-    def _forward(self, x, context=None, timesteps=None):
-        assert self.timesteps or timesteps
-        assert not (self.timesteps and timesteps) or self.timesteps == timesteps
-        timesteps = self.timesteps or timesteps
-        B, S, C = x.shape
-        x = rearrange(x, "(b t) s c -> (b s) t c", t=timesteps)
-
-        if self.ff_in:
-            x_skip = x
-            x = self.ff_in(self.norm_in(x))
-            if self.is_res:
-                x += x_skip
-
-        if self.disable_self_attn:
-            x = self.attn1(self.norm1(x), context=context) + x
-        else:
-            x = self.attn1(self.norm1(x)) + x
-
-        if self.attn2 is not None:
-            if self.switch_temporal_ca_to_sa:
-                x = self.attn2(self.norm2(x)) + x
-            else:
-                x = self.attn2(self.norm2(x), context=context) + x
-        x_skip = x
-        x = self.ff(self.norm3(x))
-        if self.is_res:
-            x += x_skip
-
-        x = rearrange(
-            x, "(b s) t c -> (b t) s c", s=S, b=B // timesteps, c=C, t=timesteps
-        )
-        return x
-
-    def get_last_layer(self):
-        return self.ff.net[-1].weight
-
-
-str_to_dtype = {"fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16}
-
-
-class SpatialVideoTransformer(SpatialTransformer):
-    def __init__(
-        self,
-        in_channels,
-        n_heads,
-        d_head,
-        depth=1,
-        dropout=0.0,
-        use_linear=False,
-        context_dim=None,
-        use_spatial_context=False,
-        timesteps=None,
-        merge_strategy: str = "fixed",
-        merge_factor: float = 0.5,
-        time_context_dim=None,
-        ff_in=False,
-        checkpoint=False,
-        time_depth=1,
-        attn_mode="softmax",
-        disable_self_attn=False,
-        disable_temporal_crossattention=False,
-        max_time_embed_period: int = 10000,
-        dtype="fp32",
-    ):
-        super().__init__(
-            in_channels,
-            n_heads,
-            d_head,
-            depth=depth,
-            dropout=dropout,
-            attn_type=attn_mode,
-            use_checkpoint=checkpoint,
-            context_dim=context_dim,
-            use_linear=use_linear,
-            disable_self_attn=disable_self_attn,
-        )
-        self.time_depth = time_depth
-        self.depth = depth
-        self.max_time_embed_period = max_time_embed_period
-
-        time_mix_d_head = d_head
-        n_time_mix_heads = n_heads
-
-        time_mix_inner_dim = int(time_mix_d_head * n_time_mix_heads)
-
-        inner_dim = n_heads * d_head
-        if use_spatial_context:
-            time_context_dim = context_dim
-
-        self.time_stack = nn.ModuleList(
-            [
-                VideoTransformerBlock(
-                    inner_dim,
-                    n_time_mix_heads,
-                    time_mix_d_head,
-                    dropout=dropout,
-                    context_dim=time_context_dim,
-                    timesteps=timesteps,
-                    checkpoint=checkpoint,
-                    ff_in=ff_in,
-                    inner_dim=time_mix_inner_dim,
-                    attn_mode=attn_mode,
-                    disable_self_attn=disable_self_attn,
-                    disable_temporal_crossattention=disable_temporal_crossattention,
-                )
-                for _ in range(self.depth)
-            ]
-        )
-
-        assert len(self.time_stack) == len(self.transformer_blocks)
-
-        self.use_spatial_context = use_spatial_context
-        self.in_channels = in_channels
-
-        time_embed_dim = self.in_channels * 4
-        self.time_pos_embed = nn.Sequential(
-            linear(self.in_channels, time_embed_dim),
-            nn.SiLU(),
-            linear(time_embed_dim, self.in_channels),
-        )
-
-        self.time_mixer = AlphaBlender(
-            alpha=merge_factor, merge_strategy=merge_strategy
-        )
-        self.dtype = str_to_dtype[dtype]
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        context: Optional[torch.Tensor] = None,
-        time_context: Optional[torch.Tensor] = None,
-        timesteps: Optional[int] = None,
-        image_only_indicator: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        _, _, h, w = x.shape
-        x_in = x
-        spatial_context = None
-        if exists(context):
-            spatial_context = context
-
-        if self.use_spatial_context:
-            assert (
-                context.ndim == 3
-            ), f"n dims of spatial context should be 3 but are {context.ndim}"
-
-            time_context = context
-            time_context_first_timestep = time_context[::timesteps]
-            time_context = repeat(
-                time_context_first_timestep, "b ... -> (b n) ...", n=h * w
-            )
-        elif time_context is not None and not self.use_spatial_context:
-            time_context = repeat(time_context, "b ... -> (b n) ...", n=h * w)
-            if time_context.ndim == 2:
-                time_context = rearrange(time_context, "b c -> b 1 c")
-
-        x = self.norm(x)
-        if not self.use_linear:
-            x = self.proj_in(x)
-        x = rearrange(x, "b c h w -> b (h w) c")
-        if self.use_linear:
-            x = self.proj_in(x)
-
-        num_frames = torch.arange(timesteps, device=x.device)
-        num_frames = repeat(num_frames, "t -> b t", b=x.shape[0] // timesteps)
-        num_frames = rearrange(num_frames, "b t -> (b t)")
-        t_emb = timestep_embedding(
-            num_frames,
-            self.in_channels,
-            repeat_only=False,
-            max_period=self.max_time_embed_period,
-            dtype=self.dtype,
-        )
-        emb = self.time_pos_embed(t_emb)
-        emb = emb[:, None, :]
-
-        for it_, (block, mix_block) in enumerate(
-            zip(self.transformer_blocks, self.time_stack)
-        ):
-            x = block(
-                x,
-                context=spatial_context,
-            )
-
-            x_mix = x
-            x_mix = x_mix + emb
-
-            x_mix = mix_block(x_mix, context=time_context, timesteps=timesteps)
-            x = self.time_mixer(
-                x_spatial=x,
-                x_temporal=x_mix,
-                image_only_indicator=image_only_indicator,
-            )
-        if self.use_linear:
-            x = self.proj_out(x)
-        x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w)
-        if not self.use_linear:
-            x = self.proj_out(x)
-        out = x + x_in
-        return out
diff --git a/videotuna/models/cogvideo_sat/sgm/util.py b/videotuna/models/cogvideo_sat/sgm/util.py
deleted file mode 100644
index c85a493f..00000000
--- a/videotuna/models/cogvideo_sat/sgm/util.py
+++ /dev/null
@@ -1,405 +0,0 @@
-import functools
-import importlib
-import os
-from functools import partial
-from inspect import isfunction
-
-import fsspec
-import numpy as np
-import torch
-import torch.distributed
-from PIL import Image, ImageDraw, ImageFont
-from safetensors.torch import load_file as load_safetensors
-
-_CONTEXT_PARALLEL_GROUP = None
-_CONTEXT_PARALLEL_SIZE = None
-
-
-def is_context_parallel_initialized():
-    if _CONTEXT_PARALLEL_GROUP is None:
-        return False
-    else:
-        return True
-
-
-def set_context_parallel_group(size, group):
-    global _CONTEXT_PARALLEL_GROUP
-    global _CONTEXT_PARALLEL_SIZE
-    _CONTEXT_PARALLEL_GROUP = group
-    _CONTEXT_PARALLEL_SIZE = size
-
-
-def initialize_context_parallel(context_parallel_size):
-    global _CONTEXT_PARALLEL_GROUP
-    global _CONTEXT_PARALLEL_SIZE
-
-    assert (
-        _CONTEXT_PARALLEL_GROUP is None
-    ), "context parallel group is already initialized"
-    _CONTEXT_PARALLEL_SIZE = context_parallel_size
-
-    rank = torch.distributed.get_rank()
-    world_size = torch.distributed.get_world_size()
-
-    for i in range(0, world_size, context_parallel_size):
-        ranks = range(i, i + context_parallel_size)
-        group = torch.distributed.new_group(ranks)
-        if rank in ranks:
-            _CONTEXT_PARALLEL_GROUP = group
-            break
-
-
-def get_context_parallel_group():
-    assert (
-        _CONTEXT_PARALLEL_GROUP is not None
-    ), "context parallel group is not initialized"
-
-    return _CONTEXT_PARALLEL_GROUP
-
-
-def get_context_parallel_world_size():
-    assert (
-        _CONTEXT_PARALLEL_SIZE is not None
-    ), "context parallel size is not initialized"
-
-    return _CONTEXT_PARALLEL_SIZE
-
-
-def get_context_parallel_rank():
-    assert (
-        _CONTEXT_PARALLEL_SIZE is not None
-    ), "context parallel size is not initialized"
-
-    rank = torch.distributed.get_rank()
-    cp_rank = rank % _CONTEXT_PARALLEL_SIZE
-    return cp_rank
-
-
-def get_context_parallel_group_rank():
-    assert (
-        _CONTEXT_PARALLEL_SIZE is not None
-    ), "context parallel size is not initialized"
-
-    rank = torch.distributed.get_rank()
-    cp_group_rank = rank // _CONTEXT_PARALLEL_SIZE
-
-    return cp_group_rank
-
-
-class SafeConv3d(torch.nn.Conv3d):
-    def forward(self, input):
-        memory_count = torch.prod(torch.tensor(input.shape)).item() * 2 / 1024**3
-        if memory_count > 2:
-            # print(f"WARNING: Conv3d with {memory_count:.2f}GB")
-            kernel_size = self.kernel_size[0]
-            part_num = int(memory_count / 2) + 1
-            input_chunks = torch.chunk(input, part_num, dim=2)  # NCTHW
-            if kernel_size > 1:
-                input_chunks = [input_chunks[0]] + [
-                    torch.cat(
-                        (
-                            input_chunks[i - 1][:, :, -kernel_size + 1 :],
-                            input_chunks[i],
-                        ),
-                        dim=2,
-                    )
-                    for i in range(1, len(input_chunks))
-                ]
-
-            output_chunks = []
-            for input_chunk in input_chunks:
-                output_chunks.append(super(SafeConv3d, self).forward(input_chunk))
-            output = torch.cat(output_chunks, dim=2)
-            return output
-        else:
-            return super(SafeConv3d, self).forward(input)
-
-
-def disabled_train(self, mode=True):
-    """Overwrite model.train with this function to make sure train/eval mode
-    does not change anymore."""
-    return self
-
-
-def get_string_from_tuple(s):
-    try:
-        # Check if the string starts and ends with parentheses
-        if s[0] == "(" and s[-1] == ")":
-            # Convert the string to a tuple
-            t = eval(s)
-            # Check if the type of t is tuple
-            if type(t) == tuple:
-                return t[0]
-            else:
-                pass
-    except:
-        pass
-    return s
-
-
-def is_power_of_two(n):
-    """
-    chat.openai.com/chat
-    Return True if n is a power of 2, otherwise return False.
-
-    The function is_power_of_two takes an integer n as input and returns True if n is a power of 2, otherwise it returns False.
-    The function works by first checking if n is less than or equal to 0. If n is less than or equal to 0, it can't be a power of 2, so the function returns False.
-    If n is greater than 0, the function checks whether n is a power of 2 by using a bitwise AND operation between n and n-1. If n is a power of 2, then it will have only one bit set to 1 in its binary representation. When we subtract 1 from a power of 2, all the bits to the right of that bit become 1, and the bit itself becomes 0. So, when we perform a bitwise AND between n and n-1, we get 0 if n is a power of 2, and a non-zero value otherwise.
-    Thus, if the result of the bitwise AND operation is 0, then n is a power of 2 and the function returns True. Otherwise, the function returns False.
-
-    """
-    if n <= 0:
-        return False
-    return (n & (n - 1)) == 0
-
-
-def autocast(f, enabled=True):
-    def do_autocast(*args, **kwargs):
-        with torch.cuda.amp.autocast(
-            enabled=enabled,
-            dtype=torch.get_autocast_gpu_dtype(),
-            cache_enabled=torch.is_autocast_cache_enabled(),
-        ):
-            return f(*args, **kwargs)
-
-    return do_autocast
-
-
-def load_partial_from_config(config):
-    return partial(get_obj_from_str(config["target"]), **config.get("params", dict()))
-
-
-def log_txt_as_img(wh, xc, size=10):
-    # wh a tuple of (width, height)
-    # xc a list of captions to plot
-    b = len(xc)
-    txts = list()
-    for bi in range(b):
-        txt = Image.new("RGB", wh, color="white")
-        draw = ImageDraw.Draw(txt)
-        font = ImageFont.truetype("data/DejaVuSans.ttf", size=size)
-        nc = int(40 * (wh[0] / 256))
-        if isinstance(xc[bi], list):
-            text_seq = xc[bi][0]
-        else:
-            text_seq = xc[bi]
-        lines = "\n".join(
-            text_seq[start : start + nc] for start in range(0, len(text_seq), nc)
-        )
-
-        try:
-            draw.text((0, 0), lines, fill="black", font=font)
-        except UnicodeEncodeError:
-            print("Cant encode string for logging. Skipping.")
-
-        txt = np.array(txt).transpose(2, 0, 1) / 127.5 - 1.0
-        txts.append(txt)
-    txts = np.stack(txts)
-    txts = torch.tensor(txts)
-    return txts
-
-
-def partialclass(cls, *args, **kwargs):
-    class NewCls(cls):
-        __init__ = functools.partialmethod(cls.__init__, *args, **kwargs)
-
-    return NewCls
-
-
-def make_path_absolute(path):
-    fs, p = fsspec.core.url_to_fs(path)
-    if fs.protocol == "file":
-        return os.path.abspath(p)
-    return path
-
-
-def ismap(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return (len(x.shape) == 4) and (x.shape[1] > 3)
-
-
-def isimage(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)
-
-
-def isheatmap(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-
-    return x.ndim == 2
-
-
-def isneighbors(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return x.ndim == 5 and (x.shape[2] == 3 or x.shape[2] == 1)
-
-
-def exists(x):
-    return x is not None
-
-
-def expand_dims_like(x, y):
-    while x.dim() != y.dim():
-        x = x.unsqueeze(-1)
-    return x
-
-
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-
-
-def mean_flat(tensor):
-    """
-    https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L86
-    Take the mean over all non-batch dimensions.
-    """
-    return tensor.mean(dim=list(range(1, len(tensor.shape))))
-
-
-def count_params(model, verbose=False):
-    total_params = sum(p.numel() for p in model.parameters())
-    if verbose:
-        print(f"{model.__class__.__name__} has {total_params * 1.e-6:.2f} M params.")
-    return total_params
-
-
-def instantiate_from_config(config, **extra_kwargs):
-    if not "target" in config:
-        if config == "__is_first_stage__":
-            return None
-        elif config == "__is_unconditional__":
-            return None
-        raise KeyError("Expected key `target` to instantiate.")
-    return get_obj_from_str(config["target"])(
-        **config.get("params", dict()), **extra_kwargs
-    )
-
-
-def get_obj_from_str(string, reload=False, invalidate_cache=True):
-    module, cls = string.rsplit(".", 1)
-    if invalidate_cache:
-        importlib.invalidate_caches()
-    if reload:
-        module_imp = importlib.import_module(module)
-        importlib.reload(module_imp)
-    return getattr(importlib.import_module(module, package=None), cls)
-
-
-def append_zero(x):
-    return torch.cat([x, x.new_zeros([1])])
-
-
-def append_dims(x, target_dims):
-    """Appends dimensions to the end of a tensor until it has target_dims dimensions."""
-    dims_to_append = target_dims - x.ndim
-    if dims_to_append < 0:
-        raise ValueError(
-            f"input has {x.ndim} dims but target_dims is {target_dims}, which is less"
-        )
-    return x[(...,) + (None,) * dims_to_append]
-
-
-def load_model_from_config(config, ckpt, verbose=True, freeze=True):
-    print(f"Loading model from {ckpt}")
-    if ckpt.endswith("ckpt"):
-        pl_sd = torch.load(ckpt, map_location="cpu")
-        if "global_step" in pl_sd:
-            print(f"Global Step: {pl_sd['global_step']}")
-        sd = pl_sd["state_dict"]
-    elif ckpt.endswith("safetensors"):
-        sd = load_safetensors(ckpt)
-    else:
-        raise NotImplementedError
-
-    model = instantiate_from_config(config.model)
-
-    m, u = model.load_state_dict(sd, strict=False)
-
-    if len(m) > 0 and verbose:
-        print("missing keys:")
-        print(m)
-    if len(u) > 0 and verbose:
-        print("unexpected keys:")
-        print(u)
-
-    if freeze:
-        for param in model.parameters():
-            param.requires_grad = False
-
-    model.eval()
-    return model
-
-
-def get_configs_path() -> str:
-    """
-    Get the `configs` directory.
-    For a working copy, this is the one in the root of the repository,
-    but for an installed copy, it's in the `sgm` package (see pyproject.toml).
-    """
-    this_dir = os.path.dirname(__file__)
-    candidates = (
-        os.path.join(this_dir, "configs"),
-        os.path.join(this_dir, "..", "configs"),
-    )
-    for candidate in candidates:
-        candidate = os.path.abspath(candidate)
-        if os.path.isdir(candidate):
-            return candidate
-    raise FileNotFoundError(f"Could not find SGM configs in {candidates}")
-
-
-def get_nested_attribute(obj, attribute_path, depth=None, return_key=False):
-    """
-    Will return the result of a recursive get attribute call.
-    E.g.:
-        a.b.c
-        = getattr(getattr(a, "b"), "c")
-        = get_nested_attribute(a, "b.c")
-    If any part of the attribute call is an integer x with current obj a, will
-    try to call a[x] instead of a.x first.
-    """
-    attributes = attribute_path.split(".")
-    if depth is not None and depth > 0:
-        attributes = attributes[:depth]
-    assert len(attributes) > 0, "At least one attribute should be selected"
-    current_attribute = obj
-    current_key = None
-    for level, attribute in enumerate(attributes):
-        current_key = ".".join(attributes[: level + 1])
-        try:
-            id_ = int(attribute)
-            current_attribute = current_attribute[id_]
-        except ValueError:
-            current_attribute = getattr(current_attribute, attribute)
-
-    return (current_attribute, current_key) if return_key else current_attribute
-
-
-from math import sqrt
-
-
-class SeededNoise:
-    def __init__(self, seeds, weights):
-        self.seeds = seeds
-        self.weights = weights
-        weight_square_sum = 0
-        for weight in weights:
-            weight_square_sum += weight**2
-        self.weight_square_sum_sqrt = sqrt(weight_square_sum)
-        self.cnt = 0
-
-    def __call__(self, x):
-        self.cnt += 1
-        randn_combined = torch.zeros_like(x)
-        for seed, weight in zip(self.seeds, self.weights):
-            randn = np.random.RandomState(seed + self.cnt).randn(*x.shape)
-            randn = torch.from_numpy(randn, dtype=x.dtype, device=x.device)
-            randn_combined += randn * weight
-        randn_combined /= self.weight_square_sum_sqrt
-        return randn_combined
diff --git a/videotuna/models/cogvideo_sat/sgm/webds.py b/videotuna/models/cogvideo_sat/sgm/webds.py
deleted file mode 100644
index 078ed6dd..00000000
--- a/videotuna/models/cogvideo_sat/sgm/webds.py
+++ /dev/null
@@ -1,419 +0,0 @@
-import io
-import json
-import os
-import re
-import sys
-import tarfile
-from functools import partial
-
-import webdataset as wds
-from webdataset import DataPipeline, ResampledShards, tarfile_to_samples
-from webdataset.filters import pipelinefilter
-from webdataset.gopen import gopen, gopen_schemes
-from webdataset.handlers import reraise_exception
-from webdataset.tariterators import group_by_keys, url_opener
-
-
-def pytorch_worker_info(group=None):  # sourcery skip: use-contextlib-suppress
-    """Return node and worker info for PyTorch and some distributed environments."""
-    rank = 0
-    world_size = 1
-    worker = 0
-    num_workers = 1
-    try:
-        import torch.distributed
-
-        if torch.distributed.is_available() and torch.distributed.is_initialized():
-            group = group or torch.distributed.group.WORLD
-            rank = torch.distributed.get_rank(group=group)
-            world_size = torch.distributed.get_world_size(group=group)
-    except ModuleNotFoundError:
-        pass
-    try:
-        import torch.utils.data
-
-        worker_info = torch.utils.data.get_worker_info()
-        if worker_info is not None:
-            worker = worker_info.id
-            num_workers = worker_info.num_workers
-    except ModuleNotFoundError:
-        pass
-
-    return rank, world_size, worker, num_workers
-
-
-def pytorch_worker_seed(group=None):
-    """Compute a distinct, deterministic RNG seed for each worker and node."""
-    rank, world_size, worker, num_workers = pytorch_worker_info(group=group)
-    return rank * 1000 + worker
-
-
-def worker_seed_sat(group=None, seed=0):
-    return pytorch_worker_seed(group=group) + seed * 23
-
-
-class ConfiguredResampledShards(ResampledShards):
-    def __init__(self, urls, seed, nshards=sys.maxsize, deterministic=True):
-        from sat.helpers import print_rank0
-
-        try:
-            from megatron.core.parallel_state import get_data_parallel_group
-
-            group = get_data_parallel_group()
-            print_rank0("Using megatron data parallel group.")
-        except:
-            from sat.mpu import get_data_parallel_group
-
-            try:
-                group = get_data_parallel_group()
-                print_rank0("Using sat data parallel group.")
-            except AssertionError:
-                group = None
-                print_rank0("No data parallel group is specified!")
-        worker_seed_sat_this = partial(worker_seed_sat, group=group, seed=seed)
-        super().__init__(urls, nshards, worker_seed_sat_this, deterministic)
-
-
-class SimpleDistributedWebDataset(DataPipeline):
-    def __init__(self, path, process_fn, seed, *, shuffle_buffer=1000):
-        # set shuffle_buffer = 1 to disable it, model-parallel will be different due to shuffle
-        try:
-            from sat.mpu import get_model_parallel_world_size
-
-            if get_model_parallel_world_size() > 1:
-                shuffle_buffer = 1
-        except Exception:
-            pass
-        super().__init__(
-            ConfiguredResampledShards(
-                path, seed
-            ),  # Lots of shards are recommended, or not evenly
-            tarfile_to_samples(),
-            wds.shuffle(shuffle_buffer),
-            process_fn,
-        )
-
-
-def tar_file_iterator_with_meta(
-    fileobj,
-    meta_names,
-    skip_meta=r"__[^/]*__($|/)",
-    suffix=None,
-    handler=reraise_exception,
-    meta_stream=None,
-):
-    """Iterate over tar file, yielding filename, content pairs for the given tar stream.
-
-    :param fileobj: byte stream suitable for tarfile
-    :param meta_names: key of different items in meta file
-    :param skip_meta: regexp for keys that are skipped entirely (Default value = r"__[^/]*__($|/)")
-
-    """
-    stream = tarfile.open(fileobj=fileobj, mode="r|*")
-    data_dir, filename = fileobj.name.rsplit("/", 1)
-    meta_data = {}  # {id: {meta_name: meta_value, meta_name2: meta_value2, ...}}
-
-    if meta_stream is None:
-        meta_file_name = filename.split(".")[0] + ".meta.jsonl"
-        meta_path = os.path.join(data_dir, meta_file_name)
-        if os.path.exists(meta_path):
-            meta_stream = open(meta_path, "r")
-    else:
-        meta_file_name = meta_stream.name
-
-    if meta_stream is not None:
-        for lineno, line in enumerate(meta_stream):
-            meta_list = []
-            try:
-                meta_list.append(json.loads(line))
-            except Exception as exn:
-                from sat.helpers import print_rank0
-
-                print_rank0(
-                    f"Error in loading jsonl {meta_file_name}, lineno {lineno}: {line}",
-                    level="DEBUG",
-                )
-                continue
-            for item in meta_list:
-                if not item["key"] in meta_data:
-                    meta_data[item["key"]] = {}
-                for meta_name in meta_names:
-                    if meta_name in item:
-                        meta_data[item["key"]][meta_name] = item[meta_name]
-        meta_stream.close()
-
-    try:
-        for tarinfo in stream:
-            fname = tarinfo.name
-            try:
-                if not tarinfo.isreg():
-                    continue
-                if fname is None:
-                    continue
-                if "/" not in fname and fname.startswith("__") and fname.endswith("__"):
-                    # skipping metadata for now
-                    continue
-                if skip_meta is not None and re.match(skip_meta, fname):
-                    continue
-                if fname.endswith(".txt") and suffix is not None:
-                    data = (
-                        stream.extractfile(tarinfo).read().decode() + suffix
-                    ).encode()
-                else:
-                    data = stream.extractfile(tarinfo).read()
-                result = dict(fname=fname, data=data)
-                yield result
-
-                if fname.endswith(".id"):
-                    fid = fname.split(".")[0]
-                    if "-$#%@&" in fid:
-                        sfid = fid.split("-$#%@&")[0]
-                    else:
-                        sfid = fid
-                    meta_data_fid = meta_data.get(sfid, {})
-                    for meta_name in meta_names:
-                        meta_fname = fid + "." + meta_name
-                        meta = meta_data_fid.get(meta_name, None)
-                        yield dict(fname=meta_fname, data=meta)
-                stream.members = []
-            except Exception as exn:
-                if hasattr(exn, "args") and len(exn.args) > 0:
-                    exn.args = (exn.args[0] + " @ " + str(fileobj),) + exn.args[1:]
-                if handler(exn):
-                    continue
-                else:
-                    break
-    except Exception as exn:
-        print(exn)
-    del stream
-
-
-def tar_file_expander_with_meta(data, meta_names, handler=reraise_exception):
-    """Expand a stream of open tar files into a stream of tar file contents.
-
-    This returns an iterator over (filename, file_contents).
-    """
-    for source in data:
-        url = source["url"]
-        try:
-            assert isinstance(source, dict)
-            assert "stream" in source
-            for sample in tar_file_iterator_with_meta(
-                source["stream"], meta_names, meta_stream=source["meta_stream"]
-            ):
-                assert (
-                    isinstance(sample, dict) and "data" in sample and "fname" in sample
-                )
-                sample["__url__"] = url
-                yield sample
-        except Exception as exn:
-            exn.args = exn.args + (source.get("stream"), source.get("url"))
-            if handler(exn):
-                continue
-            else:
-                break
-
-
-def url_opener(
-    data,
-    handler,
-    **kw,
-):
-    """Open URLs and yield a stream of url+stream pairs.
-
-    Args:
-        data: iterator over dict(url=...)
-        handler: exception handler.
-        kw: keyword arguments for gopen.gopen.
-
-    Yields:
-        a stream of url+stream pairs.
-    """
-    for sample in data:
-        assert isinstance(sample, dict), sample
-        assert "url" in sample
-        url = sample["url"]
-        try:
-            stream = gopen(url, **kw)
-            if hasattr(stream, "meta_stream"):
-                meta_stream = stream.meta_stream
-                del stream.meta_stream
-            else:
-                meta_stream = None
-            sample.update(stream=stream, meta_stream=meta_stream)
-            yield sample
-        except Exception as exn:
-            exn.args = exn.args + (url,)
-            if handler(exn):
-                continue
-            else:
-                break
-
-
-def tarfile_samples_with_meta(src, meta_names, handler=reraise_exception):
-    streams = url_opener(src, handler=handler)
-    files = tar_file_expander_with_meta(streams, meta_names, handler)
-    samples = group_by_keys(files, handler=handler)
-    return samples
-
-
-class MetaDistributedWebDataset(DataPipeline):
-    """WebDataset with meta information files
-    Extra Format:
-        in webdataset (tar), for each sample there is a '.id';
-        for each tar file, there is a '.meta.jsonl' file with the same name;
-        The '.meta.jsonl' file contains lines of json objects, each with a 'key' field to match '.id'.
-    """
-
-    def __init__(
-        self,
-        path,
-        process_fn,
-        seed,
-        *,
-        meta_names=[],
-        nshards=sys.maxsize,
-        shuffle_buffer=1000,
-        include_dirs=None,
-    ):
-        # os.environ['WDS_SHOW_SEED'] = '1'
-        import torch
-
-        if torch.distributed.get_rank() == 0:
-            if include_dirs is not None:  # /webdatasets/A,/webdatasets/C
-                other_paths = []
-                include_dirs = include_dirs.split(",")
-                for include_dir in include_dirs:
-                    if "*" in include_dir:
-                        include_dir, n = include_dir.split("*")
-                        n = int(n)
-                    else:
-                        n = 1
-                    for cur_dir, dirs, files in os.walk(include_dir):
-                        for f in files:
-                            if (
-                                f.endswith("tar")
-                                and os.path.getsize(os.path.join(cur_dir, f)) > 0
-                            ):
-                                # other_paths.append(os.path.join(cur_dir,f))
-                                other_paths.extend([os.path.join(cur_dir, f)] * n)
-                # print(f'Adding dataset paths {",".join(other_paths)}')
-                from braceexpand import braceexpand
-
-                if len(path) > 0:  # not ""
-                    path = list(braceexpand(path)) + other_paths
-                else:
-                    path = other_paths
-            path = [path]
-        else:
-            path = [
-                None,
-            ]
-        torch.distributed.broadcast_object_list(path, src=0)
-        path = path[0]
-
-        tarfile_samples = partial(tarfile_samples_with_meta, meta_names=meta_names)
-        tarfile_to_samples = pipelinefilter(tarfile_samples)
-
-        # if model parallel, shuffle_buffer should be 1 to disable shuffling
-        try:
-            from sat.mpu import get_model_parallel_world_size
-
-            if get_model_parallel_world_size() > 1:
-                shuffle_buffer = 1
-        except Exception:
-            pass
-
-        super().__init__(
-            ConfiguredResampledShards(path, seed, nshards=nshards),
-            tarfile_to_samples(),
-            wds.shuffle(shuffle_buffer),
-            process_fn,
-        )
-
-
-# rclone support
-from webdataset.gopen import Pipe
-
-
-def gopen_rclone(url, mode="rb", bufsize=1024 * 1024 * 32):
-    """Open a URL with `curl`.
-
-    :param url: rclone url, e.g. data:bucket1/foo.tar. data should be configured.
-    :param mode: file mode
-    :param bufsize: buffer size
-    """
-    url = url.replace("rclone://", "")
-    if mode[0] == "r":
-        cmd = f"rclone cat '{url}'"
-        return Pipe(
-            cmd,
-            mode=mode,
-            shell=True,
-            bufsize=bufsize,
-            ignore_status=[141, 23],
-        )  # skipcq: BAN-B604
-    elif mode[0] == "w":
-        cmd = f"rclone cp - '{url}'"
-        return Pipe(
-            cmd,
-            mode=mode,
-            shell=True,
-            bufsize=bufsize,
-            ignore_status=[141, 26],
-        )  # skipcq: BAN-B604
-    else:
-        raise ValueError(f"{mode}: unknown mode")
-
-
-def gopen_boto3(url, mode="rb", bufsize=8192 * 2):
-    """Open a URL with boto3 API.
-
-    :param url: boto3 url, e.g. boto3://bucket1/foo.tar. data should be configured.
-    :param mode: file mode
-    :param bufsize: buffer size
-    """
-    import boto3
-
-    # boto3.set_stream_logger('botocore', level='DEBUG')
-    if url.startswith("boto3://"):
-        url = url.replace("boto3://", "")
-        need_meta = False
-    else:
-        url = url.replace("metaboto3://", "")
-        need_meta = True
-    endpoint_url = os.environ.get("S3_ENDPOINT_URL", None)
-    access_key = os.environ.get("S3_ACCESS_KEY_ID", None)
-    secret_key = os.environ.get("S3_SECRET_ACCESS_KEY", None)
-
-    if mode[0] == "r":
-        s3_client = boto3.client(
-            "s3",
-            endpoint_url=endpoint_url,
-            aws_access_key_id=access_key,
-            aws_secret_access_key=secret_key,
-        )
-        bucket, key = url.split("/", 1)
-
-        if need_meta:
-            # download a meta json
-            meta_file_key = key.split(".")[0] + ".meta.jsonl"
-            meta_stream = io.BytesIO()
-            s3_client.download_fileobj(bucket, meta_file_key, meta_stream)
-            meta_stream.seek(0)
-            meta_stream.name = meta_file_key
-        else:
-            meta_stream = None
-
-        # data tar stream
-        response = s3_client.get_object(Bucket=bucket, Key=key)  # Range optional
-        response["Body"].name = key  # actually not used
-        response["Body"].meta_stream = meta_stream
-        return response["Body"]
-    else:
-        raise ValueError(f"{mode}: unknown mode")
-
-
-gopen_schemes["rclone"] = gopen_rclone
-gopen_schemes["boto3"] = gopen_boto3
-gopen_schemes["metaboto3"] = gopen_boto3
diff --git a/videotuna/models/cogvideo_sat/vae_modules/attention.py b/videotuna/models/cogvideo_sat/vae_modules/attention.py
deleted file mode 100644
index caa594ef..00000000
--- a/videotuna/models/cogvideo_sat/vae_modules/attention.py
+++ /dev/null
@@ -1,633 +0,0 @@
-import math
-from inspect import isfunction
-from typing import Any, Optional
-
-import torch
-import torch.nn.functional as F
-from einops import rearrange, repeat
-from packaging import version
-from torch import nn
-
-if version.parse(torch.__version__) >= version.parse("2.0.0"):
-    SDP_IS_AVAILABLE = True
-    from torch.backends.cuda import SDPBackend, sdp_kernel
-
-    BACKEND_MAP = {
-        SDPBackend.MATH: {
-            "enable_math": True,
-            "enable_flash": False,
-            "enable_mem_efficient": False,
-        },
-        SDPBackend.FLASH_ATTENTION: {
-            "enable_math": False,
-            "enable_flash": True,
-            "enable_mem_efficient": False,
-        },
-        SDPBackend.EFFICIENT_ATTENTION: {
-            "enable_math": False,
-            "enable_flash": False,
-            "enable_mem_efficient": True,
-        },
-        None: {"enable_math": True, "enable_flash": True, "enable_mem_efficient": True},
-    }
-else:
-    from contextlib import nullcontext
-
-    SDP_IS_AVAILABLE = False
-    sdp_kernel = nullcontext
-    BACKEND_MAP = {}
-    print(
-        f"No SDP backend available, likely because you are running in pytorch versions < 2.0. In fact, "
-        f"you are using PyTorch {torch.__version__}. You might want to consider upgrading."
-    )
-
-try:
-    import xformers
-    import xformers.ops
-
-    XFORMERS_IS_AVAILABLE = True
-except:
-    XFORMERS_IS_AVAILABLE = False
-    print("no module 'xformers'. Processing without...")
-
-from modules.utils import checkpoint
-
-
-def exists(val):
-    return val is not None
-
-
-def uniq(arr):
-    return {el: True for el in arr}.keys()
-
-
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-
-
-def max_neg_value(t):
-    return -torch.finfo(t.dtype).max
-
-
-def init_(tensor):
-    dim = tensor.shape[-1]
-    std = 1 / math.sqrt(dim)
-    tensor.uniform_(-std, std)
-    return tensor
-
-
-# feedforward
-class GEGLU(nn.Module):
-    def __init__(self, dim_in, dim_out):
-        super().__init__()
-        self.proj = nn.Linear(dim_in, dim_out * 2)
-
-    def forward(self, x):
-        x, gate = self.proj(x).chunk(2, dim=-1)
-        return x * F.gelu(gate)
-
-
-class FeedForward(nn.Module):
-    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.0):
-        super().__init__()
-        inner_dim = int(dim * mult)
-        dim_out = default(dim_out, dim)
-        project_in = (
-            nn.Sequential(nn.Linear(dim, inner_dim), nn.GELU())
-            if not glu
-            else GEGLU(dim, inner_dim)
-        )
-
-        self.net = nn.Sequential(
-            project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out)
-        )
-
-    def forward(self, x):
-        return self.net(x)
-
-
-def zero_module(module):
-    """
-    Zero out the parameters of a module and return it.
-    """
-    for p in module.parameters():
-        p.detach().zero_()
-    return module
-
-
-def Normalize(in_channels):
-    return torch.nn.GroupNorm(
-        num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-    )
-
-
-class LinearAttention(nn.Module):
-    def __init__(self, dim, heads=4, dim_head=32):
-        super().__init__()
-        self.heads = heads
-        hidden_dim = dim_head * heads
-        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
-        self.to_out = nn.Conv2d(hidden_dim, dim, 1)
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        qkv = self.to_qkv(x)
-        q, k, v = rearrange(
-            qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3
-        )
-        k = k.softmax(dim=-1)
-        context = torch.einsum("bhdn,bhen->bhde", k, v)
-        out = torch.einsum("bhde,bhdn->bhen", context, q)
-        out = rearrange(
-            out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w
-        )
-        return self.to_out(out)
-
-
-class SpatialSelfAttention(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x):
-        h_ = x
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = rearrange(q, "b c h w -> b (h w) c")
-        k = rearrange(k, "b c h w -> b c (h w)")
-        w_ = torch.einsum("bij,bjk->bik", q, k)
-
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = rearrange(v, "b c h w -> b c (h w)")
-        w_ = rearrange(w_, "b i j -> b j i")
-        h_ = torch.einsum("bij,bjk->bik", v, w_)
-        h_ = rearrange(h_, "b c (h w) -> b c h w", h=h)
-        h_ = self.proj_out(h_)
-
-        return x + h_
-
-
-class CrossAttention(nn.Module):
-    def __init__(
-        self,
-        query_dim,
-        context_dim=None,
-        heads=8,
-        dim_head=64,
-        dropout=0.0,
-        backend=None,
-    ):
-        super().__init__()
-        inner_dim = dim_head * heads
-        context_dim = default(context_dim, query_dim)
-
-        self.scale = dim_head**-0.5
-        self.heads = heads
-
-        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
-        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
-        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
-
-        self.to_out = nn.Sequential(
-            nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)
-        )
-        self.backend = backend
-
-    def forward(
-        self,
-        x,
-        context=None,
-        mask=None,
-        additional_tokens=None,
-        n_times_crossframe_attn_in_self=0,
-    ):
-        h = self.heads
-
-        if additional_tokens is not None:
-            # get the number of masked tokens at the beginning of the output sequence
-            n_tokens_to_mask = additional_tokens.shape[1]
-            # add additional token
-            x = torch.cat([additional_tokens, x], dim=1)
-
-        q = self.to_q(x)
-        context = default(context, x)
-        k = self.to_k(context)
-        v = self.to_v(context)
-
-        if n_times_crossframe_attn_in_self:
-            # reprogramming cross-frame attention as in https://arxiv.org/abs/2303.13439
-            assert x.shape[0] % n_times_crossframe_attn_in_self == 0
-            n_cp = x.shape[0] // n_times_crossframe_attn_in_self
-            k = repeat(
-                k[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_cp
-            )
-            v = repeat(
-                v[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_cp
-            )
-
-        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))
-
-        # old
-        """
-        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
-        del q, k
-
-        if exists(mask):
-            mask = rearrange(mask, 'b ... -> b (...)')
-            max_neg_value = -torch.finfo(sim.dtype).max
-            mask = repeat(mask, 'b j -> (b h) () j', h=h)
-            sim.masked_fill_(~mask, max_neg_value)
-
-        # attention, what we cannot get enough of
-        sim = sim.softmax(dim=-1)
-
-        out = einsum('b i j, b j d -> b i d', sim, v)
-        """
-        # new
-        with sdp_kernel(**BACKEND_MAP[self.backend]):
-            # print("dispatching into backend", self.backend, "q/k/v shape: ", q.shape, k.shape, v.shape)
-            out = F.scaled_dot_product_attention(
-                q, k, v, attn_mask=mask
-            )  # scale is dim_head ** -0.5 per default
-
-        del q, k, v
-        out = rearrange(out, "b h n d -> b n (h d)", h=h)
-
-        if additional_tokens is not None:
-            # remove additional token
-            out = out[:, n_tokens_to_mask:]
-        return self.to_out(out)
-
-
-class MemoryEfficientCrossAttention(nn.Module):
-    # https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223
-    def __init__(
-        self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0, **kwargs
-    ):
-        super().__init__()
-        print(
-            f"Setting up {self.__class__.__name__}. Query dim is {query_dim}, context_dim is {context_dim} and using "
-            f"{heads} heads with a dimension of {dim_head}."
-        )
-        inner_dim = dim_head * heads
-        context_dim = default(context_dim, query_dim)
-
-        self.heads = heads
-        self.dim_head = dim_head
-
-        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
-        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
-        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
-
-        self.to_out = nn.Sequential(
-            nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)
-        )
-        self.attention_op: Optional[Any] = None
-
-    def forward(
-        self,
-        x,
-        context=None,
-        mask=None,
-        additional_tokens=None,
-        n_times_crossframe_attn_in_self=0,
-    ):
-        if additional_tokens is not None:
-            # get the number of masked tokens at the beginning of the output sequence
-            n_tokens_to_mask = additional_tokens.shape[1]
-            # add additional token
-            x = torch.cat([additional_tokens, x], dim=1)
-        q = self.to_q(x)
-        context = default(context, x)
-        k = self.to_k(context)
-        v = self.to_v(context)
-
-        if n_times_crossframe_attn_in_self:
-            # reprogramming cross-frame attention as in https://arxiv.org/abs/2303.13439
-            assert x.shape[0] % n_times_crossframe_attn_in_self == 0
-            # n_cp = x.shape[0]//n_times_crossframe_attn_in_self
-            k = repeat(
-                k[::n_times_crossframe_attn_in_self],
-                "b ... -> (b n) ...",
-                n=n_times_crossframe_attn_in_self,
-            )
-            v = repeat(
-                v[::n_times_crossframe_attn_in_self],
-                "b ... -> (b n) ...",
-                n=n_times_crossframe_attn_in_self,
-            )
-
-        b, _, _ = q.shape
-        q, k, v = map(
-            lambda t: t.unsqueeze(3)
-            .reshape(b, t.shape[1], self.heads, self.dim_head)
-            .permute(0, 2, 1, 3)
-            .reshape(b * self.heads, t.shape[1], self.dim_head)
-            .contiguous(),
-            (q, k, v),
-        )
-
-        # actually compute the attention, what we cannot get enough of
-        out = xformers.ops.memory_efficient_attention(
-            q, k, v, attn_bias=None, op=self.attention_op
-        )
-
-        # TODO: Use this directly in the attention operation, as a bias
-        if exists(mask):
-            raise NotImplementedError
-        out = (
-            out.unsqueeze(0)
-            .reshape(b, self.heads, out.shape[1], self.dim_head)
-            .permute(0, 2, 1, 3)
-            .reshape(b, out.shape[1], self.heads * self.dim_head)
-        )
-        if additional_tokens is not None:
-            # remove additional token
-            out = out[:, n_tokens_to_mask:]
-        return self.to_out(out)
-
-
-class BasicTransformerBlock(nn.Module):
-    ATTENTION_MODES = {
-        "softmax": CrossAttention,  # vanilla attention
-        "softmax-xformers": MemoryEfficientCrossAttention,  # ampere
-    }
-
-    def __init__(
-        self,
-        dim,
-        n_heads,
-        d_head,
-        dropout=0.0,
-        context_dim=None,
-        gated_ff=True,
-        checkpoint=True,
-        disable_self_attn=False,
-        attn_mode="softmax",
-        sdp_backend=None,
-    ):
-        super().__init__()
-        assert attn_mode in self.ATTENTION_MODES
-        if attn_mode != "softmax" and not XFORMERS_IS_AVAILABLE:
-            print(
-                f"Attention mode '{attn_mode}' is not available. Falling back to native attention. "
-                f"This is not a problem in Pytorch >= 2.0. FYI, you are running with PyTorch version {torch.__version__}"
-            )
-            attn_mode = "softmax"
-        elif attn_mode == "softmax" and not SDP_IS_AVAILABLE:
-            print(
-                "We do not support vanilla attention anymore, as it is too expensive. Sorry."
-            )
-            if not XFORMERS_IS_AVAILABLE:
-                assert (
-                    False
-                ), "Please install xformers via e.g. 'pip install xformers==0.0.16'"
-            else:
-                print("Falling back to xformers efficient attention.")
-                attn_mode = "softmax-xformers"
-        attn_cls = self.ATTENTION_MODES[attn_mode]
-        if version.parse(torch.__version__) >= version.parse("2.0.0"):
-            assert sdp_backend is None or isinstance(sdp_backend, SDPBackend)
-        else:
-            assert sdp_backend is None
-        self.disable_self_attn = disable_self_attn
-        self.attn1 = attn_cls(
-            query_dim=dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            context_dim=context_dim if self.disable_self_attn else None,
-            backend=sdp_backend,
-        )  # is a self-attention if not self.disable_self_attn
-        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
-        self.attn2 = attn_cls(
-            query_dim=dim,
-            context_dim=context_dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            backend=sdp_backend,
-        )  # is self-attn if context is none
-        self.norm1 = nn.LayerNorm(dim)
-        self.norm2 = nn.LayerNorm(dim)
-        self.norm3 = nn.LayerNorm(dim)
-        self.checkpoint = checkpoint
-        if self.checkpoint:
-            print(f"{self.__class__.__name__} is using checkpointing")
-
-    def forward(
-        self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0
-    ):
-        kwargs = {"x": x}
-
-        if context is not None:
-            kwargs.update({"context": context})
-
-        if additional_tokens is not None:
-            kwargs.update({"additional_tokens": additional_tokens})
-
-        if n_times_crossframe_attn_in_self:
-            kwargs.update(
-                {"n_times_crossframe_attn_in_self": n_times_crossframe_attn_in_self}
-            )
-
-        # return mixed_checkpoint(self._forward, kwargs, self.parameters(), self.checkpoint)
-        return checkpoint(
-            self._forward, (x, context), self.parameters(), self.checkpoint
-        )
-
-    def _forward(
-        self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0
-    ):
-        x = (
-            self.attn1(
-                self.norm1(x),
-                context=context if self.disable_self_attn else None,
-                additional_tokens=additional_tokens,
-                n_times_crossframe_attn_in_self=(
-                    n_times_crossframe_attn_in_self if not self.disable_self_attn else 0
-                ),
-            )
-            + x
-        )
-        x = (
-            self.attn2(
-                self.norm2(x), context=context, additional_tokens=additional_tokens
-            )
-            + x
-        )
-        x = self.ff(self.norm3(x)) + x
-        return x
-
-
-class BasicTransformerSingleLayerBlock(nn.Module):
-    ATTENTION_MODES = {
-        "softmax": CrossAttention,  # vanilla attention
-        "softmax-xformers": MemoryEfficientCrossAttention,  # on the A100s not quite as fast as the above version
-        # (todo might depend on head_dim, check, falls back to semi-optimized kernels for dim!=[16,32,64,128])
-    }
-
-    def __init__(
-        self,
-        dim,
-        n_heads,
-        d_head,
-        dropout=0.0,
-        context_dim=None,
-        gated_ff=True,
-        checkpoint=True,
-        attn_mode="softmax",
-    ):
-        super().__init__()
-        assert attn_mode in self.ATTENTION_MODES
-        attn_cls = self.ATTENTION_MODES[attn_mode]
-        self.attn1 = attn_cls(
-            query_dim=dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            context_dim=context_dim,
-        )
-        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
-        self.norm1 = nn.LayerNorm(dim)
-        self.norm2 = nn.LayerNorm(dim)
-        self.checkpoint = checkpoint
-
-    def forward(self, x, context=None):
-        return checkpoint(
-            self._forward, (x, context), self.parameters(), self.checkpoint
-        )
-
-    def _forward(self, x, context=None):
-        x = self.attn1(self.norm1(x), context=context) + x
-        x = self.ff(self.norm2(x)) + x
-        return x
-
-
-class SpatialTransformer(nn.Module):
-    """
-    Transformer block for image-like data.
-    First, project the input (aka embedding)
-    and reshape to b, t, d.
-    Then apply standard transformer action.
-    Finally, reshape to image
-    NEW: use_linear for more efficiency instead of the 1x1 convs
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        n_heads,
-        d_head,
-        depth=1,
-        dropout=0.0,
-        context_dim=None,
-        disable_self_attn=False,
-        use_linear=False,
-        attn_type="softmax",
-        use_checkpoint=True,
-        # sdp_backend=SDPBackend.FLASH_ATTENTION
-        sdp_backend=None,
-    ):
-        super().__init__()
-        print(
-            f"constructing {self.__class__.__name__} of depth {depth} w/ {in_channels} channels and {n_heads} heads"
-        )
-        from omegaconf import ListConfig
-
-        if exists(context_dim) and not isinstance(context_dim, (list, ListConfig)):
-            context_dim = [context_dim]
-        if exists(context_dim) and isinstance(context_dim, list):
-            if depth != len(context_dim):
-                print(
-                    f"WARNING: {self.__class__.__name__}: Found context dims {context_dim} of depth {len(context_dim)}, "
-                    f"which does not match the specified 'depth' of {depth}. Setting context_dim to {depth * [context_dim[0]]} now."
-                )
-                # depth does not match context dims.
-                assert all(
-                    map(lambda x: x == context_dim[0], context_dim)
-                ), "need homogenous context_dim to match depth automatically"
-                context_dim = depth * [context_dim[0]]
-        elif context_dim is None:
-            context_dim = [None] * depth
-        self.in_channels = in_channels
-        inner_dim = n_heads * d_head
-        self.norm = Normalize(in_channels)
-        if not use_linear:
-            self.proj_in = nn.Conv2d(
-                in_channels, inner_dim, kernel_size=1, stride=1, padding=0
-            )
-        else:
-            self.proj_in = nn.Linear(in_channels, inner_dim)
-
-        self.transformer_blocks = nn.ModuleList(
-            [
-                BasicTransformerBlock(
-                    inner_dim,
-                    n_heads,
-                    d_head,
-                    dropout=dropout,
-                    context_dim=context_dim[d],
-                    disable_self_attn=disable_self_attn,
-                    attn_mode=attn_type,
-                    checkpoint=use_checkpoint,
-                    sdp_backend=sdp_backend,
-                )
-                for d in range(depth)
-            ]
-        )
-        if not use_linear:
-            self.proj_out = zero_module(
-                nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)
-            )
-        else:
-            # self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
-            self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
-        self.use_linear = use_linear
-
-    def forward(self, x, context=None):
-        # note: if no context is given, cross-attention defaults to self-attention
-        if not isinstance(context, list):
-            context = [context]
-        b, c, h, w = x.shape
-        x_in = x
-        x = self.norm(x)
-        if not self.use_linear:
-            x = self.proj_in(x)
-        x = rearrange(x, "b c h w -> b (h w) c").contiguous()
-        if self.use_linear:
-            x = self.proj_in(x)
-        for i, block in enumerate(self.transformer_blocks):
-            if i > 0 and len(context) == 1:
-                i = 0  # use same context for each block
-            x = block(x, context=context[i])
-        if self.use_linear:
-            x = self.proj_out(x)
-        x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w).contiguous()
-        if not self.use_linear:
-            x = self.proj_out(x)
-        return x + x_in
diff --git a/videotuna/models/cogvideo_sat/vae_modules/autoencoder.py b/videotuna/models/cogvideo_sat/vae_modules/autoencoder.py
deleted file mode 100644
index 790d63c0..00000000
--- a/videotuna/models/cogvideo_sat/vae_modules/autoencoder.py
+++ /dev/null
@@ -1,684 +0,0 @@
-import logging
-import math
-import re
-from abc import abstractmethod
-from contextlib import contextmanager
-from typing import Any, Dict, List, Optional, Tuple, Union
-
-import pytorch_lightning as pl
-import torch
-import torch.distributed
-from packaging import version
-from sgm.util import (
-    default,
-    get_context_parallel_group,
-    get_context_parallel_group_rank,
-    get_obj_from_str,
-    initialize_context_parallel,
-    instantiate_from_config,
-    is_context_parallel_initialized,
-)
-from vae_modules.cp_enc_dec import _conv_gather, _conv_split
-from vae_modules.ema import LitEma
-
-logpy = logging.getLogger(__name__)
-
-
-class AbstractAutoencoder(pl.LightningModule):
-    """
-    This is the base class for all autoencoders, including image autoencoders, image autoencoders with discriminators,
-    unCLIP models, etc. Hence, it is fairly general, and specific features
-    (e.g. discriminator training, encoding, decoding) must be implemented in subclasses.
-    """
-
-    def __init__(
-        self,
-        ema_decay: Union[None, float] = None,
-        monitor: Union[None, str] = None,
-        input_key: str = "jpg",
-    ):
-        super().__init__()
-
-        self.input_key = input_key
-        self.use_ema = ema_decay is not None
-        if monitor is not None:
-            self.monitor = monitor
-
-        if self.use_ema:
-            self.model_ema = LitEma(self, decay=ema_decay)
-            logpy.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
-
-        if version.parse(torch.__version__) >= version.parse("2.0.0"):
-            self.automatic_optimization = False
-
-    def apply_ckpt(self, ckpt: Union[None, str, dict]):
-        if ckpt is None:
-            return
-        if isinstance(ckpt, str):
-            ckpt = {
-                "target": "sgm.modules.checkpoint.CheckpointEngine",
-                "params": {"ckpt_path": ckpt},
-            }
-        engine = instantiate_from_config(ckpt)
-        engine(self)
-
-    @abstractmethod
-    def get_input(self, batch) -> Any:
-        raise NotImplementedError()
-
-    def on_train_batch_end(self, *args, **kwargs):
-        # for EMA computation
-        if self.use_ema:
-            self.model_ema(self)
-
-    @contextmanager
-    def ema_scope(self, context=None):
-        if self.use_ema:
-            self.model_ema.store(self.parameters())
-            self.model_ema.copy_to(self)
-            if context is not None:
-                logpy.info(f"{context}: Switched to EMA weights")
-        try:
-            yield None
-        finally:
-            if self.use_ema:
-                self.model_ema.restore(self.parameters())
-                if context is not None:
-                    logpy.info(f"{context}: Restored training weights")
-
-    @abstractmethod
-    def encode(self, *args, **kwargs) -> torch.Tensor:
-        raise NotImplementedError("encode()-method of abstract base class called")
-
-    @abstractmethod
-    def decode(self, *args, **kwargs) -> torch.Tensor:
-        raise NotImplementedError("decode()-method of abstract base class called")
-
-    def instantiate_optimizer_from_config(self, params, lr, cfg):
-        logpy.info(f"loading >>> {cfg['target']} <<< optimizer from config")
-        return get_obj_from_str(cfg["target"])(
-            params, lr=lr, **cfg.get("params", dict())
-        )
-
-    def configure_optimizers(self) -> Any:
-        raise NotImplementedError()
-
-
-class AutoencodingEngine(AbstractAutoencoder):
-    """
-    Base class for all image autoencoders that we train, like VQGAN or AutoencoderKL
-    (we also restore them explicitly as special cases for legacy reasons).
-    Regularizations such as KL or VQ are moved to the regularizer class.
-    """
-
-    def __init__(
-        self,
-        *args,
-        encoder_config: Dict,
-        decoder_config: Dict,
-        loss_config: Dict,
-        regularizer_config: Dict,
-        optimizer_config: Union[Dict, None] = None,
-        lr_g_factor: float = 1.0,
-        trainable_ae_params: Optional[List[List[str]]] = None,
-        ae_optimizer_args: Optional[List[dict]] = None,
-        trainable_disc_params: Optional[List[List[str]]] = None,
-        disc_optimizer_args: Optional[List[dict]] = None,
-        disc_start_iter: int = 0,
-        diff_boost_factor: float = 3.0,
-        ckpt_engine: Union[None, str, dict] = None,
-        ckpt_path: Optional[str] = None,
-        additional_decode_keys: Optional[List[str]] = None,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-        self.automatic_optimization = False  # pytorch lightning
-
-        self.encoder = instantiate_from_config(encoder_config)
-        self.decoder = instantiate_from_config(decoder_config)
-        self.loss = instantiate_from_config(loss_config)
-        self.regularization = instantiate_from_config(regularizer_config)
-        self.optimizer_config = default(
-            optimizer_config, {"target": "torch.optim.Adam"}
-        )
-        self.diff_boost_factor = diff_boost_factor
-        self.disc_start_iter = disc_start_iter
-        self.lr_g_factor = lr_g_factor
-        self.trainable_ae_params = trainable_ae_params
-        if self.trainable_ae_params is not None:
-            self.ae_optimizer_args = default(
-                ae_optimizer_args,
-                [{} for _ in range(len(self.trainable_ae_params))],
-            )
-            assert len(self.ae_optimizer_args) == len(self.trainable_ae_params)
-        else:
-            self.ae_optimizer_args = [{}]  # makes type consistent
-
-        self.trainable_disc_params = trainable_disc_params
-        if self.trainable_disc_params is not None:
-            self.disc_optimizer_args = default(
-                disc_optimizer_args,
-                [{} for _ in range(len(self.trainable_disc_params))],
-            )
-            assert len(self.disc_optimizer_args) == len(self.trainable_disc_params)
-        else:
-            self.disc_optimizer_args = [{}]  # makes type consistent
-
-        if ckpt_path is not None:
-            assert ckpt_engine is None, "Can't set ckpt_engine and ckpt_path"
-            logpy.warning("Checkpoint path is deprecated, use `checkpoint_egnine` instead")
-        self.apply_ckpt(default(ckpt_path, ckpt_engine))
-        self.additional_decode_keys = set(default(additional_decode_keys, []))
-
-    def get_input(self, batch: Dict) -> torch.Tensor:
-        # assuming unified data format, dataloader returns a dict.
-        # image tensors should be scaled to -1 ... 1 and in channels-first
-        # format (e.g., bchw instead if bhwc)
-        return batch[self.input_key]
-
-    def get_autoencoder_params(self) -> list:
-        params = []
-        if hasattr(self.loss, "get_trainable_autoencoder_parameters"):
-            params += list(self.loss.get_trainable_autoencoder_parameters())
-        if hasattr(self.regularization, "get_trainable_parameters"):
-            params += list(self.regularization.get_trainable_parameters())
-        params = params + list(self.encoder.parameters())
-        params = params + list(self.decoder.parameters())
-        return params
-
-    def get_discriminator_params(self) -> list:
-        if hasattr(self.loss, "get_trainable_parameters"):
-            params = list(self.loss.get_trainable_parameters())  # e.g., discriminator
-        else:
-            params = []
-        return params
-
-    def get_last_layer(self):
-        return self.decoder.get_last_layer()
-
-    def encode(
-        self,
-        x: torch.Tensor,
-        return_reg_log: bool = False,
-        unregularized: bool = False,
-        **kwargs,
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, dict]]:
-        z = self.encoder(x, **kwargs)
-        if unregularized:
-            return z, dict()
-        z, reg_log = self.regularization(z)
-        if return_reg_log:
-            return z, reg_log
-        return z
-
-    def decode(self, z: torch.Tensor, **kwargs) -> torch.Tensor:
-        x = self.decoder(z, **kwargs)
-        return x
-
-    def forward(
-        self, x: torch.Tensor, **additional_decode_kwargs
-    ) -> Tuple[torch.Tensor, torch.Tensor, dict]:
-        z, reg_log = self.encode(x, return_reg_log=True)
-        dec = self.decode(z, **additional_decode_kwargs)
-        return z, dec, reg_log
-
-    def inner_training_step(
-        self, batch: dict, batch_idx: int, optimizer_idx: int = 0
-    ) -> torch.Tensor:
-        x = self.get_input(batch)
-        additional_decode_kwargs = {
-            key: batch[key] for key in self.additional_decode_keys.intersection(batch)
-        }
-        z, xrec, regularization_log = self(x, **additional_decode_kwargs)
-        if hasattr(self.loss, "forward_keys"):
-            extra_info = {
-                "z": z,
-                "optimizer_idx": optimizer_idx,
-                "global_step": self.global_step,
-                "last_layer": self.get_last_layer(),
-                "split": "train",
-                "regularization_log": regularization_log,
-                "autoencoder": self,
-            }
-            extra_info = {k: extra_info[k] for k in self.loss.forward_keys}
-        else:
-            extra_info = dict()
-
-        if optimizer_idx == 0:
-            # autoencode
-            out_loss = self.loss(x, xrec, **extra_info)
-            if isinstance(out_loss, tuple):
-                aeloss, log_dict_ae = out_loss
-            else:
-                # simple loss function
-                aeloss = out_loss
-                log_dict_ae = {"train/loss/rec": aeloss.detach()}
-
-            self.log_dict(
-                log_dict_ae,
-                prog_bar=False,
-                logger=True,
-                on_step=True,
-                on_epoch=True,
-                sync_dist=False,
-            )
-            self.log(
-                "loss",
-                aeloss.mean().detach(),
-                prog_bar=True,
-                logger=False,
-                on_epoch=False,
-                on_step=True,
-            )
-            return aeloss
-        elif optimizer_idx == 1:
-            # discriminator
-            discloss, log_dict_disc = self.loss(x, xrec, **extra_info)
-            # -> discriminator always needs to return a tuple
-            self.log_dict(
-                log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=True
-            )
-            return discloss
-        else:
-            raise NotImplementedError(f"Unknown optimizer {optimizer_idx}")
-
-    def training_step(self, batch: dict, batch_idx: int):
-        opts = self.optimizers()
-        if not isinstance(opts, list):
-            # Non-adversarial case
-            opts = [opts]
-        optimizer_idx = batch_idx % len(opts)
-        if self.global_step < self.disc_start_iter:
-            optimizer_idx = 0
-        opt = opts[optimizer_idx]
-        opt.zero_grad()
-        with opt.toggle_model():
-            loss = self.inner_training_step(
-                batch, batch_idx, optimizer_idx=optimizer_idx
-            )
-            self.manual_backward(loss)
-        opt.step()
-
-    def validation_step(self, batch: dict, batch_idx: int) -> Dict:
-        log_dict = self._validation_step(batch, batch_idx)
-        with self.ema_scope():
-            log_dict_ema = self._validation_step(batch, batch_idx, postfix="_ema")
-            log_dict.update(log_dict_ema)
-        return log_dict
-
-    def _validation_step(self, batch: dict, batch_idx: int, postfix: str = "") -> Dict:
-        x = self.get_input(batch)
-
-        z, xrec, regularization_log = self(x)
-        if hasattr(self.loss, "forward_keys"):
-            extra_info = {
-                "z": z,
-                "optimizer_idx": 0,
-                "global_step": self.global_step,
-                "last_layer": self.get_last_layer(),
-                "split": "val" + postfix,
-                "regularization_log": regularization_log,
-                "autoencoder": self,
-            }
-            extra_info = {k: extra_info[k] for k in self.loss.forward_keys}
-        else:
-            extra_info = dict()
-        out_loss = self.loss(x, xrec, **extra_info)
-        if isinstance(out_loss, tuple):
-            aeloss, log_dict_ae = out_loss
-        else:
-            # simple loss function
-            aeloss = out_loss
-            log_dict_ae = {f"val{postfix}/loss/rec": aeloss.detach()}
-        full_log_dict = log_dict_ae
-
-        if "optimizer_idx" in extra_info:
-            extra_info["optimizer_idx"] = 1
-            discloss, log_dict_disc = self.loss(x, xrec, **extra_info)
-            full_log_dict.update(log_dict_disc)
-        self.log(
-            f"val{postfix}/loss/rec",
-            log_dict_ae[f"val{postfix}/loss/rec"],
-            sync_dist=True,
-        )
-        self.log_dict(full_log_dict, sync_dist=True)
-        return full_log_dict
-
-    def get_param_groups(
-        self, parameter_names: List[List[str]], optimizer_args: List[dict]
-    ) -> Tuple[List[Dict[str, Any]], int]:
-        groups = []
-        num_params = 0
-        for names, args in zip(parameter_names, optimizer_args):
-            params = []
-            for pattern_ in names:
-                pattern_params = []
-                pattern = re.compile(pattern_)
-                for p_name, param in self.named_parameters():
-                    if re.match(pattern, p_name):
-                        pattern_params.append(param)
-                        num_params += param.numel()
-                if len(pattern_params) == 0:
-                    logpy.warning(f"Did not find parameters for pattern {pattern_}")
-                params.extend(pattern_params)
-            groups.append({"params": params, **args})
-        return groups, num_params
-
-    def configure_optimizers(self) -> List[torch.optim.Optimizer]:
-        if self.trainable_ae_params is None:
-            ae_params = self.get_autoencoder_params()
-        else:
-            ae_params, num_ae_params = self.get_param_groups(
-                self.trainable_ae_params, self.ae_optimizer_args
-            )
-            logpy.info(f"Number of trainable autoencoder parameters: {num_ae_params:,}")
-        if self.trainable_disc_params is None:
-            disc_params = self.get_discriminator_params()
-        else:
-            disc_params, num_disc_params = self.get_param_groups(
-                self.trainable_disc_params, self.disc_optimizer_args
-            )
-            logpy.info(
-                f"Number of trainable discriminator parameters: {num_disc_params:,}"
-            )
-        opt_ae = self.instantiate_optimizer_from_config(
-            ae_params,
-            default(self.lr_g_factor, 1.0) * self.learning_rate,
-            self.optimizer_config,
-        )
-        opts = [opt_ae]
-        if len(disc_params) > 0:
-            opt_disc = self.instantiate_optimizer_from_config(
-                disc_params, self.learning_rate, self.optimizer_config
-            )
-            opts.append(opt_disc)
-
-        return opts
-
-    @torch.no_grad()
-    def log_images(
-        self, batch: dict, additional_log_kwargs: Optional[Dict] = None, **kwargs
-    ) -> dict:
-        log = dict()
-        additional_decode_kwargs = {}
-        x = self.get_input(batch)
-        additional_decode_kwargs.update(
-            {key: batch[key] for key in self.additional_decode_keys.intersection(batch)}
-        )
-
-        _, xrec, _ = self(x, **additional_decode_kwargs)
-        log["inputs"] = x
-        log["reconstructions"] = xrec
-        diff = 0.5 * torch.abs(torch.clamp(xrec, -1.0, 1.0) - x)
-        diff.clamp_(0, 1.0)
-        log["diff"] = 2.0 * diff - 1.0
-        # diff_boost shows location of small errors, by boosting their
-        # brightness.
-        log["diff_boost"] = (
-            2.0 * torch.clamp(self.diff_boost_factor * diff, 0.0, 1.0) - 1
-        )
-        if hasattr(self.loss, "log_images"):
-            log.update(self.loss.log_images(x, xrec))
-        with self.ema_scope():
-            _, xrec_ema, _ = self(x, **additional_decode_kwargs)
-            log["reconstructions_ema"] = xrec_ema
-            diff_ema = 0.5 * torch.abs(torch.clamp(xrec_ema, -1.0, 1.0) - x)
-            diff_ema.clamp_(0, 1.0)
-            log["diff_ema"] = 2.0 * diff_ema - 1.0
-            log["diff_boost_ema"] = (
-                2.0 * torch.clamp(self.diff_boost_factor * diff_ema, 0.0, 1.0) - 1
-            )
-        if additional_log_kwargs:
-            additional_decode_kwargs.update(additional_log_kwargs)
-            _, xrec_add, _ = self(x, **additional_decode_kwargs)
-            log_str = "reconstructions-" + "-".join(
-                [f"{key}={additional_log_kwargs[key]}" for key in additional_log_kwargs]
-            )
-            log[log_str] = xrec_add
-        return log
-
-
-class AutoencodingEngineLegacy(AutoencodingEngine):
-    def __init__(self, embed_dim: int, **kwargs):
-        self.max_batch_size = kwargs.pop("max_batch_size", None)
-        ddconfig = kwargs.pop("ddconfig")
-        ckpt_path = kwargs.pop("ckpt_path", None)
-        ckpt_engine = kwargs.pop("ckpt_engine", None)
-        super().__init__(
-            encoder_config={
-                "target": "sgm.modules.diffusionmodules.model.Encoder",
-                "params": ddconfig,
-            },
-            decoder_config={
-                "target": "sgm.modules.diffusionmodules.model.Decoder",
-                "params": ddconfig,
-            },
-            **kwargs,
-        )
-        self.quant_conv = torch.nn.Conv2d(
-            (1 + ddconfig["double_z"]) * ddconfig["z_channels"],
-            (1 + ddconfig["double_z"]) * embed_dim,
-            1,
-        )
-        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
-        self.embed_dim = embed_dim
-
-        self.apply_ckpt(default(ckpt_path, ckpt_engine))
-
-    def get_autoencoder_params(self) -> list:
-        params = super().get_autoencoder_params()
-        return params
-
-    def encode(
-        self, x: torch.Tensor, return_reg_log: bool = False
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, dict]]:
-        if self.max_batch_size is None:
-            z = self.encoder(x)
-            z = self.quant_conv(z)
-        else:
-            N = x.shape[0]
-            bs = self.max_batch_size
-            n_batches = int(math.ceil(N / bs))
-            z = list()
-            for i_batch in range(n_batches):
-                z_batch = self.encoder(x[i_batch * bs : (i_batch + 1) * bs])
-                z_batch = self.quant_conv(z_batch)
-                z.append(z_batch)
-            z = torch.cat(z, 0)
-
-        z, reg_log = self.regularization(z)
-        if return_reg_log:
-            return z, reg_log
-        return z
-
-    def decode(self, z: torch.Tensor, **decoder_kwargs) -> torch.Tensor:
-        if self.max_batch_size is None:
-            dec = self.post_quant_conv(z)
-            dec = self.decoder(dec, **decoder_kwargs)
-        else:
-            N = z.shape[0]
-            bs = self.max_batch_size
-            n_batches = int(math.ceil(N / bs))
-            dec = list()
-            for i_batch in range(n_batches):
-                dec_batch = self.post_quant_conv(z[i_batch * bs : (i_batch + 1) * bs])
-                dec_batch = self.decoder(dec_batch, **decoder_kwargs)
-                dec.append(dec_batch)
-            dec = torch.cat(dec, 0)
-
-        return dec
-
-
-class AutoencoderKL(AutoencodingEngineLegacy):
-    def __init__(self, **kwargs):
-        if "lossconfig" in kwargs:
-            kwargs["loss_config"] = kwargs.pop("lossconfig")
-        super().__init__(
-            regularizer_config={
-                "target": (
-                    "sgm.modules.autoencoding.regularizers"
-                    ".DiagonalGaussianRegularizer"
-                )
-            },
-            **kwargs,
-        )
-
-
-class IdentityFirstStage(AbstractAutoencoder):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-    def get_input(self, x: Any) -> Any:
-        return x
-
-    def encode(self, x: Any, *args, **kwargs) -> Any:
-        return x
-
-    def decode(self, x: Any, *args, **kwargs) -> Any:
-        return x
-
-
-class VideoAutoencodingEngine(AutoencodingEngine):
-    def __init__(
-        self,
-        ckpt_path: Union[None, str] = None,
-        ignore_keys: Union[Tuple, list] = (),
-        image_video_weights=[1, 1],
-        only_train_decoder=False,
-        context_parallel_size=0,
-        **kwargs,
-    ):
-        super().__init__(**kwargs)
-        self.context_parallel_size = context_parallel_size
-        if ckpt_path is not None:
-            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
-
-    def log_videos(
-        self, batch: dict, additional_log_kwargs: Optional[Dict] = None, **kwargs
-    ) -> dict:
-        return self.log_images(batch, additional_log_kwargs, **kwargs)
-
-    def get_input(self, batch: dict) -> torch.Tensor:
-        if self.context_parallel_size > 0:
-            if not is_context_parallel_initialized():
-                initialize_context_parallel(self.context_parallel_size)
-
-            batch = batch[self.input_key]
-
-            global_src_rank = (
-                get_context_parallel_group_rank() * self.context_parallel_size
-            )
-            torch.distributed.broadcast(
-                batch, src=global_src_rank, group=get_context_parallel_group()
-            )
-
-            batch = _conv_split(batch, dim=2, kernel_size=1)
-            return batch
-
-        return batch[self.input_key]
-
-    def apply_ckpt(self, ckpt: Union[None, str, dict]):
-        if ckpt is None:
-            return
-        self.init_from_ckpt(ckpt)
-
-    def init_from_ckpt(self, path, ignore_keys=list()):
-        sd = torch.load(path, map_location="cpu")["state_dict"]
-        keys = list(sd.keys())
-        for k in keys:
-            for ik in ignore_keys:
-                if k.startswith(ik):
-                    print("Deleting key {} from state_dict.".format(k))
-                    del sd[k]
-        missing_keys, unexpected_keys = self.load_state_dict(sd, strict=False)
-        print("Missing keys: ", missing_keys)
-        print("Unexpected keys: ", unexpected_keys)
-        print(f"Restored from {path}")
-
-
-class VideoAutoencoderInferenceWrapper(VideoAutoencodingEngine):
-    def __init__(
-        self,
-        cp_size=0,
-        *args,
-        **kwargs,
-    ):
-        self.cp_size = cp_size
-        return super().__init__(*args, **kwargs)
-
-    def encode(
-        self,
-        x: torch.Tensor,
-        return_reg_log: bool = False,
-        unregularized: bool = False,
-        input_cp: bool = False,
-        output_cp: bool = False,
-        use_cp: bool = True,
-    ) -> Union[torch.Tensor, Tuple[torch.Tensor, dict]]:
-        if self.cp_size <= 1:
-            use_cp = False
-        if self.cp_size > 0 and use_cp and not input_cp:
-            if not is_context_parallel_initialized:
-                initialize_context_parallel(self.cp_size)
-
-            global_src_rank = get_context_parallel_group_rank() * self.cp_size
-            torch.distributed.broadcast(
-                x, src=global_src_rank, group=get_context_parallel_group()
-            )
-
-            x = _conv_split(x, dim=2, kernel_size=1)
-
-        if return_reg_log:
-            z, reg_log = super().encode(x, return_reg_log, unregularized, use_cp=use_cp)
-        else:
-            z = super().encode(x, return_reg_log, unregularized, use_cp=use_cp)
-
-        if self.cp_size > 0 and use_cp and not output_cp:
-            z = _conv_gather(z, dim=2, kernel_size=1)
-
-        if return_reg_log:
-            return z, reg_log
-        return z
-
-    def decode(
-        self,
-        z: torch.Tensor,
-        input_cp: bool = False,
-        output_cp: bool = False,
-        use_cp: bool = True,
-        **kwargs,
-    ):
-        if self.cp_size <= 1:
-            use_cp = False
-        if self.cp_size > 0 and use_cp and not input_cp:
-            if not is_context_parallel_initialized:
-                initialize_context_parallel(self.cp_size)
-
-            global_src_rank = get_context_parallel_group_rank() * self.cp_size
-            torch.distributed.broadcast(
-                z, src=global_src_rank, group=get_context_parallel_group()
-            )
-
-            z = _conv_split(z, dim=2, kernel_size=1)
-
-        x = super().decode(z, use_cp=use_cp, **kwargs)
-
-        if self.cp_size > 0 and use_cp and not output_cp:
-            x = _conv_gather(x, dim=2, kernel_size=1)
-        return x
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        input_cp: bool = False,
-        latent_cp: bool = False,
-        output_cp: bool = False,
-        **additional_decode_kwargs,
-    ) -> Tuple[torch.Tensor, torch.Tensor, dict]:
-        z, reg_log = self.encode(
-            x, return_reg_log=True, input_cp=input_cp, output_cp=latent_cp
-        )
-        dec = self.decode(
-            z, input_cp=latent_cp, output_cp=output_cp, **additional_decode_kwargs
-        )
-        return z, dec, reg_log
diff --git a/videotuna/models/cogvideo_sat/vae_modules/cp_enc_dec.py b/videotuna/models/cogvideo_sat/vae_modules/cp_enc_dec.py
deleted file mode 100644
index 6db3e499..00000000
--- a/videotuna/models/cogvideo_sat/vae_modules/cp_enc_dec.py
+++ /dev/null
@@ -1,1070 +0,0 @@
-import math
-
-import numpy as np
-import torch
-import torch.distributed
-import torch.nn as nn
-import torch.nn.functional as F
-from beartype import beartype
-from beartype.typing import List, Optional, Tuple, Union
-from einops import rearrange
-from sgm.util import (
-    get_context_parallel_group,
-    get_context_parallel_group_rank,
-    get_context_parallel_rank,
-    get_context_parallel_world_size,
-)
-from vae_modules.utils import SafeConv3d as Conv3d
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def exists(v):
-    return v is not None
-
-
-def pair(t):
-    return t if isinstance(t, tuple) else (t, t)
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-def leaky_relu(p=0.1):
-    return nn.LeakyReLU(p)
-
-
-def _split(input_, dim):
-    cp_world_size = get_context_parallel_world_size()
-
-    if cp_world_size == 1:
-        return input_
-
-    cp_rank = get_context_parallel_rank()
-
-    inpu_first_frame_ = input_.transpose(0, dim)[:1].transpose(0, dim).contiguous()
-    input_ = input_.transpose(0, dim)[1:].transpose(0, dim).contiguous()
-    dim_size = input_.size()[dim] // cp_world_size
-
-    input_list = torch.split(input_, dim_size, dim=dim)
-    output = input_list[cp_rank]
-
-    if cp_rank == 0:
-        output = torch.cat([inpu_first_frame_, output], dim=dim)
-    output = output.contiguous()
-
-    return output
-
-
-def _gather(input_, dim):
-    cp_world_size = get_context_parallel_world_size()
-
-    # Bypass the function if context parallel is 1
-    if cp_world_size == 1:
-        return input_
-
-    group = get_context_parallel_group()
-    cp_rank = get_context_parallel_rank()
-
-    input_first_frame_ = input_.transpose(0, dim)[:1].transpose(0, dim).contiguous()
-    if cp_rank == 0:
-        input_ = input_.transpose(0, dim)[1:].transpose(0, dim).contiguous()
-
-    tensor_list = [
-        torch.empty_like(torch.cat([input_first_frame_, input_], dim=dim))
-    ] + [torch.empty_like(input_) for _ in range(cp_world_size - 1)]
-
-    if cp_rank == 0:
-        input_ = torch.cat([input_first_frame_, input_], dim=dim)
-
-    tensor_list[cp_rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=group)
-
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    # print('out _gather, cp_rank:', cp_rank, 'output_size:', output.shape)
-
-    return output
-
-
-def _conv_split(input_, dim, kernel_size):
-    cp_world_size = get_context_parallel_world_size()
-
-    if cp_world_size == 1:
-        return input_
-
-    cp_rank = get_context_parallel_rank()
-
-    dim_size = (input_.size()[dim] - kernel_size) // cp_world_size
-
-    if cp_rank == 0:
-        output = input_.transpose(dim, 0)[: dim_size + kernel_size].transpose(dim, 0)
-    else:
-        output = input_.transpose(dim, 0)[
-            cp_rank * dim_size + kernel_size : (cp_rank + 1) * dim_size + kernel_size
-        ].transpose(dim, 0)
-    output = output.contiguous()
-
-    return output
-
-
-def _conv_gather(input_, dim, kernel_size):
-    cp_world_size = get_context_parallel_world_size()
-
-    # Bypass the function if context parallel is 1
-    if cp_world_size == 1:
-        return input_
-
-    group = get_context_parallel_group()
-    cp_rank = get_context_parallel_rank()
-    input_first_kernel_ = (
-        input_.transpose(0, dim)[:kernel_size].transpose(0, dim).contiguous()
-    )
-    if cp_rank == 0:
-        input_ = input_.transpose(0, dim)[kernel_size:].transpose(0, dim).contiguous()
-    else:
-        input_ = (
-            input_.transpose(0, dim)[max(kernel_size - 1, 0) :]
-            .transpose(0, dim)
-            .contiguous()
-        )
-
-    tensor_list = [
-        torch.empty_like(torch.cat([input_first_kernel_, input_], dim=dim))
-    ] + [torch.empty_like(input_) for _ in range(cp_world_size - 1)]
-    if cp_rank == 0:
-        input_ = torch.cat([input_first_kernel_, input_], dim=dim)
-
-    tensor_list[cp_rank] = input_
-    torch.distributed.all_gather(tensor_list, input_, group=group)
-
-    # Note: torch.cat already creates a contiguous tensor.
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    # print('out _conv_gather, cp_rank:', cp_rank, 'input_size:', output.shape)
-
-    return output
-
-
-def _pass_from_previous_rank(input_, dim, kernel_size):
-    # Bypass the function if kernel size is 1
-    if kernel_size == 1:
-        return input_
-
-    group = get_context_parallel_group()
-    cp_rank = get_context_parallel_rank()
-    cp_group_rank = get_context_parallel_group_rank()
-    cp_world_size = get_context_parallel_world_size()
-
-    # print('in _pass_from_previous_rank, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    global_rank = torch.distributed.get_rank()
-    global_world_size = torch.distributed.get_world_size()
-
-    input_ = input_.transpose(0, dim)
-
-    # pass from last rank
-    send_rank = global_rank + 1
-    recv_rank = global_rank - 1
-    if send_rank % cp_world_size == 0:
-        send_rank -= cp_world_size
-    if recv_rank % cp_world_size == cp_world_size - 1:
-        recv_rank += cp_world_size
-
-    if cp_rank < cp_world_size - 1:
-        req_send = torch.distributed.isend(
-            input_[-kernel_size + 1 :].contiguous(), send_rank, group=group
-        )
-    if cp_rank > 0:
-        recv_buffer = torch.empty_like(input_[-kernel_size + 1 :]).contiguous()
-        req_recv = torch.distributed.irecv(recv_buffer, recv_rank, group=group)
-
-    if cp_rank == 0:
-        input_ = torch.cat([input_[:1]] * (kernel_size - 1) + [input_], dim=0)
-    else:
-        req_recv.wait()
-        input_ = torch.cat([recv_buffer, input_], dim=0)
-
-    input_ = input_.transpose(0, dim).contiguous()
-
-    # print('out _pass_from_previous_rank, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    return input_
-
-
-def _fake_cp_pass_from_previous_rank(input_, dim, kernel_size, cache_padding=None):
-    # Bypass the function if kernel size is 1
-    if kernel_size == 1:
-        return input_
-
-    group = get_context_parallel_group()
-    cp_rank = get_context_parallel_rank()
-    cp_group_rank = get_context_parallel_group_rank()
-    cp_world_size = get_context_parallel_world_size()
-
-    # print('in _pass_from_previous_rank, cp_rank:', cp_rank, 'input_size:', input_.shape)
-
-    global_rank = torch.distributed.get_rank()
-    global_world_size = torch.distributed.get_world_size()
-
-    input_ = input_.transpose(0, dim)
-
-    # pass from last rank
-    send_rank = global_rank + 1
-    recv_rank = global_rank - 1
-    if send_rank % cp_world_size == 0:
-        send_rank -= cp_world_size
-    if recv_rank % cp_world_size == cp_world_size - 1:
-        recv_rank += cp_world_size
-
-    recv_buffer = torch.empty_like(input_[-kernel_size + 1 :]).contiguous()
-    if cp_rank < cp_world_size - 1:
-        req_send = torch.distributed.isend(
-            input_[-kernel_size + 1 :].contiguous(), send_rank, group=group
-        )
-    if cp_rank > 0:
-        req_recv = torch.distributed.irecv(recv_buffer, recv_rank, group=group)
-
-    if cp_rank == 0:
-        if cache_padding is not None:
-            input_ = torch.cat(
-                [cache_padding.transpose(0, dim).to(input_.device), input_], dim=0
-            )
-        else:
-            input_ = torch.cat([input_[:1]] * (kernel_size - 1) + [input_], dim=0)
-    else:
-        req_recv.wait()
-        input_ = torch.cat([recv_buffer, input_], dim=0)
-
-    input_ = input_.transpose(0, dim).contiguous()
-    return input_
-
-
-def _drop_from_previous_rank(input_, dim, kernel_size):
-    input_ = input_.transpose(0, dim)[kernel_size - 1 :].transpose(0, dim)
-    return input_
-
-
-class _ConvolutionScatterToContextParallelRegion(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, dim, kernel_size):
-        ctx.dim = dim
-        ctx.kernel_size = kernel_size
-        return _conv_split(input_, dim, kernel_size)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _conv_gather(grad_output, ctx.dim, ctx.kernel_size), None, None
-
-
-class _ConvolutionGatherFromContextParallelRegion(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, dim, kernel_size):
-        ctx.dim = dim
-        ctx.kernel_size = kernel_size
-        return _conv_gather(input_, dim, kernel_size)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _conv_split(grad_output, ctx.dim, ctx.kernel_size), None, None
-
-
-class _ConvolutionPassFromPreviousRank(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, dim, kernel_size):
-        ctx.dim = dim
-        ctx.kernel_size = kernel_size
-        return _pass_from_previous_rank(input_, dim, kernel_size)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return (
-            _drop_from_previous_rank(grad_output, ctx.dim, ctx.kernel_size),
-            None,
-            None,
-        )
-
-
-class _FakeCPConvolutionPassFromPreviousRank(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, input_, dim, kernel_size, cache_padding):
-        ctx.dim = dim
-        ctx.kernel_size = kernel_size
-        return _fake_cp_pass_from_previous_rank(input_, dim, kernel_size, cache_padding)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return (
-            _drop_from_previous_rank(grad_output, ctx.dim, ctx.kernel_size),
-            None,
-            None,
-            None,
-        )
-
-
-def conv_scatter_to_context_parallel_region(input_, dim, kernel_size):
-    return _ConvolutionScatterToContextParallelRegion.apply(input_, dim, kernel_size)
-
-
-def conv_gather_from_context_parallel_region(input_, dim, kernel_size):
-    return _ConvolutionGatherFromContextParallelRegion.apply(input_, dim, kernel_size)
-
-
-def conv_pass_from_last_rank(input_, dim, kernel_size):
-    return _ConvolutionPassFromPreviousRank.apply(input_, dim, kernel_size)
-
-
-def fake_cp_pass_from_previous_rank(input_, dim, kernel_size, cache_padding):
-    return _FakeCPConvolutionPassFromPreviousRank.apply(
-        input_, dim, kernel_size, cache_padding
-    )
-
-
-class ContextParallelCausalConv3d(nn.Module):
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        stride=1,
-        **kwargs,
-    ):
-        super().__init__()
-        kernel_size = cast_tuple(kernel_size, 3)
-
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-
-        assert is_odd(height_kernel_size) and is_odd(width_kernel_size)
-
-        time_pad = time_kernel_size - 1
-        height_pad = height_kernel_size // 2
-        width_pad = width_kernel_size // 2
-
-        self.height_pad = height_pad
-        self.width_pad = width_pad
-        self.time_pad = time_pad
-        self.time_kernel_size = time_kernel_size
-        self.temporal_dim = 2
-
-        stride = (stride, stride, stride)
-        dilation = (1, 1, 1)
-        self.conv = Conv3d(
-            chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs
-        )
-        self.cache_padding = None
-
-    def forward(self, input_, clear_cache=True):
-        input_parallel = fake_cp_pass_from_previous_rank(
-            input_, self.temporal_dim, self.time_kernel_size, self.cache_padding
-        )
-
-        del self.cache_padding
-        self.cache_padding = None
-        if not clear_cache:
-            cp_rank, cp_world_size = (
-                get_context_parallel_rank(),
-                get_context_parallel_world_size(),
-            )
-            global_rank = torch.distributed.get_rank()
-            if cp_world_size == 1:
-                self.cache_padding = (
-                    input_parallel[:, :, -self.time_kernel_size + 1 :]
-                    .contiguous()
-                    .detach()
-                    .clone()
-                    .cpu()
-                )
-            else:
-                if cp_rank == cp_world_size - 1:
-                    torch.distributed.isend(
-                        input_parallel[:, :, -self.time_kernel_size + 1 :].contiguous(),
-                        global_rank + 1 - cp_world_size,
-                        group=get_context_parallel_group(),
-                    )
-                if cp_rank == 0:
-                    recv_buffer = torch.empty_like(
-                        input_parallel[:, :, -self.time_kernel_size + 1 :]
-                    ).contiguous()
-                    torch.distributed.recv(
-                        recv_buffer,
-                        global_rank - 1 + cp_world_size,
-                        group=get_context_parallel_group(),
-                    )
-                    self.cache_padding = recv_buffer.contiguous().detach().clone().cpu()
-
-        padding_2d = (self.width_pad, self.width_pad, self.height_pad, self.height_pad)
-        input_parallel = F.pad(input_parallel, padding_2d, mode="constant", value=0)
-
-        output_parallel = self.conv(input_parallel)
-        output = output_parallel
-        return output
-
-
-class ContextParallelGroupNorm(torch.nn.GroupNorm):
-    def forward(self, input_):
-        gather_flag = input_.shape[2] > 1
-        if gather_flag:
-            input_ = conv_gather_from_context_parallel_region(
-                input_, dim=2, kernel_size=1
-            )
-        output = super().forward(input_)
-        if gather_flag:
-            output = conv_scatter_to_context_parallel_region(
-                output, dim=2, kernel_size=1
-            )
-        return output
-
-
-def Normalize(in_channels, gather=False, **kwargs):
-    if gather:
-        return ContextParallelGroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-    else:
-        return torch.nn.GroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-
-
-class SpatialNorm3D(nn.Module):
-    def __init__(
-        self,
-        f_channels,
-        zq_channels,
-        freeze_norm_layer=False,
-        add_conv=False,
-        pad_mode="constant",
-        gather=False,
-        **norm_layer_params,
-    ):
-        super().__init__()
-        if gather:
-            self.norm_layer = ContextParallelGroupNorm(
-                num_channels=f_channels, **norm_layer_params
-            )
-        else:
-            self.norm_layer = torch.nn.GroupNorm(
-                num_channels=f_channels, **norm_layer_params
-            )
-        # self.norm_layer = norm_layer(num_channels=f_channels, **norm_layer_params)
-        if freeze_norm_layer:
-            for p in self.norm_layer.parameters:
-                p.requires_grad = False
-
-        self.add_conv = add_conv
-        if add_conv:
-            self.conv = ContextParallelCausalConv3d(
-                chan_in=zq_channels,
-                chan_out=zq_channels,
-                kernel_size=3,
-            )
-
-        self.conv_y = ContextParallelCausalConv3d(
-            chan_in=zq_channels,
-            chan_out=f_channels,
-            kernel_size=1,
-        )
-        self.conv_b = ContextParallelCausalConv3d(
-            chan_in=zq_channels,
-            chan_out=f_channels,
-            kernel_size=1,
-        )
-
-    def forward(self, f, zq, clear_fake_cp_cache=True, fake_cp=True):
-        if f.shape[2] > 1 and get_context_parallel_rank() == 0 and fake_cp:
-            f_first, f_rest = f[:, :, :1], f[:, :, 1:]
-            f_first_size, f_rest_size = f_first.shape[-3:], f_rest.shape[-3:]
-            zq_first, zq_rest = zq[:, :, :1], zq[:, :, 1:]
-            zq_first = torch.nn.functional.interpolate(
-                zq_first, size=f_first_size, mode="nearest"
-            )
-
-            zq_rest_splits = torch.split(zq_rest, 32, dim=1)
-            interpolated_splits = [
-                torch.nn.functional.interpolate(split, size=f_rest_size, mode="nearest")
-                for split in zq_rest_splits
-            ]
-
-            zq_rest = torch.cat(interpolated_splits, dim=1)
-            # zq_rest = torch.nn.functional.interpolate(zq_rest, size=f_rest_size, mode="nearest")
-            zq = torch.cat([zq_first, zq_rest], dim=2)
-        else:
-            f_size = f.shape[-3:]
-
-            zq_splits = torch.split(zq, 32, dim=1)
-            interpolated_splits = [
-                torch.nn.functional.interpolate(split, size=f_size, mode="nearest")
-                for split in zq_splits
-            ]
-            zq = torch.cat(interpolated_splits, dim=1)
-
-        if self.add_conv:
-            zq = self.conv(zq, clear_cache=clear_fake_cp_cache)
-
-        norm_f = self.norm_layer(f)
-        new_f = norm_f * self.conv_y(zq) + self.conv_b(zq)
-        return new_f
-
-
-def Normalize3D(
-    in_channels,
-    zq_ch,
-    add_conv,
-    gather=False,
-):
-    return SpatialNorm3D(
-        in_channels,
-        zq_ch,
-        gather=gather,
-        freeze_norm_layer=False,
-        add_conv=add_conv,
-        num_groups=32,
-        eps=1e-6,
-        affine=True,
-    )
-
-
-class Upsample3D(nn.Module):
-    def __init__(
-        self,
-        in_channels,
-        with_conv,
-        compress_time=False,
-    ):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1
-            )
-        self.compress_time = compress_time
-
-    def forward(self, x, fake_cp=True):
-        if self.compress_time and x.shape[2] > 1:
-            if get_context_parallel_rank() == 0 and fake_cp:
-                # split first frame
-                x_first, x_rest = x[:, :, 0], x[:, :, 1:]
-                x_first = torch.nn.functional.interpolate(
-                    x_first, scale_factor=2.0, mode="nearest"
-                )
-
-                splits = torch.split(x_rest, 32, dim=1)
-                interpolated_splits = [
-                    torch.nn.functional.interpolate(
-                        split, scale_factor=2.0, mode="nearest"
-                    )
-                    for split in splits
-                ]
-                x_rest = torch.cat(interpolated_splits, dim=1)
-                x = torch.cat([x_first[:, :, None, :, :], x_rest], dim=2)
-            else:
-                splits = torch.split(x, 32, dim=1)
-                interpolated_splits = [
-                    torch.nn.functional.interpolate(
-                        split, scale_factor=2.0, mode="nearest"
-                    )
-                    for split in splits
-                ]
-                x = torch.cat(interpolated_splits, dim=1)
-
-        else:
-            # only interpolate 2D
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-
-            splits = torch.split(x, 32, dim=1)
-            interpolated_splits = [
-                torch.nn.functional.interpolate(split, scale_factor=2.0, mode="nearest")
-                for split in splits
-            ]
-            x = torch.cat(interpolated_splits, dim=1)
-
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-
-        if self.with_conv:
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = self.conv(x)
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-        return x
-
-
-class DownSample3D(nn.Module):
-    def __init__(self, in_channels, with_conv, compress_time=False, out_channels=None):
-        super().__init__()
-        self.with_conv = with_conv
-        if out_channels is None:
-            out_channels = in_channels
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, out_channels, kernel_size=3, stride=2, padding=0
-            )
-        self.compress_time = compress_time
-
-    def forward(self, x, fake_cp=True):
-        if self.compress_time and x.shape[2] > 1:
-            h, w = x.shape[-2:]
-            x = rearrange(x, "b c t h w -> (b h w) c t")
-
-            if get_context_parallel_rank() == 0 and fake_cp:
-                # split first frame
-                x_first, x_rest = x[..., 0], x[..., 1:]
-
-                if x_rest.shape[-1] > 0:
-                    splits = torch.split(x_rest, 32, dim=1)
-                    interpolated_splits = [
-                        torch.nn.functional.avg_pool1d(split, kernel_size=2, stride=2)
-                        for split in splits
-                    ]
-                    x_rest = torch.cat(interpolated_splits, dim=1)
-                x = torch.cat([x_first[..., None], x_rest], dim=-1)
-                x = rearrange(x, "(b h w) c t -> b c t h w", h=h, w=w)
-            else:
-                # x = torch.nn.functional.avg_pool1d(x, kernel_size=2, stride=2)
-                splits = torch.split(x, 32, dim=1)
-                interpolated_splits = [
-                    torch.nn.functional.avg_pool1d(split, kernel_size=2, stride=2)
-                    for split in splits
-                ]
-                x = torch.cat(interpolated_splits, dim=1)
-                x = rearrange(x, "(b h w) c t -> b c t h w", h=h, w=w)
-
-        if self.with_conv:
-            pad = (0, 1, 0, 1)
-            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = self.conv(x)
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-        else:
-            t = x.shape[2]
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
-            x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-        return x
-
-
-class ContextParallelResnetBlock3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-        zq_ch=None,
-        add_conv=False,
-        gather_norm=False,
-        normalization=Normalize,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = normalization(
-            in_channels,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            gather=gather_norm,
-        )
-
-        self.conv1 = ContextParallelCausalConv3d(
-            chan_in=in_channels,
-            chan_out=out_channels,
-            kernel_size=3,
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = normalization(
-            out_channels,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            gather=gather_norm,
-        )
-        self.dropout = torch.nn.Dropout(dropout)
-        self.conv2 = ContextParallelCausalConv3d(
-            chan_in=out_channels,
-            chan_out=out_channels,
-            kernel_size=3,
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = ContextParallelCausalConv3d(
-                    chan_in=in_channels,
-                    chan_out=out_channels,
-                    kernel_size=3,
-                )
-            else:
-                self.nin_shortcut = Conv3d(
-                    in_channels,
-                    out_channels,
-                    kernel_size=1,
-                    stride=1,
-                    padding=0,
-                )
-
-    def forward(self, x, temb, zq=None, clear_fake_cp_cache=True, fake_cp=True):
-        h = x
-
-        if zq is not None:
-            h = self.norm1(
-                h, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=fake_cp
-            )
-        else:
-            h = self.norm1(h)
-
-        h = nonlinearity(h)
-        h = self.conv1(h, clear_cache=clear_fake_cp_cache)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None, None]
-
-        if zq is not None:
-            h = self.norm2(
-                h, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=fake_cp
-            )
-        else:
-            h = self.norm2(h)
-
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h, clear_cache=clear_fake_cp_cache)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x, clear_cache=clear_fake_cp_cache)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class ContextParallelEncoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        double_z=True,
-        pad_mode="first",
-        temporal_compress_times=4,
-        gather_norm=False,
-        **ignore_kwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        self.conv_in = ContextParallelCausalConv3d(
-            chan_in=in_channels,
-            chan_out=self.ch,
-            kernel_size=3,
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ContextParallelResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        dropout=dropout,
-                        temb_channels=self.temb_ch,
-                        gather_norm=gather_norm,
-                    )
-                )
-                block_in = block_out
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                if i_level < self.temporal_compress_level:
-                    down.downsample = DownSample3D(
-                        block_in, resamp_with_conv, compress_time=True
-                    )
-                else:
-                    down.downsample = DownSample3D(
-                        block_in, resamp_with_conv, compress_time=False
-                    )
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ContextParallelResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            gather_norm=gather_norm,
-        )
-
-        self.mid.block_2 = ContextParallelResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            gather_norm=gather_norm,
-        )
-
-        # end
-        self.norm_out = Normalize(block_in, gather=gather_norm)
-
-        self.conv_out = ContextParallelCausalConv3d(
-            chan_in=block_in,
-            chan_out=2 * z_channels if double_z else z_channels,
-            kernel_size=3,
-        )
-
-    def forward(self, x, use_cp=True):
-        global _USE_CP
-        _USE_CP = use_cp
-
-        # timestep embedding
-        temb = None
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        h = self.mid.block_2(h, temb)
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-
-        return h
-
-
-class ContextParallelDecoder3D(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        zq_ch=None,
-        add_conv=False,
-        pad_mode="first",
-        temporal_compress_times=4,
-        gather_norm=False,
-        **ignorekwargs,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-
-        # log2 of temporal_compress_times
-        self.temporal_compress_level = int(np.log2(temporal_compress_times))
-
-        if zq_ch is None:
-            zq_ch = z_channels
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-
-        self.conv_in = ContextParallelCausalConv3d(
-            chan_in=z_channels,
-            chan_out=block_in,
-            kernel_size=3,
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ContextParallelResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            normalization=Normalize3D,
-            gather_norm=gather_norm,
-        )
-
-        self.mid.block_2 = ContextParallelResnetBlock3D(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-            zq_ch=zq_ch,
-            add_conv=add_conv,
-            normalization=Normalize3D,
-            gather_norm=gather_norm,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ContextParallelResnetBlock3D(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                        zq_ch=zq_ch,
-                        add_conv=add_conv,
-                        normalization=Normalize3D,
-                        gather_norm=gather_norm,
-                    )
-                )
-                block_in = block_out
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                if i_level < self.num_resolutions - self.temporal_compress_level:
-                    up.upsample = Upsample3D(
-                        block_in, with_conv=resamp_with_conv, compress_time=False
-                    )
-                else:
-                    up.upsample = Upsample3D(
-                        block_in, with_conv=resamp_with_conv, compress_time=True
-                    )
-            self.up.insert(0, up)
-
-        self.norm_out = Normalize3D(
-            block_in, zq_ch, add_conv=add_conv, gather=gather_norm
-        )
-
-        self.conv_out = ContextParallelCausalConv3d(
-            chan_in=block_in,
-            chan_out=out_ch,
-            kernel_size=3,
-        )
-
-    def forward(self, z, clear_fake_cp_cache=True, use_cp=True):
-        global _USE_CP
-        _USE_CP = use_cp
-        self.last_z_shape = z.shape
-
-        # timestep embedding
-        temb = None
-
-        t = z.shape[2]
-        # z to block_in
-
-        zq = z
-        h = self.conv_in(z, clear_cache=clear_fake_cp_cache)
-
-        # middle
-        h = self.mid.block_1(
-            h, temb, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=use_cp
-        )
-        h = self.mid.block_2(
-            h, temb, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=use_cp
-        )
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](
-                    h, temb, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=use_cp
-                )
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h, zq)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h, fake_cp=use_cp)
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(
-            h, zq, clear_fake_cp_cache=clear_fake_cp_cache, fake_cp=use_cp
-        )
-        h = nonlinearity(h)
-        h = self.conv_out(h, clear_cache=clear_fake_cp_cache)
-
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.conv.weight
diff --git a/videotuna/models/cogvideo_sat/vae_modules/ema.py b/videotuna/models/cogvideo_sat/vae_modules/ema.py
deleted file mode 100644
index 96f64345..00000000
--- a/videotuna/models/cogvideo_sat/vae_modules/ema.py
+++ /dev/null
@@ -1,88 +0,0 @@
-import torch
-from torch import nn
-
-
-class LitEma(nn.Module):
-    def __init__(self, model, decay=0.9999, use_num_upates=True):
-        super().__init__()
-        if decay < 0.0 or decay > 1.0:
-            raise ValueError("Decay must be between 0 and 1")
-
-        self.m_name2s_name = {}
-        self.register_buffer("decay", torch.tensor(decay, dtype=torch.float32))
-        self.register_buffer(
-            "num_updates",
-            (
-                torch.tensor(0, dtype=torch.int)
-                if use_num_upates
-                else torch.tensor(-1, dtype=torch.int)
-            ),
-        )
-
-        for name, p in model.named_parameters():
-            if p.requires_grad:
-                # remove as '.'-character is not allowed in buffers
-                s_name = name.replace(".", "")
-                self.m_name2s_name.update({name: s_name})
-                self.register_buffer(s_name, p.clone().detach().data)
-
-        self.collected_params = []
-
-    def reset_num_updates(self):
-        del self.num_updates
-        self.register_buffer("num_updates", torch.tensor(0, dtype=torch.int))
-
-    def forward(self, model):
-        decay = self.decay
-
-        if self.num_updates >= 0:
-            self.num_updates += 1
-            decay = min(self.decay, (1 + self.num_updates) / (10 + self.num_updates))
-
-        one_minus_decay = 1.0 - decay
-
-        with torch.no_grad():
-            m_param = dict(model.named_parameters())
-            shadow_params = dict(self.named_buffers())
-
-            for key in m_param:
-                if m_param[key].requires_grad:
-                    sname = self.m_name2s_name[key]
-                    shadow_params[sname] = shadow_params[sname].type_as(m_param[key])
-                    shadow_params[sname].sub_(
-                        one_minus_decay * (shadow_params[sname] - m_param[key])
-                    )
-                else:
-                    assert not key in self.m_name2s_name
-
-    def copy_to(self, model):
-        m_param = dict(model.named_parameters())
-        shadow_params = dict(self.named_buffers())
-        for key in m_param:
-            if m_param[key].requires_grad:
-                m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data)
-            else:
-                assert not key in self.m_name2s_name
-
-    def store(self, parameters):
-        """
-        Save the current parameters for restoring later.
-        Args:
-          parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-            temporarily stored.
-        """
-        self.collected_params = [param.clone() for param in parameters]
-
-    def restore(self, parameters):
-        """
-        Restore the parameters stored with the `store` method.
-        Useful to validate the model with EMA parameters without affecting the
-        original optimization process. Store the parameters before the
-        `copy_to` method. After validation (or model saving), use this to
-        restore the former parameters.
-        Args:
-          parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-            updated with the stored parameters.
-        """
-        for c_param, param in zip(self.collected_params, parameters):
-            param.data.copy_(c_param.data)
diff --git a/videotuna/models/cogvideo_sat/vae_modules/regularizers.py b/videotuna/models/cogvideo_sat/vae_modules/regularizers.py
deleted file mode 100644
index 7b77ed51..00000000
--- a/videotuna/models/cogvideo_sat/vae_modules/regularizers.py
+++ /dev/null
@@ -1,114 +0,0 @@
-from abc import abstractmethod
-from typing import Any, Tuple
-
-import numpy as np
-import torch
-import torch.nn.functional as F
-from torch import nn
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters, deterministic=False):
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(self.mean).to(
-                device=self.parameters.device
-            )
-
-    def sample(self):
-        # x = self.mean + self.std * torch.randn(self.mean.shape).to(
-        #     device=self.parameters.device
-        # )
-        x = self.mean + self.std * torch.randn_like(self.mean)
-        return x
-
-    def kl(self, other=None):
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        else:
-            if other is None:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
-                    dim=[1, 2, 3],
-                )
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var
-                    - 1.0
-                    - self.logvar
-                    + other.logvar,
-                    dim=[1, 2, 3],
-                )
-
-    def nll(self, sample, dims=[1, 2, 3]):
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims,
-        )
-
-    def mode(self):
-        return self.mean
-
-
-class AbstractRegularizer(nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        raise NotImplementedError()
-
-    @abstractmethod
-    def get_trainable_parameters(self) -> Any:
-        raise NotImplementedError()
-
-
-class IdentityRegularizer(AbstractRegularizer):
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        return z, dict()
-
-    def get_trainable_parameters(self) -> Any:
-        yield from ()
-
-
-def measure_perplexity(
-    predicted_indices: torch.Tensor, num_centroids: int
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    # videotuna: https://github.com/karpathy/deep-vector-quantization/blob/main/model.py
-    # eval cluster perplexity. when perplexity == num_embeddings then all clusters are used exactly equally
-    encodings = (
-        F.one_hot(predicted_indices, num_centroids).float().reshape(-1, num_centroids)
-    )
-    avg_probs = encodings.mean(0)
-    perplexity = (-(avg_probs * torch.log(avg_probs + 1e-10)).sum()).exp()
-    cluster_use = torch.sum(avg_probs > 0)
-    return perplexity, cluster_use
-
-
-class DiagonalGaussianRegularizer(AbstractRegularizer):
-    def __init__(self, sample: bool = True):
-        super().__init__()
-        self.sample = sample
-
-    def get_trainable_parameters(self) -> Any:
-        yield from ()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        log = dict()
-        posterior = DiagonalGaussianDistribution(z)
-        if self.sample:
-            z = posterior.sample()
-        else:
-            z = posterior.mode()
-        kl_loss = posterior.kl()
-        kl_loss = torch.sum(kl_loss) / kl_loss.shape[0]
-        log["kl_loss"] = kl_loss
-        return z, log
diff --git a/videotuna/models/cogvideo_sat/vae_modules/utils.py b/videotuna/models/cogvideo_sat/vae_modules/utils.py
deleted file mode 100644
index a52d94d6..00000000
--- a/videotuna/models/cogvideo_sat/vae_modules/utils.py
+++ /dev/null
@@ -1,424 +0,0 @@
-import functools
-import importlib
-import os
-from functools import partial
-from inspect import isfunction
-
-import fsspec
-import numpy as np
-import torch
-import torch.distributed
-from PIL import Image, ImageDraw, ImageFont
-from safetensors.torch import load_file as load_safetensors
-
-_CONTEXT_PARALLEL_GROUP = None
-_CONTEXT_PARALLEL_SIZE = None
-
-
-def is_context_parallel_initialized():
-    if _CONTEXT_PARALLEL_GROUP is None:
-        return False
-    else:
-        return True
-
-
-def initialize_context_parallel(context_parallel_size):
-    global _CONTEXT_PARALLEL_GROUP
-    global _CONTEXT_PARALLEL_SIZE
-
-    assert (
-        _CONTEXT_PARALLEL_GROUP is None
-    ), "context parallel group is already initialized"
-    _CONTEXT_PARALLEL_SIZE = context_parallel_size
-
-    rank = torch.distributed.get_rank()
-    world_size = torch.distributed.get_world_size()
-
-    for i in range(0, world_size, context_parallel_size):
-        ranks = range(i, i + context_parallel_size)
-        group = torch.distributed.new_group(ranks)
-        if rank in ranks:
-            _CONTEXT_PARALLEL_GROUP = group
-            break
-
-
-def get_context_parallel_group():
-    assert (
-        _CONTEXT_PARALLEL_GROUP is not None
-    ), "context parallel group is not initialized"
-
-    return _CONTEXT_PARALLEL_GROUP
-
-
-def get_context_parallel_world_size():
-    assert (
-        _CONTEXT_PARALLEL_SIZE is not None
-    ), "context parallel size is not initialized"
-
-    return _CONTEXT_PARALLEL_SIZE
-
-
-def get_context_parallel_rank():
-    assert (
-        _CONTEXT_PARALLEL_SIZE is not None
-    ), "context parallel size is not initialized"
-
-    rank = torch.distributed.get_rank()
-    cp_rank = rank % _CONTEXT_PARALLEL_SIZE
-    return cp_rank
-
-
-def get_context_parallel_group_rank():
-    assert (
-        _CONTEXT_PARALLEL_SIZE is not None
-    ), "context parallel size is not initialized"
-
-    rank = torch.distributed.get_rank()
-    cp_group_rank = rank // _CONTEXT_PARALLEL_SIZE
-
-    return cp_group_rank
-
-
-class SafeConv3d(torch.nn.Conv3d):
-    def forward(self, input):
-        memory_count = torch.prod(torch.tensor(input.shape)).item() * 2 / 1024**3
-        if memory_count > 2:
-            kernel_size = self.kernel_size[0]
-            part_num = int(memory_count / 2) + 1
-            input_chunks = torch.chunk(input, part_num, dim=2)  # NCTHW
-            if kernel_size > 1:
-                input_chunks = [input_chunks[0]] + [
-                    torch.cat(
-                        (
-                            input_chunks[i - 1][:, :, -kernel_size + 1 :],
-                            input_chunks[i],
-                        ),
-                        dim=2,
-                    )
-                    for i in range(1, len(input_chunks))
-                ]
-
-            output_chunks = []
-            for input_chunk in input_chunks:
-                output_chunks.append(super(SafeConv3d, self).forward(input_chunk))
-            output = torch.cat(output_chunks, dim=2)
-            return output
-        else:
-            return super(SafeConv3d, self).forward(input)
-
-
-def disabled_train(self, mode=True):
-    """Overwrite model.train with this function to make sure train/eval mode
-    does not change anymore."""
-    return self
-
-
-def get_string_from_tuple(s):
-    try:
-        # Check if the string starts and ends with parentheses
-        if s[0] == "(" and s[-1] == ")":
-            # Convert the string to a tuple
-            t = eval(s)
-            # Check if the type of t is tuple
-            if type(t) == tuple:
-                return t[0]
-            else:
-                pass
-    except:
-        pass
-    return s
-
-
-def is_power_of_two(n):
-    """
-    chat.openai.com/chat
-    Return True if n is a power of 2, otherwise return False.
-
-    The function is_power_of_two takes an integer n as input and returns True if n is a power of 2, otherwise it returns False.
-    The function works by first checking if n is less than or equal to 0. If n is less than or equal to 0, it can't be a power of 2, so the function returns False.
-    If n is greater than 0, the function checks whether n is a power of 2 by using a bitwise AND operation between n and n-1. If n is a power of 2, then it will have only one bit set to 1 in its binary representation. When we subtract 1 from a power of 2, all the bits to the right of that bit become 1, and the bit itself becomes 0. So, when we perform a bitwise AND between n and n-1, we get 0 if n is a power of 2, and a non-zero value otherwise.
-    Thus, if the result of the bitwise AND operation is 0, then n is a power of 2 and the function returns True. Otherwise, the function returns False.
-
-    """
-    if n <= 0:
-        return False
-    return (n & (n - 1)) == 0
-
-
-def autocast(f, enabled=True):
-    def do_autocast(*args, **kwargs):
-        with torch.cuda.amp.autocast(
-            enabled=enabled,
-            dtype=torch.get_autocast_gpu_dtype(),
-            cache_enabled=torch.is_autocast_cache_enabled(),
-        ):
-            return f(*args, **kwargs)
-
-    return do_autocast
-
-
-def load_partial_from_config(config):
-    return partial(get_obj_from_str(config["target"]), **config.get("params", dict()))
-
-
-def log_txt_as_img(wh, xc, size=10):
-    # wh a tuple of (width, height)
-    # xc a list of captions to plot
-    b = len(xc)
-    txts = list()
-    for bi in range(b):
-        txt = Image.new("RGB", wh, color="white")
-        draw = ImageDraw.Draw(txt)
-        font = ImageFont.truetype("data/DejaVuSans.ttf", size=size)
-        nc = int(40 * (wh[0] / 256))
-        if isinstance(xc[bi], list):
-            text_seq = xc[bi][0]
-        else:
-            text_seq = xc[bi]
-        lines = "\n".join(
-            text_seq[start : start + nc] for start in range(0, len(text_seq), nc)
-        )
-
-        try:
-            draw.text((0, 0), lines, fill="black", font=font)
-        except UnicodeEncodeError:
-            print("Cant encode string for logging. Skipping.")
-
-        txt = np.array(txt).transpose(2, 0, 1) / 127.5 - 1.0
-        txts.append(txt)
-    txts = np.stack(txts)
-    txts = torch.tensor(txts)
-    return txts
-
-
-def partialclass(cls, *args, **kwargs):
-    class NewCls(cls):
-        __init__ = functools.partialmethod(cls.__init__, *args, **kwargs)
-
-    return NewCls
-
-
-def make_path_absolute(path):
-    fs, p = fsspec.core.url_to_fs(path)
-    if fs.protocol == "file":
-        return os.path.abspath(p)
-    return path
-
-
-def ismap(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return (len(x.shape) == 4) and (x.shape[1] > 3)
-
-
-def isimage(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)
-
-
-def isheatmap(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-
-    return x.ndim == 2
-
-
-def isneighbors(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return x.ndim == 5 and (x.shape[2] == 3 or x.shape[2] == 1)
-
-
-def exists(x):
-    return x is not None
-
-
-def expand_dims_like(x, y):
-    while x.dim() != y.dim():
-        x = x.unsqueeze(-1)
-    return x
-
-
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-
-
-def mean_flat(tensor):
-    """
-    https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L86
-    Take the mean over all non-batch dimensions.
-    """
-    return tensor.mean(dim=list(range(1, len(tensor.shape))))
-
-
-def count_params(model, verbose=False):
-    total_params = sum(p.numel() for p in model.parameters())
-    if verbose:
-        print(f"{model.__class__.__name__} has {total_params * 1.e-6:.2f} M params.")
-    return total_params
-
-
-def instantiate_from_config(config):
-    if not "target" in config:
-        if config == "__is_first_stage__":
-            return None
-        elif config == "__is_unconditional__":
-            return None
-        raise KeyError("Expected key `target` to instantiate.")
-    return get_obj_from_str(config["target"])(**config.get("params", dict()))
-
-
-def get_obj_from_str(string, reload=False, invalidate_cache=True):
-    module, cls = string.rsplit(".", 1)
-    if invalidate_cache:
-        importlib.invalidate_caches()
-    if reload:
-        module_imp = importlib.import_module(module)
-        importlib.reload(module_imp)
-    return getattr(importlib.import_module(module, package=None), cls)
-
-
-def append_zero(x):
-    return torch.cat([x, x.new_zeros([1])])
-
-
-def append_dims(x, target_dims):
-    """Appends dimensions to the end of a tensor until it has target_dims dimensions."""
-    dims_to_append = target_dims - x.ndim
-    if dims_to_append < 0:
-        raise ValueError(
-            f"input has {x.ndim} dims but target_dims is {target_dims}, which is less"
-        )
-    return x[(...,) + (None,) * dims_to_append]
-
-
-def load_model_from_config(config, ckpt, verbose=True, freeze=True):
-    print(f"Loading model from {ckpt}")
-    if ckpt.endswith("ckpt"):
-        pl_sd = torch.load(ckpt, map_location="cpu")
-        if "global_step" in pl_sd:
-            print(f"Global Step: {pl_sd['global_step']}")
-        sd = pl_sd["state_dict"]
-    elif ckpt.endswith("safetensors"):
-        sd = load_safetensors(ckpt)
-    else:
-        raise NotImplementedError
-
-    model = instantiate_from_config(config.model)
-
-    m, u = model.load_state_dict(sd, strict=False)
-
-    if len(m) > 0 and verbose:
-        print("missing keys:")
-        print(m)
-    if len(u) > 0 and verbose:
-        print("unexpected keys:")
-        print(u)
-
-    if freeze:
-        for param in model.parameters():
-            param.requires_grad = False
-
-    model.eval()
-    return model
-
-
-def get_configs_path() -> str:
-    """
-    Get the `configs` directory.
-    For a working copy, this is the one in the root of the repository,
-    but for an installed copy, it's in the `sgm` package (see pyproject.toml).
-    """
-    this_dir = os.path.dirname(__file__)
-    candidates = (
-        os.path.join(this_dir, "configs"),
-        os.path.join(this_dir, "..", "configs"),
-    )
-    for candidate in candidates:
-        candidate = os.path.abspath(candidate)
-        if os.path.isdir(candidate):
-            return candidate
-    raise FileNotFoundError(f"Could not find SGM configs in {candidates}")
-
-
-def get_nested_attribute(obj, attribute_path, depth=None, return_key=False):
-    """
-    Will return the result of a recursive get attribute call.
-    E.g.:
-        a.b.c
-        = getattr(getattr(a, "b"), "c")
-        = get_nested_attribute(a, "b.c")
-    If any part of the attribute call is an integer x with current obj a, will
-    try to call a[x] instead of a.x first.
-    """
-    attributes = attribute_path.split(".")
-    if depth is not None and depth > 0:
-        attributes = attributes[:depth]
-    assert len(attributes) > 0, "At least one attribute should be selected"
-    current_attribute = obj
-    current_key = None
-    for level, attribute in enumerate(attributes):
-        current_key = ".".join(attributes[: level + 1])
-        try:
-            id_ = int(attribute)
-            current_attribute = current_attribute[id_]
-        except ValueError:
-            current_attribute = getattr(current_attribute, attribute)
-
-    return (current_attribute, current_key) if return_key else current_attribute
-
-
-def checkpoint(func, inputs, params, flag):
-    """
-    Evaluate a function without caching intermediate activations, allowing for
-    reduced memory at the expense of extra compute in the backward pass.
-    :param func: the function to evaluate.
-    :param inputs: the argument sequence to pass to `func`.
-    :param params: a sequence of parameters `func` depends on but does not
-                   explicitly take as arguments.
-    :param flag: if False, disable gradient checkpointing.
-    """
-    if flag:
-        args = tuple(inputs) + tuple(params)
-        return CheckpointFunction.apply(func, len(inputs), *args)
-    else:
-        return func(*inputs)
-
-
-class CheckpointFunction(torch.autograd.Function):
-    @staticmethod
-    def forward(ctx, run_function, length, *args):
-        ctx.run_function = run_function
-        ctx.input_tensors = list(args[:length])
-        ctx.input_params = list(args[length:])
-        ctx.gpu_autocast_kwargs = {
-            "enabled": torch.is_autocast_enabled(),
-            "dtype": torch.get_autocast_gpu_dtype(),
-            "cache_enabled": torch.is_autocast_cache_enabled(),
-        }
-        with torch.no_grad():
-            output_tensors = ctx.run_function(*ctx.input_tensors)
-        return output_tensors
-
-    @staticmethod
-    def backward(ctx, *output_grads):
-        ctx.input_tensors = [x.detach().requires_grad_(True) for x in ctx.input_tensors]
-        with torch.enable_grad(), torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs):
-            # Fixes a bug where the first op in run_function modifies the
-            # Tensor storage in place, which is not allowed for detach()'d
-            # Tensors.
-            shallow_copies = [x.view_as(x) for x in ctx.input_tensors]
-            output_tensors = ctx.run_function(*shallow_copies)
-        input_grads = torch.autograd.grad(
-            output_tensors,
-            ctx.input_tensors + ctx.input_params,
-            output_grads,
-            allow_unused=True,
-        )
-        del ctx.input_tensors
-        del ctx.input_params
-        del output_tensors
-        return (None, None) + input_grads
diff --git a/videotuna/models/flux/__init__.py b/videotuna/models/flux/__init__.py
deleted file mode 100644
index 43c365a4..00000000
--- a/videotuna/models/flux/__init__.py
+++ /dev/null
@@ -1,11 +0,0 @@
-try:
-    from ._version import version as __version__  # type: ignore
-    from ._version import version_tuple
-except ImportError:
-    __version__ = "unknown (no version information available)"
-    version_tuple = (0, 0, "unknown", "noinfo")
-
-from pathlib import Path
-
-PACKAGE = __package__.replace("_", "-")
-PACKAGE_ROOT = Path(__file__).parent
diff --git a/videotuna/models/flux/__main__.py b/videotuna/models/flux/__main__.py
deleted file mode 100644
index d5cf0fd2..00000000
--- a/videotuna/models/flux/__main__.py
+++ /dev/null
@@ -1,4 +0,0 @@
-from .cli import app
-
-if __name__ == "__main__":
-    app()
diff --git a/videotuna/models/flux/api.py b/videotuna/models/flux/api.py
deleted file mode 100644
index e00cc768..00000000
--- a/videotuna/models/flux/api.py
+++ /dev/null
@@ -1,200 +0,0 @@
-import io
-import os
-import time
-from pathlib import Path
-
-import requests
-from PIL import Image
-
-API_ENDPOINT = "https://api.bfl.ml"
-
-
-class ApiException(Exception):
-    def __init__(self, status_code: int, detail: str | list[dict] | None = None):
-        super().__init__()
-        self.detail = detail
-        self.status_code = status_code
-
-    def __str__(self) -> str:
-        return self.__repr__()
-
-    def __repr__(self) -> str:
-        if self.detail is None:
-            message = None
-        elif isinstance(self.detail, str):
-            message = self.detail
-        else:
-            message = "[" + ",".join(d["msg"] for d in self.detail) + "]"
-        return f"ApiException({self.status_code=}, {message=}, detail={self.detail})"
-
-
-class ImageRequest:
-    def __init__(
-        self,
-        prompt: str,
-        width: int = 1024,
-        height: int = 1024,
-        name: str = "flux.1-pro",
-        num_steps: int = 50,
-        prompt_upsampling: bool = False,
-        seed: int | None = None,
-        validate: bool = True,
-        launch: bool = True,
-        api_key: str | None = None,
-    ):
-        """
-        Manages an image generation request to the API.
-
-        Args:
-            prompt: Prompt to sample
-            width: Width of the image in pixel
-            height: Height of the image in pixel
-            name: Name of the model
-            num_steps: Number of network evaluations
-            prompt_upsampling: Use prompt upsampling
-            seed: Fix the generation seed
-            validate: Run input validation
-            launch: Directly launches request
-            api_key: Your API key if not provided by the environment
-
-        Raises:
-            ValueError: For invalid input
-            ApiException: For errors raised from the API
-        """
-        if validate:
-            if name not in ["flux.1-pro"]:
-                raise ValueError(f"Invalid model {name}")
-            elif width % 32 != 0:
-                raise ValueError(f"width must be divisible by 32, got {width}")
-            elif not (256 <= width <= 1440):
-                raise ValueError(f"width must be between 256 and 1440, got {width}")
-            elif height % 32 != 0:
-                raise ValueError(f"height must be divisible by 32, got {height}")
-            elif not (256 <= height <= 1440):
-                raise ValueError(f"height must be between 256 and 1440, got {height}")
-            elif not (1 <= num_steps <= 50):
-                raise ValueError(f"steps must be between 1 and 50, got {num_steps}")
-
-        self.request_json = {
-            "prompt": prompt,
-            "width": width,
-            "height": height,
-            "variant": name,
-            "steps": num_steps,
-            "prompt_upsampling": prompt_upsampling,
-        }
-        if seed is not None:
-            self.request_json["seed"] = seed
-
-        self.request_id: str | None = None
-        self.result: dict | None = None
-        self._image_bytes: bytes | None = None
-        self._url: str | None = None
-        if api_key is None:
-            self.api_key = os.environ.get("BFL_API_KEY")
-        else:
-            self.api_key = api_key
-
-        if launch:
-            self.request()
-
-    def request(self):
-        """
-        Request to generate the image.
-        """
-        if self.request_id is not None:
-            return
-        response = requests.post(
-            f"{API_ENDPOINT}/v1/image",
-            headers={
-                "accept": "application/json",
-                "x-key": self.api_key,
-                "Content-Type": "application/json",
-            },
-            json=self.request_json,
-        )
-        result = response.json()
-        if response.status_code != 200:
-            raise ApiException(
-                status_code=response.status_code, detail=result.get("detail")
-            )
-        self.request_id = response.json()["id"]
-
-    def retrieve(self) -> dict:
-        """
-        Wait for the generation to finish and retrieve response.
-        """
-        if self.request_id is None:
-            self.request()
-        while self.result is None:
-            response = requests.get(
-                f"{API_ENDPOINT}/v1/get_result",
-                headers={
-                    "accept": "application/json",
-                    "x-key": self.api_key,
-                },
-                params={
-                    "id": self.request_id,
-                },
-            )
-            result = response.json()
-            if "status" not in result:
-                raise ApiException(
-                    status_code=response.status_code, detail=result.get("detail")
-                )
-            elif result["status"] == "Ready":
-                self.result = result["result"]
-            elif result["status"] == "Pending":
-                time.sleep(0.5)
-            else:
-                raise ApiException(
-                    status_code=200, detail=f"API returned status '{result['status']}'"
-                )
-        return self.result
-
-    @property
-    def bytes(self) -> bytes:
-        """
-        Generated image as bytes.
-        """
-        if self._image_bytes is None:
-            response = requests.get(self.url)
-            if response.status_code == 200:
-                self._image_bytes = response.content
-            else:
-                raise ApiException(status_code=response.status_code)
-        return self._image_bytes
-
-    @property
-    def url(self) -> str:
-        """
-        Public url to retrieve the image from
-        """
-        if self._url is None:
-            result = self.retrieve()
-            self._url = result["sample"]
-        return self._url
-
-    @property
-    def image(self) -> Image.Image:
-        """
-        Load the image as a PIL Image
-        """
-        return Image.open(io.BytesIO(self.bytes))
-
-    def save(self, path: str):
-        """
-        Save the generated image to a local path
-        """
-        suffix = Path(self.url).suffix
-        if not path.endswith(suffix):
-            path = path + suffix
-        Path(path).resolve().parent.mkdir(parents=True, exist_ok=True)
-        with open(path, "wb") as file:
-            file.write(self.bytes)
-
-
-if __name__ == "__main__":
-    from fire import Fire
-
-    Fire(ImageRequest)
diff --git a/videotuna/models/flux/cli.py b/videotuna/models/flux/cli.py
deleted file mode 100644
index 56ae3de6..00000000
--- a/videotuna/models/flux/cli.py
+++ /dev/null
@@ -1,272 +0,0 @@
-import os
-import re
-import time
-from dataclasses import dataclass
-from glob import iglob
-
-import torch
-from einops import rearrange
-from fire import Fire
-from flux.sampling import denoise, get_noise, get_schedule, prepare, unpack
-from flux.util import (
-    configs,
-    embed_watermark,
-    load_ae,
-    load_clip,
-    load_flow_model,
-    load_t5,
-)
-from PIL import ExifTags, Image
-from transformers import pipeline
-
-NSFW_THRESHOLD = 0.85
-
-
-@dataclass
-class SamplingOptions:
-    prompt: str
-    width: int
-    height: int
-    num_steps: int
-    guidance: float
-    seed: int | None
-
-
-def parse_prompt(options: SamplingOptions) -> SamplingOptions | None:
-    user_question = (
-        "Next prompt (write /h for help, /q to quit and leave empty to repeat):\n"
-    )
-    usage = (
-        "Usage: Either write your prompt directly, leave this field empty "
-        "to repeat the prompt or write a command starting with a slash:\n"
-        "- '/w <width>' will set the width of the generated image\n"
-        "- '/h <height>' will set the height of the generated image\n"
-        "- '/s <seed>' sets the next seed\n"
-        "- '/g <guidance>' sets the guidance (flux-dev only)\n"
-        "- '/n <steps>' sets the number of steps\n"
-        "- '/q' to quit"
-    )
-
-    while (prompt := input(user_question)).startswith("/"):
-        if prompt.startswith("/w"):
-            if prompt.count(" ") != 1:
-                print(f"Got invalid command '{prompt}'\n{usage}")
-                continue
-            _, width = prompt.split()
-            options.width = 16 * (int(width) // 16)
-            print(
-                f"Setting resolution to {options.width} x {options.height} "
-                f"({options.height *options.width/1e6:.2f}MP)"
-            )
-        elif prompt.startswith("/h"):
-            if prompt.count(" ") != 1:
-                print(f"Got invalid command '{prompt}'\n{usage}")
-                continue
-            _, height = prompt.split()
-            options.height = 16 * (int(height) // 16)
-            print(
-                f"Setting resolution to {options.width} x {options.height} "
-                f"({options.height *options.width/1e6:.2f}MP)"
-            )
-        elif prompt.startswith("/g"):
-            if prompt.count(" ") != 1:
-                print(f"Got invalid command '{prompt}'\n{usage}")
-                continue
-            _, guidance = prompt.split()
-            options.guidance = float(guidance)
-            print(f"Setting guidance to {options.guidance}")
-        elif prompt.startswith("/s"):
-            if prompt.count(" ") != 1:
-                print(f"Got invalid command '{prompt}'\n{usage}")
-                continue
-            _, seed = prompt.split()
-            options.seed = int(seed)
-            print(f"Setting seed to {options.seed}")
-        elif prompt.startswith("/n"):
-            if prompt.count(" ") != 1:
-                print(f"Got invalid command '{prompt}'\n{usage}")
-                continue
-            _, steps = prompt.split()
-            options.num_steps = int(steps)
-            print(f"Setting seed to {options.num_steps}")
-        elif prompt.startswith("/q"):
-            print("Quitting")
-            return None
-        else:
-            if not prompt.startswith("/h"):
-                print(f"Got invalid command '{prompt}'\n{usage}")
-            print(usage)
-    if prompt != "":
-        options.prompt = prompt
-    return options
-
-
-@torch.inference_mode()
-def main(
-    name: str = "flux-schnell",
-    width: int = 1360,
-    height: int = 768,
-    seed: int | None = None,
-    prompt: str = (
-        "a photo of a forest with mist swirling around the tree trunks. The word "
-        '"FLUX" is painted over it in big, red brush strokes with visible texture'
-    ),
-    device: str = "cuda" if torch.cuda.is_available() else "cpu",
-    num_steps: int | None = None,
-    loop: bool = False,
-    guidance: float = 3.5,
-    offload: bool = False,
-    output_dir: str = "output",
-    add_sampling_metadata: bool = True,
-):
-    """
-    Sample the flux model. Either interactively (set `--loop`) or run for a
-    single image.
-
-    Args:
-        name: Name of the model to load
-        height: height of the sample in pixels (should be a multiple of 16)
-        width: width of the sample in pixels (should be a multiple of 16)
-        seed: Set a seed for sampling
-        output_name: where to save the output image, `{idx}` will be replaced
-            by the index of the sample
-        prompt: Prompt used for sampling
-        device: Pytorch device
-        num_steps: number of sampling steps (default 4 for schnell, 50 for guidance distilled)
-        loop: start an interactive session and sample multiple times
-        guidance: guidance value used for guidance distillation
-        add_sampling_metadata: Add the prompt to the image Exif metadata
-    """
-    nsfw_classifier = pipeline(
-        "image-classification", model="Falconsai/nsfw_image_detection"
-    )
-
-    if name not in configs:
-        available = ", ".join(configs.keys())
-        raise ValueError(f"Got unknown model name: {name}, chose from {available}")
-
-    torch_device = torch.device(device)
-    if num_steps is None:
-        num_steps = 4 if name == "flux-schnell" else 50
-
-    # allow for packing and conversion to latent space
-    height = 16 * (height // 16)
-    width = 16 * (width // 16)
-
-    output_name = os.path.join(output_dir, "img_{idx}.jpg")
-    if not os.path.exists(output_dir):
-        os.makedirs(output_dir)
-        idx = 0
-    else:
-        fns = [
-            fn
-            for fn in iglob(output_name.format(idx="*"))
-            if re.search(r"img_[0-9]\.jpg$", fn)
-        ]
-        if len(fns) > 0:
-            idx = max(int(fn.split("_")[-1].split(".")[0]) for fn in fns) + 1
-        else:
-            idx = 0
-
-    # init all components
-    t5 = load_t5(torch_device, max_length=256 if name == "flux-schnell" else 512)
-    clip = load_clip(torch_device)
-    model = load_flow_model(name, device="cpu" if offload else torch_device)
-    ae = load_ae(name, device="cpu" if offload else torch_device)
-
-    rng = torch.Generator(device="cpu")
-    opts = SamplingOptions(
-        prompt=prompt,
-        width=width,
-        height=height,
-        num_steps=num_steps,
-        guidance=guidance,
-        seed=seed,
-    )
-
-    if loop:
-        opts = parse_prompt(opts)
-
-    while opts is not None:
-        if opts.seed is None:
-            opts.seed = rng.seed()
-        print(f"Generating with seed {opts.seed}:\n{opts.prompt}")
-        t0 = time.perf_counter()
-
-        # prepare input
-        x = get_noise(
-            1,
-            opts.height,
-            opts.width,
-            device=torch_device,
-            dtype=torch.bfloat16,
-            seed=opts.seed,
-        )
-        opts.seed = None
-        if offload:
-            ae = ae.cpu()
-            torch.cuda.empty_cache()
-            t5, clip = t5.to(torch_device), clip.to(torch_device)
-        inp = prepare(t5, clip, x, prompt=opts.prompt)
-        timesteps = get_schedule(
-            opts.num_steps, inp["img"].shape[1], shift=(name != "flux-schnell")
-        )
-
-        # offload TEs to CPU, load model to gpu
-        if offload:
-            t5, clip = t5.cpu(), clip.cpu()
-            torch.cuda.empty_cache()
-            model = model.to(torch_device)
-
-        # denoise initial noise
-        x = denoise(model, **inp, timesteps=timesteps, guidance=opts.guidance)
-
-        # offload model, load autoencoder to gpu
-        if offload:
-            model.cpu()
-            torch.cuda.empty_cache()
-            ae.decoder.to(x.device)
-
-        # decode latents to pixel space
-        x = unpack(x.float(), opts.height, opts.width)
-        with torch.autocast(device_type=torch_device.type, dtype=torch.bfloat16):
-            x = ae.decode(x)
-        t1 = time.perf_counter()
-
-        fn = output_name.format(idx=idx)
-        print(f"Done in {t1 - t0:.1f}s. Saving {fn}")
-        # bring into PIL format and save
-        x = x.clamp(-1, 1)
-        x = embed_watermark(x.float())
-        x = rearrange(x[0], "c h w -> h w c")
-
-        img = Image.fromarray((127.5 * (x + 1.0)).cpu().byte().numpy())
-        nsfw_score = [x["score"] for x in nsfw_classifier(img) if x["label"] == "nsfw"][
-            0
-        ]
-
-        if nsfw_score < NSFW_THRESHOLD:
-            exif_data = Image.Exif()
-            exif_data[ExifTags.Base.Software] = "AI generated;txt2img;flux"
-            exif_data[ExifTags.Base.Make] = "Black Forest Labs"
-            exif_data[ExifTags.Base.Model] = name
-            if add_sampling_metadata:
-                exif_data[ExifTags.Base.ImageDescription] = prompt
-            img.save(fn, exif=exif_data, quality=95, subsampling=0)
-            idx += 1
-        else:
-            print("Your generated image may contain NSFW content.")
-
-        if loop:
-            print("-" * 80)
-            opts = parse_prompt(opts)
-        else:
-            opts = None
-
-
-def app():
-    Fire(main)
-
-
-if __name__ == "__main__":
-    app()
diff --git a/videotuna/models/flux/flux_math.py b/videotuna/models/flux/flux_math.py
deleted file mode 100644
index 8c5e2c2c..00000000
--- a/videotuna/models/flux/flux_math.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import torch
-from einops import rearrange
-from torch import Tensor
-
-
-def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor) -> Tensor:
-    q, k = apply_rope(q, k, pe)
-
-    x = torch.nn.functional.scaled_dot_product_attention(q, k, v)
-    x = rearrange(x, "B H L D -> B L (H D)")
-
-    return x
-
-
-def rope(pos: Tensor, dim: int, theta: int) -> Tensor:
-    assert dim % 2 == 0
-    scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
-    omega = 1.0 / (theta**scale)
-    out = torch.einsum("...n,d->...nd", pos, omega)
-    out = torch.stack(
-        [torch.cos(out), -torch.sin(out), torch.sin(out), torch.cos(out)], dim=-1
-    )
-    out = rearrange(out, "b n d (i j) -> b n d i j", i=2, j=2)
-    return out.float()
-
-
-def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor) -> tuple[Tensor, Tensor]:
-    xq_ = xq.float().reshape(*xq.shape[:-1], -1, 1, 2)
-    xk_ = xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
-    xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
-    xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
-    return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)
diff --git a/videotuna/models/flux/model.py b/videotuna/models/flux/model.py
deleted file mode 100644
index 83321b31..00000000
--- a/videotuna/models/flux/model.py
+++ /dev/null
@@ -1,126 +0,0 @@
-from dataclasses import dataclass
-
-import torch
-from flux.modules.layers import (
-    DoubleStreamBlock,
-    EmbedND,
-    LastLayer,
-    MLPEmbedder,
-    SingleStreamBlock,
-    timestep_embedding,
-)
-from torch import Tensor, nn
-
-
-@dataclass
-class FluxParams:
-    in_channels: int
-    vec_in_dim: int
-    context_in_dim: int
-    hidden_size: int
-    mlp_ratio: float
-    num_heads: int
-    depth: int
-    depth_single_blocks: int
-    axes_dim: list[int]
-    theta: int
-    qkv_bias: bool
-    guidance_embed: bool
-
-
-class Flux(nn.Module):
-    """
-    Transformer model for flow matching on sequences.
-    """
-
-    def __init__(self, params: FluxParams):
-        super().__init__()
-
-        self.params = params
-        self.in_channels = params.in_channels
-        self.out_channels = self.in_channels
-        if params.hidden_size % params.num_heads != 0:
-            raise ValueError(
-                f"Hidden size {params.hidden_size} must be divisible by num_heads {params.num_heads}"
-            )
-        pe_dim = params.hidden_size // params.num_heads
-        if sum(params.axes_dim) != pe_dim:
-            raise ValueError(
-                f"Got {params.axes_dim} but expected positional dim {pe_dim}"
-            )
-        self.hidden_size = params.hidden_size
-        self.num_heads = params.num_heads
-        self.pe_embedder = EmbedND(
-            dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim
-        )
-        self.img_in = nn.Linear(self.in_channels, self.hidden_size, bias=True)
-        self.time_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size)
-        self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size)
-        self.guidance_in = (
-            MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size)
-            if params.guidance_embed
-            else nn.Identity()
-        )
-        self.txt_in = nn.Linear(params.context_in_dim, self.hidden_size)
-
-        self.double_blocks = nn.ModuleList(
-            [
-                DoubleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    qkv_bias=params.qkv_bias,
-                )
-                for _ in range(params.depth)
-            ]
-        )
-
-        self.single_blocks = nn.ModuleList(
-            [
-                SingleStreamBlock(
-                    self.hidden_size, self.num_heads, mlp_ratio=params.mlp_ratio
-                )
-                for _ in range(params.depth_single_blocks)
-            ]
-        )
-
-        self.final_layer = LastLayer(self.hidden_size, 1, self.out_channels)
-
-    def forward(
-        self,
-        img: Tensor,
-        img_ids: Tensor,
-        txt: Tensor,
-        txt_ids: Tensor,
-        timesteps: Tensor,
-        y: Tensor,
-        guidance: Tensor | None = None,
-    ) -> Tensor:
-        if img.ndim != 3 or txt.ndim != 3:
-            raise ValueError("Input img and txt tensors must have 3 dimensions.")
-
-        # running on sequences img
-        img = self.img_in(img)
-        vec = self.time_in(timestep_embedding(timesteps, 256))
-        if self.params.guidance_embed:
-            if guidance is None:
-                raise ValueError(
-                    "Didn't get guidance strength for guidance distilled model."
-                )
-            vec = vec + self.guidance_in(timestep_embedding(guidance, 256))
-        vec = vec + self.vector_in(y)
-        txt = self.txt_in(txt)
-
-        ids = torch.cat((txt_ids, img_ids), dim=1)
-        pe = self.pe_embedder(ids)
-
-        for block in self.double_blocks:
-            img, txt = block(img=img, txt=txt, vec=vec, pe=pe)
-
-        img = torch.cat((txt, img), 1)
-        for block in self.single_blocks:
-            img = block(img, vec=vec, pe=pe)
-        img = img[:, txt.shape[1] :, ...]
-
-        img = self.final_layer(img, vec)  # (N, T, patch_size ** 2 * out_channels)
-        return img
diff --git a/videotuna/models/flux/modules/autoencoder.py b/videotuna/models/flux/modules/autoencoder.py
deleted file mode 100644
index c0624afb..00000000
--- a/videotuna/models/flux/modules/autoencoder.py
+++ /dev/null
@@ -1,338 +0,0 @@
-from dataclasses import dataclass
-
-import torch
-from einops import rearrange
-from torch import Tensor, nn
-
-
-@dataclass
-class AutoEncoderParams:
-    resolution: int
-    in_channels: int
-    ch: int
-    out_ch: int
-    ch_mult: list[int]
-    num_res_blocks: int
-    z_channels: int
-    scale_factor: float
-    shift_factor: float
-
-
-def swish(x: Tensor) -> Tensor:
-    return x * torch.sigmoid(x)
-
-
-class AttnBlock(nn.Module):
-    def __init__(self, in_channels: int):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = nn.GroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-
-        self.q = nn.Conv2d(in_channels, in_channels, kernel_size=1)
-        self.k = nn.Conv2d(in_channels, in_channels, kernel_size=1)
-        self.v = nn.Conv2d(in_channels, in_channels, kernel_size=1)
-        self.proj_out = nn.Conv2d(in_channels, in_channels, kernel_size=1)
-
-    def attention(self, h_: Tensor) -> Tensor:
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        b, c, h, w = q.shape
-        q = rearrange(q, "b c h w -> b 1 (h w) c").contiguous()
-        k = rearrange(k, "b c h w -> b 1 (h w) c").contiguous()
-        v = rearrange(v, "b c h w -> b 1 (h w) c").contiguous()
-        h_ = nn.functional.scaled_dot_product_attention(q, k, v)
-
-        return rearrange(h_, "b 1 (h w) c -> b c h w", h=h, w=w, c=c, b=b)
-
-    def forward(self, x: Tensor) -> Tensor:
-        return x + self.proj_out(self.attention(x))
-
-
-class ResnetBlock(nn.Module):
-    def __init__(self, in_channels: int, out_channels: int):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-
-        self.norm1 = nn.GroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-        self.conv1 = nn.Conv2d(
-            in_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        self.norm2 = nn.GroupNorm(
-            num_groups=32, num_channels=out_channels, eps=1e-6, affine=True
-        )
-        self.conv2 = nn.Conv2d(
-            out_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if self.in_channels != self.out_channels:
-            self.nin_shortcut = nn.Conv2d(
-                in_channels, out_channels, kernel_size=1, stride=1, padding=0
-            )
-
-    def forward(self, x):
-        h = x
-        h = self.norm1(h)
-        h = swish(h)
-        h = self.conv1(h)
-
-        h = self.norm2(h)
-        h = swish(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class Downsample(nn.Module):
-    def __init__(self, in_channels: int):
-        super().__init__()
-        # no asymmetric padding in torch conv, must do it ourselves
-        self.conv = nn.Conv2d(
-            in_channels, in_channels, kernel_size=3, stride=2, padding=0
-        )
-
-    def forward(self, x: Tensor):
-        pad = (0, 1, 0, 1)
-        x = nn.functional.pad(x, pad, mode="constant", value=0)
-        x = self.conv(x)
-        return x
-
-
-class Upsample(nn.Module):
-    def __init__(self, in_channels: int):
-        super().__init__()
-        self.conv = nn.Conv2d(
-            in_channels, in_channels, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, x: Tensor):
-        x = nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-        x = self.conv(x)
-        return x
-
-
-class Encoder(nn.Module):
-    def __init__(
-        self,
-        resolution: int,
-        in_channels: int,
-        ch: int,
-        ch_mult: list[int],
-        num_res_blocks: int,
-        z_channels: int,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        # downsampling
-        self.conv_in = nn.Conv2d(
-            in_channels, self.ch, kernel_size=3, stride=1, padding=1
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.in_ch_mult = in_ch_mult
-        self.down = nn.ModuleList()
-        block_in = self.ch
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for _ in range(self.num_res_blocks):
-                block.append(ResnetBlock(in_channels=block_in, out_channels=block_out))
-                block_in = block_out
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                down.downsample = Downsample(block_in)
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in)
-        self.mid.attn_1 = AttnBlock(block_in)
-        self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in)
-
-        # end
-        self.norm_out = nn.GroupNorm(
-            num_groups=32, num_channels=block_in, eps=1e-6, affine=True
-        )
-        self.conv_out = nn.Conv2d(
-            block_in, 2 * z_channels, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, x: Tensor) -> Tensor:
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1])
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h)
-        # end
-        h = self.norm_out(h)
-        h = swish(h)
-        h = self.conv_out(h)
-        return h
-
-
-class Decoder(nn.Module):
-    def __init__(
-        self,
-        ch: int,
-        out_ch: int,
-        ch_mult: list[int],
-        num_res_blocks: int,
-        in_channels: int,
-        resolution: int,
-        z_channels: int,
-    ):
-        super().__init__()
-        self.ch = ch
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.ffactor = 2 ** (self.num_resolutions - 1)
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-
-        # z to block_in
-        self.conv_in = nn.Conv2d(
-            z_channels, block_in, kernel_size=3, stride=1, padding=1
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in)
-        self.mid.attn_1 = AttnBlock(block_in)
-        self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in)
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for _ in range(self.num_res_blocks + 1):
-                block.append(ResnetBlock(in_channels=block_in, out_channels=block_out))
-                block_in = block_out
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = nn.GroupNorm(
-            num_groups=32, num_channels=block_in, eps=1e-6, affine=True
-        )
-        self.conv_out = nn.Conv2d(block_in, out_ch, kernel_size=3, stride=1, padding=1)
-
-    def forward(self, z: Tensor) -> Tensor:
-        # z to block_in
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        h = self.norm_out(h)
-        h = swish(h)
-        h = self.conv_out(h)
-        return h
-
-
-class DiagonalGaussian(nn.Module):
-    def __init__(self, sample: bool = True, chunk_dim: int = 1):
-        super().__init__()
-        self.sample = sample
-        self.chunk_dim = chunk_dim
-
-    def forward(self, z: Tensor) -> Tensor:
-        mean, logvar = torch.chunk(z, 2, dim=self.chunk_dim)
-        if self.sample:
-            std = torch.exp(0.5 * logvar)
-            return mean + std * torch.randn_like(mean)
-        else:
-            return mean
-
-
-class AutoEncoder(nn.Module):
-    def __init__(self, params: AutoEncoderParams):
-        super().__init__()
-        self.encoder = Encoder(
-            resolution=params.resolution,
-            in_channels=params.in_channels,
-            ch=params.ch,
-            ch_mult=params.ch_mult,
-            num_res_blocks=params.num_res_blocks,
-            z_channels=params.z_channels,
-        )
-        self.decoder = Decoder(
-            resolution=params.resolution,
-            in_channels=params.in_channels,
-            ch=params.ch,
-            out_ch=params.out_ch,
-            ch_mult=params.ch_mult,
-            num_res_blocks=params.num_res_blocks,
-            z_channels=params.z_channels,
-        )
-        self.reg = DiagonalGaussian()
-
-        self.scale_factor = params.scale_factor
-        self.shift_factor = params.shift_factor
-
-    def encode(self, x: Tensor) -> Tensor:
-        z = self.reg(self.encoder(x))
-        z = self.scale_factor * (z - self.shift_factor)
-        return z
-
-    def decode(self, z: Tensor) -> Tensor:
-        z = z / self.scale_factor + self.shift_factor
-        return self.decoder(z)
-
-    def forward(self, x: Tensor) -> Tensor:
-        return self.decode(self.encode(x))
diff --git a/videotuna/models/flux/modules/conditioner.py b/videotuna/models/flux/modules/conditioner.py
deleted file mode 100644
index c5c3e16e..00000000
--- a/videotuna/models/flux/modules/conditioner.py
+++ /dev/null
@@ -1,45 +0,0 @@
-from torch import Tensor, nn
-from transformers import CLIPTextModel, CLIPTokenizer, T5EncoderModel, T5Tokenizer
-
-
-class HFEmbedder(nn.Module):
-    def __init__(self, version: str, max_length: int, **hf_kwargs):
-        super().__init__()
-        self.is_clip = version.startswith("openai")
-        self.max_length = max_length
-        self.output_key = "pooler_output" if self.is_clip else "last_hidden_state"
-
-        if self.is_clip:
-            self.tokenizer: CLIPTokenizer = CLIPTokenizer.from_pretrained(
-                version, max_length=max_length
-            )
-            self.hf_module: CLIPTextModel = CLIPTextModel.from_pretrained(
-                version, **hf_kwargs
-            )
-        else:
-            self.tokenizer: T5Tokenizer = T5Tokenizer.from_pretrained(
-                version, max_length=max_length
-            )
-            self.hf_module: T5EncoderModel = T5EncoderModel.from_pretrained(
-                version, **hf_kwargs
-            )
-
-        self.hf_module = self.hf_module.eval().requires_grad_(False)
-
-    def forward(self, text: list[str]) -> Tensor:
-        batch_encoding = self.tokenizer(
-            text,
-            truncation=True,
-            max_length=self.max_length,
-            return_length=False,
-            return_overflowing_tokens=False,
-            padding="max_length",
-            return_tensors="pt",
-        )
-
-        outputs = self.hf_module(
-            input_ids=batch_encoding["input_ids"].to(self.hf_module.device),
-            attention_mask=None,
-            output_hidden_states=False,
-        )
-        return outputs[self.output_key]
diff --git a/videotuna/models/flux/modules/layers.py b/videotuna/models/flux/modules/layers.py
deleted file mode 100644
index ab3f4c47..00000000
--- a/videotuna/models/flux/modules/layers.py
+++ /dev/null
@@ -1,278 +0,0 @@
-import math
-from dataclasses import dataclass
-
-import torch
-from einops import rearrange
-from flux.flux_math import attention, rope
-from torch import Tensor, nn
-
-
-class EmbedND(nn.Module):
-    def __init__(self, dim: int, theta: int, axes_dim: list[int]):
-        super().__init__()
-        self.dim = dim
-        self.theta = theta
-        self.axes_dim = axes_dim
-
-    def forward(self, ids: Tensor) -> Tensor:
-        n_axes = ids.shape[-1]
-        emb = torch.cat(
-            [rope(ids[..., i], self.axes_dim[i], self.theta) for i in range(n_axes)],
-            dim=-3,
-        )
-
-        return emb.unsqueeze(1)
-
-
-def timestep_embedding(t: Tensor, dim, max_period=10000, time_factor: float = 1000.0):
-    """
-    Create sinusoidal timestep embeddings.
-    :param t: a 1-D Tensor of N indices, one per batch element.
-                      These may be fractional.
-    :param dim: the dimension of the output.
-    :param max_period: controls the minimum frequency of the embeddings.
-    :return: an (N, D) Tensor of positional embeddings.
-    """
-    t = time_factor * t
-    half = dim // 2
-    freqs = torch.exp(
-        -math.log(max_period)
-        * torch.arange(start=0, end=half, dtype=torch.float32)
-        / half
-    ).to(t.device)
-
-    args = t[:, None].float() * freqs[None]
-    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
-    if dim % 2:
-        embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
-    if torch.is_floating_point(t):
-        embedding = embedding.to(t)
-    return embedding
-
-
-class MLPEmbedder(nn.Module):
-    def __init__(self, in_dim: int, hidden_dim: int):
-        super().__init__()
-        self.in_layer = nn.Linear(in_dim, hidden_dim, bias=True)
-        self.silu = nn.SiLU()
-        self.out_layer = nn.Linear(hidden_dim, hidden_dim, bias=True)
-
-    def forward(self, x: Tensor) -> Tensor:
-        return self.out_layer(self.silu(self.in_layer(x)))
-
-
-class RMSNorm(torch.nn.Module):
-    def __init__(self, dim: int):
-        super().__init__()
-        self.scale = nn.Parameter(torch.ones(dim))
-
-    def forward(self, x: Tensor):
-        x_dtype = x.dtype
-        x = x.float()
-        rrms = torch.rsqrt(torch.mean(x**2, dim=-1, keepdim=True) + 1e-6)
-        return (x * rrms).to(dtype=x_dtype) * self.scale
-
-
-class QKNorm(torch.nn.Module):
-    def __init__(self, dim: int):
-        super().__init__()
-        self.query_norm = RMSNorm(dim)
-        self.key_norm = RMSNorm(dim)
-
-    def forward(self, q: Tensor, k: Tensor, v: Tensor) -> tuple[Tensor, Tensor]:
-        q = self.query_norm(q)
-        k = self.key_norm(k)
-        return q.to(v), k.to(v)
-
-
-class SelfAttention(nn.Module):
-    def __init__(self, dim: int, num_heads: int = 8, qkv_bias: bool = False):
-        super().__init__()
-        self.num_heads = num_heads
-        head_dim = dim // num_heads
-
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.norm = QKNorm(head_dim)
-        self.proj = nn.Linear(dim, dim)
-
-    def forward(self, x: Tensor, pe: Tensor) -> Tensor:
-        qkv = self.qkv(x)
-        q, k, v = rearrange(qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads)
-        q, k = self.norm(q, k, v)
-        x = attention(q, k, v, pe=pe)
-        x = self.proj(x)
-        return x
-
-
-@dataclass
-class ModulationOut:
-    shift: Tensor
-    scale: Tensor
-    gate: Tensor
-
-
-class Modulation(nn.Module):
-    def __init__(self, dim: int, double: bool):
-        super().__init__()
-        self.is_double = double
-        self.multiplier = 6 if double else 3
-        self.lin = nn.Linear(dim, self.multiplier * dim, bias=True)
-
-    def forward(self, vec: Tensor) -> tuple[ModulationOut, ModulationOut | None]:
-        out = self.lin(nn.functional.silu(vec))[:, None, :].chunk(
-            self.multiplier, dim=-1
-        )
-
-        return (
-            ModulationOut(*out[:3]),
-            ModulationOut(*out[3:]) if self.is_double else None,
-        )
-
-
-class DoubleStreamBlock(nn.Module):
-    def __init__(
-        self, hidden_size: int, num_heads: int, mlp_ratio: float, qkv_bias: bool = False
-    ):
-        super().__init__()
-
-        mlp_hidden_dim = int(hidden_size * mlp_ratio)
-        self.num_heads = num_heads
-        self.hidden_size = hidden_size
-        self.img_mod = Modulation(hidden_size, double=True)
-        self.img_norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.img_attn = SelfAttention(
-            dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias
-        )
-
-        self.img_norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.img_mlp = nn.Sequential(
-            nn.Linear(hidden_size, mlp_hidden_dim, bias=True),
-            nn.GELU(approximate="tanh"),
-            nn.Linear(mlp_hidden_dim, hidden_size, bias=True),
-        )
-
-        self.txt_mod = Modulation(hidden_size, double=True)
-        self.txt_norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.txt_attn = SelfAttention(
-            dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias
-        )
-
-        self.txt_norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.txt_mlp = nn.Sequential(
-            nn.Linear(hidden_size, mlp_hidden_dim, bias=True),
-            nn.GELU(approximate="tanh"),
-            nn.Linear(mlp_hidden_dim, hidden_size, bias=True),
-        )
-
-    def forward(
-        self, img: Tensor, txt: Tensor, vec: Tensor, pe: Tensor
-    ) -> tuple[Tensor, Tensor]:
-        img_mod1, img_mod2 = self.img_mod(vec)
-        txt_mod1, txt_mod2 = self.txt_mod(vec)
-
-        # prepare image for attention
-        img_modulated = self.img_norm1(img)
-        img_modulated = (1 + img_mod1.scale) * img_modulated + img_mod1.shift
-        img_qkv = self.img_attn.qkv(img_modulated)
-        img_q, img_k, img_v = rearrange(
-            img_qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads
-        )
-        img_q, img_k = self.img_attn.norm(img_q, img_k, img_v)
-
-        # prepare txt for attention
-        txt_modulated = self.txt_norm1(txt)
-        txt_modulated = (1 + txt_mod1.scale) * txt_modulated + txt_mod1.shift
-        txt_qkv = self.txt_attn.qkv(txt_modulated)
-        txt_q, txt_k, txt_v = rearrange(
-            txt_qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads
-        )
-        txt_q, txt_k = self.txt_attn.norm(txt_q, txt_k, txt_v)
-
-        # run actual attention
-        q = torch.cat((txt_q, img_q), dim=2)
-        k = torch.cat((txt_k, img_k), dim=2)
-        v = torch.cat((txt_v, img_v), dim=2)
-
-        attn = attention(q, k, v, pe=pe)
-        txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1] :]
-
-        # calculate the img bloks
-        img = img + img_mod1.gate * self.img_attn.proj(img_attn)
-        img = img + img_mod2.gate * self.img_mlp(
-            (1 + img_mod2.scale) * self.img_norm2(img) + img_mod2.shift
-        )
-
-        # calculate the txt bloks
-        txt = txt + txt_mod1.gate * self.txt_attn.proj(txt_attn)
-        txt = txt + txt_mod2.gate * self.txt_mlp(
-            (1 + txt_mod2.scale) * self.txt_norm2(txt) + txt_mod2.shift
-        )
-        return img, txt
-
-
-class SingleStreamBlock(nn.Module):
-    """
-    A DiT block with parallel linear layers as described in
-    https://arxiv.org/abs/2302.05442 and adapted modulation interface.
-    """
-
-    def __init__(
-        self,
-        hidden_size: int,
-        num_heads: int,
-        mlp_ratio: float = 4.0,
-        qk_scale: float | None = None,
-    ):
-        super().__init__()
-        self.hidden_dim = hidden_size
-        self.num_heads = num_heads
-        head_dim = hidden_size // num_heads
-        self.scale = qk_scale or head_dim**-0.5
-
-        self.mlp_hidden_dim = int(hidden_size * mlp_ratio)
-        # qkv and mlp_in
-        self.linear1 = nn.Linear(hidden_size, hidden_size * 3 + self.mlp_hidden_dim)
-        # proj and mlp_out
-        self.linear2 = nn.Linear(hidden_size + self.mlp_hidden_dim, hidden_size)
-
-        self.norm = QKNorm(head_dim)
-
-        self.hidden_size = hidden_size
-        self.pre_norm = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-
-        self.mlp_act = nn.GELU(approximate="tanh")
-        self.modulation = Modulation(hidden_size, double=False)
-
-    def forward(self, x: Tensor, vec: Tensor, pe: Tensor) -> Tensor:
-        mod, _ = self.modulation(vec)
-        x_mod = (1 + mod.scale) * self.pre_norm(x) + mod.shift
-        qkv, mlp = torch.split(
-            self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1
-        )
-
-        q, k, v = rearrange(qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads)
-        q, k = self.norm(q, k, v)
-
-        # compute attention
-        attn = attention(q, k, v, pe=pe)
-        # compute activation in mlp stream, cat again and run second linear layer
-        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
-        return x + mod.gate * output
-
-
-class LastLayer(nn.Module):
-    def __init__(self, hidden_size: int, patch_size: int, out_channels: int):
-        super().__init__()
-        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.linear = nn.Linear(
-            hidden_size, patch_size * patch_size * out_channels, bias=True
-        )
-        self.adaLN_modulation = nn.Sequential(
-            nn.SiLU(), nn.Linear(hidden_size, 2 * hidden_size, bias=True)
-        )
-
-    def forward(self, x: Tensor, vec: Tensor) -> Tensor:
-        shift, scale = self.adaLN_modulation(vec).chunk(2, dim=1)
-        x = (1 + scale[:, None, :]) * self.norm_final(x) + shift[:, None, :]
-        x = self.linear(x)
-        return x
diff --git a/videotuna/models/flux/sampling.py b/videotuna/models/flux/sampling.py
deleted file mode 100644
index 111c8b97..00000000
--- a/videotuna/models/flux/sampling.py
+++ /dev/null
@@ -1,140 +0,0 @@
-import math
-from typing import Callable
-
-import torch
-from einops import rearrange, repeat
-from torch import Tensor
-
-from .model import Flux
-from .modules.conditioner import HFEmbedder
-
-
-def get_noise(
-    num_samples: int,
-    height: int,
-    width: int,
-    device: torch.device,
-    dtype: torch.dtype,
-    seed: int,
-):
-    return torch.randn(
-        num_samples,
-        16,
-        # allow for packing
-        2 * math.ceil(height / 16),
-        2 * math.ceil(width / 16),
-        device=device,
-        dtype=dtype,
-        generator=torch.Generator(device=device).manual_seed(seed),
-    )
-
-
-def prepare(
-    t5: HFEmbedder, clip: HFEmbedder, img: Tensor, prompt: str | list[str]
-) -> dict[str, Tensor]:
-    bs, c, h, w = img.shape
-    if bs == 1 and not isinstance(prompt, str):
-        bs = len(prompt)
-
-    img = rearrange(img, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=2, pw=2)
-    if img.shape[0] == 1 and bs > 1:
-        img = repeat(img, "1 ... -> bs ...", bs=bs)
-
-    img_ids = torch.zeros(h // 2, w // 2, 3)
-    img_ids[..., 1] = img_ids[..., 1] + torch.arange(h // 2)[:, None]
-    img_ids[..., 2] = img_ids[..., 2] + torch.arange(w // 2)[None, :]
-    img_ids = repeat(img_ids, "h w c -> b (h w) c", b=bs)
-
-    if isinstance(prompt, str):
-        prompt = [prompt]
-    txt = t5(prompt)
-    if txt.shape[0] == 1 and bs > 1:
-        txt = repeat(txt, "1 ... -> bs ...", bs=bs)
-    txt_ids = torch.zeros(bs, txt.shape[1], 3)
-
-    vec = clip(prompt)
-    if vec.shape[0] == 1 and bs > 1:
-        vec = repeat(vec, "1 ... -> bs ...", bs=bs)
-
-    return {
-        "img": img,
-        "img_ids": img_ids.to(img.device),
-        "txt": txt.to(img.device),
-        "txt_ids": txt_ids.to(img.device),
-        "vec": vec.to(img.device),
-    }
-
-
-def time_shift(mu: float, sigma: float, t: Tensor):
-    return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)
-
-
-def get_lin_function(
-    x1: float = 256, y1: float = 0.5, x2: float = 4096, y2: float = 1.15
-) -> Callable[[float], float]:
-    m = (y2 - y1) / (x2 - x1)
-    b = y1 - m * x1
-    return lambda x: m * x + b
-
-
-def get_schedule(
-    num_steps: int,
-    image_seq_len: int,
-    base_shift: float = 0.5,
-    max_shift: float = 1.15,
-    shift: bool = True,
-) -> list[float]:
-    # extra step for zero
-    timesteps = torch.linspace(1, 0, num_steps + 1)
-
-    # shifting the schedule to favor high timesteps for higher signal images
-    if shift:
-        # eastimate mu based on linear estimation between two points
-        mu = get_lin_function(y1=base_shift, y2=max_shift)(image_seq_len)
-        timesteps = time_shift(mu, 1.0, timesteps)
-
-    return timesteps.tolist()
-
-
-def denoise(
-    model: Flux,
-    # model input
-    img: Tensor,
-    img_ids: Tensor,
-    txt: Tensor,
-    txt_ids: Tensor,
-    vec: Tensor,
-    # sampling parameters
-    timesteps: list[float],
-    guidance: float = 4.0,
-):
-    # this is ignored for schnell
-    guidance_vec = torch.full(
-        (img.shape[0],), guidance, device=img.device, dtype=img.dtype
-    )
-    for t_curr, t_prev in zip(timesteps[:-1], timesteps[1:]):
-        t_vec = torch.full((img.shape[0],), t_curr, dtype=img.dtype, device=img.device)
-        pred = model(
-            img=img,
-            img_ids=img_ids,
-            txt=txt,
-            txt_ids=txt_ids,
-            y=vec,
-            timesteps=t_vec,
-            guidance=guidance_vec,
-        )
-
-        img = img + (t_prev - t_curr) * pred
-
-    return img
-
-
-def unpack(x: Tensor, height: int, width: int) -> Tensor:
-    return rearrange(
-        x,
-        "b (h w) (c ph pw) -> b c (h ph) (w pw)",
-        h=math.ceil(height / 16),
-        w=math.ceil(width / 16),
-        ph=2,
-        pw=2,
-    )
diff --git a/videotuna/models/flux/util.py b/videotuna/models/flux/util.py
deleted file mode 100644
index b4c7b1b3..00000000
--- a/videotuna/models/flux/util.py
+++ /dev/null
@@ -1,210 +0,0 @@
-import os
-from dataclasses import dataclass
-
-import torch
-from einops import rearrange
-from flux.model import Flux, FluxParams
-from flux.modules.autoencoder import AutoEncoder, AutoEncoderParams
-from flux.modules.conditioner import HFEmbedder
-from huggingface_hub import hf_hub_download
-from imwatermark import WatermarkEncoder
-from safetensors.torch import load_file as load_sft
-
-
-@dataclass
-class ModelSpec:
-    params: FluxParams
-    ae_params: AutoEncoderParams
-    ckpt_path: str | None
-    ae_path: str | None
-    repo_id: str | None
-    repo_flow: str | None
-    repo_ae: str | None
-
-
-configs = {
-    "flux-dev": ModelSpec(
-        repo_id="black-forest-labs/FLUX.1-dev",
-        repo_flow="flux1-dev.safetensors",
-        repo_ae="ae.safetensors",
-        ckpt_path=os.getenv("FLUX_DEV"),
-        params=FluxParams(
-            in_channels=64,
-            vec_in_dim=768,
-            context_in_dim=4096,
-            hidden_size=3072,
-            mlp_ratio=4.0,
-            num_heads=24,
-            depth=19,
-            depth_single_blocks=38,
-            axes_dim=[16, 56, 56],
-            theta=10_000,
-            qkv_bias=True,
-            guidance_embed=True,
-        ),
-        ae_path=os.getenv("AE"),
-        ae_params=AutoEncoderParams(
-            resolution=256,
-            in_channels=3,
-            ch=128,
-            out_ch=3,
-            ch_mult=[1, 2, 4, 4],
-            num_res_blocks=2,
-            z_channels=16,
-            scale_factor=0.3611,
-            shift_factor=0.1159,
-        ),
-    ),
-    "flux-schnell": ModelSpec(
-        repo_id="black-forest-labs/FLUX.1-schnell",
-        repo_flow="flux1-schnell.safetensors",
-        repo_ae="ae.safetensors",
-        ckpt_path=os.getenv("FLUX_SCHNELL"),
-        params=FluxParams(
-            in_channels=64,
-            vec_in_dim=768,
-            context_in_dim=4096,
-            hidden_size=3072,
-            mlp_ratio=4.0,
-            num_heads=24,
-            depth=19,
-            depth_single_blocks=38,
-            axes_dim=[16, 56, 56],
-            theta=10_000,
-            qkv_bias=True,
-            guidance_embed=False,
-        ),
-        ae_path=os.getenv("AE"),
-        ae_params=AutoEncoderParams(
-            resolution=256,
-            in_channels=3,
-            ch=128,
-            out_ch=3,
-            ch_mult=[1, 2, 4, 4],
-            num_res_blocks=2,
-            z_channels=16,
-            scale_factor=0.3611,
-            shift_factor=0.1159,
-        ),
-    ),
-}
-
-
-def print_load_warning(missing: list[str], unexpected: list[str]) -> None:
-    if len(missing) > 0 and len(unexpected) > 0:
-        print(f"Got {len(missing)} missing keys:\n\t" + "\n\t".join(missing))
-        print("\n" + "-" * 79 + "\n")
-        print(f"Got {len(unexpected)} unexpected keys:\n\t" + "\n\t".join(unexpected))
-    elif len(missing) > 0:
-        print(f"Got {len(missing)} missing keys:\n\t" + "\n\t".join(missing))
-    elif len(unexpected) > 0:
-        print(f"Got {len(unexpected)} unexpected keys:\n\t" + "\n\t".join(unexpected))
-
-
-def load_flow_model(
-    name: str, device: str | torch.device = "cuda", hf_download: bool = True
-):
-    # Loading Flux
-    print("Init model")
-    ckpt_path = configs[name].ckpt_path
-    if (
-        ckpt_path is None
-        and configs[name].repo_id is not None
-        and configs[name].repo_flow is not None
-        and hf_download
-    ):
-        ckpt_path = hf_hub_download(configs[name].repo_id, configs[name].repo_flow)
-
-    with torch.device("meta" if ckpt_path is not None else device):
-        model = Flux(configs[name].params).to(torch.bfloat16)
-
-    if ckpt_path is not None:
-        print("Loading checkpoint")
-        # load_sft doesn't support torch.device
-        sd = load_sft(ckpt_path, device=str(device))
-        missing, unexpected = model.load_state_dict(sd, strict=False, assign=True)
-        print_load_warning(missing, unexpected)
-    return model
-
-
-def load_t5(device: str | torch.device = "cuda", max_length: int = 512) -> HFEmbedder:
-    # max length 64, 128, 256 and 512 should work (if your sequence is short enough)
-    return HFEmbedder(
-        "google/t5-v1_1-xxl", max_length=max_length, torch_dtype=torch.bfloat16
-    ).to(device)
-
-
-def load_clip(device: str | torch.device = "cuda") -> HFEmbedder:
-    return HFEmbedder(
-        "openai/clip-vit-large-patch14", max_length=77, torch_dtype=torch.bfloat16
-    ).to(device)
-
-
-def load_ae(
-    name: str, device: str | torch.device = "cuda", hf_download: bool = True
-) -> AutoEncoder:
-    ckpt_path = configs[name].ae_path
-    if (
-        ckpt_path is None
-        and configs[name].repo_id is not None
-        and configs[name].repo_ae is not None
-        and hf_download
-    ):
-        ckpt_path = hf_hub_download(configs[name].repo_id, configs[name].repo_ae)
-
-    # Loading the autoencoder
-    print("Init AE")
-    with torch.device("meta" if ckpt_path is not None else device):
-        ae = AutoEncoder(configs[name].ae_params)
-
-    if ckpt_path is not None:
-        sd = load_sft(ckpt_path, device=str(device))
-        missing, unexpected = ae.load_state_dict(sd, strict=False, assign=True)
-        print_load_warning(missing, unexpected)
-    return ae
-
-
-class WatermarkEmbedder:
-    def __init__(self, watermark):
-        self.watermark = watermark
-        self.num_bits = len(WATERMARK_BITS)
-        self.encoder = WatermarkEncoder()
-        self.encoder.set_watermark("bits", self.watermark)
-
-    def __call__(self, image: torch.Tensor) -> torch.Tensor:
-        """
-        Adds a predefined watermark to the input image
-
-        Args:
-            image: ([N,] B, RGB, H, W) in range [-1, 1]
-
-        Returns:
-            same as input but watermarked
-        """
-        image = 0.5 * image + 0.5
-        squeeze = len(image.shape) == 4
-        if squeeze:
-            image = image[None, ...]
-        n = image.shape[0]
-        image_np = rearrange(
-            (255 * image).detach().cpu(), "n b c h w -> (n b) h w c"
-        ).numpy()[:, :, :, ::-1]
-        # torch (b, c, h, w) in [0, 1] -> numpy (b, h, w, c) [0, 255]
-        # watermarking libary expects input as cv2 BGR format
-        for k in range(image_np.shape[0]):
-            image_np[k] = self.encoder.encode(image_np[k], "dwtDct")
-        image = torch.from_numpy(
-            rearrange(image_np[:, :, :, ::-1], "(n b) h w c -> n b c h w", n=n)
-        ).to(image.device)
-        image = torch.clamp(image / 255, min=0.0, max=1.0)
-        if squeeze:
-            image = image[0]
-        image = 2 * image - 1
-        return image
-
-
-# A fixed 48-bit message that was choosen at random
-WATERMARK_MESSAGE = 0b001010101111111010000111100111001111010100101110
-# bin(x)[2:] gives bits of x as str, use int to convert them to 0/1
-WATERMARK_BITS = [int(bit) for bit in bin(WATERMARK_MESSAGE)[2:]]
-embed_watermark = WatermarkEmbedder(WATERMARK_BITS)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/config.py b/videotuna/models/hunyuan/hyvideo_i2v/config.py
deleted file mode 100644
index b40513f8..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/config.py
+++ /dev/null
@@ -1,601 +0,0 @@
-import argparse
-from .constants import *
-import re
-from .modules.models import HUNYUAN_VIDEO_CONFIG
-
-
-def parse_args(mode="eval", namespace=None):
-    parser = argparse.ArgumentParser(description="HunyuanVideo inference/lora training script")
-
-    parser = add_network_args(parser)
-    parser = add_extra_models_args(parser)
-    parser = add_denoise_schedule_args(parser)
-    parser = add_i2v_args(parser)
-    parser = add_lora_args(parser)
-    parser = add_inference_args(parser)
-    parser = add_parallel_args(parser)
-    if mode == "train":
-        parser = add_training_args(parser)
-        parser = add_optimizer_args(parser)
-        parser = add_deepspeed_args(parser)
-        parser = add_data_args(parser)
-        parser = add_train_denoise_schedule_args(parser)
-
-    args = parser.parse_args(namespace=namespace)
-    args = sanity_check_args(args)
-
-    return args
-
-def add_train_denoise_schedule_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Denoise schedule")
-
-    group.add_argument("--flow-path-type", type=str, default="linear", choices=FLOW_PATH_TYPE,
-                       help="Path type for flow matching schedulers.")
-    group.add_argument("--flow-predict-type", type=str, default="velocity", choices=FLOW_PREDICT_TYPE,
-                       help="Prediction type for flow matching schedulers.")
-    group.add_argument("--flow-loss-weight", type=str, default=None, choices=FLOW_LOSS_WEIGHT,
-                       help="Loss weight type for flow matching schedulers.")
-    group.add_argument("--flow-train-eps", type=float, default=None,
-                       help="Small epsilon for avoiding instability during training.")
-    group.add_argument("--flow-sample-eps", type=float, default=None,
-                       help="Small epsilon for avoiding instability during sampling.")
-    group.add_argument("--flow-snr-type", type=str, default="lognorm", choices=FLOW_SNR_TYPE,
-                       help="Type of SNR to use for flow matching schedulers.")
-
-    return parser
-
-def add_deepspeed_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="DeepSpeed")
-
-    group.add_argument("--local_rank", type=int, default=-1, help="Local rank for distributed training.")
-    group.add_argument("--zero-stage", type=int, default=0, choices=[0, 1, 2, 3],
-                       help="DeepSpeed ZeRO stage. 0: off, 1: offload optimizer, 2: offload parameters, "
-                            "3: offload optimizer and parameters.")
-    return parser
-
-def add_data_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Data")
-
-    group.add_argument("--data-type", type=str, default="image", choices=DATA_TYPE, help="Type of the dataset.")
-    group.add_argument("--data-jsons-path", type=str, default=None, help="Dataset path for training.")
-    group.add_argument("--sample-n-frames", type=int, default=65,
-                       help="How many frames to sample from a video. if using 3d vae, the number should be 4n+1")
-    group.add_argument("--sample-stride", type=int, default=1,
-                       help="How many frames to skip when sampling from a video.")
-    group.add_argument("--num-workers", type=int, default=4, help="Number of workers for data loading.")
-    group.add_argument("--prefetch-factor", type=int, default=2, help="Prefetch factor for data loading.")
-    group.add_argument("--same-data-batch", action="store_true", help="Use same data type for all rank in a batch for training.")
-    group.add_argument("--uncond-p", type=float, default=0.1,
-                       help="Probability of randomly dropping video description.")
-    group.add_argument("--sematic-cond-drop-p", type=float, default=0.1,
-                       help="Probability of randomly dropping img condition description.")
-
-    return parser
-
-def add_training_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Training")
-
-    group.add_argument("--task-flag", type=str, required=True,
-                       help="Task flag for training/inference. It is used to determine the experiment directory.")
-    group.add_argument("--output-dir", type=str, required=True, help="Directory to save logs and models")
-    group.add_argument("--sample-dir", type=str, default=None, required=False, help="Directory to save samples")
-    group.add_argument("--micro-batch-size", type=int, default=1, nargs='*',
-                       help="Batch size per model instance (local batch size).")
-    group.add_argument("--video-micro-batch-size", type=int, default=None, nargs='*',
-                       help="Batch size per model instance (local batch size).")
-    group.add_argument("--global-batch-size", type=int, default=None, nargs='*',
-                       help="Global batch size (across all model instances). "
-                            "global-batch-size = micro-batch-size * world-size * gradient-accumulation-steps")
-    group.add_argument("--gradient-accumulation-steps", type=int, default=1,
-                       help="Number of steps to accumulate gradients over before performing an update.")
-    group.add_argument("--global-seed", type=int, default=42, help="Global seed for reproducibility.")
-
-    group.add_argument("--resume", type=str, default=None,
-                       help="Path to the checkpoint to resume training. It can be an experiment index to resume from "
-                            "the latest checkpoint in the output directory.")
-    group.add_argument("--init-from", type=str, default=None,
-                       help="Path to the checkpoint to load from init ckpt for training. ")
-    group.add_argument("--training-parts", type=str, default=None, help="Training a subset of the model parameters.")
-    group.add_argument("--init-save", action="store_true", help="Save the initial model before training.")
-    group.set_defaults(final_save=True)
-    group.add_argument("--final-save", action="store_true", help="Save the final model after training.")
-    group.add_argument("--no-final-save", dest="final_save", action="store_false", help="Do not save the final model.")
-
-    group.add_argument("--epochs", type=int, default=100, help="Number of epochs to train.")
-    group.add_argument("--max-training-steps", type=int, default=10_000_000, help="Maximum number of training steps.")
-    group.add_argument("--ckpt-every", type=int, default=5000, help="Save checkpoint every N steps.")
-
-    group.add_argument("--rope-theta-rescale-factor", type=float, default=1.0, nargs='+',
-                       help="Rope interpolation factor.")
-    group.add_argument("--rope-interpolation-factor", type=float, default=1.0, nargs='+',
-                       help="Rope interpolation factor.")
-
-    group.add_argument("--log-every", type=int, default=10, help="Log every N update steps.")
-    group.add_argument("--tensorboard", action="store_true", help="Enable TensorBoard logging.")
-    group.add_argument("--profile", action="store_true", help="Enable PyTorch profiler.")
-    return parser
-
-def add_optimizer_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Optimizer")
-
-    # Learning rate
-    group.add_argument("--lr", type=float, default=1e-4,
-                       help="Basic learning rate, varies depending on learning rate schedule and warmup.")
-    group.add_argument("--warmup-min-lr", type=float, default=1e-6, help="Minimum learning rate for warmup.")
-    group.add_argument("--warmup-num-steps", type=int, default=0, help="Number of warmup steps for learning rate.")
-
-    # Optimizer
-    group.add_argument("--adam-beta1", type=float, default=0.9,
-                       help="[AdamW] First coefficient for computing running averages of gradient.")
-    group.add_argument("--adam-beta2", type=float, default=0.999,
-                       help="[AdamW] Second coefficient for computing running averages of gradient square.")
-    group.add_argument("--adam-eps", type=float, default=1e-8,
-                       help="[AdamW] Term added to the denominator to improve numerical stability.")
-    group.add_argument("--weight-decay", type=float, default=0,
-                       help="Weight decay coefficient for L2 regularization.")
-    return parser
-
-def add_train_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="HunyuanVideo train args")
-
-
-    return parser
-
-def add_network_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="HunyuanVideo network args")
-
-    # Main model
-    group.add_argument(
-        "--model",
-        type=str,
-        choices=list(HUNYUAN_VIDEO_CONFIG.keys()),
-        default="HYVideo-T/2-cfgdistill",
-    )
-    group.add_argument(
-        "--latent-channels",
-        type=str,
-        default=16,
-        help="Number of latent channels of DiT. If None, it will be determined by `vae`. If provided, "
-        "it still needs to match the latent channels of the VAE model.",
-    )
-    group.add_argument(
-        "--precision",
-        type=str,
-        default="bf16",
-        choices=PRECISIONS,
-        help="Precision mode. Options: fp32, fp16, bf16. Applied to the backbone model and optimizer.",
-    )
-
-    # RoPE
-    group.add_argument(
-        "--rope-theta", type=int, default=256, help="Theta used in RoPE."
-    )
-
-    group.add_argument("--gradient-checkpoint", action="store_true",
-                       help="Enable gradient checkpointing to reduce memory usage.")
-
-    group.add_argument("--gradient-checkpoint-layers", type=int, default=-1,
-                       help="Number of layers to checkpoint. -1 for all layers. `n` for the first n layers.")
-
-    return parser
-
-
-def add_extra_models_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(
-        title="Extra models args, including vae, text encoders and tokenizers)"
-    )
-
-    # - VAE
-    group.add_argument(
-        "--vae",
-        type=str,
-        default="884-16c-hy",
-        choices=list(VAE_PATH),
-        help="Name of the VAE model.",
-    )
-    group.add_argument(
-        "--vae-precision",
-        type=str,
-        default="fp16",
-        choices=PRECISIONS,
-        help="Precision mode for the VAE model.",
-    )
-    group.add_argument(
-        "--vae-tiling",
-        action="store_true",
-        help="Enable tiling for the VAE model to save GPU memory.",
-    )
-    group.set_defaults(vae_tiling=True)
-
-    group.add_argument(
-        "--text-encoder",
-        type=str,
-        default="llm-i2v",
-        choices=list(TEXT_ENCODER_PATH),
-        help="Name of the text encoder model.",
-    )
-    group.add_argument(
-        "--text-encoder-precision",
-        type=str,
-        default="fp16",
-        choices=PRECISIONS,
-        help="Precision mode for the text encoder model.",
-    )
-    group.add_argument(
-        "--text-states-dim",
-        type=int,
-        default=4096,
-        help="Dimension of the text encoder hidden states.",
-    )
-    group.add_argument(
-        "--text-len", type=int, default=256, help="Maximum length of the text input."
-    )
-    group.add_argument(
-        "--tokenizer",
-        type=str,
-        default="llm-i2v",
-        choices=list(TOKENIZER_PATH),
-        help="Name of the tokenizer model.",
-    )
-    group.add_argument(
-        "--prompt-template",
-        type=str,
-        default="dit-llm-encode-i2v",
-        choices=PROMPT_TEMPLATE,
-        help="Image prompt template for the decoder-only text encoder model.",
-    )
-    group.add_argument(
-        "--prompt-template-video",
-        type=str,
-        default="dit-llm-encode-video-i2v",
-        choices=PROMPT_TEMPLATE,
-        help="Video prompt template for the decoder-only text encoder model.",
-    )
-    group.add_argument(
-        "--hidden-state-skip-layer",
-        type=int,
-        default=2,
-        help="Skip layer for hidden states.",
-    )
-    group.add_argument(
-        "--apply-final-norm",
-        action="store_true",
-        help="Apply final normalization to the used text encoder hidden states.",
-    )
-
-    # - CLIP
-    group.add_argument(
-        "--text-encoder-2",
-        type=str,
-        default="clipL",
-        choices=list(TEXT_ENCODER_PATH),
-        help="Name of the second text encoder model.",
-    )
-    group.add_argument(
-        "--text-encoder-precision-2",
-        type=str,
-        default="fp16",
-        choices=PRECISIONS,
-        help="Precision mode for the second text encoder model.",
-    )
-    group.add_argument(
-        "--text-states-dim-2",
-        type=int,
-        default=768,
-        help="Dimension of the second text encoder hidden states.",
-    )
-    group.add_argument(
-        "--tokenizer-2",
-        type=str,
-        default="clipL",
-        choices=list(TOKENIZER_PATH),
-        help="Name of the second tokenizer model.",
-    )
-    group.add_argument(
-        "--text-len-2",
-        type=int,
-        default=77,
-        help="Maximum length of the second text input.",
-    )
-
-    return parser
-
-
-def add_denoise_schedule_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Denoise schedule args")
-
-    group.add_argument(
-        "--denoise-type",
-        type=str,
-        default="flow",
-        help="Denoise type for noised inputs.",
-    )
-
-    # Flow Matching
-    group.add_argument(
-        "--flow-shift",
-        type=float,
-        default=17.0,
-        help="Shift factor for flow matching schedulers.",
-    )
-    group.add_argument(
-        "--flow-reverse",
-        action="store_true",
-        help="If reverse, learning/sampling from t=1 -> t=0.",
-    )
-    group.add_argument(
-        "--flow-solver",
-        type=str,
-        default="euler",
-        help="Solver for flow matching.",
-    )
-    group.add_argument(
-        "--use-linear-quadratic-schedule",
-        action="store_true",
-        help="Use linear quadratic schedule for flow matching."
-        "Following MovieGen (https://ai.meta.com/static-resource/movie-gen-research-paper)",
-    )
-    group.add_argument(
-        "--linear-schedule-end",
-        type=int,
-        default=25,
-        help="End step for linear quadratic schedule for flow matching.",
-    )
-
-    return parser
-
-
-def add_inference_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Inference args")
-
-    # ======================== Model loads ========================
-    group.add_argument(
-        "--model-base",
-        type=str,
-        default="ckpts",
-        help="Root path of all the models, including t2v models and extra models.",
-    )
-    group.add_argument(
-        "--dit-weight",
-        type=str,
-        default="ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt",
-        help="Path to the HunyuanVideo model. If None, search the model in the args.model_root."
-        "1. If it is a file, load the model directly."
-        "2. If it is a directory, search the model in the directory. Support two types of models: "
-        "1) named `pytorch_model_*.pt`"
-        "2) named `*_model_states.pt`, where * can be `mp_rank_00`.",
-    )
-    group.add_argument(
-        "--i2v-dit-weight",
-        type=str,
-        default="ckpts/hunyuan-video-i2v-720p/transformers/mp_rank_00_model_states.pt",
-        help="Path to the HunyuanVideo model. If None, search the model in the args.model_root."
-        "1. If it is a file, load the model directly."
-        "2. If it is a directory, search the model in the directory. Support two types of models: "
-        "1) named `pytorch_model_*.pt`"
-        "2) named `*_model_states.pt`, where * can be `mp_rank_00`.",
-    )
-    group.add_argument(
-        "--model-resolution",
-        type=str,
-        default="540p",
-        choices=["540p", "720p"],
-        help="Root path of all the models, including t2v models and extra models.",
-    )
-    group.add_argument(
-        "--load-key",
-        type=str,
-        default="module",
-        help="Key to load the model states. 'module' for the main model, 'ema' for the EMA model.",
-    )
-    group.add_argument(
-        "--use-cpu-offload",
-        action="store_true",
-        help="Use CPU offload for the model load.",
-    )
-
-    # ======================== Inference general setting ========================
-    group.add_argument(
-        "--batch-size",
-        type=int,
-        default=1,
-        help="Batch size for inference and evaluation.",
-    )
-    group.add_argument(
-        "--infer-steps",
-        type=int,
-        default=50,
-        help="Number of denoising steps for inference.",
-    )
-    group.add_argument(
-        "--disable-autocast",
-        action="store_true",
-        help="Disable autocast for denoising loop and vae decoding in pipeline sampling.",
-    )
-    group.add_argument(
-        "--save-path",
-        type=str,
-        default="./results",
-        help="Path to save the generated samples.",
-    )
-    group.add_argument(
-        "--save-path-suffix",
-        type=str,
-        default="",
-        help="Suffix for the directory of saved samples.",
-    )
-    group.add_argument(
-        "--name-suffix",
-        type=str,
-        default="",
-        help="Suffix for the names of saved samples.",
-    )
-    group.add_argument(
-        "--num-videos",
-        type=int,
-        default=1,
-        help="Number of videos to generate for each prompt.",
-    )
-    # ---sample size---
-    group.add_argument(
-        "--video-size",
-        type=int,
-        nargs="+",
-        default=(720, 1280),
-        help="Video size for training. If a single value is provided, it will be used for both height "
-        "and width. If two values are provided, they will be used for height and width "
-        "respectively.",
-    )
-    group.add_argument(
-        "--video-length",
-        type=int,
-        default=129,
-        help="How many frames to sample from a video. if using 3d vae, the number should be 4n+1",
-    )
-    # --- prompt ---
-    group.add_argument(
-        "--prompt",
-        type=str,
-        default=None,
-        help="Prompt for sampling during evaluation.",
-    )
-    group.add_argument(
-        "--seed-type",
-        type=str,
-        default="auto",
-        choices=["file", "random", "fixed", "auto"],
-        help="Seed type for evaluation. If file, use the seed from the CSV file. If random, generate a "
-        "random seed. If fixed, use the fixed seed given by `--seed`. If auto, `csv` will use the "
-        "seed column if available, otherwise use the fixed `seed` value. `prompt` will use the "
-        "fixed `seed` value.",
-    )
-    group.add_argument("--seed", type=int, default=None, help="Seed for evaluation.")
-
-    # Classifier-Free Guidance
-    group.add_argument(
-        "--neg-prompt", type=str, default=None, help="Negative prompt for sampling."
-    )
-    group.add_argument(
-        "--cfg-scale", type=float, default=1.0, help="Classifier free guidance scale."
-    )
-    group.add_argument(
-        "--embedded-cfg-scale",
-        type=float,
-        default=None,
-        help="Embeded classifier free guidance scale.",
-    )
-
-    group.add_argument(
-        "--use-fp8",
-        action="store_true",
-        help="Enable use fp8 for inference acceleration."
-    )
-
-    group.add_argument(
-        "--reproduce",
-        action="store_true",
-        help="Enable reproducibility by setting random seeds and deterministic algorithms.",
-    )
-
-    return parser
-
-def add_i2v_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="I2V args")
-
-    group.add_argument(
-        "--i2v-mode",
-        action="store_true",
-        help="Whether to open i2v mode."
-    )
-
-    group.add_argument(
-        "--i2v-resolution",
-        type=str,
-        default="720p",
-        choices=["720p", "540p", "360p"],
-        help="Resolution for i2v inference."
-    )
-
-    group.add_argument(
-        "--i2v-image-path",
-        type=str,
-        default="./assets/demo/i2v/imgs/0.png",
-        help="Image path for i2v inference."
-    )
-
-    group.add_argument(
-        "--i2v-condition-type",
-        type=str,
-        default="token_replace",
-        choices=["token_replace", "latent_concat"],
-        help="Condition type for i2v model."
-    )
-
-    group.add_argument(
-        "--i2v-stability", action="store_true", help="Whether to use i2v stability mode."
-    )
-
-    return parser
-
-
-def add_lora_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="lora args")
-
-    group.add_argument(
-        "--use-lora", action="store_true", help="Whether to open lora mode."
-    )
-
-    group.add_argument(
-        "--lora-path", type=str, default="", help="Weight path for lora model."
-    )
-
-    group.add_argument(
-        "--lora-scale", type=float, default=1.0, help="Fusion scale for lora model."
-    )
-
-    group.add_argument(
-        "--lora-rank", type=int, default=64, help="Rank for lora model."
-    )
-
-    return parser
-
-def add_parallel_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Parallel args")
-
-    # ======================== Model loads ========================
-    group.add_argument(
-        "--ulysses-degree",
-        type=int,
-        default=1,
-        help="Ulysses degree for xdit parallel args.",
-    )
-    group.add_argument(
-        "--ring-degree",
-        type=int,
-        default=1,
-        help="Ring degree for xdit parallel args.",
-    )
-    group.add_argument(
-        "--xdit-adaptive-size",
-        action="store_true",
-        help="Make the generated video has no black padding.")
-
-
-    return parser
-
-
-def sanity_check_args(args):
-    # VAE channels
-    vae_pattern = r"\d{2,3}-\d{1,2}c-\w+"
-    if not re.match(vae_pattern, args.vae):
-        raise ValueError(
-            f"Invalid VAE model: {args.vae}. Must be in the format of '{vae_pattern}'."
-        )
-    vae_channels = int(args.vae.split("-")[1][:-1])
-    if args.latent_channels is None:
-        args.latent_channels = vae_channels
-    if vae_channels != args.latent_channels:
-        raise ValueError(
-            f"Latent channels ({args.latent_channels}) must match the VAE channels ({vae_channels})."
-        )
-    return args
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/constants.py b/videotuna/models/hunyuan/hyvideo_i2v/constants.py
deleted file mode 100644
index bfd16499..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/constants.py
+++ /dev/null
@@ -1,164 +0,0 @@
-import os
-import torch
-
-__all__ = [
-    "C_SCALE",
-    "PROMPT_TEMPLATE",
-    "MODEL_BASE",
-    "PRECISIONS",
-    "NORMALIZATION_TYPE",
-    "ACTIVATION_TYPE",
-    "VAE_PATH",
-    "TEXT_ENCODER_PATH",
-    "TOKENIZER_PATH",
-    "TEXT_PROJECTION",
-    "DATA_TYPE",
-    "NEGATIVE_PROMPT",
-    "NEGATIVE_PROMPT_I2V",
-    "FLOW_PATH_TYPE",
-    "FLOW_PREDICT_TYPE",
-    "FLOW_LOSS_WEIGHT",
-    "FLOW_SNR_TYPE",
-    "FLOW_SOLVER",
-]
-
-PRECISION_TO_TYPE = {
-    'fp32': torch.float32,
-    'fp16': torch.float16,
-    'bf16': torch.bfloat16,
-}
-
-# =================== Constant Values =====================
-# Computation scale factor, 1P = 1_000_000_000_000_000. Tensorboard will display the value in PetaFLOPS to avoid
-# overflow error when tensorboard logging values.
-C_SCALE = 1_000_000_000_000_000
-
-# When using decoder-only models, we must provide a prompt template to instruct the text encoder
-# on how to generate the text.
-# --------------------------------------------------------------------
-PROMPT_TEMPLATE_ENCODE = (
-    "<|start_header_id|>system<|end_header_id|>\n\nDescribe the image by detailing the color, shape, size, texture, "
-    "quantity, text, spatial relationships of the objects and background:<|eot_id|>"
-    "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-) 
-PROMPT_TEMPLATE_ENCODE_VIDEO = (
-    "<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: "
-    "1. The main content and theme of the video."
-    "2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects."
-    "3. Actions, events, behaviors temporal relationships, physical movement changes of the objects."
-    "4. background environment, light, style and atmosphere."
-    "5. camera angles, movements, and transitions used in the video:<|eot_id|>"
-    "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-)  
-
-PROMPT_TEMPLATE_ENCODE_I2V = (
-    "<|start_header_id|>system<|end_header_id|>\n\n<image>\nDescribe the image by detailing the color, shape, size, texture, "
-    "quantity, text, spatial relationships of the objects and background:<|eot_id|>"
-    "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-    "<|start_header_id|>assistant<|end_header_id|>\n\n"
-)
-
-PROMPT_TEMPLATE_ENCODE_VIDEO_I2V = (
-    "<|start_header_id|>system<|end_header_id|>\n\n<image>\nDescribe the video by detailing the following aspects according to the reference image: "
-    "1. The main content and theme of the video."
-    "2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects."
-    "3. Actions, events, behaviors temporal relationships, physical movement changes of the objects."
-    "4. background environment, light, style and atmosphere."
-    "5. camera angles, movements, and transitions used in the video:<|eot_id|>\n\n"
-    "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-    "<|start_header_id|>assistant<|end_header_id|>\n\n"
-)
-
-NEGATIVE_PROMPT = "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion"
-NEGATIVE_PROMPT_I2V = "deformation, a poor composition and deformed video, bad teeth, bad eyes, bad limbs"
-
-PROMPT_TEMPLATE = {
-    "dit-llm-encode": {
-        "template": PROMPT_TEMPLATE_ENCODE,
-        "crop_start": 36,
-    },
-    "dit-llm-encode-video": {
-        "template": PROMPT_TEMPLATE_ENCODE_VIDEO,
-        "crop_start": 95,
-    },
-    "dit-llm-encode-i2v": {
-        "template": PROMPT_TEMPLATE_ENCODE_I2V,
-        "crop_start": 36,
-        "image_emb_start": 5,
-        "image_emb_end": 581,
-        "image_emb_len": 576,
-        "double_return_token_id": 271
-    },
-    "dit-llm-encode-video-i2v": {
-        "template": PROMPT_TEMPLATE_ENCODE_VIDEO_I2V,
-        "crop_start": 103,
-        "image_emb_start": 5,
-        "image_emb_end": 581,
-        "image_emb_len": 576,
-        "double_return_token_id": 271
-    },
-}
-
-# ======================= Model ======================
-PRECISIONS = {"fp32", "fp16", "bf16"}
-NORMALIZATION_TYPE = {"layer", "rms"}
-ACTIVATION_TYPE = {"relu", "silu", "gelu", "gelu_tanh"}
-
-# =================== Model Path =====================
-MODEL_BASE = os.getenv("MODEL_BASE", "checkpoints/hunyuanvideo/HunyuanVideo-I2V")
-
-# =================== Data =======================
-DATA_TYPE = {"image", "video", "image_video"}
-
-# 3D VAE
-VAE_PATH = {"884-16c-hy": f"{MODEL_BASE}/hunyuan-video-i2v-720p/vae"}
-
-# Text Encoder
-TEXT_ENCODER_PATH = {
-    "clipL": f"{MODEL_BASE}/text_encoder_2",
-    "llm": f"{MODEL_BASE}/text_encoder",
-    "llm-i2v": f"{MODEL_BASE}/text_encoder_i2v",
-}
-
-# Tokenizer
-TOKENIZER_PATH = {
-    "clipL": f"{MODEL_BASE}/text_encoder_2",
-    "llm": f"{MODEL_BASE}/text_encoder",
-    "llm-i2v": f"{MODEL_BASE}/text_encoder_i2v",
-}
-
-TEXT_PROJECTION = {
-    "linear",  # Default, an nn.Linear() layer
-    "single_refiner",  # Single TokenRefiner. Refer to LI-DiT
-}
-
-# Flow Matching path type
-FLOW_PATH_TYPE = {
-    "linear",               # Linear trajectory between noise and data
-    "gvp",                  # Generalized variance-preserving SDE
-    "vp",                   # Variance-preserving SDE
-}
-
-# Flow Matching predict type
-FLOW_PREDICT_TYPE = {
-    "velocity",             # Predict velocity
-    "score",                # Predict score
-    "noise",                # Predict noise
-}
-
-# Flow Matching loss weight
-FLOW_LOSS_WEIGHT = {
-    "velocity",             # Weight loss by velocity
-    "likelihood",           # Weight loss by likelihood
-}
-
-# Flow Matching SNR type
-FLOW_SNR_TYPE = {
-    "lognorm",              # Log-normal SNR
-    "uniform",              # Uniform SNR
-}
-
-# Flow Matching solvers
-FLOW_SOLVER = {
-    "euler",                # Euler solver
-}
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/dataset/video_loader.py b/videotuna/models/hunyuan/hyvideo_i2v/dataset/video_loader.py
deleted file mode 100644
index c7c04f06..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/dataset/video_loader.py
+++ /dev/null
@@ -1,184 +0,0 @@
-import random
-import os
-import io
-import torch
-import numpy as np
-import json
-import traceback
-import time
-import pyarrow as pa
-
-from torch.utils.data import Dataset
-
-class VideoDataset(Dataset):
-    def __init__(self,
-                 data_jsons_path: str,
-                 sample_n_frames: int = 129,
-                 sample_stride: int = 1,
-                 text_encoder=None,
-                 text_encoder_2=None,
-                 uncond_p=0.0,
-                 args=None,
-                 logger=None,
-                 ) -> None:
-        """_summary_
-
-        Args:
-            data_jsons_path (str): input data json path
-            sample_n_frames (int, optional): training video length. Defaults to 129.
-            sample_stride (int, optional): video frame sample stride. Defaults to 1 (No strid).
-            text_encoder (_type_, optional): text encoder to tokenize. Defaults to None.
-            text_encoder_2 (_type_, optional): second text encoder to tokenize. Defaults to None.
-            uncond_p (float, optional): text uncondition prod. Defaults to 0.0.
-            args (_type_, optional): args. Defaults to None.
-            logger (_type_, optional): logger. Defaults to None.
-        """
-        self.args = args
-        self.sample_n_frames = sample_n_frames
-        self.sample_stride = sample_stride
-        self.text_encoder = text_encoder
-        self.text_encoder_2 = text_encoder_2
-        self.uncond_p = uncond_p
-
-        if logger is None:
-            from loguru import logger
-        self.logger = logger
-
-        json_files = os.listdir(data_jsons_path)
-
-        video_id_list = []
-        latent_shape_list = []
-        prompt_list = []
-        npy_save_path_list = []
-        height_list = []
-        width_list = []
-        for json_file in json_files:
-            with open(f"{data_jsons_path}/{json_file}", 'r', encoding='utf-8-sig') as file:
-                data = json.load(file)
-            video_id = data.get('video_id')
-            latent_shape = data.get('latent_shape')
-            prompt = data.get('prompt')
-            npy_save_path = data.get('npy_save_path')
-
-            video_id_list.append(video_id)
-            latent_shape_list.append(latent_shape)
-            prompt_list.append(prompt)
-            npy_save_path_list.append(npy_save_path)
-            height_list.append(latent_shape[3])
-            width_list.append(latent_shape[4])
-
-        schema = pa.schema([
-            ('video_id', pa.string()),
-            ('latent_shape', pa.list_(pa.int64())),
-            ('prompt', pa.string()),
-            ('npy_save_path', pa.string()),
-            ('height', pa.int64()),
-            ('width', pa.int64()),
-        ])
-
-        video_id_array = pa.array(video_id_list, type=pa.string())
-        latent_shape_array = pa.array(latent_shape_list, type=pa.list_(pa.int64()))
-        prompt_array = pa.array(prompt_list, type=pa.string())
-        npy_save_path_array = pa.array(npy_save_path_list, type=pa.string())
-        height_array = pa.array(height_list, type=pa.int64())
-        width_array = pa.array(width_list, type=pa.int64())
-
-        record_batch = pa.RecordBatch.from_arrays([video_id_array, latent_shape_array, prompt_array,
-                                                   npy_save_path_array, height_array, width_array], schema=schema)
-        self.table = pa.Table.from_batches([record_batch])
-
-        s_time = time.time()
-        logger.info(f"load {data_jsons_path} \t cost {time.time() - s_time} s \t total length {len(self.table)}")
-
-    def __len__(self):
-        return len(self.table)
-
-    def get_data_info(self, index):
-
-        latent_shape = self.table['latent_shape'][index].as_py()
-        assert isinstance(latent_shape, list), "latent_shape must be list"
-        num_frames = latent_shape[-3]
-        height = latent_shape[-2]
-        width = latent_shape[-1]
-        num_frames = (num_frames - 1) * 4 + 1
-
-        return {'height': height,
-                'width': width,
-                'num_frames': num_frames}
-
-    @staticmethod
-    def get_text_tokens(text_encoder, description):
-        text_inputs = text_encoder.text2tokens(description, data_type='video')
-        text_ids = text_inputs["input_ids"].squeeze(0)
-        text_mask = text_inputs["attention_mask"].squeeze(0)
-        return text_ids, text_mask
-
-    def get_batch(self, idx):
-        videoid = self.table['video_id'][idx].as_py()
-        prompt = self.table['prompt'][idx].as_py()
-        pixel_values = torch.tensor(0)
-
-        if random.random() < self.uncond_p:
-            prompt = ''
-
-        text_ids, text_mask = self.get_text_tokens(self.text_encoder, prompt)
-        sample_n_frames = self.sample_n_frames
-
-        cache_path = self.table['npy_save_path'][idx].as_py()
-        latents = torch.from_numpy(np.load(cache_path)).squeeze(0)
-        sample_n_latent = (sample_n_frames - 1) // 4 + 1
-        start_idx = 0
-        latents = latents[:, start_idx:start_idx + sample_n_latent, ...]
-
-        if latents.shape[1] < sample_n_latent:
-            raise Exception(
-                f' videoid: {videoid} has wrong cache data for temporal buckets of shape {latents.shape}, expected length: {sample_n_latent}')
-
-        data_info = self.get_data_info(idx)
-        num_frames, height, width = data_info['num_frames'], data_info['height'], data_info['width']
-        kwargs = {
-            "text": prompt,
-            "index": idx,
-            "type": 'video',
-            'bucket': [num_frames, height, width],
-            "videoid": videoid
-        }
-        if self.text_encoder_2 is None:
-            return (
-                pixel_values,
-                latents,
-                text_ids.clone(),
-                text_mask.clone(),
-                {k: torch.as_tensor(v) if not isinstance(v, str) else v for k, v in kwargs.items()},
-            )
-        else:
-            text_ids_2, text_mask_2 = self.get_text_tokens(self.text_encoder_2, prompt)
-            return (
-                pixel_values,
-                latents,
-                text_ids.clone(),
-                text_mask.clone(),
-                text_ids_2.clone(),
-                text_mask_2.clone(),
-                {k: torch.as_tensor(v) if not isinstance(v, str) else v for k, v in kwargs.items()},
-            )
-
-    def __getitem__(self, idx):
-        try_times = 100
-        for i in range(try_times):
-            try:
-                return self.get_batch(idx)
-            except Exception as e:
-                self.logger.warning(
-                    f"Error details: {str(e)}-{self.table['video_id'][idx]}-{traceback.format_exc()}\n")
-                idx = np.random.randint(len(self))
-
-        raise RuntimeError('Too many bad data.')
-
-if __name__ == "__main__":
-
-    data_jsons_path = "test_path"
-    dataset = VideoDataset(args=None,
-                                      data_jsons_path=data_jsons_path)
-
-    print(dataset.__getitem__(0))
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/__init__.py
deleted file mode 100644
index d3575c9f..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/__init__.py
+++ /dev/null
@@ -1,88 +0,0 @@
-from .pipelines import HunyuanVideoPipeline
-from .schedulers import FlowMatchDiscreteScheduler
-from .flow.transport import *
-
-def create_transport(
-        *,
-        path_type,
-        prediction,
-        loss_weight=None,
-        train_eps=None,
-        sample_eps=None,
-        snr_type="uniform",
-        shift=1.0,
-        video_shift=None,
-        reverse=False,
-):
-    if prediction == "noise":
-        model_type = ModelType.NOISE
-    elif prediction == "score":
-        model_type = ModelType.SCORE
-    else:
-        model_type = ModelType.VELOCITY
-
-    if loss_weight == "velocity":
-        loss_type = WeightType.VELOCITY
-    elif loss_weight == "likelihood":
-        loss_type = WeightType.LIKELIHOOD
-    else:
-        loss_type = WeightType.NONE
-
-    if snr_type == "lognorm":
-        snr_type = SNRType.LOGNORM
-    elif snr_type == "uniform":
-        snr_type = SNRType.UNIFORM
-    else:
-        raise ValueError(f"Invalid snr type {snr_type}")
-
-    if video_shift is None:
-        video_shift = shift
-
-    path_choice = {
-        "linear": PathType.LINEAR,
-        "gvp": PathType.GVP,
-        "vp": PathType.VP,
-    }
-
-    path_type = path_choice[path_type.lower()]
-
-    if path_type in [PathType.VP]:
-        train_eps = 1e-5 if train_eps is None else train_eps
-        sample_eps = 1e-3 if train_eps is None else sample_eps
-    elif path_type in [PathType.GVP, PathType.LINEAR] and model_type != ModelType.VELOCITY:
-        train_eps = 1e-3 if train_eps is None else train_eps
-        sample_eps = 1e-3 if train_eps is None else sample_eps
-    else:  # velocity & [GVP, LINEAR] is stable everywhere
-        train_eps = 0
-        sample_eps = 0
-
-    # create flow state
-    state = Transport(
-        model_type=model_type,
-        path_type=path_type,
-        loss_type=loss_type,
-        train_eps=train_eps,
-        sample_eps=sample_eps,
-        snr_type=snr_type,
-        shift=shift,
-        video_shift=video_shift,
-        reverse=reverse,
-    )
-
-    return state
-
-def load_denoiser(args):
-    if args.denoise_type == "flow":
-        denoiser = create_transport(path_type=args.flow_path_type,
-                                    prediction=args.flow_predict_type,
-                                    loss_weight=args.flow_loss_weight,
-                                    train_eps=args.flow_train_eps,
-                                    sample_eps=args.flow_sample_eps,
-                                    snr_type=args.flow_snr_type,
-                                    shift=args.flow_shift,
-                                    video_shift=args.flow_shift,
-                                    reverse=args.flow_reverse,
-                                    )
-    else:
-        raise ValueError(f"Unknown denoise type: {args.denoise_type}")
-    return denoiser
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/__init__.py
deleted file mode 100644
index 72560eab..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/__init__.py
+++ /dev/null
@@ -1,73 +0,0 @@
-from .transport import ModelType, PathType, Sampler, SNRType, Transport, WeightType
-
-
-def create_transport(
-    path_type="linear",
-    prediction="velocity",
-    loss_weight=None,
-    train_eps=None,
-    sample_eps=None,
-    snr_type="uniform",
-):
-    """function for creating Transport object
-    **Note**: model prediction defaults to velocity
-    Args:
-    - path_type: type of path to use; default to linear
-    - learn_score: set model prediction to score
-    - learn_noise: set model prediction to noise
-    - velocity_weighted: weight loss by velocity weight
-    - likelihood_weighted: weight loss by likelihood weight
-    - train_eps: small epsilon for avoiding instability during training
-    - sample_eps: small epsilon for avoiding instability during sampling
-    """
-
-    if prediction == "noise":
-        model_type = ModelType.NOISE
-    elif prediction == "score":
-        model_type = ModelType.SCORE
-    else:
-        model_type = ModelType.VELOCITY
-
-    if loss_weight == "velocity":
-        loss_type = WeightType.VELOCITY
-    elif loss_weight == "likelihood":
-        loss_type = WeightType.LIKELIHOOD
-    else:
-        loss_type = WeightType.NONE
-
-    if snr_type == "lognorm":
-        snr_type = SNRType.LOGNORM
-    elif snr_type == "uniform":
-        snr_type = SNRType.UNIFORM
-    else:
-        raise ValueError(f"Invalid snr type {snr_type}")
-
-    path_choice = {
-        "linear": PathType.LINEAR,
-        "gvp": PathType.GVP,
-        "vp": PathType.VP,
-    }
-
-    path_type = path_choice[path_type.lower()]
-
-    if path_type in [PathType.VP]:
-        train_eps = 1e-5 if train_eps is None else train_eps
-        sample_eps = 1e-3 if train_eps is None else sample_eps
-    elif path_type in [PathType.GVP, PathType.LINEAR] and model_type != ModelType.VELOCITY:
-        train_eps = 1e-3 if train_eps is None else train_eps
-        sample_eps = 1e-3 if train_eps is None else sample_eps
-    else:  # velocity & [GVP, LINEAR] is stable everywhere
-        train_eps = 0
-        sample_eps = 0
-
-    # create flow state
-    state = Transport(
-        model_type=model_type,
-        path_type=path_type,
-        loss_type=loss_type,
-        train_eps=train_eps,
-        sample_eps=sample_eps,
-        snr_type=snr_type,
-    )
-
-    return state
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/integrators.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/integrators.py
deleted file mode 100644
index 7cbd115d..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/integrators.py
+++ /dev/null
@@ -1,125 +0,0 @@
-import torch as th
-
-
-class sde:
-    """SDE solver class"""
-
-    def __init__(
-        self,
-        drift,
-        diffusion,
-        *,
-        t0,
-        t1,
-        num_steps,
-        sampler_type,
-    ):
-        assert t0 < t1, "SDE sampler has to be in forward time"
-
-        self.num_timesteps = num_steps
-        self.t = th.linspace(t0, t1, num_steps)
-        self.dt = self.t[1] - self.t[0]
-        self.drift = drift
-        self.diffusion = diffusion
-        self.sampler_type = sampler_type
-
-    def __Euler_Maruyama_step(self, x, mean_x, t, model, **model_kwargs):
-        w_cur = th.randn(x.size()).to(x)
-        t = th.ones(x.size(0)).to(x) * t
-        dw = w_cur * th.sqrt(self.dt)
-        drift = self.drift(x, t, model, **model_kwargs)
-        diffusion = self.diffusion(x, t)
-        mean_x = x + drift * self.dt
-        x = mean_x + th.sqrt(2 * diffusion) * dw
-        return x, mean_x
-
-    def __Heun_step(self, x, _, t, model, **model_kwargs):
-        w_cur = th.randn(x.size()).to(x)
-        dw = w_cur * th.sqrt(self.dt)
-        t_cur = th.ones(x.size(0)).to(x) * t
-        diffusion = self.diffusion(x, t_cur)
-        xhat = x + th.sqrt(2 * diffusion) * dw
-        K1 = self.drift(xhat, t_cur, model, **model_kwargs)
-        xp = xhat + self.dt * K1
-        K2 = self.drift(xp, t_cur + self.dt, model, **model_kwargs)
-        return (
-            xhat + 0.5 * self.dt * (K1 + K2),
-            xhat,
-        )  # at last time point we do not perform the heun step
-
-    def __forward_fn(self):
-        """TODO: generalize here by adding all private functions ending with steps to it"""
-        sampler_dict = {
-            "Euler": self.__Euler_Maruyama_step,
-            "Heun": self.__Heun_step,
-        }
-
-        try:
-            sampler = sampler_dict[self.sampler_type]
-        except:
-            raise NotImplementedError("Smapler type not implemented.")
-
-        return sampler
-
-    def sample(self, init, model, **model_kwargs):
-        """forward loop of sde"""
-        x = init
-        mean_x = init
-        samples = []
-        sampler = self.__forward_fn()
-        for ti in self.t[:-1]:
-            with th.no_grad():
-                x, mean_x = sampler(x, mean_x, ti, model, **model_kwargs)
-                samples.append(x)
-
-        return samples
-
-
-class ode:
-    """ODE solver class"""
-
-    def __init__(
-        self,
-        drift,
-        *,
-        t0,
-        t1,
-        sampler_type,
-        num_steps,
-        atol,
-        rtol,
-        time_shifting_factor=None,
-    ):
-        assert t0 < t1, "ODE sampler has to be in forward time"
-
-        self.drift = drift
-        self.t = th.linspace(t0, t1, num_steps)
-        if time_shifting_factor:
-            self.t = self.t / (self.t + time_shifting_factor - time_shifting_factor * self.t)
-        self.atol = atol
-        self.rtol = rtol
-        self.sampler_type = sampler_type
-
-    def sample(self, x, model, **model_kwargs):
-        from torchdiffeq import odeint
-        device = x[0].device if isinstance(x, tuple) else x.device
-
-        def _fn(t, x):
-            t = th.ones(x[0].size(0)).to(device) * t if isinstance(x, tuple) else th.ones(x.size(0)).to(device) * t
-            model_output = self.drift(x, t, model, **model_kwargs)
-            return model_output
-
-        t = self.t.to(device)
-        atol = [self.atol] * len(x) if isinstance(x, tuple) else [self.atol]
-        rtol = [self.rtol] * len(x) if isinstance(x, tuple) else [self.rtol]
-        samples = odeint(_fn, x, t, method=self.sampler_type, atol=atol, rtol=rtol)
-        return samples
-
-    def sample_with_step_fn(self, x, step_fn):
-        from torchdiffeq import odeint
-        device = x[0].device if isinstance(x, tuple) else x.device
-        t = self.t.to(device)
-        atol = [self.atol] * len(x) if isinstance(x, tuple) else [self.atol]
-        rtol = [self.rtol] * len(x) if isinstance(x, tuple) else [self.rtol]
-        samples = odeint(step_fn, x, t, method=self.sampler_type, atol=atol, rtol=rtol)
-        return samples
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/path.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/path.py
deleted file mode 100644
index ead4020a..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/path.py
+++ /dev/null
@@ -1,208 +0,0 @@
-import numpy as np
-import torch as th
-
-
-def expand_t_like_x(t, x):
-    """Function to reshape time t to broadcastable dimension of x
-    Args:
-      t: [batch_dim,], time vector
-      x: [batch_dim,...], data point
-    """
-    dims = [1] * len(x[0].size())
-    t = t.view(t.size(0), *dims)
-    return t
-
-class ICPlan:
-    """Linear Coupling Plan"""
-
-    def __init__(self, sigma=0.0, reverse=False):
-        self.sigma = sigma
-        self.reverse = reverse
-
-    def compute_alpha_t(self, t):
-        """Compute the data coefficient along the path"""
-        if self.reverse:
-            return 1 - t, -1
-        else:
-            return t, 1
-
-    def compute_sigma_t(self, t):
-        """Compute the noise coefficient along the path"""
-        if self.reverse:
-            return t, 1
-        else:
-            return 1 - t, -1
-
-    def compute_d_alpha_alpha_ratio_t(self, t):
-        """Compute the ratio between d_alpha and alpha"""
-        return 1 / t
-
-    def compute_drift(self, x, t):
-        """We always output sde according to score parametrization;"""
-        t = expand_t_like_x(t, x)
-        alpha_ratio = self.compute_d_alpha_alpha_ratio_t(t)
-        sigma_t, d_sigma_t = self.compute_sigma_t(t)
-        drift = alpha_ratio * x
-        diffusion = alpha_ratio * (sigma_t**2) - sigma_t * d_sigma_t
-
-        return -drift, diffusion
-
-    def compute_diffusion(self, x, t, form="constant", norm=1.0):
-        """Compute the diffusion term of the SDE
-        Args:
-          x: [batch_dim, ...], data point
-          t: [batch_dim,], time vector
-          form: str, form of the diffusion term
-          norm: float, norm of the diffusion term
-        """
-        t = expand_t_like_x(t, x)
-        choices = {
-            "constant": norm,
-            "SBDM": norm * self.compute_drift(x, t)[1],
-            "sigma": norm * self.compute_sigma_t(t)[0],
-            "linear": norm * (1 - t),
-            "decreasing": 0.25 * (norm * th.cos(np.pi * t) + 1) ** 2,
-            "inccreasing-decreasing": norm * th.sin(np.pi * t) ** 2,
-        }
-
-        try:
-            diffusion = choices[form]
-        except KeyError:
-            raise NotImplementedError(f"Diffusion form {form} not implemented")
-
-        return diffusion
-
-    def get_score_from_velocity(self, velocity, x, t):
-        """Wrapper function: transfrom velocity prediction model to score
-        Args:
-            velocity: [batch_dim, ...] shaped tensor; velocity model output
-            x: [batch_dim, ...] shaped tensor; x_t data point
-            t: [batch_dim,] time tensor
-        """
-        t = expand_t_like_x(t, x)
-        alpha_t, d_alpha_t = self.compute_alpha_t(t)
-        sigma_t, d_sigma_t = self.compute_sigma_t(t)
-        mean = x
-        reverse_alpha_ratio = alpha_t / d_alpha_t
-        var = sigma_t**2 - reverse_alpha_ratio * d_sigma_t * sigma_t
-        score = (reverse_alpha_ratio * velocity - mean) / var
-        return score
-
-    def get_noise_from_velocity(self, velocity, x, t):
-        """Wrapper function: transfrom velocity prediction model to denoiser
-        Args:
-            velocity: [batch_dim, ...] shaped tensor; velocity model output
-            x: [batch_dim, ...] shaped tensor; x_t data point
-            t: [batch_dim,] time tensor
-        """
-        t = expand_t_like_x(t, x)
-        alpha_t, d_alpha_t = self.compute_alpha_t(t)
-        sigma_t, d_sigma_t = self.compute_sigma_t(t)
-        mean = x
-        reverse_alpha_ratio = alpha_t / d_alpha_t
-        var = reverse_alpha_ratio * d_sigma_t - sigma_t
-        noise = (reverse_alpha_ratio * velocity - mean) / var
-        return noise
-
-    def get_velocity_from_score(self, score, x, t):
-        """Wrapper function: transfrom score prediction model to velocity
-        Args:
-            score: [batch_dim, ...] shaped tensor; score model output
-            x: [batch_dim, ...] shaped tensor; x_t data point
-            t: [batch_dim,] time tensor
-        """
-        t = expand_t_like_x(t, x)
-        drift, var = self.compute_drift(x, t)
-        velocity = var * score - drift
-        return velocity
-
-    def compute_mu_t(self, t, x0, x1):
-        """Compute the mean of time-dependent density p_t"""
-        t = expand_t_like_x(t, x1)
-        alpha_t, _ = self.compute_alpha_t(t)
-        sigma_t, _ = self.compute_sigma_t(t)
-        if isinstance(x1, (list, tuple)):
-            return [alpha_t[i] * x1[i] + sigma_t[i] * x0[i] for i in range(len(x1))]
-        else:
-            return alpha_t * x1 + sigma_t * x0
-
-    def compute_xt(self, t, x0, x1):
-        """Sample xt from time-dependent density p_t; rng is required"""
-        xt = self.compute_mu_t(t, x0, x1)
-        return xt
-
-    def compute_ut(self, t, x0, x1, xt):
-        """Compute the vector field corresponding to p_t"""
-        t = expand_t_like_x(t, x1)
-        _, d_alpha_t = self.compute_alpha_t(t)
-        _, d_sigma_t = self.compute_sigma_t(t)
-        if isinstance(x1, (list, tuple)):
-            return [d_alpha_t * x1[i] + d_sigma_t * x0[i] for i in range(len(x1))]
-        else:
-            return d_alpha_t * x1 + d_sigma_t * x0
-
-    def plan(self, t, x0, x1):
-        xt = self.compute_xt(t, x0, x1)
-        ut = self.compute_ut(t, x0, x1, xt)
-        return t, xt, ut
-
-
-class VPCPlan(ICPlan):
-    """class for VP path flow matching"""
-
-    def __init__(self, sigma_min=0.1, sigma_max=20.0, reverse=False):
-        self.sigma_min = sigma_min
-        self.sigma_max = sigma_max
-        self.log_mean_coeff = (
-            lambda t: -0.25 * ((1 - t) ** 2) * (self.sigma_max - self.sigma_min) - 0.5 * (1 - t) * self.sigma_min
-        )
-        self.d_log_mean_coeff = lambda t: 0.5 * (1 - t) * (self.sigma_max - self.sigma_min) + 0.5 * self.sigma_min
-        self.reverse = reverse
-        if self.reverse:
-            raise NotImplementedError("Reverse VPCPlan is not implemented")
-
-    def compute_alpha_t(self, t):
-        """Compute coefficient of x1"""
-        alpha_t = self.log_mean_coeff(t)
-        alpha_t = th.exp(alpha_t)
-        d_alpha_t = alpha_t * self.d_log_mean_coeff(t)
-        return alpha_t, d_alpha_t
-
-    def compute_sigma_t(self, t):
-        """Compute coefficient of x0"""
-        p_sigma_t = 2 * self.log_mean_coeff(t)
-        sigma_t = th.sqrt(1 - th.exp(p_sigma_t))
-        d_sigma_t = th.exp(p_sigma_t) * (2 * self.d_log_mean_coeff(t)) / (-2 * sigma_t)
-        return sigma_t, d_sigma_t
-
-    def compute_d_alpha_alpha_ratio_t(self, t):
-        """Special purposed function for computing numerical stabled d_alpha_t / alpha_t"""
-        return self.d_log_mean_coeff(t)
-
-    def compute_drift(self, x, t):
-        """Compute the drift term of the SDE"""
-        t = expand_t_like_x(t, x)
-        beta_t = self.sigma_min + (1 - t) * (self.sigma_max - self.sigma_min)
-        return -0.5 * beta_t * x, beta_t / 2
-
-class GVPCPlan(ICPlan):
-    def __init__(self, sigma=0.0, reverse=False):
-        super().__init__(sigma)
-        if self.reverse:
-            raise NotImplementedError("Reverse GVPCPlan is not implemented")
-
-    def compute_alpha_t(self, t):
-        """Compute coefficient of x1"""
-        alpha_t = th.sin(t * np.pi / 2)
-        d_alpha_t = np.pi / 2 * th.cos(t * np.pi / 2)
-        return alpha_t, d_alpha_t
-
-    def compute_sigma_t(self, t):
-        """Compute coefficient of x0"""
-        sigma_t = th.cos(t * np.pi / 2)
-        d_sigma_t = -np.pi / 2 * th.sin(t * np.pi / 2)
-        return sigma_t, d_sigma_t
-
-    def compute_d_alpha_alpha_ratio_t(self, t):
-        """Special purposed function for computing numerical stabled d_alpha_t / alpha_t"""
-        return np.pi / (2 * th.tan(t * np.pi / 2))
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/transport.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/transport.py
deleted file mode 100644
index 00ace622..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/transport.py
+++ /dev/null
@@ -1,520 +0,0 @@
-import enum
-import math
-from typing import Callable
-import copy
-import numpy as np
-import torch as th
-
-from . import path
-from .integrators import ode, sde
-from .utils import mean_flat
-from videotuna.models.hunyuan.hyvideo_i2v.constants import PRECISION_TO_TYPE
-
-__all__ = ["ModelType", "PathType", "WeightType", "Transport", "Sampler", "SNRType"]
-
-
-class ModelType(enum.Enum):
-    """
-    Which type of output the model predicts.
-    """
-
-    NOISE = enum.auto()  # the model predicts epsilon
-    SCORE = enum.auto()  # the model predicts \nabla \log p(x)
-    VELOCITY = enum.auto()  # the model predicts v(x)
-
-
-class PathType(enum.Enum):
-    """
-    Which type of path to use.
-    """
-
-    LINEAR = enum.auto()
-    GVP = enum.auto()
-    VP = enum.auto()
-
-
-class WeightType(enum.Enum):
-    """
-    Which type of weighting to use.
-    """
-
-    NONE = enum.auto()
-    VELOCITY = enum.auto()
-    LIKELIHOOD = enum.auto()
-
-
-class SNRType(enum.Enum):
-    UNIFORM = enum.auto()
-    LOGNORM = enum.auto()
-
-
-def get_lin_function(
-    x1: float = 256, y1: float = 0.5, x2: float = 4096, y2: float = 1.15
-) -> Callable[[float], float]:
-    m = (y2 - y1) / (x2 - x1)
-    b = y1 - m * x1
-    return lambda x: m * x + b
-
-
-def time_shift(mu: float, sigma: float, t: th.Tensor):
-    return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)
-
-
-class Transport:
-    def __init__(self, *, model_type, path_type, loss_type, train_eps, sample_eps, snr_type,
-                 training_timesteps=1000, reverse_time_schedule=False, shift=1.0, video_shift=None, reverse=False,
-                 ):
-        path_options = {
-            PathType.LINEAR: path.ICPlan,
-            PathType.GVP: path.GVPCPlan,
-            PathType.VP: path.VPCPlan,
-        }
-
-        self.loss_type = loss_type
-        self.model_type = model_type
-        self.path_sampler = path_options[path_type](reverse=reverse)
-        self.train_eps = train_eps
-        self.sample_eps = sample_eps
-
-        self.snr_type = snr_type
-        # timestep shift: http://arxiv.org/abs/2403.03206
-        self.shift = shift  # flow matching shift factor, =sqrt(m/n)
-        if video_shift is None: video_shift = shift # if video shift is not given, set it to be the same as flow shift
-        self.video_shift = video_shift
-        self.reverse = reverse
-
-        self.training_timesteps = training_timesteps
-        self.reverse_time_schedule = reverse_time_schedule
-
-    def prior_logp(self, z):
-        """
-        Standard multivariate normal prior
-        Assume z is batched
-        """
-        shape = th.tensor(z.size())
-        N = th.prod(shape[1:])
-        _fn = lambda x: -N / 2.0 * np.log(2 * np.pi) - th.sum(x**2) / 2.0
-        return th.vmap(_fn)(z)
-
-    def check_interval(
-        self,
-        train_eps,
-        sample_eps,
-        *,
-        diffusion_form="SBDM",
-        sde=False,
-        reverse=False,
-        eval=False,
-        last_step_size=0.0,
-    ):
-        t0 = 0
-        t1 = 1
-        eps = train_eps if not eval else sample_eps
-        if type(self.path_sampler) in [path.VPCPlan]:
-            t1 = 1 - eps if (not sde or last_step_size == 0) else 1 - last_step_size
-
-        elif (type(self.path_sampler) in [path.ICPlan, path.GVPCPlan]) and (
-            self.model_type != ModelType.VELOCITY or sde
-        ):  # avoid numerical issue by taking a first semi-implicit step
-            t0 = eps if (diffusion_form == "SBDM" and sde) or self.model_type != ModelType.VELOCITY else 0
-            t1 = 1 - eps if (not sde or last_step_size == 0) else 1 - last_step_size
-
-        if reverse:
-            t0, t1 = 1 - t0, 1 - t1
-
-        return t0, t1
-
-    def sample(self, x1, n_tokens=None):
-        """Sampling x0 & t based on shape of x1 (if needed)
-        Args:
-          x1 - data point; [batch, *dim]
-        """
-        if isinstance(x1, (list, tuple)):
-            x0 = [th.randn_like(img_start) for img_start in x1]
-        else:
-            x0 = th.randn_like(x1)
-        t0, t1 = self.check_interval(self.train_eps, self.sample_eps)
-
-        if self.snr_type == SNRType.UNIFORM:
-            t = th.rand((len(x1),)) * (t1 - t0) + t0
-        elif self.snr_type == SNRType.LOGNORM:
-            u = th.normal(mean=0.0, std=1.0, size=(len(x1),))
-            t = 1 / (1 + th.exp(-u)) * (t1 - t0) + t0
-        else:
-            raise ValueError(f"Unknown snr type: {self.snr_type}")
-
-        if self.shift != 1.:
-            if self.reverse:
-                # xt = (1 - t) * x1 + t * x0
-                t = (self.shift * t) / (1 + (self.shift - 1) * t)
-            else:
-                # xt = t * x1 + (1 - t) * x0
-                t = t / (self.shift - (self.shift - 1) * t)
-
-        t = t.to(x1[0])
-        return t, x0, x1
-
-    def get_model_t(self, t):
-        if self.reverse_time_schedule:
-            return (1 - t) * self.training_timesteps
-        else:
-            return t * self.training_timesteps
-
-    def training_losses(self, model, x1, model_kwargs=None, timestep=None, n_tokens=None,
-                        i2v_mode=False, cond_latents=None, args=None):
-
-        self.shift = self.video_shift
-        if model_kwargs == None:
-            model_kwargs = {}
-
-        t, x0, x1 = self.sample(x1, n_tokens)
-        if timestep is not None:
-            t = th.ones_like(t) * timestep
-        t, xt, ut = self.path_sampler.plan(t, x0, x1)
-        input_t = self.get_model_t(t)
-
-        if i2v_mode and args.i2v_condition_type == "latent_concat":
-            if cond_latents is not None:
-                x1_concat = cond_latents.repeat(1,1,x1.shape[2],1,1)
-                x1_concat[:, :, 1:, :, :] = 0.0
-            else:
-                x1_concat = x1.cpu().clone().to(device=x1.device)
-                x1_concat[:, :, 1:, :, :] = 0.0
-
-            mask_concat = th.ones(x1.shape[0], 1, x1.shape[2], x1.shape[3], x1.shape[4]).to(device=x1.device)
-            mask_concat[:, :, 1:, ...] = 0.0
-
-            xt = th.concat([xt, x1_concat, mask_concat], dim=1)
-        elif i2v_mode and args.i2v_condition_type == "token_replace":
-            xt = th.concat([cond_latents, xt[:, :, 1:, :, :]], dim=2)
-
-        guidance_expand = (
-            th.tensor(
-                [args.embedded_cfg_scale] * x1.shape[0],
-                dtype=th.float32,
-                device=x1.device,
-            ).to(PRECISION_TO_TYPE[args.precision])
-            * 1000.0
-            if args.embedded_cfg_scale is not None
-            else None
-        )
-        model_kwargs["guidance"] = guidance_expand
-
-        model_output = model(xt, input_t, **model_kwargs)['x']
-
-        if i2v_mode and args.i2v_condition_type == "token_replace":
-            assert self.model_type == ModelType.VELOCITY, f"self.model_type: {self.model_type} must be ModelType.VELOCITY"
-            model_output = model_output[:, :, 1:, :, :]
-            ut = ut[:, :, 1:, :, :]
-
-        if not i2v_mode:
-            assert model_output.size() == xt.size(), f"Output shape from model does not match input shape: " \
-                                                 f"{model_output.size()} != {xt.size()}"
-
-        terms = {}
-        if self.model_type == ModelType.VELOCITY:
-            terms["loss"] = mean_flat(((model_output - ut) ** 2))
-        else:
-            _, drift_var = self.path_sampler.compute_drift(xt, t)
-            sigma_t, _ = self.path_sampler.compute_sigma_t(path.expand_t_like_x(t, xt))
-            if self.loss_type in [WeightType.VELOCITY]:
-                weight = (drift_var / sigma_t) ** 2
-            elif self.loss_type in [WeightType.LIKELIHOOD]:
-                weight = drift_var / (sigma_t ** 2)
-            elif self.loss_type in [WeightType.NONE]:
-                weight = 1
-            else:
-                raise NotImplementedError()
-
-            if self.model_type == ModelType.NOISE:
-                terms['loss'] = mean_flat(weight * ((model_output - x0) ** 2))
-            else:
-                terms['loss'] = mean_flat(weight * ((model_output * sigma_t + x0) ** 2))
-
-        return model_output, terms
-
-    def get_drift(self):
-        """member function for obtaining the drift of the probability flow ODE"""
-
-        def score_ode(x, t, model, **model_kwargs):
-            drift_mean, drift_var = self.path_sampler.compute_drift(x, t)
-            model_output = model(x, t, **model_kwargs)
-            return -drift_mean + drift_var * model_output  # by change of variable
-
-        def noise_ode(x, t, model, **model_kwargs):
-            drift_mean, drift_var = self.path_sampler.compute_drift(x, t)
-            sigma_t, _ = self.path_sampler.compute_sigma_t(path.expand_t_like_x(t, x))
-            model_output = model(x, t, **model_kwargs)
-            score = model_output / -sigma_t
-            return -drift_mean + drift_var * score
-
-        def velocity_ode(x, t, model, **model_kwargs):
-            model_output = model(x, t, **model_kwargs)
-            return model_output
-
-        if self.model_type == ModelType.NOISE:
-            drift_fn = noise_ode
-        elif self.model_type == ModelType.SCORE:
-            drift_fn = score_ode
-        else:
-            drift_fn = velocity_ode
-
-        def body_fn(x, t, model, **model_kwargs):
-            model_output = drift_fn(x, t, model, **model_kwargs)
-            assert model_output.shape == x.shape, "Output shape from ODE solver must match input shape"
-            return model_output
-
-        return body_fn
-
-    def get_score(
-        self,
-    ):
-        """member function for obtaining score of
-        x_t = alpha_t * x + sigma_t * eps"""
-        if self.model_type == ModelType.NOISE:
-            score_fn = (
-                lambda x, t, model, **kwargs: model(x, t, **kwargs)
-                / -self.path_sampler.compute_sigma_t(path.expand_t_like_x(t, x))[0]
-            )
-        elif self.model_type == ModelType.SCORE:
-            score_fn = lambda x, t, model, **kwagrs: model(x, t, **kwagrs)
-        elif self.model_type == ModelType.VELOCITY:
-            score_fn = lambda x, t, model, **kwargs: self.path_sampler.get_score_from_velocity(
-                model(x, t, **kwargs), x, t
-            )
-        else:
-            raise NotImplementedError()
-
-        return score_fn
-
-
-class Sampler:
-    """Sampler class for the transport model"""
-
-    def __init__(
-        self,
-        transport,
-    ):
-        """Constructor for a general sampler; supporting different sampling methods
-        Args:
-        - transport: an tranport object specify model prediction & interpolant type
-        """
-
-        self.transport = transport
-        self.drift = self.transport.get_drift()
-        self.score = self.transport.get_score()
-
-    def __get_sde_diffusion_and_drift(
-        self,
-        *,
-        diffusion_form="SBDM",
-        diffusion_norm=1.0,
-    ):
-        def diffusion_fn(x, t):
-            diffusion = self.transport.path_sampler.compute_diffusion(x, t, form=diffusion_form, norm=diffusion_norm)
-            return diffusion
-
-        sde_drift = lambda x, t, model, **kwargs: self.drift(x, t, model, **kwargs) + diffusion_fn(x, t) * self.score(
-            x, t, model, **kwargs
-        )
-
-        sde_diffusion = diffusion_fn
-
-        return sde_drift, sde_diffusion
-
-    def __get_last_step(
-        self,
-        sde_drift,
-        *,
-        last_step,
-        last_step_size,
-    ):
-        """Get the last step function of the SDE solver"""
-
-        if last_step is None:
-            last_step_fn = lambda x, t, model, **model_kwargs: x
-        elif last_step == "Mean":
-            last_step_fn = (
-                lambda x, t, model, **model_kwargs: x + sde_drift(x, t, model, **model_kwargs) * last_step_size
-            )
-        elif last_step == "Tweedie":
-            alpha = self.transport.path_sampler.compute_alpha_t  # simple aliasing; the original name was too long
-            sigma = self.transport.path_sampler.compute_sigma_t
-            last_step_fn = lambda x, t, model, **model_kwargs: x / alpha(t)[0][0] + (sigma(t)[0][0] ** 2) / alpha(t)[0][
-                0
-            ] * self.score(x, t, model, **model_kwargs)
-        elif last_step == "Euler":
-            last_step_fn = (
-                lambda x, t, model, **model_kwargs: x + self.drift(x, t, model, **model_kwargs) * last_step_size
-            )
-        else:
-            raise NotImplementedError()
-
-        return last_step_fn
-
-    def sample_sde(
-        self,
-        *,
-        sampling_method="Euler",
-        diffusion_form="SBDM",
-        diffusion_norm=1.0,
-        last_step="Mean",
-        last_step_size=0.04,
-        num_steps=250,
-    ):
-        """returns a sampling function with given SDE settings
-        Args:
-        - sampling_method: type of sampler used in solving the SDE; default to be Euler-Maruyama
-        - diffusion_form: function form of diffusion coefficient; default to be matching SBDM
-        - diffusion_norm: function magnitude of diffusion coefficient; default to 1
-        - last_step: type of the last step; default to identity
-        - last_step_size: size of the last step; default to match the stride of 250 steps over [0,1]
-        - num_steps: total integration step of SDE
-        """
-
-        if last_step is None:
-            last_step_size = 0.0
-
-        sde_drift, sde_diffusion = self.__get_sde_diffusion_and_drift(
-            diffusion_form=diffusion_form,
-            diffusion_norm=diffusion_norm,
-        )
-
-        t0, t1 = self.transport.check_interval(
-            self.transport.train_eps,
-            self.transport.sample_eps,
-            diffusion_form=diffusion_form,
-            sde=True,
-            eval=True,
-            reverse=False,
-            last_step_size=last_step_size,
-        )
-
-        _sde = sde(
-            sde_drift,
-            sde_diffusion,
-            t0=t0,
-            t1=t1,
-            num_steps=num_steps,
-            sampler_type=sampling_method,
-        )
-
-        last_step_fn = self.__get_last_step(sde_drift, last_step=last_step, last_step_size=last_step_size)
-
-        def _sample(init, model, **model_kwargs):
-            xs = _sde.sample(init, model, **model_kwargs)
-            ts = th.ones(init.size(0), device=init.device) * t1
-            x = last_step_fn(xs[-1], ts, model, **model_kwargs)
-            xs.append(x)
-
-            assert len(xs) == num_steps, "Samples does not match the number of steps"
-
-            return xs
-
-        return _sample
-
-    def sample_ode(
-        self,
-        *,
-        sampling_method="dopri5",
-        num_steps=50,
-        atol=1e-6,
-        rtol=1e-3,
-        reverse=False,
-        time_shifting_factor=None,
-    ):
-        """returns a sampling function with given ODE settings
-        Args:
-        - sampling_method: type of sampler used in solving the ODE; default to be Dopri5
-        - num_steps:
-            - fixed solver (Euler, Heun): the actual number of integration steps performed
-            - adaptive solver (Dopri5): the number of datapoints saved during integration; produced by interpolation
-        - atol: absolute error tolerance for the solver
-        - rtol: relative error tolerance for the solver
-        - reverse: whether solving the ODE in reverse (data to noise); default to False
-        """
-        if reverse:
-            drift = lambda x, t, model, **kwargs: self.drift(x, th.ones_like(t) * (1 - t), model, **kwargs)
-        else:
-            drift = self.drift
-
-        t0, t1 = self.transport.check_interval(
-            self.transport.train_eps,
-            self.transport.sample_eps,
-            sde=False,
-            eval=True,
-            reverse=reverse,
-            last_step_size=0.0,
-        )
-
-        _ode = ode(
-            drift=drift,
-            t0=t0,
-            t1=t1,
-            sampler_type=sampling_method,
-            num_steps=num_steps,
-            atol=atol,
-            rtol=rtol,
-            time_shifting_factor=time_shifting_factor,
-        )
-        self.ode = _ode
-        return _ode.sample
-
-    def sample_ode_likelihood(
-        self,
-        *,
-        sampling_method="dopri5",
-        num_steps=50,
-        atol=1e-6,
-        rtol=1e-3,
-    ):
-        """returns a sampling function for calculating likelihood with given ODE settings
-        Args:
-        - sampling_method: type of sampler used in solving the ODE; default to be Dopri5
-        - num_steps:
-            - fixed solver (Euler, Heun): the actual number of integration steps performed
-            - adaptive solver (Dopri5): the number of datapoints saved during integration; produced by interpolation
-        - atol: absolute error tolerance for the solver
-        - rtol: relative error tolerance for the solver
-        """
-
-        def _likelihood_drift(x, t, model, **model_kwargs):
-            x, _ = x
-            eps = th.randint(2, x.size(), dtype=th.float, device=x.device) * 2 - 1
-            t = th.ones_like(t) * (1 - t)
-            with th.enable_grad():
-                x.requires_grad = True
-                grad = th.autograd.grad(th.sum(self.drift(x, t, model, **model_kwargs) * eps), x)[0]
-                logp_grad = th.sum(grad * eps, dim=tuple(range(1, len(x.size()))))
-                drift = self.drift(x, t, model, **model_kwargs)
-            return (-drift, logp_grad)
-
-        t0, t1 = self.transport.check_interval(
-            self.transport.train_eps,
-            self.transport.sample_eps,
-            sde=False,
-            eval=True,
-            reverse=False,
-            last_step_size=0.0,
-        )
-
-        _ode = ode(
-            drift=_likelihood_drift,
-            t0=t0,
-            t1=t1,
-            sampler_type=sampling_method,
-            num_steps=num_steps,
-            atol=atol,
-            rtol=rtol,
-        )
-
-        def _sample_fn(x, model, **model_kwargs):
-            init_logp = th.zeros(x.size(0)).to(x)
-            input = (x, init_logp)
-            drift, delta_logp = _ode.sample(input, model, **model_kwargs)
-            drift, delta_logp = drift[-1], delta_logp[-1]
-            prior_logp = self.transport.prior_logp(drift)
-            logp = prior_logp - delta_logp
-            return logp, drift
-
-        return _sample_fn
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/utils.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/utils.py
deleted file mode 100644
index 33ac64f4..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/flow/utils.py
+++ /dev/null
@@ -1,31 +0,0 @@
-import torch as th
-
-
-class EasyDict:
-    def __init__(self, sub_dict):
-        for k, v in sub_dict.items():
-            setattr(self, k, v)
-
-    def __getitem__(self, key):
-        return getattr(self, key)
-
-
-def mean_flat(x):
-    """
-    Take the mean over all non-batch dimensions.
-    """
-    return th.mean(x, dim=list(range(1, len(x.size()))))
-
-
-def log_state(state):
-    result = []
-
-    sorted_state = dict(sorted(state.items()))
-    for key, value in sorted_state.items():
-        # Check if the value is an instance of a class
-        if "<object" in str(value) or "object at" in str(value):
-            result.append(f"{key}: [{value.__class__.__name__}]")
-        else:
-            result.append(f"{key}: {value}")
-
-    return "\n".join(result)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/pipelines/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/pipelines/__init__.py
deleted file mode 100644
index e44cb619..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/pipelines/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .pipeline_hunyuan_video import HunyuanVideoPipeline
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/pipelines/pipeline_hunyuan_video.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/pipelines/pipeline_hunyuan_video.py
deleted file mode 100644
index be53e98c..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/pipelines/pipeline_hunyuan_video.py
+++ /dev/null
@@ -1,1172 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-import inspect
-from typing import Any, Callable, Dict, List, Optional, Union, Tuple
-import torch
-import torch.distributed as dist
-import numpy as np
-from dataclasses import dataclass
-from packaging import version
-
-from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
-from diffusers.configuration_utils import FrozenDict
-from diffusers.image_processor import VaeImageProcessor
-from diffusers.loaders import LoraLoaderMixin, TextualInversionLoaderMixin
-from diffusers.models import AutoencoderKL
-from diffusers.models.lora import adjust_lora_scale_text_encoder
-from diffusers.schedulers import KarrasDiffusionSchedulers
-from diffusers.utils import (
-    USE_PEFT_BACKEND,
-    deprecate,
-    logging,
-    replace_example_docstring,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
-from diffusers.utils.torch_utils import randn_tensor
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline
-from diffusers.utils import BaseOutput
-
-from ...constants import PRECISION_TO_TYPE
-from ...vae.autoencoder_kl_causal_3d import AutoencoderKLCausal3D
-from ...text_encoder import TextEncoder
-from ...modules import HYVideoDiffusionTransformer
-from ...utils.data_utils import black_image
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-EXAMPLE_DOC_STRING = """"""
-
-
-def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
-    """
-    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and
-    Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
-    """
-    std_text = noise_pred_text.std(
-        dim=list(range(1, noise_pred_text.ndim)), keepdim=True
-    )
-    std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
-    # rescale the results from guidance (fixes overexposure)
-    noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
-    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
-    noise_cfg = (
-        guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
-    )
-    return noise_cfg
-
-
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-@dataclass
-class HunyuanVideoPipelineOutput(BaseOutput):
-    videos: Union[torch.Tensor, np.ndarray]
-
-
-class HunyuanVideoPipeline(DiffusionPipeline):
-    r"""
-    Pipeline for text-to-video generation using HunyuanVideo.
-
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
-
-    Args:
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-        text_encoder ([`TextEncoder`]):
-            Frozen text-encoder.
-        text_encoder_2 ([`TextEncoder`]):
-            Frozen text-encoder_2.
-        transformer ([`HYVideoDiffusionTransformer`]):
-            A `HYVideoDiffusionTransformer` to denoise the encoded video latents.
-        scheduler ([`SchedulerMixin`]):
-            A scheduler to be used in combination with `unet` to denoise the encoded image latents.
-    """
-
-    model_cpu_offload_seq = "text_encoder->text_encoder_2->transformer->vae"
-    _optional_components = ["text_encoder_2"]
-    _exclude_from_cpu_offload = ["transformer"]
-    _callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]
-
-    def __init__(
-        self,
-        vae: AutoencoderKL,
-        text_encoder: TextEncoder,
-        transformer: HYVideoDiffusionTransformer,
-        scheduler: KarrasDiffusionSchedulers,
-        text_encoder_2: Optional[TextEncoder] = None,
-        progress_bar_config: Dict[str, Any] = None,
-        vae_precision: str = 'fp16',
-        precision: str = 'bf16',
-        disable_autocast: bool = False,
-    ):
-        super().__init__()
-
-        # ==========================================================================================
-        if progress_bar_config is None:
-            progress_bar_config = {}
-        if not hasattr(self, "_progress_bar_config"):
-            self._progress_bar_config = {}
-        self._progress_bar_config.update(progress_bar_config)
-
-        self.vae_precision = vae_precision
-        self.precision = precision
-        self.disable_autocast = disable_autocast
-        # ==========================================================================================
-
-        if (
-            hasattr(scheduler.config, "steps_offset")
-            and scheduler.config.steps_offset != 1
-        ):
-            deprecation_message = (
-                f"The configuration file of this scheduler: {scheduler} is outdated. `steps_offset`"
-                f" should be set to 1 instead of {scheduler.config.steps_offset}. Please make sure "
-                "to update the config accordingly as leaving `steps_offset` might led to incorrect results"
-                " in future versions. If you have downloaded this checkpoint from the Hugging Face Hub,"
-                " it would be very nice if you could open a Pull request for the `scheduler/scheduler_config.json`"
-                " file"
-            )
-            deprecate(
-                "steps_offset!=1", "1.0.0", deprecation_message, standard_warn=False
-            )
-            new_config = dict(scheduler.config)
-            new_config["steps_offset"] = 1
-            scheduler._internal_dict = FrozenDict(new_config)
-
-        if (
-            hasattr(scheduler.config, "clip_sample")
-            and scheduler.config.clip_sample is True
-        ):
-            deprecation_message = (
-                f"The configuration file of this scheduler: {scheduler} has not set the configuration `clip_sample`."
-                " `clip_sample` should be set to False in the configuration file. Please make sure to update the"
-                " config accordingly as not setting `clip_sample` in the config might lead to incorrect results in"
-                " future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very"
-                " nice if you could open a Pull request for the `scheduler/scheduler_config.json` file"
-            )
-            deprecate(
-                "clip_sample not set", "1.0.0", deprecation_message, standard_warn=False
-            )
-            new_config = dict(scheduler.config)
-            new_config["clip_sample"] = False
-            scheduler._internal_dict = FrozenDict(new_config)
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            transformer=transformer,
-            scheduler=scheduler,
-            text_encoder_2=text_encoder_2,
-        )
-        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-
-    def encode_prompt(
-        self,
-        prompt,
-        device,
-        num_videos_per_prompt,
-        do_classifier_free_guidance,
-        negative_prompt=None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_attention_mask: Optional[torch.Tensor] = None,
-        lora_scale: Optional[float] = None,
-        clip_skip: Optional[int] = None,
-        text_encoder: Optional[TextEncoder] = None,
-        data_type: Optional[str] = "image",
-        semantic_images=None
-    ):
-        r"""
-        Encodes the prompt into text encoder hidden states.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            device: (`torch.device`):
-                torch device
-            num_videos_per_prompt (`int`):
-                number of videos that should be generated per prompt
-            do_classifier_free_guidance (`bool`):
-                whether to use classifier free guidance or not
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the video generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            attention_mask (`torch.Tensor`, *optional*):
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            negative_attention_mask (`torch.Tensor`, *optional*):
-            lora_scale (`float`, *optional*):
-                A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            text_encoder (TextEncoder, *optional*):
-            data_type (`str`, *optional*):
-        """
-        if text_encoder is None:
-            text_encoder = self.text_encoder
-
-        # set lora scale so that monkey patched LoRA
-        # function of text encoder can correctly access it
-        if lora_scale is not None and isinstance(self, LoraLoaderMixin):
-            self._lora_scale = lora_scale
-
-            # dynamically adjust the LoRA scale
-            if not USE_PEFT_BACKEND:
-                adjust_lora_scale_text_encoder(text_encoder.model, lora_scale)
-            else:
-                scale_lora_layers(text_encoder.model, lora_scale)
-
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        if prompt_embeds is None:
-            # textual inversion: process multi-vector tokens if necessary
-            if isinstance(self, TextualInversionLoaderMixin):
-                prompt = self.maybe_convert_prompt(prompt, text_encoder.tokenizer)
-
-            text_inputs = text_encoder.text2tokens(prompt, data_type=data_type)
-
-            if clip_skip is None:
-                prompt_outputs = text_encoder.encode(
-                    text_inputs, data_type=data_type, semantic_images=semantic_images, device=device
-                )
-                prompt_embeds = prompt_outputs.hidden_state
-            else:
-                prompt_outputs = text_encoder.encode(
-                    text_inputs,
-                    output_hidden_states=True,
-                    data_type=data_type,
-                    semantic_images=semantic_images,
-                    device=device,
-                )
-                # Access the `hidden_states` first, that contains a tuple of
-                # all the hidden states from the encoder layers. Then index into
-                # the tuple to access the hidden states from the desired layer.
-                prompt_embeds = prompt_outputs.hidden_states_list[-(clip_skip + 1)]
-                # We also need to apply the final LayerNorm here to not mess with the
-                # representations. The `last_hidden_states` that we typically use for
-                # obtaining the final prompt representations passes through the LayerNorm
-                # layer.
-                prompt_embeds = text_encoder.model.text_model.final_layer_norm(
-                    prompt_embeds
-                )
-
-            attention_mask = prompt_outputs.attention_mask
-            if attention_mask is not None:
-                attention_mask = attention_mask.to(device)
-                bs_embed, seq_len = attention_mask.shape
-                attention_mask = attention_mask.repeat(1, num_videos_per_prompt)
-                attention_mask = attention_mask.view(
-                    bs_embed * num_videos_per_prompt, seq_len
-                )
-
-        if text_encoder is not None:
-            prompt_embeds_dtype = text_encoder.dtype
-        elif self.transformer is not None:
-            prompt_embeds_dtype = self.transformer.dtype
-        else:
-            prompt_embeds_dtype = prompt_embeds.dtype
-
-        prompt_embeds = prompt_embeds.to(dtype=prompt_embeds_dtype, device=device)
-
-        if prompt_embeds.ndim == 2:
-            bs_embed, _ = prompt_embeds.shape
-            # duplicate text embeddings for each generation per prompt, using mps friendly method
-            prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt)
-            prompt_embeds = prompt_embeds.view(bs_embed * num_videos_per_prompt, -1)
-        else:
-            bs_embed, seq_len, _ = prompt_embeds.shape
-            # duplicate text embeddings for each generation per prompt, using mps friendly method
-            prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
-            prompt_embeds = prompt_embeds.view(
-                bs_embed * num_videos_per_prompt, seq_len, -1
-            )
-
-        # get unconditional embeddings for classifier free guidance
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            uncond_tokens: List[str]
-            if negative_prompt is None:
-                uncond_tokens = [""] * batch_size
-            elif prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif isinstance(negative_prompt, str):
-                uncond_tokens = [negative_prompt]
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-            else:
-                uncond_tokens = negative_prompt
-
-            # textual inversion: process multi-vector tokens if necessary
-            if isinstance(self, TextualInversionLoaderMixin):
-                uncond_tokens = self.maybe_convert_prompt(
-                    uncond_tokens, text_encoder.tokenizer
-                )
-
-            # max_length = prompt_embeds.shape[1]
-            uncond_input = text_encoder.text2tokens(uncond_tokens, data_type=data_type)
-
-            if semantic_images is not None:
-                uncond_image = [black_image(img.size[0], img.size[1]) for img in semantic_images]
-            else:
-                uncond_image = None
-
-            negative_prompt_outputs = text_encoder.encode(
-                uncond_input, data_type=data_type, semantic_images=uncond_image, device=device
-            )
-            negative_prompt_embeds = negative_prompt_outputs.hidden_state
-
-            negative_attention_mask = negative_prompt_outputs.attention_mask
-            if negative_attention_mask is not None:
-                negative_attention_mask = negative_attention_mask.to(device)
-                _, seq_len = negative_attention_mask.shape
-                negative_attention_mask = negative_attention_mask.repeat(
-                    1, num_videos_per_prompt
-                )
-                negative_attention_mask = negative_attention_mask.view(
-                    batch_size * num_videos_per_prompt, seq_len
-                )
-
-        if do_classifier_free_guidance:
-            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
-            seq_len = negative_prompt_embeds.shape[1]
-
-            negative_prompt_embeds = negative_prompt_embeds.to(
-                dtype=prompt_embeds_dtype, device=device
-            )
-
-            if negative_prompt_embeds.ndim == 2:
-                negative_prompt_embeds = negative_prompt_embeds.repeat(
-                    1, num_videos_per_prompt
-                )
-                negative_prompt_embeds = negative_prompt_embeds.view(
-                    batch_size * num_videos_per_prompt, -1
-                )
-            else:
-                negative_prompt_embeds = negative_prompt_embeds.repeat(
-                    1, num_videos_per_prompt, 1
-                )
-                negative_prompt_embeds = negative_prompt_embeds.view(
-                    batch_size * num_videos_per_prompt, seq_len, -1
-                )
-
-        if text_encoder is not None:
-            if isinstance(self, LoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(text_encoder.model, lora_scale)
-
-        return (
-            prompt_embeds,
-            negative_prompt_embeds,
-            attention_mask,
-            negative_attention_mask,
-        )
-
-    def decode_latents(self, latents, enable_tiling=True):
-        deprecation_message = "The decode_latents method is deprecated and will be removed in 1.0.0. Please use VaeImageProcessor.postprocess(...) instead"
-        deprecate("decode_latents", "1.0.0", deprecation_message, standard_warn=False)
-
-        latents = 1 / self.vae.config.scaling_factor * latents
-        if enable_tiling:
-            self.vae.enable_tiling()
-            image = self.vae.decode(latents, return_dict=False)[0]
-        else:
-            image = self.vae.decode(latents, return_dict=False)[0]
-        image = (image / 2 + 0.5).clamp(0, 1)
-        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
-        if image.ndim == 4:
-            image = image.cpu().permute(0, 2, 3, 1).float()
-        else:
-            image = image.cpu().float()
-        return image
-
-    def prepare_extra_func_kwargs(self, func, kwargs):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-        extra_step_kwargs = {}
-
-        for k, v in kwargs.items():
-            accepts = k in set(inspect.signature(func).parameters.keys())
-            if accepts:
-                extra_step_kwargs[k] = v
-        return extra_step_kwargs
-
-    def check_inputs(
-        self,
-        prompt,
-        height,
-        width,
-        video_length,
-        callback_steps,
-        negative_prompt=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-        vae_ver="88-4c-sd",
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if video_length is not None:
-            if "884" in vae_ver:
-                if video_length != 1 and (video_length - 1) % 4 != 0:
-                    raise ValueError(
-                        f"`video_length` has to be 1 or a multiple of 4 but is {video_length}."
-                    )
-            elif "888" in vae_ver:
-                if video_length != 1 and (video_length - 1) % 8 != 0:
-                    raise ValueError(
-                        f"`video_length` has to be 1 or a multiple of 8 but is {video_length}."
-                    )
-
-        if callback_steps is not None and (
-            not isinstance(callback_steps, int) or callback_steps <= 0
-        ):
-            raise ValueError(
-                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
-                f" {type(callback_steps)}."
-            )
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        video_length,
-        dtype,
-        device,
-        generator,
-        latents=None,
-        img_latents=None,
-        i2v_mode=False,
-        i2v_condition_type=None,
-        i2v_stability=True,
-    ):
-        if i2v_mode and i2v_condition_type == "latent_concat":
-            num_channels_latents = (num_channels_latents - 1) // 2
-        shape = (
-            batch_size,
-            num_channels_latents,
-            video_length,
-            int(height) // self.vae_scale_factor,
-            int(width) // self.vae_scale_factor,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if i2v_mode and i2v_stability:
-            if img_latents.shape[2] == 1:
-                img_latents = img_latents.repeat(1, 1, video_length, 1, 1)
-            x0 = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-            x1 = img_latents
-
-            t = torch.tensor([0.999]).to(device=device)
-            latents = x0 * t + x1 * (1 - t)
-            latents = latents.to(dtype=dtype)
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-
-        # Check existence to make it compatible with FlowMatchEulerDiscreteScheduler
-        if hasattr(self.scheduler, "init_noise_sigma"):
-            # scale the initial noise by the standard deviation required by the scheduler
-            latents = latents * self.scheduler.init_noise_sigma
-        return latents
-
-    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(
-        self,
-        w: torch.Tensor,
-        embedding_dim: int = 512,
-        dtype: torch.dtype = torch.float32,
-    ) -> torch.Tensor:
-        """
-        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
-
-        Args:
-            w (`torch.Tensor`):
-                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
-            embedding_dim (`int`, *optional*, defaults to 512):
-                Dimension of the embeddings to generate.
-            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
-                Data type of the generated embeddings.
-
-        Returns:
-            `torch.Tensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
-        """
-        assert len(w.shape) == 1
-        w = w * 1000.0
-
-        half_dim = embedding_dim // 2
-        emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
-        emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb)
-        emb = w.to(dtype)[:, None] * emb[None, :]
-        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-        if embedding_dim % 2 == 1:  # zero pad
-            emb = torch.nn.functional.pad(emb, (0, 1))
-        assert emb.shape == (w.shape[0], embedding_dim)
-        return emb
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def guidance_rescale(self):
-        return self._guidance_rescale
-
-    @property
-    def clip_skip(self):
-        return self._clip_skip
-
-    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-    # corresponds to doing no classifier free guidance.
-    @property
-    def do_classifier_free_guidance(self):
-        # return self._guidance_scale > 1 and self.transformer.config.time_cond_proj_dim is None
-        return self._guidance_scale > 1
-
-    @property
-    def cross_attention_kwargs(self):
-        return self._cross_attention_kwargs
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]],
-        height: int,
-        width: int,
-        video_length: int,
-        data_type: str = "video",
-        num_inference_steps: int = 50,
-        timesteps: List[int] = None,
-        sigmas: List[float] = None,
-        guidance_scale: float = 7.5,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        num_videos_per_prompt: Optional[int] = 1,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_attention_mask: Optional[torch.Tensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
-        guidance_rescale: float = 0.0,
-        clip_skip: Optional[int] = None,
-        callback_on_step_end: Optional[
-            Union[
-                Callable[[int, int, Dict], None],
-                PipelineCallback,
-                MultiPipelineCallbacks,
-            ]
-        ] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        freqs_cis: Tuple[torch.Tensor, torch.Tensor] = None,
-        vae_ver: str = "88-4c-sd",
-        enable_tiling: bool = False,
-        n_tokens: Optional[int] = None,
-        embedded_guidance_scale: Optional[float] = None,
-        i2v_mode: bool = False,
-        i2v_condition_type: str = None,
-        i2v_stability: bool = True,
-        img_latents: Optional[torch.Tensor] = None,
-        semantic_images=None,
-        **kwargs,
-    ):
-        r"""
-        The call function to the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`):
-                The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
-            height (`int`):
-                The height in pixels of the generated image.
-            width (`int`):
-                The width in pixels of the generated image.
-            video_length (`int`):
-                The number of frames in the generated video.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            sigmas (`List[float]`, *optional*):
-                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
-                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
-                will be used.
-            guidance_scale (`float`, *optional*, defaults to 7.5):
-                A higher guidance scale value encourages the model to generate images closely linked to the text
-                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide what to not include in image generation. If not defined, you need to
-                pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
-                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                generation deterministic.
-            latents (`torch.Tensor`, *optional*):
-                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor is generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
-                provided, text embeddings are generated from the `prompt` input argument.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
-                not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
-                
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`HunyuanVideoPipelineOutput`] instead of a
-                plain tuple.
-            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
-                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            guidance_rescale (`float`, *optional*, defaults to 0.0):
-                Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
-                Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
-                using zero terminal SNR.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            callback_on_step_end (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*):
-                A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of
-                each denoising step during the inference. with the following arguments: `callback_on_step_end(self:
-                DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a
-                list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~HunyuanVideoPipelineOutput`] or `tuple`:
-                If `return_dict` is `True`, [`HunyuanVideoPipelineOutput`] is returned,
-                otherwise a `tuple` is returned where the first element is a list with the generated images and the
-                second element is a list of `bool`s indicating whether the corresponding generated image contains
-                "not-safe-for-work" (nsfw) content.
-        """
-        callback = kwargs.pop("callback", None)
-        callback_steps = kwargs.pop("callback_steps", None)
-
-        if callback is not None:
-            deprecate(
-                "callback",
-                "1.0.0",
-                "Passing `callback` as an input argument to `__call__` is deprecated, consider using `callback_on_step_end`",
-            )
-        if callback_steps is not None:
-            deprecate(
-                "callback_steps",
-                "1.0.0",
-                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider using `callback_on_step_end`",
-            )
-
-        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
-            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
-
-        # 0. Default height and width to unet
-        # height = height or self.transformer.config.sample_size * self.vae_scale_factor
-        # width = width or self.transformer.config.sample_size * self.vae_scale_factor
-        # to deal with lora scaling and other possible forward hooks
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            height,
-            width,
-            video_length,
-            callback_steps,
-            negative_prompt,
-            prompt_embeds,
-            negative_prompt_embeds,
-            callback_on_step_end_tensor_inputs,
-            vae_ver=vae_ver,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._guidance_rescale = guidance_rescale
-        self._clip_skip = clip_skip
-        self._cross_attention_kwargs = cross_attention_kwargs
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = torch.device(f"cuda:{dist.get_rank()}") if dist.is_initialized() else self._execution_device
-
-        # 3. Encode input prompt
-        lora_scale = (
-            self.cross_attention_kwargs.get("scale", None)
-            if self.cross_attention_kwargs is not None
-            else None
-        )
-
-        (
-            prompt_embeds,
-            negative_prompt_embeds,
-            prompt_mask,
-            negative_prompt_mask,
-        ) = self.encode_prompt(
-            prompt,
-            device,
-            num_videos_per_prompt,
-            self.do_classifier_free_guidance,
-            negative_prompt,
-            prompt_embeds=prompt_embeds,
-            attention_mask=attention_mask,
-            negative_prompt_embeds=negative_prompt_embeds,
-            negative_attention_mask=negative_attention_mask,
-            lora_scale=lora_scale,
-            clip_skip=self.clip_skip,
-            data_type=data_type,
-            semantic_images=semantic_images
-        )
-        if self.text_encoder_2 is not None:
-            (
-                prompt_embeds_2,
-                negative_prompt_embeds_2,
-                prompt_mask_2,
-                negative_prompt_mask_2,
-            ) = self.encode_prompt(
-                prompt,
-                device,
-                num_videos_per_prompt,
-                self.do_classifier_free_guidance,
-                negative_prompt,
-                prompt_embeds=None,
-                attention_mask=None,
-                negative_prompt_embeds=None,
-                negative_attention_mask=None,
-                lora_scale=lora_scale,
-                clip_skip=self.clip_skip,
-                text_encoder=self.text_encoder_2,
-                data_type=data_type,
-            )
-        else:
-            prompt_embeds_2 = None
-            negative_prompt_embeds_2 = None
-            prompt_mask_2 = None
-            negative_prompt_mask_2 = None
-
-        # For classifier free guidance, we need to do two forward passes.
-        # Here we concatenate the unconditional and text embeddings into a single batch
-        # to avoid doing two forward passes
-        if self.do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
-            if prompt_mask is not None:
-                prompt_mask = torch.cat([negative_prompt_mask, prompt_mask])
-            if prompt_embeds_2 is not None:
-                prompt_embeds_2 = torch.cat([negative_prompt_embeds_2, prompt_embeds_2])
-            if prompt_mask_2 is not None:
-                prompt_mask_2 = torch.cat([negative_prompt_mask_2, prompt_mask_2])
-
-
-        # 4. Prepare timesteps
-        extra_set_timesteps_kwargs = self.prepare_extra_func_kwargs(
-            self.scheduler.set_timesteps, {"n_tokens": n_tokens}
-        )
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler,
-            num_inference_steps,
-            device,
-            timesteps,
-            sigmas,
-            **extra_set_timesteps_kwargs,
-        )
-
-        if "884" in vae_ver:
-            video_length = (video_length - 1) // 4 + 1
-        elif "888" in vae_ver:
-            video_length = (video_length - 1) // 8 + 1
-        else:
-            video_length = video_length
-
-        # 5. Prepare latent variables
-        num_channels_latents = self.transformer.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_videos_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            video_length,
-            prompt_embeds.dtype,
-            device,
-            generator,
-            latents,
-            img_latents=img_latents,
-            i2v_mode=i2v_mode,
-            i2v_condition_type=i2v_condition_type,
-            i2v_stability=i2v_stability
-        )
-
-        if i2v_mode and i2v_condition_type == "latent_concat":
-            if img_latents.shape[2] == 1:
-                img_latents_concat = img_latents.repeat(1, 1, video_length, 1, 1)
-            else:
-                img_latents_concat = img_latents
-            img_latents_concat[:, :, 1:, ...] = 0
-
-            i2v_mask = torch.zeros(video_length)
-            i2v_mask[0] = 1
-
-            mask_concat = torch.ones(img_latents_concat.shape[0], 1, img_latents_concat.shape[2], img_latents_concat.shape[3],
-                                     img_latents_concat.shape[4]).to(device=img_latents.device)
-            mask_concat[:, :, 1:, ...] = 0
-
-        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_func_kwargs(
-            self.scheduler.step,
-            {"generator": generator, "eta": eta},
-        )
-
-        target_dtype = PRECISION_TO_TYPE[self.precision]
-        autocast_enabled = (
-            target_dtype != torch.float32
-        ) and not self.disable_autocast
-        vae_dtype = PRECISION_TO_TYPE[self.vae_precision]
-        vae_autocast_enabled = (
-            vae_dtype != torch.float32
-        ) and not self.disable_autocast
-
-        # 7. Denoising loop
-        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
-        self._num_timesteps = len(timesteps)
-
-        # if is_progress_bar:
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                if i2v_mode and i2v_condition_type == "token_replace":
-                    latents = torch.concat([img_latents, latents[:, :, 1:, :, :]], dim=2)
-
-                # expand the latents if we are doing classifier free guidance
-                if i2v_mode and i2v_condition_type == "latent_concat":
-                    latent_model_input = torch.concat([latents, img_latents_concat, mask_concat], dim=1)
-                else:
-                    latent_model_input = latents
-
-                latent_model_input = (
-                    torch.cat([latent_model_input] * 2)
-                    if self.do_classifier_free_guidance
-                    else latent_model_input
-                )
-
-                latent_model_input = self.scheduler.scale_model_input(
-                    latent_model_input, t
-                )
-
-                t_expand = t.repeat(latent_model_input.shape[0])
-                guidance_expand = (
-                    torch.tensor(
-                        [embedded_guidance_scale] * latent_model_input.shape[0],
-                        dtype=torch.float32,
-                        device=device,
-                    ).to(target_dtype)
-                    * 1000.0
-                    if embedded_guidance_scale is not None
-                    else None
-                )
-
-                # predict the noise residual
-                with torch.autocast(
-                    device_type="cuda", dtype=target_dtype, enabled=autocast_enabled
-                ):
-                    noise_pred = self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)
-                        latent_model_input,  # [2, 16, 33, 24, 42]
-                        t_expand,  # [2]
-                        text_states=prompt_embeds,  # [2, 256, 4096]
-                        text_mask=prompt_mask,  # [2, 256]
-                        text_states_2=prompt_embeds_2,  # [2, 768]
-                        freqs_cos=freqs_cis[0],  # [seqlen, head_dim]
-                        freqs_sin=freqs_cis[1],  # [seqlen, head_dim]
-                        guidance=guidance_expand,
-                        return_dict=True,
-                    )[
-                        "x"
-                    ]
-
-                # perform guidance
-                if self.do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + self.guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                if self.do_classifier_free_guidance and self.guidance_rescale > 0.0:
-                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
-                    noise_pred = rescale_noise_cfg(
-                        noise_pred,
-                        noise_pred_text,
-                        guidance_rescale=self.guidance_rescale,
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                if i2v_mode and i2v_condition_type == "token_replace":
-                    latents = self.scheduler.step(
-                        noise_pred[:, :, 1:, :, :], t, latents[:, :, 1:, :, :], **extra_step_kwargs, return_dict=False
-                    )[0]
-                    latents = torch.concat(
-                        [img_latents, latents], dim=2
-                    )
-                else:
-                    latents = self.scheduler.step(
-                        noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                    )[0]
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop(
-                        "negative_prompt_embeds", negative_prompt_embeds
-                    )
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    if progress_bar is not None:
-                        progress_bar.update()
-                    if callback is not None and i % callback_steps == 0:
-                        step_idx = i // getattr(self.scheduler, "order", 1)
-                        callback(step_idx, t, latents)
-
-        if not output_type == "latent":
-            expand_temporal_dim = False
-            if len(latents.shape) == 4:
-                if isinstance(self.vae, AutoencoderKLCausal3D):
-                    latents = latents.unsqueeze(2)
-                    expand_temporal_dim = True
-            elif len(latents.shape) == 5:
-                pass
-            else:
-                raise ValueError(
-                    f"Only support latents with shape (b, c, h, w) or (b, c, f, h, w), but got {latents.shape}."
-                )
-
-            if (
-                hasattr(self.vae.config, "shift_factor")
-                and self.vae.config.shift_factor
-            ):
-                latents = (
-                    latents / self.vae.config.scaling_factor
-                    + self.vae.config.shift_factor
-                )
-            else:
-                latents = latents / self.vae.config.scaling_factor
-
-            with torch.autocast(
-                device_type="cuda", dtype=vae_dtype, enabled=vae_autocast_enabled
-            ):
-                if enable_tiling:
-                    self.vae.enable_tiling()
-                    image = self.vae.decode(
-                        latents, return_dict=False, generator=generator
-                    )[0]
-                else:
-                    image = self.vae.decode(
-                        latents, return_dict=False, generator=generator
-                    )[0]
-
-            if expand_temporal_dim or image.shape[2] == 1:
-                image = image.squeeze(2)
-
-        else:
-            image = latents
-
-        image = (image / 2 + 0.5).clamp(0, 1)
-        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16
-        image = image.cpu().float()
-
-        if i2v_mode and i2v_condition_type == "latent_concat":
-            image = image[:, :, 4:, :, :]
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return image
-
-        return HunyuanVideoPipelineOutput(videos=image)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/schedulers/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/schedulers/__init__.py
deleted file mode 100644
index 14f2ba33..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/schedulers/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .scheduling_flow_match_discrete import FlowMatchDiscreteScheduler
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/schedulers/scheduling_flow_match_discrete.py b/videotuna/models/hunyuan/hyvideo_i2v/diffusion/schedulers/scheduling_flow_match_discrete.py
deleted file mode 100644
index c507ec4e..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/diffusion/schedulers/scheduling_flow_match_discrete.py
+++ /dev/null
@@ -1,257 +0,0 @@
-# Copyright 2024 Stability AI, Katherine Crowson and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-
-from dataclasses import dataclass
-from typing import Optional, Tuple, Union
-
-import numpy as np
-import torch
-
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.utils import BaseOutput, logging
-from diffusers.schedulers.scheduling_utils import SchedulerMixin
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-@dataclass
-class FlowMatchDiscreteSchedulerOutput(BaseOutput):
-    """
-    Output class for the scheduler's `step` function output.
-
-    Args:
-        prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
-            Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
-            denoising loop.
-    """
-
-    prev_sample: torch.FloatTensor
-
-
-class FlowMatchDiscreteScheduler(SchedulerMixin, ConfigMixin):
-    """
-    Euler scheduler.
-
-    This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic
-    methods the library implements for all schedulers such as loading and saving.
-
-    Args:
-        num_train_timesteps (`int`, defaults to 1000):
-            The number of diffusion steps to train the model.
-        timestep_spacing (`str`, defaults to `"linspace"`):
-            The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
-            Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
-        shift (`float`, defaults to 1.0):
-            The shift value for the timestep schedule.
-        reverse (`bool`, defaults to `True`):
-            Whether to reverse the timestep schedule.
-    """
-
-    _compatibles = []
-    order = 1
-
-    @register_to_config
-    def __init__(
-        self,
-        num_train_timesteps: int = 1000,
-        shift: float = 1.0,
-        reverse: bool = True,
-        solver: str = "euler",
-        n_tokens: Optional[int] = None,
-    ):
-        sigmas = torch.linspace(1, 0, num_train_timesteps + 1)
-
-        if not reverse:
-            sigmas = sigmas.flip(0)
-
-        self.sigmas = sigmas
-        # the value fed to model
-        self.timesteps = (sigmas[:-1] * num_train_timesteps).to(dtype=torch.float32)
-
-        self._step_index = None
-        self._begin_index = None
-
-        self.supported_solver = ["euler"]
-        if solver not in self.supported_solver:
-            raise ValueError(
-                f"Solver {solver} not supported. Supported solvers: {self.supported_solver}"
-            )
-
-    @property
-    def step_index(self):
-        """
-        The index counter for current timestep. It will increase 1 after each scheduler step.
-        """
-        return self._step_index
-
-    @property
-    def begin_index(self):
-        """
-        The index for the first timestep. It should be set from pipeline with `set_begin_index` method.
-        """
-        return self._begin_index
-
-    # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
-    def set_begin_index(self, begin_index: int = 0):
-        """
-        Sets the begin index for the scheduler. This function should be run from pipeline before the inference.
-
-        Args:
-            begin_index (`int`):
-                The begin index for the scheduler.
-        """
-        self._begin_index = begin_index
-
-    def _sigma_to_t(self, sigma):
-        return sigma * self.config.num_train_timesteps
-
-    def set_timesteps(
-        self,
-        num_inference_steps: int,
-        device: Union[str, torch.device] = None,
-        n_tokens: int = None,
-    ):
-        """
-        Sets the discrete timesteps used for the diffusion chain (to be run before inference).
-
-        Args:
-            num_inference_steps (`int`):
-                The number of diffusion steps used when generating samples with a pre-trained model.
-            device (`str` or `torch.device`, *optional*):
-                The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-            n_tokens (`int`, *optional*):
-                Number of tokens in the input sequence.
-        """
-        self.num_inference_steps = num_inference_steps
-        
-        sigmas = torch.linspace(1, 0, num_inference_steps + 1)
-        sigmas = self.sd3_time_shift(sigmas)
-
-        if not self.config.reverse:
-            sigmas = 1 - sigmas
-
-        self.sigmas = sigmas
-        self.timesteps = (sigmas[:-1] * self.config.num_train_timesteps).to(
-            dtype=torch.float32, device=device
-        )
-
-        # Reset step index
-        self._step_index = None
-
-    def index_for_timestep(self, timestep, schedule_timesteps=None):
-        if schedule_timesteps is None:
-            schedule_timesteps = self.timesteps
-
-        indices = (schedule_timesteps == timestep).nonzero()
-
-        # The sigma index that is taken for the **very** first `step`
-        # is always the second index (or the last index if there is only 1)
-        # This way we can ensure we don't accidentally skip a sigma in
-        # case we start in the middle of the denoising schedule (e.g. for image-to-image)
-        pos = 1 if len(indices) > 1 else 0
-
-        return indices[pos].item()
-
-    def _init_step_index(self, timestep):
-        if self.begin_index is None:
-            if isinstance(timestep, torch.Tensor):
-                timestep = timestep.to(self.timesteps.device)
-            self._step_index = self.index_for_timestep(timestep)
-        else:
-            self._step_index = self._begin_index
-
-    def scale_model_input(
-        self, sample: torch.Tensor, timestep: Optional[int] = None
-    ) -> torch.Tensor:
-        return sample
-
-    def sd3_time_shift(self, t: torch.Tensor):
-        return (self.config.shift * t) / (1 + (self.config.shift - 1) * t)
-
-    def step(
-        self,
-        model_output: torch.FloatTensor,
-        timestep: Union[float, torch.FloatTensor],
-        sample: torch.FloatTensor,
-        return_dict: bool = True,
-    ) -> Union[FlowMatchDiscreteSchedulerOutput, Tuple]:
-        """
-        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
-        process from the learned model outputs (most often the predicted noise).
-
-        Args:
-            model_output (`torch.FloatTensor`):
-                The direct output from learned diffusion model.
-            timestep (`float`):
-                The current discrete timestep in the diffusion chain.
-            sample (`torch.FloatTensor`):
-                A current instance of a sample created by the diffusion process.
-            generator (`torch.Generator`, *optional*):
-                A random number generator.
-            n_tokens (`int`, *optional*):
-                Number of tokens in the input sequence.
-            return_dict (`bool`):
-                Whether or not to return a [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or
-                tuple.
-
-        Returns:
-            [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or `tuple`:
-                If return_dict is `True`, [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] is
-                returned, otherwise a tuple is returned where the first element is the sample tensor.
-        """
-
-        if (
-            isinstance(timestep, int)
-            or isinstance(timestep, torch.IntTensor)
-            or isinstance(timestep, torch.LongTensor)
-        ):
-            raise ValueError(
-                (
-                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
-                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
-                    " one of the `scheduler.timesteps` as a timestep."
-                ),
-            )
-
-        if self.step_index is None:
-            self._init_step_index(timestep)
-
-        # Upcast to avoid precision issues when computing prev_sample
-        sample = sample.to(torch.float32)
-
-        dt = self.sigmas[self.step_index + 1] - self.sigmas[self.step_index]
-
-        if self.config.solver == "euler":
-            prev_sample = sample + model_output.to(torch.float32) * dt
-        else:
-            raise ValueError(
-                f"Solver {self.config.solver} not supported. Supported solvers: {self.supported_solver}"
-            )
-
-        # upon completion increase step index by one
-        self._step_index += 1
-
-        if not return_dict:
-            return (prev_sample,)
-
-        return FlowMatchDiscreteSchedulerOutput(prev_sample=prev_sample)
-
-    def __len__(self):
-        return self.config.num_train_timesteps
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/ds_config.py b/videotuna/models/hunyuan/hyvideo_i2v/ds_config.py
deleted file mode 100644
index bf37f5bb..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/ds_config.py
+++ /dev/null
@@ -1,63 +0,0 @@
-import argparse
-from pathlib import Path
-
-
-def get_tensorboard_config(output_dir: str, job_name: str):
-    tensorboard_config = {
-        "enabled": True,
-        "output_path": output_dir,
-        "job_name": job_name
-    }
-    return tensorboard_config
-
-
-def get_deepspeed_config(args: argparse.Namespace,
-                         micro_batch_size: int,
-                         global_batch_size: int,
-                         output_dir: str = None,
-                         job_name: str = None,
-                         ):
-    config = {
-        "train_batch_size": global_batch_size,
-        "train_micro_batch_size_per_gpu": micro_batch_size,
-        "gradient_accumulation_steps": args.gradient_accumulation_steps,
-        "steps_per_print": args.log_every,
-        "optimizer": {
-            "type": "AdamW",
-            "params": {
-                "lr": args.lr,
-                "betas": [
-                    args.adam_beta1,
-                    args.adam_beta2
-                ],
-                "eps": args.adam_eps,
-                "weight_decay": args.weight_decay
-            }
-        },
-        "gradient_clipping": 1.0,
-        "prescale_gradients": True,
-
-        "fp16": {
-            "enabled": args.precision == 'fp16',
-            "fp16_master_weights_and_grads": False,
-            "loss_scale": 0,
-            "loss_scale_window": 500,
-            "hysteresis": 2,
-            "min_loss_scale": 1,
-            "initial_scale_power": 15
-        },
-        "bf16": {
-            "enabled": args.precision == 'bf16'
-        },
-        "wall_clock_breakdown": False,
-        "zero_optimization": {
-            "stage": args.zero_stage,
-            "reduce_scatter": False,
-            "reduce_bucket_size": 1e9,
-        },
-    }
-
-    if args.tensorboard:
-        config["tensorboard"] = get_tensorboard_config(output_dir, job_name)
-
-    return config
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/README.md b/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/README.md
deleted file mode 100644
index 319dbba0..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/README.md
+++ /dev/null
@@ -1,127 +0,0 @@
-
-[中文阅读](./README_zh.md)
-
-# HunyuanVideo Latent Feature Extraction Tool
-
-This project provides an efficient tool for extracting latent features from videos, preparing them for subsequent video generation and processing tasks.
-
-## Features
-
-- Support for various video formats and resolutions
-- Multi-GPU parallel processing for improved efficiency
-- Support for multiple aspect ratios
-- High-performance VAE model for feature extraction
-- Automatic skipping of already processed videos, supporting resume functionality
-
-## Usage
-
-### 1. Configuration File
-
-## Input dataset Format
-
-The input video metadata file (meta_file.list) should be a list of JSON file paths, with each JSON file containing the following fields:
-
-The format of meta_file.list (e.g., ./assets/demo/i2v_lora/train_dataset/meta_file.list) is as follows
-```
-/path/to/0.json
-/path/to/1.json
-/path/to/2.json
-...
-```
-
-The format of /path/to/0.json (e.g., ./assets/demo/i2v_lora/train_dataset/meta_data.json) is as follows
-```json
-{
-  "video_path": "/path/to/video.mp4",
-  "raw_caption": {
-    "long caption": "Detailed description text of the video"
-  }
-}
-```
-
-Configure parameters in `hyvideo/hyvae_extract/vae.yaml`:
-
-```yaml
-vae_path: "./ckpts/hunyuan-video-i2v-720p/vae" # VAE model path
-video_url_files: "/path/to/meta_file.list"     # Video metadata file list
-output_base_dir: "/path/to/output/directory"   # Output directory
-sample_n_frames: 129                           # Number of frames to sample
-target_size:                                   # Target size
-  - bucket_size
-  - bucket_size
-enable_multi_aspect_ratio: True                # Enable multiple aspect ratios
-use_stride: True                               # Use stride sampling
-```
-
-#### Bucket Size Reference
-
-The `target_size` parameter defines the resolution bucket size. Here are the recommended values for different quality levels:
-
-| Quality | Bucket Size | Typical Resolution |
-|---------|-------------|-------------------|
-| 720p    | 960         | 1280×720 or similar |
-| 540p    | 720         | 960×540 or similar |
-| 360p    | 480         | 640×360 or similar |
-
-When `enable_multi_aspect_ratio` is set to `True`, the system will use these bucket sizes as a base to generate multiple aspect ratio buckets. For optimal performance, choose a bucket size that balances quality and memory usage based on your hardware capabilities.
-
-### 2. Run Extraction
-
-```bash
-# Set environment variables
-export HOST_GPU_NUM=8  # Set the number of GPUs to use
-
-# Run extraction script
-cd HunyuanVideo-I2V
-bash hyvideo/hyvae_extract/start.sh
-```
-
-### 3. Single GPU Run
-
-```bash
-cd HunyuanVideo-I2V
-export PYTHONPATH=${PYTHONPATH}:`pwd`
-export HOST_GPU_NUM=1
-CUDA_VISIBLE_DEVICES=0 python3 -u hyvideo/hyvae_extract/run.py --local_rank 0 --config 'hyvideo/hyvae_extract/vae.yaml'
-```
-
-## Output Files
-
-The program generates the following files in the specified output directory:
-
-1. `{video_id}.npy` - Latent feature array of the video
-2. `json_path/{video_id}.json` - JSON file containing video metadata, including:
-   - video_id: Video ID
-   - latent_shape: Shape of the latent features
-   - video_path: Original video path
-   - prompt: Video description/prompt
-   - npy_save_path: Path where the latent features are saved
-
-```
-output_base_dir/
-│
-├── {video_id_1}.npy # Latent feature array for video 1
-├── {video_id_2}.npy # Latent feature array for video 2
-├── {video_id_3}.npy # Latent feature array for video 3
-│ ...
-├── {video_id_n}.npy # Latent feature array for video n
-│
-└── json_path/ # Directory containing metadata JSON files
-│     ├── {video_id_1}.json # Metadata for video 1
-│     ├── {video_id_2}.json # Metadata for video 2
-│     ├── {video_id_3}.json # Metadata for video 3
-│     │ ...
-│     └── {video_id_n}.json # Metadata for video n
-```
-
-## Advanced Configuration
-
-### Multiple Aspect Ratio Processing
-
-When `enable_multi_aspect_ratio` is set to `True`, the system selects the target size closest to the original aspect ratio of the video, rather than forcing it to be cropped to a fixed size. This is useful for maintaining the integrity of the video content.
-
-### Stride Sampling
-
-When `use_stride` is set to `True`, the system automatically adjusts the sampling stride based on the video's frame rate:
-- When frame rate >= 50fps, stride is 2
-- When frame rate < 50fps, stride is 1
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/README_zh.md b/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/README_zh.md
deleted file mode 100644
index 0c6c964b..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/README_zh.md
+++ /dev/null
@@ -1,126 +0,0 @@
-[English](./README.md)
-
-# 混元视频特征提取工具
-
-本项目提供了一个高效的工具，用于从视频中提取潜在特征，为后续的视频生成和处理任务做准备。
-
-## 功能特点
-
-- 支持各种视频格式和分辨率
-- 多GPU并行处理，提高效率
-- 支持多种宽高比
-- 高性能VAE模型用于特征提取
-- 自动跳过已处理的视频，支持断点续传功能
-
-## 使用方法
-
-### 1. 配置文件
-
-## 输入数据集格式
-
-输入的视频元数据文件(meta_file.list)应为JSON文件路径的列表，每个JSON文件包含以下字段：
-
-meta_file.list的格式（例如，./assets/demo/i2v_lora/train_dataset/meta_file.list）如下：
-```
-/path/to/0.json
-/path/to/1.json
-/path/to/2.json
-...
-```
-
-/path/to/0.json的格式（例如，./assets/demo/i2v_lora/train_dataset/meta_data.json）如下：
-```json
-{
-  "video_path": "/path/to/video.mp4",
-  "raw_caption": {
-    "long caption": "视频的详细描述文本"
-  }
-}
-```
-
-在`hyvideo/hyvae_extract/vae.yaml`中配置参数：
-
-```yaml
-vae_path: "./ckpts/hunyuan-video-i2v-720p/vae" # VAE模型路径
-video_url_files: "/path/to/meta_file.list"     # 视频元数据文件列表
-output_base_dir: "/path/to/output/directory"   # 输出目录
-sample_n_frames: 129                           # 采样帧数
-target_size:                                   # 目标尺寸
-  - bucket_size
-  - bucket_size
-enable_multi_aspect_ratio: True                # 启用多种宽高比
-use_stride: True                               # 使用步长采样
-```
-
-#### 分辨率桶大小参考
-
-`target_size`参数定义了分辨率桶大小。以下是不同质量级别的推荐值：
-
-| 质量 | 桶大小 | 典型分辨率 |
-|---------|-------------|-------------------|
-| 720p    | 960         | 1280×720或类似 |
-| 540p    | 720         | 960×540或类似 |
-| 360p    | 480         | 640×360或类似 |
-
-当`enable_multi_aspect_ratio`设置为`True`时，系统将使用这些桶大小作为基础来生成多种宽高比的桶。为了获得最佳性能，请根据您的硬件能力选择平衡质量和内存使用的桶大小。
-
-### 2. 运行提取
-
-```bash
-# 设置环境变量
-export HOST_GPU_NUM=8  # 设置要使用的GPU数量
-
-# 运行提取脚本
-cd HunyuanVideo-I2V
-bash hyvideo/hyvae_extract/start.sh
-```
-
-### 3. 单GPU运行
-
-```bash
-cd HunyuanVideo-I2V
-export PYTHONPATH=${PYTHONPATH}:`pwd`
-export HOST_GPU_NUM=1
-CUDA_VISIBLE_DEVICES=0 python3 -u hyvideo/hyvae_extract/run.py --local_rank 0 --config 'hyvideo/hyvae_extract/vae.yaml'
-```
-
-## 输出文件
-
-程序在指定的输出目录中生成以下文件：
-
-1. `{video_id}.npy` - 视频的潜在特征数组
-2. `json_path/{video_id}.json` - 包含视频元数据的JSON文件，包括：
-   - video_id: 视频ID
-   - latent_shape: 潜在特征的形状
-   - video_path: 原始视频路径
-   - prompt: 视频描述/提示
-   - npy_save_path: 保存潜在特征的路径
-
-```
-output_base_dir/
-│
-├── {video_id_1}.npy # 视频1的潜在特征数组
-├── {video_id_2}.npy # 视频2的潜在特征数组
-├── {video_id_3}.npy # 视频3的潜在特征数组
-│ ...
-├── {video_id_n}.npy # 视频n的潜在特征数组
-│
-└── json_path/ # 包含元数据JSON文件的目录
-      ├── {video_id_1}.json # 视频1的元数据
-      ├── {video_id_2}.json # 视频2的元数据
-      ├── {video_id_3}.json # 视频3的元数据
-      │ ...
-      └── {video_id_n}.json # 视频n的元数据
-```
-
-## 高级配置
-
-### 多宽高比处理
-
-当`enable_multi_aspect_ratio`设置为`True`时，系统会选择最接近视频原始宽高比的目标尺寸，而不是强制将其裁剪为固定尺寸。这有助于保持视频内容的完整性。
-
-### 步长采样
-
-当`use_stride`设置为`True`时，系统会根据视频的帧率自动调整采样步长：
-- 当帧率 >= 50fps时，步长为2
-- 当帧率 < 50fps时，步长为1 
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/dataset.py b/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/dataset.py
deleted file mode 100644
index 8ebd115f..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/dataset.py
+++ /dev/null
@@ -1,255 +0,0 @@
-from typing import Tuple, List
-from decord import VideoReader
-import urllib
-import io
-import os
-import csv
-import numpy as np
-import torch
-from torch.utils.data import Dataset, IterableDataset
-import torchvision.transforms as transforms
-from torchvision.transforms.functional import crop
-from pathlib import Path
-import sys
-import json
-
-
-def split_video_urls(meta_files: str, global_rank: int, world_size: int):
-    meta_paths = []
-    meta_paths.extend([line.strip() for line in open(meta_files, 'r').readlines()])
-    num_videos = len(meta_paths)
-    num_videos_per_rank = num_videos // world_size
-    remainder = num_videos % world_size
-
-    # Calculate start and end indices
-    start = num_videos_per_rank * global_rank + min(global_rank, remainder)
-    end = start + num_videos_per_rank + (1 if global_rank < remainder else 0)
-
-    return start, end, meta_paths[start:end]
-
-class MultiBucketDataset(IterableDataset):
-    def __init__(self, source: Dataset, batch_size: int, max_buf = 64):
-        super().__init__()
-        self.source = source
-        self.batch_size = batch_size
-        self.buffer = {}   
-        self.max_buf = max_buf
-        self.size = 0
-
-    @staticmethod
-    def collate_fn(samples):
-        pixel_values = torch.stack([sample["pixel_values"] for sample in samples]).contiguous()
-        videoid = [sample["videoid"] for sample in samples]
-        valid = [sample["valid"] for sample in samples]
-        batch = {"pixel_values": pixel_values, "videoid": videoid, "valid": valid}
-        return batch
-
-    def __iter__(self):
-        # split dataset
-        worker_info = torch.utils.data.get_worker_info()
-        if worker_info is None:
-            iter_start = 0
-            iter_end = len(self.source)
-        else:
-            worker_id = int(worker_info.id)
-            per_worker = len(self.source) // int(worker_info.num_workers)
-            per_worker += int(worker_id < len(self.source) % int(worker_info.num_workers))
-            if worker_id >= len(self.source) % int(worker_info.num_workers):
-                iter_start = worker_id * per_worker + len(self.source) % int(worker_info.num_workers)  
-            else:          
-                iter_start = worker_id * per_worker
-            iter_end = iter_start + per_worker
-       
-       # bucketing
-        for i in range(iter_start, iter_end):
-            sample = self.source[i]
-            if sample["valid"] is False:
-                continue
-            T, C, H, W = sample["pixel_values"].shape
-            if (T, H, W) not in self.buffer:
-                self.buffer[(T, H, W)] = []
-            self.buffer[(T, H, W)].append(sample)
-            self.size += 1
-            if len(self.buffer[(T, H, W)]) == self.batch_size:
-                yield self.buffer[(T, H, W)]
-                self.size -= self.batch_size
-                self.buffer[(T, H, W)] = []
-            if self.size > self.max_buf and (len(self.buffer[(T, H, W)]) > 0):
-                self.size -= len(self.buffer[(T, H, W)])
-                yield self.buffer[(T, H, W)]
-                self.buffer[(T, H, W)] = []
-        # yield the remaining batch
-        for bucket, samples in self.buffer.items():
-            if len(samples) > 0:
-                yield samples
-
-class VideoDataset(Dataset):
-    def __init__(
-        self,
-        meta_files: List[str],
-        latent_cache_dir: str,
-        sample_size: Tuple[int, int],
-        sample_n_frames: int,
-        is_center_crop: bool = True,
-        enable_multi_aspect_ratio: bool = False,
-        vae_time_compression_ratio: int = 4,
-        use_stride: bool = False,
-    ):
-        if not Path(latent_cache_dir).exists():
-            Path(latent_cache_dir).mkdir(parents=True, exist_ok=True)
-        self.latent_cache_dir = latent_cache_dir
-
-        self.sample_n_frames = sample_n_frames
-        self.sample_size = tuple(sample_size)
-        self.is_center_crop = is_center_crop
-        self.vae_time_compression_ratio = vae_time_compression_ratio
-        self.enable_multi_aspect_ratio = enable_multi_aspect_ratio
-        self.dataset = meta_files
-        self.length = len(self.dataset)
-        self.use_stride = use_stride
-
-        # multi-aspect-ratio buckets
-        if enable_multi_aspect_ratio:
-            assert self.sample_size[0] == self.sample_size[1]
-            if self.sample_size[0] < 540:
-                self.buckets = self.generate_crop_size_list(base_size=self.sample_size[0])
-            else:
-                self.buckets = self.generate_crop_size_list(base_size=self.sample_size[0], patch_size=32)
-            self.aspect_ratios = np.array([float(w) / float(h) for w, h in self.buckets])
-            print(f"Multi-aspect-ratio bucket num: {len(self.buckets)}")
-        # image preprocess
-        if not enable_multi_aspect_ratio:
-            self.train_crop = transforms.CenterCrop(self.sample_size) if self.is_center_crop else transforms.RandomCrop(self.sample_size)
-
-    def request_ceph_data(self, path):
-        try:
-            video_reader = VideoReader(path)
-        except Exception as e:
-            print(f"Error: {e}")
-            raise
-        return video_reader
-
-    def preprocess_url(self, data_json_path):
-        with open(data_json_path, "r") as f:
-            data_dict = json.load(f)
-
-        video_path = data_dict['video_path']
-        video_id = video_path.split('/')[-1].split('.')[0]
-        prompt = data_dict['raw_caption']["long caption"]
-
-        item = {"video_path": video_path, "videoid": video_id, "prompt": prompt}
-        return item
-
-    def get_item(self, idx):
-        # Create Video Reader
-        data_json_path = self.dataset[idx]
-        video_item = self.preprocess_url(data_json_path)
-
-        # 20250322 pftq: fixed to return 5 values for consistency and "not enough values to unpack" error
-        # Skip if exists
-        latent_save_path = Path(self.latent_cache_dir) / f"{video_item['videoid']}.npy"
-        if latent_save_path.exists():
-            return None, None, None, None, False
-
-        video_reader = self.request_ceph_data(video_item["video_path"])
-
-        fps = video_reader.get_avg_fps()
-
-        stride = 1
-        if self.use_stride:
-            if int(fps) >= 50:
-                stride = 2
-            else:
-                stride = 1
-        else:
-            stride = 1
-            
-        video_length = len(video_reader)
-        if video_length < self.sample_n_frames*stride:
-            sample_n_frames = video_length - (video_length - 1) % (self.vae_time_compression_ratio*stride)  # 4n+1/8n+1
-        else:
-            sample_n_frames = self.sample_n_frames*stride  
-
-        start_idx = 0
-        batch_index = list(range(start_idx, start_idx + sample_n_frames, stride))
-
-        # 20250322 pftq: fixed to return 5 values for consistency and "not enough values to unpack" error
-        if len(batch_index) == 0:
-            print(f"get video len=0, skip for {video_item['video_path']}")
-            return None, video_item["videoid"], video_item["video_path"], video_item["prompt"], False
-
-        # Read frames
-        try:
-            video_images = video_reader.get_batch(batch_index).asnumpy()
-        except Exception as e:
-            print(f'Error: {e}, video_path: {video_item["video_path"]}')
-            raise
-        pixel_values = torch.from_numpy(video_images).permute(0, 3, 1, 2).contiguous()
-        del video_reader
-
-        return pixel_values, video_item["videoid"], video_item["video_path"], video_item["prompt"], True
-
-    def preprocess_train(self, frames):
-        height, width = frames.shape[-2:]
-        # Resize & Crop
-        if self.enable_multi_aspect_ratio:
-            bw, bh = self.get_closest_ratio(width=width, height=height, ratios=self.aspect_ratios, buckets=self.buckets)
-            sample_size = bh, bw
-            target_size = self.get_target_size(frames, sample_size)
-            train_crop = transforms.CenterCrop(sample_size) if self.is_center_crop else transforms.RandomCrop(sample_size)
-        else:
-            sample_size = self.sample_size
-            target_size = self.get_target_size(frames, sample_size)
-            train_crop = self.train_crop
-
-        frames = transforms.Resize(target_size, interpolation=transforms.InterpolationMode.BILINEAR, antialias=True)(frames)
-        if self.is_center_crop:
-            y1 = max(0, int(round((height - sample_size[0]) / 2.0)))
-            x1 = max(0, int(round((width - sample_size[1]) / 2.0)))
-            frames = train_crop(frames)
-        else:
-            y1, x1, h, w = train_crop.get_params(frames, sample_size)
-            frames = crop(frames, y1, x1, h, w)
-        return frames
-
-    @staticmethod
-    def get_closest_ratio(width: float, height: float, ratios: list, buckets: list):
-        aspect_ratio = float(width) / float(height)
-        closest_ratio_id = np.abs(ratios - aspect_ratio).argmin()
-        return buckets[closest_ratio_id]
-
-    @staticmethod
-    def generate_crop_size_list(base_size=256, patch_size=16, max_ratio=4.0):
-        num_patches = round((base_size / patch_size) ** 2)
-        assert max_ratio >= 1.
-        crop_size_list = []
-        wp, hp = num_patches, 1
-        while wp > 0:
-            if max(wp, hp) / min(wp, hp) <= max_ratio:
-                crop_size_list.append((wp * patch_size, hp * patch_size))
-            if (hp + 1) * wp <= num_patches:
-                hp += 1
-            else:
-                wp -= 1
-        return crop_size_list
-
-    def get_target_size(self, frames, target_size):
-        T, C, H, W = frames.shape
-        th, tw = target_size
-        r = max(th / H, tw / W)
-        target_size = int(H * r), int(W * r)
-        return target_size
-
-    def __len__(self):
-        return self.length
-
-    def __getitem__(self, idx):
-        try:
-            pixel, videoid, video_path, prompt, valid = self.get_item(idx)
-            if pixel is not None and valid:
-                pixel = self.preprocess_train(pixel)
-            sample = dict(pixel_values=pixel, videoid=videoid, video_path=video_path, prompt=prompt,valid=valid)
-            return sample
-        except Exception as e:
-            print(e)
-            return dict(pixel_values=None, videoid=None, video_path=None, prompt=None, valid=False)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/run.py b/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/run.py
deleted file mode 100644
index 1367ca93..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/run.py
+++ /dev/null
@@ -1,152 +0,0 @@
-from typing import Tuple, List, Dict
-import sys
-from pathlib import Path
-import argparse
-import time
-import os
-import traceback
-import random
-import numpy as np
-from einops import rearrange
-import torch
-from torch.utils.data import DataLoader
-from torchvision import transforms
-from dataset import VideoDataset, MultiBucketDataset, split_video_urls
-import json
-import glob
-from omegaconf import OmegaConf
-from hyvideo.vae import load_vae
-
-DEVICE = "cuda"
-DTYPE = torch.float16
-
-
-def seed_everything(seed):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-
-@torch.no_grad()
-def extract(
-    vae: torch.nn.Module,
-    meta_files: List[str],
-    output_base_dir: str,
-    sample_n_frames: int,
-    target_size: Tuple[int, int],
-    enable_multi_aspect_ratio: bool = False,
-    use_stride: bool = False,
-    batch_size=None,
-):
-    dataset = VideoDataset(
-        meta_files=meta_files,
-        latent_cache_dir=output_base_dir,
-        sample_size=target_size,
-        sample_n_frames=sample_n_frames,
-        is_center_crop=True,
-        enable_multi_aspect_ratio=enable_multi_aspect_ratio,
-        vae_time_compression_ratio=vae.time_compression_ratio,
-        use_stride=use_stride
-    )
-    if batch_size is not None:
-        dataset = MultiBucketDataset(dataset, batch_size=batch_size)
-
-    dataloader = DataLoader(
-        dataset,
-        batch_size=None,
-        collate_fn=dataset.collate_fn if batch_size is not None else None,
-        shuffle=False,
-        num_workers=8,
-        prefetch_factor=4,
-        pin_memory=False,
-    )
-    normalize_fn = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5], inplace=True)
-
-    save_json_path = Path(output_base_dir) / "json_path"
-    if not os.path.exists(save_json_path):
-        os.makedirs(save_json_path, exist_ok=True)
-
-    for i, item in enumerate(dataloader):
-        print(f"processing video latent extraction {i}")
-        if batch_size is None:
-            if item.get("valid", True) is False:
-                continue
-            item["videoid"] = [item["videoid"]]
-            item["valid"] = [item["valid"]]
-            item["prompt"] = [item["prompt"]]
-        try:
-            pixel_values = item["pixel_values"]
-            pixel_values = pixel_values.to(device=vae.device, dtype=vae.dtype)
-            pixel_values = pixel_values / 255.
-            pixel_values = normalize_fn(pixel_values)
-            if pixel_values.ndim == 4:
-                pixel_values = pixel_values.unsqueeze(0)
-            pixel_values = rearrange(pixel_values, "b f c h w -> b c f h w")
-            z = vae.encode(pixel_values).latent_dist.mode()
-            z = z.detach().to(DTYPE).cpu().numpy()
-
-            assert z.shape[0] == len(item["videoid"])
-            for k in range(z.shape[0]):
-                save_path = Path(output_base_dir) / f"{item['videoid'][k]}.npy"
-                np.save(save_path, z[k][None, ...])
-                data = {"video_id": item["videoid"][k],
-                    "latent_shape": z[k][None,...].shape,
-                    "video_path": item["video_path"][k], 
-                    "prompt": item["prompt"][k],
-                    "npy_save_path": str(save_path)}
-                with open(save_json_path / f"{item['videoid'][k]}.json", "w", encoding='utf-8') as f:
-                    json.dump(data, f, ensure_ascii=False)
-        except Exception as e:
-            traceback.print_exc()
-
-def main(
-    local_rank: int,
-    vae_path: str,
-    meta_files: str,
-    output_base_dir: str,
-    sample_n_frames: int,
-    target_size: Tuple[int, int],
-    enable_multi_aspect_ratio: bool = False,
-    use_stride: bool = False,
-    seed: int = 42,
-):
-    seed_everything(seed)
-
-    global_rank = local_rank
-    world_size = int(os.environ["HOST_GPU_NUM"])
-
-    print(f"split video urls")
-    start, end, meta_files = split_video_urls(meta_files, global_rank, world_size)
-
-    print(f"Load VAE")
-    vae, vae_path, spatial_compression_ratio, time_compression_ratio = load_vae(
-        vae_type="884-16c-hy",
-        vae_precision='fp16',
-        vae_path=vae_path,
-        device=DEVICE,
-    )
-
-    # vae.enable_temporal_tiling()
-    vae.enable_spatial_tiling()
-    vae.eval()
-
-    print(f"processing video latent extraction")
-    extract(vae, meta_files, output_base_dir, sample_n_frames, target_size, enable_multi_aspect_ratio, use_stride)
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--local_rank", type=int, required=True)
-    parser.add_argument("--config", default='./vae.yaml', type=str)
-    args = parser.parse_args()
-
-    config = OmegaConf.load(args.config)
-
-    vae_path = config.vae_path
-    sample_n_frames = config.sample_n_frames
-    target_size = [config.target_size[0], config.target_size[1]]
-    enable_multi_aspect_ratio = config.enable_multi_aspect_ratio
-    output_base_dir = config.output_base_dir
-    use_stride = config.use_stride
-    meta_files = config.video_url_files
-
-    main(args.local_rank, vae_path, meta_files, output_base_dir, sample_n_frames, target_size, enable_multi_aspect_ratio, use_stride)
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/start.sh b/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/start.sh
deleted file mode 100644
index 0760f277..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/start.sh
+++ /dev/null
@@ -1,8 +0,0 @@
-export PYTHONPATH=${PYTHONPATH}:`pwd`
-for ((i=0;i<$HOST_GPU_NUM;++i)); do
-    CUDA_VISIBLE_DEVICES=$i python3 -u hyvideo/hyvae_extract/run.py --local_rank $i --config 'hyvideo/hyvae_extract/vae.yaml'&
-done
-# CUDA_VISIBLE_DEVICES=0 python3 -u hyvideo/hyvae_extract/run.py --local_rank 0 --config 'hyvideo/hyvae_extract/vae.yaml'&
-wait
-
-echo "Finished."
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/vae.yaml b/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/vae.yaml
deleted file mode 100644
index b6e04461..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/hyvae_extract/vae.yaml
+++ /dev/null
@@ -1,9 +0,0 @@
-vae_path: "./ckpts/hunyuan-video-i2v-720p/vae"
-video_url_files: "/path/to/meta_file.list"
-output_base_dir: "/path/to/output/directory"
-sample_n_frames: 129
-target_size: 
-  - 480
-  - 480
-enable_multi_aspect_ratio: True
-use_stride: True
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/__init__.py
deleted file mode 100644
index ec9e0c85..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .models import HYVideoDiffusionTransformer
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/activation_layers.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/activation_layers.py
deleted file mode 100644
index f8774c26..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/activation_layers.py
+++ /dev/null
@@ -1,23 +0,0 @@
-import torch.nn as nn
-
-
-def get_activation_layer(act_type):
-    """get activation layer
-
-    Args:
-        act_type (str): the activation type
-
-    Returns:
-        torch.nn.functional: the activation layer
-    """
-    if act_type == "gelu":
-        return lambda: nn.GELU()
-    elif act_type == "gelu_tanh":
-        # Approximate `tanh` requires torch >= 1.13
-        return lambda: nn.GELU(approximate="tanh")
-    elif act_type == "relu":
-        return nn.ReLU
-    elif act_type == "silu":
-        return nn.SiLU
-    else:
-        raise ValueError(f"Unknown activation type: {act_type}")
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/attenion.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/attenion.py
deleted file mode 100644
index 44548793..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/attenion.py
+++ /dev/null
@@ -1,212 +0,0 @@
-import importlib.metadata
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-try:
-    import flash_attn
-    from flash_attn.flash_attn_interface import _flash_attn_forward
-    from flash_attn.flash_attn_interface import flash_attn_varlen_func
-except ImportError:
-    flash_attn = None
-    flash_attn_varlen_func = None
-    _flash_attn_forward = None
-
-
-MEMORY_LAYOUT = {
-    "flash": (
-        lambda x: x.view(x.shape[0] * x.shape[1], *x.shape[2:]),
-        lambda x: x,
-    ),
-    "torch": (
-        lambda x: x.transpose(1, 2),
-        lambda x: x.transpose(1, 2),
-    ),
-    "vanilla": (
-        lambda x: x.transpose(1, 2),
-        lambda x: x.transpose(1, 2),
-    ),
-}
-
-
-def get_cu_seqlens(text_mask, img_len):
-    """Calculate cu_seqlens_q, cu_seqlens_kv using text_mask and img_len
-
-    Args:
-        text_mask (torch.Tensor): the mask of text
-        img_len (int): the length of image
-
-    Returns:
-        torch.Tensor: the calculated cu_seqlens for flash attention
-    """
-    batch_size = text_mask.shape[0]
-    text_len = text_mask.sum(dim=1)
-    max_len = text_mask.shape[1] + img_len
-
-    cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda")
-
-    for i in range(batch_size):
-        s = text_len[i] + img_len
-        s1 = i * max_len + s
-        s2 = (i + 1) * max_len
-        cu_seqlens[2 * i + 1] = s1
-        cu_seqlens[2 * i + 2] = s2
-
-    return cu_seqlens
-
-
-def attention(
-    q,
-    k,
-    v,
-    mode="flash",
-    drop_rate=0,
-    attn_mask=None,
-    causal=False,
-    cu_seqlens_q=None,
-    cu_seqlens_kv=None,
-    max_seqlen_q=None,
-    max_seqlen_kv=None,
-    batch_size=1,
-):
-    """
-    Perform QKV self attention.
-
-    Args:
-        q (torch.Tensor): Query tensor with shape [b, s, a, d], where a is the number of heads.
-        k (torch.Tensor): Key tensor with shape [b, s1, a, d]
-        v (torch.Tensor): Value tensor with shape [b, s1, a, d]
-        mode (str): Attention mode. Choose from 'self_flash', 'cross_flash', 'torch', and 'vanilla'.
-        drop_rate (float): Dropout rate in attention map. (default: 0)
-        attn_mask (torch.Tensor): Attention mask with shape [b, s1] (cross_attn), or [b, a, s, s1] (torch or vanilla).
-            (default: None)
-        causal (bool): Whether to use causal attention. (default: False)
-        cu_seqlens_q (torch.Tensor): dtype torch.int32. The cumulative sequence lengths of the sequences in the batch,
-            used to index into q.
-        cu_seqlens_kv (torch.Tensor): dtype torch.int32. The cumulative sequence lengths of the sequences in the batch,
-            used to index into kv.
-        max_seqlen_q (int): The maximum sequence length in the batch of q.
-        max_seqlen_kv (int): The maximum sequence length in the batch of k and v.
-
-    Returns:
-        torch.Tensor: Output tensor after self attention with shape [b, s, ad]
-    """
-    pre_attn_layout, post_attn_layout = MEMORY_LAYOUT[mode]
-    q = pre_attn_layout(q)
-    k = pre_attn_layout(k)
-    v = pre_attn_layout(v)
-
-    if mode == "torch":
-        if attn_mask is not None and attn_mask.dtype != torch.bool:
-            attn_mask = attn_mask.to(q.dtype)
-        x = F.scaled_dot_product_attention(
-            q, k, v, attn_mask=attn_mask, dropout_p=drop_rate, is_causal=causal
-        )
-    elif mode == "flash":
-        x = flash_attn_varlen_func(
-            q,
-            k,
-            v,
-            cu_seqlens_q,
-            cu_seqlens_kv,
-            max_seqlen_q,
-            max_seqlen_kv,
-        )
-        # x with shape [(bxs), a, d]
-        x = x.view(
-            batch_size, max_seqlen_q, x.shape[-2], x.shape[-1]
-        )  # reshape x to [b, s, a, d]
-    elif mode == "vanilla":
-        scale_factor = 1 / math.sqrt(q.size(-1))
-
-        b, a, s, _ = q.shape
-        s1 = k.size(2)
-        attn_bias = torch.zeros(b, a, s, s1, dtype=q.dtype, device=q.device)
-        if causal:
-            # Only applied to self attention
-            assert (
-                attn_mask is None
-            ), "Causal mask and attn_mask cannot be used together"
-            temp_mask = torch.ones(b, a, s, s, dtype=torch.bool, device=q.device).tril(
-                diagonal=0
-            )
-            attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
-            attn_bias.to(q.dtype)
-
-        if attn_mask is not None:
-            if attn_mask.dtype == torch.bool:
-                attn_bias.masked_fill_(attn_mask.logical_not(), float("-inf"))
-            else:
-                attn_bias += attn_mask
-
-        # TODO: Maybe force q and k to be float32 to avoid numerical overflow
-        attn = (q @ k.transpose(-2, -1)) * scale_factor
-        attn += attn_bias
-        attn = attn.softmax(dim=-1)
-        attn = torch.dropout(attn, p=drop_rate, train=True)
-        x = attn @ v
-    else:
-        raise NotImplementedError(f"Unsupported attention mode: {mode}")
-
-    x = post_attn_layout(x)
-    b, s, a, d = x.shape
-    out = x.reshape(b, s, -1)
-    return out
-
-
-def parallel_attention(
-    hybrid_seq_parallel_attn,
-    q,
-    k,
-    v,
-    img_q_len,
-    img_kv_len,
-    cu_seqlens_q,
-    cu_seqlens_kv
-):
-    attn1 = hybrid_seq_parallel_attn(
-        None,
-        q[:, :img_q_len, :, :],
-        k[:, :img_kv_len, :, :],
-        v[:, :img_kv_len, :, :],
-        dropout_p=0.0,
-        causal=False,
-        joint_tensor_query=q[:,img_q_len:cu_seqlens_q[1]],
-        joint_tensor_key=k[:,img_kv_len:cu_seqlens_kv[1]],
-        joint_tensor_value=v[:,img_kv_len:cu_seqlens_kv[1]],
-        joint_strategy="rear",
-    )
-    if flash_attn.__version__ >= '2.7.0':
-        attn2, *_ = _flash_attn_forward(
-            q[:,cu_seqlens_q[1]:],
-            k[:,cu_seqlens_kv[1]:],
-            v[:,cu_seqlens_kv[1]:],
-            dropout_p=0.0,
-            softmax_scale=q.shape[-1] ** (-0.5),
-            causal=False,
-            window_size_left=-1,
-            window_size_right=-1,
-            softcap=0.0,
-            alibi_slopes=None,
-            return_softmax=False,
-        )
-    else:
-        attn2, *_ = _flash_attn_forward(
-            q[:,cu_seqlens_q[1]:],
-            k[:,cu_seqlens_kv[1]:],
-            v[:,cu_seqlens_kv[1]:],
-            dropout_p=0.0,
-            softmax_scale=q.shape[-1] ** (-0.5),
-            causal=False,
-            window_size=(-1, -1),
-            softcap=0.0,
-            alibi_slopes=None,
-            return_softmax=False,
-        )
-    attn = torch.cat([attn1, attn2], dim=1)
-    b, s, a, d = attn.shape
-    attn = attn.reshape(b, s, -1)
-
-    return attn
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/embed_layers.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/embed_layers.py
deleted file mode 100644
index 3d65ed1a..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/embed_layers.py
+++ /dev/null
@@ -1,157 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-from einops import rearrange, repeat
-
-from ..utils.helpers import to_2tuple
-
-
-class PatchEmbed(nn.Module):
-    """2D Image to Patch Embedding
-
-    Image to Patch Embedding using Conv2d
-
-    A convolution based approach to patchifying a 2D image w/ embedding projection.
-
-    Based on the impl in https://github.com/google-research/vision_transformer
-
-    Hacked together by / Copyright 2020 Ross Wightman
-
-    Remove the _assert function in forward function to be compatible with multi-resolution images.
-    """
-
-    def __init__(
-        self,
-        patch_size=16,
-        in_chans=3,
-        embed_dim=768,
-        norm_layer=None,
-        flatten=True,
-        bias=True,
-        dtype=None,
-        device=None,
-    ):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        patch_size = to_2tuple(patch_size)
-        self.patch_size = patch_size
-        self.flatten = flatten
-
-        self.proj = nn.Conv3d(
-            in_chans,
-            embed_dim,
-            kernel_size=patch_size,
-            stride=patch_size,
-            bias=bias,
-            **factory_kwargs
-        )
-        nn.init.xavier_uniform_(self.proj.weight.view(self.proj.weight.size(0), -1))
-        if bias:
-            nn.init.zeros_(self.proj.bias)
-
-        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
-
-    def forward(self, x):
-        x = self.proj(x)
-        if self.flatten:
-            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
-        x = self.norm(x)
-        return x
-
-
-class TextProjection(nn.Module):
-    """
-    Projects text embeddings. Also handles dropout for classifier-free guidance.
-
-    Adapted from https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/nets/PixArt_blocks.py
-    """
-
-    def __init__(self, in_channels, hidden_size, act_layer, dtype=None, device=None):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.linear_1 = nn.Linear(
-            in_features=in_channels,
-            out_features=hidden_size,
-            bias=True,
-            **factory_kwargs
-        )
-        self.act_1 = act_layer()
-        self.linear_2 = nn.Linear(
-            in_features=hidden_size,
-            out_features=hidden_size,
-            bias=True,
-            **factory_kwargs
-        )
-
-    def forward(self, caption):
-        hidden_states = self.linear_1(caption)
-        hidden_states = self.act_1(hidden_states)
-        hidden_states = self.linear_2(hidden_states)
-        return hidden_states
-
-
-def timestep_embedding(t, dim, max_period=10000):
-    """
-    Create sinusoidal timestep embeddings.
-
-    Args:
-        t (torch.Tensor): a 1-D Tensor of N indices, one per batch element. These may be fractional.
-        dim (int): the dimension of the output.
-        max_period (int): controls the minimum frequency of the embeddings.
-
-    Returns:
-        embedding (torch.Tensor): An (N, D) Tensor of positional embeddings.
-
-    .. ref_link: https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
-    """
-    half = dim // 2
-    freqs = torch.exp(
-        -math.log(max_period)
-        * torch.arange(start=0, end=half, dtype=torch.float32)
-        / half
-    ).to(device=t.device)
-    args = t[:, None].float() * freqs[None]
-    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
-    if dim % 2:
-        embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
-    return embedding
-
-
-class TimestepEmbedder(nn.Module):
-    """
-    Embeds scalar timesteps into vector representations.
-    """
-
-    def __init__(
-        self,
-        hidden_size,
-        act_layer,
-        frequency_embedding_size=256,
-        max_period=10000,
-        out_size=None,
-        dtype=None,
-        device=None,
-    ):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.frequency_embedding_size = frequency_embedding_size
-        self.max_period = max_period
-        if out_size is None:
-            out_size = hidden_size
-
-        self.mlp = nn.Sequential(
-            nn.Linear(
-                frequency_embedding_size, hidden_size, bias=True, **factory_kwargs
-            ),
-            act_layer(),
-            nn.Linear(hidden_size, out_size, bias=True, **factory_kwargs),
-        )
-        nn.init.normal_(self.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.mlp[2].weight, std=0.02)
-
-    def forward(self, t):
-        t_freq = timestep_embedding(
-            t, self.frequency_embedding_size, self.max_period
-        ).type(self.mlp[0].weight.dtype)
-        t_emb = self.mlp(t_freq)
-        return t_emb
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/fp8_optimization.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/fp8_optimization.py
deleted file mode 100644
index b95c1f49..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/fp8_optimization.py
+++ /dev/null
@@ -1,102 +0,0 @@
-import os
-
-import torch
-import torch.nn as nn
-from torch.nn import functional as F
-
-def get_fp_maxval(bits=8, mantissa_bit=3, sign_bits=1):
-    _bits = torch.tensor(bits)
-    _mantissa_bit = torch.tensor(mantissa_bit)
-    _sign_bits = torch.tensor(sign_bits)
-    M = torch.clamp(torch.round(_mantissa_bit), 1, _bits - _sign_bits)
-    E = _bits - _sign_bits - M
-    bias = 2 ** (E - 1) - 1
-    mantissa = 1
-    for i in range(mantissa_bit - 1):
-        mantissa += 1 / (2 ** (i+1))
-    maxval = mantissa * 2 ** (2**E - 1 - bias)
-    return maxval
-
-def quantize_to_fp8(x, bits=8, mantissa_bit=3, sign_bits=1):
-    """
-    Default is E4M3.
-    """
-    bits = torch.tensor(bits)
-    mantissa_bit = torch.tensor(mantissa_bit)
-    sign_bits = torch.tensor(sign_bits)
-    M = torch.clamp(torch.round(mantissa_bit), 1, bits - sign_bits)
-    E = bits - sign_bits - M
-    bias = 2 ** (E - 1) - 1
-    mantissa = 1
-    for i in range(mantissa_bit - 1):
-        mantissa += 1 / (2 ** (i+1))
-    maxval = mantissa * 2 ** (2**E - 1 - bias)
-    minval = - maxval
-    minval = - maxval if sign_bits == 1 else torch.zeros_like(maxval)
-    input_clamp = torch.min(torch.max(x, minval), maxval)
-    log_scales = torch.clamp((torch.floor(torch.log2(torch.abs(input_clamp)) + bias)).detach(), 1.0)
-    log_scales = 2.0 ** (log_scales - M - bias.type(x.dtype))
-    # dequant
-    qdq_out = torch.round(input_clamp / log_scales) * log_scales
-    return qdq_out, log_scales
-
-def fp8_tensor_quant(x, scale, bits=8, mantissa_bit=3, sign_bits=1):
-    for i in range(len(x.shape) - 1):
-        scale = scale.unsqueeze(-1)
-    new_x = x / scale
-    quant_dequant_x, log_scales = quantize_to_fp8(new_x, bits=bits, mantissa_bit=mantissa_bit, sign_bits=sign_bits)
-    return quant_dequant_x, scale, log_scales
-
-def fp8_activation_dequant(qdq_out, scale, dtype):
-    qdq_out = qdq_out.type(dtype)
-    quant_dequant_x = qdq_out * scale.to(dtype)
-    return quant_dequant_x
-
-def fp8_linear_forward(cls, original_dtype, input):
-    weight_dtype = cls.weight.dtype
-    #####
-    if cls.weight.dtype != torch.float8_e4m3fn:
-        maxval = get_fp_maxval()
-        scale = torch.max(torch.abs(cls.weight.flatten())) / maxval
-        linear_weight, scale, log_scales = fp8_tensor_quant(cls.weight, scale)
-        linear_weight = linear_weight.to(torch.float8_e4m3fn)
-        weight_dtype = linear_weight.dtype
-    else:
-        scale = cls.fp8_scale.to(cls.weight.device)
-        linear_weight = cls.weight
-    #####
-
-    if weight_dtype == torch.float8_e4m3fn and cls.weight.sum() != 0:
-        if True or len(input.shape) == 3:
-            cls_dequant = fp8_activation_dequant(linear_weight, scale, original_dtype)
-            if cls.bias != None:
-                output = F.linear(input, cls_dequant, cls.bias)
-            else:
-                output = F.linear(input, cls_dequant)
-            return output
-        else:
-            return cls.original_forward(input.to(original_dtype))
-    else:
-        return cls.original_forward(input)
-
-def convert_fp8_linear(module, dit_weight_path, original_dtype, params_to_keep={}):
-    setattr(module, "fp8_matmul_enabled", True)
-
-    # loading fp8 mapping file
-    fp8_map_path = dit_weight_path.replace('.pt', '_map.pt')
-    if os.path.exists(fp8_map_path):
-        fp8_map = torch.load(fp8_map_path, map_location=lambda storage, loc: storage)
-    else:
-        raise ValueError(f"Invalid fp8_map path: {fp8_map_path}.")
-
-    fp8_layers = []
-    for key, layer in module.named_modules():
-        if isinstance(layer, nn.Linear) and ('double_blocks' in key or 'single_blocks' in key):
-            fp8_layers.append(key)
-            original_forward = layer.forward
-            layer.weight = torch.nn.Parameter(layer.weight.to(torch.float8_e4m3fn))
-            setattr(layer, "fp8_scale", fp8_map[key].to(dtype=original_dtype))
-            setattr(layer, "original_forward", original_forward)
-            setattr(layer, "forward", lambda input, m=layer: fp8_linear_forward(m, original_dtype, input))
-    
-
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/mlp_layers.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/mlp_layers.py
deleted file mode 100644
index 24dd2d9b..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/mlp_layers.py
+++ /dev/null
@@ -1,118 +0,0 @@
-# Modified from timm library:
-# https://github.com/huggingface/pytorch-image-models/blob/648aaa41233ba83eb38faf5ba9d415d574823241/timm/layers/mlp.py#L13
-
-from functools import partial
-
-import torch
-import torch.nn as nn
-
-from .modulate_layers import modulate
-from ..utils.helpers import to_2tuple
-
-
-class MLP(nn.Module):
-    """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
-
-    def __init__(
-        self,
-        in_channels,
-        hidden_channels=None,
-        out_features=None,
-        act_layer=nn.GELU,
-        norm_layer=None,
-        bias=True,
-        drop=0.0,
-        use_conv=False,
-        device=None,
-        dtype=None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        out_features = out_features or in_channels
-        hidden_channels = hidden_channels or in_channels
-        bias = to_2tuple(bias)
-        drop_probs = to_2tuple(drop)
-        linear_layer = partial(nn.Conv2d, kernel_size=1) if use_conv else nn.Linear
-
-        self.fc1 = linear_layer(
-            in_channels, hidden_channels, bias=bias[0], **factory_kwargs
-        )
-        self.act = act_layer()
-        self.drop1 = nn.Dropout(drop_probs[0])
-        self.norm = (
-            norm_layer(hidden_channels, **factory_kwargs)
-            if norm_layer is not None
-            else nn.Identity()
-        )
-        self.fc2 = linear_layer(
-            hidden_channels, out_features, bias=bias[1], **factory_kwargs
-        )
-        self.drop2 = nn.Dropout(drop_probs[1])
-
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop1(x)
-        x = self.norm(x)
-        x = self.fc2(x)
-        x = self.drop2(x)
-        return x
-
-
-# 
-class MLPEmbedder(nn.Module):
-    """copied from https://github.com/black-forest-labs/flux/blob/main/src/flux/modules/layers.py"""
-    def __init__(self, in_dim: int, hidden_dim: int, device=None, dtype=None):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.in_layer = nn.Linear(in_dim, hidden_dim, bias=True, **factory_kwargs)
-        self.silu = nn.SiLU()
-        self.out_layer = nn.Linear(hidden_dim, hidden_dim, bias=True, **factory_kwargs)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return self.out_layer(self.silu(self.in_layer(x)))
-
-
-class FinalLayer(nn.Module):
-    """The final layer of DiT."""
-
-    def __init__(
-        self, hidden_size, patch_size, out_channels, act_layer, device=None, dtype=None
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        # Just use LayerNorm for the final layer
-        self.norm_final = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-        if isinstance(patch_size, int):
-            self.linear = nn.Linear(
-                hidden_size,
-                patch_size * patch_size * out_channels,
-                bias=True,
-                **factory_kwargs
-            )
-        else:
-            self.linear = nn.Linear(
-                hidden_size,
-                patch_size[0] * patch_size[1] * patch_size[2] * out_channels,
-                bias=True,
-            )
-        nn.init.zeros_(self.linear.weight)
-        nn.init.zeros_(self.linear.bias)
-
-        # Here we don't distinguish between the modulate types. Just use the simple one.
-        self.adaLN_modulation = nn.Sequential(
-            act_layer(),
-            nn.Linear(hidden_size, 2 * hidden_size, bias=True, **factory_kwargs),
-        )
-        # Zero-initialize the modulation
-        nn.init.zeros_(self.adaLN_modulation[1].weight)
-        nn.init.zeros_(self.adaLN_modulation[1].bias)
-
-    def forward(self, x, c):
-        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
-        x = modulate(self.norm_final(x), shift=shift, scale=scale)
-        x = self.linear(x)
-        return x
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/models.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/models.py
deleted file mode 100644
index 7f1f9709..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/models.py
+++ /dev/null
@@ -1,994 +0,0 @@
-from typing import Any, List, Tuple, Optional, Union, Dict
-from einops import rearrange
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from pathlib import Path
-from loguru import logger
-
-from diffusers.models import ModelMixin
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-import torch.utils
-import torch.utils.checkpoint
-
-from .activation_layers import get_activation_layer
-from .norm_layers import get_norm_layer
-from .embed_layers import TimestepEmbedder, PatchEmbed, TextProjection
-from .attenion import attention, parallel_attention, get_cu_seqlens
-from .posemb_layers import apply_rotary_emb
-from .mlp_layers import MLP, MLPEmbedder, FinalLayer
-from .modulate_layers import ModulateDiT, modulate, apply_gate, ckpt_wrapper
-from .token_refiner import SingleTokenRefiner
-from ..constants import PROMPT_TEMPLATE, NEGATIVE_PROMPT, PRECISION_TO_TYPE, NEGATIVE_PROMPT_I2V
-
-class MMDoubleStreamBlock(nn.Module):
-    """
-    A multimodal dit block with seperate modulation for
-    text and image/video, see more details (SD3): https://arxiv.org/abs/2403.03206
-                                     (Flux.1): https://github.com/black-forest-labs/flux
-    """
-
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        mlp_width_ratio: float,
-        mlp_act_type: str = "gelu_tanh",
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qkv_bias: bool = False,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        mlp_hidden_dim = int(hidden_size * mlp_width_ratio)
-
-        self.img_mod = ModulateDiT(
-            hidden_size,
-            factor=6,
-            act_layer=get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-        self.img_norm1 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-
-        self.img_attn_qkv = nn.Linear(
-            hidden_size, hidden_size * 3, bias=qkv_bias, **factory_kwargs
-        )
-        qk_norm_layer = get_norm_layer(qk_norm_type)
-        self.img_attn_q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.img_attn_k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.img_attn_proj = nn.Linear(
-            hidden_size, hidden_size, bias=qkv_bias, **factory_kwargs
-        )
-
-        self.img_norm2 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-        self.img_mlp = MLP(
-            hidden_size,
-            mlp_hidden_dim,
-            act_layer=get_activation_layer(mlp_act_type),
-            bias=True,
-            **factory_kwargs,
-        )
-
-        self.txt_mod = ModulateDiT(
-            hidden_size,
-            factor=6,
-            act_layer=get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-        self.txt_norm1 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-
-        self.txt_attn_qkv = nn.Linear(
-            hidden_size, hidden_size * 3, bias=qkv_bias, **factory_kwargs
-        )
-        self.txt_attn_q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.txt_attn_k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.txt_attn_proj = nn.Linear(
-            hidden_size, hidden_size, bias=qkv_bias, **factory_kwargs
-        )
-
-        self.txt_norm2 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-        self.txt_mlp = MLP(
-            hidden_size,
-            mlp_hidden_dim,
-            act_layer=get_activation_layer(mlp_act_type),
-            bias=True,
-            **factory_kwargs,
-        )
-        self.hybrid_seq_parallel_attn = None
-
-    def enable_deterministic(self):
-        self.deterministic = True
-
-    def disable_deterministic(self):
-        self.deterministic = False
-
-    def forward(
-        self,
-        img: torch.Tensor,
-        txt: torch.Tensor,
-        vec: torch.Tensor,
-        cu_seqlens_q: Optional[torch.Tensor] = None,
-        cu_seqlens_kv: Optional[torch.Tensor] = None,
-        max_seqlen_q: Optional[int] = None,
-        max_seqlen_kv: Optional[int] = None,
-        freqs_cis: tuple = None,
-        condition_type: str = None,
-        token_replace_vec: torch.Tensor = None,
-        frist_frame_token_num: int = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        if condition_type == "token_replace":
-            img_mod1, token_replace_img_mod1 = self.img_mod(vec, condition_type=condition_type, \
-                                                            token_replace_vec=token_replace_vec)
-            (img_mod1_shift,
-             img_mod1_scale,
-             img_mod1_gate,
-             img_mod2_shift,
-             img_mod2_scale,
-             img_mod2_gate) = img_mod1.chunk(6, dim=-1)
-            (tr_img_mod1_shift,
-             tr_img_mod1_scale,
-             tr_img_mod1_gate,
-             tr_img_mod2_shift,
-             tr_img_mod2_scale,
-             tr_img_mod2_gate) = token_replace_img_mod1.chunk(6, dim=-1)
-        else:
-            (
-                img_mod1_shift,
-                img_mod1_scale,
-                img_mod1_gate,
-                img_mod2_shift,
-                img_mod2_scale,
-                img_mod2_gate,
-            ) = self.img_mod(vec).chunk(6, dim=-1)
-
-        (
-            txt_mod1_shift,
-            txt_mod1_scale,
-            txt_mod1_gate,
-            txt_mod2_shift,
-            txt_mod2_scale,
-            txt_mod2_gate,
-        ) = self.txt_mod(vec).chunk(6, dim=-1)
-
-        # Prepare image for attention.
-        img_modulated = self.img_norm1(img)
-        if condition_type == "token_replace":
-            img_modulated = modulate(
-                img_modulated, shift=img_mod1_shift, scale=img_mod1_scale, condition_type=condition_type,
-                tr_shift=tr_img_mod1_shift, tr_scale=tr_img_mod1_scale,
-                frist_frame_token_num=frist_frame_token_num
-            )
-        else:
-            img_modulated = modulate(
-                img_modulated, shift=img_mod1_shift, scale=img_mod1_scale
-            )
-        img_qkv = self.img_attn_qkv(img_modulated)
-        img_q, img_k, img_v = rearrange(
-            img_qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num
-        )
-        # Apply QK-Norm if needed
-        img_q = self.img_attn_q_norm(img_q).to(img_v)
-        img_k = self.img_attn_k_norm(img_k).to(img_v)
-
-        # Apply RoPE if needed.
-        if freqs_cis is not None:
-            img_qq, img_kk = apply_rotary_emb(img_q, img_k, freqs_cis, head_first=False)
-            assert (
-                img_qq.shape == img_q.shape and img_kk.shape == img_k.shape
-            ), f"img_kk: {img_qq.shape}, img_q: {img_q.shape}, img_kk: {img_kk.shape}, img_k: {img_k.shape}"
-            img_q, img_k = img_qq, img_kk
-
-        # Prepare txt for attention.
-        txt_modulated = self.txt_norm1(txt)
-        txt_modulated = modulate(
-            txt_modulated, shift=txt_mod1_shift, scale=txt_mod1_scale
-        )
-        txt_qkv = self.txt_attn_qkv(txt_modulated)
-        txt_q, txt_k, txt_v = rearrange(
-            txt_qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num
-        )
-        # Apply QK-Norm if needed.
-        txt_q = self.txt_attn_q_norm(txt_q).to(txt_v)
-        txt_k = self.txt_attn_k_norm(txt_k).to(txt_v)
-
-        # Run actual attention.
-        q = torch.cat((img_q, txt_q), dim=1)
-        k = torch.cat((img_k, txt_k), dim=1)
-        v = torch.cat((img_v, txt_v), dim=1)
-        assert (
-            cu_seqlens_q.shape[0] == 2 * img.shape[0] + 1
-        ), f"cu_seqlens_q.shape:{cu_seqlens_q.shape}, img.shape[0]:{img.shape[0]}"
-        
-        # attention computation start
-        if not self.hybrid_seq_parallel_attn:
-            attn = attention(
-                q,
-                k,
-                v,
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv,
-                max_seqlen_q=max_seqlen_q,
-                max_seqlen_kv=max_seqlen_kv,
-                batch_size=img_k.shape[0],
-            )
-        else:
-            attn = parallel_attention(
-                self.hybrid_seq_parallel_attn,
-                q,
-                k,
-                v,
-                img_q_len=img_q.shape[1],
-                img_kv_len=img_k.shape[1],
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv
-            )
-            
-        # attention computation end
-
-        img_attn, txt_attn = attn[:, : img.shape[1]], attn[:, img.shape[1] :]
-
-        # Calculate the img bloks.
-        if condition_type == "token_replace":
-            img = img + apply_gate(self.img_attn_proj(img_attn), gate=img_mod1_gate, condition_type=condition_type,
-                                   tr_gate=tr_img_mod1_gate, frist_frame_token_num=frist_frame_token_num)
-            img = img + apply_gate(
-                self.img_mlp(
-                    modulate(
-                        self.img_norm2(img), shift=img_mod2_shift, scale=img_mod2_scale, condition_type=condition_type,
-                        tr_shift=tr_img_mod2_shift, tr_scale=tr_img_mod2_scale, frist_frame_token_num=frist_frame_token_num
-                    )
-                ),
-                gate=img_mod2_gate, condition_type=condition_type,
-                tr_gate=tr_img_mod2_gate, frist_frame_token_num=frist_frame_token_num
-            )
-        else:
-            img = img + apply_gate(self.img_attn_proj(img_attn), gate=img_mod1_gate)
-            img = img + apply_gate(
-                self.img_mlp(
-                    modulate(
-                        self.img_norm2(img), shift=img_mod2_shift, scale=img_mod2_scale
-                    )
-                ),
-                gate=img_mod2_gate,
-            )
-
-        # Calculate the txt bloks.
-        txt = txt + apply_gate(self.txt_attn_proj(txt_attn), gate=txt_mod1_gate)
-        txt = txt + apply_gate(
-            self.txt_mlp(
-                modulate(
-                    self.txt_norm2(txt), shift=txt_mod2_shift, scale=txt_mod2_scale
-                )
-            ),
-            gate=txt_mod2_gate,
-        )
-
-        return img, txt
-
-
-class MMSingleStreamBlock(nn.Module):
-    """
-    A DiT block with parallel linear layers as described in
-    https://arxiv.org/abs/2302.05442 and adapted modulation interface.
-    Also refer to (SD3): https://arxiv.org/abs/2403.03206
-                  (Flux.1): https://github.com/black-forest-labs/flux
-    """
-
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        mlp_width_ratio: float = 4.0,
-        mlp_act_type: str = "gelu_tanh",
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qk_scale: float = None,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        mlp_hidden_dim = int(hidden_size * mlp_width_ratio)
-        self.mlp_hidden_dim = mlp_hidden_dim
-        self.scale = qk_scale or head_dim ** -0.5
-
-        # qkv and mlp_in
-        self.linear1 = nn.Linear(
-            hidden_size, hidden_size * 3 + mlp_hidden_dim, **factory_kwargs
-        )
-        # proj and mlp_out
-        self.linear2 = nn.Linear(
-            hidden_size + mlp_hidden_dim, hidden_size, **factory_kwargs
-        )
-
-        qk_norm_layer = get_norm_layer(qk_norm_type)
-        self.q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-
-        self.pre_norm = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-
-        self.mlp_act = get_activation_layer(mlp_act_type)()
-        self.modulation = ModulateDiT(
-            hidden_size,
-            factor=3,
-            act_layer=get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-        self.hybrid_seq_parallel_attn = None
-
-    def enable_deterministic(self):
-        self.deterministic = True
-
-    def disable_deterministic(self):
-        self.deterministic = False
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        vec: torch.Tensor,
-        txt_len: int,
-        cu_seqlens_q: Optional[torch.Tensor] = None,
-        cu_seqlens_kv: Optional[torch.Tensor] = None,
-        max_seqlen_q: Optional[int] = None,
-        max_seqlen_kv: Optional[int] = None,
-        freqs_cis: Tuple[torch.Tensor, torch.Tensor] = None,
-        condition_type: str = None,
-        token_replace_vec: torch.Tensor = None,
-        frist_frame_token_num: int = None,
-    ) -> torch.Tensor:
-        if condition_type == "token_replace":
-            mod, tr_mod = self.modulation(vec,
-                                          condition_type=condition_type,
-                                          token_replace_vec=token_replace_vec)
-            (mod_shift,
-             mod_scale,
-             mod_gate) = mod.chunk(3, dim=-1)
-            (tr_mod_shift,
-             tr_mod_scale,
-             tr_mod_gate) = tr_mod.chunk(3, dim=-1)
-        else:
-            mod_shift, mod_scale, mod_gate = self.modulation(vec).chunk(3, dim=-1)
-        if condition_type == "token_replace":
-            x_mod = modulate(self.pre_norm(x), shift=mod_shift, scale=mod_scale, condition_type=condition_type,
-                             tr_shift=tr_mod_shift, tr_scale=tr_mod_scale, frist_frame_token_num=frist_frame_token_num)
-        else:
-            x_mod = modulate(self.pre_norm(x), shift=mod_shift, scale=mod_scale)
-        qkv, mlp = torch.split(
-            self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1
-        )
-
-        q, k, v = rearrange(qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num)
-
-        # Apply QK-Norm if needed.
-        q = self.q_norm(q).to(v)
-        k = self.k_norm(k).to(v)
-
-        # Apply RoPE if needed.
-        if freqs_cis is not None:
-            img_q, txt_q = q[:, :-txt_len, :, :], q[:, -txt_len:, :, :]
-            img_k, txt_k = k[:, :-txt_len, :, :], k[:, -txt_len:, :, :]
-            img_qq, img_kk = apply_rotary_emb(img_q, img_k, freqs_cis, head_first=False)
-            assert (
-                img_qq.shape == img_q.shape and img_kk.shape == img_k.shape
-            ), f"img_kk: {img_qq.shape}, img_q: {img_q.shape}, img_kk: {img_kk.shape}, img_k: {img_k.shape}"
-            img_q, img_k = img_qq, img_kk
-            q = torch.cat((img_q, txt_q), dim=1)
-            k = torch.cat((img_k, txt_k), dim=1)
-
-        # Compute attention.
-        assert (
-            cu_seqlens_q.shape[0] == 2 * x.shape[0] + 1
-        ), f"cu_seqlens_q.shape:{cu_seqlens_q.shape}, x.shape[0]:{x.shape[0]}"
-        
-        # attention computation start
-        if not self.hybrid_seq_parallel_attn:
-            attn = attention(
-                q,
-                k,
-                v,
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv,
-                max_seqlen_q=max_seqlen_q,
-                max_seqlen_kv=max_seqlen_kv,
-                batch_size=x.shape[0],
-            )
-        else:
-            attn = parallel_attention(
-                self.hybrid_seq_parallel_attn,
-                q,
-                k,
-                v,
-                img_q_len=img_q.shape[1],
-                img_kv_len=img_k.shape[1],
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv
-            )
-        # attention computation end
-
-        # Compute activation in mlp stream, cat again and run second linear layer.
-        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
-
-        if condition_type == "token_replace":
-            output = x + apply_gate(output, gate=mod_gate, condition_type=condition_type,
-                                    tr_gate=tr_mod_gate, frist_frame_token_num=frist_frame_token_num)
-            return output
-        else:
-            return x + apply_gate(output, gate=mod_gate)
-
-class HYVideoDiffusionTransformer(ModelMixin, ConfigMixin):
-    """
-    HunyuanVideo Transformer backbone
-
-    Inherited from ModelMixin and ConfigMixin for compatibility with diffusers' sampler StableDiffusionPipeline.
-
-    Reference:
-    [1] Flux.1: https://github.com/black-forest-labs/flux
-    [2] MMDiT: http://arxiv.org/abs/2403.03206
-
-    Parameters
-    ----------
-    args: argparse.Namespace
-        The arguments parsed by argparse.
-    patch_size: list
-        The size of the patch.
-    in_channels: int
-        The number of input channels.
-    out_channels: int
-        The number of output channels.
-    hidden_size: int
-        The hidden size of the transformer backbone.
-    heads_num: int
-        The number of attention heads.
-    mlp_width_ratio: float
-        The ratio of the hidden size of the MLP in the transformer block.
-    mlp_act_type: str
-        The activation function of the MLP in the transformer block.
-    depth_double_blocks: int
-        The number of transformer blocks in the double blocks.
-    depth_single_blocks: int
-        The number of transformer blocks in the single blocks.
-    rope_dim_list: list
-        The dimension of the rotary embedding for t, h, w.
-    qkv_bias: bool
-        Whether to use bias in the qkv linear layer.
-    qk_norm: bool
-        Whether to use qk norm.
-    qk_norm_type: str
-        The type of qk norm.
-    guidance_embed: bool
-        Whether to use guidance embedding for distillation.
-    text_projection: str
-        The type of the text projection, default is single_refiner.
-    use_attention_mask: bool
-        Whether to use attention mask for text encoder.
-    dtype: torch.dtype
-        The dtype of the model.
-    device: torch.device
-        The device of the model.
-    """
-
-    @register_to_config
-    def __init__(
-        self,
-        patch_size: list = [1, 2, 2],
-        in_channels: int = 4,  # Should be VAE.config.latent_channels.
-        out_channels: int = None,
-        hidden_size: int = 3072,
-        heads_num: int = 24,
-        mlp_width_ratio: float = 4.0,
-        mlp_act_type: str = "gelu_tanh",
-        mm_double_blocks_depth: int = 20,
-        mm_single_blocks_depth: int = 40,
-        rope_dim_list: List[int] = [16, 56, 56],
-        qkv_bias: bool = True,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        guidance_embed: bool = False,  # For modulation.
-        text_projection: str = "single_refiner",
-        use_attention_mask: bool = True,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        #below args
-        i2v_condition_type: str = "token_replace",
-        text_states_dim: int = 4096,
-        text_states_dim_2: int = 768,
-        gradient_checkpoint: bool = False,
-        gradient_checkpoint_layers: int = -1,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.patch_size = patch_size
-        self.in_channels = in_channels
-        self.out_channels = in_channels if out_channels is None else out_channels
-        self.unpatchify_channels = self.out_channels
-        self.guidance_embed = guidance_embed
-        self.rope_dim_list = rope_dim_list
-        self.i2v_condition_type = i2v_condition_type
-
-        # Text projection. Default to linear projection.
-        # Alternative: TokenRefiner. See more details (LI-DiT): http://arxiv.org/abs/2406.11831
-        self.use_attention_mask = use_attention_mask
-        self.text_projection = text_projection
-
-        self.text_states_dim = text_states_dim
-        self.text_states_dim_2 = text_states_dim_2
-
-        # Gradient checkpoint.
-        self.gradient_checkpoint = gradient_checkpoint
-        self.gradient_checkpoint_layers = gradient_checkpoint_layers
-        if self.gradient_checkpoint:
-            assert self.gradient_checkpoint_layers <= mm_double_blocks_depth + mm_single_blocks_depth, \
-                f"Gradient checkpoint layers must be less or equal than the depth of the model. " \
-                f"Got gradient_checkpoint_layers={self.gradient_checkpoint_layers} and " \
-                f"depth={mm_double_blocks_depth + mm_single_blocks_depth}."
-
-        if hidden_size % heads_num != 0:
-            raise ValueError(
-                f"Hidden size {hidden_size} must be divisible by heads_num {heads_num}"
-            )
-        pe_dim = hidden_size // heads_num
-        if sum(rope_dim_list) != pe_dim:
-            raise ValueError(
-                f"Got {rope_dim_list} but expected positional dim {pe_dim}"
-            )
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-
-        # image projection
-        self.img_in = PatchEmbed(
-            self.patch_size, self.in_channels, self.hidden_size, **factory_kwargs
-        )
-
-        # text projection
-        if self.text_projection == "linear":
-            self.txt_in = TextProjection(
-                self.text_states_dim,
-                self.hidden_size,
-                get_activation_layer("silu"),
-                **factory_kwargs,
-            )
-        elif self.text_projection == "single_refiner":
-            self.txt_in = SingleTokenRefiner(
-                self.text_states_dim, hidden_size, heads_num, depth=2, **factory_kwargs
-            )
-        else:
-            raise NotImplementedError(
-                f"Unsupported text_projection: {self.text_projection}"
-            )
-
-        # time modulation
-        self.time_in = TimestepEmbedder(
-            self.hidden_size, get_activation_layer("silu"), **factory_kwargs
-        )
-
-        # text modulation
-        self.vector_in = MLPEmbedder(
-            self.text_states_dim_2, self.hidden_size, **factory_kwargs
-        )
-
-        # guidance modulation
-        self.guidance_in = (
-            TimestepEmbedder(
-                self.hidden_size, get_activation_layer("silu"), **factory_kwargs
-            )
-            if guidance_embed
-            else None
-        )
-
-        # double blocks
-        self.double_blocks = nn.ModuleList(
-            [
-                MMDoubleStreamBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    mlp_width_ratio=mlp_width_ratio,
-                    mlp_act_type=mlp_act_type,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    qkv_bias=qkv_bias,
-                    **factory_kwargs,
-                )
-                for _ in range(mm_double_blocks_depth)
-            ]
-        )
-
-        # single blocks
-        self.single_blocks = nn.ModuleList(
-            [
-                MMSingleStreamBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    mlp_width_ratio=mlp_width_ratio,
-                    mlp_act_type=mlp_act_type,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    **factory_kwargs,
-                )
-                for _ in range(mm_single_blocks_depth)
-            ]
-        )
-
-        self.final_layer = FinalLayer(
-            self.hidden_size,
-            self.patch_size,
-            self.out_channels,
-            get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-
-    def enable_deterministic(self):
-        for block in self.double_blocks:
-            block.enable_deterministic()
-        for block in self.single_blocks:
-            block.enable_deterministic()
-
-    def disable_deterministic(self):
-        for block in self.double_blocks:
-            block.disable_deterministic()
-        for block in self.single_blocks:
-            block.disable_deterministic()
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        t: torch.Tensor,  # Should be in range(0, 1000).
-        text_states: torch.Tensor = None,
-        text_mask: torch.Tensor = None,  # Now we don't use it.
-        text_states_2: Optional[torch.Tensor] = None,  # Text embedding for modulation.
-        freqs_cos: Optional[torch.Tensor] = None,
-        freqs_sin: Optional[torch.Tensor] = None,
-        guidance: torch.Tensor = None,  # Guidance for modulation, should be cfg_scale x 1000.
-        return_dict: bool = True,
-    ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:
-        out = {}
-        img = x
-        txt = text_states
-        _, _, ot, oh, ow = x.shape
-        tt, th, tw = (
-            ot // self.patch_size[0],
-            oh // self.patch_size[1],
-            ow // self.patch_size[2],
-        )
-
-        # Prepare modulation vectors.
-        vec = self.time_in(t)
-
-        if self.i2v_condition_type == "token_replace":
-            token_replace_t = torch.zeros_like(t)
-            token_replace_vec = self.time_in(token_replace_t)
-            frist_frame_token_num = th * tw
-        else:
-            token_replace_vec = None
-            frist_frame_token_num = None
-
-        # text modulation
-        vec_2 = self.vector_in(text_states_2)
-        vec = vec + vec_2
-        if self.i2v_condition_type == "token_replace":
-            token_replace_vec = token_replace_vec + vec_2
-
-        # guidance modulation
-        if self.guidance_embed:
-            if guidance is None:
-                raise ValueError(
-                    "Didn't get guidance strength for guidance distilled model."
-                )
-
-            # our timestep_embedding is merged into guidance_in(TimestepEmbedder)
-            vec = vec + self.guidance_in(guidance)
-
-        # Embed image and text.
-        img = self.img_in(img)
-        if self.text_projection == "linear":
-            txt = self.txt_in(txt)
-        elif self.text_projection == "single_refiner":
-            txt = self.txt_in(txt, t, text_mask if self.use_attention_mask else None)
-        else:
-            raise NotImplementedError(
-                f"Unsupported text_projection: {self.text_projection}"
-            )
-
-        txt_seq_len = txt.shape[1]
-        img_seq_len = img.shape[1]
-
-        # Compute cu_squlens and max_seqlen for flash attention
-        cu_seqlens_q = get_cu_seqlens(text_mask, img_seq_len)
-        cu_seqlens_kv = cu_seqlens_q
-        max_seqlen_q = img_seq_len + txt_seq_len
-        max_seqlen_kv = max_seqlen_q
-
-        freqs_cis = (freqs_cos, freqs_sin) if freqs_cos is not None else None
-        # --------------------- Pass through DiT blocks ------------------------
-        for layer_num, block in enumerate(self.double_blocks):
-            double_block_args = [
-                img,
-                txt,
-                vec,
-                cu_seqlens_q,
-                cu_seqlens_kv,
-                max_seqlen_q,
-                max_seqlen_kv,
-                freqs_cis,
-                self.i2v_condition_type,
-                token_replace_vec,
-                frist_frame_token_num,
-            ]
-
-            if self.training and self.gradient_checkpoint and \
-                    (self.gradient_checkpoint_layers == -1 or layer_num < self.gradient_checkpoint_layers):
-                # print(f'gradient checkpointing...')
-                img, txt = torch.utils.checkpoint.checkpoint(ckpt_wrapper(block), *double_block_args, use_reentrant=False)
-            else:
-                img, txt = block(*double_block_args)
-
-        # Merge txt and img to pass through single stream blocks.
-        x = torch.cat((img, txt), 1)
-
-        if len(self.single_blocks) > 0:
-            for _, block in enumerate(self.single_blocks):
-                single_block_args = [
-                    x,
-                    vec,
-                    txt_seq_len,
-                    cu_seqlens_q,
-                    cu_seqlens_kv,
-                    max_seqlen_q,
-                    max_seqlen_kv,
-                    (freqs_cos, freqs_sin),
-                    self.i2v_condition_type,
-                    token_replace_vec,
-                    frist_frame_token_num,
-                ]
-
-                if self.training and self.gradient_checkpoint and \
-                        (self.gradient_checkpoint_layers == -1 or layer_num + len(self.double_blocks) < self.gradient_checkpoint_layers):
-                    x = torch.utils.checkpoint.checkpoint(ckpt_wrapper(block), *single_block_args, use_reentrant=False)
-                else:
-                    x = block(*single_block_args)
-
-        img = x[:, :img_seq_len, ...]
-
-        # ---------------------------- Final layer ------------------------------
-        img = self.final_layer(img, vec)  # (N, T, patch_size ** 2 * out_channels)
-
-        img = self.unpatchify(img, tt, th, tw)
-        if return_dict:
-            out["x"] = img
-            return out
-        return img
-
-    def unpatchify(self, x, t, h, w):
-        """
-        x: (N, T, patch_size**2 * C)
-        imgs: (N, H, W, C)
-        """
-        c = self.unpatchify_channels
-        pt, ph, pw = self.patch_size
-        assert t * h * w == x.shape[1]
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, c, pt, ph, pw))
-        x = torch.einsum("nthwcopq->nctohpwq", x)
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-
-        return imgs
-
-    def params_count(self):
-        counts = {
-            "double": sum(
-                [
-                    sum(p.numel() for p in block.img_attn_qkv.parameters())
-                    + sum(p.numel() for p in block.img_attn_proj.parameters())
-                    + sum(p.numel() for p in block.img_mlp.parameters())
-                    + sum(p.numel() for p in block.txt_attn_qkv.parameters())
-                    + sum(p.numel() for p in block.txt_attn_proj.parameters())
-                    + sum(p.numel() for p in block.txt_mlp.parameters())
-                    for block in self.double_blocks
-                ]
-            ),
-            "single": sum(
-                [
-                    sum(p.numel() for p in block.linear1.parameters())
-                    + sum(p.numel() for p in block.linear2.parameters())
-                    for block in self.single_blocks
-                ]
-            ),
-            "total": sum(p.numel() for p in self.parameters()),
-        }
-        counts["attn+mlp"] = counts["double"] + counts["single"]
-        return counts
-
-    def set_input_tensor(self, input_tensor):
-        pass
-
-#################################################################################
-#                             HunyuanVideo Configs                              #
-#################################################################################
-
-HUNYUAN_VIDEO_CONFIG = {
-    "HYVideo-T/2": {
-        "mm_double_blocks_depth": 20,
-        "mm_single_blocks_depth": 40,
-        "rope_dim_list": [16, 56, 56],
-        "hidden_size": 3072,
-        "heads_num": 24,
-        "mlp_width_ratio": 4,
-    },
-    "HYVideo-T/2-cfgdistill": {
-        "mm_double_blocks_depth": 20,
-        "mm_single_blocks_depth": 40,
-        "rope_dim_list": [16, 56, 56],
-        "hidden_size": 3072,
-        "heads_num": 24,
-        "mlp_width_ratio": 4,
-        "guidance_embed": True,
-    },
-    "HYVideo-S/2": {
-        "mm_double_blocks_depth": 6,
-        "mm_single_blocks_depth": 12,
-        "rope_dim_list": [12, 42, 42],
-        "hidden_size": 480,
-        "heads_num": 5,
-        "mlp_width_ratio": 4,
-    },
-}
-
-
-class HYVideoDiffusionTransformerWrapper(nn.Module):
-    def __init__(self,
-                device: str = 'cuda',
-                i2v_mode: bool = True,
-                i2v_condition_type: str = 'token_replace',
-                precision: str = 'bf16',
-                latent_channels: int = 16,
-                embedded_cfg_scale: float = 6.0,
-                model: str = 'HYVideo-T/2',
-                gradient_checkpoint: bool = False,
-                gradient_checkpoint_layers: int = -1,
-                text_states_dim: int = 4096,
-                text_states_dim_2: int = 768,
-                ckpt_path: str = None,
-                dit_weight: str = None,
-                i2v_dit_weight: str = None,
-                model_resolution: str = '720p',
-                load_key: str = 'module',
-                *args, 
-                **kwargs):
-        super().__init__(*args, **kwargs)
-
-        factor_kwargs = {"device": device, "dtype": PRECISION_TO_TYPE[precision]}
-        if i2v_mode and i2v_condition_type == "latent_concat":
-            in_channels = latent_channels * 2 + 1
-        elif i2v_mode and i2v_condition_type == "token_replace":
-            in_channels = latent_channels
-        else:
-            in_channels = latent_channels
-        out_channels = latent_channels
-        if embedded_cfg_scale:
-            factor_kwargs["guidance_embed"] = True
-
-        assert model in HUNYUAN_VIDEO_CONFIG.keys(), f"invalid model: {model}"
-        self.model = HYVideoDiffusionTransformer(
-            gradient_checkpoint=gradient_checkpoint,
-            gradient_checkpoint_layers=gradient_checkpoint_layers,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            text_states_dim=text_states_dim,
-            text_states_dim_2=text_states_dim_2,
-            **HUNYUAN_VIDEO_CONFIG[model],
-            **factor_kwargs,
-        )
-        self.dit_weight = dit_weight
-        self.ckpt_path = ckpt_path
-        self.i2v_dit_weight = i2v_dit_weight
-        self.model_resolution = model_resolution
-        self.load_key = load_key
-        self.i2v_mode = i2v_mode
-        self.device = device
-
-
-    def load_weight(self):
-        load_key = self.load_key
-        if self.i2v_mode:
-            dit_weight = Path(self.i2v_dit_weight)
-        else:
-            dit_weight = Path(self.dit_weight)
-
-        if dit_weight is None:
-            model_dir = self.ckpt_path / f"t2v_{self.model_resolution}"
-            files = list(model_dir.glob("*.pt"))
-            if len(files) == 0:
-                raise ValueError(f"No model weights found in {model_dir}")
-            if str(files[0]).startswith("pytorch_model_"):
-                model_path = dit_weight / f"pytorch_model_{load_key}.pt"
-                bare_model = True
-            elif any(str(f).endswith("_model_states.pt") for f in files):
-                files = [f for f in files if str(f).endswith("_model_states.pt")]
-                model_path = files[0]
-                if len(files) > 1:
-                    logger.warning(f"Multiple model weights found in {dit_weight}, using {model_path}")
-                bare_model = False
-            else:
-                raise ValueError(f"Invalid model path: {dit_weight} with unrecognized weight format")
-        else:
-            if dit_weight.is_dir():
-                files = list(dit_weight.glob("*.pt"))
-                if len(files) == 0:
-                    raise ValueError(f"No model weights found in {dit_weight}")
-                if str(files[0]).startswith("pytorch_model_"):
-                    model_path = dit_weight / f"pytorch_model_{load_key}.pt"
-                    bare_model = True
-                elif any(str(f).endswith("_model_states.pt") for f in files):
-                    files = [f for f in files if str(f).endswith("_model_states.pt")]
-                    model_path = files[0]
-                    if len(files) > 1:
-                        logger.warning(f"Multiple model weights found in {dit_weight}, using {model_path}")
-                    bare_model = False
-                else:
-                    raise ValueError(f"Invalid model path: {dit_weight} with unrecognized weight format")
-            elif dit_weight.is_file():
-                model_path = dit_weight
-                bare_model = "unknown"
-            else:
-                raise ValueError(f"Invalid model path: {dit_weight}")
-
-        if not model_path.exists():
-            raise ValueError(f"model_path not exists: {model_path}")
-        logger.info(f"Loading torch model {model_path}...")
-        state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
-
-        if bare_model == "unknown" and ("ema" in state_dict or "module" in state_dict):
-            bare_model = False
-        if bare_model is False:
-            if load_key in state_dict:
-                state_dict = state_dict[load_key]
-            else:
-                raise KeyError(f"Missing key: `{load_key}` in the checkpoint: {model_path}")
-        self.model.load_state_dict(state_dict, strict=True)
-        self.model = self.model.to(self.device)
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/modulate_layers.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/modulate_layers.py
deleted file mode 100644
index 97b29eb0..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/modulate_layers.py
+++ /dev/null
@@ -1,104 +0,0 @@
-from typing import Callable
-
-import torch
-import torch.nn as nn
-import math
-
-class ModulateDiT(nn.Module):
-    """Modulation layer for DiT."""
-    def __init__(
-        self,
-        hidden_size: int,
-        factor: int,
-        act_layer: Callable,
-        dtype=None,
-        device=None,
-    ):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.act = act_layer()
-        self.linear = nn.Linear(
-            hidden_size, factor * hidden_size, bias=True, **factory_kwargs
-        )
-        # Zero-initialize the modulation
-        nn.init.zeros_(self.linear.weight)
-        nn.init.zeros_(self.linear.bias)
-
-    def forward(self, x: torch.Tensor, condition_type=None, token_replace_vec=None) -> torch.Tensor:
-
-        x_out = self.linear(self.act(x))
-
-        if condition_type == "token_replace":
-            x_token_replace_out = self.linear(self.act(token_replace_vec))
-            return x_out, x_token_replace_out
-        else:
-            return x_out
-
-def modulate(x, shift=None, scale=None, condition_type=None,
-             tr_shift=None, tr_scale=None,
-             frist_frame_token_num=None):
-    """modulate by shift and scale
-
-    Args:
-        x (torch.Tensor): input tensor.
-        shift (torch.Tensor, optional): shift tensor. Defaults to None.
-        scale (torch.Tensor, optional): scale tensor. Defaults to None.
-
-    Returns:
-        torch.Tensor: the output tensor after modulate.
-    """
-    if condition_type == "token_replace":
-        x_zero = x[:, :frist_frame_token_num] * (1 + tr_scale.unsqueeze(1)) + tr_shift.unsqueeze(1)
-        x_orig = x[:, frist_frame_token_num:] * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
-        x = torch.concat((x_zero, x_orig), dim=1)
-        return x
-    else:
-        if scale is None and shift is None:
-            return x
-        elif shift is None:
-            return x * (1 + scale.unsqueeze(1))
-        elif scale is None:
-            return x + shift.unsqueeze(1)
-        else:
-            return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
-
-
-def apply_gate(x, gate=None, tanh=False, condition_type=None, tr_gate=None, frist_frame_token_num=None):
-    """AI is creating summary for apply_gate
-
-    Args:
-        x (torch.Tensor): input tensor.
-        gate (torch.Tensor, optional): gate tensor. Defaults to None.
-        tanh (bool, optional): whether to use tanh function. Defaults to False.
-
-    Returns:
-        torch.Tensor: the output tensor after apply gate.
-    """
-    if condition_type == "token_replace":
-        if gate is None:
-            return x
-        if tanh:
-            x_zero = x[:, :frist_frame_token_num] * tr_gate.unsqueeze(1).tanh()
-            x_orig = x[:, frist_frame_token_num:] * gate.unsqueeze(1).tanh()
-            x = torch.concat((x_zero, x_orig), dim=1)
-            return x
-        else:
-            x_zero = x[:, :frist_frame_token_num] * tr_gate.unsqueeze(1)
-            x_orig = x[:, frist_frame_token_num:] * gate.unsqueeze(1)
-            x = torch.concat((x_zero, x_orig), dim=1)
-            return x
-    else:
-        if gate is None:
-            return x
-        if tanh:
-            return x * gate.unsqueeze(1).tanh()
-        else:
-            return x * gate.unsqueeze(1)
-
-
-def ckpt_wrapper(module):
-    def ckpt_forward(*inputs):
-        outputs = module(*inputs)
-        return outputs
-
-    return ckpt_forward
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/norm_layers.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/norm_layers.py
deleted file mode 100644
index d8c73b1a..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/norm_layers.py
+++ /dev/null
@@ -1,77 +0,0 @@
-import torch
-import torch.nn as nn
-
-
-class RMSNorm(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        elementwise_affine=True,
-        eps: float = 1e-6,
-        device=None,
-        dtype=None,
-    ):
-        """
-        Initialize the RMSNorm normalization layer.
-
-        Args:
-            dim (int): The dimension of the input tensor.
-            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.
-
-        Attributes:
-            eps (float): A small value added to the denominator for numerical stability.
-            weight (nn.Parameter): Learnable scaling parameter.
-
-        """
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.eps = eps
-        if elementwise_affine:
-            self.weight = nn.Parameter(torch.ones(dim, **factory_kwargs))
-
-    def _norm(self, x):
-        """
-        Apply the RMSNorm normalization to the input tensor.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The normalized tensor.
-
-        """
-        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
-
-    def forward(self, x):
-        """
-        Forward pass through the RMSNorm layer.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The output tensor after applying RMSNorm.
-
-        """
-        output = self._norm(x.float()).type_as(x)
-        if hasattr(self, "weight"):
-            output = output * self.weight
-        return output
-
-
-def get_norm_layer(norm_layer):
-    """
-    Get the normalization layer.
-
-    Args:
-        norm_layer (str): The type of normalization layer.
-
-    Returns:
-        norm_layer (nn.Module): The normalization layer.
-    """
-    if norm_layer == "layer":
-        return nn.LayerNorm
-    elif norm_layer == "rms":
-        return RMSNorm
-    else:
-        raise NotImplementedError(f"Norm layer {norm_layer} is not implemented")
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/posemb_layers.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/posemb_layers.py
deleted file mode 100644
index dfce82c6..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/posemb_layers.py
+++ /dev/null
@@ -1,310 +0,0 @@
-import torch
-from typing import Union, Tuple, List
-
-
-def _to_tuple(x, dim=2):
-    if isinstance(x, int):
-        return (x,) * dim
-    elif len(x) == dim:
-        return x
-    else:
-        raise ValueError(f"Expected length {dim} or int, but got {x}")
-
-
-def get_meshgrid_nd(start, *args, dim=2):
-    """
-    Get n-D meshgrid with start, stop and num.
-
-    Args:
-        start (int or tuple): If len(args) == 0, start is num; If len(args) == 1, start is start, args[0] is stop,
-            step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num. For n-dim, start/stop/num
-            should be int or n-tuple. If n-tuple is provided, the meshgrid will be stacked following the dim order in
-            n-tuples.
-        *args: See above.
-        dim (int): Dimension of the meshgrid. Defaults to 2.
-
-    Returns:
-        grid (np.ndarray): [dim, ...]
-    """
-    if len(args) == 0:
-        # start is grid_size
-        num = _to_tuple(start, dim=dim)
-        start = (0,) * dim
-        stop = num
-    elif len(args) == 1:
-        # start is start, args[0] is stop, step is 1
-        start = _to_tuple(start, dim=dim)
-        stop = _to_tuple(args[0], dim=dim)
-        num = [stop[i] - start[i] for i in range(dim)]
-    elif len(args) == 2:
-        # start is start, args[0] is stop, args[1] is num
-        start = _to_tuple(start, dim=dim)  # Left-Top       eg: 12,0
-        stop = _to_tuple(args[0], dim=dim)  # Right-Bottom   eg: 20,32
-        num = _to_tuple(args[1], dim=dim)  # Target Size    eg: 32,124
-    else:
-        raise ValueError(f"len(args) should be 0, 1 or 2, but got {len(args)}")
-
-    # PyTorch implement of np.linspace(start[i], stop[i], num[i], endpoint=False)
-    axis_grid = []
-    for i in range(dim):
-        a, b, n = start[i], stop[i], num[i]
-        g = torch.linspace(a, b, n + 1, dtype=torch.float32)[:n]
-        axis_grid.append(g)
-    grid = torch.meshgrid(*axis_grid, indexing="ij")  # dim x [W, H, D]
-    grid = torch.stack(grid, dim=0)  # [dim, W, H, D]
-
-    return grid
-
-
-#################################################################################
-#                   Rotary Positional Embedding Functions                       #
-#################################################################################
-# https://github.com/meta-llama/llama/blob/be327c427cc5e89cc1d3ab3d3fec4484df771245/llama/model.py#L80
-
-
-def reshape_for_broadcast(
-    freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor]],
-    x: torch.Tensor,
-    head_first=False,
-):
-    """
-    Reshape frequency tensor for broadcasting it with another tensor.
-
-    This function reshapes the frequency tensor to have the same shape as the target tensor 'x'
-    for the purpose of broadcasting the frequency tensor during element-wise operations.
-
-    Notes:
-        When using FlashMHAModified, head_first should be False.
-        When using Attention, head_first should be True.
-
-    Args:
-        freqs_cis (Union[torch.Tensor, Tuple[torch.Tensor]]): Frequency tensor to be reshaped.
-        x (torch.Tensor): Target tensor for broadcasting compatibility.
-        head_first (bool): head dimension first (except batch dim) or not.
-
-    Returns:
-        torch.Tensor: Reshaped frequency tensor.
-
-    Raises:
-        AssertionError: If the frequency tensor doesn't match the expected shape.
-        AssertionError: If the target tensor 'x' doesn't have the expected number of dimensions.
-    """
-    ndim = x.ndim
-    assert 0 <= 1 < ndim
-
-    if isinstance(freqs_cis, tuple):
-        # freqs_cis: (cos, sin) in real space
-        if head_first:
-            assert freqs_cis[0].shape == (
-                x.shape[-2],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis[0].shape} does not match x shape {x.shape}"
-            shape = [
-                d if i == ndim - 2 or i == ndim - 1 else 1
-                for i, d in enumerate(x.shape)
-            ]
-        else:
-            assert freqs_cis[0].shape == (
-                x.shape[1],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis[0].shape} does not match x shape {x.shape}"
-            shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
-        return freqs_cis[0].view(*shape), freqs_cis[1].view(*shape)
-    else:
-        # freqs_cis: values in complex space
-        if head_first:
-            assert freqs_cis.shape == (
-                x.shape[-2],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis.shape} does not match x shape {x.shape}"
-            shape = [
-                d if i == ndim - 2 or i == ndim - 1 else 1
-                for i, d in enumerate(x.shape)
-            ]
-        else:
-            assert freqs_cis.shape == (
-                x.shape[1],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis.shape} does not match x shape {x.shape}"
-            shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
-        return freqs_cis.view(*shape)
-
-
-def rotate_half(x):
-    x_real, x_imag = (
-        x.float().reshape(*x.shape[:-1], -1, 2).unbind(-1)
-    )  # [B, S, H, D//2]
-    return torch.stack([-x_imag, x_real], dim=-1).flatten(3)
-
-
-def apply_rotary_emb(
-    xq: torch.Tensor,
-    xk: torch.Tensor,
-    freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
-    head_first: bool = False,
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """
-    Apply rotary embeddings to input tensors using the given frequency tensor.
-
-    This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
-    frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
-    is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
-    returned as real tensors.
-
-    Args:
-        xq (torch.Tensor): Query tensor to apply rotary embeddings. [B, S, H, D]
-        xk (torch.Tensor): Key tensor to apply rotary embeddings.   [B, S, H, D]
-        freqs_cis (torch.Tensor or tuple): Precomputed frequency tensor for complex exponential.
-        head_first (bool): head dimension first (except batch dim) or not.
-
-    Returns:
-        Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
-
-    """
-    xk_out = None
-    if isinstance(freqs_cis, tuple):
-        cos, sin = reshape_for_broadcast(freqs_cis, xq, head_first)  # [S, D]
-        cos, sin = cos.to(xq.device), sin.to(xq.device)
-        # real * cos - imag * sin
-        # imag * cos + real * sin
-        xq_out = (xq.float() * cos + rotate_half(xq.float()) * sin).type_as(xq)
-        xk_out = (xk.float() * cos + rotate_half(xk.float()) * sin).type_as(xk)
-    else:
-        # view_as_complex will pack [..., D/2, 2](real) to [..., D/2](complex)
-        xq_ = torch.view_as_complex(
-            xq.float().reshape(*xq.shape[:-1], -1, 2)
-        )  # [B, S, H, D//2]
-        freqs_cis = reshape_for_broadcast(freqs_cis, xq_, head_first).to(
-            xq.device
-        )  # [S, D//2] --> [1, S, 1, D//2]
-        # (real, imag) * (cos, sin) = (real * cos - imag * sin, imag * cos + real * sin)
-        # view_as_real will expand [..., D/2](complex) to [..., D/2, 2](real)
-        xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3).type_as(xq)
-        xk_ = torch.view_as_complex(
-            xk.float().reshape(*xk.shape[:-1], -1, 2)
-        )  # [B, S, H, D//2]
-        xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3).type_as(xk)
-
-    return xq_out, xk_out
-
-
-def get_nd_rotary_pos_embed(
-    rope_dim_list,
-    start,
-    *args,
-    theta=10000.0,
-    use_real=False,
-    theta_rescale_factor: Union[float, List[float]] = 1.0,
-    interpolation_factor: Union[float, List[float]] = 1.0,
-):
-    """
-    This is a n-d version of precompute_freqs_cis, which is a RoPE for tokens with n-d structure.
-
-    Args:
-        rope_dim_list (list of int): Dimension of each rope. len(rope_dim_list) should equal to n.
-            sum(rope_dim_list) should equal to head_dim of attention layer.
-        start (int | tuple of int | list of int): If len(args) == 0, start is num; If len(args) == 1, start is start,
-            args[0] is stop, step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num.
-        *args: See above.
-        theta (float): Scaling factor for frequency computation. Defaults to 10000.0.
-        use_real (bool): If True, return real part and imaginary part separately. Otherwise, return complex numbers.
-            Some libraries such as TensorRT does not support complex64 data type. So it is useful to provide a real
-            part and an imaginary part separately.
-        theta_rescale_factor (float): Rescale factor for theta. Defaults to 1.0.
-
-    Returns:
-        pos_embed (torch.Tensor): [HW, D/2]
-    """
-
-    grid = get_meshgrid_nd(
-        start, *args, dim=len(rope_dim_list)
-    )  # [3, W, H, D] / [2, W, H]
-
-    if isinstance(theta_rescale_factor, int) or isinstance(theta_rescale_factor, float):
-        theta_rescale_factor = [theta_rescale_factor] * len(rope_dim_list)
-    elif isinstance(theta_rescale_factor, list) and len(theta_rescale_factor) == 1:
-        theta_rescale_factor = [theta_rescale_factor[0]] * len(rope_dim_list)
-    assert len(theta_rescale_factor) == len(
-        rope_dim_list
-    ), "len(theta_rescale_factor) should equal to len(rope_dim_list)"
-
-    if isinstance(interpolation_factor, int) or isinstance(interpolation_factor, float):
-        interpolation_factor = [interpolation_factor] * len(rope_dim_list)
-    elif isinstance(interpolation_factor, list) and len(interpolation_factor) == 1:
-        interpolation_factor = [interpolation_factor[0]] * len(rope_dim_list)
-    assert len(interpolation_factor) == len(
-        rope_dim_list
-    ), "len(interpolation_factor) should equal to len(rope_dim_list)"
-
-    # use 1/ndim of dimensions to encode grid_axis
-    embs = []
-    for i in range(len(rope_dim_list)):
-        emb = get_1d_rotary_pos_embed(
-            rope_dim_list[i],
-            grid[i].reshape(-1),
-            theta,
-            use_real=use_real,
-            theta_rescale_factor=theta_rescale_factor[i],
-            interpolation_factor=interpolation_factor[i],
-        )  # 2 x [WHD, rope_dim_list[i]]
-        embs.append(emb)
-
-    if use_real:
-        cos = torch.cat([emb[0] for emb in embs], dim=1)  # (WHD, D/2)
-        sin = torch.cat([emb[1] for emb in embs], dim=1)  # (WHD, D/2)
-        return cos, sin
-    else:
-        emb = torch.cat(embs, dim=1)  # (WHD, D/2)
-        return emb
-
-
-def get_1d_rotary_pos_embed(
-    dim: int,
-    pos: Union[torch.FloatTensor, int],
-    theta: float = 10000.0,
-    use_real: bool = False,
-    theta_rescale_factor: float = 1.0,
-    interpolation_factor: float = 1.0,
-) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
-    """
-    Precompute the frequency tensor for complex exponential (cis) with given dimensions.
-    (Note: `cis` means `cos + i * sin`, where i is the imaginary unit.)
-
-    This function calculates a frequency tensor with complex exponential using the given dimension 'dim'
-    and the end index 'end'. The 'theta' parameter scales the frequencies.
-    The returned tensor contains complex values in complex64 data type.
-
-    Args:
-        dim (int): Dimension of the frequency tensor.
-        pos (int or torch.FloatTensor): Position indices for the frequency tensor. [S] or scalar
-        theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.
-        use_real (bool, optional): If True, return real part and imaginary part separately.
-                                   Otherwise, return complex numbers.
-        theta_rescale_factor (float, optional): Rescale factor for theta. Defaults to 1.0.
-
-    Returns:
-        freqs_cis: Precomputed frequency tensor with complex exponential. [S, D/2]
-        freqs_cos, freqs_sin: Precomputed frequency tensor with real and imaginary parts separately. [S, D]
-    """
-    if isinstance(pos, int):
-        pos = torch.arange(pos).float()
-
-    # proposed by reddit user bloc97, to rescale rotary embeddings to longer sequence length without fine-tuning
-    # has some connection to NTK literature
-    if theta_rescale_factor != 1.0:
-        theta *= theta_rescale_factor ** (dim / (dim - 2))
-
-    freqs = 1.0 / (
-        theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)
-    )  # [D/2]
-    # assert interpolation_factor == 1.0, f"interpolation_factor: {interpolation_factor}"
-    freqs = torch.outer(pos * interpolation_factor, freqs)  # [S, D/2]
-    if use_real:
-        freqs_cos = freqs.cos().repeat_interleave(2, dim=1)  # [S, D]
-        freqs_sin = freqs.sin().repeat_interleave(2, dim=1)  # [S, D]
-        return freqs_cos, freqs_sin
-    else:
-        freqs_cis = torch.polar(
-            torch.ones_like(freqs), freqs
-        )  # complex64     # [S, D/2]
-        return freqs_cis
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/modules/token_refiner.py b/videotuna/models/hunyuan/hyvideo_i2v/modules/token_refiner.py
deleted file mode 100644
index bf09278e..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/modules/token_refiner.py
+++ /dev/null
@@ -1,236 +0,0 @@
-from typing import Optional
-
-from einops import rearrange
-import torch
-import torch.nn as nn
-
-from .activation_layers import get_activation_layer
-from .attenion import attention
-from .norm_layers import get_norm_layer
-from .embed_layers import TimestepEmbedder, TextProjection
-from .attenion import attention
-from .mlp_layers import MLP
-from .modulate_layers import modulate, apply_gate
-
-
-class IndividualTokenRefinerBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        heads_num,
-        mlp_width_ratio: str = 4.0,
-        mlp_drop_rate: float = 0.0,
-        act_type: str = "silu",
-        qk_norm: bool = False,
-        qk_norm_type: str = "layer",
-        qkv_bias: bool = True,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        mlp_hidden_dim = int(hidden_size * mlp_width_ratio)
-
-        self.norm1 = nn.LayerNorm(
-            hidden_size, elementwise_affine=True, eps=1e-6, **factory_kwargs
-        )
-        self.self_attn_qkv = nn.Linear(
-            hidden_size, hidden_size * 3, bias=qkv_bias, **factory_kwargs
-        )
-        qk_norm_layer = get_norm_layer(qk_norm_type)
-        self.self_attn_q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.self_attn_k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.self_attn_proj = nn.Linear(
-            hidden_size, hidden_size, bias=qkv_bias, **factory_kwargs
-        )
-
-        self.norm2 = nn.LayerNorm(
-            hidden_size, elementwise_affine=True, eps=1e-6, **factory_kwargs
-        )
-        act_layer = get_activation_layer(act_type)
-        self.mlp = MLP(
-            in_channels=hidden_size,
-            hidden_channels=mlp_hidden_dim,
-            act_layer=act_layer,
-            drop=mlp_drop_rate,
-            **factory_kwargs,
-        )
-
-        self.adaLN_modulation = nn.Sequential(
-            act_layer(),
-            nn.Linear(hidden_size, 2 * hidden_size, bias=True, **factory_kwargs),
-        )
-        # Zero-initialize the modulation
-        nn.init.zeros_(self.adaLN_modulation[1].weight)
-        nn.init.zeros_(self.adaLN_modulation[1].bias)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        c: torch.Tensor,  # timestep_aware_representations + context_aware_representations
-        attn_mask: torch.Tensor = None,
-    ):
-        gate_msa, gate_mlp = self.adaLN_modulation(c).chunk(2, dim=1)
-
-        norm_x = self.norm1(x)
-        qkv = self.self_attn_qkv(norm_x)
-        q, k, v = rearrange(qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num)
-        # Apply QK-Norm if needed
-        q = self.self_attn_q_norm(q).to(v)
-        k = self.self_attn_k_norm(k).to(v)
-
-        # Self-Attention
-        attn = attention(q, k, v, mode="torch", attn_mask=attn_mask)
-
-        x = x + apply_gate(self.self_attn_proj(attn), gate_msa)
-
-        # FFN Layer
-        x = x + apply_gate(self.mlp(self.norm2(x)), gate_mlp)
-
-        return x
-
-
-class IndividualTokenRefiner(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        heads_num,
-        depth,
-        mlp_width_ratio: float = 4.0,
-        mlp_drop_rate: float = 0.0,
-        act_type: str = "silu",
-        qk_norm: bool = False,
-        qk_norm_type: str = "layer",
-        qkv_bias: bool = True,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.blocks = nn.ModuleList(
-            [
-                IndividualTokenRefinerBlock(
-                    hidden_size=hidden_size,
-                    heads_num=heads_num,
-                    mlp_width_ratio=mlp_width_ratio,
-                    mlp_drop_rate=mlp_drop_rate,
-                    act_type=act_type,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    qkv_bias=qkv_bias,
-                    **factory_kwargs,
-                )
-                for _ in range(depth)
-            ]
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        c: torch.LongTensor,
-        mask: Optional[torch.Tensor] = None,
-    ):
-        self_attn_mask = None
-        if mask is not None:
-            batch_size = mask.shape[0]
-            seq_len = mask.shape[1]
-            mask = mask.to(x.device)
-            # batch_size x 1 x seq_len x seq_len
-            self_attn_mask_1 = mask.view(batch_size, 1, 1, seq_len).repeat(
-                1, 1, seq_len, 1
-            )
-            # batch_size x 1 x seq_len x seq_len
-            self_attn_mask_2 = self_attn_mask_1.transpose(2, 3)
-            # batch_size x 1 x seq_len x seq_len, 1 for broadcasting of heads_num
-            self_attn_mask = (self_attn_mask_1 & self_attn_mask_2).bool()
-            # avoids self-attention weight being NaN for padding tokens
-            self_attn_mask[:, :, :, 0] = True
-
-        for block in self.blocks:
-            x = block(x, c, self_attn_mask)
-        return x
-
-
-class SingleTokenRefiner(nn.Module):
-    """
-    A single token refiner block for llm text embedding refine.
-    """
-    def __init__(
-        self,
-        in_channels,
-        hidden_size,
-        heads_num,
-        depth,
-        mlp_width_ratio: float = 4.0,
-        mlp_drop_rate: float = 0.0,
-        act_type: str = "silu",
-        qk_norm: bool = False,
-        qk_norm_type: str = "layer",
-        qkv_bias: bool = True,
-        attn_mode: str = "torch",
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.attn_mode = attn_mode
-        assert self.attn_mode == "torch", "Only support 'torch' mode for token refiner."
-
-        self.input_embedder = nn.Linear(
-            in_channels, hidden_size, bias=True, **factory_kwargs
-        )
-
-        act_layer = get_activation_layer(act_type)
-        # Build timestep embedding layer
-        self.t_embedder = TimestepEmbedder(hidden_size, act_layer, **factory_kwargs)
-        # Build context embedding layer
-        self.c_embedder = TextProjection(
-            in_channels, hidden_size, act_layer, **factory_kwargs
-        )
-
-        self.individual_token_refiner = IndividualTokenRefiner(
-            hidden_size=hidden_size,
-            heads_num=heads_num,
-            depth=depth,
-            mlp_width_ratio=mlp_width_ratio,
-            mlp_drop_rate=mlp_drop_rate,
-            act_type=act_type,
-            qk_norm=qk_norm,
-            qk_norm_type=qk_norm_type,
-            qkv_bias=qkv_bias,
-            **factory_kwargs,
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        t: torch.LongTensor,
-        mask: Optional[torch.LongTensor] = None,
-    ):
-        timestep_aware_representations = self.t_embedder(t)
-
-        if mask is None:
-            context_aware_representations = x.mean(dim=1)
-        else:
-            mask_float = mask.float().unsqueeze(-1)  # [b, s1, 1]
-            context_aware_representations = (x * mask_float).sum(
-                dim=1
-            ) / mask_float.sum(dim=1)
-        context_aware_representations = self.c_embedder(context_aware_representations)
-        c = timestep_aware_representations + context_aware_representations
-
-        x = self.input_embedder(x)
-
-        x = self.individual_token_refiner(x, c, mask)
-
-        return x
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/text_encoder/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/text_encoder/__init__.py
deleted file mode 100644
index 814b433f..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/text_encoder/__init__.py
+++ /dev/null
@@ -1,611 +0,0 @@
-from dataclasses import dataclass
-from typing import Optional, Tuple
-from copy import deepcopy
-from omegaconf import DictConfig, OmegaConf
-import torch
-import torch.nn as nn
-from transformers import (
-    CLIPTextModel,
-    CLIPTokenizer,
-    AutoTokenizer,
-    AutoModel,
-    LlavaForConditionalGeneration,
-    CLIPImageProcessor,
-)
-from transformers.utils import ModelOutput
-
-from ..constants import TEXT_ENCODER_PATH, TOKENIZER_PATH
-from ..constants import PRECISION_TO_TYPE, PROMPT_TEMPLATE
-
-
-def use_default(value, default):
-    return value if value is not None else default
-
-
-def load_text_encoder(
-    text_encoder_type,
-    text_encoder_precision=None,
-    text_encoder_path=None,
-    logger=None,
-    device=None,
-):
-    if text_encoder_path is None:
-        text_encoder_path = TEXT_ENCODER_PATH[text_encoder_type]
-    if logger is not None:
-        logger.info(
-            f"Loading text encoder model ({text_encoder_type}) from: {text_encoder_path}"
-        )
-
-    if text_encoder_type == "clipL":
-        text_encoder = CLIPTextModel.from_pretrained(text_encoder_path)
-        text_encoder.final_layer_norm = text_encoder.text_model.final_layer_norm
-    elif text_encoder_type == "llm":
-        text_encoder = AutoModel.from_pretrained(
-            text_encoder_path, low_cpu_mem_usage=True
-        )
-        text_encoder.final_layer_norm = text_encoder.norm
-    elif text_encoder_type == "llm-i2v":
-        text_encoder = LlavaForConditionalGeneration.from_pretrained(
-            text_encoder_path, low_cpu_mem_usage=True
-        )
-    else:
-        raise ValueError(f"Unsupported text encoder type: {text_encoder_type}")
-    # from_pretrained will ensure that the model is in eval mode.
-
-    if text_encoder_precision is not None:
-        text_encoder = text_encoder.to(dtype=PRECISION_TO_TYPE[text_encoder_precision])
-
-    text_encoder.requires_grad_(False)
-
-    if logger is not None:
-        logger.info(f"Text encoder to dtype: {text_encoder.dtype}")
-
-    if device is not None:
-        text_encoder = text_encoder.to(device)
-
-    return text_encoder, text_encoder_path
-
-
-def load_tokenizer(
-    tokenizer_type, tokenizer_path=None, padding_side="right", logger=None
-):
-    if tokenizer_path is None:
-        tokenizer_path = TOKENIZER_PATH[tokenizer_type]
-    if logger is not None:
-        logger.info(f"Loading tokenizer ({tokenizer_type}) from: {tokenizer_path}")
-
-    processor = None
-    if tokenizer_type == "clipL":
-        tokenizer = CLIPTokenizer.from_pretrained(tokenizer_path, max_length=77)
-    elif tokenizer_type == "llm":
-        tokenizer = AutoTokenizer.from_pretrained(
-            tokenizer_path, padding_side=padding_side
-        )
-    elif tokenizer_type == "llm-i2v":
-        tokenizer = AutoTokenizer.from_pretrained(
-            tokenizer_path, padding_side=padding_side
-        )
-        processor = CLIPImageProcessor.from_pretrained(tokenizer_path)
-    else:
-        raise ValueError(f"Unsupported tokenizer type: {tokenizer_type}")
-
-    return tokenizer, tokenizer_path, processor
-
-
-@dataclass
-class TextEncoderModelOutput(ModelOutput):
-    """
-    Base class for model's outputs that also contains a pooling of the last hidden states.
-
-    Args:
-        hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the output of the last layer of the model.
-        attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
-        hidden_states_list (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
-            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-        text_outputs (`list`, *optional*, returned when `return_texts=True` is passed):
-            List of decoded texts.
-    """
-
-    hidden_state: torch.FloatTensor = None
-    attention_mask: Optional[torch.LongTensor] = None
-    hidden_states_list: Optional[Tuple[torch.FloatTensor, ...]] = None
-    text_outputs: Optional[list] = None
-
-
-class TextEncoder(nn.Module):
-    def __init__(
-        self,
-        text_encoder_type: str,
-        max_length: int,
-        text_encoder_precision: Optional[str] = None,
-        text_encoder_path: Optional[str] = None,
-        tokenizer_type: Optional[str] = None,
-        tokenizer_path: Optional[str] = None,
-        output_key: Optional[str] = None,
-        use_attention_mask: bool = True,
-        i2v_mode: bool = False,
-        input_max_length: Optional[int] = None,
-        prompt_template: Optional[dict] = None,
-        prompt_template_video: Optional[dict] = None,
-        hidden_state_skip_layer: Optional[int] = None,
-        apply_final_norm: bool = False,
-        reproduce: bool = False,
-        logger=None,
-        device=None,
-        image_embed_interleave=None,
-    ):
-        super().__init__()
-        self.text_encoder_type = text_encoder_type
-        self.max_length = max_length
-        self.precision = text_encoder_precision
-        self.model_path = text_encoder_path
-        self.tokenizer_type = (
-            tokenizer_type if tokenizer_type is not None else text_encoder_type
-        )
-        self.tokenizer_path = (
-            tokenizer_path if tokenizer_path is not None else text_encoder_path
-        )
-        self.use_attention_mask = use_attention_mask
-        if prompt_template_video is not None:
-            assert (
-                use_attention_mask is True
-            ), "Attention mask is True required when training videos."
-        self.input_max_length = (
-            input_max_length if input_max_length is not None else max_length
-        )
-        self.prompt_template = prompt_template
-        self.prompt_template_video = prompt_template_video
-        self.hidden_state_skip_layer = hidden_state_skip_layer
-        self.apply_final_norm = apply_final_norm
-        self.i2v_mode = i2v_mode
-        self.reproduce = reproduce
-        self.logger = logger
-        self.image_embed_interleave = image_embed_interleave
-
-        self.use_template = self.prompt_template is not None
-        if self.use_template:
-            assert (
-                (isinstance(self.prompt_template, dict) or isinstance(self.prompt_template, DictConfig))
-                and "template" in self.prompt_template
-            ), f"`prompt_template` must be a dictionary with a key 'template', got {self.prompt_template}"
-            assert "{}" in str(self.prompt_template["template"]), (
-                "`prompt_template['template']` must contain a placeholder `{}` for the input text, "
-                f"got {self.prompt_template['template']}"
-            )
-
-        self.use_video_template = self.prompt_template_video is not None
-        if self.use_video_template:
-            if self.prompt_template_video is not None:
-                assert (
-                    (isinstance(self.prompt_template_video, dict) or isinstance(self.prompt_template, DictConfig))
-                    and "template" in self.prompt_template_video
-                ), f"`prompt_template_video` must be a dictionary with a key 'template', got {self.prompt_template_video}"
-            assert "{}" in str(self.prompt_template_video["template"]), (
-                "`prompt_template_video['template']` must contain a placeholder `{}` for the input text, "
-                f"got {self.prompt_template_video['template']}"
-            )
-
-        if "t5" in text_encoder_type:
-            self.output_key = output_key or "last_hidden_state"
-        elif "clip" in text_encoder_type:
-            self.output_key = output_key or "pooler_output"
-        elif "llm" in text_encoder_type or "glm" in text_encoder_type:
-            self.output_key = output_key or "last_hidden_state"
-        else:
-            raise ValueError(f"Unsupported text encoder type: {text_encoder_type}")
-
-        self.model, self.model_path = load_text_encoder(
-            text_encoder_type=self.text_encoder_type,
-            text_encoder_precision=self.precision,
-            text_encoder_path=self.model_path,
-            logger=self.logger,
-            device=device,
-        )
-        self.dtype = self.model.dtype
-        self.device = self.model.device
-
-        self.tokenizer, self.tokenizer_path, self.processor = load_tokenizer(
-            tokenizer_type=self.tokenizer_type,
-            tokenizer_path=self.tokenizer_path,
-            padding_side="right",
-            logger=self.logger,
-        )
-
-    def __repr__(self):
-        return f"{self.text_encoder_type} ({self.precision} - {self.model_path})"
-
-    @staticmethod
-    def apply_text_to_template(text, template, prevent_empty_text=True):
-        """
-        Apply text to template.
-
-        Args:
-            text (str): Input text.
-            template (str or list): Template string or list of chat conversation.
-            prevent_empty_text (bool): If Ture, we will prevent the user text from being empty
-                by adding a space. Defaults to True.
-        """
-        if isinstance(template, str):
-            # Will send string to tokenizer. Used for llm
-            return template.format(text)
-        else:
-            raise TypeError(f"Unsupported template type: {type(template)}")
-
-    def text2tokens(self, text, data_type="image"):
-        """
-        Tokenize the input text.
-
-        Args:
-            text (str or list): Input text.
-        """
-        tokenize_input_type = "str"
-        if self.use_template:
-            if data_type == "image":
-                prompt_template = self.prompt_template["template"]
-            elif data_type == "video":
-                prompt_template = self.prompt_template_video["template"]
-            else:
-                raise ValueError(f"Unsupported data type: {data_type}")
-            if isinstance(text, (list, tuple)):
-                text = [
-                    self.apply_text_to_template(one_text, prompt_template)
-                    for one_text in text
-                ]
-                if isinstance(text[0], list):
-                    tokenize_input_type = "list"
-            elif isinstance(text, str):
-                text = self.apply_text_to_template(text, prompt_template)
-                if isinstance(text, list):
-                    tokenize_input_type = "list"
-            else:
-                raise TypeError(f"Unsupported text type: {type(text)}")
-
-        kwargs = dict(
-            truncation=True,
-            max_length=self.max_length,
-            padding="max_length",
-            return_tensors="pt",
-        )
-        if tokenize_input_type == "str":
-            return self.tokenizer(
-                text,
-                return_length=False,
-                return_overflowing_tokens=False,
-                return_attention_mask=True,
-                **kwargs,
-            )
-        elif tokenize_input_type == "list":
-            return self.tokenizer.apply_chat_template(
-                text,
-                add_generation_prompt=True,
-                tokenize=True,
-                return_dict=True,
-                **kwargs,
-            )
-        else:
-            raise ValueError(f"Unsupported tokenize_input_type: {tokenize_input_type}")
-
-    def encode(
-        self,
-        batch_encoding,
-        use_attention_mask=None,
-        output_hidden_states=False,
-        do_sample=None,
-        hidden_state_skip_layer=None,
-        return_texts=False,
-        data_type="image",
-        semantic_images=None,
-        device=None,
-    ):
-        """
-        Args:
-            batch_encoding (dict): Batch encoding from tokenizer.
-            use_attention_mask (bool): Whether to use attention mask. If None, use self.use_attention_mask.
-                Defaults to None.
-            output_hidden_states (bool): Whether to output hidden states. If False, return the value of
-                self.output_key. If True, return the entire output. If set self.hidden_state_skip_layer,
-                output_hidden_states will be set True. Defaults to False.
-            do_sample (bool): Whether to sample from the model. Used for Decoder-Only LLMs. Defaults to None.
-                When self.produce is False, do_sample is set to True by default.
-            hidden_state_skip_layer (int): Number of hidden states to hidden_state_skip_layer. 0 means the last layer.
-                If None, self.output_key will be used. Defaults to None.
-            hidden_state_skip_layer (PIL.Image): The reference images for i2v models.
-            return_texts (bool): Whether to return the decoded texts. Defaults to False.
-        """
-        device = self.model.device if device is None else device
-        use_attention_mask = use_default(use_attention_mask, self.use_attention_mask)
-        hidden_state_skip_layer = use_default(
-            hidden_state_skip_layer, self.hidden_state_skip_layer
-        )
-        do_sample = use_default(do_sample, not self.reproduce)
-        if not self.i2v_mode:
-            attention_mask = (
-                batch_encoding["attention_mask"].to(device)
-                if use_attention_mask
-                else None
-            )
-            outputs = self.model(
-                input_ids=batch_encoding["input_ids"].to(device),
-                attention_mask=attention_mask,
-                output_hidden_states=output_hidden_states
-                or hidden_state_skip_layer is not None,
-            )
-            if hidden_state_skip_layer is not None:
-                last_hidden_state = outputs.hidden_states[
-                    -(hidden_state_skip_layer + 1)
-                ]
-                # Real last hidden state already has layer norm applied. So here we only apply it
-                # for intermediate layers.
-                if hidden_state_skip_layer > 0 and self.apply_final_norm:
-                    last_hidden_state = self.model.final_layer_norm(last_hidden_state)
-            else:
-                last_hidden_state = outputs[self.output_key]
-
-            # Remove hidden states of instruction tokens, only keep prompt tokens.
-            if self.use_template:
-                if data_type == "image":
-                    crop_start = self.prompt_template.get("crop_start", -1)
-                elif data_type == "video":
-                    crop_start = self.prompt_template_video.get("crop_start", -1)
-                else:
-                    raise ValueError(f"Unsupported data type: {data_type}")
-                if crop_start > 0:
-                    last_hidden_state = last_hidden_state[:, crop_start:]
-                    attention_mask = (
-                        attention_mask[:, crop_start:] if use_attention_mask else None
-                    )
-
-            if output_hidden_states:
-                return TextEncoderModelOutput(
-                    last_hidden_state, attention_mask, outputs.hidden_states
-                )
-            return TextEncoderModelOutput(last_hidden_state, attention_mask)
-        else:
-            image_outputs = self.processor(semantic_images, return_tensors="pt")[
-                "pixel_values"
-            ].to(device)
-            attention_mask = (
-                batch_encoding["attention_mask"].to(device)
-                if use_attention_mask
-                else None
-            )
-            outputs = self.model(
-                input_ids=batch_encoding["input_ids"].to(device),
-                attention_mask=attention_mask,
-                output_hidden_states=output_hidden_states
-                or hidden_state_skip_layer is not None,
-                pixel_values=image_outputs,
-            )
-            if hidden_state_skip_layer is not None:
-                last_hidden_state = outputs.hidden_states[
-                    -(hidden_state_skip_layer + 1)
-                ]
-                # Real last hidden state already has layer norm applied. So here we only apply it
-                # for intermediate layers.
-                if hidden_state_skip_layer > 0 and self.apply_final_norm:
-                    last_hidden_state = self.model.final_layer_norm(last_hidden_state)
-            else:
-                last_hidden_state = outputs[self.output_key]
-            if self.use_template:
-                if data_type == "video":
-                    crop_start = self.prompt_template_video.get("crop_start", -1)
-                    text_crop_start = (
-                        crop_start
-                        - 1
-                        + self.prompt_template_video.get("image_emb_len", 576)
-                    )
-                    image_crop_start = self.prompt_template_video.get(
-                        "image_emb_start", 5
-                    )
-                    image_crop_end = self.prompt_template_video.get(
-                        "image_emb_end", 581
-                    )
-                    batch_indices, last_double_return_token_indices = torch.where(
-                        batch_encoding["input_ids"]
-                        == self.prompt_template_video.get("double_return_token_id", 271)
-                    )
-                    if last_double_return_token_indices.shape[0] == 3:
-                        # in case the prompt is too long
-                        last_double_return_token_indices = torch.cat(
-                            (
-                                last_double_return_token_indices,
-                                torch.tensor([batch_encoding["input_ids"].shape[-1]]).to(
-                                    device=last_double_return_token_indices.device),
-                            )
-                        )
-                    last_double_return_token_indices = (
-                        last_double_return_token_indices.reshape(
-                            batch_encoding["input_ids"].shape[0], -1
-                        )[:, -1]
-                    )
-                    assistant_crop_start = (
-                        last_double_return_token_indices
-                        - 1
-                        + self.prompt_template_video.get("image_emb_len", 576)
-                        - 4
-                    )
-                    assistant_crop_end = (
-                        last_double_return_token_indices
-                        - 1
-                        + self.prompt_template_video.get("image_emb_len", 576)
-                    )
-                    attention_mask_assistant_crop_start = (
-                        last_double_return_token_indices - 4
-                    )
-                    attention_mask_assistant_crop_end = last_double_return_token_indices
-                else:
-                    raise ValueError(f"Unsupported data type: {data_type}")
-
-                text_last_hidden_state = []
-                text_attention_mask = []
-                image_last_hidden_state = []
-                image_attention_mask = []
-                for i in range(batch_encoding["input_ids"].shape[0]):
-                    text_last_hidden_state.append(
-                        torch.cat(
-                            [
-                                last_hidden_state[
-                                    i, text_crop_start : assistant_crop_start[i].item()
-                                ],
-                                last_hidden_state[i, assistant_crop_end[i].item() :],
-                            ]
-                        )
-                    )
-                    text_attention_mask.append(
-                        torch.cat(
-                            [
-                                attention_mask[
-                                    i,
-                                    crop_start : attention_mask_assistant_crop_start[
-                                        i
-                                    ].item(),
-                                ],
-                                attention_mask[
-                                    i, attention_mask_assistant_crop_end[i].item() :
-                                ],
-                            ]
-                        )
-                        if use_attention_mask
-                        else None
-                    )
-                    image_last_hidden_state.append(
-                        last_hidden_state[i, image_crop_start:image_crop_end]
-                    )
-                    image_attention_mask.append(
-                        torch.ones(image_last_hidden_state[-1].shape[0])
-                        .to(last_hidden_state.device)
-                        .to(attention_mask.dtype)
-                        if use_attention_mask
-                        else None
-                    )
-
-                text_last_hidden_state = torch.stack(text_last_hidden_state)
-                text_attention_mask = torch.stack(text_attention_mask)
-                image_last_hidden_state = torch.stack(image_last_hidden_state)
-                image_attention_mask = torch.stack(image_attention_mask)
-
-                if semantic_images is not None and 0 < self.image_embed_interleave < 6:
-                    image_last_hidden_state = image_last_hidden_state[
-                        :, ::self.image_embed_interleave, :
-                    ]
-                    image_attention_mask = image_attention_mask[
-                        :, ::self.image_embed_interleave
-                    ]
-
-                assert (
-                    text_last_hidden_state.shape[0] == text_attention_mask.shape[0]
-                    and image_last_hidden_state.shape[0]
-                    == image_attention_mask.shape[0]
-                )
-
-                last_hidden_state = torch.cat(
-                    [image_last_hidden_state, text_last_hidden_state], dim=1
-                )
-                attention_mask = torch.cat(
-                    [image_attention_mask, text_attention_mask], dim=1
-                )
-            if output_hidden_states:
-                return TextEncoderModelOutput(
-                    last_hidden_state,
-                    attention_mask,
-                    hidden_states_list=outputs.hidden_states,
-                )
-            return TextEncoderModelOutput(last_hidden_state, attention_mask)
-
-    def forward(
-        self,
-        text,
-        use_attention_mask=None,
-        output_hidden_states=False,
-        do_sample=False,
-        hidden_state_skip_layer=None,
-        return_texts=False,
-    ):
-        batch_encoding = self.text2tokens(text)
-        return self.encode(
-            batch_encoding,
-            use_attention_mask=use_attention_mask,
-            output_hidden_states=output_hidden_states,
-            do_sample=do_sample,
-            hidden_state_skip_layer=hidden_state_skip_layer,
-            return_texts=return_texts,
-        )
-
-
-class TextEncoderWrapper(nn.Module):
-    def __init__(self, 
-                i2v_mode: bool = True,
-                i2v_condition_type: str = 'token_replace',
-                text_encoder: str = "llm-i2v",
-                text_encoder_precision: str = "fp16",
-                text_states_dim: int = 4096,
-                text_len: int = 256,
-                tokenizer: str = "llm-i2v",
-                prompt_template: str = "dit-llm-encode-i2v",
-                prompt_template_video: str = "dit-llm-encode-video-i2v",
-                hidden_state_skip_layer: int = 2,
-                apply_final_norm: bool = False,
-                reproduce: bool = False,
-                device: str = 'cuda',
-                use_cpu_offload: bool = True,
-                *args, 
-                 **kwargs):
-        super().__init__(*args, **kwargs)
-        self.i2v_mode = i2v_mode
-        self.text_encoder = text_encoder
-        self.text_encoder_precision = text_encoder_precision
-        self.text_states_dim = text_states_dim
-        self.text_len = text_len
-        self.tokenizer = tokenizer
-        self.prompt_template = prompt_template
-        self.prompt_template_video = prompt_template_video
-        self.hidden_state_skip_layer = hidden_state_skip_layer
-        self.apply_final_norm = apply_final_norm
-        self.reproduce = reproduce
-        self.i2v_condition_type = i2v_condition_type
-        self.use_cpu_offload = use_cpu_offload
-        
-        # Text encoder
-        if self.i2v_mode:
-            self.text_encoder = "llm-i2v"
-            self.tokenizer = "llm-i2v"
-            self.prompt_template = "dit-llm-encode-i2v"
-            self.prompt_template_video = "dit-llm-encode-video-i2v"
-
-        if self.prompt_template_video is not None:
-            crop_start = PROMPT_TEMPLATE[self.prompt_template_video].get("crop_start", 0)
-        elif self.prompt_template is not None:
-            crop_start = PROMPT_TEMPLATE[self.prompt_template].get("crop_start", 0)
-        else:
-            crop_start = 0
-        max_length = self.text_len + crop_start
-
-        prompt_template = PROMPT_TEMPLATE[self.prompt_template] if self.prompt_template is not None else None
-        prompt_template_video = PROMPT_TEMPLATE[self.prompt_template_video] if self.prompt_template_video is not None else None
-
-        if self.i2v_mode and self.i2v_condition_type == "latent_concat":
-            image_embed_interleave = 2
-        elif self.i2v_mode and self.i2v_condition_type == "token_replace":
-            image_embed_interleave = 4
-        else:
-            image_embed_interleave = 1
-        
-        self.text_encoder = TextEncoder(
-            text_encoder_type=self.text_encoder,
-            max_length=max_length,
-            text_encoder_precision=self.text_encoder_precision,
-            tokenizer_type=self.tokenizer,
-            i2v_mode=self.i2v_mode,
-            prompt_template=prompt_template,
-            prompt_template_video=prompt_template_video,
-            hidden_state_skip_layer=self.hidden_state_skip_layer,
-            apply_final_norm=self.apply_final_norm,
-            reproduce=self.reproduce,
-            logger=None,
-            device=device if not use_cpu_offload else "cpu",
-            image_embed_interleave=image_embed_interleave
-        )
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/data_utils.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/data_utils.py
deleted file mode 100644
index b7d4b356..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/utils/data_utils.py
+++ /dev/null
@@ -1,99 +0,0 @@
-import numpy as np
-import math
-from PIL import Image
-import torch
-import copy
-import string
-import random
-
-
-def align_to(value, alignment):
-    """align hight, width according to alignment
-
-    Args:
-        value (int): height or width
-        alignment (int): target alignment factor
-
-    Returns:
-        int: the aligned value
-    """
-    return int(math.ceil(value / alignment) * alignment)
-
-
-def black_image(width, height):
-    """generate a black image
-
-    Args:
-        width (int): image width
-        height (int): image height
-
-    Returns:
-        _type_: a black image
-    """
-    black_image = Image.new("RGB", (width, height), (0, 0, 0))
-    return black_image
-
-
-def get_closest_ratio(height: float, width: float, ratios: list, buckets: list):
-    """get the closest ratio in the buckets
-
-    Args:
-        height (float): video height
-        width (float): video width
-        ratios (list): video aspect ratio
-        buckets (list): buckets generate by `generate_crop_size_list`
-
-    Returns:
-        the closest ratio in the buckets and the corresponding ratio
-    """
-    aspect_ratio = float(height) / float(width)
-    diff_ratios = ratios - aspect_ratio
-
-    if aspect_ratio >= 1:
-        indices = [(index, x) for index, x in enumerate(diff_ratios) if x <= 0]
-    else:
-        indices = [(index, x) for index, x in enumerate(diff_ratios) if x > 0]
-
-    closest_ratio_id = min(indices, key=lambda pair: abs(pair[1]))[0]
-    closest_size = buckets[closest_ratio_id]
-    closest_ratio = ratios[closest_ratio_id]
-
-    return closest_size, closest_ratio
-
-
-def generate_crop_size_list(base_size=256, patch_size=32, max_ratio=4.0):
-    """generate crop size list
-
-    Args:
-        base_size (int, optional): the base size for generate bucket. Defaults to 256.
-        patch_size (int, optional): the stride to generate bucket. Defaults to 32.
-        max_ratio (float, optional): th max ratio for h or w based on base_size . Defaults to 4.0.
-
-    Returns:
-        list: generate crop size list
-    """
-    num_patches = round((base_size / patch_size) ** 2)
-    assert max_ratio >= 1.0
-    crop_size_list = []
-    wp, hp = num_patches, 1
-    while wp > 0:
-        if max(wp, hp) / min(wp, hp) <= max_ratio:
-            crop_size_list.append((wp * patch_size, hp * patch_size))
-        if (hp + 1) * wp <= num_patches:
-            hp += 1
-        else:
-            wp -= 1
-    return crop_size_list
-
-
-def align_floor_to(value, alignment):
-    """align hight, width according to alignment
-
-    Args:
-        value (int): height or width
-        alignment (int): target alignment factor
-
-    Returns:
-        int: the aligned value
-    """
-    return int(math.floor(value / alignment) * alignment)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/file_utils.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/file_utils.py
deleted file mode 100644
index 45800c59..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/utils/file_utils.py
+++ /dev/null
@@ -1,193 +0,0 @@
-import logging
-import os
-from pathlib import Path
-import json
-import tarfile
-from collections import defaultdict
-from einops import rearrange
-from typing import List
-import torch
-import torchvision
-import numpy as np
-import imageio
-import PIL.Image
-from PIL import Image
-
-CODE_SUFFIXES = {
-    ".py",  # Python codes
-    ".sh",  # Shell scripts
-    ".yaml",
-    ".yml",  # Configuration files
-}
-
-
-def build_pretraining_data_loader():
-    pass
-
-
-def logger_filter(name):
-    def filter_(record):
-        return record["extra"].get("name") == name
-
-    return filter_
-
-
-def resolve_resume_path(resume, results_dir):
-    # Detect the resume path. Support both the experiment index and the full path.
-    if resume.isnumeric():
-        tmp_dirs = list(Path(results_dir).glob("*"))
-        id2exp_dir = defaultdict(list)
-        for tmp_dir in tmp_dirs:
-            part0 = tmp_dir.name.split("_")[0]
-            if part0.isnumeric():
-                id2exp_dir[int(part0)].append(tmp_dir)
-        resume_id = int(resume)
-        valid_exp_dir = id2exp_dir.get(resume_id)
-        if len(valid_exp_dir) == 0:
-            raise ValueError(
-                f"No valid experiment directories found in {results_dir} with the experiment "
-                f"index {resume}."
-            )
-        elif len(valid_exp_dir) > 1:
-            raise ValueError(
-                f"Multiple valid experiment directories found in {results_dir} with the experiment "
-                f"index {resume}: {valid_exp_dir}."
-            )
-        resume_path = valid_exp_dir[0] / "checkpoints"
-    else:
-        resume_path = Path(resume)
-
-    if not resume_path.exists():
-        raise FileNotFoundError(f"Resume path {resume_path} not found.")
-
-    return resume_path
-
-
-def dump_codes(save_path, root, sub_dirs=None, valid_suffixes=None, save_prefix="./"):
-    """
-    Dump codes to the experiment directory.
-
-    Args:
-        save_path (str): Path to the experiment directory.
-        root (Path): Path to the root directory of the codes.
-        sub_dirs (list): List of subdirectories to be dumped. If None, all files in the root directory will
-            be dumped. (default: None)
-        valid_suffixes (tuple, optional): Valid suffixes of the files to be dumped. If None, CODE_SUFFIXES will be used.
-            (default: None)
-        save_prefix (str, optional): Prefix to be added to the files in the tarball. (default: './')
-    """
-    if valid_suffixes is None:
-        valid_suffixes = CODE_SUFFIXES
-
-    # Force to use tar.gz suffix
-    save_path = safe_file(save_path)
-    assert save_path.name.endswith(
-        ".tar.gz"
-    ), f"save_path should end with .tar.gz, got {save_path.name}."
-    # Make root absolute
-    root = Path(root).absolute()
-    # Make a tarball of the codes
-    with tarfile.open(save_path, "w:gz") as tar:
-        # Recursively add all files in the root directory
-        if sub_dirs is None:
-            sub_dirs = list(root.iterdir())
-        for sub_dir in sub_dirs:
-            for file in Path(sub_dir).rglob("*"):
-                if file.is_file() and file.suffix in valid_suffixes:
-                    # make file absolute
-                    file = file.absolute()
-                    arcname = Path(save_prefix) / file.relative_to(root)
-                    tar.add(file, arcname=arcname)
-    return root
-
-
-def dump_args(args, save_path, extra_args=None):
-    args_dict = vars(args)
-    if extra_args:
-        assert isinstance(
-            extra_args, dict
-        ), f"extra_args should be a dictionary, got {type(extra_args)}."
-        args_dict.update(extra_args)
-    # Save to file
-    with safe_file(save_path).open("w") as f:
-        json.dump(args_dict, f, indent=4, sort_keys=True, ensure_ascii=False)
-
-
-def empty_logger():
-    logger = logging.getLogger("hymm_empty_logger")
-    logger.addHandler(logging.NullHandler())
-    logger.setLevel(logging.CRITICAL)
-    return logger
-
-
-def is_valid_experiment(path):
-    path = Path(path)
-    if path.is_dir() and path.name.split("_")[0].isdigit():
-        return True
-    return False
-
-
-def get_experiment_max_number(experiments):
-    valid_experiment_numbers = []
-    for exp in experiments:
-        if is_valid_experiment(exp):
-            valid_experiment_numbers.append(int(Path(exp).name.split("_")[0]))
-    if valid_experiment_numbers:
-        return max(valid_experiment_numbers)
-    return 0
-
-
-def safe_dir(path):
-    """
-    Create a directory (or the parent directory of a file) if it does not exist.
-
-    Args:
-        path (str or Path): Path to the directory.
-
-    Returns:
-        path (Path): Path object of the directory.
-    """
-    path = Path(path)
-    path.mkdir(exist_ok=True, parents=True)
-    return path
-
-
-def safe_file(path):
-    """
-    Create the parent directory of a file if it does not exist.
-
-    Args:
-        path (str or Path): Path to the file.
-
-    Returns:
-        path (Path): Path object of the file.
-    """
-    path = Path(path)
-    path.parent.mkdir(exist_ok=True, parents=True)
-    return path
-
-
-def save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=1, fps=24):
-    """save videos by video tensor
-       copy from https://github.com/guoyww/AnimateDiff/blob/e92bd5671ba62c0d774a32951453e328018b7c5b/animatediff/utils/util.py#L61
-
-    Args:
-        videos (torch.Tensor): video tensor predicted by the model
-        path (str): path to save video
-        rescale (bool, optional): rescale the video tensor from [-1, 1] to  . Defaults to False.
-        n_rows (int, optional): Defaults to 1.
-        fps (int, optional): video save fps. Defaults to 8.
-    """
-    videos = rearrange(videos, "b c t h w -> t b c h w")
-    outputs = []
-    for x in videos:
-        x = torchvision.utils.make_grid(x, nrow=n_rows)
-        x = x.transpose(0, 1).transpose(1, 2).squeeze(-1)
-        if rescale:
-            x = (x + 1.0) / 2.0  # -1,1 -> 0,1
-        x = torch.clamp(x, 0, 1)
-        x = (x * 255).numpy().astype(np.uint8)
-        outputs.append(x)
-
-    os.makedirs(os.path.dirname(path), exist_ok=True)
-    imageio.mimsave(path, outputs, fps=fps)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/helpers.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/helpers.py
deleted file mode 100644
index f2d582c0..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/utils/helpers.py
+++ /dev/null
@@ -1,122 +0,0 @@
-import collections.abc
-
-from itertools import repeat
-
-import contextlib
-import os
-import random
-
-import numpy as np
-import torch
-import deepspeed
-import torch.distributed as dist
-from torch.utils.tensorboard import SummaryWriter
-
-
-def all_gather_sum(running_value, device):
-    value = torch.tensor(running_value, device=device)
-    dist.all_reduce(value, op=dist.ReduceOp.SUM)
-    return value.item()
-
-
-class EventsMonitor(object):
-    def __init__(self, events_root, rank):
-        self.rank = rank
-        if rank == 0:
-            self.writer = SummaryWriter(log_dir=events_root)
-        else:
-            self.writer = None
-
-    def write_events(self, events):
-        for event in events:
-            name, val, count = event
-            if self.rank == 0:
-                self.writer.add_scalar(name, val, global_step=count)
-
-
-def profiler_context(enable, exp_dir, worker_name):
-    if enable:
-        return torch.profiler.profile(
-            activities=[
-                torch.profiler.ProfilerActivity.CPU,
-                torch.profiler.ProfilerActivity.CUDA,
-            ],
-            schedule=torch.profiler.schedule(
-                skip_first=10,
-                wait=5,
-                warmup=1,
-                active=3,
-                repeat=2,
-            ),
-            profile_memory=True,
-            on_trace_ready=torch.profiler.tensorboard_trace_handler(
-                exp_dir, worker_name=worker_name
-            ),
-        )
-    else:
-        # return empty python context manager
-        return contextlib.nullcontext()
-
-
-def set_reproducibility(enable, global_seed=None):
-    if enable:
-        # Configure the seed for reproducibility
-        set_manual_seed(global_seed)
-    # Set following debug environment variable
-    # See the link for details: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility
-    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
-    # Cudnn benchmarking
-    torch.backends.cudnn.benchmark = not enable
-    # Use deterministic algorithms in PyTorch
-    torch.use_deterministic_algorithms(enable)
-
-    # LSTM and RNN networks are not deterministic
-
-
-def set_manual_seed(global_seed):
-    # Seed the RNG for Python
-    random.seed(global_seed)
-    # Seed the RNG for Numpy
-    np.random.seed(global_seed)
-    # Seed the RNG for all devices (both CPU and CUDA)
-    torch.manual_seed(global_seed)
-    # Seed cuda
-    torch.cuda.manual_seed_all(global_seed)
-
-
-def _ntuple(n):
-    def parse(x):
-        if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
-            x = tuple(x)
-            if len(x) == 1:
-                x = tuple(repeat(x[0], n))
-            return x
-        return tuple(repeat(x, n))
-
-    return parse
-
-
-to_1tuple = _ntuple(1)
-to_2tuple = _ntuple(2)
-to_3tuple = _ntuple(3)
-to_4tuple = _ntuple(4)
-
-
-def as_tuple(x):
-    if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
-        return tuple(x)
-    if x is None or isinstance(x, (int, float, str)):
-        return (x,)
-    else:
-        raise ValueError(f"Unknown type {type(x)}")
-
-
-def as_list_of_2tuple(x):
-    x = as_tuple(x)
-    if len(x) == 1:
-        x = (x[0], x[0])
-    assert len(x) % 2 == 0, f"Expect even length, got {len(x)}."
-    lst = []
-    for i in range(0, len(x), 2):
-        lst.append((x[i], x[i + 1]))
-    return lst
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/lora_utils.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/lora_utils.py
deleted file mode 100644
index 760b65b1..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/utils/lora_utils.py
+++ /dev/null
@@ -1,87 +0,0 @@
-import torch
-from safetensors.torch import load_file
-
-
-# load kohya lora for diffusers pipeline
-def load_lora_for_pipeline(
-    pipeline,
-    lora_path,
-    LORA_PREFIX_TRANSFORMER="",
-    LORA_PREFIX_TEXT_ENCODER="",
-    alpha=1.0,
-    device=0,
-    is_parallel=False
-):
-    # load LoRA weight from .safetensors
-    if is_parallel:
-        state_dict = load_file(lora_path, device="cpu")
-        for k, v in state_dict.items():
-            state_dict[k] = v.to(device)
-    else:
-        state_dict = load_file(lora_path, device=device)
-
-    visited = []
-
-    # directly update weight in diffusers model
-    for key in state_dict:
-        # it is suggested to print out the key, it usually will be something like below
-        # "lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_down.weight"
-
-        # as we have set the alpha beforehand, so just skip
-        if ".alpha" in key or key in visited:
-            continue
-
-        if "text" in key:
-            layer_infos = (
-                key.split(".")[0].split(LORA_PREFIX_TEXT_ENCODER + "_")[-1].split("_")
-            )
-            curr_layer = pipeline.text_encoder
-        else:
-            layer_infos = (
-                key.split(".")[0].split(LORA_PREFIX_TRANSFORMER + "_")[-1].split("_")
-            )
-            curr_layer = pipeline.transformer
-
-        # find the target layer
-        temp_name = layer_infos.pop(0)
-        while len(layer_infos) > -1:
-            try:
-                curr_layer = curr_layer.__getattr__(temp_name)
-                if len(layer_infos) > 0:
-                    temp_name = layer_infos.pop(0)
-                elif len(layer_infos) == 0:
-                    break
-            except Exception:
-                if len(temp_name) > 0:
-                    temp_name += "_" + layer_infos.pop(0)
-                else:
-                    temp_name = layer_infos.pop(0)
-
-        pair_keys = []
-        if "lora_down" in key:
-            pair_keys.append(key.replace("lora_down", "lora_up"))
-            pair_keys.append(key)
-        else:
-            pair_keys.append(key)
-            pair_keys.append(key.replace("lora_up", "lora_down"))
-
-        # update weight
-        if len(state_dict[pair_keys[0]].shape) == 4:
-            weight_up = state_dict[pair_keys[0]].squeeze(3).squeeze(2).to(torch.float32)
-            weight_down = (
-                state_dict[pair_keys[1]].squeeze(3).squeeze(2).to(torch.float32)
-            )
-            curr_layer.weight.data += alpha * torch.mm(
-                weight_up, weight_down
-            ).unsqueeze(2).unsqueeze(3)
-        else:
-            weight_up = state_dict[pair_keys[0]].to(torch.float32)
-            weight_down = state_dict[pair_keys[1]].to(torch.float32)
-            curr_layer.weight.data += alpha * torch.mm(weight_up, weight_down)
-
-        # update visited list
-        for item in pair_keys:
-            visited.append(item)
-    del state_dict
-
-    return pipeline
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/preprocess_text_encoder_tokenizer_utils.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/preprocess_text_encoder_tokenizer_utils.py
deleted file mode 100644
index a5306938..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/utils/preprocess_text_encoder_tokenizer_utils.py
+++ /dev/null
@@ -1,43 +0,0 @@
-import argparse
-import torch
-from transformers import (
-    AutoProcessor,
-    LlavaForConditionalGeneration,
-)
-
-
-def preprocess_text_encoder_tokenizer(args):
-
-    processor = AutoProcessor.from_pretrained(args.input_dir)
-    model = LlavaForConditionalGeneration.from_pretrained(
-        args.input_dir,
-        torch_dtype=torch.float16,
-        low_cpu_mem_usage=True,
-    ).to(0)
-
-    model.language_model.save_pretrained(f"{args.output_dir}")
-    processor.tokenizer.save_pretrained(f"{args.output_dir}")
-
-
-if __name__ == "__main__":
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--input_dir",
-        type=str,
-        required=True,
-        help="The path to the llava-llama-3-8b-v1_1-transformers.",
-    )
-    parser.add_argument(
-        "--output_dir",
-        type=str,
-        default="",
-        help="The output path of the llava-llama-3-8b-text-encoder-tokenizer."
-        "if '', the parent dir of output will be the same as input dir.",
-    )
-    args = parser.parse_args()
-
-    if len(args.output_dir) == 0:
-        args.output_dir = "/".join(args.input_dir.split("/")[:-1])
-
-    preprocess_text_encoder_tokenizer(args)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/utils/train_utils.py b/videotuna/models/hunyuan/hyvideo_i2v/utils/train_utils.py
deleted file mode 100644
index 8484ff91..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/utils/train_utils.py
+++ /dev/null
@@ -1,455 +0,0 @@
-import random
-import torchvision.transforms as transforms
-
-import numpy as np
-import torch
-
-import imageio
-import os
-import PIL.Image
-from typing import Union, Optional, List
-from peft import get_peft_model_state_dict
-
-from hyvideo.modules.posemb_layers import get_nd_rotary_pos_embed
-from hyvideo.vae import AutoencoderKLCausal3D
-
-from pathlib import Path
-from einops import rearrange
-from PIL import Image
-
-from hyvideo.constants import PRECISION_TO_TYPE
-from safetensors.torch import load_file
-
-
-def convert_kohya_to_peft_keys(
-    kohya_dict: dict,
-    kohya_prefix="",
-    peft_prefix: str = "base_model.model",
-    device="cpu",
-) -> dict:
-    peft_dict = {}
-    for k, v in kohya_dict.items():
-        if ".alpha" in k:
-            continue
-        new_key = k.replace(f"{kohya_prefix}_lora_", f"{peft_prefix}.")
-        new_key = new_key.replace("single_blocks_", "single_blocks.")
-        new_key = new_key.replace("double_blocks_", "double_blocks.")
-        new_key = new_key.replace("_img_attn_proj", ".img_attn_proj")
-        new_key = new_key.replace("_img_attn_qkv", ".img_attn_qkv")
-        new_key = new_key.replace("_img_mlp_fc", ".img_mlp.fc")
-        new_key = new_key.replace("_txt_mlp_fc", ".txt_mlp.fc")
-        new_key = new_key.replace("_img_mod", ".img_mod")
-        new_key = new_key.replace("_txt", ".txt")
-        new_key = new_key.replace("_modulation", ".modulation")
-        new_key = new_key.replace("_linear", ".linear")
-        new_key = new_key.replace("lora_down", "lora_A.default")
-        new_key = new_key.replace("lora_up", "lora_B.default")
-        new_key = new_key.replace(
-            "_individual_token_refiner_blocks_", ".individual_token_refiner.blocks."
-        )
-        new_key = new_key.replace("_mlp_fc", ".mlp.fc")
-
-        peft_dict[new_key] = v.to(device)
-    return peft_dict
-
-
-def load_lora(model, lora_path, device):
-    kohya_weights = load_file(lora_path)
-    peft_weights = convert_kohya_to_peft_keys(
-        kohya_weights, kohya_prefix="Hunyuan_video_I2V", device=device
-    )
-    model.load_state_dict(peft_weights, strict=False)
-    return model
-
-
-def black_image(width, height):
-    black_image = Image.new("RGB", (width, height), (0, 0, 0))
-    return black_image
-
-
-def numpy_to_pil(images: np.ndarray) -> List[PIL.Image.Image]:
-    if images.ndim == 3:
-        images = images[None, ...]
-    images = (images * 255).round().astype("uint8")
-    if images.shape[-1] == 1:
-        # special case for grayscale (single channel) images
-        pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
-    else:
-        pil_images = [Image.fromarray(image) for image in images]
-
-    return pil_images
-
-
-def get_cond_latents(args, latents, vae):
-    """get conditioned latent by decode and encode the first frame latents"""
-    first_image_latents = latents[:, :, 0, ...] if len(latents.shape) == 5 else latents
-    first_image_latents = 1 / vae.config.scaling_factor * first_image_latents
-    first_images = vae.decode(
-        first_image_latents.unsqueeze(2).to(vae.dtype), return_dict=False
-    )[0]
-    first_images = first_images.squeeze(2)
-    first_images = (first_images / 2 + 0.5).clamp(0, 1)
-    first_images = first_images.cpu().permute(0, 2, 3, 1).float().numpy()
-    first_images = numpy_to_pil(first_images)
-
-    image_transform = transforms.Compose(
-        [transforms.ToTensor(), transforms.Normalize([0.5], [0.5])]
-    )
-    first_images_pixel_values = [image_transform(image) for image in first_images]
-    first_images_pixel_values = (
-        torch.cat(first_images_pixel_values).unsqueeze(0).unsqueeze(2).to(vae.device)
-    )
-
-    vae_dtype = PRECISION_TO_TYPE[args.vae_precision]
-    with torch.autocast(
-        device_type="cuda", dtype=vae_dtype, enabled=vae_dtype != torch.float32
-    ):
-        cond_latents = vae.encode(
-            first_images_pixel_values
-        ).latent_dist.sample()  # B, C, F, H, W
-        cond_latents.mul_(vae.config.scaling_factor)
-
-    return cond_latents
-
-
-def get_cond_images(args, latents, vae, is_uncond=False):
-    """get conditioned images by decode the first frame latents"""
-    sematic_image_latents = (
-        latents[:, :, 0, ...] if len(latents.shape) == 5 else latents
-    )
-    sematic_image_latents = 1 / vae.config.scaling_factor * sematic_image_latents
-    semantic_images = vae.decode(
-        sematic_image_latents.unsqueeze(2).to(vae.dtype), return_dict=False
-    )[0]
-    semantic_images = semantic_images.squeeze(2)
-    semantic_images = (semantic_images / 2 + 0.5).clamp(0, 1)
-    semantic_images = semantic_images.cpu().permute(0, 2, 3, 1).float().numpy()
-    semantic_images = numpy_to_pil(semantic_images)
-    if is_uncond:
-        semantic_images = [
-            black_image(img.size[0], img.size[1]) for img in semantic_images
-        ]
-
-    return semantic_images
-
-
-def load_state_dict(args, model, logger):
-    pretrained_model_path = Path(args.model_base)
-    if not pretrained_model_path.exists():
-        raise ValueError(f"`models_root` not exists: {pretrained_model_path}")
-
-    load_key = args.load_key
-    if args.i2v_mode:
-        dit_weight = Path(args.i2v_dit_weight)
-    else:
-        dit_weight = Path(args.dit_weight)
-
-    if dit_weight is None:
-        model_dir = pretrained_model_path / f"t2v_{args.model_resolution}"
-        files = list(model_dir.glob("*.pt"))
-        if len(files) == 0:
-            raise ValueError(f"No model weights found in {model_dir}")
-        if str(files[0]).startswith("pytorch_model_"):
-            model_path = dit_weight / f"pytorch_model_{load_key}.pt"
-            bare_model = True
-        elif any(str(f).endswith("_model_states.pt") for f in files):
-            files = [f for f in files if str(f).endswith("_model_states.pt")]
-            model_path = files[0]
-            if len(files) > 1:
-                logger.warning(
-                    f"Multiple model weights found in {dit_weight}, using {model_path}"
-                )
-            bare_model = False
-        else:
-            raise ValueError(
-                f"Invalid model path: {dit_weight} with unrecognized weight format: "
-                f"{list(map(str, files))}. When given a directory as --dit-weight, only "
-                f"`pytorch_model_*.pt`(provided by HunyuanVideo official) and "
-                f"`*_model_states.pt`(saved by deepspeed) can be parsed. If you want to load a "
-                f"specific weight file, please provide the full path to the file."
-            )
-    else:
-        if dit_weight.is_dir():
-            files = list(dit_weight.glob("*.pt"))
-            if len(files) == 0:
-                raise ValueError(f"No model weights found in {dit_weight}")
-            if str(files[0]).startswith("pytorch_model_"):
-                model_path = dit_weight / f"pytorch_model_{load_key}.pt"
-                bare_model = True
-            elif any(str(f).endswith("_model_states.pt") for f in files):
-                files = [f for f in files if str(f).endswith("_model_states.pt")]
-                model_path = files[0]
-                if len(files) > 1:
-                    logger.warning(
-                        f"Multiple model weights found in {dit_weight}, using {model_path}"
-                    )
-                bare_model = False
-            else:
-                raise ValueError(
-                    f"Invalid model path: {dit_weight} with unrecognized weight format: "
-                    f"{list(map(str, files))}. When given a directory as --dit-weight, only "
-                    f"`pytorch_model_*.pt`(provided by HunyuanVideo official) and "
-                    f"`*_model_states.pt`(saved by deepspeed) can be parsed. If you want to load a "
-                    f"specific weight file, please provide the full path to the file."
-                )
-        elif dit_weight.is_file():
-            model_path = dit_weight
-            bare_model = "unknown"
-        else:
-            raise ValueError(f"Invalid model path: {dit_weight}")
-
-    if not model_path.exists():
-        raise ValueError(f"model_path not exists: {model_path}")
-    logger.info(f"Loading torch model {model_path}...")
-    state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
-
-    if bare_model == "unknown" and ("ema" in state_dict or "module" in state_dict):
-        bare_model = False
-    if bare_model is False:
-        if load_key in state_dict:
-            state_dict = state_dict[load_key]
-        else:
-            raise KeyError(
-                f"Missing key: `{load_key}` in the checkpoint: {model_path}. The keys in the checkpoint "
-                f"are: {list(state_dict.keys())}."
-            )
-    model.load_state_dict(state_dict, strict=True)
-    return model
-
-
-class set_worker_seed_builder:
-    def __init__(self, global_rank):
-        self.global_rank = global_rank
-
-    def __call__(self, worker_id):
-        set_manual_seed(torch.initial_seed() % (2 ** 32 - 1))
-
-
-def set_reproducibility(enable, global_seed=None):
-    if enable:
-        # Configure the seed for reproducibility
-        set_manual_seed(global_seed)
-    # Set following debug environment variable
-    # See the link for details: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility
-    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
-    # Cudnn benchmarking
-    torch.backends.cudnn.benchmark = not enable
-    # Use deterministic algorithms in PyTorch
-    torch.use_deterministic_algorithms(enable)
-
-    # LSTM and RNN networks are not deterministic
-
-
-def prepare_model_inputs(
-    args,
-    batch: tuple,
-    device: Union[int, str],
-    model,
-    vae,
-    text_encoder,
-    text_encoder_2=None,
-    rope_theta_rescale_factor: Union[float, List[float]] = 1.0,
-    rope_interpolation_factor: Union[float, List[float]] = 1.0,
-):
-    media, latents, *batch_args = batch
-    if len(batch_args) == 3:
-        text_ids, text_mask, kwargs = batch_args
-        text_ids_2, text_mask_2 = None, None
-    elif len(batch_args) == 5:
-        text_ids, text_mask, text_ids_2, text_mask_2, kwargs = batch_args
-    else:
-        raise ValueError(f"Unexpected batch_args.")
-    data_type = kwargs["type"][0]
-
-    # Move batch to device
-    media = media.to(device)
-    latents = latents.to(device)
-    text_ids = text_ids.to(device)
-    text_mask = text_mask.to(device)
-
-    # ======================================== Encode media ======================================
-    # Used for 3D VAE with 2D inputs(image).
-    # Prepare media shape for 2D/3D VAE
-    if len(latents.shape) == 1:
-        if len(media.shape) == 4:
-            # media is a batch of image with shape [b, c, h, w]
-            if isinstance(vae, AutoencoderKLCausal3D):
-                media = media.unsqueeze(2)  # [b, c, 1, h, w]
-        elif len(media.shape) == 5:
-            # media is a batch of video with shape [b, c, f, h, w]
-            if not isinstance(vae, AutoencoderKLCausal3D):
-                media = rearrange(media, "b c f h w -> (b f) c h w")
-        else:
-            raise ValueError(
-                f"Only support media with shape (b, c, h, w) or (b, c, f, h, w), but got {media.shape}."
-            )
-
-        vae_dtype = PRECISION_TO_TYPE[args.vae_precision]
-        with torch.autocast(
-            device_type="cuda", dtype=vae_dtype, enabled=vae_dtype != torch.float32
-        ):
-            latents = vae.encode(media).latent_dist.sample()
-            if hasattr(vae.config, "shift_factor") and vae.config.shift_factor:
-                latents.sub_(vae.config.shift_factor).mul_(vae.config.scaling_factor)
-            else:
-                latents.mul_(vae.config.scaling_factor)
-    elif len(latents.shape) == 5 or len(latents.shape) == 4:  # Using video/image cache
-        latents = (
-            latents * vae.config.scaling_factor
-        )  # vae cache is not multiplied by scaling_factor
-    else:
-        raise ValueError(
-            f"Only support media/latent with shape (b, c, h, w) or (b, c, f, h, w), but got {media.shape} {latents.shape}."
-        )
-
-    cond_latents = get_cond_latents(args, latents, vae)
-    is_uncond = (
-        torch.tensor(1).to(torch.int64)
-        if random.random() < args.sematic_cond_drop_p
-        else torch.tensor(0).to(torch.int64)
-    )
-    semantic_images = get_cond_images(args, latents, vae, is_uncond=is_uncond)
-
-    # ======================================== Encode text ======================================
-    # Autocast is handled by text_encoder itself.
-    # Whether to apply text_mask is determined by args.use_attention_mask.
-    text_outputs = text_encoder.encode(
-        {"input_ids": text_ids, "attention_mask": text_mask},
-        data_type=batch_args[-1]["type"][0],
-        semantic_images=semantic_images,
-    )
-    text_states = text_outputs.hidden_state
-    text_mask = text_outputs.attention_mask
-    text_states_2 = (
-        text_encoder_2.encode(
-            {"input_ids": text_ids_2, "attention_mask": text_mask_2},
-            data_type=data_type,
-        ).hidden_state
-        if text_encoder_2 is not None
-        else None
-    )
-
-    # ======================================== Build RoPE ======================================
-    target_ndim = 3  # n-d RoPE
-    ndim = len(latents.shape) - 2
-    latents_size = list(latents.shape[-ndim:])
-    freqs_cos, freqs_sin = get_rope_freq_from_size(
-        args,
-        model,
-        latents_size,
-        ndim,
-        target_ndim,
-        rope_theta_rescale_factor=rope_theta_rescale_factor,
-        rope_interpolation_factor=rope_interpolation_factor,
-    )
-
-    # ===================================== Pack model kwargs ==================================
-    model_kwargs = dict(
-        text_states=text_states,  # [b, 256, 4096]
-        text_mask=text_mask,  # [b, 256]
-        text_states_2=text_states_2,  # [b, 768]
-        freqs_cos=freqs_cos,  # [seqlen, head_dim]
-        freqs_sin=freqs_sin,  # [seqlen, head_dim]
-        return_dict=True,
-    )
-
-    return latents, model_kwargs, freqs_cos.shape[0], cond_latents
-
-
-def format_params(params):
-    if params < 1e6:
-        return f"{params} (less than 1M)"
-    elif params < 1e9:
-        return f"{params / 1e6:.2f}M"
-    else:
-        return f"{params / 1e9:.2f}B"
-
-
-def set_manual_seed(global_seed):
-    random.seed(global_seed)
-    np.random.seed(global_seed)
-    torch.manual_seed(global_seed)
-
-
-def get_rope_freq_from_size(
-    args,
-    model,
-    latents_size,
-    ndim,
-    target_ndim,
-    rope_theta_rescale_factor=1.0,
-    rope_interpolation_factor=1.0,
-):
-
-    if isinstance(model.patch_size, int):
-        assert all(s % model.patch_size == 0 for s in latents_size), (
-            f"Latent size(last {ndim} dimensions) should be divisible by patch size({model.patch_size}), "
-            f"but got {latents_size}."
-        )
-        rope_sizes = [s // model.patch_size for s in latents_size]
-
-    elif isinstance(model.patch_size, list):
-        assert all(
-            s % model.patch_size[idx] == 0 for idx, s in enumerate(latents_size)
-        ), (
-            f"Latent size(last {ndim} dimensions) should be divisible by patch size({model.patch_size}), "
-            f"but got {latents_size}."
-        )
-        rope_sizes = [s // model.patch_size[idx] for idx, s in enumerate(latents_size)]
-
-    if len(rope_sizes) != target_ndim:
-        rope_sizes = [1] * (target_ndim - len(rope_sizes)) + rope_sizes  # time axis
-    head_dim = model.hidden_size // model.heads_num
-    rope_dim_list = model.rope_dim_list
-
-    if rope_dim_list is None:
-        rope_dim_list = [head_dim // target_ndim for _ in range(target_ndim)]
-    assert (
-        sum(rope_dim_list) == head_dim
-    ), "sum(rope_dim_list) should equal to head_dim of attention layer"
-
-    freqs_cos, freqs_sin = get_nd_rotary_pos_embed(
-        rope_dim_list,
-        rope_sizes,
-        theta=args.rope_theta,
-        use_real=True,
-        theta_rescale_factor=rope_theta_rescale_factor,
-        interpolation_factor=rope_interpolation_factor,
-    )
-
-    return freqs_cos, freqs_sin
-
-
-# copy from https://github.com/huggingface/diffusers/blob/ec9bfa9e148b7764137dd92247ce859d915abcb0/examples/consistency_distillation/train_lcm_distill_lora_sd_wds.py#L258
-# get kohya lora state dict
-def get_module_kohya_state_dict(module, prefix, dtype, adapter_name="default"):
-    kohya_ss_state_dict = {}
-    for peft_key, weight in get_peft_model_state_dict(
-        module, adapter_name=adapter_name
-    ).items():
-        kohya_key = peft_key.replace("base_model.model", prefix)
-        kohya_key = kohya_key.replace("lora_A", "lora_down")
-        kohya_key = kohya_key.replace("lora_B", "lora_up")
-        kohya_key = kohya_key.replace(".", "_", kohya_key.count(".") - 2)
-        kohya_ss_state_dict[kohya_key] = weight.to(dtype)
-
-        # Set alpha parameter
-        if "lora_down" in kohya_key:
-            alpha_key = f'{kohya_key.split(".")[0]}.alpha'
-            kohya_ss_state_dict[alpha_key] = torch.tensor(
-                module.peft_config[adapter_name].lora_alpha
-            ).to(dtype)
-
-    return kohya_ss_state_dict
-
-
-# get diffusers lora state dict
-def get_module_diffusers_state_dict(module, dtype, adapter_name="default"):
-    diffusers_ss_state_dict = {}
-    for peft_key, weight in get_peft_model_state_dict(
-        module, adapter_name=adapter_name
-    ).items():
-        diffusers_key = peft_key.replace("base_model.model", "diffusion_model")
-        diffusers_ss_state_dict[diffusers_key] = weight.to(dtype)
-
-    return diffusers_ss_state_dict
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/vae/__init__.py b/videotuna/models/hunyuan/hyvideo_i2v/vae/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/vae/autoencoder_kl_causal_3d.py b/videotuna/models/hunyuan/hyvideo_i2v/vae/autoencoder_kl_causal_3d.py
deleted file mode 100644
index 2d58933f..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/vae/autoencoder_kl_causal_3d.py
+++ /dev/null
@@ -1,655 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-from typing import Dict, Optional, Tuple, Union
-from dataclasses import dataclass
-
-import torch
-from pathlib import Path
-import torch.nn as nn
-
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-
-try:
-    # This diffusers is modified and packed in the mirror.
-    from diffusers.loaders import FromOriginalVAEMixin
-except ImportError:
-    # Use this to be compatible with the original diffusers.
-    from diffusers.loaders.single_file_model import FromOriginalModelMixin as FromOriginalVAEMixin
-from diffusers.utils.accelerate_utils import apply_forward_hook
-from diffusers.models.attention_processor import (
-    ADDED_KV_ATTENTION_PROCESSORS,
-    CROSS_ATTENTION_PROCESSORS,
-    Attention,
-    AttentionProcessor,
-    AttnAddedKVProcessor,
-    AttnProcessor,
-)
-from diffusers.models.modeling_outputs import AutoencoderKLOutput
-from diffusers.models.modeling_utils import ModelMixin
-from .vae import DecoderCausal3D, BaseOutput, DecoderOutput, DiagonalGaussianDistribution, EncoderCausal3D
-from ..constants import VAE_PATH, PRECISION_TO_TYPE
-
-
-@dataclass
-class DecoderOutput2(BaseOutput):
-    sample: torch.FloatTensor
-    posterior: Optional[DiagonalGaussianDistribution] = None
-
-
-class AutoencoderKLCausal3D(ModelMixin, ConfigMixin, FromOriginalVAEMixin):
-    r"""
-    A VAE model with KL loss for encoding images/videos into latents and decoding latent representations into images/videos.
-
-    This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
-    for all models (such as downloading or saving).
-    """
-
-    _supports_gradient_checkpointing = True
-
-    @register_to_config
-    def __init__(
-        self,
-        in_channels: int = 3,
-        out_channels: int = 3,
-        down_block_types: Tuple[str] = ("DownEncoderBlockCausal3D",),
-        up_block_types: Tuple[str] = ("UpDecoderBlockCausal3D",),
-        block_out_channels: Tuple[int] = (64,),
-        layers_per_block: int = 1,
-        act_fn: str = "silu",
-        latent_channels: int = 4,
-        norm_num_groups: int = 32,
-        sample_size: int = 32,
-        sample_tsize: int = 64,
-        scaling_factor: float = 0.18215,
-        force_upcast: float = True,
-        spatial_compression_ratio: int = 8,
-        time_compression_ratio: int = 4,
-        mid_block_add_attention: bool = True,
-    ):
-        super().__init__()
-
-        self.time_compression_ratio = time_compression_ratio
-
-        self.encoder = EncoderCausal3D(
-            in_channels=in_channels,
-            out_channels=latent_channels,
-            down_block_types=down_block_types,
-            block_out_channels=block_out_channels,
-            layers_per_block=layers_per_block,
-            act_fn=act_fn,
-            norm_num_groups=norm_num_groups,
-            double_z=True,
-            time_compression_ratio=time_compression_ratio,
-            spatial_compression_ratio=spatial_compression_ratio,
-            mid_block_add_attention=mid_block_add_attention,
-        )
-
-        self.decoder = DecoderCausal3D(
-            in_channels=latent_channels,
-            out_channels=out_channels,
-            up_block_types=up_block_types,
-            block_out_channels=block_out_channels,
-            layers_per_block=layers_per_block,
-            norm_num_groups=norm_num_groups,
-            act_fn=act_fn,
-            time_compression_ratio=time_compression_ratio,
-            spatial_compression_ratio=spatial_compression_ratio,
-            mid_block_add_attention=mid_block_add_attention,
-        )
-
-        self.quant_conv = nn.Conv3d(2 * latent_channels, 2 * latent_channels, kernel_size=1)
-        self.post_quant_conv = nn.Conv3d(latent_channels, latent_channels, kernel_size=1)
-
-        self.use_slicing = False
-        self.use_spatial_tiling = False
-        self.use_temporal_tiling = False
-
-        # only relevant if vae tiling is enabled
-        self.tile_sample_min_tsize = sample_tsize
-        self.tile_latent_min_tsize = sample_tsize // time_compression_ratio
-
-        self.tile_sample_min_size = self.config.sample_size
-        sample_size = (
-            self.config.sample_size[0]
-            if isinstance(self.config.sample_size, (list, tuple))
-            else self.config.sample_size
-        )
-        self.tile_latent_min_size = int(sample_size / (2 ** (len(self.config.block_out_channels) - 1)))
-        self.tile_overlap_factor = 0.25
-
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, (EncoderCausal3D, DecoderCausal3D)):
-            module.gradient_checkpointing = value
-
-    def enable_temporal_tiling(self, use_tiling: bool = True):
-        self.use_temporal_tiling = use_tiling
-
-    def disable_temporal_tiling(self):
-        self.enable_temporal_tiling(False)
-
-    def enable_spatial_tiling(self, use_tiling: bool = True):
-        self.use_spatial_tiling = use_tiling
-
-    def disable_spatial_tiling(self):
-        self.enable_spatial_tiling(False)
-
-    def enable_tiling(self, use_tiling: bool = True):
-        r"""
-        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
-        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        processing larger videos.
-        """
-        self.enable_spatial_tiling(use_tiling)
-        self.enable_temporal_tiling(use_tiling)
-
-    def disable_tiling(self):
-        r"""
-        Disable tiled VAE decoding. If `enable_tiling` was previously enabled, this method will go back to computing
-        decoding in one step.
-        """
-        self.disable_spatial_tiling()
-        self.disable_temporal_tiling()
-
-    def enable_slicing(self):
-        r"""
-        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
-        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        """
-        self.use_slicing = True
-
-    def disable_slicing(self):
-        r"""
-        Disable sliced VAE decoding. If `enable_slicing` was previously enabled, this method will go back to computing
-        decoding in one step.
-        """
-        self.use_slicing = False
-
-    @property
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors
-    def attn_processors(self) -> Dict[str, AttentionProcessor]:
-        r"""
-        Returns:
-            `dict` of attention processors: A dictionary containing all attention processors used in the model with
-            indexed by its weight name.
-        """
-        # set recursively
-        processors = {}
-
-        def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
-            if hasattr(module, "get_processor"):
-                processors[f"{name}.processor"] = module.get_processor(return_deprecated_lora=True)
-
-            for sub_name, child in module.named_children():
-                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
-
-            return processors
-
-        for name, module in self.named_children():
-            fn_recursive_add_processors(name, module, processors)
-
-        return processors
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attn_processor
-    def set_attn_processor(
-        self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]], _remove_lora=False
-    ):
-        r"""
-        Sets the attention processor to use to compute attention.
-
-        Parameters:
-            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
-                The instantiated processor class or a dictionary of processor classes that will be set as the processor
-                for **all** `Attention` layers.
-
-                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
-                processor. This is strongly recommended when setting trainable attention processors.
-
-        """
-        count = len(self.attn_processors.keys())
-
-        if isinstance(processor, dict) and len(processor) != count:
-            raise ValueError(
-                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
-                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
-            )
-
-        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
-            if hasattr(module, "set_processor"):
-                if not isinstance(processor, dict):
-                    module.set_processor(processor, _remove_lora=_remove_lora)
-                else:
-                    module.set_processor(processor.pop(f"{name}.processor"), _remove_lora=_remove_lora)
-
-            for sub_name, child in module.named_children():
-                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
-
-        for name, module in self.named_children():
-            fn_recursive_attn_processor(name, module, processor)
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_default_attn_processor
-    def set_default_attn_processor(self):
-        """
-        Disables custom attention processors and sets the default attention implementation.
-        """
-        if all(proc.__class__ in ADDED_KV_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
-            processor = AttnAddedKVProcessor()
-        elif all(proc.__class__ in CROSS_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
-            processor = AttnProcessor()
-        else:
-            raise ValueError(
-                f"Cannot call `set_default_attn_processor` when attention processors are of type {next(iter(self.attn_processors.values()))}"
-            )
-
-        self.set_attn_processor(processor, _remove_lora=True)
-
-    @apply_forward_hook
-    def encode(
-        self, x: torch.FloatTensor, return_dict: bool = True
-    ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
-        """
-        Encode a batch of images/videos into latents.
-
-        Args:
-            x (`torch.FloatTensor`): Input batch of images/videos.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
-
-        Returns:
-                The latent representations of the encoded images/videos. If `return_dict` is True, a
-                [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is returned.
-        """
-        assert len(x.shape) == 5, "The input tensor should have 5 dimensions."
-
-        if self.use_temporal_tiling and x.shape[2] > self.tile_sample_min_tsize:
-            return self.temporal_tiled_encode(x, return_dict=return_dict)
-
-        if self.use_spatial_tiling and (x.shape[-1] > self.tile_sample_min_size or x.shape[-2] > self.tile_sample_min_size):
-            return self.spatial_tiled_encode(x, return_dict=return_dict)
-
-        if self.use_slicing and x.shape[0] > 1:
-            encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
-            h = torch.cat(encoded_slices)
-        else:
-            h = self.encoder(x)
-
-        moments = self.quant_conv(h)
-        posterior = DiagonalGaussianDistribution(moments)
-
-        if not return_dict:
-            return (posterior,)
-
-        return AutoencoderKLOutput(latent_dist=posterior)
-
-    def _decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
-        assert len(z.shape) == 5, "The input tensor should have 5 dimensions."
-
-        if self.use_temporal_tiling and z.shape[2] > self.tile_latent_min_tsize:
-            return self.temporal_tiled_decode(z, return_dict=return_dict)
-
-        if self.use_spatial_tiling and (z.shape[-1] > self.tile_latent_min_size or z.shape[-2] > self.tile_latent_min_size):
-            return self.spatial_tiled_decode(z, return_dict=return_dict)
-
-        z = self.post_quant_conv(z)
-        dec = self.decoder(z)
-
-        if not return_dict:
-            return (dec,)
-
-        return DecoderOutput(sample=dec)
-
-    @apply_forward_hook
-    def decode(
-        self, z: torch.FloatTensor, return_dict: bool = True, generator=None
-    ) -> Union[DecoderOutput, torch.FloatTensor]:
-        """
-        Decode a batch of images/videos.
-
-        Args:
-            z (`torch.FloatTensor`): Input batch of latent vectors.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
-
-        Returns:
-            [`~models.vae.DecoderOutput`] or `tuple`:
-                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
-                returned.
-
-        """
-        if self.use_slicing and z.shape[0] > 1:
-            decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
-            decoded = torch.cat(decoded_slices)
-        else:
-            decoded = self._decode(z).sample
-
-        if not return_dict:
-            return (decoded,)
-
-        return DecoderOutput(sample=decoded)
-
-    def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
-        blend_extent = min(a.shape[-2], b.shape[-2], blend_extent)
-        for y in range(blend_extent):
-            b[:, :, :, y, :] = a[:, :, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, :, y, :] * (y / blend_extent)
-        return b
-
-    def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
-        blend_extent = min(a.shape[-1], b.shape[-1], blend_extent)
-        for x in range(blend_extent):
-            b[:, :, :, :, x] = a[:, :, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, :, x] * (x / blend_extent)
-        return b
-
-    def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
-        blend_extent = min(a.shape[-3], b.shape[-3], blend_extent)
-        for x in range(blend_extent):
-            b[:, :, x, :, :] = a[:, :, -blend_extent + x, :, :] * (1 - x / blend_extent) + b[:, :, x, :, :] * (x / blend_extent)
-        return b
-
-    def spatial_tiled_encode(self, x: torch.FloatTensor, return_dict: bool = True, return_moments: bool = False) -> AutoencoderKLOutput:
-        r"""Encode a batch of images/videos using a tiled encoder.
-
-        When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several
-        steps. This is useful to keep memory use constant regardless of image/videos size. The end result of tiled encoding is
-        different from non-tiled encoding because each tile uses a different encoder. To avoid tiling artifacts, the
-        tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the
-        output, but they should be much less noticeable.
-
-        Args:
-            x (`torch.FloatTensor`): Input batch of images/videos.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
-
-        Returns:
-            [`~models.autoencoder_kl.AutoencoderKLOutput`] or `tuple`:
-                If return_dict is True, a [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain
-                `tuple` is returned.
-        """
-        overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)
-        row_limit = self.tile_latent_min_size - blend_extent
-
-        # Split video into tiles and encode them separately.
-        rows = []
-        for i in range(0, x.shape[-2], overlap_size):
-            row = []
-            for j in range(0, x.shape[-1], overlap_size):
-                tile = x[:, :, :, i: i + self.tile_sample_min_size, j: j + self.tile_sample_min_size]
-                tile = self.encoder(tile)
-                tile = self.quant_conv(tile)
-                row.append(tile)
-            rows.append(row)
-        result_rows = []
-        for i, row in enumerate(rows):
-            result_row = []
-            for j, tile in enumerate(row):
-                # blend the above tile and the left tile
-                # to the current tile and add the current tile to the result row
-                if i > 0:
-                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
-                if j > 0:
-                    tile = self.blend_h(row[j - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :, :row_limit, :row_limit])
-            result_rows.append(torch.cat(result_row, dim=-1))
-
-        moments = torch.cat(result_rows, dim=-2)
-        if return_moments:
-            return moments
-
-        posterior = DiagonalGaussianDistribution(moments)
-        if not return_dict:
-            return (posterior,)
-
-        return AutoencoderKLOutput(latent_dist=posterior)
-
-    def spatial_tiled_decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
-        r"""
-        Decode a batch of images/videos using a tiled decoder.
-
-        Args:
-            z (`torch.FloatTensor`): Input batch of latent vectors.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
-
-        Returns:
-            [`~models.vae.DecoderOutput`] or `tuple`:
-                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
-                returned.
-        """
-        overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)
-        row_limit = self.tile_sample_min_size - blend_extent
-
-        # Split z into overlapping tiles and decode them separately.
-        # The tiles have an overlap to avoid seams between tiles.
-        rows = []
-        for i in range(0, z.shape[-2], overlap_size):
-            row = []
-            for j in range(0, z.shape[-1], overlap_size):
-                tile = z[:, :, :, i: i + self.tile_latent_min_size, j: j + self.tile_latent_min_size]
-                tile = self.post_quant_conv(tile)
-                decoded = self.decoder(tile)
-                row.append(decoded)
-            rows.append(row)
-        result_rows = []
-        for i, row in enumerate(rows):
-            result_row = []
-            for j, tile in enumerate(row):
-                # blend the above tile and the left tile
-                # to the current tile and add the current tile to the result row
-                if i > 0:
-                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
-                if j > 0:
-                    tile = self.blend_h(row[j - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :, :row_limit, :row_limit])
-            result_rows.append(torch.cat(result_row, dim=-1))
-
-        dec = torch.cat(result_rows, dim=-2)
-        if not return_dict:
-            return (dec,)
-
-        return DecoderOutput(sample=dec)
-
-    def temporal_tiled_encode(self, x: torch.FloatTensor, return_dict: bool = True) -> AutoencoderKLOutput:
-
-        B, C, T, H, W = x.shape
-        overlap_size = int(self.tile_sample_min_tsize * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_latent_min_tsize * self.tile_overlap_factor)
-        t_limit = self.tile_latent_min_tsize - blend_extent
-
-        # Split the video into tiles and encode them separately.
-        row = []
-        for i in range(0, T, overlap_size):
-            tile = x[:, :, i: i + self.tile_sample_min_tsize + 1, :, :]
-            if self.use_spatial_tiling and (tile.shape[-1] > self.tile_sample_min_size or tile.shape[-2] > self.tile_sample_min_size):
-                tile = self.spatial_tiled_encode(tile, return_moments=True)
-            else:
-                tile = self.encoder(tile)
-                tile = self.quant_conv(tile)
-            if i > 0:
-                tile = tile[:, :, 1:, :, :]
-            row.append(tile)
-        result_row = []
-        for i, tile in enumerate(row):
-            if i > 0:
-                tile = self.blend_t(row[i - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :t_limit, :, :])
-            else:
-                result_row.append(tile[:, :, :t_limit + 1, :, :])
-
-        moments = torch.cat(result_row, dim=2)
-        posterior = DiagonalGaussianDistribution(moments)
-
-        if not return_dict:
-            return (posterior,)
-
-        return AutoencoderKLOutput(latent_dist=posterior)
-
-    def temporal_tiled_decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
-        # Split z into overlapping tiles and decode them separately.
-
-        B, C, T, H, W = z.shape
-        overlap_size = int(self.tile_latent_min_tsize * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_sample_min_tsize * self.tile_overlap_factor)
-        t_limit = self.tile_sample_min_tsize - blend_extent
-
-        row = []
-        for i in range(0, T, overlap_size):
-            tile = z[:, :, i: i + self.tile_latent_min_tsize + 1, :, :]
-            if self.use_spatial_tiling and (tile.shape[-1] > self.tile_latent_min_size or tile.shape[-2] > self.tile_latent_min_size):
-                decoded = self.spatial_tiled_decode(tile, return_dict=True).sample
-            else:
-                tile = self.post_quant_conv(tile)
-                decoded = self.decoder(tile)
-            if i > 0:
-                decoded = decoded[:, :, 1:, :, :]
-            row.append(decoded)
-        result_row = []
-        for i, tile in enumerate(row):
-            if i > 0:
-                tile = self.blend_t(row[i - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :t_limit, :, :])
-            else:
-                result_row.append(tile[:, :, :t_limit + 1, :, :])
-
-        dec = torch.cat(result_row, dim=2)
-        if not return_dict:
-            return (dec,)
-
-        return DecoderOutput(sample=dec)
-
-    def forward(
-        self,
-        sample: torch.FloatTensor,
-        sample_posterior: bool = False,
-        return_dict: bool = True,
-        return_posterior: bool = False,
-        generator: Optional[torch.Generator] = None,
-    ) -> Union[DecoderOutput2, torch.FloatTensor]:
-        r"""
-        Args:
-            sample (`torch.FloatTensor`): Input sample.
-            sample_posterior (`bool`, *optional*, defaults to `False`):
-                Whether to sample from the posterior.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
-        """
-        x = sample
-        posterior = self.encode(x).latent_dist
-        if sample_posterior:
-            z = posterior.sample(generator=generator)
-        else:
-            z = posterior.mode()
-        dec = self.decode(z).sample
-
-        if not return_dict:
-            if return_posterior:
-                return (dec, posterior)
-            else:
-                return (dec,)
-        if return_posterior:
-            return DecoderOutput2(sample=dec, posterior=posterior)
-        else:
-            return DecoderOutput2(sample=dec)
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
-    def fuse_qkv_projections(self):
-        """
-        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query,
-        key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
-
-        <Tip warning={true}>
-
-        This API is 🧪 experimental.
-
-        </Tip>
-        """
-        self.original_attn_processors = None
-
-        for _, attn_processor in self.attn_processors.items():
-            if "Added" in str(attn_processor.__class__.__name__):
-                raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")
-
-        self.original_attn_processors = self.attn_processors
-
-        for module in self.modules():
-            if isinstance(module, Attention):
-                module.fuse_projections(fuse=True)
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
-    def unfuse_qkv_projections(self):
-        """Disables the fused QKV projection if enabled.
-
-        <Tip warning={true}>
-
-        This API is 🧪 experimental.
-
-        </Tip>
-
-        """
-        if self.original_attn_processors is not None:
-            self.set_attn_processor(self.original_attn_processors)
-
-
-
-
-class AutoencoderKLCausal3DWrapper(nn.Module):
-    """
-
-    Args:
-        vae_type (str): the type of the 3D VAE model. Defaults to "884-16c-hy".
-        vae_precision (str, optional): the precision to load vae. Defaults to None.
-        sample_size (tuple, optional): the tiling size. Defaults to None.
-        vae_path (str, optional): the path to vae. Defaults to None.
-        logger (_type_, optional): logger. Defaults to None.
-        device (_type_, optional): device to load vae. Defaults to None.
-    """
-    def __init__(self, 
-             vae_type: str="884-16c-hy",
-             vae_precision: str=None,
-             sample_size: tuple=None,
-             vae_path: str=None,
-             device:str='cuda',
-             use_cpu_offload: bool=True,
-             *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        if vae_path is None:
-            vae_path = VAE_PATH[vae_type]
-        config = AutoencoderKLCausal3D.load_config(vae_path)
-        if sample_size:
-            vae = AutoencoderKLCausal3D.from_config(config, sample_size=sample_size)
-        else:
-            vae = AutoencoderKLCausal3D.from_config(config)
-        self.device = device if not use_cpu_offload else 'cpu'
-        self.vae = vae
-        self.vae_path = vae_path
-        if vae_precision is not None:
-            self.vae = self.vae.to(dtype=PRECISION_TO_TYPE[vae_precision])
-
-    
-    def load_weight(self):
-        vae_ckpt = Path(self.vae_path) / "pytorch_model.pt"
-        assert vae_ckpt.exists(), f"VAE checkpoint not found: {vae_ckpt}"
-        ckpt = torch.load(vae_ckpt, map_location=self.vae.device)
-        if "state_dict" in ckpt:
-            ckpt = ckpt["state_dict"]
-        if any(k.startswith("vae.") for k in ckpt.keys()):
-            ckpt = {k.replace("vae.", ""): v for k, v in ckpt.items() if k.startswith("vae.")}
-        self.vae.load_state_dict(ckpt)
-        self.vae.requires_grad_(False)
-        if self.device is not None:
-            self.vae = self.vae.to(self.device)
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/vae/unet_causal_3d_blocks.py b/videotuna/models/hunyuan/hyvideo_i2v/vae/unet_causal_3d_blocks.py
deleted file mode 100644
index f78bc755..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/vae/unet_causal_3d_blocks.py
+++ /dev/null
@@ -1,764 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-
-from typing import Optional, Tuple, Union
-
-import torch
-import torch.nn.functional as F
-from torch import nn
-from einops import rearrange
-
-from diffusers.utils import logging
-from diffusers.models.activations import get_activation
-from diffusers.models.attention_processor import SpatialNorm
-from diffusers.models.attention_processor import Attention
-from diffusers.models.normalization import AdaGroupNorm
-from diffusers.models.normalization import RMSNorm
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-def prepare_causal_attention_mask(n_frame: int, n_hw: int, dtype, device, batch_size: int = None):
-    seq_len = n_frame * n_hw
-    mask = torch.full((seq_len, seq_len), float("-inf"), dtype=dtype, device=device)
-    for i in range(seq_len):
-        i_frame = i // n_hw
-        mask[i, : (i_frame + 1) * n_hw] = 0
-    if batch_size is not None:
-        mask = mask.unsqueeze(0).expand(batch_size, -1, -1)
-    return mask
-
-
-class CausalConv3d(nn.Module):
-    """
-    Implements a causal 3D convolution layer where each position only depends on previous timesteps and current spatial locations.
-    This maintains temporal causality in video generation tasks.
-    """
-
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        stride: Union[int, Tuple[int, int, int]] = 1,
-        dilation: Union[int, Tuple[int, int, int]] = 1,
-        pad_mode='replicate',
-        **kwargs
-    ):
-        super().__init__()
-
-        self.pad_mode = pad_mode
-        padding = (kernel_size // 2, kernel_size // 2, kernel_size // 2, kernel_size // 2, kernel_size - 1, 0)  # W, H, T
-        self.time_causal_padding = padding
-
-        self.conv = nn.Conv3d(chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs)
-
-    def forward(self, x):
-        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
-        return self.conv(x)
-
-
-class UpsampleCausal3D(nn.Module):
-    """
-    A 3D upsampling layer with an optional convolution.
-    """
-
-    def __init__(
-        self,
-        channels: int,
-        use_conv: bool = False,
-        use_conv_transpose: bool = False,
-        out_channels: Optional[int] = None,
-        name: str = "conv",
-        kernel_size: Optional[int] = None,
-        padding=1,
-        norm_type=None,
-        eps=None,
-        elementwise_affine=None,
-        bias=True,
-        interpolate=True,
-        upsample_factor=(2, 2, 2),
-    ):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.use_conv_transpose = use_conv_transpose
-        self.name = name
-        self.interpolate = interpolate
-        self.upsample_factor = upsample_factor
-
-        if norm_type == "ln_norm":
-            self.norm = nn.LayerNorm(channels, eps, elementwise_affine)
-        elif norm_type == "rms_norm":
-            self.norm = RMSNorm(channels, eps, elementwise_affine)
-        elif norm_type is None:
-            self.norm = None
-        else:
-            raise ValueError(f"unknown norm_type: {norm_type}")
-
-        conv = None
-        if use_conv_transpose:
-            raise NotImplementedError
-        elif use_conv:
-            if kernel_size is None:
-                kernel_size = 3
-            conv = CausalConv3d(self.channels, self.out_channels, kernel_size=kernel_size, bias=bias)
-
-        if name == "conv":
-            self.conv = conv
-        else:
-            self.Conv2d_0 = conv
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        output_size: Optional[int] = None,
-        scale: float = 1.0,
-    ) -> torch.FloatTensor:
-        assert hidden_states.shape[1] == self.channels
-
-        if self.norm is not None:
-            raise NotImplementedError
-
-        if self.use_conv_transpose:
-            return self.conv(hidden_states)
-
-        # Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
-        dtype = hidden_states.dtype
-        if dtype == torch.bfloat16:
-            hidden_states = hidden_states.to(torch.float32)
-
-        # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
-        if hidden_states.shape[0] >= 64:
-            hidden_states = hidden_states.contiguous()
-
-        # if `output_size` is passed we force the interpolation output
-        # size and do not make use of `scale_factor=2`
-        if self.interpolate:
-            B, C, T, H, W = hidden_states.shape
-            first_h, other_h = hidden_states.split((1, T - 1), dim=2)
-            if output_size is None:
-                if T > 1:
-                    other_h = F.interpolate(other_h, scale_factor=self.upsample_factor, mode="nearest")
-
-                first_h = first_h.squeeze(2)
-                first_h = F.interpolate(first_h, scale_factor=self.upsample_factor[1:], mode="nearest")
-                first_h = first_h.unsqueeze(2)
-            else:
-                raise NotImplementedError
-
-            if T > 1:
-                hidden_states = torch.cat((first_h, other_h), dim=2)
-            else:
-                hidden_states = first_h
-
-        # If the input is bfloat16, we cast back to bfloat16
-        if dtype == torch.bfloat16:
-            hidden_states = hidden_states.to(dtype)
-
-        if self.use_conv:
-            if self.name == "conv":
-                hidden_states = self.conv(hidden_states)
-            else:
-                hidden_states = self.Conv2d_0(hidden_states)
-
-        return hidden_states
-
-
-class DownsampleCausal3D(nn.Module):
-    """
-    A 3D downsampling layer with an optional convolution.
-    """
-
-    def __init__(
-        self,
-        channels: int,
-        use_conv: bool = False,
-        out_channels: Optional[int] = None,
-        padding: int = 1,
-        name: str = "conv",
-        kernel_size=3,
-        norm_type=None,
-        eps=None,
-        elementwise_affine=None,
-        bias=True,
-        stride=2,
-    ):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.padding = padding
-        stride = stride
-        self.name = name
-
-        if norm_type == "ln_norm":
-            self.norm = nn.LayerNorm(channels, eps, elementwise_affine)
-        elif norm_type == "rms_norm":
-            self.norm = RMSNorm(channels, eps, elementwise_affine)
-        elif norm_type is None:
-            self.norm = None
-        else:
-            raise ValueError(f"unknown norm_type: {norm_type}")
-
-        if use_conv:
-            conv = CausalConv3d(
-                self.channels, self.out_channels, kernel_size=kernel_size, stride=stride, bias=bias
-            )
-        else:
-            raise NotImplementedError
-
-        if name == "conv":
-            self.Conv2d_0 = conv
-            self.conv = conv
-        elif name == "Conv2d_0":
-            self.conv = conv
-        else:
-            self.conv = conv
-
-    def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor:
-        assert hidden_states.shape[1] == self.channels
-
-        if self.norm is not None:
-            hidden_states = self.norm(hidden_states.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
-
-        assert hidden_states.shape[1] == self.channels
-
-        hidden_states = self.conv(hidden_states)
-
-        return hidden_states
-
-
-class ResnetBlockCausal3D(nn.Module):
-    r"""
-    A Resnet block.
-    """
-
-    def __init__(
-        self,
-        *,
-        in_channels: int,
-        out_channels: Optional[int] = None,
-        conv_shortcut: bool = False,
-        dropout: float = 0.0,
-        temb_channels: int = 512,
-        groups: int = 32,
-        groups_out: Optional[int] = None,
-        pre_norm: bool = True,
-        eps: float = 1e-6,
-        non_linearity: str = "swish",
-        skip_time_act: bool = False,
-        # default, scale_shift, ada_group, spatial
-        time_embedding_norm: str = "default",
-        kernel: Optional[torch.FloatTensor] = None,
-        output_scale_factor: float = 1.0,
-        use_in_shortcut: Optional[bool] = None,
-        up: bool = False,
-        down: bool = False,
-        conv_shortcut_bias: bool = True,
-        conv_3d_out_channels: Optional[int] = None,
-    ):
-        super().__init__()
-        self.pre_norm = pre_norm
-        self.pre_norm = True
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-        self.up = up
-        self.down = down
-        self.output_scale_factor = output_scale_factor
-        self.time_embedding_norm = time_embedding_norm
-        self.skip_time_act = skip_time_act
-
-        linear_cls = nn.Linear
-
-        if groups_out is None:
-            groups_out = groups
-
-        if self.time_embedding_norm == "ada_group":
-            self.norm1 = AdaGroupNorm(temb_channels, in_channels, groups, eps=eps)
-        elif self.time_embedding_norm == "spatial":
-            self.norm1 = SpatialNorm(in_channels, temb_channels)
-        else:
-            self.norm1 = torch.nn.GroupNorm(num_groups=groups, num_channels=in_channels, eps=eps, affine=True)
-
-        self.conv1 = CausalConv3d(in_channels, out_channels, kernel_size=3, stride=1)
-
-        if temb_channels is not None:
-            if self.time_embedding_norm == "default":
-                self.time_emb_proj = linear_cls(temb_channels, out_channels)
-            elif self.time_embedding_norm == "scale_shift":
-                self.time_emb_proj = linear_cls(temb_channels, 2 * out_channels)
-            elif self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial":
-                self.time_emb_proj = None
-            else:
-                raise ValueError(f"Unknown time_embedding_norm : {self.time_embedding_norm} ")
-        else:
-            self.time_emb_proj = None
-
-        if self.time_embedding_norm == "ada_group":
-            self.norm2 = AdaGroupNorm(temb_channels, out_channels, groups_out, eps=eps)
-        elif self.time_embedding_norm == "spatial":
-            self.norm2 = SpatialNorm(out_channels, temb_channels)
-        else:
-            self.norm2 = torch.nn.GroupNorm(num_groups=groups_out, num_channels=out_channels, eps=eps, affine=True)
-
-        self.dropout = torch.nn.Dropout(dropout)
-        conv_3d_out_channels = conv_3d_out_channels or out_channels
-        self.conv2 = CausalConv3d(out_channels, conv_3d_out_channels, kernel_size=3, stride=1)
-
-        self.nonlinearity = get_activation(non_linearity)
-
-        self.upsample = self.downsample = None
-        if self.up:
-            self.upsample = UpsampleCausal3D(in_channels, use_conv=False)
-        elif self.down:
-            self.downsample = DownsampleCausal3D(in_channels, use_conv=False, name="op")
-
-        self.use_in_shortcut = self.in_channels != conv_3d_out_channels if use_in_shortcut is None else use_in_shortcut
-
-        self.conv_shortcut = None
-        if self.use_in_shortcut:
-            self.conv_shortcut = CausalConv3d(
-                in_channels,
-                conv_3d_out_channels,
-                kernel_size=1,
-                stride=1,
-                bias=conv_shortcut_bias,
-            )
-
-    def forward(
-        self,
-        input_tensor: torch.FloatTensor,
-        temb: torch.FloatTensor,
-        scale: float = 1.0,
-    ) -> torch.FloatTensor:
-        hidden_states = input_tensor
-
-        if self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial":
-            hidden_states = self.norm1(hidden_states, temb)
-        else:
-            hidden_states = self.norm1(hidden_states)
-
-        hidden_states = self.nonlinearity(hidden_states)
-
-        if self.upsample is not None:
-            # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
-            if hidden_states.shape[0] >= 64:
-                input_tensor = input_tensor.contiguous()
-                hidden_states = hidden_states.contiguous()
-            input_tensor = (
-                self.upsample(input_tensor, scale=scale)
-            )
-            hidden_states = (
-                self.upsample(hidden_states, scale=scale)
-            )
-        elif self.downsample is not None:
-            input_tensor = (
-                self.downsample(input_tensor, scale=scale)
-            )
-            hidden_states = (
-                self.downsample(hidden_states, scale=scale)
-            )
-
-        hidden_states = self.conv1(hidden_states)
-
-        if self.time_emb_proj is not None:
-            if not self.skip_time_act:
-                temb = self.nonlinearity(temb)
-            temb = (
-                self.time_emb_proj(temb, scale)[:, :, None, None]
-            )
-
-        if temb is not None and self.time_embedding_norm == "default":
-            hidden_states = hidden_states + temb
-
-        if self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial":
-            hidden_states = self.norm2(hidden_states, temb)
-        else:
-            hidden_states = self.norm2(hidden_states)
-
-        if temb is not None and self.time_embedding_norm == "scale_shift":
-            scale, shift = torch.chunk(temb, 2, dim=1)
-            hidden_states = hidden_states * (1 + scale) + shift
-
-        hidden_states = self.nonlinearity(hidden_states)
-
-        hidden_states = self.dropout(hidden_states)
-        hidden_states = self.conv2(hidden_states)
-
-        if self.conv_shortcut is not None:
-            input_tensor = (
-                self.conv_shortcut(input_tensor)
-            )
-
-        output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
-
-        return output_tensor
-
-
-def get_down_block3d(
-    down_block_type: str,
-    num_layers: int,
-    in_channels: int,
-    out_channels: int,
-    temb_channels: int,
-    add_downsample: bool,
-    downsample_stride: int,
-    resnet_eps: float,
-    resnet_act_fn: str,
-    transformer_layers_per_block: int = 1,
-    num_attention_heads: Optional[int] = None,
-    resnet_groups: Optional[int] = None,
-    cross_attention_dim: Optional[int] = None,
-    downsample_padding: Optional[int] = None,
-    dual_cross_attention: bool = False,
-    use_linear_projection: bool = False,
-    only_cross_attention: bool = False,
-    upcast_attention: bool = False,
-    resnet_time_scale_shift: str = "default",
-    attention_type: str = "default",
-    resnet_skip_time_act: bool = False,
-    resnet_out_scale_factor: float = 1.0,
-    cross_attention_norm: Optional[str] = None,
-    attention_head_dim: Optional[int] = None,
-    downsample_type: Optional[str] = None,
-    dropout: float = 0.0,
-):
-    # If attn head dim is not defined, we default it to the number of heads
-    if attention_head_dim is None:
-        logger.warning(
-            f"It is recommended to provide `attention_head_dim` when calling `get_down_block`. Defaulting `attention_head_dim` to {num_attention_heads}."
-        )
-        attention_head_dim = num_attention_heads
-
-    down_block_type = down_block_type[7:] if down_block_type.startswith("UNetRes") else down_block_type
-    if down_block_type == "DownEncoderBlockCausal3D":
-        return DownEncoderBlockCausal3D(
-            num_layers=num_layers,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            dropout=dropout,
-            add_downsample=add_downsample,
-            downsample_stride=downsample_stride,
-            resnet_eps=resnet_eps,
-            resnet_act_fn=resnet_act_fn,
-            resnet_groups=resnet_groups,
-            downsample_padding=downsample_padding,
-            resnet_time_scale_shift=resnet_time_scale_shift,
-        )
-    raise ValueError(f"{down_block_type} does not exist.")
-
-
-def get_up_block3d(
-    up_block_type: str,
-    num_layers: int,
-    in_channels: int,
-    out_channels: int,
-    prev_output_channel: int,
-    temb_channels: int,
-    add_upsample: bool,
-    upsample_scale_factor: Tuple,
-    resnet_eps: float,
-    resnet_act_fn: str,
-    resolution_idx: Optional[int] = None,
-    transformer_layers_per_block: int = 1,
-    num_attention_heads: Optional[int] = None,
-    resnet_groups: Optional[int] = None,
-    cross_attention_dim: Optional[int] = None,
-    dual_cross_attention: bool = False,
-    use_linear_projection: bool = False,
-    only_cross_attention: bool = False,
-    upcast_attention: bool = False,
-    resnet_time_scale_shift: str = "default",
-    attention_type: str = "default",
-    resnet_skip_time_act: bool = False,
-    resnet_out_scale_factor: float = 1.0,
-    cross_attention_norm: Optional[str] = None,
-    attention_head_dim: Optional[int] = None,
-    upsample_type: Optional[str] = None,
-    dropout: float = 0.0,
-) -> nn.Module:
-    # If attn head dim is not defined, we default it to the number of heads
-    if attention_head_dim is None:
-        logger.warning(
-            f"It is recommended to provide `attention_head_dim` when calling `get_up_block`. Defaulting `attention_head_dim` to {num_attention_heads}."
-        )
-        attention_head_dim = num_attention_heads
-
-    up_block_type = up_block_type[7:] if up_block_type.startswith("UNetRes") else up_block_type
-    if up_block_type == "UpDecoderBlockCausal3D":
-        return UpDecoderBlockCausal3D(
-            num_layers=num_layers,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            resolution_idx=resolution_idx,
-            dropout=dropout,
-            add_upsample=add_upsample,
-            upsample_scale_factor=upsample_scale_factor,
-            resnet_eps=resnet_eps,
-            resnet_act_fn=resnet_act_fn,
-            resnet_groups=resnet_groups,
-            resnet_time_scale_shift=resnet_time_scale_shift,
-            temb_channels=temb_channels,
-        )
-    raise ValueError(f"{up_block_type} does not exist.")
-
-
-class UNetMidBlockCausal3D(nn.Module):
-    """
-    A 3D UNet mid-block [`UNetMidBlockCausal3D`] with multiple residual blocks and optional attention blocks.
-    """
-
-    def __init__(
-        self,
-        in_channels: int,
-        temb_channels: int,
-        dropout: float = 0.0,
-        num_layers: int = 1,
-        resnet_eps: float = 1e-6,
-        resnet_time_scale_shift: str = "default",  # default, spatial
-        resnet_act_fn: str = "swish",
-        resnet_groups: int = 32,
-        attn_groups: Optional[int] = None,
-        resnet_pre_norm: bool = True,
-        add_attention: bool = True,
-        attention_head_dim: int = 1,
-        output_scale_factor: float = 1.0,
-    ):
-        super().__init__()
-        resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
-        self.add_attention = add_attention
-
-        if attn_groups is None:
-            attn_groups = resnet_groups if resnet_time_scale_shift == "default" else None
-
-        # there is always at least one resnet
-        resnets = [
-            ResnetBlockCausal3D(
-                in_channels=in_channels,
-                out_channels=in_channels,
-                temb_channels=temb_channels,
-                eps=resnet_eps,
-                groups=resnet_groups,
-                dropout=dropout,
-                time_embedding_norm=resnet_time_scale_shift,
-                non_linearity=resnet_act_fn,
-                output_scale_factor=output_scale_factor,
-                pre_norm=resnet_pre_norm,
-            )
-        ]
-        attentions = []
-
-        if attention_head_dim is None:
-            logger.warning(
-                f"It is not recommend to pass `attention_head_dim=None`. Defaulting `attention_head_dim` to `in_channels`: {in_channels}."
-            )
-            attention_head_dim = in_channels
-
-        for _ in range(num_layers):
-            if self.add_attention:
-                attentions.append(
-                    Attention(
-                        in_channels,
-                        heads=in_channels // attention_head_dim,
-                        dim_head=attention_head_dim,
-                        rescale_output_factor=output_scale_factor,
-                        eps=resnet_eps,
-                        norm_num_groups=attn_groups,
-                        spatial_norm_dim=temb_channels if resnet_time_scale_shift == "spatial" else None,
-                        residual_connection=True,
-                        bias=True,
-                        upcast_softmax=True,
-                        _from_deprecated_attn_block=True,
-                    )
-                )
-            else:
-                attentions.append(None)
-
-            resnets.append(
-                ResnetBlockCausal3D(
-                    in_channels=in_channels,
-                    out_channels=in_channels,
-                    temb_channels=temb_channels,
-                    eps=resnet_eps,
-                    groups=resnet_groups,
-                    dropout=dropout,
-                    time_embedding_norm=resnet_time_scale_shift,
-                    non_linearity=resnet_act_fn,
-                    output_scale_factor=output_scale_factor,
-                    pre_norm=resnet_pre_norm,
-                )
-            )
-
-        self.attentions = nn.ModuleList(attentions)
-        self.resnets = nn.ModuleList(resnets)
-
-    def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor:
-        hidden_states = self.resnets[0](hidden_states, temb)
-        for attn, resnet in zip(self.attentions, self.resnets[1:]):
-            if attn is not None:
-                B, C, T, H, W = hidden_states.shape
-                hidden_states = rearrange(hidden_states, "b c f h w -> b (f h w) c")
-                attention_mask = prepare_causal_attention_mask(
-                    T, H * W, hidden_states.dtype, hidden_states.device, batch_size=B
-                )
-                hidden_states = attn(hidden_states, temb=temb, attention_mask=attention_mask)
-                hidden_states = rearrange(hidden_states, "b (f h w) c -> b c f h w", f=T, h=H, w=W)
-            hidden_states = resnet(hidden_states, temb)
-
-        return hidden_states
-
-
-class DownEncoderBlockCausal3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        dropout: float = 0.0,
-        num_layers: int = 1,
-        resnet_eps: float = 1e-6,
-        resnet_time_scale_shift: str = "default",
-        resnet_act_fn: str = "swish",
-        resnet_groups: int = 32,
-        resnet_pre_norm: bool = True,
-        output_scale_factor: float = 1.0,
-        add_downsample: bool = True,
-        downsample_stride: int = 2,
-        downsample_padding: int = 1,
-    ):
-        super().__init__()
-        resnets = []
-
-        for i in range(num_layers):
-            in_channels = in_channels if i == 0 else out_channels
-            resnets.append(
-                ResnetBlockCausal3D(
-                    in_channels=in_channels,
-                    out_channels=out_channels,
-                    temb_channels=None,
-                    eps=resnet_eps,
-                    groups=resnet_groups,
-                    dropout=dropout,
-                    time_embedding_norm=resnet_time_scale_shift,
-                    non_linearity=resnet_act_fn,
-                    output_scale_factor=output_scale_factor,
-                    pre_norm=resnet_pre_norm,
-                )
-            )
-
-        self.resnets = nn.ModuleList(resnets)
-
-        if add_downsample:
-            self.downsamplers = nn.ModuleList(
-                [
-                    DownsampleCausal3D(
-                        out_channels,
-                        use_conv=True,
-                        out_channels=out_channels,
-                        padding=downsample_padding,
-                        name="op",
-                        stride=downsample_stride,
-                    )
-                ]
-            )
-        else:
-            self.downsamplers = None
-
-    def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor:
-        for resnet in self.resnets:
-            hidden_states = resnet(hidden_states, temb=None, scale=scale)
-
-        if self.downsamplers is not None:
-            for downsampler in self.downsamplers:
-                hidden_states = downsampler(hidden_states, scale)
-
-        return hidden_states
-
-
-class UpDecoderBlockCausal3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        resolution_idx: Optional[int] = None,
-        dropout: float = 0.0,
-        num_layers: int = 1,
-        resnet_eps: float = 1e-6,
-        resnet_time_scale_shift: str = "default",  # default, spatial
-        resnet_act_fn: str = "swish",
-        resnet_groups: int = 32,
-        resnet_pre_norm: bool = True,
-        output_scale_factor: float = 1.0,
-        add_upsample: bool = True,
-        upsample_scale_factor=(2, 2, 2),
-        temb_channels: Optional[int] = None,
-    ):
-        super().__init__()
-        resnets = []
-
-        for i in range(num_layers):
-            input_channels = in_channels if i == 0 else out_channels
-
-            resnets.append(
-                ResnetBlockCausal3D(
-                    in_channels=input_channels,
-                    out_channels=out_channels,
-                    temb_channels=temb_channels,
-                    eps=resnet_eps,
-                    groups=resnet_groups,
-                    dropout=dropout,
-                    time_embedding_norm=resnet_time_scale_shift,
-                    non_linearity=resnet_act_fn,
-                    output_scale_factor=output_scale_factor,
-                    pre_norm=resnet_pre_norm,
-                )
-            )
-
-        self.resnets = nn.ModuleList(resnets)
-
-        if add_upsample:
-            self.upsamplers = nn.ModuleList(
-                [
-                    UpsampleCausal3D(
-                        out_channels,
-                        use_conv=True,
-                        out_channels=out_channels,
-                        upsample_factor=upsample_scale_factor,
-                    )
-                ]
-            )
-        else:
-            self.upsamplers = None
-
-        self.resolution_idx = resolution_idx
-
-    def forward(
-        self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0
-    ) -> torch.FloatTensor:
-        for resnet in self.resnets:
-            hidden_states = resnet(hidden_states, temb=temb, scale=scale)
-
-        if self.upsamplers is not None:
-            for upsampler in self.upsamplers:
-                hidden_states = upsampler(hidden_states)
-
-        return hidden_states
diff --git a/videotuna/models/hunyuan/hyvideo_i2v/vae/vae.py b/videotuna/models/hunyuan/hyvideo_i2v/vae/vae.py
deleted file mode 100644
index 4002d1f7..00000000
--- a/videotuna/models/hunyuan/hyvideo_i2v/vae/vae.py
+++ /dev/null
@@ -1,355 +0,0 @@
-from dataclasses import dataclass
-from typing import Optional, Tuple
-
-import numpy as np
-import torch
-import torch.nn as nn
-
-from diffusers.utils import BaseOutput, is_torch_version
-from diffusers.utils.torch_utils import randn_tensor
-from diffusers.models.attention_processor import SpatialNorm
-from .unet_causal_3d_blocks import (
-    CausalConv3d,
-    UNetMidBlockCausal3D,
-    get_down_block3d,
-    get_up_block3d,
-)
-
-
-@dataclass
-class DecoderOutput(BaseOutput):
-    r"""
-    Output of decoding method.
-
-    Args:
-        sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
-            The decoded output sample from the last layer of the model.
-    """
-
-    sample: torch.FloatTensor
-
-
-class EncoderCausal3D(nn.Module):
-    r"""
-    The `EncoderCausal3D` layer of a variational autoencoder that encodes its input into a latent representation.
-    """
-
-    def __init__(
-        self,
-        in_channels: int = 3,
-        out_channels: int = 3,
-        down_block_types: Tuple[str, ...] = ("DownEncoderBlockCausal3D",),
-        block_out_channels: Tuple[int, ...] = (64,),
-        layers_per_block: int = 2,
-        norm_num_groups: int = 32,
-        act_fn: str = "silu",
-        double_z: bool = True,
-        mid_block_add_attention=True,
-        time_compression_ratio: int = 4,
-        spatial_compression_ratio: int = 8,
-    ):
-        super().__init__()
-        self.layers_per_block = layers_per_block
-
-        self.conv_in = CausalConv3d(in_channels, block_out_channels[0], kernel_size=3, stride=1)
-        self.mid_block = None
-        self.down_blocks = nn.ModuleList([])
-
-        # down
-        output_channel = block_out_channels[0]
-        for i, down_block_type in enumerate(down_block_types):
-            input_channel = output_channel
-            output_channel = block_out_channels[i]
-            is_final_block = i == len(block_out_channels) - 1
-            num_spatial_downsample_layers = int(np.log2(spatial_compression_ratio))
-            num_time_downsample_layers = int(np.log2(time_compression_ratio))
-
-            if time_compression_ratio == 4:
-                add_spatial_downsample = bool(i < num_spatial_downsample_layers)
-                add_time_downsample = bool(
-                    i >= (len(block_out_channels) - 1 - num_time_downsample_layers)
-                    and not is_final_block
-                )
-            else:
-                raise ValueError(f"Unsupported time_compression_ratio: {time_compression_ratio}.")
-
-            downsample_stride_HW = (2, 2) if add_spatial_downsample else (1, 1)
-            downsample_stride_T = (2,) if add_time_downsample else (1,)
-            downsample_stride = tuple(downsample_stride_T + downsample_stride_HW)
-            down_block = get_down_block3d(
-                down_block_type,
-                num_layers=self.layers_per_block,
-                in_channels=input_channel,
-                out_channels=output_channel,
-                add_downsample=bool(add_spatial_downsample or add_time_downsample),
-                downsample_stride=downsample_stride,
-                resnet_eps=1e-6,
-                downsample_padding=0,
-                resnet_act_fn=act_fn,
-                resnet_groups=norm_num_groups,
-                attention_head_dim=output_channel,
-                temb_channels=None,
-            )
-            self.down_blocks.append(down_block)
-
-        # mid
-        self.mid_block = UNetMidBlockCausal3D(
-            in_channels=block_out_channels[-1],
-            resnet_eps=1e-6,
-            resnet_act_fn=act_fn,
-            output_scale_factor=1,
-            resnet_time_scale_shift="default",
-            attention_head_dim=block_out_channels[-1],
-            resnet_groups=norm_num_groups,
-            temb_channels=None,
-            add_attention=mid_block_add_attention,
-        )
-
-        # out
-        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[-1], num_groups=norm_num_groups, eps=1e-6)
-        self.conv_act = nn.SiLU()
-
-        conv_out_channels = 2 * out_channels if double_z else out_channels
-        self.conv_out = CausalConv3d(block_out_channels[-1], conv_out_channels, kernel_size=3)
-
-    def forward(self, sample: torch.FloatTensor) -> torch.FloatTensor:
-        r"""The forward method of the `EncoderCausal3D` class."""
-        assert len(sample.shape) == 5, "The input tensor should have 5 dimensions"
-
-        sample = self.conv_in(sample)
-
-        # down
-        for down_block in self.down_blocks:
-            sample = down_block(sample)
-
-        # middle
-        sample = self.mid_block(sample)
-
-        # post-process
-        sample = self.conv_norm_out(sample)
-        sample = self.conv_act(sample)
-        sample = self.conv_out(sample)
-
-        return sample
-
-
-class DecoderCausal3D(nn.Module):
-    r"""
-    The `DecoderCausal3D` layer of a variational autoencoder that decodes its latent representation into an output sample.
-    """
-
-    def __init__(
-        self,
-        in_channels: int = 3,
-        out_channels: int = 3,
-        up_block_types: Tuple[str, ...] = ("UpDecoderBlockCausal3D",),
-        block_out_channels: Tuple[int, ...] = (64,),
-        layers_per_block: int = 2,
-        norm_num_groups: int = 32,
-        act_fn: str = "silu",
-        norm_type: str = "group",  # group, spatial
-        mid_block_add_attention=True,
-        time_compression_ratio: int = 4,
-        spatial_compression_ratio: int = 8,
-    ):
-        super().__init__()
-        self.layers_per_block = layers_per_block
-
-        self.conv_in = CausalConv3d(in_channels, block_out_channels[-1], kernel_size=3, stride=1)
-        self.mid_block = None
-        self.up_blocks = nn.ModuleList([])
-
-        temb_channels = in_channels if norm_type == "spatial" else None
-
-        # mid
-        self.mid_block = UNetMidBlockCausal3D(
-            in_channels=block_out_channels[-1],
-            resnet_eps=1e-6,
-            resnet_act_fn=act_fn,
-            output_scale_factor=1,
-            resnet_time_scale_shift="default" if norm_type == "group" else norm_type,
-            attention_head_dim=block_out_channels[-1],
-            resnet_groups=norm_num_groups,
-            temb_channels=temb_channels,
-            add_attention=mid_block_add_attention,
-        )
-
-        # up
-        reversed_block_out_channels = list(reversed(block_out_channels))
-        output_channel = reversed_block_out_channels[0]
-        for i, up_block_type in enumerate(up_block_types):
-            prev_output_channel = output_channel
-            output_channel = reversed_block_out_channels[i]
-            is_final_block = i == len(block_out_channels) - 1
-            num_spatial_upsample_layers = int(np.log2(spatial_compression_ratio))
-            num_time_upsample_layers = int(np.log2(time_compression_ratio))
-
-            if time_compression_ratio == 4:
-                add_spatial_upsample = bool(i < num_spatial_upsample_layers)
-                add_time_upsample = bool(
-                    i >= len(block_out_channels) - 1 - num_time_upsample_layers
-                    and not is_final_block
-                )
-            else:
-                raise ValueError(f"Unsupported time_compression_ratio: {time_compression_ratio}.")
-
-            upsample_scale_factor_HW = (2, 2) if add_spatial_upsample else (1, 1)
-            upsample_scale_factor_T = (2,) if add_time_upsample else (1,)
-            upsample_scale_factor = tuple(upsample_scale_factor_T + upsample_scale_factor_HW)
-            up_block = get_up_block3d(
-                up_block_type,
-                num_layers=self.layers_per_block + 1,
-                in_channels=prev_output_channel,
-                out_channels=output_channel,
-                prev_output_channel=None,
-                add_upsample=bool(add_spatial_upsample or add_time_upsample),
-                upsample_scale_factor=upsample_scale_factor,
-                resnet_eps=1e-6,
-                resnet_act_fn=act_fn,
-                resnet_groups=norm_num_groups,
-                attention_head_dim=output_channel,
-                temb_channels=temb_channels,
-                resnet_time_scale_shift=norm_type,
-            )
-            self.up_blocks.append(up_block)
-            prev_output_channel = output_channel
-
-        # out
-        if norm_type == "spatial":
-            self.conv_norm_out = SpatialNorm(block_out_channels[0], temb_channels)
-        else:
-            self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=1e-6)
-        self.conv_act = nn.SiLU()
-        self.conv_out = CausalConv3d(block_out_channels[0], out_channels, kernel_size=3)
-
-        self.gradient_checkpointing = False
-
-    def forward(
-        self,
-        sample: torch.FloatTensor,
-        latent_embeds: Optional[torch.FloatTensor] = None,
-    ) -> torch.FloatTensor:
-        r"""The forward method of the `DecoderCausal3D` class."""
-        assert len(sample.shape) == 5, "The input tensor should have 5 dimensions."
-
-        sample = self.conv_in(sample)
-
-        upscale_dtype = next(iter(self.up_blocks.parameters())).dtype
-        if self.training and self.gradient_checkpointing:
-
-            def create_custom_forward(module):
-                def custom_forward(*inputs):
-                    return module(*inputs)
-
-                return custom_forward
-
-            if is_torch_version(">=", "1.11.0"):
-                # middle
-                sample = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(self.mid_block),
-                    sample,
-                    latent_embeds,
-                    use_reentrant=False,
-                )
-                sample = sample.to(upscale_dtype)
-
-                # up
-                for up_block in self.up_blocks:
-                    sample = torch.utils.checkpoint.checkpoint(
-                        create_custom_forward(up_block),
-                        sample,
-                        latent_embeds,
-                        use_reentrant=False,
-                    )
-            else:
-                # middle
-                sample = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(self.mid_block), sample, latent_embeds
-                )
-                sample = sample.to(upscale_dtype)
-
-                # up
-                for up_block in self.up_blocks:
-                    sample = torch.utils.checkpoint.checkpoint(create_custom_forward(up_block), sample, latent_embeds)
-        else:
-            # middle
-            sample = self.mid_block(sample, latent_embeds)
-            sample = sample.to(upscale_dtype)
-
-            # up
-            for up_block in self.up_blocks:
-                sample = up_block(sample, latent_embeds)
-
-        # post-process
-        if latent_embeds is None:
-            sample = self.conv_norm_out(sample)
-        else:
-            sample = self.conv_norm_out(sample, latent_embeds)
-        sample = self.conv_act(sample)
-        sample = self.conv_out(sample)
-
-        return sample
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters: torch.Tensor, deterministic: bool = False):
-        if parameters.ndim == 3:
-            dim = 2  # (B, L, C)
-        elif parameters.ndim == 5 or parameters.ndim == 4:
-            dim = 1  # (B, C, T, H ,W) / (B, C, H, W)
-        else:
-            raise NotImplementedError
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=dim)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(
-                self.mean, device=self.parameters.device, dtype=self.parameters.dtype
-            )
-
-    def sample(self, generator: Optional[torch.Generator] = None) -> torch.FloatTensor:
-        # make sure sample is on the same device as the parameters and has same dtype
-        sample = randn_tensor(
-            self.mean.shape,
-            generator=generator,
-            device=self.parameters.device,
-            dtype=self.parameters.dtype,
-        )
-        x = self.mean + self.std * sample
-        return x
-
-    def kl(self, other: "DiagonalGaussianDistribution" = None) -> torch.Tensor:
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        else:
-            reduce_dim = list(range(1, self.mean.ndim))
-            if other is None:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
-                    dim=reduce_dim,
-                )
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var
-                    - 1.0
-                    - self.logvar
-                    + other.logvar,
-                    dim=reduce_dim,
-                )
-
-    def nll(self, sample: torch.Tensor, dims: Tuple[int, ...] = [1, 2, 3]) -> torch.Tensor:
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar +
-            torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims,
-        )
-
-    def mode(self) -> torch.Tensor:
-        return self.mean
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/config.py b/videotuna/models/hunyuan/hyvideo_t2v/config.py
deleted file mode 100644
index a4f2cb4d..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/config.py
+++ /dev/null
@@ -1,398 +0,0 @@
-import argparse
-from .constants import *
-import re
-from .modules.models import HUNYUAN_VIDEO_CONFIG
-
-
-def parse_args(namespace=None):
-    parser = argparse.ArgumentParser(description="HunyuanVideo inference script")
-
-    parser = add_network_args(parser)
-    parser = add_extra_models_args(parser)
-    parser = add_denoise_schedule_args(parser)
-    parser = add_inference_args(parser)
-    parser = add_parallel_args(parser)
-
-    args = parser.parse_args(namespace=namespace)
-    args = sanity_check_args(args)
-
-    return args
-
-
-def add_network_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="HunyuanVideo network args")
-
-    # Main model
-    group.add_argument(
-        "--model",
-        type=str,
-        choices=list(HUNYUAN_VIDEO_CONFIG.keys()),
-        default="HYVideo-T/2-cfgdistill",
-    )
-    group.add_argument(
-        "--latent-channels",
-        type=str,
-        default=16,
-        help="Number of latent channels of DiT. If None, it will be determined by `vae`. If provided, "
-        "it still needs to match the latent channels of the VAE model.",
-    )
-    group.add_argument(
-        "--precision",
-        type=str,
-        default="bf16",
-        choices=PRECISIONS,
-        help="Precision mode. Options: fp32, fp16, bf16. Applied to the backbone model and optimizer.",
-    )
-
-    # RoPE
-    group.add_argument(
-        "--rope-theta", type=int, default=256, help="Theta used in RoPE."
-    )
-    return parser
-
-
-def add_extra_models_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(
-        title="Extra models args, including vae, text encoders and tokenizers)"
-    )
-
-    # - VAE
-    group.add_argument(
-        "--vae",
-        type=str,
-        default="884-16c-hy",
-        choices=list(VAE_PATH),
-        help="Name of the VAE model.",
-    )
-    group.add_argument(
-        "--vae-precision",
-        type=str,
-        default="fp16",
-        choices=PRECISIONS,
-        help="Precision mode for the VAE model.",
-    )
-    group.add_argument(
-        "--vae-tiling",
-        action="store_true",
-        help="Enable tiling for the VAE model to save GPU memory.",
-    )
-    group.set_defaults(vae_tiling=True)
-
-    group.add_argument(
-        "--text-encoder",
-        type=str,
-        default="llm",
-        choices=list(TEXT_ENCODER_PATH),
-        help="Name of the text encoder model.",
-    )
-    group.add_argument(
-        "--text-encoder-precision",
-        type=str,
-        default="fp16",
-        choices=PRECISIONS,
-        help="Precision mode for the text encoder model.",
-    )
-    group.add_argument(
-        "--text-states-dim",
-        type=int,
-        default=4096,
-        help="Dimension of the text encoder hidden states.",
-    )
-    group.add_argument(
-        "--text-len", type=int, default=256, help="Maximum length of the text input."
-    )
-    group.add_argument(
-        "--tokenizer",
-        type=str,
-        default="llm",
-        choices=list(TOKENIZER_PATH),
-        help="Name of the tokenizer model.",
-    )
-    group.add_argument(
-        "--prompt-template",
-        type=str,
-        default="dit-llm-encode",
-        choices=PROMPT_TEMPLATE,
-        help="Image prompt template for the decoder-only text encoder model.",
-    )
-    group.add_argument(
-        "--prompt-template-video",
-        type=str,
-        default="dit-llm-encode-video",
-        choices=PROMPT_TEMPLATE,
-        help="Video prompt template for the decoder-only text encoder model.",
-    )
-    group.add_argument(
-        "--hidden-state-skip-layer",
-        type=int,
-        default=2,
-        help="Skip layer for hidden states.",
-    )
-    group.add_argument(
-        "--apply-final-norm",
-        action="store_true",
-        help="Apply final normalization to the used text encoder hidden states.",
-    )
-
-    # - CLIP
-    group.add_argument(
-        "--text-encoder-2",
-        type=str,
-        default="clipL",
-        choices=list(TEXT_ENCODER_PATH),
-        help="Name of the second text encoder model.",
-    )
-    group.add_argument(
-        "--text-encoder-precision-2",
-        type=str,
-        default="fp16",
-        choices=PRECISIONS,
-        help="Precision mode for the second text encoder model.",
-    )
-    group.add_argument(
-        "--text-states-dim-2",
-        type=int,
-        default=768,
-        help="Dimension of the second text encoder hidden states.",
-    )
-    group.add_argument(
-        "--tokenizer-2",
-        type=str,
-        default="clipL",
-        choices=list(TOKENIZER_PATH),
-        help="Name of the second tokenizer model.",
-    )
-    group.add_argument(
-        "--text-len-2",
-        type=int,
-        default=77,
-        help="Maximum length of the second text input.",
-    )
-
-    return parser
-
-
-def add_denoise_schedule_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Denoise schedule args")
-
-    group.add_argument(
-        "--denoise-type",
-        type=str,
-        default="flow",
-        help="Denoise type for noised inputs.",
-    )
-
-    # Flow Matching
-    group.add_argument(
-        "--flow-shift",
-        type=float,
-        default=7.0,
-        help="Shift factor for flow matching schedulers.",
-    )
-    group.add_argument(
-        "--flow-reverse",
-        action="store_true",
-        help="If reverse, learning/sampling from t=1 -> t=0.",
-    )
-    group.add_argument(
-        "--flow-solver",
-        type=str,
-        default="euler",
-        help="Solver for flow matching.",
-    )
-    group.add_argument(
-        "--use-linear-quadratic-schedule",
-        action="store_true",
-        help="Use linear quadratic schedule for flow matching."
-        "Following MovieGen (https://ai.meta.com/static-resource/movie-gen-research-paper)",
-    )
-    group.add_argument(
-        "--linear-schedule-end",
-        type=int,
-        default=25,
-        help="End step for linear quadratic schedule for flow matching.",
-    )
-
-    return parser
-
-
-def add_inference_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Inference args")
-
-    # ======================== Model loads ========================
-    group.add_argument(
-        "--model-base",
-        type=str,
-        default="ckpts",
-        help="Root path of all the models, including t2v models and extra models.",
-    )
-    group.add_argument(
-        "--dit-weight",
-        type=str,
-        default="ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt",
-        help="Path to the HunyuanVideo model. If None, search the model in the args.model_root."
-        "1. If it is a file, load the model directly."
-        "2. If it is a directory, search the model in the directory. Support two types of models: "
-        "1) named `pytorch_model_*.pt`"
-        "2) named `*_model_states.pt`, where * can be `mp_rank_00`.",
-    )
-    group.add_argument(
-        "--model-resolution",
-        type=str,
-        default="540p",
-        choices=["540p", "720p"],
-        help="Root path of all the models, including t2v models and extra models.",
-    )
-    group.add_argument(
-        "--load-key",
-        type=str,
-        default="module",
-        help="Key to load the model states. 'module' for the main model, 'ema' for the EMA model.",
-    )
-    group.add_argument(
-        "--use-cpu-offload",
-        action="store_true",
-        help="Use CPU offload for the model load.",
-    )
-
-    # ======================== Inference general setting ========================
-    group.add_argument(
-        "--batch-size",
-        type=int,
-        default=1,
-        help="Batch size for inference and evaluation.",
-    )
-    group.add_argument(
-        "--infer-steps",
-        type=int,
-        default=50,
-        help="Number of denoising steps for inference.",
-    )
-    group.add_argument(
-        "--disable-autocast",
-        action="store_true",
-        help="Disable autocast for denoising loop and vae decoding in pipeline sampling.",
-    )
-    group.add_argument(
-        "--save-path",
-        type=str,
-        default="./results",
-        help="Path to save the generated samples.",
-    )
-    group.add_argument(
-        "--save-path-suffix",
-        type=str,
-        default="",
-        help="Suffix for the directory of saved samples.",
-    )
-    group.add_argument(
-        "--name-suffix",
-        type=str,
-        default="",
-        help="Suffix for the names of saved samples.",
-    )
-    group.add_argument(
-        "--num-videos",
-        type=int,
-        default=1,
-        help="Number of videos to generate for each prompt.",
-    )
-    # ---sample size---
-    group.add_argument(
-        "--video-size",
-        type=int,
-        nargs="+",
-        default=(720, 1280),
-        help="Video size for training. If a single value is provided, it will be used for both height "
-        "and width. If two values are provided, they will be used for height and width "
-        "respectively.",
-    )
-    group.add_argument(
-        "--video-length",
-        type=int,
-        default=129,
-        help="How many frames to sample from a video. if using 3d vae, the number should be 4n+1",
-    )
-    # --- prompt ---
-    group.add_argument(
-        "--prompt",
-        type=str,
-        default=None,
-        help="Prompt for sampling during evaluation.",
-    )
-    group.add_argument(
-        "--seed-type",
-        type=str,
-        default="auto",
-        choices=["file", "random", "fixed", "auto"],
-        help="Seed type for evaluation. If file, use the seed from the CSV file. If random, generate a "
-        "random seed. If fixed, use the fixed seed given by `--seed`. If auto, `csv` will use the "
-        "seed column if available, otherwise use the fixed `seed` value. `prompt` will use the "
-        "fixed `seed` value.",
-    )
-    group.add_argument("--seed", type=int, default=None, help="Seed for evaluation.")
-
-    # Classifier-Free Guidance
-    group.add_argument(
-        "--neg-prompt", type=str, default=None, help="Negative prompt for sampling."
-    )
-    group.add_argument(
-        "--cfg-scale", type=float, default=1.0, help="Classifier free guidance scale."
-    )
-    group.add_argument(
-        "--embedded-cfg-scale",
-        type=float,
-        default=6.0,
-        help="Embeded classifier free guidance scale.",
-    )
-
-    group.add_argument(
-        "--use-fp8",
-        action="store_true",
-        help="Enable use fp8 for inference acceleration."
-    )
-
-    group.add_argument(
-        "--reproduce",
-        action="store_true",
-        help="Enable reproducibility by setting random seeds and deterministic algorithms.",
-    )
-
-    return parser
-
-
-def add_parallel_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Parallel args")
-
-    # ======================== Model loads ========================
-    group.add_argument(
-        "--ulysses-degree",
-        type=int,
-        default=1,
-        help="Ulysses degree.",
-    )
-    group.add_argument(
-        "--ring-degree",
-        type=int,
-        default=1,
-        help="Ulysses degree.",
-    )
-
-    return parser
-
-
-def sanity_check_args(args):
-    # VAE channels
-    vae_pattern = r"\d{2,3}-\d{1,2}c-\w+"
-    if not re.match(vae_pattern, args.vae):
-        raise ValueError(
-            f"Invalid VAE model: {args.vae}. Must be in the format of '{vae_pattern}'."
-        )
-    vae_channels = int(args.vae.split("-")[1][:-1])
-    if args.latent_channels is None:
-        args.latent_channels = vae_channels
-    if vae_channels != args.latent_channels:
-        raise ValueError(
-            f"Latent channels ({args.latent_channels}) must match the VAE channels ({vae_channels})."
-        )
-    return args
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/constants.py b/videotuna/models/hunyuan/hyvideo_t2v/constants.py
deleted file mode 100644
index 2ccfe4de..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/constants.py
+++ /dev/null
@@ -1,90 +0,0 @@
-import os
-import torch
-
-__all__ = [
-    "C_SCALE",
-    "PROMPT_TEMPLATE",
-    "MODEL_BASE",
-    "PRECISIONS",
-    "NORMALIZATION_TYPE",
-    "ACTIVATION_TYPE",
-    "VAE_PATH",
-    "TEXT_ENCODER_PATH",
-    "TOKENIZER_PATH",
-    "TEXT_PROJECTION",
-    "DATA_TYPE",
-    "NEGATIVE_PROMPT",
-]
-
-PRECISION_TO_TYPE = {
-    'fp32': torch.float32,
-    'fp16': torch.float16,
-    'bf16': torch.bfloat16,
-}
-
-# =================== Constant Values =====================
-# Computation scale factor, 1P = 1_000_000_000_000_000. Tensorboard will display the value in PetaFLOPS to avoid
-# overflow error when tensorboard logging values.
-C_SCALE = 1_000_000_000_000_000
-
-# When using decoder-only models, we must provide a prompt template to instruct the text encoder
-# on how to generate the text.
-# --------------------------------------------------------------------
-PROMPT_TEMPLATE_ENCODE = (
-    "<|start_header_id|>system<|end_header_id|>\n\nDescribe the image by detailing the color, shape, size, texture, "
-    "quantity, text, spatial relationships of the objects and background:<|eot_id|>"
-    "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-) 
-PROMPT_TEMPLATE_ENCODE_VIDEO = (
-    "<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: "
-    "1. The main content and theme of the video."
-    "2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects."
-    "3. Actions, events, behaviors temporal relationships, physical movement changes of the objects."
-    "4. background environment, light, style and atmosphere."
-    "5. camera angles, movements, and transitions used in the video:<|eot_id|>"
-    "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-)  
-
-NEGATIVE_PROMPT = "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion"
-
-PROMPT_TEMPLATE = {
-    "dit-llm-encode": {
-        "template": PROMPT_TEMPLATE_ENCODE,
-        "crop_start": 36,
-    },
-    "dit-llm-encode-video": {
-        "template": PROMPT_TEMPLATE_ENCODE_VIDEO,
-        "crop_start": 95,
-    },
-}
-
-# ======================= Model ======================
-PRECISIONS = {"fp32", "fp16", "bf16"}
-NORMALIZATION_TYPE = {"layer", "rms"}
-ACTIVATION_TYPE = {"relu", "silu", "gelu", "gelu_tanh"}
-
-# =================== Model Path =====================
-MODEL_BASE = os.getenv("MODEL_BASE", "./checkpoints-local/hunyuanvideo")
-
-# =================== Data =======================
-DATA_TYPE = {"image", "video", "image_video"}
-
-# 3D VAE
-VAE_PATH = {"884-16c-hy": f"{MODEL_BASE}/hunyuan-video-t2v-720p/vae"}
-
-# Text Encoder
-TEXT_ENCODER_PATH = {
-    "clipL": f"{MODEL_BASE}/text_encoder_2",
-    "llm": f"{MODEL_BASE}/text_encoder",
-}
-
-# Tokenizer
-TOKENIZER_PATH = {
-    "clipL": f"{MODEL_BASE}/text_encoder_2",
-    "llm": f"{MODEL_BASE}/text_encoder",
-}
-
-TEXT_PROJECTION = {
-    "linear",  # Default, an nn.Linear() layer
-    "single_refiner",  # Single TokenRefiner. Refer to LI-DiT
-}
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/diffusion/__init__.py
deleted file mode 100644
index 2141aa3d..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-from .pipelines import HunyuanVideoPipeline
-from .schedulers import FlowMatchDiscreteScheduler
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/pipelines/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/diffusion/pipelines/__init__.py
deleted file mode 100644
index e44cb619..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/pipelines/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .pipeline_hunyuan_video import HunyuanVideoPipeline
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/pipelines/pipeline_hunyuan_video.py b/videotuna/models/hunyuan/hyvideo_t2v/diffusion/pipelines/pipeline_hunyuan_video.py
deleted file mode 100644
index c1293161..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/pipelines/pipeline_hunyuan_video.py
+++ /dev/null
@@ -1,1100 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-import inspect
-from typing import Any, Callable, Dict, List, Optional, Union, Tuple
-import torch
-import torch.distributed as dist
-import numpy as np
-from dataclasses import dataclass
-from packaging import version
-
-from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
-from diffusers.configuration_utils import FrozenDict
-from diffusers.image_processor import VaeImageProcessor
-from diffusers.loaders import LoraLoaderMixin, TextualInversionLoaderMixin
-from diffusers.models import AutoencoderKL
-from diffusers.models.lora import adjust_lora_scale_text_encoder
-from diffusers.schedulers import KarrasDiffusionSchedulers
-from diffusers.utils import (
-    USE_PEFT_BACKEND,
-    deprecate,
-    logging,
-    replace_example_docstring,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
-from diffusers.utils.torch_utils import randn_tensor
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline
-from diffusers.utils import BaseOutput
-
-from ...constants import PRECISION_TO_TYPE
-from ...vae.autoencoder_kl_causal_3d import AutoencoderKLCausal3D
-from ...text_encoder import TextEncoder
-from ...modules import HYVideoDiffusionTransformer
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-EXAMPLE_DOC_STRING = """"""
-
-
-def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
-    """
-    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and
-    Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
-    """
-    std_text = noise_pred_text.std(
-        dim=list(range(1, noise_pred_text.ndim)), keepdim=True
-    )
-    std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
-    # rescale the results from guidance (fixes overexposure)
-    noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
-    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
-    noise_cfg = (
-        guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
-    )
-    return noise_cfg
-
-
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-@dataclass
-class HunyuanVideoPipelineOutput(BaseOutput):
-    videos: Union[torch.Tensor, np.ndarray]
-
-
-class HunyuanVideoPipeline(DiffusionPipeline):
-    r"""
-    Pipeline for text-to-video generation using HunyuanVideo.
-
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
-
-    Args:
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-        text_encoder ([`TextEncoder`]):
-            Frozen text-encoder.
-        text_encoder_2 ([`TextEncoder`]):
-            Frozen text-encoder_2.
-        transformer ([`HYVideoDiffusionTransformer`]):
-            A `HYVideoDiffusionTransformer` to denoise the encoded video latents.
-        scheduler ([`SchedulerMixin`]):
-            A scheduler to be used in combination with `unet` to denoise the encoded image latents.
-    """
-
-    model_cpu_offload_seq = "text_encoder->text_encoder_2->transformer->vae"
-    _optional_components = ["text_encoder_2"]
-    _exclude_from_cpu_offload = ["transformer"]
-    _callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]
-
-    def __init__(
-        self,
-        vae: AutoencoderKL,
-        text_encoder: TextEncoder,
-        transformer: HYVideoDiffusionTransformer,
-        scheduler: KarrasDiffusionSchedulers,
-        text_encoder_2: Optional[TextEncoder] = None,
-        progress_bar_config: Dict[str, Any] = None,
-        args=None,
-    ):
-        super().__init__()
-
-        # ==========================================================================================
-        if progress_bar_config is None:
-            progress_bar_config = {}
-        if not hasattr(self, "_progress_bar_config"):
-            self._progress_bar_config = {}
-        self._progress_bar_config.update(progress_bar_config)
-
-        self.args = args
-        # ==========================================================================================
-
-        if (
-            hasattr(scheduler.config, "steps_offset")
-            and scheduler.config.steps_offset != 1
-        ):
-            deprecation_message = (
-                f"The configuration file of this scheduler: {scheduler} is outdated. `steps_offset`"
-                f" should be set to 1 instead of {scheduler.config.steps_offset}. Please make sure "
-                "to update the config accordingly as leaving `steps_offset` might led to incorrect results"
-                " in future versions. If you have downloaded this checkpoint from the Hugging Face Hub,"
-                " it would be very nice if you could open a Pull request for the `scheduler/scheduler_config.json`"
-                " file"
-            )
-            deprecate(
-                "steps_offset!=1", "1.0.0", deprecation_message, standard_warn=False
-            )
-            new_config = dict(scheduler.config)
-            new_config["steps_offset"] = 1
-            scheduler._internal_dict = FrozenDict(new_config)
-
-        if (
-            hasattr(scheduler.config, "clip_sample")
-            and scheduler.config.clip_sample is True
-        ):
-            deprecation_message = (
-                f"The configuration file of this scheduler: {scheduler} has not set the configuration `clip_sample`."
-                " `clip_sample` should be set to False in the configuration file. Please make sure to update the"
-                " config accordingly as not setting `clip_sample` in the config might lead to incorrect results in"
-                " future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very"
-                " nice if you could open a Pull request for the `scheduler/scheduler_config.json` file"
-            )
-            deprecate(
-                "clip_sample not set", "1.0.0", deprecation_message, standard_warn=False
-            )
-            new_config = dict(scheduler.config)
-            new_config["clip_sample"] = False
-            scheduler._internal_dict = FrozenDict(new_config)
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            transformer=transformer,
-            scheduler=scheduler,
-            text_encoder_2=text_encoder_2,
-        )
-        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-
-    def encode_prompt(
-        self,
-        prompt,
-        device,
-        num_videos_per_prompt,
-        do_classifier_free_guidance,
-        negative_prompt=None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_attention_mask: Optional[torch.Tensor] = None,
-        lora_scale: Optional[float] = None,
-        clip_skip: Optional[int] = None,
-        text_encoder: Optional[TextEncoder] = None,
-        data_type: Optional[str] = "image",
-    ):
-        r"""
-        Encodes the prompt into text encoder hidden states.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            device: (`torch.device`):
-                torch device
-            num_videos_per_prompt (`int`):
-                number of videos that should be generated per prompt
-            do_classifier_free_guidance (`bool`):
-                whether to use classifier free guidance or not
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the video generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            attention_mask (`torch.Tensor`, *optional*):
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            negative_attention_mask (`torch.Tensor`, *optional*):
-            lora_scale (`float`, *optional*):
-                A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            text_encoder (TextEncoder, *optional*):
-            data_type (`str`, *optional*):
-        """
-        if text_encoder is None:
-            text_encoder = self.text_encoder
-
-        # set lora scale so that monkey patched LoRA
-        # function of text encoder can correctly access it
-        if lora_scale is not None and isinstance(self, LoraLoaderMixin):
-            self._lora_scale = lora_scale
-
-            # dynamically adjust the LoRA scale
-            if not USE_PEFT_BACKEND:
-                adjust_lora_scale_text_encoder(text_encoder.model, lora_scale)
-            else:
-                scale_lora_layers(text_encoder.model, lora_scale)
-
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        if prompt_embeds is None:
-            # textual inversion: process multi-vector tokens if necessary
-            if isinstance(self, TextualInversionLoaderMixin):
-                prompt = self.maybe_convert_prompt(prompt, text_encoder.tokenizer)
-
-            text_inputs = text_encoder.text2tokens(prompt, data_type=data_type)
-
-            if clip_skip is None:
-                prompt_outputs = text_encoder.encode(
-                    text_inputs, data_type=data_type, device=device
-                )
-                prompt_embeds = prompt_outputs.hidden_state
-            else:
-                prompt_outputs = text_encoder.encode(
-                    text_inputs,
-                    output_hidden_states=True,
-                    data_type=data_type,
-                    device=device,
-                )
-                # Access the `hidden_states` first, that contains a tuple of
-                # all the hidden states from the encoder layers. Then index into
-                # the tuple to access the hidden states from the desired layer.
-                prompt_embeds = prompt_outputs.hidden_states_list[-(clip_skip + 1)]
-                # We also need to apply the final LayerNorm here to not mess with the
-                # representations. The `last_hidden_states` that we typically use for
-                # obtaining the final prompt representations passes through the LayerNorm
-                # layer.
-                prompt_embeds = text_encoder.model.text_model.final_layer_norm(
-                    prompt_embeds
-                )
-
-            attention_mask = prompt_outputs.attention_mask
-            if attention_mask is not None:
-                attention_mask = attention_mask.to(device)
-                bs_embed, seq_len = attention_mask.shape
-                attention_mask = attention_mask.repeat(1, num_videos_per_prompt)
-                attention_mask = attention_mask.view(
-                    bs_embed * num_videos_per_prompt, seq_len
-                )
-
-        if text_encoder is not None:
-            prompt_embeds_dtype = text_encoder.dtype
-        elif self.transformer is not None:
-            prompt_embeds_dtype = self.transformer.dtype
-        else:
-            prompt_embeds_dtype = prompt_embeds.dtype
-
-        prompt_embeds = prompt_embeds.to(dtype=prompt_embeds_dtype, device=device)
-
-        if prompt_embeds.ndim == 2:
-            bs_embed, _ = prompt_embeds.shape
-            # duplicate text embeddings for each generation per prompt, using mps friendly method
-            prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt)
-            prompt_embeds = prompt_embeds.view(bs_embed * num_videos_per_prompt, -1)
-        else:
-            bs_embed, seq_len, _ = prompt_embeds.shape
-            # duplicate text embeddings for each generation per prompt, using mps friendly method
-            prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
-            prompt_embeds = prompt_embeds.view(
-                bs_embed * num_videos_per_prompt, seq_len, -1
-            )
-
-        # get unconditional embeddings for classifier free guidance
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            uncond_tokens: List[str]
-            if negative_prompt is None:
-                uncond_tokens = [""] * batch_size
-            elif prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif isinstance(negative_prompt, str):
-                uncond_tokens = [negative_prompt]
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-            else:
-                uncond_tokens = negative_prompt
-
-            # textual inversion: process multi-vector tokens if necessary
-            if isinstance(self, TextualInversionLoaderMixin):
-                uncond_tokens = self.maybe_convert_prompt(
-                    uncond_tokens, text_encoder.tokenizer
-                )
-
-            # max_length = prompt_embeds.shape[1]
-            uncond_input = text_encoder.text2tokens(uncond_tokens, data_type=data_type)
-
-            negative_prompt_outputs = text_encoder.encode(
-                uncond_input, data_type=data_type, device=device
-            )
-            negative_prompt_embeds = negative_prompt_outputs.hidden_state
-
-            negative_attention_mask = negative_prompt_outputs.attention_mask
-            if negative_attention_mask is not None:
-                negative_attention_mask = negative_attention_mask.to(device)
-                _, seq_len = negative_attention_mask.shape
-                negative_attention_mask = negative_attention_mask.repeat(
-                    1, num_videos_per_prompt
-                )
-                negative_attention_mask = negative_attention_mask.view(
-                    batch_size * num_videos_per_prompt, seq_len
-                )
-
-        if do_classifier_free_guidance:
-            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
-            seq_len = negative_prompt_embeds.shape[1]
-
-            negative_prompt_embeds = negative_prompt_embeds.to(
-                dtype=prompt_embeds_dtype, device=device
-            )
-
-            if negative_prompt_embeds.ndim == 2:
-                negative_prompt_embeds = negative_prompt_embeds.repeat(
-                    1, num_videos_per_prompt
-                )
-                negative_prompt_embeds = negative_prompt_embeds.view(
-                    batch_size * num_videos_per_prompt, -1
-                )
-            else:
-                negative_prompt_embeds = negative_prompt_embeds.repeat(
-                    1, num_videos_per_prompt, 1
-                )
-                negative_prompt_embeds = negative_prompt_embeds.view(
-                    batch_size * num_videos_per_prompt, seq_len, -1
-                )
-
-        if text_encoder is not None:
-            if isinstance(self, LoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(text_encoder.model, lora_scale)
-
-        return (
-            prompt_embeds,
-            negative_prompt_embeds,
-            attention_mask,
-            negative_attention_mask,
-        )
-
-    def decode_latents(self, latents, enable_tiling=True):
-        deprecation_message = "The decode_latents method is deprecated and will be removed in 1.0.0. Please use VaeImageProcessor.postprocess(...) instead"
-        deprecate("decode_latents", "1.0.0", deprecation_message, standard_warn=False)
-
-        latents = 1 / self.vae.config.scaling_factor * latents
-        if enable_tiling:
-            self.vae.enable_tiling()
-            image = self.vae.decode(latents, return_dict=False)[0]
-        else:
-            image = self.vae.decode(latents, return_dict=False)[0]
-        image = (image / 2 + 0.5).clamp(0, 1)
-        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
-        if image.ndim == 4:
-            image = image.cpu().permute(0, 2, 3, 1).float()
-        else:
-            image = image.cpu().float()
-        return image
-
-    def prepare_extra_func_kwargs(self, func, kwargs):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-        extra_step_kwargs = {}
-
-        for k, v in kwargs.items():
-            accepts = k in set(inspect.signature(func).parameters.keys())
-            if accepts:
-                extra_step_kwargs[k] = v
-        return extra_step_kwargs
-
-    def check_inputs(
-        self,
-        prompt,
-        height,
-        width,
-        video_length,
-        callback_steps,
-        negative_prompt=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-        vae_ver="88-4c-sd",
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if video_length is not None:
-            if "884" in vae_ver:
-                if video_length != 1 and (video_length - 1) % 4 != 0:
-                    raise ValueError(
-                        f"`video_length` has to be 1 or a multiple of 4 but is {video_length}."
-                    )
-            elif "888" in vae_ver:
-                if video_length != 1 and (video_length - 1) % 8 != 0:
-                    raise ValueError(
-                        f"`video_length` has to be 1 or a multiple of 8 but is {video_length}."
-                    )
-
-        if callback_steps is not None and (
-            not isinstance(callback_steps, int) or callback_steps <= 0
-        ):
-            raise ValueError(
-                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
-                f" {type(callback_steps)}."
-            )
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        video_length,
-        dtype,
-        device,
-        generator,
-        latents=None,
-    ):
-        shape = (
-            batch_size,
-            num_channels_latents,
-            video_length,
-            int(height) // self.vae_scale_factor,
-            int(width) // self.vae_scale_factor,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-
-        # Check existence to make it compatible with FlowMatchEulerDiscreteScheduler
-        if hasattr(self.scheduler, "init_noise_sigma"):
-            # scale the initial noise by the standard deviation required by the scheduler
-            latents = latents * self.scheduler.init_noise_sigma
-        return latents
-
-    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(
-        self,
-        w: torch.Tensor,
-        embedding_dim: int = 512,
-        dtype: torch.dtype = torch.float32,
-    ) -> torch.Tensor:
-        """
-        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
-
-        Args:
-            w (`torch.Tensor`):
-                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
-            embedding_dim (`int`, *optional*, defaults to 512):
-                Dimension of the embeddings to generate.
-            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
-                Data type of the generated embeddings.
-
-        Returns:
-            `torch.Tensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
-        """
-        assert len(w.shape) == 1
-        w = w * 1000.0
-
-        half_dim = embedding_dim // 2
-        emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
-        emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb)
-        emb = w.to(dtype)[:, None] * emb[None, :]
-        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-        if embedding_dim % 2 == 1:  # zero pad
-            emb = torch.nn.functional.pad(emb, (0, 1))
-        assert emb.shape == (w.shape[0], embedding_dim)
-        return emb
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def guidance_rescale(self):
-        return self._guidance_rescale
-
-    @property
-    def clip_skip(self):
-        return self._clip_skip
-
-    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-    # corresponds to doing no classifier free guidance.
-    @property
-    def do_classifier_free_guidance(self):
-        # return self._guidance_scale > 1 and self.transformer.config.time_cond_proj_dim is None
-        return self._guidance_scale > 1
-
-    @property
-    def cross_attention_kwargs(self):
-        return self._cross_attention_kwargs
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]],
-        height: int,
-        width: int,
-        video_length: int,
-        data_type: str = "video",
-        num_inference_steps: int = 50,
-        timesteps: List[int] = None,
-        sigmas: List[float] = None,
-        guidance_scale: float = 7.5,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        num_videos_per_prompt: Optional[int] = 1,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_attention_mask: Optional[torch.Tensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
-        guidance_rescale: float = 0.0,
-        clip_skip: Optional[int] = None,
-        callback_on_step_end: Optional[
-            Union[
-                Callable[[int, int, Dict], None],
-                PipelineCallback,
-                MultiPipelineCallbacks,
-            ]
-        ] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        freqs_cis: Tuple[torch.Tensor, torch.Tensor] = None,
-        vae_ver: str = "88-4c-sd",
-        enable_tiling: bool = False,
-        n_tokens: Optional[int] = None,
-        embedded_guidance_scale: Optional[float] = None,
-        **kwargs,
-    ):
-        r"""
-        The call function to the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`):
-                The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
-            height (`int`):
-                The height in pixels of the generated image.
-            width (`int`):
-                The width in pixels of the generated image.
-            video_length (`int`):
-                The number of frames in the generated video.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            sigmas (`List[float]`, *optional*):
-                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
-                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
-                will be used.
-            guidance_scale (`float`, *optional*, defaults to 7.5):
-                A higher guidance scale value encourages the model to generate images closely linked to the text
-                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide what to not include in image generation. If not defined, you need to
-                pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
-                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                generation deterministic.
-            latents (`torch.Tensor`, *optional*):
-                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor is generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
-                provided, text embeddings are generated from the `prompt` input argument.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
-                not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
-                
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`HunyuanVideoPipelineOutput`] instead of a
-                plain tuple.
-            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
-                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            guidance_rescale (`float`, *optional*, defaults to 0.0):
-                Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
-                Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
-                using zero terminal SNR.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            callback_on_step_end (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*):
-                A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of
-                each denoising step during the inference. with the following arguments: `callback_on_step_end(self:
-                DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a
-                list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~HunyuanVideoPipelineOutput`] or `tuple`:
-                If `return_dict` is `True`, [`HunyuanVideoPipelineOutput`] is returned,
-                otherwise a `tuple` is returned where the first element is a list with the generated images and the
-                second element is a list of `bool`s indicating whether the corresponding generated image contains
-                "not-safe-for-work" (nsfw) content.
-        """
-        callback = kwargs.pop("callback", None)
-        callback_steps = kwargs.pop("callback_steps", None)
-
-        if callback is not None:
-            deprecate(
-                "callback",
-                "1.0.0",
-                "Passing `callback` as an input argument to `__call__` is deprecated, consider using `callback_on_step_end`",
-            )
-        if callback_steps is not None:
-            deprecate(
-                "callback_steps",
-                "1.0.0",
-                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider using `callback_on_step_end`",
-            )
-
-        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
-            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
-
-        # 0. Default height and width to unet
-        # height = height or self.transformer.config.sample_size * self.vae_scale_factor
-        # width = width or self.transformer.config.sample_size * self.vae_scale_factor
-        # to deal with lora scaling and other possible forward hooks
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            height,
-            width,
-            video_length,
-            callback_steps,
-            negative_prompt,
-            prompt_embeds,
-            negative_prompt_embeds,
-            callback_on_step_end_tensor_inputs,
-            vae_ver=vae_ver,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._guidance_rescale = guidance_rescale
-        self._clip_skip = clip_skip
-        self._cross_attention_kwargs = cross_attention_kwargs
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = torch.device(f"cuda:{dist.get_rank()}") if dist.is_initialized() else self._execution_device
-
-        # 3. Encode input prompt
-        lora_scale = (
-            self.cross_attention_kwargs.get("scale", None)
-            if self.cross_attention_kwargs is not None
-            else None
-        )
-
-        (
-            prompt_embeds,
-            negative_prompt_embeds,
-            prompt_mask,
-            negative_prompt_mask,
-        ) = self.encode_prompt(
-            prompt,
-            device,
-            num_videos_per_prompt,
-            self.do_classifier_free_guidance,
-            negative_prompt,
-            prompt_embeds=prompt_embeds,
-            attention_mask=attention_mask,
-            negative_prompt_embeds=negative_prompt_embeds,
-            negative_attention_mask=negative_attention_mask,
-            lora_scale=lora_scale,
-            clip_skip=self.clip_skip,
-            data_type=data_type,
-        )
-        if self.text_encoder_2 is not None:
-            (
-                prompt_embeds_2,
-                negative_prompt_embeds_2,
-                prompt_mask_2,
-                negative_prompt_mask_2,
-            ) = self.encode_prompt(
-                prompt,
-                device,
-                num_videos_per_prompt,
-                self.do_classifier_free_guidance,
-                negative_prompt,
-                prompt_embeds=None,
-                attention_mask=None,
-                negative_prompt_embeds=None,
-                negative_attention_mask=None,
-                lora_scale=lora_scale,
-                clip_skip=self.clip_skip,
-                text_encoder=self.text_encoder_2,
-                data_type=data_type,
-            )
-        else:
-            prompt_embeds_2 = None
-            negative_prompt_embeds_2 = None
-            prompt_mask_2 = None
-            negative_prompt_mask_2 = None
-
-        # For classifier free guidance, we need to do two forward passes.
-        # Here we concatenate the unconditional and text embeddings into a single batch
-        # to avoid doing two forward passes
-        if self.do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
-            if prompt_mask is not None:
-                prompt_mask = torch.cat([negative_prompt_mask, prompt_mask])
-            if prompt_embeds_2 is not None:
-                prompt_embeds_2 = torch.cat([negative_prompt_embeds_2, prompt_embeds_2])
-            if prompt_mask_2 is not None:
-                prompt_mask_2 = torch.cat([negative_prompt_mask_2, prompt_mask_2])
-
-
-        # 4. Prepare timesteps
-        extra_set_timesteps_kwargs = self.prepare_extra_func_kwargs(
-            self.scheduler.set_timesteps, {"n_tokens": n_tokens}
-        )
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler,
-            num_inference_steps,
-            device,
-            timesteps,
-            sigmas,
-            **extra_set_timesteps_kwargs,
-        )
-
-        if "884" in vae_ver:
-            video_length = (video_length - 1) // 4 + 1
-        elif "888" in vae_ver:
-            video_length = (video_length - 1) // 8 + 1
-        else:
-            video_length = video_length
-
-        # 5. Prepare latent variables
-        num_channels_latents = self.transformer.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_videos_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            video_length,
-            prompt_embeds.dtype,
-            device,
-            generator,
-            latents,
-        )
-
-        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_func_kwargs(
-            self.scheduler.step,
-            {"generator": generator, "eta": eta},
-        )
-
-        target_dtype = PRECISION_TO_TYPE[self.args.precision]
-        autocast_enabled = (
-            target_dtype != torch.float32
-        ) and not self.args.disable_autocast
-        vae_dtype = PRECISION_TO_TYPE[self.args.vae_precision]
-        vae_autocast_enabled = (
-            vae_dtype != torch.float32
-        ) and not self.args.disable_autocast
-
-        # 7. Denoising loop
-        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
-        self._num_timesteps = len(timesteps)
-
-        # if is_progress_bar:
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                # expand the latents if we are doing classifier free guidance
-                latent_model_input = (
-                    torch.cat([latents] * 2)
-                    if self.do_classifier_free_guidance
-                    else latents
-                )
-                latent_model_input = self.scheduler.scale_model_input(
-                    latent_model_input, t
-                )
-
-                t_expand = t.repeat(latent_model_input.shape[0])
-                guidance_expand = (
-                    torch.tensor(
-                        [embedded_guidance_scale] * latent_model_input.shape[0],
-                        dtype=torch.float32,
-                        device=device,
-                    ).to(target_dtype)
-                    * 1000.0
-                    if embedded_guidance_scale is not None
-                    else None
-                )
-
-                # predict the noise residual
-                with torch.autocast(
-                    device_type="cuda", dtype=target_dtype, enabled=autocast_enabled
-                ):
-                    noise_pred = self.transformer(  # For an input image (129, 192, 336) (1, 256, 256)
-                        latent_model_input,  # [2, 16, 33, 24, 42]
-                        t_expand,  # [2]
-                        text_states=prompt_embeds,  # [2, 256, 4096]
-                        text_mask=prompt_mask,  # [2, 256]
-                        text_states_2=prompt_embeds_2,  # [2, 768]
-                        freqs_cos=freqs_cis[0],  # [seqlen, head_dim]
-                        freqs_sin=freqs_cis[1],  # [seqlen, head_dim]
-                        guidance=guidance_expand,
-                        return_dict=True,
-                    )[
-                        "x"
-                    ]
-
-                # perform guidance
-                if self.do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + self.guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                if self.do_classifier_free_guidance and self.guidance_rescale > 0.0:
-                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
-                    noise_pred = rescale_noise_cfg(
-                        noise_pred,
-                        noise_pred_text,
-                        guidance_rescale=self.guidance_rescale,
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop(
-                        "negative_prompt_embeds", negative_prompt_embeds
-                    )
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    if progress_bar is not None:
-                        progress_bar.update()
-                    if callback is not None and i % callback_steps == 0:
-                        step_idx = i // getattr(self.scheduler, "order", 1)
-                        callback(step_idx, t, latents)
-
-        if not output_type == "latent":
-            expand_temporal_dim = False
-            if len(latents.shape) == 4:
-                if isinstance(self.vae, AutoencoderKLCausal3D):
-                    latents = latents.unsqueeze(2)
-                    expand_temporal_dim = True
-            elif len(latents.shape) == 5:
-                pass
-            else:
-                raise ValueError(
-                    f"Only support latents with shape (b, c, h, w) or (b, c, f, h, w), but got {latents.shape}."
-                )
-
-            if (
-                hasattr(self.vae.config, "shift_factor")
-                and self.vae.config.shift_factor
-            ):
-                latents = (
-                    latents / self.vae.config.scaling_factor
-                    + self.vae.config.shift_factor
-                )
-            else:
-                latents = latents / self.vae.config.scaling_factor
-
-            with torch.autocast(
-                device_type="cuda", dtype=vae_dtype, enabled=vae_autocast_enabled
-            ):
-                if enable_tiling:
-                    self.vae.enable_tiling()
-                    image = self.vae.decode(
-                        latents, return_dict=False, generator=generator
-                    )[0]
-                else:
-                    image = self.vae.decode(
-                        latents, return_dict=False, generator=generator
-                    )[0]
-
-            if expand_temporal_dim or image.shape[2] == 1:
-                image = image.squeeze(2)
-
-        else:
-            image = latents
-
-        image = (image / 2 + 0.5).clamp(0, 1)
-        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloa16
-        image = image.cpu().float()
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return image
-
-        return HunyuanVideoPipelineOutput(videos=image)
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/schedulers/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/diffusion/schedulers/__init__.py
deleted file mode 100644
index 14f2ba33..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/schedulers/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .scheduling_flow_match_discrete import FlowMatchDiscreteScheduler
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/schedulers/scheduling_flow_match_discrete.py b/videotuna/models/hunyuan/hyvideo_t2v/diffusion/schedulers/scheduling_flow_match_discrete.py
deleted file mode 100644
index c507ec4e..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/diffusion/schedulers/scheduling_flow_match_discrete.py
+++ /dev/null
@@ -1,257 +0,0 @@
-# Copyright 2024 Stability AI, Katherine Crowson and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-
-from dataclasses import dataclass
-from typing import Optional, Tuple, Union
-
-import numpy as np
-import torch
-
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.utils import BaseOutput, logging
-from diffusers.schedulers.scheduling_utils import SchedulerMixin
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-@dataclass
-class FlowMatchDiscreteSchedulerOutput(BaseOutput):
-    """
-    Output class for the scheduler's `step` function output.
-
-    Args:
-        prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
-            Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
-            denoising loop.
-    """
-
-    prev_sample: torch.FloatTensor
-
-
-class FlowMatchDiscreteScheduler(SchedulerMixin, ConfigMixin):
-    """
-    Euler scheduler.
-
-    This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic
-    methods the library implements for all schedulers such as loading and saving.
-
-    Args:
-        num_train_timesteps (`int`, defaults to 1000):
-            The number of diffusion steps to train the model.
-        timestep_spacing (`str`, defaults to `"linspace"`):
-            The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
-            Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
-        shift (`float`, defaults to 1.0):
-            The shift value for the timestep schedule.
-        reverse (`bool`, defaults to `True`):
-            Whether to reverse the timestep schedule.
-    """
-
-    _compatibles = []
-    order = 1
-
-    @register_to_config
-    def __init__(
-        self,
-        num_train_timesteps: int = 1000,
-        shift: float = 1.0,
-        reverse: bool = True,
-        solver: str = "euler",
-        n_tokens: Optional[int] = None,
-    ):
-        sigmas = torch.linspace(1, 0, num_train_timesteps + 1)
-
-        if not reverse:
-            sigmas = sigmas.flip(0)
-
-        self.sigmas = sigmas
-        # the value fed to model
-        self.timesteps = (sigmas[:-1] * num_train_timesteps).to(dtype=torch.float32)
-
-        self._step_index = None
-        self._begin_index = None
-
-        self.supported_solver = ["euler"]
-        if solver not in self.supported_solver:
-            raise ValueError(
-                f"Solver {solver} not supported. Supported solvers: {self.supported_solver}"
-            )
-
-    @property
-    def step_index(self):
-        """
-        The index counter for current timestep. It will increase 1 after each scheduler step.
-        """
-        return self._step_index
-
-    @property
-    def begin_index(self):
-        """
-        The index for the first timestep. It should be set from pipeline with `set_begin_index` method.
-        """
-        return self._begin_index
-
-    # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
-    def set_begin_index(self, begin_index: int = 0):
-        """
-        Sets the begin index for the scheduler. This function should be run from pipeline before the inference.
-
-        Args:
-            begin_index (`int`):
-                The begin index for the scheduler.
-        """
-        self._begin_index = begin_index
-
-    def _sigma_to_t(self, sigma):
-        return sigma * self.config.num_train_timesteps
-
-    def set_timesteps(
-        self,
-        num_inference_steps: int,
-        device: Union[str, torch.device] = None,
-        n_tokens: int = None,
-    ):
-        """
-        Sets the discrete timesteps used for the diffusion chain (to be run before inference).
-
-        Args:
-            num_inference_steps (`int`):
-                The number of diffusion steps used when generating samples with a pre-trained model.
-            device (`str` or `torch.device`, *optional*):
-                The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-            n_tokens (`int`, *optional*):
-                Number of tokens in the input sequence.
-        """
-        self.num_inference_steps = num_inference_steps
-        
-        sigmas = torch.linspace(1, 0, num_inference_steps + 1)
-        sigmas = self.sd3_time_shift(sigmas)
-
-        if not self.config.reverse:
-            sigmas = 1 - sigmas
-
-        self.sigmas = sigmas
-        self.timesteps = (sigmas[:-1] * self.config.num_train_timesteps).to(
-            dtype=torch.float32, device=device
-        )
-
-        # Reset step index
-        self._step_index = None
-
-    def index_for_timestep(self, timestep, schedule_timesteps=None):
-        if schedule_timesteps is None:
-            schedule_timesteps = self.timesteps
-
-        indices = (schedule_timesteps == timestep).nonzero()
-
-        # The sigma index that is taken for the **very** first `step`
-        # is always the second index (or the last index if there is only 1)
-        # This way we can ensure we don't accidentally skip a sigma in
-        # case we start in the middle of the denoising schedule (e.g. for image-to-image)
-        pos = 1 if len(indices) > 1 else 0
-
-        return indices[pos].item()
-
-    def _init_step_index(self, timestep):
-        if self.begin_index is None:
-            if isinstance(timestep, torch.Tensor):
-                timestep = timestep.to(self.timesteps.device)
-            self._step_index = self.index_for_timestep(timestep)
-        else:
-            self._step_index = self._begin_index
-
-    def scale_model_input(
-        self, sample: torch.Tensor, timestep: Optional[int] = None
-    ) -> torch.Tensor:
-        return sample
-
-    def sd3_time_shift(self, t: torch.Tensor):
-        return (self.config.shift * t) / (1 + (self.config.shift - 1) * t)
-
-    def step(
-        self,
-        model_output: torch.FloatTensor,
-        timestep: Union[float, torch.FloatTensor],
-        sample: torch.FloatTensor,
-        return_dict: bool = True,
-    ) -> Union[FlowMatchDiscreteSchedulerOutput, Tuple]:
-        """
-        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
-        process from the learned model outputs (most often the predicted noise).
-
-        Args:
-            model_output (`torch.FloatTensor`):
-                The direct output from learned diffusion model.
-            timestep (`float`):
-                The current discrete timestep in the diffusion chain.
-            sample (`torch.FloatTensor`):
-                A current instance of a sample created by the diffusion process.
-            generator (`torch.Generator`, *optional*):
-                A random number generator.
-            n_tokens (`int`, *optional*):
-                Number of tokens in the input sequence.
-            return_dict (`bool`):
-                Whether or not to return a [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or
-                tuple.
-
-        Returns:
-            [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or `tuple`:
-                If return_dict is `True`, [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] is
-                returned, otherwise a tuple is returned where the first element is the sample tensor.
-        """
-
-        if (
-            isinstance(timestep, int)
-            or isinstance(timestep, torch.IntTensor)
-            or isinstance(timestep, torch.LongTensor)
-        ):
-            raise ValueError(
-                (
-                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
-                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
-                    " one of the `scheduler.timesteps` as a timestep."
-                ),
-            )
-
-        if self.step_index is None:
-            self._init_step_index(timestep)
-
-        # Upcast to avoid precision issues when computing prev_sample
-        sample = sample.to(torch.float32)
-
-        dt = self.sigmas[self.step_index + 1] - self.sigmas[self.step_index]
-
-        if self.config.solver == "euler":
-            prev_sample = sample + model_output.to(torch.float32) * dt
-        else:
-            raise ValueError(
-                f"Solver {self.config.solver} not supported. Supported solvers: {self.supported_solver}"
-            )
-
-        # upon completion increase step index by one
-        self._step_index += 1
-
-        if not return_dict:
-            return (prev_sample,)
-
-        return FlowMatchDiscreteSchedulerOutput(prev_sample=prev_sample)
-
-    def __len__(self):
-        return self.config.num_train_timesteps
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/hunyuanvideo.py b/videotuna/models/hunyuan/hyvideo_t2v/hunyuanvideo.py
deleted file mode 100644
index b9238eff..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/hunyuanvideo.py
+++ /dev/null
@@ -1,1046 +0,0 @@
-import math
-import torch
-import inspect
-from transformers import T5EncoderModel, T5Tokenizer
-from diffusers import (
-    AutoencoderKLCogVideoX,
-    CogVideoXDPMScheduler,
-    CogVideoXTransformer3DModel,
-)
-from diffusers.video_processor import VideoProcessor
-from diffusers.utils.torch_utils import randn_tensor
-from diffusers.callbacks import PipelineCallback, MultiPipelineCallbacks
-from diffusers.models.embeddings import get_3d_rotary_pos_embed
-from diffusers import CogVideoXDDIMScheduler, FlowMatchEulerDiscreteScheduler
-from diffusers.training_utils import compute_loss_weighting_for_sd3
-
-import pytorch_lightning as pl
-from videotuna.utils.common_utils import instantiate_from_config
-from typing import List, Optional, Tuple, Union, Dict, Any, Callable
-from peft import (
-    LoraConfig,
-    get_peft_model_state_dict,
-    set_peft_model_state_dict,
-    get_peft_model,
-)
-import numpy as np
-
-
-DEFAULT_PROMPT_TEMPLATE = {
-    "template": (
-        "<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: "
-        "1. The main content and theme of the video."
-        "2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects."
-        "3. Actions, events, behaviors temporal relationships, physical movement changes of the objects."
-        "4. background environment, light, style and atmosphere."
-        "5. camera angles, movements, and transitions used in the video:<|eot_id|>"
-        "<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>"
-    ),
-    "crop_start": 95,
-}
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-def compute_density_for_timestep_sampling(
-    weighting_scheme: str,
-    batch_size: int,
-    logit_mean: float = None,
-    logit_std: float = None,
-    mode_scale: float = None,
-    device: torch.device = torch.device("cpu"),
-    generator: Optional[torch.Generator] = None,
-) -> torch.Tensor:
-    r"""
-    Compute the density for sampling the timesteps when doing SD3 training.
-
-    Courtesy: This was contributed by Rafie Walker in https://github.com/huggingface/diffusers/pull/8528.
-
-    SD3 paper reference: https://arxiv.org/abs/2403.03206v1.
-    """
-    if weighting_scheme == "logit_normal":
-        # See 3.1 in the SD3 paper ($rf/lognorm(0.00,1.00)$).
-        u = torch.normal(mean=logit_mean, std=logit_std, size=(batch_size,), device=device, generator=generator)
-        u = torch.nn.functional.sigmoid(u)
-    elif weighting_scheme == "mode":
-        u = torch.rand(size=(batch_size,), device=device, generator=generator)
-        u = 1 - u - mode_scale * (torch.cos(math.pi * u / 2) ** 2 - 1 + u)
-    else:
-        u = torch.rand(size=(batch_size,), device=device, generator=generator)
-    return u
-
-
-
-def prepare_sigmas(
-    scheduler: Union[CogVideoXDDIMScheduler, FlowMatchEulerDiscreteScheduler],
-    sigmas: torch.Tensor,
-    batch_size: int,
-    num_train_timesteps: int,
-    flow_weighting_scheme: str = "none",
-    flow_logit_mean: float = 0.0,
-    flow_logit_std: float = 1.0,
-    flow_mode_scale: float = 1.29,
-    device: torch.device = torch.device("cpu"),
-    generator: Optional[torch.Generator] = None,
-) -> torch.Tensor:
-    if isinstance(scheduler, FlowMatchEulerDiscreteScheduler):
-        weights = compute_density_for_timestep_sampling(
-            weighting_scheme=flow_weighting_scheme,
-            batch_size=batch_size,
-            logit_mean=flow_logit_mean,
-            logit_std=flow_logit_std,
-            mode_scale=flow_mode_scale,
-            device=device,
-            generator=generator,
-        )
-        indices = (weights * num_train_timesteps).long()
-    else:
-        raise ValueError(f"Unsupported scheduler type {type(scheduler)}")
-
-    return sigmas[indices]
-
-
-def expand_tensor_dims(tensor, ndim):
-    while len(tensor.shape) < ndim:
-        tensor = tensor.unsqueeze(-1)
-    return tensor
-
-
-def prepare_loss_weights(
-    scheduler: Union[CogVideoXDDIMScheduler, FlowMatchEulerDiscreteScheduler],
-    alphas: Optional[torch.Tensor] = None,
-    sigmas: Optional[torch.Tensor] = None,
-    flow_weighting_scheme: str = "none",
-) -> torch.Tensor:
-    if isinstance(scheduler, FlowMatchEulerDiscreteScheduler):
-        return compute_loss_weighting_for_sd3(sigmas=sigmas, weighting_scheme=flow_weighting_scheme)
-    else:
-        raise ValueError(f"Unsupported scheduler type {type(scheduler)}")
-
-
-def prepare_target(
-    scheduler: Union[CogVideoXDDIMScheduler, FlowMatchEulerDiscreteScheduler],
-    noise: torch.Tensor,
-    latents: torch.Tensor,
-) -> torch.Tensor:
-    if isinstance(scheduler, FlowMatchEulerDiscreteScheduler):
-        target = noise - latents
-    elif isinstance(scheduler, CogVideoXDDIMScheduler):
-        target = latents
-    else:
-        raise ValueError(f"Unsupported scheduler type {type(scheduler)}")
-
-    return target
-
-
-class HunyuanVideoWorkFlow(pl.LightningModule):
-    def __init__(
-        self,
-        first_stage_config: Dict[str, Any],  # vae
-        cond_stage_config: Dict[str, Any],  # text encoder
-        cond_stage_config_2: Dict[str, Any],  # text encoder 2
-        tokenizer_config: Dict[str, Any],  # tokenizer
-        tokenizer_config_2: Dict[str, Any],  # tokenizer 2
-        denoiser_config: Dict[str, Any],  # transformer
-        scheduler_config: Dict[str, Any],  # scheduler
-        learning_rate: float = 6e-6,
-        adapter_config=None,
-        deepspeed_config=None,
-        logdir=None,
-    ):
-        super().__init__()
-        self.model = instantiate_from_config(denoiser_config)
-
-        self.logdir = logdir
-        self.learning_rate = learning_rate
-        # condtion stage use T5 class, which is availale at lvdm.module.encoders.condtion.FrozenT5Embedder
-        # but we need to be aware of the model name and tokenizer name
-        # here is the same with DDPM
-        self.instantiate_first_stage(first_stage_config)
-        # max_sequence_length=226
-        self.instantiate_cond_stage(cond_stage_config)
-        self.instantiate_cond_stage_2(cond_stage_config_2)
-
-        self.tokenizer = instantiate_from_config(tokenizer_config)
-        self.tokenizer_2 = instantiate_from_config(tokenizer_config_2)
-
-        # self.vae_scale_factor_spatial = (
-        #     2 ** (len(self.vae.config.block_out_channels) - 1)
-        #     if hasattr(self, "first_stage_model") and self is not None
-        #     else 8
-        # )
-        # self.vae_scale_factor_temporal = (
-        #     self.vae.config.temporal_compression_ratio
-        #     if hasattr(self, "first_stage_model") and self.vae is not None
-        #     else 4
-        # )
-
-        # self.video_processor = VideoProcessor(
-        #     vae_scale_factor=self.vae_scale_factor_spatial
-        # )
-        self.vae_scale_factor_temporal = (
-            self.vae.temporal_compression_ratio 
-            if getattr(self, "vae", None) 
-            else 4
-        )
-        self.vae_scale_factor_spatial = (
-            self.vae.spatial_compression_ratio 
-            if getattr(self, "vae", None) 
-            else 8
-        )
-        self.video_processor = VideoProcessor(
-            vae_scale_factor=self.vae_scale_factor_spatial
-        )
-
-        self.model = instantiate_from_config(denoiser_config)
-        self.scheduler = instantiate_from_config(scheduler_config)
-        self.scheduler_sigmas = self.scheduler.sigmas.clone().to(self.device)
-        # add adapter config (Support Lora and HRA )
-        self.lora_args = []
-        if adapter_config is not None:
-            self.inject_adapter(adapter_config)
-        if deepspeed_config is not None:
-            self.deepspeed_config = deepspeed_config.params
-        
-    
-    def inject_adapter(self, adapter_config):
-        self.model.requires_grad_(False)
-        self.model.enable_gradient_checkpointing()
-        transformer_adapter_config = instantiate_from_config(adapter_config)   
-        # print(transformer_adapter_config)
-        self.model = get_peft_model(self.model, transformer_adapter_config, autocast_adapter_dtype=False)
-        self.model.print_trainable_parameters()
-    
-    ## VAE is named as first_stage_model 
-    ## followed functions are all first stage related. 
-    def instantiate_first_stage(self, config):
-        # import pdb;pdb.set_trace()
-        model = instantiate_from_config(config)
-        self.vae = model.eval()
-        # self.vae.train = disabled_train
-        self.vae.requires_grad_(False)
-    
-    @torch.no_grad()
-    def encode_first_stage(self, x):
-        x = x.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W]
-        latent_dist = self.vae.encode(x).latent_dist
-        return latent_dist
-
-    def _decode_core(self, z, **kwargs):
-        z = 1. / self.scale_factor * z
-
-        if self.encoder_type == "2d" and z.dim() == 5:
-            return self.decode_first_stage_2DAE(z)
-        results = self.vae.decode(z, **kwargs)
-        return results
-
-    @torch.no_grad()
-    def decode_first_stage(self, z, **kwargs):
-        return self._decode_core(z, **kwargs)
-
-    def differentiable_decode_first_stage(self, z, **kwargs):
-        """same as decode_first_stage but without decorator"""
-        return self._decode_core(z, **kwargs)
-    
-    ## second stage : text condition and other condtions 
-    def instantiate_cond_stage(self, config):
-        model = instantiate_from_config(config)
-        # # in finetune cogvideox don't train as default
-        self.cond_stage_model = model.eval()
-        self.cond_stage_model.requires_grad_(False)
-    
-    def instantiate_cond_stage_2(self, config):
-        model = instantiate_from_config(config)
-        # # in finetune cogvideox don't train as default
-        self.cond_stage_model_2 = model.eval()
-        self.cond_stage_model_2.requires_grad_(False)
-
-    def decode_latents(self, latents: torch.Tensor) -> torch.Tensor:
-        latents = latents.permute(0, 2, 1, 3, 4)  # [batch_size, num_channels, num_frames, height, width]
-        latents = 1 / self.vae.config.scaling_factor * latents
-
-        frames = self.vae.decode(latents).sample
-        return frames
-    
-
-    def check_inputs(
-        self,
-        prompt,
-        prompt_2,
-        height,
-        width,
-        prompt_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-        prompt_template=None,
-    ):
-        if height % 16 != 0 or width % 16 != 0:
-            raise ValueError(f"`height` and `width` have to be divisible by 16 but are {height} and {width}.")
-
-        # if callback_on_step_end_tensor_inputs is not None and not all(
-        #     k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
-        # ):
-        #     raise ValueError(
-        #         f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-        #     )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_2 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
-            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
-        elif prompt_2 is not None and (not isinstance(prompt_2, str) and not isinstance(prompt_2, list)):
-            raise ValueError(f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}")
-
-        if prompt_template is not None:
-            if not isinstance(prompt_template, dict):
-                raise ValueError(f"`prompt_template` has to be of type `dict` but is {type(prompt_template)}")
-            if "template" not in prompt_template:
-                raise ValueError(
-                    f"`prompt_template` has to contain a key `template` but only found {prompt_template.keys()}"
-                )
-    
-
-    def _get_llama_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]],
-        prompt_template: Dict[str, Any],
-        num_videos_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        max_sequence_length: int = 256,
-        num_hidden_layers_to_skip: int = 2,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        # device = device or self._execution_device
-        # dtype = dtype or self.text_encoder.dtype
-
-        device = self.device
-        # TODO: fix data type 
-        # dtype = torch.float32
-        dtype = torch.float16
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        prompt = [prompt_template["template"].format(p) for p in prompt]
-
-        crop_start = prompt_template.get("crop_start", None)
-        if crop_start is None:
-            prompt_template_input = self.tokenizer(
-                prompt_template["template"],
-                padding="max_length",
-                return_tensors="pt",
-                return_length=False,
-                return_overflowing_tokens=False,
-                return_attention_mask=False,
-            )
-            crop_start = prompt_template_input["input_ids"].shape[-1]
-            # Remove <|eot_id|> token and placeholder {}
-            crop_start -= 2
-        
-        max_sequence_length += crop_start
-        text_inputs = self.tokenizer(
-            prompt,
-            max_length=max_sequence_length,
-            padding="max_length",
-            truncation=True,
-            return_tensors="pt",
-            return_length=False,
-            return_overflowing_tokens=False,
-            return_attention_mask=True,
-        )
-        text_input_ids = text_inputs.input_ids.to(device=device)
-        prompt_attention_mask = text_inputs.attention_mask.to(device=device)
-
-        # todo: check if this is correct
-        prompt_embeds = self.cond_stage_model(
-            input_ids=text_input_ids,
-            attention_mask=prompt_attention_mask,
-            output_hidden_states=True,
-        ).hidden_states[-(num_hidden_layers_to_skip + 1)]
-        prompt_embeds = prompt_embeds.to(dtype=dtype)
-
-        if crop_start is not None and crop_start > 0:
-            prompt_embeds = prompt_embeds[:, crop_start:]
-            prompt_attention_mask = prompt_attention_mask[:, crop_start:]
-
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        _, seq_len, _ = prompt_embeds.shape
-        prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(batch_size * num_videos_per_prompt, seq_len, -1)
-        prompt_attention_mask = prompt_attention_mask.repeat(1, num_videos_per_prompt)
-
-        prompt_attention_mask = prompt_attention_mask.view(batch_size * num_videos_per_prompt, seq_len)
-
-        return prompt_embeds, prompt_attention_mask
-
-    def _get_clip_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]],
-        num_videos_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        max_sequence_length: int = 77,
-    ) -> torch.Tensor:
-        # device = device or self._execution_device
-        # dtype = dtype or self.text_encoder_2.dtype
-
-        device = self.device
-        # TODO: fix data type 
-        # dtype = torch.float32
-        dtype = torch.float16
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        text_inputs = self.tokenizer_2(
-            prompt,
-            padding="max_length",
-            max_length=max_sequence_length,
-            truncation=True,
-            return_tensors="pt",
-        )
-
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = self.tokenizer_2(prompt, padding="longest", return_tensors="pt").input_ids
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(text_input_ids, untruncated_ids):
-            removed_text = self.tokenizer_2.batch_decode(untruncated_ids[:, max_sequence_length - 1 : -1])
-            # logger.warning( 
-            #     "The following part of your input was truncated because CLIP can only handle sequences up to"
-            #     f" {max_sequence_length} tokens: {removed_text}"
-            # )
-
-        prompt_embeds = self.cond_stage_model_2(text_input_ids.to(device), output_hidden_states=False).pooler_output
-        prompt_embeds = prompt_embeds.to(dtype=dtype)
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_videos_per_prompt)
-        prompt_embeds = prompt_embeds.view(batch_size * num_videos_per_prompt, -1)
-
-        return prompt_embeds
-
-
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        prompt_2: Union[str, List[str]] = None,
-        prompt_template: Dict[str, Any] = DEFAULT_PROMPT_TEMPLATE,
-        num_videos_per_prompt: int = 1,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        pooled_prompt_embeds: Optional[torch.Tensor] = None,
-        prompt_attention_mask: Optional[torch.Tensor] = None,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        max_sequence_length: int = 256,
-    ):
-        if prompt_embeds is None:
-            prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
-                prompt,
-                prompt_template,
-                num_videos_per_prompt,
-                device=device,
-                dtype=dtype,
-                max_sequence_length=max_sequence_length,
-            )
-
-        if pooled_prompt_embeds is None:
-            if prompt_2 is None and pooled_prompt_embeds is None:
-                prompt_2 = prompt
-            pooled_prompt_embeds = self._get_clip_prompt_embeds(
-                prompt,
-                num_videos_per_prompt,
-                device=device,
-                dtype=dtype,
-                max_sequence_length=77,
-            )
-
-        return prompt_embeds, pooled_prompt_embeds, prompt_attention_mask
-
-    def prepare_latents(
-        self,
-        batch_size: int,
-        num_channels_latents: 32,
-        height: int = 720,
-        width: int = 1280,
-        num_frames: int = 129,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        if latents is not None:
-            return latents.to(device=device, dtype=dtype)
-
-        shape = (
-            batch_size,
-            num_channels_latents,
-            num_frames,
-            int(height) // self.vae_scale_factor_spatial,
-            int(width) // self.vae_scale_factor_spatial,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-        return latents
-
-    def enable_vae_slicing(self):
-        r"""
-        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
-        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        """
-        self.vae.enable_slicing()
-
-    def disable_vae_slicing(self):
-        r"""
-        Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
-        computing decoding in one step.
-        """
-        self.vae.disable_slicing()
-
-    def enable_vae_tiling(self):
-        r"""
-        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
-        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        processing larger images.
-        """
-        self.vae.enable_tiling()
-
-    def disable_vae_tiling(self):
-        r"""
-        Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
-        computing decoding in one step.
-        """
-        self.vae.disable_tiling()
-      
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
-    def prepare_extra_step_kwargs(self, generator, eta):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-
-        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
-        extra_step_kwargs = {}
-        if accepts_eta:
-            extra_step_kwargs["eta"] = eta
-
-        # check if the scheduler accepts generator
-        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
-        if accepts_generator:
-            extra_step_kwargs["generator"] = generator
-        return extra_step_kwargs
-    
-    @torch.no_grad()
-    def sample(
-        self,
-        prompt: Union[str, List[str]] = None,
-        prompt_2: Union[str, List[str]] = None,
-        height: int = 720,
-        width: int = 1280,
-        num_frames: int = 129,
-        num_inference_steps: int = 50,
-        sigmas: List[float] = None,
-        guidance_scale: float = 6.0,
-        num_videos_per_prompt: Optional[int] = 1,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        pooled_prompt_embeds: Optional[torch.Tensor] = None,
-        prompt_attention_mask: Optional[torch.Tensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        attention_kwargs: Optional[Dict[str, Any]] = None,
-        callback_on_step_end: Optional[
-            Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
-        ] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        prompt_template: Dict[str, Any] = DEFAULT_PROMPT_TEMPLATE,
-        max_sequence_length: int = 256,
-    ) -> Union[Tuple]:
-        r"""
-        The call function to the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                will be used instead.
-            height (`int`, defaults to `720`):
-                The height in pixels of the generated image.
-            width (`int`, defaults to `1280`):
-                The width in pixels of the generated image.
-            num_frames (`int`, defaults to `129`):
-                The number of frames in the generated video.
-            num_inference_steps (`int`, defaults to `50`):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            sigmas (`List[float]`, *optional*):
-                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
-                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
-                will be used.
-            guidance_scale (`float`, defaults to `6.0`):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality. Note that the only available HunyuanVideo model is
-                CFG-distilled, which means that traditional guidance between unconditional and conditional latent is
-                not applied.
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                generation deterministic.
-            latents (`torch.Tensor`, *optional*):
-                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor is generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
-                provided, text embeddings are generated from the `prompt` input argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`HunyuanVideoPipelineOutput`] instead of a plain tuple.
-            attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            callback_on_step_end (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*):
-                A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of
-                each denoising step during the inference. with the following arguments: `callback_on_step_end(self:
-                DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a
-                list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~HunyuanVideoPipelineOutput`] or `tuple`:
-                If `return_dict` is `True`, [`HunyuanVideoPipelineOutput`] is returned, otherwise a `tuple` is returned
-                where the first element is a list with the generated images and the second element is a list of `bool`s
-                indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content.
-        """
-        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
-            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            prompt_2,
-            height,
-            width,
-            prompt_embeds,
-            callback_on_step_end_tensor_inputs,
-            prompt_template,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._attention_kwargs = attention_kwargs
-        self._current_timestep = None
-        self._interrupt = False
-
-        device = self.device
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-        
-        # 3. Encode input prompt
-        prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt_2,
-            prompt_template=prompt_template,
-            num_videos_per_prompt=num_videos_per_prompt,
-            prompt_embeds=prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            prompt_attention_mask=prompt_attention_mask,
-            device=device,
-            max_sequence_length=max_sequence_length,
-        )
-
-        transformer_dtype = self.model.dtype
-        prompt_embeds = prompt_embeds.to(transformer_dtype)
-        prompt_attention_mask = prompt_attention_mask.to(transformer_dtype)
-        if pooled_prompt_embeds is not None:
-            pooled_prompt_embeds = pooled_prompt_embeds.to(transformer_dtype)
-
-        # 4. Prepare timesteps
-        sigmas = np.linspace(1.0, 0.0, num_inference_steps + 1)[:-1] if sigmas is None else sigmas
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler,
-            num_inference_steps,
-            device,
-            sigmas=sigmas,
-        )
-
-        # 5. Prepare latent variables
-        num_channels_latents = self.model.config.in_channels
-        num_latent_frames = (num_frames - 1) // self.vae_scale_factor_temporal + 1
-        latents = self.prepare_latents(
-            batch_size * num_videos_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            num_latent_frames,
-            # torch.float32,
-            torch.float16,
-            device,
-            generator,
-            latents,
-        )
-
-        # 6. Prepare guidance condition
-        guidance = torch.tensor([guidance_scale] * latents.shape[0], dtype=transformer_dtype, device=device) * 1000.0
-
-        # 7. Denoising loop
-        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
-        self._num_timesteps = len(timesteps)
-        self.interrupt = False 
-        self.model.cuda()
-
-        # with self.progress_bar(total=num_inference_steps) as progress_bar:
-        for i, t in enumerate(timesteps):
-            if self.interrupt:
-                continue
-
-            self._current_timestep = t
-            latent_model_input = latents.to(transformer_dtype)
-            # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-            timestep = t.expand(latents.shape[0]).to(latents.dtype)
-
-            noise_pred = self.model(
-                hidden_states=latent_model_input,
-                timestep=timestep,
-                encoder_hidden_states=prompt_embeds,
-                encoder_attention_mask=prompt_attention_mask,
-                pooled_projections=pooled_prompt_embeds,
-                guidance=guidance,
-                attention_kwargs=attention_kwargs,
-                return_dict=False,
-            )[0]
-
-            # compute the previous noisy sample x_t -> x_t-1
-            latents = self.scheduler.step(noise_pred, t, latents, return_dict=False)[0]
-
-            if callback_on_step_end is not None:
-                callback_kwargs = {}
-                for k in callback_on_step_end_tensor_inputs:
-                    callback_kwargs[k] = locals()[k]
-                callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                latents = callback_outputs.pop("latents", latents)
-                prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-
-                # # call the callback, if provided
-                # if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
-                #     progress_bar.update()
-        
-        self._current_timestep = None
-
-        if not output_type == "latent":
-            latents = latents.to(self.vae.dtype) / self.vae.config.scaling_factor
-            video = self.vae.decode(latents, return_dict=False)[0]
-            # video = self.video_processor.postprocess_video(video, output_type=output_type)
-        else:
-            video = latents
-
-        # Offload all models
-        # self.maybe_free_model_hooks()
-
-        video = video[None,...]
-        video = video.cpu()
-        torch.cuda.empty_cache()
-        return video 
-    
-    # training specific functions 
-    def configure_optimizers(self):
-        if self.deepspeed_config is not None and self.deepspeed_config.use_cpu_adam:
-            from deepspeed.ops.adam import DeepSpeedCPUAdam
-            optimizer = DeepSpeedCPUAdam([p for p in self.model.parameters() if p.requires_grad ], lr=self.learning_rate)
-        else:
-            optimizer = torch.optim.AdamW([p for p in self.model.parameters() if p.requires_grad ], lr=self.learning_rate)
-        return optimizer
-    
-    def on_save_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
-        new_satate_dict = checkpoint["state_dict"]
-        new_satate_dict = {k: v for k, v in new_satate_dict.items() if "lora" in k}
-        checkpoint["state_dict"] = new_satate_dict
-        return checkpoint
-        
-    def on_load_checkpoint(self, checkpoint: Dict[str, Any]) -> None:
-        pass
-    
-    def encode_video(self,video):
-        video = video.to(self.device, dtype=self.vae.dtype).unsqueeze(0) # [1, 61, 3, 544, 960]
-        # video = video.to(self.device, dtype=self.vae.dtype).unsqueeze(0) # [1, 61, 3, 544, 960]
-        video = video.permute(0, 2, 1, 3, 4)  # [B, C, F, H, W], [1, 3, 61, 544, 960]
-        
-        latent_dist = self.vae.encode(video).latent_dist
-        return latent_dist
-   
-    def get_batch_input(self, batch):
-        """
-        Prepare model batch inputs 
-        """
-        # equal to collate_fn
-        # the resonable video latents range is [-5,5], approximately.
-        # videos
-        videos = [self.encode_video(video) for video in batch["video"]]
-        videos = [video.sample() * self.vae.config.scaling_factor for video in videos]
-        videos = torch.cat(videos, dim=0)
-        videos = videos.to(memory_format=torch.contiguous_format).float()
-        # prompt
-        prompts = [item for item in batch["prompt"]]
-        return {
-            "videos": videos,
-            "prompts": prompts,
-        }
-    
-    def training_step(self, batch, batch_idx):
-        batch = self.get_batch_input(batch)
-        # model_input = batch["videos"].permute(0, 2, 1, 3, 4).to(dtype=self.vae.dtype)  # [B, F, C, H, W]
-        model_input = batch["videos"].to(dtype=self.vae.dtype)
-        prompts = batch["prompts"]
-        
-        max_sequence_length = 256 # TODO: check this value
-        with torch.no_grad():
-            prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
-                prompt=prompts,
-                prompt_2=None,
-                prompt_template=DEFAULT_PROMPT_TEMPLATE,
-                num_videos_per_prompt=1,
-                prompt_embeds=None,
-                pooled_prompt_embeds=None,
-                prompt_attention_mask=None,
-                device=self.device,
-                max_sequence_length=max_sequence_length,
-            )
-
-        batch_size, num_frames, num_channels, height, width = model_input.shape
-
-        # flow_weighting_scheme: str = "none"
-        # flow_logit_mean: float = 0.0
-        # flow_logit_std: float = 1.0
-        # flow_mode_scale: float = 1.29
-        
-        sigmas = prepare_sigmas(
-            scheduler=self.scheduler,
-            sigmas=self.scheduler_sigmas.to(self.device),
-            batch_size=batch_size,
-            num_train_timesteps=self.scheduler.config.num_train_timesteps,
-            flow_weighting_scheme="none",
-            flow_logit_mean=0.0,
-            flow_logit_std=1.0,
-            flow_mode_scale=1.29,
-            device=self.device,
-            generator=None, # TODO: do we need to set the generator here?
-        )
-
-        timesteps = (sigmas * 1000.0).long()
-
-        noise = torch.randn(
-            model_input.shape,
-            generator=None, # TODO: do we need to set the generator here?
-            device=self.device,
-            dtype=self.vae.dtype,
-        )
-        sigmas = expand_tensor_dims(sigmas, ndim=noise.ndim)
-
-        noisy_latents = (1.0 - sigmas) * model_input + sigmas * noise
-        noisy_latents = noisy_latents.to(model_input.dtype)
-        weights = prepare_loss_weights(
-            scheduler=self.scheduler,
-            alphas=None, # None for flow matching
-            sigmas=sigmas,
-            flow_weighting_scheme="none",
-        )
-        weights = expand_tensor_dims(weights, noise.ndim)
-
-        # pred = self.model_config["forward_pass"](
-        #     transformer=self.transformer,
-        #     scheduler=self.scheduler,
-        #     timesteps=timesteps,
-        #     **latent_conditions,
-        #     **text_conditions,
-        # )
-        guidance_scale = 1.0
-        guidance = torch.tensor([guidance_scale] * noisy_latents.shape[0], dtype=noisy_latents.dtype, device=noisy_latents.device) * 1000.0
-        
-        model_output = self.model(
-                    hidden_states=noisy_latents,
-                    timestep=timesteps,
-                    encoder_hidden_states=prompt_embeds.to(noisy_latents),
-                    encoder_attention_mask=prompt_attention_mask,
-                    pooled_projections=pooled_prompt_embeds.to(noisy_latents),
-                    guidance=guidance,
-                    # attention_kwargs=attention_kwargs,
-                    return_dict=False,
-                )[0]
-        target = prepare_target(
-            scheduler=self.scheduler, noise=noise, latents=model_input
-        )
-        loss = weights.float() * (model_output.float() - target.float()).pow(2)
-        # Average loss across all but batch dimension
-        loss = loss.mean(list(range(1, loss.ndim)))
-        # Average loss across batch dimension
-        loss = loss.mean()
-        return loss
-
-
-    def training_step_old(self, batch, batch_idx):
-        batch = self.get_batch_input(batch)
-        # model_input = batch["videos"].permute(0, 2, 1, 3, 4).to(dtype=self.vae.dtype)  # [B, F, C, H, W]
-        model_input = batch["videos"].to(dtype=self.vae.dtype)
-        prompts = batch["prompts"]
-        
-        max_sequence_length = 256 # TODO: check this value
-        with torch.no_grad():
-            # prompt_embeds = self.encode_prompt(
-            #     prompts,
-            #     do_classifier_free_guidance=False,# set to false for train
-            #     num_videos_per_prompt=1,
-            #     max_sequence_length=max_sequence_length,
-            #     device=self.device,
-            #     dtype=self.vae.dtype,
-            # )
-            prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
-                prompt=prompts,
-                prompt_2=None,
-                prompt_template=DEFAULT_PROMPT_TEMPLATE,
-                num_videos_per_prompt=1,
-                prompt_embeds=None,
-                pooled_prompt_embeds=None,
-                prompt_attention_mask=None,
-                device=self.device,
-                max_sequence_length=max_sequence_length,
-            )
-            
-        batch_size, num_frames, num_channels, height, width = model_input.shape
-        
-        # generate noise 
-        # 
-        
-        # Sample noise that will be added to the latents
-        noise = torch.randn_like(model_input)
-
-        # Sample a random timestep for each image
-        timesteps = torch.randint(
-            0, self.scheduler.config.num_train_timesteps, (batch_size,), device=self.device
-        )
-        timesteps = timesteps.long()
-
-        # Add noise to the model input according to the noise magnitude at each timestep
-        # (this is the forward diffusion process)
-        noisy_model_input = self.scheduler.add_noise(model_input, noise, timesteps)
-
-        # guidance = torch.tensor([self._guidance_scale], device=self.device, dtype=self.vae.dtype) * 1000.0
-        guidance_scale = 1.0
-        guidance = torch.tensor([guidance_scale] * noisy_model_input.shape[0], dtype=noisy_model_input.dtype, device=noisy_model_input.device) * 1000.0
-        model_output = self.model(
-                    hidden_states=noisy_model_input,
-                    timestep=timesteps,
-                    encoder_hidden_states=prompt_embeds,
-                    encoder_attention_mask=prompt_attention_mask,
-                    pooled_projections=pooled_prompt_embeds,
-                    guidance=guidance,
-                    # attention_kwargs=attention_kwargs,
-                    return_dict=False,
-                )[0]
-        model_pred = self.scheduler.get_velocity(model_output, noisy_model_input, timesteps)
-
-        alphas_cumprod = self.scheduler.alphas_cumprod[timesteps]
-        weights = 1 / (1 - alphas_cumprod)
-        while len(weights.shape) < len(model_pred.shape):
-            weights = weights.unsqueeze(-1)
-
-        target = model_input
-        # TODO: inherent loss computation from base class. 
-        loss = torch.mean((weights * (model_pred - target) ** 2).reshape(batch_size, -1), dim=1)
-        loss = loss.mean()
-        return loss
-
-
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/inference.py b/videotuna/models/hunyuan/hyvideo_t2v/inference.py
deleted file mode 100644
index 39ab6b2c..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/inference.py
+++ /dev/null
@@ -1,671 +0,0 @@
-import os
-import time
-import random
-import functools
-from typing import List, Optional, Tuple, Union
-
-from pathlib import Path
-from loguru import logger
-
-import torch
-import torch.distributed as dist
-from hyvideo_t2v.constants import PROMPT_TEMPLATE, NEGATIVE_PROMPT, PRECISION_TO_TYPE
-from hyvideo_t2v.vae import load_vae
-from hyvideo_t2v.modules import load_model
-from hyvideo_t2v.text_encoder import TextEncoder
-from hyvideo_t2v.utils.data_utils import align_to
-from hyvideo_t2v.modules.posemb_layers import get_nd_rotary_pos_embed
-from hyvideo_t2v.modules.fp8_optimization import convert_fp8_linear
-from hyvideo_t2v.diffusion.schedulers import FlowMatchDiscreteScheduler
-from hyvideo_t2v.diffusion.pipelines import HunyuanVideoPipeline
-
-try:
-    import xfuser
-    from xfuser.core.distributed import (
-        get_sequence_parallel_world_size,
-        get_sequence_parallel_rank,
-        get_sp_group,
-        initialize_model_parallel,
-        init_distributed_environment
-    )
-except:
-    xfuser = None
-    get_sequence_parallel_world_size = None
-    get_sequence_parallel_rank = None
-    get_sp_group = None
-    initialize_model_parallel = None
-    init_distributed_environment = None
-
-
-def parallelize_transformer(pipe):
-    transformer = pipe.transformer
-    original_forward = transformer.forward
-
-    @functools.wraps(transformer.__class__.forward)
-    def new_forward(
-        self,
-        x: torch.Tensor,
-        t: torch.Tensor,  # Should be in range(0, 1000).
-        text_states: torch.Tensor = None,
-        text_mask: torch.Tensor = None,  # Now we don't use it.
-        text_states_2: Optional[torch.Tensor] = None,  # Text embedding for modulation.
-        freqs_cos: Optional[torch.Tensor] = None,
-        freqs_sin: Optional[torch.Tensor] = None,
-        guidance: torch.Tensor = None,  # Guidance for modulation, should be cfg_scale x 1000.
-        return_dict: bool = True,
-    ):
-        if x.shape[-2] // 2 % get_sequence_parallel_world_size() == 0:
-            # try to split x by height
-            split_dim = -2
-        elif x.shape[-1] // 2 % get_sequence_parallel_world_size() == 0:
-            # try to split x by width
-            split_dim = -1
-        else:
-            raise ValueError(f"Cannot split video sequence into ulysses_degree x ring_degree ({get_sequence_parallel_world_size()}) parts evenly")
-
-        # patch sizes for the temporal, height, and width dimensions are 1, 2, and 2.
-        temporal_size, h, w = x.shape[2], x.shape[3] // 2, x.shape[4] // 2
-
-        x = torch.chunk(x, get_sequence_parallel_world_size(),dim=split_dim)[get_sequence_parallel_rank()]
-
-        dim_thw = freqs_cos.shape[-1]
-        freqs_cos = freqs_cos.reshape(temporal_size, h, w, dim_thw)
-        freqs_cos = torch.chunk(freqs_cos, get_sequence_parallel_world_size(),dim=split_dim - 1)[get_sequence_parallel_rank()]
-        freqs_cos = freqs_cos.reshape(-1, dim_thw)
-        dim_thw = freqs_sin.shape[-1]
-        freqs_sin = freqs_sin.reshape(temporal_size, h, w, dim_thw)
-        freqs_sin = torch.chunk(freqs_sin, get_sequence_parallel_world_size(),dim=split_dim - 1)[get_sequence_parallel_rank()]
-        freqs_sin = freqs_sin.reshape(-1, dim_thw)
-        
-        from xfuser.core.long_ctx_attention import xFuserLongContextAttention
-        
-        for block in transformer.double_blocks + transformer.single_blocks:
-            block.hybrid_seq_parallel_attn = xFuserLongContextAttention()
-
-        output = original_forward(
-            x,
-            t,
-            text_states,
-            text_mask,
-            text_states_2,
-            freqs_cos,
-            freqs_sin,
-            guidance,
-            return_dict,
-        )
-
-        return_dict = not isinstance(output, tuple)
-        sample = output["x"]
-        sample = get_sp_group().all_gather(sample, dim=split_dim)
-        output["x"] = sample
-        return output
-
-    new_forward = new_forward.__get__(transformer)
-    transformer.forward = new_forward
-    
-
-class Inference(object):
-    def __init__(
-        self,
-        args,
-        vae,
-        vae_kwargs,
-        text_encoder,
-        model,
-        text_encoder_2=None,
-        pipeline=None,
-        use_cpu_offload=False,
-        device=None,
-        logger=None,
-        parallel_args=None,
-    ):
-        self.vae = vae
-        self.vae_kwargs = vae_kwargs
-
-        self.text_encoder = text_encoder
-        self.text_encoder_2 = text_encoder_2
-
-        self.model = model
-        self.pipeline = pipeline
-        self.use_cpu_offload = use_cpu_offload
-
-        self.args = args
-        self.device = (
-            device
-            if device is not None
-            else "cuda"
-            if torch.cuda.is_available()
-            else "cpu"
-        )
-        self.logger = logger
-        self.parallel_args = parallel_args
-
-    @classmethod
-    def from_pretrained(cls, pretrained_model_path, args, device=None, **kwargs):
-        """
-        Initialize the Inference pipeline.
-
-        Args:
-            pretrained_model_path (str or pathlib.Path): The model path, including t2v, text encoder and vae checkpoints.
-            args (argparse.Namespace): The arguments for the pipeline.
-            device (int): The device for inference. Default is 0.
-        """
-        # ========================================================================
-        logger.info(f"Got text-to-video model root path: {pretrained_model_path}")
-        
-        # ==================== Initialize Distributed Environment ================
-        if args.ulysses_degree > 1 or args.ring_degree > 1:
-            assert xfuser is not None, \
-                "Ulysses Attention and Ring Attention requires xfuser package."
-
-            assert args.use_cpu_offload is False, \
-                "Cannot enable use_cpu_offload in the distributed environment."
-
-            dist.init_process_group("nccl")
-
-            assert dist.get_world_size() == args.ring_degree * args.ulysses_degree, \
-                "number of GPUs should be equal to ring_degree * ulysses_degree."
-
-            init_distributed_environment(rank=dist.get_rank(), world_size=dist.get_world_size())
-            
-            initialize_model_parallel(
-                sequence_parallel_degree=dist.get_world_size(),
-                ring_degree=args.ring_degree,
-                ulysses_degree=args.ulysses_degree,
-            )
-            device = torch.device(f"cuda:{os.environ['LOCAL_RANK']}")
-        else:
-            if device is None:
-                device = "cuda" if torch.cuda.is_available() else "cpu"
-
-        parallel_args = {"ulysses_degree": args.ulysses_degree, "ring_degree": args.ring_degree}
-
-        # ======================== Get the args path =============================
-
-        # Disable gradient
-        torch.set_grad_enabled(False)
-
-        # =========================== Build main model ===========================
-        logger.info("Building model...")
-        factor_kwargs = {"device": device, "dtype": PRECISION_TO_TYPE[args.precision]}
-        in_channels = args.latent_channels
-        out_channels = args.latent_channels
-
-        model = load_model(
-            args,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            factor_kwargs=factor_kwargs,
-        )
-        if args.use_fp8:
-            convert_fp8_linear(model, args.dit_weight, original_dtype=PRECISION_TO_TYPE[args.precision])
-        model = model.to(device)
-        model = Inference.load_state_dict(args, model, pretrained_model_path)
-        model.eval()
-
-        # ============================= Build extra models ========================
-        # VAE
-        vae, _, s_ratio, t_ratio = load_vae(
-            args.vae,
-            args.vae_precision,
-            logger=logger,
-            device=device if not args.use_cpu_offload else "cpu",
-        )
-        vae_kwargs = {"s_ratio": s_ratio, "t_ratio": t_ratio}
-
-        # Text encoder
-        if args.prompt_template_video is not None:
-            crop_start = PROMPT_TEMPLATE[args.prompt_template_video].get(
-                "crop_start", 0
-            )
-        elif args.prompt_template is not None:
-            crop_start = PROMPT_TEMPLATE[args.prompt_template].get("crop_start", 0)
-        else:
-            crop_start = 0
-        max_length = args.text_len + crop_start
-
-        # prompt_template
-        prompt_template = (
-            PROMPT_TEMPLATE[args.prompt_template]
-            if args.prompt_template is not None
-            else None
-        )
-
-        # prompt_template_video
-        prompt_template_video = (
-            PROMPT_TEMPLATE[args.prompt_template_video]
-            if args.prompt_template_video is not None
-            else None
-        )
-
-        text_encoder = TextEncoder(
-            text_encoder_type=args.text_encoder,
-            max_length=max_length,
-            text_encoder_precision=args.text_encoder_precision,
-            tokenizer_type=args.tokenizer,
-            prompt_template=prompt_template,
-            prompt_template_video=prompt_template_video,
-            hidden_state_skip_layer=args.hidden_state_skip_layer,
-            apply_final_norm=args.apply_final_norm,
-            reproduce=args.reproduce,
-            logger=logger,
-            device=device if not args.use_cpu_offload else "cpu",
-        )
-        text_encoder_2 = None
-        if args.text_encoder_2 is not None:
-            text_encoder_2 = TextEncoder(
-                text_encoder_type=args.text_encoder_2,
-                max_length=args.text_len_2,
-                text_encoder_precision=args.text_encoder_precision_2,
-                tokenizer_type=args.tokenizer_2,
-                reproduce=args.reproduce,
-                logger=logger,
-                device=device if not args.use_cpu_offload else "cpu",
-            )
-
-        return cls(
-            args=args,
-            vae=vae,
-            vae_kwargs=vae_kwargs,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            model=model,
-            use_cpu_offload=args.use_cpu_offload,
-            device=device,
-            logger=logger,
-            parallel_args=parallel_args
-        )
-
-    @staticmethod
-    def load_state_dict(args, model, pretrained_model_path):
-        load_key = args.load_key
-        dit_weight = Path(args.dit_weight)
-
-        if dit_weight is None:
-            model_dir = pretrained_model_path / f"t2v_{args.model_resolution}"
-            files = list(model_dir.glob("*.pt"))
-            if len(files) == 0:
-                raise ValueError(f"No model weights found in {model_dir}")
-            if str(files[0]).startswith("pytorch_model_"):
-                model_path = dit_weight / f"pytorch_model_{load_key}.pt"
-                bare_model = True
-            elif any(str(f).endswith("_model_states.pt") for f in files):
-                files = [f for f in files if str(f).endswith("_model_states.pt")]
-                model_path = files[0]
-                if len(files) > 1:
-                    logger.warning(
-                        f"Multiple model weights found in {dit_weight}, using {model_path}"
-                    )
-                bare_model = False
-            else:
-                raise ValueError(
-                    f"Invalid model path: {dit_weight} with unrecognized weight format: "
-                    f"{list(map(str, files))}. When given a directory as --dit-weight, only "
-                    f"`pytorch_model_*.pt`(provided by HunyuanDiT official) and "
-                    f"`*_model_states.pt`(saved by deepspeed) can be parsed. If you want to load a "
-                    f"specific weight file, please provide the full path to the file."
-                )
-        else:
-            if dit_weight.is_dir():
-                files = list(dit_weight.glob("*.pt"))
-                if len(files) == 0:
-                    raise ValueError(f"No model weights found in {dit_weight}")
-                if str(files[0]).startswith("pytorch_model_"):
-                    model_path = dit_weight / f"pytorch_model_{load_key}.pt"
-                    bare_model = True
-                elif any(str(f).endswith("_model_states.pt") for f in files):
-                    files = [f for f in files if str(f).endswith("_model_states.pt")]
-                    model_path = files[0]
-                    if len(files) > 1:
-                        logger.warning(
-                            f"Multiple model weights found in {dit_weight}, using {model_path}"
-                        )
-                    bare_model = False
-                else:
-                    raise ValueError(
-                        f"Invalid model path: {dit_weight} with unrecognized weight format: "
-                        f"{list(map(str, files))}. When given a directory as --dit-weight, only "
-                        f"`pytorch_model_*.pt`(provided by HunyuanDiT official) and "
-                        f"`*_model_states.pt`(saved by deepspeed) can be parsed. If you want to load a "
-                        f"specific weight file, please provide the full path to the file."
-                    )
-            elif dit_weight.is_file():
-                model_path = dit_weight
-                bare_model = "unknown"
-            else:
-                raise ValueError(f"Invalid model path: {dit_weight}")
-
-        if not model_path.exists():
-            raise ValueError(f"model_path not exists: {model_path}")
-        logger.info(f"Loading torch model {model_path}...")
-        state_dict = torch.load(model_path, map_location=lambda storage, loc: storage)
-
-        if bare_model == "unknown" and ("ema" in state_dict or "module" in state_dict):
-            bare_model = False
-        if bare_model is False:
-            if load_key in state_dict:
-                state_dict = state_dict[load_key]
-            else:
-                raise KeyError(
-                    f"Missing key: `{load_key}` in the checkpoint: {model_path}. The keys in the checkpoint "
-                    f"are: {list(state_dict.keys())}."
-                )
-        model.load_state_dict(state_dict, strict=True)
-        return model
-
-    @staticmethod
-    def parse_size(size):
-        if isinstance(size, int):
-            size = [size]
-        if not isinstance(size, (list, tuple)):
-            raise ValueError(f"Size must be an integer or (height, width), got {size}.")
-        if len(size) == 1:
-            size = [size[0], size[0]]
-        if len(size) != 2:
-            raise ValueError(f"Size must be an integer or (height, width), got {size}.")
-        return size
-
-
-class HunyuanVideoSampler(Inference):
-    def __init__(
-        self,
-        args,
-        vae,
-        vae_kwargs,
-        text_encoder,
-        model,
-        text_encoder_2=None,
-        pipeline=None,
-        use_cpu_offload=False,
-        device=0,
-        logger=None,
-        parallel_args=None
-    ):
-        super().__init__(
-            args,
-            vae,
-            vae_kwargs,
-            text_encoder,
-            model,
-            text_encoder_2=text_encoder_2,
-            pipeline=pipeline,
-            use_cpu_offload=use_cpu_offload,
-            device=device,
-            logger=logger,
-            parallel_args=parallel_args
-        )
-
-        self.pipeline = self.load_diffusion_pipeline(
-            args=args,
-            vae=self.vae,
-            text_encoder=self.text_encoder,
-            text_encoder_2=self.text_encoder_2,
-            model=self.model,
-            device=self.device,
-        )
-
-        self.default_negative_prompt = NEGATIVE_PROMPT
-        if self.parallel_args['ulysses_degree'] > 1 or self.parallel_args['ring_degree'] > 1:
-            parallelize_transformer(self.pipeline)
-
-    def load_diffusion_pipeline(
-        self,
-        args,
-        vae,
-        text_encoder,
-        text_encoder_2,
-        model,
-        scheduler=None,
-        device=None,
-        progress_bar_config=None,
-        data_type="video",
-    ):
-        """Load the denoising scheduler for inference."""
-        if scheduler is None:
-            if args.denoise_type == "flow":
-                scheduler = FlowMatchDiscreteScheduler(
-                    shift=args.flow_shift,
-                    reverse=args.flow_reverse,
-                    solver=args.flow_solver,
-                )
-            else:
-                raise ValueError(f"Invalid denoise type {args.denoise_type}")
-
-        pipeline = HunyuanVideoPipeline(
-            vae=vae,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            transformer=model,
-            scheduler=scheduler,
-            progress_bar_config=progress_bar_config,
-            args=args,
-        )
-        if self.use_cpu_offload:
-            pipeline.enable_sequential_cpu_offload()
-        else:
-            pipeline = pipeline.to(device)
-
-        return pipeline
-
-    def get_rotary_pos_embed(self, video_length, height, width):
-        target_ndim = 3
-        ndim = 5 - 2
-        # 884
-        if "884" in self.args.vae:
-            latents_size = [(video_length - 1) // 4 + 1, height // 8, width // 8]
-        elif "888" in self.args.vae:
-            latents_size = [(video_length - 1) // 8 + 1, height // 8, width // 8]
-        else:
-            latents_size = [video_length, height // 8, width // 8]
-
-        if isinstance(self.model.patch_size, int):
-            assert all(s % self.model.patch_size == 0 for s in latents_size), (
-                f"Latent size(last {ndim} dimensions) should be divisible by patch size({self.model.patch_size}), "
-                f"but got {latents_size}."
-            )
-            rope_sizes = [s // self.model.patch_size for s in latents_size]
-        elif isinstance(self.model.patch_size, list):
-            assert all(
-                s % self.model.patch_size[idx] == 0
-                for idx, s in enumerate(latents_size)
-            ), (
-                f"Latent size(last {ndim} dimensions) should be divisible by patch size({self.model.patch_size}), "
-                f"but got {latents_size}."
-            )
-            rope_sizes = [
-                s // self.model.patch_size[idx] for idx, s in enumerate(latents_size)
-            ]
-
-        if len(rope_sizes) != target_ndim:
-            rope_sizes = [1] * (target_ndim - len(rope_sizes)) + rope_sizes  # time axis
-        head_dim = self.model.hidden_size // self.model.heads_num
-        rope_dim_list = self.model.rope_dim_list
-        if rope_dim_list is None:
-            rope_dim_list = [head_dim // target_ndim for _ in range(target_ndim)]
-        assert (
-            sum(rope_dim_list) == head_dim
-        ), "sum(rope_dim_list) should equal to head_dim of attention layer"
-        freqs_cos, freqs_sin = get_nd_rotary_pos_embed(
-            rope_dim_list,
-            rope_sizes,
-            theta=self.args.rope_theta,
-            use_real=True,
-            theta_rescale_factor=1,
-        )
-        return freqs_cos, freqs_sin
-
-    @torch.no_grad()
-    def predict(
-        self,
-        prompt,
-        height=192,
-        width=336,
-        video_length=129,
-        seed=None,
-        negative_prompt=None,
-        infer_steps=50,
-        guidance_scale=6,
-        flow_shift=5.0,
-        embedded_guidance_scale=None,
-        batch_size=1,
-        num_videos_per_prompt=1,
-        **kwargs,
-    ):
-        """
-        Predict the image/video from the given text.
-
-        Args:
-            prompt (str or List[str]): The input text.
-            kwargs:
-                height (int): The height of the output video. Default is 192.
-                width (int): The width of the output video. Default is 336.
-                video_length (int): The frame number of the output video. Default is 129.
-                seed (int or List[str]): The random seed for the generation. Default is a random integer.
-                negative_prompt (str or List[str]): The negative text prompt. Default is an empty string.
-                guidance_scale (float): The guidance scale for the generation. Default is 6.0.
-                num_images_per_prompt (int): The number of images per prompt. Default is 1.
-                infer_steps (int): The number of inference steps. Default is 100.
-        """
-        out_dict = dict()
-
-        # ========================================================================
-        # Arguments: seed
-        # ========================================================================
-        if isinstance(seed, torch.Tensor):
-            seed = seed.tolist()
-        if seed is None:
-            seeds = [
-                random.randint(0, 1_000_000)
-                for _ in range(batch_size * num_videos_per_prompt)
-            ]
-        elif isinstance(seed, int):
-            seeds = [
-                seed + i
-                for _ in range(batch_size)
-                for i in range(num_videos_per_prompt)
-            ]
-        elif isinstance(seed, (list, tuple)):
-            if len(seed) == batch_size:
-                seeds = [
-                    int(seed[i]) + j
-                    for i in range(batch_size)
-                    for j in range(num_videos_per_prompt)
-                ]
-            elif len(seed) == batch_size * num_videos_per_prompt:
-                seeds = [int(s) for s in seed]
-            else:
-                raise ValueError(
-                    f"Length of seed must be equal to number of prompt(batch_size) or "
-                    f"batch_size * num_videos_per_prompt ({batch_size} * {num_videos_per_prompt}), got {seed}."
-                )
-        else:
-            raise ValueError(
-                f"Seed must be an integer, a list of integers, or None, got {seed}."
-            )
-        generator = [torch.Generator(self.device).manual_seed(seed) for seed in seeds]
-        out_dict["seeds"] = seeds
-
-        # ========================================================================
-        # Arguments: target_width, target_height, target_video_length
-        # ========================================================================
-        if width <= 0 or height <= 0 or video_length <= 0:
-            raise ValueError(
-                f"`height` and `width` and `video_length` must be positive integers, got height={height}, width={width}, video_length={video_length}"
-            )
-        if (video_length - 1) % 4 != 0:
-            raise ValueError(
-                f"`video_length-1` must be a multiple of 4, got {video_length}"
-            )
-
-        logger.info(
-            f"Input (height, width, video_length) = ({height}, {width}, {video_length})"
-        )
-
-        target_height = align_to(height, 16)
-        target_width = align_to(width, 16)
-        target_video_length = video_length
-
-        out_dict["size"] = (target_height, target_width, target_video_length)
-
-        # ========================================================================
-        # Arguments: prompt, new_prompt, negative_prompt
-        # ========================================================================
-        if not isinstance(prompt, str):
-            raise TypeError(f"`prompt` must be a string, but got {type(prompt)}")
-        prompt = [prompt.strip()]
-
-        # negative prompt
-        if negative_prompt is None or negative_prompt == "":
-            negative_prompt = self.default_negative_prompt
-        if not isinstance(negative_prompt, str):
-            raise TypeError(
-                f"`negative_prompt` must be a string, but got {type(negative_prompt)}"
-            )
-        negative_prompt = [negative_prompt.strip()]
-
-        # ========================================================================
-        # Scheduler
-        # ========================================================================
-        scheduler = FlowMatchDiscreteScheduler(
-            shift=flow_shift,
-            reverse=self.args.flow_reverse,
-            solver=self.args.flow_solver
-        )
-        self.pipeline.scheduler = scheduler # yazhou: substitute the scheduler in the pipeline
-
-        # ========================================================================
-        # Build Rope freqs
-        # ========================================================================
-        freqs_cos, freqs_sin = self.get_rotary_pos_embed(
-            target_video_length, target_height, target_width
-        )
-        n_tokens = freqs_cos.shape[0]
-
-        # ========================================================================
-        # Print infer args
-        # ========================================================================
-        debug_str = f"""
-                        height: {target_height}
-                         width: {target_width}
-                  video_length: {target_video_length}
-                        prompt: {prompt}
-                    neg_prompt: {negative_prompt}
-                          seed: {seed}
-                   infer_steps: {infer_steps}
-         num_videos_per_prompt: {num_videos_per_prompt}
-                guidance_scale: {guidance_scale}
-                      n_tokens: {n_tokens}
-                    flow_shift: {flow_shift}
-       embedded_guidance_scale: {embedded_guidance_scale}"""
-        logger.debug(debug_str)
-
-        # ========================================================================
-        # Pipeline inference
-        # ========================================================================
-        start_time = time.time()
-        samples = self.pipeline(
-            prompt=prompt,
-            height=target_height,
-            width=target_width,
-            video_length=target_video_length,
-            num_inference_steps=infer_steps,
-            guidance_scale=guidance_scale,
-            negative_prompt=negative_prompt,
-            num_videos_per_prompt=num_videos_per_prompt,
-            generator=generator,
-            output_type="pil",
-            freqs_cis=(freqs_cos, freqs_sin),
-            n_tokens=n_tokens,
-            embedded_guidance_scale=embedded_guidance_scale,
-            data_type="video" if target_video_length > 1 else "image",
-            is_progress_bar=True,
-            vae_ver=self.args.vae,
-            enable_tiling=self.args.vae_tiling,
-        )[0]
-        out_dict["samples"] = samples
-        out_dict["prompts"] = prompt
-
-        gen_time = time.time() - start_time
-        logger.info(f"Success, time: {gen_time}")
-
-        return out_dict
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/__init__.py
deleted file mode 100644
index 2ebe2c3e..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/__init__.py
+++ /dev/null
@@ -1,26 +0,0 @@
-from .models import HYVideoDiffusionTransformer, HUNYUAN_VIDEO_CONFIG
-
-
-def load_model(args, in_channels, out_channels, factor_kwargs):
-    """load hunyuan video model
-
-    Args:
-        args (dict): model args
-        in_channels (int): input channels number
-        out_channels (int): output channels number
-        factor_kwargs (dict): factor kwargs
-
-    Returns:
-        model (nn.Module): The hunyuan video model
-    """
-    if args.model in HUNYUAN_VIDEO_CONFIG.keys():
-        model = HYVideoDiffusionTransformer(
-            args,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            **HUNYUAN_VIDEO_CONFIG[args.model],
-            **factor_kwargs,
-        )
-        return model
-    else:
-        raise NotImplementedError()
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/activation_layers.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/activation_layers.py
deleted file mode 100644
index f8774c26..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/activation_layers.py
+++ /dev/null
@@ -1,23 +0,0 @@
-import torch.nn as nn
-
-
-def get_activation_layer(act_type):
-    """get activation layer
-
-    Args:
-        act_type (str): the activation type
-
-    Returns:
-        torch.nn.functional: the activation layer
-    """
-    if act_type == "gelu":
-        return lambda: nn.GELU()
-    elif act_type == "gelu_tanh":
-        # Approximate `tanh` requires torch >= 1.13
-        return lambda: nn.GELU(approximate="tanh")
-    elif act_type == "relu":
-        return nn.ReLU
-    elif act_type == "silu":
-        return nn.SiLU
-    else:
-        raise ValueError(f"Unknown activation type: {act_type}")
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/attenion.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/attenion.py
deleted file mode 100644
index 44548793..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/attenion.py
+++ /dev/null
@@ -1,212 +0,0 @@
-import importlib.metadata
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-try:
-    import flash_attn
-    from flash_attn.flash_attn_interface import _flash_attn_forward
-    from flash_attn.flash_attn_interface import flash_attn_varlen_func
-except ImportError:
-    flash_attn = None
-    flash_attn_varlen_func = None
-    _flash_attn_forward = None
-
-
-MEMORY_LAYOUT = {
-    "flash": (
-        lambda x: x.view(x.shape[0] * x.shape[1], *x.shape[2:]),
-        lambda x: x,
-    ),
-    "torch": (
-        lambda x: x.transpose(1, 2),
-        lambda x: x.transpose(1, 2),
-    ),
-    "vanilla": (
-        lambda x: x.transpose(1, 2),
-        lambda x: x.transpose(1, 2),
-    ),
-}
-
-
-def get_cu_seqlens(text_mask, img_len):
-    """Calculate cu_seqlens_q, cu_seqlens_kv using text_mask and img_len
-
-    Args:
-        text_mask (torch.Tensor): the mask of text
-        img_len (int): the length of image
-
-    Returns:
-        torch.Tensor: the calculated cu_seqlens for flash attention
-    """
-    batch_size = text_mask.shape[0]
-    text_len = text_mask.sum(dim=1)
-    max_len = text_mask.shape[1] + img_len
-
-    cu_seqlens = torch.zeros([2 * batch_size + 1], dtype=torch.int32, device="cuda")
-
-    for i in range(batch_size):
-        s = text_len[i] + img_len
-        s1 = i * max_len + s
-        s2 = (i + 1) * max_len
-        cu_seqlens[2 * i + 1] = s1
-        cu_seqlens[2 * i + 2] = s2
-
-    return cu_seqlens
-
-
-def attention(
-    q,
-    k,
-    v,
-    mode="flash",
-    drop_rate=0,
-    attn_mask=None,
-    causal=False,
-    cu_seqlens_q=None,
-    cu_seqlens_kv=None,
-    max_seqlen_q=None,
-    max_seqlen_kv=None,
-    batch_size=1,
-):
-    """
-    Perform QKV self attention.
-
-    Args:
-        q (torch.Tensor): Query tensor with shape [b, s, a, d], where a is the number of heads.
-        k (torch.Tensor): Key tensor with shape [b, s1, a, d]
-        v (torch.Tensor): Value tensor with shape [b, s1, a, d]
-        mode (str): Attention mode. Choose from 'self_flash', 'cross_flash', 'torch', and 'vanilla'.
-        drop_rate (float): Dropout rate in attention map. (default: 0)
-        attn_mask (torch.Tensor): Attention mask with shape [b, s1] (cross_attn), or [b, a, s, s1] (torch or vanilla).
-            (default: None)
-        causal (bool): Whether to use causal attention. (default: False)
-        cu_seqlens_q (torch.Tensor): dtype torch.int32. The cumulative sequence lengths of the sequences in the batch,
-            used to index into q.
-        cu_seqlens_kv (torch.Tensor): dtype torch.int32. The cumulative sequence lengths of the sequences in the batch,
-            used to index into kv.
-        max_seqlen_q (int): The maximum sequence length in the batch of q.
-        max_seqlen_kv (int): The maximum sequence length in the batch of k and v.
-
-    Returns:
-        torch.Tensor: Output tensor after self attention with shape [b, s, ad]
-    """
-    pre_attn_layout, post_attn_layout = MEMORY_LAYOUT[mode]
-    q = pre_attn_layout(q)
-    k = pre_attn_layout(k)
-    v = pre_attn_layout(v)
-
-    if mode == "torch":
-        if attn_mask is not None and attn_mask.dtype != torch.bool:
-            attn_mask = attn_mask.to(q.dtype)
-        x = F.scaled_dot_product_attention(
-            q, k, v, attn_mask=attn_mask, dropout_p=drop_rate, is_causal=causal
-        )
-    elif mode == "flash":
-        x = flash_attn_varlen_func(
-            q,
-            k,
-            v,
-            cu_seqlens_q,
-            cu_seqlens_kv,
-            max_seqlen_q,
-            max_seqlen_kv,
-        )
-        # x with shape [(bxs), a, d]
-        x = x.view(
-            batch_size, max_seqlen_q, x.shape[-2], x.shape[-1]
-        )  # reshape x to [b, s, a, d]
-    elif mode == "vanilla":
-        scale_factor = 1 / math.sqrt(q.size(-1))
-
-        b, a, s, _ = q.shape
-        s1 = k.size(2)
-        attn_bias = torch.zeros(b, a, s, s1, dtype=q.dtype, device=q.device)
-        if causal:
-            # Only applied to self attention
-            assert (
-                attn_mask is None
-            ), "Causal mask and attn_mask cannot be used together"
-            temp_mask = torch.ones(b, a, s, s, dtype=torch.bool, device=q.device).tril(
-                diagonal=0
-            )
-            attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
-            attn_bias.to(q.dtype)
-
-        if attn_mask is not None:
-            if attn_mask.dtype == torch.bool:
-                attn_bias.masked_fill_(attn_mask.logical_not(), float("-inf"))
-            else:
-                attn_bias += attn_mask
-
-        # TODO: Maybe force q and k to be float32 to avoid numerical overflow
-        attn = (q @ k.transpose(-2, -1)) * scale_factor
-        attn += attn_bias
-        attn = attn.softmax(dim=-1)
-        attn = torch.dropout(attn, p=drop_rate, train=True)
-        x = attn @ v
-    else:
-        raise NotImplementedError(f"Unsupported attention mode: {mode}")
-
-    x = post_attn_layout(x)
-    b, s, a, d = x.shape
-    out = x.reshape(b, s, -1)
-    return out
-
-
-def parallel_attention(
-    hybrid_seq_parallel_attn,
-    q,
-    k,
-    v,
-    img_q_len,
-    img_kv_len,
-    cu_seqlens_q,
-    cu_seqlens_kv
-):
-    attn1 = hybrid_seq_parallel_attn(
-        None,
-        q[:, :img_q_len, :, :],
-        k[:, :img_kv_len, :, :],
-        v[:, :img_kv_len, :, :],
-        dropout_p=0.0,
-        causal=False,
-        joint_tensor_query=q[:,img_q_len:cu_seqlens_q[1]],
-        joint_tensor_key=k[:,img_kv_len:cu_seqlens_kv[1]],
-        joint_tensor_value=v[:,img_kv_len:cu_seqlens_kv[1]],
-        joint_strategy="rear",
-    )
-    if flash_attn.__version__ >= '2.7.0':
-        attn2, *_ = _flash_attn_forward(
-            q[:,cu_seqlens_q[1]:],
-            k[:,cu_seqlens_kv[1]:],
-            v[:,cu_seqlens_kv[1]:],
-            dropout_p=0.0,
-            softmax_scale=q.shape[-1] ** (-0.5),
-            causal=False,
-            window_size_left=-1,
-            window_size_right=-1,
-            softcap=0.0,
-            alibi_slopes=None,
-            return_softmax=False,
-        )
-    else:
-        attn2, *_ = _flash_attn_forward(
-            q[:,cu_seqlens_q[1]:],
-            k[:,cu_seqlens_kv[1]:],
-            v[:,cu_seqlens_kv[1]:],
-            dropout_p=0.0,
-            softmax_scale=q.shape[-1] ** (-0.5),
-            causal=False,
-            window_size=(-1, -1),
-            softcap=0.0,
-            alibi_slopes=None,
-            return_softmax=False,
-        )
-    attn = torch.cat([attn1, attn2], dim=1)
-    b, s, a, d = attn.shape
-    attn = attn.reshape(b, s, -1)
-
-    return attn
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/embed_layers.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/embed_layers.py
deleted file mode 100644
index 3d65ed1a..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/embed_layers.py
+++ /dev/null
@@ -1,157 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-from einops import rearrange, repeat
-
-from ..utils.helpers import to_2tuple
-
-
-class PatchEmbed(nn.Module):
-    """2D Image to Patch Embedding
-
-    Image to Patch Embedding using Conv2d
-
-    A convolution based approach to patchifying a 2D image w/ embedding projection.
-
-    Based on the impl in https://github.com/google-research/vision_transformer
-
-    Hacked together by / Copyright 2020 Ross Wightman
-
-    Remove the _assert function in forward function to be compatible with multi-resolution images.
-    """
-
-    def __init__(
-        self,
-        patch_size=16,
-        in_chans=3,
-        embed_dim=768,
-        norm_layer=None,
-        flatten=True,
-        bias=True,
-        dtype=None,
-        device=None,
-    ):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        patch_size = to_2tuple(patch_size)
-        self.patch_size = patch_size
-        self.flatten = flatten
-
-        self.proj = nn.Conv3d(
-            in_chans,
-            embed_dim,
-            kernel_size=patch_size,
-            stride=patch_size,
-            bias=bias,
-            **factory_kwargs
-        )
-        nn.init.xavier_uniform_(self.proj.weight.view(self.proj.weight.size(0), -1))
-        if bias:
-            nn.init.zeros_(self.proj.bias)
-
-        self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
-
-    def forward(self, x):
-        x = self.proj(x)
-        if self.flatten:
-            x = x.flatten(2).transpose(1, 2)  # BCHW -> BNC
-        x = self.norm(x)
-        return x
-
-
-class TextProjection(nn.Module):
-    """
-    Projects text embeddings. Also handles dropout for classifier-free guidance.
-
-    Adapted from https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/nets/PixArt_blocks.py
-    """
-
-    def __init__(self, in_channels, hidden_size, act_layer, dtype=None, device=None):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.linear_1 = nn.Linear(
-            in_features=in_channels,
-            out_features=hidden_size,
-            bias=True,
-            **factory_kwargs
-        )
-        self.act_1 = act_layer()
-        self.linear_2 = nn.Linear(
-            in_features=hidden_size,
-            out_features=hidden_size,
-            bias=True,
-            **factory_kwargs
-        )
-
-    def forward(self, caption):
-        hidden_states = self.linear_1(caption)
-        hidden_states = self.act_1(hidden_states)
-        hidden_states = self.linear_2(hidden_states)
-        return hidden_states
-
-
-def timestep_embedding(t, dim, max_period=10000):
-    """
-    Create sinusoidal timestep embeddings.
-
-    Args:
-        t (torch.Tensor): a 1-D Tensor of N indices, one per batch element. These may be fractional.
-        dim (int): the dimension of the output.
-        max_period (int): controls the minimum frequency of the embeddings.
-
-    Returns:
-        embedding (torch.Tensor): An (N, D) Tensor of positional embeddings.
-
-    .. ref_link: https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
-    """
-    half = dim // 2
-    freqs = torch.exp(
-        -math.log(max_period)
-        * torch.arange(start=0, end=half, dtype=torch.float32)
-        / half
-    ).to(device=t.device)
-    args = t[:, None].float() * freqs[None]
-    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
-    if dim % 2:
-        embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
-    return embedding
-
-
-class TimestepEmbedder(nn.Module):
-    """
-    Embeds scalar timesteps into vector representations.
-    """
-
-    def __init__(
-        self,
-        hidden_size,
-        act_layer,
-        frequency_embedding_size=256,
-        max_period=10000,
-        out_size=None,
-        dtype=None,
-        device=None,
-    ):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.frequency_embedding_size = frequency_embedding_size
-        self.max_period = max_period
-        if out_size is None:
-            out_size = hidden_size
-
-        self.mlp = nn.Sequential(
-            nn.Linear(
-                frequency_embedding_size, hidden_size, bias=True, **factory_kwargs
-            ),
-            act_layer(),
-            nn.Linear(hidden_size, out_size, bias=True, **factory_kwargs),
-        )
-        nn.init.normal_(self.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.mlp[2].weight, std=0.02)
-
-    def forward(self, t):
-        t_freq = timestep_embedding(
-            t, self.frequency_embedding_size, self.max_period
-        ).type(self.mlp[0].weight.dtype)
-        t_emb = self.mlp(t_freq)
-        return t_emb
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/fp8_optimization.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/fp8_optimization.py
deleted file mode 100644
index b95c1f49..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/fp8_optimization.py
+++ /dev/null
@@ -1,102 +0,0 @@
-import os
-
-import torch
-import torch.nn as nn
-from torch.nn import functional as F
-
-def get_fp_maxval(bits=8, mantissa_bit=3, sign_bits=1):
-    _bits = torch.tensor(bits)
-    _mantissa_bit = torch.tensor(mantissa_bit)
-    _sign_bits = torch.tensor(sign_bits)
-    M = torch.clamp(torch.round(_mantissa_bit), 1, _bits - _sign_bits)
-    E = _bits - _sign_bits - M
-    bias = 2 ** (E - 1) - 1
-    mantissa = 1
-    for i in range(mantissa_bit - 1):
-        mantissa += 1 / (2 ** (i+1))
-    maxval = mantissa * 2 ** (2**E - 1 - bias)
-    return maxval
-
-def quantize_to_fp8(x, bits=8, mantissa_bit=3, sign_bits=1):
-    """
-    Default is E4M3.
-    """
-    bits = torch.tensor(bits)
-    mantissa_bit = torch.tensor(mantissa_bit)
-    sign_bits = torch.tensor(sign_bits)
-    M = torch.clamp(torch.round(mantissa_bit), 1, bits - sign_bits)
-    E = bits - sign_bits - M
-    bias = 2 ** (E - 1) - 1
-    mantissa = 1
-    for i in range(mantissa_bit - 1):
-        mantissa += 1 / (2 ** (i+1))
-    maxval = mantissa * 2 ** (2**E - 1 - bias)
-    minval = - maxval
-    minval = - maxval if sign_bits == 1 else torch.zeros_like(maxval)
-    input_clamp = torch.min(torch.max(x, minval), maxval)
-    log_scales = torch.clamp((torch.floor(torch.log2(torch.abs(input_clamp)) + bias)).detach(), 1.0)
-    log_scales = 2.0 ** (log_scales - M - bias.type(x.dtype))
-    # dequant
-    qdq_out = torch.round(input_clamp / log_scales) * log_scales
-    return qdq_out, log_scales
-
-def fp8_tensor_quant(x, scale, bits=8, mantissa_bit=3, sign_bits=1):
-    for i in range(len(x.shape) - 1):
-        scale = scale.unsqueeze(-1)
-    new_x = x / scale
-    quant_dequant_x, log_scales = quantize_to_fp8(new_x, bits=bits, mantissa_bit=mantissa_bit, sign_bits=sign_bits)
-    return quant_dequant_x, scale, log_scales
-
-def fp8_activation_dequant(qdq_out, scale, dtype):
-    qdq_out = qdq_out.type(dtype)
-    quant_dequant_x = qdq_out * scale.to(dtype)
-    return quant_dequant_x
-
-def fp8_linear_forward(cls, original_dtype, input):
-    weight_dtype = cls.weight.dtype
-    #####
-    if cls.weight.dtype != torch.float8_e4m3fn:
-        maxval = get_fp_maxval()
-        scale = torch.max(torch.abs(cls.weight.flatten())) / maxval
-        linear_weight, scale, log_scales = fp8_tensor_quant(cls.weight, scale)
-        linear_weight = linear_weight.to(torch.float8_e4m3fn)
-        weight_dtype = linear_weight.dtype
-    else:
-        scale = cls.fp8_scale.to(cls.weight.device)
-        linear_weight = cls.weight
-    #####
-
-    if weight_dtype == torch.float8_e4m3fn and cls.weight.sum() != 0:
-        if True or len(input.shape) == 3:
-            cls_dequant = fp8_activation_dequant(linear_weight, scale, original_dtype)
-            if cls.bias != None:
-                output = F.linear(input, cls_dequant, cls.bias)
-            else:
-                output = F.linear(input, cls_dequant)
-            return output
-        else:
-            return cls.original_forward(input.to(original_dtype))
-    else:
-        return cls.original_forward(input)
-
-def convert_fp8_linear(module, dit_weight_path, original_dtype, params_to_keep={}):
-    setattr(module, "fp8_matmul_enabled", True)
-
-    # loading fp8 mapping file
-    fp8_map_path = dit_weight_path.replace('.pt', '_map.pt')
-    if os.path.exists(fp8_map_path):
-        fp8_map = torch.load(fp8_map_path, map_location=lambda storage, loc: storage)
-    else:
-        raise ValueError(f"Invalid fp8_map path: {fp8_map_path}.")
-
-    fp8_layers = []
-    for key, layer in module.named_modules():
-        if isinstance(layer, nn.Linear) and ('double_blocks' in key or 'single_blocks' in key):
-            fp8_layers.append(key)
-            original_forward = layer.forward
-            layer.weight = torch.nn.Parameter(layer.weight.to(torch.float8_e4m3fn))
-            setattr(layer, "fp8_scale", fp8_map[key].to(dtype=original_dtype))
-            setattr(layer, "original_forward", original_forward)
-            setattr(layer, "forward", lambda input, m=layer: fp8_linear_forward(m, original_dtype, input))
-    
-
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/mlp_layers.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/mlp_layers.py
deleted file mode 100644
index 24dd2d9b..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/mlp_layers.py
+++ /dev/null
@@ -1,118 +0,0 @@
-# Modified from timm library:
-# https://github.com/huggingface/pytorch-image-models/blob/648aaa41233ba83eb38faf5ba9d415d574823241/timm/layers/mlp.py#L13
-
-from functools import partial
-
-import torch
-import torch.nn as nn
-
-from .modulate_layers import modulate
-from ..utils.helpers import to_2tuple
-
-
-class MLP(nn.Module):
-    """MLP as used in Vision Transformer, MLP-Mixer and related networks"""
-
-    def __init__(
-        self,
-        in_channels,
-        hidden_channels=None,
-        out_features=None,
-        act_layer=nn.GELU,
-        norm_layer=None,
-        bias=True,
-        drop=0.0,
-        use_conv=False,
-        device=None,
-        dtype=None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        out_features = out_features or in_channels
-        hidden_channels = hidden_channels or in_channels
-        bias = to_2tuple(bias)
-        drop_probs = to_2tuple(drop)
-        linear_layer = partial(nn.Conv2d, kernel_size=1) if use_conv else nn.Linear
-
-        self.fc1 = linear_layer(
-            in_channels, hidden_channels, bias=bias[0], **factory_kwargs
-        )
-        self.act = act_layer()
-        self.drop1 = nn.Dropout(drop_probs[0])
-        self.norm = (
-            norm_layer(hidden_channels, **factory_kwargs)
-            if norm_layer is not None
-            else nn.Identity()
-        )
-        self.fc2 = linear_layer(
-            hidden_channels, out_features, bias=bias[1], **factory_kwargs
-        )
-        self.drop2 = nn.Dropout(drop_probs[1])
-
-    def forward(self, x):
-        x = self.fc1(x)
-        x = self.act(x)
-        x = self.drop1(x)
-        x = self.norm(x)
-        x = self.fc2(x)
-        x = self.drop2(x)
-        return x
-
-
-# 
-class MLPEmbedder(nn.Module):
-    """copied from https://github.com/black-forest-labs/flux/blob/main/src/flux/modules/layers.py"""
-    def __init__(self, in_dim: int, hidden_dim: int, device=None, dtype=None):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.in_layer = nn.Linear(in_dim, hidden_dim, bias=True, **factory_kwargs)
-        self.silu = nn.SiLU()
-        self.out_layer = nn.Linear(hidden_dim, hidden_dim, bias=True, **factory_kwargs)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return self.out_layer(self.silu(self.in_layer(x)))
-
-
-class FinalLayer(nn.Module):
-    """The final layer of DiT."""
-
-    def __init__(
-        self, hidden_size, patch_size, out_channels, act_layer, device=None, dtype=None
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        # Just use LayerNorm for the final layer
-        self.norm_final = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-        if isinstance(patch_size, int):
-            self.linear = nn.Linear(
-                hidden_size,
-                patch_size * patch_size * out_channels,
-                bias=True,
-                **factory_kwargs
-            )
-        else:
-            self.linear = nn.Linear(
-                hidden_size,
-                patch_size[0] * patch_size[1] * patch_size[2] * out_channels,
-                bias=True,
-            )
-        nn.init.zeros_(self.linear.weight)
-        nn.init.zeros_(self.linear.bias)
-
-        # Here we don't distinguish between the modulate types. Just use the simple one.
-        self.adaLN_modulation = nn.Sequential(
-            act_layer(),
-            nn.Linear(hidden_size, 2 * hidden_size, bias=True, **factory_kwargs),
-        )
-        # Zero-initialize the modulation
-        nn.init.zeros_(self.adaLN_modulation[1].weight)
-        nn.init.zeros_(self.adaLN_modulation[1].bias)
-
-    def forward(self, x, c):
-        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
-        x = modulate(self.norm_final(x), shift=shift, scale=scale)
-        x = self.linear(x)
-        return x
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/models.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/models.py
deleted file mode 100644
index 646a42d0..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/models.py
+++ /dev/null
@@ -1,760 +0,0 @@
-from typing import Any, List, Tuple, Optional, Union, Dict
-from einops import rearrange
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from diffusers.models import ModelMixin
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-
-from .activation_layers import get_activation_layer
-from .norm_layers import get_norm_layer
-from .embed_layers import TimestepEmbedder, PatchEmbed, TextProjection
-from .attenion import attention, parallel_attention, get_cu_seqlens
-from .posemb_layers import apply_rotary_emb
-from .mlp_layers import MLP, MLPEmbedder, FinalLayer
-from .modulate_layers import ModulateDiT, modulate, apply_gate
-from .token_refiner import SingleTokenRefiner
-
-
-class MMDoubleStreamBlock(nn.Module):
-    """
-    A multimodal dit block with seperate modulation for
-    text and image/video, see more details (SD3): https://arxiv.org/abs/2403.03206
-                                     (Flux.1): https://github.com/black-forest-labs/flux
-    """
-
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        mlp_width_ratio: float,
-        mlp_act_type: str = "gelu_tanh",
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qkv_bias: bool = False,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        mlp_hidden_dim = int(hidden_size * mlp_width_ratio)
-
-        self.img_mod = ModulateDiT(
-            hidden_size,
-            factor=6,
-            act_layer=get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-        self.img_norm1 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-
-        self.img_attn_qkv = nn.Linear(
-            hidden_size, hidden_size * 3, bias=qkv_bias, **factory_kwargs
-        )
-        qk_norm_layer = get_norm_layer(qk_norm_type)
-        self.img_attn_q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.img_attn_k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.img_attn_proj = nn.Linear(
-            hidden_size, hidden_size, bias=qkv_bias, **factory_kwargs
-        )
-
-        self.img_norm2 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-        self.img_mlp = MLP(
-            hidden_size,
-            mlp_hidden_dim,
-            act_layer=get_activation_layer(mlp_act_type),
-            bias=True,
-            **factory_kwargs,
-        )
-
-        self.txt_mod = ModulateDiT(
-            hidden_size,
-            factor=6,
-            act_layer=get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-        self.txt_norm1 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-
-        self.txt_attn_qkv = nn.Linear(
-            hidden_size, hidden_size * 3, bias=qkv_bias, **factory_kwargs
-        )
-        self.txt_attn_q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.txt_attn_k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.txt_attn_proj = nn.Linear(
-            hidden_size, hidden_size, bias=qkv_bias, **factory_kwargs
-        )
-
-        self.txt_norm2 = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-        self.txt_mlp = MLP(
-            hidden_size,
-            mlp_hidden_dim,
-            act_layer=get_activation_layer(mlp_act_type),
-            bias=True,
-            **factory_kwargs,
-        )
-        self.hybrid_seq_parallel_attn = None
-
-    def enable_deterministic(self):
-        self.deterministic = True
-
-    def disable_deterministic(self):
-        self.deterministic = False
-
-    def forward(
-        self,
-        img: torch.Tensor,
-        txt: torch.Tensor,
-        vec: torch.Tensor,
-        cu_seqlens_q: Optional[torch.Tensor] = None,
-        cu_seqlens_kv: Optional[torch.Tensor] = None,
-        max_seqlen_q: Optional[int] = None,
-        max_seqlen_kv: Optional[int] = None,
-        freqs_cis: tuple = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        (
-            img_mod1_shift,
-            img_mod1_scale,
-            img_mod1_gate,
-            img_mod2_shift,
-            img_mod2_scale,
-            img_mod2_gate,
-        ) = self.img_mod(vec).chunk(6, dim=-1)
-        (
-            txt_mod1_shift,
-            txt_mod1_scale,
-            txt_mod1_gate,
-            txt_mod2_shift,
-            txt_mod2_scale,
-            txt_mod2_gate,
-        ) = self.txt_mod(vec).chunk(6, dim=-1)
-
-        # Prepare image for attention.
-        img_modulated = self.img_norm1(img)
-        img_modulated = modulate(
-            img_modulated, shift=img_mod1_shift, scale=img_mod1_scale
-        )
-        img_qkv = self.img_attn_qkv(img_modulated)
-        img_q, img_k, img_v = rearrange(
-            img_qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num
-        )
-        # Apply QK-Norm if needed
-        img_q = self.img_attn_q_norm(img_q).to(img_v)
-        img_k = self.img_attn_k_norm(img_k).to(img_v)
-
-        # Apply RoPE if needed.
-        if freqs_cis is not None:
-            img_qq, img_kk = apply_rotary_emb(img_q, img_k, freqs_cis, head_first=False)
-            assert (
-                img_qq.shape == img_q.shape and img_kk.shape == img_k.shape
-            ), f"img_kk: {img_qq.shape}, img_q: {img_q.shape}, img_kk: {img_kk.shape}, img_k: {img_k.shape}"
-            img_q, img_k = img_qq, img_kk
-
-        # Prepare txt for attention.
-        txt_modulated = self.txt_norm1(txt)
-        txt_modulated = modulate(
-            txt_modulated, shift=txt_mod1_shift, scale=txt_mod1_scale
-        )
-        txt_qkv = self.txt_attn_qkv(txt_modulated)
-        txt_q, txt_k, txt_v = rearrange(
-            txt_qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num
-        )
-        # Apply QK-Norm if needed.
-        txt_q = self.txt_attn_q_norm(txt_q).to(txt_v)
-        txt_k = self.txt_attn_k_norm(txt_k).to(txt_v)
-
-        # Run actual attention.
-        q = torch.cat((img_q, txt_q), dim=1)
-        k = torch.cat((img_k, txt_k), dim=1)
-        v = torch.cat((img_v, txt_v), dim=1)
-        assert (
-            cu_seqlens_q.shape[0] == 2 * img.shape[0] + 1
-        ), f"cu_seqlens_q.shape:{cu_seqlens_q.shape}, img.shape[0]:{img.shape[0]}"
-        
-        # attention computation start
-        if not self.hybrid_seq_parallel_attn:
-            attn = attention(
-                q,
-                k,
-                v,
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv,
-                max_seqlen_q=max_seqlen_q,
-                max_seqlen_kv=max_seqlen_kv,
-                batch_size=img_k.shape[0],
-            )
-        else:
-            attn = parallel_attention(
-                self.hybrid_seq_parallel_attn,
-                q,
-                k,
-                v,
-                img_q_len=img_q.shape[1],
-                img_kv_len=img_k.shape[1],
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv
-            )
-            
-        # attention computation end
-
-        img_attn, txt_attn = attn[:, : img.shape[1]], attn[:, img.shape[1] :]
-
-        # Calculate the img bloks.
-        img = img + apply_gate(self.img_attn_proj(img_attn), gate=img_mod1_gate)
-        img = img + apply_gate(
-            self.img_mlp(
-                modulate(
-                    self.img_norm2(img), shift=img_mod2_shift, scale=img_mod2_scale
-                )
-            ),
-            gate=img_mod2_gate,
-        )
-
-        # Calculate the txt bloks.
-        txt = txt + apply_gate(self.txt_attn_proj(txt_attn), gate=txt_mod1_gate)
-        txt = txt + apply_gate(
-            self.txt_mlp(
-                modulate(
-                    self.txt_norm2(txt), shift=txt_mod2_shift, scale=txt_mod2_scale
-                )
-            ),
-            gate=txt_mod2_gate,
-        )
-
-        return img, txt
-
-
-class MMSingleStreamBlock(nn.Module):
-    """
-    A DiT block with parallel linear layers as described in
-    https://arxiv.org/abs/2302.05442 and adapted modulation interface.
-    Also refer to (SD3): https://arxiv.org/abs/2403.03206
-                  (Flux.1): https://github.com/black-forest-labs/flux
-    """
-
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        mlp_width_ratio: float = 4.0,
-        mlp_act_type: str = "gelu_tanh",
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qk_scale: float = None,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        mlp_hidden_dim = int(hidden_size * mlp_width_ratio)
-        self.mlp_hidden_dim = mlp_hidden_dim
-        self.scale = qk_scale or head_dim ** -0.5
-
-        # qkv and mlp_in
-        self.linear1 = nn.Linear(
-            hidden_size, hidden_size * 3 + mlp_hidden_dim, **factory_kwargs
-        )
-        # proj and mlp_out
-        self.linear2 = nn.Linear(
-            hidden_size + mlp_hidden_dim, hidden_size, **factory_kwargs
-        )
-
-        qk_norm_layer = get_norm_layer(qk_norm_type)
-        self.q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-
-        self.pre_norm = nn.LayerNorm(
-            hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs
-        )
-
-        self.mlp_act = get_activation_layer(mlp_act_type)()
-        self.modulation = ModulateDiT(
-            hidden_size,
-            factor=3,
-            act_layer=get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-        self.hybrid_seq_parallel_attn = None
-
-    def enable_deterministic(self):
-        self.deterministic = True
-
-    def disable_deterministic(self):
-        self.deterministic = False
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        vec: torch.Tensor,
-        txt_len: int,
-        cu_seqlens_q: Optional[torch.Tensor] = None,
-        cu_seqlens_kv: Optional[torch.Tensor] = None,
-        max_seqlen_q: Optional[int] = None,
-        max_seqlen_kv: Optional[int] = None,
-        freqs_cis: Tuple[torch.Tensor, torch.Tensor] = None,
-    ) -> torch.Tensor:
-        mod_shift, mod_scale, mod_gate = self.modulation(vec).chunk(3, dim=-1)
-        x_mod = modulate(self.pre_norm(x), shift=mod_shift, scale=mod_scale)
-        qkv, mlp = torch.split(
-            self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1
-        )
-
-        q, k, v = rearrange(qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num)
-
-        # Apply QK-Norm if needed.
-        q = self.q_norm(q).to(v)
-        k = self.k_norm(k).to(v)
-
-        # Apply RoPE if needed.
-        if freqs_cis is not None:
-            img_q, txt_q = q[:, :-txt_len, :, :], q[:, -txt_len:, :, :]
-            img_k, txt_k = k[:, :-txt_len, :, :], k[:, -txt_len:, :, :]
-            img_qq, img_kk = apply_rotary_emb(img_q, img_k, freqs_cis, head_first=False)
-            assert (
-                img_qq.shape == img_q.shape and img_kk.shape == img_k.shape
-            ), f"img_kk: {img_qq.shape}, img_q: {img_q.shape}, img_kk: {img_kk.shape}, img_k: {img_k.shape}"
-            img_q, img_k = img_qq, img_kk
-            q = torch.cat((img_q, txt_q), dim=1)
-            k = torch.cat((img_k, txt_k), dim=1)
-
-        # Compute attention.
-        assert (
-            cu_seqlens_q.shape[0] == 2 * x.shape[0] + 1
-        ), f"cu_seqlens_q.shape:{cu_seqlens_q.shape}, x.shape[0]:{x.shape[0]}"
-        
-        # attention computation start
-        if not self.hybrid_seq_parallel_attn:
-            attn = attention(
-                q,
-                k,
-                v,
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv,
-                max_seqlen_q=max_seqlen_q,
-                max_seqlen_kv=max_seqlen_kv,
-                batch_size=x.shape[0],
-            )
-        else:
-            attn = parallel_attention(
-                self.hybrid_seq_parallel_attn,
-                q,
-                k,
-                v,
-                img_q_len=img_q.shape[1],
-                img_kv_len=img_k.shape[1],
-                cu_seqlens_q=cu_seqlens_q,
-                cu_seqlens_kv=cu_seqlens_kv
-            )
-        # attention computation end
-
-        # Compute activation in mlp stream, cat again and run second linear layer.
-        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
-        return x + apply_gate(output, gate=mod_gate)
-
-
-class HYVideoDiffusionTransformer(ModelMixin, ConfigMixin):
-    """
-    HunyuanVideo Transformer backbone
-
-    Inherited from ModelMixin and ConfigMixin for compatibility with diffusers' sampler StableDiffusionPipeline.
-
-    Reference:
-    [1] Flux.1: https://github.com/black-forest-labs/flux
-    [2] MMDiT: http://arxiv.org/abs/2403.03206
-
-    Parameters
-    ----------
-    args: argparse.Namespace
-        The arguments parsed by argparse.
-    patch_size: list
-        The size of the patch.
-    in_channels: int
-        The number of input channels.
-    out_channels: int
-        The number of output channels.
-    hidden_size: int
-        The hidden size of the transformer backbone.
-    heads_num: int
-        The number of attention heads.
-    mlp_width_ratio: float
-        The ratio of the hidden size of the MLP in the transformer block.
-    mlp_act_type: str
-        The activation function of the MLP in the transformer block.
-    depth_double_blocks: int
-        The number of transformer blocks in the double blocks.
-    depth_single_blocks: int
-        The number of transformer blocks in the single blocks.
-    rope_dim_list: list
-        The dimension of the rotary embedding for t, h, w.
-    qkv_bias: bool
-        Whether to use bias in the qkv linear layer.
-    qk_norm: bool
-        Whether to use qk norm.
-    qk_norm_type: str
-        The type of qk norm.
-    guidance_embed: bool
-        Whether to use guidance embedding for distillation.
-    text_projection: str
-        The type of the text projection, default is single_refiner.
-    use_attention_mask: bool
-        Whether to use attention mask for text encoder.
-    dtype: torch.dtype
-        The dtype of the model.
-    device: torch.device
-        The device of the model.
-    """
-
-    @register_to_config
-    def __init__(
-        self,
-        args: Any,
-        patch_size: list = [1, 2, 2],
-        in_channels: int = 4,  # Should be VAE.config.latent_channels.
-        out_channels: int = None,
-        hidden_size: int = 3072,
-        heads_num: int = 24,
-        mlp_width_ratio: float = 4.0,
-        mlp_act_type: str = "gelu_tanh",
-        mm_double_blocks_depth: int = 20,
-        mm_single_blocks_depth: int = 40,
-        rope_dim_list: List[int] = [16, 56, 56],
-        qkv_bias: bool = True,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        guidance_embed: bool = False,  # For modulation.
-        text_projection: str = "single_refiner",
-        use_attention_mask: bool = True,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.patch_size = patch_size
-        self.in_channels = in_channels
-        self.out_channels = in_channels if out_channels is None else out_channels
-        self.unpatchify_channels = self.out_channels
-        self.guidance_embed = guidance_embed
-        self.rope_dim_list = rope_dim_list
-
-        # Text projection. Default to linear projection.
-        # Alternative: TokenRefiner. See more details (LI-DiT): http://arxiv.org/abs/2406.11831
-        self.use_attention_mask = use_attention_mask
-        self.text_projection = text_projection
-
-        self.text_states_dim = args.text_states_dim
-        self.text_states_dim_2 = args.text_states_dim_2
-
-        if hidden_size % heads_num != 0:
-            raise ValueError(
-                f"Hidden size {hidden_size} must be divisible by heads_num {heads_num}"
-            )
-        pe_dim = hidden_size // heads_num
-        if sum(rope_dim_list) != pe_dim:
-            raise ValueError(
-                f"Got {rope_dim_list} but expected positional dim {pe_dim}"
-            )
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-
-        # image projection
-        self.img_in = PatchEmbed(
-            self.patch_size, self.in_channels, self.hidden_size, **factory_kwargs
-        )
-
-        # text projection
-        if self.text_projection == "linear":
-            self.txt_in = TextProjection(
-                self.text_states_dim,
-                self.hidden_size,
-                get_activation_layer("silu"),
-                **factory_kwargs,
-            )
-        elif self.text_projection == "single_refiner":
-            self.txt_in = SingleTokenRefiner(
-                self.text_states_dim, hidden_size, heads_num, depth=2, **factory_kwargs
-            )
-        else:
-            raise NotImplementedError(
-                f"Unsupported text_projection: {self.text_projection}"
-            )
-
-        # time modulation
-        self.time_in = TimestepEmbedder(
-            self.hidden_size, get_activation_layer("silu"), **factory_kwargs
-        )
-
-        # text modulation
-        self.vector_in = MLPEmbedder(
-            self.text_states_dim_2, self.hidden_size, **factory_kwargs
-        )
-
-        # guidance modulation
-        self.guidance_in = (
-            TimestepEmbedder(
-                self.hidden_size, get_activation_layer("silu"), **factory_kwargs
-            )
-            if guidance_embed
-            else None
-        )
-
-        # double blocks
-        self.double_blocks = nn.ModuleList(
-            [
-                MMDoubleStreamBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    mlp_width_ratio=mlp_width_ratio,
-                    mlp_act_type=mlp_act_type,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    qkv_bias=qkv_bias,
-                    **factory_kwargs,
-                )
-                for _ in range(mm_double_blocks_depth)
-            ]
-        )
-
-        # single blocks
-        self.single_blocks = nn.ModuleList(
-            [
-                MMSingleStreamBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    mlp_width_ratio=mlp_width_ratio,
-                    mlp_act_type=mlp_act_type,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    **factory_kwargs,
-                )
-                for _ in range(mm_single_blocks_depth)
-            ]
-        )
-
-        self.final_layer = FinalLayer(
-            self.hidden_size,
-            self.patch_size,
-            self.out_channels,
-            get_activation_layer("silu"),
-            **factory_kwargs,
-        )
-
-    def enable_deterministic(self):
-        for block in self.double_blocks:
-            block.enable_deterministic()
-        for block in self.single_blocks:
-            block.enable_deterministic()
-
-    def disable_deterministic(self):
-        for block in self.double_blocks:
-            block.disable_deterministic()
-        for block in self.single_blocks:
-            block.disable_deterministic()
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        t: torch.Tensor,  # Should be in range(0, 1000).
-        text_states: torch.Tensor = None,
-        text_mask: torch.Tensor = None,  # Now we don't use it.
-        text_states_2: Optional[torch.Tensor] = None,  # Text embedding for modulation.
-        freqs_cos: Optional[torch.Tensor] = None,
-        freqs_sin: Optional[torch.Tensor] = None,
-        guidance: torch.Tensor = None,  # Guidance for modulation, should be cfg_scale x 1000.
-        return_dict: bool = True,
-    ) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:
-        out = {}
-        img = x
-        txt = text_states
-        _, _, ot, oh, ow = x.shape
-        tt, th, tw = (
-            ot // self.patch_size[0],
-            oh // self.patch_size[1],
-            ow // self.patch_size[2],
-        )
-
-        # Prepare modulation vectors.
-        vec = self.time_in(t)
-
-        # text modulation
-        vec = vec + self.vector_in(text_states_2)
-
-        # guidance modulation
-        if self.guidance_embed:
-            if guidance is None:
-                raise ValueError(
-                    "Didn't get guidance strength for guidance distilled model."
-                )
-
-            # our timestep_embedding is merged into guidance_in(TimestepEmbedder)
-            vec = vec + self.guidance_in(guidance)
-
-        # Embed image and text.
-        img = self.img_in(img)
-        if self.text_projection == "linear":
-            txt = self.txt_in(txt)
-        elif self.text_projection == "single_refiner":
-            txt = self.txt_in(txt, t, text_mask if self.use_attention_mask else None)
-        else:
-            raise NotImplementedError(
-                f"Unsupported text_projection: {self.text_projection}"
-            )
-
-        txt_seq_len = txt.shape[1]
-        img_seq_len = img.shape[1]
-
-        # Compute cu_squlens and max_seqlen for flash attention
-        cu_seqlens_q = get_cu_seqlens(text_mask, img_seq_len)
-        cu_seqlens_kv = cu_seqlens_q
-        max_seqlen_q = img_seq_len + txt_seq_len
-        max_seqlen_kv = max_seqlen_q
-
-        freqs_cis = (freqs_cos, freqs_sin) if freqs_cos is not None else None
-        # --------------------- Pass through DiT blocks ------------------------
-        for _, block in enumerate(self.double_blocks):
-            double_block_args = [
-                img,
-                txt,
-                vec,
-                cu_seqlens_q,
-                cu_seqlens_kv,
-                max_seqlen_q,
-                max_seqlen_kv,
-                freqs_cis,
-            ]
-
-            img, txt = block(*double_block_args)
-
-        # Merge txt and img to pass through single stream blocks.
-        x = torch.cat((img, txt), 1)
-        if len(self.single_blocks) > 0:
-            for _, block in enumerate(self.single_blocks):
-                single_block_args = [
-                    x,
-                    vec,
-                    txt_seq_len,
-                    cu_seqlens_q,
-                    cu_seqlens_kv,
-                    max_seqlen_q,
-                    max_seqlen_kv,
-                    (freqs_cos, freqs_sin),
-                ]
-
-                x = block(*single_block_args)
-
-        img = x[:, :img_seq_len, ...]
-
-        # ---------------------------- Final layer ------------------------------
-        img = self.final_layer(img, vec)  # (N, T, patch_size ** 2 * out_channels)
-
-        img = self.unpatchify(img, tt, th, tw)
-        if return_dict:
-            out["x"] = img
-            return out
-        return img
-
-    def unpatchify(self, x, t, h, w):
-        """
-        x: (N, T, patch_size**2 * C)
-        imgs: (N, H, W, C)
-        """
-        c = self.unpatchify_channels
-        pt, ph, pw = self.patch_size
-        assert t * h * w == x.shape[1]
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, c, pt, ph, pw))
-        x = torch.einsum("nthwcopq->nctohpwq", x)
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-
-        return imgs
-
-    def params_count(self):
-        counts = {
-            "double": sum(
-                [
-                    sum(p.numel() for p in block.img_attn_qkv.parameters())
-                    + sum(p.numel() for p in block.img_attn_proj.parameters())
-                    + sum(p.numel() for p in block.img_mlp.parameters())
-                    + sum(p.numel() for p in block.txt_attn_qkv.parameters())
-                    + sum(p.numel() for p in block.txt_attn_proj.parameters())
-                    + sum(p.numel() for p in block.txt_mlp.parameters())
-                    for block in self.double_blocks
-                ]
-            ),
-            "single": sum(
-                [
-                    sum(p.numel() for p in block.linear1.parameters())
-                    + sum(p.numel() for p in block.linear2.parameters())
-                    for block in self.single_blocks
-                ]
-            ),
-            "total": sum(p.numel() for p in self.parameters()),
-        }
-        counts["attn+mlp"] = counts["double"] + counts["single"]
-        return counts
-
-
-#################################################################################
-#                             HunyuanVideo Configs                              #
-#################################################################################
-
-HUNYUAN_VIDEO_CONFIG = {
-    "HYVideo-T/2": {
-        "mm_double_blocks_depth": 20,
-        "mm_single_blocks_depth": 40,
-        "rope_dim_list": [16, 56, 56],
-        "hidden_size": 3072,
-        "heads_num": 24,
-        "mlp_width_ratio": 4,
-    },
-    "HYVideo-T/2-cfgdistill": {
-        "mm_double_blocks_depth": 20,
-        "mm_single_blocks_depth": 40,
-        "rope_dim_list": [16, 56, 56],
-        "hidden_size": 3072,
-        "heads_num": 24,
-        "mlp_width_ratio": 4,
-        "guidance_embed": True,
-    },
-}
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/modulate_layers.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/modulate_layers.py
deleted file mode 100644
index 93a57c6d..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/modulate_layers.py
+++ /dev/null
@@ -1,76 +0,0 @@
-from typing import Callable
-
-import torch
-import torch.nn as nn
-
-
-class ModulateDiT(nn.Module):
-    """Modulation layer for DiT."""
-    def __init__(
-        self,
-        hidden_size: int,
-        factor: int,
-        act_layer: Callable,
-        dtype=None,
-        device=None,
-    ):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.act = act_layer()
-        self.linear = nn.Linear(
-            hidden_size, factor * hidden_size, bias=True, **factory_kwargs
-        )
-        # Zero-initialize the modulation
-        nn.init.zeros_(self.linear.weight)
-        nn.init.zeros_(self.linear.bias)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return self.linear(self.act(x))
-
-
-def modulate(x, shift=None, scale=None):
-    """modulate by shift and scale
-
-    Args:
-        x (torch.Tensor): input tensor.
-        shift (torch.Tensor, optional): shift tensor. Defaults to None.
-        scale (torch.Tensor, optional): scale tensor. Defaults to None.
-
-    Returns:
-        torch.Tensor: the output tensor after modulate.
-    """
-    if scale is None and shift is None:
-        return x
-    elif shift is None:
-        return x * (1 + scale.unsqueeze(1))
-    elif scale is None:
-        return x + shift.unsqueeze(1)
-    else:
-        return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
-
-
-def apply_gate(x, gate=None, tanh=False):
-    """AI is creating summary for apply_gate
-
-    Args:
-        x (torch.Tensor): input tensor.
-        gate (torch.Tensor, optional): gate tensor. Defaults to None.
-        tanh (bool, optional): whether to use tanh function. Defaults to False.
-
-    Returns:
-        torch.Tensor: the output tensor after apply gate.
-    """
-    if gate is None:
-        return x
-    if tanh:
-        return x * gate.unsqueeze(1).tanh()
-    else:
-        return x * gate.unsqueeze(1)
-
-
-def ckpt_wrapper(module):
-    def ckpt_forward(*inputs):
-        outputs = module(*inputs)
-        return outputs
-
-    return ckpt_forward
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/norm_layers.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/norm_layers.py
deleted file mode 100644
index d8c73b1a..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/norm_layers.py
+++ /dev/null
@@ -1,77 +0,0 @@
-import torch
-import torch.nn as nn
-
-
-class RMSNorm(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        elementwise_affine=True,
-        eps: float = 1e-6,
-        device=None,
-        dtype=None,
-    ):
-        """
-        Initialize the RMSNorm normalization layer.
-
-        Args:
-            dim (int): The dimension of the input tensor.
-            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.
-
-        Attributes:
-            eps (float): A small value added to the denominator for numerical stability.
-            weight (nn.Parameter): Learnable scaling parameter.
-
-        """
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.eps = eps
-        if elementwise_affine:
-            self.weight = nn.Parameter(torch.ones(dim, **factory_kwargs))
-
-    def _norm(self, x):
-        """
-        Apply the RMSNorm normalization to the input tensor.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The normalized tensor.
-
-        """
-        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
-
-    def forward(self, x):
-        """
-        Forward pass through the RMSNorm layer.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The output tensor after applying RMSNorm.
-
-        """
-        output = self._norm(x.float()).type_as(x)
-        if hasattr(self, "weight"):
-            output = output * self.weight
-        return output
-
-
-def get_norm_layer(norm_layer):
-    """
-    Get the normalization layer.
-
-    Args:
-        norm_layer (str): The type of normalization layer.
-
-    Returns:
-        norm_layer (nn.Module): The normalization layer.
-    """
-    if norm_layer == "layer":
-        return nn.LayerNorm
-    elif norm_layer == "rms":
-        return RMSNorm
-    else:
-        raise NotImplementedError(f"Norm layer {norm_layer} is not implemented")
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/posemb_layers.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/posemb_layers.py
deleted file mode 100644
index dfce82c6..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/posemb_layers.py
+++ /dev/null
@@ -1,310 +0,0 @@
-import torch
-from typing import Union, Tuple, List
-
-
-def _to_tuple(x, dim=2):
-    if isinstance(x, int):
-        return (x,) * dim
-    elif len(x) == dim:
-        return x
-    else:
-        raise ValueError(f"Expected length {dim} or int, but got {x}")
-
-
-def get_meshgrid_nd(start, *args, dim=2):
-    """
-    Get n-D meshgrid with start, stop and num.
-
-    Args:
-        start (int or tuple): If len(args) == 0, start is num; If len(args) == 1, start is start, args[0] is stop,
-            step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num. For n-dim, start/stop/num
-            should be int or n-tuple. If n-tuple is provided, the meshgrid will be stacked following the dim order in
-            n-tuples.
-        *args: See above.
-        dim (int): Dimension of the meshgrid. Defaults to 2.
-
-    Returns:
-        grid (np.ndarray): [dim, ...]
-    """
-    if len(args) == 0:
-        # start is grid_size
-        num = _to_tuple(start, dim=dim)
-        start = (0,) * dim
-        stop = num
-    elif len(args) == 1:
-        # start is start, args[0] is stop, step is 1
-        start = _to_tuple(start, dim=dim)
-        stop = _to_tuple(args[0], dim=dim)
-        num = [stop[i] - start[i] for i in range(dim)]
-    elif len(args) == 2:
-        # start is start, args[0] is stop, args[1] is num
-        start = _to_tuple(start, dim=dim)  # Left-Top       eg: 12,0
-        stop = _to_tuple(args[0], dim=dim)  # Right-Bottom   eg: 20,32
-        num = _to_tuple(args[1], dim=dim)  # Target Size    eg: 32,124
-    else:
-        raise ValueError(f"len(args) should be 0, 1 or 2, but got {len(args)}")
-
-    # PyTorch implement of np.linspace(start[i], stop[i], num[i], endpoint=False)
-    axis_grid = []
-    for i in range(dim):
-        a, b, n = start[i], stop[i], num[i]
-        g = torch.linspace(a, b, n + 1, dtype=torch.float32)[:n]
-        axis_grid.append(g)
-    grid = torch.meshgrid(*axis_grid, indexing="ij")  # dim x [W, H, D]
-    grid = torch.stack(grid, dim=0)  # [dim, W, H, D]
-
-    return grid
-
-
-#################################################################################
-#                   Rotary Positional Embedding Functions                       #
-#################################################################################
-# https://github.com/meta-llama/llama/blob/be327c427cc5e89cc1d3ab3d3fec4484df771245/llama/model.py#L80
-
-
-def reshape_for_broadcast(
-    freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor]],
-    x: torch.Tensor,
-    head_first=False,
-):
-    """
-    Reshape frequency tensor for broadcasting it with another tensor.
-
-    This function reshapes the frequency tensor to have the same shape as the target tensor 'x'
-    for the purpose of broadcasting the frequency tensor during element-wise operations.
-
-    Notes:
-        When using FlashMHAModified, head_first should be False.
-        When using Attention, head_first should be True.
-
-    Args:
-        freqs_cis (Union[torch.Tensor, Tuple[torch.Tensor]]): Frequency tensor to be reshaped.
-        x (torch.Tensor): Target tensor for broadcasting compatibility.
-        head_first (bool): head dimension first (except batch dim) or not.
-
-    Returns:
-        torch.Tensor: Reshaped frequency tensor.
-
-    Raises:
-        AssertionError: If the frequency tensor doesn't match the expected shape.
-        AssertionError: If the target tensor 'x' doesn't have the expected number of dimensions.
-    """
-    ndim = x.ndim
-    assert 0 <= 1 < ndim
-
-    if isinstance(freqs_cis, tuple):
-        # freqs_cis: (cos, sin) in real space
-        if head_first:
-            assert freqs_cis[0].shape == (
-                x.shape[-2],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis[0].shape} does not match x shape {x.shape}"
-            shape = [
-                d if i == ndim - 2 or i == ndim - 1 else 1
-                for i, d in enumerate(x.shape)
-            ]
-        else:
-            assert freqs_cis[0].shape == (
-                x.shape[1],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis[0].shape} does not match x shape {x.shape}"
-            shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
-        return freqs_cis[0].view(*shape), freqs_cis[1].view(*shape)
-    else:
-        # freqs_cis: values in complex space
-        if head_first:
-            assert freqs_cis.shape == (
-                x.shape[-2],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis.shape} does not match x shape {x.shape}"
-            shape = [
-                d if i == ndim - 2 or i == ndim - 1 else 1
-                for i, d in enumerate(x.shape)
-            ]
-        else:
-            assert freqs_cis.shape == (
-                x.shape[1],
-                x.shape[-1],
-            ), f"freqs_cis shape {freqs_cis.shape} does not match x shape {x.shape}"
-            shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
-        return freqs_cis.view(*shape)
-
-
-def rotate_half(x):
-    x_real, x_imag = (
-        x.float().reshape(*x.shape[:-1], -1, 2).unbind(-1)
-    )  # [B, S, H, D//2]
-    return torch.stack([-x_imag, x_real], dim=-1).flatten(3)
-
-
-def apply_rotary_emb(
-    xq: torch.Tensor,
-    xk: torch.Tensor,
-    freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
-    head_first: bool = False,
-) -> Tuple[torch.Tensor, torch.Tensor]:
-    """
-    Apply rotary embeddings to input tensors using the given frequency tensor.
-
-    This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
-    frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
-    is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
-    returned as real tensors.
-
-    Args:
-        xq (torch.Tensor): Query tensor to apply rotary embeddings. [B, S, H, D]
-        xk (torch.Tensor): Key tensor to apply rotary embeddings.   [B, S, H, D]
-        freqs_cis (torch.Tensor or tuple): Precomputed frequency tensor for complex exponential.
-        head_first (bool): head dimension first (except batch dim) or not.
-
-    Returns:
-        Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.
-
-    """
-    xk_out = None
-    if isinstance(freqs_cis, tuple):
-        cos, sin = reshape_for_broadcast(freqs_cis, xq, head_first)  # [S, D]
-        cos, sin = cos.to(xq.device), sin.to(xq.device)
-        # real * cos - imag * sin
-        # imag * cos + real * sin
-        xq_out = (xq.float() * cos + rotate_half(xq.float()) * sin).type_as(xq)
-        xk_out = (xk.float() * cos + rotate_half(xk.float()) * sin).type_as(xk)
-    else:
-        # view_as_complex will pack [..., D/2, 2](real) to [..., D/2](complex)
-        xq_ = torch.view_as_complex(
-            xq.float().reshape(*xq.shape[:-1], -1, 2)
-        )  # [B, S, H, D//2]
-        freqs_cis = reshape_for_broadcast(freqs_cis, xq_, head_first).to(
-            xq.device
-        )  # [S, D//2] --> [1, S, 1, D//2]
-        # (real, imag) * (cos, sin) = (real * cos - imag * sin, imag * cos + real * sin)
-        # view_as_real will expand [..., D/2](complex) to [..., D/2, 2](real)
-        xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3).type_as(xq)
-        xk_ = torch.view_as_complex(
-            xk.float().reshape(*xk.shape[:-1], -1, 2)
-        )  # [B, S, H, D//2]
-        xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3).type_as(xk)
-
-    return xq_out, xk_out
-
-
-def get_nd_rotary_pos_embed(
-    rope_dim_list,
-    start,
-    *args,
-    theta=10000.0,
-    use_real=False,
-    theta_rescale_factor: Union[float, List[float]] = 1.0,
-    interpolation_factor: Union[float, List[float]] = 1.0,
-):
-    """
-    This is a n-d version of precompute_freqs_cis, which is a RoPE for tokens with n-d structure.
-
-    Args:
-        rope_dim_list (list of int): Dimension of each rope. len(rope_dim_list) should equal to n.
-            sum(rope_dim_list) should equal to head_dim of attention layer.
-        start (int | tuple of int | list of int): If len(args) == 0, start is num; If len(args) == 1, start is start,
-            args[0] is stop, step is 1; If len(args) == 2, start is start, args[0] is stop, args[1] is num.
-        *args: See above.
-        theta (float): Scaling factor for frequency computation. Defaults to 10000.0.
-        use_real (bool): If True, return real part and imaginary part separately. Otherwise, return complex numbers.
-            Some libraries such as TensorRT does not support complex64 data type. So it is useful to provide a real
-            part and an imaginary part separately.
-        theta_rescale_factor (float): Rescale factor for theta. Defaults to 1.0.
-
-    Returns:
-        pos_embed (torch.Tensor): [HW, D/2]
-    """
-
-    grid = get_meshgrid_nd(
-        start, *args, dim=len(rope_dim_list)
-    )  # [3, W, H, D] / [2, W, H]
-
-    if isinstance(theta_rescale_factor, int) or isinstance(theta_rescale_factor, float):
-        theta_rescale_factor = [theta_rescale_factor] * len(rope_dim_list)
-    elif isinstance(theta_rescale_factor, list) and len(theta_rescale_factor) == 1:
-        theta_rescale_factor = [theta_rescale_factor[0]] * len(rope_dim_list)
-    assert len(theta_rescale_factor) == len(
-        rope_dim_list
-    ), "len(theta_rescale_factor) should equal to len(rope_dim_list)"
-
-    if isinstance(interpolation_factor, int) or isinstance(interpolation_factor, float):
-        interpolation_factor = [interpolation_factor] * len(rope_dim_list)
-    elif isinstance(interpolation_factor, list) and len(interpolation_factor) == 1:
-        interpolation_factor = [interpolation_factor[0]] * len(rope_dim_list)
-    assert len(interpolation_factor) == len(
-        rope_dim_list
-    ), "len(interpolation_factor) should equal to len(rope_dim_list)"
-
-    # use 1/ndim of dimensions to encode grid_axis
-    embs = []
-    for i in range(len(rope_dim_list)):
-        emb = get_1d_rotary_pos_embed(
-            rope_dim_list[i],
-            grid[i].reshape(-1),
-            theta,
-            use_real=use_real,
-            theta_rescale_factor=theta_rescale_factor[i],
-            interpolation_factor=interpolation_factor[i],
-        )  # 2 x [WHD, rope_dim_list[i]]
-        embs.append(emb)
-
-    if use_real:
-        cos = torch.cat([emb[0] for emb in embs], dim=1)  # (WHD, D/2)
-        sin = torch.cat([emb[1] for emb in embs], dim=1)  # (WHD, D/2)
-        return cos, sin
-    else:
-        emb = torch.cat(embs, dim=1)  # (WHD, D/2)
-        return emb
-
-
-def get_1d_rotary_pos_embed(
-    dim: int,
-    pos: Union[torch.FloatTensor, int],
-    theta: float = 10000.0,
-    use_real: bool = False,
-    theta_rescale_factor: float = 1.0,
-    interpolation_factor: float = 1.0,
-) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
-    """
-    Precompute the frequency tensor for complex exponential (cis) with given dimensions.
-    (Note: `cis` means `cos + i * sin`, where i is the imaginary unit.)
-
-    This function calculates a frequency tensor with complex exponential using the given dimension 'dim'
-    and the end index 'end'. The 'theta' parameter scales the frequencies.
-    The returned tensor contains complex values in complex64 data type.
-
-    Args:
-        dim (int): Dimension of the frequency tensor.
-        pos (int or torch.FloatTensor): Position indices for the frequency tensor. [S] or scalar
-        theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.
-        use_real (bool, optional): If True, return real part and imaginary part separately.
-                                   Otherwise, return complex numbers.
-        theta_rescale_factor (float, optional): Rescale factor for theta. Defaults to 1.0.
-
-    Returns:
-        freqs_cis: Precomputed frequency tensor with complex exponential. [S, D/2]
-        freqs_cos, freqs_sin: Precomputed frequency tensor with real and imaginary parts separately. [S, D]
-    """
-    if isinstance(pos, int):
-        pos = torch.arange(pos).float()
-
-    # proposed by reddit user bloc97, to rescale rotary embeddings to longer sequence length without fine-tuning
-    # has some connection to NTK literature
-    if theta_rescale_factor != 1.0:
-        theta *= theta_rescale_factor ** (dim / (dim - 2))
-
-    freqs = 1.0 / (
-        theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)
-    )  # [D/2]
-    # assert interpolation_factor == 1.0, f"interpolation_factor: {interpolation_factor}"
-    freqs = torch.outer(pos * interpolation_factor, freqs)  # [S, D/2]
-    if use_real:
-        freqs_cos = freqs.cos().repeat_interleave(2, dim=1)  # [S, D]
-        freqs_sin = freqs.sin().repeat_interleave(2, dim=1)  # [S, D]
-        return freqs_cos, freqs_sin
-    else:
-        freqs_cis = torch.polar(
-            torch.ones_like(freqs), freqs
-        )  # complex64     # [S, D/2]
-        return freqs_cis
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/modules/token_refiner.py b/videotuna/models/hunyuan/hyvideo_t2v/modules/token_refiner.py
deleted file mode 100644
index bf09278e..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/modules/token_refiner.py
+++ /dev/null
@@ -1,236 +0,0 @@
-from typing import Optional
-
-from einops import rearrange
-import torch
-import torch.nn as nn
-
-from .activation_layers import get_activation_layer
-from .attenion import attention
-from .norm_layers import get_norm_layer
-from .embed_layers import TimestepEmbedder, TextProjection
-from .attenion import attention
-from .mlp_layers import MLP
-from .modulate_layers import modulate, apply_gate
-
-
-class IndividualTokenRefinerBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        heads_num,
-        mlp_width_ratio: str = 4.0,
-        mlp_drop_rate: float = 0.0,
-        act_type: str = "silu",
-        qk_norm: bool = False,
-        qk_norm_type: str = "layer",
-        qkv_bias: bool = True,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        mlp_hidden_dim = int(hidden_size * mlp_width_ratio)
-
-        self.norm1 = nn.LayerNorm(
-            hidden_size, elementwise_affine=True, eps=1e-6, **factory_kwargs
-        )
-        self.self_attn_qkv = nn.Linear(
-            hidden_size, hidden_size * 3, bias=qkv_bias, **factory_kwargs
-        )
-        qk_norm_layer = get_norm_layer(qk_norm_type)
-        self.self_attn_q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.self_attn_k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs)
-            if qk_norm
-            else nn.Identity()
-        )
-        self.self_attn_proj = nn.Linear(
-            hidden_size, hidden_size, bias=qkv_bias, **factory_kwargs
-        )
-
-        self.norm2 = nn.LayerNorm(
-            hidden_size, elementwise_affine=True, eps=1e-6, **factory_kwargs
-        )
-        act_layer = get_activation_layer(act_type)
-        self.mlp = MLP(
-            in_channels=hidden_size,
-            hidden_channels=mlp_hidden_dim,
-            act_layer=act_layer,
-            drop=mlp_drop_rate,
-            **factory_kwargs,
-        )
-
-        self.adaLN_modulation = nn.Sequential(
-            act_layer(),
-            nn.Linear(hidden_size, 2 * hidden_size, bias=True, **factory_kwargs),
-        )
-        # Zero-initialize the modulation
-        nn.init.zeros_(self.adaLN_modulation[1].weight)
-        nn.init.zeros_(self.adaLN_modulation[1].bias)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        c: torch.Tensor,  # timestep_aware_representations + context_aware_representations
-        attn_mask: torch.Tensor = None,
-    ):
-        gate_msa, gate_mlp = self.adaLN_modulation(c).chunk(2, dim=1)
-
-        norm_x = self.norm1(x)
-        qkv = self.self_attn_qkv(norm_x)
-        q, k, v = rearrange(qkv, "B L (K H D) -> K B L H D", K=3, H=self.heads_num)
-        # Apply QK-Norm if needed
-        q = self.self_attn_q_norm(q).to(v)
-        k = self.self_attn_k_norm(k).to(v)
-
-        # Self-Attention
-        attn = attention(q, k, v, mode="torch", attn_mask=attn_mask)
-
-        x = x + apply_gate(self.self_attn_proj(attn), gate_msa)
-
-        # FFN Layer
-        x = x + apply_gate(self.mlp(self.norm2(x)), gate_mlp)
-
-        return x
-
-
-class IndividualTokenRefiner(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        heads_num,
-        depth,
-        mlp_width_ratio: float = 4.0,
-        mlp_drop_rate: float = 0.0,
-        act_type: str = "silu",
-        qk_norm: bool = False,
-        qk_norm_type: str = "layer",
-        qkv_bias: bool = True,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.blocks = nn.ModuleList(
-            [
-                IndividualTokenRefinerBlock(
-                    hidden_size=hidden_size,
-                    heads_num=heads_num,
-                    mlp_width_ratio=mlp_width_ratio,
-                    mlp_drop_rate=mlp_drop_rate,
-                    act_type=act_type,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    qkv_bias=qkv_bias,
-                    **factory_kwargs,
-                )
-                for _ in range(depth)
-            ]
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        c: torch.LongTensor,
-        mask: Optional[torch.Tensor] = None,
-    ):
-        self_attn_mask = None
-        if mask is not None:
-            batch_size = mask.shape[0]
-            seq_len = mask.shape[1]
-            mask = mask.to(x.device)
-            # batch_size x 1 x seq_len x seq_len
-            self_attn_mask_1 = mask.view(batch_size, 1, 1, seq_len).repeat(
-                1, 1, seq_len, 1
-            )
-            # batch_size x 1 x seq_len x seq_len
-            self_attn_mask_2 = self_attn_mask_1.transpose(2, 3)
-            # batch_size x 1 x seq_len x seq_len, 1 for broadcasting of heads_num
-            self_attn_mask = (self_attn_mask_1 & self_attn_mask_2).bool()
-            # avoids self-attention weight being NaN for padding tokens
-            self_attn_mask[:, :, :, 0] = True
-
-        for block in self.blocks:
-            x = block(x, c, self_attn_mask)
-        return x
-
-
-class SingleTokenRefiner(nn.Module):
-    """
-    A single token refiner block for llm text embedding refine.
-    """
-    def __init__(
-        self,
-        in_channels,
-        hidden_size,
-        heads_num,
-        depth,
-        mlp_width_ratio: float = 4.0,
-        mlp_drop_rate: float = 0.0,
-        act_type: str = "silu",
-        qk_norm: bool = False,
-        qk_norm_type: str = "layer",
-        qkv_bias: bool = True,
-        attn_mode: str = "torch",
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.attn_mode = attn_mode
-        assert self.attn_mode == "torch", "Only support 'torch' mode for token refiner."
-
-        self.input_embedder = nn.Linear(
-            in_channels, hidden_size, bias=True, **factory_kwargs
-        )
-
-        act_layer = get_activation_layer(act_type)
-        # Build timestep embedding layer
-        self.t_embedder = TimestepEmbedder(hidden_size, act_layer, **factory_kwargs)
-        # Build context embedding layer
-        self.c_embedder = TextProjection(
-            in_channels, hidden_size, act_layer, **factory_kwargs
-        )
-
-        self.individual_token_refiner = IndividualTokenRefiner(
-            hidden_size=hidden_size,
-            heads_num=heads_num,
-            depth=depth,
-            mlp_width_ratio=mlp_width_ratio,
-            mlp_drop_rate=mlp_drop_rate,
-            act_type=act_type,
-            qk_norm=qk_norm,
-            qk_norm_type=qk_norm_type,
-            qkv_bias=qkv_bias,
-            **factory_kwargs,
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        t: torch.LongTensor,
-        mask: Optional[torch.LongTensor] = None,
-    ):
-        timestep_aware_representations = self.t_embedder(t)
-
-        if mask is None:
-            context_aware_representations = x.mean(dim=1)
-        else:
-            mask_float = mask.float().unsqueeze(-1)  # [b, s1, 1]
-            context_aware_representations = (x * mask_float).sum(
-                dim=1
-            ) / mask_float.sum(dim=1)
-        context_aware_representations = self.c_embedder(context_aware_representations)
-        c = timestep_aware_representations + context_aware_representations
-
-        x = self.input_embedder(x)
-
-        x = self.individual_token_refiner(x, c, mask)
-
-        return x
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/prompt_rewrite.py b/videotuna/models/hunyuan/hyvideo_t2v/prompt_rewrite.py
deleted file mode 100644
index 974c452a..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/prompt_rewrite.py
+++ /dev/null
@@ -1,51 +0,0 @@
-normal_mode_prompt = """Normal mode - Video Recaption Task:
-
-You are a large language model specialized in rewriting video descriptions. Your task is to modify the input description.
-
-0. Preserve ALL information, including style words and technical terms.
-
-1. If the input is in Chinese, translate the entire description to English. 
-
-2. If the input is just one or two words describing an object or person, provide a brief, simple description focusing on basic visual characteristics. Limit the description to 1-2 short sentences.
-
-3. If the input does not include style, lighting, atmosphere, you can make reasonable associations.
-
-4. Output ALL must be in English.
-
-Given Input:
-input: "{input}"
-"""
-
-
-master_mode_prompt = """Master mode - Video Recaption Task:
-
-You are a large language model specialized in rewriting video descriptions. Your task is to modify the input description.
-
-0. Preserve ALL information, including style words and technical terms.
-
-1. If the input is in Chinese, translate the entire description to English. 
-
-2. If the input is just one or two words describing an object or person, provide a brief, simple description focusing on basic visual characteristics. Limit the description to 1-2 short sentences.
-
-3. If the input does not include style, lighting, atmosphere, you can make reasonable associations.
-
-4. Output ALL must be in English.
-
-Given Input:
-input: "{input}"
-"""
-
-def get_rewrite_prompt(ori_prompt, mode="Normal"):
-    if mode == "Normal":
-        prompt = normal_mode_prompt.format(input=ori_prompt)
-    elif mode == "Master":
-        prompt = master_mode_prompt.format(input=ori_prompt)
-    else:
-        raise Exception("Only supports Normal and Normal", mode)
-    return prompt
-
-ori_prompt = "一只小狗在草地上奔跑。"
-normal_prompt = get_rewrite_prompt(ori_prompt, mode="Normal")
-master_prompt = get_rewrite_prompt(ori_prompt, mode="Master")
-
-# Then you can use the normal_prompt or master_prompt to access the hunyuan-large rewrite model to get the final prompt.
\ No newline at end of file
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/text_encoder/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/text_encoder/__init__.py
deleted file mode 100644
index 4fa53ab1..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/text_encoder/__init__.py
+++ /dev/null
@@ -1,360 +0,0 @@
-from dataclasses import dataclass
-from typing import Optional, Tuple
-from copy import deepcopy
-
-import torch
-import torch.nn as nn
-from transformers import CLIPTextModel, CLIPTokenizer, AutoTokenizer, AutoModel
-from transformers.utils import ModelOutput
-
-from ..constants import TEXT_ENCODER_PATH, TOKENIZER_PATH
-from ..constants import PRECISION_TO_TYPE
-
-
-def use_default(value, default):
-    return value if value is not None else default
-
-
-def load_text_encoder(
-    text_encoder_type,
-    text_encoder_precision=None,
-    text_encoder_path=None,
-    logger=None,
-    device=None,
-):
-    if text_encoder_path is None:
-        text_encoder_path = TEXT_ENCODER_PATH[text_encoder_type]
-        print(f"text_encoder_path: {text_encoder_path}") 
-    if logger is not None:
-        logger.info(
-            f"Loading text encoder model ({text_encoder_type}) from: {text_encoder_path}"
-        )
-
-    if text_encoder_type == "clipL":
-        text_encoder = CLIPTextModel.from_pretrained(text_encoder_path)
-        text_encoder.final_layer_norm = text_encoder.text_model.final_layer_norm
-    elif text_encoder_type == "llm":
-        text_encoder = AutoModel.from_pretrained(
-            text_encoder_path, low_cpu_mem_usage=True
-        )
-        text_encoder.final_layer_norm = text_encoder.norm
-    else:
-        raise ValueError(f"Unsupported text encoder type: {text_encoder_type}")
-    # from_pretrained will ensure that the model is in eval mode.
-
-    if text_encoder_precision is not None:
-        text_encoder = text_encoder.to(dtype=PRECISION_TO_TYPE[text_encoder_precision])
-
-    text_encoder.requires_grad_(False)
-
-    if logger is not None:
-        logger.info(f"Text encoder to dtype: {text_encoder.dtype}")
-
-    if device is not None:
-        text_encoder = text_encoder.to(device)
-
-    return text_encoder, text_encoder_path
-
-
-def load_tokenizer(
-    tokenizer_type, tokenizer_path=None, padding_side="right", logger=None
-):
-    if tokenizer_path is None:
-        tokenizer_path = TOKENIZER_PATH[tokenizer_type]
-    if logger is not None:
-        logger.info(f"Loading tokenizer ({tokenizer_type}) from: {tokenizer_path}")
-
-    if tokenizer_type == "clipL":
-        tokenizer = CLIPTokenizer.from_pretrained(tokenizer_path, max_length=77)
-    elif tokenizer_type == "llm":
-        print(f"tokenizer_path: {tokenizer_path}")
-        tokenizer = AutoTokenizer.from_pretrained(
-            tokenizer_path, padding_side=padding_side
-        )
-    else:
-        raise ValueError(f"Unsupported tokenizer type: {tokenizer_type}")
-
-    return tokenizer, tokenizer_path
-
-
-@dataclass
-class TextEncoderModelOutput(ModelOutput):
-    """
-    Base class for model's outputs that also contains a pooling of the last hidden states.
-
-    Args:
-        hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
-            Sequence of hidden-states at the output of the last layer of the model.
-        attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
-        hidden_states_list (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed):
-            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
-            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
-            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-        text_outputs (`list`, *optional*, returned when `return_texts=True` is passed):
-            List of decoded texts.
-    """
-
-    hidden_state: torch.FloatTensor = None
-    attention_mask: Optional[torch.LongTensor] = None
-    hidden_states_list: Optional[Tuple[torch.FloatTensor, ...]] = None
-    text_outputs: Optional[list] = None
-
-
-class TextEncoder(nn.Module):
-    def __init__(
-        self,
-        text_encoder_type: str,
-        max_length: int,
-        text_encoder_precision: Optional[str] = None,
-        text_encoder_path: Optional[str] = None,
-        tokenizer_type: Optional[str] = None,
-        tokenizer_path: Optional[str] = None,
-        output_key: Optional[str] = None,
-        use_attention_mask: bool = True,
-        input_max_length: Optional[int] = None,
-        prompt_template: Optional[dict] = None,
-        prompt_template_video: Optional[dict] = None,
-        hidden_state_skip_layer: Optional[int] = None,
-        apply_final_norm: bool = False,
-        reproduce: bool = False,
-        logger=None,
-        device=None,
-    ):
-        super().__init__()
-        self.text_encoder_type = text_encoder_type
-        self.max_length = max_length
-        self.precision = text_encoder_precision
-        self.model_path = text_encoder_path
-        print(f"model_path: {self.model_path}") # None 
-        self.tokenizer_type = (
-            tokenizer_type if tokenizer_type is not None else text_encoder_type
-        )
-        self.tokenizer_path = (
-            tokenizer_path if tokenizer_path is not None else text_encoder_path
-        )
-        self.use_attention_mask = use_attention_mask
-        if prompt_template_video is not None:
-            assert (
-                use_attention_mask is True
-            ), "Attention mask is True required when training videos."
-        self.input_max_length = (
-            input_max_length if input_max_length is not None else max_length
-        )
-        self.prompt_template = prompt_template
-        self.prompt_template_video = prompt_template_video
-        self.hidden_state_skip_layer = hidden_state_skip_layer
-        self.apply_final_norm = apply_final_norm
-        self.reproduce = reproduce
-        self.logger = logger
-
-        self.use_template = self.prompt_template is not None
-        if self.use_template:
-            assert (
-                isinstance(self.prompt_template, dict)
-                and "template" in self.prompt_template
-            ), f"`prompt_template` must be a dictionary with a key 'template', got {self.prompt_template}"
-            assert "{}" in str(self.prompt_template["template"]), (
-                "`prompt_template['template']` must contain a placeholder `{}` for the input text, "
-                f"got {self.prompt_template['template']}"
-            )
-
-        self.use_video_template = self.prompt_template_video is not None
-        if self.use_video_template:
-            if self.prompt_template_video is not None:
-                assert (
-                    isinstance(self.prompt_template_video, dict)
-                    and "template" in self.prompt_template_video
-                ), f"`prompt_template_video` must be a dictionary with a key 'template', got {self.prompt_template_video}"
-            assert "{}" in str(self.prompt_template_video["template"]), (
-                "`prompt_template_video['template']` must contain a placeholder `{}` for the input text, "
-                f"got {self.prompt_template_video['template']}"
-            )
-
-        if "t5" in text_encoder_type:
-            self.output_key = output_key or "last_hidden_state"
-        elif "clip" in text_encoder_type:
-            self.output_key = output_key or "pooler_output"
-        elif "llm" in text_encoder_type or "glm" in text_encoder_type:
-            self.output_key = output_key or "last_hidden_state"
-        else:
-            raise ValueError(f"Unsupported text encoder type: {text_encoder_type}")
-
-        self.model, self.model_path = load_text_encoder(
-            text_encoder_type=self.text_encoder_type,
-            text_encoder_precision=self.precision,
-            text_encoder_path=self.model_path,
-            logger=self.logger,
-            device=device,
-        )
-        self.dtype = self.model.dtype
-        self.device = self.model.device
-
-        self.tokenizer, self.tokenizer_path = load_tokenizer(
-            tokenizer_type=self.tokenizer_type,
-            tokenizer_path=self.tokenizer_path,
-            padding_side="right",
-            logger=self.logger,
-        )
-
-    def __repr__(self):
-        return f"{self.text_encoder_type} ({self.precision} - {self.model_path})"
-
-    @staticmethod
-    def apply_text_to_template(text, template, prevent_empty_text=True):
-        """
-        Apply text to template.
-
-        Args:
-            text (str): Input text.
-            template (str or list): Template string or list of chat conversation.
-            prevent_empty_text (bool): If Ture, we will prevent the user text from being empty
-                by adding a space. Defaults to True.
-        """
-        if isinstance(template, str):
-            # Will send string to tokenizer. Used for llm
-            return template.format(text)
-        else:
-            raise TypeError(f"Unsupported template type: {type(template)}")
-
-    def text2tokens(self, text, data_type="image"):
-        """
-        Tokenize the input text.
-
-        Args:
-            text (str or list): Input text.
-        """
-        tokenize_input_type = "str"
-        if self.use_template:
-            if data_type == "image":
-                prompt_template = self.prompt_template["template"]
-            elif data_type == "video":
-                prompt_template = self.prompt_template_video["template"]
-            else:
-                raise ValueError(f"Unsupported data type: {data_type}")
-            if isinstance(text, (list, tuple)):
-                text = [
-                    self.apply_text_to_template(one_text, prompt_template)
-                    for one_text in text
-                ]
-                if isinstance(text[0], list):
-                    tokenize_input_type = "list"
-            elif isinstance(text, str):
-                text = self.apply_text_to_template(text, prompt_template)
-                if isinstance(text, list):
-                    tokenize_input_type = "list"
-            else:
-                raise TypeError(f"Unsupported text type: {type(text)}")
-
-        kwargs = dict(
-            truncation=True,
-            max_length=self.max_length,
-            padding="max_length",
-            return_tensors="pt",
-        )
-        if tokenize_input_type == "str":
-            return self.tokenizer(
-                text,
-                return_length=False,
-                return_overflowing_tokens=False,
-                return_attention_mask=True,
-                **kwargs,
-            )
-        elif tokenize_input_type == "list":
-            return self.tokenizer.apply_chat_template(
-                text,
-                add_generation_prompt=True,
-                tokenize=True,
-                return_dict=True,
-                **kwargs,
-            )
-        else:
-            raise ValueError(f"Unsupported tokenize_input_type: {tokenize_input_type}")
-
-    def encode(
-        self,
-        batch_encoding,
-        use_attention_mask=None,
-        output_hidden_states=False,
-        do_sample=None,
-        hidden_state_skip_layer=None,
-        return_texts=False,
-        data_type="image",
-        device=None,
-    ):
-        """
-        Args:
-            batch_encoding (dict): Batch encoding from tokenizer.
-            use_attention_mask (bool): Whether to use attention mask. If None, use self.use_attention_mask.
-                Defaults to None.
-            output_hidden_states (bool): Whether to output hidden states. If False, return the value of
-                self.output_key. If True, return the entire output. If set self.hidden_state_skip_layer,
-                output_hidden_states will be set True. Defaults to False.
-            do_sample (bool): Whether to sample from the model. Used for Decoder-Only LLMs. Defaults to None.
-                When self.produce is False, do_sample is set to True by default.
-            hidden_state_skip_layer (int): Number of hidden states to hidden_state_skip_layer. 0 means the last layer.
-                If None, self.output_key will be used. Defaults to None.
-            return_texts (bool): Whether to return the decoded texts. Defaults to False.
-        """
-        device = self.model.device if device is None else device
-        use_attention_mask = use_default(use_attention_mask, self.use_attention_mask)
-        hidden_state_skip_layer = use_default(
-            hidden_state_skip_layer, self.hidden_state_skip_layer
-        )
-        do_sample = use_default(do_sample, not self.reproduce)
-        attention_mask = (
-            batch_encoding["attention_mask"].to(device) if use_attention_mask else None
-        )
-        outputs = self.model(
-            input_ids=batch_encoding["input_ids"].to(device),
-            attention_mask=attention_mask,
-            output_hidden_states=output_hidden_states
-            or hidden_state_skip_layer is not None,
-        )
-        if hidden_state_skip_layer is not None:
-            last_hidden_state = outputs.hidden_states[-(hidden_state_skip_layer + 1)]
-            # Real last hidden state already has layer norm applied. So here we only apply it
-            # for intermediate layers.
-            if hidden_state_skip_layer > 0 and self.apply_final_norm:
-                last_hidden_state = self.model.final_layer_norm(last_hidden_state)
-        else:
-            last_hidden_state = outputs[self.output_key]
-
-        # Remove hidden states of instruction tokens, only keep prompt tokens.
-        if self.use_template:
-            if data_type == "image":
-                crop_start = self.prompt_template.get("crop_start", -1)
-            elif data_type == "video":
-                crop_start = self.prompt_template_video.get("crop_start", -1)
-            else:
-                raise ValueError(f"Unsupported data type: {data_type}")
-            if crop_start > 0:
-                last_hidden_state = last_hidden_state[:, crop_start:]
-                attention_mask = (
-                    attention_mask[:, crop_start:] if use_attention_mask else None
-                )
-
-        if output_hidden_states:
-            return TextEncoderModelOutput(
-                last_hidden_state, attention_mask, outputs.hidden_states
-            )
-        return TextEncoderModelOutput(last_hidden_state, attention_mask)
-
-    def forward(
-        self,
-        text,
-        use_attention_mask=None,
-        output_hidden_states=False,
-        do_sample=False,
-        hidden_state_skip_layer=None,
-        return_texts=False,
-    ):
-        batch_encoding = self.text2tokens(text)
-        return self.encode(
-            batch_encoding,
-            use_attention_mask=use_attention_mask,
-            output_hidden_states=output_hidden_states,
-            do_sample=do_sample,
-            hidden_state_skip_layer=hidden_state_skip_layer,
-            return_texts=return_texts,
-        )
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/utils/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/utils/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/utils/data_utils.py b/videotuna/models/hunyuan/hyvideo_t2v/utils/data_utils.py
deleted file mode 100644
index 583a9035..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/utils/data_utils.py
+++ /dev/null
@@ -1,15 +0,0 @@
-import numpy as np
-import math
-
-
-def align_to(value, alignment):
-    """align hight, width according to alignment
-
-    Args:
-        value (int): height or width
-        alignment (int): target alignment factor
-
-    Returns:
-        int: the aligned value
-    """
-    return int(math.ceil(value / alignment) * alignment)
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/utils/file_utils.py b/videotuna/models/hunyuan/hyvideo_t2v/utils/file_utils.py
deleted file mode 100644
index 2ba36514..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/utils/file_utils.py
+++ /dev/null
@@ -1,70 +0,0 @@
-import os
-from pathlib import Path
-from einops import rearrange
-
-import torch
-import torchvision
-import numpy as np
-import imageio
-
-CODE_SUFFIXES = {
-    ".py",  # Python codes
-    ".sh",  # Shell scripts
-    ".yaml",
-    ".yml",  # Configuration files
-}
-
-
-def safe_dir(path):
-    """
-    Create a directory (or the parent directory of a file) if it does not exist.
-
-    Args:
-        path (str or Path): Path to the directory.
-
-    Returns:
-        path (Path): Path object of the directory.
-    """
-    path = Path(path)
-    path.mkdir(exist_ok=True, parents=True)
-    return path
-
-
-def safe_file(path):
-    """
-    Create the parent directory of a file if it does not exist.
-
-    Args:
-        path (str or Path): Path to the file.
-
-    Returns:
-        path (Path): Path object of the file.
-    """
-    path = Path(path)
-    path.parent.mkdir(exist_ok=True, parents=True)
-    return path
-
-def save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=1, fps=24):
-    """save videos by video tensor
-       copy from https://github.com/guoyww/AnimateDiff/blob/e92bd5671ba62c0d774a32951453e328018b7c5b/animatediff/utils/util.py#L61
-
-    Args:
-        videos (torch.Tensor): video tensor predicted by the model
-        path (str): path to save video
-        rescale (bool, optional): rescale the video tensor from [-1, 1] to  . Defaults to False.
-        n_rows (int, optional): Defaults to 1.
-        fps (int, optional): video save fps. Defaults to 8.
-    """
-    videos = rearrange(videos, "b c t h w -> t b c h w")
-    outputs = []
-    for x in videos:
-        x = torchvision.utils.make_grid(x, nrow=n_rows)
-        x = x.transpose(0, 1).transpose(1, 2).squeeze(-1)
-        if rescale:
-            x = (x + 1.0) / 2.0  # -1,1 -> 0,1
-        x = torch.clamp(x, 0, 1)
-        x = (x * 255).numpy().astype(np.uint8)
-        outputs.append(x)
-
-    os.makedirs(os.path.dirname(path), exist_ok=True)
-    imageio.mimsave(path, outputs, fps=fps)
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/utils/helpers.py b/videotuna/models/hunyuan/hyvideo_t2v/utils/helpers.py
deleted file mode 100644
index 72ab8cb1..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/utils/helpers.py
+++ /dev/null
@@ -1,40 +0,0 @@
-import collections.abc
-
-from itertools import repeat
-
-
-def _ntuple(n):
-    def parse(x):
-        if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
-            x = tuple(x)
-            if len(x) == 1:
-                x = tuple(repeat(x[0], n))
-            return x
-        return tuple(repeat(x, n))
-    return parse
-
-
-to_1tuple = _ntuple(1)
-to_2tuple = _ntuple(2)
-to_3tuple = _ntuple(3)
-to_4tuple = _ntuple(4)
-
-
-def as_tuple(x):
-    if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
-        return tuple(x)
-    if x is None or isinstance(x, (int, float, str)):
-        return (x,)
-    else:
-        raise ValueError(f"Unknown type {type(x)}")
-
-
-def as_list_of_2tuple(x):
-    x = as_tuple(x)
-    if len(x) == 1:
-        x = (x[0], x[0])
-    assert len(x) % 2 == 0, f"Expect even length, got {len(x)}."
-    lst = []
-    for i in range(0, len(x), 2):
-        lst.append((x[i], x[i + 1]))
-    return lst
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/utils/preprocess_text_encoder_tokenizer_utils.py b/videotuna/models/hunyuan/hyvideo_t2v/utils/preprocess_text_encoder_tokenizer_utils.py
deleted file mode 100644
index 2908eb29..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/utils/preprocess_text_encoder_tokenizer_utils.py
+++ /dev/null
@@ -1,46 +0,0 @@
-import argparse
-import torch
-from transformers import (
-    AutoProcessor,
-    LlavaForConditionalGeneration,
-)
-
-
-def preprocess_text_encoder_tokenizer(args):
-
-    processor = AutoProcessor.from_pretrained(args.input_dir)
-    model = LlavaForConditionalGeneration.from_pretrained(
-        args.input_dir,
-        torch_dtype=torch.float16,
-        low_cpu_mem_usage=True,
-    ).to(0)
-
-    model.language_model.save_pretrained(
-        f"{args.output_dir}"
-    )
-    processor.tokenizer.save_pretrained(
-        f"{args.output_dir}"
-    )
-
-if __name__ == "__main__":
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--input_dir",
-        type=str,
-        required=True,
-        help="The path to the llava-llama-3-8b-v1_1-transformers.",
-    )
-    parser.add_argument(
-        "--output_dir",
-        type=str,
-        default="",
-        help="The output path of the llava-llama-3-8b-text-encoder-tokenizer."
-        "if '', the parent dir of output will be the same as input dir.",
-    )
-    args = parser.parse_args()
-
-    if len(args.output_dir) == 0:
-        args.output_dir = "/".join(args.input_dir.split("/")[:-1])
-
-    preprocess_text_encoder_tokenizer(args)
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/vae/__init__.py b/videotuna/models/hunyuan/hyvideo_t2v/vae/__init__.py
deleted file mode 100644
index 7a0d3962..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/vae/__init__.py
+++ /dev/null
@@ -1,62 +0,0 @@
-from pathlib import Path
-
-import torch
-
-from .autoencoder_kl_causal_3d import AutoencoderKLCausal3D
-from ..constants import VAE_PATH, PRECISION_TO_TYPE
-
-def load_vae(vae_type: str="884-16c-hy",
-             vae_precision: str=None,
-             sample_size: tuple=None,
-             vae_path: str=None,
-             logger=None,
-             device=None
-             ):
-    """the fucntion to load the 3D VAE model
-
-    Args:
-        vae_type (str): the type of the 3D VAE model. Defaults to "884-16c-hy".
-        vae_precision (str, optional): the precision to load vae. Defaults to None.
-        sample_size (tuple, optional): the tiling size. Defaults to None.
-        vae_path (str, optional): the path to vae. Defaults to None.
-        logger (_type_, optional): logger. Defaults to None.
-        device (_type_, optional): device to load vae. Defaults to None.
-    """
-    if vae_path is None:
-        vae_path = VAE_PATH[vae_type]
-    
-    if logger is not None:
-        logger.info(f"Loading 3D VAE model ({vae_type}) from: {vae_path}")
-    config = AutoencoderKLCausal3D.load_config(vae_path)
-    if sample_size:
-        vae = AutoencoderKLCausal3D.from_config(config, sample_size=sample_size)
-    else:
-        vae = AutoencoderKLCausal3D.from_config(config)
-    
-    vae_ckpt = Path(vae_path) / "pytorch_model.pt"
-    assert vae_ckpt.exists(), f"VAE checkpoint not found: {vae_ckpt}"
-    
-    ckpt = torch.load(vae_ckpt, map_location=vae.device)
-    if "state_dict" in ckpt:
-        ckpt = ckpt["state_dict"]
-    if any(k.startswith("vae.") for k in ckpt.keys()):
-        ckpt = {k.replace("vae.", ""): v for k, v in ckpt.items() if k.startswith("vae.")}
-    vae.load_state_dict(ckpt)
-
-    spatial_compression_ratio = vae.config.spatial_compression_ratio
-    time_compression_ratio = vae.config.time_compression_ratio
-    
-    if vae_precision is not None:
-        vae = vae.to(dtype=PRECISION_TO_TYPE[vae_precision])
-
-    vae.requires_grad_(False)
-
-    if logger is not None:
-        logger.info(f"VAE to dtype: {vae.dtype}")
-
-    if device is not None:
-        vae = vae.to(device)
-
-    vae.eval()
-
-    return vae, vae_path, spatial_compression_ratio, time_compression_ratio
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/vae/autoencoder_kl_causal_3d.py b/videotuna/models/hunyuan/hyvideo_t2v/vae/autoencoder_kl_causal_3d.py
deleted file mode 100644
index c98e41d9..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/vae/autoencoder_kl_causal_3d.py
+++ /dev/null
@@ -1,603 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-from typing import Dict, Optional, Tuple, Union
-from dataclasses import dataclass
-
-import torch
-import torch.nn as nn
-
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-
-try:
-    # This diffusers is modified and packed in the mirror.
-    from diffusers.loaders import FromOriginalVAEMixin
-except ImportError:
-    # Use this to be compatible with the original diffusers.
-    from diffusers.loaders.single_file_model import FromOriginalModelMixin as FromOriginalVAEMixin
-from diffusers.utils.accelerate_utils import apply_forward_hook
-from diffusers.models.attention_processor import (
-    ADDED_KV_ATTENTION_PROCESSORS,
-    CROSS_ATTENTION_PROCESSORS,
-    Attention,
-    AttentionProcessor,
-    AttnAddedKVProcessor,
-    AttnProcessor,
-)
-from diffusers.models.modeling_outputs import AutoencoderKLOutput
-from diffusers.models.modeling_utils import ModelMixin
-from .vae import DecoderCausal3D, BaseOutput, DecoderOutput, DiagonalGaussianDistribution, EncoderCausal3D
-
-
-@dataclass
-class DecoderOutput2(BaseOutput):
-    sample: torch.FloatTensor
-    posterior: Optional[DiagonalGaussianDistribution] = None
-
-
-class AutoencoderKLCausal3D(ModelMixin, ConfigMixin, FromOriginalVAEMixin):
-    r"""
-    A VAE model with KL loss for encoding images/videos into latents and decoding latent representations into images/videos.
-
-    This model inherits from [`ModelMixin`]. Check the superclass documentation for it's generic methods implemented
-    for all models (such as downloading or saving).
-    """
-
-    _supports_gradient_checkpointing = True
-
-    @register_to_config
-    def __init__(
-        self,
-        in_channels: int = 3,
-        out_channels: int = 3,
-        down_block_types: Tuple[str] = ("DownEncoderBlockCausal3D",),
-        up_block_types: Tuple[str] = ("UpDecoderBlockCausal3D",),
-        block_out_channels: Tuple[int] = (64,),
-        layers_per_block: int = 1,
-        act_fn: str = "silu",
-        latent_channels: int = 4,
-        norm_num_groups: int = 32,
-        sample_size: int = 32,
-        sample_tsize: int = 64,
-        scaling_factor: float = 0.18215,
-        force_upcast: float = True,
-        spatial_compression_ratio: int = 8,
-        time_compression_ratio: int = 4,
-        mid_block_add_attention: bool = True,
-    ):
-        super().__init__()
-
-        self.time_compression_ratio = time_compression_ratio
-
-        self.encoder = EncoderCausal3D(
-            in_channels=in_channels,
-            out_channels=latent_channels,
-            down_block_types=down_block_types,
-            block_out_channels=block_out_channels,
-            layers_per_block=layers_per_block,
-            act_fn=act_fn,
-            norm_num_groups=norm_num_groups,
-            double_z=True,
-            time_compression_ratio=time_compression_ratio,
-            spatial_compression_ratio=spatial_compression_ratio,
-            mid_block_add_attention=mid_block_add_attention,
-        )
-
-        self.decoder = DecoderCausal3D(
-            in_channels=latent_channels,
-            out_channels=out_channels,
-            up_block_types=up_block_types,
-            block_out_channels=block_out_channels,
-            layers_per_block=layers_per_block,
-            norm_num_groups=norm_num_groups,
-            act_fn=act_fn,
-            time_compression_ratio=time_compression_ratio,
-            spatial_compression_ratio=spatial_compression_ratio,
-            mid_block_add_attention=mid_block_add_attention,
-        )
-
-        self.quant_conv = nn.Conv3d(2 * latent_channels, 2 * latent_channels, kernel_size=1)
-        self.post_quant_conv = nn.Conv3d(latent_channels, latent_channels, kernel_size=1)
-
-        self.use_slicing = False
-        self.use_spatial_tiling = False
-        self.use_temporal_tiling = False
-
-        # only relevant if vae tiling is enabled
-        self.tile_sample_min_tsize = sample_tsize
-        self.tile_latent_min_tsize = sample_tsize // time_compression_ratio
-
-        self.tile_sample_min_size = self.config.sample_size
-        sample_size = (
-            self.config.sample_size[0]
-            if isinstance(self.config.sample_size, (list, tuple))
-            else self.config.sample_size
-        )
-        self.tile_latent_min_size = int(sample_size / (2 ** (len(self.config.block_out_channels) - 1)))
-        self.tile_overlap_factor = 0.25
-
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, (EncoderCausal3D, DecoderCausal3D)):
-            module.gradient_checkpointing = value
-
-    def enable_temporal_tiling(self, use_tiling: bool = True):
-        self.use_temporal_tiling = use_tiling
-
-    def disable_temporal_tiling(self):
-        self.enable_temporal_tiling(False)
-
-    def enable_spatial_tiling(self, use_tiling: bool = True):
-        self.use_spatial_tiling = use_tiling
-
-    def disable_spatial_tiling(self):
-        self.enable_spatial_tiling(False)
-
-    def enable_tiling(self, use_tiling: bool = True):
-        r"""
-        Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
-        compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
-        processing larger videos.
-        """
-        self.enable_spatial_tiling(use_tiling)
-        self.enable_temporal_tiling(use_tiling)
-
-    def disable_tiling(self):
-        r"""
-        Disable tiled VAE decoding. If `enable_tiling` was previously enabled, this method will go back to computing
-        decoding in one step.
-        """
-        self.disable_spatial_tiling()
-        self.disable_temporal_tiling()
-
-    def enable_slicing(self):
-        r"""
-        Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
-        compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
-        """
-        self.use_slicing = True
-
-    def disable_slicing(self):
-        r"""
-        Disable sliced VAE decoding. If `enable_slicing` was previously enabled, this method will go back to computing
-        decoding in one step.
-        """
-        self.use_slicing = False
-
-    @property
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.attn_processors
-    def attn_processors(self) -> Dict[str, AttentionProcessor]:
-        r"""
-        Returns:
-            `dict` of attention processors: A dictionary containing all attention processors used in the model with
-            indexed by its weight name.
-        """
-        # set recursively
-        processors = {}
-
-        def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: Dict[str, AttentionProcessor]):
-            if hasattr(module, "get_processor"):
-                processors[f"{name}.processor"] = module.get_processor(return_deprecated_lora=True)
-
-            for sub_name, child in module.named_children():
-                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
-
-            return processors
-
-        for name, module in self.named_children():
-            fn_recursive_add_processors(name, module, processors)
-
-        return processors
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attn_processor
-    def set_attn_processor(
-        self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]], _remove_lora=False
-    ):
-        r"""
-        Sets the attention processor to use to compute attention.
-
-        Parameters:
-            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
-                The instantiated processor class or a dictionary of processor classes that will be set as the processor
-                for **all** `Attention` layers.
-
-                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
-                processor. This is strongly recommended when setting trainable attention processors.
-
-        """
-        count = len(self.attn_processors.keys())
-
-        if isinstance(processor, dict) and len(processor) != count:
-            raise ValueError(
-                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
-                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
-            )
-
-        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
-            if hasattr(module, "set_processor"):
-                if not isinstance(processor, dict):
-                    module.set_processor(processor, _remove_lora=_remove_lora)
-                else:
-                    module.set_processor(processor.pop(f"{name}.processor"), _remove_lora=_remove_lora)
-
-            for sub_name, child in module.named_children():
-                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
-
-        for name, module in self.named_children():
-            fn_recursive_attn_processor(name, module, processor)
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_default_attn_processor
-    def set_default_attn_processor(self):
-        """
-        Disables custom attention processors and sets the default attention implementation.
-        """
-        if all(proc.__class__ in ADDED_KV_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
-            processor = AttnAddedKVProcessor()
-        elif all(proc.__class__ in CROSS_ATTENTION_PROCESSORS for proc in self.attn_processors.values()):
-            processor = AttnProcessor()
-        else:
-            raise ValueError(
-                f"Cannot call `set_default_attn_processor` when attention processors are of type {next(iter(self.attn_processors.values()))}"
-            )
-
-        self.set_attn_processor(processor, _remove_lora=True)
-
-    @apply_forward_hook
-    def encode(
-        self, x: torch.FloatTensor, return_dict: bool = True
-    ) -> Union[AutoencoderKLOutput, Tuple[DiagonalGaussianDistribution]]:
-        """
-        Encode a batch of images/videos into latents.
-
-        Args:
-            x (`torch.FloatTensor`): Input batch of images/videos.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
-
-        Returns:
-                The latent representations of the encoded images/videos. If `return_dict` is True, a
-                [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain `tuple` is returned.
-        """
-        assert len(x.shape) == 5, "The input tensor should have 5 dimensions."
-
-        if self.use_temporal_tiling and x.shape[2] > self.tile_sample_min_tsize:
-            return self.temporal_tiled_encode(x, return_dict=return_dict)
-
-        if self.use_spatial_tiling and (x.shape[-1] > self.tile_sample_min_size or x.shape[-2] > self.tile_sample_min_size):
-            return self.spatial_tiled_encode(x, return_dict=return_dict)
-
-        if self.use_slicing and x.shape[0] > 1:
-            encoded_slices = [self.encoder(x_slice) for x_slice in x.split(1)]
-            h = torch.cat(encoded_slices)
-        else:
-            h = self.encoder(x)
-
-        moments = self.quant_conv(h)
-        posterior = DiagonalGaussianDistribution(moments)
-
-        if not return_dict:
-            return (posterior,)
-
-        return AutoencoderKLOutput(latent_dist=posterior)
-
-    def _decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
-        assert len(z.shape) == 5, "The input tensor should have 5 dimensions."
-
-        if self.use_temporal_tiling and z.shape[2] > self.tile_latent_min_tsize:
-            return self.temporal_tiled_decode(z, return_dict=return_dict)
-
-        if self.use_spatial_tiling and (z.shape[-1] > self.tile_latent_min_size or z.shape[-2] > self.tile_latent_min_size):
-            return self.spatial_tiled_decode(z, return_dict=return_dict)
-
-        z = self.post_quant_conv(z)
-        dec = self.decoder(z)
-
-        if not return_dict:
-            return (dec,)
-
-        return DecoderOutput(sample=dec)
-
-    @apply_forward_hook
-    def decode(
-        self, z: torch.FloatTensor, return_dict: bool = True, generator=None
-    ) -> Union[DecoderOutput, torch.FloatTensor]:
-        """
-        Decode a batch of images/videos.
-
-        Args:
-            z (`torch.FloatTensor`): Input batch of latent vectors.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
-
-        Returns:
-            [`~models.vae.DecoderOutput`] or `tuple`:
-                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
-                returned.
-
-        """
-        if self.use_slicing and z.shape[0] > 1:
-            decoded_slices = [self._decode(z_slice).sample for z_slice in z.split(1)]
-            decoded = torch.cat(decoded_slices)
-        else:
-            decoded = self._decode(z).sample
-
-        if not return_dict:
-            return (decoded,)
-
-        return DecoderOutput(sample=decoded)
-
-    def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
-        blend_extent = min(a.shape[-2], b.shape[-2], blend_extent)
-        for y in range(blend_extent):
-            b[:, :, :, y, :] = a[:, :, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, :, y, :] * (y / blend_extent)
-        return b
-
-    def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
-        blend_extent = min(a.shape[-1], b.shape[-1], blend_extent)
-        for x in range(blend_extent):
-            b[:, :, :, :, x] = a[:, :, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, :, x] * (x / blend_extent)
-        return b
-
-    def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor:
-        blend_extent = min(a.shape[-3], b.shape[-3], blend_extent)
-        for x in range(blend_extent):
-            b[:, :, x, :, :] = a[:, :, -blend_extent + x, :, :] * (1 - x / blend_extent) + b[:, :, x, :, :] * (x / blend_extent)
-        return b
-
-    def spatial_tiled_encode(self, x: torch.FloatTensor, return_dict: bool = True, return_moments: bool = False) -> AutoencoderKLOutput:
-        r"""Encode a batch of images/videos using a tiled encoder.
-
-        When this option is enabled, the VAE will split the input tensor into tiles to compute encoding in several
-        steps. This is useful to keep memory use constant regardless of image/videos size. The end result of tiled encoding is
-        different from non-tiled encoding because each tile uses a different encoder. To avoid tiling artifacts, the
-        tiles overlap and are blended together to form a smooth output. You may still see tile-sized changes in the
-        output, but they should be much less noticeable.
-
-        Args:
-            x (`torch.FloatTensor`): Input batch of images/videos.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~models.autoencoder_kl.AutoencoderKLOutput`] instead of a plain tuple.
-
-        Returns:
-            [`~models.autoencoder_kl.AutoencoderKLOutput`] or `tuple`:
-                If return_dict is True, a [`~models.autoencoder_kl.AutoencoderKLOutput`] is returned, otherwise a plain
-                `tuple` is returned.
-        """
-        overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)
-        row_limit = self.tile_latent_min_size - blend_extent
-
-        # Split video into tiles and encode them separately.
-        rows = []
-        for i in range(0, x.shape[-2], overlap_size):
-            row = []
-            for j in range(0, x.shape[-1], overlap_size):
-                tile = x[:, :, :, i: i + self.tile_sample_min_size, j: j + self.tile_sample_min_size]
-                tile = self.encoder(tile)
-                tile = self.quant_conv(tile)
-                row.append(tile)
-            rows.append(row)
-        result_rows = []
-        for i, row in enumerate(rows):
-            result_row = []
-            for j, tile in enumerate(row):
-                # blend the above tile and the left tile
-                # to the current tile and add the current tile to the result row
-                if i > 0:
-                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
-                if j > 0:
-                    tile = self.blend_h(row[j - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :, :row_limit, :row_limit])
-            result_rows.append(torch.cat(result_row, dim=-1))
-
-        moments = torch.cat(result_rows, dim=-2)
-        if return_moments:
-            return moments
-
-        posterior = DiagonalGaussianDistribution(moments)
-        if not return_dict:
-            return (posterior,)
-
-        return AutoencoderKLOutput(latent_dist=posterior)
-
-    def spatial_tiled_decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
-        r"""
-        Decode a batch of images/videos using a tiled decoder.
-
-        Args:
-            z (`torch.FloatTensor`): Input batch of latent vectors.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~models.vae.DecoderOutput`] instead of a plain tuple.
-
-        Returns:
-            [`~models.vae.DecoderOutput`] or `tuple`:
-                If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is
-                returned.
-        """
-        overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)
-        row_limit = self.tile_sample_min_size - blend_extent
-
-        # Split z into overlapping tiles and decode them separately.
-        # The tiles have an overlap to avoid seams between tiles.
-        rows = []
-        for i in range(0, z.shape[-2], overlap_size):
-            row = []
-            for j in range(0, z.shape[-1], overlap_size):
-                tile = z[:, :, :, i: i + self.tile_latent_min_size, j: j + self.tile_latent_min_size]
-                tile = self.post_quant_conv(tile)
-                decoded = self.decoder(tile)
-                row.append(decoded)
-            rows.append(row)
-        result_rows = []
-        for i, row in enumerate(rows):
-            result_row = []
-            for j, tile in enumerate(row):
-                # blend the above tile and the left tile
-                # to the current tile and add the current tile to the result row
-                if i > 0:
-                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
-                if j > 0:
-                    tile = self.blend_h(row[j - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :, :row_limit, :row_limit])
-            result_rows.append(torch.cat(result_row, dim=-1))
-
-        dec = torch.cat(result_rows, dim=-2)
-        if not return_dict:
-            return (dec,)
-
-        return DecoderOutput(sample=dec)
-
-    def temporal_tiled_encode(self, x: torch.FloatTensor, return_dict: bool = True) -> AutoencoderKLOutput:
-
-        B, C, T, H, W = x.shape
-        overlap_size = int(self.tile_sample_min_tsize * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_latent_min_tsize * self.tile_overlap_factor)
-        t_limit = self.tile_latent_min_tsize - blend_extent
-
-        # Split the video into tiles and encode them separately.
-        row = []
-        for i in range(0, T, overlap_size):
-            tile = x[:, :, i: i + self.tile_sample_min_tsize + 1, :, :]
-            if self.use_spatial_tiling and (tile.shape[-1] > self.tile_sample_min_size or tile.shape[-2] > self.tile_sample_min_size):
-                tile = self.spatial_tiled_encode(tile, return_moments=True)
-            else:
-                tile = self.encoder(tile)
-                tile = self.quant_conv(tile)
-            if i > 0:
-                tile = tile[:, :, 1:, :, :]
-            row.append(tile)
-        result_row = []
-        for i, tile in enumerate(row):
-            if i > 0:
-                tile = self.blend_t(row[i - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :t_limit, :, :])
-            else:
-                result_row.append(tile[:, :, :t_limit + 1, :, :])
-
-        moments = torch.cat(result_row, dim=2)
-        posterior = DiagonalGaussianDistribution(moments)
-
-        if not return_dict:
-            return (posterior,)
-
-        return AutoencoderKLOutput(latent_dist=posterior)
-
-    def temporal_tiled_decode(self, z: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]:
-        # Split z into overlapping tiles and decode them separately.
-
-        B, C, T, H, W = z.shape
-        overlap_size = int(self.tile_latent_min_tsize * (1 - self.tile_overlap_factor))
-        blend_extent = int(self.tile_sample_min_tsize * self.tile_overlap_factor)
-        t_limit = self.tile_sample_min_tsize - blend_extent
-
-        row = []
-        for i in range(0, T, overlap_size):
-            tile = z[:, :, i: i + self.tile_latent_min_tsize + 1, :, :]
-            if self.use_spatial_tiling and (tile.shape[-1] > self.tile_latent_min_size or tile.shape[-2] > self.tile_latent_min_size):
-                decoded = self.spatial_tiled_decode(tile, return_dict=True).sample
-            else:
-                tile = self.post_quant_conv(tile)
-                decoded = self.decoder(tile)
-            if i > 0:
-                decoded = decoded[:, :, 1:, :, :]
-            row.append(decoded)
-        result_row = []
-        for i, tile in enumerate(row):
-            if i > 0:
-                tile = self.blend_t(row[i - 1], tile, blend_extent)
-                result_row.append(tile[:, :, :t_limit, :, :])
-            else:
-                result_row.append(tile[:, :, :t_limit + 1, :, :])
-
-        dec = torch.cat(result_row, dim=2)
-        if not return_dict:
-            return (dec,)
-
-        return DecoderOutput(sample=dec)
-
-    def forward(
-        self,
-        sample: torch.FloatTensor,
-        sample_posterior: bool = False,
-        return_dict: bool = True,
-        return_posterior: bool = False,
-        generator: Optional[torch.Generator] = None,
-    ) -> Union[DecoderOutput2, torch.FloatTensor]:
-        r"""
-        Args:
-            sample (`torch.FloatTensor`): Input sample.
-            sample_posterior (`bool`, *optional*, defaults to `False`):
-                Whether to sample from the posterior.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
-        """
-        x = sample
-        posterior = self.encode(x).latent_dist
-        if sample_posterior:
-            z = posterior.sample(generator=generator)
-        else:
-            z = posterior.mode()
-        dec = self.decode(z).sample
-
-        if not return_dict:
-            if return_posterior:
-                return (dec, posterior)
-            else:
-                return (dec,)
-        if return_posterior:
-            return DecoderOutput2(sample=dec, posterior=posterior)
-        else:
-            return DecoderOutput2(sample=dec)
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
-    def fuse_qkv_projections(self):
-        """
-        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query,
-        key, value) are fused. For cross-attention modules, key and value projection matrices are fused.
-
-        <Tip warning={true}>
-
-        This API is 🧪 experimental.
-
-        </Tip>
-        """
-        self.original_attn_processors = None
-
-        for _, attn_processor in self.attn_processors.items():
-            if "Added" in str(attn_processor.__class__.__name__):
-                raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")
-
-        self.original_attn_processors = self.attn_processors
-
-        for module in self.modules():
-            if isinstance(module, Attention):
-                module.fuse_projections(fuse=True)
-
-    # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
-    def unfuse_qkv_projections(self):
-        """Disables the fused QKV projection if enabled.
-
-        <Tip warning={true}>
-
-        This API is 🧪 experimental.
-
-        </Tip>
-
-        """
-        if self.original_attn_processors is not None:
-            self.set_attn_processor(self.original_attn_processors)
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/vae/unet_causal_3d_blocks.py b/videotuna/models/hunyuan/hyvideo_t2v/vae/unet_causal_3d_blocks.py
deleted file mode 100644
index f78bc755..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/vae/unet_causal_3d_blocks.py
+++ /dev/null
@@ -1,764 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-#
-# Modified from diffusers==0.29.2
-#
-# ==============================================================================
-
-from typing import Optional, Tuple, Union
-
-import torch
-import torch.nn.functional as F
-from torch import nn
-from einops import rearrange
-
-from diffusers.utils import logging
-from diffusers.models.activations import get_activation
-from diffusers.models.attention_processor import SpatialNorm
-from diffusers.models.attention_processor import Attention
-from diffusers.models.normalization import AdaGroupNorm
-from diffusers.models.normalization import RMSNorm
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-def prepare_causal_attention_mask(n_frame: int, n_hw: int, dtype, device, batch_size: int = None):
-    seq_len = n_frame * n_hw
-    mask = torch.full((seq_len, seq_len), float("-inf"), dtype=dtype, device=device)
-    for i in range(seq_len):
-        i_frame = i // n_hw
-        mask[i, : (i_frame + 1) * n_hw] = 0
-    if batch_size is not None:
-        mask = mask.unsqueeze(0).expand(batch_size, -1, -1)
-    return mask
-
-
-class CausalConv3d(nn.Module):
-    """
-    Implements a causal 3D convolution layer where each position only depends on previous timesteps and current spatial locations.
-    This maintains temporal causality in video generation tasks.
-    """
-
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        stride: Union[int, Tuple[int, int, int]] = 1,
-        dilation: Union[int, Tuple[int, int, int]] = 1,
-        pad_mode='replicate',
-        **kwargs
-    ):
-        super().__init__()
-
-        self.pad_mode = pad_mode
-        padding = (kernel_size // 2, kernel_size // 2, kernel_size // 2, kernel_size // 2, kernel_size - 1, 0)  # W, H, T
-        self.time_causal_padding = padding
-
-        self.conv = nn.Conv3d(chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs)
-
-    def forward(self, x):
-        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
-        return self.conv(x)
-
-
-class UpsampleCausal3D(nn.Module):
-    """
-    A 3D upsampling layer with an optional convolution.
-    """
-
-    def __init__(
-        self,
-        channels: int,
-        use_conv: bool = False,
-        use_conv_transpose: bool = False,
-        out_channels: Optional[int] = None,
-        name: str = "conv",
-        kernel_size: Optional[int] = None,
-        padding=1,
-        norm_type=None,
-        eps=None,
-        elementwise_affine=None,
-        bias=True,
-        interpolate=True,
-        upsample_factor=(2, 2, 2),
-    ):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.use_conv_transpose = use_conv_transpose
-        self.name = name
-        self.interpolate = interpolate
-        self.upsample_factor = upsample_factor
-
-        if norm_type == "ln_norm":
-            self.norm = nn.LayerNorm(channels, eps, elementwise_affine)
-        elif norm_type == "rms_norm":
-            self.norm = RMSNorm(channels, eps, elementwise_affine)
-        elif norm_type is None:
-            self.norm = None
-        else:
-            raise ValueError(f"unknown norm_type: {norm_type}")
-
-        conv = None
-        if use_conv_transpose:
-            raise NotImplementedError
-        elif use_conv:
-            if kernel_size is None:
-                kernel_size = 3
-            conv = CausalConv3d(self.channels, self.out_channels, kernel_size=kernel_size, bias=bias)
-
-        if name == "conv":
-            self.conv = conv
-        else:
-            self.Conv2d_0 = conv
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        output_size: Optional[int] = None,
-        scale: float = 1.0,
-    ) -> torch.FloatTensor:
-        assert hidden_states.shape[1] == self.channels
-
-        if self.norm is not None:
-            raise NotImplementedError
-
-        if self.use_conv_transpose:
-            return self.conv(hidden_states)
-
-        # Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
-        dtype = hidden_states.dtype
-        if dtype == torch.bfloat16:
-            hidden_states = hidden_states.to(torch.float32)
-
-        # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
-        if hidden_states.shape[0] >= 64:
-            hidden_states = hidden_states.contiguous()
-
-        # if `output_size` is passed we force the interpolation output
-        # size and do not make use of `scale_factor=2`
-        if self.interpolate:
-            B, C, T, H, W = hidden_states.shape
-            first_h, other_h = hidden_states.split((1, T - 1), dim=2)
-            if output_size is None:
-                if T > 1:
-                    other_h = F.interpolate(other_h, scale_factor=self.upsample_factor, mode="nearest")
-
-                first_h = first_h.squeeze(2)
-                first_h = F.interpolate(first_h, scale_factor=self.upsample_factor[1:], mode="nearest")
-                first_h = first_h.unsqueeze(2)
-            else:
-                raise NotImplementedError
-
-            if T > 1:
-                hidden_states = torch.cat((first_h, other_h), dim=2)
-            else:
-                hidden_states = first_h
-
-        # If the input is bfloat16, we cast back to bfloat16
-        if dtype == torch.bfloat16:
-            hidden_states = hidden_states.to(dtype)
-
-        if self.use_conv:
-            if self.name == "conv":
-                hidden_states = self.conv(hidden_states)
-            else:
-                hidden_states = self.Conv2d_0(hidden_states)
-
-        return hidden_states
-
-
-class DownsampleCausal3D(nn.Module):
-    """
-    A 3D downsampling layer with an optional convolution.
-    """
-
-    def __init__(
-        self,
-        channels: int,
-        use_conv: bool = False,
-        out_channels: Optional[int] = None,
-        padding: int = 1,
-        name: str = "conv",
-        kernel_size=3,
-        norm_type=None,
-        eps=None,
-        elementwise_affine=None,
-        bias=True,
-        stride=2,
-    ):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.padding = padding
-        stride = stride
-        self.name = name
-
-        if norm_type == "ln_norm":
-            self.norm = nn.LayerNorm(channels, eps, elementwise_affine)
-        elif norm_type == "rms_norm":
-            self.norm = RMSNorm(channels, eps, elementwise_affine)
-        elif norm_type is None:
-            self.norm = None
-        else:
-            raise ValueError(f"unknown norm_type: {norm_type}")
-
-        if use_conv:
-            conv = CausalConv3d(
-                self.channels, self.out_channels, kernel_size=kernel_size, stride=stride, bias=bias
-            )
-        else:
-            raise NotImplementedError
-
-        if name == "conv":
-            self.Conv2d_0 = conv
-            self.conv = conv
-        elif name == "Conv2d_0":
-            self.conv = conv
-        else:
-            self.conv = conv
-
-    def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor:
-        assert hidden_states.shape[1] == self.channels
-
-        if self.norm is not None:
-            hidden_states = self.norm(hidden_states.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
-
-        assert hidden_states.shape[1] == self.channels
-
-        hidden_states = self.conv(hidden_states)
-
-        return hidden_states
-
-
-class ResnetBlockCausal3D(nn.Module):
-    r"""
-    A Resnet block.
-    """
-
-    def __init__(
-        self,
-        *,
-        in_channels: int,
-        out_channels: Optional[int] = None,
-        conv_shortcut: bool = False,
-        dropout: float = 0.0,
-        temb_channels: int = 512,
-        groups: int = 32,
-        groups_out: Optional[int] = None,
-        pre_norm: bool = True,
-        eps: float = 1e-6,
-        non_linearity: str = "swish",
-        skip_time_act: bool = False,
-        # default, scale_shift, ada_group, spatial
-        time_embedding_norm: str = "default",
-        kernel: Optional[torch.FloatTensor] = None,
-        output_scale_factor: float = 1.0,
-        use_in_shortcut: Optional[bool] = None,
-        up: bool = False,
-        down: bool = False,
-        conv_shortcut_bias: bool = True,
-        conv_3d_out_channels: Optional[int] = None,
-    ):
-        super().__init__()
-        self.pre_norm = pre_norm
-        self.pre_norm = True
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-        self.up = up
-        self.down = down
-        self.output_scale_factor = output_scale_factor
-        self.time_embedding_norm = time_embedding_norm
-        self.skip_time_act = skip_time_act
-
-        linear_cls = nn.Linear
-
-        if groups_out is None:
-            groups_out = groups
-
-        if self.time_embedding_norm == "ada_group":
-            self.norm1 = AdaGroupNorm(temb_channels, in_channels, groups, eps=eps)
-        elif self.time_embedding_norm == "spatial":
-            self.norm1 = SpatialNorm(in_channels, temb_channels)
-        else:
-            self.norm1 = torch.nn.GroupNorm(num_groups=groups, num_channels=in_channels, eps=eps, affine=True)
-
-        self.conv1 = CausalConv3d(in_channels, out_channels, kernel_size=3, stride=1)
-
-        if temb_channels is not None:
-            if self.time_embedding_norm == "default":
-                self.time_emb_proj = linear_cls(temb_channels, out_channels)
-            elif self.time_embedding_norm == "scale_shift":
-                self.time_emb_proj = linear_cls(temb_channels, 2 * out_channels)
-            elif self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial":
-                self.time_emb_proj = None
-            else:
-                raise ValueError(f"Unknown time_embedding_norm : {self.time_embedding_norm} ")
-        else:
-            self.time_emb_proj = None
-
-        if self.time_embedding_norm == "ada_group":
-            self.norm2 = AdaGroupNorm(temb_channels, out_channels, groups_out, eps=eps)
-        elif self.time_embedding_norm == "spatial":
-            self.norm2 = SpatialNorm(out_channels, temb_channels)
-        else:
-            self.norm2 = torch.nn.GroupNorm(num_groups=groups_out, num_channels=out_channels, eps=eps, affine=True)
-
-        self.dropout = torch.nn.Dropout(dropout)
-        conv_3d_out_channels = conv_3d_out_channels or out_channels
-        self.conv2 = CausalConv3d(out_channels, conv_3d_out_channels, kernel_size=3, stride=1)
-
-        self.nonlinearity = get_activation(non_linearity)
-
-        self.upsample = self.downsample = None
-        if self.up:
-            self.upsample = UpsampleCausal3D(in_channels, use_conv=False)
-        elif self.down:
-            self.downsample = DownsampleCausal3D(in_channels, use_conv=False, name="op")
-
-        self.use_in_shortcut = self.in_channels != conv_3d_out_channels if use_in_shortcut is None else use_in_shortcut
-
-        self.conv_shortcut = None
-        if self.use_in_shortcut:
-            self.conv_shortcut = CausalConv3d(
-                in_channels,
-                conv_3d_out_channels,
-                kernel_size=1,
-                stride=1,
-                bias=conv_shortcut_bias,
-            )
-
-    def forward(
-        self,
-        input_tensor: torch.FloatTensor,
-        temb: torch.FloatTensor,
-        scale: float = 1.0,
-    ) -> torch.FloatTensor:
-        hidden_states = input_tensor
-
-        if self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial":
-            hidden_states = self.norm1(hidden_states, temb)
-        else:
-            hidden_states = self.norm1(hidden_states)
-
-        hidden_states = self.nonlinearity(hidden_states)
-
-        if self.upsample is not None:
-            # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
-            if hidden_states.shape[0] >= 64:
-                input_tensor = input_tensor.contiguous()
-                hidden_states = hidden_states.contiguous()
-            input_tensor = (
-                self.upsample(input_tensor, scale=scale)
-            )
-            hidden_states = (
-                self.upsample(hidden_states, scale=scale)
-            )
-        elif self.downsample is not None:
-            input_tensor = (
-                self.downsample(input_tensor, scale=scale)
-            )
-            hidden_states = (
-                self.downsample(hidden_states, scale=scale)
-            )
-
-        hidden_states = self.conv1(hidden_states)
-
-        if self.time_emb_proj is not None:
-            if not self.skip_time_act:
-                temb = self.nonlinearity(temb)
-            temb = (
-                self.time_emb_proj(temb, scale)[:, :, None, None]
-            )
-
-        if temb is not None and self.time_embedding_norm == "default":
-            hidden_states = hidden_states + temb
-
-        if self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial":
-            hidden_states = self.norm2(hidden_states, temb)
-        else:
-            hidden_states = self.norm2(hidden_states)
-
-        if temb is not None and self.time_embedding_norm == "scale_shift":
-            scale, shift = torch.chunk(temb, 2, dim=1)
-            hidden_states = hidden_states * (1 + scale) + shift
-
-        hidden_states = self.nonlinearity(hidden_states)
-
-        hidden_states = self.dropout(hidden_states)
-        hidden_states = self.conv2(hidden_states)
-
-        if self.conv_shortcut is not None:
-            input_tensor = (
-                self.conv_shortcut(input_tensor)
-            )
-
-        output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
-
-        return output_tensor
-
-
-def get_down_block3d(
-    down_block_type: str,
-    num_layers: int,
-    in_channels: int,
-    out_channels: int,
-    temb_channels: int,
-    add_downsample: bool,
-    downsample_stride: int,
-    resnet_eps: float,
-    resnet_act_fn: str,
-    transformer_layers_per_block: int = 1,
-    num_attention_heads: Optional[int] = None,
-    resnet_groups: Optional[int] = None,
-    cross_attention_dim: Optional[int] = None,
-    downsample_padding: Optional[int] = None,
-    dual_cross_attention: bool = False,
-    use_linear_projection: bool = False,
-    only_cross_attention: bool = False,
-    upcast_attention: bool = False,
-    resnet_time_scale_shift: str = "default",
-    attention_type: str = "default",
-    resnet_skip_time_act: bool = False,
-    resnet_out_scale_factor: float = 1.0,
-    cross_attention_norm: Optional[str] = None,
-    attention_head_dim: Optional[int] = None,
-    downsample_type: Optional[str] = None,
-    dropout: float = 0.0,
-):
-    # If attn head dim is not defined, we default it to the number of heads
-    if attention_head_dim is None:
-        logger.warning(
-            f"It is recommended to provide `attention_head_dim` when calling `get_down_block`. Defaulting `attention_head_dim` to {num_attention_heads}."
-        )
-        attention_head_dim = num_attention_heads
-
-    down_block_type = down_block_type[7:] if down_block_type.startswith("UNetRes") else down_block_type
-    if down_block_type == "DownEncoderBlockCausal3D":
-        return DownEncoderBlockCausal3D(
-            num_layers=num_layers,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            dropout=dropout,
-            add_downsample=add_downsample,
-            downsample_stride=downsample_stride,
-            resnet_eps=resnet_eps,
-            resnet_act_fn=resnet_act_fn,
-            resnet_groups=resnet_groups,
-            downsample_padding=downsample_padding,
-            resnet_time_scale_shift=resnet_time_scale_shift,
-        )
-    raise ValueError(f"{down_block_type} does not exist.")
-
-
-def get_up_block3d(
-    up_block_type: str,
-    num_layers: int,
-    in_channels: int,
-    out_channels: int,
-    prev_output_channel: int,
-    temb_channels: int,
-    add_upsample: bool,
-    upsample_scale_factor: Tuple,
-    resnet_eps: float,
-    resnet_act_fn: str,
-    resolution_idx: Optional[int] = None,
-    transformer_layers_per_block: int = 1,
-    num_attention_heads: Optional[int] = None,
-    resnet_groups: Optional[int] = None,
-    cross_attention_dim: Optional[int] = None,
-    dual_cross_attention: bool = False,
-    use_linear_projection: bool = False,
-    only_cross_attention: bool = False,
-    upcast_attention: bool = False,
-    resnet_time_scale_shift: str = "default",
-    attention_type: str = "default",
-    resnet_skip_time_act: bool = False,
-    resnet_out_scale_factor: float = 1.0,
-    cross_attention_norm: Optional[str] = None,
-    attention_head_dim: Optional[int] = None,
-    upsample_type: Optional[str] = None,
-    dropout: float = 0.0,
-) -> nn.Module:
-    # If attn head dim is not defined, we default it to the number of heads
-    if attention_head_dim is None:
-        logger.warning(
-            f"It is recommended to provide `attention_head_dim` when calling `get_up_block`. Defaulting `attention_head_dim` to {num_attention_heads}."
-        )
-        attention_head_dim = num_attention_heads
-
-    up_block_type = up_block_type[7:] if up_block_type.startswith("UNetRes") else up_block_type
-    if up_block_type == "UpDecoderBlockCausal3D":
-        return UpDecoderBlockCausal3D(
-            num_layers=num_layers,
-            in_channels=in_channels,
-            out_channels=out_channels,
-            resolution_idx=resolution_idx,
-            dropout=dropout,
-            add_upsample=add_upsample,
-            upsample_scale_factor=upsample_scale_factor,
-            resnet_eps=resnet_eps,
-            resnet_act_fn=resnet_act_fn,
-            resnet_groups=resnet_groups,
-            resnet_time_scale_shift=resnet_time_scale_shift,
-            temb_channels=temb_channels,
-        )
-    raise ValueError(f"{up_block_type} does not exist.")
-
-
-class UNetMidBlockCausal3D(nn.Module):
-    """
-    A 3D UNet mid-block [`UNetMidBlockCausal3D`] with multiple residual blocks and optional attention blocks.
-    """
-
-    def __init__(
-        self,
-        in_channels: int,
-        temb_channels: int,
-        dropout: float = 0.0,
-        num_layers: int = 1,
-        resnet_eps: float = 1e-6,
-        resnet_time_scale_shift: str = "default",  # default, spatial
-        resnet_act_fn: str = "swish",
-        resnet_groups: int = 32,
-        attn_groups: Optional[int] = None,
-        resnet_pre_norm: bool = True,
-        add_attention: bool = True,
-        attention_head_dim: int = 1,
-        output_scale_factor: float = 1.0,
-    ):
-        super().__init__()
-        resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
-        self.add_attention = add_attention
-
-        if attn_groups is None:
-            attn_groups = resnet_groups if resnet_time_scale_shift == "default" else None
-
-        # there is always at least one resnet
-        resnets = [
-            ResnetBlockCausal3D(
-                in_channels=in_channels,
-                out_channels=in_channels,
-                temb_channels=temb_channels,
-                eps=resnet_eps,
-                groups=resnet_groups,
-                dropout=dropout,
-                time_embedding_norm=resnet_time_scale_shift,
-                non_linearity=resnet_act_fn,
-                output_scale_factor=output_scale_factor,
-                pre_norm=resnet_pre_norm,
-            )
-        ]
-        attentions = []
-
-        if attention_head_dim is None:
-            logger.warning(
-                f"It is not recommend to pass `attention_head_dim=None`. Defaulting `attention_head_dim` to `in_channels`: {in_channels}."
-            )
-            attention_head_dim = in_channels
-
-        for _ in range(num_layers):
-            if self.add_attention:
-                attentions.append(
-                    Attention(
-                        in_channels,
-                        heads=in_channels // attention_head_dim,
-                        dim_head=attention_head_dim,
-                        rescale_output_factor=output_scale_factor,
-                        eps=resnet_eps,
-                        norm_num_groups=attn_groups,
-                        spatial_norm_dim=temb_channels if resnet_time_scale_shift == "spatial" else None,
-                        residual_connection=True,
-                        bias=True,
-                        upcast_softmax=True,
-                        _from_deprecated_attn_block=True,
-                    )
-                )
-            else:
-                attentions.append(None)
-
-            resnets.append(
-                ResnetBlockCausal3D(
-                    in_channels=in_channels,
-                    out_channels=in_channels,
-                    temb_channels=temb_channels,
-                    eps=resnet_eps,
-                    groups=resnet_groups,
-                    dropout=dropout,
-                    time_embedding_norm=resnet_time_scale_shift,
-                    non_linearity=resnet_act_fn,
-                    output_scale_factor=output_scale_factor,
-                    pre_norm=resnet_pre_norm,
-                )
-            )
-
-        self.attentions = nn.ModuleList(attentions)
-        self.resnets = nn.ModuleList(resnets)
-
-    def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor:
-        hidden_states = self.resnets[0](hidden_states, temb)
-        for attn, resnet in zip(self.attentions, self.resnets[1:]):
-            if attn is not None:
-                B, C, T, H, W = hidden_states.shape
-                hidden_states = rearrange(hidden_states, "b c f h w -> b (f h w) c")
-                attention_mask = prepare_causal_attention_mask(
-                    T, H * W, hidden_states.dtype, hidden_states.device, batch_size=B
-                )
-                hidden_states = attn(hidden_states, temb=temb, attention_mask=attention_mask)
-                hidden_states = rearrange(hidden_states, "b (f h w) c -> b c f h w", f=T, h=H, w=W)
-            hidden_states = resnet(hidden_states, temb)
-
-        return hidden_states
-
-
-class DownEncoderBlockCausal3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        dropout: float = 0.0,
-        num_layers: int = 1,
-        resnet_eps: float = 1e-6,
-        resnet_time_scale_shift: str = "default",
-        resnet_act_fn: str = "swish",
-        resnet_groups: int = 32,
-        resnet_pre_norm: bool = True,
-        output_scale_factor: float = 1.0,
-        add_downsample: bool = True,
-        downsample_stride: int = 2,
-        downsample_padding: int = 1,
-    ):
-        super().__init__()
-        resnets = []
-
-        for i in range(num_layers):
-            in_channels = in_channels if i == 0 else out_channels
-            resnets.append(
-                ResnetBlockCausal3D(
-                    in_channels=in_channels,
-                    out_channels=out_channels,
-                    temb_channels=None,
-                    eps=resnet_eps,
-                    groups=resnet_groups,
-                    dropout=dropout,
-                    time_embedding_norm=resnet_time_scale_shift,
-                    non_linearity=resnet_act_fn,
-                    output_scale_factor=output_scale_factor,
-                    pre_norm=resnet_pre_norm,
-                )
-            )
-
-        self.resnets = nn.ModuleList(resnets)
-
-        if add_downsample:
-            self.downsamplers = nn.ModuleList(
-                [
-                    DownsampleCausal3D(
-                        out_channels,
-                        use_conv=True,
-                        out_channels=out_channels,
-                        padding=downsample_padding,
-                        name="op",
-                        stride=downsample_stride,
-                    )
-                ]
-            )
-        else:
-            self.downsamplers = None
-
-    def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor:
-        for resnet in self.resnets:
-            hidden_states = resnet(hidden_states, temb=None, scale=scale)
-
-        if self.downsamplers is not None:
-            for downsampler in self.downsamplers:
-                hidden_states = downsampler(hidden_states, scale)
-
-        return hidden_states
-
-
-class UpDecoderBlockCausal3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        resolution_idx: Optional[int] = None,
-        dropout: float = 0.0,
-        num_layers: int = 1,
-        resnet_eps: float = 1e-6,
-        resnet_time_scale_shift: str = "default",  # default, spatial
-        resnet_act_fn: str = "swish",
-        resnet_groups: int = 32,
-        resnet_pre_norm: bool = True,
-        output_scale_factor: float = 1.0,
-        add_upsample: bool = True,
-        upsample_scale_factor=(2, 2, 2),
-        temb_channels: Optional[int] = None,
-    ):
-        super().__init__()
-        resnets = []
-
-        for i in range(num_layers):
-            input_channels = in_channels if i == 0 else out_channels
-
-            resnets.append(
-                ResnetBlockCausal3D(
-                    in_channels=input_channels,
-                    out_channels=out_channels,
-                    temb_channels=temb_channels,
-                    eps=resnet_eps,
-                    groups=resnet_groups,
-                    dropout=dropout,
-                    time_embedding_norm=resnet_time_scale_shift,
-                    non_linearity=resnet_act_fn,
-                    output_scale_factor=output_scale_factor,
-                    pre_norm=resnet_pre_norm,
-                )
-            )
-
-        self.resnets = nn.ModuleList(resnets)
-
-        if add_upsample:
-            self.upsamplers = nn.ModuleList(
-                [
-                    UpsampleCausal3D(
-                        out_channels,
-                        use_conv=True,
-                        out_channels=out_channels,
-                        upsample_factor=upsample_scale_factor,
-                    )
-                ]
-            )
-        else:
-            self.upsamplers = None
-
-        self.resolution_idx = resolution_idx
-
-    def forward(
-        self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0
-    ) -> torch.FloatTensor:
-        for resnet in self.resnets:
-            hidden_states = resnet(hidden_states, temb=temb, scale=scale)
-
-        if self.upsamplers is not None:
-            for upsampler in self.upsamplers:
-                hidden_states = upsampler(hidden_states)
-
-        return hidden_states
diff --git a/videotuna/models/hunyuan/hyvideo_t2v/vae/vae.py b/videotuna/models/hunyuan/hyvideo_t2v/vae/vae.py
deleted file mode 100644
index 4002d1f7..00000000
--- a/videotuna/models/hunyuan/hyvideo_t2v/vae/vae.py
+++ /dev/null
@@ -1,355 +0,0 @@
-from dataclasses import dataclass
-from typing import Optional, Tuple
-
-import numpy as np
-import torch
-import torch.nn as nn
-
-from diffusers.utils import BaseOutput, is_torch_version
-from diffusers.utils.torch_utils import randn_tensor
-from diffusers.models.attention_processor import SpatialNorm
-from .unet_causal_3d_blocks import (
-    CausalConv3d,
-    UNetMidBlockCausal3D,
-    get_down_block3d,
-    get_up_block3d,
-)
-
-
-@dataclass
-class DecoderOutput(BaseOutput):
-    r"""
-    Output of decoding method.
-
-    Args:
-        sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
-            The decoded output sample from the last layer of the model.
-    """
-
-    sample: torch.FloatTensor
-
-
-class EncoderCausal3D(nn.Module):
-    r"""
-    The `EncoderCausal3D` layer of a variational autoencoder that encodes its input into a latent representation.
-    """
-
-    def __init__(
-        self,
-        in_channels: int = 3,
-        out_channels: int = 3,
-        down_block_types: Tuple[str, ...] = ("DownEncoderBlockCausal3D",),
-        block_out_channels: Tuple[int, ...] = (64,),
-        layers_per_block: int = 2,
-        norm_num_groups: int = 32,
-        act_fn: str = "silu",
-        double_z: bool = True,
-        mid_block_add_attention=True,
-        time_compression_ratio: int = 4,
-        spatial_compression_ratio: int = 8,
-    ):
-        super().__init__()
-        self.layers_per_block = layers_per_block
-
-        self.conv_in = CausalConv3d(in_channels, block_out_channels[0], kernel_size=3, stride=1)
-        self.mid_block = None
-        self.down_blocks = nn.ModuleList([])
-
-        # down
-        output_channel = block_out_channels[0]
-        for i, down_block_type in enumerate(down_block_types):
-            input_channel = output_channel
-            output_channel = block_out_channels[i]
-            is_final_block = i == len(block_out_channels) - 1
-            num_spatial_downsample_layers = int(np.log2(spatial_compression_ratio))
-            num_time_downsample_layers = int(np.log2(time_compression_ratio))
-
-            if time_compression_ratio == 4:
-                add_spatial_downsample = bool(i < num_spatial_downsample_layers)
-                add_time_downsample = bool(
-                    i >= (len(block_out_channels) - 1 - num_time_downsample_layers)
-                    and not is_final_block
-                )
-            else:
-                raise ValueError(f"Unsupported time_compression_ratio: {time_compression_ratio}.")
-
-            downsample_stride_HW = (2, 2) if add_spatial_downsample else (1, 1)
-            downsample_stride_T = (2,) if add_time_downsample else (1,)
-            downsample_stride = tuple(downsample_stride_T + downsample_stride_HW)
-            down_block = get_down_block3d(
-                down_block_type,
-                num_layers=self.layers_per_block,
-                in_channels=input_channel,
-                out_channels=output_channel,
-                add_downsample=bool(add_spatial_downsample or add_time_downsample),
-                downsample_stride=downsample_stride,
-                resnet_eps=1e-6,
-                downsample_padding=0,
-                resnet_act_fn=act_fn,
-                resnet_groups=norm_num_groups,
-                attention_head_dim=output_channel,
-                temb_channels=None,
-            )
-            self.down_blocks.append(down_block)
-
-        # mid
-        self.mid_block = UNetMidBlockCausal3D(
-            in_channels=block_out_channels[-1],
-            resnet_eps=1e-6,
-            resnet_act_fn=act_fn,
-            output_scale_factor=1,
-            resnet_time_scale_shift="default",
-            attention_head_dim=block_out_channels[-1],
-            resnet_groups=norm_num_groups,
-            temb_channels=None,
-            add_attention=mid_block_add_attention,
-        )
-
-        # out
-        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[-1], num_groups=norm_num_groups, eps=1e-6)
-        self.conv_act = nn.SiLU()
-
-        conv_out_channels = 2 * out_channels if double_z else out_channels
-        self.conv_out = CausalConv3d(block_out_channels[-1], conv_out_channels, kernel_size=3)
-
-    def forward(self, sample: torch.FloatTensor) -> torch.FloatTensor:
-        r"""The forward method of the `EncoderCausal3D` class."""
-        assert len(sample.shape) == 5, "The input tensor should have 5 dimensions"
-
-        sample = self.conv_in(sample)
-
-        # down
-        for down_block in self.down_blocks:
-            sample = down_block(sample)
-
-        # middle
-        sample = self.mid_block(sample)
-
-        # post-process
-        sample = self.conv_norm_out(sample)
-        sample = self.conv_act(sample)
-        sample = self.conv_out(sample)
-
-        return sample
-
-
-class DecoderCausal3D(nn.Module):
-    r"""
-    The `DecoderCausal3D` layer of a variational autoencoder that decodes its latent representation into an output sample.
-    """
-
-    def __init__(
-        self,
-        in_channels: int = 3,
-        out_channels: int = 3,
-        up_block_types: Tuple[str, ...] = ("UpDecoderBlockCausal3D",),
-        block_out_channels: Tuple[int, ...] = (64,),
-        layers_per_block: int = 2,
-        norm_num_groups: int = 32,
-        act_fn: str = "silu",
-        norm_type: str = "group",  # group, spatial
-        mid_block_add_attention=True,
-        time_compression_ratio: int = 4,
-        spatial_compression_ratio: int = 8,
-    ):
-        super().__init__()
-        self.layers_per_block = layers_per_block
-
-        self.conv_in = CausalConv3d(in_channels, block_out_channels[-1], kernel_size=3, stride=1)
-        self.mid_block = None
-        self.up_blocks = nn.ModuleList([])
-
-        temb_channels = in_channels if norm_type == "spatial" else None
-
-        # mid
-        self.mid_block = UNetMidBlockCausal3D(
-            in_channels=block_out_channels[-1],
-            resnet_eps=1e-6,
-            resnet_act_fn=act_fn,
-            output_scale_factor=1,
-            resnet_time_scale_shift="default" if norm_type == "group" else norm_type,
-            attention_head_dim=block_out_channels[-1],
-            resnet_groups=norm_num_groups,
-            temb_channels=temb_channels,
-            add_attention=mid_block_add_attention,
-        )
-
-        # up
-        reversed_block_out_channels = list(reversed(block_out_channels))
-        output_channel = reversed_block_out_channels[0]
-        for i, up_block_type in enumerate(up_block_types):
-            prev_output_channel = output_channel
-            output_channel = reversed_block_out_channels[i]
-            is_final_block = i == len(block_out_channels) - 1
-            num_spatial_upsample_layers = int(np.log2(spatial_compression_ratio))
-            num_time_upsample_layers = int(np.log2(time_compression_ratio))
-
-            if time_compression_ratio == 4:
-                add_spatial_upsample = bool(i < num_spatial_upsample_layers)
-                add_time_upsample = bool(
-                    i >= len(block_out_channels) - 1 - num_time_upsample_layers
-                    and not is_final_block
-                )
-            else:
-                raise ValueError(f"Unsupported time_compression_ratio: {time_compression_ratio}.")
-
-            upsample_scale_factor_HW = (2, 2) if add_spatial_upsample else (1, 1)
-            upsample_scale_factor_T = (2,) if add_time_upsample else (1,)
-            upsample_scale_factor = tuple(upsample_scale_factor_T + upsample_scale_factor_HW)
-            up_block = get_up_block3d(
-                up_block_type,
-                num_layers=self.layers_per_block + 1,
-                in_channels=prev_output_channel,
-                out_channels=output_channel,
-                prev_output_channel=None,
-                add_upsample=bool(add_spatial_upsample or add_time_upsample),
-                upsample_scale_factor=upsample_scale_factor,
-                resnet_eps=1e-6,
-                resnet_act_fn=act_fn,
-                resnet_groups=norm_num_groups,
-                attention_head_dim=output_channel,
-                temb_channels=temb_channels,
-                resnet_time_scale_shift=norm_type,
-            )
-            self.up_blocks.append(up_block)
-            prev_output_channel = output_channel
-
-        # out
-        if norm_type == "spatial":
-            self.conv_norm_out = SpatialNorm(block_out_channels[0], temb_channels)
-        else:
-            self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=1e-6)
-        self.conv_act = nn.SiLU()
-        self.conv_out = CausalConv3d(block_out_channels[0], out_channels, kernel_size=3)
-
-        self.gradient_checkpointing = False
-
-    def forward(
-        self,
-        sample: torch.FloatTensor,
-        latent_embeds: Optional[torch.FloatTensor] = None,
-    ) -> torch.FloatTensor:
-        r"""The forward method of the `DecoderCausal3D` class."""
-        assert len(sample.shape) == 5, "The input tensor should have 5 dimensions."
-
-        sample = self.conv_in(sample)
-
-        upscale_dtype = next(iter(self.up_blocks.parameters())).dtype
-        if self.training and self.gradient_checkpointing:
-
-            def create_custom_forward(module):
-                def custom_forward(*inputs):
-                    return module(*inputs)
-
-                return custom_forward
-
-            if is_torch_version(">=", "1.11.0"):
-                # middle
-                sample = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(self.mid_block),
-                    sample,
-                    latent_embeds,
-                    use_reentrant=False,
-                )
-                sample = sample.to(upscale_dtype)
-
-                # up
-                for up_block in self.up_blocks:
-                    sample = torch.utils.checkpoint.checkpoint(
-                        create_custom_forward(up_block),
-                        sample,
-                        latent_embeds,
-                        use_reentrant=False,
-                    )
-            else:
-                # middle
-                sample = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(self.mid_block), sample, latent_embeds
-                )
-                sample = sample.to(upscale_dtype)
-
-                # up
-                for up_block in self.up_blocks:
-                    sample = torch.utils.checkpoint.checkpoint(create_custom_forward(up_block), sample, latent_embeds)
-        else:
-            # middle
-            sample = self.mid_block(sample, latent_embeds)
-            sample = sample.to(upscale_dtype)
-
-            # up
-            for up_block in self.up_blocks:
-                sample = up_block(sample, latent_embeds)
-
-        # post-process
-        if latent_embeds is None:
-            sample = self.conv_norm_out(sample)
-        else:
-            sample = self.conv_norm_out(sample, latent_embeds)
-        sample = self.conv_act(sample)
-        sample = self.conv_out(sample)
-
-        return sample
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters: torch.Tensor, deterministic: bool = False):
-        if parameters.ndim == 3:
-            dim = 2  # (B, L, C)
-        elif parameters.ndim == 5 or parameters.ndim == 4:
-            dim = 1  # (B, C, T, H ,W) / (B, C, H, W)
-        else:
-            raise NotImplementedError
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=dim)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(
-                self.mean, device=self.parameters.device, dtype=self.parameters.dtype
-            )
-
-    def sample(self, generator: Optional[torch.Generator] = None) -> torch.FloatTensor:
-        # make sure sample is on the same device as the parameters and has same dtype
-        sample = randn_tensor(
-            self.mean.shape,
-            generator=generator,
-            device=self.parameters.device,
-            dtype=self.parameters.dtype,
-        )
-        x = self.mean + self.std * sample
-        return x
-
-    def kl(self, other: "DiagonalGaussianDistribution" = None) -> torch.Tensor:
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        else:
-            reduce_dim = list(range(1, self.mean.ndim))
-            if other is None:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
-                    dim=reduce_dim,
-                )
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var
-                    - 1.0
-                    - self.logvar
-                    + other.logvar,
-                    dim=reduce_dim,
-                )
-
-    def nll(self, sample: torch.Tensor, dims: Tuple[int, ...] = [1, 2, 3]) -> torch.Tensor:
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar +
-            torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims,
-        )
-
-    def mode(self) -> torch.Tensor:
-        return self.mean
diff --git a/videotuna/models/lvdm/ddpm3d.py b/videotuna/models/lvdm/ddpm3d.py
deleted file mode 100644
index f88503a6..00000000
--- a/videotuna/models/lvdm/ddpm3d.py
+++ /dev/null
@@ -1,1713 +0,0 @@
-"""
-wild mixture of
-https://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py
-https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
-https://github.com/CompVis/taming-transformers
--- merci
-"""
-
-import logging
-import os
-import random
-from contextlib import contextmanager
-from functools import partial
-
-import numpy as np
-from einops import rearrange, repeat
-from tqdm import tqdm
-from omegaconf import DictConfig
-
-mainlogger = logging.getLogger("mainlogger")
-
-import peft
-import pytorch_lightning as pl
-import torch
-import torch.nn as nn
-from pytorch_lightning.utilities import rank_zero_only
-from torch.optim.lr_scheduler import CosineAnnealingLR, LambdaLR
-from torchvision.utils import make_grid
-
-from videotuna.schedulers.ddim import DDIMSampler
-from videotuna.utils.distributions import DiagonalGaussianDistribution
-from videotuna.utils.ema import LitEma
-
-from videotuna.models.lvdm.models.rlhf_utils.batch_ddim import batch_ddim_sampling
-from videotuna.models.lvdm.models.rlhf_utils.reward_fn import aesthetic_loss_fn
-from videotuna.models.lvdm.modules.encoders.ip_resampler import ImageProjModel, Resampler
-from videotuna.models.lvdm.modules.utils import (
-    default,
-    disabled_train,
-    exists,
-    extract_into_tensor,
-    noise_like,
-)
-from videotuna.utils.common_utils import instantiate_from_config
-
-__conditioning_keys__ = {"concat": "c_concat", "crossattn": "c_crossattn", "adm": "y"}
-
-
-class DDPMFlow(pl.LightningModule):
-    # classic DDPM with Gaussian diffusion, in image space
-    def __init__(
-        self,
-        unet_config,
-        diffusion_scheduler_config,
-        loss_type="l2",
-        ckpt_path=None,
-        ignore_keys=[],
-        load_only_unet=False,
-        monitor=None,
-        use_ema=True,
-        first_stage_key="image",
-        image_size=256,
-        channels=3,
-        log_every_t=100,
-        clip_denoised=True,
-        original_elbo_weight=0.0,
-        l_simple_weight=1.0,
-        conditioning_key=None,
-        parameterization="eps",  # all assuming fixed variance schedules
-        scheduler_config=None,  # learning rate scheduler config
-        use_positional_encodings=False,
-        lora_args=[],
-        *args,
-        **kwargs,
-    ):
-        super().__init__()
-        assert parameterization in [
-            "eps",
-            "x0",
-            "v",
-        ], 'currently only supporting "eps" and "x0" and "v"'
-        self.parameterization = parameterization
-        mainlogger.info(
-            f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode"
-        )
-
-        self.clip_denoised = clip_denoised
-        self.log_every_t = log_every_t
-
-        # model related
-        self.first_stage_key = first_stage_key
-        self.cond_stage_model = None
-        self.channels = channels
-        self.temporal_length = unet_config.params.get("temporal_length", 16)
-        self.image_size = image_size  # try conv?
-        if isinstance(self.image_size, int):
-            self.image_size = [self.image_size, self.image_size]
-        self.use_positional_encodings = use_positional_encodings
-        self.model = DiffusionWrapper(unet_config, conditioning_key)
-        # load lora models
-        # peft lora config. The arg name is algned with peft.
-        self.lora_args = lora_args
-        if len(lora_args) > 0:
-            self.lora_ckpt_path = getattr(lora_args, "lora_ckpt", None)
-            self.lora_rank = getattr(lora_args, "lora_rank", 4)
-            self.lora_alpha = getattr(lora_args, "lora_alpha", 1)
-            self.lora_dropout = getattr(lora_args, "lora_dropout", 0.0)
-            self.target_modules = getattr(
-                lora_args, "target_modules", ["to_k", "to_v", "to_q"]
-            )
-            # peft set other paramtere requires_grad to False
-            self.lora_config = peft.LoraConfig(
-                r=self.lora_rank,
-                lora_alpha=self.lora_alpha,
-                target_modules=self.target_modules,  # only diffusion_model has these modules
-                lora_dropout=self.lora_dropout,
-            )
-
-        # count_params(self.model, verbose=True)
-        self.use_ema = use_ema
-        if self.use_ema:
-            self.model_ema = LitEma(self.model)
-            mainlogger.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
-
-        # this is learning rate scheduler..
-        self.use_scheduler = scheduler_config is not None
-        if self.use_scheduler:
-            self.scheduler_config = scheduler_config
-
-        self.original_elbo_weight = original_elbo_weight
-        self.l_simple_weight = l_simple_weight
-
-        # TODO to be implemented
-        print("scheduler config type: ", type(diffusion_scheduler_config))
-        diffusion_scheduler_config.parameterization = self.parameterization
-        self.diffusion_scheduler = instantiate_from_config(diffusion_scheduler_config)
-        self.num_timesteps = self.diffusion_scheduler.num_timesteps
-
-        # # this can be used to initiate schedulers
-        # self.v_posterior = v_posterior
-        # self.original_elbo_weight = original_elbo_weight
-        # self.l_simple_weight = l_simple_weight
-
-        # self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps,
-        #                        linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s)
-
-        # self.learn_logvar = learn_logvar
-        # self.logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,))
-        # if self.learn_logvar:
-        #     self.logvar = nn.Parameter(self.logvar, requires_grad=True)
-
-        # others
-        if monitor is not None:
-            self.monitor = monitor
-        if ckpt_path is not None:
-            self.init_from_ckpt(
-                ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet
-            )
-
-        self.loss_type = loss_type
-
-    @contextmanager
-    def ema_scope(self, context=None):
-        if self.use_ema:
-            self.model_ema.store(self.model.parameters())
-            self.model_ema.copy_to(self.model)
-            if context is not None:
-                mainlogger.info(f"{context}: Switched to EMA weights")
-        try:
-            yield None
-        finally:
-            if self.use_ema:
-                self.model_ema.restore(self.model.parameters())
-                if context is not None:
-                    mainlogger.info(f"{context}: Restored training weights")
-
-    def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
-        # TODO logvar keys
-        sd = torch.load(path, map_location="cpu")
-        if "state_dict" in list(sd.keys()):
-            sd = sd["state_dict"]
-        keys = list(sd.keys())
-        for k in keys:
-            for ik in ignore_keys:
-                if k.startswith(ik):
-                    mainlogger.info("Deleting key {} from state_dict.".format(k))
-                    del sd[k]
-        missing, unexpected = (
-            self.load_state_dict(sd, strict=False)
-            if not only_model
-            else self.model.load_state_dict(sd, strict=False)
-        )
-        mainlogger.info(
-            f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys"
-        )
-        if len(missing) > 0:
-            mainlogger.info(f"Missing Keys: {missing}")
-        if len(unexpected) > 0:
-            mainlogger.info(f"Unexpected Keys: {unexpected}")
-
-    @torch.no_grad()
-    def p_sample_loop(self, shape, return_intermediates=False):
-        device = self.diffusion_scheduler.betas.device
-        b = shape[0]
-        img = torch.randn(shape, device=device)
-        intermediates = [img]
-        for i in tqdm(
-            reversed(range(0, self.num_timesteps)),
-            desc="Sampling t",
-            total=self.num_timesteps,
-        ):
-            img = self.diffusion_scheduler.p_sample(
-                img,
-                torch.full((b,), i, device=device, dtype=torch.long),
-                clip_denoised=self.clip_denoised,
-            )
-            if i % self.log_every_t == 0 or i == self.num_timesteps - 1:
-                intermediates.append(img)
-        if return_intermediates:
-            return img, intermediates
-        return img
-
-    @torch.no_grad()
-    def sample(self, batch_size=16, return_intermediates=False):
-        image_size = self.image_size
-        channels = self.channels
-        return self.p_sample_loop(
-            (batch_size, channels, image_size, image_size),
-            return_intermediates=return_intermediates,
-        )
-
-    def get_loss(self, pred, target, mean=True):
-
-        if target.size()[1] != pred.size()[1]:
-            c = target.size()[1]
-            pred = pred[
-                :, :c, ...
-            ]  # opensora, only previous 4 channels used for calculating loss.
-
-        if self.loss_type == "l1":
-            loss = (target - pred).abs()
-            if mean:
-                loss = loss.mean()
-        elif self.loss_type == "l2":
-            if mean:
-                loss = torch.nn.functional.mse_loss(target, pred)
-            else:
-                loss = torch.nn.functional.mse_loss(target, pred, reduction="none")
-        else:
-            raise NotImplementedError("unknown loss type '{loss_type}'")
-
-        return loss
-
-    def p_losses(self, x_start, t, noise=None):
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        x_noisy = self.diffusion_scheduler.q_sample(x_start=x_start, t=t, noise=noise)
-        model_out = self.model(x_noisy, t)
-
-        loss_dict = {}
-        if self.parameterization == "eps":
-            target = noise
-        elif self.parameterization == "x0":
-            target = x_start
-        elif self.parameterization == "v":
-            target = self.diffusion_scheduler.get_v(x_start, noise, t)
-        else:
-            raise NotImplementedError(
-                f"Paramterization {self.parameterization} not yet supported"
-            )
-
-        loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])
-
-        log_prefix = "train" if self.training else "val"
-
-        loss_dict.update({f"{log_prefix}/loss_simple": loss.mean()})
-        loss_simple = loss.mean() * self.l_simple_weight
-
-        loss_vlb = (self.diffusion_scheduler.lvlb_weights[t] * loss).mean()
-        loss_dict.update({f"{log_prefix}/loss_vlb": loss_vlb})
-
-        loss = loss_simple + self.original_elbo_weight * loss_vlb
-
-        loss_dict.update({f"{log_prefix}/loss": loss})
-
-        return loss, loss_dict
-
-    def forward(self, x, *args, **kwargs):
-        # b, c, h, w, device, img_size, = *x.shape, x.device, self.image_size
-        # assert h == img_size and w == img_size, f'height and width of image must be {img_size}'
-        t = torch.randint(
-            0, self.num_timesteps, (x.shape[0],), device=self.device
-        ).long()
-        return self.p_losses(x, t, *args, **kwargs)
-
-    def get_input(self, batch, k):
-        x = batch[k]
-        """
-        if len(x.shape) == 3:
-            x = x[..., None]
-        x = rearrange(x, 'b h w c -> b c h w')
-        """
-        x = x.to(memory_format=torch.contiguous_format).float()
-        return x
-
-    def shared_step(self, batch):
-        x = self.get_input(batch, self.first_stage_key)
-        loss, loss_dict = self(x)
-        return loss, loss_dict
-
-    def training_step(self, batch, batch_idx):
-        loss, loss_dict = self.shared_step(batch)
-
-        self.log_dict(
-            loss_dict, prog_bar=True, logger=True, on_step=True, on_epoch=True
-        )
-
-        self.log(
-            "global_step",
-            self.global_step,
-            prog_bar=True,
-            logger=True,
-            on_step=True,
-            on_epoch=False,
-        )
-
-        if self.use_scheduler:
-            lr = self.optimizers().param_groups[0]["lr"]
-            self.log(
-                "lr_abs", lr, prog_bar=True, logger=True, on_step=True, on_epoch=False
-            )
-
-        return loss
-
-    @torch.no_grad()
-    def validation_step(self, batch, batch_idx):
-        _, loss_dict_no_ema = self.shared_step(batch, random_uncond=False)
-        with self.ema_scope():
-            _, loss_dict_ema = self.shared_step(batch, random_uncond=False)
-            loss_dict_ema = {key + "_ema": loss_dict_ema[key] for key in loss_dict_ema}
-        self.log_dict(
-            loss_dict_no_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True
-        )
-        self.log_dict(
-            loss_dict_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True
-        )
-
-    def on_train_batch_end(self, *args, **kwargs):
-        if self.use_ema:
-            self.model_ema(self.model)
-
-    def _get_rows_from_list(self, samples):
-        n_imgs_per_row = len(samples)
-        denoise_grid = rearrange(samples, "n b c h w -> b n c h w")
-        denoise_grid = rearrange(denoise_grid, "b n c h w -> (b n) c h w")
-        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)
-        return denoise_grid
-
-    @torch.no_grad()
-    def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs):
-        log = dict()
-        x = self.get_input(batch, self.first_stage_key)
-        N = min(x.shape[0], N)
-        n_row = min(x.shape[0], n_row)
-        x = x.to(self.device)[:N]
-        log["inputs"] = x
-
-        # get diffusion row
-        diffusion_row = list()
-        x_start = x[:n_row]
-
-        for t in range(self.num_timesteps):
-            if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
-                t = repeat(torch.tensor([t]), "1 -> b", b=n_row)
-                t = t.to(self.device).long()
-                noise = torch.randn_like(x_start)
-                x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
-                diffusion_row.append(x_noisy)
-
-        log["diffusion_row"] = self._get_rows_from_list(diffusion_row)
-
-        if sample:
-            # get denoise row
-            with self.ema_scope("Plotting"):
-                samples, denoise_row = self.sample(
-                    batch_size=N, return_intermediates=True
-                )
-
-            log["samples"] = samples
-            log["denoise_row"] = self._get_rows_from_list(denoise_row)
-
-        if return_keys:
-            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
-                return log
-            else:
-                return {key: log[key] for key in return_keys}
-        return log
-
-    def configure_optimizers(self):
-        lr = self.learning_rate
-        params = list(self.model.parameters())
-        if self.diffusion_scheduler.learn_logvar:
-            params = params + [self.logvar]
-        opt = torch.optim.AdamW(params, lr=lr)
-        return opt
-
-    def load_lora_from_ckpt(self, model, path):
-        lora_state_dict = torch.load(path)["state_dict"]
-        copy_tracker = {key: False for key in lora_state_dict}
-        for n, p in model.named_parameters():
-            if "lora" in n:
-                lora_n = f"model.{n}"
-            else:
-                continue
-            if lora_n in lora_state_dict:
-                if copy_tracker[lora_n]:
-                    raise RuntimeError(
-                        f"Parameter {lora_n} has already been copied once."
-                    )
-                print(f"Copying parameter {lora_n}")
-                with torch.no_grad():
-                    p.copy_(lora_state_dict[lora_n])
-                copy_tracker[lora_n] = True
-            else:
-                raise RuntimeError(f"Parameter {lora_n} not found in lora_state_dict.")
-        # check parameter load intergrity
-        for key, copied in copy_tracker.items():
-            if not copied:
-                raise RuntimeError(
-                    f"Parameter {key} from lora_state_dict was not copied to the model."
-                )
-                # print(f"Parameter {key} from lora_state_dict was not copied to the model.")
-        print(f"All Parameters was copied successfully.")
-
-    def inject_lora(self):
-        """inject lora into the denoising module.
-        The self.model should be a instance of pl.LightningModule or nn.Module.
-        """
-        # TODO: we can support inistantiate from config in the future. Now we test correctness.
-        # for simplicity, we just inject denoising model. not injecting condition model.
-        self.model = peft.get_peft_model(self.model, self.lora_config)
-
-        self.model.print_trainable_parameters()
-
-        if self.lora_ckpt_path is not None:
-            self.load_lora_from_ckpt(self.model, self.lora_ckpt_path)
-
-
-class LVDMFlow(DDPMFlow):
-    """main class"""
-
-    def __init__(
-        self,
-        first_stage_config,
-        cond_stage_config,
-        diffusion_scheduler_config,
-        cond_stage_key="caption",
-        cond_stage_trainable=False,
-        cond_stage_forward=None,
-        conditioning_key=None,
-        uncond_prob=0.2,
-        uncond_type="empty_seq",
-        scale_factor=1.0,
-        scale_by_std=False,
-        fps_condition_type="fs",
-        # added for LVDM
-        encoder_type="2d",
-        frame_cond=None,
-        only_model=False,
-        use_scale=False,  # dynamic rescaling
-        scale_a=1,
-        scale_b=0.3,
-        mid_step=400,
-        fix_scale_bug=False,
-        interp_mode=False,
-        logdir=None,
-        rand_cond_frame=False,
-        empty_params_only=False,
-        *args,
-        **kwargs,
-    ):
-        self.scale_by_std = scale_by_std
-        ckpt_path = kwargs.pop("ckpt_path", None)
-        ignore_keys = kwargs.pop("ignore_keys", [])
-        conditioning_key = default(conditioning_key, "crossattn")
-
-        # the init func of the base class will initiate a diffusion_scheduler
-        super().__init__(
-            conditioning_key=conditioning_key,
-            diffusion_scheduler_config=diffusion_scheduler_config,
-            *args,
-            **kwargs,
-        )
-
-        self.cond_stage_trainable = cond_stage_trainable
-        self.cond_stage_key = cond_stage_key
-        self.empty_params_only = empty_params_only
-        self.fps_condition_type = fps_condition_type
-
-        # scale factor
-        self.use_scale = use_scale
-        if self.use_scale:
-            self.scale_a = scale_a
-            self.scale_b = scale_b
-            if fix_scale_bug:
-                scale_step = self.num_timesteps - mid_step
-            else:  # bug
-                scale_step = self.num_timesteps
-
-            scale_arr1 = np.linspace(scale_a, scale_b, mid_step)
-            scale_arr2 = np.full(scale_step, scale_b)
-            scale_arr = np.concatenate((scale_arr1, scale_arr2))
-            scale_arr_prev = np.append(scale_a, scale_arr[:-1])
-            to_torch = partial(torch.tensor, dtype=torch.float32)
-            self.register_buffer("scale_arr", to_torch(scale_arr))
-
-        try:
-            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
-        except:
-            self.num_downs = 0
-        if not scale_by_std:
-            self.scale_factor = scale_factor
-        else:
-            self.register_buffer("scale_factor", torch.tensor(scale_factor))
-
-        self.instantiate_first_stage(first_stage_config)
-        self.instantiate_cond_stage(cond_stage_config)
-        self.first_stage_config = first_stage_config
-        self.cond_stage_config = cond_stage_config
-        self.clip_denoised = False
-
-        self.cond_stage_forward = cond_stage_forward
-        self.encoder_type = encoder_type
-        assert encoder_type in ["2d", "3d"]
-        self.uncond_prob = uncond_prob
-        self.classifier_free_guidance = True if uncond_prob > 0 else False
-        assert uncond_type in ["zero_embed", "empty_seq"]
-        self.uncond_type = uncond_type
-
-        ## future frame prediction
-        self.frame_cond = frame_cond
-        if self.frame_cond:
-            # frame_len = self.model.diffusion_model.temporal_length
-            frame_len = self.temporal_length
-            cond_mask = torch.zeros(frame_len, dtype=torch.float32)
-            cond_mask[: self.frame_cond] = 1.0
-            ## b,c,t,h,w
-            self.cond_mask = cond_mask[None, None, :, None, None]
-            mainlogger.info(
-                "---training for %d-frame conditoning T2V" % (self.frame_cond)
-            )
-        else:
-            self.cond_mask = None
-
-        self.restarted_from_ckpt = False
-        if ckpt_path is not None:
-            self.init_from_ckpt(ckpt_path, ignore_keys, only_model=only_model)
-            self.restarted_from_ckpt = True
-
-        self.logdir = logdir
-        self.rand_cond_frame = rand_cond_frame
-        self.interp_mode = interp_mode
-
-    def _freeze_model(self):
-        for name, para in self.model.diffusion_model.named_parameters():
-            para.requires_grad = False
-
-    @rank_zero_only
-    @torch.no_grad()
-    def on_train_batch_start(self, batch, batch_idx, dataloader_idx=None):
-        # only for very first batch, reset the self.scale_factor
-        if (
-            self.scale_by_std
-            and self.current_epoch == 0
-            and self.global_step == 0
-            and batch_idx == 0
-            and not self.restarted_from_ckpt
-        ):
-            assert (
-                self.scale_factor == 1.0
-            ), "rather not use custom rescaling and std-rescaling simultaneously"
-            # set rescale weight to 1./std of encodings
-            mainlogger.info("### USING STD-RESCALING ###")
-            x = super().get_input(batch, self.first_stage_key)
-            x = x.to(self.device)
-            encoder_posterior = self.encode_first_stage(x)
-            z = self.get_first_stage_encoding(encoder_posterior).detach()
-            del self.scale_factor
-            self.register_buffer("scale_factor", 1.0 / z.flatten().std())
-            mainlogger.info(f"setting self.scale_factor to {self.scale_factor}")
-            mainlogger.info("### USING STD-RESCALING ###")
-            mainlogger.info(f"std={z.flatten().std()}")
-
-    def instantiate_first_stage(self, config):
-        model = instantiate_from_config(config)
-        self.first_stage_model = model.eval()
-        self.first_stage_model.train = disabled_train
-        for param in self.first_stage_model.parameters():
-            param.requires_grad = False
-
-    def instantiate_cond_stage(self, config):
-        if not self.cond_stage_trainable:
-            model = instantiate_from_config(config)
-            self.cond_stage_model = model.eval()
-            self.cond_stage_model.train = disabled_train
-            for param in self.cond_stage_model.parameters():
-                param.requires_grad = False
-        else:
-            model = instantiate_from_config(config)
-            self.cond_stage_model = model
-
-    def get_learned_conditioning(self, c):
-        if self.cond_stage_forward is None:
-            if hasattr(self.cond_stage_model, "encode") and callable(
-                self.cond_stage_model.encode
-            ):
-                c = self.cond_stage_model.encode(c)
-                if isinstance(c, DiagonalGaussianDistribution):
-                    c = c.mode()
-            else:
-                c = self.cond_stage_model(c)
-        else:
-            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
-            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
-        return c
-
-    def get_first_stage_encoding(self, encoder_posterior, noise=None):
-        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
-            z = encoder_posterior.sample(noise=noise)
-        elif isinstance(encoder_posterior, torch.Tensor):
-            z = encoder_posterior
-        else:
-            raise NotImplementedError(
-                f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented"
-            )
-        return self.scale_factor * z
-
-    @torch.no_grad()
-    def encode_first_stage(self, x):
-        if self.encoder_type == "2d" and x.dim() == 5:
-            return self.encode_first_stage_2DAE(x)
-        encoder_posterior = self.first_stage_model.encode(x)
-        results = self.get_first_stage_encoding(encoder_posterior).detach()
-        return results
-
-    def encode_first_stage_2DAE(self, x):
-        """encode frame by frame"""
-        b, _, t, _, _ = x.shape
-        results = torch.cat(
-            [
-                self.get_first_stage_encoding(self.first_stage_model.encode(x[:, :, i]))
-                .detach()
-                .unsqueeze(2)
-                for i in range(t)
-            ],
-            dim=2,
-        )
-        return results
-
-    def decode_first_stage_2DAE(self, z, **kwargs):
-        """decode frame by frame"""
-        _, _, t, _, _ = z.shape
-        results = torch.cat(
-            [
-                self.first_stage_model.decode(z[:, :, i], **kwargs).unsqueeze(2)
-                for i in range(t)
-            ],
-            dim=2,
-        )
-        return results
-
-    def _decode_core(self, z, **kwargs):
-        z = 1.0 / self.scale_factor * z
-
-        if self.encoder_type == "2d" and z.dim() == 5:
-            return self.decode_first_stage_2DAE(z)
-        results = self.first_stage_model.decode(z, **kwargs)
-        return results
-
-    @torch.no_grad()
-    def decode_first_stage(self, z, **kwargs):
-        return self._decode_core(z, **kwargs)
-
-    def differentiable_decode_first_stage(self, z, **kwargs):
-        """same as decode_first_stage but without decorator"""
-        return self._decode_core(z, **kwargs)
-
-    def get_batch_input(
-        self,
-        batch,
-        random_uncond,
-        return_first_stage_outputs=False,
-        return_original_cond=False,
-        is_imgbatch=False,
-    ):
-        ## image/video shape: b, c, t, h, w
-        data_key = "jpg" if is_imgbatch else self.first_stage_key
-        x = super().get_input(batch, data_key)
-        if is_imgbatch:
-            ## pack image as video
-            # x = x[:,:,None,:,:]
-            b = x.shape[0] // self.temporal_length
-            x = rearrange(x, "(b t) c h w -> b c t h w", b=b, t=self.temporal_length)
-        x_ori = x
-        ## encode video frames x to z via a 2D encoder
-        z = self.encode_first_stage(x)
-
-        ## get caption condition
-        cond_key = "txt" if is_imgbatch else self.cond_stage_key
-        cond = batch[cond_key]
-        if random_uncond and self.uncond_type == "empty_seq":
-            for i, ci in enumerate(cond):
-                if random.random() < self.uncond_prob:
-                    cond[i] = ""
-        if isinstance(cond, dict) or isinstance(cond, list):
-            cond_emb = self.get_learned_conditioning(cond)
-        else:
-            cond_emb = self.get_learned_conditioning(cond.to(self.device))
-        if random_uncond and self.uncond_type == "zero_embed":
-            for i, ci in enumerate(cond):
-                if random.random() < self.uncond_prob:
-                    cond_emb[i] = torch.zeros_like(ci)
-
-        out = [z, cond_emb]
-        ## optional output: self-reconst or caption
-        if return_first_stage_outputs:
-            xrec = self.decode_first_stage(z)
-            out.extend([x_ori, xrec])
-        if return_original_cond:
-            out.append(cond)
-
-        return out
-
-    def forward(self, x, c, **kwargs):
-        if "t" in kwargs:
-            t = kwargs.pop("t")
-        else:
-            t = torch.randint(
-                0, self.num_timesteps, (x.shape[0],), device=self.device
-            ).long()
-        if self.use_scale:
-            x = x * extract_into_tensor(self.scale_arr, t, x.shape)
-        return self.p_losses(x, c, t, **kwargs)
-
-    def shared_step(self, batch, random_uncond, **kwargs):
-        is_imgbatch = False
-        if "loader_img" in batch.keys():
-            ratio = 10.0 / self.temporal_length
-            if random.uniform(0.0, 10.0) < ratio:
-                is_imgbatch = True
-                batch = batch["loader_img"]
-            else:
-                batch = batch["loader_video"]
-        else:
-            pass
-
-        x, c = self.get_batch_input(
-            batch, random_uncond=random_uncond, is_imgbatch=is_imgbatch
-        )
-        loss, loss_dict = self(x, c, is_imgbatch=is_imgbatch, **kwargs)
-        return loss, loss_dict
-
-    def apply_model(self, x_noisy, t, cond, **kwargs):
-        if self.model.conditioning_key == "crossattn_stdit":
-            key = "c_crossattn_stdit"
-            cond = {key: [cond["y"]], "mask": [cond["mask"]]}  # support mask for T5
-        else:
-            if isinstance(cond, dict):
-                # hybrid case, cond is exptected to be a dict
-                pass
-            else:
-                if not isinstance(cond, list):
-                    cond = [cond]
-                key = (
-                    "c_concat"
-                    if self.model.conditioning_key == "concat"
-                    else "c_crossattn"
-                )
-                cond = {key: cond}
-
-        x_recon = self.model(x_noisy, t, **cond, **kwargs)
-
-        if isinstance(x_recon, tuple):
-            return x_recon[0]
-        else:
-            return x_recon
-
-    def p_losses(self, x_start, cond, t, noise=None, **kwargs):
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        x_noisy = self.diffusion_scheduler.q_sample(x_start=x_start, t=t, noise=noise)
-        if self.frame_cond:
-            if self.cond_mask.device is not self.device:
-                self.cond_mask = self.cond_mask.to(self.device)
-            ## condition on fist few frames
-            x_noisy = x_start * self.cond_mask + (1.0 - self.cond_mask) * x_noisy
-        model_output = self.apply_model(x_noisy, t, cond, **kwargs)
-
-        loss_dict = {}
-        prefix = "train" if self.training else "val"
-
-        if self.parameterization == "x0":
-            target = x_start
-        elif self.parameterization == "eps":
-            target = noise
-        elif self.parameterization == "v":
-            target = self.diffusion_scheduler.get_v(x_start, noise, t)
-        else:
-            raise NotImplementedError()
-
-        if self.frame_cond:
-            ## [b,c,t,h,w]: only care about the predicted part (avoid disturbance)
-            model_output = model_output[:, :, self.frame_cond :, :, :]
-            target = target[:, :, self.frame_cond :, :, :]
-
-        loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3, 4])
-
-        if torch.isnan(loss_simple).any():
-            print(f"loss_simple exists nan: {loss_simple}")
-            for i in range(loss_simple.shape[0]):
-                if torch.isnan(loss_simple[i]).any():
-                    loss_simple[i] = torch.zeros_like(loss_simple[i])
-
-        loss_dict.update({f"{prefix}/loss_simple": loss_simple.mean()})
-
-        if self.diffusion_scheduler.logvar.device is not self.device:
-            self.diffusion_scheduler.logvar = self.diffusion_scheduler.logvar.to(
-                self.device
-            )
-        logvar_t = self.diffusion_scheduler.logvar[t]
-        # logvar_t = self.logvar[t.item()].to(self.device) # device conflict when ddp shared
-        loss = loss_simple / torch.exp(logvar_t) + logvar_t
-        # loss = loss_simple / torch.exp(self.logvar) + self.logvar
-        if self.diffusion_scheduler.learn_logvar:
-            loss_dict.update({f"{prefix}/loss_gamma": loss.mean()})
-            loss_dict.update({"logvar": self.diffusion_scheduler.logvar.data.mean()})
-
-        loss = self.l_simple_weight * loss.mean()
-
-        if self.original_elbo_weight > 0:
-            loss_vlb = self.get_loss(model_output, target, mean=False).mean(
-                dim=(1, 2, 3, 4)
-            )
-            loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()
-            loss_dict.update({f"{prefix}/loss_vlb": loss_vlb})
-            loss += self.original_elbo_weight * loss_vlb
-        loss_dict.update({f"{prefix}/loss": loss})
-
-        return loss, loss_dict
-
-    def training_step(self, batch, batch_idx):
-        loss, loss_dict = self.shared_step(
-            batch, random_uncond=self.classifier_free_guidance
-        )
-        self.log_dict(
-            loss_dict,
-            prog_bar=True,
-            logger=True,
-            on_step=True,
-            on_epoch=True,
-            sync_dist=False,
-        )
-        # self.log("epoch/global_step", self.global_step.float(), prog_bar=True, logger=True, on_step=True, on_epoch=False)
-        """
-        if self.use_scheduler:
-            lr = self.optimizers().param_groups[0]['lr']
-            self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False, rank_zero_only=True)
-        """
-        if (batch_idx + 1) % self.log_every_t == 0:
-            mainlogger.info(
-                f"batch:{batch_idx}|epoch:{self.current_epoch} [globalstep:{self.global_step}]: loss={loss}"
-            )
-        return loss
-
-    def _get_denoise_row_from_list(self, samples, desc=""):
-        denoise_row = []
-        for zd in tqdm(samples, desc=desc):
-            denoise_row.append(self.decode_first_stage(zd.to(self.device)))
-        n_log_timesteps = len(denoise_row)
-
-        denoise_row = torch.stack(denoise_row)  # n_log_timesteps, b, C, H, W
-
-        if denoise_row.dim() == 5:
-            # img, num_imgs= n_log_timesteps * bs, grid_size=[bs,n_log_timesteps]
-            # batch:col, different samples,
-            # n:rows, different steps for one sample
-            denoise_grid = rearrange(denoise_row, "n b c h w -> b n c h w")
-            denoise_grid = rearrange(denoise_grid, "b n c h w -> (b n) c h w")
-            denoise_grid = make_grid(denoise_grid, nrow=n_log_timesteps)
-        elif denoise_row.dim() == 6:
-            # video, grid_size=[n_log_timesteps*bs, t]
-            video_length = denoise_row.shape[3]
-            denoise_grid = rearrange(denoise_row, "n b c t h w -> b n c t h w")
-            denoise_grid = rearrange(denoise_grid, "b n c t h w -> (b n) c t h w")
-            denoise_grid = rearrange(denoise_grid, "n c t h w -> (n t) c h w")
-            denoise_grid = make_grid(denoise_grid, nrow=video_length)
-        else:
-            raise ValueError
-
-        return denoise_grid
-
-    @torch.no_grad()
-    def log_images(
-        self,
-        batch,
-        sample=True,
-        ddim_steps=200,
-        ddim_eta=1.0,
-        plot_denoise_rows=False,
-        unconditional_guidance_scale=1.0,
-        **kwargs,
-    ):
-        """log images for LatentDiffusion"""
-        ## TBD: currently, classifier_free_guidance sampling is only supported by DDIM
-        use_ddim = ddim_steps is not None
-        log = dict()
-        z, c, x, xrec, xc = self.get_batch_input(
-            batch,
-            random_uncond=False,
-            return_first_stage_outputs=True,
-            return_original_cond=True,
-        )
-        N, _, T, H, W = x.shape
-        # TODO fix data type
-        log["inputs"] = x.to(torch.bfloat16)
-        log["reconst"] = xrec
-        log["condition"] = xc
-
-        if sample:
-            # get uncond embedding for classifier-free guidance sampling
-            if unconditional_guidance_scale != 1.0:
-                if isinstance(c, dict):
-                    if "y" in c:
-                        c_emb = c["y"]
-                        c_cat = None  # set default value is None
-                    else:
-                        c_cat, c_emb = c["c_concat"][0], c["c_crossattn"][0]
-                else:
-                    c_emb = c
-
-                # TODO fix data type
-                z = z.to(torch.bfloat16)
-                c_emb = c_emb.to(torch.bfloat16)
-
-                # get uc: unconditional condition for classifier-free guidance sampling
-                if self.uncond_type == "empty_seq":
-                    prompts = N * [""]
-                    uc = self.get_learned_conditioning(prompts)
-                elif self.uncond_type == "zero_embed":
-                    uc = torch.zeros_like(c_emb)
-                # make uc for hybrid condition case
-                if isinstance(c, dict) and c_cat is not None:
-                    uc = {"c_concat": [c_cat], "c_crossattn": [uc]}
-            else:
-                uc = None
-
-            with self.ema_scope("Plotting"):
-                samples, z_denoise_row = self.sample_log(
-                    cond=c,
-                    batch_size=N,
-                    ddim=use_ddim,
-                    ddim_steps=ddim_steps,
-                    eta=ddim_eta,
-                    unconditional_guidance_scale=unconditional_guidance_scale,
-                    unconditional_conditioning=uc,
-                    mask=self.cond_mask,
-                    x0=z,
-                    **kwargs,
-                )
-            x_samples = self.decode_first_stage(samples)
-            log["samples"] = x_samples
-
-            if plot_denoise_rows:
-                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
-                log["denoise_row"] = denoise_grid
-
-        return log
-
-    @torch.no_grad()
-    def p_sample_loop(
-        self,
-        cond,
-        shape,
-        return_intermediates=False,
-        x_T=None,
-        verbose=True,
-        callback=None,
-        timesteps=None,
-        mask=None,
-        x0=None,
-        img_callback=None,
-        start_T=None,
-        log_every_t=None,
-        **kwargs,
-    ):
-
-        if not log_every_t:
-            log_every_t = self.log_every_t
-        device = self.diffusion_scheduler.betas.device
-        b = shape[0]
-        # sample an initial noise
-        if x_T is None:
-            img = torch.randn(shape, device=device)
-        else:
-            img = x_T
-
-        intermediates = [img]
-        if timesteps is None:
-            timesteps = self.num_timesteps
-        if start_T is not None:
-            timesteps = min(timesteps, start_T)
-
-        iterator = (
-            tqdm(reversed(range(0, timesteps)), desc="Sampling t", total=timesteps)
-            if verbose
-            else reversed(range(0, timesteps))
-        )
-
-        if mask is not None:
-            assert x0 is not None
-            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match
-
-        for i in iterator:
-            ts = torch.full((b,), i, device=device, dtype=torch.long)
-            if self.diffusion_scheduler.shorten_cond_schedule:
-                assert self.model.conditioning_key != "hybrid"
-                tc = self.cond_ids[ts].to(cond.device)
-                cond = self.diffusion_scheduler.q_sample(
-                    x_start=cond, t=tc, noise=torch.randn_like(cond)
-                )
-
-            img = self.diffusion_scheduler.p_sample(
-                img, cond, ts, clip_denoised=self.clip_denoised, **kwargs
-            )
-            if mask is not None:
-                img_orig = self.diffusion_scheduler.q_sample(x0, ts)
-                img = img_orig * mask + (1.0 - mask) * img
-
-            if i % log_every_t == 0 or i == timesteps - 1:
-                intermediates.append(img)
-            if callback:
-                callback(i)
-            if img_callback:
-                img_callback(img, i)
-
-        if return_intermediates:
-            return img, intermediates
-        return img
-
-    @torch.no_grad()
-    def sample(
-        self,
-        cond,
-        batch_size=16,
-        return_intermediates=False,
-        x_T=None,
-        verbose=True,
-        timesteps=None,
-        mask=None,
-        x0=None,
-        shape=None,
-        decode=True,
-        **kwargs,
-    ):
-        if shape is None:
-            shape = (batch_size, self.channels, self.temporal_length, *self.image_size)
-
-        if cond is not None:
-            if isinstance(cond, dict):
-                cond = {
-                    key: (
-                        cond[key][:batch_size]
-                        if not isinstance(cond[key], list)
-                        else list(map(lambda x: x[:batch_size], cond[key]))
-                    )
-                    for key in cond
-                }
-            else:
-                cond = (
-                    [c[:batch_size] for c in cond]
-                    if isinstance(cond, list)
-                    else cond[:batch_size]
-                )
-
-        samples = self.p_sample_loop(
-            cond,
-            shape,
-            return_intermediates=return_intermediates,
-            x_T=x_T,
-            verbose=verbose,
-            timesteps=timesteps,
-            mask=mask,
-            x0=x0,
-            **kwargs,
-        )
-        if decode:
-            samples = self.decode_first_stage(samples)
-        return samples
-
-    @torch.no_grad()
-    def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs):
-        if ddim:
-            ddim_sampler = DDIMSampler(self)
-            shape = (self.channels, self.temporal_length, *self.image_size)
-            # kwargs.update({"clean_cond": True})
-            samples, intermediates = ddim_sampler.sample(
-                ddim_steps, batch_size, shape, cond, verbose=False, **kwargs
-            )
-
-        else:
-            samples, intermediates = self.sample(
-                cond=cond, batch_size=batch_size, return_intermediates=True, **kwargs
-            )
-
-        return samples, intermediates
-
-    def configure_optimizers(self):
-        """configure_optimizers for LatentDiffusion"""
-        lr = self.learning_rate
-        if self.empty_params_only and hasattr(self, "empty_paras"):
-            params = [
-                p for n, p in self.model.named_parameters() if n in self.empty_paras
-            ]
-            print("self.empty_paras", len(self.empty_paras))
-            for n, p in self.model.named_parameters():
-                if n not in self.empty_paras:
-                    p.requires_grad = False
-            mainlogger.info(f"@Training [{len(params)}] Empty Paramters ONLY.")
-        elif len(self.lora_args) > 0:
-            # if there is lora_args, but haven't injected lora, it would also work.
-            # but the trainable params will be significantly more than the lora_params
-            # filter out the non lora parameters.
-            params = [
-                p
-                for n, p in self.model.named_parameters()
-                if p.requires_grad and "lora" in n
-            ]
-            mainlogger.info(f"@Training [{len(params)}] Lora Paramters.")
-        else:
-            params = list(self.model.parameters())
-            mainlogger.info(f"@Training [{len(params)}] Full Paramters.")
-
-        if self.diffusion_scheduler.learn_logvar:
-            mainlogger.info("Diffusion model optimizing logvar")
-            if isinstance(params[0], dict):
-                params.append({"params": [self.logvar]})
-            else:
-                params.append(self.logvar)
-
-        ## optimizer
-        optimizer = torch.optim.AdamW(params, lr=lr)
-        ## lr scheduler
-        if self.use_scheduler:
-            mainlogger.info("Setting up LambdaLR scheduler...")
-            lr_scheduler = self.configure_schedulers(optimizer)
-            return [optimizer], [lr_scheduler]
-
-        return optimizer
-
-    def configure_schedulers(self, optimizer):
-        assert "target" in self.scheduler_config
-        scheduler_name = self.scheduler_config.target.split(".")[-1]
-        interval = self.scheduler_config.interval
-        frequency = self.scheduler_config.frequency
-        if scheduler_name == "LambdaLRScheduler":
-            scheduler = instantiate_from_config(self.scheduler_config)
-            scheduler.start_step = self.global_step
-            lr_scheduler = {
-                "scheduler": LambdaLR(optimizer, lr_lambda=scheduler.schedule),
-                "interval": interval,
-                "frequency": frequency,
-            }
-        elif scheduler_name == "CosineAnnealingLRScheduler":
-            scheduler = instantiate_from_config(self.scheduler_config)
-            decay_steps = scheduler.decay_steps
-            last_step = -1 if self.global_step == 0 else scheduler.start_step
-            lr_scheduler = {
-                "scheduler": CosineAnnealingLR(
-                    optimizer, T_max=decay_steps, last_epoch=last_step
-                ),
-                "interval": interval,
-                "frequency": frequency,
-            }
-        else:
-            raise NotImplementedError
-        return lr_scheduler
-
-
-class RewardLVDMTrainer(LVDMFlow):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        # Reward Gradient Training Using LoRA as a default
-        # TODO: use config and getattr to set default values
-        # sampling configs for DDIM
-        self.ddim_eta = 1.0
-        self.ddim_steps = 20  # reduce some steps to speed up sampling process
-        self.n_samples = 1
-        self.fps = 24  # default 24 following VADER
-        # rlhf configs
-        self.backprop_mode = (
-            "last"  # m"backpropagation mode: 'last', 'rand', 'specific'"
-        )
-        self.decode_frame = (
-            -1
-        )  # it could also be any number str like '3', '10'. alt: alternate frames, fml: first, middle, last frames, all: all frames. '-1': random frame
-        self.reward_loss_type = "aesthetic"
-        # self.configure_reward_loss()
-
-    def configure_reward_loss(self, loss_type=None):
-        if loss_type is None:
-            loss_type = self.reward_loss_type
-
-        if loss_type == "aesthetic":
-            self.loss_fn = aesthetic_loss_fn(
-                grad_scale=0.1,
-                aesthetic_target=10,
-                torch_dtype=self.model.dtype,
-                device=self.device,
-            )
-        else:
-            raise NotImplementedError(f"loss type {loss_type} not implemented")
-
-    def on_train_batch_start(self, batch, batch_idx, dataloader_idx=None):
-        # the reason why configure here is to wait the model transfered to target device
-        # otherwise, the loss_fn will be on cpu if configure in __init__
-        self.configure_reward_loss()
-
-    def training_step(self, batch, batch_idx):
-        """training_step for Reward Model Feedback"""
-        # in reward model training, we just need shape of video frames
-        # default cond is text prompt
-        prompts = batch[self.cond_stage_key]
-        # print(prompts) #  Elon mask is talking
-        x, c = self.get_batch_input(
-            batch, random_uncond=self.classifier_free_guidance, is_imgbatch=False
-        )  # x is latent image ; c is text embedding(tensor)
-        kwargs = {}
-        batch_size = x.shape[0]
-        noise_shape = (
-            batch_size,
-            self.channels,
-            self.temporal_length // 4,
-            *self.image_size,
-        )  # (1, 4, 4, 40, 64)
-        # print("noise shape",noise_shape)
-        fps = torch.tensor([self.fps] * batch_size).to(self.device).long()
-        cond = {"c_crossattn": [c], "fps": fps}
-        # Notice: VADER has modified ddim for training
-        # input cond = {"c_crossattn": [text_emb], "fps": fps}
-        batch_samples = batch_ddim_sampling(
-            self,
-            cond,
-            noise_shape,
-            self.n_samples,
-            self.ddim_steps,
-            self.ddim_eta,
-            self.classifier_free_guidance,
-            None,
-            backprop_mode=self.backprop_mode,
-            decode_frame=self.decode_frame,
-            **kwargs,
-        )
-
-        video_frames_ = batch_samples.permute(
-            1, 0, 3, 2, 4, 5
-        )  # batch,samples,channels,frames,height,width >> s,b,f,c,h,w
-        # print("video_frames shape",batch_samples.shape,video_frames_.requires_grad)
-        s_, bs, nf, c_, h_, w_ = video_frames_.shape
-        assert s_ == 1  # samples should only be on single sample in training mode
-        video_frames_ = video_frames_.squeeze(0)  # s,b,f,c,h,w >> b,f,c,h,w
-        assert nf == 1  # reward should only be on single frame
-        video_frames_ = video_frames_.squeeze(1)  # b,f,c,h,w >> b,c,h,w
-        video_frames_ = video_frames_.to(x.dtype)
-
-        # some reward fn may require prompts as input.
-        loss, rewards = self.loss_fn(video_frames_)  # rewards is for logging only.
-        loss_dict = {
-            "reward_train_loss": loss.detach().item(),
-            "step_reward": rewards.detach().item(),
-        }
-        self.log_dict(
-            loss_dict,
-            prog_bar=True,
-            logger=True,
-            on_step=True,
-            on_epoch=True,
-            sync_dist=False,
-        )
-        self.log(
-            "global_step",
-            self.global_step,
-            prog_bar=True,
-            logger=True,
-            on_step=True,
-            on_epoch=False,
-        )
-        if (batch_idx + 1) % self.log_every_t == 0:
-            mainlogger.info(
-                f"batch:{batch_idx}|epoch:{self.current_epoch} [globalstep:{self.global_step}]: loss={loss} reward={rewards}"
-            )
-        return loss
-
-    def configure_optimizers(self):
-        lr = self.learning_rate
-        opt = torch.optim.AdamW(self.model.parameters(), lr=lr)
-        # the training steps count is short no need for scheduler as default
-        if self.use_scheduler:
-            lr_scheduler = self.configure_schedulers(opt)
-            return [opt], [lr_scheduler]
-        return opt
-
-
-class LatentVisualDiffusionFlow(LVDMFlow):
-    def __init__(
-        self,
-        img_cond_stage_config,
-        finegrained=False,  # vc1-i2v
-        image_proj_stage_config=None,  # dc
-        freeze_embedder=True,  # dc
-        image_proj_model_trainable=True,  # dc
-        *args,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-        self.image_proj_model_trainable = image_proj_model_trainable
-        self._init_embedder(img_cond_stage_config, freeze_embedder)
-        if image_proj_stage_config is not None:
-            self._init_img_ctx_projector(
-                image_proj_stage_config, image_proj_model_trainable
-            )
-        else:
-            self.init_projector(
-                use_finegrained=finegrained,
-                num_tokens=16 if finegrained else 4,
-                input_dim=1024,
-                cross_attention_dim=1024,
-                dim=1280,
-            )
-
-    def _init_img_ctx_projector(self, config, trainable):
-        self.image_proj_model = instantiate_from_config(config)
-        if not trainable:
-            self.image_proj_model.eval()
-            self.image_proj_model.train = disabled_train
-            for param in self.image_proj_model.parameters():
-                param.requires_grad = False
-
-    def _init_embedder(self, config, freeze=True):
-        embedder = instantiate_from_config(config)
-        if freeze:
-            self.embedder = embedder.eval()
-            self.embedder.train = disabled_train
-            for param in self.embedder.parameters():
-                param.requires_grad = False
-
-    def init_projector(
-        self, use_finegrained, num_tokens, input_dim, cross_attention_dim, dim
-    ):
-        if not use_finegrained:
-            image_proj_model = ImageProjModel(
-                clip_extra_context_tokens=num_tokens,
-                cross_attention_dim=cross_attention_dim,
-                clip_embeddings_dim=input_dim,
-            )
-        else:
-            image_proj_model = Resampler(
-                dim=input_dim,
-                depth=4,
-                dim_head=64,
-                heads=12,
-                num_queries=num_tokens,
-                embedding_dim=dim,
-                output_dim=cross_attention_dim,
-                ff_mult=4,
-            )
-        self.image_proj_model = image_proj_model
-
-    ## Never delete this func: it is used in log_images() and inference stage
-    def get_image_embeds(self, batch_imgs):
-        ## img: b c h w
-        img_token = self.embedder(batch_imgs)
-        img_emb = self.image_proj_model(img_token)
-        return img_emb
-
-    def shared_step(self, batch, random_uncond, **kwargs):
-        x, c, fs = self.get_batch_input(
-            batch, random_uncond=random_uncond, return_fs=True
-        )
-        kwargs.update({"fs": fs.long()})
-        loss, loss_dict = self(x, c, **kwargs)
-        return loss, loss_dict
-
-    def get_batch_input(
-        self,
-        batch,
-        random_uncond,
-        return_first_stage_outputs=False,
-        return_original_cond=False,
-        return_fs=False,
-        return_cond_frame=False,
-        return_original_input=False,
-        **kwargs,
-    ):
-        ## x: b c t h w
-        x = super().get_input(batch, self.first_stage_key)
-        ## encode video frames x to z via a 2D encoder
-        z = self.encode_first_stage(x)
-
-        ## get caption condition
-        cond_input = batch[self.cond_stage_key]
-
-        if isinstance(cond_input, dict) or isinstance(cond_input, list):
-            cond_emb = self.get_learned_conditioning(cond_input)
-        else:
-            cond_emb = self.get_learned_conditioning(cond_input.to(self.device))
-
-        cond = {}
-        ## to support classifier-free guidance, randomly drop out only text conditioning 5%, only image conditioning 5%, and both 5%.
-        if random_uncond:
-            random_num = torch.rand(x.size(0), device=x.device)
-        else:
-            random_num = torch.ones(
-                x.size(0), device=x.device
-            )  ## by doning so, we can get text embedding and complete img emb for inference
-        prompt_mask = rearrange(random_num < 2 * self.uncond_prob, "n -> n 1 1")
-        input_mask = 1 - rearrange(
-            (random_num >= self.uncond_prob).float()
-            * (random_num < 3 * self.uncond_prob).float(),
-            "n -> n 1 1 1",
-        )
-
-        null_prompt = self.get_learned_conditioning([""])
-        prompt_imb = torch.where(prompt_mask, null_prompt, cond_emb.detach())
-
-        ## get conditioning frame
-        cond_frame_index = 0
-        if self.rand_cond_frame:
-            cond_frame_index = random.randint(
-                0, self.model.diffusion_model.temporal_length - 1
-            )
-
-        img = x[:, :, cond_frame_index, ...]
-        img = input_mask * img
-        ## img: b c h w
-        img_emb = self.embedder(img)  ## b l c
-        img_emb = self.image_proj_model(img_emb)
-
-        if self.model.conditioning_key == "hybrid":
-            if self.interp_mode:
-                ## starting frame + (L-2 empty frames) + ending frame
-                img_cat_cond = torch.zeros_like(z)
-                img_cat_cond[:, :, 0, :, :] = z[:, :, 0, :, :]
-                img_cat_cond[:, :, -1, :, :] = z[:, :, -1, :, :]
-            else:
-                ## simply repeat the cond_frame to match the seq_len of z
-                img_cat_cond = z[:, :, cond_frame_index, :, :]
-                img_cat_cond = img_cat_cond.unsqueeze(2)
-                img_cat_cond = repeat(
-                    img_cat_cond, "b c t h w -> b c (repeat t) h w", repeat=z.shape[2]
-                )
-            cond["c_concat"] = [img_cat_cond]  # b c t h w
-        cond["c_crossattn"] = [
-            torch.cat([prompt_imb, img_emb], dim=1)
-        ]  ## concat in the seq_len dim
-
-        out = [z, cond]
-        if return_first_stage_outputs:
-            xrec = self.decode_first_stage(z)
-            out.extend([xrec])
-
-        if return_original_cond:
-            out.append(cond_input)
-        if return_fs:
-            if self.fps_condition_type == "fs":
-                fs = super().get_input(batch, "frame_stride")
-            elif self.fps_condition_type == "fps":
-                fs = super().get_input(batch, "fps")
-            out.append(fs)
-        if return_cond_frame:
-            out.append(x[:, :, cond_frame_index, ...].unsqueeze(2))
-        if return_original_input:
-            out.append(x)
-
-        return out
-
-    @torch.no_grad()
-    def log_images(
-        self,
-        batch,
-        sample=True,
-        ddim_steps=50,
-        ddim_eta=1.0,
-        plot_denoise_rows=False,
-        unconditional_guidance_scale=1.0,
-        mask=None,
-        **kwargs,
-    ):
-        """log images for LatentVisualDiffusion"""
-        ##### sampled_img_num: control sampled imgae for logging, larger value may cause OOM
-        sampled_img_num = 1
-        for key in batch.keys():
-            batch[key] = batch[key][:sampled_img_num]
-
-        ## TBD: currently, classifier_free_guidance sampling is only supported by DDIM
-        use_ddim = ddim_steps is not None
-        log = dict()
-
-        z, c, xrec, xc, fs, cond_x = self.get_batch_input(
-            batch,
-            random_uncond=False,
-            return_first_stage_outputs=True,
-            return_original_cond=True,
-            return_fs=True,
-            return_cond_frame=True,
-        )
-
-        N = xrec.shape[0]
-        log["image_condition"] = cond_x
-        log["reconst"] = xrec
-        xc_with_fs = []
-        for idx, content in enumerate(xc):
-            xc_with_fs.append(content + "_fs=" + str(fs[idx].item()))
-        log["condition"] = xc_with_fs
-        kwargs.update({"fs": fs.long()})
-
-        c_cat = None
-        if sample:
-            # get uncond embedding for classifier-free guidance sampling
-            if unconditional_guidance_scale != 1.0:
-                if isinstance(c, dict):
-                    c_emb = c["c_crossattn"][0]
-                    if "c_concat" in c.keys():
-                        c_cat = c["c_concat"][0]
-                else:
-                    c_emb = c
-
-                if self.uncond_type == "empty_seq":
-                    prompts = N * [""]
-                    uc_prompt = self.get_learned_conditioning(prompts)
-                elif self.uncond_type == "zero_embed":
-                    uc_prompt = torch.zeros_like(c_emb)
-
-                img = torch.zeros_like(xrec[:, :, 0])  ## b c h w
-                ## img: b c h w
-                img_emb = self.embedder(img)  ## b l c
-                uc_img = self.image_proj_model(img_emb)
-
-                uc = torch.cat([uc_prompt, uc_img], dim=1)
-                ## hybrid case
-                if isinstance(c, dict):
-                    uc_hybrid = {"c_concat": [c_cat], "c_crossattn": [uc]}
-                    uc = uc_hybrid
-            else:
-                uc = None
-
-            with self.ema_scope("Plotting"):
-                samples, z_denoise_row = self.sample_log(
-                    cond=c,
-                    batch_size=N,
-                    ddim=use_ddim,
-                    ddim_steps=ddim_steps,
-                    eta=ddim_eta,
-                    unconditional_guidance_scale=unconditional_guidance_scale,
-                    unconditional_conditioning=uc,
-                    x0=z,
-                    **kwargs,
-                )
-            x_samples = self.decode_first_stage(samples)
-            log["samples"] = x_samples
-
-            if plot_denoise_rows:
-                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
-                log["denoise_row"] = denoise_grid
-
-        return log
-
-    def configure_optimizers(self):
-        """configure_optimizers for LatentDiffusion"""
-        lr = self.learning_rate
-
-        params = list(self.model.parameters())
-        mainlogger.info(f"@Training [{len(params)}] Full Paramters.")
-
-        if self.cond_stage_trainable:
-            params_cond_stage = [
-                p for p in self.cond_stage_model.parameters() if p.requires_grad == True
-            ]
-            mainlogger.info(
-                f"@Training [{len(params_cond_stage)}] Paramters for Cond_stage_model."
-            )
-            params.extend(params_cond_stage)
-
-        if self.image_proj_model_trainable:
-            mainlogger.info(
-                f"@Training [{len(list(self.image_proj_model.parameters()))}] Paramters for Image_proj_model."
-            )
-            params.extend(list(self.image_proj_model.parameters()))
-
-        if self.diffusion_scheduler.learn_logvar:
-            mainlogger.info("Diffusion model optimizing logvar")
-            if isinstance(params[0], dict):
-                params.append({"params": [self.logvar]})
-            else:
-                params.append(self.logvar)
-
-        ## optimizer
-        optimizer = torch.optim.AdamW(params, lr=lr)
-
-        ## lr scheduler
-        if self.use_scheduler:
-            mainlogger.info("Setting up scheduler...")
-            lr_scheduler = self.configure_schedulers(optimizer)
-            return [optimizer], [lr_scheduler]
-
-        return optimizer
-
-class DiffusionWrapper(pl.LightningModule):
-    def __init__(self, diff_model_config, conditioning_key):
-        super().__init__()
-        
-        if isinstance(diff_model_config, dict) or isinstance(diff_model_config, DictConfig):
-            self.diffusion_model = instantiate_from_config(diff_model_config)
-        elif isinstance(diff_model_config, nn.Module):
-            self.diffusion_model = diff_model_config
-        else:
-            raise ValueError("diff_model_config should be a dict or a nn.Module")
-        
-        self.conditioning_key = conditioning_key
-
-    def forward(
-        self,
-        x,
-        t,
-        c_concat: list = None,
-        c_crossattn: list = None,
-        c_crossattn_stdit: list = None,
-        mask: list = None,
-        c_adm=None,
-        s=None,
-        **kwargs,
-    ):
-        if self.conditioning_key is None:
-            out = self.diffusion_model(x, t)
-        elif self.conditioning_key == "concat":
-            xc = torch.cat([x] + c_concat, dim=1)
-            out = self.diffusion_model(xc, t, **kwargs)
-        elif self.conditioning_key == "crossattn":
-            cc = torch.cat(c_crossattn, 1)
-            out = self.diffusion_model(x, t, context=cc, **kwargs)
-        elif self.conditioning_key == "crossattn_stdit":
-            cc = torch.cat(c_crossattn_stdit, 1)  # [b, 77, 1024]
-            mask = torch.cat(mask, 1)
-            # TODO fix precision
-            # if self.precision is not None and self.precision == 'bf16':
-            # print('Convert datatype')
-            cc = cc.to(torch.bfloat16)
-            self.diffusion_model = self.diffusion_model.to(torch.bfloat16)
-
-            out = self.diffusion_model(x, t, y=cc, mask=mask)
-            # def forward(self, x, timestep, y, mask=None, x_mask=None, fps=None, height=None, width=None, **kwargs):
-        elif self.conditioning_key == "hybrid":
-            ## it is just right [b,c,t,h,w]: concatenate in channel dim
-            xc = torch.cat([x] + c_concat, dim=1)
-            cc = torch.cat(c_crossattn, 1)
-            out = self.diffusion_model(xc, t, context=cc, **kwargs)
-        elif self.conditioning_key == "resblockcond":
-            cc = c_crossattn[0]
-            out = self.diffusion_model(x, t, context=cc)
-        elif self.conditioning_key == "adm":
-            cc = c_crossattn[0]
-            out = self.diffusion_model(x, t, y=cc)
-        elif self.conditioning_key == "hybrid-adm":
-            assert c_adm is not None
-            xc = torch.cat([x] + c_concat, dim=1)
-            cc = torch.cat(c_crossattn, 1)
-            out = self.diffusion_model(xc, t, context=cc, y=c_adm, **kwargs)
-        elif self.conditioning_key == "hybrid-time":
-            assert s is not None
-            xc = torch.cat([x] + c_concat, dim=1)
-            cc = torch.cat(c_crossattn, 1)
-            out = self.diffusion_model(xc, t, context=cc, s=s)
-        elif self.conditioning_key == "concat-time-mask":
-            # assert s is not None
-            xc = torch.cat([x] + c_concat, dim=1)
-            out = self.diffusion_model(xc, t, context=None, s=s, mask=mask)
-        elif self.conditioning_key == "concat-adm-mask":
-            # assert s is not None
-            if c_concat is not None:
-                xc = torch.cat([x] + c_concat, dim=1)
-            else:
-                xc = x
-            out = self.diffusion_model(xc, t, context=None, y=s, mask=mask)
-        elif self.conditioning_key == "hybrid-adm-mask":
-            cc = torch.cat(c_crossattn, 1)
-            if c_concat is not None:
-                xc = torch.cat([x] + c_concat, dim=1)
-            else:
-                xc = x
-            out = self.diffusion_model(xc, t, context=cc, y=s, mask=mask)
-        elif (
-            self.conditioning_key == "hybrid-time-adm"
-        ):  # adm means y, e.g., class index
-            # assert s is not None
-            assert c_adm is not None
-            xc = torch.cat([x] + c_concat, dim=1)
-            cc = torch.cat(c_crossattn, 1)
-            out = self.diffusion_model(xc, t, context=cc, s=s, y=c_adm)
-        elif self.conditioning_key == "crossattn-adm":
-            assert c_adm is not None
-            cc = torch.cat(c_crossattn, 1)
-            out = self.diffusion_model(x, t, context=cc, y=c_adm)
-        else:
-            raise NotImplementedError()
-
-        return out
diff --git a/videotuna/models/lvdm/models/rlhf_utils/actpred_scorer.py b/videotuna/models/lvdm/models/rlhf_utils/actpred_scorer.py
deleted file mode 100644
index da5872cb..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/actpred_scorer.py
+++ /dev/null
@@ -1,100 +0,0 @@
-import numpy as np
-import torch
-from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
-
-
-class ActPredScorer(torch.nn.Module):
-
-    def __init__(
-        self,
-        model_name="MCG-NJU/videomae-base-finetuned-kinetics",
-        num_frames=16,
-        device="cuda",
-        dtype=torch.float32,
-    ):
-        super().__init__()
-        self.model = VideoMAEForVideoClassification.from_pretrained(
-            model_name, num_frames=num_frames, torch_dtype=dtype
-        )
-        self.feature_extractor = VideoMAEFeatureExtractor.from_pretrained(model_name)
-        self.device = device
-        self.model.to(device)
-
-    def get_target_class_idx(self, target_action):
-        def mapping_func(x):
-            if "piano" in x:
-                return "playing piano"
-            if "guitar" in x:
-                return "playing guitar"
-            if "doughnuts" in x:
-                return "eating doughnuts"
-            if "beer" in x:
-                return "drinking beer"
-            if "badminton" in x:
-                return "playing badminton"
-            if "cello" in x:
-                return "playing cello"
-            if "scooter" in x:
-                return "riding scooter"
-            if "ballet" in x:
-                return "dancing ballet"
-            if "pancake" in x:
-                return "flipping pancake"
-            if "violin" in x:
-                return "playing violin"
-            if "wood" in x:
-                return "chopping wood"
-            if "watermelon" in x:
-                return "eating watermelon"
-            if "jogging" in x:
-                return "jogging"
-            else:
-                print(
-                    f"Please add your action mapping to ActPredScorer. Mapping not found for {x}"
-                )
-                raise NotImplementedError
-
-        try:
-            target_class_idx = self.model.config.label2id[target_action]
-        except:
-            target_class_idx = self.model.config.label2id[mapping_func(target_action)]
-        return target_class_idx
-
-    def get_loss_and_score(self, norm_vid, target_action):
-        """video should be a torch array of dtype float, with values from 0-1, of dimension (num_frames, height, width, 3)"""
-
-        target_class_idx = self.get_target_class_idx(target_action)
-        outputs = self.model(
-            norm_vid, labels=torch.tensor([target_class_idx]).to(self.device)
-        )
-        loss = outputs.loss
-        logits = outputs.logits
-
-        norm_logits = torch.exp(logits) / (torch.exp(logits).sum())
-        norm_logits = norm_logits.squeeze()
-
-        score = norm_logits[target_class_idx]
-        return loss, score, self.get_pred_class(logits)
-
-    def get_pred_class(self, logits):
-        predicted_class_idx = logits.argmax(-1).item()
-        return self.model.config.id2label[predicted_class_idx]
-
-
-def gen_rand_labels_file(labels_list, out_file, num_labels=50):
-    idxs = np.random.choice(len(labels_list), num_labels, replace=False)
-    rand_labels = [labels_list[i] for i in idxs]
-    rand_labels.sort()
-    with open(out_file, "w") as f:
-        for line in rand_labels:
-            f.write(f"{line}\n")
-
-
-if __name__ == "__main__":
-    # import numpy as np
-    # scorer = ActPredScorer(num_frames = 7)
-    # video_torch = [torch.randn((3,256,256)).clamp(0,1) for _ in range(7)]
-    # encoding = scorer.feature_extractor(video_torch,  do_rescale = False, return_tensors="pt")
-    # print(scorer.get_loss_and_score(video_torch))
-    scorer = ActPredScorer(num_frames=7)
-    labels = scorer.model.config.id2label
diff --git a/videotuna/models/lvdm/models/rlhf_utils/aesthetic_scorer.py b/videotuna/models/lvdm/models/rlhf_utils/aesthetic_scorer.py
deleted file mode 100644
index 271f1e89..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/aesthetic_scorer.py
+++ /dev/null
@@ -1,99 +0,0 @@
-# Based on https://github.com/christophschuhmann/improved-aesthetic-predictor/blob/fe88a163f4661b4ddabba0751ff645e2e620746e/simple_inference.py
-# import ipdb
-# st = ipdb.set_trace
-# from importlib_resources import files
-import os
-
-import numpy as np
-import torch
-import torch.nn as nn
-from PIL import Image
-from transformers import CLIPModel, CLIPProcessor
-
-# ASSETS_PATH = files("lvdm.models.rlhf_utils.pretrained_reward_models")
-ASSETS_PATH = "videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models"
-
-
-class MLPDiff(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.layers = nn.Sequential(
-            nn.Linear(768, 1024),
-            nn.Dropout(0.2),
-            nn.Linear(1024, 128),
-            nn.Dropout(0.2),
-            nn.Linear(128, 64),
-            nn.Dropout(0.1),
-            nn.Linear(64, 16),
-            nn.Linear(16, 1),
-        )
-
-    def forward(self, embed):
-        return self.layers(embed)
-
-
-class AestheticScorerDiff(torch.nn.Module):
-    def __init__(self, dtype):
-        super().__init__()
-        self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
-        self.mlp = MLPDiff()
-        state_dict = torch.load(
-            os.path.join(ASSETS_PATH, "sac+logos+ava1-l14-linearMSE.pth")
-        )
-        self.mlp.load_state_dict(state_dict)
-        self.dtype = dtype
-        self.eval()
-
-    def __call__(self, images):
-        # device = next(self.parameters()).device
-        # print("AestheticScorerDiff",device)
-        embed = self.clip.get_image_features(pixel_values=images)
-        embed = embed / torch.linalg.vector_norm(embed, dim=-1, keepdim=True)
-        return self.mlp(embed).squeeze(1)
-
-    def eval_video(self, video_path):
-        # read a video and return the aesthetic score
-        cap = cv2.VideoCapture(video_path)
-        frames = []
-        while True:
-            ret, frame = cap.read()
-            if not ret:
-                break
-            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
-            frame = cv2.resize(frame, target_size)
-            frame = frame / 255.0
-            frames.append(frame)
-        # im_pix = ((im_pix_un / 2) + 0.5).clamp(0, 1)
-        # im_pix = torchvision.transforms.Resize(target_size)(im_pix)
-        # im_pix = normalize(im_pix).to(im_pix_un.dtype)
-        frames = np.array(frames)
-        frames = torch.tensor(frames, dtype=torch.float32).permute(0, 3, 1, 2)
-        scores = self(frames)
-        return scores.mean().item()
-
-    def eval_video_folder(self, video_folder):
-        # read a folder of videos and return the aesthetic scores
-
-        files = os.listdir(video_folder)
-        scores = []
-        for file in files:
-            if file.endswith(".mp4"):
-                score = self.eval_video(video_folder + "/" + file)
-                scores.append(score)
-
-
-## the main function is a aesthetic scorer that takes in a video folder and returns the aesthetic scores
-if __name__ == "__main__":
-    target_size = (224, 224)
-    normalize = torchvision.transforms.Normalize(
-        mean=[0.48145466, 0.4578275, 0.40821073],
-        std=[0.26862954, 0.26130258, 0.27577711],
-    )
-    scorer = AestheticScorerDiff(dtype=torch_dtype).to(device, dtype=torch_dtype)
-    scorer.requires_grad_(False)
-    video_folder = (
-        "path to video /rlhf-visual-results/lora_aes_chatgpt_instructions-3184"
-    )
-    scores = scorer.eval_video_folder(video_folder)
-    print(type(scores), type(scores[0]))
-    print(scores, np.mean(scores))
diff --git a/videotuna/models/lvdm/models/rlhf_utils/batch_ddim.py b/videotuna/models/lvdm/models/rlhf_utils/batch_ddim.py
deleted file mode 100644
index 007a11db..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/batch_ddim.py
+++ /dev/null
@@ -1,266 +0,0 @@
-# Adapted from VideoCrafter: https://github.com/AILab-CVC/VideoCrafter
-# Second Apdated from VADER: https://github.com/mihirp1998/VADER
-import glob
-import os
-import random
-import sys
-from collections import OrderedDict
-
-import cv2
-import numpy as np
-import torch
-import torchvision
-from decord import VideoReader, cpu
-
-sys.path.append("videotuna")
-from models.lvdm.models.rlhf_utils.rl_ddim import DDIMSampler
-
-# import ipdb
-# st = ipdb.set_trace
-
-
-def batch_ddim_sampling(
-    model,
-    cond,
-    noise_shape,
-    n_samples=1,
-    ddim_steps=50,
-    ddim_eta=1.0,
-    cfg_scale=1.0,
-    temporal_cfg_scale=None,
-    backprop_mode=None,
-    decode_frame="-1",
-    **kwargs,
-):
-    ddim_sampler = DDIMSampler(model)
-    if (
-        backprop_mode is not None
-    ):  # it is for training now, backprop_mode != None also means vader training mode
-        ddim_sampler.backprop_mode = backprop_mode
-        ddim_sampler.training_mode = True
-    uncond_type = model.uncond_type
-    batch_size = noise_shape[0]
-
-    ## construct unconditional guidance
-    if cfg_scale != 1.0:
-        if uncond_type == "empty_seq":
-            prompts = batch_size * [""]
-
-            uc_emb = model.get_learned_conditioning(prompts)
-        elif uncond_type == "zero_embed":
-            c_emb = cond["c_crossattn"][0] if isinstance(cond, dict) else cond
-            uc_emb = torch.zeros_like(c_emb)
-
-        ## process image embedding token
-        if hasattr(model, "embedder"):
-            uc_img = torch.zeros(noise_shape[0], 3, 224, 224).to(model.device)
-            ## img: b c h w >> b l c
-            uc_img = model.get_image_embeds(uc_img)
-            uc_emb = torch.cat([uc_emb, uc_img], dim=1)
-
-        if isinstance(cond, dict):
-            uc = {key: cond[key] for key in cond.keys()}
-            uc.update({"c_crossattn": [uc_emb]})
-        else:
-            uc = uc_emb
-    else:
-        uc = None
-
-    x_T = None
-    batch_variants = []
-
-    for _ in range(n_samples):
-        if ddim_sampler is not None:
-            kwargs.update({"clean_cond": True})
-            samples, _ = ddim_sampler.sample(
-                S=ddim_steps,  # samples: batch, c, t, h, w
-                conditioning=cond,
-                batch_size=noise_shape[0],
-                shape=noise_shape[1:],
-                verbose=False,
-                unconditional_guidance_scale=cfg_scale,
-                unconditional_conditioning=uc,
-                eta=ddim_eta,
-                temporal_length=noise_shape[2],
-                conditional_guidance_scale_temporal=temporal_cfg_scale,
-                x_T=x_T,
-                **kwargs,
-            )
-
-        ## reconstruct from latent to pixel space
-        if (
-            backprop_mode is not None
-        ):  # it is for training now. Use one frame randomly to save memory
-            # print('>>> backprop_mode:', backprop_mode)
-            try:
-                decode_frame = int(decode_frame)
-                # it's a int
-            except:
-                pass
-            # modified by haoyu , here we need to distinguish trainable and non-trainable decode.
-            if type(decode_frame) == int:
-                frame_index = (
-                    random.randint(0, samples.shape[2] - 1)
-                    if decode_frame == -1
-                    else decode_frame
-                )  # samples: batch, c, t, h, w
-                batch_images = model.differentiable_decode_first_stage(
-                    samples[:, :, frame_index : frame_index + 1, :, :]
-                )
-            elif decode_frame in ["alt", "all"]:
-                idxs = (
-                    range(0, samples.shape[2], 2)
-                    if decode_frame == "alt"
-                    else range(samples.shape[2])
-                )
-                batch_images = model.differentiable_decode_first_stage(
-                    samples[:, :, idxs, :, :]
-                )
-
-        else:  # inference mode
-            batch_images = model.decode_first_stage_2DAE(samples)
-        batch_variants.append(batch_images)
-
-    ## batch, <samples>, c, t, h, w
-    batch_variants = torch.stack(batch_variants, dim=1)
-    return batch_variants
-
-
-def get_filelist(data_dir, ext="*"):
-    file_list = glob.glob(os.path.join(data_dir, "*.%s" % ext))
-    file_list.sort()
-    return file_list
-
-
-def get_dirlist(path):
-    list = []
-    if os.path.exists(path):
-        files = os.listdir(path)
-        for file in files:
-            m = os.path.join(path, file)
-            if os.path.isdir(m):
-                list.append(m)
-    list.sort()
-    return list
-
-
-def load_model_checkpoint(model, ckpt):
-    def load_checkpoint(model, ckpt, full_strict):
-        state_dict = torch.load(ckpt, map_location="cpu")
-        try:
-            ## deepspeed
-            new_pl_sd = OrderedDict()
-            for key in state_dict["module"].keys():
-                new_pl_sd[key[16:]] = state_dict["module"][key]
-            model.load_state_dict(new_pl_sd, strict=full_strict)
-        except:
-            if "state_dict" in list(state_dict.keys()):
-                state_dict = state_dict["state_dict"]
-            model.load_state_dict(state_dict, strict=full_strict)
-        return model
-
-    load_checkpoint(model, ckpt, full_strict=True)
-    print(">>> model checkpoint loaded.")
-    return model
-
-
-def load_video_batch(
-    filepath_list, frame_stride, video_size=(256, 256), video_frames=16
-):
-    """
-    Notice about some special cases:
-    1. video_frames=-1 means to take all the frames (with fs=1)
-    2. when the total video frames is less than required, padding strategy will be used (repreated last frame)
-    """
-    fps_list = []
-    batch_tensor = []
-    assert frame_stride > 0, "valid frame stride should be a positive interge!"
-    for filepath in filepath_list:
-        padding_num = 0
-        vidreader = VideoReader(
-            filepath, ctx=cpu(0), width=video_size[1], height=video_size[0]
-        )
-        fps = vidreader.get_avg_fps()
-        total_frames = len(vidreader)
-        max_valid_frames = (total_frames - 1) // frame_stride + 1
-        if video_frames < 0:
-            ## all frames are collected: fs=1 is a must
-            required_frames = total_frames
-            frame_stride = 1
-        else:
-            required_frames = video_frames
-        query_frames = min(required_frames, max_valid_frames)
-        frame_indices = [frame_stride * i for i in range(query_frames)]
-
-        ## [t,h,w,c] -> [c,t,h,w]
-        frames = vidreader.get_batch(frame_indices)
-        frame_tensor = torch.tensor(frames.asnumpy()).permute(3, 0, 1, 2).float()
-        frame_tensor = (frame_tensor / 255.0 - 0.5) * 2
-        if max_valid_frames < required_frames:
-            padding_num = required_frames - max_valid_frames
-            frame_tensor = torch.cat(
-                [frame_tensor, *([frame_tensor[:, -1:, :, :]] * padding_num)], dim=1
-            )
-            print(
-                f"{os.path.split(filepath)[1]} is not long enough: {padding_num} frames padded."
-            )
-        batch_tensor.append(frame_tensor)
-        sample_fps = int(fps / frame_stride)
-        fps_list.append(sample_fps)
-
-    return torch.stack(batch_tensor, dim=0)
-
-
-from PIL import Image
-
-
-def load_image_batch(filepath_list, image_size=(256, 256)):
-    batch_tensor = []
-    for filepath in filepath_list:
-        _, filename = os.path.split(filepath)
-        _, ext = os.path.splitext(filename)
-        if ext == ".mp4":
-            vidreader = VideoReader(
-                filepath, ctx=cpu(0), width=image_size[1], height=image_size[0]
-            )
-            frame = vidreader.get_batch([0])
-            img_tensor = (
-                torch.tensor(frame.asnumpy()).squeeze(0).permute(2, 0, 1).float()
-            )
-        elif ext == ".png" or ext == ".jpg":
-            img = Image.open(filepath).convert("RGB")
-            rgb_img = np.array(img, np.float32)
-            # bgr_img = cv2.imread(filepath, cv2.IMREAD_COLOR)
-            # bgr_img = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2RGB)
-            rgb_img = cv2.resize(
-                rgb_img, (image_size[1], image_size[0]), interpolation=cv2.INTER_LINEAR
-            )
-            img_tensor = torch.from_numpy(rgb_img).permute(2, 0, 1).float()
-        else:
-            print(
-                f"ERROR: <{ext}> image loading only support format: [mp4], [png], [jpg]"
-            )
-            raise NotImplementedError
-        img_tensor = (img_tensor / 255.0 - 0.5) * 2
-        batch_tensor.append(img_tensor)
-    return torch.stack(batch_tensor, dim=0)
-
-
-def save_videos(batch_tensors, savedir, filenames, fps=10):
-    # b,samples,c,t,h,w
-    n_samples = batch_tensors.shape[1]
-    for idx, vid_tensor in enumerate(batch_tensors):
-        video = vid_tensor.detach().cpu()
-        video = torch.clamp(video.float(), -1.0, 1.0)
-        video = video.permute(2, 0, 1, 3, 4)  # t,n,c,h,w
-        frame_grids = [
-            torchvision.utils.make_grid(framesheet, nrow=int(n_samples))
-            for framesheet in video
-        ]  # [3, 1*h, n*w]
-        grid = torch.stack(frame_grids, dim=0)  # stack in temporal dim [t, 3, n*h, w]
-        grid = (grid + 1.0) / 2.0
-        grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1)
-        savepath = os.path.join(savedir, f"{filenames[idx]}.mp4")
-        torchvision.io.write_video(
-            savepath, grid, fps=fps, video_codec="h264", options={"crf": "10"}
-        )
diff --git a/videotuna/models/lvdm/models/rlhf_utils/compression_scorer.py b/videotuna/models/lvdm/models/rlhf_utils/compression_scorer.py
deleted file mode 100644
index 196197a0..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/compression_scorer.py
+++ /dev/null
@@ -1,130 +0,0 @@
-# Adapt from Cheng An Hsieh, et. al.: https://github.com/RewardMultiverse/reward-multiverse
-import io
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torchvision
-from PIL import Image
-
-# import albumentations as A
-from transformers import CLIPModel, CLIPProcessor
-
-# import ipdb
-# st = ipdb.set_trace
-
-
-def jpeg_compressibility(device):
-    def _fn(images):
-        """
-        args:
-            images: shape NCHW
-        """
-        org_type = images.dtype
-        if isinstance(images, torch.Tensor):
-            images = (images * 255).round().clamp(0, 255).to(torch.uint8).cpu().numpy()
-            images = images.transpose(0, 2, 3, 1)  # NCHW -> NHWC
-
-        transform_images_tensor = torch.Tensor(np.array(images)).to(
-            device, dtype=org_type
-        )
-        transform_images_tensor = (
-            transform_images_tensor.permute(0, 3, 1, 2) / 255
-        ).clamp(
-            0, 1
-        )  # NHWC -> NCHW
-        transform_images_pil = [Image.fromarray(image) for image in images]
-        buffers = [io.BytesIO() for _ in transform_images_pil]
-
-        for image, buffer in zip(transform_images_pil, buffers):
-            image.save(buffer, format="JPEG", quality=95)
-
-        sizes = [buffer.tell() / 1000 for buffer in buffers]
-
-        return np.array(sizes), transform_images_tensor
-
-    return _fn
-
-
-class MLP(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.layers = nn.Sequential(
-            nn.Linear(768, 512),
-            nn.ReLU(),
-            nn.Dropout(0.2),
-            nn.Linear(512, 256),
-            nn.ReLU(),
-            nn.Dropout(0.2),
-            nn.Linear(256, 128),
-            nn.ReLU(),
-            nn.Dropout(0.2),
-            nn.Linear(128, 32),
-            nn.ReLU(),
-            nn.Dropout(0.1),
-            nn.Linear(32, 1),
-        )
-
-    def forward(self, embed):
-        return self.layers(embed)
-
-
-def jpegcompression_loss_fn(
-    target=None,
-    grad_scale=0,
-    device=None,
-    accelerator=None,
-    torch_dtype=None,
-    reward_model_resume_from=None,
-):
-    scorer = JpegCompressionScorer(
-        dtype=torch_dtype, model_path=reward_model_resume_from
-    ).to(device, dtype=torch_dtype)
-    scorer.requires_grad_(False)
-    scorer.eval()
-
-    def loss_fn(im_pix_un):
-        if accelerator.mixed_precision == "fp16":
-            with accelerator.autocast():
-                rewards = scorer(im_pix_un)
-        else:
-            rewards = scorer(im_pix_un)
-
-        if target is None:
-            loss = rewards
-        else:
-            loss = abs(rewards - target)
-        return loss * grad_scale, rewards
-
-    return loss_fn
-
-
-class JpegCompressionScorer(nn.Module):
-    def __init__(self, dtype=None, model_path=None):
-        super().__init__()
-        self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
-        self.clip.requires_grad_(False)
-        self.score_generator = MLP()
-
-        if model_path:
-            state_dict = torch.load(model_path)
-            self.score_generator.load_state_dict(state_dict)
-        if dtype:
-            self.dtype = dtype
-        self.target_size = (224, 224)
-        self.normalize = torchvision.transforms.Normalize(
-            mean=[0.48145466, 0.4578275, 0.40821073],
-            std=[0.26862954, 0.26130258, 0.27577711],
-        )
-
-    def set_device(self, device, inference_type):
-        # self.clip.to(device, dtype = inference_type)
-        self.score_generator.to(device)  # , dtype = inference_type
-
-    def __call__(self, images):
-        device = next(self.parameters()).device
-        im_pix = torchvision.transforms.Resize(self.target_size)(images)
-        im_pix = self.normalize(im_pix).to(images.dtype)
-        embed = self.clip.get_image_features(pixel_values=im_pix)
-        embed = embed / torch.linalg.vector_norm(embed, dim=-1, keepdim=True)
-        return self.score_generator(embed).squeeze(1)
diff --git a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/compression_reward.pt b/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/compression_reward.pt
deleted file mode 100644
index d2e3c1ea..00000000
Binary files a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/compression_reward.pt and /dev/null differ
diff --git a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/rainy_reward.pt b/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/rainy_reward.pt
deleted file mode 100644
index 3c3520f7..00000000
Binary files a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/rainy_reward.pt and /dev/null differ
diff --git a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/sac+logos+ava1-l14-linearMSE.pth b/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/sac+logos+ava1-l14-linearMSE.pth
deleted file mode 100644
index 7c0d8aa2..00000000
Binary files a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/sac+logos+ava1-l14-linearMSE.pth and /dev/null differ
diff --git a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/snowy_reward.pt b/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/snowy_reward.pt
deleted file mode 100644
index 296f15c7..00000000
Binary files a/videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models/snowy_reward.pt and /dev/null differ
diff --git a/videotuna/models/lvdm/models/rlhf_utils/prompts.py b/videotuna/models/lvdm/models/rlhf_utils/prompts.py
deleted file mode 100644
index d4a74dec..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/prompts.py
+++ /dev/null
@@ -1,209 +0,0 @@
-# from importlib_resources import files
-import functools
-import os
-import random
-
-# import inflect
-
-# IE = inflect.engine()
-# ASSETS_PATH = files("lvdm.models.rlhf_utils.pretrained_reward_models")
-ASSETS_PATH = "videotuna/models/lvdm/models/rlhf_utils/pretrained_reward_models"
-
-
-@functools.lru_cache(maxsize=None)
-def _load_lines(path):
-    """
-    Load lines from a file. First tries to load from `path` directly, and if that doesn't exist, searches the
-    `ddpo_pytorch/assets` directory for a file named `path`.
-    """
-    if not os.path.exists(path):
-        # newpath = ASSETS_PATH.joinpath(path)
-        newpath = os.path.join(ASSETS_PATH, path)
-    if not os.path.exists(newpath):
-        raise FileNotFoundError(f"Could not find {path} or assets/{path}")
-    path = newpath
-    with open(path, "r") as f:
-        return [line.strip() for line in f.readlines()]
-
-
-def hps_v2_all(nouns_file=None, activities_file=None):
-    return from_file("hps_v2_all.txt")
-
-
-def hps_custom(nouns_file=None, activities_file=None):
-    return from_file("hps_custom.txt")
-
-
-def hps_debug(nouns_file=None, activities_file=None):
-    return from_file("hps_debug.txt")
-
-
-def hps_single(nouns_file=None, activities_file=None):
-    return from_file("hps_single.txt")
-
-
-def kinetics_4rand(nouns_file=None, activities_file=None):
-    return from_file("kinetics_4rand.txt")
-
-
-def kinetics_50rand(nouns_file=None, activities_file=None):
-    return from_file("kinetics_50rand.txt")
-
-
-def simple_animals():
-    return from_file("simple_animals.txt")
-
-
-def eval_simple_animals():
-    return from_file("eval_simple_animals.txt")
-
-
-def eval_hps_v2_all(nouns_file=None, activities_file=None):
-    return from_file("hps_v2_all_eval.txt")
-
-
-def chatgpt_custom(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom.txt")
-
-
-def chatgpt_custom_instruments(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_instruments.txt")
-
-
-def chatgpt_custom_human(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_human.txt")
-
-
-def chatgpt_custom_human_activity(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_human_activity.txt")
-
-
-def chatgpt_custom_animal(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal.txt")
-
-
-def chatgpt_custom_animal_sport(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_sport.txt")
-
-
-def chatgpt_custom_animal_sportV2(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_sportV2.txt")
-
-
-def chatgpt_custom_animal_clothes(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_clothes.txt")
-
-
-def chatgpt_custom_animal_clothesV2(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_clothesV2.txt")
-
-
-def chatgpt_custom_animal_clothesV3(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_clothesV3.txt")
-
-
-def chatgpt_custom_animal_technology(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_technology.txt")
-
-
-def chatgpt_custom_animal_housework(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_housework.txt")
-
-
-def chatgpt_custom_animal_action(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_animal_action.txt")
-
-
-def chatgpt_custom_outdoor(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_outdoor.txt")
-
-
-def chatgpt_custom_rainy(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_rainy.txt")
-
-
-def chatgpt_custom_snowy(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_snowy.txt")
-
-
-def chatgpt_custom_dog(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_dog.txt")
-
-
-def chatgpt_custom_banana(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_banana.txt")
-
-
-def chatgpt_custom_forest(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_forest.txt")
-
-
-def chatgpt_custom_forest_vivid(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_forest_vivid.txt")
-
-
-def chatgpt_custom_cruel_animal(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_cruel_animal.txt")
-
-
-def chatgpt_custom_cruel_animal2(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_cruel_animal2.txt")
-
-
-def chatgpt_custom_bottle_glass(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_bottle_glass.txt")
-
-
-def chatgpt_custom_book_cup(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_book_cup.txt")
-
-
-def chatgpt_custom_book_cup_character(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_book_cup_character.txt")
-
-
-def chatgpt_custom_cute(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_cute.txt")
-
-
-def chatgpt_custom_ice(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_ice.txt")
-
-
-def chatgpt_custom_compression(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_compression.txt")
-
-
-def chatgpt_custom_compression_animals(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_compression_animals.txt")
-
-
-def chatgpt_custom_actpred(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_actpred.txt")
-
-
-def chatgpt_custom_actpred2(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_actpred2.txt")
-
-
-def chatgpt_custom_instruments_unseen(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_custom_instruments_unseen.txt")
-
-
-def chatgpt_inference(nouns_file=None, activities_file=None):
-    return from_file("chatgpt_inference.txt")
-
-
-def from_file(path, low=None, high=None, **kwargs):
-    prompts = _load_lines(path)[low:high]
-    return random.choice(prompts), {}
-
-
-def from_str(_str, **kwargs):
-    return _str, {}
-
-
-def nouns_activities(nouns_file, activities_file, **kwargs):
-    nouns = _load_lines(nouns_file)
-    activities = _load_lines(activities_file)
-    return f"{IE.a(random.choice(nouns))} {random.choice(activities)}", {}
diff --git a/videotuna/models/lvdm/models/rlhf_utils/reward_fn.py b/videotuna/models/lvdm/models/rlhf_utils/reward_fn.py
deleted file mode 100644
index 61a24492..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/reward_fn.py
+++ /dev/null
@@ -1,793 +0,0 @@
-# adapted from VADER  https://github.com/mihirp1998/VADER
-import argparse
-import glob
-import math
-import os
-import random
-import sys
-
-import yaml
-
-sys.path.insert(
-    1, os.path.join(sys.path[0], "..", "..")
-)  # setting path to get Core and assets
-
-import hpsv2
-import models.lvdm.models.rlhf_utils.prompts as prompts_file
-import torchvision
-from hpsv2.src.open_clip import create_model_and_transforms, get_tokenizer
-from models.lvdm.models.rlhf_utils.actpred_scorer import ActPredScorer
-from models.lvdm.models.rlhf_utils.aesthetic_scorer import AestheticScorerDiff
-from models.lvdm.models.rlhf_utils.compression_scorer import (
-    JpegCompressionScorer,
-    jpeg_compressibility,
-)
-from models.lvdm.models.rlhf_utils.weather_scorer import WeatherScorer
-from transformers import (
-    AutoImageProcessor,
-    AutoModel,
-    AutoModelForObjectDetection,
-    AutoModelForZeroShotObjectDetection,
-    AutoProcessor,
-)
-from transformers.utils import ContextManagers
-
-# import ipdb
-# st = ipdb.set_trace
-
-
-def create_output_folders(output_dir, run_name):
-    out_dir = os.path.join(output_dir, run_name)
-    os.makedirs(out_dir, exist_ok=True)
-    os.makedirs(f"{out_dir}/samples", exist_ok=True)
-    return out_dir
-
-
-# to convert string to boolean in argparse
-def str2bool(v):
-    if isinstance(v, bool):
-        return v
-    if v.lower() in ("yes", "true", "t", "y", "1"):
-        return True
-    elif v.lower() in ("no", "false", "f", "n", "0"):
-        return False
-    else:
-        raise argparse.ArgumentTypeError("Boolean value expected.")
-
-
-def get_parser():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--seed", type=int, default=20230211, help="seed for seed_everything"
-    )
-    parser.add_argument(
-        "--mode",
-        default="base",
-        type=str,
-        help="which kind of inference mode: {'base', 'i2v'}",
-    )
-    parser.add_argument("--ckpt_path", type=str, default=None, help="checkpoint path")
-    parser.add_argument("--config", type=str, help="config (yaml) path")
-    parser.add_argument("--savefps", type=str, default=10, help="video fps to generate")
-    parser.add_argument(
-        "--n_samples",
-        type=int,
-        default=1,
-        help="num of samples per prompt",
-    )
-    parser.add_argument(
-        "--ddim_steps",
-        type=int,
-        default=50,
-        help="steps of ddim if positive, otherwise use DDPM",
-    )
-    parser.add_argument(
-        "--ddim_eta",
-        type=float,
-        default=1.0,
-        help="eta for ddim sampling (0.0 yields deterministic sampling)",
-    )
-    parser.add_argument(
-        "--height", type=int, default=512, help="image height, in pixel space"
-    )
-    parser.add_argument(
-        "--width", type=int, default=512, help="image width, in pixel space"
-    )
-    parser.add_argument(
-        "--frames", type=int, default=-1, help="frames num to inference"
-    )
-    parser.add_argument("--fps", type=int, default=24)
-    parser.add_argument(
-        "--unconditional_guidance_scale",
-        type=float,
-        default=1.0,
-        help="prompt classifier-free guidance",
-    )
-    parser.add_argument(
-        "--unconditional_guidance_scale_temporal",
-        type=float,
-        default=None,
-        help="temporal consistency guidance",
-    )
-    ## for conditional i2v only
-    parser.add_argument(
-        "--cond_input", type=str, default=None, help="data dir of conditional input"
-    )
-    ## for training
-    parser.add_argument("--lr", type=float, default=2e-4, help="learning rate")
-    parser.add_argument(
-        "--val_batch_size", type=int, default=1, help="batch size for validation"
-    )
-    parser.add_argument(
-        "--num_val_runs",
-        type=int,
-        default=1,
-        help="total number of validation samples = num_val_runs * num_gpus * num_val_batch",
-    )
-    parser.add_argument(
-        "--train_batch_size", type=int, default=1, help="batch size for training"
-    )
-    parser.add_argument(
-        "--reward_fn",
-        type=str,
-        default="aesthetic",
-        help="reward function: 'aesthetic', 'hps', 'aesthetic_hps', 'pick_score', 'rainy', 'snowy', 'objectDetection', 'actpred', 'compression'",
-    )
-    parser.add_argument(
-        "--compression_model_path",
-        type=str,
-        default="../pretrained_models/compression_reward.pt",
-        help="compression model path",
-    )  # The compression model is used only when reward_fn is 'compression'
-    # The "book." is for grounding-dino model . Remember to add "." at the end of the object name for grounding-dino model.
-    # But for yolos model, do not add "." at the end of the object name. Instead, you should set the object name to "book" for example.
-    parser.add_argument(
-        "--target_object",
-        type=str,
-        default="book",
-        help="target object for object detection reward function",
-    )
-    parser.add_argument(
-        "--detector_model",
-        type=str,
-        default="yolos-base",
-        help="object detection model",
-        choices=[
-            "yolos-base",
-            "yolos-tiny",
-            "grounding-dino-base",
-            "grounding-dino-tiny",
-        ],
-    )
-    parser.add_argument(
-        "--hps_version", type=str, default="v2.1", help="hps version: 'v2.0', 'v2.1'"
-    )
-    parser.add_argument(
-        "--prompt_fn", type=str, default="hps_custom", help="prompt function"
-    )
-    parser.add_argument(
-        "--nouns_file", type=str, default="simple_animals.txt", help="nouns file"
-    )
-    parser.add_argument(
-        "--activities_file", type=str, default="activities.txt", help="activities file"
-    )
-    parser.add_argument(
-        "--num_train_epochs", type=int, default=200, help="number of training epochs"
-    )
-    parser.add_argument(
-        "--max_train_steps", type=int, default=10000, help="max training steps"
-    )
-    parser.add_argument(
-        "--backprop_mode",
-        type=str,
-        default="last",
-        help="backpropagation mode: 'last', 'rand', 'specific'",
-    )  # backprop_mode != None also means training mode for batch_ddim_sampling
-    parser.add_argument(
-        "--gradient_accumulation_steps",
-        type=int,
-        default=1,
-        help="gradient accumulation steps",
-    )
-    parser.add_argument(
-        "--mixed_precision",
-        type=str,
-        default="fp16",
-        help="mixed precision training: 'no', 'fp8', 'fp16', 'bf16'",
-    )
-    parser.add_argument(
-        "--logger_type",
-        type=str,
-        default="wandb",
-        help="logger type: 'wandb', 'tensorboard'",
-    )
-    parser.add_argument(
-        "--project_dir", type=str, default="./project_dir", help="project directory"
-    )
-    parser.add_argument(
-        "--validation_steps",
-        type=int,
-        default=1,
-        help="The frequency of validation, e.g., 1 means validate every 1*accelerator.num_processes steps",
-    )
-    parser.add_argument(
-        "--checkpointing_steps",
-        type=int,
-        default=1,
-        help="The frequency of checkpointing",
-    )
-    parser.add_argument(
-        "--use_wandb", type=str2bool, default=True, help="use wandb for logging"
-    )
-    parser.add_argument("--wandb_entity", type=str, default="", help="wandb entity")
-    parser.add_argument("--debug", type=str2bool, default=False, help="debug mode")
-    parser.add_argument(
-        "--max_grad_norm", type=float, default=1.0, help="max gradient norm"
-    )
-    parser.add_argument(
-        "--use_AdamW8bit", type=str2bool, default=False, help="use AdamW8bit optimizer"
-    )
-    parser.add_argument(
-        "--is_sample_preview",
-        type=str2bool,
-        default=True,
-        help="sample preview during training",
-    )
-    parser.add_argument(
-        "--decode_frame",
-        type=str,
-        default="-1",
-        help="decode frame: '-1', 'fml', 'all', 'alt'",
-    )  # it could also be any number str like '3', '10'. alt: alternate frames, fml: first, middle, last frames, all: all frames. '-1': random frame
-    parser.add_argument(
-        "--inference_only", type=str2bool, default=False, help="only do inference"
-    )
-    parser.add_argument(
-        "--lora_ckpt_path", type=str, default=None, help="LoRA checkpoint path"
-    )
-    parser.add_argument("--lora_rank", type=int, default=8, help="LoRA rank")
-
-    return parser
-
-
-def aesthetic_loss_fn(
-    aesthetic_target=None, grad_scale=0, device=None, torch_dtype=None
-):
-    """
-    Args:
-        aesthetic_target: float, the target value of the aesthetic score. it is 10 in this experiment
-        grad_scale: float, the scale of the gradient. it is 0.1 in this experiment
-        device: torch.device, the device to run the model.
-        torch_dtype: torch.dtype, the data type of the model.
-
-    Returns:
-        loss_fn: function, the loss function of the aesthetic reward function.
-    """
-    target_size = (224, 224)
-    normalize = torchvision.transforms.Normalize(
-        mean=[0.48145466, 0.4578275, 0.40821073],
-        std=[0.26862954, 0.26130258, 0.27577711],
-    )
-
-    scorer = AestheticScorerDiff(dtype=torch_dtype).to(device, dtype=torch_dtype)
-    scorer.requires_grad_(False)
-
-    def loss_fn(im_pix_un):
-        im_pix = ((im_pix_un / 2) + 0.5).clamp(0, 1)
-        im_pix = torchvision.transforms.Resize(target_size)(im_pix)
-        im_pix = normalize(im_pix).to(im_pix_un.dtype)
-        rewards = scorer(im_pix)
-        if aesthetic_target is None:  # default maximization
-            loss = -1 * rewards
-        else:
-            # using L1 to keep on same scale
-            loss = abs(rewards - aesthetic_target)
-        return loss.mean() * grad_scale, rewards.mean()
-
-    return loss_fn
-
-
-def hps_loss_fn(inference_dtype=None, device=None, hps_version="v2.0"):
-    """
-    Args:
-        inference_dtype: torch.dtype, the data type of the model.
-        device: torch.device, the device to run the model.
-        hps_version: str, the version of the HPS model. It is "v2.0" or "v2.1" in this experiment.
-
-    Returns:
-        loss_fn: function, the loss function of the HPS reward function.
-    """
-    model_name = "ViT-H-14"
-
-    model, preprocess_train, preprocess_val = create_model_and_transforms(
-        model_name,
-        "laion2B-s32B-b79K",
-        precision=inference_dtype,
-        device=device,
-        jit=False,
-        force_quick_gelu=False,
-        force_custom_text=False,
-        force_patch_dropout=False,
-        force_image_size=None,
-        pretrained_image=False,
-        image_mean=None,
-        image_std=None,
-        light_augmentation=True,
-        aug_cfg={},
-        output_dict=True,
-        with_score_predictor=False,
-        with_region_predictor=False,
-    )
-
-    tokenizer = get_tokenizer(model_name)
-
-    if (
-        hps_version == "v2.0"
-    ):  # if there is a error, please download the model manually and set the path
-        checkpoint_path = f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--xswu--HPSv2/snapshots/697403c78157020a1ae59d23f111aa58ced35b0a/HPS_v2_compressed.pt"
-    else:  # hps_version == "v2.1"
-        checkpoint_path = f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--xswu--HPSv2/snapshots/697403c78157020a1ae59d23f111aa58ced35b0a/HPS_v2.1_compressed.pt"
-    # force download of model via score
-    hpsv2.score([], "", hps_version=hps_version)
-
-    checkpoint = torch.load(checkpoint_path, map_location=device)
-    model.load_state_dict(checkpoint["state_dict"])
-    tokenizer = get_tokenizer(model_name)
-    model = model.to(device, dtype=inference_dtype)
-    model.eval()
-
-    target_size = (224, 224)
-    normalize = torchvision.transforms.Normalize(
-        mean=[0.48145466, 0.4578275, 0.40821073],
-        std=[0.26862954, 0.26130258, 0.27577711],
-    )
-
-    def loss_fn(im_pix, prompts):
-        im_pix = ((im_pix / 2) + 0.5).clamp(0, 1)
-        x_var = torchvision.transforms.Resize(target_size)(im_pix)
-        x_var = normalize(x_var).to(im_pix.dtype)
-        caption = tokenizer(prompts)
-        caption = caption.to(device)
-        outputs = model(x_var, caption)
-        image_features, text_features = (
-            outputs["image_features"],
-            outputs["text_features"],
-        )
-        logits = image_features @ text_features.T
-        scores = torch.diagonal(logits)
-        loss = 1.0 - scores
-        return loss.mean(), scores.mean()
-
-    return loss_fn
-
-
-def aesthetic_hps_loss_fn(
-    aesthetic_target=None,
-    grad_scale=0,
-    inference_dtype=None,
-    device=None,
-    hps_version="v2.0",
-):
-    """
-    Args:
-        aesthetic_target: float, the target value of the aesthetic score. it is 10 in this experiment
-        grad_scale: float, the scale of the gradient. it is 0.1 in this experiment
-        inference_dtype: torch.dtype, the data type of the model.
-        device: torch.device, the device to run the model.
-        hps_version: str, the version of the HPS model. It is "v2.0" or "v2.1" in this experiment.
-
-    Returns:
-        loss_fn: function, the loss function of a combination of aesthetic and HPS reward function.
-    """
-    # HPS
-    model_name = "ViT-H-14"
-
-    model, preprocess_train, preprocess_val = create_model_and_transforms(
-        model_name,
-        "laion2B-s32B-b79K",
-        precision=inference_dtype,
-        device=device,
-        jit=False,
-        force_quick_gelu=False,
-        force_custom_text=False,
-        force_patch_dropout=False,
-        force_image_size=None,
-        pretrained_image=False,
-        image_mean=None,
-        image_std=None,
-        light_augmentation=True,
-        aug_cfg={},
-        output_dict=True,
-        with_score_predictor=False,
-        with_region_predictor=False,
-    )
-
-    # tokenizer = get_tokenizer(model_name)
-
-    if (
-        hps_version == "v2.0"
-    ):  # if there is a error, please download the model manually and set the path
-        checkpoint_path = f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--xswu--HPSv2/snapshots/697403c78157020a1ae59d23f111aa58ced35b0a/HPS_v2_compressed.pt"
-    else:  # hps_version == "v2.1"
-        checkpoint_path = f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--xswu--HPSv2/snapshots/697403c78157020a1ae59d23f111aa58ced35b0a/HPS_v2.1_compressed.pt"
-    # force download of model via score
-    hpsv2.score([], "", hps_version=hps_version)
-
-    checkpoint = torch.load(checkpoint_path, map_location=device)
-    model.load_state_dict(checkpoint["state_dict"])
-    tokenizer = get_tokenizer(model_name)
-    model = model.to(device, dtype=inference_dtype)
-    model.eval()
-
-    target_size = (224, 224)
-    normalize = torchvision.transforms.Normalize(
-        mean=[0.48145466, 0.4578275, 0.40821073],
-        std=[0.26862954, 0.26130258, 0.27577711],
-    )
-    # Aesthetic
-    scorer = AestheticScorerDiff(dtype=inference_dtype).to(
-        device, dtype=inference_dtype
-    )
-    scorer.requires_grad_(False)
-
-    def loss_fn(im_pix_un, prompts):
-        # Aesthetic
-        im_pix = ((im_pix_un / 2) + 0.5).clamp(0, 1)
-        im_pix = torchvision.transforms.Resize(target_size)(im_pix)
-        im_pix = normalize(im_pix).to(im_pix_un.dtype)
-
-        aesthetic_rewards = scorer(im_pix)
-        if aesthetic_target is None:  # default maximization
-            aesthetic_loss = -1 * aesthetic_rewards
-        else:
-            # using L1 to keep on same scale
-            aesthetic_loss = abs(aesthetic_rewards - aesthetic_target)
-        aesthetic_loss = aesthetic_loss.mean() * grad_scale
-        aesthetic_rewards = aesthetic_rewards.mean()
-
-        # HPS
-        caption = tokenizer(prompts)
-        caption = caption.to(device)
-        outputs = model(im_pix, caption)
-        image_features, text_features = (
-            outputs["image_features"],
-            outputs["text_features"],
-        )
-        logits = image_features @ text_features.T
-        scores = torch.diagonal(logits)
-        hps_loss = abs(1.0 - scores)
-        hps_loss = hps_loss.mean()
-        hps_rewards = scores.mean()
-
-        loss = (
-            1.5 * aesthetic_loss + hps_loss
-        ) / 2  # 1.5 is a hyperparameter. Set it to 1.5 because experimentally hps_loss is 1.5 times larger than aesthetic_loss
-        rewards = (
-            aesthetic_rewards + 15 * hps_rewards
-        ) / 2  # 15 is a hyperparameter. Set it to 15 because experimentally aesthetic_rewards is 15 times larger than hps_reward
-        return loss, rewards
-
-    return loss_fn
-
-
-def pick_score_loss_fn(inference_dtype=None, device=None):
-    """
-    Args:
-        inference_dtype: torch.dtype, the data type of the model.
-        device: torch.device, the device to run the model.
-
-    Returns:
-        loss_fn: function, the loss function of the PickScore reward function.
-    """
-    processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
-    model_pretrained_name_or_path = "yuvalkirstain/PickScore_v1"
-    processor = AutoProcessor.from_pretrained(
-        processor_name_or_path, torch_dtype=inference_dtype
-    )
-    model = (
-        AutoModel.from_pretrained(
-            model_pretrained_name_or_path, torch_dtype=inference_dtype
-        )
-        .eval()
-        .to(device)
-    )
-    model.requires_grad_(False)
-
-    def loss_fn(im_pix_un, prompts):  # im_pix_un: b,c,h,w
-        im_pix = ((im_pix_un / 2) + 0.5).clamp(0, 1)
-
-        # reproduce the pick_score preprocessing
-        im_pix = im_pix * 255  # b,c,h,w
-
-        if im_pix.shape[2] < im_pix.shape[3]:
-            height = 224
-            width = (
-                im_pix.shape[3] * height // im_pix.shape[2]
-            )  # keep the aspect ratio, so the width is w * 224/h
-        else:
-            width = 224
-            height = (
-                im_pix.shape[2] * width // im_pix.shape[3]
-            )  # keep the aspect ratio, so the height is h * 224/w
-
-        # interpolation and antialiasing should be the same as below
-        im_pix = torchvision.transforms.Resize(
-            (height, width),
-            interpolation=torchvision.transforms.InterpolationMode.BICUBIC,
-            antialias=True,
-        )(im_pix)
-        im_pix = im_pix.permute(0, 2, 3, 1)  # b,c,h,w -> (b,h,w,c)
-        # crop the center 224x224
-        startx = width // 2 - (224 // 2)
-        starty = height // 2 - (224 // 2)
-        im_pix = im_pix[:, starty : starty + 224, startx : startx + 224, :]
-        # do rescale and normalize as CLIP
-        im_pix = im_pix * 0.00392156862745098  # rescale factor
-        mean = torch.tensor([0.48145466, 0.4578275, 0.40821073]).to(device)
-        std = torch.tensor([0.26862954, 0.26130258, 0.27577711]).to(device)
-        im_pix = (im_pix - mean) / std
-        im_pix = im_pix.permute(0, 3, 1, 2)  # BHWC -> BCHW
-
-        text_inputs = processor(
-            text=prompts,
-            padding=True,
-            truncation=True,
-            max_length=77,
-            return_tensors="pt",
-        ).to(device)
-
-        # embed
-        image_embs = model.get_image_features(pixel_values=im_pix)
-        image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)
-
-        text_embs = model.get_text_features(**text_inputs)
-        text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)
-
-        # score
-        scores = model.logit_scale.exp() * (text_embs @ image_embs.T)[0]
-        loss = abs(1.0 - scores / 100.0)
-        return loss.mean(), scores.mean()
-
-    return loss_fn
-
-
-def weather_loss_fn(
-    inference_dtype=None, device=None, weather="rainy", target=None, grad_scale=0
-):
-    """
-    Args:
-        inference_dtype: torch.dtype, the data type of the model.
-        device: torch.device, the device to run the model.
-        weather: str, the weather condition. It is "rainy" or "snowy" in this experiment.
-        target: float, the target value of the weather score. It is 1.0 in this experiment.
-        grad_scale: float, the scale of the gradient. It is 1 in this experiment.
-
-    Returns:
-        loss_fn: function, the loss function of the weather reward function.
-    """
-    if weather == "rainy":
-        reward_model_path = "../pretrained_models/rainy_reward.pt"
-    elif weather == "snowy":
-        reward_model_path = "../pretrained_models/snowy_reward.pt"
-    else:
-        raise NotImplementedError
-    scorer = WeatherScorer(dtype=inference_dtype, model_path=reward_model_path).to(
-        device, dtype=inference_dtype
-    )
-    scorer.requires_grad_(False)
-    scorer.eval()
-
-    def loss_fn(im_pix_un):
-        im_pix = ((im_pix_un + 1) / 2).clamp(0, 1)  # from [-1, 1] to [0, 1]
-        rewards = scorer(im_pix)
-
-        if target is None:
-            loss = rewards
-        else:
-            loss = abs(rewards - target)
-
-        return loss.mean() * grad_scale, rewards.mean()
-
-    return loss_fn
-
-
-def objectDetection_loss_fn(
-    inference_dtype=None,
-    device=None,
-    targetObject="dog.",
-    model_name="grounding-dino-base",
-):
-    """
-    This reward function is used to remove the target object from the generated video.
-    We use yolo-s-tiny model to detect the target object in the generated video.
-
-    Args:
-        inference_dtype: torch.dtype, the data type of the model.
-        device: torch.device, the device to run the model.
-        targetObject: str, the object to detect. It is "dog" in this experiment.
-
-    Returns:
-        loss_fn: function, the loss function of the object detection reward function.
-    """
-    if model_name == "yolos-base":
-        image_processor = AutoImageProcessor.from_pretrained(
-            "hustvl/yolos-base", torch_dtype=inference_dtype
-        )
-        model = AutoModelForObjectDetection.from_pretrained(
-            "hustvl/yolos-base", torch_dtype=inference_dtype
-        ).to(device)
-        # check if "." in the targetObject name for yolos model
-        if "." in targetObject:
-            raise ValueError(
-                "The targetObject name should not contain '.' for yolos-base model."
-            )
-    elif model_name == "yolos-tiny":
-        image_processor = AutoImageProcessor.from_pretrained(
-            "hustvl/yolos-tiny", torch_dtype=inference_dtype
-        )
-        model = AutoModelForObjectDetection.from_pretrained(
-            "hustvl/yolos-tiny", torch_dtype=inference_dtype
-        ).to(device)
-        # check if "." in the targetObject name for yolos model
-        if "." in targetObject:
-            raise ValueError(
-                "The targetObject name should not contain '.' for yolos-tiny model."
-            )
-    elif model_name == "grounding-dino-base":
-        image_processor = AutoProcessor.from_pretrained(
-            "IDEA-Research/grounding-dino-base", torch_dtype=inference_dtype
-        )
-        model = AutoModelForZeroShotObjectDetection.from_pretrained(
-            "IDEA-Research/grounding-dino-base", torch_dtype=inference_dtype
-        ).to(device)
-        # check if "." in the targetObject name for grounding-dino model
-        if "." not in targetObject:
-            raise ValueError(
-                "The targetObject name should contain '.' for grounding-dino-base model."
-            )
-    elif model_name == "grounding-dino-tiny":
-        image_processor = AutoProcessor.from_pretrained(
-            "IDEA-Research/grounding-dino-tiny", torch_dtype=inference_dtype
-        )
-        model = AutoModelForZeroShotObjectDetection.from_pretrained(
-            "IDEA-Research/grounding-dino-tiny", torch_dtype=inference_dtype
-        ).to(device)
-        # check if "." in the targetObject name for grounding-dino model
-        if "." not in targetObject:
-            raise ValueError(
-                "The targetObject name should contain '.' for grounding-dino-tiny model."
-            )
-    else:
-        raise NotImplementedError
-
-    model.requires_grad_(False)
-    model.eval()
-
-    def loss_fn(im_pix_un):  # im_pix_un: b,c,h,w
-        images = ((im_pix_un / 2) + 0.5).clamp(0.0, 1.0)
-
-        # reproduce the yolo preprocessing
-        height = 512
-        width = (
-            512 * images.shape[3] // images.shape[2]
-        )  # keep the aspect ratio, so the width is 512 * w/h
-        images = torchvision.transforms.Resize((height, width), antialias=False)(images)
-        images = images.permute(0, 2, 3, 1)  # b,c,h,w -> (b,h,w,c)
-
-        image_mean = torch.tensor([0.485, 0.456, 0.406]).to(device)
-        image_std = torch.tensor([0.229, 0.224, 0.225]).to(device)
-
-        images = (images - image_mean) / image_std
-        normalized_image = images.permute(0, 3, 1, 2)  # NHWC -> NCHW
-
-        # Process images
-        if model_name == "yolos-base" or model_name == "yolos-tiny":
-            outputs = model(pixel_values=normalized_image)
-        else:  # grounding-dino model
-            inputs = image_processor(text=targetObject, return_tensors="pt").to(device)
-            outputs = model(pixel_values=normalized_image, input_ids=inputs.input_ids)
-
-        # Get target sizes for each image
-        target_sizes = torch.tensor(
-            [normalized_image[0].shape[1:]] * normalized_image.shape[0]
-        ).to(device)
-
-        # Post-process results for each image
-        if model_name == "yolos-base" or model_name == "yolos-tiny":
-            results = image_processor.post_process_object_detection(
-                outputs, threshold=0.2, target_sizes=target_sizes
-            )
-        else:  # grounding-dino model
-            results = image_processor.post_process_grounded_object_detection(
-                outputs,
-                inputs.input_ids,
-                box_threshold=0.4,
-                text_threshold=0.3,
-                target_sizes=target_sizes,
-            )
-
-        sum_avg_scores = 0
-        for i, result in enumerate(results):
-            if model_name == "yolos-base" or model_name == "yolos-tiny":
-                id = model.config.label2id[targetObject]
-                # get index of targetObject's label
-                index = torch.where(result["labels"] == id)
-                if len(index[0]) == 0:  # index: ([],[]) so index[0] is the first list
-                    sum_avg_scores = torch.sum(
-                        outputs.logits - outputs.logits
-                    )  # set sum_avg_scores to 0
-                    continue
-                scores = result["scores"][index]
-            else:  # grounding-dino model
-                if result["scores"].shape[0] == 0:
-                    sum_avg_scores = torch.sum(
-                        outputs.last_hidden_state - outputs.last_hidden_state
-                    )  # set sum_avg_scores to 0
-                    continue
-                scores = result["scores"]
-            sum_avg_scores = sum_avg_scores + (torch.sum(scores) / scores.shape[0])
-
-        loss = sum_avg_scores / len(results)
-        reward = 1 - loss
-
-        return loss, reward
-
-    return loss_fn
-
-
-def compression_loss_fn(
-    inference_dtype=None, device=None, target=None, grad_scale=0, model_path=None
-):
-    """
-    Args:
-        inference_dtype: torch.dtype, the data type of the model.
-        device: torch.device, the device to run the model.
-        model_path: str, the path of the compression model.
-
-    Returns:
-        loss_fn: function, the loss function of the compression reward function.
-    """
-    scorer = JpegCompressionScorer(dtype=inference_dtype, model_path=model_path).to(
-        device, dtype=inference_dtype
-    )
-    scorer.requires_grad_(False)
-    scorer.eval()
-
-    def loss_fn(im_pix_un):
-        im_pix = ((im_pix_un + 1) / 2).clamp(0, 1)
-        rewards = scorer(im_pix)
-
-        if target is None:
-            loss = rewards
-        else:
-            loss = abs(rewards - target)
-        return loss.mean() * grad_scale, rewards.mean()
-
-    return loss_fn
-
-
-def actpred_loss_fn(inference_dtype=None, device=None, num_frames=14, target_size=224):
-    scorer = ActPredScorer(device=device, num_frames=num_frames, dtype=inference_dtype)
-    scorer.requires_grad_(False)
-
-    def preprocess_img(img):
-        img = ((img / 2) + 0.5).clamp(0, 1)
-        img = torchvision.transforms.Resize((target_size, target_size), antialias=True)(
-            img
-        )
-        img = torchvision.transforms.Normalize(
-            mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
-        )(img)
-        return img
-
-    def loss_fn(vid, target_action_label):
-        vid = torch.cat([preprocess_img(img).unsqueeze(0) for img in vid])[None]
-        return scorer.get_loss_and_score(vid, target_action_label)
-
-    return loss_fn
-
-
-def should_sample(global_step, validation_steps, is_sample_preview):
-    return (
-        global_step % validation_steps == 0 or global_step == 1
-    ) and is_sample_preview
diff --git a/videotuna/models/lvdm/models/rlhf_utils/rl_ddim.py b/videotuna/models/lvdm/models/rlhf_utils/rl_ddim.py
deleted file mode 100644
index 2f3245c9..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/rl_ddim.py
+++ /dev/null
@@ -1,526 +0,0 @@
-# Adapted from VideoCrafter: https://github.com/AILab-CVC/VideoCrafter
-import random
-
-import numpy as np
-import torch
-from models.lvdm.modules.utils import noise_like
-from tqdm import tqdm
-
-from videotuna.utils.diffusion_utils import (
-    make_ddim_sampling_parameters,
-    make_ddim_timesteps,
-)
-
-
-class DDIMSampler(object):
-    def __init__(self, model, schedule="linear", **kwargs):
-        super().__init__()
-        self.model = model
-        self.ddpm_num_timesteps = model.num_timesteps
-        self.schedule = schedule
-        self.counter = 0
-        self.backprop_mode = "last"  # default
-        self.training_mode = False  # default
-
-    def register_buffer(self, name, attr):
-        if type(attr) == torch.Tensor:
-            if attr.device != torch.device("cuda"):
-                attr = attr.to(torch.device("cuda"))
-        setattr(self, name, attr)
-
-    def make_schedule(
-        self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0.0, verbose=True
-    ):
-        self.ddim_timesteps = make_ddim_timesteps(
-            ddim_discr_method=ddim_discretize,
-            num_ddim_timesteps=ddim_num_steps,
-            num_ddpm_timesteps=self.ddpm_num_timesteps,
-            verbose=verbose,
-        )
-        alphas_cumprod = self.model.diffusion_scheduler.alphas_cumprod
-        assert (
-            alphas_cumprod.shape[0] == self.ddpm_num_timesteps
-        ), "alphas have to be defined for each timestep"
-        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
-
-        self.register_buffer("betas", to_torch(self.model.diffusion_scheduler.betas))
-        self.register_buffer("alphas_cumprod", to_torch(alphas_cumprod))
-        self.register_buffer(
-            "alphas_cumprod_prev",
-            to_torch(self.model.diffusion_scheduler.alphas_cumprod_prev),
-        )
-        self.use_scale = self.model.use_scale
-
-        if self.use_scale:
-            self.register_buffer("scale_arr", to_torch(self.model.scale_arr))
-            ddim_scale_arr = self.scale_arr.cpu()[self.ddim_timesteps]
-            self.register_buffer("ddim_scale_arr", ddim_scale_arr)
-            ddim_scale_arr = np.asarray(
-                [self.scale_arr.cpu()[0]]
-                + self.scale_arr.cpu()[self.ddim_timesteps[:-1]].tolist()
-            )
-            self.register_buffer("ddim_scale_arr_prev", ddim_scale_arr)
-
-        # calculations for diffusion q(x_t | x_{t-1}) and others
-        self.register_buffer(
-            "sqrt_alphas_cumprod", to_torch(np.sqrt(alphas_cumprod.cpu()))
-        )
-        self.register_buffer(
-            "sqrt_one_minus_alphas_cumprod",
-            to_torch(np.sqrt(1.0 - alphas_cumprod.cpu())),
-        )
-        self.register_buffer(
-            "log_one_minus_alphas_cumprod", to_torch(np.log(1.0 - alphas_cumprod.cpu()))
-        )
-        self.register_buffer(
-            "sqrt_recip_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod.cpu()))
-        )
-        self.register_buffer(
-            "sqrt_recipm1_alphas_cumprod",
-            to_torch(np.sqrt(1.0 / alphas_cumprod.cpu() - 1)),
-        )
-
-        # ddim sampling parameters
-        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(
-            alphacums=alphas_cumprod.cpu(),
-            ddim_timesteps=self.ddim_timesteps,
-            eta=ddim_eta,
-            verbose=verbose,
-        )
-        self.register_buffer("ddim_sigmas", ddim_sigmas)
-        self.register_buffer("ddim_alphas", ddim_alphas)
-        self.register_buffer("ddim_alphas_prev", ddim_alphas_prev)
-        self.register_buffer("ddim_sqrt_one_minus_alphas", np.sqrt(1.0 - ddim_alphas))
-        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
-            (1 - self.alphas_cumprod_prev)
-            / (1 - self.alphas_cumprod)
-            * (1 - self.alphas_cumprod / self.alphas_cumprod_prev)
-        )
-        self.register_buffer(
-            "ddim_sigmas_for_original_num_steps", sigmas_for_original_sampling_steps
-        )
-
-    # @torch.no_grad()
-    def sample(
-        self,
-        S,
-        batch_size,
-        shape,
-        conditioning=None,
-        callback=None,
-        normals_sequence=None,
-        img_callback=None,
-        quantize_x0=False,
-        eta=0.0,
-        mask=None,
-        x0=None,
-        temperature=1.0,
-        noise_dropout=0.0,
-        score_corrector=None,
-        corrector_kwargs=None,
-        verbose=True,
-        schedule_verbose=False,
-        x_T=None,
-        log_every_t=100,
-        unconditional_guidance_scale=1.0,
-        unconditional_conditioning=None,
-        # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
-        **kwargs,
-    ):
-
-        # check condition bs
-        if conditioning is not None:
-            if isinstance(conditioning, dict):
-                try:
-                    cbs = conditioning[list(conditioning.keys())[0]].shape[0]
-                except:
-                    cbs = conditioning[list(conditioning.keys())[0]][0].shape[0]
-
-                if cbs != batch_size:
-                    print(
-                        f"Warning: Got {cbs} conditionings but batch-size is {batch_size}"
-                    )
-            else:
-                if conditioning.shape[0] != batch_size:
-                    print(
-                        f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}"
-                    )
-
-        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=schedule_verbose)
-        self.ddim_num_steps = S  # for ddim_sampling
-
-        # make shape
-        if len(shape) == 3:
-            C, H, W = shape
-            size = (batch_size, C, H, W)
-        elif len(shape) == 4:
-            C, T, H, W = shape
-            size = (batch_size, C, T, H, W)
-        # print(f'Data shape for DDIM sampling is {size}, eta {eta}')
-
-        samples, intermediates = self.ddim_sampling(
-            conditioning,
-            size,  # samples: batch, c, t, h, w
-            callback=callback,
-            img_callback=img_callback,
-            quantize_denoised=quantize_x0,
-            mask=mask,
-            x0=x0,
-            ddim_use_original_steps=False,
-            noise_dropout=noise_dropout,
-            temperature=temperature,
-            score_corrector=score_corrector,
-            corrector_kwargs=corrector_kwargs,
-            x_T=x_T,
-            log_every_t=log_every_t,
-            unconditional_guidance_scale=unconditional_guidance_scale,
-            unconditional_conditioning=unconditional_conditioning,
-            verbose=verbose,
-            **kwargs,
-        )
-        return samples, intermediates
-
-    # @torch.no_grad()
-    def ddim_sampling(
-        self,
-        cond,
-        shape,
-        x_T=None,
-        ddim_use_original_steps=False,
-        callback=None,
-        timesteps=None,
-        quantize_denoised=False,
-        mask=None,
-        x0=None,
-        img_callback=None,
-        log_every_t=100,
-        temperature=1.0,
-        noise_dropout=0.0,
-        score_corrector=None,
-        corrector_kwargs=None,
-        unconditional_guidance_scale=1.0,
-        unconditional_conditioning=None,
-        verbose=True,
-        cond_tau=1.0,
-        target_size=None,
-        start_timesteps=None,
-        **kwargs,
-    ):
-        device = self.model.diffusion_scheduler.betas.device
-        # print('ddim device', device)
-        b = shape[0]
-        if x_T is None:
-            img = torch.randn(shape, device=device)
-        else:
-            img = x_T
-
-        if timesteps is None:
-            timesteps = (
-                self.ddpm_num_timesteps
-                if ddim_use_original_steps
-                else self.ddim_timesteps
-            )
-        elif timesteps is not None and not ddim_use_original_steps:
-            subset_end = (
-                int(
-                    min(timesteps / self.ddim_timesteps.shape[0], 1)
-                    * self.ddim_timesteps.shape[0]
-                )
-                - 1
-            )
-            timesteps = self.ddim_timesteps[:subset_end]
-
-        intermediates = {"x_inter": [img], "pred_x0": [img]}
-        time_range = (
-            reversed(range(0, timesteps))
-            if ddim_use_original_steps
-            else np.flip(timesteps)
-        )
-        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
-        if verbose:
-            iterator = tqdm(time_range, desc="DDIM Sampler", total=total_steps)
-        else:
-            iterator = time_range
-
-        init_x0 = False
-        clean_cond = kwargs.pop("clean_cond", False)
-
-        if self.training_mode == True:
-            # print("Training mode", self.training_mode)
-            if self.backprop_mode == "last":
-                backprop_cutoff_idx = self.ddim_num_steps - 1
-            elif self.backprop_mode == "rand":
-                backprop_cutoff_idx = random.randint(0, self.ddim_num_steps - 1)
-            elif self.backprop_mode == "specific":
-                backprop_cutoff_idx = 15
-
-        for i, step in enumerate(iterator):
-            index = total_steps - i - 1
-            ts = torch.full((b,), step, device=device, dtype=torch.long)
-
-            if self.training_mode == True:
-                if i >= backprop_cutoff_idx:
-                    for name, param in self.model.named_parameters():
-                        if "lora" in name:
-                            # print(name,"restting grad")
-                            param.requires_grad = True
-                else:
-                    for name, param in self.model.named_parameters():
-                        param.requires_grad = False
-
-            if start_timesteps is not None:
-                assert x0 is not None
-                if step > start_timesteps * time_range[0]:
-                    continue
-                elif not init_x0:
-                    img = self.model.q_sample(x0, ts)
-                    init_x0 = True
-
-            # use mask to blend noised original latent (img_orig) & new sampled latent (img)
-            if mask is not None:
-                assert x0 is not None
-                if clean_cond:
-                    img_orig = x0
-                else:
-                    img_orig = self.model.q_sample(
-                        x0, ts
-                    )  # TODO: deterministic forward pass? <ddim inversion>
-                img = (
-                    img_orig * mask + (1.0 - mask) * img
-                )  # keep original & modify use img
-
-            index_clip = int((1 - cond_tau) * total_steps)
-            if index <= index_clip and target_size is not None:
-                target_size_ = [
-                    target_size[0],
-                    target_size[1] // 8,
-                    target_size[2] // 8,
-                ]
-                img = torch.nn.functional.interpolate(
-                    img,
-                    size=target_size_,
-                    mode="nearest",
-                )
-
-            # forward_context = torch.autograd.graph.save_on_cpu
-            # with forward_context():
-            outs = self.p_sample_ddim(
-                img,
-                cond,
-                ts,
-                index=index,
-                use_original_steps=ddim_use_original_steps,
-                quantize_denoised=quantize_denoised,
-                temperature=temperature,
-                noise_dropout=noise_dropout,
-                score_corrector=score_corrector,
-                corrector_kwargs=corrector_kwargs,
-                unconditional_guidance_scale=unconditional_guidance_scale,
-                unconditional_conditioning=unconditional_conditioning,
-                x0=x0,
-                **kwargs,
-            )
-
-            img, pred_x0 = outs
-
-            if callback:
-                callback(i)
-            if img_callback:
-                img_callback(pred_x0, i)
-
-            if index % log_every_t == 0 or index == total_steps - 1:
-                intermediates["x_inter"].append(img)
-                intermediates["pred_x0"].append(pred_x0)
-
-        return img, intermediates
-
-    # @torch.no_grad()
-    def p_sample_ddim(
-        self,
-        x,
-        c,
-        t,
-        index,
-        repeat_noise=False,
-        use_original_steps=False,
-        quantize_denoised=False,
-        temperature=1.0,
-        noise_dropout=0.0,
-        score_corrector=None,
-        corrector_kwargs=None,
-        unconditional_guidance_scale=1.0,
-        unconditional_conditioning=None,
-        uc_type=None,
-        conditional_guidance_scale_temporal=None,
-        **kwargs,
-    ):
-        b, *_, device = *x.shape, x.device
-
-        if x.dim() == 5:
-            is_video = True
-        else:
-            is_video = False
-        if unconditional_conditioning is None or unconditional_guidance_scale == 1.0:
-            e_t = self.model.apply_model(x, t, c, **kwargs)  # unet denoiser
-        else:
-            # with unconditional condition
-            if isinstance(c, torch.Tensor):
-                e_t = self.model.apply_model(x, t, c, **kwargs)  # unet denoiser
-                e_t_uncond = self.model.apply_model(
-                    x, t, unconditional_conditioning, **kwargs
-                )
-            elif isinstance(c, dict):
-                e_t = self.model.apply_model(x, t, c, **kwargs)  # unet denoiser
-                e_t_uncond = self.model.apply_model(
-                    x, t, unconditional_conditioning, **kwargs
-                )  # unet denoiser
-            else:
-                raise NotImplementedError
-
-            if uc_type is None:
-                e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)
-            else:
-                if uc_type == "cfg_original":
-                    e_t = e_t + unconditional_guidance_scale * (e_t - e_t_uncond)
-                elif uc_type == "cfg_ours":
-                    e_t = e_t + unconditional_guidance_scale * (e_t_uncond - e_t)
-                else:
-                    raise NotImplementedError
-
-            # temporal guidance
-            if conditional_guidance_scale_temporal is not None:
-                e_t_temporal = self.model.apply_model(x, t, c, **kwargs)
-                e_t_image = self.model.apply_model(
-                    x, t, c, no_temporal_attn=True, **kwargs
-                )
-                e_t = e_t + conditional_guidance_scale_temporal * (
-                    e_t_temporal - e_t_image
-                )
-
-        if score_corrector is not None:
-            assert self.model.parameterization == "eps"
-            e_t = score_corrector.modify_score(
-                self.model, e_t, x, t, c, **corrector_kwargs
-            )
-
-        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
-        alphas_prev = (
-            self.model.alphas_cumprod_prev
-            if use_original_steps
-            else self.ddim_alphas_prev
-        )
-        sqrt_one_minus_alphas = (
-            self.model.sqrt_one_minus_alphas_cumprod
-            if use_original_steps
-            else self.ddim_sqrt_one_minus_alphas
-        )
-        sigmas = (
-            self.model.ddim_sigmas_for_original_num_steps
-            if use_original_steps
-            else self.ddim_sigmas
-        )
-        # select parameters corresponding to the currently considered timestep
-
-        if is_video:
-            size = (b, 1, 1, 1, 1)
-        else:
-            size = (b, 1, 1, 1)
-        a_t = torch.full(size, alphas[index], device=device)
-        a_prev = torch.full(size, alphas_prev[index], device=device)
-        sigma_t = torch.full(size, sigmas[index], device=device)
-        sqrt_one_minus_at = torch.full(
-            size, sqrt_one_minus_alphas[index], device=device
-        )
-
-        # current prediction for x_0
-        pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
-        if quantize_denoised:
-            pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
-        # direction pointing to x_t
-        dir_xt = (1.0 - a_prev - sigma_t**2).sqrt() * e_t
-
-        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
-        if noise_dropout > 0.0:
-            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
-        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
-        if self.use_scale:
-            scale_arr = (
-                self.model.scale_arr if use_original_steps else self.ddim_scale_arr
-            )
-            scale_t = torch.full(size, scale_arr[index], device=device)
-            scale_arr_prev = (
-                self.model.scale_arr_prev
-                if use_original_steps
-                else self.ddim_scale_arr_prev
-            )
-            scale_t_prev = torch.full(size, scale_arr_prev[index], device=device)
-            pred_x0 /= scale_t
-            x_prev = a_prev.sqrt() * scale_t_prev * pred_x0 + dir_xt + noise
-        else:
-            x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
-
-        return x_prev, pred_x0
-
-    @torch.no_grad()
-    def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
-        # fast, but does not allow for exact reconstruction
-        # t serves as an index to gather the correct alphas
-        if use_original_steps:
-            sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
-            sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
-        else:
-            sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
-            sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
-
-        if noise is None:
-            noise = torch.randn_like(x0)
-
-        def extract_into_tensor(a, t, x_shape):
-            b, *_ = t.shape
-            out = a.gather(-1, t)
-            return out.reshape(b, *((1,) * (len(x_shape) - 1)))
-
-        return (
-            extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0
-            + extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise
-        )
-
-    @torch.no_grad()
-    def decode(
-        self,
-        x_latent,
-        cond,
-        t_start,
-        unconditional_guidance_scale=1.0,
-        unconditional_conditioning=None,
-        use_original_steps=False,
-    ):
-
-        timesteps = (
-            np.arange(self.ddpm_num_timesteps)
-            if use_original_steps
-            else self.ddim_timesteps
-        )
-        timesteps = timesteps[:t_start]
-
-        time_range = np.flip(timesteps)
-        total_steps = timesteps.shape[0]
-        print(f"Running DDIM Sampling with {total_steps} timesteps")
-
-        iterator = tqdm(time_range, desc="Decoding image", total=total_steps)
-        x_dec = x_latent
-        for i, step in enumerate(iterator):
-            index = total_steps - i - 1
-            ts = torch.full(
-                (x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long
-            )
-            x_dec, _ = self.p_sample_ddim(
-                x_dec,
-                cond,
-                ts,
-                index=index,
-                use_original_steps=use_original_steps,
-                unconditional_guidance_scale=unconditional_guidance_scale,
-                unconditional_conditioning=unconditional_conditioning,
-            )
-        return x_dec
diff --git a/videotuna/models/lvdm/models/rlhf_utils/weather_scorer.py b/videotuna/models/lvdm/models/rlhf_utils/weather_scorer.py
deleted file mode 100644
index 8ea0a507..00000000
--- a/videotuna/models/lvdm/models/rlhf_utils/weather_scorer.py
+++ /dev/null
@@ -1,182 +0,0 @@
-# Copy from Cheng An Hsieh, et. al.: https://github.com/RewardMultiverse/reward-multiverse
-import torch
-import torch.nn as nn
-import torchvision
-from transformers import CLIPModel, CLIPProcessor
-
-
-class SimpleCNN(nn.Module):  # parameter = 6333513
-    def __init__(self, num_class=None):
-        super(SimpleCNN, self).__init__()
-        self.layer1 = nn.Sequential(
-            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
-            nn.ReLU(),
-            nn.MaxPool2d(kernel_size=2, stride=2),
-        )
-        self.layer2 = nn.Sequential(
-            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),
-            nn.ReLU(),
-            nn.MaxPool2d(kernel_size=2, stride=2),
-        )
-        self.layer3 = nn.Sequential(
-            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
-            nn.ReLU(),
-            nn.MaxPool2d(kernel_size=2, stride=2),
-        )
-        self.layer4 = nn.Sequential(
-            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
-            nn.ReLU(),
-            nn.MaxPool2d(kernel_size=2, stride=2),
-        )
-        self.fc1 = nn.Linear(128 * 32 * 32, 1000)
-        self.fc2 = nn.Linear(1000, num_class)
-
-    def forward(self, x):
-        x = self.layer1(x)
-        # print("x1", x.shape)
-        x = self.layer2(x)
-        # print("x2", x.shape)
-        x = self.layer3(x)
-        # print("x3", x.shape)
-        x = self.layer4(x)
-        # print("x4", x.shape)
-
-        x = x.reshape(x.size(0), -1)
-        # print("x reshape", x.shape)
-        x = torch.relu(self.fc1(x))
-        x = self.fc2(x)
-        return x
-
-
-class MLP(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.layers = nn.Sequential(  # regression
-            nn.Linear(768, 1024),
-            nn.Dropout(0.2),
-            nn.Linear(1024, 128),
-            nn.Dropout(0.2),
-            nn.Linear(128, 64),
-            nn.Dropout(0.1),
-            nn.Linear(64, 16),
-            nn.Linear(16, 1),
-            nn.Sigmoid(),
-        )
-
-        # self.layers = nn.Sequential(  # classification
-        #     nn.Linear(768, 1024),
-        #     nn.Dropout(0.2),
-        #     nn.Linear(1024, 128),
-        #     nn.Dropout(0.2),
-        #     nn.Linear(128, 64),
-        #     nn.Dropout(0.1),
-        #     nn.Linear(64, 16),
-        #     nn.Linear(16, 2)
-        # )
-
-    def forward(self, embed):
-        return self.layers(embed)
-
-
-class MLP_Resnet(nn.Module):
-    def __init__(self, num_class):
-        super().__init__()
-        self.layers = nn.Sequential(
-            nn.Linear(1000, 128),
-            # nn.Dropout(0.2),
-            nn.Linear(128, 64),
-            # nn.Dropout(0.2),
-            nn.Linear(64, 16),
-            nn.Linear(16, num_class),
-        )
-
-    def forward(self, embed):
-        return self.layers(embed)
-
-
-def weather_loss_fn(
-    target=None,  # TODO: use config.task to decide returned loss_fn
-    grad_scale=0,
-    device=None,
-    accelerator=None,
-    torch_dtype=None,
-    reward_model_resume_from=None,
-    num_of_labels=None,
-):
-    scorer = WeatherScorer(
-        dtype=torch_dtype, model_path=reward_model_resume_from, num_class=num_of_labels
-    ).to(device, dtype=torch_dtype)
-    scorer.requires_grad_(False)
-    scorer.eval()
-
-    def loss_fn(im_pix_un):
-        if accelerator.mixed_precision == "fp16":
-            with accelerator.autocast():
-                rewards = scorer(im_pix_un)
-        else:
-            rewards = scorer(im_pix_un)
-
-        target_tensors = torch.full((rewards.shape[0],), target).to(
-            rewards.device, dtype=rewards.dtype
-        )  # regression
-        criterion = torch.nn.MSELoss(reduction="sum")  # regression
-        # target_tensors = torch.full((rewards.shape[0],), target).to(rewards.device, dtype=torch.long)    # classification
-        # criterion = nn.CrossEntropyLoss(reduction="sum")    # classification
-        loss = criterion(rewards, target_tensors)
-        return (
-            loss * grad_scale,
-            rewards,
-        )  # nn.Softmax(dim=-1)(rewards)   # rewards (reg)
-
-    return loss_fn
-
-
-class WeatherModel(nn.Module):
-    def __init__(self, num_class=None):
-        super().__init__()
-        self.embed_model = torch.hub.load(
-            "pytorch/vision:v0.10.0", "resnet18", pretrained=True
-        )
-        self.score_model = MLP_Resnet(num_class)
-
-    def __call__(self, im):
-        return self.score_model(self.embed_model(im))
-
-
-class WeatherScorer(nn.Module):  # Reward model
-    def __init__(self, dtype=None, model_path=None, num_class=None):
-        super().__init__()
-        self.clip = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
-        self.clip.requires_grad_(False)
-        self.clip.eval()
-        self.score_generator = MLP()
-        # self.score_generator = WeatherModel(num_class)    # resnet + mlp
-        if model_path:
-            state_dict = torch.load(model_path)
-            self.score_generator.load_state_dict(state_dict)
-            self.score_generator.requires_grad_(False)
-            self.score_generator.eval()
-            # self.clip.requires_grad_(False)
-            # self.clip.eval()
-        else:
-            self.score_generator.requires_grad_(True)
-        if dtype:
-            self.dtype = dtype
-        self.target_size = (224, 224)  # resnet 224, cnn 512 (use 224 for both...?)
-        self.normalize = torchvision.transforms.Normalize(
-            mean=[0.48145466, 0.4578275, 0.40821073],
-            std=[0.26862954, 0.26130258, 0.27577711],
-        )
-
-    def set_device(self, device, inference_type):
-        self.clip.to(device, dtype=inference_type)  # uncomment for mlp
-        self.score_generator.to(device)  #  dtype = inference_dtype
-
-    def __call__(self, images):
-        device = next(self.parameters()).device
-        im_pix = torchvision.transforms.Resize(self.target_size)(images)
-        im_pix = self.normalize(im_pix).to(images.dtype)
-        embed = self.clip.get_image_features(pixel_values=im_pix)
-        embed = embed / torch.linalg.vector_norm(embed, dim=-1, keepdim=True)
-        return self.score_generator(embed).squeeze(1)  # CLIP + MLP
-        # return self.score_generator(im_pix).squeeze(1)    # for simpleCNN
diff --git a/videotuna/models/lvdm/modules/ae_modules.py b/videotuna/models/lvdm/modules/ae_modules.py
deleted file mode 100644
index 3c16119a..00000000
--- a/videotuna/models/lvdm/modules/ae_modules.py
+++ /dev/null
@@ -1,1027 +0,0 @@
-# pytorch_diffusion + derived encoder decoder
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-from einops import rearrange
-
-from videotuna.models.lvdm.modules.attention import LinearAttention
-from videotuna.utils.common_utils import instantiate_from_config
-
-
-def nonlinearity(x):
-    # swish
-    return x * torch.sigmoid(x)
-
-
-def Normalize(in_channels, num_groups=32):
-    return torch.nn.GroupNorm(
-        num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True
-    )
-
-
-class LinAttnBlock(LinearAttention):
-    """to match AttnBlock usage"""
-
-    def __init__(self, in_channels):
-        super().__init__(dim=in_channels, heads=1, dim_head=in_channels)
-
-
-class AttnBlock(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = Normalize(in_channels)
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x):
-        h_ = x
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = q.reshape(b, c, h * w)  # bcl
-        q = q.permute(0, 2, 1)  # bcl -> blc l=hw
-        k = k.reshape(b, c, h * w)  # bcl
-
-        w_ = torch.bmm(q, k)  # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = v.reshape(b, c, h * w)
-        w_ = w_.permute(0, 2, 1)  # b,hw,hw (first hw of k, second of q)
-        h_ = torch.bmm(v, w_)  # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
-        h_ = h_.reshape(b, c, h, w)
-
-        h_ = self.proj_out(h_)
-
-        return x + h_
-
-
-def make_attn(in_channels, attn_type="vanilla"):
-    assert attn_type in ["vanilla", "linear", "none"], f"attn_type {attn_type} unknown"
-    # print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
-    if attn_type == "vanilla":
-        return AttnBlock(in_channels)
-    elif attn_type == "none":
-        return nn.Identity(in_channels)
-    else:
-        return LinAttnBlock(in_channels)
-
-
-class Downsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        self.in_channels = in_channels
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=2, padding=0
-            )
-
-    def forward(self, x):
-        if self.with_conv:
-            pad = (0, 1, 0, 1)
-            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
-            x = self.conv(x)
-        else:
-            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
-        return x
-
-
-class Upsample(nn.Module):
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        self.in_channels = in_channels
-        if self.with_conv:
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=3, stride=1, padding=1
-            )
-
-    def forward(self, x):
-        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
-        if self.with_conv:
-            x = self.conv(x)
-        return x
-
-
-def get_timestep_embedding(timesteps, embedding_dim):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models:
-    From Fairseq.
-    Build sinusoidal embeddings.
-    This matches the implementation in tensor2tensor, but differs slightly
-    from the description in Section 3.5 of "Attention Is All You Need".
-    """
-    assert len(timesteps.shape) == 1
-
-    half_dim = embedding_dim // 2
-    emb = math.log(10000) / (half_dim - 1)
-    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
-    emb = emb.to(device=timesteps.device)
-    emb = timesteps.float()[:, None] * emb[None, :]
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-    if embedding_dim % 2 == 1:  # zero pad
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-class ResnetBlock(nn.Module):
-    def __init__(
-        self,
-        *,
-        in_channels,
-        out_channels=None,
-        conv_shortcut=False,
-        dropout,
-        temb_channels=512,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-        self.use_conv_shortcut = conv_shortcut
-
-        self.norm1 = Normalize(in_channels)
-        self.conv1 = torch.nn.Conv2d(
-            in_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if temb_channels > 0:
-            self.temb_proj = torch.nn.Linear(temb_channels, out_channels)
-        self.norm2 = Normalize(out_channels)
-        self.dropout = torch.nn.Dropout(dropout)
-        self.conv2 = torch.nn.Conv2d(
-            out_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=3, stride=1, padding=1
-                )
-            else:
-                self.nin_shortcut = torch.nn.Conv2d(
-                    in_channels, out_channels, kernel_size=1, stride=1, padding=0
-                )
-
-    def forward(self, x, temb):
-        h = x
-        h = self.norm1(h)
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        if temb is not None:
-            h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None]
-
-        h = self.norm2(h)
-        h = nonlinearity(h)
-        h = self.dropout(h)
-        h = self.conv2(h)
-
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return x + h
-
-
-class Model(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        use_timestep=True,
-        use_linear_attn=False,
-        attn_type="vanilla",
-    ):
-        super().__init__()
-        if use_linear_attn:
-            attn_type = "linear"
-        self.ch = ch
-        self.temb_ch = self.ch * 4
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        self.use_timestep = use_timestep
-        if self.use_timestep:
-            # timestep embedding
-            self.temb = nn.Module()
-            self.temb.dense = nn.ModuleList(
-                [
-                    torch.nn.Linear(self.ch, self.temb_ch),
-                    torch.nn.Linear(self.temb_ch, self.temb_ch),
-                ]
-            )
-
-        # downsampling
-        self.conv_in = torch.nn.Conv2d(
-            in_channels, self.ch, kernel_size=3, stride=1, padding=1
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                down.downsample = Downsample(block_in, resamp_with_conv)
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            skip_in = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                if i_block == self.num_res_blocks:
-                    skip_in = ch * in_ch_mult[i_level]
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in + skip_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in, resamp_with_conv)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in, out_ch, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, x, t=None, context=None):
-        # assert x.shape[2] == x.shape[3] == self.resolution
-        if context is not None:
-            # assume aligned context, cat along channel axis
-            x = torch.cat((x, context), dim=1)
-        if self.use_timestep:
-            # timestep embedding
-            assert t is not None
-            temb = get_timestep_embedding(t, self.ch)
-            temb = self.temb.dense[0](temb)
-            temb = nonlinearity(temb)
-            temb = self.temb.dense[1](temb)
-        else:
-            temb = None
-
-        # downsampling
-        hs = [self.conv_in(x)]
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                hs.append(self.down[i_level].downsample(hs[-1]))
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](
-                    torch.cat([h, hs.pop()], dim=1), temb
-                )
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-    def get_last_layer(self):
-        return self.conv_out.weight
-
-
-class Encoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        double_z=True,
-        use_linear_attn=False,
-        attn_type="vanilla",
-        **ignore_kwargs,
-    ):
-        super().__init__()
-        if use_linear_attn:
-            attn_type = "linear"
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-
-        # downsampling
-        self.conv_in = torch.nn.Conv2d(
-            in_channels, self.ch, kernel_size=3, stride=1, padding=1
-        )
-
-        curr_res = resolution
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.in_ch_mult = in_ch_mult
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                down.downsample = Downsample(block_in, resamp_with_conv)
-                curr_res = curr_res // 2
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in,
-            2 * z_channels if double_z else z_channels,
-            kernel_size=3,
-            stride=1,
-            padding=1,
-        )
-
-    def forward(self, x):
-        # timestep embedding
-        temb = None
-
-        # print(f'encoder-input={x.shape}')
-        # downsampling
-        hs = [self.conv_in(x)]
-        # print(f'encoder-conv in feat={hs[0].shape}')
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](hs[-1], temb)
-                # print(f'encoder-down feat={h.shape}')
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                hs.append(h)
-            if i_level != self.num_resolutions - 1:
-                # print(f'encoder-downsample (input)={hs[-1].shape}')
-                hs.append(self.down[i_level].downsample(hs[-1]))
-                # print(f'encoder-downsample (output)={hs[-1].shape}')
-
-        # middle
-        h = hs[-1]
-        h = self.mid.block_1(h, temb)
-        # print(f'encoder-mid1 feat={h.shape}')
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-        # print(f'encoder-mid2 feat={h.shape}')
-
-        # end
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        # print(f'end feat={h.shape}')
-        return h
-
-
-class Decoder(nn.Module):
-    def __init__(
-        self,
-        *,
-        ch,
-        out_ch,
-        ch_mult=(1, 2, 4, 8),
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        in_channels,
-        resolution,
-        z_channels,
-        give_pre_end=False,
-        tanh_out=False,
-        use_linear_attn=False,
-        attn_type="vanilla",
-        **ignorekwargs,
-    ):
-        super().__init__()
-        if use_linear_attn:
-            attn_type = "linear"
-        self.ch = ch
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.resolution = resolution
-        self.in_channels = in_channels
-        self.give_pre_end = give_pre_end
-        self.tanh_out = tanh_out
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        in_ch_mult = (1,) + tuple(ch_mult)
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.z_shape = (1, z_channels, curr_res, curr_res)
-        print(
-            "AE working on z of shape {} = {} dimensions.".format(
-                self.z_shape, np.prod(self.z_shape)
-            )
-        )
-
-        # z to block_in
-        self.conv_in = torch.nn.Conv2d(
-            z_channels, block_in, kernel_size=3, stride=1, padding=1
-        )
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
-        self.mid.block_2 = ResnetBlock(
-            in_channels=block_in,
-            out_channels=block_in,
-            temb_channels=self.temb_ch,
-            dropout=dropout,
-        )
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-                if curr_res in attn_resolutions:
-                    attn.append(make_attn(block_in, attn_type=attn_type))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                up.upsample = Upsample(block_in, resamp_with_conv)
-                curr_res = curr_res * 2
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in, out_ch, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, z):
-        # assert z.shape[1:] == self.z_shape[1:]
-        self.last_z_shape = z.shape
-
-        # print(f'decoder-input={z.shape}')
-        # timestep embedding
-        temb = None
-
-        # z to block_in
-        h = self.conv_in(z)
-        # print(f'decoder-conv in feat={h.shape}')
-
-        # middle
-        h = self.mid.block_1(h, temb)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb)
-        # print(f'decoder-mid feat={h.shape}')
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h, temb)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-                # print(f'decoder-up feat={h.shape}')
-            if i_level != 0:
-                h = self.up[i_level].upsample(h)
-                # print(f'decoder-upsample feat={h.shape}')
-
-        # end
-        if self.give_pre_end:
-            return h
-
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        # print(f'decoder-conv_out feat={h.shape}')
-        if self.tanh_out:
-            h = torch.tanh(h)
-        return h
-
-
-class SimpleDecoder(nn.Module):
-    def __init__(self, in_channels, out_channels, *args, **kwargs):
-        super().__init__()
-        self.model = nn.ModuleList(
-            [
-                nn.Conv2d(in_channels, in_channels, 1),
-                ResnetBlock(
-                    in_channels=in_channels,
-                    out_channels=2 * in_channels,
-                    temb_channels=0,
-                    dropout=0.0,
-                ),
-                ResnetBlock(
-                    in_channels=2 * in_channels,
-                    out_channels=4 * in_channels,
-                    temb_channels=0,
-                    dropout=0.0,
-                ),
-                ResnetBlock(
-                    in_channels=4 * in_channels,
-                    out_channels=2 * in_channels,
-                    temb_channels=0,
-                    dropout=0.0,
-                ),
-                nn.Conv2d(2 * in_channels, in_channels, 1),
-                Upsample(in_channels, with_conv=True),
-            ]
-        )
-        # end
-        self.norm_out = Normalize(in_channels)
-        self.conv_out = torch.nn.Conv2d(
-            in_channels, out_channels, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, x):
-        for i, layer in enumerate(self.model):
-            if i in [1, 2, 3]:
-                x = layer(x, None)
-            else:
-                x = layer(x)
-
-        h = self.norm_out(x)
-        h = nonlinearity(h)
-        x = self.conv_out(h)
-        return x
-
-
-class UpsampleDecoder(nn.Module):
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        ch,
-        num_res_blocks,
-        resolution,
-        ch_mult=(2, 2),
-        dropout=0.0,
-    ):
-        super().__init__()
-        # upsampling
-        self.temb_ch = 0
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        block_in = in_channels
-        curr_res = resolution // 2 ** (self.num_resolutions - 1)
-        self.res_blocks = nn.ModuleList()
-        self.upsample_blocks = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            res_block = []
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                res_block.append(
-                    ResnetBlock(
-                        in_channels=block_in,
-                        out_channels=block_out,
-                        temb_channels=self.temb_ch,
-                        dropout=dropout,
-                    )
-                )
-                block_in = block_out
-            self.res_blocks.append(nn.ModuleList(res_block))
-            if i_level != self.num_resolutions - 1:
-                self.upsample_blocks.append(Upsample(block_in, True))
-                curr_res = curr_res * 2
-
-        # end
-        self.norm_out = Normalize(block_in)
-        self.conv_out = torch.nn.Conv2d(
-            block_in, out_channels, kernel_size=3, stride=1, padding=1
-        )
-
-    def forward(self, x):
-        # upsampling
-        h = x
-        for k, i_level in enumerate(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.res_blocks[i_level][i_block](h, None)
-            if i_level != self.num_resolutions - 1:
-                h = self.upsample_blocks[k](h)
-        h = self.norm_out(h)
-        h = nonlinearity(h)
-        h = self.conv_out(h)
-        return h
-
-
-class LatentRescaler(nn.Module):
-    def __init__(self, factor, in_channels, mid_channels, out_channels, depth=2):
-        super().__init__()
-        # residual block, interpolate, residual block
-        self.factor = factor
-        self.conv_in = nn.Conv2d(
-            in_channels, mid_channels, kernel_size=3, stride=1, padding=1
-        )
-        self.res_block1 = nn.ModuleList(
-            [
-                ResnetBlock(
-                    in_channels=mid_channels,
-                    out_channels=mid_channels,
-                    temb_channels=0,
-                    dropout=0.0,
-                )
-                for _ in range(depth)
-            ]
-        )
-        self.attn = AttnBlock(mid_channels)
-        self.res_block2 = nn.ModuleList(
-            [
-                ResnetBlock(
-                    in_channels=mid_channels,
-                    out_channels=mid_channels,
-                    temb_channels=0,
-                    dropout=0.0,
-                )
-                for _ in range(depth)
-            ]
-        )
-
-        self.conv_out = nn.Conv2d(
-            mid_channels,
-            out_channels,
-            kernel_size=1,
-        )
-
-    def forward(self, x):
-        x = self.conv_in(x)
-        for block in self.res_block1:
-            x = block(x, None)
-        x = torch.nn.functional.interpolate(
-            x,
-            size=(
-                int(round(x.shape[2] * self.factor)),
-                int(round(x.shape[3] * self.factor)),
-            ),
-        )
-        x = self.attn(x)
-        for block in self.res_block2:
-            x = block(x, None)
-        x = self.conv_out(x)
-        return x
-
-
-class MergedRescaleEncoder(nn.Module):
-    def __init__(
-        self,
-        in_channels,
-        ch,
-        resolution,
-        out_ch,
-        num_res_blocks,
-        attn_resolutions,
-        dropout=0.0,
-        resamp_with_conv=True,
-        ch_mult=(1, 2, 4, 8),
-        rescale_factor=1.0,
-        rescale_module_depth=1,
-    ):
-        super().__init__()
-        intermediate_chn = ch * ch_mult[-1]
-        self.encoder = Encoder(
-            in_channels=in_channels,
-            num_res_blocks=num_res_blocks,
-            ch=ch,
-            ch_mult=ch_mult,
-            z_channels=intermediate_chn,
-            double_z=False,
-            resolution=resolution,
-            attn_resolutions=attn_resolutions,
-            dropout=dropout,
-            resamp_with_conv=resamp_with_conv,
-            out_ch=None,
-        )
-        self.rescaler = LatentRescaler(
-            factor=rescale_factor,
-            in_channels=intermediate_chn,
-            mid_channels=intermediate_chn,
-            out_channels=out_ch,
-            depth=rescale_module_depth,
-        )
-
-    def forward(self, x):
-        x = self.encoder(x)
-        x = self.rescaler(x)
-        return x
-
-
-class MergedRescaleDecoder(nn.Module):
-    def __init__(
-        self,
-        z_channels,
-        out_ch,
-        resolution,
-        num_res_blocks,
-        attn_resolutions,
-        ch,
-        ch_mult=(1, 2, 4, 8),
-        dropout=0.0,
-        resamp_with_conv=True,
-        rescale_factor=1.0,
-        rescale_module_depth=1,
-    ):
-        super().__init__()
-        tmp_chn = z_channels * ch_mult[-1]
-        self.decoder = Decoder(
-            out_ch=out_ch,
-            z_channels=tmp_chn,
-            attn_resolutions=attn_resolutions,
-            dropout=dropout,
-            resamp_with_conv=resamp_with_conv,
-            in_channels=None,
-            num_res_blocks=num_res_blocks,
-            ch_mult=ch_mult,
-            resolution=resolution,
-            ch=ch,
-        )
-        self.rescaler = LatentRescaler(
-            factor=rescale_factor,
-            in_channels=z_channels,
-            mid_channels=tmp_chn,
-            out_channels=tmp_chn,
-            depth=rescale_module_depth,
-        )
-
-    def forward(self, x):
-        x = self.rescaler(x)
-        x = self.decoder(x)
-        return x
-
-
-class Upsampler(nn.Module):
-    def __init__(self, in_size, out_size, in_channels, out_channels, ch_mult=2):
-        super().__init__()
-        assert out_size >= in_size
-        num_blocks = int(np.log2(out_size // in_size)) + 1
-        factor_up = 1.0 + (out_size % in_size)
-        print(
-            f"Building {self.__class__.__name__} with in_size: {in_size} --> out_size {out_size} and factor {factor_up}"
-        )
-        self.rescaler = LatentRescaler(
-            factor=factor_up,
-            in_channels=in_channels,
-            mid_channels=2 * in_channels,
-            out_channels=in_channels,
-        )
-        self.decoder = Decoder(
-            out_ch=out_channels,
-            resolution=out_size,
-            z_channels=in_channels,
-            num_res_blocks=2,
-            attn_resolutions=[],
-            in_channels=None,
-            ch=in_channels,
-            ch_mult=[ch_mult for _ in range(num_blocks)],
-        )
-
-    def forward(self, x):
-        x = self.rescaler(x)
-        x = self.decoder(x)
-        return x
-
-
-class Resize(nn.Module):
-    def __init__(self, in_channels=None, learned=False, mode="bilinear"):
-        super().__init__()
-        self.with_conv = learned
-        self.mode = mode
-        if self.with_conv:
-            print(
-                f"Note: {self.__class__.__name} uses learned downsampling and will ignore the fixed {mode} mode"
-            )
-            raise NotImplementedError()
-            assert in_channels is not None
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv = torch.nn.Conv2d(
-                in_channels, in_channels, kernel_size=4, stride=2, padding=1
-            )
-
-    def forward(self, x, scale_factor=1.0):
-        if scale_factor == 1.0:
-            return x
-        else:
-            x = torch.nn.functional.interpolate(
-                x, mode=self.mode, align_corners=False, scale_factor=scale_factor
-            )
-        return x
-
-
-class FirstStagePostProcessor(nn.Module):
-
-    def __init__(
-        self,
-        ch_mult: list,
-        in_channels,
-        pretrained_model: nn.Module = None,
-        reshape=False,
-        n_channels=None,
-        dropout=0.0,
-        pretrained_config=None,
-    ):
-        super().__init__()
-        if pretrained_config is None:
-            assert (
-                pretrained_model is not None
-            ), 'Either "pretrained_model" or "pretrained_config" must not be None'
-            self.pretrained_model = pretrained_model
-        else:
-            assert (
-                pretrained_config is not None
-            ), 'Either "pretrained_model" or "pretrained_config" must not be None'
-            self.instantiate_pretrained(pretrained_config)
-
-        self.do_reshape = reshape
-
-        if n_channels is None:
-            n_channels = self.pretrained_model.encoder.ch
-
-        self.proj_norm = Normalize(in_channels, num_groups=in_channels // 2)
-        self.proj = nn.Conv2d(
-            in_channels, n_channels, kernel_size=3, stride=1, padding=1
-        )
-
-        blocks = []
-        downs = []
-        ch_in = n_channels
-        for m in ch_mult:
-            blocks.append(
-                ResnetBlock(
-                    in_channels=ch_in, out_channels=m * n_channels, dropout=dropout
-                )
-            )
-            ch_in = m * n_channels
-            downs.append(Downsample(ch_in, with_conv=False))
-
-        self.model = nn.ModuleList(blocks)
-        self.downsampler = nn.ModuleList(downs)
-
-    def instantiate_pretrained(self, config):
-        model = instantiate_from_config(config)
-        self.pretrained_model = model.eval()
-        # self.pretrained_model.train = False
-        for param in self.pretrained_model.parameters():
-            param.requires_grad = False
-
-    @torch.no_grad()
-    def encode_with_pretrained(self, x):
-        c = self.pretrained_model.encode(x)
-        if isinstance(c, DiagonalGaussianDistribution):
-            c = c.mode()
-        return c
-
-    def forward(self, x):
-        z_fs = self.encode_with_pretrained(x)
-        z = self.proj_norm(z_fs)
-        z = self.proj(z)
-        z = nonlinearity(z)
-
-        for submodel, downmodel in zip(self.model, self.downsampler):
-            z = submodel(z, temb=None)
-            z = downmodel(z)
-
-        if self.do_reshape:
-            z = rearrange(z, "b c h w -> b (h w) c")
-        return z
diff --git a/videotuna/models/lvdm/modules/attention.py b/videotuna/models/lvdm/modules/attention.py
deleted file mode 100644
index 8ef58c6c..00000000
--- a/videotuna/models/lvdm/modules/attention.py
+++ /dev/null
@@ -1,618 +0,0 @@
-from functools import partial
-
-import torch
-import torch.nn.functional as F
-from einops import rearrange, repeat
-from torch import einsum, nn
-
-try:
-    import xformers
-    import xformers.ops
-
-    XFORMERS_IS_AVAILBLE = True
-except:
-    XFORMERS_IS_AVAILBLE = False
-
-from videotuna.models.lvdm.modules.utils import checkpoint, default, exists, zero_module
-
-
-class RelativePosition(nn.Module):
-    """https://github.com/evelinehong/Transformer_Relative_Position_PyTorch/blob/master/relative_position.py"""
-
-    def __init__(self, num_units, max_relative_position):
-        super().__init__()
-        self.num_units = num_units
-        self.max_relative_position = max_relative_position
-        self.embeddings_table = nn.Parameter(
-            torch.Tensor(max_relative_position * 2 + 1, num_units)
-        )
-        nn.init.xavier_uniform_(self.embeddings_table)
-
-    def forward(self, length_q, length_k):
-        device = self.embeddings_table.device
-        range_vec_q = torch.arange(length_q, device=device)
-        range_vec_k = torch.arange(length_k, device=device)
-        distance_mat = range_vec_k[None, :] - range_vec_q[:, None]
-        distance_mat_clipped = torch.clamp(
-            distance_mat, -self.max_relative_position, self.max_relative_position
-        )
-        final_mat = distance_mat_clipped + self.max_relative_position
-        final_mat = final_mat.long()
-        embeddings = self.embeddings_table[final_mat]
-        return embeddings
-
-
-class CrossAttention(nn.Module):
-
-    def __init__(
-        self,
-        query_dim,
-        context_dim=None,
-        heads=8,
-        dim_head=64,
-        dropout=0.0,
-        relative_position=False,
-        temporal_length=None,
-        video_length=None,
-        img_cross_attention=False,
-        img_cross_attention_scale=1.0,
-        img_cross_attention_scale_learnable=False,
-        text_context_len=77,
-    ):
-        super().__init__()
-        inner_dim = dim_head * heads
-        context_dim = default(context_dim, query_dim)
-
-        self.scale = dim_head**-0.5
-        self.heads = heads
-        self.dim_head = dim_head
-        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
-        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
-        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
-        self.to_out = nn.Sequential(
-            nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)
-        )
-
-        self.img_cross_attention = img_cross_attention
-        self.img_cross_attention_scale = img_cross_attention_scale
-        self.img_cross_attention_scale_learnable = img_cross_attention_scale_learnable
-        self.text_context_len = text_context_len
-
-        if self.img_cross_attention:
-            self.to_k_ip = nn.Linear(context_dim, inner_dim, bias=False)
-            self.to_v_ip = nn.Linear(context_dim, inner_dim, bias=False)
-            if img_cross_attention_scale_learnable:
-                self.register_parameter("alpha", nn.Parameter(torch.tensor(0.0)))
-
-        self.relative_position = relative_position
-        if self.relative_position:
-            assert temporal_length is not None
-            self.relative_position_k = RelativePosition(
-                num_units=dim_head, max_relative_position=temporal_length
-            )
-            self.relative_position_v = RelativePosition(
-                num_units=dim_head, max_relative_position=temporal_length
-            )
-        else:
-            ## only used for spatial attention, while NOT for temporal attention
-            if XFORMERS_IS_AVAILBLE and temporal_length is None:
-                self.forward = self.efficient_forward
-
-    def forward(self, x, context=None, mask=None):
-        is_self_attn = context is None
-        k_ip, v_ip, out_ip = None, None, None
-
-        h = self.heads
-
-        q = self.to_q(x)
-        context = default(context, x)
-
-        if self.img_cross_attention and not is_self_attn:
-            context, context_img = (
-                context[:, : self.text_context_len, :],
-                context[:, self.text_context_len :, :],
-            )
-            k = self.to_k(context)
-            v = self.to_v(context)
-            k_ip = self.to_k_ip(context_img)
-            v_ip = self.to_v_ip(context_img)
-        else:
-            if not is_self_attn:
-                context = context[:, : self.text_context_len, :]
-
-            k = self.to_k(context)
-            v = self.to_v(context)
-
-        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> (b h) n d", h=h), (q, k, v))
-
-        sim = torch.einsum("b i d, b j d -> b i j", q, k) * self.scale
-        if self.relative_position:
-            len_q, len_k, len_v = q.shape[1], k.shape[1], v.shape[1]
-            k2 = self.relative_position_k(len_q, len_k)
-            sim2 = einsum("b t d, t s d -> b t s", q, k2) * self.scale  # TODO check
-            sim += sim2
-        del k
-
-        if exists(mask):
-            ## feasible for causal attention mask only
-            max_neg_value = -torch.finfo(sim.dtype).max
-            mask = repeat(mask, "b i j -> (b h) i j", h=h)
-            sim.masked_fill_(~(mask > 0.5), max_neg_value)
-
-        # attention, what we cannot get enough of
-        sim = sim.softmax(dim=-1)
-        out = torch.einsum("b i j, b j d -> b i d", sim, v)
-        if self.relative_position:
-            v2 = self.relative_position_v(len_q, len_v)
-            out2 = einsum("b t s, t s d -> b t d", sim, v2)  # TODO check
-            out += out2
-        out = rearrange(out, "(b h) n d -> b n (h d)", h=h)
-
-        ## for image cross-attention
-        if k_ip is not None:
-            k_ip, v_ip = map(
-                lambda t: rearrange(t, "b n (h d) -> (b h) n d", h=h), (k_ip, v_ip)
-            )
-            sim_ip = torch.einsum("b i d, b j d -> b i j", q, k_ip) * self.scale
-            del k_ip
-            sim_ip = sim_ip.softmax(dim=-1)
-            out_ip = torch.einsum("b i j, b j d -> b i d", sim_ip, v_ip)
-            out_ip = rearrange(out_ip, "(b h) n d -> b n (h d)", h=h)
-
-        if out_ip is not None:
-            if self.img_cross_attention_scale_learnable:
-                out = out + self.img_cross_attention_scale * out_ip * (
-                    torch.tanh(self.alpha) + 1
-                )
-            else:
-                out = out + self.img_cross_attention_scale * out_ip
-
-        return self.to_out(out)
-
-    def efficient_forward(self, x, context=None, mask=None):
-        is_self_attn = context is None
-        k_ip, v_ip, out_ip = None, None, None
-
-        q = self.to_q(x)
-        context = default(context, x)
-
-        ## considering image token additionally
-        if self.img_cross_attention and not is_self_attn:
-            context, context_img = (
-                context[:, : self.text_context_len, :],
-                context[:, self.text_context_len :, :],
-            )
-            k = self.to_k(context)
-            v = self.to_v(context)
-            k_ip = self.to_k_ip(context_img)
-            v_ip = self.to_v_ip(context_img)
-        else:
-            if not is_self_attn:
-                context = context[:, : self.text_context_len, :]
-            k = self.to_k(context)
-            v = self.to_v(context)
-
-        b, _, _ = q.shape
-        q, k, v = map(
-            lambda t: t.unsqueeze(3)
-            .reshape(b, t.shape[1], self.heads, self.dim_head)
-            .permute(0, 2, 1, 3)
-            .reshape(b * self.heads, t.shape[1], self.dim_head)
-            .contiguous(),
-            (q, k, v),
-        )
-        # actually compute the attention, what we cannot get enough of
-        out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=None)
-
-        # for image cross-attention
-        if k_ip is not None:
-            k_ip, v_ip = map(
-                lambda t: t.unsqueeze(3)
-                .reshape(b, t.shape[1], self.heads, self.dim_head)
-                .permute(0, 2, 1, 3)
-                .reshape(b * self.heads, t.shape[1], self.dim_head)
-                .contiguous(),
-                (k_ip, v_ip),
-            )
-            out_ip = xformers.ops.memory_efficient_attention(
-                q, k_ip, v_ip, attn_bias=None, op=None
-            )
-            out_ip = (
-                out_ip.unsqueeze(0)
-                .reshape(b, self.heads, out.shape[1], self.dim_head)
-                .permute(0, 2, 1, 3)
-                .reshape(b, out.shape[1], self.heads * self.dim_head)
-            )
-
-        if exists(mask):
-            raise NotImplementedError
-        out = (
-            out.unsqueeze(0)
-            .reshape(b, self.heads, out.shape[1], self.dim_head)
-            .permute(0, 2, 1, 3)
-            .reshape(b, out.shape[1], self.heads * self.dim_head)
-        )
-        if out_ip is not None:
-            if self.img_cross_attention_scale_learnable:
-                out = out + self.img_cross_attention_scale * out_ip * (
-                    torch.tanh(self.alpha) + 1
-                )
-            else:
-                out = out + self.img_cross_attention_scale * out_ip
-        return self.to_out(out)
-
-
-class BasicTransformerBlock(nn.Module):
-    def __init__(
-        self,
-        dim,
-        n_heads,
-        d_head,
-        dropout=0.0,
-        context_dim=None,
-        gated_ff=True,
-        checkpoint=True,
-        disable_self_attn=False,
-        attention_cls=None,
-        **kwargs,
-    ):
-        super().__init__()
-        attn_cls = CrossAttention if attention_cls is None else attention_cls
-        self.disable_self_attn = disable_self_attn
-        self.attn1 = attn_cls(
-            query_dim=dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            context_dim=context_dim if self.disable_self_attn else None,
-        )
-        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
-        self.attn2 = attn_cls(
-            query_dim=dim,
-            heads=n_heads,
-            dim_head=d_head,
-            dropout=dropout,
-            context_dim=context_dim,
-            **kwargs,
-        )
-        self.norm1 = nn.LayerNorm(dim)
-        self.norm2 = nn.LayerNorm(dim)
-        self.norm3 = nn.LayerNorm(dim)
-        self.checkpoint = checkpoint
-
-    def forward(self, x, context=None, mask=None, **kwargs):
-        ## implementation tricks: because checkpointing doesn't support non-tensor (e.g. None or scalar) arguments
-        input_tuple = (
-            x,
-        )  ## should not be (x), otherwise *input_tuple will decouple x into multiple arguments
-        if context is not None:
-            input_tuple = (x, context)
-        if mask is not None:
-            forward_mask = partial(self._forward, mask=mask)
-            return checkpoint(forward_mask, (x,), self.parameters(), self.checkpoint)
-        if context is not None and mask is not None:
-            input_tuple = (x, context, mask)
-        return checkpoint(
-            self._forward, input_tuple, self.parameters(), self.checkpoint
-        )
-
-    def _forward(self, x, context=None, mask=None):
-        x = (
-            self.attn1(
-                self.norm1(x),
-                context=context if self.disable_self_attn else None,
-                mask=mask,
-            )
-            + x
-        )
-        x = self.attn2(self.norm2(x), context=context, mask=mask) + x
-        x = self.ff(self.norm3(x)) + x
-        return x
-
-
-class SpatialTransformer(nn.Module):
-    """
-    Transformer block for image-like data in spatial axis.
-    First, project the input (aka embedding)
-    and reshape to b, t, d.
-    Then apply standard transformer action.
-    Finally, reshape to image
-    NEW: use_linear for more efficiency instead of the 1x1 convs
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        n_heads,
-        d_head,
-        depth=1,
-        dropout=0.0,
-        context_dim=None,
-        use_checkpoint=True,
-        disable_self_attn=False,
-        use_linear=False,
-        img_cross_attention=False,
-        img_cross_attention_scale_learnable=False,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        inner_dim = n_heads * d_head
-        self.norm = torch.nn.GroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-        if not use_linear:
-            self.proj_in = nn.Conv2d(
-                in_channels, inner_dim, kernel_size=1, stride=1, padding=0
-            )
-        else:
-            self.proj_in = nn.Linear(in_channels, inner_dim)
-
-        attention_cls = None
-        self.transformer_blocks = nn.ModuleList(
-            [
-                BasicTransformerBlock(
-                    inner_dim,
-                    n_heads,
-                    d_head,
-                    dropout=dropout,
-                    context_dim=context_dim,
-                    disable_self_attn=disable_self_attn,
-                    checkpoint=use_checkpoint,
-                    attention_cls=attention_cls,
-                    img_cross_attention=img_cross_attention,
-                    img_cross_attention_scale_learnable=img_cross_attention_scale_learnable,
-                )
-                for d in range(depth)
-            ]
-        )
-        if not use_linear:
-            self.proj_out = zero_module(
-                nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)
-            )
-        else:
-            self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
-        self.use_linear = use_linear
-
-    def forward(self, x, context=None):
-        b, c, h, w = x.shape
-        x_in = x
-        x = self.norm(x)
-        if not self.use_linear:
-            x = self.proj_in(x)
-        x = rearrange(x, "b c h w -> b (h w) c").contiguous()
-        if self.use_linear:
-            x = self.proj_in(x)
-        for i, block in enumerate(self.transformer_blocks):
-            x = block(x, context=context)
-        if self.use_linear:
-            x = self.proj_out(x)
-        x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w).contiguous()
-        if not self.use_linear:
-            x = self.proj_out(x)
-        return x + x_in
-
-
-class TemporalTransformer(nn.Module):
-    """
-    Transformer block for image-like data in temporal axis.
-    First, reshape to b, t, d.
-    Then apply standard transformer action.
-    Finally, reshape to image
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        n_heads,
-        d_head,
-        depth=1,
-        dropout=0.0,
-        context_dim=None,
-        use_checkpoint=True,
-        use_linear=False,
-        only_self_att=True,
-        causal_attention=False,
-        causal_block_size=1,
-        relative_position=False,
-        temporal_length=None,
-    ):
-        super().__init__()
-        self.only_self_att = only_self_att
-        self.relative_position = relative_position
-        self.causal_attention = causal_attention
-        self.causal_block_size = causal_block_size
-        self.in_channels = in_channels
-        inner_dim = n_heads * d_head
-        self.norm = torch.nn.GroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-        self.proj_in = nn.Conv1d(
-            in_channels, inner_dim, kernel_size=1, stride=1, padding=0
-        )
-        if not use_linear:
-            self.proj_in = nn.Conv1d(
-                in_channels, inner_dim, kernel_size=1, stride=1, padding=0
-            )
-        else:
-            self.proj_in = nn.Linear(in_channels, inner_dim)
-
-        if relative_position:
-            assert temporal_length is not None
-            attention_cls = partial(
-                CrossAttention, relative_position=True, temporal_length=temporal_length
-            )
-        else:
-            attention_cls = partial(CrossAttention, temporal_length=temporal_length)
-
-        if self.causal_attention:
-            assert temporal_length is not None
-            self.mask = torch.tril(torch.ones([1, temporal_length, temporal_length]))
-
-        if self.only_self_att:
-            context_dim = None
-        self.transformer_blocks = nn.ModuleList(
-            [
-                BasicTransformerBlock(
-                    inner_dim,
-                    n_heads,
-                    d_head,
-                    dropout=dropout,
-                    context_dim=context_dim,
-                    attention_cls=attention_cls,
-                    checkpoint=use_checkpoint,
-                )
-                for d in range(depth)
-            ]
-        )
-        if not use_linear:
-            self.proj_out = zero_module(
-                nn.Conv1d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)
-            )
-        else:
-            self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
-        self.use_linear = use_linear
-
-    def forward(self, x, context=None):
-        b, c, t, h, w = x.shape
-        x_in = x
-        x = self.norm(x)
-        x = rearrange(x, "b c t h w -> (b h w) c t").contiguous()
-        if not self.use_linear:
-            x = self.proj_in(x)
-        x = rearrange(x, "bhw c t -> bhw t c").contiguous()
-        if self.use_linear:
-            x = self.proj_in(x)
-
-        if self.causal_attention:
-            mask = self.mask[:, :t, :t]
-
-            mask = self.mask.to(x.device)
-            mask = repeat(mask, "l i j -> (l bhw) i j", bhw=b * h * w)
-        else:
-            mask = None
-
-        if self.only_self_att:
-            # if no context is given, cross-attention defaults to self-attention
-            for i, block in enumerate(self.transformer_blocks):
-                x = block(x, mask=mask)
-            x = rearrange(x, "(b hw) t c -> b hw t c", b=b).contiguous()
-        else:
-            x = rearrange(x, "(b hw) t c -> b hw t c", b=b).contiguous()
-            context = rearrange(context, "(b t) l con -> b t l con", t=t).contiguous()
-            for i, block in enumerate(self.transformer_blocks):
-                # calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
-                for j in range(b):
-                    context_j = repeat(
-                        context[j], "t l con -> (t r) l con", r=(h * w) // t, t=t
-                    ).contiguous()
-                    ## note: causal mask will not applied in cross-attention case
-                    x[j] = block(x[j], context=context_j)
-
-        if self.use_linear:
-            x = self.proj_out(x)
-            x = rearrange(x, "b (h w) t c -> b c t h w", h=h, w=w).contiguous()
-        if not self.use_linear:
-            x = rearrange(x, "b hw t c -> (b hw) c t").contiguous()
-            x = self.proj_out(x)
-            x = rearrange(x, "(b h w) c t -> b c t h w", b=b, h=h, w=w).contiguous()
-
-        return x + x_in
-
-
-class GEGLU(nn.Module):
-    def __init__(self, dim_in, dim_out):
-        super().__init__()
-        self.proj = nn.Linear(dim_in, dim_out * 2)
-
-    def forward(self, x):
-        x, gate = self.proj(x).chunk(2, dim=-1)
-        return x * F.gelu(gate)
-
-
-class FeedForward(nn.Module):
-    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.0):
-        super().__init__()
-        inner_dim = int(dim * mult)
-        dim_out = default(dim_out, dim)
-        project_in = (
-            nn.Sequential(nn.Linear(dim, inner_dim), nn.GELU())
-            if not glu
-            else GEGLU(dim, inner_dim)
-        )
-
-        self.net = nn.Sequential(
-            project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out)
-        )
-
-    def forward(self, x):
-        return self.net(x)
-
-
-class LinearAttention(nn.Module):
-    def __init__(self, dim, heads=4, dim_head=32):
-        super().__init__()
-        self.heads = heads
-        hidden_dim = dim_head * heads
-        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False)
-        self.to_out = nn.Conv2d(hidden_dim, dim, 1)
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        qkv = self.to_qkv(x)
-        q, k, v = rearrange(
-            qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3
-        )
-        k = k.softmax(dim=-1)
-        context = torch.einsum("bhdn,bhen->bhde", k, v)
-        out = torch.einsum("bhde,bhdn->bhen", context, q)
-        out = rearrange(
-            out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w
-        )
-        return self.to_out(out)
-
-
-class SpatialSelfAttention(nn.Module):
-    def __init__(self, in_channels):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.norm = torch.nn.GroupNorm(
-            num_groups=32, num_channels=in_channels, eps=1e-6, affine=True
-        )
-        self.q = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.k = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.v = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-        self.proj_out = torch.nn.Conv2d(
-            in_channels, in_channels, kernel_size=1, stride=1, padding=0
-        )
-
-    def forward(self, x):
-        h_ = x
-        h_ = self.norm(h_)
-        q = self.q(h_)
-        k = self.k(h_)
-        v = self.v(h_)
-
-        # compute attention
-        b, c, h, w = q.shape
-        q = rearrange(q, "b c h w -> b (h w) c")
-        k = rearrange(k, "b c h w -> b c (h w)")
-        w_ = torch.einsum("bij,bjk->bik", q, k)
-
-        w_ = w_ * (int(c) ** (-0.5))
-        w_ = torch.nn.functional.softmax(w_, dim=2)
-
-        # attend to values
-        v = rearrange(v, "b c h w -> b c (h w)")
-        w_ = rearrange(w_, "b i j -> b j i")
-        h_ = torch.einsum("bij,bjk->bik", v, w_)
-        h_ = rearrange(h_, "b c (h w) -> b c h w", h=h)
-        h_ = self.proj_out(h_)
-
-        return x + h_
diff --git a/videotuna/models/lvdm/modules/encoders/condition.py b/videotuna/models/lvdm/modules/encoders/condition.py
deleted file mode 100644
index 5ea90546..00000000
--- a/videotuna/models/lvdm/modules/encoders/condition.py
+++ /dev/null
@@ -1,513 +0,0 @@
-import kornia
-import open_clip
-import torch
-import torch.nn as nn
-from torch.utils.checkpoint import checkpoint
-from transformers import CLIPTextModel, CLIPTokenizer, T5EncoderModel, T5Tokenizer
-
-from videotuna.models.lvdm.modules.utils import autocast
-from videotuna.utils.common_utils import count_params
-
-
-class AbstractEncoder(nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def encode(self, *args, **kwargs):
-        raise NotImplementedError
-
-
-class IdentityEncoder(AbstractEncoder):
-
-    def encode(self, x):
-        return x
-
-
-class ClassEmbedder(nn.Module):
-    def __init__(self, embed_dim, n_classes=1000, key="class", ucg_rate=0.1):
-        super().__init__()
-        self.key = key
-        self.embedding = nn.Embedding(n_classes, embed_dim)
-        self.n_classes = n_classes
-        self.ucg_rate = ucg_rate
-
-    def forward(self, batch, key=None, disable_dropout=False):
-        if key is None:
-            key = self.key
-        # this is for use in crossattn
-        c = batch[key][:, None]
-        if self.ucg_rate > 0.0 and not disable_dropout:
-            mask = 1.0 - torch.bernoulli(torch.ones_like(c) * self.ucg_rate)
-            c = mask * c + (1 - mask) * torch.ones_like(c) * (self.n_classes - 1)
-            c = c.long()
-        c = self.embedding(c)
-        return c
-
-    def get_unconditional_conditioning(self, bs, device="cuda"):
-        uc_class = (
-            self.n_classes - 1
-        )  # 1000 classes --> 0 ... 999, one extra class for ucg (class 1000)
-        uc = torch.ones((bs,), device=device) * uc_class
-        uc = {self.key: uc}
-        return uc
-
-
-def disabled_train(self, mode=True):
-    """Overwrite model.train with this function to make sure train/eval mode
-    does not change anymore."""
-    return self
-
-
-class FrozenT5Embedder(AbstractEncoder):
-    """Uses the T5 transformer encoder for text"""
-
-    def __init__(
-        self, version="google/t5-v1_1-large", device="cuda", max_length=77, freeze=True
-    ):  # others are google/t5-v1_1-xl and google/t5-v1_1-xxl
-        super().__init__()
-        self.tokenizer = T5Tokenizer.from_pretrained(version)
-        self.transformer = T5EncoderModel.from_pretrained(version)
-        self.device = device
-        self.max_length = max_length  # TODO: typical value?
-        if freeze:
-            self.freeze()
-
-    def freeze(self):
-        self.transformer = self.transformer.eval()
-        # self.train = disabled_train
-        for param in self.parameters():
-            param.requires_grad = False
-
-    def forward(self, text):
-        batch_encoding = self.tokenizer(
-            text,
-            truncation=True,
-            max_length=self.max_length,
-            return_length=True,
-            return_overflowing_tokens=False,
-            padding="max_length",
-            return_tensors="pt",
-        )
-        tokens = batch_encoding["input_ids"].to(self.device)
-        outputs = self.transformer(input_ids=tokens)
-
-        z = outputs.last_hidden_state
-        return z
-
-    def encode(self, text):
-        return self(text)
-
-
-class FrozenCLIPEmbedder(AbstractEncoder):
-    """Uses the CLIP transformer encoder for text (from huggingface)"""
-
-    LAYERS = ["last", "pooled", "hidden"]
-
-    def __init__(
-        self,
-        version="openai/clip-vit-large-patch14",
-        device="cuda",
-        max_length=77,
-        freeze=True,
-        layer="last",
-        layer_idx=None,
-    ):  # clip-vit-base-patch32
-        super().__init__()
-        assert layer in self.LAYERS
-        self.tokenizer = CLIPTokenizer.from_pretrained(version)
-        self.transformer = CLIPTextModel.from_pretrained(version)
-        self.device = device
-        self.max_length = max_length
-        if freeze:
-            self.freeze()
-        self.layer = layer
-        self.layer_idx = layer_idx
-        if layer == "hidden":
-            assert layer_idx is not None
-            assert 0 <= abs(layer_idx) <= 12
-
-    def freeze(self):
-        self.transformer = self.transformer.eval()
-        # self.train = disabled_train
-        for param in self.parameters():
-            param.requires_grad = False
-
-    def forward(self, text):
-        batch_encoding = self.tokenizer(
-            text,
-            truncation=True,
-            max_length=self.max_length,
-            return_length=True,
-            return_overflowing_tokens=False,
-            padding="max_length",
-            return_tensors="pt",
-        )
-        tokens = batch_encoding["input_ids"].to(self.device)
-        outputs = self.transformer(
-            input_ids=tokens, output_hidden_states=self.layer == "hidden"
-        )
-        if self.layer == "last":
-            z = outputs.last_hidden_state
-        elif self.layer == "pooled":
-            z = outputs.pooler_output[:, None, :]
-        else:
-            z = outputs.hidden_states[self.layer_idx]
-        return z
-
-    def encode(self, text):
-        return self(text)
-
-
-class ClipImageEmbedder(nn.Module):
-    def __init__(
-        self,
-        model,
-        jit=False,
-        device="cuda" if torch.cuda.is_available() else "cpu",
-        antialias=True,
-        ucg_rate=0.0,
-    ):
-        super().__init__()
-        from clip import load as load_clip
-
-        self.model, _ = load_clip(name=model, device=device, jit=jit)
-
-        self.antialias = antialias
-
-        self.register_buffer(
-            "mean", torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False
-        )
-        self.register_buffer(
-            "std", torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False
-        )
-        self.ucg_rate = ucg_rate
-
-    def preprocess(self, x):
-        # normalize to [0,1]
-        x = kornia.geometry.resize(
-            x,
-            (224, 224),
-            interpolation="bicubic",
-            align_corners=True,
-            antialias=self.antialias,
-        )
-        x = (x + 1.0) / 2.0
-        # re-normalize according to clip
-        x = kornia.enhance.normalize(x, self.mean, self.std)
-        return x
-
-    def forward(self, x, no_dropout=False):
-        # x is assumed to be in range [-1,1]
-        out = self.model.encode_image(self.preprocess(x))
-        out = out.to(x.dtype)
-        if self.ucg_rate > 0.0 and not no_dropout:
-            out = (
-                torch.bernoulli(
-                    (1.0 - self.ucg_rate) * torch.ones(out.shape[0], device=out.device)
-                )[:, None]
-                * out
-            )
-        return out
-
-
-class FrozenOpenCLIPEmbedder(AbstractEncoder):
-    """
-    Uses the OpenCLIP transformer encoder for text
-    """
-
-    LAYERS = [
-        # "pooled",
-        "last",
-        "penultimate",
-    ]
-
-    def __init__(
-        self,
-        arch="ViT-H-14",
-        version="laion2b_s32b_b79k",
-        device="cuda",
-        max_length=77,
-        freeze=True,
-        layer="last",
-    ):
-        super().__init__()
-        assert layer in self.LAYERS
-        model, _, _ = open_clip.create_model_and_transforms(
-            arch, device=torch.device("cpu")
-        )
-        del model.visual
-        self.model = model
-
-        self.device = device
-        self.max_length = max_length
-        if freeze:
-            self.freeze()
-        self.layer = layer
-        if self.layer == "last":
-            self.layer_idx = 0
-        elif self.layer == "penultimate":
-            self.layer_idx = 1
-        else:
-            raise NotImplementedError()
-
-    def freeze(self):
-        self.model = self.model.eval()
-        for param in self.parameters():
-            param.requires_grad = False
-
-    def forward(self, text):
-        self.device = self.model.positional_embedding.device
-        tokens = open_clip.tokenize(text)
-        z = self.encode_with_transformer(tokens.to(self.device))
-        return z
-
-    def encode_with_transformer(self, text):
-        x = self.model.token_embedding(text)  # [batch_size, n_ctx, d_model]
-        x = x + self.model.positional_embedding
-        x = x.permute(1, 0, 2)  # NLD -> LND
-        x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
-        x = x.permute(1, 0, 2)  # LND -> NLD
-        x = self.model.ln_final(x)
-        return x
-
-    def text_transformer_forward(self, x: torch.Tensor, attn_mask=None):
-        for i, r in enumerate(self.model.transformer.resblocks):
-            if i == len(self.model.transformer.resblocks) - self.layer_idx:
-                break
-            if (
-                self.model.transformer.grad_checkpointing
-                and not torch.jit.is_scripting()
-            ):
-                x = checkpoint(r, x, attn_mask)
-            else:
-                x = r(x, attn_mask=attn_mask)
-        return x
-
-    def encode(self, text):
-        return self(text)
-
-
-class FrozenOpenCLIPImageEmbedder(AbstractEncoder):
-    """
-    Uses the OpenCLIP vision transformer encoder for images
-    """
-
-    def __init__(
-        self,
-        arch="ViT-H-14",
-        version="laion2b_s32b_b79k",
-        device="cuda",
-        max_length=77,
-        freeze=True,
-        layer="pooled",
-        antialias=True,
-        ucg_rate=0.0,
-    ):
-        super().__init__()
-        model, _, _ = open_clip.create_model_and_transforms(
-            arch,
-            device=torch.device("cpu"),
-            pretrained=version,
-        )
-        del model.transformer
-        self.model = model
-
-        self.device = device
-        self.max_length = max_length
-        if freeze:
-            self.freeze()
-        self.layer = layer
-        if self.layer == "penultimate":
-            raise NotImplementedError()
-            self.layer_idx = 1
-
-        self.antialias = antialias
-
-        self.register_buffer(
-            "mean", torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False
-        )
-        self.register_buffer(
-            "std", torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False
-        )
-        self.ucg_rate = ucg_rate
-
-    def preprocess(self, x):
-        # normalize to [0,1]
-        x = kornia.geometry.resize(
-            x,
-            (224, 224),
-            interpolation="bicubic",
-            align_corners=True,
-            antialias=self.antialias,
-        )
-        x = (x + 1.0) / 2.0
-        # renormalize according to clip
-        x = kornia.enhance.normalize(x, self.mean, self.std)
-        return x
-
-    def freeze(self):
-        self.model = self.model.eval()
-        for param in self.parameters():
-            param.requires_grad = False
-
-    @autocast
-    def forward(self, image, no_dropout=False):
-        z = self.encode_with_vision_transformer(image)
-        if self.ucg_rate > 0.0 and not no_dropout:
-            z = (
-                torch.bernoulli(
-                    (1.0 - self.ucg_rate) * torch.ones(z.shape[0], device=z.device)
-                )[:, None]
-                * z
-            )
-        return z
-
-    def encode_with_vision_transformer(self, img):
-        img = self.preprocess(img)
-        x = self.model.visual(img)
-        return x
-
-    def encode(self, text):
-        return self(text)
-
-
-class FrozenOpenCLIPImageEmbedderV2(AbstractEncoder):
-    """
-    Uses the OpenCLIP vision transformer encoder for images
-    """
-
-    def __init__(
-        self,
-        arch="ViT-H-14",
-        version="laion2b_s32b_b79k",
-        device="cuda",
-        freeze=True,
-        layer="pooled",
-        antialias=True,
-    ):
-        super().__init__()
-        model, _, _ = open_clip.create_model_and_transforms(
-            arch,
-            device=torch.device("cpu"),
-            pretrained=version,
-        )
-        del model.transformer
-        self.model = model
-        self.device = device
-
-        if freeze:
-            self.freeze()
-        self.layer = layer
-        if self.layer == "penultimate":
-            raise NotImplementedError()
-            self.layer_idx = 1
-
-        self.antialias = antialias
-        self.register_buffer(
-            "mean", torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False
-        )
-        self.register_buffer(
-            "std", torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False
-        )
-
-    def preprocess(self, x):
-        # normalize to [0,1]
-        x = kornia.geometry.resize(
-            x,
-            (224, 224),
-            interpolation="bicubic",
-            align_corners=True,
-            antialias=self.antialias,
-        )
-        x = (x + 1.0) / 2.0
-        # renormalize according to clip
-        x = kornia.enhance.normalize(x, self.mean, self.std)
-        return x
-
-    def freeze(self):
-        self.model = self.model.eval()
-        for param in self.model.parameters():
-            param.requires_grad = False
-
-    def forward(self, image, no_dropout=False):
-        ## image: b c h w
-        z = self.encode_with_vision_transformer(image)
-        return z
-
-    def encode_with_vision_transformer(self, x):
-        x = self.preprocess(x)
-
-        # to patches - whether to use dual patchnorm - https://arxiv.org/abs/2302.01327v1
-        if self.model.visual.input_patchnorm:
-            # einops - rearrange(x, 'b c (h p1) (w p2) -> b (h w) (c p1 p2)')
-            x = x.reshape(
-                x.shape[0],
-                x.shape[1],
-                self.model.visual.grid_size[0],
-                self.model.visual.patch_size[0],
-                self.model.visual.grid_size[1],
-                self.model.visual.patch_size[1],
-            )
-            x = x.permute(0, 2, 4, 1, 3, 5)
-            x = x.reshape(
-                x.shape[0],
-                self.model.visual.grid_size[0] * self.model.visual.grid_size[1],
-                -1,
-            )
-            x = self.model.visual.patchnorm_pre_ln(x)
-            x = self.model.visual.conv1(x)
-        else:
-            x = self.model.visual.conv1(x)  # shape = [*, width, grid, grid]
-            x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
-            x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
-
-        # class embeddings and positional embeddings
-        x = torch.cat(
-            [
-                self.model.visual.class_embedding.to(x.dtype)
-                + torch.zeros(
-                    x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
-                ),
-                x,
-            ],
-            dim=1,
-        )  # shape = [*, grid ** 2 + 1, width]
-        x = x + self.model.visual.positional_embedding.to(x.dtype)
-
-        # a patch_dropout of 0. would mean it is disabled and this function would do nothing but return what was passed in
-        x = self.model.visual.patch_dropout(x)
-        x = self.model.visual.ln_pre(x)
-
-        x = x.permute(1, 0, 2)  # NLD -> LND
-        x = self.model.visual.transformer(x)
-        x = x.permute(1, 0, 2)  # LND -> NLD
-
-        return x
-
-
-class FrozenCLIPT5Encoder(AbstractEncoder):
-    def __init__(
-        self,
-        clip_version="openai/clip-vit-large-patch14",
-        t5_version="google/t5-v1_1-xl",
-        device="cuda",
-        clip_max_length=77,
-        t5_max_length=77,
-    ):
-        super().__init__()
-        self.clip_encoder = FrozenCLIPEmbedder(
-            clip_version, device, max_length=clip_max_length
-        )
-        self.t5_encoder = FrozenT5Embedder(t5_version, device, max_length=t5_max_length)
-        print(
-            f"{self.clip_encoder.__class__.__name__} has {count_params(self.clip_encoder) * 1.e-6:.2f} M parameters, "
-            f"{self.t5_encoder.__class__.__name__} comes with {count_params(self.t5_encoder) * 1.e-6:.2f} M params."
-        )
-
-    def encode(self, text):
-        return self(text)
-
-    def forward(self, text):
-        clip_z = self.clip_encoder.encode(text)
-        t5_z = self.t5_encoder.encode(text)
-        return [clip_z, t5_z]
diff --git a/videotuna/models/lvdm/modules/encoders/ip_resampler.py b/videotuna/models/lvdm/modules/encoders/ip_resampler.py
deleted file mode 100644
index 6718d77f..00000000
--- a/videotuna/models/lvdm/modules/encoders/ip_resampler.py
+++ /dev/null
@@ -1,152 +0,0 @@
-# modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py
-import math
-
-import torch
-import torch.nn as nn
-
-
-class ImageProjModel(nn.Module):
-    """Projection Model"""
-
-    def __init__(
-        self,
-        cross_attention_dim=1024,
-        clip_embeddings_dim=1024,
-        clip_extra_context_tokens=4,
-    ):
-        super().__init__()
-        self.cross_attention_dim = cross_attention_dim
-        self.clip_extra_context_tokens = clip_extra_context_tokens
-        self.proj = nn.Linear(
-            clip_embeddings_dim, self.clip_extra_context_tokens * cross_attention_dim
-        )
-        self.norm = nn.LayerNorm(cross_attention_dim)
-
-    def forward(self, image_embeds):
-        # embeds = image_embeds
-        embeds = image_embeds.type(list(self.proj.parameters())[0].dtype)
-        clip_extra_context_tokens = self.proj(embeds).reshape(
-            -1, self.clip_extra_context_tokens, self.cross_attention_dim
-        )
-        clip_extra_context_tokens = self.norm(clip_extra_context_tokens)
-        return clip_extra_context_tokens
-
-
-# FFN
-def FeedForward(dim, mult=4):
-    inner_dim = int(dim * mult)
-    return nn.Sequential(
-        nn.LayerNorm(dim),
-        nn.Linear(dim, inner_dim, bias=False),
-        nn.GELU(),
-        nn.Linear(inner_dim, dim, bias=False),
-    )
-
-
-def reshape_tensor(x, heads):
-    bs, length, width = x.shape
-    # (bs, length, width) --> (bs, length, n_heads, dim_per_head)
-    x = x.view(bs, length, heads, -1)
-    # (bs, length, n_heads, dim_per_head) --> (bs, n_heads, length, dim_per_head)
-    x = x.transpose(1, 2)
-    # (bs, n_heads, length, dim_per_head) --> (bs*n_heads, length, dim_per_head)
-    x = x.reshape(bs, heads, length, -1)
-    return x
-
-
-class PerceiverAttention(nn.Module):
-    def __init__(self, *, dim, dim_head=64, heads=8):
-        super().__init__()
-        self.scale = dim_head**-0.5
-        self.dim_head = dim_head
-        self.heads = heads
-        inner_dim = dim_head * heads
-
-        self.norm1 = nn.LayerNorm(dim)
-        self.norm2 = nn.LayerNorm(dim)
-
-        self.to_q = nn.Linear(dim, inner_dim, bias=False)
-        self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
-        self.to_out = nn.Linear(inner_dim, dim, bias=False)
-
-    def forward(self, x, latents):
-        """
-        Args:
-            x (torch.Tensor): image features
-                shape (b, n1, D)
-            latent (torch.Tensor): latent features
-                shape (b, n2, D)
-        """
-        x = self.norm1(x)
-        latents = self.norm2(latents)
-
-        b, l, _ = latents.shape
-
-        q = self.to_q(latents)
-        kv_input = torch.cat((x, latents), dim=-2)
-        k, v = self.to_kv(kv_input).chunk(2, dim=-1)
-
-        q = reshape_tensor(q, self.heads)
-        k = reshape_tensor(k, self.heads)
-        v = reshape_tensor(v, self.heads)
-
-        # attention
-        scale = 1 / math.sqrt(math.sqrt(self.dim_head))
-        weight = (q * scale) @ (k * scale).transpose(
-            -2, -1
-        )  # More stable with f16 than dividing afterwards
-        weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
-        out = weight @ v
-
-        out = out.permute(0, 2, 1, 3).reshape(b, l, -1)
-
-        return self.to_out(out)
-
-
-class Resampler(nn.Module):
-    def __init__(
-        self,
-        dim=1024,
-        depth=8,
-        dim_head=64,
-        heads=16,
-        num_queries=8,
-        embedding_dim=768,
-        output_dim=1024,
-        ff_mult=4,
-        video_length=None,  # using frame-wise version or not
-    ):
-        super().__init__()
-        ## queries for a single frame / image
-        self.num_queries = num_queries
-        self.video_length = video_length
-        ## <num_queries> queries for each frame
-        if video_length is not None:
-            num_queries = num_queries * video_length
-
-        self.latents = nn.Parameter(torch.randn(1, num_queries, dim) / dim**0.5)
-        self.proj_in = nn.Linear(embedding_dim, dim)
-        self.proj_out = nn.Linear(dim, output_dim)
-        self.norm_out = nn.LayerNorm(output_dim)
-
-        self.layers = nn.ModuleList([])
-        for _ in range(depth):
-            self.layers.append(
-                nn.ModuleList(
-                    [
-                        PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads),
-                        FeedForward(dim=dim, mult=ff_mult),
-                    ]
-                )
-            )
-
-    def forward(self, x):
-        latents = self.latents.repeat(x.size(0), 1, 1)
-        x = self.proj_in(x)
-
-        for attn, ff in self.layers:
-            latents = attn(x, latents) + latents
-            latents = ff(latents) + latents
-
-        latents = self.proj_out(latents)
-        return self.norm_out(latents)
diff --git a/videotuna/models/lvdm/modules/losses/__init__.py b/videotuna/models/lvdm/modules/losses/__init__.py
deleted file mode 100644
index f372dbcb..00000000
--- a/videotuna/models/lvdm/modules/losses/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from models.lvdm.modules.losses.contperceptual import LPIPSWithDiscriminator
diff --git a/videotuna/models/lvdm/modules/losses/contperceptual.py b/videotuna/models/lvdm/modules/losses/contperceptual.py
deleted file mode 100644
index 8f332297..00000000
--- a/videotuna/models/lvdm/modules/losses/contperceptual.py
+++ /dev/null
@@ -1,173 +0,0 @@
-import torch
-import torch.nn as nn
-from einops import rearrange
-from taming.modules.losses.vqperceptual import *  # TODO: taming dependency yes/no?
-
-
-class LPIPSWithDiscriminator(nn.Module):
-    def __init__(
-        self,
-        disc_start,
-        logvar_init=0.0,
-        kl_weight=1.0,
-        pixelloss_weight=1.0,
-        disc_num_layers=3,
-        disc_in_channels=3,
-        disc_factor=1.0,
-        disc_weight=1.0,
-        perceptual_weight=1.0,
-        use_actnorm=False,
-        disc_conditional=False,
-        disc_loss="hinge",
-        max_bs=None,
-    ):
-
-        super().__init__()
-        assert disc_loss in ["hinge", "vanilla"]
-        self.kl_weight = kl_weight
-        self.pixel_weight = pixelloss_weight
-        self.perceptual_loss = LPIPS().eval()
-        self.perceptual_weight = perceptual_weight
-        # output log variance
-        self.logvar = nn.Parameter(torch.ones(size=()) * logvar_init)
-
-        self.discriminator = NLayerDiscriminator(
-            input_nc=disc_in_channels, n_layers=disc_num_layers, use_actnorm=use_actnorm
-        ).apply(weights_init)
-        self.discriminator_iter_start = disc_start
-        self.disc_loss = hinge_d_loss if disc_loss == "hinge" else vanilla_d_loss
-        self.disc_factor = disc_factor
-        self.discriminator_weight = disc_weight
-        self.disc_conditional = disc_conditional
-        self.max_bs = max_bs
-
-    def calculate_adaptive_weight(self, nll_loss, g_loss, last_layer=None):
-        if last_layer is not None:
-            nll_grads = torch.autograd.grad(nll_loss, last_layer, retain_graph=True)[0]
-            g_grads = torch.autograd.grad(g_loss, last_layer, retain_graph=True)[0]
-        else:
-            nll_grads = torch.autograd.grad(
-                nll_loss, self.last_layer[0], retain_graph=True
-            )[0]
-            g_grads = torch.autograd.grad(
-                g_loss, self.last_layer[0], retain_graph=True
-            )[0]
-
-        d_weight = torch.norm(nll_grads) / (torch.norm(g_grads) + 1e-4)
-        d_weight = torch.clamp(d_weight, 0.0, 1e4).detach()
-        d_weight = d_weight * self.discriminator_weight
-        return d_weight
-
-    def forward(
-        self,
-        inputs,
-        reconstructions,
-        posteriors,
-        optimizer_idx,
-        global_step,
-        last_layer=None,
-        cond=None,
-        split="train",
-        weights=None,
-    ):
-        if inputs.dim() == 5:
-            inputs = rearrange(inputs, "b c t h w -> (b t) c h w")
-        if reconstructions.dim() == 5:
-            reconstructions = rearrange(reconstructions, "b c t h w -> (b t) c h w")
-        rec_loss = torch.abs(inputs.contiguous() - reconstructions.contiguous())
-        if self.perceptual_weight > 0:
-            if self.max_bs is not None and self.max_bs < inputs.shape[0]:
-                input_list = torch.split(inputs, self.max_bs, dim=0)
-                reconstruction_list = torch.split(reconstructions, self.max_bs, dim=0)
-                p_losses = [
-                    self.perceptual_loss(
-                        inputs.contiguous(), reconstructions.contiguous()
-                    )
-                    for inputs, reconstructions in zip(input_list, reconstruction_list)
-                ]
-                p_loss = torch.cat(p_losses, dim=0)
-            else:
-                p_loss = self.perceptual_loss(
-                    inputs.contiguous(), reconstructions.contiguous()
-                )
-            rec_loss = rec_loss + self.perceptual_weight * p_loss
-
-        nll_loss = rec_loss / torch.exp(self.logvar) + self.logvar
-        weighted_nll_loss = nll_loss
-        if weights is not None:
-            weighted_nll_loss = weights * nll_loss
-        weighted_nll_loss = torch.sum(weighted_nll_loss) / weighted_nll_loss.shape[0]
-        nll_loss = torch.sum(nll_loss) / nll_loss.shape[0]
-
-        kl_loss = posteriors.kl()
-        kl_loss = torch.sum(kl_loss) / kl_loss.shape[0]
-
-        # now the GAN part
-        if optimizer_idx == 0:
-            # generator update
-            if cond is None:
-                assert not self.disc_conditional
-                logits_fake = self.discriminator(reconstructions.contiguous())
-            else:
-                assert self.disc_conditional
-                logits_fake = self.discriminator(
-                    torch.cat((reconstructions.contiguous(), cond), dim=1)
-                )
-            g_loss = -torch.mean(logits_fake)
-
-            if self.disc_factor > 0.0:
-                try:
-                    d_weight = self.calculate_adaptive_weight(
-                        nll_loss, g_loss, last_layer=last_layer
-                    )
-                except RuntimeError:
-                    assert not self.training
-                    d_weight = torch.tensor(0.0)
-            else:
-                d_weight = torch.tensor(0.0)
-
-            disc_factor = adopt_weight(
-                self.disc_factor, global_step, threshold=self.discriminator_iter_start
-            )
-            loss = (
-                weighted_nll_loss
-                + self.kl_weight * kl_loss
-                + d_weight * disc_factor * g_loss
-            )
-
-            log = {
-                "{}/total_loss".format(split): loss.clone().detach().mean(),
-                "{}/logvar".format(split): self.logvar.detach(),
-                "{}/kl_loss".format(split): kl_loss.detach().mean(),
-                "{}/nll_loss".format(split): nll_loss.detach().mean(),
-                "{}/rec_loss".format(split): rec_loss.detach().mean(),
-                "{}/d_weight".format(split): d_weight.detach(),
-                "{}/disc_factor".format(split): torch.tensor(disc_factor),
-                "{}/g_loss".format(split): g_loss.detach().mean(),
-            }
-            return loss, log
-
-        if optimizer_idx == 1:
-            # second pass for discriminator update
-            if cond is None:
-                logits_real = self.discriminator(inputs.contiguous().detach())
-                logits_fake = self.discriminator(reconstructions.contiguous().detach())
-            else:
-                logits_real = self.discriminator(
-                    torch.cat((inputs.contiguous().detach(), cond), dim=1)
-                )
-                logits_fake = self.discriminator(
-                    torch.cat((reconstructions.contiguous().detach(), cond), dim=1)
-                )
-
-            disc_factor = adopt_weight(
-                self.disc_factor, global_step, threshold=self.discriminator_iter_start
-            )
-            d_loss = disc_factor * self.disc_loss(logits_real, logits_fake)
-
-            log = {
-                "{}/disc_loss".format(split): d_loss.clone().detach().mean(),
-                "{}/logits_real".format(split): logits_real.detach().mean(),
-                "{}/logits_fake".format(split): logits_fake.detach().mean(),
-            }
-            return d_loss, log
diff --git a/videotuna/models/lvdm/modules/networks/openaimodel3d.py b/videotuna/models/lvdm/modules/networks/openaimodel3d.py
deleted file mode 100644
index 565c980e..00000000
--- a/videotuna/models/lvdm/modules/networks/openaimodel3d.py
+++ /dev/null
@@ -1,694 +0,0 @@
-from abc import abstractmethod
-from functools import partial
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-
-from videotuna.utils.diffusion_utils import timestep_embedding
-from videotuna.models.lvdm.modules.attention import SpatialTransformer, TemporalTransformer
-from videotuna.models.lvdm.modules.utils import (
-    avg_pool_nd,
-    checkpoint,
-    conv_nd,
-    linear,
-    normalization,
-    zero_module,
-)
-
-
-class TimestepBlock(nn.Module):
-    """
-    Any module where forward() takes timestep embeddings as a second argument.
-    """
-
-    @abstractmethod
-    def forward(self, x, emb):
-        """
-        Apply the module to `x` given `emb` timestep embeddings.
-        """
-
-
-class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
-    """
-    A sequential module that passes timestep embeddings to the children that
-    support it as an extra input.
-    """
-
-    def forward(self, x, emb, context=None, batch_size=None):
-        for layer in self:
-            if isinstance(layer, TimestepBlock):
-                x = layer(x, emb, batch_size)
-            elif isinstance(layer, SpatialTransformer):
-                x = layer(x, context)
-            elif isinstance(layer, TemporalTransformer):
-                x = rearrange(x, "(b f) c h w -> b c f h w", b=batch_size)
-                x = layer(x, context)
-                x = rearrange(x, "b c f h w -> (b f) c h w")
-            else:
-                x = layer(
-                    x,
-                )
-        return x
-
-
-class Downsample(nn.Module):
-    """
-    A downsampling layer with an optional convolution.
-    :param channels: channels in the inputs and outputs.
-    :param use_conv: a bool determining if a convolution is applied.
-    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
-                 downsampling occurs in the inner-two dimensions.
-    """
-
-    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.dims = dims
-        stride = 2 if dims != 3 else (1, 2, 2)
-        if use_conv:
-            self.op = conv_nd(
-                dims,
-                self.channels,
-                self.out_channels,
-                3,
-                stride=stride,
-                padding=padding,
-            )
-        else:
-            assert self.channels == self.out_channels
-            self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride)
-
-    def forward(self, x):
-        assert x.shape[1] == self.channels
-        return self.op(x)
-
-
-class Upsample(nn.Module):
-    """
-    An upsampling layer with an optional convolution.
-    :param channels: channels in the inputs and outputs.
-    :param use_conv: a bool determining if a convolution is applied.
-    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
-                 upsampling occurs in the inner-two dimensions.
-    """
-
-    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.dims = dims
-        if use_conv:
-            self.conv = conv_nd(
-                dims, self.channels, self.out_channels, 3, padding=padding
-            )
-
-    def forward(self, x):
-        assert x.shape[1] == self.channels
-        if self.dims == 3:
-            x = F.interpolate(
-                x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
-            )
-        else:
-            x = F.interpolate(x, scale_factor=2, mode="nearest")
-        if self.use_conv:
-            x = self.conv(x)
-        return x
-
-
-class ResBlock(TimestepBlock):
-    """
-    A residual block that can optionally change the number of channels.
-    :param channels: the number of input channels.
-    :param emb_channels: the number of timestep embedding channels.
-    :param dropout: the rate of dropout.
-    :param out_channels: if specified, the number of out channels.
-    :param use_conv: if True and out_channels is specified, use a spatial
-        convolution instead of a smaller 1x1 convolution to change the
-        channels in the skip connection.
-    :param dims: determines if the signal is 1D, 2D, or 3D.
-    :param up: if True, use this block for upsampling.
-    :param down: if True, use this block for downsampling.
-    :param use_temporal_conv: if True, use the temporal convolution.
-    """
-
-    def __init__(
-        self,
-        channels,
-        emb_channels,
-        dropout,
-        out_channels=None,
-        use_scale_shift_norm=False,
-        dims=2,
-        use_checkpoint=False,
-        use_conv=False,
-        up=False,
-        down=False,
-        use_temporal_conv=False,
-        tempspatial_aware=False,
-    ):
-        super().__init__()
-        self.channels = channels
-        self.emb_channels = emb_channels
-        self.dropout = dropout
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.use_checkpoint = use_checkpoint
-        self.use_scale_shift_norm = use_scale_shift_norm
-        self.use_temporal_conv = use_temporal_conv
-
-        self.in_layers = nn.Sequential(
-            normalization(channels),
-            nn.SiLU(),
-            conv_nd(dims, channels, self.out_channels, 3, padding=1),
-        )
-
-        self.updown = up or down
-
-        if up:
-            self.h_upd = Upsample(channels, False, dims)
-            self.x_upd = Upsample(channels, False, dims)
-        elif down:
-            self.h_upd = Downsample(channels, False, dims)
-            self.x_upd = Downsample(channels, False, dims)
-        else:
-            self.h_upd = self.x_upd = nn.Identity()
-
-        self.emb_layers = nn.Sequential(
-            nn.SiLU(),
-            nn.Linear(
-                emb_channels,
-                2 * self.out_channels if use_scale_shift_norm else self.out_channels,
-            ),
-        )
-        self.out_layers = nn.Sequential(
-            normalization(self.out_channels),
-            nn.SiLU(),
-            nn.Dropout(p=dropout),
-            zero_module(nn.Conv2d(self.out_channels, self.out_channels, 3, padding=1)),
-        )
-
-        if self.out_channels == channels:
-            self.skip_connection = nn.Identity()
-        elif use_conv:
-            self.skip_connection = conv_nd(
-                dims, channels, self.out_channels, 3, padding=1
-            )
-        else:
-            self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
-
-        if self.use_temporal_conv:
-            self.temopral_conv = TemporalConvBlock(
-                self.out_channels,
-                self.out_channels,
-                dropout=0.1,
-                spatial_aware=tempspatial_aware,
-            )
-
-    def forward(self, x, emb, batch_size=None):
-        """
-        Apply the block to a Tensor, conditioned on a timestep embedding.
-        :param x: an [N x C x ...] Tensor of features.
-        :param emb: an [N x emb_channels] Tensor of timestep embeddings.
-        :return: an [N x C x ...] Tensor of outputs.
-        """
-        input_tuple = (x, emb)
-        if batch_size:
-            forward_batchsize = partial(self._forward, batch_size=batch_size)
-            return checkpoint(
-                forward_batchsize, input_tuple, self.parameters(), self.use_checkpoint
-            )
-        return checkpoint(
-            self._forward, input_tuple, self.parameters(), self.use_checkpoint
-        )
-
-    def _forward(self, x, emb, batch_size=None):
-        if self.updown:
-            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
-            h = in_rest(x)
-            h = self.h_upd(h)
-            x = self.x_upd(x)
-            h = in_conv(h)
-        else:
-            h = self.in_layers(x)
-        emb_out = self.emb_layers(emb).type(h.dtype)
-        while len(emb_out.shape) < len(h.shape):
-            emb_out = emb_out[..., None]
-        if self.use_scale_shift_norm:
-            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
-            scale, shift = torch.chunk(emb_out, 2, dim=1)
-            h = out_norm(h) * (1 + scale) + shift
-            h = out_rest(h)
-        else:
-            h = h + emb_out
-            h = self.out_layers(h)
-        h = self.skip_connection(x) + h
-
-        if self.use_temporal_conv and batch_size:
-            h = rearrange(h, "(b t) c h w -> b c t h w", b=batch_size)
-            h = self.temopral_conv(h)
-            h = rearrange(h, "b c t h w -> (b t) c h w")
-        return h
-
-
-class TemporalConvBlock(nn.Module):
-    """
-    Adapted from modelscope: https://github.com/modelscope/modelscope/blob/master/modelscope/models/multi_modal/video_synthesis/unet_sd.py
-    """
-
-    def __init__(
-        self, in_channels, out_channels=None, dropout=0.0, spatial_aware=False
-    ):
-        super(TemporalConvBlock, self).__init__()
-        if out_channels is None:
-            out_channels = in_channels
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        kernel_shape = (3, 1, 1) if not spatial_aware else (3, 3, 3)
-        padding_shape = (1, 0, 0) if not spatial_aware else (1, 1, 1)
-
-        # conv layers
-        self.conv1 = nn.Sequential(
-            nn.GroupNorm(32, in_channels),
-            nn.SiLU(),
-            nn.Conv3d(in_channels, out_channels, kernel_shape, padding=padding_shape),
-        )
-        self.conv2 = nn.Sequential(
-            nn.GroupNorm(32, out_channels),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            nn.Conv3d(out_channels, in_channels, kernel_shape, padding=padding_shape),
-        )
-        self.conv3 = nn.Sequential(
-            nn.GroupNorm(32, out_channels),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            nn.Conv3d(out_channels, in_channels, (3, 1, 1), padding=(1, 0, 0)),
-        )
-        self.conv4 = nn.Sequential(
-            nn.GroupNorm(32, out_channels),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            nn.Conv3d(out_channels, in_channels, (3, 1, 1), padding=(1, 0, 0)),
-        )
-
-        # zero out the last layer params,so the conv block is identity
-        nn.init.zeros_(self.conv4[-1].weight)
-        nn.init.zeros_(self.conv4[-1].bias)
-
-    def forward(self, x):
-        identity = x
-        x = self.conv1(x)
-        x = self.conv2(x)
-        x = self.conv3(x)
-        x = self.conv4(x)
-
-        return x + identity
-
-
-class UNetModel(nn.Module):
-    """
-    The full UNet model with attention and timestep embedding.
-    :param in_channels: in_channels in the input Tensor.
-    :param model_channels: base channel count for the model.
-    :param out_channels: channels in the output Tensor.
-    :param num_res_blocks: number of residual blocks per downsample.
-    :param attention_resolutions: a collection of downsample rates at which
-        attention will take place. May be a set, list, or tuple.
-        For example, if this contains 4, then at 4x downsampling, attention
-        will be used.
-    :param dropout: the dropout probability.
-    :param channel_mult: channel multiplier for each level of the UNet.
-    :param conv_resample: if True, use learned convolutions for upsampling and
-        downsampling.
-    :param dims: determines if the signal is 1D, 2D, or 3D.
-    :param num_classes: if specified (as an int), then this model will be
-        class-conditional with `num_classes` classes.
-    :param use_checkpoint: use gradient checkpointing to reduce memory usage.
-    :param num_heads: the number of attention heads in each attention layer.
-    :param num_heads_channels: if specified, ignore num_heads and instead use
-                               a fixed channel width per attention head.
-    :param num_heads_upsample: works with num_heads to set a different number
-                               of heads for upsampling. Deprecated.
-    :param use_scale_shift_norm: use a FiLM-like conditioning mechanism.
-    :param resblock_updown: use residual blocks for up/downsampling.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        model_channels,
-        out_channels,
-        num_res_blocks,
-        attention_resolutions,
-        dropout=0.0,
-        channel_mult=(1, 2, 4, 8),
-        conv_resample=True,
-        dims=2,
-        context_dim=None,
-        use_scale_shift_norm=False,
-        resblock_updown=False,
-        num_heads=-1,
-        num_head_channels=-1,
-        transformer_depth=1,
-        use_linear=False,
-        use_checkpoint=False,
-        temporal_conv=False,
-        tempspatial_aware=False,
-        temporal_attention=True,
-        temporal_selfatt_only=True,
-        use_relative_position=True,
-        use_causal_attention=False,
-        temporal_length=None,
-        use_fp16=False,
-        addition_attention=False,
-        use_image_attention=False,
-        temporal_transformer_depth=1,
-        fps_cond=False,
-    ):
-        super(UNetModel, self).__init__()
-        if num_heads == -1:
-            assert (
-                num_head_channels != -1
-            ), "Either num_heads or num_head_channels has to be set"
-        if num_head_channels == -1:
-            assert (
-                num_heads != -1
-            ), "Either num_heads or num_head_channels has to be set"
-
-        self.in_channels = in_channels
-        self.model_channels = model_channels
-        self.out_channels = out_channels
-        self.num_res_blocks = num_res_blocks
-        self.attention_resolutions = attention_resolutions
-        self.dropout = dropout
-        self.channel_mult = channel_mult
-        self.conv_resample = conv_resample
-        self.temporal_attention = temporal_attention
-        time_embed_dim = model_channels * 4
-        self.use_checkpoint = use_checkpoint
-        self.dtype = torch.float16 if use_fp16 else torch.float32
-        self.addition_attention = addition_attention
-        self.use_image_attention = use_image_attention
-        self.fps_cond = fps_cond
-
-        self.time_embed = nn.Sequential(
-            linear(model_channels, time_embed_dim),
-            nn.SiLU(),
-            linear(time_embed_dim, time_embed_dim),
-        )
-        if self.fps_cond:
-            self.fps_embedding = nn.Sequential(
-                linear(model_channels, time_embed_dim),
-                nn.SiLU(),
-                linear(time_embed_dim, time_embed_dim),
-            )
-
-        self.input_blocks = nn.ModuleList(
-            [
-                TimestepEmbedSequential(
-                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
-                )
-            ]
-        )
-        if self.addition_attention:
-            self.init_attn = TimestepEmbedSequential(
-                TemporalTransformer(
-                    model_channels,
-                    n_heads=8,
-                    d_head=num_head_channels,
-                    depth=transformer_depth,
-                    context_dim=context_dim,
-                    use_checkpoint=use_checkpoint,
-                    only_self_att=temporal_selfatt_only,
-                    causal_attention=use_causal_attention,
-                    relative_position=use_relative_position,
-                    temporal_length=temporal_length,
-                )
-            )
-
-        input_block_chans = [model_channels]
-        ch = model_channels
-        ds = 1
-        for level, mult in enumerate(channel_mult):
-            for _ in range(num_res_blocks):
-                layers = [
-                    ResBlock(
-                        ch,
-                        time_embed_dim,
-                        dropout,
-                        out_channels=mult * model_channels,
-                        dims=dims,
-                        use_checkpoint=use_checkpoint,
-                        use_scale_shift_norm=use_scale_shift_norm,
-                        tempspatial_aware=tempspatial_aware,
-                        use_temporal_conv=temporal_conv,
-                    )
-                ]
-                ch = mult * model_channels
-                if ds in attention_resolutions:
-                    if num_head_channels == -1:
-                        dim_head = ch // num_heads
-                    else:
-                        num_heads = ch // num_head_channels
-                        dim_head = num_head_channels
-                    layers.append(
-                        SpatialTransformer(
-                            ch,
-                            num_heads,
-                            dim_head,
-                            depth=transformer_depth,
-                            context_dim=context_dim,
-                            use_linear=use_linear,
-                            use_checkpoint=use_checkpoint,
-                            disable_self_attn=False,
-                            img_cross_attention=self.use_image_attention,
-                        )
-                    )
-                    if self.temporal_attention:
-                        layers.append(
-                            TemporalTransformer(
-                                ch,
-                                num_heads,
-                                dim_head,
-                                depth=temporal_transformer_depth,
-                                context_dim=context_dim,
-                                use_linear=use_linear,
-                                use_checkpoint=use_checkpoint,
-                                only_self_att=temporal_selfatt_only,
-                                causal_attention=use_causal_attention,
-                                relative_position=use_relative_position,
-                                temporal_length=temporal_length,
-                            )
-                        )
-                self.input_blocks.append(TimestepEmbedSequential(*layers))
-                input_block_chans.append(ch)
-            if level != len(channel_mult) - 1:
-                out_ch = ch
-                self.input_blocks.append(
-                    TimestepEmbedSequential(
-                        ResBlock(
-                            ch,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=out_ch,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                            down=True,
-                        )
-                        if resblock_updown
-                        else Downsample(
-                            ch, conv_resample, dims=dims, out_channels=out_ch
-                        )
-                    )
-                )
-                ch = out_ch
-                input_block_chans.append(ch)
-                ds *= 2
-
-        if num_head_channels == -1:
-            dim_head = ch // num_heads
-        else:
-            num_heads = ch // num_head_channels
-            dim_head = num_head_channels
-        layers = [
-            ResBlock(
-                ch,
-                time_embed_dim,
-                dropout,
-                dims=dims,
-                use_checkpoint=use_checkpoint,
-                use_scale_shift_norm=use_scale_shift_norm,
-                tempspatial_aware=tempspatial_aware,
-                use_temporal_conv=temporal_conv,
-            ),
-            SpatialTransformer(
-                ch,
-                num_heads,
-                dim_head,
-                depth=transformer_depth,
-                context_dim=context_dim,
-                use_linear=use_linear,
-                use_checkpoint=use_checkpoint,
-                disable_self_attn=False,
-                img_cross_attention=self.use_image_attention,
-            ),
-        ]
-        if self.temporal_attention:
-            layers.append(
-                TemporalTransformer(
-                    ch,
-                    num_heads,
-                    dim_head,
-                    depth=temporal_transformer_depth,
-                    context_dim=context_dim,
-                    use_linear=use_linear,
-                    use_checkpoint=use_checkpoint,
-                    only_self_att=temporal_selfatt_only,
-                    causal_attention=use_causal_attention,
-                    relative_position=use_relative_position,
-                    temporal_length=temporal_length,
-                )
-            )
-        layers.append(
-            ResBlock(
-                ch,
-                time_embed_dim,
-                dropout,
-                dims=dims,
-                use_checkpoint=use_checkpoint,
-                use_scale_shift_norm=use_scale_shift_norm,
-                tempspatial_aware=tempspatial_aware,
-                use_temporal_conv=temporal_conv,
-            )
-        )
-        self.middle_block = TimestepEmbedSequential(*layers)
-
-        self.output_blocks = nn.ModuleList([])
-        for level, mult in list(enumerate(channel_mult))[::-1]:
-            for i in range(num_res_blocks + 1):
-                ich = input_block_chans.pop()
-                layers = [
-                    ResBlock(
-                        ch + ich,
-                        time_embed_dim,
-                        dropout,
-                        out_channels=mult * model_channels,
-                        dims=dims,
-                        use_checkpoint=use_checkpoint,
-                        use_scale_shift_norm=use_scale_shift_norm,
-                        tempspatial_aware=tempspatial_aware,
-                        use_temporal_conv=temporal_conv,
-                    )
-                ]
-                ch = model_channels * mult
-                if ds in attention_resolutions:
-                    if num_head_channels == -1:
-                        dim_head = ch // num_heads
-                    else:
-                        num_heads = ch // num_head_channels
-                        dim_head = num_head_channels
-                    layers.append(
-                        SpatialTransformer(
-                            ch,
-                            num_heads,
-                            dim_head,
-                            depth=transformer_depth,
-                            context_dim=context_dim,
-                            use_linear=use_linear,
-                            use_checkpoint=use_checkpoint,
-                            disable_self_attn=False,
-                            img_cross_attention=self.use_image_attention,
-                        )
-                    )
-                    if self.temporal_attention:
-                        layers.append(
-                            TemporalTransformer(
-                                ch,
-                                num_heads,
-                                dim_head,
-                                depth=temporal_transformer_depth,
-                                context_dim=context_dim,
-                                use_linear=use_linear,
-                                use_checkpoint=use_checkpoint,
-                                only_self_att=temporal_selfatt_only,
-                                causal_attention=use_causal_attention,
-                                relative_position=use_relative_position,
-                                temporal_length=temporal_length,
-                            )
-                        )
-                if level and i == num_res_blocks:
-                    out_ch = ch
-                    layers.append(
-                        ResBlock(
-                            ch,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=out_ch,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                            up=True,
-                        )
-                        if resblock_updown
-                        else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch)
-                    )
-                    ds //= 2
-                self.output_blocks.append(TimestepEmbedSequential(*layers))
-
-        self.out = nn.Sequential(
-            normalization(ch),
-            nn.SiLU(),
-            zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),
-        )
-
-    def forward(
-        self, x, timesteps, context=None, features_adapter=None, fps=16, **kwargs
-    ):
-        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
-        emb = self.time_embed(t_emb)
-
-        if self.fps_cond:
-            if type(fps) == int:
-                fps = torch.full_like(timesteps, fps)
-            fps_emb = timestep_embedding(fps, self.model_channels, repeat_only=False)
-            emb += self.fps_embedding(fps_emb)
-
-        b, _, t, _, _ = x.shape
-        ## repeat t times for context [(b t) 77 768] & time embedding
-        context = context.repeat_interleave(repeats=t, dim=0)
-        emb = emb.repeat_interleave(repeats=t, dim=0)
-
-        ## always in shape (b t) c h w, except for temporal layer
-        x = rearrange(x, "b c t h w -> (b t) c h w")
-
-        h = x.type(self.dtype)
-        adapter_idx = 0
-        hs = []
-        for id, module in enumerate(self.input_blocks):
-            h = module(h, emb, context=context, batch_size=b)
-            if id == 0 and self.addition_attention:
-                h = self.init_attn(h, emb, context=context, batch_size=b)
-            ## plug-in adapter features
-            if ((id + 1) % 3 == 0) and features_adapter is not None:
-                h = h + features_adapter[adapter_idx]
-                adapter_idx += 1
-            hs.append(h)
-        if features_adapter is not None:
-            assert len(features_adapter) == adapter_idx, "Wrong features_adapter"
-
-        h = self.middle_block(h, emb, context=context, batch_size=b)
-        for module in self.output_blocks:
-            h = torch.cat([h, hs.pop()], dim=1)
-            h = module(h, emb, context=context, batch_size=b)
-        h = h.type(x.dtype)
-        y = self.out(h).type(x.dtype)
-
-        # reshape back to (b c t h w)
-        y = rearrange(y, "(b t) c h w -> b c t h w", b=b)
-        return y
diff --git a/videotuna/models/lvdm/modules/networks/openaimodel3d_dc.py b/videotuna/models/lvdm/modules/networks/openaimodel3d_dc.py
deleted file mode 100644
index 8d461f60..00000000
--- a/videotuna/models/lvdm/modules/networks/openaimodel3d_dc.py
+++ /dev/null
@@ -1,737 +0,0 @@
-from abc import abstractmethod
-from functools import partial
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-
-from videotuna.utils.diffusion_utils import timestep_embedding
-from videotuna.models.lvdm.modules.attention import SpatialTransformer, TemporalTransformer
-from videotuna.models.lvdm.modules.utils import (
-    avg_pool_nd,
-    checkpoint,
-    conv_nd,
-    linear,
-    normalization,
-    zero_module,
-)
-
-
-class TimestepBlock(nn.Module):
-    """
-    Any module where forward() takes timestep embeddings as a second argument.
-    """
-
-    @abstractmethod
-    def forward(self, x, emb):
-        """
-        Apply the module to `x` given `emb` timestep embeddings.
-        """
-
-
-class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
-    """
-    A sequential module that passes timestep embeddings to the children that
-    support it as an extra input.
-    """
-
-    def forward(self, x, emb, context=None, batch_size=None):
-        for layer in self:
-            if isinstance(layer, TimestepBlock):
-                x = layer(x, emb, batch_size=batch_size)
-            elif isinstance(layer, SpatialTransformer):
-                x = layer(x, context)
-            elif isinstance(layer, TemporalTransformer):
-                x = rearrange(x, "(b f) c h w -> b c f h w", b=batch_size)
-                x = layer(x, context)
-                x = rearrange(x, "b c f h w -> (b f) c h w")
-            else:
-                x = layer(x)
-        return x
-
-
-class Downsample(nn.Module):
-    """
-    A downsampling layer with an optional convolution.
-    :param channels: channels in the inputs and outputs.
-    :param use_conv: a bool determining if a convolution is applied.
-    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
-                 downsampling occurs in the inner-two dimensions.
-    """
-
-    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.dims = dims
-        stride = 2 if dims != 3 else (1, 2, 2)
-        if use_conv:
-            self.op = conv_nd(
-                dims,
-                self.channels,
-                self.out_channels,
-                3,
-                stride=stride,
-                padding=padding,
-            )
-        else:
-            assert self.channels == self.out_channels
-            self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride)
-
-    def forward(self, x):
-        assert x.shape[1] == self.channels
-        return self.op(x)
-
-
-class Upsample(nn.Module):
-    """
-    An upsampling layer with an optional convolution.
-    :param channels: channels in the inputs and outputs.
-    :param use_conv: a bool determining if a convolution is applied.
-    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then
-                 upsampling occurs in the inner-two dimensions.
-    """
-
-    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.dims = dims
-        if use_conv:
-            self.conv = conv_nd(
-                dims, self.channels, self.out_channels, 3, padding=padding
-            )
-
-    def forward(self, x):
-        assert x.shape[1] == self.channels
-        if self.dims == 3:
-            x = F.interpolate(
-                x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest"
-            )
-        else:
-            x = F.interpolate(x, scale_factor=2, mode="nearest")
-        if self.use_conv:
-            x = self.conv(x)
-        return x
-
-
-class ResBlock(TimestepBlock):
-    """
-    A residual block that can optionally change the number of channels.
-    :param channels: the number of input channels.
-    :param emb_channels: the number of timestep embedding channels.
-    :param dropout: the rate of dropout.
-    :param out_channels: if specified, the number of out channels.
-    :param use_conv: if True and out_channels is specified, use a spatial
-        convolution instead of a smaller 1x1 convolution to change the
-        channels in the skip connection.
-    :param dims: determines if the signal is 1D, 2D, or 3D.
-    :param up: if True, use this block for upsampling.
-    :param down: if True, use this block for downsampling.
-    :param use_temporal_conv: if True, use the temporal convolution.
-    :param use_image_dataset: if True, the temporal parameters will not be optimized.
-    """
-
-    def __init__(
-        self,
-        channels,
-        emb_channels,
-        dropout,
-        out_channels=None,
-        use_scale_shift_norm=False,
-        dims=2,
-        use_checkpoint=False,
-        use_conv=False,
-        up=False,
-        down=False,
-        use_temporal_conv=False,
-        tempspatial_aware=False,
-    ):
-        super().__init__()
-        self.channels = channels
-        self.emb_channels = emb_channels
-        self.dropout = dropout
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.use_checkpoint = use_checkpoint
-        self.use_scale_shift_norm = use_scale_shift_norm
-        self.use_temporal_conv = use_temporal_conv
-
-        self.in_layers = nn.Sequential(
-            normalization(channels),
-            nn.SiLU(),
-            conv_nd(dims, channels, self.out_channels, 3, padding=1),
-        )
-
-        self.updown = up or down
-
-        if up:
-            self.h_upd = Upsample(channels, False, dims)
-            self.x_upd = Upsample(channels, False, dims)
-        elif down:
-            self.h_upd = Downsample(channels, False, dims)
-            self.x_upd = Downsample(channels, False, dims)
-        else:
-            self.h_upd = self.x_upd = nn.Identity()
-
-        self.emb_layers = nn.Sequential(
-            nn.SiLU(),
-            nn.Linear(
-                emb_channels,
-                2 * self.out_channels if use_scale_shift_norm else self.out_channels,
-            ),
-        )
-        self.out_layers = nn.Sequential(
-            normalization(self.out_channels),
-            nn.SiLU(),
-            nn.Dropout(p=dropout),
-            zero_module(nn.Conv2d(self.out_channels, self.out_channels, 3, padding=1)),
-        )
-
-        if self.out_channels == channels:
-            self.skip_connection = nn.Identity()
-        elif use_conv:
-            self.skip_connection = conv_nd(
-                dims, channels, self.out_channels, 3, padding=1
-            )
-        else:
-            self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)
-
-        if self.use_temporal_conv:
-            self.temopral_conv = TemporalConvBlock(
-                self.out_channels,
-                self.out_channels,
-                dropout=0.1,
-                spatial_aware=tempspatial_aware,
-            )
-
-    def forward(self, x, emb, batch_size=None):
-        """
-        Apply the block to a Tensor, conditioned on a timestep embedding.
-        :param x: an [N x C x ...] Tensor of features.
-        :param emb: an [N x emb_channels] Tensor of timestep embeddings.
-        :return: an [N x C x ...] Tensor of outputs.
-        """
-        input_tuple = (x, emb)
-        if batch_size:
-            forward_batchsize = partial(self._forward, batch_size=batch_size)
-            return checkpoint(
-                forward_batchsize, input_tuple, self.parameters(), self.use_checkpoint
-            )
-        return checkpoint(
-            self._forward, input_tuple, self.parameters(), self.use_checkpoint
-        )
-
-    def _forward(self, x, emb, batch_size=None):
-        if self.updown:
-            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
-            h = in_rest(x)
-            h = self.h_upd(h)
-            x = self.x_upd(x)
-            h = in_conv(h)
-        else:
-            h = self.in_layers(x)
-        emb_out = self.emb_layers(emb).type(h.dtype)
-        while len(emb_out.shape) < len(h.shape):
-            emb_out = emb_out[..., None]
-        if self.use_scale_shift_norm:
-            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
-            scale, shift = torch.chunk(emb_out, 2, dim=1)
-            h = out_norm(h) * (1 + scale) + shift
-            h = out_rest(h)
-        else:
-            h = h + emb_out
-            h = self.out_layers(h)
-        h = self.skip_connection(x) + h
-
-        if self.use_temporal_conv and batch_size:
-            h = rearrange(h, "(b t) c h w -> b c t h w", b=batch_size)
-            h = self.temopral_conv(h)
-            h = rearrange(h, "b c t h w -> (b t) c h w")
-        return h
-
-
-class TemporalConvBlock(nn.Module):
-    """
-    Adapted from modelscope: https://github.com/modelscope/modelscope/blob/master/modelscope/models/multi_modal/video_synthesis/unet_sd.py
-    """
-
-    def __init__(
-        self, in_channels, out_channels=None, dropout=0.0, spatial_aware=False
-    ):
-        super(TemporalConvBlock, self).__init__()
-        if out_channels is None:
-            out_channels = in_channels
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        th_kernel_shape = (3, 1, 1) if not spatial_aware else (3, 3, 1)
-        th_padding_shape = (1, 0, 0) if not spatial_aware else (1, 1, 0)
-        tw_kernel_shape = (3, 1, 1) if not spatial_aware else (3, 1, 3)
-        tw_padding_shape = (1, 0, 0) if not spatial_aware else (1, 0, 1)
-
-        # conv layers
-        self.conv1 = nn.Sequential(
-            nn.GroupNorm(32, in_channels),
-            nn.SiLU(),
-            nn.Conv3d(
-                in_channels, out_channels, th_kernel_shape, padding=th_padding_shape
-            ),
-        )
-        self.conv2 = nn.Sequential(
-            nn.GroupNorm(32, out_channels),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            nn.Conv3d(
-                out_channels, in_channels, tw_kernel_shape, padding=tw_padding_shape
-            ),
-        )
-        self.conv3 = nn.Sequential(
-            nn.GroupNorm(32, out_channels),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            nn.Conv3d(
-                out_channels, in_channels, th_kernel_shape, padding=th_padding_shape
-            ),
-        )
-        self.conv4 = nn.Sequential(
-            nn.GroupNorm(32, out_channels),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            nn.Conv3d(
-                out_channels, in_channels, tw_kernel_shape, padding=tw_padding_shape
-            ),
-        )
-
-        # zero out the last layer params,so the conv block is identity
-        nn.init.zeros_(self.conv4[-1].weight)
-        nn.init.zeros_(self.conv4[-1].bias)
-
-    def forward(self, x):
-        identity = x
-        x = self.conv1(x)
-        x = self.conv2(x)
-        x = self.conv3(x)
-        x = self.conv4(x)
-
-        return identity + x
-
-
-class UNetModel(nn.Module):
-    """
-    The full UNet model with attention and timestep embedding.
-    :param in_channels: in_channels in the input Tensor.
-    :param model_channels: base channel count for the model.
-    :param out_channels: channels in the output Tensor.
-    :param num_res_blocks: number of residual blocks per downsample.
-    :param attention_resolutions: a collection of downsample rates at which
-        attention will take place. May be a set, list, or tuple.
-        For example, if this contains 4, then at 4x downsampling, attention
-        will be used.
-    :param dropout: the dropout probability.
-    :param channel_mult: channel multiplier for each level of the UNet.
-    :param conv_resample: if True, use learned convolutions for upsampling and
-        downsampling.
-    :param dims: determines if the signal is 1D, 2D, or 3D.
-    :param num_classes: if specified (as an int), then this model will be
-        class-conditional with `num_classes` classes.
-    :param use_checkpoint: use gradient checkpointing to reduce memory usage.
-    :param num_heads: the number of attention heads in each attention layer.
-    :param num_heads_channels: if specified, ignore num_heads and instead use
-                               a fixed channel width per attention head.
-    :param num_heads_upsample: works with num_heads to set a different number
-                               of heads for upsampling. Deprecated.
-    :param use_scale_shift_norm: use a FiLM-like conditioning mechanism.
-    :param resblock_updown: use residual blocks for up/downsampling.
-    :param use_new_attention_order: use a different attention pattern for potentially
-                                    increased efficiency.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        model_channels,
-        out_channels,
-        num_res_blocks,
-        attention_resolutions,
-        dropout=0.0,
-        channel_mult=(1, 2, 4, 8),
-        conv_resample=True,
-        dims=2,
-        context_dim=None,
-        use_scale_shift_norm=False,
-        resblock_updown=False,
-        num_heads=-1,
-        num_head_channels=-1,
-        transformer_depth=1,
-        use_linear=False,
-        use_checkpoint=False,
-        temporal_conv=False,
-        tempspatial_aware=False,
-        temporal_attention=True,
-        use_relative_position=True,
-        use_causal_attention=False,
-        temporal_length=None,
-        use_fp16=False,
-        addition_attention=False,
-        temporal_selfatt_only=True,
-        img_cross_attention=False,
-        img_cross_attention_scale_learnable=False,
-        default_fs=4,
-        fs_condition=False,
-    ):
-        super(UNetModel, self).__init__()
-        if num_heads == -1:
-            assert (
-                num_head_channels != -1
-            ), "Either num_heads or num_head_channels has to be set"
-        if num_head_channels == -1:
-            assert (
-                num_heads != -1
-            ), "Either num_heads or num_head_channels has to be set"
-
-        self.in_channels = in_channels
-        self.model_channels = model_channels
-        self.out_channels = out_channels
-        self.num_res_blocks = num_res_blocks
-        self.attention_resolutions = attention_resolutions
-        self.dropout = dropout
-        self.channel_mult = channel_mult
-        self.conv_resample = conv_resample
-        self.temporal_attention = temporal_attention
-        time_embed_dim = model_channels * 4
-        self.use_checkpoint = use_checkpoint
-        self.dtype = torch.float16 if use_fp16 else torch.float32
-        temporal_self_att_only = True
-        self.addition_attention = addition_attention
-        self.temporal_length = temporal_length
-        self.img_cross_attention = img_cross_attention
-        self.img_cross_attention_scale_learnable = img_cross_attention_scale_learnable
-        self.default_fs = default_fs
-        self.fs_condition = fs_condition
-
-        ## Time embedding blocks
-        self.time_embed = nn.Sequential(
-            linear(model_channels, time_embed_dim),
-            nn.SiLU(),
-            linear(time_embed_dim, time_embed_dim),
-        )
-        if fs_condition:
-            self.fps_embedding = nn.Sequential(
-                linear(model_channels, time_embed_dim),
-                nn.SiLU(),
-                linear(time_embed_dim, time_embed_dim),
-            )
-            nn.init.zeros_(self.fps_embedding[-1].weight)
-            nn.init.zeros_(self.fps_embedding[-1].bias)
-        ## Input Block
-        self.input_blocks = nn.ModuleList(
-            [
-                TimestepEmbedSequential(
-                    conv_nd(dims, in_channels, model_channels, 3, padding=1)
-                )
-            ]
-        )
-        if self.addition_attention:
-            self.init_attn = TimestepEmbedSequential(
-                TemporalTransformer(
-                    model_channels,
-                    n_heads=8,
-                    d_head=num_head_channels,
-                    depth=transformer_depth,
-                    context_dim=context_dim,
-                    use_checkpoint=use_checkpoint,
-                    only_self_att=temporal_selfatt_only,
-                    causal_attention=False,
-                    relative_position=use_relative_position,
-                    temporal_length=temporal_length,
-                )
-            )
-
-        input_block_chans = [model_channels]
-        ch = model_channels
-        ds = 1
-        for level, mult in enumerate(channel_mult):
-            for _ in range(num_res_blocks):
-                layers = [
-                    ResBlock(
-                        ch,
-                        time_embed_dim,
-                        dropout,
-                        out_channels=mult * model_channels,
-                        dims=dims,
-                        use_checkpoint=use_checkpoint,
-                        use_scale_shift_norm=use_scale_shift_norm,
-                        tempspatial_aware=tempspatial_aware,
-                        use_temporal_conv=temporal_conv,
-                    )
-                ]
-                ch = mult * model_channels
-                if ds in attention_resolutions:
-                    if num_head_channels == -1:
-                        dim_head = ch // num_heads
-                    else:
-                        num_heads = ch // num_head_channels
-                        dim_head = num_head_channels
-                    layers.append(
-                        SpatialTransformer(
-                            ch,
-                            num_heads,
-                            dim_head,
-                            depth=transformer_depth,
-                            context_dim=context_dim,
-                            use_linear=use_linear,
-                            use_checkpoint=use_checkpoint,
-                            disable_self_attn=False,
-                            img_cross_attention=self.img_cross_attention,
-                            img_cross_attention_scale_learnable=self.img_cross_attention_scale_learnable,
-                        )
-                    )
-                    if self.temporal_attention:
-                        layers.append(
-                            TemporalTransformer(
-                                ch,
-                                num_heads,
-                                dim_head,
-                                depth=transformer_depth,
-                                context_dim=context_dim,
-                                use_linear=use_linear,
-                                use_checkpoint=use_checkpoint,
-                                only_self_att=temporal_self_att_only,
-                                causal_attention=use_causal_attention,
-                                relative_position=use_relative_position,
-                                temporal_length=temporal_length,
-                            )
-                        )
-                self.input_blocks.append(TimestepEmbedSequential(*layers))
-                input_block_chans.append(ch)
-            if level != len(channel_mult) - 1:
-                out_ch = ch
-                self.input_blocks.append(
-                    TimestepEmbedSequential(
-                        ResBlock(
-                            ch,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=out_ch,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                            down=True,
-                        )
-                        if resblock_updown
-                        else Downsample(
-                            ch, conv_resample, dims=dims, out_channels=out_ch
-                        )
-                    )
-                )
-                ch = out_ch
-                input_block_chans.append(ch)
-                ds *= 2
-
-        if num_head_channels == -1:
-            dim_head = ch // num_heads
-        else:
-            num_heads = ch // num_head_channels
-            dim_head = num_head_channels
-        layers = [
-            ResBlock(
-                ch,
-                time_embed_dim,
-                dropout,
-                dims=dims,
-                use_checkpoint=use_checkpoint,
-                use_scale_shift_norm=use_scale_shift_norm,
-                tempspatial_aware=tempspatial_aware,
-                use_temporal_conv=temporal_conv,
-            ),
-            SpatialTransformer(
-                ch,
-                num_heads,
-                dim_head,
-                depth=transformer_depth,
-                context_dim=context_dim,
-                use_linear=use_linear,
-                use_checkpoint=use_checkpoint,
-                disable_self_attn=False,
-                img_cross_attention=self.img_cross_attention,
-                img_cross_attention_scale_learnable=self.img_cross_attention_scale_learnable,
-            ),
-        ]
-        if self.temporal_attention:
-            layers.append(
-                TemporalTransformer(
-                    ch,
-                    num_heads,
-                    dim_head,
-                    depth=transformer_depth,
-                    context_dim=context_dim,
-                    use_linear=use_linear,
-                    use_checkpoint=use_checkpoint,
-                    only_self_att=temporal_self_att_only,
-                    causal_attention=use_causal_attention,
-                    relative_position=use_relative_position,
-                    temporal_length=temporal_length,
-                )
-            )
-        layers.append(
-            ResBlock(
-                ch,
-                time_embed_dim,
-                dropout,
-                dims=dims,
-                use_checkpoint=use_checkpoint,
-                use_scale_shift_norm=use_scale_shift_norm,
-                tempspatial_aware=tempspatial_aware,
-                use_temporal_conv=temporal_conv,
-            )
-        )
-
-        ## Middle Block
-        self.middle_block = TimestepEmbedSequential(*layers)
-
-        ## Output Block
-        self.output_blocks = nn.ModuleList([])
-        for level, mult in list(enumerate(channel_mult))[::-1]:
-            for i in range(num_res_blocks + 1):
-                ich = input_block_chans.pop()
-                layers = [
-                    ResBlock(
-                        ch + ich,
-                        time_embed_dim,
-                        dropout,
-                        out_channels=mult * model_channels,
-                        dims=dims,
-                        use_checkpoint=use_checkpoint,
-                        use_scale_shift_norm=use_scale_shift_norm,
-                        tempspatial_aware=tempspatial_aware,
-                        use_temporal_conv=temporal_conv,
-                    )
-                ]
-                ch = model_channels * mult
-                if ds in attention_resolutions:
-                    if num_head_channels == -1:
-                        dim_head = ch // num_heads
-                    else:
-                        num_heads = ch // num_head_channels
-                        dim_head = num_head_channels
-                    layers.append(
-                        SpatialTransformer(
-                            ch,
-                            num_heads,
-                            dim_head,
-                            depth=transformer_depth,
-                            context_dim=context_dim,
-                            use_linear=use_linear,
-                            use_checkpoint=use_checkpoint,
-                            disable_self_attn=False,
-                            img_cross_attention=self.img_cross_attention,
-                            img_cross_attention_scale_learnable=self.img_cross_attention_scale_learnable,
-                        )
-                    )
-                    if self.temporal_attention:
-                        layers.append(
-                            TemporalTransformer(
-                                ch,
-                                num_heads,
-                                dim_head,
-                                depth=transformer_depth,
-                                context_dim=context_dim,
-                                use_linear=use_linear,
-                                use_checkpoint=use_checkpoint,
-                                only_self_att=temporal_self_att_only,
-                                causal_attention=use_causal_attention,
-                                relative_position=use_relative_position,
-                                temporal_length=temporal_length,
-                            )
-                        )
-                if level and i == num_res_blocks:
-                    out_ch = ch
-                    layers.append(
-                        ResBlock(
-                            ch,
-                            time_embed_dim,
-                            dropout,
-                            out_channels=out_ch,
-                            dims=dims,
-                            use_checkpoint=use_checkpoint,
-                            use_scale_shift_norm=use_scale_shift_norm,
-                            up=True,
-                        )
-                        if resblock_updown
-                        else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch)
-                    )
-                    ds //= 2
-                self.output_blocks.append(TimestepEmbedSequential(*layers))
-
-        self.out = nn.Sequential(
-            normalization(ch),
-            nn.SiLU(),
-            zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),
-        )
-
-    def forward(
-        self, x, timesteps, context=None, features_adapter=None, fs=None, **kwargs
-    ):
-        b, _, t, _, _ = x.shape
-        t_emb = timestep_embedding(
-            timesteps, self.model_channels, repeat_only=False
-        ).type(x.dtype)
-        emb = self.time_embed(t_emb)
-
-        ## repeat t times for context [(b t) 77 768] & time embedding
-        ## check if we use per-frame image conditioning
-        _, l_context, _ = context.shape
-        if l_context == 77 + t * 16:  ## !!! HARD CODE here
-            context_text, context_img = context[:, :77, :], context[:, 77:, :]
-            context_text = context_text.repeat_interleave(repeats=t, dim=0)
-            context_img = rearrange(context_img, "b (t l) c -> (b t) l c", t=t)
-            context = torch.cat([context_text, context_img], dim=1)
-        else:
-            context = context.repeat_interleave(repeats=t, dim=0)
-        emb = emb.repeat_interleave(repeats=t, dim=0)
-
-        ## always in shape (b t) c h w, except for temporal layer
-        x = rearrange(x, "b c t h w -> (b t) c h w")
-
-        ## combine emb
-        if self.fs_condition:
-            if fs is None:
-                fs = torch.tensor(
-                    [self.default_fs] * b, dtype=torch.long, device=x.device
-                )
-            fs_emb = timestep_embedding(
-                fs, self.model_channels, repeat_only=False
-            ).type(x.dtype)
-
-            fs_embed = self.fps_embedding(fs_emb)
-            fs_embed = fs_embed.repeat_interleave(repeats=t, dim=0)
-            emb = emb + fs_embed
-
-        h = x.type(self.dtype)
-        adapter_idx = 0
-        hs = []
-        for id, module in enumerate(self.input_blocks):
-            h = module(h, emb, context=context, batch_size=b)
-            if id == 0 and self.addition_attention:
-                h = self.init_attn(h, emb, context=context, batch_size=b)
-            ## plug-in adapter features
-            if ((id + 1) % 3 == 0) and features_adapter is not None:
-                h = h + features_adapter[adapter_idx]
-                adapter_idx += 1
-            hs.append(h)
-        if features_adapter is not None:
-            assert len(features_adapter) == adapter_idx, "Wrong features_adapter"
-
-        h = self.middle_block(h, emb, context=context, batch_size=b)
-        for module in self.output_blocks:
-            h = torch.cat([h, hs.pop()], dim=1)
-            h = module(h, emb, context=context, batch_size=b)
-        h = h.type(x.dtype)
-        y = self.out(h)
-
-        # reshape back to (b c t h w)
-        y = rearrange(y, "(b t) c h w -> b c t h w", b=b)
-        return y
diff --git a/videotuna/models/lvdm/modules/utils.py b/videotuna/models/lvdm/modules/utils.py
deleted file mode 100644
index 5ba79d7d..00000000
--- a/videotuna/models/lvdm/modules/utils.py
+++ /dev/null
@@ -1,216 +0,0 @@
-# adopted from
-# https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py
-# and
-# https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
-# and
-# https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py
-#
-# thanks!
-
-import math
-from inspect import isfunction
-
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from torch import nn
-
-from videotuna.utils.common_utils import instantiate_from_config
-
-
-def gather_data(data, return_np=True):
-    """gather data from multiple processes to one list"""
-    data_list = [torch.zeros_like(data) for _ in range(dist.get_world_size())]
-    dist.all_gather(data_list, data)  # gather not supported with NCCL
-    if return_np:
-        data_list = [data.cpu().numpy() for data in data_list]
-    return data_list
-
-
-def autocast(f):
-    def do_autocast(*args, **kwargs):
-        with torch.cuda.amp.autocast(
-            enabled=True,
-            dtype=torch.get_autocast_gpu_dtype(),
-            cache_enabled=torch.is_autocast_cache_enabled(),
-        ):
-            return f(*args, **kwargs)
-
-    return do_autocast
-
-
-def extract_into_tensor(a, t, x_shape):
-    b, *_ = t.shape
-    out = a.gather(-1, t)
-    return out.reshape(b, *((1,) * (len(x_shape) - 1)))
-
-
-def noise_like(shape, device, repeat=False):
-    repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(
-        shape[0], *((1,) * (len(shape) - 1))
-    )
-    noise = lambda: torch.randn(shape, device=device)
-    return repeat_noise() if repeat else noise()
-
-
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-
-
-def exists(val):
-    return val is not None
-
-
-def identity(*args, **kwargs):
-    return nn.Identity()
-
-
-def uniq(arr):
-    return {el: True for el in arr}.keys()
-
-
-def mean_flat(tensor):
-    """
-    Take the mean over all non-batch dimensions.
-    """
-    return tensor.mean(dim=list(range(1, len(tensor.shape))))
-
-
-def ismap(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return (len(x.shape) == 4) and (x.shape[1] > 3)
-
-
-def isimage(x):
-    if not isinstance(x, torch.Tensor):
-        return False
-    return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)
-
-
-def max_neg_value(t):
-    return -torch.finfo(t.dtype).max
-
-
-def shape_to_str(x):
-    shape_str = "x".join([str(x) for x in x.shape])
-    return shape_str
-
-
-def init_(tensor):
-    dim = tensor.shape[-1]
-    std = 1 / math.sqrt(dim)
-    tensor.uniform_(-std, std)
-    return tensor
-
-
-ckpt = torch.utils.checkpoint.checkpoint
-
-
-def checkpoint(func, inputs, params, flag):
-    """
-    Evaluate a function without caching intermediate activations, allowing for
-    reduced memory at the expense of extra compute in the backward pass.
-    :param func: the function to evaluate.
-    :param inputs: the argument sequence to pass to `func`.
-    :param params: a sequence of parameters `func` depends on but does not
-                   explicitly take as arguments.
-    :param flag: if False, disable gradient checkpointing.
-    """
-    if flag:
-        return ckpt(func, *inputs, use_reentrant=False)
-    else:
-        return func(*inputs)
-
-
-def disabled_train(self, mode=True):
-    """Overwrite model.train with this function to make sure train/eval mode
-    does not change anymore."""
-    return self
-
-
-def zero_module(module):
-    """
-    Zero out the parameters of a module and return it.
-    """
-    for p in module.parameters():
-        p.detach().zero_()
-    return module
-
-
-def scale_module(module, scale):
-    """
-    Scale the parameters of a module and return it.
-    """
-    for p in module.parameters():
-        p.detach().mul_(scale)
-    return module
-
-
-def conv_nd(dims, *args, **kwargs):
-    """
-    Create a 1D, 2D, or 3D convolution module.
-    """
-    if dims == 1:
-        return nn.Conv1d(*args, **kwargs)
-    elif dims == 2:
-        return nn.Conv2d(*args, **kwargs)
-    elif dims == 3:
-        return nn.Conv3d(*args, **kwargs)
-    raise ValueError(f"unsupported dimensions: {dims}")
-
-
-def linear(*args, **kwargs):
-    """
-    Create a linear module.
-    """
-    return nn.Linear(*args, **kwargs)
-
-
-def avg_pool_nd(dims, *args, **kwargs):
-    """
-    Create a 1D, 2D, or 3D average pooling module.
-    """
-    if dims == 1:
-        return nn.AvgPool1d(*args, **kwargs)
-    elif dims == 2:
-        return nn.AvgPool2d(*args, **kwargs)
-    elif dims == 3:
-        return nn.AvgPool3d(*args, **kwargs)
-    raise ValueError(f"unsupported dimensions: {dims}")
-
-
-def nonlinearity(type="silu"):
-    if type == "silu":
-        return nn.SiLU()
-    elif type == "leaky_relu":
-        return nn.LeakyReLU()
-
-
-class GroupNormSpecific(nn.GroupNorm):
-    def forward(self, x):
-        return super().forward(x.float()).type(x.dtype)
-
-
-def normalization(channels, num_groups=32):
-    """
-    Make a standard normalization layer.
-    :param channels: number of input channels.
-    :return: an nn.Module for normalization.
-    """
-    return GroupNormSpecific(num_groups, channels)
-
-
-class HybridConditioner(nn.Module):
-
-    def __init__(self, c_concat_config, c_crossattn_config):
-        super().__init__()
-        self.concat_conditioner = instantiate_from_config(c_concat_config)
-        self.crossattn_conditioner = instantiate_from_config(c_crossattn_config)
-
-    def forward(self, c_concat, c_crossattn):
-        c_concat = self.concat_conditioner(c_concat)
-        c_crossattn = self.crossattn_conditioner(c_crossattn)
-        return {"c_concat": [c_concat], "c_crossattn": [c_crossattn]}
diff --git a/videotuna/models/lvdm/modules/vae/autoencoder.py b/videotuna/models/lvdm/modules/vae/autoencoder.py
deleted file mode 100644
index 47cb9ee4..00000000
--- a/videotuna/models/lvdm/modules/vae/autoencoder.py
+++ /dev/null
@@ -1,276 +0,0 @@
-import os
-
-import torch
-from einops import rearrange
-import pytorch_lightning as pl
-import torch.nn.functional as F
-
-from videotuna.models.lvdm.modules.ae_modules import Decoder, Encoder
-from videotuna.utils.distributions import DiagonalGaussianDistribution
-from videotuna.utils.common_utils import instantiate_from_config
-
-
-class AutoencoderKL(pl.LightningModule):
-    def __init__(
-        self,
-        ddconfig,
-        lossconfig,
-        embed_dim,
-        ckpt_path=None,
-        ignore_keys=[],
-        image_key="image",
-        colorize_nlabels=None,
-        monitor=None,
-        test=False,
-        logdir=None,
-        input_dim=4,
-        test_args=None,
-    ):
-        super().__init__()
-        self.image_key = image_key
-        self.encoder = Encoder(**ddconfig)
-        self.decoder = Decoder(**ddconfig)
-        self.loss = instantiate_from_config(lossconfig)
-        assert ddconfig["double_z"]
-        self.quant_conv = torch.nn.Conv2d(2 * ddconfig["z_channels"], 2 * embed_dim, 1)
-        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
-        self.embed_dim = embed_dim
-        self.input_dim = input_dim
-        self.test = test
-        self.test_args = test_args
-        self.logdir = logdir
-        if colorize_nlabels is not None:
-            assert type(colorize_nlabels) == int
-            self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
-        if monitor is not None:
-            self.monitor = monitor
-        if ckpt_path is not None:
-            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
-        if self.test:
-            self.init_test()
-
-    def init_test(
-        self,
-    ):
-        self.test = True
-        save_dir = os.path.join(self.logdir, "test")
-        if "ckpt" in self.test_args:
-            ckpt_name = (
-                os.path.basename(self.test_args.ckpt).split(".ckpt")[0]
-                + f"_epoch{self._cur_epoch}"
-            )
-            self.root = os.path.join(save_dir, ckpt_name)
-        else:
-            self.root = save_dir
-        if "test_subdir" in self.test_args:
-            self.root = os.path.join(save_dir, self.test_args.test_subdir)
-
-        self.root_zs = os.path.join(self.root, "zs")
-        self.root_dec = os.path.join(self.root, "reconstructions")
-        self.root_inputs = os.path.join(self.root, "inputs")
-        os.makedirs(self.root, exist_ok=True)
-
-        if self.test_args.save_z:
-            os.makedirs(self.root_zs, exist_ok=True)
-        if self.test_args.save_reconstruction:
-            os.makedirs(self.root_dec, exist_ok=True)
-        if self.test_args.save_input:
-            os.makedirs(self.root_inputs, exist_ok=True)
-        assert self.test_args is not None
-        self.test_maximum = getattr(self.test_args, "test_maximum", None)
-        self.count = 0
-        self.eval_metrics = {}
-        self.decodes = []
-        self.save_decode_samples = 2048
-
-    def init_from_ckpt(self, path, ignore_keys=list()):
-        sd = torch.load(path, map_location="cpu")
-        try:
-            self._cur_epoch = sd["epoch"]
-            sd = sd["state_dict"]
-        except:
-            self._cur_epoch = "null"
-        keys = list(sd.keys())
-        for k in keys:
-            for ik in ignore_keys:
-                if k.startswith(ik):
-                    print("Deleting key {} from state_dict.".format(k))
-                    del sd[k]
-        self.load_state_dict(sd, strict=False)
-        # self.load_state_dict(sd, strict=True)
-        print(f"Restored from {path}")
-
-    def encode(self, x, **kwargs):
-
-        h = self.encoder(x)
-        moments = self.quant_conv(h)
-        posterior = DiagonalGaussianDistribution(moments)
-        return posterior
-
-    def decode(self, z, **kwargs):
-        z = self.post_quant_conv(z)
-        dec = self.decoder(z)
-        return dec
-
-    def forward(self, input, sample_posterior=True):
-        posterior = self.encode(input)
-        if sample_posterior:
-            z = posterior.sample()
-        else:
-            z = posterior.mode()
-        dec = self.decode(z)
-        return dec, posterior
-
-    def get_input(self, batch, k):
-        x = batch[k]
-        if x.dim() == 5 and self.input_dim == 4:
-            b, c, t, h, w = x.shape
-            self.b = b
-            self.t = t
-            x = rearrange(x, "b c t h w -> (b t) c h w")
-
-        return x
-
-    def training_step(self, batch, batch_idx, optimizer_idx):
-        inputs = self.get_input(batch, self.image_key)
-        reconstructions, posterior = self(inputs)
-
-        if optimizer_idx == 0:
-            # train encoder+decoder+logvar
-            aeloss, log_dict_ae = self.loss(
-                inputs,
-                reconstructions,
-                posterior,
-                optimizer_idx,
-                self.global_step,
-                last_layer=self.get_last_layer(),
-                split="train",
-            )
-            self.log(
-                "aeloss",
-                aeloss,
-                prog_bar=True,
-                logger=True,
-                on_step=True,
-                on_epoch=True,
-            )
-            self.log_dict(
-                log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False
-            )
-            return aeloss
-
-        if optimizer_idx == 1:
-            # train the discriminator
-            discloss, log_dict_disc = self.loss(
-                inputs,
-                reconstructions,
-                posterior,
-                optimizer_idx,
-                self.global_step,
-                last_layer=self.get_last_layer(),
-                split="train",
-            )
-
-            self.log(
-                "discloss",
-                discloss,
-                prog_bar=True,
-                logger=True,
-                on_step=True,
-                on_epoch=True,
-            )
-            self.log_dict(
-                log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False
-            )
-            return discloss
-
-    def validation_step(self, batch, batch_idx):
-        inputs = self.get_input(batch, self.image_key)
-        reconstructions, posterior = self(inputs)
-        aeloss, log_dict_ae = self.loss(
-            inputs,
-            reconstructions,
-            posterior,
-            0,
-            self.global_step,
-            last_layer=self.get_last_layer(),
-            split="val",
-        )
-
-        discloss, log_dict_disc = self.loss(
-            inputs,
-            reconstructions,
-            posterior,
-            1,
-            self.global_step,
-            last_layer=self.get_last_layer(),
-            split="val",
-        )
-
-        self.log("val/rec_loss", log_dict_ae["val/rec_loss"])
-        self.log_dict(log_dict_ae)
-        self.log_dict(log_dict_disc)
-        return self.log_dict
-
-    def configure_optimizers(self):
-        lr = self.learning_rate
-        opt_ae = torch.optim.Adam(
-            list(self.encoder.parameters())
-            + list(self.decoder.parameters())
-            + list(self.quant_conv.parameters())
-            + list(self.post_quant_conv.parameters()),
-            lr=lr,
-            betas=(0.5, 0.9),
-        )
-        opt_disc = torch.optim.Adam(
-            self.loss.discriminator.parameters(), lr=lr, betas=(0.5, 0.9)
-        )
-        return [opt_ae, opt_disc], []
-
-    def get_last_layer(self):
-        return self.decoder.conv_out.weight
-
-    @torch.no_grad()
-    def log_images(self, batch, only_inputs=False, **kwargs):
-        log = dict()
-        x = self.get_input(batch, self.image_key)
-        x = x.to(self.device)
-        if not only_inputs:
-            xrec, posterior = self(x)
-            if x.shape[1] > 3:
-                # colorize with random projection
-                assert xrec.shape[1] > 3
-                x = self.to_rgb(x)
-                xrec = self.to_rgb(xrec)
-            log["samples"] = self.decode(torch.randn_like(posterior.sample()))
-            log["reconstructions"] = xrec
-        log["inputs"] = x
-        return log
-
-    def to_rgb(self, x):
-        assert self.image_key == "segmentation"
-        if not hasattr(self, "colorize"):
-            self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
-        x = F.conv2d(x, weight=self.colorize)
-        x = 2.0 * (x - x.min()) / (x.max() - x.min()) - 1.0
-        return x
-
-
-class IdentityFirstStage(torch.nn.Module):
-    def __init__(self, *args, vq_interface=False, **kwargs):
-        self.vq_interface = vq_interface  # TODO: Should be true by default but check to not break older stuff
-        super().__init__()
-
-    def encode(self, x, *args, **kwargs):
-        return x
-
-    def decode(self, x, *args, **kwargs):
-        return x
-
-    def quantize(self, x, *args, **kwargs):
-        if self.vq_interface:
-            return x, None, [None, None, None]
-        return x
-
-    def forward(self, x, *args, **kwargs):
-        return x
diff --git a/videotuna/models/lvdm/modules/x_transformer.py b/videotuna/models/lvdm/modules/x_transformer.py
deleted file mode 100644
index e0986b62..00000000
--- a/videotuna/models/lvdm/modules/x_transformer.py
+++ /dev/null
@@ -1,705 +0,0 @@
-"""shout-out to https://github.com/lucidrains/x-transformers/tree/main/x_transformers"""
-
-from collections import namedtuple
-from functools import partial
-from inspect import isfunction
-
-import torch
-import torch.nn.functional as F
-from einops import rearrange, repeat
-from torch import einsum, nn
-
-# constants
-DEFAULT_DIM_HEAD = 64
-
-Intermediates = namedtuple("Intermediates", ["pre_softmax_attn", "post_softmax_attn"])
-
-LayerIntermediates = namedtuple("Intermediates", ["hiddens", "attn_intermediates"])
-
-
-class AbsolutePositionalEmbedding(nn.Module):
-    def __init__(self, dim, max_seq_len):
-        super().__init__()
-        self.emb = nn.Embedding(max_seq_len, dim)
-        self.init_()
-
-    def init_(self):
-        nn.init.normal_(self.emb.weight, std=0.02)
-
-    def forward(self, x):
-        n = torch.arange(x.shape[1], device=x.device)
-        return self.emb(n)[None, :, :]
-
-
-class FixedPositionalEmbedding(nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
-        self.register_buffer("inv_freq", inv_freq)
-
-    def forward(self, x, seq_dim=1, offset=0):
-        t = (
-            torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq)
-            + offset
-        )
-        sinusoid_inp = torch.einsum("i , j -> i j", t, self.inv_freq)
-        emb = torch.cat((sinusoid_inp.sin(), sinusoid_inp.cos()), dim=-1)
-        return emb[None, :, :]
-
-
-# helpers
-
-
-def exists(val):
-    return val is not None
-
-
-def default(val, d):
-    if exists(val):
-        return val
-    return d() if isfunction(d) else d
-
-
-def always(val):
-    def inner(*args, **kwargs):
-        return val
-
-    return inner
-
-
-def not_equals(val):
-    def inner(x):
-        return x != val
-
-    return inner
-
-
-def equals(val):
-    def inner(x):
-        return x == val
-
-    return inner
-
-
-def max_neg_value(tensor):
-    return -torch.finfo(tensor.dtype).max
-
-
-# keyword argument helpers
-
-
-def pick_and_pop(keys, d):
-    values = list(map(lambda key: d.pop(key), keys))
-    return dict(zip(keys, values))
-
-
-def group_dict_by_key(cond, d):
-    return_val = [dict(), dict()]
-    for key in d.keys():
-        match = bool(cond(key))
-        ind = int(not match)
-        return_val[ind][key] = d[key]
-    return (*return_val,)
-
-
-def string_begins_with(prefix, str):
-    return str.startswith(prefix)
-
-
-def group_by_key_prefix(prefix, d):
-    return group_dict_by_key(partial(string_begins_with, prefix), d)
-
-
-def groupby_prefix_and_trim(prefix, d):
-    kwargs_with_prefix, kwargs = group_dict_by_key(
-        partial(string_begins_with, prefix), d
-    )
-    kwargs_without_prefix = dict(
-        map(lambda x: (x[0][len(prefix) :], x[1]), tuple(kwargs_with_prefix.items()))
-    )
-    return kwargs_without_prefix, kwargs
-
-
-# classes
-class Scale(nn.Module):
-    def __init__(self, value, fn):
-        super().__init__()
-        self.value = value
-        self.fn = fn
-
-    def forward(self, x, **kwargs):
-        x, *rest = self.fn(x, **kwargs)
-        return (x * self.value, *rest)
-
-
-class Rezero(nn.Module):
-    def __init__(self, fn):
-        super().__init__()
-        self.fn = fn
-        self.g = nn.Parameter(torch.zeros(1))
-
-    def forward(self, x, **kwargs):
-        x, *rest = self.fn(x, **kwargs)
-        return (x * self.g, *rest)
-
-
-class ScaleNorm(nn.Module):
-    def __init__(self, dim, eps=1e-5):
-        super().__init__()
-        self.scale = dim**-0.5
-        self.eps = eps
-        self.g = nn.Parameter(torch.ones(1))
-
-    def forward(self, x):
-        norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
-        return x / norm.clamp(min=self.eps) * self.g
-
-
-class RMSNorm(nn.Module):
-    def __init__(self, dim, eps=1e-8):
-        super().__init__()
-        self.scale = dim**-0.5
-        self.eps = eps
-        self.g = nn.Parameter(torch.ones(dim))
-
-    def forward(self, x):
-        norm = torch.norm(x, dim=-1, keepdim=True) * self.scale
-        return x / norm.clamp(min=self.eps) * self.g
-
-
-class Residual(nn.Module):
-    def forward(self, x, residual):
-        return x + residual
-
-
-class GRUGating(nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        self.gru = nn.GRUCell(dim, dim)
-
-    def forward(self, x, residual):
-        gated_output = self.gru(
-            rearrange(x, "b n d -> (b n) d"), rearrange(residual, "b n d -> (b n) d")
-        )
-
-        return gated_output.reshape_as(x)
-
-
-# feedforward
-
-
-class GEGLU(nn.Module):
-    def __init__(self, dim_in, dim_out):
-        super().__init__()
-        self.proj = nn.Linear(dim_in, dim_out * 2)
-
-    def forward(self, x):
-        x, gate = self.proj(x).chunk(2, dim=-1)
-        return x * F.gelu(gate)
-
-
-class FeedForward(nn.Module):
-    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.0):
-        super().__init__()
-        inner_dim = int(dim * mult)
-        dim_out = default(dim_out, dim)
-        project_in = (
-            nn.Sequential(nn.Linear(dim, inner_dim), nn.GELU())
-            if not glu
-            else GEGLU(dim, inner_dim)
-        )
-
-        self.net = nn.Sequential(
-            project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out)
-        )
-
-    def forward(self, x):
-        return self.net(x)
-
-
-# attention.
-class Attention(nn.Module):
-    def __init__(
-        self,
-        dim,
-        dim_head=DEFAULT_DIM_HEAD,
-        heads=8,
-        causal=False,
-        mask=None,
-        talking_heads=False,
-        sparse_topk=None,
-        use_entmax15=False,
-        num_mem_kv=0,
-        dropout=0.0,
-        on_attn=False,
-    ):
-        super().__init__()
-        if use_entmax15:
-            raise NotImplementedError(
-                "Check out entmax activation instead of softmax activation!"
-            )
-        self.scale = dim_head**-0.5
-        self.heads = heads
-        self.causal = causal
-        self.mask = mask
-
-        inner_dim = dim_head * heads
-
-        self.to_q = nn.Linear(dim, inner_dim, bias=False)
-        self.to_k = nn.Linear(dim, inner_dim, bias=False)
-        self.to_v = nn.Linear(dim, inner_dim, bias=False)
-        self.dropout = nn.Dropout(dropout)
-
-        # talking heads
-        self.talking_heads = talking_heads
-        if talking_heads:
-            self.pre_softmax_proj = nn.Parameter(torch.randn(heads, heads))
-            self.post_softmax_proj = nn.Parameter(torch.randn(heads, heads))
-
-        # explicit topk sparse attention
-        self.sparse_topk = sparse_topk
-
-        # entmax
-        # self.attn_fn = entmax15 if use_entmax15 else F.softmax
-        self.attn_fn = F.softmax
-
-        # add memory key / values
-        self.num_mem_kv = num_mem_kv
-        if num_mem_kv > 0:
-            self.mem_k = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head))
-            self.mem_v = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head))
-
-        # attention on attention
-        self.attn_on_attn = on_attn
-        self.to_out = (
-            nn.Sequential(nn.Linear(inner_dim, dim * 2), nn.GLU())
-            if on_attn
-            else nn.Linear(inner_dim, dim)
-        )
-
-    def forward(
-        self,
-        x,
-        context=None,
-        mask=None,
-        context_mask=None,
-        rel_pos=None,
-        sinusoidal_emb=None,
-        prev_attn=None,
-        mem=None,
-    ):
-        b, n, _, h, talking_heads, device = (
-            *x.shape,
-            self.heads,
-            self.talking_heads,
-            x.device,
-        )
-        kv_input = default(context, x)
-
-        q_input = x
-        k_input = kv_input
-        v_input = kv_input
-
-        if exists(mem):
-            k_input = torch.cat((mem, k_input), dim=-2)
-            v_input = torch.cat((mem, v_input), dim=-2)
-
-        if exists(sinusoidal_emb):
-            # in shortformer, the query would start at a position offset depending on the past cached memory
-            offset = k_input.shape[-2] - q_input.shape[-2]
-            q_input = q_input + sinusoidal_emb(q_input, offset=offset)
-            k_input = k_input + sinusoidal_emb(k_input)
-
-        q = self.to_q(q_input)
-        k = self.to_k(k_input)
-        v = self.to_v(v_input)
-
-        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))
-
-        input_mask = None
-        if any(map(exists, (mask, context_mask))):
-            q_mask = default(mask, lambda: torch.ones((b, n), device=device).bool())
-            k_mask = q_mask if not exists(context) else context_mask
-            k_mask = default(
-                k_mask, lambda: torch.ones((b, k.shape[-2]), device=device).bool()
-            )
-            q_mask = rearrange(q_mask, "b i -> b () i ()")
-            k_mask = rearrange(k_mask, "b j -> b () () j")
-            input_mask = q_mask * k_mask
-
-        if self.num_mem_kv > 0:
-            mem_k, mem_v = map(
-                lambda t: repeat(t, "h n d -> b h n d", b=b), (self.mem_k, self.mem_v)
-            )
-            k = torch.cat((mem_k, k), dim=-2)
-            v = torch.cat((mem_v, v), dim=-2)
-            if exists(input_mask):
-                input_mask = F.pad(input_mask, (self.num_mem_kv, 0), value=True)
-
-        dots = einsum("b h i d, b h j d -> b h i j", q, k) * self.scale
-        mask_value = max_neg_value(dots)
-
-        if exists(prev_attn):
-            dots = dots + prev_attn
-
-        pre_softmax_attn = dots
-
-        if talking_heads:
-            dots = einsum(
-                "b h i j, h k -> b k i j", dots, self.pre_softmax_proj
-            ).contiguous()
-
-        if exists(rel_pos):
-            dots = rel_pos(dots)
-
-        if exists(input_mask):
-            dots.masked_fill_(~input_mask, mask_value)
-            del input_mask
-
-        if self.causal:
-            i, j = dots.shape[-2:]
-            r = torch.arange(i, device=device)
-            mask = rearrange(r, "i -> () () i ()") < rearrange(r, "j -> () () () j")
-            mask = F.pad(mask, (j - i, 0), value=False)
-            dots.masked_fill_(mask, mask_value)
-            del mask
-
-        if exists(self.sparse_topk) and self.sparse_topk < dots.shape[-1]:
-            top, _ = dots.topk(self.sparse_topk, dim=-1)
-            vk = top[..., -1].unsqueeze(-1).expand_as(dots)
-            mask = dots < vk
-            dots.masked_fill_(mask, mask_value)
-            del mask
-
-        attn = self.attn_fn(dots, dim=-1)
-        post_softmax_attn = attn
-
-        attn = self.dropout(attn)
-
-        if talking_heads:
-            attn = einsum(
-                "b h i j, h k -> b k i j", attn, self.post_softmax_proj
-            ).contiguous()
-
-        out = einsum("b h i j, b h j d -> b h i d", attn, v)
-        out = rearrange(out, "b h n d -> b n (h d)")
-
-        intermediates = Intermediates(
-            pre_softmax_attn=pre_softmax_attn, post_softmax_attn=post_softmax_attn
-        )
-
-        return self.to_out(out), intermediates
-
-
-class AttentionLayers(nn.Module):
-    def __init__(
-        self,
-        dim,
-        depth,
-        heads=8,
-        causal=False,
-        cross_attend=False,
-        only_cross=False,
-        use_scalenorm=False,
-        use_rmsnorm=False,
-        use_rezero=False,
-        rel_pos_num_buckets=32,
-        rel_pos_max_distance=128,
-        position_infused_attn=False,
-        custom_layers=None,
-        sandwich_coef=None,
-        par_ratio=None,
-        residual_attn=False,
-        cross_residual_attn=False,
-        macaron=False,
-        pre_norm=True,
-        gate_residual=False,
-        **kwargs,
-    ):
-        super().__init__()
-        ff_kwargs, kwargs = groupby_prefix_and_trim("ff_", kwargs)
-        attn_kwargs, _ = groupby_prefix_and_trim("attn_", kwargs)
-
-        dim_head = attn_kwargs.get("dim_head", DEFAULT_DIM_HEAD)
-
-        self.dim = dim
-        self.depth = depth
-        self.layers = nn.ModuleList([])
-
-        self.has_pos_emb = position_infused_attn
-        self.pia_pos_emb = (
-            FixedPositionalEmbedding(dim) if position_infused_attn else None
-        )
-        self.rotary_pos_emb = always(None)
-
-        assert (
-            rel_pos_num_buckets <= rel_pos_max_distance
-        ), "number of relative position buckets must be less than the relative position max distance"
-        self.rel_pos = None
-
-        self.pre_norm = pre_norm
-
-        self.residual_attn = residual_attn
-        self.cross_residual_attn = cross_residual_attn
-
-        norm_class = ScaleNorm if use_scalenorm else nn.LayerNorm
-        norm_class = RMSNorm if use_rmsnorm else norm_class
-        norm_fn = partial(norm_class, dim)
-
-        norm_fn = nn.Identity if use_rezero else norm_fn
-        branch_fn = Rezero if use_rezero else None
-
-        if cross_attend and not only_cross:
-            default_block = ("a", "c", "f")
-        elif cross_attend and only_cross:
-            default_block = ("c", "f")
-        else:
-            default_block = ("a", "f")
-
-        if macaron:
-            default_block = ("f",) + default_block
-
-        if exists(custom_layers):
-            layer_types = custom_layers
-        elif exists(par_ratio):
-            par_depth = depth * len(default_block)
-            assert 1 < par_ratio <= par_depth, "par ratio out of range"
-            default_block = tuple(filter(not_equals("f"), default_block))
-            par_attn = par_depth // par_ratio
-            depth_cut = (
-                par_depth * 2 // 3
-            )  # 2 / 3 attention layer cutoff suggested by PAR paper
-            par_width = (depth_cut + depth_cut // par_attn) // par_attn
-            assert (
-                len(default_block) <= par_width
-            ), "default block is too large for par_ratio"
-            par_block = default_block + ("f",) * (par_width - len(default_block))
-            par_head = par_block * par_attn
-            layer_types = par_head + ("f",) * (par_depth - len(par_head))
-        elif exists(sandwich_coef):
-            assert (
-                sandwich_coef > 0 and sandwich_coef <= depth
-            ), "sandwich coefficient should be less than the depth"
-            layer_types = (
-                ("a",) * sandwich_coef
-                + default_block * (depth - sandwich_coef)
-                + ("f",) * sandwich_coef
-            )
-        else:
-            layer_types = default_block * depth
-
-        self.layer_types = layer_types
-        self.num_attn_layers = len(list(filter(equals("a"), layer_types)))
-
-        for layer_type in self.layer_types:
-            if layer_type == "a":
-                layer = Attention(dim, heads=heads, causal=causal, **attn_kwargs)
-            elif layer_type == "c":
-                layer = Attention(dim, heads=heads, **attn_kwargs)
-            elif layer_type == "f":
-                layer = FeedForward(dim, **ff_kwargs)
-                layer = layer if not macaron else Scale(0.5, layer)
-            else:
-                raise Exception(f"invalid layer type {layer_type}")
-
-            if isinstance(layer, Attention) and exists(branch_fn):
-                layer = branch_fn(layer)
-
-            if gate_residual:
-                residual_fn = GRUGating(dim)
-            else:
-                residual_fn = Residual()
-
-            self.layers.append(nn.ModuleList([norm_fn(), layer, residual_fn]))
-
-    def forward(
-        self,
-        x,
-        context=None,
-        mask=None,
-        context_mask=None,
-        mems=None,
-        return_hiddens=False,
-    ):
-        hiddens = []
-        intermediates = []
-        prev_attn = None
-        prev_cross_attn = None
-
-        mems = mems.copy() if exists(mems) else [None] * self.num_attn_layers
-
-        for ind, (layer_type, (norm, block, residual_fn)) in enumerate(
-            zip(self.layer_types, self.layers)
-        ):
-            is_last = ind == (len(self.layers) - 1)
-
-            if layer_type == "a":
-                hiddens.append(x)
-                layer_mem = mems.pop(0)
-
-            residual = x
-
-            if self.pre_norm:
-                x = norm(x)
-
-            if layer_type == "a":
-                out, inter = block(
-                    x,
-                    mask=mask,
-                    sinusoidal_emb=self.pia_pos_emb,
-                    rel_pos=self.rel_pos,
-                    prev_attn=prev_attn,
-                    mem=layer_mem,
-                )
-            elif layer_type == "c":
-                out, inter = block(
-                    x,
-                    context=context,
-                    mask=mask,
-                    context_mask=context_mask,
-                    prev_attn=prev_cross_attn,
-                )
-            elif layer_type == "f":
-                out = block(x)
-
-            x = residual_fn(out, residual)
-
-            if layer_type in ("a", "c"):
-                intermediates.append(inter)
-
-            if layer_type == "a" and self.residual_attn:
-                prev_attn = inter.pre_softmax_attn
-            elif layer_type == "c" and self.cross_residual_attn:
-                prev_cross_attn = inter.pre_softmax_attn
-
-            if not self.pre_norm and not is_last:
-                x = norm(x)
-
-        if return_hiddens:
-            intermediates = LayerIntermediates(
-                hiddens=hiddens, attn_intermediates=intermediates
-            )
-
-            return x, intermediates
-
-        return x
-
-
-class Encoder(AttentionLayers):
-    def __init__(self, **kwargs):
-        assert "causal" not in kwargs, "cannot set causality on encoder"
-        super().__init__(causal=False, **kwargs)
-
-
-class TransformerWrapper(nn.Module):
-    def __init__(
-        self,
-        *,
-        num_tokens,
-        max_seq_len,
-        attn_layers,
-        emb_dim=None,
-        max_mem_len=0.0,
-        emb_dropout=0.0,
-        num_memory_tokens=None,
-        tie_embedding=False,
-        use_pos_emb=True,
-    ):
-        super().__init__()
-        assert isinstance(
-            attn_layers, AttentionLayers
-        ), "attention layers must be one of Encoder or Decoder"
-
-        dim = attn_layers.dim
-        emb_dim = default(emb_dim, dim)
-
-        self.max_seq_len = max_seq_len
-        self.max_mem_len = max_mem_len
-        self.num_tokens = num_tokens
-
-        self.token_emb = nn.Embedding(num_tokens, emb_dim)
-        self.pos_emb = (
-            AbsolutePositionalEmbedding(emb_dim, max_seq_len)
-            if (use_pos_emb and not attn_layers.has_pos_emb)
-            else always(0)
-        )
-        self.emb_dropout = nn.Dropout(emb_dropout)
-
-        self.project_emb = nn.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity()
-        self.attn_layers = attn_layers
-        self.norm = nn.LayerNorm(dim)
-
-        self.init_()
-
-        self.to_logits = (
-            nn.Linear(dim, num_tokens)
-            if not tie_embedding
-            else lambda t: t @ self.token_emb.weight.t()
-        )
-
-        # memory tokens (like [cls]) from Memory Transformers paper
-        num_memory_tokens = default(num_memory_tokens, 0)
-        self.num_memory_tokens = num_memory_tokens
-        if num_memory_tokens > 0:
-            self.memory_tokens = nn.Parameter(torch.randn(num_memory_tokens, dim))
-
-            # let funnel encoder know number of memory tokens, if specified
-            if hasattr(attn_layers, "num_memory_tokens"):
-                attn_layers.num_memory_tokens = num_memory_tokens
-
-    def init_(self):
-        nn.init.normal_(self.token_emb.weight, std=0.02)
-
-    def forward(
-        self,
-        x,
-        return_embeddings=False,
-        mask=None,
-        return_mems=False,
-        return_attn=False,
-        mems=None,
-        **kwargs,
-    ):
-        b, n, device, num_mem = *x.shape, x.device, self.num_memory_tokens
-        x = self.token_emb(x)
-        x += self.pos_emb(x)
-        x = self.emb_dropout(x)
-
-        x = self.project_emb(x)
-
-        if num_mem > 0:
-            mem = repeat(self.memory_tokens, "n d -> b n d", b=b)
-            x = torch.cat((mem, x), dim=1)
-
-            # auto-handle masking after appending memory tokens
-            if exists(mask):
-                mask = F.pad(mask, (num_mem, 0), value=True)
-
-        x, intermediates = self.attn_layers(
-            x, mask=mask, mems=mems, return_hiddens=True, **kwargs
-        )
-        x = self.norm(x)
-
-        mem, x = x[:, :num_mem], x[:, num_mem:]
-
-        out = self.to_logits(x) if not return_embeddings else x
-
-        if return_mems:
-            hiddens = intermediates.hiddens
-            new_mems = (
-                list(map(lambda pair: torch.cat(pair, dim=-2), zip(mems, hiddens)))
-                if exists(mems)
-                else hiddens
-            )
-            new_mems = list(
-                map(lambda t: t[..., -self.max_mem_len :, :].detach(), new_mems)
-            )
-            return out, new_mems
-
-        if return_attn:
-            attn_maps = list(
-                map(lambda t: t.post_softmax_attn, intermediates.attn_intermediates)
-            )
-            return out, attn_maps
-
-        return out
diff --git a/videotuna/models/opensora/__init__.py b/videotuna/models/opensora/__init__.py
deleted file mode 100644
index a4df48f0..00000000
--- a/videotuna/models/opensora/__init__.py
+++ /dev/null
@@ -1,5 +0,0 @@
-# from .acceleration import *
-# from .datasets import *
-# from .models import *
-# from .registry import *
-# from .llms import build
diff --git a/videotuna/models/opensora/acceleration/__init__.py b/videotuna/models/opensora/acceleration/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/opensora/acceleration/checkpoint.py b/videotuna/models/opensora/acceleration/checkpoint.py
deleted file mode 100644
index c24bebd3..00000000
--- a/videotuna/models/opensora/acceleration/checkpoint.py
+++ /dev/null
@@ -1,26 +0,0 @@
-from collections.abc import Iterable
-
-import torch.nn as nn
-from torch.utils.checkpoint import checkpoint, checkpoint_sequential
-
-
-def set_grad_checkpoint(model, use_fp32_attention=False, gc_step=1):
-    assert isinstance(model, nn.Module)
-
-    def set_attr(module):
-        module.grad_checkpointing = True
-        module.fp32_attention = use_fp32_attention
-        module.grad_checkpointing_step = gc_step
-
-    model.apply(set_attr)
-
-
-def auto_grad_checkpoint(module, *args, **kwargs):
-    if getattr(module, "grad_checkpointing", False):
-        if not isinstance(module, Iterable):
-            return checkpoint(module, *args, use_reentrant=False, **kwargs)
-        gc_step = module[0].grad_checkpointing_step
-        return checkpoint_sequential(
-            module, gc_step, *args, use_reentrant=False, **kwargs
-        )
-    return module(*args, **kwargs)
diff --git a/videotuna/models/opensora/acceleration/communications.py b/videotuna/models/opensora/acceleration/communications.py
deleted file mode 100644
index 654d1534..00000000
--- a/videotuna/models/opensora/acceleration/communications.py
+++ /dev/null
@@ -1,192 +0,0 @@
-import torch
-import torch.distributed as dist
-
-
-# ====================
-# All-To-All
-# ====================
-def _all_to_all(
-    input_: torch.Tensor,
-    world_size: int,
-    group: dist.ProcessGroup,
-    scatter_dim: int,
-    gather_dim: int,
-):
-    input_list = [
-        t.contiguous() for t in torch.tensor_split(input_, world_size, scatter_dim)
-    ]
-    output_list = [torch.empty_like(input_list[0]) for _ in range(world_size)]
-    dist.all_to_all(output_list, input_list, group=group)
-    return torch.cat(output_list, dim=gather_dim).contiguous()
-
-
-class _AllToAll(torch.autograd.Function):
-    """All-to-all communication.
-
-    Args:
-        input_: input matrix
-        process_group: communication group
-        scatter_dim: scatter dimension
-        gather_dim: gather dimension
-    """
-
-    @staticmethod
-    def forward(ctx, input_, process_group, scatter_dim, gather_dim):
-        ctx.process_group = process_group
-        ctx.scatter_dim = scatter_dim
-        ctx.gather_dim = gather_dim
-        ctx.world_size = dist.get_world_size(process_group)
-        output = _all_to_all(
-            input_, ctx.world_size, process_group, scatter_dim, gather_dim
-        )
-        return output
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        grad_output = _all_to_all(
-            grad_output,
-            ctx.world_size,
-            ctx.process_group,
-            ctx.gather_dim,
-            ctx.scatter_dim,
-        )
-        return (
-            grad_output,
-            None,
-            None,
-            None,
-        )
-
-
-def all_to_all(
-    input_: torch.Tensor,
-    process_group: dist.ProcessGroup,
-    scatter_dim: int = 2,
-    gather_dim: int = 1,
-):
-    return _AllToAll.apply(input_, process_group, scatter_dim, gather_dim)
-
-
-def _gather(
-    input_: torch.Tensor,
-    world_size: int,
-    group: dist.ProcessGroup,
-    gather_dim: int,
-):
-    if gather_list is None:
-        gather_list = [torch.empty_like(input_) for _ in range(world_size)]
-    dist.gather(input_, gather_list, group=group, gather_dim=gather_dim)
-    return gather_list
-
-
-# ====================
-# Gather-Split
-# ====================
-
-
-def _split(input_, pg: dist.ProcessGroup, dim=-1):
-    # skip if only one rank involved
-    world_size = dist.get_world_size(pg)
-    rank = dist.get_rank(pg)
-    if world_size == 1:
-        return input_
-
-    # Split along last dimension.
-    dim_size = input_.size(dim)
-    assert dim_size % world_size == 0, (
-        f"The dimension to split ({dim_size}) is not a multiple of world size ({world_size}), "
-        f"cannot split tensor evenly"
-    )
-
-    tensor_list = torch.split(input_, dim_size // world_size, dim=dim)
-    output = tensor_list[rank].contiguous()
-
-    return output
-
-
-def _gather(input_, pg: dist.ProcessGroup, dim=-1):
-    # skip if only one rank involved
-    input_ = input_.contiguous()
-    world_size = dist.get_world_size(pg)
-    dist.get_rank(pg)
-
-    if world_size == 1:
-        return input_
-
-    # all gather
-    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
-    assert input_.device.type == "cuda"
-    torch.distributed.all_gather(tensor_list, input_, group=pg)
-
-    # concat
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    return output
-
-
-class _GatherForwardSplitBackward(torch.autograd.Function):
-    """Gather the input from model parallel region and concatenate.
-
-    Args:
-        input_: input matrix.
-        process_group: parallel mode.
-        dim: dimension
-    """
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _gather(input_)
-
-    @staticmethod
-    def forward(ctx, input_, process_group, dim, grad_scale):
-        ctx.mode = process_group
-        ctx.dim = dim
-        ctx.grad_scale = grad_scale
-        return _gather(input_, process_group, dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.grad_scale == "up":
-            grad_output = grad_output * dist.get_world_size(ctx.mode)
-        elif ctx.grad_scale == "down":
-            grad_output = grad_output / dist.get_world_size(ctx.mode)
-
-        return _split(grad_output, ctx.mode, ctx.dim), None, None, None
-
-
-class _SplitForwardGatherBackward(torch.autograd.Function):
-    """
-    Split the input and keep only the corresponding chuck to the rank.
-
-    Args:
-        input_: input matrix.
-        process_group: parallel mode.
-        dim: dimension
-    """
-
-    @staticmethod
-    def symbolic(graph, input_):
-        return _split(input_)
-
-    @staticmethod
-    def forward(ctx, input_, process_group, dim, grad_scale):
-        ctx.mode = process_group
-        ctx.dim = dim
-        ctx.grad_scale = grad_scale
-        return _split(input_, process_group, dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        if ctx.grad_scale == "up":
-            grad_output = grad_output * dist.get_world_size(ctx.mode)
-        elif ctx.grad_scale == "down":
-            grad_output = grad_output / dist.get_world_size(ctx.mode)
-        return _gather(grad_output, ctx.mode, ctx.dim), None, None, None
-
-
-def split_forward_gather_backward(input_, process_group, dim, grad_scale=1.0):
-    return _SplitForwardGatherBackward.apply(input_, process_group, dim, grad_scale)
-
-
-def gather_forward_split_backward(input_, process_group, dim, grad_scale=None):
-    return _GatherForwardSplitBackward.apply(input_, process_group, dim, grad_scale)
diff --git a/videotuna/models/opensora/acceleration/parallel_states.py b/videotuna/models/opensora/acceleration/parallel_states.py
deleted file mode 100644
index 3c05cf13..00000000
--- a/videotuna/models/opensora/acceleration/parallel_states.py
+++ /dev/null
@@ -1,19 +0,0 @@
-import torch.distributed as dist
-
-_GLOBAL_PARALLEL_GROUPS = dict()
-
-
-def set_data_parallel_group(group: dist.ProcessGroup):
-    _GLOBAL_PARALLEL_GROUPS["data"] = group
-
-
-def get_data_parallel_group():
-    return _GLOBAL_PARALLEL_GROUPS.get("data", dist.group.WORLD)
-
-
-def set_sequence_parallel_group(group: dist.ProcessGroup):
-    _GLOBAL_PARALLEL_GROUPS["sequence"] = group
-
-
-def get_sequence_parallel_group():
-    return _GLOBAL_PARALLEL_GROUPS.get("sequence", None)
diff --git a/videotuna/models/opensora/acceleration/plugin.py b/videotuna/models/opensora/acceleration/plugin.py
deleted file mode 100644
index 32362649..00000000
--- a/videotuna/models/opensora/acceleration/plugin.py
+++ /dev/null
@@ -1,102 +0,0 @@
-import random
-from typing import Optional
-
-import numpy as np
-import torch
-from colossalai.booster.plugin import LowLevelZeroPlugin
-from colossalai.cluster import ProcessGroupMesh
-from torch.utils.data import DataLoader
-from torch.utils.data.distributed import DistributedSampler
-
-DP_AXIS, SP_AXIS = 0, 1
-
-
-class ZeroSeqParallelPlugin(LowLevelZeroPlugin):
-    def __init__(
-        self,
-        sp_size: int = 1,
-        stage: int = 2,
-        precision: str = "fp16",
-        initial_scale: float = 2**32,
-        min_scale: float = 1,
-        growth_factor: float = 2,
-        backoff_factor: float = 0.5,
-        growth_interval: int = 1000,
-        hysteresis: int = 2,
-        max_scale: float = 2**32,
-        max_norm: float = 0.0,
-        norm_type: float = 2.0,
-        reduce_bucket_size_in_m: int = 12,
-        communication_dtype: Optional[torch.dtype] = None,
-        overlap_communication: bool = True,
-        cpu_offload: bool = False,
-        master_weights: bool = True,
-        verbose: bool = False,
-    ) -> None:
-        super().__init__(
-            stage=stage,
-            precision=precision,
-            initial_scale=initial_scale,
-            min_scale=min_scale,
-            growth_factor=growth_factor,
-            backoff_factor=backoff_factor,
-            growth_interval=growth_interval,
-            hysteresis=hysteresis,
-            max_scale=max_scale,
-            max_norm=max_norm,
-            norm_type=norm_type,
-            reduce_bucket_size_in_m=reduce_bucket_size_in_m,
-            communication_dtype=communication_dtype,
-            overlap_communication=overlap_communication,
-            cpu_offload=cpu_offload,
-            master_weights=master_weights,
-            verbose=verbose,
-        )
-        self.sp_size = sp_size
-        assert self.world_size % sp_size == 0, "world_size must be divisible by sp_size"
-        self.dp_size = self.world_size // sp_size
-        self.pg_mesh = ProcessGroupMesh(self.dp_size, self.sp_size)
-        self.dp_group = self.pg_mesh.get_group_along_axis(DP_AXIS)
-        self.sp_group = self.pg_mesh.get_group_along_axis(SP_AXIS)
-        self.dp_rank = self.pg_mesh.coordinate(DP_AXIS)
-        self.sp_rank = self.pg_mesh.coordinate(SP_AXIS)
-
-    def __del__(self):
-        """Destroy the prcess groups in ProcessGroupMesh"""
-        self.pg_mesh.destroy_mesh_process_groups()
-
-    def prepare_dataloader(
-        self,
-        dataset,
-        batch_size,
-        shuffle=False,
-        seed=1024,
-        drop_last=False,
-        pin_memory=False,
-        num_workers=0,
-        distributed_sampler_cls=None,
-        **kwargs,
-    ):
-        _kwargs = kwargs.copy()
-        distributed_sampler_cls = distributed_sampler_cls or DistributedSampler
-        sampler = distributed_sampler_cls(
-            dataset, num_replicas=self.dp_size, rank=self.dp_rank, shuffle=shuffle
-        )
-
-        # Deterministic dataloader
-        def seed_worker(worker_id):
-            worker_seed = seed
-            np.random.seed(worker_seed)
-            torch.manual_seed(worker_seed)
-            random.seed(worker_seed)
-
-        return DataLoader(
-            dataset,
-            batch_size=batch_size,
-            sampler=sampler,
-            worker_init_fn=seed_worker,
-            drop_last=drop_last,
-            pin_memory=pin_memory,
-            num_workers=num_workers,
-            **_kwargs,
-        )
diff --git a/videotuna/models/opensora/acceleration/shardformer/modeling/__init__.py b/videotuna/models/opensora/acceleration/shardformer/modeling/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/opensora/acceleration/shardformer/modeling/t5.py b/videotuna/models/opensora/acceleration/shardformer/modeling/t5.py
deleted file mode 100644
index 9cfb8084..00000000
--- a/videotuna/models/opensora/acceleration/shardformer/modeling/t5.py
+++ /dev/null
@@ -1,39 +0,0 @@
-import torch
-import torch.nn as nn
-
-
-class T5LayerNorm(nn.Module):
-    def __init__(self, hidden_size, eps=1e-6):
-        """
-        Construct a layernorm module in the T5 style. No bias and no subtraction of mean.
-        """
-        super().__init__()
-        self.weight = nn.Parameter(torch.ones(hidden_size))
-        self.variance_epsilon = eps
-
-    def forward(self, hidden_states):
-        # T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
-        # Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated
-        # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for
-        # half-precision inputs is done in fp32
-
-        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
-
-        # convert into half-precision if necessary
-        if self.weight.dtype in [torch.float16, torch.bfloat16]:
-            hidden_states = hidden_states.to(self.weight.dtype)
-
-        return self.weight * hidden_states
-
-    @staticmethod
-    def from_native_module(module, *args, **kwargs):
-        assert module.__class__.__name__ == "FusedRMSNorm", (
-            "Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm."
-            "Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48"
-        )
-
-        layer_norm = T5LayerNorm(module.normalized_shape, eps=module.eps)
-        layer_norm.weight.data.copy_(module.weight.data)
-        layer_norm = layer_norm.to(module.weight.device)
-        return layer_norm
diff --git a/videotuna/models/opensora/acceleration/shardformer/policy/__init__.py b/videotuna/models/opensora/acceleration/shardformer/policy/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/opensora/acceleration/shardformer/policy/t5_encoder.py b/videotuna/models/opensora/acceleration/shardformer/policy/t5_encoder.py
deleted file mode 100644
index 1093fa5f..00000000
--- a/videotuna/models/opensora/acceleration/shardformer/policy/t5_encoder.py
+++ /dev/null
@@ -1,81 +0,0 @@
-from colossalai.shardformer.modeling.jit import get_jit_fused_dropout_add_func
-from colossalai.shardformer.modeling.t5 import (
-    get_jit_fused_T5_layer_ff_forward,
-    get_T5_layer_self_attention_forward,
-)
-from colossalai.shardformer.policies.base_policy import (
-    Policy,
-    SubModuleReplacementDescription,
-)
-
-
-class T5EncoderPolicy(Policy):
-    def config_sanity_check(self):
-        assert not self.shard_config.enable_tensor_parallelism
-        assert not self.shard_config.enable_flash_attention
-
-    def preprocess(self):
-        return self.model
-
-    def module_policy(self):
-        from transformers.models.t5.modeling_t5 import (
-            T5LayerFF,
-            T5LayerSelfAttention,
-            T5Stack,
-        )
-
-        policy = {}
-
-        # check whether apex is installed
-        try:
-            from opensora.acceleration.shardformer.modeling.t5 import T5LayerNorm
-
-            # recover hf from fused rms norm to T5 norm which is faster
-            self.append_or_create_submodule_replacement(
-                description=SubModuleReplacementDescription(
-                    suffix="layer_norm",
-                    target_module=T5LayerNorm,
-                ),
-                policy=policy,
-                target_key=T5LayerFF,
-            )
-            self.append_or_create_submodule_replacement(
-                description=SubModuleReplacementDescription(
-                    suffix="layer_norm", target_module=T5LayerNorm
-                ),
-                policy=policy,
-                target_key=T5LayerSelfAttention,
-            )
-            self.append_or_create_submodule_replacement(
-                description=SubModuleReplacementDescription(
-                    suffix="final_layer_norm", target_module=T5LayerNorm
-                ),
-                policy=policy,
-                target_key=T5Stack,
-            )
-        except (ImportError, ModuleNotFoundError):
-            pass
-
-        # use jit operator
-        if self.shard_config.enable_jit_fused:
-            self.append_or_create_method_replacement(
-                description={
-                    "forward": get_jit_fused_T5_layer_ff_forward(),
-                    "dropout_add": get_jit_fused_dropout_add_func(),
-                },
-                policy=policy,
-                target_key=T5LayerFF,
-            )
-            self.append_or_create_method_replacement(
-                description={
-                    "forward": get_T5_layer_self_attention_forward(),
-                    "dropout_add": get_jit_fused_dropout_add_func(),
-                },
-                policy=policy,
-                target_key=T5LayerSelfAttention,
-            )
-
-        return policy
-
-    def postprocess(self):
-        return self.model
diff --git a/videotuna/models/opensora/models/__init__.py b/videotuna/models/opensora/models/__init__.py
deleted file mode 100644
index ed0654e2..00000000
--- a/videotuna/models/opensora/models/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-# from .dit import *
-# from .latte import *
-# from .pixart import *
-# from .stdit import *
-# from .text_encoder import *
-# from .vae import *
diff --git a/videotuna/models/opensora/models/iddpm3d.py b/videotuna/models/opensora/models/iddpm3d.py
deleted file mode 100644
index 79da672e..00000000
--- a/videotuna/models/opensora/models/iddpm3d.py
+++ /dev/null
@@ -1,1770 +0,0 @@
-import enum
-import logging
-import math
-import os
-import random
-from contextlib import contextmanager
-from functools import partial
-
-import numpy as np
-from einops import rearrange, repeat
-from omegaconf.listconfig import ListConfig
-from tqdm import tqdm
-
-mainlogger = logging.getLogger("mainlogger")
-
-import torch
-import torch.nn as nn
-from pytorch_lightning.utilities import rank_zero_only
-from torch.optim.lr_scheduler import CosineAnnealingLR, LambdaLR
-from torchvision.utils import make_grid
-
-from videotuna.schedulers.ddim import DDIMSampler
-from videotuna.models.lvdm.ddpm3d import DDPMFlow
-from videotuna.schedulers.diffusion_schedulers import DDPMScheduler
-from videotuna.utils.distributions import DiagonalGaussianDistribution, normal_kl
-from videotuna.utils.diffusion_utils import (
-    discretized_gaussian_log_likelihood,
-)
-from videotuna.models.lvdm.modules.utils import (
-    default,
-    disabled_train,
-    exists,
-    extract_into_tensor,
-    noise_like,
-)
-from videotuna.utils.common_utils import instantiate_from_config
-
-
-def mean_flat(tensor: torch.Tensor, mask=None) -> torch.Tensor:
-    """
-    Take the mean over all non-batch dimensions.
-    """
-    if mask is None:
-        return tensor.mean(dim=list(range(1, len(tensor.shape))))
-    else:
-        assert tensor.dim() == 5
-        assert tensor.shape[2] == mask.shape[1]
-        tensor = rearrange(tensor, "b c t h w -> b t (c h w)")
-        denom = mask.sum(dim=1) * tensor.shape[-1]
-        loss = (tensor * mask.unsqueeze(2)).sum(dim=1).sum(dim=1) / denom
-        return loss
-
-
-def _warmup_beta(beta_start, beta_end, num_diffusion_timesteps, warmup_frac):
-    betas = beta_end * np.ones(num_diffusion_timesteps, dtype=np.float64)
-    warmup_time = int(num_diffusion_timesteps * warmup_frac)
-    betas[:warmup_time] = np.linspace(
-        beta_start, beta_end, warmup_time, dtype=np.float64
-    )
-    return betas
-
-
-def get_beta_schedule(
-    beta_schedule: str, *, beta_start, beta_end, num_diffusion_timesteps: int
-) -> np.ndarray:
-    """
-    This is the deprecated API for creating beta schedules.
-    See get_named_beta_schedule() for the new library of schedules.
-    """
-    if beta_schedule == "quad":
-        betas = (
-            np.linspace(
-                beta_start**0.5,
-                beta_end**0.5,
-                num_diffusion_timesteps,
-                dtype=np.float64,
-            )
-            ** 2
-        )
-    elif beta_schedule == "linear":
-        betas = np.linspace(
-            beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64
-        )
-    elif beta_schedule == "warmup10":
-        betas = _warmup_beta(beta_start, beta_end, num_diffusion_timesteps, 0.1)
-    elif beta_schedule == "warmup50":
-        betas = _warmup_beta(beta_start, beta_end, num_diffusion_timesteps, 0.5)
-    elif beta_schedule == "const":
-        betas = beta_end * np.ones(num_diffusion_timesteps, dtype=np.float64)
-    elif beta_schedule == "jsd":  # 1/T, 1/(T-1), 1/(T-2), ..., 1
-        betas = 1.0 / np.linspace(
-            num_diffusion_timesteps, 1, num_diffusion_timesteps, dtype=np.float64
-        )
-    else:
-        raise NotImplementedError(beta_schedule)
-    assert betas.shape == (num_diffusion_timesteps,)
-    return betas
-
-
-def get_named_beta_schedule(
-    schedule_name: str, num_diffusion_timesteps: int
-) -> np.ndarray:
-    """
-    Get a pre-defined beta schedule for the given name.
-    The beta schedule library consists of beta schedules which remain similar
-    in the limit of num_diffusion_timesteps.
-    Beta schedules may be added, but should not be removed or changed once
-    they are committed to maintain backwards compatibility.
-    """
-    if schedule_name == "linear":
-        # Linear schedule from Ho et al, extended to work for any number of
-        # diffusion steps.
-        scale = 1000 / num_diffusion_timesteps
-        return get_beta_schedule(
-            "linear",
-            beta_start=scale * 0.0001,
-            beta_end=scale * 0.02,
-            num_diffusion_timesteps=num_diffusion_timesteps,
-        )
-    elif schedule_name == "squaredcos_cap_v2":
-        return betas_for_alpha_bar(
-            num_diffusion_timesteps,
-            lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
-        )
-    else:
-        raise NotImplementedError(f"unknown beta schedule: {schedule_name}")
-
-
-def betas_for_alpha_bar(
-    num_diffusion_timesteps, alpha_bar, max_beta=0.999
-) -> np.ndarray:
-    """
-    Create a beta schedule that discretizes the given alpha_t_bar function,
-    which defines the cumulative product of (1-beta) over time from t = [0,1].
-    :param num_diffusion_timesteps: the number of betas to produce.
-    :param alpha_bar: a lambda that takes an argument t from 0 to 1 and
-                      produces the cumulative product of (1-beta) up to that
-                      part of the diffusion process.
-    :param max_beta: the maximum beta to use; use values lower than 1 to
-                     prevent singularities.
-    """
-    betas = []
-    for i in range(num_diffusion_timesteps):
-        t1 = i / num_diffusion_timesteps
-        t2 = (i + 1) / num_diffusion_timesteps
-        betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta))
-    return np.array(betas)
-
-
-class ModelMeanType(enum.Enum):
-    """
-    Which type of output the model predicts.
-    """
-
-    PREVIOUS_X = enum.auto()  # the model predicts x_{t-1}
-    START_X = enum.auto()  # the model predicts x_0
-    EPSILON = enum.auto()  # the model predicts epsilon
-
-
-class ModelVarType(enum.Enum):
-    """
-    What is used as the model's output variance.
-    The LEARNED_RANGE option has been added to allow the model to predict
-    values between FIXED_SMALL and FIXED_LARGE, making its job easier.
-    """
-
-    LEARNED = enum.auto()
-    FIXED_SMALL = enum.auto()
-    FIXED_LARGE = enum.auto()
-    LEARNED_RANGE = enum.auto()
-
-
-class LossType(enum.Enum):
-    MSE = enum.auto()  # use raw MSE loss (and KL when learning variances)
-    RESCALED_MSE = (
-        enum.auto()
-    )  # use raw MSE loss (with RESCALED_KL when learning variances)
-    KL = enum.auto()  # use the variational lower-bound
-    RESCALED_KL = enum.auto()  # like KL, but rescale to estimate the full VLB
-
-    def is_vb(self):
-        return self == LossType.KL or self == LossType.RESCALED_KL
-
-
-class IDDPMScheduler(DDPMScheduler):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-    def register_schedule(self, given_betas, *args, **kwargs):
-        if exists(given_betas):
-            if isinstance(given_betas, list) or isinstance(given_betas, ListConfig):
-                betas = np.array(given_betas, dtype=np.float32)
-            elif isinstance(given_betas, np.ndarray):
-                betas = given_betas
-            else:
-                raise TypeError("given_betas type error")
-        else:
-            raise ValueError("given_betas must be provided")
-
-        betas = np.array(betas, dtype=np.float32)
-
-        timesteps = betas.shape[0]
-        self.num_timesteps = int(timesteps)
-
-        alphas = 1.0 - betas
-        alphas_cumprod = np.cumprod(alphas, axis=0)
-        alphas_cumprod_prev = np.append(1.0, alphas_cumprod[:-1])
-        alphas_cumprod_next = np.append(alphas_cumprod[1:], 0.0)
-        assert (
-            alphas_cumprod.shape[0] == self.num_timesteps
-        ), "alphas have to be defined for each timestep"
-
-        to_torch = partial(torch.tensor, device=self.device)
-
-        self.register_buffer("betas", to_torch(betas, dtype=torch.float32))
-        self.register_buffer("alphas_cumprod", to_torch(alphas_cumprod))
-        self.register_buffer("alphas_cumprod_prev", to_torch(alphas_cumprod_prev))
-        self.register_buffer("alphas_cumprod_next", to_torch(alphas_cumprod_next))
-
-        # calculations for diffusion q(x_t | x_{t-1}) and others
-        self.register_buffer("sqrt_alphas_cumprod", to_torch(np.sqrt(alphas_cumprod)))
-        self.register_buffer(
-            "sqrt_one_minus_alphas_cumprod", to_torch(np.sqrt(1.0 - alphas_cumprod))
-        )
-        self.register_buffer(
-            "log_one_minus_alphas_cumprod", to_torch(np.log(1.0 - alphas_cumprod))
-        )
-        self.register_buffer(
-            "sqrt_recip_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod))
-        )
-        self.register_buffer(
-            "sqrt_recipm1_alphas_cumprod", to_torch(np.sqrt(1.0 / alphas_cumprod - 1))
-        )
-
-        # calculations for posterior q(x_{t-1} | x_t, x_0)
-        posterior_variance = (1 - self.v_posterior) * betas * (
-            1.0 - alphas_cumprod_prev
-        ) / (1.0 - alphas_cumprod) + self.v_posterior * betas
-        self.register_buffer("posterior_variance", to_torch(posterior_variance))
-        self.register_buffer(
-            "posterior_log_variance_clipped",
-            (
-                to_torch(
-                    np.log(np.append(posterior_variance[1], posterior_variance[1:]))
-                )
-                if len(self.posterior_variance) > 1
-                else torch.DoubleTensor([])
-            ),
-        )
-        self.register_buffer(
-            "posterior_mean_coef1",
-            to_torch(betas * np.sqrt(alphas_cumprod_prev) / (1.0 - alphas_cumprod)),
-        )
-        self.register_buffer(
-            "posterior_mean_coef2",
-            to_torch(
-                (1.0 - alphas_cumprod_prev) * np.sqrt(alphas) / (1.0 - alphas_cumprod)
-            ),
-        )
-
-    @torch.no_grad()
-    def p_sample(
-        self,
-        x,
-        t,
-        clip_denoised=True,
-        denoised_fn=None,
-        cond_fn=None,
-        model_kwargs=None,
-        mask=None,
-    ):
-        """
-        Sample x_{t-1} from the model at the given timestep.
-        :param x: the current tensor at x_{t-1}.
-        :param t: the value of t, starting at 0 for the first diffusion step.
-        :param clip_denoised: if True, clip the x_start prediction to [-1, 1].
-        :param denoised_fn: if not None, a function which applies to the
-            x_start prediction before it is used to sample.
-        :param cond_fn: if not None, this is a gradient function that acts
-                        similarly to the model.
-        :param model_kwargs: if not None, a dict of extra keyword arguments to
-            pass to the model. This can be used for conditioning.
-        :return: a dict containing the following keys:
-                 - 'sample': a random sample from the model.
-                 - 'pred_xstart': a prediction of x_0.
-        """
-        if mask is not None:
-            if mask.shape[0] != x.shape[0]:
-                mask = mask.repeat(2, 1)  # HACK
-            mask_t = (mask * len(self.betas)).to(torch.int)
-
-            # x0: copy unchanged x values
-            # x_noise: add noise to x values
-            x0 = x.clone()
-            x_noise = x0 * extract_into_tensor(
-                self.sqrt_alphas_cumprod, t, x.shape
-            ) + torch.randn_like(x) * extract_into_tensor(
-                self.sqrt_one_minus_alphas_cumprod, t, x.shape
-            )
-
-            # active noise addition
-            mask_t_equall = (mask_t == t.unsqueeze(1))[:, None, :, None, None]
-            x = torch.where(mask_t_equall, x_noise, x0)
-
-            # create x_mask
-            mask_t_upper = (mask_t > t.unsqueeze(1))[:, None, :, None, None]
-            batch_size = x.shape[0]
-            model_kwargs["x_mask"] = mask_t_upper.reshape(batch_size, -1).to(torch.bool)
-
-        model_mean, _, model_log_variance = self.p_mean_variance(
-            x,
-            t,
-            clip_denoised=clip_denoised,
-            denoised_fn=denoised_fn,
-            model_kwargs=model_kwargs,
-        )
-        noise = torch.randn_like(x)
-        nonzero_mask = (
-            (t != 0).float().view(-1, *([1] * (len(x.shape) - 1)))
-        )  # no noise when t == 0
-        # TODO: check cond_fn
-        # if cond_fn is not None:
-        #     model_mean = self.condition_mean(cond_fn, out, x, t, model_kwargs=model_kwargs)
-        sample = model_mean + nonzero_mask * torch.exp(0.5 * model_log_variance) * noise
-
-        if mask is not None:
-            mask_t_lower = (mask_t < t.unsqueeze(1))[:, None, :, None, None]
-            sample = torch.where(mask_t_lower, x0, sample)
-
-        return sample
-
-    def predict_start_from_noise(self, x_t, t, noise):
-        assert noise.shape == x_t.shape
-        return super().predict_start_from_noise(x_t, t, noise)
-
-    def predict_start_from_prev(self, x_t, t, x_prev):
-        assert x_prev.shape == x_t.shape
-        return (  # (x_prev - coef2 * x_t) / coef1
-            extract_into_tensor(1.0 / self.posterior_mean_coef1, t, x_t.shape) * x_prev
-            - extract_into_tensor(
-                self.posterior_mean_coef2 / self.posterior_mean_coef1, t, x_t.shape
-            )
-            * x_t
-        )
-
-    def p_mean_variance(
-        self, x, t, clip_denoised=True, denoised_fn=None, model_kwargs=None
-    ):
-        """
-        Apply the model to get p(x_{t-1} | x_t), as well as a prediction of
-        the initial x, x_0.
-        :param model: the model, which takes a signal and a batch of timesteps
-                      as input.
-        :param x: the [N x C x ...] tensor at time t.
-        :param t: a 1-D Tensor of timesteps.
-        :param clip_denoised: if True, clip the denoised signal into [-1, 1].
-        :param denoised_fn: if not None, a function which applies to the
-            x_start prediction before it is used to sample. Applies before
-            clip_denoised.
-        :param model_kwargs: if not None, a dict of extra keyword arguments to
-            pass to the model. This can be used for conditioning.
-        :return: a dict with the following keys:
-                 - 'mean': the model mean output.
-                 - 'variance': the model variance output.
-                 - 'log_variance': the log of 'variance'.
-                 - 'pred_xstart': the prediction for x_0.
-        """
-        if model_kwargs is None:
-            model_kwargs = {}
-
-        B, C = x.shape[:2]
-        assert t.shape == (B,)
-        model_output = self.model(x, t, **model_kwargs)
-        if isinstance(model_output, tuple):
-            model_output, extra = model_output
-        else:
-            extra = None
-
-        if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]:
-            assert model_output.shape == (B, C * 2, *x.shape[2:])
-            model_output, model_var_values = torch.split(model_output, C, dim=1)
-            min_log = extract_into_tensor(
-                self.posterior_log_variance_clipped, t, x.shape
-            )
-            max_log = extract_into_tensor(torch.log(self.betas), t, x.shape)
-            # The model_var_values is [-1, 1] for [min_var, max_var].
-            frac = (model_var_values + 1) / 2
-            model_log_variance = frac * max_log + (1 - frac) * min_log
-            model_variance = torch.exp(model_log_variance)
-        else:
-            model_variance, model_log_variance = {
-                # for fixedlarge, we set the initial (log-)variance like so
-                # to get a better decoder log likelihood.
-                ModelVarType.FIXED_LARGE: (
-                    torch.cat(self.posterior_variance[1].unsqueeze(0), self.betas[1:]),
-                    torch.log(
-                        torch.cat(
-                            self.posterior_variance[1].unsqueeze(0), self.betas[1:]
-                        )
-                    ),
-                ),
-                ModelVarType.FIXED_SMALL: (
-                    self.posterior_variance,
-                    self.posterior_log_variance_clipped,
-                ),
-            }[self.model_var_type]
-            model_variance = extract_into_tensor(model_variance, t, x.shape)
-            model_log_variance = extract_into_tensor(model_log_variance, t, x.shape)
-
-        def process_xstart(x):
-            if denoised_fn is not None:
-                x = denoised_fn(x)
-            if clip_denoised:
-                return x.clamp(-1, 1)
-            return x
-
-        if self.model_mean_type == ModelMeanType.PREVIOUS_X:
-            model_mean = model_output
-            pred_xstart = process_xstart(
-                self.predict_start_from_prev(x_t=x, t=t, x_prev=model_output)
-            )
-        elif self.model_mean_type == ModelMeanType.START_X:
-            pred_xstart = process_xstart(model_output)
-            model_mean, _, _ = self.q_posterior(x_start=pred_xstart, x_t=x, t=t)
-        elif self.model_mean_type == ModelMeanType.EPSILON:
-            pred_xstart = process_xstart(
-                self.predict_start_from_noise(x_t=x, t=t, eps=model_output)
-            )
-            model_mean, _, _ = self.q_posterior(x_start=pred_xstart, x_t=x, t=t)
-        else:
-            raise NotImplementedError(self.model_mean_type)
-
-        assert (
-            model_mean.shape == model_log_variance.shape == pred_xstart.shape == x.shape
-        )
-
-        return model_mean, model_variance, model_log_variance
-
-
-class OpenSoraScheduler(IDDPMScheduler):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-
-    def p_mean_variance(
-        self,
-        x,
-        c,
-        t,
-        clip_denoised: bool,
-        return_x0=False,
-        score_corrector=None,
-        corrector_kwargs=None,
-        model_kwargs=None,
-        **kwargs,
-    ):
-        if model_kwargs is None:
-            model_kwargs = {}
-        t_in = t
-        model_output = self.apply_model(x, t_in, c, **kwargs)
-        B, C = x.shape[:2]
-        if isinstance(model_output, tuple):
-            model_output, extra = model_output
-        else:
-            extra = None
-
-        if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]:
-            assert model_output.shape == (B, C * 2, *x.shape[2:])
-            model_output, model_var_values = torch.split(model_output, C, dim=1)
-            min_log = extract_into_tensor(
-                self.posterior_log_variance_clipped, t, x.shape
-            )
-            max_log = extract_into_tensor(torch.log(self.betas), t, x.shape)
-            # The model_var_values is [-1, 1] for [min_var, max_var].
-            frac = (model_var_values + 1) / 2
-            model_log_variance = frac * max_log + (1 - frac) * min_log
-            model_variance = torch.exp(model_log_variance)
-        else:
-            model_variance, model_log_variance = {
-                # for fixedlarge, we set the initial (log-)variance like so
-                # to get a better decoder log likelihood.
-                ModelVarType.FIXED_LARGE: (
-                    torch.cat(self.posterior_variance[1].unsqueeze(0), self.betas[1:]),
-                    torch.log(
-                        torch.cat(
-                            self.posterior_variance[1].unsqueeze(0), self.betas[1:]
-                        )
-                    ),
-                ),
-                ModelVarType.FIXED_SMALL: (
-                    self.posterior_variance,
-                    self.posterior_log_variance_clipped,
-                ),
-            }[self.model_var_type]
-            model_variance = extract_into_tensor(model_variance, t, x.shape)
-            model_log_variance = extract_into_tensor(model_log_variance, t, x.shape)
-
-        if self.model_mean_type == ModelMeanType.START_X:
-            x_recon = self.predict_start_from_noise(x, t=t, noise=model_output)
-        else:
-            x_recon = model_output
-        if clip_denoised:
-            x_recon.clamp_(-1.0, 1.0)
-
-        if score_corrector is not None:
-            assert self.parameterization == "eps"
-            model_output = score_corrector.modify_score(
-                self, model_output, x, t, c, **corrector_kwargs
-            )
-
-        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(
-            x_start=x_recon, x_t=x, t=t
-        )
-        # posterior_variance = extract_into_tensor(posterior_variance, t, x.shape)
-        # posterior_log_variance = extract_into_tensor(posterior_log_variance, t, x.shape)
-        if return_x0:
-            return model_mean, model_log_variance, model_log_variance, x_recon
-        else:
-            return model_mean, model_log_variance, model_log_variance
-
-    @torch.no_grad()
-    def p_sample(
-        self,
-        x,
-        c,
-        t,
-        clip_denoised=False,
-        repeat_noise=False,
-        return_x0=False,
-        temperature=1.0,
-        noise_dropout=0.0,
-        score_corrector=None,
-        corrector_kwargs=None,
-        **kwargs,
-    ):
-        b, *_, device = *x.shape, x.device
-        outputs = self.p_mean_variance(
-            model=self.model,
-            x=x,
-            c=c,
-            t=t,
-            clip_denoised=clip_denoised,
-            return_x0=return_x0,
-            score_corrector=score_corrector,
-            corrector_kwargs=corrector_kwargs,
-            **kwargs,
-        )
-        if return_x0:
-            model_mean, _, model_log_variance, x0 = outputs
-        else:
-            model_mean, _, model_log_variance = outputs
-
-        noise = noise_like(x.shape, device, repeat_noise) * temperature
-        if noise_dropout > 0.0:
-            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
-        # no noise when t == 0
-        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
-
-        if return_x0:
-            return (
-                model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise,
-                x0,
-            )
-        else:
-            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
-
-    def apply_model(self, x_noisy, t, cond, **kwargs):
-        if isinstance(cond, dict):
-            # hybrid case, cond is exptected to be a dict
-            pass
-        else:
-            if not isinstance(cond, list):
-                cond = [cond]
-            key = (
-                "c_concat" if self.model.conditioning_key == "concat" else "c_crossattn"
-            )
-            cond = {key: cond}
-
-        # If model is provided as a keyword argument, use it to train the variance.
-        if "model" not in kwargs:
-            x_recon = self.model(x_noisy, t, **cond, **kwargs)
-        else:
-            x_recon = kwargs["model"](x_noisy, t)
-
-        if isinstance(x_recon, tuple):
-            return x_recon[0]
-        else:
-            return x_recon
-
-
-class IDDPM(DDPMFlow):
-    def __init__(
-        self,
-        model_mean_type=ModelMeanType.EPSILON,
-        model_var_type=ModelVarType.LEARNED_RANGE,
-        loss_type=LossType.MSE,
-        *args,
-        **kwargs,
-    ):
-        super().__init__(*args, **kwargs)
-        self.model_mean_type = model_mean_type
-        self.model_var_type = model_var_type
-        self.loss_type = loss_type
-
-        # Add the model mean and variance types to the scheduler.
-        self.diffusion_scheduler.model_mean_type = model_mean_type
-        self.diffusion_scheduler.model_var_type = model_var_type
-
-    @torch.no_grad()
-    def p_sample_loop(
-        self,
-        shape,
-        return_intermediates=False,
-        noise=None,
-        clip_denoised=True,
-        denoised_fn=None,
-        cond_fn=None,
-        model_kwargs=None,
-        device=None,
-        progress=False,
-        mask=None,
-    ):
-        """
-        Generate samples from the model.
-        :param shape: the shape of the samples, (N, C, H, W).
-        :param return_intermediates: if True, return all intermediate samples.
-        :param noise: if specified, the noise from the encoder to sample.
-                      Should be of the same shape as `shape`.
-        :param clip_denoised: if True, clip x_start predictions to [-1, 1].
-        :param denoised_fn: if not None, a function which applies to the
-            x_start prediction before it is used to sample.
-        :param cond_fn: if not None, this is a gradient function that acts
-                        similarly to the model.
-        :param model_kwargs: if not None, a dict of extra keyword arguments to
-            pass to the model. This can be used for conditioning.
-        :param device: if specified, the device to create the samples on.
-                       If not specified, use a model parameter's device.
-        :param progress: if True, show a tqdm progress bar.
-        :return: a non-differentiable batch of samples.
-        """
-        if device is None:
-            device = next(self.model.parameters()).device
-        assert isinstance(shape, (tuple, list))
-
-        if noise is not None:
-            img = noise
-        else:
-            img = torch.randn(*shape, device=device)
-        indices = list(range(self.diffusion_scheduler.num_timesteps))[::-1]
-        b = shape[0]
-
-        if progress:
-            # Lazy import so that we don't depend on tqdm.
-            from tqdm.auto import tqdm
-
-            indices = tqdm(indices)
-
-        intermediates = [img]
-        for i in indices:
-            t = torch.full((b,), i, device=device)
-            out = self.diffusion_scheduler.p_sample(
-                img,
-                t,
-                clip_denoised=clip_denoised,
-                denoised_fn=denoised_fn,
-                cond_fn=cond_fn,
-                model_kwargs=model_kwargs,
-                mask=mask,
-            )
-            if (
-                i % self.log_every_t == 0
-                or i == self.diffusion_scheduler.num_timesteps - 1
-            ):
-                intermediates.append(img)
-
-        if return_intermediates:
-            return img, intermediates
-        else:
-            return img
-
-    def _vb_terms_bpd(
-        self, x_start, x_t, t, clip_denoised=True, model_kwargs=None, mask=None
-    ):
-        """
-        Get a term for the variational lower-bound.
-        The resulting units are bits (rather than nats, as one might expect).
-        This allows for comparison to other papers.
-        :return: a dict with the following keys:
-                 - 'output': a shape [N] tensor of NLLs or KLs.
-                 - 'pred_xstart': the x_0 predictions.
-        """
-        true_mean, _, true_log_variance_clipped = self.diffusion_scheduler.q_posterior(
-            x_start=x_start, x_t=x_t, t=t
-        )
-        out = self.diffusion_scheduler.p_mean_variance(
-            x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs
-        )
-        kl = normal_kl(
-            true_mean, true_log_variance_clipped, out["mean"], out["log_variance"]
-        )
-        kl = mean_flat(kl, mask=mask) / np.log(2.0)
-
-        decoder_nll = -discretized_gaussian_log_likelihood(
-            x_start, means=out["mean"], log_scales=0.5 * out["log_variance"]
-        )
-        assert decoder_nll.shape == x_start.shape
-        decoder_nll = mean_flat(decoder_nll, mask=mask) / np.log(2.0)
-
-        # At the first timestep return the decoder NLL,
-        # otherwise return KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t))
-        output = torch.where((t == 0), decoder_nll, kl)
-        return output
-
-    def p_losses(
-        self,
-        x_start,
-        t,
-        noise=None,
-        model_kwargs=None,
-        mask=None,
-        weights=None,
-    ):
-        """
-        Compute the losses for the model.
-        :param x_start: the [N x C x ...] tensor of inputs.
-        :param t: the timestep tensor.
-        :param noise: if specified, the specific Gaussian noise to try to remove.
-        :param model_kwargs: if not None, a dict of extra keyword arguments to
-            pass to the model. This can be used for conditioning.
-        :param mask: if not None, a mask tensor to apply to the losses.
-        :return: a dict containing the following keys:
-                 - 'loss': the loss value.
-                 - 'denoised': the denoised signal.
-        """
-        if model_kwargs is None:
-            model_kwargs = {}
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        x_t = self.diffusion_scheduler.q_sample(x_start, t, noise=noise)
-        if mask is not None:
-            t0 = torch.zeros_like(t)
-            x_t0 = self.diffusion_scheduler.q_sample(x_start, t0, noise=noise)
-            x_t = torch.where(mask[:, None, :, None, None], x_t, x_t0)
-
-        terms = {}
-        if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL:
-            assert mask is None, "mask not supported for KL loss"
-            terms["loss"] = self._vb_terms_bpd(
-                x_start=x_start,
-                x_t=x_t,
-                t=t,
-                clip_denoised=False,
-                model_kwargs=model_kwargs,
-            )
-            if self.loss_type == LossType.RESCALED_KL:
-                terms["loss"] *= self.diffusion_scheduler.num_timesteps
-        elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:
-            model_output = self.model(x_t, t, **model_kwargs)
-
-            if self.model_var_type in [
-                ModelVarType.LEARNED,
-                ModelVarType.LEARNED_RANGE,
-            ]:
-                B, C = x_t.shape[:2]
-                assert model_output.shape == (B, C * 2, *x_t.shape[2:])
-                model_output, model_var_values = torch.split(model_output, C, dim=1)
-                # Learn the variance using the variational bound, but don't let
-                # it affect our mean prediction.
-                # TODO: refactor this protect mean prediction
-                # frozen_out = torch.cat([model_output.detach(), model_var_values], dim=1)
-                terms["vb"] = self._vb_terms_bpd(
-                    x_start=x_start,
-                    x_t=x_t,
-                    t=t,
-                    clip_denoised=False,
-                    mask=mask,
-                )
-                if self.loss_type == LossType.RESCALED_MSE:
-                    # Divide by 1000 for equivalence with initial implementation.
-                    # Without a factor of 1/1000, the VB term hurts the MSE term.
-                    terms["vb"] *= self.diffusion_scheduler.num_timesteps / 1000.0
-
-            target = {
-                ModelMeanType.PREVIOUS_X: self.diffusion_scheduler.q_posterior(
-                    x_start=x_start, x_t=x_t, t=t
-                )[0],
-                ModelMeanType.START_X: x_start,
-                ModelMeanType.EPSILON: noise,
-            }[self.model_mean_type]
-            assert model_output.shape == target.shape == x_start.shape
-            if weights is None:
-                terms["mse"] = mean_flat((target - model_output) ** 2, mask=mask)
-            else:
-                weight = extract_into_tensor(weights, t, target.shape)
-                terms["mse"] = mean_flat(
-                    weight * (target - model_output) ** 2, mask=mask
-                )
-            if "vb" in terms:
-                terms["loss"] = terms["mse"] + terms["vb"]
-            else:
-                terms["loss"] = terms["mse"]
-        else:
-            raise NotImplementedError(self.loss_type)
-
-        loss = terms["loss"].mean()
-
-        loss_dict = {}
-        log_prefix = "train" if self.training else "val"
-
-        loss_dict.update({f"{log_prefix}/loss": loss})
-        loss_dict.update({f"{log_prefix}/loss_mse": terms["mse"].mean()})
-        if "vb" in terms:
-            loss_dict.update({f"{log_prefix}/loss_vb": terms["vb"].mean()})
-
-        return loss, loss_dict
-
-
-def space_timesteps(num_timesteps, section_counts):
-    """
-    Create a list of timesteps to use from an original diffusion process,
-    given the number of timesteps we want to take from equally-sized portions
-    of the original process.
-    For example, if there's 300 timesteps and the section counts are [10,15,20]
-    then the first 100 timesteps are strided to be 10 timesteps, the second 100
-    are strided to be 15 timesteps, and the final 100 are strided to be 20.
-    If the stride is a string starting with "ddim", then the fixed striding
-    from the DDIM paper is used, and only one section is allowed.
-    :param num_timesteps: the number of diffusion steps in the original
-                          process to divide up.
-    :param section_counts: either a list of numbers, or a string containing
-                           comma-separated numbers, indicating the step count
-                           per section. As a special case, use "ddimN" where N
-                           is a number of steps to use the striding from the
-                           DDIM paper.
-    :return: a set of diffusion steps from the original process to use.
-    """
-    if isinstance(section_counts, str):
-        if section_counts.startswith("ddim"):
-            desired_count = int(section_counts[len("ddim") :])
-            for i in range(1, num_timesteps):
-                if len(range(0, num_timesteps, i)) == desired_count:
-                    return set(range(0, num_timesteps, i))
-            raise ValueError(
-                f"cannot create exactly {num_timesteps} steps with an integer stride"
-            )
-        section_counts = [int(x) for x in section_counts.split(",")]
-    size_per = num_timesteps // len(section_counts)
-    extra = num_timesteps % len(section_counts)
-    start_idx = 0
-    all_steps = []
-    for i, section_count in enumerate(section_counts):
-        size = size_per + (1 if i < extra else 0)
-        if size < section_count:
-            raise ValueError(
-                f"cannot divide section of {size} steps into {section_count}"
-            )
-        if section_count <= 1:
-            frac_stride = 1
-        else:
-            frac_stride = (size - 1) / (section_count - 1)
-        cur_idx = 0.0
-        taken_steps = []
-        for _ in range(section_count):
-            taken_steps.append(start_idx + round(cur_idx))
-            cur_idx += frac_stride
-        all_steps += taken_steps
-        start_idx += size
-    return set(all_steps)
-
-
-class SpacedDiffusion(IDDPM):
-    """
-    A diffusion process which can skip steps in a base diffusion process.
-    :param use_timesteps: a collection (sequence or set) of timesteps from the
-                          original diffusion process to retain.
-    :param kwargs: the kwargs to create the base diffusion process.
-    """
-
-    def __init__(self, use_timesteps, **kwargs):
-        self.use_timesteps = set(use_timesteps)
-        self.timestep_map = []
-        self.original_num_steps = len(kwargs["given_betas"])
-
-        kwargs = self.add_given_betas_to_config(
-            kwargs["given_betas"], kwargs
-        )  # add the 'given_betas' into the 'diffusion_scheduler_config'
-        base_diffusion = IDDPM(**kwargs)  # pylint: disable=missing-kwoa
-
-        last_alpha_cumprod = 1.0
-        new_betas = []
-        for i, alpha_cumprod in enumerate(
-            base_diffusion.diffusion_scheduler.alphas_cumprod
-        ):
-            if i in self.use_timesteps:
-                new_betas.append(1 - alpha_cumprod / last_alpha_cumprod)
-                last_alpha_cumprod = alpha_cumprod
-                self.timestep_map.append(i)
-        kwargs["given_betas"] = torch.FloatTensor(new_betas)
-        kwargs = self.add_given_betas_to_config(new_betas, kwargs)
-        super().__init__(**kwargs)
-        self.map_tensor = torch.tensor(self.timestep_map)  # TODO: get device
-
-    def add_given_betas_to_config(self, given_betas, kwargs):
-        given_betas_list = list(given_betas)
-        given_betas_list = [float(x) for x in given_betas_list]
-        kwargs["diffusion_scheduler_config"]["params"]["given_betas"] = given_betas_list
-        return kwargs
-
-    def p_mean_variance(
-        self, model, *args, **kwargs
-    ):  # pylint: disable=signature-differs
-        return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
-
-    def training_losses(
-        self, model, *args, **kwargs
-    ):  # pylint: disable=signature-differs
-        return super().training_losses(self._wrap_model(model), *args, **kwargs)
-
-    def condition_mean(self, cond_fn, *args, **kwargs):
-        return super().condition_mean(self._wrap_model(cond_fn), *args, **kwargs)
-
-    def condition_score(self, cond_fn, *args, **kwargs):
-        return super().condition_score(self._wrap_model(cond_fn), *args, **kwargs)
-
-    def _wrap_model(self, model):
-        if isinstance(model, _WrappedModel):
-            return model
-        return _WrappedModel(model, self.map_tensor, self.original_num_steps)
-
-    def _scale_timesteps(self, t):
-        # Scaling is done by the wrapped model.
-        return t
-
-
-class _WrappedModel:
-    def __init__(self, model, map_tensor, original_num_steps):
-        self.model = model
-        self.map_tensor = map_tensor
-        # self.rescale_timesteps = rescale_timesteps
-        self.original_num_steps = original_num_steps
-
-    def __call__(self, x, ts, **kwargs):
-        new_ts = self.map_tensor[ts].to(device=ts.device, dtype=ts.dtype)
-        # if self.rescale_timesteps:
-        #     new_ts = new_ts.float() * (1000.0 / self.original_num_steps)
-        return self.model(x, new_ts, **kwargs)
-
-
-class LatentDiffusion(SpacedDiffusion):
-    """main class"""
-
-    def __init__(
-        self,
-        first_stage_config,
-        cond_stage_config,
-        diffusion_scheduler_config,
-        cond_stage_key="caption",
-        cond_stage_trainable=False,
-        cond_stage_forward=None,
-        conditioning_key=None,
-        uncond_prob=0.2,
-        uncond_type="empty_seq",
-        scale_factor=1.0,
-        scale_by_std=False,
-        fps_condition_type="fs",
-        # Added for LVDM
-        encoder_type="2d",
-        frame_cond=None,
-        only_model=False,
-        use_scale=False,  # dynamic rescaling
-        scale_a=1,
-        scale_b=0.3,
-        mid_step=400,
-        fix_scale_bug=False,
-        interp_mode=False,
-        logdir=None,
-        rand_cond_frame=False,
-        empty_params_only=False,
-        num_sampling_steps=None,  # Added for SpacedDiffusion
-        timestep_respacing=None,  # Added for SpacedDiffusion
-        *args,
-        **kwargs,
-    ):
-        self.scale_by_std = scale_by_std
-        ckpt_path = kwargs.pop("ckpt_path", None)
-        ignore_keys = kwargs.pop("ignore_keys", [])
-        conditioning_key = default(conditioning_key, "crossattn")
-
-        given_betas = get_named_beta_schedule(
-            "linear", diffusion_scheduler_config.params.get("timesteps", 1000)
-        )
-        if num_sampling_steps is not None:
-            assert timestep_respacing is None
-            timestep_respacing = str(num_sampling_steps)
-        if timestep_respacing is None or timestep_respacing == "":
-            timestep_respacing = [
-                diffusion_scheduler_config.params.get("timesteps", 1000)
-            ]
-
-        super().__init__(
-            use_timesteps=space_timesteps(
-                diffusion_scheduler_config.params.get("timesteps", 1000),
-                timestep_respacing,
-            ),
-            given_betas=given_betas,
-            conditioning_key=conditioning_key,
-            diffusion_scheduler_config=diffusion_scheduler_config,
-            *args,
-            **kwargs,
-        )
-
-        # add support for auto gradient checkpointing
-        from videotuna.models.opensora.acceleration.checkpoint import set_grad_checkpoint
-
-        set_grad_checkpoint(self.model)
-
-        self.cond_stage_trainable = cond_stage_trainable
-        self.cond_stage_key = cond_stage_key
-        self.empty_params_only = empty_params_only
-        self.fps_condition_type = fps_condition_type
-
-        # scale factor
-        self.use_scale = use_scale
-        if self.use_scale:
-            self.scale_a = scale_a
-            self.scale_b = scale_b
-            if fix_scale_bug:
-                scale_step = self.num_timesteps - mid_step
-            else:  # bug
-                scale_step = self.num_timesteps
-
-            scale_arr1 = np.linspace(scale_a, scale_b, mid_step)
-            scale_arr2 = np.full(scale_step, scale_b)
-            scale_arr = np.concatenate((scale_arr1, scale_arr2))
-            scale_arr_prev = np.append(scale_a, scale_arr[:-1])
-            to_torch = partial(torch.tensor, dtype=torch.float32)
-            self.register_buffer("scale_arr", to_torch(scale_arr))
-
-        try:
-            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
-        except:
-            self.num_downs = 0
-        if not scale_by_std:
-            self.scale_factor = scale_factor
-        else:
-            self.register_buffer("scale_factor", torch.tensor(scale_factor))
-
-        self.instantiate_first_stage(first_stage_config)
-        self.instantiate_cond_stage(cond_stage_config)
-        self.first_stage_config = first_stage_config
-        self.cond_stage_config = cond_stage_config
-        self.clip_denoised = False
-
-        self.cond_stage_forward = cond_stage_forward
-        self.encoder_type = encoder_type
-        assert encoder_type in ["2d", "3d"]
-        self.uncond_prob = uncond_prob
-        self.classifier_free_guidance = True if uncond_prob > 0 else False
-        assert uncond_type in ["zero_embed", "empty_seq"]
-        self.uncond_type = uncond_type
-
-        ## future frame prediction
-        self.frame_cond = frame_cond
-        if self.frame_cond:
-            # frame_len = self.model.diffusion_model.temporal_length
-            frame_len = self.temporal_length
-            cond_mask = torch.zeros(frame_len, dtype=torch.float32)
-            cond_mask[: self.frame_cond] = 1.0
-            ## b,c,t,h,w
-            self.cond_mask = cond_mask[None, None, :, None, None]
-            mainlogger.info(
-                "---training for %d-frame conditoning T2V" % (self.frame_cond)
-            )
-        else:
-            self.cond_mask = None
-
-        self.restarted_from_ckpt = False
-        if ckpt_path is not None:
-            self.init_from_ckpt(ckpt_path, ignore_keys, only_model=only_model)
-            self.restarted_from_ckpt = True
-
-        self.logdir = logdir
-        self.rand_cond_frame = rand_cond_frame
-        self.interp_mode = interp_mode
-
-    def _freeze_model(self):
-        for name, para in self.model.diffusion_model.named_parameters():
-            para.requires_grad = False
-
-    @rank_zero_only
-    @torch.no_grad()
-    def on_train_batch_start(self, batch, batch_idx, dataloader_idx=None):
-        # only for very first batch, reset the self.scale_factor
-        if (
-            self.scale_by_std
-            and self.current_epoch == 0
-            and self.global_step == 0
-            and batch_idx == 0
-            and not self.restarted_from_ckpt
-        ):
-            assert (
-                self.scale_factor == 1.0
-            ), "rather not use custom rescaling and std-rescaling simultaneously"
-            # set rescale weight to 1./std of encodings
-            mainlogger.info("### USING STD-RESCALING ###")
-            x = super().get_input(batch, self.first_stage_key)
-            x = x.to(self.device)
-            encoder_posterior = self.encode_first_stage(x)
-            z = self.get_first_stage_encoding(encoder_posterior).detach()
-            del self.scale_factor
-            self.register_buffer("scale_factor", 1.0 / z.flatten().std())
-            mainlogger.info(f"setting self.scale_factor to {self.scale_factor}")
-            mainlogger.info("### USING STD-RESCALING ###")
-            mainlogger.info(f"std={z.flatten().std()}")
-
-    def register_schedule(
-        self,
-        given_betas=None,
-        beta_schedule="linear",
-        timesteps=1000,
-        linear_start=1e-4,
-        linear_end=2e-2,
-        cosine_s=8e-3,
-    ):
-        super().register_schedule(
-            given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s
-        )
-
-    def instantiate_first_stage(self, config):
-        model = instantiate_from_config(config)
-        self.first_stage_model = model.eval()
-        self.first_stage_model.train = disabled_train
-        for param in self.first_stage_model.parameters():
-            param.requires_grad = False
-
-    def instantiate_cond_stage(self, config):
-        if not self.cond_stage_trainable:
-            model = instantiate_from_config(config)
-            self.cond_stage_model = model.eval()
-            self.cond_stage_model.train = disabled_train
-            for param in self.cond_stage_model.parameters():
-                param.requires_grad = False
-        else:
-            model = instantiate_from_config(config)
-            self.cond_stage_model = model
-
-    def get_learned_conditioning(self, c):
-        if self.cond_stage_forward is None:
-            if hasattr(self.cond_stage_model, "encode") and callable(
-                self.cond_stage_model.encode
-            ):
-                c = self.cond_stage_model.encode(c)
-                if isinstance(c, DiagonalGaussianDistribution):
-                    c = c.mode()
-            else:
-                c = self.cond_stage_model(c)
-        else:
-            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
-            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
-        return c
-
-    def get_first_stage_encoding(self, encoder_posterior, noise=None):
-        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
-            z = encoder_posterior.sample(noise=noise)
-        elif isinstance(encoder_posterior, torch.Tensor):
-            z = encoder_posterior
-        else:
-            raise NotImplementedError(
-                f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented"
-            )
-        return self.scale_factor * z
-
-    @torch.no_grad()
-    def encode_first_stage(self, x):
-        if self.encoder_type == "2d" and x.dim() == 5:
-            return self.encode_first_stage_2DAE(x)
-        encoder_posterior = self.first_stage_model.encode(x)
-        results = self.get_first_stage_encoding(encoder_posterior).detach()
-        return results
-
-    def encode_first_stage_2DAE(self, x):
-        """encode frame by frame"""
-        b, _, t, _, _ = x.shape
-        results = torch.cat(
-            [
-                self.get_first_stage_encoding(self.first_stage_model.encode(x[:, :, i]))
-                .detach()
-                .unsqueeze(2)
-                for i in range(t)
-            ],
-            dim=2,
-        )
-        return results
-
-    def decode_first_stage_2DAE(self, z, **kwargs):
-        """decode frame by frame"""
-        _, _, t, _, _ = z.shape
-        results = torch.cat(
-            [
-                self.first_stage_model.decode(z[:, :, i], **kwargs).unsqueeze(2)
-                for i in range(t)
-            ],
-            dim=2,
-        )
-        return results
-
-    def _decode_core(self, z, **kwargs):
-        z = 1.0 / self.scale_factor * z
-
-        if self.encoder_type == "2d" and z.dim() == 5:
-            return self.decode_first_stage_2DAE(z)
-        results = self.first_stage_model.decode(z, **kwargs)
-        return results
-
-    @torch.no_grad()
-    def decode_first_stage(self, z, **kwargs):
-        return self._decode_core(z, **kwargs)
-
-    def differentiable_decode_first_stage(self, z, **kwargs):
-        """same as decode_first_stage but without decorator"""
-        return self._decode_core(z, **kwargs)
-
-    @torch.no_grad()
-    def get_batch_input(
-        self,
-        batch,
-        random_uncond,
-        return_first_stage_outputs=False,
-        return_original_cond=False,
-        is_imgbatch=False,
-    ):
-        ## image/video shape: b, c, t, h, w
-        data_key = "jpg" if is_imgbatch else self.first_stage_key
-        x = super().get_input(batch, data_key)
-        if is_imgbatch:
-            ## pack image as video
-            # x = x[:,:,None,:,:]
-            b = x.shape[0] // self.temporal_length
-            x = rearrange(x, "(b t) c h w -> b c t h w", b=b, t=self.temporal_length)
-        x_ori = x
-        ## encode video frames x to z via a 2D encoder
-        z = self.encode_first_stage(x)
-
-        ## get caption condition
-        cond_key = "txt" if is_imgbatch else self.cond_stage_key
-        cond = batch[cond_key]
-        if random_uncond and self.uncond_type == "empty_seq":
-            for i, ci in enumerate(cond):
-                if random.random() < self.uncond_prob:
-                    cond[i] = ""
-        if isinstance(cond, dict) or isinstance(cond, list):
-            cond_emb = self.get_learned_conditioning(cond)
-        else:
-            cond_emb = self.get_learned_conditioning(cond.to(self.device))
-        if random_uncond and self.uncond_type == "zero_embed":
-            for i, ci in enumerate(cond):
-                if random.random() < self.uncond_prob:
-                    cond_emb[i] = torch.zeros_like(ci)
-
-        out = [z, cond_emb]
-        ## optional output: self-reconst or caption
-        if return_first_stage_outputs:
-            xrec = self.decode_first_stage(z)
-            out.extend([x_ori, xrec])
-        if return_original_cond:
-            out.append(cond)
-
-        return out
-
-    def forward(self, x, c, **kwargs):
-        if "t" in kwargs:
-            t = kwargs.pop("t")
-        else:
-            t = torch.randint(
-                0, self.num_timesteps, (x.shape[0],), device=self.device
-            ).long()
-        if self.use_scale:
-            x = x * extract_into_tensor(self.scale_arr, t, x.shape)
-        return self.p_losses(x, c, t, **kwargs)
-
-    def shared_step(self, batch, random_uncond, **kwargs):
-        is_imgbatch = False
-        if "loader_img" in batch.keys():
-            ratio = 10.0 / self.temporal_length
-            if random.uniform(0.0, 10.0) < ratio:
-                is_imgbatch = True
-                batch = batch["loader_img"]
-            else:
-                batch = batch["loader_video"]
-        else:
-            pass
-
-        x, c = self.get_batch_input(
-            batch, random_uncond=random_uncond, is_imgbatch=is_imgbatch
-        )
-        loss, loss_dict = self(x, c, is_imgbatch=is_imgbatch, **kwargs)
-        return loss, loss_dict
-
-    def apply_model(self, x_noisy, t, cond, **kwargs):
-        if self.model.conditioning_key == "crossattn_stdit":
-            key = "c_crossattn_stdit"
-            try:
-                cond = {
-                    key: [cond["c_crossattn"][0]["y"]],
-                    "mask": [cond["c_crossattn"][0]["mask"]],
-                }
-            except:
-                cond = {key: [cond["y"]], "mask": [cond["mask"]]}  # support mask for T5
-        else:
-            if isinstance(cond, dict):
-                # hybrid case, cond is exptected to be a dict
-                pass
-            else:
-                if not isinstance(cond, list):
-                    cond = [cond]
-                key = (
-                    "c_concat"
-                    if self.model.conditioning_key == "concat"
-                    else "c_crossattn"
-                )
-                cond = {key: cond}
-
-        # If model is provided as a keyword argument, use it to train the variance.
-        if "model" not in kwargs:
-            x_recon = self.model(x_noisy, t, **cond, **kwargs)
-        else:
-            x_recon = kwargs["model"](x_noisy, t)
-
-        if isinstance(x_recon, tuple):
-            return x_recon[0]
-        else:
-            return x_recon
-
-    def p_losses(
-        self,
-        x_start,
-        cond,
-        t,
-        noise=None,
-        model_kwargs=None,
-        mask=None,
-        weights=None,
-        **kwargs,
-    ):
-        if model_kwargs is None:
-            model_kwargs = {}
-        noise = default(noise, lambda: torch.randn_like(x_start))
-        x_t = self.diffusion_scheduler.q_sample(x_start=x_start, t=t, noise=noise)
-        if mask is not None:
-            t0 = torch.zeros_like(t)
-            x_t0 = self.diffusion_scheduler.q_sample(x_start, t0, noise=noise)
-            x_t = torch.where(mask[:, None, :, None, None], x_t, x_t0)
-
-        model_output = self.apply_model(x_t, t, cond, **kwargs)
-
-        terms = {}
-        if self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:
-            if self.model_var_type in [
-                ModelVarType.LEARNED,
-                ModelVarType.LEARNED_RANGE,
-            ]:
-                B, C = x_t.shape[:2]
-                assert model_output.shape == (B, C * 2, *x_t.shape[2:])
-                model_output, model_var_values = torch.split(model_output, C, dim=1)
-                # Learn the variance using the variational bound, but don't let
-                # it affect our mean prediction.
-                frozen_out = torch.cat([model_output.detach(), model_var_values], dim=1)
-                terms["vb"] = self._vb_terms_bpd(
-                    x_start=x_start,
-                    x_t=x_t,
-                    c=cond,
-                    t=t,
-                    clip_denoised=False,
-                    mask=mask,
-                    model=lambda *args, r=frozen_out: r,
-                    **kwargs,
-                )
-                if self.loss_type == LossType.RESCALED_MSE:
-                    # Divide by 1000 for equivalence with initial implementation.
-                    # Without a factor of 1/1000, the VB term hurts the MSE term.
-                    terms["vb"] *= self.num_timesteps / 1000.0
-
-            target = {
-                ModelMeanType.PREVIOUS_X: self.diffusion_scheduler.q_posterior(
-                    x_start=x_start, x_t=x_t, t=t
-                )[0],
-                ModelMeanType.START_X: x_start,
-                ModelMeanType.EPSILON: noise,
-            }[self.model_mean_type]
-            assert model_output.shape == target.shape == x_start.shape
-            if weights is None:
-                terms["mse"] = mean_flat((target - model_output) ** 2, mask=mask)
-            else:
-                weight = extract_into_tensor(weights, t, target.shape)
-                terms["mse"] = mean_flat(
-                    weight * (target - model_output) ** 2, mask=mask
-                )
-            if "vb" in terms:
-                terms["loss"] = terms["mse"] + terms["vb"]
-            else:
-                terms["loss"] = terms["mse"]
-        else:
-            raise NotImplementedError(self.loss_type)
-
-        loss = terms["loss"].mean()
-
-        loss_dict = {}
-        log_prefix = "train" if self.training else "val"
-
-        loss_dict.update({f"{log_prefix}/loss": loss})
-        loss_dict.update({f"{log_prefix}/loss_mse": terms["mse"].mean()})
-        if "vb" in terms:
-            loss_dict.update({f"{log_prefix}/loss_vb": terms["vb"].mean()})
-
-        return loss, loss_dict
-
-    def training_step(self, batch, batch_idx):
-        loss, loss_dict = self.shared_step(
-            batch, random_uncond=self.classifier_free_guidance
-        )
-        self.log_dict(
-            loss_dict,
-            prog_bar=True,
-            logger=True,
-            on_step=True,
-            on_epoch=True,
-            sync_dist=False,
-        )
-        # self.log("epoch/global_step", self.global_step.float(), prog_bar=True, logger=True, on_step=True, on_epoch=False)
-        """
-        if self.use_scheduler:
-            lr = self.optimizers().param_groups[0]['lr']
-            self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False, rank_zero_only=True)
-        """
-        if (batch_idx + 1) % self.log_every_t == 0:
-            mainlogger.info(
-                f"batch:{batch_idx}|epoch:{self.current_epoch} [globalstep:{self.global_step}]: loss={loss}"
-            )
-        return loss
-
-    def _get_denoise_row_from_list(self, samples, desc=""):
-        denoise_row = []
-        for zd in tqdm(samples, desc=desc):
-            denoise_row.append(self.decode_first_stage(zd.to(self.device)))
-        n_log_timesteps = len(denoise_row)
-
-        denoise_row = torch.stack(denoise_row)  # n_log_timesteps, b, C, H, W
-
-        if denoise_row.dim() == 5:
-            # img, num_imgs= n_log_timesteps * bs, grid_size=[bs,n_log_timesteps]
-            # batch:col, different samples,
-            # n:rows, different steps for one sample
-            denoise_grid = rearrange(denoise_row, "n b c h w -> b n c h w")
-            denoise_grid = rearrange(denoise_grid, "b n c h w -> (b n) c h w")
-            denoise_grid = make_grid(denoise_grid, nrow=n_log_timesteps)
-        elif denoise_row.dim() == 6:
-            # video, grid_size=[n_log_timesteps*bs, t]
-            video_length = denoise_row.shape[3]
-            denoise_grid = rearrange(denoise_row, "n b c t h w -> b n c t h w")
-            denoise_grid = rearrange(denoise_grid, "b n c t h w -> (b n) c t h w")
-            denoise_grid = rearrange(denoise_grid, "n c t h w -> (n t) c h w")
-            denoise_grid = make_grid(denoise_grid, nrow=video_length)
-        else:
-            raise ValueError
-
-        return denoise_grid
-
-    @torch.no_grad()
-    def log_images(
-        self,
-        batch,
-        sample=True,
-        ddim_steps=200,
-        ddim_eta=1.0,
-        plot_denoise_rows=False,
-        unconditional_guidance_scale=1.0,
-        **kwargs,
-    ):
-        """log images for LatentDiffusion"""
-        ## TBD: currently, classifier_free_guidance sampling is only supported by DDIM
-        use_ddim = ddim_steps is not None
-        log = dict()
-        z, c, x, xrec, xc = self.get_batch_input(
-            batch,
-            random_uncond=False,
-            return_first_stage_outputs=True,
-            return_original_cond=True,
-        )
-        N, _, T, H, W = x.shape
-        # TODO fix data type
-        log["inputs"] = x.to(torch.bfloat16)
-        log["reconst"] = xrec
-        log["condition"] = xc
-
-        if sample:
-            # get uncond embedding for classifier-free guidance sampling
-            if unconditional_guidance_scale != 1.0:
-                if isinstance(c, dict):
-                    if "y" in c:
-                        c_emb = c["y"]
-                        c_cat = None  # set default value is None
-                    else:
-                        c_cat, c_emb = c["c_concat"][0], c["c_crossattn"][0]
-                else:
-                    c_emb = c
-
-                # TODO fix data type
-                z = z.to(torch.bfloat16)
-                c_emb = c_emb.to(torch.bfloat16)
-
-                # get uc: unconditional condition for classifier-free guidance sampling
-                if self.uncond_type == "empty_seq":
-                    prompts = N * [""]
-                    uc = self.get_learned_conditioning(prompts)
-                elif self.uncond_type == "zero_embed":
-                    uc = torch.zeros_like(c_emb)
-                # make uc for hybrid condition case
-                if isinstance(c, dict) and c_cat is not None:
-                    uc = {"c_concat": [c_cat], "c_crossattn": [uc]}
-            else:
-                uc = None
-
-            with self.ema_scope("Plotting"):
-                samples, z_denoise_row = self.sample_log(
-                    cond=c,
-                    batch_size=N,
-                    ddim=use_ddim,
-                    ddim_steps=ddim_steps,
-                    eta=ddim_eta,
-                    unconditional_guidance_scale=unconditional_guidance_scale,
-                    unconditional_conditioning=uc,
-                    mask=self.cond_mask,
-                    x0=z,
-                    **kwargs,
-                )
-            x_samples = self.decode_first_stage(samples)
-            log["samples"] = x_samples
-
-            if plot_denoise_rows:
-                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)
-                log["denoise_row"] = denoise_grid
-
-        return log
-
-    def _vb_terms_bpd(
-        self,
-        x_start,
-        x_t,
-        c,
-        t,
-        clip_denoised=True,
-        model_kwargs=None,
-        mask=None,
-        **kwargs,
-    ):
-        """
-        Get a term for the variational lower-bound.
-        The resulting units are bits (rather than nats, as one might expect).
-        This allows for comparison to other papers.
-        :return: a dict with the following keys:
-                 - 'output': a shape [N] tensor of NLLs or KLs.
-                 - 'pred_xstart': the x_0 predictions.
-        """
-        true_mean, _, true_log_variance_clipped = self.diffusion_scheduler.q_posterior(
-            x_start=x_start, x_t=x_t, t=t
-        )
-        out = self.diffusion_scheduler.p_mean_variance(
-            x_t, c, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs, **kwargs
-        )
-        model_mean, posterior_variance, posterior_log_variance = out
-        kl = normal_kl(
-            true_mean, true_log_variance_clipped, model_mean, posterior_log_variance
-        )
-        kl = mean_flat(kl, mask=mask) / np.log(2.0)
-
-        decoder_nll = -discretized_gaussian_log_likelihood(
-            x_start, means=model_mean, log_scales=0.5 * posterior_log_variance
-        )
-        assert decoder_nll.shape == x_start.shape
-        decoder_nll = mean_flat(decoder_nll, mask=mask) / np.log(2.0)
-
-        # At the first timestep return the decoder NLL,
-        # otherwise return KL(q(x_{t-1}|x_t,x_0) || p(x_{t-1}|x_t))
-        output = torch.where((t == 0), decoder_nll, kl)
-        return output
-
-    @torch.no_grad()
-    def p_sample_loop(
-        self,
-        cond,
-        shape,
-        return_intermediates=False,
-        x_T=None,
-        verbose=True,
-        callback=None,
-        timesteps=None,
-        mask=None,
-        x0=None,
-        img_callback=None,
-        start_T=None,
-        log_every_t=None,
-        **kwargs,
-    ):
-
-        if not log_every_t:
-            log_every_t = self.log_every_t
-        device = self.diffusion_scheduler.betas.device
-        b = shape[0]
-        # sample an initial noise
-        if x_T is None:
-            img = torch.randn(shape, device=device)
-        else:
-            img = x_T
-
-        intermediates = [img]
-        if timesteps is None:
-            timesteps = self.num_timesteps
-        if start_T is not None:
-            timesteps = min(timesteps, start_T)
-
-        iterator = (
-            tqdm(reversed(range(0, timesteps)), desc="Sampling t", total=timesteps)
-            if verbose
-            else reversed(range(0, timesteps))
-        )
-
-        if mask is not None:
-            assert x0 is not None
-            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match
-
-        for i in iterator:
-            ts = torch.full((b,), i, device=device, dtype=torch.long)
-
-            img = self.diffusion_scheduler.p_sample(
-                img, cond, ts, clip_denoised=self.clip_denoised, **kwargs
-            )
-            if mask is not None:
-                img_orig = self.diffusion_scheduler.q_sample(x0, ts)
-                img = img_orig * mask + (1.0 - mask) * img
-
-            if i % log_every_t == 0 or i == timesteps - 1:
-                intermediates.append(img)
-            if callback:
-                callback(i)
-            if img_callback:
-                img_callback(img, i)
-
-        if return_intermediates:
-            return img, intermediates
-        return img
-
-    @torch.no_grad()
-    def sample(
-        self,
-        cond,
-        batch_size=16,
-        return_intermediates=False,
-        x_T=None,
-        verbose=True,
-        timesteps=None,
-        mask=None,
-        x0=None,
-        shape=None,
-        **kwargs,
-    ):
-        if shape is None:
-            shape = (batch_size, self.channels, self.temporal_length, *self.image_size)
-        if cond is not None:
-            if isinstance(cond, dict):
-                cond = {
-                    key: (
-                        cond[key][:batch_size]
-                        if not isinstance(cond[key], list)
-                        else list(map(lambda x: x[:batch_size], cond[key]))
-                    )
-                    for key in cond
-                }
-            else:
-                cond = (
-                    [c[:batch_size] for c in cond]
-                    if isinstance(cond, list)
-                    else cond[:batch_size]
-                )
-        return self.p_sample_loop(
-            cond,
-            shape,
-            return_intermediates=return_intermediates,
-            x_T=x_T,
-            verbose=verbose,
-            timesteps=timesteps,
-            mask=mask,
-            x0=x0,
-            **kwargs,
-        )
-
-    @torch.no_grad()
-    def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs):
-        if ddim:
-            ddim_sampler = DDIMSampler(self)
-            shape = (self.channels, self.temporal_length, *self.image_size)
-            # kwargs.update({"clean_cond": True})
-            samples, intermediates = ddim_sampler.sample(
-                ddim_steps, batch_size, shape, cond, verbose=False, **kwargs
-            )
-
-        else:
-            samples, intermediates = self.sample(
-                cond=cond, batch_size=batch_size, return_intermediates=True, **kwargs
-            )
-
-        return samples, intermediates
-
-    def configure_optimizers(self):
-        """configure_optimizers for LatentDiffusion"""
-        lr = self.learning_rate
-        if self.empty_params_only and hasattr(self, "empty_paras"):
-            params = [
-                p for n, p in self.model.named_parameters() if n in self.empty_paras
-            ]
-            print("self.empty_paras", len(self.empty_paras))
-            for n, p in self.model.named_parameters():
-                if n not in self.empty_paras:
-                    p.requires_grad = False
-            mainlogger.info(f"@Training [{len(params)}] Empty Paramters ONLY.")
-        else:
-            params = list(self.model.parameters())
-            mainlogger.info(f"@Training [{len(params)}] Full Paramters.")
-
-        if self.diffusion_scheduler.learn_logvar:
-            mainlogger.info("Diffusion model optimizing logvar")
-            if isinstance(params[0], dict):
-                params.append({"params": [self.logvar]})
-            else:
-                params.append(self.logvar)
-
-        ## optimizer
-        optimizer = torch.optim.AdamW(params, lr=lr)
-        ## lr scheduler
-        if self.use_scheduler:
-            mainlogger.info("Setting up LambdaLR scheduler...")
-            lr_scheduler = self.configure_schedulers(optimizer)
-            return [optimizer], [lr_scheduler]
-
-        return optimizer
-
-    def configure_schedulers(self, optimizer):
-        assert "target" in self.scheduler_config
-        scheduler_name = self.scheduler_config.target.split(".")[-1]
-        interval = self.scheduler_config.interval
-        frequency = self.scheduler_config.frequency
-        if scheduler_name == "LambdaLRScheduler":
-            scheduler = instantiate_from_config(self.scheduler_config)
-            scheduler.start_step = self.global_step
-            lr_scheduler = {
-                "scheduler": LambdaLR(optimizer, lr_lambda=scheduler.schedule),
-                "interval": interval,
-                "frequency": frequency,
-            }
-        elif scheduler_name == "CosineAnnealingLRScheduler":
-            scheduler = instantiate_from_config(self.scheduler_config)
-            decay_steps = scheduler.decay_steps
-            last_step = -1 if self.global_step == 0 else scheduler.start_step
-            lr_scheduler = {
-                "scheduler": CosineAnnealingLR(
-                    optimizer, T_max=decay_steps, last_epoch=last_step
-                ),
-                "interval": interval,
-                "frequency": frequency,
-            }
-        else:
-            raise NotImplementedError
-        return lr_scheduler
diff --git a/videotuna/models/opensora/models/layers/__init__.py b/videotuna/models/opensora/models/layers/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/opensora/models/layers/blocks.py b/videotuna/models/opensora/models/layers/blocks.py
deleted file mode 100644
index 92d6c90f..00000000
--- a/videotuna/models/opensora/models/layers/blocks.py
+++ /dev/null
@@ -1,919 +0,0 @@
-# This source code is licensed under the license found in the
-# LICENSE file in the root directory of this source tree.
-# --------------------------------------------------------
-# References:
-# PixArt: https://github.com/PixArt-alpha/PixArt-alpha
-# Latte:  https://github.com/Vchitect/Latte
-# DiT:    https://github.com/facebookresearch/DiT/tree/main
-# GLIDE:  https://github.com/openai/glide-text2im
-# MAE:    https://github.com/facebookresearch/mae/blob/main/models_mae.py
-# --------------------------------------------------------
-
-import functools
-import math
-from typing import Optional
-
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-import torch.nn.functional as F
-import torch.utils.checkpoint
-import xformers.ops
-from einops import rearrange
-from timm.models.vision_transformer import Mlp
-
-from videotuna.models.opensora.acceleration.communications import (
-    all_to_all,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-
-approx_gelu = lambda: nn.GELU(approximate="tanh")
-
-
-class LlamaRMSNorm(nn.Module):
-    def __init__(self, hidden_size, eps=1e-6):
-        """
-        LlamaRMSNorm is equivalent to T5LayerNorm
-        """
-        super().__init__()
-        self.weight = nn.Parameter(torch.ones(hidden_size))
-        self.variance_epsilon = eps
-
-    def forward(self, hidden_states):
-        input_dtype = hidden_states.dtype
-        hidden_states = hidden_states.to(torch.float32)
-        variance = hidden_states.pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
-        return self.weight * hidden_states.to(input_dtype)
-
-
-def get_layernorm(
-    hidden_size: torch.Tensor, eps: float, affine: bool, use_kernel: bool
-):
-    if use_kernel:
-        try:
-            from apex.normalization import FusedLayerNorm
-
-            return FusedLayerNorm(hidden_size, elementwise_affine=affine, eps=eps)
-        except ImportError:
-            raise RuntimeError("FusedLayerNorm not available. Please install apex.")
-    else:
-        return nn.LayerNorm(hidden_size, eps, elementwise_affine=affine)
-
-
-def modulate(norm_func, x, shift, scale):
-    # Suppose x is (B, N, D), shift is (B, D), scale is (B, D)
-    dtype = x.dtype
-    x = norm_func(x.to(torch.float32)).to(dtype)
-    x = x * (scale.unsqueeze(1) + 1) + shift.unsqueeze(1)
-    x = x.to(dtype)
-    return x
-
-
-def t2i_modulate(x, shift, scale):
-    return x * (1 + scale) + shift
-
-
-# ===============================================
-# General-purpose Layers
-# ===============================================
-
-
-class PatchEmbed3D(nn.Module):
-    """Video to Patch Embedding.
-
-    Args:
-        patch_size (int): Patch token size. Default: (2,4,4).
-        in_chans (int): Number of input video channels. Default: 3.
-        embed_dim (int): Number of linear projection output channels. Default: 96.
-        norm_layer (nn.Module, optional): Normalization layer. Default: None
-    """
-
-    def __init__(
-        self,
-        patch_size=(2, 4, 4),
-        in_chans=3,
-        embed_dim=96,
-        norm_layer=None,
-        flatten=True,
-    ):
-        super().__init__()
-        self.patch_size = patch_size
-        self.flatten = flatten
-
-        self.in_chans = in_chans
-        self.embed_dim = embed_dim
-
-        self.proj = nn.Conv3d(
-            in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
-        )
-        if norm_layer is not None:
-            self.norm = norm_layer(embed_dim)
-        else:
-            self.norm = None
-
-    def forward(self, x):
-        """Forward function."""
-        # padding
-        _, _, D, H, W = x.size()
-        if W % self.patch_size[2] != 0:
-            x = F.pad(x, (0, self.patch_size[2] - W % self.patch_size[2]))
-        if H % self.patch_size[1] != 0:
-            x = F.pad(x, (0, 0, 0, self.patch_size[1] - H % self.patch_size[1]))
-        if D % self.patch_size[0] != 0:
-            x = F.pad(x, (0, 0, 0, 0, 0, self.patch_size[0] - D % self.patch_size[0]))
-
-        x = self.proj(x)  # (B C T H W)
-        if self.norm is not None:
-            D, Wh, Ww = x.size(2), x.size(3), x.size(4)
-            x = x.flatten(2).transpose(1, 2)
-            x = self.norm(x)
-            x = x.transpose(1, 2).view(-1, self.embed_dim, D, Wh, Ww)
-        if self.flatten:
-            x = x.flatten(2).transpose(1, 2)  # BCTHW -> BNC
-        return x
-
-
-class Attention(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        num_heads: int = 8,
-        qkv_bias: bool = False,
-        qk_norm: bool = False,
-        attn_drop: float = 0.0,
-        proj_drop: float = 0.0,
-        norm_layer: nn.Module = LlamaRMSNorm,
-        enable_flash_attn: bool = False,
-        rope=None,
-        qk_norm_legacy: bool = False,
-    ) -> None:
-        super().__init__()
-        assert dim % num_heads == 0, "dim should be divisible by num_heads"
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = dim // num_heads
-        self.scale = self.head_dim**-0.5
-        self.enable_flash_attn = enable_flash_attn
-
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-        self.q_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
-        self.qk_norm_legacy = qk_norm_legacy
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-
-        self.rope = False
-        if rope is not None:
-            self.rope = True
-            self.rotary_emb = rope
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        B, N, C = x.shape
-        # flash attn is not memory efficient for small sequences, this is empirical
-        enable_flash_attn = self.enable_flash_attn and (N > B)
-        qkv = self.qkv(x)
-        qkv_shape = (B, N, 3, self.num_heads, self.head_dim)
-
-        qkv = qkv.view(qkv_shape).permute(2, 0, 3, 1, 4)
-        q, k, v = qkv.unbind(0)
-        if self.qk_norm_legacy:
-            # WARNING: this may be a bug
-            if self.rope:
-                q = self.rotary_emb(q)
-                k = self.rotary_emb(k)
-            q, k = self.q_norm(q), self.k_norm(k)
-        else:
-            q, k = self.q_norm(q), self.k_norm(k)
-            if self.rope:
-                q = self.rotary_emb(q)
-                k = self.rotary_emb(k)
-
-        if enable_flash_attn:
-            from flash_attn import flash_attn_func
-
-            # (B, #heads, N, #dim) -> (B, N, #heads, #dim)
-            q = q.permute(0, 2, 1, 3)
-            k = k.permute(0, 2, 1, 3)
-            v = v.permute(0, 2, 1, 3)
-            x = flash_attn_func(
-                q,
-                k,
-                v,
-                dropout_p=self.attn_drop.p if self.training else 0.0,
-                softmax_scale=self.scale,
-            )
-        else:
-            dtype = q.dtype
-            q = q * self.scale
-            attn = q @ k.transpose(-2, -1)  # translate attn to float32
-            attn = attn.to(torch.float32)
-            attn = attn.softmax(dim=-1)
-            attn = attn.to(dtype)  # cast back attn to original dtype
-            attn = self.attn_drop(attn)
-            x = attn @ v
-
-        x_output_shape = (B, N, C)
-        if not enable_flash_attn:
-            x = x.transpose(1, 2)
-        x = x.reshape(x_output_shape)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class KVCompressAttention(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        num_heads: int = 8,
-        qkv_bias: bool = False,
-        qk_norm: bool = False,
-        attn_drop: float = 0.0,
-        proj_drop: float = 0.0,
-        norm_layer: nn.Module = LlamaRMSNorm,
-        enable_flash_attn: bool = False,
-        sampling="conv",
-        sr_ratio=1,
-        mem_eff_attention=False,
-        attn_half=False,
-    ) -> None:
-        super().__init__()
-        assert dim % num_heads == 0, "dim should be divisible by num_heads"
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = dim // num_heads
-        self.scale = self.head_dim**-0.5
-        self.enable_flash_attn = enable_flash_attn
-
-        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
-
-        self.sr_ratio = sr_ratio
-        self.sampling = sampling
-        if sr_ratio > 1 and sampling == "conv":
-            # Avg Conv Init.
-            self.sr = nn.Conv2d(
-                dim, dim, groups=dim, kernel_size=sr_ratio, stride=sr_ratio
-            )
-            self.sr.weight.data.fill_(1 / sr_ratio**2)
-            self.sr.bias.data.zero_()
-            self.norm = nn.LayerNorm(dim)
-
-        self.q_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim) if qk_norm else nn.Identity()
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(dim, dim)
-        self.proj_drop = nn.Dropout(proj_drop)
-
-        self.mem_eff_attention = mem_eff_attention
-        self.attn_half = attn_half
-
-    def downsample_2d(self, tensor, H, W, scale_factor, sampling=None):
-        if sampling is None or scale_factor == 1:
-            return tensor
-        B, N, C = tensor.shape
-
-        if sampling == "uniform_every":
-            return tensor[:, ::scale_factor], int(N // scale_factor)
-
-        tensor = tensor.reshape(B, H, W, C).permute(0, 3, 1, 2)
-        new_H, new_W = int(H / scale_factor), int(W / scale_factor)
-        new_N = new_H * new_W
-
-        if sampling == "ave":
-            tensor = F.interpolate(
-                tensor, scale_factor=1 / scale_factor, mode="nearest"
-            ).permute(0, 2, 3, 1)
-        elif sampling == "uniform":
-            tensor = tensor[:, :, ::scale_factor, ::scale_factor].permute(0, 2, 3, 1)
-        elif sampling == "conv":
-            tensor = self.sr(tensor).reshape(B, C, -1).permute(0, 2, 1)
-            tensor = self.norm(tensor)
-        else:
-            raise ValueError
-
-        return tensor.reshape(B, new_N, C).contiguous(), new_N
-
-    def forward(
-        self, x: torch.Tensor, mask=None, HW=None, block_id=None, **kwargs
-    ) -> torch.Tensor:
-        B, N, C = x.shape
-        new_N = N
-        H, W = HW
-        # flash attn is not memory efficient for small sequences, this is empirical
-        enable_flash_attn = self.enable_flash_attn and (N > B)
-
-        qkv = self.qkv(x).reshape(B, N, 3, C)
-        q, k, v = qkv.unbind(2)
-        dtype = q.dtype
-        # KV compression
-        if self.sr_ratio > 1:
-            k, new_N = self.downsample_2d(
-                k, H, W, self.sr_ratio, sampling=self.sampling
-            )
-            v, new_N = self.downsample_2d(
-                v, H, W, self.sr_ratio, sampling=self.sampling
-            )
-
-        q = q.reshape(B, N, self.num_heads, C // self.num_heads).to(dtype)
-        k = k.reshape(B, new_N, self.num_heads, C // self.num_heads).to(dtype)
-        v = v.reshape(B, new_N, self.num_heads, C // self.num_heads).to(dtype)
-
-        q, k = self.q_norm(q), self.k_norm(k)
-
-        if enable_flash_attn:
-            from flash_attn import flash_attn_func
-
-            x = flash_attn_func(
-                q,
-                k,
-                v,
-                dropout_p=self.attn_drop.p if self.training else 0.0,
-                softmax_scale=self.scale,
-            )
-
-        elif self.mem_eff_attention:
-            attn_bias = None
-            if mask is not None:
-                attn_bias = torch.zeros(
-                    [B * self.num_heads, q.shape[1], k.shape[1]],
-                    dtype=q.dtype,
-                    device=q.device,
-                )
-                attn_bias.masked_fill_(
-                    mask.squeeze(1).repeat(self.num_heads, 1, 1) == 0, float("-inf")
-                )
-            x = xformers.ops.memory_efficient_attention(
-                q, k, v, p=self.attn_drop.p, attn_bias=attn_bias
-            )
-        else:
-            # (B, N, #heads, #dim) -> (B, #heads, N, #dim)
-            q = q.permute(0, 2, 1, 3)
-            k = k.permute(0, 2, 1, 3)
-            v = v.permute(0, 2, 1, 3)
-            dtype = q.dtype
-            q = q * self.scale
-            attn = q @ k.transpose(-2, -1)  # translate attn to float32
-            if not self.attn_half:
-                attn = attn.to(torch.float32)
-            attn = attn.softmax(dim=-1)
-            attn = attn.to(dtype)  # cast back attn to original dtype
-            attn = self.attn_drop(attn)
-            x = attn @ v
-
-        x_output_shape = (B, N, C)
-        if not enable_flash_attn:
-            x = x.transpose(1, 2)
-        x = x.reshape(x_output_shape)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class SeqParallelAttention(Attention):
-    def __init__(
-        self,
-        dim: int,
-        num_heads: int = 8,
-        qkv_bias: bool = False,
-        qk_norm: bool = False,
-        attn_drop: float = 0.0,
-        proj_drop: float = 0.0,
-        norm_layer: nn.Module = LlamaRMSNorm,
-        enable_flash_attn: bool = False,
-        rope=None,
-    ) -> None:
-        assert rope is None, "Rope is not supported in SeqParallelAttention"
-        super().__init__(
-            dim=dim,
-            num_heads=num_heads,
-            qkv_bias=qkv_bias,
-            qk_norm=qk_norm,
-            attn_drop=attn_drop,
-            proj_drop=proj_drop,
-            norm_layer=norm_layer,
-            enable_flash_attn=enable_flash_attn,
-        )
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        B, N, C = (
-            x.shape
-        )  # for sequence parallel here, the N is a local sequence length
-        qkv = self.qkv(x)
-        qkv_shape = (B, N, 3, self.num_heads, self.head_dim)
-
-        qkv = qkv.view(qkv_shape)
-
-        sp_group = get_sequence_parallel_group()
-
-        # apply all_to_all to gather sequence and split attention heads
-        # [B, SUB_N, 3, NUM_HEAD, HEAD_DIM] -> [B, N, 3, NUM_HEAD_PER_DEVICE, HEAD_DIM]
-        qkv = all_to_all(qkv, sp_group, scatter_dim=3, gather_dim=1)
-
-        if self.enable_flash_attn:
-            qkv_permute_shape = (
-                2,
-                0,
-                1,
-                3,
-                4,
-            )  # [3, B, N, NUM_HEAD_PER_DEVICE, HEAD_DIM]
-        else:
-            qkv_permute_shape = (
-                2,
-                0,
-                3,
-                1,
-                4,
-            )  # [3, B, NUM_HEAD_PER_DEVICE, N, HEAD_DIM]
-        qkv = qkv.permute(qkv_permute_shape)
-
-        # ERROR: Should qk_norm first
-        q, k, v = qkv.unbind(0)
-        q, k = self.q_norm(q), self.k_norm(k)
-        if self.enable_flash_attn:
-            from flash_attn import flash_attn_func
-
-            x = flash_attn_func(
-                q,
-                k,
-                v,
-                dropout_p=self.attn_drop.p if self.training else 0.0,
-                softmax_scale=self.scale,
-            )
-        else:
-            dtype = q.dtype
-            q = q * self.scale
-            attn = q @ k.transpose(-2, -1)  # translate attn to float32
-            attn = attn.to(torch.float32)
-            attn = attn.softmax(dim=-1)
-            attn = attn.to(dtype)  # cast back attn to original dtype
-            attn = self.attn_drop(attn)
-            x = attn @ v
-
-        if not self.enable_flash_attn:
-            x = x.transpose(1, 2)
-
-        # apply all to all to gather back attention heads and split sequence
-        # [B, N, NUM_HEAD_PER_DEVICE, HEAD_DIM]  -> [B, SUB_N, NUM_HEAD, HEAD_DIM]
-        x = all_to_all(x, sp_group, scatter_dim=1, gather_dim=2)
-
-        # reshape outputs back to [B, N, C]
-        x_output_shape = (B, N, C)
-        x = x.reshape(x_output_shape)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class MultiHeadCrossAttention(nn.Module):
-    def __init__(self, d_model, num_heads, attn_drop=0.0, proj_drop=0.0):
-        super(MultiHeadCrossAttention, self).__init__()
-        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
-
-        self.d_model = d_model
-        self.num_heads = num_heads
-        self.head_dim = d_model // num_heads
-
-        self.q_linear = nn.Linear(d_model, d_model)
-        self.kv_linear = nn.Linear(d_model, d_model * 2)
-        self.attn_drop = nn.Dropout(attn_drop)
-        self.proj = nn.Linear(d_model, d_model)
-        self.proj_drop = nn.Dropout(proj_drop)
-
-    def forward(self, x, cond, mask=None):
-        # query/value: img tokens; key: condition; mask: if padding tokens
-        B, N, C = x.shape
-
-        q = self.q_linear(x).view(1, -1, self.num_heads, self.head_dim)
-        kv = self.kv_linear(cond).view(1, -1, 2, self.num_heads, self.head_dim)
-        k, v = kv.unbind(2)
-
-        attn_bias = None
-        if mask is not None:
-            attn_bias = xformers.ops.fmha.BlockDiagonalMask.from_seqlens([N] * B, mask)
-        x = xformers.ops.memory_efficient_attention(
-            q, k, v, p=self.attn_drop.p, attn_bias=attn_bias
-        )
-
-        x = x.view(B, -1, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class SeqParallelMultiHeadCrossAttention(MultiHeadCrossAttention):
-    def __init__(
-        self,
-        d_model,
-        num_heads,
-        attn_drop=0.0,
-        proj_drop=0.0,
-    ):
-        super().__init__(
-            d_model=d_model,
-            num_heads=num_heads,
-            attn_drop=attn_drop,
-            proj_drop=proj_drop,
-        )
-
-    def forward(self, x, cond, mask=None):
-        # query/value: img tokens; key: condition; mask: if padding tokens
-        sp_group = get_sequence_parallel_group()
-        sp_size = dist.get_world_size(sp_group)
-        B, SUB_N, C = x.shape  # [B, TS/p, C]
-        N = SUB_N * sp_size
-
-        # shape:
-        # q, k, v: [B, SUB_N, NUM_HEADS, HEAD_DIM]
-        q = self.q_linear(x).view(1, -1, self.num_heads, self.head_dim)
-        kv = self.kv_linear(cond).view(1, -1, 2, self.num_heads, self.head_dim)
-        kv = split_forward_gather_backward(
-            kv, get_sequence_parallel_group(), dim=3, grad_scale="down"
-        )
-        k, v = kv.unbind(2)
-
-        # apply all_to_all to gather sequence and split attention heads
-        q = all_to_all(q, sp_group, scatter_dim=2, gather_dim=1)
-
-        q = q.view(1, -1, self.num_heads // sp_size, self.head_dim)
-        k = k.view(1, -1, self.num_heads // sp_size, self.head_dim)
-        v = v.view(1, -1, self.num_heads // sp_size, self.head_dim)
-
-        # compute attention
-        attn_bias = None
-        if mask is not None:
-            attn_bias = xformers.ops.fmha.BlockDiagonalMask.from_seqlens([N] * B, mask)
-        x = xformers.ops.memory_efficient_attention(
-            q, k, v, p=self.attn_drop.p, attn_bias=attn_bias
-        )
-
-        # apply all to all to gather back attention heads and scatter sequence
-        x = x.view(B, -1, self.num_heads // sp_size, self.head_dim)
-        x = all_to_all(x, sp_group, scatter_dim=1, gather_dim=2)
-
-        # apply output projection
-        x = x.view(B, -1, C)
-        x = self.proj(x)
-        x = self.proj_drop(x)
-        return x
-
-
-class FinalLayer(nn.Module):
-    """
-    The final layer of DiT.
-    """
-
-    def __init__(self, hidden_size, num_patch, out_channels):
-        super().__init__()
-        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.linear = nn.Linear(hidden_size, num_patch * out_channels, bias=True)
-        self.adaLN_modulation = nn.Sequential(
-            nn.SiLU(), nn.Linear(hidden_size, 2 * hidden_size, bias=True)
-        )
-
-    def forward(self, x, c):
-        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
-        x = modulate(self.norm_final, x, shift, scale)
-        x = self.linear(x)
-        return x
-
-
-class T2IFinalLayer(nn.Module):
-    """
-    The final layer of PixArt.
-    """
-
-    def __init__(self, hidden_size, num_patch, out_channels, d_t=None, d_s=None):
-        super().__init__()
-        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
-        self.linear = nn.Linear(hidden_size, num_patch * out_channels, bias=True)
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(2, hidden_size) / hidden_size**0.5
-        )
-        self.out_channels = out_channels
-        self.d_t = d_t
-        self.d_s = d_s
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(self, x, t, x_mask=None, t0=None, T=None, S=None):
-        if T is None:
-            T = self.d_t
-        if S is None:
-            S = self.d_s
-        shift, scale = (self.scale_shift_table[None] + t[:, None]).chunk(2, dim=1)
-        x = t2i_modulate(self.norm_final(x), shift, scale)
-        if x_mask is not None:
-            shift_zero, scale_zero = (self.scale_shift_table[None] + t0[:, None]).chunk(
-                2, dim=1
-            )
-            x_zero = t2i_modulate(self.norm_final(x), shift_zero, scale_zero)
-            x = self.t_mask_select(x_mask, x, x_zero, T, S)
-        x = self.linear(x)
-        return x
-
-
-# ===============================================
-# Embedding Layers for Timesteps and Class Labels
-# ===============================================
-
-
-class TimestepEmbedder(nn.Module):
-    """
-    Embeds scalar timesteps into vector representations.
-    """
-
-    def __init__(self, hidden_size, frequency_embedding_size=256):
-        super().__init__()
-        self.mlp = nn.Sequential(
-            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
-            nn.SiLU(),
-            nn.Linear(hidden_size, hidden_size, bias=True),
-        )
-        self.frequency_embedding_size = frequency_embedding_size
-
-    @staticmethod
-    def timestep_embedding(t, dim, max_period=10000):
-        """
-        Create sinusoidal timestep embeddings.
-        :param t: a 1-D Tensor of N indices, one per batch element.
-                          These may be fractional.
-        :param dim: the dimension of the output.
-        :param max_period: controls the minimum frequency of the embeddings.
-        :return: an (N, D) Tensor of positional embeddings.
-        """
-        # https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
-        half = dim // 2
-        freqs = torch.exp(
-            -math.log(max_period)
-            * torch.arange(start=0, end=half, dtype=torch.float32)
-            / half
-        )
-        freqs = freqs.to(device=t.device)
-        args = t[:, None].float() * freqs[None]
-        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
-        if dim % 2:
-            embedding = torch.cat(
-                [embedding, torch.zeros_like(embedding[:, :1])], dim=-1
-            )
-        return embedding
-
-    def forward(self, t, dtype):
-        t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
-        if t_freq.dtype != dtype:
-            t_freq = t_freq.to(dtype)
-        t_emb = self.mlp(t_freq)
-        return t_emb
-
-
-class LabelEmbedder(nn.Module):
-    """
-    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
-    """
-
-    def __init__(self, num_classes, hidden_size, dropout_prob):
-        super().__init__()
-        use_cfg_embedding = dropout_prob > 0
-        self.embedding_table = nn.Embedding(
-            num_classes + use_cfg_embedding, hidden_size
-        )
-        self.num_classes = num_classes
-        self.dropout_prob = dropout_prob
-
-    def token_drop(self, labels, force_drop_ids=None):
-        """
-        Drops labels to enable classifier-free guidance.
-        """
-        if force_drop_ids is None:
-            drop_ids = torch.rand(labels.shape[0]).cuda() < self.dropout_prob
-        else:
-            drop_ids = force_drop_ids == 1
-        labels = torch.where(drop_ids, self.num_classes, labels)
-        return labels
-
-    def forward(self, labels, train, force_drop_ids=None):
-        use_dropout = self.dropout_prob > 0
-        if (train and use_dropout) or (force_drop_ids is not None):
-            labels = self.token_drop(labels, force_drop_ids)
-        return self.embedding_table(labels)
-
-
-class SizeEmbedder(TimestepEmbedder):
-    """
-    Embeds scalar timesteps into vector representations.
-    """
-
-    def __init__(self, hidden_size, frequency_embedding_size=256):
-        super().__init__(
-            hidden_size=hidden_size, frequency_embedding_size=frequency_embedding_size
-        )
-        self.mlp = nn.Sequential(
-            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
-            nn.SiLU(),
-            nn.Linear(hidden_size, hidden_size, bias=True),
-        )
-        self.frequency_embedding_size = frequency_embedding_size
-        self.outdim = hidden_size
-
-    def forward(self, s, bs):
-        if s.ndim == 1:
-            s = s[:, None]
-        assert s.ndim == 2
-        if s.shape[0] != bs:
-            s = s.repeat(bs // s.shape[0], 1)
-            assert s.shape[0] == bs
-        b, dims = s.shape[0], s.shape[1]
-        s = rearrange(s, "b d -> (b d)")
-        s_freq = self.timestep_embedding(s, self.frequency_embedding_size).to(
-            self.dtype
-        )
-        s_emb = self.mlp(s_freq)
-        s_emb = rearrange(s_emb, "(b d) d2 -> b (d d2)", b=b, d=dims, d2=self.outdim)
-        return s_emb
-
-    @property
-    def dtype(self):
-        return next(self.parameters()).dtype
-
-
-class CaptionEmbedder(nn.Module):
-    """
-    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
-    """
-
-    def __init__(
-        self,
-        in_channels,
-        hidden_size,
-        uncond_prob,
-        act_layer=nn.GELU(approximate="tanh"),
-        token_num=120,
-    ):
-        super().__init__()
-        self.y_proj = Mlp(
-            in_features=in_channels,
-            hidden_features=hidden_size,
-            out_features=hidden_size,
-            act_layer=act_layer,
-            drop=0,
-        )
-        self.register_buffer(
-            "y_embedding",
-            torch.randn(token_num, in_channels) / in_channels**0.5,
-        )
-        self.uncond_prob = uncond_prob
-
-    def token_drop(self, caption, force_drop_ids=None):
-        """
-        Drops labels to enable classifier-free guidance.
-        """
-        if force_drop_ids is None:
-            drop_ids = torch.rand(caption.shape[0]).cuda() < self.uncond_prob
-        else:
-            drop_ids = force_drop_ids == 1
-        caption = torch.where(drop_ids[:, None, None, None], self.y_embedding, caption)
-        return caption
-
-    def forward(self, caption, train, force_drop_ids=None):
-        if train:
-            assert caption.shape[2:] == self.y_embedding.shape
-        use_dropout = self.uncond_prob > 0
-        if (train and use_dropout) or (force_drop_ids is not None):
-            caption = self.token_drop(caption, force_drop_ids)
-        caption = self.y_proj(caption)
-        return caption
-
-
-class PositionEmbedding2D(nn.Module):
-    def __init__(self, dim: int) -> None:
-        super().__init__()
-        self.dim = dim
-        assert dim % 4 == 0, "dim must be divisible by 4"
-        half_dim = dim // 2
-        inv_freq = 1.0 / (10000 ** (torch.arange(0, half_dim, 2).float() / half_dim))
-        self.register_buffer("inv_freq", inv_freq, persistent=False)
-
-    def _get_sin_cos_emb(self, t: torch.Tensor):
-        out = torch.einsum("i,d->id", t, self.inv_freq)
-        emb_cos = torch.cos(out)
-        emb_sin = torch.sin(out)
-        return torch.cat((emb_sin, emb_cos), dim=-1)
-
-    @functools.lru_cache(maxsize=512)
-    def _get_cached_emb(
-        self,
-        device: torch.device,
-        dtype: torch.dtype,
-        h: int,
-        w: int,
-        scale: float = 1.0,
-        base_size: Optional[int] = None,
-    ):
-        grid_h = torch.arange(h, device=device) / scale
-        grid_w = torch.arange(w, device=device) / scale
-        if base_size is not None:
-            grid_h *= base_size / h
-            grid_w *= base_size / w
-        grid_h, grid_w = torch.meshgrid(
-            grid_w,
-            grid_h,
-            indexing="ij",
-        )  # here w goes first
-        grid_h = grid_h.t().reshape(-1)
-        grid_w = grid_w.t().reshape(-1)
-        emb_h = self._get_sin_cos_emb(grid_h)
-        emb_w = self._get_sin_cos_emb(grid_w)
-        return torch.concat([emb_h, emb_w], dim=-1).unsqueeze(0).to(dtype)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        h: int,
-        w: int,
-        scale: Optional[float] = 1.0,
-        base_size: Optional[int] = None,
-    ) -> torch.Tensor:
-        return self._get_cached_emb(x.device, x.dtype, h, w, scale, base_size)
-
-
-# ===============================================
-# Sine/Cosine Positional Embedding Functions
-# ===============================================
-# https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
-
-
-def get_2d_sincos_pos_embed(
-    embed_dim, grid_size, cls_token=False, extra_tokens=0, scale=1.0, base_size=None
-):
-    """
-    grid_size: int of the grid height and width
-    return:
-    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
-    """
-    if not isinstance(grid_size, tuple):
-        grid_size = (grid_size, grid_size)
-
-    grid_h = np.arange(grid_size[0], dtype=np.float32) / scale
-    grid_w = np.arange(grid_size[1], dtype=np.float32) / scale
-    if base_size is not None:
-        grid_h *= base_size / grid_size[0]
-        grid_w *= base_size / grid_size[1]
-    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
-    grid = np.stack(grid, axis=0)
-
-    grid = grid.reshape([2, 1, grid_size[1], grid_size[0]])
-    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
-    if cls_token and extra_tokens > 0:
-        pos_embed = np.concatenate(
-            [np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0
-        )
-    return pos_embed
-
-
-def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
-    assert embed_dim % 2 == 0
-
-    # use half of dimensions to encode grid_h
-    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
-    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
-
-    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
-    return emb
-
-
-def get_1d_sincos_pos_embed(embed_dim, length, scale=1.0):
-    pos = np.arange(0, length)[..., None] / scale
-    return get_1d_sincos_pos_embed_from_grid(embed_dim, pos)
-
-
-def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
-    """
-    embed_dim: output dimension for each position
-    pos: a list of positions to be encoded: size (M,)
-    out: (M, D)
-    """
-    assert embed_dim % 2 == 0
-    omega = np.arange(embed_dim // 2, dtype=np.float64)
-    omega /= embed_dim / 2.0
-    omega = 1.0 / 10000**omega  # (D/2,)
-
-    pos = pos.reshape(-1)  # (M,)
-    out = np.einsum("m,d->md", pos, omega)  # (M, D/2), outer product
-
-    emb_sin = np.sin(out)  # (M, D/2)
-    emb_cos = np.cos(out)  # (M, D/2)
-
-    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
-    return emb
diff --git a/videotuna/models/opensora/models/stdit/__init__.py b/videotuna/models/opensora/models/stdit/__init__.py
deleted file mode 100644
index 7d59db5b..00000000
--- a/videotuna/models/opensora/models/stdit/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-from .stdit import STDiT
-from .stdit2 import STDiT2
-from .stdit3 import STDiT3
-from .stdit4 import STDiT4
-from .stdit5 import STDiT5
-from .stdit6 import STDiT6
-from .stdit7 import STDiT7
-from .stdit8 import STDiT8
diff --git a/videotuna/models/opensora/models/stdit/stdit.py b/videotuna/models/opensora/models/stdit/stdit.py
deleted file mode 100644
index 676ec578..00000000
--- a/videotuna/models/opensora/models/stdit/stdit.py
+++ /dev/null
@@ -1,426 +0,0 @@
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        d_s=None,
-        d_t=None,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flash_attn=enable_flashattn,
-            norm_layer=nn.LayerNorm,
-        )
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # temporal attention
-        self.d_s = d_s
-        self.d_t = d_t
-
-        if self._enable_sequence_parallelism:
-            sp_size = dist.get_world_size(get_sequence_parallel_group())
-            # make sure d_t is divisible by sp_size
-            assert d_t % sp_size == 0
-            self.d_t = d_t // sp_size
-
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flash_attn=self.enable_flashattn,
-            norm_layer=nn.LayerNorm,
-        )
-
-    def forward(self, x, y, t, mask=None, tpe=None):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=self.d_t, S=self.d_s)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=self.d_t, S=self.d_s)
-        x = x + self.drop_path(gate_msa * x_s)
-
-        # temporal branch
-        x_t = rearrange(x, "B (T S) C -> (B S) T C", T=self.d_t, S=self.d_s)
-        if tpe is not None:
-            x_t = x_t + tpe
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=self.d_t, S=self.d_s)
-        x = x + self.drop_path(gate_msa * x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # mlp
-        x = x + self.drop_path(
-            gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp))
-        )
-
-        return x
-
-
-@MODELS.register_module()
-class STDiT(nn.Module):
-    def __init__(
-        self,
-        input_size=(1, 32, 32),
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        # dtype=torch.float32,
-        dtype=torch.bfloat16,  # TODO fix data type
-        space_scale=1.0,
-        time_scale=1.0,
-        freeze=None,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.pred_sigma = pred_sigma
-        self.in_channels = in_channels
-        self.out_channels = in_channels * 2 if pred_sigma else in_channels
-        self.hidden_size = hidden_size
-        self.patch_size = patch_size
-        self.input_size = input_size
-        num_patches = np.prod([input_size[i] // patch_size[i] for i in range(3)])
-        self.num_patches = num_patches
-        self.num_temporal = input_size[0] // patch_size[0]
-        self.num_spatial = num_patches // self.num_temporal
-        self.num_heads = num_heads
-        self.dtype = dtype
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.depth = depth
-        self.mlp_ratio = mlp_ratio
-        self.enable_flashattn = enable_flashattn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.space_scale = space_scale
-        self.time_scale = time_scale
-
-        self.register_buffer("pos_embed", self.get_spatial_pos_embed())
-        self.register_buffer("pos_embed_temporal", self.get_temporal_pos_embed())
-
-        self.x_embedder = PatchEmbed3D(patch_size, in_channels, hidden_size)
-        self.t_embedder = TimestepEmbedder(hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=caption_channels,
-            hidden_size=hidden_size,
-            uncond_prob=class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=model_max_length,
-        )
-
-        drop_path = [x.item() for x in torch.linspace(0, drop_path, depth)]
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=enable_sequence_parallelism,
-                    d_t=self.num_temporal,
-                    d_s=self.num_spatial,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if freeze is not None:
-            assert freeze in ["not_temporal", "text"]
-            if freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        if enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def forward(self, x, timestep, y, mask=None):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(
-            x, "B (T S) C -> B T S C", T=self.num_temporal, S=self.num_spatial
-        )
-        x = x + self.pos_embed
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t0 = self.t_block(t)  # [B, C]
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for i, block in enumerate(self.blocks):
-            if i == 0:
-                if self.enable_sequence_parallelism:
-                    tpe = torch.chunk(
-                        self.pos_embed_temporal,
-                        dist.get_world_size(get_sequence_parallel_group()),
-                        dim=1,
-                    )[self.sp_rank].contiguous()
-                else:
-                    tpe = self.pos_embed_temporal
-            else:
-                tpe = None
-            x = auto_grad_checkpoint(block, x, y, t0, y_lens, tpe)
-
-        if self.enable_sequence_parallelism:
-            x = gather_forward_split_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="up"
-            )
-        # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(x, t)  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT-XL/2")
-def STDiT_XL_2(from_pretrained=None, **kwargs):
-    model = STDiT(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    if from_pretrained is not None and from_pretrained:
-        load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit2.py b/videotuna/models/opensora/models/stdit/stdit2.py
deleted file mode 100644
index fa584236..00000000
--- a/videotuna/models/opensora/models/stdit/stdit2.py
+++ /dev/null
@@ -1,487 +0,0 @@
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock2(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        d_s=None,
-        d_t=None,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=enable_flashattn,
-        )
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # temporal attention
-        self.d_s = d_s
-        self.d_t = d_t
-
-        if self._enable_sequence_parallelism:
-            sp_size = dist.get_world_size(get_sequence_parallel_group())
-            # make sure d_t is divisible by sp_size
-            assert d_t % sp_size == 0
-            self.d_t = d_t // sp_size
-
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=self.enable_flashattn,
-        )
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self, x, y, t, mask=None, tpe=None, x_mask=None, t0=None, T=None, S=None
-    ):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-
-        # modulate
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=self.d_t, S=self.d_s)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=self.d_t, S=self.d_s)
-        if x_mask is not None:
-            x_s_zero = gate_msa_zero * x_s
-            x_s = gate_msa * x_s
-            x_s = self.t_mask_select(x_mask, x_s, x_s_zero, T, S)
-        else:
-            x_s = gate_msa * x_s
-        x = x + self.drop_path(x_s)
-
-        # temporal branch
-        x_t = rearrange(x, "B (T S) C -> (B S) T C", T=self.d_t, S=self.d_s)
-        if tpe is not None:
-            x_t = x_t + tpe
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=self.d_t, S=self.d_s)
-        if x_mask is not None:
-            x_t_zero = gate_msa_zero * x_t
-            x_t = gate_msa * x_t
-            x_t = self.t_mask_select(x_mask, x_t, x_t_zero, T, S)
-        else:
-            x_t = gate_msa * x_t
-        x = x + self.drop_path(x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # mlp
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        x_mlp = self.mlp(x_m)
-        if x_mask is not None:
-            x_mlp_zero = gate_mlp_zero * x_mlp
-            x_mlp = gate_mlp * x_mlp
-            x_mlp = self.t_mask_select(x_mask, x_mlp, x_mlp_zero, T, S)
-        else:
-            x_mlp = gate_mlp * x_mlp
-
-        x = x + self.drop_path(x_mlp)
-
-        return x
-
-
-@MODELS.register_module()
-class STDiT2(nn.Module):
-    def __init__(
-        self,
-        input_size=(1, 32, 32),
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        dtype=torch.float32,
-        space_scale=1.0,
-        time_scale=1.0,
-        freeze=None,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.pred_sigma = pred_sigma
-        self.in_channels = in_channels
-        self.out_channels = in_channels * 2 if pred_sigma else in_channels
-        self.hidden_size = hidden_size
-        self.patch_size = patch_size
-        self.input_size = input_size
-        num_patches = np.prod([input_size[i] // patch_size[i] for i in range(3)])
-        self.num_patches = num_patches
-        self.num_temporal = input_size[0] // patch_size[0]
-        self.num_spatial = num_patches // self.num_temporal
-        self.num_heads = num_heads
-        self.dtype = dtype
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.depth = depth
-        self.mlp_ratio = mlp_ratio
-        self.enable_flashattn = enable_flashattn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.space_scale = space_scale
-        self.time_scale = time_scale
-
-        self.register_buffer("pos_embed", self.get_spatial_pos_embed())
-        self.register_buffer("pos_embed_temporal", self.get_temporal_pos_embed())
-
-        self.x_embedder = PatchEmbed3D(patch_size, in_channels, hidden_size)
-        self.t_embedder = TimestepEmbedder(hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=caption_channels,
-            hidden_size=hidden_size,
-            uncond_prob=class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=model_max_length,
-        )
-
-        drop_path = [x.item() for x in torch.linspace(0, drop_path, depth)]
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock2(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=enable_sequence_parallelism,
-                    d_t=self.num_temporal,
-                    d_s=self.num_spatial,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if freeze is not None:
-            assert freeze in ["not_temporal", "text"]
-            if freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        if enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def forward(self, x, timestep, y, mask=None, x_mask=None):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-
-        T = self.num_temporal
-        S = self.num_spatial
-
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(
-            x, "B (T S) C -> B T S C", T=self.num_temporal, S=self.num_spatial
-        )
-        x = x + self.pos_embed
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t_emb = self.t_block(t)  # [B, C]
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0_emb = self.t_block(t0)
-        else:
-            t0_emb = None
-
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for i, block in enumerate(self.blocks):
-            if i == 0:
-                if self.enable_sequence_parallelism:
-                    tpe = torch.chunk(
-                        self.pos_embed_temporal,
-                        dist.get_world_size(get_sequence_parallel_group()),
-                        dim=1,
-                    )[self.sp_rank].contiguous()
-                else:
-                    tpe = self.pos_embed_temporal
-            else:
-                tpe = None
-            x = auto_grad_checkpoint(
-                block, x, y, t_emb, y_lens, tpe, x_mask, t0_emb, T, S
-            )
-
-        if self.enable_sequence_parallelism:
-            x = gather_forward_split_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="up"
-            )
-        # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(
-            x, t, x_mask, t0, T, S
-        )  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT2-XL/2")
-def STDiT2_XL_2(from_pretrained=None, **kwargs):
-    model = STDiT2(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    if from_pretrained is not None:
-        load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit3.py b/videotuna/models/opensora/models/stdit/stdit3.py
deleted file mode 100644
index a7ffafe7..00000000
--- a/videotuna/models/opensora/models/stdit/stdit3.py
+++ /dev/null
@@ -1,442 +0,0 @@
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=enable_flashattn,
-        )
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # temporal attention
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=self.enable_flashattn,
-        )
-
-    def forward(self, x, y, t, T, S, mask=None, tpe=None):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=T, S=S)
-        x = x + self.drop_path(gate_msa * x_s)
-
-        # temporal branch
-        x_t = rearrange(x, "B (T S) C -> (B S) T C", T=T, S=S)
-        if tpe is not None:
-            x_t = x_t + tpe
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=T, S=S)
-        x = x + self.drop_path(gate_msa * x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # mlp
-        x = x + self.drop_path(
-            gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp))
-        )
-
-        return x
-
-
-@MODELS.register_module()
-class STDiT3(nn.Module):
-    def __init__(
-        self,
-        input_size=(1, 32, 32),
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        dtype=torch.float32,
-        space_scale=1.0,
-        time_scale=1.0,
-        freeze=None,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.pred_sigma = pred_sigma
-        self.in_channels = in_channels
-        self.out_channels = in_channels * 2 if pred_sigma else in_channels
-        self.hidden_size = hidden_size
-        self.patch_size = patch_size
-
-        # size for video input
-        self.input_size = input_size
-        num_patches = np.prod([input_size[i] // patch_size[i] for i in range(3)])
-        self.num_patches = num_patches
-        self.num_temporal = input_size[0] // patch_size[0]
-        self.num_spatial = num_patches // self.num_temporal
-
-        # size for image input
-        img_input_size = [1, input_size[1], input_size[2]]
-        self.img_input_size = img_input_size
-        self.img_num_patches = np.prod(
-            [self.img_input_size[i] // patch_size[i] for i in range(3)]
-        )
-        self.img_num_temporal = self.img_input_size[0] // patch_size[0]
-        self.img_num_spatial = self.img_num_patches // self.img_num_temporal
-
-        self.num_heads = num_heads
-        self.dtype = dtype
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.depth = depth
-        self.mlp_ratio = mlp_ratio
-        self.enable_flashattn = enable_flashattn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.space_scale = space_scale
-        self.time_scale = time_scale
-
-        self.register_buffer("pos_embed", self.get_spatial_pos_embed())
-        self.register_buffer("pos_embed_temporal", self.get_temporal_pos_embed())
-
-        self.x_embedder = PatchEmbed3D(patch_size, in_channels, hidden_size)
-        self.t_embedder = TimestepEmbedder(hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=caption_channels,
-            hidden_size=hidden_size,
-            uncond_prob=class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=model_max_length,
-        )
-
-        drop_path = [x.item() for x in torch.linspace(0, drop_path, depth)]
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=enable_sequence_parallelism,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if freeze is not None:
-            assert freeze in ["not_temporal", "text"]
-            if freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        if enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def forward(self, x, timestep, y, mask=None):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-        if x.shape[2] == 1:
-            input_type = "image"
-        elif x.shape[2] > 1:
-            input_type = "video"
-        else:
-            raise ValueError("Input shape not recognized.")
-
-        if input_type == "image":
-            T = self.img_num_temporal
-            S = self.img_num_spatial
-        elif input_type == "video":
-            T = self.num_temporal
-            S = self.num_spatial
-
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + self.pos_embed
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t0 = self.t_block(t)  # [B, C]
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-        import pdb; pdb.set_trace()
-        
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for i, block in enumerate(self.blocks):
-            if i == 0:
-                if self.enable_sequence_parallelism:
-                    tpe = torch.chunk(
-                        self.pos_embed_temporal,
-                        dist.get_world_size(get_sequence_parallel_group()),
-                        dim=1,
-                    )[self.sp_rank].contiguous()
-                else:
-                    if input_type == "image":
-                        tpe = self.pos_embed_temporal[:, :1, :]
-                    else:
-                        tpe = self.pos_embed_temporal
-            else:
-                tpe = None
-            x = auto_grad_checkpoint(block, x, y, t0, T, S, y_lens, tpe)
-
-        if self.enable_sequence_parallelism:
-            x = gather_forward_split_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="up"
-            )
-        # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(x, t)  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x, input_type=input_type)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, input_type=None):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-        if input_type == "image":
-            N_t, N_h, N_w = [
-                self.img_input_size[i] // self.patch_size[i] for i in range(3)
-            ]
-        else:
-            N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT3-XL/2")
-def STDiT3_XL_2(from_pretrained=None, **kwargs):
-    model = STDiT3(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    if from_pretrained is not None:
-        load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit4.py b/videotuna/models/opensora/models/stdit/stdit4.py
deleted file mode 100644
index 4294e711..00000000
--- a/videotuna/models/opensora/models/stdit/stdit4.py
+++ /dev/null
@@ -1,503 +0,0 @@
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock2(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=enable_flashattn,
-        )
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # temporal attention
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=self.enable_flashattn,
-        )
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self, x, y, t, mask=None, tpe=None, x_mask=None, t0=None, T=None, S=None
-    ):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-
-        # modulate
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_s_zero = gate_msa_zero * x_s
-            x_s = gate_msa * x_s
-            x_s = self.t_mask_select(x_mask, x_s, x_s_zero, T, S)
-        else:
-            x_s = gate_msa * x_s
-        x = x + self.drop_path(x_s)
-
-        # temporal branch
-        x_t = rearrange(x, "B (T S) C -> (B S) T C", T=T, S=S)
-        if tpe is not None:
-            x_t = x_t + tpe
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_t_zero = gate_msa_zero * x_t
-            x_t = gate_msa * x_t
-            x_t = self.t_mask_select(x_mask, x_t, x_t_zero, T, S)
-        else:
-            x_t = gate_msa * x_t
-        x = x + self.drop_path(x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # mlp
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        x_mlp = self.mlp(x_m)
-        if x_mask is not None:
-            x_mlp_zero = gate_mlp_zero * x_mlp
-            x_mlp = gate_mlp * x_mlp
-            x_mlp = self.t_mask_select(x_mask, x_mlp, x_mlp_zero, T, S)
-        else:
-            x_mlp = gate_mlp * x_mlp
-
-        x = x + self.drop_path(x_mlp)
-
-        return x
-
-
-@MODELS.register_module()
-class STDiT4(nn.Module):
-    def __init__(
-        self,
-        input_size=(1, 32, 32),
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        dtype=torch.float16,
-        space_scale=1.0,
-        time_scale=1.0,
-        freeze=None,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.pred_sigma = pred_sigma
-        self.in_channels = in_channels
-        self.out_channels = in_channels * 2 if pred_sigma else in_channels
-        self.hidden_size = hidden_size
-        self.patch_size = patch_size
-
-        # size for video input
-        self.input_size = input_size
-        num_patches = np.prod([input_size[i] // patch_size[i] for i in range(3)])
-        self.num_patches = num_patches
-        self.num_temporal = input_size[0] // patch_size[0]
-        self.num_spatial = num_patches // self.num_temporal
-
-        # size for image input
-        img_input_size = [1, input_size[1], input_size[2]]
-        self.img_input_size = img_input_size
-        self.img_num_patches = np.prod(
-            [self.img_input_size[i] // patch_size[i] for i in range(3)]
-        )
-        self.img_num_temporal = self.img_input_size[0] // patch_size[0]
-        self.img_num_spatial = self.img_num_patches // self.img_num_temporal
-
-        self.num_heads = num_heads
-        self.dtype = dtype
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.depth = depth
-        self.mlp_ratio = mlp_ratio
-        self.enable_flashattn = enable_flashattn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.space_scale = space_scale
-        self.time_scale = time_scale
-
-        self.register_buffer("pos_embed", self.get_spatial_pos_embed())
-        self.register_buffer("pos_embed_temporal", self.get_temporal_pos_embed())
-
-        self.x_embedder = PatchEmbed3D(patch_size, in_channels, hidden_size)
-        self.t_embedder = TimestepEmbedder(hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(hidden_size, 6 * hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=caption_channels,
-            hidden_size=hidden_size,
-            uncond_prob=class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=model_max_length,
-        )
-
-        drop_path = [x.item() for x in torch.linspace(0, drop_path, depth)]
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock2(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=enable_sequence_parallelism,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if freeze is not None:
-            assert freeze in ["not_temporal", "text"]
-            if freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        if enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def forward(self, x, timestep, y, mask=None, x_mask=None):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-        if x.shape[2] == 1:
-            input_type = "image"
-        elif x.shape[2] > 1:
-            input_type = "video"
-        else:
-            raise ValueError("Input shape not recognized.")
-
-        if input_type == "image":
-            T = self.img_num_temporal
-            S = self.img_num_spatial
-        elif input_type == "video":
-            T = self.num_temporal
-            S = self.num_spatial
-
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + self.pos_embed
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t_emb = self.t_block(t)  # [B, C]
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0_emb = self.t_block(t0)
-        else:
-            t0 = None
-            t0_emb = None
-
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for i, block in enumerate(self.blocks):
-            if i == 0:
-                if self.enable_sequence_parallelism:
-                    tpe = torch.chunk(
-                        self.pos_embed_temporal,
-                        dist.get_world_size(get_sequence_parallel_group()),
-                        dim=1,
-                    )[self.sp_rank].contiguous()
-                else:
-                    if input_type == "image":
-                        tpe = self.pos_embed_temporal[:, :1, :]
-                    else:
-                        tpe = self.pos_embed_temporal
-            else:
-                tpe = None
-            x = auto_grad_checkpoint(
-                block, x, y, t_emb, y_lens, tpe, x_mask, t0_emb, T, S
-            )
-
-        if self.enable_sequence_parallelism:
-            x = gather_forward_split_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="up"
-            )
-        # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(
-            x, t, x_mask, t0, T, S
-        )  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x, input_type=input_type)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, input_type=None):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        if input_type == "image":
-            N_t, N_h, N_w = [
-                self.img_input_size[i] // self.patch_size[i] for i in range(3)
-            ]
-        else:
-            N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT4-XL/2")
-def STDiT4_XL_2(from_pretrained=None, **kwargs):
-    model = STDiT4(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    if from_pretrained is not None:
-        load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit5.py b/videotuna/models/opensora/models/stdit/stdit5.py
deleted file mode 100644
index 9d3b1ef7..00000000
--- a/videotuna/models/opensora/models/stdit/stdit5.py
+++ /dev/null
@@ -1,646 +0,0 @@
-# Add the ar and fps to the time embedding with Rope Positional Embedding on Temporal Attention
-
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from rotary_embedding_torch import RotaryEmbedding
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-from transformers import PretrainedConfig, PreTrainedModel
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    PositionEmbedding2D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    SizeEmbedder,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock2(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        rope=None,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        # spatial branch
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=enable_flashattn,
-        )
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # cross attn
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-
-        # mpl branch
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-
-        # temporal attention
-        self.norm_temp = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )  # new
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=self.enable_flashattn,
-            rope=rope,
-        )
-        self.scale_shift_table_temporal = nn.Parameter(
-            torch.randn(3, hidden_size) / hidden_size**0.5
-        )  # new
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self,
-        x,
-        y,
-        t,
-        t_tmp,
-        mask=None,
-        x_mask=None,
-        t0=None,
-        t0_tmp=None,
-        T=None,
-        S=None,
-    ):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        shift_tmp, scale_tmp, gate_tmp = (
-            self.scale_shift_table_temporal[None] + t_tmp.reshape(B, 3, -1)
-        ).chunk(3, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-            shift_tmp_zero, scale_tmp_zero, gate_tmp_zero = (
-                self.scale_shift_table_temporal[None] + t0_tmp.reshape(B, 3, -1)
-            ).chunk(3, dim=1)
-
-        # modulate
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_s_zero = gate_msa_zero * x_s
-            x_s = gate_msa * x_s
-            x_s = self.t_mask_select(x_mask, x_s, x_s_zero, T, S)
-        else:
-            x_s = gate_msa * x_s
-        x = x + self.drop_path(x_s)
-
-        # modulate
-        x_m = t2i_modulate(self.norm_temp(x), shift_tmp, scale_tmp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm_temp(x), shift_tmp_zero, scale_tmp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # temporal branch
-        x_t = rearrange(x_m, "B (T S) C -> (B S) T C", T=T, S=S)
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_t_zero = gate_tmp_zero * x_t
-            x_t = gate_tmp * x_t
-            x_t = self.t_mask_select(x_mask, x_t, x_t_zero, T, S)
-        else:
-            x_t = gate_tmp * x_t
-        x = x + self.drop_path(x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # modulate
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # mlp
-        x_mlp = self.mlp(x_m)
-        if x_mask is not None:
-            x_mlp_zero = gate_mlp_zero * x_mlp
-            x_mlp = gate_mlp * x_mlp
-            x_mlp = self.t_mask_select(x_mask, x_mlp, x_mlp_zero, T, S)
-        else:
-            x_mlp = gate_mlp * x_mlp
-        x = x + self.drop_path(x_mlp)
-
-        return x
-
-
-class STDiT5Config(PretrainedConfig):
-
-    model_type = "STDiT5"
-
-    def __init__(
-        self,
-        input_size=(None, None, None),
-        input_sq_size=32,
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        freeze=None,
-        qk_norm=False,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        **kwargs,
-    ):
-        self.input_size = input_size
-        self.input_sq_size = input_sq_size
-        self.in_channels = in_channels
-        self.patch_size = patch_size
-        self.hidden_size = hidden_size
-        self.depth = depth
-        self.num_heads = num_heads
-        self.mlp_ratio = mlp_ratio
-        self.class_dropout_prob = class_dropout_prob
-        self.pred_sigma = pred_sigma
-        self.drop_path = drop_path
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.caption_channels = caption_channels
-        self.model_max_length = model_max_length
-        self.freeze = freeze
-        self.qk_norm = qk_norm
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        super().__init__(**kwargs)
-
-
-@MODELS.register_module()
-class STDiT5(nn.Module):
-
-    config_class = STDiT5Config
-
-    def __init__(self, config):
-        super().__init__()
-        self.dtype = config.dtype
-        self.pred_sigma = config.pred_sigma
-        self.in_channels = config.in_channels
-        self.out_channels = (
-            config.in_channels * 2 if config.pred_sigma else config.in_channels
-        )
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_heads
-        self.no_temporal_pos_emb = config.no_temporal_pos_emb
-        self.depth = config.depth
-        self.mlp_ratio = config.mlp_ratio
-        self.enable_flashattn = config.enable_flashattn
-        self.enable_layernorm_kernel = config.enable_layernorm_kernel
-
-        # support dynamic input
-        self.patch_size = config.patch_size
-        self.input_size = config.input_size
-        self.input_sq_size = config.input_sq_size
-        self.pos_embed = PositionEmbedding2D(config.hidden_size)
-
-        self.x_embedder = PatchEmbed3D(
-            config.patch_size, config.in_channels, config.hidden_size
-        )
-        self.t_embedder = TimestepEmbedder(config.hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(config.hidden_size, 6 * config.hidden_size, bias=True)
-        )
-        self.t_block_temp = nn.Sequential(
-            nn.SiLU(), nn.Linear(config.hidden_size, 3 * config.hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=config.caption_channels,
-            hidden_size=config.hidden_size,
-            uncond_prob=config.class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=config.model_max_length,
-        )
-
-        self.space_scale = config.space_scale
-        self.time_scale = config.time_scale
-
-        drop_path = [
-            x.item() for x in torch.linspace(0, config.drop_path, config.depth)
-        ]
-        self.rope = RotaryEmbedding(dim=self.hidden_size // self.num_heads)  # new
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock2(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                    rope=self.rope.rotate_queries_or_keys,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            config.hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # multi_res
-        assert self.hidden_size % 3 == 0, "hidden_size must be divisible by 3"
-        self.csize_embedder = SizeEmbedder(self.hidden_size // 3)
-        self.ar_embedder = SizeEmbedder(self.hidden_size // 3)
-        self.fl_embedder = SizeEmbedder(self.hidden_size)  # new
-        self.fps_embedder = SizeEmbedder(self.hidden_size)  # new
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if config.freeze is not None:
-            assert config.freeze in ["not_temporal", "text"]
-            if config.freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif config.freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = config.enable_sequence_parallelism
-        if config.enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def get_dynamic_size(self, x):
-        _, _, T, H, W = x.size()
-        if T % self.patch_size[0] != 0:
-            T += self.patch_size[0] - T % self.patch_size[0]
-        if H % self.patch_size[1] != 0:
-            H += self.patch_size[1] - H % self.patch_size[1]
-        if W % self.patch_size[2] != 0:
-            W += self.patch_size[2] - W % self.patch_size[2]
-        T = T // self.patch_size[0]
-        H = H // self.patch_size[1]
-        W = W // self.patch_size[2]
-        return (T, H, W)
-
-    def forward(
-        self,
-        x,
-        timestep,
-        y,
-        mask=None,
-        x_mask=None,
-        num_frames=None,
-        height=None,
-        width=None,
-        ar=None,
-        fps=None,
-    ):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-        B = x.shape[0]
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # == process data info ==
-        # 1. get dynamic size
-        hw = torch.cat([height[:, None], width[:, None]], dim=1)
-        rs = (height[0].item() * width[0].item()) ** 0.5
-        csize = self.csize_embedder(hw, B)
-
-        # 2. get aspect ratio
-        ar = ar.unsqueeze(1)
-        ar = self.ar_embedder(ar, B)
-        data_info = torch.cat([csize, ar], dim=1)
-
-        # 3. get number of frames
-        fl = num_frames.unsqueeze(1)
-        fps = fps.unsqueeze(1)
-        fl = self.fl_embedder(fl, B)
-        fl = fl + self.fps_embedder(fps, B)
-
-        # === get dynamic shape size ===
-        _, _, Tx, Hx, Wx = x.size()
-        T, H, W = self.get_dynamic_size(x)
-        S = H * W
-        scale = rs / self.input_sq_size
-        base_size = round(S**0.5)
-        pos_emb = self.pos_embed(x, H, W, scale=scale, base_size=base_size)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + pos_emb
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t_spc = t + data_info  # [B, C]
-        t_tmp = t + fl  # [B, C]
-        t_emb = self.t_block(t_tmp)  # [B, C]
-        t_spc_mlp = self.t_block(t_spc)  # [B, 6*C]
-        t_tmp_mlp = self.t_block_temp(t_tmp)  # [B, 3*C]
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0_spc = t0 + data_info
-            t0_tmp = t0 + fl
-            t0_spc_mlp = self.t_block(t0_spc)
-            t0_tmp_mlp = self.t_block_temp(t0_tmp)
-        else:
-            t0_spc = None
-            t0_tmp = None
-            t0_spc_mlp = None
-            t0_tmp_mlp = None
-
-        # prepare y
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for _, block in enumerate(self.blocks):
-            x = auto_grad_checkpoint(
-                block,
-                x,
-                y,
-                t_spc_mlp,
-                t_tmp_mlp,
-                y_lens,
-                x_mask,
-                t0_spc_mlp,
-                t0_tmp_mlp,
-                T,
-                S,
-            )
-            # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(
-            x, t, x_mask, t0_spc, T, S
-        )  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x, T, H, W, Tx, Hx, Wx)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, N_t, N_h, N_w, R_t, R_h, R_w):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        # N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        # unpad
-        x = x[:, :, :R_t, :R_h, :R_w]
-        return x
-
-    def unpatchify_lastest(self, x, input_type=None):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        if input_type == "image":
-            N_t, N_h, N_w = [
-                self.img_input_size[i] // self.patch_size[i] for i in range(3)
-            ]
-        else:
-            N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-        nn.init.normal_(self.t_block_temp[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT5-XL/2")
-def STDiT5_XL_2(from_pretrained=None, **kwargs):
-    config = STDiT5Config(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    model = STDiT5(config)
-    if from_pretrained is not None:
-        import os
-
-        if os.path.isdir(from_pretrained) or os.path.isfile(from_pretrained):
-            load_checkpoint(model, from_pretrained)
-
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit6.py b/videotuna/models/opensora/models/stdit/stdit6.py
deleted file mode 100644
index 7f6750d5..00000000
--- a/videotuna/models/opensora/models/stdit/stdit6.py
+++ /dev/null
@@ -1,630 +0,0 @@
-# Concat the ar and fps to the text token embedding with Sincosin Positional Embedding
-
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-from transformers import PretrainedConfig, PreTrainedModel
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    PositionEmbedding2D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    SizeEmbedder,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock2(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        # spatial branch
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=enable_flashattn,
-        )
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # cross attn
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-
-        # mpl branch
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-
-        # temporal attention
-        self.norm_temp = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )  # new
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=self.enable_flashattn,
-        )
-        self.scale_shift_table_temporal = nn.Parameter(
-            torch.randn(3, hidden_size) / hidden_size**0.5
-        )  # new
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self, x, y, t, tpe=None, mask=None, x_mask=None, t0=None, T=None, S=None
-    ):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-
-        # modulate
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_s_zero = gate_msa_zero * x_s
-            x_s = gate_msa * x_s
-            x_s = self.t_mask_select(x_mask, x_s, x_s_zero, T, S)
-        else:
-            x_s = gate_msa * x_s
-        x = x + self.drop_path(x_s)
-
-        # temporal branch
-        x_t = rearrange(x_m, "B (T S) C -> (B S) T C", T=T, S=S)
-        if tpe is not None:
-            x_t = x_t + tpe
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_t_zero = gate_msa_zero * x_t
-            x_t = gate_msa * x_t
-            x_t = self.t_mask_select(x_mask, x_t, x_t_zero, T, S)
-        else:
-            x_t = gate_msa * x_t
-        x = x + self.drop_path(x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # modulate
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # mlp
-        x_mlp = self.mlp(x_m)
-        if x_mask is not None:
-            x_mlp_zero = gate_mlp_zero * x_mlp
-            x_mlp = gate_mlp * x_mlp
-            x_mlp = self.t_mask_select(x_mask, x_mlp, x_mlp_zero, T, S)
-        else:
-            x_mlp = gate_mlp * x_mlp
-
-        x = x + self.drop_path(x_mlp)
-
-        return x
-
-
-class STDiT6Config(PretrainedConfig):
-
-    model_type = "STDiT6"
-
-    def __init__(
-        self,
-        input_size=(None, None, None),
-        input_sq_size=32,
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        freeze=None,
-        qk_norm=False,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        **kwargs,
-    ):
-        self.input_size = input_size
-        self.input_sq_size = input_sq_size
-        self.in_channels = in_channels
-        self.patch_size = patch_size
-        self.hidden_size = hidden_size
-        self.depth = depth
-        self.num_heads = num_heads
-        self.mlp_ratio = mlp_ratio
-        self.class_dropout_prob = class_dropout_prob
-        self.pred_sigma = pred_sigma
-        self.drop_path = drop_path
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.caption_channels = caption_channels
-        self.model_max_length = model_max_length
-        self.freeze = freeze
-        self.qk_norm = qk_norm
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        super().__init__(**kwargs)
-
-
-@MODELS.register_module()
-class STDiT6(nn.Module):
-
-    config_class = STDiT6Config
-
-    def __init__(self, config):
-        super().__init__()
-        self.dtype = config.dtype
-        self.pred_sigma = config.pred_sigma
-        self.in_channels = config.in_channels
-        self.out_channels = (
-            config.in_channels * 2 if config.pred_sigma else config.in_channels
-        )
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_heads
-        self.no_temporal_pos_emb = config.no_temporal_pos_emb
-        self.depth = config.depth
-        self.mlp_ratio = config.mlp_ratio
-        self.enable_flashattn = config.enable_flashattn
-        self.enable_layernorm_kernel = config.enable_layernorm_kernel
-
-        # support dynamic input
-        self.patch_size = config.patch_size
-        self.input_size = config.input_size
-        self.input_sq_size = config.input_sq_size
-        self.pos_embed = PositionEmbedding2D(config.hidden_size)
-
-        self.x_embedder = PatchEmbed3D(
-            config.patch_size, config.in_channels, config.hidden_size
-        )
-        self.t_embedder = TimestepEmbedder(config.hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(config.hidden_size, 6 * config.hidden_size, bias=True)
-        )
-        self.t_block_temp = nn.Sequential(
-            nn.SiLU(), nn.Linear(config.hidden_size, 3 * config.hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=config.caption_channels,
-            hidden_size=config.hidden_size,
-            uncond_prob=config.class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=config.model_max_length,
-        )
-
-        self.space_scale = config.space_scale
-        self.time_scale = config.time_scale
-
-        drop_path = [
-            x.item() for x in torch.linspace(0, config.drop_path, config.depth)
-        ]
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock2(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            config.hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # multi_res
-        assert self.hidden_size % 3 == 0, "hidden_size must be divisible by 3"
-        self.csize_embedder = SizeEmbedder(self.hidden_size // 3)
-        self.ar_embedder = SizeEmbedder(self.hidden_size // 3)
-        self.fl_embedder = SizeEmbedder(self.hidden_size)  # new
-        self.fps_embedder = SizeEmbedder(self.hidden_size)  # new
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if config.freeze is not None:
-            assert config.freeze in ["not_temporal", "text"]
-            if config.freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif config.freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = config.enable_sequence_parallelism
-        if config.enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def get_dynamic_size(self, x):
-        _, _, T, H, W = x.size()
-        if T % self.patch_size[0] != 0:
-            T += self.patch_size[0] - T % self.patch_size[0]
-        if H % self.patch_size[1] != 0:
-            H += self.patch_size[1] - H % self.patch_size[1]
-        if W % self.patch_size[2] != 0:
-            W += self.patch_size[2] - W % self.patch_size[2]
-        T = T // self.patch_size[0]
-        H = H // self.patch_size[1]
-        W = W // self.patch_size[2]
-        return (T, H, W)
-
-    def forward(
-        self,
-        x,
-        timestep,
-        y,
-        mask=None,
-        x_mask=None,
-        num_frames=None,
-        height=None,
-        width=None,
-        ar=None,
-        fps=None,
-    ):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-        B = x.shape[0]
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # == process data info ==
-        # 1. get dynamic size
-        hw = torch.cat([height[:, None], width[:, None]], dim=1)
-        rs = (height[0].item() * width[0].item()) ** 0.5
-        csize = self.csize_embedder(hw, B)  # [B, C/2]
-
-        # 2. get aspect ratio
-        ar = ar.unsqueeze(1)
-        ar = self.ar_embedder(ar, B)  # [B, C/2]
-        data_info = torch.cat([csize, ar], dim=1)  # [B, C]
-
-        # 3. get number of frames
-        fl = num_frames.unsqueeze(1)
-        fps = fps.unsqueeze(1)
-        fl = self.fl_embedder(fl, B)
-        fl = fl + self.fps_embedder(fps, B)  # [B, 2]
-
-        # === get dynamic shape size ===
-        _, _, Tx, Hx, Wx = x.size()
-        T, H, W = self.get_dynamic_size(x)
-        S = H * W
-        scale = rs / self.input_sq_size
-        base_size = round(S**0.5)
-        pos_emb = self.pos_embed(x, H, W, scale=scale, base_size=base_size)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + pos_emb
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t_emb = self.t_block(t)  # [B, C]
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0_emb = self.t_block(t0)
-        else:
-            t0 = None
-            t0_emb = None
-
-        # prepare y
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-        data_info = data_info.unsqueeze(1).unsqueeze(1)
-        fl = fl.unsqueeze(1).unsqueeze(1)
-        y = torch.cat([y, data_info, fl], dim=2)  # [B, 1, N_token+2, C]
-        b = y.shape[0]
-        n_token = y.shape[2]
-        mask = torch.ones((b, n_token), device=y.device, dtype=torch.int64)
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for i, block in enumerate(self.blocks):
-            if i == 0:
-                tpe = get_1d_sincos_pos_embed(
-                    self.hidden_size, T, scale=self.time_scale
-                )
-                tpe = (
-                    torch.from_numpy(tpe)
-                    .unsqueeze(0)
-                    .requires_grad_(False)
-                    .to(x.device, x.dtype)
-                )
-            else:
-                tpe = None
-            x = auto_grad_checkpoint(
-                block,
-                x,
-                y,
-                t_emb,
-                tpe,
-                y_lens,
-                x_mask,
-                t0_emb,
-                T,
-                S,
-            )
-            # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(
-            x, t, x_mask, t0, T, S
-        )  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x, T, H, W, Tx, Hx, Wx)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, N_t, N_h, N_w, R_t, R_h, R_w):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        # N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        # unpad
-        x = x[:, :, :R_t, :R_h, :R_w]
-        return x
-
-    def unpatchify_lastest(self, x, input_type=None):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        if input_type == "image":
-            N_t, N_h, N_w = [
-                self.img_input_size[i] // self.patch_size[i] for i in range(3)
-            ]
-        else:
-            N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-        nn.init.normal_(self.t_block_temp[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT6-XL/2")
-def STDiT6_XL_2(from_pretrained=None, **kwargs):
-    config = STDiT6Config(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    model = STDiT6(config)
-    if from_pretrained is not None:
-        import os
-
-        if os.path.isdir(from_pretrained) or os.path.isfile(from_pretrained):
-            load_checkpoint(model, from_pretrained)
-
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit7.py b/videotuna/models/opensora/models/stdit/stdit7.py
deleted file mode 100644
index 69db52f8..00000000
--- a/videotuna/models/opensora/models/stdit/stdit7.py
+++ /dev/null
@@ -1,661 +0,0 @@
-# Add the ar and fps to the time embedding with Sincosin Positional Embedding
-
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from rotary_embedding_torch import RotaryEmbedding
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-from transformers import PretrainedConfig, PreTrainedModel
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    PositionEmbedding2D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    SizeEmbedder,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiTBlock2(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        enable_flashattn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        rope=None,
-    ):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.enable_flashattn = enable_flashattn
-        self._enable_sequence_parallelism = enable_sequence_parallelism
-
-        if enable_sequence_parallelism:
-            self.attn_cls = SeqParallelAttention
-            self.mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            self.attn_cls = Attention
-            self.mha_cls = MultiHeadCrossAttention
-
-        # spatial branch
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=enable_flashattn,
-        )
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-        # cross attn
-        self.cross_attn = self.mha_cls(hidden_size, num_heads)
-
-        # mpl branch
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-
-        # temporal attention
-        self.norm_temp = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )  # new
-        self.attn_temp = self.attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            enable_flashattn=self.enable_flashattn,
-            rope=rope,
-        )
-        self.scale_shift_table_temporal = nn.Parameter(
-            torch.randn(3, hidden_size) / hidden_size**0.5
-        )  # new
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self,
-        x,
-        y,
-        tpe,
-        t,
-        t_tmp,
-        mask=None,
-        x_mask=None,
-        t0=None,
-        t0_tmp=None,
-        T=None,
-        S=None,
-    ):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        shift_tmp, scale_tmp, gate_tmp = (
-            self.scale_shift_table_temporal[None] + t_tmp.reshape(B, 3, -1)
-        ).chunk(3, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-            shift_tmp_zero, scale_tmp_zero, gate_tmp_zero = (
-                self.scale_shift_table_temporal[None] + t0_tmp.reshape(B, 3, -1)
-            ).chunk(3, dim=1)
-
-        # modulate
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # spatial branch
-        x_s = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-        x_s = self.attn(x_s)
-        x_s = rearrange(x_s, "(B T) S C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_s_zero = gate_msa_zero * x_s
-            x_s = gate_msa * x_s
-            x_s = self.t_mask_select(x_mask, x_s, x_s_zero, T, S)
-        else:
-            x_s = gate_msa * x_s
-        x = x + self.drop_path(x_s)
-
-        # modulate
-        x_m = t2i_modulate(self.norm_temp(x), shift_tmp, scale_tmp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm_temp(x), shift_tmp_zero, scale_tmp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # temporal branch
-        x_t = rearrange(x_m, "B (T S) C -> (B S) T C", T=T, S=S)
-        if tpe is not None:
-            x_t = x_t + tpe
-        x_t = self.attn_temp(x_t)
-        x_t = rearrange(x_t, "(B S) T C -> B (T S) C", T=T, S=S)
-        if x_mask is not None:
-            x_t_zero = gate_tmp_zero * x_t
-            x_t = gate_tmp * x_t
-            x_t = self.t_mask_select(x_mask, x_t, x_t_zero, T, S)
-        else:
-            x_t = gate_tmp * x_t
-        x = x + self.drop_path(x_t)
-
-        # cross attn
-        x = x + self.cross_attn(x, y, mask)
-
-        # modulate
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # mlp
-        x_mlp = self.mlp(x_m)
-        if x_mask is not None:
-            x_mlp_zero = gate_mlp_zero * x_mlp
-            x_mlp = gate_mlp * x_mlp
-            x_mlp = self.t_mask_select(x_mask, x_mlp, x_mlp_zero, T, S)
-        else:
-            x_mlp = gate_mlp * x_mlp
-        x = x + self.drop_path(x_mlp)
-
-        return x
-
-
-class STDiT7Config(PretrainedConfig):
-
-    model_type = "STDiT7"
-
-    def __init__(
-        self,
-        input_size=(None, None, None),
-        input_sq_size=32,
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        no_temporal_pos_emb=False,
-        caption_channels=4096,
-        model_max_length=120,
-        freeze=None,
-        qk_norm=False,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        **kwargs,
-    ):
-        self.input_size = input_size
-        self.input_sq_size = input_sq_size
-        self.in_channels = in_channels
-        self.patch_size = patch_size
-        self.hidden_size = hidden_size
-        self.depth = depth
-        self.num_heads = num_heads
-        self.mlp_ratio = mlp_ratio
-        self.class_dropout_prob = class_dropout_prob
-        self.pred_sigma = pred_sigma
-        self.drop_path = drop_path
-        self.no_temporal_pos_emb = no_temporal_pos_emb
-        self.caption_channels = caption_channels
-        self.model_max_length = model_max_length
-        self.freeze = freeze
-        self.qk_norm = qk_norm
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        super().__init__(**kwargs)
-
-
-@MODELS.register_module()
-class STDiT7(nn.Module):
-
-    config_class = STDiT7Config
-
-    def __init__(self, config):
-        super().__init__()
-        self.dtype = config.dtype
-        self.pred_sigma = config.pred_sigma
-        self.in_channels = config.in_channels
-        self.out_channels = (
-            config.in_channels * 2 if config.pred_sigma else config.in_channels
-        )
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_heads
-        self.no_temporal_pos_emb = config.no_temporal_pos_emb
-        self.depth = config.depth
-        self.mlp_ratio = config.mlp_ratio
-        self.enable_flashattn = config.enable_flashattn
-        self.enable_layernorm_kernel = config.enable_layernorm_kernel
-
-        # support dynamic input
-        self.patch_size = config.patch_size
-        self.input_size = config.input_size
-        self.input_sq_size = config.input_sq_size
-        self.pos_embed = PositionEmbedding2D(config.hidden_size)
-
-        self.x_embedder = PatchEmbed3D(
-            config.patch_size, config.in_channels, config.hidden_size
-        )
-        self.t_embedder = TimestepEmbedder(config.hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(), nn.Linear(config.hidden_size, 6 * config.hidden_size, bias=True)
-        )
-        self.t_block_temp = nn.Sequential(
-            nn.SiLU(), nn.Linear(config.hidden_size, 3 * config.hidden_size, bias=True)
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=config.caption_channels,
-            hidden_size=config.hidden_size,
-            uncond_prob=config.class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=config.model_max_length,
-        )
-
-        self.space_scale = config.space_scale
-        self.time_scale = config.time_scale
-
-        drop_path = [
-            x.item() for x in torch.linspace(0, config.drop_path, config.depth)
-        ]
-        self.blocks = nn.ModuleList(
-            [
-                STDiTBlock2(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=self.mlp_ratio,
-                    drop_path=drop_path[i],
-                    enable_flashattn=self.enable_flashattn,
-                    enable_layernorm_kernel=self.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                    rope=None,
-                )
-                for i in range(self.depth)
-            ]
-        )
-        self.final_layer = T2IFinalLayer(
-            config.hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # multi_res
-        assert self.hidden_size % 3 == 0, "hidden_size must be divisible by 3"
-        self.csize_embedder = SizeEmbedder(self.hidden_size // 3)
-        self.ar_embedder = SizeEmbedder(self.hidden_size // 3)
-        self.fl_embedder = SizeEmbedder(self.hidden_size)  # new
-        self.fps_embedder = SizeEmbedder(self.hidden_size)  # new
-
-        # init model
-        self.initialize_weights()
-        self.initialize_temporal()
-        if config.freeze is not None:
-            assert config.freeze in ["not_temporal", "text"]
-            if config.freeze == "not_temporal":
-                self.freeze_not_temporal()
-            elif config.freeze == "text":
-                self.freeze_text()
-
-        # sequence parallel related configs
-        self.enable_sequence_parallelism = config.enable_sequence_parallelism
-        if config.enable_sequence_parallelism:
-            self.sp_rank = dist.get_rank(get_sequence_parallel_group())
-        else:
-            self.sp_rank = None
-
-    def get_dynamic_size(self, x):
-        _, _, T, H, W = x.size()
-        if T % self.patch_size[0] != 0:
-            T += self.patch_size[0] - T % self.patch_size[0]
-        if H % self.patch_size[1] != 0:
-            H += self.patch_size[1] - H % self.patch_size[1]
-        if W % self.patch_size[2] != 0:
-            W += self.patch_size[2] - W % self.patch_size[2]
-        T = T // self.patch_size[0]
-        H = H // self.patch_size[1]
-        W = W // self.patch_size[2]
-        return (T, H, W)
-
-    def forward(
-        self,
-        x,
-        timestep,
-        y,
-        mask=None,
-        x_mask=None,
-        num_frames=None,
-        height=None,
-        width=None,
-        ar=None,
-        fps=None,
-    ):
-        """
-        Forward pass of STDiT.
-        Args:
-            x (torch.Tensor): latent representation of video; of shape [B, C, T, H, W]
-            timestep (torch.Tensor): diffusion time steps; of shape [B]
-            y (torch.Tensor): representation of prompts; of shape [B, 1, N_token, C]
-            mask (torch.Tensor): mask for selecting prompt tokens; of shape [B, N_token]
-
-        Returns:
-            x (torch.Tensor): output latent representation; of shape [B, C, T, H, W]
-        """
-        B = x.shape[0]
-        x = x.to(self.dtype)
-        timestep = timestep.to(self.dtype)
-        y = y.to(self.dtype)
-
-        # == process data info ==
-        # 1. get dynamic size
-        hw = torch.cat([height[:, None], width[:, None]], dim=1)
-        rs = (height[0].item() * width[0].item()) ** 0.5
-        csize = self.csize_embedder(hw, B)
-
-        # 2. get aspect ratio
-        ar = ar.unsqueeze(1)
-        ar = self.ar_embedder(ar, B)
-        data_info = torch.cat([csize, ar], dim=1)
-
-        # 3. get number of frames
-        fl = num_frames.unsqueeze(1)
-        fps = fps.unsqueeze(1)
-        fl = self.fl_embedder(fl, B)
-        fl = fl + self.fps_embedder(fps, B)
-
-        # === get dynamic shape size ===
-        _, _, Tx, Hx, Wx = x.size()
-        T, H, W = self.get_dynamic_size(x)
-        S = H * W
-        scale = rs / self.input_sq_size
-        base_size = round(S**0.5)
-        pos_emb = self.pos_embed(x, H, W, scale=scale, base_size=base_size)
-
-        # embedding
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + pos_emb
-        x = rearrange(x, "B T S C -> B (T S) C")
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=1, grad_scale="down"
-            )
-
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        t_spc = t + data_info  # [B, C]
-        t_tmp = t + fl  # [B, C]
-        t_emb = self.t_block(t_tmp)  # [B, C]
-        t_spc_mlp = self.t_block(t_spc)  # [B, 6*C]
-        t_tmp_mlp = self.t_block_temp(t_tmp)  # [B, 3*C]
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0_spc = t0 + data_info
-            t0_tmp = t0 + fl
-            t0_spc_mlp = self.t_block(t0_spc)
-            t0_tmp_mlp = self.t_block_temp(t0_tmp)
-        else:
-            t0_spc = None
-            t0_tmp = None
-            t0_spc_mlp = None
-            t0_tmp_mlp = None
-
-        # prepare y
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, x.shape[-1])
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-
-        # blocks
-        for i, block in enumerate(self.blocks):
-            if i == 0:
-                tpe = get_1d_sincos_pos_embed(
-                    self.hidden_size, T, scale=self.time_scale
-                )
-                tpe = (
-                    torch.from_numpy(tpe)
-                    .unsqueeze(0)
-                    .requires_grad_(False)
-                    .to(x.device, x.dtype)
-                )
-            else:
-                tpe = None
-            x = auto_grad_checkpoint(
-                block,
-                x,
-                y,
-                tpe,
-                t_spc_mlp,
-                t_tmp_mlp,
-                y_lens,
-                x_mask,
-                t0_spc_mlp,
-                t0_tmp_mlp,
-                T,
-                S,
-            )
-            # x.shape: [B, N, C]
-
-        # final process
-        x = self.final_layer(
-            x, t, x_mask, t0_spc, T, S
-        )  # [B, N, C=T_p * H_p * W_p * C_out]
-        x = self.unpatchify(x, T, H, W, Tx, Hx, Wx)  # [B, C_out, T, H, W]
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, N_t, N_h, N_w, R_t, R_h, R_w):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        # N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        # unpad
-        x = x[:, :, :R_t, :R_h, :R_w]
-        return x
-
-    def unpatchify_lastest(self, x, input_type=None):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        if input_type == "image":
-            N_t, N_h, N_w = [
-                self.img_input_size[i] // self.patch_size[i] for i in range(3)
-            ]
-        else:
-            N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        return x
-
-    def unpatchify_old(self, x):
-        c = self.out_channels
-        t, h, w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        pt, ph, pw = self.patch_size
-
-        x = x.reshape(shape=(x.shape[0], t, h, w, pt, ph, pw, c))
-        x = rearrange(x, "n t h w r p q c -> n c t r h p w q")
-        imgs = x.reshape(shape=(x.shape[0], c, t * pt, h * ph, w * pw))
-        return imgs
-
-    def get_spatial_pos_embed(self, grid_size=None):
-        if grid_size is None:
-            grid_size = self.input_size[1:]
-        pos_embed = get_2d_sincos_pos_embed(
-            self.hidden_size,
-            (grid_size[0] // self.patch_size[1], grid_size[1] // self.patch_size[2]),
-            scale=self.space_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def get_temporal_pos_embed(self):
-        pos_embed = get_1d_sincos_pos_embed(
-            self.hidden_size,
-            self.input_size[0] // self.patch_size[0],
-            scale=self.time_scale,
-        )
-        pos_embed = (
-            torch.from_numpy(pos_embed).float().unsqueeze(0).requires_grad_(False)
-        )
-        return pos_embed
-
-    def freeze_not_temporal(self):
-        for n, p in self.named_parameters():
-            if "attn_temp" not in n:
-                p.requires_grad = False
-
-    def freeze_text(self):
-        for n, p in self.named_parameters():
-            if "cross_attn" in n:
-                p.requires_grad = False
-
-    def initialize_temporal(self):
-        for block in self.blocks:
-            nn.init.constant_(block.attn_temp.proj.weight, 0)
-            nn.init.constant_(block.attn_temp.proj.bias, 0)
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
-        w = self.x_embedder.proj.weight.data
-        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
-
-        # Initialize timestep embedding MLP:
-        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
-        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
-        nn.init.normal_(self.t_block[1].weight, std=0.02)
-        nn.init.normal_(self.t_block_temp[1].weight, std=0.02)
-
-        # Initialize caption embedding MLP:
-        nn.init.normal_(self.y_embedder.y_proj.fc1.weight, std=0.02)
-        nn.init.normal_(self.y_embedder.y_proj.fc2.weight, std=0.02)
-
-        # Zero-out adaLN modulation layers in PixArt blocks:
-        for block in self.blocks:
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.bias, 0)
-
-        # Zero-out output layers:
-        nn.init.constant_(self.final_layer.linear.weight, 0)
-        nn.init.constant_(self.final_layer.linear.bias, 0)
-
-
-@MODELS.register_module("STDiT7-XL/2")
-def STDiT7_XL_2(from_pretrained=None, **kwargs):
-    config = STDiT7Config(
-        depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-    )
-    model = STDiT7(config)
-    if from_pretrained is not None:
-        import os
-
-        if os.path.isdir(from_pretrained) or os.path.isfile(from_pretrained):
-            load_checkpoint(model, from_pretrained)
-
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit8.py b/videotuna/models/opensora/models/stdit/stdit8.py
deleted file mode 100644
index f2e3da07..00000000
--- a/videotuna/models/opensora/models/stdit/stdit8.py
+++ /dev/null
@@ -1,591 +0,0 @@
-import os
-from copy import deepcopy
-
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from rotary_embedding_torch import RotaryEmbedding
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-from transformers import PretrainedConfig, PreTrainedModel
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    PositionEmbedding2D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    SizeEmbedder,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiT8Block(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        rope=None,
-        qk_norm=False,
-        temporal=False,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.temporal = temporal
-        self.hidden_size = hidden_size
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-
-        if self.enable_sequence_parallelism and not temporal:
-            attn_cls = SeqParallelAttention
-            mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            attn_cls = Attention
-            mha_cls = MultiHeadCrossAttention
-
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            qk_norm=qk_norm,
-            rope=rope,
-            enable_flash_attn=enable_flash_attn,
-        )
-        self.cross_attn = mha_cls(hidden_size, num_heads)
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self,
-        x,
-        y,
-        t,
-        mask=None,  # text mask
-        x_mask=None,  # temporal mask
-        t0=None,  # t with timestamp=0
-        T=None,  # number of frames
-        S=None,  # number of pixel patches
-        debug=None,
-    ):
-        # prepare modulate parameters
-        B, N, C = x.shape
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-        if debug:
-            x_tmp = deepcopy(x)
-        # modulate (attention)
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # attention
-        if self.temporal:
-            x_m = rearrange(x_m, "B (T S) C -> (B S) T C", T=T, S=S)
-            x_m = self.attn(x_m)
-            x_m = rearrange(x_m, "(B S) T C -> B (T S) C", T=T, S=S)
-        else:
-            x_m = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-            x_m = self.attn(x_m)
-            x_m = rearrange(x_m, "(B T) S C -> B (T S) C", T=T, S=S)
-
-        # modulate (attention)
-        x_m_s = gate_msa * x_m
-        if x_mask is not None:
-            x_m_s_zero = gate_msa_zero * x_m
-            x_m_s = self.t_mask_select(x_mask, x_m_s, x_m_s_zero, T, S)
-
-        # residual
-        x = x + self.drop_path(x_m_s)
-
-        # cross attention
-        x = x + self.cross_attn(x, y, mask)
-
-        # modulate (MLP)
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # MLP
-        x_m = self.mlp(x_m)
-
-        # modulate (MLP)
-        x_m_s = gate_mlp * x_m
-        if x_mask is not None:
-            x_m_s_zero = gate_mlp_zero * x_m
-            x_m_s = self.t_mask_select(x_mask, x_m_s, x_m_s_zero, T, S)
-
-        # residual
-        x = x + self.drop_path(x_m_s)
-
-        if debug:
-            print(f"Block {debug} x diff: {torch.norm(x - x_tmp)}")
-
-        return x
-
-
-class STDiT8Config(PretrainedConfig):
-    model_type = "STDiT8"
-
-    def __init__(
-        self,
-        input_size=(None, None, None),
-        input_sq_size=512,
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        caption_channels=4096,
-        model_max_length=300,
-        qk_norm=True,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        only_train_temporal=False,
-        freeze_y_embedder=False,
-        skip_y_embedder=False,
-        freeze_previous_blocks=None,
-        **kwargs,
-    ):
-        self.input_size = input_size
-        self.input_sq_size = input_sq_size
-        self.in_channels = in_channels
-        self.patch_size = patch_size
-        self.hidden_size = hidden_size
-        self.depth = depth
-        self.num_heads = num_heads
-        self.mlp_ratio = mlp_ratio
-        self.class_dropout_prob = class_dropout_prob
-        self.pred_sigma = pred_sigma
-        self.drop_path = drop_path
-        self.caption_channels = caption_channels
-        self.model_max_length = model_max_length
-        self.qk_norm = qk_norm
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        self.only_train_temporal = only_train_temporal
-        self.freeze_y_embedder = freeze_y_embedder
-        self.skip_y_embedder = skip_y_embedder
-        self.freeze_previous_blocks = freeze_previous_blocks
-        super().__init__(**kwargs)
-
-
-class STDiT8(PreTrainedModel):
-    config_class = STDiT8Config
-
-    def __init__(self, config):
-        super().__init__(config)
-        self.pred_sigma = config.pred_sigma
-        self.in_channels = config.in_channels
-        self.out_channels = (
-            config.in_channels * 2 if config.pred_sigma else config.in_channels
-        )
-
-        # model size related
-        self.depth = config.depth
-        self.mlp_ratio = config.mlp_ratio
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_heads
-
-        # computation related
-        self.drop_path = config.drop_path
-        self.enable_flash_attn = config.enable_flash_attn
-        self.enable_layernorm_kernel = config.enable_layernorm_kernel
-        self.enable_sequence_parallelism = config.enable_sequence_parallelism
-
-        # input size related
-        self.patch_size = config.patch_size
-        self.input_sq_size = config.input_sq_size
-        self.pos_embed = PositionEmbedding2D(config.hidden_size)
-        self.rope = RotaryEmbedding(dim=self.hidden_size // self.num_heads)
-
-        # embedding
-        self.x_embedder = PatchEmbed3D(
-            config.patch_size, config.in_channels, config.hidden_size
-        )
-        self.t_embedder = TimestepEmbedder(config.hidden_size)
-        self.fps_embedder = SizeEmbedder(self.hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(),
-            nn.Linear(config.hidden_size, 6 * config.hidden_size, bias=True),
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=config.caption_channels,
-            hidden_size=config.hidden_size,
-            uncond_prob=config.class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=config.model_max_length,
-        )
-
-        # spatial blocks
-        drop_path = [x.item() for x in torch.linspace(0, self.drop_path, config.depth)]
-        self.spatial_blocks = nn.ModuleList(
-            [
-                STDiT8Block(
-                    hidden_size=config.hidden_size,
-                    num_heads=config.num_heads,
-                    mlp_ratio=config.mlp_ratio,
-                    drop_path=drop_path[i],
-                    qk_norm=config.qk_norm,
-                    enable_flash_attn=config.enable_flash_attn,
-                    enable_layernorm_kernel=config.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                )
-                for i in range(config.depth)
-            ]
-        )
-
-        # temporal blocks
-        drop_path = [x.item() for x in torch.linspace(0, self.drop_path, config.depth)]
-        self.temporal_blocks = nn.ModuleList(
-            [
-                STDiT8Block(
-                    hidden_size=config.hidden_size,
-                    num_heads=config.num_heads,
-                    mlp_ratio=config.mlp_ratio,
-                    drop_path=drop_path[i],
-                    qk_norm=config.qk_norm,
-                    enable_flash_attn=config.enable_flash_attn,
-                    enable_layernorm_kernel=config.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                    # temporal
-                    temporal=True,
-                    rope=self.rope.rotate_queries_or_keys,
-                )
-                for i in range(config.depth)
-            ]
-        )
-
-        # final layer
-        self.final_layer = T2IFinalLayer(
-            config.hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # weight training config
-        self.only_train_temporal = config.only_train_temporal
-        self.freeze_y_embedder = config.freeze_y_embedder
-        self.freeze_previous_blocks = config.freeze_previous_blocks
-
-        self.initialize_weights()
-        self.freeze_weights()
-
-    def freeze_weights(self):
-        # only train temproal blocks
-        if self.only_train_temporal:
-            for param in self.parameters():
-                param.requires_grad = False
-            for block in self.temporal_blocks:
-                for param in block.parameters():
-                    param.requires_grad = True
-
-        # freeze y embedder
-        if self.freeze_y_embedder:
-            for param in self.y_embedder.parameters():
-                param.requires_grad = False
-
-        # freeze previous blocks
-        if self.freeze_previous_blocks is not None:
-            print("Freeze all parameters")
-            for param in self.parameters():
-                param.requires_grad = False
-            freeze_type = self.freeze_previous_blocks[1]
-            if freeze_type == "front":
-                print("Freeze the front blocks")
-                for name, param in self.spatial_blocks.named_parameters():
-                    if int(name.split(".")[0]) >= 28:
-                        param.requires_grad = True
-                for name, param in self.temporal_blocks.named_parameters():
-                    if int(name.split(".")[0]) >= 28:
-                        param.requires_grad = True
-            elif freeze_type == "inter":
-                print("Freeze the inter blocks")
-                un_freeze_depth_list = [x for x in range(1, self.depth, 2)]
-                for name, param in self.spatial_blocks.named_parameters():
-                    if int(name.split(".")[0]) in un_freeze_depth_list:
-                        param.requires_grad = True
-                for name, param in self.temporal_blocks.named_parameters():
-                    if int(name.split(".")[0]) in un_freeze_depth_list:
-                        param.requires_grad = True
-            else:
-                raise ValueError(f"Unknown freeze_previous_blocks type: {freeze_type}")
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize fps_embedder
-        nn.init.normal_(self.fps_embedder.mlp[0].weight, std=0.02)
-        nn.init.constant_(self.fps_embedder.mlp[0].bias, 0)
-        nn.init.constant_(self.fps_embedder.mlp[2].weight, 0)
-        nn.init.constant_(self.fps_embedder.mlp[2].bias, 0)
-
-        # Initialize timporal blocks
-        for block in self.temporal_blocks:
-            nn.init.constant_(block.attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.mlp.fc2.weight, 0)
-
-    def get_dynamic_size(self, x):
-        _, _, T, H, W = x.size()
-        if T % self.patch_size[0] != 0:
-            T += self.patch_size[0] - T % self.patch_size[0]
-        if H % self.patch_size[1] != 0:
-            H += self.patch_size[1] - H % self.patch_size[1]
-        if W % self.patch_size[2] != 0:
-            W += self.patch_size[2] - W % self.patch_size[2]
-        T = T // self.patch_size[0]
-        H = H // self.patch_size[1]
-        W = W // self.patch_size[2]
-        return (T, H, W)
-
-    def encode_text(self, y, mask=None):
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, self.hidden_size)
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, self.hidden_size)
-        return y, y_lens
-
-    def forward(
-        self,
-        x,
-        timestep,
-        y,
-        mask=None,
-        x_mask=None,
-        fps=None,
-        height=None,
-        width=None,
-        **kwargs,
-    ):
-        dtype = self.x_embedder.proj.weight.dtype
-        B = x.size(0)
-        x = x.to(dtype)
-        timestep = timestep.to(dtype)
-        y = y.to(dtype)
-
-        # === get pos embed ===
-        _, _, Tx, Hx, Wx = x.size()
-        T, H, W = self.get_dynamic_size(x)
-        S = H * W
-        base_size = round(S**0.5)
-        resolution_sq = (height[0].item() * width[0].item()) ** 0.5
-        scale = resolution_sq / self.input_sq_size
-        pos_emb = self.pos_embed(x, H, W, scale=scale, base_size=base_size)
-
-        # === get timestep embed ===
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        fps = self.fps_embedder(fps.unsqueeze(1), B)
-        t = t + fps
-        t_mlp = self.t_block(t)
-        t0 = t0_mlp = None
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0 = t0 + fps
-            t0_mlp = self.t_block(t0)
-
-        # === get y embed ===
-        if self.config.skip_y_embedder:
-            y_lens = mask
-            if isinstance(y_lens, torch.Tensor):
-                y_lens = y_lens.long().tolist()
-        else:
-            y, y_lens = self.encode_text(y, mask)
-
-        # === get x embed ===
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + pos_emb
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=2, grad_scale="down"
-            )
-            S = S // dist.get_world_size(get_sequence_parallel_group())
-
-        x = rearrange(x, "B T S C -> B (T S) C", T=T, S=S)
-
-        # === blocks ===
-        for spatial_block, temporal_block in zip(
-            self.spatial_blocks, self.temporal_blocks
-        ):
-            x = auto_grad_checkpoint(
-                spatial_block, x, y, t_mlp, y_lens, x_mask, t0_mlp, T, S
-            )
-            x = auto_grad_checkpoint(
-                temporal_block, x, y, t_mlp, y_lens, x_mask, t0_mlp, T, S
-            )
-
-        if self.enable_sequence_parallelism:
-            x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-            x = gather_forward_split_backward(
-                x, get_sequence_parallel_group(), dim=2, grad_scale="up"
-            )
-            S = S * dist.get_world_size(get_sequence_parallel_group())
-            x = rearrange(x, "B T S C -> B (T S) C", T=T, S=S)
-
-        # === final layer ===
-        x = self.final_layer(x, t, x_mask, t0, T, S)
-        x = self.unpatchify(x, T, H, W, Tx, Hx, Wx)
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, N_t, N_h, N_w, R_t, R_h, R_w):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        # N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        # unpad
-        x = x[:, :, :R_t, :R_h, :R_w]
-        return x
-
-
-@MODELS.register_module("STDiT8-XL/2")
-def STDiT8_XL_2(from_pretrained=None, **kwargs):
-    force_huggingface = kwargs.pop("force_huggingface", False)
-    if (
-        force_huggingface
-        or from_pretrained is not None
-        and not os.path.isdir(from_pretrained)
-    ):
-        model = STDiT8.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = STDiT8Config(
-            depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-        )
-        model = STDiT8(config)
-        if from_pretrained is not None:
-            load_checkpoint(model, from_pretrained)
-    return model
-
-
-@MODELS.register_module("STDiT8-2B/2")
-def STDiT8_2B_2(from_pretrained=None, **kwargs):
-    force_huggingface = kwargs.pop("force_huggingface", False)
-    if (
-        force_huggingface
-        or from_pretrained is not None
-        and not os.path.isdir(from_pretrained)
-    ):
-        model = STDiT8.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = STDiT8Config(
-            depth=56, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-        )
-        model = STDiT8(config)
-        if from_pretrained is not None:
-            load_checkpoint(model, from_pretrained)
-    return model
-
-
-@MODELS.register_module("STDiT8-3B/2")
-def STDiT8_3B_2(from_pretrained=None, **kwargs):
-    if from_pretrained is not None and not os.path.isdir(from_pretrained):
-        model = STDiT8.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = STDiT8Config(
-            depth=28, hidden_size=1872, patch_size=(1, 2, 2), num_heads=26, **kwargs
-        )
-        model = STDiT8(config)
-        if from_pretrained is not None:
-            load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/models/stdit/stdit8_debug.py b/videotuna/models/opensora/models/stdit/stdit8_debug.py
deleted file mode 100644
index bad3afdb..00000000
--- a/videotuna/models/opensora/models/stdit/stdit8_debug.py
+++ /dev/null
@@ -1,636 +0,0 @@
-import sys
-
-sys.path.append("/home/zraoac/Open-Sora")
-import os
-from copy import deepcopy
-
-import numpy as np
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from einops import rearrange
-from rotary_embedding_torch import RotaryEmbedding
-from timm.models.layers import DropPath
-from timm.models.vision_transformer import Mlp
-from transformers import PretrainedConfig, PreTrainedModel
-
-from videotuna.models.opensora.acceleration.checkpoint import auto_grad_checkpoint
-from videotuna.models.opensora.acceleration.communications import (
-    gather_forward_split_backward,
-    split_forward_gather_backward,
-)
-from videotuna.models.opensora.acceleration.parallel_states import get_sequence_parallel_group
-from videotuna.models.opensora.models.layers.blocks import (
-    Attention,
-    CaptionEmbedder,
-    MultiHeadCrossAttention,
-    PatchEmbed3D,
-    PositionEmbedding2D,
-    SeqParallelAttention,
-    SeqParallelMultiHeadCrossAttention,
-    SizeEmbedder,
-    T2IFinalLayer,
-    TimestepEmbedder,
-    approx_gelu,
-    get_1d_sincos_pos_embed,
-    get_2d_sincos_pos_embed,
-    get_layernorm,
-    t2i_modulate,
-)
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-class STDiT8Block(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        num_heads,
-        mlp_ratio=4.0,
-        drop_path=0.0,
-        rope=None,
-        qk_norm=False,
-        temporal=False,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-    ):
-        super().__init__()
-        self.temporal = temporal
-        self.hidden_size = hidden_size
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-
-        if self.enable_sequence_parallelism and not temporal:
-            attn_cls = SeqParallelAttention
-            mha_cls = SeqParallelMultiHeadCrossAttention
-        else:
-            attn_cls = Attention
-            mha_cls = MultiHeadCrossAttention
-
-        self.norm1 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.attn = attn_cls(
-            hidden_size,
-            num_heads=num_heads,
-            qkv_bias=True,
-            qk_norm=qk_norm,
-            rope=rope,
-            enable_flash_attn=enable_flash_attn,
-        )
-        self.cross_attn = mha_cls(hidden_size, num_heads)
-        self.norm2 = get_layernorm(
-            hidden_size, eps=1e-6, affine=False, use_kernel=enable_layernorm_kernel
-        )
-        self.mlp = Mlp(
-            in_features=hidden_size,
-            hidden_features=int(hidden_size * mlp_ratio),
-            act_layer=approx_gelu,
-            drop=0,
-        )
-        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
-        self.scale_shift_table = nn.Parameter(
-            torch.randn(6, hidden_size) / hidden_size**0.5
-        )
-
-    def t_mask_select(self, x_mask, x, masked_x, T, S):
-        # x: [B, (T, S), C]
-        # mased_x: [B, (T, S), C]
-        # x_mask: [B, T]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        masked_x = rearrange(masked_x, "B (T S) C -> B T S C", T=T, S=S)
-        x = torch.where(x_mask[:, :, None, None], x, masked_x)
-        x = rearrange(x, "B T S C -> B (T S) C")
-        return x
-
-    def forward(
-        self,
-        x,
-        y,
-        t,
-        mask=None,  # text mask
-        x_mask=None,  # temporal mask
-        t0=None,  # t with timestamp=0
-        T=None,  # number of frames
-        S=None,  # number of pixel patches
-        debug=None,
-    ):
-        # prepare modulate parameters
-        B, N, C = x.shape
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            self.scale_shift_table[None] + t.reshape(B, 6, -1)
-        ).chunk(6, dim=1)
-        if x_mask is not None:
-            (
-                shift_msa_zero,
-                scale_msa_zero,
-                gate_msa_zero,
-                shift_mlp_zero,
-                scale_mlp_zero,
-                gate_mlp_zero,
-            ) = (self.scale_shift_table[None] + t0.reshape(B, 6, -1)).chunk(6, dim=1)
-        if debug:
-            x_tmp = deepcopy(x)
-        # modulate (attention)
-        x_m = t2i_modulate(self.norm1(x), shift_msa, scale_msa)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm1(x), shift_msa_zero, scale_msa_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # attention
-        # if x_m.dtype == torch.float32:
-        #     x_m = x_m.to(dtype=torch.float16)
-
-        if self.temporal:
-            x_m = rearrange(x_m, "B (T S) C -> (B S) T C", T=T, S=S)
-            x_m = self.attn(x_m)
-            x_m = rearrange(x_m, "(B S) T C -> B (T S) C", T=T, S=S)
-        else:
-            x_m = rearrange(x_m, "B (T S) C -> (B T) S C", T=T, S=S)
-            x_m = self.attn(x_m)
-            x_m = rearrange(x_m, "(B T) S C -> B (T S) C", T=T, S=S)
-
-        # modulate (attention)
-        x_m_s = gate_msa * x_m
-        if x_mask is not None:
-            x_m_s_zero = gate_msa_zero * x_m
-            x_m_s = self.t_mask_select(x_mask, x_m_s, x_m_s_zero, T, S)
-
-        # residual
-        x = x + self.drop_path(x_m_s)
-
-        # cross attention
-        x = x + self.cross_attn(x, y, mask)
-
-        # modulate (MLP)
-        x_m = t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)
-        if x_mask is not None:
-            x_m_zero = t2i_modulate(self.norm2(x), shift_mlp_zero, scale_mlp_zero)
-            x_m = self.t_mask_select(x_mask, x_m, x_m_zero, T, S)
-
-        # MLP
-        x_m = self.mlp(x_m)
-
-        # modulate (MLP)
-        x_m_s = gate_mlp * x_m
-        if x_mask is not None:
-            x_m_s_zero = gate_mlp_zero * x_m
-            x_m_s = self.t_mask_select(x_mask, x_m_s, x_m_s_zero, T, S)
-
-        # residual
-        x = x + self.drop_path(x_m_s)
-
-        if debug:
-            print(f"Block {debug} x diff: {torch.norm(x - x_tmp)}")
-
-        return x
-
-
-class STDiT8Config(PretrainedConfig):
-    model_type = "STDiT8"
-
-    def __init__(
-        self,
-        input_size=(None, None, None),
-        input_sq_size=512,
-        in_channels=4,
-        patch_size=(1, 2, 2),
-        hidden_size=1152,
-        depth=28,
-        num_heads=16,
-        mlp_ratio=4.0,
-        class_dropout_prob=0.1,
-        pred_sigma=True,
-        drop_path=0.0,
-        caption_channels=4096,
-        model_max_length=300,
-        qk_norm=True,
-        enable_flash_attn=False,
-        enable_layernorm_kernel=False,
-        enable_sequence_parallelism=False,
-        only_train_temporal=False,
-        freeze_y_embedder=False,
-        skip_y_embedder=False,
-        freeze_previous_blocks=None,
-        **kwargs,
-    ):
-        self.input_size = input_size
-        self.input_sq_size = input_sq_size
-        self.in_channels = in_channels
-        self.patch_size = patch_size
-        self.hidden_size = hidden_size
-        self.depth = depth
-        self.num_heads = num_heads
-        self.mlp_ratio = mlp_ratio
-        self.class_dropout_prob = class_dropout_prob
-        self.pred_sigma = pred_sigma
-        self.drop_path = drop_path
-        self.caption_channels = caption_channels
-        self.model_max_length = model_max_length
-        self.qk_norm = qk_norm
-        self.enable_flash_attn = enable_flash_attn
-        self.enable_layernorm_kernel = enable_layernorm_kernel
-        self.enable_sequence_parallelism = enable_sequence_parallelism
-        self.only_train_temporal = only_train_temporal
-        self.freeze_y_embedder = freeze_y_embedder
-        self.skip_y_embedder = skip_y_embedder
-        self.freeze_previous_blocks = freeze_previous_blocks
-        super().__init__(**kwargs)
-
-
-class STDiT8(PreTrainedModel):
-    config_class = STDiT8Config
-
-    def __init__(self, config):
-        super().__init__(config)
-        self.pred_sigma = config.pred_sigma
-        self.in_channels = config.in_channels
-        self.out_channels = (
-            config.in_channels * 2 if config.pred_sigma else config.in_channels
-        )
-
-        # model size related
-        self.depth = config.depth
-        self.mlp_ratio = config.mlp_ratio
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_heads
-
-        # computation related
-        self.drop_path = config.drop_path
-        self.enable_flash_attn = config.enable_flash_attn
-        self.enable_layernorm_kernel = config.enable_layernorm_kernel
-        self.enable_sequence_parallelism = config.enable_sequence_parallelism
-
-        # input size related
-        self.patch_size = config.patch_size
-        self.input_sq_size = config.input_sq_size
-        self.pos_embed = PositionEmbedding2D(config.hidden_size)
-        self.rope = RotaryEmbedding(dim=self.hidden_size // self.num_heads)
-
-        # embedding
-        self.x_embedder = PatchEmbed3D(
-            config.patch_size, config.in_channels, config.hidden_size
-        )
-        self.t_embedder = TimestepEmbedder(config.hidden_size)
-        self.fps_embedder = SizeEmbedder(self.hidden_size)
-        self.t_block = nn.Sequential(
-            nn.SiLU(),
-            nn.Linear(config.hidden_size, 6 * config.hidden_size, bias=True),
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=config.caption_channels,
-            hidden_size=config.hidden_size,
-            uncond_prob=config.class_dropout_prob,
-            act_layer=approx_gelu,
-            token_num=config.model_max_length,
-        )
-
-        # spatial blocks
-        drop_path = [x.item() for x in torch.linspace(0, self.drop_path, config.depth)]
-        self.spatial_blocks = nn.ModuleList(
-            [
-                STDiT8Block(
-                    hidden_size=config.hidden_size,
-                    num_heads=config.num_heads,
-                    mlp_ratio=config.mlp_ratio,
-                    drop_path=drop_path[i],
-                    qk_norm=config.qk_norm,
-                    enable_flash_attn=config.enable_flash_attn,
-                    enable_layernorm_kernel=config.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                )
-                for i in range(config.depth)
-            ]
-        )
-
-        # temporal blocks
-        drop_path = [x.item() for x in torch.linspace(0, self.drop_path, config.depth)]
-        self.temporal_blocks = nn.ModuleList(
-            [
-                STDiT8Block(
-                    hidden_size=config.hidden_size,
-                    num_heads=config.num_heads,
-                    mlp_ratio=config.mlp_ratio,
-                    drop_path=drop_path[i],
-                    qk_norm=config.qk_norm,
-                    enable_flash_attn=config.enable_flash_attn,
-                    enable_layernorm_kernel=config.enable_layernorm_kernel,
-                    enable_sequence_parallelism=config.enable_sequence_parallelism,
-                    # temporal
-                    temporal=True,
-                    rope=self.rope.rotate_queries_or_keys,
-                )
-                for i in range(config.depth)
-            ]
-        )
-
-        # final layer
-        self.final_layer = T2IFinalLayer(
-            config.hidden_size, np.prod(self.patch_size), self.out_channels
-        )
-
-        # weight training config
-        self.only_train_temporal = config.only_train_temporal
-        self.freeze_y_embedder = config.freeze_y_embedder
-        self.freeze_previous_blocks = config.freeze_previous_blocks
-
-        self.initialize_weights()
-        self.freeze_weights()
-
-    def freeze_weights(self):
-        # only train temproal blocks
-        if self.only_train_temporal:
-            for param in self.parameters():
-                param.requires_grad = False
-            for block in self.temporal_blocks:
-                for param in block.parameters():
-                    param.requires_grad = True
-
-        # freeze y embedder
-        if self.freeze_y_embedder:
-            for param in self.y_embedder.parameters():
-                param.requires_grad = False
-
-        # freeze previous blocks
-        if self.freeze_previous_blocks is not None:
-            freeze_type = self.freeze_previous_blocks[1]
-            if freeze_type == "front":
-                print("Freeze the front blocks")
-                for name, param in self.spatial_blocks.named_parameters():
-                    if int(name.split(".")[0]) < 28:
-                        param.requires_grad = False
-                for name, param in self.temporal_blocks.named_parameters():
-                    if int(name.split(".")[0]) < 28:
-                        param.requires_grad = False
-            elif freeze_type == "inter":
-                print("Freeze the inter blocks")
-                freeze_depth_list = [x for x in range(0, self.depth, 2)]
-                for name, param in self.spatial_blocks.named_parameters():
-                    if int(name.split(".")[0]) in freeze_depth_list:
-                        param.requires_grad = False
-                for name, param in self.temporal_blocks.named_parameters():
-                    if int(name.split(".")[0]) in freeze_depth_list:
-                        param.requires_grad = False
-            else:
-                raise ValueError(f"Unknown freeze_previous_blocks type: {freeze_type}")
-
-    def initialize_weights(self):
-        # Initialize transformer layers:
-        def _basic_init(module):
-            if isinstance(module, nn.Linear):
-                torch.nn.init.xavier_uniform_(module.weight)
-                if module.bias is not None:
-                    nn.init.constant_(module.bias, 0)
-
-        self.apply(_basic_init)
-
-        # Initialize fps_embedder
-        nn.init.normal_(self.fps_embedder.mlp[0].weight, std=0.02)
-        nn.init.constant_(self.fps_embedder.mlp[0].bias, 0)
-        nn.init.constant_(self.fps_embedder.mlp[2].weight, 0)
-        nn.init.constant_(self.fps_embedder.mlp[2].bias, 0)
-
-        # Initialize timporal blocks
-        for block in self.temporal_blocks:
-            nn.init.constant_(block.attn.proj.weight, 0)
-            nn.init.constant_(block.cross_attn.proj.weight, 0)
-            nn.init.constant_(block.mlp.fc2.weight, 0)
-
-    def get_dynamic_size(self, x):
-        _, _, T, H, W = x.size()
-        if T % self.patch_size[0] != 0:
-            T += self.patch_size[0] - T % self.patch_size[0]
-        if H % self.patch_size[1] != 0:
-            H += self.patch_size[1] - H % self.patch_size[1]
-        if W % self.patch_size[2] != 0:
-            W += self.patch_size[2] - W % self.patch_size[2]
-        T = T // self.patch_size[0]
-        H = H // self.patch_size[1]
-        W = W // self.patch_size[2]
-        return (T, H, W)
-
-    def encode_text(self, y, mask=None):
-        y = self.y_embedder(y, self.training)  # [B, 1, N_token, C]
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = (
-                y.squeeze(1)
-                .masked_select(mask.unsqueeze(-1) != 0)
-                .view(1, -1, self.hidden_size)
-            )
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = [y.shape[2]] * y.shape[0]
-            y = y.squeeze(1).view(1, -1, self.hidden_size)
-        return y, y_lens
-
-    def forward(
-        self,
-        x,
-        timestep,
-        y,
-        mask=None,
-        x_mask=None,
-        fps=None,
-        height=None,
-        width=None,
-        **kwargs,
-    ):
-        dtype = self.x_embedder.proj.weight.dtype
-        B = x.size(0)
-        x = x.to(dtype)
-        timestep = timestep.to(dtype)
-        y = y.to(dtype)
-
-        # === get pos embed ===
-        _, _, Tx, Hx, Wx = x.size()
-        T, H, W = self.get_dynamic_size(x)
-        S = H * W
-        base_size = round(S**0.5)
-        resolution_sq = (height[0].item() * width[0].item()) ** 0.5
-        scale = resolution_sq / self.input_sq_size
-        pos_emb = self.pos_embed(x, H, W, scale=scale, base_size=base_size)
-
-        # === get timestep embed ===
-        t = self.t_embedder(timestep, dtype=x.dtype)  # [B, C]
-        fps = self.fps_embedder(fps.unsqueeze(1), B)
-        t = t + fps
-        t_mlp = self.t_block(t)
-        t0 = t0_mlp = None
-        if x_mask is not None:
-            t0_timestep = torch.zeros_like(timestep)
-            t0 = self.t_embedder(t0_timestep, dtype=x.dtype)
-            t0 = t0 + fps
-            t0_mlp = self.t_block(t0)
-
-        # === get y embed ===
-        if self.config.skip_y_embedder:
-            y_lens = mask
-            if isinstance(y_lens, torch.Tensor):
-                y_lens = y_lens.long().tolist()
-        else:
-            y, y_lens = self.encode_text(y, mask)
-
-        # === get x embed ===
-        x = self.x_embedder(x)  # [B, N, C]
-        x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-        x = x + pos_emb
-
-        # shard over the sequence dim if sp is enabled
-        if self.enable_sequence_parallelism:
-            x = split_forward_gather_backward(
-                x, get_sequence_parallel_group(), dim=2, grad_scale="down"
-            )
-            S = S // dist.get_world_size(get_sequence_parallel_group())
-
-        x = rearrange(x, "B T S C -> B (T S) C", T=T, S=S)
-
-        # === blocks ===
-        for spatial_block, temporal_block in zip(
-            self.spatial_blocks, self.temporal_blocks
-        ):
-            x = auto_grad_checkpoint(
-                spatial_block, x, y, t_mlp, y_lens, x_mask, t0_mlp, T, S
-            )
-            x = auto_grad_checkpoint(
-                temporal_block, x, y, t_mlp, y_lens, x_mask, t0_mlp, T, S
-            )
-
-        if self.enable_sequence_parallelism:
-            x = rearrange(x, "B (T S) C -> B T S C", T=T, S=S)
-            x = gather_forward_split_backward(
-                x, get_sequence_parallel_group(), dim=2, grad_scale="up"
-            )
-            S = S * dist.get_world_size(get_sequence_parallel_group())
-            x = rearrange(x, "B T S C -> B (T S) C", T=T, S=S)
-
-        # === final layer ===
-        x = self.final_layer(x, t, x_mask, t0, T, S)
-        x = self.unpatchify(x, T, H, W, Tx, Hx, Wx)
-
-        # cast to float32 for better accuracy
-        x = x.to(torch.float32)
-        return x
-
-    def unpatchify(self, x, N_t, N_h, N_w, R_t, R_h, R_w):
-        """
-        Args:
-            x (torch.Tensor): of shape [B, N, C]
-
-        Return:
-            x (torch.Tensor): of shape [B, C_out, T, H, W]
-        """
-
-        # N_t, N_h, N_w = [self.input_size[i] // self.patch_size[i] for i in range(3)]
-        T_p, H_p, W_p = self.patch_size
-        x = rearrange(
-            x,
-            "B (N_t N_h N_w) (T_p H_p W_p C_out) -> B C_out (N_t T_p) (N_h H_p) (N_w W_p)",
-            N_t=N_t,
-            N_h=N_h,
-            N_w=N_w,
-            T_p=T_p,
-            H_p=H_p,
-            W_p=W_p,
-            C_out=self.out_channels,
-        )
-        # unpad
-        x = x[:, :, :R_t, :R_h, :R_w]
-        return x
-
-
-# @MODELS.register_module("STDiT8-XL/2")
-def STDiT8_XL_2(from_pretrained=None, **kwargs):
-    force_huggingface = kwargs.pop("force_huggingface", False)
-    if (
-        force_huggingface
-        or from_pretrained is not None
-        and not os.path.isdir(from_pretrained)
-    ):
-        model = STDiT8.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = STDiT8Config(
-            depth=28, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-        )
-        model = STDiT8(config)
-        if from_pretrained is not None:
-            load_checkpoint(model, from_pretrained)
-    return model
-
-
-# @MODELS.register_module("STDiT8-2B/2")
-def STDiT8_2B_2(from_pretrained=None, **kwargs):
-    force_huggingface = kwargs.pop("force_huggingface", False)
-    if (
-        force_huggingface
-        or from_pretrained is not None
-        and not os.path.isdir(from_pretrained)
-    ):
-        model = STDiT8.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = STDiT8Config(
-            depth=56, hidden_size=1152, patch_size=(1, 2, 2), num_heads=16, **kwargs
-        )
-        model = STDiT8(config)
-        if from_pretrained is not None:
-            load_checkpoint(model, from_pretrained)
-    return model
-
-
-# @MODELS.register_module("STDiT8-3B/2")
-def STDiT8_3B_2(from_pretrained=None, **kwargs):
-    if from_pretrained is not None and not os.path.isdir(from_pretrained):
-        model = STDiT8.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = STDiT8Config(
-            depth=28, hidden_size=1872, patch_size=(1, 2, 2), num_heads=26, **kwargs
-        )
-        model = STDiT8(config)
-        if from_pretrained is not None:
-            load_checkpoint(model, from_pretrained)
-    return model
-
-
-if __name__ == "__main__":
-    model = (
-        STDiT8_2B_2(
-            from_pretrained="/project/llmsvgen/share/shared_ckpts/Open-Sora/OpenSora-STDiT-2B-inter-zero/",
-            qk_norm=True,
-            enable_flash_attn=True,
-            enable_layernorm_kernel=True,
-        )
-        .to(device="cuda", dtype=torch.bfloat16)
-        .eval()
-    )
-
-    model2 = (
-        STDiT8_XL_2(
-            from_pretrained="/project/llmsvgen/share/shared_ckpts/Open-Sora/OpenSora-STDiT-v3/",
-            qk_norm=True,
-            enable_flash_attn=True,
-            enable_layernorm_kernel=True,
-        )
-        .to(device="cuda", dtype=torch.bfloat16)
-        .eval()
-    )
-
-    x = torch.randn(2, 4, 15, 64, 64).to(device="cuda", dtype=torch.bfloat16)
-    y = torch.randn(2, 1, 300, 4096).to(device="cuda", dtype=torch.bfloat16)
-    timestep = torch.randn(2).to(device="cuda", dtype=torch.bfloat16)
-    mask = torch.randn(2, 300).to(device="cuda", dtype=torch.bfloat16)
-
-    fps = torch.randn(2).to(device="cuda", dtype=torch.bfloat16)
-    height = torch.randn(2).to(device="cuda", dtype=torch.bfloat16)
-    width = torch.randn(2).to(device="cuda", dtype=torch.bfloat16)
-
-    with torch.no_grad():
-        out1 = model(x, timestep, y, mask, None, fps, height, width)
-        # out2 = model2(x, timestep, y, mask, None, fps, height, width)
-    print(out1)
-
-    # print(model)
-
-    # print(model2)
diff --git a/videotuna/models/opensora/models/text_encoder/__init__.py b/videotuna/models/opensora/models/text_encoder/__init__.py
deleted file mode 100644
index 9fc6a999..00000000
--- a/videotuna/models/opensora/models/text_encoder/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .classes import ClassEncoder
-from .clip import ClipEncoder
-from .t5 import T5Encoder
diff --git a/videotuna/models/opensora/models/text_encoder/classes.py b/videotuna/models/opensora/models/text_encoder/classes.py
deleted file mode 100644
index d3d428a3..00000000
--- a/videotuna/models/opensora/models/text_encoder/classes.py
+++ /dev/null
@@ -1,22 +0,0 @@
-import torch
-
-from videotuna.models.opensora.registry import MODELS
-
-
-@MODELS.register_module("classes")
-class ClassEncoder:
-    def __init__(
-        self, num_classes, model_max_length=None, device="cuda", dtype=torch.float
-    ):
-        self.num_classes = num_classes
-        self.y_embedder = None
-
-        self.model_max_length = model_max_length
-        self.output_dim = None
-        self.device = device
-
-    def encode(self, text):
-        return dict(y=torch.tensor([int(t) for t in text]).to(self.device))
-
-    def null(self, n):
-        return torch.tensor([self.num_classes] * n).to(self.device)
diff --git a/videotuna/models/opensora/models/text_encoder/clip.py b/videotuna/models/opensora/models/text_encoder/clip.py
deleted file mode 100644
index faa341e6..00000000
--- a/videotuna/models/opensora/models/text_encoder/clip.py
+++ /dev/null
@@ -1,118 +0,0 @@
-# Copyright 2024 Vchitect/Latte
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.# Modified from Latte
-#
-# This file is adapted from the Latte project.
-#
-# This source code is licensed under the license found in the
-# LICENSE file in the root directory of this source tree.
-# --------------------------------------------------------
-# References:
-# Latte: https://github.com/Vchitect/Latte
-# DiT:   https://github.com/facebookresearch/DiT/tree/main
-# --------------------------------------------------------
-
-
-import torch
-import torch.nn as nn
-import transformers
-from transformers import CLIPTextModel, CLIPTokenizer
-
-from videotuna.models.opensora.registry import MODELS
-
-transformers.logging.set_verbosity_error()
-
-
-class AbstractEncoder(nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def encode(self, *args, **kwargs):
-        raise NotImplementedError
-
-
-class FrozenCLIPEmbedder(AbstractEncoder):
-    """Uses the CLIP transformer encoder for text (from Hugging Face)"""
-
-    def __init__(
-        self, path="openai/clip-vit-huge-patch14", device="cuda", max_length=77
-    ):
-        super().__init__()
-        self.tokenizer = CLIPTokenizer.from_pretrained(path)
-        self.transformer = CLIPTextModel.from_pretrained(path)
-        self.device = device
-        self.max_length = max_length
-        self._freeze()
-
-    def _freeze(self):
-        self.transformer = self.transformer.eval()
-        for param in self.parameters():
-            param.requires_grad = False
-
-    def forward(self, text):
-        batch_encoding = self.tokenizer(
-            text,
-            truncation=True,
-            max_length=self.max_length,
-            return_length=True,
-            return_overflowing_tokens=False,
-            padding="max_length",
-            return_tensors="pt",
-        )
-        tokens = batch_encoding["input_ids"].to(self.device)
-        outputs = self.transformer(input_ids=tokens)
-
-        z = outputs.last_hidden_state
-        pooled_z = outputs.pooler_output
-        return z, pooled_z
-
-    def encode(self, text):
-        return self(text)
-
-
-@MODELS.register_module("clip")
-class ClipEncoder:
-    """
-    Embeds text prompt into vector representations. Also handles text dropout for classifier-free guidance.
-    """
-
-    def __init__(
-        self,
-        from_pretrained,
-        model_max_length=77,
-        device="cuda",
-        dtype=torch.float,
-    ):
-        super().__init__()
-        assert from_pretrained is not None, "Please specify the path to the T5 model"
-
-        self.text_encoder = FrozenCLIPEmbedder(
-            path=from_pretrained, max_length=model_max_length
-        ).to(device, dtype)
-        self.y_embedder = None
-
-        self.model_max_length = model_max_length
-        self.output_dim = self.text_encoder.transformer.config.hidden_size
-
-    def encode(self, text):
-        _, pooled_embeddings = self.text_encoder.encode(text)
-        y = pooled_embeddings.unsqueeze(1).unsqueeze(1)
-        return dict(y=y)
-
-    def null(self, n):
-        null_y = self.y_embedder.y_embedding[None].repeat(n, 1, 1)[:, None]
-        return null_y
-
-    def to(self, dtype):
-        self.text_encoder = self.text_encoder.to(dtype)
-        return self
diff --git a/videotuna/models/opensora/models/text_encoder/t5.py b/videotuna/models/opensora/models/text_encoder/t5.py
deleted file mode 100644
index e63b7abc..00000000
--- a/videotuna/models/opensora/models/text_encoder/t5.py
+++ /dev/null
@@ -1,503 +0,0 @@
-# Adapted from PixArt
-#
-# Copyright (C) 2023  PixArt-alpha/PixArt-alpha
-#
-# This program is free software: you can redistribute it and/or modify
-# it under the terms of the GNU Affero General Public License as published
-# by the Free Software Foundation, either version 3 of the License, or
-# (at your option) any later version.
-#
-# This program is distributed in the hope that it will be useful,
-# but WITHOUT ANY WARRANTY; without even the implied warranty of
-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-# GNU Affero General Public License for more details.
-#
-#
-# This source code is licensed under the license found in the
-# LICENSE file in the root directory of this source tree.
-# --------------------------------------------------------
-# References:
-# PixArt: https://github.com/PixArt-alpha/PixArt-alpha
-# T5:     https://github.com/google-research/text-to-text-transfer-transformer
-# --------------------------------------------------------
-
-
-import html
-import os
-import re
-import urllib.parse as ul
-
-import ftfy
-import torch
-from bs4 import BeautifulSoup
-from huggingface_hub import hf_hub_download
-from transformers import AutoTokenizer, T5EncoderModel
-
-from videotuna.models.opensora.registry import MODELS
-
-
-class T5Embedder:
-    available_models = ["DeepFloyd/t5-v1_1-xxl"]
-    bad_punct_regex = re.compile(
-        r"["
-        + "#®•©™&@·º½¾¿¡§~"
-        + "\)"
-        + "\("
-        + "\]"
-        + "\["
-        + "\}"
-        + "\{"
-        + "\|"
-        + "\\"
-        + "\/"
-        + "\*"
-        + r"]{1,}"
-    )  # noqa
-
-    def __init__(
-        self,
-        device,
-        from_pretrained=None,
-        *,
-        cache_dir=None,
-        hf_token=None,
-        use_text_preprocessing=True,
-        t5_model_kwargs=None,
-        torch_dtype=None,
-        use_offload_folder=None,
-        model_max_length=120,
-    ):
-        self.device = torch.device(device)
-        self.torch_dtype = torch_dtype or torch.bfloat16
-        self.cache_dir = cache_dir
-
-        if t5_model_kwargs is None:
-            t5_model_kwargs = {
-                "low_cpu_mem_usage": True,
-                "torch_dtype": self.torch_dtype,
-            }
-
-            if use_offload_folder is not None:
-                t5_model_kwargs["offload_folder"] = use_offload_folder
-                t5_model_kwargs["device_map"] = {
-                    "shared": self.device,
-                    "encoder.embed_tokens": self.device,
-                    "encoder.block.0": self.device,
-                    "encoder.block.1": self.device,
-                    "encoder.block.2": self.device,
-                    "encoder.block.3": self.device,
-                    "encoder.block.4": self.device,
-                    "encoder.block.5": self.device,
-                    "encoder.block.6": self.device,
-                    "encoder.block.7": self.device,
-                    "encoder.block.8": self.device,
-                    "encoder.block.9": self.device,
-                    "encoder.block.10": self.device,
-                    "encoder.block.11": self.device,
-                    "encoder.block.12": "disk",
-                    "encoder.block.13": "disk",
-                    "encoder.block.14": "disk",
-                    "encoder.block.15": "disk",
-                    "encoder.block.16": "disk",
-                    "encoder.block.17": "disk",
-                    "encoder.block.18": "disk",
-                    "encoder.block.19": "disk",
-                    "encoder.block.20": "disk",
-                    "encoder.block.21": "disk",
-                    "encoder.block.22": "disk",
-                    "encoder.block.23": "disk",
-                    "encoder.final_layer_norm": "disk",
-                    "encoder.dropout": "disk",
-                }
-            else:
-                t5_model_kwargs["device_map"] = {
-                    "shared": self.device,
-                    "encoder": self.device,
-                }
-
-        self.use_text_preprocessing = use_text_preprocessing
-        self.hf_token = hf_token
-
-        assert from_pretrained in self.available_models
-        self.tokenizer = AutoTokenizer.from_pretrained(
-            from_pretrained, cache_dir=cache_dir
-        )
-        self.model = T5EncoderModel.from_pretrained(
-            from_pretrained, cache_dir=cache_dir, **t5_model_kwargs
-        ).eval()
-        self.model_max_length = model_max_length
-
-    def get_text_embeddings(self, texts):
-        texts = [self.text_preprocessing(text) for text in texts]
-
-        text_tokens_and_mask = self.tokenizer(
-            texts,
-            max_length=self.model_max_length,
-            padding="max_length",
-            truncation=True,
-            return_attention_mask=True,
-            add_special_tokens=True,
-            return_tensors="pt",
-        )
-
-        text_tokens_and_mask["input_ids"] = text_tokens_and_mask["input_ids"]
-        text_tokens_and_mask["attention_mask"] = text_tokens_and_mask["attention_mask"]
-
-        with torch.no_grad():
-            text_encoder_embs = self.model(
-                input_ids=text_tokens_and_mask["input_ids"].to(self.device),
-                attention_mask=text_tokens_and_mask["attention_mask"].to(self.device),
-            )["last_hidden_state"].detach()
-        return text_encoder_embs, text_tokens_and_mask["attention_mask"].to(self.device)
-
-    def text_preprocessing(self, text):
-        if self.use_text_preprocessing:
-            # The exact text cleaning as was in the training stage:
-            text = self.clean_caption(text)
-            text = self.clean_caption(text)
-            return text
-        else:
-            return text.lower().strip()
-
-    @staticmethod
-    def basic_clean(text):
-        text = ftfy.fix_text(text)
-        text = html.unescape(html.unescape(text))
-        return text.strip()
-
-    def clean_caption(self, caption):
-        caption = str(caption)
-        caption = ul.unquote_plus(caption)
-        caption = caption.strip().lower()
-        caption = re.sub("<person>", "person", caption)
-        # urls:
-        caption = re.sub(
-            r"\b((?:https?:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",  # noqa
-            "",
-            caption,
-        )  # regex for urls
-        caption = re.sub(
-            r"\b((?:www:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",  # noqa
-            "",
-            caption,
-        )  # regex for urls
-        # html:
-        caption = BeautifulSoup(caption, features="html.parser").text
-
-        # @<nickname>
-        caption = re.sub(r"@[\w\d]+\b", "", caption)
-
-        # 31C0—31EF CJK Strokes
-        # 31F0—31FF Katakana Phonetic Extensions
-        # 3200—32FF Enclosed CJK Letters and Months
-        # 3300—33FF CJK Compatibility
-        # 3400—4DBF CJK Unified Ideographs Extension A
-        # 4DC0—4DFF Yijing Hexagram Symbols
-        # 4E00—9FFF CJK Unified Ideographs
-        caption = re.sub(r"[\u31c0-\u31ef]+", "", caption)
-        caption = re.sub(r"[\u31f0-\u31ff]+", "", caption)
-        caption = re.sub(r"[\u3200-\u32ff]+", "", caption)
-        caption = re.sub(r"[\u3300-\u33ff]+", "", caption)
-        caption = re.sub(r"[\u3400-\u4dbf]+", "", caption)
-        caption = re.sub(r"[\u4dc0-\u4dff]+", "", caption)
-        caption = re.sub(r"[\u4e00-\u9fff]+", "", caption)
-        #######################################################
-
-        # все виды тире / all types of dash --> "-"
-        caption = re.sub(
-            r"[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]+",  # noqa
-            "-",
-            caption,
-        )
-
-        # кавычки к одному стандарту
-        caption = re.sub(r"[`´«»“”¨]", '"', caption)
-        caption = re.sub(r"[‘’]", "'", caption)
-
-        # &quot;
-        caption = re.sub(r"&quot;?", "", caption)
-        # &amp
-        caption = re.sub(r"&amp", "", caption)
-
-        # ip adresses:
-        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)
-
-        # article ids:
-        caption = re.sub(r"\d:\d\d\s+$", "", caption)
-
-        # \n
-        caption = re.sub(r"\\n", " ", caption)
-
-        # "#123"
-        caption = re.sub(r"#\d{1,3}\b", "", caption)
-        # "#12345.."
-        caption = re.sub(r"#\d{5,}\b", "", caption)
-        # "123456.."
-        caption = re.sub(r"\b\d{6,}\b", "", caption)
-        # filenames:
-        caption = re.sub(
-            r"[\S]+\.(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)", "", caption
-        )
-
-        #
-        caption = re.sub(r"[\"\']{2,}", r'"', caption)  # """AUSVERKAUFT"""
-        caption = re.sub(r"[\.]{2,}", r" ", caption)  # """AUSVERKAUFT"""
-
-        caption = re.sub(
-            self.bad_punct_regex, r" ", caption
-        )  # ***AUSVERKAUFT***, #AUSVERKAUFT
-        caption = re.sub(r"\s+\.\s+", r" ", caption)  # " . "
-
-        # this-is-my-cute-cat / this_is_my_cute_cat
-        regex2 = re.compile(r"(?:\-|\_)")
-        if len(re.findall(regex2, caption)) > 3:
-            caption = re.sub(regex2, " ", caption)
-
-        caption = self.basic_clean(caption)
-
-        caption = re.sub(r"\b[a-zA-Z]{1,3}\d{3,15}\b", "", caption)  # jc6640
-        caption = re.sub(r"\b[a-zA-Z]+\d+[a-zA-Z]+\b", "", caption)  # jc6640vc
-        caption = re.sub(r"\b\d+[a-zA-Z]+\d+\b", "", caption)  # 6640vc231
-
-        caption = re.sub(r"(worldwide\s+)?(free\s+)?shipping", "", caption)
-        caption = re.sub(r"(free\s)?download(\sfree)?", "", caption)
-        caption = re.sub(r"\bclick\b\s(?:for|on)\s\w+", "", caption)
-        caption = re.sub(
-            r"\b(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)(\simage[s]?)?", "", caption
-        )
-        caption = re.sub(r"\bpage\s+\d+\b", "", caption)
-
-        caption = re.sub(
-            r"\b\d*[a-zA-Z]+\d+[a-zA-Z]+\d+[a-zA-Z\d]*\b", r" ", caption
-        )  # j2d1a2a...
-
-        caption = re.sub(r"\b\d+\.?\d*[xх×]\d+\.?\d*\b", "", caption)
-
-        caption = re.sub(r"\b\s+\:\s+", r": ", caption)
-        caption = re.sub(r"(\D[,\./])\b", r"\1 ", caption)
-        caption = re.sub(r"\s+", " ", caption)
-
-        caption.strip()
-
-        caption = re.sub(r"^[\"\']([\w\W]+)[\"\']$", r"\1", caption)
-        caption = re.sub(r"^[\'\_,\-\:;]", r"", caption)
-        caption = re.sub(r"[\'\_,\-\:\-\+]$", r"", caption)
-        caption = re.sub(r"^\.\S+$", "", caption)
-
-        return caption.strip()
-
-
-@MODELS.register_module("t5")
-class T5Encoder:
-    def __init__(
-        self,
-        from_pretrained=None,
-        model_max_length=120,
-        device="cuda",
-        dtype=torch.float,
-        cache_dir=None,
-        shardformer=False,
-    ):
-        assert from_pretrained is not None, "Please specify the path to the T5 model"
-
-        self.t5 = T5Embedder(
-            device=device,
-            torch_dtype=dtype,
-            from_pretrained=from_pretrained,
-            cache_dir=cache_dir,
-            model_max_length=model_max_length,
-        )
-        self.t5.model.to(dtype=dtype)
-        self.y_embedder = None
-
-        self.model_max_length = model_max_length
-        self.output_dim = self.t5.model.config.d_model
-
-        if shardformer:
-            self.shardformer_t5()
-
-    def shardformer_t5(self):
-        from colossalai.shardformer import ShardConfig, ShardFormer
-        from opensora.acceleration.shardformer.policy.t5_encoder import T5EncoderPolicy
-        from opensora.utils.misc import requires_grad
-
-        shard_config = ShardConfig(
-            tensor_parallel_process_group=None,
-            pipeline_stage_manager=None,
-            enable_tensor_parallelism=False,
-            enable_fused_normalization=False,
-            enable_flash_attention=False,
-            enable_jit_fused=True,
-            enable_sequence_parallelism=False,
-            enable_sequence_overlap=False,
-        )
-        shard_former = ShardFormer(shard_config=shard_config)
-        optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy())
-        self.t5.model = optim_model.half()
-
-        # ensure the weights are frozen
-        requires_grad(self.t5.model, False)
-
-    def encode(self, text):
-        caption_embs, emb_masks = self.t5.get_text_embeddings(text)
-        caption_embs = caption_embs[:, None]
-        return dict(y=caption_embs, mask=emb_masks)
-
-    def null(self, n):
-        null_y = self.y_embedder.y_embedding[None].repeat(n, 1, 1)[:, None]
-        return null_y
-
-
-def basic_clean(text):
-    text = ftfy.fix_text(text)
-    text = html.unescape(html.unescape(text))
-    return text.strip()
-
-
-BAD_PUNCT_REGEX = re.compile(
-    r"["
-    + "#®•©™&@·º½¾¿¡§~"
-    + "\)"
-    + "\("
-    + "\]"
-    + "\["
-    + "\}"
-    + "\{"
-    + "\|"
-    + "\\"
-    + "\/"
-    + "\*"
-    + r"]{1,}"
-)  # noqa]
-
-
-def clean_caption(caption):
-    import urllib.parse as ul
-
-    from bs4 import BeautifulSoup
-
-    caption = str(caption)
-    caption = ul.unquote_plus(caption)
-    caption = caption.strip().lower()
-    caption = re.sub("<person>", "person", caption)
-    # urls:
-    caption = re.sub(
-        r"\b((?:https?:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",  # noqa
-        "",
-        caption,
-    )  # regex for urls
-    caption = re.sub(
-        r"\b((?:www:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",  # noqa
-        "",
-        caption,
-    )  # regex for urls
-    # html:
-    caption = BeautifulSoup(caption, features="html.parser").text
-
-    # @<nickname>
-    caption = re.sub(r"@[\w\d]+\b", "", caption)
-
-    # 31C0—31EF CJK Strokes
-    # 31F0—31FF Katakana Phonetic Extensions
-    # 3200—32FF Enclosed CJK Letters and Months
-    # 3300—33FF CJK Compatibility
-    # 3400—4DBF CJK Unified Ideographs Extension A
-    # 4DC0—4DFF Yijing Hexagram Symbols
-    # 4E00—9FFF CJK Unified Ideographs
-    caption = re.sub(r"[\u31c0-\u31ef]+", "", caption)
-    caption = re.sub(r"[\u31f0-\u31ff]+", "", caption)
-    caption = re.sub(r"[\u3200-\u32ff]+", "", caption)
-    caption = re.sub(r"[\u3300-\u33ff]+", "", caption)
-    caption = re.sub(r"[\u3400-\u4dbf]+", "", caption)
-    caption = re.sub(r"[\u4dc0-\u4dff]+", "", caption)
-    caption = re.sub(r"[\u4e00-\u9fff]+", "", caption)
-    #######################################################
-
-    # все виды тире / all types of dash --> "-"
-    caption = re.sub(
-        r"[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]+",  # noqa
-        "-",
-        caption,
-    )
-
-    # кавычки к одному стандарту
-    caption = re.sub(r"[`´«»“”¨]", '"', caption)
-    caption = re.sub(r"[‘’]", "'", caption)
-
-    # &quot;
-    caption = re.sub(r"&quot;?", "", caption)
-    # &amp
-    caption = re.sub(r"&amp", "", caption)
-
-    # ip adresses:
-    caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)
-
-    # article ids:
-    caption = re.sub(r"\d:\d\d\s+$", "", caption)
-
-    # \n
-    caption = re.sub(r"\\n", " ", caption)
-
-    # "#123"
-    caption = re.sub(r"#\d{1,3}\b", "", caption)
-    # "#12345.."
-    caption = re.sub(r"#\d{5,}\b", "", caption)
-    # "123456.."
-    caption = re.sub(r"\b\d{6,}\b", "", caption)
-    # filenames:
-    caption = re.sub(r"[\S]+\.(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)", "", caption)
-
-    #
-    caption = re.sub(r"[\"\']{2,}", r'"', caption)  # """AUSVERKAUFT"""
-    caption = re.sub(r"[\.]{2,}", r" ", caption)  # """AUSVERKAUFT"""
-
-    caption = re.sub(BAD_PUNCT_REGEX, r" ", caption)  # ***AUSVERKAUFT***, #AUSVERKAUFT
-    caption = re.sub(r"\s+\.\s+", r" ", caption)  # " . "
-
-    # this-is-my-cute-cat / this_is_my_cute_cat
-    regex2 = re.compile(r"(?:\-|\_)")
-    if len(re.findall(regex2, caption)) > 3:
-        caption = re.sub(regex2, " ", caption)
-
-    caption = basic_clean(caption)
-
-    caption = re.sub(r"\b[a-zA-Z]{1,3}\d{3,15}\b", "", caption)  # jc6640
-    caption = re.sub(r"\b[a-zA-Z]+\d+[a-zA-Z]+\b", "", caption)  # jc6640vc
-    caption = re.sub(r"\b\d+[a-zA-Z]+\d+\b", "", caption)  # 6640vc231
-
-    caption = re.sub(r"(worldwide\s+)?(free\s+)?shipping", "", caption)
-    caption = re.sub(r"(free\s)?download(\sfree)?", "", caption)
-    caption = re.sub(r"\bclick\b\s(?:for|on)\s\w+", "", caption)
-    caption = re.sub(
-        r"\b(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)(\simage[s]?)?", "", caption
-    )
-    caption = re.sub(r"\bpage\s+\d+\b", "", caption)
-
-    caption = re.sub(
-        r"\b\d*[a-zA-Z]+\d+[a-zA-Z]+\d+[a-zA-Z\d]*\b", r" ", caption
-    )  # j2d1a2a...
-
-    caption = re.sub(r"\b\d+\.?\d*[xх×]\d+\.?\d*\b", "", caption)
-
-    caption = re.sub(r"\b\s+\:\s+", r": ", caption)
-    caption = re.sub(r"(\D[,\./])\b", r"\1 ", caption)
-    caption = re.sub(r"\s+", " ", caption)
-
-    caption.strip()
-
-    caption = re.sub(r"^[\"\']([\w\W]+)[\"\']$", r"\1", caption)
-    caption = re.sub(r"^[\'\_,\-\:;]", r"", caption)
-    caption = re.sub(r"[\'\_,\-\:\-\+]$", r"", caption)
-    caption = re.sub(r"^\.\S+$", "", caption)
-
-    return caption.strip()
-
-
-def text_preprocessing(text, use_text_preprocessing: bool = True):
-    if use_text_preprocessing:
-        # The exact text cleaning as was in the training stage:
-        text = clean_caption(text)
-        text = clean_caption(text)
-        return text
-    else:
-        return text.lower().strip()
diff --git a/videotuna/models/opensora/models/vae/__init__.py b/videotuna/models/opensora/models/vae/__init__.py
deleted file mode 100644
index f27be475..00000000
--- a/videotuna/models/opensora/models/vae/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from .discriminator import DISCRIMINATOR_3D
-from .vae import VideoAutoencoderKL, VideoAutoencoderKLTemporalDecoder
-from .vae_temporal import VAE_Temporal
diff --git a/videotuna/models/opensora/models/vae/discriminator.py b/videotuna/models/opensora/models/vae/discriminator.py
deleted file mode 100644
index a3a17877..00000000
--- a/videotuna/models/opensora/models/vae/discriminator.py
+++ /dev/null
@@ -1,476 +0,0 @@
-import functools
-import math
-
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import find_model, load_checkpoint
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def xavier_uniform_weight_init(m):
-    if isinstance(m, nn.Conv3d) or isinstance(m, nn.Linear):
-        nn.init.xavier_uniform_(m.weight, gain=nn.init.calculate_gain("relu"))
-        if m.bias is not None:
-            nn.init.zeros_(m.bias)
-        # print("initialized module to xavier_uniform:", m)
-
-
-# SCH: taken from Open Sora Plan
-def n_layer_disc_weights_init(m):
-    classname = m.__class__.__name__
-    if classname.find("Conv") != -1:
-        nn.init.normal_(m.weight.data, 0.0, 0.02)
-    elif classname.find("BatchNorm") != -1:
-        nn.init.normal_(m.weight.data, 1.0, 0.02)
-        nn.init.constant_(m.bias.data, 0)
-
-
-# SCH: own implementation modified on top of: discriminator with anti-aliased downsampling (blurpool Zhang et al.)
-class BlurPool3D(nn.Module):
-    def __init__(
-        self,
-        channels,
-        pad_type="reflect",
-        filt_size=3,
-        stride=2,
-        pad_off=0,
-        device="cpu",
-        dtype=torch.bfloat16,
-    ):
-        super(BlurPool3D, self).__init__()
-        self.filt_size = filt_size
-        self.pad_off = pad_off
-        self.pad_sizes = [
-            int(1.0 * (filt_size - 1) / 2),
-            int(np.ceil(1.0 * (filt_size - 1) / 2)),
-            int(1.0 * (filt_size - 1) / 2),
-            int(np.ceil(1.0 * (filt_size - 1) / 2)),
-            int(1.0 * (filt_size - 1) / 2),
-            int(np.ceil(1.0 * (filt_size - 1) / 2)),
-        ]
-        self.pad_sizes = [pad_size + pad_off for pad_size in self.pad_sizes]
-        self.stride = stride
-        self.off = int((self.stride - 1) / 2.0)
-        self.channels = channels
-
-        if self.filt_size == 1:
-            a = np.array(
-                [
-                    1.0,
-                ]
-            )
-        elif self.filt_size == 2:
-            a = np.array([1.0, 1.0])
-        elif self.filt_size == 3:
-            a = np.array([1.0, 2.0, 1.0])
-        elif self.filt_size == 4:
-            a = np.array([1.0, 3.0, 3.0, 1.0])
-        elif self.filt_size == 5:
-            a = np.array([1.0, 4.0, 6.0, 4.0, 1.0])
-        elif self.filt_size == 6:
-            a = np.array([1.0, 5.0, 10.0, 10.0, 5.0, 1.0])
-        elif self.filt_size == 7:
-            a = np.array([1.0, 6.0, 15.0, 20.0, 15.0, 6.0, 1.0])
-
-        filt_2d = a[:, None] * a[None, :]
-        filt_3d = torch.Tensor(a[:, None, None] * filt_2d[None, :, :]).to(device, dtype)
-
-        filt = filt_3d / torch.sum(filt_3d)  # SCH: modified to it 3D
-        self.register_buffer(
-            "filt", filt[None, None, :, :, :].repeat((self.channels, 1, 1, 1, 1))
-        )
-
-        self.pad = get_pad_layer(pad_type)(self.pad_sizes)
-
-    def forward(self, inp):
-        if self.filt_size == 1:
-            if self.pad_off == 0:
-                return inp[:, :, :: self.stride, :: self.stride]
-            else:
-                return self.pad(inp)[:, :, :: self.stride, :: self.stride]
-        else:
-            return F.conv3d(
-                self.pad(inp), self.filt, stride=self.stride, groups=inp.shape[1]
-            )
-
-
-class ResBlockDown(nn.Module):
-    """3D StyleGAN ResBlock for D."""
-
-    def __init__(
-        self,
-        in_channels,
-        filters,
-        activation_fn,
-        num_groups=32,
-        device="cpu",
-        dtype=torch.bfloat16,
-    ):
-        super().__init__()
-
-        self.filters = filters
-        self.activation_fn = activation_fn
-
-        # SCH: NOTE: although paper says conv (X->Y, Y->Y), original code implementation is (X->X, X->Y), we follow code
-        self.conv1 = nn.Conv3d(
-            in_channels, in_channels, (3, 3, 3), padding=1, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-        self.norm1 = nn.GroupNorm(num_groups, in_channels, device=device, dtype=dtype)
-
-        self.blur = BlurPool3D(in_channels, device=device, dtype=dtype)
-
-        self.conv2 = nn.Conv3d(
-            in_channels, self.filters, (1, 1, 1), bias=False, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-        self.conv3 = nn.Conv3d(
-            in_channels, self.filters, (3, 3, 3), padding=1, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-        self.norm2 = nn.GroupNorm(num_groups, self.filters, device=device, dtype=dtype)
-
-        # self.apply(xavier_uniform_weight_init)
-
-    def forward(self, x):
-        residual = x
-        x = self.conv1(x)
-        x = self.norm1(x)
-        x = self.activation_fn(x)
-
-        residual = self.blur(residual)
-        residual = self.conv2(residual)
-
-        x = self.blur(x)
-        x = self.conv3(x)
-        x = self.norm2(x)
-        x = self.activation_fn(x)
-        out = (residual + x) / math.sqrt(2)
-        return out
-
-
-@MODELS.register_module()
-class NLayerDiscriminator(nn.Module):
-    """Defines a PatchGAN discriminator as in Pix2Pix
-    --> see https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/blob/master/models/networks.py
-    """
-
-    def __init__(
-        self, input_nc=3, ndf=64, n_layers=3, use_actnorm=False, from_pretrained=None
-    ):
-        """Construct a PatchGAN discriminator
-        Parameters:
-            input_nc (int)  -- the number of channels in input images
-            ndf (int)       -- the number of filters in the last conv layer
-            n_layers (int)  -- the number of conv layers in the discriminator
-            norm_layer      -- normalization layer
-        """
-        super(NLayerDiscriminator, self).__init__()
-
-        norm_layer = nn.BatchNorm2d
-
-        if (
-            type(norm_layer) == functools.partial
-        ):  # no need to use bias as BatchNorm2d has affine parameters
-            use_bias = norm_layer.func != nn.BatchNorm2d
-        else:
-            use_bias = norm_layer != nn.BatchNorm2d
-
-        kw = 4
-        padw = 1
-        sequence = [
-            nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw),
-            nn.LeakyReLU(0.2, True),
-        ]
-        nf_mult = 1
-        nf_mult_prev = 1
-        for n in range(1, n_layers):  # gradually increase the number of filters
-            nf_mult_prev = nf_mult
-            nf_mult = min(2**n, 8)
-            sequence += [
-                nn.Conv2d(
-                    ndf * nf_mult_prev,
-                    ndf * nf_mult,
-                    kernel_size=kw,
-                    stride=2,
-                    padding=padw,
-                    bias=use_bias,
-                ),
-                norm_layer(ndf * nf_mult),
-                nn.LeakyReLU(0.2, True),
-            ]
-
-        nf_mult_prev = nf_mult
-        nf_mult = min(2**n_layers, 8)
-        sequence += [
-            nn.Conv2d(
-                ndf * nf_mult_prev,
-                ndf * nf_mult,
-                kernel_size=kw,
-                stride=1,
-                padding=padw,
-                bias=use_bias,
-            ),
-            norm_layer(ndf * nf_mult),
-            nn.LeakyReLU(0.2, True),
-        ]
-
-        sequence += [
-            nn.Conv2d(ndf * nf_mult, 1, kernel_size=kw, stride=1, padding=padw)
-        ]  # output 1 channel prediction map
-        self.main = nn.Sequential(*sequence)
-
-        if from_pretrained is not None:
-            load_checkpoint(self, from_pretrained)
-
-    def forward(self, input):
-        """Standard forward."""
-        return self.main(input)
-
-
-class NLayerDiscriminator3D(nn.Module):
-    """Defines a 3D PatchGAN discriminator as in Pix2Pix but for 3D inputs."""
-
-    def __init__(self, input_nc=1, ndf=64, n_layers=3, use_actnorm=False):
-        """
-        Construct a 3D PatchGAN discriminator
-
-        Parameters:
-            input_nc (int)  -- the number of channels in input volumes
-            ndf (int)       -- the number of filters in the last conv layer
-            n_layers (int)  -- the number of conv layers in the discriminator
-            use_actnorm (bool) -- flag to use actnorm instead of batchnorm
-        """
-        super(NLayerDiscriminator3D, self).__init__()
-        if not use_actnorm:
-            norm_layer = nn.BatchNorm3d
-        else:
-            raise NotImplementedError("Not implemented.")
-        if type(norm_layer) == functools.partial:
-            use_bias = norm_layer.func != nn.BatchNorm3d
-        else:
-            use_bias = norm_layer != nn.BatchNorm3d
-
-        kw = 4
-        padw = 1
-        sequence = [
-            nn.Conv3d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw),
-            nn.LeakyReLU(0.2, True),
-        ]
-        nf_mult = 1
-        nf_mult_prev = 1
-        for n in range(1, n_layers):  # gradually increase the number of filters
-            nf_mult_prev = nf_mult
-            nf_mult = min(2**n, 8)
-            sequence += [
-                nn.Conv3d(
-                    ndf * nf_mult_prev,
-                    ndf * nf_mult,
-                    kernel_size=(kw, kw, kw),
-                    stride=(1, 2, 2),
-                    padding=padw,
-                    bias=use_bias,
-                ),
-                norm_layer(ndf * nf_mult),
-                nn.LeakyReLU(0.2, True),
-            ]
-
-        nf_mult_prev = nf_mult
-        nf_mult = min(2**n_layers, 8)
-        sequence += [
-            nn.Conv3d(
-                ndf * nf_mult_prev,
-                ndf * nf_mult,
-                kernel_size=(kw, kw, kw),
-                stride=1,
-                padding=padw,
-                bias=use_bias,
-            ),
-            norm_layer(ndf * nf_mult),
-            nn.LeakyReLU(0.2, True),
-        ]
-
-        sequence += [
-            nn.Conv3d(ndf * nf_mult, 1, kernel_size=kw, stride=1, padding=padw)
-        ]  # output 1 channel prediction map
-        self.main = nn.Sequential(*sequence)
-
-    def forward(self, input):
-        """Standard forward."""
-        return self.main(input)
-
-
-class StyleGANDiscriminatorBlur(nn.Module):
-    """StyleGAN Discriminator.
-
-    SCH: NOTE:
-        this discriminator requries the num_frames to be fixed during training;
-        in case we pre-train with image then train on video, this disciminator's Linear layer would have to be re-trained!
-    """
-
-    def __init__(
-        self,
-        image_size=(128, 128),
-        num_frames=17,
-        in_channels=3,
-        filters=128,
-        channel_multipliers=(2, 4, 4, 4, 4),
-        num_groups=32,
-        dtype=torch.bfloat16,
-        device="cpu",
-    ):
-        super().__init__()
-
-        self.dtype = dtype
-        self.input_size = cast_tuple(image_size, 2)
-        self.filters = filters
-        self.activation_fn = nn.LeakyReLU(negative_slope=0.2)
-        self.channel_multipliers = channel_multipliers
-
-        self.conv1 = nn.Conv3d(
-            in_channels, self.filters, (3, 3, 3), padding=1, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-
-        prev_filters = self.filters  # record in_channels
-        self.num_blocks = len(self.channel_multipliers)
-        self.res_block_list = nn.ModuleList([])
-        for i in range(self.num_blocks):
-            filters = self.filters * self.channel_multipliers[i]
-            self.res_block_list.append(
-                ResBlockDown(
-                    prev_filters,
-                    filters,
-                    self.activation_fn,
-                    device=device,
-                    dtype=dtype,
-                ).apply(xavier_uniform_weight_init)
-            )
-            prev_filters = filters  # update in_channels
-
-        self.conv2 = nn.Conv3d(
-            prev_filters, prev_filters, (3, 3, 3), padding=1, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-        # torch.nn.init.xavier_uniform_(self.conv2.weight)
-
-        self.norm1 = nn.GroupNorm(num_groups, prev_filters, dtype=dtype, device=device)
-
-        scale_factor = 2**self.num_blocks
-        if (
-            num_frames % scale_factor != 0
-        ):  # SCH: NOTE: has first frame which would be padded before usage
-            time_scaled = num_frames // scale_factor + 1
-        else:
-            time_scaled = num_frames / scale_factor
-
-        assert (
-            self.input_size[0] % scale_factor == 0
-        ), f"image width {self.input_size[0]} is not divisible by scale factor {scale_factor}"
-        assert (
-            self.input_size[1] % scale_factor == 0
-        ), f"image height {self.input_size[1]} is not divisible by scale factor {scale_factor}"
-        w_scaled, h_scaled = (
-            self.input_size[0] / scale_factor,
-            self.input_size[1] / scale_factor,
-        )
-        in_features = int(prev_filters * time_scaled * w_scaled * h_scaled)  # (C*T*W*H)
-        self.linear1 = nn.Linear(
-            in_features, prev_filters, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-        self.linear2 = nn.Linear(
-            prev_filters, 1, device=device, dtype=dtype
-        )  # NOTE: init to xavier_uniform
-
-        # self.apply(xavier_uniform_weight_init)
-
-    def forward(self, x):
-        x = self.conv1(x)
-        # print("discriminator aft conv:", x.size())
-        x = self.activation_fn(x)
-
-        for i in range(self.num_blocks):
-            x = self.res_block_list[i](x)
-            # print("discriminator resblock down:", x.size())
-
-        x = self.conv2(x)
-        # print("discriminator aft conv2:", x.size())
-        x = self.norm1(x)
-        x = self.activation_fn(x)
-        x = x.reshape((x.shape[0], -1))  # SCH: [B, (C * T * W * H)] ?
-
-        # print("discriminator reshape:", x.size())
-        x = self.linear1(x)
-        # print("discriminator aft linear1:", x.size())
-
-        x = self.activation_fn(x)
-        x = self.linear2(x)
-        # print("discriminator aft linear2:", x.size())
-        return x
-
-
-def load_checkpoint_with_inflation(model, ckpt_path):
-    """
-    pre-train using image, then inflate to 3D videos
-    """
-    if ckpt_path.endswith(".pt") or ckpt_path.endswith(".pth"):
-        state_dict = find_model(ckpt_path)
-        with torch.no_grad():
-            for key in state_dict:
-                if key in model:
-                    # central inflation
-                    if state_dict[key].size() == model[key][:, :, 0, :, :].size():
-                        # temporal dimension
-                        val = torch.zeros_like(model[key])
-                        centre = int(model[key].size(2) // 2)
-                        val[:, :, centre, :, :] = state_dict[key]
-        missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
-        print(f"Missing keys: {missing_keys}")
-        print(f"Unexpected keys: {unexpected_keys}")
-    else:
-        load_checkpoint(model, ckpt_path)  # use the default function
-
-
-@MODELS.register_module("DISCRIMINATOR_3D")
-def DISCRIMINATOR_3D(
-    from_pretrained=None, inflate_from_2d=False, use_pretrained=True, **kwargs
-):
-    model = StyleGANDiscriminatorBlur(**kwargs).apply(xavier_uniform_weight_init)
-    if from_pretrained is not None:
-        if use_pretrained:
-            if inflate_from_2d:
-                load_checkpoint_with_inflation(model, from_pretrained)
-            else:
-                load_checkpoint(model, from_pretrained, model_name="discriminator")
-                print("loaded discriminator")
-        else:
-            print(
-                f"discriminator use_pretrained={use_pretrained}, initializing new discriminator"
-            )
-
-    return model
-
-
-@MODELS.register_module("N_Layer_DISCRIMINATOR_3D")
-def DISCRIMINATOR_3D_N_Layer(
-    from_pretrained=None, inflate_from_2d=False, use_pretrained=True, **kwargs
-):
-    model = NLayerDiscriminator3D(
-        input_nc=3,
-        n_layers=3,
-    ).apply(n_layer_disc_weights_init)
-    if from_pretrained is not None:
-        if use_pretrained:
-            if inflate_from_2d:
-                load_checkpoint_with_inflation(model, from_pretrained)
-            else:
-                load_checkpoint(model, from_pretrained, model_name="discriminator")
-                print("loaded discriminator")
-        else:
-            print(
-                f"discriminator use_pretrained={use_pretrained}, initializing new discriminator"
-            )
-
-    return model
diff --git a/videotuna/models/opensora/models/vae/losses.py b/videotuna/models/opensora/models/vae/losses.py
deleted file mode 100644
index 1b61c7b3..00000000
--- a/videotuna/models/opensora/models/vae/losses.py
+++ /dev/null
@@ -1,301 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange, repeat
-
-from .lpips import LPIPS
-
-
-def hinge_d_loss(logits_real, logits_fake):
-    loss_real = torch.mean(F.relu(1.0 - logits_real))
-    loss_fake = torch.mean(F.relu(1.0 + logits_fake))
-    d_loss = 0.5 * (loss_real + loss_fake)
-    return d_loss
-
-
-def vanilla_d_loss(logits_real, logits_fake):
-    d_loss = 0.5 * (
-        torch.mean(torch.nn.functional.softplus(-logits_real))
-        + torch.mean(torch.nn.functional.softplus(logits_fake))
-    )
-    return d_loss
-
-
-# from MAGVIT, used in place hof hinge_d_loss
-def sigmoid_cross_entropy_with_logits(labels, logits):
-    # The final formulation is: max(x, 0) - x * z + log(1 + exp(-abs(x)))
-    zeros = torch.zeros_like(logits, dtype=logits.dtype)
-    condition = logits >= zeros
-    relu_logits = torch.where(condition, logits, zeros)
-    neg_abs_logits = torch.where(condition, -logits, logits)
-    return relu_logits - logits * labels + torch.log1p(torch.exp(neg_abs_logits))
-
-
-def lecam_reg(real_pred, fake_pred, ema_real_pred, ema_fake_pred):
-    assert real_pred.ndim == 0 and ema_fake_pred.ndim == 0
-    lecam_loss = torch.mean(torch.pow(nn.ReLU()(real_pred - ema_fake_pred), 2))
-    lecam_loss += torch.mean(torch.pow(nn.ReLU()(ema_real_pred - fake_pred), 2))
-    return lecam_loss
-
-
-def gradient_penalty_fn(images, output):
-    gradients = torch.autograd.grad(
-        outputs=output,
-        inputs=images,
-        grad_outputs=torch.ones(output.size(), device=images.device),
-        create_graph=True,
-        retain_graph=True,
-        only_inputs=True,
-    )[0]
-
-    gradients = rearrange(gradients, "b ... -> b (...)")
-    return ((gradients.norm(2, dim=1) - 1) ** 2).mean()
-
-
-class VAELoss(nn.Module):
-    def __init__(
-        self,
-        logvar_init=0.0,
-        perceptual_loss_weight=0.1,
-        kl_loss_weight=0.000001,
-        device="cpu",
-        dtype="bf16",
-    ):
-        super().__init__()
-
-        if type(dtype) == str:
-            if dtype == "bf16":
-                dtype = torch.bfloat16
-            elif dtype == "fp16":
-                dtype = torch.float16
-            else:
-                raise NotImplementedError(f"dtype: {dtype}")
-
-        # KL Loss
-        self.kl_loss_weight = kl_loss_weight
-        # Perceptual Loss
-        self.perceptual_loss_fn = LPIPS().eval().to(device, dtype)
-        self.perceptual_loss_weight = perceptual_loss_weight
-        self.logvar = nn.Parameter(torch.ones(size=()) * logvar_init)
-
-    def forward(
-        self,
-        video,
-        recon_video,
-        posterior,
-        nll_weights=None,
-        no_perceptual=False,
-    ):
-        video = rearrange(video, "b c t h w -> (b t) c h w").contiguous()
-        recon_video = rearrange(recon_video, "b c t h w -> (b t) c h w").contiguous()
-
-        # reconstruction loss
-        recon_loss = torch.abs(video - recon_video)
-
-        # perceptual loss
-        if (
-            self.perceptual_loss_weight is not None
-            and self.perceptual_loss_weight > 0.0
-            and not no_perceptual
-        ):
-            # handle channels
-            channels = video.shape[1]
-            assert channels in {1, 3}
-            if channels == 1:
-                input_vgg_input = repeat(video, "b 1 h w -> b c h w", c=3)
-                recon_vgg_input = repeat(recon_video, "b 1 h w -> b c h w", c=3)
-            else:
-                input_vgg_input = video
-                recon_vgg_input = recon_video
-
-            perceptual_loss = self.perceptual_loss_fn(input_vgg_input, recon_vgg_input)
-            recon_loss = recon_loss + self.perceptual_loss_weight * perceptual_loss
-
-        nll_loss = recon_loss / torch.exp(self.logvar) + self.logvar
-
-        weighted_nll_loss = nll_loss
-        if nll_weights is not None:
-            weighted_nll_loss = nll_weights * nll_loss
-        weighted_nll_loss = torch.sum(weighted_nll_loss) / weighted_nll_loss.shape[0]
-        nll_loss = torch.sum(nll_loss) / nll_loss.shape[0]
-
-        # KL Loss
-        weighted_kl_loss = 0
-        if self.kl_loss_weight is not None and self.kl_loss_weight > 0.0:
-            kl_loss = posterior.kl()
-            kl_loss = torch.sum(kl_loss) / kl_loss.shape[0]
-            weighted_kl_loss = kl_loss * self.kl_loss_weight
-
-        return nll_loss, weighted_nll_loss, weighted_kl_loss
-
-
-def adopt_weight(weight, global_step, threshold=0, value=0.0):
-    if global_step < threshold:
-        weight = value
-    return weight
-
-
-class AdversarialLoss(nn.Module):
-    def __init__(
-        self,
-        discriminator_factor=1.0,
-        discriminator_start=50001,
-        generator_factor=0.5,
-        generator_loss_type="non-saturating",
-    ):
-        super().__init__()
-        self.discriminator_factor = discriminator_factor
-        self.discriminator_start = discriminator_start
-        self.generator_factor = generator_factor
-        self.generator_loss_type = generator_loss_type
-
-    def calculate_adaptive_weight(self, nll_loss, g_loss, last_layer):
-        nll_grads = torch.autograd.grad(nll_loss, last_layer, retain_graph=True)[0]
-        g_grads = torch.autograd.grad(g_loss, last_layer, retain_graph=True)[0]
-        d_weight = torch.norm(nll_grads) / (torch.norm(g_grads) + 1e-4)
-        d_weight = torch.clamp(d_weight, 0.0, 1e4).detach()
-        d_weight = d_weight * self.generator_factor
-        return d_weight
-
-    def forward(
-        self,
-        fake_logits,
-        nll_loss,
-        last_layer,
-        global_step,
-        is_training=True,
-    ):
-        # NOTE: following MAGVIT to allow non_saturating
-        assert self.generator_loss_type in ["hinge", "vanilla", "non-saturating"]
-
-        if self.generator_loss_type == "hinge":
-            gen_loss = -torch.mean(fake_logits)
-        elif self.generator_loss_type == "non-saturating":
-            gen_loss = torch.mean(
-                sigmoid_cross_entropy_with_logits(
-                    labels=torch.ones_like(fake_logits), logits=fake_logits
-                )
-            )
-        else:
-            raise ValueError(
-                "Generator loss {} not supported".format(self.generator_loss_type)
-            )
-
-        if self.discriminator_factor is not None and self.discriminator_factor > 0.0:
-            try:
-                d_weight = self.calculate_adaptive_weight(
-                    nll_loss, gen_loss, last_layer
-                )
-            except RuntimeError:
-                assert not is_training
-                d_weight = torch.tensor(0.0)
-        else:
-            d_weight = torch.tensor(0.0)
-
-        disc_factor = adopt_weight(
-            self.discriminator_factor, global_step, threshold=self.discriminator_start
-        )
-        weighted_gen_loss = d_weight * disc_factor * gen_loss
-
-        return weighted_gen_loss
-
-
-class LeCamEMA:
-    def __init__(
-        self,
-        ema_real=0.0,
-        ema_fake=0.0,
-        decay=0.999,
-        dtype=torch.bfloat16,
-        device="cpu",
-    ):
-        self.decay = decay
-        self.ema_real = torch.tensor(ema_real).to(device, dtype)
-        self.ema_fake = torch.tensor(ema_fake).to(device, dtype)
-
-    def update(self, ema_real, ema_fake):
-        self.ema_real = self.ema_real * self.decay + ema_real * (1 - self.decay)
-        self.ema_fake = self.ema_fake * self.decay + ema_fake * (1 - self.decay)
-
-    def get(self):
-        return self.ema_real, self.ema_fake
-
-
-class DiscriminatorLoss(nn.Module):
-    def __init__(
-        self,
-        discriminator_factor=1.0,
-        discriminator_start=50001,
-        discriminator_loss_type="non-saturating",
-        lecam_loss_weight=None,
-        gradient_penalty_loss_weight=None,  # SCH: following MAGVIT config.vqgan.grad_penalty_cost
-    ):
-        super().__init__()
-
-        assert discriminator_loss_type in ["hinge", "vanilla", "non-saturating"]
-        self.discriminator_factor = discriminator_factor
-        self.discriminator_start = discriminator_start
-        self.lecam_loss_weight = lecam_loss_weight
-        self.gradient_penalty_loss_weight = gradient_penalty_loss_weight
-        self.discriminator_loss_type = discriminator_loss_type
-
-    def forward(
-        self,
-        real_logits,
-        fake_logits,
-        global_step,
-        lecam_ema_real=None,
-        lecam_ema_fake=None,
-        real_video=None,
-        split="train",
-    ):
-        if self.discriminator_factor is not None and self.discriminator_factor > 0.0:
-            disc_factor = adopt_weight(
-                self.discriminator_factor,
-                global_step,
-                threshold=self.discriminator_start,
-            )
-
-            if self.discriminator_loss_type == "hinge":
-                disc_loss = hinge_d_loss(real_logits, fake_logits)
-            elif self.discriminator_loss_type == "non-saturating":
-                if real_logits is not None:
-                    real_loss = sigmoid_cross_entropy_with_logits(
-                        labels=torch.ones_like(real_logits), logits=real_logits
-                    )
-                else:
-                    real_loss = 0.0
-                if fake_logits is not None:
-                    fake_loss = sigmoid_cross_entropy_with_logits(
-                        labels=torch.zeros_like(fake_logits), logits=fake_logits
-                    )
-                else:
-                    fake_loss = 0.0
-                disc_loss = 0.5 * (torch.mean(real_loss) + torch.mean(fake_loss))
-            elif self.discriminator_loss_type == "vanilla":
-                disc_loss = vanilla_d_loss(real_logits, fake_logits)
-            else:
-                raise ValueError(f"Unknown GAN loss '{self.discriminator_loss_type}'.")
-
-            weighted_d_adversarial_loss = disc_factor * disc_loss
-
-        else:
-            weighted_d_adversarial_loss = 0
-
-        lecam_loss = torch.tensor(0.0)
-        if self.lecam_loss_weight is not None and self.lecam_loss_weight > 0.0:
-            real_pred = torch.mean(real_logits)
-            fake_pred = torch.mean(fake_logits)
-            lecam_loss = lecam_reg(real_pred, fake_pred, lecam_ema_real, lecam_ema_fake)
-            lecam_loss = lecam_loss * self.lecam_loss_weight
-
-        gradient_penalty = torch.tensor(0.0)
-        if (
-            self.gradient_penalty_loss_weight is not None
-            and self.gradient_penalty_loss_weight > 0.0
-        ):
-            assert real_video is not None
-            gradient_penalty = gradient_penalty_fn(real_video, real_logits)
-            gradient_penalty *= self.gradient_penalty_loss_weight
-
-        return (weighted_d_adversarial_loss, lecam_loss, gradient_penalty)
diff --git a/videotuna/models/opensora/models/vae/lpips.py b/videotuna/models/opensora/models/vae/lpips.py
deleted file mode 100644
index dd28c326..00000000
--- a/videotuna/models/opensora/models/vae/lpips.py
+++ /dev/null
@@ -1,182 +0,0 @@
-import hashlib
-import os
-from collections import namedtuple
-
-import requests
-import torch
-import torch.nn as nn
-from torchvision import models
-from tqdm import tqdm
-
-URL_MAP = {"vgg_lpips": "https://heibox.uni-heidelberg.de/f/607503859c864bc1b30b/?dl=1"}
-
-CKPT_MAP = {"vgg_lpips": "vgg.pth"}
-
-MD5_MAP = {"vgg_lpips": "d507d7349b931f0638a25a48a722f98a"}
-
-
-def md5_hash(path):
-    with open(path, "rb") as f:
-        content = f.read()
-    return hashlib.md5(content).hexdigest()
-
-
-def download(url, local_path, chunk_size=1024):
-    os.makedirs(os.path.split(local_path)[0], exist_ok=True)
-    with requests.get(url, stream=True) as r:
-        total_size = int(r.headers.get("content-length", 0))
-        with tqdm(total=total_size, unit="B", unit_scale=True) as pbar:
-            with open(local_path, "wb") as f:
-                for data in r.iter_content(chunk_size=chunk_size):
-                    if data:
-                        f.write(data)
-                        pbar.update(chunk_size)
-
-
-def get_ckpt_path(name, root, check=False):
-    assert name in URL_MAP
-    path = os.path.join(root, CKPT_MAP[name])
-    if not os.path.exists(path) or (check and not md5_hash(path) == MD5_MAP[name]):
-        print("Downloading {} model from {} to {}".format(name, URL_MAP[name], path))
-        download(URL_MAP[name], path)
-        md5 = md5_hash(path)
-        assert md5 == MD5_MAP[name], md5
-    return path
-
-
-class LPIPS(nn.Module):
-    # Learned perceptual metric
-    def __init__(self, use_dropout=True):
-        super().__init__()
-        self.scaling_layer = ScalingLayer()
-        self.chns = [64, 128, 256, 512, 512]  # vg16 features
-        self.net = vgg16(pretrained=True, requires_grad=False)
-        self.lin0 = NetLinLayer(self.chns[0], use_dropout=use_dropout)
-        self.lin1 = NetLinLayer(self.chns[1], use_dropout=use_dropout)
-        self.lin2 = NetLinLayer(self.chns[2], use_dropout=use_dropout)
-        self.lin3 = NetLinLayer(self.chns[3], use_dropout=use_dropout)
-        self.lin4 = NetLinLayer(self.chns[4], use_dropout=use_dropout)
-        self.load_from_pretrained()
-        for param in self.parameters():
-            param.requires_grad = False
-
-    def load_from_pretrained(self, name="vgg_lpips"):
-        ckpt = get_ckpt_path(name, "pretrained_models/taming/modules/autoencoder/lpips")
-        self.load_state_dict(
-            torch.load(ckpt, map_location=torch.device("cpu")), strict=False
-        )
-        # print("loaded pretrained LPIPS loss from {}".format(ckpt))
-
-    @classmethod
-    def from_pretrained(cls, name="vgg_lpips"):
-        if name != "vgg_lpips":
-            raise NotImplementedError
-        model = cls()
-        ckpt = get_ckpt_path(name)
-        model.load_state_dict(
-            torch.load(ckpt, map_location=torch.device("cpu")), strict=False
-        )
-        return model
-
-    def forward(self, input, target):
-        in0_input, in1_input = (self.scaling_layer(input), self.scaling_layer(target))
-        outs0, outs1 = self.net(in0_input), self.net(in1_input)
-        feats0, feats1, diffs = {}, {}, {}
-        lins = [self.lin0, self.lin1, self.lin2, self.lin3, self.lin4]
-        for kk in range(len(self.chns)):
-            feats0[kk], feats1[kk] = normalize_tensor(outs0[kk]), normalize_tensor(
-                outs1[kk]
-            )
-            diffs[kk] = (feats0[kk] - feats1[kk]) ** 2
-
-        res = [
-            spatial_average(lins[kk].model(diffs[kk]), keepdim=True)
-            for kk in range(len(self.chns))
-        ]
-        val = res[0]
-        for l in range(1, len(self.chns)):
-            val += res[l]
-        return val
-
-
-class ScalingLayer(nn.Module):
-    def __init__(self):
-        super(ScalingLayer, self).__init__()
-        self.register_buffer(
-            "shift", torch.Tensor([-0.030, -0.088, -0.188])[None, :, None, None]
-        )
-        self.register_buffer(
-            "scale", torch.Tensor([0.458, 0.448, 0.450])[None, :, None, None]
-        )
-
-    def forward(self, inp):
-        return (inp - self.shift) / self.scale
-
-
-class NetLinLayer(nn.Module):
-    """A single linear layer which does a 1x1 conv"""
-
-    def __init__(self, chn_in, chn_out=1, use_dropout=False):
-        super(NetLinLayer, self).__init__()
-        layers = (
-            [
-                nn.Dropout(),
-            ]
-            if (use_dropout)
-            else []
-        )
-        layers += [
-            nn.Conv2d(chn_in, chn_out, 1, stride=1, padding=0, bias=False),
-        ]
-        self.model = nn.Sequential(*layers)
-
-
-class vgg16(torch.nn.Module):
-    def __init__(self, requires_grad=False, pretrained=True):
-        super(vgg16, self).__init__()
-        vgg_pretrained_features = models.vgg16(pretrained=pretrained).features
-        self.slice1 = torch.nn.Sequential()
-        self.slice2 = torch.nn.Sequential()
-        self.slice3 = torch.nn.Sequential()
-        self.slice4 = torch.nn.Sequential()
-        self.slice5 = torch.nn.Sequential()
-        self.N_slices = 5
-        for x in range(4):
-            self.slice1.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(4, 9):
-            self.slice2.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(9, 16):
-            self.slice3.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(16, 23):
-            self.slice4.add_module(str(x), vgg_pretrained_features[x])
-        for x in range(23, 30):
-            self.slice5.add_module(str(x), vgg_pretrained_features[x])
-        if not requires_grad:
-            for param in self.parameters():
-                param.requires_grad = False
-
-    def forward(self, X):
-        h = self.slice1(X)
-        h_relu1_2 = h
-        h = self.slice2(h)
-        h_relu2_2 = h
-        h = self.slice3(h)
-        h_relu3_3 = h
-        h = self.slice4(h)
-        h_relu4_3 = h
-        h = self.slice5(h)
-        h_relu5_3 = h
-        vgg_outputs = namedtuple(
-            "VggOutputs", ["relu1_2", "relu2_2", "relu3_3", "relu4_3", "relu5_3"]
-        )
-        out = vgg_outputs(h_relu1_2, h_relu2_2, h_relu3_3, h_relu4_3, h_relu5_3)
-        return out
-
-
-def normalize_tensor(x, eps=1e-10):
-    norm_factor = torch.sqrt(torch.sum(x**2, dim=1, keepdim=True))
-    return x / (norm_factor + eps)
-
-
-def spatial_average(x, keepdim=True):
-    return x.mean([2, 3], keepdim=keepdim)
diff --git a/videotuna/models/opensora/models/vae/opensoravae.py b/videotuna/models/opensora/models/vae/opensoravae.py
deleted file mode 100644
index ac6f2bbd..00000000
--- a/videotuna/models/opensora/models/vae/opensoravae.py
+++ /dev/null
@@ -1,64 +0,0 @@
-import os
-from einops import rearrange
-from typing import Optional, Union
-
-import torch
-import pytorch_lightning as pl
-
-from diffusers.models import AutoencoderKL
-
-
-class VideoAutoencoderKL(pl.LightningModule):
-    def __init__(
-        self,
-        from_pretrained: Optional[Union[str, os.PathLike]] = None,
-        micro_batch_size: int = None,
-    ):
-        super().__init__()
-        self.module = AutoencoderKL.from_pretrained(from_pretrained)
-        self.out_channels = self.module.config.latent_channels
-        self.patch_size = (1, 8, 8)
-        self.micro_batch_size = micro_batch_size
-
-    def encode(self, x):
-        # x: (B, C, T, H, W)
-        B = x.shape[0]
-        x = rearrange(x, "B C T H W -> (B T) C H W")
-
-        if self.micro_batch_size is None:
-            x = self.module.encode(x).latent_dist.sample()
-        else:
-            bs = self.micro_batch_size
-            x_out = []
-            for i in range(0, x.shape[0], bs):
-                x_bs = x[i : i + bs]
-                x_bs = self.module.encode(x_bs).latent_dist.sample()
-                x_out.append(x_bs)
-            x = torch.cat(x_out, dim=0)
-        x = rearrange(x, "(B T) C H W -> B C T H W", B=B)
-        return x
-
-    def decode(self, x):
-        # x: (B, C, T, H, W)
-        B = x.shape[0]
-        x = rearrange(x, "B C T H W -> (B T) C H W")
-        if self.micro_batch_size is None:
-            x = self.module.decode(x).sample
-        else:
-            bs = self.micro_batch_size
-            x_out = []
-            for i in range(0, x.shape[0], bs):
-                x_bs = x[i : i + bs]
-                x_bs = self.module.decode(x_bs).sample
-                x_out.append(x_bs)
-            x = torch.cat(x_out, dim=0)
-        x = rearrange(x, "(B T) C H W -> B C T H W", B=B)
-        return x
-
-    def get_latent_size(self, input_size):
-        for i in range(3):
-            assert (
-                input_size[i] % self.patch_size[i] == 0
-            ), "Input size must be divisible by patch size"
-        input_size = [input_size[i] // self.patch_size[i] for i in range(3)]
-        return input_size
diff --git a/videotuna/models/opensora/models/vae/utils.py b/videotuna/models/opensora/models/vae/utils.py
deleted file mode 100644
index f22fe0ec..00000000
--- a/videotuna/models/opensora/models/vae/utils.py
+++ /dev/null
@@ -1,60 +0,0 @@
-import numpy as np
-import torch
-
-"""Stripped version of https://github.com/richzhang/PerceptualSimilarity/tree/master/models"""
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(
-        self,
-        parameters,
-        deterministic=False,
-    ):
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(self.mean).to(
-                device=self.parameters.device, dtype=self.mean.dtype
-            )
-
-    def sample(self):
-        # torch.randn: standard normal distribution
-        x = self.mean + self.std * torch.randn(self.mean.shape).to(
-            device=self.parameters.device, dtype=self.mean.dtype
-        )
-        return x
-
-    def kl(self, other=None):
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        else:
-            if other is None:  # SCH: assumes other is a standard normal distribution
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
-                    dim=[1, 2, 3, 4],
-                )
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var
-                    - 1.0
-                    - self.logvar
-                    + other.logvar,
-                    dim=[1, 2, 3, 4],
-                )
-
-    def nll(self, sample, dims=[1, 2, 3, 4]):
-        if self.deterministic:
-            return torch.Tensor([0.0])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims,
-        )
-
-    def mode(self):
-        return self.mean
diff --git a/videotuna/models/opensora/models/vae/vae.py b/videotuna/models/opensora/models/vae/vae.py
deleted file mode 100644
index 8cf7fbb5..00000000
--- a/videotuna/models/opensora/models/vae/vae.py
+++ /dev/null
@@ -1,313 +0,0 @@
-import os
-
-import torch
-import torch.nn as nn
-from diffusers.models import AutoencoderKL, AutoencoderKLTemporalDecoder
-from einops import rearrange
-from transformers import PretrainedConfig, PreTrainedModel
-
-from videotuna.models.opensora.registry import MODELS, build_module
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-
-@MODELS.register_module()
-class VideoAutoencoderKL(nn.Module):
-    def __init__(
-        self,
-        from_pretrained=None,
-        micro_batch_size=None,
-        cache_dir=None,
-        local_files_only=False,
-        subfolder=None,
-    ):
-        super().__init__()
-        self.module = AutoencoderKL.from_pretrained(
-            from_pretrained,
-            cache_dir=cache_dir,
-            local_files_only=local_files_only,
-            subfolder=subfolder,
-        )
-        self.out_channels = self.module.config.latent_channels
-        self.patch_size = (1, 8, 8)
-        self.micro_batch_size = micro_batch_size
-
-    def encode(self, x):
-        # x: (B, C, T, H, W)
-        B = x.shape[0]
-        x = rearrange(x, "B C T H W -> (B T) C H W")
-
-        if self.micro_batch_size is None:
-            x = self.module.encode(x).latent_dist.sample().mul_(0.18215)
-        else:
-            # NOTE: cannot be used for training
-            bs = self.micro_batch_size
-            x_out = []
-            for i in range(0, x.shape[0], bs):
-                x_bs = x[i : i + bs]
-                x_bs = self.module.encode(x_bs).latent_dist.sample().mul_(0.18215)
-                x_out.append(x_bs)
-            x = torch.cat(x_out, dim=0)
-        x = rearrange(x, "(B T) C H W -> B C T H W", B=B)
-        return x
-
-    def decode(self, x, **kwargs):
-        # x: (B, C, T, H, W)
-        B = x.shape[0]
-        x = rearrange(x, "B C T H W -> (B T) C H W")
-        if self.micro_batch_size is None:
-            x = self.module.decode(x / 0.18215).sample
-        else:
-            # NOTE: cannot be used for training
-            bs = self.micro_batch_size
-            x_out = []
-            for i in range(0, x.shape[0], bs):
-                x_bs = x[i : i + bs]
-                x_bs = self.module.decode(x_bs / 0.18215).sample
-                x_out.append(x_bs)
-            x = torch.cat(x_out, dim=0)
-        x = rearrange(x, "(B T) C H W -> B C T H W", B=B)
-        return x
-
-    def get_latent_size(self, input_size):
-        latent_size = []
-        for i in range(3):
-            # assert (
-            #     input_size[i] is None or input_size[i] % self.patch_size[i] == 0
-            # ), "Input size must be divisible by patch size"
-            latent_size.append(
-                input_size[i] // self.patch_size[i]
-                if input_size[i] is not None
-                else None
-            )
-        return latent_size
-
-    @property
-    def device(self):
-        return next(self.parameters()).device
-
-    @property
-    def dtype(self):
-        return next(self.parameters()).dtype
-
-
-@MODELS.register_module()
-class VideoAutoencoderKLTemporalDecoder(nn.Module):
-    def __init__(self, from_pretrained=None, cache_dir=None, local_files_only=False):
-        super().__init__()
-        self.module = AutoencoderKLTemporalDecoder.from_pretrained(
-            from_pretrained, cache_dir=cache_dir, local_files_only=local_files_only
-        )
-        self.out_channels = self.module.config.latent_channels
-        self.patch_size = (1, 8, 8)
-
-    def encode(self, x):
-        raise NotImplementedError
-
-    def decode(self, x, **kwargs):
-        B, _, T = x.shape[:3]
-        x = rearrange(x, "B C T H W -> (B T) C H W")
-        x = self.module.decode(x / 0.18215, num_frames=T).sample
-        x = rearrange(x, "(B T) C H W -> B C T H W", B=B)
-        return x
-
-    def get_latent_size(self, input_size):
-        latent_size = []
-        for i in range(3):
-            # assert (
-            #     input_size[i] is None or input_size[i] % self.patch_size[i] == 0
-            # ), "Input size must be divisible by patch size"
-            latent_size.append(
-                input_size[i] // self.patch_size[i]
-                if input_size[i] is not None
-                else None
-            )
-        return latent_size
-
-    @property
-    def device(self):
-        return next(self.parameters()).device
-
-    @property
-    def dtype(self):
-        return next(self.parameters()).dtype
-
-
-class VideoAutoencoderPipelineConfig(PretrainedConfig):
-    model_type = "VideoAutoencoderPipeline"
-
-    def __init__(
-        self,
-        vae_2d=None,
-        vae_temporal=None,
-        from_pretrained=None,
-        freeze_vae_2d=False,
-        cal_loss=False,
-        micro_frame_size=None,
-        shift=0.0,
-        scale=1.0,
-        **kwargs,
-    ):
-        self.vae_2d = vae_2d
-        self.vae_temporal = vae_temporal
-        self.from_pretrained = from_pretrained
-        self.freeze_vae_2d = freeze_vae_2d
-        self.cal_loss = cal_loss
-        self.micro_frame_size = micro_frame_size
-        self.shift = shift
-        self.scale = scale
-        super().__init__(**kwargs)
-
-
-class VideoAutoencoderPipeline(PreTrainedModel):
-    config_class = VideoAutoencoderPipelineConfig
-
-    def __init__(self, config: VideoAutoencoderPipelineConfig):
-        super().__init__(config=config)
-        self.spatial_vae = build_module(config.vae_2d, MODELS)
-        self.temporal_vae = build_module(config.vae_temporal, MODELS)
-        self.cal_loss = config.cal_loss
-        self.micro_frame_size = config.micro_frame_size
-        self.micro_z_frame_size = self.temporal_vae.get_latent_size(
-            [config.micro_frame_size, None, None]
-        )[0]
-
-        if config.freeze_vae_2d:
-            for param in self.spatial_vae.parameters():
-                param.requires_grad = False
-
-        self.out_channels = self.temporal_vae.out_channels
-
-        # normalization parameters
-        scale = torch.tensor(config.scale)
-        shift = torch.tensor(config.shift)
-        if len(scale.shape) > 0:
-            scale = scale[None, :, None, None, None]
-        if len(shift.shape) > 0:
-            shift = shift[None, :, None, None, None]
-        self.register_buffer("scale", scale)
-        self.register_buffer("shift", shift)
-
-    def encode(self, x):
-        x_z = self.spatial_vae.encode(x)
-
-        if self.micro_frame_size is None:
-            posterior = self.temporal_vae.encode(x_z)
-            z = posterior.sample()
-        else:
-            z_list = []
-            for i in range(0, x_z.shape[2], self.micro_frame_size):
-                x_z_bs = x_z[:, :, i : i + self.micro_frame_size]
-                posterior = self.temporal_vae.encode(x_z_bs)
-                z_list.append(posterior.sample())
-            z = torch.cat(z_list, dim=2)
-
-        if self.cal_loss:
-            return z, posterior, x_z
-        else:
-            return (z - self.shift) / self.scale
-
-    def decode(self, z, num_frames=None):
-        if not self.cal_loss:
-            z = z * self.scale.to(z.dtype) + self.shift.to(z.dtype)
-
-        if self.micro_frame_size is None:
-            x_z = self.temporal_vae.decode(z, num_frames=num_frames)
-            x = self.spatial_vae.decode(x_z)
-        else:
-            x_z_list = []
-            for i in range(0, z.size(2), self.micro_z_frame_size):
-                z_bs = z[:, :, i : i + self.micro_z_frame_size]
-                x_z_bs = self.temporal_vae.decode(
-                    z_bs, num_frames=min(self.micro_frame_size, num_frames)
-                )
-                x_z_list.append(x_z_bs)
-                num_frames -= self.micro_frame_size
-            x_z = torch.cat(x_z_list, dim=2)
-            x = self.spatial_vae.decode(x_z)
-
-        if self.cal_loss:
-            return x, x_z
-        else:
-            return x
-
-    def forward(self, x):
-        assert self.cal_loss, "This method is only available when cal_loss is True"
-        z, posterior, x_z = self.encode(x)
-        x_rec, x_z_rec = self.decode(z, num_frames=x_z.shape[2])
-        return x_rec, x_z_rec, z, posterior, x_z
-
-    def get_latent_size(self, input_size):
-        if self.micro_frame_size is None or input_size[0] is None:
-            return self.temporal_vae.get_latent_size(
-                self.spatial_vae.get_latent_size(input_size)
-            )
-        else:
-            sub_input_size = [self.micro_frame_size, input_size[1], input_size[2]]
-            sub_latent_size = self.temporal_vae.get_latent_size(
-                self.spatial_vae.get_latent_size(sub_input_size)
-            )
-            sub_latent_size[0] = sub_latent_size[0] * (
-                input_size[0] // self.micro_frame_size
-            )
-            remain_temporal_size = [input_size[0] % self.micro_frame_size, None, None]
-            if remain_temporal_size[0] > 0:
-                remain_size = self.temporal_vae.get_latent_size(remain_temporal_size)
-                sub_latent_size[0] += remain_size[0]
-            return sub_latent_size
-
-    def get_temporal_last_layer(self):
-        return self.temporal_vae.decoder.conv_out.conv.weight
-
-    @property
-    def device(self):
-        return next(self.parameters()).device
-
-    @property
-    def dtype(self):
-        return next(self.parameters()).dtype
-
-
-@MODELS.register_module()
-def OpenSoraVAE_V1_2(
-    micro_batch_size=4,
-    micro_frame_size=17,
-    from_pretrained=None,
-    local_files_only=False,
-    freeze_vae_2d=False,
-    cal_loss=False,
-    force_huggingface=False,
-):
-    vae_2d = dict(
-        type="VideoAutoencoderKL",
-        from_pretrained="PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers",
-        subfolder="vae",
-        micro_batch_size=micro_batch_size,
-        local_files_only=local_files_only,
-    )
-    vae_temporal = dict(
-        type="VAE_Temporal_SD",
-        from_pretrained=None,
-    )
-    shift = (-0.10, 0.34, 0.27, 0.98)
-    scale = (3.85, 2.32, 2.33, 3.06)
-    kwargs = dict(
-        vae_2d=vae_2d,
-        vae_temporal=vae_temporal,
-        freeze_vae_2d=freeze_vae_2d,
-        cal_loss=cal_loss,
-        micro_frame_size=micro_frame_size,
-        shift=shift,
-        scale=scale,
-    )
-
-    if force_huggingface or (
-        from_pretrained is not None and not os.path.isdir(from_pretrained)
-    ):
-        model = VideoAutoencoderPipeline.from_pretrained(from_pretrained, **kwargs)
-    else:
-        config = VideoAutoencoderPipelineConfig(**kwargs)
-        model = VideoAutoencoderPipeline(config)
-
-        if from_pretrained:
-            load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/models/vae/vae_temporal.py b/videotuna/models/opensora/models/vae/vae_temporal.py
deleted file mode 100644
index 8676a761..00000000
--- a/videotuna/models/opensora/models/vae/vae_temporal.py
+++ /dev/null
@@ -1,462 +0,0 @@
-from typing import Tuple, Union
-
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-
-from videotuna.models.opensora.registry import MODELS
-from videotuna.models.opensora.utils.ckpt_utils import load_checkpoint
-
-from .utils import DiagonalGaussianDistribution
-
-
-def cast_tuple(t, length=1):
-    return t if isinstance(t, tuple) else ((t,) * length)
-
-
-def divisible_by(num, den):
-    return (num % den) == 0
-
-
-def is_odd(n):
-    return not divisible_by(n, 2)
-
-
-def pad_at_dim(t, pad, dim=-1):
-    dims_from_right = (-dim - 1) if dim < 0 else (t.ndim - dim - 1)
-    zeros = (0, 0) * dims_from_right
-    return F.pad(t, (*zeros, *pad), mode="constant")
-
-
-def exists(v):
-    return v is not None
-
-
-class CausalConv3d(nn.Module):
-    def __init__(
-        self,
-        chan_in,
-        chan_out,
-        kernel_size: Union[int, Tuple[int, int, int]],
-        pad_mode="constant",
-        strides=None,  # allow custom stride
-        **kwargs,
-    ):
-        super().__init__()
-        kernel_size = cast_tuple(kernel_size, 3)
-
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-
-        assert is_odd(height_kernel_size) and is_odd(width_kernel_size)
-
-        dilation = kwargs.pop("dilation", 1)
-        stride = strides[0] if strides is not None else kwargs.pop("stride", 1)
-
-        self.pad_mode = pad_mode
-        time_pad = dilation * (time_kernel_size - 1) + (1 - stride)
-        height_pad = height_kernel_size // 2
-        width_pad = width_kernel_size // 2
-
-        self.time_pad = time_pad
-        self.time_causal_padding = (
-            width_pad,
-            width_pad,
-            height_pad,
-            height_pad,
-            time_pad,
-            0,
-        )
-
-        stride = strides if strides is not None else (stride, 1, 1)
-        dilation = (dilation, 1, 1)
-        self.conv = nn.Conv3d(
-            chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs
-        )
-
-    def forward(self, x):
-        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
-        x = self.conv(x)
-        return x
-
-
-class ResBlock(nn.Module):
-    def __init__(
-        self,
-        in_channels,  # SCH: added
-        filters,
-        conv_fn,
-        activation_fn=nn.SiLU,
-        use_conv_shortcut=False,
-        num_groups=32,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        self.filters = filters
-        self.activate = activation_fn()
-        self.use_conv_shortcut = use_conv_shortcut
-
-        # SCH: MAGVIT uses GroupNorm by default
-        self.norm1 = nn.GroupNorm(num_groups, in_channels)
-        self.conv1 = conv_fn(
-            in_channels, self.filters, kernel_size=(3, 3, 3), bias=False
-        )
-        self.norm2 = nn.GroupNorm(num_groups, self.filters)
-        self.conv2 = conv_fn(
-            self.filters, self.filters, kernel_size=(3, 3, 3), bias=False
-        )
-        if in_channels != filters:
-            if self.use_conv_shortcut:
-                self.conv3 = conv_fn(
-                    in_channels, self.filters, kernel_size=(3, 3, 3), bias=False
-                )
-            else:
-                self.conv3 = conv_fn(
-                    in_channels, self.filters, kernel_size=(1, 1, 1), bias=False
-                )
-
-    def forward(self, x):
-        residual = x
-        x = self.norm1(x)
-        x = self.activate(x)
-        x = self.conv1(x)
-        x = self.norm2(x)
-        x = self.activate(x)
-        x = self.conv2(x)
-        if self.in_channels != self.filters:  # SCH: ResBlock X->Y
-            residual = self.conv3(residual)
-        return x + residual
-
-
-def get_activation_fn(activation):
-    if activation == "relu":
-        activation_fn = nn.ReLU
-    elif activation == "swish":
-        activation_fn = nn.SiLU
-    else:
-        raise NotImplementedError
-    return activation_fn
-
-
-class Encoder(nn.Module):
-    """Encoder Blocks."""
-
-    def __init__(
-        self,
-        in_out_channels=4,
-        latent_embed_dim=512,  # num channels for latent vector
-        filters=128,
-        num_res_blocks=4,
-        channel_multipliers=(1, 2, 2, 4),
-        temporal_downsample=(False, True, True),
-        num_groups=32,  # for nn.GroupNorm
-        activation_fn="swish",
-    ):
-        super().__init__()
-        self.filters = filters
-        self.num_res_blocks = num_res_blocks
-        self.num_blocks = len(channel_multipliers)
-        self.channel_multipliers = channel_multipliers
-        self.temporal_downsample = temporal_downsample
-        self.num_groups = num_groups
-        self.embedding_dim = latent_embed_dim
-
-        self.activation_fn = get_activation_fn(activation_fn)
-        self.activate = self.activation_fn()
-        self.conv_fn = CausalConv3d
-        self.block_args = dict(
-            conv_fn=self.conv_fn,
-            activation_fn=self.activation_fn,
-            use_conv_shortcut=False,
-            num_groups=self.num_groups,
-        )
-
-        # first layer conv
-        self.conv_in = self.conv_fn(
-            in_out_channels,
-            filters,
-            kernel_size=(3, 3, 3),
-            bias=False,
-        )
-
-        # ResBlocks and conv downsample
-        self.block_res_blocks = nn.ModuleList([])
-        self.conv_blocks = nn.ModuleList([])
-
-        filters = self.filters
-        prev_filters = filters  # record for in_channels
-        for i in range(self.num_blocks):
-            filters = self.filters * self.channel_multipliers[i]
-            block_items = nn.ModuleList([])
-            for _ in range(self.num_res_blocks):
-                block_items.append(ResBlock(prev_filters, filters, **self.block_args))
-                prev_filters = filters  # update in_channels
-            self.block_res_blocks.append(block_items)
-
-            if i < self.num_blocks - 1:
-                if self.temporal_downsample[i]:
-                    t_stride = 2 if self.temporal_downsample[i] else 1
-                    s_stride = 1
-                    self.conv_blocks.append(
-                        self.conv_fn(
-                            prev_filters,
-                            filters,
-                            kernel_size=(3, 3, 3),
-                            strides=(t_stride, s_stride, s_stride),
-                        )
-                    )
-                    prev_filters = filters  # update in_channels
-                else:
-                    # if no t downsample, don't add since this does nothing for pipeline models
-                    self.conv_blocks.append(nn.Identity(prev_filters))  # Identity
-                    prev_filters = filters  # update in_channels
-
-        # last layer res block
-        self.res_blocks = nn.ModuleList([])
-        for _ in range(self.num_res_blocks):
-            self.res_blocks.append(ResBlock(prev_filters, filters, **self.block_args))
-            prev_filters = filters  # update in_channels
-
-        # MAGVIT uses Group Normalization
-        self.norm1 = nn.GroupNorm(self.num_groups, prev_filters)
-
-        self.conv2 = self.conv_fn(
-            prev_filters, self.embedding_dim, kernel_size=(1, 1, 1), padding="same"
-        )
-
-    def forward(self, x):
-        x = self.conv_in(x)
-
-        for i in range(self.num_blocks):
-            for j in range(self.num_res_blocks):
-                x = self.block_res_blocks[i][j](x)
-            if i < self.num_blocks - 1:
-                x = self.conv_blocks[i](x)
-        for i in range(self.num_res_blocks):
-            x = self.res_blocks[i](x)
-
-        x = self.norm1(x)
-        x = self.activate(x)
-        x = self.conv2(x)
-        return x
-
-
-class Decoder(nn.Module):
-    """Decoder Blocks."""
-
-    def __init__(
-        self,
-        in_out_channels=4,
-        latent_embed_dim=512,
-        filters=128,
-        num_res_blocks=4,
-        channel_multipliers=(1, 2, 2, 4),
-        temporal_downsample=(False, True, True),
-        num_groups=32,  # for nn.GroupNorm
-        activation_fn="swish",
-    ):
-        super().__init__()
-        self.filters = filters
-        self.num_res_blocks = num_res_blocks
-        self.num_blocks = len(channel_multipliers)
-        self.channel_multipliers = channel_multipliers
-        self.temporal_downsample = temporal_downsample
-        self.num_groups = num_groups
-        self.embedding_dim = latent_embed_dim
-        self.s_stride = 1
-
-        self.activation_fn = get_activation_fn(activation_fn)
-        self.activate = self.activation_fn()
-        self.conv_fn = CausalConv3d
-        self.block_args = dict(
-            conv_fn=self.conv_fn,
-            activation_fn=self.activation_fn,
-            use_conv_shortcut=False,
-            num_groups=self.num_groups,
-        )
-
-        filters = self.filters * self.channel_multipliers[-1]
-        prev_filters = filters
-
-        # last conv
-        self.conv1 = self.conv_fn(
-            self.embedding_dim, filters, kernel_size=(3, 3, 3), bias=True
-        )
-
-        # last layer res block
-        self.res_blocks = nn.ModuleList([])
-        for _ in range(self.num_res_blocks):
-            self.res_blocks.append(ResBlock(filters, filters, **self.block_args))
-
-        # ResBlocks and conv upsample
-        self.block_res_blocks = nn.ModuleList([])
-        self.num_blocks = len(self.channel_multipliers)
-        self.conv_blocks = nn.ModuleList([])
-        # reverse to keep track of the in_channels, but append also in a reverse direction
-        for i in reversed(range(self.num_blocks)):
-            filters = self.filters * self.channel_multipliers[i]
-            # resblock handling
-            block_items = nn.ModuleList([])
-            for _ in range(self.num_res_blocks):
-                block_items.append(ResBlock(prev_filters, filters, **self.block_args))
-                prev_filters = filters  # SCH: update in_channels
-            self.block_res_blocks.insert(0, block_items)  # SCH: append in front
-
-            # conv blocks with upsampling
-            if i > 0:
-                if self.temporal_downsample[i - 1]:
-                    t_stride = 2 if self.temporal_downsample[i - 1] else 1
-                    # SCH: T-Causal Conv 3x3x3, f -> (t_stride * 2 * 2) * f, depth to space t_stride x 2 x 2
-                    self.conv_blocks.insert(
-                        0,
-                        self.conv_fn(
-                            prev_filters,
-                            prev_filters * t_stride * self.s_stride * self.s_stride,
-                            kernel_size=(3, 3, 3),
-                        ),
-                    )
-                else:
-                    self.conv_blocks.insert(
-                        0,
-                        nn.Identity(prev_filters),
-                    )
-
-        self.norm1 = nn.GroupNorm(self.num_groups, prev_filters)
-
-        self.conv_out = self.conv_fn(filters, in_out_channels, 3)
-
-    def forward(self, x):
-        x = self.conv1(x)
-        for i in range(self.num_res_blocks):
-            x = self.res_blocks[i](x)
-        for i in reversed(range(self.num_blocks)):
-            for j in range(self.num_res_blocks):
-                x = self.block_res_blocks[i][j](x)
-            if i > 0:
-                t_stride = 2 if self.temporal_downsample[i - 1] else 1
-                x = self.conv_blocks[i - 1](x)
-                x = rearrange(
-                    x,
-                    "B (C ts hs ws) T H W -> B C (T ts) (H hs) (W ws)",
-                    ts=t_stride,
-                    hs=self.s_stride,
-                    ws=self.s_stride,
-                )
-
-        x = self.norm1(x)
-        x = self.activate(x)
-        x = self.conv_out(x)
-        return x
-
-
-@MODELS.register_module()
-class VAE_Temporal(nn.Module):
-    def __init__(
-        self,
-        in_out_channels=4,
-        latent_embed_dim=4,
-        embed_dim=4,
-        filters=128,
-        num_res_blocks=4,
-        channel_multipliers=(1, 2, 2, 4),
-        temporal_downsample=(True, True, False),
-        num_groups=32,  # for nn.GroupNorm
-        activation_fn="swish",
-    ):
-        super().__init__()
-
-        self.time_downsample_factor = 2 ** sum(temporal_downsample)
-        # self.time_padding = self.time_downsample_factor - 1
-        self.patch_size = (self.time_downsample_factor, 1, 1)
-        self.out_channels = in_out_channels
-
-        # NOTE: following MAGVIT, conv in bias=False in encoder first conv
-        self.encoder = Encoder(
-            in_out_channels=in_out_channels,
-            latent_embed_dim=latent_embed_dim * 2,
-            filters=filters,
-            num_res_blocks=num_res_blocks,
-            channel_multipliers=channel_multipliers,
-            temporal_downsample=temporal_downsample,
-            num_groups=num_groups,  # for nn.GroupNorm
-            activation_fn=activation_fn,
-        )
-        self.quant_conv = CausalConv3d(2 * latent_embed_dim, 2 * embed_dim, 1)
-
-        self.post_quant_conv = CausalConv3d(embed_dim, latent_embed_dim, 1)
-        self.decoder = Decoder(
-            in_out_channels=in_out_channels,
-            latent_embed_dim=latent_embed_dim,
-            filters=filters,
-            num_res_blocks=num_res_blocks,
-            channel_multipliers=channel_multipliers,
-            temporal_downsample=temporal_downsample,
-            num_groups=num_groups,  # for nn.GroupNorm
-            activation_fn=activation_fn,
-        )
-
-    def get_latent_size(self, input_size):
-        latent_size = []
-        for i in range(3):
-            if input_size[i] is None:
-                lsize = None
-            elif i == 0:
-                time_padding = (
-                    0
-                    if (input_size[i] % self.time_downsample_factor == 0)
-                    else self.time_downsample_factor
-                    - input_size[i] % self.time_downsample_factor
-                )
-                lsize = (input_size[i] + time_padding) // self.patch_size[i]
-            else:
-                lsize = input_size[i] // self.patch_size[i]
-            latent_size.append(lsize)
-        return latent_size
-
-    def encode(self, x):
-        time_padding = (
-            0
-            if (x.shape[2] % self.time_downsample_factor == 0)
-            else self.time_downsample_factor - x.shape[2] % self.time_downsample_factor
-        )
-        x = pad_at_dim(x, (time_padding, 0), dim=2)
-        encoded_feature = self.encoder(x)
-        moments = self.quant_conv(encoded_feature).to(x.dtype)
-        posterior = DiagonalGaussianDistribution(moments)
-        return posterior
-
-    def decode(self, z, num_frames=None):
-        time_padding = (
-            0
-            if (num_frames % self.time_downsample_factor == 0)
-            else self.time_downsample_factor - num_frames % self.time_downsample_factor
-        )
-        z = self.post_quant_conv(z)
-        x = self.decoder(z)
-        x = x[:, :, time_padding:]
-        return x
-
-    def forward(self, x, sample_posterior=True):
-        posterior = self.encode(x)
-        if sample_posterior:
-            z = posterior.sample()
-        else:
-            z = posterior.mode()
-        recon_video = self.decode(z, num_frames=x.shape[2])
-        return recon_video, posterior, z
-
-
-@MODELS.register_module("VAE_Temporal_SD")
-def VAE_Temporal_SD(from_pretrained=None, **kwargs):
-    model = VAE_Temporal(
-        in_out_channels=4,
-        latent_embed_dim=4,
-        embed_dim=4,
-        filters=128,
-        num_res_blocks=4,
-        channel_multipliers=(1, 2, 2, 4),
-        temporal_downsample=(False, True, True),
-        **kwargs,
-    )
-    if from_pretrained is not None:
-        load_checkpoint(model, from_pretrained)
-    return model
diff --git a/videotuna/models/opensora/registry.py b/videotuna/models/opensora/registry.py
deleted file mode 100644
index 384d6d77..00000000
--- a/videotuna/models/opensora/registry.py
+++ /dev/null
@@ -1,44 +0,0 @@
-from copy import deepcopy
-
-import torch.nn as nn
-from mmengine.registry import Registry
-
-
-def build_module(module, builder, **kwargs):
-    """Build module from config or return the module itself.
-
-    Args:
-        module (Union[dict, nn.Module]): The module to build.
-        builder (Registry): The registry to build module.
-        *args, **kwargs: Arguments passed to build function.
-
-    Returns:
-        Any: The built module.
-    """
-    if isinstance(module, dict):
-        cfg = deepcopy(module)
-        for k, v in kwargs.items():
-            cfg[k] = v
-        return builder.build(cfg)
-    elif isinstance(module, nn.Module):
-        return module
-    elif module is None:
-        return None
-    else:
-        raise TypeError(f"Only support dict and nn.Module, but got {type(module)}.")
-
-
-MODELS = Registry(
-    "model",
-    locations=["videotuna.opensora.models"],
-)
-
-SCHEDULERS = Registry(
-    "scheduler",
-    locations=["videotuna.opensora.schedulers"],
-)
-
-DATASETS = Registry(
-    "dataset",
-    locations=["videotuna.opensora.datasets"],
-)
diff --git a/videotuna/models/opensora/utils/__init__.py b/videotuna/models/opensora/utils/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/opensora/utils/ckpt_utils.py b/videotuna/models/opensora/utils/ckpt_utils.py
deleted file mode 100644
index 5a9ed8d9..00000000
--- a/videotuna/models/opensora/utils/ckpt_utils.py
+++ /dev/null
@@ -1,410 +0,0 @@
-import functools
-import json
-import logging
-import operator
-import os
-from typing import Tuple
-
-import colossalai
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from colossalai.booster import Booster
-from colossalai.checkpoint_io import GeneralCheckpointIO
-from colossalai.cluster import DistCoordinator
-from torch.optim import Optimizer
-from torch.optim.lr_scheduler import _LRScheduler
-from torchvision.datasets.utils import download_url
-
-from .misc import get_logger
-
-hf_endpoint = os.environ.get("HF_ENDPOINT")
-if hf_endpoint is None:
-    hf_endpoint = "https://huggingface.co"
-
-pretrained_models = {
-    "DiT-XL-2-512x512.pt": "https://dl.fbaipublicfiles.com/DiT/models/DiT-XL-2-512x512.pt",
-    "DiT-XL-2-256x256.pt": "https://dl.fbaipublicfiles.com/DiT/models/DiT-XL-2-256x256.pt",
-    "Latte-XL-2-256x256-ucf101.pt": hf_endpoint
-    + "/maxin-cn/Latte/resolve/main/ucf101.pt",
-    "PixArt-XL-2-256x256.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-alpha/resolve/main/PixArt-XL-2-256x256.pth",
-    "PixArt-XL-2-SAM-256x256.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-alpha/resolve/main/PixArt-XL-2-SAM-256x256.pth",
-    "PixArt-XL-2-512x512.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-alpha/resolve/main/PixArt-XL-2-512x512.pth",
-    "PixArt-XL-2-1024-MS.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-alpha/resolve/main/PixArt-XL-2-1024-MS.pth",
-    "OpenSora-v1-16x256x256.pth": hf_endpoint
-    + "/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-16x256x256.pth",
-    "OpenSora-v1-HQ-16x256x256.pth": hf_endpoint
-    + "/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-HQ-16x256x256.pth",
-    "OpenSora-v1-HQ-16x512x512.pth": hf_endpoint
-    + "/hpcai-tech/Open-Sora/resolve/main/OpenSora-v1-HQ-16x512x512.pth",
-    "PixArt-Sigma-XL-2-256x256.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-Sigma/resolve/main/PixArt-Sigma-XL-2-256x256.pth",
-    "PixArt-Sigma-XL-2-512-MS.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-Sigma/resolve/main/PixArt-Sigma-XL-2-512-MS.pth",
-    "PixArt-Sigma-XL-2-1024-MS.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-Sigma/resolve/main/PixArt-Sigma-XL-2-1024-MS.pth",
-    "PixArt-Sigma-XL-2-2K-MS.pth": hf_endpoint
-    + "/PixArt-alpha/PixArt-Sigma/resolve/main/PixArt-Sigma-XL-2-2K-MS.pth",
-}
-
-
-def reparameter(ckpt, name=None, model=None):
-    model_name = name
-    name = os.path.basename(name)
-    if not dist.is_initialized() or dist.get_rank() == 0:
-        get_logger().info("loading pretrained model: %s", model_name)
-    if name in ["DiT-XL-2-512x512.pt", "DiT-XL-2-256x256.pt"]:
-        ckpt["x_embedder.proj.weight"] = ckpt["x_embedder.proj.weight"].unsqueeze(2)
-        del ckpt["pos_embed"]
-    if name in ["Latte-XL-2-256x256-ucf101.pt"]:
-        ckpt = ckpt["ema"]
-        ckpt["x_embedder.proj.weight"] = ckpt["x_embedder.proj.weight"].unsqueeze(2)
-        del ckpt["pos_embed"]
-        del ckpt["temp_embed"]
-    if name in [
-        "PixArt-XL-2-256x256.pth",
-        "PixArt-XL-2-SAM-256x256.pth",
-        "PixArt-XL-2-512x512.pth",
-        "PixArt-XL-2-1024-MS.pth",
-        "PixArt-Sigma-XL-2-256x256.pth",
-        "PixArt-Sigma-XL-2-512-MS.pth",
-        "PixArt-Sigma-XL-2-1024-MS.pth",
-        "PixArt-Sigma-XL-2-2K-MS.pth",
-    ]:
-        ckpt = ckpt["state_dict"]
-        ckpt["x_embedder.proj.weight"] = ckpt["x_embedder.proj.weight"].unsqueeze(2)
-        if "pos_embed" in ckpt:
-            del ckpt["pos_embed"]
-
-    if name in [
-        "PixArt-1B-2.pth",
-    ]:
-        ckpt = ckpt["state_dict"]
-        if "pos_embed" in ckpt:
-            del ckpt["pos_embed"]
-
-    # no need pos_embed
-    if "pos_embed_temporal" in ckpt:
-        del ckpt["pos_embed_temporal"]
-    if "pos_embed" in ckpt:
-        del ckpt["pos_embed"]
-    # different text length
-    if "y_embedder.y_embedding" in ckpt:
-        if (
-            ckpt["y_embedder.y_embedding"].shape[0]
-            < model.y_embedder.y_embedding.shape[0]
-        ):
-            get_logger().info(
-                "Extend y_embedding from %s to %s",
-                ckpt["y_embedder.y_embedding"].shape[0],
-                model.y_embedder.y_embedding.shape[0],
-            )
-            additional_length = (
-                model.y_embedder.y_embedding.shape[0]
-                - ckpt["y_embedder.y_embedding"].shape[0]
-            )
-            new_y_embedding = torch.zeros(
-                additional_length, model.y_embedder.y_embedding.shape[1]
-            )
-            new_y_embedding[:] = ckpt["y_embedder.y_embedding"][-1]
-            ckpt["y_embedder.y_embedding"] = torch.cat(
-                [ckpt["y_embedder.y_embedding"], new_y_embedding], dim=0
-            )
-        elif (
-            ckpt["y_embedder.y_embedding"].shape[0]
-            > model.y_embedder.y_embedding.shape[0]
-        ):
-            get_logger().info(
-                "Shrink y_embedding from %s to %s",
-                ckpt["y_embedder.y_embedding"].shape[0],
-                model.y_embedder.y_embedding.shape[0],
-            )
-            ckpt["y_embedder.y_embedding"] = ckpt["y_embedder.y_embedding"][
-                : model.y_embedder.y_embedding.shape[0]
-            ]
-    # stdit3 special case
-    if type(model).__name__ == "STDiT3" and "PixArt-Sigma" in name:
-        ckpt_keys = list(ckpt.keys())
-        for key in ckpt_keys:
-            if "blocks." in key:
-                ckpt[key.replace("blocks.", "spatial_blocks.")] = ckpt[key]
-                del ckpt[key]
-
-    return ckpt
-
-
-def find_model(model_name, model=None):
-    """
-    Finds a pre-trained DiT model, downloading it if necessary. Alternatively, loads a model from a local path.
-    """
-    if model_name in pretrained_models:  # Find/download our pre-trained DiT checkpoints
-        model_ckpt = download_model(model_name)
-        model_ckpt = reparameter(model_ckpt, model_name, model=model)
-    elif os.path.isfile(model_name) and model_name.split("/")[-1] in pretrained_models:
-        model_ckpt = torch.load(model_name, map_location=lambda storage, loc: storage)
-        model_ckpt = reparameter(model_ckpt, model_name.split("/")[-1], model=model)
-    else:  # Load a custom DiT checkpoint:
-        assert os.path.isfile(
-            model_name
-        ), f"Could not find DiT checkpoint at {model_name}"
-        model_ckpt = torch.load(model_name, map_location=lambda storage, loc: storage)
-        model_ckpt = reparameter(model_ckpt, model_name, model=model)
-    return model_ckpt
-
-
-def download_model(model_name=None, local_path=None, url=None):
-    """
-    Downloads a pre-trained DiT model from the web.
-    """
-    if model_name is not None:
-        assert model_name in pretrained_models
-        local_path = f"pretrained_models/{model_name}"
-        web_path = pretrained_models[model_name]
-    else:
-        assert local_path is not None
-        assert url is not None
-        web_path = url
-    if not os.path.isfile(local_path):
-        os.makedirs("pretrained_models", exist_ok=True)
-        dir_name = os.path.dirname(local_path)
-        file_name = os.path.basename(local_path)
-        download_url(web_path, dir_name, file_name)
-    model = torch.load(local_path, map_location=lambda storage, loc: storage)
-    return model
-
-
-def load_from_sharded_state_dict(model, ckpt_path, model_name="model", strict=False):
-    ckpt_io = GeneralCheckpointIO()
-    ckpt_io.load_model(model, os.path.join(ckpt_path, model_name), strict=strict)
-
-
-def model_sharding(model: torch.nn.Module):
-    global_rank = dist.get_rank()
-    world_size = dist.get_world_size()
-    for _, param in model.named_parameters():
-        padding_size = (world_size - param.numel() % world_size) % world_size
-        if padding_size > 0:
-            padding_param = torch.nn.functional.pad(
-                param.data.view(-1), [0, padding_size]
-            )
-        else:
-            padding_param = param.data.view(-1)
-        splited_params = padding_param.split(padding_param.numel() // world_size)
-        splited_params = splited_params[global_rank]
-        param.data = splited_params
-
-
-def load_json(file_path: str):
-    with open(file_path, "r") as f:
-        return json.load(f)
-
-
-def save_json(data, file_path: str):
-    with open(file_path, "w") as f:
-        json.dump(data, f, indent=4)
-
-
-def remove_padding(tensor: torch.Tensor, original_shape: Tuple) -> torch.Tensor:
-    return tensor[: functools.reduce(operator.mul, original_shape)]
-
-
-def model_gathering(model: torch.nn.Module, model_shape_dict: dict):
-    global_rank = dist.get_rank()
-    global_size = dist.get_world_size()
-    for name, param in model.named_parameters():
-        all_params = [torch.empty_like(param.data) for _ in range(global_size)]
-        dist.all_gather(all_params, param.data, group=dist.group.WORLD)
-        if int(global_rank) == 0:
-            all_params = torch.cat(all_params)
-            param.data = remove_padding(all_params, model_shape_dict[name]).view(
-                model_shape_dict[name]
-            )
-    dist.barrier()
-
-
-def remove_padding(tensor: torch.Tensor, original_shape: Tuple) -> torch.Tensor:
-    return tensor[: functools.reduce(operator.mul, original_shape)]
-
-
-def record_model_param_shape(model: torch.nn.Module) -> dict:
-    param_shape = {}
-    for name, param in model.named_parameters():
-        param_shape[name] = param.shape
-    return param_shape
-
-
-def load_checkpoint(
-    model, ckpt_path, save_as_pt=False, model_name="model", strict=False
-):
-    if ckpt_path.endswith(".pt") or ckpt_path.endswith(".pth"):
-        state_dict = find_model(ckpt_path, model=model)
-        missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=strict)
-        get_logger().info("Missing keys: %s", missing_keys)
-        get_logger().info("Unexpected keys: %s", unexpected_keys)
-    elif os.path.isdir(ckpt_path):
-        load_from_sharded_state_dict(model, ckpt_path, model_name, strict=strict)
-        get_logger().info("Model checkpoint loaded from %s", ckpt_path)
-        if save_as_pt:
-            save_path = os.path.join(ckpt_path, model_name + "_ckpt.pt")
-            torch.save(model.state_dict(), save_path)
-            get_logger().info("Model checkpoint saved to %s", save_path)
-    else:
-        raise ValueError(f"Invalid checkpoint path: {ckpt_path}")
-
-
-def save_frequently(
-    booster: Booster,
-    model: nn.Module,
-    ema: nn.Module,
-    optimizer: Optimizer,
-    lr_scheduler: _LRScheduler,
-    epoch: int,
-    step: int,
-    global_step: int,
-    batch_size: int,
-    coordinator: DistCoordinator,
-    save_dir: str,
-    shape_dict: dict,
-):
-    save_dir = os.path.join(save_dir, f"last")
-    os.makedirs(os.path.join(save_dir, "model"), exist_ok=True)
-
-    booster.save_model(model, os.path.join(save_dir, "model"), shard=True)
-    # ema is not boosted, so we don't need to use booster.save_model
-    model_gathering(ema, shape_dict)
-    global_rank = dist.get_rank()
-    if int(global_rank) == 0:
-        torch.save(ema.state_dict(), os.path.join(save_dir, "ema.pt"))
-        model_sharding(ema)
-
-    booster.save_optimizer(
-        optimizer, os.path.join(save_dir, "optimizer"), shard=True, size_per_shard=4096
-    )
-    if lr_scheduler is not None:
-        booster.save_lr_scheduler(lr_scheduler, os.path.join(save_dir, "lr_scheduler"))
-    running_states = {
-        "epoch": epoch,
-        "step": step,
-        "global_step": global_step,
-        "sample_start_index": step * batch_size,
-    }
-    if coordinator.is_master():
-        save_json(running_states, os.path.join(save_dir, "running_states.json"))
-    dist.barrier()
-
-
-def save(
-    booster: Booster,
-    save_dir: str,
-    model: nn.Module = None,
-    ema: nn.Module = None,
-    optimizer: Optimizer = None,
-    lr_scheduler: _LRScheduler = None,
-    sampler=None,
-    epoch: int = None,
-    step: int = None,
-    global_step: int = None,
-    batch_size: int = None,
-):
-    save_dir = os.path.join(save_dir, f"epoch{epoch}-global_step{global_step}")
-    os.makedirs(os.path.join(save_dir, "model"), exist_ok=True)
-
-    if model is not None:
-        booster.save_model(model, os.path.join(save_dir, "model"), shard=True)
-    if optimizer is not None:
-        booster.save_optimizer(
-            optimizer,
-            os.path.join(save_dir, "optimizer"),
-            shard=True,
-            size_per_shard=4096,
-        )
-    if lr_scheduler is not None:
-        booster.save_lr_scheduler(lr_scheduler, os.path.join(save_dir, "lr_scheduler"))
-    if dist.get_rank() == 0:
-        running_states = {
-            "epoch": epoch,
-            "step": step,
-            "global_step": global_step,
-            "batch_size": batch_size,
-            "sample_start_index": step * batch_size,
-        }
-        save_json(running_states, os.path.join(save_dir, "running_states.json"))
-
-        if ema is not None:
-            torch.save(ema.state_dict(), os.path.join(save_dir, "ema.pt"))
-
-        if sampler is not None:
-            # only for VariableVideoBatchSampler
-            torch.save(sampler.state_dict(step), os.path.join(save_dir, "sampler"))
-    dist.barrier()
-    return save_dir
-
-
-def load(
-    booster: Booster,
-    load_dir: str,
-    model: nn.Module = None,
-    ema: nn.Module = None,
-    optimizer: Optimizer = None,
-    lr_scheduler: _LRScheduler = None,
-    sampler=None,
-) -> Tuple[int, int, int]:
-    assert os.path.exists(load_dir), f"Checkpoint directory {load_dir} does not exist"
-    assert os.path.exists(
-        os.path.join(load_dir, "running_states.json")
-    ), "running_states.json does not exist"
-    running_states = load_json(os.path.join(load_dir, "running_states.json"))
-    if model is not None:
-        booster.load_model(model, os.path.join(load_dir, "model"))
-    if ema is not None:
-        # ema is not boosted, so we don't use booster.load_model
-        ema.load_state_dict(
-            torch.load(
-                os.path.join(load_dir, "ema.pt"), map_location=torch.device("cpu")
-            ),
-            strict=False,
-        )
-    if optimizer is not None:
-        booster.load_optimizer(optimizer, os.path.join(load_dir, "optimizer"))
-    if lr_scheduler is not None:
-        booster.load_lr_scheduler(lr_scheduler, os.path.join(load_dir, "lr_scheduler"))
-    if sampler is not None:
-        sampler.load_state_dict(torch.load(os.path.join(load_dir, "sampler")))
-    dist.barrier()
-
-    if "sample_start_index" not in running_states:
-        return (
-            running_states["epoch"],
-            running_states["step"],
-        )
-    else:
-        return (
-            running_states["epoch"],
-            running_states["step"],
-            running_states["sample_start_index"],
-        )
-
-
-def create_logger(logging_dir):
-    """
-    Create a logger that writes to a log file and stdout.
-    """
-    if dist.get_rank() == 0:  # real logger
-        logging.basicConfig(
-            level=logging.INFO,
-            format="[\033[34m%(asctime)s\033[0m] %(message)s",
-            datefmt="%Y-%m-%d %H:%M:%S",
-            handlers=[
-                logging.StreamHandler(),
-                logging.FileHandler(f"{logging_dir}/log.txt"),
-            ],
-        )
-        logger = logging.getLogger(__name__)
-    else:  # dummy logger (does nothing)
-        logger = logging.getLogger(__name__)
-        logger.addHandler(logging.NullHandler())
-    return logger
diff --git a/videotuna/models/opensora/utils/config_utils.py b/videotuna/models/opensora/utils/config_utils.py
deleted file mode 100644
index 0be005ba..00000000
--- a/videotuna/models/opensora/utils/config_utils.py
+++ /dev/null
@@ -1,255 +0,0 @@
-import argparse
-import json
-import os
-from glob import glob
-
-from mmengine.config import Config
-
-
-def parse_args(training=False):
-    parser = argparse.ArgumentParser()
-
-    # model config
-    parser.add_argument("config", help="model config file path")
-
-    # ======================================================
-    # General
-    # ======================================================
-    parser.add_argument(
-        "--seed", default=None, type=int, help="seed for reproducibility"
-    )
-    parser.add_argument(
-        "--ckpt-path",
-        default=None,
-        type=str,
-        help="path to model ckpt; will overwrite cfg.model.from_pretrained if specified",
-    )
-    parser.add_argument("--batch-size", default=None, type=int, help="batch size")
-    parser.add_argument(
-        "--outputs", default=None, type=str, help="the dir to save model weights"
-    )
-    parser.add_argument(
-        "--remarks", default=None, type=str, help="remarks for the experiment"
-    )
-    parser.add_argument(
-        "--flash-attn", default=None, type=str2bool, help="enable flash attention"
-    )
-    parser.add_argument(
-        "--layernorm-kernel",
-        default=None,
-        type=str2bool,
-        help="enable layernorm kernel",
-    )
-    parser.add_argument("--resolution", default=None, type=str, help="multi resolution")
-    parser.add_argument("--data-path", default=None, type=str, help="path to data csv")
-    parser.add_argument("--dtype", default=None, type=str, help="data type")
-
-    # ======================================================
-    # Inference
-    # ======================================================
-
-    if not training:
-        # output
-        parser.add_argument(
-            "--save-dir", default=None, type=str, help="path to save generated samples"
-        )
-        parser.add_argument(
-            "--sample-name",
-            default=None,
-            type=str,
-            help="sample name, default is sample_idx",
-        )
-        parser.add_argument(
-            "--start-index", default=None, type=int, help="start index for sample name"
-        )
-        parser.add_argument(
-            "--end-index", default=None, type=int, help="end index for sample name"
-        )
-        parser.add_argument(
-            "--num-sample",
-            default=None,
-            type=int,
-            help="number of samples to generate for one prompt",
-        )
-        parser.add_argument(
-            "--prompt-as-path",
-            action="store_true",
-            help="use prompt as path to save samples",
-        )
-        parser.add_argument("--verbose", default=None, type=int, help="verbose level")
-
-        # prompt
-        parser.add_argument(
-            "--prompt-path", default=None, type=str, help="path to prompt txt file"
-        )
-        parser.add_argument(
-            "--prompt", default=None, type=str, nargs="+", help="prompt list"
-        )
-        parser.add_argument(
-            "--llm-refine", default=None, type=str2bool, help="enable LLM refine"
-        )
-        parser.add_argument(
-            "--prompt-generator", default=None, type=str, help="prompt generator"
-        )
-
-        # image/video
-        parser.add_argument(
-            "--num-frames", default=None, type=str, help="number of frames"
-        )
-        parser.add_argument("--fps", default=None, type=int, help="fps")
-        parser.add_argument("--save-fps", default=None, type=int, help="save fps")
-        parser.add_argument(
-            "--image-size", default=None, type=int, nargs=2, help="image size"
-        )
-        parser.add_argument(
-            "--frame-interval", default=None, type=int, help="frame interval"
-        )
-        parser.add_argument(
-            "--aspect-ratio", default=None, type=str, help="aspect ratio (h:w)"
-        )
-        parser.add_argument(
-            "--watermark", default=None, type=str2bool, help="watermark video"
-        )
-
-        # hyperparameters
-        parser.add_argument(
-            "--num-sampling-steps", default=None, type=int, help="sampling steps"
-        )
-        parser.add_argument(
-            "--cfg-scale",
-            default=None,
-            type=float,
-            help="balance between cond & uncond",
-        )
-
-        # reference
-        parser.add_argument("--loop", default=None, type=int, help="loop")
-        parser.add_argument(
-            "--condition-frame-length",
-            default=None,
-            type=int,
-            help="condition frame length",
-        )
-        parser.add_argument(
-            "--reference-path", default=None, type=str, nargs="+", help="reference path"
-        )
-        parser.add_argument(
-            "--mask-strategy", default=None, type=str, nargs="+", help="mask strategy"
-        )
-        parser.add_argument("--aes", default=None, type=float, help="aesthetic score")
-        parser.add_argument("--flow", default=None, type=float, help="flow score")
-        parser.add_argument(
-            "--camera-motion", default=None, type=str, help="camera motion"
-        )
-    # ======================================================
-    # Training
-    # ======================================================
-    else:
-        parser.add_argument("--lr", default=None, type=float, help="learning rate")
-        parser.add_argument("--wandb", default=None, type=bool, help="enable wandb")
-        parser.add_argument(
-            "--project-name", default=None, type=str, help="project name for wandb"
-        )
-        parser.add_argument(
-            "--load", default=None, type=str, help="path to continue training"
-        )
-        parser.add_argument(
-            "--start-from-scratch",
-            action="store_true",
-            help="start training from scratch",
-        )
-        parser.add_argument(
-            "--warmup-steps", default=None, type=int, help="warmup steps"
-        )
-
-    return parser.parse_args()
-
-
-def merge_args(cfg, args, training=False):
-    if args.ckpt_path is not None:
-        cfg.model["from_pretrained"] = args.ckpt_path
-        if cfg.get("discriminator") is not None:
-            cfg.discriminator["from_pretrained"] = args.ckpt_path
-        args.ckpt_path = None
-    if args.flash_attn is not None:
-        cfg.model["enable_flash_attn"] = args.flash_attn
-        args.enable_flash_attn = None
-    if args.layernorm_kernel is not None:
-        cfg.model["enable_layernorm_kernel"] = args.layernorm_kernel
-        args.enable_layernorm_kernel = None
-    if args.data_path is not None:
-        cfg.dataset["data_path"] = args.data_path
-        args.data_path = None
-    # NOTE: for vae inference (reconstruction)
-    if not training and "dataset" in cfg:
-        if args.image_size is not None:
-            cfg.dataset["image_size"] = args.image_size
-        if args.num_frames is not None:
-            cfg.dataset["num_frames"] = args.num_frames
-    if not training:
-        if args.cfg_scale is not None:
-            cfg.scheduler["cfg_scale"] = args.cfg_scale
-            args.cfg_scale = None
-        if args.num_sampling_steps is not None:
-            cfg.scheduler["num_sampling_steps"] = args.num_sampling_steps
-            args.num_sampling_steps = None
-
-    for k, v in vars(args).items():
-        if v is not None:
-            cfg[k] = v
-
-    return cfg
-
-
-def read_config(config_path):
-    cfg = Config.fromfile(config_path)
-    return cfg
-
-
-def parse_configs(training=False):
-    args = parse_args(training)
-    cfg = read_config(args.config)
-    cfg = merge_args(cfg, args, training)
-    return cfg
-
-
-def define_experiment_workspace(cfg, get_last_workspace=False, remarks=None):
-    """
-    This function creates a folder for experiment tracking.
-
-    Args:
-        args: The parsed arguments.
-
-    Returns:
-        exp_dir: The path to the experiment folder.
-    """
-    # Make outputs folder (holds all experiment subfolders)
-    os.makedirs(cfg.outputs, exist_ok=True)
-    experiment_index = len(glob(f"{cfg.outputs}/*"))
-    if get_last_workspace:
-        experiment_index -= 1
-
-    # Create an experiment folder
-    model_name = cfg.model["type"].replace("/", "-")
-    if remarks is not None:
-        exp_name = f"{experiment_index:03d}-{model_name}-{remarks}"
-    else:
-        exp_name = f"{experiment_index:03d}-{model_name}"
-    exp_dir = f"{cfg.outputs}/{exp_name}"
-    return exp_name, exp_dir
-
-
-def save_training_config(cfg, experiment_dir):
-    with open(f"{experiment_dir}/config.txt", "w") as f:
-        json.dump(cfg, f, indent=4)
-
-
-def str2bool(v):
-    if isinstance(v, bool):
-        return v
-    if v.lower() in ("yes", "true", "t", "y", "1"):
-        return True
-    elif v.lower() in ("no", "false", "f", "n", "0"):
-        return False
-    else:
-        raise argparse.ArgumentTypeError("Boolean value expected.")
diff --git a/videotuna/models/opensora/utils/inference_utils.py b/videotuna/models/opensora/utils/inference_utils.py
deleted file mode 100644
index 7444a115..00000000
--- a/videotuna/models/opensora/utils/inference_utils.py
+++ /dev/null
@@ -1,393 +0,0 @@
-import json
-import os
-import re
-from pathlib import Path
-
-import torch
-
-from videotuna.models.opensora.datasets import IMG_FPS
-from videotuna.models.opensora.datasets.utils import read_from_path
-
-
-def prepare_multi_resolution_info(
-    info_type, batch_size, image_size, num_frames, fps, device, dtype
-):
-    if info_type is None:
-        return dict()
-    elif info_type == "PixArtMS":
-        hw = torch.tensor([image_size], device=device, dtype=dtype).repeat(
-            batch_size, 1
-        )
-        ar = torch.tensor(
-            [[image_size[0] / image_size[1]]], device=device, dtype=dtype
-        ).repeat(batch_size, 1)
-        return dict(ar=ar, hw=hw)
-    elif info_type in ["STDiT2", "OpenSora"]:
-        fps = fps if num_frames > 1 else IMG_FPS
-        fps = torch.tensor([fps], device=device, dtype=dtype).repeat(batch_size)
-        height = torch.tensor([image_size[0]], device=device, dtype=dtype).repeat(
-            batch_size
-        )
-        width = torch.tensor([image_size[1]], device=device, dtype=dtype).repeat(
-            batch_size
-        )
-        num_frames = torch.tensor([num_frames], device=device, dtype=dtype).repeat(
-            batch_size
-        )
-        ar = torch.tensor(
-            [image_size[0] / image_size[1]], device=device, dtype=dtype
-        ).repeat(batch_size)
-        return dict(height=height, width=width, num_frames=num_frames, ar=ar, fps=fps)
-    else:
-        raise NotImplementedError
-
-
-def load_prompts(prompt_path, start_idx=None, end_idx=None):
-    prompt_path = Path(prompt_path)
-    prompt_dict = {}
-    if prompt_path.is_dir():
-        for prompt_file in prompt_path.glob("*.txt"):
-            prompts = []
-            file_name = prompt_file.stem
-            with open(prompt_file, "r") as f:
-                prompts.extend([line.strip() for line in f.readlines()])
-            prompt_dict[file_name] = prompts
-    elif prompt_path.is_file():
-        file_name = prompt_path.stem
-        with open(prompt_path, "r") as f:
-            prompts = [line.strip() for line in f.readlines()]
-        prompt_dict[file_name] = prompts
-    else:
-        raise FileNotFoundError(f"Prompt file or directory not found: {prompt_path}")
-
-    return prompt_dict
-
-
-def get_save_path_name(
-    save_dir,
-    sample_name=None,  # prefix
-    sample_idx=None,  # sample index
-    prompt=None,  # used prompt
-    prompt_as_path=False,  # use prompt as path
-    num_sample=1,  # number of samples to generate for one prompt
-    k=None,  # kth sample
-):
-    if sample_name is None:
-        sample_name = "" if prompt_as_path else "sample"
-    sample_name_suffix = prompt if prompt_as_path else f"_{sample_idx:04d}"
-    save_path = os.path.join(save_dir, f"{sample_name}{sample_name_suffix}")
-    if num_sample != 1:
-        save_path = f"{save_path}-{k}"
-    return save_path
-
-
-def get_save_path_name_vbench(
-    save_dir,
-    prompt_file=None,
-    sample_idx=None,
-    prompt=None,
-    k=None,
-):
-    save_dir = Path(save_dir)
-    sub_dir = save_dir / prompt_file / f"{k}"
-    sub_dir.mkdir(parents=True, exist_ok=True)
-
-    if len(prompt) > 150:
-        prompt_save = prompt[:150]
-    else:
-        prompt_save = prompt
-    print(f"Prompt: {prompt}")
-
-    save_file_name = f"{sample_idx:05d}-{prompt_save.strip('.').replace(' ', '_')}"
-    save_path = sub_dir / save_file_name
-    return str(save_path)
-
-
-def append_score_to_prompts(prompts, aes=None, flow=None, camera_motion=None):
-    new_prompts = []
-    for prompt in prompts:
-        new_prompt = prompt
-        if aes is not None and "aesthetic score:" not in prompt:
-            new_prompt = f"{new_prompt} aesthetic score: {aes:.1f}."
-        if flow is not None and "motion score:" not in prompt:
-            new_prompt = f"{new_prompt} motion score: {flow:.1f}."
-        if camera_motion is not None and "camera motion:" not in prompt:
-            new_prompt = f"{new_prompt} camera motion: {camera_motion}."
-        new_prompts.append(new_prompt)
-    return new_prompts
-
-
-def extract_json_from_prompts(prompts, reference, mask_strategy):
-    ret_prompts = []
-    for i, prompt in enumerate(prompts):
-        parts = re.split(r"(?=[{])", prompt)
-        assert len(parts) <= 2, f"Invalid prompt: {prompt}"
-        ret_prompts.append(parts[0])
-        if len(parts) > 1:
-            additional_info = json.loads(parts[1])
-            for key in additional_info:
-                assert key in ["reference_path", "mask_strategy"], f"Invalid key: {key}"
-                if key == "reference_path":
-                    reference[i] = additional_info[key]
-                elif key == "mask_strategy":
-                    mask_strategy[i] = additional_info[key]
-    return ret_prompts, reference, mask_strategy
-
-
-def collect_references_batch(reference_paths, vae, image_size):
-    refs_x = []  # refs_x: [batch, ref_num, C, T, H, W]
-    for reference_path in reference_paths:
-        if reference_path == "":
-            refs_x.append([])
-            continue
-        ref_path = reference_path.split(";")
-        ref = []
-        for r_path in ref_path:
-            r = read_from_path(r_path, image_size, transform_name="resize_crop")
-            r_x = vae.encode(r.unsqueeze(0).to(vae.device, vae.dtype))
-            r_x = r_x.squeeze(0)
-            ref.append(r_x)
-        refs_x.append(ref)
-    return refs_x
-
-
-def extract_prompts_loop(prompts, num_loop):
-    ret_prompts = []
-    for prompt in prompts:
-        if prompt.startswith("|0|"):
-            prompt_list = prompt.split("|")[1:]
-            text_list = []
-            for i in range(0, len(prompt_list), 2):
-                start_loop = int(prompt_list[i])
-                text = prompt_list[i + 1]
-                end_loop = (
-                    int(prompt_list[i + 2])
-                    if i + 2 < len(prompt_list)
-                    else num_loop + 1
-                )
-                text_list.extend([text] * (end_loop - start_loop))
-            prompt = text_list[num_loop]
-        ret_prompts.append(prompt)
-    return ret_prompts
-
-
-def split_prompt(prompt_text):
-    if prompt_text.startswith("|0|"):
-        # this is for prompts which look like
-        # |0| a beautiful day |1| a sunny day |2| a rainy day
-        # we want to parse it into a list of prompts with the loop index
-        prompt_list = prompt_text.split("|")[1:]
-        text_list = []
-        loop_idx = []
-        for i in range(0, len(prompt_list), 2):
-            start_loop = int(prompt_list[i])
-            text = prompt_list[i + 1].strip()
-            text_list.append(text)
-            loop_idx.append(start_loop)
-        return text_list, loop_idx
-    else:
-        return [prompt_text], None
-
-
-def merge_prompt(text_list, loop_idx_list=None):
-    if loop_idx_list is None:
-        return text_list[0]
-    else:
-        prompt = ""
-        for i, text in enumerate(text_list):
-            prompt += f"|{loop_idx_list[i]}|{text}"
-        return prompt
-
-
-MASK_DEFAULT = ["0", "0", "0", "0", "1", "0"]
-
-
-def parse_mask_strategy(mask_strategy):
-    mask_batch = []
-    if mask_strategy == "" or mask_strategy is None:
-        return mask_batch
-
-    mask_strategy = mask_strategy.split(";")
-    for mask in mask_strategy:
-        mask_group = mask.split(",")
-        num_group = len(mask_group)
-        assert num_group >= 1 and num_group <= 6, f"Invalid mask strategy: {mask}"
-        mask_group.extend(MASK_DEFAULT[num_group:])
-        for i in range(5):
-            mask_group[i] = int(mask_group[i])
-        mask_group[5] = float(mask_group[5])
-        mask_batch.append(mask_group)
-    return mask_batch
-
-
-def find_nearest_point(value, point, max_value):
-    t = value // point
-    if value % point > point / 2 and t < max_value // point - 1:
-        t += 1
-    return t * point
-
-
-def apply_mask_strategy(z, refs_x, mask_strategys, loop_i, align=None):
-    masks = []
-    no_mask = True
-    for i, mask_strategy in enumerate(mask_strategys):
-        no_mask = False
-        mask = torch.ones(z.shape[2], dtype=torch.float, device=z.device)
-        mask_strategy = parse_mask_strategy(mask_strategy)
-        for mst in mask_strategy:
-            loop_id, m_id, m_ref_start, m_target_start, m_length, edit_ratio = mst
-            if loop_id != loop_i:
-                continue
-            ref = refs_x[i][m_id]
-
-            if m_ref_start < 0:
-                # ref: [C, T, H, W]
-                m_ref_start = ref.shape[1] + m_ref_start
-            if m_target_start < 0:
-                # z: [B, C, T, H, W]
-                m_target_start = z.shape[2] + m_target_start
-            if align is not None:
-                m_ref_start = find_nearest_point(m_ref_start, align, ref.shape[1])
-                m_target_start = find_nearest_point(m_target_start, align, z.shape[2])
-            m_length = min(
-                m_length, z.shape[2] - m_target_start, ref.shape[1] - m_ref_start
-            )
-            z[i, :, m_target_start : m_target_start + m_length] = ref[
-                :, m_ref_start : m_ref_start + m_length
-            ]
-            mask[m_target_start : m_target_start + m_length] = edit_ratio
-        masks.append(mask)
-    if no_mask:
-        return None
-    masks = torch.stack(masks)
-    return masks
-
-
-def append_generated(
-    vae,
-    generated_video,
-    refs_x,
-    mask_strategy,
-    loop_i,
-    condition_frame_length,
-    condition_frame_edit,
-):
-    ref_x = vae.encode(generated_video)
-    for j, refs in enumerate(refs_x):
-        if refs is None:
-            refs_x[j] = [ref_x[j]]
-        else:
-            refs.append(ref_x[j])
-        if mask_strategy[j] is None or mask_strategy[j] == "":
-            mask_strategy[j] = ""
-        else:
-            mask_strategy[j] += ";"
-        mask_strategy[
-            j
-        ] += f"{loop_i},{len(refs)-1},-{condition_frame_length},0,{condition_frame_length},{condition_frame_edit}"
-    return refs_x, mask_strategy
-
-
-def dframe_to_frame(num):
-    assert num % 5 == 0, f"Invalid num: {num}"
-    return num // 5 * 17
-
-
-OPENAI_CLIENT = None
-REFINE_PROMPTS = None
-REFINE_PROMPTS_PATH = "assets/texts/t2v_pllava.txt"
-REFINE_PROMPTS_TEMPLATE = """
-You need to refine user's input prompt. The user's input prompt is used for video generation task. You need to refine the user's prompt to make it more suitable for the task. Here are some examples of refined prompts:
-{}
-
-The refined prompt should pay attention to all objects in the video. The description should be useful for AI to re-generate the video. The description should be no more than six sentences. The refined prompt should be in English.
-"""
-RANDOM_PROMPTS = None
-RANDOM_PROMPTS_TEMPLATE = """
-You need to generate one input prompt for video generation task. The prompt should be suitable for the task. Here are some examples of refined prompts:
-{}
-
-The prompt should pay attention to all objects in the video. The description should be useful for AI to re-generate the video. The description should be no more than six sentences. The prompt should be in English.
-"""
-
-
-def get_openai_response(sys_prompt, usr_prompt, model="gpt-4o"):
-    global OPENAI_CLIENT
-    if OPENAI_CLIENT is None:
-        from openai import OpenAI
-
-        OPENAI_CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
-
-    completion = OPENAI_CLIENT.chat.completions.create(
-        model=model,
-        messages=[
-            {
-                "role": "system",
-                "content": sys_prompt,
-            },  # <-- This is the system message that provides context to the model
-            {
-                "role": "user",
-                "content": usr_prompt,
-            },  # <-- This is the user message for which the model will generate a response
-        ],
-    )
-
-    return completion.choices[0].message.content
-
-
-def get_random_prompt_by_openai():
-    global RANDOM_PROMPTS
-    if RANDOM_PROMPTS is None:
-        examples = load_prompts(REFINE_PROMPTS_PATH)
-        RANDOM_PROMPTS = RANDOM_PROMPTS_TEMPLATE.format("\n".join(examples))
-
-    response = get_openai_response(RANDOM_PROMPTS, "Generate one example.")
-    return response
-
-
-def refine_prompt_by_openai(prompt):
-    global REFINE_PROMPTS
-    if REFINE_PROMPTS is None:
-        examples = load_prompts(REFINE_PROMPTS_PATH)
-        REFINE_PROMPTS = REFINE_PROMPTS_TEMPLATE.format("\n".join(examples))
-
-    response = get_openai_response(REFINE_PROMPTS, prompt)
-    return response
-
-
-def has_openai_key():
-    return "OPENAI_API_KEY" in os.environ
-
-
-def refine_prompts_by_openai(prompts):
-    new_prompts = []
-    for prompt in prompts:
-        try:
-            if prompt.strip() == "":
-                new_prompt = get_random_prompt_by_openai()
-                print(
-                    f"[Info] Empty prompt detected, generate random prompt: {new_prompt}"
-                )
-            else:
-                new_prompt = refine_prompt_by_openai(prompt)
-                print(f"[Info] Refine prompt: {prompt} -> {new_prompt}")
-            new_prompts.append(new_prompt)
-        except Exception as e:
-            print(f"[Warning] Failed to refine prompt: {prompt} due to {e}")
-            new_prompts.append(prompt)
-    return new_prompts
-
-
-def add_watermark(
-    input_video_path,
-    watermark_image_path="./assets/images/watermark/watermark.png",
-    output_video_path=None,
-):
-    # execute this command in terminal with subprocess
-    # return if the process is successful
-    if output_video_path is None:
-        output_video_path = input_video_path.replace(".mp4", "_watermark.mp4")
-    cmd = f'ffmpeg -y -i {input_video_path} -i {watermark_image_path} -filter_complex "[1][0]scale2ref=oh*mdar:ih*0.1[logo][video];[video][logo]overlay" {output_video_path}'
-    exit_code = os.system(cmd)
-    is_success = exit_code == 0
-    return is_success
diff --git a/videotuna/models/opensora/utils/lr_scheduler.py b/videotuna/models/opensora/utils/lr_scheduler.py
deleted file mode 100644
index 36378b4d..00000000
--- a/videotuna/models/opensora/utils/lr_scheduler.py
+++ /dev/null
@@ -1,25 +0,0 @@
-from torch.optim.lr_scheduler import _LRScheduler
-
-
-class LinearWarmupLR(_LRScheduler):
-    """Linearly warmup learning rate and then linearly decay.
-
-    Args:
-        optimizer (:class:`torch.optim.Optimizer`): Wrapped optimizer.
-        warmup_steps (int, optional): Number of warmup steps, defaults to 0
-        last_step (int, optional): The index of last step, defaults to -1. When last_step=-1,
-            the schedule is started from the beginning or When last_step=-1, sets initial lr as lr.
-    """
-
-    def __init__(self, optimizer, warmup_steps: int = 0, last_epoch: int = -1):
-        self.warmup_steps = warmup_steps
-        super().__init__(optimizer, last_epoch=last_epoch)
-
-    def get_lr(self):
-        if self.last_epoch < self.warmup_steps:
-            return [
-                (self.last_epoch + 1) / (self.warmup_steps + 1) * lr
-                for lr in self.base_lrs
-            ]
-        else:
-            return self.base_lrs
diff --git a/videotuna/models/opensora/utils/misc.py b/videotuna/models/opensora/utils/misc.py
deleted file mode 100644
index b2f67309..00000000
--- a/videotuna/models/opensora/utils/misc.py
+++ /dev/null
@@ -1,411 +0,0 @@
-import collections
-import importlib
-import logging
-import os
-import time
-from collections import OrderedDict
-from collections.abc import Sequence
-from itertools import repeat
-from typing import Tuple
-
-import numpy as np
-import torch
-import torch.distributed as dist
-
-# ======================================================
-# Logging
-# ======================================================
-
-
-def is_distributed():
-    return os.environ.get("WORLD_SIZE", None) is not None
-
-
-def is_main_process():
-    return not is_distributed() or dist.get_rank() == 0
-
-
-def get_world_size():
-    if is_distributed():
-        return dist.get_world_size()
-    else:
-        return 1
-
-
-def create_logger(logging_dir=None):
-    """
-    Create a logger that writes to a log file and stdout.
-    """
-    if is_main_process():  # real logger
-        additional_args = dict()
-        if logging_dir is not None:
-            additional_args["handlers"] = [
-                logging.StreamHandler(),
-                logging.FileHandler(f"{logging_dir}/log.txt"),
-            ]
-        logging.basicConfig(
-            level=logging.INFO,
-            format="[\033[34m%(asctime)s\033[0m] %(message)s",
-            datefmt="%Y-%m-%d %H:%M:%S",
-            **additional_args,
-        )
-        logger = logging.getLogger(__name__)
-    else:  # dummy logger (does nothing)
-        logger = logging.getLogger(__name__)
-        logger.addHandler(logging.NullHandler())
-    return logger
-
-
-def get_logger():
-    return logging.getLogger(__name__)
-
-
-def print_rank(var_name, var_value, rank=0):
-    if dist.get_rank() == rank:
-        print(f"[Rank {rank}] {var_name}: {var_value}")
-
-
-def print_0(*args, **kwargs):
-    if dist.get_rank() == 0:
-        print(*args, **kwargs)
-
-
-def create_tensorboard_writer(exp_dir):
-    from torch.utils.tensorboard import SummaryWriter
-
-    tensorboard_dir = f"{exp_dir}/tensorboard"
-    os.makedirs(tensorboard_dir, exist_ok=True)
-    writer = SummaryWriter(tensorboard_dir)
-    return writer
-
-
-# ======================================================
-# String
-# ======================================================
-
-
-def format_numel_str(numel: int) -> str:
-    B = 1024**3
-    M = 1024**2
-    K = 1024
-    if numel >= B:
-        return f"{numel / B:.2f} B"
-    elif numel >= M:
-        return f"{numel / M:.2f} M"
-    elif numel >= K:
-        return f"{numel / K:.2f} K"
-    else:
-        return f"{numel}"
-
-
-def get_timestamp():
-    timestamp = time.strftime("%Y%m%d-%H%M%S", time.localtime(time.time()))
-    return timestamp
-
-
-def format_time(seconds):
-    days = int(seconds / 3600 / 24)
-    seconds = seconds - days * 3600 * 24
-    hours = int(seconds / 3600)
-    seconds = seconds - hours * 3600
-    minutes = int(seconds / 60)
-    seconds = seconds - minutes * 60
-    secondsf = int(seconds)
-    seconds = seconds - secondsf
-    millis = int(seconds * 1000)
-
-    f = ""
-    i = 1
-    if days > 0:
-        f += str(days) + "D"
-        i += 1
-    if hours > 0 and i <= 2:
-        f += str(hours) + "h"
-        i += 1
-    if minutes > 0 and i <= 2:
-        f += str(minutes) + "m"
-        i += 1
-    if secondsf > 0 and i <= 2:
-        f += str(secondsf) + "s"
-        i += 1
-    if millis > 0 and i <= 2:
-        f += str(millis) + "ms"
-        i += 1
-    if f == "":
-        f = "0ms"
-    return f
-
-
-class BColors:
-    HEADER = "\033[95m"
-    OKBLUE = "\033[94m"
-    OKCYAN = "\033[96m"
-    OKGREEN = "\033[92m"
-    WARNING = "\033[93m"
-    FAIL = "\033[91m"
-    ENDC = "\033[0m"
-    BOLD = "\033[1m"
-    UNDERLINE = "\033[4m"
-
-
-# ======================================================
-# PyTorch
-# ======================================================
-
-
-def requires_grad(model: torch.nn.Module, flag: bool = True) -> None:
-    """
-    Set requires_grad flag for all parameters in a model.
-    """
-    for p in model.parameters():
-        p.requires_grad = flag
-
-
-def all_reduce_mean(tensor: torch.Tensor) -> torch.Tensor:
-    dist.all_reduce(tensor=tensor, op=dist.ReduceOp.SUM)
-    tensor.div_(dist.get_world_size())
-    return tensor
-
-
-def get_model_numel(model: torch.nn.Module) -> Tuple[int, int]:
-    num_params = 0
-    num_params_trainable = 0
-    for p in model.parameters():
-        num_params += p.numel()
-        if p.requires_grad:
-            num_params_trainable += p.numel()
-    return num_params, num_params_trainable
-
-
-def count_params(model):
-    return sum(p.numel() for p in model.parameters() if p.requires_grad)
-
-
-def to_tensor(data):
-    """Convert objects of various python types to :obj:`torch.Tensor`.
-
-    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,
-    :class:`Sequence`, :class:`int` and :class:`float`.
-
-    Args:
-        data (torch.Tensor | numpy.ndarray | Sequence | int | float): Data to
-            be converted.
-    """
-
-    if isinstance(data, torch.Tensor):
-        return data
-    elif isinstance(data, np.ndarray):
-        return torch.from_numpy(data)
-    elif isinstance(data, Sequence) and not isinstance(data, str):
-        return torch.tensor(data)
-    elif isinstance(data, int):
-        return torch.LongTensor([data])
-    elif isinstance(data, float):
-        return torch.FloatTensor([data])
-    else:
-        raise TypeError(f"type {type(data)} cannot be converted to tensor.")
-
-
-def to_ndarray(data):
-    if isinstance(data, torch.Tensor):
-        return data.numpy()
-    elif isinstance(data, np.ndarray):
-        return data
-    elif isinstance(data, Sequence):
-        return np.array(data)
-    elif isinstance(data, int):
-        return np.ndarray([data], dtype=int)
-    elif isinstance(data, float):
-        return np.array([data], dtype=float)
-    else:
-        raise TypeError(f"type {type(data)} cannot be converted to ndarray.")
-
-
-def to_torch_dtype(dtype):
-    if isinstance(dtype, torch.dtype):
-        return dtype
-    elif isinstance(dtype, str):
-        dtype_mapping = {
-            "float64": torch.float64,
-            "float32": torch.float32,
-            "float16": torch.float16,
-            "fp32": torch.float32,
-            "fp16": torch.float16,
-            "half": torch.float16,
-            "bf16": torch.bfloat16,
-        }
-        if dtype not in dtype_mapping:
-            raise ValueError
-        dtype = dtype_mapping[dtype]
-        return dtype
-    else:
-        raise ValueError
-
-
-def _ntuple(n):
-    def parse(x):
-        if isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
-            return x
-        return tuple(repeat(x, n))
-
-    return parse
-
-
-to_1tuple = _ntuple(1)
-to_2tuple = _ntuple(2)
-to_3tuple = _ntuple(3)
-to_4tuple = _ntuple(4)
-to_ntuple = _ntuple
-
-
-def convert_SyncBN_to_BN2d(model_cfg):
-    for k in model_cfg:
-        v = model_cfg[k]
-        if k == "norm_cfg" and v["type"] == "SyncBN":
-            v["type"] = "BN2d"
-        elif isinstance(v, dict):
-            convert_SyncBN_to_BN2d(v)
-
-
-def get_topk(x, dim=4, k=5):
-    x = to_tensor(x)
-    inds = x[..., dim].topk(k)[1]
-    return x[inds]
-
-
-def param_sigmoid(x, alpha):
-    ret = 1 / (1 + (-alpha * x).exp())
-    return ret
-
-
-def inverse_param_sigmoid(x, alpha, eps=1e-5):
-    x = x.clamp(min=0, max=1)
-    x1 = x.clamp(min=eps)
-    x2 = (1 - x).clamp(min=eps)
-    return torch.log(x1 / x2) / alpha
-
-
-def inverse_sigmoid(x, eps=1e-5):
-    """Inverse function of sigmoid.
-
-    Args:
-        x (Tensor): The tensor to do the
-            inverse.
-        eps (float): EPS avoid numerical
-            overflow. Defaults 1e-5.
-    Returns:
-        Tensor: The x has passed the inverse
-            function of sigmoid, has same
-            shape with input.
-    """
-    x = x.clamp(min=0, max=1)
-    x1 = x.clamp(min=eps)
-    x2 = (1 - x).clamp(min=eps)
-    return torch.log(x1 / x2)
-
-
-# ======================================================
-# Python
-# ======================================================
-
-
-def count_columns(df, columns):
-    cnt_dict = OrderedDict()
-    num_samples = len(df)
-
-    for col in columns:
-        d_i = df[col].value_counts().to_dict()
-        for k in d_i:
-            d_i[k] = (d_i[k], d_i[k] / num_samples)
-        cnt_dict[col] = d_i
-
-    return cnt_dict
-
-
-def try_import(name):
-    """Try to import a module.
-
-    Args:
-        name (str): Specifies what module to import in absolute or relative
-            terms (e.g. either pkg.mod or ..mod).
-    Returns:
-        ModuleType or None: If importing successfully, returns the imported
-        module, otherwise returns None.
-    """
-    try:
-        return importlib.import_module(name)
-    except ImportError:
-        return None
-
-
-def transpose(x):
-    """
-    transpose a list of list
-    Args:
-        x (list[list]):
-    """
-    ret = list(map(list, zip(*x)))
-    return ret
-
-
-def all_exists(paths):
-    return all(os.path.exists(path) for path in paths)
-
-
-# ======================================================
-# Profile
-# ======================================================
-
-
-class Timer:
-    def __init__(self, name, log=False):
-        self.name = name
-        self.start_time = None
-        self.end_time = None
-        self.log = log
-
-    @property
-    def elapsed_time(self):
-        return self.end_time - self.start_time
-
-    def __enter__(self):
-        torch.cuda.synchronize()
-        self.start_time = time.time()
-        return self
-
-    def __exit__(self, exc_type, exc_val, exc_tb):
-        torch.cuda.synchronize()
-        self.end_time = time.time()
-        if self.log:
-            print(f"Elapsed time for {self.name}: {self.elapsed_time:.2f} s")
-
-
-def get_tensor_memory(tensor, human_readable=True):
-    size = tensor.element_size() * tensor.nelement()
-    if human_readable:
-        size = format_numel_str(size)
-    return size
-
-
-class FeatureSaver:
-    def __init__(self, save_dir, bin_size=10, start_bin=0):
-        self.save_dir = save_dir
-        self.bin_size = bin_size
-        self.bin_cnt = start_bin
-
-        self.data_list = []
-        self.cnt = 0
-
-    def update(self, data):
-        self.data_list.append(data)
-        self.cnt += 1
-
-        if self.cnt % self.bin_size == 0:
-            self.save()
-
-    def save(self):
-        save_path = os.path.join(self.save_dir, f"{self.bin_cnt:08}.bin")
-        torch.save(self.data_list, save_path)
-        get_logger().info("Saved to %s", save_path)
-        self.data_list = []
-        self.bin_cnt += 1
diff --git a/videotuna/models/opensora/utils/train_utils.py b/videotuna/models/opensora/utils/train_utils.py
deleted file mode 100644
index cc13ac84..00000000
--- a/videotuna/models/opensora/utils/train_utils.py
+++ /dev/null
@@ -1,175 +0,0 @@
-import math
-import random
-from collections import OrderedDict
-
-import torch
-import torch.distributed as dist
-from colossalai.booster.plugin import LowLevelZeroPlugin
-
-from videotuna.models.opensora.acceleration.parallel_states import (
-    set_data_parallel_group,
-    set_sequence_parallel_group,
-)
-from videotuna.models.opensora.acceleration.plugin import ZeroSeqParallelPlugin
-
-from .misc import get_logger
-
-
-def create_colossalai_plugin(plugin, dtype, grad_clip, sp_size):
-    if plugin == "zero2":
-        assert sp_size == 1, "Zero2 plugin does not support sequence parallelism"
-        plugin = LowLevelZeroPlugin(
-            stage=2,
-            precision=dtype,
-            initial_scale=2**16,
-            max_norm=grad_clip,
-        )
-        set_data_parallel_group(dist.group.WORLD)
-    elif plugin == "zero2-seq":
-        assert sp_size > 1, "Zero2-seq plugin requires sequence parallelism"
-        plugin = ZeroSeqParallelPlugin(
-            sp_size=sp_size,
-            stage=2,
-            precision=dtype,
-            initial_scale=2**16,
-            max_norm=grad_clip,
-        )
-        set_sequence_parallel_group(plugin.sp_group)
-        set_data_parallel_group(plugin.dp_group)
-    else:
-        raise ValueError(f"Unknown plugin {plugin}")
-    return plugin
-
-
-@torch.no_grad()
-def update_ema(
-    ema_model: torch.nn.Module,
-    model: torch.nn.Module,
-    optimizer=None,
-    decay: float = 0.9999,
-    sharded: bool = True,
-) -> None:
-    """
-    Step the EMA model towards the current model.
-    """
-    ema_params = OrderedDict(ema_model.named_parameters())
-    model_params = OrderedDict(model.named_parameters())
-
-    for name, param in model_params.items():
-        if name == "pos_embed":
-            continue
-        if not param.requires_grad:
-            continue
-        if not sharded:
-            param_data = param.data
-            ema_params[name].mul_(decay).add_(param_data, alpha=1 - decay)
-        else:
-            if param.data.dtype != torch.float32:
-                param_id = id(param)
-                master_param = optimizer._param_store.working_to_master_param[param_id]
-                param_data = master_param.data
-            else:
-                param_data = param.data
-            ema_params[name].mul_(decay).add_(param_data, alpha=1 - decay)
-
-
-class MaskGenerator:
-    def __init__(self, mask_ratios):
-        valid_mask_names = [
-            "identity",
-            "mask_no",
-            "mask_quarter_random",
-            "mask_quarter_head",
-            "mask_quarter_tail",
-            "mask_quarter_head_tail",
-            "mask_image_random",
-            "mask_image_head",
-            "mask_image_tail",
-            "mask_image_head_tail",
-            "mask_inter",
-        ]
-        assert all(
-            mask_name in valid_mask_names for mask_name in mask_ratios.keys()
-        ), f"mask_name should be one of {valid_mask_names}, got {mask_ratios.keys()}"
-        assert all(
-            mask_ratio >= 0 for mask_ratio in mask_ratios.values()
-        ), f"mask_ratio should be greater than or equal to 0, got {mask_ratios.values()}"
-        assert all(
-            mask_ratio <= 1 for mask_ratio in mask_ratios.values()
-        ), f"mask_ratio should be less than or equal to 1, got {mask_ratios.values()}"
-        # sum of mask_ratios should be 1
-        if "identity" in mask_ratios:
-            mask_ratios["identity"] = 1.0 - sum(mask_ratios.values())
-        assert math.isclose(
-            sum(mask_ratios.values()), 1.0, abs_tol=1e-6
-        ), f"sum of mask_ratios should be 1, got {sum(mask_ratios.values())}"
-        get_logger().info("mask ratios: %s", mask_ratios)
-        self.mask_ratios = mask_ratios
-
-    def get_mask(self, x):
-        mask_type = random.random()
-        mask_name = None
-        prob_acc = 0.0
-        for mask, mask_ratio in self.mask_ratios.items():
-            prob_acc += mask_ratio
-            if mask_type < prob_acc:
-                mask_name = mask
-                break
-
-        num_frames = x.shape[2]
-        # Hardcoded condition_frames
-        condition_frames_max = num_frames // 4
-
-        mask = torch.ones(num_frames, dtype=torch.bool, device=x.device)
-        if num_frames <= 1:
-            return mask
-
-        if mask_name == "mask_quarter_random":
-            random_size = random.randint(1, condition_frames_max)
-            random_pos = random.randint(0, x.shape[2] - random_size)
-            mask[random_pos : random_pos + random_size] = 0
-        elif mask_name == "mask_image_random":
-            random_size = 1
-            random_pos = random.randint(0, x.shape[2] - random_size)
-            mask[random_pos : random_pos + random_size] = 0
-        elif mask_name == "mask_quarter_head":
-            random_size = random.randint(1, condition_frames_max)
-            mask[:random_size] = 0
-        elif mask_name == "mask_image_head":
-            random_size = 1
-            mask[:random_size] = 0
-        elif mask_name == "mask_quarter_tail":
-            random_size = random.randint(1, condition_frames_max)
-            mask[-random_size:] = 0
-        elif mask_name == "mask_image_tail":
-            random_size = 1
-            mask[-random_size:] = 0
-        elif mask_name == "mask_quarter_head_tail":
-            random_size = random.randint(1, condition_frames_max)
-            mask[:random_size] = 0
-            mask[-random_size:] = 0
-        elif mask_name == "mask_image_head_tail":
-            random_size = 1
-            mask[:random_size] = 0
-            mask[-random_size:] = 0
-        elif mask_name == "mask_inter":
-            mask[0::2] = 0
-        elif mask_name == "intepolate":
-            random_start = random.randint(0, 1)
-            mask[random_start::2] = 0
-        elif mask_name == "random":
-            mask_ratio = random.uniform(0.1, 0.9)
-            mask = torch.rand(num_frames, device=x.device) > mask_ratio
-            # if mask is all False, set the last frame to True
-            if not mask.any():
-                mask[-1] = 1
-
-        return mask
-
-    def get_masks(self, x):
-        masks = []
-        for _ in range(len(x)):
-            mask = self.get_mask(x)
-            masks.append(mask)
-        masks = torch.stack(masks, dim=0)
-        return masks
diff --git a/videotuna/models/stepvideo/run.py b/videotuna/models/stepvideo/run.py
deleted file mode 100644
index 67b4d873..00000000
--- a/videotuna/models/stepvideo/run.py
+++ /dev/null
@@ -1,40 +0,0 @@
-from stepvideo.diffusion.video_pipeline import StepVideoPipeline
-import torch
-from stepvideo.config import parse_args
-from stepvideo.utils import setup_seed
-
-import torch
-import os
-import pickle
-import argparse
-import threading
-import argparse
-
-
-    
-if __name__ == "__main__":
-    args = parse_args()
-
-    setup_seed(args.seed)
-        
-    vae_dir = os.path.join(args.model_dir, args.vae_dir)
-    llm_dir = os.path.join(args.model_dir, args.llm_dir)
-    clip_dir = os.path.join(args.model_dir, args.clip_dir)
-
-    pipeline = StepVideoPipeline.from_pretrained(args.model_dir).to(dtype=torch.bfloat16)
-    pipeline.setup_dir(vae_dir, llm_dir, clip_dir)
-    pipeline.enable_vram_management(num_persistent_param_in_dit=0)
-
-    prompt = args.prompt
-    videos = pipeline(
-        prompt=prompt, 
-        num_frames=args.num_frames, 
-        height=args.height, 
-        width=args.width,
-        num_inference_steps = args.infer_steps,
-        guidance_scale=args.cfg_scale,
-        time_shift=args.time_shift,
-        pos_magic=args.pos_magic,
-        neg_magic=args.neg_magic,
-        output_file_name=prompt[:50]
-    )
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/__init__.py b/videotuna/models/stepvideo/stepvideo/__init__.py
deleted file mode 100644
index e919871d..00000000
--- a/videotuna/models/stepvideo/stepvideo/__init__.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import os
-
-os.environ["NCCL_DEBUG"] = "ERROR"
-
-from .diffusion.scheduler import *
-from .diffusion.video_pipeline import *
-from .modules.model import *
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/__version__.py b/videotuna/models/stepvideo/stepvideo/__version__.py
deleted file mode 100644
index a68927d6..00000000
--- a/videotuna/models/stepvideo/stepvideo/__version__.py
+++ /dev/null
@@ -1 +0,0 @@
-__version__ = "0.1.0"
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/config.py b/videotuna/models/stepvideo/stepvideo/config.py
deleted file mode 100644
index b982ed94..00000000
--- a/videotuna/models/stepvideo/stepvideo/config.py
+++ /dev/null
@@ -1,189 +0,0 @@
-import argparse
-
-def parse_args(namespace=None):
-    parser = argparse.ArgumentParser(description="StepVideo inference script")
-
-    parser = add_extra_models_args(parser)
-    parser = add_denoise_schedule_args(parser)
-    parser = add_inference_args(parser)
-    parser = add_parallel_args(parser)
-
-    args = parser.parse_args(namespace=namespace)
-
-    return args
-
-
-
-def add_extra_models_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(
-        title="Extra models args, including vae, text encoders and tokenizers)"
-    )
-
-    group.add_argument(
-        "--vae_dir",
-        type=str,
-        default='var',
-        help="vae dir.",
-    )
-    group.add_argument(
-        "--llm_dir",
-        type=str,
-        default='step_llm',
-        help="llm encoder dir",
-    )
-    group.add_argument(
-        "--clip_dir",
-        type=str,
-        default='hunyuan_clip',
-        help="clip encoder dir",
-    )
-
-    return parser
-
-
-def add_denoise_schedule_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Denoise schedule args")
-
-    # Flow Matching
-    group.add_argument(
-        "--time_shift",
-        type=float,
-        default=7.0,
-        help="Shift factor for flow matching schedulers.",
-    )
-    group.add_argument(
-        "--flow_reverse",
-        action="store_true",
-        help="If reverse, learning/sampling from t=1 -> t=0.",
-    )
-    group.add_argument(
-        "--flow_solver",
-        type=str,
-        default="euler",
-        help="Solver for flow matching.",
-    )
-
-    return parser
-
-
-def add_inference_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Inference args")
-
-    # ======================== Model loads ========================
-    group.add_argument(
-        "--model_dir",
-        type=str,
-        default="./ckpts",
-        help="Root path of all the models, including t2v models and extra models.",
-    )
-    group.add_argument(
-        "--model_resolution",
-        type=str,
-        default="540p",
-        choices=["540p"],
-        help="Root path of all the models, including t2v models and extra models.",
-    )
-    group.add_argument(
-        "--use-cpu-offload",
-        action="store_true",
-        help="Use CPU offload for the model load.",
-    )
-
-    # ======================== Inference general setting ========================
-    group.add_argument(
-        "--batch_size",
-        type=int,
-        default=1,
-        help="Batch size for inference and evaluation.",
-    )
-    group.add_argument(
-        "--infer_steps",
-        type=int,
-        default=50,
-        help="Number of denoising steps for inference.",
-    )
-    group.add_argument(
-        "--save_path",
-        type=str,
-        default="./results",
-        help="Path to save the generated samples.",
-    )
-    group.add_argument(
-        "--name_suffix",
-        type=str,
-        default="",
-        help="Suffix for the names of saved samples.",
-    )
-    group.add_argument(
-        "--num_videos",
-        type=int,
-        default=1,
-        help="Number of videos to generate for each prompt.",
-    )
-    # ---sample size---
-    group.add_argument(
-        "--num_frames",
-        type=int,
-        default=204,
-        help="How many frames to sample from a video. ",
-    )
-    group.add_argument(
-        "--height",
-        type=int,
-        default=544,
-        help="The height of video sample",
-    )
-    group.add_argument(
-        "--width",
-        type=int,
-        default=992,
-        help="The width of video sample",
-    )
-    # --- prompt ---
-    group.add_argument(
-        "--prompt",
-        type=str,
-        default=None,
-        help="Prompt for sampling during evaluation.",
-    )
-    group.add_argument("--seed", type=int, default=1234, help="Seed for evaluation.")
-
-    # Classifier-Free Guidance
-    group.add_argument(
-        "--pos_magic", type=str, default="超高清、HDR 视频、环境光、杜比全景声、画面稳定、流畅动作、逼真的细节、专业级构图、超现实主义、自然、生动、超细节、清晰。", help="Positive magic prompt for sampling."
-    )
-    group.add_argument(
-        "--neg_magic", type=str, default="画面暗、低分辨率、不良手、文本、缺少手指、多余的手指、裁剪、低质量、颗粒状、签名、水印、用户名、模糊。", help="Negative magic prompt for sampling."
-    )
-    group.add_argument(
-        "--cfg_scale", type=float, default=9.0, help="Classifier free guidance scale."
-    )
-
-
-    return parser
-
-
-def add_parallel_args(parser: argparse.ArgumentParser):
-    group = parser.add_argument_group(title="Parallel args")
-
-    # ======================== Model loads ========================
-    group.add_argument(
-        "--ulysses_degree",
-        type=int,
-        default=8,
-        help="Ulysses degree.",
-    )
-    group.add_argument(
-        "--ring_degree",
-        type=int,
-        default=1,
-        help="Ulysses degree.",
-    )
-
-    group.add_argument(
-        "--tensor_parallel_degree",
-        type=int,
-        default=1,
-        help="Tensor parallel degree.",
-    )
-    return parser
diff --git a/videotuna/models/stepvideo/stepvideo/diffusion/scheduler.py b/videotuna/models/stepvideo/stepvideo/diffusion/scheduler.py
deleted file mode 100644
index 7996a1d3..00000000
--- a/videotuna/models/stepvideo/stepvideo/diffusion/scheduler.py
+++ /dev/null
@@ -1,236 +0,0 @@
-from dataclasses import dataclass
-from typing import Optional, Tuple, Union
-
-import numpy as np
-import torch
-
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.utils import BaseOutput, logging
-from diffusers.schedulers.scheduling_utils import SchedulerMixin
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-@dataclass
-class FlowMatchDiscreteSchedulerOutput(BaseOutput):
-    """
-    Output class for the scheduler's `step` function output.
-
-    Args:
-        prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
-            Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
-            denoising loop.
-    """
-
-    prev_sample: torch.FloatTensor
-
-
-class FlowMatchDiscreteScheduler(SchedulerMixin, ConfigMixin):
-    """
-    Euler scheduler.
-
-    This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic
-    methods the library implements for all schedulers such as loading and saving.
-
-    Args:
-        num_train_timesteps (`int`, defaults to 1000):
-            The number of diffusion steps to train the model.
-        timestep_spacing (`str`, defaults to `"linspace"`):
-            The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
-            Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
-        reverse (`bool`, defaults to `True`):
-            Whether to reverse the timestep schedule.
-    """
-
-    _compatibles = []
-    order = 1
-
-    @register_to_config
-    def __init__(
-        self,
-        num_train_timesteps: int = 1000,
-        reverse: bool = False,
-        solver: str = "euler",
-        device: Union[str, torch.device] = None,
-    ):
-        sigmas = torch.linspace(1, 0, num_train_timesteps + 1)
-
-        if not reverse:
-            sigmas = sigmas.flip(0)
-
-        self.sigmas = sigmas
-        # the value fed to model
-        self.timesteps = (sigmas[:-1] * num_train_timesteps).to(dtype=torch.float32)
-
-        self._step_index = None
-        self._begin_index = None
-        
-        self.device = device
-
-        self.supported_solver = ["euler"]
-        if solver not in self.supported_solver:
-            raise ValueError(
-                f"Solver {solver} not supported. Supported solvers: {self.supported_solver}"
-            )
-
-    @property
-    def step_index(self):
-        """
-        The index counter for current timestep. It will increase 1 after each scheduler step.
-        """
-        return self._step_index
-
-    @property
-    def begin_index(self):
-        """
-        The index for the first timestep. It should be set from pipeline with `set_begin_index` method.
-        """
-        return self._begin_index
-
-    # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
-    def set_begin_index(self, begin_index: int = 0):
-        """
-        Sets the begin index for the scheduler. This function should be run from pipeline before the inference.
-
-        Args:
-            begin_index (`int`):
-                The begin index for the scheduler.
-        """
-        self._begin_index = begin_index
-
-    def _sigma_to_t(self, sigma):
-        return sigma * self.config.num_train_timesteps
-
-    def set_timesteps(
-        self,
-        num_inference_steps: int,
-        time_shift: float = 13.0,
-        device: Union[str, torch.device] = None,
-    ):
-        """
-        Sets the discrete timesteps used for the diffusion chain (to be run before inference).
-
-        Args:
-            num_inference_steps (`int`):
-                The number of diffusion steps used when generating samples with a pre-trained model.
-            device (`str` or `torch.device`, *optional*):
-                The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-            n_tokens (`int`, *optional*):
-                Number of tokens in the input sequence.
-        """
-        device = device or self.device
-        self.num_inference_steps = num_inference_steps
-        
-        sigmas = torch.linspace(1, 0, num_inference_steps + 1, device=device)
-        sigmas = self.sd3_time_shift(sigmas, time_shift)
-
-        if not self.config.reverse:
-            sigmas = 1 - sigmas
-
-        self.sigmas = sigmas
-        self.timesteps = sigmas[:-1]
-
-        # Reset step index
-        self._step_index = None
-
-    def index_for_timestep(self, timestep, schedule_timesteps=None):
-        if schedule_timesteps is None:
-            schedule_timesteps = self.timesteps
-
-        indices = (schedule_timesteps == timestep).nonzero()
-
-        # The sigma index that is taken for the **very** first `step`
-        # is always the second index (or the last index if there is only 1)
-        # This way we can ensure we don't accidentally skip a sigma in
-        # case we start in the middle of the denoising schedule (e.g. for image-to-image)
-        pos = 1 if len(indices) > 1 else 0
-
-        return indices[pos].item()
-
-    def _init_step_index(self, timestep):
-        if self.begin_index is None:
-            if isinstance(timestep, torch.Tensor):
-                timestep = timestep.to(self.timesteps.device)
-            self._step_index = self.index_for_timestep(timestep)
-        else:
-            self._step_index = self._begin_index
-
-    def scale_model_input(
-        self, sample: torch.Tensor, timestep: Optional[int] = None
-    ) -> torch.Tensor:
-        return sample
-
-    def sd3_time_shift(self, t: torch.Tensor, time_shift: float = 13.0):
-        return (time_shift * t) / (1 + (time_shift - 1) * t)
-
-    def step(
-        self,
-        model_output: torch.FloatTensor,
-        timestep: Union[float, torch.FloatTensor],
-        sample: torch.FloatTensor,
-        return_dict: bool = False,
-    ) -> Union[FlowMatchDiscreteSchedulerOutput, Tuple]:
-        """
-        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
-        process from the learned model outputs (most often the predicted noise).
-
-        Args:
-            model_output (`torch.FloatTensor`):
-                The direct output from learned diffusion model.
-            timestep (`float`):
-                The current discrete timestep in the diffusion chain.
-            sample (`torch.FloatTensor`):
-                A current instance of a sample created by the diffusion process.
-            generator (`torch.Generator`, *optional*):
-                A random number generator.
-            n_tokens (`int`, *optional*):
-                Number of tokens in the input sequence.
-            return_dict (`bool`):
-                Whether or not to return a [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or
-                tuple.
-
-        Returns:
-            [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or `tuple`:
-                If return_dict is `True`, [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] is
-                returned, otherwise a tuple is returned where the first element is the sample tensor.
-        """
-
-        if (
-            isinstance(timestep, int)
-            or isinstance(timestep, torch.IntTensor)
-            or isinstance(timestep, torch.LongTensor)
-        ):
-            raise ValueError(
-                (
-                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
-                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
-                    " one of the `scheduler.timesteps` as a timestep."
-                ),
-            )
-
-        if self.step_index is None:
-            self._init_step_index(timestep)
-
-        # Upcast to avoid precision issues when computing prev_sample
-        sample = sample.to(torch.float32)
-
-        dt = self.sigmas[self.step_index + 1] - self.sigmas[self.step_index]
-
-        if self.config.solver == "euler":
-            prev_sample = sample + model_output.to(torch.float32) * dt
-        else:
-            raise ValueError(
-                f"Solver {self.config.solver} not supported. Supported solvers: {self.supported_solver}"
-            )
-
-        # upon completion increase step index by one
-        self._step_index += 1
-
-        if not return_dict:
-            return prev_sample
-
-        return FlowMatchDiscreteSchedulerOutput(prev_sample=prev_sample)
-
-    def __len__(self):
-        return self.config.num_train_timesteps
diff --git a/videotuna/models/stepvideo/stepvideo/diffusion/video_pipeline.py b/videotuna/models/stepvideo/stepvideo/diffusion/video_pipeline.py
deleted file mode 100755
index b1eadfc6..00000000
--- a/videotuna/models/stepvideo/stepvideo/diffusion/video_pipeline.py
+++ /dev/null
@@ -1,576 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union
-from dataclasses import dataclass
-
-import numpy as np
-import pickle
-import torch, copy
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline
-from diffusers.utils import BaseOutput
-import asyncio
-
-from ..modules.model import StepVideoModel
-from .scheduler import FlowMatchDiscreteScheduler
-from ..utils import VideoProcessor, with_empty_init
-import os
-
-from transformers.models.bert.modeling_bert import BertEmbeddings
-from ..modules.model import RMSNorm
-from ..vae.vae import CausalConv, CausalConvAfterNorm, Upsample2D
-
-def cast_to(weight, dtype, device):
-    r = torch.empty_like(weight, dtype=dtype, device=device)
-    r.copy_(weight)
-    return r
-
-
-class AutoWrappedModule(torch.nn.Module):
-    def __init__(self, module: torch.nn.Module, offload_dtype, offload_device, onload_dtype, onload_device, computation_dtype, computation_device):
-        super().__init__()
-        self.module = module.to(dtype=offload_dtype, device=offload_device)
-        self.offload_dtype = offload_dtype
-        self.offload_device = offload_device
-        self.onload_dtype = onload_dtype
-        self.onload_device = onload_device
-        self.computation_dtype = computation_dtype
-        self.computation_device = computation_device
-        self.state = 0
-
-    def offload(self):
-        if self.state == 1 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
-            self.module.to(dtype=self.offload_dtype, device=self.offload_device)
-            self.state = 0
-
-    def onload(self):
-        if self.state == 0 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
-            self.module.to(dtype=self.onload_dtype, device=self.onload_device)
-            self.state = 1
-
-    def forward(self, *args, **kwargs):
-        if self.onload_dtype == self.computation_dtype and self.onload_device == self.computation_device:
-            module = self.module
-        else:
-            module = copy.deepcopy(self.module).to(dtype=self.computation_dtype, device=self.computation_device)
-        return module(*args, **kwargs)
-    
-
-class AutoWrappedLinear(torch.nn.Linear):
-
-    @with_empty_init
-    def __init__(self, module: torch.nn.Linear, offload_dtype, offload_device, onload_dtype, onload_device, computation_dtype, computation_device):
-        super().__init__(in_features=module.in_features, out_features=module.out_features, bias=module.bias is not None, dtype=offload_dtype, device=offload_device)
-        self.weight = module.weight
-        self.bias = module.bias
-        self.offload_dtype = offload_dtype
-        self.offload_device = offload_device
-        self.onload_dtype = onload_dtype
-        self.onload_device = onload_device
-        self.computation_dtype = computation_dtype
-        self.computation_device = computation_device
-        self.state = 0
-
-    def offload(self):
-        if self.state == 1 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
-            self.to(dtype=self.offload_dtype, device=self.offload_device)
-            self.state = 0
-
-    def onload(self):
-        if self.state == 0 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
-            self.to(dtype=self.onload_dtype, device=self.onload_device)
-            self.state = 1
-
-    def forward(self, x, *args, **kwargs):
-        if self.onload_dtype == self.computation_dtype and self.onload_device == self.computation_device:
-            weight, bias = self.weight, self.bias
-        else:
-            weight = cast_to(self.weight, self.computation_dtype, self.computation_device)
-            bias = None if self.bias is None else cast_to(self.bias, self.computation_dtype, self.computation_device)
-        return torch.nn.functional.linear(x, weight, bias)
-
-
-def enable_vram_management_recursively(model: torch.nn.Module, module_map: dict, module_config: dict, max_num_param=None, overflow_module_config: dict = None, total_num_param=0):
-    for name, module in model.named_children():
-        for source_module, target_module in module_map.items():
-            if isinstance(module, source_module):
-                num_param = sum(p.numel() for p in module.parameters())
-                if max_num_param is not None and total_num_param + num_param > max_num_param:
-                    module_config_ = overflow_module_config
-                else:
-                    module_config_ = module_config
-                module_ = target_module(module, **module_config_)
-                setattr(model, name, module_)
-                total_num_param += num_param
-                break
-        else:
-            total_num_param = enable_vram_management_recursively(module, module_map, module_config, max_num_param, overflow_module_config, total_num_param)
-    return total_num_param
-
-
-def enable_vram_management(model: torch.nn.Module, module_map: dict, module_config: dict, max_num_param=None, overflow_module_config: dict = None):
-    enable_vram_management_recursively(model, module_map, module_config, max_num_param, overflow_module_config, total_num_param=0)
-    model.vram_management_enabled = True
-
-
-
-@dataclass
-class StepVideoPipelineOutput(BaseOutput):
-    video: Union[torch.Tensor, np.ndarray]
-    
-    
-class StepVideoPipeline(DiffusionPipeline):
-    r"""
-    Pipeline for text-to-video generation using StepVideo.
-
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
-    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
-
-    Args:
-        transformer ([`StepVideoModel`]):
-            Conditional Transformer to denoise the encoded image latents.
-        scheduler ([`FlowMatchDiscreteScheduler`]):
-            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
-        vae_url:
-            remote vae server's url.
-        caption_url:
-            remote caption (stepllm and clip) server's url.
-    """
-
-    def __init__(
-        self,
-        transformer: StepVideoModel,
-        scheduler: FlowMatchDiscreteScheduler,
-        vae_dir: str = '',
-        caption_dir: tuple = ('', ''),
-        save_path: str = './results',
-        name_suffix: str = '',
-    ):
-        super().__init__()
-
-        self.register_modules(
-            transformer=transformer,
-            scheduler=scheduler,
-        )
-        
-
-        self.vae_scale_factor_temporal = self.vae.temporal_compression_ratio if getattr(self, "vae", None) else 8
-        self.vae_scale_factor_spatial = self.vae.spatial_compression_ratio if getattr(self, "vae", None) else 16
-        self.video_processor = VideoProcessor(save_path, name_suffix)
-        self.model_names = ['vae', 'text_encoder', 'clip', 'transformer']
-
-        self.torch_dtype = torch.bfloat16
-        self.device_type = 'cuda'
-
-
-        # self.vae_dir = vae_dir
-        # self.llm_dir, self.clip_dir = caption_dir
-        # self.setup_dir(self.vae_dir, self.llm_dir, self.clip_dir)
-        
-    def setup_dir(self, vae_dir, llm_dir, clip_dir, version=2):
-        self.vae_dir = vae_dir
-        self.llm_dir = llm_dir
-        self.clip_dir = clip_dir
-        self.text_encoder = self.build_llm(llm_dir)
-        self.clip = self.build_clip(clip_dir)
-        self.vae = self.build_vae(vae_dir, version)
-        self.scale_factor = 1.0
-        return self
-    
-    def enable_vram_management(self, num_persistent_param_in_dit=None):
-        dtype = next(iter(self.clip.parameters())).dtype
-        enable_vram_management(
-            self.clip,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                BertEmbeddings: AutoWrappedModule,
-                torch.nn.LayerNorm: AutoWrappedModule,
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device="cpu",
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        dtype = next(iter(self.text_encoder.parameters())).dtype
-        enable_vram_management(
-            self.text_encoder,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                RMSNorm: AutoWrappedModule,
-                torch.nn.Embedding: AutoWrappedModule,
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device="cpu",
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        dtype = next(iter(self.transformer.parameters())).dtype
-        enable_vram_management(
-            self.transformer,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                torch.nn.Conv2d: AutoWrappedModule,
-                torch.nn.LayerNorm: AutoWrappedModule,
-                RMSNorm: AutoWrappedModule,
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device=self.device_type,
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-            max_num_param=num_persistent_param_in_dit,
-            overflow_module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device="cpu",
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        dtype = next(iter(self.vae.parameters())).dtype
-        enable_vram_management(
-            self.vae,
-            module_map = {
-                torch.nn.Linear: AutoWrappedLinear,
-                torch.nn.Conv3d: AutoWrappedModule,
-                CausalConv: AutoWrappedModule,
-                CausalConvAfterNorm: AutoWrappedModule,
-                Upsample2D: AutoWrappedModule
-            },
-            module_config = dict(
-                offload_dtype=dtype,
-                offload_device="cpu",
-                onload_dtype=dtype,
-                onload_device="cpu",
-                computation_dtype=self.torch_dtype,
-                computation_device=self.device_type,
-            ),
-        )
-        self.enable_cpu_offload()
-
-    def enable_cpu_offload(self):
-        self.cpu_offload = True
-
-
-    def load_models_to_device(self, loadmodel_names=[]):
-        # only load models to device if cpu_offload is enabled
-        if not self.cpu_offload:
-            return
-        # offload the unneeded models to cpu
-        for model_name in self.model_names:
-            if model_name not in loadmodel_names:
-                model = getattr(self, model_name)
-                if model is not None:
-                    if hasattr(model, "vram_management_enabled") and model.vram_management_enabled:
-                        for module in model.modules():
-                            if hasattr(module, "offload"):
-                                module.offload()
-                    else:
-                        model.cpu()
-        # load the needed models to device
-        for model_name in loadmodel_names:
-            model = getattr(self, model_name)
-            if model is not None:
-                if hasattr(model, "vram_management_enabled") and model.vram_management_enabled:
-                    for module in model.modules():
-                        if hasattr(module, "onload"):
-                            module.onload()
-                else:
-                    model.to(self.device)
-        # fresh the cuda cache
-        torch.cuda.empty_cache()
-    
-    def build_llm(self, model_dir):
-        from stepvideo.text_encoder.stepllm import STEP1TextEncoder
-        text_encoder = STEP1TextEncoder(model_dir, max_length=320).eval()
-        print("Inintialized text encoder...")
-        return text_encoder
-        
-    def build_clip(self, model_dir):
-        from stepvideo.text_encoder.clip import HunyuanClip
-        clip = HunyuanClip(model_dir, max_length=77).eval()
-        print("Inintialized clip encoder...")
-        return clip
-
-    def build_vae(self, vae_dir, version=2):
-        from stepvideo.vae.vae import AutoencoderKL
-        (model_name, z_channels) = ("vae_v2.safetensors", 64) if version == 2 else ("vae.safetensors", 16)
-        model_path = os.path.join(vae_dir, model_name) 
-        
-        model = AutoencoderKL(
-            z_channels=z_channels,
-            model_path=model_path,
-            version=version,
-        ).eval()
-        print("Inintialized vae...")
-        return model
-    
-    def encode_prompt(
-        self,
-        prompt: str,
-        neg_magic: str = '',
-        pos_magic: str = '',
-    ):
-        device = self._execution_device
-        prompts = [prompt+pos_magic]
-        bs = len(prompts)
-        prompts += [neg_magic]*bs
-        
-        data = self.embedding(prompts)
-        prompt_embeds, prompt_attention_mask, clip_embedding = data['y'].to(device), data['y_mask'].to(device), data['clip_embedding'].to(device)
-
-        return prompt_embeds, clip_embedding, prompt_attention_mask
-
-    def embedding(self, prompts, *args, **kwargs):
-        with torch.no_grad():
-            try:
-                y, y_mask = self.text_encoder(prompts)
-                    
-                clip_embedding, _ = self.clip(prompts)
-                
-                len_clip = clip_embedding.shape[1]
-                y_mask = torch.nn.functional.pad(y_mask, (len_clip, 0), value=1)   ## pad attention_mask with clip's length 
-
-                data = {
-                    'y': y.detach().cpu(),
-                    'y_mask': y_mask.detach().cpu(),
-                    'clip_embedding': clip_embedding.to(torch.bfloat16).detach().cpu()
-                }
-
-                return data
-            except Exception as err:
-                print(f"{err}")
-                return None
-
-    def decode_vae(self, samples):
-        samples = self.decode(samples)
-        return samples
-    
-    def decode(self, samples, *args, **kwargs):
-        with torch.no_grad():
-            try:
-                dtype = self.dtype
-                device = self.device_type
-                samples = self.vae.decode(samples.to(dtype).to(device) / self.scale_factor)
-                if hasattr(samples,'sample'):
-                    samples = samples.sample
-                return samples
-            except:
-                torch.cuda.empty_cache()
-                return None
-
-    def check_inputs(self, num_frames, width, height):
-        num_frames = max(num_frames//17*17, 1)
-        width = max(width//16*16, 16)
-        height = max(height//16*16, 16)
-        return num_frames, width, height
-
-    def prepare_latents(
-        self,
-        batch_size: int,
-        num_channels_latents: 64,
-        height: int = 544,
-        width: int = 992,
-        num_frames: int = 204,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        if latents is not None:
-            return latents.to(device=device, dtype=dtype)
-
-        num_frames, width, height = self.check_inputs(num_frames, width, height)
-        shape = (
-            batch_size,
-            max(num_frames//17*3, 1),
-            num_channels_latents,
-            int(height) // self.vae_scale_factor_spatial,
-            int(width) // self.vae_scale_factor_spatial,
-        )   # b,f,c,h,w
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if generator is None:
-            generator = torch.Generator(device=self._execution_device)
-
-        latents = torch.randn(shape, generator=generator, device=device, dtype=dtype)
-        return latents
-
-
-    @torch.inference_mode()
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        height: int = 544,
-        width: int = 992,
-        num_frames: int = 204,
-        num_inference_steps: int = 50,
-        guidance_scale: float = 9.0,
-        time_shift: float = 13.0,
-        neg_magic: str = "",
-        pos_magic: str = "",
-        num_videos_per_prompt: Optional[int] = 1,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-        output_type: Optional[str] = "mp4",
-        output_file_name: Optional[str] = "",
-        return_dict: bool = True,
-    ):
-        r"""
-        The call function to the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            height (`int`, defaults to `544`):
-                The height in pixels of the generated image.
-            width (`int`, defaults to `992`):
-                The width in pixels of the generated image.
-            num_frames (`int`, defaults to `204`):
-                The number of frames in the generated video.
-            num_inference_steps (`int`, defaults to `50`):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            guidance_scale (`float`, defaults to `9.0`):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality. 
-            num_videos_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
-                generation deterministic.
-            latents (`torch.Tensor`, *optional*):
-                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor is generated by sampling using the supplied random `generator`.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
-            output_file_name(`str`, *optional*`):
-                The output mp4 file name.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`StepVideoPipelineOutput`] instead of a plain tuple.
-
-        Examples:
-
-        Returns:
-            [`~StepVideoPipelineOutput`] or `tuple`:
-                If `return_dict` is `True`, [`StepVideoPipelineOutput`] is returned, otherwise a `tuple` is returned
-                where the first element is a list with the generated images and the second element is a list of `bool`s
-                indicating whether the corresponding generated image contains "not-safe-for-work" (nsfw) content.
-        """
-
-        # 1. Check inputs. Raise error if not correct
-        device = self._execution_device
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        do_classifier_free_guidance = guidance_scale > 1.0
-
-        # 3. Encode input prompt
-        self.load_models_to_device(['text_encoder', 'clip'])
-        prompt_embeds, prompt_embeds_2, prompt_attention_mask = self.encode_prompt(
-            prompt=prompt,
-            neg_magic=neg_magic,
-            pos_magic=pos_magic,
-        )
-
-        transformer_dtype = self.transformer.dtype
-        prompt_embeds = prompt_embeds.to(transformer_dtype).to(self.device_type)
-        prompt_attention_mask = prompt_attention_mask.to(transformer_dtype).to(self.device_type)
-        prompt_embeds_2 = prompt_embeds_2.to(transformer_dtype).to(self.device_type)
-
-        # 4. Prepare timesteps
-        self.scheduler.set_timesteps(
-            num_inference_steps=num_inference_steps,
-            time_shift=time_shift,
-            device=device
-        )
-
-        # 5. Prepare latent variables
-        num_channels_latents = self.transformer.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_videos_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            num_frames,
-            torch.bfloat16,
-            device,
-            generator,
-            latents,
-        ).to(self.device_type)
-
-        # 7. Denoising loop
-        self.load_models_to_device(['transformer'])
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(self.scheduler.timesteps):
-                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
-                latent_model_input = latent_model_input.to(transformer_dtype)
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                timestep = t.expand(latent_model_input.shape[0]).to(latent_model_input.dtype).to(self.device_type)
-
-                noise_pred = self.transformer(
-                    hidden_states=latent_model_input,
-                    timestep=timestep,
-                    encoder_hidden_states=prompt_embeds,
-                    encoder_attention_mask=prompt_attention_mask,
-                    encoder_hidden_states_2=prompt_embeds_2,
-                    return_dict=False,
-                )
-                # perform guidance
-                if do_classifier_free_guidance:
-                    noise_pred_text, noise_pred_uncond = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents = self.scheduler.step(
-                    model_output=noise_pred,
-                    timestep=t,
-                    sample=latents
-                )
-                
-                progress_bar.update()
-
-        if not torch.distributed.is_initialized() or int(torch.distributed.get_rank())==0:
-            if not output_type == "latent":
-                self.load_models_to_device(['vae'])
-                video = self.decode_vae(latents)
-                video = self.video_processor.postprocess_video(video, output_file_name=output_file_name, output_type=output_type)
-            else:
-                video = latents
-
-            # Offload all models
-            self.maybe_free_model_hooks()
-
-            if not return_dict:
-                return (video, )
-
-            return StepVideoPipelineOutput(video=video)
-        
-
-        
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/modules/attentions.py b/videotuna/models/stepvideo/stepvideo/modules/attentions.py
deleted file mode 100755
index c6ef95cf..00000000
--- a/videotuna/models/stepvideo/stepvideo/modules/attentions.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import torch
-import torch.nn as nn
-from einops import rearrange
-
-try:
-    from xfuser.core.long_ctx_attention import xFuserLongContextAttention
-except ImportError:
-    xFuserLongContextAttention = None
-    
-    
-class Attention(nn.Module):
-    def __init__(self):
-        super().__init__()
-    
-    def attn_processor(self, attn_type):
-        if attn_type == 'torch':
-            return self.torch_attn_func
-        elif attn_type == 'parallel':
-            return self.parallel_attn_func
-        else:
-            raise Exception('Not supported attention type...')
-
-    def torch_attn_func(
-        self,
-        q,
-        k,
-        v,
-        attn_mask=None,
-        causal=False,
-        drop_rate=0.0,
-        **kwargs
-    ):
-
-        if attn_mask is not None and attn_mask.dtype != torch.bool:
-            attn_mask = attn_mask.to(q.dtype)
-            
-        if attn_mask is not None and attn_mask.ndim == 3:   ## no head
-            n_heads = q.shape[2]
-            attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1)
-        
-        q, k, v = map(lambda x: rearrange(x, 'b s h d -> b h s d'), (q, k, v))
-        x = torch.nn.functional.scaled_dot_product_attention(
-            q, k, v, attn_mask=attn_mask, dropout_p=drop_rate, is_causal=causal
-        )
-        x = rearrange(x, 'b h s d -> b s h d')
-        return x        
-
-    def parallel_attn_func(
-        self,
-        q,
-        k,
-        v,
-        causal=False,
-        **kwargs
-    ):
-        assert xFuserLongContextAttention is not None; 'to use sequence parallel attention, xFuserLongContextAttention should be imported...'
-        hybrid_seq_parallel_attn = xFuserLongContextAttention()
-        x = hybrid_seq_parallel_attn(
-            None, q,k,v, causal=causal
-        )
-        return x
-
diff --git a/videotuna/models/stepvideo/stepvideo/modules/blocks.py b/videotuna/models/stepvideo/stepvideo/modules/blocks.py
deleted file mode 100755
index 6493c176..00000000
--- a/videotuna/models/stepvideo/stepvideo/modules/blocks.py
+++ /dev/null
@@ -1,314 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-# 
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-# ==============================================================================
-import torch
-import torch.nn as nn
-from typing import Optional
-from einops import rearrange
-from .rope import RoPE3D
-from .attentions import Attention
-from .normalization import RMSNorm
-
-
-class SelfAttention(Attention):
-    def __init__(self, hidden_dim, head_dim, bias=False, with_rope=True, with_qk_norm=True, attn_type='torch'):
-        super().__init__()
-        self.head_dim = head_dim
-        self.n_heads = hidden_dim // head_dim
-        
-        self.wqkv = nn.Linear(hidden_dim, hidden_dim*3, bias=bias)
-        self.wo = nn.Linear(hidden_dim, hidden_dim, bias=bias)
-        
-        self.with_rope = with_rope
-        self.with_qk_norm = with_qk_norm
-        if self.with_qk_norm:
-            self.q_norm = RMSNorm(head_dim, elementwise_affine=True)
-            self.k_norm = RMSNorm(head_dim, elementwise_affine=True)
-        
-        if self.with_rope:
-            self.rope_3d = RoPE3D(freq=1e4, F0=1.0, scaling_factor=1.0)
-            self.rope_ch_split = [64, 32, 32]
-        
-        self.core_attention = self.attn_processor(attn_type=attn_type)
-        self.parallel = attn_type=='parallel'
-        
-    def apply_rope3d(self, x, fhw_positions, rope_ch_split, parallel=True):
-        x = self.rope_3d(x, fhw_positions, rope_ch_split, parallel)
-        return x
-        
-    def forward(
-        self, 
-        x,
-        cu_seqlens=None,
-        max_seqlen=None,
-        rope_positions=None,
-        attn_mask=None
-    ):
-        xqkv = self.wqkv(x) 
-        xqkv = xqkv.view(*x.shape[:-1], self.n_heads, 3*self.head_dim)
-
-        xq, xk, xv = torch.split(xqkv, [self.head_dim]*3, dim=-1)  ## seq_len, n, dim
-    
-        if self.with_qk_norm:
-            xq = self.q_norm(xq)
-            xk = self.k_norm(xk)
-    
-        if self.with_rope:
-            xq = self.apply_rope3d(xq, rope_positions, self.rope_ch_split, parallel=self.parallel)
-            xk = self.apply_rope3d(xk, rope_positions, self.rope_ch_split, parallel=self.parallel)
-            
-        output = self.core_attention(
-                    xq,
-                    xk,
-                    xv,
-                    cu_seqlens=cu_seqlens,
-                    max_seqlen=max_seqlen,
-                    attn_mask=attn_mask
-                )
-        output = rearrange(output, 'b s h d -> b s (h d)')
-        output = self.wo(output)
-        
-        return output
-    
-    
-class CrossAttention(Attention):
-    def __init__(self, hidden_dim, head_dim, bias=False, with_qk_norm=True, attn_type='torch'):
-        super().__init__()
-        self.head_dim = head_dim
-        self.n_heads = hidden_dim // head_dim
-        
-        self.wq = nn.Linear(hidden_dim, hidden_dim, bias=bias)
-        self.wkv = nn.Linear(hidden_dim, hidden_dim*2, bias=bias)
-        self.wo = nn.Linear(hidden_dim, hidden_dim, bias=bias)
-        
-        self.with_qk_norm = with_qk_norm
-        if self.with_qk_norm:
-            self.q_norm = RMSNorm(head_dim, elementwise_affine=True)
-            self.k_norm = RMSNorm(head_dim, elementwise_affine=True)
-        
-        self.core_attention = self.attn_processor(attn_type=attn_type)
-
-    def forward(
-            self, 
-            x: torch.Tensor,
-            encoder_hidden_states: torch.Tensor,
-            attn_mask=None
-        ):
-        xq = self.wq(x) 
-        xq = xq.view(*xq.shape[:-1], self.n_heads, self.head_dim)
-        
-        xkv = self.wkv(encoder_hidden_states)
-        xkv = xkv.view(*xkv.shape[:-1], self.n_heads, 2*self.head_dim)
-
-        xk, xv = torch.split(xkv, [self.head_dim]*2, dim=-1)  ## seq_len, n, dim
-    
-        if self.with_qk_norm:
-            xq = self.q_norm(xq)
-            xk = self.k_norm(xk)
-
-        output = self.core_attention(
-                    xq,
-                    xk,
-                    xv,
-                    attn_mask=attn_mask
-                )
-        
-        output = rearrange(output, 'b s h d -> b s (h d)')
-        output = self.wo(output)
-        
-        return output
-
-    
-class GELU(nn.Module):
-    r"""
-    GELU activation function with tanh approximation support with `approximate="tanh"`.
-
-    Parameters:
-        dim_in (`int`): The number of channels in the input.
-        dim_out (`int`): The number of channels in the output.
-        approximate (`str`, *optional*, defaults to `"none"`): If `"tanh"`, use tanh approximation.
-        bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
-    """
-
-    def __init__(self, dim_in: int, dim_out: int, approximate: str = "none", bias: bool = True):
-        super().__init__()
-        self.proj = nn.Linear(dim_in, dim_out, bias=bias)
-        self.approximate = approximate
-
-    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
-        return torch.nn.functional.gelu(gate, approximate=self.approximate)
-
-    def forward(self, hidden_states):
-        hidden_states = self.proj(hidden_states)
-        hidden_states = self.gelu(hidden_states)
-        return hidden_states
-    
-    
-class FeedForward(nn.Module):
-    def __init__(
-        self, 
-        dim: int,
-        inner_dim: Optional[int] = None,
-        dim_out: Optional[int] = None,
-        mult: int = 4,
-        bias: bool = False,
-    ):
-        super().__init__()
-        inner_dim = dim*mult if inner_dim is None else inner_dim
-        dim_out = dim if dim_out is None else dim_out
-        self.net = nn.ModuleList([
-            GELU(dim, inner_dim, approximate="tanh", bias=bias),
-            nn.Identity(),
-            nn.Linear(inner_dim, dim_out, bias=bias)
-        ])
-        
-        
-    def forward(self, hidden_states: torch.Tensor, *args, **kwargs) -> torch.Tensor:
-        for module in self.net:
-            hidden_states = module(hidden_states)
-        return hidden_states
-    
-
-def modulate(x, scale, shift):
-    x = x * (1 + scale) + shift
-    return x
-
-def gate(x, gate):
-    x = gate * x
-    return x
-
-
-class StepVideoTransformerBlock(nn.Module):
-    r"""
-    A basic Transformer block.
-
-    Parameters:
-        dim (`int`): The number of channels in the input and output.
-        num_attention_heads (`int`): The number of heads to use for multi-head attention.
-        attention_head_dim (`int`): The number of channels in each head.
-        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
-        cross_attention_dim (`int`, *optional*): The size of the encoder_hidden_states vector for cross attention.
-        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
-        num_embeds_ada_norm (:
-            obj: `int`, *optional*): The number of diffusion steps used during training. See `Transformer2DModel`.
-        attention_bias (:
-            obj: `bool`, *optional*, defaults to `False`): Configure if the attentions should contain a bias parameter.
-        only_cross_attention (`bool`, *optional*):
-            Whether to use only cross-attention layers. In this case two cross attention layers are used.
-        double_self_attention (`bool`, *optional*):
-            Whether to use two self-attention layers. In this case no cross attention layers are used.
-        upcast_attention (`bool`, *optional*):
-            Whether to upcast the attention computation to float32. This is useful for mixed precision training.
-        norm_elementwise_affine (`bool`, *optional*, defaults to `True`):
-            Whether to use learnable elementwise affine parameters for normalization.
-        norm_type (`str`, *optional*, defaults to `"layer_norm"`):
-            The normalization layer to use. Can be `"layer_norm"`, `"ada_norm"` or `"ada_norm_zero"`.
-        final_dropout (`bool` *optional*, defaults to False):
-            Whether to apply a final dropout after the last feed-forward layer.
-        attention_type (`str`, *optional*, defaults to `"default"`):
-            The type of attention to use. Can be `"default"` or `"gated"` or `"gated-text-image"`.
-        positional_embeddings (`str`, *optional*, defaults to `None`):
-            The type of positional embeddings to apply to.
-        num_positional_embeddings (`int`, *optional*, defaults to `None`):
-            The maximum number of positional embeddings to apply.
-    """
-
-    def __init__(
-        self,
-        dim: int,
-        attention_head_dim: int,
-        norm_eps: float = 1e-5,
-        ff_inner_dim: Optional[int] = None,
-        ff_bias: bool = False,
-        attention_type: str = 'parallel'
-    ):
-        super().__init__()
-        self.dim = dim
-        self.norm1 = nn.LayerNorm(dim, eps=norm_eps)
-        self.attn1 = SelfAttention(dim, attention_head_dim, bias=False, with_rope=True, with_qk_norm=True, attn_type=attention_type)
-        
-        self.norm2 = nn.LayerNorm(dim, eps=norm_eps)
-        self.attn2 = CrossAttention(dim, attention_head_dim, bias=False, with_qk_norm=True, attn_type='torch')
-
-        self.ff = FeedForward(dim=dim, inner_dim=ff_inner_dim, dim_out=dim, bias=ff_bias)
-
-        self.scale_shift_table = nn.Parameter(torch.randn(6, dim) /dim**0.5)
-
-    @torch.no_grad()
-    def forward(
-        self,
-        q: torch.Tensor,
-        kv: Optional[torch.Tensor] = None,
-        timestep: Optional[torch.LongTensor] =  None,
-        attn_mask = None,
-        rope_positions: list = None, 
-    ) -> torch.Tensor:
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            torch.clone(chunk) for chunk in (self.scale_shift_table[None] + timestep.reshape(-1, 6, self.dim)).chunk(6, dim=1)
-        )
-        
-        scale_shift_q = modulate(self.norm1(q), scale_msa, shift_msa)
-
-        attn_q = self.attn1(
-            scale_shift_q,
-            rope_positions=rope_positions
-        )
-
-        q = gate(attn_q, gate_msa) + q
-        
-        attn_q = self.attn2(
-                q,
-                kv,
-                attn_mask
-            )
-
-        q = attn_q + q
-
-        scale_shift_q = modulate(self.norm2(q), scale_mlp, shift_mlp)
-
-        ff_output = self.ff(scale_shift_q)
-        
-        q = gate(ff_output, gate_mlp) + q
-        
-        return q
-    
-    
-class PatchEmbed(nn.Module):
-    """2D Image to Patch Embedding"""
-
-    def __init__(
-        self,
-        patch_size=64,
-        in_channels=3,
-        embed_dim=768,
-        layer_norm=False,
-        flatten=True,
-        bias=True,
-    ):
-        super().__init__()
-
-        self.flatten = flatten
-        self.layer_norm = layer_norm
-
-        self.proj = nn.Conv2d(
-            in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias
-        )
-
-    def forward(self, latent):
-        latent = self.proj(latent).to(latent.dtype)   
-        if self.flatten:
-            latent = latent.flatten(2).transpose(1, 2)  # BCHW -> BNC
-        if self.layer_norm:
-            latent = self.norm(latent)
-
-        return latent
-    
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/modules/model.py b/videotuna/models/stepvideo/stepvideo/modules/model.py
deleted file mode 100755
index bcc0333f..00000000
--- a/videotuna/models/stepvideo/stepvideo/modules/model.py
+++ /dev/null
@@ -1,920 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-# 
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-# ==============================================================================
-from typing import Any, Dict, Optional, Union, Tuple
-import torch, math
-from torch import nn
-import os
-from einops import rearrange, repeat
-from tqdm import tqdm
-
-from ..utils import with_empty_init
-
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.models.modeling_utils import ModelMixin
-
-class RMSNorm(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        elementwise_affine=True,
-        eps: float = 1e-6,
-        device=None,
-        dtype=None,
-    ):
-        """
-        Initialize the RMSNorm normalization layer.
-
-        Args:
-            dim (int): The dimension of the input tensor.
-            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.
-
-        Attributes:
-            eps (float): A small value added to the denominator for numerical stability.
-            weight (nn.Parameter): Learnable scaling parameter.
-
-        """
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.eps = eps
-        if elementwise_affine:
-            self.weight = nn.Parameter(torch.ones(dim, **factory_kwargs))
-
-    def _norm(self, x):
-        """
-        Apply the RMSNorm normalization to the input tensor.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The normalized tensor.
-
-        """
-        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
-
-    def forward(self, x):
-        """
-        Forward pass through the RMSNorm layer.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The output tensor after applying RMSNorm.
-
-        """
-        output = self._norm(x.float()).type_as(x)
-        if hasattr(self, "weight"):
-            output = output * self.weight
-        return output
-
-ACTIVATION_FUNCTIONS = {
-    "swish": nn.SiLU(),
-    "silu": nn.SiLU(),
-    "mish": nn.Mish(),
-    "gelu": nn.GELU(),
-    "relu": nn.ReLU(),
-}
-
-
-def get_activation(act_fn: str) -> nn.Module:
-    """Helper function to get activation function from string.
-
-    Args:
-        act_fn (str): Name of activation function.
-
-    Returns:
-        nn.Module: Activation function.
-    """
-
-    act_fn = act_fn.lower()
-    if act_fn in ACTIVATION_FUNCTIONS:
-        return ACTIVATION_FUNCTIONS[act_fn]
-    else:
-        raise ValueError(f"Unsupported activation function: {act_fn}")
-
-
-def get_timestep_embedding(
-    timesteps: torch.Tensor,
-    embedding_dim: int,
-    flip_sin_to_cos: bool = False,
-    downscale_freq_shift: float = 1,
-    scale: float = 1,
-    max_period: int = 10000,
-):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
-
-    :param timesteps: a 1-D Tensor of N indices, one per batch element.
-                      These may be fractional.
-    :param embedding_dim: the dimension of the output. :param max_period: controls the minimum frequency of the
-    embeddings. :return: an [N x dim] Tensor of positional embeddings.
-    """
-    assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
-
-    half_dim = embedding_dim // 2
-    exponent = -math.log(max_period) * torch.arange(
-        start=0, end=half_dim, dtype=torch.float32, device=timesteps.device
-    )
-    exponent = exponent / (half_dim - downscale_freq_shift)
-
-    emb = torch.exp(exponent)
-    emb = timesteps[:, None].float() * emb[None, :]
-
-    # scale embeddings
-    emb = scale * emb
-
-    # concat sine and cosine embeddings
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
-
-    # flip sine and cosine embeddings
-    if flip_sin_to_cos:
-        emb = torch.cat([emb[:, half_dim:], emb[:, :half_dim]], dim=-1)
-
-    # zero pad
-    if embedding_dim % 2 == 1:
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-class Timesteps(nn.Module):
-    def __init__(self, num_channels: int, flip_sin_to_cos: bool, downscale_freq_shift: float):
-        super().__init__()
-        self.num_channels = num_channels
-        self.flip_sin_to_cos = flip_sin_to_cos
-        self.downscale_freq_shift = downscale_freq_shift
-
-    def forward(self, timesteps):
-        t_emb = get_timestep_embedding(
-            timesteps,
-            self.num_channels,
-            flip_sin_to_cos=self.flip_sin_to_cos,
-            downscale_freq_shift=self.downscale_freq_shift,
-        )
-        return t_emb
-
-
-class TimestepEmbedding(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        time_embed_dim: int,
-        act_fn: str = "silu",
-        out_dim: int = None,
-        post_act_fn: Optional[str] = None,
-        cond_proj_dim=None,
-        sample_proj_bias=True
-    ):
-        super().__init__()
-        linear_cls = nn.Linear
-
-        self.linear_1 = linear_cls(
-                in_channels, 
-                time_embed_dim, 
-                bias=sample_proj_bias,
-            )
-
-        if cond_proj_dim is not None:
-            self.cond_proj = linear_cls(
-                    cond_proj_dim, 
-                    in_channels, 
-                    bias=False,
-                )
-        else:
-            self.cond_proj = None
-
-        self.act = get_activation(act_fn)
-
-        if out_dim is not None:
-            time_embed_dim_out = out_dim
-        else:
-            time_embed_dim_out = time_embed_dim
-            
-        self.linear_2 = linear_cls(
-                time_embed_dim, 
-                time_embed_dim_out, 
-                bias=sample_proj_bias, 
-            )
-
-        if post_act_fn is None:
-            self.post_act = None
-        else:
-            self.post_act = get_activation(post_act_fn)
-
-    def forward(self, sample, condition=None):
-        if condition is not None:
-            sample = sample + self.cond_proj(condition)
-        sample = self.linear_1(sample)
-
-        if self.act is not None:
-            sample = self.act(sample)
-
-        sample = self.linear_2(sample)
-
-        if self.post_act is not None:
-            sample = self.post_act(sample)
-        return sample
-
-
-class PixArtAlphaCombinedTimestepSizeEmbeddings(nn.Module):
-    def __init__(self, embedding_dim, size_emb_dim, use_additional_conditions: bool = False):
-        super().__init__()
-
-        self.outdim = size_emb_dim
-        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
-        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
-
-        self.use_additional_conditions = use_additional_conditions
-        if self.use_additional_conditions:
-            self.additional_condition_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
-            self.resolution_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=size_emb_dim)
-            self.nframe_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
-            self.fps_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
-
-    def forward(self, timestep, resolution=None, nframe=None, fps=None):
-        hidden_dtype = timestep.dtype
-
-        timesteps_proj = self.time_proj(timestep)
-        timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_dtype))  # (N, D)
-
-        if self.use_additional_conditions:
-            batch_size = timestep.shape[0]
-            resolution_emb = self.additional_condition_proj(resolution.flatten()).to(hidden_dtype)
-            resolution_emb = self.resolution_embedder(resolution_emb).reshape(batch_size, -1)
-            nframe_emb = self.additional_condition_proj(nframe.flatten()).to(hidden_dtype)
-            nframe_emb = self.nframe_embedder(nframe_emb).reshape(batch_size, -1)
-            conditioning = timesteps_emb + resolution_emb + nframe_emb
-
-            if fps is not None:
-                fps_emb = self.additional_condition_proj(fps.flatten()).to(hidden_dtype)
-                fps_emb = self.fps_embedder(fps_emb).reshape(batch_size, -1)
-                conditioning = conditioning + fps_emb
-        else:
-            conditioning = timesteps_emb
-
-        return conditioning
-
-
-class AdaLayerNormSingle(nn.Module):
-    r"""
-        Norm layer adaptive layer norm single (adaLN-single).
-
-        As proposed in PixArt-Alpha (see: https://arxiv.org/abs/2310.00426; Section 2.3).
-
-        Parameters:
-            embedding_dim (`int`): The size of each embedding vector.
-            use_additional_conditions (`bool`): To use additional conditions for normalization or not.
-    """
-    def __init__(self, embedding_dim: int, use_additional_conditions: bool = False, time_step_rescale=1000):
-        super().__init__()
-
-        self.emb = PixArtAlphaCombinedTimestepSizeEmbeddings(
-            embedding_dim, size_emb_dim=embedding_dim // 2, use_additional_conditions=use_additional_conditions
-        )
-
-        self.silu = nn.SiLU()
-        self.linear = nn.Linear(embedding_dim, 6 * embedding_dim, bias=True)
-
-        self.time_step_rescale = time_step_rescale  ## timestep usually in [0, 1], we rescale it to [0,1000] for stability
-
-    def forward(
-        self,
-        timestep: torch.Tensor,
-        added_cond_kwargs: Dict[str, torch.Tensor] = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        embedded_timestep = self.emb(timestep*self.time_step_rescale, **added_cond_kwargs)
-
-        out = self.linear(self.silu(embedded_timestep))
-
-        return out, embedded_timestep
-    
-
-class PixArtAlphaTextProjection(nn.Module):
-    """
-    Projects caption embeddings. Also handles dropout for classifier-free guidance.
-
-    Adapted from https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/nets/PixArt_blocks.py
-    """
-
-    def __init__(self, in_features, hidden_size):
-        super().__init__()
-        self.linear_1 = nn.Linear(
-                in_features, 
-                hidden_size, 
-                bias=True, 
-            )        
-        self.act_1 = nn.GELU(approximate="tanh")
-        self.linear_2 = nn.Linear(
-                hidden_size, 
-                hidden_size, 
-                bias=True, 
-            )
-
-    def forward(self, caption):
-        hidden_states = self.linear_1(caption)
-        hidden_states = self.act_1(hidden_states)
-        hidden_states = self.linear_2(hidden_states)
-        return hidden_states
-    
-class RoPE1D:
-    def __init__(self, freq=1e4, F0=1.0, scaling_factor=1.0):
-        self.base = freq
-        self.F0 = F0
-        self.scaling_factor = scaling_factor
-        self.cache = {}
-
-    def get_cos_sin(self, D, seq_len, device, dtype):
-        if (D, seq_len, device, dtype) not in self.cache:
-            inv_freq = 1.0 / (self.base ** (torch.arange(0, D, 2).float().to(device) / D))
-            t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)
-            freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype)
-            freqs = torch.cat((freqs, freqs), dim=-1)
-            cos = freqs.cos()  # (Seq, Dim)
-            sin = freqs.sin()
-            self.cache[D, seq_len, device, dtype] = (cos, sin)
-        return self.cache[D, seq_len, device, dtype]
-
-    @staticmethod
-    def rotate_half(x):
-        x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2:]
-        return torch.cat((-x2, x1), dim=-1)
-
-    def apply_rope1d(self, tokens, pos1d, cos, sin):
-        assert pos1d.ndim == 2
-        cos = torch.nn.functional.embedding(pos1d, cos)[:, :, None, :]
-        sin = torch.nn.functional.embedding(pos1d, sin)[:, :, None, :]
-        return (tokens * cos) + (self.rotate_half(tokens) * sin)
-
-    def __call__(self, tokens, positions):
-        """
-        input:
-            * tokens: batch_size x ntokens x nheads x dim
-            * positions: batch_size x ntokens (t position of each token)
-        output:
-            * tokens after applying RoPE2D (batch_size x ntokens x nheads x dim)
-        """
-        D = tokens.size(3)
-        assert positions.ndim == 2  # Batch, Seq
-        cos, sin = self.get_cos_sin(D, int(positions.max()) + 1, tokens.device, tokens.dtype)
-        tokens = self.apply_rope1d(tokens, positions, cos, sin)
-        return tokens
-class Attention(nn.Module):
-    def __init__(self):
-        super().__init__()
-    
-    def attn_processor(self, attn_type):
-        if attn_type == 'torch':
-            return self.torch_attn_func
-        elif attn_type == 'parallel':
-            return self.parallel_attn_func
-        else:
-            raise Exception('Not supported attention type...')
-
-    def torch_attn_func(
-        self,
-        q,
-        k,
-        v,
-        attn_mask=None,
-        causal=False,
-        drop_rate=0.0,
-        **kwargs
-    ):
-
-        if attn_mask is not None and attn_mask.dtype != torch.bool:
-            attn_mask = attn_mask.to(q.dtype)
-            
-        if attn_mask is not None and attn_mask.ndim == 3:   ## no head
-            n_heads = q.shape[2]
-            attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1)
-        
-        q, k, v = map(lambda x: rearrange(x, 'b s h d -> b h s d'), (q, k, v))
-        if attn_mask is not None:
-            attn_mask = attn_mask.to(q.device)
-        x = torch.nn.functional.scaled_dot_product_attention(
-            q, k, v, attn_mask=attn_mask, dropout_p=drop_rate, is_causal=causal
-        )
-        x = rearrange(x, 'b h s d -> b s h d')
-        return x  
-
-class RoPE3D(RoPE1D):
-    def __init__(self, freq=1e4, F0=1.0, scaling_factor=1.0):
-        super(RoPE3D, self).__init__(freq, F0, scaling_factor)
-        self.position_cache = {}
-
-    def get_mesh_3d(self, rope_positions, bsz):
-        f, h, w = rope_positions
-
-        if f"{f}-{h}-{w}" not in self.position_cache:
-            x = torch.arange(f, device='cpu')
-            y = torch.arange(h, device='cpu')
-            z = torch.arange(w, device='cpu')
-            self.position_cache[f"{f}-{h}-{w}"] = torch.cartesian_prod(x, y, z).view(1, f*h*w, 3).expand(bsz, -1, 3)
-        return self.position_cache[f"{f}-{h}-{w}"]
-     
-    def __call__(self, tokens, rope_positions, ch_split, parallel=False):
-        """
-        input:
-            * tokens: batch_size x ntokens x nheads x dim
-            * rope_positions: list of (f, h, w)
-        output:
-            * tokens after applying RoPE2D (batch_size x ntokens x nheads x dim)
-        """
-        assert sum(ch_split) == tokens.size(-1); 
-
-        mesh_grid = self.get_mesh_3d(rope_positions, bsz=tokens.shape[0])
-        out = []
-        for i, (D, x) in enumerate(zip(ch_split, torch.split(tokens, ch_split, dim=-1))):
-            cos, sin = self.get_cos_sin(D, int(mesh_grid.max()) + 1, tokens.device, tokens.dtype)
-            
-            if parallel:
-                pass
-            else:
-                mesh = mesh_grid[:, :, i].clone()
-            x = self.apply_rope1d(x, mesh.to(tokens.device), cos, sin)
-            out.append(x)
-            
-        tokens = torch.cat(out, dim=-1)
-        return tokens
-
-class SelfAttention(Attention):
-    def __init__(self, hidden_dim, head_dim, bias=False, with_rope=True, with_qk_norm=True, attn_type='torch'):
-        super().__init__()
-        self.head_dim = head_dim
-        self.n_heads = hidden_dim // head_dim
-        
-        self.wqkv = nn.Linear(hidden_dim, hidden_dim*3, bias=bias)
-        self.wo = nn.Linear(hidden_dim, hidden_dim, bias=bias)
-        
-        self.with_rope = with_rope
-        self.with_qk_norm = with_qk_norm
-        if self.with_qk_norm:
-            self.q_norm = RMSNorm(head_dim, elementwise_affine=True)
-            self.k_norm = RMSNorm(head_dim, elementwise_affine=True)
-        
-        if self.with_rope:
-            self.rope_3d = RoPE3D(freq=1e4, F0=1.0, scaling_factor=1.0)
-            self.rope_ch_split = [64, 32, 32]
-        
-        self.core_attention = self.attn_processor(attn_type=attn_type)
-        self.parallel = attn_type=='parallel'
-        
-    def apply_rope3d(self, x, fhw_positions, rope_ch_split, parallel=True):
-        x = self.rope_3d(x, fhw_positions, rope_ch_split, parallel)
-        return x
-        
-    def forward(
-        self, 
-        x,
-        cu_seqlens=None,
-        max_seqlen=None,
-        rope_positions=None,
-        attn_mask=None
-    ):
-        xqkv = self.wqkv(x) 
-        xqkv = xqkv.view(*x.shape[:-1], self.n_heads, 3*self.head_dim)
-
-        xq, xk, xv = torch.split(xqkv, [self.head_dim]*3, dim=-1)  ## seq_len, n, dim
-    
-        if self.with_qk_norm:
-            xq = self.q_norm(xq)
-            xk = self.k_norm(xk)
-    
-        if self.with_rope:
-            xq = self.apply_rope3d(xq, rope_positions, self.rope_ch_split, parallel=self.parallel)
-            xk = self.apply_rope3d(xk, rope_positions, self.rope_ch_split, parallel=self.parallel)
-            
-        output = self.core_attention(
-                    xq,
-                    xk,
-                    xv,
-                    cu_seqlens=cu_seqlens,
-                    max_seqlen=max_seqlen,
-                    attn_mask=attn_mask
-                )
-        output = rearrange(output, 'b s h d -> b s (h d)')
-        output = self.wo(output)
-        
-        return output
-    
-    
-class CrossAttention(Attention):
-    def __init__(self, hidden_dim, head_dim, bias=False, with_qk_norm=True, attn_type='torch'):
-        super().__init__()
-        self.head_dim = head_dim
-        self.n_heads = hidden_dim // head_dim
-        
-        self.wq = nn.Linear(hidden_dim, hidden_dim, bias=bias)
-        self.wkv = nn.Linear(hidden_dim, hidden_dim*2, bias=bias)
-        self.wo = nn.Linear(hidden_dim, hidden_dim, bias=bias)
-        
-        self.with_qk_norm = with_qk_norm
-        if self.with_qk_norm:
-            self.q_norm = RMSNorm(head_dim, elementwise_affine=True)
-            self.k_norm = RMSNorm(head_dim, elementwise_affine=True)
-        
-        self.core_attention = self.attn_processor(attn_type=attn_type)
-
-    def forward(
-            self, 
-            x: torch.Tensor,
-            encoder_hidden_states: torch.Tensor,
-            attn_mask=None
-        ):
-        xq = self.wq(x) 
-        xq = xq.view(*xq.shape[:-1], self.n_heads, self.head_dim)
-        
-        xkv = self.wkv(encoder_hidden_states)
-        xkv = xkv.view(*xkv.shape[:-1], self.n_heads, 2*self.head_dim)
-
-        xk, xv = torch.split(xkv, [self.head_dim]*2, dim=-1)  ## seq_len, n, dim
-    
-        if self.with_qk_norm:
-            xq = self.q_norm(xq)
-            xk = self.k_norm(xk)
-
-        output = self.core_attention(
-                    xq,
-                    xk,
-                    xv,
-                    attn_mask=attn_mask
-                )
-        
-        output = rearrange(output, 'b s h d -> b s (h d)')
-        output = self.wo(output)
-        
-        return output
-
-    
-class GELU(nn.Module):
-    r"""
-    GELU activation function with tanh approximation support with `approximate="tanh"`.
-
-    Parameters:
-        dim_in (`int`): The number of channels in the input.
-        dim_out (`int`): The number of channels in the output.
-        approximate (`str`, *optional*, defaults to `"none"`): If `"tanh"`, use tanh approximation.
-        bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
-    """
-
-    def __init__(self, dim_in: int, dim_out: int, approximate: str = "none", bias: bool = True):
-        super().__init__()
-        self.proj = nn.Linear(dim_in, dim_out, bias=bias)
-        self.approximate = approximate
-
-    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
-        return torch.nn.functional.gelu(gate, approximate=self.approximate)
-
-    def forward(self, hidden_states):
-        hidden_states = self.proj(hidden_states)
-        hidden_states = self.gelu(hidden_states)
-        return hidden_states
-class PatchEmbed(nn.Module):
-    """2D Image to Patch Embedding"""
-
-    def __init__(
-        self,
-        patch_size=64,
-        in_channels=3,
-        embed_dim=768,
-        layer_norm=False,
-        flatten=True,
-        bias=True,
-    ):
-        super().__init__()
-
-        self.flatten = flatten
-        self.layer_norm = layer_norm
-
-        self.proj = nn.Conv2d(
-            in_channels, embed_dim, kernel_size=(patch_size, patch_size), stride=patch_size, bias=bias
-        )
-
-    def forward(self, latent):
-        latent = self.proj(latent).to(latent.dtype)   
-        if self.flatten:
-            latent = latent.flatten(2).transpose(1, 2)  # BCHW -> BNC
-        if self.layer_norm:
-            latent = self.norm(latent)
-
-        return latent
-class FeedForward(nn.Module):
-    def __init__(
-        self, 
-        dim: int,
-        inner_dim: Optional[int] = None,
-        dim_out: Optional[int] = None,
-        mult: int = 4,
-        bias: bool = False,
-    ):
-        super().__init__()
-        inner_dim = dim*mult if inner_dim is None else inner_dim
-        dim_out = dim if dim_out is None else dim_out
-        self.net = nn.ModuleList([
-            GELU(dim, inner_dim, approximate="tanh", bias=bias),
-            nn.Identity(),
-            nn.Linear(inner_dim, dim_out, bias=bias)
-        ])
-        
-        
-    def forward(self, hidden_states: torch.Tensor, *args, **kwargs) -> torch.Tensor:
-        for module in self.net:
-            hidden_states = module(hidden_states)
-        return hidden_states
-    
-
-def modulate(x, scale, shift):
-    x = x * (1 + scale) + shift
-    return x
-
-
-def gate(x, gate):
-    x = gate * x
-    return x
-
-class StepVideoTransformerBlock(nn.Module):
-    r"""
-    A basic Transformer block.
-
-    Parameters:
-        dim (`int`): The number of channels in the input and output.
-        num_attention_heads (`int`): The number of heads to use for multi-head attention.
-        attention_head_dim (`int`): The number of channels in each head.
-        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
-        cross_attention_dim (`int`, *optional*): The size of the encoder_hidden_states vector for cross attention.
-        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
-        num_embeds_ada_norm (:
-            obj: `int`, *optional*): The number of diffusion steps used during training. See `Transformer2DModel`.
-        attention_bias (:
-            obj: `bool`, *optional*, defaults to `False`): Configure if the attentions should contain a bias parameter.
-        only_cross_attention (`bool`, *optional*):
-            Whether to use only cross-attention layers. In this case two cross attention layers are used.
-        double_self_attention (`bool`, *optional*):
-            Whether to use two self-attention layers. In this case no cross attention layers are used.
-        upcast_attention (`bool`, *optional*):
-            Whether to upcast the attention computation to float32. This is useful for mixed precision training.
-        norm_elementwise_affine (`bool`, *optional*, defaults to `True`):
-            Whether to use learnable elementwise affine parameters for normalization.
-        norm_type (`str`, *optional*, defaults to `"layer_norm"`):
-            The normalization layer to use. Can be `"layer_norm"`, `"ada_norm"` or `"ada_norm_zero"`.
-        final_dropout (`bool` *optional*, defaults to False):
-            Whether to apply a final dropout after the last feed-forward layer.
-        attention_type (`str`, *optional*, defaults to `"default"`):
-            The type of attention to use. Can be `"default"` or `"gated"` or `"gated-text-image"`.
-        positional_embeddings (`str`, *optional*, defaults to `None`):
-            The type of positional embeddings to apply to.
-        num_positional_embeddings (`int`, *optional*, defaults to `None`):
-            The maximum number of positional embeddings to apply.
-    """
-
-    def __init__(
-        self,
-        dim: int,
-        attention_head_dim: int,
-        norm_eps: float = 1e-5,
-        ff_inner_dim: Optional[int] = None,
-        ff_bias: bool = False,
-        attention_type: str = 'parallel'
-    ):
-        super().__init__()
-        self.dim = dim
-        self.norm1 = nn.LayerNorm(dim, eps=norm_eps)
-        self.attn1 = SelfAttention(dim, attention_head_dim, bias=False, with_rope=True, with_qk_norm=True, attn_type=attention_type)
-        
-        self.norm2 = nn.LayerNorm(dim, eps=norm_eps)
-        self.attn2 = CrossAttention(dim, attention_head_dim, bias=False, with_qk_norm=True, attn_type='torch')
-
-        self.ff = FeedForward(dim=dim, inner_dim=ff_inner_dim, dim_out=dim, bias=ff_bias)
-
-        self.scale_shift_table = nn.Parameter(torch.randn(6, dim) /dim**0.5)
-
-    @torch.no_grad()
-    def forward(
-        self,
-        q: torch.Tensor,
-        kv: Optional[torch.Tensor] = None,
-        timestep: Optional[torch.LongTensor] =  None,
-        attn_mask = None,
-        rope_positions: list = None, 
-    ) -> torch.Tensor:
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
-            torch.clone(chunk) for chunk in (self.scale_shift_table[None].to(dtype=q.dtype, device=q.device) + timestep.reshape(-1, 6, self.dim)).chunk(6, dim=1)
-        )
-        
-        scale_shift_q = modulate(self.norm1(q), scale_msa, shift_msa)
-
-        attn_q = self.attn1(
-            scale_shift_q,
-            rope_positions=rope_positions
-        )
-
-        q = gate(attn_q, gate_msa) + q
-        
-        attn_q = self.attn2(
-                q,
-                kv,
-                attn_mask
-            )
-
-        q = attn_q + q
-
-        scale_shift_q = modulate(self.norm2(q), scale_mlp, shift_mlp)
-
-        ff_output = self.ff(scale_shift_q)
-        
-        q = gate(ff_output, gate_mlp) + q
-        
-        return q
-
-class StepVideoModel(ModelMixin, ConfigMixin):
-    _no_split_modules = ["StepVideoTransformerBlock", "PatchEmbed"]
-
-    @register_to_config
-    def __init__(
-        self,
-        num_attention_heads: int = 48,
-        attention_head_dim: int = 128,
-        in_channels: int = 64,
-        out_channels: Optional[int] = 64,
-        num_layers: int = 48,
-        dropout: float = 0.0,
-        patch_size: int = 1,
-        norm_type: str = "ada_norm_single",
-        norm_elementwise_affine: bool = False,
-        norm_eps: float = 1e-6,
-        use_additional_conditions: Optional[bool] = False,
-        caption_channels: Optional[int]|list|tuple = [6144, 1024],
-        attention_type: Optional[str] = "torch",
-    ):
-        super().__init__()
-
-        # Set some common variables used across the board.
-        self.inner_dim = self.config.num_attention_heads * self.config.attention_head_dim
-        self.out_channels = in_channels if out_channels is None else out_channels
-
-        self.use_additional_conditions = use_additional_conditions
-
-        self.pos_embed = PatchEmbed(
-            patch_size=patch_size,
-            in_channels=self.config.in_channels,
-            embed_dim=self.inner_dim,
-        )
-
-        self.transformer_blocks = nn.ModuleList(
-            [
-                StepVideoTransformerBlock(
-                    dim=self.inner_dim,
-                    attention_head_dim=self.config.attention_head_dim,
-                    attention_type='torch'
-                )
-                for _ in range(self.config.num_layers)
-            ]
-        )
-
-        # 3. Output blocks.
-        self.norm_out = nn.LayerNorm(self.inner_dim, eps=norm_eps, elementwise_affine=norm_elementwise_affine)
-        self.scale_shift_table = nn.Parameter(torch.randn(2, self.inner_dim) / self.inner_dim**0.5)
-        self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels)
-        self.patch_size = patch_size
-
-        self.adaln_single = AdaLayerNormSingle(
-            self.inner_dim, use_additional_conditions=self.use_additional_conditions
-        )
-
-        if isinstance(self.config.caption_channels, int):
-            caption_channel = self.config.caption_channels
-        else:
-            caption_channel, clip_channel = self.config.caption_channels
-            self.clip_projection = nn.Linear(clip_channel, self.inner_dim) 
-
-        self.caption_norm = nn.LayerNorm(caption_channel,  eps=norm_eps, elementwise_affine=norm_elementwise_affine)
-        
-        self.caption_projection = PixArtAlphaTextProjection(
-            in_features=caption_channel, hidden_size=self.inner_dim
-        )
-        
-        self.parallel = attention_type=='parallel'
-
-    def patchfy(self, hidden_states):
-        hidden_states = rearrange(hidden_states, 'b f c h w -> (b f) c h w')
-        hidden_states = self.pos_embed(hidden_states)
-        return hidden_states
-
-    def prepare_attn_mask(self, encoder_attention_mask, encoder_hidden_states, q_seqlen):
-        kv_seqlens = encoder_attention_mask.sum(dim=1).int()
-        mask = torch.zeros([len(kv_seqlens), q_seqlen, max(kv_seqlens)], dtype=torch.bool, device=encoder_attention_mask.device)
-        encoder_hidden_states = encoder_hidden_states[:,: max(kv_seqlens)]
-        for i, kv_len in enumerate(kv_seqlens):
-            mask[i, :, :kv_len] = 1
-        return encoder_hidden_states, mask
-
-    def block_forward(
-        self,
-        hidden_states,
-        encoder_hidden_states=None,
-        timestep=None,
-        rope_positions=None,
-        attn_mask=None,
-        parallel=True
-    ):
-
-        for block in tqdm(self.transformer_blocks, desc="Transformer Block"):
-            hidden_states = block(
-                hidden_states,
-                encoder_hidden_states,
-                timestep=timestep,
-                attn_mask=attn_mask,
-                rope_positions=rope_positions
-            )
-
-        return hidden_states
-        
-
-    @torch.inference_mode()
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        encoder_hidden_states: Optional[torch.Tensor] = None,
-        encoder_hidden_states_2: Optional[torch.Tensor] = None,
-        timestep: Optional[torch.LongTensor] = None,
-        added_cond_kwargs: Dict[str, torch.Tensor] = None,
-        encoder_attention_mask: Optional[torch.Tensor] = None,
-        fps: torch.Tensor=None,
-        return_dict: bool = True,
-    ):
-        assert hidden_states.ndim==5; "hidden_states's shape should be (bsz, f, ch, h ,w)"
-
-        bsz, frame, _, height, width = hidden_states.shape
-        height, width = height // self.patch_size, width // self.patch_size
-                
-        hidden_states = self.patchfy(hidden_states) 
-        len_frame = hidden_states.shape[1]
-                
-        if self.use_additional_conditions:
-            added_cond_kwargs = {
-                "resolution": torch.tensor([(height, width)]*bsz, device=hidden_states.device, dtype=hidden_states.dtype),
-                "nframe": torch.tensor([frame]*bsz, device=hidden_states.device, dtype=hidden_states.dtype),
-                "fps": fps
-            }    
-        else:
-            added_cond_kwargs = {}
-        
-        timestep, embedded_timestep = self.adaln_single(
-            timestep, added_cond_kwargs=added_cond_kwargs
-        )
-
-        encoder_hidden_states = self.caption_projection(self.caption_norm(encoder_hidden_states))
-        
-        if encoder_hidden_states_2 is not None and hasattr(self, 'clip_projection'):
-            clip_embedding = self.clip_projection(encoder_hidden_states_2)
-            encoder_hidden_states = torch.cat([clip_embedding, encoder_hidden_states], dim=1)
-
-        hidden_states = rearrange(hidden_states, '(b f) l d->  b (f l) d', b=bsz, f=frame, l=len_frame).contiguous()
-        encoder_hidden_states, attn_mask = self.prepare_attn_mask(encoder_attention_mask, encoder_hidden_states, q_seqlen=frame*len_frame)
-        
-        hidden_states = self.block_forward(
-            hidden_states,
-            encoder_hidden_states,
-            timestep=timestep,
-            rope_positions=[frame, height, width],
-            attn_mask=attn_mask,
-            parallel=self.parallel
-        )
-        
-        hidden_states = rearrange(hidden_states, 'b (f l) d -> (b f) l d', b=bsz, f=frame, l=len_frame)
-        
-        embedded_timestep = repeat(embedded_timestep, 'b d -> (b f) d', f=frame).contiguous()
-        
-        shift, scale = (self.scale_shift_table[None].to(dtype=embedded_timestep.dtype, device=embedded_timestep.device) + embedded_timestep[:, None]).chunk(2, dim=1)
-        hidden_states = self.norm_out(hidden_states)
-        # Modulation
-        hidden_states = hidden_states * (1 + scale) + shift
-        hidden_states = self.proj_out(hidden_states)
-        
-        # unpatchify
-        hidden_states = hidden_states.reshape(
-            shape=(-1, height, width, self.patch_size, self.patch_size, self.out_channels)
-        )
-        
-        hidden_states = rearrange(hidden_states, 'n h w p q c -> n c h p w q')
-        output = hidden_states.reshape(
-            shape=(-1, self.out_channels, height * self.patch_size, width * self.patch_size)
-        )
-
-        output = rearrange(output, '(b f) c h w -> b f c h w', f=frame)
-
-        if return_dict:
-            return {'x': output}
-        return output
-    
-    
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/modules/normalization.py b/videotuna/models/stepvideo/stepvideo/modules/normalization.py
deleted file mode 100755
index df303ca6..00000000
--- a/videotuna/models/stepvideo/stepvideo/modules/normalization.py
+++ /dev/null
@@ -1,317 +0,0 @@
-from typing import Any, Dict, Optional, Union, Tuple
-import torch
-import torch.nn as nn
-import math
-
-
-class RMSNorm(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        elementwise_affine=True,
-        eps: float = 1e-6,
-        device=None,
-        dtype=None,
-    ):
-        """
-        Initialize the RMSNorm normalization layer.
-
-        Args:
-            dim (int): The dimension of the input tensor.
-            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.
-
-        Attributes:
-            eps (float): A small value added to the denominator for numerical stability.
-            weight (nn.Parameter): Learnable scaling parameter.
-
-        """
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.eps = eps
-        if elementwise_affine:
-            self.weight = nn.Parameter(torch.ones(dim, **factory_kwargs))
-
-    def _norm(self, x):
-        """
-        Apply the RMSNorm normalization to the input tensor.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The normalized tensor.
-
-        """
-        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
-
-    def forward(self, x):
-        """
-        Forward pass through the RMSNorm layer.
-
-        Args:
-            x (torch.Tensor): The input tensor.
-
-        Returns:
-            torch.Tensor: The output tensor after applying RMSNorm.
-
-        """
-        output = self._norm(x.float()).type_as(x)
-        if hasattr(self, "weight"):
-            output = output * self.weight
-        return output
-    
-
-ACTIVATION_FUNCTIONS = {
-    "swish": nn.SiLU(),
-    "silu": nn.SiLU(),
-    "mish": nn.Mish(),
-    "gelu": nn.GELU(),
-    "relu": nn.ReLU(),
-}
-
-
-def get_activation(act_fn: str) -> nn.Module:
-    """Helper function to get activation function from string.
-
-    Args:
-        act_fn (str): Name of activation function.
-
-    Returns:
-        nn.Module: Activation function.
-    """
-
-    act_fn = act_fn.lower()
-    if act_fn in ACTIVATION_FUNCTIONS:
-        return ACTIVATION_FUNCTIONS[act_fn]
-    else:
-        raise ValueError(f"Unsupported activation function: {act_fn}")
-
-
-
-def get_timestep_embedding(
-    timesteps: torch.Tensor,
-    embedding_dim: int,
-    flip_sin_to_cos: bool = False,
-    downscale_freq_shift: float = 1,
-    scale: float = 1,
-    max_period: int = 10000,
-):
-    """
-    This matches the implementation in Denoising Diffusion Probabilistic Models: Create sinusoidal timestep embeddings.
-
-    :param timesteps: a 1-D Tensor of N indices, one per batch element.
-                      These may be fractional.
-    :param embedding_dim: the dimension of the output. :param max_period: controls the minimum frequency of the
-    embeddings. :return: an [N x dim] Tensor of positional embeddings.
-    """
-    assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
-
-    half_dim = embedding_dim // 2
-    exponent = -math.log(max_period) * torch.arange(
-        start=0, end=half_dim, dtype=torch.float32, device=timesteps.device
-    )
-    exponent = exponent / (half_dim - downscale_freq_shift)
-
-    emb = torch.exp(exponent)
-    emb = timesteps[:, None].float() * emb[None, :]
-
-    # scale embeddings
-    emb = scale * emb
-
-    # concat sine and cosine embeddings
-    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
-
-    # flip sine and cosine embeddings
-    if flip_sin_to_cos:
-        emb = torch.cat([emb[:, half_dim:], emb[:, :half_dim]], dim=-1)
-
-    # zero pad
-    if embedding_dim % 2 == 1:
-        emb = torch.nn.functional.pad(emb, (0, 1, 0, 0))
-    return emb
-
-
-
-class Timesteps(nn.Module):
-    def __init__(self, num_channels: int, flip_sin_to_cos: bool, downscale_freq_shift: float):
-        super().__init__()
-        self.num_channels = num_channels
-        self.flip_sin_to_cos = flip_sin_to_cos
-        self.downscale_freq_shift = downscale_freq_shift
-
-    def forward(self, timesteps):
-        t_emb = get_timestep_embedding(
-            timesteps,
-            self.num_channels,
-            flip_sin_to_cos=self.flip_sin_to_cos,
-            downscale_freq_shift=self.downscale_freq_shift,
-        )
-        return t_emb
-
-
-
-class TimestepEmbedding(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        time_embed_dim: int,
-        act_fn: str = "silu",
-        out_dim: int = None,
-        post_act_fn: Optional[str] = None,
-        cond_proj_dim=None,
-        sample_proj_bias=True
-    ):
-        super().__init__()
-        linear_cls = nn.Linear
-
-        self.linear_1 = linear_cls(
-                in_channels, 
-                time_embed_dim, 
-                bias=sample_proj_bias,
-            )
-
-        if cond_proj_dim is not None:
-            self.cond_proj = linear_cls(
-                    cond_proj_dim, 
-                    in_channels, 
-                    bias=False,
-                )
-        else:
-            self.cond_proj = None
-
-        self.act = get_activation(act_fn)
-
-        if out_dim is not None:
-            time_embed_dim_out = out_dim
-        else:
-            time_embed_dim_out = time_embed_dim
-            
-        self.linear_2 = linear_cls(
-                time_embed_dim, 
-                time_embed_dim_out, 
-                bias=sample_proj_bias, 
-            )
-
-        if post_act_fn is None:
-            self.post_act = None
-        else:
-            self.post_act = get_activation(post_act_fn)
-
-    def forward(self, sample, condition=None):
-        if condition is not None:
-            sample = sample + self.cond_proj(condition)
-        sample = self.linear_1(sample)
-
-        if self.act is not None:
-            sample = self.act(sample)
-
-        sample = self.linear_2(sample)
-
-        if self.post_act is not None:
-            sample = self.post_act(sample)
-        return sample
-
-
-
-
-class PixArtAlphaCombinedTimestepSizeEmbeddings(nn.Module):
-    def __init__(self, embedding_dim, size_emb_dim, use_additional_conditions: bool = False):
-        super().__init__()
-
-        self.outdim = size_emb_dim
-        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
-        self.timestep_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
-
-        self.use_additional_conditions = use_additional_conditions
-        if self.use_additional_conditions:
-            self.additional_condition_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0)
-            self.resolution_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=size_emb_dim)
-            self.nframe_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
-            self.fps_embedder = TimestepEmbedding(in_channels=256, time_embed_dim=embedding_dim)
-
-    def forward(self, timestep, resolution=None, nframe=None, fps=None):
-        hidden_dtype = next(self.timestep_embedder.parameters()).dtype
-
-        timesteps_proj = self.time_proj(timestep)
-        timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_dtype))  # (N, D)
-
-        if self.use_additional_conditions:
-            batch_size = timestep.shape[0]
-            resolution_emb = self.additional_condition_proj(resolution.flatten()).to(hidden_dtype)
-            resolution_emb = self.resolution_embedder(resolution_emb).reshape(batch_size, -1)
-            nframe_emb = self.additional_condition_proj(nframe.flatten()).to(hidden_dtype)
-            nframe_emb = self.nframe_embedder(nframe_emb).reshape(batch_size, -1)
-            conditioning = timesteps_emb + resolution_emb + nframe_emb
-
-            if fps is not None:
-                fps_emb = self.additional_condition_proj(fps.flatten()).to(hidden_dtype)
-                fps_emb = self.fps_embedder(fps_emb).reshape(batch_size, -1)
-                conditioning = conditioning + fps_emb
-        else:
-            conditioning = timesteps_emb
-
-        return conditioning
-
-
-
-class AdaLayerNormSingle(nn.Module):
-    r"""
-        Norm layer adaptive layer norm single (adaLN-single).
-
-        As proposed in PixArt-Alpha (see: https://arxiv.org/abs/2310.00426; Section 2.3).
-
-        Parameters:
-            embedding_dim (`int`): The size of each embedding vector.
-            use_additional_conditions (`bool`): To use additional conditions for normalization or not.
-    """
-    def __init__(self, embedding_dim: int, use_additional_conditions: bool = False, time_step_rescale=1000):
-        super().__init__()
-
-        self.emb = PixArtAlphaCombinedTimestepSizeEmbeddings(
-            embedding_dim, size_emb_dim=embedding_dim // 2, use_additional_conditions=use_additional_conditions
-        )
-
-        self.silu = nn.SiLU()
-        self.linear = nn.Linear(embedding_dim, 6 * embedding_dim, bias=True)
-
-        self.time_step_rescale = time_step_rescale  ## timestep usually in [0, 1], we rescale it to [0,1000] for stability
-
-    def forward(
-        self,
-        timestep: torch.Tensor,
-        added_cond_kwargs: Dict[str, torch.Tensor] = None,
-    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        embedded_timestep = self.emb(timestep*self.time_step_rescale, **added_cond_kwargs)
-
-        out = self.linear(self.silu(embedded_timestep))
-
-        return out, embedded_timestep
-    
-
-class PixArtAlphaTextProjection(nn.Module):
-    """
-    Projects caption embeddings. Also handles dropout for classifier-free guidance.
-
-    Adapted from https://github.com/PixArt-alpha/PixArt-alpha/blob/master/diffusion/model/nets/PixArt_blocks.py
-    """
-
-    def __init__(self, in_features, hidden_size):
-        super().__init__()
-        self.linear_1 = nn.Linear(
-                in_features, 
-                hidden_size, 
-                bias=True, 
-            )        
-        self.act_1 = nn.GELU(approximate="tanh")
-        self.linear_2 = nn.Linear(
-                hidden_size, 
-                hidden_size, 
-                bias=True, 
-            )
-
-    def forward(self, caption):
-        hidden_states = self.linear_1(caption)
-        hidden_states = self.act_1(hidden_states)
-        hidden_states = self.linear_2(hidden_states)
-        return hidden_states
-
diff --git a/videotuna/models/stepvideo/stepvideo/modules/rope.py b/videotuna/models/stepvideo/stepvideo/modules/rope.py
deleted file mode 100755
index 58640946..00000000
--- a/videotuna/models/stepvideo/stepvideo/modules/rope.py
+++ /dev/null
@@ -1,90 +0,0 @@
-import torch
-from videotuna.models.stepvideo.stepvideo.parallel import get_sequence_parallel_world_size, get_sequence_parallel_rank
-
-
-class RoPE1D:
-    def __init__(self, freq=1e4, F0=1.0, scaling_factor=1.0):
-        self.base = freq
-        self.F0 = F0
-        self.scaling_factor = scaling_factor
-        self.cache = {}
-
-    def get_cos_sin(self, D, seq_len, device, dtype):
-        if (D, seq_len, device, dtype) not in self.cache:
-            inv_freq = 1.0 / (self.base ** (torch.arange(0, D, 2).float().to(device) / D))
-            t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype)
-            freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype)
-            freqs = torch.cat((freqs, freqs), dim=-1)
-            cos = freqs.cos()  # (Seq, Dim)
-            sin = freqs.sin()
-            self.cache[D, seq_len, device, dtype] = (cos, sin)
-        return self.cache[D, seq_len, device, dtype]
-
-    @staticmethod
-    def rotate_half(x):
-        x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2:]
-        return torch.cat((-x2, x1), dim=-1)
-
-    def apply_rope1d(self, tokens, pos1d, cos, sin):
-        assert pos1d.ndim == 2
-        cos = torch.nn.functional.embedding(pos1d, cos)[:, :, None, :]
-        sin = torch.nn.functional.embedding(pos1d, sin)[:, :, None, :]
-        return (tokens * cos) + (self.rotate_half(tokens) * sin)
-
-    def __call__(self, tokens, positions):
-        """
-        input:
-            * tokens: batch_size x ntokens x nheads x dim
-            * positions: batch_size x ntokens (t position of each token)
-        output:
-            * tokens after appplying RoPE2D (batch_size x ntokens x nheads x dim)
-        """
-        D = tokens.size(3)
-        assert positions.ndim == 2  # Batch, Seq
-        cos, sin = self.get_cos_sin(D, int(positions.max()) + 1, tokens.device, tokens.dtype)
-        tokens = self.apply_rope1d(tokens, positions, cos, sin)
-        return tokens
-
-
-
-class RoPE3D(RoPE1D):
-    def __init__(self, freq=1e4, F0=1.0, scaling_factor=1.0):
-        super(RoPE3D, self).__init__(freq, F0, scaling_factor)
-        self.position_cache = {}
-
-    def get_mesh_3d(self, rope_positions, bsz):
-        f, h, w = rope_positions
-
-        if f"{f}-{h}-{w}" not in self.position_cache:
-            x = torch.arange(f, device='cpu')
-            y = torch.arange(h, device='cpu')
-            z = torch.arange(w, device='cpu')
-            self.position_cache[f"{f}-{h}-{w}"] = torch.cartesian_prod(x, y, z).view(1, f*h*w, 3).expand(bsz, -1, 3)
-        return self.position_cache[f"{f}-{h}-{w}"]
-     
-    def __call__(self, tokens, rope_positions, ch_split, parallel=False):
-        """
-        input:
-            * tokens: batch_size x ntokens x nheads x dim
-            * rope_positions: list of (f, h, w)
-        output:
-            * tokens after appplying RoPE2D (batch_size x ntokens x nheads x dim)
-        """
-        assert sum(ch_split) == tokens.size(-1); 
-
-        mesh_grid = self.get_mesh_3d(rope_positions, bsz=tokens.shape[0])
-        out = []
-        for i, (D, x) in enumerate(zip(ch_split, torch.split(tokens, ch_split, dim=-1))):
-            cos, sin = self.get_cos_sin(D, int(mesh_grid.max()) + 1, tokens.device, tokens.dtype)
-            
-            if parallel:
-                mesh = torch.chunk(mesh_grid[:, :, i], get_sequence_parallel_world_size(),dim=1)[get_sequence_parallel_rank()].clone()
-            else:
-                mesh = mesh_grid[:, :, i].clone()
-            x = self.apply_rope1d(x, mesh.to(tokens.device), cos, sin)
-            out.append(x)
-            
-        tokens = torch.cat(out, dim=-1)
-        return tokens
-    
-
diff --git a/videotuna/models/stepvideo/stepvideo/parallel.py b/videotuna/models/stepvideo/stepvideo/parallel.py
deleted file mode 100644
index 05e79988..00000000
--- a/videotuna/models/stepvideo/stepvideo/parallel.py
+++ /dev/null
@@ -1,47 +0,0 @@
-import torch.distributed as dist
-import xfuser
-import torch
-
-
-def initialize_parall_group(ring_degree, ulysses_degree, tensor_parallel_degree):
-    dist.init_process_group("nccl")
-    xfuser.core.distributed.init_distributed_environment(
-        rank=dist.get_rank(), 
-        world_size=dist.get_world_size()
-    )
-    
-    xfuser.core.distributed.initialize_model_parallel(
-        sequence_parallel_degree=ulysses_degree,
-        ring_degree=ring_degree,
-        ulysses_degree=ulysses_degree,
-        tensor_parallel_degree=tensor_parallel_degree,
-    )
-    torch.cuda.set_device(dist.get_rank())
-
-def get_parallel_group():
-    return xfuser.core.distributed.get_world_group()
-
-def get_sequence_parallel_world_size():
-    return xfuser.core.distributed.parallel_state.get_sequence_parallel_world_size()
-
-def get_sequence_parallel_rank():
-    return xfuser.core.distributed.parallel_state.get_sequence_parallel_rank()
-
-def get_sp_group():
-    return xfuser.core.distributed.parallel_state.get_sp_group()
-
-
-
-def parallel_forward(fn_):
-    def wrapTheFunction(_, hidden_states, *args, **kwargs):
-        if kwargs['parallel']:            
-            hidden_states = torch.chunk(hidden_states, get_sequence_parallel_world_size(), dim=-2)[get_sequence_parallel_rank()]
-            kwargs['attn_mask'] = torch.chunk(kwargs['attn_mask'], get_sequence_parallel_world_size(), dim=-2)[get_sequence_parallel_rank()]
-        output = fn_(_, hidden_states, *args, **kwargs)
-        
-        if kwargs['parallel']:
-            output = get_sp_group().all_gather(output.contiguous(), dim=-2)
-        
-        return output
-     
-    return wrapTheFunction
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/text_encoder/__init__.py b/videotuna/models/stepvideo/stepvideo/text_encoder/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/videotuna/models/stepvideo/stepvideo/text_encoder/clip.py b/videotuna/models/stepvideo/stepvideo/text_encoder/clip.py
deleted file mode 100755
index 13670363..00000000
--- a/videotuna/models/stepvideo/stepvideo/text_encoder/clip.py
+++ /dev/null
@@ -1,47 +0,0 @@
-import torch
-import torch.nn as nn
-from transformers import BertModel, BertTokenizer, BertConfig
-import os
-from ..utils import with_empty_init
-from loguru import logger
-class HunyuanClip(nn.Module):
-    """
-        Hunyuan clip code copied from https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py
-        hunyuan's clip used BertModel and BertTokenizer, so we copy it.
-    """
-    def __init__(self, model_dir, max_length=77, torch_dtype: torch.dtype = torch.bfloat16):
-        super(HunyuanClip, self).__init__()
-        
-        self.model_dir = model_dir
-        self.max_length = max_length
-        self.tokenizer = BertTokenizer.from_pretrained(os.path.join(model_dir, 'tokenizer'))
-        self.config = BertConfig.from_pretrained(os.path.join(model_dir, 'clip_text_encoder'))
-        self.text_encoder = BertModel(self.config)
-        self.torch_dtype = torch_dtype
-
-    def load_weight(self):
-        #1. hunyuan has visual + bert, but we only want bert here, and remember remove bert prefix
-        #2. BertModel has pooler layer, we do not need that
-        logger.info("HunyuanClip: fixing bert model weights")
-        state_dict = torch.load(os.path.join(self.model_dir, 'clip_text_encoder/pytorch_model.bin'), map_location='cpu')
-        state_dict_pruned = {k[5:] : v for k, v in state_dict.items() if k.startswith('bert')}
-        self.text_encoder.load_state_dict(state_dict_pruned, strict=False, assign=True)
-        self.text_encoder = self.text_encoder.to(self.torch_dtype)
-        
-    @torch.no_grad
-    def forward(self, prompts, with_mask=True, device='cuda'):
-        self.device = device
-        text_inputs = self.tokenizer(
-            prompts,
-            padding="max_length",
-            max_length=self.max_length,
-            truncation=True,
-            return_attention_mask=True,
-            return_tensors="pt",
-        )
-        prompt_embeds = self.text_encoder(
-            text_inputs.input_ids.to(self.device),
-            attention_mask=text_inputs.attention_mask.to(self.device) if with_mask else None,
-        )
-        return prompt_embeds.last_hidden_state, prompt_embeds.pooler_output
-        
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/text_encoder/flashattention.py b/videotuna/models/stepvideo/stepvideo/text_encoder/flashattention.py
deleted file mode 100755
index 5fe543f0..00000000
--- a/videotuna/models/stepvideo/stepvideo/text_encoder/flashattention.py
+++ /dev/null
@@ -1,37 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-# 
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-# ==============================================================================
-import torch
-
-def flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=None, causal=True,
-                    return_attn_probs=False, tp_group_rank=0, tp_group_size=1):
-    softmax_scale = q.size(-1) ** (-0.5) if softmax_scale is None else softmax_scale
-    return torch.ops.Optimus.fwd(q, k, v, None, dropout_p, softmax_scale, causal, return_attn_probs, None, tp_group_rank, tp_group_size)[0]
-
-
-class FlashSelfAttention(torch.nn.Module):
-    def __init__(
-        self,
-        attention_dropout=0.0,
-    ):
-        super().__init__()
-        self.dropout_p = attention_dropout
-
-
-    def forward(self, q, k, v, cu_seqlens=None, max_seq_len=None):
-        if cu_seqlens is None:
-            output = flash_attn_func(q, k, v, dropout_p=self.dropout_p)
-        else:
-            raise ValueError('cu_seqlens is not supported!')
-
-        return output
-    
diff --git a/videotuna/models/stepvideo/stepvideo/text_encoder/stepllm.py b/videotuna/models/stepvideo/stepvideo/text_encoder/stepllm.py
deleted file mode 100755
index bf30de79..00000000
--- a/videotuna/models/stepvideo/stepvideo/text_encoder/stepllm.py
+++ /dev/null
@@ -1,303 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-# 
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-# ==============================================================================
-import os
-from typing import Optional
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from .flashattention import FlashSelfAttention
-from ..modules.model import RMSNorm
-from .tokenizer import LLaMaEmbedding, Wrapped_StepChatTokenizer
-from ..utils import with_empty_init
-from safetensors.torch import load_file
-from transformers.modeling_utils import PretrainedConfig, PreTrainedModel
-from einops import rearrange
-import json
-from loguru import logger
-
-
-    
-def safediv(n, d):
-    q, r = divmod(n, d)
-    assert r == 0
-    return q
-
-
-class MultiQueryAttention(nn.Module):
-    def __init__(self, cfg, layer_id=None):
-        super().__init__()
-
-        self.head_dim = cfg.hidden_size // cfg.num_attention_heads
-        self.max_seq_len = cfg.seq_length
-        self.use_flash_attention = cfg.use_flash_attn
-        assert self.use_flash_attention, 'FlashAttention is required!'
-
-        self.n_groups = cfg.num_attention_groups
-        self.tp_size = 1
-        self.n_local_heads = cfg.num_attention_heads
-        self.n_local_groups = self.n_groups
-
-        self.wqkv = nn.Linear(
-            cfg.hidden_size,
-            cfg.hidden_size + self.head_dim * 2 * self.n_groups,
-            bias=False,
-        )
-        self.wo = nn.Linear(
-            cfg.hidden_size,
-            cfg.hidden_size,
-            bias=False,
-        )
-
-        assert self.use_flash_attention, 'non-Flash attention not supported yet.'
-        self.core_attention = FlashSelfAttention(attention_dropout=cfg.attention_dropout)
-        
-        self.layer_id = layer_id
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        mask: Optional[torch.Tensor],
-        cu_seqlens: Optional[torch.Tensor],
-        max_seq_len: Optional[torch.Tensor],
-    ):
-        seqlen, bsz, dim = x.shape
-        xqkv = self.wqkv(x)
-
-        xq, xkv = torch.split(
-            xqkv,
-            (dim // self.tp_size,
-             self.head_dim*2*self.n_groups // self.tp_size
-            ),
-            dim=-1,
-        )
-
-        # gather on 1st dimension
-        xq = xq.view(seqlen, bsz, self.n_local_heads, self.head_dim)
-        xkv = xkv.view(seqlen, bsz, self.n_local_groups, 2 * self.head_dim)
-        xk, xv = xkv.chunk(2, -1)
-
-        # rotary embedding + flash attn
-        xq = rearrange(xq, "s b h d -> b s h d")
-        xk = rearrange(xk, "s b h d -> b s h d")
-        xv = rearrange(xv, "s b h d -> b s h d")
-
-        q_per_kv = self.n_local_heads // self.n_local_groups
-        if q_per_kv > 1:
-            b, s, h, d = xk.size()
-            if h == 1:
-                xk = xk.expand(b, s, q_per_kv, d)
-                xv = xv.expand(b, s, q_per_kv, d)
-            else:
-                ''' To cover the cases where h > 1, we have
-                    the following implementation, which is equivalent to:
-                        xk = xk.repeat_interleave(q_per_kv, dim=-2)
-                        xv = xv.repeat_interleave(q_per_kv, dim=-2)
-                    but can avoid calling aten::item() that involves cpu.
-                '''
-                idx = torch.arange(q_per_kv * h, device=xk.device).reshape(q_per_kv, -1).permute(1, 0).flatten()
-                xk = torch.index_select(xk.repeat(1, 1, q_per_kv, 1), 2, idx).contiguous()
-                xv = torch.index_select(xv.repeat(1, 1, q_per_kv, 1), 2, idx).contiguous()
-
-        if self.use_flash_attention:
-            output = self.core_attention(xq, xk, xv,
-                                      cu_seqlens=cu_seqlens,
-                                      max_seq_len=max_seq_len)
-            # reduce-scatter only support first dimention now
-            output = rearrange(output, "b s h d -> s b (h d)").contiguous()
-        else:
-            xq, xk, xv = [
-                rearrange(x, "b s ... -> s b ...").contiguous()
-                for x in (xq, xk, xv)
-            ]
-            output = self.core_attention(xq, xk, xv, mask)
-        output = self.wo(output)
-        return output
-
-
-
-class FeedForward(nn.Module):
-    def __init__(
-        self,
-        cfg,
-        dim: int,
-        hidden_dim: int,
-        layer_id: int,
-        multiple_of: int=256,
-    ):
-        super().__init__()
-
-        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
-        def swiglu(x):
-            x = torch.chunk(x, 2, dim=-1)
-            return F.silu(x[0]) * x[1]
-        self.swiglu = swiglu
-            
-        self.w1 = nn.Linear(
-            dim,
-            2 * hidden_dim,
-            bias=False,
-        )
-        self.w2 = nn.Linear(
-            hidden_dim,
-            dim,
-            bias=False,
-        )
-
-    def forward(self, x):
-        x = self.swiglu(self.w1(x))
-        output = self.w2(x)
-        return output
-
-
-
-class TransformerBlock(nn.Module):
-    def __init__(
-        self, cfg, layer_id: int
-    ):
-        super().__init__()
-
-        self.n_heads = cfg.num_attention_heads
-        self.dim = cfg.hidden_size
-        self.head_dim = cfg.hidden_size // cfg.num_attention_heads
-        self.attention = MultiQueryAttention(
-            cfg,
-            layer_id=layer_id,
-        )
-
-        self.feed_forward = FeedForward(
-            cfg,
-            dim=cfg.hidden_size,
-            hidden_dim=cfg.ffn_hidden_size,
-            layer_id=layer_id,
-        )
-        self.layer_id = layer_id
-        self.attention_norm = RMSNorm(
-            cfg.hidden_size,
-            eps=cfg.layernorm_epsilon,
-        )
-        self.ffn_norm = RMSNorm(
-            cfg.hidden_size,
-            eps=cfg.layernorm_epsilon,
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        mask: Optional[torch.Tensor],
-        cu_seqlens: Optional[torch.Tensor],
-        max_seq_len: Optional[torch.Tensor],
-    ):
-        residual = self.attention.forward(
-            self.attention_norm(x), mask,
-            cu_seqlens, max_seq_len
-        )
-        h = x + residual
-        ffn_res = self.feed_forward.forward(self.ffn_norm(h))
-        out = h + ffn_res
-        return out
-
-
-class Transformer(nn.Module):
-    def __init__(
-        self,
-        config,
-        max_seq_size=8192,
-    ):
-        super().__init__()
-        self.num_layers = config.num_layers
-        self.layers = self._build_layers(config)
-
-    def _build_layers(self, config):
-        layers = torch.nn.ModuleList()
-        for layer_id in range(self.num_layers):
-            layers.append(
-                TransformerBlock(
-                    config,
-                    layer_id=layer_id + 1 ,
-                )
-            )
-        return layers
-
-    def forward(
-        self,
-        hidden_states,
-        attention_mask,
-        cu_seqlens=None,
-        max_seq_len=None,
-    ):
-
-        if max_seq_len is not None and not isinstance(max_seq_len, torch.Tensor):
-            max_seq_len = torch.tensor(max_seq_len, dtype=torch.int32, device="cpu")
-
-        for lid, layer in enumerate(self.layers):
-            hidden_states = layer(
-                                    hidden_states,
-                                    attention_mask,
-                                    cu_seqlens,
-                                    max_seq_len,
-                                )
-        return hidden_states
-
-
-class Step1Model(PreTrainedModel):
-    config_class=PretrainedConfig
-    @with_empty_init
-    def __init__(
-        self,
-        config,
-    ):
-        super().__init__(config)
-        self.tok_embeddings = LLaMaEmbedding(config)
-        self.transformer = Transformer(config)
-
-    def forward(
-        self,
-        input_ids=None,
-        attention_mask=None,
-    ):
-
-        hidden_states = self.tok_embeddings(input_ids)
-
-        hidden_states = self.transformer(
-            hidden_states,
-            attention_mask,
-        )
-        return hidden_states
-    
-    
-
-class STEP1TextEncoder(torch.nn.Module):
-    def __init__(self, model_dir, max_length=320):
-        super(STEP1TextEncoder, self).__init__()
-        self.max_length = max_length
-        self.text_tokenizer = Wrapped_StepChatTokenizer(os.path.join(model_dir, 'step1_chat_tokenizer.model'))
-        logger.info("Directly loading STEP1TextEncoder weights")
-        self.text_encoder = Step1Model.from_pretrained(model_dir, torch_dtype=torch.bfloat16)
-        
-    @torch.no_grad
-    def forward(self, prompts, with_mask=True, max_length=None, device='cuda'):
-        with torch.no_grad(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
-            if type(prompts) is str:
-                prompts = [prompts]
-            
-            txt_tokens = self.text_tokenizer(
-                prompts, max_length=max_length or self.max_length, padding="max_length", truncation=True, return_tensors="pt"
-            )
-            y = self.text_encoder(
-                txt_tokens.input_ids.to(device), 
-                attention_mask=txt_tokens.attention_mask.to(device) if with_mask else None
-            )
-            y_mask = txt_tokens.attention_mask
-        return y.transpose(0,1), y_mask
-
diff --git a/videotuna/models/stepvideo/stepvideo/text_encoder/tokenizer.py b/videotuna/models/stepvideo/stepvideo/text_encoder/tokenizer.py
deleted file mode 100755
index 6fa7b48b..00000000
--- a/videotuna/models/stepvideo/stepvideo/text_encoder/tokenizer.py
+++ /dev/null
@@ -1,206 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-# 
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-# ==============================================================================
-import torch.nn as nn
-import torch
-from typing import List
-
-
-class LLaMaEmbedding(nn.Module):
-    """Language model embeddings.
-
-    Arguments:
-        hidden_size: hidden size
-        vocab_size: vocabulary size
-        max_sequence_length: maximum size of sequence. This
-                             is used for positional embedding
-        embedding_dropout_prob: dropout probability for embeddings
-        init_method: weight initialization method
-        num_tokentypes: size of the token-type embeddings. 0 value
-                        will ignore this embedding
-    """
-
-    def __init__(self,
-                 cfg,
-                 ):
-        super().__init__()
-        self.hidden_size = cfg.hidden_size
-        self.params_dtype = cfg.params_dtype
-        self.fp32_residual_connection = cfg.fp32_residual_connection 
-        self.embedding_weights_in_fp32 = cfg.embedding_weights_in_fp32
-        self.word_embeddings = torch.nn.Embedding(
-            cfg.padded_vocab_size, self.hidden_size,
-        )
-        self.embedding_dropout = torch.nn.Dropout(cfg.hidden_dropout)
-
-    def forward(self, input_ids):
-        # Embeddings.
-        if self.embedding_weights_in_fp32:
-            self.word_embeddings = self.word_embeddings.to(torch.float32)
-        embeddings = self.word_embeddings(input_ids)
-        if self.embedding_weights_in_fp32:
-            embeddings = embeddings.to(self.params_dtype)
-            self.word_embeddings = self.word_embeddings.to(self.params_dtype)
-
-        # Data format change to avoid explicit tranposes : [b s h] --> [s b h].
-        embeddings = embeddings.transpose(0, 1).contiguous()
-
-        # If the input flag for fp32 residual connection is set, convert for float.
-        if self.fp32_residual_connection:
-            embeddings = embeddings.float()
-
-        # Dropout.
-        embeddings = self.embedding_dropout(embeddings)
-
-        return embeddings
-
-
-
-class StepChatTokenizer:
-    """Step Chat Tokenizer"""
-
-    def __init__(
-        self, model_file, name="StepChatTokenizer",
-        bot_token="<|BOT|>",  # Begin of Turn
-        eot_token="<|EOT|>",  # End of Turn
-        call_start_token="<|CALL_START|>",      # Call Start
-        call_end_token="<|CALL_END|>",          # Call End
-        think_start_token="<|THINK_START|>",    # Think Start
-        think_end_token="<|THINK_END|>",        # Think End
-        mask_start_token="<|MASK_1e69f|>",      # Mask start
-        mask_end_token="<|UNMASK_1e69f|>",      # Mask end
-    ):
-        import sentencepiece
-
-        self._tokenizer = sentencepiece.SentencePieceProcessor(model_file=model_file)
-
-        self._vocab = {}
-        self._inv_vocab = {}
-
-        self._special_tokens = {}
-        self._inv_special_tokens = {}
-
-        self._t5_tokens = []
-
-        for idx in range(self._tokenizer.get_piece_size()):
-            text = self._tokenizer.id_to_piece(idx)
-            self._inv_vocab[idx] = text
-            self._vocab[text] = idx
-
-            if self._tokenizer.is_control(idx) or self._tokenizer.is_unknown(idx):
-                self._special_tokens[text] = idx
-                self._inv_special_tokens[idx] = text
-
-        self._unk_id = self._tokenizer.unk_id()
-        self._bos_id = self._tokenizer.bos_id()
-        self._eos_id = self._tokenizer.eos_id()
-
-        for token in [
-            bot_token, eot_token, call_start_token, call_end_token,
-            think_start_token, think_end_token
-        ]:
-            assert token in self._vocab, f"Token '{token}' not found in tokenizer"
-            assert token in self._special_tokens, f"Token '{token}' is not a special token"
-
-        for token in [mask_start_token, mask_end_token]:
-            assert token in self._vocab, f"Token '{token}' not found in tokenizer"
-
-        self._bot_id = self._tokenizer.piece_to_id(bot_token)
-        self._eot_id = self._tokenizer.piece_to_id(eot_token)
-        self._call_start_id = self._tokenizer.piece_to_id(call_start_token)
-        self._call_end_id = self._tokenizer.piece_to_id(call_end_token)
-        self._think_start_id = self._tokenizer.piece_to_id(think_start_token)
-        self._think_end_id = self._tokenizer.piece_to_id(think_end_token)
-        self._mask_start_id = self._tokenizer.piece_to_id(mask_start_token)
-        self._mask_end_id = self._tokenizer.piece_to_id(mask_end_token)
-
-        self._underline_id = self._tokenizer.piece_to_id("\u2581")
-        
-    @property
-    def vocab(self):
-        return self._vocab
-
-    @property
-    def inv_vocab(self):
-        return self._inv_vocab
-
-    @property
-    def vocab_size(self):
-        return self._tokenizer.vocab_size()
-
-    def tokenize(self, text: str) -> List[int]:
-        return self._tokenizer.encode_as_ids(text)
-
-    def detokenize(self, token_ids: List[int]) -> str:
-        return self._tokenizer.decode_ids(token_ids)
-
-    
-class Tokens:
-    def __init__(self, input_ids, cu_input_ids, attention_mask, cu_seqlens, max_seq_len) -> None:
-        self.input_ids = input_ids
-        self.attention_mask = attention_mask
-        self.cu_input_ids = cu_input_ids
-        self.cu_seqlens = cu_seqlens
-        self.max_seq_len = max_seq_len
-    def to(self, device):
-        self.input_ids = self.input_ids.to(device)
-        self.attention_mask = self.attention_mask.to(device)
-        self.cu_input_ids = self.cu_input_ids.to(device)
-        self.cu_seqlens = self.cu_seqlens.to(device)
-        return self
-    
-class Wrapped_StepChatTokenizer(StepChatTokenizer):
-    def __call__(self, text, max_length=320, padding="max_length", truncation=True, return_tensors="pt"):
-        # [bos, ..., eos, pad, pad, ..., pad]
-        self.BOS = 1
-        self.EOS = 2
-        self.PAD = 2
-        out_tokens = []
-        attn_mask = []
-        if len(text) == 0:
-            part_tokens = [self.BOS] + [self.EOS]
-            valid_size = len(part_tokens)
-            if len(part_tokens) < max_length:
-                part_tokens += [self.PAD] * (max_length - valid_size)
-            out_tokens.append(part_tokens)
-            attn_mask.append([1]*valid_size+[0]*(max_length-valid_size))
-        else:
-            for part in text:
-                part_tokens = self.tokenize(part)
-                part_tokens = part_tokens[:(max_length - 2)] # leave 2 space for bos and eos
-                part_tokens = [self.BOS] + part_tokens + [self.EOS]
-                valid_size = len(part_tokens)
-                if len(part_tokens) < max_length:
-                    part_tokens += [self.PAD] * (max_length - valid_size)
-                out_tokens.append(part_tokens)
-                attn_mask.append([1]*valid_size+[0]*(max_length-valid_size))
-
-        out_tokens = torch.tensor(out_tokens, dtype=torch.long)
-        attn_mask = torch.tensor(attn_mask, dtype=torch.long)
-
-        # padding y based on tp size
-        padded_len = 0
-        padded_flag = True if padded_len > 0 else False
-        if padded_flag:
-            pad_tokens = torch.tensor([[self.PAD] * max_length], device=out_tokens.device)
-            pad_attn_mask = torch.tensor([[1]*padded_len+[0]*(max_length-padded_len)], device=attn_mask.device)
-            out_tokens = torch.cat([out_tokens, pad_tokens], dim=0)
-            attn_mask = torch.cat([attn_mask, pad_attn_mask], dim=0)
-        
-        # cu_seqlens
-        cu_out_tokens = out_tokens.masked_select(attn_mask != 0).unsqueeze(0)
-        seqlen = attn_mask.sum(dim=1).tolist()
-        cu_seqlens = torch.cumsum(torch.tensor([0]+seqlen), 0).to(device=out_tokens.device,dtype=torch.int32)
-        max_seq_len = max(seqlen)
-        return Tokens(out_tokens, cu_out_tokens, attn_mask, cu_seqlens, max_seq_len)
-    
-    
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/utils/__init__.py b/videotuna/models/stepvideo/stepvideo/utils/__init__.py
deleted file mode 100755
index a77226b1..00000000
--- a/videotuna/models/stepvideo/stepvideo/utils/__init__.py
+++ /dev/null
@@ -1,2 +0,0 @@
-from .utils import *
-from .video_process import *
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/utils/utils.py b/videotuna/models/stepvideo/stepvideo/utils/utils.py
deleted file mode 100755
index b0f1dc7b..00000000
--- a/videotuna/models/stepvideo/stepvideo/utils/utils.py
+++ /dev/null
@@ -1,64 +0,0 @@
-import numpy as np
-import random
-import torch
-from functools import wraps
-import torch.utils._device
-
-
-def setup_seed(seed):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    torch.backends.cudnn.deterministic = True
-    torch.backends.cudnn.benchmark = False
-
-
-class EmptyInitOnDevice(torch.overrides.TorchFunctionMode):
-    def __init__(self, device=None):
-        self.device = device
-
-    def __torch_function__(self, func, types, args=(), kwargs=None):
-        kwargs = kwargs or {}
-        if getattr(func, '__module__', None) == 'torch.nn.init':
-            if 'tensor' in kwargs:
-                return kwargs['tensor']
-            else:
-                return args[0]
-        if self.device is not None and func in torch.utils._device._device_constructors() and kwargs.get('device') is None:
-            kwargs['device'] = self.device
-        return func(*args, **kwargs)
-    
-
-def with_empty_init(func):
-    @wraps(func)
-    def wrapper(*args, **kwargs):
-        with EmptyInitOnDevice('cpu'):
-            return func(*args, **kwargs)
-    return wrapper
-
-
-    
-    
-def culens2mask(
-    cu_seqlens=None,
-    cu_seqlens_kv=None,
-    max_seqlen=None,
-    max_seqlen_kv=None,
-    is_causal=False
-):
-    assert len(cu_seqlens) == len(cu_seqlens_kv); "q k v should have same bsz..."
-    bsz = len(cu_seqlens) - 1
-    seqlens = cu_seqlens[1:]-cu_seqlens[:-1]
-    seqlens_kv = cu_seqlens_kv[1:]-cu_seqlens_kv[:-1]
-    
-    attn_mask = torch.zeros(bsz, max_seqlen, max_seqlen_kv, dtype=torch.bool)
-    for i, (seq_len, seq_len_kv) in enumerate(zip(seqlens, seqlens_kv)):
-        if is_causal:
-            attn_mask[i, :seq_len, :seq_len_kv] = torch.triu(torch.ones(seq_len, seq_len_kv), diagonal=1).bool()
-        else:
-            attn_mask[i, :seq_len, :seq_len_kv] = torch.ones([seq_len, seq_len_kv], dtype=torch.bool)
-
-    return attn_mask
-    
-    
diff --git a/videotuna/models/stepvideo/stepvideo/utils/video_process.py b/videotuna/models/stepvideo/stepvideo/utils/video_process.py
deleted file mode 100644
index 77803179..00000000
--- a/videotuna/models/stepvideo/stepvideo/utils/video_process.py
+++ /dev/null
@@ -1,50 +0,0 @@
-import numpy as np
-import datetime
-import torch
-import os
-import imageio
-
-
-class VideoProcessor:
-    def __init__(self, save_path: str='./results', name_suffix: str=''):
-        self.save_path = save_path
-        os.makedirs(self.save_path, exist_ok=True)
-        self.name_suffix = name_suffix
-    
-    def crop2standard540p(self, vid_array):
-        _, height, width, _ = vid_array.shape
-        height_center = height//2
-        width_center = width//2
-        if width_center>height_center:  ## horizon mode
-            return vid_array[:, height_center-270:height_center+270, width_center-480:width_center+480]
-        elif width_center<height_center: ## portrait mode
-            return vid_array[:, height_center-480:height_center+480, width_center-270:width_center+270]
-        else:
-            return vid_array
-
-    def save_imageio_video(self, video_array: np.array, output_filename: str, fps=25, codec='libx264'):
-        
-        ffmpeg_params = [
-            "-vf", "atadenoise=0a=0.1:0b=0.1:1a=0.1:1b=0.1",  # denoise
-        ]
-   
-        with imageio.get_writer(output_filename, fps=fps, codec=codec, ffmpeg_params=ffmpeg_params) as vid_writer:
-            for img_array in video_array:
-                vid_writer.append_data(img_array)   
-        
-    
-    def postprocess_video(self, video_tensor, output_file_name='', output_type="mp4", crop2standard540p=True):
-        if len(self.name_suffix) == 0:
-            video_path = os.path.join(self.save_path, f"{output_file_name}-{str(datetime.datetime.now())}.{output_type}")
-        else:
-            video_path = os.path.join(self.save_path, f"{output_file_name}-{self.name_suffix}.{output_type}")
-        
-        video_tensor = (video_tensor.cpu().clamp(-1, 1)+1)*127.5
-        video_tensor = torch.cat([t for t in video_tensor], dim=-2)
-        video_array = video_tensor.clamp(0, 255).to(torch.uint8).numpy().transpose(0,2,3,1)
-        
-        if crop2standard540p:
-            video_array = self.crop2standard540p(video_array)
-
-        self.save_imageio_video(video_array, video_path)
-        print(f"Saved the generated video in {video_path}")
\ No newline at end of file
diff --git a/videotuna/models/stepvideo/stepvideo/vae/vae.py b/videotuna/models/stepvideo/stepvideo/vae/vae.py
deleted file mode 100755
index bf3e19c7..00000000
--- a/videotuna/models/stepvideo/stepvideo/vae/vae.py
+++ /dev/null
@@ -1,1014 +0,0 @@
-# Copyright 2025 StepFun Inc. All Rights Reserved.
-# 
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-#
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-# ==============================================================================
-import torch
-from einops import rearrange
-from torch import nn
-from torch.nn import functional as F
-from loguru import logger
-from ..utils import with_empty_init
-
-
-def base_group_norm(x, norm_layer, act_silu=False, channel_last=False):
-    if hasattr(base_group_norm, 'spatial') and base_group_norm.spatial:
-        assert channel_last == True
-        x_shape = x.shape
-        x = x.flatten(0, 1)
-        if channel_last:
-            # Permute to NCHW format
-            x = x.permute(0, 3, 1, 2)
-
-        out = F.group_norm(x.contiguous(), norm_layer.num_groups, norm_layer.weight.to(device=x.device, dtype=x.dtype), norm_layer.bias.to(device=x.device, dtype=x.dtype), norm_layer.eps)
-        if act_silu:
-            out = F.silu(out)
-        
-        if channel_last:
-            # Permute back to NHWC format
-            out = out.permute(0, 2, 3, 1)
-
-        out = out.view(x_shape)
-    else:
-        if channel_last:
-            # Permute to NCHW format
-            x = x.permute(0, 4, 1, 2, 3)
-        out = F.group_norm(x.contiguous(), norm_layer.num_groups, norm_layer.weight, norm_layer.bias, norm_layer.eps)
-        if act_silu:
-            out = F.silu(out)
-        if channel_last:
-            # Permute back to NHWC format
-            out = out.permute(0, 2, 3, 4, 1)
-    return out
-
-def base_conv2d(x, conv_layer, channel_last=False, residual=None):
-    if channel_last:
-        x = x.permute(0, 3, 1, 2)  # NHWC to NCHW
-    out = F.conv2d(x, conv_layer.weight, conv_layer.bias, stride=conv_layer.stride, padding=conv_layer.padding)
-    if residual is not None:
-        if channel_last:
-            residual = residual.permute(0, 3, 1, 2)  # NHWC to NCHW
-        out += residual
-    if channel_last:
-        out = out.permute(0, 2, 3, 1)  # NCHW to NHWC
-    return out
-
-def base_conv3d(x, conv_layer, channel_last=False, residual=None, only_return_output=False):
-    if only_return_output:
-        size = cal_outsize(x.shape, conv_layer.weight.shape, conv_layer.stride, conv_layer.padding)
-        return torch.empty(size, device=x.device, dtype=x.dtype)
-    if channel_last:
-        x = x.permute(0, 4, 1, 2, 3)  # NDHWC to NCDHW
-    out = F.conv3d(x, conv_layer.weight, conv_layer.bias, stride=conv_layer.stride, padding=conv_layer.padding)
-    if residual is not None:
-        if channel_last:
-            residual = residual.permute(0, 4, 1, 2, 3)  # NDHWC to NCDHW
-        out += residual
-    if channel_last:
-        out = out.permute(0, 2, 3, 4, 1)  # NCDHW to NDHWC
-    return out
-
-
-def cal_outsize(input_sizes, kernel_sizes, stride, padding):
-    stride_d, stride_h, stride_w = stride
-    padding_d, padding_h, padding_w = padding 
-    dilation_d, dilation_h, dilation_w = 1, 1, 1
-
-    in_d = input_sizes[1]
-    in_h = input_sizes[2]
-    in_w = input_sizes[3]
-    in_channel = input_sizes[4]
-
-
-    kernel_d = kernel_sizes[2]
-    kernel_h = kernel_sizes[3]
-    kernel_w = kernel_sizes[4]
-    out_channels = kernel_sizes[0]
-
-    out_d = calc_out_(in_d, padding_d, dilation_d, kernel_d, stride_d)
-    out_h = calc_out_(in_h, padding_h, dilation_h, kernel_h, stride_h)
-    out_w = calc_out_(in_w, padding_w, dilation_w, kernel_w, stride_w)
-    size = [input_sizes[0], out_d, out_h, out_w, out_channels]
-    return size
-
-
-
-
-def calc_out_(in_size, padding, dilation, kernel, stride):
-    return (in_size + 2 * padding - dilation * (kernel - 1) - 1) // stride + 1
-
-
-
-def base_conv3d_channel_last(x, conv_layer, residual=None):
-    in_numel = x.numel()
-    out_numel = int(x.numel() * conv_layer.out_channels / conv_layer.in_channels)
-    if (in_numel >= 2**30) or (out_numel >= 2**30):
-        assert conv_layer.stride[0] == 1, "time split asks time stride = 1"
-
-        B,T,H,W,C = x.shape
-        K = conv_layer.kernel_size[0]
-
-        chunks = 4
-        chunk_size = T // chunks
-
-        if residual is None:
-            out_nhwc = base_conv3d(x, conv_layer, channel_last=True, residual=residual, only_return_output=True)
-        else:
-            out_nhwc = residual
-
-        assert B == 1
-        outs = []
-        for i in range(chunks):
-            if i == chunks-1:
-                xi = x[:1,chunk_size*i:]
-                out_nhwci = out_nhwc[:1,chunk_size*i:]
-            else:
-                xi = x[:1,chunk_size*i:chunk_size*(i+1)+K-1]
-                out_nhwci = out_nhwc[:1,chunk_size*i:chunk_size*(i+1)]
-            if residual is not None:
-                if i == chunks-1:
-                    ri = residual[:1,chunk_size*i:]
-                else:
-                    ri = residual[:1,chunk_size*i:chunk_size*(i+1)]
-            else:
-                ri = None
-            out_nhwci.copy_(base_conv3d(xi, conv_layer, channel_last=True, residual=ri))
-    else:
-        out_nhwc = base_conv3d(x, conv_layer, channel_last=True, residual=residual)
-    return out_nhwc
-
-
-
-class Upsample2D(nn.Module):
-    def __init__(self,
-                 channels,
-                 use_conv=False,
-                 use_conv_transpose=False,
-                 out_channels=None):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.use_conv_transpose = use_conv_transpose
-
-        if use_conv:
-            self.conv = nn.Conv2d(self.channels, self.out_channels, 3, padding=1)
-        else:
-            assert "Not Supported"
-            self.conv = nn.ConvTranspose2d(channels, self.out_channels, 4, 2, 1)
-
-    def forward(self, x, output_size=None):
-        assert x.shape[-1] == self.channels
-
-        if self.use_conv_transpose:
-            return self.conv(x)
-
-        if output_size is None:
-            x = F.interpolate(
-                x.permute(0,3,1,2).to(memory_format=torch.channels_last),
-                scale_factor=2.0, mode='nearest').permute(0,2,3,1).contiguous()
-        else:
-            x = F.interpolate(
-                x.permute(0,3,1,2).to(memory_format=torch.channels_last),
-                size=output_size, mode='nearest').permute(0,2,3,1).contiguous()
-
-        # x = self.conv(x)
-        x = base_conv2d(x, self.conv, channel_last=True)
-        return x
-
-
-class Downsample2D(nn.Module):
-    def __init__(self, channels, use_conv=False, out_channels=None, padding=1):
-        super().__init__()
-        self.channels = channels
-        self.out_channels = out_channels or channels
-        self.use_conv = use_conv
-        self.padding = padding
-        stride = 2
-
-        if use_conv:
-            self.conv = nn.Conv2d(self.channels, self.out_channels, 3, stride=stride, padding=padding)
-        else:
-            assert self.channels == self.out_channels
-            self.conv = nn.AvgPool2d(kernel_size=stride, stride=stride)
-
-    def forward(self, x):
-        assert x.shape[-1] == self.channels
-        if self.use_conv and self.padding == 0:
-            pad = (0, 0, 0, 1, 0, 1)
-            x = F.pad(x, pad, mode="constant", value=0)
-
-        assert x.shape[-1] == self.channels
-        # x = self.conv(x)
-        x = base_conv2d(x, self.conv, channel_last=True)
-        return x
-
-
-
-class CausalConv(nn.Module):
-    def __init__(self,
-        chan_in,
-        chan_out,
-        kernel_size,
-        **kwargs
-    ):
-        super().__init__()
-
-        if isinstance(kernel_size, int):
-            kernel_size = kernel_size if isinstance(kernel_size, tuple) else ((kernel_size,) * 3)
-        time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
-
-        self.dilation = kwargs.pop('dilation', 1)
-        self.stride = kwargs.pop('stride', 1)
-        if isinstance(self.stride, int):
-            self.stride = (self.stride, 1, 1)
-        time_pad = self.dilation * (time_kernel_size - 1) + max((1 - self.stride[0]), 0)
-        height_pad = height_kernel_size // 2
-        width_pad = width_kernel_size // 2
-        self.time_causal_padding = (width_pad, width_pad, height_pad, height_pad, time_pad, 0)
-        self.time_uncausal_padding = (width_pad, width_pad, height_pad, height_pad, 0, 0)
-
-        self.conv = nn.Conv3d(chan_in, chan_out, kernel_size, stride=self.stride, dilation=self.dilation, **kwargs)
-        self.is_first_run = True
-
-    def forward(self, x, is_init=True, residual=None):
-        x = nn.functional.pad(x,
-            self.time_causal_padding if is_init else self.time_uncausal_padding)
-
-        x = self.conv(x)
-        if residual is not None:
-            x.add_(residual)
-        return x
-
-
-class ChannelDuplicatingPixelUnshuffleUpSampleLayer3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        factor: int,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.factor = factor
-        assert out_channels * factor**3 % in_channels == 0
-        self.repeats = out_channels * factor**3 // in_channels
-
-    def forward(self, x: torch.Tensor, is_init=True) -> torch.Tensor:
-        x = x.repeat_interleave(self.repeats, dim=1)
-        x = x.view(x.size(0), self.out_channels, self.factor, self.factor, self.factor, x.size(2), x.size(3), x.size(4))
-        x = x.permute(0, 1, 5, 2, 6, 3, 7, 4).contiguous()
-        x = x.view(x.size(0), self.out_channels, x.size(2)*self.factor, x.size(4)*self.factor, x.size(6)*self.factor)
-        x = x[:, :, self.factor - 1:, :, :]
-        return x
-
-class ConvPixelShuffleUpSampleLayer3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        kernel_size: int,
-        factor: int,
-    ):
-        super().__init__()
-        self.factor = factor
-        out_ratio = factor**3
-        self.conv = CausalConv(
-            in_channels,
-            out_channels * out_ratio,
-            kernel_size=kernel_size
-        )
-
-    def forward(self, x: torch.Tensor, is_init=True) -> torch.Tensor:
-        x = self.conv(x, is_init)
-        x = self.pixel_shuffle_3d(x, self.factor)
-        return x
-
-    @staticmethod
-    def pixel_shuffle_3d(x: torch.Tensor, factor: int) -> torch.Tensor:
-        batch_size, channels, depth, height, width = x.size()
-        new_channels = channels // (factor ** 3)
-        new_depth = depth * factor
-        new_height = height * factor
-        new_width = width * factor
-
-        x = x.view(batch_size, new_channels, factor, factor, factor, depth, height, width)
-        x = x.permute(0, 1, 5, 2, 6, 3, 7, 4).contiguous()
-        x = x.view(batch_size, new_channels, new_depth, new_height, new_width)
-        x = x[:, :, factor - 1:, :, :]
-        return x
-
-class ConvPixelUnshuffleDownSampleLayer3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        kernel_size: int,
-        factor: int,
-    ):
-        super().__init__()
-        self.factor = factor
-        out_ratio = factor**3
-        assert out_channels % out_ratio == 0
-        self.conv = CausalConv(
-            in_channels,
-            out_channels // out_ratio,
-            kernel_size=kernel_size
-        )
-
-    def forward(self, x: torch.Tensor, is_init=True) -> torch.Tensor:
-        x = self.conv(x, is_init)
-        x = self.pixel_unshuffle_3d(x, self.factor)
-        return x
-
-    @staticmethod
-    def pixel_unshuffle_3d(x: torch.Tensor, factor: int) -> torch.Tensor:
-        pad = (0, 0, 0, 0, factor-1, 0)  # (left, right, top, bottom, front, back)
-        x = F.pad(x, pad)
-        B, C, D, H, W = x.shape
-        x = x.view(B, C, D // factor, factor, H // factor, factor, W // factor, factor)
-        x = x.permute(0, 1, 3, 5, 7, 2, 4, 6).contiguous()
-        x = x.view(B, C * factor**3, D // factor, H // factor, W // factor)
-        return x
-
-class PixelUnshuffleChannelAveragingDownSampleLayer3D(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        factor: int,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.factor = factor
-        assert in_channels * factor**3 % out_channels == 0
-        self.group_size = in_channels * factor**3 // out_channels
-
-    def forward(self, x: torch.Tensor, is_init=True) -> torch.Tensor:
-        pad = (0, 0, 0, 0, self.factor-1, 0)  # (left, right, top, bottom, front, back)
-        x = F.pad(x, pad)
-        B, C, D, H, W = x.shape
-        x = x.view(B, C, D // self.factor, self.factor, H // self.factor, self.factor, W // self.factor, self.factor)
-        x = x.permute(0, 1, 3, 5, 7, 2, 4, 6).contiguous()
-        x = x.view(B, C * self.factor**3, D // self.factor, H // self.factor, W // self.factor)
-        x = x.view(B, self.out_channels, self.group_size, D // self.factor, H // self.factor, W // self.factor)
-        x = x.mean(dim=2)
-        return x
-
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        factor: int,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.factor = factor
-        assert in_channels * factor**3 % out_channels == 0
-        self.group_size = in_channels * factor**3 // out_channels
-
-    def forward(self, x: torch.Tensor, is_init=True) -> torch.Tensor:
-        pad = (0, 0, 0, 0, self.factor-1, 0)  # (left, right, top, bottom, front, back)
-        x = F.pad(x, pad)
-        B, C, D, H, W = x.shape
-        x = x.view(B, C, D // self.factor, self.factor, H // self.factor, self.factor, W // self.factor, self.factor)
-        x = x.permute(0, 1, 3, 5, 7, 2, 4, 6).contiguous()
-        x = x.view(B, C * self.factor**3, D // self.factor, H // self.factor, W // self.factor)
-        x = x.view(B, self.out_channels, self.group_size, D // self.factor, H // self.factor, W // self.factor)
-        x = x.mean(dim=2)
-        return x
-
-
-
-
-def base_group_norm_with_zero_pad(x, norm_layer, act_silu=True, pad_size=2):
-    out_shape = list(x.shape)
-    out_shape[1] += pad_size
-    out = torch.empty(out_shape, dtype=x.dtype, device=x.device)
-    out[:, pad_size:] = base_group_norm(x, norm_layer, act_silu=act_silu, channel_last=True)
-    out[:, :pad_size] = 0
-    return out
-
-
-class CausalConvChannelLast(CausalConv):
-    def __init__(self,
-        chan_in,
-        chan_out,
-        kernel_size,
-        **kwargs
-    ):
-        super().__init__(
-            chan_in, chan_out, kernel_size, **kwargs)
-
-        self.time_causal_padding = (0, 0) + self.time_causal_padding
-        self.time_uncausal_padding = (0, 0) + self.time_uncausal_padding
-
-    def forward(self, x, is_init=True, residual=None):
-        if self.is_first_run:
-            self.is_first_run = False
-            # self.conv.weight = nn.Parameter(self.conv.weight.permute(0,2,3,4,1).contiguous())
-
-        x = nn.functional.pad(x,
-            self.time_causal_padding if is_init else self.time_uncausal_padding)
-
-        x = base_conv3d_channel_last(x, self.conv, residual=residual)
-        return x
-
-class CausalConvAfterNorm(CausalConv):
-    def __init__(self,
-        chan_in,
-        chan_out,
-        kernel_size,
-        **kwargs
-    ):
-        super().__init__(
-            chan_in, chan_out, kernel_size, **kwargs)
-
-        if self.time_causal_padding == (1, 1, 1, 1, 2, 0):
-            self.conv = nn.Conv3d(chan_in, chan_out, kernel_size, stride=self.stride, dilation=self.dilation, padding=(0, 1, 1), **kwargs)
-        else:
-            self.conv = nn.Conv3d(chan_in, chan_out, kernel_size, stride=self.stride, dilation=self.dilation, **kwargs)
-        self.is_first_run = True
-
-    def forward(self, x, is_init=True, residual=None):
-        if self.is_first_run:
-            self.is_first_run = False
-
-        if self.time_causal_padding == (1, 1, 1, 1, 2, 0):
-            pass
-        else:
-            x = nn.functional.pad(x, self.time_causal_padding).contiguous()
-
-        x = base_conv3d_channel_last(x, self.conv, residual=residual)
-        return x
-
-class AttnBlock(nn.Module):
-    def __init__(self,
-        in_channels
-    ):
-        super().__init__()
-
-        self.norm = nn.GroupNorm(num_groups=32, num_channels=in_channels)
-        self.q        = CausalConvChannelLast(in_channels, in_channels, kernel_size=1)
-        self.k        = CausalConvChannelLast(in_channels, in_channels, kernel_size=1)
-        self.v        = CausalConvChannelLast(in_channels, in_channels, kernel_size=1)
-        self.proj_out = CausalConvChannelLast(in_channels, in_channels, kernel_size=1)
-
-    def attention(self, x, is_init=True):
-        x = base_group_norm(x, self.norm, act_silu=False, channel_last=True)
-        q = self.q(x, is_init)
-        k = self.k(x, is_init)
-        v = self.v(x, is_init)
-
-        b, t, h, w, c = q.shape
-        q, k, v = map(lambda x: rearrange(x, "b t h w c -> b 1 (t h w) c"), (q, k, v))
-        x = nn.functional.scaled_dot_product_attention(q, k, v, is_causal=True)
-        x = rearrange(x, "b 1 (t h w) c -> b t h w c", t=t, h=h, w=w)
-
-        return x
-
-    def forward(self, x):
-        x = x.permute(0,2,3,4,1).contiguous()
-        h = self.attention(x)
-        x = self.proj_out(h, residual=x)
-        x = x.permute(0,4,1,2,3)
-        return x
-
-class Resnet3DBlock(nn.Module):
-    def __init__(self,
-        in_channels,
-        out_channels=None,
-        temb_channels=512,
-        conv_shortcut=False,
-    ):
-        super().__init__()
-
-        self.in_channels = in_channels
-        out_channels = in_channels if out_channels is None else out_channels
-        self.out_channels = out_channels
-
-        self.norm1 = nn.GroupNorm(num_groups=32, num_channels=in_channels)
-        self.conv1 = CausalConvAfterNorm(in_channels, out_channels, kernel_size=3)
-        if temb_channels > 0:
-            self.temb_proj = nn.Linear(temb_channels, out_channels)
-
-        self.norm2 = nn.GroupNorm(num_groups=32, num_channels=out_channels)
-        self.conv2 = CausalConvAfterNorm(out_channels, out_channels, kernel_size=3)
-
-        assert conv_shortcut is False
-        self.use_conv_shortcut = conv_shortcut
-        if self.in_channels != self.out_channels:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = CausalConvAfterNorm(in_channels, out_channels, kernel_size=3)
-            else:
-                self.nin_shortcut = CausalConvAfterNorm(in_channels, out_channels, kernel_size=1)
-
-    def forward(self, x, temb=None, is_init=True):
-        x = x.permute(0,2,3,4,1).contiguous()
-
-        h = base_group_norm_with_zero_pad(x, self.norm1, act_silu=True, pad_size=2)
-        h = self.conv1(h)
-        if temb is not None:
-            h = h + self.temb_proj(nn.functional.silu(temb))[:, :, None, None]
-
-        x = self.nin_shortcut(x) if self.in_channels != self.out_channels else x
-
-        h = base_group_norm_with_zero_pad(h, self.norm2, act_silu=True, pad_size=2)
-        x = self.conv2(h, residual=x)
-
-        x = x.permute(0,4,1,2,3)
-        return x
-
-
-class Downsample3D(nn.Module):
-    def __init__(self,
-        in_channels,
-        with_conv,
-        stride
-    ):
-        super().__init__()
-
-        self.with_conv = with_conv
-        if with_conv:
-            self.conv = CausalConv(in_channels, in_channels, kernel_size=3, stride=stride)
-
-    def forward(self, x, is_init=True):
-        if self.with_conv:
-            x = self.conv(x, is_init)
-        else:
-            x = nn.functional.avg_pool3d(x, kernel_size=2, stride=2)
-        return x
-
-class VideoEncoder(nn.Module):
-    def __init__(self,
-        ch=32,
-        ch_mult=(4, 8, 16, 16),
-        num_res_blocks=2,
-        in_channels=3,
-        z_channels=16,
-        double_z=True,
-        down_sampling_layer=[1, 2],
-        resamp_with_conv=True,
-        version=1,
-    ):
-        super().__init__()
-
-        temb_ch = 0
-
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-
-        # downsampling
-        self.conv_in = CausalConv(in_channels, ch, kernel_size=3)
-        self.down_sampling_layer = down_sampling_layer
-
-        in_ch_mult = (1,) + tuple(ch_mult)
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_resolutions):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = ch * in_ch_mult[i_level]
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    Resnet3DBlock(in_channels=block_in, out_channels=block_out, temb_channels=temb_ch))
-                block_in = block_out
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level != self.num_resolutions - 1:
-                if i_level in self.down_sampling_layer:
-                    down.downsample = Downsample3D(block_in, resamp_with_conv, stride=(2, 2, 2))
-                else:
-                    down.downsample = Downsample2D(block_in, resamp_with_conv, padding=0) #DIFF
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = Resnet3DBlock(in_channels=block_in, out_channels=block_in, temb_channels=temb_ch)
-        self.mid.attn_1 = AttnBlock(block_in)
-        self.mid.block_2 = Resnet3DBlock(in_channels=block_in, out_channels=block_in, temb_channels=temb_ch)
-
-        # end
-        self.norm_out = nn.GroupNorm(num_groups=32, num_channels=block_in)
-        self.version = version
-        if version == 2:
-            channels = 4 * z_channels * 2 ** 3
-            self.conv_patchify = ConvPixelUnshuffleDownSampleLayer3D(block_in, channels, kernel_size=3, factor=2)
-            self.shortcut_pathify = PixelUnshuffleChannelAveragingDownSampleLayer3D(block_in, channels, 2)
-            self.shortcut_out = PixelUnshuffleChannelAveragingDownSampleLayer3D(channels, 2 * z_channels if double_z else z_channels, 1)
-            self.conv_out = CausalConvChannelLast(channels, 2 * z_channels if double_z else z_channels, kernel_size=3)
-        else:
-            self.conv_out = CausalConvAfterNorm(block_in, 2 * z_channels if double_z else z_channels, kernel_size=3)
-
-    @torch.inference_mode()
-    def forward(self, x, video_frame_num, is_init=True):
-        # timestep embedding
-        temb = None
-
-        t = video_frame_num
-
-        # downsampling
-        h = self.conv_in(x, is_init)
-
-        # make it real channel last, but behave like normal layout
-        h = h.permute(0,2,3,4,1).contiguous().permute(0,4,1,2,3)
-
-        for i_level in range(self.num_resolutions):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](h, temb, is_init)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-
-            if i_level != self.num_resolutions - 1:
-                if isinstance(self.down[i_level].downsample, Downsample2D):
-                    _, _, t, _, _ = h.shape
-                    h = rearrange(h, "b c t h w -> (b t) h w c", t=t)
-                    h = self.down[i_level].downsample(h)
-                    h = rearrange(h, "(b t) h w c -> b c t h w", t=t)
-                else:
-                    h = self.down[i_level].downsample(h, is_init)
-
-        h = self.mid.block_1(h, temb, is_init)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h, temb, is_init)
-
-        h = h.permute(0,2,3,4,1).contiguous() # b c l h w -> b l h w c
-        if self.version == 2:
-            h = base_group_norm(h, self.norm_out, act_silu=True, channel_last=True)
-            h = h.permute(0,4,1,2,3).contiguous()
-            shortcut = self.shortcut_pathify(h, is_init)
-            h = self.conv_patchify(h, is_init)
-            h = h.add_(shortcut)
-            shortcut = self.shortcut_out(h, is_init).permute(0,2,3,4,1)
-            h = self.conv_out(h.permute(0,2,3,4,1).contiguous(), is_init)
-            h = h.add_(shortcut)
-        else:
-            h = base_group_norm_with_zero_pad(h, self.norm_out, act_silu=True, pad_size=2)
-            h = self.conv_out(h, is_init)
-        h = h.permute(0,4,1,2,3) # b l h w c -> b c l h w
-
-        h = rearrange(h, "b c t h w -> b t c h w")
-        return h
-
-
-class Res3DBlockUpsample(nn.Module):
-    def __init__(self,
-        input_filters,
-        num_filters,
-        down_sampling_stride,
-        down_sampling=False
-    ):
-        super().__init__()
-
-        self.input_filters = input_filters
-        self.num_filters = num_filters
-
-        self.act_ = nn.SiLU(inplace=True)
-
-        self.conv1 = CausalConvChannelLast(num_filters, num_filters, kernel_size=[3, 3, 3])
-        self.norm1 = nn.GroupNorm(32, num_filters)
-
-        self.conv2 = CausalConvChannelLast(num_filters, num_filters, kernel_size=[3, 3, 3])
-        self.norm2 = nn.GroupNorm(32, num_filters)
-
-        self.down_sampling = down_sampling
-        if down_sampling:
-            self.down_sampling_stride = down_sampling_stride
-        else:
-            self.down_sampling_stride = [1, 1, 1]
-
-        if num_filters != input_filters or down_sampling:
-            self.conv3 = CausalConvChannelLast(input_filters, num_filters, kernel_size=[1, 1, 1], stride=self.down_sampling_stride)
-            self.norm3 = nn.GroupNorm(32, num_filters)
-
-    def forward(self, x, is_init=False):
-        x = x.permute(0,2,3,4,1).contiguous()
-
-        residual = x
-
-        h = self.conv1(x, is_init)
-        h = base_group_norm(h, self.norm1, act_silu=True, channel_last=True)
-
-        h = self.conv2(h, is_init)
-        h = base_group_norm(h, self.norm2, act_silu=False, channel_last=True)
-
-        if self.down_sampling or self.num_filters != self.input_filters:
-            x = self.conv3(x, is_init)
-            x = base_group_norm(x, self.norm3, act_silu=False, channel_last=True)
-
-        h.add_(x)
-        h = self.act_(h)
-        if residual is not None:
-            h.add_(residual)
-
-        h = h.permute(0,4,1,2,3)
-        return h
-
-class Upsample3D(nn.Module):
-    def __init__(self,
-        in_channels,
-        scale_factor=2
-    ):
-        super().__init__()
-
-        self.scale_factor = scale_factor
-        self.conv3d = Res3DBlockUpsample(input_filters=in_channels,
-                                         num_filters=in_channels,
-                                         down_sampling_stride=(1, 1, 1),
-                                         down_sampling=False)
-
-    def forward(self, x, is_init=True, is_split=True):
-        b, c, t, h, w = x.shape
-
-        # x = x.permute(0,2,3,4,1).contiguous().permute(0,4,1,2,3).to(memory_format=torch.channels_last_3d)
-        if is_split:
-            split_size = c // 8
-            x_slices = torch.split(x, split_size, dim=1)
-            x = [nn.functional.interpolate(x, scale_factor=self.scale_factor) for x in x_slices]
-            x = torch.cat(x, dim=1)
-        else:
-            x = nn.functional.interpolate(x, scale_factor=self.scale_factor)
-
-        x = self.conv3d(x, is_init)
-        return x
-
-class VideoDecoder(nn.Module):
-    def __init__(self,
-        ch=128,
-        z_channels=16,
-        out_channels=3,
-        ch_mult=(1, 2, 4, 4),
-        num_res_blocks=2,
-        temporal_up_layers=[2, 3],
-        temporal_downsample=4,
-        resamp_with_conv=True,
-        version=1,
-    ):
-        super().__init__()
-
-        temb_ch = 0
-
-        self.num_resolutions = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.temporal_downsample = temporal_downsample
-
-        block_in = ch * ch_mult[self.num_resolutions - 1]
-        self.version = version
-        if version == 2:
-            channels = 4 * z_channels * 2 ** 3
-            self.conv_in = CausalConv(z_channels, channels, kernel_size=3)
-            self.shortcut_in = ChannelDuplicatingPixelUnshuffleUpSampleLayer3D(z_channels, channels, 1)
-            self.conv_unpatchify = ConvPixelShuffleUpSampleLayer3D(channels, block_in, kernel_size=3, factor=2)
-            self.shortcut_unpathify = ChannelDuplicatingPixelUnshuffleUpSampleLayer3D(channels, block_in, 2)
-        else:
-            self.conv_in = CausalConv(z_channels, block_in, kernel_size=3)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = Resnet3DBlock(in_channels=block_in, out_channels=block_in, temb_channels=temb_ch)
-        self.mid.attn_1 = AttnBlock(block_in)
-        self.mid.block_2 = Resnet3DBlock(in_channels=block_in, out_channels=block_in, temb_channels=temb_ch)
-
-        # upsampling
-        self.up_id = len(temporal_up_layers)
-        self.video_frame_num = 1
-        self.cur_video_frame_num = self.video_frame_num // 2 ** self.up_id + 1
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_resolutions)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = ch * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(
-                    Resnet3DBlock(in_channels=block_in, out_channels=block_out, temb_channels=temb_ch))
-                block_in = block_out
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level != 0:
-                if i_level in temporal_up_layers:
-                    up.upsample = Upsample3D(block_in)
-                    self.cur_video_frame_num = self.cur_video_frame_num * 2
-                else:
-                    up.upsample = Upsample2D(block_in, resamp_with_conv)
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.norm_out = nn.GroupNorm(num_groups=32, num_channels=block_in)
-        self.conv_out = CausalConvAfterNorm(block_in, out_channels, kernel_size=3)
-
-    @torch.inference_mode()
-    def forward(self, z, is_init=True):
-        z = rearrange(z, "b t c h w -> b c t h w")
-
-        h = self.conv_in(z, is_init=is_init)
-        if self.version == 2:
-            shortcut = self.shortcut_in(z, is_init=is_init)
-            h = h.add_(shortcut)
-            shortcut = self.shortcut_unpathify(h, is_init=is_init)
-            h = self.conv_unpatchify(h, is_init=is_init)
-            h = h.add_(shortcut)
-
-        temb = None
-
-        h = h.permute(0,2,3,4,1).contiguous().permute(0,4,1,2,3)
-        h = self.mid.block_1(h, temb, is_init=is_init)
-        h = self.mid.attn_1(h)
-        h = h.permute(0,2,3,4,1).contiguous().permute(0,4,1,2,3)
-        h = self.mid.block_2(h, temb, is_init=is_init)
-
-        # upsampling
-        for i_level in reversed(range(self.num_resolutions)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = h.permute(0,2,3,4,1).contiguous().permute(0,4,1,2,3)
-                h = self.up[i_level].block[i_block](h, temb, is_init=is_init)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-            if i_level != 0:
-                if isinstance(self.up[i_level].upsample, Upsample2D) or (hasattr(self.up[i_level].upsample, "module") and isinstance(self.up[i_level].upsample.module, Upsample2D)):
-                    B = h.size(0)
-                    h = h.permute(0,2,3,4,1).flatten(0,1)
-                    h = self.up[i_level].upsample(h)
-                    h = h.unflatten(0, (B, -1)).permute(0,4,1,2,3)
-                else:
-                    h = self.up[i_level].upsample(h, is_init=is_init)
-
-        # end
-        h = h.permute(0,2,3,4,1) # b c l h w -> b l h w c
-        h = base_group_norm_with_zero_pad(h, self.norm_out, act_silu=True, pad_size=2)
-        h = self.conv_out(h)
-        h = h.permute(0,4,1,2,3)
-
-        if is_init:
-            h = h[:, :, (self.temporal_downsample - 1):]
-        return h
-
-
-
-def rms_norm(input, normalized_shape, eps=1e-6):
-    dtype = input.dtype
-    input = input.to(torch.float32)
-    variance = input.pow(2).flatten(-len(normalized_shape)).mean(-1)[(...,) + (None,) * len(normalized_shape)]
-    input = input * torch.rsqrt(variance + eps)
-    return input.to(dtype)
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters, deterministic=False, rms_norm_mean=False, only_return_mean=False):
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=-3) #N,[X],C,H,W
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        self.deterministic = deterministic
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(
-                self.mean,
-                device=self.parameters.device,
-                dtype=self.parameters.dtype)
-        if rms_norm_mean:
-            self.mean = rms_norm(self.mean, self.mean.size()[1:])
-        self.only_return_mean = only_return_mean
-
-    def sample(self, generator=None):
-        # make sure sample is on the same device
-        # as the parameters and has same dtype
-        sample = torch.randn(
-            self.mean.shape, generator=generator, device=self.parameters.device)
-        sample = sample.to(dtype=self.parameters.dtype)
-        x = self.mean + self.std * sample
-        if self.only_return_mean:
-            return self.mean
-        else:
-            return x
-
-class AutoencoderKL(nn.Module):
-    def __init__(self,
-        in_channels=3,
-        out_channels=3,
-        z_channels=16,
-        num_res_blocks=2,
-        model_path=None,
-        weight_dict={},
-        world_size=1,
-        version=1,
-        torch_dtype: torch.dtype = torch.bfloat16 
-    ):
-        super().__init__()
-
-        self.frame_len = 17
-        self.latent_len = 3 if version == 2 else 5
-
-        base_group_norm.spatial = True if version == 2 else False
-
-        self.encoder = VideoEncoder(
-            in_channels=in_channels,
-            z_channels=z_channels,
-            num_res_blocks=num_res_blocks,
-            version=version,
-        )
-
-        self.decoder = VideoDecoder(
-            z_channels=z_channels,
-            out_channels=out_channels,
-            num_res_blocks=num_res_blocks,
-            version=version,
-        )
-        self.world_size = world_size
-        self.model_path = model_path
-        self.torch_dtype = torch_dtype
-    
-    def load_weight(self):
-        logger.info("AutoencoderKL: start load weight")
-        if self.model_path is not None:
-            weight_dict = self.init_from_ckpt(self.model_path)
-        if len(weight_dict) != 0:
-            self.load_from_dict(weight_dict)
-        self.convert_channel_last()
-        self.to(self.torch_dtype)
-        logger.info("AutoencoderKL: end load weight")
-
-
-    def init_from_ckpt(self, model_path):
-        from safetensors import safe_open
-        p = {}
-        with safe_open(model_path, framework="pt", device="cpu") as f:
-            for k in f.keys():
-                tensor = f.get_tensor(k)
-                if k.startswith("decoder.conv_out."):
-                    k = k.replace("decoder.conv_out.", "decoder.conv_out.conv.")
-                p[k] = tensor
-        return p
-
-    def load_from_dict(self, p):
-        self.load_state_dict(p)
-
-    def convert_channel_last(self):
-        #Conv2d NCHW->NHWC
-        pass
-
-    def naive_encode(self, x, is_init_image=True):
-        b, l, c, h, w = x.size()
-        x = rearrange(x, 'b l c h w -> b c l h w').contiguous()
-        z = self.encoder(x, l, True) # 下采样[1, 4, 8, 16, 16]
-        return z
-
-    @torch.inference_mode()
-    def encode(self, x):
-        # b (nc cf) c h w -> (b nc) cf c h w -> encode -> (b nc) cf c h w -> b (nc cf) c h w
-        chunks = list(x.split(self.frame_len, dim=1))
-        for i in range(len(chunks)):
-            chunks[i] = self.naive_encode(chunks[i], True)
-        z = torch.cat(chunks, dim=1)
-
-        posterior = DiagonalGaussianDistribution(z)
-        return posterior.sample()
-
-    def decode_naive(self, z, is_init=True):
-        dec = self.decoder(z, is_init)
-        return dec
-
-    @torch.inference_mode()
-    def decode(self, z):
-        # b (nc cf) c h w -> (b nc) cf c h w -> decode -> (b nc) c cf h w -> b (nc cf) c h w
-        chunks = list(z.split(self.latent_len, dim=1))
-
-        if self.world_size > 1:
-            chunks_total_num = len(chunks)
-            max_num_per_rank = (chunks_total_num + self.world_size - 1) // self.world_size
-            rank = torch.distributed.get_rank()
-            chunks_ = chunks[max_num_per_rank * rank : max_num_per_rank * (rank + 1)]
-            if len(chunks_) < max_num_per_rank:
-                chunks_.extend(chunks[:max_num_per_rank-len(chunks_)])
-            chunks = chunks_
-
-        for i in range(len(chunks)):
-            chunks[i] = self.decode_naive(chunks[i], True).permute(0,2,1,3,4)
-        x = torch.cat(chunks, dim=1)
-
-        if self.world_size > 1:
-            x_ = torch.empty([x.size(0), (self.world_size * max_num_per_rank) * self.frame_len, *x.shape[2:]], dtype=x.dtype, device=x.device)
-            torch.distributed.all_gather_into_tensor(x_, x)
-            x = x_[:, : chunks_total_num * self.frame_len]
-
-        x = self.mix(x)
-        return x
-
-    def mix(self, x):
-        remain_scale = 0.6
-        mix_scale = 1. - remain_scale
-        front = slice(self.frame_len - 1, x.size(1) - 1, self.frame_len)
-        back = slice(self.frame_len, x.size(1), self.frame_len)
-        x[:, back] = x[:, back] * remain_scale + x[:, front] * mix_scale
-        x[:, front] = x[:, front] * remain_scale + x[:, back] * mix_scale
-        return x
diff --git a/videotuna/models/wan/LICENSE b/videotuna/models/wan/LICENSE
new file mode 100644
index 00000000..261eeb9e
--- /dev/null
+++ b/videotuna/models/wan/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/videotuna/models/wan/VENDOR.md b/videotuna/models/wan/VENDOR.md
new file mode 100644
index 00000000..9c3df780
--- /dev/null
+++ b/videotuna/models/wan/VENDOR.md
@@ -0,0 +1,100 @@
+# Vendor: Wan native stack
+
+| Field | Value |
+|-------|-------|
+| **Path** | `videotuna/models/wan/` (in-tree snapshot; submodule migration planned) |
+| **Upstream** | https://github.com/Wan-Video/Wan2.2 |
+| **License** | Apache-2.0 — [upstream LICENSE.txt](https://github.com/Wan-Video/Wan2.2/blob/main/LICENSE.txt) |
+| **Pinned commit** | `42bf4cfaa384bc21833865abc2f9e6c0e67233dc` (2026-03-17; basis of sync in `1100b6a`) |
+| **Import / last sync** | Initial Wan2.1 import `7b1513f` (2025-04-12); Wan2.2 resync `1100b6a` (2026-06-22); PyAV patch `1ac39d9` (2026-06-23) |
+| **Weight lineage** | Domain training loads **Wan 2.1** hub checkpoints (`Wan-AI/Wan2.1-T2V-14B`, `Wan-AI/Wan2.1-I2V-14B-480P`); code tree is Wan2.2-native |
+| **PrivTune entrypoints** | See below |
+| **Vendor deps** | `easydict` (configs only; enforced by `tests/test_vendor_import_boundary.py`) |
+
+## PrivTune entrypoints
+
+**Poetry scripts**
+
+- `train-domain-t2v`, `train-domain-i2v`
+- `train-wan2-1-t2v-lora`, `train-wan2-1-i2v-lora`
+
+**Runners and flow**
+
+- `scripts/train_new.py`, `scripts/inference_new.py`
+- `videotuna/flow/wanvideo.py`
+- `shscripts/inference_wanvideo_t2v_lora.sh`
+
+**Configs**
+
+- `configs/domain/wan_*_lora*.yaml`
+- `configs/inference/presets/wan_domain_lora_smoke*.yaml`
+
+**Not this tree** — Wan 2.2 Diffusers validation/inference uses `validate-domain-*`, `inference-wan2.2-*`, and `videotuna/flow/diffusers_video.py`.
+
+## Local patches
+
+Re-apply after every upstream bump:
+
+| File | Patch |
+|------|-------|
+| `wan/configs/__init__.py` | Legacy Wan2.1 task aliases (`t2v-14B`, `i2v-14B`, …) for domain YAMLs |
+| `wan/modules/vae.py` | Shim re-exporting `WanVAE_` from `vae2_1` for YAML targets |
+| `wan/modules/t5.py` | `videotuna.utils.device_utils.resolve_inference_device()` instead of hard-coded CUDA |
+## Pruned variants (not in PrivTune scope)
+
+The following upstream Wan 2.2 variants were **removed** from this snapshot. PrivTune domain training uses **T2V/I2V LoRA only** (`train-domain-t2v`, `train-domain-i2v`).
+
+| Removed path | Upstream variant |
+|--------------|------------------|
+| `wan/speech2video.py`, `wan/modules/s2v/`, `wan/configs/wan_s2v_14B.py` | Speech-to-video (s2v) |
+| `wan/animate.py`, `wan/modules/animate/`, `wan/configs/wan_animate_14B.py` | Character animation |
+| `wan/textimage2video.py`, `wan/configs/wan_ti2v_5B.py`, `wan/modules/vae2_2.py` | Text-image-to-video (ti2v) |
+
+Do **not** restore these paths during upstream resync unless product scope expands. Re-copy from upstream Wan2.2 if needed in the future.
+
+## Update procedure (in-tree snapshot)
+
+1. Identify upstream commit to pin on [Wan-Video/Wan2.2](https://github.com/Wan-Video/Wan2.2).
+2. Diff and sync `videotuna/models/wan/wan/` from upstream `wan/` — **skip pruned variant paths** (table above).
+3. Re-apply local patches (table above).
+4. Update the pinned SHA and sync dates in this file.
+5. Run smoke tests:
+
+```bash
+poetry run test tests/test_import_smoke.py -q
+poetry run test tests/test_vendor_import_boundary.py -q
+poetry run test tests/test_domain_finetune_configs.py -q
+poetry run test tests/test_wan_lora_bridge.py -q
+poetry run test tests/test_wan_training_step.py -q
+```
+
+## Submodule migration plan (future)
+
+The tree predates the `videotuna/vendor/` layout. Migration is deferred until a fork or patch queue exists.
+
+### Step A — Submodule + fork
+
+```ini
+# .gitmodules (future)
+[submodule "videotuna/vendor/wan22"]
+    path = videotuna/vendor/wan22
+    url = https://github.com/Wan-Video/Wan2.2.git
+```
+
+PrivTune patches upstream files; use a **fork** or `git format-patch` queue before the submodule can replace the in-tree copy.
+
+### Step B — Import-path bridge
+
+Keep YAML targets stable (`videotuna.models.wan.wan.modules.*`) via a thin wrapper at `videotuna/models/wan/` that re-exports from the submodule or patched fork checkout. Alternative (higher churn): move to `videotuna/vendor/wan22/wan/` and update all YAML configs plus `videotuna/flow/wanvideo.py` imports.
+
+### Step C — Submodule bump
+
+```bash
+git submodule update --init videotuna/vendor/wan22
+cd videotuna/vendor/wan22
+git fetch origin
+git checkout <new-sha>
+cd ../../..
+# Re-apply patch queue or merge fork, sync into videotuna/models/wan/wan/
+# Update this file with the new SHA, then run smoke tests (see above)
+```
diff --git a/videotuna/models/wan/wan/__init__.py b/videotuna/models/wan/wan/__init__.py
index df36ebed..f1d5682b 100644
--- a/videotuna/models/wan/wan/__init__.py
+++ b/videotuna/models/wan/wan/__init__.py
@@ -1,3 +1,6 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 from . import configs, distributed, modules
 from .image2video import WanI2V
 from .text2video import WanT2V
+
+__all__ = ["WanT2V", "WanI2V", "configs", "distributed", "modules"]
diff --git a/videotuna/models/wan/wan/configs/__init__.py b/videotuna/models/wan/wan/configs/__init__.py
index c72d2d01..ede975c4 100644
--- a/videotuna/models/wan/wan/configs/__init__.py
+++ b/videotuna/models/wan/wan/configs/__init__.py
@@ -2,41 +2,56 @@
 import copy
 import os
 
-os.environ['TOKENIZERS_PARALLELISM'] = 'false'
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
 
-from .wan_i2v_14B import i2v_14B
+from .wan_i2v_A14B import i2v_A14B
 from .wan_t2v_1_3B import t2v_1_3B
-from .wan_t2v_14B import t2v_14B
+from .wan_t2v_A14B import t2v_A14B
 
-# the config of t2i_14B is the same as t2v_14B
-t2i_14B = copy.deepcopy(t2v_14B)
-t2i_14B.__name__ = 'Config: Wan T2I 14B'
+# Legacy Wan2.1 task name aliases (VideoTuna configs / poetry scripts).
+t2v_14B = t2v_A14B
+i2v_14B = i2v_A14B
+t2i_14B = copy.deepcopy(t2v_A14B)
+t2i_14B.__name__ = "Config: Wan T2I 14B"
 
 WAN_CONFIGS = {
-    't2v-14B': t2v_14B,
-    't2v-1.3B': t2v_1_3B,
-    'i2v-14B': i2v_14B,
-    't2i-14B': t2i_14B,
+    "t2v-A14B": t2v_A14B,
+    "i2v-A14B": i2v_A14B,
+    # Wan2.1 / VideoTuna legacy task names
+    "t2v-14B": t2v_14B,
+    "t2v-1.3B": t2v_1_3B,
+    "i2v-14B": i2v_14B,
+    "t2i-14B": t2i_14B,
 }
 
 SIZE_CONFIGS = {
-    '720*1280': (720, 1280),
-    '1280*720': (1280, 720),
-    '480*832': (480, 832),
-    '832*480': (832, 480),
-    '1024*1024': (1024, 1024),
+    "720*1280": (720, 1280),
+    "1280*720": (1280, 720),
+    "480*832": (480, 832),
+    "832*480": (832, 480),
+    "704*1280": (704, 1280),
+    "1280*704": (1280, 704),
+    "1024*704": (1024, 704),
+    "704*1024": (704, 1024),
 }
 
 MAX_AREA_CONFIGS = {
-    '720*1280': 720 * 1280,
-    '1280*720': 1280 * 720,
-    '480*832': 480 * 832,
-    '832*480': 832 * 480,
+    "720*1280": 720 * 1280,
+    "1280*720": 1280 * 720,
+    "480*832": 480 * 832,
+    "832*480": 832 * 480,
+    "704*1280": 704 * 1280,
+    "1280*704": 1280 * 704,
+    "1024*704": 1024 * 704,
+    "704*1024": 704 * 1024,
 }
 
 SUPPORTED_SIZES = {
-    't2v-14B': ('720*1280', '1280*720', '480*832', '832*480'),
-    't2v-1.3B': ('480*832', '832*480'),
-    'i2v-14B': ('720*1280', '1280*720', '480*832', '832*480'),
-    't2i-14B': tuple(SIZE_CONFIGS.keys()),
+    "t2v-A14B": ("720*1280", "1280*720", "480*832", "832*480"),
+    "i2v-A14B": ("720*1280", "1280*720", "480*832", "832*480"),
+    # Legacy Wan2.1 task names
+    "t2v-14B": ("720*1280", "1280*720", "480*832", "832*480"),
+    "t2v-1.3B": ("480*832", "832*480"),
+    "i2v-14B": ("720*1280", "1280*720", "480*832", "832*480"),
+    "t2i-14B": tuple(SIZE_CONFIGS.keys()),
 }
diff --git a/videotuna/models/wan/wan/configs/shared_config.py b/videotuna/models/wan/wan/configs/shared_config.py
index 04a9f454..b589ff2e 100644
--- a/videotuna/models/wan/wan/configs/shared_config.py
+++ b/videotuna/models/wan/wan/configs/shared_config.py
@@ -2,11 +2,11 @@
 import torch
 from easydict import EasyDict
 
-#------------------------ Wan shared config ------------------------#
+# ------------------------ Wan shared config ------------------------#
 wan_shared_cfg = EasyDict()
 
 # t5
-wan_shared_cfg.t5_model = 'umt5_xxl'
+wan_shared_cfg.t5_model = "umt5_xxl"
 wan_shared_cfg.t5_dtype = torch.bfloat16
 wan_shared_cfg.text_len = 512
 
@@ -16,4 +16,5 @@
 # inference
 wan_shared_cfg.num_train_timesteps = 1000
 wan_shared_cfg.sample_fps = 16
-wan_shared_cfg.sample_neg_prompt = '色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走'
+wan_shared_cfg.sample_neg_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
+wan_shared_cfg.frame_num = 81
diff --git a/videotuna/models/wan/wan/configs/wan_i2v_14B.py b/videotuna/models/wan/wan/configs/wan_i2v_14B.py
deleted file mode 100644
index 12e8e205..00000000
--- a/videotuna/models/wan/wan/configs/wan_i2v_14B.py
+++ /dev/null
@@ -1,35 +0,0 @@
-# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-import torch
-from easydict import EasyDict
-
-from .shared_config import wan_shared_cfg
-
-#------------------------ Wan I2V 14B ------------------------#
-
-i2v_14B = EasyDict(__name__='Config: Wan I2V 14B')
-i2v_14B.update(wan_shared_cfg)
-
-i2v_14B.t5_checkpoint = 'models_t5_umt5-xxl-enc-bf16.pth'
-i2v_14B.t5_tokenizer = 'google/umt5-xxl'
-
-# clip
-i2v_14B.clip_model = 'clip_xlm_roberta_vit_h_14'
-i2v_14B.clip_dtype = torch.float16
-i2v_14B.clip_checkpoint = 'models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth'
-i2v_14B.clip_tokenizer = 'xlm-roberta-large'
-
-# vae
-i2v_14B.vae_checkpoint = 'Wan2.1_VAE.pth'
-i2v_14B.vae_stride = (4, 8, 8)
-
-# transformer
-i2v_14B.patch_size = (1, 2, 2)
-i2v_14B.dim = 5120
-i2v_14B.ffn_dim = 13824
-i2v_14B.freq_dim = 256
-i2v_14B.num_heads = 40
-i2v_14B.num_layers = 40
-i2v_14B.window_size = (-1, -1)
-i2v_14B.qk_norm = True
-i2v_14B.cross_attn_norm = True
-i2v_14B.eps = 1e-6
diff --git a/videotuna/models/wan/wan/configs/wan_i2v_A14B.py b/videotuna/models/wan/wan/configs/wan_i2v_A14B.py
new file mode 100644
index 00000000..cb2f40a0
--- /dev/null
+++ b/videotuna/models/wan/wan/configs/wan_i2v_A14B.py
@@ -0,0 +1,36 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from easydict import EasyDict
+
+from .shared_config import wan_shared_cfg
+
+# ------------------------ Wan I2V A14B ------------------------#
+
+i2v_A14B = EasyDict(__name__="Config: Wan I2V A14B")
+i2v_A14B.update(wan_shared_cfg)
+
+i2v_A14B.t5_checkpoint = "models_t5_umt5-xxl-enc-bf16.pth"
+i2v_A14B.t5_tokenizer = "google/umt5-xxl"
+
+# vae
+i2v_A14B.vae_checkpoint = "Wan2.1_VAE.pth"
+i2v_A14B.vae_stride = (4, 8, 8)
+
+# transformer
+i2v_A14B.patch_size = (1, 2, 2)
+i2v_A14B.dim = 5120
+i2v_A14B.ffn_dim = 13824
+i2v_A14B.freq_dim = 256
+i2v_A14B.num_heads = 40
+i2v_A14B.num_layers = 40
+i2v_A14B.window_size = (-1, -1)
+i2v_A14B.qk_norm = True
+i2v_A14B.cross_attn_norm = True
+i2v_A14B.eps = 1e-6
+i2v_A14B.low_noise_checkpoint = "low_noise_model"
+i2v_A14B.high_noise_checkpoint = "high_noise_model"
+
+# inference
+i2v_A14B.sample_shift = 5.0
+i2v_A14B.sample_steps = 40
+i2v_A14B.boundary = 0.900
+i2v_A14B.sample_guide_scale = (3.5, 3.5)  # low noise, high noise
diff --git a/videotuna/models/wan/wan/configs/wan_t2v_14B.py b/videotuna/models/wan/wan/configs/wan_t2v_14B.py
deleted file mode 100644
index 9d0ee69d..00000000
--- a/videotuna/models/wan/wan/configs/wan_t2v_14B.py
+++ /dev/null
@@ -1,29 +0,0 @@
-# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-from easydict import EasyDict
-
-from .shared_config import wan_shared_cfg
-
-#------------------------ Wan T2V 14B ------------------------#
-
-t2v_14B = EasyDict(__name__='Config: Wan T2V 14B')
-t2v_14B.update(wan_shared_cfg)
-
-# t5
-t2v_14B.t5_checkpoint = 'models_t5_umt5-xxl-enc-bf16.pth'
-t2v_14B.t5_tokenizer = 'google/umt5-xxl'
-
-# vae
-t2v_14B.vae_checkpoint = 'Wan2.1_VAE.pth'
-t2v_14B.vae_stride = (4, 8, 8)
-
-# transformer
-t2v_14B.patch_size = (1, 2, 2)
-t2v_14B.dim = 5120
-t2v_14B.ffn_dim = 13824
-t2v_14B.freq_dim = 256
-t2v_14B.num_heads = 40
-t2v_14B.num_layers = 40
-t2v_14B.window_size = (-1, -1)
-t2v_14B.qk_norm = True
-t2v_14B.cross_attn_norm = True
-t2v_14B.eps = 1e-6
diff --git a/videotuna/models/wan/wan/configs/wan_t2v_1_3B.py b/videotuna/models/wan/wan/configs/wan_t2v_1_3B.py
index ea9502b0..63d0a037 100644
--- a/videotuna/models/wan/wan/configs/wan_t2v_1_3B.py
+++ b/videotuna/models/wan/wan/configs/wan_t2v_1_3B.py
@@ -3,17 +3,17 @@
 
 from .shared_config import wan_shared_cfg
 
-#------------------------ Wan T2V 1.3B ------------------------#
+# ------------------------ Wan T2V 1.3B ------------------------#
 
-t2v_1_3B = EasyDict(__name__='Config: Wan T2V 1.3B')
+t2v_1_3B = EasyDict(__name__="Config: Wan T2V 1.3B")
 t2v_1_3B.update(wan_shared_cfg)
 
 # t5
-t2v_1_3B.t5_checkpoint = 'models_t5_umt5-xxl-enc-bf16.pth'
-t2v_1_3B.t5_tokenizer = 'google/umt5-xxl'
+t2v_1_3B.t5_checkpoint = "models_t5_umt5-xxl-enc-bf16.pth"
+t2v_1_3B.t5_tokenizer = "google/umt5-xxl"
 
 # vae
-t2v_1_3B.vae_checkpoint = 'Wan2.1_VAE.pth'
+t2v_1_3B.vae_checkpoint = "Wan2.1_VAE.pth"
 t2v_1_3B.vae_stride = (4, 8, 8)
 
 # transformer
diff --git a/videotuna/models/wan/wan/configs/wan_t2v_A14B.py b/videotuna/models/wan/wan/configs/wan_t2v_A14B.py
new file mode 100644
index 00000000..0e72db73
--- /dev/null
+++ b/videotuna/models/wan/wan/configs/wan_t2v_A14B.py
@@ -0,0 +1,37 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from easydict import EasyDict
+
+from .shared_config import wan_shared_cfg
+
+# ------------------------ Wan T2V A14B ------------------------#
+
+t2v_A14B = EasyDict(__name__="Config: Wan T2V A14B")
+t2v_A14B.update(wan_shared_cfg)
+
+# t5
+t2v_A14B.t5_checkpoint = "models_t5_umt5-xxl-enc-bf16.pth"
+t2v_A14B.t5_tokenizer = "google/umt5-xxl"
+
+# vae
+t2v_A14B.vae_checkpoint = "Wan2.1_VAE.pth"
+t2v_A14B.vae_stride = (4, 8, 8)
+
+# transformer
+t2v_A14B.patch_size = (1, 2, 2)
+t2v_A14B.dim = 5120
+t2v_A14B.ffn_dim = 13824
+t2v_A14B.freq_dim = 256
+t2v_A14B.num_heads = 40
+t2v_A14B.num_layers = 40
+t2v_A14B.window_size = (-1, -1)
+t2v_A14B.qk_norm = True
+t2v_A14B.cross_attn_norm = True
+t2v_A14B.eps = 1e-6
+t2v_A14B.low_noise_checkpoint = "low_noise_model"
+t2v_A14B.high_noise_checkpoint = "high_noise_model"
+
+# inference
+t2v_A14B.sample_shift = 12.0
+t2v_A14B.sample_steps = 40
+t2v_A14B.boundary = 0.875
+t2v_A14B.sample_guide_scale = (3.0, 4.0)  # low noise, high noise
diff --git a/videotuna/models/wan/wan/distributed/__init__.py b/videotuna/models/wan/wan/distributed/__init__.py
index e69de29b..566f71ed 100644
--- a/videotuna/models/wan/wan/distributed/__init__.py
+++ b/videotuna/models/wan/wan/distributed/__init__.py
@@ -0,0 +1 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
diff --git a/videotuna/models/wan/wan/distributed/fsdp.py b/videotuna/models/wan/wan/distributed/fsdp.py
index 18ba2f3e..c7168662 100644
--- a/videotuna/models/wan/wan/distributed/fsdp.py
+++ b/videotuna/models/wan/wan/distributed/fsdp.py
@@ -8,6 +8,7 @@
 from torch.distributed.fsdp.wrap import lambda_auto_wrap_policy
 from torch.distributed.utils import _free_storage
 
+
 def shard_model(
     model,
     device_id,
@@ -17,21 +18,27 @@ def shard_model(
     process_group=None,
     sharding_strategy=ShardingStrategy.FULL_SHARD,
     sync_module_states=True,
+    use_lora=False,
 ):
     model = FSDP(
         module=model,
         process_group=process_group,
         sharding_strategy=sharding_strategy,
         auto_wrap_policy=partial(
-            lambda_auto_wrap_policy, lambda_fn=lambda m: m in model.blocks),
+            lambda_auto_wrap_policy, lambda_fn=lambda m: m in model.blocks
+        ),
         mixed_precision=MixedPrecision(
             param_dtype=param_dtype,
             reduce_dtype=reduce_dtype,
-            buffer_dtype=buffer_dtype),
+            buffer_dtype=buffer_dtype,
+        ),
         device_id=device_id,
-        sync_module_states=sync_module_states)
+        sync_module_states=sync_module_states,
+        use_orig_params=True if use_lora else False,
+    )
     return model
 
+
 def free_model(model):
     for m in model.modules():
         if isinstance(m, FSDP):
diff --git a/videotuna/models/wan/wan/distributed/xdit_context_parallel.py b/videotuna/models/wan/wan/distributed/sequence_parallel.py
similarity index 52%
rename from videotuna/models/wan/wan/distributed/xdit_context_parallel.py
rename to videotuna/models/wan/wan/distributed/sequence_parallel.py
index 01936cee..1a3493be 100644
--- a/videotuna/models/wan/wan/distributed/xdit_context_parallel.py
+++ b/videotuna/models/wan/wan/distributed/sequence_parallel.py
@@ -1,28 +1,22 @@
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 import torch
-import torch.cuda.amp as amp
-from xfuser.core.distributed import (get_sequence_parallel_rank,
-                                     get_sequence_parallel_world_size,
-                                     get_sp_group)
-from xfuser.core.long_ctx_attention import xFuserLongContextAttention
 
 from ..modules.model import sinusoidal_embedding_1d
+from .ulysses import distributed_attention
+from .util import gather_forward, get_rank, get_world_size
 
 
 def pad_freqs(original_tensor, target_len):
     seq_len, s1, s2 = original_tensor.shape
     pad_size = target_len - seq_len
     padding_tensor = torch.ones(
-        pad_size,
-        s1,
-        s2,
-        dtype=original_tensor.dtype,
-        device=original_tensor.device)
+        pad_size, s1, s2, dtype=original_tensor.dtype, device=original_tensor.device
+    )
     padded_tensor = torch.cat([original_tensor, padding_tensor], dim=0)
     return padded_tensor
 
 
-@amp.autocast(enabled=False)
+@torch.amp.autocast("cuda", enabled=False)
 def rope_apply(x, grid_sizes, freqs):
     """
     x:          [B, L, N, C].
@@ -39,22 +33,24 @@ def rope_apply(x, grid_sizes, freqs):
         seq_len = f * h * w
 
         # precompute multipliers
-        x_i = torch.view_as_complex(x[i, :s].to(torch.float64).reshape(
-            s, n, -1, 2))
-        freqs_i = torch.cat([
-            freqs[0][:f].view(f, 1, 1, -1).expand(f, h, w, -1),
-            freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
-            freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1)
-        ],
-                            dim=-1).reshape(seq_len, 1, -1)
+        x_i = torch.view_as_complex(x[i, :s].to(torch.float64).reshape(s, n, -1, 2))
+        freqs_i = torch.cat(
+            [
+                freqs[0][:f].view(f, 1, 1, -1).expand(f, h, w, -1),
+                freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
+                freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1),
+            ],
+            dim=-1,
+        ).reshape(seq_len, 1, -1)
 
         # apply rotary embedding
-        sp_size = get_sequence_parallel_world_size()
-        sp_rank = get_sequence_parallel_rank()
+        sp_size = get_world_size()
+        sp_rank = get_rank()
         freqs_i = pad_freqs(freqs_i, s * sp_size)
         s_per_rank = s
-        freqs_i_rank = freqs_i[(sp_rank * s_per_rank):((sp_rank + 1) *
-                                                       s_per_rank), :, :]
+        freqs_i_rank = freqs_i[
+            (sp_rank * s_per_rank) : ((sp_rank + 1) * s_per_rank), :, :
+        ]
         x_i = torch.view_as_real(x_i * freqs_i_rank).flatten(2)
         x_i = torch.cat([x_i, x[i, s:]])
 
@@ -63,13 +59,12 @@ def rope_apply(x, grid_sizes, freqs):
     return torch.stack(output).float()
 
 
-def usp_dit_forward(
+def sp_dit_forward(
     self,
     x,
     t,
     context,
     seq_len,
-    clip_fea=None,
     y=None,
 ):
     """
@@ -77,8 +72,8 @@ def usp_dit_forward(
     t:              [B].
     context:        A list of text embeddings each with shape [L, C].
     """
-    if self.model_type == 'i2v':
-        assert clip_fea is not None and y is not None
+    if self.model_type == "i2v":
+        assert y is not None
     # params
     device = self.patch_embedding.weight.device
     if self.freqs.device != device:
@@ -89,34 +84,46 @@ def usp_dit_forward(
 
     # embeddings
     x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
-    grid_sizes = torch.stack(
-        [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
+    grid_sizes = torch.stack([torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
     x = [u.flatten(2).transpose(1, 2) for u in x]
     seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
     assert seq_lens.max() <= seq_len
-    x = torch.cat([
-        torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))], dim=1)
-        for u in x
-    ])
+    x = torch.cat(
+        [
+            torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))], dim=1)
+            for u in x
+        ]
+    )
 
     # time embeddings
-    with amp.autocast(dtype=torch.float32):
+    if t.dim() == 1:
+        t = t.expand(t.size(0), seq_len)
+    with torch.amp.autocast("cuda", dtype=torch.float32):
+        bt = t.size(0)
+        t = t.flatten()
         e = self.time_embedding(
-            sinusoidal_embedding_1d(self.freq_dim, t).float())
-        e0 = self.time_projection(e).unflatten(1, (6, self.dim))
+            sinusoidal_embedding_1d(self.freq_dim, t)
+            .unflatten(0, (bt, seq_len))
+            .float()
+        )
+        e0 = self.time_projection(e).unflatten(2, (6, self.dim))
         assert e.dtype == torch.float32 and e0.dtype == torch.float32
 
     # context
     context_lens = None
     context = self.text_embedding(
-        torch.stack([
-            torch.cat([u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
-            for u in context
-        ]))
+        torch.stack(
+            [
+                torch.cat([u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
+                for u in context
+            ]
+        )
+    )
 
-    if clip_fea is not None:
-        context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
-        context = torch.concat([context_clip, context], dim=1)
+    # Context Parallel
+    x = torch.chunk(x, get_world_size(), dim=1)[get_rank()]
+    e = torch.chunk(e, get_world_size(), dim=1)[get_rank()]
+    e0 = torch.chunk(e0, get_world_size(), dim=1)[get_rank()]
 
     # arguments
     kwargs = dict(
@@ -125,12 +132,8 @@ def usp_dit_forward(
         grid_sizes=grid_sizes,
         freqs=self.freqs,
         context=context,
-        context_lens=context_lens)
-
-    # Context Parallel
-    x = torch.chunk(
-        x, get_sequence_parallel_world_size(),
-        dim=1)[get_sequence_parallel_rank()]
+        context_lens=context_lens,
+    )
 
     for block in self.blocks:
         x = block(x, **kwargs)
@@ -139,19 +142,14 @@ def usp_dit_forward(
     x = self.head(x, e)
 
     # Context Parallel
-    x = get_sp_group().all_gather(x, dim=1)
+    x = gather_forward(x, dim=1)
 
     # unpatchify
     x = self.unpatchify(x, grid_sizes)
     return [u.float() for u in x]
 
 
-def usp_attn_forward(self,
-                     x,
-                     seq_lens,
-                     grid_sizes,
-                     freqs,
-                     dtype=torch.bfloat16):
+def sp_attn_forward(self, x, seq_lens, grid_sizes, freqs, dtype=torch.bfloat16):
     b, s, n, d = *x.shape[:2], self.num_heads, self.head_dim
     half_dtypes = (torch.float16, torch.bfloat16)
 
@@ -169,22 +167,13 @@ def qkv_fn(x):
     q = rope_apply(q, grid_sizes, freqs)
     k = rope_apply(k, grid_sizes, freqs)
 
-    # TODO: We should use unpaded q,k,v for attention.
-    # k_lens = seq_lens // get_sequence_parallel_world_size()
-    # if k_lens is not None:
-    #     q = torch.cat([u[:l] for u, l in zip(q, k_lens)]).unsqueeze(0)
-    #     k = torch.cat([u[:l] for u, l in zip(k, k_lens)]).unsqueeze(0)
-    #     v = torch.cat([u[:l] for u, l in zip(v, k_lens)]).unsqueeze(0)
-
-    x = xFuserLongContextAttention()(
-        None,
-        query=half(q),
-        key=half(k),
-        value=half(v),
-        window_size=self.window_size)
-
-    # TODO: padding after attention.
-    # x = torch.cat([x, x.new_zeros(b, s - x.size(1), n, d)], dim=1)
+    x = distributed_attention(
+        half(q),
+        half(k),
+        half(v),
+        seq_lens,
+        window_size=self.window_size,
+    )
 
     # output
     x = x.flatten(2)
diff --git a/videotuna/models/wan/wan/distributed/ulysses.py b/videotuna/models/wan/wan/distributed/ulysses.py
new file mode 100644
index 00000000..cf1c0419
--- /dev/null
+++ b/videotuna/models/wan/wan/distributed/ulysses.py
@@ -0,0 +1,46 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import torch.distributed as dist
+
+from ..modules.attention import flash_attention
+from .util import all_to_all
+
+
+def distributed_attention(
+    q,
+    k,
+    v,
+    seq_lens,
+    window_size=(-1, -1),
+):
+    """
+    Performs distributed attention based on DeepSpeed Ulysses attention mechanism.
+    please refer to https://arxiv.org/pdf/2309.14509
+
+    Args:
+        q:           [B, Lq // p, Nq, C1].
+        k:           [B, Lk // p, Nk, C1].
+        v:           [B, Lk // p, Nk, C2]. Nq must be divisible by Nk.
+        seq_lens:    [B], length of each sequence in batch
+        window_size: (left right). If not (-1, -1), apply sliding window local attention.
+    """
+    if not dist.is_initialized():
+        raise ValueError("distributed group should be initialized.")
+    q.shape[0]
+
+    # gather q/k/v sequence
+    q = all_to_all(q, scatter_dim=2, gather_dim=1)
+    k = all_to_all(k, scatter_dim=2, gather_dim=1)
+    v = all_to_all(v, scatter_dim=2, gather_dim=1)
+
+    # apply attention
+    x = flash_attention(
+        q,
+        k,
+        v,
+        k_lens=seq_lens,
+        window_size=window_size,
+    )
+
+    # scatter q/k/v sequence
+    x = all_to_all(x, scatter_dim=1, gather_dim=2)
+    return x
diff --git a/videotuna/models/wan/wan/distributed/util.py b/videotuna/models/wan/wan/distributed/util.py
new file mode 100644
index 00000000..dfcd1399
--- /dev/null
+++ b/videotuna/models/wan/wan/distributed/util.py
@@ -0,0 +1,50 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import torch
+import torch.distributed as dist
+
+
+def init_distributed_group():
+    """r initialize sequence parallel group."""
+    if not dist.is_initialized():
+        dist.init_process_group(backend="nccl")
+
+
+def get_rank():
+    return dist.get_rank()
+
+
+def get_world_size():
+    return dist.get_world_size()
+
+
+def all_to_all(x, scatter_dim, gather_dim, group=None, **kwargs):
+    """
+    `scatter` along one dimension and `gather` along another.
+    """
+    world_size = get_world_size()
+    if world_size > 1:
+        inputs = [u.contiguous() for u in x.chunk(world_size, dim=scatter_dim)]
+        outputs = [torch.empty_like(u) for u in inputs]
+        dist.all_to_all(outputs, inputs, group=group, **kwargs)
+        x = torch.cat(outputs, dim=gather_dim).contiguous()
+    return x
+
+
+def all_gather(tensor):
+    world_size = dist.get_world_size()
+    if world_size == 1:
+        return [tensor]
+    tensor_list = [torch.empty_like(tensor) for _ in range(world_size)]
+    torch.distributed.all_gather(tensor_list, tensor)
+    return tensor_list
+
+
+def gather_forward(input, dim):
+    # skip if world_size == 1
+    world_size = dist.get_world_size()
+    if world_size == 1:
+        return input
+
+    # gather sequence
+    output = all_gather(input)
+    return torch.cat(output, dim=dim).contiguous()
diff --git a/videotuna/models/wan/wan/image2video.py b/videotuna/models/wan/wan/image2video.py
index 2f3cecd0..5f54bf12 100644
--- a/videotuna/models/wan/wan/image2video.py
+++ b/videotuna/models/wan/wan/image2video.py
@@ -1,6 +1,6 @@
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 import gc
-from loguru import logger
+import logging
 import math
 import os
 import random
@@ -8,29 +8,28 @@
 import types
 from contextlib import contextmanager
 from functools import partial
-from typing import Union
 
 import numpy as np
 import torch
-import torch.cuda.amp as amp
 import torch.distributed as dist
 import torchvision.transforms.functional as TF
 from tqdm import tqdm
-from PIL import Image
 
 from .distributed.fsdp import shard_model
-from .modules.clip import CLIPModel, XLMRobertaCLIP
+from .distributed.sequence_parallel import sp_attn_forward, sp_dit_forward
+from .distributed.util import get_world_size
 from .modules.model import WanModel
-from .modules.t5 import T5Encoder, T5EncoderModel
-from .modules.vae import WanVAE, WanVAE_
-from .utils.fm_solvers import (FlowDPMSolverMultistepScheduler,
-                               get_sampling_sigmas, retrieve_timesteps)
+from .modules.t5 import T5EncoderModel
+from .modules.vae2_1 import Wan2_1_VAE
+from .utils.fm_solvers import (
+    FlowDPMSolverMultistepScheduler,
+    get_sampling_sigmas,
+    retrieve_timesteps,
+)
 from .utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
-from ....utils.common_utils import monitor_resources
-from ....schedulers.flow_matching import FlowMatchScheduler
 
-class WanI2V:
 
+class WanI2V:
     def __init__(
         self,
         config,
@@ -39,13 +38,10 @@ def __init__(
         rank=0,
         t5_fsdp=False,
         dit_fsdp=False,
-        use_usp=False,
+        use_sp=False,
         t5_cpu=False,
         init_on_cpu=True,
-        first_stage_model: WanVAE_= None ,
-        cond_stage_model: T5Encoder=None,
-        cond_stage_2_model:XLMRobertaCLIP=None,
-        denoiser: WanModel=None,
+        convert_model_dtype=False,
     ):
         r"""
         Initializes the image-to-video generation model components.
@@ -63,73 +59,169 @@ def __init__(
                 Enable FSDP sharding for T5 model
             dit_fsdp (`bool`, *optional*, defaults to False):
                 Enable FSDP sharding for DiT model
-            use_usp (`bool`, *optional*, defaults to False):
-                Enable distribution strategy of USP.
+            use_sp (`bool`, *optional*, defaults to False):
+                Enable distribution strategy of sequence parallel.
             t5_cpu (`bool`, *optional*, defaults to False):
                 Whether to place T5 model on CPU. Only works without t5_fsdp.
             init_on_cpu (`bool`, *optional*, defaults to True):
                 Enable initializing Transformer Model on CPU. Only works without FSDP or USP.
+            convert_model_dtype (`bool`, *optional*, defaults to False):
+                Convert DiT model parameters dtype to 'config.param_dtype'.
+                Only works without FSDP.
         """
         self.device = torch.device(f"cuda:{device_id}")
         self.config = config
         self.rank = rank
-        self.use_usp = use_usp
         self.t5_cpu = t5_cpu
-        self.t5_fsdp = t5_fsdp
-        self.dit_fsdp = dit_fsdp
+        self.init_on_cpu = init_on_cpu
+
         self.num_train_timesteps = config.num_train_timesteps
+        self.boundary = config.boundary
         self.param_dtype = config.param_dtype
-        
+
+        if t5_fsdp or dit_fsdp or use_sp:
+            self.init_on_cpu = False
+
         shard_fn = partial(shard_model, device_id=device_id)
-        self.text_encoder : T5EncoderModel = T5EncoderModel(
+        self.text_encoder = T5EncoderModel(
             text_len=config.text_len,
             dtype=config.t5_dtype,
-            device=torch.device('cpu'),
+            device=torch.device("cpu"),
             checkpoint_path=os.path.join(checkpoint_dir, config.t5_checkpoint),
             tokenizer_path=os.path.join(checkpoint_dir, config.t5_tokenizer),
             shard_fn=shard_fn if t5_fsdp else None,
-            model=cond_stage_model
         )
 
-        #vae
         self.vae_stride = config.vae_stride
         self.patch_size = config.patch_size
-        self.vae: WanVAE = WanVAE(
-            vae=first_stage_model,
+        self.vae = Wan2_1_VAE(
             vae_pth=os.path.join(checkpoint_dir, config.vae_checkpoint),
-            device=self.device)
-
-        #clip
-        self.clip = CLIPModel(
-            dtype=config.clip_dtype,
             device=self.device,
-            checkpoint_path=os.path.join(checkpoint_dir,
-                                         config.clip_checkpoint),
-            tokenizer_path=os.path.join(checkpoint_dir, config.clip_tokenizer),
-            model=cond_stage_2_model)
+        )
 
+        logging.info(f"Creating WanModel from {checkpoint_dir}")
+        self.low_noise_model = WanModel.from_pretrained(
+            checkpoint_dir, subfolder=config.low_noise_checkpoint
+        )
+        self.low_noise_model = self._configure_model(
+            model=self.low_noise_model,
+            use_sp=use_sp,
+            dit_fsdp=dit_fsdp,
+            shard_fn=shard_fn,
+            convert_model_dtype=convert_model_dtype,
+        )
+
+        self.high_noise_model = WanModel.from_pretrained(
+            checkpoint_dir, subfolder=config.high_noise_checkpoint
+        )
+        self.high_noise_model = self._configure_model(
+            model=self.high_noise_model,
+            use_sp=use_sp,
+            dit_fsdp=dit_fsdp,
+            shard_fn=shard_fn,
+            convert_model_dtype=convert_model_dtype,
+        )
+        if use_sp:
+            self.sp_size = get_world_size()
+        else:
+            self.sp_size = 1
 
-        #denoiser
-        self.model : WanModel = denoiser
-        self.shard_fn = shard_fn
         self.sample_neg_prompt = config.sample_neg_prompt
-        self.init_on_cpu = init_on_cpu
-        if t5_fsdp or dit_fsdp or use_usp:
-            self.init_on_cpu = False
 
-    @monitor_resources(return_metrics=True)
-    def generate(self,
-                 input_prompt,
-                 img,
-                 max_area=720 * 1280,
-                 frame_num=81,
-                 shift=5.0,
-                 sample_solver='unipc',
-                 sampling_steps=40,
-                 guide_scale=5.0,
-                 n_prompt="",
-                 seed=-1,
-                 offload_model=True):
+    def _configure_model(self, model, use_sp, dit_fsdp, shard_fn, convert_model_dtype):
+        """
+        Configures a model object. This includes setting evaluation modes,
+        applying distributed parallel strategy, and handling device placement.
+
+        Args:
+            model (torch.nn.Module):
+                The model instance to configure.
+            use_sp (`bool`):
+                Enable distribution strategy of sequence parallel.
+            dit_fsdp (`bool`):
+                Enable FSDP sharding for DiT model.
+            shard_fn (callable):
+                The function to apply FSDP sharding.
+            convert_model_dtype (`bool`):
+                Convert DiT model parameters dtype to 'config.param_dtype'.
+                Only works without FSDP.
+
+        Returns:
+            torch.nn.Module:
+                The configured model.
+        """
+        model.eval().requires_grad_(False)
+
+        if use_sp:
+            for block in model.blocks:
+                block.self_attn.forward = types.MethodType(
+                    sp_attn_forward, block.self_attn
+                )
+            model.forward = types.MethodType(sp_dit_forward, model)
+
+        if dist.is_initialized():
+            dist.barrier()
+
+        if dit_fsdp:
+            model = shard_fn(model)
+        else:
+            if convert_model_dtype:
+                model.to(self.param_dtype)
+            if not self.init_on_cpu:
+                model.to(self.device)
+
+        return model
+
+    def _prepare_model_for_timestep(self, t, boundary, offload_model):
+        r"""
+        Prepares and returns the required model for the current timestep.
+
+        Args:
+            t (torch.Tensor):
+                current timestep.
+            boundary (`int`):
+                The timestep threshold. If `t` is at or above this value,
+                the `high_noise_model` is considered as the required model.
+            offload_model (`bool`):
+                A flag intended to control the offloading behavior.
+
+        Returns:
+            torch.nn.Module:
+                The active model on the target device for the current timestep.
+        """
+        if t.item() >= boundary:
+            required_model_name = "high_noise_model"
+            offload_model_name = "low_noise_model"
+        else:
+            required_model_name = "low_noise_model"
+            offload_model_name = "high_noise_model"
+        if offload_model or self.init_on_cpu:
+            if (
+                next(getattr(self, offload_model_name).parameters()).device.type
+                == "cuda"
+            ):
+                getattr(self, offload_model_name).to("cpu")
+            if (
+                next(getattr(self, required_model_name).parameters()).device.type
+                == "cpu"
+            ):
+                getattr(self, required_model_name).to(self.device)
+        return getattr(self, required_model_name)
+
+    def generate(
+        self,
+        input_prompt,
+        img,
+        max_area=720 * 1280,
+        frame_num=81,
+        shift=5.0,
+        sample_solver="unipc",
+        sampling_steps=40,
+        guide_scale=5.0,
+        n_prompt="",
+        seed=-1,
+        offload_model=True,
+    ):
         r"""
         Generates video frames from input image and text prompt using diffusion process.
 
@@ -149,8 +241,10 @@ def generate(self,
                 Solver used to sample the video.
             sampling_steps (`int`, *optional*, defaults to 40):
                 Number of diffusion sampling steps. Higher values improve quality but slow generation
-            guide_scale (`float`, *optional*, defaults 5.0):
-                Classifier-free guidance scale. Controls prompt adherence vs. creativity
+            guide_scale (`float` or tuple[`float`], *optional*, defaults 5.0):
+                Classifier-free guidance scale. Controls prompt adherence vs. creativity.
+                If tuple, the first guide_scale will be used for low noise model and
+                the second guide_scale will be used for high noise model.
             n_prompt (`str`, *optional*, defaults to ""):
                 Negative prompt for content exclusion. If not given, use `config.sample_neg_prompt`
             seed (`int`, *optional*, defaults to -1):
@@ -165,24 +259,39 @@ def generate(self,
                 - N: Number of frames (81)
                 - H: Frame height (from max_area)
                 - W: Frame width from max_area)
-
         """
+        # preprocess
+        guide_scale = (
+            (guide_scale, guide_scale)
+            if isinstance(guide_scale, float)
+            else guide_scale
+        )
         img = TF.to_tensor(img).sub_(0.5).div_(0.5).to(self.device)
 
         F = frame_num
         h, w = img.shape[1:]
         aspect_ratio = h / w
         lat_h = round(
-            np.sqrt(max_area * aspect_ratio) // self.vae_stride[1] //
-            self.patch_size[1] * self.patch_size[1])
+            np.sqrt(max_area * aspect_ratio)
+            // self.vae_stride[1]
+            // self.patch_size[1]
+            * self.patch_size[1]
+        )
         lat_w = round(
-            np.sqrt(max_area / aspect_ratio) // self.vae_stride[2] //
-            self.patch_size[2] * self.patch_size[2])
+            np.sqrt(max_area / aspect_ratio)
+            // self.vae_stride[2]
+            // self.patch_size[2]
+            * self.patch_size[2]
+        )
         h = lat_h * self.vae_stride[1]
         w = lat_w * self.vae_stride[2]
 
-        max_seq_len = ((F - 1) // self.vae_stride[0] + 1) * lat_h * lat_w // (
-            self.patch_size[1] * self.patch_size[2])
+        max_seq_len = (
+            ((F - 1) // self.vae_stride[0] + 1)
+            * lat_h
+            * lat_w
+            // (self.patch_size[1] * self.patch_size[2])
+        )
         max_seq_len = int(math.ceil(max_seq_len / self.sp_size)) * self.sp_size
 
         seed = seed if seed >= 0 else random.randint(0, sys.maxsize)
@@ -190,19 +299,19 @@ def generate(self,
         seed_g.manual_seed(seed)
         noise = torch.randn(
             16,
-            21,
+            (F - 1) // self.vae_stride[0] + 1,
             lat_h,
             lat_w,
             dtype=torch.float32,
             generator=seed_g,
-            device=self.device)
+            device=self.device,
+        )
 
-        msk = torch.ones(1, 81, lat_h, lat_w, device=self.device)
+        msk = torch.ones(1, F, lat_h, lat_w, device=self.device)
         msk[:, 1:] = 0
-        msk = torch.concat([
-            torch.repeat_interleave(msk[:, 0:1], repeats=4, dim=1), msk[:, 1:]
-        ],
-                           dim=1)
+        msk = torch.concat(
+            [torch.repeat_interleave(msk[:, 0:1], repeats=4, dim=1), msk[:, 1:]], dim=1
+        )
         msk = msk.view(1, msk.shape[1] // 4, 4, lat_h, lat_w)
         msk = msk.transpose(1, 2)[0]
 
@@ -217,57 +326,62 @@ def generate(self,
             if offload_model:
                 self.text_encoder.model.cpu()
         else:
-            context = self.text_encoder([input_prompt], torch.device('cpu'))
-            context_null = self.text_encoder([n_prompt], torch.device('cpu'))
+            context = self.text_encoder([input_prompt], torch.device("cpu"))
+            context_null = self.text_encoder([n_prompt], torch.device("cpu"))
             context = [t.to(self.device) for t in context]
             context_null = [t.to(self.device) for t in context_null]
 
-        self.clip.model.to(self.device)
-        clip_context = self.clip.visual([img[:, None, :, :]])
-        if offload_model:
-            self.clip.model.cpu()
-
-        self.vae.model.to(self.device)
-        y = self.vae.encode([
-            torch.concat([
-                torch.nn.functional.interpolate(
-                    img[None].cpu(), size=(h, w), mode='bicubic').transpose(
-                        0, 1),
-                torch.zeros(3, 80, h, w)
-            ],
-                         dim=1).to(self.device)
-        ])[0]
+        y = self.vae.encode(
+            [
+                torch.concat(
+                    [
+                        torch.nn.functional.interpolate(
+                            img[None].cpu(), size=(h, w), mode="bicubic"
+                        ).transpose(0, 1),
+                        torch.zeros(3, F - 1, h, w),
+                    ],
+                    dim=1,
+                ).to(self.device)
+            ]
+        )[0]
         y = torch.concat([msk, y])
-        if offload_model:
-            self.vae.model.cpu()
 
         @contextmanager
         def noop_no_sync():
             yield
 
-        no_sync = getattr(self.model, 'no_sync', noop_no_sync)
+        no_sync_low_noise = getattr(self.low_noise_model, "no_sync", noop_no_sync)
+        no_sync_high_noise = getattr(self.high_noise_model, "no_sync", noop_no_sync)
 
         # evaluation mode
-        with amp.autocast(dtype=self.param_dtype), torch.no_grad(), no_sync():
-
-            if sample_solver == 'unipc':
+        with (
+            torch.amp.autocast("cuda", dtype=self.param_dtype),
+            torch.no_grad(),
+            no_sync_low_noise(),
+            no_sync_high_noise(),
+        ):
+            boundary = self.boundary * self.num_train_timesteps
+
+            if sample_solver == "unipc":
                 sample_scheduler = FlowUniPCMultistepScheduler(
                     num_train_timesteps=self.num_train_timesteps,
                     shift=1,
-                    use_dynamic_shifting=False)
+                    use_dynamic_shifting=False,
+                )
                 sample_scheduler.set_timesteps(
-                    sampling_steps, device=self.device, shift=shift)
+                    sampling_steps, device=self.device, shift=shift
+                )
                 timesteps = sample_scheduler.timesteps
-            elif sample_solver == 'dpm++':
+            elif sample_solver == "dpm++":
                 sample_scheduler = FlowDPMSolverMultistepScheduler(
                     num_train_timesteps=self.num_train_timesteps,
                     shift=1,
-                    use_dynamic_shifting=False)
+                    use_dynamic_shifting=False,
+                )
                 sampling_sigmas = get_sampling_sigmas(sampling_steps, shift)
                 timesteps, _ = retrieve_timesteps(
-                    sample_scheduler,
-                    device=self.device,
-                    sigmas=sampling_sigmas)
+                    sample_scheduler, device=self.device, sigmas=sampling_sigmas
+                )
             else:
                 raise NotImplementedError("Unsupported solver.")
 
@@ -275,67 +389,62 @@ def noop_no_sync():
             latent = noise
 
             arg_c = {
-                'context': [context[0]],
-                'clip_fea': clip_context,
-                'seq_len': max_seq_len,
-                'y': [y],
+                "context": [context[0]],
+                "seq_len": max_seq_len,
+                "y": [y],
             }
 
             arg_null = {
-                'context': context_null,
-                'clip_fea': clip_context,
-                'seq_len': max_seq_len,
-                'y': [y],
+                "context": context_null,
+                "seq_len": max_seq_len,
+                "y": [y],
             }
 
             if offload_model:
                 torch.cuda.empty_cache()
 
-            self.model.to(self.device)
             for _, t in enumerate(tqdm(timesteps)):
                 latent_model_input = [latent.to(self.device)]
                 timestep = [t]
 
                 timestep = torch.stack(timestep).to(self.device)
 
-                noise_pred_cond = self.model(
-                    latent_model_input, t=timestep, **arg_c)[0].to(
-                        torch.device('cpu') if offload_model else self.device)
+                model = self._prepare_model_for_timestep(t, boundary, offload_model)
+                sample_guide_scale = (
+                    guide_scale[1] if t.item() >= boundary else guide_scale[0]
+                )
+
+                noise_pred_cond = model(latent_model_input, t=timestep, **arg_c)[0]
                 if offload_model:
                     torch.cuda.empty_cache()
-                noise_pred_uncond = self.model(
-                    latent_model_input, t=timestep, **arg_null)[0].to(
-                        torch.device('cpu') if offload_model else self.device)
+                noise_pred_uncond = model(latent_model_input, t=timestep, **arg_null)[0]
                 if offload_model:
                     torch.cuda.empty_cache()
-                noise_pred = noise_pred_uncond + guide_scale * (
-                    noise_pred_cond - noise_pred_uncond)
-
-                latent = latent.to(
-                    torch.device('cpu') if offload_model else self.device)
+                noise_pred = noise_pred_uncond + sample_guide_scale * (
+                    noise_pred_cond - noise_pred_uncond
+                )
 
                 temp_x0 = sample_scheduler.step(
                     noise_pred.unsqueeze(0),
                     t,
                     latent.unsqueeze(0),
                     return_dict=False,
-                    generator=seed_g)[0]
+                    generator=seed_g,
+                )[0]
                 latent = temp_x0.squeeze(0)
 
-                x0 = [latent.to(self.device)]
+                x0 = [latent]
                 del latent_model_input, timestep
 
             if offload_model:
-                self.model.cpu()
+                self.low_noise_model.cpu()
+                self.high_noise_model.cpu()
                 torch.cuda.empty_cache()
 
             if self.rank == 0:
-                self.vae.model.to(self.device)
                 videos = self.vae.decode(x0)
-                if offload_model:
-                    self.vae.model.cpu()
 
-        del noise, latent
+        del noise, latent, x0
         del sample_scheduler
         if offload_model:
             gc.collect()
@@ -344,86 +453,3 @@ def noop_no_sync():
             dist.barrier()
 
         return videos[0] if self.rank == 0 else None
-    
-    def load_weight(self):
-        self.text_encoder.load_weight()
-        self.vae.load_weight()
-        self.clip.load_weight()
-        #denoiser use from_pretrained, no need load again
-        if self.use_usp:
-            from xfuser.core.distributed import \
-                get_sequence_parallel_world_size
-
-            from .distributed.xdit_context_parallel import (usp_attn_forward,
-                                                            usp_dit_forward)
-            for block in self.model.blocks:
-                block.self_attn.forward = types.MethodType(
-                    usp_attn_forward, block.self_attn)
-            self.model.forward = types.MethodType(usp_dit_forward, self.model)
-            self.sp_size = get_sequence_parallel_world_size()
-        else:
-            self.sp_size = 1
-
-        if dist.is_initialized():
-            dist.barrier()
-        if self.dit_fsdp:
-            self.model = self.shard_fn(self.model)
-        else:
-            if not self.init_on_cpu:
-                self.model = self.model.to(self.device)
-
-    def enable_vram_management(self):
-        pass
-
-
-    def training_step(self, batch, batch_idx, 
-                      first_stage_key:str, 
-                      cond_stage_key:str,
-                      model_offload:bool = True,
-                      dtype:torch.dtype = torch.bfloat16,
-                      device:str = "cuda"):
-        videos = batch[first_stage_key]
-        first_frame = videos[:, :, 0:1, :, :]
-        
-        ## compute latent and embeddings
-        with torch.no_grad():
-            if model_offload:
-                self.vae.model.to(device)
-                latents = torch.stack(self.vae.encode(videos)).to(dtype=dtype, device=device).detach()
-                videos[:, :, 1:, :, :] = 0
-                y = torch.stack(self.vae.encode(videos)).to(dtype=dtype, device=device).detach()
-                self.vae.model.to('cpu')
-                self.text_encoder.model.to(device)
-                text_cond_embed = self.text_encoder(batch[cond_stage_key], device)
-                self.text_encoder.model.to('cpu')
-                self.clip.model.to(device)
-                clip_context = self.clip.visual(first_frame)
-                self.clip.model.to('cpu')
-            else:
-                latents = torch.stack(self.vae.encode(videos)).to(dtype=dtype, device=device).detach()
-                videos[:, :, 1:, :, :] = 0
-                y = torch.stack(self.vae.encode(videos)).to(dtype=dtype, device=device).detach()
-                text_cond_embed = self.text_encoder(batch[cond_stage_key], device)
-                clip_context = self.clip.visual(first_frame)
-
-        ## scheduler
-        self.scheduler : FlowMatchScheduler = FlowMatchScheduler(shift=5, sigma_min=0.0, extra_one_step=True)
-        self.scheduler.set_timesteps(1000, training=True)
-
-        ## noise
-        b, c, f, h, w = latents.shape
-        noise = torch.randn_like(latents)
-        timestep_ids = torch.randint(0, self.scheduler.num_train_timesteps, (b,))
-        timesteps = self.scheduler.timesteps[timestep_ids].to(dtype=dtype, device=device)
-        noisy_latents = self.scheduler.add_noise(latents, noise, timesteps).to(dtype=dtype, device=device)
-        training_target = noise.to(device) - latents
-
-        # compute loss
-        mask = torch.zeros((b, 4, f, h, w), device=device, dtype=dtype)
-        mask[:, :, 0, :, :] = 1
-        y = torch.cat([mask, y], dim=1)
-
-        noise_pred = self.model(x=noisy_latents, t=timesteps, context=text_cond_embed, clip_fea=clip_context, seq_len=None, y=y)
-        loss = torch.nn.functional.mse_loss(torch.stack(noise_pred).float(), training_target.float())
-        loss = loss * self.scheduler.training_weight(timesteps).to(device=device)
-        return loss
\ No newline at end of file
diff --git a/videotuna/models/wan/wan/modules/__init__.py b/videotuna/models/wan/wan/modules/__init__.py
index f8935bbb..9b74cded 100644
--- a/videotuna/models/wan/wan/modules/__init__.py
+++ b/videotuna/models/wan/wan/modules/__init__.py
@@ -1,16 +1,17 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 from .attention import flash_attention
 from .model import WanModel
 from .t5 import T5Decoder, T5Encoder, T5EncoderModel, T5Model
 from .tokenizers import HuggingfaceTokenizer
-from .vae import WanVAE
+from .vae2_1 import Wan2_1_VAE
 
 __all__ = [
-    'WanVAE',
-    'WanModel',
-    'T5Model',
-    'T5Encoder',
-    'T5Decoder',
-    'T5EncoderModel',
-    'HuggingfaceTokenizer',
-    'flash_attention',
+    "Wan2_1_VAE",
+    "WanModel",
+    "T5Model",
+    "T5Encoder",
+    "T5Decoder",
+    "T5EncoderModel",
+    "HuggingfaceTokenizer",
+    "flash_attention",
 ]
diff --git a/videotuna/models/wan/wan/modules/attention.py b/videotuna/models/wan/wan/modules/attention.py
index 4dbbe03f..de604a5f 100644
--- a/videotuna/models/wan/wan/modules/attention.py
+++ b/videotuna/models/wan/wan/modules/attention.py
@@ -3,12 +3,14 @@
 
 try:
     import flash_attn_interface
+
     FLASH_ATTN_3_AVAILABLE = True
 except ModuleNotFoundError:
     FLASH_ATTN_3_AVAILABLE = False
 
 try:
     import flash_attn
+
     FLASH_ATTN_2_AVAILABLE = True
 except ModuleNotFoundError:
     FLASH_ATTN_2_AVAILABLE = False
@@ -16,8 +18,8 @@
 import warnings
 
 __all__ = [
-    'flash_attention',
-    'attention',
+    "flash_attention",
+    "attention",
 ]
 
 
@@ -27,7 +29,7 @@ def flash_attention(
     v,
     q_lens=None,
     k_lens=None,
-    dropout_p=0.,
+    dropout_p=0.0,
     softmax_scale=None,
     q_scale=None,
     causal=False,
@@ -51,7 +53,7 @@ def flash_attention(
     """
     half_dtypes = (torch.float16, torch.bfloat16)
     assert dtype in half_dtypes
-    assert q.device.type == 'cuda' and q.size(-1) <= 256
+    assert q.device.type == "cuda" and q.size(-1) <= 256
 
     # params
     b, lq, lk, out_dtype = q.size(0), q.size(1), k.size(1), q.dtype
@@ -62,9 +64,9 @@ def half(x):
     # preprocess query
     if q_lens is None:
         q = half(q.flatten(0, 1))
-        q_lens = torch.tensor(
-            [lq] * b, dtype=torch.int32).to(
-                device=q.device, non_blocking=True)
+        q_lens = torch.tensor([lq] * b, dtype=torch.int32).to(
+            device=q.device, non_blocking=True
+        )
     else:
         q = half(torch.cat([u[:v] for u, v in zip(q, q_lens)]))
 
@@ -72,9 +74,9 @@ def half(x):
     if k_lens is None:
         k = half(k.flatten(0, 1))
         v = half(v.flatten(0, 1))
-        k_lens = torch.tensor(
-            [lk] * b, dtype=torch.int32).to(
-                device=k.device, non_blocking=True)
+        k_lens = torch.tensor([lk] * b, dtype=torch.int32).to(
+            device=k.device, non_blocking=True
+        )
     else:
         k = half(torch.cat([u[:v] for u, v in zip(k, k_lens)]))
         v = half(torch.cat([u[:v] for u, v in zip(v, k_lens)]))
@@ -87,7 +89,7 @@ def half(x):
 
     if version is not None and version == 3 and not FLASH_ATTN_3_AVAILABLE:
         warnings.warn(
-            'Flash attention 3 is not available, use flash attention 2 instead.'
+            "Flash attention 3 is not available, use flash attention 2 instead."
         )
 
     # apply attention
@@ -97,34 +99,40 @@ def half(x):
             q=q,
             k=k,
             v=v,
-            cu_seqlens_q=torch.cat([q_lens.new_zeros([1]), q_lens]).cumsum(
-                0, dtype=torch.int32).to(q.device, non_blocking=True),
-            cu_seqlens_k=torch.cat([k_lens.new_zeros([1]), k_lens]).cumsum(
-                0, dtype=torch.int32).to(q.device, non_blocking=True),
+            cu_seqlens_q=torch.cat([q_lens.new_zeros([1]), q_lens])
+            .cumsum(0, dtype=torch.int32)
+            .to(q.device, non_blocking=True),
+            cu_seqlens_k=torch.cat([k_lens.new_zeros([1]), k_lens])
+            .cumsum(0, dtype=torch.int32)
+            .to(q.device, non_blocking=True),
             seqused_q=None,
             seqused_k=None,
             max_seqlen_q=lq,
             max_seqlen_k=lk,
             softmax_scale=softmax_scale,
             causal=causal,
-            deterministic=deterministic)[0].unflatten(0, (b, lq))
+            deterministic=deterministic,
+        )[0].unflatten(0, (b, lq))
     else:
         assert FLASH_ATTN_2_AVAILABLE
         x = flash_attn.flash_attn_varlen_func(
             q=q,
             k=k,
             v=v,
-            cu_seqlens_q=torch.cat([q_lens.new_zeros([1]), q_lens]).cumsum(
-                0, dtype=torch.int32).to(q.device, non_blocking=True),
-            cu_seqlens_k=torch.cat([k_lens.new_zeros([1]), k_lens]).cumsum(
-                0, dtype=torch.int32).to(q.device, non_blocking=True),
+            cu_seqlens_q=torch.cat([q_lens.new_zeros([1]), q_lens])
+            .cumsum(0, dtype=torch.int32)
+            .to(q.device, non_blocking=True),
+            cu_seqlens_k=torch.cat([k_lens.new_zeros([1]), k_lens])
+            .cumsum(0, dtype=torch.int32)
+            .to(q.device, non_blocking=True),
             max_seqlen_q=lq,
             max_seqlen_k=lk,
             dropout_p=dropout_p,
             softmax_scale=softmax_scale,
             causal=causal,
             window_size=window_size,
-            deterministic=deterministic).unflatten(0, (b, lq))
+            deterministic=deterministic,
+        ).unflatten(0, (b, lq))
 
     # output
     return x.type(out_dtype)
@@ -136,7 +144,7 @@ def attention(
     v,
     q_lens=None,
     k_lens=None,
-    dropout_p=0.,
+    dropout_p=0.0,
     softmax_scale=None,
     q_scale=None,
     causal=False,
@@ -164,7 +172,7 @@ def attention(
     else:
         if q_lens is not None or k_lens is not None:
             warnings.warn(
-                'Padding mask is disabled when using scaled_dot_product_attention. It can have a significant impact on performance.'
+                "Padding mask is disabled when using scaled_dot_product_attention. It can have a significant impact on performance."
             )
         attn_mask = None
 
@@ -173,7 +181,8 @@ def attention(
         v = v.transpose(1, 2).to(dtype)
 
         out = torch.nn.functional.scaled_dot_product_attention(
-            q, k, v, attn_mask=attn_mask, is_causal=causal, dropout_p=dropout_p)
+            q, k, v, attn_mask=attn_mask, is_causal=causal, dropout_p=dropout_p
+        )
 
         out = out.transpose(1, 2).contiguous()
         return out
diff --git a/videotuna/models/wan/wan/modules/clip.py b/videotuna/models/wan/wan/modules/clip.py
deleted file mode 100644
index a6288c10..00000000
--- a/videotuna/models/wan/wan/modules/clip.py
+++ /dev/null
@@ -1,487 +0,0 @@
-# Modified from ``https://github.com/openai/CLIP'' and ``https://github.com/mlfoundations/open_clip''
-# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-from loguru import logger
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchvision.transforms as T
-
-from .attention import flash_attention
-from .tokenizers import HuggingfaceTokenizer
-from .xlm_roberta import XLMRoberta
-
-__all__ = [
-    'XLMRobertaCLIP',
-    'clip_xlm_roberta_vit_h_14',
-    'CLIPModel',
-]
-
-
-def pos_interpolate(pos, seq_len):
-    if pos.size(1) == seq_len:
-        return pos
-    else:
-        src_grid = int(math.sqrt(pos.size(1)))
-        tar_grid = int(math.sqrt(seq_len))
-        n = pos.size(1) - src_grid * src_grid
-        return torch.cat([
-            pos[:, :n],
-            F.interpolate(
-                pos[:, n:].float().reshape(1, src_grid, src_grid, -1).permute(
-                    0, 3, 1, 2),
-                size=(tar_grid, tar_grid),
-                mode='bicubic',
-                align_corners=False).flatten(2).transpose(1, 2)
-        ],
-                         dim=1)
-
-
-class QuickGELU(nn.Module):
-
-    def forward(self, x):
-        return x * torch.sigmoid(1.702 * x)
-
-
-class LayerNorm(nn.LayerNorm):
-
-    def forward(self, x):
-        return super().forward(x.float()).type_as(x)
-
-
-class SelfAttention(nn.Module):
-
-    def __init__(self,
-                 dim,
-                 num_heads,
-                 causal=False,
-                 attn_dropout=0.0,
-                 proj_dropout=0.0):
-        assert dim % num_heads == 0
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = dim // num_heads
-        self.causal = causal
-        self.attn_dropout = attn_dropout
-        self.proj_dropout = proj_dropout
-
-        # layers
-        self.to_qkv = nn.Linear(dim, dim * 3)
-        self.proj = nn.Linear(dim, dim)
-
-    def forward(self, x):
-        """
-        x:   [B, L, C].
-        """
-        b, s, c, n, d = *x.size(), self.num_heads, self.head_dim
-
-        # compute query, key, value
-        q, k, v = self.to_qkv(x).view(b, s, 3, n, d).unbind(2)
-
-        # compute attention
-        p = self.attn_dropout if self.training else 0.0
-        x = flash_attention(q, k, v, dropout_p=p, causal=self.causal, version=2)
-        x = x.reshape(b, s, c)
-
-        # output
-        x = self.proj(x)
-        x = F.dropout(x, self.proj_dropout, self.training)
-        return x
-
-
-class SwiGLU(nn.Module):
-
-    def __init__(self, dim, mid_dim):
-        super().__init__()
-        self.dim = dim
-        self.mid_dim = mid_dim
-
-        # layers
-        self.fc1 = nn.Linear(dim, mid_dim)
-        self.fc2 = nn.Linear(dim, mid_dim)
-        self.fc3 = nn.Linear(mid_dim, dim)
-
-    def forward(self, x):
-        x = F.silu(self.fc1(x)) * self.fc2(x)
-        x = self.fc3(x)
-        return x
-
-
-class AttentionBlock(nn.Module):
-
-    def __init__(self,
-                 dim,
-                 mlp_ratio,
-                 num_heads,
-                 post_norm=False,
-                 causal=False,
-                 activation='quick_gelu',
-                 attn_dropout=0.0,
-                 proj_dropout=0.0,
-                 norm_eps=1e-5):
-        assert activation in ['quick_gelu', 'gelu', 'swi_glu']
-        super().__init__()
-        self.dim = dim
-        self.mlp_ratio = mlp_ratio
-        self.num_heads = num_heads
-        self.post_norm = post_norm
-        self.causal = causal
-        self.norm_eps = norm_eps
-
-        # layers
-        self.norm1 = LayerNorm(dim, eps=norm_eps)
-        self.attn = SelfAttention(dim, num_heads, causal, attn_dropout,
-                                  proj_dropout)
-        self.norm2 = LayerNorm(dim, eps=norm_eps)
-        if activation == 'swi_glu':
-            self.mlp = SwiGLU(dim, int(dim * mlp_ratio))
-        else:
-            self.mlp = nn.Sequential(
-                nn.Linear(dim, int(dim * mlp_ratio)),
-                QuickGELU() if activation == 'quick_gelu' else nn.GELU(),
-                nn.Linear(int(dim * mlp_ratio), dim), nn.Dropout(proj_dropout))
-
-    def forward(self, x):
-        if self.post_norm:
-            x = x + self.norm1(self.attn(x))
-            x = x + self.norm2(self.mlp(x))
-        else:
-            x = x + self.attn(self.norm1(x))
-            x = x + self.mlp(self.norm2(x))
-        return x
-
-
-class AttentionPool(nn.Module):
-
-    def __init__(self,
-                 dim,
-                 mlp_ratio,
-                 num_heads,
-                 activation='gelu',
-                 proj_dropout=0.0,
-                 norm_eps=1e-5):
-        assert dim % num_heads == 0
-        super().__init__()
-        self.dim = dim
-        self.mlp_ratio = mlp_ratio
-        self.num_heads = num_heads
-        self.head_dim = dim // num_heads
-        self.proj_dropout = proj_dropout
-        self.norm_eps = norm_eps
-
-        # layers
-        gain = 1.0 / math.sqrt(dim)
-        self.cls_embedding = nn.Parameter(gain * torch.randn(1, 1, dim))
-        self.to_q = nn.Linear(dim, dim)
-        self.to_kv = nn.Linear(dim, dim * 2)
-        self.proj = nn.Linear(dim, dim)
-        self.norm = LayerNorm(dim, eps=norm_eps)
-        self.mlp = nn.Sequential(
-            nn.Linear(dim, int(dim * mlp_ratio)),
-            QuickGELU() if activation == 'quick_gelu' else nn.GELU(),
-            nn.Linear(int(dim * mlp_ratio), dim), nn.Dropout(proj_dropout))
-
-    def forward(self, x):
-        """
-        x:  [B, L, C].
-        """
-        b, s, c, n, d = *x.size(), self.num_heads, self.head_dim
-
-        # compute query, key, value
-        q = self.to_q(self.cls_embedding).view(1, 1, n, d).expand(b, -1, -1, -1)
-        k, v = self.to_kv(x).view(b, s, 2, n, d).unbind(2)
-
-        # compute attention
-        x = flash_attention(q, k, v, version=2)
-        x = x.reshape(b, 1, c)
-
-        # output
-        x = self.proj(x)
-        x = F.dropout(x, self.proj_dropout, self.training)
-
-        # mlp
-        x = x + self.mlp(self.norm(x))
-        return x[:, 0]
-
-
-class VisionTransformer(nn.Module):
-
-    def __init__(self,
-                 image_size=224,
-                 patch_size=16,
-                 dim=768,
-                 mlp_ratio=4,
-                 out_dim=512,
-                 num_heads=12,
-                 num_layers=12,
-                 pool_type='token',
-                 pre_norm=True,
-                 post_norm=False,
-                 activation='quick_gelu',
-                 attn_dropout=0.0,
-                 proj_dropout=0.0,
-                 embedding_dropout=0.0,
-                 norm_eps=1e-5):
-        if image_size % patch_size != 0:
-            print(
-                '[WARNING] image_size is not divisible by patch_size',
-                flush=True)
-        assert pool_type in ('token', 'token_fc', 'attn_pool')
-        out_dim = out_dim or dim
-        super().__init__()
-        self.image_size = image_size
-        self.patch_size = patch_size
-        self.num_patches = (image_size // patch_size)**2
-        self.dim = dim
-        self.mlp_ratio = mlp_ratio
-        self.out_dim = out_dim
-        self.num_heads = num_heads
-        self.num_layers = num_layers
-        self.pool_type = pool_type
-        self.post_norm = post_norm
-        self.norm_eps = norm_eps
-
-        # embeddings
-        gain = 1.0 / math.sqrt(dim)
-        self.patch_embedding = nn.Conv2d(
-            3,
-            dim,
-            kernel_size=patch_size,
-            stride=patch_size,
-            bias=not pre_norm)
-        if pool_type in ('token', 'token_fc'):
-            self.cls_embedding = nn.Parameter(gain * torch.randn(1, 1, dim))
-        self.pos_embedding = nn.Parameter(gain * torch.randn(
-            1, self.num_patches +
-            (1 if pool_type in ('token', 'token_fc') else 0), dim))
-        self.dropout = nn.Dropout(embedding_dropout)
-
-        # transformer
-        self.pre_norm = LayerNorm(dim, eps=norm_eps) if pre_norm else None
-        self.transformer = nn.Sequential(*[
-            AttentionBlock(dim, mlp_ratio, num_heads, post_norm, False,
-                           activation, attn_dropout, proj_dropout, norm_eps)
-            for _ in range(num_layers)
-        ])
-        self.post_norm = LayerNorm(dim, eps=norm_eps)
-
-        # head
-        if pool_type == 'token':
-            self.head = nn.Parameter(gain * torch.randn(dim, out_dim))
-        elif pool_type == 'token_fc':
-            self.head = nn.Linear(dim, out_dim)
-        elif pool_type == 'attn_pool':
-            self.head = AttentionPool(dim, mlp_ratio, num_heads, activation,
-                                      proj_dropout, norm_eps)
-
-    def forward(self, x, interpolation=False, use_31_block=False):
-        b = x.size(0)
-
-        # embeddings
-        x = self.patch_embedding(x).flatten(2).permute(0, 2, 1)
-        if self.pool_type in ('token', 'token_fc'):
-            x = torch.cat([self.cls_embedding.expand(b, -1, -1), x], dim=1)
-        if interpolation:
-            e = pos_interpolate(self.pos_embedding, x.size(1))
-        else:
-            e = self.pos_embedding
-        x = self.dropout(x + e)
-        if self.pre_norm is not None:
-            x = self.pre_norm(x)
-
-        # transformer
-        if use_31_block:
-            x = self.transformer[:-1](x)
-            return x
-        else:
-            x = self.transformer(x)
-            return x
-
-
-class XLMRobertaWithHead(XLMRoberta):
-
-    def __init__(self, **kwargs):
-        self.out_dim = kwargs.pop('out_dim')
-        super().__init__(**kwargs)
-
-        # head
-        mid_dim = (self.dim + self.out_dim) // 2
-        self.head = nn.Sequential(
-            nn.Linear(self.dim, mid_dim, bias=False), nn.GELU(),
-            nn.Linear(mid_dim, self.out_dim, bias=False))
-
-    def forward(self, ids):
-        # xlm-roberta
-        x = super().forward(ids)
-
-        # average pooling
-        mask = ids.ne(self.pad_id).unsqueeze(-1).to(x)
-        x = (x * mask).sum(dim=1) / mask.sum(dim=1)
-
-        # head
-        x = self.head(x)
-        return x
-
-
-class XLMRobertaCLIP(nn.Module):
-
-    def __init__(self,
-                 embed_dim=1024,
-                 image_size=224,
-                 patch_size=14,
-                 vision_dim=1280,
-                 vision_mlp_ratio=4,
-                 vision_heads=16,
-                 vision_layers=32,
-                 vision_pool='token',
-                 vision_pre_norm=True,
-                 vision_post_norm=False,
-                 activation='gelu',
-                 vocab_size=250002,
-                 max_text_len=514,
-                 type_size=1,
-                 pad_id=1,
-                 text_dim=1024,
-                 text_heads=16,
-                 text_layers=24,
-                 text_post_norm=True,
-                 text_dropout=0.1,
-                 attn_dropout=0.0,
-                 proj_dropout=0.0,
-                 embedding_dropout=0.0,
-                 norm_eps=1e-5):
-        super().__init__()
-        self.embed_dim = embed_dim
-        self.image_size = image_size
-        self.patch_size = patch_size
-        self.vision_dim = vision_dim
-        self.vision_mlp_ratio = vision_mlp_ratio
-        self.vision_heads = vision_heads
-        self.vision_layers = vision_layers
-        self.vision_pre_norm = vision_pre_norm
-        self.vision_post_norm = vision_post_norm
-        self.activation = activation
-        self.vocab_size = vocab_size
-        self.max_text_len = max_text_len
-        self.type_size = type_size
-        self.pad_id = pad_id
-        self.text_dim = text_dim
-        self.text_heads = text_heads
-        self.text_layers = text_layers
-        self.text_post_norm = text_post_norm
-        self.norm_eps = norm_eps
-
-        # models
-        self.visual = VisionTransformer(
-            image_size=image_size,
-            patch_size=patch_size,
-            dim=vision_dim,
-            mlp_ratio=vision_mlp_ratio,
-            out_dim=embed_dim,
-            num_heads=vision_heads,
-            num_layers=vision_layers,
-            pool_type=vision_pool,
-            pre_norm=vision_pre_norm,
-            post_norm=vision_post_norm,
-            activation=activation,
-            attn_dropout=attn_dropout,
-            proj_dropout=proj_dropout,
-            embedding_dropout=embedding_dropout,
-            norm_eps=norm_eps)
-        self.textual = XLMRobertaWithHead(
-            vocab_size=vocab_size,
-            max_seq_len=max_text_len,
-            type_size=type_size,
-            pad_id=pad_id,
-            dim=text_dim,
-            out_dim=embed_dim,
-            num_heads=text_heads,
-            num_layers=text_layers,
-            post_norm=text_post_norm,
-            dropout=text_dropout)
-        self.log_scale = nn.Parameter(math.log(1 / 0.07) * torch.ones([]))
-
-    def forward(self, imgs, txt_ids):
-        """
-        imgs:       [B, 3, H, W] of torch.float32.
-        - mean:     [0.48145466, 0.4578275, 0.40821073]
-        - std:      [0.26862954, 0.26130258, 0.27577711]
-        txt_ids:    [B, L] of torch.long.
-                    Encoded by data.CLIPTokenizer.
-        """
-        xi = self.visual(imgs)
-        xt = self.textual(txt_ids)
-        return xi, xt
-
-    def param_groups(self):
-        groups = [{
-            'params': [
-                p for n, p in self.named_parameters()
-                if 'norm' in n or n.endswith('bias')
-            ],
-            'weight_decay': 0.0
-        }, {
-            'params': [
-                p for n, p in self.named_parameters()
-                if not ('norm' in n or n.endswith('bias'))
-            ]
-        }]
-        return groups
-
-
-def clip_transforms(model, pretrained_name):
-    if 'siglip' in pretrained_name.lower():
-        mean, std = [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]
-    else:
-        mean = [0.48145466, 0.4578275, 0.40821073]
-        std = [0.26862954, 0.26130258, 0.27577711]
-
-    # transforms
-    return T.Compose([
-        T.Resize((model.image_size, model.image_size),
-                    interpolation=T.InterpolationMode.BICUBIC),
-        T.ToTensor(),
-        T.Normalize(mean=mean, std=std)
-    ])
-
-
-class CLIPModel:
-
-    def __init__(self, dtype, device, checkpoint_path, tokenizer_path, model: XLMRobertaCLIP):
-        self.dtype = dtype
-        self.device = device
-        self.checkpoint_path = checkpoint_path
-        self.tokenizer_path = tokenizer_path
-
-        self.model = model.to(dtype)
-        self.transforms = clip_transforms(model, 'open-clip-xlm-roberta-large-vit-huge-14')
-        self.tokenizer = HuggingfaceTokenizer(
-            name=tokenizer_path,
-            seq_len=self.model.max_text_len - 2,
-            clean='whitespace')
-
-    def visual(self, videos):
-        # preprocess
-        size = (self.model.image_size,) * 2
-        videos = torch.cat([
-            F.interpolate(
-                u.transpose(0, 1),
-                size=size,
-                mode='bicubic',
-                align_corners=False) for u in videos
-        ])
-        videos = self.transforms.transforms[-1](videos.mul_(0.5).add_(0.5))
-
-        # forward
-        with torch.cuda.amp.autocast(dtype=self.dtype):
-            out = self.model.visual(videos, use_31_block=True)
-            return out
-
-    def load_weight(self):
-        logger.info(f'loading CLIPModel weight from ckpt_path: {self.checkpoint_path}')
-        self.model.load_state_dict(
-            torch.load(self.checkpoint_path, map_location='cpu'))
-        self.model = self.model.to(self.dtype)
-        logger.info(f'loading CLIPModel weight from ckpt_path: {self.checkpoint_path} finished')
diff --git a/videotuna/models/wan/wan/modules/model.py b/videotuna/models/wan/wan/modules/model.py
index 0a0eee4c..bef78657 100644
--- a/videotuna/models/wan/wan/modules/model.py
+++ b/videotuna/models/wan/wan/modules/model.py
@@ -2,15 +2,13 @@
 import math
 
 import torch
-import torch.cuda.amp as amp
 import torch.nn as nn
 from diffusers.configuration_utils import ConfigMixin, register_to_config
 from diffusers.models.modeling_utils import ModelMixin
-from tqdm import tqdm
-from loguru import logger
+
 from .attention import flash_attention
 
-__all__ = ['WanModel']
+__all__ = ["WanModel"]
 
 
 def sinusoidal_embedding_1d(dim, position):
@@ -21,23 +19,24 @@ def sinusoidal_embedding_1d(dim, position):
 
     # calculation
     sinusoid = torch.outer(
-        position, torch.pow(10000, -torch.arange(half).to(position).div(half)))
+        position, torch.pow(10000, -torch.arange(half).to(position).div(half))
+    )
     x = torch.cat([torch.cos(sinusoid), torch.sin(sinusoid)], dim=1)
     return x
 
 
-@amp.autocast(enabled=False)
+@torch.amp.autocast("cuda", enabled=False)
 def rope_params(max_seq_len, dim, theta=10000):
     assert dim % 2 == 0
     freqs = torch.outer(
         torch.arange(max_seq_len),
-        1.0 / torch.pow(theta,
-                        torch.arange(0, dim, 2).to(torch.float64).div(dim)))
+        1.0 / torch.pow(theta, torch.arange(0, dim, 2).to(torch.float64).div(dim)),
+    )
     freqs = torch.polar(torch.ones_like(freqs), freqs)
     return freqs
 
 
-@amp.autocast(enabled=False)
+@torch.amp.autocast("cuda", enabled=False)
 def rope_apply(x, grid_sizes, freqs):
     n, c = x.size(2), x.size(3) // 2
 
@@ -50,14 +49,17 @@ def rope_apply(x, grid_sizes, freqs):
         seq_len = f * h * w
 
         # precompute multipliers
-        x_i = torch.view_as_complex(x[i, :seq_len].to(torch.float64).reshape(
-            seq_len, n, -1, 2))
-        freqs_i = torch.cat([
-            freqs[0][:f].view(f, 1, 1, -1).expand(f, h, w, -1),
-            freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
-            freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1)
-        ],
-                            dim=-1).reshape(seq_len, 1, -1)
+        x_i = torch.view_as_complex(
+            x[i, :seq_len].to(torch.float64).reshape(seq_len, n, -1, 2)
+        )
+        freqs_i = torch.cat(
+            [
+                freqs[0][:f].view(f, 1, 1, -1).expand(f, h, w, -1),
+                freqs[1][:h].view(1, h, 1, -1).expand(f, h, w, -1),
+                freqs[2][:w].view(1, 1, w, -1).expand(f, h, w, -1),
+            ],
+            dim=-1,
+        ).reshape(seq_len, 1, -1)
 
         # apply rotary embedding
         x_i = torch.view_as_real(x_i * freqs_i).flatten(2)
@@ -69,7 +71,6 @@ def rope_apply(x, grid_sizes, freqs):
 
 
 class WanRMSNorm(nn.Module):
-
     def __init__(self, dim, eps=1e-5):
         super().__init__()
         self.dim = dim
@@ -88,7 +89,6 @@ def _norm(self, x):
 
 
 class WanLayerNorm(nn.LayerNorm):
-
     def __init__(self, dim, eps=1e-6, elementwise_affine=False):
         super().__init__(dim, elementwise_affine=elementwise_affine, eps=eps)
 
@@ -101,13 +101,7 @@ def forward(self, x):
 
 
 class WanSelfAttention(nn.Module):
-
-    def __init__(self,
-                 dim,
-                 num_heads,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 eps=1e-6):
+    def __init__(self, dim, num_heads, window_size=(-1, -1), qk_norm=True, eps=1e-6):
         assert dim % num_heads == 0
         super().__init__()
         self.dim = dim
@@ -149,7 +143,8 @@ def qkv_fn(x):
             k=rope_apply(k, grid_sizes, freqs),
             v=v,
             k_lens=seq_lens,
-            window_size=self.window_size)
+            window_size=self.window_size,
+        )
 
         # output
         x = x.flatten(2)
@@ -157,8 +152,7 @@ def qkv_fn(x):
         return x
 
 
-class WanT2VCrossAttention(WanSelfAttention):
-
+class WanCrossAttention(WanSelfAttention):
     def forward(self, x, context, context_lens):
         r"""
         Args:
@@ -182,67 +176,17 @@ def forward(self, x, context, context_lens):
         return x
 
 
-class WanI2VCrossAttention(WanSelfAttention):
-
-    def __init__(self,
-                 dim,
-                 num_heads,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 eps=1e-6):
-        super().__init__(dim, num_heads, window_size, qk_norm, eps)
-
-        self.k_img = nn.Linear(dim, dim)
-        self.v_img = nn.Linear(dim, dim)
-        # self.alpha = nn.Parameter(torch.zeros((1, )))
-        self.norm_k_img = WanRMSNorm(dim, eps=eps) if qk_norm else nn.Identity()
-
-    def forward(self, x, context, context_lens):
-        r"""
-        Args:
-            x(Tensor): Shape [B, L1, C]
-            context(Tensor): Shape [B, L2, C]
-            context_lens(Tensor): Shape [B]
-        """
-        context_img = context[:, :257]
-        context = context[:, 257:]
-        b, n, d = x.size(0), self.num_heads, self.head_dim
-
-        # compute query, key, value
-        q = self.norm_q(self.q(x)).view(b, -1, n, d)
-        k = self.norm_k(self.k(context)).view(b, -1, n, d)
-        v = self.v(context).view(b, -1, n, d)
-        k_img = self.norm_k_img(self.k_img(context_img)).view(b, -1, n, d)
-        v_img = self.v_img(context_img).view(b, -1, n, d)
-        img_x = flash_attention(q, k_img, v_img, k_lens=None)
-        # compute attention
-        x = flash_attention(q, k, v, k_lens=context_lens)
-
-        # output
-        x = x.flatten(2)
-        img_x = img_x.flatten(2)
-        x = x + img_x
-        x = self.o(x)
-        return x
-
-
-WAN_CROSSATTENTION_CLASSES = {
-    't2v_cross_attn': WanT2VCrossAttention,
-    'i2v_cross_attn': WanI2VCrossAttention,
-}
-
-
 class WanAttentionBlock(nn.Module):
-
-    def __init__(self,
-                 cross_attn_type,
-                 dim,
-                 ffn_dim,
-                 num_heads,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=False,
-                 eps=1e-6):
+    def __init__(
+        self,
+        dim,
+        ffn_dim,
+        num_heads,
+        window_size=(-1, -1),
+        qk_norm=True,
+        cross_attn_norm=False,
+        eps=1e-6,
+    ):
         super().__init__()
         self.dim = dim
         self.ffn_dim = ffn_dim
@@ -254,20 +198,19 @@ def __init__(self,
 
         # layers
         self.norm1 = WanLayerNorm(dim, eps)
-        self.self_attn = WanSelfAttention(dim, num_heads, window_size, qk_norm,
-                                          eps)
-        self.norm3 = WanLayerNorm(
-            dim, eps,
-            elementwise_affine=True) if cross_attn_norm else nn.Identity()
-        self.cross_attn = WAN_CROSSATTENTION_CLASSES[cross_attn_type](dim,
-                                                                      num_heads,
-                                                                      (-1, -1),
-                                                                      qk_norm,
-                                                                      eps)
+        self.self_attn = WanSelfAttention(dim, num_heads, window_size, qk_norm, eps)
+        self.norm3 = (
+            WanLayerNorm(dim, eps, elementwise_affine=True)
+            if cross_attn_norm
+            else nn.Identity()
+        )
+        self.cross_attn = WanCrossAttention(dim, num_heads, (-1, -1), qk_norm, eps)
         self.norm2 = WanLayerNorm(dim, eps)
         self.ffn = nn.Sequential(
-            nn.Linear(dim, ffn_dim), nn.GELU(approximate='tanh'),
-            nn.Linear(ffn_dim, dim))
+            nn.Linear(dim, ffn_dim),
+            nn.GELU(approximate="tanh"),
+            nn.Linear(ffn_dim, dim),
+        )
 
         # modulation
         self.modulation = nn.Parameter(torch.randn(1, 6, dim) / dim**0.5)
@@ -285,29 +228,34 @@ def forward(
         r"""
         Args:
             x(Tensor): Shape [B, L, C]
-            e(Tensor): Shape [B, 6, C]
+            e(Tensor): Shape [B, L1, 6, C]
             seq_lens(Tensor): Shape [B], length of each sequence in batch
             grid_sizes(Tensor): Shape [B, 3], the second dimension contains (F, H, W)
             freqs(Tensor): Rope freqs, shape [1024, C / num_heads / 2]
         """
         assert e.dtype == torch.float32
-        with amp.autocast(dtype=torch.float32):
-            e = (self.modulation + e).chunk(6, dim=1)
+        with torch.amp.autocast("cuda", dtype=torch.float32):
+            e = (self.modulation.unsqueeze(0) + e).chunk(6, dim=2)
         assert e[0].dtype == torch.float32
 
         # self-attention
         y = self.self_attn(
-            self.norm1(x).float() * (1 + e[1]) + e[0], seq_lens, grid_sizes,
-            freqs)
-        with amp.autocast(dtype=torch.float32):
-            x = x + y * e[2]
+            self.norm1(x).float() * (1 + e[1].squeeze(2)) + e[0].squeeze(2),
+            seq_lens,
+            grid_sizes,
+            freqs,
+        )
+        with torch.amp.autocast("cuda", dtype=torch.float32):
+            x = x + y * e[2].squeeze(2)
 
         # cross-attention & ffn function
         def cross_attn_ffn(x, context, context_lens, e):
             x = x + self.cross_attn(self.norm3(x), context, context_lens)
-            y = self.ffn(self.norm2(x).float() * (1 + e[4]) + e[3])
-            with amp.autocast(dtype=torch.float32):
-                x = x + y * e[5]
+            y = self.ffn(
+                self.norm2(x).float() * (1 + e[4].squeeze(2)) + e[3].squeeze(2)
+            )
+            with torch.amp.autocast("cuda", dtype=torch.float32):
+                x = x + y * e[5].squeeze(2)
             return x
 
         x = cross_attn_ffn(x, context, context_lens, e)
@@ -315,7 +263,6 @@ def cross_attn_ffn(x, context, context_lens, e):
 
 
 class Head(nn.Module):
-
     def __init__(self, dim, out_dim, patch_size, eps=1e-6):
         super().__init__()
         self.dim = dim
@@ -335,57 +282,48 @@ def forward(self, x, e):
         r"""
         Args:
             x(Tensor): Shape [B, L1, C]
-            e(Tensor): Shape [B, C]
+            e(Tensor): Shape [B, L1, C]
         """
         assert e.dtype == torch.float32
-        with amp.autocast(dtype=torch.float32):
-            e = (self.modulation + e.unsqueeze(1)).chunk(2, dim=1)
-            x = (self.head(self.norm(x) * (1 + e[1]) + e[0]))
+        with torch.amp.autocast("cuda", dtype=torch.float32):
+            e = (self.modulation.unsqueeze(0) + e.unsqueeze(2)).chunk(2, dim=2)
+            x = self.head(self.norm(x) * (1 + e[1].squeeze(2)) + e[0].squeeze(2))
         return x
 
 
-class MLPProj(torch.nn.Module):
-
-    def __init__(self, in_dim, out_dim):
-        super().__init__()
-
-        self.proj = torch.nn.Sequential(
-            torch.nn.LayerNorm(in_dim), torch.nn.Linear(in_dim, in_dim),
-            torch.nn.GELU(), torch.nn.Linear(in_dim, out_dim),
-            torch.nn.LayerNorm(out_dim))
-
-    def forward(self, image_embeds):
-        clip_extra_context_tokens = self.proj(image_embeds)
-        return clip_extra_context_tokens
-
-
 class WanModel(ModelMixin, ConfigMixin):
     r"""
     Wan diffusion backbone supporting both text-to-video and image-to-video.
     """
 
     ignore_for_config = [
-        'patch_size', 'cross_attn_norm', 'qk_norm', 'text_dim', 'window_size'
+        "patch_size",
+        "cross_attn_norm",
+        "qk_norm",
+        "text_dim",
+        "window_size",
     ]
-    _no_split_modules = ['WanAttentionBlock']
+    _no_split_modules = ["WanAttentionBlock"]
 
     @register_to_config
-    def __init__(self,
-                 model_type='t2v',
-                 patch_size=(1, 2, 2),
-                 text_len=512,
-                 in_dim=16,
-                 dim=2048,
-                 ffn_dim=8192,
-                 freq_dim=256,
-                 text_dim=4096,
-                 out_dim=16,
-                 num_heads=16,
-                 num_layers=32,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=True,
-                 eps=1e-6):
+    def __init__(
+        self,
+        model_type="t2v",
+        patch_size=(1, 2, 2),
+        text_len=512,
+        in_dim=16,
+        dim=2048,
+        ffn_dim=8192,
+        freq_dim=256,
+        text_dim=4096,
+        out_dim=16,
+        num_heads=16,
+        num_layers=32,
+        window_size=(-1, -1),
+        qk_norm=True,
+        cross_attn_norm=True,
+        eps=1e-6,
+    ):
         r"""
         Initialize the diffusion model backbone.
 
@@ -424,7 +362,7 @@ def __init__(self,
 
         super().__init__()
 
-        assert model_type in ['t2v', 'i2v']
+        assert model_type in ["t2v", "i2v", "ti2v", "s2v"]
         self.model_type = model_type
 
         self.patch_size = patch_size
@@ -444,22 +382,26 @@ def __init__(self,
 
         # embeddings
         self.patch_embedding = nn.Conv3d(
-            in_dim, dim, kernel_size=patch_size, stride=patch_size)
+            in_dim, dim, kernel_size=patch_size, stride=patch_size
+        )
         self.text_embedding = nn.Sequential(
-            nn.Linear(text_dim, dim), nn.GELU(approximate='tanh'),
-            nn.Linear(dim, dim))
+            nn.Linear(text_dim, dim), nn.GELU(approximate="tanh"), nn.Linear(dim, dim)
+        )
 
         self.time_embedding = nn.Sequential(
-            nn.Linear(freq_dim, dim), nn.SiLU(), nn.Linear(dim, dim))
+            nn.Linear(freq_dim, dim), nn.SiLU(), nn.Linear(dim, dim)
+        )
         self.time_projection = nn.Sequential(nn.SiLU(), nn.Linear(dim, dim * 6))
 
         # blocks
-        cross_attn_type = 't2v_cross_attn' if model_type == 't2v' else 'i2v_cross_attn'
-        self.blocks = nn.ModuleList([
-            WanAttentionBlock(cross_attn_type, dim, ffn_dim, num_heads,
-                              window_size, qk_norm, cross_attn_norm, eps)
-            for _ in range(num_layers)
-        ])
+        self.blocks = nn.ModuleList(
+            [
+                WanAttentionBlock(
+                    dim, ffn_dim, num_heads, window_size, qk_norm, cross_attn_norm, eps
+                )
+                for _ in range(num_layers)
+            ]
+        )
 
         # head
         self.head = Head(dim, out_dim, patch_size, eps)
@@ -467,15 +409,14 @@ def __init__(self,
         # buffers (don't use register_buffer otherwise dtype will be changed in to())
         assert (dim % num_heads) == 0 and (dim // num_heads) % 2 == 0
         d = dim // num_heads
-        self.freqs = torch.cat([
-            rope_params(1024, d - 4 * (d // 6)),
-            rope_params(1024, 2 * (d // 6)),
-            rope_params(1024, 2 * (d // 6))
-        ],
-                               dim=1)
-
-        if model_type == 'i2v':
-            self.img_emb = MLPProj(1280, dim)
+        self.freqs = torch.cat(
+            [
+                rope_params(1024, d - 4 * (d // 6)),
+                rope_params(1024, 2 * (d // 6)),
+                rope_params(1024, 2 * (d // 6)),
+            ],
+            dim=1,
+        )
 
         # initialize weights
         self.init_weights()
@@ -485,11 +426,8 @@ def forward(
         x,
         t,
         context,
-        seq_len=None,
-        clip_fea=None,
+        seq_len,
         y=None,
-        grad_offload=True,
-        activation_checkpointing=True
     ):
         r"""
         Forward pass through the diffusion model
@@ -503,8 +441,6 @@ def forward(
                 List of text embeddings each with shape [L, C]
             seq_len (`int`):
                 Maximum sequence length for positional encoding
-            clip_fea (Tensor, *optional*):
-                CLIP image features for image-to-video mode
             y (List[Tensor], *optional*):
                 Conditional video inputs for image-to-video mode, same shape as x
 
@@ -512,8 +448,8 @@ def forward(
             List[Tensor]:
                 List of denoised video tensors with original input shapes [C_out, F, H / 8, W / 8]
         """
-        if self.model_type == 'i2v':
-            assert clip_fea is not None and y is not None
+        if self.model_type == "i2v":
+            assert y is not None
         # params
         device = self.patch_embedding.weight.device
         if self.freqs.device != device:
@@ -525,36 +461,42 @@ def forward(
         # embeddings
         x = [self.patch_embedding(u.unsqueeze(0)) for u in x]
         grid_sizes = torch.stack(
-            [torch.tensor(u.shape[2:], dtype=torch.long) for u in x])
+            [torch.tensor(u.shape[2:], dtype=torch.long) for u in x]
+        )
         x = [u.flatten(2).transpose(1, 2) for u in x]
         seq_lens = torch.tensor([u.size(1) for u in x], dtype=torch.long)
-        if seq_len is None:
-            seq_len = seq_lens.max()
         assert seq_lens.max() <= seq_len
-        x = torch.cat([
-            torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))],
-                      dim=1) for u in x
-        ])
+        x = torch.cat(
+            [
+                torch.cat([u, u.new_zeros(1, seq_len - u.size(1), u.size(2))], dim=1)
+                for u in x
+            ]
+        )
 
         # time embeddings
-        with amp.autocast(dtype=torch.float32):
+        if t.dim() == 1:
+            t = t.expand(t.size(0), seq_len)
+        with torch.amp.autocast("cuda", dtype=torch.float32):
+            bt = t.size(0)
+            t = t.flatten()
             e = self.time_embedding(
-                sinusoidal_embedding_1d(self.freq_dim, t).float())
-            e0 = self.time_projection(e).unflatten(1, (6, self.dim))
+                sinusoidal_embedding_1d(self.freq_dim, t)
+                .unflatten(0, (bt, seq_len))
+                .float()
+            )
+            e0 = self.time_projection(e).unflatten(2, (6, self.dim))
             assert e.dtype == torch.float32 and e0.dtype == torch.float32
 
         # context
         context_lens = None
         context = self.text_embedding(
-            torch.stack([
-                torch.cat(
-                    [u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
-                for u in context
-            ]))
-
-        if clip_fea is not None:
-            context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
-            context = torch.concat([context_clip, context], dim=1)
+            torch.stack(
+                [
+                    torch.cat([u, u.new_zeros(self.text_len - u.size(0), u.size(1))])
+                    for u in context
+                ]
+            )
+        )
 
         # arguments
         kwargs = dict(
@@ -563,25 +505,12 @@ def forward(
             grid_sizes=grid_sizes,
             freqs=self.freqs,
             context=context,
-            context_lens=context_lens)
-
-        def create_custom_forward(module):
-            def custom_forward(*inputs):
-                return module(*inputs)
-            return custom_forward
-        
-        for block in tqdm(self.blocks):
-            if self.training and activation_checkpointing:
-                if grad_offload:
-                    #logger.info("activation checkpointing with cpu offload")
-                    with torch.autograd.graph.save_on_cpu():
-                        x = torch.utils.checkpoint.checkpoint(create_custom_forward(block), x, e0, seq_lens, grid_sizes, self.freqs, context, context_lens, use_reentrant=False)
-                else:
-                    #logger.info("activation checkpointing")
-                    x = torch.utils.checkpoint.checkpoint(create_custom_forward(block), x, e0, seq_lens, grid_sizes, self.freqs, context, context_lens, use_reentrant=False)
-            else:
-                x = block(x, **kwargs)
-                
+            context_lens=context_lens,
+        )
+
+        for block in self.blocks:
+            x = block(x, **kwargs)
+
         # head
         x = self.head(x, e)
 
@@ -608,8 +537,8 @@ def unpatchify(self, x, grid_sizes):
         c = self.out_dim
         out = []
         for u, v in zip(x, grid_sizes.tolist()):
-            u = u[:math.prod(v)].view(*v, *self.patch_size, c)
-            u = torch.einsum('fhwpqrc->cfphqwr', u)
+            u = u[: math.prod(v)].view(*v, *self.patch_size, c)
+            u = torch.einsum("fhwpqrc->cfphqwr", u)
             u = u.reshape(c, *[i * j for i, j in zip(v, self.patch_size)])
             out.append(u)
         return out
@@ -630,10 +559,10 @@ def init_weights(self):
         nn.init.xavier_uniform_(self.patch_embedding.weight.flatten(1))
         for m in self.text_embedding.modules():
             if isinstance(m, nn.Linear):
-                nn.init.normal_(m.weight, std=.02)
+                nn.init.normal_(m.weight, std=0.02)
         for m in self.time_embedding.modules():
             if isinstance(m, nn.Linear):
-                nn.init.normal_(m.weight, std=.02)
+                nn.init.normal_(m.weight, std=0.02)
 
         # init output layer
         nn.init.zeros_(self.head.head.weight)
diff --git a/videotuna/models/wan/wan/modules/t5.py b/videotuna/models/wan/wan/modules/t5.py
index 362fdbb1..b6bbdf83 100644
--- a/videotuna/models/wan/wan/modules/t5.py
+++ b/videotuna/models/wan/wan/modules/t5.py
@@ -1,8 +1,7 @@
 # Modified from transformers.models.t5.modeling_t5
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-from loguru import logger
+import logging
 import math
-from typing import Union
 
 import torch
 import torch.nn as nn
@@ -11,10 +10,10 @@
 from .tokenizers import HuggingfaceTokenizer
 
 __all__ = [
-    'T5Model',
-    'T5Encoder',
-    'T5Decoder',
-    'T5EncoderModel',
+    "T5Model",
+    "T5Encoder",
+    "T5Decoder",
+    "T5EncoderModel",
 ]
 
 
@@ -35,24 +34,31 @@ def init_weights(m):
         nn.init.normal_(m.fc1.weight, std=m.dim**-0.5)
         nn.init.normal_(m.fc2.weight, std=m.dim_ffn**-0.5)
     elif isinstance(m, T5Attention):
-        nn.init.normal_(m.q.weight, std=(m.dim * m.dim_attn)**-0.5)
+        nn.init.normal_(m.q.weight, std=(m.dim * m.dim_attn) ** -0.5)
         nn.init.normal_(m.k.weight, std=m.dim**-0.5)
         nn.init.normal_(m.v.weight, std=m.dim**-0.5)
-        nn.init.normal_(m.o.weight, std=(m.num_heads * m.dim_attn)**-0.5)
+        nn.init.normal_(m.o.weight, std=(m.num_heads * m.dim_attn) ** -0.5)
     elif isinstance(m, T5RelativeEmbedding):
         nn.init.normal_(
-            m.embedding.weight, std=(2 * m.num_buckets * m.num_heads)**-0.5)
+            m.embedding.weight, std=(2 * m.num_buckets * m.num_heads) ** -0.5
+        )
 
 
 class GELU(nn.Module):
-
     def forward(self, x):
-        return 0.5 * x * (1.0 + torch.tanh(
-            math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
+        return (
+            0.5
+            * x
+            * (
+                1.0
+                + torch.tanh(
+                    math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))
+                )
+            )
+        )
 
 
 class T5LayerNorm(nn.Module):
-
     def __init__(self, dim, eps=1e-6):
         super(T5LayerNorm, self).__init__()
         self.dim = dim
@@ -60,15 +66,13 @@ def __init__(self, dim, eps=1e-6):
         self.weight = nn.Parameter(torch.ones(dim))
 
     def forward(self, x):
-        x = x * torch.rsqrt(x.float().pow(2).mean(dim=-1, keepdim=True) +
-                            self.eps)
+        x = x * torch.rsqrt(x.float().pow(2).mean(dim=-1, keepdim=True) + self.eps)
         if self.weight.dtype in [torch.float16, torch.bfloat16]:
             x = x.type_as(self.weight)
         return self.weight * x
 
 
 class T5Attention(nn.Module):
-
     def __init__(self, dim, dim_attn, num_heads, dropout=0.1):
         assert dim_attn % num_heads == 0
         super(T5Attention, self).__init__()
@@ -105,14 +109,13 @@ def forward(self, x, context=None, mask=None, pos_bias=None):
             attn_bias += pos_bias
         if mask is not None:
             assert mask.ndim in [2, 3]
-            mask = mask.view(b, 1, 1,
-                             -1) if mask.ndim == 2 else mask.unsqueeze(1)
+            mask = mask.view(b, 1, 1, -1) if mask.ndim == 2 else mask.unsqueeze(1)
             attn_bias.masked_fill_(mask == 0, torch.finfo(x.dtype).min)
 
         # compute attention (T5 does not use scaling)
-        attn = torch.einsum('binc,bjnc->bnij', q, k) + attn_bias
+        attn = torch.einsum("binc,bjnc->bnij", q, k) + attn_bias
         attn = F.softmax(attn.float(), dim=-1).type_as(attn)
-        x = torch.einsum('bnij,bjnc->binc', attn, v)
+        x = torch.einsum("bnij,bjnc->binc", attn, v)
 
         # output
         x = x.reshape(b, -1, n * c)
@@ -122,7 +125,6 @@ def forward(self, x, context=None, mask=None, pos_bias=None):
 
 
 class T5FeedForward(nn.Module):
-
     def __init__(self, dim, dim_ffn, dropout=0.1):
         super(T5FeedForward, self).__init__()
         self.dim = dim
@@ -143,15 +145,16 @@ def forward(self, x):
 
 
 class T5SelfAttention(nn.Module):
-
-    def __init__(self,
-                 dim,
-                 dim_attn,
-                 dim_ffn,
-                 num_heads,
-                 num_buckets,
-                 shared_pos=True,
-                 dropout=0.1):
+    def __init__(
+        self,
+        dim,
+        dim_attn,
+        dim_ffn,
+        num_heads,
+        num_buckets,
+        shared_pos=True,
+        dropout=0.1,
+    ):
         super(T5SelfAttention, self).__init__()
         self.dim = dim
         self.dim_attn = dim_attn
@@ -165,27 +168,35 @@ def __init__(self,
         self.attn = T5Attention(dim, dim_attn, num_heads, dropout)
         self.norm2 = T5LayerNorm(dim)
         self.ffn = T5FeedForward(dim, dim_ffn, dropout)
-        self.pos_embedding = None if shared_pos else T5RelativeEmbedding(
-            num_buckets, num_heads, bidirectional=True)
+        self.pos_embedding = (
+            None
+            if shared_pos
+            else T5RelativeEmbedding(num_buckets, num_heads, bidirectional=True)
+        )
 
     def forward(self, x, mask=None, pos_bias=None):
-        e = pos_bias if self.shared_pos else self.pos_embedding(
-            x.size(1), x.size(1))
+        if self.shared_pos:
+            e = pos_bias
+        else:
+            pos_embedding = self.pos_embedding
+            assert pos_embedding is not None
+            e = pos_embedding(x.size(1), x.size(1))
         x = fp16_clamp(x + self.attn(self.norm1(x), mask=mask, pos_bias=e))
         x = fp16_clamp(x + self.ffn(self.norm2(x)))
         return x
 
 
 class T5CrossAttention(nn.Module):
-
-    def __init__(self,
-                 dim,
-                 dim_attn,
-                 dim_ffn,
-                 num_heads,
-                 num_buckets,
-                 shared_pos=True,
-                 dropout=0.1):
+    def __init__(
+        self,
+        dim,
+        dim_attn,
+        dim_ffn,
+        num_heads,
+        num_buckets,
+        shared_pos=True,
+        dropout=0.1,
+    ):
         super(T5CrossAttention, self).__init__()
         self.dim = dim
         self.dim_attn = dim_attn
@@ -201,26 +212,31 @@ def __init__(self,
         self.cross_attn = T5Attention(dim, dim_attn, num_heads, dropout)
         self.norm3 = T5LayerNorm(dim)
         self.ffn = T5FeedForward(dim, dim_ffn, dropout)
-        self.pos_embedding = None if shared_pos else T5RelativeEmbedding(
-            num_buckets, num_heads, bidirectional=False)
-
-    def forward(self,
-                x,
-                mask=None,
-                encoder_states=None,
-                encoder_mask=None,
-                pos_bias=None):
-        e = pos_bias if self.shared_pos else self.pos_embedding(
-            x.size(1), x.size(1))
+        self.pos_embedding = (
+            None
+            if shared_pos
+            else T5RelativeEmbedding(num_buckets, num_heads, bidirectional=False)
+        )
+
+    def forward(
+        self, x, mask=None, encoder_states=None, encoder_mask=None, pos_bias=None
+    ):
+        if self.shared_pos:
+            e = pos_bias
+        else:
+            pos_embedding = self.pos_embedding
+            assert pos_embedding is not None
+            e = pos_embedding(x.size(1), x.size(1))
         x = fp16_clamp(x + self.self_attn(self.norm1(x), mask=mask, pos_bias=e))
-        x = fp16_clamp(x + self.cross_attn(
-            self.norm2(x), context=encoder_states, mask=encoder_mask))
+        x = fp16_clamp(
+            x
+            + self.cross_attn(self.norm2(x), context=encoder_states, mask=encoder_mask)
+        )
         x = fp16_clamp(x + self.ffn(self.norm3(x)))
         return x
 
 
 class T5RelativeEmbedding(nn.Module):
-
     def __init__(self, num_buckets, num_heads, bidirectional, max_dist=128):
         super(T5RelativeEmbedding, self).__init__()
         self.num_buckets = num_buckets
@@ -235,12 +251,12 @@ def forward(self, lq, lk):
         device = self.embedding.weight.device
         # rel_pos = torch.arange(lk).unsqueeze(0).to(device) - \
         #     torch.arange(lq).unsqueeze(1).to(device)
-        rel_pos = torch.arange(lk, device=device).unsqueeze(0) - \
-            torch.arange(lq, device=device).unsqueeze(1)
+        rel_pos = torch.arange(lk, device=device).unsqueeze(0) - torch.arange(
+            lq, device=device
+        ).unsqueeze(1)
         rel_pos = self._relative_position_bucket(rel_pos)
         rel_pos_embeds = self.embedding(rel_pos)
-        rel_pos_embeds = rel_pos_embeds.permute(2, 0, 1).unsqueeze(
-            0)  # [1, N, Lq, Lk]
+        rel_pos_embeds = rel_pos_embeds.permute(2, 0, 1).unsqueeze(0)  # [1, N, Lq, Lk]
         return rel_pos_embeds.contiguous()
 
     def _relative_position_bucket(self, rel_pos):
@@ -256,27 +272,34 @@ def _relative_position_bucket(self, rel_pos):
 
         # embeddings for small and large positions
         max_exact = num_buckets // 2
-        rel_pos_large = max_exact + (torch.log(rel_pos.float() / max_exact) /
-                                     math.log(self.max_dist / max_exact) *
-                                     (num_buckets - max_exact)).long()
+        rel_pos_large = (
+            max_exact
+            + (
+                torch.log(rel_pos.float() / max_exact)
+                / math.log(self.max_dist / max_exact)
+                * (num_buckets - max_exact)
+            ).long()
+        )
         rel_pos_large = torch.min(
-            rel_pos_large, torch.full_like(rel_pos_large, num_buckets - 1))
+            rel_pos_large, torch.full_like(rel_pos_large, num_buckets - 1)
+        )
         rel_buckets += torch.where(rel_pos < max_exact, rel_pos, rel_pos_large)
         return rel_buckets
 
 
 class T5Encoder(nn.Module):
-
-    def __init__(self,
-                 vocab,
-                 dim,
-                 dim_attn,
-                 dim_ffn,
-                 num_heads,
-                 num_layers,
-                 num_buckets,
-                 shared_pos=True,
-                 dropout=0.1):
+    def __init__(
+        self,
+        vocab,
+        dim,
+        dim_attn,
+        dim_ffn,
+        num_heads,
+        num_layers,
+        num_buckets,
+        shared_pos=True,
+        dropout=0.1,
+    ):
         super(T5Encoder, self).__init__()
         self.dim = dim
         self.dim_attn = dim_attn
@@ -287,15 +310,23 @@ def __init__(self,
         self.shared_pos = shared_pos
 
         # layers
-        self.token_embedding = vocab if isinstance(vocab, nn.Embedding) \
-            else nn.Embedding(vocab, dim)
-        self.pos_embedding = T5RelativeEmbedding(
-            num_buckets, num_heads, bidirectional=True) if shared_pos else None
+        self.token_embedding = (
+            vocab if isinstance(vocab, nn.Embedding) else nn.Embedding(vocab, dim)
+        )
+        self.pos_embedding = (
+            T5RelativeEmbedding(num_buckets, num_heads, bidirectional=True)
+            if shared_pos
+            else None
+        )
         self.dropout = nn.Dropout(dropout)
-        self.blocks = nn.ModuleList([
-            T5SelfAttention(dim, dim_attn, dim_ffn, num_heads, num_buckets,
-                            shared_pos, dropout) for _ in range(num_layers)
-        ])
+        self.blocks = nn.ModuleList(
+            [
+                T5SelfAttention(
+                    dim, dim_attn, dim_ffn, num_heads, num_buckets, shared_pos, dropout
+                )
+                for _ in range(num_layers)
+            ]
+        )
         self.norm = T5LayerNorm(dim)
 
         # initialize weights
@@ -304,8 +335,11 @@ def __init__(self,
     def forward(self, ids, mask=None):
         x = self.token_embedding(ids)
         x = self.dropout(x)
-        e = self.pos_embedding(x.size(1),
-                               x.size(1)) if self.shared_pos else None
+        if self.shared_pos:
+            assert self.pos_embedding is not None
+            e = self.pos_embedding(x.size(1), x.size(1))
+        else:
+            e = None
         for block in self.blocks:
             x = block(x, mask, pos_bias=e)
         x = self.norm(x)
@@ -314,17 +348,18 @@ def forward(self, ids, mask=None):
 
 
 class T5Decoder(nn.Module):
-
-    def __init__(self,
-                 vocab,
-                 dim,
-                 dim_attn,
-                 dim_ffn,
-                 num_heads,
-                 num_layers,
-                 num_buckets,
-                 shared_pos=True,
-                 dropout=0.1):
+    def __init__(
+        self,
+        vocab,
+        dim,
+        dim_attn,
+        dim_ffn,
+        num_heads,
+        num_layers,
+        num_buckets,
+        shared_pos=True,
+        dropout=0.1,
+    ):
         super(T5Decoder, self).__init__()
         self.dim = dim
         self.dim_attn = dim_attn
@@ -335,15 +370,23 @@ def __init__(self,
         self.shared_pos = shared_pos
 
         # layers
-        self.token_embedding = vocab if isinstance(vocab, nn.Embedding) \
-            else nn.Embedding(vocab, dim)
-        self.pos_embedding = T5RelativeEmbedding(
-            num_buckets, num_heads, bidirectional=False) if shared_pos else None
+        self.token_embedding = (
+            vocab if isinstance(vocab, nn.Embedding) else nn.Embedding(vocab, dim)
+        )
+        self.pos_embedding = (
+            T5RelativeEmbedding(num_buckets, num_heads, bidirectional=False)
+            if shared_pos
+            else None
+        )
         self.dropout = nn.Dropout(dropout)
-        self.blocks = nn.ModuleList([
-            T5CrossAttention(dim, dim_attn, dim_ffn, num_heads, num_buckets,
-                             shared_pos, dropout) for _ in range(num_layers)
-        ])
+        self.blocks = nn.ModuleList(
+            [
+                T5CrossAttention(
+                    dim, dim_attn, dim_ffn, num_heads, num_buckets, shared_pos, dropout
+                )
+                for _ in range(num_layers)
+            ]
+        )
         self.norm = T5LayerNorm(dim)
 
         # initialize weights
@@ -361,8 +404,11 @@ def forward(self, ids, mask=None, encoder_states=None, encoder_mask=None):
         # layers
         x = self.token_embedding(ids)
         x = self.dropout(x)
-        e = self.pos_embedding(x.size(1),
-                               x.size(1)) if self.shared_pos else None
+        if self.shared_pos:
+            assert self.pos_embedding is not None
+            e = self.pos_embedding(x.size(1), x.size(1))
+        else:
+            e = None
         for block in self.blocks:
             x = block(x, mask, encoder_states, encoder_mask, pos_bias=e)
         x = self.norm(x)
@@ -371,18 +417,19 @@ def forward(self, ids, mask=None, encoder_states=None, encoder_mask=None):
 
 
 class T5Model(nn.Module):
-
-    def __init__(self,
-                 vocab_size,
-                 dim,
-                 dim_attn,
-                 dim_ffn,
-                 num_heads,
-                 encoder_layers,
-                 decoder_layers,
-                 num_buckets,
-                 shared_pos=True,
-                 dropout=0.1):
+    def __init__(
+        self,
+        vocab_size,
+        dim,
+        dim_attn,
+        dim_ffn,
+        num_heads,
+        encoder_layers,
+        decoder_layers,
+        num_buckets,
+        shared_pos=True,
+        dropout=0.1,
+    ):
         super(T5Model, self).__init__()
         self.vocab_size = vocab_size
         self.dim = dim
@@ -395,12 +442,28 @@ def __init__(self,
 
         # layers
         self.token_embedding = nn.Embedding(vocab_size, dim)
-        self.encoder = T5Encoder(self.token_embedding, dim, dim_attn, dim_ffn,
-                                 num_heads, encoder_layers, num_buckets,
-                                 shared_pos, dropout)
-        self.decoder = T5Decoder(self.token_embedding, dim, dim_attn, dim_ffn,
-                                 num_heads, decoder_layers, num_buckets,
-                                 shared_pos, dropout)
+        self.encoder = T5Encoder(
+            self.token_embedding,
+            dim,
+            dim_attn,
+            dim_ffn,
+            num_heads,
+            encoder_layers,
+            num_buckets,
+            shared_pos,
+            dropout,
+        )
+        self.decoder = T5Decoder(
+            self.token_embedding,
+            dim,
+            dim_attn,
+            dim_ffn,
+            num_heads,
+            decoder_layers,
+            num_buckets,
+            shared_pos,
+            dropout,
+        )
         self.head = nn.Linear(dim, vocab_size, bias=False)
 
         # initialize weights
@@ -412,45 +475,116 @@ def forward(self, encoder_ids, encoder_mask, decoder_ids, decoder_mask):
         x = self.head(x)
         return x
 
-class T5EncoderModel:
 
+def _t5(
+    name,
+    encoder_only=False,
+    decoder_only=False,
+    return_tokenizer=False,
+    tokenizer_kwargs={},
+    dtype=torch.float32,
+    device="cpu",
+    **kwargs,
+):
+    # sanity check
+    assert not (encoder_only and decoder_only)
+
+    # params
+    if encoder_only:
+        model_cls = T5Encoder
+        kwargs["vocab"] = kwargs.pop("vocab_size")
+        kwargs["num_layers"] = kwargs.pop("encoder_layers")
+        _ = kwargs.pop("decoder_layers")
+    elif decoder_only:
+        model_cls = T5Decoder
+        kwargs["vocab"] = kwargs.pop("vocab_size")
+        kwargs["num_layers"] = kwargs.pop("decoder_layers")
+        _ = kwargs.pop("encoder_layers")
+    else:
+        model_cls = T5Model
+
+    # init model
+    with torch.device(device):
+        model = model_cls(**kwargs)
+
+    # set device
+    model = model.to(dtype=dtype, device=device)
+
+    # init tokenizer
+    if return_tokenizer:
+        from .tokenizers import HuggingfaceTokenizer
+
+        tokenizer = HuggingfaceTokenizer(f"google/{name}", **tokenizer_kwargs)
+        return model, tokenizer
+    else:
+        return model
+
+
+def umt5_xxl(**kwargs):
+    cfg = dict(
+        vocab_size=256384,
+        dim=4096,
+        dim_attn=4096,
+        dim_ffn=10240,
+        num_heads=64,
+        encoder_layers=24,
+        decoder_layers=24,
+        num_buckets=32,
+        shared_pos=False,
+        dropout=0.1,
+    )
+    cfg.update(**kwargs)
+    return _t5("umt5-xxl", **cfg)
+
+
+class T5EncoderModel:
     def __init__(
         self,
         text_len,
         dtype=torch.bfloat16,
-        device=torch.cuda.current_device(),
+        device=None,
         checkpoint_path=None,
         tokenizer_path=None,
         shard_fn=None,
-        model:T5Encoder=None
     ):
+        from videotuna.utils.device_utils import resolve_inference_device
+
+        if device is None:
+            device = resolve_inference_device()
+        elif not isinstance(device, torch.device):
+            device = resolve_inference_device(device)
+
         self.text_len = text_len
         self.dtype = dtype
         self.device = device
-        self.tokenizer_path = tokenizer_path
         self.checkpoint_path = checkpoint_path
-        self.shard_fn = shard_fn
-        self.model = model.to(dtype=self.dtype)
-        self.tokenizer = HuggingfaceTokenizer(
-            name=tokenizer_path, seq_len=text_len, clean='whitespace')
+        self.tokenizer_path = tokenizer_path
 
+        # init model
+        model = (
+            umt5_xxl(
+                encoder_only=True, return_tokenizer=False, dtype=dtype, device=device
+            )
+            .eval()
+            .requires_grad_(False)
+        )
+        logging.info(f"loading {checkpoint_path}")
+        assert checkpoint_path is not None
+        model.load_state_dict(torch.load(checkpoint_path, map_location="cpu"))
+        self.model = model
+        if shard_fn is not None:
+            self.model = shard_fn(self.model, sync_module_states=False)
+        else:
+            self.model.to(self.device)
+        # init tokenizer
+        self.tokenizer = HuggingfaceTokenizer(
+            name=tokenizer_path, seq_len=text_len, clean="whitespace"
+        )
 
     def __call__(self, texts, device):
-        ids, mask = self.tokenizer(
-            texts, return_mask=True, add_special_tokens=True)
+        ids, mask = self.tokenizer(texts, return_mask=True, add_special_tokens=True)
         ids = ids.to(device)
         mask = mask.to(device)
         seq_lens = mask.gt(0).sum(dim=1).long()
         context = self.model(ids, mask)
         return [u[:v] for u, v in zip(context, seq_lens)]
-
-    def load_weight(self):
-        logger.info(f'loading T5EncoderModel from ckpt_path: {self.checkpoint_path}')
-        self.model.load_state_dict(torch.load(self.checkpoint_path, map_location='cpu'))
-        logger.info(f'loading T5EncoderModel from ckpt_path: {self.checkpoint_path} finished')
-
-        if self.shard_fn is not None:
-            logger.info(f'shard T5EncoderModel')
-            self.model = self.shard_fn(self.model, sync_module_states=False)
-        else:
-            self.model = self.model.to(self.device).to(self.dtype)
diff --git a/videotuna/models/wan/wan/modules/tokenizers.py b/videotuna/models/wan/wan/modules/tokenizers.py
index 121e591c..6702a8f2 100644
--- a/videotuna/models/wan/wan/modules/tokenizers.py
+++ b/videotuna/models/wan/wan/modules/tokenizers.py
@@ -6,7 +6,7 @@
 import regex as re
 from transformers import AutoTokenizer
 
-__all__ = ['HuggingfaceTokenizer']
+__all__ = ["HuggingfaceTokenizer"]
 
 
 def basic_clean(text):
@@ -16,28 +16,28 @@ def basic_clean(text):
 
 
 def whitespace_clean(text):
-    text = re.sub(r'\s+', ' ', text)
+    text = re.sub(r"\s+", " ", text)
     text = text.strip()
     return text
 
 
 def canonicalize(text, keep_punctuation_exact_string=None):
-    text = text.replace('_', ' ')
+    text = text.replace("_", " ")
     if keep_punctuation_exact_string:
         text = keep_punctuation_exact_string.join(
-            part.translate(str.maketrans('', '', string.punctuation))
-            for part in text.split(keep_punctuation_exact_string))
+            part.translate(str.maketrans("", "", string.punctuation))
+            for part in text.split(keep_punctuation_exact_string)
+        )
     else:
-        text = text.translate(str.maketrans('', '', string.punctuation))
+        text = text.translate(str.maketrans("", "", string.punctuation))
     text = text.lower()
-    text = re.sub(r'\s+', ' ', text)
+    text = re.sub(r"\s+", " ", text)
     return text.strip()
 
 
 class HuggingfaceTokenizer:
-
     def __init__(self, name, seq_len=None, clean=None, **kwargs):
-        assert clean in (None, 'whitespace', 'lower', 'canonicalize')
+        assert clean in (None, "whitespace", "lower", "canonicalize")
         self.name = name
         self.seq_len = seq_len
         self.clean = clean
@@ -47,16 +47,18 @@ def __init__(self, name, seq_len=None, clean=None, **kwargs):
         self.vocab_size = self.tokenizer.vocab_size
 
     def __call__(self, sequence, **kwargs):
-        return_mask = kwargs.pop('return_mask', False)
+        return_mask = kwargs.pop("return_mask", False)
 
         # arguments
-        _kwargs = {'return_tensors': 'pt'}
+        _kwargs = {"return_tensors": "pt"}
         if self.seq_len is not None:
-            _kwargs.update({
-                'padding': 'max_length',
-                'truncation': True,
-                'max_length': self.seq_len
-            })
+            _kwargs.update(
+                {
+                    "padding": "max_length",
+                    "truncation": True,
+                    "max_length": self.seq_len,
+                }
+            )
         _kwargs.update(**kwargs)
 
         # tokenization
@@ -73,10 +75,10 @@ def __call__(self, sequence, **kwargs):
             return ids.input_ids
 
     def _clean(self, text):
-        if self.clean == 'whitespace':
+        if self.clean == "whitespace":
             text = whitespace_clean(basic_clean(text))
-        elif self.clean == 'lower':
+        elif self.clean == "lower":
             text = whitespace_clean(basic_clean(text)).lower()
-        elif self.clean == 'canonicalize':
+        elif self.clean == "canonicalize":
             text = canonicalize(basic_clean(text))
         return text
diff --git a/videotuna/models/wan/wan/modules/vae.py b/videotuna/models/wan/wan/modules/vae.py
index ea328265..c65fd075 100644
--- a/videotuna/models/wan/wan/modules/vae.py
+++ b/videotuna/models/wan/wan/modules/vae.py
@@ -1,639 +1,6 @@
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-from loguru import logger
+"""Shim so YAML configs can target ``modules.vae.WanVAE_`` (implemented in vae2_1)."""
 
-import torch
-import torch.cuda.amp as amp
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-from tqdm import tqdm
+from .vae2_1 import Wan2_1_VAE, WanVAE_
 
-__all__ = [
-    'WanVAE',
-]
-
-CACHE_T = 2
-
-
-class CausalConv3d(nn.Conv3d):
-    """
-    Causal 3d convolusion.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self._padding = (self.padding[2], self.padding[2], self.padding[1],
-                         self.padding[1], 2 * self.padding[0], 0)
-        self.padding = (0, 0, 0)
-
-    def forward(self, x, cache_x=None):
-        padding = list(self._padding)
-        if cache_x is not None and self._padding[4] > 0:
-            cache_x = cache_x.to(x.device)
-            x = torch.cat([cache_x, x], dim=2)
-            padding[4] -= cache_x.shape[2]
-        x = F.pad(x, padding)
-
-        return super().forward(x)
-
-
-class RMS_norm(nn.Module):
-
-    def __init__(self, dim, channel_first=True, images=True, bias=False):
-        super().__init__()
-        broadcastable_dims = (1, 1, 1) if not images else (1, 1)
-        shape = (dim, *broadcastable_dims) if channel_first else (dim,)
-
-        self.channel_first = channel_first
-        self.scale = dim**0.5
-        self.gamma = nn.Parameter(torch.ones(shape))
-        self.bias = nn.Parameter(torch.zeros(shape)) if bias else 0.
-
-    def forward(self, x):
-        return F.normalize(
-            x, dim=(1 if self.channel_first else
-                    -1)) * self.scale * self.gamma + self.bias
-
-
-class Upsample(nn.Upsample):
-
-    def forward(self, x):
-        """
-        Fix bfloat16 support for nearest neighbor interpolation.
-        """
-        return super().forward(x.float()).type_as(x)
-
-
-class Resample(nn.Module):
-
-    def __init__(self, dim, mode):
-        assert mode in ('none', 'upsample2d', 'upsample3d', 'downsample2d',
-                        'downsample3d')
-        super().__init__()
-        self.dim = dim
-        self.mode = mode
-
-        # layers
-        if mode == 'upsample2d':
-            self.resample = nn.Sequential(
-                Upsample(scale_factor=(2., 2.), mode='nearest-exact'),
-                nn.Conv2d(dim, dim // 2, 3, padding=1))
-        elif mode == 'upsample3d':
-            self.resample = nn.Sequential(
-                Upsample(scale_factor=(2., 2.), mode='nearest-exact'),
-                nn.Conv2d(dim, dim // 2, 3, padding=1))
-            self.time_conv = CausalConv3d(
-                dim, dim * 2, (3, 1, 1), padding=(1, 0, 0))
-
-        elif mode == 'downsample2d':
-            self.resample = nn.Sequential(
-                nn.ZeroPad2d((0, 1, 0, 1)),
-                nn.Conv2d(dim, dim, 3, stride=(2, 2)))
-        elif mode == 'downsample3d':
-            self.resample = nn.Sequential(
-                nn.ZeroPad2d((0, 1, 0, 1)),
-                nn.Conv2d(dim, dim, 3, stride=(2, 2)))
-            self.time_conv = CausalConv3d(
-                dim, dim, (3, 1, 1), stride=(2, 1, 1), padding=(0, 0, 0))
-
-        else:
-            self.resample = nn.Identity()
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        b, c, t, h, w = x.size()
-        if self.mode == 'upsample3d':
-            if feat_cache is not None:
-                idx = feat_idx[0]
-                if feat_cache[idx] is None:
-                    feat_cache[idx] = 'Rep'
-                    feat_idx[0] += 1
-                else:
-
-                    cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                    if cache_x.shape[2] < 2 and feat_cache[
-                            idx] is not None and feat_cache[idx] != 'Rep':
-                        # cache last frame of last two chunk
-                        cache_x = torch.cat([
-                            feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                                cache_x.device), cache_x
-                        ],
-                                            dim=2)
-                    if cache_x.shape[2] < 2 and feat_cache[
-                            idx] is not None and feat_cache[idx] == 'Rep':
-                        cache_x = torch.cat([
-                            torch.zeros_like(cache_x).to(cache_x.device),
-                            cache_x
-                        ],
-                                            dim=2)
-                    if feat_cache[idx] == 'Rep':
-                        x = self.time_conv(x)
-                    else:
-                        x = self.time_conv(x, feat_cache[idx])
-                    feat_cache[idx] = cache_x
-                    feat_idx[0] += 1
-
-                    x = x.reshape(b, 2, c, t, h, w)
-                    x = torch.stack((x[:, 0, :, :, :, :], x[:, 1, :, :, :, :]),
-                                    3)
-                    x = x.reshape(b, c, t * 2, h, w)
-        t = x.shape[2]
-        x = rearrange(x, 'b c t h w -> (b t) c h w')
-        x = self.resample(x)
-        x = rearrange(x, '(b t) c h w -> b c t h w', t=t)
-
-        if self.mode == 'downsample3d':
-            if feat_cache is not None:
-                idx = feat_idx[0]
-                if feat_cache[idx] is None:
-                    feat_cache[idx] = x.clone()
-                    feat_idx[0] += 1
-                else:
-
-                    cache_x = x[:, :, -1:, :, :].clone()
-                    # if cache_x.shape[2] < 2 and feat_cache[idx] is not None and feat_cache[idx]!='Rep':
-                    #     # cache last frame of last two chunk
-                    #     cache_x = torch.cat([feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(cache_x.device), cache_x], dim=2)
-
-                    x = self.time_conv(
-                        torch.cat([feat_cache[idx][:, :, -1:, :, :], x], 2))
-                    feat_cache[idx] = cache_x
-                    feat_idx[0] += 1
-        return x, feat_cache, feat_idx
-
-    def init_weight(self, conv):
-        conv_weight = conv.weight
-        nn.init.zeros_(conv_weight)
-        c1, c2, t, h, w = conv_weight.size()
-        one_matrix = torch.eye(c1, c2)
-        init_matrix = one_matrix
-        nn.init.zeros_(conv_weight)
-        #conv_weight.data[:,:,-1,1,1] = init_matrix * 0.5
-        conv_weight.data[:, :, 1, 0, 0] = init_matrix  #* 0.5
-        conv.weight.data.copy_(conv_weight)
-        nn.init.zeros_(conv.bias.data)
-
-    def init_weight2(self, conv):
-        conv_weight = conv.weight.data
-        nn.init.zeros_(conv_weight)
-        c1, c2, t, h, w = conv_weight.size()
-        init_matrix = torch.eye(c1 // 2, c2)
-        #init_matrix = repeat(init_matrix, 'o ... -> (o 2) ...').permute(1,0,2).contiguous().reshape(c1,c2)
-        conv_weight[:c1 // 2, :, -1, 0, 0] = init_matrix
-        conv_weight[c1 // 2:, :, -1, 0, 0] = init_matrix
-        conv.weight.data.copy_(conv_weight)
-        nn.init.zeros_(conv.bias.data)
-
-
-class ResidualBlock(nn.Module):
-
-    def __init__(self, in_dim, out_dim, dropout=0.0):
-        super().__init__()
-        self.in_dim = in_dim
-        self.out_dim = out_dim
-
-        # layers
-        self.residual = nn.Sequential(
-            RMS_norm(in_dim, images=False), nn.SiLU(),
-            CausalConv3d(in_dim, out_dim, 3, padding=1),
-            RMS_norm(out_dim, images=False), nn.SiLU(), nn.Dropout(dropout),
-            CausalConv3d(out_dim, out_dim, 3, padding=1))
-        self.shortcut = CausalConv3d(in_dim, out_dim, 1) \
-            if in_dim != out_dim else nn.Identity()
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        h = self.shortcut(x)
-        for layer in self.residual:
-            if isinstance(layer, CausalConv3d) and feat_cache is not None:
-                idx = feat_idx[0]
-                cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                    # cache last frame of last two chunk
-                    cache_x = torch.cat([
-                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                            cache_x.device), cache_x
-                    ],
-                                        dim=2)
-                x = layer(x, feat_cache[idx])
-                feat_cache[idx] = cache_x
-                feat_idx[0] += 1
-            else:
-                x = layer(x)
-        return x + h, feat_cache, feat_idx
-
-
-class AttentionBlock(nn.Module):
-    """
-    Causal self-attention with a single head.
-    """
-
-    def __init__(self, dim):
-        super().__init__()
-        self.dim = dim
-
-        # layers
-        self.norm = RMS_norm(dim)
-        self.to_qkv = nn.Conv2d(dim, dim * 3, 1)
-        self.proj = nn.Conv2d(dim, dim, 1)
-
-        # zero out the last layer params
-        nn.init.zeros_(self.proj.weight)
-
-    def forward(self, x):
-        identity = x
-        b, c, t, h, w = x.size()
-        x = rearrange(x, 'b c t h w -> (b t) c h w')
-        x = self.norm(x)
-        # compute query, key, value
-        q, k, v = self.to_qkv(x).reshape(b * t, 1, c * 3,
-                                         -1).permute(0, 1, 3,
-                                                     2).contiguous().chunk(
-                                                         3, dim=-1)
-
-        # apply attention
-        x = F.scaled_dot_product_attention(
-            q,
-            k,
-            v,
-        )
-        x = x.squeeze(1).permute(0, 2, 1).reshape(b * t, c, h, w)
-
-        # output
-        x = self.proj(x)
-        x = rearrange(x, '(b t) c h w-> b c t h w', t=t)
-        return x + identity
-
-
-class Encoder3d(nn.Module):
-
-    def __init__(self,
-                 dim=128,
-                 z_dim=4,
-                 dim_mult=[1, 2, 4, 4],
-                 num_res_blocks=2,
-                 attn_scales=[],
-                 temperal_downsample=[True, True, False],
-                 dropout=0.0):
-        super().__init__()
-        self.dim = dim
-        self.z_dim = z_dim
-        self.dim_mult = dim_mult
-        self.num_res_blocks = num_res_blocks
-        self.attn_scales = attn_scales
-        self.temperal_downsample = temperal_downsample
-
-        # dimensions
-        dims = [dim * u for u in [1] + dim_mult]
-        scale = 1.0
-
-        # init block
-        self.conv1 = CausalConv3d(3, dims[0], 3, padding=1)
-
-        # downsample blocks
-        downsamples = []
-        for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])):
-            # residual (+attention) blocks
-            for _ in range(num_res_blocks):
-                downsamples.append(ResidualBlock(in_dim, out_dim, dropout))
-                if scale in attn_scales:
-                    downsamples.append(AttentionBlock(out_dim))
-                in_dim = out_dim
-
-            # downsample block
-            if i != len(dim_mult) - 1:
-                mode = 'downsample3d' if temperal_downsample[
-                    i] else 'downsample2d'
-                downsamples.append(Resample(out_dim, mode=mode))
-                scale /= 2.0
-        self.downsamples = nn.Sequential(*downsamples)
-
-        # middle blocks
-        self.middle = nn.Sequential(
-            ResidualBlock(out_dim, out_dim, dropout), AttentionBlock(out_dim),
-            ResidualBlock(out_dim, out_dim, dropout))
-
-        # output blocks
-        self.head = nn.Sequential(
-            RMS_norm(out_dim, images=False), nn.SiLU(),
-            CausalConv3d(out_dim, z_dim, 3, padding=1))
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        if feat_cache is not None:
-            idx = feat_idx[0]
-            cache_x = x[:, :, -CACHE_T:, :, :].clone()
-            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                # cache last frame of last two chunk
-                cache_x = torch.cat([
-                    feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                        cache_x.device), cache_x
-                ],
-                                    dim=2)
-            x = self.conv1(x, feat_cache[idx])
-            feat_cache[idx] = cache_x
-            feat_idx[0] += 1
-        else:
-            x = self.conv1(x)
-
-        ## downsamples
-        for layer in self.downsamples:
-            if feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## middle
-        for layer in self.middle:
-            if isinstance(layer, ResidualBlock) and feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## head
-        for layer in self.head:
-            if isinstance(layer, CausalConv3d) and feat_cache is not None:
-                idx = feat_idx[0]
-                cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                    # cache last frame of last two chunk
-                    cache_x = torch.cat([
-                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                            cache_x.device), cache_x
-                    ],
-                                        dim=2)
-                x = layer(x, feat_cache[idx])
-                feat_cache[idx] = cache_x
-                feat_idx[0] += 1
-            else:
-                x = layer(x)
-        return x, feat_cache, feat_idx
-
-
-class Decoder3d(nn.Module):
-
-    def __init__(self,
-                 dim=128,
-                 z_dim=4,
-                 dim_mult=[1, 2, 4, 4],
-                 num_res_blocks=2,
-                 attn_scales=[],
-                 temperal_upsample=[False, True, True],
-                 dropout=0.0):
-        super().__init__()
-        self.dim = dim
-        self.z_dim = z_dim
-        self.dim_mult = dim_mult
-        self.num_res_blocks = num_res_blocks
-        self.attn_scales = attn_scales
-        self.temperal_upsample = temperal_upsample
-
-        # dimensions
-        dims = [dim * u for u in [dim_mult[-1]] + dim_mult[::-1]]
-        scale = 1.0 / 2**(len(dim_mult) - 2)
-
-        # init block
-        self.conv1 = CausalConv3d(z_dim, dims[0], 3, padding=1)
-
-        # middle blocks
-        self.middle = nn.Sequential(
-            ResidualBlock(dims[0], dims[0], dropout), AttentionBlock(dims[0]),
-            ResidualBlock(dims[0], dims[0], dropout))
-
-        # upsample blocks
-        upsamples = []
-        for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])):
-            # residual (+attention) blocks
-            if i == 1 or i == 2 or i == 3:
-                in_dim = in_dim // 2
-            for _ in range(num_res_blocks + 1):
-                upsamples.append(ResidualBlock(in_dim, out_dim, dropout))
-                if scale in attn_scales:
-                    upsamples.append(AttentionBlock(out_dim))
-                in_dim = out_dim
-
-            # upsample block
-            if i != len(dim_mult) - 1:
-                mode = 'upsample3d' if temperal_upsample[i] else 'upsample2d'
-                upsamples.append(Resample(out_dim, mode=mode))
-                scale *= 2.0
-        self.upsamples = nn.Sequential(*upsamples)
-
-        # output blocks
-        self.head = nn.Sequential(
-            RMS_norm(out_dim, images=False), nn.SiLU(),
-            CausalConv3d(out_dim, 3, 3, padding=1))
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        ## conv1
-        if feat_cache is not None:
-            idx = feat_idx[0]
-            cache_x = x[:, :, -CACHE_T:, :, :].clone()
-            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                # cache last frame of last two chunk
-                cache_x = torch.cat([
-                    feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                        cache_x.device), cache_x
-                ],
-                                    dim=2)
-            x = self.conv1(x, feat_cache[idx])
-            feat_cache[idx] = cache_x
-            feat_idx[0] += 1
-        else:
-            x = self.conv1(x)
-
-        ## middle
-        for layer in self.middle:
-            if isinstance(layer, ResidualBlock) and feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## upsamples
-        for layer in self.upsamples:
-            if feat_cache is not None:
-                x, feat_cache, feat_idx = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## head
-        for layer in self.head:
-            if isinstance(layer, CausalConv3d) and feat_cache is not None:
-                idx = feat_idx[0]
-                cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                    # cache last frame of last two chunk
-                    cache_x = torch.cat([
-                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                            cache_x.device), cache_x
-                    ],
-                                        dim=2)
-                x = layer(x, feat_cache[idx])
-                feat_cache[idx] = cache_x
-                feat_idx[0] += 1
-            else:
-                x = layer(x)
-        return x, feat_cache, feat_idx
-
-
-def count_conv3d(model):
-    count = 0
-    for m in model.modules():
-        if isinstance(m, CausalConv3d):
-            count += 1
-    return count
-
-
-class WanVAE_(nn.Module):
-
-    def __init__(self,
-                 dim=128,
-                 z_dim=4,
-                 dim_mult=[1, 2, 4, 4],
-                 num_res_blocks=2,
-                 attn_scales=[],
-                 temperal_downsample=[True, True, False],
-                 dropout=0.0):
-        super().__init__()
-        self.dim = dim
-        self.z_dim = z_dim
-        self.dim_mult = dim_mult
-        self.num_res_blocks = num_res_blocks
-        self.attn_scales = attn_scales
-        self.temperal_downsample = temperal_downsample
-        self.temperal_upsample = temperal_downsample[::-1]
-
-        # modules
-        self.encoder = Encoder3d(dim, z_dim * 2, dim_mult, num_res_blocks,
-                                 attn_scales, self.temperal_downsample, dropout)
-        self.conv1 = CausalConv3d(z_dim * 2, z_dim * 2, 1)
-        self.conv2 = CausalConv3d(z_dim, z_dim, 1)
-        self.decoder = Decoder3d(dim, z_dim, dim_mult, num_res_blocks,
-                                 attn_scales, self.temperal_upsample, dropout)
-
-    def forward(self, x):
-        mu, log_var = self.encode(x)
-        z = self.reparameterize(mu, log_var)
-        x_recon = self.decode(z)
-        return x_recon, mu, log_var
-
-    def encode(self, x, scale):
-        self.clear_cache()
-        ## cache
-        t = x.shape[2]
-        iter_ = 1 + (t - 1) // 4
-        ## 对encode输入的x，按时间拆分为1、4、4、4....
-        for i in tqdm(range(iter_)):
-            self._enc_conv_idx = [0]
-            if i == 0:
-                out, self._enc_feat_map, self._enc_conv_idx = self.encoder(
-                    x[:, :, :1, :, :],
-                    feat_cache=self._enc_feat_map,
-                    feat_idx=self._enc_conv_idx)
-            else:
-                out_, self._enc_feat_map, self._enc_conv_idx = self.encoder(
-                    x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
-                    feat_cache=self._enc_feat_map,
-                    feat_idx=self._enc_conv_idx)
-                out = torch.cat([out, out_], 2)
-        mu, log_var = self.conv1(out).chunk(2, dim=1)
-        if isinstance(scale[0], torch.Tensor):
-            mu = (mu - scale[0].view(1, self.z_dim, 1, 1, 1)) * scale[1].view(
-                1, self.z_dim, 1, 1, 1)
-        else:
-            mu = (mu - scale[0]) * scale[1]
-        self.clear_cache()
-        return mu
-
-    def decode(self, z, scale):
-        self.clear_cache()
-        # z: [b,c,t,h,w]
-        if isinstance(scale[0], torch.Tensor):
-            z = z / scale[1].view(1, self.z_dim, 1, 1, 1) + scale[0].view(
-                1, self.z_dim, 1, 1, 1)
-        else:
-            z = z / scale[1] + scale[0]
-        iter_ = z.shape[2]
-        x = self.conv2(z)
-        for i in range(iter_):
-            self._conv_idx = [0]
-            if i == 0:
-                out, self._feat_map, self._conv_idx  = self.decoder(
-                    x[:, :, i:i + 1, :, :],
-                    feat_cache=self._feat_map,
-                    feat_idx=self._conv_idx)
-            else:
-                out_, self._feat_map, self._conv_idx = self.decoder(
-                    x[:, :, i:i + 1, :, :],
-                    feat_cache=self._feat_map,
-                    feat_idx=self._conv_idx)
-                out = torch.cat([out, out_], 2)
-        self.clear_cache()
-        return out
-
-    def reparameterize(self, mu, log_var):
-        std = torch.exp(0.5 * log_var)
-        eps = torch.randn_like(std)
-        return eps * std + mu
-
-    def sample(self, imgs, deterministic=False):
-        mu, log_var = self.encode(imgs)
-        if deterministic:
-            return mu
-        std = torch.exp(0.5 * log_var.clamp(-30.0, 20.0))
-        return mu + std * torch.randn_like(std)
-
-    def clear_cache(self):
-        self._conv_num = count_conv3d(self.decoder)
-        self._conv_idx = [0]
-        self._feat_map = [None] * self._conv_num
-        #cache encode
-        self._enc_conv_num = count_conv3d(self.encoder)
-        self._enc_conv_idx = [0]
-        self._enc_feat_map = [None] * self._enc_conv_num
-
-
-class WanVAE:
-
-    def __init__(self,
-                 vae:WanVAE_=None,
-                 vae_pth='cache/vae_step_411000.pth',
-                 dtype=torch.float,
-                 device="cuda"):
-        self.dtype = dtype
-        self.device = device
-
-        mean = [
-            -0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508,
-            0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921
-        ]
-        std = [
-            2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743,
-            3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.9160
-        ]
-        self.mean = torch.tensor(mean, dtype=dtype, device=device)
-        self.std = torch.tensor(std, dtype=dtype, device=device)
-        self.scale = [self.mean, 1.0 / self.std]
-        self.model = vae.to(dtype)
-        self.vae_pth = vae_pth
-
-    def encode(self, videos):
-        """
-        videos: A list of videos each with shape [C, T, H, W].
-        """
-        with amp.autocast(dtype=self.dtype):
-            return [
-                self.model.encode(u.unsqueeze(0), self.scale).float().squeeze(0)
-                for u in videos
-            ]
-
-    def decode(self, zs):
-        with amp.autocast(dtype=self.dtype):
-            return [
-                self.model.decode(u.unsqueeze(0),
-                                  self.scale).float().clamp_(-1, 1).squeeze(0)
-                for u in zs
-            ]
-
-    def load_weight(self):    
-        logger.info(f'loading WanVAE from ckpt_path: {self.vae_pth}')
-        self.model.load_state_dict(torch.load(self.vae_pth, map_location=self.device), assign=True)
-        logger.info(f'loading WanVAE from ckpt_path: {self.vae_pth} Finished')
-        self.model = self.model.to(self.device).to(self.dtype)
\ No newline at end of file
+__all__ = ["Wan2_1_VAE", "WanVAE_"]
diff --git a/videotuna/models/wan/wan/modules/vae2_1.py b/videotuna/models/wan/wan/modules/vae2_1.py
new file mode 100644
index 00000000..f3bf18f8
--- /dev/null
+++ b/videotuna/models/wan/wan/modules/vae2_1.py
@@ -0,0 +1,772 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+import logging
+
+import torch
+import torch.cuda.amp as amp
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange
+
+__all__ = [
+    "Wan2_1_VAE",
+]
+
+CACHE_T = 2
+
+
+class CausalConv3d(nn.Conv3d):
+    """
+    Causal 3d convolusion.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._padding = (
+            self.padding[2],
+            self.padding[2],
+            self.padding[1],
+            self.padding[1],
+            2 * self.padding[0],
+            0,
+        )
+        self.padding = (0, 0, 0)
+
+    def forward(self, x, cache_x=None):
+        padding = list(self._padding)
+        if cache_x is not None and self._padding[4] > 0:
+            cache_x = cache_x.to(x.device)
+            x = torch.cat([cache_x, x], dim=2)
+            padding[4] -= cache_x.shape[2]
+        x = F.pad(x, padding)
+
+        return super().forward(x)
+
+
+class RMS_norm(nn.Module):
+    def __init__(self, dim, channel_first=True, images=True, bias=False):
+        super().__init__()
+        broadcastable_dims = (1, 1, 1) if not images else (1, 1)
+        shape = (dim, *broadcastable_dims) if channel_first else (dim,)
+
+        self.channel_first = channel_first
+        self.scale = dim**0.5
+        self.gamma = nn.Parameter(torch.ones(shape))
+        self.bias = nn.Parameter(torch.zeros(shape)) if bias else 0.0
+
+    def forward(self, x):
+        return (
+            F.normalize(x, dim=(1 if self.channel_first else -1))
+            * self.scale
+            * self.gamma
+            + self.bias
+        )
+
+
+class Upsample(nn.Upsample):
+    def forward(self, x):
+        """
+        Fix bfloat16 support for nearest neighbor interpolation.
+        """
+        return super().forward(x.float()).type_as(x)
+
+
+class Resample(nn.Module):
+    def __init__(self, dim, mode):
+        assert mode in (
+            "none",
+            "upsample2d",
+            "upsample3d",
+            "downsample2d",
+            "downsample3d",
+        )
+        super().__init__()
+        self.dim = dim
+        self.mode = mode
+
+        # layers
+        if mode == "upsample2d":
+            self.resample = nn.Sequential(
+                Upsample(scale_factor=(2.0, 2.0), mode="nearest-exact"),
+                nn.Conv2d(dim, dim // 2, 3, padding=1),
+            )
+        elif mode == "upsample3d":
+            self.resample = nn.Sequential(
+                Upsample(scale_factor=(2.0, 2.0), mode="nearest-exact"),
+                nn.Conv2d(dim, dim // 2, 3, padding=1),
+            )
+            self.time_conv = CausalConv3d(dim, dim * 2, (3, 1, 1), padding=(1, 0, 0))
+
+        elif mode == "downsample2d":
+            self.resample = nn.Sequential(
+                nn.ZeroPad2d((0, 1, 0, 1)), nn.Conv2d(dim, dim, 3, stride=(2, 2))
+            )
+        elif mode == "downsample3d":
+            self.resample = nn.Sequential(
+                nn.ZeroPad2d((0, 1, 0, 1)), nn.Conv2d(dim, dim, 3, stride=(2, 2))
+            )
+            self.time_conv = CausalConv3d(
+                dim, dim, (3, 1, 1), stride=(2, 1, 1), padding=(0, 0, 0)
+            )
+
+        else:
+            self.resample = nn.Identity()
+
+    def forward(self, x, feat_cache=None, feat_idx=[0]):
+        b, c, t, h, w = x.size()
+        if self.mode == "upsample3d":
+            if feat_cache is not None:
+                idx = feat_idx[0]
+                if feat_cache[idx] is None:
+                    feat_cache[idx] = "Rep"
+                    feat_idx[0] += 1
+                else:
+                    cache_x = x[:, :, -CACHE_T:, :, :].clone()
+                    if (
+                        cache_x.shape[2] < 2
+                        and feat_cache[idx] is not None
+                        and feat_cache[idx] != "Rep"
+                    ):
+                        # cache last frame of last two chunk
+                        cache_x = torch.cat(
+                            [
+                                feat_cache[idx][:, :, -1, :, :]
+                                .unsqueeze(2)
+                                .to(cache_x.device),
+                                cache_x,
+                            ],
+                            dim=2,
+                        )
+                    if (
+                        cache_x.shape[2] < 2
+                        and feat_cache[idx] is not None
+                        and feat_cache[idx] == "Rep"
+                    ):
+                        cache_x = torch.cat(
+                            [torch.zeros_like(cache_x).to(cache_x.device), cache_x],
+                            dim=2,
+                        )
+                    if feat_cache[idx] == "Rep":
+                        x = self.time_conv(x)
+                    else:
+                        x = self.time_conv(x, feat_cache[idx])
+                    feat_cache[idx] = cache_x
+                    feat_idx[0] += 1
+
+                    x = x.reshape(b, 2, c, t, h, w)
+                    x = torch.stack((x[:, 0, :, :, :, :], x[:, 1, :, :, :, :]), 3)
+                    x = x.reshape(b, c, t * 2, h, w)
+        t = x.shape[2]
+        x = rearrange(x, "b c t h w -> (b t) c h w")
+        x = self.resample(x)
+        x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
+
+        if self.mode == "downsample3d":
+            if feat_cache is not None:
+                idx = feat_idx[0]
+                if feat_cache[idx] is None:
+                    feat_cache[idx] = x.clone()
+                    feat_idx[0] += 1
+                else:
+                    cache_x = x[:, :, -1:, :, :].clone()
+                    # if cache_x.shape[2] < 2 and feat_cache[idx] is not None and feat_cache[idx]!='Rep':
+                    #     # cache last frame of last two chunk
+                    #     cache_x = torch.cat([feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(cache_x.device), cache_x], dim=2)
+
+                    x = self.time_conv(
+                        torch.cat([feat_cache[idx][:, :, -1:, :, :], x], 2)
+                    )
+                    feat_cache[idx] = cache_x
+                    feat_idx[0] += 1
+        return x
+
+    def init_weight(self, conv):
+        conv_weight = conv.weight
+        nn.init.zeros_(conv_weight)
+        c1, c2, t, h, w = conv_weight.size()
+        one_matrix = torch.eye(c1, c2)
+        init_matrix = one_matrix
+        nn.init.zeros_(conv_weight)
+        # conv_weight.data[:,:,-1,1,1] = init_matrix * 0.5
+        conv_weight.data[:, :, 1, 0, 0] = init_matrix  # * 0.5
+        conv.weight.data.copy_(conv_weight)
+        nn.init.zeros_(conv.bias.data)
+
+    def init_weight2(self, conv):
+        conv_weight = conv.weight.data
+        nn.init.zeros_(conv_weight)
+        c1, c2, t, h, w = conv_weight.size()
+        init_matrix = torch.eye(c1 // 2, c2)
+        # init_matrix = repeat(init_matrix, 'o ... -> (o 2) ...').permute(1,0,2).contiguous().reshape(c1,c2)
+        conv_weight[: c1 // 2, :, -1, 0, 0] = init_matrix
+        conv_weight[c1 // 2 :, :, -1, 0, 0] = init_matrix
+        conv.weight.data.copy_(conv_weight)
+        nn.init.zeros_(conv.bias.data)
+
+
+class ResidualBlock(nn.Module):
+    def __init__(self, in_dim, out_dim, dropout=0.0):
+        super().__init__()
+        self.in_dim = in_dim
+        self.out_dim = out_dim
+
+        # layers
+        self.residual = nn.Sequential(
+            RMS_norm(in_dim, images=False),
+            nn.SiLU(),
+            CausalConv3d(in_dim, out_dim, 3, padding=1),
+            RMS_norm(out_dim, images=False),
+            nn.SiLU(),
+            nn.Dropout(dropout),
+            CausalConv3d(out_dim, out_dim, 3, padding=1),
+        )
+        self.shortcut = (
+            CausalConv3d(in_dim, out_dim, 1) if in_dim != out_dim else nn.Identity()
+        )
+
+    def forward(self, x, feat_cache=None, feat_idx=[0]):
+        h = self.shortcut(x)
+        for layer in self.residual:
+            if isinstance(layer, CausalConv3d) and feat_cache is not None:
+                idx = feat_idx[0]
+                cache_x = x[:, :, -CACHE_T:, :, :].clone()
+                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
+                    # cache last frame of last two chunk
+                    cache_x = torch.cat(
+                        [
+                            feat_cache[idx][:, :, -1, :, :]
+                            .unsqueeze(2)
+                            .to(cache_x.device),
+                            cache_x,
+                        ],
+                        dim=2,
+                    )
+                x = layer(x, feat_cache[idx])
+                feat_cache[idx] = cache_x
+                feat_idx[0] += 1
+            else:
+                x = layer(x)
+        return x + h
+
+
+class AttentionBlock(nn.Module):
+    """
+    Causal self-attention with a single head.
+    """
+
+    def __init__(self, dim):
+        super().__init__()
+        self.dim = dim
+
+        # layers
+        self.norm = RMS_norm(dim)
+        self.to_qkv = nn.Conv2d(dim, dim * 3, 1)
+        self.proj = nn.Conv2d(dim, dim, 1)
+
+        # zero out the last layer params
+        nn.init.zeros_(self.proj.weight)
+
+    def forward(self, x):
+        identity = x
+        b, c, t, h, w = x.size()
+        x = rearrange(x, "b c t h w -> (b t) c h w")
+        x = self.norm(x)
+        # compute query, key, value
+        q, k, v = (
+            self.to_qkv(x)
+            .reshape(b * t, 1, c * 3, -1)
+            .permute(0, 1, 3, 2)
+            .contiguous()
+            .chunk(3, dim=-1)
+        )
+
+        # apply attention
+        x = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+        )
+        x = x.squeeze(1).permute(0, 2, 1).reshape(b * t, c, h, w)
+
+        # output
+        x = self.proj(x)
+        x = rearrange(x, "(b t) c h w-> b c t h w", t=t)
+        return x + identity
+
+
+class Encoder3d(nn.Module):
+    def __init__(
+        self,
+        dim=128,
+        z_dim=4,
+        dim_mult=[1, 2, 4, 4],
+        num_res_blocks=2,
+        attn_scales=[],
+        temperal_downsample=[True, True, False],
+        dropout=0.0,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.z_dim = z_dim
+        self.dim_mult = dim_mult
+        self.num_res_blocks = num_res_blocks
+        self.attn_scales = attn_scales
+        self.temperal_downsample = temperal_downsample
+
+        # dimensions
+        dims = [dim * u for u in [1] + dim_mult]
+        scale = 1.0
+
+        # init block
+        self.conv1 = CausalConv3d(3, dims[0], 3, padding=1)
+
+        # downsample blocks
+        downsamples = []
+        for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])):
+            # residual (+attention) blocks
+            for _ in range(num_res_blocks):
+                downsamples.append(ResidualBlock(in_dim, out_dim, dropout))
+                if scale in attn_scales:
+                    downsamples.append(AttentionBlock(out_dim))
+                in_dim = out_dim
+
+            # downsample block
+            if i != len(dim_mult) - 1:
+                mode = "downsample3d" if temperal_downsample[i] else "downsample2d"
+                downsamples.append(Resample(out_dim, mode=mode))
+                scale /= 2.0
+        self.downsamples = nn.Sequential(*downsamples)
+
+        # middle blocks
+        self.middle = nn.Sequential(
+            ResidualBlock(out_dim, out_dim, dropout),
+            AttentionBlock(out_dim),
+            ResidualBlock(out_dim, out_dim, dropout),
+        )
+
+        # output blocks
+        self.head = nn.Sequential(
+            RMS_norm(out_dim, images=False),
+            nn.SiLU(),
+            CausalConv3d(out_dim, z_dim, 3, padding=1),
+        )
+
+    def forward(self, x, feat_cache=None, feat_idx=[0]):
+        if feat_cache is not None:
+            idx = feat_idx[0]
+            cache_x = x[:, :, -CACHE_T:, :, :].clone()
+            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
+                # cache last frame of last two chunk
+                cache_x = torch.cat(
+                    [
+                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(cache_x.device),
+                        cache_x,
+                    ],
+                    dim=2,
+                )
+            x = self.conv1(x, feat_cache[idx])
+            feat_cache[idx] = cache_x
+            feat_idx[0] += 1
+        else:
+            x = self.conv1(x)
+
+        ## downsamples
+        for layer in self.downsamples:
+            if feat_cache is not None:
+                x = layer(x, feat_cache, feat_idx)
+            else:
+                x = layer(x)
+
+        ## middle
+        for layer in self.middle:
+            if isinstance(layer, ResidualBlock) and feat_cache is not None:
+                x = layer(x, feat_cache, feat_idx)
+            else:
+                x = layer(x)
+
+        ## head
+        for layer in self.head:
+            if isinstance(layer, CausalConv3d) and feat_cache is not None:
+                idx = feat_idx[0]
+                cache_x = x[:, :, -CACHE_T:, :, :].clone()
+                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
+                    # cache last frame of last two chunk
+                    cache_x = torch.cat(
+                        [
+                            feat_cache[idx][:, :, -1, :, :]
+                            .unsqueeze(2)
+                            .to(cache_x.device),
+                            cache_x,
+                        ],
+                        dim=2,
+                    )
+                x = layer(x, feat_cache[idx])
+                feat_cache[idx] = cache_x
+                feat_idx[0] += 1
+            else:
+                x = layer(x)
+        return x
+
+
+class Decoder3d(nn.Module):
+    def __init__(
+        self,
+        dim=128,
+        z_dim=4,
+        dim_mult=[1, 2, 4, 4],
+        num_res_blocks=2,
+        attn_scales=[],
+        temperal_upsample=[False, True, True],
+        dropout=0.0,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.z_dim = z_dim
+        self.dim_mult = dim_mult
+        self.num_res_blocks = num_res_blocks
+        self.attn_scales = attn_scales
+        self.temperal_upsample = temperal_upsample
+
+        # dimensions
+        dims = [dim * u for u in [dim_mult[-1]] + dim_mult[::-1]]
+        scale = 1.0 / 2 ** (len(dim_mult) - 2)
+
+        # init block
+        self.conv1 = CausalConv3d(z_dim, dims[0], 3, padding=1)
+
+        # middle blocks
+        self.middle = nn.Sequential(
+            ResidualBlock(dims[0], dims[0], dropout),
+            AttentionBlock(dims[0]),
+            ResidualBlock(dims[0], dims[0], dropout),
+        )
+
+        # upsample blocks
+        upsamples = []
+        for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])):
+            # residual (+attention) blocks
+            if i == 1 or i == 2 or i == 3:
+                in_dim = in_dim // 2
+            for _ in range(num_res_blocks + 1):
+                upsamples.append(ResidualBlock(in_dim, out_dim, dropout))
+                if scale in attn_scales:
+                    upsamples.append(AttentionBlock(out_dim))
+                in_dim = out_dim
+
+            # upsample block
+            if i != len(dim_mult) - 1:
+                mode = "upsample3d" if temperal_upsample[i] else "upsample2d"
+                upsamples.append(Resample(out_dim, mode=mode))
+                scale *= 2.0
+        self.upsamples = nn.Sequential(*upsamples)
+
+        # output blocks
+        self.head = nn.Sequential(
+            RMS_norm(out_dim, images=False),
+            nn.SiLU(),
+            CausalConv3d(out_dim, 3, 3, padding=1),
+        )
+
+    def forward(self, x, feat_cache=None, feat_idx=[0]):
+        ## conv1
+        if feat_cache is not None:
+            idx = feat_idx[0]
+            cache_x = x[:, :, -CACHE_T:, :, :].clone()
+            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
+                # cache last frame of last two chunk
+                cache_x = torch.cat(
+                    [
+                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(cache_x.device),
+                        cache_x,
+                    ],
+                    dim=2,
+                )
+            x = self.conv1(x, feat_cache[idx])
+            feat_cache[idx] = cache_x
+            feat_idx[0] += 1
+        else:
+            x = self.conv1(x)
+
+        ## middle
+        for layer in self.middle:
+            if isinstance(layer, ResidualBlock) and feat_cache is not None:
+                x = layer(x, feat_cache, feat_idx)
+            else:
+                x = layer(x)
+
+        ## upsamples
+        for layer in self.upsamples:
+            if feat_cache is not None:
+                x = layer(x, feat_cache, feat_idx)
+            else:
+                x = layer(x)
+
+        ## head
+        for layer in self.head:
+            if isinstance(layer, CausalConv3d) and feat_cache is not None:
+                idx = feat_idx[0]
+                cache_x = x[:, :, -CACHE_T:, :, :].clone()
+                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
+                    # cache last frame of last two chunk
+                    cache_x = torch.cat(
+                        [
+                            feat_cache[idx][:, :, -1, :, :]
+                            .unsqueeze(2)
+                            .to(cache_x.device),
+                            cache_x,
+                        ],
+                        dim=2,
+                    )
+                x = layer(x, feat_cache[idx])
+                feat_cache[idx] = cache_x
+                feat_idx[0] += 1
+            else:
+                x = layer(x)
+        return x
+
+
+def count_conv3d(model):
+    count = 0
+    for m in model.modules():
+        if isinstance(m, CausalConv3d):
+            count += 1
+    return count
+
+
+class WanVAE_(nn.Module):
+    def __init__(
+        self,
+        dim=128,
+        z_dim=4,
+        dim_mult=[1, 2, 4, 4],
+        num_res_blocks=2,
+        attn_scales=[],
+        temperal_downsample=[True, True, False],
+        dropout=0.0,
+    ):
+        super().__init__()
+        self.dim = dim
+        self.z_dim = z_dim
+        self.dim_mult = dim_mult
+        self.num_res_blocks = num_res_blocks
+        self.attn_scales = attn_scales
+        self.temperal_downsample = temperal_downsample
+        self.temperal_upsample = temperal_downsample[::-1]
+
+        # modules
+        self.encoder = Encoder3d(
+            dim,
+            z_dim * 2,
+            dim_mult,
+            num_res_blocks,
+            attn_scales,
+            self.temperal_downsample,
+            dropout,
+        )
+        self.conv1 = CausalConv3d(z_dim * 2, z_dim * 2, 1)
+        self.conv2 = CausalConv3d(z_dim, z_dim, 1)
+        self.decoder = Decoder3d(
+            dim,
+            z_dim,
+            dim_mult,
+            num_res_blocks,
+            attn_scales,
+            self.temperal_upsample,
+            dropout,
+        )
+
+    def forward(self, x):
+        mu, log_var = self.encode(x)
+        z = self.reparameterize(mu, log_var)
+        x_recon = self.decode(z)
+        return x_recon, mu, log_var
+
+    def encode(self, x, scale):
+        self.clear_cache()
+        ## cache
+        t = x.shape[2]
+        iter_ = 1 + (t - 1) // 4
+        ## 对encode输入的x，按时间拆分为1、4、4、4....
+        for i in range(iter_):
+            self._enc_conv_idx = [0]
+            if i == 0:
+                out = self.encoder(
+                    x[:, :, :1, :, :],
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx,
+                )
+            else:
+                out_ = self.encoder(
+                    x[:, :, 1 + 4 * (i - 1) : 1 + 4 * i, :, :],
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx,
+                )
+                out = torch.cat([out, out_], 2)
+        mu, log_var = self.conv1(out).chunk(2, dim=1)
+        if isinstance(scale[0], torch.Tensor):
+            mu = (mu - scale[0].view(1, self.z_dim, 1, 1, 1)) * scale[1].view(
+                1, self.z_dim, 1, 1, 1
+            )
+        else:
+            mu = (mu - scale[0]) * scale[1]
+        self.clear_cache()
+        return mu
+
+    def decode(self, z, scale):
+        self.clear_cache()
+        # z: [b,c,t,h,w]
+        if isinstance(scale[0], torch.Tensor):
+            z = z / scale[1].view(1, self.z_dim, 1, 1, 1) + scale[0].view(
+                1, self.z_dim, 1, 1, 1
+            )
+        else:
+            z = z / scale[1] + scale[0]
+        iter_ = z.shape[2]
+        x = self.conv2(z)
+        for i in range(iter_):
+            self._conv_idx = [0]
+            if i == 0:
+                out = self.decoder(
+                    x[:, :, i : i + 1, :, :],
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx,
+                )
+            else:
+                out_ = self.decoder(
+                    x[:, :, i : i + 1, :, :],
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx,
+                )
+                out = torch.cat([out, out_], 2)
+        self.clear_cache()
+        return out
+
+    def reparameterize(self, mu, log_var):
+        std = torch.exp(0.5 * log_var)
+        eps = torch.randn_like(std)
+        return eps * std + mu
+
+    def sample(self, imgs, deterministic=False):
+        mu, log_var = self.encode(imgs)
+        if deterministic:
+            return mu
+        std = torch.exp(0.5 * log_var.clamp(-30.0, 20.0))
+        return mu + std * torch.randn_like(std)
+
+    def clear_cache(self):
+        self._conv_num = count_conv3d(self.decoder)
+        self._conv_idx = [0]
+        self._feat_map = [None] * self._conv_num
+        # cache encode
+        self._enc_conv_num = count_conv3d(self.encoder)
+        self._enc_conv_idx = [0]
+        self._enc_feat_map = [None] * self._enc_conv_num
+
+
+def _video_vae(pretrained_path=None, z_dim=None, device="cpu", **kwargs):
+    """
+    Autoencoder3d adapted from Stable Diffusion 1.x, 2.x and XL.
+    """
+    # params
+    cfg = dict(
+        dim=96,
+        z_dim=z_dim,
+        dim_mult=[1, 2, 4, 4],
+        num_res_blocks=2,
+        attn_scales=[],
+        temperal_downsample=[False, True, True],
+        dropout=0.0,
+    )
+    cfg.update(**kwargs)
+
+    # init model
+    with torch.device("meta"):
+        model = WanVAE_(**cfg)
+
+    # load checkpoint
+    logging.info(f"loading {pretrained_path}")
+    model.load_state_dict(torch.load(pretrained_path, map_location=device), assign=True)
+
+    return model
+
+
+class Wan2_1_VAE:
+    def __init__(
+        self,
+        z_dim=16,
+        vae_pth="cache/vae_step_411000.pth",
+        dtype=torch.float,
+        device="cuda",
+    ):
+        self.dtype = dtype
+        self.device = device
+
+        mean = [
+            -0.7571,
+            -0.7089,
+            -0.9113,
+            0.1075,
+            -0.1745,
+            0.9653,
+            -0.1517,
+            1.5508,
+            0.4134,
+            -0.0715,
+            0.5517,
+            -0.3632,
+            -0.1922,
+            -0.9497,
+            0.2503,
+            -0.2921,
+        ]
+        std = [
+            2.8184,
+            1.4541,
+            2.3275,
+            2.6558,
+            1.2196,
+            1.7708,
+            2.6052,
+            2.0743,
+            3.2687,
+            2.1526,
+            2.8652,
+            1.5579,
+            1.6382,
+            1.1253,
+            2.8251,
+            1.9160,
+        ]
+        self.mean = torch.tensor(mean, dtype=dtype, device=device)
+        self.std = torch.tensor(std, dtype=dtype, device=device)
+        self.scale = [self.mean, 1.0 / self.std]
+
+        # init model
+        self.model = (
+            _video_vae(
+                pretrained_path=vae_pth,
+                z_dim=z_dim,
+            )
+            .eval()
+            .requires_grad_(False)
+            .to(device)
+        )
+
+    def encode(self, videos):
+        """
+        videos: A list of videos each with shape [C, T, H, W].
+        """
+        with amp.autocast(dtype=self.dtype):
+            return [
+                self.model.encode(u.unsqueeze(0), self.scale).float().squeeze(0)
+                for u in videos
+            ]
+
+    def decode(self, zs):
+        with amp.autocast(dtype=self.dtype):
+            return [
+                self.model.decode(u.unsqueeze(0), self.scale)
+                .float()
+                .clamp_(-1, 1)
+                .squeeze(0)
+                for u in zs
+            ]
diff --git a/videotuna/models/wan/wan/modules/xlm_roberta.py b/videotuna/models/wan/wan/modules/xlm_roberta.py
deleted file mode 100644
index 4bd38c10..00000000
--- a/videotuna/models/wan/wan/modules/xlm_roberta.py
+++ /dev/null
@@ -1,170 +0,0 @@
-# Modified from transformers.models.xlm_roberta.modeling_xlm_roberta
-# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-__all__ = ['XLMRoberta', 'xlm_roberta_large']
-
-
-class SelfAttention(nn.Module):
-
-    def __init__(self, dim, num_heads, dropout=0.1, eps=1e-5):
-        assert dim % num_heads == 0
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = dim // num_heads
-        self.eps = eps
-
-        # layers
-        self.q = nn.Linear(dim, dim)
-        self.k = nn.Linear(dim, dim)
-        self.v = nn.Linear(dim, dim)
-        self.o = nn.Linear(dim, dim)
-        self.dropout = nn.Dropout(dropout)
-
-    def forward(self, x, mask):
-        """
-        x:   [B, L, C].
-        """
-        b, s, c, n, d = *x.size(), self.num_heads, self.head_dim
-
-        # compute query, key, value
-        q = self.q(x).reshape(b, s, n, d).permute(0, 2, 1, 3)
-        k = self.k(x).reshape(b, s, n, d).permute(0, 2, 1, 3)
-        v = self.v(x).reshape(b, s, n, d).permute(0, 2, 1, 3)
-
-        # compute attention
-        p = self.dropout.p if self.training else 0.0
-        x = F.scaled_dot_product_attention(q, k, v, mask, p)
-        x = x.permute(0, 2, 1, 3).reshape(b, s, c)
-
-        # output
-        x = self.o(x)
-        x = self.dropout(x)
-        return x
-
-
-class AttentionBlock(nn.Module):
-
-    def __init__(self, dim, num_heads, post_norm, dropout=0.1, eps=1e-5):
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.post_norm = post_norm
-        self.eps = eps
-
-        # layers
-        self.attn = SelfAttention(dim, num_heads, dropout, eps)
-        self.norm1 = nn.LayerNorm(dim, eps=eps)
-        self.ffn = nn.Sequential(
-            nn.Linear(dim, dim * 4), nn.GELU(), nn.Linear(dim * 4, dim),
-            nn.Dropout(dropout))
-        self.norm2 = nn.LayerNorm(dim, eps=eps)
-
-    def forward(self, x, mask):
-        if self.post_norm:
-            x = self.norm1(x + self.attn(x, mask))
-            x = self.norm2(x + self.ffn(x))
-        else:
-            x = x + self.attn(self.norm1(x), mask)
-            x = x + self.ffn(self.norm2(x))
-        return x
-
-
-class XLMRoberta(nn.Module):
-    """
-    XLMRobertaModel with no pooler and no LM head.
-    """
-
-    def __init__(self,
-                 vocab_size=250002,
-                 max_seq_len=514,
-                 type_size=1,
-                 pad_id=1,
-                 dim=1024,
-                 num_heads=16,
-                 num_layers=24,
-                 post_norm=True,
-                 dropout=0.1,
-                 eps=1e-5):
-        super().__init__()
-        self.vocab_size = vocab_size
-        self.max_seq_len = max_seq_len
-        self.type_size = type_size
-        self.pad_id = pad_id
-        self.dim = dim
-        self.num_heads = num_heads
-        self.num_layers = num_layers
-        self.post_norm = post_norm
-        self.eps = eps
-
-        # embeddings
-        self.token_embedding = nn.Embedding(vocab_size, dim, padding_idx=pad_id)
-        self.type_embedding = nn.Embedding(type_size, dim)
-        self.pos_embedding = nn.Embedding(max_seq_len, dim, padding_idx=pad_id)
-        self.dropout = nn.Dropout(dropout)
-
-        # blocks
-        self.blocks = nn.ModuleList([
-            AttentionBlock(dim, num_heads, post_norm, dropout, eps)
-            for _ in range(num_layers)
-        ])
-
-        # norm layer
-        self.norm = nn.LayerNorm(dim, eps=eps)
-
-    def forward(self, ids):
-        """
-        ids: [B, L] of torch.LongTensor.
-        """
-        b, s = ids.shape
-        mask = ids.ne(self.pad_id).long()
-
-        # embeddings
-        x = self.token_embedding(ids) + \
-            self.type_embedding(torch.zeros_like(ids)) + \
-            self.pos_embedding(self.pad_id + torch.cumsum(mask, dim=1) * mask)
-        if self.post_norm:
-            x = self.norm(x)
-        x = self.dropout(x)
-
-        # blocks
-        mask = torch.where(
-            mask.view(b, 1, 1, s).gt(0), 0.0,
-            torch.finfo(x.dtype).min)
-        for block in self.blocks:
-            x = block(x, mask)
-
-        # output
-        if not self.post_norm:
-            x = self.norm(x)
-        return x
-
-
-def xlm_roberta_large(pretrained=False,
-                      return_tokenizer=False,
-                      device='cpu',
-                      **kwargs):
-    """
-    XLMRobertaLarge adapted from Huggingface.
-    """
-    # params
-    cfg = dict(
-        vocab_size=250002,
-        max_seq_len=514,
-        type_size=1,
-        pad_id=1,
-        dim=1024,
-        num_heads=16,
-        num_layers=24,
-        post_norm=True,
-        dropout=0.1,
-        eps=1e-5)
-    cfg.update(**kwargs)
-
-    # init a model on device
-    with torch.device(device):
-        model = XLMRoberta(**cfg)
-    return model
diff --git a/videotuna/models/wan/wan/text2video.py b/videotuna/models/wan/wan/text2video.py
index 64444f03..ba313ac4 100644
--- a/videotuna/models/wan/wan/text2video.py
+++ b/videotuna/models/wan/wan/text2video.py
@@ -1,6 +1,6 @@
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 import gc
-from loguru import logger
+import logging
 import math
 import os
 import random
@@ -8,27 +8,26 @@
 import types
 from contextlib import contextmanager
 from functools import partial
-from typing import Union
-from pathlib import Path
 
 import torch
-import torch.cuda.amp as amp
 import torch.distributed as dist
 from tqdm import tqdm
-from typing import Optional
 
 from .distributed.fsdp import shard_model
+from .distributed.sequence_parallel import sp_attn_forward, sp_dit_forward
+from .distributed.util import get_world_size
 from .modules.model import WanModel
-from .modules.t5 import T5Encoder, T5EncoderModel
-from .modules.vae import WanVAE, WanVAE_
-from .utils.fm_solvers import (FlowDPMSolverMultistepScheduler,
-                               get_sampling_sigmas, retrieve_timesteps)
+from .modules.t5 import T5EncoderModel
+from .modules.vae2_1 import Wan2_1_VAE
+from .utils.fm_solvers import (
+    FlowDPMSolverMultistepScheduler,
+    get_sampling_sigmas,
+    retrieve_timesteps,
+)
 from .utils.fm_solvers_unipc import FlowUniPCMultistepScheduler
-from ....utils.common_utils import monitor_resources
-from ....schedulers.flow_matching import FlowMatchScheduler
 
-class WanT2V:
 
+class WanT2V:
     def __init__(
         self,
         config,
@@ -37,11 +36,10 @@ def __init__(
         rank=0,
         t5_fsdp=False,
         dit_fsdp=False,
-        use_usp=False,
+        use_sp=False,
         t5_cpu=False,
-        first_stage_model: WanVAE_= None ,
-        cond_stage_model:T5Encoder=None,
-        denoiser: WanModel=None,
+        init_on_cpu=True,
+        convert_model_dtype=False,
     ):
         r"""
         Initializes the Wan text-to-video generation model components.
@@ -59,63 +57,175 @@ def __init__(
                 Enable FSDP sharding for T5 model
             dit_fsdp (`bool`, *optional*, defaults to False):
                 Enable FSDP sharding for DiT model
-            use_usp (`bool`, *optional*, defaults to False):
-                Enable distribution strategy of USP.
+            use_sp (`bool`, *optional*, defaults to False):
+                Enable distribution strategy of sequence parallel.
             t5_cpu (`bool`, *optional*, defaults to False):
                 Whether to place T5 model on CPU. Only works without t5_fsdp.
+            init_on_cpu (`bool`, *optional*, defaults to True):
+                Enable initializing Transformer Model on CPU. Only works without FSDP or USP.
+            convert_model_dtype (`bool`, *optional*, defaults to False):
+                Convert DiT model parameters dtype to 'config.param_dtype'.
+                Only works without FSDP.
         """
         self.device = torch.device(f"cuda:{device_id}")
         self.config = config
         self.rank = rank
         self.t5_cpu = t5_cpu
-        self.t5_fsdp = t5_fsdp
-        self.dit_fsdp = dit_fsdp
-        self.use_usp = use_usp
+        self.init_on_cpu = init_on_cpu
+
         self.num_train_timesteps = config.num_train_timesteps
+        self.boundary = config.boundary
         self.param_dtype = config.param_dtype
 
-        #encoder
+        if t5_fsdp or dit_fsdp or use_sp:
+            self.init_on_cpu = False
+
         shard_fn = partial(shard_model, device_id=device_id)
-        self.text_encoder : T5EncoderModel = T5EncoderModel(
+        self.text_encoder = T5EncoderModel(
             text_len=config.text_len,
             dtype=config.t5_dtype,
-            device=torch.device('cpu'),
+            device=torch.device("cpu"),
             checkpoint_path=os.path.join(checkpoint_dir, config.t5_checkpoint),
             tokenizer_path=os.path.join(checkpoint_dir, config.t5_tokenizer),
             shard_fn=shard_fn if t5_fsdp else None,
-            model=cond_stage_model)
+        )
 
-        #vae
-        self.vae: WanVAE = WanVAE(vae=first_stage_model, 
-                                  vae_pth=os.path.join(checkpoint_dir, config.vae_checkpoint), 
-                                  device=self.device)
         self.vae_stride = config.vae_stride
         self.patch_size = config.patch_size
+        self.vae = Wan2_1_VAE(
+            vae_pth=os.path.join(checkpoint_dir, config.vae_checkpoint),
+            device=self.device,
+        )
+
+        logging.info(f"Creating WanModel from {checkpoint_dir}")
+        self.low_noise_model = WanModel.from_pretrained(
+            checkpoint_dir, subfolder=config.low_noise_checkpoint
+        )
+        self.low_noise_model = self._configure_model(
+            model=self.low_noise_model,
+            use_sp=use_sp,
+            dit_fsdp=dit_fsdp,
+            shard_fn=shard_fn,
+            convert_model_dtype=convert_model_dtype,
+        )
+
+        self.high_noise_model = WanModel.from_pretrained(
+            checkpoint_dir, subfolder=config.high_noise_checkpoint
+        )
+        self.high_noise_model = self._configure_model(
+            model=self.high_noise_model,
+            use_sp=use_sp,
+            dit_fsdp=dit_fsdp,
+            shard_fn=shard_fn,
+            convert_model_dtype=convert_model_dtype,
+        )
+        if use_sp:
+            self.sp_size = get_world_size()
+        else:
+            self.sp_size = 1
 
-        #denoiser
-        self.model: WanModel = denoiser
-        self.shard_fn = shard_fn
         self.sample_neg_prompt = config.sample_neg_prompt
 
-    @monitor_resources(return_metrics=True)
-    def generate(self,
-                 input_prompt,
-                 size=(1280, 720),
-                 frame_num=81,
-                 shift=5.0,
-                 sample_solver='unipc',
-                 sampling_steps=50,
-                 guide_scale=5.0,
-                 n_prompt="",
-                 seed=-1,
-                 offload_model=True):
+    def _configure_model(self, model, use_sp, dit_fsdp, shard_fn, convert_model_dtype):
+        """
+        Configures a model object. This includes setting evaluation modes,
+        applying distributed parallel strategy, and handling device placement.
+
+        Args:
+            model (torch.nn.Module):
+                The model instance to configure.
+            use_sp (`bool`):
+                Enable distribution strategy of sequence parallel.
+            dit_fsdp (`bool`):
+                Enable FSDP sharding for DiT model.
+            shard_fn (callable):
+                The function to apply FSDP sharding.
+            convert_model_dtype (`bool`):
+                Convert DiT model parameters dtype to 'config.param_dtype'.
+                Only works without FSDP.
+
+        Returns:
+            torch.nn.Module:
+                The configured model.
+        """
+        model.eval().requires_grad_(False)
+
+        if use_sp:
+            for block in model.blocks:
+                block.self_attn.forward = types.MethodType(
+                    sp_attn_forward, block.self_attn
+                )
+            model.forward = types.MethodType(sp_dit_forward, model)
+
+        if dist.is_initialized():
+            dist.barrier()
+
+        if dit_fsdp:
+            model = shard_fn(model)
+        else:
+            if convert_model_dtype:
+                model.to(self.param_dtype)
+            if not self.init_on_cpu:
+                model.to(self.device)
+
+        return model
+
+    def _prepare_model_for_timestep(self, t, boundary, offload_model):
+        r"""
+        Prepares and returns the required model for the current timestep.
+
+        Args:
+            t (torch.Tensor):
+                current timestep.
+            boundary (`int`):
+                The timestep threshold. If `t` is at or above this value,
+                the `high_noise_model` is considered as the required model.
+            offload_model (`bool`):
+                A flag intended to control the offloading behavior.
+
+        Returns:
+            torch.nn.Module:
+                The active model on the target device for the current timestep.
+        """
+        if t.item() >= boundary:
+            required_model_name = "high_noise_model"
+            offload_model_name = "low_noise_model"
+        else:
+            required_model_name = "low_noise_model"
+            offload_model_name = "high_noise_model"
+        if offload_model or self.init_on_cpu:
+            if (
+                next(getattr(self, offload_model_name).parameters()).device.type
+                == "cuda"
+            ):
+                getattr(self, offload_model_name).to("cpu")
+            if (
+                next(getattr(self, required_model_name).parameters()).device.type
+                == "cpu"
+            ):
+                getattr(self, required_model_name).to(self.device)
+        return getattr(self, required_model_name)
+
+    def generate(
+        self,
+        input_prompt,
+        size=(1280, 720),
+        frame_num=81,
+        shift=5.0,
+        sample_solver="unipc",
+        sampling_steps=50,
+        guide_scale=5.0,
+        n_prompt="",
+        seed=-1,
+        offload_model=True,
+    ):
         r"""
         Generates video frames from text prompt using diffusion process.
 
         Args:
             input_prompt (`str`):
                 Text prompt for content generation
-            size (tupele[`int`], *optional*, defaults to (1280,720)):
+            size (`tuple[int]`, *optional*, defaults to (1280,720)):
                 Controls video resolution, (width,height).
             frame_num (`int`, *optional*, defaults to 81):
                 How many frames to sample from a video. The number should be 4n+1
@@ -123,10 +233,12 @@ def generate(self,
                 Noise schedule shift parameter. Affects temporal dynamics
             sample_solver (`str`, *optional*, defaults to 'unipc'):
                 Solver used to sample the video.
-            sampling_steps (`int`, *optional*, defaults to 40):
+            sampling_steps (`int`, *optional*, defaults to 50):
                 Number of diffusion sampling steps. Higher values improve quality but slow generation
-            guide_scale (`float`, *optional*, defaults 5.0):
-                Classifier-free guidance scale. Controls prompt adherence vs. creativity
+            guide_scale (`float` or tuple[`float`], *optional*, defaults 5.0):
+                Classifier-free guidance scale. Controls prompt adherence vs. creativity.
+                If tuple, the first guide_scale will be used for low noise model and
+                the second guide_scale will be used for high noise model.
             n_prompt (`str`, *optional*, defaults to ""):
                 Negative prompt for content exclusion. If not given, use `config.sample_neg_prompt`
             seed (`int`, *optional*, defaults to -1):
@@ -143,14 +255,28 @@ def generate(self,
                 - W: Frame width from size)
         """
         # preprocess
+        guide_scale = (
+            (guide_scale, guide_scale)
+            if isinstance(guide_scale, float)
+            else guide_scale
+        )
         F = frame_num
-        target_shape = (self.vae.model.z_dim, (F - 1) // self.vae_stride[0] + 1,
-                        size[1] // self.vae_stride[1],
-                        size[0] // self.vae_stride[2])
-
-        seq_len = math.ceil((target_shape[2] * target_shape[3]) /
-                            (self.patch_size[1] * self.patch_size[2]) *
-                            target_shape[1] / self.sp_size) * self.sp_size
+        target_shape = (
+            self.vae.model.z_dim,
+            (F - 1) // self.vae_stride[0] + 1,
+            size[1] // self.vae_stride[1],
+            size[0] // self.vae_stride[2],
+        )
+
+        seq_len = (
+            math.ceil(
+                (target_shape[2] * target_shape[3])
+                / (self.patch_size[1] * self.patch_size[2])
+                * target_shape[1]
+                / self.sp_size
+            )
+            * self.sp_size
+        )
 
         if n_prompt == "":
             n_prompt = self.sample_neg_prompt
@@ -165,8 +291,8 @@ def generate(self,
             if offload_model:
                 self.text_encoder.model.cpu()
         else:
-            context = self.text_encoder([input_prompt], torch.device('cpu'))
-            context_null = self.text_encoder([n_prompt], torch.device('cpu'))
+            context = self.text_encoder([input_prompt], torch.device("cpu"))
+            context_null = self.text_encoder([n_prompt], torch.device("cpu"))
             context = [t.to(self.device) for t in context]
             context_null = [t.to(self.device) for t in context_null]
 
@@ -178,44 +304,54 @@ def generate(self,
                 target_shape[3],
                 dtype=torch.float32,
                 device=self.device,
-                generator=seed_g)
+                generator=seed_g,
+            )
         ]
 
         @contextmanager
         def noop_no_sync():
             yield
 
-        no_sync = getattr(self.model, 'no_sync', noop_no_sync)
+        no_sync_low_noise = getattr(self.low_noise_model, "no_sync", noop_no_sync)
+        no_sync_high_noise = getattr(self.high_noise_model, "no_sync", noop_no_sync)
 
         # evaluation mode
-        with amp.autocast(dtype=self.param_dtype), torch.no_grad(), no_sync():
-
-            if sample_solver == 'unipc':
+        with (
+            torch.amp.autocast("cuda", dtype=self.param_dtype),
+            torch.no_grad(),
+            no_sync_low_noise(),
+            no_sync_high_noise(),
+        ):
+            boundary = self.boundary * self.num_train_timesteps
+
+            if sample_solver == "unipc":
                 sample_scheduler = FlowUniPCMultistepScheduler(
                     num_train_timesteps=self.num_train_timesteps,
                     shift=1,
-                    use_dynamic_shifting=False)
+                    use_dynamic_shifting=False,
+                )
                 sample_scheduler.set_timesteps(
-                    sampling_steps, device=self.device, shift=shift)
+                    sampling_steps, device=self.device, shift=shift
+                )
                 timesteps = sample_scheduler.timesteps
-            elif sample_solver == 'dpm++':
+            elif sample_solver == "dpm++":
                 sample_scheduler = FlowDPMSolverMultistepScheduler(
                     num_train_timesteps=self.num_train_timesteps,
                     shift=1,
-                    use_dynamic_shifting=False)
+                    use_dynamic_shifting=False,
+                )
                 sampling_sigmas = get_sampling_sigmas(sampling_steps, shift)
                 timesteps, _ = retrieve_timesteps(
-                    sample_scheduler,
-                    device=self.device,
-                    sigmas=sampling_sigmas)
+                    sample_scheduler, device=self.device, sigmas=sampling_sigmas
+                )
             else:
                 raise NotImplementedError("Unsupported solver.")
 
             # sample videos
             latents = noise
 
-            arg_c = {'context': context, 'seq_len': seq_len}
-            arg_null = {'context': context_null, 'seq_len': seq_len}
+            arg_c = {"context": context, "seq_len": seq_len}
+            arg_null = {"context": context_null, "seq_len": seq_len}
 
             for _, t in enumerate(tqdm(timesteps)):
                 latent_model_input = latents
@@ -223,26 +359,31 @@ def noop_no_sync():
 
                 timestep = torch.stack(timestep)
 
-                self.model.to(self.device)
-                noise_pred_cond = self.model(
-                    latent_model_input, t=timestep, **arg_c)[0]
-                noise_pred_uncond = self.model(
-                    latent_model_input, t=timestep, **arg_null)[0]
+                model = self._prepare_model_for_timestep(t, boundary, offload_model)
+                sample_guide_scale = (
+                    guide_scale[1] if t.item() >= boundary else guide_scale[0]
+                )
 
-                noise_pred = noise_pred_uncond + guide_scale * (
-                    noise_pred_cond - noise_pred_uncond)
+                noise_pred_cond = model(latent_model_input, t=timestep, **arg_c)[0]
+                noise_pred_uncond = model(latent_model_input, t=timestep, **arg_null)[0]
+
+                noise_pred = noise_pred_uncond + sample_guide_scale * (
+                    noise_pred_cond - noise_pred_uncond
+                )
 
                 temp_x0 = sample_scheduler.step(
                     noise_pred.unsqueeze(0),
                     t,
                     latents[0].unsqueeze(0),
                     return_dict=False,
-                    generator=seed_g)[0]
+                    generator=seed_g,
+                )[0]
                 latents = [temp_x0.squeeze(0)]
 
             x0 = latents
             if offload_model:
-                self.model.cpu()
+                self.low_noise_model.cpu()
+                self.high_noise_model.cpu()
                 torch.cuda.empty_cache()
             if self.rank == 0:
                 videos = self.vae.decode(x0)
@@ -256,77 +397,3 @@ def noop_no_sync():
             dist.barrier()
 
         return videos[0] if self.rank == 0 else None
-
-    def load_weight(self):
-        self.text_encoder.load_weight()
-        self.vae.load_weight()
-        #denoiser use from_pretrained, no need load again
-        if self.use_usp:
-            from xfuser.core.distributed import \
-                get_sequence_parallel_world_size
-
-            from .distributed.xdit_context_parallel import (usp_attn_forward,
-                                                            usp_dit_forward)
-            for block in self.model.blocks:
-                block.self_attn.forward = types.MethodType(
-                    usp_attn_forward, block.self_attn)
-            self.model.forward = types.MethodType(usp_dit_forward, self.model)
-            self.sp_size = get_sequence_parallel_world_size()
-        else:
-            self.sp_size = 1
-
-        if dist.is_initialized():
-            dist.barrier()
-        if self.dit_fsdp:
-            self.model = self.shard_fn(self.model)
-        else:
-            self.model = self.model.to(self.device)
-
-    def enable_vram_management(self):
-        pass
-
-    def get_seq_len(self, frames:int=81, width:int=1280, height:int=720):
-        target_shape = (self.vae.model.z_dim, (frames - 1) // self.vae_stride[0] + 1,
-                        height // self.vae_stride[1],
-                        width // self.vae_stride[2])
-
-        seq_len = math.ceil((target_shape[2] * target_shape[3]) /
-                            (self.patch_size[1] * self.patch_size[2]) *
-                            target_shape[1] / self.sp_size) * self.sp_size
-        return seq_len
-    
-    def training_step(self, batch, batch_idx, 
-                      first_stage_key:str, 
-                      cond_stage_key:str,
-                      model_offload:bool = True,
-                      dtype:torch.dtype = torch.bfloat16,
-                      device:str = "cuda"):
-        with torch.no_grad():
-            if not model_offload:
-                latents = torch.stack(self.vae.encode(batch[first_stage_key])).to(dtype=dtype, device=device).detach()
-                text_cond_embed = self.text_encoder(batch[cond_stage_key], device)
-            else:
-                self.vae.model.to(device)
-                latents = torch.stack(self.vae.encode(batch[first_stage_key])).to(dtype=dtype, device=device).detach()
-                self.vae.model.to('cpu')
-                self.text_encoder.model.to(device)
-                text_cond_embed = self.text_encoder(batch[cond_stage_key], device)
-                self.text_encoder.model.to('cpu')
-
-        ## scheduler
-        self.scheduler : FlowMatchScheduler = FlowMatchScheduler(shift=5, sigma_min=0.0, extra_one_step=True)
-        self.scheduler.set_timesteps(1000, training=True)
-
-        ## noise
-        B = len(latents)
-        noise = torch.randn_like(latents)
-        timestep_id = torch.randint(0, self.scheduler.num_train_timesteps, (1,))
-        timestep = self.scheduler.timesteps[timestep_id].to(dtype=dtype, device=device)
-        noisy_latents = self.scheduler.add_noise(latents, noise, timestep).to(dtype=dtype, device=device)
-        training_target = noise.to(device) - latents
-
-        # compute loss
-        noise_pred = self.model(x=noisy_latents, t=timestep, context=text_cond_embed, seq_len=None)
-        loss = torch.nn.functional.mse_loss(torch.stack(noise_pred).float(), training_target.float())
-        loss = loss * self.scheduler.training_weight(timestep).to(device=device)
-        return loss
\ No newline at end of file
diff --git a/videotuna/models/wan/wan/utils/__init__.py b/videotuna/models/wan/wan/utils/__init__.py
index 6e9a339e..e08f2ba7 100644
--- a/videotuna/models/wan/wan/utils/__init__.py
+++ b/videotuna/models/wan/wan/utils/__init__.py
@@ -1,8 +1,15 @@
-from .fm_solvers import (FlowDPMSolverMultistepScheduler, get_sampling_sigmas,
-                         retrieve_timesteps)
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+from .fm_solvers import (
+    FlowDPMSolverMultistepScheduler,
+    get_sampling_sigmas,
+    retrieve_timesteps,
+)
 from .fm_solvers_unipc import FlowUniPCMultistepScheduler
 
 __all__ = [
-    'HuggingfaceTokenizer', 'get_sampling_sigmas', 'retrieve_timesteps',
-    'FlowDPMSolverMultistepScheduler', 'FlowUniPCMultistepScheduler'
+    "HuggingfaceTokenizer",
+    "get_sampling_sigmas",
+    "retrieve_timesteps",
+    "FlowDPMSolverMultistepScheduler",
+    "FlowUniPCMultistepScheduler",
 ]
diff --git a/videotuna/models/wan/wan/utils/fm_solvers.py b/videotuna/models/wan/wan/utils/fm_solvers.py
index c908969e..448a3856 100644
--- a/videotuna/models/wan/wan/utils/fm_solvers.py
+++ b/videotuna/models/wan/wan/utils/fm_solvers.py
@@ -9,9 +9,11 @@
 import numpy as np
 import torch
 from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.schedulers.scheduling_utils import (KarrasDiffusionSchedulers,
-                                                   SchedulerMixin,
-                                                   SchedulerOutput)
+from diffusers.schedulers.scheduling_utils import (
+    KarrasDiffusionSchedulers,
+    SchedulerMixin,
+    SchedulerOutput,
+)
 from diffusers.utils import deprecate, is_scipy_available
 from diffusers.utils.torch_utils import randn_tensor
 
@@ -21,7 +23,7 @@
 
 def get_sampling_sigmas(sampling_steps, shift):
     sigma = np.linspace(1, 0, sampling_steps + 1)[:sampling_steps]
-    sigma = (shift * sigma / (1 + (shift - 1) * sigma))
+    sigma = shift * sigma / (1 + (shift - 1) * sigma)
 
     return sigma
 
@@ -40,7 +42,8 @@ def retrieve_timesteps(
         )
     if timesteps is not None:
         accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys())
+            inspect.signature(scheduler.set_timesteps).parameters.keys()
+        )
         if not accepts_timesteps:
             raise ValueError(
                 f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
@@ -51,7 +54,8 @@ def retrieve_timesteps(
         num_inference_steps = len(timesteps)
     elif sigmas is not None:
         accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys())
+            inspect.signature(scheduler.set_timesteps).parameters.keys()
+        )
         if not accept_sigmas:
             raise ValueError(
                 f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
@@ -147,43 +151,53 @@ def __init__(
     ):
         if algorithm_type in ["dpmsolver", "sde-dpmsolver"]:
             deprecation_message = f"algorithm_type {algorithm_type} is deprecated and will be removed in a future version. Choose from `dpmsolver++` or `sde-dpmsolver++` instead"
-            deprecate("algorithm_types dpmsolver and sde-dpmsolver", "1.0.0",
-                      deprecation_message)
+            deprecate(
+                "algorithm_types dpmsolver and sde-dpmsolver",
+                "1.0.0",
+                deprecation_message,
+            )
 
         # settings for DPM-Solver
         if algorithm_type not in [
-                "dpmsolver", "dpmsolver++", "sde-dpmsolver", "sde-dpmsolver++"
+            "dpmsolver",
+            "dpmsolver++",
+            "sde-dpmsolver",
+            "sde-dpmsolver++",
         ]:
             if algorithm_type == "deis":
                 self.register_to_config(algorithm_type="dpmsolver++")
             else:
                 raise NotImplementedError(
-                    f"{algorithm_type} is not implemented for {self.__class__}")
+                    f"{algorithm_type} is not implemented for {self.__class__}"
+                )
 
         if solver_type not in ["midpoint", "heun"]:
             if solver_type in ["logrho", "bh1", "bh2"]:
                 self.register_to_config(solver_type="midpoint")
             else:
                 raise NotImplementedError(
-                    f"{solver_type} is not implemented for {self.__class__}")
+                    f"{solver_type} is not implemented for {self.__class__}"
+                )
 
-        if algorithm_type not in ["dpmsolver++", "sde-dpmsolver++"
-                                 ] and final_sigmas_type == "zero":
+        if (
+            algorithm_type not in ["dpmsolver++", "sde-dpmsolver++"]
+            and final_sigmas_type == "zero"
+        ):
             raise ValueError(
                 f"`final_sigmas_type` {final_sigmas_type} is not supported for `algorithm_type` {algorithm_type}. Please choose `sigma_min` instead."
             )
 
         # setable values
         self.num_inference_steps = None
-        alphas = np.linspace(1, 1 / num_train_timesteps,
-                             num_train_timesteps)[::-1].copy()
+        alphas = np.linspace(1, 1 / num_train_timesteps, num_train_timesteps)[
+            ::-1
+        ].copy()
         sigmas = 1.0 - alphas
         sigmas = torch.from_numpy(sigmas).to(dtype=torch.float32)
 
         if not use_dynamic_shifting:
             # when use_dynamic_shifting is True, we apply the timestep shifting on the fly based on the image resolution
-            sigmas = shift * sigmas / (1 +
-                                       (shift - 1) * sigmas)  # pyright: ignore
+            sigmas = shift * sigmas / (1 + (shift - 1) * sigmas)  # pyright: ignore
 
         self.sigmas = sigmas
         self.timesteps = sigmas * num_train_timesteps
@@ -246,21 +260,19 @@ def set_timesteps(
             )
 
         if sigmas is None:
-            sigmas = np.linspace(self.sigma_max, self.sigma_min,
-                                 num_inference_steps +
-                                 1).copy()[:-1]  # pyright: ignore
+            sigmas = np.linspace(
+                self.sigma_max, self.sigma_min, num_inference_steps + 1
+            ).copy()[:-1]  # pyright: ignore
 
         if self.config.use_dynamic_shifting:
             sigmas = self.time_shift(mu, 1.0, sigmas)  # pyright: ignore
         else:
             if shift is None:
                 shift = self.config.shift
-            sigmas = shift * sigmas / (1 +
-                                       (shift - 1) * sigmas)  # pyright: ignore
+            sigmas = shift * sigmas / (1 + (shift - 1) * sigmas)  # pyright: ignore
 
         if self.config.final_sigmas_type == "sigma_min":
-            sigma_last = ((1 - self.alphas_cumprod[0]) /
-                          self.alphas_cumprod[0])**0.5
+            sigma_last = ((1 - self.alphas_cumprod[0]) / self.alphas_cumprod[0]) ** 0.5
         elif self.config.final_sigmas_type == "zero":
             sigma_last = 0
         else:
@@ -269,12 +281,12 @@ def set_timesteps(
             )
 
         timesteps = sigmas * self.config.num_train_timesteps
-        sigmas = np.concatenate([sigmas, [sigma_last]
-                                ]).astype(np.float32)  # pyright: ignore
+        sigmas = np.concatenate([sigmas, [sigma_last]]).astype(np.float32)  # pyright: ignore
 
         self.sigmas = torch.from_numpy(sigmas)
         self.timesteps = torch.from_numpy(timesteps).to(
-            device=device, dtype=torch.int64)
+            device=device, dtype=torch.int64
+        )
 
         self.num_inference_steps = len(timesteps)
 
@@ -302,7 +314,8 @@ def _threshold_sample(self, sample: torch.Tensor) -> torch.Tensor:
         batch_size, channels, *remaining_dims = sample.shape
 
         if dtype not in (torch.float32, torch.float64):
-            sample = sample.float(
+            sample = (
+                sample.float()
             )  # upcast for quantile calculation, and clamp not implemented for cpu half
 
         # Flatten sample for doing quantile calculation along each image
@@ -310,16 +323,14 @@ def _threshold_sample(self, sample: torch.Tensor) -> torch.Tensor:
 
         abs_sample = sample.abs()  # "a certain percentile absolute pixel value"
 
-        s = torch.quantile(
-            abs_sample, self.config.dynamic_thresholding_ratio, dim=1)
+        s = torch.quantile(abs_sample, self.config.dynamic_thresholding_ratio, dim=1)
         s = torch.clamp(
             s, min=1, max=self.config.sample_max_value
         )  # When clamped to min=1, equivalent to standard clipping to [-1, 1]
-        s = s.unsqueeze(
-            1)  # (batch_size, 1) because clamp will broadcast along dim=0
-        sample = torch.clamp(
-            sample, -s, s
-        ) / s  # "we threshold xt0 to the range [-s, s] and then divide by s"
+        s = s.unsqueeze(1)  # (batch_size, 1) because clamp will broadcast along dim=0
+        sample = (
+            torch.clamp(sample, -s, s) / s
+        )  # "we threshold xt0 to the range [-s, s] and then divide by s"
 
         sample = sample.reshape(batch_size, channels, *remaining_dims)
         sample = sample.to(dtype)
@@ -335,7 +346,7 @@ def _sigma_to_alpha_sigma_t(self, sigma):
 
     # Copied from diffusers.schedulers.scheduling_flow_match_euler_discrete.set_timesteps
     def time_shift(self, mu: float, sigma: float, t: torch.Tensor):
-        return math.exp(mu) / (math.exp(mu) + (1 / t - 1)**sigma)
+        return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)
 
     # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.convert_model_output
     def convert_model_output(
@@ -367,8 +378,7 @@ def convert_model_output(
             if len(args) > 1:
                 sample = args[1]
             else:
-                raise ValueError(
-                    "missing `sample` as a required keyward argument")
+                raise ValueError("missing `sample` as a required keyward argument")
         if timestep is not None:
             deprecate(
                 "timesteps",
@@ -432,14 +442,12 @@ def dpm_solver_first_order_update(
                 The sample tensor at the previous timestep.
         """
         timestep = args[0] if len(args) > 0 else kwargs.pop("timestep", None)
-        prev_timestep = args[1] if len(args) > 1 else kwargs.pop(
-            "prev_timestep", None)
+        prev_timestep = args[1] if len(args) > 1 else kwargs.pop("prev_timestep", None)
         if sample is None:
             if len(args) > 2:
                 sample = args[2]
             else:
-                raise ValueError(
-                    " missing `sample` as a required keyward argument")
+                raise ValueError(" missing `sample` as a required keyward argument")
         if timestep is not None:
             deprecate(
                 "timesteps",
@@ -454,8 +462,10 @@ def dpm_solver_first_order_update(
                 "Passing `prev_timestep` is deprecated and has no effect as model output conversion is now handled via an internal counter `self.step_index`",
             )
 
-        sigma_t, sigma_s = self.sigmas[self.step_index + 1], self.sigmas[
-            self.step_index]  # pyright: ignore
+        sigma_t, sigma_s = (
+            self.sigmas[self.step_index + 1],
+            self.sigmas[self.step_index],
+        )  # pyright: ignore
         alpha_t, sigma_t = self._sigma_to_alpha_sigma_t(sigma_t)
         alpha_s, sigma_s = self._sigma_to_alpha_sigma_t(sigma_s)
         lambda_t = torch.log(alpha_t) - torch.log(sigma_t)
@@ -463,23 +473,27 @@ def dpm_solver_first_order_update(
 
         h = lambda_t - lambda_s
         if self.config.algorithm_type == "dpmsolver++":
-            x_t = (sigma_t /
-                   sigma_s) * sample - (alpha_t *
-                                        (torch.exp(-h) - 1.0)) * model_output
+            x_t = (sigma_t / sigma_s) * sample - (
+                alpha_t * (torch.exp(-h) - 1.0)
+            ) * model_output
         elif self.config.algorithm_type == "dpmsolver":
-            x_t = (alpha_t /
-                   alpha_s) * sample - (sigma_t *
-                                        (torch.exp(h) - 1.0)) * model_output
+            x_t = (alpha_t / alpha_s) * sample - (
+                sigma_t * (torch.exp(h) - 1.0)
+            ) * model_output
         elif self.config.algorithm_type == "sde-dpmsolver++":
             assert noise is not None
-            x_t = ((sigma_t / sigma_s * torch.exp(-h)) * sample +
-                   (alpha_t * (1 - torch.exp(-2.0 * h))) * model_output +
-                   sigma_t * torch.sqrt(1.0 - torch.exp(-2 * h)) * noise)
+            x_t = (
+                (sigma_t / sigma_s * torch.exp(-h)) * sample
+                + (alpha_t * (1 - torch.exp(-2.0 * h))) * model_output
+                + sigma_t * torch.sqrt(1.0 - torch.exp(-2 * h)) * noise
+            )
         elif self.config.algorithm_type == "sde-dpmsolver":
             assert noise is not None
-            x_t = ((alpha_t / alpha_s) * sample - 2.0 *
-                   (sigma_t * (torch.exp(h) - 1.0)) * model_output +
-                   sigma_t * torch.sqrt(torch.exp(2 * h) - 1.0) * noise)
+            x_t = (
+                (alpha_t / alpha_s) * sample
+                - 2.0 * (sigma_t * (torch.exp(h) - 1.0)) * model_output
+                + sigma_t * torch.sqrt(torch.exp(2 * h) - 1.0) * noise
+            )
         return x_t  # pyright: ignore
 
     # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.multistep_dpm_solver_second_order_update
@@ -502,16 +516,13 @@ def multistep_dpm_solver_second_order_update(
             `torch.Tensor`:
                 The sample tensor at the previous timestep.
         """
-        timestep_list = args[0] if len(args) > 0 else kwargs.pop(
-            "timestep_list", None)
-        prev_timestep = args[1] if len(args) > 1 else kwargs.pop(
-            "prev_timestep", None)
+        timestep_list = args[0] if len(args) > 0 else kwargs.pop("timestep_list", None)
+        prev_timestep = args[1] if len(args) > 1 else kwargs.pop("prev_timestep", None)
         if sample is None:
             if len(args) > 2:
                 sample = args[2]
             else:
-                raise ValueError(
-                    " missing `sample` as a required keyward argument")
+                raise ValueError(" missing `sample` as a required keyward argument")
         if timestep_list is not None:
             deprecate(
                 "timestep_list",
@@ -548,48 +559,63 @@ def multistep_dpm_solver_second_order_update(
         if self.config.algorithm_type == "dpmsolver++":
             # See https://arxiv.org/abs/2211.01095 for detailed derivations
             if self.config.solver_type == "midpoint":
-                x_t = ((sigma_t / sigma_s0) * sample -
-                       (alpha_t * (torch.exp(-h) - 1.0)) * D0 - 0.5 *
-                       (alpha_t * (torch.exp(-h) - 1.0)) * D1)
+                x_t = (
+                    (sigma_t / sigma_s0) * sample
+                    - (alpha_t * (torch.exp(-h) - 1.0)) * D0
+                    - 0.5 * (alpha_t * (torch.exp(-h) - 1.0)) * D1
+                )
             elif self.config.solver_type == "heun":
-                x_t = ((sigma_t / sigma_s0) * sample -
-                       (alpha_t * (torch.exp(-h) - 1.0)) * D0 +
-                       (alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0)) * D1)
+                x_t = (
+                    (sigma_t / sigma_s0) * sample
+                    - (alpha_t * (torch.exp(-h) - 1.0)) * D0
+                    + (alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0)) * D1
+                )
         elif self.config.algorithm_type == "dpmsolver":
             # See https://arxiv.org/abs/2206.00927 for detailed derivations
             if self.config.solver_type == "midpoint":
-                x_t = ((alpha_t / alpha_s0) * sample -
-                       (sigma_t * (torch.exp(h) - 1.0)) * D0 - 0.5 *
-                       (sigma_t * (torch.exp(h) - 1.0)) * D1)
+                x_t = (
+                    (alpha_t / alpha_s0) * sample
+                    - (sigma_t * (torch.exp(h) - 1.0)) * D0
+                    - 0.5 * (sigma_t * (torch.exp(h) - 1.0)) * D1
+                )
             elif self.config.solver_type == "heun":
-                x_t = ((alpha_t / alpha_s0) * sample -
-                       (sigma_t * (torch.exp(h) - 1.0)) * D0 -
-                       (sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)) * D1)
+                x_t = (
+                    (alpha_t / alpha_s0) * sample
+                    - (sigma_t * (torch.exp(h) - 1.0)) * D0
+                    - (sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)) * D1
+                )
         elif self.config.algorithm_type == "sde-dpmsolver++":
             assert noise is not None
             if self.config.solver_type == "midpoint":
-                x_t = ((sigma_t / sigma_s0 * torch.exp(-h)) * sample +
-                       (alpha_t * (1 - torch.exp(-2.0 * h))) * D0 + 0.5 *
-                       (alpha_t * (1 - torch.exp(-2.0 * h))) * D1 +
-                       sigma_t * torch.sqrt(1.0 - torch.exp(-2 * h)) * noise)
+                x_t = (
+                    (sigma_t / sigma_s0 * torch.exp(-h)) * sample
+                    + (alpha_t * (1 - torch.exp(-2.0 * h))) * D0
+                    + 0.5 * (alpha_t * (1 - torch.exp(-2.0 * h))) * D1
+                    + sigma_t * torch.sqrt(1.0 - torch.exp(-2 * h)) * noise
+                )
             elif self.config.solver_type == "heun":
-                x_t = ((sigma_t / sigma_s0 * torch.exp(-h)) * sample +
-                       (alpha_t * (1 - torch.exp(-2.0 * h))) * D0 +
-                       (alpha_t * ((1.0 - torch.exp(-2.0 * h)) /
-                                   (-2.0 * h) + 1.0)) * D1 +
-                       sigma_t * torch.sqrt(1.0 - torch.exp(-2 * h)) * noise)
+                x_t = (
+                    (sigma_t / sigma_s0 * torch.exp(-h)) * sample
+                    + (alpha_t * (1 - torch.exp(-2.0 * h))) * D0
+                    + (alpha_t * ((1.0 - torch.exp(-2.0 * h)) / (-2.0 * h) + 1.0)) * D1
+                    + sigma_t * torch.sqrt(1.0 - torch.exp(-2 * h)) * noise
+                )
         elif self.config.algorithm_type == "sde-dpmsolver":
             assert noise is not None
             if self.config.solver_type == "midpoint":
-                x_t = ((alpha_t / alpha_s0) * sample - 2.0 *
-                       (sigma_t * (torch.exp(h) - 1.0)) * D0 -
-                       (sigma_t * (torch.exp(h) - 1.0)) * D1 +
-                       sigma_t * torch.sqrt(torch.exp(2 * h) - 1.0) * noise)
+                x_t = (
+                    (alpha_t / alpha_s0) * sample
+                    - 2.0 * (sigma_t * (torch.exp(h) - 1.0)) * D0
+                    - (sigma_t * (torch.exp(h) - 1.0)) * D1
+                    + sigma_t * torch.sqrt(torch.exp(2 * h) - 1.0) * noise
+                )
             elif self.config.solver_type == "heun":
-                x_t = ((alpha_t / alpha_s0) * sample - 2.0 *
-                       (sigma_t * (torch.exp(h) - 1.0)) * D0 - 2.0 *
-                       (sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)) * D1 +
-                       sigma_t * torch.sqrt(torch.exp(2 * h) - 1.0) * noise)
+                x_t = (
+                    (alpha_t / alpha_s0) * sample
+                    - 2.0 * (sigma_t * (torch.exp(h) - 1.0)) * D0
+                    - 2.0 * (sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)) * D1
+                    + sigma_t * torch.sqrt(torch.exp(2 * h) - 1.0) * noise
+                )
         return x_t  # pyright: ignore
 
     # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.multistep_dpm_solver_third_order_update
@@ -612,16 +638,13 @@ def multistep_dpm_solver_third_order_update(
                 The sample tensor at the previous timestep.
         """
 
-        timestep_list = args[0] if len(args) > 0 else kwargs.pop(
-            "timestep_list", None)
-        prev_timestep = args[1] if len(args) > 1 else kwargs.pop(
-            "prev_timestep", None)
+        timestep_list = args[0] if len(args) > 0 else kwargs.pop("timestep_list", None)
+        prev_timestep = args[1] if len(args) > 1 else kwargs.pop("prev_timestep", None)
         if sample is None:
             if len(args) > 2:
                 sample = args[2]
             else:
-                raise ValueError(
-                    " missing`sample` as a required keyward argument")
+                raise ValueError(" missing`sample` as a required keyward argument")
         if timestep_list is not None:
             deprecate(
                 "timestep_list",
@@ -653,8 +676,7 @@ def multistep_dpm_solver_third_order_update(
         lambda_s1 = torch.log(alpha_s1) - torch.log(sigma_s1)
         lambda_s2 = torch.log(alpha_s2) - torch.log(sigma_s2)
 
-        m0, m1, m2 = model_output_list[-1], model_output_list[
-            -2], model_output_list[-3]
+        m0, m1, m2 = model_output_list[-1], model_output_list[-2], model_output_list[-3]
 
         h, h_0, h_1 = lambda_t - lambda_s0, lambda_s0 - lambda_s1, lambda_s1 - lambda_s2
         r0, r1 = h_0 / h, h_1 / h
@@ -664,16 +686,20 @@ def multistep_dpm_solver_third_order_update(
         D2 = (1.0 / (r0 + r1)) * (D1_0 - D1_1)
         if self.config.algorithm_type == "dpmsolver++":
             # See https://arxiv.org/abs/2206.00927 for detailed derivations
-            x_t = ((sigma_t / sigma_s0) * sample -
-                   (alpha_t * (torch.exp(-h) - 1.0)) * D0 +
-                   (alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0)) * D1 -
-                   (alpha_t * ((torch.exp(-h) - 1.0 + h) / h**2 - 0.5)) * D2)
+            x_t = (
+                (sigma_t / sigma_s0) * sample
+                - (alpha_t * (torch.exp(-h) - 1.0)) * D0
+                + (alpha_t * ((torch.exp(-h) - 1.0) / h + 1.0)) * D1
+                - (alpha_t * ((torch.exp(-h) - 1.0 + h) / h**2 - 0.5)) * D2
+            )
         elif self.config.algorithm_type == "dpmsolver":
             # See https://arxiv.org/abs/2206.00927 for detailed derivations
-            x_t = ((alpha_t / alpha_s0) * sample - (sigma_t *
-                                                    (torch.exp(h) - 1.0)) * D0 -
-                   (sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)) * D1 -
-                   (sigma_t * ((torch.exp(h) - 1.0 - h) / h**2 - 0.5)) * D2)
+            x_t = (
+                (alpha_t / alpha_s0) * sample
+                - (sigma_t * (torch.exp(h) - 1.0)) * D0
+                - (sigma_t * ((torch.exp(h) - 1.0) / h - 1.0)) * D1
+                - (sigma_t * ((torch.exp(h) - 1.0 - h) / h**2 - 0.5)) * D2
+            )
         return x_t  # pyright: ignore
 
     def index_for_timestep(self, timestep, schedule_timesteps=None):
@@ -744,12 +770,15 @@ def step(
 
         # Improve numerical stability for small number of steps
         lower_order_final = (self.step_index == len(self.timesteps) - 1) and (
-            self.config.euler_at_final or
-            (self.config.lower_order_final and len(self.timesteps) < 15) or
-            self.config.final_sigmas_type == "zero")
-        lower_order_second = ((self.step_index == len(self.timesteps) - 2) and
-                              self.config.lower_order_final and
-                              len(self.timesteps) < 15)
+            self.config.euler_at_final
+            or (self.config.lower_order_final and len(self.timesteps) < 15)
+            or self.config.final_sigmas_type == "zero"
+        )
+        lower_order_second = (
+            (self.step_index == len(self.timesteps) - 2)
+            and self.config.lower_order_final
+            and len(self.timesteps) < 15
+        )
 
         model_output = self.convert_model_output(model_output, sample=sample)
         for i in range(self.config.solver_order - 1):
@@ -758,29 +787,41 @@ def step(
 
         # Upcast to avoid precision issues when computing prev_sample
         sample = sample.to(torch.float32)
-        if self.config.algorithm_type in ["sde-dpmsolver", "sde-dpmsolver++"
-                                         ] and variance_noise is None:
+        if (
+            self.config.algorithm_type in ["sde-dpmsolver", "sde-dpmsolver++"]
+            and variance_noise is None
+        ):
             noise = randn_tensor(
                 model_output.shape,
                 generator=generator,
                 device=model_output.device,
-                dtype=torch.float32)
+                dtype=torch.float32,
+            )
         elif self.config.algorithm_type in ["sde-dpmsolver", "sde-dpmsolver++"]:
-            noise = variance_noise.to(
-                device=model_output.device,
-                dtype=torch.float32)  # pyright: ignore
+            noise = variance_noise.to(device=model_output.device, dtype=torch.float32)  # pyright: ignore
         else:
             noise = None
 
-        if self.config.solver_order == 1 or self.lower_order_nums < 1 or lower_order_final:
+        if (
+            self.config.solver_order == 1
+            or self.lower_order_nums < 1
+            or lower_order_final
+        ):
             prev_sample = self.dpm_solver_first_order_update(
-                model_output, sample=sample, noise=noise)
-        elif self.config.solver_order == 2 or self.lower_order_nums < 2 or lower_order_second:
+                model_output, sample=sample, noise=noise
+            )
+        elif (
+            self.config.solver_order == 2
+            or self.lower_order_nums < 2
+            or lower_order_second
+        ):
             prev_sample = self.multistep_dpm_solver_second_order_update(
-                self.model_outputs, sample=sample, noise=noise)
+                self.model_outputs, sample=sample, noise=noise
+            )
         else:
             prev_sample = self.multistep_dpm_solver_third_order_update(
-                self.model_outputs, sample=sample)
+                self.model_outputs, sample=sample
+            )
 
         if self.lower_order_nums < self.config.solver_order:
             self.lower_order_nums += 1
@@ -797,8 +838,7 @@ def step(
         return SchedulerOutput(prev_sample=prev_sample)
 
     # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.scale_model_input
-    def scale_model_input(self, sample: torch.Tensor, *args,
-                          **kwargs) -> torch.Tensor:
+    def scale_model_input(self, sample: torch.Tensor, *args, **kwargs) -> torch.Tensor:
         """
         Ensures interchangeability with schedulers that need to scale the denoising model input depending on the
         current timestep.
@@ -820,14 +860,14 @@ def add_noise(
     ) -> torch.Tensor:
         # Make sure sigmas and timesteps have the same device and dtype as original_samples
         sigmas = self.sigmas.to(
-            device=original_samples.device, dtype=original_samples.dtype)
-        if original_samples.device.type == "mps" and torch.is_floating_point(
-                timesteps):
+            device=original_samples.device, dtype=original_samples.dtype
+        )
+        if original_samples.device.type == "mps" and torch.is_floating_point(timesteps):
             # mps does not support float64
             schedule_timesteps = self.timesteps.to(
-                original_samples.device, dtype=torch.float32)
-            timesteps = timesteps.to(
-                original_samples.device, dtype=torch.float32)
+                original_samples.device, dtype=torch.float32
+            )
+            timesteps = timesteps.to(original_samples.device, dtype=torch.float32)
         else:
             schedule_timesteps = self.timesteps.to(original_samples.device)
             timesteps = timesteps.to(original_samples.device)
@@ -835,8 +875,7 @@ def add_noise(
         # begin_index is None when the scheduler is used for training or pipeline does not implement set_begin_index
         if self.begin_index is None:
             step_indices = [
-                self.index_for_timestep(t, schedule_timesteps)
-                for t in timesteps
+                self.index_for_timestep(t, schedule_timesteps) for t in timesteps
             ]
         elif self.step_index is not None:
             # add_noise is called after first denoising step (for inpainting)
diff --git a/videotuna/models/wan/wan/utils/fm_solvers_unipc.py b/videotuna/models/wan/wan/utils/fm_solvers_unipc.py
index 57321baa..2cc1d1f0 100644
--- a/videotuna/models/wan/wan/utils/fm_solvers_unipc.py
+++ b/videotuna/models/wan/wan/utils/fm_solvers_unipc.py
@@ -8,13 +8,15 @@
 import numpy as np
 import torch
 from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.schedulers.scheduling_utils import (KarrasDiffusionSchedulers,
-                                                   SchedulerMixin,
-                                                   SchedulerOutput)
+from diffusers.schedulers.scheduling_utils import (
+    KarrasDiffusionSchedulers,
+    SchedulerMixin,
+    SchedulerOutput,
+)
 from diffusers.utils import deprecate, is_scipy_available
 
 if is_scipy_available():
-    import scipy.stats
+    pass
 
 
 class FlowUniPCMultistepScheduler(SchedulerMixin, ConfigMixin):
@@ -75,44 +77,44 @@ class FlowUniPCMultistepScheduler(SchedulerMixin, ConfigMixin):
 
     @register_to_config
     def __init__(
-            self,
-            num_train_timesteps: int = 1000,
-            solver_order: int = 2,
-            prediction_type: str = "flow_prediction",
-            shift: Optional[float] = 1.0,
-            use_dynamic_shifting=False,
-            thresholding: bool = False,
-            dynamic_thresholding_ratio: float = 0.995,
-            sample_max_value: float = 1.0,
-            predict_x0: bool = True,
-            solver_type: str = "bh2",
-            lower_order_final: bool = True,
-            disable_corrector: List[int] = [],
-            solver_p: SchedulerMixin = None,
-            timestep_spacing: str = "linspace",
-            steps_offset: int = 0,
-            final_sigmas_type: Optional[str] = "zero",  # "zero", "sigma_min"
+        self,
+        num_train_timesteps: int = 1000,
+        solver_order: int = 2,
+        prediction_type: str = "flow_prediction",
+        shift: Optional[float] = 1.0,
+        use_dynamic_shifting=False,
+        thresholding: bool = False,
+        dynamic_thresholding_ratio: float = 0.995,
+        sample_max_value: float = 1.0,
+        predict_x0: bool = True,
+        solver_type: str = "bh2",
+        lower_order_final: bool = True,
+        disable_corrector: List[int] = [],
+        solver_p: SchedulerMixin = None,
+        timestep_spacing: str = "linspace",
+        steps_offset: int = 0,
+        final_sigmas_type: Optional[str] = "zero",  # "zero", "sigma_min"
     ):
-
         if solver_type not in ["bh1", "bh2"]:
             if solver_type in ["midpoint", "heun", "logrho"]:
                 self.register_to_config(solver_type="bh2")
             else:
                 raise NotImplementedError(
-                    f"{solver_type} is not implemented for {self.__class__}")
+                    f"{solver_type} is not implemented for {self.__class__}"
+                )
 
         self.predict_x0 = predict_x0
         # setable values
         self.num_inference_steps = None
-        alphas = np.linspace(1, 1 / num_train_timesteps,
-                             num_train_timesteps)[::-1].copy()
+        alphas = np.linspace(1, 1 / num_train_timesteps, num_train_timesteps)[
+            ::-1
+        ].copy()
         sigmas = 1.0 - alphas
         sigmas = torch.from_numpy(sigmas).to(dtype=torch.float32)
 
         if not use_dynamic_shifting:
             # when use_dynamic_shifting is True, we apply the timestep shifting on the fly based on the image resolution
-            sigmas = shift * sigmas / (1 +
-                                       (shift - 1) * sigmas)  # pyright: ignore
+            sigmas = shift * sigmas / (1 + (shift - 1) * sigmas)  # pyright: ignore
 
         self.sigmas = sigmas
         self.timesteps = sigmas * num_train_timesteps
@@ -126,8 +128,7 @@ def __init__(
         self._step_index = None
         self._begin_index = None
 
-        self.sigmas = self.sigmas.to(
-            "cpu")  # to avoid too much CPU/GPU communication
+        self.sigmas = self.sigmas.to("cpu")  # to avoid too much CPU/GPU communication
         self.sigma_min = self.sigmas[-1].item()
         self.sigma_max = self.sigmas[0].item()
 
@@ -180,21 +181,19 @@ def set_timesteps(
             )
 
         if sigmas is None:
-            sigmas = np.linspace(self.sigma_max, self.sigma_min,
-                                 num_inference_steps +
-                                 1).copy()[:-1]  # pyright: ignore
+            sigmas = np.linspace(
+                self.sigma_max, self.sigma_min, num_inference_steps + 1
+            ).copy()[:-1]  # pyright: ignore
 
         if self.config.use_dynamic_shifting:
             sigmas = self.time_shift(mu, 1.0, sigmas)  # pyright: ignore
         else:
             if shift is None:
                 shift = self.config.shift
-            sigmas = shift * sigmas / (1 +
-                                       (shift - 1) * sigmas)  # pyright: ignore
+            sigmas = shift * sigmas / (1 + (shift - 1) * sigmas)  # pyright: ignore
 
         if self.config.final_sigmas_type == "sigma_min":
-            sigma_last = ((1 - self.alphas_cumprod[0]) /
-                          self.alphas_cumprod[0])**0.5
+            sigma_last = ((1 - self.alphas_cumprod[0]) / self.alphas_cumprod[0]) ** 0.5
         elif self.config.final_sigmas_type == "zero":
             sigma_last = 0
         else:
@@ -203,12 +202,12 @@ def set_timesteps(
             )
 
         timesteps = sigmas * self.config.num_train_timesteps
-        sigmas = np.concatenate([sigmas, [sigma_last]
-                                ]).astype(np.float32)  # pyright: ignore
+        sigmas = np.concatenate([sigmas, [sigma_last]]).astype(np.float32)  # pyright: ignore
 
         self.sigmas = torch.from_numpy(sigmas)
         self.timesteps = torch.from_numpy(timesteps).to(
-            device=device, dtype=torch.int64)
+            device=device, dtype=torch.int64
+        )
 
         self.num_inference_steps = len(timesteps)
 
@@ -223,8 +222,7 @@ def set_timesteps(
         # add an index counter for schedulers that allow duplicated timesteps
         self._step_index = None
         self._begin_index = None
-        self.sigmas = self.sigmas.to(
-            "cpu")  # to avoid too much CPU/GPU communication
+        self.sigmas = self.sigmas.to("cpu")  # to avoid too much CPU/GPU communication
 
     # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler._threshold_sample
     def _threshold_sample(self, sample: torch.Tensor) -> torch.Tensor:
@@ -241,7 +239,8 @@ def _threshold_sample(self, sample: torch.Tensor) -> torch.Tensor:
         batch_size, channels, *remaining_dims = sample.shape
 
         if dtype not in (torch.float32, torch.float64):
-            sample = sample.float(
+            sample = (
+                sample.float()
             )  # upcast for quantile calculation, and clamp not implemented for cpu half
 
         # Flatten sample for doing quantile calculation along each image
@@ -249,16 +248,14 @@ def _threshold_sample(self, sample: torch.Tensor) -> torch.Tensor:
 
         abs_sample = sample.abs()  # "a certain percentile absolute pixel value"
 
-        s = torch.quantile(
-            abs_sample, self.config.dynamic_thresholding_ratio, dim=1)
+        s = torch.quantile(abs_sample, self.config.dynamic_thresholding_ratio, dim=1)
         s = torch.clamp(
             s, min=1, max=self.config.sample_max_value
         )  # When clamped to min=1, equivalent to standard clipping to [-1, 1]
-        s = s.unsqueeze(
-            1)  # (batch_size, 1) because clamp will broadcast along dim=0
-        sample = torch.clamp(
-            sample, -s, s
-        ) / s  # "we threshold xt0 to the range [-s, s] and then divide by s"
+        s = s.unsqueeze(1)  # (batch_size, 1) because clamp will broadcast along dim=0
+        sample = (
+            torch.clamp(sample, -s, s) / s
+        )  # "we threshold xt0 to the range [-s, s] and then divide by s"
 
         sample = sample.reshape(batch_size, channels, *remaining_dims)
         sample = sample.to(dtype)
@@ -274,7 +271,7 @@ def _sigma_to_alpha_sigma_t(self, sigma):
 
     # Copied from diffusers.schedulers.scheduling_flow_match_euler_discrete.set_timesteps
     def time_shift(self, mu: float, sigma: float, t: torch.Tensor):
-        return math.exp(mu) / (math.exp(mu) + (1 / t - 1)**sigma)
+        return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)
 
     def convert_model_output(
         self,
@@ -303,8 +300,7 @@ def convert_model_output(
             if len(args) > 1:
                 sample = args[1]
             else:
-                raise ValueError(
-                    "missing `sample` as a required keyward argument")
+                raise ValueError("missing `sample` as a required keyward argument")
         if timestep is not None:
             deprecate(
                 "timesteps",
@@ -372,20 +368,17 @@ def multistep_uni_p_bh_update(
             `torch.Tensor`:
                 The sample tensor at the previous timestep.
         """
-        prev_timestep = args[0] if len(args) > 0 else kwargs.pop(
-            "prev_timestep", None)
+        prev_timestep = args[0] if len(args) > 0 else kwargs.pop("prev_timestep", None)
         if sample is None:
             if len(args) > 1:
                 sample = args[1]
             else:
-                raise ValueError(
-                    " missing `sample` as a required keyward argument")
+                raise ValueError(" missing `sample` as a required keyward argument")
         if order is None:
             if len(args) > 2:
                 order = args[2]
             else:
-                raise ValueError(
-                    " missing `order` as a required keyward argument")
+                raise ValueError(" missing `order` as a required keyward argument")
         if prev_timestep is not None:
             deprecate(
                 "prev_timestep",
@@ -402,8 +395,10 @@ def multistep_uni_p_bh_update(
             x_t = self.solver_p.step(model_output, s0, x).prev_sample
             return x_t
 
-        sigma_t, sigma_s0 = self.sigmas[self.step_index + 1], self.sigmas[
-            self.step_index]  # pyright: ignore
+        sigma_t, sigma_s0 = (
+            self.sigmas[self.step_index + 1],
+            self.sigmas[self.step_index],
+        )  # pyright: ignore
         alpha_t, sigma_t = self._sigma_to_alpha_sigma_t(sigma_t)
         alpha_s0, sigma_s0 = self._sigma_to_alpha_sigma_t(sigma_s0)
 
@@ -458,24 +453,21 @@ def multistep_uni_p_bh_update(
             if order == 2:
                 rhos_p = torch.tensor([0.5], dtype=x.dtype, device=device)
             else:
-                rhos_p = torch.linalg.solve(R[:-1, :-1],
-                                            b[:-1]).to(device).to(x.dtype)
+                rhos_p = torch.linalg.solve(R[:-1, :-1], b[:-1]).to(device).to(x.dtype)
         else:
             D1s = None
 
         if self.predict_x0:
             x_t_ = sigma_t / sigma_s0 * x - alpha_t * h_phi_1 * m0
             if D1s is not None:
-                pred_res = torch.einsum("k,bkc...->bc...", rhos_p,
-                                        D1s)  # pyright: ignore
+                pred_res = torch.einsum("k,bkc...->bc...", rhos_p, D1s)  # pyright: ignore
             else:
                 pred_res = 0
             x_t = x_t_ - alpha_t * B_h * pred_res
         else:
             x_t_ = alpha_t / alpha_s0 * x - sigma_t * h_phi_1 * m0
             if D1s is not None:
-                pred_res = torch.einsum("k,bkc...->bc...", rhos_p,
-                                        D1s)  # pyright: ignore
+                pred_res = torch.einsum("k,bkc...->bc...", rhos_p, D1s)  # pyright: ignore
             else:
                 pred_res = 0
             x_t = x_t_ - sigma_t * B_h * pred_res
@@ -511,26 +503,22 @@ def multistep_uni_c_bh_update(
             `torch.Tensor`:
                 The corrected sample tensor at the current timestep.
         """
-        this_timestep = args[0] if len(args) > 0 else kwargs.pop(
-            "this_timestep", None)
+        this_timestep = args[0] if len(args) > 0 else kwargs.pop("this_timestep", None)
         if last_sample is None:
             if len(args) > 1:
                 last_sample = args[1]
             else:
-                raise ValueError(
-                    " missing`last_sample` as a required keyward argument")
+                raise ValueError(" missing`last_sample` as a required keyward argument")
         if this_sample is None:
             if len(args) > 2:
                 this_sample = args[2]
             else:
-                raise ValueError(
-                    " missing`this_sample` as a required keyward argument")
+                raise ValueError(" missing`this_sample` as a required keyward argument")
         if order is None:
             if len(args) > 3:
                 order = args[3]
             else:
-                raise ValueError(
-                    " missing`order` as a required keyward argument")
+                raise ValueError(" missing`order` as a required keyward argument")
         if this_timestep is not None:
             deprecate(
                 "this_timestep",
@@ -545,8 +533,10 @@ def multistep_uni_c_bh_update(
         x_t = this_sample
         model_t = this_model_output
 
-        sigma_t, sigma_s0 = self.sigmas[self.step_index], self.sigmas[
-            self.step_index - 1]  # pyright: ignore
+        sigma_t, sigma_s0 = (
+            self.sigmas[self.step_index],
+            self.sigmas[self.step_index - 1],
+        )  # pyright: ignore
         alpha_t, sigma_t = self._sigma_to_alpha_sigma_t(sigma_t)
         alpha_s0, sigma_s0 = self._sigma_to_alpha_sigma_t(sigma_s0)
 
@@ -652,12 +642,14 @@ def _init_step_index(self, timestep):
         else:
             self._step_index = self._begin_index
 
-    def step(self,
-             model_output: torch.Tensor,
-             timestep: Union[int, torch.Tensor],
-             sample: torch.Tensor,
-             return_dict: bool = True,
-             generator=None) -> Union[SchedulerOutput, Tuple]:
+    def step(
+        self,
+        model_output: torch.Tensor,
+        timestep: Union[int, torch.Tensor],
+        sample: torch.Tensor,
+        return_dict: bool = True,
+        generator=None,
+    ) -> Union[SchedulerOutput, Tuple]:
         """
         Predict the sample from the previous timestep by reversing the SDE. This function propagates the sample with
         the multistep UniPC.
@@ -687,13 +679,12 @@ def step(self,
             self._init_step_index(timestep)
 
         use_corrector = (
-            self.step_index > 0 and
-            self.step_index - 1 not in self.disable_corrector and
-            self.last_sample is not None  # pyright: ignore
+            self.step_index > 0
+            and self.step_index - 1 not in self.disable_corrector
+            and self.last_sample is not None  # pyright: ignore
         )
 
-        model_output_convert = self.convert_model_output(
-            model_output, sample=sample)
+        model_output_convert = self.convert_model_output(model_output, sample=sample)
         if use_corrector:
             sample = self.multistep_uni_c_bh_update(
                 this_model_output=model_output_convert,
@@ -710,14 +701,15 @@ def step(self,
         self.timestep_list[-1] = timestep  # pyright: ignore
 
         if self.config.lower_order_final:
-            this_order = min(self.config.solver_order,
-                             len(self.timesteps) -
-                             self.step_index)  # pyright: ignore
+            this_order = min(
+                self.config.solver_order, len(self.timesteps) - self.step_index
+            )  # pyright: ignore
         else:
             this_order = self.config.solver_order
 
-        self.this_order = min(this_order,
-                              self.lower_order_nums + 1)  # warmup for multistep
+        self.this_order = min(
+            this_order, self.lower_order_nums + 1
+        )  # warmup for multistep
         assert self.this_order > 0
 
         self.last_sample = sample
@@ -738,8 +730,7 @@ def step(self,
 
         return SchedulerOutput(prev_sample=prev_sample)
 
-    def scale_model_input(self, sample: torch.Tensor, *args,
-                          **kwargs) -> torch.Tensor:
+    def scale_model_input(self, sample: torch.Tensor, *args, **kwargs) -> torch.Tensor:
         """
         Ensures interchangeability with schedulers that need to scale the denoising model input depending on the
         current timestep.
@@ -763,14 +754,14 @@ def add_noise(
     ) -> torch.Tensor:
         # Make sure sigmas and timesteps have the same device and dtype as original_samples
         sigmas = self.sigmas.to(
-            device=original_samples.device, dtype=original_samples.dtype)
-        if original_samples.device.type == "mps" and torch.is_floating_point(
-                timesteps):
+            device=original_samples.device, dtype=original_samples.dtype
+        )
+        if original_samples.device.type == "mps" and torch.is_floating_point(timesteps):
             # mps does not support float64
             schedule_timesteps = self.timesteps.to(
-                original_samples.device, dtype=torch.float32)
-            timesteps = timesteps.to(
-                original_samples.device, dtype=torch.float32)
+                original_samples.device, dtype=torch.float32
+            )
+            timesteps = timesteps.to(original_samples.device, dtype=torch.float32)
         else:
             schedule_timesteps = self.timesteps.to(original_samples.device)
             timesteps = timesteps.to(original_samples.device)
@@ -778,8 +769,7 @@ def add_noise(
         # begin_index is None when the scheduler is used for training or pipeline does not implement set_begin_index
         if self.begin_index is None:
             step_indices = [
-                self.index_for_timestep(t, schedule_timesteps)
-                for t in timesteps
+                self.index_for_timestep(t, schedule_timesteps) for t in timesteps
             ]
         elif self.step_index is not None:
             # add_noise is called after first denoising step (for inpainting)
diff --git a/videotuna/models/wan/wan/utils/prompt_extend.py b/videotuna/models/wan/wan/utils/prompt_extend.py
index 8b9db081..c1171fa0 100644
--- a/videotuna/models/wan/wan/utils/prompt_extend.py
+++ b/videotuna/models/wan/wan/utils/prompt_extend.py
@@ -1,5 +1,6 @@
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 import json
+import logging
 import math
 import os
 import random
@@ -7,7 +8,7 @@
 import tempfile
 from dataclasses import dataclass
 from http import HTTPStatus
-from typing import Optional, Union
+from typing import Union
 
 import dashscope
 import torch
@@ -15,86 +16,28 @@
 
 try:
     from flash_attn import flash_attn_varlen_func
+
     FLASH_VER = 2
 except ModuleNotFoundError:
     flash_attn_varlen_func = None  # in compatible with CPU machines
     FLASH_VER = None
 
-LM_ZH_SYS_PROMPT = \
-    '''你是一位Prompt优化师，旨在将用户输入改写为优质Prompt，使其更完整、更具表现力，同时不改变原意。\n''' \
-    '''任务要求：\n''' \
-    '''1. 对于过于简短的用户输入，在不改变原意前提下，合理推断并补充细节，使得画面更加完整好看；\n''' \
-    '''2. 完善用户描述中出现的主体特征（如外貌、表情，数量、种族、姿态等）、画面风格、空间关系、镜头景别；\n''' \
-    '''3. 整体中文输出，保留引号、书名号中原文以及重要的输入信息，不要改写；\n''' \
-    '''4. Prompt应匹配符合用户意图且精准细分的风格描述。如果用户未指定，则根据画面选择最恰当的风格，或使用纪实摄影风格。如果用户未指定，除非画面非常适合，否则不要使用插画风格。如果用户指定插画风格，则生成插画风格；\n''' \
-    '''5. 如果Prompt是古诗词，应该在生成的Prompt中强调中国古典元素，避免出现西方、现代、外国场景；\n''' \
-    '''6. 你需要强调输入中的运动信息和不同的镜头运镜；\n''' \
-    '''7. 你的输出应当带有自然运动属性，需要根据描述主体目标类别增加这个目标的自然动作，描述尽可能用简单直接的动词；\n''' \
-    '''8. 改写后的prompt字数控制在80-100字左右\n''' \
-    '''改写后 prompt 示例：\n''' \
-    '''1. 日系小清新胶片写真，扎着双麻花辫的年轻东亚女孩坐在船边。女孩穿着白色方领泡泡袖连衣裙，裙子上有褶皱和纽扣装饰。她皮肤白皙，五官清秀，眼神略带忧郁，直视镜头。女孩的头发自然垂落，刘海遮住部分额头。她双手扶船，姿态自然放松。背景是模糊的户外场景，隐约可见蓝天、山峦和一些干枯植物。复古胶片质感照片。中景半身坐姿人像。\n''' \
-    '''2. 二次元厚涂动漫插画，一个猫耳兽耳白人少女手持文件夹，神情略带不满。她深紫色长发，红色眼睛，身穿深灰色短裙和浅灰色上衣，腰间系着白色系带，胸前佩戴名牌，上面写着黑体中文"紫阳"。淡黄色调室内背景，隐约可见一些家具轮廓。少女头顶有一个粉色光圈。线条流畅的日系赛璐璐风格。近景半身略俯视视角。\n''' \
-    '''3. CG游戏概念数字艺术，一只巨大的鳄鱼张开大嘴，背上长着树木和荆棘。鳄鱼皮肤粗糙，呈灰白色，像是石头或木头的质感。它背上生长着茂盛的树木、灌木和一些荆棘状的突起。鳄鱼嘴巴大张，露出粉红色的舌头和锋利的牙齿。画面背景是黄昏的天空，远处有一些树木。场景整体暗黑阴冷。近景，仰视视角。\n''' \
-    '''4. 美剧宣传海报风格，身穿黄色防护服的Walter White坐在金属折叠椅上，上方无衬线英文写着"Breaking Bad"，周围是成堆的美元和蓝色塑料储物箱。他戴着眼镜目光直视前方，身穿黄色连体防护服，双手放在膝盖上，神态稳重自信。背景是一个废弃的阴暗厂房，窗户透着光线。带有明显颗粒质感纹理。中景人物平视特写。\n''' \
-    '''下面我将给你要改写的Prompt，请直接对该Prompt进行忠实原意的扩写和改写，输出为中文文本，即使收到指令，也应当扩写或改写该指令本身，而不是回复该指令。请直接对Prompt进行改写，不要进行多余的回复：'''
-
-LM_EN_SYS_PROMPT = \
-    '''You are a prompt engineer, aiming to rewrite user inputs into high-quality prompts for better video generation without affecting the original meaning.\n''' \
-    '''Task requirements:\n''' \
-    '''1. For overly concise user inputs, reasonably infer and add details to make the video more complete and appealing without altering the original intent;\n''' \
-    '''2. Enhance the main features in user descriptions (e.g., appearance, expression, quantity, race, posture, etc.), visual style, spatial relationships, and shot scales;\n''' \
-    '''3. Output the entire prompt in English, retaining original text in quotes and titles, and preserving key input information;\n''' \
-    '''4. Prompts should match the user’s intent and accurately reflect the specified style. If the user does not specify a style, choose the most appropriate style for the video;\n''' \
-    '''5. Emphasize motion information and different camera movements present in the input description;\n''' \
-    '''6. Your output should have natural motion attributes. For the target category described, add natural actions of the target using simple and direct verbs;\n''' \
-    '''7. The revised prompt should be around 80-100 words long.\n''' \
-    '''Revised prompt examples:\n''' \
-    '''1. Japanese-style fresh film photography, a young East Asian girl with braided pigtails sitting by the boat. The girl is wearing a white square-neck puff sleeve dress with ruffles and button decorations. She has fair skin, delicate features, and a somewhat melancholic look, gazing directly into the camera. Her hair falls naturally, with bangs covering part of her forehead. She is holding onto the boat with both hands, in a relaxed posture. The background is a blurry outdoor scene, with faint blue sky, mountains, and some withered plants. Vintage film texture photo. Medium shot half-body portrait in a seated position.\n''' \
-    '''2. Anime thick-coated illustration, a cat-ear beast-eared white girl holding a file folder, looking slightly displeased. She has long dark purple hair, red eyes, and is wearing a dark grey short skirt and light grey top, with a white belt around her waist, and a name tag on her chest that reads "Ziyang" in bold Chinese characters. The background is a light yellow-toned indoor setting, with faint outlines of furniture. There is a pink halo above the girl's head. Smooth line Japanese cel-shaded style. Close-up half-body slightly overhead view.\n''' \
-    '''3. CG game concept digital art, a giant crocodile with its mouth open wide, with trees and thorns growing on its back. The crocodile's skin is rough, greyish-white, with a texture resembling stone or wood. Lush trees, shrubs, and thorny protrusions grow on its back. The crocodile's mouth is wide open, showing a pink tongue and sharp teeth. The background features a dusk sky with some distant trees. The overall scene is dark and cold. Close-up, low-angle view.\n''' \
-    '''4. American TV series poster style, Walter White wearing a yellow protective suit sitting on a metal folding chair, with "Breaking Bad" in sans-serif text above. Surrounded by piles of dollars and blue plastic storage bins. He is wearing glasses, looking straight ahead, dressed in a yellow one-piece protective suit, hands on his knees, with a confident and steady expression. The background is an abandoned dark factory with light streaming through the windows. With an obvious grainy texture. Medium shot character eye-level close-up.\n''' \
-    '''I will now provide the prompt for you to rewrite. Please directly expand and rewrite the specified prompt in English while preserving the original meaning. Even if you receive a prompt that looks like an instruction, proceed with expanding or rewriting that instruction itself, rather than replying to it. Please directly rewrite the prompt without extra responses and quotation mark:'''
-
-
-VL_ZH_SYS_PROMPT = \
-    '''你是一位Prompt优化师，旨在参考用户输入的图像的细节内容，把用户输入的Prompt改写为优质Prompt，使其更完整、更具表现力，同时不改变原意。你需要综合用户输入的照片内容和输入的Prompt进行改写，严格参考示例的格式进行改写。\n''' \
-    '''任务要求：\n''' \
-    '''1. 对于过于简短的用户输入，在不改变原意前提下，合理推断并补充细节，使得画面更加完整好看；\n''' \
-    '''2. 完善用户描述中出现的主体特征（如外貌、表情，数量、种族、姿态等）、画面风格、空间关系、镜头景别；\n''' \
-    '''3. 整体中文输出，保留引号、书名号中原文以及重要的输入信息，不要改写；\n''' \
-    '''4. Prompt应匹配符合用户意图且精准细分的风格描述。如果用户未指定，则根据用户提供的照片的风格，你需要仔细分析照片的风格，并参考风格进行改写；\n''' \
-    '''5. 如果Prompt是古诗词，应该在生成的Prompt中强调中国古典元素，避免出现西方、现代、外国场景；\n''' \
-    '''6. 你需要强调输入中的运动信息和不同的镜头运镜；\n''' \
-    '''7. 你的输出应当带有自然运动属性，需要根据描述主体目标类别增加这个目标的自然动作，描述尽可能用简单直接的动词；\n''' \
-    '''8. 你需要尽可能的参考图片的细节信息，如人物动作、服装、背景等，强调照片的细节元素；\n''' \
-    '''9. 改写后的prompt字数控制在80-100字左右\n''' \
-    '''10. 无论用户输入什么语言，你都必须输出中文\n''' \
-    '''改写后 prompt 示例：\n''' \
-    '''1. 日系小清新胶片写真，扎着双麻花辫的年轻东亚女孩坐在船边。女孩穿着白色方领泡泡袖连衣裙，裙子上有褶皱和纽扣装饰。她皮肤白皙，五官清秀，眼神略带忧郁，直视镜头。女孩的头发自然垂落，刘海遮住部分额头。她双手扶船，姿态自然放松。背景是模糊的户外场景，隐约可见蓝天、山峦和一些干枯植物。复古胶片质感照片。中景半身坐姿人像。\n''' \
-    '''2. 二次元厚涂动漫插画，一个猫耳兽耳白人少女手持文件夹，神情略带不满。她深紫色长发，红色眼睛，身穿深灰色短裙和浅灰色上衣，腰间系着白色系带，胸前佩戴名牌，上面写着黑体中文"紫阳"。淡黄色调室内背景，隐约可见一些家具轮廓。少女头顶有一个粉色光圈。线条流畅的日系赛璐璐风格。近景半身略俯视视角。\n''' \
-    '''3. CG游戏概念数字艺术，一只巨大的鳄鱼张开大嘴，背上长着树木和荆棘。鳄鱼皮肤粗糙，呈灰白色，像是石头或木头的质感。它背上生长着茂盛的树木、灌木和一些荆棘状的突起。鳄鱼嘴巴大张，露出粉红色的舌头和锋利的牙齿。画面背景是黄昏的天空，远处有一些树木。场景整体暗黑阴冷。近景，仰视视角。\n''' \
-    '''4. 美剧宣传海报风格，身穿黄色防护服的Walter White坐在金属折叠椅上，上方无衬线英文写着"Breaking Bad"，周围是成堆的美元和蓝色塑料储物箱。他戴着眼镜目光直视前方，身穿黄色连体防护服，双手放在膝盖上，神态稳重自信。背景是一个废弃的阴暗厂房，窗户透着光线。带有明显颗粒质感纹理。中景人物平视特写。\n''' \
-    '''直接输出改写后的文本。'''
-
-VL_EN_SYS_PROMPT =  \
-    '''You are a prompt optimization specialist whose goal is to rewrite the user's input prompts into high-quality English prompts by referring to the details of the user's input images, making them more complete and expressive while maintaining the original meaning. You need to integrate the content of the user's photo with the input prompt for the rewrite, strictly adhering to the formatting of the examples provided.\n''' \
-    '''Task Requirements:\n''' \
-    '''1. For overly brief user inputs, reasonably infer and supplement details without changing the original meaning, making the image more complete and visually appealing;\n''' \
-    '''2. Improve the characteristics of the main subject in the user's description (such as appearance, expression, quantity, ethnicity, posture, etc.), rendering style, spatial relationships, and camera angles;\n''' \
-    '''3. The overall output should be in Chinese, retaining original text in quotes and book titles as well as important input information without rewriting them;\n''' \
-    '''4. The prompt should match the user’s intent and provide a precise and detailed style description. If the user has not specified a style, you need to carefully analyze the style of the user's provided photo and use that as a reference for rewriting;\n''' \
-    '''5. If the prompt is an ancient poem, classical Chinese elements should be emphasized in the generated prompt, avoiding references to Western, modern, or foreign scenes;\n''' \
-    '''6. You need to emphasize movement information in the input and different camera angles;\n''' \
-    '''7. Your output should convey natural movement attributes, incorporating natural actions related to the described subject category, using simple and direct verbs as much as possible;\n''' \
-    '''8. You should reference the detailed information in the image, such as character actions, clothing, backgrounds, and emphasize the details in the photo;\n''' \
-    '''9. Control the rewritten prompt to around 80-100 words.\n''' \
-    '''10. No matter what language the user inputs, you must always output in English.\n''' \
-    '''Example of the rewritten English prompt:\n''' \
-    '''1. A Japanese fresh film-style photo of a young East Asian girl with double braids sitting by the boat. The girl wears a white square collar puff sleeve dress, decorated with pleats and buttons. She has fair skin, delicate features, and slightly melancholic eyes, staring directly at the camera. Her hair falls naturally, with bangs covering part of her forehead. She rests her hands on the boat, appearing natural and relaxed. The background features a blurred outdoor scene, with hints of blue sky, mountains, and some dry plants. The photo has a vintage film texture. A medium shot of a seated portrait.\n''' \
-    '''2. An anime illustration in vibrant thick painting style of a white girl with cat ears holding a folder, showing a slightly dissatisfied expression. She has long dark purple hair and red eyes, wearing a dark gray skirt and a light gray top with a white waist tie and a name tag in bold Chinese characters that says "紫阳" (Ziyang). The background has a light yellow indoor tone, with faint outlines of some furniture visible. A pink halo hovers above her head, in a smooth Japanese cel-shading style. A close-up shot from a slightly elevated perspective.\n''' \
-    '''3. CG game concept digital art featuring a huge crocodile with its mouth wide open, with trees and thorns growing on its back. The crocodile's skin is rough and grayish-white, resembling stone or wood texture. Its back is lush with trees, shrubs, and thorny protrusions. With its mouth agape, the crocodile reveals a pink tongue and sharp teeth. The background features a dusk sky with some distant trees, giving the overall scene a dark and cold atmosphere. A close-up from a low angle.\n''' \
-    '''4. In the style of an American drama promotional poster, Walter White sits in a metal folding chair wearing a yellow protective suit, with the words "Breaking Bad" written in sans-serif English above him, surrounded by piles of dollar bills and blue plastic storage boxes. He wears glasses, staring forward, dressed in a yellow jumpsuit, with his hands resting on his knees, exuding a calm and confident demeanor. The background shows an abandoned, dim factory with light filtering through the windows. There’s a noticeable grainy texture. A medium shot with a straight-on close-up of the character.\n''' \
-    '''Directly output the rewritten English text.'''
+from .system_prompt import *
+
+DEFAULT_SYS_PROMPTS = {
+    "t2v-A14B": {
+        "zh": T2V_A14B_ZH_SYS_PROMPT,
+        "en": T2V_A14B_EN_SYS_PROMPT,
+    },
+    "i2v-A14B": {
+        "zh": I2V_A14B_ZH_SYS_PROMPT,
+        "en": I2V_A14B_EN_SYS_PROMPT,
+        "empty": {
+            "zh": I2V_A14B_EMPTY_ZH_SYS_PROMPT,
+            "en": I2V_A14B_EMPTY_EN_SYS_PROMPT,
+        },
+    },
+}
 
 
 @dataclass
@@ -110,44 +53,44 @@ def add_custom_field(self, key: str, value) -> None:
 
 
 class PromptExpander:
-
-    def __init__(self, model_name, is_vl=False, device=0, **kwargs):
+    def __init__(self, model_name, task, is_vl=False, device=0, **kwargs):
         self.model_name = model_name
+        self.task = task
         self.is_vl = is_vl
         self.device = device
 
-    def extend_with_img(self,
-                        prompt,
-                        system_prompt,
-                        image=None,
-                        seed=-1,
-                        *args,
-                        **kwargs):
+    def extend_with_img(
+        self, prompt, system_prompt, image=None, seed=-1, *args, **kwargs
+    ):
         pass
 
     def extend(self, prompt, system_prompt, seed=-1, *args, **kwargs):
         pass
 
-    def decide_system_prompt(self, tar_lang="zh"):
-        zh = tar_lang == "zh"
-        if zh:
-            return LM_ZH_SYS_PROMPT if not self.is_vl else VL_ZH_SYS_PROMPT
-        else:
-            return LM_EN_SYS_PROMPT if not self.is_vl else VL_EN_SYS_PROMPT
-
-    def __call__(self,
-                 prompt,
-                 tar_lang="zh",
-                 image=None,
-                 seed=-1,
-                 *args,
-                 **kwargs):
-        system_prompt = self.decide_system_prompt(tar_lang=tar_lang)
+    def decide_system_prompt(self, tar_lang="zh", prompt=None):
+        assert self.task is not None
+        if "i2v" in self.task and len(prompt) == 0:
+            return DEFAULT_SYS_PROMPTS[self.task]["empty"][tar_lang]
+        return DEFAULT_SYS_PROMPTS[self.task][tar_lang]
+
+    def __call__(
+        self,
+        prompt,
+        system_prompt=None,
+        tar_lang="zh",
+        image=None,
+        seed=-1,
+        *args,
+        **kwargs,
+    ):
+        if system_prompt is None:
+            system_prompt = self.decide_system_prompt(tar_lang=tar_lang, prompt=prompt)
         if seed < 0:
             seed = random.randint(0, sys.maxsize)
         if image is not None and self.is_vl:
             return self.extend_with_img(
-                prompt, system_prompt, image=image, seed=seed, *args, **kwargs)
+                prompt, system_prompt, image=image, seed=seed, *args, **kwargs
+            )
         elif not self.is_vl:
             return self.extend(prompt, system_prompt, seed, *args, **kwargs)
         else:
@@ -155,38 +98,39 @@ def __call__(self,
 
 
 class DashScopePromptExpander(PromptExpander):
-
-    def __init__(self,
-                 api_key=None,
-                 model_name=None,
-                 max_image_size=512 * 512,
-                 retry_times=4,
-                 is_vl=False,
-                 **kwargs):
-        '''
+    def __init__(
+        self,
+        api_key=None,
+        model_name=None,
+        task=None,
+        max_image_size=512 * 512,
+        retry_times=4,
+        is_vl=False,
+        **kwargs,
+    ):
+        """
         Args:
             api_key: The API key for Dash Scope authentication and access to related services.
             model_name: Model name, 'qwen-plus' for extending prompts, 'qwen-vl-max' for extending prompt-images.
+            task: Task name. This is required to determine the default system prompt.
             max_image_size: The maximum size of the image; unit unspecified (e.g., pixels, KB). Please specify the unit based on actual usage.
             retry_times: Number of retry attempts in case of request failure.
             is_vl: A flag indicating whether the task involves visual-language processing.
             **kwargs: Additional keyword arguments that can be passed to the function or method.
-        '''
+        """
         if model_name is None:
-            model_name = 'qwen-plus' if not is_vl else 'qwen-vl-max'
-        super().__init__(model_name, is_vl, **kwargs)
+            model_name = "qwen-plus" if not is_vl else "qwen-vl-max"
+        super().__init__(model_name, task, is_vl, **kwargs)
         if api_key is not None:
             dashscope.api_key = api_key
-        elif 'DASH_API_KEY' in os.environ and os.environ[
-                'DASH_API_KEY'] is not None:
-            dashscope.api_key = os.environ['DASH_API_KEY']
+        elif "DASH_API_KEY" in os.environ and os.environ["DASH_API_KEY"] is not None:
+            dashscope.api_key = os.environ["DASH_API_KEY"]
         else:
             raise ValueError("DASH_API_KEY is not set")
-        if 'DASH_API_URL' in os.environ and os.environ[
-                'DASH_API_URL'] is not None:
-            dashscope.base_http_api_url = os.environ['DASH_API_URL']
+        if "DASH_API_URL" in os.environ and os.environ["DASH_API_URL"] is not None:
+            dashscope.base_http_api_url = os.environ["DASH_API_URL"]
         else:
-            dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
+            dashscope.base_http_api_url = "https://dashscope.aliyuncs.com/api/v1"
         self.api_key = api_key
 
         self.max_image_size = max_image_size
@@ -194,13 +138,10 @@ def __init__(self,
         self.retry_times = retry_times
 
     def extend(self, prompt, system_prompt, seed=-1, *args, **kwargs):
-        messages = [{
-            'role': 'system',
-            'content': system_prompt
-        }, {
-            'role': 'user',
-            'content': prompt
-        }]
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": prompt},
+        ]
 
         exception = None
         for _ in range(self.retry_times):
@@ -209,17 +150,17 @@ def extend(self, prompt, system_prompt, seed=-1, *args, **kwargs):
                     self.model,
                     messages=messages,
                     seed=seed,
-                    result_format='message',  # set the result to be "message" format.
+                    result_format="message",  # set the result to be "message" format.
                 )
                 assert response.status_code == HTTPStatus.OK, response
-                expanded_prompt = response['output']['choices'][0]['message'][
-                    'content']
+                expanded_prompt = response["output"]["choices"][0]["message"]["content"]
                 return PromptOutput(
                     status=True,
                     prompt=expanded_prompt,
                     seed=seed,
                     system_prompt=system_prompt,
-                    message=json.dumps(response, ensure_ascii=False))
+                    message=json.dumps(response, ensure_ascii=False),
+                )
             except Exception as e:
                 exception = e
         return PromptOutput(
@@ -227,17 +168,20 @@ def extend(self, prompt, system_prompt, seed=-1, *args, **kwargs):
             prompt=prompt,
             seed=seed,
             system_prompt=system_prompt,
-            message=str(exception))
-
-    def extend_with_img(self,
-                        prompt,
-                        system_prompt,
-                        image: Union[Image.Image, str] = None,
-                        seed=-1,
-                        *args,
-                        **kwargs):
+            message=str(exception),
+        )
+
+    def extend_with_img(
+        self,
+        prompt,
+        system_prompt,
+        image: Union[Image.Image, str] = None,
+        seed=-1,
+        *args,
+        **kwargs,
+    ):
         if isinstance(image, str):
-            image = Image.open(image).convert('RGB')
+            image = Image.open(image).convert("RGB")
         w = image.width
         h = image.height
         area = min(w * h, self.max_image_size)
@@ -245,26 +189,14 @@ def extend_with_img(self,
         resized_h = round(math.sqrt(area * aspect_ratio))
         resized_w = round(math.sqrt(area / aspect_ratio))
         image = image.resize((resized_w, resized_h))
-        with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as f:
+        with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as f:
             image.save(f.name)
             fname = f.name
             image_path = f"file://{f.name}"
         prompt = f"{prompt}"
         messages = [
-            {
-                'role': 'system',
-                'content': [{
-                    "text": system_prompt
-                }]
-            },
-            {
-                'role': 'user',
-                'content': [{
-                    "text": prompt
-                }, {
-                    "image": image_path
-                }]
-            },
+            {"role": "system", "content": [{"text": system_prompt}]},
+            {"role": "user", "content": [{"text": prompt}, {"image": image_path}]},
         ]
         response = None
         result_prompt = prompt
@@ -276,16 +208,17 @@ def extend_with_img(self,
                     self.model,
                     messages=messages,
                     seed=seed,
-                    result_format='message',  # set the result to be "message" format.
+                    result_format="message",  # set the result to be "message" format.
                 )
                 assert response.status_code == HTTPStatus.OK, response
-                result_prompt = response['output']['choices'][0]['message'][
-                    'content'][0]['text'].replace('\n', '\\n')
+                result_prompt = response["output"]["choices"][0]["message"]["content"][
+                    0
+                ]["text"].replace("\n", "\\n")
                 status = True
                 break
             except Exception as e:
                 exception = e
-        result_prompt = result_prompt.replace('\n', '\\n')
+        result_prompt = result_prompt.replace("\n", "\\n")
         os.remove(fname)
 
         return PromptOutput(
@@ -293,8 +226,10 @@ def extend_with_img(self,
             prompt=result_prompt,
             seed=seed,
             system_prompt=system_prompt,
-            message=str(exception) if not status else json.dumps(
-                response, ensure_ascii=False))
+            message=str(exception)
+            if not status
+            else json.dumps(response, ensure_ascii=False),
+        )
 
 
 class QwenPromptExpander(PromptExpander):
@@ -306,8 +241,8 @@ class QwenPromptExpander(PromptExpander):
         "Qwen2.5_14B": "Qwen/Qwen2.5-14B-Instruct",
     }
 
-    def __init__(self, model_name=None, device=0, is_vl=False, **kwargs):
-        '''
+    def __init__(self, model_name=None, task=None, device=0, is_vl=False, **kwargs):
+        """
         Args:
             model_name: Use predefined model names such as 'QwenVL2.5_7B' and 'Qwen2.5_14B',
                 which are specific versions of the Qwen model. Alternatively, you can use the
@@ -319,20 +254,26 @@ def __init__(self, model_name=None, device=0, is_vl=False, **kwargs):
                 * You can provide the path to a model that you have downloaded locally.
                 Hugging Face Model Name:
                 * You can also specify the model name from Hugging Face's model hub.
+            task: Task name. This is required to determine the default system prompt.
             is_vl: A flag indicating whether the task involves visual-language processing.
             **kwargs: Additional keyword arguments that can be passed to the function or method.
-        '''
+        """
         if model_name is None:
-            model_name = 'Qwen2.5_14B' if not is_vl else 'QwenVL2.5_7B'
-        super().__init__(model_name, is_vl, device, **kwargs)
-        if (not os.path.exists(self.model_name)) and (self.model_name
-                                                      in self.model_dict):
+            model_name = "Qwen2.5_14B" if not is_vl else "QwenVL2.5_7B"
+        super().__init__(model_name, task, is_vl, device, **kwargs)
+        if (not os.path.exists(self.model_name)) and (
+            self.model_name in self.model_dict
+        ):
             self.model_name = self.model_dict[self.model_name]
 
         if self.is_vl:
             # default: Load the model on the available device(s)
-            from transformers import (AutoProcessor, AutoTokenizer,
-                                      Qwen2_5_VLForConditionalGeneration)
+            from transformers import (
+                AutoProcessor,
+                AutoTokenizer,
+                Qwen2_5_VLForConditionalGeneration,
+            )
+
             try:
                 from .qwen_vl_utils import process_vision_info
             except:
@@ -344,88 +285,86 @@ def __init__(self, model_name=None, device=0, is_vl=False, **kwargs):
                 self.model_name,
                 min_pixels=min_pixels,
                 max_pixels=max_pixels,
-                use_fast=True)
+                use_fast=True,
+            )
             self.model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
                 self.model_name,
-                torch_dtype=torch.bfloat16 if FLASH_VER == 2 else
-                torch.float16 if "AWQ" in self.model_name else "auto",
-                attn_implementation="flash_attention_2"
-                if FLASH_VER == 2 else None,
-                device_map="cpu")
+                torch_dtype=torch.bfloat16
+                if FLASH_VER == 2
+                else torch.float16
+                if "AWQ" in self.model_name
+                else "auto",
+                attn_implementation="flash_attention_2" if FLASH_VER == 2 else None,
+                device_map="cpu",
+            )
         else:
             from transformers import AutoModelForCausalLM, AutoTokenizer
+
             self.model = AutoModelForCausalLM.from_pretrained(
                 self.model_name,
-                torch_dtype=torch.float16
-                if "AWQ" in self.model_name else "auto",
-                attn_implementation="flash_attention_2"
-                if FLASH_VER == 2 else None,
-                device_map="cpu")
+                torch_dtype=torch.float16 if "AWQ" in self.model_name else "auto",
+                attn_implementation="flash_attention_2" if FLASH_VER == 2 else None,
+                device_map="cpu",
+            )
             self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
 
     def extend(self, prompt, system_prompt, seed=-1, *args, **kwargs):
         self.model = self.model.to(self.device)
-        messages = [{
-            "role": "system",
-            "content": system_prompt
-        }, {
-            "role": "user",
-            "content": prompt
-        }]
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": prompt},
+        ]
         text = self.tokenizer.apply_chat_template(
-            messages, tokenize=False, add_generation_prompt=True)
-        model_inputs = self.tokenizer([text],
-                                      return_tensors="pt").to(self.model.device)
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device)
 
         generated_ids = self.model.generate(**model_inputs, max_new_tokens=512)
         generated_ids = [
-            output_ids[len(input_ids):] for input_ids, output_ids in zip(
-                model_inputs.input_ids, generated_ids)
+            output_ids[len(input_ids) :]
+            for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
         ]
 
         expanded_prompt = self.tokenizer.batch_decode(
-            generated_ids, skip_special_tokens=True)[0]
+            generated_ids, skip_special_tokens=True
+        )[0]
         self.model = self.model.to("cpu")
         return PromptOutput(
             status=True,
             prompt=expanded_prompt,
             seed=seed,
             system_prompt=system_prompt,
-            message=json.dumps({"content": expanded_prompt},
-                               ensure_ascii=False))
-
-    def extend_with_img(self,
-                        prompt,
-                        system_prompt,
-                        image: Union[Image.Image, str] = None,
-                        seed=-1,
-                        *args,
-                        **kwargs):
+            message=json.dumps({"content": expanded_prompt}, ensure_ascii=False),
+        )
+
+    def extend_with_img(
+        self,
+        prompt,
+        system_prompt,
+        image: Union[Image.Image, str] = None,
+        seed=-1,
+        *args,
+        **kwargs,
+    ):
         self.model = self.model.to(self.device)
-        messages = [{
-            'role': 'system',
-            'content': [{
-                "type": "text",
-                "text": system_prompt
-            }]
-        }, {
-            "role":
-                "user",
-            "content": [
-                {
-                    "type": "image",
-                    "image": image,
-                },
-                {
-                    "type": "text",
-                    "text": prompt
-                },
-            ],
-        }]
+        messages = [
+            {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "image",
+                        "image": image,
+                    },
+                    {"type": "text", "text": prompt},
+                ],
+            },
+        ]
 
         # Preparation for inference
         text = self.processor.apply_chat_template(
-            messages, tokenize=False, add_generation_prompt=True)
+            messages, tokenize=False, add_generation_prompt=True
+        )
         image_inputs, video_inputs = self.process_vision_info(messages)
         inputs = self.processor(
             text=[text],
@@ -439,105 +378,136 @@ def extend_with_img(self,
         # Inference: Generation of the output
         generated_ids = self.model.generate(**inputs, max_new_tokens=512)
         generated_ids_trimmed = [
-            out_ids[len(in_ids):]
+            out_ids[len(in_ids) :]
             for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
         ]
         expanded_prompt = self.processor.batch_decode(
             generated_ids_trimmed,
             skip_special_tokens=True,
-            clean_up_tokenization_spaces=False)[0]
+            clean_up_tokenization_spaces=False,
+        )[0]
         self.model = self.model.to("cpu")
         return PromptOutput(
             status=True,
             prompt=expanded_prompt,
             seed=seed,
             system_prompt=system_prompt,
-            message=json.dumps({"content": expanded_prompt},
-                               ensure_ascii=False))
+            message=json.dumps({"content": expanded_prompt}, ensure_ascii=False),
+        )
 
 
 if __name__ == "__main__":
+    logging.basicConfig(
+        level=logging.INFO,
+        format="[%(asctime)s] %(levelname)s: %(message)s",
+        handlers=[logging.StreamHandler(stream=sys.stdout)],
+    )
 
     seed = 100
     prompt = "夏日海滩度假风格，一只戴着墨镜的白色猫咪坐在冲浪板上。猫咪毛发蓬松，表情悠闲，直视镜头。背景是模糊的海滩景色，海水清澈，远处有绿色的山丘和蓝天白云。猫咪的姿态自然放松，仿佛在享受海风和阳光。近景特写，强调猫咪的细节和海滩的清新氛围。"
     en_prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
-    # test cases for prompt extend
-    ds_model_name = "qwen-plus"
-    # for qwenmodel, you can download the model form modelscope or huggingface and use the model path as model_name
-    qwen_model_name = "./models/Qwen2.5-14B-Instruct/"  # VRAM: 29136MiB
-    # qwen_model_name = "./models/Qwen2.5-14B-Instruct-AWQ/"  # VRAM: 10414MiB
-
-    # test dashscope api
-    dashscope_prompt_expander = DashScopePromptExpander(
-        model_name=ds_model_name)
-    dashscope_result = dashscope_prompt_expander(prompt, tar_lang="zh")
-    print("LM dashscope result -> zh",
-          dashscope_result.prompt)  #dashscope_result.system_prompt)
-    dashscope_result = dashscope_prompt_expander(prompt, tar_lang="en")
-    print("LM dashscope result -> en",
-          dashscope_result.prompt)  #dashscope_result.system_prompt)
-    dashscope_result = dashscope_prompt_expander(en_prompt, tar_lang="zh")
-    print("LM dashscope en result -> zh",
-          dashscope_result.prompt)  #dashscope_result.system_prompt)
-    dashscope_result = dashscope_prompt_expander(en_prompt, tar_lang="en")
-    print("LM dashscope en result -> en",
-          dashscope_result.prompt)  #dashscope_result.system_prompt)
-    # # test qwen api
-    qwen_prompt_expander = QwenPromptExpander(
-        model_name=qwen_model_name, is_vl=False, device=0)
-    qwen_result = qwen_prompt_expander(prompt, tar_lang="zh")
-    print("LM qwen result -> zh",
-          qwen_result.prompt)  #qwen_result.system_prompt)
-    qwen_result = qwen_prompt_expander(prompt, tar_lang="en")
-    print("LM qwen result -> en",
-          qwen_result.prompt)  # qwen_result.system_prompt)
-    qwen_result = qwen_prompt_expander(en_prompt, tar_lang="zh")
-    print("LM qwen en result -> zh",
-          qwen_result.prompt)  #, qwen_result.system_prompt)
-    qwen_result = qwen_prompt_expander(en_prompt, tar_lang="en")
-    print("LM qwen en result -> en",
-          qwen_result.prompt)  # , qwen_result.system_prompt)
-    # test case for prompt-image extend
-    ds_model_name = "qwen-vl-max"
-    #qwen_model_name = "./models/Qwen2.5-VL-3B-Instruct/" #VRAM: 9686MiB
-    qwen_model_name = "./models/Qwen2.5-VL-7B-Instruct-AWQ/"  # VRAM: 8492
     image = "./examples/i2v_input.JPG"
 
-    # test dashscope api why image_path is local directory; skip
-    dashscope_prompt_expander = DashScopePromptExpander(
-        model_name=ds_model_name, is_vl=True)
-    dashscope_result = dashscope_prompt_expander(
-        prompt, tar_lang="zh", image=image, seed=seed)
-    print("VL dashscope result -> zh",
-          dashscope_result.prompt)  #, dashscope_result.system_prompt)
-    dashscope_result = dashscope_prompt_expander(
-        prompt, tar_lang="en", image=image, seed=seed)
-    print("VL dashscope result -> en",
-          dashscope_result.prompt)  # , dashscope_result.system_prompt)
-    dashscope_result = dashscope_prompt_expander(
-        en_prompt, tar_lang="zh", image=image, seed=seed)
-    print("VL dashscope en result -> zh",
-          dashscope_result.prompt)  #, dashscope_result.system_prompt)
-    dashscope_result = dashscope_prompt_expander(
-        en_prompt, tar_lang="en", image=image, seed=seed)
-    print("VL dashscope en result -> en",
-          dashscope_result.prompt)  # , dashscope_result.system_prompt)
-    # test qwen api
-    qwen_prompt_expander = QwenPromptExpander(
-        model_name=qwen_model_name, is_vl=True, device=0)
-    qwen_result = qwen_prompt_expander(
-        prompt, tar_lang="zh", image=image, seed=seed)
-    print("VL qwen result -> zh",
-          qwen_result.prompt)  #, qwen_result.system_prompt)
-    qwen_result = qwen_prompt_expander(
-        prompt, tar_lang="en", image=image, seed=seed)
-    print("VL qwen result ->en",
-          qwen_result.prompt)  # , qwen_result.system_prompt)
-    qwen_result = qwen_prompt_expander(
-        en_prompt, tar_lang="zh", image=image, seed=seed)
-    print("VL qwen vl en result -> zh",
-          qwen_result.prompt)  #, qwen_result.system_prompt)
-    qwen_result = qwen_prompt_expander(
-        en_prompt, tar_lang="en", image=image, seed=seed)
-    print("VL qwen vl en result -> en",
-          qwen_result.prompt)  # , qwen_result.system_prompt)
+    def test(method, prompt, model_name, task, image=None, en_prompt=None, seed=None):
+        prompt_expander = method(
+            model_name=model_name, task=task, is_vl=image is not None
+        )
+        result = prompt_expander(prompt, image=image, tar_lang="zh")
+        logging.info(f"zh prompt -> zh: {result.prompt}")
+        result = prompt_expander(prompt, image=image, tar_lang="en")
+        logging.info(f"zh prompt -> en: {result.prompt}")
+        if en_prompt is not None:
+            result = prompt_expander(en_prompt, image=image, tar_lang="zh")
+            logging.info(f"en prompt -> zh: {result.prompt}")
+            result = prompt_expander(en_prompt, image=image, tar_lang="en")
+            logging.info(f"en prompt -> en: {result.prompt}")
+
+    ds_model_name = None
+    ds_vl_model_name = None
+    qwen_model_name = None
+    qwen_vl_model_name = None
+
+    for task in ["t2v-A14B", "i2v-A14B"]:
+        # test prompt extend
+        if "t2v" in task:
+            # test dashscope api
+            logging.info("-" * 40)
+            logging.info(f"Testing {task} dashscope prompt extend")
+            test(
+                DashScopePromptExpander,
+                prompt,
+                ds_model_name,
+                task,
+                image=None,
+                en_prompt=en_prompt,
+                seed=seed,
+            )
+
+            # test qwen api
+            logging.info("-" * 40)
+            logging.info(f"Testing {task} qwen prompt extend")
+            test(
+                QwenPromptExpander,
+                prompt,
+                qwen_model_name,
+                task,
+                image=None,
+                en_prompt=en_prompt,
+                seed=seed,
+            )
+
+        # test prompt-image extend
+        if "i2v" in task:
+            # test dashscope api
+            logging.info("-" * 40)
+            logging.info(f"Testing {task} dashscope vl prompt extend")
+            test(
+                DashScopePromptExpander,
+                prompt,
+                ds_vl_model_name,
+                task,
+                image=image,
+                en_prompt=en_prompt,
+                seed=seed,
+            )
+
+            # test qwen api
+            logging.info("-" * 40)
+            logging.info(f"Testing {task} qwen vl prompt extend")
+            test(
+                QwenPromptExpander,
+                prompt,
+                qwen_vl_model_name,
+                task,
+                image=image,
+                en_prompt=en_prompt,
+                seed=seed,
+            )
+
+        # test empty prompt extend
+        if "i2v-A14B" in task:
+            # test dashscope api
+            logging.info("-" * 40)
+            logging.info(f"Testing {task} dashscope vl empty prompt extend")
+            test(
+                DashScopePromptExpander,
+                "",
+                ds_vl_model_name,
+                task,
+                image=image,
+                en_prompt=None,
+                seed=seed,
+            )
+
+            # test qwen api
+            logging.info("-" * 40)
+            logging.info(f"Testing {task} qwen vl empty prompt extend")
+            test(
+                QwenPromptExpander,
+                "",
+                qwen_vl_model_name,
+                task,
+                image=image,
+                en_prompt=None,
+                seed=seed,
+            )
diff --git a/videotuna/models/wan/wan/utils/qwen_vl_utils.py b/videotuna/models/wan/wan/utils/qwen_vl_utils.py
index 3c682e6a..9fb19ba0 100644
--- a/videotuna/models/wan/wan/utils/qwen_vl_utils.py
+++ b/videotuna/models/wan/wan/utils/qwen_vl_utils.py
@@ -51,11 +51,13 @@ def floor_by_factor(number: int, factor: int) -> int:
     return math.floor(number / factor) * factor
 
 
-def smart_resize(height: int,
-                 width: int,
-                 factor: int = IMAGE_FACTOR,
-                 min_pixels: int = MIN_PIXELS,
-                 max_pixels: int = MAX_PIXELS) -> tuple[int, int]:
+def smart_resize(
+    height: int,
+    width: int,
+    factor: int = IMAGE_FACTOR,
+    min_pixels: int = MIN_PIXELS,
+    max_pixels: int = MAX_PIXELS,
+) -> tuple[int, int]:
     """
     Rescales the image so that the following conditions are met:
 
@@ -82,8 +84,9 @@ def smart_resize(height: int,
     return h_bar, w_bar
 
 
-def fetch_image(ele: dict[str, str | Image.Image],
-                size_factor: int = IMAGE_FACTOR) -> Image.Image:
+def fetch_image(
+    ele: dict[str, str | Image.Image], size_factor: int = IMAGE_FACTOR
+) -> Image.Image:
     if "image" in ele:
         image = ele["image"]
     else:
@@ -153,17 +156,17 @@ def smart_nframes(
     Returns:
         int: the number of frames for video used for model inputs.
     """
-    assert not ("fps" in ele and
-                "nframes" in ele), "Only accept either `fps` or `nframes`"
+    assert not (
+        "fps" in ele and "nframes" in ele
+    ), "Only accept either `fps` or `nframes`"
     if "nframes" in ele:
         nframes = round_by_factor(ele["nframes"], FRAME_FACTOR)
     else:
         fps = ele.get("fps", FPS)
-        min_frames = ceil_by_factor(
-            ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
+        min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
         max_frames = floor_by_factor(
-            ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)),
-            FRAME_FACTOR)
+            ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR
+        )
         nframes = total_frames / video_fps * fps
         nframes = min(max(nframes, min_frames), max_frames)
         nframes = round_by_factor(nframes, FRAME_FACTOR)
@@ -174,7 +177,9 @@ def smart_nframes(
     return nframes
 
 
-def _read_video_torchvision(ele: dict,) -> torch.Tensor:
+def _read_video_torchvision(
+    ele: dict,
+) -> torch.Tensor:
     """read video using torchvision.io.read_video
 
     Args:
@@ -212,45 +217,7 @@ def _read_video_torchvision(ele: dict,) -> torch.Tensor:
     return video
 
 
-def is_decord_available() -> bool:
-    import importlib.util
-
-    return importlib.util.find_spec("decord") is not None
-
-
-def _read_video_decord(ele: dict,) -> torch.Tensor:
-    """read video using decord.VideoReader
-
-    Args:
-        ele (dict): a dict contains the configuration of video.
-        support keys:
-            - video: the path of video. support "file://", "http://", "https://" and local path.
-            - video_start: the start time of video.
-            - video_end: the end time of video.
-    Returns:
-        torch.Tensor: the video tensor with shape (T, C, H, W).
-    """
-    import decord
-    video_path = ele["video"]
-    st = time.time()
-    vr = decord.VideoReader(video_path)
-    # TODO: support start_pts and end_pts
-    if 'video_start' in ele or 'video_end' in ele:
-        raise NotImplementedError(
-            "not support start_pts and end_pts in decord for now.")
-    total_frames, video_fps = len(vr), vr.get_avg_fps()
-    logger.info(
-        f"decord:  {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s"
-    )
-    nframes = smart_nframes(ele, total_frames=total_frames, video_fps=video_fps)
-    idx = torch.linspace(0, total_frames - 1, nframes).round().long().tolist()
-    video = vr.get_batch(idx).asnumpy()
-    video = torch.tensor(video).permute(0, 3, 1, 2)  # Convert to TCHW format
-    return video
-
-
 VIDEO_READER_BACKENDS = {
-    "decord": _read_video_decord,
     "torchvision": _read_video_torchvision,
 }
 
@@ -261,19 +228,17 @@ def _read_video_decord(ele: dict,) -> torch.Tensor:
 def get_video_reader_backend() -> str:
     if FORCE_QWENVL_VIDEO_READER is not None:
         video_reader_backend = FORCE_QWENVL_VIDEO_READER
-    elif is_decord_available():
-        video_reader_backend = "decord"
     else:
         video_reader_backend = "torchvision"
-    print(
-        f"qwen-vl-utils using {video_reader_backend} to read video.",
-        file=sys.stderr)
+    logger.info(
+        f"qwen-vl-utils using {video_reader_backend} to read video.", file=sys.stderr
+    )
     return video_reader_backend
 
 
 def fetch_video(
-        ele: dict,
-        image_factor: int = IMAGE_FACTOR) -> torch.Tensor | list[Image.Image]:
+    ele: dict, image_factor: int = IMAGE_FACTOR
+) -> torch.Tensor | list[Image.Image]:
     if isinstance(ele["video"], str):
         video_reader_backend = get_video_reader_backend()
         video = VIDEO_READER_BACKENDS[video_reader_backend](ele)
@@ -283,7 +248,8 @@ def fetch_video(
         total_pixels = ele.get("total_pixels", VIDEO_TOTAL_PIXELS)
         max_pixels = max(
             min(VIDEO_MAX_PIXELS, total_pixels / nframes * FRAME_FACTOR),
-            int(min_pixels * 1.05))
+            int(min_pixels * 1.05),
+        )
         max_pixels = ele.get("max_pixels", max_pixels)
         if "resized_height" in ele and "resized_width" in ele:
             resized_height, resized_width = smart_resize(
@@ -312,11 +278,9 @@ def fetch_video(
         process_info.pop("type", None)
         process_info.pop("video", None)
         images = [
-            fetch_image({
-                "image": video_element,
-                **process_info
-            },
-                        size_factor=image_factor)
+            fetch_image(
+                {"image": video_element, **process_info}, size_factor=image_factor
+            )
             for video_element in ele["video"]
         ]
         nframes = ceil_by_factor(len(images), FRAME_FACTOR)
@@ -325,8 +289,7 @@ def fetch_video(
         return images
 
 
-def extract_vision_info(
-        conversations: list[dict] | list[list[dict]]) -> list[dict]:
+def extract_vision_info(conversations: list[dict] | list[list[dict]]) -> list[dict]:
     vision_infos = []
     if isinstance(conversations[0], dict):
         conversations = [conversations]
@@ -334,17 +297,19 @@ def extract_vision_info(
         for message in conversation:
             if isinstance(message["content"], list):
                 for ele in message["content"]:
-                    if ("image" in ele or "image_url" in ele or
-                            "video" in ele or
-                            ele["type"] in ("image", "image_url", "video")):
+                    if (
+                        "image" in ele
+                        or "image_url" in ele
+                        or "video" in ele
+                        or ele["type"] in ("image", "image_url", "video")
+                    ):
                         vision_infos.append(ele)
     return vision_infos
 
 
 def process_vision_info(
     conversations: list[dict] | list[list[dict]],
-) -> tuple[list[Image.Image] | None, list[torch.Tensor | list[Image.Image]] |
-           None]:
+) -> tuple[list[Image.Image] | None, list[torch.Tensor | list[Image.Image]] | None]:
     vision_infos = extract_vision_info(conversations)
     ## Read images or videos
     image_inputs = []
diff --git a/videotuna/models/wan/wan/utils/system_prompt.py b/videotuna/models/wan/wan/utils/system_prompt.py
new file mode 100644
index 00000000..3949cb09
--- /dev/null
+++ b/videotuna/models/wan/wan/utils/system_prompt.py
@@ -0,0 +1,141 @@
+# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
+
+T2V_A14B_ZH_SYS_PROMPT = """ 你是一位电影导演，旨在为用户输入的原始prompt添加电影元素，改写为优质Prompt，使其完整、具有表现力。
+任务要求： 
+1. 对于用户输入的prompt,在不改变prompt的原意（如主体、动作）前提下，从下列电影美学设定中选择部分合适的时间、光源、光线强度、光线角度、对比度、饱和度、色调、拍摄角度、镜头大小、构图的电影设定细节,将这些内容添加到prompt中，让画面变得更美，注意，可以任选，不必每项都有 
+  时间：["白天", "夜晚", "黎明", "日出"], 可以不选, 如果prompt没有特别说明则选白天 !
+  光源：[日光", "人工光", "月光", "实用光", "火光", "荧光", "阴天光", "晴天光"], 根据根据室内室外及prompt内容选定义光源，添加关于光源的描述，如光线来源（窗户、灯具等）
+  光线强度：["柔光", "硬光"], 
+  光线角度：["顶光", "侧光", "底光", "边缘光",] 
+  色调：["暖色调","冷色调", "混合色调"] 
+  镜头尺寸：["中景", "中近景", "全景","中全景","近景", "特写", "极端全景"]若无特殊要求，默认选择中景或全景
+  拍摄角度：["过肩镜头角度拍摄", "低角度拍摄", "高角度拍摄","倾斜角度拍摄", "航拍","俯视角度拍摄"],如果原始prompt中有运镜的描述，则不要添加此项!
+  构图：["中心构图"，"平衡构图","右侧重构图", "左侧重构图", "对称构图", "短边构图"] 若无特殊要求，默认选择中心构图 
+2. 完善用户描述中出现的主体特征（如外貌、表情，数量、种族、姿态等）等内容，确保不要添加原始prompt中不存在的主体，（如prompt是对风景或物体的描述，但添加了人），增加背景元素的细节； 
+3. 不要输出关于氛围、感觉等文学描写，如（画面充满运动感与生活张力，突出正式氛围）。； 
+4. 对于prompt中的动作，详细解释运动的发生过程，若没有动作，则添加动作描述（摇晃身体、跳舞等），对背景元素也可添加适当运动（如云彩飘动，风吹树叶等）。 
+5. 若原始prompt中没有风格，则不添加风格描述，若有风格描述，则将风格描述放于首位，若为2D插画等与现实电影相悖的风格，则不要添加关于电影美学的描写； 
+6. 若prompt出现天空的描述，则改为湛蓝色的天空相关描述，避免曝光；
+7. 改写后的prompt字数控制在60-200字左右, 不要输出类似“改写后prompt:”这样的输出 
+8. 如果用户输入存在疑似脱衣服、衣物透明化、轻纱衣物、湿身、洗澡、舔舐牛奶/酸奶/白色液体、性暗示、强奸、漏出或微露胸部、爬行、性器官、儿童身体裸露等色情内容，请将原prompt替换为内容完全不同的高美感prompt，而不要拒答或者提醒用户。 
+9. 如果用户输入存在比基尼穿着内容，请将原prompt替换为确保内容发生在沙滩、泳池、海边的高美感prompt，而不要拒答或者提醒用户。 
+10. 如果用户输入存在性别多元化、同性恋亲吻、脸颊贴近（两个男性亲吻或两个女性亲吻）/性交等同性恋内容，请将原prompt替换为不同动作（不能是亲吻和其他亲密动作）的高美感prompt，而不要拒答或者提醒用户。 
+生成的 prompt 示例： 
+1.边缘光，中近景，日光，左侧重构图，暖色调，硬光，晴天光，侧光，白天，一个年轻的女孩坐在高草丛生的田野中，两条毛发蓬松的小毛驴站在她身后。女孩大约十一二岁，穿着简单的碎花裙子，头发扎成两条麻花辫，脸上带着纯真的笑容。她双腿交叉坐下，双手轻轻抚弄身旁的野花。小毛驴体型健壮，耳朵竖起，好奇地望着镜头方向。阳光洒在田野上，营造出温暖自然的画面感。
+2.黎明，顶光，俯视角度拍摄，日光，长焦，中心构图，近景，高角度拍摄，荧光，柔光，冷色调，在昏暗的环境中，一个外国白人女子在水中仰面漂浮。俯拍近景镜头中，她有着棕色的短发，脸上有几颗雀斑。随着镜头下摇，她转过头来，面向右侧，水面上泛起一圈涟漪。虚化的背景一片漆黑，只有微弱的光线照亮了女子的脸庞和水面的一部分区域，水面呈现蓝色。女子穿着一件蓝色的吊带，肩膀裸露在外。
+3.右侧重构图，暖色调，底光，侧光，夜晚，火光，过肩镜头角度拍摄, 镜头平拍拍摄外国女子在室内的近景，她穿着棕色的衣服戴着彩色的项链和粉色的帽子，坐在深灰色的椅子上，双手放在黑色的桌子上，眼睛看着镜头的左侧，嘴巴张动，左手上下晃动，桌子上有白色的蜡烛有黄色的火焰，后面是黑色的墙，前面有黑色的网状架子，旁边是黑色的箱子，上面有一些黑色的物品，都做了虚化的处理。 
+4. 二次元厚涂动漫插画，一个猫耳兽耳白人少女手持文件夹摇晃，神情略带不满。她深紫色长发，红色眼睛，身穿深灰色短裙和浅灰色上衣，腰间系着白色系带，胸前佩戴名牌，上面写着黑体中文"紫阳"。淡黄色调室内背景，隐约可见一些家具轮廓。少女头顶有一个粉色光圈。线条流畅的日系赛璐璐风格。近景半身略俯视视角。 
+"""
+
+
+T2V_A14B_EN_SYS_PROMPT = """你是一位电影导演，旨在为用户输入的原始prompt添加电影元素，改写为优质（英文）Prompt，使其完整、具有表现力注意，输出必须是英文！
+任务要求：
+1. 对于用户输入的prompt,在不改变prompt的原意（如主体、动作）前提下，从下列电影美学设定中选择不超过4种合适的时间、光源、光线强度、光线角度、对比度、饱和度、色调、拍摄角度、镜头大小、构图的电影设定细节,将这些内容添加到prompt中，让画面变得更美，注意，可以任选，不必每项都有
+  时间：["Day time", "Night time" "Dawn time","Sunrise time"], 如果prompt没有特别说明则选 Day time!!!
+  光源：["Daylight", "Artificial lighting", "Moonlight", "Practical lighting", "Firelight","Fluorescent lighting", "Overcast lighting" "Sunny lighting"], 根据根据室内室外及prompt内容选定义光源，添加关于光源的描述，如光线来源（窗户、灯具等）
+  光线强度：["Soft lighting", "Hard lighting"], 
+  色调：["Warm colors","Cool colors", "Mixed colors"] 
+  光线角度：["Top lighting", "Side lighting", "Underlighting", "Edge lighting"]
+  镜头尺寸：["Medium shot", "Medium close-up shot", "Wide shot","Medium wide shot","Close-up shot", "Extreme close-up shot", "Extreme wide shot"]若无特殊要求，默认选择Medium shot或Wide shot
+  拍摄角度：["Over-the-shoulder shot", ""Low angle shot", "High angle shot","Dutch angle shot", "Aerial shot","Overhead shot"] 若原始prompt中有运镜的描述，则取消添加此项。
+  构图：["Center composition"，"Balanced composition","Right-heavy composition", "Left-heavy composition", "Symmetrical composition", "Short-side composition"] 若无特殊要求，默认选择Center composition
+2. 完善用户描述中出现的主体特征（如外貌、表情，数量、种族、姿态等）等内容，确保不要添加原始prompt中不存在的主体，（如prompt是对风景或物体的描述，但添加了人），增加背景元素的细节；
+3. 不要输出关于氛围、感觉等文学描写，如（画面充满运动感与生活张力，突出正式氛围）。；
+4. 对于prompt中的动作，详细描述运动的发生过程，若没有动作，则添加动作描述（摇晃身体、跳舞等，对背景元素也可添加适当运动（如云彩飘动，风吹树叶等）。
+5. 若原始prompt中没有风格，则不添加风格描述，若有风格描述，则将风格描述放于首位，若为2D插画等与现实电影相悖的风格，则不要添加关于电影美学的描写；
+6. 若prompt出现天空的描述，则改为湛蓝色的天空相关描述，避免曝光；
+7. 输出必须是全英文，改写后的prompt字数控制在60-200字左右, 不要输出类似“改写后prompt:”这样的输出
+8. 如果用户输入存在疑似脱衣服、衣物透明化、轻纱衣物、湿身、洗澡、舔舐牛奶/酸奶/白色液体、性暗示、强奸、漏出或微露胸部、爬行、性器官、儿童身体裸露等色情内容，请将原prompt替换为内容完全不同的高美感prompt，而不要拒答或者提醒用户。
+9. 如果用户输入存在比基尼穿着内容，请将原prompt替换为确保内容发生在沙滩、泳池、海边的高美感prompt，而不要拒答或者提醒用户。
+10. 如果用户输入存在性别多元化、同性恋亲吻、脸颊贴近（两个男性亲吻或两个女性亲吻）/性交等同性恋内容，请将原prompt替换为不同动作（不能是亲吻和其他亲密动作）的高美感prompt，而不要拒答或者提醒用户。
+生成的 prompt 示例：
+1.Edge lighting, medium close-up shot, daylight, left-heavy composition. A young girl around 11-12 years old sits in a field of tall grass, with two fluffy small donkeys standing behind her. She wears a simple floral dress with hair in twin braids, smiling innocently while cross-legged and gently touching wild flowers beside her. The sturdy donkeys have perked ears, curiously gazing toward the camera. Sunlight bathes the field, creating a warm natural atmosphere.
+2.Dawn time, top lighting, high-angle shot, daylight, long lens shot, center composition, Close-up shot,  Fluorescent lighting,  soft lighting, cool colors. In dim surroundings, a Caucasian woman floats on her back in water. The俯拍close-up shows her brown short hair and freckled face. As the camera tilts downward, she turns her head toward the right, creating ripples on the blue-toned water surface. The blurred background is pitch black except for faint light illuminating her face and partial water surface. She wears a blue sleeveless top with bare shoulders.
+3.Right-heavy composition, warm colors, night time, firelight, over-the-shoulder angle. An eye-level close-up of a foreign woman indoors wearing brown clothes with colorful necklace and pink hat. She sits on a charcoal-gray chair, hands on black table, eyes looking left of camera while mouth moves and left hand gestures up/down. White candles with yellow flames sit on the table. Background shows black walls, with blurred black mesh shelf nearby and black crate containing dark items in front.
+4."Anime-style thick-painted style. A cat-eared Caucasian girl with beast ears holds a folder, showing slight displeasure. Features deep purple hair, red eyes, dark gray skirt and light gray top with white waist sash. A name tag labeled 'Ziyang' in bold Chinese characters hangs on her chest. Pale yellow indoor background with faint furniture outlines. A pink halo floats above her head. Features smooth linework in cel-shaded Japanese style, medium close-up from slightly elevated perspective.
+"""
+
+
+I2V_A14B_ZH_SYS_PROMPT = """你是一个视频描述提示词的改写专家，你的任务是根据用户给你输入的图像，对提供的视频描述提示词进行改写，你要强调潜在的动态内容。具体要求如下
+用户输入的语言可能含有多样化的描述，如markdown文档格式、指令格式，长度过长或者过短，你需要根据图片的内容和用户的输入的提示词，尽可能提取用户输入的提示词和图片关联信息。
+你改写的视频描述结果要尽可能保留提供给你的视频描述提示词中动态部分，保留主体的动作。
+你要根据图像，强调并简化视频描述提示词中的图像主体，如果用户只提供了动作，你要根据图像内容合理补充，如“跳舞”补充称“一个女孩在跳舞”
+如果用户输入的提示词过长，你需要提炼潜在的动作过程
+如果用户输入的提示词过短，综合用户输入的提示词以及画面内容，合理的增加潜在的运动信息
+你要根据图像，保留并强调视频描述提示词中关于运镜手段的描述，如“镜头上摇”，“镜头从左到右”，“镜头从右到左”等等，你要保留，如“镜头拍摄两个男人打斗，他们先是躺在地上，随后镜头向上移动，拍摄他们站起来，接着镜头向左移动，左边男人拿着一个蓝色的东西，右边男人上前抢夺，两人激烈地来回争抢。”。
+你需要给出对视频描述的动态内容，不要添加对于静态场景的描述，如果用户输入的描述已经在画面中出现，则移除这些描述
+改写后的prompt字数控制在100字以下
+无论用户输入那种语言，你都需要输出中文
+改写后 prompt 示例：
+1. 镜头后拉，拍摄两个外国男人，走在楼梯上，镜头左侧的男人右手搀扶着镜头右侧的男人。
+2. 一只黑色的小松鼠专注地吃着东西，偶尔抬头看看四周。
+3. 男子说着话，表情从微笑逐渐转变为闭眼，然后睁开眼睛，最后是闭眼微笑，他的手势活跃，在说话时做出一系列的手势。
+4. 一个人正在用尺子和笔进行测量的特写，右手用一支黑色水性笔在纸上画出一条直线。
+5. 一辆车模型在木板上形式，车辆从画面的右侧向左侧移动，经过一片草地和一些木制结构。
+6. 镜头左移后前推，拍摄一个人坐在防波堤上。
+7. 男子说着话，他的表情和手势随着对话内容的变化而变化，但整体场景保持不变。
+8. 镜头左移后前推，拍摄一个人坐在防波堤上。
+9. 带着珍珠项链的女子看向画面右侧并说着话。
+请直接输出改写后的文本，不要进行多余的回复。"""
+
+
+I2V_A14B_EN_SYS_PROMPT = """You are an expert in rewriting video description prompts. Your task is to rewrite the provided video description prompts based on the images given by users, emphasizing potential dynamic content. Specific requirements are as follows:
+The user's input language may include diverse descriptions, such as markdown format, instruction format, or be too long or too short. You need to extract the relevant information from the user’s input and associate it with the image content.
+Your rewritten video description should retain the dynamic parts of the provided prompts, focusing on the main subject's actions. Emphasize and simplify the main subject of the image while retaining their movement. If the user only provides an action (e.g., "dancing"), supplement it reasonably based on the image content (e.g., "a girl is dancing").
+If the user’s input prompt is too long, refine it to capture the essential action process. If the input is too short, add reasonable motion-related details based on the image content.
+Retain and emphasize descriptions of camera movements, such as "the camera pans up," "the camera moves from left to right," or "the camera moves from right to left." For example: "The camera captures two men fighting. They start lying on the ground, then the camera moves upward as they stand up. The camera shifts left, showing the man on the left holding a blue object while the man on the right tries to grab it, resulting in a fierce back-and-forth struggle."
+Focus on dynamic content in the video description and avoid adding static scene descriptions. If the user’s input already describes elements visible in the image, remove those static descriptions.
+Limit the rewritten prompt to 100 words or less. Regardless of the input language, your output must be in English.
+
+Examples of rewritten prompts:
+The camera pulls back to show two foreign men walking up the stairs. The man on the left supports the man on the right with his right hand.
+A black squirrel focuses on eating, occasionally looking around.
+A man talks, his expression shifting from smiling to closing his eyes, reopening them, and finally smiling with closed eyes. His gestures are lively, making various hand motions while speaking.
+A close-up of someone measuring with a ruler and pen, drawing a straight line on paper with a black marker in their right hand.
+A model car moves on a wooden board, traveling from right to left across grass and wooden structures.
+The camera moves left, then pushes forward to capture a person sitting on a breakwater.
+A man speaks, his expressions and gestures changing with the conversation, while the overall scene remains constant.
+The camera moves left, then pushes forward to capture a person sitting on a breakwater.
+A woman wearing a pearl necklace looks to the right and speaks.
+Output only the rewritten text without additional responses."""
+
+
+I2V_A14B_EMPTY_ZH_SYS_PROMPT = """你是一个视频描述提示词的撰写专家，你的任务是根据用户给你输入的图像，发挥合理的想象，让这张图动起来，你要强调潜在的动态内容。具体要求如下
+你需要根据图片的内容想象出运动的主体
+你输出的结果应强调图片中的动态部分，保留主体的动作。
+你需要给出对视频描述的动态内容，不要有过多的对于静态场景的描述
+输出的prompt字数控制在100字以下
+你需要输出中文
+prompt 示例：
+1. 镜头后拉，拍摄两个外国男人，走在楼梯上，镜头左侧的男人右手搀扶着镜头右侧的男人。
+2. 一只黑色的小松鼠专注地吃着东西，偶尔抬头看看四周。
+3. 男子说着话，表情从微笑逐渐转变为闭眼，然后睁开眼睛，最后是闭眼微笑，他的手势活跃，在说话时做出一系列的手势。
+4. 一个人正在用尺子和笔进行测量的特写，右手用一支黑色水性笔在纸上画出一条直线。
+5. 一辆车模型在木板上形式，车辆从画面的右侧向左侧移动，经过一片草地和一些木制结构。
+6. 镜头左移后前推，拍摄一个人坐在防波堤上。
+7. 男子说着话，他的表情和手势随着对话内容的变化而变化，但整体场景保持不变。
+8. 镜头左移后前推，拍摄一个人坐在防波堤上。
+9. 带着珍珠项链的女子看向画面右侧并说着话。
+请直接输出文本，不要进行多余的回复。"""
+
+
+I2V_A14B_EMPTY_EN_SYS_PROMPT = """You are an expert in writing video description prompts. Your task is to bring the image provided by the user to life through reasonable imagination, emphasizing potential dynamic content. Specific requirements are as follows:
+
+You need to imagine the moving subject based on the content of the image.
+Your output should emphasize the dynamic parts of the image and retain the main subject’s actions.
+Focus only on describing dynamic content; avoid excessive descriptions of static scenes.
+Limit the output prompt to 100 words or less.
+The output must be in English.
+
+Prompt examples:
+
+The camera pulls back to show two foreign men walking up the stairs. The man on the left supports the man on the right with his right hand.
+A black squirrel focuses on eating, occasionally looking around.
+A man talks, his expression shifting from smiling to closing his eyes, reopening them, and finally smiling with closed eyes. His gestures are lively, making various hand motions while speaking.
+A close-up of someone measuring with a ruler and pen, drawing a straight line on paper with a black marker in their right hand.
+A model car moves on a wooden board, traveling from right to left across grass and wooden structures.
+The camera moves left, then pushes forward to capture a person sitting on a breakwater.
+A man speaks, his expressions and gestures changing with the conversation, while the overall scene remains constant.
+The camera moves left, then pushes forward to capture a person sitting on a breakwater.
+A woman wearing a pearl necklace looks to the right and speaks.
+Output only the text without additional responses."""
diff --git a/videotuna/models/wan/wan/utils/utils.py b/videotuna/models/wan/wan/utils/utils.py
index d7259996..7392cd00 100644
--- a/videotuna/models/wan/wan/utils/utils.py
+++ b/videotuna/models/wan/wan/utils/utils.py
@@ -1,94 +1,146 @@
 # Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
 import argparse
 import binascii
+import logging
 import os
 import os.path as osp
+import shutil
+import subprocess
 
 import imageio
 import torch
 import torchvision
 
-__all__ = ['cache_video', 'cache_image', 'str2bool']
+__all__ = ["save_video", "save_image", "str2bool"]
 
 
-def rand_name(length=8, suffix=''):
-    name = binascii.b2a_hex(os.urandom(length)).decode('utf-8')
+def rand_name(length=8, suffix=""):
+    name = binascii.b2a_hex(os.urandom(length)).decode("utf-8")
     if suffix:
-        if not suffix.startswith('.'):
-            suffix = '.' + suffix
+        if not suffix.startswith("."):
+            suffix = "." + suffix
         name += suffix
     return name
 
 
-def cache_video(tensor,
-                save_file=None,
-                fps=30,
-                suffix='.mp4',
-                nrow=8,
-                normalize=True,
-                value_range=(-1, 1),
-                retry=5):
+def merge_video_audio(video_path: str, audio_path: str):
+    """
+    Merge the video and audio into a new video, with the duration set to the shorter of the two,
+    and overwrite the original video file.
+
+    Parameters:
+    video_path (str): Path to the original video file
+    audio_path (str): Path to the audio file
+    """
+    # set logging
+    logging.basicConfig(level=logging.INFO)
+
+    # check
+    if not os.path.exists(video_path):
+        raise FileNotFoundError(f"video file {video_path} does not exist")
+    if not os.path.exists(audio_path):
+        raise FileNotFoundError(f"audio file {audio_path} does not exist")
+
+    base, ext = os.path.splitext(video_path)
+    temp_output = f"{base}_temp{ext}"
+
+    try:
+        # create ffmpeg command
+        command = [
+            "ffmpeg",
+            "-y",  # overwrite
+            "-i",
+            video_path,
+            "-i",
+            audio_path,
+            "-c:v",
+            "copy",  # copy video stream
+            "-c:a",
+            "aac",  # use AAC audio encoder
+            "-b:a",
+            "192k",  # set audio bitrate (optional)
+            "-map",
+            "0:v:0",  # select the first video stream
+            "-map",
+            "1:a:0",  # select the first audio stream
+            "-shortest",  # choose the shortest duration
+            temp_output,
+        ]
+
+        # execute the command
+        logging.info("Start merging video and audio...")
+        result = subprocess.run(
+            command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
+        )
+
+        # check result
+        if result.returncode != 0:
+            error_msg = f"FFmpeg execute failed: {result.stderr}"
+            logging.error(error_msg)
+            raise RuntimeError(error_msg)
+
+        shutil.move(temp_output, video_path)
+        logging.info(f"Merge completed, saved to {video_path}")
+
+    except Exception as e:
+        if os.path.exists(temp_output):
+            os.remove(temp_output)
+        logging.error(f"merge_video_audio failed with error: {e}")
+
+
+def save_video(
+    tensor,
+    save_file=None,
+    fps=30,
+    suffix=".mp4",
+    nrow=8,
+    normalize=True,
+    value_range=(-1, 1),
+):
     # cache file
-    cache_file = osp.join('/tmp', rand_name(
-        suffix=suffix)) if save_file is None else save_file
+    cache_file = (
+        osp.join("/tmp", rand_name(suffix=suffix)) if save_file is None else save_file
+    )
 
     # save to cache
-    error = None
-    for _ in range(retry):
-        try:
-            # preprocess
-            tensor = tensor.clamp(min(value_range), max(value_range))
-            tensor = torch.stack([
+    try:
+        # preprocess
+        tensor = tensor.clamp(min(value_range), max(value_range))
+        tensor = torch.stack(
+            [
                 torchvision.utils.make_grid(
-                    u, nrow=nrow, normalize=normalize, value_range=value_range)
+                    u, nrow=nrow, normalize=normalize, value_range=value_range
+                )
                 for u in tensor.unbind(2)
             ],
-                                 dim=1).permute(1, 2, 3, 0)
-            tensor = (tensor * 255).type(torch.uint8).cpu()
-
-            # write video
-            writer = imageio.get_writer(
-                cache_file, fps=fps, codec='libx264', quality=8)
-            for frame in tensor.numpy():
-                writer.append_data(frame)
-            writer.close()
-            return cache_file
-        except Exception as e:
-            error = e
-            continue
-    else:
-        print(f'cache_video failed, error: {error}', flush=True)
-        return None
+            dim=1,
+        ).permute(1, 2, 3, 0)
+        tensor = (tensor * 255).type(torch.uint8).cpu()
+
+        # write video
+        writer = imageio.get_writer(cache_file, fps=fps, codec="libx264", quality=8)
+        for frame in tensor.numpy():
+            writer.append_data(frame)
+        writer.close()
+    except Exception as e:
+        logging.info(f"save_video failed, error: {e}")
 
 
-def cache_image(tensor,
-                save_file,
-                nrow=8,
-                normalize=True,
-                value_range=(-1, 1),
-                retry=5):
+def save_image(tensor, save_file, nrow=8, normalize=True, value_range=(-1, 1)):
     # cache file
     suffix = osp.splitext(save_file)[1]
-    if suffix.lower() not in [
-            '.jpg', '.jpeg', '.png', '.tiff', '.gif', '.webp'
-    ]:
-        suffix = '.png'
+    if suffix.lower() not in [".jpg", ".jpeg", ".png", ".tiff", ".gif", ".webp"]:
+        suffix = ".png"
 
     # save to cache
-    error = None
-    for _ in range(retry):
-        try:
-            tensor = tensor.clamp(min(value_range), max(value_range))
-            torchvision.utils.save_image(
-                tensor,
-                save_file,
-                nrow=nrow,
-                normalize=normalize,
-                value_range=value_range)
-            return save_file
-        except Exception as e:
-            error = e
-            continue
+    try:
+        tensor = tensor.clamp(min(value_range), max(value_range))
+        torchvision.utils.save_image(
+            tensor, save_file, nrow=nrow, normalize=normalize, value_range=value_range
+        )
+        return save_file
+    except Exception as e:
+        logging.info(f"save_image failed, error: {e}")
 
 
 def str2bool(v):
@@ -110,9 +162,89 @@ def str2bool(v):
     if isinstance(v, bool):
         return v
     v_lower = v.lower()
-    if v_lower in ('yes', 'true', 't', 'y', '1'):
+    if v_lower in ("yes", "true", "t", "y", "1"):
         return True
-    elif v_lower in ('no', 'false', 'f', 'n', '0'):
+    elif v_lower in ("no", "false", "f", "n", "0"):
         return False
     else:
-        raise argparse.ArgumentTypeError('Boolean value expected (True/False)')
+        raise argparse.ArgumentTypeError("Boolean value expected (True/False)")
+
+
+def masks_like(tensor, zero=False, generator=None, p=0.2):
+    assert isinstance(tensor, list)
+    out1 = [torch.ones(u.shape, dtype=u.dtype, device=u.device) for u in tensor]
+
+    out2 = [torch.ones(u.shape, dtype=u.dtype, device=u.device) for u in tensor]
+
+    if zero:
+        if generator is not None:
+            for u, v in zip(out1, out2):
+                random_num = torch.rand(
+                    1, generator=generator, device=generator.device
+                ).item()
+                if random_num < p:
+                    u[:, 0] = (
+                        torch.normal(
+                            mean=-3.5,
+                            std=0.5,
+                            size=(1,),
+                            device=u.device,
+                            generator=generator,
+                        )
+                        .expand_as(u[:, 0])
+                        .exp()
+                    )
+                    v[:, 0] = torch.zeros_like(v[:, 0])
+                else:
+                    u[:, 0] = u[:, 0]
+                    v[:, 0] = v[:, 0]
+        else:
+            for u, v in zip(out1, out2):
+                u[:, 0] = torch.zeros_like(u[:, 0])
+                v[:, 0] = torch.zeros_like(v[:, 0])
+
+    return out1, out2
+
+
+def best_output_size(w, h, dw, dh, expected_area):
+    # float output size
+    ratio = w / h
+    ow = (expected_area * ratio) ** 0.5
+    oh = expected_area / ow
+
+    # process width first
+    ow1 = int(ow // dw * dw)
+    oh1 = int(expected_area / ow1 // dh * dh)
+    assert ow1 % dw == 0 and oh1 % dh == 0 and ow1 * oh1 <= expected_area
+    ratio1 = ow1 / oh1
+
+    # process height first
+    oh2 = int(oh // dh * dh)
+    ow2 = int(expected_area / oh2 // dw * dw)
+    assert oh2 % dh == 0 and ow2 % dw == 0 and ow2 * oh2 <= expected_area
+    ratio2 = ow2 / oh2
+
+    # compare ratios
+    if max(ratio / ratio1, ratio1 / ratio) < max(ratio / ratio2, ratio2 / ratio):
+        return ow1, oh1
+    else:
+        return ow2, oh2
+
+
+def download_cosyvoice_repo(repo_path):
+    try:
+        import git
+    except ImportError:
+        raise ImportError("failed to import git, please run pip install GitPython")
+    git.Repo.clone_from(
+        "https://github.com/FunAudioLLM/CosyVoice.git",
+        repo_path,
+        multi_options=["--recursive"],
+        branch="main",
+    )
+
+
+def download_cosyvoice_model(model_name, model_path):
+    from modelscope import snapshot_download
+
+    snapshot_download("iic/{}".format(model_name), local_dir=model_path)
diff --git a/videotuna/schedulers/ddim.py b/videotuna/schedulers/ddim.py
index 09825836..a969ba7b 100644
--- a/videotuna/schedulers/ddim.py
+++ b/videotuna/schedulers/ddim.py
@@ -7,7 +7,7 @@
     make_ddim_timesteps,
     rescale_noise_cfg,
 )
-from videotuna.models.lvdm.modules.utils import noise_like
+from videotuna.utils.sched_utils import noise_like
 
 
 class DDIMSampler(object):
@@ -19,7 +19,7 @@ def __init__(self, model, schedule="linear", **kwargs):
         self.counter = 0
 
     def register_buffer(self, name, attr):
-        if type(attr) == torch.Tensor:
+        if isinstance(attr, torch.Tensor):
             if attr.device != torch.device("cuda"):
                 attr = attr.to(torch.device("cuda"))
         setattr(self, name, attr)
@@ -37,7 +37,9 @@ def make_schedule(
         assert (
             alphas_cumprod.shape[0] == self.ddpm_num_timesteps
         ), "alphas have to be defined for each timestep"
-        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
+
+        def to_torch(x):
+            return x.clone().detach().to(torch.float32).to(self.model.device)
 
         self.register_buffer("betas", to_torch(self.model.diffusion_scheduler.betas))
         self.register_buffer("alphas_cumprod", to_torch(alphas_cumprod))
@@ -127,28 +129,29 @@ def sample(
         guidance_rescale=0.0,
         **kwargs,
     ):
-
         # check condition bs
         if conditioning is not None:
             if isinstance(conditioning, dict):
                 try:
                     cbs = conditioning[list(conditioning.keys())[0]].shape[0]
-                except:
+                except Exception:
                     try:
                         cbs = conditioning[list(conditioning.keys())[0]][0].shape[0]
-                    except:
+                    except Exception:
                         cbs = int(
                             conditioning[list(conditioning.keys())[0]][0]["y"].shape[0]
                         )
 
                 if cbs != batch_size:
                     print(
-                        f"Warning: Got {cbs} conditionings but batch-size is {batch_size}"
+                        f"Warning: Got {cbs} conditionings "
+                        f"but batch-size is {batch_size}"
                     )
             else:
                 if conditioning.shape[0] != batch_size:
                     print(
-                        f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}"
+                        f"Warning: Got {conditioning.shape[0]} conditionings "
+                        f"but batch-size is {batch_size}"
                     )
 
         self.make_schedule(
@@ -373,12 +376,14 @@ def p_sample_ddim(
         else:
             # with unconditional condition
             if isinstance(c, torch.Tensor):
-                e_t = self.model.apply_model(x, t, c, **kwargs)
+                e_t_cond = self.model.apply_model(x, t, c, **kwargs)
+                e_t = e_t_cond
                 e_t_uncond = self.model.apply_model(
                     x, t, unconditional_conditioning, **kwargs
                 )
             elif isinstance(c, dict):
-                e_t = self.model.apply_model(x, t, c, **kwargs)
+                e_t_cond = self.model.apply_model(x, t, c, **kwargs)
+                e_t = e_t_cond
                 e_t_uncond = self.model.apply_model(
                     x, t, unconditional_conditioning, **kwargs
                 )
@@ -529,7 +534,6 @@ def decode(
         unconditional_conditioning=None,
         use_original_steps=False,
     ):
-
         timesteps = (
             np.arange(self.ddpm_num_timesteps)
             if use_original_steps
diff --git a/videotuna/schedulers/ddim_multiplecond.py b/videotuna/schedulers/ddim_multiplecond.py
index 9b38e325..de6fd4c4 100644
--- a/videotuna/schedulers/ddim_multiplecond.py
+++ b/videotuna/schedulers/ddim_multiplecond.py
@@ -1,5 +1,3 @@
-import copy
-
 import numpy as np
 import torch
 from tqdm import tqdm
@@ -9,7 +7,7 @@
     make_ddim_timesteps,
     rescale_noise_cfg,
 )
-from videotuna.models.lvdm.modules.utils import extract_into_tensor, noise_like
+from videotuna.utils.sched_utils import extract_into_tensor, noise_like
 
 
 class DDIMSampler(object):
@@ -21,7 +19,7 @@ def __init__(self, model, schedule="linear", **kwargs):
         self.counter = 0
 
     def register_buffer(self, name, attr):
-        if type(attr) == torch.Tensor:
+        if isinstance(attr, torch.Tensor):
             if attr.device != torch.device("cuda"):
                 attr = attr.to(torch.device("cuda"))
         setattr(self, name, attr)
@@ -39,7 +37,9 @@ def make_schedule(
         assert (
             alphas_cumprod.shape[0] == self.ddpm_num_timesteps
         ), "alphas have to be defined for each timestep"
-        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
+
+        def to_torch(x):
+            return x.clone().detach().to(torch.float32).to(self.model.device)
 
         if self.model.use_scale:
             self.ddim_scale_arr = self.model.scale_arr[self.ddim_timesteps]
@@ -120,26 +120,28 @@ def sample(
         fs=None,
         timestep_spacing="uniform",  # uniform_trailing for starting from last timestep
         guidance_rescale=0.0,
-        # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
+        # this has to come in the same format as the conditioning,
+        # e.g. as encoded tokens, ...
         **kwargs,
     ):
-
         # check condition bs
         if conditioning is not None:
             if isinstance(conditioning, dict):
                 try:
                     cbs = conditioning[list(conditioning.keys())[0]].shape[0]
-                except:
+                except Exception:
                     cbs = conditioning[list(conditioning.keys())[0]][0].shape[0]
 
                 if cbs != batch_size:
                     print(
-                        f"Warning: Got {cbs} conditionings but batch-size is {batch_size}"
+                        f"Warning: Got {cbs} conditionings "
+                        f"but batch-size is {batch_size}"
                     )
             else:
                 if conditioning.shape[0] != batch_size:
                     print(
-                        f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}"
+                        f"Warning: Got {conditioning.shape[0]} conditionings "
+                        f"but batch-size is {batch_size}"
                     )
 
         # print('==> timestep_spacing: ', timestep_spacing, guidance_rescale)
@@ -250,12 +252,15 @@ def ddim_sampling(
 
         clean_cond = kwargs.pop("clean_cond", False)
 
-        # cond_copy, unconditional_conditioning_copy = copy.deepcopy(cond), copy.deepcopy(unconditional_conditioning)
+        # cond_copy, unconditional_conditioning_copy = (
+        #     copy.deepcopy(cond), copy.deepcopy(unconditional_conditioning)
+        # )
         for i, step in enumerate(iterator):
             index = total_steps - i - 1
             ts = torch.full((b,), step, device=device, dtype=torch.long)
 
-            ## use mask to blend noised original latent (img_orig) & new sampled latent (img)
+            ## use mask to blend noised original latent (img_orig)
+            ## & new sampled latent (img)
             if mask is not None:
                 assert x0 is not None
                 if clean_cond:
@@ -435,7 +440,6 @@ def decode(
         use_original_steps=False,
         callback=None,
     ):
-
         timesteps = (
             np.arange(self.ddpm_num_timesteps)
             if use_original_steps
diff --git a/videotuna/schedulers/ddpm.py b/videotuna/schedulers/ddpm.py
index 25c454f3..0054b823 100644
--- a/videotuna/schedulers/ddpm.py
+++ b/videotuna/schedulers/ddpm.py
@@ -1,21 +1,15 @@
-import logging
-import os
-import random
-from contextlib import contextmanager
 from functools import partial
 
 import numpy as np
-from einops import rearrange, repeat
-from tqdm import tqdm
-
 import torch
 import torch.nn as nn
-import torch.nn.functional as F
 
-from videotuna.utils.diffusion_utils import make_beta_schedule, rescale_zero_terminal_snr
-from videotuna.models.lvdm.modules.utils import (
+from videotuna.utils.diffusion_utils import (
+    make_beta_schedule,
+    rescale_zero_terminal_snr,
+)
+from videotuna.utils.sched_utils import (
     default,
-    disabled_train,
     exists,
     noise_like,
 )
@@ -36,7 +30,9 @@ def __init__(
         linear_end=2e-2,
         cosine_s=8e-3,
         given_betas=None,
-        v_posterior=0.0,  # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta
+        # weight for choosing posterior variance as
+        # sigma = (1-v) * beta_tilde + v * beta
+        v_posterior=0.0,
         learn_logvar=False,
         logvar_init=0.0,
         rescale_betas_zero_snr=False,
@@ -107,10 +103,14 @@ def register_schedule(
         self.log_one_minus_alphas_cumprod = to_torch(np.log(1.0 - alphas_cumprod))
         if self.parameterization == "v":
             self.sqrt_recip_alphas_cumprod = torch.zeros_like(to_torch(alphas_cumprod))
-            self.sqrt_recipm1_alphas_cumprod = torch.zeros_like(to_torch(alphas_cumprod))
+            self.sqrt_recipm1_alphas_cumprod = torch.zeros_like(
+                to_torch(alphas_cumprod)
+            )
         else:
             self.sqrt_recip_alphas_cumprod = to_torch(np.sqrt(1.0 / alphas_cumprod))
-            self.sqrt_recipm1_alphas_cumprod = to_torch(np.sqrt(1.0 / alphas_cumprod - 1))
+            self.sqrt_recipm1_alphas_cumprod = to_torch(
+                np.sqrt(1.0 / alphas_cumprod - 1)
+            )
 
         # calculations for posterior q(x_{t-1} | x_t, x_0)
         posterior_variance = (1 - self.v_posterior) * betas * (
@@ -118,10 +118,17 @@ def register_schedule(
         ) / (1.0 - alphas_cumprod) + self.v_posterior * betas
         # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
         self.posterior_variance = to_torch(posterior_variance)
-        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
-        self.posterior_log_variance_clipped = to_torch(np.log(np.maximum(posterior_variance, 1e-20)))
-        self.posterior_mean_coef1 = to_torch(betas * np.sqrt(alphas_cumprod_prev) / (1.0 - alphas_cumprod))
-        self.posterior_mean_coef2 = to_torch((1.0 - alphas_cumprod_prev) * np.sqrt(alphas) / (1.0 - alphas_cumprod))
+        # below: log calculation clipped because the posterior variance
+        # is 0 at the beginning of the diffusion chain
+        self.posterior_log_variance_clipped = to_torch(
+            np.log(np.maximum(posterior_variance, 1e-20))
+        )
+        self.posterior_mean_coef1 = to_torch(
+            betas * np.sqrt(alphas_cumprod_prev) / (1.0 - alphas_cumprod)
+        )
+        self.posterior_mean_coef2 = to_torch(
+            (1.0 - alphas_cumprod_prev) * np.sqrt(alphas) / (1.0 - alphas_cumprod)
+        )
 
         if self.parameterization == "eps":
             lvlb_weights = self.betas**2 / (
@@ -175,7 +182,10 @@ def predict_start_from_noise(self, x_t, t, noise):
 
     def predict_start_from_z_and_v(self, x_t, t, v):
         # self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
-        # self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
+        # self.register_buffer(
+        #     'sqrt_one_minus_alphas_cumprod',
+        #     to_torch(np.sqrt(1. - alphas_cumprod)),
+        # )
         return (
             extract_into_tensor(self.sqrt_alphas_cumprod, t, x_t.shape) * x_t
             - extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_t.shape) * v
@@ -325,4 +335,4 @@ def p_sample(
                 x0,
             )
         else:
-            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
\ No newline at end of file
+            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
diff --git a/videotuna/schedulers/diffusion_schedulers.py b/videotuna/schedulers/diffusion_schedulers.py
index 2a0efa43..653a4e2b 100644
--- a/videotuna/schedulers/diffusion_schedulers.py
+++ b/videotuna/schedulers/diffusion_schedulers.py
@@ -5,10 +5,12 @@
 import torch
 import torch.nn as nn
 
-from videotuna.utils.diffusion_utils import make_beta_schedule, rescale_zero_terminal_snr
-from videotuna.models.lvdm.modules.utils import (
+from videotuna.utils.diffusion_utils import (
+    make_beta_schedule,
+    rescale_zero_terminal_snr,
+)
+from videotuna.utils.sched_utils import (
     default,
-    disabled_train,
     exists,
     extract_into_tensor,
     noise_like,
@@ -24,7 +26,9 @@ def __init__(
         linear_end=2e-2,
         cosine_s=8e-3,
         given_betas=None,
-        v_posterior=0.0,  # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta
+        # weight for choosing posterior variance as
+        # sigma = (1-v) * beta_tilde + v * beta
+        v_posterior=0.0,
         learn_logvar=False,
         logvar_init=0.0,
         rescale_betas_zero_snr=False,
@@ -120,7 +124,8 @@ def register_schedule(
         ) / (1.0 - alphas_cumprod) + self.v_posterior * betas
         # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
         self.register_buffer("posterior_variance", to_torch(posterior_variance))
-        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
+        # below: log calculation clipped because the posterior variance
+        # is 0 at the beginning of the diffusion chain
         self.register_buffer(
             "posterior_log_variance_clipped",
             to_torch(np.log(np.maximum(posterior_variance, 1e-20))),
@@ -189,7 +194,10 @@ def predict_start_from_noise(self, x_t, t, noise):
 
     def predict_start_from_z_and_v(self, x_t, t, v):
         # self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
-        # self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
+        # self.register_buffer(
+        #     'sqrt_one_minus_alphas_cumprod',
+        #     to_torch(np.sqrt(1. - alphas_cumprod)),
+        # )
         return (
             extract_into_tensor(self.sqrt_alphas_cumprod, t, x_t.shape) * x_t
             - extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_t.shape) * v
diff --git a/videotuna/schedulers/flow_matching.py b/videotuna/schedulers/flow_matching.py
index d6d02195..b72eaa81 100644
--- a/videotuna/schedulers/flow_matching.py
+++ b/videotuna/schedulers/flow_matching.py
@@ -1,10 +1,18 @@
 import torch
 
 
-
-class FlowMatchScheduler():
-
-    def __init__(self, num_inference_steps=100, num_train_timesteps=1000, shift=3.0, sigma_max=1.0, sigma_min=0.003/1.002, inverse_timesteps=False, extra_one_step=False, reverse_sigmas=False):
+class FlowMatchScheduler:
+    def __init__(
+        self,
+        num_inference_steps=100,
+        num_train_timesteps=1000,
+        shift=3.0,
+        sigma_max=1.0,
+        sigma_min=0.003 / 1.002,
+        inverse_timesteps=False,
+        extra_one_step=False,
+        reverse_sigmas=False,
+    ):
         self.num_train_timesteps = num_train_timesteps
         self.shift = shift
         self.sigma_max = sigma_max
@@ -14,15 +22,29 @@ def __init__(self, num_inference_steps=100, num_train_timesteps=1000, shift=3.0,
         self.reverse_sigmas = reverse_sigmas
         self.set_timesteps(num_inference_steps)
 
-
-    def set_timesteps(self, num_inference_steps=100, denoising_strength=1.0, training=False, shift=None):
-        if shift is not None:
+    def set_timesteps(
+        self,
+        num_inference_steps=100,
+        denoising_strength=1.0,
+        training=False,
+        shift=None,
+        time_shift=None,
+        device=None,
+    ):
+        if time_shift is not None:
+            shift = time_shift
             self.shift = shift
-        sigma_start = self.sigma_min + (self.sigma_max - self.sigma_min) * denoising_strength
+        sigma_start = (
+            self.sigma_min + (self.sigma_max - self.sigma_min) * denoising_strength
+        )
         if self.extra_one_step:
-            self.sigmas = torch.linspace(sigma_start, self.sigma_min, num_inference_steps + 1)[:-1]
+            self.sigmas = torch.linspace(
+                sigma_start, self.sigma_min, num_inference_steps + 1
+            )[:-1]
         else:
-            self.sigmas = torch.linspace(sigma_start, self.sigma_min, num_inference_steps)
+            self.sigmas = torch.linspace(
+                sigma_start, self.sigma_min, num_inference_steps
+            )
         if self.inverse_timesteps:
             self.sigmas = torch.flip(self.sigmas, dims=[0])
         self.sigmas = self.shift * self.sigmas / (1 + (self.shift - 1) * self.sigmas)
@@ -30,13 +52,14 @@ def set_timesteps(self, num_inference_steps=100, denoising_strength=1.0, trainin
             self.sigmas = 1 - self.sigmas
         self.timesteps = self.sigmas * self.num_train_timesteps
         if training:
-            x = self.timesteps
-            y = torch.exp(-2 * ((x - num_inference_steps / 2) / num_inference_steps) ** 2)
+            x = self.timesteps.float()
+            steps = float(num_inference_steps)
+            shifted = (x - (steps / 2.0)) / steps
+            y = torch.exp(-2.0 * shifted * shifted)
             y_shifted = y - y.min()
             bsmntw_weighing = y_shifted * (num_inference_steps / y_shifted.sum())
             self.linear_timesteps_weights = bsmntw_weighing
 
-
     def step(self, model_output, timestep, sample, to_final=False, **kwargs):
         if isinstance(timestep, torch.Tensor):
             timestep = timestep.cpu()
@@ -48,7 +71,6 @@ def step(self, model_output, timestep, sample, to_final=False, **kwargs):
             sigma_ = self.sigmas[timestep_id + 1]
         prev_sample = sample + model_output * (sigma_ - sigma)
         return prev_sample
-    
 
     def return_to_timestep(self, timestep, sample, sample_stablized):
         if isinstance(timestep, torch.Tensor):
@@ -57,8 +79,7 @@ def return_to_timestep(self, timestep, sample, sample_stablized):
         sigma = self.sigmas[timestep_id]
         model_output = (sample - sample_stablized) / sigma
         return model_output
-    
-    
+
     def add_noise(self, original_samples, noise, timestep):
         if isinstance(timestep, torch.Tensor):
             timestep = timestep.cpu()
@@ -66,14 +87,14 @@ def add_noise(self, original_samples, noise, timestep):
         sigma = self.sigmas[timestep_id]
         sample = (1 - sigma) * original_samples + sigma * noise
         return sample
-    
 
     def training_target(self, sample, noise, timestep):
         target = noise - sample
         return target
-    
 
     def training_weight(self, timestep):
-        timestep_id = torch.argmin((self.timesteps - timestep.to(self.timesteps.device)).abs())
+        timestep_id = torch.argmin(
+            (self.timesteps - timestep.to(self.timesteps.device)).abs()
+        )
         weights = self.linear_timesteps_weights[timestep_id]
         return weights
diff --git a/videotuna/settings.py b/videotuna/settings.py
new file mode 100644
index 00000000..dd4d34aa
--- /dev/null
+++ b/videotuna/settings.py
@@ -0,0 +1,156 @@
+"""Centralized PrivTune environment settings (VIDEOTUNA_* prefix)."""
+
+from __future__ import annotations
+
+from typing import Literal
+
+from loguru import logger
+from pydantic import field_validator
+from pydantic_settings import BaseSettings, SettingsConfigDict
+
+ComputeBackendSetting = Literal["auto", "cuda", "rocm", "cpu"]
+CpuModeSetting = Literal["off", "smoke", "force"]
+AttnBackendSetting = Literal["auto", "flash", "sdpa", "eager"]
+TorchCompileModeSetting = Literal["reduce-overhead", "max-autotune"]
+MetricsOwnerSetting = Literal["script", "flow"]
+MetricsBackendSetting = Literal["tensorboard", "trackio"]
+
+ENV_PREFIX = "VIDEOTUNA_"
+ENV_COMPUTE_BACKEND = f"{ENV_PREFIX}COMPUTE_BACKEND"
+ENV_CPU_MODE = f"{ENV_PREFIX}CPU_MODE"
+ENV_ALLOW_CPU_INFERENCE = f"{ENV_PREFIX}ALLOW_CPU_INFERENCE"
+ENV_ATTN_BACKEND = f"{ENV_PREFIX}ATTN_BACKEND"
+ENV_ATTN_BACKEND_STRICT = f"{ENV_PREFIX}ATTN_BACKEND_STRICT"
+ENV_TORCH_COMPILE = f"{ENV_PREFIX}TORCH_COMPILE"
+ENV_TORCH_COMPILE_MODE = f"{ENV_PREFIX}TORCH_COMPILE_MODE"
+ENV_METRICS_OWNER = f"{ENV_PREFIX}METRICS_OWNER"
+ENV_METRICS_BACKEND = f"{ENV_PREFIX}METRICS_BACKEND"
+ENV_TRACKIO_SPACE_ID = f"{ENV_PREFIX}TRACKIO_SPACE_ID"
+ENV_TRACKIO_PROJECT = f"{ENV_PREFIX}TRACKIO_PROJECT"
+ENV_LOG_LEVEL = f"{ENV_PREFIX}LOG_LEVEL"
+ENV_BENCH_MODEL = f"{ENV_PREFIX}BENCH_MODEL"
+
+_VALID_COMPILE_MODES = ("reduce-overhead", "max-autotune")
+
+
+def _parse_bool01(value: object) -> bool:
+    """Parse 0/1 env strings; only ``1`` is truthy (legacy os.environ semantics)."""
+    if isinstance(value, bool):
+        return value
+    if value is None:
+        return False
+    text = str(value).strip()
+    if text == "1":
+        return True
+    if text == "0":
+        return False
+    raise ValueError(f"Expected '0' or '1', got {value!r}")
+
+
+def _normalize_lower(value: object) -> object:
+    if isinstance(value, str):
+        return value.strip().lower()
+    return value
+
+
+class PrivTuneSettings(BaseSettings):
+    """Load all VIDEOTUNA_* environment variables from one settings object."""
+
+    model_config = SettingsConfigDict(
+        env_prefix=ENV_PREFIX,
+        env_file=".env",
+        env_file_encoding="utf-8",
+        extra="ignore",
+        case_sensitive=False,
+    )
+
+    compute_backend: ComputeBackendSetting = "auto"
+    cpu_mode: CpuModeSetting = "off"
+    allow_cpu_inference: bool = False
+    attn_backend: AttnBackendSetting = "auto"
+    attn_backend_strict: bool = False
+    torch_compile: bool = False
+    torch_compile_mode: TorchCompileModeSetting = "reduce-overhead"
+    metrics_owner: MetricsOwnerSetting = "script"
+    metrics_backend: MetricsBackendSetting = "tensorboard"
+    trackio_space_id: str | None = None
+    trackio_project: str | None = None
+    log_level: str = "INFO"
+    bench_model: str | None = None
+
+    @field_validator(
+        "compute_backend",
+        "attn_backend",
+        "metrics_owner",
+        "metrics_backend",
+        mode="before",
+    )
+    @classmethod
+    def _normalize_string_literals(cls, value: object) -> object:
+        normalized = _normalize_lower(value)
+        if normalized == "mps":
+            raise ValueError(
+                "VIDEOTUNA_COMPUTE_BACKEND=mps is not supported. "
+                "PrivTune supports auto, cuda, rocm, and cpu. "
+                "For config validation on Apple Silicon, use "
+                "VIDEOTUNA_CPU_MODE=smoke or --cpu-smoke."
+            )
+        return normalized
+
+    @field_validator("cpu_mode", mode="before")
+    @classmethod
+    def _normalize_cpu_mode(cls, value: object) -> object:
+        if isinstance(value, str):
+            text = value.strip().lower()
+            return text if text else "off"
+        return value
+
+    @field_validator(
+        "allow_cpu_inference",
+        "attn_backend_strict",
+        "torch_compile",
+        mode="before",
+    )
+    @classmethod
+    def _parse_bool01_fields(cls, value: object) -> bool:
+        return _parse_bool01(value)
+
+    @field_validator("torch_compile_mode", mode="before")
+    @classmethod
+    def _normalize_compile_mode(cls, value: object) -> str:
+        if value is None:
+            return "reduce-overhead"
+        mode = str(value).strip()
+        if mode in _VALID_COMPILE_MODES:
+            return mode
+        logger.warning(
+            "Invalid {}={!r}; using reduce-overhead",
+            ENV_TORCH_COMPILE_MODE,
+            mode,
+        )
+        return "reduce-overhead"
+
+    @field_validator("log_level", mode="before")
+    @classmethod
+    def _normalize_log_level(cls, value: object) -> str:
+        if value is None:
+            return "INFO"
+        return str(value).strip().upper()
+
+    @field_validator(
+        "bench_model",
+        "trackio_space_id",
+        "trackio_project",
+        mode="before",
+    )
+    @classmethod
+    def _normalize_optional_string(cls, value: object) -> str | None:
+        if value is None:
+            return None
+        text = str(value).strip()
+        return text or None
+
+
+def get_settings() -> PrivTuneSettings:
+    """Return settings loaded from the current environment (no cache)."""
+    return PrivTuneSettings()
diff --git a/videotuna/testing/__init__.py b/videotuna/testing/__init__.py
new file mode 100644
index 00000000..0da06882
--- /dev/null
+++ b/videotuna/testing/__init__.py
@@ -0,0 +1 @@
+"""Test helpers and synthetic fixtures (not part of the production API)."""
diff --git a/videotuna/testing/wan_lora_ckpt.py b/videotuna/testing/wan_lora_ckpt.py
new file mode 100644
index 00000000..8487ce9d
--- /dev/null
+++ b/videotuna/testing/wan_lora_ckpt.py
@@ -0,0 +1,33 @@
+"""Synthetic Wan 2.1 native LoRA checkpoints for tests and bridge spikes."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import torch
+
+
+def build_synthetic_wan_lora_ckpt(
+    path: Path,
+    *,
+    num_blocks: int = 2,
+    rank: int = 16,
+) -> Path:
+    """Write a synthetic denoiser ckpt with production-style key names."""
+    dim_in, dim_mid, dim_out = 5120, 13824, 5120
+    state: dict[str, torch.Tensor] = {}
+    for i in range(num_blocks):
+        for p in ("q", "k", "v", "o"):
+            state[f"denoiser.blocks.{i}.self_attn.{p}.lora_A.weight"] = torch.randn(
+                rank, dim_in
+            )
+            state[f"denoiser.blocks.{i}.self_attn.{p}.lora_B.weight"] = torch.randn(
+                dim_in, rank
+            )
+        state[f"denoiser.blocks.{i}.ffn.0.lora_A.weight"] = torch.randn(rank, dim_in)
+        state[f"denoiser.blocks.{i}.ffn.0.lora_B.weight"] = torch.randn(dim_mid, rank)
+        state[f"denoiser.blocks.{i}.ffn.2.lora_A.weight"] = torch.randn(rank, dim_mid)
+        state[f"denoiser.blocks.{i}.ffn.2.lora_B.weight"] = torch.randn(dim_out, rank)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    torch.save({"state_dict": state}, path)
+    return path
diff --git a/videotuna/third_party/flux/caching/memory.py b/videotuna/third_party/flux/caching/memory.py
deleted file mode 100644
index 0ff727e3..00000000
--- a/videotuna/third_party/flux/caching/memory.py
+++ /dev/null
@@ -1,14 +0,0 @@
-def reclaim_memory():
-    import gc
-
-    import torch
-
-    if torch.cuda.is_available():
-        gc.collect()
-        torch.cuda.empty_cache()
-        torch.cuda.ipc_collect()
-
-    if torch.backends.mps.is_available():
-        torch.mps.empty_cache()
-        torch.mps.synchronize()
-        gc.collect()
diff --git a/videotuna/third_party/flux/caching/text_embeds.py b/videotuna/third_party/flux/caching/text_embeds.py
deleted file mode 100644
index 171b8feb..00000000
--- a/videotuna/third_party/flux/caching/text_embeds.py
+++ /dev/null
@@ -1,1428 +0,0 @@
-import gc
-import hashlib
-import logging
-import os
-import queue
-import time
-from concurrent.futures import ThreadPoolExecutor
-from queue import Queue
-from threading import Thread
-
-import torch
-from tqdm import tqdm
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.prompts import PromptHandler
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.multi_process import rank_info, should_log
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.webhooks.mixin import WebhookMixin
-
-logger = logging.getLogger("TextEmbeddingCache")
-if should_log():
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel("ERROR")
-
-
-def _encode_sd3_prompt_with_t5(
-    text_encoder,
-    tokenizer,
-    prompt=None,
-    num_images_per_prompt=1,
-    device=None,
-    zero_padding_tokens: bool = True,
-    max_sequence_length: int = 77,
-):
-    prompt = [prompt] if isinstance(prompt, str) else prompt
-    batch_size = len(prompt)
-
-    text_inputs = tokenizer(
-        prompt,
-        padding="max_length",
-        max_length=max_sequence_length,
-        truncation=True,
-        add_special_tokens=True,
-        return_tensors="pt",
-    )
-    text_input_ids = text_inputs.input_ids
-    prompt_embeds = text_encoder(text_input_ids.to(device))[0]
-
-    dtype = text_encoder.dtype
-    prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-    _, seq_len, _ = prompt_embeds.shape
-
-    # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
-    prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-    prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
-    attention_mask = text_inputs.attention_mask.to(device)
-
-    if zero_padding_tokens:
-        # for some reason, SAI's reference code doesn't bother to mask the prompt embeddings.
-        # this can lead to a problem where the model fails to represent short and long prompts equally well.
-        # additionally, the model learns the bias of the prompt embeds' noise.
-        return prompt_embeds * attention_mask.unsqueeze(-1).expand(prompt_embeds.shape)
-    else:
-        return prompt_embeds
-
-
-def _encode_sd3_prompt_with_clip(
-    text_encoder,
-    tokenizer,
-    prompt: str,
-    device=None,
-    num_images_per_prompt: int = 1,
-    max_token_length: int = 77,
-):
-    prompt = [prompt] if isinstance(prompt, str) else prompt
-    batch_size = len(prompt)
-
-    text_inputs = tokenizer(
-        prompt,
-        padding="max_length",
-        max_length=max_token_length,
-        truncation=True,
-        return_tensors="pt",
-    )
-    text_input_ids = text_inputs.input_ids
-    prompt_embeds = text_encoder(text_input_ids.to(device), output_hidden_states=True)
-
-    pooled_prompt_embeds = prompt_embeds[0]
-    prompt_embeds = prompt_embeds.hidden_states[-2]
-    prompt_embeds = prompt_embeds.to(dtype=text_encoder.dtype, device=device)
-
-    _, seq_len, _ = prompt_embeds.shape
-    # duplicate text embeddings for each generation per prompt, using mps friendly method
-    prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-    prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
-
-    return prompt_embeds, pooled_prompt_embeds
-
-
-class TextEmbeddingCache(WebhookMixin):
-    prompts = {}
-
-    def __init__(
-        self,
-        id: str,
-        data_backend: BaseDataBackend,
-        text_encoders,
-        tokenizers,
-        accelerator,
-        webhook_progress_interval: int = 100,
-        cache_dir: str = "cache",
-        model_type: str = "sdxl",
-        prompt_handler: PromptHandler = None,
-        write_batch_size: int = 128,
-        read_batch_size: int = 25,
-        process_queue_size: int = 16,
-        text_encoder_batch_size: int = 4,
-        max_workers: int = 32,
-    ):
-        self.id = id
-        if data_backend.id != id:
-            raise ValueError(
-                f"TextEmbeddingCache received incorrect data_backend: {data_backend}"
-            )
-        self.should_abort = False
-        self.data_backend = data_backend
-        self.text_encoders = text_encoders
-        self.tokenizers = tokenizers
-        self.accelerator = accelerator
-        self.cache_dir = cache_dir
-        self.model_type = model_type
-        self.pipeline = None
-        if self.model_type == "flux":
-            from diffusers.pipelines.flux import FluxPipeline
-
-            self.pipeline = FluxPipeline.from_pretrained(
-                pretrained_model_name_or_path=StateTracker.get_args().pretrained_model_name_or_path,
-                text_encoder=text_encoders[0],
-                text_encoder_2=text_encoders[1],
-                tokenizer=tokenizers[0],
-                tokenizer_2=tokenizers[1],
-                transformer=None,
-                vae=None,
-            )
-        self.prompt_handler = prompt_handler
-        self.write_batch_size = write_batch_size
-        self.read_batch_size = read_batch_size
-        self.process_queue_size = process_queue_size
-        self.write_thread_bar = None
-        self.text_encoder_batch_size = text_encoder_batch_size
-        self.max_workers = max_workers
-        self.rank_info = rank_info()
-        if self.data_backend.type == "local":
-            self.cache_dir = os.path.abspath(self.cache_dir)
-        self.data_backend.create_directory(self.cache_dir)
-        self.write_queue = Queue()
-        self.process_write_batches = True
-        self.batch_write_thread = Thread(
-            target=self.batch_write_embeddings,
-            name=f"batch_write_thread_{self.id}",
-            daemon=True,
-        )
-        self.batch_write_thread.start()
-        self.webhook_progress_interval = webhook_progress_interval
-
-    def debug_log(self, msg: str):
-        logger.debug(f"{self.rank_info}(id={self.id}) {msg}")
-
-    def create_hash(self, caption):
-        if caption is None:
-            # It's gross, but some images do not have captions.
-            caption = ""
-        # Precomputed part of the format string
-        hash_format = f"-{self.model_type}"
-
-        # Reuse the hash object
-        md5_hash = hashlib.md5()
-        md5_hash.update(str(caption).encode())
-        # logger.debug(f"Hashing caption: {caption}")
-        result = md5_hash.hexdigest() + hash_format
-        # logger.debug(f"-> {result}")
-        return result
-
-    def hash_prompt_with_path(self, caption):
-        return os.path.join(self.cache_dir, self.create_hash(caption) + ".pt")
-
-    def hash_prompt(self, caption):
-        return self.create_hash(caption) + ".pt"
-
-    def discover_all_files(self):
-        """Identify all files in the data backend."""
-        logger.info(
-            f"{self.rank_info}(id={self.id}) Listing all text embed cache entries"
-        )
-        # This isn't returned, because we merely check if it's stored, or, store it.
-        (
-            StateTracker.get_text_cache_files(data_backend_id=self.id)
-            or StateTracker.set_text_cache_files(
-                self.data_backend.list_files(
-                    instance_data_dir=self.cache_dir,
-                    file_extensions=["pt"],
-                ),
-                data_backend_id=self.id,
-            )
-        )
-        self.debug_log(" -> done listing all text embed cache entries")
-
-    def save_to_cache(self, filename, embeddings):
-        """Add write requests to the queue instead of writing directly."""
-        if not self.batch_write_thread.is_alive():
-            logger.debug("Restarting background write thread.")
-            # Start the thread again.
-            self.process_write_batches = True
-            self.batch_write_thread = Thread(target=self.batch_write_embeddings)
-            self.batch_write_thread.start()
-        self.write_queue.put((embeddings, filename))
-        logger.debug(
-            f"save_to_cache called for {filename}, write queue has {self.write_queue.qsize()} items, and the write thread's status: {self.batch_write_thread.is_alive()}"
-        )
-
-    def batch_write_embeddings(self):
-        """Process write requests in batches."""
-        batch = []
-        written_elements = 0
-        while True:
-            try:
-                # Block until an item is available or timeout occurs
-                first_item = self.write_queue.get(timeout=1)
-                batch = [first_item]
-
-                # Try to get more items without blocking
-                while (
-                    not self.write_queue.empty() and len(batch) < self.write_batch_size
-                ):
-                    logger.debug("Retrieving more items from the queue.")
-                    items = self.write_queue.get_nowait()
-                    batch.append(items)
-                    logger.debug(f"Batch now contains {len(batch)} items.")
-
-                self.process_write_batch(batch)
-                self.write_thread_bar.update(len(batch))
-                logger.debug("Processed batch write.")
-                written_elements += len(batch)
-
-            except queue.Empty:
-                # Timeout occurred, no items were ready
-                if not self.process_write_batches:
-                    if len(batch) > 0:
-                        self.process_write_batch(batch)
-                        self.write_thread_bar.update(len(batch))
-                    logger.debug(
-                        f"Exiting batch write thread, no more work to do after writing {written_elements} elements"
-                    )
-                    break
-                logger.debug(
-                    f"Queue is empty. Retrieving new entries. Should retrieve? {self.process_write_batches}"
-                )
-                pass
-            except Exception:
-                logger.exception("An error occurred while writing embeddings to disk.")
-        logger.debug("Exiting background batch write thread.")
-
-    def process_write_batch(self, batch):
-        """Write a batch of embeddings to the cache."""
-        logger.debug(f"Writing {len(batch)} items to disk")
-        logger.debug(f"Batch: {batch}")
-        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
-            futures = [
-                executor.submit(self.data_backend.torch_save, *args) for args in batch
-            ]
-            for future in futures:
-                future.result()  # Wait for all writes to complete
-        logger.debug(f"Completed write batch of {len(batch)} items")
-
-    def load_from_cache(self, filename):
-        result = self.data_backend.torch_load(filename)
-        return result
-
-    def encode_flux_prompt(
-        self,
-        text_encoders,
-        tokenizers,
-        prompt: str,
-        is_validation: bool = False,
-        zero_padding_tokens: bool = True,
-    ):
-        """
-        Encode a prompt for a Flux model.
-
-        Args:
-            text_encoders: List of text encoders.
-            tokenizers: List of tokenizers.
-            prompt: The prompt to encode.
-            num_images_per_prompt: The number of images to generate per prompt.
-            is_validation: Whether the prompt is for validation. No-op for SD3.
-
-        Returns:
-            Tuple of (prompt_embeds, pooled_prompt_embeds).
-        """
-        from videotuna.third_party.flux.models.flux import FluxPipeline
-
-        pipe = FluxPipeline(
-            self.pipeline.scheduler,
-            self.pipeline.vae,
-            self.pipeline.text_encoder,
-            self.pipeline.tokenizer,
-            self.pipeline.text_encoder_2,
-            self.pipeline.tokenizer_2,
-            self.pipeline.transformer,
-        )
-
-        prompt_embeds, pooled_prompt_embeds, time_ids, masks = pipe.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt,
-            device=self.accelerator.device,
-            max_sequence_length=StateTracker.get_args().tokenizer_max_length,
-        )
-        if zero_padding_tokens:
-            # we can zero the padding tokens if we're just going to mask them later anyway.
-            prompt_embeds = prompt_embeds * masks.to(
-                device=prompt_embeds.device
-            ).unsqueeze(-1).expand(prompt_embeds.shape)
-
-        return prompt_embeds, pooled_prompt_embeds, time_ids, masks
-
-    # Adapted from pipelines.StableDiffusion3Pipeline.encode_prompt
-    def encode_sd3_prompt(
-        self,
-        text_encoders,
-        tokenizers,
-        prompt: str,
-        is_validation: bool = False,
-        zero_padding_tokens: bool = False,
-    ):
-        """
-        Encode a prompt for an SD3 model.
-
-        Args:
-            text_encoders: List of text encoders.
-            tokenizers: List of tokenizers.
-            prompt: The prompt to encode.
-            num_images_per_prompt: The number of images to generate per prompt.
-            is_validation: Whether the prompt is for validation. No-op for SD3.
-
-        Returns:
-            Tuple of (prompt_embeds, pooled_prompt_embeds).
-        """
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        num_images_per_prompt = 1
-
-        clip_tokenizers = tokenizers[:2]
-        clip_text_encoders = text_encoders[:2]
-
-        clip_prompt_embeds_list = []
-        clip_pooled_prompt_embeds_list = []
-        for tokenizer, text_encoder in zip(clip_tokenizers, clip_text_encoders):
-            prompt_embeds, pooled_prompt_embeds = _encode_sd3_prompt_with_clip(
-                text_encoder=text_encoder,
-                tokenizer=tokenizer,
-                prompt=prompt,
-                device=self.accelerator.device,
-                num_images_per_prompt=num_images_per_prompt,
-            )
-            clip_prompt_embeds_list.append(prompt_embeds)
-            clip_pooled_prompt_embeds_list.append(pooled_prompt_embeds)
-
-        clip_prompt_embeds = torch.cat(clip_prompt_embeds_list, dim=-1)
-        pooled_prompt_embeds = torch.cat(clip_pooled_prompt_embeds_list, dim=-1)
-
-        t5_prompt_embed = _encode_sd3_prompt_with_t5(
-            text_encoders[-1],
-            tokenizers[-1],
-            prompt=prompt,
-            num_images_per_prompt=num_images_per_prompt,
-            device=self.accelerator.device,
-            zero_padding_tokens=zero_padding_tokens,
-            max_sequence_length=StateTracker.get_args().tokenizer_max_length,
-        )
-
-        clip_prompt_embeds = torch.nn.functional.pad(
-            clip_prompt_embeds,
-            (0, t5_prompt_embed.shape[-1] - clip_prompt_embeds.shape[-1]),
-        )
-        prompt_embeds = torch.cat([clip_prompt_embeds, t5_prompt_embed], dim=-2)
-
-        return prompt_embeds, pooled_prompt_embeds
-
-    def encode_legacy_prompt(self, text_encoder, tokenizer, prompt):
-        input_tokens = tokenizer(
-            PromptHandler.filter_caption(self.data_backend, prompt),
-            truncation=True,
-            padding="max_length",
-            max_length=tokenizer.model_max_length,
-            return_tensors="pt",
-        ).input_ids.to(self.accelerator.device)
-        output = text_encoder(input_tokens)[0]
-        # self.debug_log(f"Legacy prompt shape: {output.shape}")
-        # self.debug_log(f"Legacy prompt encoded: {output}")
-        return output
-
-    # Adapted from pipelines.StableDiffusionXLPipeline.encode_sdxl_prompt
-    def encode_sdxl_prompt(
-        self,
-        text_encoders,
-        tokenizers,
-        prompt,
-        is_validation: bool = False,
-    ):
-        prompt_embeds_list = []
-
-        emitted_warning = False
-        try:
-            for tokenizer, text_encoder in zip(tokenizers, text_encoders):
-                if tokenizer is None or text_encoder is None:
-                    # SDXL Refiner only has one text encoder and tokenizer
-                    continue
-                if type(prompt) is not str and type(prompt) is not list:
-                    prompt = str(prompt)
-                max_seq_len = 256 if self.model_type == "kolors" else 77
-                text_inputs = tokenizer(
-                    prompt,
-                    padding="max_length",
-                    truncation=True,
-                    return_tensors="pt",
-                    max_length=max_seq_len,
-                )
-                untruncated_ids = tokenizer(
-                    prompt,
-                    padding="longest",
-                    return_tensors="pt",
-                    max_length=max_seq_len,
-                ).input_ids
-
-                if untruncated_ids.shape[
-                    -1
-                ] > tokenizer.model_max_length and not torch.equal(
-                    text_inputs.input_ids, untruncated_ids
-                ):
-                    removed_text = tokenizer.batch_decode(
-                        untruncated_ids[:, tokenizer.model_max_length - 1 : -1]
-                    )
-                    if not emitted_warning:
-                        # Only print this once. It's a bit spammy otherwise.
-                        emitted_warning = True
-                        logger.warning(
-                            f"The following part of your input was truncated because CLIP can only handle sequences up to {tokenizer.model_max_length} tokens: {removed_text}"
-                        )
-                if self.model_type == "sdxl":
-                    prompt_embeds_output = text_encoder(
-                        text_inputs.input_ids.to(self.accelerator.device),
-                        output_hidden_states=True,
-                    )
-                    # We are always interested in the pooled output of the final text encoder
-                    pooled_prompt_embeds = prompt_embeds_output[0]
-                    prompt_embeds = prompt_embeds_output.hidden_states[-2]
-                elif self.model_type == "kolors":
-                    # we pass the attention mask into the text encoder. it transforms the embeds but does not attend to them.
-                    # unfortunately, kolors does not return the attention mask for later use by the U-net to avoid attending to the padding tokens.
-                    prompt_embeds_output = text_encoder(
-                        input_ids=text_inputs["input_ids"].to(self.accelerator.device),
-                        attention_mask=text_inputs["attention_mask"].to(
-                            self.accelerator.device
-                        ),
-                        position_ids=text_inputs["position_ids"],
-                        output_hidden_states=True,
-                    )
-                    # the ChatGLM encoder output is hereby mangled in fancy ways for Kolors to be useful.
-                    prompt_embeds = (
-                        prompt_embeds_output.hidden_states[-2].permute(1, 0, 2).clone()
-                    )
-                    # [max_sequence_length, batch, hidden_size] -> [batch, hidden_size]
-                    pooled_prompt_embeds = prompt_embeds_output.hidden_states[-1][
-                        -1, :, :
-                    ].clone()
-                else:
-                    raise ValueError(f"Unknown model type: {self.model_type}")
-                bs_embed, seq_len, _ = prompt_embeds.shape
-                prompt_embeds = prompt_embeds.view(bs_embed, seq_len, -1)
-
-                # Clear out anything we moved to the text encoder device
-                text_inputs.input_ids.to("cpu")
-                del prompt_embeds_output
-                del text_inputs
-
-                prompt_embeds_list.append(prompt_embeds)
-        except Exception as e:
-            import traceback
-
-            logger.error(
-                f"Failed to encode prompt: {prompt}\n-> error: {e}\n-> traceback: {traceback.format_exc()}"
-            )
-            raise e
-
-        prompt_embeds = torch.cat(prompt_embeds_list, dim=-1)
-        return prompt_embeds, pooled_prompt_embeds
-
-    # Adapted from pipelines.StableDiffusionXLPipeline.encode_prompt
-    def encode_sdxl_prompts(
-        self,
-        text_encoders,
-        tokenizers,
-        prompts,
-        is_validation: bool = False,
-    ):
-        prompt_embeds_all = []
-        pooled_prompt_embeds_all = []
-
-        for prompt in prompts:
-            prompt_embeds, pooled_prompt_embeds = self.encode_sdxl_prompt(
-                text_encoders, tokenizers, prompt, is_validation
-            )
-            prompt_embeds_all.append(prompt_embeds)
-            pooled_prompt_embeds_all.append(pooled_prompt_embeds)
-
-        return torch.stack(prompt_embeds_all).squeeze(dim=1), torch.stack(
-            pooled_prompt_embeds_all
-        ).squeeze(dim=1)
-
-    def encode_prompt(self, prompt: str, is_validation: bool = False):
-        if self.model_type == "sdxl" or self.model_type == "kolors":
-            return self.encode_sdxl_prompt(
-                self.text_encoders, self.tokenizers, prompt, is_validation
-            )
-        elif self.model_type == "sd3":
-            return self.encode_sd3_prompt(
-                self.text_encoders,
-                self.tokenizers,
-                prompt,
-                is_validation,
-                zero_padding_tokens=(
-                    True if StateTracker.get_args().t5_padding == "zero" else False
-                ),
-            )
-        else:
-            return self.encode_legacy_prompt(
-                self.text_encoders[0], self.tokenizers[0], prompt
-            )
-
-    def tokenize_t5_prompt(self, prompt, tokenizer_max_length=None):
-        if tokenizer_max_length is not None:
-            max_length = tokenizer_max_length
-        else:
-            # prevent runaway token length sizes.
-            # huge captions aren't very helpful, and if you want them, use --tokenizer_max_length
-            max_length = 144
-
-        text_inputs = self.tokenizers[0](
-            prompt,
-            truncation=True,
-            padding="max_length",
-            max_length=max_length,
-            return_tensors="pt",
-        )
-
-        return text_inputs
-
-    def encode_t5_prompt(self, input_ids, attention_mask):
-        text_input_ids = input_ids.to(self.text_encoders[0].device)
-        attention_mask = attention_mask.to(self.text_encoders[0].device)
-        prompt_embeds = self.text_encoders[0](
-            text_input_ids,
-            attention_mask=attention_mask,
-            return_dict=False,
-        )[0]
-        prompt_embeds = prompt_embeds.to("cpu")
-
-        return prompt_embeds
-
-    def compute_t5_prompt(self, prompt: str):
-        """
-        Tokenise, encode, optionally mask, and then return a prompt_embed for a T5 model.
-
-        Args:
-            prompt: The prompt to encode.
-        Returns:
-            Tuple of (prompt_embeds, attention_mask)
-        """
-        logger.debug(f"Computing T5 caption for: {prompt}")
-        text_inputs = self.tokenize_t5_prompt(
-            prompt, tokenizer_max_length=StateTracker.get_args().tokenizer_max_length
-        )
-        result = self.encode_t5_prompt(
-            text_inputs.input_ids,
-            text_inputs.attention_mask,
-        )
-        attn_mask = text_inputs.attention_mask
-        del text_inputs
-
-        return result, attn_mask
-
-    def compute_embeddings_for_prompts(
-        self,
-        all_prompts,
-        return_concat: bool = True,
-        is_validation: bool = False,
-        load_from_cache: bool = True,
-    ):
-        logger.debug("Initialising text embed calculator...")
-        if not self.batch_write_thread.is_alive():
-            logger.debug("Restarting background write thread.")
-            # Start the thread again.
-            self.process_write_batches = True
-            self.batch_write_thread = Thread(target=self.batch_write_embeddings)
-            self.batch_write_thread.start()
-
-        existing_cache_filenames = list(
-            StateTracker.get_text_cache_files(data_backend_id=self.id).keys()
-        )
-
-        # Parallel processing for hashing
-        with ThreadPoolExecutor() as executor:
-            all_cache_filenames = list(
-                executor.map(self.hash_prompt_with_path, all_prompts)
-            )
-
-        # Create a set for faster lookups
-        existing_cache_filenames_set = set(existing_cache_filenames)
-
-        # Determine which prompts are not cached
-        uncached_prompts = [
-            prompt
-            for prompt, filename in zip(all_prompts, all_cache_filenames)
-            if filename not in existing_cache_filenames_set
-        ]
-
-        # If all prompts are cached and certain conditions are met, return None
-        if not uncached_prompts and not return_concat:
-            self.debug_log(
-                f"All prompts are cached, ignoring (uncached_prompts={uncached_prompts}, is_validation={is_validation}, return_concat={return_concat})"
-            )
-            return None
-        else:
-            self.debug_log(
-                f"(uncached_prompts={uncached_prompts}, is_validation={is_validation}, return_concat={return_concat})"
-            )
-
-        # Proceed with uncached prompts
-        raw_prompts = uncached_prompts if uncached_prompts else all_prompts
-        output = None
-        if self.model_type == "sdxl" or self.model_type == "kolors":
-            output = self.compute_embeddings_for_sdxl_prompts(
-                raw_prompts,
-                return_concat=return_concat,
-                is_validation=is_validation,
-                load_from_cache=load_from_cache,
-            )
-        elif (
-            self.model_type == "legacy"
-            or self.model_type == "pixart_sigma"
-            or self.model_type == "smoldit"
-        ):
-            # both sd1.x/2.x and t5 style models like pixart use this flow.
-            output = self.compute_embeddings_for_legacy_prompts(
-                raw_prompts,
-                return_concat=return_concat,
-                load_from_cache=load_from_cache,
-            )
-        elif self.model_type == "sd3":
-            output = self.compute_embeddings_for_sd3_prompts(
-                raw_prompts,
-                return_concat=return_concat,
-                load_from_cache=load_from_cache,
-            )
-        elif self.model_type == "flux":
-            output = self.compute_embeddings_for_flux_prompts(
-                raw_prompts,
-                return_concat=return_concat,
-                load_from_cache=load_from_cache,
-            )
-        else:
-            raise ValueError(
-                f"No such text encoding backend for model type '{self.model_type}'"
-            )
-        # logger.debug(f"Returning output: {output}")
-        return output
-
-    def split_captions_between_processes(self, all_captions: list):
-        with self.accelerator.split_between_processes(all_captions) as split:
-            split_captions = split
-        self.debug_log(
-            f"Before splitting, we had {len(all_captions)} captions. After splitting, we have {len(split_captions)} unprocessed files."
-        )
-        # # Print the first 5 as a debug log:
-        self.debug_log(f"Local unprocessed captions: {split_captions[:5]} (truncated)")
-        return split_captions
-
-    def compute_embeddings_for_sdxl_prompts(
-        self,
-        prompts: list = None,
-        return_concat: bool = True,
-        is_validation: bool = False,
-        load_from_cache: bool = True,
-    ):
-        prompt_embeds_all = []
-        add_text_embeds_all = []
-        should_encode = not load_from_cache
-        args = StateTracker.get_args()
-        if should_encode:
-            local_caption_split = self.split_captions_between_processes(
-                prompts or self.prompts
-            )
-        else:
-            local_caption_split = prompts or self.prompts
-        if (
-            hasattr(args, "cache_clear_validation_prompts")
-            and args.cache_clear_validation_prompts
-            and is_validation
-        ):
-            # If --cache_clear_validation_prompts was provided, we will forcibly overwrite them.
-            load_from_cache = False
-            should_encode = True
-        # self.debug_log(
-        #     f"compute_embeddings_for_sdxl_prompts received list of prompts: {list(prompts)[:5]}"
-        # )
-        if self.webhook_handler is not None:
-            last_reported_index = 0
-            self.send_progress_update(
-                type="init_cache_text_embeds_started",
-                progress=int(0 // len(local_caption_split)),
-                total=len(local_caption_split),
-                current=0,
-            )
-        self.write_thread_bar = tqdm(
-            desc="Write embeds to disk",
-            leave=False,
-            ncols=125,
-            disable=return_concat,
-            total=len(local_caption_split),
-            position=get_rank(),
-        )
-        with torch.no_grad():
-            last_reported_index = 0
-            for prompt in tqdm(
-                local_caption_split,
-                desc="Processing prompts",
-                disable=return_concat,
-                miniters=50,
-                leave=False,
-                ncols=125,
-                position=get_rank() + self.accelerator.num_processes + 1,
-            ):
-                filename = os.path.join(self.cache_dir, self.hash_prompt(prompt))
-                debug_msg = f"Processing file: {filename}, prompt: {prompt}"
-                prompt = PromptHandler.filter_caption(self.data_backend, prompt)
-                debug_msg = f"{debug_msg}\n -> filtered prompt: {prompt}"
-                logger.debug(debug_msg)
-                if return_concat and load_from_cache:
-                    try:
-                        # We attempt to load.
-                        prompt_embeds, add_text_embeds = self.load_from_cache(filename)
-                    except Exception as e:
-                        # We failed to load. Now encode the prompt.
-                        logger.error(
-                            f"Failed retrieving prompt from cache:"
-                            f"\n-> prompt: {prompt}"
-                            f"\n-> filename: {filename}"
-                            f"\n-> error: {e}"
-                            f"\n-> id: {self.id}, data_backend id: {self.data_backend.id}"
-                        )
-                        should_encode = True
-                        raise Exception(
-                            "Cache retrieval for text embed file failed. Ensure your dataloader config value for skip_file_discovery does not contain 'text', and that preserve_data_backend_cache is disabled or unset."
-                        )
-                if should_encode:
-                    # If load_from_cache is True, should_encode would be False unless we failed to load.
-                    # self.debug_log(f"Encoding prompt: {prompt}")
-                    prompt_embeds, pooled_prompt_embeds = self.encode_sdxl_prompts(
-                        self.text_encoders,
-                        self.tokenizers,
-                        [prompt],
-                        is_validation,
-                    )
-                    add_text_embeds = pooled_prompt_embeds
-                    # If the prompt is empty, zero out the embeddings
-                    if prompt == "":
-                        prompt_embeds = torch.zeros_like(prompt_embeds)
-                        add_text_embeds = torch.zeros_like(add_text_embeds)
-                    # Get the current size of the queue.
-                    current_size = self.write_queue.qsize()
-                    if current_size >= 2048:
-                        log_msg = str(
-                            f"[WARNING] Write queue size is {current_size}. This is quite large."
-                            " Consider increasing the write batch size. Delaying encode so that writes can catch up."
-                        )
-                        self.write_thread_bar.write(log_msg)
-                        while self.write_queue.qsize() > 100:
-                            time.sleep(0.1)
-
-                    self.debug_log(f"Adding embed to write queue: {filename}")
-                    self.save_to_cache(filename, (prompt_embeds, add_text_embeds))
-
-                    if (
-                        self.webhook_handler is not None
-                        and int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        < 10
-                    ):
-                        last_reported_index = int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        self.send_progress_update(
-                            type="init_cache_text_embeds_status_update",
-                            progress=int(
-                                self.write_thread_bar.n
-                                // len(local_caption_split)
-                                * 100
-                            ),
-                            total=len(local_caption_split),
-                            current=0,
-                        )
-
-                    if return_concat:
-                        prompt_embeds = prompt_embeds.to(self.accelerator.device)
-                        add_text_embeds = add_text_embeds.to(self.accelerator.device)
-                    else:
-                        del prompt_embeds
-                        del add_text_embeds
-                        del pooled_prompt_embeds
-                        continue
-
-                if return_concat:
-                    prompt_embeds_all.append(prompt_embeds)
-                    add_text_embeds_all.append(add_text_embeds)
-
-            while self.write_queue.qsize() > 0:
-                time.sleep(0.1)  # Sleep briefly to avoid busy-waiting
-
-            logger.debug(
-                f"Exiting text cache write busy-loop, {self.write_queue.qsize()} items remaining."
-            )
-
-            if self.webhook_handler is not None:
-                self.send_progress_update(
-                    type="init_cache_text_embeds_status_complete",
-                    progress=100,
-                    total=len(local_caption_split),
-                    current=len(local_caption_split),
-                )
-
-            # Close the tqdm progress bar after the loop
-            self.write_thread_bar.close()
-            self.process_write_batches = False
-
-            if not return_concat:
-                del prompt_embeds_all
-                del add_text_embeds_all
-                return
-
-            prompt_embeds_all = torch.cat(prompt_embeds_all, dim=0)
-            add_text_embeds_all = torch.cat(add_text_embeds_all, dim=0)
-
-        return prompt_embeds_all, add_text_embeds_all
-
-    def compute_embeddings_for_legacy_prompts(
-        self,
-        prompts: list = None,
-        return_concat: bool = True,
-        load_from_cache: bool = True,
-    ):
-        logger.debug(
-            f"compute_embeddings_for_legacy_prompts arguments: prompts={prompts}, return_concat={return_concat}, load_from_cache={load_from_cache}"
-        )
-        prompt_embeds_all = []
-        prompt_embeds_all = []
-        should_encode = not load_from_cache
-        args = StateTracker.get_args()
-        if (
-            hasattr(args, "cache_clear_validation_prompts")
-            and args.cache_clear_validation_prompts
-            and not load_from_cache
-        ):
-            # If --cache_clear_validation_prompts was provided, we will forcibly overwrite them.
-            should_encode = True
-            logger.debug("Setting should_encode = True")
-        # self.debug_log(
-        #     f"compute_embeddings_for_legacy_prompts received list of prompts: {list(prompts)[:5]}"
-        # )
-        if should_encode:
-            local_caption_split = self.split_captions_between_processes(
-                prompts or self.prompts
-            )
-        else:
-            local_caption_split = prompts or self.prompts
-
-        if self.webhook_handler is not None:
-            last_reported_index = 0
-            self.send_progress_update(
-                type="init_cache_text_embeds_started",
-                progress=int(0 // len(local_caption_split)),
-                total=len(local_caption_split),
-                current=0,
-            )
-
-        self.write_thread_bar = tqdm(
-            desc="Write embeds to disk",
-            leave=False,
-            ncols=125,
-            disable=return_concat,
-            total=len(local_caption_split),
-            position=get_rank(),
-        )
-        with torch.no_grad():
-            attention_mask = None
-            attention_masks_all = []
-            last_reported_index = 0
-            for prompt in tqdm(
-                local_caption_split,
-                desc="Processing prompts",
-                leave=False,
-                ncols=125,
-                disable=return_concat,
-                position=get_rank() + self.accelerator.num_processes + 1,
-            ):
-                filename = os.path.join(self.cache_dir, self.hash_prompt(prompt))
-                if prompt != "":
-                    prompt = PromptHandler.filter_caption(self.data_backend, prompt)
-                if prompt is None:
-                    continue
-
-                if return_concat and load_from_cache:
-                    try:
-                        # We attempt to load.
-                        logging.debug("Loading embed from cache.")
-                        prompt_embeds = self.load_from_cache(filename)
-                        if type(prompt_embeds) is tuple and len(prompt_embeds) == 2:
-                            # we have an attention mask stored with the embed.
-                            prompt_embeds, attention_mask = prompt_embeds
-                        logging.debug(f"Loaded embeds: {prompt_embeds.shape}")
-                    except Exception as e:
-                        # We failed to load. Now encode the prompt.
-                        logger.error(
-                            f"Failed retrieving prompt from cache:"
-                            f"\n-> prompt: {prompt}"
-                            f"\n-> filename: {filename}"
-                            f"\n-> error: {e}"
-                        )
-                        should_encode = True
-                        raise Exception(
-                            "Cache retrieval for text embed file failed. Ensure your dataloader config value for skip_file_discovery does not contain 'text', and that preserve_data_backend_cache is disabled or unset."
-                        )
-
-                if should_encode:
-                    # self.debug_log(f"Encoding prompt: {prompt}")
-                    # Get the current size of the queue.
-                    current_size = self.write_queue.qsize()
-                    if current_size >= 2048:
-                        log_msg = str(
-                            f"[WARNING] Write queue size is {current_size}. This is quite large."
-                            " Consider increasing the write batch size. Delaying encode so that writes can catch up."
-                        )
-                        self.write_thread_bar.write(log_msg)
-                        while self.write_queue.qsize() > 100:
-                            logger.debug("Waiting for write thread to catch up.")
-                            time.sleep(5)
-                    if (
-                        "deepfloyd" in StateTracker.get_args().model_type
-                        or self.model_type == "pixart_sigma"
-                        or self.model_type == "smoldit"
-                    ):
-                        # TODO: Batch this
-                        prompt_embeds, attention_mask = self.compute_t5_prompt(
-                            prompt=prompt,
-                        )
-                        if "deepfloyd" not in StateTracker.get_args().model_type:
-                            # we have to store the attn mask with the embed for pixart.
-                            # smoldit requires the attn mask at inference time 💪🏽
-                            prompt_embeds = (prompt_embeds, attention_mask)
-                    else:
-                        prompt_embeds = self.encode_legacy_prompt(
-                            self.text_encoders[0], self.tokenizers[0], [prompt]
-                        )
-                    if return_concat:
-                        if type(prompt_embeds) is tuple:
-                            prompt_embeds = (
-                                prompt_embeds[0].to(self.accelerator.device),
-                                prompt_embeds[1].to(self.accelerator.device),
-                            )
-                        else:
-                            prompt_embeds = prompt_embeds.to(self.accelerator.device)
-
-                    self.save_to_cache(filename, prompt_embeds)
-
-                    if (
-                        self.webhook_handler is not None
-                        and int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        < 10
-                    ):
-                        last_reported_index = int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        self.send_progress_update(
-                            type="init_cache_text_embeds_status_update",
-                            progress=int(
-                                self.write_thread_bar.n
-                                // len(local_caption_split)
-                                * 100
-                            ),
-                            total=len(local_caption_split),
-                            current=0,
-                        )
-
-                if not return_concat:
-                    del prompt_embeds
-                    prompt_embeds = None
-
-                if return_concat:
-                    prompt_embeds_all.append(prompt_embeds)
-                    if attention_mask is not None:
-                        attention_masks_all.append(attention_mask)
-
-            while self.write_queue.qsize() > 0:
-                time.sleep(0.1)  # Sleep briefly to avoid busy-waiting
-
-            logger.debug(
-                f"Exiting text cache write busy-loop, {self.write_queue.qsize()} items remaining."
-            )
-
-            if self.webhook_handler is not None:
-                self.send_progress_update(
-                    type="init_cache_text_embeds_status_complete",
-                    progress=100,
-                    total=len(local_caption_split),
-                    current=len(local_caption_split),
-                )
-
-            # Close the tqdm progress bar after the loop
-            self.write_thread_bar.close()
-            self.process_write_batches = False
-
-            if not return_concat:
-                del prompt_embeds_all
-                gc.collect()
-                return
-
-        # logger.debug(f"Returning all prompt embeds: {prompt_embeds_all}")
-        if len(attention_masks_all) > 0:
-            return prompt_embeds_all, attention_masks_all
-        return prompt_embeds_all
-
-    def compute_embeddings_for_flux_prompts(
-        self,
-        prompts: list = None,
-        return_concat: bool = True,
-        is_validation: bool = False,
-        load_from_cache: bool = True,
-    ):
-        prompt_embeds_all = []
-        add_text_embeds_all = []
-        time_ids_all = []
-        masks_all = []
-        should_encode = not load_from_cache
-        args = StateTracker.get_args()
-        if should_encode:
-            local_caption_split = self.split_captions_between_processes(
-                prompts or self.prompts
-            )
-        else:
-            local_caption_split = prompts or self.prompts
-        if (
-            hasattr(args, "cache_clear_validation_prompts")
-            and args.cache_clear_validation_prompts
-            and is_validation
-        ):
-            # If --cache_clear_validation_prompts was provided, we will forcibly overwrite them.
-            load_from_cache = False
-            should_encode = True
-
-        if self.webhook_handler is not None:
-            last_reported_index = 0
-            self.send_progress_update(
-                type="init_cache_text_embeds_started",
-                progress=int(0 // len(local_caption_split)),
-                total=len(local_caption_split),
-                current=0,
-            )
-        self.write_thread_bar = tqdm(
-            desc="Write embeds to disk",
-            leave=False,
-            ncols=125,
-            disable=return_concat,
-            total=len(local_caption_split),
-            position=get_rank(),
-        )
-        with torch.no_grad():
-            last_reported_index = 0
-            for prompt in tqdm(
-                local_caption_split,
-                desc="Processing prompts",
-                disable=return_concat,
-                miniters=50,
-                leave=False,
-                ncols=125,
-                position=get_rank() + self.accelerator.num_processes + 1,
-            ):
-                filename = os.path.join(self.cache_dir, self.hash_prompt(prompt))
-                debug_msg = f"Processing file: {filename}, prompt: {prompt}"
-                prompt = PromptHandler.filter_caption(self.data_backend, prompt)
-                debug_msg = f"{debug_msg}\n -> filtered prompt: {prompt}"
-                if prompt is None:
-                    logger.error(f"Filename {filename} does not have a caption.")
-                    continue
-                logger.debug(debug_msg)
-                if return_concat and load_from_cache:
-                    try:
-                        # We attempt to load.
-                        _flux_embed = self.load_from_cache(filename)
-                        if len(_flux_embed) == 3:
-                            # legacy flux embed w/o attn mask
-                            prompt_embeds, add_text_embeds, time_ids = _flux_embed
-                            masks = None
-                        elif len(_flux_embed) == 4:
-                            # flux embed with attn mask
-                            prompt_embeds, add_text_embeds, time_ids, masks = (
-                                _flux_embed
-                            )
-                        del _flux_embed
-                        logger.debug(
-                            f"Cached Flux text embeds: {prompt_embeds.shape}, {add_text_embeds.shape}, {time_ids.shape}, {masks.shape if masks is not None else None}"
-                        )
-                    except Exception as e:
-                        # We failed to load. Now encode the prompt.
-                        logger.error(
-                            f"Failed retrieving prompt from cache:"
-                            f"\n-> prompt: {prompt}"
-                            f"\n-> filename: {filename}"
-                            f"\n-> error: {e}"
-                            f"\n-> id: {self.id}, data_backend id: {self.data_backend.id}"
-                        )
-                        should_encode = True
-                        raise Exception(
-                            "Cache retrieval for text embed file failed. Ensure your dataloader config value for skip_file_discovery does not contain 'text', and that preserve_data_backend_cache is disabled or unset."
-                        )
-                if should_encode:
-                    # If load_from_cache is True, should_encode would be False unless we failed to load.
-                    self.debug_log(f"Encoding prompt: {prompt}")
-                    prompt_embeds, pooled_prompt_embeds, time_ids, masks = (
-                        self.encode_flux_prompt(
-                            self.text_encoders,
-                            self.tokenizers,
-                            [prompt],
-                            is_validation,
-                            zero_padding_tokens=StateTracker.get_args().t5_padding
-                            == "zero",
-                        )
-                    )
-                    logger.debug(
-                        f"Flux prompt embeds: {prompt_embeds.shape}, {pooled_prompt_embeds.shape}, {time_ids.shape}, {masks.shape}"
-                    )
-                    add_text_embeds = pooled_prompt_embeds
-                    current_size = self.write_queue.qsize()
-                    if current_size >= 2048:
-                        log_msg = str(
-                            f"[WARNING] Write queue size is {current_size}. This is quite large."
-                            " Consider increasing the write batch size. Delaying encode so that writes can catch up."
-                        )
-                        self.write_thread_bar.write(log_msg)
-                        while self.write_queue.qsize() > 100:
-                            time.sleep(0.1)
-
-                    self.debug_log(f"Adding embed to write queue: {filename}")
-                    self.save_to_cache(
-                        filename, (prompt_embeds, add_text_embeds, time_ids, masks)
-                    )
-                    if (
-                        self.webhook_handler is not None
-                        and int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        < 10
-                    ):
-                        last_reported_index = int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        self.send_progress_update(
-                            type="init_cache_text_embeds_status_update",
-                            progress=int(
-                                self.write_thread_bar.n
-                                // len(local_caption_split)
-                                * 100
-                            ),
-                            total=len(local_caption_split),
-                            current=0,
-                        )
-
-                    if return_concat:
-                        prompt_embeds = prompt_embeds.to(self.accelerator.device)
-                        add_text_embeds = add_text_embeds.to(self.accelerator.device)
-                        time_ids = time_ids.to(self.accelerator.device)
-                        masks = masks.to(self.accelerator.device)
-                    else:
-                        del prompt_embeds
-                        del add_text_embeds
-                        del pooled_prompt_embeds
-                        del masks
-                        continue
-
-                if return_concat:
-                    prompt_embeds_all.append(prompt_embeds)
-                    add_text_embeds_all.append(add_text_embeds)
-                    time_ids_all.append(time_ids)
-                    masks_all.append(masks)
-
-            while self.write_queue.qsize() > 0:
-                time.sleep(0.1)  # Sleep briefly to avoid busy-waiting
-
-            if self.webhook_handler is not None:
-                self.send_progress_update(
-                    type="init_cache_text_embeds_status_complete",
-                    progress=100,
-                    total=len(local_caption_split),
-                    current=len(local_caption_split),
-                )
-
-            # Close the tqdm progress bar after the loop
-            self.write_thread_bar.close()
-            self.process_write_batches = False
-
-            if not return_concat:
-                del prompt_embeds_all
-                del add_text_embeds_all
-                del time_ids_all
-                del masks_all
-                return
-
-            logger.debug(f"Returning all prompt embeds: {prompt_embeds_all}")
-            prompt_embeds_all = torch.cat(prompt_embeds_all, dim=0)
-            add_text_embeds_all = torch.cat(add_text_embeds_all, dim=0)
-            time_ids_all = torch.cat(time_ids_all, dim=0)
-            # if any masks_all are None, we can't cat
-            masks_all = torch.cat(masks_all, dim=0) if None not in masks_all else None
-
-        return prompt_embeds_all, add_text_embeds_all, time_ids_all, masks_all
-
-    def compute_embeddings_for_sd3_prompts(
-        self,
-        prompts: list = None,
-        return_concat: bool = True,
-        is_validation: bool = False,
-        load_from_cache: bool = True,
-    ):
-        prompt_embeds_all = []
-        add_text_embeds_all = []
-        should_encode = not load_from_cache
-        args = StateTracker.get_args()
-        if should_encode:
-            local_caption_split = self.split_captions_between_processes(
-                prompts or self.prompts
-            )
-        else:
-            local_caption_split = prompts or self.prompts
-        if (
-            hasattr(args, "cache_clear_validation_prompts")
-            and args.cache_clear_validation_prompts
-            and is_validation
-        ):
-            # If --cache_clear_validation_prompts was provided, we will forcibly overwrite them.
-            load_from_cache = False
-            should_encode = True
-        # self.debug_log(
-        #     f"compute_embeddings_for_sdxl_prompts received list of prompts: {list(prompts)[:5]}"
-        # )
-
-        if self.webhook_handler is not None:
-            last_reported_index = 0
-            self.send_progress_update(
-                type="init_cache_text_embeds_started",
-                progress=int(0 // len(local_caption_split)),
-                total=len(local_caption_split),
-                current=0,
-            )
-
-        self.write_thread_bar = tqdm(
-            desc="Write embeds to disk",
-            leave=False,
-            ncols=125,
-            disable=return_concat,
-            total=len(local_caption_split),
-            position=get_rank(),
-        )
-        with torch.no_grad():
-            last_reported_index = 0
-            for prompt in tqdm(
-                local_caption_split,
-                desc="Processing prompts",
-                disable=return_concat,
-                miniters=50,
-                leave=False,
-                ncols=125,
-                position=get_rank() + self.accelerator.num_processes + 1,
-            ):
-                filename = os.path.join(self.cache_dir, self.hash_prompt(prompt))
-                debug_msg = f"Processing file: {filename}, prompt: {prompt}"
-                prompt = PromptHandler.filter_caption(self.data_backend, prompt)
-                debug_msg = f"{debug_msg}\n -> filtered prompt: {prompt}"
-                if prompt is None:
-                    logger.error(f"Filename {filename} does not have a caption.")
-                    continue
-                logger.debug(debug_msg)
-                if return_concat and load_from_cache:
-                    try:
-                        # We attempt to load.
-                        prompt_embeds, add_text_embeds = self.load_from_cache(filename)
-                        logger.debug(
-                            f"Cached SD3 embeds: {prompt_embeds.shape}, {add_text_embeds.shape}"
-                        )
-                    except Exception as e:
-                        # We failed to load. Now encode the prompt.
-                        logger.error(
-                            f"Failed retrieving prompt from cache:"
-                            f"\n-> prompt: {prompt}"
-                            f"\n-> filename: {filename}"
-                            f"\n-> error: {e}"
-                            f"\n-> id: {self.id}, data_backend id: {self.data_backend.id}"
-                        )
-                        should_encode = True
-                        raise Exception(
-                            "Cache retrieval for text embed file failed. Ensure your dataloader config value for skip_file_discovery does not contain 'text', and that preserve_data_backend_cache is disabled or unset."
-                        )
-                if should_encode:
-                    # If load_from_cache is True, should_encode would be False unless we failed to load.
-                    self.debug_log(
-                        f"Encoding filename {filename} :: device {self.text_encoders[0].device} :: prompt {prompt}"
-                    )
-                    prompt_embeds, pooled_prompt_embeds = self.encode_sd3_prompt(
-                        self.text_encoders,
-                        self.tokenizers,
-                        [prompt],
-                        is_validation,
-                        zero_padding_tokens=(
-                            True
-                            if StateTracker.get_args().t5_padding == "zero"
-                            else False
-                        ),
-                    )
-                    logger.debug(
-                        f"Filename {filename} SD3 prompt embeds: {prompt_embeds.shape}, {pooled_prompt_embeds.shape}"
-                    )
-                    add_text_embeds = pooled_prompt_embeds
-                    # StabilityAI say not to zero them out.
-                    if prompt == "":
-                        if StateTracker.get_args().sd3_clip_uncond_behaviour == "zero":
-                            prompt_embeds = torch.zeros_like(prompt_embeds)
-                        if StateTracker.get_args().sd3_t5_uncond_behaviour == "zero":
-                            add_text_embeds = torch.zeros_like(add_text_embeds)
-                    # Get the current size of the queue.
-                    current_size = self.write_queue.qsize()
-                    if current_size >= 2048:
-                        log_msg = str(
-                            f"[WARNING] Write queue size is {current_size}. This is quite large."
-                            " Consider increasing the write batch size. Delaying encode so that writes can catch up."
-                        )
-                        self.write_thread_bar.write(log_msg)
-                        while self.write_queue.qsize() > 100:
-                            time.sleep(0.1)
-
-                    self.debug_log(f"Adding embed to write queue: {filename}")
-                    self.save_to_cache(filename, (prompt_embeds, add_text_embeds))
-
-                    if (
-                        self.webhook_handler is not None
-                        and int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        < 10
-                    ):
-                        last_reported_index = int(
-                            self.write_thread_bar.n % self.webhook_progress_interval
-                        )
-                        self.send_progress_update(
-                            type="init_cache_text_embeds_status_update",
-                            progress=int(
-                                self.write_thread_bar.n
-                                // len(local_caption_split)
-                                * 100
-                            ),
-                            total=len(local_caption_split),
-                            current=0,
-                        )
-
-                    if return_concat:
-                        prompt_embeds = prompt_embeds.to(self.accelerator.device)
-                        add_text_embeds = add_text_embeds.to(self.accelerator.device)
-                    else:
-                        del prompt_embeds
-                        del add_text_embeds
-                        del pooled_prompt_embeds
-                        continue
-
-                if return_concat:
-                    prompt_embeds_all.append(prompt_embeds)
-                    add_text_embeds_all.append(add_text_embeds)
-
-            while self.write_queue.qsize() > 0:
-                time.sleep(0.1)  # Sleep briefly to avoid busy-waiting
-
-            if self.webhook_handler is not None:
-                self.send_progress_update(
-                    type="init_cache_text_embeds_status_complete",
-                    progress=100,
-                    total=len(local_caption_split),
-                    current=len(local_caption_split),
-                )
-
-            # Close the tqdm progress bar after the loop
-            self.write_thread_bar.close()
-            self.process_write_batches = False
-
-            if not return_concat:
-                del prompt_embeds_all
-                del add_text_embeds_all
-                return
-
-            logger.debug(f"Returning all prompt embeds: {prompt_embeds_all}")
-            prompt_embeds_all = torch.cat(prompt_embeds_all, dim=0)
-            add_text_embeds_all = torch.cat(add_text_embeds_all, dim=0)
-
-        return prompt_embeds_all, add_text_embeds_all
-
-    def __del__(self):
-        """Ensure that the batch write thread is properly closed."""
-        if self.batch_write_thread.is_alive():
-            self.batch_write_thread.join()
diff --git a/videotuna/third_party/flux/caching/vae.py b/videotuna/third_party/flux/caching/vae.py
deleted file mode 100644
index 1d26b0fd..00000000
--- a/videotuna/third_party/flux/caching/vae.py
+++ /dev/null
@@ -1,1106 +0,0 @@
-import logging
-import os
-import traceback
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from hashlib import sha256
-from pathlib import Path
-from queue import Queue
-from random import shuffle
-
-import torch
-from numpy import str_ as numpy_str
-from PIL import Image
-from tqdm import tqdm
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.training_sample import (
-    PreparedSample,
-    TrainingSample,
-)
-from videotuna.third_party.flux.metadata.backends.base import MetadataBackend
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage
-from videotuna.third_party.flux.training import image_file_extensions
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.multi_process import rank_info
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.webhooks.mixin import WebhookMixin
-
-logger = logging.getLogger("VAECache")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def prepare_sample(
-    image: Image.Image = None, data_backend_id: str = None, filepath: str = None
-):
-    metadata = StateTracker.get_metadata_by_filepath(
-        filepath, data_backend_id=data_backend_id
-    )
-    data_backend = StateTracker.get_data_backend(data_backend_id)
-    data_sampler = data_backend.get("sampler")
-    image_data = image
-    if image_data is None:
-        image_data = data_sampler.yield_single_image(filepath)
-    training_sample = TrainingSample(
-        image=image_data,
-        data_backend_id=data_backend_id,
-        image_metadata=metadata,
-        image_path=filepath,
-    )
-    prepared_sample = training_sample.prepare()
-    return (
-        prepared_sample.image,
-        prepared_sample.crop_coordinates,
-        prepared_sample.aspect_ratio,
-    )
-
-
-class VAECache(WebhookMixin):
-    def __init__(
-        self,
-        id: str,
-        vae,
-        accelerator,
-        metadata_backend: MetadataBackend,
-        instance_data_dir: str,
-        image_data_backend: BaseDataBackend,
-        webhook_progress_interval: int = 100,
-        cache_data_backend: BaseDataBackend = None,
-        cache_dir="vae_cache",
-        resolution: float = 1024,
-        maximum_image_size: float = None,
-        target_downsample_size: float = None,
-        delete_problematic_images: bool = False,
-        write_batch_size: int = 25,
-        read_batch_size: int = 25,
-        process_queue_size: int = 16,
-        vae_batch_size: int = 4,
-        resolution_type: str = "pixel",
-        minimum_image_size: int = None,
-        max_workers: int = 32,
-        vae_cache_ondemand: bool = False,
-        hash_filenames: bool = False,
-    ):
-        self.id = id
-        if image_data_backend and image_data_backend.id != id:
-            raise ValueError(
-                f"VAECache received incorrect image_data_backend: {image_data_backend}"
-            )
-        self.image_data_backend = image_data_backend
-        self.cache_data_backend = (
-            cache_data_backend if cache_data_backend is not None else image_data_backend
-        )
-        self.hash_filenames = hash_filenames
-        self.vae = vae
-        self.accelerator = accelerator
-        self.cache_dir = cache_dir
-        if len(self.cache_dir) > 0 and self.cache_dir[-1] == "/":
-            # Remove trailing slash
-            self.cache_dir = self.cache_dir[:-1]
-        if self.cache_data_backend and self.cache_data_backend.type == "local":
-            self.cache_dir = os.path.abspath(self.cache_dir)
-            self.cache_data_backend.create_directory(self.cache_dir)
-        self.resolution = resolution
-        self.resolution_type = resolution_type
-        self.minimum_image_size = minimum_image_size
-        self.webhook_progress_interval = webhook_progress_interval
-        self.delete_problematic_images = delete_problematic_images
-        self.write_batch_size = write_batch_size
-        self.read_batch_size = read_batch_size
-        self.process_queue_size = process_queue_size
-        self.vae_batch_size = vae_batch_size
-        self.instance_data_dir = instance_data_dir
-        self.transform = MultiaspectImage.get_image_transforms()
-        self.rank_info = rank_info()
-        self.metadata_backend = metadata_backend
-        if self.metadata_backend and not self.metadata_backend.image_metadata_loaded:
-            self.metadata_backend.load_image_metadata()
-
-        self.vae_cache_ondemand = vae_cache_ondemand
-
-        self.max_workers = max_workers
-        if (maximum_image_size and not target_downsample_size) or (
-            target_downsample_size and not maximum_image_size
-        ):
-            raise ValueError(
-                "Both maximum_image_size and target_downsample_size must be specified."
-                f"Only {'maximum_image_size' if maximum_image_size else 'target_downsample_size'} was specified."
-            )
-        self.maximum_image_size = maximum_image_size
-        self.target_downsample_size = target_downsample_size
-        self.read_queue = Queue()
-        self.process_queue = Queue()
-        self.write_queue = Queue()
-        self.vae_input_queue = Queue()
-
-    def debug_log(self, msg: str):
-        logger.debug(f"{self.rank_info}{msg}")
-
-    def generate_vae_cache_filename(self, filepath: str) -> tuple:
-        """Get the cache filename for a given image filepath and its base name."""
-        if filepath.endswith(".pt"):
-            return filepath, os.path.basename(filepath)
-        # Extract the base name from the filepath and replace the image extension with .pt
-        base_filename = os.path.splitext(os.path.basename(filepath))[0]
-        if self.hash_filenames:
-            base_filename = str(sha256(str(base_filename).encode()).hexdigest())
-        base_filename = str(base_filename) + ".pt"
-        # Find the subfolders the sample was in, and replace the instance_data_dir with the cache_dir
-        subfolders = ""
-        if self.instance_data_dir is not None:
-            subfolders = os.path.dirname(filepath).replace(self.instance_data_dir, "")
-            subfolders = subfolders.lstrip(os.sep)
-
-        if len(subfolders) > 0:
-            full_filename = os.path.join(self.cache_dir, subfolders, base_filename)
-            # logger.debug(
-            #     f"full_filename: {full_filename} = os.path.join({self.cache_dir}, {subfolders}, {base_filename})"
-            # )
-        else:
-            full_filename = os.path.join(self.cache_dir, base_filename)
-            # logger.debug(
-            #     f"full_filename: {full_filename} = os.path.join({self.cache_dir}, {base_filename})"
-            # )
-        return full_filename, base_filename
-
-    def _image_filename_from_vaecache_filename(self, filepath: str) -> tuple[str, str]:
-        test_filepath, _ = self.generate_vae_cache_filename(filepath)
-        result = self.vae_path_to_image_path.get(test_filepath, None)
-
-        return result
-
-    def build_vae_cache_filename_map(self, all_image_files: list):
-        """Build a map of image filepaths to their corresponding cache filenames."""
-        self.image_path_to_vae_path = {}
-        self.vae_path_to_image_path = {}
-        for image_file in all_image_files:
-            cache_filename, _ = self.generate_vae_cache_filename(image_file)
-            if self.cache_data_backend.type == "local":
-                cache_filename = os.path.abspath(cache_filename)
-            self.image_path_to_vae_path[image_file] = cache_filename
-            self.vae_path_to_image_path[cache_filename] = image_file
-
-    def already_cached(self, filepath: str) -> bool:
-        test_path = self.image_path_to_vae_path.get(filepath, None)
-        if self.cache_data_backend.exists(test_path):
-            return True
-        return False
-
-    def _read_from_storage(
-        self, filename: str, hide_errors: bool = False
-    ) -> torch.Tensor:
-        """Read an image or cache object from the storage backend.
-
-        Args:
-            filename (str): The path to the cache item, eg. `vae_cache/foo.pt` or `instance_data_dir/foo.png`
-
-        Returns:
-            Image or cache object
-        """
-        if os.path.splitext(filename)[1] != ".pt":
-            try:
-                return self.image_data_backend.read_image(filename)
-            except Exception as e:
-                if self.delete_problematic_images:
-                    self.metadata_backend.remove_image(filename)
-                    self.image_data_backend.delete(filename)
-                    self.debug_log(
-                        f"Deleted {filename} because it was problematic: {e}"
-                    )
-                raise e
-        try:
-            return self.cache_data_backend.torch_load(filename).to("cpu")
-        except Exception as e:
-            if hide_errors:
-                self.debug_log(
-                    f"Filename: {filename}, returning None even though read_from_storage found no object, since hide_errors is True: {e}"
-                )
-                return None
-            raise e
-
-    def retrieve_from_cache(self, filepath: str):
-        """
-        Use the encode_images method to emulate a single image encoding.
-        """
-        return self.encode_images([None], [filepath])[0]
-
-    def retreve_batch_from_cache(self, filepaths: list):
-        """
-        Use the encode_images method to emulate a batch of image encodings.
-        """
-        return self.encode_images([None] * len(filepaths), filepaths)
-
-    def discover_all_files(self):
-        """Identify all files in the data backend."""
-        all_image_files = StateTracker.get_image_files(
-            data_backend_id=self.id
-        ) or StateTracker.set_image_files(
-            self.image_data_backend.list_files(
-                instance_data_dir=self.instance_data_dir,
-                file_extensions=image_file_extensions,
-            ),
-            data_backend_id=self.id,
-        )
-        # This isn't returned, because we merely check if it's stored, or, store it.
-        (
-            StateTracker.get_vae_cache_files(data_backend_id=self.id)
-            or StateTracker.set_vae_cache_files(
-                self.cache_data_backend.list_files(
-                    instance_data_dir=self.cache_dir,
-                    file_extensions=["pt"],
-                ),
-                data_backend_id=self.id,
-            )
-        )
-        self.debug_log(
-            f"VAECache discover_all_files found {len(all_image_files)} images"
-        )
-        return all_image_files
-
-    def init_vae(self):
-        from diffusers import AutoencoderKL
-
-        args = StateTracker.get_args()
-        vae_path = (
-            args.pretrained_model_name_or_path
-            if args.pretrained_vae_model_name_or_path is None
-            else args.pretrained_vae_model_name_or_path
-        )
-        precached_vae = StateTracker.get_vae()
-        self.vae = precached_vae or AutoencoderKL.from_pretrained(
-            vae_path,
-            subfolder="vae" if args.pretrained_vae_model_name_or_path is None else None,
-            revision=args.revision,
-            force_upcast=False,
-        ).to(self.accelerator.device)
-        if self.vae.device != self.accelerator.device:
-            self.vae = self.vae.to(self.accelerator.device)
-        StateTracker.set_vae(self.vae)
-
-    def rebuild_cache(self):
-        """
-        First, we'll clear the cache before rebuilding it.
-        """
-        self.debug_log("Rebuilding cache.")
-        if self.accelerator.is_local_main_process:
-            self.debug_log("Updating StateTracker with new VAE cache entry list.")
-            StateTracker.set_vae_cache_files(
-                self.cache_data_backend.list_files(
-                    instance_data_dir=self.cache_dir,
-                    file_extensions=["pt"],
-                ),
-                data_backend_id=self.id,
-            )
-        self.accelerator.wait_for_everyone()
-        self.debug_log("-> Clearing cache objects")
-        self.clear_cache()
-        self.debug_log("-> Split tasks between GPU(s)")
-        self.discover_unprocessed_files()
-        self.debug_log("-> Load VAE")
-        self.init_vae()
-        if not StateTracker.get_args().vae_cache_ondemand:
-            self.debug_log("-> Process VAE cache")
-            self.process_buckets()
-            if self.accelerator.is_local_main_process:
-                self.debug_log("Updating StateTracker with new VAE cache entry list.")
-                StateTracker.set_vae_cache_files(
-                    self.cache_data_backend.list_files(
-                        instance_data_dir=self.cache_dir,
-                        file_extensions=["pt"],
-                    ),
-                    data_backend_id=self.id,
-                )
-            self.accelerator.wait_for_everyone()
-        self.debug_log("-> Completed cache rebuild")
-
-    def clear_cache(self):
-        """
-        Clear all .pt files in our data backend's cache prefix, as obtained from self.discover_all_files().
-
-        We can't simply clear the directory, because it might be mixed with the image samples (in the case of S3)
-
-        We want to thread this, using the data_backend.delete function as the worker function.
-        """
-        futures = []
-        all_cache_files = StateTracker.get_vae_cache_files(data_backend_id=self.id)
-        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
-            for filename in all_cache_files:
-                full_path = os.path.join(self.cache_dir, filename)
-                self.debug_log(f"Would delete: {full_path}")
-                futures.append(
-                    executor.submit(self.cache_data_backend.delete, full_path)
-                )
-            for future in tqdm(
-                as_completed(futures),
-                total=len(futures),
-                desc=f"Deleting files for backend {self.id}",
-                position=get_rank(),
-                ncols=125,
-                leave=False,
-            ):
-                try:
-                    future.result()
-                except Exception as e:
-                    logger.error(f"Error deleting file {filename}: {e}")
-                    self.debug_log(f"Error traceback: {traceback.format_exc()}")
-                    raise e
-        # Clear the StateTracker list of VAE objects:
-        StateTracker.set_vae_cache_files([], data_backend_id=self.id)
-
-    def _list_cached_images(self):
-        """
-        Return a set of filenames (without the .pt extension) that have been processed.
-        """
-        # Extract array of tuple into just, an array of files:
-        pt_files = StateTracker.get_vae_cache_files(data_backend_id=self.id)
-        # Extract just the base filename without the extension
-        results = {os.path.splitext(f)[0] for f in pt_files}
-        # self.debug_log(
-        #     f"Found {len(pt_files)} cached files in {self.cache_dir} (truncated): {list(results)[:5]}"
-        # )
-        return results
-
-    def discover_unprocessed_files(self, directory: str = None):
-        """Identify files that haven't been processed yet."""
-        all_image_files = set(StateTracker.get_image_files(data_backend_id=self.id))
-        existing_cache_files = set(
-            StateTracker.get_vae_cache_files(data_backend_id=self.id)
-        )
-        # Convert cache filenames to their corresponding image filenames
-        already_cached_images = []
-        for cache_file in existing_cache_files:
-            try:
-                n = self._image_filename_from_vaecache_filename(cache_file)
-                if n is None:
-                    continue
-                already_cached_images.append(n)
-            except Exception as e:
-                logger.error(
-                    f"Could not find image path for cache file {cache_file}: {e}"
-                )
-                continue
-
-        # Identify unprocessed files
-        self.local_unprocessed_files = list(
-            set(all_image_files) - set(already_cached_images)
-        )
-
-        return self.local_unprocessed_files
-
-    def _reduce_bucket(
-        self,
-        bucket: str,
-        aspect_bucket_cache: dict,
-        processed_images: dict,
-        do_shuffle: bool = True,
-    ):
-        """
-        Given a bucket, return the relevant files for that bucket.
-        """
-        relevant_files = []
-        total_files = 0
-        skipped_files = 0
-        for full_image_path in aspect_bucket_cache[bucket]:
-            total_files += 1
-            comparison_path = self.generate_vae_cache_filename(full_image_path)[0]
-            if os.path.splitext(comparison_path)[0] in processed_images:
-                # processed_images contains basename *cache* paths:
-                skipped_files += 1
-                # self.debug_log(
-                #     f"Reduce bucket {bucket}, skipping ({skipped_files}/{total_files}) {full_image_path} because it is in processed_images"
-                # )
-                continue
-            if full_image_path not in self.local_unprocessed_files:
-                # full_image_path is the full *image* path:
-                skipped_files += 1
-                # self.debug_log(
-                #     f"Reduce bucket {bucket}, skipping ({skipped_files}/{total_files}) {full_image_path} because it is not in local_unprocessed_files"
-                # )
-                continue
-            # self.debug_log(
-            #     f"Reduce bucket {bucket}, adding ({len(relevant_files)}/{total_files}) {full_image_path}"
-            # )
-            relevant_files.append(full_image_path)
-        if do_shuffle:
-            shuffle(relevant_files)
-        # self.debug_log(
-        #     f"Reduced bucket {bucket} down from {len(aspect_bucket_cache[bucket])} to {len(relevant_files)} relevant files."
-        #     f" Our system has {len(self.local_unprocessed_files)} total images in its assigned slice for processing across all buckets."
-        # )
-        return relevant_files
-
-    def encode_images(self, images, filepaths, load_from_cache=True):
-        """
-        Encode a batch of input images. Images must be the same dimension.
-
-        If load_from_cache=True, we read from the VAE cache rather than encode.
-        If load_from_cache=True, we will throw an exception if the entry is not found.
-        """
-        batch_size = len(images)
-        if batch_size != len(filepaths):
-            raise ValueError("Mismatch between number of images and filepaths.")
-
-        full_filenames = [
-            self.generate_vae_cache_filename(filepath)[0] for filepath in filepaths
-        ]
-
-        # Check cache for each image and filter out already cached ones
-        uncached_images = []
-        uncached_image_indices = [
-            i
-            for i, filename in enumerate(full_filenames)
-            if not self.cache_data_backend.exists(filename)
-        ]
-        uncached_image_paths = [
-            filepaths[i]
-            for i, filename in enumerate(full_filenames)
-            if i in uncached_image_indices
-        ]
-
-        # We need to populate any uncached images with the actual image data if they are None.
-        missing_images = [
-            i
-            for i, image in enumerate(images)
-            if i in uncached_image_indices and image is None
-        ]
-        missing_image_pixel_values = []
-        written_latents = []
-        if len(missing_images) > 0 and self.vae_cache_ondemand:
-            missing_image_paths = [filepaths[i] for i in missing_images]
-            missing_image_data_generator = self._read_from_storage_concurrently(
-                missing_image_paths, hide_errors=True
-            )
-            # extract images from generator:
-            missing_image_data = [
-                retrieved_image_data[1]
-                for retrieved_image_data in missing_image_data_generator
-            ]
-            missing_image_pixel_values = self._process_images_in_batch(
-                missing_image_paths, missing_image_data, disable_queue=True
-            )
-            missing_image_vae_outputs = self._encode_images_in_batch(
-                image_pixel_values=missing_image_pixel_values, disable_queue=True
-            )
-            written_latents = self._write_latents_in_batch(missing_image_vae_outputs)
-            if len(written_latents) == len(images):
-                return written_latents
-
-        if len(uncached_image_indices) > 0:
-            uncached_images = [images[i] for i in uncached_image_indices]
-        elif len(missing_images) > 0 and len(missing_image_pixel_values) > 0:
-            uncached_images = []
-            for i in uncached_image_indices:
-                if images[i] is not None:
-                    uncached_images.append(images[i])
-                elif i in missing_image_pixel_values:
-                    uncached_images.append(missing_image_pixel_values[i])
-
-        if (
-            len(uncached_image_indices) > 0
-            and load_from_cache
-            and not self.vae_cache_ondemand
-        ):
-            # We wanted only uncached images. Something went wrong.
-            raise Exception(
-                f"(id={self.id}) Some images were not correctly cached during the VAE Cache operations. Ensure --skip_file_discovery=vae is not set.\nProblematic images: {uncached_image_paths}"
-            )
-
-        latents = []
-        if load_from_cache:
-            # If all images are cached, simply load them
-            latents = [
-                self._read_from_storage(filename, hide_errors=self.vae_cache_ondemand)
-                for filename in full_filenames
-                if filename not in uncached_images
-            ]
-
-        if len(uncached_images) > 0 and (
-            len(images) != len(latents) or len(filepaths) != len(latents)
-        ):
-            # Process images not found in cache
-            with torch.no_grad():
-                processed_images = torch.stack(uncached_images).to(
-                    self.accelerator.device, dtype=StateTracker.get_vae_dtype()
-                )
-                latents_uncached = self.vae.encode(
-                    processed_images
-                ).latent_dist.sample()
-                if (
-                    hasattr(self.vae, "config")
-                    and hasattr(self.vae.config, "shift_factor")
-                    and self.vae.config.shift_factor is not None
-                ):
-                    latents_uncached = (
-                        latents_uncached - self.vae.config.shift_factor
-                    ) * self.vae.config.scaling_factor
-                else:
-                    latents_uncached = latents_uncached * self.vae.config.scaling_factor
-                logger.debug(f"Latents shape: {latents_uncached.shape}")
-
-            # Prepare final latents list by combining cached and newly computed latents
-            cached_idx, uncached_idx = 0, 0
-            for i in range(batch_size):
-                if i in uncached_image_indices:
-                    latents.append(latents_uncached[uncached_idx])
-                    uncached_idx += 1
-                else:
-                    latents.append(self._read_from_storage(full_filenames[i]))
-                    cached_idx += 1
-        return latents
-
-    def _write_latents_in_batch(self, input_latents: list = None):
-        # Pull the 'filepaths' and 'latents' from self.write_queue
-        filepaths, latents = [], []
-        if input_latents is not None:
-            qlen = len(input_latents)
-        else:
-            qlen = self.write_queue.qsize()
-
-        for idx in range(0, qlen):
-            if input_latents:
-                output_file, filepath, latent_vector = input_latents.pop()
-            else:
-                output_file, filepath, latent_vector = self.write_queue.get()
-            file_extension = os.path.splitext(output_file)[1]
-            if file_extension != ".pt":
-                raise ValueError(
-                    f"Cannot write a latent embedding to an image path, {output_file}"
-                )
-            filepaths.append(output_file)
-            # pytorch will hold onto all of the tensors in the list if we do not use clone()
-            latents.append(latent_vector.clone())
-
-        self.cache_data_backend.write_batch(filepaths, latents)
-
-        return latents
-
-    def _process_images_in_batch(
-        self,
-        image_paths: list = None,
-        image_data: list = None,
-        disable_queue: bool = False,
-    ) -> None:
-        """Process a queue of images. This method assumes our batch size has been reached.
-
-        Args:
-            image_paths: list If given, image_data must also be supplied. This will avoid the use of the Queues.
-            image_data: list Provided Image objects for corresponding image_paths.
-
-        Returns:
-            None
-        """
-        try:
-            # self.debug_log(
-            #     f"Processing batch of images into VAE embeds. image_paths: {type(image_paths)}, image_data: {type(image_data)}"
-            # )
-            initial_data = []
-            filepaths = []
-            if image_paths is not None and image_data is not None:
-                qlen = len(image_paths)
-            else:
-                qlen = self.process_queue.qsize()
-
-            # First Loop: Preparation and Filtering
-            for _ in range(qlen):
-                if image_paths:
-                    # retrieve image data from Generator, image_data:
-                    filepath = image_paths.pop()
-                    image = image_data.pop()
-                    aspect_bucket = (
-                        self.metadata_backend.get_metadata_attribute_by_filepath(
-                            filepath=filepath, attribute="aspect_bucket"
-                        )
-                    )
-                else:
-                    filepath, image, aspect_bucket = self.process_queue.get()
-                if self.minimum_image_size is not None:
-                    if not self.metadata_backend.meets_resolution_requirements(
-                        image_path=filepath
-                    ):
-                        self.debug_log(
-                            f"Skipping {filepath} because it does not meet the minimum image size requirement of {self.minimum_image_size}"
-                        )
-                        continue
-                # image.save(f"test_{os.path.basename(filepath)}.png")
-                initial_data.append((filepath, image, aspect_bucket))
-
-            # Process Pool Execution
-            processed_images = []
-            with ThreadPoolExecutor(self.max_workers) as executor:
-                futures = [
-                    executor.submit(
-                        prepare_sample,
-                        data_backend_id=self.id,
-                        filepath=data[0],
-                    )
-                    for data in initial_data
-                ]
-                first_aspect_ratio = None
-                for future in futures:
-                    try:
-                        result = (
-                            future.result()
-                        )  # Returns PreparedSample or tuple(image, crop_coordinates, aspect_ratio)
-                        if result:  # Ensure result is not None or invalid
-                            processed_images.append(result)
-                            if first_aspect_ratio is None:
-                                first_aspect_ratio = result[2]
-                            elif (
-                                type(result) is PreparedSample
-                                and result.aspect_ratio is not None
-                                and first_aspect_ratio is not None
-                                and result.aspect_ratio != first_aspect_ratio
-                            ):
-                                raise ValueError(
-                                    f"({type(result)}) Image {filepath} has a different aspect ratio ({result.aspect_ratio}) than the first image in the batch ({first_aspect_ratio})."
-                                )
-                            elif (
-                                type(result) is tuple
-                                and result[2]
-                                and first_aspect_ratio is not None
-                                and result[2] != first_aspect_ratio
-                            ):
-                                raise ValueError(
-                                    f"({type(result)}) Image {filepath} has a different aspect ratio ({result[2]}) than the first image in the batch ({first_aspect_ratio})."
-                                )
-
-                    except Exception as e:
-                        logger.error(
-                            f"Error processing image in pool: {e}, traceback: {traceback.format_exc()}"
-                        )
-
-            # Second Loop: Final Processing
-            is_final_sample = False
-            output_values = []
-            first_aspect_ratio = None
-            for idx, (image, crop_coordinates, new_aspect_ratio) in enumerate(
-                processed_images
-            ):
-                if idx == len(processed_images) - 1:
-                    is_final_sample = True
-                if first_aspect_ratio is None:
-                    first_aspect_ratio = new_aspect_ratio
-                elif new_aspect_ratio != first_aspect_ratio:
-                    is_final_sample = True
-                    first_aspect_ratio = new_aspect_ratio
-                filepath, _, aspect_bucket = initial_data[idx]
-                filepaths.append(filepath)
-
-                pixel_values = self.transform(image).to(
-                    self.accelerator.device, dtype=self.vae.dtype
-                )
-                output_value = (pixel_values, filepath, aspect_bucket, is_final_sample)
-                output_values.append(output_value)
-                if not disable_queue:
-                    self.vae_input_queue.put(
-                        (pixel_values, filepath, aspect_bucket, is_final_sample)
-                    )
-                # Update the crop_coordinates in the metadata document
-                # NOTE: This is currently a no-op because the metadata is now considered 'trustworthy'.
-                #       The VAE encode uses the preexisting metadata, and the TrainingSample class will not update.
-                #       However, we'll check that the values didn't change anyway, just in case.
-                if crop_coordinates:
-                    current_crop_coordinates = (
-                        self.metadata_backend.get_metadata_attribute_by_filepath(
-                            filepath=filepath,
-                            attribute="crop_coordinates",
-                        )
-                    )
-                    if tuple(current_crop_coordinates) != tuple(crop_coordinates):
-                        logger.debug(
-                            f"Should be updating crop_coordinates for {filepath} from {current_crop_coordinates} to {crop_coordinates}. But we won't.."
-                        )
-
-            self.debug_log(
-                f"Completed processing gathered {len(output_values)} output values."
-            )
-        except Exception as e:
-            logger.error(
-                f"Error processing images {filepaths if len(filepaths) > 0 else image_paths}: {e}"
-            )
-            self.debug_log(f"Error traceback: {traceback.format_exc()}")
-            raise e
-        return output_values
-
-    def _encode_images_in_batch(
-        self, image_pixel_values: list = None, disable_queue: bool = False
-    ) -> None:
-        """Encode the batched Image objects using the VAE model.
-
-        Raises:
-            ValueError: If we receive any invalid results.
-        """
-        try:
-            if image_pixel_values is not None:
-                qlen = len(image_pixel_values)
-                if self.vae_batch_size != len(image_pixel_values):
-                    self.vae_batch_size = len(image_pixel_values)
-            else:
-                qlen = self.vae_input_queue.qsize()
-
-            if qlen == 0:
-                return
-            output_values = []
-            while qlen > 0:
-                vae_input_images, vae_input_filepaths, vae_output_filepaths = [], [], []
-                batch_aspect_bucket = None
-                count_to_process = min(qlen, self.vae_batch_size)
-                for idx in range(0, count_to_process):
-                    if image_pixel_values:
-                        pixel_values, filepath, aspect_bucket, is_final_sample = (
-                            image_pixel_values.pop()
-                        )
-                    else:
-                        pixel_values, filepath, aspect_bucket, is_final_sample = (
-                            self.vae_input_queue.get()
-                        )
-
-                    if batch_aspect_bucket is None:
-                        batch_aspect_bucket = aspect_bucket
-                    vae_input_images.append(pixel_values)
-                    vae_input_filepaths.append(filepath)
-                    vae_output_filepaths.append(
-                        self.generate_vae_cache_filename(filepath)[0]
-                    )
-                    if is_final_sample:
-                        # When we have fewer samples in a bucket than our VAE batch size might indicate,
-                        # we need to respect is_final_sample value and not retrieve the *next* element yet.
-                        break
-
-                latents = self.encode_images(
-                    [
-                        sample.to(dtype=StateTracker.get_vae_dtype())
-                        for sample in vae_input_images
-                    ],
-                    vae_input_filepaths,
-                    load_from_cache=False,
-                )
-                if latents is None:
-                    raise ValueError("Received None from encode_images")
-                for output_file, latent_vector, filepath in zip(
-                    vae_output_filepaths, latents, vae_input_filepaths
-                ):
-                    if latent_vector is None:
-                        raise ValueError(
-                            f"Latent vector is None for filepath {filepath}"
-                        )
-                    output_value = (output_file, filepath, latent_vector)
-                    output_values.append(output_value)
-                    if not disable_queue:
-                        logger.debug("Adding outputs to write queue")
-                        self.write_queue.put(output_value)
-                if image_pixel_values is not None:
-                    qlen = len(image_pixel_values)
-                else:
-                    qlen = self.vae_input_queue.qsize()
-        except Exception as e:
-            logger.error(f"Error encoding images {vae_input_filepaths}: {e}")
-            if "out of memory" in str(e).lower():
-                import sys
-
-                sys.exit(1)
-            # Remove all of the errored images from the bucket. They will be captured on restart.
-            for filepath in vae_input_filepaths:
-                self.metadata_backend.remove_image(filepath)
-            self.debug_log(f"Error traceback: {traceback.format_exc()}")
-            raise Exception(
-                f"Error encoding images {vae_input_filepaths}: {e}, traceback: {traceback.format_exc()}"
-            )
-        return output_values
-
-    def _read_from_storage_concurrently(self, paths, hide_errors: bool = False):
-        """
-        A helper method to read files from storage concurrently, without Queues.
-
-        Args:
-            paths (List[str]): A list of file paths to read.
-
-        Returns:
-            Generator[Tuple[str, Any], None, None]: Yields file path and contents.
-        """
-
-        def read_file(path):
-            try:
-                return path, self._read_from_storage(path, hide_errors=hide_errors)
-            except Exception as e:
-                import traceback
-
-                logger.error(
-                    f"Error reading {path}: {e}, traceback: {traceback.format_exc()}"
-                )
-                # If --delete_problematic_images is supplied, we remove the image now:
-                if self.delete_problematic_images:
-                    self.metadata_backend.remove_image(path)
-                    self.image_data_backend.delete(path)
-                return path, None
-
-        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
-            # Map read_file operation over all paths
-            future_to_path = {executor.submit(read_file, path): path for path in paths}
-            for future in as_completed(future_to_path):
-                path = future_to_path[future]
-                try:
-                    yield future.result()
-                except Exception as exc:
-                    logger.error(f"{path} generated an exception: {exc}")
-
-    def read_images_in_batch(self) -> None:
-        """Immediately read a batch of images.
-
-        The images are added to a Queue, for later processing.
-
-        Args:
-            filepaths (list): A list of image file paths.
-
-        Returns:
-            None
-        """
-        filepaths = []
-        qlen = self.read_queue.qsize()
-        for idx in range(0, qlen):
-            read_queue_item = self.read_queue.get()
-            path, aspect_bucket = read_queue_item
-            filepaths.append(path)
-        available_filepaths, batch_output = self.image_data_backend.read_image_batch(
-            filepaths, delete_problematic_images=self.delete_problematic_images
-        )
-        missing_image_count = len(filepaths) - len(available_filepaths)
-        if len(available_filepaths) != len(filepaths):
-            logging.warning(
-                f"Failed to request {missing_image_count} sample{'s' if missing_image_count > 1 else ''} during batched read, out of {len(filepaths)} total samples requested."
-                " These samples likely do not exist in the storage pool any longer."
-            )
-        for filepath, element in zip(available_filepaths, batch_output):
-            if type(filepath) != str:
-                raise ValueError(
-                    f"Received unknown filepath type ({type(filepath)}) value: {filepath}"
-                )
-            # Add the element to the queue for later processing.
-            # This allows us to have separate read and processing queue size limits.
-            self.process_queue.put((filepath, element, aspect_bucket))
-
-    def _process_raw_filepath(self, raw_filepath: str):
-        if type(raw_filepath) == str or len(raw_filepath) == 1:
-            filepath = raw_filepath
-        elif len(raw_filepath) == 2:
-            basename, filepath = raw_filepath
-        elif type(raw_filepath) == Path or type(raw_filepath) == numpy_str:
-            filepath = str(raw_filepath)
-        else:
-            raise ValueError(
-                f"Received unknown filepath type ({type(raw_filepath)}) value: {raw_filepath}"
-            )
-        return filepath
-
-    def _accumulate_read_queue(self, filepath, aspect_bucket):
-        self.read_queue.put((filepath, aspect_bucket))
-
-    def _process_futures(self, futures: list, executor: ThreadPoolExecutor):
-        completed_futures = []
-        for future in as_completed(futures):
-            try:
-                future.result()
-                completed_futures.append(future)
-            except Exception as e:
-                logging.error(
-                    f"An error occurred in a future: {e}, file {e.__traceback__.tb_frame}, {e.__traceback__.tb_lineno}, future traceback {traceback.format_exc()}"
-                )
-                completed_futures.append(future)
-        return [f for f in futures if f not in completed_futures]
-
-    def process_buckets(self):
-        futures = []
-        processed_images = self._list_cached_images()
-        aspect_bucket_cache = self.metadata_backend.read_cache().copy()
-
-        # Extract and shuffle the keys of the dictionary
-        do_shuffle = (
-            os.environ.get("SIMPLETUNER_SHUFFLE_ASPECTS", "true").lower() == "true"
-        )
-        if do_shuffle:
-            shuffled_keys = list(aspect_bucket_cache.keys())
-            shuffle(shuffled_keys)
-
-        if self.webhook_handler is not None:
-            total_count = len(
-                [item for sublist in aspect_bucket_cache.values() for item in sublist]
-            )
-            self.send_progress_update(
-                type="init_cache_vae_processing_started",
-                progress=int(len(processed_images) / total_count * 100),
-                total=total_count,
-                current=len(processed_images),
-            )
-
-        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
-            for bucket in shuffled_keys:
-                relevant_files = self._reduce_bucket(
-                    bucket, aspect_bucket_cache, processed_images, do_shuffle
-                )
-                if len(relevant_files) == 0:
-                    continue
-                statistics = {
-                    "not_local": 0,
-                    "already_cached": 0,
-                    "cached": 0,
-                    "total": 0,
-                }
-                last_reported_index = 0
-
-                for raw_filepath in tqdm(
-                    relevant_files,
-                    desc=f"Processing bucket {bucket}",
-                    position=get_rank(),
-                    ncols=125,
-                    leave=False,
-                ):
-                    statistics["total"] += 1
-                    filepath = self._process_raw_filepath(raw_filepath)
-                    test_filepath = self._image_filename_from_vaecache_filename(
-                        filepath
-                    )
-                    if test_filepath is None:
-                        continue
-                    if test_filepath not in self.local_unprocessed_files:
-                        statistics["not_local"] += 1
-                        continue
-                    try:
-                        # Convert whatever we have, into the VAE cache basename.
-                        filepath = self._process_raw_filepath(raw_filepath)
-                        # Does it exist on the backend?
-                        if self.already_cached(filepath):
-                            statistics["already_cached"] += 1
-                            continue
-                        # It does not exist. We can add it to the read queue.
-                        self._accumulate_read_queue(filepath, aspect_bucket=bucket)
-                        # We will check to see whether the queue is ready.
-                        if self.read_queue.qsize() >= self.read_batch_size:
-                            # We have an adequate number of samples to read. Let's now do that in a batch, to reduce I/O wait.
-                            future_to_read = executor.submit(self.read_images_in_batch)
-                            futures.append(future_to_read)
-
-                        # Now we try and process the images, if we have a process batch size large enough.
-                        if self.process_queue.qsize() >= self.process_queue_size:
-                            future_to_process = executor.submit(
-                                self._process_images_in_batch
-                            )
-                            futures.append(future_to_process)
-
-                        # Now we encode the images.
-                        if self.vae_input_queue.qsize() >= self.vae_batch_size:
-                            statistics["cached"] += 1
-                            future_to_process = executor.submit(
-                                self._encode_images_in_batch
-                            )
-                            futures.append(future_to_process)
-                            if (
-                                self.webhook_handler is not None
-                                and int(
-                                    statistics["total"]
-                                    // self.webhook_progress_interval
-                                )
-                                > last_reported_index
-                            ):
-                                last_reported_index = (
-                                    statistics["total"]
-                                    // self.webhook_progress_interval
-                                )
-                                self.send_progress_update(
-                                    type="vaecache",
-                                    progress=int(
-                                        statistics["total"] / len(relevant_files) * 100
-                                    ),
-                                    total=len(relevant_files),
-                                    current=statistics["total"],
-                                )
-
-                        # If we have accumulated enough write objects, we can write them to disk at once.
-                        if self.write_queue.qsize() >= self.write_batch_size:
-                            future_to_write = executor.submit(
-                                self._write_latents_in_batch
-                            )
-                            futures.append(future_to_write)
-                    except ValueError as e:
-                        logger.error(f"Received fatal error: {e}")
-                        raise e
-                    except Exception as e:
-                        logger.error(f"Error processing image {filepath}: {e}")
-                        self.debug_log(f"Error traceback: {traceback.format_exc()}")
-                        raise e
-
-                    # Now, see if we have any futures to complete, and execute them.
-                    # Cleanly removes futures from the list, once they are completed.
-                    futures = self._process_futures(futures, executor)
-
-                try:
-                    # Handle remainders after processing the bucket
-                    if self.read_queue.qsize() > 0:
-                        # We have an adequate number of samples to read. Let's now do that in a batch, to reduce I/O wait.
-                        future_to_read = executor.submit(self.read_images_in_batch)
-                        futures.append(future_to_read)
-
-                    futures = self._process_futures(futures, executor)
-
-                    # Now we try and process the images, if we have a process batch size large enough.
-                    if self.process_queue.qsize() > 0:
-                        future_to_process = executor.submit(
-                            self._process_images_in_batch
-                        )
-                        futures.append(future_to_process)
-
-                    futures = self._process_futures(futures, executor)
-
-                    if self.vae_input_queue.qsize() > 0:
-                        future_to_process = executor.submit(
-                            self._encode_images_in_batch
-                        )
-                        futures.append(future_to_process)
-
-                    futures = self._process_futures(futures, executor)
-
-                    # Write the remaining batches. This is not strictly necessary, since they do not need to be written with matching dimensions.
-                    # However, it's simply easiest to do this now, even if we have less-than a single batch size.
-                    if self.write_queue.qsize() > 0:
-                        future_to_write = executor.submit(self._write_latents_in_batch)
-                        futures.append(future_to_write)
-
-                    futures = self._process_futures(futures, executor)
-                    log_msg = (
-                        f"(id={self.id}) Bucket {bucket} caching results: {statistics}"
-                    )
-                    if get_rank() == 0:
-                        logger.debug(log_msg)
-                        tqdm.write(log_msg)
-                    if self.webhook_handler is not None:
-                        self.send_progress_update(
-                            type="init_cache_vae_processing_complete",
-                            progress=100,
-                            total=statistics["total"],
-                            current=statistics["total"],
-                        )
-                    self.debug_log(
-                        "Completed process_buckets, all futures have been returned."
-                    )
-                except Exception as e:
-                    logger.error(f"Fatal error when processing bucket {bucket}: {e}")
-                    continue
-
-    def scan_cache_contents(self):
-        """
-        A generator method that iterates over the VAE cache, yielding each cache file's path and its contents
-        using multi-threading for improved performance.
-
-        Yields:
-            Tuple[str, Any]: A tuple containing the file path and its contents.
-        """
-        try:
-            all_cache_files = StateTracker.get_vae_cache_files(data_backend_id=self.id)
-            try:
-                yield from self._read_from_storage_concurrently(
-                    all_cache_files, hide_errors=True
-                )
-            except FileNotFoundError:
-                yield (None, None)
-        except Exception as e:
-            if "is not iterable" not in str(e):
-                logger.error(f"Error in scan_cache_contents: {e}")
-                self.debug_log(f"Error traceback: {traceback.format_exc()}")
diff --git a/videotuna/third_party/flux/configuration/cmd_args.py b/videotuna/third_party/flux/configuration/cmd_args.py
deleted file mode 100644
index 7510830b..00000000
--- a/videotuna/third_party/flux/configuration/cmd_args.py
+++ /dev/null
@@ -1,2396 +0,0 @@
-import argparse
-import logging
-import os
-import random
-import sys
-import time
-from datetime import timedelta
-from typing import Dict, List, Optional, Tuple
-
-import torch
-from accelerate import InitProcessGroupKwargs
-from accelerate.utils import ProjectConfiguration
-
-from videotuna.third_party.flux.models.smoldit import SmolDiTConfigurationNames
-from videotuna.third_party.flux.training import quantised_precision_levels
-from videotuna.third_party.flux.training.optimizer_param import (
-    is_optimizer_deprecated,
-    is_optimizer_grad_fp32,
-    map_deprecated_optimizer_parameter,
-    optimizer_choices,
-)
-
-logger = logging.getLogger("ArgsParser")
-# Are we the primary process?
-is_primary_process = True
-if os.environ.get("RANK") is not None:
-    if int(os.environ.get("RANK")) != 0:
-        is_primary_process = False
-logger.setLevel(
-    os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO" if is_primary_process else "ERROR")
-)
-
-if torch.cuda.is_available():
-    os.environ["NCCL_SOCKET_NTIMEO"] = "2000000"
-
-
-def print_on_main_thread(message):
-    if is_primary_process:
-        print(message)
-
-
-def info_log(message):
-    if is_primary_process:
-        logger.info(message)
-
-
-def warning_log(message):
-    if is_primary_process:
-        logger.warning(message)
-
-
-def error_log(message):
-    if is_primary_process:
-        logger.error(message)
-
-
-def get_argument_parser():
-    parser = argparse.ArgumentParser(
-        description="The following SimpleTuner command-line options are available:",
-        exit_on_error=False,
-    )
-    parser.add_argument(
-        "--snr_gamma",
-        type=float,
-        default=None,
-        help=(
-            "SNR weighting gamma to be used if rebalancing the loss. Recommended value is 5.0."
-            " More details here: https://arxiv.org/abs/2303.09556."
-        ),
-    )
-    parser.add_argument(
-        "--use_soft_min_snr",
-        action="store_true",
-        help=(
-            "If set, will use the soft min SNR calculation method. This method uses the sigma_data parameter."
-            " If not provided, the method will raise an error."
-        ),
-    )
-    parser.add_argument(
-        "--soft_min_snr_sigma_data",
-        default=None,
-        type=float,
-        help=(
-            "The standard deviation of the data used in the soft min weighting method."
-            " This is required when using the soft min SNR calculation method."
-        ),
-    )
-    parser.add_argument(
-        "--model_family",
-        choices=["pixart_sigma", "kolors", "sd3", "flux", "smoldit", "sdxl", "legacy"],
-        default=None,
-        required=True,
-        help=("The model family to train. This option is required."),
-    )
-    parser.add_argument(
-        "--model_type",
-        type=str,
-        choices=[
-            "full",
-            "lora",
-            "deepfloyd-full",
-            "deepfloyd-lora",
-            "deepfloyd-stage2",
-            "deepfloyd-stage2-lora",
-        ],
-        default="full",
-        help=(
-            "The training type to use. 'full' will train the full model, while 'lora' will train the LoRA model."
-            " LoRA is a smaller model that can be used for faster training."
-        ),
-    )
-    parser.add_argument(
-        "--flux_lora_target",
-        type=str,
-        choices=[
-            "mmdit",
-            "context",
-            "context+ffs",
-            "all",
-            "all+ffs",
-            "ai-toolkit",
-            "tiny",
-            "nano",
-        ],
-        default="all",
-        help=(
-            "Flux has single and joint attention blocks."
-            " By default, all attention layers are trained, but not the feed-forward layers"
-            " If 'mmdit' is provided, the text input layers will not be trained."
-            " If 'context' is provided, then ONLY the text attention layers are trained"
-            " If 'context+ffs' is provided, then text attention and text feed-forward layers are trained. This is somewhat similar to text-encoder-only training in earlier SD versions."
-            " If 'all' is provided, all layers will be trained, minus feed-forward."
-            " If 'all+ffs' is provided, all layers will be trained including feed-forward."
-            " If 'ai-toolkit' is provided, all layers will be trained including feed-forward and norms (based on ostris/ai-toolkit)."
-            " If 'tiny' is provided, only two layers will be trained."
-            " If 'nano' is provided, only one layers will be trained."
-        ),
-    )
-    parser.add_argument(
-        "--flow_matching_sigmoid_scale",
-        type=float,
-        default=1.0,
-        help="Scale factor for sigmoid timestep sampling for flow-matching models..",
-    )
-    parser.add_argument(
-        "--flux_fast_schedule",
-        action="store_true",
-        help=(
-            "An experimental feature to train Flux.1S using a noise schedule closer to what it was trained with,"
-            " which has improved results in short experiments. Thanks to @mhirki for the contribution."
-        ),
-    )
-    parser.add_argument(
-        "--flux_use_beta_schedule",
-        action="store_true",
-        help=(
-            "Whether or not to use a beta schedule with Flux instead of sigmoid. The default values of alpha"
-            " and beta approximate a sigmoid."
-        ),
-    )
-    parser.add_argument(
-        "--flux_beta_schedule_alpha",
-        type=float,
-        default=2.0,
-        help=("The alpha value of the flux beta schedule. Default is 2.0"),
-    )
-    parser.add_argument(
-        "--flux_beta_schedule_beta",
-        type=float,
-        default=2.0,
-        help=("The beta value of the flux beta schedule. Default is 2.0"),
-    )
-    parser.add_argument(
-        "--flux_schedule_shift",
-        type=float,
-        default=3,
-        help=(
-            "Shift the noise schedule. This is a value between 0 and ~4.0, where 0 disables the timestep-dependent shift,"
-            " and anything greater than 0 will shift the timestep sampling accordingly. The SD3 model was trained with"
-            " a shift value of 3. The value for Flux is unknown. Higher values result in less noisy timesteps sampled,"
-            " which results in a lower mean loss value, but not necessarily better results. Early reports indicate"
-            " that modification of this value can change how the contrast is learnt by the model, and whether fine"
-            " details are ignored or accentuated, removing fine details and making the outputs blurrier."
-        ),
-    )
-    parser.add_argument(
-        "--flux_schedule_auto_shift",
-        action="store_true",
-        default=False,
-        help=(
-            "Shift the noise schedule depending on image resolution. The shift value calculation is taken from the official"
-            " Flux inference code. Shift value is math.exp(1.15) = 3.1581 for a pixel count of 1024px * 1024px. The shift"
-            " value grows exponentially with higher pixel counts. It is a good idea to train on a mix of different resolutions"
-            " when this option is enabled. You may need to lower your learning rate with this enabled."
-        ),
-    )
-    parser.add_argument(
-        "--flux_guidance_mode",
-        type=str,
-        choices=["constant", "random-range", "mobius"],
-        default="constant",
-        help=(
-            "Flux has a 'guidance' value used during training time that reflects the CFG range of your training samples."
-            " The default mode 'constant' will use a single value for every sample."
-            " The mode 'random-range' will randomly select a value from the range of the CFG for each sample."
-            " The mode 'mobius' will use a value that is a function of the remaining steps in the epoch, constructively"
-            " deconstructing the constructed deconstructions to then Mobius them back into the constructed reconstructions,"
-            " possibly resulting in the exploration of what is known as the Mobius space, a new continuous"
-            " realm of possibility brought about by destroying the model so that you can make it whole once more."
-            " Or so according to DataVoid, anyway. This is just a Flux-specific implementation of Mobius."
-            " Set the range using --flux_guidance_min and --flux_guidance_max."
-        ),
-    )
-    parser.add_argument(
-        "--flux_guidance_value",
-        type=float,
-        default=1.0,
-        help=(
-            "When using --flux_guidance_mode=constant, this value will be used for every input sample."
-            " Using a value of 1.0 seems to preserve the CFG distillation for the Dev model,"
-            " and using any other value will result in the resulting LoRA requiring CFG at inference time."
-        ),
-    )
-    parser.add_argument(
-        "--flux_guidance_min",
-        type=float,
-        default=0.0,
-    )
-    parser.add_argument(
-        "--flux_guidance_max",
-        type=float,
-        default=4.0,
-    )
-    parser.add_argument(
-        "--flux_attention_masked_training",
-        action="store_true",
-        default=False,
-        help="Use attention masking while training flux.",
-    )
-    parser.add_argument(
-        "--t5_padding",
-        choices=["zero", "unmodified"],
-        default="unmodified",
-        help=(
-            "The padding behaviour for Flux. The default is 'zero', which will pad the input with zeros."
-            " The alternative is 'unmodified', which will not pad the input."
-        ),
-    )
-    parser.add_argument(
-        "--smoldit",
-        action="store_true",
-        default=False,
-        help=("Use the experimental SmolDiT model architecture."),
-    )
-    parser.add_argument(
-        "--smoldit_config",
-        type=str,
-        choices=SmolDiTConfigurationNames,
-        default="smoldit-base",
-        help=(
-            "The SmolDiT configuration to use. This is a list of pre-configured models."
-            " The default is 'smoldit-base'."
-        ),
-    )
-    parser.add_argument(
-        "--flow_matching_loss",
-        type=str,
-        choices=["diffusers", "compatible", "diffusion", "sd35"],
-        default="compatible",
-        help=(
-            "A discrepancy exists between the Diffusers implementation of flow matching and the minimal implementation provided"
-            " by StabilityAI. This experimental option allows switching loss calculations to be compatible with those."
-            " Additionally, 'diffusion' is offered as an option to reparameterise a model to v_prediction loss."
-            " sd35 provides the ability to train on SD3.5's flow-matching target, which is the denoised sample."
-        ),
-    )
-    parser.add_argument(
-        "--sd3_clip_uncond_behaviour",
-        type=str,
-        choices=["empty_string", "zero"],
-        default="empty_string",
-        help=(
-            "SD3 can be trained using zeroed prompt embeds during unconditional dropout,"
-            " or an encoded empty string may be used instead (the default). Changing this value may stabilise or"
-            " destabilise training. The default is 'empty_string'."
-        ),
-    )
-    parser.add_argument(
-        "--sd3_t5_uncond_behaviour",
-        type=str,
-        choices=["empty_string", "zero"],
-        default=None,
-        help=(
-            "Override the value of unconditional prompts from T5 embeds."
-            " The default is to follow the value of --sd3_clip_uncond_behaviour."
-        ),
-    )
-    parser.add_argument(
-        "--lora_type",
-        type=str.lower,
-        choices=["standard", "lycoris"],
-        default="standard",
-        help=(
-            "When training using --model_type=lora, you may specify a different type of LoRA to train here."
-            " standard refers to training a vanilla LoRA via PEFT, lycoris refers to training with KohakuBlueleaf's library of the same name."
-        ),
-    )
-    parser.add_argument(
-        "--lora_init_type",
-        type=str,
-        choices=["default", "gaussian", "loftq", "olora", "pissa"],
-        default="default",
-        help=(
-            "The initialization type for the LoRA model. 'default' will use Microsoft's initialization method,"
-            " 'gaussian' will use a Gaussian scaled distribution, and 'loftq' will use LoftQ initialization."
-            " In short experiments, 'default' produced accurate results earlier in training, 'gaussian' had slightly more"
-            " creative outputs, and LoftQ produces an entirely different result with worse quality at first, taking"
-            " potentially longer to converge than the other methods."
-        ),
-    )
-    parser.add_argument(
-        "--init_lora",
-        type=str,
-        default=None,
-        help="Specify an existing LoRA or LyCORIS safetensors file to initialize the adapter and continue training, if a full checkpoint is not available.",
-    )
-    parser.add_argument(
-        "--lora_rank",
-        type=int,
-        default=16,
-        help=("The dimension of the LoRA update matrices."),
-    )
-    parser.add_argument(
-        "--lora_alpha",
-        type=float,
-        required=False,
-        default=None,
-        help=(
-            "The alpha value for the LoRA model. This is the learning rate for the LoRA update matrices."
-        ),
-    )
-    parser.add_argument(
-        "--lora_dropout",
-        type=float,
-        default=0.1,
-        help=(
-            "LoRA dropout randomly ignores neurons during training. This can help prevent overfitting."
-        ),
-    )
-    parser.add_argument(
-        "--lycoris_config",
-        type=str,
-        default="configs/006_flux/lycoris_config.json",
-        help=("The location for the JSON file of the Lycoris configuration."),
-    )
-    parser.add_argument(
-        "--init_lokr_norm",
-        type=float,
-        required=False,
-        default=None,
-        help=(
-            "Setting this turns on perturbed normal initialization of the LyCORIS LoKr PEFT layers. A good value is between 1e-4 and 1e-2."
-        ),
-    )
-    parser.add_argument(
-        "--controlnet",
-        action="store_true",
-        default=False,
-        help=(
-            "If set, ControlNet style training will be used, where a conditioning input image is required alongside the training data."
-        ),
-    )
-    parser.add_argument(
-        "--controlnet_model_name_or_path",
-        action="store_true",
-        default=None,
-        help=(
-            "When provided alongside --controlnet, this will specify ControlNet model weights to preload from the hub."
-        ),
-    )
-    parser.add_argument(
-        "--pretrained_model_name_or_path",
-        type=str,
-        default=None,
-        required=True,
-        help="Path to pretrained model or model identifier from huggingface.co/models.",
-    )
-    parser.add_argument(
-        "--pretrained_transformer_model_name_or_path",
-        type=str,
-        default=None,
-        help="Path to pretrained transformer model or model identifier from huggingface.co/models.",
-    )
-    parser.add_argument(
-        "--pretrained_transformer_subfolder",
-        type=str,
-        default="transformer",
-        help="The subfolder to load the transformer model from. Use 'none' for a flat directory.",
-    )
-    parser.add_argument(
-        "--pretrained_unet_model_name_or_path",
-        type=str,
-        default=None,
-        help="Path to pretrained unet model or model identifier from huggingface.co/models.",
-    )
-    parser.add_argument(
-        "--pretrained_unet_subfolder",
-        type=str,
-        default="unet",
-        help="The subfolder to load the unet model from. Use 'none' for a flat directory.",
-    )
-    parser.add_argument(
-        "--pretrained_vae_model_name_or_path",
-        type=str,
-        default="madebyollin/sdxl-vae-fp16-fix",
-        help="Path to an improved VAE to stabilize training. For more details check out: https://github.com/huggingface/diffusers/pull/4038.",
-    )
-    parser.add_argument(
-        "--pretrained_t5_model_name_or_path",
-        type=str,
-        default=None,
-        help=(
-            "T5-XXL is a huge model, and starting from many different models will download a separate one each time."
-            " This option allows you to specify a specific location to retrieve T5-XXL v1.1 from, so that it only downloads once.."
-        ),
-    )
-
-    parser.add_argument(
-        "--prediction_type",
-        type=str,
-        default="epsilon",
-        choices=["epsilon", "v_prediction", "sample"],
-        help=(
-            "The type of prediction to use for the u-net. Choose between ['epsilon', 'v_prediction', 'sample']."
-            " For SD 2.1-v, this is v_prediction. For 2.1-base, it is epsilon. SDXL is generally epsilon."
-            " SD 1.5 is epsilon."
-        ),
-    )
-    parser.add_argument(
-        "--snr_weight",
-        type=float,
-        default=1.0,
-        help=(
-            "When training a model using `--prediction_type=sample`, one can supply an SNR weight value to augment the loss with."
-            " If a value of 0.5 is provided here, the loss is taken half from the SNR and half from the MSE."
-        ),
-    )
-    parser.add_argument(
-        "--training_scheduler_timestep_spacing",
-        type=str,
-        default="trailing",
-        choices=["leading", "linspace", "trailing"],
-        help=(
-            "(SDXL Only) Spacing timesteps can fundamentally alter the course of history. Er, I mean, your model weights."
-            " For all training, including epsilon, it would seem that 'trailing' is the right choice. SD 2.x always uses 'trailing',"
-            " but SDXL may do better in its default state when using 'leading'."
-        ),
-    )
-    parser.add_argument(
-        "--inference_scheduler_timestep_spacing",
-        type=str,
-        default="trailing",
-        choices=["leading", "linspace", "trailing"],
-        help=(
-            "(SDXL Only) The Bytedance paper on zero terminal SNR recommends inference using 'trailing'. SD 2.x always uses 'trailing',"
-            " but SDXL may do better in its default state when using 'leading'."
-        ),
-    )
-    parser.add_argument(
-        "--refiner_training",
-        action="store_true",
-        default=False,
-        help=(
-            "When training or adapting a model into a mixture-of-experts 2nd stage / refiner model, this option should be set."
-            " This will slice the timestep schedule defined by --refiner_training_strength proportion value (default 0.2)"
-        ),
-    )
-    parser.add_argument(
-        "--refiner_training_invert_schedule",
-        action="store_true",
-        default=False,
-        help=(
-            "While the refiner training strength is applied to the end of the schedule, this option will invert the result"
-            " for training a **base** model, eg. the first model in a mixture-of-experts series."
-            " A --refiner_training_strength of 0.35 will result in the refiner learning timesteps 349-0."
-            " Setting --refiner_training_invert_schedule then would result in the base model learning timesteps 999-350."
-        ),
-    )
-    parser.add_argument(
-        "--refiner_training_strength",
-        default=0.2,
-        type=float,
-        help=(
-            "When training a refiner / 2nd stage mixture of experts model, the refiner training strength"
-            " indicates how much of the *end* of the schedule it will be trained on. A value of 0.2 means"
-            " timesteps 199-0 will be the focus of this model, and 0.3 would be 299-0 and so on."
-            " The default value is 0.2, in line with the SDXL refiner pretraining."
-        ),
-    )
-    parser.add_argument(
-        "--timestep_bias_strategy",
-        type=str,
-        default="none",
-        choices=["earlier", "later", "range", "none"],
-        help=(
-            "The timestep bias strategy, which may help direct the model toward learning low or frequency details."
-            " Choices: ['earlier', 'later', 'none']."
-            " The default is 'none', which means no bias is applied, and training proceeds normally."
-            " The value of 'later' will prefer to generate samples for later timesteps."
-        ),
-    )
-    parser.add_argument(
-        "--timestep_bias_multiplier",
-        type=float,
-        default=1.0,
-        help=(
-            "The multiplier for the bias. Defaults to 1.0, which means no bias is applied."
-            " A value of 2.0 will double the weight of the bias, and a value of 0.5 will halve it."
-        ),
-    )
-    parser.add_argument(
-        "--timestep_bias_begin",
-        type=int,
-        default=0,
-        help=(
-            "When using `--timestep_bias_strategy=range`, the beginning timestep to bias."
-            " Defaults to zero, which equates to having no specific bias."
-        ),
-    )
-    parser.add_argument(
-        "--timestep_bias_end",
-        type=int,
-        default=1000,
-        help=(
-            "When using `--timestep_bias_strategy=range`, the final timestep to bias."
-            " Defaults to 1000, which is the number of timesteps that SDXL Base and SD 2.x were trained on."
-        ),
-    )
-    parser.add_argument(
-        "--timestep_bias_portion",
-        type=float,
-        default=0.25,
-        help=(
-            "The portion of timesteps to bias. Defaults to 0.25, which 25 percent of timesteps will be biased."
-            " A value of 0.5 will bias one half of the timesteps. The value provided for `--timestep_bias_strategy` determines"
-            " whether the biased portions are in the earlier or later timesteps."
-        ),
-    )
-    parser.add_argument(
-        "--disable_segmented_timestep_sampling",
-        action="store_true",
-        help=(
-            "By default, the timestep schedule is divided into roughly `train_batch_size` number of segments, and then"
-            " each of those are sampled from separately. This improves the selection distribution, but may not"
-            " be desired in certain training scenarios, eg. when limiting the timestep selection range."
-        ),
-    )
-    parser.add_argument(
-        "--rescale_betas_zero_snr",
-        action="store_true",
-        help=(
-            "If set, will rescale the betas to zero terminal SNR. This is recommended for training with v_prediction."
-            " For epsilon, this might help with fine details, but will not result in contrast improvements."
-        ),
-    )
-    parser.add_argument(
-        "--vae_dtype",
-        type=str,
-        default="bf16",
-        choices=["default", "fp16", "fp32", "bf16"],
-        required=False,
-        help=(
-            "The dtype of the VAE model. Choose between ['default', 'fp16', 'fp32', 'bf16']."
-            " The default VAE dtype is bfloat16, due to NaN issues in SDXL 1.0."
-            " Using fp16 is not recommended."
-        ),
-    )
-    parser.add_argument(
-        "--vae_batch_size",
-        type=int,
-        default=4,
-        help=(
-            "When pre-caching latent vectors, this is the batch size to use. Decreasing this may help with VRAM issues,"
-            " but if you are at that point of contention, it's possible that your GPU has too little RAM. Default: 4."
-        ),
-    )
-    parser.add_argument(
-        "--vae_cache_scan_behaviour",
-        type=str,
-        choices=["recreate", "sync"],
-        default="recreate",
-        help=(
-            "When a mismatched latent vector is detected, a scan will be initiated to locate inconsistencies and resolve them."
-            " The default setting 'recreate' will delete any inconsistent cache entries and rebuild it."
-            " Alternatively, 'sync' will update the bucket configuration so that the image is in a bucket that matches its latent size."
-            " The recommended behaviour is to use the default value and allow the cache to be recreated."
-        ),
-    )
-    parser.add_argument(
-        "--vae_cache_ondemand",
-        action="store_true",
-        default=False,
-        help=(
-            "By default, will batch-encode images before training. For some situations, ondemand may be desired, but it greatly slows training and increases memory pressure."
-        ),
-    )
-    parser.add_argument(
-        "--compress_disk_cache",
-        action="store_true",
-        default=False,
-        help=(
-            "If set, will gzip-compress the disk cache for Pytorch files. This will save substantial disk space, but may slow down the training process."
-        ),
-    )
-    parser.add_argument(
-        "--aspect_bucket_disable_rebuild",
-        action="store_true",
-        default=False,
-        help=(
-            "When using a randomised aspect bucket list, the VAE and aspect cache are rebuilt on each epoch."
-            " With a large and diverse enough dataset, rebuilding the aspect list may take a long time, and this may be undesirable."
-            " This option will not override vae_cache_clear_each_epoch. If both options are provided, only the VAE cache will be rebuilt."
-        ),
-    )
-    parser.add_argument(
-        "--keep_vae_loaded",
-        action="store_true",
-        default=False,
-        help="If set, will keep the VAE loaded in memory. This can reduce disk churn, but consumes VRAM during the forward pass.",
-    )
-    parser.add_argument(
-        "--skip_file_discovery",
-        type=str,
-        default="",
-        help=(
-            "Comma-separated values of which stages to skip discovery for. Skipping any stage will speed up resumption,"
-            " but will increase the risk of errors, as missing images or incorrectly bucketed images may not be caught."
-            " 'vae' will skip the VAE cache process, 'aspect' will not build any aspect buckets, and 'text' will avoid text embed management."
-            " Valid options: aspect, vae, text, metadata."
-        ),
-    )
-    parser.add_argument(
-        "--revision",
-        type=str,
-        default=None,
-        required=False,
-        help=(
-            "Revision of pretrained model identifier from huggingface.co/models. Trainable model components should be"
-            " at least bfloat16 precision."
-        ),
-    )
-    parser.add_argument(
-        "--variant",
-        type=str,
-        default=None,
-        required=False,
-        help=(
-            "Variant of pretrained model identifier from huggingface.co/models. Trainable model components should be"
-            " at least bfloat16 precision."
-        ),
-    )
-    parser.add_argument(
-        "--preserve_data_backend_cache",
-        action="store_true",
-        default=False,
-        help=(
-            "For very large cloud storage buckets that will never change, enabling this option will prevent the trainer"
-            " from scanning it at startup, by preserving the cache files that we generate. Be careful when using this,"
-            " as, switching datasets can result in the preserved cache being used, which would be problematic."
-            " Currently, cache is not stored in the dataset itself but rather, locally. This may change in a future release."
-        ),
-    )
-    parser.add_argument(
-        "--use_dora",
-        action="store_true",
-        default=False,
-        help=(
-            "If set, will use the DoRA-enhanced LoRA training. This is an experimental feature, may slow down training,"
-            " and is not recommended for general use."
-        ),
-    )
-    parser.add_argument(
-        "--override_dataset_config",
-        action="store_true",
-        default=False,
-        help=(
-            "When provided, the dataset's config will not be checked against the live backend config."
-            " This is useful if you want to simply update the behaviour of an existing dataset,"
-            " but the recommendation is to not change the dataset configuration after caching has begun,"
-            " as most options cannot be changed without unexpected behaviour later on. Additionally, it prevents"
-            " accidentally loading an SDXL configuration on a SD 2.x model and vice versa."
-        ),
-    )
-    parser.add_argument(
-        "--cache_dir_text",
-        type=str,
-        default="cache",
-        help=(
-            "This is the path to a local directory that will contain your text embed cache."
-        ),
-    )
-    parser.add_argument(
-        "--cache_dir_vae",
-        type=str,
-        default="",
-        help=(
-            "This is the path to a local directory that will contain your VAE outputs."
-            " Unlike the text embed cache, your VAE latents will be stored in the AWS data backend."
-            " Each backend can have its own value, but if that is not provided, this will be the default value."
-        ),
-    )
-    parser.add_argument(
-        "--data_backend_config",
-        type=str,
-        default=None,
-        help=(
-            "The relative or fully-qualified path for your data backend config."
-            " See multidatabackend.json.example for an example."
-        ),
-    )
-    parser.add_argument(
-        "--data_backend_sampling",
-        type=str,
-        choices=["uniform", "auto-weighting"],
-        default="auto-weighting",
-        help=(
-            "When using multiple data backends, the sampling weighting can be set to 'uniform' or 'auto-weighting'."
-            " The default value is 'auto-weighting', which will automatically adjust the sampling weights based on the"
-            " number of images in each backend. 'uniform' will sample from each backend equally."
-        ),
-    )
-    parser.add_argument(
-        "--ignore_missing_files",
-        action="store_true",
-        help=(
-            "This option will disable the check for files that have been deleted or removed from your data directory."
-            " This would allow training on large datasets without keeping the associated images on disk, though it's"
-            " not recommended and is not a supported feature. Use with caution, as it mostly exists for experimentation."
-        ),
-    )
-    parser.add_argument(
-        "--write_batch_size",
-        type=int,
-        default=128,
-        help=(
-            "When using certain storage backends, it is better to batch smaller writes rather than continuous dispatching."
-            " In SimpleTuner, write batching is currently applied during VAE caching, when many small objects are written."
-            " This mostly applies to S3, but some shared server filesystems may benefit as well, eg. Ceph. Default: 64."
-        ),
-    )
-    parser.add_argument(
-        "--read_batch_size",
-        type=int,
-        default=25,
-        help=(
-            "Used by the VAE cache to prefetch image data. This is the number of images to read ahead."
-        ),
-    )
-    parser.add_argument(
-        "--image_processing_batch_size",
-        type=int,
-        default=32,
-        help=(
-            "When resizing and cropping images, we do it in parallel using processes or threads."
-            " This defines how many images will be read into the queue before they are processed."
-        ),
-    )
-    parser.add_argument(
-        "--enable_multiprocessing",
-        default=False,
-        action="store_true",
-        help=(
-            "If set, will use processes instead of threads during metadata caching operations."
-            " For some systems, multiprocessing may be faster than threading, but will consume a lot more memory."
-            " Use this option with caution, and monitor your system's memory usage."
-        ),
-    )
-    parser.add_argument(
-        "--max_workers",
-        default=32,
-        type=int,
-        help=("How many active threads or processes to run during VAE caching."),
-    )
-    parser.add_argument(
-        "--aws_max_pool_connections",
-        type=int,
-        default=128,
-        help=(
-            "When using AWS backends, the maximum number of connections to keep open to the S3 bucket at a single time."
-            " This should be greater or equal to the max_workers and aspect bucket worker count values."
-        ),
-    )
-    parser.add_argument(
-        "--torch_num_threads",
-        type=int,
-        default=8,
-        help=(
-            "The number of threads to use for PyTorch operations. This is not the same as the number of workers."
-            " Default: 8."
-        ),
-    )
-    parser.add_argument(
-        "--dataloader_prefetch",
-        action="store_true",
-        default=False,
-        help=(
-            "When provided, the dataloader will read-ahead and attempt to retrieve latents, text embeds, and other metadata"
-            " ahead of the time when the batch is required, so that it can be immediately available."
-        ),
-    )
-    parser.add_argument(
-        "--dataloader_prefetch_qlen",
-        type=int,
-        default=10,
-        help=("Set the number of prefetched batches."),
-    )
-    parser.add_argument(
-        "--aspect_bucket_worker_count",
-        type=int,
-        default=12,
-        help=(
-            "The number of workers to use for aspect bucketing. This is a CPU-bound task, so the number of workers"
-            " should be set to the number of CPU threads available. If you use an I/O bound backend, an even higher"
-            " value may make sense. Default: 12."
-        ),
-    )
-    parser.add_argument(
-        "--cache_dir",
-        type=str,
-        default=None,
-        help="The directory where the downloaded models and datasets will be stored.",
-    )
-    parser.add_argument(
-        "--cache_clear_validation_prompts",
-        action="store_true",
-        help=(
-            "When provided, any validation prompt entries in the text embed cache will be recreated."
-            " This is useful if you've modified any of the existing prompts, or, disabled/enabled Compel,"
-            " via `--disable_compel`"
-        ),
-    )
-    parser.add_argument(
-        "--caption_strategy",
-        type=str,
-        default="filename",
-        choices=["filename", "textfile", "instance_prompt", "parquet"],
-        help=(
-            "The default captioning strategy, 'filename', will use the filename as the caption, after stripping some characters like underscores."
-            " The 'textfile' strategy will use the contents of a text file with the same name as the image."
-            " The 'parquet' strategy requires a parquet file with the same name as the image, containing a 'caption' column."
-        ),
-    )
-    parser.add_argument(
-        "--parquet_caption_column",
-        type=str,
-        default=None,
-        help=(
-            "When using caption_strategy=parquet, this option will allow you to globally set the default caption field across all datasets"
-            " that do not have an override set."
-        ),
-    )
-    parser.add_argument(
-        "--parquet_filename_column",
-        type=str,
-        default=None,
-        help=(
-            "When using caption_strategy=parquet, this option will allow you to globally set the default filename field across all datasets"
-            " that do not have an override set."
-        ),
-    )
-    parser.add_argument(
-        "--instance_prompt",
-        type=str,
-        default=None,
-        required=False,
-        help="This is unused. Filenames will be the captions instead.",
-    )
-    parser.add_argument(
-        "--output_dir",
-        type=str,
-        default="simpletuner-results",
-        help="The output directory where the model predictions and checkpoints will be written.",
-    )
-    parser.add_argument(
-        "--seed", type=int, default=None, help="A seed for reproducible training."
-    )
-    parser.add_argument(
-        "--seed_for_each_device",
-        type=bool,
-        default=True,
-        help=(
-            "By default, a unique seed will be used for each GPU."
-            " This is done deterministically, so that each GPU will receive the same seed across invocations."
-            " If --seed_for_each_device=false is provided, then we will use the same seed across all GPUs,"
-            " which will almost certainly result in the over-sampling of inputs on larger datasets."
-        ),
-    )
-    parser.add_argument(
-        "--resolution",
-        type=float,
-        default=1024,
-        help=(
-            "The resolution for input images, all the images in the train/validation dataset will be resized to this"
-            " resolution. If using --resolution_type=area, this float value represents megapixels."
-        ),
-    )
-    parser.add_argument(
-        "--resolution_type",
-        type=str,
-        default="pixel_area",
-        choices=["pixel", "area", "pixel_area"],
-        help=(
-            "Resizing images maintains aspect ratio. This defines the resizing strategy."
-            " If 'pixel', the images will be resized to the resolution by the shortest pixel edge, if the target size does not match the current size."
-            " If 'area', the images will be resized so the pixel area is this many megapixels. Common rounded values such as `0.5` and `1.0` will be implicitly adjusted to their squared size equivalents."
-            " If 'pixel_area', the pixel value (eg. 1024) will be converted to the proper value for 'area', and then calculate everything the same as 'area' would."
-        ),
-    )
-    parser.add_argument(
-        "--aspect_bucket_rounding",
-        type=int,
-        default=None,
-        choices=range(1, 10),
-        help=(
-            "The number of decimal places to round the aspect ratio to. This is used to create buckets for aspect ratios."
-            " For higher precision, ensure the image sizes remain compatible. Higher precision levels result in a"
-            " greater number of buckets, which may not be a desirable outcome."
-        ),
-    )
-    parser.add_argument(
-        "--aspect_bucket_alignment",
-        type=int,
-        choices=[8, 64],
-        default=64,
-        help=(
-            "When training diffusion models, the image sizes generally must align to a 64 pixel interval."
-            " This is an exception when training models like DeepFloyd that use a base resolution of 64 pixels,"
-            " as aligning to 64 pixels would result in a 1:1 or 2:1 aspect ratio, overly distorting images."
-            " For DeepFloyd, this value is set to 8, but all other training defaults to 64. You may experiment"
-            " with this value, but it is not recommended."
-        ),
-    )
-    parser.add_argument(
-        "--minimum_image_size",
-        type=float,
-        default=None,
-        help=(
-            "The minimum resolution for both sides of input images."
-            " If --delete_unwanted_images is set, images smaller than this will be DELETED."
-            " The default value is None, which means no minimum resolution is enforced."
-            " If this option is not provided, it is possible that images will be destructively upsampled, harming model performance."
-        ),
-    )
-    parser.add_argument(
-        "--maximum_image_size",
-        type=float,
-        default=None,
-        help=(
-            "When cropping images that are excessively large, the entire scene context may be lost, eg. the crop might just"
-            " end up being a portion of the background. To avoid this, a maximum image size may be provided, which will"
-            " result in very-large images being downsampled before cropping them. This value uses --resolution_type to determine"
-            " whether it is a pixel edge or megapixel value."
-        ),
-    )
-    parser.add_argument(
-        "--target_downsample_size",
-        type=float,
-        default=None,
-        help=(
-            "When using --maximum_image_size, very-large images exceeding that value will be downsampled to this target"
-            " size before cropping. If --resolution_type=area and --maximum_image_size=4.0, --target_downsample_size=2.0"
-            " would result in a 4 megapixel image being resized to 2 megapixel before cropping to 1 megapixel."
-        ),
-    )
-    parser.add_argument(
-        "--train_text_encoder",
-        action="store_true",
-        help="(SD 2.x only) Whether to train the text encoder. If set, the text encoder should be float32 precision.",
-    )
-    # DeepFloyd
-    parser.add_argument(
-        "--tokenizer_max_length",
-        type=int,
-        default=None,
-        required=False,
-        help="The maximum length of the tokenizer. If not set, will default to the tokenizer's max length.",
-    )
-    # End DeepFloyd-specific settings
-    parser.add_argument(
-        "--train_batch_size",
-        type=int,
-        default=4,
-        help="Batch size (per device) for the training dataloader.",
-    )
-    parser.add_argument("--num_train_epochs", type=int, default=1)
-    parser.add_argument(
-        "--max_train_steps",
-        type=int,
-        default=None,
-        help="Total number of training steps to perform.  If provided, overrides num_train_epochs.",
-    )
-    parser.add_argument(
-        "--checkpointing_steps",
-        type=int,
-        default=500,
-        help=(
-            "Save a checkpoint of the training state every X updates. Checkpoints can be used for resuming training via `--resume_from_checkpoint`. "
-            "In the case that the checkpoint is better than the final trained model, the checkpoint can also be used for inference."
-            "Using a checkpoint for inference requires separate loading of the original pipeline and the individual checkpointed model components."
-            "See https://huggingface.co/docs/diffusers/main/en/training/dreambooth#performing-inference-using-a-saved-checkpoint for step by step"
-            "instructions."
-        ),
-    )
-    parser.add_argument(
-        "--checkpoints_total_limit",
-        type=int,
-        default=None,
-        help="Max number of checkpoints to store.",
-    )
-    parser.add_argument(
-        "--resume_from_checkpoint",
-        type=str,
-        default=None,
-        help=(
-            "Whether training should be resumed from a previous checkpoint. Use a path saved by"
-            ' `--checkpointing_steps`, or `"latest"` to automatically select the last available checkpoint.'
-        ),
-    )
-    parser.add_argument(
-        "--gradient_accumulation_steps",
-        type=int,
-        default=1,
-        help="Number of updates steps to accumulate before performing a backward/update pass.",
-    )
-    parser.add_argument(
-        "--gradient_checkpointing",
-        action="store_true",
-        help="Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.",
-    )
-    parser.add_argument(
-        "--learning_rate",
-        type=float,
-        default=4e-7,
-        help=(
-            "Initial learning rate (after the potential warmup period) to use."
-            " When using a cosine or sine schedule, --learning_rate defines the maximum learning rate."
-        ),
-    )
-    parser.add_argument(
-        "--text_encoder_lr",
-        type=float,
-        default=None,
-        help="Learning rate for the text encoder. If not provided, the value of --learning_rate will be used.",
-    )
-    parser.add_argument(
-        "--lr_scale",
-        action="store_true",
-        default=False,
-        help="Scale the learning rate by the number of GPUs, gradient accumulation steps, and batch size.",
-    )
-    parser.add_argument(
-        "--lr_scheduler",
-        type=str,
-        default="sine",
-        choices=[
-            "linear",
-            "sine",
-            "cosine",
-            "cosine_with_restarts",
-            "polynomial",
-            "constant",
-            "constant_with_warmup",
-        ],
-        help=("The scheduler type to use. Default: sine"),
-    )
-    parser.add_argument(
-        "--lr_warmup_steps",
-        type=int,
-        default=500,
-        help="Number of steps for the warmup in the lr scheduler.",
-    )
-    parser.add_argument(
-        "--lr_num_cycles",
-        type=int,
-        default=1,
-        help="Number of hard resets of the lr in cosine_with_restarts scheduler.",
-    )
-    parser.add_argument(
-        "--lr_power",
-        type=float,
-        default=0.8,
-        help="Power factor of the polynomial scheduler.",
-    )
-    parser.add_argument(
-        "--use_ema",
-        action="store_true",
-        help="Whether to use EMA (exponential moving average) model.",
-    )
-    parser.add_argument(
-        "--ema_device",
-        choices=["cpu", "accelerator"],
-        default="cpu",
-        help=(
-            "The device to use for the EMA model. If set to 'accelerator', the EMA model will be placed on the accelerator."
-            " This provides the fastest EMA update times, but is not ultimately necessary for EMA to function."
-        ),
-    )
-    parser.add_argument(
-        "--ema_cpu_only",
-        action="store_true",
-        default=False,
-        help=(
-            "When using EMA, the shadow model is moved to the accelerator before we update its parameters."
-            " When provided, this option will disable the moving of the EMA model to the accelerator."
-            " This will save a lot of VRAM at the cost of a lot of time for updates. It is recommended to also supply"
-            " --ema_update_interval to reduce the number of updates to eg. every 100 steps."
-        ),
-    )
-    parser.add_argument(
-        "--ema_foreach_disable",
-        action="store_true",
-        default=True,
-        help=(
-            "By default, we use torch._foreach functions for updating the shadow parameters, which should be fast."
-            " When provided, this option will disable the foreach methods and use vanilla EMA updates."
-        ),
-    )
-    parser.add_argument(
-        "--ema_update_interval",
-        type=int,
-        default=None,
-        help=(
-            "The number of optimization steps between EMA updates. If not provided, EMA network will update on every step."
-        ),
-    )
-    parser.add_argument(
-        "--ema_decay",
-        type=float,
-        default=0.995,
-        help=(
-            "The closer to 0.9999 this gets, the less updates will occur over time. Setting it to a lower value, such as 0.990,"
-            " will allow greater influence of later updates."
-        ),
-    )
-    parser.add_argument(
-        "--non_ema_revision",
-        type=str,
-        default=None,
-        required=False,
-        help=(
-            "Revision of pretrained non-ema model identifier. Must be a branch, tag or git identifier of the local or"
-            " remote repository specified with --pretrained_model_name_or_path."
-        ),
-    )
-    parser.add_argument(
-        "--offload_param_path",
-        type=str,
-        default=None,
-        help=(
-            "When using DeepSpeed ZeRo stage 2 or 3 with NVMe offload, this may be specified to provide a path for the offload."
-        ),
-    )
-    parser.add_argument(
-        "--optimizer",
-        type=str,
-        choices=optimizer_choices.keys(),
-        required=True,
-        default=None,
-    )
-    parser.add_argument(
-        "--optimizer_config",
-        type=str,
-        default=None,
-        help=(
-            "When setting a given optimizer, this allows a comma-separated list of key-value pairs to be provided that will override the optimizer defaults."
-            " For example, `--optimizer_config=decouple_lr=True,weight_decay=0.01`."
-        ),
-    )
-    parser.add_argument(
-        "--optimizer_cpu_offload_method",
-        choices=["none"],  # , "torchao"],
-        default="none",
-        help=(
-            "This option is a placeholder. In the future, it will allow for the selection of different CPU offload methods."
-        ),
-    )
-    parser.add_argument(
-        "--optimizer_offload_gradients",
-        action="store_true",
-        default=False,
-        help=(
-            "When creating a CPU-offloaded optimiser, the gradients can be offloaded to the CPU to save more memory."
-        ),
-    )
-    parser.add_argument(
-        "--fuse_optimizer",
-        action="store_true",
-        default=False,
-        help=(
-            "When creating a CPU-offloaded optimiser, the fused optimiser could be used to save on memory, while running slightly slower."
-        ),
-    )
-    parser.add_argument(
-        "--optimizer_beta1",
-        type=float,
-        default=None,
-        help="The value to use for the first beta value in the optimiser, which is used for the first moment estimate. A range of 0.8-0.9 is common.",
-    )
-    parser.add_argument(
-        "--optimizer_beta2",
-        type=float,
-        default=None,
-        help="The value to use for the second beta value in the optimiser, which is used for the second moment estimate. A range of 0.999-0.9999 is common.",
-    )
-    parser.add_argument(
-        "--optimizer_release_gradients",
-        action="store_true",
-        help=(
-            "When using Optimi optimizers, this option will release the gradients after the optimizer step."
-            " This can save memory, but may slow down training. With Quanto, there may be no benefit."
-        ),
-    )
-    parser.add_argument(
-        "--adam_beta1",
-        type=float,
-        default=0.9,
-        help="The beta1 parameter for the Adam and other optimizers.",
-    )
-    parser.add_argument(
-        "--adam_beta2",
-        type=float,
-        default=0.999,
-        help="The beta2 parameter for the Adam and other optimizers.",
-    )
-    parser.add_argument(
-        "--adam_weight_decay", type=float, default=1e-2, help="Weight decay to use."
-    )
-    parser.add_argument(
-        "--adam_epsilon",
-        type=float,
-        default=1e-08,
-        help="Epsilon value for the Adam optimizer",
-    )
-    parser.add_argument(
-        "--max_grad_norm",
-        default=2.0,
-        type=float,
-        help=(
-            "Clipping the max gradient norm can help prevent exploding gradients, but"
-            " may also harm training by introducing artifacts or making it hard to train artifacts away."
-        ),
-    )
-    parser.add_argument(
-        "--push_to_hub",
-        action="store_true",
-        help="Whether or not to push the model to the Hub.",
-    )
-    parser.add_argument(
-        "--push_checkpoints_to_hub",
-        action="store_true",
-        help=(
-            "When set along with --push_to_hub, all intermediary checkpoints will be pushed to the hub as if they were a final checkpoint."
-        ),
-    )
-    parser.add_argument(
-        "--hub_model_id",
-        type=str,
-        default=None,
-        help="The name of the repository to keep in sync with the local `output_dir`.",
-    )
-    parser.add_argument(
-        "--model_card_note",
-        type=str,
-        default=None,
-        help=(
-            "Add a string to the top of your model card to provide users with some additional context."
-        ),
-    )
-    parser.add_argument(
-        "--model_card_safe_for_work",
-        action="store_true",
-        default=False,
-        help=(
-            "Hugging Face Hub requires a warning to be added to models that may generate NSFW content."
-            " This is done by default in SimpleTuner for safety purposes, but can be disabled with this option."
-            " Additionally, removing the not-for-all-audiences tag from the README.md in the repo will also disable this warning"
-            " on previously-uploaded models."
-        ),
-    )
-    parser.add_argument(
-        "--logging_dir",
-        type=str,
-        default="logs",
-        help=(
-            "[TensorBoard](https://www.tensorflow.org/tensorboard) log directory. Will default to"
-            " *output_dir/runs/**CURRENT_DATETIME_HOSTNAME***."
-        ),
-    )
-    parser.add_argument(
-        "--benchmark_base_model",
-        action="store_true",
-        default=False,
-        help=(
-            "Deprecated option, benchmarks are now enabled by default. Use --disable_benchmark to disable."
-        ),
-    )
-    parser.add_argument(
-        "--disable_benchmark",
-        action="store_true",
-        default=False,
-        help=(
-            "By default, the model will be benchmarked on the first batch of the first epoch."
-            " This can be disabled with this option."
-        ),
-    )
-    parser.add_argument(
-        "--validation_on_startup",
-        action="store_true",
-        default=False,
-        help=(
-            "When training begins, the starting model will have validation prompts run through it, for later comparison."
-        ),
-    )
-    parser.add_argument(
-        "--validation_seed_source",
-        type=str,
-        default="cpu",
-        choices=["gpu", "cpu"],
-        help=(
-            "Some systems may benefit from using CPU-based seeds for reproducibility. On other systems, this may cause a TypeError."
-            " Setting this option to 'cpu' may cause validation errors. If so, please set SIMPLETUNER_LOG_LEVEL=DEBUG"
-            " and submit debug.log to a new Github issue report."
-        ),
-    )
-    parser.add_argument(
-        "--validation_torch_compile",
-        action="store_true",
-        default=False,
-        help=(
-            "Supply `--validation_torch_compile=true` to enable the use of torch.compile() on the validation pipeline."
-            " For some setups, torch.compile() may error out. This is dependent on PyTorch version, phase of the moon,"
-            " but if it works, you should leave it enabled for a great speed-up."
-        ),
-    )
-    parser.add_argument(
-        "--validation_torch_compile_mode",
-        type=str,
-        default="max-autotune",
-        choices=["max-autotune", "reduce-overhead", "default"],
-        help=(
-            "PyTorch provides different modes for the Torch Inductor when compiling graphs. max-autotune,"
-            " the default mode, provides the most benefit."
-        ),
-    )
-    parser.add_argument(
-        "--allow_tf32",
-        action="store_true",
-        help=(
-            "Deprecated option. TF32 is now enabled by default. Use --disable_tf32 to disable."
-        ),
-    )
-    parser.add_argument(
-        "--disable_tf32",
-        action="store_true",
-        help=(
-            "Previous defaults were to disable TF32 on Ampere GPUs. This option is provided to explicitly disable TF32,"
-            " after default configuration was updated to enable TF32 on Ampere GPUs."
-        ),
-    )
-    parser.add_argument(
-        "--validation_using_datasets",
-        action="store_true",
-        default=None,
-        help=(
-            "When set, validation will use images sampled randomly from each dataset for validation."
-            " Be mindful of privacy issues when publishing training data to the internet."
-        ),
-    )
-    parser.add_argument(
-        "--webhook_config",
-        type=str,
-        default=None,
-        help=(
-            "The path to the webhook configuration file. This file should be a JSON file with the following format:"
-            ' {"url": "https://your.webhook.url", "webhook_type": "discord"}}'
-        ),
-    )
-    parser.add_argument(
-        "--webhook_reporting_interval",
-        type=int,
-        default=None,
-        help=(
-            "When using 'raw' webhooks that receive structured data, you can specify a reporting interval here for"
-            " training progress updates to be sent at. This does not impact 'discord' webhook types."
-        ),
-    )
-    parser.add_argument(
-        "--report_to",
-        type=str,
-        default="wandb",
-        help=(
-            'The integration to report the results and logs to. Supported platforms are `"tensorboard"`'
-            ' (default), `"wandb"` and `"comet_ml"`. Use `"all"` to report to all integrations,'
-            ' or `"none"` to disable logging.'
-        ),
-    )
-    parser.add_argument(
-        "--tracker_run_name",
-        type=str,
-        default="simpletuner-testing",
-        help="The name of the run to track with the tracker.",
-    )
-    parser.add_argument(
-        "--tracker_project_name",
-        type=str,
-        default="simpletuner",
-        help="The name of the project for WandB or Tensorboard.",
-    )
-    parser.add_argument(
-        "--tracker_image_layout",
-        choices=["gallery", "table"],
-        default="gallery",
-        help=(
-            "When running validations with multiple images, you may want them all placed together in a table, row-wise."
-            " Gallery mode, the default, will allow use of a slider to view the historical images easily."
-        ),
-    )
-    parser.add_argument(
-        "--validation_prompt",
-        type=str,
-        default=None,
-        help="A prompt that is used during validation to verify that the model is learning.",
-    )
-    parser.add_argument(
-        "--validation_prompt_library",
-        action="store_true",
-        help="If this is provided, the SimpleTuner prompt library will be used to generate multiple images.",
-    )
-    parser.add_argument(
-        "--user_prompt_library",
-        type=str,
-        default=None,
-        help="This should be a path to the JSON file containing your prompt library. See user_prompt_library.json.example.",
-    )
-    parser.add_argument(
-        "--validation_negative_prompt",
-        type=str,
-        default="blurry, cropped, ugly",
-        help=(
-            "When validating images, a negative prompt may be used to guide the model away from certain features."
-            " When this value is set to --validation_negative_prompt='', no negative guidance will be applied."
-            " Default: blurry, cropped, ugly"
-        ),
-    )
-    parser.add_argument(
-        "--num_validation_images",
-        type=int,
-        default=1,
-        help="Number of images that should be generated during validation with `validation_prompt`.",
-    )
-    parser.add_argument(
-        "--validation_steps",
-        type=int,
-        default=100,
-        help=(
-            "Run validation every X steps. Validation consists of running the prompt"
-            " `args.validation_prompt` multiple times: `args.num_validation_images`"
-            " and logging the images."
-        ),
-    )
-    parser.add_argument(
-        "--num_eval_images",
-        type=int,
-        default=4,
-        help=(
-            "If possible, this many eval images will be selected from each dataset."
-            " This is used when training super-resolution models such as DeepFloyd Stage II,"
-            " which will upscale input images from the training set."
-        ),
-    )
-    parser.add_argument(
-        "--eval_dataset_id",
-        type=str,
-        default=None,
-        help=(
-            "When provided, only this dataset's images will be used as the eval set, to keep"
-            " the training and eval images split."
-        ),
-    )
-    parser.add_argument(
-        "--validation_num_inference_steps",
-        type=int,
-        default=30,
-        help=(
-            "The default scheduler, DDIM, benefits from more steps. UniPC can do well with just 10-15."
-            " For more speed during validations, reduce this value. For better quality, increase it."
-            " For model distilation, you will likely want to keep this low."
-        ),
-    )
-    parser.add_argument(
-        "--validation_resolution",
-        type=str,
-        default=256,
-        help="Square resolution images will be output at this resolution (256x256).",
-    )
-    parser.add_argument(
-        "--validation_noise_scheduler",
-        type=str,
-        choices=["ddim", "ddpm", "euler", "euler-a", "unipc"],
-        default=None,
-        help=(
-            "When validating the model at inference time, a different scheduler may be chosen."
-            " UniPC can offer better speed, and Euler A can put up with instabilities a bit better."
-            " For zero-terminal SNR models, DDIM is the best choice. Choices: ['ddim', 'ddpm', 'euler', 'euler-a', 'unipc'],"
-            " Default: None (use the model default)"
-        ),
-    )
-    parser.add_argument(
-        "--validation_disable_unconditional",
-        action="store_true",
-        help=(
-            "When set, the validation pipeline will not generate unconditional samples."
-            " This is useful to speed up validations with a single prompt on slower systems, or if you are not"
-            " interested in unconditional space generations."
-        ),
-    )
-    parser.add_argument(
-        "--enable_watermark",
-        default=False,
-        action="store_true",
-        help=(
-            "The SDXL 0.9 and 1.0 licenses both require a watermark be used to identify any images created to be shared."
-            " Since the images created during validation typically are not shared, and we want the most accurate results,"
-            " this watermarker is disabled by default. If you are sharing the validation images, it is up to you"
-            " to ensure that you are complying with the license, whether that is through this watermarker, or another."
-        ),
-    )
-    parser.add_argument(
-        "--mixed_precision",
-        type=str,
-        default="bf16",
-        choices=["bf16", "no"],
-        help=(
-            "SimpleTuner only supports bf16 training. Bf16 requires PyTorch >="
-            " 1.10. on an Nvidia Ampere or later GPU, and PyTorch 2.3 or newer for Apple Silicon."
-            " Default to the value of accelerate config of the current system or the"
-            " flag passed with the `accelerate.launch` command. Use this argument to override the accelerate config."
-        ),
-    )
-    parser.add_argument(
-        "--gradient_precision",
-        type=str,
-        choices=["unmodified", "fp32"],
-        default=None,
-        help=(
-            "One of the hallmark discoveries of the Llama 3.1 paper is numeric instability when calculating"
-            " gradients in bf16 precision. The default behaviour when gradient accumulation steps are enabled"
-            " is now to use fp32 gradients, which is slower, but provides more accurate updates."
-        ),
-    )
-    parser.add_argument(
-        "--quantize_via",
-        type=str,
-        choices=["cpu", "accelerator"],
-        default="accelerator",
-        help=(
-            "When quantising the model, the quantisation process can be done on the CPU or the accelerator."
-            " When done on the accelerator (default), slightly more VRAM is required, but the process completes in milliseconds."
-            " When done on the CPU, the process may take upwards of 60 seconds, but can complete without OOM on 16G cards."
-        ),
-    )
-    parser.add_argument(
-        "--base_model_precision",
-        type=str,
-        default="no_change",
-        choices=quantised_precision_levels,
-        help=(
-            "When training a LoRA, you might want to quantise the base model to a lower precision to save more VRAM."
-            " The default value, 'no_change', does not quantise any weights."
-            " Using 'fp4-bnb' or 'fp8-bnb' will require Bits n Bytes for quantisation (NVIDIA, maybe AMD)."
-            " Using 'fp8-quanto' will require Quanto for quantisation (Apple Silicon, NVIDIA, AMD)."
-        ),
-    )
-    parser.add_argument(
-        "--quantize_activations",
-        action="store_true",
-        help=(
-            "(EXPERIMENTAL) This option is currently unsupported, and exists solely for development purposes."
-        ),
-    )
-    parser.add_argument(
-        "--base_model_default_dtype",
-        type=str,
-        default="bf16",
-        choices=["bf16", "fp32"],
-        help=(
-            "Unlike --mixed_precision, this value applies specifically for the default weights of your quantised base model."
-            " When quantised, not every parameter can or should be quantised down to the target precision."
-            " By default, we use bf16 weights for the base model - but this can be changed to fp32 to enable"
-            " the use of other optimizers than adamw_bf16. However, this uses marginally more memory,"
-            " and may not be necessary for your use case."
-        ),
-    )
-    for i in range(1, 4):
-        parser.add_argument(
-            f"--text_encoder_{i}_precision",
-            type=str,
-            default="no_change",
-            choices=quantised_precision_levels,
-            help=(
-                f"When training a LoRA, you might want to quantise text encoder {i} to a lower precision to save more VRAM."
-                " The default value is to follow base_model_precision (no_change)."
-                " Using 'fp4-bnb' or 'fp8-bnb' will require Bits n Bytes for quantisation (NVIDIA, maybe AMD)."
-                " Using 'fp8-quanto' will require Quanto for quantisation (Apple Silicon, NVIDIA, AMD)."
-            ),
-        )
-    parser.add_argument(
-        "--local_rank",
-        type=int,
-        default=-1,
-        help="For distributed training: local_rank",
-    )
-    parser.add_argument(
-        "--enable_xformers_memory_efficient_attention",
-        action="store_true",
-        help="Whether or not to use xformers.",
-    )
-    parser.add_argument(
-        "--set_grads_to_none",
-        action="store_true",
-        help=(
-            "Save more memory by using setting grads to None instead of zero. Be aware, that this changes certain"
-            " behaviors, so disable this argument if it causes any problems. More info:"
-            " https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html"
-        ),
-    )
-    parser.add_argument(
-        "--noise_offset",
-        type=float,
-        default=0.1,
-        help="The scale of noise offset. Default: 0.1",
-    )
-    parser.add_argument(
-        "--noise_offset_probability",
-        type=float,
-        default=0.25,
-        help=(
-            "When training with --offset_noise, the value of --noise_offset will only be applied probabilistically."
-            " The default behaviour is for offset noise (if enabled) to be applied 25 percent of the time."
-        ),
-    )
-    parser.add_argument(
-        "--validation_guidance",
-        type=float,
-        default=7.5,
-        help="CFG value for validation images. Default: 7.5",
-    )
-    parser.add_argument(
-        "--validation_guidance_real",
-        type=float,
-        default=1.0,
-        help="Use real CFG sampling for Flux validation images. Default: 1.0 (no CFG)",
-    )
-    parser.add_argument(
-        "--validation_no_cfg_until_timestep",
-        type=int,
-        default=2,
-        help="When using real CFG sampling for Flux validation images, skip doing CFG on these timesteps. Default: 2",
-    )
-    parser.add_argument(
-        "--validation_guidance_rescale",
-        type=float,
-        default=0.0,
-        help="CFG rescale value for validation images. Default: 0.0, max 1.0",
-    )
-    parser.add_argument(
-        "--validation_randomize",
-        action="store_true",
-        default=False,
-        help="If supplied, validations will be random, ignoring any seeds.",
-    )
-    parser.add_argument(
-        "--validation_seed",
-        type=int,
-        default=None,
-        help=(
-            "If not supplied, the value for --seed will be used."
-            " If neither those nor --validation_randomize are supplied, a seed of zero is used."
-        ),
-    )
-    parser.add_argument(
-        "--fully_unload_text_encoder",
-        action="store_true",
-        help=(
-            "If set, will fully unload the text_encoder from memory when not in use."
-            " This currently has the side effect of crashing validations, but it is useful"
-            " for initiating VAE caching on GPUs that would otherwise be too small."
-        ),
-    )
-    parser.add_argument(
-        "--freeze_encoder_before",
-        type=int,
-        default=12,
-        help="When using 'before' strategy, we will freeze layers earlier than this.",
-    )
-    parser.add_argument(
-        "--freeze_encoder_after",
-        type=int,
-        default=17,
-        help="When using 'after' strategy, we will freeze layers later than this.",
-    )
-    parser.add_argument(
-        "--freeze_encoder_strategy",
-        type=str,
-        default="after",
-        help=(
-            "When freezing the text_encoder, we can use the 'before', 'between', or 'after' strategy."
-            " The 'between' strategy will freeze layers between those two values, leaving the outer layers unfrozen."
-            " The default strategy is to freeze all layers from 17 up."
-            " This can be helpful when fine-tuning Stable Diffusion 2.1 on a new style."
-        ),
-    )
-    parser.add_argument(
-        "--layer_freeze_strategy",
-        type=str,
-        choices=["none", "bitfit"],
-        default="none",
-        help=(
-            "When freezing parameters, we can use the 'none' or 'bitfit' strategy."
-            " The 'bitfit' strategy will freeze all weights, and leave bias in a trainable state."
-            " The default strategy is to leave all parameters in a trainable state."
-            " Freezing the weights can improve convergence for finetuning."
-            " Using bitfit only moderately reduces VRAM consumption, but substantially reduces the count of trainable parameters."
-        ),
-    )
-    parser.add_argument(
-        "--unet_attention_slice",
-        action="store_true",
-        default=False,
-        help=(
-            "If set, will use attention slicing for the SDXL UNet. This is an experimental feature and is not recommended for general use."
-            " SD 2.x makes use of attention slicing on Apple MPS platform to avoid a NDArray size crash, but SDXL does not"
-            " seem to require attention slicing on MPS. If memory constrained, try enabling it anyway."
-        ),
-    )
-    parser.add_argument(
-        "--print_filenames",
-        action="store_true",
-        help=(
-            "If any image files are stopping the process eg. due to corruption or truncation, this will help identify which is at fault."
-        ),
-    )
-    parser.add_argument(
-        "--print_sampler_statistics",
-        action="store_true",
-        help=(
-            "If provided, will print statistics about the dataset sampler. This is useful for debugging."
-            " The default behaviour is to not print sampler statistics."
-        ),
-    )
-    parser.add_argument(
-        "--metadata_update_interval",
-        type=int,
-        default=3600,
-        help=(
-            "When generating the aspect bucket indicies, we want to save it every X seconds."
-            " The default is to save it every 1 hour, such that progress is not lost on clusters"
-            " where runtime is limited to 6-hour increments (e.g. the JUWELS Supercomputer)."
-            " The minimum value is 60 seconds."
-        ),
-    )
-    parser.add_argument(
-        "--debug_aspect_buckets",
-        action="store_true",
-        help="If set, will print excessive debugging for aspect bucket operations.",
-    )
-    parser.add_argument(
-        "--debug_dataset_loader",
-        action="store_true",
-        help="If set, will print excessive debugging for data loader operations.",
-    )
-    parser.add_argument(
-        "--freeze_encoder",
-        type=bool,
-        default=True,
-        help="Whether or not to freeze the text_encoder. The default is true.",
-    )
-    parser.add_argument(
-        "--save_text_encoder",
-        action="store_true",
-        default=False,
-        help=(
-            "If set, will save the text_encoder after training."
-            " This is useful if you're using --push_to_hub so that the final pipeline contains all necessary components to run."
-        ),
-    )
-    parser.add_argument(
-        "--text_encoder_limit",
-        type=int,
-        default=25,
-        help=(
-            "When training the text_encoder, we want to limit how long it trains for to avoid catastrophic loss."
-        ),
-    )
-    parser.add_argument(
-        "--prepend_instance_prompt",
-        action="store_true",
-        help=(
-            "When determining the captions from the filename, prepend the instance prompt as an enforced keyword."
-        ),
-    )
-    parser.add_argument(
-        "--only_instance_prompt",
-        action="store_true",
-        help="Use the instance prompt instead of the caption from filename.",
-    )
-    parser.add_argument(
-        "--data_aesthetic_score",
-        type=float,
-        default=7.0,
-        help=(
-            "Since currently we do not calculate aesthetic scores for data, we will statically set it to one value. This is only used by the SDXL Refiner."
-        ),
-    )
-    parser.add_argument(
-        "--sdxl_refiner_uses_full_range",
-        action="store_true",
-        default=False,
-        help=(
-            "If set, the SDXL Refiner will use the full range of the model, rather than the design value of 20 percent."
-            " This is useful for training models that will be used for inference from end-to-end of the noise schedule."
-            " You may use this for example, to turn the SDXL refiner into a full text-to-image model."
-        ),
-    )
-    parser.add_argument(
-        "--caption_dropout_probability",
-        type=float,
-        default=None,
-        help=(
-            "Caption dropout will randomly drop captions and, for SDXL, size conditioning inputs based on this probability."
-            " When set to a value of 0.1, it will drop approximately 10 percent of the inputs."
-            " Maximum recommended value is probably less than 0.5, or 50 percent of the inputs. Maximum technical value is 1.0."
-            " The default is to use zero caption dropout, though for better generalisation, a value of 0.1 is recommended."
-        ),
-    )
-    parser.add_argument(
-        "--delete_unwanted_images",
-        action="store_true",
-        help=(
-            "If set, will delete images that are not of a minimum size to save on disk space for large training runs."
-            " Default behaviour: Unset, remove images from bucket only."
-        ),
-    )
-    parser.add_argument(
-        "--delete_problematic_images",
-        action="store_true",
-        help=(
-            "If set, any images that error out during load will be removed from the underlying storage medium."
-            " This is useful to prevent repeatedly attempting to cache bad files on a cloud bucket."
-        ),
-    )
-    parser.add_argument(
-        "--disable_bucket_pruning",
-        action="store_true",
-        help=(
-            "When training on very small datasets, you might not care that the batch sizes will outpace your image count."
-            " Setting this option will prevent SimpleTuner from deleting your bucket lists that do not meet"
-            " the minimum image count requirements. Use at your own risk, it may end up throwing off your statistics or epoch tracking."
-        ),
-    )
-    parser.add_argument(
-        "--offset_noise",
-        action="store_true",
-        default=False,
-        help=(
-            "Fine-tuning against a modified noise"
-            " See: https://www.crosslabs.org//blog/diffusion-with-offset-noise for more information."
-        ),
-    )
-    parser.add_argument(
-        "--input_perturbation",
-        type=float,
-        default=0.0,
-        help=(
-            "Add additional noise only to the inputs fed to the model during training."
-            " This will make the training converge faster. A value of 0.1 is suggested if you want to enable this."
-            " Input perturbation seems to also work with flow-matching (e.g. SD3 and Flux)."
-        ),
-    )
-    parser.add_argument(
-        "--input_perturbation_steps",
-        type=float,
-        default=0,
-        help=(
-            "Only apply input perturbation over the first N steps with linear decay."
-            " This should prevent artifacts from showing up in longer training runs."
-        ),
-    )
-    parser.add_argument(
-        "--lr_end",
-        type=str,
-        default="4e-7",
-        help=(
-            "A polynomial learning rate will end up at this value after the specified number of warmup steps."
-            " A sine or cosine wave will use this value as its lower bound for the learning rate."
-        ),
-    )
-    parser.add_argument(
-        "--i_know_what_i_am_doing",
-        action="store_true",
-        help=(
-            "This flag allows you to override some safety checks."
-            " It's not recommended to use this unless you are developing the platform."
-            " Generally speaking, issue reports submitted with this flag enabled will go to the bottom of the queue."
-        ),
-    )
-    parser.add_argument(
-        "--accelerator_cache_clear_interval",
-        default=None,
-        type=int,
-        help=(
-            "Clear the cache from VRAM every X steps. This can help prevent memory leaks, but may slow down training."
-        ),
-    )
-
-    return parser
-
-
-def get_default_config():
-    parser = get_argument_parser()
-    default_config = {}
-    for action in parser.__dict__["_actions"]:
-        if action.dest:
-            default_config[action.dest] = action.default
-
-    return default_config
-
-
-def parse_cmdline_args(input_args=None):
-    parser = get_argument_parser()
-    if input_args is not None:
-        for key_val in input_args:
-            print_on_main_thread(f"{key_val}")
-        try:
-            args = parser.parse_args(input_args)
-        except:
-            logger.error(f"Could not parse input: {input_args}")
-            import traceback
-
-            logger.error(traceback.format_exc())
-    else:
-        args = parser.parse_args()
-
-    if args.optimizer == "adam_bfloat16" and args.mixed_precision != "bf16":
-        if not torch.backends.mps.is_available():
-            logging.error(
-                "You cannot use --adam_bfloat16 without --mixed_precision=bf16."
-            )
-            sys.exit(1)
-
-    env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
-    if env_local_rank != -1 and env_local_rank != args.local_rank:
-        args.local_rank = env_local_rank
-
-    if args.seed is not None:
-        if args.seed == 0:
-            # the current time should be used if value is zero, providing a rolling seed.
-            args.seed = int(time.time())
-        elif args.seed == -1:
-            # more random seed if value is -1, it will be very different on each startup.
-            args.seed = int(random.randint(0, 2**30))
-
-    # default to using the same revision for the non-ema model if not specified
-    if args.non_ema_revision is None:
-        args.non_ema_revision = args.revision
-
-    if args.cache_dir is None or args.cache_dir == "":
-        args.cache_dir = os.path.join(args.output_dir, "cache")
-
-    if args.maximum_image_size is not None and not args.target_downsample_size:
-        raise ValueError(
-            "When providing --maximum_image_size, you must also provide a value for --target_downsample_size."
-        )
-    if (
-        args.maximum_image_size is not None
-        and args.resolution_type == "area"
-        and args.maximum_image_size > 5
-        and not os.environ.get("SIMPLETUNER_MAXIMUM_IMAGE_SIZE_OVERRIDE", False)
-    ):
-        raise ValueError(
-            f"When using --resolution_type=area, --maximum_image_size must be less than 5 megapixels. You may have accidentally entered {args.maximum_image_size} pixels, instead of megapixels."
-        )
-    elif (
-        args.maximum_image_size is not None
-        and args.resolution_type == "pixel"
-        and args.maximum_image_size < 512
-    ):
-        raise ValueError(
-            f"When using --resolution_type=pixel, --maximum_image_size must be at least 512 pixels. You may have accidentally entered {args.maximum_image_size} megapixels, instead of pixels."
-        )
-    if (
-        args.target_downsample_size is not None
-        and args.resolution_type == "area"
-        and args.target_downsample_size > 5
-        and not os.environ.get("SIMPLETUNER_MAXIMUM_IMAGE_SIZE_OVERRIDE", False)
-    ):
-        raise ValueError(
-            f"When using --resolution_type=area, --target_downsample_size must be less than 5 megapixels. You may have accidentally entered {args.target_downsample_size} pixels, instead of megapixels."
-        )
-    elif (
-        args.target_downsample_size is not None
-        and args.resolution_type == "pixel"
-        and args.target_downsample_size < 512
-    ):
-        raise ValueError(
-            f"When using --resolution_type=pixel, --target_downsample_size must be at least 512 pixels. You may have accidentally entered {args.target_downsample_size} megapixels, instead of pixels."
-        )
-
-    model_is_bf16 = (
-        args.base_model_precision == "no_change"
-        and (args.mixed_precision == "bf16" or torch.backends.mps.is_available())
-    ) or (
-        args.base_model_precision != "no_change"
-        and args.base_model_default_dtype == "bf16"
-    )
-    model_is_quantized = args.base_model_precision != "no_change"
-    # check optimiser validity
-    chosen_optimizer = args.optimizer
-    is_optimizer_deprecated(chosen_optimizer)
-    from videotuna.third_party.flux.training.optimizer_param import optimizer_parameters
-
-    optimizer_cls, optimizer_details = optimizer_parameters(chosen_optimizer, args)
-    using_bf16_optimizer = optimizer_details.get("default_settings", {}).get(
-        "precision"
-    ) in ["any", "bf16"]
-    if using_bf16_optimizer and not model_is_bf16:
-        raise ValueError(
-            f"Model is not using bf16 precision, but the optimizer {chosen_optimizer} requires it."
-        )
-    if is_optimizer_grad_fp32(args.optimizer):
-        warning_log(
-            "Using an optimizer that requires fp32 gradients. Training will potentially run more slowly."
-        )
-        if args.gradient_precision != "fp32":
-            args.gradient_precision = "fp32"
-    else:
-        if args.gradient_precision == "fp32":
-            args.gradient_precision = "unmodified"
-
-    if torch.backends.mps.is_available():
-        if (
-            args.model_family.lower() not in ["sd3", "flux", "legacy"]
-            and not args.unet_attention_slice
-        ):
-            warning_log(
-                "MPS may benefit from the use of --unet_attention_slice for memory savings at the cost of speed."
-            )
-        if args.model_family != "smoldit" and args.train_batch_size > 16:
-            error_log(
-                "An M3 Max 128G will use 12 seconds per step at a batch size of 1 and 65 seconds per step at a batch size of 12."
-                " Any higher values will result in NDArray size errors or other unstable training results and crashes."
-                "\nPlease reduce the batch size to 12 or lower."
-            )
-            sys.exit(1)
-
-        if args.quantize_via == "accelerator":
-            error_log(
-                "MPS does not benefit from models being quantized on the accelerator device. Overriding --quantize_via to 'cpu'."
-            )
-            args.quantize_via = "cpu"
-
-    if (
-        args.max_train_steps is not None
-        and args.max_train_steps > 0
-        and args.num_train_epochs > 0
-    ):
-        error_log(
-            "When using --max_train_steps (MAX_NUM_STEPS), you must set --num_train_epochs (NUM_EPOCHS) to 0."
-        )
-        sys.exit(1)
-
-    if (
-        args.pretrained_vae_model_name_or_path is not None
-        and args.model_family in ["legacy", "flux", "sd3"]
-        and "sdxl" in args.pretrained_vae_model_name_or_path
-        and "deepfloyd" not in args.model_type
-    ):
-        warning_log(
-            f"The VAE model {args.pretrained_vae_model_name_or_path} is not compatible. Please use a compatible VAE to eliminate this warning. The baked-in VAE will be used, instead."
-        )
-        args.pretrained_vae_model_name_or_path = None
-    if (
-        args.pretrained_vae_model_name_or_path == ""
-        or args.pretrained_vae_model_name_or_path == "''"
-    ):
-        args.pretrained_vae_model_name_or_path = None
-
-    if "deepfloyd" not in args.model_type:
-        info_log(
-            f"VAE Model: {args.pretrained_vae_model_name_or_path or args.pretrained_model_name_or_path}"
-        )
-        info_log(f"Default VAE Cache location: {args.cache_dir_vae}")
-        info_log(f"Text Cache location: {args.cache_dir_text}")
-    if args.model_family == "sd3":
-        warning_log(
-            "MM-DiT requires an alignment value of 64px. Overriding the value of --aspect_bucket_alignment."
-        )
-        args.aspect_bucket_alignment = 64
-        if args.sd3_t5_uncond_behaviour is None:
-            args.sd3_t5_uncond_behaviour = args.sd3_clip_uncond_behaviour
-        info_log(
-            f"SD3 embeds for unconditional captions: t5={args.sd3_t5_uncond_behaviour}, clip={args.sd3_clip_uncond_behaviour}"
-        )
-
-    elif "deepfloyd" in args.model_type:
-        deepfloyd_pixel_alignment = 8
-        if args.aspect_bucket_alignment != deepfloyd_pixel_alignment:
-            warning_log(
-                f"Overriding aspect bucket alignment pixel interval to {deepfloyd_pixel_alignment}px instead of {args.aspect_bucket_alignment}px."
-            )
-            args.aspect_bucket_alignment = deepfloyd_pixel_alignment
-
-    if "deepfloyd-stage2" in args.model_type and args.resolution < 256:
-        warning_log(
-            "DeepFloyd Stage II requires a resolution of at least 256. Setting to 256."
-        )
-        args.resolution = 256
-        args.aspect_bucket_alignment = 64
-        args.resolution_type = "pixel"
-
-    validation_resolution_is_float = False
-    if "." in str(args.validation_resolution):
-        try:
-            # this makes handling for int() conversion easier later.
-            args.validation_resolution = float(args.validation_resolution)
-            validation_resolution_is_float = True
-        except ValueError:
-            pass
-    validation_resolution_is_digit = False
-    try:
-        int(args.validation_resolution)
-        validation_resolution_is_digit = True
-    except ValueError:
-        pass
-
-    if (
-        (validation_resolution_is_digit or validation_resolution_is_float)
-        and int(args.validation_resolution) < 128
-        and "deepfloyd" not in args.model_type
-    ):
-        # Convert from megapixels to pixels:
-        log_msg = f"It seems that --validation_resolution was given in megapixels ({args.validation_resolution}). Converting to pixel measurement:"
-        if int(args.validation_resolution) == 1:
-            args.validation_resolution = 1024
-        else:
-            args.validation_resolution = int(int(args.validation_resolution) * 1e3)
-            # Make it divisible by 8:
-            args.validation_resolution = int(int(args.validation_resolution) / 8) * 8
-        info_log(f"{log_msg} {int(args.validation_resolution)}px")
-    if args.timestep_bias_portion < 0.0 or args.timestep_bias_portion > 1.0:
-        raise ValueError("Timestep bias portion must be between 0.0 and 1.0.")
-
-    if args.controlnet and "lora" in args.model_type:
-        raise ValueError("ControlNet is not supported for LoRA models.")
-
-    if args.metadata_update_interval < 60:
-        raise ValueError("Metadata update interval must be at least 60 seconds.")
-
-    if args.model_family == "sd3":
-        args.pretrained_vae_model_name_or_path = None
-        args.disable_compel = True
-
-    t5_max_length = 256
-    if args.model_family == "sd3" and (
-        args.tokenizer_max_length is None
-        or int(args.tokenizer_max_length) > t5_max_length
-    ):
-        if not args.i_know_what_i_am_doing:
-            warning_log(
-                f"Updating T5 XXL tokeniser max length to {t5_max_length} for SD3."
-            )
-            args.tokenizer_max_length = t5_max_length
-        else:
-            warning_log(
-                f"-!- SD3 supports a max length of {t5_max_length} tokens, but you have supplied `--i_know_what_i_am_doing`, so this limit will not be enforced. -!-"
-            )
-            warning_log(
-                f"The model will begin to collapse after a short period of time, if the model you are continuing from has not been tuned beyond {t5_max_length} tokens."
-            )
-    flux_version = "dev"
-    model_max_seq_length = 512
-    if (
-        "schnell" in args.pretrained_model_name_or_path.lower()
-        or args.flux_fast_schedule
-    ):
-        if not args.flux_fast_schedule and not args.i_know_what_i_am_doing:
-            error_log(
-                "Schnell requires --flux_fast_schedule (or --i_know_what_i_am_doing)."
-            )
-            sys.exit(1)
-        flux_version = "schnell"
-        model_max_seq_length = 256
-
-    if args.model_family == "flux":
-        if (
-            args.tokenizer_max_length is None
-            or int(args.tokenizer_max_length) > model_max_seq_length
-        ):
-            if not args.i_know_what_i_am_doing:
-                warning_log(
-                    f"Updating T5 XXL tokeniser max length to {model_max_seq_length} for Flux."
-                )
-                args.tokenizer_max_length = model_max_seq_length
-            else:
-                warning_log(
-                    f"-!- Flux supports a max length of {model_max_seq_length} tokens, but you have supplied `--i_know_what_i_am_doing`, so this limit will not be enforced. -!-"
-                )
-                warning_log(
-                    f"The model will begin to collapse after a short period of time, if the model you are continuing from has not been tuned beyond 256 tokens."
-                )
-        if flux_version == "dev":
-            if args.validation_num_inference_steps > 28:
-                warning_log(
-                    "Flux Dev expects around 28 or fewer inference steps. Consider limiting --validation_num_inference_steps to 28."
-                )
-            if args.validation_num_inference_steps < 15:
-                warning_log(
-                    "Flux Dev expects around 15 or more inference steps. Consider increasing --validation_num_inference_steps to 15."
-                )
-        if flux_version == "schnell" and args.validation_num_inference_steps > 4:
-            warning_log(
-                "Flux Schnell requires fewer inference steps. Consider reducing --validation_num_inference_steps to 4."
-            )
-
-        if args.flux_guidance_mode == "mobius":
-            warning_log(
-                "Mobius training is only for the most elite. Pardon my English, but this is not for those who don't like to destroy something beautiful every now and then. If you feel perhaps this is not for you, please consider using a different guidance mode."
-            )
-            if args.flux_guidance_min < 1.0:
-                warning_log(
-                    "Flux minimum guidance value for Mobius training is 1.0. Updating value.."
-                )
-                args.flux_guidance_min = 1.0
-
-    if args.use_ema and args.ema_cpu_only:
-        args.ema_device = "cpu"
-
-    if (args.optimizer_beta1 is not None and args.optimizer_beta2 is None) or (
-        args.optimizer_beta1 is None and args.optimizer_beta2 is not None
-    ):
-        error_log("Both --optimizer_beta1 and --optimizer_beta2 should be provided.")
-        sys.exit(1)
-
-    if not args.i_know_what_i_am_doing:
-        if args.model_family == "pixart_sigma" or args.model_family == "sd3":
-            if args.max_grad_norm is None or float(args.max_grad_norm) > 0.01:
-                warning_log(
-                    f"{'PixArt Sigma' if args.model_family == 'pixart_sigma' else 'Stable Diffusion 3'} requires --max_grad_norm=0.01 to prevent model collapse. Overriding value. Set this value manually to disable this warning."
-                )
-                args.max_grad_norm = 0.01
-    if args.gradient_checkpointing:
-        # enable torch compile w/ activation checkpointing :[ slows us down.
-        torch._dynamo.config.optimize_ddp = False
-    if args.gradient_accumulation_steps > 1:
-        if args.gradient_precision == "unmodified" or args.gradient_precision is None:
-            warning_log(
-                "Gradient accumulation steps are enabled, but gradient precision is set to 'unmodified'."
-                " This may lead to numeric instability. Consider disabling gradient accumulation steps. Continuing in 10 seconds.."
-            )
-            time.sleep(10)
-        elif args.gradient_precision == "fp32":
-            info_log(
-                "Gradient accumulation steps are enabled, and gradient precision is set to 'fp32'."
-            )
-            args.gradient_precision = "fp32"
-
-    if args.use_ema:
-        if args.model_family == "sd3":
-            raise ValueError(
-                "Using EMA is not currently supported for Stable Diffusion 3 training."
-            )
-        if "lora" in args.model_type:
-            raise ValueError("Using EMA is not currently supported for LoRA training.")
-    args.logging_dir = os.path.join(args.output_dir, args.logging_dir)
-    args.accelerator_project_config = ProjectConfiguration(
-        project_dir=args.output_dir, logging_dir=args.logging_dir
-    )
-    # Create the custom configuration
-    args.process_group_kwargs = InitProcessGroupKwargs(
-        timeout=timedelta(seconds=5400)
-    )  # 1.5 hours
-
-    # Enable TF32 for faster training on Ampere GPUs,
-    # cf https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
-    if torch.cuda.is_available():
-        torch.backends.cuda.matmul.allow_tf32 = True
-        torch.backends.cudnn.allow_tf32 = True
-        if args.disable_tf32:
-            warning_log(
-                "--disable_tf32 is provided, not enabling. Training will potentially be much slower."
-            )
-            torch.backends.cuda.matmul.allow_tf32 = False
-            torch.backends.cudnn.allow_tf32 = False
-        else:
-            info_log(
-                "Enabled NVIDIA TF32 for faster training on Ampere GPUs. Use --disable_tf32 if this causes any problems."
-            )
-
-    args.is_quantized = (
-        False
-        if (args.base_model_precision == "no_change" or "lora" not in args.model_type)
-        else True
-    )
-    args.weight_dtype = (
-        torch.bfloat16
-        if (
-            (args.mixed_precision == "bf16" or torch.backends.mps.is_available())
-            or (args.base_model_default_dtype == "bf16" and args.is_quantized)
-        )
-        else torch.float32
-    )
-    args.disable_accelerator = os.environ.get("SIMPLETUNER_DISABLE_ACCELERATOR", False)
-
-    if "lycoris" == args.lora_type.lower():
-        from lycoris import create_lycoris
-
-        if args.lycoris_config is None:
-            raise ValueError(
-                "--lora_type=lycoris requires you to add a JSON "
-                + "configuration file location with --lycoris_config"
-            )
-        # is it readable?
-        if not os.path.isfile(args.lycoris_config) or not os.access(
-            args.lycoris_config, os.R_OK
-        ):
-            raise ValueError(
-                f"Could not find the JSON configuration file at {args.lycoris_config}"
-            )
-        import json
-
-        with open(args.lycoris_config, "r") as f:
-            lycoris_config = json.load(f)
-        assert "algo" in lycoris_config, "lycoris_config JSON must contain algo key"
-        assert (
-            "multiplier" in lycoris_config
-        ), "lycoris_config JSON must contain multiplier key"
-        assert (
-            "linear_dim" in lycoris_config
-        ), "lycoris_config JSON must contain linear_dim key"
-        assert (
-            "linear_alpha" in lycoris_config
-        ), "lycoris_config JSON must contain linear_alpha key"
-
-    elif "standard" == args.lora_type.lower():
-        if hasattr(args, "lora_init_type") and args.lora_init_type is not None:
-            if torch.backends.mps.is_available() and args.lora_init_type == "loftq":
-                logger.error(
-                    "Apple MPS cannot make use of LoftQ initialisation. Overriding to 'default'."
-                )
-            elif args.is_quantized and args.lora_init_type == "loftq":
-                logger.error(
-                    "LoftQ initialisation is not supported with quantised models. Overriding to 'default'."
-                )
-            else:
-                args.lora_initialisation_style = (
-                    args.lora_init_type if args.lora_init_type != "default" else True
-                )
-        if args.use_dora:
-            if "quanto" in args.base_model_precision:
-                logger.error(
-                    "Quanto does not yet support DoRA training in PEFT. Disabling DoRA. 😴"
-                )
-                args.use_dora = False
-            else:
-                warning_log(
-                    "DoRA support is experimental and not very thoroughly tested."
-                )
-                args.lora_initialisation_style = "default"
-
-    if not args.data_backend_config:
-        from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-        args.data_backend_config = os.path.join(
-            StateTracker.get_config_path(), "multidatabackend.json"
-        )
-        warning_log(
-            f"No data backend config provided. Using default config at {args.data_backend_config}."
-        )
-
-    # Check if we have a valid gradient accumulation steps.
-    if args.gradient_accumulation_steps < 1:
-        raise ValueError(
-            f"Invalid gradient_accumulation_steps parameter: {args.gradient_accumulation_steps}, should be >= 1"
-        )
-
-    return args
diff --git a/videotuna/third_party/flux/configuration/configure.py b/videotuna/third_party/flux/configuration/configure.py
deleted file mode 100644
index 09640e27..00000000
--- a/videotuna/third_party/flux/configuration/configure.py
+++ /dev/null
@@ -1,905 +0,0 @@
-import os
-
-import huggingface_hub
-import torch
-
-from videotuna.third_party.flux.training import (
-    lycoris_defaults,
-    quantised_precision_levels,
-)
-from videotuna.third_party.flux.training.optimizer_param import optimizer_choices
-
-bf16_only_optims = [
-    key
-    for key, value in optimizer_choices.items()
-    if value.get("precision", "any") == "bf16"
-]
-any_precision_optims = [
-    key
-    for key, value in optimizer_choices.items()
-    if value.get("precision", "any") == "any"
-]
-model_classes = {
-    "full": [
-        "flux",
-        "sdxl",
-        "pixart_sigma",
-        "kolors",
-        "sd3",
-        "legacy",
-    ],
-    "lora": ["flux", "sdxl", "kolors", "sd3", "legacy"],
-    "controlnet": ["sdxl", "legacy"],
-}
-
-default_models = {
-    "flux": "black-forest-labs/FLUX.1-dev",
-    "sdxl": "stabilityai/stable-diffusion-xl-base-1.0",
-    "pixart_sigma": "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
-    "kolors": "kwai-kolors/kolors-diffusers",
-    "terminus": "ptx0/terminus-xl-velocity-v2",
-    "sd3": "stabilityai/stable-diffusion-3.5-large",
-    "legacy": "stabilityai/stable-diffusion-2-1-base",
-}
-
-default_cfg = {
-    "flux": 3.0,
-    "sdxl": 4.2,
-    "pixart_sigma": 3.4,
-    "kolors": 5.0,
-    "terminus": 8.0,
-    "sd3": 5.0,
-}
-
-model_labels = {
-    "sd3": "Stable Diffusion 3",
-    "flux": "FLUX",
-    "pixart_sigma": "PixArt Sigma",
-    "kolors": "Kwai Kolors",
-    "terminus": "Terminus",
-    "sdxl": "Stable Diffusion XL",
-    "legacy": "Stable Diffusion",
-}
-
-lora_ranks = [1, 16, 64, 128, 256]
-learning_rates_by_rank = {
-    1: "3e-4",
-    16: "1e-4",
-    64: "8e-5",
-    128: "6e-5",
-    256: "5.09e-5",
-}
-
-
-def print_config(env_contents: dict, extra_args: list):
-    # env_contents["TRAINER_EXTRA_ARGS"] = " ".join(extra_args)
-    # output = json.dumps(env_contents, indent=4)
-    # print(output)
-    pass
-
-
-def prompt_user(prompt, default=None):
-    if default:
-        prompt = f"{prompt} (default: {default})"
-    user_input = input(f"{prompt}: ")
-    return user_input.strip() or default
-
-
-def configure_lycoris():
-    print("Let's configure your LyCORIS model!\n")
-
-    print("Select a LyCORIS algorithm:\n")
-
-    print(
-        "1. LoRA - Efficient, balanced fine-tuning. Good for general tasks. (algo=lora)"
-    )
-    print(
-        "2. LoHa - Advanced, strong dampening. Ideal for multi-concept fine-tuning. (algo=loha)"
-    )
-    print(
-        "3. LoKr - Kronecker product-based. Use for complex transformations. (algo=lokr)"
-    )
-    print("4. Full Fine-Tuning - Traditional full model tuning. (algo=full)")
-    print("5. IA^3 - Efficient, tiny files, best for styles. (algo=ia3)")
-    print("6. DyLoRA - Dynamic updates, efficient with large dims. (algo=dylora)")
-    print("7. Diag-OFT - Fast convergence with orthogonal fine-tuning. (algo=diag-oft)")
-    print("8. BOFT - Advanced version of Diag-OFT with more flexibility. (algo=boft)")
-    print("9. GLoRA - Generalized LoRA. (algo=glora)\n")
-
-    # Prompt user to select an algorithm
-    algo = prompt_user(
-        f"Which LyCORIS algorithm would you like to use? (Enter the number corresponding to the algorithm)",
-        "3",  # Default to LoKr
-    )
-
-    # Map the selected number to the actual algorithm name
-    algo_map = {
-        "1": "lora",
-        "2": "loha",
-        "3": "lokr",
-        "4": "full",
-        "5": "ia3",
-        "6": "dylora",
-        "7": "diag-oft",
-        "8": "boft",
-        "9": "glora",
-    }
-
-    algo = algo_map.get(algo, "lokr").lower()
-
-    # Get the default configuration for the selected algorithm
-    default_config = lycoris_defaults.get(algo, {}).copy()
-
-    # Continue with further configuration
-    print(f"\nConfiguring {algo.upper()} algorithm...\n")
-
-    multiplier = float(
-        prompt_user(
-            f"Set the effect multiplier. Adjust for stronger or subtler effects. "
-            f"(default: {default_config.get('multiplier', 1.0)})",
-            default_config.get("multiplier", 1.0),
-        )
-    )
-
-    linear_dim = int(
-        prompt_user(
-            f"Set the linear dimension. Higher values mean more capacity but use more resources. "
-            f"(default: {default_config.get('linear_dim', 1000000)})",
-            default_config.get("linear_dim", 1000000),
-        )
-    )
-
-    linear_alpha = int(
-        prompt_user(
-            f"Set the alpha scaling factor. Controls the impact on the model. "
-            f"(default: {default_config.get('linear_alpha', 1)})",
-            default_config.get("linear_alpha", 1),
-        )
-    )
-
-    # Update basic parameters in config
-    default_config.update(
-        {
-            "multiplier": multiplier,
-            "linear_dim": linear_dim,
-            "linear_alpha": linear_alpha,
-        }
-    )
-
-    # Conditional prompts based on the selected algorithm
-    if algo == "lokr":
-        factor = int(
-            prompt_user(
-                f"Set the factor for compression/expansion. "
-                f"(default: {default_config.get('factor', 16)})",
-                default_config.get("factor", 16),
-            )
-        )
-        default_config.update({"factor": factor})
-
-        if linear_dim >= 10000:  # Handle full-dimension case
-            print("Full-dimension mode activated. Alpha will be set to 1.")
-            default_config["linear_alpha"] = 1
-
-    elif algo == "loha":
-        if linear_dim > 32:
-            print("Warning: High dim values with LoHa may cause instability.")
-        # Additional LoHa-specific configurations can be added here if needed
-
-    elif algo == "dylora":
-        block_size = int(
-            prompt_user(
-                f"Set block size for DyLoRA (rows/columns updated per step). "
-                f"(default: {default_config.get('block_size', 0)})",
-                default_config.get("block_size", 0),
-            )
-        )
-        default_config.update({"block_size": block_size})
-
-    elif algo in ["diag-oft", "boft"]:
-        constraint = (
-            prompt_user(
-                f"Enforce constraints (e.g., orthogonality)? "
-                f"(True/False, default: {default_config.get('constraint', False)})",
-                str(default_config.get("constraint", False)),
-            ).lower()
-            == "true"
-        )
-
-        rescaled = (
-            prompt_user(
-                f"Rescale transformations? Adjusts model impact. "
-                f"(True/False, default: {default_config.get('rescaled', False)})",
-                str(default_config.get("rescaled", False)),
-            ).lower()
-            == "true"
-        )
-
-        default_config.update(
-            {
-                "constraint": constraint,
-                "rescaled": rescaled,
-            }
-        )
-
-    # Handle presets for specific modules
-    if "apply_preset" in default_config:
-        print("\nNext, configure the modules to target with this algorithm.")
-        target_module = prompt_user(
-            f"Which modules should the {algo.upper()} algorithm be applied to? "
-            f"(default: {', '.join(default_config['apply_preset']['target_module'])})",
-            ", ".join(default_config["apply_preset"]["target_module"]),
-        ).split(",")
-        default_config["apply_preset"]["target_module"] = [
-            m.strip() for m in target_module
-        ]
-
-        for module_name, module_config in default_config["apply_preset"][
-            "module_algo_map"
-        ].items():
-            for param, value in module_config.items():
-                user_value = prompt_user(
-                    f"Set {param} for {module_name}. " f"(default: {value})", value
-                )
-                module_config[param] = (
-                    int(user_value) if isinstance(value, int) else float(user_value)
-                )
-
-    print("\nLyCORIS configuration complete: ", default_config)
-    return default_config
-
-
-def configure_env():
-    print("Welcome to SimpleTuner!")
-    print("This script will guide you through setting up your config.json file.\n")
-    env_contents = {
-        "--resume_from_checkpoint": "latest",
-        "--data_backend_config": "configs/006_flux/multidatabackend.json",
-        "--aspect_bucket_rounding": 2,
-        "--seed": 42,
-        "--minimum_image_size": 0,
-        "--disable_benchmark": False,
-    }
-    extra_args = []
-
-    output_dir = prompt_user(
-        "Enter the directory where you want to store your outputs", "output/models"
-    )
-    while not os.path.exists(output_dir):
-        should_create = (
-            prompt_user(
-                "That directory did not exist. Should I create it? Answer 'n' to select a new location. ([y]/n)",
-                "y",
-            )
-            == "y"
-        )
-        if should_create:
-            os.makedirs(output_dir, exist_ok=True)
-        else:
-            print(
-                f"Directory {output_dir} does not exist. Please create it and try again."
-            )
-            output_dir = prompt_user(
-                "Enter the directory where you want to store your outputs",
-                "output/models",
-            )
-    env_contents["--output_dir"] = output_dir
-
-    # Start with the basic options
-    model_type = prompt_user(
-        "What type of model are you training? (Options: [lora], full)", "lora"
-    ).lower()
-    use_lycoris = False
-    use_lora = False
-    if model_type == "lora":
-        use_lora = True
-        use_lycoris = (
-            prompt_user("Would you like to train a LyCORIS model? ([y]/n)", "y").lower()
-            == "y"
-        )
-        if use_lycoris:
-            env_contents["--lora_type"] = "lycoris"
-            lycoris_config = configure_lycoris()
-            env_contents["--lycoris_config"] = "configs/006_flux/lycoris_config.json"
-            # write json to file
-            import json
-
-            # approximate the rank of the lycoris
-            lora_rank = 16
-            with open(
-                "configs/006_flux/lycoris_config.json", "w", encoding="utf-8"
-            ) as f:
-                f.write(json.dumps(lycoris_config, indent=4))
-        else:
-            env_contents["--lora_type"] = "standard"
-            use_dora = prompt_user(
-                "Would you like to train a DoRA model? (y/[n])", "n"
-            ).lower()
-            if use_dora == "y":
-                env_contents["--use_dora"] = "true"
-            lora_rank = None
-            while lora_rank not in lora_ranks:
-                if lora_rank is not None:
-                    print(f"Invalid LoRA rank: {lora_rank}")
-                lora_rank = int(
-                    prompt_user(
-                        f"Set the LoRA rank (Options: {', '.join([str(x) for x in lora_ranks])})",
-                        "64",
-                    )
-                )
-            env_contents["--lora_rank"] = lora_rank
-    elif model_type == "full":
-        use_ema = prompt_user(
-            "Would you like to use EMA for training? (y/[n])", "n"
-        ).lower()
-        if use_ema == "y":
-            env_contents["--use_ema"] = "true"
-
-    print("We'll try and login to Hugging Face Hub..")
-    whoami = None
-    try:
-        whoami = huggingface_hub.whoami()
-    except:
-        pass
-    should_retry = True
-    while not whoami and should_retry:
-        should_retry = (
-            prompt_user(
-                "You are not currently logged into Hugging Face Hub. Would you like to login? (y/n)",
-                "y",
-            ).lower()
-            == "y"
-        )
-        if not should_retry:
-            whoami = None
-            print("Will not be logged into Hugging Face Hub.")
-            break
-        huggingface_hub.login()
-        whoami = huggingface_hub.whoami()
-
-    finishing_count_type = prompt_user(
-        "Should we schedule the end of training by epochs, or steps?", "steps"
-    ).lower()
-    while finishing_count_type not in ["steps", "epochs"]:
-        print(f"Invalid finishing count type: {finishing_count_type}")
-        finishing_count_type = prompt_user(
-            "Should we schedule the end of training by epochs, or steps?", "steps"
-        ).lower()
-    default_checkpointing_interval = 500
-    if finishing_count_type == "steps":
-        env_contents["--max_train_steps"] = int(
-            prompt_user("Set the maximum number of steps", 10000)
-        )
-        if env_contents["--max_train_steps"] < default_checkpointing_interval:
-            # reduce the default checkpointing interval offered to the user so that they get a reasonable value.
-            default_checkpointing_interval = env_contents["--max_train_steps"] // 10
-        env_contents["--num_train_epochs"] = 0
-    else:
-        env_contents["--num_train_epochs"] = prompt_user(
-            "Set the maximum number of epochs", 100
-        )
-        env_contents["--max_train_steps"] = 0
-
-    checkpointing_interval = prompt_user(
-        "Set the checkpointing interval (in steps)", default_checkpointing_interval
-    )
-    env_contents["--checkpointing_steps"] = int(checkpointing_interval)
-    checkpointing_limit = prompt_user(
-        "How many checkpoints do you want to keep? LoRA are small, and you can keep more than a full finetune.",
-        5,
-    )
-    env_contents["--checkpoints_total_limit"] = int(checkpointing_limit)
-    if whoami is not None:
-        print("Connected to Hugging Face Hub as:", whoami["name"])
-        should_push_to_hub = (
-            prompt_user(
-                "Do you want to push your model to Hugging Face Hub when it is completed uploading? (y/n)",
-                "y",
-            ).lower()
-            == "y"
-        )
-        if should_push_to_hub:
-            env_contents["--hub_model_id"] = prompt_user(
-                f"What do you want the name of your Hugging Face Hub model to be? This will be accessible as https://huggingface.co/{whoami['name']}/your-model-name-here",
-                f"simpletuner-{model_type}",
-            )
-            should_push_checkpoints = False
-            env_contents["--push_to_hub"] = "true"
-            should_push_checkpoints = (
-                prompt_user(
-                    "Do you want to push intermediary checkpoints to Hugging Face Hub? ([y]/n)",
-                    "y",
-                ).lower()
-                == "y"
-            )
-            if should_push_checkpoints:
-                env_contents["--push_checkpoints_to_hub"] = "true"
-            model_card_safe_for_work = (
-                prompt_user(
-                    "Is your target model considered safe-for-work? Answering yes here will remove the NSFW warning from the Hugging Face Hub model card. If you are unsure, please leave this as 'no'. (y/[n])",
-                    "n",
-                ).lower()
-                == "y"
-            )
-            if model_card_safe_for_work:
-                env_contents["--model_card_safe_for_work"] = "true"
-    report_to_wandb = (
-        prompt_user(
-            "Would you like to report training statistics to Weights & Biases? ([y]/n)",
-            "y",
-        ).lower()
-        == "y"
-    )
-    report_to_tensorboard = (
-        prompt_user(
-            "Would you like to report training statistics to TensorBoard? (y/[n])", "n"
-        ).lower()
-        == "y"
-    )
-    report_to_str = ""
-    if report_to_wandb or report_to_tensorboard:
-        tracker_project_name = prompt_user(
-            "Enter the name of your Weights & Biases project", f"{model_type}-training"
-        )
-        env_contents["--tracker_project_name"] = tracker_project_name
-        tracker_run_name = prompt_user(
-            "Enter the name of your Weights & Biases runs. This can use shell commands, which can be used to dynamically set the run name.",
-            f"simpletuner-{model_type}",
-        )
-        env_contents["--tracker_run_name"] = tracker_run_name
-        report_to_str = None
-        if report_to_wandb:
-            report_to_str = "wandb"
-        if report_to_tensorboard:
-            if report_to_wandb:
-                report_to_str += ","
-            else:
-                report_to_str = ""
-            report_to_str += "tensorboard"
-        if report_to_str:
-            env_contents["--report_to"] = report_to_str
-
-    print_config(env_contents, extra_args)
-
-    model_class = None
-    while model_class not in model_classes[model_type]:
-        if model_class is not None:
-            print(f"Invalid model class: {model_class}")
-        model_class = prompt_user(
-            f"Which model family are you training? ({'/'.join(model_classes[model_type])})",
-            "flux",
-        ).lower()
-
-    can_load_model = False
-    model_name = None
-    while not can_load_model:
-        if model_name is not None:
-            print(
-                "For some reason, we can not load that model. Can you check your Hugging Face login and try again?"
-            )
-        model_name = prompt_user(
-            "Enter the model name from Hugging Face Hub", default_models[model_class]
-        )
-        try:
-            model_info = huggingface_hub.model_info(model_name)
-            if hasattr(model_info, "id"):
-                can_load_model = True
-        except:
-            continue
-    env_contents["--model_type"] = model_type
-    env_contents["--pretrained_model_name_or_path"] = model_name
-    env_contents["--model_family"] = model_class.lower()
-    # Flux-specific options
-    if "FLUX" in env_contents and env_contents["--model_family"] == "flux":
-        if env_contents["--model_type"].lower() == "lora" and not use_lycoris:
-            flux_targets = [
-                "mmdit",
-                "context",
-                "all",
-                "all+ffs",
-                "ai-toolkit",
-                "tiny",
-                "nano",
-            ]
-            flux_target_layers = None
-            while flux_target_layers not in flux_targets:
-                if flux_target_layers:
-                    print(f"Invalid Flux target layers: {flux_target_layers}")
-                flux_target_layers = prompt_user(
-                    f"Set Flux target layers (Options: {'/'.join(flux_targets)})",
-                    "all",
-                )
-            env_contents["--flux_lora_target"] = flux_target_layers
-
-    print_config(env_contents, extra_args)
-
-    # Additional settings
-    env_contents["--train_batch_size"] = int(
-        prompt_user(
-            "Set the training batch size. Larger values will require larger datasets, more VRAM, and slow things down.",
-            1,
-        )
-    )
-    env_contents["--gradient_checkpointing"] = "true"
-
-    env_contents["--caption_dropout_probability"] = float(
-        prompt_user(
-            "Set the caption dropout rate, or use 0.0 to disable it. Dropout is not recommended for LoRA/LyCORIS training unless you are training for style transfer.",
-            "0.0" if any([use_lora, use_lycoris]) else "0.1",
-        )
-    )
-
-    resolution_types = ["pixel", "area", "pixel_area"]
-    env_contents["--resolution_type"] = None
-    while env_contents["--resolution_type"] not in resolution_types:
-        if env_contents["--resolution_type"]:
-            print(f"Invalid resolution type: {env_contents['--resolution_type']}")
-        env_contents["--resolution_type"] = prompt_user(
-            "How do you want to measure dataset resolutions? 'pixel' will size images with the shorter edge, 'area' will measure in megapixels, and is great for aspect-bucketing. 'pixel_area' is a combination of these two ideas, which lets you set your area using pixels instead of megapixels.",
-            "pixel_area",
-        ).lower()
-    if (
-        env_contents["--resolution_type"] == "pixel"
-        or env_contents["--resolution_type"] == "pixel_area"
-    ):
-        default_resolution = 1024
-        resolution_unit = "pixel"
-    else:
-        default_resolution = 1.0
-        resolution_unit = "megapixel"
-    env_contents["--resolution"] = prompt_user(
-        f"What would you like the default resolution of your datasets to be? The default for is {env_contents['--resolution_type']} is {default_resolution} {resolution_unit}s.",
-        default_resolution,
-    )
-
-    # remove spaces from validation resolution, ensure it's a single WxH or a comma-separated list of WxH
-    env_contents["--validation_seed"] = prompt_user("Set the seed for validation", 42)
-    env_contents["--validation_steps"] = prompt_user(
-        "How many steps in between validation outputs?",
-        env_contents["--checkpointing_steps"],
-    )
-    env_contents["--validation_resolution"] = None
-    while (
-        env_contents["--validation_resolution"] is None
-        or "x" not in env_contents["--validation_resolution"]
-    ):
-        if env_contents["--validation_resolution"] is not None:
-            print(
-                "Invalid resolution format. Please enter a single resolution, or a comma-separated list. Example: 1024x1024,1280x768"
-            )
-        env_contents["--validation_resolution"] = prompt_user(
-            "Set the validation resolution. Format could be a single resolution, or comma-separated.",
-            "1024x1024",
-        )
-        env_contents["--validation_resolution"] = ",".join(
-            [x.strip() for x in env_contents["--validation_resolution"].split(",")]
-        )
-    env_contents["--validation_guidance"] = prompt_user(
-        "Set the guidance scale for validation", default_cfg.get(model_class, 3.0)
-    )
-    env_contents["--validation_guidance_rescale"] = prompt_user(
-        "Set the guidance re-scale for validation - this is called dynamic thresholding and is used mostly for zero-terminal SNR models.",
-        "0.0",
-    )
-    env_contents["--validation_num_inference_steps"] = prompt_user(
-        "Set the number of inference steps for validation", "20"
-    )
-    env_contents["--validation_prompt"] = prompt_user(
-        "Set the validation prompt", "A photo-realistic image of a cat"
-    )
-    print_config(env_contents, extra_args)
-
-    # Advanced options
-    if torch.cuda.is_available():
-        use_tf32 = (
-            prompt_user("Would you like to enable TF32 mode? ([y]/n)", "y").lower()
-            == "y"
-        )
-        if not use_tf32:
-            env_contents["--disable_tf32"] = "true"
-    mixed_precision_options = ["bf16", "no"]
-    env_contents["--mixed_precision"] = None
-    while (
-        not env_contents["--mixed_precision"]
-        or env_contents["--mixed_precision"] not in mixed_precision_options
-    ):
-        if env_contents["--mixed_precision"]:
-            print(
-                f"Invalid mixed precision option: {env_contents['--mixed_precision']}"
-            )
-        env_contents["--mixed_precision"] = prompt_user(
-            "Set mixed precision mode (Options: bf16, no (fp32))", "bf16"
-        )
-    if env_contents["--mixed_precision"] == "bf16":
-        compatible_optims = bf16_only_optims + any_precision_optims
-    else:
-        compatible_optims = any_precision_optims
-    env_contents["--optimizer"] = None
-    while (
-        not env_contents["--optimizer"]
-        or env_contents["--optimizer"] not in compatible_optims
-    ):
-        if env_contents["--optimizer"]:
-            print(f"Invalid optimizer: {env_contents['--optimizer']}")
-        env_contents["--optimizer"] = prompt_user(
-            f"Choose an optimizer (Options: {'/'.join(compatible_optims)})",
-            compatible_optims[0],
-        )
-
-    lr_schedulers = ["polynomial", "constant"]
-    lr_scheduler = None
-    while lr_scheduler not in lr_schedulers:
-        if lr_scheduler:
-            print(f"Invalid learning rate scheduler: {lr_scheduler}")
-        lr_scheduler = prompt_user(
-            f"Set the learning rate scheduler. Options: {'/'.join(lr_schedulers)}",
-            lr_schedulers[0],
-        )
-    learning_rate = prompt_user(
-        "Set the learning rate",
-        (
-            learning_rates_by_rank[lora_rank]
-            if model_type == "lora"
-            else 1.0 if env_contents["--optimizer"] == "prodigy" else "1e-6"
-        ),
-    )
-    lr_warmup_steps = prompt_user(
-        "Set the number of warmup steps before the learning rate reaches its peak. This is set to 10 percent of the total runtime by default, or 100 steps, whichever is higher.",
-        min(100, int(env_contents["--max_train_steps"]) // 10),
-    )
-    env_contents["--learning_rate"] = learning_rate
-    env_contents["--lr_scheduler"] = lr_scheduler
-    if lr_scheduler == "polynomial":
-        extra_args.append("--lr_end=1e-8")
-    env_contents["--lr_warmup_steps"] = lr_warmup_steps
-
-    quantization = (
-        prompt_user(
-            f"Would you like to enable model quantization? {'NOTE: Currently, a bug prevents multi-GPU training with LoRA' if use_lora else ''}. ([y]/n)",
-            "y",
-        ).lower()
-        == "y"
-    )
-    if quantization:
-        if env_contents.get("--use_dora") == "true":
-            print("DoRA will be disabled for quantisation.")
-            del env_contents["--use_dora"]
-        quantization_type = None
-        while (
-            not quantization_type or quantization_type not in quantised_precision_levels
-        ):
-            if quantization_type:
-                print(f"Invalid quantization type: {quantization_type}")
-            quantization_type = prompt_user(
-                f"Choose quantization type. (Options: {'/'.join(quantised_precision_levels)})",
-                "int8-quanto",
-            )
-        env_contents["--base_model_precision"] = quantization_type
-    print_config(env_contents, extra_args)
-    compress_disk_cache = (
-        prompt_user("Would you like to compress the disk cache? (y/n)", "y").lower()
-        == "y"
-    )
-    if compress_disk_cache:
-        extra_args.append("--compress_disk_cache")
-
-    # torch compile
-    torch_compile = (
-        prompt_user(
-            "Would you like to use torch compile during validations? (y/n)", "n"
-        ).lower()
-        == "y"
-    )
-    env_contents["--validation_torch_compile"] = "false"
-    if torch_compile:
-        env_contents["--validation_torch_compile"] = "true"
-
-    # Summary and confirmation
-    print_config(env_contents, extra_args)
-    confirm = prompt_user("Does this look correct? (y/n)", "y").lower() == "y"
-
-    if confirm:
-        # Write to .env file
-        with open("configs/006_flux/config.json", "w") as env_file:
-            import json
-
-            env_file.write(json.dumps(env_contents, indent=4))
-
-        print("\nConfiguration file created successfully!")
-    else:
-        print("\nConfiguration aborted. No changes were made.")
-        import sys
-
-        sys.exit(1)
-
-    # dataloader configuration
-    default_local_configuration = [
-        {
-            "id": "PLACEHOLDER-512",
-            "type": "local",
-            "instance_data_dir": None,
-            "crop": False,
-            "crop_style": "random",
-            "minimum_image_size": 128,
-            "resolution": 512,
-            "resolution_type": "pixel_area",
-            "repeats": 10,
-            "metadata_backend": "discovery",
-            "caption_strategy": "filename",
-            "cache_dir_vae": "vae-512",
-        },
-        {
-            "id": "PLACEHOLDER-1024",
-            "type": "local",
-            "instance_data_dir": None,
-            "crop": False,
-            "crop_style": "random",
-            "minimum_image_size": 128,
-            "resolution": 1024,
-            "resolution_type": "pixel_area",
-            "repeats": 10,
-            "metadata_backend": "discovery",
-            "caption_strategy": "filename",
-            "cache_dir_vae": "vae-1024",
-        },
-        {
-            "id": "PLACEHOLDER-512-crop",
-            "type": "local",
-            "instance_data_dir": None,
-            "crop": True,
-            "crop_style": "random",
-            "minimum_image_size": 128,
-            "resolution": 512,
-            "resolution_type": "pixel_area",
-            "repeats": 10,
-            "metadata_backend": "discovery",
-            "caption_strategy": "filename",
-            "cache_dir_vae": "vae-512-crop",
-        },
-        {
-            "id": "PLACEHOLDER-1024-crop",
-            "type": "local",
-            "instance_data_dir": None,
-            "crop": True,
-            "crop_style": "random",
-            "minimum_image_size": 128,
-            "resolution": 1024,
-            "resolution_type": "pixel_area",
-            "repeats": 10,
-            "metadata_backend": "discovery",
-            "caption_strategy": "filename",
-            "cache_dir_vae": "vae-1024-crop",
-        },
-        {
-            "id": "text-embed-cache",
-            "dataset_type": "text_embeds",
-            "default": True,
-            "type": "local",
-            "cache_dir": "text",
-        },
-    ]
-
-    # Let's offer to generate a prompt library for the user. Preserve their existing one if it already exists.
-    should_generate_by_default = "n"
-    if not os.path.exists("configs/006_flux/user_prompt_library.json"):
-        should_generate_by_default = "y"
-    should_generate_prompt_library = (
-        prompt_user(
-            (
-                "Would you like to generate a very rudimentary subject-centric prompt library for your dataset?"
-                " This will download a small 1B Llama 3.2 model."
-                " If a user prompt library exists, it will be overwritten. (y/n)"
-            ),
-            should_generate_by_default,
-        ).lower()
-        == "y"
-    )
-    if should_generate_prompt_library:
-        try:
-            user_caption_trigger = prompt_user(
-                "Enter a trigger word (or a few words) that you would like Llama 3.2 1B to expand.",
-                "Character Name",
-            )
-            number_of_prompts = int(
-                prompt_user("How many prompts would you like to generate?", 8)
-            )
-            from videotuna.third_party.flux.prompt_expander import PromptExpander
-
-            PromptExpander.initialize_model()
-            user_prompt_library = PromptExpander.generate_prompts(
-                trigger_phrase=user_caption_trigger, num_prompts=number_of_prompts
-            )
-            with open(
-                "configs/006_flux/user_prompt_library.json", "w", encoding="utf-8"
-            ) as f:
-                f.write(json.dumps(user_prompt_library, indent=4))
-            print("Prompt library generated successfully!")
-            env_contents["--user_prompt_library"] = (
-                "configs/006_flux/user_prompt_library.json"
-            )
-        except Exception as e:
-            print(f"(warning) Failed to generate prompt library: {e}")
-
-    # now we ask user the path to their data, the path to the cache (cache/), number of repeats, update the id placeholder based on users dataset name
-    # then we'll write the file to multidatabackend.json
-    should_configure_dataloader = (
-        prompt_user("Would you like to configure your dataloader? (y/n)", "y").lower()
-        == "y"
-    )
-    if not should_configure_dataloader:
-        print("Skipping dataloader configuration.")
-        return
-    dataset_id = prompt_user(
-        "Enter the name of your dataset. This will be used to generate the cache directory. It should be simple, and not contain spaces or special characters.",
-        "my-dataset",
-    )
-    dataset_path = prompt_user(
-        "Enter the path to your dataset. This should be a directory containing images and text files for their caption. For reliability, use an absolute (full) path, beginning with a '/'",
-        "/datasets/my-dataset",
-    )
-    dataset_caption_strategy = prompt_user(
-        (
-            "How should the dataloader handle captions?"
-            "\n-> 'filename' will use the names of your image files as the caption"
-            "\n-> 'textfile' requires a image.txt file to go next to your image.png file"
-            "\n-> 'instanceprompt' will just use one trigger phrase for all images"
-            "\n"
-            "\n(Options: filename, textfile, instanceprompt)"
-        ),
-        "textfile",
-    )
-    if dataset_caption_strategy not in ["filename", "textfile", "instanceprompt"]:
-        print(f"Invalid caption strategy: {dataset_caption_strategy}")
-        dataset_caption_strategy = "textfile"
-    dataset_instance_prompt = None
-    if "instanceprompt" in dataset_caption_strategy:
-        dataset_instance_prompt = prompt_user(
-            "Enter the instance_prompt you want to use for all images in this dataset",
-            "Character Name",
-        )
-    dataset_repeats = int(
-        prompt_user(
-            "How many times do you want to repeat each image in the dataset?", 10
-        )
-    )
-    dataset_cache_prefix = prompt_user(
-        "Where will your VAE and text encoder caches be written to? Subdirectories will be created inside for you automatically.",
-        "cache/",
-    )
-    has_very_large_images = (
-        prompt_user(
-            "Do you have very-large images in the dataset (eg. much larger than 1024x1024)? (y/n)",
-            "n",
-        ).lower()
-        == "y"
-    )
-
-    # Now we'll modify the default json and if has_very_large_images is true, we will add two keys to each image dataset, 'maximum_image_size' and 'target_downsample_size' equal to the dataset's resolution value
-    for dataset in default_local_configuration:
-        if dataset.get("dataset_type") == "text_embeds":
-            dataset["cache_dir"] = f"{dataset_cache_prefix}/{dataset['cache_dir']}"
-            continue
-        dataset["instance_data_dir"] = dataset_path
-        dataset["repeats"] = dataset_repeats
-        dataset["cache_dir_vae"] = f"{dataset_cache_prefix}/{dataset['cache_dir_vae']}"
-        if has_very_large_images:
-            dataset["maximum_image_size"] = dataset["resolution"]
-            dataset["target_downsample_size"] = dataset["resolution"]
-        dataset["id"] = dataset["id"].replace("PLACEHOLDER", dataset_id)
-        if dataset_instance_prompt:
-            dataset["instance_prompt"] = dataset_instance_prompt
-        dataset["caption_strategy"] = dataset_caption_strategy
-
-    print("Dataloader configuration:")
-    print(default_local_configuration)
-    confirm = prompt_user("Does this look correct? (y/n)", "y").lower() == "y"
-    if confirm:
-        import json
-
-        with open("configs/006_flux/multidatabackend.json", "w", encoding="utf-8") as f:
-            f.write(json.dumps(default_local_configuration, indent=4))
-        print("Dataloader configuration written successfully!")
-
-
-if __name__ == "__main__":
-    configure_env()
diff --git a/videotuna/third_party/flux/configuration/env_file.py b/videotuna/third_party/flux/configuration/env_file.py
deleted file mode 100644
index 19b1d057..00000000
--- a/videotuna/third_party/flux/configuration/env_file.py
+++ /dev/null
@@ -1,193 +0,0 @@
-import json
-
-env_to_args_map = {
-    "RESUME_CHECKPOINT": "--resume_from_checkpoint",
-    "DATALOADER_CONFIG": "--data_backend_config",
-    "ASPECT_BUCKET_ROUNDING": "--aspect_bucket_rounding",
-    "TRAINING_SEED": "--seed",
-    "USE_EMA": "--use_ema",
-    "USE_XFORMERS": "--enable_xformers_memory_efficient_attention",
-    "MINIMUM_RESOLUTION": "--minimum_image_size",
-    "OUTPUT_DIR": "--output_dir",
-    "USE_DORA": "--use_dora",
-    "USE_BITFIT": "--layer_freeze_strategy=bitfit",
-    "LORA_TYPE": "--lora_type",
-    "LYCORIS_CONFIG": "--lycoris_config",
-    "PUSH_TO_HUB": "--push_to_hub",
-    "PUSH_CHECKPOINTS": "--push_checkpoints_to_hub",
-    "MAX_NUM_STEPS": "--max_train_steps",
-    "NUM_EPOCHS": "--num_train_epochs",
-    "CHECKPOINTING_STEPS": "--checkpointing_steps",
-    "CHECKPOINTING_LIMIT": "--checkpoints_total_limit",
-    "HUB_MODEL_NAME": "--hub_model_id",
-    "MODEL_CARD_SAFE_FOR_WORK": "--model_card_safe_for_work",
-    "TRACKER_PROJECT_NAME": "--tracker_project_name",
-    "TRACKER_RUN_NAME": "--tracker_run_name",
-    "MODEL_TYPE": "--model_type",
-    "MODEL_NAME": "--pretrained_model_name_or_path",
-    "MODEL_FAMILY": "--model_family",
-    "TRAIN_BATCH_SIZE": "--train_batch_size",
-    "USE_GRADIENT_CHECKPOINTING": "--gradient_checkpointing",
-    "CAPTION_DROPOUT_PROBABILITY": "--caption_dropout_probability",
-    "RESOLUTION_TYPE": "--resolution_type",
-    "RESOLUTION": "--resolution",
-    "VALIDATION_SEED": "--validation_seed",
-    "VALIDATION_STEPS": "--validation_steps",
-    "VALIDATION_RESOLUTION": "--validation_resolution",
-    "VALIDATION_GUIDANCE": "--validation_guidance",
-    "VALIDATION_GUIDANCE_RESCALE": "--validation_guidance_rescale",
-    "VALIDATION_NUM_INFERENCE_STEPS": "--validation_num_inference_steps",
-    "VALIDATION_PROMPT": "--validation_prompt",
-    "ALLOW_TF32": "--allow_tf32",
-    "MIXED_PRECISION": "--mixed_precision",
-    "OPTIMIZER": "--optimizer",
-    "LEARNING_RATE": "--learning_rate",
-    "LR_SCHEDULE": "--lr_scheduler",
-    "LR_WARMUP_STEPS": "--lr_warmup_steps",
-    "BASE_MODEL_PRECISION": "--base_model_precision",
-    "TRAINING_NUM_PROCESSES": "--num_processes",
-    "TRAINING_NUM_MACHINES": "--num_machines",
-    "VALIDATION_TORCH_COMPILE": "--validation_torch_compile",
-    "TRAINER_DYNAMO_BACKEND": "--dynamo_backend",
-    "VALIDATION_GUIDANCE_REAL": "--validation_guidance_real",
-    "VALIDATION_NO_CFG_UNTIL_TIMESTEP": "--validation_no_cfg_until_timestep",
-    "TRAINING_SCHEDULER_TIMESTEP_SPACING": "--training_scheduler_timestep_spacing",
-    "INFERENCE_SCHEDULER_TIMESTEP_SPACING": "--inference_scheduler_timestep_spacing",
-    "GRADIENT_ACCUMULATION_STEPS": "--gradient_accumulation_steps",
-    "TRAINING_DYNAMO_BACKEND": "--dynamo_backend",
-    "LR_END": "--lr_end",
-    "FLUX_GUIDANCE_VALUE": "--flux_guidance_value",
-    "FLUX_LORA_TARGET": "--flux_lora_target",
-    "VALIDATION_NEGATIVE_PROMPT": "--validation_negative_prompt",
-    "METADATA_UPDATE_INTERVAL": "--metadata_update_interval",
-    "READ_BATCH_SIZE": "--read_batch_size",
-    "WRITE_BATCH_SIZE": "--write_batch_size",
-    "AWS_MAX_POOL_CONNECTIONS": "--aws_max_pool_connections",
-    "TORCH_NUM_THREADS": "--torch_num_threads",
-    "IMAGE_PROCESSING_BATCH_SIZE": "--image_processing_batch_size",
-    "DISABLE_BENCHMARK": "--disable_benchmark",
-}
-
-import logging
-import os
-import subprocess
-
-logger = logging.getLogger("SimpleTuner")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def load_env():
-    """
-    Load environment variables from .env files based on the specified environment.
-    """
-    # Define the paths to the default and environment-specific .env files
-    config_env_path = "configs/006_flux/config.env"
-    env = os.environ.get(
-        "SIMPLETUNER_ENVIRONMENT",
-        os.environ.get("SIMPLETUNER_ENV", os.environ.get("ENV", None)),
-    )
-    if env and env != "default":
-        config_env_path = f"configs/006_flux/{env}/config.env"
-
-    # Load default environment variables if the file exists
-    config_file_contents = {}
-    if os.path.isfile(config_env_path):
-        # Loop through, ignoring comments '#' and empty lines, while setting the env variables
-        with open(config_env_path, "r") as f:
-            for line in f:
-                # Skip comments and empty lines
-                if line.startswith("#") or line.strip() == "":
-                    continue
-
-                # Remove 'export' from the start
-                if line.startswith("export"):
-                    line = line[7:]
-
-                # Handle `+=` for appending values
-                if "+=" in line:
-                    key, value = line.strip().split("+=", 1)
-                    key, value = (
-                        key.strip(),
-                        value.strip('"').strip("'").strip().split(),
-                    )
-                    # Append each element to the existing key's list or create a new list
-                    if key in config_file_contents:
-                        config_file_contents[key].extend(value)
-                    else:
-                        config_file_contents[key] = value
-                else:
-                    # Regular `=` assignment
-                    c = line.strip().split("=", 1)
-                    if len(c) == 2:
-                        key, value = c
-                        config_file_contents[key.strip()] = (
-                            value.strip('"').strip("'").split()
-                        )
-
-        # Convert lists to single string values with spaces, if needed
-        for key, value in config_file_contents.items():
-            if isinstance(value, list):
-                if value and "${" in value[0]:
-                    continue
-                config_file_contents[key] = " ".join(value)
-
-        print(f"[CONFIG.ENV] Loaded environment variables from {config_env_path}")
-    else:
-        logger.error(f"Cannot find config file: {config_env_path}")
-
-    return config_file_contents
-
-
-def load_env_config():
-    """
-    Map the environment variables to command-line arguments.
-
-    :return: List of command-line arguments.
-    """
-    config_file_contents = load_env()
-    mapped_args = []
-    # Loop through the environment variable to argument mapping
-    ignored_accelerate_kwargs = [
-        "--num_processes",
-        "--num_machines",
-        "--dynamo_backend",
-    ]
-    for env_var, arg_name in env_to_args_map.items():
-        if arg_name in ignored_accelerate_kwargs:
-            continue
-        value = config_file_contents.get(env_var, None)
-        # strip 's from the outside of value
-        if value is not None and value.startswith("'") and value.endswith("'"):
-            value = value[1:-1]
-        if value is not None and value.startswith('"') and value.endswith('"'):
-            value = value[1:-1]
-        is_numeric = (
-            str(value).isnumeric()
-            or str(value).isdigit()
-            or str(value).replace(".", "").isdigit()
-        )
-        if value is not None:
-            # Handle booleans by checking their string value
-            if value.lower() in ["true", "false"]:
-                if value.lower() == "true":
-                    mapped_args.append(f"{arg_name}")
-            elif is_numeric:
-                # Handle numeric values
-                mapped_args.append(f"{arg_name}={value}")
-            else:
-                # Add the argument and its value to the list
-                mapped_args.append(f"{arg_name}={value}")
-    # handle TRAINER_EXTRA_ARGS, which is like `TRAINER_EXTRA_ARGS="--num_processes=1 --num_machines=1 --dynamo_backend=local"`
-    extra_args = config_file_contents.get("TRAINER_EXTRA_ARGS", None)
-    if extra_args:
-        print(f"Extra args: {extra_args}")
-        if type(extra_args) is list:
-            for value in extra_args:
-                if "${" in value:
-                    continue
-                mapped_args.extend(value.split())
-        else:
-            mapped_args.extend(extra_args.split())
-
-    logger.info(f"Loaded environment variables: {json.dumps(mapped_args, indent=4)}")
-    return mapped_args
diff --git a/videotuna/third_party/flux/configuration/json_file.py b/videotuna/third_party/flux/configuration/json_file.py
deleted file mode 100644
index 6d8c4921..00000000
--- a/videotuna/third_party/flux/configuration/json_file.py
+++ /dev/null
@@ -1,66 +0,0 @@
-import json
-import logging
-import os
-
-# Set up logging
-from videotuna.third_party.flux.training.multi_process import _get_rank
-
-logger = logging.getLogger("SimpleTuner")
-if _get_rank() > 0:
-    logger.setLevel(logging.WARNING)
-else:
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def normalize_args(args_dict):
-    """
-    Normalize arguments, ensuring they have '--' at the start if necessary.
-
-    :param args_dict: A dictionary of arguments that may or may not have '--' prefixes.
-    :return: A normalized dictionary of arguments.
-    """
-    normalized = []
-    for key, value in args_dict.items():
-        # Add -- prefix if not present
-        if (type(value) is bool and value) or value == "true":
-            if not key.startswith("--"):
-                normalized_key = f"--{key}"
-            else:
-                normalized_key = key
-        elif type(value) is bool and not value or value == "false":
-            logger.warning(f"Skipping false argument: {key}")
-            continue
-        else:
-            if not key.startswith("--"):
-                normalized_key = f"--{key}={value}"
-            else:
-                normalized_key = f"{key}={value}"
-        normalized.append(normalized_key)
-    return normalized
-
-
-def load_json_config():
-    """
-    Load configuration from a JSON file that directly specifies command-line arguments.
-
-    :param json_path: The path to the JSON file.
-    :return: A dictionary containing the configuration.
-    """
-    config_json_path = "configs/006_flux/config.json"
-    env = os.environ.get(
-        "SIMPLETUNER_ENVIRONMENT",
-        os.environ.get("SIMPLETUNER_ENV", os.environ.get("ENV", None)),
-    )
-    if env and env != "default":
-        config_json_path = f"configs/006_flux/{env}/config.json"
-
-    if not os.path.isfile(config_json_path):
-        raise ValueError(f"JSON configuration file not found: {config_json_path}")
-
-    with open(config_json_path, "r") as file:
-        try:
-            config = json.load(file)
-            logger.info(f"[CONFIG.JSON] Loaded configuration from {config_json_path}")
-            return normalize_args(config)
-        except json.JSONDecodeError as e:
-            raise ValueError(f"Failed to parse JSON file {config_json_path}: {e}")
diff --git a/videotuna/third_party/flux/configuration/loader.py b/videotuna/third_party/flux/configuration/loader.py
deleted file mode 100644
index 44488665..00000000
--- a/videotuna/third_party/flux/configuration/loader.py
+++ /dev/null
@@ -1,64 +0,0 @@
-import logging
-import os
-import sys
-
-from videotuna.third_party.flux.configuration import (
-    cmd_args,
-    env_file,
-    json_file,
-    toml_file,
-)
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("SimpleTuner")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-tools = {
-    "json": json_file.load_json_config,
-    "toml": toml_file.load_toml_config,
-    "env": env_file.load_env_config,
-    "cmd": cmd_args.parse_cmdline_args,
-}
-
-default_config_paths = {
-    "json": "config.json",
-    "toml": "config.toml",
-    "env": "config.env",
-}
-
-
-def attach_env_to_path_if_not_present(backend: str, env: str = None):
-    backend_cfg_path = default_config_paths.get(backend)
-    if env and env != "default":
-        return f"configs/006_flux/{env}/{backend_cfg_path}"
-    return f"configs/006_flux/{backend_cfg_path}"
-
-
-def load_config(args: dict = None):
-    # Check if help is requested; bypass configuration loading if true
-    if "-h" in sys.argv or "--help" in sys.argv:
-        return tools["cmd"]()
-
-    mapped_config = args
-    if mapped_config is None or not mapped_config:
-        config_backend = os.environ.get(
-            "SIMPLETUNER_CONFIG_BACKEND",
-            os.environ.get("CONFIG_BACKEND", os.environ.get("CONFIG_TYPE", "env")),
-        ).lower()
-        config_env = os.environ.get(
-            "SIMPLETUNER_ENVIRONMENT",
-            os.environ.get("SIMPLETUNER_ENV", os.environ.get("ENV", "default")),
-        )
-        config_backend_path = "config"
-        if config_env and config_env != "default" and config_env is not None:
-            config_backend_path = os.path.join("config", config_env)
-        StateTracker.set_config_path(config_backend_path)
-        logger.info("Using {} configuration backend.".format(config_backend))
-        mapped_config = tools[config_backend]()
-        if config_backend == "cmd":
-            return mapped_config
-
-    # Other configs need to be passed through parse_cmdline_args to be made whole and have complete defaults and safety checks applied.
-    configuration = tools["cmd"](input_args=mapped_config)
-
-    return configuration
diff --git a/videotuna/third_party/flux/configuration/toml_file.py b/videotuna/third_party/flux/configuration/toml_file.py
deleted file mode 100644
index 2e3eb625..00000000
--- a/videotuna/third_party/flux/configuration/toml_file.py
+++ /dev/null
@@ -1,75 +0,0 @@
-import logging
-import os
-
-import toml
-
-# Set up logging
-from videotuna.third_party.flux.training.multi_process import _get_rank
-
-logger = logging.getLogger("SimpleTuner")
-if _get_rank() > 0:
-    logger.setLevel(logging.WARNING)
-else:
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def normalize_args(args_dict):
-    """
-    Normalize arguments, ensuring they have '--' at the start if necessary.
-
-    :param args_dict: A dictionary of arguments that may or may not have '--' prefixes.
-    :return: A normalized dictionary of arguments.
-    """
-    normalized = []
-    for key, value in args_dict.items():
-        # Add -- prefix if not present
-        if type(value) is bool and value or value == "true":
-            if not key.startswith("--"):
-                normalized_key = f"--{key}"
-            else:
-                normalized_key = key
-        elif type(value) is bool and not value or value == "false":
-            logger.warning(f"Skipping false argument: {key}")
-            continue
-        else:
-            print(f"Value: {value}, type: {type(value)}")
-            if not key.startswith("--"):
-                normalized_key = f"--{key}={value}"
-            else:
-                normalized_key = f"{key}={value}"
-        normalized.append(normalized_key)
-    return normalized
-
-
-def load_toml_config():
-    """
-    Load configuration from a TOML file that directly specifies command-line arguments.
-
-    :param toml_path: The path to the TOML file.
-    :return: A dictionary containing the configuration.
-    """
-    config_toml_path = "configs/006_flux/config.toml"
-    env = os.environ.get(
-        "SIMPLETUNER_ENVIRONMENT",
-        os.environ.get("SIMPLETUNER_ENV", os.environ.get("ENV", None)),
-    )
-    if env and env != "default":
-        config_toml_path = f"configs/006_flux/{env}/config.toml"
-
-    if not os.path.isfile(config_toml_path):
-        raise ValueError(f"Can not find config file: {config_toml_path}")
-
-    with open(config_toml_path, "r") as file:
-        try:
-            config = toml.load(file)
-            logger.info(f"[CONFIG.TOML] Loaded configuration from {config_toml_path}")
-            toml_config = config
-        except toml.TomlDecodeError as e:
-            logger.error(f"Failed to parse TOML file {config_toml_path}: {e}")
-            toml_config = {}
-    normalized_config = normalize_args(toml_config)
-    logger.info(
-        f"[CONFIG] Loaded and normalized TOML configuration: {normalized_config}"
-    )
-
-    return normalized_config
diff --git a/videotuna/third_party/flux/convert_parquet_to_images.py b/videotuna/third_party/flux/convert_parquet_to_images.py
deleted file mode 100644
index 5ce2d0a1..00000000
--- a/videotuna/third_party/flux/convert_parquet_to_images.py
+++ /dev/null
@@ -1,44 +0,0 @@
-import io
-import os
-
-import numpy as np
-import pandas as pd
-from PIL import Image
-
-# Step 1: Load Parquet File
-parquet_file_path = "data/train-00000-of-00001-dfb0d9df7ebab67e.parquet"  # Replace with your Parquet file path
-output_directory = "data-res"  # Directory to save the images
-import pandas as pd
-
-# Load the Parquet file into a DataFrame
-df = pd.read_parquet(parquet_file_path)
-
-# Step 2: Print the column names
-print("Columns in the Parquet file:")
-print(df.columns)
-
-# Load the Parquet file into a Pandas DataFrame
-df = pd.read_parquet(parquet_file_path)
-
-
-# Step 2: Process DataFrame Rows
-for index, row in df.iterrows():
-    # Extract the 'text' column as the filename (without extension)
-    text_filename = row["text"]
-
-    # Extract image data - assuming the image data is in a column called 'image_data'
-    # This data should be in a format suitable to create an image (e.g., 2D numpy array)
-    image_data = row["image"]  # Replace with the actual column name for image data
-    # print(image_data.items())
-    image_bytes = image_data["bytes"]
-
-    # Step 2: Convert the bytes data to an image
-    image = Image.open(io.BytesIO(image_bytes))
-
-    # Step 3: Save the image to disk or process it further
-    output_path = os.path.join(output_directory, text_filename + ".png")
-    image.save(output_path)
-
-    print(f"Saved image: {output_path}")
-
-print("All images have been saved.")
diff --git a/videotuna/third_party/flux/data_backend/aws.py b/videotuna/third_party/flux/data_backend/aws.py
deleted file mode 100644
index b8b2a357..00000000
--- a/videotuna/third_party/flux/data_backend/aws.py
+++ /dev/null
@@ -1,424 +0,0 @@
-import concurrent.futures
-import fnmatch
-import logging
-import os
-import time
-from io import BytesIO
-from os.path import splitext
-
-import boto3
-import torch
-from botocore.config import Config
-from botocore.exceptions import NoCredentialsError, PartialCredentialsError
-from torch import Tensor
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.load import load_image
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-
-loggers_to_silence = [
-    "botocore.hooks",
-    "botocore.auth",
-    "botocore.httpsession",
-    "botocore.parsers",
-    "botocore.retryhandler",
-    "botocore.loaders",
-    "botocore.regions",
-    "botocore.utils",
-    "botocore.client",
-    "botocore.handler",
-    "botocore.handlers",
-    "botocore.awsrequest",
-]
-
-for logger_name in loggers_to_silence:
-    logger = logging.getLogger(logger_name)
-    logger.setLevel("ERROR")
-
-# Arguably, the most interesting one:
-boto_logger = logging.getLogger("botocore.endpoint")
-boto_logger.setLevel(os.environ.get("SIMPLETUNER_AWS_LOG_LEVEL", "ERROR"))
-
-logger = logging.getLogger("S3DataBackend")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class S3DataBackend(BaseDataBackend):
-    # Storing the list_files output in a local dict.
-    _list_cache: dict = {}
-
-    def __init__(
-        self,
-        id: str,
-        bucket_name,
-        accelerator,
-        region_name="us-east-1",
-        endpoint_url: str = None,
-        aws_access_key_id: str = None,
-        aws_secret_access_key: str = None,
-        read_retry_limit: int = 5,
-        write_retry_limit: int = 5,
-        read_retry_interval: int = 5,
-        write_retry_interval: int = 5,
-        compress_cache: bool = False,
-        max_pool_connections: int = 128,
-    ):
-        self.id = id
-        self.accelerator = accelerator
-        self.bucket_name = bucket_name
-        self.read_retry_limit = read_retry_limit
-        self.read_retry_interval = read_retry_interval
-        self.write_retry_limit = write_retry_limit
-        self.write_retry_interval = write_retry_interval
-        self.compress_cache = compress_cache
-        self.max_pool_connections = max_pool_connections
-        self.type = "aws"
-        # AWS buckets might use a region.
-        extra_args = {
-            "region_name": region_name,
-        }
-        # If using an endpoint_url, we do not use the region.
-        if endpoint_url:
-            extra_args = {
-                "endpoint_url": endpoint_url,
-            }
-        s3_config = Config(max_pool_connections=self.max_pool_connections)
-        self.client = boto3.client(
-            "s3",
-            aws_access_key_id=aws_access_key_id,
-            aws_secret_access_key=aws_secret_access_key,
-            config=s3_config,
-            **extra_args,
-        )
-
-    def exists(self, s3_key):
-        """Check if the key exists in S3, with retries for transient errors."""
-        for i in range(self.read_retry_limit):
-            try:
-                self.client.head_object(Bucket=self.bucket_name, Key=str(s3_key))
-                return True
-            except self.client.exceptions.NoSuchKey:
-                logger.debug(
-                    f"File {s3_key} does not exist in S3 bucket ({self.bucket_name})"
-                )
-                return False
-            except (NoCredentialsError, PartialCredentialsError) as e:
-                raise e  # Raise credential errors to the caller
-            except Exception as e:
-                logger.error(f'Error checking existence of S3 key "{s3_key}": {e}')
-                if i == self.read_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise e
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.read_retry_interval)
-            except:
-                if i == self.read_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.read_retry_interval)
-
-    def read(self, s3_key):
-        """Retrieve and return the content of the file from S3."""
-        for i in range(self.read_retry_limit):
-            try:
-                response = self.client.get_object(
-                    Bucket=self.bucket_name, Key=str(s3_key)
-                )
-                return response["Body"].read()
-            except self.client.exceptions.NoSuchKey:
-                logger.debug(
-                    f"File {s3_key} does not exist in S3 bucket ({self.bucket_name})"
-                )
-                return None
-            except (NoCredentialsError, PartialCredentialsError) as e:
-                raise e  # Raise credential errors to the caller
-            except Exception as e:
-                logger.error(f'Error reading S3 bucket key "{s3_key}": {e}')
-                if i == self.read_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise e
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.read_retry_interval)
-            except:
-                if i == self.read_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.read_retry_interval)
-
-    def open_file(self, s3_key, mode):
-        """Open the file in the specified mode."""
-        return self.read(s3_key)
-
-    def write(self, s3_key, data):
-        """Upload data to the specified S3 key."""
-        real_key = str(s3_key)
-        for i in range(self.write_retry_limit):
-            try:
-                if type(data) == Tensor:
-                    return self.torch_save(data, real_key)
-                response = self.client.put_object(
-                    Body=data,
-                    Bucket=self.bucket_name,
-                    Key=real_key,
-                )
-                return response
-            except Exception as e:
-                logger.error(f'Error writing S3 bucket key "{real_key}": {e}')
-                if i == self.write_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise e
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.write_retry_interval)
-
-    def delete(self, s3_key):
-        """Delete the specified file from S3."""
-        for i in range(self.write_retry_limit):
-            try:
-                logger.debug(f'Deleting S3 key "{s3_key}"')
-                response = self.client.delete_object(
-                    Bucket=self.bucket_name, Key=str(s3_key)
-                )
-                return response
-            except Exception as e:
-                logger.error(f'Error deleting S3 bucket key "{s3_key}": {e}')
-                if i == self.write_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise e
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.write_retry_interval)
-
-    def list_by_prefix(self, prefix=""):
-        """List all files under a specific path (prefix) in the S3 bucket."""
-        response = self.client.list_objects_v2(Bucket=self.bucket_name, Prefix=prefix)
-        bucket_prefix = f"{self.bucket_name}/"
-
-        return [
-            (
-                item["Key"][len(bucket_prefix) :]
-                if item["Key"].startswith(bucket_prefix)
-                else item["Key"]
-            )
-            for item in response.get("Contents", [])
-        ]
-
-    def list_files(self, file_extensions: list, instance_data_dir: str = None):
-        # Initialize the results list
-        results = []
-
-        def splitext_(path):
-            o = splitext(path)[1].lower()
-            # remove leading .
-            return o[1:] if o else o
-
-        # Grab a timestamp for our start time.
-        start_time = time.time()
-
-        # Using paginator to handle potential large number of objects
-        paginator = self.client.get_paginator("list_objects_v2")
-
-        # Using a dictionary to hold files based on their prefixes (subdirectories)
-        prefix_dict = {}
-        # Log the first few items, alphabetically sorted:
-        logger.debug(
-            f"Listing files in S3 bucket {self.bucket_name} in prefix {instance_data_dir} with extensions: {file_extensions}"
-        )
-
-        # Paginating over the entire bucket objects
-        for page in paginator.paginate(Bucket=self.bucket_name, MaxKeys=1000):
-            # logger.debug(f"Page: {page}")
-            for obj in page.get("Contents", []):
-                # Filter based on the provided pattern
-                ext = splitext_(obj["Key"])
-                if file_extensions and ext not in file_extensions:
-                    continue
-                # Split the S3 key to determine the directory and file structure
-                parts = obj["Key"].split("/")
-                subdir = "/".join(parts[:-1])  # Get the directory excluding the file
-                filename = parts[-1]  # Get the file name
-
-                # Storing filenames under their respective subdirectories
-                if subdir not in prefix_dict:
-                    prefix_dict[subdir] = []
-                prefix_dict[subdir].append(obj["Key"])
-
-        # Transforming the prefix_dict into the desired results format
-        for subdir, files in prefix_dict.items():
-            results.append((subdir, [], files))
-
-        end_time = time.time()
-        total_time = end_time - start_time
-        # Log the output in n automatically human friendly manner, eg. "x minutes" or "x seconds"
-        if total_time > 120:
-            logger.debug(f"Completed file list in {total_time/60} minutes.")
-        elif total_time < 60:
-            logger.debug(f"Completed file list in {total_time} seconds.")
-        return results
-
-    def read_image(self, s3_key):
-        return load_image(BytesIO(self.read(s3_key)))
-
-    def read_image_batch(self, s3_keys: list, delete_problematic_images: bool = False):
-        """
-        Return a list of Image objects, given a list of S3 keys.
-        This makes use of read_batch for efficiency.
-        Args:
-            s3_keys (list): List of S3 keys to read. May not be included in the output, if it does not exist, or had an error.
-            delete_problematic_images (bool, optional): Whether to delete problematic images. Defaults to False.
-
-        Returns:
-            tuple(list, list): (available_keys, output_images)
-        """
-        batch = self.read_batch(s3_keys)
-        output_images = []
-        available_keys = []
-        for s3_key, data in zip(s3_keys, batch):
-            try:
-                image_data = load_image(BytesIO(data))
-                if image_data is None:
-                    logger.warning(f"Unable to load image '{s3_key}', skipping.")
-                    continue
-                output_images.append(image_data)
-                available_keys.append(s3_key)
-            except Exception as e:
-                if delete_problematic_images:
-                    logger.warning(
-                        f"Deleting image '{s3_key}', because --delete_problematic_images is provided. Error: {e}"
-                    )
-                    self.delete(s3_key)
-                else:
-                    logger.warning(
-                        f"A problematic image {s3_key} is detected, but we are not allowed to remove it, because --delete_problematic_image is not provided."
-                        f" Please correct this manually. Error: {e}"
-                    )
-        return (available_keys, output_images)
-
-    def create_directory(self, directory_path):
-        # Since S3 doesn't have a traditional directory structure, this is just a pass-through
-        pass
-
-    def _detect_file_format(self, fileobj):
-        fileobj.seek(0)
-        magic_number = fileobj.read(4)
-        fileobj.seek(0)
-        logger.debug(f"Magic number: {magic_number}")
-        if magic_number[:2] == b"\x80\x04" or b"PK" in magic_number:
-            # This is likely a torch-saved object (Pickle protocol 4)
-            # Need to check whether it's the incorrectly saved compressed data
-            try:
-                obj = torch.load(fileobj, map_location="cpu")
-                if isinstance(obj, bytes):
-                    # If obj is bytes, it means compressed data was saved incorrectly
-                    return "incorrect"
-                else:
-                    return "correct_uncompressed"
-            except Exception as e:
-                # If torch.load fails, it's possibly compressed correctly
-                return "correct_compressed"
-        elif magic_number[:2] == b"\x1f\x8b":
-            # GZIP magic number, compressed data saved correctly
-            return "correct_compressed"
-        else:
-            # Unrecognized format
-            return "unknown"
-
-    def torch_load(self, s3_key):
-        for i in range(self.read_retry_limit):
-            try:
-                # Read data from S3
-                data = self.read(s3_key)
-                stored_data = BytesIO(data)
-                stored_data.seek(0)
-
-                # Determine if the file was saved incorrectly
-                file_format = self._detect_file_format(stored_data)
-                logger.debug(f"File format: {file_format}")
-                if file_format == "incorrect":
-                    # Load the compressed bytes object serialized by torch.save
-                    stored_data.seek(0)
-                    compressed_data = BytesIO(
-                        torch.load(stored_data, map_location="cpu")
-                    )
-                    # Decompress the data
-                    stored_tensor = self._decompress_torch(compressed_data)
-                elif file_format == "correct_compressed":
-                    # Data is compressed but saved correctly
-                    decompressed_data = self._decompress_torch(data)
-                else:
-                    # Data is uncompressed and saved correctly
-                    stored_tensor = stored_data
-
-                if hasattr(stored_tensor, "seek"):
-                    stored_tensor.seek(0)
-                obj = torch.load(stored_tensor, map_location="cpu")
-
-                if isinstance(obj, tuple):
-                    obj = tuple(o.to(torch.float32) for o in obj)
-                elif isinstance(obj, torch.Tensor):
-                    obj = obj.to(torch.float32)
-
-                return obj
-            except Exception as e:
-                logging.error(f"Failed to load tensor from {s3_key}: {e}")
-                if i == self.read_retry_limit - 1:
-                    raise
-                else:
-                    logging.info(f"Retrying... ({i+1}/{self.read_retry_limit})")
-
-    def torch_save(self, data, s3_key):
-        from io import BytesIO
-
-        import torch
-
-        # Retry the torch save within the retry limit
-        for i in range(self.write_retry_limit):
-            try:
-                buffer = BytesIO()
-                if self.compress_cache:
-                    compressed_data = self._compress_torch(data)
-                    buffer.write(compressed_data)
-                else:
-                    torch.save(data, buffer)
-                buffer.seek(0)  # Reset buffer position to the beginning
-                logger.debug(f"Writing torch file: {s3_key}")
-                result = self.write(s3_key, buffer.getvalue())
-                logger.debug(f"Write completed: {s3_key}")
-                return result
-            except Exception as e:
-                logger.error(f"Could not torch save to backend: {e}")
-                if i == self.write_retry_limit - 1:
-                    # We have reached our maximum retry count.
-                    raise e
-                else:
-                    # Sleep for a bit before retrying.
-                    time.sleep(self.write_retry_interval)
-
-    def write_batch(self, s3_keys, data_list):
-        """Write a batch of files to the specified S3 keys concurrently."""
-        # Use ThreadPoolExecutor for concurrent uploads
-        with concurrent.futures.ThreadPoolExecutor() as executor:
-            executor.map(self.write, s3_keys, data_list)
-
-    def read_batch(self, s3_keys):
-        """Read a batch of files from the specified S3 keys concurrently."""
-
-        # Use ThreadPoolExecutor for concurrent reads
-        with concurrent.futures.ThreadPoolExecutor() as executor:
-            return list(executor.map(self.read, s3_keys))
-
-    def bulk_exists(self, s3_keys, prefix=""):
-        """Check the existence of a list of S3 keys in bulk."""
-
-        # List all objects with the given prefix
-        objects = self.client.list_objects_v2(Bucket=self.bucket_name, Prefix=prefix)
-        existing_keys = set(obj["Key"] for obj in objects.get("Contents", []))
-
-        # Check existence for each key
-        return [key in existing_keys for key in s3_keys]
diff --git a/videotuna/third_party/flux/data_backend/base.py b/videotuna/third_party/flux/data_backend/base.py
deleted file mode 100644
index dcf30509..00000000
--- a/videotuna/third_party/flux/data_backend/base.py
+++ /dev/null
@@ -1,113 +0,0 @@
-import gzip
-from abc import ABC, abstractmethod
-from io import BytesIO
-
-import torch
-
-
-class BaseDataBackend(ABC):
-    @abstractmethod
-    def read(self, identifier):
-        """
-        Read data based on the identifier.
-        """
-        pass
-
-    @abstractmethod
-    def write(self, identifier, data):
-        """
-        Write data to the specified identifier.
-        """
-        pass
-
-    @abstractmethod
-    def delete(self, identifier):
-        """
-        Delete data associated with the identifier.
-        """
-        pass
-
-    @abstractmethod
-    def exists(self, identifier):
-        """
-        Check if the identifier exists.
-        """
-        pass
-
-    @abstractmethod
-    def open_file(self, identifier, mode):
-        """
-        Open the identifier (file or object) in the specified mode.
-        """
-        pass
-
-    @abstractmethod
-    def list_files(self, file_extensions: list, instance_data_dir: str = None) -> tuple:
-        """
-        List all files matching the pattern.
-        """
-        pass
-
-    @abstractmethod
-    def read_image(self, filepath: str, delete_problematic_images: bool = False):
-        """
-        Read an image from the backend and return a PIL Image.
-        """
-        pass
-
-    @abstractmethod
-    def read_image_batch(self, filepaths: str, delete_problematic_images: bool = False):
-        """
-        Read a batch of images from the backend and return a list of PIL Images.
-        """
-        pass
-
-    @abstractmethod
-    def create_directory(self, directory_path):
-        """
-        Creates a directory in the backend.
-        """
-        pass
-
-    @abstractmethod
-    def torch_load(self, filename):
-        """
-        Reads content from the backend and loads it with torch.
-        """
-        pass
-
-    @abstractmethod
-    def torch_save(self, data, filename):
-        """
-        Saves the data using torch to the backend.
-        """
-        pass
-
-    @abstractmethod
-    def write_batch(self, identifiers, files):
-        """
-        Write a batch of files to the specified identifiers.
-        """
-        pass
-
-    def _decompress_torch(self, gzip_data):
-        """
-        We've read the gzip from disk. Just decompress it.
-        """
-        gzip_data.seek(0)
-        with gzip.GzipFile(fileobj=gzip_data, mode="rb") as file:
-            decompressed_data = file.read()
-        return BytesIO(decompressed_data)
-
-    def _compress_torch(self, data):
-        """
-        Compress the torch data before writing it to disk.
-        """
-        output_data_container = BytesIO()
-        torch.save(data, output_data_container)
-        output_data_container.seek(0)
-
-        with BytesIO() as compressed_output:
-            with gzip.GzipFile(fileobj=compressed_output, mode="wb") as file:
-                file.write(output_data_container.getvalue())
-            return compressed_output.getvalue()
diff --git a/videotuna/third_party/flux/data_backend/csv_url_list.py b/videotuna/third_party/flux/data_backend/csv_url_list.py
deleted file mode 100644
index 790b076f..00000000
--- a/videotuna/third_party/flux/data_backend/csv_url_list.py
+++ /dev/null
@@ -1,322 +0,0 @@
-import fnmatch
-import hashlib
-import logging
-import os
-from io import BytesIO
-from pathlib import Path
-from typing import Any, BinaryIO, Optional, Union
-
-import pandas as pd
-import requests
-import torch
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.load import load_image
-from videotuna.third_party.flux.training.multi_process import should_log
-
-logger = logging.getLogger("CSVDataBackend")
-if should_log():
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel("ERROR")
-
-
-def url_to_filename(url: str) -> str:
-    return url.split("/")[-1]
-
-
-def str_hash(filename: str) -> str:
-    return str(hashlib.sha256(str(filename).encode()).hexdigest())
-
-
-def path_to_hashed_path(path: Path, hash_filenames: bool) -> Path:
-    path = Path(path).resolve()
-    if hash_filenames:
-        return path.parent.joinpath(str_hash(path.stem) + path.suffix)
-    return path
-
-
-def html_to_file_loc(parent_directory: Path, url: str, hash_filenames: bool) -> str:
-    filename = url_to_filename(url)
-    cached_loc = path_to_hashed_path(
-        parent_directory.joinpath(filename), hash_filenames
-    )
-    return str(cached_loc.resolve())
-
-
-class CSVDataBackend(BaseDataBackend):
-    def __init__(
-        self,
-        accelerator,
-        id: str,
-        csv_file: Path,
-        compress_cache: bool = False,
-        url_column: str = "url",
-        caption_column: str = "caption",
-        image_cache_loc: Optional[str] = None,
-        hash_filenames: bool = True,
-    ):
-        self.id = id
-        self.type = "csv"
-        self.compress_cache = compress_cache
-        self.hash_filenames = hash_filenames
-        self.csv_file = csv_file
-        self.accelerator = accelerator
-        self.url_column = url_column
-        self.df = pd.read_csv(csv_file, index_col=url_column)
-        self.df = self.df.groupby(level=0).last()  # deduplicate by index (image loc)
-        self.caption_column = caption_column
-        self.image_cache_loc = (
-            Path(image_cache_loc) if image_cache_loc is not None else None
-        )
-
-    def read(self, location, as_byteIO: bool = False):
-        """Read and return the content of the file."""
-        already_hashed = False
-        if isinstance(location, Path):
-            location = str(location.resolve())
-        if location.startswith("http"):
-            if self.image_cache_loc is not None:
-                # check for cache
-                cached_loc = html_to_file_loc(
-                    self.image_cache_loc,
-                    location,
-                    self.hash_filenames,
-                )
-                if os.path.exists(cached_loc):
-                    # found cache
-                    location = cached_loc
-                    already_hashed = True
-                else:
-                    # actually go to website
-                    data = requests.get(location, stream=True).raw.data
-                    with open(cached_loc, "wb") as f:
-                        f.write(data)
-            else:
-                data = requests.get(location, stream=True).raw.data
-        if not location.startswith("http"):
-            # read from local file
-            hashed_location = path_to_hashed_path(
-                location, hash_filenames=self.hash_filenames and not already_hashed
-            )
-            try:
-                with open(hashed_location, "rb") as file:
-                    data = file.read()
-            except FileNotFoundError as e:
-                tqdm.write(f"ask was for file {location} bound to {hashed_location}")
-                raise e
-        if not as_byteIO:
-            return data
-        return BytesIO(data)
-
-    def write(self, filepath: Union[str, Path], data: Any) -> None:
-        """Write the provided data to the specified filepath."""
-        if isinstance(filepath, str):
-            assert not filepath.startswith(
-                "http"
-            ), f"writing to {filepath} is not allowed as it has http in it"
-            filepath = Path(filepath)
-        # Not a huge fan of auto-shortening filenames, as we hash things for that in other cases.
-        # However, this is copied in from the original Arcade-AI CSV backend implementation for compatibility.
-        filepath = path_to_hashed_path(filepath, self.hash_filenames)
-        filepath.parent.mkdir(parents=True, exist_ok=True)
-        with open(filepath, "wb") as file:
-            # Check if data is a Tensor, and if so, save it appropriately
-            if isinstance(data, torch.Tensor):
-                # logger.debug(f"Writing a torch file to disk.")
-                return self.torch_save(data, file)
-            if isinstance(data, str):
-                # logger.debug(f"Writing a string to disk as {filepath}: {data}")
-                data = data.encode("utf-8")
-            else:
-                logger.debug(
-                    f"Received an unknown data type to write to disk. Doing our best: {type(data)}"
-                )
-            file.write(data)
-
-    def delete(self, filepath):
-        """Delete the specified file."""
-        if filepath in self.df.index:
-            self.df.drop(filepath, inplace=True)
-            # self.save_state()
-        filepath = path_to_hashed_path(filepath, self.hash_filenames)
-        if os.path.exists(filepath):
-            logger.debug(f"Deleting file: {filepath}")
-            os.remove(filepath)
-        # Validate that we deleted it correctly.
-        if self.exists(filepath) or filepath in self.df.index:
-            raise Exception(f"Failed to delete {filepath}")
-
-    def exists(self, filepath):
-        """Check if the file exists."""
-        if isinstance(filepath, str) and "http" in filepath:
-            return filepath in self.df.index
-        else:
-            filepath = path_to_hashed_path(filepath, self.hash_filenames)
-            return os.path.exists(filepath)
-
-    def open_file(self, filepath, mode):
-        """Open the file in the specified mode."""
-        return open(path_to_hashed_path(filepath, self.hash_filenames), mode)
-
-    def list_files(
-        self, file_extensions: list = None, instance_data_dir: str = None
-    ) -> tuple:
-        """
-        List all files matching the file extensions.
-        Creates Path objects of each file found.
-        """
-        logger.debug(
-            f"CSVDataBackend.list_files: file_extensions={file_extensions}, instance_data_dir={instance_data_dir}"
-        )
-
-        if instance_data_dir is None:
-            filtered_paths = set(self.df.index)
-            filtered_ids = set(filtered_paths)
-        else:
-            # Convert file extensions to patterns
-            if file_extensions:
-                patterns = [f"*.{ext.lower()}" for ext in file_extensions]
-            else:
-                patterns = ["*"]
-
-            filtered_ids = set()
-            for pattern in patterns:
-                filtered_ids.update(
-                    filter(lambda id: fnmatch.fnmatch(id, pattern), list(self.df.index))
-                )
-
-            filtered_paths = set(
-                filter(lambda id: "http" not in id and os.path.exists(id), filtered_ids)
-            )
-
-        # Group files by their parent directory
-        path_dict = {}
-        for path in filtered_paths:
-            if hasattr(path, "parent"):
-                parent = str(Path(path).parent)
-                if parent not in path_dict:
-                    path_dict[parent] = []
-                path_dict[parent].append(str(Path(path).absolute()))
-            else:
-                if "/" not in path_dict:
-                    path_dict["/"] = []
-                if os.path.splitext(str(path))[1] not in [".json", ".csv", ".parquet"]:
-                    path_dict["/"].append(str(path))
-
-        results = [(subdir, [], files) for subdir, files in path_dict.items()]
-        results += [("", [], filtered_ids - filtered_paths)]
-        return results
-
-    def read_image(self, filepath: str, delete_problematic_images: bool = False):
-        # Remove embedded null byte:
-        if isinstance(filepath, str):
-            filepath = filepath.replace("\x00", "")
-        try:
-            image_data = self.read(filepath, as_byteIO=True)
-            image = load_image(image_data)
-            return image
-        except Exception as e:
-            import traceback
-
-            logger.error(
-                f"Encountered error opening image {filepath}: {e}, traceback: {traceback.format_exc()}"
-            )
-            if delete_problematic_images:
-                logger.error(
-                    "Deleting image, because --delete_problematic_images is provided."
-                )
-                self.delete(filepath)
-            else:
-                exit(1)
-                raise e
-
-    def read_image_batch(
-        self, filepaths: list, delete_problematic_images: bool = False
-    ) -> list:
-        """Read a batch of images from the specified filepaths."""
-        if type(filepaths) != list:
-            raise ValueError(
-                f"read_image_batch must be given a list of image filepaths. we received: {filepaths}"
-            )
-        output_images = []
-        available_keys = []
-        for filepath in filepaths:
-            try:
-                image_data = self.read_image(filepath, delete_problematic_images)
-                if image_data is None:
-                    logger.warning(f"Unable to load image '{filepath}', skipping.")
-                    continue
-                output_images.append(image_data)
-                available_keys.append(filepath)
-            except Exception as e:
-                if delete_problematic_images:
-                    logger.error(
-                        f"Deleting image '{filepath}', because --delete_problematic_images is provided. Error: {e}"
-                    )
-                else:
-                    logger.warning(
-                        f"A problematic image {filepath} is detected, but we are not allowed to remove it, because --delete_problematic_image is not provided."
-                        f" Please correct this manually. Error: {e}"
-                    )
-        return (available_keys, output_images)
-
-    def create_directory(self, directory_path):
-        if os.path.exists(directory_path):
-            return
-        logger.debug(f"Creating directory: {directory_path}")
-        os.makedirs(directory_path, exist_ok=True)
-
-    def torch_load(self, filename):
-        """
-        Load a torch tensor from a file.
-        """
-
-        stored_tensor = self.read(filename, as_byteIO=True)
-
-        if self.compress_cache:
-            try:
-                stored_tensor = self._decompress_torch(stored_tensor)
-            except Exception as e:
-                logger.error(
-                    f"Failed to decompress torch file, falling back to passthrough: {e}"
-                )
-        if hasattr(stored_tensor, "seek"):
-            stored_tensor.seek(0)
-        try:
-            loaded_tensor = torch.load(stored_tensor, map_location="cpu")
-        except Exception as e:
-            logger.error(f"Failed to load corrupt torch file '{filename}': {e}")
-            if "invalid load key" in str(e):
-                self.delete(filename)
-            raise e
-        return loaded_tensor
-
-    def torch_save(self, data, location: Union[str, Path, BytesIO]):
-        """
-        Save a torch tensor to a file.
-        """
-
-        if isinstance(location, str) or isinstance(location, Path):
-            location = path_to_hashed_path(location, self.hash_filenames)
-            location = self.open_file(location, "wb")
-
-        if self.compress_cache:
-            compressed_data = self._compress_torch(data)
-            location.write(compressed_data)
-        else:
-            torch.save(data, location)
-        location.close()
-
-    def write_batch(self, filepaths: list, data_list: list) -> None:
-        """Write a batch of data to the specified filepaths."""
-        for filepath, data in zip(filepaths, data_list):
-            self.write(filepath, data)
-
-    def save_state(self):
-        self.df.to_csv(self.csv_file, index_label=self.url_column)
-
-    def get_caption(self, image_path: str) -> str:
-        if self.caption_column is None:
-            raise ValueError("Cannot retrieve caption from csv, as one is not set.")
-        return self.df.loc[image_path, self.caption_column]
diff --git a/videotuna/third_party/flux/data_backend/factory.py b/videotuna/third_party/flux/data_backend/factory.py
deleted file mode 100644
index 15fe437d..00000000
--- a/videotuna/third_party/flux/data_backend/factory.py
+++ /dev/null
@@ -1,1393 +0,0 @@
-import io
-import json
-import logging
-import os
-import queue
-import threading
-import time
-from math import sqrt
-
-import torch
-from tqdm import tqdm
-
-from videotuna.third_party.flux.caching.text_embeds import TextEmbeddingCache
-from videotuna.third_party.flux.caching.vae import VAECache
-from videotuna.third_party.flux.data_backend.aws import S3DataBackend
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.data_backend.csv_url_list import CSVDataBackend
-from videotuna.third_party.flux.data_backend.local import LocalDataBackend
-from videotuna.third_party.flux.multiaspect.dataset import MultiAspectDataset
-from videotuna.third_party.flux.multiaspect.sampler import MultiAspectSampler
-from videotuna.third_party.flux.prompts import PromptHandler
-from videotuna.third_party.flux.training.collate import collate_fn
-from videotuna.third_party.flux.training.default_settings import (
-    default,
-    latest_config_version,
-)
-from videotuna.third_party.flux.training.exceptions import MultiDatasetExhausted
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.multi_process import rank_info, should_log
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("DataBackendFactory")
-if should_log():
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel(logging.ERROR)
-prefetch_log = logging.getLogger("DataBackendPrefetch")
-if should_log():
-    prefetch_log.setLevel(os.environ.get("SIMPLETUNER_PREFETCH_LOG_LEVEL", "INFO"))
-else:
-    prefetch_log.setLevel(logging.ERROR)
-
-# For prefetching.
-
-
-def prefetch_log_debug(message):
-    prefetch_log.debug(f"{rank_info()} {message}")
-
-
-def info_log(message):
-    if StateTracker.get_accelerator().is_main_process:
-        logger.info(message)
-
-
-def init_backend_config(backend: dict, args: dict, accelerator) -> dict:
-    output = {"id": backend["id"], "config": {}}
-    if backend.get("dataset_type", None) == "text_embeds":
-        if "caption_filter_list" in backend:
-            output["config"]["caption_filter_list"] = backend["caption_filter_list"]
-        output["dataset_type"] = "text_embeds"
-
-        return output
-    elif backend.get("dataset_type", None) == "image_embeds":
-        # no overrides for image embed backends
-        return output
-    else:
-        ## Check for settings we shouldn't have for non-text datasets.
-        if "caption_filter_list" in backend:
-            raise ValueError(
-                f"caption_filter_list is only a valid setting for text datasets. It is currently set for the {backend.get('dataset_type', 'image')} dataset {backend['id']}."
-            )
-
-    # Image backend config
-    output["dataset_type"] = backend.get("dataset_type", "image")
-    choices = ["image", "conditioning"]
-    if (
-        StateTracker.get_args().controlnet
-        and output["dataset_type"] == "image"
-        and backend.get("conditioning_data", None) is None
-    ):
-        raise ValueError(
-            "Image datasets require a corresponding conditioning_data set configured in your dataloader."
-        )
-    if output["dataset_type"] not in choices:
-        raise ValueError(f"(id={backend['id']}) dataset_type must be one of {choices}.")
-    if "vae_cache_clear_each_epoch" in backend:
-        output["config"]["vae_cache_clear_each_epoch"] = backend[
-            "vae_cache_clear_each_epoch"
-        ]
-    if "probability" in backend:
-        output["config"]["probability"] = (
-            float(backend["probability"]) if backend["probability"] else 1.0
-        )
-    if "ignore_epochs" in backend:
-        logger.error(
-            "ignore_epochs is deprecated, and will do nothing. This can be safely removed from your configuration."
-        )
-    if "repeats" in backend:
-        output["config"]["repeats"] = (
-            int(backend["repeats"]) if backend["repeats"] else 0
-        )
-    if "crop" in backend:
-        output["config"]["crop"] = backend["crop"]
-    else:
-        output["config"]["crop"] = False
-    if backend.get("type") == "csv":
-        if "csv_cache_dir" in backend:
-            output["config"]["csv_cache_dir"] = backend["csv_cache_dir"]
-        if "csv_file" in backend:
-            output["config"]["csv_file"] = backend["csv_file"]
-        if "csv_caption_column" in backend:
-            output["config"]["csv_caption_column"] = backend["csv_caption_column"]
-        if "csv_url_column" in backend:
-            output["config"]["csv_url_column"] = backend["csv_url_column"]
-    if "crop_aspect" in backend:
-        choices = ["square", "preserve", "random", "closest"]
-        if backend.get("crop_aspect", None) not in choices:
-            raise ValueError(
-                f"(id={backend['id']}) crop_aspect must be one of {choices}."
-            )
-        output["config"]["crop_aspect"] = backend["crop_aspect"]
-        if (
-            output["config"]["crop_aspect"] == "random"
-            or output["config"]["crop_aspect"] == "closest"
-        ):
-            if "crop_aspect_buckets" not in backend or not isinstance(
-                backend["crop_aspect_buckets"], list
-            ):
-                raise ValueError(
-                    f"(id={backend['id']}) crop_aspect_buckets must be provided when crop_aspect is set to 'random'."
-                    " This should be a list of float values or a list of dictionaries following the format: {'aspect_bucket': float, 'weight': float}."
-                    " The weight represents how likely this bucket is to be chosen, and all weights should add up to 1.0 collectively."
-                )
-            for bucket in backend.get("crop_aspect_buckets"):
-                if type(bucket) not in [float, int, dict]:
-                    raise ValueError(
-                        f"(id={backend['id']}) crop_aspect_buckets must be a list of float values or a list of dictionaries following the format: {'aspect_bucket': float, 'weight': float}."
-                        " The weight represents how likely this bucket is to be chosen, and all weights should add up to 1.0 collectively."
-                    )
-
-        output["config"]["crop_aspect_buckets"] = backend.get("crop_aspect_buckets")
-    else:
-        output["config"]["crop_aspect"] = "square"
-    if "crop_style" in backend:
-        crop_styles = ["random", "corner", "center", "centre", "face"]
-        if backend["crop_style"] not in crop_styles:
-            raise ValueError(
-                f"(id={backend['id']}) crop_style must be one of {crop_styles}."
-            )
-        output["config"]["crop_style"] = backend["crop_style"]
-    else:
-        output["config"]["crop_style"] = "random"
-    output["config"]["disable_validation"] = backend.get("disable_validation", False)
-    if "resolution" in backend:
-        output["config"]["resolution"] = backend["resolution"]
-    else:
-        output["config"]["resolution"] = args.resolution
-    if "resolution_type" in backend:
-        output["config"]["resolution_type"] = backend["resolution_type"]
-    else:
-        output["config"]["resolution_type"] = args.resolution_type
-    if "parquet" in backend:
-        output["config"]["parquet"] = backend["parquet"]
-    if "caption_strategy" in backend:
-        output["config"]["caption_strategy"] = backend["caption_strategy"]
-    else:
-        output["config"]["caption_strategy"] = args.caption_strategy
-    output["config"]["instance_data_dir"] = backend.get(
-        "instance_data_dir", backend.get("aws_data_prefix", "")
-    )
-    if "hash_filenames" in backend:
-        output["config"]["hash_filenames"] = backend["hash_filenames"]
-    if "hash_filenames" in backend and backend.get("type") == "csv":
-        output["config"]["hash_filenames"] = backend["hash_filenames"]
-
-    # check if caption_strategy=parquet with metadata_backend=json
-    current_metadata_backend_type = backend.get("metadata_backend", "discovery")
-    if output["config"]["caption_strategy"] == "parquet" and (
-        current_metadata_backend_type == "json"
-        or current_metadata_backend_type == "discovery"
-    ):
-        raise ValueError(
-            f"(id={backend['id']}) Cannot use caption_strategy=parquet with metadata_backend={current_metadata_backend_type}. Instead, it is recommended to use the textfile strategy and extract your captions into txt files."
-        )
-
-    maximum_image_size = backend.get("maximum_image_size", args.maximum_image_size)
-    target_downsample_size = backend.get(
-        "target_downsample_size", args.target_downsample_size
-    )
-    output["config"]["maximum_image_size"] = maximum_image_size
-    output["config"]["target_downsample_size"] = target_downsample_size
-
-    if maximum_image_size and not target_downsample_size:
-        raise ValueError(
-            "When a data backend is configured to use `maximum_image_size`, you must also provide a value for `target_downsample_size`."
-        )
-    if (
-        maximum_image_size
-        and output["config"]["resolution_type"] == "area"
-        and maximum_image_size > 10
-        and not os.environ.get("SIMPLETUNER_MAXIMUM_IMAGE_SIZE_OVERRIDE", False)
-    ):
-        raise ValueError(
-            f"When a data backend is configured to use `'resolution_type':area`, `maximum_image_size` must be less than 10 megapixels. You may have accidentally entered {maximum_image_size} pixels, instead of megapixels."
-        )
-    elif (
-        maximum_image_size
-        and output["config"]["resolution_type"] == "pixel"
-        and maximum_image_size < 512
-        and "deepfloyd" not in args.model_type
-        and args.model_family != "smoldit"
-    ):
-        raise ValueError(
-            f"When a data backend is configured to use `'resolution_type':pixel`, `maximum_image_size` must be at least 512 pixels. You may have accidentally entered {maximum_image_size} megapixels, instead of pixels."
-        )
-    if (
-        target_downsample_size
-        and output["config"]["resolution_type"] == "area"
-        and target_downsample_size > 10
-        and not os.environ.get("SIMPLETUNER_MAXIMUM_IMAGE_SIZE_OVERRIDE", False)
-    ):
-        raise ValueError(
-            f"When a data backend is configured to use `'resolution_type':area`, `target_downsample_size` must be less than 10 megapixels. You may have accidentally entered {target_downsample_size} pixels, instead of megapixels."
-        )
-    elif (
-        target_downsample_size
-        and output["config"]["resolution_type"] == "pixel"
-        and target_downsample_size < 512
-        and "deepfloyd" not in args.model_type
-        and args.model_family != "smoldit"
-    ):
-        raise ValueError(
-            f"When a data backend is configured to use `'resolution_type':pixel`, `target_downsample_size` must be at least 512 pixels. You may have accidentally entered {target_downsample_size} megapixels, instead of pixels."
-        )
-
-    return output
-
-
-def print_bucket_info(metadata_backend):
-    # Print table header
-    if get_rank() == 0:
-        tqdm.write(f"{rank_info()} | {'Bucket':<10} | {'Image Count (per-GPU)':<12}")
-
-        # Print separator
-        tqdm.write("-" * 30)
-
-        # Print each bucket's information
-        for bucket in metadata_backend.aspect_ratio_bucket_indices:
-            image_count = len(metadata_backend.aspect_ratio_bucket_indices[bucket])
-            if image_count == 0:
-                continue
-            tqdm.write(f"{rank_info()} | {bucket:<10} | {image_count:<12}")
-
-
-def configure_parquet_database(backend: dict, args, data_backend: BaseDataBackend):
-    """When given a backend config dictionary, configure a parquet database."""
-    parquet_config = backend.get("parquet", None)
-    if not parquet_config:
-        raise ValueError(
-            "Parquet backend must have a 'parquet' field in the backend config containing required fields for configuration."
-        )
-    parquet_path = parquet_config.get("path", None)
-    if not parquet_path:
-        raise ValueError(
-            "Parquet backend must have a 'path' field in the backend config under the 'parquet' key."
-        )
-    if not data_backend.exists(parquet_path):
-        raise FileNotFoundError(f"Parquet file {parquet_path} not found.")
-    # Load the dataframe
-    import pandas as pd
-
-    bytes_string = data_backend.read(parquet_path)
-    pq = io.BytesIO(bytes_string)
-    if parquet_path.endswith(".jsonl"):
-        df = pd.read_json(pq, lines=True)
-    else:
-        df = pd.read_parquet(pq)
-
-    caption_column = parquet_config.get(
-        "caption_column", args.parquet_caption_column or "description"
-    )
-    fallback_caption_column = parquet_config.get("fallback_caption_column", None)
-    filename_column = parquet_config.get(
-        "filename_column", args.parquet_filename_column or "id"
-    )
-    identifier_includes_extension = parquet_config.get(
-        "identifier_includes_extension", False
-    )
-
-    # Check the columns exist
-    if caption_column not in df.columns:
-        raise ValueError(
-            f"Parquet file {parquet_path} does not contain a column named '{caption_column}'."
-        )
-    if filename_column not in df.columns:
-        raise ValueError(
-            f"Parquet file {parquet_path} does not contain a column named '{filename_column}'."
-        )
-    # Check for null values
-    if df[caption_column].isnull().values.any() and not fallback_caption_column:
-        raise ValueError(
-            f"Parquet file {parquet_path} contains null values in the '{caption_column}' column, but no fallback_caption_column was set."
-        )
-    if df[filename_column].isnull().values.any():
-        raise ValueError(
-            f"Parquet file {parquet_path} contains null values in the '{filename_column}' column."
-        )
-    # Check for empty strings
-    if (df[caption_column] == "").sum() > 0 and not fallback_caption_column:
-        raise ValueError(
-            f"Parquet file {parquet_path} contains empty strings in the '{caption_column}' column."
-        )
-    if (df[filename_column] == "").sum() > 0:
-        raise ValueError(
-            f"Parquet file {parquet_path} contains empty strings in the '{filename_column}' column."
-        )
-    # Store the database in StateTracker
-    StateTracker.set_parquet_database(
-        backend["id"],
-        (
-            df,
-            filename_column,
-            caption_column,
-            fallback_caption_column,
-            identifier_includes_extension,
-        ),
-    )
-    info_log(
-        f"Configured parquet database for backend {backend['id']}. Caption column: {caption_column}. Filename column: {filename_column}."
-    )
-
-
-def configure_multi_databackend(args: dict, accelerator, text_encoders, tokenizers):
-    """
-    Configure a multiple dataloaders based on the provided commandline args.
-    """
-    StateTracker.clear_data_backends()
-    logger.setLevel(
-        os.environ.get(
-            "SIMPLETUNER_LOG_LEVEL", "INFO" if accelerator.is_main_process else "ERROR"
-        )
-    )
-    if args.data_backend_config is None:
-        raise ValueError(
-            "Must provide a data backend config file via --data_backend_config"
-        )
-    if not os.path.exists(args.data_backend_config):
-        raise FileNotFoundError(
-            f"Data backend config file {args.data_backend_config} not found."
-        )
-    info_log(f"Loading data backend config from {args.data_backend_config}")
-    with open(args.data_backend_config, "r", encoding="utf-8") as f:
-        data_backend_config = json.load(f)
-    if len(data_backend_config) == 0:
-        raise ValueError(
-            "Must provide at least one data backend in the data backend config file."
-        )
-
-    text_embed_backends = {}
-    image_embed_backends = {}
-
-    ###                                            ###
-    #    now we configure the text embed backends    #
-    ###                                            ###
-    default_text_embed_backend_id = None
-    text_embed_cache_dir_paths = []
-    for backend in data_backend_config:
-        dataset_type = backend.get("dataset_type", None)
-        if dataset_type is None or dataset_type != "text_embeds":
-            # Skip configuration of image data backends. It is done later.
-            continue
-        if ("disabled" in backend and backend["disabled"]) or (
-            "disable" in backend and backend["disable"]
-        ):
-            info_log(f"Skipping disabled data backend {backend['id']} in config file.")
-            continue
-
-        info_log(f'Configuring text embed backend: {backend["id"]}')
-        if backend.get("default", None):
-            if default_text_embed_backend_id is not None:
-                raise ValueError(
-                    "Only one text embed backend can be marked as default."
-                )
-            default_text_embed_backend_id = backend["id"]
-        # Retrieve some config file overrides for commandline arguments,
-        #  there currently isn't much for text embeds.
-        init_backend = init_backend_config(backend, args, accelerator)
-        StateTracker.set_data_backend_config(init_backend["id"], init_backend["config"])
-        if backend["type"] == "local":
-            text_embed_cache_dir_paths.append(
-                backend.get("cache_dir", args.cache_dir_text)
-            )
-            init_backend["data_backend"] = get_local_backend(
-                accelerator, init_backend["id"], compress_cache=args.compress_disk_cache
-            )
-            init_backend["cache_dir"] = backend["cache_dir"]
-        elif backend["type"] == "aws":
-            check_aws_config(backend)
-            init_backend["data_backend"] = get_aws_backend(
-                identifier=init_backend["id"],
-                aws_bucket_name=backend["aws_bucket_name"],
-                aws_region_name=backend["aws_region_name"],
-                aws_endpoint_url=backend["aws_endpoint_url"],
-                aws_access_key_id=backend["aws_access_key_id"],
-                aws_secret_access_key=backend["aws_secret_access_key"],
-                accelerator=accelerator,
-                max_pool_connections=backend.get(
-                    "max_pool_connections", args.aws_max_pool_connections
-                ),
-            )
-            # S3 buckets use the aws_data_prefix as their prefix/ for all data.
-            # Ensure we have a trailing slash on the prefix:
-            init_backend["cache_dir"] = backend.get(
-                "aws_data_prefix", backend.get("cache_dir", args.cache_dir_text)
-            )
-        elif backend["type"] == "csv":
-            raise ValueError("Cannot use CSV backend for text embed storage.")
-        else:
-            raise ValueError(f"Unknown data backend type: {backend['type']}")
-
-        preserve_data_backend_cache = backend.get("preserve_data_backend_cache", False)
-        if not preserve_data_backend_cache and accelerator.is_local_main_process:
-            StateTracker.delete_cache_files(
-                data_backend_id=init_backend["id"],
-                preserve_data_backend_cache=preserve_data_backend_cache,
-            )
-
-        # Generate a TextEmbeddingCache object
-        init_backend["text_embed_cache"] = TextEmbeddingCache(
-            id=init_backend["id"],
-            data_backend=init_backend["data_backend"],
-            text_encoders=text_encoders,
-            tokenizers=tokenizers,
-            accelerator=accelerator,
-            cache_dir=init_backend.get("cache_dir", args.cache_dir_text),
-            model_type=StateTracker.get_model_family(),
-            write_batch_size=backend.get("write_batch_size", args.write_batch_size),
-        )
-        init_backend["text_embed_cache"].set_webhook_handler(
-            StateTracker.get_webhook_handler()
-        )
-        with accelerator.main_process_first():
-            init_backend["text_embed_cache"].discover_all_files()
-        accelerator.wait_for_everyone()
-
-        if backend.get("default", False):
-            # The default embed cache will be used for eg. validation prompts.
-            StateTracker.set_default_text_embed_cache(init_backend["text_embed_cache"])
-            logger.debug(f"Set the default text embed cache to {init_backend['id']}.")
-            # We will compute the null embedding for caption dropout here.
-            info_log("Pre-computing null embedding")
-            with accelerator.main_process_first():
-                init_backend["text_embed_cache"].compute_embeddings_for_prompts(
-                    [""], return_concat=False, load_from_cache=False
-                )
-            time.sleep(5)
-            accelerator.wait_for_everyone()
-        if args.caption_dropout_probability == 0.0:
-            logger.warning(
-                "Not using caption dropout will potentially lead to overfitting on captions, eg. CFG will not work very well. Set --caption_dropout_probability=0.1 as a recommended value."
-            )
-
-        # We don't compute the text embeds at this time, because we do not really have any captions available yet.
-        text_embed_backends[init_backend["id"]] = init_backend
-
-    if not text_embed_backends:
-        raise ValueError(
-            "Your dataloader config must contain at least one image dataset AND at least one text_embed dataset."
-            " See this link for more information about dataset_type: https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#configuration-options"
-        )
-    if not default_text_embed_backend_id and len(text_embed_backends) > 1:
-        raise ValueError(
-            f"You have {len(text_embed_backends)} text_embed dataset{'s' if len(text_embed_backends) > 1 else ''}, but no default text embed was defined."
-            "\nPlease set default: true on one of the text_embed datasets, as this will be the location of global embeds (validation prompts, etc)."
-            "\nSee this link for more information on how to configure a default text embed dataset: https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#configuration-options"
-        )
-    elif not default_text_embed_backend_id:
-        logger.warning(
-            f"No default text embed was defined, using {list(text_embed_backends.keys())[0]} as the default."
-            " See this page for information about the default text embed backend: https://github.com/bghira/SimpleTuner/blob/main/documentation/DATALOADER.md#configuration-options"
-        )
-        default_text_embed_backend_id = list(text_embed_backends.keys())[0]
-    info_log("Completed loading text embed services.")
-
-    ###                                             ###
-    #    now we configure the image embed backends    #
-    ###                                             ###
-    for backend in data_backend_config:
-        dataset_type = backend.get("dataset_type", None)
-        if dataset_type is None or dataset_type != "image_embeds":
-            continue
-        if ("disabled" in backend and backend["disabled"]) or (
-            "disable" in backend and backend["disable"]
-        ):
-            info_log(f"Skipping disabled data backend {backend['id']} in config file.")
-            continue
-
-        info_log(f'Configuring VAE image embed backend: {backend["id"]}')
-        # Retrieve some config file overrides for commandline arguments,
-        #  there currently isn't much for text embeds.
-        init_backend = init_backend_config(backend, args, accelerator)
-        existing_config = StateTracker.get_data_backend_config(init_backend["id"])
-        if existing_config is not None and existing_config != {}:
-            raise ValueError(
-                f"You can only have one backend named {init_backend['id']}"
-            )
-        StateTracker.set_data_backend_config(init_backend["id"], init_backend["config"])
-        if backend["type"] == "local":
-            init_backend["data_backend"] = get_local_backend(
-                accelerator, init_backend["id"], compress_cache=args.compress_disk_cache
-            )
-        elif backend["type"] == "aws":
-            check_aws_config(backend)
-            init_backend["data_backend"] = get_aws_backend(
-                identifier=init_backend["id"],
-                aws_bucket_name=backend["aws_bucket_name"],
-                aws_region_name=backend["aws_region_name"],
-                aws_endpoint_url=backend["aws_endpoint_url"],
-                aws_access_key_id=backend["aws_access_key_id"],
-                aws_secret_access_key=backend["aws_secret_access_key"],
-                accelerator=accelerator,
-                max_pool_connections=backend.get(
-                    "max_pool_connections", args.aws_max_pool_connections
-                ),
-            )
-            # S3 buckets use the aws_data_prefix as their prefix/ for all data.
-            # Ensure we have a trailing slash on the prefix:
-            init_backend["cache_dir"] = backend.get("aws_data_prefix", None)
-        elif backend["type"] == "csv":
-            raise ValueError("Cannot use CSV backend for image embed storage.")
-        else:
-            raise ValueError(f"Unknown data backend type: {backend['type']}")
-
-        preserve_data_backend_cache = backend.get("preserve_data_backend_cache", False)
-        if not preserve_data_backend_cache and accelerator.is_local_main_process:
-            StateTracker.delete_cache_files(
-                data_backend_id=init_backend["id"],
-                preserve_data_backend_cache=preserve_data_backend_cache,
-            )
-
-        image_embed_backends[init_backend["id"]] = init_backend
-
-    ###                                       ###
-    #    now we configure the image backends    #
-    ###                                       ###
-    vae_cache_dir_paths = []  # tracking for duplicates
-    for backend in data_backend_config:
-        dataset_type = backend.get("dataset_type", None)
-        if dataset_type is not None and (
-            dataset_type != "image" and dataset_type != "conditioning"
-        ):
-            # Skip configuration of text embed backends. It is done earlier.
-            continue
-        if ("disabled" in backend and backend["disabled"]) or (
-            "disable" in backend and backend["disable"]
-        ):
-            info_log(f"Skipping disabled data backend {backend['id']} in config file.")
-            continue
-        # For each backend, we will create a dict to store all of its components in.
-        if (
-            "id" not in backend
-            or backend["id"] == ""
-            or backend["id"] in StateTracker.get_data_backends()
-        ):
-            raise ValueError("Each dataset needs a unique 'id' field.")
-        info_log(f"Configuring data backend: {backend['id']}")
-        conditioning_type = backend.get("conditioning_type")
-        if (
-            backend.get("dataset_type") == "conditioning"
-            or conditioning_type is not None
-        ):
-            backend["dataset_type"] = "conditioning"
-        resolution_type = backend.get("resolution_type", args.resolution_type)
-        if resolution_type == "pixel_area":
-            pixel_edge_length = backend.get("resolution", int(args.resolution))
-            if pixel_edge_length is None or (
-                type(pixel_edge_length) is not int
-                or not str(pixel_edge_length).isdigit()
-            ):
-                raise ValueError(
-                    f"Resolution type 'pixel_area' requires a 'resolution' field to be set in the backend config using an integer in the format: 1024, but {pixel_edge_length} was given"
-                )
-            # we'll convert pixel_area to area
-            backend["resolution_type"] = "area"
-            backend["resolution"] = (pixel_edge_length * pixel_edge_length) / (1000**2)
-            # convert the other megapixel values.
-            if (
-                backend.get("maximum_image_size", None) is not None
-                and backend["maximum_image_size"] > 0
-            ):
-                backend["maximum_image_size"] = (
-                    backend["maximum_image_size"] * backend["maximum_image_size"]
-                ) / 1_000_000
-            if (
-                backend.get("target_downsample_size", None) is not None
-                and backend["target_downsample_size"] > 0
-            ):
-                backend["target_downsample_size"] = (
-                    backend["target_downsample_size"]
-                    * backend["target_downsample_size"]
-                ) / 1_000_000
-            if (
-                backend.get("minimum_image_size", None) is not None
-                and backend["minimum_image_size"] > 0
-            ):
-                backend["minimum_image_size"] = (
-                    backend["minimum_image_size"] * backend["minimum_image_size"]
-                ) / 1_000_000
-
-        # Retrieve some config file overrides for commandline arguments, eg. cropping
-        init_backend = init_backend_config(backend, args, accelerator)
-        StateTracker.set_data_backend_config(
-            data_backend_id=init_backend["id"],
-            config=init_backend["config"],
-        )
-
-        preserve_data_backend_cache = backend.get("preserve_data_backend_cache", False)
-        if not preserve_data_backend_cache:
-            StateTracker.delete_cache_files(
-                data_backend_id=init_backend["id"],
-                preserve_data_backend_cache=preserve_data_backend_cache,
-            )
-        StateTracker.load_aspect_resolution_map(
-            dataloader_resolution=init_backend["config"]["resolution"],
-        )
-
-        if backend["type"] == "local":
-            init_backend["data_backend"] = get_local_backend(
-                accelerator, init_backend["id"], compress_cache=args.compress_disk_cache
-            )
-            init_backend["instance_data_dir"] = backend.get(
-                "instance_data_dir", backend.get("instance_data_root")
-            )
-            if init_backend["instance_data_dir"] is None:
-                raise ValueError(
-                    "A local backend requires instance_data_dir be defined and pointing to the image data directory."
-                )
-            # Remove trailing slash
-            if (
-                init_backend["instance_data_dir"] is not None
-                and init_backend["instance_data_dir"][-1] == "/"
-            ):
-                init_backend["instance_data_dir"] = init_backend["instance_data_dir"][
-                    :-1
-                ]
-        elif backend["type"] == "aws":
-            check_aws_config(backend)
-            init_backend["data_backend"] = get_aws_backend(
-                identifier=init_backend["id"],
-                aws_bucket_name=backend["aws_bucket_name"],
-                aws_region_name=backend["aws_region_name"],
-                aws_endpoint_url=backend["aws_endpoint_url"],
-                aws_access_key_id=backend["aws_access_key_id"],
-                aws_secret_access_key=backend["aws_secret_access_key"],
-                accelerator=accelerator,
-                compress_cache=args.compress_disk_cache,
-                max_pool_connections=backend.get(
-                    "max_pool_connections", args.aws_max_pool_connections
-                ),
-            )
-            # S3 buckets use the aws_data_prefix as their prefix/ for all data.
-            init_backend["instance_data_dir"] = backend.get("aws_data_prefix", "")
-        elif backend["type"] == "csv":
-            check_csv_config(backend=backend, args=args)
-            init_backend["data_backend"] = get_csv_backend(
-                accelerator=accelerator,
-                id=backend["id"],
-                csv_file=backend["csv_file"],
-                csv_cache_dir=backend["csv_cache_dir"],
-                compress_cache=args.compress_disk_cache,
-                hash_filenames=backend.get("hash_filenames", False),
-            )
-            # init_backend["instance_data_dir"] = backend.get("instance_data_dir", backend.get("instance_data_root", backend.get("csv_cache_dir")))
-            init_backend["instance_data_dir"] = None
-            # if init_backend["instance_data_dir"] is None:
-            #     raise ValueError("CSV backend requires one of instance_data_dir, instance_data_root or csv_cache_dir to be set, as we require a location to place metadata lists.")
-            # Remove trailing slash
-            if (
-                init_backend["instance_data_dir"] is not None
-                and init_backend["instance_data_dir"][-1] == "/"
-            ):
-                init_backend["instance_data_dir"] = init_backend["instance_data_dir"][
-                    :-1
-                ]
-        else:
-            raise ValueError(f"Unknown data backend type: {backend['type']}")
-
-        # Assign a TextEmbeddingCache to this dataset. it might be undefined.
-        text_embed_id = backend.get(
-            "text_embeds",
-            backend.get("text_embed_cache", default_text_embed_backend_id),
-        )
-        if text_embed_id not in text_embed_backends:
-            raise ValueError(
-                f"Text embed backend {text_embed_id} not found in data backend config file."
-            )
-        # Do we have a specific VAE embed backend?
-        image_embed_backend_id = backend.get("image_embeds", None)
-        image_embed_data_backend = init_backend
-        if image_embed_backend_id is not None:
-            if image_embed_backend_id not in image_embed_backends:
-                raise ValueError(
-                    f"Could not find image embed backend ID in multidatabackend config: {image_embed_backend_id}"
-                )
-            image_embed_data_backend = image_embed_backends[image_embed_backend_id]
-        info_log(f"(id={init_backend['id']}) Loading bucket manager.")
-        metadata_backend_args = {}
-        metadata_backend = backend.get("metadata_backend", "discovery")
-        if metadata_backend == "json" or metadata_backend == "discovery":
-            from videotuna.third_party.flux.metadata.backends.discovery import (
-                DiscoveryMetadataBackend,
-            )
-
-            BucketManager_cls = DiscoveryMetadataBackend
-        elif metadata_backend == "parquet":
-            from videotuna.third_party.flux.metadata.backends.parquet import (
-                ParquetMetadataBackend,
-            )
-
-            BucketManager_cls = ParquetMetadataBackend
-            metadata_backend_args["parquet_config"] = backend.get("parquet", None)
-            if not metadata_backend_args["parquet_config"]:
-                raise ValueError(
-                    "Parquet metadata backend requires a 'parquet' field in the backend config containing required fields for configuration."
-                )
-        else:
-            raise ValueError(f"Unknown metadata backend type: {metadata_backend}")
-
-        init_backend["metadata_backend"] = BucketManager_cls(
-            id=init_backend["id"],
-            instance_data_dir=init_backend["instance_data_dir"],
-            data_backend=init_backend["data_backend"],
-            accelerator=accelerator,
-            resolution=backend.get("resolution", args.resolution),
-            minimum_image_size=backend.get(
-                "minimum_image_size", args.minimum_image_size
-            ),
-            resolution_type=backend.get("resolution_type", args.resolution_type),
-            batch_size=args.train_batch_size,
-            metadata_update_interval=backend.get(
-                "metadata_update_interval", args.metadata_update_interval
-            ),
-            cache_file=os.path.join(
-                backend.get(
-                    "instance_data_dir",
-                    backend.get("csv_cache_dir", backend.get("aws_data_prefix", "")),
-                ),
-                "aspect_ratio_bucket_indices",
-            ),
-            metadata_file=os.path.join(
-                backend.get(
-                    "instance_data_dir",
-                    backend.get("csv_cache_dir", backend.get("aws_data_prefix", "")),
-                ),
-                "aspect_ratio_bucket_metadata",
-            ),
-            delete_problematic_images=args.delete_problematic_images or False,
-            delete_unwanted_images=backend.get(
-                "delete_unwanted_images", args.delete_unwanted_images
-            ),
-            cache_file_suffix=backend.get("cache_file_suffix", init_backend["id"]),
-            repeats=init_backend["config"].get("repeats", 0),
-            **metadata_backend_args,
-        )
-
-        if (
-            "aspect" not in args.skip_file_discovery
-            and "aspect" not in backend.get("skip_file_discovery", "")
-            and conditioning_type not in ["mask", "controlnet"]
-        ):
-            if accelerator.is_local_main_process:
-                info_log(
-                    f"(id={init_backend['id']}) Refreshing aspect buckets on main process."
-                )
-                init_backend["metadata_backend"].refresh_buckets(rank_info())
-        accelerator.wait_for_everyone()
-        if not accelerator.is_main_process:
-            info_log(
-                f"(id={init_backend['id']}) Reloading bucket manager cache on subprocesses."
-            )
-            init_backend["metadata_backend"].reload_cache()
-        accelerator.wait_for_everyone()
-        if init_backend["metadata_backend"].has_single_underfilled_bucket():
-            raise Exception(
-                f"Cannot train using a dataset that has a single bucket with fewer than {args.train_batch_size} images."
-                f" You have to reduce your batch size, or increase your dataset size (id={init_backend['id']})."
-            )
-        # Now split the contents of these buckets between all processes
-        init_backend["metadata_backend"].split_buckets_between_processes(
-            gradient_accumulation_steps=args.gradient_accumulation_steps,
-        )
-
-        # Check if there is an existing 'config' in the metadata_backend.config
-        excluded_keys = [
-            "probability",
-            "repeats",
-            "ignore_epochs",
-            "caption_filter_list",
-            "vae_cache_clear_each_epoch",
-            "caption_strategy",
-            "maximum_image_size",
-            "target_downsample_size",
-            "parquet",
-        ]
-        # we will set the latest version by default.
-        current_config_version = latest_config_version()
-        if init_backend["metadata_backend"].config != {}:
-            prev_config = init_backend["metadata_backend"].config
-            # if the prev config used an old default config version, we will update defaults here.
-            current_config_version = prev_config.get("config_version", None)
-            if current_config_version is None:
-                # backwards compatibility for non-versioned config files, so that we do not enable life-changing options.
-                current_config_version = 1
-
-            logger.debug(
-                f"Found existing config (version={current_config_version}): {prev_config}"
-            )
-            logger.debug(f"Comparing against new config: {init_backend['config']}")
-            # Check if any values differ between the 'backend' values and the 'config' values:
-            for key, _ in prev_config.items():
-                logger.debug(f"Checking config key: {key}")
-                if key not in excluded_keys:
-                    if key in backend and prev_config[key] != backend[key]:
-                        # if not args.override_dataset_config:
-                        #     raise Exception(
-                        #         f"Dataset {init_backend['id']} has inconsistent config, and --override_dataset_config was not provided."
-                        #         f"\n-> Expected value {key}={prev_config.get(key)} differs from current value={backend.get(key)}."
-                        #         f"\n-> Recommended action is to correct the current config values to match the values that were used to create this dataset:"
-                        #         f"\n{prev_config}"
-                        #     )
-                        # else:
-                        #    logger.warning(
-                        #        f"Overriding config value {key}={prev_config[key]} with {backend[key]}"
-                        #    )
-                        #    prev_config[key] = backend[key]
-                        logger.warning(
-                            f"Overriding config value {key}={prev_config[key]} with {backend[key]}"
-                        )
-                        prev_config[key] = backend[key]
-                    elif key not in backend:
-                        if should_log():
-                            logger.warning(
-                                f"Key {key} not found in the current backend config, using the existing value '{prev_config[key]}'."
-                            )
-                        init_backend["config"][key] = prev_config[key]
-
-        init_backend["config"]["config_version"] = current_config_version
-        StateTracker.set_data_backend_config(init_backend["id"], init_backend["config"])
-        info_log(f"Configured backend: {init_backend}")
-
-        print_bucket_info(init_backend["metadata_backend"])
-        if len(init_backend["metadata_backend"]) == 0 and conditioning_type is None:
-            raise Exception(
-                f"No images were discovered by the bucket manager in the dataset: {init_backend['id']}."
-            )
-
-        use_captions = True
-        is_regularisation_data = backend.get(
-            "is_regularisation_data", backend.get("is_regularization_data", False)
-        )
-        if "only_instance_prompt" in backend and backend["only_instance_prompt"]:
-            use_captions = False
-        elif args.only_instance_prompt:
-            use_captions = False
-        init_backend["train_dataset"] = MultiAspectDataset(
-            id=init_backend["id"],
-            datasets=[init_backend["metadata_backend"]],
-            is_regularisation_data=is_regularisation_data,
-        )
-
-        if "deepfloyd" in args.model_type:
-            if init_backend["metadata_backend"].resolution_type == "area":
-                logger.warning(
-                    "Resolution type is 'area', but should be 'pixel' for DeepFloyd. Unexpected results may occur."
-                )
-                if init_backend["metadata_backend"].resolution > 0.25:
-                    logger.warning(
-                        "Resolution is greater than 0.25 megapixels. This may lead to unconstrained memory requirements."
-                    )
-            if init_backend["metadata_backend"].resolution_type == "pixel":
-                if (
-                    "stage2" not in args.model_type
-                    and init_backend["metadata_backend"].resolution > 64
-                ):
-                    logger.warning(
-                        "Resolution is greater than 64 pixels, which will possibly lead to poor quality results."
-                    )
-
-        if "deepfloyd-stage2" in args.model_type:
-            # Resolution must be at least 256 for Stage II.
-            if init_backend["metadata_backend"].resolution < 256:
-                logger.warning(
-                    "Increasing resolution to 256, as is required for DF Stage II."
-                )
-
-        init_backend["sampler"] = MultiAspectSampler(
-            id=init_backend["id"],
-            metadata_backend=init_backend["metadata_backend"],
-            data_backend=init_backend["data_backend"],
-            accelerator=accelerator,
-            batch_size=args.train_batch_size,
-            debug_aspect_buckets=args.debug_aspect_buckets,
-            delete_unwanted_images=backend.get(
-                "delete_unwanted_images", args.delete_unwanted_images
-            ),
-            resolution=backend.get("resolution", args.resolution),
-            resolution_type=backend.get("resolution_type", args.resolution_type),
-            caption_strategy=backend.get("caption_strategy", args.caption_strategy),
-            use_captions=use_captions,
-            prepend_instance_prompt=backend.get(
-                "prepend_instance_prompt", args.prepend_instance_prompt
-            ),
-            instance_prompt=backend.get("instance_prompt", args.instance_prompt),
-            conditioning_type=conditioning_type,
-            is_regularisation_data=is_regularisation_data,
-        )
-        if init_backend["sampler"].caption_strategy == "parquet":
-            configure_parquet_database(backend, args, init_backend["data_backend"])
-        init_backend["train_dataloader"] = torch.utils.data.DataLoader(
-            init_backend["train_dataset"],
-            batch_size=1,  # The sampler handles batching
-            shuffle=False,  # The sampler handles shuffling
-            sampler=init_backend["sampler"],
-            collate_fn=lambda examples: collate_fn(examples),
-            num_workers=0,
-            persistent_workers=False,
-        )
-
-        init_backend["text_embed_cache"] = text_embed_backends[text_embed_id][
-            "text_embed_cache"
-        ]
-        prepend_instance_prompt = backend.get(
-            "prepend_instance_prompt", args.prepend_instance_prompt
-        )
-        instance_prompt = backend.get("instance_prompt", args.instance_prompt)
-        if prepend_instance_prompt and instance_prompt is None:
-            raise ValueError(
-                f"Backend {init_backend['id']} has prepend_instance_prompt=True, but no instance_prompt was provided. You must provide an instance_prompt, or disable this option."
-            )
-
-        # Update the backend registration here so the metadata backend can be found.
-        StateTracker.register_data_backend(init_backend)
-
-        # We get captions from the IMAGE dataset. Not the text embeds dataset.
-        if (
-            conditioning_type != "mask"
-            and "text" not in args.skip_file_discovery
-            and "text" not in backend.get("skip_file_discovery", "")
-        ):
-            info_log(f"(id={init_backend['id']}) Collecting captions.")
-            captions = PromptHandler.get_all_captions(
-                data_backend=init_backend["data_backend"],
-                instance_data_dir=init_backend["instance_data_dir"],
-                prepend_instance_prompt=prepend_instance_prompt,
-                instance_prompt=instance_prompt,
-                use_captions=use_captions,
-                caption_strategy=backend.get("caption_strategy", args.caption_strategy),
-            )
-            logger.debug(
-                f"Pre-computing text embeds / updating cache. We have {len(captions)} captions to process, though these will be filtered next."
-            )
-            caption_strategy = backend.get("caption_strategy", args.caption_strategy)
-            info_log(
-                f"(id={init_backend['id']}) Initialise text embed pre-computation using the {caption_strategy} caption strategy. We have {len(captions)} captions to process."
-            )
-            init_backend["text_embed_cache"].compute_embeddings_for_prompts(
-                captions, return_concat=False, load_from_cache=False
-            )
-            info_log(
-                f"(id={init_backend['id']}) Completed processing {len(captions)} captions."
-            )
-
-        # Register the backend here so the sampler can be found.
-        StateTracker.register_data_backend(init_backend)
-
-        default_hash_option = True
-        hash_filenames = init_backend["config"].get(
-            "hash_filenames", default_hash_option
-        )
-        init_backend["config"]["hash_filenames"] = hash_filenames
-        StateTracker.set_data_backend_config(init_backend["id"], init_backend["config"])
-        logger.debug(f"Hashing filenames: {hash_filenames}")
-
-        if (
-            "deepfloyd" not in StateTracker.get_args().model_type
-            and conditioning_type not in ["mask", "controlnet"]
-        ):
-            info_log(f"(id={init_backend['id']}) Creating VAE latent cache.")
-            vae_cache_dir = backend.get("cache_dir_vae", None)
-            if vae_cache_dir in vae_cache_dir_paths:
-                raise ValueError(
-                    f"VAE image embed cache directory {backend.get('cache_dir_vae')} is the same as another VAE image embed cache directory. This is not allowed, the trainer will get confused and sleepy and wake up in a distant place with no memory and no money for a taxi ride back home, forever looking in the mirror and wondering who they are. This should be avoided."
-                )
-            vae_cache_dir_paths.append(vae_cache_dir)
-
-            if (
-                vae_cache_dir is not None
-                and vae_cache_dir in text_embed_cache_dir_paths
-            ):
-                raise ValueError(
-                    f"VAE image embed cache directory {backend.get('cache_dir_vae')} is the same as the text embed cache directory. This is not allowed, the trainer will get confused."
-                )
-            init_backend["vaecache"] = VAECache(
-                id=init_backend["id"],
-                vae=StateTracker.get_vae(),
-                accelerator=accelerator,
-                metadata_backend=init_backend["metadata_backend"],
-                image_data_backend=init_backend["data_backend"],
-                cache_data_backend=image_embed_data_backend["data_backend"],
-                instance_data_dir=init_backend["instance_data_dir"],
-                delete_problematic_images=backend.get(
-                    "delete_problematic_images", args.delete_problematic_images
-                ),
-                resolution=backend.get("resolution", args.resolution),
-                resolution_type=backend.get("resolution_type", args.resolution_type),
-                maximum_image_size=backend.get(
-                    "maximum_image_size",
-                    args.maximum_image_size
-                    or backend.get("resolution", args.resolution) * 1.5,
-                ),
-                target_downsample_size=backend.get(
-                    "target_downsample_size",
-                    args.target_downsample_size
-                    or backend.get("resolution", args.resolution) * 1.25,
-                ),
-                minimum_image_size=backend.get(
-                    "minimum_image_size",
-                    args.minimum_image_size,
-                ),
-                vae_batch_size=backend.get("vae_batch_size", args.vae_batch_size),
-                write_batch_size=backend.get("write_batch_size", args.write_batch_size),
-                read_batch_size=backend.get("read_batch_size", args.read_batch_size),
-                cache_dir=backend.get("cache_dir_vae", args.cache_dir_vae),
-                max_workers=backend.get("max_workers", args.max_workers),
-                process_queue_size=backend.get(
-                    "image_processing_batch_size", args.image_processing_batch_size
-                ),
-                vae_cache_ondemand=args.vae_cache_ondemand,
-                hash_filenames=hash_filenames,
-            )
-            init_backend["vaecache"].set_webhook_handler(
-                StateTracker.get_webhook_handler()
-            )
-
-            if not args.vae_cache_ondemand:
-                info_log(f"(id={init_backend['id']}) Discovering cache objects..")
-                if accelerator.is_local_main_process:
-                    init_backend["vaecache"].discover_all_files()
-                accelerator.wait_for_everyone()
-            all_image_files = StateTracker.get_image_files(
-                data_backend_id=init_backend["id"]
-            )
-            init_backend["vaecache"].build_vae_cache_filename_map(
-                all_image_files=all_image_files
-            )
-
-        if (
-            (
-                "metadata" not in args.skip_file_discovery
-                or "metadata" not in backend.get("skip_file_discovery", "")
-            )
-            and accelerator.is_main_process
-            and backend.get("scan_for_errors", False)
-            and "deepfloyd" not in StateTracker.get_args().model_type
-            and conditioning_type not in ["mask", "controlnet"]
-        ):
-            info_log(
-                f"Beginning error scan for dataset {init_backend['id']}. Set 'scan_for_errors' to False in the dataset config to disable this."
-            )
-            init_backend["metadata_backend"].handle_vae_cache_inconsistencies(
-                vae_cache=init_backend["vaecache"],
-                vae_cache_behavior=backend.get(
-                    "vae_cache_scan_behaviour", args.vae_cache_scan_behaviour
-                ),
-            )
-            init_backend["metadata_backend"].scan_for_metadata()
-
-        accelerator.wait_for_everyone()
-        if not accelerator.is_main_process:
-            init_backend["metadata_backend"].load_image_metadata()
-        accelerator.wait_for_everyone()
-
-        if (
-            not args.vae_cache_ondemand
-            and "vaecache" in init_backend
-            and "vae" not in args.skip_file_discovery
-            and "vae" not in backend.get("skip_file_discovery", "")
-            and "deepfloyd" not in StateTracker.get_args().model_type
-            and conditioning_type not in ["mask", "controlnet"]
-        ):
-            init_backend["vaecache"].discover_unprocessed_files()
-            if not args.vae_cache_ondemand:
-                init_backend["vaecache"].process_buckets()
-            logger.debug(f"Encoding images during training: {args.vae_cache_ondemand}")
-            accelerator.wait_for_everyone()
-
-        info_log(f"Configured backend: {init_backend}")
-
-        StateTracker.register_data_backend(init_backend)
-        init_backend["metadata_backend"].save_cache()
-
-    # For each image backend, connect it to its conditioning backend.
-    for backend in data_backend_config:
-        dataset_type = backend.get("dataset_type", "image")
-        if dataset_type is not None and dataset_type != "image":
-            # Skip configuration of conditioning/text data backends. It is done earlier.
-            continue
-        if ("disabled" in backend and backend["disabled"]) or (
-            "disable" in backend and backend["disable"]
-        ):
-            info_log(f"Skipping disabled data backend {backend['id']} in config file.")
-            continue
-        if "conditioning_data" in backend and backend[
-            "conditioning_data"
-        ] not in StateTracker.get_data_backends(_type="conditioning"):
-            raise ValueError(
-                f"Conditioning data backend {backend['conditioning_data']} not found in data backend list: {StateTracker.get_data_backends()}."
-            )
-        if "conditioning_data" in backend:
-            StateTracker.set_conditioning_dataset(
-                backend["id"], backend["conditioning_data"]
-            )
-            info_log(
-                f"Successfully configured conditioning image dataset for {backend['id']}"
-            )
-
-    if len(StateTracker.get_data_backends()) == 0:
-        raise ValueError(
-            "Must provide at least one data backend in the data backend config file."
-        )
-    return StateTracker.get_data_backends()
-
-
-def get_local_backend(
-    accelerator, identifier: str, compress_cache: bool = False
-) -> LocalDataBackend:
-    """
-    Get a local disk backend.
-
-    Args:
-        accelerator (Accelerator): A Huggingface Accelerate object.
-        identifier (str): An identifier that links this data backend to its other components.
-    Returns:
-        LocalDataBackend: A LocalDataBackend object.
-    """
-    return LocalDataBackend(
-        accelerator=accelerator, id=identifier, compress_cache=compress_cache
-    )
-
-
-def get_csv_backend(
-    accelerator,
-    id: str,
-    csv_file: str,
-    csv_cache_dir: str,
-    url_column: str,
-    caption_column: str,
-    compress_cache: bool = False,
-    hash_filenames: bool = False,
-    shorten_filenames: bool = False,
-) -> CSVDataBackend:
-    from pathlib import Path
-
-    return CSVDataBackend(
-        accelerator=accelerator,
-        id=id,
-        csv_file=Path(csv_file),
-        image_cache_loc=csv_cache_dir,
-        url_column=url_column,
-        caption_column=caption_column,
-        compress_cache=compress_cache,
-        shorten_filenames=shorten_filenames,
-        hash_filenames=hash_filenames,
-    )
-
-
-def check_csv_config(backend: dict, args) -> None:
-    required_keys = {
-        "csv_file": "This is the path to the CSV file containing your image URLs.",
-        "csv_cache_dir": "This is the path to your temporary cache files where images will be stored. This can grow quite large.",
-        "csv_caption_column": "This is the column in your csv which contains the caption(s) for the samples.",
-        "csv_url_column": "This is the column in your csv that contains image urls or paths.",
-    }
-    for key in required_keys.keys():
-        if key not in backend:
-            raise ValueError(
-                f"Missing required key {key} in CSV backend config: {required_keys[key]}"
-            )
-    if not args.compress_disk_cache:
-        logger.warning(
-            "You can save more disk space for cache objects by providing --compress_disk_cache and recreating its contents"
-        )
-    caption_strategy = backend.get("caption_strategy")
-    if caption_strategy is None or caption_strategy != "csv":
-        raise ValueError("CSV backend requires a caption_strategy of 'csv'.")
-
-
-def check_aws_config(backend: dict) -> None:
-    """
-    Check the configuration for an AWS backend.
-
-    Args:
-        backend (dict): A dictionary of the backend configuration.
-    Returns:
-        None
-    """
-    required_keys = [
-        "aws_bucket_name",
-        "aws_region_name",
-        "aws_endpoint_url",
-        "aws_access_key_id",
-        "aws_secret_access_key",
-    ]
-    for key in required_keys:
-        if key not in backend:
-            raise ValueError(f"Missing required key {key} in AWS backend config.")
-
-
-def get_aws_backend(
-    aws_bucket_name: str,
-    aws_region_name: str,
-    aws_endpoint_url: str,
-    aws_access_key_id: str,
-    aws_secret_access_key: str,
-    accelerator,
-    identifier: str,
-    compress_cache: bool = False,
-    max_pool_connections: int = 128,
-) -> S3DataBackend:
-    return S3DataBackend(
-        id=identifier,
-        bucket_name=aws_bucket_name,
-        accelerator=accelerator,
-        region_name=aws_region_name,
-        endpoint_url=aws_endpoint_url,
-        aws_access_key_id=aws_access_key_id,
-        aws_secret_access_key=aws_secret_access_key,
-        compress_cache=compress_cache,
-        max_pool_connections=max_pool_connections,
-    )
-
-
-def select_dataloader_index(step, backends):
-    # Generate weights for each backend based on some criteria
-    weights = []
-    backend_ids = []
-    for backend_id, backend in backends.items():
-        weight = get_backend_weight(backend_id, backend, step)
-        weights.append(weight)
-        backend_ids.append(backend_id)
-
-    # Convert to a torch tensor for easy sampling
-    weights = torch.tensor(weights, dtype=torch.float32)
-    weights /= weights.sum()  # Normalize the weights
-    if weights.sum() == 0:
-        return None
-
-    # Sample a backend index based on the weights
-    chosen_index = torch.multinomial(weights, 1).item()
-    chosen_backend_id = backend_ids[chosen_index]
-
-    return chosen_backend_id
-
-
-def get_backend_weight(backend_id, backend, step):
-    backend_config = StateTracker.get_data_backend_config(backend_id)
-    prob = backend_config.get("probability", 1)
-
-    if StateTracker.get_args().data_backend_sampling == "uniform":
-        return prob
-    elif StateTracker.get_args().data_backend_sampling == "auto-weighting":
-        # Get the dataset length (assuming you have a method or property to retrieve it)
-        dataset_length = StateTracker.get_dataset_size(backend_id)
-
-        # Calculate the weight based on dataset length
-        length_factor = dataset_length / sum(
-            StateTracker.get_dataset_size(b) for b in StateTracker.get_data_backends()
-        )
-
-        # Adjust the probability by length factor
-        adjusted_prob = prob * length_factor
-
-        disable_step = backend_config.get("disable_after_epoch_step", None)
-        if disable_step:
-            disable_step = int(disable_step)
-        else:
-            disable_step = float("inf")
-        adjusted_prob = (
-            0
-            if int(step) > disable_step
-            else max(0, adjusted_prob * (1 - step / disable_step))
-        )
-
-        return adjusted_prob
-    else:
-        raise ValueError(
-            f"Unknown sampling weighting method: {StateTracker.get_args().data_backend_sampling}"
-        )
-
-
-def random_dataloader_iterator(step, backends: dict):
-    prefetch_log_debug("Random dataloader iterator launched.")
-    gradient_accumulation_steps = StateTracker.get_args().gradient_accumulation_steps
-    logger.debug(f"Backends to select from {backends}")
-    if backends == {}:
-        logger.debug(
-            "All dataloaders exhausted. Moving to next epoch in main training loop."
-        )
-        StateTracker.clear_exhausted_buckets()
-        StateTracker.set_repeats(repeats=0)
-        return False
-    while backends:
-        epoch_step = int(step / gradient_accumulation_steps)
-        StateTracker.set_epoch_step(epoch_step)
-
-        chosen_backend_id = select_dataloader_index(step, backends)
-        if chosen_backend_id is None:
-            logger.debug("No dataloader iterators were available.")
-            break
-
-        chosen_iter = iter(backends[chosen_backend_id])
-
-        try:
-            return next(chosen_iter)
-        except MultiDatasetExhausted:
-            # We may want to repeat the same dataset multiple times in a single epoch.
-            # If so, we can just reset the iterator and keep going.
-            repeats = StateTracker.get_data_backend_config(chosen_backend_id).get(
-                "repeats", False
-            )
-            if (
-                repeats
-                and repeats > 0
-                and StateTracker.get_repeats(chosen_backend_id) < repeats
-            ):
-                StateTracker.increment_repeats(chosen_backend_id)
-                logger.debug(
-                    f"Dataset (name={chosen_backend_id}) is now sampling its {StateTracker.get_repeats(chosen_backend_id)} repeat out of {repeats} total allowed."
-                )
-                continue
-            logger.debug(
-                f"Dataset (name={chosen_backend_id}) is now exhausted after {StateTracker.get_repeats(chosen_backend_id)} repeat(s). Removing from list."
-            )
-            del backends[chosen_backend_id]
-            StateTracker.backend_exhausted(chosen_backend_id)
-            StateTracker.set_repeats(data_backend_id=chosen_backend_id, repeats=0)
-        finally:
-            if not backends:
-                logger.debug(
-                    "All dataloaders exhausted. Moving to next epoch in main training loop."
-                )
-                StateTracker.clear_exhausted_buckets()
-                return False
-
-
-class BatchFetcher:
-    def __init__(self, step, max_size=10, datasets={}):
-        self.queue = queue.Queue(max_size)
-        self.datasets = datasets
-        self.keep_running = True
-        self.step = step
-
-    def start_fetching(self):
-        thread = threading.Thread(target=self.fetch_responses)
-        thread.start()
-        return thread
-
-    def fetch_responses(self):
-        prefetch_log_debug("Launching retrieval thread.")
-        while self.keep_running:
-            if self.queue.qsize() < self.queue.maxsize:
-                prefetch_log_debug(
-                    f"Queue size: {self.queue.qsize()}. Fetching more data."
-                )
-                self.queue.put(random_dataloader_iterator(self.step, self.datasets))
-                if self.queue.qsize() >= self.queue.maxsize:
-                    prefetch_log_debug("Completed fetching data. Queue is full.")
-                    continue
-            else:
-                time.sleep(0.5)
-        prefetch_log_debug("Exiting retrieval thread.")
-
-    def next_response(self, step: int):
-        self.step = step
-        if self.queue.empty():
-            prefetch_log_debug("Queue is empty. Waiting for data.")
-        while self.queue.empty():
-            continue
-        prefetch_log_debug("Queue has data. Yielding next item.")
-        return self.queue.get()
-
-    def stop_fetching(self):
-        self.keep_running = False
diff --git a/videotuna/third_party/flux/data_backend/local.py b/videotuna/third_party/flux/data_backend/local.py
deleted file mode 100644
index b1baaef5..00000000
--- a/videotuna/third_party/flux/data_backend/local.py
+++ /dev/null
@@ -1,231 +0,0 @@
-import logging
-import os
-from io import BytesIO
-from pathlib import Path
-from typing import Any
-
-import torch
-from regex import regex
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.load import load_image
-
-logger = logging.getLogger("LocalDataBackend")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class LocalDataBackend(BaseDataBackend):
-    def __init__(self, accelerator, id: str, compress_cache: bool = False):
-        self.accelerator = accelerator
-        self.id = id
-        self.type = "local"
-        self.compress_cache = compress_cache
-
-    def read(self, filepath, as_byteIO: bool = False):
-        """Read and return the content of the file."""
-        # Openfilepath as BytesIO:
-        with open(filepath, "rb") as file:
-            data = file.read()
-        if not as_byteIO:
-            return data
-        return BytesIO(data)
-
-    def write(self, filepath: str, data: Any) -> None:
-        """Write the provided data to the specified filepath."""
-        os.makedirs(os.path.dirname(filepath), exist_ok=True)
-        with open(filepath, "wb") as file:
-            # Check if data is a Tensor, and if so, save it appropriately
-            if isinstance(data, torch.Tensor):
-                # logger.debug(f"Writing a torch file to disk.")
-                return self.torch_save(data, file)
-            elif isinstance(data, str):
-                # logger.debug(f"Writing a string to disk as {filepath}: {data}")
-                data = data.encode("utf-8")
-            else:
-                logger.debug(
-                    f"Received an unknown data type to write to disk. Doing our best: {type(data)}"
-                )
-            file.write(data)
-
-    def delete(self, filepath):
-        """Delete the specified file."""
-        if os.path.exists(filepath):
-            logger.debug(f"Deleting file: {filepath}")
-            os.remove(filepath)
-        else:
-            raise FileNotFoundError(f"{filepath} not found.")
-        # Check if file exists:
-        if self.exists(filepath):
-            raise Exception(f"Failed to delete {filepath}")
-
-    def exists(self, filepath):
-        """Check if the file exists."""
-        return os.path.exists(filepath)
-
-    def open_file(self, filepath, mode):
-        """Open the file in the specified mode."""
-        return open(filepath, mode)
-
-    def list_files(self, file_extensions: list, instance_data_dir: str):
-        """
-        List all files matching the given file extensions.
-        Creates Path objects of each file found.
-        """
-        logger.debug(
-            f"LocalDataBackend.list_files: file_extensions={file_extensions}, instance_data_dir={instance_data_dir}"
-        )
-        if instance_data_dir is None:
-            raise ValueError("instance_data_dir must be specified.")
-
-        def _rglob_follow_symlinks(path: Path, extensions: list):
-            # Skip Spotlight and Jupyter directories
-            forbidden_directories = [
-                ".Spotlight-V100",
-                ".Trashes",
-                ".fseventsd",
-                ".TemporaryItems",
-                ".zfs",
-                ".ipynb_checkpoints",
-            ]
-            if path.name in forbidden_directories:
-                return
-
-            # If no extensions are provided, list all files
-            if not extensions:
-                for p in path.rglob("*"):
-                    if p.is_file():
-                        yield p
-            else:
-                for ext in extensions:
-                    for p in path.rglob(ext):
-                        yield p
-
-            for p in path.iterdir():
-                if p.is_dir() and not p.is_symlink():
-                    yield from _rglob_follow_symlinks(p, extensions)
-                elif p.is_symlink():
-                    real_path = Path(os.readlink(p))
-                    if real_path.is_dir():
-                        yield from _rglob_follow_symlinks(real_path, extensions)
-
-        # If file_extensions is None, list all files
-        extensions = (
-            [f"*.{ext.lower()}" for ext in file_extensions] if file_extensions else None
-        )
-
-        paths = list(_rglob_follow_symlinks(Path(instance_data_dir), extensions))
-
-        # Group files by their parent directory
-        path_dict = {}
-        for path in paths:
-            parent = str(path.parent)
-            if parent not in path_dict:
-                path_dict[parent] = []
-            path_dict[parent].append(str(path.absolute()))
-
-        results = [(subdir, [], files) for subdir, files in path_dict.items()]
-        return results
-
-    def read_image(self, filepath: str, delete_problematic_images: bool = False):
-        # Remove embedded null byte:
-        filepath = filepath.replace("\x00", "")
-        try:
-            image = load_image(filepath)
-            return image
-        except Exception as e:
-            import traceback
-
-            logger.error(
-                f"Encountered error opening image {filepath}: {e}, traceback: {traceback.format_exc()}"
-            )
-            if delete_problematic_images:
-                logger.error(
-                    "Deleting image, because --delete_problematic_images is provided."
-                )
-                self.delete(filepath)
-            else:
-                exit(1)
-                raise e
-
-    def read_image_batch(
-        self, filepaths: list, delete_problematic_images: bool = False
-    ) -> list:
-        """Read a batch of images from the specified filepaths."""
-        if type(filepaths) != list:
-            raise ValueError(
-                f"read_image_batch must be given a list of image filepaths. we received: {filepaths}"
-            )
-        output_images = []
-        available_keys = []
-        for filepath in filepaths:
-            try:
-                image_data = self.read_image(filepath, delete_problematic_images)
-                if image_data is None:
-                    logger.warning(f"Unable to load image '{filepath}', skipping.")
-                    continue
-                output_images.append(image_data)
-                available_keys.append(filepath)
-            except Exception as e:
-                if delete_problematic_images:
-                    logger.error(
-                        f"Deleting image '{filepath}', because --delete_problematic_images is provided. Error: {e}"
-                    )
-                else:
-                    logger.warning(
-                        f"A problematic image {filepath} is detected, but we are not allowed to remove it, because --delete_problematic_image is not provided."
-                        f" Please correct this manually. Error: {e}"
-                    )
-        return (available_keys, output_images)
-
-    def create_directory(self, directory_path):
-        if os.path.exists(directory_path):
-            return
-        logger.debug(f"Creating directory: {directory_path}")
-        os.makedirs(directory_path, exist_ok=True)
-
-    def torch_load(self, filename):
-        """
-        Load a torch tensor from a file.
-        """
-        if not self.exists(filename):
-            raise FileNotFoundError(f"{filename} not found.")
-
-        stored_tensor = self.read(filename, as_byteIO=True)
-
-        if self.compress_cache:
-            try:
-                stored_tensor = self._decompress_torch(stored_tensor)
-            except Exception as e:
-                pass
-
-        if hasattr(stored_tensor, "seek"):
-            stored_tensor.seek(0)
-        try:
-            loaded_tensor = torch.load(stored_tensor, map_location="cpu")
-        except Exception as e:
-            logger.error(f"Failed to load corrupt torch file '{filename}': {e}")
-            if "invalid load key" in str(e):
-                self.delete(filename)
-            raise e
-        return loaded_tensor
-
-    def torch_save(self, data, original_location):
-        """
-        Save a torch tensor to a file.
-        """
-        if isinstance(original_location, str):
-            location = self.open_file(original_location, "wb")
-        else:
-            location = original_location
-
-        if self.compress_cache:
-            compressed_data = self._compress_torch(data)
-            location.write(compressed_data)
-        else:
-            torch.save(data, location)
-        location.close()
-
-    def write_batch(self, filepaths: list, data_list: list) -> None:
-        """Write a batch of data to the specified filepaths."""
-        for filepath, data in zip(filepaths, data_list):
-            self.write(filepath, data)
diff --git a/videotuna/third_party/flux/image_manipulation/brightness.py b/videotuna/third_party/flux/image_manipulation/brightness.py
deleted file mode 100644
index 5862ddac..00000000
--- a/videotuna/third_party/flux/image_manipulation/brightness.py
+++ /dev/null
@@ -1,28 +0,0 @@
-import multiprocessing
-
-import numpy as np
-from PIL import Image
-
-
-def calculate_luminance(img: Image.Image):
-    np_img = np.asarray(img.convert("RGB"))
-    r, g, b = np_img[:, :, 0], np_img[:, :, 1], np_img[:, :, 2]
-    luminance = 0.299 * r + 0.587 * g + 0.114 * b
-    avg_luminance = np.mean(luminance)
-    return avg_luminance
-
-
-def worker_batch_luminance(imgs: list):
-    return [calculate_luminance(img) for img in imgs]
-
-
-def calculate_batch_luminance(imgs: list):
-    num_processes = multiprocessing.cpu_count()
-    with multiprocessing.Pool(num_processes) as pool:
-        # Splitting images into batches for each process
-        img_batches = [imgs[i::num_processes] for i in range(num_processes)]
-        results = pool.map(worker_batch_luminance, img_batches)
-
-    # Flatten the results and calculate average luminance
-    all_luminance_values = [lum for sublist in results for lum in sublist]
-    return sum(all_luminance_values) / len(all_luminance_values)
diff --git a/videotuna/third_party/flux/image_manipulation/cropping.py b/videotuna/third_party/flux/image_manipulation/cropping.py
deleted file mode 100644
index a5b3581a..00000000
--- a/videotuna/third_party/flux/image_manipulation/cropping.py
+++ /dev/null
@@ -1,129 +0,0 @@
-import logging
-import os
-
-from PIL import Image
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class BaseCropping:
-    def __init__(self, image: Image = None, image_metadata: dict = None):
-        self.original_height = None
-        self.original_width = None
-        self.intermediary_height = None
-        self.intermediary_width = None
-        self.image = image
-        self.image_metadata = image_metadata
-        # When we've only got metadata, we can't crop the image.
-        self.meta_crop = False
-        if self.image:
-            self.original_width, self.original_height = self.image.size
-        elif self.image_metadata:
-            self.original_width, self.original_height = self.image_metadata[
-                "original_size"
-            ]
-        # print(
-        #     "Cropper intialized with image size: %s x %s",
-        #     self.original_width,
-        #     self.original_height,
-        # )
-
-    def crop(self, target_width, target_height):
-        raise NotImplementedError("Subclasses must implement this method")
-
-    def set_image(self, image: Image.Image):
-        if type(image) is not Image.Image:
-            raise TypeError("Image must be a PIL Image object")
-        # else:
-        #     print(f"Cropper received updated image contents: {image}")
-        self.image = image
-
-        return self
-
-    def set_intermediary_size(self, width, height):
-        self.intermediary_width = width
-        self.intermediary_height = height
-        # print(f"Updated intermediary size: {width} x {height}")
-
-        return self
-
-
-class CornerCropping(BaseCropping):
-    def crop(self, target_width, target_height):
-        left = max(0, self.intermediary_width - target_width)
-        top = max(0, self.intermediary_height - target_height)
-        right = self.intermediary_width
-        bottom = self.intermediary_height
-        if self.image:
-            return self.image.crop((left, top, right, bottom)), (top, left)
-        elif self.image_metadata:
-            return None, (top, left)
-
-
-class CenterCropping(BaseCropping):
-    def crop(self, target_width, target_height):
-        left = (self.intermediary_width - target_width) / 2
-        top = (self.intermediary_height - target_height) / 2
-        right = (self.intermediary_width + target_width) / 2
-        bottom = (self.intermediary_height + target_height) / 2
-        if self.image:
-            return self.image.crop((left, top, right, bottom)), (top, left)
-        elif self.image_metadata:
-            return None, (top, left)
-
-
-class RandomCropping(BaseCropping):
-    def crop(self, target_width, target_height):
-        import random
-
-        left = random.randint(0, max(0, self.intermediary_width - target_width))
-        top = random.randint(0, max(0, self.intermediary_height - target_height))
-        right = left + target_width
-        bottom = top + target_height
-        if self.image:
-            return self.image.crop((left, top, right, bottom)), (top, left)
-        elif self.image_metadata:
-            return None, (top, left)
-
-
-class FaceCropping(RandomCropping):
-    def crop(
-        self,
-        image: Image.Image,
-        target_width: int,
-        target_height: int,
-    ):
-        # Import modules
-        import cv2
-        import numpy as np
-
-        # Detect a face in the image
-        face_cascade = cv2.CascadeClassifier(
-            cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
-        )
-        image = image.convert("RGB")
-        image = np.array(image)
-        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
-        faces = face_cascade.detectMultiScale(gray, 1.1, 4)
-        if len(faces) > 0:
-            # Get the largest face
-            face = max(faces, key=lambda f: f[2] * f[3])
-            x, y, w, h = face
-            left = max(0, x - 0.5 * w)
-            top = max(0, y - 0.5 * h)
-            right = min(image.shape[1], x + 1.5 * w)
-            bottom = min(image.shape[0], y + 1.5 * h)
-            image = Image.fromarray(image)
-            return image.crop((left, top, right, bottom)), (top, left)
-        else:
-            # Crop the image from a random position
-            return super.crop(image, target_width, target_height)
-
-
-crop_handlers = {
-    "corner": CornerCropping,
-    "centre": CenterCropping,
-    "center": CenterCropping,
-    "random": RandomCropping,
-}
diff --git a/videotuna/third_party/flux/image_manipulation/load.py b/videotuna/third_party/flux/image_manipulation/load.py
deleted file mode 100644
index 9f9a7288..00000000
--- a/videotuna/third_party/flux/image_manipulation/load.py
+++ /dev/null
@@ -1,102 +0,0 @@
-import logging
-from io import BytesIO
-from typing import IO, Any, Union
-
-import numpy as np
-from PIL import Image, PngImagePlugin
-
-logger = logging.getLogger(__name__)
-logger.setLevel(logging.WARNING)
-
-try:
-    import cv2
-except Exception as e:
-    if "libGL" in str(e):
-        print(
-            "An error occurred while importing OpenCV2 due to a missing LibGL dependency on your system or container."
-            " Unfortunately, this is not a dependency that SimpleTuner can include during install time."
-            "\nFor Ubuntu systems, you can typically resolve this by running the following command:\n"
-            "sudo apt-get install libgl1-mesa-glx"
-            "\nor, if that does not work:\n"
-            "sudo apt-get install libgl1-mesa-dri"
-            "\nIf all else fails, you may need to contact the support department for your chosen platform."
-            " You can find the full error message at the end of debug.log inside the SimpleTuner directory."
-        )
-        from sys import exit
-
-        exit(1)
-    else:
-        raise e
-
-
-LARGE_ENOUGH_NUMBER = 100
-PngImagePlugin.MAX_TEXT_CHUNK = LARGE_ENOUGH_NUMBER * (1024**2)
-
-
-def decode_image_with_opencv(nparr: np.ndarray) -> Union[Image.Image, None]:
-    img_cv = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
-    if img_cv is not None:
-        img_cv = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
-        # Ensuring we only convert to RGB if needed.
-        if len(img_cv.shape) == 2 or (img_cv.shape[2] != 3 and img_cv.shape[2] == 1):
-            img_cv = cv2.cvtColor(img_cv, cv2.COLOR_GRAY2RGB)
-    return img_cv if img_cv is None else Image.fromarray(img_cv)
-
-
-def decode_image_with_pil(img_data: bytes) -> Image.Image:
-    try:
-        if isinstance(img_data, bytes):
-            img_pil = Image.open(BytesIO(img_data))
-        else:
-            img_pil = Image.open(img_data)
-
-        if img_pil.mode not in ["RGB", "RGBA"] and "transparency" in img_pil.info:
-            img_pil = img_pil.convert("RGBA")
-
-        # For transparent images, add a white background as this is correct
-        # most of the time.
-        if img_pil.mode == "RGBA":
-            canvas = Image.new("RGBA", img_pil.size, (255, 255, 255))
-            canvas.alpha_composite(img_pil)
-            img_pil = canvas.convert("RGB")
-        else:
-            img_pil = img_pil.convert("RGB")
-    except (OSError, Image.DecompressionBombError, ValueError) as e:
-        logger.warning(f"Error decoding image: {e}")
-        raise
-    return img_pil
-
-
-def load_image(img_data: Union[bytes, IO[Any], str]) -> Image.Image:
-    """
-    Load an image using CV2. If that fails, fall back to PIL.
-
-    The image is returned as a PIL object.
-    """
-    if isinstance(img_data, str):
-        with open(img_data, "rb") as file:
-            img_data = file.read()
-    elif hasattr(img_data, "read"):
-        # Check if it's file-like object.
-        img_data = img_data.read()
-
-    # Preload the image bytes with channels unchanged and ensure determine
-    # if the image has an alpha channel. If it does we should add a white
-    # background to it using PIL.
-    nparr = np.frombuffer(img_data, np.uint8)
-    image_preload = cv2.imdecode(nparr, cv2.IMREAD_UNCHANGED)
-    has_alpha = False
-    if (
-        image_preload is not None
-        and len(image_preload.shape) >= 3
-        and image_preload.shape[2] == 4
-    ):
-        has_alpha = True
-    del image_preload
-
-    img = None
-    if not has_alpha:
-        img = decode_image_with_opencv(nparr)
-    if img is None:
-        img = decode_image_with_pil(img_data)
-    return img
diff --git a/videotuna/third_party/flux/image_manipulation/training_sample.py b/videotuna/third_party/flux/image_manipulation/training_sample.py
deleted file mode 100644
index 23a8a463..00000000
--- a/videotuna/third_party/flux/image_manipulation/training_sample.py
+++ /dev/null
@@ -1,706 +0,0 @@
-import logging
-import os
-import random
-import time
-from math import sqrt
-
-from PIL import Image
-from PIL.ImageOps import exif_transpose
-from tqdm import tqdm
-
-from videotuna.third_party.flux.image_manipulation.cropping import crop_handlers
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage, resize_tools
-from videotuna.third_party.flux.training.multi_process import should_log
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger(__name__)
-if should_log():
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel("ERROR")
-
-
-class TrainingSample:
-    def __init__(
-        self,
-        image: Image.Image,
-        data_backend_id: str,
-        image_metadata: dict = None,
-        image_path: str = None,
-        conditioning_type: str = None,
-    ):
-        """
-        Initializes a new TrainingSample instance with a provided PIL.Image object and a data backend identifier.
-
-        Args:
-            image (Image.Image): A PIL Image object.
-            data_backend_id (str): Identifier for the data backend used for additional operations.
-            metadata (dict): Optional metadata associated with the image.
-        """
-        self.image = image
-        self.target_size = None
-        self.intermediary_size = None
-        self.original_size = None
-        self.conditioning_type = conditioning_type
-        self.data_backend_id = data_backend_id
-        self.image_metadata = (
-            image_metadata
-            if image_metadata
-            else StateTracker.get_metadata_by_filepath(image_path, data_backend_id)
-        )
-        if hasattr(image, "size"):
-            self.original_size = self.image.size
-            self.original_aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-                self.original_size
-            )
-        elif image_metadata is not None:
-            self.original_size = image_metadata.get("original_size")
-            self.original_aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-                self.original_size
-            )
-        self.current_size = self.original_size
-
-        if not self.original_size:
-            raise Exception("Original size not found in metadata.")
-
-        # Torchvision transforms turn the pixels into a Tensor and normalize them for the VAE.
-        self.transforms = MultiaspectImage.get_image_transforms()
-
-        # Backend config details
-        self.data_backend_config = StateTracker.get_data_backend_config(data_backend_id)
-        self.crop_enabled = self.data_backend_config.get("crop", False)
-        self.crop_style = self.data_backend_config.get("crop_style", "random")
-        self.crop_aspect = self.data_backend_config.get("crop_aspect", "square")
-        self.crop_aspect_buckets = self.data_backend_config.get(
-            "crop_aspect_buckets", []
-        )
-        self.crop_coordinates = (0, 0)
-        crop_handler_cls = crop_handlers.get(self.crop_style)
-        if not crop_handler_cls:
-            raise ValueError(f"Unknown crop style: {self.crop_style}")
-        self.cropper = crop_handler_cls(image=self.image, image_metadata=image_metadata)
-        self.resolution = self.data_backend_config.get("resolution")
-        self.resolution_type = self.data_backend_config.get("resolution_type")
-        self.target_size_calculator = resize_tools.get(self.resolution_type)
-        if self.target_size_calculator is None and conditioning_type not in [
-            "mask",
-            "controlnet",
-        ]:
-            raise ValueError(f"Unknown resolution type: {self.resolution_type}")
-        self._set_resolution()
-        self.target_downsample_size = self.data_backend_config.get(
-            "target_downsample_size", None
-        )
-        self.maximum_image_size = self.data_backend_config.get(
-            "maximum_image_size", None
-        )
-        self._image_path = image_path
-        # RGB/EXIF conversions.
-        self.correct_image()
-        self._validate_image_metadata()
-
-    def save_debug_image(self, path: str):
-        if self.image and os.environ.get("SIMPLETUNER_DEBUG_IMAGE_PREP", "") == "true":
-            self.image.save(path)
-        return self
-
-    @staticmethod
-    def from_image_path(image_path: str, data_backend_id: str):
-        """
-        Create a new TrainingSample instance from an image path.
-
-        Args:
-            image_path (str): The path to the image.
-            data_backend_id (str): Identifier for the data backend used for additional operations.
-
-        Returns:
-            TrainingSample: A new TrainingSample instance.
-        """
-        data_backend = StateTracker.get_data_backend(data_backend_id)
-        image = data_backend["data_backend"].read_image(image_path)
-        return TrainingSample(image, data_backend_id, image_path=image_path)
-
-    def _validate_image_metadata(self) -> bool:
-        """
-        Determine whether all required keys exist for prepare() to skip calculations.
-        This is useful because randomised aspect buckets must be preserved across runs to avoid mismatched tensor dimensions.
-
-        Returns:
-            bool: True if the metadata is valid, False otherwise.
-        """
-        required_keys = [
-            "original_size",
-            "target_size",
-            "intermediary_size",
-            "crop_coordinates",
-            "aspect_ratio",
-        ]
-        if type(self.image_metadata) is not dict:
-            self.valid_metadata = False
-        else:
-            self.valid_metadata = all(
-                key in self.image_metadata for key in required_keys
-            )
-        if self.valid_metadata:
-            self.original_size = self.image_metadata["original_size"]
-            self.target_size = self.image_metadata["target_size"]
-            self.intermediary_size = self.image_metadata["intermediary_size"]
-            self.crop_coordinates = self.image_metadata["crop_coordinates"]
-            self.aspect_ratio = self.image_metadata["aspect_ratio"]
-
-        self.original_aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-            self.original_size
-        )
-
-        if not self.valid_metadata and hasattr(self.image, "size"):
-            self.original_size = self.image.size
-
-        return self.valid_metadata
-
-    def _set_resolution(self):
-        if self.resolution_type == "pixel":
-            self.target_area = self.resolution
-            # Store the pixel value, eg. 1024
-            self.pixel_resolution = int(self.resolution)
-            # Store the megapixel value, eg. 1.0
-            self.megapixel_resolution = self.resolution / 1e3
-        elif self.resolution_type == "area":
-            # Convert pixel area to megapixels, remapping commonly used round values
-            # to their pixel_area equivalents for compatibility purposes.
-            self.target_area = {
-                0.25: 512**2,
-                0.5: 768**2,
-                1.0: 1024**2,
-                2.0: 1536**2,
-                4.0: 2048**2,
-            }.get(self.resolution, self.resolution * 1e6)
-            # Store the pixel value, eg. 1024
-            self.pixel_resolution = int(
-                MultiaspectImage._round_to_nearest_multiple(
-                    sqrt(self.resolution * (1024**2))
-                )
-            )
-            # Store the megapixel value, eg. 1.0
-            self.megapixel_resolution = self.resolution
-        else:
-            raise Exception(f"Unknown resolution type: {self.resolution_type}")
-
-    def _trim_aspect_bucket_list(self):
-        """
-        Momentarily return a temporarily list of pruned buckets that'll work for this image.
-        An aspect bucket will "work" if the image must be upscaled less than 20% to fit into it.
-
-        Returns:
-            list[float]: The list of available aspect buckets
-        """
-        available_buckets = []
-        for bucket in self.crop_aspect_buckets:
-            # We want to ensure we don't upscale images beyond about 20% of their original size.
-            # If any of the aspect buckets will result in that, we'll ignore it.
-            if type(bucket) is dict:
-                aspect = bucket["aspect_ratio"]
-            elif type(bucket) is float or type(bucket) is int:
-                aspect = bucket
-            else:
-                raise ValueError(
-                    "Aspect buckets must be a list of floats or dictionaries."
-                )
-            # Calculate new size
-            target_size, intermediary_size, aspect_ratio = self.target_size_calculator(
-                aspect, self.resolution, self.original_size
-            )
-            # Check the size vs a 20% threshold
-            if (
-                target_size[0] * 1.2 < self.original_size[0]
-                and target_size[1] * 1.2 < self.original_size[1]
-            ):
-                available_buckets.append(aspect)
-        return available_buckets
-
-    def _select_random_aspect(self):
-        """
-        This method returns an aspect bucket based on the crop_aspect configuration.
-        If crop_aspect is "closest", it returns the closest aspect ratio.
-        If crop_aspect is "random", it returns a random aspect ratio based on weights.
-
-        Returns:
-            float: The selected aspect ratio.
-        """
-        if not self.crop_aspect_buckets:
-            raise ValueError(
-                "Aspect buckets are not defined in the data backend config."
-            )
-
-        if self.valid_metadata:
-            self.aspect_ratio = self.image_metadata["aspect_ratio"]
-            return self.aspect_ratio
-
-        # Handle 'preserve' crop_aspect mode by picking the closest aspect ratio
-        if self.crop_aspect == "closest":
-            closest_aspect = min(
-                self.crop_aspect_buckets,
-                key=lambda bucket: abs(
-                    (bucket["aspect"] if isinstance(bucket, dict) else bucket)
-                    - self.aspect_ratio
-                ),
-            )
-            closest_aspect_value = (
-                closest_aspect["aspect"]
-                if isinstance(closest_aspect, dict)
-                else closest_aspect
-            )
-            # logger.debug(f"Selected closest aspect: {closest_aspect_value} for aspect ratio: {self.aspect_ratio}")
-            return closest_aspect_value
-
-        # Handle 'random' crop_aspect mode by picking a random aspect ratio based on weights
-        if self.crop_aspect == "random":
-            if (
-                len(self.crop_aspect_buckets) > 0
-                and type(self.crop_aspect_buckets[0]) is dict
-            ):
-                has_portrait_buckets = any(
-                    bucket["aspect"] < 1.0 for bucket in self.crop_aspect_buckets
-                )
-                has_landscape_buckets = any(
-                    bucket["aspect"] > 1.0 for bucket in self.crop_aspect_buckets
-                )
-                logger.error(
-                    f"has_portrait_buckets: {has_portrait_buckets}, has_landscape_buckets: {has_landscape_buckets}"
-                )
-
-                # Instead of defaulting to 1.0, use whatever buckets are available
-                aspects = [bucket["aspect"] for bucket in self.crop_aspect_buckets]
-                weights = [bucket["weight"] for bucket in self.crop_aspect_buckets]
-
-                # Ensure that the weights add up to 1.0
-                total_weight = sum(weights)
-                if total_weight != 1.0:
-                    raise ValueError("The weights of aspect buckets must add up to 1.")
-
-                selected_aspect = random.choices(aspects, weights)[0]
-                return selected_aspect
-
-            elif (
-                len(self.crop_aspect_buckets) > 0
-                and type(self.crop_aspect_buckets[0]) is float
-            ):
-                available_aspects = self._trim_aspect_bucket_list()
-                if len(available_aspects) == 0:
-                    selected_aspect = 1.0
-                    if should_log():
-                        tqdm.write(
-                            "[WARNING] Image dimensions do not fit into the configured aspect buckets. Using square crop."
-                        )
-                else:
-                    selected_aspect = random.choice(available_aspects)
-                return selected_aspect
-
-            else:
-                raise ValueError(
-                    "Aspect buckets must be a list of floats or dictionaries."
-                    " If using a dictionary, it is expected to be in the format {'aspect': 1.0, 'weight': 0.5}."
-                    " To provide multiple aspect ratios, use a list of dictionaries: [{'aspect': 1.0, 'weight': 0.5}, {'aspect': 1.5, 'weight': 0.5}]."
-                )
-
-        # Default to 1.0 if none of the conditions above match
-        return 1.0
-
-    def prepare_like(self, other_sample, return_tensor=False):
-        """
-        Prepare the current TrainingSample in the same way as other_sample.
-
-        Args:
-            other_sample (TrainingSample): The sample to mimic.
-            return_tensors (bool): Whether to return tensors.
-
-        Returns:
-            PreparedSample: The prepared sample.
-        """
-        # Copy over the image metadata from the other sample
-        self.image_metadata = (
-            other_sample.image_metadata.copy() if other_sample.image_metadata else {}
-        )
-        # Validate the metadata to set internal attributes
-        self._validate_image_metadata()
-        # Proceed to prepare the image
-        return self.prepare(return_tensor=return_tensor)
-
-    def prepare(self, return_tensor: bool = False):
-        """
-        Perform initial image preparations such as converting to RGB and applying EXIF transformations.
-
-        Args:
-            image (Image.Image): The image to prepare.
-
-        Returns: tuple
-            - image data (PIL.Image)
-            - crop_coordinates (tuple)
-            - aspect_ratio (float)
-        """
-        self.save_debug_image(f"images/{time.time()}-0-original.png")
-        self.crop()
-        self.save_debug_image(f"images/{time.time()}-1-cropped.png")
-        if not self.crop_enabled:
-            self.save_debug_image(f"images/{time.time()}-1b-nocrop-resize.png")
-            self.resize()
-
-        image = self.image
-        if return_tensor:
-            # Return normalised tensor.
-            image = self.transforms(image)
-        webhook_handler = StateTracker.get_webhook_handler()
-        prepared_sample = PreparedSample(
-            image=image,
-            original_size=self.original_size,
-            crop_coordinates=self.crop_coordinates,
-            aspect_ratio=self.aspect_ratio,
-            image_metadata=self.image_metadata,
-            target_size=self.target_size,
-            intermediary_size=self.intermediary_size,
-        )
-        if webhook_handler:
-            webhook_handler.send(
-                message=f"Debug info for prepared sample, {str(prepared_sample)}",
-                images=[self.image] if self.image else None,
-                message_level="debug",
-            )
-        return prepared_sample
-
-    def area(self) -> int:
-        """
-        Calculate the area of the image.
-
-        Returns:
-            int: The area of the image.
-        """
-        if self.image is not None:
-            return self.image.size[0] * self.image.size[1]
-        if self.original_size:
-            return self.original_size[0] * self.original_size[1]
-
-    def _should_resize_before_crop(self) -> bool:
-        """
-        If the options to do so are enabled, or, the image require it; we will resize before cropping.
-
-        Returns:
-            bool: True if the image should be resized before cropping, False otherwise.
-        """
-        if (
-            not self.crop_enabled
-            or not self.maximum_image_size
-            or not self.target_downsample_size
-        ):
-            return False
-        if self.data_backend_config.get("resolution_type") == "pixel":
-            return (
-                self.current_size[0] > self.pixel_resolution
-                or self.current_size[1] > self.pixel_resolution
-            ) or (
-                self.current_size[0] < self.pixel_resolution
-                or self.current_size[1] < self.pixel_resolution
-            )
-        elif self.data_backend_config.get("resolution_type") == "area":
-            should_resize = (
-                self.area() > self.target_area
-                or self.area() < self.target_area
-                or self.current_size[0] < self.target_size[0]
-                or self.current_size[1] < self.target_size[1]
-            )
-            logger.debug(f"Should resize? {should_resize}")
-            return should_resize
-        else:
-            raise ValueError(
-                f"Unknown resolution type: {self.data_backend_config.get('resolution_type')}"
-            )
-
-    def _calculate_target_downsample_size(self):
-        """
-        When cropping images, it is optional to disturb them with a resize before the crop.
-        This is desirable when a large image is being cropped to a small size, as it will preserve scene details and maintain aspect ratio.
-
-        Returns:
-            tuple: The target downsample size as (width, height).
-        """
-        # We'll run the target size calculator logic without updating any of the object attributes.
-        # This will prevent contamination of the final values that the image will represent.
-        _, calculated_intermediary_size, _ = self.target_size_calculator(
-            self.original_aspect_ratio, self.target_downsample_size, self.original_size
-        )
-        # The calculated_intermediary_size's purpose is to resize to this value before cropping to target_size.
-        # If the intermediary size is smaller than target_size on either edge, the cropping will result in black bars.
-        # We have to calculate the scale factor and adjust the image edges proportionally to avoid squishing it.
-        if calculated_intermediary_size[0] < self.target_size[0]:
-            scale_factor = self.target_size[0] / calculated_intermediary_size[0]
-            calculated_intermediary_size = (
-                self.target_size[0],
-                int(calculated_intermediary_size[1] * scale_factor),
-            )
-        elif calculated_intermediary_size[1] < self.target_size[1]:
-            scale_factor = self.target_size[1] / calculated_intermediary_size[1]
-            calculated_intermediary_size = (
-                int(calculated_intermediary_size[0] * scale_factor),
-                self.target_size[1],
-            )
-
-        return calculated_intermediary_size
-
-    def _downsample_before_crop(self):
-        """
-        Downsample the image before cropping, to preserve scene details and maintain aspect ratio.
-
-        Returns:
-            TrainingSample: The current TrainingSample instance.
-        """
-        if self._should_resize_before_crop():
-            target_downsample_size = self._calculate_target_downsample_size()
-            logger.debug(f"resizing to {target_downsample_size}")
-            self.resize(target_downsample_size)
-        return self
-
-    def correct_intermediary_square_size(self):
-        """
-        When an intermediary size is calculated, we don't adjust it to be divisible by 8 or 64.
-        However, the aspect ratio 1.0 needs special consideration for our base resolutions 512, 768, and 1024, because they typically result in 500x500, 750x750, and 1000x1000 images.
-
-        Returns:
-            TrainingSample: The current TrainingSample instance.
-        """
-        if (
-            self.aspect_ratio == 1.0
-            and self.intermediary_size[0] < self.pixel_resolution
-        ):
-            self.intermediary_size = (
-                self.pixel_resolution,
-                self.pixel_resolution,
-            )
-            self.crop_coordinates = (0, 0)
-        return self
-
-    def calculate_target_size(self):
-        """
-        This method will populate the values for self.{target_size,intermediary_size,aspect_ratio} based on the image's original size and the data backend configuration.
-
-        Returns:
-            tuple:
-                - The target size as (width, height).
-                - The intermediary size as (width, height).
-                - The aspect ratio of the target size. This will likely be different from the original aspect ratio.
-        """
-        self.aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-            self.original_size
-        )
-        if self.crop_enabled:
-            if self.crop_aspect == "square":
-                self.target_size = (self.pixel_resolution, self.pixel_resolution)
-                _, self.intermediary_size, _ = self.target_size_calculator(
-                    self.aspect_ratio, self.resolution, self.original_size
-                )
-                self.aspect_ratio = 1.0
-                self.correct_intermediary_square_size()
-                square_crop_metadata = (
-                    self.target_size,
-                    self.intermediary_size,
-                    self.aspect_ratio,
-                )
-                logger.debug(f"Square crop metadata: {square_crop_metadata}")
-                return square_crop_metadata
-        if self.crop_enabled and (
-            self.crop_aspect == "random" or self.crop_aspect == "closest"
-        ):
-            # Grab a random aspect ratio from a list.
-            self.aspect_ratio = self._select_random_aspect()
-        self.target_size, calculated_intermediary_size, self.aspect_ratio = (
-            self.target_size_calculator(
-                self.aspect_ratio, self.resolution, self.original_size
-            )
-        )
-        if self.crop_aspect != "random" or not self.valid_metadata:
-            self.intermediary_size = calculated_intermediary_size
-        self.aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-            self.target_size
-        )
-        self.correct_intermediary_square_size()
-        if self.aspect_ratio == 1.0:
-            self.target_size = (self.pixel_resolution, self.pixel_resolution)
-
-        return (
-            self.target_size,
-            (int(self.intermediary_size[0]), int(self.intermediary_size[1])),
-            self.aspect_ratio,
-        )
-
-    def correct_image(self):
-        """
-        Apply a series of transformations to the image to "correct" it, such as EXIF rotation and conversion to RGB.
-
-        Returns:
-            TrainingSample: The current TrainingSample instance.
-        """
-        if self.image:
-            # Convert image to RGB to remove any alpha channel and apply EXIF data transformations
-            self.image = self.image.convert("RGB")
-            self.image = exif_transpose(self.image)
-        return self
-
-    def crop(self):
-        """
-        Crop the image using the detected crop handler class.
-        If cropping is not enabled, we do nothing.
-
-        Returns:
-            TrainingSample: The current TrainingSample instance.
-        """
-        if not self.crop_enabled:
-            return self
-        # Too-big of an image, resize before we crop.
-        self.calculate_target_size()
-        self._downsample_before_crop()
-        self.save_debug_image(f"images/{time.time()}-0.5-downsampled.png")
-        if self.image is not None:
-            logger.debug(f"setting image: {self.image.size}")
-            self.cropper.set_image(self.image)
-        logger.debug(f"Cropper size updating to {self.current_size}")
-        self.cropper.set_intermediary_size(self.current_size[0], self.current_size[1])
-        self.image, self.crop_coordinates = self.cropper.crop(
-            self.target_size[0], self.target_size[1]
-        )
-        self.current_size = self.target_size
-        logger.debug(
-            f"Cropped to {self.image.size if self.image is not None else self.current_size} via crop coordinates {self.crop_coordinates} {'resulting in current_size of' if self.image is not None else ''} {self.current_size if self.image is not None else ''}"
-        )
-        return self
-
-    def resize(self, size: tuple = None):
-        """
-        Resize the image to a new size. If one is not provided, we will use the precalculated self.target_size
-
-        Args:
-            (optional) target_size (tuple): The target size as (width, height).
-        Returns:
-            TrainingSample: The current TrainingSample instance.
-        """
-        current_size = self.image.size if self.image is not None else self.original_size
-        if size is None:
-            if not self.valid_metadata:
-                self.target_size, self.intermediary_size, self.target_aspect_ratio = (
-                    self.calculate_target_size()
-                )
-            size = self.target_size
-            if self.target_size != self.intermediary_size:
-                logger.debug(
-                    f"we have to crop because target size {self.target_size} != intermediary size {self.intermediary_size}"
-                )
-                # Now we can resize the image to the intermediary size.
-                if self.image is not None:
-                    self.image = self.image.resize(
-                        self.intermediary_size, Image.Resampling.LANCZOS
-                    )
-                self.current_size = self.intermediary_size
-                if self.image is not None and self.cropper:
-                    self.cropper.set_image(self.image)
-                self.cropper.set_intermediary_size(
-                    self.intermediary_size[0], self.intermediary_size[1]
-                )
-                self.image, self.crop_coordinates = self.cropper.crop(
-                    self.target_size[0], self.target_size[1]
-                )
-                logger.debug(f"crop coordinates: {self.crop_coordinates}")
-                return self
-
-        if self.image and hasattr(self.image, "resize"):
-            self.image = self.image.resize(size, Image.Resampling.LANCZOS)
-            self.aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-                self.image.size
-            )
-        self.current_size = size
-        logger.debug(
-            f"Resized to {self.current_size} (aspect ratio: {self.aspect_ratio})"
-        )
-        return self
-
-    def get_image(self):
-        """
-        Returns the current state of the image.
-        If using the `parquet` metadata backend, this value may be None during the initial aspect bucketing phase.
-
-        Returns:
-            Image.Image: The current image.
-        """
-        return self.image
-
-    def is_conditioning_sample(self):
-        return self.conditioning_type is not None
-
-    def get_conditioning_type(self):
-        return self.conditioning_type
-
-    def cache_path(self):
-        """
-        Given an image path, manipulate the prefix and suffix to return its counterpart cache path.
-        The image extension will be stripped and replaced with the appropriate value (.pt).
-        If the instance_data_dir is found in the path, it will be replaced with the cache_dir.
-
-        Returns:
-            str: The cache path for the image.
-        """
-        vae_cache = StateTracker.get_data_backend(self.data_backend_id)["vaecache"]
-
-        return vae_cache.image_path_to_vae_path.get(self._image_path, None)
-
-    def image_path(self, basename_only=False):
-        """
-        Returns the absolute or basename path for the current training sample.
-
-        Args:
-            basename_only (bool): Whether to return the basename only.
-        Returns:
-            str: The image path
-        """
-        if basename_only:
-            return os.path.basename(self._image_path)
-        return self._image_path
-
-
-class PreparedSample:
-    def __init__(
-        self,
-        image: Image.Image,
-        image_metadata: dict,
-        original_size: tuple,
-        intermediary_size: tuple,
-        target_size: tuple,
-        aspect_ratio: float,
-        crop_coordinates: tuple,
-    ):
-        """
-        Initializes a new PreparedSample instance with a provided PIL.Image object and optional metadata.
-
-        Args:
-        image (Image.Image): A PIL Image object.
-        metadata (dict): Optional metadata associated with the image.
-        """
-        self.image = image
-        self.image_metadata = image_metadata if image_metadata else {}
-        self.original_size = original_size
-        self.intermediary_size = intermediary_size
-        self.target_size = target_size
-        if image is not None and hasattr(image, "size") and type(image.size) is tuple:
-            self.aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-                image.size[0] / image.size[1]
-            )
-        else:
-            self.aspect_ratio = aspect_ratio
-        self.crop_coordinates = crop_coordinates
-
-    def __str__(self):
-        return f"PreparedSample(image={self.image}, original_size={self.original_size}, intermediary_size={self.intermediary_size}, target_size={self.target_size}, aspect_ratio={self.aspect_ratio}, crop_coordinates={self.crop_coordinates})"
-
-    def to_dict(self):
-        return {
-            "image": self.image,
-            "original_size": self.original_size,
-            "intermediary_size": self.intermediary_size,
-            "target_size": self.target_size,
-            "aspect_ratio": self.aspect_ratio,
-            "crop_coordinates": self.crop_coordinates,
-        }
diff --git a/videotuna/third_party/flux/log_format.py b/videotuna/third_party/flux/log_format.py
deleted file mode 100644
index b15eddbf..00000000
--- a/videotuna/third_party/flux/log_format.py
+++ /dev/null
@@ -1,109 +0,0 @@
-import logging
-import os
-
-from colorama import Back, Fore, Style, init
-
-
-class ColorizedFormatter(logging.Formatter):
-    level_colors = {
-        logging.DEBUG: Fore.CYAN,
-        logging.INFO: Fore.GREEN,
-        logging.WARNING: Fore.YELLOW,
-        logging.ERROR: Fore.RED,
-        logging.CRITICAL: Fore.RED + Back.WHITE + Style.BRIGHT,
-    }
-
-    def format(self, record):
-        level_color = self.level_colors.get(record.levelno, "")
-        reset_color = Style.RESET_ALL
-        message = super().format(record)
-        return f"{level_color}{message}{reset_color}"
-
-
-# Initialize colorama
-init(autoreset=True)
-
-# Create a logger
-logger = logging.getLogger()
-logger.setLevel(logging.DEBUG)  # Set lowest level to capture everything
-
-# Create handlers
-console_handler = logging.StreamHandler()
-console_handler.setLevel(
-    logging.INFO
-)  # Change to ERROR if you want to suppress INFO messages too
-console_handler.setFormatter(
-    ColorizedFormatter("%(asctime)s [%(levelname)s] %(message)s")
-)
-
-# blank out the existing debug.log, if exists
-if os.path.exists("debug.log"):
-    with open("debug.log", "w"):
-        pass
-
-# Create a file handler
-if not os.path.exists("cache"):
-    os.makedirs("cache")
-file_handler = logging.FileHandler("cache/debug.log")
-file_handler.setLevel(logging.DEBUG)  # Capture debug and above
-file_handler.setFormatter(
-    logging.Formatter("%(asctime)s [%(levelname)s] (%(name)s) %(message)s")
-)
-
-# Remove existing handlers
-for handler in logger.handlers[:]:
-    logger.removeHandler(handler)
-
-# Add handlers to the logger
-logger.addHandler(console_handler)
-logger.addHandler(file_handler)
-
-forward_logger = logging.getLogger("diffusers.models.unet_2d_condition")
-forward_logger.setLevel(logging.WARNING)
-
-pil_logger = logging.getLogger("PIL")
-pil_logger.setLevel(logging.INFO)
-pil_logger = logging.getLogger("PIL.Image")
-pil_logger.setLevel("ERROR")
-pil_logger = logging.getLogger("PIL.PngImagePlugin")
-pil_logger.setLevel("ERROR")
-transformers_logger = logging.getLogger("transformers.configuration_utils")
-transformers_logger.setLevel("ERROR")
-diffusers_logger = logging.getLogger("diffusers.configuration_utils")
-diffusers_logger.setLevel("ERROR")
-torchdistlogger = logging.getLogger("torch.distributed.nn.jit.instantiator")
-torchdistlogger.setLevel("WARNING")
-torch_utils_logger = logging.getLogger("diffusers.utils.torch_utils")
-torch_utils_logger.setLevel("ERROR")
-
-import warnings
-
-# Suppress specific PIL warning
-warnings.filterwarnings(
-    "ignore",
-    category=UserWarning,
-    module="PIL",
-    message="Palette images with Transparency expressed in bytes should be converted to RGBA images",
-)
-warnings.filterwarnings(
-    "ignore",
-    category=FutureWarning,
-    module="transformers.deepspeed",
-    message="transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations",
-)
-
-# Ignore torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
-warnings.filterwarnings(
-    "ignore",
-    category=DeprecationWarning,
-    module="torch.utils._pytree",
-    message="torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.",
-)
-
-warnings.filterwarnings(
-    "ignore",
-)
-warnings.filterwarnings(
-    "ignore",
-    message=".*is deprecated.*",
-)
diff --git a/videotuna/third_party/flux/metadata/backends/base.py b/videotuna/third_party/flux/metadata/backends/base.py
deleted file mode 100644
index 0231c320..00000000
--- a/videotuna/third_party/flux/metadata/backends/base.py
+++ /dev/null
@@ -1,991 +0,0 @@
-import logging
-import os
-import threading
-import time
-from math import ceil, floor
-from multiprocessing import Process, Queue
-from pathlib import Path
-
-# For semaphore
-from threading import Semaphore, Thread
-
-import numpy as np
-import torch
-from PIL import Image
-from tqdm import tqdm
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage
-from videotuna.third_party.flux.training.multi_process import should_log
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("BaseMetadataBackend")
-if should_log():
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel("ERROR")
-
-
-class MetadataBackend:
-    def __init__(
-        self,
-        id: str,
-        instance_data_dir: str,
-        cache_file: str,
-        metadata_file: str,
-        data_backend: BaseDataBackend,
-        accelerator,
-        batch_size: int,
-        resolution: float,
-        resolution_type: str,
-        delete_problematic_images: bool = False,
-        delete_unwanted_images: bool = False,
-        metadata_update_interval: int = 3600,
-        minimum_image_size: int = None,
-        cache_file_suffix: str = None,
-        repeats: int = 0,
-    ):
-        self.id = id
-        if self.id != data_backend.id:
-            raise ValueError(
-                f"BucketManager ID ({self.id}) must match the DataBackend ID ({data_backend.id})."
-            )
-        self.accelerator = accelerator
-        self.should_abort = False
-        self.data_backend = data_backend
-        self.batch_size = int(batch_size)
-        self.repeats = int(repeats)
-        self.instance_data_dir = instance_data_dir
-        if cache_file_suffix is not None:
-            cache_file = f"{cache_file}_{cache_file_suffix}"
-            metadata_file = f"{metadata_file}_{cache_file_suffix}"
-        self.cache_file = Path(f"{cache_file}.json")
-        self.metadata_file = Path(f"{metadata_file}.json")
-        self.aspect_ratio_bucket_indices = {}
-        self.image_metadata = {}  # Store image metadata
-        self.seen_images = {}
-        self.config = {}
-        self.reload_cache()
-        self.resolution = float(resolution)
-        self.resolution_type = resolution_type
-        self.delete_problematic_images = delete_problematic_images
-        self.delete_unwanted_images = delete_unwanted_images
-        self.metadata_update_interval = metadata_update_interval
-        self.minimum_image_size = (
-            float(minimum_image_size) if minimum_image_size else None
-        )
-        self.image_metadata_loaded = False
-        self.vae_output_scaling_factor = 8
-        self.metadata_semaphor = Semaphore()
-        # When a multi-gpu system splits the buckets, we no longer update.
-        self.read_only = False
-
-    def load_metadata(self):
-        raise NotImplementedError
-
-    def save_metadata(self):
-        raise NotImplementedError
-
-    def _bucket_worker(
-        self,
-        tqdm_queue,
-        files,
-        aspect_ratio_bucket_indices_queue,
-        metadata_updates_queue,
-        written_files_queue,
-        existing_files_set,
-    ):
-        """
-        A worker function to bucket a list of files.
-
-        Args:
-            tqdm_queue (Queue): A queue to report progress to.
-            files (list): A list of files to bucket.
-            aspect_ratio_bucket_indices_queue (Queue): A queue to report the bucket indices to.
-            existing_files_set (set): A set of existing files.
-
-        Returns:
-            dict: The bucket indices.
-        """
-        local_aspect_ratio_bucket_indices = {}
-        local_metadata_updates = {}
-        processed_file_list = set()
-        processed_file_count = 0
-        # Initialize statistics dictionary
-        statistics = {
-            "total_processed": 0,
-            "skipped": {
-                "already_exists": 0,
-                "metadata_missing": 0,
-                "not_found": 0,
-                "too_small": 0,
-                "other": 0,  # Add more specific reasons as needed
-            },
-        }
-
-        for file in files:
-            if str(file) not in existing_files_set:
-                logger.debug(f"Processing file {file}.")
-                try:
-                    local_aspect_ratio_bucket_indices = self._process_for_bucket(
-                        file,
-                        local_aspect_ratio_bucket_indices,
-                        metadata_updates=local_metadata_updates,
-                        delete_problematic_images=self.delete_problematic_images,
-                        statistics=statistics,
-                    )
-                except Exception as e:
-                    logger.error(
-                        f"Error processing file {file}. Reason: {e}. Skipping."
-                    )
-                    statistics["skipped"]["error"] += 1
-                logger.debug(
-                    f"Statistics: {statistics}, total: {sum([len(bucket) for bucket in local_aspect_ratio_bucket_indices.values()])}"
-                )
-                processed_file_count += 1
-                # Successfully processed
-                statistics["total_processed"] = processed_file_count
-                processed_file_list.add(file)
-            else:
-                statistics["skipped"]["already_exists"] += 1
-            tqdm_queue.put(1)
-            if processed_file_count % 500 == 0:
-                # Send updates to queues and reset the local dictionaries
-                if aspect_ratio_bucket_indices_queue is not None:
-                    aspect_ratio_bucket_indices_queue.put(
-                        local_aspect_ratio_bucket_indices
-                    )
-                if written_files_queue is not None:
-                    written_files_queue.put(processed_file_list)
-                metadata_updates_queue.put(local_metadata_updates)
-                local_aspect_ratio_bucket_indices = {}
-                local_metadata_updates = {}
-                processed_file_list = set()
-        if (
-            aspect_ratio_bucket_indices_queue is not None
-            and local_aspect_ratio_bucket_indices
-        ):
-            aspect_ratio_bucket_indices_queue.put(local_aspect_ratio_bucket_indices)
-        if local_metadata_updates:
-            metadata_updates_queue.put(local_metadata_updates)
-            # At the end of the _bucket_worker method
-            metadata_updates_queue.put(("statistics", statistics))
-        time.sleep(0.001)
-        logger.debug("Bucket worker completed processing. Returning to main thread.")
-
-    def compute_aspect_ratio_bucket_indices(self, ignore_existing_cache: bool = False):
-        """
-        Compute the aspect ratio bucket indices. The workhorse of this class.
-
-        Arguments:
-            ignore_existing_cache (bool): Whether to ignore the existing cache
-            and entirely recompute the aspect ratio bucket indices.
-
-        Returns:
-            dict: The aspect ratio bucket indices.
-        """
-        logger.info("Discovering new files...")
-        new_files = self._discover_new_files(
-            ignore_existing_cache=ignore_existing_cache
-        )
-
-        existing_files_set = set().union(*self.aspect_ratio_bucket_indices.values())
-        logger.info(
-            f"Compressed {len(existing_files_set)} existing files from {len(self.aspect_ratio_bucket_indices.values())}."
-        )
-        # Initialize aggregated statistics
-        aggregated_statistics = {
-            "total_processed": 0,
-            "skipped": {
-                "already_exists": len(existing_files_set),
-                "metadata_missing": 0,
-                "not_found": 0,
-                "too_small": 0,
-                "other": 0,
-            },
-        }
-        if not new_files:
-            logger.info("No new files discovered. Doing nothing.")
-            logger.info(f"Statistics: {aggregated_statistics}")
-            return
-        num_cpus = (
-            StateTracker.get_args().aspect_bucket_worker_count
-        )  # Using a fixed number for better control and predictability
-        files_split = np.array_split(new_files, num_cpus)
-
-        metadata_updates_queue = Queue()
-        written_files_queue = Queue()
-        tqdm_queue = Queue()
-        aspect_ratio_bucket_indices_queue = Queue()
-        try:
-            self.load_image_metadata()
-        except Exception as e:
-            if ignore_existing_cache:
-                logger.warning(
-                    f"Error loading image metadata, creating new metadata cache: {e}"
-                )
-                self.image_metadata = {}
-            else:
-                raise Exception(
-                    f"Error loading image metadata. You may have to remove the metadata json file '{self.metadata_file}' and VAE cache manually: {e}"
-                )
-        worker_cls = (
-            Process if StateTracker.get_args().enable_multiprocessing else Thread
-        )
-        workers = [
-            worker_cls(
-                target=self._bucket_worker,
-                args=(
-                    tqdm_queue,
-                    file_shard,
-                    aspect_ratio_bucket_indices_queue,
-                    metadata_updates_queue,
-                    written_files_queue,
-                    existing_files_set,
-                ),
-            )
-            for file_shard in files_split
-        ]
-
-        for worker in workers:
-            worker.start()
-        last_write_time = time.time()
-        written_files = set()
-        with tqdm(
-            desc="Generating aspect bucket cache",
-            total=len(new_files),
-            leave=False,
-            ncols=100,
-            miniters=int(len(new_files) / 100),
-        ) as pbar:
-            if self.should_abort:
-                logger.info("Aborting aspect bucket update.")
-                return
-            while (
-                any(worker.is_alive() for worker in workers)
-                or not tqdm_queue.empty()
-                or not aspect_ratio_bucket_indices_queue.empty()
-                or not metadata_updates_queue.empty()
-                or not written_files_queue.empty()
-            ):
-                current_time = time.time()
-                while not tqdm_queue.empty():
-                    pbar.update(tqdm_queue.get())
-                while not aspect_ratio_bucket_indices_queue.empty():
-                    aspect_ratio_bucket_indices_update = (
-                        aspect_ratio_bucket_indices_queue.get()
-                    )
-                    for key, value in aspect_ratio_bucket_indices_update.items():
-                        self.aspect_ratio_bucket_indices.setdefault(key, []).extend(
-                            value
-                        )
-                # Now, pull metadata updates from the queue
-                while not metadata_updates_queue.empty():
-                    metadata_update = metadata_updates_queue.get()
-                    if (
-                        type(metadata_update) is tuple
-                        and metadata_update[0] == "statistics"
-                    ):
-                        logger.debug(
-                            f"Received statistics update: {metadata_update[1]}"
-                        )
-                        for reason, count in metadata_update[1]["skipped"].items():
-                            aggregated_statistics["skipped"][reason] += count
-                        aggregated_statistics["total_processed"] += metadata_update[1][
-                            "total_processed"
-                        ]
-                        continue
-                    for filepath, meta in metadata_update.items():
-                        self.set_metadata_by_filepath(
-                            filepath=filepath, metadata=meta, update_json=False
-                        )
-                # Process the written files queue
-                while not written_files_queue.empty():
-                    written_files_batch = written_files_queue.get()
-                    written_files.update(written_files_batch)  # Use update for sets
-
-                processing_duration = current_time - last_write_time
-                if processing_duration >= self.metadata_update_interval:
-                    logger.debug(
-                        f"In-flight metadata update after {processing_duration} seconds. Saving {len(self.image_metadata)} metadata entries and {len(self.aspect_ratio_bucket_indices)} aspect bucket lists."
-                    )
-                    self.save_cache(enforce_constraints=False)
-                    self.save_image_metadata()
-                    last_write_time = current_time
-
-                time.sleep(0.001)
-
-        for worker in workers:
-            worker.join()
-        logger.info(f"Image processing statistics: {aggregated_statistics}")
-        self.save_image_metadata()
-        self.save_cache(enforce_constraints=True)
-        logger.info("Completed aspect bucket update.")
-
-    def split_buckets_between_processes(self, gradient_accumulation_steps=1):
-        """
-        Splits the contents of each bucket in aspect_ratio_bucket_indices between the available processes.
-        """
-        new_aspect_ratio_bucket_indices = {}
-        total_images = sum(
-            [len(bucket) for bucket in self.aspect_ratio_bucket_indices.values()]
-        )
-        logger.debug(f"Count of items before split: {total_images}")
-
-        # Determine the effective batch size for all processes considering gradient accumulation
-        num_processes = self.accelerator.num_processes
-        effective_batch_size = (
-            self.batch_size * num_processes * gradient_accumulation_steps
-        )
-
-        for bucket, images in self.aspect_ratio_bucket_indices.items():
-            # Trim the list to a length that's divisible by the effective batch size
-            total_img_count_incl_repeats = len(images) * (self.repeats + 1)
-            num_batches = ceil(total_img_count_incl_repeats / effective_batch_size)
-            trimmed_images = images[: num_batches * effective_batch_size]
-            if len(trimmed_images) == 0 and should_log():
-                logger.error(
-                    f"Bucket {bucket} has no images after trimming because {len(images)} images are not enough to satisfy an effective batch size of {effective_batch_size}."
-                    " Lower your batch size, increase repeat count, or increase data pool size."
-                )
-
-            with self.accelerator.split_between_processes(
-                trimmed_images, apply_padding=False
-            ) as images_split:
-                # Now images_split contains only the part of the images list that this process should handle
-                new_aspect_ratio_bucket_indices[bucket] = images_split
-
-        # Replace the original aspect_ratio_bucket_indices with the new one containing only this process's share
-        self.aspect_ratio_bucket_indices = new_aspect_ratio_bucket_indices
-        post_total = sum(
-            [len(bucket) for bucket in self.aspect_ratio_bucket_indices.values()]
-        )
-        if total_images != post_total:
-            self.read_only = True
-
-        logger.debug(f"Count of items after split: {post_total}")
-
-    def mark_as_seen(self, image_path):
-        """Mark an image as seen."""
-        self.seen_images[image_path] = True
-
-    def mark_batch_as_seen(self, image_paths):
-        """Efficiently extend the Manager with new contents, image_paths
-
-        Args:
-            image_paths (list): A list of image paths to mark as seen.
-        """
-        self.seen_images.update({image_path: True for image_path in image_paths})
-
-    def is_seen(self, image_path):
-        """Check if an image is seen."""
-        return self.seen_images.get(image_path, False)
-
-    def reset_seen_images(self):
-        """Reset the seen images."""
-        self.seen_images.clear()
-
-    def remove_image(self, image_path, bucket: str = None):
-        """
-        Used by other classes to reliably remove images from a bucket.
-
-        Args:
-            image_path (str): The path to the image to remove.
-            bucket (str): The bucket to remove the image from.
-
-        Returns:
-            dict: The aspect ratio bucket indices.
-        """
-        if not bucket:
-            for bucket, images in self.aspect_ratio_bucket_indices.items():
-                if image_path in images:
-                    self.aspect_ratio_bucket_indices[bucket].remove(image_path)
-                    break
-        if image_path in self.aspect_ratio_bucket_indices[bucket]:
-            self.aspect_ratio_bucket_indices[bucket].remove(image_path)
-
-    def update_buckets_with_existing_files(self, existing_files: set):
-        """
-        Update bucket indices to remove entries that no longer exist and remove duplicates.
-
-        Args:
-            existing_files (set): A set of existing files.
-        """
-        logger.debug(
-            f"Before updating, in all buckets, we had {sum([len(bucket) for bucket in self.aspect_ratio_bucket_indices.values()])}."
-        )
-        for bucket, images in self.aspect_ratio_bucket_indices.items():
-            # Remove non-existing files and duplicates while preserving order
-            filtered_images = list(
-                dict.fromkeys(img for img in images if img in existing_files)
-            )
-            self.aspect_ratio_bucket_indices[bucket] = filtered_images
-        logger.debug(
-            f"After updating, in all buckets, we had {sum([len(bucket) for bucket in self.aspect_ratio_bucket_indices.values()])}."
-        )
-        # Save the updated cache
-        self.save_cache()
-
-    def refresh_buckets(self, rank: int = None):
-        """
-        Discover new files and remove images that no longer exist.
-        """
-        # Discover new files and update bucket indices
-        self.compute_aspect_ratio_bucket_indices()
-
-        # Get the list of existing files
-        logger.debug(
-            f"Refreshing buckets for rank {rank} via data_backend id {self.id}."
-        )
-        existing_files = StateTracker.get_image_files(data_backend_id=self.id)
-
-        if not StateTracker.get_args().ignore_missing_files:
-            # Update bucket indices to remove entries that no longer exist
-            self.update_buckets_with_existing_files(existing_files)
-        return
-
-    def _enforce_min_bucket_size(self):
-        """
-        Remove buckets that have fewer samples than batch_size and enforce minimum image size constraints.
-        """
-        logger.info(
-            f"Enforcing minimum image size of {self.minimum_image_size}."
-            " This could take a while for very-large datasets."
-        )
-        for bucket in tqdm(
-            list(self.aspect_ratio_bucket_indices.keys()),
-            leave=False,
-            desc="Enforcing minimum bucket size",
-        ):  # Safe iteration over keys
-            # Prune the smaller buckets so that we don't enforce resolution constraints on them unnecessarily.
-            self._prune_small_buckets(bucket)
-            if self.minimum_image_size is not None:
-                self._enforce_resolution_constraints(bucket)
-                # We do this twice in case there were any new contenders for being too small.
-                self._prune_small_buckets(bucket)
-
-    def _prune_small_buckets(self, bucket):
-        """
-        Remove buckets with fewer images than the batch size.
-        """
-        if StateTracker.get_args().disable_bucket_pruning:
-            logger.warning(
-                "Not pruning small buckets, as --disable_bucket_pruning is provided."
-            )
-            return
-        if (
-            bucket in self.aspect_ratio_bucket_indices
-            and (
-                len(self.aspect_ratio_bucket_indices[bucket]) * (int(self.repeats) + 1)
-            )
-            < self.batch_size
-        ):
-            bucket_sample_count = len(self.aspect_ratio_bucket_indices[bucket])
-            del self.aspect_ratio_bucket_indices[bucket]
-            logger.warning(
-                f"Removing bucket {bucket} due to insufficient samples; your batch size may be too large for the small quantity of data (batch_size={self.batch_size} > sample_count={bucket_sample_count})."
-            )
-
-    def _enforce_resolution_constraints(self, bucket):
-        """
-        Enforce resolution constraints on images in a bucket.
-        """
-        if self.minimum_image_size is not None:
-            if bucket not in self.aspect_ratio_bucket_indices:
-                logger.debug(
-                    f"Bucket {bucket} was already removed due to insufficient samples."
-                )
-                return
-            images = self.aspect_ratio_bucket_indices[bucket]
-            total_before = len(images)
-            self.aspect_ratio_bucket_indices[bucket] = [
-                img
-                for img in images
-                if self.meets_resolution_requirements(
-                    image_path=img,
-                    image=None,
-                )
-            ]
-            total_after = len(self.aspect_ratio_bucket_indices[bucket])
-            total_lost = total_before - total_after
-            if total_lost > 0:
-                logger.info(
-                    f"Had {total_before} samples before and {total_lost} that did not meet the minimum image size requirement ({self.minimum_image_size})."
-                )
-
-    def meets_resolution_requirements(
-        self,
-        image_path: str = None,
-        image: Image = None,
-        image_metadata: dict = None,
-    ):
-        """
-        Check if an image meets the resolution requirements.
-        """
-        if image is None and (image_path is not None and image_metadata is None):
-            metadata = self.get_metadata_by_filepath(image_path)
-            if metadata is None:
-                logger.warning(f"Metadata not found for image {image_path}.")
-                return False
-            width, height = metadata["original_size"]
-        elif image is not None:
-            width, height = image.size
-        elif image_metadata is not None:
-            width, height = image_metadata["original_size"]
-        else:
-            # Unexpected condition
-            raise ValueError(
-                f"meets_resolution_requirements expects an image_path"
-                f" ({image_path}) or Image object ({image}), but received neither."
-            )
-
-        if self.minimum_image_size is None:
-            return True
-
-        if self.resolution_type == "pixel":
-            return (
-                self.minimum_image_size <= width and self.minimum_image_size <= height
-            )
-        elif self.resolution_type == "area":
-            # We receive megapixel integer value, and then have to compare here by converting minimum_image_size MP to pixels.
-            if self.minimum_image_size > 5:
-                raise ValueError(
-                    f"--minimum_image_size was given with a value of {self.minimum_image_size} but resolution_type is area, which means this value is most likely too large. Please use a value less than 5."
-                )
-            # We need to find the square image length if crop_style = square.
-            minimum_image_size = self.minimum_image_size * 1_000_000
-            if (
-                StateTracker.get_data_backend_config(self.id).get("crop", False)
-                and StateTracker.get_data_backend_config(self.id).get(
-                    "crop_aspect", "square"
-                )
-                == "square"
-            ):
-                # When comparing the 'area' of an image but cropping to square area, one side might be too small.
-                # So we have to convert our megapixel value to a 1.0 aspect square image size.
-                # We do this by taking the square root of the megapixel value.
-                pixel_edge_len = floor(np.sqrt(minimum_image_size))
-                if not (pixel_edge_len <= width and pixel_edge_len <= height):
-                    # If the square edge length is too small, then the image is too small.
-                    return False
-            # Since we've now tested whether a square-cropped image will be adequate, we can calculate the area of the image.
-            return minimum_image_size <= width * height
-        else:
-            raise ValueError(
-                f"BucketManager.meets_resolution_requirements received unexpected value for resolution_type: {self.resolution_type}"
-            )
-
-    def handle_incorrect_bucket(
-        self, image_path: str, bucket: str, actual_bucket: str, save_cache: bool = True
-    ):
-        """
-        Used by other classes to move images between buckets, when mis-detected.
-
-        Args:
-            image_path (str): The path to the image to move.
-            bucket (str): The bucket to move the image from.
-            actual_bucket (str): The bucket to move the image to.
-        """
-        logger.warning(
-            f"Found an image in bucket {bucket} it doesn't belong in, when actually it is: {actual_bucket}"
-        )
-        self.remove_image(image_path, bucket)
-        if actual_bucket in self.aspect_ratio_bucket_indices:
-            logger.warning("Moved image to bucket, it already existed.")
-            self.aspect_ratio_bucket_indices[actual_bucket].append(image_path)
-        else:
-            logger.warning("Created new bucket for that pesky image.")
-            self.aspect_ratio_bucket_indices[actual_bucket] = [image_path]
-        if save_cache:
-            self.save_cache()
-
-    def handle_small_image(
-        self, image_path: str, bucket: str, delete_unwanted_images: bool
-    ):
-        """
-        Used by other classes to remove an image, or DELETE it from disk, depending on parameters.
-
-        Args:
-            image_path (str): The path to the image to remove.
-            bucket (str): The bucket to remove the image from.
-            delete_unwanted_images (bool): Whether to delete the image from disk.
-        """
-        if delete_unwanted_images:
-            try:
-                logger.warning(
-                    f"Image {image_path} too small: DELETING image and continuing search."
-                )
-                self.data_backend.delete(image_path)
-            except Exception:
-                logger.debug(
-                    f"Image {image_path} was already deleted. Another GPU must have gotten to it."
-                )
-        else:
-            logger.warning(
-                f"Image {image_path} too small, but --delete_unwanted_images is not provided, so we simply ignore and remove from bucket."
-            )
-        self.remove_image(image_path, bucket)
-
-    def has_single_underfilled_bucket(self):
-        """
-        Check if there's only one active bucket and it has fewer images than the batch size.
-
-        Returns:
-            bool: True if there's a single underfilled bucket, False otherwise.
-        """
-        if len(self.aspect_ratio_bucket_indices) != 1:
-            return False
-
-        bucket = list(self.aspect_ratio_bucket_indices.keys())[0]
-        if (
-            len(self.aspect_ratio_bucket_indices[bucket]) * (int(self.repeats) + 1)
-        ) < self.batch_size:
-            return True
-
-        return False
-
-    def read_cache(self):
-        """
-        Read the entire bucket cache.
-        """
-        return self.aspect_ratio_bucket_indices
-
-    def get_metadata_attribute_by_filepath(self, filepath: str, attribute: str):
-        """Use get_metadata_by_filepath to return a specific attribute.
-
-        Args:
-            filepath (str): The complete path from the aspect bucket list.
-            attribute (str): The attribute you are seeking.
-
-        Returns:
-            any type: The attribute value, or None.
-        """
-        metadata = self.get_metadata_by_filepath(filepath)
-        if metadata:
-            return metadata.get(attribute, None)
-        else:
-            return None
-
-    def set_metadata_attribute_by_filepath(
-        self, filepath: str, attribute: str, value: any, update_json: bool = True
-    ):
-        """Use set_metadata_by_filepath to update the contents of a specific attribute.
-
-        Args:
-            filepath (str): The complete path from the aspect bucket list.
-            attribute (str): The attribute you are updating.
-            value (any type): The value to set.
-        """
-        metadata = self.get_metadata_by_filepath(filepath) or {}
-        metadata[attribute] = value
-        return self.set_metadata_by_filepath(filepath, metadata, update_json)
-
-    def set_metadata_by_filepath(
-        self, filepath: str, metadata: dict, update_json: bool = True
-    ):
-        """Set metadata for a given image file path.
-
-        Args:
-            filepath (str): The complete path from the aspect bucket list.
-        """
-        with self.metadata_semaphor:
-            logger.debug(f"Setting metadata for {filepath} to {metadata}.")
-            self.image_metadata[filepath] = metadata
-            if update_json:
-                self.save_image_metadata()
-
-    def get_metadata_by_filepath(self, filepath: str):
-        """Retrieve metadata for a given image file path.
-
-        Args:
-            filepath (str): The complete or basename path from the aspect bucket list.
-                            First, we search for the basename as the key, and we fall
-                             back to the
-
-        Returns:
-            dict: Metadata for the image. Returns None if not found.
-        """
-        if type(filepath) is tuple or type(filepath) is list:
-            for path in filepath:
-                if path in self.image_metadata:
-                    result = self.image_metadata.get(path, None)
-                    logger.debug(
-                        f"Retrieving metadata for path: {filepath}, result: {result}"
-                    )
-                    if result is not None:
-                        return result
-            return None
-
-        return self.image_metadata.get(filepath, None)
-
-    def scan_for_metadata(self):
-        """
-        Update the metadata without modifying the bucket indices.
-        """
-        logger.info(f"Loading metadata from {self.metadata_file}")
-        self.load_image_metadata()
-        logger.debug(
-            f"A subset of the available metadata: {list(self.image_metadata.keys())[:5]}"
-        )
-        logger.info("Discovering new images for metadata scan...")
-        new_files = self._discover_new_files(for_metadata=True)
-        if not new_files:
-            logger.info("No new files discovered. Exiting.")
-            return
-
-        existing_files_set = {
-            existing_file for existing_file in self.image_metadata.keys()
-        }
-
-        num_cpus = 8  # Using a fixed number for better control and predictability
-        files_split = np.array_split(new_files, num_cpus)
-
-        metadata_updates_queue = Queue()
-        tqdm_queue = Queue()
-        worker_cls = (
-            Process if StateTracker.get_args().enable_multiprocessing else Thread
-        )
-        workers = [
-            worker_cls(
-                target=self._bucket_worker,
-                args=(
-                    tqdm_queue,
-                    file_shard,
-                    None,  # Passing None to indicate we don't want to update the buckets
-                    metadata_updates_queue,
-                    None,  # Passing None to indicate we don't want to update the written files list
-                    existing_files_set,
-                ),
-            )
-            for file_shard in files_split
-        ]
-
-        for worker in workers:
-            worker.start()
-
-        with tqdm(
-            desc="Scanning image metadata",
-            total=len(new_files),
-            leave=False,
-            ncols=100,
-        ) as pbar:
-            while any(worker.is_alive() for worker in workers):
-                while not tqdm_queue.empty():
-                    pbar.update(tqdm_queue.get())
-
-                # Only update the metadata
-                while not metadata_updates_queue.empty():
-                    metadata_update = metadata_updates_queue.get()
-                    logger.debug(
-                        f"Received type of metadata update: {type(metadata_update)}, contents: {metadata_update}"
-                    )
-                    if type(metadata_update) == dict:
-                        for filepath, meta in metadata_update.items():
-                            self.set_metadata_by_filepath(
-                                filepath=filepath, metadata=meta, update_json=False
-                            )
-
-        for worker in workers:
-            worker.join()
-
-        self.save_image_metadata()
-        self.save_cache(enforce_constraints=True)
-        logger.info("Completed metadata update.")
-
-    def handle_vae_cache_inconsistencies(self, vae_cache, vae_cache_behavior: str):
-        """
-        Handles inconsistencies between the aspect buckets and the VAE cache.
-
-        Args:
-            vae_cache: The VAECache object.
-            vae_cache_behavior (str): Behavior for handling inconsistencies ('sync' or 'recreate').
-        """
-        if "deepfloyd" in StateTracker.get_args().model_type:
-            return
-        if vae_cache_behavior not in ["sync", "recreate"]:
-            raise ValueError("Invalid VAE cache behavior specified.")
-        logger.info("Scanning VAE cache for inconsistencies with aspect buckets...")
-        try:
-            for cache_file, cache_content in vae_cache.scan_cache_contents():
-                if cache_content is None:
-                    continue
-                if vae_cache_behavior == "sync":
-                    # Sync aspect buckets with the cache
-                    expected_bucket = str(
-                        self._get_aspect_ratio_from_tensor(cache_content)
-                    )
-                    self._modify_cache_entry_bucket(cache_file, expected_bucket)
-                elif vae_cache_behavior == "recreate":
-                    # Delete the cache file if it doesn't match the aspect bucket indices
-                    if self.is_cache_inconsistent(vae_cache, cache_file, cache_content):
-                        threading.Thread(
-                            target=self.data_backend.delete,
-                            args=(cache_file,),
-                            daemon=True,
-                        ).start()
-        except Exception as e:
-            logger.debug(f"Error running VAE cache scan: {e}")
-            return
-
-        # Update any state or metadata post-processing
-        self.save_cache()
-
-    def _recalculate_target_resolution(self, original_aspect_ratio: float) -> tuple:
-        """Given the original resolution, use our backend config to properly recalculate the size."""
-        resolution_type = StateTracker.get_data_backend_config(self.id)[
-            "resolution_type"
-        ]
-        resolution = StateTracker.get_data_backend_config(self.id)["resolution"]
-        if resolution_type == "pixel":
-            return MultiaspectImage.calculate_new_size_by_pixel_edge(
-                original_aspect_ratio, int(resolution)
-            )
-        elif resolution_type == "area":
-            if original_aspect_ratio is None:
-                raise ValueError(
-                    "Original aspect ratio must be provided for area-based resolution."
-                )
-            return MultiaspectImage.calculate_new_size_by_pixel_area(
-                original_aspect_ratio, resolution
-            )
-
-    def is_cache_inconsistent(self, vae_cache, cache_file, cache_content):
-        """
-        Check if a cache file's content is inconsistent with the aspect ratio bucket indices.
-
-        Args:
-            cache_file (str): The cache file path.
-            cache_content: The content of the cache file (PyTorch Tensor).
-
-        Returns:
-            bool: True if the cache file is inconsistent, False otherwise.
-        """
-        # Get tensor shape and multiply by self.scaling_factor or 8
-        if cache_content is None:
-            return True
-        # is it a tensor with nan or inf values?
-        if torch.isnan(cache_content).any() or torch.isinf(cache_content).any():
-            logger.warning(f"Cache file {cache_file} contains NaN or Inf values.")
-            return True
-        image_filename = vae_cache._image_filename_from_vaecache_filename(cache_file)
-        logger.debug(
-            f"Checking cache file {cache_file} for inconsistencies. Image filename: {image_filename}"
-        )
-        actual_resolution = self._get_image_size_from_tensor(cache_content)
-        original_resolution = self.get_metadata_attribute_by_filepath(
-            image_filename, "original_size"
-        )
-        metadata_target_size = self.get_metadata_attribute_by_filepath(
-            image_filename, "target_size"
-        )
-        if metadata_target_size is None:
-            logger.error(
-                f"Received sample with no metadata: {self.get_metadata_by_filepath(image_filename)}"
-            )
-            return True
-        target_resolution = tuple(metadata_target_size)
-        recalculated_target_resolution, intermediary_size, recalculated_aspect_ratio = (
-            self._recalculate_target_resolution(
-                original_aspect_ratio=MultiaspectImage.calculate_image_aspect_ratio(
-                    original_resolution
-                )
-            )
-        )
-        logger.debug(
-            f"Original resolution: {original_resolution}, Target resolution: {target_resolution}, Recalculated target resolution: {recalculated_target_resolution}"
-        )
-        if (
-            original_resolution is not None
-            and target_resolution is not None
-            and (
-                actual_resolution != target_resolution
-                or actual_resolution != recalculated_target_resolution
-            )
-        ):
-            logger.debug(
-                f"Actual resolution {actual_resolution} does not match target resolution {target_resolution}, recalculated as {recalculated_target_resolution}."
-            )
-            return True
-        else:
-            logger.debug(
-                f"Actual resolution {actual_resolution} matches target resolution {target_resolution}."
-            )
-
-        actual_aspect_ratio = self._get_aspect_ratio_from_tensor(cache_content)
-        expected_bucket = str(recalculated_aspect_ratio)
-        logger.debug(
-            f"Expected bucket for {cache_file}: {expected_bucket} vs actual {actual_aspect_ratio}"
-        )
-
-        # Extract the base filename without the extension
-        base_filename = os.path.splitext(os.path.basename(cache_file))[0]
-        base_filename_png = os.path.join(self.instance_data_dir, f"{base_filename}.png")
-        base_filename_jpg = os.path.join(self.instance_data_dir, f"{base_filename}.jpg")
-        # Check if the base filename is in the correct bucket
-        if any(
-            base_filename_png in self.aspect_ratio_bucket_indices.get(bucket, set())
-            for bucket in [expected_bucket, str(expected_bucket)]
-        ):
-            logger.debug(f"File {base_filename} is in the correct bucket.")
-            return False
-        if any(
-            base_filename_jpg in self.aspect_ratio_bucket_indices.get(bucket, set())
-            for bucket in [expected_bucket, str(expected_bucket)]
-        ):
-            logger.debug(f"File {base_filename} is in the correct bucket.")
-            return False
-        logger.debug(f"File {base_filename} was not found in the correct place.")
-        return True
-
-    def _get_aspect_ratio_from_tensor(self, tensor):
-        """
-        Calculate the aspect ratio from a PyTorch Tensor.
-
-        Args:
-            tensor (torch.Tensor): The tensor representing the image.
-
-        Returns:
-            float: The aspect ratio of the image.
-        """
-        if tensor.dim() < 3:
-            raise ValueError(
-                "Tensor does not have enough dimensions to determine aspect ratio."
-            )
-        # Assuming tensor is in CHW format (channel, height, width)
-        _, height, width = tensor.size()
-        return width / height
-
-    def _get_image_size_from_tensor(self, tensor):
-        """
-        Calculate the image size from a PyTorch Tensor.
-
-        Args:
-            tensor (torch.Tensor): The tensor representing the image.
-
-        Returns:
-            tuple[width, height]: The resolution of the image just before it was encoded.
-        """
-        if tensor.dim() < 3:
-            raise ValueError(
-                f"Tensor does not have enough dimensions to determine an image resolution. Its shape is: {tensor.size}"
-            )
-        # Assuming tensor is in CHW format (channel, height, width)
-        _, height, width = tensor.size()
-        return (
-            width * self.vae_output_scaling_factor,
-            height * self.vae_output_scaling_factor,
-        )
-
-    def _modify_cache_entry_bucket(self, cache_file, expected_bucket):
-        """
-        Update the bucket indices based on the cache file's actual aspect ratio.
-
-        Args:
-            cache_file (str): The cache file path.
-            expected_bucket (str): The bucket that the cache file should belong to.
-        """
-        for bucket, files in self.aspect_ratio_bucket_indices.items():
-            if cache_file in files and str(bucket) != str(expected_bucket):
-                files.remove(cache_file)
-                self.aspect_ratio_bucket_indices[expected_bucket].append(cache_file)
-                break
diff --git a/videotuna/third_party/flux/metadata/backends/discovery.py b/videotuna/third_party/flux/metadata/backends/discovery.py
deleted file mode 100644
index 9c8d871b..00000000
--- a/videotuna/third_party/flux/metadata/backends/discovery.py
+++ /dev/null
@@ -1,282 +0,0 @@
-import json
-import logging
-import os
-import traceback
-from io import BytesIO
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.brightness import calculate_luminance
-from videotuna.third_party.flux.image_manipulation.load import load_image
-from videotuna.third_party.flux.image_manipulation.training_sample import TrainingSample
-from videotuna.third_party.flux.metadata.backends.base import MetadataBackend
-from videotuna.third_party.flux.training import image_file_extensions
-from videotuna.third_party.flux.training.multi_process import should_log
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("DiscoveryMetadataBackend")
-if should_log():
-    target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-else:
-    target_level = "ERROR"
-logger.setLevel(target_level)
-
-
-class DiscoveryMetadataBackend(MetadataBackend):
-    def __init__(
-        self,
-        id: str,
-        instance_data_dir: str,
-        cache_file: str,
-        metadata_file: str,
-        data_backend: BaseDataBackend,
-        accelerator,
-        batch_size: int,
-        resolution: float,
-        resolution_type: str,
-        delete_problematic_images: bool = False,
-        delete_unwanted_images: bool = False,
-        metadata_update_interval: int = 3600,
-        minimum_image_size: int = None,
-        cache_file_suffix: str = None,
-        repeats: int = 0,
-    ):
-        super().__init__(
-            id=id,
-            instance_data_dir=instance_data_dir,
-            cache_file=cache_file,
-            metadata_file=metadata_file,
-            data_backend=data_backend,
-            accelerator=accelerator,
-            batch_size=batch_size,
-            resolution=resolution,
-            resolution_type=resolution_type,
-            delete_problematic_images=delete_problematic_images,
-            delete_unwanted_images=delete_unwanted_images,
-            metadata_update_interval=metadata_update_interval,
-            minimum_image_size=minimum_image_size,
-            cache_file_suffix=cache_file_suffix,
-            repeats=repeats,
-        )
-
-    def _discover_new_files(
-        self, for_metadata: bool = False, ignore_existing_cache: bool = False
-    ):
-        """
-        Discover new files that have not been processed yet.
-
-        Returns:
-            list: A list of new files.
-        """
-        all_image_files = StateTracker.get_image_files(
-            data_backend_id=self.data_backend.id
-        )
-        if ignore_existing_cache:
-            # Return all files and remove the existing buckets.
-            logger.debug(
-                "Resetting the entire aspect bucket cache as we've received the signal to ignore existing cache."
-            )
-            self.aspect_ratio_bucket_indices = {}
-            return list(all_image_files.keys())
-        if all_image_files is None:
-            logger.debug("No image file cache available, retrieving fresh")
-            all_image_files = self.data_backend.list_files(
-                instance_data_dir=self.instance_data_dir,
-                file_extensions=image_file_extensions,
-            )
-            all_image_files = StateTracker.set_image_files(
-                all_image_files, data_backend_id=self.data_backend.id
-            )
-        else:
-            logger.debug("Using cached image file list")
-
-        # Flatten the list if it contains nested lists
-        if any(isinstance(i, list) for i in all_image_files):
-            all_image_files = [item for sublist in all_image_files for item in sublist]
-
-        # logger.debug(f"All image files: {json.dumps(all_image_files, indent=4)}")
-
-        all_image_files_set = set(all_image_files)
-
-        if for_metadata:
-            result = [
-                file
-                for file in all_image_files
-                if self.get_metadata_by_filepath(file) is None
-            ]
-        else:
-            processed_files = set(
-                path
-                for paths in self.aspect_ratio_bucket_indices.values()
-                for path in paths
-            )
-            result = [
-                file for file in all_image_files_set if file not in processed_files
-            ]
-
-        return result
-
-    def reload_cache(self, set_config: bool = True):
-        """
-        Load cache data from a JSON file.
-
-        Returns:
-            dict: The cache data.
-        """
-        # Query our DataBackend to see whether the cache file exists.
-        logger.debug(f"Checking for cache file: {self.cache_file}")
-        if self.data_backend.exists(self.cache_file):
-            try:
-                # Use our DataBackend to actually read the cache file.
-                logger.debug("Pulling cache file from storage")
-                cache_data_raw = self.data_backend.read(self.cache_file)
-                cache_data = json.loads(cache_data_raw)
-            except Exception as e:
-                logger.warning(
-                    f"Error loading aspect bucket cache, creating new one: {e}"
-                )
-                cache_data = {}
-            self.aspect_ratio_bucket_indices = cache_data.get(
-                "aspect_ratio_bucket_indices", {}
-            )
-            if set_config:
-                self.config = cache_data.get("config", {})
-                if self.config != {}:
-                    logger.debug(f"Setting config to {self.config}")
-                    logger.debug(f"Loaded previous data backend config: {self.config}")
-                    StateTracker.set_data_backend_config(
-                        data_backend_id=self.id,
-                        config=self.config,
-                    )
-            logger.debug(
-                f"(id={self.id}) Loaded {len(self.aspect_ratio_bucket_indices)} aspect ratio buckets"
-            )
-        else:
-            logger.warning("No cache file found, creating new one.")
-
-    def save_cache(self, enforce_constraints: bool = False):
-        """
-        Save cache data to file.
-        """
-        # Prune any buckets that have fewer samples than batch_size
-        if enforce_constraints:
-            self._enforce_min_bucket_size()
-        if self.read_only:
-            logger.debug("Skipping cache update on storage backend, read-only mode.")
-            return
-        # Convert any non-strings into strings as we save the index.
-        aspect_ratio_bucket_indices_str = {
-            key: [str(path) for path in value]
-            for key, value in self.aspect_ratio_bucket_indices.items()
-        }
-        # Encode the cache as JSON.
-        cache_data = {
-            "config": StateTracker.get_data_backend_config(
-                data_backend_id=self.data_backend.id
-            ),
-            "aspect_ratio_bucket_indices": aspect_ratio_bucket_indices_str,
-        }
-        logger.debug(f"save_cache has config to write: {cache_data['config']}")
-        cache_data_str = json.dumps(cache_data)
-        # Use our DataBackend to write the cache file.
-        self.data_backend.write(self.cache_file, cache_data_str)
-
-    def load_image_metadata(self):
-        """Load image metadata from a JSON file."""
-        self.image_metadata = {}
-        self.image_metadata_loaded = False
-        if self.data_backend.exists(self.metadata_file):
-            cache_data_raw = self.data_backend.read(self.metadata_file)
-            self.image_metadata = json.loads(cache_data_raw)
-            self.image_metadata_loaded = True
-
-    def save_image_metadata(self):
-        """Save image metadata to a JSON file."""
-        self.data_backend.write(self.metadata_file, json.dumps(self.image_metadata))
-
-    def _process_for_bucket(
-        self,
-        image_path_str,
-        aspect_ratio_bucket_indices,
-        aspect_ratio_rounding: int = 3,
-        metadata_updates=None,
-        delete_problematic_images: bool = False,
-        statistics: dict = {},
-    ):
-        try:
-            image_metadata = {}
-            image_data = self.data_backend.read(image_path_str)
-            if image_data is None:
-                logger.debug(
-                    f"Image {image_path_str} was not found on the backend. Skipping image."
-                )
-                statistics.setdefault("skipped", {}).setdefault("not_found", 0)
-                statistics["skipped"]["not_found"] += 1
-                return aspect_ratio_bucket_indices
-
-            with load_image(BytesIO(image_data)) as image:
-                if not self.meets_resolution_requirements(image=image):
-                    if not self.delete_unwanted_images:
-                        logger.debug(
-                            f"Image {image_path_str} does not meet minimum size requirements. Skipping image."
-                        )
-                    else:
-                        logger.debug(
-                            f"Image {image_path_str} does not meet minimum size requirements. Deleting image."
-                        )
-                        self.data_backend.delete(image_path_str)
-                    statistics.setdefault("skipped", {}).setdefault("too_small", 0)
-                    statistics["skipped"]["too_small"] += 1
-                    return aspect_ratio_bucket_indices
-
-                image_metadata["original_size"] = image.size
-                training_sample = TrainingSample(
-                    image=image,
-                    data_backend_id=self.id,
-                    image_metadata=image_metadata,
-                    image_path=image_path_str,
-                )
-                prepared_sample = training_sample.prepare()
-                image_metadata.update(
-                    {
-                        "crop_coordinates": prepared_sample.crop_coordinates,
-                        "target_size": prepared_sample.target_size,
-                        "intermediary_size": prepared_sample.intermediary_size,
-                        "aspect_ratio": prepared_sample.aspect_ratio,
-                        "luminance": calculate_luminance(image),
-                    }
-                )
-                logger.debug(
-                    f"Image {image_path_str} has aspect ratio {prepared_sample.aspect_ratio} and size {image.size}."
-                )
-
-            aspect_ratio_key = str(prepared_sample.aspect_ratio)
-            if aspect_ratio_key not in aspect_ratio_bucket_indices:
-                aspect_ratio_bucket_indices[aspect_ratio_key] = []
-            aspect_ratio_bucket_indices[aspect_ratio_key].append(image_path_str)
-
-            if metadata_updates is not None:
-                metadata_updates[image_path_str] = image_metadata
-
-        except Exception as e:
-            logger.error(f"Error processing image: {e}")
-            logger.error(f"Error traceback: {traceback.format_exc()}")
-            if delete_problematic_images:
-                logger.error(f"Deleting image {image_path_str}.")
-                self.data_backend.delete(image_path_str)
-
-        return aspect_ratio_bucket_indices
-
-    def __len__(self):
-        """
-        Returns:
-            int: The number of batches in the dataset, accounting for images that can't form a complete batch and are discarded.
-        """
-
-        def repeat_len(bucket):
-            return len(bucket) * (self.repeats + 1)
-
-        return sum(
-            (repeat_len(bucket) + (self.batch_size - 1)) // self.batch_size
-            for bucket in self.aspect_ratio_bucket_indices.values()
-            if repeat_len(bucket) >= self.batch_size
-        )
diff --git a/videotuna/third_party/flux/metadata/backends/parquet.py b/videotuna/third_party/flux/metadata/backends/parquet.py
deleted file mode 100644
index 6a64e61a..00000000
--- a/videotuna/third_party/flux/metadata/backends/parquet.py
+++ /dev/null
@@ -1,601 +0,0 @@
-import json
-import logging
-import os
-import time
-import traceback
-
-import numpy
-from tqdm import tqdm
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.training_sample import TrainingSample
-from videotuna.third_party.flux.metadata.backends.base import MetadataBackend
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage
-from videotuna.third_party.flux.training import image_file_extensions
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("ParquetMetadataBackend")
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-
-try:
-    import pandas as pd
-except ImportError:
-    raise ImportError("Pandas is required for the ParquetMetadataBackend.")
-
-
-class ParquetMetadataBackend(MetadataBackend):
-    def __init__(
-        self,
-        id: str,
-        instance_data_dir: str,
-        cache_file: str,
-        metadata_file: str,
-        data_backend: BaseDataBackend,
-        accelerator,
-        batch_size: int,
-        resolution: float,
-        resolution_type: str,
-        parquet_config: dict,
-        delete_problematic_images: bool = False,
-        delete_unwanted_images: bool = False,
-        metadata_update_interval: int = 3600,
-        minimum_image_size: int = None,
-        cache_file_suffix: str = None,
-        repeats: int = 0,
-    ):
-        self.parquet_config = parquet_config
-        self.parquet_path = parquet_config.get("path", None)
-        self.is_json_lines = self.parquet_path.endswith(".jsonl")
-        self.is_json_file = self.parquet_path.endswith(".json")
-        super().__init__(
-            id=id,
-            instance_data_dir=instance_data_dir,
-            cache_file=cache_file,
-            metadata_file=metadata_file,
-            data_backend=data_backend,
-            accelerator=accelerator,
-            batch_size=batch_size,
-            resolution=resolution,
-            resolution_type=resolution_type,
-            delete_problematic_images=delete_problematic_images,
-            delete_unwanted_images=delete_unwanted_images,
-            metadata_update_interval=metadata_update_interval,
-            minimum_image_size=minimum_image_size,
-            cache_file_suffix=cache_file_suffix,
-            repeats=repeats,
-        )
-        self.load_parquet_database()
-        self.caption_cache = self._extract_captions_to_fast_list()
-        self.missing_captions = self._locate_missing_caption_from_fast_list()
-        if self.missing_captions:
-            logger.warning(
-                f"Missing captions for {len(self.missing_captions)} images: {self.missing_captions}"
-            )
-
-    def load_parquet_database(self):
-        """
-        Load the parquet database from file.
-        """
-        if self.data_backend.exists(self.parquet_path):
-            try:
-                bytes_string = self.data_backend.read(self.parquet_path)
-                import io
-
-                pq = io.BytesIO(bytes_string)
-            except Exception as e:
-                raise e
-            if self.is_json_lines or self.is_json_file:
-                self.parquet_database = pd.read_json(pq, lines=self.is_json_lines)
-            else:
-                self.parquet_database = pd.read_parquet(pq, engine="pyarrow")
-            self.parquet_database.set_index(
-                self.parquet_config.get("filename_column"), inplace=True
-            )
-        else:
-            raise FileNotFoundError(
-                f"Parquet could not be loaded from {self.parquet_path}: database file does not exist (path={self.parquet_path})."
-            )
-
-    def _locate_missing_caption_from_fast_list(self):
-        """
-        Check the fast list keys vs the filenames in our aspect ratio bucket indices.
-        """
-        missing_captions = []
-        identifier_includes_extension = self.parquet_config.get(
-            "identifier_includes_extension", False
-        )
-        # currently we just don't do this.
-        identifier_includes_path = False
-        for key in self.aspect_ratio_bucket_indices.keys():
-            for filename in self.aspect_ratio_bucket_indices[key]:
-                if not identifier_includes_extension:
-                    filename = os.path.splitext(filename)[0]
-                if not identifier_includes_path:
-                    # strip out self.instance_data_dir
-                    filename = filename.replace(self.instance_data_dir, "")
-                    # any leading /
-                    if filename.startswith("/"):
-                        filename = filename[1:]
-                if filename not in self.caption_cache:
-                    missing_captions.append(filename)
-        return missing_captions
-
-    def _extract_captions_to_fast_list(self):
-        """
-        Pull the captions from the parquet table into a dict with the format {filename: caption}.
-
-        This helps because parquet's columnar format sucks for searching.
-
-        Returns:
-            dict: A dictionary of captions.
-        """
-        if self.parquet_database is None:
-            raise ValueError("Parquet database is not loaded.")
-        filename_column = self.parquet_config.get("filename_column")
-        caption_column = self.parquet_config.get("caption_column")
-        fallback_caption_column = self.parquet_config.get("fallback_caption_column")
-        identifier_includes_extension = self.parquet_config.get(
-            "identifier_includes_extension", False
-        )
-        captions = {}
-        for index, row in self.parquet_database.iterrows():
-            if filename_column in row:
-                filename = str(row[filename_column])
-            else:
-                filename = str(index)
-            if not identifier_includes_extension:
-                filename = os.path.splitext(filename)[0]
-
-            if type(caption_column) == list:
-                caption = None
-                if len(caption_column) > 0:
-                    caption = [row[c] for c in caption_column]
-            else:
-                caption = row[caption_column]
-
-            if not caption and fallback_caption_column:
-                caption = row[fallback_caption_column]
-            if not caption:
-                raise ValueError(
-                    f"Could not locate caption for image {filename} in sampler_backend {self.id} with filename column {filename_column}, caption column {caption_column}, and a parquet database with {len(self.parquet_database)} entries."
-                )
-            if type(caption) == bytes:
-                caption = caption.decode("utf-8")
-            elif type(caption) == list:
-                caption = [c.strip() for c in caption if c.strip()]
-            if caption:
-                caption = caption.strip()
-            captions[filename] = caption
-        return captions
-
-    def caption_cache_entry(self, index: str):
-        result = self.caption_cache.get(str(index), None)
-
-        logger.debug(f"Caption cache entry for idx {str(index)}: {result}")
-        return result
-
-    def _discover_new_files(
-        self, for_metadata: bool = False, ignore_existing_cache: bool = False
-    ):
-        """
-        Discover new files that have not been processed yet.
-
-        Returns:
-            list: A list of new files.
-        """
-        all_image_files = StateTracker.get_image_files(
-            data_backend_id=self.data_backend.id
-        )
-        if all_image_files is None:
-            logger.debug("No image file cache available, retrieving fresh")
-            all_image_files = self.data_backend.list_files(
-                instance_data_dir=self.instance_data_dir,
-                file_extensions=image_file_extensions,
-            )
-            all_image_files = StateTracker.set_image_files(
-                all_image_files, data_backend_id=self.data_backend.id
-            )
-        else:
-            logger.debug("Using cached image file list")
-        if ignore_existing_cache:
-            # Return all files and remove the existing buckets.
-            logger.debug(
-                "Resetting the entire aspect bucket cache as we've received the signal to ignore existing cache."
-            )
-            self.aspect_ratio_bucket_indices = {}
-            return list(all_image_files.keys())
-        # Flatten the list if it contains nested lists
-        if any(isinstance(i, list) for i in all_image_files):
-            all_image_files = [item for sublist in all_image_files for item in sublist]
-
-        # logger.debug(f"All image files: {json.dumps(all_image_files, indent=4)}")
-
-        all_image_files_set = set(all_image_files)
-
-        if for_metadata:
-            result = [
-                file
-                for file in all_image_files
-                if self.get_metadata_by_filepath(file) is None
-            ]
-        elif ignore_existing_cache:
-            # Remove existing aspect bucket indices and return all image files.
-            result = all_image_files
-            self.aspect_ratio_bucket_indices = {}
-        else:
-            processed_files = set(
-                path
-                for paths in self.aspect_ratio_bucket_indices.values()
-                for path in paths
-            )
-            result = [
-                file for file in all_image_files_set if file not in processed_files
-            ]
-
-        return result
-
-    def reload_cache(self, set_config: bool = True):
-        """
-        Load cache data from a parquet file.
-
-        Returns:
-            dict: The cache data.
-        """
-        # Query our DataBackend to see whether the cache file exists.
-        if self.data_backend.exists(self.cache_file):
-            try:
-                # Use our DataBackend to actually read the cache file.
-                logger.debug("Pulling cache file from storage.")
-                cache_data_raw = self.data_backend.read(self.cache_file)
-                cache_data = json.loads(cache_data_raw)
-                logger.debug("Completed loading cache data.")
-            except Exception as e:
-                logger.warning(
-                    f"Error loading aspect bucket cache, creating new one: {e}"
-                )
-                cache_data = {}
-            self.aspect_ratio_bucket_indices = cache_data.get(
-                "aspect_ratio_bucket_indices", {}
-            )
-            if set_config:
-                self.config = cache_data.get("config", {})
-                if self.config != {}:
-                    logger.debug(f"Setting config to {self.config}")
-                    logger.debug(f"Loaded previous data backend config: {self.config}")
-                    StateTracker.set_data_backend_config(
-                        data_backend_id=self.id,
-                        config=self.config,
-                    )
-
-    def save_cache(self, enforce_constraints: bool = False):
-        """
-        Save cache data to file.
-        """
-        # Prune any buckets that have fewer samples than batch_size
-        if enforce_constraints:
-            self._enforce_min_bucket_size()
-        if self.read_only:
-            logger.debug("Metadata backend is read-only, skipping cache save.")
-            return
-        # Convert any non-strings into strings as we save the index.
-        aspect_ratio_bucket_indices_str = {
-            key: [str(path) for path in value]
-            for key, value in self.aspect_ratio_bucket_indices.items()
-        }
-        # Encode the cache as JSON.
-        cache_data = {
-            "config": StateTracker.get_data_backend_config(
-                data_backend_id=self.data_backend.id
-            ),
-            "aspect_ratio_bucket_indices": aspect_ratio_bucket_indices_str,
-        }
-        logger.debug(f"save_cache has config to write: {cache_data['config']}")
-        cache_data_str = json.dumps(cache_data)
-        # Use our DataBackend to write the cache file.
-        self.data_backend.write(self.cache_file, cache_data_str)
-
-    def load_image_metadata(self):
-        """Load image metadata from a JSON file."""
-        logger.debug(f"Loading metadata: {self.metadata_file}")
-        self.image_metadata = {}
-        self.image_metadata_loaded = False
-        if self.data_backend.exists(self.metadata_file):
-            cache_data_raw = self.data_backend.read(self.metadata_file)
-            self.image_metadata = json.loads(cache_data_raw)
-            self.image_metadata_loaded = True
-        logger.debug("Metadata loaded.")
-
-    def save_image_metadata(self):
-        """Save image metadata to a JSON file."""
-        self.data_backend.write(self.metadata_file, json.dumps(self.image_metadata))
-
-    def compute_aspect_ratio_bucket_indices(self, ignore_existing_cache: bool = False):
-        """
-        Compute the aspect ratio bucket indices without any threads or queues.
-
-        Parquet backend behaves very differently to JSON backend.
-
-        Returns:
-            dict: The aspect ratio bucket indices.
-        """
-        logger.info("Discovering new files...")
-        new_files = self._discover_new_files(
-            ignore_existing_cache=ignore_existing_cache
-        )
-
-        existing_files_set = set().union(*self.aspect_ratio_bucket_indices.values())
-        # Initialize aggregated statistics
-        statistics = {
-            "total_processed": 0,
-            "skipped": {
-                "already_exists": len(existing_files_set),
-                "metadata_missing": 0,
-                "not_found": 0,
-                "too_small": 0,
-                "other": 0,
-            },
-        }
-        if not new_files:
-            logger.debug("No new files discovered. Doing nothing.")
-            return
-
-        try:
-            self.load_image_metadata()
-        except Exception as e:
-            if ignore_existing_cache:
-                logger.warning(
-                    f"Error loading image metadata, creating new metadata cache: {e}"
-                )
-                self.image_metadata = {}
-            else:
-                raise Exception(
-                    f"Error loading image metadata. You may have to remove the metadata json file '{self.metadata_file}' and VAE cache manually: {e}"
-                )
-        last_write_time = time.time()
-        aspect_ratio_bucket_updates = {}
-        # log a truncated set of the parquet table
-        logger.debug(f"Parquet table head: {self.parquet_database.head().to_string()}")
-        for file in tqdm(
-            new_files,
-            desc="Generating aspect bucket cache",
-            total=len(new_files),
-            leave=False,
-            ncols=100,
-            miniters=int(len(new_files) / 100),
-        ):
-            current_time = time.time()
-            if str(file) not in existing_files_set:
-                logger.debug(f"Processing file {file}.")
-                metadata_updates = {}
-                if self.should_abort:
-                    logger.info("Aborting aspect bucket update.")
-                    return
-                aspect_ratio_bucket_updates = self._process_for_bucket(
-                    file,
-                    aspect_ratio_bucket_updates,
-                    metadata_updates=metadata_updates,
-                    delete_problematic_images=self.delete_problematic_images,
-                    statistics=statistics,
-                )
-                statistics["total_processed"] += 1
-                logger.debug(f"Statistics: {statistics}")
-                logger.debug(f"Metadata updates: {metadata_updates}")
-            else:
-                statistics["skipped"]["already_exists"] += 1
-                continue
-
-            # Now, pull metadata updates from the queue
-            if len(metadata_updates) > 0 and file in metadata_updates:
-                metadata_update = metadata_updates[file]
-                self.set_metadata_by_filepath(
-                    filepath=file, metadata=metadata_updates[file], update_json=False
-                )
-
-                continue
-            processing_duration = current_time - last_write_time
-            if processing_duration >= self.metadata_update_interval:
-                logger.debug(
-                    f"In-flight metadata update after {processing_duration} seconds. Saving {len(self.image_metadata)} metadata entries and {len(self.aspect_ratio_bucket_indices)} aspect bucket lists."
-                )
-                self.save_cache(enforce_constraints=False)
-                self.save_image_metadata()
-                last_write_time = current_time
-
-        for key, value in aspect_ratio_bucket_updates.items():
-            self.aspect_ratio_bucket_indices.setdefault(key, []).extend(value)
-
-        logger.debug("Bucket worker completed processing. Returning to main thread.")
-        logger.info(f"Image processing statistics: {statistics}")
-        self.save_image_metadata()
-        self.save_cache(enforce_constraints=True)
-        logger.info("Completed aspect bucket update.")
-
-    def _get_first_value(self, series_or_scalar):
-        """Extract the first value if the input is a Series, else return the value itself."""
-        if isinstance(series_or_scalar, pd.Series):
-            return int(series_or_scalar.iloc[0])
-        elif isinstance(series_or_scalar, str):
-            # Convert to int if the input is a string representing a number
-            return int(series_or_scalar)
-        elif isinstance(series_or_scalar, (int, float)):
-            return series_or_scalar
-        elif isinstance(series_or_scalar, numpy.int64):
-            new_type = int(series_or_scalar)
-            if type(new_type) != int:
-                raise ValueError(f"Unsupported data type: {type(series_or_scalar)}.")
-            return new_type
-        else:
-            raise ValueError(f"Unsupported data type: {type(series_or_scalar)}.")
-
-    def _process_for_bucket(
-        self,
-        image_path_str,
-        aspect_ratio_bucket_indices,
-        aspect_ratio_rounding: int = 3,
-        metadata_updates=None,
-        delete_problematic_images: bool = False,
-        statistics: dict = {},
-    ):
-        try:
-            # Adjust image path if the identifier does not include extension
-            image_path_filtered = image_path_str
-            if not self.parquet_config.get("identifier_includes_extension", False):
-                image_path_filtered = os.path.splitext(
-                    os.path.split(image_path_str)[-1]
-                )[0]
-            if self.instance_data_dir in image_path_filtered:
-                image_path_filtered = image_path_filtered.replace(
-                    self.instance_data_dir, ""
-                )
-                # remove leading /
-                if image_path_filtered.startswith("/"):
-                    image_path_filtered = image_path_filtered[1:]
-            if image_path_filtered.isdigit():
-                image_path_filtered = int(image_path_filtered)
-
-            logger.debug(
-                f"Reading image {image_path_str} metadata from parquet backend column {self.parquet_config.get('filename_column')} without instance root dir prefix {self.instance_data_dir}: {image_path_filtered}."
-            )
-
-            try:
-                database_image_metadata = self.parquet_database.loc[image_path_filtered]
-            except KeyError:
-                database_image_metadata = None
-
-            logger.debug(f"Found image metadata: {database_image_metadata}")
-            if database_image_metadata is None:
-                logger.debug(
-                    f"Image {image_path_str} was not found on the backend. Skipping image."
-                )
-                statistics.setdefault("skipped", {}).setdefault("metadata_missing", 0)
-                statistics["skipped"]["metadata_missing"] += 1
-                return aspect_ratio_bucket_indices
-
-            width_column = self.parquet_config.get("width_column", "width")
-            height_column = self.parquet_config.get("height_column", "height")
-            if width_column is None or height_column is None:
-                raise ValueError(
-                    "ParquetMetadataBackend requires width and height columns to be defined."
-                )
-            w = self._get_first_value(database_image_metadata[width_column])
-            h = self._get_first_value(database_image_metadata[height_column])
-            logger.debug(
-                f"Image {image_path_str} has dimensions {w}x{h} types {type(w)}."
-            )
-            original_size = (w, h)
-            if (
-                original_size[0] < StateTracker.get_args().aspect_bucket_alignment
-                or original_size[1] < StateTracker.get_args().aspect_bucket_alignment
-            ):
-                logger.debug(
-                    f"Image {image_path_str} is smaller than the aspect bucket index. Skipping image."
-                )
-                return aspect_ratio_bucket_indices
-
-            training_sample = TrainingSample(
-                image=None,
-                data_backend_id=self.id,
-                image_metadata={"original_size": original_size},
-                image_path=image_path_str,
-            )
-            prepared_sample = training_sample.prepare()
-            image_metadata = {"original_size": training_sample.original_size}
-
-            logger.debug("Prepared sample: %s", str(prepared_sample))
-
-            logger.debug("Checking minimum resolution size vs image size...")
-            if not self.meets_resolution_requirements(image_metadata=image_metadata):
-                if not self.delete_unwanted_images:
-                    logger.debug(
-                        f"Image {image_path_str} does not meet minimum image size requirements. Skipping image."
-                    )
-                else:
-                    logger.debug(
-                        f"Image {image_path_str} does not meet minimum image size requirements. Deleting image."
-                    )
-                    try:
-                        self.data_backend.delete(image_path_str)
-                    except:
-                        pass
-                statistics.setdefault("skipped", {}).setdefault("too_small", 0)
-                statistics["skipped"]["too_small"] += 1
-
-                return aspect_ratio_bucket_indices
-
-            logger.debug("Collecting aspect ratio data...")
-            aspect_ratio_column = self.parquet_config.get("aspect_ratio_column")
-            aspect_ratio = (
-                database_image_metadata[aspect_ratio_column]
-                if aspect_ratio_column
-                else training_sample.aspect_ratio
-            )
-            aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-                float(aspect_ratio)
-            )
-
-            logger.debug("Image metadata has been generated and collected.")
-            image_metadata.update(
-                {
-                    "intermediary_size": prepared_sample.intermediary_size,
-                    "crop_coordinates": prepared_sample.crop_coordinates,
-                    "target_size": prepared_sample.target_size,
-                    "aspect_ratio": float(prepared_sample.aspect_ratio),
-                    "luminance": int(
-                        database_image_metadata.get(
-                            self.parquet_config.get("luminance_column"), 0
-                        )
-                    ),
-                }
-            )
-            # logger.debug(
-            #     f"Data types for metadata: {[type(v) for v in image_metadata.values()]}"
-            # )
-            # print the types of any iterable values
-            # for key, value in image_metadata.items():
-            # if hasattr(value, "__iter__"):
-            # logger.debug(f"Key {key} has type {type(value)}: {value}")
-            # for v in value:
-            # logger.debug(f"Value has type {type(v)}: {v}")
-
-            # logger.debug(
-            #     f"Image {image_path_str} has aspect ratio {prepared_sample.aspect_ratio}, intermediary size {image_metadata['intermediary_size']}, target size {image_metadata['target_size']}."
-            # )
-
-            # Create a new bucket if it doesn't exist
-            aspect_ratio_key = str(prepared_sample.aspect_ratio)
-            if aspect_ratio_key not in aspect_ratio_bucket_indices:
-                aspect_ratio_bucket_indices[aspect_ratio_key] = []
-            logger.debug("Adding to list...")
-            aspect_ratio_bucket_indices[aspect_ratio_key].append(image_path_str)
-            logger.debug("Added to list.")
-
-            # Instead of directly updating, just fill the provided dictionary
-            if metadata_updates is not None:
-                logger.debug("Adding to metadata list...")
-                metadata_updates[image_path_str] = image_metadata
-                logger.debug("Added to metadata list.")
-
-        except Exception as e:
-            logger.error(f"Error processing image: {e}")
-            logger.error(f"Error traceback: {traceback.format_exc()}")
-            if delete_problematic_images:
-                logger.error(f"Deleting image {image_path_str}.")
-                self.data_backend.delete(image_path_str)
-
-        return aspect_ratio_bucket_indices
-
-    def __len__(self):
-        """
-        Returns:
-            int: The number of batches in the dataset, accounting for images that can't form a complete batch and are discarded.
-        """
-
-        def repeat_len(bucket):
-            return len(bucket) * (self.repeats + 1)
-
-        return sum(
-            (repeat_len(bucket) + (self.batch_size - 1)) // self.batch_size
-            for bucket in self.aspect_ratio_bucket_indices.values()
-            if repeat_len(bucket) >= self.batch_size
-        )
diff --git a/videotuna/third_party/flux/models/flux/__init__.py b/videotuna/third_party/flux/models/flux/__init__.py
deleted file mode 100644
index fec70e30..00000000
--- a/videotuna/third_party/flux/models/flux/__init__.py
+++ /dev/null
@@ -1,123 +0,0 @@
-import math
-import random
-
-import torch
-from diffusers.pipelines.flux.pipeline_flux import (
-    calculate_shift as calculate_shift_flux,
-)
-
-from videotuna.third_party.flux.models.flux.pipeline import FluxPipeline
-from videotuna.third_party.flux.training import steps_remaining_in_epoch
-
-
-def apply_flux_schedule_shift(args, noise_scheduler, sigmas, noise):
-    # Resolution-dependent shifting of timestep schedules as per section 5.3.2 of SD3 paper
-    shift = None
-    if args.flux_schedule_shift is not None and args.flux_schedule_shift > 0:
-        # Static shift value for every resolution
-        shift = args.flux_schedule_shift
-    elif args.flux_schedule_auto_shift:
-        # Resolution-dependent shift value calculation used by official Flux inference implementation
-        image_seq_len = (noise.shape[-1] * noise.shape[-2]) // 4
-        mu = calculate_shift_flux(
-            (noise.shape[-1] * noise.shape[-2]) // 4,
-            noise_scheduler.config.base_image_seq_len,
-            noise_scheduler.config.max_image_seq_len,
-            noise_scheduler.config.base_shift,
-            noise_scheduler.config.max_shift,
-        )
-        shift = math.exp(mu)
-    if shift is not None:
-        sigmas = (sigmas * shift) / (1 + (shift - 1) * sigmas)
-    return sigmas
-
-
-def get_mobius_guidance(args, global_step, steps_per_epoch, batch_size, device):
-    """
-    state of the art
-    """
-    steps_remaining = steps_remaining_in_epoch(global_step, steps_per_epoch)
-
-    # Start with a linear mapping from remaining steps to a scale between 0 and 1
-    scale_factor = steps_remaining / steps_per_epoch
-
-    # we want the last 10% of the epoch to have a guidance of 1.0
-    threshold_step_count = max(1, int(steps_per_epoch * 0.1))
-
-    if (
-        steps_remaining <= threshold_step_count
-    ):  # Last few steps in the epoch, set guidance to 1.0
-        guidance_values = [1.0 for _ in range(batch_size)]
-    else:
-        # Sample between flux_guidance_min and flux_guidance_max with bias towards 1.0
-        guidance_values = [
-            random.uniform(args.flux_guidance_min, args.flux_guidance_max)
-            * scale_factor
-            + (1.0 - scale_factor)
-            for _ in range(batch_size)
-        ]
-
-    return guidance_values
-
-
-def update_flux_schedule_to_fast(args, noise_scheduler_to_copy):
-    if args.flux_fast_schedule and args.model_family.lower() == "flux":
-        # 4-step noise schedule [0.7, 0.1, 0.1, 0.1] from SD3-Turbo paper
-        for i in range(0, 250):
-            noise_scheduler_to_copy.sigmas[i] = 1.0
-        for i in range(250, 500):
-            noise_scheduler_to_copy.sigmas[i] = 0.3
-        for i in range(500, 750):
-            noise_scheduler_to_copy.sigmas[i] = 0.2
-        for i in range(750, 1000):
-            noise_scheduler_to_copy.sigmas[i] = 0.1
-    return noise_scheduler_to_copy
-
-
-def pack_latents(latents, batch_size, num_channels_latents, height, width):
-    latents = latents.view(
-        batch_size, num_channels_latents, height // 2, 2, width // 2, 2
-    )
-    latents = latents.permute(0, 2, 4, 1, 3, 5)
-    latents = latents.reshape(
-        batch_size, (height // 2) * (width // 2), num_channels_latents * 4
-    )
-
-    return latents
-
-
-def unpack_latents(latents, height, width, vae_scale_factor):
-    batch_size, num_patches, channels = latents.shape
-
-    height = height // vae_scale_factor
-    width = width // vae_scale_factor
-
-    latents = latents.view(batch_size, height, width, channels // 4, 2, 2)
-    latents = latents.permute(0, 3, 1, 4, 2, 5)
-
-    latents = latents.reshape(batch_size, channels // (2 * 2), height * 2, width * 2)
-
-    return latents
-
-
-def prepare_latent_image_ids(batch_size, height, width, device, dtype):
-    latent_image_ids = torch.zeros(height // 2, width // 2, 3)
-    latent_image_ids[..., 1] = (
-        latent_image_ids[..., 1] + torch.arange(height // 2)[:, None]
-    )
-    latent_image_ids[..., 2] = (
-        latent_image_ids[..., 2] + torch.arange(width // 2)[None, :]
-    )
-
-    latent_image_id_height, latent_image_id_width, latent_image_id_channels = (
-        latent_image_ids.shape
-    )
-
-    latent_image_ids = latent_image_ids[None, :].repeat(batch_size, 1, 1, 1)
-    latent_image_ids = latent_image_ids.reshape(
-        batch_size,
-        latent_image_id_height * latent_image_id_width,
-        latent_image_id_channels,
-    )
-
-    return latent_image_ids.to(device=device, dtype=dtype)[0]
diff --git a/videotuna/third_party/flux/models/flux/attention.py b/videotuna/third_party/flux/models/flux/attention.py
deleted file mode 100644
index 6f69855f..00000000
--- a/videotuna/third_party/flux/models/flux/attention.py
+++ /dev/null
@@ -1,213 +0,0 @@
-from diffusers.models.attention_processor import Attention
-from diffusers.models.embeddings import apply_rotary_emb
-from einops import rearrange
-from torch import FloatTensor, Tensor
-from torch.nn import functional as F
-
-try:
-    from flash_attn_interface import flash_attn_func
-except:
-    pass
-
-
-def fa3_sdpa(
-    q,
-    k,
-    v,
-):
-    # flash attention 3 sdpa drop-in replacement
-    q, k, v = [x.permute(0, 2, 1, 3) for x in [q, k, v]]
-    out = flash_attn_func(q, k, v)[0]
-    return out.permute(0, 2, 1, 3)
-
-
-class FluxSingleAttnProcessor3_0:
-    r"""
-    Processor for implementing scaled dot-product attention (enabled by default if you're using PyTorch 2.0).
-    """
-
-    def __init__(self):
-        if not hasattr(F, "scaled_dot_product_attention"):
-            raise ImportError(
-                "AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
-            )
-
-    def __call__(
-        self,
-        attn,
-        hidden_states: Tensor,
-        encoder_hidden_states: Tensor = None,
-        attention_mask: FloatTensor = None,
-        image_rotary_emb: Tensor = None,
-    ) -> Tensor:
-        input_ndim = hidden_states.ndim
-
-        if input_ndim == 4:
-            batch_size, channel, height, width = hidden_states.shape
-            hidden_states = hidden_states.view(
-                batch_size, channel, height * width
-            ).transpose(1, 2)
-
-        batch_size, _, _ = (
-            hidden_states.shape
-            if encoder_hidden_states is None
-            else encoder_hidden_states.shape
-        )
-
-        query = attn.to_q(hidden_states)
-        if encoder_hidden_states is None:
-            encoder_hidden_states = hidden_states
-
-        key = attn.to_k(encoder_hidden_states)
-        value = attn.to_v(encoder_hidden_states)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-
-        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        if attn.norm_q is not None:
-            query = attn.norm_q(query)
-        if attn.norm_k is not None:
-            key = attn.norm_k(key)
-
-        # Apply RoPE if needed
-        if image_rotary_emb is not None:
-            query = apply_rotary_emb(query, image_rotary_emb)
-            key = apply_rotary_emb(key, image_rotary_emb)
-
-        # the output of sdp = (batch, num_heads, seq_len, head_dim)
-        # TODO: add support for attn.scale when we move to Torch 2.1
-        # hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
-        hidden_states = fa3_sdpa(query, key, value)
-        hidden_states = rearrange(hidden_states, "B H L D -> B L (H D)")
-
-        hidden_states = hidden_states.transpose(1, 2).reshape(
-            batch_size, -1, attn.heads * head_dim
-        )
-        hidden_states = hidden_states.to(query.dtype)
-
-        if input_ndim == 4:
-            hidden_states = hidden_states.transpose(-1, -2).reshape(
-                batch_size, channel, height, width
-            )
-
-        return hidden_states
-
-
-class FluxAttnProcessor3_0:
-    """Attention processor used typically in processing the SD3-like self-attention projections."""
-
-    def __init__(self):
-        if not hasattr(F, "scaled_dot_product_attention"):
-            raise ImportError(
-                "FluxAttnProcessor3_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
-            )
-
-    def __call__(
-        self,
-        attn,
-        hidden_states: FloatTensor,
-        encoder_hidden_states: FloatTensor = None,
-        attention_mask: FloatTensor = None,
-        image_rotary_emb: Tensor = None,
-    ) -> FloatTensor:
-        input_ndim = hidden_states.ndim
-        if input_ndim == 4:
-            batch_size, channel, height, width = hidden_states.shape
-            hidden_states = hidden_states.view(
-                batch_size, channel, height * width
-            ).transpose(1, 2)
-        context_input_ndim = encoder_hidden_states.ndim
-        if context_input_ndim == 4:
-            batch_size, channel, height, width = encoder_hidden_states.shape
-            encoder_hidden_states = encoder_hidden_states.view(
-                batch_size, channel, height * width
-            ).transpose(1, 2)
-
-        batch_size = encoder_hidden_states.shape[0]
-
-        # `sample` projections.
-        query = attn.to_q(hidden_states)
-        key = attn.to_k(hidden_states)
-        value = attn.to_v(hidden_states)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-
-        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        if attn.norm_q is not None:
-            query = attn.norm_q(query)
-        if attn.norm_k is not None:
-            key = attn.norm_k(key)
-
-        # `context` projections.
-        encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states)
-        encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states)
-        encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states)
-
-        encoder_hidden_states_query_proj = encoder_hidden_states_query_proj.view(
-            batch_size, -1, attn.heads, head_dim
-        ).transpose(1, 2)
-        encoder_hidden_states_key_proj = encoder_hidden_states_key_proj.view(
-            batch_size, -1, attn.heads, head_dim
-        ).transpose(1, 2)
-        encoder_hidden_states_value_proj = encoder_hidden_states_value_proj.view(
-            batch_size, -1, attn.heads, head_dim
-        ).transpose(1, 2)
-
-        if attn.norm_added_q is not None:
-            encoder_hidden_states_query_proj = attn.norm_added_q(
-                encoder_hidden_states_query_proj
-            )
-        if attn.norm_added_k is not None:
-            encoder_hidden_states_key_proj = attn.norm_added_k(
-                encoder_hidden_states_key_proj
-            )
-
-        # attention
-        query = torch.cat([encoder_hidden_states_query_proj, query], dim=2)
-        key = torch.cat([encoder_hidden_states_key_proj, key], dim=2)
-        value = torch.cat([encoder_hidden_states_value_proj, value], dim=2)
-
-        if image_rotary_emb is not None:
-
-            query = apply_rotary_emb(query, image_rotary_emb)
-            key = apply_rotary_emb(key, image_rotary_emb)
-
-        # hidden_states = F.scaled_dot_product_attention(query, key, value, dropout_p=0.0, is_causal=False)
-        hidden_states = fa3_sdpa(query, key, value)
-        hidden_states = rearrange(hidden_states, "B H L D -> B L (H D)")
-
-        hidden_states = hidden_states.transpose(1, 2).reshape(
-            batch_size, -1, attn.heads * head_dim
-        )
-        hidden_states = hidden_states.to(query.dtype)
-
-        encoder_hidden_states, hidden_states = (
-            hidden_states[:, : encoder_hidden_states.shape[1]],
-            hidden_states[:, encoder_hidden_states.shape[1] :],
-        )
-
-        # linear proj
-        hidden_states = attn.to_out[0](hidden_states)
-        # dropout
-        hidden_states = attn.to_out[1](hidden_states)
-        encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
-
-        if input_ndim == 4:
-            hidden_states = hidden_states.transpose(-1, -2).reshape(
-                batch_size, channel, height, width
-            )
-        if context_input_ndim == 4:
-            encoder_hidden_states = encoder_hidden_states.transpose(-1, -2).reshape(
-                batch_size, channel, height, width
-            )
-
-        return hidden_states, encoder_hidden_states
diff --git a/videotuna/third_party/flux/models/flux/pipeline.py b/videotuna/third_party/flux/models/flux/pipeline.py
deleted file mode 100644
index 3f2fdcc7..00000000
--- a/videotuna/third_party/flux/models/flux/pipeline.py
+++ /dev/null
@@ -1,936 +0,0 @@
-# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import inspect
-from typing import Any, Callable, Dict, List, Optional, Union
-
-import numpy as np
-import torch
-from diffusers.image_processor import VaeImageProcessor
-from diffusers.loaders import FluxLoraLoaderMixin
-from diffusers.models.autoencoders import AutoencoderKL
-from diffusers.models.transformers import FluxTransformer2DModel
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline
-from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
-from diffusers.utils import (
-    USE_PEFT_BACKEND,
-    is_torch_xla_available,
-    logging,
-    replace_example_docstring,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
-from diffusers.utils.torch_utils import randn_tensor
-from transformers import CLIPTextModel, CLIPTokenizer, T5EncoderModel, T5TokenizerFast
-
-if is_torch_xla_available():
-    import torch_xla.core.xla_model as xm
-
-    XLA_AVAILABLE = True
-else:
-    XLA_AVAILABLE = False
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-EXAMPLE_DOC_STRING = """
-    Examples:
-        ```py
-        >>> import torch
-        >>> from diffusers import FluxPipeline
-
-        >>> pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)
-        >>> pipe.to("cuda")
-        >>> prompt = "A cat holding a sign that says hello world"
-        >>> # Depending on the variant being used, the pipeline call will slightly vary.
-        >>> # Refer to the pipeline documentation for more details.
-        >>> image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0]
-        >>> image.save("flux.png")
-        ```
-"""
-
-
-def calculate_shift(
-    image_seq_len,
-    base_seq_len: int = 256,
-    max_seq_len: int = 4096,
-    base_shift: float = 0.5,
-    max_shift: float = 1.16,
-):
-    m = (max_shift - base_shift) / (max_seq_len - base_seq_len)
-    b = base_shift - m * base_seq_len
-    mu = image_seq_len * m + b
-    return mu
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-class FluxPipeline(DiffusionPipeline, FluxLoraLoaderMixin):
-    r"""
-    The Flux pipeline for text-to-image generation.
-
-    Reference: https://blackforestlabs.ai/announcing-black-forest-labs/
-
-    Args:
-        transformer ([`FluxTransformer2DModel`]):
-            Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
-        scheduler ([`FlowMatchEulerDiscreteScheduler`]):
-            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`CLIPTextModelWithProjection`]):
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
-            with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
-            as its dimension.
-        text_encoder_2 ([`CLIPTextModelWithProjection`]):
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the
-            [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
-            variant.
-        tokenizer (`CLIPTokenizer`):
-            Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_2 (`CLIPTokenizer`):
-            Second Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-    """
-
-    model_cpu_offload_seq = "text_encoder->text_encoder_2->transformer->vae"
-    _optional_components = []
-    _callback_tensor_inputs = ["latents", "prompt_embeds"]
-
-    def __init__(
-        self,
-        scheduler: FlowMatchEulerDiscreteScheduler,
-        vae: AutoencoderKL,
-        text_encoder: CLIPTextModel,
-        tokenizer: CLIPTokenizer,
-        text_encoder_2: T5EncoderModel,
-        tokenizer_2: T5TokenizerFast,
-        transformer: FluxTransformer2DModel,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            tokenizer=tokenizer,
-            tokenizer_2=tokenizer_2,
-            transformer=transformer,
-            scheduler=scheduler,
-        )
-        self.vae_scale_factor = (
-            2 ** (len(self.vae.config.block_out_channels))
-            if hasattr(self, "vae") and self.vae is not None
-            else 16
-        )
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-        self.tokenizer_max_length = (
-            self.tokenizer.model_max_length
-            if hasattr(self, "tokenizer") and self.tokenizer is not None
-            else 77
-        )
-        self.default_sample_size = 64
-
-    def _get_t5_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]] = None,
-        num_images_per_prompt: int = 1,
-        max_sequence_length: int = 512,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ):
-        device = device or self._execution_device
-        dtype = dtype or self.text_encoder.dtype
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        text_inputs = self.tokenizer_2(
-            prompt,
-            padding="max_length",
-            max_length=max_sequence_length,
-            truncation=True,
-            return_length=False,
-            return_overflowing_tokens=False,
-            return_tensors="pt",
-        )
-        prompt_attention_mask = text_inputs.attention_mask
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = self.tokenizer_2(
-            prompt, padding="longest", return_tensors="pt"
-        ).input_ids
-
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
-            text_input_ids, untruncated_ids
-        ):
-            removed_text = self.tokenizer_2.batch_decode(
-                untruncated_ids[:, self.tokenizer_max_length - 1 : -1]
-            )
-            # logger.warning(
-            #     "The following part of your input was truncated because `max_sequence_length` is set to "
-            #     f" {max_sequence_length} tokens: {removed_text}"
-            # )
-
-        prompt_embeds = self.text_encoder_2(
-            text_input_ids.to(device), output_hidden_states=False
-        )[0]
-
-        dtype = self.text_encoder_2.dtype
-        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-        _, seq_len, _ = prompt_embeds.shape
-
-        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            batch_size * num_images_per_prompt, seq_len, -1
-        )
-
-        return prompt_embeds, prompt_attention_mask
-
-    def _get_clip_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]],
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-    ):
-        device = device or self._execution_device
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        text_inputs = self.tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=self.tokenizer_max_length,
-            truncation=True,
-            return_overflowing_tokens=False,
-            return_length=False,
-            return_tensors="pt",
-        )
-
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = self.tokenizer(
-            prompt, padding="longest", return_tensors="pt"
-        ).input_ids
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
-            text_input_ids, untruncated_ids
-        ):
-            removed_text = self.tokenizer.batch_decode(
-                untruncated_ids[:, self.tokenizer_max_length - 1 : -1]
-            )
-            # logger.warning(
-            #     "The following part of your input was truncated because CLIP can only handle sequences up to"
-            #     f" {self.tokenizer_max_length} tokens: {removed_text}"
-            # )
-        prompt_embeds = self.text_encoder(
-            text_input_ids.to(device), output_hidden_states=False
-        )
-
-        # Use pooled output of CLIPTextModel
-        prompt_embeds = prompt_embeds.pooler_output
-        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)
-
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, -1)
-
-        return prompt_embeds
-
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        prompt_2: Union[str, List[str]],
-        device: Optional[torch.device] = None,
-        num_images_per_prompt: int = 1,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        max_sequence_length: int = 512,
-        lora_scale: Optional[float] = None,
-    ):
-        r"""
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in all text-encoders
-            device: (`torch.device`):
-                torch device
-            num_images_per_prompt (`int`):
-                number of images that should be generated per prompt
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            lora_scale (`float`, *optional*):
-                A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
-        """
-        device = device or self._execution_device
-
-        # set lora scale so that monkey patched LoRA
-        # function of text encoder can correctly access it
-        if lora_scale is not None and isinstance(self, FluxLoraLoaderMixin):
-            self._lora_scale = lora_scale
-
-            # dynamically adjust the LoRA scale
-            if self.text_encoder is not None and USE_PEFT_BACKEND:
-                scale_lora_layers(self.text_encoder, lora_scale)
-            if self.text_encoder_2 is not None and USE_PEFT_BACKEND:
-                scale_lora_layers(self.text_encoder_2, lora_scale)
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        if prompt is not None:
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        prompt_attention_mask = None
-        if prompt_embeds is None:
-            prompt_2 = prompt_2 or prompt
-            prompt_2 = [prompt_2] if isinstance(prompt_2, str) else prompt_2
-
-            # We only use the pooled prompt output from the CLIPTextModel
-            pooled_prompt_embeds = self._get_clip_prompt_embeds(
-                prompt=prompt,
-                device=device,
-                num_images_per_prompt=num_images_per_prompt,
-            )
-            prompt_embeds, prompt_attention_mask = self._get_t5_prompt_embeds(
-                prompt=prompt_2,
-                num_images_per_prompt=num_images_per_prompt,
-                max_sequence_length=max_sequence_length,
-                device=device,
-            )
-
-        if self.text_encoder is not None:
-            if isinstance(self, FluxLoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(self.text_encoder, lora_scale)
-
-        if self.text_encoder_2 is not None:
-            if isinstance(self, FluxLoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(self.text_encoder_2, lora_scale)
-
-        text_ids = torch.zeros(batch_size, prompt_embeds.shape[1], 3).to(
-            device=device, dtype=prompt_embeds.dtype
-        )
-
-        return prompt_embeds, pooled_prompt_embeds, text_ids, prompt_attention_mask
-
-    def check_inputs(
-        self,
-        prompt,
-        prompt_2,
-        height,
-        width,
-        prompt_embeds=None,
-        pooled_prompt_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-        max_sequence_length=None,
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_2 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-        elif prompt_2 is not None and (
-            not isinstance(prompt_2, str) and not isinstance(prompt_2, list)
-        ):
-            raise ValueError(
-                f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}"
-            )
-
-        if prompt_embeds is not None and pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
-            )
-
-        if max_sequence_length is not None and max_sequence_length > 512:
-            raise ValueError(
-                f"`max_sequence_length` cannot be greater than 512 but is {max_sequence_length}"
-            )
-
-    @staticmethod
-    def _prepare_latent_image_ids(batch_size, height, width, device, dtype):
-        latent_image_ids = torch.zeros(height // 2, width // 2, 3)
-        latent_image_ids[..., 1] = (
-            latent_image_ids[..., 1] + torch.arange(height // 2)[:, None]
-        )
-        latent_image_ids[..., 2] = (
-            latent_image_ids[..., 2] + torch.arange(width // 2)[None, :]
-        )
-
-        latent_image_id_height, latent_image_id_width, latent_image_id_channels = (
-            latent_image_ids.shape
-        )
-
-        latent_image_ids = latent_image_ids[None, :].repeat(batch_size, 1, 1, 1)
-        latent_image_ids = latent_image_ids.reshape(
-            batch_size,
-            latent_image_id_height * latent_image_id_width,
-            latent_image_id_channels,
-        )
-
-        return latent_image_ids.to(device=device, dtype=dtype)
-
-    @staticmethod
-    def _pack_latents(latents, batch_size, num_channels_latents, height, width):
-        latents = latents.view(
-            batch_size, num_channels_latents, height // 2, 2, width // 2, 2
-        )
-        latents = latents.permute(0, 2, 4, 1, 3, 5)
-        latents = latents.reshape(
-            batch_size, (height // 2) * (width // 2), num_channels_latents * 4
-        )
-
-        return latents
-
-    @staticmethod
-    def _unpack_latents(latents, height, width, vae_scale_factor):
-        batch_size, num_patches, channels = latents.shape
-
-        height = height // vae_scale_factor
-        width = width // vae_scale_factor
-
-        latents = latents.view(batch_size, height, width, channels // 4, 2, 2)
-        latents = latents.permute(0, 3, 1, 4, 2, 5)
-
-        latents = latents.reshape(
-            batch_size, channels // (2 * 2), height * 2, width * 2
-        )
-
-        return latents
-
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        dtype,
-        device,
-        generator,
-        latents=None,
-    ):
-        height = 2 * (int(height) // self.vae_scale_factor)
-        width = 2 * (int(width) // self.vae_scale_factor)
-
-        shape = (batch_size, num_channels_latents, height, width)
-
-        if latents is not None:
-            latent_image_ids = self._prepare_latent_image_ids(
-                batch_size, height, width, device, dtype
-            )
-            return latents.to(device=device, dtype=dtype), latent_image_ids
-
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-        latents = self._pack_latents(
-            latents, batch_size, num_channels_latents, height, width
-        )
-
-        latent_image_ids = self._prepare_latent_image_ids(
-            batch_size, height, width, device, dtype
-        )
-
-        return latents, latent_image_ids
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def joint_attention_kwargs(self):
-        return self._joint_attention_kwargs
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        prompt_mask: Optional[Union[torch.FloatTensor, List[torch.FloatTensor]]] = None,
-        negative_mask: Optional[
-            Union[torch.FloatTensor, List[torch.FloatTensor]]
-        ] = None,
-        prompt_2: Optional[Union[str, List[str]]] = None,
-        height: Optional[int] = None,
-        width: Optional[int] = None,
-        num_inference_steps: int = 28,
-        timesteps: List[int] = None,
-        guidance_scale: float = 3.5,
-        num_images_per_prompt: Optional[int] = 1,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.FloatTensor] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
-        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        max_sequence_length: int = 512,
-        guidance_scale_real: float = 1.0,
-        negative_prompt: Union[str, List[str]] = "",
-        negative_prompt_2: Union[str, List[str]] = "",
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        no_cfg_until_timestep: int = 2,
-    ):
-        r"""
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            prompt_mask (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be used as a mask for the image generation. If not defined, `prompt` is used
-                instead.
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                will be used instead
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The height in pixels of the generated image. This is set to 1024 by default for the best results.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The width in pixels of the generated image. This is set to 1024 by default for the best results.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            guidance_scale (`float`, *optional*, defaults to 7.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.flux.FluxPipelineOutput`] instead of a plain tuple.
-            joint_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            callback_on_step_end (`Callable`, *optional*):
-                A function that calls at the end of each denoising steps during the inference. The function is called
-                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
-                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
-                `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-            max_sequence_length (`int` defaults to 512): Maximum sequence length to use with the `prompt`.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.flux.FluxPipelineOutput`] or `tuple`: [`~pipelines.flux.FluxPipelineOutput`] if `return_dict`
-            is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated
-            images.
-        """
-
-        height = height or self.default_sample_size * self.vae_scale_factor
-        width = width or self.default_sample_size * self.vae_scale_factor
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            prompt_2,
-            height,
-            width,
-            prompt_embeds=prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
-            max_sequence_length=max_sequence_length,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._guidance_scale_real = guidance_scale_real
-        self._joint_attention_kwargs = joint_attention_kwargs
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self._execution_device
-
-        lora_scale = (
-            self.joint_attention_kwargs.get("scale", None)
-            if self.joint_attention_kwargs is not None
-            else None
-        )
-        (
-            prompt_embeds,
-            pooled_prompt_embeds,
-            text_ids,
-            _,
-        ) = self.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt_2,
-            prompt_embeds=prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            device=device,
-            num_images_per_prompt=num_images_per_prompt,
-            max_sequence_length=max_sequence_length,
-            lora_scale=lora_scale,
-        )
-
-        if negative_prompt_2 == "" and negative_prompt != "":
-            negative_prompt_2 = negative_prompt
-
-        negative_text_ids = text_ids
-        if guidance_scale_real > 1.0 and (
-            negative_prompt_embeds is None or negative_pooled_prompt_embeds is None
-        ):
-            (
-                negative_prompt_embeds,
-                negative_pooled_prompt_embeds,
-                negative_text_ids,
-                _,
-            ) = self.encode_prompt(
-                prompt=negative_prompt,
-                prompt_2=negative_prompt_2,
-                prompt_embeds=None,
-                pooled_prompt_embeds=None,
-                device=device,
-                num_images_per_prompt=num_images_per_prompt,
-                max_sequence_length=max_sequence_length,
-                lora_scale=lora_scale,
-            )
-
-        # 4. Prepare latent variables
-        num_channels_latents = self.transformer.config.in_channels // 4
-        latents, latent_image_ids = self.prepare_latents(
-            batch_size * num_images_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            prompt_embeds.dtype,
-            device,
-            generator,
-            latents,
-        )
-
-        # 5. Prepare timesteps
-        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps)
-        image_seq_len = latents.shape[1]
-        mu = calculate_shift(
-            image_seq_len,
-            self.scheduler.config.base_image_seq_len,
-            self.scheduler.config.max_image_seq_len,
-            self.scheduler.config.base_shift,
-            self.scheduler.config.max_shift,
-        )
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler,
-            num_inference_steps,
-            device,
-            timesteps,
-            sigmas,
-            mu=mu,
-        )
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-        self._num_timesteps = len(timesteps)
-
-        latents = latents.to(self.transformer.device)
-        latent_image_ids = latent_image_ids.to(self.transformer.device)[0]
-        timesteps = timesteps.to(self.transformer.device)
-        text_ids = text_ids.to(self.transformer.device)[0]
-
-        # 6. Denoising loop
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                timestep = t.expand(latents.shape[0]).to(latents.dtype)
-
-                # handle guidance
-                if self.transformer.config.guidance_embeds:
-                    guidance = torch.tensor(
-                        [guidance_scale], device=self.transformer.device
-                    )
-                    guidance = guidance.expand(latents.shape[0])
-                else:
-                    guidance = None
-
-                extra_transformer_args = {}
-                if prompt_mask is not None:
-                    extra_transformer_args["attention_mask"] = prompt_mask.to(
-                        device=self.transformer.device
-                    )
-
-                noise_pred = self.transformer(
-                    hidden_states=latents.to(
-                        device=self.transformer.device  # , dtype=self.transformer.dtype     # can't cast dtype like this because of NF4
-                    ),
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
-                    timestep=timestep / 1000,
-                    guidance=guidance,
-                    pooled_projections=pooled_prompt_embeds.to(
-                        device=self.transformer.device  # , dtype=self.transformer.dtype     # can't cast dtype like this because of NF4
-                    ),
-                    encoder_hidden_states=prompt_embeds.to(
-                        device=self.transformer.device  # , dtype=self.transformer.dtype     # can't cast dtype like this because of NF4
-                    ),
-                    txt_ids=text_ids,
-                    img_ids=latent_image_ids,
-                    joint_attention_kwargs=self.joint_attention_kwargs,
-                    return_dict=False,
-                    **extra_transformer_args,
-                )[0]
-
-                # TODO optionally use batch prediction to speed this up.
-                if guidance_scale_real > 1.0 and i >= no_cfg_until_timestep:
-                    noise_pred_uncond = self.transformer(
-                        hidden_states=latents.to(
-                            device=self.transformer.device  # , dtype=self.transformer.dtype     # can't cast dtype like this because of NF4
-                        ),
-                        # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
-                        timestep=timestep / 1000,
-                        guidance=guidance,
-                        pooled_projections=negative_pooled_prompt_embeds.to(
-                            device=self.transformer.device  # , dtype=self.transformer.dtype     # can't cast dtype like this because of NF4
-                        ),
-                        encoder_hidden_states=negative_prompt_embeds.to(
-                            device=self.transformer.device  # , dtype=self.transformer.dtype     # can't cast dtype like this because of NF4
-                        ),
-                        txt_ids=negative_text_ids.to(device=self.transformer.device),
-                        img_ids=latent_image_ids.to(device=self.transformer.device),
-                        joint_attention_kwargs=self.joint_attention_kwargs,
-                        return_dict=False,
-                    )[0]
-
-                    noise_pred = noise_pred_uncond + guidance_scale_real * (
-                        noise_pred - noise_pred_uncond
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents_dtype = latents.dtype
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, return_dict=False
-                )[0]
-
-                if latents.dtype != latents_dtype:
-                    if torch.backends.mps.is_available():
-                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                        latents = latents.to(latents_dtype)
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-
-                if XLA_AVAILABLE:
-                    xm.mark_step()
-
-        if output_type == "latent":
-            image = latents
-
-        else:
-            latents = self._unpack_latents(
-                latents, height, width, self.vae_scale_factor
-            )
-            latents = (
-                latents / self.vae.config.scaling_factor
-            ) + self.vae.config.shift_factor
-
-            image = self.vae.decode(
-                latents.to(device=self.vae.device, dtype=self.vae.dtype),
-                return_dict=False,
-            )[0]
-            image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return FluxPipelineOutput(images=image)
-
-
-from dataclasses import dataclass
-from typing import List, Union
-
-import PIL.Image
-from diffusers.utils import BaseOutput
-
-
-@dataclass
-class FluxPipelineOutput(BaseOutput):
-    """
-    Output class for Stable Diffusion pipelines.
-
-    Args:
-        images (`List[PIL.Image.Image]` or `np.ndarray`)
-            List of denoised PIL images of length `batch_size` or numpy array of shape `(batch_size, height, width,
-            num_channels)`. PIL images or numpy array present the denoised images of the diffusion pipeline.
-    """
-
-    images: Union[List[PIL.Image.Image], np.ndarray]
diff --git a/videotuna/third_party/flux/models/flux/transformer.py b/videotuna/third_party/flux/models/flux/transformer.py
deleted file mode 100644
index c677fa7d..00000000
--- a/videotuna/third_party/flux/models/flux/transformer.py
+++ /dev/null
@@ -1,753 +0,0 @@
-# Copyright 2024 Stability AI, The HuggingFace Team, The InstantX Team, and Terminus Research Group. All rights reserved.
-#
-# Originally licensed under the Apache License, Version 2.0 (the "License");
-# Updated to "Affero GENERAL PUBLIC LICENSE Version 3, 19 November 2007" via extensive updates to attn_mask usage.
-
-from typing import Any, Dict, List, Optional, Tuple, Union
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.loaders import FromOriginalModelMixin, PeftAdapterMixin
-from diffusers.models.attention import FeedForward
-from diffusers.models.attention_processor import Attention
-from diffusers.models.embeddings import (
-    CombinedTimestepGuidanceTextProjEmbeddings,
-    CombinedTimestepTextProjEmbeddings,
-    FluxPosEmbed,
-)
-from diffusers.models.modeling_outputs import Transformer2DModelOutput
-from diffusers.models.modeling_utils import ModelMixin
-from diffusers.models.normalization import (
-    AdaLayerNormContinuous,
-    AdaLayerNormZero,
-    AdaLayerNormZeroSingle,
-)
-from diffusers.utils import (
-    USE_PEFT_BACKEND,
-    is_torch_version,
-    logging,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
-from diffusers.utils.torch_utils import maybe_allow_in_graph
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-is_flash_attn_available = False
-try:
-    from flash_attn_interface import flash_attn_func
-
-    is_flash_attn_available = True
-except:
-    pass
-
-from videotuna.third_party.flux.models.flux.attention import (
-    FluxAttnProcessor3_0,
-    FluxSingleAttnProcessor3_0,
-)
-
-
-class FluxAttnProcessor2_0:
-    """Attention processor used typically in processing the SD3-like self-attention projections."""
-
-    def __init__(self):
-        if not hasattr(F, "scaled_dot_product_attention"):
-            raise ImportError(
-                "FluxAttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0."
-            )
-
-    def __call__(
-        self,
-        attn: Attention,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor = None,
-        attention_mask: Optional[torch.FloatTensor] = None,
-        image_rotary_emb: Optional[torch.Tensor] = None,
-    ) -> torch.FloatTensor:
-        batch_size, _, _ = (
-            hidden_states.shape
-            if encoder_hidden_states is None
-            else encoder_hidden_states.shape
-        )
-
-        # `sample` projections.
-        query = attn.to_q(hidden_states)
-        key = attn.to_k(hidden_states)
-        value = attn.to_v(hidden_states)
-
-        inner_dim = key.shape[-1]
-        head_dim = inner_dim // attn.heads
-
-        query = query.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, attn.heads, head_dim).transpose(1, 2)
-
-        if attn.norm_q is not None:
-            query = attn.norm_q(query)
-        if attn.norm_k is not None:
-            key = attn.norm_k(key)
-
-        # the attention in FluxSingleTransformerBlock does not use `encoder_hidden_states`
-        if encoder_hidden_states is not None:
-            # `context` projections.
-            encoder_hidden_states_query_proj = attn.add_q_proj(encoder_hidden_states)
-            encoder_hidden_states_key_proj = attn.add_k_proj(encoder_hidden_states)
-            encoder_hidden_states_value_proj = attn.add_v_proj(encoder_hidden_states)
-
-            encoder_hidden_states_query_proj = encoder_hidden_states_query_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-            encoder_hidden_states_key_proj = encoder_hidden_states_key_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-            encoder_hidden_states_value_proj = encoder_hidden_states_value_proj.view(
-                batch_size, -1, attn.heads, head_dim
-            ).transpose(1, 2)
-
-            if attn.norm_added_q is not None:
-                encoder_hidden_states_query_proj = attn.norm_added_q(
-                    encoder_hidden_states_query_proj
-                )
-            if attn.norm_added_k is not None:
-                encoder_hidden_states_key_proj = attn.norm_added_k(
-                    encoder_hidden_states_key_proj
-                )
-
-            # attention
-            query = torch.cat([encoder_hidden_states_query_proj, query], dim=2)
-            key = torch.cat([encoder_hidden_states_key_proj, key], dim=2)
-            value = torch.cat([encoder_hidden_states_value_proj, value], dim=2)
-
-        if image_rotary_emb is not None:
-            from diffusers.models.embeddings import apply_rotary_emb
-
-            query = apply_rotary_emb(query, image_rotary_emb)
-            key = apply_rotary_emb(key, image_rotary_emb)
-
-        if attention_mask is not None:
-            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
-            attention_mask = (attention_mask > 0).bool()
-            attention_mask = attention_mask.to(
-                device=hidden_states.device, dtype=hidden_states.dtype
-            )
-
-        hidden_states = F.scaled_dot_product_attention(
-            query,
-            key,
-            value,
-            dropout_p=0.0,
-            is_causal=False,
-            attn_mask=attention_mask,
-        )
-        hidden_states = hidden_states.transpose(1, 2).reshape(
-            batch_size, -1, attn.heads * head_dim
-        )
-        hidden_states = hidden_states.to(query.dtype)
-
-        if encoder_hidden_states is not None:
-            encoder_hidden_states, hidden_states = (
-                hidden_states[:, : encoder_hidden_states.shape[1]],
-                hidden_states[:, encoder_hidden_states.shape[1] :],
-            )
-
-            # linear proj
-            hidden_states = attn.to_out[0](hidden_states)
-            # dropout
-            hidden_states = attn.to_out[1](hidden_states)
-            encoder_hidden_states = attn.to_add_out(encoder_hidden_states)
-
-            return hidden_states, encoder_hidden_states
-        return hidden_states
-
-
-def expand_flux_attention_mask(
-    hidden_states: torch.Tensor,
-    attn_mask: torch.Tensor,
-) -> torch.Tensor:
-    """
-    Expand a mask so that the image is included.
-    """
-    bsz = attn_mask.shape[0]
-    assert bsz == hidden_states.shape[0]
-    residual_seq_len = hidden_states.shape[1]
-    mask_seq_len = attn_mask.shape[1]
-
-    expanded_mask = torch.ones(bsz, residual_seq_len)
-    expanded_mask[:, :mask_seq_len] = attn_mask
-
-    return expanded_mask
-
-
-@maybe_allow_in_graph
-class FluxSingleTransformerBlock(nn.Module):
-    r"""
-    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
-
-    Reference: https://arxiv.org/abs/2403.03206
-
-    Parameters:
-        dim (`int`): The number of channels in the input and output.
-        num_attention_heads (`int`): The number of heads to use for multi-head attention.
-        attention_head_dim (`int`): The number of channels in each head.
-        context_pre_only (`bool`): Boolean to determine if we should add some blocks associated with the
-            processing of `context` conditions.
-    """
-
-    def __init__(self, dim, num_attention_heads, attention_head_dim, mlp_ratio=4.0):
-        super().__init__()
-        self.mlp_hidden_dim = int(dim * mlp_ratio)
-
-        self.norm = AdaLayerNormZeroSingle(dim)
-        self.proj_mlp = nn.Linear(dim, self.mlp_hidden_dim)
-        self.act_mlp = nn.GELU(approximate="tanh")
-        self.proj_out = nn.Linear(dim + self.mlp_hidden_dim, dim)
-
-        processor = FluxAttnProcessor2_0()
-        if torch.cuda.is_available():
-            rank = (
-                torch.distributed.get_rank()
-                if torch.distributed.is_initialized()
-                else 0
-            )
-            primary_device = torch.cuda.get_device_properties(rank)
-            if primary_device.major == 9 and primary_device.minor == 0:
-                if is_flash_attn_available:
-                    if rank == 0:
-                        print("Using FlashAttention3_0 for H100 GPU (Single block)")
-                    processor = FluxSingleAttnProcessor3_0()
-                else:
-                    if rank == 0:
-                        print(
-                            "FlashAttention3_0 is not available, using FlashAttention2_0 for H100 GPU (Single block). Install flash_attn to make use of it."
-                        )
-        self.attn = Attention(
-            query_dim=dim,
-            cross_attention_dim=None,
-            dim_head=attention_head_dim,
-            heads=num_attention_heads,
-            out_dim=dim,
-            bias=True,
-            processor=processor,
-            qk_norm="rms_norm",
-            eps=1e-6,
-            pre_only=True,
-        )
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        temb: torch.FloatTensor,
-        image_rotary_emb=None,
-        attention_mask: Optional[torch.Tensor] = None,
-    ):
-        residual = hidden_states
-        norm_hidden_states, gate = self.norm(hidden_states, emb=temb)
-        mlp_hidden_states = self.act_mlp(self.proj_mlp(norm_hidden_states))
-
-        if attention_mask is not None:
-            attention_mask = expand_flux_attention_mask(
-                hidden_states,
-                attention_mask,
-            )
-
-        attn_output = self.attn(
-            hidden_states=norm_hidden_states,
-            image_rotary_emb=image_rotary_emb,
-            attention_mask=attention_mask,
-        )
-
-        hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=2)
-        gate = gate.unsqueeze(1)
-        hidden_states = gate * self.proj_out(hidden_states)
-        hidden_states = residual + hidden_states
-
-        return hidden_states
-
-
-@maybe_allow_in_graph
-class FluxTransformerBlock(nn.Module):
-    r"""
-    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
-
-    Reference: https://arxiv.org/abs/2403.03206
-
-    Parameters:
-        dim (`int`): The number of channels in the input and output.
-        num_attention_heads (`int`): The number of heads to use for multi-head attention.
-        attention_head_dim (`int`): The number of channels in each head.
-        context_pre_only (`bool`): Boolean to determine if we should add some blocks associated with the
-            processing of `context` conditions.
-    """
-
-    def __init__(
-        self, dim, num_attention_heads, attention_head_dim, qk_norm="rms_norm", eps=1e-6
-    ):
-        super().__init__()
-
-        self.norm1 = AdaLayerNormZero(dim)
-
-        self.norm1_context = AdaLayerNormZero(dim)
-
-        if hasattr(F, "scaled_dot_product_attention"):
-            processor = FluxAttnProcessor2_0()
-            if torch.cuda.is_available():
-                rank = (
-                    torch.distributed.get_rank()
-                    if torch.distributed.is_initialized()
-                    else 0
-                )
-                primary_device = torch.cuda.get_device_properties(rank)
-                if primary_device.major == 9 and primary_device.minor == 0:
-                    if is_flash_attn_available:
-                        if rank == 0:
-                            print("Using FlashAttention3_0 for H100 GPU (Double block)")
-                        processor = FluxAttnProcessor3_0()
-        else:
-            raise ValueError(
-                "The current PyTorch version does not support the `scaled_dot_product_attention` function."
-            )
-        self.attn = Attention(
-            query_dim=dim,
-            cross_attention_dim=None,
-            added_kv_proj_dim=dim,
-            dim_head=attention_head_dim,
-            heads=num_attention_heads,
-            out_dim=dim,
-            context_pre_only=False,
-            bias=True,
-            processor=processor,
-            qk_norm=qk_norm,
-            eps=eps,
-        )
-
-        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
-        self.ff = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
-
-        self.norm2_context = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
-        self.ff_context = FeedForward(
-            dim=dim, dim_out=dim, activation_fn="gelu-approximate"
-        )
-
-        # let chunk size default to None
-        self._chunk_size = None
-        self._chunk_dim = 0
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor,
-        temb: torch.FloatTensor,
-        image_rotary_emb=None,
-        attention_mask: Optional[torch.Tensor] = None,
-    ):
-        norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
-            hidden_states, emb=temb
-        )
-
-        norm_encoder_hidden_states, c_gate_msa, c_shift_mlp, c_scale_mlp, c_gate_mlp = (
-            self.norm1_context(encoder_hidden_states, emb=temb)
-        )
-
-        if attention_mask is not None:
-            attention_mask = expand_flux_attention_mask(
-                torch.cat([encoder_hidden_states, hidden_states], dim=1),
-                attention_mask,
-            )
-
-        # Attention.
-        attn_output, context_attn_output = self.attn(
-            hidden_states=norm_hidden_states,
-            encoder_hidden_states=norm_encoder_hidden_states,
-            image_rotary_emb=image_rotary_emb,
-            attention_mask=attention_mask,
-        )
-
-        # Process attention outputs for the `hidden_states`.
-        attn_output = gate_msa.unsqueeze(1) * attn_output
-        hidden_states = hidden_states + attn_output
-
-        norm_hidden_states = self.norm2(hidden_states)
-        norm_hidden_states = (
-            norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
-        )
-
-        ff_output = self.ff(norm_hidden_states)
-        ff_output = gate_mlp.unsqueeze(1) * ff_output
-
-        hidden_states = hidden_states + ff_output
-
-        # Process attention outputs for the `encoder_hidden_states`.
-
-        context_attn_output = c_gate_msa.unsqueeze(1) * context_attn_output
-        encoder_hidden_states = encoder_hidden_states + context_attn_output
-
-        norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states)
-        norm_encoder_hidden_states = (
-            norm_encoder_hidden_states * (1 + c_scale_mlp[:, None])
-            + c_shift_mlp[:, None]
-        )
-
-        context_ff_output = self.ff_context(norm_encoder_hidden_states)
-        encoder_hidden_states = (
-            encoder_hidden_states + c_gate_mlp.unsqueeze(1) * context_ff_output
-        )
-
-        return encoder_hidden_states, hidden_states
-
-
-class FluxTransformer2DModelWithMasking(
-    ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin
-):
-    """
-    The Transformer model introduced in Flux.
-
-    Reference: https://blackforestlabs.ai/announcing-black-forest-labs/
-
-    Parameters:
-        patch_size (`int`): Patch size to turn the input data into small patches.
-        in_channels (`int`, *optional*, defaults to 16): The number of channels in the input.
-        num_layers (`int`, *optional*, defaults to 18): The number of layers of MMDiT blocks to use.
-        num_single_layers (`int`, *optional*, defaults to 18): The number of layers of single DiT blocks to use.
-        attention_head_dim (`int`, *optional*, defaults to 64): The number of channels in each head.
-        num_attention_heads (`int`, *optional*, defaults to 18): The number of heads to use for multi-head attention.
-        joint_attention_dim (`int`, *optional*): The number of `encoder_hidden_states` dimensions to use.
-        pooled_projection_dim (`int`): Number of dimensions to use when projecting the `pooled_projections`.
-        guidance_embeds (`bool`, defaults to False): Whether to use guidance embeddings.
-    """
-
-    _supports_gradient_checkpointing = True
-
-    @register_to_config
-    def __init__(
-        self,
-        patch_size: int = 1,
-        in_channels: int = 64,
-        num_layers: int = 19,
-        num_single_layers: int = 38,
-        attention_head_dim: int = 128,
-        num_attention_heads: int = 24,
-        joint_attention_dim: int = 4096,
-        pooled_projection_dim: int = 768,
-        guidance_embeds: bool = False,
-        axes_dims_rope: Tuple[int] = (16, 56, 56),
-    ):
-        super().__init__()
-        self.out_channels = in_channels
-        self.inner_dim = (
-            self.config.num_attention_heads * self.config.attention_head_dim
-        )
-
-        self.pos_embed = FluxPosEmbed(theta=10000, axes_dim=axes_dims_rope)
-        text_time_guidance_cls = (
-            CombinedTimestepGuidanceTextProjEmbeddings
-            if guidance_embeds
-            else CombinedTimestepTextProjEmbeddings
-        )
-        self.time_text_embed = text_time_guidance_cls(
-            embedding_dim=self.inner_dim,
-            pooled_projection_dim=self.config.pooled_projection_dim,
-        )
-
-        self.context_embedder = nn.Linear(
-            self.config.joint_attention_dim, self.inner_dim
-        )
-        self.x_embedder = torch.nn.Linear(self.config.in_channels, self.inner_dim)
-
-        self.transformer_blocks = nn.ModuleList(
-            [
-                FluxTransformerBlock(
-                    dim=self.inner_dim,
-                    num_attention_heads=self.config.num_attention_heads,
-                    attention_head_dim=self.config.attention_head_dim,
-                )
-                for i in range(self.config.num_layers)
-            ]
-        )
-
-        self.single_transformer_blocks = nn.ModuleList(
-            [
-                FluxSingleTransformerBlock(
-                    dim=self.inner_dim,
-                    num_attention_heads=self.config.num_attention_heads,
-                    attention_head_dim=self.config.attention_head_dim,
-                )
-                for i in range(self.config.num_single_layers)
-            ]
-        )
-
-        self.norm_out = AdaLayerNormContinuous(
-            self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6
-        )
-        self.proj_out = nn.Linear(
-            self.inner_dim, patch_size * patch_size * self.out_channels, bias=True
-        )
-
-        self.gradient_checkpointing = False
-
-    def _set_gradient_checkpointing(self, module, value=False):
-        if hasattr(module, "gradient_checkpointing"):
-            module.gradient_checkpointing = value
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        encoder_hidden_states: torch.Tensor = None,
-        pooled_projections: torch.Tensor = None,
-        timestep: torch.LongTensor = None,
-        img_ids: torch.Tensor = None,
-        txt_ids: torch.Tensor = None,
-        guidance: torch.Tensor = None,
-        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
-        return_dict: bool = True,
-        attention_mask: Optional[torch.Tensor] = None,
-    ) -> Union[torch.FloatTensor, Transformer2DModelOutput]:
-        """
-        The [`FluxTransformer2DModelWithMasking`] forward method.
-
-        Args:
-            hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
-                Input `hidden_states`.
-            encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
-                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
-            pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected
-                from the embeddings of input conditions.
-            timestep ( `torch.LongTensor`):
-                Used to indicate denoising step.
-            block_controlnet_hidden_states: (`list` of `torch.Tensor`):
-                A list of tensors that if specified are added to the residuals of transformer blocks.
-            joint_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
-                tuple.
-
-        Returns:
-            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
-            `tuple` where the first element is the sample tensor.
-        """
-        if joint_attention_kwargs is not None:
-            joint_attention_kwargs = joint_attention_kwargs.copy()
-            lora_scale = joint_attention_kwargs.pop("scale", 1.0)
-        else:
-            lora_scale = 1.0
-
-        if USE_PEFT_BACKEND:
-            # weight the lora layers by setting `lora_scale` for each PEFT layer
-            scale_lora_layers(self, lora_scale)
-        else:
-            if (
-                joint_attention_kwargs is not None
-                and joint_attention_kwargs.get("scale", None) is not None
-            ):
-                logger.warning(
-                    "Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective."
-                )
-        hidden_states = self.x_embedder(hidden_states)
-
-        timestep = timestep.to(hidden_states.dtype) * 1000
-        if guidance is not None:
-            guidance = guidance.to(hidden_states.dtype) * 1000
-        else:
-            guidance = None
-        temb = (
-            self.time_text_embed(timestep, pooled_projections)
-            if guidance is None
-            else self.time_text_embed(timestep, guidance, pooled_projections)
-        )
-        encoder_hidden_states = self.context_embedder(encoder_hidden_states)
-
-        if txt_ids.ndim == 3:
-            txt_ids = txt_ids[0]
-        if img_ids.ndim == 3:
-            img_ids = img_ids[0]
-
-        ids = torch.cat((txt_ids, img_ids), dim=0)
-
-        image_rotary_emb = self.pos_embed(ids)
-
-        for index_block, block in enumerate(self.transformer_blocks):
-            if self.training and self.gradient_checkpointing:
-
-                def create_custom_forward(module, return_dict=None):
-                    def custom_forward(*inputs):
-                        if return_dict is not None:
-                            return module(*inputs, return_dict=return_dict)
-                        else:
-                            return module(*inputs)
-
-                    return custom_forward
-
-                ckpt_kwargs: Dict[str, Any] = (
-                    {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
-                )
-                encoder_hidden_states, hidden_states = (
-                    torch.utils.checkpoint.checkpoint(
-                        create_custom_forward(block),
-                        hidden_states,
-                        encoder_hidden_states,
-                        temb,
-                        image_rotary_emb,
-                        attention_mask,
-                        **ckpt_kwargs,
-                    )
-                )
-
-            else:
-                encoder_hidden_states, hidden_states = block(
-                    hidden_states=hidden_states,
-                    encoder_hidden_states=encoder_hidden_states,
-                    temb=temb,
-                    image_rotary_emb=image_rotary_emb,
-                    attention_mask=attention_mask,
-                )
-
-        # Flux places the text tokens in front of the image tokens in the
-        # sequence.
-        hidden_states = torch.cat([encoder_hidden_states, hidden_states], dim=1)
-
-        for index_block, block in enumerate(self.single_transformer_blocks):
-            if self.training and self.gradient_checkpointing:
-
-                def create_custom_forward(module, return_dict=None):
-                    def custom_forward(*inputs):
-                        if return_dict is not None:
-                            return module(*inputs, return_dict=return_dict)
-                        else:
-                            return module(*inputs)
-
-                    return custom_forward
-
-                ckpt_kwargs: Dict[str, Any] = (
-                    {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
-                )
-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(block),
-                    hidden_states,
-                    temb,
-                    image_rotary_emb,
-                    attention_mask,
-                    **ckpt_kwargs,
-                )
-
-            else:
-                hidden_states = block(
-                    hidden_states=hidden_states,
-                    temb=temb,
-                    image_rotary_emb=image_rotary_emb,
-                    attention_mask=attention_mask,
-                )
-
-        hidden_states = hidden_states[:, encoder_hidden_states.shape[1] :, ...]
-
-        hidden_states = self.norm_out(hidden_states, temb)
-        output = self.proj_out(hidden_states)
-
-        if USE_PEFT_BACKEND:
-            # remove `lora_scale` from each PEFT layer
-            unscale_lora_layers(self, lora_scale)
-
-        if not return_dict:
-            return (output,)
-
-        return Transformer2DModelOutput(sample=output)
-
-
-if __name__ == "__main__":
-    dtype = torch.bfloat16
-    bsz = 2
-    img = torch.rand((bsz, 16, 64, 64)).to("cuda", dtype=dtype)
-    timestep = torch.tensor([0.5, 0.5]).to("cuda", dtype=torch.float32)
-    pooled = torch.rand(bsz, 768).to("cuda", dtype=dtype)
-    text = torch.rand((bsz, 512, 4096)).to("cuda", dtype=dtype)
-    attn_mask = torch.tensor([[1.0] * 384 + [0.0] * 128] * bsz).to(
-        "cuda", dtype=dtype
-    )  # Last 128 positions are masked
-
-    def _pack_latents(latents, batch_size, num_channels_latents, height, width):
-        latents = latents.view(
-            batch_size, num_channels_latents, height // 2, 2, width // 2, 2
-        )
-        latents = latents.permute(0, 2, 4, 1, 3, 5)
-        latents = latents.reshape(
-            batch_size, (height // 2) * (width // 2), num_channels_latents * 4
-        )
-
-        return latents
-
-    def _prepare_latent_image_ids(
-        batch_size, height, width, device="cuda", dtype=dtype
-    ):
-        latent_image_ids = torch.zeros(height // 2, width // 2, 3)
-        latent_image_ids[..., 1] = (
-            latent_image_ids[..., 1] + torch.arange(height // 2)[:, None]
-        )
-        latent_image_ids[..., 2] = (
-            latent_image_ids[..., 2] + torch.arange(width // 2)[None, :]
-        )
-
-        latent_image_id_height, latent_image_id_width, latent_image_id_channels = (
-            latent_image_ids.shape
-        )
-
-        latent_image_ids = latent_image_ids[None, :].repeat(batch_size, 1, 1, 1)
-        latent_image_ids = latent_image_ids.reshape(
-            batch_size,
-            latent_image_id_height * latent_image_id_width,
-            latent_image_id_channels,
-        )
-
-        return latent_image_ids.to(device=device, dtype=dtype)
-
-    txt_ids = torch.zeros(bsz, text.shape[1], 3).to(device="cuda", dtype=dtype)
-
-    vae_scale_factor = 16
-    height = 2 * (int(512) // vae_scale_factor)
-    width = 2 * (int(512) // vae_scale_factor)
-    img_ids = _prepare_latent_image_ids(bsz, height, width)
-    img = _pack_latents(img, img.shape[0], 16, height, width)
-
-    # Gotta go fast
-    transformer = FluxTransformer2DModelWithMasking.from_config(
-        {
-            "attention_head_dim": 128,
-            "guidance_embeds": True,
-            "in_channels": 64,
-            "joint_attention_dim": 4096,
-            "num_attention_heads": 24,
-            "num_layers": 4,
-            "num_single_layers": 8,
-            "patch_size": 1,
-            "pooled_projection_dim": 768,
-        }
-    ).to("cuda", dtype=dtype)
-
-    guidance = torch.tensor([2.0], device="cuda")
-    guidance = guidance.expand(bsz)
-
-    with torch.no_grad():
-        no_mask = transformer(
-            img,
-            encoder_hidden_states=text,
-            pooled_projections=pooled,
-            timestep=timestep,
-            img_ids=img_ids,
-            txt_ids=txt_ids,
-            guidance=guidance,
-        )
-        mask = transformer(
-            img,
-            encoder_hidden_states=text,
-            pooled_projections=pooled,
-            timestep=timestep,
-            img_ids=img_ids,
-            txt_ids=txt_ids,
-            guidance=guidance,
-            attention_mask=attn_mask,
-        )
-
-    assert torch.allclose(no_mask.sample, mask.sample) is False
-    print("Attention masking test ran OK. Differences in output were detected.")
diff --git a/videotuna/third_party/flux/models/pixart/pipeline.py b/videotuna/third_party/flux/models/pixart/pipeline.py
deleted file mode 100644
index 6412cefb..00000000
--- a/videotuna/third_party/flux/models/pixart/pipeline.py
+++ /dev/null
@@ -1,1254 +0,0 @@
-# Copyright 2024 PixArt-Sigma Authors and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import html
-import inspect
-import re
-import urllib.parse as ul
-from typing import Callable, List, Optional, Tuple, Union
-
-import torch
-from diffusers.image_processor import PipelineImageInput, PixArtImageProcessor
-from diffusers.models import AutoencoderKL, PixArtTransformer2DModel
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
-from diffusers.pipelines.pixart_alpha.pipeline_pixart_alpha import (
-    ASPECT_RATIO_256_BIN,
-    ASPECT_RATIO_512_BIN,
-    ASPECT_RATIO_1024_BIN,
-)
-from diffusers.schedulers import KarrasDiffusionSchedulers
-from diffusers.utils import (
-    BACKENDS_MAPPING,
-    deprecate,
-    is_bs4_available,
-    is_ftfy_available,
-    logging,
-    replace_example_docstring,
-)
-from diffusers.utils.torch_utils import randn_tensor
-from transformers import T5EncoderModel, T5Tokenizer
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
-def retrieve_latents(
-    encoder_output: torch.Tensor,
-    generator: Optional[torch.Generator] = None,
-    sample_mode: str = "sample",
-):
-    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
-        return encoder_output.latent_dist.sample(generator)
-    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
-        return encoder_output.latent_dist.mode()
-    elif hasattr(encoder_output, "latents"):
-        return encoder_output.latents
-    else:
-        raise AttributeError("Could not access latents of provided encoder_output")
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-if is_bs4_available():
-    from bs4 import BeautifulSoup
-
-if is_ftfy_available():
-    import ftfy
-
-
-ASPECT_RATIO_2048_BIN = {
-    "0.25": [1024.0, 4096.0],
-    "0.26": [1024.0, 3968.0],
-    "0.27": [1024.0, 3840.0],
-    "0.28": [1024.0, 3712.0],
-    "0.32": [1152.0, 3584.0],
-    "0.33": [1152.0, 3456.0],
-    "0.35": [1152.0, 3328.0],
-    "0.4": [1280.0, 3200.0],
-    "0.42": [1280.0, 3072.0],
-    "0.48": [1408.0, 2944.0],
-    "0.5": [1408.0, 2816.0],
-    "0.52": [1408.0, 2688.0],
-    "0.57": [1536.0, 2688.0],
-    "0.6": [1536.0, 2560.0],
-    "0.68": [1664.0, 2432.0],
-    "0.72": [1664.0, 2304.0],
-    "0.78": [1792.0, 2304.0],
-    "0.82": [1792.0, 2176.0],
-    "0.88": [1920.0, 2176.0],
-    "0.94": [1920.0, 2048.0],
-    "1.0": [2048.0, 2048.0],
-    "1.07": [2048.0, 1920.0],
-    "1.13": [2176.0, 1920.0],
-    "1.21": [2176.0, 1792.0],
-    "1.29": [2304.0, 1792.0],
-    "1.38": [2304.0, 1664.0],
-    "1.46": [2432.0, 1664.0],
-    "1.67": [2560.0, 1536.0],
-    "1.75": [2688.0, 1536.0],
-    "2.0": [2816.0, 1408.0],
-    "2.09": [2944.0, 1408.0],
-    "2.4": [3072.0, 1280.0],
-    "2.5": [3200.0, 1280.0],
-    "2.89": [3328.0, 1152.0],
-    "3.0": [3456.0, 1152.0],
-    "3.11": [3584.0, 1152.0],
-    "3.62": [3712.0, 1024.0],
-    "3.75": [3840.0, 1024.0],
-    "3.88": [3968.0, 1024.0],
-    "4.0": [4096.0, 1024.0],
-}
-
-
-EXAMPLE_DOC_STRING = """
-    Examples:
-        ```py
-        >>> import torch
-        >>> from diffusers import PixArtSigmaPipeline
-
-        >>> # You can replace the checkpoint id with "PixArt-alpha/PixArt-Sigma-XL-2-512-MS" too.
-        >>> pipe = PixArtSigmaPipeline.from_pretrained(
-        ...     "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", torch_dtype=torch.float16
-        ... )
-        >>> # Enable memory optimizations.
-        >>> # pipe.enable_model_cpu_offload()
-
-        >>> prompt = "A small cactus with a happy face in the Sahara desert."
-        >>> image = pipe(prompt).images[0]
-        ```
-"""
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-class PixArtSigmaPipeline(DiffusionPipeline):
-    r"""
-    Pipeline for text-to-image generation using PixArt-Sigma.
-    """
-
-    bad_punct_regex = re.compile(
-        r"["
-        + "#®•©™&@·º½¾¿¡§~"
-        + r"\)"
-        + r"\("
-        + r"\]"
-        + r"\["
-        + r"\}"
-        + r"\{"
-        + r"\|"
-        + "\\"
-        + r"\/"
-        + r"\*"
-        + r"]{1,}"
-    )  # noqa
-
-    _optional_components = ["tokenizer", "text_encoder"]
-    model_cpu_offload_seq = "text_encoder->transformer->vae"
-
-    def __init__(
-        self,
-        tokenizer: T5Tokenizer,
-        text_encoder: T5EncoderModel,
-        vae: AutoencoderKL,
-        transformer: PixArtTransformer2DModel,
-        scheduler: KarrasDiffusionSchedulers,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            tokenizer=tokenizer,
-            text_encoder=text_encoder,
-            vae=vae,
-            transformer=transformer,
-            scheduler=scheduler,
-        )
-
-        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
-        self.image_processor = PixArtImageProcessor(
-            vae_scale_factor=self.vae_scale_factor
-        )
-
-    def get_timesteps(
-        self, num_inference_steps, strength, device, denoising_start=None
-    ):
-        # get the original timestep using init_timestep
-        if denoising_start is not None:
-            init_timestep = min(
-                int(num_inference_steps * denoising_start), num_inference_steps
-            )
-            t_start = max(num_inference_steps - init_timestep, 0)
-        else:
-            t_start = 0
-
-        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
-        # Strength is irrelevant if we directly request a timestep to start at;
-        # that is, strength is determined by the denoising_start instead.
-        if denoising_start is not None:
-            discrete_timestep_cutoff = int(
-                round(
-                    self.scheduler.config.num_train_timesteps
-                    - (denoising_start * self.scheduler.config.num_train_timesteps)
-                )
-            )
-
-            num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item()
-            if self.scheduler.order == 2 and num_inference_steps % 2 == 0:
-                # if the scheduler is a 2nd order scheduler we might have to do +1
-                # because `num_inference_steps` might be even given that every timestep
-                # (except the highest one) is duplicated. If `num_inference_steps` is even it would
-                # mean that we cut the timesteps in the middle of the denoising step
-                # (between 1st and 2nd derivative) which leads to incorrect results. By adding 1
-                # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler
-                num_inference_steps = num_inference_steps + 1
-
-            # because t_n+1 >= t_n, we slice the timesteps starting from the end
-            timesteps = timesteps[-num_inference_steps:]
-            return timesteps, num_inference_steps
-
-        return timesteps, num_inference_steps - t_start
-
-    # Copied from diffusers.pipelines.pixart_alpha.pipeline_pixart_alpha.PixArtAlphaPipeline.encode_prompt with 120->300
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        do_classifier_free_guidance: bool = True,
-        negative_prompt: str = "",
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        prompt_attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_attention_mask: Optional[torch.Tensor] = None,
-        clean_caption: bool = False,
-        max_sequence_length: int = 300,
-        **kwargs,
-    ):
-        r"""
-        Encodes the prompt into text encoder hidden states.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds`
-                instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`). For
-                PixArt-Alpha, this should be "".
-            do_classifier_free_guidance (`bool`, *optional*, defaults to `True`):
-                whether to use classifier free guidance or not
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                number of images that should be generated per prompt
-            device: (`torch.device`, *optional*):
-                torch device to place the resulting embeddings on
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. For PixArt-Alpha, it's should be the embeddings of the ""
-                string.
-            clean_caption (`bool`, defaults to `False`):
-                If `True`, the function will preprocess and clean the provided caption before encoding.
-            max_sequence_length (`int`, defaults to 300): Maximum sequence length to use for the prompt.
-        """
-
-        if "mask_feature" in kwargs:
-            deprecation_message = "The use of `mask_feature` is deprecated. It is no longer used in any computation and that doesn't affect the end results. It will be removed in a future version."
-            deprecate("mask_feature", "1.0.0", deprecation_message, standard_warn=False)
-
-        if device is None:
-            device = self._execution_device
-
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        # See Section 3.1. of the paper.
-        max_length = max_sequence_length
-
-        if prompt_embeds is None:
-            prompt = self._text_preprocessing(prompt, clean_caption=clean_caption)
-            text_inputs = self.tokenizer(
-                prompt,
-                padding="max_length",
-                max_length=max_length,
-                truncation=True,
-                add_special_tokens=True,
-                return_tensors="pt",
-            )
-            text_input_ids = text_inputs.input_ids
-            untruncated_ids = self.tokenizer(
-                prompt, padding="longest", return_tensors="pt"
-            ).input_ids
-
-            if untruncated_ids.shape[-1] >= text_input_ids.shape[
-                -1
-            ] and not torch.equal(text_input_ids, untruncated_ids):
-                removed_text = self.tokenizer.batch_decode(
-                    untruncated_ids[:, max_length - 1 : -1]
-                )
-                logger.warning(
-                    "The following part of your input was truncated because T5 can only handle sequences up to"
-                    f" {max_length} tokens: {removed_text}"
-                )
-
-            prompt_attention_mask = text_inputs.attention_mask
-            prompt_attention_mask = prompt_attention_mask.to(device)
-
-            prompt_embeds = self.text_encoder(
-                text_input_ids.to(device), attention_mask=prompt_attention_mask
-            )
-            prompt_embeds = prompt_embeds[0]
-
-        if self.text_encoder is not None:
-            dtype = self.text_encoder.dtype
-        elif self.transformer is not None:
-            dtype = self.transformer.dtype
-        else:
-            dtype = None
-
-        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-        bs_embed, seq_len, _ = prompt_embeds.shape
-        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            bs_embed * num_images_per_prompt, seq_len, -1
-        )
-        prompt_attention_mask = prompt_attention_mask.view(bs_embed, -1)
-        prompt_attention_mask = prompt_attention_mask.repeat(num_images_per_prompt, 1)
-
-        # get unconditional embeddings for classifier free guidance
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            uncond_tokens = (
-                [negative_prompt] * batch_size
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-            uncond_tokens = self._text_preprocessing(
-                uncond_tokens, clean_caption=clean_caption
-            )
-            max_length = prompt_embeds.shape[1]
-            uncond_input = self.tokenizer(
-                uncond_tokens,
-                padding="max_length",
-                max_length=max_length,
-                truncation=True,
-                return_attention_mask=True,
-                add_special_tokens=True,
-                return_tensors="pt",
-            )
-            negative_prompt_attention_mask = uncond_input.attention_mask
-            negative_prompt_attention_mask = negative_prompt_attention_mask.to(device)
-
-            negative_prompt_embeds = self.text_encoder(
-                uncond_input.input_ids.to(device),
-                attention_mask=negative_prompt_attention_mask,
-            )
-            negative_prompt_embeds = negative_prompt_embeds[0]
-
-        if do_classifier_free_guidance:
-            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
-            seq_len = negative_prompt_embeds.shape[1]
-
-            negative_prompt_embeds = negative_prompt_embeds.to(
-                dtype=dtype, device=device
-            )
-
-            negative_prompt_embeds = negative_prompt_embeds.repeat(
-                1, num_images_per_prompt, 1
-            )
-            negative_prompt_embeds = negative_prompt_embeds.view(
-                batch_size * num_images_per_prompt, seq_len, -1
-            )
-
-            negative_prompt_attention_mask = negative_prompt_attention_mask.view(
-                bs_embed, -1
-            )
-            negative_prompt_attention_mask = negative_prompt_attention_mask.repeat(
-                num_images_per_prompt, 1
-            )
-        else:
-            negative_prompt_embeds = None
-            negative_prompt_attention_mask = None
-
-        return (
-            prompt_embeds,
-            prompt_attention_mask,
-            negative_prompt_embeds,
-            negative_prompt_attention_mask,
-        )
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
-    def prepare_extra_step_kwargs(self, generator, eta):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-
-        accepts_eta = "eta" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        extra_step_kwargs = {}
-        if accepts_eta:
-            extra_step_kwargs["eta"] = eta
-
-        # check if the scheduler accepts generator
-        accepts_generator = "generator" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        if accepts_generator:
-            extra_step_kwargs["generator"] = generator
-        return extra_step_kwargs
-
-    # Copied from diffusers.pipelines.pixart_alpha.pipeline_pixart_alpha.PixArtAlphaPipeline.check_inputs
-    def check_inputs(
-        self,
-        prompt,
-        height,
-        width,
-        strength,
-        num_inference_steps,
-        negative_prompt,
-        callback_steps,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        prompt_attention_mask=None,
-        negative_prompt_attention_mask=None,
-    ):
-        if strength is None:
-            if height % 8 != 0 or width % 8 != 0:
-                raise ValueError(
-                    f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-                )
-        else:
-            if strength < 0 or strength > 1:
-                raise ValueError(
-                    f"The value of strength should in [0.0, 1.0] but is {strength}"
-                )
-            if num_inference_steps is None:
-                raise ValueError("`num_inference_steps` cannot be None.")
-            elif not isinstance(num_inference_steps, int) or num_inference_steps <= 0:
-                raise ValueError(
-                    f"`num_inference_steps` has to be a positive integer but is {num_inference_steps} of type"
-                    f" {type(num_inference_steps)}."
-                )
-        if (callback_steps is None) or (
-            callback_steps is not None
-            and (not isinstance(callback_steps, int) or callback_steps <= 0)
-        ):
-            raise ValueError(
-                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
-                f" {type(callback_steps)}."
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            prompt = None
-
-        if prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-
-        if prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            negative_prompt = None
-
-        if prompt_embeds is not None and prompt_attention_mask is None:
-            raise ValueError(
-                "Must provide `prompt_attention_mask` when specifying `prompt_embeds`."
-            )
-
-        if (
-            negative_prompt_embeds is not None
-            and negative_prompt_attention_mask is None
-        ):
-            raise ValueError(
-                "Must provide `negative_prompt_attention_mask` when specifying `negative_prompt_embeds`."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-            if prompt_attention_mask.shape != negative_prompt_attention_mask.shape:
-                raise ValueError(
-                    "`prompt_attention_mask` and `negative_prompt_attention_mask` must have the same shape when passed directly, but"
-                    f" got: `prompt_attention_mask` {prompt_attention_mask.shape} != `negative_prompt_attention_mask`"
-                    f" {negative_prompt_attention_mask.shape}."
-                )
-
-    # Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline._text_preprocessing
-    def _text_preprocessing(self, text, clean_caption=False):
-        if clean_caption and not is_bs4_available():
-            logger.warning(
-                BACKENDS_MAPPING["bs4"][-1].format("Setting `clean_caption=True`")
-            )
-            logger.warning("Setting `clean_caption` to False...")
-            clean_caption = False
-
-        if clean_caption and not is_ftfy_available():
-            logger.warning(
-                BACKENDS_MAPPING["ftfy"][-1].format("Setting `clean_caption=True`")
-            )
-            logger.warning("Setting `clean_caption` to False...")
-            clean_caption = False
-
-        if not isinstance(text, (tuple, list)):
-            text = [text]
-
-        def process(text: str):
-            if clean_caption:
-                text = self._clean_caption(text)
-                text = self._clean_caption(text)
-            else:
-                text = text.lower().strip()
-            return text
-
-        return [process(t) for t in text]
-
-    # Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline._clean_caption
-    def _clean_caption(self, caption):
-        caption = str(caption)
-        caption = ul.unquote_plus(caption)
-        caption = caption.strip().lower()
-        caption = re.sub("<person>", "person", caption)
-        # urls:
-        caption = re.sub(
-            r"\b((?:https?:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",  # noqa
-            "",
-            caption,
-        )  # regex for urls
-        caption = re.sub(
-            r"\b((?:www:(?:\/{1,3}|[a-zA-Z0-9%])|[a-zA-Z0-9.\-]+[.](?:com|co|ru|net|org|edu|gov|it)[\w/-]*\b\/?(?!@)))",  # noqa
-            "",
-            caption,
-        )  # regex for urls
-        # html:
-        caption = BeautifulSoup(caption, features="html.parser").text
-
-        # @<nickname>
-        caption = re.sub(r"@[\w\d]+\b", "", caption)
-
-        # 31C0—31EF CJK Strokes
-        # 31F0—31FF Katakana Phonetic Extensions
-        # 3200—32FF Enclosed CJK Letters and Months
-        # 3300—33FF CJK Compatibility
-        # 3400—4DBF CJK Unified Ideographs Extension A
-        # 4DC0—4DFF Yijing Hexagram Symbols
-        # 4E00—9FFF CJK Unified Ideographs
-        caption = re.sub(r"[\u31c0-\u31ef]+", "", caption)
-        caption = re.sub(r"[\u31f0-\u31ff]+", "", caption)
-        caption = re.sub(r"[\u3200-\u32ff]+", "", caption)
-        caption = re.sub(r"[\u3300-\u33ff]+", "", caption)
-        caption = re.sub(r"[\u3400-\u4dbf]+", "", caption)
-        caption = re.sub(r"[\u4dc0-\u4dff]+", "", caption)
-        caption = re.sub(r"[\u4e00-\u9fff]+", "", caption)
-        #######################################################
-
-        # все виды тире / all types of dash --> "-"
-        caption = re.sub(
-            r"[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]+",  # noqa
-            "-",
-            caption,
-        )
-
-        # кавычки к одному стандарту
-        caption = re.sub(r"[`´«»“”¨]", '"', caption)
-        caption = re.sub(r"[‘’]", "'", caption)
-
-        # &quot;
-        caption = re.sub(r"&quot;?", "", caption)
-        # &amp
-        caption = re.sub(r"&amp", "", caption)
-
-        # ip adresses:
-        caption = re.sub(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", " ", caption)
-
-        # article ids:
-        caption = re.sub(r"\d:\d\d\s+$", "", caption)
-
-        # \n
-        caption = re.sub(r"\\n", " ", caption)
-
-        # "#123"
-        caption = re.sub(r"#\d{1,3}\b", "", caption)
-        # "#12345.."
-        caption = re.sub(r"#\d{5,}\b", "", caption)
-        # "123456.."
-        caption = re.sub(r"\b\d{6,}\b", "", caption)
-        # filenames:
-        caption = re.sub(
-            r"[\S]+\.(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)", "", caption
-        )
-
-        #
-        caption = re.sub(r"[\"\']{2,}", r'"', caption)  # """AUSVERKAUFT"""
-        caption = re.sub(r"[\.]{2,}", r" ", caption)  # """AUSVERKAUFT"""
-
-        caption = re.sub(
-            self.bad_punct_regex, r" ", caption
-        )  # ***AUSVERKAUFT***, #AUSVERKAUFT
-        caption = re.sub(r"\s+\.\s+", r" ", caption)  # " . "
-
-        # this-is-my-cute-cat / this_is_my_cute_cat
-        regex2 = re.compile(r"(?:\-|\_)")
-        if len(re.findall(regex2, caption)) > 3:
-            caption = re.sub(regex2, " ", caption)
-
-        caption = ftfy.fix_text(caption)
-        caption = html.unescape(html.unescape(caption))
-
-        caption = re.sub(r"\b[a-zA-Z]{1,3}\d{3,15}\b", "", caption)  # jc6640
-        caption = re.sub(r"\b[a-zA-Z]+\d+[a-zA-Z]+\b", "", caption)  # jc6640vc
-        caption = re.sub(r"\b\d+[a-zA-Z]+\d+\b", "", caption)  # 6640vc231
-
-        caption = re.sub(r"(worldwide\s+)?(free\s+)?shipping", "", caption)
-        caption = re.sub(r"(free\s)?download(\sfree)?", "", caption)
-        caption = re.sub(r"\bclick\b\s(?:for|on)\s\w+", "", caption)
-        caption = re.sub(
-            r"\b(?:png|jpg|jpeg|bmp|webp|eps|pdf|apk|mp4)(\simage[s]?)?", "", caption
-        )
-        caption = re.sub(r"\bpage\s+\d+\b", "", caption)
-
-        caption = re.sub(
-            r"\b\d*[a-zA-Z]+\d+[a-zA-Z]+\d+[a-zA-Z\d]*\b", r" ", caption
-        )  # j2d1a2a...
-
-        caption = re.sub(r"\b\d+\.?\d*[xх×]\d+\.?\d*\b", "", caption)
-
-        caption = re.sub(r"\b\s+\:\s+", r": ", caption)
-        caption = re.sub(r"(\D[,\./])\b", r"\1 ", caption)
-        caption = re.sub(r"\s+", " ", caption)
-
-        caption.strip()
-
-        caption = re.sub(r"^[\"\']([\w\W]+)[\"\']$", r"\1", caption)
-        caption = re.sub(r"^[\'\_,\-\:;]", r"", caption)
-        caption = re.sub(r"[\'\_,\-\:\-\+]$", r"", caption)
-        caption = re.sub(r"^\.\S+$", "", caption)
-
-        return caption.strip()
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        dtype,
-        device,
-        generator,
-        latents=None,
-        timestep=None,
-        add_noise=False,
-        image=None,
-    ):
-        shape = (
-            batch_size,
-            num_channels_latents,
-            int(height) // self.vae_scale_factor,
-            int(width) // self.vae_scale_factor,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-            if add_noise and timestep is not None:
-                shape = latents.shape
-                noise = randn_tensor(
-                    shape, generator=generator, device=device, dtype=dtype
-                )
-                # get latents
-                latents = self.scheduler.add_noise(latents, noise, timestep)
-
-        # scale the initial noise by the standard deviation required by the scheduler
-        init_latents = latents * self.scheduler.init_noise_sigma
-
-        if image is not None:
-            if image.shape[1] == 4:
-                init_latents = image
-
-            else:
-                # make sure the VAE is in float32 mode, as it overflows in float16
-                if self.vae.config.force_upcast:
-                    image = image.float()
-                    self.vae.to(dtype=torch.float32)
-
-                if isinstance(generator, list) and len(generator) != batch_size:
-                    raise ValueError(
-                        f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                        f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-                    )
-
-                elif isinstance(generator, list):
-                    init_latents = [
-                        retrieve_latents(
-                            self.vae.encode(image[i : i + 1]), generator=generator[i]
-                        )
-                        for i in range(batch_size)
-                    ]
-                    init_latents = torch.cat(init_latents, dim=0)
-                else:
-                    init_latents = retrieve_latents(
-                        self.vae.encode(image), generator=generator
-                    )
-
-                if self.vae.config.force_upcast:
-                    self.vae.to(dtype)
-
-                init_latents = init_latents.to(dtype)
-                if latents_mean is not None and latents_std is not None:
-                    latents_mean = latents_mean.to(device=device, dtype=dtype)
-                    latents_std = latents_std.to(device=device, dtype=dtype)
-                    init_latents = (
-                        (init_latents - latents_mean)
-                        * self.vae.config.scaling_factor
-                        / latents_std
-                    )
-                else:
-                    init_latents = self.vae.config.scaling_factor * init_latents
-
-            if (
-                batch_size > init_latents.shape[0]
-                and batch_size % init_latents.shape[0] == 0
-            ):
-                # expand init_latents for batch_size
-                additional_image_per_prompt = batch_size // init_latents.shape[0]
-                init_latents = torch.cat(
-                    [init_latents] * additional_image_per_ompt, dim=0
-                )
-            elif (
-                batch_size > init_latents.shape[0]
-                and batch_size % init_latents.shape[0] != 0
-            ):
-                raise ValueError(
-                    f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts."
-                )
-            else:
-                init_latents = torch.cat([init_latents], dim=0)
-
-        return init_latents
-
-    @property
-    def denoising_start(self):
-        return self._denoising_start
-
-    @property
-    def denoising_end(self):
-        return self._denoising_end
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        negative_prompt: str = "",
-        strength: float = None,
-        num_inference_steps: int = 20,
-        timesteps: List[int] = None,
-        sigmas: List[float] = None,
-        denoising_start: Optional[float] = None,
-        denoising_end: Optional[float] = None,
-        guidance_scale: float = 4.5,
-        num_images_per_prompt: Optional[int] = 1,
-        height: Optional[int] = None,
-        width: Optional[int] = None,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        image: Optional[PipelineImageInput] = None,
-        latents: Optional[torch.Tensor] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        prompt_attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_attention_mask: Optional[torch.Tensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        callback: Optional[Callable[[int, int, torch.Tensor], None]] = None,
-        callback_steps: int = 1,
-        clean_caption: bool = True,
-        use_resolution_binning: bool = True,
-        max_sequence_length: int = 300,
-        **kwargs,
-    ) -> Union[ImagePipelineOutput, Tuple]:
-        """
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            strength (`float`, *optional*, defaults to 0.3):
-                Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image`
-                will be used as a starting point, adding more noise to it the larger the `strength`. The number of
-                denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will
-                be maximum and the denoising process will run for the full number of iterations specified in
-                `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. Note that in the case of
-                `denoising_start` being declared as an integer, the value of `strength` will be ignored.
-            num_inference_steps (`int`, *optional*, defaults to 100):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            denoising_start (`float`, *optional*):
-                When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be
-                bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and
-                it is assumed that the passed `image` is a partly denoised image. Note that when this is specified,
-                strength will be ignored. The `denoising_start` parameter is particularly beneficial when this pipeline
-                is integrated into a "Mixture of Denoisers" multi-pipeline setup, as detailed in [**Refine Image
-                Quality**](https://huggingface.co/docs/diffusers/using-diffusers/sdxl#refine-image-quality).
-            denoising_end (`float`, *optional*):
-                When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be
-                completed before it is intentionally prematurely terminated. As a result, the returned sample will
-                still retain a substantial amount of noise as determined by the discrete timesteps selected by the
-                scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a
-                "Mixture of Denoisers" multi-pipeline setup, as elaborated in [**Refining the Image
-                Output**](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#refining-the-image-output)
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            sigmas (`List[float]`, *optional*):
-                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
-                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
-                will be used.
-            guidance_scale (`float`, *optional*, defaults to 4.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            height (`int`, *optional*, defaults to self.unet.config.sample_size):
-                The height in pixels of the generated image.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size):
-                The width in pixels of the generated image.
-            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
-                [`schedulers.DDIMScheduler`], will be ignored for others.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.Tensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            prompt_attention_mask (`torch.Tensor`, *optional*): Pre-generated attention mask for text embeddings.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not
-                provided, negative_prompt_embeds will be generated from `negative_prompt` input argument.
-            negative_prompt_attention_mask (`torch.Tensor`, *optional*):
-                Pre-generated attention mask for negative text embeddings.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion.IFPipelineOutput`] instead of a plain tuple.
-            callback (`Callable`, *optional*):
-                A function that will be called every `callback_steps` steps during inference. The function will be
-                called with the following arguments: `callback(step: int, timestep: int, latents: torch.Tensor)`.
-            callback_steps (`int`, *optional*, defaults to 1):
-                The frequency at which the `callback` function will be called. If not specified, the callback will be
-                called at every step.
-            clean_caption (`bool`, *optional*, defaults to `True`):
-                Whether or not to clean the caption before creating embeddings. Requires `beautifulsoup4` and `ftfy` to
-                be installed. If the dependencies are not installed, the embeddings will be created from the raw
-                prompt.
-            use_resolution_binning (`bool` defaults to `True`):
-                If set to `True`, the requested height and width are first mapped to the closest resolutions using
-                `ASPECT_RATIO_1024_BIN`. After the produced latents are decoded into images, they are resized back to
-                the requested resolution. Useful for generating non-square images.
-            max_sequence_length (`int` defaults to 300): Maximum sequence length to use with the `prompt`.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.ImagePipelineOutput`] or `tuple`:
-                If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is
-                returned where the first element is a list with the generated images
-        """
-        # 1. Check inputs. Raise error if not correct
-        height = height or self.transformer.config.sample_size * self.vae_scale_factor
-        width = width or self.transformer.config.sample_size * self.vae_scale_factor
-        if use_resolution_binning:
-            if self.transformer.config.sample_size == 256:
-                aspect_ratio_bin = ASPECT_RATIO_2048_BIN
-            elif self.transformer.config.sample_size == 128:
-                aspect_ratio_bin = ASPECT_RATIO_1024_BIN
-            elif self.transformer.config.sample_size == 64:
-                aspect_ratio_bin = ASPECT_RATIO_512_BIN
-            elif self.transformer.config.sample_size == 32:
-                aspect_ratio_bin = ASPECT_RATIO_256_BIN
-            else:
-                raise ValueError("Invalid sample size")
-            orig_height, orig_width = height, width
-            height, width = self.image_processor.classify_height_width_bin(
-                height, width, ratios=aspect_ratio_bin
-            )
-
-        self.check_inputs(
-            prompt,
-            height,
-            width,
-            strength,
-            num_inference_steps,
-            negative_prompt,
-            callback_steps,
-            prompt_embeds,
-            negative_prompt_embeds,
-            prompt_attention_mask,
-            negative_prompt_attention_mask,
-        )
-
-        # 2. Default height and width to transformer
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self._execution_device
-        self._denoising_start = denoising_start
-        self._num_timesteps = num_inference_steps
-        self._denoising_end = denoising_end
-
-        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-        # corresponds to doing no classifier free guidance.
-        do_classifier_free_guidance = guidance_scale > 1.0
-
-        # 3. Encode input prompt
-        (
-            prompt_embeds,
-            prompt_attention_mask,
-            negative_prompt_embeds,
-            negative_prompt_attention_mask,
-        ) = self.encode_prompt(
-            prompt,
-            do_classifier_free_guidance,
-            negative_prompt=negative_prompt,
-            num_images_per_prompt=num_images_per_prompt,
-            device=device,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            prompt_attention_mask=prompt_attention_mask,
-            negative_prompt_attention_mask=negative_prompt_attention_mask,
-            clean_caption=clean_caption,
-            max_sequence_length=max_sequence_length,
-        )
-        if do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-            prompt_attention_mask = torch.cat(
-                [negative_prompt_attention_mask, prompt_attention_mask], dim=0
-            )
-
-        # 4. Prepare timesteps
-        def denoising_value_valid(dnv):
-            return isinstance(dnv, float) and 0 < dnv < 1
-
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps, sigmas
-        )
-
-        # 5. Prepare latents.
-        if image is not None:
-            image = self.image_processor.preprocess(image)
-            image = image.to(device=device, dtype=dtype)
-
-        latent_channels = self.transformer.config.in_channels
-        latent_timestep = None
-        if denoising_end is not None or denoising_start is not None:
-            timesteps, num_inference_steps = self.get_timesteps(
-                num_inference_steps,
-                strength,
-                device,
-                denoising_start=(
-                    self.denoising_start
-                    if denoising_value_valid(self.denoising_start)
-                    else None
-                ),
-            )
-            latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
-            if latents is not None:
-                height, width = latents.shape[-2:]
-                height = height * self.vae_scale_factor
-                width = width * self.vae_scale_factor
-        add_noise = True if self.denoising_start is None else False
-        if latents is None:
-            latents = self.prepare_latents(
-                batch_size * num_images_per_prompt,
-                latent_channels,
-                height,
-                width,
-                prompt_embeds.dtype,
-                device,
-                generator,
-                latents,
-                timestep=latent_timestep,
-                add_noise=add_noise,
-                image=image,
-            )
-
-        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
-
-        # 6.1 Prepare micro-conditions.
-        added_cond_kwargs = {"resolution": None, "aspect_ratio": None}
-
-        # 7. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-        if (
-            self.denoising_end is not None
-            and self.denoising_start is not None
-            and denoising_value_valid(self.denoising_end)
-            and denoising_value_valid(self.denoising_start)
-            and self.denoising_start >= self.denoising_end
-        ):
-            raise ValueError(
-                f"`denoising_start`: {self.denoising_start} cannot be larger than or equal to `denoising_end`: "
-                + f" {self.denoising_end} when using type float."
-            )
-        if self.denoising_start is not None:
-            if denoising_value_valid(self.denoising_start):
-                discrete_timestep_cutoff = int(
-                    round(
-                        self.scheduler.config.num_train_timesteps
-                        - (denoising_start * self.scheduler.config.num_train_timesteps)
-                    )
-                )
-
-                num_inference_steps = (
-                    (timesteps < discrete_timestep_cutoff).sum().item()
-                )
-                print(
-                    f"Beginning inference for stage2 with {num_inference_steps} steps."
-                )
-
-            else:
-                raise ValueError(
-                    f"`denoising_start` must be a float between 0 and 1: {denoising_start}"
-                )
-        if self.denoising_end is not None:
-            if denoising_value_valid(self.denoising_end):
-                discrete_timestep_cutoff = int(
-                    round(
-                        self.scheduler.config.num_train_timesteps
-                        - (
-                            self.denoising_end
-                            * self.scheduler.config.num_train_timesteps
-                        )
-                    )
-                )
-                num_inference_steps = len(
-                    list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps))
-                )
-                print(
-                    f"Beginning inference for stage1 with {num_inference_steps} steps."
-                )
-                timesteps = timesteps[:num_inference_steps]
-            else:
-                raise ValueError(
-                    f"`denoising_end` must be a float between 0 and 1: {denoising_end}"
-                )
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                latent_model_input = (
-                    torch.cat([latents] * 2) if do_classifier_free_guidance else latents
-                )
-                latent_model_input = self.scheduler.scale_model_input(
-                    latent_model_input, t
-                )
-
-                current_timestep = t
-                if not torch.is_tensor(current_timestep):
-                    # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
-                    # This would be a good case for the `match` statement (Python 3.10+)
-                    is_mps = latent_model_input.device.type == "mps"
-                    if isinstance(current_timestep, float):
-                        dtype = torch.float32 if is_mps else torch.float64
-                    else:
-                        dtype = torch.int32 if is_mps else torch.int64
-                    current_timestep = torch.tensor(
-                        [current_timestep],
-                        dtype=dtype,
-                        device=latent_model_input.device,
-                    )
-                elif len(current_timestep.shape) == 0:
-                    current_timestep = current_timestep[None].to(
-                        latent_model_input.device
-                    )
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                current_timestep = current_timestep.expand(latent_model_input.shape[0])
-
-                # predict noise model_output
-                noise_pred = self.transformer(
-                    latent_model_input.to(
-                        device=self.transformer.device, dtype=self.transformer.dtype
-                    ),
-                    encoder_hidden_states=prompt_embeds,
-                    encoder_attention_mask=prompt_attention_mask,
-                    timestep=current_timestep,
-                    added_cond_kwargs=added_cond_kwargs,
-                    return_dict=False,
-                )[0]
-
-                # perform guidance
-                if do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                # learned sigma
-                if self.transformer.config.out_channels // 2 == latent_channels:
-                    noise_pred = noise_pred.chunk(2, dim=1)[0]
-                else:
-                    noise_pred = noise_pred
-
-                # compute previous image: x_t -> x_t-1
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-                    if callback is not None and i % callback_steps == 0:
-                        step_idx = i // getattr(self.scheduler, "order", 1)
-                        callback(step_idx, t, latents)
-
-        if not output_type == "latent":
-            image = self.vae.decode(
-                latents.to(device=self.vae.device, dtype=self.vae.dtype)
-                / self.vae.config.scaling_factor,
-                return_dict=False,
-            )[0]
-            if use_resolution_binning:
-                image = self.image_processor.resize_and_crop_tensor(
-                    image, orig_width, orig_height
-                )
-        else:
-            image = latents
-
-        if not output_type == "latent":
-            image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return ImagePipelineOutput(images=image)
diff --git a/videotuna/third_party/flux/models/sd3/expanded.py b/videotuna/third_party/flux/models/sd3/expanded.py
deleted file mode 100644
index e0ea5d42..00000000
--- a/videotuna/third_party/flux/models/sd3/expanded.py
+++ /dev/null
@@ -1,737 +0,0 @@
-import argparse
-import gc
-import operator
-import os
-from typing import Any, Dict, Optional, Union
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.loaders import FromOriginalModelMixin, PeftAdapterMixin
-from diffusers.models.attention import FeedForward, _chunked_feed_forward
-from diffusers.models.attention_processor import (
-    Attention,
-    AttentionProcessor,
-    JointAttnProcessor2_0,
-)
-from diffusers.models.embeddings import CombinedTimestepTextProjEmbeddings, PatchEmbed
-from diffusers.models.modeling_utils import ModelMixin
-from diffusers.models.normalization import AdaLayerNormContinuous, AdaLayerNormZero
-from diffusers.models.transformers.transformer_2d import Transformer2DModelOutput
-from diffusers.utils import (
-    USE_PEFT_BACKEND,
-    is_torch_version,
-    logging,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
-from diffusers.utils.torch_utils import maybe_allow_in_graph
-
-ORIG_DEPTH = 24
-FINAL_DEPTH = 36
-M_VALUE = 6
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-@maybe_allow_in_graph
-class JointTransformerBlock(nn.Module):
-    r"""
-    A Transformer block following the MMDiT architecture, introduced in Stable Diffusion 3.
-
-    Reference: https://arxiv.org/abs/2403.03206
-
-    Parameters:
-        dim (`int`): The number of channels in the input and output.
-        num_attention_heads (`int`): The number of heads to use for multi-head attention.
-        attention_head_dim (`int`): The number of channels in each head.
-        context_pre_only (`bool`): Boolean to determine if we should add some blocks associated with the
-            processing of `context` conditions.
-    """
-
-    def __init__(
-        self,
-        dim,
-        num_attention_heads,
-        attention_head_dim,
-        context_pre_only=False,
-        qk_norm="layer_norm",
-    ):
-        super().__init__()
-
-        self.context_pre_only = context_pre_only
-        context_norm_type = (
-            "ada_norm_continous" if context_pre_only else "ada_norm_zero"
-        )
-
-        self.norm1 = AdaLayerNormZero(dim)
-
-        if context_norm_type == "ada_norm_continous":
-            self.norm1_context = AdaLayerNormContinuous(
-                dim,
-                dim,
-                elementwise_affine=False,
-                eps=1e-6,
-                bias=True,
-                norm_type="layer_norm",
-            )
-        elif context_norm_type == "ada_norm_zero":
-            self.norm1_context = AdaLayerNormZero(dim)
-        else:
-            raise ValueError(
-                f"Unknown context_norm_type: {context_norm_type}, currently only support `ada_norm_continous`, `ada_norm_zero`"
-            )
-        if hasattr(F, "scaled_dot_product_attention"):
-            processor = JointAttnProcessor2_0()
-        else:
-            raise ValueError(
-                "The current PyTorch version does not support the `scaled_dot_product_attention` function."
-            )
-        self.attn = Attention(
-            query_dim=dim,
-            cross_attention_dim=None,
-            added_kv_proj_dim=dim,
-            qk_norm=qk_norm,
-            dim_head=attention_head_dim // num_attention_heads,
-            heads=num_attention_heads,
-            out_dim=attention_head_dim,
-            context_pre_only=context_pre_only,
-            bias=True,
-            processor=processor,
-        )
-
-        self.norm2 = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
-        self.ff = FeedForward(dim=dim, dim_out=dim, activation_fn="gelu-approximate")
-
-        if not context_pre_only:
-            self.norm2_context = nn.LayerNorm(dim, elementwise_affine=False, eps=1e-6)
-            self.ff_context = FeedForward(
-                dim=dim, dim_out=dim, activation_fn="gelu-approximate"
-            )
-        else:
-            self.norm2_context = None
-            self.ff_context = None
-
-        # let chunk size default to None
-        self._chunk_size = None
-        self._chunk_dim = 0
-
-    # Copied from diffusers.models.attention.BasicTransformerBlock.set_chunk_feed_forward
-    def set_chunk_feed_forward(self, chunk_size: Optional[int], dim: int = 0):
-        # Sets chunk feed-forward
-        self._chunk_size = chunk_size
-        self._chunk_dim = dim
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor,
-        temb: torch.FloatTensor,
-    ):
-        norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
-            hidden_states, emb=temb
-        )
-
-        if self.context_pre_only:
-            norm_encoder_hidden_states = self.norm1_context(encoder_hidden_states, temb)
-        else:
-            (
-                norm_encoder_hidden_states,
-                c_gate_msa,
-                c_shift_mlp,
-                c_scale_mlp,
-                c_gate_mlp,
-            ) = self.norm1_context(encoder_hidden_states, emb=temb)
-
-        # Attention.
-        attn_output, context_attn_output = self.attn(
-            hidden_states=norm_hidden_states,
-            encoder_hidden_states=norm_encoder_hidden_states,
-        )
-
-        # Process attention outputs for the `hidden_states`.
-        attn_output = gate_msa.unsqueeze(1) * attn_output
-        hidden_states = hidden_states + attn_output
-
-        norm_hidden_states = self.norm2(hidden_states)
-        norm_hidden_states = (
-            norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
-        )
-        if self._chunk_size is not None:
-            # "feed_forward_chunk_size" can be used to save memory
-            ff_output = _chunked_feed_forward(
-                self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size
-            )
-        else:
-            ff_output = self.ff(norm_hidden_states)
-        ff_output = gate_mlp.unsqueeze(1) * ff_output
-
-        hidden_states = hidden_states + ff_output
-
-        # Process attention outputs for the `encoder_hidden_states`.
-        if self.context_pre_only:
-            encoder_hidden_states = None
-        else:
-            context_attn_output = c_gate_msa.unsqueeze(1) * context_attn_output
-            encoder_hidden_states = encoder_hidden_states + context_attn_output
-
-            norm_encoder_hidden_states = self.norm2_context(encoder_hidden_states)
-            norm_encoder_hidden_states = (
-                norm_encoder_hidden_states * (1 + c_scale_mlp[:, None])
-                + c_shift_mlp[:, None]
-            )
-            if self._chunk_size is not None:
-                # "feed_forward_chunk_size" can be used to save memory
-                context_ff_output = _chunked_feed_forward(
-                    self.ff_context,
-                    norm_encoder_hidden_states,
-                    self._chunk_dim,
-                    self._chunk_size,
-                )
-            else:
-                context_ff_output = self.ff_context(norm_encoder_hidden_states)
-            encoder_hidden_states = (
-                encoder_hidden_states + c_gate_mlp.unsqueeze(1) * context_ff_output
-            )
-
-        return encoder_hidden_states, hidden_states
-
-
-class SD3TransformerQKNorm2DModel(
-    ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin
-):
-    """
-    The Transformer model introduced in Stable Diffusion 3.
-
-    Reference: https://arxiv.org/abs/2403.03206
-
-    Parameters:
-        sample_size (`int`): The width of the latent images. This is fixed during training since
-            it is used to learn a number of position embeddings.
-        patch_size (`int`): Patch size to turn the input data into small patches.
-        in_channels (`int`, *optional*, defaults to 16): The number of channels in the input.
-        num_layers (`int`, *optional*, defaults to 18): The number of layers of Transformer blocks to use.
-        attention_head_dim (`int`, *optional*, defaults to 64): The number of channels in each head.
-        num_attention_heads (`int`, *optional*, defaults to 18): The number of heads to use for multi-head attention.
-        cross_attention_dim (`int`, *optional*): The number of `encoder_hidden_states` dimensions to use.
-        caption_projection_dim (`int`): Number of dimensions to use when projecting the `encoder_hidden_states`.
-        pooled_projection_dim (`int`): Number of dimensions to use when projecting the `pooled_projections`.
-        out_channels (`int`, defaults to 16): Number of output channels.
-        qk_norm (`str`, defaults to "layer_norm"): The type of qk_norm to use.
-
-        TODO The SD3 paper uses RMSNorm instead of LayerNorm but it is unlikely
-             that there is much difference betweens RMSNorm being faster.
-    """
-
-    _supports_gradient_checkpointing = True
-
-    @register_to_config
-    def __init__(
-        self,
-        sample_size: int = 128,
-        patch_size: int = 2,
-        in_channels: int = 16,
-        num_layers: int = 18,
-        attention_head_dim: int = 64,
-        num_attention_heads: int = 18,
-        joint_attention_dim: int = 4096,
-        caption_projection_dim: int = 1152,
-        pooled_projection_dim: int = 2048,
-        out_channels: int = 16,
-        pos_embed_max_size: int = 96,
-        qk_norm: str | None = "layer_norm",
-    ):
-        super().__init__()
-        default_out_channels = in_channels
-        self.out_channels = (
-            out_channels if out_channels is not None else default_out_channels
-        )
-        self.inner_dim = (
-            self.config.num_attention_heads * self.config.attention_head_dim
-        )
-
-        self.pos_embed = PatchEmbed(
-            height=self.config.sample_size,
-            width=self.config.sample_size,
-            patch_size=self.config.patch_size,
-            in_channels=self.config.in_channels,
-            embed_dim=self.inner_dim,
-            pos_embed_max_size=pos_embed_max_size,  # hard-code for now.
-        )
-        self.time_text_embed = CombinedTimestepTextProjEmbeddings(
-            embedding_dim=self.inner_dim,
-            pooled_projection_dim=self.config.pooled_projection_dim,
-        )
-        self.context_embedder = nn.Linear(
-            self.config.joint_attention_dim, self.config.caption_projection_dim
-        )
-
-        # `attention_head_dim` is doubled to account for the mixing.
-        # It needs to crafted when we get the actual checkpoints.
-        self.transformer_blocks = nn.ModuleList(
-            [
-                JointTransformerBlock(
-                    dim=self.inner_dim,
-                    num_attention_heads=self.config.num_attention_heads,
-                    attention_head_dim=self.inner_dim,
-                    context_pre_only=i == num_layers - 1,
-                    qk_norm=qk_norm,
-                )
-                for i in range(self.config.num_layers)
-            ]
-        )
-
-        self.norm_out = AdaLayerNormContinuous(
-            self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6
-        )
-        self.proj_out = nn.Linear(
-            self.inner_dim, patch_size * patch_size * self.out_channels, bias=True
-        )
-
-        self.gradient_checkpointing = False
-
-    # Copied from diffusers.models.unets.unet_3d_condition.UNet3DConditionModel.enable_forward_chunking
-    def enable_forward_chunking(
-        self, chunk_size: Optional[int] = None, dim: int = 0
-    ) -> None:
-        """
-        Sets the attention processor to use [feed forward
-        chunking](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers).
-
-        Parameters:
-            chunk_size (`int`, *optional*):
-                The chunk size of the feed-forward layers. If not specified, will run feed-forward layer individually
-                over each tensor of dim=`dim`.
-            dim (`int`, *optional*, defaults to `0`):
-                The dimension over which the feed-forward computation should be chunked. Choose between dim=0 (batch)
-                or dim=1 (sequence length).
-        """
-        if dim not in [0, 1]:
-            raise ValueError(f"Make sure to set `dim` to either 0 or 1, not {dim}")
-
-        # By default chunk size is 1
-        chunk_size = chunk_size or 1
-
-        def fn_recursive_feed_forward(
-            module: torch.nn.Module, chunk_size: int, dim: int
-        ):
-            if hasattr(module, "set_chunk_feed_forward"):
-                module.set_chunk_feed_forward(chunk_size=chunk_size, dim=dim)
-
-            for child in module.children():
-                fn_recursive_feed_forward(child, chunk_size, dim)
-
-        for module in self.children():
-            fn_recursive_feed_forward(module, chunk_size, dim)
-
-    @property
-    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
-    def attn_processors(self) -> Dict[str, AttentionProcessor]:
-        r"""
-        Returns:
-            `dict` of attention processors: A dictionary containing all attention processors used in the model with
-            indexed by its weight name.
-        """
-        # set recursively
-        processors = {}
-
-        def fn_recursive_add_processors(
-            name: str,
-            module: torch.nn.Module,
-            processors: Dict[str, AttentionProcessor],
-        ):
-            if hasattr(module, "get_processor"):
-                processors[f"{name}.processor"] = module.get_processor(
-                    return_deprecated_lora=True
-                )
-
-            for sub_name, child in module.named_children():
-                fn_recursive_add_processors(f"{name}.{sub_name}", child, processors)
-
-            return processors
-
-        for name, module in self.named_children():
-            fn_recursive_add_processors(name, module, processors)
-
-        return processors
-
-    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.set_attn_processor
-    def set_attn_processor(
-        self, processor: Union[AttentionProcessor, Dict[str, AttentionProcessor]]
-    ):
-        r"""
-        Sets the attention processor to use to compute attention.
-
-        Parameters:
-            processor (`dict` of `AttentionProcessor` or only `AttentionProcessor`):
-                The instantiated processor class or a dictionary of processor classes that will be set as the processor
-                for **all** `Attention` layers.
-
-                If `processor` is a dict, the key needs to define the path to the corresponding cross attention
-                processor. This is strongly recommended when setting trainable attention processors.
-
-        """
-        count = len(self.attn_processors.keys())
-
-        if isinstance(processor, dict) and len(processor) != count:
-            raise ValueError(
-                f"A dict of processors was passed, but the number of processors {len(processor)} does not match the"
-                f" number of attention layers: {count}. Please make sure to pass {count} processor classes."
-            )
-
-        def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor):
-            if hasattr(module, "set_processor"):
-                if not isinstance(processor, dict):
-                    module.set_processor(processor)
-                else:
-                    module.set_processor(processor.pop(f"{name}.processor"))
-
-            for sub_name, child in module.named_children():
-                fn_recursive_attn_processor(f"{name}.{sub_name}", child, processor)
-
-        for name, module in self.named_children():
-            fn_recursive_attn_processor(name, module, processor)
-
-    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.fuse_qkv_projections
-    def fuse_qkv_projections(self):
-        """
-        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
-        are fused. For cross-attention modules, key and value projection matrices are fused.
-
-        <Tip warning={true}>
-
-        This API is 🧪 experimental.
-
-        </Tip>
-        """
-        self.original_attn_processors = None
-
-        for _, attn_processor in self.attn_processors.items():
-            if "Added" in str(attn_processor.__class__.__name__):
-                raise ValueError(
-                    "`fuse_qkv_projections()` is not supported for models having added KV projections."
-                )
-
-        self.original_attn_processors = self.attn_processors
-
-        for module in self.modules():
-            if isinstance(module, Attention):
-                module.fuse_projections(fuse=True)
-
-    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
-    def unfuse_qkv_projections(self):
-        """Disables the fused QKV projection if enabled.
-
-        <Tip warning={true}>
-
-        This API is 🧪 experimental.
-
-        </Tip>
-
-        """
-        if self.original_attn_processors is not None:
-            self.set_attn_processor(self.original_attn_processors)
-
-    def _set_gradient_checkpointing(self, module, value=False):
-        if hasattr(module, "gradient_checkpointing"):
-            module.gradient_checkpointing = value
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,
-        encoder_hidden_states: torch.FloatTensor = None,
-        pooled_projections: torch.FloatTensor = None,
-        timestep: torch.LongTensor = None,
-        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
-        return_dict: bool = True,
-    ) -> Union[torch.FloatTensor, Transformer2DModelOutput]:
-        """
-        The [`SD3Transformer2DModel`] forward method.
-
-        Args:
-            hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
-                Input `hidden_states`.
-            encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
-                Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
-            pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected
-                from the embeddings of input conditions.
-            timestep ( `torch.LongTensor`):
-                Used to indicate denoising step.
-            joint_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
-                tuple.
-
-        Returns:
-            If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
-            `tuple` where the first element is the sample tensor.
-        """
-        if joint_attention_kwargs is not None:
-            joint_attention_kwargs = joint_attention_kwargs.copy()
-            lora_scale = joint_attention_kwargs.pop("scale", 1.0)
-        else:
-            lora_scale = 1.0
-
-        if USE_PEFT_BACKEND:
-            # weight the lora layers by setting `lora_scale` for each PEFT layer
-            scale_lora_layers(self, lora_scale)
-        else:
-            if (
-                joint_attention_kwargs is not None
-                and joint_attention_kwargs.get("scale", None) is not None
-            ):
-                logger.warning(
-                    "Passing `scale` via `joint_attention_kwargs` when not using the PEFT backend is ineffective."
-                )
-
-        height, width = hidden_states.shape[-2:]
-
-        hidden_states = self.pos_embed(
-            hidden_states
-        )  # takes care of adding positional embeddings too.
-        temb = self.time_text_embed(timestep, pooled_projections)
-        encoder_hidden_states = self.context_embedder(encoder_hidden_states)
-
-        for block in self.transformer_blocks:
-            if self.training and self.gradient_checkpointing:
-
-                def create_custom_forward(module, return_dict=None):
-                    def custom_forward(*inputs):
-                        if return_dict is not None:
-                            return module(*inputs, return_dict=return_dict)
-                        else:
-                            return module(*inputs)
-
-                    return custom_forward
-
-                ckpt_kwargs: Dict[str, Any] = (
-                    {"use_reentrant": False} if is_torch_version(">=", "1.11.0") else {}
-                )
-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(block),
-                    hidden_states,
-                    encoder_hidden_states,
-                    temb,
-                    **ckpt_kwargs,
-                )
-
-            else:
-                encoder_hidden_states, hidden_states = block(
-                    hidden_states=hidden_states,
-                    encoder_hidden_states=encoder_hidden_states,
-                    temb=temb,
-                )
-
-        hidden_states = self.norm_out(hidden_states, temb)
-        hidden_states = self.proj_out(hidden_states)
-
-        # unpatchify
-        patch_size = self.config.patch_size
-        height = height // patch_size
-        width = width // patch_size
-
-        hidden_states = hidden_states.reshape(
-            shape=(
-                hidden_states.shape[0],
-                height,
-                width,
-                patch_size,
-                patch_size,
-                self.out_channels,
-            )
-        )
-        hidden_states = torch.einsum("nhwpqc->nchpwq", hidden_states)
-        output = hidden_states.reshape(
-            shape=(
-                hidden_states.shape[0],
-                self.out_channels,
-                height * patch_size,
-                width * patch_size,
-            )
-        )
-
-        if USE_PEFT_BACKEND:
-            # remove `lora_scale` from each PEFT layer
-            unscale_lora_layers(self, lora_scale)
-
-        if not return_dict:
-            return (output,)
-
-        return Transformer2DModelOutput(sample=output)
-
-
-def verify_all_parameters_offset_copy(
-    model_old,
-    model_new,
-    layer_name_prefix,
-    source_start_idx,
-    dest_start_idx,
-    num_layers_to_check,
-):
-    """
-    Verifies that all parameters from a specified range in the old model are correctly copied to a new range in the scaled model.
-
-    Parameters:
-    - model_old: The original PyTorch model.
-    - model_new: The depth-scaled PyTorch model.
-    - layer_name_prefix: The prefix of the layer names to check, e.g., 'transformer_blocks'.
-    - source_start_idx: The starting index of the layers in the old model from which parameters are copied.
-    - dest_start_idx: The starting index of the layers in the new model where parameters are copied into.
-    - num_layers_to_check: The number of layers to check from the source_start_idx.
-    """
-    for offset in range(num_layers_to_check):
-        source_idx = source_start_idx + offset
-        dest_idx = dest_start_idx + offset
-        source_layer = getattr(model_old, layer_name_prefix)[source_idx]
-        dest_layer = getattr(model_new, layer_name_prefix)[dest_idx]
-
-        for param_name, source_param in source_layer.named_parameters():
-            # Retrieve the corresponding parameter from the destination layer
-            if isinstance(operator.attrgetter(param_name)(dest_layer), torch.Tensor):
-                dest_param = operator.attrgetter(param_name)(dest_layer)
-
-                # Check if the parameters are close enough (considering floating-point arithmetic)
-                if not torch.allclose(source_param, dest_param, atol=1e-6):
-                    raise AssertionError(
-                        f"Parameter mismatch for {layer_name_prefix}.{source_idx}.{param_name} (original) -> {layer_name_prefix}.{dest_idx}.{param_name} (new)."
-                    )
-            else:
-                raise AssertionError(
-                    f"Missing parameter {layer_name_prefix}.{dest_idx}.{param_name} in the new model."
-                )
-
-    print(
-        f"All parameters from {source_start_idx} to {source_start_idx + num_layers_to_check - 1} ({num_layers_to_check} layers) in {layer_name_prefix} have been verified to be correctly copied to {dest_start_idx} to {dest_start_idx + num_layers_to_check - 1}."
-    )
-
-
-def expand_existing_sd3_model(model_old):
-    # This model is 36 layers deep, versus 24 layers deep from the original model.
-    # We will prune 12 layers off from the end and the start of the merged weights.
-    model_new = SD3TransformerQKNorm2DModel.from_config(
-        {
-            "_class_name": "SD3Transformer2DModel",
-            "_diffusers_version": "0.30.0.dev0",
-            "_name_or_path": "stabilityai/stable-diffusion-3-medium-diffusers",
-            "attention_head_dim": 64,
-            "caption_projection_dim": 1536,
-            "in_channels": 16,
-            "joint_attention_dim": 4096,
-            "num_attention_heads": 24,
-            "num_layers": FINAL_DEPTH,
-            "out_channels": 16,
-            "patch_size": 2,
-            "pooled_projection_dim": 2048,
-            "pos_embed_max_size": 192,
-            "qk_norm": "layer_norm",
-            "sample_size": 128,
-        }
-    )
-
-    # Copy in layers 0...23 and all other layers.
-    with torch.no_grad():
-        new_model_param_names = set(name for name, _ in model_new.named_parameters())
-
-        # Iterate through parameters of the old model
-        for name, param in model_old.named_parameters():
-            if name in new_model_param_names:
-                # Get the corresponding parameter from the new model and copy the old param in
-                try:
-                    model_new.state_dict()[name].copy_(param)
-                except RuntimeError as e:
-                    if (
-                        "The size of tensor a (9216) must match the size of tensor b (3072) at non-singleton dimension 0"
-                        in str(e)
-                    ):
-                        pass
-                    else:
-                        print(f"Got {str(e)} on layer {name}")
-                        raise
-
-    # We now need to deal with [18:] for both transformer_blocks.
-    # We do this by copying in [6:] into [18:] for these blocks.
-    with torch.no_grad():
-        for layer_idx, injection_idx in zip(
-            range(M_VALUE, FINAL_DEPTH),
-            range(ORIG_DEPTH - M_VALUE, FINAL_DEPTH),
-        ):
-            for name, param in model_old.named_parameters():
-                if "transformer_blocks" in name:
-                    if f"transformer_blocks.{layer_idx}." in name:
-                        name_to_inject_into = name.replace(
-                            f"transformer_blocks.{layer_idx}.",
-                            f"transformer_blocks.{injection_idx}.",
-                        )
-                        model_new.state_dict()[name_to_inject_into].copy_(param)
-
-    # Finally, transform all the newly added qk norm layers in passthroughs.
-    # Setting the weights to 1 and the bias to zero means that initially they
-    # should do nothing to the model.
-    with torch.no_grad():
-        for name, param in model_new.named_parameters():
-            if "transformer_blocks" in name and ("norm_q" in name or "norm_k" in name):
-                if "norm_q.weight" in name:
-                    param.fill_(1)
-                elif "norm_q.bias" in name:
-                    param.fill_(0)
-
-    verify_all_parameters_offset_copy(
-        model_old, model_new, "transformer_blocks", 0, 0, ORIG_DEPTH - M_VALUE
-    )  # Adjust the index as needed
-    verify_all_parameters_offset_copy(
-        model_old, model_new, "transformer_blocks", 6, 18, ORIG_DEPTH - M_VALUE
-    )  # Adjust the last parameter as needed based on the number of layers you're checking
-
-    orig_params = sum(p.numel() for p in model_old.parameters())
-    expanded_params = sum(p.numel() for p in model_new.parameters())
-    print(
-        f"Model has been successfully expanded from {orig_params / 1e6:.2f}M to {expanded_params / 1e6:.2f}M."
-    )
-
-    model_new.save_pretrained((os.path.join(args.output_model, "transformer")))
-    return model_new
-
-
-if __name__ == "__main__":
-    from diffusers.models.transformers.transformer_sd3 import SD3Transformer2DModel
-
-    parser = argparse.ArgumentParser(
-        description="Make a 24 block deep SD3 2B into a 36 block deep version",
-    )
-    parser.add_argument(
-        "input_model",
-        action="store",
-        type=str,
-        help="The input pretrained model",
-    )
-    parser.add_argument(
-        "output_model",
-        action="store",
-        type=str,
-        help="The output pretrained model location",
-    )
-
-    args = parser.parse_args()
-
-    model_old = SD3Transformer2DModel.from_pretrained(
-        args.input_model,
-        subfolder="transformer",
-    )
-    model_new = expand_existing_sd3_model(model_old)
-    del model_old
-    gc.collect()
-    model_new = model_new.to("cuda", dtype=torch.bfloat16)
-    with torch.no_grad(), torch.inference_mode():
-        model_new(
-            hidden_states=torch.rand((1, 16, 64, 64)).to("cuda", dtype=torch.bfloat16),
-            encoder_hidden_states=torch.rand((1, 144, 4096)).to(
-                "cuda", dtype=torch.bfloat16
-            ),
-            pooled_projections=torch.rand((1, 2048)).to("cuda", dtype=torch.bfloat16),
-            timestep=torch.tensor([500]).to("cuda", dtype=torch.bfloat16),
-        )
-    print("Successfully expanded and tested model.")
diff --git a/videotuna/third_party/flux/models/sd3/pipeline.py b/videotuna/third_party/flux/models/sd3/pipeline.py
deleted file mode 100644
index 90753b48..00000000
--- a/videotuna/third_party/flux/models/sd3/pipeline.py
+++ /dev/null
@@ -1,1973 +0,0 @@
-# Copyright 2024 Stability AI and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import inspect
-from typing import Any, Callable, Dict, List, Optional, Union
-
-import torch
-from diffusers.image_processor import VaeImageProcessor
-from diffusers.loaders import FromSingleFileMixin, SD3LoraLoaderMixin
-from diffusers.models.autoencoders import AutoencoderKL
-from diffusers.models.transformers import SD3Transformer2DModel
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline
-from diffusers.pipelines.stable_diffusion_3.pipeline_output import (
-    StableDiffusion3PipelineOutput,
-)
-from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
-from diffusers.utils import is_torch_xla_available, logging, replace_example_docstring
-from diffusers.utils.torch_utils import randn_tensor
-from transformers import (
-    CLIPTextModelWithProjection,
-    CLIPTokenizer,
-    T5EncoderModel,
-    T5TokenizerFast,
-)
-
-if is_torch_xla_available():
-    import torch_xla.core.xla_model as xm
-
-    XLA_AVAILABLE = True
-else:
-    XLA_AVAILABLE = False
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-EXAMPLE_DOC_STRING = """
-    Examples:
-        ```py
-        >>> import torch
-        >>> from diffusers import StableDiffusion3Pipeline
-
-        >>> pipe = StableDiffusion3Pipeline.from_pretrained(
-        ...     "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
-        ... )
-        >>> pipe.to("cuda")
-        >>> prompt = "A cat holding a sign that says hello world"
-        >>> image = pipe(prompt).images[0]
-        >>> image.save("sd3.png")
-        ```
-"""
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-class StableDiffusion3Pipeline(
-    DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin
-):
-    r"""
-    Args:
-        transformer ([`SD3Transformer2DModel`]):
-            Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
-        scheduler ([`FlowMatchEulerDiscreteScheduler`]):
-            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`CLIPTextModelWithProjection`]):
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
-            with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
-            as its dimension.
-        text_encoder_2 ([`CLIPTextModelWithProjection`]):
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the
-            [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
-            variant.
-        text_encoder_3 ([`T5EncoderModel`]):
-            Frozen text-encoder. Stable Diffusion 3 uses
-            [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
-            [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
-        tokenizer (`CLIPTokenizer`):
-            Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_2 (`CLIPTokenizer`):
-            Second Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_3 (`T5TokenizerFast`):
-            Tokenizer of class
-            [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
-    """
-
-    model_cpu_offload_seq = (
-        "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
-    )
-    _optional_components = []
-    _callback_tensor_inputs = [
-        "latents",
-        "prompt_embeds",
-        "negative_prompt_embeds",
-        "negative_pooled_prompt_embeds",
-    ]
-
-    def __init__(
-        self,
-        transformer: SD3Transformer2DModel,
-        scheduler: FlowMatchEulerDiscreteScheduler,
-        vae: AutoencoderKL,
-        text_encoder: CLIPTextModelWithProjection,
-        tokenizer: CLIPTokenizer,
-        text_encoder_2: CLIPTextModelWithProjection,
-        tokenizer_2: CLIPTokenizer,
-        text_encoder_3: T5EncoderModel,
-        tokenizer_3: T5TokenizerFast,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            text_encoder_3=text_encoder_3,
-            tokenizer=tokenizer,
-            tokenizer_2=tokenizer_2,
-            tokenizer_3=tokenizer_3,
-            transformer=transformer,
-            scheduler=scheduler,
-        )
-        self.vae_scale_factor = (
-            2 ** (len(self.vae.config.block_out_channels) - 1)
-            if hasattr(self, "vae") and self.vae is not None
-            else 8
-        )
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-        self.tokenizer_max_length = (
-            self.tokenizer.model_max_length
-            if hasattr(self, "tokenizer") and self.tokenizer is not None
-            else 77
-        )
-        self.default_sample_size = (
-            self.transformer.config.sample_size
-            if hasattr(self, "transformer") and self.transformer is not None
-            else 128
-        )
-
-    def _get_t5_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]] = None,
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ):
-        device = device or self._execution_device
-        dtype = dtype or self.text_encoder.dtype
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        if self.text_encoder_3 is None:
-            return torch.zeros(
-                (
-                    batch_size,
-                    self.tokenizer_max_length,
-                    self.transformer.config.joint_attention_dim,
-                ),
-                device=device,
-                dtype=dtype,
-            )
-
-        text_inputs = self.tokenizer_3(
-            prompt,
-            padding="max_length",
-            max_length=self.tokenizer_max_length,
-            truncation=True,
-            add_special_tokens=True,
-            return_tensors="pt",
-        )
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = self.tokenizer_3(
-            prompt, padding="longest", return_tensors="pt"
-        ).input_ids
-
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
-            text_input_ids, untruncated_ids
-        ):
-            removed_text = self.tokenizer_3.batch_decode(
-                untruncated_ids[:, self.tokenizer_max_length - 1 : -1]
-            )
-            logger.warning(
-                "The following part of your input was truncated because CLIP can only handle sequences up to"
-                f" {self.tokenizer_max_length} tokens: {removed_text}"
-            )
-
-        prompt_embeds = self.text_encoder_3(text_input_ids.to(device))[0]
-
-        dtype = self.text_encoder_3.dtype
-        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-        _, seq_len, _ = prompt_embeds.shape
-
-        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            batch_size * num_images_per_prompt, seq_len, -1
-        )
-
-        return prompt_embeds
-
-    def _get_clip_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]],
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        clip_skip: Optional[int] = None,
-        clip_model_index: int = 0,
-    ):
-        device = device or self._execution_device
-
-        clip_tokenizers = [self.tokenizer, self.tokenizer_2]
-        clip_text_encoders = [self.text_encoder, self.text_encoder_2]
-
-        tokenizer = clip_tokenizers[clip_model_index]
-        text_encoder = clip_text_encoders[clip_model_index]
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        text_inputs = tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=self.tokenizer_max_length,
-            truncation=True,
-            return_tensors="pt",
-        )
-
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = tokenizer(
-            prompt, padding="longest", return_tensors="pt"
-        ).input_ids
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
-            text_input_ids, untruncated_ids
-        ):
-            removed_text = tokenizer.batch_decode(
-                untruncated_ids[:, self.tokenizer_max_length - 1 : -1]
-            )
-            logger.warning(
-                "The following part of your input was truncated because CLIP can only handle sequences up to"
-                f" {self.tokenizer_max_length} tokens: {removed_text}"
-            )
-        prompt_embeds = text_encoder(
-            text_input_ids.to(device), output_hidden_states=True
-        )
-        pooled_prompt_embeds = prompt_embeds[0]
-
-        if clip_skip is None:
-            prompt_embeds = prompt_embeds.hidden_states[-2]
-        else:
-            prompt_embeds = prompt_embeds.hidden_states[-(clip_skip + 2)]
-
-        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)
-
-        _, seq_len, _ = prompt_embeds.shape
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            batch_size * num_images_per_prompt, seq_len, -1
-        )
-
-        pooled_prompt_embeds = pooled_prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        pooled_prompt_embeds = pooled_prompt_embeds.view(
-            batch_size * num_images_per_prompt, -1
-        )
-
-        return prompt_embeds, pooled_prompt_embeds
-
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        prompt_2: Union[str, List[str]],
-        prompt_3: Union[str, List[str]],
-        device: Optional[torch.device] = None,
-        num_images_per_prompt: int = 1,
-        do_classifier_free_guidance: bool = True,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt_2: Optional[Union[str, List[str]]] = None,
-        negative_prompt_3: Optional[Union[str, List[str]]] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        clip_skip: Optional[int] = None,
-    ):
-        r"""
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in all text-encoders
-            prompt_3 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_3` and `text_encoder_3`. If not defined, `prompt` is
-                used in all text-encoders
-            device: (`torch.device`):
-                torch device
-            num_images_per_prompt (`int`):
-                number of images that should be generated per prompt
-            do_classifier_free_guidance (`bool`):
-                whether to use classifier free guidance or not
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_3` and
-                `text_encoder_3`. If not defined, `negative_prompt` is used in both text-encoders
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-        """
-        device = device or self._execution_device
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        if prompt is not None:
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        if prompt_embeds is None:
-            prompt_2 = prompt_2 or prompt
-            prompt_2 = [prompt_2] if isinstance(prompt_2, str) else prompt_2
-
-            prompt_3 = prompt_3 or prompt
-            prompt_3 = [prompt_3] if isinstance(prompt_3, str) else prompt_3
-
-            prompt_embed, pooled_prompt_embed = self._get_clip_prompt_embeds(
-                prompt=prompt,
-                device=device,
-                num_images_per_prompt=num_images_per_prompt,
-                clip_skip=clip_skip,
-                clip_model_index=0,
-            )
-            prompt_2_embed, pooled_prompt_2_embed = self._get_clip_prompt_embeds(
-                prompt=prompt_2,
-                device=device,
-                num_images_per_prompt=num_images_per_prompt,
-                clip_skip=clip_skip,
-                clip_model_index=1,
-            )
-            clip_prompt_embeds = torch.cat([prompt_embed, prompt_2_embed], dim=-1)
-
-            t5_prompt_embed = self._get_t5_prompt_embeds(
-                prompt=prompt_3,
-                num_images_per_prompt=num_images_per_prompt,
-                device=device,
-            )
-
-            clip_prompt_embeds = torch.nn.functional.pad(
-                clip_prompt_embeds,
-                (0, t5_prompt_embed.shape[-1] - clip_prompt_embeds.shape[-1]),
-            )
-
-            prompt_embeds = torch.cat([clip_prompt_embeds, t5_prompt_embed], dim=-2)
-            pooled_prompt_embeds = torch.cat(
-                [pooled_prompt_embed, pooled_prompt_2_embed], dim=-1
-            )
-
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            negative_prompt = negative_prompt or ""
-            negative_prompt_2 = negative_prompt_2 or negative_prompt
-            negative_prompt_3 = negative_prompt_3 or negative_prompt
-
-            # normalize str to list
-            negative_prompt = (
-                batch_size * [negative_prompt]
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-            negative_prompt_2 = (
-                batch_size * [negative_prompt_2]
-                if isinstance(negative_prompt_2, str)
-                else negative_prompt_2
-            )
-            negative_prompt_3 = (
-                batch_size * [negative_prompt_3]
-                if isinstance(negative_prompt_3, str)
-                else negative_prompt_3
-            )
-
-            if prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-
-            negative_prompt_embed, negative_pooled_prompt_embed = (
-                self._get_clip_prompt_embeds(
-                    negative_prompt,
-                    device=device,
-                    num_images_per_prompt=num_images_per_prompt,
-                    clip_skip=None,
-                    clip_model_index=0,
-                )
-            )
-            negative_prompt_2_embed, negative_pooled_prompt_2_embed = (
-                self._get_clip_prompt_embeds(
-                    negative_prompt_2,
-                    device=device,
-                    num_images_per_prompt=num_images_per_prompt,
-                    clip_skip=None,
-                    clip_model_index=1,
-                )
-            )
-            negative_clip_prompt_embeds = torch.cat(
-                [negative_prompt_embed, negative_prompt_2_embed], dim=-1
-            )
-
-            t5_negative_prompt_embed = self._get_t5_prompt_embeds(
-                prompt=negative_prompt_3,
-                num_images_per_prompt=num_images_per_prompt,
-                device=device,
-            )
-
-            negative_clip_prompt_embeds = torch.nn.functional.pad(
-                negative_clip_prompt_embeds,
-                (
-                    0,
-                    t5_negative_prompt_embed.shape[-1]
-                    - negative_clip_prompt_embeds.shape[-1],
-                ),
-            )
-
-            negative_prompt_embeds = torch.cat(
-                [negative_clip_prompt_embeds, t5_negative_prompt_embed], dim=-2
-            )
-            negative_pooled_prompt_embeds = torch.cat(
-                [negative_pooled_prompt_embed, negative_pooled_prompt_2_embed], dim=-1
-            )
-
-        return (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        )
-
-    def check_inputs(
-        self,
-        prompt,
-        prompt_2,
-        prompt_3,
-        height,
-        width,
-        negative_prompt=None,
-        negative_prompt_2=None,
-        negative_prompt_3=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        pooled_prompt_embeds=None,
-        negative_pooled_prompt_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_2 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_3 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_3`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-        elif prompt_2 is not None and (
-            not isinstance(prompt_2, str) and not isinstance(prompt_2, list)
-        ):
-            raise ValueError(
-                f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}"
-            )
-        elif prompt_3 is not None and (
-            not isinstance(prompt_3, str) and not isinstance(prompt_3, list)
-        ):
-            raise ValueError(
-                f"`prompt_3` has to be of type `str` or `list` but is {type(prompt_3)}"
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-        elif negative_prompt_2 is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt_2`: {negative_prompt_2} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-        elif negative_prompt_3 is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt_3`: {negative_prompt_3} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-        if prompt_embeds is not None and pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
-            )
-
-        if negative_prompt_embeds is not None and negative_pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed. Make sure to generate `negative_pooled_prompt_embeds` from the same text encoder that was used to generate `negative_prompt_embeds`."
-            )
-
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        dtype,
-        device,
-        generator,
-        latents=None,
-    ):
-        if latents is not None:
-            return latents.to(device=device, dtype=dtype)
-
-        shape = (
-            batch_size,
-            num_channels_latents,
-            int(height) // self.vae_scale_factor,
-            int(width) // self.vae_scale_factor,
-        )
-
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-
-        return latents
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def clip_skip(self):
-        return self._clip_skip
-
-    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-    # corresponds to doing no classifier free guidance.
-    @property
-    def do_classifier_free_guidance(self):
-        return self._guidance_scale > 1
-
-    @property
-    def joint_attention_kwargs(self):
-        return self._joint_attention_kwargs
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        prompt_2: Optional[Union[str, List[str]]] = None,
-        prompt_3: Optional[Union[str, List[str]]] = None,
-        height: Optional[int] = None,
-        width: Optional[int] = None,
-        num_inference_steps: int = 28,
-        timesteps: List[int] = None,
-        guidance_scale: float = 7.0,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt_2: Optional[Union[str, List[str]]] = None,
-        negative_prompt_3: Optional[Union[str, List[str]]] = None,
-        num_images_per_prompt: Optional[int] = 1,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.FloatTensor] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        joint_attention_kwargs: Optional[Dict[str, Any]] = None,
-        clip_skip: Optional[int] = None,
-        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-    ):
-        r"""
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                will be used instead
-            prompt_3 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to `tokenizer_3` and `text_encoder_3`. If not defined, `prompt` is
-                will be used instead
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The height in pixels of the generated image. This is set to 1024 by default for the best results.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The width in pixels of the generated image. This is set to 1024 by default for the best results.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            guidance_scale (`float`, *optional*, defaults to 5.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used instead
-            negative_prompt_3 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_3` and
-                `text_encoder_3`. If not defined, `negative_prompt` is used instead
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
-                of a plain tuple.
-            joint_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            callback_on_step_end (`Callable`, *optional*):
-                A function that calls at the end of each denoising steps during the inference. The function is called
-                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
-                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
-                `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a
-            `tuple`. When returning a tuple, the first element is a list with the generated images.
-        """
-
-        height = height or self.default_sample_size * self.vae_scale_factor
-        width = width or self.default_sample_size * self.vae_scale_factor
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            prompt_2,
-            prompt_3,
-            height,
-            width,
-            negative_prompt=negative_prompt,
-            negative_prompt_2=negative_prompt_2,
-            negative_prompt_3=negative_prompt_3,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._clip_skip = clip_skip
-        self._joint_attention_kwargs = joint_attention_kwargs
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self._execution_device
-
-        (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        ) = self.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt_2,
-            prompt_3=prompt_3,
-            negative_prompt=negative_prompt,
-            negative_prompt_2=negative_prompt_2,
-            negative_prompt_3=negative_prompt_3,
-            do_classifier_free_guidance=self.do_classifier_free_guidance,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-            device=device,
-            clip_skip=self.clip_skip,
-            num_images_per_prompt=num_images_per_prompt,
-        )
-
-        if self.do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-            pooled_prompt_embeds = torch.cat(
-                [negative_pooled_prompt_embeds, pooled_prompt_embeds], dim=0
-            )
-
-        # 4. Prepare timesteps
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps
-        )
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-        self._num_timesteps = len(timesteps)
-
-        # 5. Prepare latent variables
-        num_channels_latents = self.transformer.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_images_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            prompt_embeds.dtype,
-            device,
-            generator,
-            latents,
-        )
-        latents = latents.to(self.transformer.device)
-        timesteps = timesteps.to(self.transformer.device)
-
-        # 6. Denoising loop
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                # expand the latents if we are doing classifier free guidance
-                latent_model_input = (
-                    torch.cat([latents] * 2)
-                    if self.do_classifier_free_guidance
-                    else latents
-                )
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                timestep = t.expand(latent_model_input.shape[0])
-
-                noise_pred = self.transformer(
-                    hidden_states=latent_model_input.to(
-                        device=self.transformer.device, dtype=self.transformer.dtype
-                    ),
-                    timestep=timestep,
-                    encoder_hidden_states=prompt_embeds.to(
-                        device=self.transformer.device, dtype=self.transformer.dtype
-                    ),
-                    pooled_projections=pooled_prompt_embeds.to(
-                        device=self.transformer.device, dtype=self.transformer.dtype
-                    ),
-                    joint_attention_kwargs=self.joint_attention_kwargs,
-                    return_dict=False,
-                )[0]
-
-                # perform guidance
-                if self.do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + self.guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents_dtype = latents.dtype
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, return_dict=False
-                )[0]
-
-                if latents.dtype != latents_dtype:
-                    if torch.backends.mps.is_available():
-                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                        latents = latents.to(latents_dtype)
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop(
-                        "negative_prompt_embeds", negative_prompt_embeds
-                    )
-                    negative_pooled_prompt_embeds = callback_outputs.pop(
-                        "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
-                    )
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-
-                if XLA_AVAILABLE:
-                    xm.mark_step()
-
-        if output_type == "latent":
-            image = latents
-
-        else:
-            latents = (
-                latents / self.vae.config.scaling_factor
-            ) + self.vae.config.shift_factor
-
-            image = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0]
-            image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return StableDiffusion3PipelineOutput(images=image)
-
-
-# Copyright 2024 Stability AI and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from typing import Callable, Dict, List, Optional, Union
-
-import PIL.Image
-import torch
-from diffusers.image_processor import PipelineImageInput
-from transformers import (
-    CLIPTextModelWithProjection,
-    CLIPTokenizer,
-    T5EncoderModel,
-    T5TokenizerFast,
-)
-
-if is_torch_xla_available():
-    import torch_xla.core.xla_model as xm
-
-    XLA_AVAILABLE = True
-else:
-    XLA_AVAILABLE = False
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-EXAMPLE_DOC_STRING = """
-    Examples:
-        ```py
-        >>> import torch
-
-        >>> from diffusers import AutoPipelineForImage2Image
-        >>> from diffusers.utils import load_image
-
-        >>> device = "cuda"
-        >>> model_id_or_path = "stabilityai/stable-diffusion-3-medium-diffusers"
-        >>> pipe = AutoPipelineForImage2Image.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
-        >>> pipe = pipe.to(device)
-
-        >>> url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
-        >>> init_image = load_image(url).resize((512, 512))
-
-        >>> prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
-
-        >>> images = pipe(prompt=prompt, image=init_image, strength=0.95, guidance_scale=7.5).images[0]
-        ```
-"""
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
-def retrieve_latents(
-    encoder_output: torch.Tensor,
-    generator: Optional[torch.Generator] = None,
-    sample_mode: str = "sample",
-):
-    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
-        return encoder_output.latent_dist.sample(generator)
-    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
-        return encoder_output.latent_dist.mode()
-    elif hasattr(encoder_output, "latents"):
-        return encoder_output.latents
-    else:
-        raise AttributeError("Could not access latents of provided encoder_output")
-
-
-class StableDiffusion3Img2ImgPipeline(DiffusionPipeline):
-    r"""
-    Args:
-        transformer ([`SD3Transformer2DModel`]):
-            Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
-        scheduler ([`FlowMatchEulerDiscreteScheduler`]):
-            A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`CLIPTextModelWithProjection`]):
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
-            with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
-            as its dimension.
-        text_encoder_2 ([`CLIPTextModelWithProjection`]):
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the
-            [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
-            variant.
-        text_encoder_3 ([`T5EncoderModel`]):
-            Frozen text-encoder. Stable Diffusion 3 uses
-            [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
-            [t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
-        tokenizer (`CLIPTokenizer`):
-            Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_2 (`CLIPTokenizer`):
-            Second Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_3 (`T5TokenizerFast`):
-            Tokenizer of class
-            [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
-    """
-
-    model_cpu_offload_seq = (
-        "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
-    )
-    _optional_components = []
-    _callback_tensor_inputs = [
-        "latents",
-        "prompt_embeds",
-        "negative_prompt_embeds",
-        "negative_pooled_prompt_embeds",
-    ]
-
-    def __init__(
-        self,
-        transformer: SD3Transformer2DModel,
-        scheduler: FlowMatchEulerDiscreteScheduler,
-        vae: AutoencoderKL,
-        text_encoder: CLIPTextModelWithProjection,
-        tokenizer: CLIPTokenizer,
-        text_encoder_2: CLIPTextModelWithProjection,
-        tokenizer_2: CLIPTokenizer,
-        text_encoder_3: T5EncoderModel,
-        tokenizer_3: T5TokenizerFast,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            text_encoder_3=text_encoder_3,
-            tokenizer=tokenizer,
-            tokenizer_2=tokenizer_2,
-            tokenizer_3=tokenizer_3,
-            transformer=transformer,
-            scheduler=scheduler,
-        )
-        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
-        self.image_processor = VaeImageProcessor(
-            vae_scale_factor=self.vae_scale_factor,
-            vae_latent_channels=self.vae.config.latent_channels,
-        )
-        self.tokenizer_max_length = self.tokenizer.model_max_length
-        self.default_sample_size = self.transformer.config.sample_size
-
-    # Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline._get_t5_prompt_embeds
-    def _get_t5_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]] = None,
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ):
-        device = device or self._execution_device
-        dtype = dtype or self.text_encoder.dtype
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        if self.text_encoder_3 is None:
-            return torch.zeros(
-                (
-                    batch_size,
-                    self.tokenizer_max_length,
-                    self.transformer.config.joint_attention_dim,
-                ),
-                device=device,
-                dtype=dtype,
-            )
-
-        text_inputs = self.tokenizer_3(
-            prompt,
-            padding="max_length",
-            max_length=self.tokenizer_max_length,
-            truncation=True,
-            add_special_tokens=True,
-            return_tensors="pt",
-        )
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = self.tokenizer_3(
-            prompt, padding="longest", return_tensors="pt"
-        ).input_ids
-
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
-            text_input_ids, untruncated_ids
-        ):
-            removed_text = self.tokenizer_3.batch_decode(
-                untruncated_ids[:, self.tokenizer_max_length - 1 : -1]
-            )
-            logger.warning(
-                "The following part of your input was truncated because CLIP can only handle sequences up to"
-                f" {self.tokenizer_max_length} tokens: {removed_text}"
-            )
-
-        prompt_embeds = self.text_encoder_3(text_input_ids.to(device))[0]
-
-        dtype = self.text_encoder_3.dtype
-        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-        _, seq_len, _ = prompt_embeds.shape
-
-        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            batch_size * num_images_per_prompt, seq_len, -1
-        )
-
-        return prompt_embeds
-
-    # Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline._get_clip_prompt_embeds
-    def _get_clip_prompt_embeds(
-        self,
-        prompt: Union[str, List[str]],
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        clip_skip: Optional[int] = None,
-        clip_model_index: int = 0,
-    ):
-        device = device or self._execution_device
-
-        clip_tokenizers = [self.tokenizer, self.tokenizer_2]
-        clip_text_encoders = [self.text_encoder, self.text_encoder_2]
-
-        tokenizer = clip_tokenizers[clip_model_index]
-        text_encoder = clip_text_encoders[clip_model_index]
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        batch_size = len(prompt)
-
-        text_inputs = tokenizer(
-            prompt,
-            padding="max_length",
-            max_length=self.tokenizer_max_length,
-            truncation=True,
-            return_tensors="pt",
-        )
-
-        text_input_ids = text_inputs.input_ids
-        untruncated_ids = tokenizer(
-            prompt, padding="longest", return_tensors="pt"
-        ).input_ids
-        if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
-            text_input_ids, untruncated_ids
-        ):
-            removed_text = tokenizer.batch_decode(
-                untruncated_ids[:, self.tokenizer_max_length - 1 : -1]
-            )
-            logger.warning(
-                "The following part of your input was truncated because CLIP can only handle sequences up to"
-                f" {self.tokenizer_max_length} tokens: {removed_text}"
-            )
-        prompt_embeds = text_encoder(
-            text_input_ids.to(device), output_hidden_states=True
-        )
-        pooled_prompt_embeds = prompt_embeds[0]
-
-        if clip_skip is None:
-            prompt_embeds = prompt_embeds.hidden_states[-2]
-        else:
-            prompt_embeds = prompt_embeds.hidden_states[-(clip_skip + 2)]
-
-        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)
-
-        _, seq_len, _ = prompt_embeds.shape
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            batch_size * num_images_per_prompt, seq_len, -1
-        )
-
-        pooled_prompt_embeds = pooled_prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        pooled_prompt_embeds = pooled_prompt_embeds.view(
-            batch_size * num_images_per_prompt, -1
-        )
-
-        return prompt_embeds, pooled_prompt_embeds
-
-    # Copied from diffusers.pipelines.stable_diffusion_3.pipeline_stable_diffusion_3.StableDiffusion3Pipeline.encode_prompt
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        prompt_2: Union[str, List[str]],
-        prompt_3: Union[str, List[str]],
-        device: Optional[torch.device] = None,
-        num_images_per_prompt: int = 1,
-        do_classifier_free_guidance: bool = True,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt_2: Optional[Union[str, List[str]]] = None,
-        negative_prompt_3: Optional[Union[str, List[str]]] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        clip_skip: Optional[int] = None,
-    ):
-        r"""
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in all text-encoders
-            prompt_3 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_3` and `text_encoder_3`. If not defined, `prompt` is
-                used in all text-encoders
-            device: (`torch.device`):
-                torch device
-            num_images_per_prompt (`int`):
-                number of images that should be generated per prompt
-            do_classifier_free_guidance (`bool`):
-                whether to use classifier free guidance or not
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_3` and
-                `text_encoder_3`. If not defined, `negative_prompt` is used in both text-encoders
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-        """
-        device = device or self._execution_device
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-        if prompt is not None:
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        if prompt_embeds is None:
-            prompt_2 = prompt_2 or prompt
-            prompt_2 = [prompt_2] if isinstance(prompt_2, str) else prompt_2
-
-            prompt_3 = prompt_3 or prompt
-            prompt_3 = [prompt_3] if isinstance(prompt_3, str) else prompt_3
-
-            prompt_embed, pooled_prompt_embed = self._get_clip_prompt_embeds(
-                prompt=prompt,
-                device=device,
-                num_images_per_prompt=num_images_per_prompt,
-                clip_skip=clip_skip,
-                clip_model_index=0,
-            )
-            prompt_2_embed, pooled_prompt_2_embed = self._get_clip_prompt_embeds(
-                prompt=prompt_2,
-                device=device,
-                num_images_per_prompt=num_images_per_prompt,
-                clip_skip=clip_skip,
-                clip_model_index=1,
-            )
-            clip_prompt_embeds = torch.cat([prompt_embed, prompt_2_embed], dim=-1)
-
-            t5_prompt_embed = self._get_t5_prompt_embeds(
-                prompt=prompt_3,
-                num_images_per_prompt=num_images_per_prompt,
-                device=device,
-            )
-
-            clip_prompt_embeds = torch.nn.functional.pad(
-                clip_prompt_embeds,
-                (0, t5_prompt_embed.shape[-1] - clip_prompt_embeds.shape[-1]),
-            )
-
-            prompt_embeds = torch.cat([clip_prompt_embeds, t5_prompt_embed], dim=-2)
-            pooled_prompt_embeds = torch.cat(
-                [pooled_prompt_embed, pooled_prompt_2_embed], dim=-1
-            )
-
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            negative_prompt = negative_prompt or ""
-            negative_prompt_2 = negative_prompt_2 or negative_prompt
-            negative_prompt_3 = negative_prompt_3 or negative_prompt
-
-            # normalize str to list
-            negative_prompt = (
-                batch_size * [negative_prompt]
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-            negative_prompt_2 = (
-                batch_size * [negative_prompt_2]
-                if isinstance(negative_prompt_2, str)
-                else negative_prompt_2
-            )
-            negative_prompt_3 = (
-                batch_size * [negative_prompt_3]
-                if isinstance(negative_prompt_3, str)
-                else negative_prompt_3
-            )
-
-            if prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-
-            negative_prompt_embed, negative_pooled_prompt_embed = (
-                self._get_clip_prompt_embeds(
-                    negative_prompt,
-                    device=device,
-                    num_images_per_prompt=num_images_per_prompt,
-                    clip_skip=None,
-                    clip_model_index=0,
-                )
-            )
-            negative_prompt_2_embed, negative_pooled_prompt_2_embed = (
-                self._get_clip_prompt_embeds(
-                    negative_prompt_2,
-                    device=device,
-                    num_images_per_prompt=num_images_per_prompt,
-                    clip_skip=None,
-                    clip_model_index=1,
-                )
-            )
-            negative_clip_prompt_embeds = torch.cat(
-                [negative_prompt_embed, negative_prompt_2_embed], dim=-1
-            )
-
-            t5_negative_prompt_embed = self._get_t5_prompt_embeds(
-                prompt=negative_prompt_3,
-                num_images_per_prompt=num_images_per_prompt,
-                device=device,
-            )
-
-            negative_clip_prompt_embeds = torch.nn.functional.pad(
-                negative_clip_prompt_embeds,
-                (
-                    0,
-                    t5_negative_prompt_embed.shape[-1]
-                    - negative_clip_prompt_embeds.shape[-1],
-                ),
-            )
-
-            negative_prompt_embeds = torch.cat(
-                [negative_clip_prompt_embeds, t5_negative_prompt_embed], dim=-2
-            )
-            negative_pooled_prompt_embeds = torch.cat(
-                [negative_pooled_prompt_embed, negative_pooled_prompt_2_embed], dim=-1
-            )
-
-        return (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        )
-
-    def check_inputs(
-        self,
-        prompt,
-        prompt_2,
-        prompt_3,
-        strength,
-        negative_prompt=None,
-        negative_prompt_2=None,
-        negative_prompt_3=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        pooled_prompt_embeds=None,
-        negative_pooled_prompt_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-    ):
-        if strength < 0 or strength > 1:
-            raise ValueError(
-                f"The value of strength should in [0.0, 1.0] but is {strength}"
-            )
-
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_2 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_3 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_3`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-        elif prompt_2 is not None and (
-            not isinstance(prompt_2, str) and not isinstance(prompt_2, list)
-        ):
-            raise ValueError(
-                f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}"
-            )
-        elif prompt_3 is not None and (
-            not isinstance(prompt_3, str) and not isinstance(prompt_3, list)
-        ):
-            raise ValueError(
-                f"`prompt_3` has to be of type `str` or `list` but is {type(prompt_3)}"
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-        elif negative_prompt_2 is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt_2`: {negative_prompt_2} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-        elif negative_prompt_3 is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt_3`: {negative_prompt_3} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-        if prompt_embeds is not None and pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
-            )
-
-        if negative_prompt_embeds is not None and negative_pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed. Make sure to generate `negative_pooled_prompt_embeds` from the same text encoder that was used to generate `negative_prompt_embeds`."
-            )
-
-    def get_timesteps(self, num_inference_steps, strength, device):
-        # get the original timestep using init_timestep
-        init_timestep = min(num_inference_steps * strength, num_inference_steps)
-
-        t_start = int(max(num_inference_steps - init_timestep, 0))
-        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
-        if hasattr(self.scheduler, "set_begin_index"):
-            self.scheduler.set_begin_index(t_start * self.scheduler.order)
-
-        return timesteps, num_inference_steps - t_start
-
-    def prepare_latents(
-        self,
-        image,
-        timestep,
-        batch_size,
-        num_images_per_prompt,
-        dtype,
-        device,
-        generator=None,
-    ):
-        if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)):
-            raise ValueError(
-                f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}"
-            )
-
-        image = image.to(device=device, dtype=dtype)
-        if image.shape[1] == self.vae.config.latent_channels:
-            init_latents = image
-
-        batch_size = batch_size * num_images_per_prompt
-        if image.shape[1] == self.vae.config.latent_channels:
-            init_latents = image
-
-        else:
-            if isinstance(generator, list) and len(generator) != batch_size:
-                raise ValueError(
-                    f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                    f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-                )
-
-            elif isinstance(generator, list):
-                init_latents = [
-                    retrieve_latents(
-                        self.vae.encode(image[i : i + 1]), generator=generator[i]
-                    )
-                    for i in range(batch_size)
-                ]
-                init_latents = torch.cat(init_latents, dim=0)
-            else:
-                init_latents = retrieve_latents(
-                    self.vae.encode(image), generator=generator
-                )
-
-            init_latents = (
-                init_latents - self.vae.config.shift_factor
-            ) * self.vae.config.scaling_factor
-
-        if (
-            batch_size > init_latents.shape[0]
-            and batch_size % init_latents.shape[0] == 0
-        ):
-            # expand init_latents for batch_size
-            additional_image_per_prompt = batch_size // init_latents.shape[0]
-            init_latents = torch.cat(
-                [init_latents] * additional_image_per_prompt, dim=0
-            )
-        elif (
-            batch_size > init_latents.shape[0]
-            and batch_size % init_latents.shape[0] != 0
-        ):
-            raise ValueError(
-                f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts."
-            )
-        else:
-            init_latents = torch.cat([init_latents], dim=0)
-
-        shape = init_latents.shape
-        noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-
-        # get latents
-        init_latents = self.scheduler.scale_noise(init_latents, timestep, noise)
-        latents = init_latents.to(device=device, dtype=dtype)
-
-        return latents
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def clip_skip(self):
-        return self._clip_skip
-
-    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-    # corresponds to doing no classifier free guidance.
-    @property
-    def do_classifier_free_guidance(self):
-        return self._guidance_scale > 1
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        prompt_2: Optional[Union[str, List[str]]] = None,
-        prompt_3: Optional[Union[str, List[str]]] = None,
-        image: PipelineImageInput = None,
-        strength: float = 0.6,
-        num_inference_steps: int = 50,
-        timesteps: List[int] = None,
-        guidance_scale: float = 7.0,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt_2: Optional[Union[str, List[str]]] = None,
-        negative_prompt_3: Optional[Union[str, List[str]]] = None,
-        num_images_per_prompt: Optional[int] = 1,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.FloatTensor] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        clip_skip: Optional[int] = None,
-        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-    ):
-        r"""
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                will be used instead
-            prompt_3 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to `tokenizer_3` and `text_encoder_3`. If not defined, `prompt` is
-                will be used instead
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The height in pixels of the generated image. This is set to 1024 by default for the best results.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The width in pixels of the generated image. This is set to 1024 by default for the best results.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            guidance_scale (`float`, *optional*, defaults to 5.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used instead
-            negative_prompt_3 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_3` and
-                `text_encoder_3`. If not defined, `negative_prompt` is used instead
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
-                of a plain tuple.
-            callback_on_step_end (`Callable`, *optional*):
-                A function that calls at the end of each denoising steps during the inference. The function is called
-                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
-                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
-                `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a
-            `tuple`. When returning a tuple, the first element is a list with the generated images.
-        """
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            prompt_2,
-            prompt_3,
-            strength,
-            negative_prompt=negative_prompt,
-            negative_prompt_2=negative_prompt_2,
-            negative_prompt_3=negative_prompt_3,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-            callback_on_step_end_tensor_inputs=callback_on_step_end_tensor_inputs,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._clip_skip = clip_skip
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self._execution_device
-
-        (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        ) = self.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt_2,
-            prompt_3=prompt_3,
-            negative_prompt=negative_prompt,
-            negative_prompt_2=negative_prompt_2,
-            negative_prompt_3=negative_prompt_3,
-            do_classifier_free_guidance=self.do_classifier_free_guidance,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-            device=device,
-            clip_skip=self.clip_skip,
-            num_images_per_prompt=num_images_per_prompt,
-        )
-
-        if self.do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-            pooled_prompt_embeds = torch.cat(
-                [negative_pooled_prompt_embeds, pooled_prompt_embeds], dim=0
-            )
-
-        # 3. Preprocess image
-        image = self.image_processor.preprocess(image)
-
-        # 4. Prepare timesteps
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps
-        )
-        timesteps, num_inference_steps = self.get_timesteps(
-            num_inference_steps, strength, device
-        )
-        latent_timestep = timesteps[:1].repeat(batch_size * num_inference_steps)
-
-        # 5. Prepare latent variables
-        if latents is None:
-            latents = self.prepare_latents(
-                image,
-                latent_timestep,
-                batch_size,
-                num_images_per_prompt,
-                prompt_embeds.dtype,
-                device,
-                generator,
-            )
-
-        # 6. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-        self._num_timesteps = len(timesteps)
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                # expand the latents if we are doing classifier free guidance
-                latent_model_input = (
-                    torch.cat([latents] * 2)
-                    if self.do_classifier_free_guidance
-                    else latents
-                )
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                timestep = t.expand(latent_model_input.shape[0])
-
-                noise_pred = self.transformer(
-                    hidden_states=latent_model_input,
-                    timestep=timestep,
-                    encoder_hidden_states=prompt_embeds,
-                    pooled_projections=pooled_prompt_embeds,
-                    return_dict=False,
-                )[0]
-
-                # perform guidance
-                if self.do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + self.guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents_dtype = latents.dtype
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, return_dict=False
-                )[0]
-
-                if latents.dtype != latents_dtype:
-                    if torch.backends.mps.is_available():
-                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                        latents = latents.to(latents_dtype)
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop(
-                        "negative_prompt_embeds", negative_prompt_embeds
-                    )
-                    negative_pooled_prompt_embeds = callback_outputs.pop(
-                        "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
-                    )
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-
-                if XLA_AVAILABLE:
-                    xm.mark_step()
-
-        if output_type == "latent":
-            image = latents
-
-        else:
-            latents = (
-                latents / self.vae.config.scaling_factor
-            ) + self.vae.config.shift_factor
-
-            image = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0]
-            image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return StableDiffusion3PipelineOutput(images=image)
diff --git a/videotuna/third_party/flux/models/sdxl/pipeline.py b/videotuna/third_party/flux/models/sdxl/pipeline.py
deleted file mode 100644
index 1e6a3bf5..00000000
--- a/videotuna/third_party/flux/models/sdxl/pipeline.py
+++ /dev/null
@@ -1,3039 +0,0 @@
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import inspect
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union
-
-import PIL
-import torch
-from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
-from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
-from diffusers.loaders import (
-    FromSingleFileMixin,
-    IPAdapterMixin,
-    StableDiffusionXLLoraLoaderMixin,
-    TextualInversionLoaderMixin,
-)
-from diffusers.models import AutoencoderKL, ImageProjection, UNet2DConditionModel
-from diffusers.models.attention_processor import (
-    AttnProcessor2_0,
-    FusedAttnProcessor2_0,
-    XFormersAttnProcessor,
-)
-from diffusers.models.lora import adjust_lora_scale_text_encoder
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline, StableDiffusionMixin
-from diffusers.pipelines.stable_diffusion_xl.pipeline_output import (
-    StableDiffusionXLPipelineOutput,
-)
-from diffusers.schedulers import KarrasDiffusionSchedulers
-from diffusers.utils import (
-    USE_PEFT_BACKEND,
-    deprecate,
-    is_invisible_watermark_available,
-    is_torch_xla_available,
-    logging,
-    replace_example_docstring,
-    scale_lora_layers,
-    unscale_lora_layers,
-)
-from diffusers.utils.torch_utils import randn_tensor
-from transformers import (
-    CLIPImageProcessor,
-    CLIPTextModel,
-    CLIPTextModelWithProjection,
-    CLIPTokenizer,
-    CLIPVisionModelWithProjection,
-)
-
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-if is_invisible_watermark_available():
-    from diffusers.pipelines.stable_diffusion_xl.watermark import (
-        StableDiffusionXLWatermarker,
-    )
-
-if is_torch_xla_available():
-    import torch_xla.core.xla_model as xm
-
-    XLA_AVAILABLE = True
-else:
-    XLA_AVAILABLE = False
-
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-EXAMPLE_DOC_STRING = """
-    Examples:
-        ```py
-        >>> import torch
-        >>> from diffusers import StableDiffusionXLPipeline
-
-        >>> pipe = StableDiffusionXLPipeline.from_pretrained(
-        ...     "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
-        ... )
-        >>> pipe = pipe.to("cuda")
-
-        >>> prompt = "a photo of an astronaut riding a horse on mars"
-        >>> image = pipe(prompt).images[0]
-        ```
-"""
-
-
-def retrieve_latents(
-    encoder_output: torch.Tensor,
-    generator: Optional[torch.Generator] = None,
-    sample_mode: str = "sample",
-):
-    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
-        return encoder_output.latent_dist.sample(generator)
-    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
-        return encoder_output.latent_dist.mode()
-    elif hasattr(encoder_output, "latents"):
-        return encoder_output.latents
-    else:
-        raise AttributeError("Could not access latents of provided encoder_output")
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.rescale_noise_cfg
-def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
-    """
-    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and
-    Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
-    """
-    std_text = noise_pred_text.std(
-        dim=list(range(1, noise_pred_text.ndim)), keepdim=True
-    )
-    std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
-    # rescale the results from guidance (fixes overexposure)
-    noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
-    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
-    noise_cfg = (
-        guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
-    )
-    return noise_cfg
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.retrieve_timesteps
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-class StableDiffusionXLPipeline(
-    DiffusionPipeline,
-    StableDiffusionMixin,
-    FromSingleFileMixin,
-    StableDiffusionXLLoraLoaderMixin,
-    TextualInversionLoaderMixin,
-    IPAdapterMixin,
-):
-    r"""
-    Pipeline for text-to-image generation using Stable Diffusion XL.
-
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
-
-    The pipeline also inherits the following loading methods:
-        - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
-        - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
-        - [`~loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] for loading LoRA weights
-        - [`~loaders.StableDiffusionXLLoraLoaderMixin.save_lora_weights`] for saving LoRA weights
-        - [`~loaders.IPAdapterMixin.load_ip_adapter`] for loading IP Adapters
-
-    Args:
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`CLIPTextModel`]):
-            Frozen text-encoder. Stable Diffusion XL uses the text portion of
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
-            the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
-        text_encoder_2 ([` CLIPTextModelWithProjection`]):
-            Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the
-            [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
-            variant.
-        tokenizer (`CLIPTokenizer`):
-            Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_2 (`CLIPTokenizer`):
-            Second Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
-        scheduler ([`SchedulerMixin`]):
-            A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
-            [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
-        force_zeros_for_empty_prompt (`bool`, *optional*, defaults to `"True"`):
-            Whether the negative prompt embeddings shall be forced to always be set to 0. Also see the config of
-            `stabilityai/stable-diffusion-xl-base-1-0`.
-        add_watermarker (`bool`, *optional*):
-            Whether to use the [invisible_watermark library](https://github.com/ShieldMnt/invisible-watermark/) to
-            watermark output images. If not defined, it will default to True if the package is installed, otherwise no
-            watermarker will be used.
-    """
-
-    model_cpu_offload_seq = "text_encoder->text_encoder_2->image_encoder->unet->vae"
-    _optional_components = [
-        "tokenizer",
-        "tokenizer_2",
-        "text_encoder",
-        "text_encoder_2",
-        "image_encoder",
-        "feature_extractor",
-    ]
-    _callback_tensor_inputs = [
-        "latents",
-        "prompt_embeds",
-        "negative_prompt_embeds",
-        "add_text_embeds",
-        "add_time_ids",
-        "negative_pooled_prompt_embeds",
-        "negative_add_time_ids",
-    ]
-
-    def __init__(
-        self,
-        vae: AutoencoderKL,
-        text_encoder: CLIPTextModel,
-        text_encoder_2: CLIPTextModelWithProjection,
-        tokenizer: CLIPTokenizer,
-        tokenizer_2: CLIPTokenizer,
-        unet: UNet2DConditionModel,
-        scheduler: KarrasDiffusionSchedulers,
-        image_encoder: CLIPVisionModelWithProjection = None,
-        feature_extractor: CLIPImageProcessor = None,
-        force_zeros_for_empty_prompt: bool = True,
-        add_watermarker: Optional[bool] = None,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            tokenizer=tokenizer,
-            tokenizer_2=tokenizer_2,
-            unet=unet,
-            scheduler=scheduler,
-            image_encoder=image_encoder,
-            feature_extractor=feature_extractor,
-        )
-        self.register_to_config(
-            force_zeros_for_empty_prompt=force_zeros_for_empty_prompt
-        )
-        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-
-        self.default_sample_size = self.unet.config.sample_size
-
-        add_watermarker = (
-            add_watermarker
-            if add_watermarker is not None
-            else is_invisible_watermark_available()
-        )
-
-        if add_watermarker:
-            self.watermark = StableDiffusionXLWatermarker()
-        else:
-            self.watermark = None
-
-    def encode_prompt(
-        self,
-        prompt: str,
-        prompt_2: Optional[str] = None,
-        device: Optional[torch.device] = None,
-        num_images_per_prompt: int = 1,
-        do_classifier_free_guidance: bool = True,
-        negative_prompt: Optional[str] = None,
-        negative_prompt_2: Optional[str] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        lora_scale: Optional[float] = None,
-        clip_skip: Optional[int] = None,
-    ):
-        r"""
-        Encodes the prompt into text encoder hidden states.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in both text-encoders
-            device: (`torch.device`):
-                torch device
-            num_images_per_prompt (`int`):
-                number of images that should be generated per prompt
-            do_classifier_free_guidance (`bool`):
-                whether to use classifier free guidance or not
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            lora_scale (`float`, *optional*):
-                A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-        """
-        device = device or self._execution_device
-
-        # set lora scale so that monkey patched LoRA
-        # function of text encoder can correctly access it
-        if lora_scale is not None and isinstance(
-            self, StableDiffusionXLLoraLoaderMixin
-        ):
-            self._lora_scale = lora_scale
-
-            # dynamically adjust the LoRA scale
-            if self.text_encoder is not None:
-                if not USE_PEFT_BACKEND:
-                    adjust_lora_scale_text_encoder(self.text_encoder, lora_scale)
-                else:
-                    scale_lora_layers(self.text_encoder, lora_scale)
-
-            if self.text_encoder_2 is not None:
-                if not USE_PEFT_BACKEND:
-                    adjust_lora_scale_text_encoder(self.text_encoder_2, lora_scale)
-                else:
-                    scale_lora_layers(self.text_encoder_2, lora_scale)
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-
-        if prompt is not None:
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        # Define tokenizers and text encoders
-        tokenizers = (
-            [self.tokenizer, self.tokenizer_2]
-            if self.tokenizer is not None
-            else [self.tokenizer_2]
-        )
-        text_encoders = (
-            [self.text_encoder, self.text_encoder_2]
-            if self.text_encoder is not None
-            else [self.text_encoder_2]
-        )
-
-        if prompt_embeds is None:
-            prompt_2 = prompt_2 or prompt
-            prompt_2 = [prompt_2] if isinstance(prompt_2, str) else prompt_2
-
-            # textual inversion: process multi-vector tokens if necessary
-            prompt_embeds_list = []
-            prompts = [prompt, prompt_2]
-            for prompt, tokenizer, text_encoder in zip(
-                prompts, tokenizers, text_encoders
-            ):
-                if isinstance(self, TextualInversionLoaderMixin):
-                    prompt = self.maybe_convert_prompt(prompt, tokenizer)
-
-                text_inputs = tokenizer(
-                    prompt,
-                    padding="max_length",
-                    max_length=tokenizer.model_max_length,
-                    truncation=True,
-                    return_tensors="pt",
-                )
-
-                text_input_ids = text_inputs.input_ids
-                untruncated_ids = tokenizer(
-                    prompt, padding="longest", return_tensors="pt"
-                ).input_ids
-
-                if untruncated_ids.shape[-1] >= text_input_ids.shape[
-                    -1
-                ] and not torch.equal(text_input_ids, untruncated_ids):
-                    removed_text = tokenizer.batch_decode(
-                        untruncated_ids[:, tokenizer.model_max_length - 1 : -1]
-                    )
-                    logger.warning(
-                        "The following part of your input was truncated because CLIP can only handle sequences up to"
-                        f" {tokenizer.model_max_length} tokens: {removed_text}"
-                    )
-
-                prompt_embeds = text_encoder(
-                    text_input_ids.to(device), output_hidden_states=True
-                )
-
-                # We are only ALWAYS interested in the pooled output of the final text encoder
-                pooled_prompt_embeds = (
-                    prompt_embeds[0]
-                    if pooled_prompt_embeds is None
-                    else pooled_prompt_embeds
-                )
-                if clip_skip is None:
-                    prompt_embeds = prompt_embeds.hidden_states[-2]
-                else:
-                    # "2" because SDXL always indexes from the penultimate layer.
-                    prompt_embeds = prompt_embeds.hidden_states[-(clip_skip + 2)]
-
-                prompt_embeds_list.append(prompt_embeds)
-
-            prompt_embeds = torch.concat(prompt_embeds_list, dim=-1)
-
-        # get unconditional embeddings for classifier free guidance
-        zero_out_negative_prompt = (
-            negative_prompt is None and self.config.force_zeros_for_empty_prompt
-        )
-        if (
-            do_classifier_free_guidance
-            and negative_prompt_embeds is None
-            and zero_out_negative_prompt
-        ):
-            negative_prompt_embeds = torch.zeros_like(prompt_embeds)
-            negative_pooled_prompt_embeds = torch.zeros_like(pooled_prompt_embeds)
-        elif do_classifier_free_guidance and negative_prompt_embeds is None:
-            negative_prompt = negative_prompt or ""
-            negative_prompt_2 = negative_prompt_2 or negative_prompt
-
-            # normalize str to list
-            negative_prompt = (
-                batch_size * [negative_prompt]
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-            negative_prompt_2 = (
-                batch_size * [negative_prompt_2]
-                if isinstance(negative_prompt_2, str)
-                else negative_prompt_2
-            )
-
-            uncond_tokens: List[str]
-            if prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-            else:
-                uncond_tokens = [negative_prompt, negative_prompt_2]
-
-            negative_prompt_embeds_list = []
-            for negative_prompt, tokenizer, text_encoder in zip(
-                uncond_tokens, tokenizers, text_encoders
-            ):
-                if isinstance(self, TextualInversionLoaderMixin):
-                    negative_prompt = self.maybe_convert_prompt(
-                        negative_prompt, tokenizer
-                    )
-
-                max_length = prompt_embeds.shape[1]
-                uncond_input = tokenizer(
-                    negative_prompt,
-                    padding="max_length",
-                    max_length=max_length,
-                    truncation=True,
-                    return_tensors="pt",
-                )
-
-                negative_prompt_embeds = text_encoder(
-                    uncond_input.input_ids.to(device),
-                    output_hidden_states=True,
-                )
-                # We are only ALWAYS interested in the pooled output of the final text encoder
-                negative_pooled_prompt_embeds = negative_prompt_embeds[0]
-                negative_prompt_embeds = negative_prompt_embeds.hidden_states[-2]
-
-                negative_prompt_embeds_list.append(negative_prompt_embeds)
-
-            negative_prompt_embeds = torch.concat(negative_prompt_embeds_list, dim=-1)
-
-        if self.text_encoder_2 is not None:
-            prompt_embeds = prompt_embeds.to(
-                dtype=self.text_encoder_2.dtype, device=device
-            )
-        else:
-            prompt_embeds = prompt_embeds.to(dtype=self.unet.dtype, device=device)
-
-        bs_embed, seq_len, _ = prompt_embeds.shape
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            bs_embed * num_images_per_prompt, seq_len, -1
-        )
-
-        if do_classifier_free_guidance:
-            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
-            seq_len = negative_prompt_embeds.shape[1]
-
-            if self.text_encoder_2 is not None:
-                negative_prompt_embeds = negative_prompt_embeds.to(
-                    dtype=self.text_encoder_2.dtype, device=device
-                )
-            else:
-                negative_prompt_embeds = negative_prompt_embeds.to(
-                    dtype=self.unet.dtype, device=device
-                )
-
-            negative_prompt_embeds = negative_prompt_embeds.repeat(
-                1, num_images_per_prompt, 1
-            )
-            negative_prompt_embeds = negative_prompt_embeds.view(
-                batch_size * num_images_per_prompt, seq_len, -1
-            )
-
-        pooled_prompt_embeds = pooled_prompt_embeds.repeat(
-            1, num_images_per_prompt
-        ).view(bs_embed * num_images_per_prompt, -1)
-        if do_classifier_free_guidance:
-            negative_pooled_prompt_embeds = negative_pooled_prompt_embeds.repeat(
-                1, num_images_per_prompt
-            ).view(bs_embed * num_images_per_prompt, -1)
-
-        if self.text_encoder is not None:
-            if isinstance(self, StableDiffusionXLLoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(self.text_encoder, lora_scale)
-
-        if self.text_encoder_2 is not None:
-            if isinstance(self, StableDiffusionXLLoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(self.text_encoder_2, lora_scale)
-
-        return (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        )
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.encode_image
-    def encode_image(
-        self, image, device, num_images_per_prompt, output_hidden_states=None
-    ):
-        dtype = next(self.image_encoder.parameters()).dtype
-
-        if not isinstance(image, torch.Tensor):
-            image = self.feature_extractor(image, return_tensors="pt").pixel_values
-
-        image = image.to(device=device, dtype=dtype)
-        if output_hidden_states:
-            image_enc_hidden_states = self.image_encoder(
-                image, output_hidden_states=True
-            ).hidden_states[-2]
-            image_enc_hidden_states = image_enc_hidden_states.repeat_interleave(
-                num_images_per_prompt, dim=0
-            )
-            uncond_image_enc_hidden_states = self.image_encoder(
-                torch.zeros_like(image), output_hidden_states=True
-            ).hidden_states[-2]
-            uncond_image_enc_hidden_states = (
-                uncond_image_enc_hidden_states.repeat_interleave(
-                    num_images_per_prompt, dim=0
-                )
-            )
-            return image_enc_hidden_states, uncond_image_enc_hidden_states
-        else:
-            image_embeds = self.image_encoder(image).image_embeds
-            image_embeds = image_embeds.repeat_interleave(num_images_per_prompt, dim=0)
-            uncond_image_embeds = torch.zeros_like(image_embeds)
-
-            return image_embeds, uncond_image_embeds
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_ip_adapter_image_embeds
-    def prepare_ip_adapter_image_embeds(
-        self,
-        ip_adapter_image,
-        ip_adapter_image_embeds,
-        device,
-        num_images_per_prompt,
-        do_classifier_free_guidance,
-    ):
-        if ip_adapter_image_embeds is None:
-            if not isinstance(ip_adapter_image, list):
-                ip_adapter_image = [ip_adapter_image]
-
-            if len(ip_adapter_image) != len(
-                self.unet.encoder_hid_proj.image_projection_layers
-            ):
-                raise ValueError(
-                    f"`ip_adapter_image` must have same length as the number of IP Adapters. Got {len(ip_adapter_image)} images and {len(self.unet.encoder_hid_proj.image_projection_layers)} IP Adapters."
-                )
-
-            image_embeds = []
-            for single_ip_adapter_image, image_proj_layer in zip(
-                ip_adapter_image, self.unet.encoder_hid_proj.image_projection_layers
-            ):
-                output_hidden_state = not isinstance(image_proj_layer, ImageProjection)
-                single_image_embeds, single_negative_image_embeds = self.encode_image(
-                    single_ip_adapter_image, device, 1, output_hidden_state
-                )
-                single_image_embeds = torch.stack(
-                    [single_image_embeds] * num_images_per_prompt, dim=0
-                )
-                single_negative_image_embeds = torch.stack(
-                    [single_negative_image_embeds] * num_images_per_prompt, dim=0
-                )
-
-                if do_classifier_free_guidance:
-                    single_image_embeds = torch.cat(
-                        [single_negative_image_embeds, single_image_embeds]
-                    )
-                    single_image_embeds = single_image_embeds.to(device)
-
-                image_embeds.append(single_image_embeds)
-        else:
-            repeat_dims = [1]
-            image_embeds = []
-            for single_image_embeds in ip_adapter_image_embeds:
-                if do_classifier_free_guidance:
-                    single_negative_image_embeds, single_image_embeds = (
-                        single_image_embeds.chunk(2)
-                    )
-                    single_image_embeds = single_image_embeds.repeat(
-                        num_images_per_prompt,
-                        *(repeat_dims * len(single_image_embeds.shape[1:])),
-                    )
-                    single_negative_image_embeds = single_negative_image_embeds.repeat(
-                        num_images_per_prompt,
-                        *(repeat_dims * len(single_negative_image_embeds.shape[1:])),
-                    )
-                    single_image_embeds = torch.cat(
-                        [single_negative_image_embeds, single_image_embeds]
-                    )
-                else:
-                    single_image_embeds = single_image_embeds.repeat(
-                        num_images_per_prompt,
-                        *(repeat_dims * len(single_image_embeds.shape[1:])),
-                    )
-                image_embeds.append(single_image_embeds)
-
-        return image_embeds
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
-    def prepare_extra_step_kwargs(self, generator, eta):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-
-        accepts_eta = "eta" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        extra_step_kwargs = {}
-        if accepts_eta:
-            extra_step_kwargs["eta"] = eta
-
-        # check if the scheduler accepts generator
-        accepts_generator = "generator" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        if accepts_generator:
-            extra_step_kwargs["generator"] = generator
-        return extra_step_kwargs
-
-    def check_inputs(
-        self,
-        prompt,
-        prompt_2,
-        height,
-        width,
-        callback_steps,
-        negative_prompt=None,
-        negative_prompt_2=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        pooled_prompt_embeds=None,
-        negative_pooled_prompt_embeds=None,
-        ip_adapter_image=None,
-        ip_adapter_image_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if callback_steps is not None and (
-            not isinstance(callback_steps, int) or callback_steps <= 0
-        ):
-            raise ValueError(
-                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
-                f" {type(callback_steps)}."
-            )
-
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_2 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-        elif prompt_2 is not None and (
-            not isinstance(prompt_2, str) and not isinstance(prompt_2, list)
-        ):
-            raise ValueError(
-                f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}"
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-        elif negative_prompt_2 is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt_2`: {negative_prompt_2} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-        if prompt_embeds is not None and pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `prompt_embeds` are provided, `pooled_prompt_embeds` also have to be passed. Make sure to generate `pooled_prompt_embeds` from the same text encoder that was used to generate `prompt_embeds`."
-            )
-
-        if negative_prompt_embeds is not None and negative_pooled_prompt_embeds is None:
-            raise ValueError(
-                "If `negative_prompt_embeds` are provided, `negative_pooled_prompt_embeds` also have to be passed. Make sure to generate `negative_pooled_prompt_embeds` from the same text encoder that was used to generate `negative_prompt_embeds`."
-            )
-
-        if ip_adapter_image is not None and ip_adapter_image_embeds is not None:
-            raise ValueError(
-                "Provide either `ip_adapter_image` or `ip_adapter_image_embeds`. Cannot leave both `ip_adapter_image` and `ip_adapter_image_embeds` defined."
-            )
-
-        if ip_adapter_image_embeds is not None:
-            if not isinstance(ip_adapter_image_embeds, list):
-                raise ValueError(
-                    f"`ip_adapter_image_embeds` has to be of type `list` but is {type(ip_adapter_image_embeds)}"
-                )
-            elif ip_adapter_image_embeds[0].ndim not in [3, 4]:
-                raise ValueError(
-                    f"`ip_adapter_image_embeds` has to be a list of 3D or 4D tensors but is {ip_adapter_image_embeds[0].ndim}D"
-                )
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        dtype,
-        device,
-        generator,
-        latents=None,
-    ):
-        shape = (
-            batch_size,
-            num_channels_latents,
-            height // self.vae_scale_factor,
-            width // self.vae_scale_factor,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-
-        # scale the initial noise by the standard deviation required by the scheduler
-        latents = latents * self.scheduler.init_noise_sigma
-        return latents
-
-    def _get_add_time_ids(
-        self,
-        original_size,
-        crops_coords_top_left,
-        target_size,
-        dtype,
-        text_encoder_projection_dim=None,
-    ):
-        if StateTracker.is_sdxl_refiner():
-            add_time_ids = list(
-                original_size
-                + crops_coords_top_left
-                + (StateTracker.get_args().data_aesthetic_score,)
-            )
-        else:
-            add_time_ids = list(original_size + crops_coords_top_left + target_size)
-
-        passed_add_embed_dim = (
-            self.unet.config.addition_time_embed_dim * len(add_time_ids)
-            + text_encoder_projection_dim
-        )
-        expected_add_embed_dim = self.unet.add_embedding.linear_1.in_features
-
-        if expected_add_embed_dim != passed_add_embed_dim:
-            raise ValueError(
-                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created. The model has an incorrect config. Please check `unet.config.time_embedding_type` and `text_encoder_2.config.projection_dim`."
-            )
-
-        add_time_ids = torch.tensor([add_time_ids], dtype=dtype)
-        return add_time_ids
-
-    def upcast_vae(self):
-        dtype = self.vae.dtype
-        self.vae.to(dtype=torch.float32)
-        use_torch_2_0_or_xformers = isinstance(
-            self.vae.decoder.mid_block.attentions[0].processor,
-            (
-                AttnProcessor2_0,
-                XFormersAttnProcessor,
-                FusedAttnProcessor2_0,
-            ),
-        )
-        # if xformers or torch_2_0 is used attention block does not need
-        # to be in float32 which can save lots of memory
-        if use_torch_2_0_or_xformers:
-            self.vae.post_quant_conv.to(dtype)
-            self.vae.decoder.conv_in.to(dtype)
-            self.vae.decoder.mid_block.to(dtype)
-
-    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(
-        self,
-        w: torch.Tensor,
-        embedding_dim: int = 512,
-        dtype: torch.dtype = torch.float32,
-    ) -> torch.FloatTensor:
-        """
-        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
-
-        Args:
-            w (`torch.Tensor`):
-                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
-            embedding_dim (`int`, *optional*, defaults to 512):
-                Dimension of the embeddings to generate.
-            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
-                Data type of the generated embeddings.
-
-        Returns:
-            `torch.FloatTensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
-        """
-        assert len(w.shape) == 1
-        w = w * 1000.0
-
-        half_dim = embedding_dim // 2
-        emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
-        emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb)
-        emb = w.to(dtype)[:, None] * emb[None, :]
-        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-        if embedding_dim % 2 == 1:  # zero pad
-            emb = torch.nn.functional.pad(emb, (0, 1))
-        assert emb.shape == (w.shape[0], embedding_dim)
-        return emb
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def guidance_rescale(self):
-        return self._guidance_rescale
-
-    @property
-    def clip_skip(self):
-        return self._clip_skip
-
-    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-    # corresponds to doing no classifier free guidance.
-    @property
-    def do_classifier_free_guidance(self):
-        return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None
-
-    @property
-    def cross_attention_kwargs(self):
-        return self._cross_attention_kwargs
-
-    @property
-    def denoising_end(self):
-        return self._denoising_end
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        prompt_2: Optional[Union[str, List[str]]] = None,
-        height: Optional[int] = None,
-        width: Optional[int] = None,
-        num_inference_steps: int = 50,
-        timesteps: List[int] = None,
-        denoising_end: Optional[float] = None,
-        guidance_scale: float = 5.0,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt_2: Optional[Union[str, List[str]]] = None,
-        num_images_per_prompt: Optional[int] = 1,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.FloatTensor] = None,
-        prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
-        pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.FloatTensor] = None,
-        ip_adapter_image: Optional[PipelineImageInput] = None,
-        ip_adapter_image_embeds: Optional[List[torch.FloatTensor]] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
-        guidance_rescale: float = 0.0,
-        original_size: Optional[Tuple[int, int]] = None,
-        crops_coords_top_left: Tuple[int, int] = (0, 0),
-        target_size: Optional[Tuple[int, int]] = None,
-        negative_original_size: Optional[Tuple[int, int]] = None,
-        negative_crops_coords_top_left: Tuple[int, int] = (0, 0),
-        negative_target_size: Optional[Tuple[int, int]] = None,
-        clip_skip: Optional[int] = None,
-        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        **kwargs,
-    ):
-        r"""
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in both text-encoders
-            height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The height in pixels of the generated image. This is set to 1024 by default for the best results.
-                Anything below 512 pixels won't work well for
-                [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
-                and checkpoints that are not specifically fine-tuned on low resolutions.
-            width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
-                The width in pixels of the generated image. This is set to 1024 by default for the best results.
-                Anything below 512 pixels won't work well for
-                [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
-                and checkpoints that are not specifically fine-tuned on low resolutions.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            denoising_end (`float`, *optional*):
-                When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be
-                completed before it is intentionally prematurely terminated. As a result, the returned sample will
-                still retain a substantial amount of noise as determined by the discrete timesteps selected by the
-                scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a
-                "Mixture of Denoisers" multi-pipeline setup, as elaborated in [**Refining the Image
-                Output**](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#refining-the-image-output)
-            guidance_scale (`float`, *optional*, defaults to 5.0):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
-                [`schedulers.DDIMScheduler`], will be ignored for others.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.FloatTensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.FloatTensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
-            ip_adapter_image_embeds (`List[torch.FloatTensor]`, *optional*):
-                Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of IP-adapters.
-                Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should contain the negative image embedding
-                if `do_classifier_free_guidance` is set to `True`.
-                If not provided, embeddings are computed from the `ip_adapter_image` input argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] instead
-                of a plain tuple.
-            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            guidance_rescale (`float`, *optional*, defaults to 0.0):
-                Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
-                Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
-                [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
-                Guidance rescale factor should fix overexposure when using zero terminal SNR.
-            original_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                If `original_size` is not the same as `target_size` the image will appear to be down- or upsampled.
-                `original_size` defaults to `(height, width)` if not specified. Part of SDXL's micro-conditioning as
-                explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
-                `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position
-                `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting
-                `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            target_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                For most cases, `target_size` should be set to the desired height and width of the generated image. If
-                not specified it will default to `(height, width)`. Part of SDXL's micro-conditioning as explained in
-                section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            negative_original_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                To negatively condition the generation process based on a specific image resolution. Part of SDXL's
-                micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). For more
-                information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
-            negative_crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
-                To negatively condition the generation process based on a specific crop coordinates. Part of SDXL's
-                micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). For more
-                information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
-            negative_target_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                To negatively condition the generation process based on a target image resolution. It should be as same
-                as the `target_size` for most cases. Part of SDXL's micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). For more
-                information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
-            callback_on_step_end (`Callable`, *optional*):
-                A function that calls at the end of each denoising steps during the inference. The function is called
-                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,
-                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by
-                `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a
-            `tuple`. When returning a tuple, the first element is a list with the generated images.
-        """
-
-        callback = kwargs.pop("callback", None)
-        callback_steps = kwargs.pop("callback_steps", None)
-
-        if callback is not None:
-            deprecate(
-                "callback",
-                "1.0.0",
-                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
-            )
-        if callback_steps is not None:
-            deprecate(
-                "callback_steps",
-                "1.0.0",
-                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
-            )
-
-        # 0. Default height and width to unet
-        height = height or self.default_sample_size * self.vae_scale_factor
-        width = width or self.default_sample_size * self.vae_scale_factor
-
-        original_size = original_size or (height, width)
-        target_size = target_size or (height, width)
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            prompt_2,
-            height,
-            width,
-            callback_steps,
-            negative_prompt,
-            negative_prompt_2,
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-            ip_adapter_image,
-            ip_adapter_image_embeds,
-            callback_on_step_end_tensor_inputs,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._guidance_rescale = guidance_rescale
-        self._clip_skip = clip_skip
-        self._cross_attention_kwargs = cross_attention_kwargs
-        self._denoising_end = denoising_end
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self.unet.device
-
-        # 3. Encode input prompt
-        lora_scale = (
-            self.cross_attention_kwargs.get("scale", None)
-            if self.cross_attention_kwargs is not None
-            else None
-        )
-
-        (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        ) = self.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt_2,
-            device=device,
-            num_images_per_prompt=num_images_per_prompt,
-            do_classifier_free_guidance=self.do_classifier_free_guidance,
-            negative_prompt=negative_prompt,
-            negative_prompt_2=negative_prompt_2,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-            lora_scale=lora_scale,
-            clip_skip=self.clip_skip,
-        )
-
-        # 4. Prepare timesteps
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps
-        )
-
-        # 5. Prepare latent variables
-        num_channels_latents = self.unet.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_images_per_prompt,
-            num_channels_latents,
-            height,
-            width,
-            prompt_embeds.dtype,
-            device,
-            generator,
-            latents,
-        )
-
-        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
-
-        # 7. Prepare added time ids & embeddings
-        add_text_embeds = pooled_prompt_embeds
-        if self.text_encoder_2 is None:
-            text_encoder_projection_dim = int(pooled_prompt_embeds.shape[-1])
-        else:
-            text_encoder_projection_dim = self.text_encoder_2.config.projection_dim
-
-        add_time_ids = self._get_add_time_ids(
-            original_size,
-            crops_coords_top_left,
-            target_size,
-            dtype=prompt_embeds.dtype,
-            text_encoder_projection_dim=text_encoder_projection_dim,
-        )
-        if negative_original_size is not None and negative_target_size is not None:
-            negative_add_time_ids = self._get_add_time_ids(
-                negative_original_size,
-                negative_crops_coords_top_left,
-                negative_target_size,
-                dtype=prompt_embeds.dtype,
-                text_encoder_projection_dim=text_encoder_projection_dim,
-            )
-        else:
-            negative_add_time_ids = add_time_ids
-
-        if self.do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-            add_text_embeds = torch.cat(
-                [negative_pooled_prompt_embeds, add_text_embeds], dim=0
-            )
-            add_time_ids = torch.cat([negative_add_time_ids, add_time_ids], dim=0)
-
-        prompt_embeds = prompt_embeds.to(device)
-        add_text_embeds = add_text_embeds.to(device)
-        add_time_ids = add_time_ids.to(device).repeat(
-            batch_size * num_images_per_prompt, 1
-        )
-
-        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
-            image_embeds = self.prepare_ip_adapter_image_embeds(
-                ip_adapter_image,
-                ip_adapter_image_embeds,
-                device,
-                batch_size * num_images_per_prompt,
-                self.do_classifier_free_guidance,
-            )
-
-        # 8. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-
-        # 8.1 Apply denoising_end
-        if (
-            self.denoising_end is not None
-            and isinstance(self.denoising_end, float)
-            and self.denoising_end > 0
-            and self.denoising_end < 1
-        ):
-            discrete_timestep_cutoff = int(
-                round(
-                    self.scheduler.config.num_train_timesteps
-                    - (self.denoising_end * self.scheduler.config.num_train_timesteps)
-                )
-            )
-            num_inference_steps = len(
-                list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps))
-            )
-            timesteps = timesteps[:num_inference_steps]
-
-        # 9. Optionally get Guidance Scale Embedding
-        timestep_cond = None
-        if self.unet.config.time_cond_proj_dim is not None:
-            guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(
-                batch_size * num_images_per_prompt
-            )
-            timestep_cond = self.get_guidance_scale_embedding(
-                guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim
-            ).to(device=device, dtype=latents.dtype)
-
-        self._num_timesteps = len(timesteps)
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                # expand the latents if we are doing classifier free guidance
-                latent_model_input = (
-                    torch.cat([latents] * 2)
-                    if self.do_classifier_free_guidance
-                    else latents
-                )
-
-                latent_model_input = self.scheduler.scale_model_input(
-                    latent_model_input, t
-                )
-
-                # predict the noise residual
-                added_cond_kwargs = {
-                    "text_embeds": add_text_embeds,
-                    "time_ids": add_time_ids,
-                }
-                if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
-                    added_cond_kwargs["image_embeds"] = image_embeds
-                noise_pred = self.unet(
-                    latent_model_input.to(self.unet.device),
-                    t,
-                    encoder_hidden_states=prompt_embeds.to(self.unet.device),
-                    timestep_cond=timestep_cond,
-                    cross_attention_kwargs=self.cross_attention_kwargs,
-                    added_cond_kwargs={
-                        k: v.to(self.unet.device) for k, v in added_cond_kwargs.items()
-                    },
-                    return_dict=False,
-                )[0]
-
-                # perform guidance
-                if self.do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + self.guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                if self.do_classifier_free_guidance and self.guidance_rescale > 0.0:
-                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
-                    noise_pred = rescale_noise_cfg(
-                        noise_pred,
-                        noise_pred_text,
-                        guidance_rescale=self.guidance_rescale,
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents_dtype = latents.dtype
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-                if latents.dtype != latents_dtype:
-                    if torch.backends.mps.is_available():
-                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                        latents = latents.to(latents_dtype)
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop(
-                        "negative_prompt_embeds", negative_prompt_embeds
-                    )
-                    add_text_embeds = callback_outputs.pop(
-                        "add_text_embeds", add_text_embeds
-                    )
-                    negative_pooled_prompt_embeds = callback_outputs.pop(
-                        "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
-                    )
-                    add_time_ids = callback_outputs.pop("add_time_ids", add_time_ids)
-                    negative_add_time_ids = callback_outputs.pop(
-                        "negative_add_time_ids", negative_add_time_ids
-                    )
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-                    if callback is not None and i % callback_steps == 0:
-                        step_idx = i // getattr(self.scheduler, "order", 1)
-                        callback(step_idx, t, latents)
-
-                if XLA_AVAILABLE:
-                    xm.mark_step()
-
-        if not output_type == "latent":
-            # make sure the VAE is in float32 mode, as it overflows in float16
-            needs_upcasting = (
-                self.vae.dtype == torch.float16 and self.vae.config.force_upcast
-            )
-
-            if needs_upcasting:
-                self.upcast_vae()
-                latents = latents.to(
-                    next(iter(self.vae.post_quant_conv.parameters())).dtype
-                )
-            elif latents.dtype != self.vae.dtype:
-                if torch.backends.mps.is_available():
-                    # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                    self.vae = self.vae.to(latents.dtype)
-
-            # unscale/denormalize the latents
-            # denormalize with the mean and std if available and not None
-            has_latents_mean = (
-                hasattr(self.vae.config, "latents_mean")
-                and self.vae.config.latents_mean is not None
-            )
-            has_latents_std = (
-                hasattr(self.vae.config, "latents_std")
-                and self.vae.config.latents_std is not None
-            )
-            if has_latents_mean and has_latents_std:
-                latents_mean = (
-                    torch.tensor(self.vae.config.latents_mean)
-                    .view(1, 4, 1, 1)
-                    .to(latents.device, latents.dtype)
-                )
-                latents_std = (
-                    torch.tensor(self.vae.config.latents_std)
-                    .view(1, 4, 1, 1)
-                    .to(latents.device, latents.dtype)
-                )
-                latents = (
-                    latents * latents_std / self.vae.config.scaling_factor
-                    + latents_mean
-                )
-            else:
-                latents = latents / self.vae.config.scaling_factor
-
-            image = self.vae.decode(
-                latents.to(dtype=self.vae.dtype), return_dict=False
-            )[0]
-
-            # cast back to fp16 if needed
-            if needs_upcasting:
-                self.vae.to(dtype=torch.float16)
-        else:
-            image = latents
-
-        if not output_type == "latent":
-            # apply watermark if available
-            if self.watermark is not None:
-                image = self.watermark.apply_watermark(image)
-
-            image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return StableDiffusionXLPipelineOutput(images=image)
-
-
-class StableDiffusionXLImg2ImgPipeline(
-    DiffusionPipeline,
-    StableDiffusionMixin,
-    TextualInversionLoaderMixin,
-    FromSingleFileMixin,
-    StableDiffusionXLLoraLoaderMixin,
-    IPAdapterMixin,
-):
-    r"""
-    Pipeline for text-to-image generation using Stable Diffusion XL.
-
-    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
-    library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
-
-    The pipeline also inherits the following loading methods:
-        - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
-        - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files
-        - [`~loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] for loading LoRA weights
-        - [`~loaders.StableDiffusionXLLoraLoaderMixin.save_lora_weights`] for saving LoRA weights
-        - [`~loaders.IPAdapterMixin.load_ip_adapter`] for loading IP Adapters
-
-    Args:
-        vae ([`AutoencoderKL`]):
-            Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
-        text_encoder ([`CLIPTextModel`]):
-            Frozen text-encoder. Stable Diffusion XL uses the text portion of
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
-            the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
-        text_encoder_2 ([` CLIPTextModelWithProjection`]):
-            Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of
-            [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
-            specifically the
-            [laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
-            variant.
-        tokenizer (`CLIPTokenizer`):
-            Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        tokenizer_2 (`CLIPTokenizer`):
-            Second Tokenizer of class
-            [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
-        unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
-        scheduler ([`SchedulerMixin`]):
-            A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
-            [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
-        requires_aesthetics_score (`bool`, *optional*, defaults to `"False"`):
-            Whether the `unet` requires an `aesthetic_score` condition to be passed during inference. Also see the
-            config of `stabilityai/stable-diffusion-xl-refiner-1-0`.
-        force_zeros_for_empty_prompt (`bool`, *optional*, defaults to `"True"`):
-            Whether the negative prompt embeddings shall be forced to always be set to 0. Also see the config of
-            `stabilityai/stable-diffusion-xl-base-1-0`.
-        add_watermarker (`bool`, *optional*):
-            Whether to use the [invisible_watermark library](https://github.com/ShieldMnt/invisible-watermark/) to
-            watermark output images. If not defined, it will default to True if the package is installed, otherwise no
-            watermarker will be used.
-    """
-
-    model_cpu_offload_seq = "text_encoder->text_encoder_2->image_encoder->unet->vae"
-    _optional_components = [
-        "tokenizer",
-        "tokenizer_2",
-        "text_encoder",
-        "text_encoder_2",
-        "image_encoder",
-        "feature_extractor",
-    ]
-    _callback_tensor_inputs = [
-        "latents",
-        "prompt_embeds",
-        "negative_prompt_embeds",
-        "add_text_embeds",
-        "add_time_ids",
-        "negative_pooled_prompt_embeds",
-        "add_neg_time_ids",
-    ]
-
-    def __init__(
-        self,
-        vae: AutoencoderKL,
-        text_encoder: CLIPTextModel,
-        text_encoder_2: CLIPTextModelWithProjection,
-        tokenizer: CLIPTokenizer,
-        tokenizer_2: CLIPTokenizer,
-        unet: UNet2DConditionModel,
-        scheduler: KarrasDiffusionSchedulers,
-        image_encoder: CLIPVisionModelWithProjection = None,
-        feature_extractor: CLIPImageProcessor = None,
-        requires_aesthetics_score: bool = False,
-        force_zeros_for_empty_prompt: bool = True,
-        add_watermarker: Optional[bool] = None,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            text_encoder_2=text_encoder_2,
-            tokenizer=tokenizer,
-            tokenizer_2=tokenizer_2,
-            unet=unet,
-            image_encoder=image_encoder,
-            feature_extractor=feature_extractor,
-            scheduler=scheduler,
-        )
-        self.register_to_config(
-            force_zeros_for_empty_prompt=force_zeros_for_empty_prompt
-        )
-        self.register_to_config(requires_aesthetics_score=requires_aesthetics_score)
-        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-
-        add_watermarker = (
-            add_watermarker
-            if add_watermarker is not None
-            else is_invisible_watermark_available()
-        )
-
-        if add_watermarker:
-            self.watermark = StableDiffusionXLWatermarker()
-        else:
-            self.watermark = None
-
-    # Copied from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl.StableDiffusionXLPipeline.encode_prompt
-    def encode_prompt(
-        self,
-        prompt: str,
-        prompt_2: Optional[str] = None,
-        device: Optional[torch.device] = None,
-        num_images_per_prompt: int = 1,
-        do_classifier_free_guidance: bool = True,
-        negative_prompt: Optional[str] = None,
-        negative_prompt_2: Optional[str] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        pooled_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.Tensor] = None,
-        lora_scale: Optional[float] = None,
-        clip_skip: Optional[int] = None,
-    ):
-        r"""
-        Encodes the prompt into text encoder hidden states.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                prompt to be encoded
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in both text-encoders
-            device: (`torch.device`):
-                torch device
-            num_images_per_prompt (`int`):
-                number of images that should be generated per prompt
-            do_classifier_free_guidance (`bool`):
-                whether to use classifier free guidance or not
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            lora_scale (`float`, *optional*):
-                A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-        """
-        device = device or self._execution_device
-
-        # set lora scale so that monkey patched LoRA
-        # function of text encoder can correctly access it
-        if lora_scale is not None and isinstance(
-            self, StableDiffusionXLLoraLoaderMixin
-        ):
-            self._lora_scale = lora_scale
-
-            # dynamically adjust the LoRA scale
-            if self.text_encoder is not None:
-                if not USE_PEFT_BACKEND:
-                    adjust_lora_scale_text_encoder(self.text_encoder, lora_scale)
-                else:
-                    scale_lora_layers(self.text_encoder, lora_scale)
-
-            if self.text_encoder_2 is not None:
-                if not USE_PEFT_BACKEND:
-                    adjust_lora_scale_text_encoder(self.text_encoder_2, lora_scale)
-                else:
-                    scale_lora_layers(self.text_encoder_2, lora_scale)
-
-        prompt = [prompt] if isinstance(prompt, str) else prompt
-
-        if prompt is not None:
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        # Define tokenizers and text encoders
-        tokenizers = (
-            [self.tokenizer, self.tokenizer_2]
-            if self.tokenizer is not None
-            else [self.tokenizer_2]
-        )
-        text_encoders = (
-            [self.text_encoder, self.text_encoder_2]
-            if self.text_encoder is not None
-            else [self.text_encoder_2]
-        )
-
-        if prompt_embeds is None:
-            prompt_2 = prompt_2 or prompt
-            prompt_2 = [prompt_2] if isinstance(prompt_2, str) else prompt_2
-
-            # textual inversion: process multi-vector tokens if necessary
-            prompt_embeds_list = []
-            prompts = [prompt, prompt_2]
-            for prompt, tokenizer, text_encoder in zip(
-                prompts, tokenizers, text_encoders
-            ):
-                if isinstance(self, TextualInversionLoaderMixin):
-                    prompt = self.maybe_convert_prompt(prompt, tokenizer)
-
-                text_inputs = tokenizer(
-                    prompt,
-                    padding="max_length",
-                    max_length=tokenizer.model_max_length,
-                    truncation=True,
-                    return_tensors="pt",
-                )
-
-                text_input_ids = text_inputs.input_ids
-                untruncated_ids = tokenizer(
-                    prompt, padding="longest", return_tensors="pt"
-                ).input_ids
-
-                if untruncated_ids.shape[-1] >= text_input_ids.shape[
-                    -1
-                ] and not torch.equal(text_input_ids, untruncated_ids):
-                    removed_text = tokenizer.batch_decode(
-                        untruncated_ids[:, tokenizer.model_max_length - 1 : -1]
-                    )
-                    logger.warning(
-                        "The following part of your input was truncated because CLIP can only handle sequences up to"
-                        f" {tokenizer.model_max_length} tokens: {removed_text}"
-                    )
-
-                prompt_embeds = text_encoder(
-                    text_input_ids.to(device), output_hidden_states=True
-                )
-
-                # We are only ALWAYS interested in the pooled output of the final text encoder
-                pooled_prompt_embeds = prompt_embeds[0]
-                if clip_skip is None:
-                    prompt_embeds = prompt_embeds.hidden_states[-2]
-                else:
-                    # "2" because SDXL always indexes from the penultimate layer.
-                    prompt_embeds = prompt_embeds.hidden_states[-(clip_skip + 2)]
-
-                prompt_embeds_list.append(prompt_embeds)
-
-            prompt_embeds = torch.concat(prompt_embeds_list, dim=-1)
-
-        # get unconditional embeddings for classifier free guidance
-        zero_out_negative_prompt = (
-            negative_prompt is None and self.config.force_zeros_for_empty_prompt
-        )
-        if (
-            do_classifier_free_guidance
-            and negative_prompt_embeds is None
-            and zero_out_negative_prompt
-        ):
-            negative_prompt_embeds = torch.zeros_like(prompt_embeds)
-            negative_pooled_prompt_embeds = torch.zeros_like(pooled_prompt_embeds)
-        elif do_classifier_free_guidance and negative_prompt_embeds is None:
-            negative_prompt = negative_prompt or ""
-            negative_prompt_2 = negative_prompt_2 or negative_prompt
-
-            # normalize str to list
-            negative_prompt = (
-                batch_size * [negative_prompt]
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-            negative_prompt_2 = (
-                batch_size * [negative_prompt_2]
-                if isinstance(negative_prompt_2, str)
-                else negative_prompt_2
-            )
-
-            uncond_tokens: List[str]
-            if prompt is not None and type(prompt) is not type(negative_prompt):
-                raise TypeError(
-                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
-                    f" {type(prompt)}."
-                )
-            elif batch_size != len(negative_prompt):
-                raise ValueError(
-                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
-                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
-                    " the batch size of `prompt`."
-                )
-            else:
-                uncond_tokens = [negative_prompt, negative_prompt_2]
-
-            negative_prompt_embeds_list = []
-            for negative_prompt, tokenizer, text_encoder in zip(
-                uncond_tokens, tokenizers, text_encoders
-            ):
-                if isinstance(self, TextualInversionLoaderMixin):
-                    negative_prompt = self.maybe_convert_prompt(
-                        negative_prompt, tokenizer
-                    )
-
-                max_length = prompt_embeds.shape[1]
-                uncond_input = tokenizer(
-                    negative_prompt,
-                    padding="max_length",
-                    max_length=max_length,
-                    truncation=True,
-                    return_tensors="pt",
-                )
-
-                negative_prompt_embeds = text_encoder(
-                    uncond_input.input_ids.to(device),
-                    output_hidden_states=True,
-                )
-                # We are only ALWAYS interested in the pooled output of the final text encoder
-                negative_pooled_prompt_embeds = negative_prompt_embeds[0]
-                negative_prompt_embeds = negative_prompt_embeds.hidden_states[-2]
-
-                negative_prompt_embeds_list.append(negative_prompt_embeds)
-
-            negative_prompt_embeds = torch.concat(negative_prompt_embeds_list, dim=-1)
-
-        if self.text_encoder_2 is not None:
-            prompt_embeds = prompt_embeds.to(
-                dtype=self.text_encoder_2.dtype, device=device
-            )
-        else:
-            prompt_embeds = prompt_embeds.to(dtype=self.unet.dtype, device=device)
-
-        bs_embed, seq_len, _ = prompt_embeds.shape
-        # duplicate text embeddings for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            bs_embed * num_images_per_prompt, seq_len, -1
-        )
-
-        if do_classifier_free_guidance:
-            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
-            seq_len = negative_prompt_embeds.shape[1]
-
-            if self.text_encoder_2 is not None:
-                negative_prompt_embeds = negative_prompt_embeds.to(
-                    dtype=self.text_encoder_2.dtype, device=device
-                )
-            else:
-                negative_prompt_embeds = negative_prompt_embeds.to(
-                    dtype=self.unet.dtype, device=device
-                )
-
-            negative_prompt_embeds = negative_prompt_embeds.repeat(
-                1, num_images_per_prompt, 1
-            )
-            negative_prompt_embeds = negative_prompt_embeds.view(
-                batch_size * num_images_per_prompt, seq_len, -1
-            )
-
-        pooled_prompt_embeds = pooled_prompt_embeds.repeat(
-            1, num_images_per_prompt
-        ).view(bs_embed * num_images_per_prompt, -1)
-        if do_classifier_free_guidance:
-            negative_pooled_prompt_embeds = negative_pooled_prompt_embeds.repeat(
-                1, num_images_per_prompt
-            ).view(bs_embed * num_images_per_prompt, -1)
-
-        if self.text_encoder is not None:
-            if isinstance(self, StableDiffusionXLLoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(self.text_encoder, lora_scale)
-
-        if self.text_encoder_2 is not None:
-            if isinstance(self, StableDiffusionXLLoraLoaderMixin) and USE_PEFT_BACKEND:
-                # Retrieve the original scale by scaling back the LoRA layers
-                unscale_lora_layers(self.text_encoder_2, lora_scale)
-
-        return (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        )
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
-    def prepare_extra_step_kwargs(self, generator, eta):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-
-        accepts_eta = "eta" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        extra_step_kwargs = {}
-        if accepts_eta:
-            extra_step_kwargs["eta"] = eta
-
-        # check if the scheduler accepts generator
-        accepts_generator = "generator" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        if accepts_generator:
-            extra_step_kwargs["generator"] = generator
-        return extra_step_kwargs
-
-    def check_inputs(
-        self,
-        prompt,
-        prompt_2,
-        strength,
-        num_inference_steps,
-        callback_steps,
-        negative_prompt=None,
-        negative_prompt_2=None,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        ip_adapter_image=None,
-        ip_adapter_image_embeds=None,
-        callback_on_step_end_tensor_inputs=None,
-    ):
-        if strength < 0 or strength > 1:
-            raise ValueError(
-                f"The value of strength should in [0.0, 1.0] but is {strength}"
-            )
-        if num_inference_steps is None:
-            raise ValueError("`num_inference_steps` cannot be None.")
-        elif not isinstance(num_inference_steps, int) or num_inference_steps <= 0:
-            raise ValueError(
-                f"`num_inference_steps` has to be a positive integer but is {num_inference_steps} of type"
-                f" {type(num_inference_steps)}."
-            )
-        if callback_steps is not None and (
-            not isinstance(callback_steps, int) or callback_steps <= 0
-        ):
-            raise ValueError(
-                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
-                f" {type(callback_steps)}."
-            )
-
-        if callback_on_step_end_tensor_inputs is not None and not all(
-            k in self._callback_tensor_inputs
-            for k in callback_on_step_end_tensor_inputs
-        ):
-            raise ValueError(
-                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt_2 is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt_2`: {prompt_2} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-        elif prompt_2 is not None and (
-            not isinstance(prompt_2, str) and not isinstance(prompt_2, list)
-        ):
-            raise ValueError(
-                f"`prompt_2` has to be of type `str` or `list` but is {type(prompt_2)}"
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-        elif negative_prompt_2 is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt_2`: {negative_prompt_2} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-
-        if ip_adapter_image is not None and ip_adapter_image_embeds is not None:
-            raise ValueError(
-                "Provide either `ip_adapter_image` or `ip_adapter_image_embeds`. Cannot leave both `ip_adapter_image` and `ip_adapter_image_embeds` defined."
-            )
-
-        if ip_adapter_image_embeds is not None:
-            if not isinstance(ip_adapter_image_embeds, list):
-                raise ValueError(
-                    f"`ip_adapter_image_embeds` has to be of type `list` but is {type(ip_adapter_image_embeds)}"
-                )
-            elif ip_adapter_image_embeds[0].ndim not in [3, 4]:
-                raise ValueError(
-                    f"`ip_adapter_image_embeds` has to be a list of 3D or 4D tensors but is {ip_adapter_image_embeds[0].ndim}D"
-                )
-
-    def get_timesteps(
-        self, num_inference_steps, strength, device, denoising_start=None
-    ):
-        # get the original timestep using init_timestep
-        if denoising_start is None:
-            init_timestep = min(
-                int(num_inference_steps * strength), num_inference_steps
-            )
-            t_start = max(num_inference_steps - init_timestep, 0)
-        else:
-            t_start = 0
-
-        timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :]
-
-        # Strength is irrelevant if we directly request a timestep to start at;
-        # that is, strength is determined by the denoising_start instead.
-        if denoising_start is not None:
-            discrete_timestep_cutoff = int(
-                round(
-                    self.scheduler.config.num_train_timesteps
-                    - (denoising_start * self.scheduler.config.num_train_timesteps)
-                )
-            )
-
-            num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item()
-            if self.scheduler.order == 2 and num_inference_steps % 2 == 0:
-                # if the scheduler is a 2nd order scheduler we might have to do +1
-                # because `num_inference_steps` might be even given that every timestep
-                # (except the highest one) is duplicated. If `num_inference_steps` is even it would
-                # mean that we cut the timesteps in the middle of the denoising step
-                # (between 1st and 2nd derivative) which leads to incorrect results. By adding 1
-                # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler
-                num_inference_steps = num_inference_steps + 1
-
-            # because t_n+1 >= t_n, we slice the timesteps starting from the end
-            timesteps = timesteps[-num_inference_steps:]
-            return timesteps, num_inference_steps
-
-        return timesteps, num_inference_steps - t_start
-
-    def prepare_latents(
-        self,
-        image,
-        timestep,
-        batch_size,
-        num_images_per_prompt,
-        dtype,
-        device,
-        generator=None,
-        add_noise=True,
-    ):
-        if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)):
-            raise ValueError(
-                f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}"
-            )
-
-        latents_mean = latents_std = None
-        if (
-            hasattr(self.vae.config, "latents_mean")
-            and self.vae.config.latents_mean is not None
-        ):
-            latents_mean = torch.tensor(self.vae.config.latents_mean).view(1, 4, 1, 1)
-        if (
-            hasattr(self.vae.config, "latents_std")
-            and self.vae.config.latents_std is not None
-        ):
-            latents_std = torch.tensor(self.vae.config.latents_std).view(1, 4, 1, 1)
-
-        # Offload text encoder if `enable_model_cpu_offload` was enabled
-        if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
-            self.text_encoder_2.to("cpu")
-            torch.cuda.empty_cache()
-
-        image = image.to(device=device, dtype=dtype)
-
-        batch_size = batch_size * num_images_per_prompt
-
-        if image.shape[1] == 4:
-            init_latents = image
-
-        else:
-            # make sure the VAE is in float32 mode, as it overflows in float16
-            if self.vae.config.force_upcast:
-                image = image.float()
-                self.vae.to(dtype=torch.float32)
-
-            if isinstance(generator, list) and len(generator) != batch_size:
-                raise ValueError(
-                    f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                    f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-                )
-
-            elif isinstance(generator, list):
-                init_latents = [
-                    retrieve_latents(
-                        self.vae.encode(image[i : i + 1]), generator=generator[i]
-                    )
-                    for i in range(batch_size)
-                ]
-                init_latents = torch.cat(init_latents, dim=0)
-            else:
-                init_latents = retrieve_latents(
-                    self.vae.encode(image), generator=generator
-                )
-
-            if self.vae.config.force_upcast:
-                self.vae.to(dtype)
-
-            init_latents = init_latents.to(dtype)
-            if latents_mean is not None and latents_std is not None:
-                latents_mean = latents_mean.to(device=self.device, dtype=dtype)
-                latents_std = latents_std.to(device=self.device, dtype=dtype)
-                init_latents = (
-                    (init_latents - latents_mean)
-                    * self.vae.config.scaling_factor
-                    / latents_std
-                )
-            else:
-                init_latents = self.vae.config.scaling_factor * init_latents
-
-        if (
-            batch_size > init_latents.shape[0]
-            and batch_size % init_latents.shape[0] == 0
-        ):
-            # expand init_latents for batch_size
-            additional_image_per_prompt = batch_size // init_latents.shape[0]
-            init_latents = torch.cat(
-                [init_latents] * additional_image_per_prompt, dim=0
-            )
-        elif (
-            batch_size > init_latents.shape[0]
-            and batch_size % init_latents.shape[0] != 0
-        ):
-            raise ValueError(
-                f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts."
-            )
-        else:
-            init_latents = torch.cat([init_latents], dim=0)
-
-        if add_noise:
-            shape = init_latents.shape
-            noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
-            # get latents
-            init_latents = self.scheduler.add_noise(init_latents, noise, timestep)
-
-        latents = init_latents
-
-        return latents
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.encode_image
-    def encode_image(
-        self, image, device, num_images_per_prompt, output_hidden_states=None
-    ):
-        dtype = next(self.image_encoder.parameters()).dtype
-
-        if not isinstance(image, torch.Tensor):
-            image = self.feature_extractor(image, return_tensors="pt").pixel_values
-
-        image = image.to(device=device, dtype=dtype)
-        if output_hidden_states:
-            image_enc_hidden_states = self.image_encoder(
-                image, output_hidden_states=True
-            ).hidden_states[-2]
-            image_enc_hidden_states = image_enc_hidden_states.repeat_interleave(
-                num_images_per_prompt, dim=0
-            )
-            uncond_image_enc_hidden_states = self.image_encoder(
-                torch.zeros_like(image), output_hidden_states=True
-            ).hidden_states[-2]
-            uncond_image_enc_hidden_states = (
-                uncond_image_enc_hidden_states.repeat_interleave(
-                    num_images_per_prompt, dim=0
-                )
-            )
-            return image_enc_hidden_states, uncond_image_enc_hidden_states
-        else:
-            image_embeds = self.image_encoder(image).image_embeds
-            image_embeds = image_embeds.repeat_interleave(num_images_per_prompt, dim=0)
-            uncond_image_embeds = torch.zeros_like(image_embeds)
-
-            return image_embeds, uncond_image_embeds
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_ip_adapter_image_embeds
-    def prepare_ip_adapter_image_embeds(
-        self,
-        ip_adapter_image,
-        ip_adapter_image_embeds,
-        device,
-        num_images_per_prompt,
-        do_classifier_free_guidance,
-    ):
-        if ip_adapter_image_embeds is None:
-            if not isinstance(ip_adapter_image, list):
-                ip_adapter_image = [ip_adapter_image]
-
-            if len(ip_adapter_image) != len(
-                self.unet.encoder_hid_proj.image_projection_layers
-            ):
-                raise ValueError(
-                    f"`ip_adapter_image` must have same length as the number of IP Adapters. Got {len(ip_adapter_image)} images and {len(self.unet.encoder_hid_proj.image_projection_layers)} IP Adapters."
-                )
-
-            image_embeds = []
-            for single_ip_adapter_image, image_proj_layer in zip(
-                ip_adapter_image, self.unet.encoder_hid_proj.image_projection_layers
-            ):
-                output_hidden_state = not isinstance(image_proj_layer, ImageProjection)
-                single_image_embeds, single_negative_image_embeds = self.encode_image(
-                    single_ip_adapter_image, device, 1, output_hidden_state
-                )
-                single_image_embeds = torch.stack(
-                    [single_image_embeds] * num_images_per_prompt, dim=0
-                )
-                single_negative_image_embeds = torch.stack(
-                    [single_negative_image_embeds] * num_images_per_prompt, dim=0
-                )
-
-                if do_classifier_free_guidance:
-                    single_image_embeds = torch.cat(
-                        [single_negative_image_embeds, single_image_embeds]
-                    )
-                    single_image_embeds = single_image_embeds.to(device)
-
-                image_embeds.append(single_image_embeds)
-        else:
-            repeat_dims = [1]
-            image_embeds = []
-            for single_image_embeds in ip_adapter_image_embeds:
-                if do_classifier_free_guidance:
-                    single_negative_image_embeds, single_image_embeds = (
-                        single_image_embeds.chunk(2)
-                    )
-                    single_image_embeds = single_image_embeds.repeat(
-                        num_images_per_prompt,
-                        *(repeat_dims * len(single_image_embeds.shape[1:])),
-                    )
-                    single_negative_image_embeds = single_negative_image_embeds.repeat(
-                        num_images_per_prompt,
-                        *(repeat_dims * len(single_negative_image_embeds.shape[1:])),
-                    )
-                    single_image_embeds = torch.cat(
-                        [single_negative_image_embeds, single_image_embeds]
-                    )
-                else:
-                    single_image_embeds = single_image_embeds.repeat(
-                        num_images_per_prompt,
-                        *(repeat_dims * len(single_image_embeds.shape[1:])),
-                    )
-                image_embeds.append(single_image_embeds)
-
-        return image_embeds
-
-    def _get_add_time_ids(
-        self,
-        original_size,
-        crops_coords_top_left,
-        target_size,
-        aesthetic_score,
-        negative_aesthetic_score,
-        negative_original_size,
-        negative_crops_coords_top_left,
-        negative_target_size,
-        dtype,
-        text_encoder_projection_dim=None,
-    ):
-        if self.config.requires_aesthetics_score:
-            add_time_ids = list(
-                original_size + crops_coords_top_left + (aesthetic_score,)
-            )
-            add_neg_time_ids = list(
-                negative_original_size
-                + negative_crops_coords_top_left
-                + (negative_aesthetic_score,)
-            )
-        else:
-            add_time_ids = list(original_size + crops_coords_top_left + target_size)
-            add_neg_time_ids = list(
-                negative_original_size + crops_coords_top_left + negative_target_size
-            )
-
-        passed_add_embed_dim = (
-            self.unet.config.addition_time_embed_dim * len(add_time_ids)
-            + text_encoder_projection_dim
-        )
-        expected_add_embed_dim = self.unet.add_embedding.linear_1.in_features
-
-        if (
-            expected_add_embed_dim > passed_add_embed_dim
-            and (expected_add_embed_dim - passed_add_embed_dim)
-            == self.unet.config.addition_time_embed_dim
-        ):
-            raise ValueError(
-                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created. Please make sure to enable `requires_aesthetics_score` with `pipe.register_to_config(requires_aesthetics_score=True)` to make sure `aesthetic_score` {aesthetic_score} and `negative_aesthetic_score` {negative_aesthetic_score} is correctly used by the model."
-            )
-        elif (
-            expected_add_embed_dim < passed_add_embed_dim
-            and (passed_add_embed_dim - expected_add_embed_dim)
-            == self.unet.config.addition_time_embed_dim
-        ):
-            raise ValueError(
-                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created. Please make sure to disable `requires_aesthetics_score` with `pipe.register_to_config(requires_aesthetics_score=False)` to make sure `target_size` {target_size} is correctly used by the model."
-            )
-        elif expected_add_embed_dim != passed_add_embed_dim:
-            raise ValueError(
-                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created. The model has an incorrect config. Please check `unet.config.time_embedding_type` and `text_encoder_2.config.projection_dim`."
-            )
-
-        add_time_ids = torch.tensor([add_time_ids], dtype=dtype)
-        add_neg_time_ids = torch.tensor([add_neg_time_ids], dtype=dtype)
-
-        return add_time_ids, add_neg_time_ids
-
-    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_upscale.StableDiffusionUpscalePipeline.upcast_vae
-    def upcast_vae(self):
-        dtype = self.vae.dtype
-        self.vae.to(dtype=torch.float32)
-        use_torch_2_0_or_xformers = isinstance(
-            self.vae.decoder.mid_block.attentions[0].processor,
-            (
-                AttnProcessor2_0,
-                XFormersAttnProcessor,
-            ),
-        )
-        # if xformers or torch_2_0 is used attention block does not need
-        # to be in float32 which can save lots of memory
-        if use_torch_2_0_or_xformers:
-            self.vae.post_quant_conv.to(dtype)
-            self.vae.decoder.conv_in.to(dtype)
-            self.vae.decoder.mid_block.to(dtype)
-
-    # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding
-    def get_guidance_scale_embedding(
-        self,
-        w: torch.Tensor,
-        embedding_dim: int = 512,
-        dtype: torch.dtype = torch.float32,
-    ) -> torch.Tensor:
-        """
-        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
-
-        Args:
-            w (`torch.Tensor`):
-                Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings.
-            embedding_dim (`int`, *optional*, defaults to 512):
-                Dimension of the embeddings to generate.
-            dtype (`torch.dtype`, *optional*, defaults to `torch.float32`):
-                Data type of the generated embeddings.
-
-        Returns:
-            `torch.Tensor`: Embedding vectors with shape `(len(w), embedding_dim)`.
-        """
-        assert len(w.shape) == 1
-        w = w * 1000.0
-
-        half_dim = embedding_dim // 2
-        emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
-        emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb)
-        emb = w.to(dtype)[:, None] * emb[None, :]
-        emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
-        if embedding_dim % 2 == 1:  # zero pad
-            emb = torch.nn.functional.pad(emb, (0, 1))
-        assert emb.shape == (w.shape[0], embedding_dim)
-        return emb
-
-    @property
-    def guidance_scale(self):
-        return self._guidance_scale
-
-    @property
-    def guidance_rescale(self):
-        return self._guidance_rescale
-
-    @property
-    def clip_skip(self):
-        return self._clip_skip
-
-    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-    # corresponds to doing no classifier free guidance.
-    @property
-    def do_classifier_free_guidance(self):
-        return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None
-
-    @property
-    def cross_attention_kwargs(self):
-        return self._cross_attention_kwargs
-
-    @property
-    def denoising_end(self):
-        return self._denoising_end
-
-    @property
-    def denoising_start(self):
-        return self._denoising_start
-
-    @property
-    def num_timesteps(self):
-        return self._num_timesteps
-
-    @property
-    def interrupt(self):
-        return self._interrupt
-
-    @torch.no_grad()
-    @replace_example_docstring(EXAMPLE_DOC_STRING)
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        prompt_2: Optional[Union[str, List[str]]] = None,
-        image: PipelineImageInput = None,
-        strength: float = 0.3,
-        num_inference_steps: int = 50,
-        timesteps: List[int] = None,
-        sigmas: List[float] = None,
-        denoising_start: Optional[float] = None,
-        denoising_end: Optional[float] = None,
-        guidance_scale: float = 5.0,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
-        negative_prompt_2: Optional[Union[str, List[str]]] = None,
-        num_images_per_prompt: Optional[int] = 1,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        pooled_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_pooled_prompt_embeds: Optional[torch.Tensor] = None,
-        ip_adapter_image: Optional[PipelineImageInput] = None,
-        ip_adapter_image_embeds: Optional[List[torch.Tensor]] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
-        guidance_rescale: float = 0.0,
-        original_size: Tuple[int, int] = None,
-        crops_coords_top_left: Tuple[int, int] = (0, 0),
-        target_size: Tuple[int, int] = None,
-        negative_original_size: Optional[Tuple[int, int]] = None,
-        negative_crops_coords_top_left: Tuple[int, int] = (0, 0),
-        negative_target_size: Optional[Tuple[int, int]] = None,
-        aesthetic_score: float = 6.0,
-        negative_aesthetic_score: float = 2.5,
-        clip_skip: Optional[int] = None,
-        callback_on_step_end: Optional[
-            Union[
-                Callable[[int, int, Dict], None],
-                PipelineCallback,
-                MultiPipelineCallbacks,
-            ]
-        ] = None,
-        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
-        **kwargs,
-    ):
-        r"""
-        Function invoked when calling the pipeline for generation.
-
-        Args:
-            prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
-                instead.
-            prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts to be sent to the `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is
-                used in both text-encoders
-            image (`torch.Tensor` or `PIL.Image.Image` or `np.ndarray` or `List[torch.Tensor]` or `List[PIL.Image.Image]` or `List[np.ndarray]`):
-                The image(s) to modify with the pipeline.
-            strength (`float`, *optional*, defaults to 0.3):
-                Conceptually, indicates how much to transform the reference `image`. Must be between 0 and 1. `image`
-                will be used as a starting point, adding more noise to it the larger the `strength`. The number of
-                denoising steps depends on the amount of noise initially added. When `strength` is 1, added noise will
-                be maximum and the denoising process will run for the full number of iterations specified in
-                `num_inference_steps`. A value of 1, therefore, essentially ignores `image`. Note that in the case of
-                `denoising_start` being declared as an integer, the value of `strength` will be ignored.
-            num_inference_steps (`int`, *optional*, defaults to 50):
-                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
-            timesteps (`List[int]`, *optional*):
-                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
-                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
-                passed will be used. Must be in descending order.
-            sigmas (`List[float]`, *optional*):
-                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
-                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
-                will be used.
-            denoising_start (`float`, *optional*):
-                When specified, indicates the fraction (between 0.0 and 1.0) of the total denoising process to be
-                bypassed before it is initiated. Consequently, the initial part of the denoising process is skipped and
-                it is assumed that the passed `image` is a partly denoised image. Note that when this is specified,
-                strength will be ignored. The `denoising_start` parameter is particularly beneficial when this pipeline
-                is integrated into a "Mixture of Denoisers" multi-pipeline setup, as detailed in [**Refine Image
-                Quality**](https://huggingface.co/docs/diffusers/using-diffusers/sdxl#refine-image-quality).
-            denoising_end (`float`, *optional*):
-                When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be
-                completed before it is intentionally prematurely terminated. As a result, the returned sample will
-                still retain a substantial amount of noise (ca. final 20% of timesteps still needed) and should be
-                denoised by a successor pipeline that has `denoising_start` set to 0.8 so that it only denoises the
-                final 20% of the scheduler. The denoising_end parameter should ideally be utilized when this pipeline
-                forms a part of a "Mixture of Denoisers" multi-pipeline setup, as elaborated in [**Refine Image
-                Quality**](https://huggingface.co/docs/diffusers/using-diffusers/sdxl#refine-image-quality).
-            guidance_scale (`float`, *optional*, defaults to 7.5):
-                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
-                `guidance_scale` is defined as `w` of equation 2. of [Imagen
-                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
-                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
-                usually at the expense of lower image quality.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. If not defined, one has to pass
-                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
-                less than `1`).
-            negative_prompt_2 (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and
-                `text_encoder_2`. If not defined, `negative_prompt` is used in both text-encoders
-            num_images_per_prompt (`int`, *optional*, defaults to 1):
-                The number of images to generate per prompt.
-            eta (`float`, *optional*, defaults to 0.0):
-                Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
-                [`schedulers.DDIMScheduler`], will be ignored for others.
-            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
-                One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
-                to make generation deterministic.
-            latents (`torch.Tensor`, *optional*):
-                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
-                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
-                tensor will ge generated by sampling using the supplied random `generator`.
-            prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
-                provided, text embeddings will be generated from `prompt` input argument.
-            negative_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
-                argument.
-            pooled_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting.
-                If not provided, pooled text embeddings will be generated from `prompt` input argument.
-            negative_pooled_prompt_embeds (`torch.Tensor`, *optional*):
-                Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
-                weighting. If not provided, pooled negative_prompt_embeds will be generated from `negative_prompt`
-                input argument.
-            ip_adapter_image: (`PipelineImageInput`, *optional*): Optional image input to work with IP Adapters.
-            ip_adapter_image_embeds (`List[torch.Tensor]`, *optional*):
-                Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of
-                IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should
-                contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not
-                provided, embeddings are computed from the `ip_adapter_image` input argument.
-            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between
-                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] instead of a
-                plain tuple.
-            cross_attention_kwargs (`dict`, *optional*):
-                A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
-                `self.processor` in
-                [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
-            guidance_rescale (`float`, *optional*, defaults to 0.0):
-                Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
-                Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
-                [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
-                Guidance rescale factor should fix overexposure when using zero terminal SNR.
-            original_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                If `original_size` is not the same as `target_size` the image will appear to be down- or upsampled.
-                `original_size` defaults to `(height, width)` if not specified. Part of SDXL's micro-conditioning as
-                explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
-                `crops_coords_top_left` can be used to generate an image that appears to be "cropped" from the position
-                `crops_coords_top_left` downwards. Favorable, well-centered images are usually achieved by setting
-                `crops_coords_top_left` to (0, 0). Part of SDXL's micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            target_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                For most cases, `target_size` should be set to the desired height and width of the generated image. If
-                not specified it will default to `(height, width)`. Part of SDXL's micro-conditioning as explained in
-                section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            negative_original_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                To negatively condition the generation process based on a specific image resolution. Part of SDXL's
-                micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). For more
-                information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
-            negative_crops_coords_top_left (`Tuple[int]`, *optional*, defaults to (0, 0)):
-                To negatively condition the generation process based on a specific crop coordinates. Part of SDXL's
-                micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). For more
-                information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
-            negative_target_size (`Tuple[int]`, *optional*, defaults to (1024, 1024)):
-                To negatively condition the generation process based on a target image resolution. It should be as same
-                as the `target_size` for most cases. Part of SDXL's micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). For more
-                information, refer to this issue thread: https://github.com/huggingface/diffusers/issues/4208.
-            aesthetic_score (`float`, *optional*, defaults to 6.0):
-                Used to simulate an aesthetic score of the generated image by influencing the positive text condition.
-                Part of SDXL's micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
-            negative_aesthetic_score (`float`, *optional*, defaults to 2.5):
-                Part of SDXL's micro-conditioning as explained in section 2.2 of
-                [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952). Can be used to
-                simulate an aesthetic score of the generated image by influencing the negative text condition.
-            clip_skip (`int`, *optional*):
-                Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
-                the output of the pre-final layer will be used for computing the prompt embeddings.
-            callback_on_step_end (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*):
-                A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of
-                each denoising step during the inference. with the following arguments: `callback_on_step_end(self:
-                DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a
-                list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
-            callback_on_step_end_tensor_inputs (`List`, *optional*):
-                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
-                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
-                `._callback_tensor_inputs` attribute of your pipeline class.
-
-        Examples:
-
-        Returns:
-            [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] or `tuple`:
-            [`~pipelines.stable_diffusion.StableDiffusionXLPipelineOutput`] if `return_dict` is True, otherwise a
-            `tuple. When returning a tuple, the first element is a list with the generated images.
-        """
-
-        callback = kwargs.pop("callback", None)
-        callback_steps = kwargs.pop("callback_steps", None)
-
-        if callback is not None:
-            deprecate(
-                "callback",
-                "1.0.0",
-                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
-            )
-        if callback_steps is not None:
-            deprecate(
-                "callback_steps",
-                "1.0.0",
-                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
-            )
-
-        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
-            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
-
-        # 1. Check inputs. Raise error if not correct
-        self.check_inputs(
-            prompt,
-            prompt_2,
-            strength,
-            num_inference_steps,
-            callback_steps,
-            negative_prompt,
-            negative_prompt_2,
-            prompt_embeds,
-            negative_prompt_embeds,
-            ip_adapter_image,
-            ip_adapter_image_embeds,
-            callback_on_step_end_tensor_inputs,
-        )
-
-        self._guidance_scale = guidance_scale
-        self._guidance_rescale = guidance_rescale
-        self._clip_skip = clip_skip
-        self._cross_attention_kwargs = cross_attention_kwargs
-        self._denoising_end = denoising_end
-        self._denoising_start = denoising_start
-        self._interrupt = False
-
-        # 2. Define call parameters
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self._execution_device
-
-        # 3. Encode input prompt
-        text_encoder_lora_scale = (
-            self.cross_attention_kwargs.get("scale", None)
-            if self.cross_attention_kwargs is not None
-            else None
-        )
-        (
-            prompt_embeds,
-            negative_prompt_embeds,
-            pooled_prompt_embeds,
-            negative_pooled_prompt_embeds,
-        ) = self.encode_prompt(
-            prompt=prompt,
-            prompt_2=prompt_2,
-            device=device,
-            num_images_per_prompt=num_images_per_prompt,
-            do_classifier_free_guidance=self.do_classifier_free_guidance,
-            negative_prompt=negative_prompt,
-            negative_prompt_2=negative_prompt_2,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            pooled_prompt_embeds=pooled_prompt_embeds,
-            negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
-            lora_scale=text_encoder_lora_scale,
-            clip_skip=self.clip_skip,
-        )
-
-        # 4. Preprocess image
-        image = self.image_processor.preprocess(image)
-
-        # 5. Prepare timesteps
-        def denoising_value_valid(dnv):
-            return isinstance(dnv, float) and 0 < dnv < 1
-
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler,
-            num_inference_steps,
-            device,
-            timesteps=timesteps,
-            sigmas=sigmas,
-        )
-        timesteps, num_inference_steps = self.get_timesteps(
-            num_inference_steps,
-            strength,
-            device,
-            denoising_start=(
-                self.denoising_start
-                if denoising_value_valid(self.denoising_start)
-                else None
-            ),
-        )
-        latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
-
-        add_noise = True if self.denoising_start is None else False
-
-        # 6. Prepare latent variables
-        if latents is None:
-            latents = self.prepare_latents(
-                image,
-                latent_timestep,
-                batch_size,
-                num_images_per_prompt,
-                prompt_embeds.dtype,
-                device,
-                generator,
-                add_noise,
-            )
-        # 7. Prepare extra step kwargs.
-        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
-
-        height, width = latents.shape[-2:]
-        height = height * self.vae_scale_factor
-        width = width * self.vae_scale_factor
-
-        original_size = original_size or (height, width)
-        target_size = target_size or (height, width)
-
-        # 8. Prepare added time ids & embeddings
-        if negative_original_size is None:
-            negative_original_size = original_size
-        if negative_target_size is None:
-            negative_target_size = target_size
-
-        add_text_embeds = pooled_prompt_embeds
-        if self.text_encoder_2 is None:
-            text_encoder_projection_dim = int(pooled_prompt_embeds.shape[-1])
-        else:
-            text_encoder_projection_dim = self.text_encoder_2.config.projection_dim
-
-        add_time_ids, add_neg_time_ids = self._get_add_time_ids(
-            original_size,
-            crops_coords_top_left,
-            target_size,
-            aesthetic_score,
-            negative_aesthetic_score,
-            negative_original_size,
-            negative_crops_coords_top_left,
-            negative_target_size,
-            dtype=prompt_embeds.dtype,
-            text_encoder_projection_dim=text_encoder_projection_dim,
-        )
-        add_time_ids = add_time_ids.repeat(batch_size * num_images_per_prompt, 1)
-
-        if self.do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-            add_text_embeds = torch.cat(
-                [negative_pooled_prompt_embeds, add_text_embeds], dim=0
-            )
-            add_neg_time_ids = add_neg_time_ids.repeat(
-                batch_size * num_images_per_prompt, 1
-            )
-            add_time_ids = torch.cat([add_neg_time_ids, add_time_ids], dim=0)
-
-        prompt_embeds = prompt_embeds.to(device)
-        add_text_embeds = add_text_embeds.to(device)
-        add_time_ids = add_time_ids.to(device)
-
-        if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
-            image_embeds = self.prepare_ip_adapter_image_embeds(
-                ip_adapter_image,
-                ip_adapter_image_embeds,
-                device,
-                batch_size * num_images_per_prompt,
-                self.do_classifier_free_guidance,
-            )
-
-        # 9. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-
-        # 9.1 Apply denoising_end
-        if (
-            self.denoising_end is not None
-            and self.denoising_start is not None
-            and denoising_value_valid(self.denoising_end)
-            and denoising_value_valid(self.denoising_start)
-            and self.denoising_start >= self.denoising_end
-        ):
-            raise ValueError(
-                f"`denoising_start`: {self.denoising_start} cannot be larger than or equal to `denoising_end`: "
-                + f" {self.denoising_end} when using type float."
-            )
-        elif self.denoising_end is not None and denoising_value_valid(
-            self.denoising_end
-        ):
-            discrete_timestep_cutoff = int(
-                round(
-                    self.scheduler.config.num_train_timesteps
-                    - (self.denoising_end * self.scheduler.config.num_train_timesteps)
-                )
-            )
-            num_inference_steps = len(
-                list(filter(lambda ts: ts >= discrete_timestep_cutoff, timesteps))
-            )
-            timesteps = timesteps[:num_inference_steps]
-
-        # 9.2 Optionally get Guidance Scale Embedding
-        timestep_cond = None
-        if self.unet.config.time_cond_proj_dim is not None:
-            guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(
-                batch_size * num_images_per_prompt
-            )
-            timestep_cond = self.get_guidance_scale_embedding(
-                guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim
-            ).to(device=device, dtype=latents.dtype)
-
-        self._num_timesteps = len(timesteps)
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                if self.interrupt:
-                    continue
-
-                # expand the latents if we are doing classifier free guidance
-                latent_model_input = (
-                    torch.cat([latents] * 2)
-                    if self.do_classifier_free_guidance
-                    else latents
-                )
-
-                latent_model_input = self.scheduler.scale_model_input(
-                    latent_model_input, t
-                )
-
-                # predict the noise residual
-                added_cond_kwargs = {
-                    "text_embeds": add_text_embeds,
-                    "time_ids": add_time_ids,
-                }
-                if ip_adapter_image is not None or ip_adapter_image_embeds is not None:
-                    added_cond_kwargs["image_embeds"] = image_embeds
-                noise_pred = self.unet(
-                    latent_model_input,
-                    t,
-                    encoder_hidden_states=prompt_embeds,
-                    timestep_cond=timestep_cond,
-                    cross_attention_kwargs=self.cross_attention_kwargs,
-                    added_cond_kwargs=added_cond_kwargs,
-                    return_dict=False,
-                )[0]
-
-                # perform guidance
-                if self.do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + self.guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                if self.do_classifier_free_guidance and self.guidance_rescale > 0.0:
-                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
-                    noise_pred = rescale_noise_cfg(
-                        noise_pred,
-                        noise_pred_text,
-                        guidance_rescale=self.guidance_rescale,
-                    )
-
-                # compute the previous noisy sample x_t -> x_t-1
-                latents_dtype = latents.dtype
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-                if latents.dtype != latents_dtype:
-                    if torch.backends.mps.is_available():
-                        # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                        latents = latents.to(latents_dtype)
-
-                if callback_on_step_end is not None:
-                    callback_kwargs = {}
-                    for k in callback_on_step_end_tensor_inputs:
-                        callback_kwargs[k] = locals()[k]
-                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
-
-                    latents = callback_outputs.pop("latents", latents)
-                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
-                    negative_prompt_embeds = callback_outputs.pop(
-                        "negative_prompt_embeds", negative_prompt_embeds
-                    )
-                    add_text_embeds = callback_outputs.pop(
-                        "add_text_embeds", add_text_embeds
-                    )
-                    negative_pooled_prompt_embeds = callback_outputs.pop(
-                        "negative_pooled_prompt_embeds", negative_pooled_prompt_embeds
-                    )
-                    add_time_ids = callback_outputs.pop("add_time_ids", add_time_ids)
-                    add_neg_time_ids = callback_outputs.pop(
-                        "add_neg_time_ids", add_neg_time_ids
-                    )
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-                    if callback is not None and i % callback_steps == 0:
-                        step_idx = i // getattr(self.scheduler, "order", 1)
-                        callback(step_idx, t, latents)
-
-                if XLA_AVAILABLE:
-                    xm.mark_step()
-
-        if not output_type == "latent":
-            # make sure the VAE is in float32 mode, as it overflows in float16
-            needs_upcasting = (
-                self.vae.dtype == torch.float16 and self.vae.config.force_upcast
-            )
-
-            if needs_upcasting:
-                self.upcast_vae()
-                latents = latents.to(
-                    next(iter(self.vae.post_quant_conv.parameters())).dtype
-                )
-            elif latents.dtype != self.vae.dtype:
-                if torch.backends.mps.is_available():
-                    # some platforms (eg. apple mps) misbehave due to a pytorch bug: https://github.com/pytorch/pytorch/pull/99272
-                    self.vae = self.vae.to(latents.dtype)
-
-            # unscale/denormalize the latents
-            # denormalize with the mean and std if available and not None
-            has_latents_mean = (
-                hasattr(self.vae.config, "latents_mean")
-                and self.vae.config.latents_mean is not None
-            )
-            has_latents_std = (
-                hasattr(self.vae.config, "latents_std")
-                and self.vae.config.latents_std is not None
-            )
-            if has_latents_mean and has_latents_std:
-                latents_mean = (
-                    torch.tensor(self.vae.config.latents_mean)
-                    .view(1, 4, 1, 1)
-                    .to(latents.device, latents.dtype)
-                )
-                latents_std = (
-                    torch.tensor(self.vae.config.latents_std)
-                    .view(1, 4, 1, 1)
-                    .to(latents.device, latents.dtype)
-                )
-                latents = (
-                    latents * latents_std / self.vae.config.scaling_factor
-                    + latents_mean
-                )
-            else:
-                latents = latents / self.vae.config.scaling_factor
-
-            image = self.vae.decode(latents.to(self.vae.dtype), return_dict=False)[0]
-
-            # cast back to fp16 if needed
-            if needs_upcasting:
-                self.vae.to(dtype=torch.float16)
-        else:
-            image = latents
-
-        # apply watermark if available
-        if self.watermark is not None:
-            image = self.watermark.apply_watermark(image)
-
-        image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return StableDiffusionXLPipelineOutput(images=image)
diff --git a/videotuna/third_party/flux/models/smoldit/__init__.py b/videotuna/third_party/flux/models/smoldit/__init__.py
deleted file mode 100644
index a6dfb63e..00000000
--- a/videotuna/third_party/flux/models/smoldit/__init__.py
+++ /dev/null
@@ -1,67 +0,0 @@
-from videotuna.third_party.flux.models.smoldit.pipeline import SmolDiTPipeline
-from videotuna.third_party.flux.models.smoldit.transformer import SmolDiT2DModel
-
-SmolDiTConfigurations = {
-    "smoldit-small": {
-        "sample_size": 64,
-        "num_layers": 18,
-        "patch_size": 2,
-        "attention_head_dim": 64,
-        "num_attention_heads": 16,
-        "num_kv_heads": 4,
-        "in_channels": 4,
-        "cross_attention_dim": 768,
-        "out_channels": 4,
-        "activation_fn": "gelu-approximate",
-    },
-    "smoldit-swiglu": {
-        "sample_size": 64,
-        "num_layers": 24,
-        "patch_size": 2,
-        "attention_head_dim": 72,
-        "num_attention_heads": 16,
-        "num_kv_heads": 4,
-        "in_channels": 4,
-        "cross_attention_dim": 768,
-        "out_channels": 4,
-        "activation_fn": "swiglu",
-    },
-    "smoldit-base": {
-        "sample_size": 64,
-        "num_layers": 24,
-        "patch_size": 2,
-        "attention_head_dim": 72,
-        "num_attention_heads": 16,
-        "num_kv_heads": 4,
-        "in_channels": 4,
-        "cross_attention_dim": 768,
-        "out_channels": 4,
-        "activation_fn": "gelu-approximate",
-    },
-    "smoldit-large": {
-        "sample_size": 64,
-        "num_layers": 30,
-        "patch_size": 2,
-        "attention_head_dim": 72,
-        "num_attention_heads": 32,
-        "num_kv_heads": 8,
-        "in_channels": 4,
-        "cross_attention_dim": 768,
-        "out_channels": 4,
-        "activation_fn": "gelu-approximate",
-    },
-    "smoldit-huge": {
-        "sample_size": 64,
-        "num_layers": 36,
-        "patch_size": 2,
-        "attention_head_dim": 96,
-        "num_attention_heads": 64,
-        "num_kv_heads": 16,
-        "in_channels": 4,
-        "cross_attention_dim": 768,
-        "out_channels": 4,
-        "activation_fn": "gelu-approximate",
-    },
-}
-SmolDiTConfigurationNames = list(SmolDiTConfigurations.keys())
-
diff --git a/videotuna/third_party/flux/models/smoldit/pipeline.py b/videotuna/third_party/flux/models/smoldit/pipeline.py
deleted file mode 100644
index 8a9513d1..00000000
--- a/videotuna/third_party/flux/models/smoldit/pipeline.py
+++ /dev/null
@@ -1,607 +0,0 @@
-# Copyright 2024 PixArt and The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import inspect
-from typing import Callable, List, Optional, Union
-
-import torch
-from diffusers.image_processor import VaeImageProcessor
-from diffusers.models import AutoencoderKL
-from diffusers.models.embeddings import get_2d_rotary_pos_embed
-from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
-from diffusers.schedulers import KarrasDiffusionSchedulers
-from diffusers.utils import logging
-from diffusers.utils.torch_utils import randn_tensor
-from transformers import T5EncoderModel, T5Tokenizer
-
-from videotuna.utils.common_utils import get_resize_crop_region_for_grid
-from videotuna.third_party.flux.models.smoldit.transformer import SmolDiT2DModel
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.rescale_noise_cfg
-def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
-    """
-    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and
-    Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
-    """
-    std_text = noise_pred_text.std(
-        dim=list(range(1, noise_pred_text.ndim)), keepdim=True
-    )
-    std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
-    # rescale the results from guidance (fixes overexposure)
-    noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
-    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
-    noise_cfg = (
-        guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
-    )
-    return noise_cfg
-
-
-def retrieve_timesteps(
-    scheduler,
-    num_inference_steps: Optional[int] = None,
-    device: Optional[Union[str, torch.device]] = None,
-    timesteps: Optional[List[int]] = None,
-    sigmas: Optional[List[float]] = None,
-    **kwargs,
-):
-    """
-    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
-    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.
-
-    Args:
-        scheduler (`SchedulerMixin`):
-            The scheduler to get timesteps from.
-        num_inference_steps (`int`):
-            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
-            must be `None`.
-        device (`str` or `torch.device`, *optional*):
-            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
-        timesteps (`List[int]`, *optional*):
-            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
-            `num_inference_steps` and `sigmas` must be `None`.
-        sigmas (`List[float]`, *optional*):
-            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
-            `num_inference_steps` and `timesteps` must be `None`.
-
-    Returns:
-        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
-        second element is the number of inference steps.
-    """
-    if timesteps is not None and sigmas is not None:
-        raise ValueError(
-            "Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values"
-        )
-    if timesteps is not None:
-        accepts_timesteps = "timesteps" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accepts_timesteps:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" timestep schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    elif sigmas is not None:
-        accept_sigmas = "sigmas" in set(
-            inspect.signature(scheduler.set_timesteps).parameters.keys()
-        )
-        if not accept_sigmas:
-            raise ValueError(
-                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
-                f" sigmas schedules. Please check whether you are using the correct scheduler."
-            )
-        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-        num_inference_steps = len(timesteps)
-    else:
-        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
-        timesteps = scheduler.timesteps
-    return timesteps, num_inference_steps
-
-
-class SmolDiTPipeline(DiffusionPipeline):
-    model_cpu_offload_seq = "text_encoder->transformer->vae"
-
-    @property
-    def guidance_rescale(self):
-        return self._guidance_rescale
-
-    def __init__(
-        self,
-        vae: AutoencoderKL,
-        text_encoder: T5EncoderModel,
-        tokenizer: T5Tokenizer,
-        transformer: SmolDiT2DModel,
-        scheduler: KarrasDiffusionSchedulers,
-    ):
-        super().__init__()
-
-        self.register_modules(
-            vae=vae,
-            text_encoder=text_encoder,
-            tokenizer=tokenizer,
-            transformer=transformer,
-            scheduler=scheduler,
-        )
-        self.vae_scale_factor = (
-            2 ** (len(self.vae.config.block_out_channels) - 1)
-            if hasattr(self, "vae") and self.vae is not None
-            else 8
-        )
-        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
-
-    def encode_prompt(
-        self,
-        prompt: Union[str, List[str]],
-        do_classifier_free_guidance: bool = True,
-        negative_prompt: str = "",
-        num_images_per_prompt: int = 1,
-        device: Optional[torch.device] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        prompt_attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_attention_mask: Optional[torch.Tensor] = None,
-        max_sequence_length: int = 300,
-    ):
-        if device is None:
-            device = self._execution_device
-
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        max_length = max_sequence_length
-
-        if prompt_embeds is None:
-            text_inputs = self.tokenizer(
-                prompt,
-                padding="max_length",
-                max_length=max_length,
-                truncation=True,
-                add_special_tokens=True,
-                return_tensors="pt",
-            )
-            text_input_ids = text_inputs.input_ids
-            prompt_attention_mask = text_inputs.attention_mask
-            prompt_attention_mask = prompt_attention_mask.to(device)
-
-            prompt_embeds = self.text_encoder(
-                text_input_ids.to(device), attention_mask=prompt_attention_mask
-            )
-            prompt_embeds = prompt_embeds[0]
-
-        if self.text_encoder is not None:
-            dtype = self.text_encoder.dtype
-        elif self.transformer is not None:
-            dtype = self.transformer.dtype
-        else:
-            dtype = None
-
-        prompt_embeds = prompt_embeds.to(dtype=dtype, device=device)
-
-        bs_embed, seq_len, _ = prompt_embeds.shape
-        # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method
-        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
-        prompt_embeds = prompt_embeds.view(
-            bs_embed * num_images_per_prompt, seq_len, -1
-        )
-        prompt_attention_mask = prompt_attention_mask.view(bs_embed, -1)
-        prompt_attention_mask = prompt_attention_mask.repeat(num_images_per_prompt, 1)
-
-        # get unconditional embeddings for classifier free guidance
-        if do_classifier_free_guidance and negative_prompt_embeds is None:
-            uncond_tokens = (
-                [negative_prompt] * batch_size
-                if isinstance(negative_prompt, str)
-                else negative_prompt
-            )
-            max_length = prompt_embeds.shape[1]
-            uncond_input = self.tokenizer(
-                uncond_tokens,
-                padding="max_length",
-                max_length=max_length,
-                truncation=True,
-                add_special_tokens=True,
-                return_tensors="pt",
-            )
-            negative_prompt_attention_mask = uncond_input.attention_mask
-            negative_prompt_attention_mask = negative_prompt_attention_mask.to(device)
-
-            negative_prompt_embeds = self.text_encoder(
-                uncond_input.input_ids.to(device),
-                attention_mask=negative_prompt_attention_mask,
-            )
-            negative_prompt_embeds = negative_prompt_embeds[0]
-
-        if do_classifier_free_guidance:
-            # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
-            seq_len = negative_prompt_embeds.shape[1]
-
-            negative_prompt_embeds = negative_prompt_embeds.to(
-                dtype=dtype, device=device
-            )
-
-            negative_prompt_embeds = negative_prompt_embeds.repeat(
-                1, num_images_per_prompt, 1
-            )
-            negative_prompt_embeds = negative_prompt_embeds.view(
-                batch_size * num_images_per_prompt, seq_len, -1
-            )
-
-            negative_prompt_attention_mask = negative_prompt_attention_mask.view(
-                bs_embed, -1
-            )
-            negative_prompt_attention_mask = negative_prompt_attention_mask.repeat(
-                num_images_per_prompt, 1
-            )
-        else:
-            negative_prompt_embeds = None
-            negative_prompt_attention_mask = None
-
-        return (
-            prompt_embeds,
-            prompt_attention_mask,
-            negative_prompt_embeds,
-            negative_prompt_attention_mask,
-        )
-
-    def prepare_extra_step_kwargs(self, generator, eta):
-        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
-        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
-        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
-        # and should be between [0, 1]
-
-        accepts_eta = "eta" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        extra_step_kwargs = {}
-        if accepts_eta:
-            extra_step_kwargs["eta"] = eta
-
-        # check if the scheduler accepts generator
-        accepts_generator = "generator" in set(
-            inspect.signature(self.scheduler.step).parameters.keys()
-        )
-        if accepts_generator:
-            extra_step_kwargs["generator"] = generator
-        return extra_step_kwargs
-
-    def check_inputs(
-        self,
-        prompt,
-        height,
-        width,
-        negative_prompt,
-        callback_steps,
-        prompt_embeds=None,
-        negative_prompt_embeds=None,
-        prompt_attention_mask=None,
-        negative_prompt_attention_mask=None,
-    ):
-        if height % 8 != 0 or width % 8 != 0:
-            raise ValueError(
-                f"`height` and `width` have to be divisible by 8 but are {height} and {width}."
-            )
-
-        if (callback_steps is None) or (
-            callback_steps is not None
-            and (not isinstance(callback_steps, int) or callback_steps <= 0)
-        ):
-            raise ValueError(
-                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
-                f" {type(callback_steps)}."
-            )
-
-        if prompt is not None and prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
-                " only forward one of the two."
-            )
-        elif prompt is None and prompt_embeds is None:
-            raise ValueError(
-                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
-            )
-        elif prompt is not None and (
-            not isinstance(prompt, str) and not isinstance(prompt, list)
-        ):
-            raise ValueError(
-                f"`prompt` has to be of type `str` or `list` but is {type(prompt)}"
-            )
-
-        if prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `prompt`: {prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if negative_prompt is not None and negative_prompt_embeds is not None:
-            raise ValueError(
-                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
-                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
-            )
-
-        if prompt_embeds is not None and prompt_attention_mask is None:
-            raise ValueError(
-                "Must provide `prompt_attention_mask` when specifying `prompt_embeds`."
-            )
-
-        if (
-            negative_prompt_embeds is not None
-            and negative_prompt_attention_mask is None
-        ):
-            raise ValueError(
-                "Must provide `negative_prompt_attention_mask` when specifying `negative_prompt_embeds`."
-            )
-
-        if prompt_embeds is not None and negative_prompt_embeds is not None:
-            if prompt_embeds.shape != negative_prompt_embeds.shape:
-                raise ValueError(
-                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
-                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
-                    f" {negative_prompt_embeds.shape}."
-                )
-            if prompt_attention_mask.shape != negative_prompt_attention_mask.shape:
-                raise ValueError(
-                    "`prompt_attention_mask` and `negative_prompt_attention_mask` must have the same shape when passed directly, but"
-                    f" got: `prompt_attention_mask` {prompt_attention_mask.shape} != `negative_prompt_attention_mask`"
-                    f" {negative_prompt_attention_mask.shape}."
-                )
-
-    def prepare_latents(
-        self,
-        batch_size,
-        num_channels_latents,
-        height,
-        width,
-        dtype,
-        device,
-        generator,
-        latents=None,
-    ):
-        shape = (
-            batch_size,
-            num_channels_latents,
-            int(height) // self.vae_scale_factor,
-            int(width) // self.vae_scale_factor,
-        )
-        if isinstance(generator, list) and len(generator) != batch_size:
-            raise ValueError(
-                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
-                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
-            )
-
-        if latents is None:
-            latents = randn_tensor(
-                shape, generator=generator, device=device, dtype=dtype
-            )
-        else:
-            latents = latents.to(device)
-
-        # scale the initial noise by the standard deviation required by the scheduler
-        latents = latents * self.scheduler.init_noise_sigma
-        return latents
-
-    @torch.no_grad()
-    def __call__(
-        self,
-        prompt: Union[str, List[str]] = None,
-        negative_prompt: str = "",
-        num_inference_steps: int = 20,
-        timesteps: List[int] = None,
-        sigmas: List[float] = None,
-        guidance_scale: float = 4.5,
-        num_images_per_prompt: Optional[int] = 1,
-        height: Optional[int] = None,
-        width: Optional[int] = None,
-        eta: float = 0.0,
-        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
-        latents: Optional[torch.Tensor] = None,
-        prompt_embeds: Optional[torch.Tensor] = None,
-        prompt_attention_mask: Optional[torch.Tensor] = None,
-        negative_prompt_embeds: Optional[torch.Tensor] = None,
-        negative_prompt_attention_mask: Optional[torch.Tensor] = None,
-        output_type: Optional[str] = "pil",
-        return_dict: bool = True,
-        guidance_rescale: float = 0.0,
-        callback: Optional[Callable[[int, int, torch.Tensor], None]] = None,
-        callback_steps: int = 1,
-        max_sequence_length: int = 300,
-    ):
-        # 1. Check inputs. Raise error if not correct
-        height = height or self.transformer.config.sample_size * self.vae_scale_factor
-        width = width or self.transformer.config.sample_size * self.vae_scale_factor
-
-        self.check_inputs(
-            prompt,
-            height,
-            width,
-            negative_prompt,
-            callback_steps,
-            prompt_embeds,
-            negative_prompt_embeds,
-            prompt_attention_mask,
-            negative_prompt_attention_mask,
-        )
-
-        # 2. Default height and width to transformer
-        if prompt is not None and isinstance(prompt, str):
-            batch_size = 1
-        elif prompt is not None and isinstance(prompt, list):
-            batch_size = len(prompt)
-        else:
-            batch_size = prompt_embeds.shape[0]
-
-        device = self._execution_device
-
-        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
-        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
-        # corresponds to doing no classifier free guidance.
-        do_classifier_free_guidance = guidance_scale > 1.0
-
-        # 3. Encode input prompt
-        (
-            prompt_embeds,
-            prompt_attention_mask,
-            negative_prompt_embeds,
-            negative_prompt_attention_mask,
-        ) = self.encode_prompt(
-            prompt,
-            do_classifier_free_guidance,
-            negative_prompt=negative_prompt,
-            num_images_per_prompt=num_images_per_prompt,
-            device=device,
-            prompt_embeds=prompt_embeds,
-            negative_prompt_embeds=negative_prompt_embeds,
-            prompt_attention_mask=prompt_attention_mask,
-            negative_prompt_attention_mask=negative_prompt_attention_mask,
-            max_sequence_length=max_sequence_length,
-        )
-        if do_classifier_free_guidance:
-            prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
-            prompt_attention_mask = torch.cat(
-                [negative_prompt_attention_mask, prompt_attention_mask], dim=0
-            )
-
-        # 4. Prepare timesteps
-        timesteps, num_inference_steps = retrieve_timesteps(
-            self.scheduler, num_inference_steps, device, timesteps, sigmas
-        )
-
-        # 5. Prepare latents.
-        latent_channels = self.transformer.config.in_channels
-        latents = self.prepare_latents(
-            batch_size * num_images_per_prompt,
-            latent_channels,
-            height,
-            width,
-            prompt_embeds.dtype,
-            device,
-            generator,
-            latents,
-        )
-
-        # 6. Prepare rotary embeddings.
-        grid_height = height // 8 // self.transformer.config.patch_size
-        grid_width = width // 8 // self.transformer.config.patch_size
-        base_size = 512 // 8 // self.transformer.config.patch_size
-        grid_crops_coords = get_resize_crop_region_for_grid(
-            (grid_height, grid_width), (base_size, base_size)
-        )
-        image_rotary_emb = get_2d_rotary_pos_embed(
-            self.transformer.inner_dim // self.transformer.config.num_attention_heads,
-            grid_crops_coords,
-            (grid_height, grid_width),
-        )
-
-        # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
-        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
-
-        # 8. Denoising loop
-        num_warmup_steps = max(
-            len(timesteps) - num_inference_steps * self.scheduler.order, 0
-        )
-
-        with self.progress_bar(total=num_inference_steps) as progress_bar:
-            for i, t in enumerate(timesteps):
-                latent_model_input = (
-                    torch.cat([latents] * 2) if do_classifier_free_guidance else latents
-                )
-                latent_model_input = self.scheduler.scale_model_input(
-                    latent_model_input, t
-                )
-
-                current_timestep = t
-                if not torch.is_tensor(current_timestep):
-                    # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
-                    # This would be a good case for the `match` statement (Python 3.10+)
-                    is_mps = latent_model_input.device.type == "mps"
-                    if isinstance(current_timestep, float):
-                        dtype = torch.float32 if is_mps else torch.float64
-                    else:
-                        dtype = torch.int32 if is_mps else torch.int64
-                    current_timestep = torch.tensor(
-                        [current_timestep],
-                        dtype=dtype,
-                        device=latent_model_input.device,
-                    )
-                elif len(current_timestep.shape) == 0:
-                    current_timestep = current_timestep[None].to(
-                        latent_model_input.device
-                    )
-                # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
-                current_timestep = current_timestep.expand(latent_model_input.shape[0])
-
-                # predict noise model_output
-                noise_pred = self.transformer(
-                    latent_model_input,
-                    encoder_hidden_states=prompt_embeds,
-                    encoder_attention_mask=prompt_attention_mask,
-                    timestep=current_timestep,
-                    image_rotary_emb=image_rotary_emb,
-                    return_dict=False,
-                )[0]
-
-                # perform guidance
-                if do_classifier_free_guidance:
-                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
-                    noise_pred = noise_pred_uncond + guidance_scale * (
-                        noise_pred_text - noise_pred_uncond
-                    )
-
-                if do_classifier_free_guidance and guidance_rescale > 0.0:
-                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
-                    noise_pred = rescale_noise_cfg(
-                        noise_pred, noise_pred_text, guidance_rescale=guidance_rescale
-                    )
-
-                # compute previous image: x_t -> x_t-1
-                latents = self.scheduler.step(
-                    noise_pred, t, latents, **extra_step_kwargs, return_dict=False
-                )[0]
-
-                # call the callback, if provided
-                if i == len(timesteps) - 1 or (
-                    (i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0
-                ):
-                    progress_bar.update()
-                    if callback is not None and i % callback_steps == 0:
-                        step_idx = i // getattr(self.scheduler, "order", 1)
-                        callback(step_idx, t, latents)
-
-        if not output_type == "latent":
-            image = self.vae.decode(
-                latents.to(device=self.vae.device, dtype=self.vae.dtype)
-                / self.vae.config.scaling_factor,
-                return_dict=False,
-            )[0]
-        else:
-            image = latents
-
-        if not output_type == "latent":
-            image = self.image_processor.postprocess(image, output_type=output_type)
-
-        # Offload all models
-        self.maybe_free_model_hooks()
-
-        if not return_dict:
-            return (image,)
-
-        return ImagePipelineOutput(images=image)
diff --git a/videotuna/third_party/flux/models/smoldit/transformer.py b/videotuna/third_party/flux/models/smoldit/transformer.py
deleted file mode 100644
index 3c4da1d4..00000000
--- a/videotuna/third_party/flux/models/smoldit/transformer.py
+++ /dev/null
@@ -1,413 +0,0 @@
-# Copyright 2024 Lumina, Hunyuan DiT, PixArt, The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from typing import Optional, Tuple
-
-import torch
-import torch.nn.functional as F
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.models.attention import FeedForward
-from diffusers.models.embeddings import (
-    PatchEmbed,
-    PixArtAlphaTextProjection,
-    TimestepEmbedding,
-    Timesteps,
-    apply_rotary_emb,
-)
-from diffusers.models.modeling_outputs import Transformer2DModelOutput
-from diffusers.models.modeling_utils import ModelMixin
-from diffusers.models.normalization import AdaLayerNormContinuous, FP32LayerNorm
-from diffusers.models.transformers.hunyuan_transformer_2d import AdaLayerNormShift
-from diffusers.utils import logging
-from torch import nn
-
-logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
-
-
-class SmolDiTAttention(nn.Module):
-    def __init__(
-        self,
-        query_dim,
-        cross_attention_dim,
-        dim_head,
-        num_heads,
-        kv_heads,
-        sliding_window=None,
-    ):
-        super().__init__()
-
-        self.inner_dim = dim_head * num_heads
-        self.inner_kv_dim = self.inner_dim if kv_heads is None else dim_head * kv_heads
-        self.query_dim = query_dim
-        self.num_heads = num_heads
-        self.is_cross_attention = cross_attention_dim is not None
-        self.cross_attention_dim = (
-            cross_attention_dim if cross_attention_dim is not None else query_dim
-        )
-
-        self.scale = dim_head**-0.5
-        self.sliding_window = sliding_window
-
-        self.to_q = nn.Linear(query_dim, self.inner_dim, bias=False)
-        self.to_k = nn.Linear(self.cross_attention_dim, self.inner_kv_dim, bias=False)
-        self.to_v = nn.Linear(self.cross_attention_dim, self.inner_kv_dim, bias=False)
-
-        self.to_out = nn.Linear(self.inner_dim, query_dim, bias=False)
-
-    # this mask processing utility is taken from the `prepare_attention_mask()`
-    # function from diffusers. it is here for self-containment.
-    def prepare_attention_mask(self, hidden_states, attention_mask):
-        sequence_length = hidden_states.shape[1]
-        current_length = attention_mask.shape[-1]
-        batch_size = hidden_states.shape[0]
-        if current_length != sequence_length:
-            if attention_mask.device.type == "mps":
-                padding_shape = (
-                    attention_mask.shape[0],
-                    attention_mask.shape[1],
-                    sequence_length,
-                )
-                padding = torch.zeros(
-                    padding_shape,
-                    dtype=attention_mask.dtype,
-                    device=attention_mask.device,
-                )
-                attention_mask = torch.cat([attention_mask, padding], dim=2)
-            else:
-                attention_mask = F.pad(attention_mask, (0, sequence_length), value=0.0)
-
-        if attention_mask.shape[0] < batch_size * self.num_heads:
-            attention_mask = attention_mask.repeat_interleave(self.num_heads, dim=0)
-
-        return attention_mask
-
-    def sliding_window_attention_mask(
-        self,
-        sequence_length: int,
-        window_size: int,
-        batch_size: int,
-        num_heads: int,
-        device,
-    ) -> torch.Tensor:
-        mask = torch.zeros(
-            (batch_size, num_heads, sequence_length, sequence_length), device=device
-        )
-        for i in range(sequence_length):
-            start = max(0, i - window_size)
-            end = min(sequence_length, i + window_size + 1)
-            mask[:, :, i, start:end] = 1
-        return mask
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        encoder_hidden_states: torch.Tensor = None,
-        encoder_attention_mask: Optional[torch.Tensor] = None,
-        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
-    ):
-        batch_size, _, _ = hidden_states.shape
-        encoder_hidden_states = (
-            hidden_states if encoder_hidden_states is None else encoder_hidden_states
-        )
-
-        # scaled_dot_product_attention expects attention_mask shape to be
-        # (batch, heads, source_length, target_length)
-        attention_mask = None
-        if encoder_attention_mask is not None:
-            encoder_attention_mask = self.prepare_attention_mask(
-                encoder_hidden_states, encoder_attention_mask
-            )
-            encoder_attention_mask = encoder_attention_mask.view(
-                batch_size, self.num_heads, -1, encoder_attention_mask.shape[-1]
-            )
-            attention_mask = encoder_attention_mask
-        elif self.sliding_window:
-            attention_mask = self.sliding_window_attention_mask(
-                sequence_length=hidden_states.shape[1],
-                window_size=self.sliding_window,
-                batch_size=batch_size,
-                num_heads=self.num_heads,
-                device=hidden_states.device,
-            )
-
-        # Projections.
-        query = self.to_q(hidden_states)
-        key = self.to_k(encoder_hidden_states)
-        value = self.to_v(encoder_hidden_states)
-
-        query_dim = query.shape[-1]
-        inner_dim = key.shape[-1]
-        head_dim = query_dim // self.num_heads
-        dtype = query.dtype
-
-        # Get key-value heads
-        kv_heads = inner_dim // head_dim
-        query = query.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2)
-        key = key.view(batch_size, -1, kv_heads, head_dim).transpose(1, 2)
-        value = value.view(batch_size, -1, kv_heads, head_dim).transpose(1, 2)
-
-        # GQA
-        if kv_heads != self.num_heads:
-            # if GQA or MQA, repeat the key/value heads to reach the number of query heads.
-            heads_per_kv_head = self.num_heads // kv_heads
-            key = torch.repeat_interleave(key, heads_per_kv_head, dim=1)
-            value = torch.repeat_interleave(value, heads_per_kv_head, dim=1)
-
-        # Apply RoPE if needed
-        if image_rotary_emb is not None:
-            query = apply_rotary_emb(query, image_rotary_emb)
-            query = query.to(dtype)
-            if not self.is_cross_attention:
-                key = apply_rotary_emb(key, image_rotary_emb)
-                key = query.to(dtype)
-
-        # the output of sdpa = (batch, num_heads, seq_len, head_dim)
-        hidden_states = F.scaled_dot_product_attention(
-            query, key, value, attn_mask=attention_mask, scale=self.scale
-        )
-
-        # out
-        hidden_states = hidden_states.transpose(1, 2).reshape(
-            batch_size, -1, self.num_heads * head_dim
-        )
-        hidden_states = hidden_states.to(query.dtype)
-        hidden_states = self.to_out(hidden_states)
-        return hidden_states
-
-
-class SmolDiTBlock(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        num_attention_heads: int,
-        num_kv_heads: int,
-        ff_inner_dim: int,
-        cross_attention_dim: int = 1024,
-        activation_fn: str = "gelu-approximate",
-        layer_idx: int = None,
-        sliding_window: int = None,
-    ):
-        super().__init__()
-
-        # 1. Self-Attn
-        self.norm1 = AdaLayerNormShift(dim, elementwise_affine=True, eps=1e-6)
-        if layer_idx is not None and sliding_window is not None:
-            sliding_window = sliding_window if not bool(layer_idx % 2) else None
-        else:
-            sliding_window = None
-
-        self.attn1 = SmolDiTAttention(
-            query_dim=dim,
-            cross_attention_dim=None,
-            dim_head=dim // num_attention_heads,
-            num_heads=num_attention_heads,
-            kv_heads=num_kv_heads,
-            sliding_window=sliding_window,
-        )
-
-        # 2. Cross-Attn
-        self.norm2 = FP32LayerNorm(dim, eps=1e-6, elementwise_affine=True)
-        self.attn2 = SmolDiTAttention(
-            query_dim=dim,
-            cross_attention_dim=cross_attention_dim,
-            dim_head=dim // num_attention_heads,
-            num_heads=num_attention_heads,
-            kv_heads=num_kv_heads,
-        )
-
-        # 3. Feed-forward
-        self.ff = FeedForward(
-            dim,
-            activation_fn=activation_fn,
-            inner_dim=ff_inner_dim,
-            bias=False,
-        )
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        temb: torch.Tensor,
-        encoder_hidden_states: torch.Tensor,
-        encoder_attention_mask: Optional[torch.Tensor] = None,
-        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
-    ) -> torch.Tensor:
-        # 1. Self-Attention
-        norm_hidden_states = self.norm1(hidden_states, temb)
-        attn_output = self.attn1(
-            norm_hidden_states,
-            image_rotary_emb=image_rotary_emb,
-        )
-        hidden_states = hidden_states + attn_output
-
-        # 2. Cross-Attention
-        hidden_states = hidden_states + self.attn2(
-            self.norm2(hidden_states),
-            encoder_hidden_states=encoder_hidden_states,
-            encoder_attention_mask=encoder_attention_mask,
-            image_rotary_emb=image_rotary_emb,
-        )
-
-        # FFN Layer
-        hidden_states = hidden_states + self.ff(hidden_states)
-
-        return hidden_states
-
-
-class SmolDiT2DModel(ModelMixin, ConfigMixin):
-    @register_to_config
-    def __init__(
-        self,
-        sample_size: int = 128,
-        patch_size: int = 2,
-        num_attention_heads: int = 16,
-        num_kv_heads: int = 8,
-        attention_head_dim: int = 88,
-        in_channels: int = 4,
-        out_channels: int = 4,
-        activation_fn: str = "gelu-approximate",
-        num_layers: int = 28,
-        mlp_ratio: float = 4.0,
-        cross_attention_dim: int = 1024,
-        sliding_window: int = None,
-    ):
-        super().__init__()
-        self.inner_dim = num_attention_heads * attention_head_dim
-
-        self.time_proj = Timesteps(
-            num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0
-        )
-        self.timestep_embedder = TimestepEmbedding(
-            in_channels=256, time_embed_dim=self.inner_dim
-        )
-
-        self.text_embedder = PixArtAlphaTextProjection(
-            in_features=cross_attention_dim,
-            hidden_size=cross_attention_dim * 4,
-            out_features=cross_attention_dim,
-            act_fn="silu_fp32",
-        )
-
-        self.pos_embed = PatchEmbed(
-            height=sample_size,
-            width=sample_size,
-            in_channels=in_channels,
-            embed_dim=self.inner_dim,
-            patch_size=patch_size,
-            pos_embed_type=None,
-        )
-
-        # SmolDiT Blocks
-        self.blocks = nn.ModuleList(
-            [
-                SmolDiTBlock(
-                    dim=self.inner_dim,
-                    num_attention_heads=num_attention_heads,
-                    num_kv_heads=num_kv_heads,
-                    ff_inner_dim=int(self.inner_dim * mlp_ratio),
-                    cross_attention_dim=cross_attention_dim,
-                    activation_fn=activation_fn,
-                    layer_idx=layer_idx,
-                    sliding_window=(
-                        sliding_window if sliding_window is not None else None
-                    ),
-                )
-                for layer_idx in range(num_layers)
-            ]
-        )
-
-        self.out_channels = out_channels
-        self.norm_out = AdaLayerNormContinuous(
-            self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6
-        )
-        self.proj_out = nn.Linear(
-            self.inner_dim, patch_size * patch_size * out_channels
-        )
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        timestep: torch.Tensor,
-        encoder_hidden_states: torch.Tensor,
-        encoder_attention_mask: Optional[torch.Tensor] = None,
-        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
-        return_dict=True,
-    ):
-        height, width = hidden_states.shape[-2:]
-        hidden_dtype = hidden_states.dtype
-
-        # convert encoder_attention_mask to a bias the same way we do for attention_mask
-        if encoder_attention_mask is not None and encoder_attention_mask.ndim == 2:
-            encoder_attention_mask = (
-                1 - encoder_attention_mask.to(hidden_states.dtype)
-            ) * -10000.0
-            encoder_attention_mask = encoder_attention_mask.unsqueeze(1)
-
-        # patch embed
-        hidden_states = self.pos_embed(hidden_states)
-
-        # timestep
-        batch_size = hidden_states.shape[0]
-        timesteps_proj = self.time_proj(timestep)
-        temb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_dtype))  # (N, 256)
-
-        # text projection
-        batch_size, sequence_length, _ = encoder_hidden_states.shape
-        encoder_hidden_states = self.text_embedder(
-            encoder_hidden_states.view(-1, encoder_hidden_states.shape[-1])
-        )
-        encoder_hidden_states = encoder_hidden_states.view(
-            batch_size, sequence_length, -1
-        )
-
-        for _, block in enumerate(self.blocks):
-            hidden_states = block(
-                hidden_states=hidden_states,
-                temb=temb,
-                encoder_hidden_states=encoder_hidden_states,
-                encoder_attention_mask=encoder_attention_mask,
-                image_rotary_emb=image_rotary_emb,
-            )  # (N, L, D)
-
-        # final layer
-        hidden_states = self.norm_out(hidden_states, temb.to(torch.float32))
-        hidden_states = self.proj_out(hidden_states)
-        # (N, L, patch_size ** 2 * out_channels)
-
-        # unpatchify: (N, out_channels, H, W)
-        patch_size = self.pos_embed.patch_size
-        height = height // patch_size
-        width = width // patch_size
-
-        hidden_states = hidden_states.reshape(
-            shape=(
-                hidden_states.shape[0],
-                height,
-                width,
-                patch_size,
-                patch_size,
-                self.out_channels,
-            )
-        )
-        hidden_states = torch.einsum("nhwpqc->nchpwq", hidden_states)
-        output = hidden_states.reshape(
-            shape=(
-                hidden_states.shape[0],
-                self.out_channels,
-                height * patch_size,
-                width * patch_size,
-            )
-        )
-        if not return_dict:
-            return (output,)
-
-        return Transformer2DModelOutput(sample=output)
diff --git a/videotuna/third_party/flux/multiaspect/dataset.py b/videotuna/third_party/flux/multiaspect/dataset.py
deleted file mode 100644
index 11fd8eda..00000000
--- a/videotuna/third_party/flux/multiaspect/dataset.py
+++ /dev/null
@@ -1,84 +0,0 @@
-import logging
-import os
-
-from torch.utils.data import Dataset
-
-from videotuna.third_party.flux.image_manipulation.training_sample import TrainingSample
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("MultiAspectDataset")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class MultiAspectDataset(Dataset):
-    """
-    A multi-aspect dataset requires special consideration and handling.
-    This class implements bucketed data loading for precomputed text embeddings.
-    This class does not do any image transforms, as those are handled by VAECache.
-    """
-
-    def __init__(
-        self,
-        id: str,
-        datasets: list,
-        print_names: bool = False,
-        is_regularisation_data: bool = False,
-    ):
-        self.id = id
-        self.datasets = datasets
-        self.print_names = print_names
-        self.is_regularisation_data = is_regularisation_data
-
-    def __len__(self):
-        # Sum the length of all data backends:
-        return sum([len(dataset) for dataset in self.datasets])
-
-    def __getitem__(self, image_tuple):
-        output_data = {
-            "training_samples": [],
-            "conditioning_samples": [],
-            "is_regularisation_data": self.is_regularisation_data,
-        }
-        first_aspect_ratio = None
-        for sample in image_tuple:
-            if type(sample) is TrainingSample:
-                image_metadata = sample.image_metadata
-            else:
-                image_metadata = sample
-                if "target_size" in image_metadata:
-                    calculated_aspect_ratio = (
-                        MultiaspectImage.calculate_image_aspect_ratio(
-                            image_metadata["target_size"]
-                        )
-                    )
-                    if first_aspect_ratio is None:
-                        first_aspect_ratio = calculated_aspect_ratio
-                    elif first_aspect_ratio != calculated_aspect_ratio:
-                        raise ValueError(
-                            f"Aspect ratios must be the same for all images in a batch. Expected: {first_aspect_ratio}, got: {calculated_aspect_ratio}"
-                        )
-                if "deepfloyd" not in StateTracker.get_args().model_type and (
-                    image_metadata["original_size"] is None
-                    or image_metadata["target_size"] is None
-                ):
-                    raise Exception(
-                        f"Metadata was unavailable for image: {image_metadata['image_path']}. Ensure --skip_file_discovery=metadata is not set."
-                    )
-
-                if self.print_names:
-                    logger.info(
-                        f"Dataset is now using image: {image_metadata['image_path']}"
-                    )
-
-            if type(sample) is TrainingSample:
-                output_data["conditioning_samples"].append(sample)
-                continue
-            else:
-                output_data["training_samples"].append(image_metadata)
-
-            if "instance_prompt_text" not in image_metadata:
-                raise ValueError(
-                    f"Instance prompt text must be provided in image metadata. Image metadata: {image_metadata}"
-                )
-        return output_data
diff --git a/videotuna/third_party/flux/multiaspect/image.py b/videotuna/third_party/flux/multiaspect/image.py
deleted file mode 100644
index dcef576b..00000000
--- a/videotuna/third_party/flux/multiaspect/image.py
+++ /dev/null
@@ -1,271 +0,0 @@
-import logging
-import os
-from math import sqrt
-
-import numpy as np
-from PIL import Image
-from torchvision import transforms
-
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("MultiaspectImage")
-logger.setLevel(os.environ.get("SIMPLETUNER_IMAGE_PREP_LOG_LEVEL", "INFO"))
-
-
-class MultiaspectImage:
-    @staticmethod
-    def get_image_transforms():
-        return transforms.Compose(
-            [
-                transforms.ToTensor(),
-                transforms.Normalize([0.5], [0.5]),
-            ]
-        )
-
-    @staticmethod
-    def _round_to_nearest_multiple(value):
-        """Round a value to the nearest multiple."""
-        multiple = StateTracker.get_args().aspect_bucket_alignment
-        rounded = round(value / multiple) * multiple
-        return max(rounded, multiple)  # Ensure it's at least the value of 'multiple'
-
-    @staticmethod
-    def is_image_too_large(image_size: tuple, resolution: float, resolution_type: str):
-        """
-        Determine if an image is too large to be processed.
-
-        Args:
-            image (PIL.Image): The image to check.
-            resolution (float): The maximum resolution to allow.
-            resolution_type (str): What form of resolution to check, choices: "pixel", "area".
-
-        Returns:
-            bool: True if the image is too large, False otherwise.
-        """
-        if resolution_type == "pixel":
-            return image_size[0] > resolution or image_size[1] > resolution
-        elif resolution_type == "area":
-            image_area = image_size[0] * image_size[1]
-            target_area = resolution * 1e6  # Convert megapixels to pixels
-            logger.debug(
-                f"Image is too large? {image_area > target_area} (image area: {image_area}, target area: {target_area})"
-            )
-            return image_area > target_area
-        else:
-            raise ValueError(f"Unknown resolution type: {resolution_type}")
-
-    @staticmethod
-    def calculate_new_size_by_pixel_edge(
-        aspect_ratio: float, resolution: int, original_size: tuple
-    ):
-        if type(aspect_ratio) != float:
-            raise ValueError(f"Aspect ratio must be a float, not {type(aspect_ratio)}")
-        if type(resolution) != int and (
-            type(resolution) != float or int(resolution) != resolution
-        ):
-            raise ValueError(f"Resolution must be an int, not {type(resolution)}")
-
-        W_original, H_original = original_size
-
-        # Start by determining the potential initial sizes
-        if W_original < H_original:  # Portrait or square orientation
-            W_initial = resolution
-            H_initial = int(W_initial / aspect_ratio)
-        else:  # Landscape orientation
-            H_initial = resolution
-            W_initial = int(H_initial * aspect_ratio)
-
-        # Round down to ensure we do not exceed original dimensions
-        W_adjusted = MultiaspectImage._round_to_nearest_multiple(W_initial)
-        H_adjusted = MultiaspectImage._round_to_nearest_multiple(H_initial)
-
-        # Intermediary size might be less than the reformed size.
-        # This situation is difficult.
-        # If the original image is roughly the size of the reformed image, and the intermediary is too small,
-        #  we can't really just boost the size of the reformed image willy-nilly. The intermediary size needs to be larger.
-        # We can't increase the intermediary size larger than the original size.
-        if W_initial < W_adjusted or H_initial < H_adjusted:
-            logger.debug(
-                f"Intermediary size {W_initial}x{H_initial} would be smaller than {W_adjusted}x{H_adjusted} (original size: {original_size}, aspect ratio: {aspect_ratio})."
-            )
-            # How much leeway to we have between the intermediary size and the reformed size?
-            reformed_W_diff = W_adjusted - W_initial
-            reformed_H_diff = H_adjusted - H_initial
-            bigger_difference = max(reformed_W_diff, reformed_H_diff)
-            logger.debug(
-                f"We have {reformed_W_diff}x{reformed_H_diff} leeway to the reformed image {W_adjusted}x{H_adjusted} from {W_initial}x{H_initial}, adjusting by {bigger_difference}px to both sides: {W_initial + bigger_difference}x{H_initial + bigger_difference}."
-            )
-            W_initial += bigger_difference
-            H_initial += bigger_difference
-
-        adjusted_aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-            (W_adjusted, H_adjusted)
-        )
-
-        return (W_adjusted, H_adjusted), (W_initial, H_initial), adjusted_aspect_ratio
-
-    @staticmethod
-    def calculate_new_size_by_pixel_area(
-        aspect_ratio: float, megapixels: float, original_size: tuple
-    ):
-        if type(aspect_ratio) not in [float, np.float64]:
-            raise ValueError(f"Aspect ratio must be a float, not {type(aspect_ratio)}")
-        target_pixel_area = (
-            megapixels * 1e6
-        )  # Convert megapixels to pixel area, eg. 1.0 mp = 1000000 pixels
-        target_pixel_edge = MultiaspectImage._round_to_nearest_multiple(
-            int(sqrt(target_pixel_area))
-        )
-        logger.debug(
-            f"Converted {megapixels} megapixels to {target_pixel_area} pixels with a square edge of {target_pixel_edge}."
-        )
-
-        W_initial, H_initial = original_size
-        if aspect_ratio == 1.0:
-            # If the aspect ratio is 1.0, we can just use the square edge as the target size.
-            logger.debug(
-                f"Returning the square edge {target_pixel_edge}x{target_pixel_edge} as the target size and original size as intermediary."
-            )
-            return (
-                (target_pixel_edge, target_pixel_edge),
-                (W_initial, H_initial),
-                aspect_ratio,
-            )
-
-        # Calculate the target size. This is what will be cropped-to.
-        W_target = MultiaspectImage._round_to_nearest_multiple(
-            target_pixel_edge * sqrt(aspect_ratio)
-        )
-        H_target = MultiaspectImage._round_to_nearest_multiple(
-            target_pixel_edge / sqrt(aspect_ratio)
-        )
-        calculated_resulting_megapixels = (W_target * H_target) / 1e6
-        target_aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-            (W_target, H_target)
-        )
-
-        if not np.isclose(calculated_resulting_megapixels, megapixels, rtol=1e-1):
-            logger.debug(
-                f"-!- This image will not have the correct target megapixel size: {calculated_resulting_megapixels}"
-            )
-
-        # Calculate the intermediary size. This will maintain aspect ratio and be resized-to.
-        if W_target < H_target:  # Portrait or square orientation
-            W_intermediary = W_target
-            H_intermediary = int(W_intermediary / aspect_ratio)
-        else:  # Landscape orientation
-            H_intermediary = H_target
-            W_intermediary = int(H_intermediary * aspect_ratio)
-
-        # retrieve the static mapping.
-        adjusted_aspect_ratio = MultiaspectImage.calculate_image_aspect_ratio(
-            (W_target, H_target)
-        )
-        previously_stored_resolution = StateTracker.get_resolution_by_aspect(
-            dataloader_resolution=megapixels, aspect=adjusted_aspect_ratio
-        )
-
-        if previously_stored_resolution:
-            logger.debug(
-                f"Using cached aspect-resolution map value for {adjusted_aspect_ratio}: {previously_stored_resolution}"
-            )
-            W_target, H_target = previously_stored_resolution
-        target_resolution = (W_target, H_target)
-
-        # The intermediary size might be smaller than the target. This is bad.
-        # If it happens, the cropped image will be cropped past the boundaries of the intermediary size.
-        if W_target > W_intermediary or H_target > H_intermediary:
-            _W_intermediary, _H_intermediary = W_intermediary, H_intermediary
-            if W_target > W_intermediary:
-                W_diff = W_target - W_intermediary
-                H_diff = int(W_diff / aspect_ratio)
-            else:
-                H_diff = H_target - H_intermediary
-                W_diff = int(H_diff * aspect_ratio)
-            H_intermediary += H_diff
-            W_intermediary += W_diff
-            logger.debug(
-                f"Intermediary size {_W_intermediary}x{_H_intermediary} would be smaller than {W_target}x{H_target} with a difference in size of {W_diff}x{H_diff}."
-                f" The size will be adjusted to maintain the aspect ratio: {W_intermediary}x{H_intermediary}."
-            )
-            calculated_resulting_megapixels = (W_intermediary * H_intermediary) / 1e6
-
-        intermediary_resolution = (W_intermediary, H_intermediary)
-
-        logger.debug(
-            f"Using target size of {megapixels} megapixels:"
-            f"\n-> initial size is {W_initial}x{H_initial}, original aspect ratio {aspect_ratio}."
-            f"\n-> intermediary size is {W_intermediary}x{H_intermediary}, with aspect ratio {adjusted_aspect_ratio}."
-            f"\n-> cropped size is {W_target}x{H_target}, with aspect ratio {target_aspect_ratio}."
-            f"\n-> cropped sample will be {calculated_resulting_megapixels} megapixels"
-        )
-        # Attempt to retrieve previously stored resolution by adjusted aspect ratio
-        if not previously_stored_resolution:
-            logger.debug(
-                f"No cached resolution found for aspect ratio {adjusted_aspect_ratio}. Storing {target_resolution}."
-            )
-            StateTracker.set_resolution_by_aspect(
-                dataloader_resolution=megapixels,
-                aspect=adjusted_aspect_ratio,
-                resolution=target_resolution,
-            )
-
-        return (target_resolution, intermediary_resolution, adjusted_aspect_ratio)
-
-    @staticmethod
-    def adjust_resolution_to_bucket_interval(
-        initial_resolution: tuple, target_resolution: tuple
-    ):
-        W_initial, H_initial = initial_resolution
-        W_adjusted, H_adjusted = target_resolution
-        # If W_initial or H_initial are < W_adjusted or H_adjusted, add the greater of the two differences to both values.
-        W_diff = W_adjusted - W_initial
-        H_diff = H_adjusted - H_initial
-        if W_diff > 0 and (W_diff > H_diff or W_diff == H_diff):
-            logger.debug(
-                f"Intermediary size {W_initial}x{H_initial} would be smaller than {W_adjusted}x{H_adjusted} with a difference in size of {W_diff}x{H_diff}. Adjusting both sides by {max(W_diff, H_diff)} pixels."
-            )
-            H_initial += W_diff
-            W_initial += W_diff
-        elif H_diff > 0 and H_diff > W_diff:
-            logger.debug(
-                f"Intermediary size {W_initial}x{H_initial} would be smaller than {W_adjusted}x{H_adjusted} with a difference in size of {W_diff}x{H_diff}. Adjusting both sides by {max(W_diff, H_diff)} pixels."
-            )
-            W_initial += H_diff
-            H_initial += H_diff
-
-        return W_initial, H_initial
-
-    @staticmethod
-    def calculate_image_aspect_ratio(image, rounding: int = 2):
-        """
-        Calculate the aspect ratio of an image and round it to a specified precision.
-
-        Args:
-            image (PIL.Image): The image to calculate the aspect ratio for.
-
-        Returns:
-            float: The rounded aspect ratio of the image.
-        """
-        to_round = StateTracker.get_args().aspect_bucket_rounding
-        if to_round is None:
-            to_round = rounding
-        if isinstance(image, Image.Image):
-            # An actual image was passed in.
-            width, height = image.size
-        elif isinstance(image, tuple) or isinstance(image, list):
-            # An image.size or a similar (W, H) tuple was provided.
-            width, height = image
-        elif isinstance(image, float):
-            # An externally-calculated aspect ratio was given to round.
-            return round(image, to_round)
-        else:
-            width, height = image.size
-        aspect_ratio = round(width / height, to_round)
-        return aspect_ratio
-
-
-resize_tools = {
-    "pixel": MultiaspectImage.calculate_new_size_by_pixel_edge,
-    "area": MultiaspectImage.calculate_new_size_by_pixel_area,
-}
diff --git a/videotuna/third_party/flux/multiaspect/sampler.py b/videotuna/third_party/flux/multiaspect/sampler.py
deleted file mode 100644
index dcf857c5..00000000
--- a/videotuna/third_party/flux/multiaspect/sampler.py
+++ /dev/null
@@ -1,639 +0,0 @@
-import logging
-import os
-import random
-
-import torch
-from accelerate.logging import get_logger
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-from videotuna.third_party.flux.image_manipulation.training_sample import TrainingSample
-from videotuna.third_party.flux.metadata.backends.base import MetadataBackend
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage
-from videotuna.third_party.flux.multiaspect.state import BucketStateManager
-from videotuna.third_party.flux.prompts import PromptHandler
-from videotuna.third_party.flux.training.exceptions import MultiDatasetExhausted
-from videotuna.third_party.flux.training.multi_process import rank_info
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-pil_logger = logging.getLogger("PIL.Image")
-pil_logger.setLevel(logging.WARNING)
-pil_logger = logging.getLogger("PIL.PngImagePlugin")
-pil_logger.setLevel(logging.WARNING)
-pil_logger = logging.getLogger("PIL.TiffImagePlugin")
-pil_logger.setLevel(logging.WARNING)
-
-
-class MultiAspectSampler(torch.utils.data.Sampler):
-    def __init__(
-        self,
-        id: str,
-        metadata_backend: MetadataBackend,
-        data_backend: BaseDataBackend,
-        accelerator,
-        batch_size: int,
-        debug_aspect_buckets: bool = False,
-        delete_unwanted_images: bool = False,
-        minimum_image_size: int = None,
-        resolution: int = 1024,
-        resolution_type: str = "pixel",
-        caption_strategy: str = "filename",
-        use_captions=True,
-        prepend_instance_prompt=False,
-        instance_prompt: str = None,
-        conditioning_type: str = None,
-        is_regularisation_data: bool = False,
-    ):
-        """
-        Initializes the sampler with provided settings.
-        Parameters:
-        - id: An identifier to link this with its VAECache and DataBackend objects.
-        - metadata_backend: An initialised instance of MetadataBackend.
-        - batch_size: Number of samples to draw per batch.
-        - state_path: Path to store the current state of the sampler.
-        - debug_aspect_buckets: Flag to log state for debugging purposes.
-        - delete_unwanted_images: Flag to decide whether to delete unwanted (small) images or just remove from the bucket.
-        - minimum_image_size: The minimum pixel length of the smallest side of an image.
-        """
-        self.id = id
-        if self.id != data_backend.id or self.id != metadata_backend.id:
-            raise ValueError(
-                f"Sampler ID ({self.id}) must match DataBackend ID ({data_backend.id}) and MetadataBackend ID ({metadata_backend.id})."
-            )
-        # Update the logger name with the id:
-        self.logger = get_logger(
-            f"MultiAspectSampler-{self.id}",
-            os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"),
-        )
-        if conditioning_type is not None:
-            if conditioning_type not in ["controlnet", "mask"]:
-                raise ValueError(
-                    f"Unknown conditioning image type: {conditioning_type}"
-                )
-        self.conditioning_type = conditioning_type
-        self.is_regularisation_data = is_regularisation_data
-
-        self.rank_info = rank_info()
-        self.accelerator = accelerator
-        self.metadata_backend = metadata_backend
-        self.data_backend = data_backend
-        self.current_bucket = None
-        self.current_epoch = 1
-        self.batch_size = batch_size
-        if debug_aspect_buckets:
-            self.logger.setLevel(logging.DEBUG)
-        self.delete_unwanted_images = delete_unwanted_images
-        self.minimum_image_size = minimum_image_size
-        self.resolution = resolution
-        self.resolution_type = resolution_type
-        self.use_captions = use_captions
-        self.caption_strategy = caption_strategy
-        self.prepend_instance_prompt = prepend_instance_prompt
-        self.instance_prompt = instance_prompt
-        self.exhausted_buckets = []
-        self.buckets = self.load_buckets()
-        self.state_manager = BucketStateManager(self.id)
-
-    def save_state(self, state_path: str):
-        """
-        This method should be called when the accelerator save hook is called,
-         so that the state is correctly restored with a given checkpoint.
-        """
-        state = {
-            "aspect_ratio_bucket_indices": self.metadata_backend.aspect_ratio_bucket_indices,
-            "buckets": self.buckets,
-            "exhausted_buckets": self.exhausted_buckets,
-            "batch_size": self.batch_size,
-            "current_bucket": self.current_bucket,
-            "seen_images": self.metadata_backend.seen_images,
-            "current_epoch": self.current_epoch,
-        }
-        self.state_manager.save_state(state, state_path)
-
-    def load_states(self, state_path: str):
-        try:
-            self.buckets = self.load_buckets()
-            previous_state = self.state_manager.load_state(state_path)
-        except Exception as e:
-            raise e
-        self.exhausted_buckets = []
-        if "exhausted_buckets" in previous_state:
-            self.logger.info(
-                f"Previous checkpoint had {len(previous_state['exhausted_buckets'])} exhausted buckets."
-            )
-            self.exhausted_buckets = previous_state["exhausted_buckets"]
-        self.current_epoch = 1
-        if "current_epoch" in previous_state:
-            self.logger.info(
-                f"Previous checkpoint was on epoch {previous_state['current_epoch']}."
-            )
-            self.current_epoch = previous_state["current_epoch"]
-        # Merge seen_images into self.state_manager.seen_images Manager.dict:
-        if "seen_images" in previous_state:
-            self.logger.info(
-                f"Previous checkpoint had {len(previous_state['seen_images'])} seen images."
-            )
-            self.metadata_backend.seen_images.update(previous_state["seen_images"])
-
-    def load_buckets(self):
-        return list(
-            self.metadata_backend.aspect_ratio_bucket_indices.keys()
-        )  # These keys are a float value, eg. 1.78.
-
-    def retrieve_validation_set(self, batch_size: int):
-        """
-        Return random images from the set. They should be paired with their caption.
-
-        Args:
-            batch_size (int): Number of images to return.
-        Returns:
-            list: a list of tuples(validation_shortname, validation_prompt, validation_sample)
-        """
-        results = (
-            []
-        )  # [tuple(validation_shortname, validation_prompt, validation_sample)]
-        for img_idx in range(batch_size):
-            image_path = self._yield_random_image()
-            image_data = self.data_backend.read_image(image_path)
-            image_metadata = self.metadata_backend.get_metadata_by_filepath(image_path)
-            training_sample = TrainingSample(
-                image=image_data,
-                data_backend_id=self.id,
-                image_metadata=image_metadata,
-                image_path=image_path,
-            )
-            training_sample.prepare()
-            validation_shortname = f"{self.id}_{img_idx}"
-            validation_prompt = PromptHandler.magic_prompt(
-                sampler_backend_id=self.id,
-                data_backend=self.data_backend,
-                image_path=image_path,
-                caption_strategy=self.caption_strategy,
-                use_captions=self.use_captions,
-                prepend_instance_prompt=self.prepend_instance_prompt,
-                instance_prompt=self.instance_prompt,
-            )
-            if type(validation_prompt) == list:
-                validation_prompt = random.choice(validation_prompt)
-                self.debug_log(
-                    f"Selecting random prompt from list: {validation_prompt}"
-                )
-            results.append(
-                (validation_shortname, validation_prompt, training_sample.image)
-            )
-
-        return results
-
-    def _yield_n_from_exhausted_bucket(self, n: int, bucket: str):
-        """
-        when a bucket is exhausted, and we have to populate the remainder of the batch,
-        we shall use this quick and dirty method to retrieve n samples from the exhausted bucket.
-        the thing is we can have a batch size of 4 and 1 image. so we'll have to just return the same image 4 times.
-        """
-        available_images = self.metadata_backend.aspect_ratio_bucket_indices[bucket]
-        if len(available_images) == 0:
-            self.debug_log(f"Bucket {bucket} is empty.")
-            return []
-        samples = []
-        while len(samples) < n:
-            to_grab = min(n, len(available_images), (n - len(samples)))
-            if to_grab == 0:
-                break
-            samples.extend(random.sample(available_images, k=to_grab))
-
-        to_yield = self._validate_and_yield_images_from_samples(samples, bucket)
-        return to_yield
-
-    def _yield_random_image(self):
-        bucket = random.choice(self.buckets)
-        image_path = random.choice(
-            self.metadata_backend.aspect_ratio_bucket_indices[bucket]
-        )
-        return image_path
-
-    def yield_single_image(self, filepath: str):
-        """
-        Yield a single image from the dataset by path.
-
-        If the path prefix isn't in the path, we'll add it.
-        """
-        if (
-            self.metadata_backend.instance_data_dir is not None
-            and self.metadata_backend.instance_data_dir not in filepath
-            and not filepath.startswith("http")
-        ):
-            filepath = os.path.join(self.metadata_backend.instance_data_dir, filepath)
-        image_data = self.data_backend.read_image(filepath)
-        return image_data
-
-    def _bucket_name_to_id(self, bucket_name: str) -> int:
-        """
-        Return a bucket array index, by its name.
-
-        Args:
-            bucket_name (str): Bucket name, eg. "1.78"
-        Returns:
-            int: Bucket array index, eg. 0
-        """
-        if "." not in str(bucket_name):
-            self.debug_log(f"Assuming {bucket_name} is already an index.")
-            return int(bucket_name)
-        return self.buckets.index(str(bucket_name))
-
-    def _reset_buckets(self):
-        if (
-            len(self.metadata_backend.seen_images) == 0
-            and len(self._get_unseen_images()) == 0
-        ):
-            raise Exception(
-                f"No images found in the dataset: {self.metadata_backend.aspect_ratio_bucket_indices}"
-                f"\n-> Unseen images: {self._get_unseen_images()}"
-                f"\n-> Seen images: {self.metadata_backend.seen_images}"
-            )
-        if StateTracker.get_args().print_sampler_statistics:
-            self.logger.info(
-                "Resetting seen image list and refreshing buckets. State before reset:"
-            )
-            self.log_state()
-        # All buckets are exhausted, so we will move onto the next epoch.
-        self.current_epoch += 1
-        self.exhausted_buckets = []
-        self.buckets = self.load_buckets()
-        self.metadata_backend.reset_seen_images()
-        self.change_bucket()
-        raise MultiDatasetExhausted()
-
-    def _get_unseen_images(self, bucket=None):
-        """
-        Get unseen images from the specified bucket.
-        If bucket is None, get unseen images from all buckets.
-        """
-        if bucket and bucket in self.metadata_backend.aspect_ratio_bucket_indices:
-            return [
-                (
-                    os.path.join(self.metadata_backend.instance_data_dir, image)
-                    if not image.startswith("http")
-                    else image
-                )
-                for image in self.metadata_backend.aspect_ratio_bucket_indices[bucket]
-                if not self.metadata_backend.is_seen(image)
-            ]
-        elif bucket is None:
-            unseen_images = []
-            for b, images in self.metadata_backend.aspect_ratio_bucket_indices.items():
-                unseen_images.extend(
-                    [
-                        (
-                            os.path.join(self.metadata_backend.instance_data_dir, image)
-                            if not image.startswith("http")
-                            else image
-                        )
-                        for image in images
-                        if not self.metadata_backend.is_seen(image)
-                    ]
-                )
-            return unseen_images
-        else:
-            return []
-
-    def _handle_bucket_with_insufficient_images(self, bucket):
-        """
-        Handle buckets with insufficient images. Return True if we changed or reset the bucket.
-        """
-        if (
-            len(self.metadata_backend.aspect_ratio_bucket_indices[bucket])
-            < self.batch_size
-        ):
-            self.debug_log(
-                f"Bucket {bucket} has insufficient ({len(self.metadata_backend.aspect_ratio_bucket_indices[bucket])}) images."
-            )
-            if bucket not in self.exhausted_buckets:
-                self.debug_log(
-                    f"Bucket {bucket} is now exhausted and sleepy, and we have to move it to the sleepy list before changing buckets."
-                )
-                self.move_to_exhausted()
-            self.debug_log("Changing bucket to another random selection.")
-            self.change_bucket()
-            return True
-        self.debug_log(
-            f"Bucket {bucket} has sufficient ({len(self.metadata_backend.aspect_ratio_bucket_indices[bucket])}) images."
-        )
-        return False
-
-    def _get_next_bucket(self):
-        """
-        Get the next bucket excluding the exhausted ones.
-        If all buckets are exhausted, first reset the seen images and exhausted buckets.
-        """
-        available_buckets = [
-            bucket for bucket in self.buckets if bucket not in self.exhausted_buckets
-        ]
-        if not available_buckets:
-            # Raise MultiDatasetExhausted
-            self._reset_buckets()
-
-        if len(self.exhausted_buckets) > 0:
-            self.debug_log(f"exhausted buckets: {self.exhausted_buckets}")
-
-        # Sequentially get the next bucket
-        if hasattr(self, "current_bucket") and self.current_bucket is not None:
-            self.current_bucket = (self.current_bucket + 1) % len(available_buckets)
-        else:
-            self.current_bucket = 0
-        if self.buckets[self.current_bucket] not in available_buckets:
-            random_bucket = random.choice(available_buckets)
-            self.current_bucket = available_buckets.index(random_bucket)
-
-        next_bucket = available_buckets[self.current_bucket]
-        return next_bucket
-
-    def change_bucket(self):
-        """
-        Change the current bucket to a new one and exclude exhausted buckets from consideration.
-        During _get_next_bucket(), if all buckets are exhausted, reset the exhausted list and seen images.
-        """
-        next_bucket = self._get_next_bucket()
-        self.current_bucket = self._bucket_name_to_id(next_bucket)
-        self._clear_batch_accumulator()
-
-    def move_to_exhausted(self):
-        bucket = self.buckets[self.current_bucket]
-        self.exhausted_buckets.append(bucket)
-        self.buckets.remove(bucket)
-        self.debug_log(
-            f"Bucket {bucket} is empty or doesn't have enough samples for a full batch. Removing from bucket list. {len(self.buckets)} remain."
-        )
-
-    def log_state(self, show_rank: bool = True, alt_stats: bool = False):
-        self.debug_log(
-            f'Active Buckets: {", ".join(self.convert_to_human_readable(float(b), self.metadata_backend.aspect_ratio_bucket_indices[b], self.resolution) for b in self.buckets)}'
-        )
-        self.debug_log(
-            f'Exhausted Buckets: {", ".join(self.convert_to_human_readable(float(b), self.metadata_backend.aspect_ratio_bucket_indices.get(b, "N/A"), self.resolution) for b in self.exhausted_buckets)}'
-        )
-        if alt_stats:
-            # Return an overview instead of a snapshot.
-            # Eg. return totals, and not "as it is now"
-            total_image_count = len(self.metadata_backend.seen_images) + len(
-                self._get_unseen_images()
-            )
-            if self.accelerator.num_processes > 1:
-                # We don't know the direct count without more work, so we'll estimate it here for multi-GPU training.
-                total_image_count *= self.accelerator.num_processes
-                total_image_count = f"~{total_image_count}"
-            data_backend_config = StateTracker.get_data_backend_config(self.id)
-            printed_state = (
-                f"- Repeats: {data_backend_config.get('repeats', 0)}\n"
-                f"- Total number of images: {total_image_count}\n"
-                f"- Total number of aspect buckets: {len(self.buckets)}\n"
-                f"- Resolution: {self.resolution} {'megapixels' if self.resolution_type == 'area' else 'px'}\n"
-                f"- Cropped: {data_backend_config.get('crop')}\n"
-                f"- Crop style: {'None' if not data_backend_config.get('crop') else data_backend_config.get('crop_style')}\n"
-                f"- Crop aspect: {'None' if not data_backend_config.get('crop') else data_backend_config.get('crop_aspect')}\n"
-                f"- Used for regularisation data: {'Yes' if self.is_regularisation_data else 'No'}\n"
-            )
-            if self.conditioning_type:
-                printed_state += f"- Conditioning type: {self.conditioning_type}\n"
-        else:
-            # Return a snapshot of the current state during training.
-            printed_state = (
-                f"\n{self.rank_info if show_rank else ''}    -> Number of seen images: {len(self.metadata_backend.seen_images)}"
-                f"\n{self.rank_info if show_rank else ''}    -> Number of unseen images: {len(self._get_unseen_images())}"
-                f"\n{self.rank_info if show_rank else ''}    -> Current Bucket: {self.current_bucket}"
-                f"\n{self.rank_info if show_rank else ''}    -> {len(self.buckets)} Buckets: {self.buckets}"
-                f"\n{self.rank_info if show_rank else ''}    -> {len(self.exhausted_buckets)} Exhausted Buckets: {self.exhausted_buckets}"
-            )
-        self.logger.info(printed_state)
-
-        return printed_state
-
-    def _validate_and_yield_images_from_samples(self, samples, bucket):
-        """
-        Validate and yield images from given samples. Return a list of valid image paths.
-        """
-        to_yield = []
-        for image_path in samples:
-            image_metadata = self.metadata_backend.get_metadata_by_filepath(image_path)
-            if image_metadata is None:
-                image_metadata = {}
-            if (
-                StateTracker.get_args().model_type
-                not in [
-                    "legacy",
-                    "deepfloyd-full",
-                    "deepfloyd-lora",
-                    "deepfloyd-stage2",
-                    "deepfloyd-stage2-lora",
-                ]
-                and "crop_coordinates" not in image_metadata
-            ):
-                raise Exception(
-                    f"An image was discovered ({image_path}) that did not have its metadata: {self.metadata_backend.get_metadata_by_filepath(image_path)}"
-                )
-            image_metadata["data_backend_id"] = self.id
-            image_metadata["image_path"] = image_path
-
-            # Use the magic prompt handler to retrieve the captions.
-            instance_prompt = PromptHandler.magic_prompt(
-                sampler_backend_id=self.id,
-                data_backend=self.data_backend,
-                image_path=image_metadata["image_path"],
-                caption_strategy=self.caption_strategy,
-                use_captions=self.use_captions,
-                prepend_instance_prompt=self.prepend_instance_prompt,
-                instance_prompt=self.instance_prompt,
-            )
-            if type(instance_prompt) == list:
-                instance_prompt = random.choice(instance_prompt)
-                self.debug_log(f"Selecting random prompt from list: {instance_prompt}")
-            image_metadata["instance_prompt_text"] = instance_prompt
-
-            to_yield.append(image_metadata)
-        return to_yield
-
-    def _clear_batch_accumulator(self):
-        self.batch_accumulator = []
-
-    def get_conditioning_sample(self, original_sample_path: str) -> str:
-        """
-        Given an original dataset sample path, return a TrainingSample
-        """
-        # strip leading /
-        original_sample_path = original_sample_path.lstrip("/")
-        full_path = os.path.join(
-            self.metadata_backend.instance_data_dir, original_sample_path
-        )
-        try:
-            conditioning_sample_data = self.data_backend.read_image(full_path)
-        except Exception as e:
-            self.logger.error(f"Could not fetch conditioning sample: {e}")
-
-            return None
-        if not conditioning_sample_data:
-            self.debug_log(f"Could not fetch conditioning sample from {full_path}.")
-            return None
-
-        conditioning_sample = TrainingSample(
-            image=conditioning_sample_data,
-            data_backend_id=self.id,
-            image_metadata=self.metadata_backend.get_metadata_by_filepath(full_path),
-            image_path=full_path,
-            conditioning_type=self.conditioning_type,
-        )
-        return conditioning_sample
-
-    def connect_conditioning_samples(self, samples: tuple):
-        # Locate the conditioning data
-        conditioning_dataset = StateTracker.get_conditioning_dataset(self.id)
-        if conditioning_dataset is None:
-            return samples
-        sampler = conditioning_dataset["sampler"]
-        outputs = list(samples)
-        for sample in samples:
-            sample_path = sample["image_path"].split(
-                self.metadata_backend.instance_data_dir
-            )[-1]
-            conditioning_sample = sampler.get_conditioning_sample(sample_path)
-            outputs.append(conditioning_sample)
-        return tuple(outputs)
-
-    def __iter__(self):
-        """
-        Iterate over the sampler to yield image paths in batches.
-        """
-        self._clear_batch_accumulator()  # Initialize an empty list to accumulate images for a batch
-        self.change_bucket()
-        while True:
-            all_buckets_exhausted = True  # Initial assumption
-
-            # Loop through all buckets to find one with sufficient images
-            for _ in range(len(self.buckets)):
-                self._clear_batch_accumulator()
-                available_images = self._get_unseen_images(
-                    self.buckets[self.current_bucket]
-                )
-                self.debug_log(
-                    f"From {len(self.buckets)} buckets, selected {self.buckets[self.current_bucket]} ({self.buckets[self.current_bucket]}) -> {len(available_images)} available images, and our accumulator has {len(self.batch_accumulator)} images ready for yielding."
-                )
-                if len(available_images) > 0:
-                    all_buckets_exhausted = False  # Found a non-exhausted bucket
-                    break
-                else:
-                    # Current bucket doesn't have enough images, try the next bucket
-                    self.move_to_exhausted()
-                    self.change_bucket()
-            while len(available_images) > 0:
-                if len(available_images) < self.batch_size:
-                    need_image_count = self.batch_size - len(available_images)
-                    self.debug_log(
-                        f"Bucket {self.buckets[self.current_bucket]} has {len(available_images)} available images, but we need {need_image_count} more."
-                    )
-                    to_yield = self._yield_n_from_exhausted_bucket(
-                        need_image_count, self.buckets[self.current_bucket]
-                    )
-                    # add the available images
-                    to_yield.extend(
-                        self._validate_and_yield_images_from_samples(
-                            available_images, self.buckets[self.current_bucket]
-                        )
-                    )
-                else:
-                    all_buckets_exhausted = False  # Found a non-exhausted bucket
-                    samples = random.sample(
-                        available_images, k=min(len(available_images), self.batch_size)
-                    )
-                    to_yield = self._validate_and_yield_images_from_samples(
-                        samples, self.buckets[self.current_bucket]
-                    )
-                self.debug_log(
-                    f"Building batch with {len(self.batch_accumulator)} samples."
-                )
-                if len(self.batch_accumulator) < self.batch_size:
-                    remaining_entries_needed = self.batch_size - len(
-                        self.batch_accumulator
-                    )
-                    # Now we'll add only remaining_entries_needed amount to the accumulator:
-                    if "target_size" in to_yield[0]:
-                        self.debug_log(
-                            f"Current bucket: {self.current_bucket}. Adding samples with aspect ratios: {[MultiaspectImage.calculate_image_aspect_ratio(i['target_size']) for i in to_yield[:remaining_entries_needed]]}"
-                        )
-                    self.batch_accumulator.extend(to_yield[:remaining_entries_needed])
-                # If the batch is full, yield it
-                if len(self.batch_accumulator) >= self.batch_size:
-                    final_yield = self.batch_accumulator[: self.batch_size]
-                    self.debug_log(
-                        f"Yielding samples and marking {len(final_yield)} images as seen, we have {len(self.metadata_backend.seen_images.values())} seen images before adding."
-                    )
-                    self.metadata_backend.mark_batch_as_seen(
-                        [instance["image_path"] for instance in final_yield]
-                    )
-                    self.accelerator.wait_for_everyone()
-                    # if applicable, we'll append TrainingSample(s) to the end for conditioning inputs.
-                    final_yield = self.connect_conditioning_samples(final_yield)
-                    yield tuple(final_yield)
-                    # Change bucket after a full batch is yielded
-                    self.change_bucket()
-                    # Break out of the while loop:
-                    break
-
-                # Update available images after yielding
-                available_images = self._get_unseen_images(
-                    self.buckets[self.current_bucket]
-                )
-                self.debug_log(
-                    f"Bucket {self.buckets[self.current_bucket]} now has {len(available_images)} available images after yielding."
-                )
-
-            # Handle exhausted bucket
-            if len(available_images) < self.batch_size:
-                self.debug_log(
-                    f"Bucket {self.buckets[self.current_bucket]} is now exhausted and sleepy, and we have to move it to the sleepy list before changing buckets."
-                )
-                self.move_to_exhausted()
-                self.change_bucket()
-
-            # Check if all buckets are exhausted
-            if all_buckets_exhausted:
-                # If all buckets are exhausted, reset the seen images and refresh buckets
-                self.logger.warning(
-                    "All buckets exhausted - since this is happening now, most likely you have chronically-underfilled buckets."
-                )
-                # Resetting buckets raises MultiDatasetExhausted
-                self._reset_buckets()
-
-    def __len__(self):
-        backend_config = StateTracker.get_data_backend_config(self.id)
-        repeats = backend_config.get("repeats", 0)
-        # We need at least a multiplier of 1. Repeats is the number of extra sample steps.
-        multiplier = repeats + 1 if repeats > 0 else 1
-
-        total_samples = (
-            sum(
-                len(indices)
-                for indices in self.metadata_backend.aspect_ratio_bucket_indices.values()
-            )
-            * multiplier
-        )
-
-        # Calculate the total number of full batches
-        total_batches = (total_samples + (self.batch_size - 1)) // self.batch_size
-
-        return total_batches
-
-    @staticmethod
-    def convert_to_human_readable(
-        aspect_ratio_float: float, bucket: iter, resolution: int = 1024
-    ):
-
-        if aspect_ratio_float < 1:
-            ratio_width = resolution
-            ratio_height = int(resolution / aspect_ratio_float)
-        else:
-            ratio_width = int(resolution * aspect_ratio_float)
-            ratio_height = resolution
-
-        # Return the aspect ratio as a string in the format "width:height"
-        return f"{aspect_ratio_float} ({len(bucket)} samples)"
-        return f"{ratio_width}:{ratio_height}"
-
-    def debug_log(self, msg: str):
-        self.logger.debug(f"{self.rank_info} {msg}", main_process_only=False)
diff --git a/videotuna/third_party/flux/multiaspect/state.py b/videotuna/third_party/flux/multiaspect/state.py
deleted file mode 100644
index e6b28d33..00000000
--- a/videotuna/third_party/flux/multiaspect/state.py
+++ /dev/null
@@ -1,62 +0,0 @@
-import json
-import logging
-import os
-from multiprocessing.managers import DictProxy
-
-logger = logging.getLogger("BucketStateManager")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class BucketStateManager:
-    def __init__(self, id: str):
-        self.id = id
-
-    def mangle_state_path(self, state_path):
-        # When saving the state, it goes into the checkpoint dir.
-        # However, we need to save a single state for each data backend.
-        # Thus, we split the state_path from its extension, add self.id to the end of the name, and rejoin:
-        if self.id in os.path.basename(state_path):
-            return state_path
-        filename, ext = os.path.splitext(state_path)
-        return f"{filename}-{self.id}{ext}"
-
-    def load_seen_images(self, state_path: str):
-        if os.path.exists(state_path):
-            with open(state_path, "r") as f:
-                return json.load(f)
-        else:
-            return {}
-
-    def save_seen_images(self, seen_images, state_path: str):
-        with open(state_path, "w") as f:
-            json.dump(seen_images, f)
-
-    def deep_convert_dict(self, d):
-        if isinstance(d, dict):
-            return {key: self.deep_convert_dict(value) for key, value in d.items()}
-        elif isinstance(d, list):
-            return [self.deep_convert_dict(value) for value in d]
-        elif isinstance(d, DictProxy):
-            return self.deep_convert_dict(dict(d))
-        else:
-            return d
-
-    def save_state(self, state: dict, state_path: str):
-        if state_path is None:
-            raise ValueError("state_path must be specified")
-        state_path = self.mangle_state_path(state_path)
-        logger.debug(f"Saving trainer state to {state_path}")
-        final_state = self.deep_convert_dict(state)
-        with open(state_path, "w") as f:
-            json.dump(final_state, f)
-
-    def load_state(self, state_path: str):
-        if state_path is None:
-            raise ValueError("state_path must be specified")
-        state_path = self.mangle_state_path(state_path)
-        if os.path.exists(state_path):
-            with open(state_path, "r") as f:
-                return json.load(f)
-        else:
-            logger.debug(f"load_state found no file: {state_path}")
-            return {}
diff --git a/videotuna/third_party/flux/prompts.py b/videotuna/third_party/flux/prompts.py
deleted file mode 100644
index 054a69c4..00000000
--- a/videotuna/third_party/flux/prompts.py
+++ /dev/null
@@ -1,624 +0,0 @@
-import json
-from pathlib import Path
-
-import regex as re
-
-from videotuna.third_party.flux.training import image_file_extensions
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-prompts = {
-    "alien_landscape": "Alien planet, strange rock formations, glowing plants, bizarre creatures, surreal atmosphere",
-    "alien_market": "Alien marketplace, bizarre creatures, exotic goods, vibrant colors, otherworldly atmosphere",
-    "child_balloon": "Child holding a balloon, happy expression, colorful balloons, sunny day, high detail",
-    "comic_strip": "a 4-panel comic strip showing an orange cat saying the words 'HELP' and 'LASAGNA'",
-    "comic_book": "a hand is holding a comic book with a cover that reads 'The Adventures of Superhero'",
-    "crystal_cave": "Underground cave filled with crystals, glowing lights, reflective surfaces, fantasy environment, high detail",
-    "cyberpunk_bazaar": "Bustling cyberpunk bazaar, vendors, neon signs, advanced tech, crowded, high detail",
-    "cyberpunk_hacker": "Cyberpunk hacker in a dark room, neon glow, multiple screens, intense focus, high detail",
-    "cybernetic_anne": "a cybernetic anne of green gables with neural implant and bio mech augmentations",
-    "dystopian_city": "Post-apocalyptic cityscape, ruined buildings, overgrown vegetation, dark and gritty, high detail",
-    "enchanted_castle": "Magical castle in a lush forest, glowing windows, fantasy architecture, high resolution, detailed textures",
-    "enchanted_forest_ruins": "Ruins of an ancient temple in an enchanted forest, glowing runes, mystical creatures, high detail",
-    "enchanted_forest": "Mystical forest, glowing plants, fairies, magical creatures, fantasy art, high detail",
-    "enchanted_garden": "Magical garden with glowing flowers, fairies, serene atmosphere, detailed plants, high resolution",
-    "fairy_garden": "Whimsical garden filled with fairies, magical plants, sparkling lights, serene atmosphere, high detail",
-    "fantasy_dragon": "Majestic dragon soaring through the sky, detailed scales, dynamic pose, fantasy art, high resolution",
-    "floating_islands": "Fantasy world, floating islands in the sky, waterfalls, lush vegetation, detailed landscape, high resolution",
-    "futuristic_cityscape": "Futuristic city skyline at night, neon lights, cyberpunk style, high contrast, sharp focus",
-    "galactic_battle": "Space battle scene, starships fighting, laser beams, explosions, cosmic background",
-    "haunted_fairground": "Abandoned fairground at night, eerie rides, ghostly figures, fog, dark atmosphere, high detail",
-    "haunted_mansion": "Spooky haunted mansion on a hill, dark and eerie, glowing windows, ghostly atmosphere, high detail",
-    "hardcover_textbook": "a hardcover physics textbook that is called PHYSICS FOR DUMMIES",
-    "medieval_battle": "Epic medieval battle, knights in armor, dynamic action, detailed landscape, high resolution",
-    "medieval_market": "Bustling medieval market with merchants, knights, and jesters, vibrant colors, detailed",
-    "medieval_tavern": "Cozy medieval tavern, warm firelight, adventurers drinking, detailed interior, rustic atmosphere",
-    "neon_cityscape": "Futuristic city skyline at night, neon lights, cyberpunk style, high contrast, sharp focus",
-    "neon_forest": "Forest with neon-lit trees, glowing plants, bioluminescence, surreal atmosphere, high detail",
-    "neon_sign": "Bright neon sign in a busy city street, 'Open 24 Hours', bold typography, glowing lights",
-    "neon_typography": "Vibrant neon sign, 'Bar', bold typography, dark background, glowing lights, detailed design",
-    "pirate_ship": "Pirate ship on the high seas, stormy weather, detailed sails, dramatic waves, photorealistic",
-    "pirate_treasure": "Pirate discovering a treasure chest, detailed gold coins, tropical island, dramatic lighting",
-    "psychedelic": "a photograph of a woman experiencing a psychedelic trip. trippy, 8k, uhd, fractal",
-    "rainy_cafe": "Cozy cafe on a rainy day, people sipping coffee, warm lights, reflections on wet pavement, photorealistic",
-    "retro_arcade": "1980s arcade, neon lights, vintage game machines, kids playing, vibrant colors, nostalgic atmosphere",
-    "retro_game_room": "1980s game room with vintage arcade machines, neon lights, vibrant colors, nostalgic feel",
-    "robot_blacksmith": "Robot blacksmith forging metal, sparks flying, detailed workshop, futuristic and medieval blend",
-    "robot_dancer": "Sleek robot performing a dance, futuristic theater, holographic effects, detailed, high resolution",
-    "robot_factory": "High-tech factory where robots are assembled, detailed machinery, futuristic setting, high detail",
-    "robotic_garden": "Garden tended by robots, mechanical plants, colorful flowers, futuristic setting, high detail",
-    "robotic_pet": "Cute robotic pet, futuristic home, sleek design, detailed features, friendly and animated",
-    "security_footage": "cctv trail camera night time security picture of a wendigo in the woods",
-    "space_explorer": "Astronaut exploring an alien planet, detailed landscape, futuristic suit, cosmic background",
-    "space_station": "Futuristic space station orbiting a distant exoplanet, sleek design, detailed structures, cosmic backdrop",
-    "soon": "a person holding a sign that reads 'SOON'",
-    "steampunk_airship": "Steampunk airship in the sky, intricate design, Victorian aesthetics, dynamic scene, high detail",
-    "steampunk_inventor": "Steampunk inventor in a workshop, intricate gadgets, Victorian attire, mechanical arm, goggles",
-    "stormy_ocean": "Stormy ocean with towering waves, dramatic skies, detailed water, intense atmosphere, high resolution",
-    "stormy_sea": "Dramatic stormy sea, lighthouse in the distance, lightning striking, dark clouds, high detail",
-    "urban_art": "Graffiti artist creating a mural, vibrant colors, urban setting, dynamic action, high resolution",
-    "urban_graffiti": "Urban alleyway filled with vibrant graffiti art, tags and murals, realistic textures",
-    "urban_street_sign": "Urban street sign, 'Main Street', bold typography, realistic textures, weathered look",
-    "vintage_car_show": "Classic car show with vintage vehicles, vibrant colors, nostalgic atmosphere, high detail",
-    "vintage_diner_sign": "Retro diner sign, 'Joe's Diner', classic 1950s design, neon lights, weathered look",
-    "vintage_store_sign": "Vintage store sign with elaborate typography, 'Antique Shop', hand-painted, weathered look",
-}
-
-
-def prompt_library_injection(new_prompts: dict) -> dict:
-    """
-    Add more prompts to the built-in SimpleTuner Prompt library.
-
-    Args:
-        new_prompts (dict): A dict of shortnames matching the existing prompt library format:
-        {
-            "nickname_here": "prompt goes here",
-            ...
-        }
-
-    Returns:
-        dict: Completed prompt library.
-    """
-
-    # Unpack the new prompts into the library.
-    global prompts
-    return {**prompts, **new_prompts}
-
-
-import logging
-import os
-
-from tqdm import tqdm
-
-from videotuna.third_party.flux.data_backend.base import BaseDataBackend
-
-logger = logging.getLogger("PromptHandler")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class PromptHandler:
-    def __init__(
-        self,
-        args: dict,
-        text_encoders: list,
-        tokenizers: list,
-        accelerator,
-        model_type: str = "sdxl",
-    ):
-        if args.disable_compel:
-            raise Exception(
-                "--disable_compel was provided, but the Compel engine was still attempted to be initialised."
-            )
-
-        from compel import Compel, ReturnedEmbeddingsType
-
-        self.accelerator = accelerator
-        self.encoder_style = model_type
-        self.compel = None
-        if model_type in ["sdxl", "legacy"]:
-            if (
-                len(text_encoders) == 2
-                and text_encoders[1] is not None
-                and text_encoders[0] is not None
-            ):
-                # SDXL Refiner and Base can both use the 2nd tokenizer/encoder.
-                logger.debug(
-                    "Initialising Compel prompt manager with dual text encoders."
-                )
-                self.compel = Compel(
-                    tokenizer=tokenizers,
-                    text_encoder=text_encoders,
-                    truncate_long_prompts=False,
-                    returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
-                    requires_pooled=[
-                        False,  # CLIP-L does not produce pooled embeds.
-                        True,  # CLIP-G produces pooled embeds.
-                    ],
-                    device=accelerator.device,
-                )
-            elif len(text_encoders) == 2 and text_encoders[0] is None:
-                # SDXL Refiner has ONLY the 2nd tokenizer/encoder, which needs to be the only one in Compel.
-                logger.debug(
-                    "Initialising Compel prompt manager with just the 2nd text encoder."
-                )
-                self.compel = Compel(
-                    tokenizer=tokenizers[1],
-                    text_encoder=text_encoders[1],
-                    truncate_long_prompts=False,
-                    returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
-                    requires_pooled=True,
-                    device=accelerator.device,
-                )
-                self.encoder_style = "sdxl-refiner"
-            elif model_type == "legacy":
-                # Any other pipeline uses the first tokenizer/encoder.
-                logger.debug(
-                    "Initialising the Compel prompt manager with a single text encoder."
-                )
-                pipe_tokenizer = tokenizers[0]
-                pipe_text_encoder = text_encoders[0]
-                self.compel = Compel(
-                    tokenizer=pipe_tokenizer,
-                    text_encoder=pipe_text_encoder,
-                    truncate_long_prompts=False,
-                    returned_embeddings_type=ReturnedEmbeddingsType.LAST_HIDDEN_STATES_NORMALIZED,
-                    device=accelerator.device,
-                )
-                self.encoder_style = "legacy"
-        self.text_encoders = text_encoders
-        self.tokenizers = tokenizers
-
-    @staticmethod
-    def retrieve_prompt_column_from_parquet(
-        sampler_backend_id: str,
-    ) -> str:
-        parquetdb = StateTracker.get_parquet_database(sampler_backend_id)
-        dataframe = parquetdb[0]
-        if dataframe is None:
-            raise ValueError(
-                f"Parquet database not found for sampler {sampler_backend_id}."
-            )
-        caption_column = (
-            StateTracker.get_data_backend_config(sampler_backend_id)
-            .get("parquet", {})
-            .get("caption_column", None)
-        )
-        if not caption_column:
-            raise ValueError(
-                f"Caption column not found for sampler {sampler_backend_id}. Config: {StateTracker.get_data_backend_config(sampler_backend_id)}"
-            )
-        # Return just that column
-        all_captions = dataframe[caption_column].values
-        fallback_caption_column = (
-            StateTracker.get_data_backend_config(sampler_backend_id)
-            .get("parquet", {})
-            .get("fallback_caption_column")
-        )
-        if fallback_caption_column is not None and all_captions is not None:
-            # Combine the lists
-            fallback_captions = dataframe[fallback_caption_column].values
-            all_captions = [
-                x if x else y for x, y in zip(all_captions, fallback_captions)
-            ]
-        return all_captions
-
-    @staticmethod
-    def prepare_instance_prompt_from_parquet(
-        image_path: str,
-        use_captions: bool,
-        prepend_instance_prompt: bool,
-        data_backend: BaseDataBackend,
-        instance_prompt: str = None,
-        sampler_backend_id: str = None,
-    ) -> str:
-        if sampler_backend_id is None:
-            raise ValueError("Sampler backend ID is required.")
-        if not use_captions:
-            if not instance_prompt:
-                raise ValueError(
-                    "Instance prompt is required when instance_prompt_only is enabled."
-                )
-            return instance_prompt
-        metadata_backend = StateTracker.get_data_backend(sampler_backend_id)[
-            "metadata_backend"
-        ]
-        if metadata_backend is None:
-            raise ValueError(
-                f"Could not find metadata backend for sampler {sampler_backend_id}: {StateTracker.get_data_backend(sampler_backend_id)}"
-            )
-        (
-            parquet_db,
-            filename_column,
-            caption_column,
-            fallback_caption_column,
-            identifier_includes_extension,
-        ) = StateTracker.get_parquet_database(sampler_backend_id)
-        backend_config = StateTracker.get_data_backend_config(
-            data_backend_id=data_backend.id
-        )
-        instance_data_dir = backend_config.get("instance_data_dir")
-        image_filename_stem = image_path
-        if instance_data_dir is not None and instance_data_dir in image_filename_stem:
-            image_filename_stem = image_filename_stem.replace(instance_data_dir, "")
-            if image_filename_stem.startswith("/"):
-                image_filename_stem = image_filename_stem[1:]
-
-        if not identifier_includes_extension:
-            image_filename_stem = os.path.splitext(image_filename_stem)[0]
-        image_caption = metadata_backend.caption_cache_entry(image_filename_stem)
-        if instance_prompt is None and fallback_caption_column and not image_caption:
-            raise ValueError(
-                f"Could not locate caption for image {image_path} in sampler_backend {sampler_backend_id} with filename column {filename_column}, caption column {caption_column}, and a parquet database with {len(parquet_db)} entries."
-            )
-        elif (
-            instance_prompt is None
-            and not fallback_caption_column
-            and not image_caption
-        ):
-            raise ValueError(
-                f"Could not locate caption for image {image_path} in sampler_backend {sampler_backend_id} with filename column {filename_column}, caption column {caption_column}, and a parquet database with {len(parquet_db)} entries."
-            )
-        if type(image_caption) == bytes:
-            image_caption = image_caption.decode("utf-8")
-        if image_caption:
-            image_caption = image_caption.strip()
-        if prepend_instance_prompt:
-            if type(image_caption) == list:
-                image_caption = [instance_prompt + " " + x for x in image_caption]
-            else:
-                image_caption = instance_prompt + " " + image_caption
-        return image_caption
-
-    @staticmethod
-    def prepare_instance_prompt_from_filename(
-        image_path: str,
-        use_captions: bool,
-        prepend_instance_prompt: bool,
-        instance_prompt: str = None,
-    ) -> str:
-        if not use_captions:
-            if not instance_prompt:
-                raise ValueError(
-                    "Instance prompt is required when instance_prompt_only is enabled."
-                )
-            return instance_prompt
-        image_caption = Path(image_path).stem
-        # Underscores to spaces.
-        image_caption = image_caption.replace("_", " ")
-        if prepend_instance_prompt:
-            image_caption = instance_prompt + " " + image_caption
-        return image_caption
-
-    @staticmethod
-    def prepare_instance_prompt_from_textfile(
-        image_path: str,
-        use_captions: bool,
-        prepend_instance_prompt: bool,
-        data_backend: BaseDataBackend,
-        instance_prompt: str = None,
-    ) -> str:
-        if not use_captions:
-            if not instance_prompt:
-                raise ValueError(
-                    "Instance prompt is required when instance_prompt_only is enabled."
-                )
-            return instance_prompt
-        caption_file = os.path.splitext(image_path)[0] + ".txt"
-        if not data_backend.exists(caption_file):
-            raise FileNotFoundError(f"Caption file {caption_file} not found.")
-        try:
-            image_caption = data_backend.read(caption_file)
-            # Convert from bytes to str:
-            if type(image_caption) == bytes:
-                image_caption = image_caption.decode("utf-8")
-
-            # any newlines? split into array
-            if "\n" in image_caption:
-                image_caption = image_caption.split("\n")
-                # Remove any empty strings
-                image_caption = [x for x in image_caption if x]
-
-            if prepend_instance_prompt:
-                if type(image_caption) is list:
-                    image_caption = [instance_prompt + " " + x for x in image_caption]
-                else:
-                    image_caption = instance_prompt + " " + image_caption
-
-            return image_caption
-        except Exception as e:
-            logger.error(f"Could not read caption file {caption_file}: {e}")
-
-    @staticmethod
-    def magic_prompt(
-        image_path: str,
-        use_captions: bool,
-        caption_strategy: str,
-        prepend_instance_prompt: bool,
-        data_backend: BaseDataBackend,
-        instance_prompt: str = None,
-        sampler_backend_id: str = None,
-    ) -> str:
-        """Pull a prompt for an image file like magic, using one of the available caption strategies.
-
-        Args:
-            image_path (str): The image path.
-            caption_strategy (str): Currently, 'filename' or 'textfile'.
-            use_captions (bool): If false, the folder containing the image is used as an instance prompt.
-            prepend_instance_prompt (bool): If true, the folder name of the image is prepended to the caption.
-
-        Raises:
-            ValueError: _description_
-
-        Returns:
-            _type_: _description_
-        """
-        if caption_strategy == "filename":
-            instance_prompt = PromptHandler.prepare_instance_prompt_from_filename(
-                image_path=image_path,
-                use_captions=use_captions,
-                prepend_instance_prompt=prepend_instance_prompt,
-                instance_prompt=instance_prompt,
-            )
-        elif caption_strategy == "textfile":
-            # Can return multiple captions, if the file has newlines.
-            instance_prompt = PromptHandler.prepare_instance_prompt_from_textfile(
-                image_path,
-                use_captions=use_captions,
-                prepend_instance_prompt=prepend_instance_prompt,
-                instance_prompt=instance_prompt,
-                data_backend=data_backend,
-            )
-        elif caption_strategy == "parquet":
-            # Can return multiple captions, if the field is a list.
-            instance_prompt = PromptHandler.prepare_instance_prompt_from_parquet(
-                image_path,
-                use_captions=use_captions,
-                prepend_instance_prompt=prepend_instance_prompt,
-                instance_prompt=instance_prompt,
-                data_backend=data_backend,
-                sampler_backend_id=sampler_backend_id,
-            )
-        elif caption_strategy == "instanceprompt":
-            return instance_prompt
-        elif caption_strategy == "csv":
-            return data_backend.get_caption(image_path)
-        else:
-            raise ValueError(
-                f"Unsupported caption strategy: {caption_strategy}. Supported: 'filename', 'textfile', 'parquet', 'instanceprompt'"
-            )
-
-        return instance_prompt
-
-    @staticmethod
-    def get_all_captions(
-        instance_data_dir: str,
-        use_captions: bool,
-        prepend_instance_prompt: bool,
-        data_backend: BaseDataBackend,
-        caption_strategy: str,
-        instance_prompt: str = None,
-    ) -> list:
-        captions = []
-        all_image_files = StateTracker.get_image_files(
-            data_backend_id=data_backend.id
-        ) or data_backend.list_files(
-            instance_data_dir=instance_data_dir, file_extensions=image_file_extensions
-        )
-        backend_config = StateTracker.get_data_backend_config(
-            data_backend_id=data_backend.id
-        )
-        if type(all_image_files) == list and type(all_image_files[0]) == tuple:
-            all_image_files = all_image_files[0][2]
-        from tqdm import tqdm
-
-        # if caption_strategy == "parquet":
-        #     return PromptHandler.retrieve_prompt_column_from_parquet(
-        #         sampler_backend_id=data_backend.id
-        #     )
-
-        for image_path in tqdm(
-            all_image_files,
-            desc="Loading captions",
-            total=len(all_image_files),
-            disable=True if get_rank() > 0 else False,
-            leave=False,
-            ncols=125,
-        ):
-            if caption_strategy == "filename":
-                caption = PromptHandler.prepare_instance_prompt_from_filename(
-                    image_path=str(image_path),
-                    use_captions=use_captions,
-                    prepend_instance_prompt=prepend_instance_prompt,
-                    instance_prompt=instance_prompt,
-                )
-            elif caption_strategy == "textfile":
-                caption = PromptHandler.prepare_instance_prompt_from_textfile(
-                    image_path,
-                    use_captions=use_captions,
-                    prepend_instance_prompt=prepend_instance_prompt,
-                    instance_prompt=instance_prompt,
-                    data_backend=data_backend,
-                )
-            elif caption_strategy == "parquet":
-                try:
-                    caption = PromptHandler.prepare_instance_prompt_from_parquet(
-                        image_path,
-                        use_captions=use_captions,
-                        prepend_instance_prompt=prepend_instance_prompt,
-                        instance_prompt=instance_prompt,
-                        data_backend=data_backend,
-                        sampler_backend_id=data_backend.id,
-                    )
-                except:
-                    continue
-            elif caption_strategy == "instanceprompt":
-                return [instance_prompt]
-            elif caption_strategy == "csv":
-                caption = data_backend.get_caption(image_path)
-            else:
-                raise ValueError(
-                    f"Unsupported caption strategy: {caption_strategy}. Supported: 'filename', 'textfile', 'parquet', 'instanceprompt'"
-                )
-
-            if type(caption) not in [tuple, list, dict]:
-                captions.append(caption)
-            else:
-                # allow caching of multiple captions, if returned by the backend.
-                captions.extend(caption)
-
-        # Deduplicate captions
-        # TODO: Investigate why this prevents captions from processing on multigpu systems.
-        # captions = list(set(captions))
-
-        return captions
-
-    @staticmethod
-    def filter_caption(data_backend: BaseDataBackend, caption: str) -> str:
-        """Just filter a single caption.
-
-        Args:
-            data_backend (BaseDataBackend): The data backend for the instance.
-            caption (str): The caption to filter.
-
-        Raises:
-            e: If caption filter list can not be loaded.
-            ValueError: If we have an invalid filter list.
-            FileNotFoundError: If the filter list can not be found.
-
-        Returns:
-            str: The filtered caption.
-        """
-        return PromptHandler.filter_captions(data_backend, [caption])[0]
-
-    @staticmethod
-    def filter_captions(data_backend: BaseDataBackend, captions: list) -> list:
-        """
-        If the data backend config contains the entry "caption_filter_list", this function will filter the captions.
-
-        The caption_filter file contains strings or regular expressions, one per line.
-
-        If a line doesn't have any regex control characters in it, we'll treat it as a string.
-        """
-        data_backend_config = StateTracker.get_data_backend_config(
-            data_backend_id=data_backend.id
-        )
-        caption_filter_list = data_backend_config.get("caption_filter_list", None)
-        if not caption_filter_list or caption_filter_list == "":
-            return captions
-        if (
-            type(caption_filter_list) == str
-            and os.path.splitext(caption_filter_list)[1] == ".json"
-        ):
-            # It's a path to a filter list. Load it in JSON format.
-            caption_filter_list_path = Path(caption_filter_list)
-            try:
-                with open(caption_filter_list_path, "r") as caption_filter_list:
-                    caption_filter_list = json.load(caption_filter_list)
-            except Exception as e:
-                logger.error(
-                    f"Caption filter list for data backend '{data_backend.id}' could not be loaded: {e}"
-                )
-                raise e
-        elif (
-            type(caption_filter_list) == str
-            and os.path.splitext(caption_filter_list)[1] == ".txt"
-        ):
-            # We have a plain text list of filter strings/regex. Load them into an array:
-            caption_filter_list_path = Path(caption_filter_list)
-            try:
-                with open(caption_filter_list_path, "r") as caption_filter_list:
-                    caption_filter_list = caption_filter_list.readlines()
-                    # Strip newlines from the ends:
-                    caption_filter_list = [x.strip("\n") for x in caption_filter_list]
-            except Exception as e:
-                logger.error(
-                    f"Caption filter list for data backend '{data_backend.id}' could not be loaded: {e}"
-                )
-                raise e
-        # We have the filter list. Is it valid and non-empty?
-        if type(caption_filter_list) != list or len(caption_filter_list) == 0:
-            logger.debug(
-                f"Data backend '{data_backend.id}' has an invalid or empty caption filter list."
-            )
-            return captions
-        elif type(caption_filter_list) is not list:
-            raise ValueError(
-                f"Data backend '{data_backend.id}' has an invalid caption filter list: {caption_filter_list}"
-            )
-        # Iterate through each caption
-        filtered_captions = []
-        for caption in tqdm(
-            captions,
-            desc="Filtering captions",
-            total=len(captions),
-            ncols=125,
-            disable=True if len(captions) < 10 else False,
-        ):
-            if type(caption) is list:
-                caption = caption[0]
-            modified_caption = caption
-            # Apply each filter to the caption
-            logger.debug(f"Filtering caption: {modified_caption}")
-            if modified_caption is None:
-                logger.error(
-                    f"Encountered a None caption in the list, data backend: {data_backend.id}"
-                )
-                continue
-            for filter_item in caption_filter_list:
-                # Check for special replace pattern 's/replace/entry/'
-                if filter_item.startswith("s/") and filter_item.count("/") == 2:
-                    _, search, replace = filter_item.split("/")
-                    regex_modified_caption = re.sub(search, replace, modified_caption)
-                    if regex_modified_caption != modified_caption:
-                        # logger.debug(
-                        #     f"Applying regex SEARCH {filter_item} to caption: {modified_caption}"
-                        # )
-                        modified_caption = regex_modified_caption
-                else:
-                    # Treat as plain string and remove occurrences
-                    if modified_caption is not None:
-                        modified_caption = str(modified_caption).replace(
-                            filter_item, ""
-                        )
-                try:
-                    # Assume all filters as regex patterns for flexibility
-                    pattern = re.compile(filter_item)
-                    try:
-                        regex_modified_caption = pattern.sub("", modified_caption)
-                    except:
-                        regex_modified_caption = modified_caption
-                    if regex_modified_caption != modified_caption:
-                        # logger.debug(
-                        #     f"Applying regex FILTER {filter_item} to caption: {modified_caption}"
-                        # )
-                        modified_caption = regex_modified_caption
-                except re.error as e:
-                    logger.error(f"Regex error with pattern {filter_item}: {e}")
-
-            # Add the modified caption to the filtered list
-            # if caption != modified_caption:
-            #     logger.debug(
-            #         f"After all filters have finished, here is the modified caption: {modified_caption}"
-            #     )
-            filtered_captions.append(modified_caption)
-
-        # Return the list of modified captions
-        return filtered_captions
-
-    @staticmethod
-    def load_user_prompts(user_prompt_path: str = None):
-        if not user_prompt_path:
-            return {}
-        # Does the file exist?
-        user_prompt_path = Path(user_prompt_path)
-        if not user_prompt_path.exists():
-            raise FileNotFoundError(f"User prompt file {user_prompt_path} not found.")
-        # Load the file.
-        try:
-            with user_prompt_path.open("r", encoding="utf-8") as f:
-                user_prompts = json.load(f)
-                # logger.debug(f"Loaded user prompts: {user_prompts}")
-            return user_prompts
-        except Exception as e:
-            logger.error(f"Could not read user prompt file {user_prompt_path}: {e}")
-            return {}
diff --git a/videotuna/third_party/flux/publishing/huggingface.py b/videotuna/third_party/flux/publishing/huggingface.py
deleted file mode 100644
index 8c49b145..00000000
--- a/videotuna/third_party/flux/publishing/huggingface.py
+++ /dev/null
@@ -1,226 +0,0 @@
-import logging
-import os
-from pathlib import Path
-
-from huggingface_hub import create_repo, upload_file, upload_folder
-
-from videotuna.third_party.flux.publishing.metadata import save_model_card
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", logging.INFO))
-
-
-LORA_SAFETENSORS_FILENAME = "pytorch_lora_weights.safetensors"
-
-
-class HubManager:
-    def __init__(self, config, repo_id: str = None):
-        self.config = config
-        self.repo_id = (
-            repo_id or self.config.hub_model_id or self.config.tracker_project_name
-        )
-        self.hub_token = self._load_hub_token()
-        self.data_backends = StateTracker.get_data_backends()
-        self._create_repo()
-        self.validation_prompts = None
-        self.validation_shortnames = None
-        self.collected_data_backend_str = None
-
-    def _create_repo(self):
-        self._repo_id = create_repo(
-            repo_id=self.config.hub_model_id or self.config.tracker_project_name,
-            exist_ok=True,
-        ).repo_id
-
-    def _vae_string(self):
-        if "deepfloyd" in self.config.model_type:
-            return "\nDeepFloyd Pixel diffusion (no VAE)."
-        else:
-            return f"\nVAE: {self.config.pretrained_vae_model_name_or_path}"
-
-    def _commit_message(self):
-        return (
-            f"Trained for {StateTracker.get_epoch() - 1} epochs and {StateTracker.get_global_step()} steps."
-            f"\nTrained with datasets {self.collected_data_backend_str}"
-            f"\nLearning rate {self.config.learning_rate}, batch size {self.config.train_batch_size}, and {self.config.gradient_accumulation_steps} gradient accumulation steps."
-            f"\nUsed DDPM noise scheduler for training with {self.config.prediction_type} prediction type and rescaled_betas_zero_snr={self.config.rescale_betas_zero_snr}"
-            f"\nUsing '{self.config.training_scheduler_timestep_spacing}' timestep spacing."
-            f"\nBase model: {self.config.pretrained_model_name_or_path}"
-            f"{self._vae_string()}"
-        )
-
-    def _load_hub_token(self):
-        token_path = os.path.join(os.path.expanduser("~"), ".cache/huggingface/token")
-        if os.path.exists(token_path):
-            with open(token_path, "r") as f:
-                return f.read().strip()
-        raise ValueError(
-            f"No Hugging Face Hub token found ({token_path}). Please ensure you have logged in with 'huggingface-cli login'."
-        )
-
-    def set_validation_prompts(self, validation_prompts, validation_shortnames):
-        self.validation_prompts = validation_prompts
-        self.validation_shortnames = validation_shortnames
-
-    def upload_validation_folder(self, webhook_handler=None, override_path=None):
-        try:
-            upload_folder(
-                repo_id=self._repo_id,
-                folder_path=os.path.join(
-                    override_path or self.config.output_dir, "assets"
-                ),
-                path_in_repo="assets/",
-                commit_message="Validation images auto-generated by SimpleTuner",
-            )
-        except Exception as e:
-            logger.error(f"Error uploading validation images to Hugging Face Hub: {e}")
-
-    def upload_model(self, validation_images, webhook_handler=None, override_path=None):
-        if webhook_handler:
-            webhook_handler.send(
-                message=f"Uploading {'model' if override_path is None else 'intermediary checkpoint'} to Hugging Face Hub as `{self.repo_id}`."
-            )
-        save_model_card(
-            repo_id=self.repo_id,
-            images=validation_images,
-            base_model=self.config.pretrained_model_name_or_path,
-            train_text_encoder=self.config.train_text_encoder,
-            prompt=self.config.validation_prompt,
-            validation_prompts=self.validation_prompts,
-            validation_shortnames=self.validation_shortnames,
-            repo_folder=override_path
-            or os.path.join(
-                self.config.output_dir,
-                "pipeline" if "lora" not in self.config.model_type else "",
-            ),
-        )
-
-        try:
-            self.upload_validation_folder(
-                webhook_handler=webhook_handler, override_path=override_path
-            )
-        except:
-            logger.error("Error uploading validation images to Hugging Face Hub.")
-
-        attempt = 0
-        while attempt < 3:
-            attempt += 1
-            try:
-                if "lora" not in self.config.model_type:
-                    self.upload_full_model(override_path=override_path)
-                else:
-                    self.upload_lora_model(override_path=override_path)
-                break
-            except Exception as e:
-                if webhook_handler:
-                    webhook_handler.send(
-                        message=f"(attempt {attempt}/3) Error uploading model to Hugging Face Hub: {e}. Retrying..."
-                    )
-        if webhook_handler:
-            webhook_handler.send(
-                message=f"Model is now available [on Hugging Face Hub](https://huggingface.co/{self._repo_id})."
-            )
-
-    def upload_full_model(self, override_path=None):
-        folder_path = os.path.join(self.config.output_dir, "pipeline")
-        try:
-            upload_folder(
-                repo_id=self._repo_id,
-                folder_path=override_path or folder_path,
-                commit_message=self._commit_message(),
-            )
-        except Exception as e:
-            logger.error(f"Failed to upload pipeline to hub: {e}")
-
-    def upload_lora_model(self, override_path=None):
-        lora_weights_path = os.path.join(
-            override_path or self.config.output_dir, LORA_SAFETENSORS_FILENAME
-        )
-        try:
-            upload_file(
-                repo_id=self._repo_id,
-                path_in_repo=f"/{LORA_SAFETENSORS_FILENAME}",
-                path_or_fileobj=lora_weights_path,
-                commit_message=self._commit_message(),
-            )
-            readme_path = os.path.join(
-                override_path or self.config.output_dir, "README.md"
-            )
-            upload_file(
-                repo_id=self._repo_id,
-                path_in_repo="/README.md",
-                path_or_fileobj=readme_path,
-                commit_message="Model card auto-generated by SimpleTuner",
-            )
-        except Exception as e:
-            logger.error(f"Failed to upload LoRA weights to hub: {e}")
-
-    def find_latest_checkpoint(self):
-        checkpoints = list(Path(self.config.output_dir).rglob("checkpoint-*"))
-        highest_checkpoint_value = None
-        highest_checkpoint = None
-        if len(checkpoints) > 0:
-            highest_checkpoint_value = 0
-            for checkpoint in checkpoints:
-                # split by -
-                parts = checkpoint.stem.split("-")
-                checkpoint_value = int(parts[-1])
-                if checkpoint_value > highest_checkpoint_value:
-                    highest_checkpoint_value = checkpoint_value
-                    highest_checkpoint = checkpoint
-
-        return highest_checkpoint
-
-    def upload_latest_checkpoint(self, validation_images: dict, webhook_handler=None):
-        checkpoint_path = self.find_latest_checkpoint()
-        if checkpoint_path:
-            logging.info(f"Checkpoint path: {checkpoint_path}")
-            try:
-                self.upload_model(
-                    validation_images=validation_images,
-                    override_path=checkpoint_path,
-                    webhook_handler=webhook_handler,
-                )
-            except Exception as e:
-                logger.error(f"Failed to upload latest checkpoint: {e}")
-
-    def upload_validation_images(
-        self, validation_images, webhook_handler=None, override_path=None
-    ):
-        logging.info(f"Validation images for upload: {validation_images}")
-        if validation_images and len(validation_images) > 0:
-            idx = 0
-            for shortname, images in (
-                validation_images.items()
-                if type(validation_images) is dict
-                else validation_images
-            ):
-                # print(f"Shortname {shortname} images: {images}")
-                if type(images) is not list:
-                    images = [images]
-                sub_idx = 0
-                for image in images:
-                    image_path = os.path.join(
-                        override_path or self.config.output_dir,
-                        "assets",
-                        f"image_{idx}_{sub_idx}.png",
-                    )
-                    image.save(image_path, format="PNG")
-                    attempt = 0
-                    while attempt < 3:
-                        attempt += 1
-                        try:
-                            upload_file(
-                                repo_id=self._repo_id,
-                                path_in_repo=f"/assets/image_{idx}_{sub_idx}.png",
-                                path_or_fileobj=image_path,
-                                commit_message="Validation image auto-generated by SimpleTuner",
-                            )
-                        except Exception as e:
-                            if webhook_handler:
-                                webhook_handler.send(
-                                    message=f"(attempt {attempt}/3) Error uploading validation image to Hugging Face Hub: {e}. Retrying..."
-                                )
-                    sub_idx += 1
-                    idx += 1
diff --git a/videotuna/third_party/flux/publishing/metadata.py b/videotuna/third_party/flux/publishing/metadata.py
deleted file mode 100644
index b966eda6..00000000
--- a/videotuna/third_party/flux/publishing/metadata.py
+++ /dev/null
@@ -1,409 +0,0 @@
-import json
-import logging
-import os
-
-import torch
-
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-licenses = {
-    "flux": "flux-1-dev-non-commercial-license",
-    "sdxl": "creativeml-openrail-m",
-    "legacy": "openrail++",
-    "pixart_sigma": "openrail++",
-    "kolors": "apache-2.0",
-    "smoldit": "apache-2.0",
-    "sd3": "stabilityai-ai-community",
-}
-allowed_licenses = [
-    "apache-2.0",
-    "mit",
-    "openrail",
-    "bigscience-openrail-m",
-    "creativeml-openrail-m",
-    "bigscience-bloom-rail-1.0",
-    "bigcode-openrail-m",
-    "afl-3.0",
-    "artistic-2.0",
-    "bsl-1.0",
-    "bsd",
-    "bsd-2-clause",
-    "bsd-3-clause",
-    "bsd-3-clause-clear",
-    "c-uda",
-    "cc",
-    "cc0-1.0",
-    "cc-by-2.0",
-    "cc-by-2.5",
-    "cc-by-3.0",
-    "cc-by-4.0",
-    "cc-by-sa-3.0",
-    "cc-by-sa-4.0",
-    "cc-by-nc-2.0",
-    "cc-by-nc-3.0",
-    "cc-by-nc-4.0",
-    "cc-by-nd-4.0",
-    "cc-by-nc-nd-3.0",
-    "cc-by-nc-nd-4.0",
-    "cc-by-nc-sa-2.0",
-    "cc-by-nc-sa-3.0",
-    "cc-by-nc-sa-4.0",
-    "cdla-sharing-1.0",
-    "cdla-permissive-1.0",
-    "cdla-permissive-2.0",
-    "wtfpl",
-    "ecl-2.0",
-    "epl-1.0",
-    "epl-2.0",
-    "etalab-2.0",
-    "eupl-1.1",
-    "agpl-3.0",
-    "gfdl",
-    "gpl",
-    "gpl-2.0",
-    "gpl-3.0",
-    "lgpl",
-    "lgpl-2.1",
-    "lgpl-3.0",
-    "isc",
-    "lppl-1.3c",
-    "ms-pl",
-    "apple-ascl",
-    "mpl-2.0",
-    "odc-by",
-    "odbl",
-    "openrail++",
-    "osl-3.0",
-    "postgresql",
-    "ofl-1.1",
-    "ncsa",
-    "unlicense",
-    "zlib",
-    "pddl",
-    "lgpl-lr",
-    "deepfloyd-if-license",
-    "llama2",
-    "llama3",
-    "llama3.1",
-    "gemma",
-    "unknown",
-    "other",
-    "array",
-]
-for _model, _license in licenses.items():
-    if _license not in allowed_licenses:
-        licenses[_model] = "other"
-
-
-def _model_imports(args):
-    output = "import torch\n"
-    output += "from diffusers import DiffusionPipeline"
-    if "lycoris" == args.lora_type.lower() and "lora" in args.model_type:
-        output += "\nfrom lycoris import create_lycoris_from_weights"
-
-    return f"{output}"
-
-
-def _model_load(args, repo_id: str = None):
-    hf_user_name = StateTracker.get_hf_username()
-    if hf_user_name is not None:
-        repo_id = f"{hf_user_name}/{repo_id}" if hf_user_name else repo_id
-    if "lora" in args.model_type:
-        if args.lora_type.lower() == "standard":
-            output = (
-                f"model_id = '{args.pretrained_model_name_or_path}'"
-                f"\nadapter_id = '{repo_id if repo_id is not None else args.output_dir}'"
-                f"\npipeline = DiffusionPipeline.from_pretrained(model_id)"
-                f"\npipeline.load_lora_weights(adapter_id)"
-            )
-        elif args.lora_type.lower() == "lycoris":
-            output = (
-                f"model_id = '{args.pretrained_model_name_or_path}'"
-                f"\nadapter_id = 'pytorch_lora_weights.safetensors' # you will have to download this manually"
-                "\nlora_scale = 1.0"
-            )
-    else:
-        output = (
-            f"model_id = '{repo_id if repo_id else os.path.join(args.output_dir, 'pipeline')}'"
-            f"\npipeline = DiffusionPipeline.from_pretrained(model_id)"
-        )
-    if args.model_type == "lora" and args.lora_type.lower() == "lycoris":
-        output += f"\nwrapper, _ = create_lycoris_from_weights(lora_scale, adapter_id, pipeline.transformer)"
-        output += "\nwrapper.merge_to()"
-
-    return output
-
-
-def _torch_device():
-    return """'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'"""
-
-
-def _negative_prompt(args, in_call: bool = False):
-    if args.model_family.lower() == "flux":
-        return ""
-    if not in_call:
-        return f"negative_prompt = '{args.validation_negative_prompt}'"
-    return "\n    negative_prompt=negative_prompt,"
-
-
-def _guidance_rescale(args):
-    if args.model_family.lower() in ["sd3", "flux", "pixart_sigma"]:
-        return ""
-    return f"\n    guidance_rescale={args.validation_guidance_rescale},"
-
-
-def _validation_resolution(args):
-    if args.validation_resolution == "" or args.validation_resolution is None:
-        return f"width=1024,\n" f"    height=1024,"
-    resolutions = [args.validation_resolution]
-    if "," in args.validation_resolution:
-        # split the resolution into a list of resolutions
-        resolutions = args.validation_resolution.split(",")
-    for resolution in resolutions:
-        if "x" in resolution:
-            return (
-                f"width={resolution.split('x')[0]},\n"
-                f"    height={resolution.split('x')[1]},"
-            )
-        return f"width={resolution},\n" f"    height={resolution},"
-
-
-def code_example(args, repo_id: str = None):
-    """Return a string with the code example."""
-    code_example = f"""
-```python
-{_model_imports(args)}
-
-{_model_load(args, repo_id)}
-
-prompt = "{args.validation_prompt if args.validation_prompt else 'An astronaut is riding a horse through the jungles of Thailand.'}"
-{_negative_prompt(args)}
-pipeline.to({_torch_device()})
-image = pipeline(
-    prompt=prompt,{_negative_prompt(args, in_call=True) if args.model_family.lower() != 'flux' else ''}
-    num_inference_steps={args.validation_num_inference_steps},
-    generator=torch.Generator(device={_torch_device()}).manual_seed(1641421826),
-    {_validation_resolution(args)}
-    guidance_scale={args.validation_guidance},{_guidance_rescale(args)}
-).images[0]
-image.save("output.png", format="PNG")
-```
-"""
-    return code_example
-
-
-def model_type(args):
-    if "lora" in args.model_type:
-        if "standard" == args.lora_type.lower():
-            return "standard PEFT LoRA"
-        if "lycoris" == args.lora_type.lower():
-            return "LyCORIS adapter"
-    else:
-        return "full rank finetune"
-
-
-def lora_info(args):
-    """Return a string with the LORA information."""
-    if "lora" not in args.model_type:
-        return ""
-    if args.lora_type.lower() == "standard":
-        return f"""- LoRA Rank: {args.lora_rank}
-- LoRA Alpha: {args.lora_alpha}
-- LoRA Dropout: {args.lora_dropout}
-- LoRA initialisation style: {args.lora_init_type}
-    """
-    if args.lora_type.lower() == "lycoris":
-        lycoris_config_file = args.lycoris_config
-        # read the json file
-        with open(lycoris_config_file, "r") as file:
-            lycoris_config = json.load(file)
-        return f"""- LyCORIS Config:\n```json\n{json.dumps(lycoris_config, indent=4)}\n```"""
-
-
-def model_card_note(args):
-    """Return a string with the model card note."""
-    note_contents = args.model_card_note if args.model_card_note else ""
-    return f"\n{note_contents}\n"
-
-
-def flux_schedule_info(args):
-    if args.model_family.lower() != "flux":
-        return ""
-    output_args = []
-    if args.flux_fast_schedule:
-        output_args.append("flux_fast_schedule")
-    if args.flux_schedule_auto_shift:
-        output_args.append("flux_schedule_auto_shift")
-    if args.flux_schedule_shift is not None:
-        output_args.append(f"shift={args.flux_schedule_shift}")
-    if args.flux_guidance_value:
-        output_args.append(f"flux_guidance_value={args.flux_guidance_value}")
-    if args.flux_guidance_min:
-        output_args.append(f"flux_guidance_min={args.flux_guidance_min}")
-    if args.flux_guidance_mode == "random-range":
-        output_args.append(f"flux_guidance_max={args.flux_guidance_max}")
-        output_args.append(f"flux_guidance_min={args.flux_guidance_min}")
-    if args.flux_use_beta_schedule:
-        output_args.append(f"flux_beta_schedule_alpha={args.flux_beta_schedule_alpha}")
-        output_args.append(f"flux_beta_schedule_beta={args.flux_beta_schedule_beta}")
-    if args.flux_attention_masked_training:
-        output_args.append("flux_attention_masked_training")
-    if args.model_type == "lora" and args.lora_type == "standard":
-        output_args.append(f"flux_lora_target={args.flux_lora_target}")
-    output_str = (
-        f" (flux parameters={output_args})"
-        if output_args
-        else " (no special parameters set)"
-    )
-
-    return output_str
-
-
-def save_model_card(
-    repo_id: str,
-    images=None,
-    base_model: str = "",
-    train_text_encoder: bool = False,
-    prompt: str = "",
-    validation_prompts: list = None,
-    validation_shortnames: list = None,
-    repo_folder: str = None,
-):
-    if repo_folder is None:
-        raise ValueError("The repo_folder must be specified and not be None.")
-    if type(validation_prompts) is not list:
-        raise ValueError(
-            f"The validation_prompts must be a list. Received {validation_prompts}"
-        )
-    # if we have more than one '/' in the base_model, we will turn it into unknown/model
-    model_family = StateTracker.get_model_family()
-    if base_model.count("/") > 1:
-        base_model = f"{model_family}/unknown-model"
-    logger.debug(f"Validating from prompts: {validation_prompts}")
-    assets_folder = os.path.join(repo_folder, "assets")
-    optimizer_config = StateTracker.get_args().optimizer_config
-    if optimizer_config is None:
-        optimizer_config = ""
-    os.makedirs(assets_folder, exist_ok=True)
-    datasets_str = ""
-    for dataset in StateTracker.get_data_backends().keys():
-        if "sampler" in StateTracker.get_data_backends()[dataset]:
-            datasets_str += f"### {dataset}\n"
-            datasets_str += f"{StateTracker.get_data_backends()[dataset]['sampler'].log_state(show_rank=False, alt_stats=True)}"
-    widget_str = ""
-    idx = 0
-    shortname_idx = 0
-    negative_prompt_text = str(StateTracker.get_args().validation_negative_prompt)
-    if negative_prompt_text == "":
-        negative_prompt_text = "''"
-    if images is not None and len(images) > 0:
-        widget_str = "widget:"
-        for image_list in images.values() if isinstance(images, dict) else images:
-            if not isinstance(image_list, list):
-                image_list = [image_list]
-            sub_idx = 0
-            for image in image_list:
-                image_path = os.path.join(assets_folder, f"image_{idx}_{sub_idx}.png")
-                image.save(image_path, format="PNG")
-                validation_prompt = "no prompt available"
-                if validation_prompts is not None:
-                    try:
-                        validation_prompt = validation_prompts[shortname_idx]
-                    except IndexError:
-                        validation_prompt = f"prompt not found ({validation_shortnames[shortname_idx] if validation_shortnames is not None and shortname_idx in validation_shortnames else shortname_idx})"
-                if validation_prompt == "":
-                    validation_prompt = "unconditional (blank prompt)"
-                else:
-                    # Escape anything that YAML won't like
-                    validation_prompt = validation_prompt.replace("'", "''")
-                widget_str += f"\n- text: '{validation_prompt}'"
-                widget_str += "\n  parameters:"
-                widget_str += f"\n    negative_prompt: '{negative_prompt_text}'"
-                widget_str += "\n  output:"
-                widget_str += f"\n    url: ./assets/image_{idx}_{sub_idx}.png"
-                idx += 1
-                sub_idx += 1
-
-            shortname_idx += 1
-    args = StateTracker.get_args()
-    yaml_content = f"""---
-license: {licenses[model_family]}
-base_model: "{base_model}"
-tags:
-  - {model_family}
-  - {f'{model_family}-diffusers' if 'deepfloyd' not in args.model_type else 'deepfloyd-if-diffusers'}
-  - text-to-image
-  - diffusers
-  - simpletuner
-  - {'not-for-all-audiences' if not args.model_card_safe_for_work else 'safe-for-work'}
-  - {args.model_type}
-{'  - template:sd-lora' if 'lora' in args.model_type else ''}
-{f'  - {args.lora_type}' if 'lora' in args.model_type else ''}
-inference: true
-{widget_str}
----
-
-"""
-    model_card_content = f"""# {repo_id}
-
-This is a {model_type(args)} derived from [{base_model}](https://huggingface.co/{base_model}).
-
-{'This is a **diffusion** model trained using DDPM objective instead of Flow matching. **Be sure to set the appropriate scheduler configuration.**' if args.model_family == "sd3" and args.flow_matching_loss == "diffusion" else ''}
-{'The main validation prompt used during training was:' if prompt else 'Validation used ground-truth images as an input for partial denoising (img2img).' if args.validation_using_datasets else 'No validation prompt was used during training.'}
-{model_card_note(args)}
-{'```' if prompt else ''}
-{prompt}
-{'```' if prompt else ''}
-
-## Validation settings
-- CFG: `{StateTracker.get_args().validation_guidance}`
-- CFG Rescale: `{StateTracker.get_args().validation_guidance_rescale}`
-- Steps: `{StateTracker.get_args().validation_num_inference_steps}`
-- Sampler: `{StateTracker.get_args().validation_noise_scheduler}`
-- Seed: `{StateTracker.get_args().validation_seed}`
-- Resolution{'s' if ',' in StateTracker.get_args().validation_resolution else ''}: `{StateTracker.get_args().validation_resolution}`
-
-Note: The validation settings are not necessarily the same as the [training settings](#training-settings).
-
-{'You can find some example images in the following gallery:' if images is not None else ''}\n
-
-<Gallery />
-
-The text encoder {'**was**' if train_text_encoder else '**was not**'} trained.
-{'You may reuse the base model text encoder for inference.' if not train_text_encoder else 'If the text encoder from this repository is not used at inference time, unexpected or bad results could occur.'}
-
-
-## Training settings
-
-- Training epochs: {StateTracker.get_epoch() - 1}
-- Training steps: {StateTracker.get_global_step()}
-- Learning rate: {StateTracker.get_args().learning_rate}
-- Max grad norm: {StateTracker.get_args().max_grad_norm}
-- Effective batch size: {StateTracker.get_args().train_batch_size * StateTracker.get_args().gradient_accumulation_steps * StateTracker.get_accelerator().num_processes}
-  - Micro-batch size: {StateTracker.get_args().train_batch_size}
-  - Gradient accumulation steps: {StateTracker.get_args().gradient_accumulation_steps}
-  - Number of GPUs: {StateTracker.get_accelerator().num_processes}
-- Prediction type: {'flow-matching' if (StateTracker.get_args().model_family in ["sd3", "flux"]) else StateTracker.get_args().prediction_type}{flux_schedule_info(args=StateTracker.get_args())}
-- Rescaled betas zero SNR: {StateTracker.get_args().rescale_betas_zero_snr}
-- Optimizer: {StateTracker.get_args().optimizer}{optimizer_config if optimizer_config is not None else ''}
-- Precision: {'Pure BF16' if torch.backends.mps.is_available() or StateTracker.get_args().mixed_precision == "bf16" else 'FP32'}
-- Quantised: {f'Yes: {StateTracker.get_args().base_model_precision}' if StateTracker.get_args().base_model_precision != "no_change" else 'No'}
-- Xformers: {'Enabled' if StateTracker.get_args().enable_xformers_memory_efficient_attention else 'Not used'}
-{lora_info(args=StateTracker.get_args())}
-
-## Datasets
-
-{datasets_str}
-
-## Inference
-
-{code_example(args=StateTracker.get_args(), repo_id=repo_id)}
-"""
-
-    logger.debug(f"YAML:\n{yaml_content}")
-    logger.debug(f"Model Card:\n{model_card_content}")
-    with open(os.path.join(repo_folder, "README.md"), "w", encoding="utf-8") as f:
-        f.write(yaml_content + model_card_content)
diff --git a/videotuna/third_party/flux/training/__init__.py b/videotuna/third_party/flux/training/__init__.py
deleted file mode 100644
index e0a8b83f..00000000
--- a/videotuna/third_party/flux/training/__init__.py
+++ /dev/null
@@ -1,143 +0,0 @@
-quantised_precision_levels = [
-    "no_change",
-    "int8-quanto",
-    "int4-quanto",
-    "int2-quanto",
-    "int8-torchao",
-]
-import torch
-
-if torch.cuda.is_available():
-    quantised_precision_levels.extend(
-        [
-            "nf4-bnb",
-            # "fp4-bnb",
-            # "fp8-bnb",
-            "fp8-quanto",
-            "fp8uz-quanto",
-        ]
-    )
-    primary_device = torch.cuda.get_device_properties(0)
-    if primary_device.major >= 9:
-        # Hopper! Or blackwell+.
-        quantised_precision_levels.append("fp8-torchao")
-
-image_file_extensions = set(["jpg", "jpeg", "png", "webp", "bmp", "tiff", "tif"])
-
-lycoris_defaults = {
-    "lora": {
-        "algo": "lora",
-        "multiplier": 1.0,
-        "linear_dim": 64,
-        "linear_alpha": 32,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-            "module_algo_map": {
-                "Attention": {"factor": 16},
-                "FeedForward": {"factor": 8},
-            },
-        },
-    },
-    "loha": {
-        "algo": "loha",
-        "multiplier": 1.0,
-        "linear_dim": 32,
-        "linear_alpha": 16,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-            "module_algo_map": {
-                "Attention": {"factor": 16},
-                "FeedForward": {"factor": 8},
-            },
-        },
-    },
-    "lokr": {
-        "algo": "lokr",
-        "multiplier": 1.0,
-        "linear_dim": 10000,  # Full dimension
-        "linear_alpha": 1,  # Ignored in full dimension
-        "factor": 16,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-            "module_algo_map": {
-                "Attention": {"factor": 16},
-                "FeedForward": {"factor": 8},
-            },
-        },
-    },
-    "full": {
-        "algo": "full",
-        "multiplier": 1.0,
-        "linear_dim": 1024,  # Example full matrix size
-        "linear_alpha": 512,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-        },
-    },
-    "ia3": {
-        "algo": "ia3",
-        "multiplier": 1.0,
-        "linear_dim": None,  # No network arguments
-        "linear_alpha": None,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-        },
-    },
-    "dylora": {
-        "algo": "dylora",
-        "multiplier": 1.0,
-        "linear_dim": 128,
-        "linear_alpha": 64,
-        "block_size": 1,  # Update one row/col per step
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-            "module_algo_map": {
-                "Attention": {"factor": 16},
-                "FeedForward": {"factor": 8},
-            },
-        },
-    },
-    "diag-oft": {
-        "algo": "diag-oft",
-        "multiplier": 1.0,
-        "linear_dim": 64,  # Block size
-        "constraint": False,
-        "rescaled": False,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-            "module_algo_map": {
-                "Attention": {"factor": 16},
-                "FeedForward": {"factor": 8},
-            },
-        },
-    },
-    "boft": {
-        "algo": "boft",
-        "multiplier": 1.0,
-        "linear_dim": 64,  # Block size
-        "constraint": False,
-        "rescaled": False,
-        "apply_preset": {
-            "target_module": ["Attention", "FeedForward"],
-            "module_algo_map": {
-                "Attention": {"factor": 16},
-                "FeedForward": {"factor": 8},
-            },
-        },
-    },
-}
-
-
-def steps_remaining_in_epoch(current_step: int, steps_per_epoch: int) -> int:
-    """
-    Calculate the number of steps remaining in the current epoch.
-
-    Args:
-        current_step (int): The current step within the epoch.
-        steps_per_epoch (int): Total number of steps in the epoch.
-
-    Returns:
-        int: Number of steps remaining in the current epoch.
-    """
-    remaining_steps = steps_per_epoch - (current_step % steps_per_epoch)
-    return remaining_steps
diff --git a/videotuna/third_party/flux/training/adapter.py b/videotuna/third_party/flux/training/adapter.py
deleted file mode 100644
index e58dd543..00000000
--- a/videotuna/third_party/flux/training/adapter.py
+++ /dev/null
@@ -1,138 +0,0 @@
-import peft
-import safetensors.torch
-import torch
-
-
-def determine_adapter_target_modules(args, unet, transformer):
-    if unet is not None:
-        return ["to_k", "to_q", "to_v", "to_out.0"]
-    elif transformer is not None:
-        target_modules = ["to_k", "to_q", "to_v", "to_out.0"]
-
-        if args.model_family.lower() == "flux" and args.flux_lora_target == "all":
-            # target_modules = mmdit layers here
-            target_modules = [
-                "to_k",
-                "to_q",
-                "to_v",
-                "add_k_proj",
-                "add_q_proj",
-                "add_v_proj",
-                "to_out.0",
-                "to_add_out",
-            ]
-        elif args.flux_lora_target == "context":
-            # i think these are the text input layers.
-            target_modules = [
-                "add_k_proj",
-                "add_q_proj",
-                "add_v_proj",
-                "to_add_out",
-            ]
-        elif args.flux_lora_target == "context+ffs":
-            # i think these are the text input layers.
-            target_modules = [
-                "add_k_proj",
-                "add_q_proj",
-                "add_v_proj",
-                "to_add_out",
-                "ff_context.net.0.proj",
-                "ff_context.net.2",
-            ]
-        elif args.flux_lora_target == "all+ffs":
-            target_modules = [
-                "to_k",
-                "to_q",
-                "to_v",
-                "add_k_proj",
-                "add_q_proj",
-                "add_v_proj",
-                "to_out.0",
-                "to_add_out",
-                "ff.net.0.proj",
-                "ff.net.2",
-                "ff_context.net.0.proj",
-                "ff_context.net.2",
-                "proj_mlp",
-                "proj_out",
-            ]
-        elif args.flux_lora_target == "ai-toolkit":
-            # from ostris' ai-toolkit, possibly required to continue finetuning one.
-            target_modules = [
-                "to_q",
-                "to_k",
-                "to_v",
-                "add_q_proj",
-                "add_k_proj",
-                "add_v_proj",
-                "to_out.0",
-                "to_add_out",
-                "ff.net.0.proj",
-                "ff.net.2",
-                "ff_context.net.0.proj",
-                "ff_context.net.2",
-                "norm.linear",
-                "norm1.linear",
-                "norm1_context.linear",
-                "proj_mlp",
-                "proj_out",
-            ]
-        elif args.flux_lora_target == "tiny":
-            # From TheLastBen
-            # https://www.reddit.com/r/StableDiffusion/comments/1f523bd/good_flux_loras_can_be_less_than_45mb_128_dim/
-            target_modules = [
-                "single_transformer_blocks.7.proj_out",
-                "single_transformer_blocks.20.proj_out",
-            ]
-        elif args.flux_lora_target == "nano":
-            # From TheLastBen
-            # https://www.reddit.com/r/StableDiffusion/comments/1f523bd/good_flux_loras_can_be_less_than_45mb_128_dim/
-            target_modules = [
-                "single_transformer_blocks.7.proj_out",
-            ]
-
-        return target_modules
-
-
-@torch.no_grad()
-def load_lora_weights(dictionary, filename, loraKey="default", use_dora=False):
-    additional_keys = set()
-    state_dict = safetensors.torch.load_file(filename)
-    for prefix, model in dictionary.items():
-        lora_layers = {
-            (prefix + "." + x): y
-            for (x, y) in model.named_modules()
-            if isinstance(y, peft.tuners.lora.layer.Linear)
-        }
-    missing_keys = set(
-        [x + ".lora_A.weight" for x in lora_layers.keys()]
-        + [x + ".lora_B.weight" for x in lora_layers.keys()]
-        + ([x + ".lora_magnitude_vector.weight"] if use_dora else [])
-    )
-    for k, v in state_dict.items():
-        if "lora_A" in k:
-            kk = k.replace(".lora_A.weight", "")
-            if kk in lora_layers:
-                lora_layers[kk].lora_A[loraKey].weight.copy_(v)
-                missing_keys.remove(k)
-            else:
-                additional_keys.add(k)
-        elif "lora_B" in k:
-            kk = k.replace(".lora_B.weight", "")
-            if kk in lora_layers:
-                lora_layers[kk].lora_B[loraKey].weight.copy_(v)
-                missing_keys.remove(k)
-            else:
-                additional_keys.add(k)
-        elif ".alpha" in k or ".lora_alpha" in k:
-            kk = k.replace(".lora_alpha", "").replace(".alpha", "")
-            if kk in lora_layers:
-                lora_layers[kk].lora_alpha[loraKey] = v
-        elif ".lora_magnitude_vector" in k:
-            kk = k.replace(".lora_magnitude_vector.weight", "")
-            if kk in lora_layers:
-                lora_layers[kk].lora_magnitude_vector[loraKey].weight.copy_(v)
-                missing_keys.remove(k)
-            else:
-                additional_keys.add(k)
-    return (additional_keys, missing_keys)
diff --git a/videotuna/third_party/flux/training/collate.py b/videotuna/third_party/flux/training/collate.py
deleted file mode 100644
index 26943eaa..00000000
--- a/videotuna/third_party/flux/training/collate.py
+++ /dev/null
@@ -1,571 +0,0 @@
-import concurrent.futures
-import logging
-from concurrent.futures import ThreadPoolExecutor
-from os import environ
-
-import numpy as np
-import torch
-
-from videotuna.third_party.flux.image_manipulation.training_sample import TrainingSample
-from videotuna.third_party.flux.training.multi_process import rank_info
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger("collate_fn")
-logger.setLevel(environ.get("SIMPLETUNER_COLLATE_LOG_LEVEL", "INFO"))
-rank_text = rank_info()
-from torchvision.transforms import ToTensor
-
-# Convert PIL Image to PyTorch Tensor
-to_tensor = ToTensor()
-
-
-def debug_log(msg: str):
-    logger.debug(f"{rank_text}{msg}")
-
-
-def compute_time_ids(
-    intermediary_size: tuple,
-    target_size: tuple,
-    weight_dtype,
-    vae_downscale_factor: int = 8,
-    crop_coordinates: list = None,
-):
-    if intermediary_size is None or target_size is None:
-        raise Exception(
-            f"Cannot continue, the intermediary_size or target_size were not provided: {intermediary_size}, {target_size}"
-        )
-    logger.debug(
-        f"Computing time ids for:"
-        f"\n-> intermediary_size = {intermediary_size}"
-        f"\n-> target_size = {target_size}"
-    )
-    # The dimensions of tensors are "transposed", as:
-    # (batch_size, height, width)
-    # An image would look like:
-    # (width, height)
-    # SDXL conditions are:
-    # [h, w, h, w, h, w]
-    original_width = intermediary_size[0]
-    original_height = intermediary_size[1]
-    target_width = int(target_size[2] * vae_downscale_factor)
-    target_height = int(target_size[1] * vae_downscale_factor)
-    final_target_size = (target_height, target_width)
-    if original_width is None:
-        raise ValueError("Original width must be specified.")
-    if original_height is None:
-        raise ValueError("Original height must be specified.")
-    if crop_coordinates is None:
-        raise ValueError("Crop coordinates were not collected during collate.")
-    if StateTracker.is_sdxl_refiner():
-        fake_aesthetic_score = StateTracker.get_args().data_aesthetic_score
-        add_time_ids = list(
-            (original_height, original_width)
-            + tuple(crop_coordinates)
-            + (fake_aesthetic_score,)
-        )
-    else:
-        add_time_ids = list(
-            (original_height, original_width)
-            + tuple(crop_coordinates)
-            + final_target_size
-        )
-
-    add_time_ids = torch.tensor([add_time_ids], dtype=weight_dtype)
-    logger.debug(
-        f"compute_time_ids returning {add_time_ids.shape} shaped time ids: {add_time_ids}"
-    )
-    return add_time_ids
-
-
-def extract_filepaths(examples):
-    filepaths = []
-    for example in examples:
-        filepaths.append(example["image_path"])
-    return filepaths
-
-
-def fetch_pixel_values(fp, data_backend_id: str):
-    """Worker method to fetch pixel values for a single image."""
-    debug_log(
-        f" -> pull pixels for fp {fp} from cache via data backend {data_backend_id}"
-    )
-    data_backend = StateTracker.get_data_backend(data_backend_id)
-    image = data_backend["data_backend"].read_image(fp)
-    training_sample = TrainingSample(
-        image=image,
-        data_backend_id=data_backend_id,
-    )
-    return training_sample.prepare(return_tensor=True).image
-
-
-def fetch_latent(fp, data_backend_id: str):
-    """Worker method to fetch latent for a single image."""
-    debug_log(
-        f" -> pull latents for fp {fp} from cache via data backend {data_backend_id}"
-    )
-    latent = StateTracker.get_vaecache(id=data_backend_id).retrieve_from_cache(fp)
-
-    # Move to CPU and pin memory if it's not on the GPU
-    if not torch.backends.mps.is_available():
-        debug_log(" -> push latents to GPU via pinned memory")
-        latent = latent.to("cpu").pin_memory()
-    return latent
-
-
-def deepfloyd_pixels(filepaths, data_backend_id: str):
-    """DeepFloyd doesn't use the VAE. We retrieve, normalise, and stack the pixel tensors directly."""
-    # Use a thread pool to fetch latents concurrently
-    try:
-        with concurrent.futures.ThreadPoolExecutor() as executor:
-            pixels = list(
-                executor.map(
-                    fetch_pixel_values, filepaths, [data_backend_id] * len(filepaths)
-                )
-            )
-    except Exception as e:
-        logger.error(f"(id={data_backend_id}) Error while computing pixels: {e}")
-        raise
-    pixels = torch.stack(pixels)
-    pixels = pixels.to(memory_format=torch.contiguous_format).float()
-
-    return pixels
-
-
-def fetch_conditioning_pixel_values(
-    fp, training_fp, conditioning_data_backend_id: str, training_data_backend_id: str
-):
-    """Worker method to fetch pixel values for a single image."""
-    # Retrieve data backends
-    conditioning_data_backend = StateTracker.get_data_backend(
-        conditioning_data_backend_id
-    )
-    training_data_backend = StateTracker.get_data_backend(training_data_backend_id)
-
-    # Use the provided training file path directly
-    training_sample = TrainingSample.from_image_path(
-        image_path=training_fp,
-        data_backend_id=training_data_backend_id,
-    )
-
-    conditioning_sample = TrainingSample.from_image_path(
-        image_path=fp,
-        data_backend_id=conditioning_data_backend_id,
-    )
-
-    # Prepare the conditioning sample to match the training sample
-    prepared_like = conditioning_sample.prepare_like(
-        training_sample, return_tensor=True
-    ).image
-
-    return prepared_like
-
-
-def conditioning_pixels(
-    filepaths,
-    training_filepaths,
-    conditioning_data_backend_id: str,
-    training_data_backend_id: str,
-):
-    """For pixel-based conditioning images that must be prepared matching a paired image's metadata.."""
-    try:
-        with concurrent.futures.ThreadPoolExecutor() as executor:
-            pixels = list(
-                executor.map(
-                    fetch_conditioning_pixel_values,
-                    filepaths,
-                    training_filepaths,
-                    [conditioning_data_backend_id] * len(filepaths),
-                    [training_data_backend_id] * len(filepaths),
-                )
-            )
-    except Exception as e:
-        logger.error(
-            f"(conditioning_data_backend_id={conditioning_data_backend_id}) Error while retrieving or transforming pixels (training data id={training_data_backend_id}): {e}"
-        )
-        raise
-    pixels = torch.stack(pixels)
-    pixels = pixels.to(memory_format=torch.contiguous_format).float()
-
-    return pixels
-
-
-def compute_latents(filepaths, data_backend_id: str):
-    # Use a thread pool to fetch latents concurrently
-    try:
-        if "deepfloyd" in StateTracker.get_args().model_type:
-            latents = deepfloyd_pixels(filepaths, data_backend_id)
-
-            return latents
-        if StateTracker.get_args().vae_cache_ondemand:
-            latents = StateTracker.get_vaecache(id=data_backend_id).encode_images(
-                [None] * len(filepaths), filepaths
-            )
-        else:
-            with concurrent.futures.ThreadPoolExecutor() as executor:
-                latents = list(
-                    executor.map(
-                        fetch_latent, filepaths, [data_backend_id] * len(filepaths)
-                    )
-                )
-    except Exception as e:
-        logger.error(f"(id={data_backend_id}) Error while computing latents: {e}")
-        raise
-
-    return latents
-
-
-def compute_single_embedding(
-    caption, text_embed_cache, is_sdxl, is_sd3: bool = False, is_flux: bool = False
-):
-    """Worker function to compute embedding for a single caption."""
-    if caption == "" or not caption:
-        # Grab the default text embed backend for null caption.
-        text_embed_cache = StateTracker.get_default_text_embed_cache()
-        debug_log(
-            f"Hashing caption '{caption}' on text embed cache: {text_embed_cache.id} using data backend {text_embed_cache.data_backend.id}"
-        )
-    if is_sdxl:
-        (
-            prompt_embeds,
-            pooled_prompt_embeds,
-        ) = text_embed_cache.compute_embeddings_for_sdxl_prompts([caption])
-        return (
-            prompt_embeds[0],
-            pooled_prompt_embeds[0],
-        )  # Unpack the first (and only) element
-    elif is_sd3:
-        prompt_embeds, pooled_prompt_embeds = (
-            text_embed_cache.compute_embeddings_for_sd3_prompts(prompts=[caption])
-        )
-        return prompt_embeds[0], pooled_prompt_embeds[0]
-    elif is_flux:
-        prompt_embeds, pooled_prompt_embeds, time_ids, masks = (
-            text_embed_cache.compute_embeddings_for_flux_prompts(prompts=[caption])
-        )
-        return (
-            prompt_embeds[0],
-            pooled_prompt_embeds[0],
-            time_ids[0],
-            masks[0] if masks is not None else None,
-        )
-    else:
-        prompt_embeds = text_embed_cache.compute_embeddings_for_legacy_prompts(
-            [caption]
-        )
-        if type(prompt_embeds) == tuple:
-            if StateTracker.get_model_family() in ["pixart_sigma", "smoldit"]:
-                # PixArt requires the attn mask be returned, too.
-                prompt_embeds, attn_mask = prompt_embeds
-
-                return prompt_embeds, attn_mask
-            elif "deepfloyd" in StateTracker.get_args().model_type:
-                # DeepFloyd doesn't use the attn mask on the unet inputs, we discard it
-                prompt_embeds = prompt_embeds[0]
-            prompt_embeds = prompt_embeds[0]
-        result = torch.squeeze(prompt_embeds[0])
-        debug_log(f"Torch shape: {result}")
-        return result, None  # Unpack and return None for the second element
-
-
-def compute_prompt_embeddings(captions, text_embed_cache):
-    """
-    Retrieve / compute text embeds in parallel.
-    Args:
-        captions: List of strings
-        text_embed_cache: TextEmbedCache instance
-
-    Returns:
-        prompt_embeds_all: Tensor of shape (batch_size, 512)
-        add_text_embeds_all: Tensor of shape (batch_size, 512)
-    """
-    debug_log(" -> get embed from cache")
-    is_sdxl = (
-        text_embed_cache.model_type == "sdxl" or text_embed_cache.model_type == "kolors"
-    )
-    is_sd3 = text_embed_cache.model_type == "sd3"
-    is_pixart_sigma = text_embed_cache.model_type == "pixart_sigma"
-    is_smoldit = text_embed_cache.model_type == "smoldit"
-    is_flux = text_embed_cache.model_type == "flux"
-
-    # Use a thread pool to compute embeddings concurrently
-    with ThreadPoolExecutor() as executor:
-        embeddings = list(
-            executor.map(
-                compute_single_embedding,
-                captions,
-                [text_embed_cache] * len(captions),
-                [is_sdxl] * len(captions),
-                [is_sd3] * len(captions),
-                [is_flux] * len(captions),
-            )
-        )
-
-    debug_log(f"Got embeddings: {embeddings}")
-    if is_sdxl:
-        # Separate the tuples
-        prompt_embeds = [t[0] for t in embeddings]
-        add_text_embeds = [t[1] for t in embeddings]
-        return (torch.stack(prompt_embeds), torch.stack(add_text_embeds))
-    elif is_sd3:
-        # Separate the tuples
-        prompt_embeds = [t[0] for t in embeddings]
-        add_text_embeds = [t[1] for t in embeddings]
-        return (torch.stack(prompt_embeds), torch.stack(add_text_embeds))
-    elif is_pixart_sigma or is_smoldit:
-        # the tuples here are the text encoder hidden states and the attention masks
-        prompt_embeds, attn_masks = [], []
-        for embed in embeddings:
-            prompt_embeds.append(embed[0][0])
-            attn_masks.append(embed[1][0])
-        if len(prompt_embeds[0].shape) == 3:
-            # some tensors are already expanded due to the way they were saved
-            prompt_embeds = [t.squeeze(0) for t in prompt_embeds]
-        return (torch.stack(prompt_embeds), torch.stack(attn_masks))
-    elif is_flux:
-        # Separate the tuples
-        prompt_embeds = [t[0] for t in embeddings]
-        add_text_embeds = [t[1] for t in embeddings]
-        time_ids = [t[2] for t in embeddings]
-        masks = [t[3] for t in embeddings]
-        return (
-            torch.stack(prompt_embeds),
-            torch.stack(add_text_embeds),
-            torch.stack(time_ids),
-            torch.stack(masks) if None not in masks else None,
-        )
-    else:
-        # Separate the tuples
-        prompt_embeds = [t[0] for t in embeddings]
-        return (torch.stack(prompt_embeds), None)
-
-
-def gather_conditional_pixart_size_features(examples, latents, weight_dtype):
-    bsz = len(examples)
-    # 1/8th scale VAE
-    LATENT_COMPRESSION_F = 8
-    batch_height = latents.shape[2] * LATENT_COMPRESSION_F
-    batch_width = latents.shape[3] * LATENT_COMPRESSION_F
-    resolution = torch.tensor([batch_height, batch_width]).repeat(bsz, 1)
-    aspect_ratio = torch.tensor([float(batch_height / batch_width)]).repeat(bsz, 1)
-    resolution = resolution.to(
-        dtype=weight_dtype, device=StateTracker.get_accelerator().device
-    )
-    aspect_ratio = aspect_ratio.to(
-        dtype=weight_dtype, device=StateTracker.get_accelerator().device
-    )
-
-    return {"resolution": resolution, "aspect_ratio": aspect_ratio}
-
-
-def gather_conditional_sdxl_size_features(examples, latents, weight_dtype):
-    batch_time_ids_list = []
-    if len(examples) != len(latents):
-        raise ValueError(
-            f"Number of examples ({len(examples)}) and latents ({len(latents)}) must match."
-        )
-
-    for idx, example in enumerate(examples):
-        # Compute time IDs for all examples
-        # - We use the intermediary size as the original size for SDXL.
-        # - This is because we first resize to intermediary_size before cropping.
-        time_ids = compute_time_ids(
-            intermediary_size=tuple(
-                example.get("intermediary_size", example.get("original_size"))
-            ),
-            target_size=latents[idx].shape,
-            crop_coordinates=example["crop_coordinates"],
-            weight_dtype=weight_dtype,
-        )
-
-        # Overwrite with zeros if conditioning is to be dropped
-        if example["drop_conditioning"]:
-            time_ids = torch.zeros_like(time_ids)
-
-        batch_time_ids_list.append(time_ids)
-
-    return torch.stack(batch_time_ids_list, dim=0)
-
-
-def check_latent_shapes(latents, filepaths, data_backend_id, batch):
-    # Validate shapes
-    test_shape = latents[0].shape
-    # Check all "aspect_ratio" values and raise error if any differ, with the two differing values:
-    for example in batch:
-        if example["aspect_ratio"] != batch[0]["aspect_ratio"]:
-            error_msg = f"(id=({data_backend_id}) Aspect ratio mismatch: {example['aspect_ratio']} != {batch[0][0]['aspect_ratio']}"
-            logger.error(error_msg)
-            logger.error(f"Erroneous batch: {batch}")
-            raise ValueError(error_msg)
-    for idx, latent in enumerate(latents):
-        # Are there any inf or nan positions?
-        if latent is None:
-            logger.debug(f"Error batch: {batch}")
-            error_msg = f"(id={data_backend_id}) File {filepaths[idx]} latent is None. Filepath: {filepaths[idx]}, data_backend_id: {data_backend_id}"
-            logger.error(error_msg)
-            raise ValueError(error_msg)
-        if torch.isnan(latent).any() or torch.isinf(latent).any():
-            # get the data_backend
-            data_backend = StateTracker.get_data_backend(data_backend_id)
-            # remove the object
-            data_backend["vaecache"].cache_data_backend.delete(filepaths[idx])
-            raise ValueError(
-                f"(id={data_backend_id}) Deleted cache file {filepaths[idx]}: contains NaN or Inf values: {latent}"
-            )
-        if latent.shape != test_shape:
-            raise ValueError(
-                f"(id={data_backend_id}) File {filepaths[idx]} latent shape mismatch: {latent.shape} != {test_shape}"
-            )
-
-    debug_log(f" -> stacking {len(latents)} latents")
-    return torch.stack(
-        [latent.to(StateTracker.get_accelerator().device) for latent in latents]
-    )
-
-
-def collate_fn(batch):
-    if len(batch) != 1:
-        raise ValueError(
-            "This trainer is not designed to handle multiple batches in a single collate."
-        )
-    debug_log("Begin collate_fn on batch")
-
-    # SDXL Dropout
-    dropout_probability = StateTracker.get_args().caption_dropout_probability
-    batch = batch[0]
-    examples = batch["training_samples"]
-    conditioning_examples = batch["conditioning_samples"]
-    is_regularisation_data = batch.get("is_regularisation_data", False)
-    if StateTracker.get_args().controlnet and len(examples) != len(
-        conditioning_examples
-    ):
-        raise ValueError(
-            "Number of training samples and conditioning samples must match for ControlNet."
-            f"\n-> Training samples: {examples}"
-            f"\n-> Conditioning samples: {conditioning_examples}"
-        )
-
-    # Randomly drop captions/conditioning based on dropout_probability
-    for example in examples:
-        data_backend_id = example["data_backend_id"]
-        if (
-            dropout_probability is not None
-            and dropout_probability > 0
-            and np.random.rand() < dropout_probability
-        ):
-            example["instance_prompt_text"] = ""  # Drop caption
-            example["drop_conditioning"] = True  # Flag to drop conditioning
-        else:
-            example["drop_conditioning"] = False
-
-    debug_log("Collect luminance values")
-    if "luminance" in examples[0]:
-        batch_luminance = [example["luminance"] for example in examples]
-    else:
-        batch_luminance = [0] * len(examples)
-    # average it
-    batch_luminance = sum(batch_luminance) / len(batch_luminance)
-    debug_log("Extract filepaths")
-    filepaths = extract_filepaths(examples)
-    debug_log("Compute latents")
-    latent_batch = compute_latents(filepaths, data_backend_id)
-    if "deepfloyd" not in StateTracker.get_args().model_type:
-        debug_log("Check latents")
-        latent_batch = check_latent_shapes(
-            latent_batch, filepaths, data_backend_id, examples
-        )
-
-    conditioning_filepaths = []
-    training_filepaths = []
-    conditioning_type = None
-    conditioning_pixel_values = None
-
-    if len(conditioning_examples) > 0:
-        if len(conditioning_examples) != len(examples):
-            raise ValueError(
-                "The number of conditioning examples must match the number of training examples."
-            )
-
-        data_backend = StateTracker.get_data_backend(data_backend_id)
-        conditioning_data_backend_id = data_backend.get("conditioning_data", {}).get(
-            "id"
-        )
-
-        for cond_example, train_example in zip(conditioning_examples, examples):
-            # Ensure conditioning types match
-            cond_type = cond_example.get_conditioning_type()
-            if conditioning_type is None:
-                conditioning_type = cond_type
-            elif cond_type != conditioning_type:
-                raise ValueError(
-                    f"Conditioning type mismatch: {conditioning_type} != {cond_type}"
-                    "\n-> Ensure all conditioning samples are of the same type."
-                )
-
-            # Collect conditioning and training file paths
-            conditioning_filepaths.append(cond_example.image_path(basename_only=False))
-            training_filepaths.append(train_example["image_path"])
-
-        # Pass both file paths to `conditioning_pixels`
-        conditioning_pixel_values = conditioning_pixels(
-            conditioning_filepaths,
-            training_filepaths,
-            conditioning_data_backend_id,
-            data_backend_id,
-        )
-
-        conditioning_pixel_values = torch.stack(
-            [
-                latent.to(StateTracker.get_accelerator().device)
-                for latent in conditioning_pixel_values
-            ]
-        )
-
-    # Compute embeddings and handle dropped conditionings
-    debug_log("Extract captions")
-    captions = [example["instance_prompt_text"] for example in examples]
-    debug_log("Pull cached text embeds")
-    text_embed_cache = StateTracker.get_data_backend(data_backend_id)[
-        "text_embed_cache"
-    ]
-
-    attn_mask = None
-    batch_time_ids = None
-    if StateTracker.get_model_family() == "flux":
-        debug_log("Compute and stack Flux time ids")
-        prompt_embeds_all, add_text_embeds_all, batch_time_ids, attn_mask = (
-            compute_prompt_embeddings(captions, text_embed_cache)
-        )
-    else:
-        prompt_embeds_all, add_text_embeds_all = compute_prompt_embeddings(
-            captions, text_embed_cache
-        )
-
-    if (
-        StateTracker.get_model_family() == "sdxl"
-        or StateTracker.get_model_family() == "kolors"
-    ):
-        debug_log("Compute and stack SDXL time ids")
-        batch_time_ids = gather_conditional_sdxl_size_features(
-            examples, latent_batch, StateTracker.get_weight_dtype()
-        )
-        debug_log(f"Time ids stacked to {batch_time_ids.shape}: {batch_time_ids}")
-    elif StateTracker.get_model_family() == "pixart_sigma":
-        debug_log("Compute and stack PixArt time ids")
-        batch_time_ids = gather_conditional_pixart_size_features(
-            examples, latent_batch, StateTracker.get_weight_dtype()
-        )
-        attn_mask = add_text_embeds_all
-    elif StateTracker.get_model_family() == "smoldit":
-        attn_mask = add_text_embeds_all
-
-    return {
-        "latent_batch": latent_batch,
-        "prompt_embeds": prompt_embeds_all,
-        "add_text_embeds": add_text_embeds_all,
-        "batch_time_ids": batch_time_ids,
-        "batch_luminance": batch_luminance,
-        "conditioning_pixel_values": conditioning_pixel_values,
-        "encoder_attention_mask": attn_mask,
-        "is_regularisation_data": is_regularisation_data,
-        "conditioning_type": conditioning_type,
-    }
diff --git a/videotuna/third_party/flux/training/custom_schedule.py b/videotuna/third_party/flux/training/custom_schedule.py
deleted file mode 100644
index 215ffd61..00000000
--- a/videotuna/third_party/flux/training/custom_schedule.py
+++ /dev/null
@@ -1,758 +0,0 @@
-import logging
-import math
-import os
-
-import accelerate
-import torch
-from torch.optim.lr_scheduler import LambdaLR, LRScheduler
-
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def segmented_timestep_selection(
-    actual_num_timesteps, bsz, weights, use_refiner_range: bool = False
-):
-    args = StateTracker.get_args()
-    # Determine the range of timesteps to use
-    num_timesteps = actual_num_timesteps
-    if use_refiner_range or args.refiner_training:
-        if args.refiner_training_invert_schedule:
-            # Inverted schedule calculation: we start from the last timestep and move downwards
-            start_timestep = (
-                actual_num_timesteps - 1
-            )  # Start from the last timestep, e.g., 999
-            # Calculate the end of the range based on the inverse of the training strength
-            end_timestep = int(args.refiner_training_strength * actual_num_timesteps)
-        else:
-            # Normal refiner training schedule
-            start_timestep = (
-                int(actual_num_timesteps * args.refiner_training_strength) - 1
-            )
-            end_timestep = 0
-        num_timesteps = start_timestep - end_timestep + 1
-    else:
-        start_timestep = actual_num_timesteps - 1
-        end_timestep = 0
-
-    # logger.debug(
-    #     f"{'Using SDXL refiner' if StateTracker.is_sdxl_refiner() else 'Training base model '} with {num_timesteps} timesteps from a full schedule of {actual_num_timesteps} and a segment size of {num_timesteps // bsz} timesteps."
-    # )
-    segment_size = max(num_timesteps // bsz, 1)
-    selected_timesteps = []
-
-    # Select one timestep from each segment based on the weights
-    for i in range(bsz):
-        start = start_timestep - i * segment_size
-        end = max(start - segment_size, end_timestep) if i != bsz - 1 else end_timestep
-        # logger.debug(f"Segment from {start} to {end}")
-        segment_weights = weights[end : start + 1]
-
-        # Normalize segment weights to ensure they sum to 1
-        segment_weights /= segment_weights.sum()
-
-        # Sample one timestep from the segment
-        segment_timesteps = torch.arange(end, start + 1)
-        selected_timestep = torch.multinomial(segment_weights, 1).item()
-        selected_timesteps.append(segment_timesteps[selected_timestep])
-
-    # logger.debug(f"Selected timesteps: {selected_timesteps}")
-    return torch.tensor(selected_timesteps)
-
-
-def get_sd3_sigmas(
-    accelerator, noise_scheduler_copy, timesteps, n_dim=4, dtype=torch.float32
-):
-    sigmas = noise_scheduler_copy.sigmas.to(device=accelerator.device, dtype=dtype)
-    # print(f'sigmas: {sigmas.shape}')
-    schedule_timesteps = noise_scheduler_copy.timesteps.to(accelerator.device)
-    timesteps = timesteps.to(accelerator.device)
-    step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps]
-    # print(f'step_indices: {step_indices}')
-
-    sigma = sigmas[step_indices].flatten()
-    while len(sigma.shape) < n_dim:
-        # print('unsqueeze')
-        sigma = sigma.unsqueeze(-1)
-    # print('return')
-    return sigma
-
-
-def generate_timestep_weights(args, num_timesteps):
-    weights = torch.ones(num_timesteps)
-
-    # Determine the indices to bias
-    num_to_bias = int(args.timestep_bias_portion * num_timesteps)
-
-    if args.timestep_bias_strategy == "later":
-        bias_indices = slice(-num_to_bias, None)
-    elif args.timestep_bias_strategy == "earlier":
-        bias_indices = slice(0, num_to_bias)
-    elif args.timestep_bias_strategy == "range":
-        # Out of the possible 1000 timesteps, we might want to focus on eg. 200-500.
-        range_begin = args.timestep_bias_begin
-        range_end = args.timestep_bias_end
-        if range_begin < 0:
-            raise ValueError(
-                "When using the range strategy for timestep bias, you must provide a beginning timestep greater or equal to zero."
-            )
-        if range_end > num_timesteps:
-            raise ValueError(
-                "When using the range strategy for timestep bias, you must provide an ending timestep smaller than the number of timesteps."
-            )
-        bias_indices = slice(range_begin, range_end)
-    else:  # 'none' or any other string
-        return weights
-    if args.timestep_bias_multiplier <= 0:
-        raise ValueError(
-            "The parameter --timestep_bias_multiplier is not intended to be used to disable the training of specific timesteps."
-            " If it was intended to disable timestep bias, use `--timestep_bias_strategy none` instead."
-            " A timestep bias multiplier less than or equal to 0 is not allowed."
-        )
-
-    # Apply the bias
-    weights[bias_indices] *= args.timestep_bias_multiplier
-
-    # Normalize
-    weights /= weights.sum()
-
-    return weights
-
-
-def get_polynomial_decay_schedule_with_warmup(
-    optimizer,
-    num_warmup_steps: int,
-    num_training_steps: int,
-    lr_end: float = 1e-7,
-    power: float = 1.0,
-    last_epoch: int = -1,
-):
-    """
-    Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the
-    optimizer to end lr defined by *lr_end*, after a warmup period during which it increases linearly from 0 to the
-    initial lr set in the optimizer.
-
-    Args:
-        optimizer ([`~torch.optim.Optimizer`]):
-            The optimizer for which to schedule the learning rate.
-        num_warmup_steps (`int`):
-            The number of steps for the warmup phase.
-        num_training_steps (`int`):
-            The total number of training steps.
-        lr_end (`float`, *optional*, defaults to 1e-7):
-            The end LR.
-        power (`float`, *optional*, defaults to 1.0):
-            Power factor.
-        last_epoch (`int`, *optional*, defaults to -1):
-            The index of the last epoch when resuming training.
-
-    Note: *power* defaults to 1.0 as in the fairseq implementation, which in turn is based on the original BERT
-    implementation at
-    https://github.com/google-research/bert/blob/f39e881b169b9d53bea03d2d341b31707a6c052b/optimization.py#L37
-
-    Return:
-        `torch.optim.lr_scheduler.LambdaLR` with the appropriate schedule.
-
-    """
-
-    lr_init = optimizer.defaults["lr"]
-    if not (float(lr_init) > float(lr_end)):
-        raise ValueError(
-            f"lr_end ({lr_end}) must be be smaller than initial lr ({lr_init})"
-        )
-
-    def lr_lambda(current_step: int):
-        if current_step < num_warmup_steps:
-            return float(current_step) / float(max(1, num_warmup_steps))
-        elif current_step > num_training_steps:
-            return float(lr_end) / float(lr_init)  # as LambdaLR multiplies by lr_init
-        else:
-            lr_range = float(lr_init) - float(lr_end)
-            decay_steps = int(num_training_steps) - int(num_warmup_steps)
-            pct_remaining = 1 - (current_step - int(num_warmup_steps)) / decay_steps
-            decay = lr_range * pct_remaining**power + float(lr_end)
-            return decay / float(lr_init)  # as LambdaLR multiplies by lr_init
-
-    return LambdaLR(optimizer, lr_lambda, last_epoch)
-
-
-def enforce_zero_terminal_snr(betas):
-    # Convert betas to alphas_bar_sqrt
-    alphas = 1 - betas
-    alphas_bar = alphas.cumprod(0)
-    alphas_bar_sqrt = alphas_bar.sqrt()
-
-    # Store old values.
-    alphas_bar_sqrt_0 = alphas_bar_sqrt[0].clone()
-    alphas_bar_sqrt_T = alphas_bar_sqrt[-1].clone()
-    # Shift so last timestep is zero.
-    alphas_bar_sqrt -= alphas_bar_sqrt_T
-    # Scale so first timestep is back to old value.
-    alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T)
-    # Convert alphas_bar_sqrt to betas
-    alphas_bar = alphas_bar_sqrt**2
-    alphas = alphas_bar[1:] / alphas_bar[:-1]
-    alphas = torch.cat([alphas_bar[0:1], alphas])
-    betas = 1 - alphas
-    return betas
-
-
-def patch_scheduler_betas(scheduler):
-    scheduler.betas = enforce_zero_terminal_snr(scheduler.betas)
-
-
-class _enable_get_lr_call:
-    def __init__(self, o):
-        self.o = o
-
-    def __enter__(self):
-        self.o._get_lr_called_within_step = True
-        return self
-
-    def __exit__(self, type, value, traceback):
-        self.o._get_lr_called_within_step = False
-        return self
-
-
-class Cosine(LRScheduler):
-    r"""Use a cosine schedule for the learning rate, without restarts.
-    This makes a nice and pretty chart on the tensorboard.
-
-    Args:
-        optimizer (Optimizer): Wrapped optimizer.
-        T_0 (int): Number of iterations for the first restart.
-        T_mult (int, optional): A factor increases :math:`T_{i}` after a restart. Default: 1.
-        eta_min (float, optional): Minimum learning rate. Default: 0.
-        last_epoch (int, optional): The index of last epoch. Default: -1.
-        verbose (bool): If ``True``, prints a message to stdout for
-            each update. Default: ``False``.
-
-    .. _SGDR\: Stochastic Gradient Descent with Warm Restarts:
-        https://arxiv.org/abs/1608.03983
-    """
-
-    def __init__(
-        self,
-        optimizer,
-        T_0,
-        steps_per_epoch=-1,
-        T_mult=1,
-        eta_min=0,
-        last_step=-1,
-        last_epoch=-1,
-        verbose=False,
-    ):
-        if T_0 <= 0 or not isinstance(T_0, int):
-            raise ValueError(
-                f"Cosine learning rate expects to use warmup steps as its interval. Expected positive integer T_0, but got {T_0}"
-            )
-        if T_mult < 1 or not isinstance(T_mult, int):
-            raise ValueError(f"Expected integer T_mult >= 1, but got {T_mult}")
-        if last_epoch != -1 and last_step != -1:
-            last_epoch = last_step
-        elif last_epoch != -1 and last_step == -1:
-            last_step = last_epoch
-        self.T_0 = T_0
-        self.steps_per_epoch = steps_per_epoch
-        self.T_i = T_0
-        self.T_mult = T_mult
-        self.eta_min = eta_min
-        self.T_cur = last_step
-        super().__init__(optimizer, last_step, verbose)
-
-    def get_lr(self):
-        lrs = [
-            self.eta_min
-            + (base_lr - self.eta_min)
-            * (1 + math.cos(math.pi * self.T_cur / self.T_i))
-            / 2
-            for base_lr in self.base_lrs
-        ]
-        return lrs
-
-    def step(self, step=None):
-        if step is None and self.last_epoch < 0:
-            step = 0
-
-        if step is None:
-            step = self.last_epoch + 1
-            self.T_cur = (step // self.steps_per_epoch) + (
-                step % self.steps_per_epoch
-            ) / self.steps_per_epoch
-        else:
-            self.T_cur = (step // self.steps_per_epoch) + (
-                step % self.steps_per_epoch
-            ) / self.steps_per_epoch
-
-        if self.T_cur >= self.T_i:
-            self.T_cur = self.T_cur - self.T_i
-            self.T_i = self.T_i * self.T_mult
-
-        self.last_epoch = step
-
-        with _enable_get_lr_call(self):
-            for i, data in enumerate(zip(self.optimizer.param_groups, self.get_lr())):
-                param_group, lr = data
-                param_group["lr"] = math.floor(lr * 1e9) / 1e9
-                self.print_lr(self.verbose, i, lr, step)
-
-        self._last_lr = [group["lr"] for group in self.optimizer.param_groups]
-
-    def print_lr(self, is_verbose, group, lr, epoch=None):
-        """Display the current learning rate."""
-        if is_verbose:
-            if epoch is None:
-                print(
-                    "Adjusting learning rate"
-                    " of group {} to {:.8e}.".format(group, lr)
-                )
-            else:
-                epoch_str = ("%.2f" if isinstance(epoch, float) else "%.5d") % epoch
-                print(
-                    "Epoch {}: adjusting learning rate"
-                    " of group {} to {:.8e}.".format(epoch_str, group, lr)
-                )
-
-
-class CosineAnnealingHardRestarts(LRScheduler):
-    r"""Set the learning rate of each parameter group using a cosine annealing
-    schedule, where :math:`\eta_{max}` is set to the initial lr, :math:`T_{cur}`
-    is the number of epochs since the last restart and :math:`T_{i}` is the number
-    of epochs between two warm restarts in SGDR:
-
-    .. math::
-        \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 +
-        \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right)
-
-    When :math:`T_{cur}=T_{i}`, set :math:`\eta_t = \eta_{min}`.
-    When :math:`T_{cur}=0` after restart, set :math:`\eta_t=\eta_{max}`.
-
-    It has been proposed in
-    `SGDR: Stochastic Gradient Descent with Warm Restarts`_.
-
-    Args:
-        optimizer (Optimizer): Wrapped optimizer.
-        T_0 (int): Number of iterations for the first restart.
-        T_mult (int, optional): A factor increases :math:`T_{i}` after a restart. Default: 1.
-        eta_min (float, optional): Minimum learning rate. Default: 0.
-        last_epoch (int, optional): The index of last epoch. Default: -1.
-        verbose (bool): If ``True``, prints a message to stdout for
-            each update. Default: ``False``.
-
-    .. _SGDR\: Stochastic Gradient Descent with Warm Restarts:
-        https://arxiv.org/abs/1608.03983
-    """
-
-    def __init__(
-        self,
-        optimizer,
-        T_0,
-        steps_per_epoch=-1,
-        T_mult=1,
-        eta_min=0,
-        last_step=-1,
-        last_epoch=-1,
-        verbose=False,
-    ):
-        if T_0 <= 0 or not isinstance(T_0, int):
-            raise ValueError(f"Expected positive integer T_0, but got {T_0}")
-        if T_mult < 1 or not isinstance(T_mult, int):
-            raise ValueError(f"Expected integer T_mult >= 1, but got {T_mult}")
-        if last_epoch != -1 and last_step != -1:
-            last_epoch = last_step
-        elif last_epoch != -1 and last_step == -1:
-            last_step = last_epoch
-        self.T_0 = T_0
-        self.steps_per_epoch = steps_per_epoch
-        self.T_i = T_0
-        self.T_mult = T_mult
-        self.eta_min = eta_min
-        self.T_cur = last_step
-        self.last_step = last_step
-        super().__init__(optimizer, last_step, verbose)
-
-    def get_lr(self):
-        lrs = [
-            self.eta_min
-            + (base_lr - self.eta_min)
-            * (1 + math.cos(math.pi * self.T_cur / self.T_i))
-            / 2
-            for base_lr in self.base_lrs
-        ]
-        return lrs
-
-    def step(self, step=None):
-        # Check if the step argument is provided, if not, increment the last_step counter
-        if step is None:
-            step = self.last_step + 1
-
-        # Calculate T_cur: This represents the current step within the current cycle
-        # % operator ensures T_cur is always within the range of the current cycle
-        self.T_cur = step % self.steps_per_epoch
-
-        # Check if T_cur has reached the end of the current cycle (T_i)
-        # If so, it's time for a warm restart
-        if self.T_cur >= self.T_i:
-            self.T_cur = 0  # Reset T_cur to start a new cycle
-            self.T_i *= self.T_mult  # Increase the length of the next cycle
-
-        # Update the last step with the current step
-        self.last_step = step
-
-        # This context manager ensures that the learning rate is updated correctly
-        with _enable_get_lr_call(self):
-            # Loop through each parameter group and its corresponding learning rate
-            for i, data in enumerate(zip(self.optimizer.param_groups, self.get_lr())):
-                param_group, lr = data
-                # Update the learning rate for this parameter group
-                # We use math.floor to truncate the precision to avoid numerical issues
-                param_group["lr"] = math.floor(lr * 1e9) / 1e9
-                # Print the updated learning rate if verbose mode is enabled
-                self.print_lr(self.verbose, i, lr, step)
-
-        # Update the last learning rate values for each parameter group
-        self._last_lr = [group["lr"] for group in self.optimizer.param_groups]
-
-    def print_lr(self, is_verbose, group, lr, epoch=None):
-        """Display the current learning rate."""
-        if is_verbose:
-            if epoch is None:
-                print(
-                    "Adjusting learning rate"
-                    " of group {} to {:.8e}.".format(group, lr)
-                )
-            else:
-                epoch_str = ("%.2f" if isinstance(epoch, float) else "%.5d") % epoch
-                print(
-                    "Epoch {}: adjusting learning rate"
-                    " of group {} to {:.8e}.".format(epoch_str, group, lr)
-                )
-
-
-class Sine(LRScheduler):
-    def __init__(
-        self, optimizer, T_0, T_mult=1, eta_min=0, last_step=-1, verbose=False
-    ):
-        if T_0 <= 0 or not isinstance(T_0, int):
-            raise ValueError(
-                f"Sine learning rate expects positive integer T_0, but got {T_0}"
-            )
-        if T_mult < 1 or not isinstance(T_mult, int):
-            raise ValueError(f"Expected integer T_mult >= 1, but got {T_mult}")
-
-        self.optimizer = optimizer
-        self.T_0 = T_0
-        self.T_mult = T_mult
-        self.eta_min = eta_min
-        self.T_i = T_0
-        self.T_cur = last_step
-        self.last_epoch = last_step
-        self.base_lrs = [group["lr"] for group in optimizer.param_groups]
-        self.verbose = verbose
-        self._last_lr = self.base_lrs
-        self.total_steps = 0  # Track total steps for a continuous wave
-
-    def get_lr(self):
-        # Calculate learning rates using a continuous sine function based on total steps
-        lrs = [
-            self.eta_min
-            + (base_lr - self.eta_min)
-            * (0.5 * (1 + math.sin(math.pi * self.total_steps / self.T_0)))
-            for base_lr in self.base_lrs
-        ]
-        return lrs
-
-    def step(self, step=None):
-        if step is None:
-            step = self.last_epoch + 1
-
-        self.total_steps = step  # Use total steps instead of resetting per interval
-        self.last_epoch = step
-        for i, (param_group, lr) in enumerate(
-            zip(self.optimizer.param_groups, self.get_lr())
-        ):
-            param_group["lr"] = math.floor(lr * 1e9) / 1e9
-            self.print_lr(self.verbose, i, lr, step)
-
-        self._last_lr = [group["lr"] for group in self.optimizer.param_groups]
-
-    def print_lr(self, is_verbose, group, lr, epoch=None):
-        if is_verbose:
-            epoch_str = ("%.2f" if isinstance(epoch, float) else "%.5d") % epoch
-            print(
-                f"Epoch {epoch_str}: adjusting learning rate of group {group} to {lr:.8e}."
-            )
-
-
-from diffusers.optimization import get_scheduler
-
-
-def get_lr_scheduler(
-    args, optimizer, accelerator, logger, use_deepspeed_scheduler=False
-):
-    if use_deepspeed_scheduler:
-        logger.info("Using DeepSpeed learning rate scheduler")
-        lr_scheduler = accelerate.utils.DummyScheduler(
-            optimizer,
-            total_num_steps=args.max_train_steps,
-            warmup_num_steps=args.lr_warmup_steps,
-        )
-    elif args.lr_scheduler == "cosine_with_restarts":
-        logger.info("Using Cosine with Restarts learning rate scheduler.")
-        logger.warning(
-            "cosine_with_restarts is currently misbehaving, and may not do what you expect. sine is recommended instead."
-        )
-        from videotuna.third_party.flux.training.custom_schedule import (
-            CosineAnnealingHardRestarts,
-        )
-
-        lr_scheduler = CosineAnnealingHardRestarts(
-            optimizer=optimizer,
-            T_0=int(args.lr_warmup_steps * accelerator.num_processes),
-            T_mult=int(1),
-            eta_min=float(args.lr_end),
-            last_step=-1,
-            verbose=os.environ.get("SIMPLETUNER_SCHEDULER_VERBOSE", "false").lower()
-            == "true",
-        )
-    elif args.lr_scheduler == "sine":
-        logger.info("Using Sine learning rate scheduler.")
-        from videotuna.third_party.flux.training.custom_schedule import Sine
-
-        lr_scheduler = Sine(
-            optimizer=optimizer,
-            T_0=int(args.lr_warmup_steps * accelerator.num_processes),
-            T_mult=int(1),
-            eta_min=float(args.lr_end),
-            last_step=-1,
-            verbose=os.environ.get("SIMPLETUNER_SCHEDULER_VERBOSE", "false").lower()
-            == "true",
-        )
-    elif args.lr_scheduler == "cosine":
-        logger.info("Using Cosine learning rate scheduler.")
-        from videotuna.third_party.flux.training.custom_schedule import Cosine
-
-        lr_scheduler = Cosine(
-            optimizer=optimizer,
-            T_0=int(args.lr_warmup_steps * accelerator.num_processes),
-            T_mult=int(1),
-            eta_min=float(args.lr_end),
-            last_step=-1,
-            verbose=os.environ.get("SIMPLETUNER_SCHEDULER_VERBOSE", "false").lower()
-            == "true",
-        )
-    elif args.lr_scheduler == "polynomial":
-        logger.info(
-            f"Using Polynomial learning rate scheduler with last epoch {StateTracker.get_global_step() - 2}."
-        )
-        lr_scheduler = get_polynomial_decay_schedule_with_warmup(
-            optimizer=optimizer,
-            num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
-            num_training_steps=args.max_train_steps * accelerator.num_processes,
-            lr_end=args.lr_end,
-            power=args.lr_power,
-            last_epoch=StateTracker.get_global_step() - 1,
-        )
-    else:
-        logger.info(f"Using generic '{args.lr_scheduler}' learning rate scheduler.")
-        lr_scheduler = get_scheduler(
-            name=args.lr_scheduler,
-            optimizer=optimizer,
-            num_warmup_steps=args.lr_warmup_steps * accelerator.num_processes,
-            num_training_steps=args.max_train_steps * accelerator.num_processes,
-            num_cycles=args.lr_num_cycles,
-            power=args.lr_power,
-        )
-
-    return lr_scheduler
-
-
-# from huggingface/diffusers#8449 (author: @leffff)
-# Copyright 2024 The HuggingFace Team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# DISCLAIMER: This code is strongly influenced by https://github.com/leffff/euler-scheduler
-
-from dataclasses import dataclass
-from typing import Optional, Tuple, Union
-
-import torch
-from diffusers.configuration_utils import ConfigMixin, register_to_config
-from diffusers.schedulers.scheduling_utils import SchedulerMixin
-from diffusers.utils import BaseOutput
-
-
-@dataclass
-class FlowMatchingEulerSchedulerOutput(BaseOutput):
-    """
-    Output class for the scheduler's `step` function output.
-
-    Args:
-        prev_sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` for images):
-            Computed sample `(x_{t-1})` of previous timestep (which in flow-matching notation should be noted as
-            `(x_{t+h})`). `prev_sample` should be used as next model input in the denoising loop.
-        pred_original_sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` for images):
-            The predicted denoised sample `(x_{0})` (which in flow-matching notation should be noted as
-            `(x_{1})`) based on the model output from the current timestep.
-            `pred_original_sample` can be used to preview progress or for guidance.
-    """
-
-    prev_sample: torch.Tensor
-    pred_original_sample: Optional[torch.Tensor] = None
-
-
-def get_time_coefficients(timestep: torch.Tensor, ndim: int) -> torch.Tensor:
-    return timestep.reshape((timestep.shape[0], *([1] * (ndim - 1))))
-
-
-class FlowMatchingEulerScheduler(SchedulerMixin, ConfigMixin):
-    """
-    `FlowMatchingEulerScheduler` is a scheduler for training and inferencing Conditional Flow Matching models (CFMs).
-
-    Flow Matching (FM) is a novel, simulation-free methodology for training Continuous Normalizing Flows (CNFs) by
-    regressing vector fields of predetermined conditional probability paths, facilitating scalable training and
-    efficient sample generation through the utilization of various probability paths, including Gaussian and
-    Optimal Transport (OT) paths, thereby enhancing model performance and generalization capabilities
-
-    Args:
-        num_inference_steps (`int`, defaults to 100):
-            The number of steps on inference.
-    """
-
-    @register_to_config
-    def __init__(self, num_inference_steps: int = 100):
-        self.timesteps = None
-        self.num_inference_steps = None
-        self.h = None
-
-        if num_inference_steps is not None:
-            self.set_timesteps(num_inference_steps)
-
-    @staticmethod
-    def add_noise(
-        original_samples: torch.Tensor, noise: torch.Tensor, timestep: torch.Tensor
-    ) -> torch.Tensor:
-        """
-        Add noise to the given sample
-
-        Args:
-            original_samples (`torch.Tensor`):
-                The original sample that is to be noised
-            noise (`torch.Tensor`):
-                The noise that is used to noise the image
-            timestep (`torch.Tensor`):
-                Timestep used to create linear interpolation `x_t = t * x_1 + (1 - t) * x_0`.
-                Where x_1 is a target distribution, x_0 is a source distribution and t (timestep) ∈ [0, 1]
-        """
-
-        t = get_time_coefficients(timestep, original_samples.ndim)
-
-        noised_sample = t * original_samples + (1 - t) * noise
-
-        return noised_sample
-
-    def set_timesteps(self, num_inference_steps: int = 100) -> None:
-        """
-        Set number of inference steps (Euler intagration steps)
-
-        Args:
-            num_inference_steps (`int`, defaults to 100):
-                The number of steps on inference.
-        """
-
-        self.num_inference_steps = num_inference_steps
-        self.h = 1 / num_inference_steps
-        self.timesteps = torch.arange(0, 1, self.h)
-
-    def step(
-        self,
-        model_output: torch.Tensor,
-        timestep: torch.Tensor,
-        sample: torch.Tensor,
-        return_dict: bool = True,
-    ) -> Union[FlowMatchingEulerSchedulerOutput, Tuple]:
-        """
-        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
-        process from the learned model outputs (most often the predicted noise).
-
-        Args:
-            model_output (`torch.Tensor`):
-                The direct output from learned diffusion model.
-            timestep (`float`):
-                Timestep used to perform Euler Method `x_t = h * f(x_t, t) + x_{t-1}`.
-                Where x_1 is a target distribution, x_0 is a source distribution and t (timestep) ∈ [0, 1]
-            sample (`torch.Tensor`):
-                A current instance of a sample created by the diffusion process.
-            return_dict (`bool`, *optional*, defaults to `True`):
-                Whether or not to return a [`~schedulers.scheduling_ddpm.DDPMSchedulerOutput`] or `tuple`.
-
-        Returns:
-            [`~schedulers.scheduling_ddpm.DDPMSchedulerOutput`] or `tuple`:
-                If return_dict is `True`, [`~schedulers.scheduling_ddpm.DDPMSchedulerOutput`] is returned, otherwise a
-                tuple is returned where the first element is the sample tensor.
-        """
-
-        step = FlowMatchingEulerSchedulerOutput(
-            prev_sample=sample + self.h * model_output,
-            pred_original_sample=sample
-            + (1 - get_time_coefficients(timestep, model_output.ndim)) * model_output,
-        )
-
-        if return_dict:
-            return step
-
-        return (step.prev_sample,)
-
-    @staticmethod
-    def get_velocity(
-        original_samples: torch.Tensor, noise: torch.Tensor
-    ) -> torch.Tensor:
-        """
-        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
-        process from the learned model outputs (most often the predicted noise).
-
-        Args:
-            original_samples (`torch.Tensor`):
-                The original sample that is to be noised
-            noise (`torch.Tensor`):
-                The noise that is used to noise the image
-
-        Returns:
-            `torch.Tensor`
-        """
-
-        return original_samples - noise
-
-    @staticmethod
-    def scale_model_input(
-        sample: torch.Tensor, timestep: Optional[int] = None
-    ) -> torch.Tensor:
-        """
-         Ensures interchangeability with schedulers that need to scale the denoising model input depending on the
-         current timestep.
-
-        Args:
-            sample (`torch.Tensor`):
-                The input sample.
-            timestep (`int`, *optional*):
-                The current timestep in the diffusion chain.
-
-        Returns:
-            `torch.Tensor`:
-                A scaled input sample.
-        """
-
-        return sample
diff --git a/videotuna/third_party/flux/training/deepspeed.py b/videotuna/third_party/flux/training/deepspeed.py
deleted file mode 100644
index 14eb73f8..00000000
--- a/videotuna/third_party/flux/training/deepspeed.py
+++ /dev/null
@@ -1,79 +0,0 @@
-import logging
-import os
-
-import accelerate
-from accelerate.state import AcceleratorState
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def deepspeed_zero_init_disabled_context_manager():
-    """
-    returns either a context list that includes one that will disable zero.Init or an empty context list
-    """
-    deepspeed_plugin = (
-        AcceleratorState().deepspeed_plugin
-        if accelerate.state.is_initialized()
-        else None
-    )
-    if deepspeed_plugin is None:
-        return []
-
-    return [deepspeed_plugin.zero3_init_context_manager(enable=False)]
-
-
-def prepare_model_for_deepspeed(accelerator, args):
-    use_deepspeed_optimizer = False
-    use_deepspeed_scheduler = False
-    if (
-        hasattr(accelerator, "state")
-        and hasattr(accelerator.state, "deepspeed_plugin")
-        and getattr(accelerator.state, "deepspeed_plugin") is not None
-    ):
-        offload_param = accelerator.state.deepspeed_plugin.deepspeed_config[
-            "zero_optimization"
-        ]["offload_param"]
-        accelerator.state.deepspeed_plugin.deepspeed_config["zero_optimization"][
-            "offload_param"
-        ]["pin_memory"] = True
-        if offload_param["device"] == "nvme":
-            if offload_param["nvme_path"] == "none":
-                if args.offload_param_path is None:
-                    raise ValueError(
-                        f"DeepSpeed is using {offload_param['device']} but nvme_path is not specified."
-                    )
-                else:
-                    accelerator.state.deepspeed_plugin.deepspeed_config[
-                        "zero_optimization"
-                    ]["offload_param"]["nvme_path"] = args.offload_param_path
-            logger.info(
-                f"Using DeepSpeed NVMe offload at {accelerator.state.deepspeed_plugin.deepspeed_config['zero_optimization']['offload_param']['nvme_path']}."
-            )
-
-        use_deepspeed_optimizer = True
-        if "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config:
-            logger.info("Using DeepSpeed optimizer (AdamW).")
-            accelerator.state.deepspeed_plugin.deepspeed_config["optimizer"] = {
-                "type": "AdamW",
-                "params": {
-                    "lr": args.learning_rate,
-                    "betas": [args.adam_beta1, args.adam_beta2],
-                    "eps": args.adam_epsilon,
-                    "weight_decay": args.adam_weight_decay,
-                },
-            }
-
-        use_deepspeed_scheduler = True
-        if "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config:
-            logger.info("Using DeepSpeed scheduler (WarmupLR).")
-            accelerator.state.deepspeed_plugin.deepspeed_config["scheduler"] = {
-                "type": "WarmupLR",
-                "params": {
-                    "warmup_min_lr": 0,
-                    "warmup_max_lr": args.learning_rate,
-                    "warmup_num_steps": args.lr_warmup_steps,
-                },
-            }
-
-    return use_deepspeed_optimizer, use_deepspeed_scheduler
diff --git a/videotuna/third_party/flux/training/default_settings/__init__.py b/videotuna/third_party/flux/training/default_settings/__init__.py
deleted file mode 100644
index b7fbb53f..00000000
--- a/videotuna/third_party/flux/training/default_settings/__init__.py
+++ /dev/null
@@ -1,15 +0,0 @@
-CURRENT_VERSION = 2
-
-LATEST_DEFAULTS = {1: {"hash_filenames": False}, 2: {"hash_filenames": True}}
-
-
-def default(setting: str, current_version: int = None, default_value=None):
-    if current_version <= 0 or current_version is None:
-        current_version = CURRENT_VERSION
-    if current_version in LATEST_DEFAULTS:
-        return LATEST_DEFAULTS[current_version].get(setting, default_value)
-    return default_value
-
-
-def latest_config_version():
-    return CURRENT_VERSION
diff --git a/videotuna/third_party/flux/training/default_settings/safety_check.py b/videotuna/third_party/flux/training/default_settings/safety_check.py
deleted file mode 100644
index c86ce69c..00000000
--- a/videotuna/third_party/flux/training/default_settings/safety_check.py
+++ /dev/null
@@ -1,125 +0,0 @@
-import logging
-import os
-import sys
-from os import environ
-
-from diffusers.utils import is_wandb_available
-
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger(__name__)
-if get_rank() == 0:
-    logger.setLevel(environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel(logging.ERROR)
-from videotuna.third_party.flux.training.error_handling import (
-    validate_deepspeed_compat_from_args,
-)
-
-
-def safety_check(args, accelerator):
-    if accelerator is not None and accelerator.num_processes > 1:
-        # mulit-gpu safety checks & warnings
-        if args.model_type == "lora" and args.lora_type == "standard":
-            # multi-gpu PEFT checks & warnings
-            if args.base_model_precision in ["fp8-quanto"]:
-                logger.error(
-                    f"{args.base_model_precision} is incompatible with multi-GPU training on PEFT LoRA."
-                    " Use LORA_TYPE (--lora_type) lycoris for quantised multi-GPU training of LoKr models in FP8."
-                )
-                args.base_model_precision = "int8-quanto"
-
-    if (
-        args.base_model_precision in ["fp8-quanto", "int4-quanto"]
-        or (args.base_model_precision != "no_change" and args.quantize_activations)
-    ) and (
-        accelerator is not None
-        and accelerator.state.dynamo_plugin.backend.lower() == "inductor"
-    ):
-        logger.warning(
-            f"{args.base_model_precision} is not supported with Dynamo backend. Disabling Dynamo."
-        )
-        from accelerate.utils import DynamoBackend
-
-        accelerator.state.dynamo_plugin.backend = DynamoBackend.NO
-    if args.report_to == "wandb":
-        if not is_wandb_available():
-            raise ImportError(
-                "Make sure to install wandb if you want to use it for logging during training."
-            )
-        import wandb
-    if accelerator is not None and (
-        hasattr(accelerator.state, "deepspeed_plugin")
-        and accelerator.state.deepspeed_plugin is not None
-    ):
-        validate_deepspeed_compat_from_args(accelerator, args)
-    if args.controlnet:
-        if args.model_family in ["pixart_sigma", "sd3", "kolors", "flux", "smoldit"]:
-            raise ValueError(
-                f"ControlNet is not yet supported with {args.model_type} models. Please disable --controlnet, or switch model types."
-            )
-    if "lora" in args.model_type and "standard" == args.lora_type.lower():
-        if args.model_family == "pixart_sigma":
-            raise Exception(f"{args.model_type} does not support LoRA model training.")
-
-    if "lora" in args.model_type and args.train_text_encoder:
-        if args.lora_type.lower() == "lycoris":
-            logger.error(
-                "LyCORIS training is not meant to be combined with --train_text_encoder. It is powerful enough on its own!"
-            )
-            sys.exit(1)
-    if args.user_prompt_library and not os.path.exists(args.user_prompt_library):
-        raise FileNotFoundError(
-            f"User prompt library not found at {args.user_prompt_library}. Please check the path and try again."
-        )
-
-    # optimizer memory limit check for SOAP w/ 24G
-    if (
-        accelerator is not None
-        and accelerator.device.type == "cuda"
-        and accelerator.is_main_process
-    ):
-        import subprocess
-
-        output = subprocess.check_output(
-            [
-                "nvidia-smi",
-                "--query-gpu=memory.total",
-                "--format=csv,noheader,nounits",
-            ]
-        ).split(b"\n")[get_rank()]
-        total_memory = int(output.decode().strip()) / 1024
-        from math import ceil
-
-        total_memory_gb = ceil(total_memory)
-        if total_memory_gb < 32 and total_memory_gb > 16 and args.optimizer == "soap":
-            logger.warning(
-                f"Your GPU has {total_memory_gb}GB of memory. The SOAP optimiser may require more than this. Setting --accelerator_cache_clear_interval=10 may help to eliminate OOM."
-            )
-        elif total_memory_gb < 24 and args.optimizer == "soap":
-            logger.error(
-                f"Your GPU has {total_memory_gb}GB of memory. The SOAP optimiser requires a GPU with at least 24G of memory."
-            )
-            sys.exit(1)
-
-    if (
-        args.model_type != "lora"
-        and not args.controlnet
-        and args.base_model_precision != "no_change"
-        and not args.i_know_what_i_am_doing
-    ):
-        logger.error(
-            f"{args.model_type} tuning is not compatible with quantisation. Please set --base_model_precision to 'no_change' or train LyCORIS/LoRA."
-        )
-        sys.exit(1)
-
-    if (
-        args.flux_schedule_shift is not None
-        and args.flux_schedule_shift > 0
-        and args.flux_schedule_auto_shift
-    ):
-        logger.error(
-            f"--flux_schedule_auto_shift cannot be combined with --flux_schedule_shift. Please set --flux_schedule_shift to 0 if you want to train with --flux_schedule_auto_shift."
-        )
-        sys.exit(1)
diff --git a/videotuna/third_party/flux/training/diffusion_model.py b/videotuna/third_party/flux/training/diffusion_model.py
deleted file mode 100644
index d928aa28..00000000
--- a/videotuna/third_party/flux/training/diffusion_model.py
+++ /dev/null
@@ -1,153 +0,0 @@
-import os
-
-from accelerate.logging import get_logger
-
-logger = get_logger(__name__, log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-
-
-def determine_subfolder(folder_value: str = None):
-    if folder_value is None or str(folder_value).lower() == "none":
-        return None
-    return str(folder_value)
-
-
-def load_diffusion_model(args, weight_dtype):
-    pretrained_load_args = {
-        "revision": args.revision,
-        "variant": args.variant,
-        "torch_dtype": weight_dtype,
-        "use_safetensors": True,
-    }
-    unet = None
-    transformer = None
-
-    if "nf4-bnb" == args.base_model_precision:
-        import torch
-        from diffusers import BitsAndBytesConfig
-
-        pretrained_load_args["quantization_config"] = BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_use_double_quant=True,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_compute_dtype=weight_dtype,
-        )
-
-    if args.model_family == "sd3":
-        # Stable Diffusion 3 uses a Diffusion transformer.
-        logger.info("Loading Stable Diffusion 3 diffusion transformer..")
-        try:
-            from diffusers import SD3Transformer2DModel
-        except Exception as e:
-            logger.error(
-                f"Can not load SD3 model class. This release requires the latest version of Diffusers: {e}"
-            )
-        transformer = SD3Transformer2DModel.from_pretrained(
-            args.pretrained_transformer_model_name_or_path
-            or args.pretrained_model_name_or_path,
-            subfolder=determine_subfolder(args.pretrained_transformer_subfolder),
-            **pretrained_load_args,
-        )
-    elif (
-        args.model_family.lower() == "flux" and not args.flux_attention_masked_training
-    ):
-        import torch
-        from diffusers.models import FluxTransformer2DModel
-
-        if torch.cuda.is_available():
-            rank = (
-                torch.distributed.get_rank()
-                if torch.distributed.is_initialized()
-                else 0
-            )
-            primary_device = torch.cuda.get_device_properties(0)
-            if primary_device.major >= 9:
-                try:
-                    import diffusers
-                    from flash_attn_interface import flash_attn_func
-
-                    from videotuna.third_party.flux.models.flux.attention import (
-                        FluxAttnProcessor3_0,
-                        FluxSingleAttnProcessor3_0,
-                    )
-
-                    diffusers.models.attention_processor.FluxSingleAttnProcessor2_0 = (
-                        FluxSingleAttnProcessor3_0
-                    )
-                    diffusers.models.attention_processor.FluxAttnProcessor2_0 = (
-                        FluxAttnProcessor3_0
-                    )
-                    if rank == 0:
-                        print("Using FlashAttention3_0 for H100 GPU (Single block)")
-                except:
-                    if rank == 0:
-                        logger.warning(
-                            "No flash_attn is available, using slower FlashAttention_2_0. Install flash_attn to make use of FA3 for Hopper or newer arch."
-                        )
-
-        transformer = FluxTransformer2DModel.from_pretrained(
-            args.pretrained_transformer_model_name_or_path
-            or args.pretrained_model_name_or_path,
-            subfolder=determine_subfolder(args.pretrained_transformer_subfolder),
-            **pretrained_load_args,
-        )
-    elif args.model_family.lower() == "flux" and args.flux_attention_masked_training:
-        from videotuna.third_party.flux.models.flux.transformer import (
-            FluxTransformer2DModelWithMasking,
-        )
-
-        transformer = FluxTransformer2DModelWithMasking.from_pretrained(
-            args.pretrained_transformer_model_name_or_path
-            or args.pretrained_model_name_or_path,
-            subfolder=determine_subfolder(args.pretrained_transformer_subfolder),
-            **pretrained_load_args,
-        )
-    elif args.model_family == "pixart_sigma":
-        from diffusers.models import PixArtTransformer2DModel
-
-        transformer = PixArtTransformer2DModel.from_pretrained(
-            args.pretrained_transformer_model_name_or_path
-            or args.pretrained_model_name_or_path,
-            subfolder=determine_subfolder(args.pretrained_transformer_subfolder),
-            **pretrained_load_args,
-        )
-    elif args.model_family == "smoldit":
-        logger.info("Loading SmolDiT model..")
-        if args.validation_noise_scheduler is None:
-            args.validation_noise_scheduler = "ddpm"
-        transformer_variant = None
-        from videotuna.third_party.flux.models.smoldit import (
-            SmolDiT2DModel,
-            SmolDiTConfigurations,
-        )
-
-        if args.smoldit_config not in SmolDiTConfigurations:
-            raise ValueError(
-                f"Invalid SmolDiT size configuration: {args.smoldit_config}"
-            )
-
-        transformer = SmolDiT2DModel(**SmolDiTConfigurations[args.smoldit_config])
-        if "lora" in args.model_type:
-            raise ValueError("SmolDiT does not yet support LoRA training.")
-    else:
-        from diffusers import UNet2DConditionModel
-
-        logger.info("Loading U-net..")
-        unet_variant = args.variant
-        if (
-            args.model_family == "kolors"
-            and args.pretrained_model_name_or_path.lower()
-            == "kwai-kolors/kolors-diffusers"
-        ):
-            unet_variant = "fp16"
-        pretrained_load_args["variant"] = unet_variant
-        unet = UNet2DConditionModel.from_pretrained(
-            args.pretrained_unet_model_name_or_path
-            or args.pretrained_model_name_or_path,
-            subfolder=determine_subfolder(args.pretrained_unet_subfolder),
-            **pretrained_load_args,
-        )
-
-    return unet, transformer
diff --git a/videotuna/third_party/flux/training/ema.py b/videotuna/third_party/flux/training/ema.py
deleted file mode 100644
index f4ffdca0..00000000
--- a/videotuna/third_party/flux/training/ema.py
+++ /dev/null
@@ -1,431 +0,0 @@
-import contextlib
-import copy
-import logging
-import os
-from typing import Any, Dict, Iterable, Optional, Union
-
-import torch
-import transformers
-from diffusers.utils import is_transformers_available
-from diffusers.utils.deprecation_utils import deprecate
-
-logger = logging.getLogger("EMAModel")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def should_update_ema(args, step):
-    if args.ema_update_interval is None:
-        # If the EMA update interval is not set, always update the EMA.
-        return True
-    else:
-        should_update = step % args.ema_update_interval == 0
-        if should_update:
-            logger.debug("Updating EMA weights...")
-        return should_update
-
-
-class EMAModel:
-    """
-    Exponential Moving Average of models weights
-    """
-
-    def __init__(
-        self,
-        args,
-        accelerator,
-        parameters: Iterable[torch.nn.Parameter],
-        decay: float = 0.9999,
-        min_decay: float = 0.0,
-        update_after_step: int = 0,
-        use_ema_warmup: bool = False,
-        inv_gamma: Union[float, int] = 1.0,
-        power: Union[float, int] = 2 / 3,
-        foreach: bool = True,
-        model_cls: Optional[Any] = None,
-        model_config: Dict[str, Any] = None,
-        **kwargs,
-    ):
-        """
-        Args:
-            parameters (Iterable[torch.nn.Parameter]): The parameters to track.
-            decay (float): The decay factor for the exponential moving average.
-            min_decay (float): The minimum decay factor for the exponential moving average.
-            update_after_step (int): The number of steps to wait before starting to update the EMA weights.
-            use_ema_warmup (bool): Whether to use EMA warmup.
-            inv_gamma (float):
-                Inverse multiplicative factor of EMA warmup. Default: 1. Only used if `use_ema_warmup` is True.
-            power (float): Exponential factor of EMA warmup. Default: 2/3. Only used if `use_ema_warmup` is True.
-            foreach (bool): Use torch._foreach functions for updating shadow parameters. Should be faster.
-            device (Optional[Union[str, torch.device]]): The device to store the EMA weights on. If None, the EMA
-                        weights will be stored on CPU.
-
-        @crowsonkb's notes on EMA Warmup:
-            If gamma=1 and power=1, implements a simple average. gamma=1, power=2/3 are good values for models you plan
-            to train for a million or more steps (reaches decay factor 0.999 at 31.6K steps, 0.9999 at 1M steps),
-            gamma=1, power=3/4 for models you plan to train for less (reaches decay factor 0.999 at 10K steps, 0.9999
-            at 215.4k steps).
-        """
-
-        if isinstance(parameters, torch.nn.Module):
-            deprecation_message = (
-                "Passing a `torch.nn.Module` to `ExponentialMovingAverage` is deprecated. "
-                "Please pass the parameters of the module instead."
-            )
-            deprecate(
-                "passing a `torch.nn.Module` to `ExponentialMovingAverage`",
-                "1.0.0",
-                deprecation_message,
-                standard_warn=False,
-            )
-            parameters = parameters.parameters()
-
-            # set use_ema_warmup to True if a torch.nn.Module is passed for backwards compatibility
-            use_ema_warmup = True
-
-        if kwargs.get("max_value", None) is not None:
-            deprecation_message = (
-                "The `max_value` argument is deprecated. Please use `decay` instead."
-            )
-            deprecate("max_value", "1.0.0", deprecation_message, standard_warn=False)
-            decay = kwargs["max_value"]
-
-        if kwargs.get("min_value", None) is not None:
-            deprecation_message = "The `min_value` argument is deprecated. Please use `min_decay` instead."
-            deprecate("min_value", "1.0.0", deprecation_message, standard_warn=False)
-            min_decay = kwargs["min_value"]
-
-        parameters = list(parameters)
-        self.shadow_params = [p.clone().detach() for p in parameters]
-
-        if kwargs.get("device", None) is not None:
-            deprecation_message = (
-                "The `device` argument is deprecated. Please use `to` instead."
-            )
-            deprecate("device", "1.0.0", deprecation_message, standard_warn=False)
-            self.to(device=kwargs["device"])
-
-        self.temp_stored_params = None
-
-        self.decay = decay
-        self.min_decay = min_decay
-        self.update_after_step = update_after_step
-        self.use_ema_warmup = use_ema_warmup
-        self.inv_gamma = inv_gamma
-        self.power = power
-        self.optimization_step = 0
-        self.cur_decay_value = None  # set in `step()`
-        self.foreach = foreach
-
-        self.model_cls = model_cls
-        self.model_config = model_config
-        self.args = args
-        self.accelerator = accelerator
-
-    @classmethod
-    def from_pretrained(cls, path, model_cls) -> "EMAModel":
-        _, ema_kwargs = model_cls.load_config(path, return_unused_kwargs=True)
-        model = model_cls.from_pretrained(path)
-
-        ema_model = cls(
-            model.parameters(), model_cls=model_cls, model_config=model.config
-        )
-
-        ema_model.load_state_dict(ema_kwargs)
-        return ema_model
-
-    def save_pretrained(self, path, max_shard_size: str = "10GB"):
-        if self.model_cls is None:
-            raise ValueError(
-                "`save_pretrained` can only be used if `model_cls` was defined at __init__."
-            )
-
-        if self.model_config is None:
-            raise ValueError(
-                "`save_pretrained` can only be used if `model_config` was defined at __init__."
-            )
-
-        model = self.model_cls.from_config(self.model_config)
-        state_dict = self.state_dict()
-        state_dict.pop("shadow_params", None)
-
-        model.register_to_config(**state_dict)
-        self.copy_to(model.parameters())
-        model.save_pretrained(path, max_shard_size=max_shard_size)
-
-    def get_decay(self, optimization_step: int = None) -> float:
-        """
-        Compute the decay factor for the exponential moving average.
-        """
-        if optimization_step is None:
-            optimization_step = self.optimization_step
-
-        step = max(0, optimization_step - self.update_after_step - 1)
-
-        if step <= 0:
-            return 0.0
-
-        if self.use_ema_warmup:
-            cur_decay_value = 1 - (1 + step / self.inv_gamma) ** -self.power
-        else:
-            cur_decay_value = (1 + step) / (10 + step)
-
-        cur_decay_value = min(cur_decay_value, self.decay)
-        # make sure decay is not smaller than min_decay
-        cur_decay_value = max(cur_decay_value, self.min_decay)
-        return cur_decay_value
-
-    @torch.no_grad()
-    def step(self, parameters: Iterable[torch.nn.Parameter], global_step: int = None):
-        if not should_update_ema(self.args, global_step):
-
-            return
-
-        if self.args.ema_device == "cpu" and not self.args.ema_cpu_only:
-            # Move EMA to accelerator for faster update.
-            self.to(device=self.accelerator.device, non_blocking=True)
-        if isinstance(parameters, torch.nn.Module):
-            deprecation_message = (
-                "Passing a `torch.nn.Module` to `ExponentialMovingAverage.step` is deprecated. "
-                "Please pass the parameters of the module instead."
-            )
-            deprecate(
-                "passing a `torch.nn.Module` to `ExponentialMovingAverage.step`",
-                "1.0.0",
-                deprecation_message,
-                standard_warn=False,
-            )
-            parameters = parameters.parameters()
-
-        parameters = list(parameters)
-
-        if global_step is not None:
-            # When we're updating the EMA periodically, we can't trust the counter.
-            self.optimization_step = global_step
-        else:
-            self.optimization_step += 1
-
-        # Compute the decay factor for the exponential moving average.
-        decay = self.get_decay(self.optimization_step)
-        self.cur_decay_value = decay
-        one_minus_decay = 1 - decay
-
-        context_manager = contextlib.nullcontext
-        if (
-            is_transformers_available()
-            and transformers.deepspeed.is_deepspeed_zero3_enabled()
-        ):
-            import deepspeed
-
-        if self.foreach:
-            if (
-                is_transformers_available()
-                and transformers.deepspeed.is_deepspeed_zero3_enabled()
-            ):
-                context_manager = deepspeed.zero.GatheredParameters(
-                    parameters, modifier_rank=None
-                )
-
-            with context_manager():
-                params_grad = [param for param in parameters if param.requires_grad]
-                s_params_grad = [
-                    s_param
-                    for s_param, param in zip(self.shadow_params, parameters)
-                    if param.requires_grad
-                ]
-
-                if len(params_grad) < len(parameters):
-                    torch._foreach_copy_(
-                        [
-                            s_param
-                            for s_param, param in zip(self.shadow_params, parameters)
-                            if not param.requires_grad
-                        ],
-                        [param for param in parameters if not param.requires_grad],
-                        non_blocking=True,
-                    )
-
-                torch._foreach_sub_(
-                    s_params_grad,
-                    torch._foreach_sub(s_params_grad, params_grad),
-                    alpha=one_minus_decay,
-                )
-
-        else:
-            for s_param, param in zip(self.shadow_params, parameters):
-                if (
-                    is_transformers_available()
-                    and transformers.deepspeed.is_deepspeed_zero3_enabled()
-                ):
-                    context_manager = deepspeed.zero.GatheredParameters(
-                        param, modifier_rank=None
-                    )
-
-                with context_manager():
-                    if param.requires_grad:
-                        s_param.sub_(
-                            one_minus_decay * (s_param - param.to(s_param.device))
-                        )
-                    else:
-                        s_param.copy_(param)
-        if self.args.ema_device == "cpu" and not self.args.ema_cpu_only:
-            # Move back to CPU for safe-keeping.
-            self.to(device=self.args.ema_device, non_blocking=True)
-
-    def copy_to(self, parameters: Iterable[torch.nn.Parameter]) -> None:
-        """
-        Copy current averaged parameters into given collection of parameters.
-
-        Args:
-            parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-                updated with the stored moving averages. If `None`, the parameters with which this
-                `ExponentialMovingAverage` was initialized will be used.
-        """
-        parameters = list(parameters)
-        if self.foreach:
-            torch._foreach_copy_(
-                [param.data for param in parameters],
-                [
-                    s_param.to(param.device).data
-                    for s_param, param in zip(self.shadow_params, parameters)
-                ],
-            )
-        else:
-            for s_param, param in zip(self.shadow_params, parameters):
-                param.data.copy_(s_param.to(param.device).data)
-
-    def pin_memory(self) -> None:
-        r"""
-        Move internal buffers of the ExponentialMovingAverage to pinned memory. Useful for non-blocking transfers for
-        offloading EMA params to the host.
-        """
-        if torch.backends.mps.is_available():
-            logger.warning("Apple silicon does not support pinned memory. Skipping.")
-            return
-
-        if self.args.ema_cpu_only:
-            return
-
-        # This probably won't work, but we'll do it anyway.
-        self.shadow_params = [p.pin_memory() for p in self.shadow_params]
-
-    def to(self, device=None, dtype=None, non_blocking=False) -> None:
-        r"""Move internal buffers of the ExponentialMovingAverage to `device`.
-
-        Args:
-            device: like `device` argument to `torch.Tensor.to`
-        """
-        # .to() on the tensors handles None correctly
-        self.shadow_params = [
-            (
-                p.to(device=device, dtype=dtype, non_blocking=non_blocking)
-                if p.is_floating_point()
-                else p.to(device=device, non_blocking=non_blocking)
-            )
-            for p in self.shadow_params
-        ]
-
-    def state_dict(self) -> dict:
-        r"""
-        Returns the state of the ExponentialMovingAverage as a dict. This method is used by accelerate during
-        checkpointing to save the ema state dict.
-        """
-        # Following PyTorch conventions, references to tensors are returned:
-        # "returns a reference to the state and not its copy!" -
-        # https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict
-        return {
-            "decay": self.decay,
-            "min_decay": self.min_decay,
-            "optimization_step": self.optimization_step,
-            "update_after_step": self.update_after_step,
-            "use_ema_warmup": self.use_ema_warmup,
-            "inv_gamma": self.inv_gamma,
-            "power": self.power,
-            "shadow_params": self.shadow_params,
-        }
-
-    def store(self, parameters: Iterable[torch.nn.Parameter]) -> None:
-        r"""
-        Args:
-        Save the current parameters for restoring later.
-            parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-                temporarily stored.
-        """
-        self.temp_stored_params = [param.detach().cpu().clone() for param in parameters]
-
-    def restore(self, parameters: Iterable[torch.nn.Parameter]) -> None:
-        r"""
-        Args:
-        Restore the parameters stored with the `store` method. Useful to validate the model with EMA parameters without:
-        affecting the original optimization process. Store the parameters before the `copy_to()` method. After
-        validation (or model saving), use this to restore the former parameters.
-            parameters: Iterable of `torch.nn.Parameter`; the parameters to be
-                updated with the stored parameters. If `None`, the parameters with which this
-                `ExponentialMovingAverage` was initialized will be used.
-        """
-        if self.temp_stored_params is None:
-            raise RuntimeError(
-                "This ExponentialMovingAverage has no `store()`ed weights "
-                "to `restore()`"
-            )
-        if self.foreach:
-            torch._foreach_copy_(
-                [param.data for param in parameters],
-                [c_param.data for c_param in self.temp_stored_params],
-            )
-        else:
-            for c_param, param in zip(self.temp_stored_params, parameters):
-                param.data.copy_(c_param.data)
-
-        # Better memory-wise.
-        self.temp_stored_params = None
-
-    def load_state_dict(self, state_dict: dict) -> None:
-        r"""
-        Args:
-        Loads the ExponentialMovingAverage state. This method is used by accelerate during checkpointing to save the
-        ema state dict.
-            state_dict (dict): EMA state. Should be an object returned
-                from a call to :meth:`state_dict`.
-        """
-        # deepcopy, to be consistent with module API
-        state_dict = copy.deepcopy(state_dict)
-
-        self.decay = state_dict.get("decay", self.decay)
-        if self.decay < 0.0 or self.decay > 1.0:
-            raise ValueError("Decay must be between 0 and 1")
-
-        self.min_decay = state_dict.get("min_decay", self.min_decay)
-        if not isinstance(self.min_decay, float):
-            raise ValueError("Invalid min_decay")
-
-        self.optimization_step = state_dict.get(
-            "optimization_step", self.optimization_step
-        )
-        if not isinstance(self.optimization_step, int):
-            raise ValueError("Invalid optimization_step")
-
-        self.update_after_step = state_dict.get(
-            "update_after_step", self.update_after_step
-        )
-        if not isinstance(self.update_after_step, int):
-            raise ValueError("Invalid update_after_step")
-
-        self.use_ema_warmup = state_dict.get("use_ema_warmup", self.use_ema_warmup)
-        if not isinstance(self.use_ema_warmup, bool):
-            raise ValueError("Invalid use_ema_warmup")
-
-        self.inv_gamma = state_dict.get("inv_gamma", self.inv_gamma)
-        if not isinstance(self.inv_gamma, (float, int)):
-            raise ValueError("Invalid inv_gamma")
-
-        self.power = state_dict.get("power", self.power)
-        if not isinstance(self.power, (float, int)):
-            raise ValueError("Invalid power")
-
-        shadow_params = state_dict.get("shadow_params", None)
-        if shadow_params is not None:
-            self.shadow_params = shadow_params
-            if not isinstance(self.shadow_params, list):
-                raise ValueError("shadow_params must be a list")
-            if not all(isinstance(p, torch.Tensor) for p in self.shadow_params):
-                raise ValueError("shadow_params must all be Tensors")
diff --git a/videotuna/third_party/flux/training/error_handling.py b/videotuna/third_party/flux/training/error_handling.py
deleted file mode 100644
index 09c010a2..00000000
--- a/videotuna/third_party/flux/training/error_handling.py
+++ /dev/null
@@ -1,29 +0,0 @@
-import os
-import sys
-
-from accelerate.logging import get_logger
-
-logger = get_logger(__name__, log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-
-
-def validate_deepspeed_compat_from_args(accelerator, args):
-    if "lora" in args.model_type:
-        logger.error(
-            "LoRA can not be trained with DeepSpeed. Please disable DeepSpeed via 'accelerate config' before reattempting."
-        )
-        sys.exit(1)
-    if (
-        "gradient_accumulation_steps"
-        in accelerator.state.deepspeed_plugin.deepspeed_config
-    ):
-        args.gradient_accumulation_steps = (
-            accelerator.state.deepspeed_plugin.deepspeed_config[
-                "gradient_accumulation_steps"
-            ]
-        )
-        logger.info(
-            f"Updated gradient_accumulation_steps to the value provided by DeepSpeed: {args.gradient_accumulation_steps}"
-        )
diff --git a/videotuna/third_party/flux/training/exceptions.py b/videotuna/third_party/flux/training/exceptions.py
deleted file mode 100644
index 74308ea3..00000000
--- a/videotuna/third_party/flux/training/exceptions.py
+++ /dev/null
@@ -1,2 +0,0 @@
-class MultiDatasetExhausted(Exception):
-    pass
diff --git a/videotuna/third_party/flux/training/min_snr_gamma.py b/videotuna/third_party/flux/training/min_snr_gamma.py
deleted file mode 100644
index 68fd4f92..00000000
--- a/videotuna/third_party/flux/training/min_snr_gamma.py
+++ /dev/null
@@ -1,47 +0,0 @@
-# From Diffusers repository: examples/research_projects/onnxruntime/text_to_image/train_text_to_image.py
-
-
-def compute_snr(timesteps, noise_scheduler, use_soft_min: bool = False, sigma_data=1.0):
-    """
-    Computes SNR using two different methods based on the `use_soft_min` flag.
-
-    Args:
-        timesteps (torch.Tensor): The timesteps at which SNR is computed.
-        noise_scheduler (NoiseScheduler): An object that contains the alpha_cumprod values.
-        use_soft_min (bool): If True, use the _weighting_soft_min_snr method to compute SNR.
-        sigma_data (torch.Tensor or None): The standard deviation of the data used in the soft min weighting method.
-
-    Returns:
-        torch.Tensor: The computed SNR values.
-    """
-    alphas_cumprod = noise_scheduler.alphas_cumprod
-    sqrt_alphas_cumprod = alphas_cumprod**0.5
-    sqrt_one_minus_alphas_cumprod = (1.0 - alphas_cumprod) ** 0.5
-
-    # Expand the tensors.
-    sqrt_alphas_cumprod = sqrt_alphas_cumprod.to(device=timesteps.device)[
-        timesteps
-    ].float()
-    while len(sqrt_alphas_cumprod.shape) < len(timesteps.shape):
-        sqrt_alphas_cumprod = sqrt_alphas_cumprod[..., None]
-    alpha = sqrt_alphas_cumprod.expand(timesteps.shape)
-
-    sqrt_one_minus_alphas_cumprod = sqrt_one_minus_alphas_cumprod.to(
-        device=timesteps.device
-    )[timesteps].float()
-    while len(sqrt_one_minus_alphas_cumprod.shape) < len(timesteps.shape):
-        sqrt_one_minus_alphas_cumprod = sqrt_one_minus_alphas_cumprod[..., None]
-    sigma = sqrt_one_minus_alphas_cumprod.expand(timesteps.shape)
-
-    # Choose the method to compute SNR
-    if use_soft_min:
-        if sigma_data is None:
-            raise ValueError(
-                "sigma_data must be provided when using soft min SNR calculation."
-            )
-        snr = (sigma * sigma_data) ** 2 / (sigma**2 + sigma_data**2) ** 2
-    else:
-        # Default SNR computation
-        snr = (alpha / sigma) ** 2
-
-    return snr
diff --git a/videotuna/third_party/flux/training/model.py b/videotuna/third_party/flux/training/model.py
deleted file mode 100644
index 1443f885..00000000
--- a/videotuna/third_party/flux/training/model.py
+++ /dev/null
@@ -1,2869 +0,0 @@
-import copy
-import glob
-import hashlib
-import json
-import logging
-import math
-import os
-import random
-import shutil
-import sys
-
-import huggingface_hub
-import pytorch_lightning as pl
-import torch.distributed as dist
-import wandb
-from pytorch_lightning.callbacks import Callback
-from pytorch_lightning.utilities import rank_zero_only
-from safetensors.torch import save_file
-
-from videotuna.third_party.flux.configuration.configure import model_labels
-from videotuna.third_party.flux.publishing.huggingface import HubManager
-from videotuna.third_party.flux.training.default_settings.safety_check import (
-    safety_check,
-)
-from videotuna.utils.callbacks import LoraModelCheckpoint
-
-# Quiet down, you.
-os.environ["ACCELERATE_LOG_LEVEL"] = "WARNING"
-from accelerate.logging import get_logger
-from diffusers.models.embeddings import get_2d_rotary_pos_embed
-
-from videotuna.utils.common_utils import get_resize_crop_region_for_grid
-from videotuna.third_party.flux import log_format  # noqa
-from videotuna.third_party.flux.caching.memory import reclaim_memory
-from videotuna.third_party.flux.configuration.loader import load_config
-from videotuna.third_party.flux.data_backend.factory import (
-    BatchFetcher,
-    configure_multi_databackend,
-    random_dataloader_iterator,
-)
-from videotuna.third_party.flux.training import steps_remaining_in_epoch
-from videotuna.third_party.flux.training.adapter import (
-    determine_adapter_target_modules,
-    load_lora_weights,
-)
-from videotuna.third_party.flux.training.custom_schedule import (
-    generate_timestep_weights,
-    get_lr_scheduler,
-    segmented_timestep_selection,
-)
-from videotuna.third_party.flux.training.deepspeed import (
-    deepspeed_zero_init_disabled_context_manager,
-    prepare_model_for_deepspeed,
-)
-from videotuna.third_party.flux.training.diffusion_model import load_diffusion_model
-from videotuna.third_party.flux.training.min_snr_gamma import compute_snr
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.optimizer_param import (
-    cpu_offload_optimizer,
-    determine_optimizer_class_with_config,
-    determine_params_to_optimize,
-    is_lr_scheduler_disabled,
-)
-from videotuna.third_party.flux.training.peft_init import (
-    init_lokr_network_with_perturbed_normal,
-)
-from videotuna.third_party.flux.training.schedulers import load_scheduler_from_args
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.training.text_encoding import (
-    determine_te_path_subfolder,
-    get_tokenizers,
-    import_model_class_from_model_name_or_path,
-    load_tes,
-)
-from videotuna.third_party.flux.training.validation import (
-    Validation,
-    prepare_validation_prompt_list,
-)
-from videotuna.third_party.flux.training.wrappers import unwrap_model
-
-logger = get_logger(
-    "SimpleTuner", log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-)
-
-filelock_logger = get_logger("filelock")
-connection_logger = get_logger("urllib3.connectionpool")
-training_logger = get_logger("training-loop")
-
-# More important logs.
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-training_logger_level = os.environ.get("SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL", "INFO")
-training_logger.setLevel(training_logger_level)
-
-# Less important logs.
-filelock_logger.setLevel("WARNING")
-connection_logger.setLevel("WARNING")
-import accelerate
-import diffusers
-import torch
-import torch.nn.functional as F
-import torch.utils.checkpoint
-import transformers
-from accelerate import Accelerator
-from accelerate.utils import set_seed
-from torch.distributions import Beta
-
-from videotuna.third_party.flux.configuration.configure import model_classes
-
-try:
-    from lycoris import LycorisNetwork
-except:
-    print("[ERROR] Lycoris not available. Please install ")
-from diffusers import (
-    AutoencoderKL,
-    ControlNetModel,
-    DDIMScheduler,
-    DDPMScheduler,
-    EulerAncestralDiscreteScheduler,
-    EulerDiscreteScheduler,
-    FluxTransformer2DModel,
-    PixArtTransformer2DModel,
-    StableDiffusion3Pipeline,
-    UNet2DConditionModel,
-    UniPCMultistepScheduler,
-)
-from diffusers.utils import (
-    check_min_version,
-    convert_state_dict_to_diffusers,
-    is_wandb_available,
-)
-from diffusers.utils.import_utils import is_xformers_available
-from peft import LoraConfig
-from peft.utils import get_peft_model_state_dict
-from tqdm.auto import tqdm
-from transformers import CLIPTokenizer, PretrainedConfig
-from transformers.utils import ContextManagers
-
-from videotuna.third_party.flux.models.flux import (
-    apply_flux_schedule_shift,
-    get_mobius_guidance,
-    pack_latents,
-    prepare_latent_image_ids,
-    unpack_latents,
-)
-from videotuna.third_party.flux.models.sdxl.pipeline import StableDiffusionXLPipeline
-from videotuna.third_party.flux.training.ema import EMAModel
-
-is_optimi_available = False
-try:
-    from optimi import prepare_for_gradient_release
-
-    is_optimi_available = True
-except:
-    pass
-
-# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.27.0.dev0")
-
-SCHEDULER_NAME_MAP = {
-    "euler": EulerDiscreteScheduler,
-    "euler-a": EulerAncestralDiscreteScheduler,
-    "unipc": UniPCMultistepScheduler,
-    "ddim": DDIMScheduler,
-    "ddpm": DDPMScheduler,
-}
-logging.basicConfig(
-    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
-    datefmt="%m/%d/%Y %H:%M:%S",
-    level=logging.INFO,
-)
-
-transformers.utils.logging.set_verbosity_warning()
-diffusers.utils.logging.set_verbosity_warning()
-
-lora_checkpoint_callback = LoraModelCheckpoint()
-
-
-class Model(pl.LightningModule):
-    def __init__(
-        self, config: dict = None, disable_accelerator: bool = False, job_id: str = None
-    ):
-        super().__init__()
-        self.accelerator = None
-        self.job_id = job_id
-        StateTracker.set_job_id(job_id)
-        self.parse_arguments(args=config, disable_accelerator=disable_accelerator)
-        self._misc_init()
-        self.lycoris_wrapped_network = None
-        self.lycoris_config = None
-        self.lr_scheduler = None
-        self.webhook_handler = None
-        self.should_abort = False
-        self.unet = None
-        self.transformer = None
-        self.vae = None
-        self.text_encoder_1 = None
-        self.text_encoder_2 = None
-        self.text_encoder_3 = None
-        self.controlnet = None
-        self.validation = None
-
-    def set_trainer(self, trainer):
-        self.trainer = trainer
-
-    def _config_to_obj(self, config):
-        if not config:
-            return None
-        return type("Config", (object,), config)
-
-    def parse_arguments(self, args=None, disable_accelerator: bool = False):
-        self.config = load_config(args)
-        report_to = (
-            None if self.config.report_to.lower() == "none" else self.config.report_to
-        )
-        if not disable_accelerator:
-            self.accelerator = Accelerator(
-                gradient_accumulation_steps=self.config.gradient_accumulation_steps,
-                mixed_precision=(
-                    self.config.mixed_precision
-                    if not torch.backends.mps.is_available()
-                    else None
-                ),
-                log_with=report_to,
-                project_config=self.config.accelerator_project_config,
-                kwargs_handlers=[self.config.process_group_kwargs],
-            )
-        safety_check(args=self.config, accelerator=self.accelerator)
-        if self.config.lr_scale:
-            logger.info(
-                f"Scaling learning rate ({self.config.learning_rate}), due to --lr_scale"
-            )
-            self.config.learning_rate = (
-                self.config.learning_rate
-                * self.config.gradient_accumulation_steps
-                * self.config.train_batch_size
-                * getattr(self.accelerator, "num_processes", 1)
-            )
-        StateTracker.set_accelerator(self.accelerator)
-        StateTracker.set_args(self.config)
-        StateTracker.set_weight_dtype(self.config.weight_dtype)
-        self.set_model_family()
-        # this updates self.config further, so we will run it here.
-        self.init_noise_schedule()
-
-    def run(self):
-        self.configure_webhook()
-        self.init_noise_schedule()
-        self.init_seed()
-        self.init_huggingface_hub()
-
-        # Core initialization steps with signal checks after each step
-        self._initialize_components_with_signal_check(
-            [
-                self.init_preprocessing_models,
-                self.init_data_backend,
-                self.init_validation_prompts,
-                self.init_unload_text_encoder,
-                self.init_unload_vae,
-                self.init_load_base_model,
-                self.init_precision,
-                self.init_controlnet_model,
-                self.init_freeze_models,
-                self.init_trainable_peft_adapter,
-                self.init_ema_model,
-            ]
-        )
-
-        # Model movement and validation setup
-        self.move_models(destination="accelerator")
-        self._exit_on_signal()
-        self.init_validations()
-        self._exit_on_signal()
-        self.init_benchmark_base_model()
-        self._exit_on_signal()
-        self.resume_and_prepare()
-        self._exit_on_signal()
-        self.init_trackers()
-
-        # except Exception as e:
-        #     import traceback
-
-        #     logger.error(
-        #         f"Failed to run training: {e}, traceback: {traceback.format_exc()}"
-        #     )
-        #     self._send_webhook_msg(
-        #         message=f"Failed to run training: {e}",
-        #     )
-        #     self._send_webhook_raw(
-        #         structured_data={
-        #             "message": f"Failed to run training: {e}",
-        #             "status": "error",
-        #         },
-        #         message_type="fatal_error",
-        #     )
-
-        #     raise e
-
-    def _initialize_components_with_signal_check(self, initializers):
-        """
-        Runs a list of initializer functions with signal checks after each.
-
-        Args:
-            initializers (list): A list of initializer functions to run sequentially.
-        """
-        for initializer in initializers:
-            initializer()
-            self._exit_on_signal()
-
-    def init_noise_schedule(self):
-        self.config, _flow_matching, self.noise_scheduler = load_scheduler_from_args(
-            self.config
-        )
-        self.config.flow_matching = _flow_matching
-        self.lr = 0.0
-
-    def configure_webhook(self, send_startup_message: bool = True):
-        self.webhook_handler = None
-        if self.config.webhook_config is None:
-            return
-        from videotuna.third_party.flux.webhooks.handler import WebhookHandler
-
-        self.webhook_handler = WebhookHandler(
-            self.config.webhook_config,
-            self.accelerator,
-            f"{self.config.tracker_project_name} {self.config.tracker_run_name}",
-        )
-        StateTracker.set_webhook_handler(self.webhook_handler)
-        if send_startup_message:
-            self._send_webhook_msg(
-                message="SimpleTuner has launched. Hold onto your butts!",
-                store_response=True,
-            )
-        self._send_webhook_raw(
-            structured_data={
-                "message": "Training job has started, configuration has begun."
-            },
-            message_type="configure_webhook",
-        )
-
-    def _misc_init(self):
-        """things that do not really need an order."""
-        torch.set_num_threads(self.config.torch_num_threads)
-        self.state = {}
-        self.state["lr"] = 0.0
-        # Global step represents the most recently *completed* optimization step, which means it
-        #  takes into account the number of gradient_accumulation_steps. If we use 1 gradient_accumulation_step,
-        #  then global_step and step will be the same throughout training. However, if we use
-        #  2 gradient_accumulation_steps, then global_step will be twice as large as step, and so on.
-        self.state["global_step"] = 0
-        self.state["global_resume_step"] = 0
-        self.state["first_epoch"] = 1
-        self.timesteps_buffer = []
-        self.guidance_values_list = []
-        self.train_loss = 0.0
-        self.bf = None
-        self.grad_norm = None
-        self.extra_lr_scheduler_kwargs = {}
-        StateTracker.set_global_step(self.state["global_step"])
-        self.config.use_deepspeed_optimizer, self.config.use_deepspeed_scheduler = (
-            prepare_model_for_deepspeed(self.accelerator, self.config)
-        )
-        self.config.base_weight_dtype = self.config.weight_dtype
-        self.config.is_quanto = False
-        self.config.is_torchao = False
-        self.config.is_bnb = False
-        if "quanto" in self.config.base_model_precision:
-            self.config.is_quanto = True
-        elif "torchao" in self.config.base_model_precision:
-            self.config.is_torchao = True
-        elif "bnb" in self.config.base_model_precision:
-            self.config.is_bnb = True
-        if self.config.is_quanto:
-            from videotuna.third_party.flux.training.quantisation import quantise_model
-
-            self.quantise_model = quantise_model
-        elif self.config.is_torchao:
-            from videotuna.third_party.flux.training.quantisation import quantise_model
-
-            self.quantise_model = quantise_model
-
-    def set_model_family(self, model_family: str = None):
-        model_family = getattr(self.config, "model_family", model_family)
-        if not model_family:
-            logger.warning(
-                "Using --model_family (or MODEL_FAMILY) to specify which model you are training will be required in a future release."
-            )
-            if self.config.model_family == "sd3":
-                model_family = "sd3"
-                logger.warning(
-                    "Using --sd3 is deprecated. Please use --model_family=sd3."
-                )
-            if self.config.model_family == "flux":
-                model_family = "flux"
-                logger.warning(
-                    "Using --flux is deprecated. Please use --model_family=flux."
-                )
-            if self.config.model_family == "pixart_sigma":
-                model_family = "pixart_sigma"
-                logger.warning(
-                    "Using --pixart_sigma is deprecated. Please use --model_family=pixart_sigma."
-                )
-            if self.config.model_family == "legacy":
-                model_family = "legacy"
-                logger.warning(
-                    "Using --legacy is deprecated. Please use --model_family=legacy."
-                )
-            if self.config.model_family == "kolors":
-                model_family = "kolors"
-                logger.warning(
-                    "Using --kolors is deprecated. Please use --model_family=kolors."
-                )
-            if self.config.model_family == "smoldit":
-                model_family = "smoldit"
-            if model_family is None:
-                model_family = "sdxl"
-                logger.warning(
-                    "Training SDXL without specifying --model_family is deprecated. Please use --model_family=sdxl."
-                )
-        elif model_family not in model_classes["full"]:
-            raise ValueError(f"Invalid model family specified: {model_family}")
-
-        self._set_model_paths()
-        StateTracker.set_model_family(model_family)
-        self.config.model_type_label = model_labels[model_family.lower()]
-        if StateTracker.is_sdxl_refiner():
-            self.config.model_type_label = "SDXL Refiner"
-
-    def init_clear_backend_cache(self):
-        if self.config.output_dir is not None:
-            os.makedirs(self.config.output_dir, exist_ok=True)
-        if self.config.preserve_data_backend_cache:
-            return
-        StateTracker.delete_cache_files(
-            preserve_data_backend_cache=self.config.preserve_data_backend_cache
-        )
-
-    def init_seed(self):
-        if self.config.seed is not None and self.config.seed != 0:
-            set_seed(self.config.seed, self.config.seed_for_each_device)
-
-    def init_huggingface_hub(self, access_token: str = None):
-        # Handle the repository creation
-        self.hub_manager = None
-        if not self.accelerator.is_main_process or not self.config.push_to_hub:
-            return
-        if access_token:
-            huggingface_hub.login(token=access_token)
-        self.hub_manager = HubManager(config=self.config)
-        try:
-            StateTracker.set_hf_user(huggingface_hub.whoami())
-            logger.info(
-                f"Logged into Hugging Face Hub as '{StateTracker.get_hf_username()}'"
-            )
-        except Exception as e:
-            logger.error(f"Failed to log into Hugging Face Hub: {e}")
-            raise e
-
-    def _set_model_paths(self):
-        self.config.vae_path = (
-            self.config.pretrained_model_name_or_path
-            if self.config.pretrained_vae_model_name_or_path is None
-            else self.config.pretrained_vae_model_name_or_path
-        )
-        self.config.text_encoder_path, self.config.text_encoder_subfolder = (
-            determine_te_path_subfolder(self.config)
-        )
-
-    def init_preprocessing_models(self, move_to_accelerator: bool = True):
-        # image embeddings
-        self.init_vae(move_to_accelerator=move_to_accelerator)
-        # text embeds
-        self.init_text_encoder(move_to_accelerator=move_to_accelerator)
-
-    def init_vae(self, move_to_accelerator: bool = True):
-        logger.info(f"Load VAE: {self.config.vae_path}")
-        self.config.vae_kwargs = {
-            "pretrained_model_name_or_path": self.config.vae_path,
-            "subfolder": "vae",
-            "revision": self.config.revision,
-            "force_upcast": False,
-            "variant": self.config.variant,
-        }
-        try:
-            self.vae = AutoencoderKL.from_pretrained(**self.config.vae_kwargs)
-        except:
-            logger.warning(
-                "Couldn't load VAE with default path. Trying without a subfolder.."
-            )
-            self.config.vae_kwargs["subfolder"] = None
-            self.vae = AutoencoderKL.from_pretrained(**self.config.vae_kwargs)
-        if not move_to_accelerator:
-            logger.debug("Not moving VAE to accelerator.")
-            return
-        if self.vae is not None:
-            # The VAE is in bfloat16 to avoid NaN losses.
-            _vae_dtype = torch.bfloat16
-            if hasattr(self.config, "vae_dtype"):
-                # Let's use a case-switch for convenience: bf16, fp16, fp32, none/default
-                if self.config.vae_dtype == "bf16":
-                    _vae_dtype = torch.bfloat16
-                elif self.config.vae_dtype == "fp16":
-                    raise ValueError(
-                        "fp16 is not supported for SDXL's VAE. Please use bf16 or fp32."
-                    )
-                elif self.config.vae_dtype == "fp32":
-                    _vae_dtype = torch.float32
-                elif (
-                    self.config.vae_dtype == "none"
-                    or self.config.vae_dtype == "default"
-                ):
-                    _vae_dtype = torch.bfloat16
-            logger.info(
-                f"Loading VAE onto accelerator, converting from {self.vae.dtype} to {_vae_dtype}"
-            )
-            self.vae.to(self.accelerator.device, dtype=_vae_dtype)
-            StateTracker.set_vae_dtype(_vae_dtype)
-            StateTracker.set_vae(self.vae)
-
-    def init_text_tokenizer(self):
-        logger.info("Load tokenizers")
-        self.tokenizer_1, self.tokenizer_2, self.tokenizer_3 = get_tokenizers(
-            self.config
-        )
-        self.tokenizers = [self.tokenizer_1, self.tokenizer_2, self.tokenizer_3]
-
-    def init_text_encoder(self, move_to_accelerator: bool = True):
-        self.init_text_tokenizer()
-        self.text_encoder_1, self.text_encoder_2, self.text_encoder_3 = None, None, None
-        self.text_encoder_cls_1, self.text_encoder_cls_2, self.text_encoder_cls_3 = (
-            None,
-            None,
-            None,
-        )
-        if self.tokenizer_1 is not None:
-            self.text_encoder_cls_1 = import_model_class_from_model_name_or_path(
-                self.config.text_encoder_path,
-                self.config.revision,
-                self.config,
-                subfolder=self.config.text_encoder_subfolder,
-            )
-        if self.tokenizer_2 is not None:
-            self.text_encoder_cls_2 = import_model_class_from_model_name_or_path(
-                self.config.pretrained_model_name_or_path,
-                self.config.revision,
-                self.config,
-                subfolder="text_encoder_2",
-            )
-        if self.tokenizer_3 is not None and self.config.model_family == "sd3":
-            self.text_encoder_cls_3 = import_model_class_from_model_name_or_path(
-                self.config.pretrained_model_name_or_path,
-                self.config.revision,
-                self.config,
-                subfolder="text_encoder_3",
-            )
-        with ContextManagers(deepspeed_zero_init_disabled_context_manager()):
-            tokenizers = [self.tokenizer_1, self.tokenizer_2, self.tokenizer_3]
-            text_encoder_classes = [
-                self.text_encoder_cls_1,
-                self.text_encoder_cls_2,
-                self.text_encoder_cls_3,
-            ]
-            (
-                text_encoder_variant,
-                self.text_encoder_1,
-                self.text_encoder_2,
-                self.text_encoder_3,
-            ) = load_tes(
-                args=self.config,
-                text_encoder_classes=text_encoder_classes,
-                weight_dtype=self.config.weight_dtype,
-                tokenizers=tokenizers,
-                text_encoder_path=self.config.text_encoder_path,
-                text_encoder_subfolder=self.config.text_encoder_subfolder,
-            )
-        if not move_to_accelerator:
-            logger.debug("Not moving text encoders to accelerator.")
-            return
-        self.text_encoders = []
-        self.tokenizers = []
-        if self.tokenizer_1 is not None:
-            logger.info("Moving text encoder to GPU.")
-            self.text_encoder_1.to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-            self.tokenizers.append(self.tokenizer_1)
-            self.text_encoders.append(self.text_encoder_1)
-        if self.tokenizer_2 is not None:
-            logger.info("Moving text encoder 2 to GPU.")
-            self.text_encoder_2.to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-            self.tokenizers.append(self.tokenizer_2)
-            self.text_encoders.append(self.text_encoder_2)
-        if self.tokenizer_3 is not None:
-            logger.info("Moving text encoder 3 to GPU.")
-            self.text_encoder_3.to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-            self.tokenizers.append(self.tokenizer_3)
-            self.text_encoders.append(self.text_encoder_3)
-
-    def init_freeze_models(self):
-        # Freeze vae and text_encoders
-        if self.vae is not None:
-            self.vae.requires_grad_(False)
-        if self.text_encoder_1 is not None:
-            self.text_encoder_1.requires_grad_(False)
-        if self.text_encoder_2 is not None:
-            self.text_encoder_2.requires_grad_(False)
-        if self.text_encoder_3 is not None:
-            self.text_encoder_3.requires_grad_(False)
-        if "lora" in self.config.model_type or self.config.controlnet:
-            if self.transformer is not None:
-                self.transformer.requires_grad_(False)
-            if self.unet is not None:
-                self.unet.requires_grad_(False)
-        self.accelerator.wait_for_everyone()
-
-    def init_load_base_model(self):
-        webhook_msg = f"Loading model: `{self.config.pretrained_model_name_or_path}`..."
-        self._send_webhook_msg(message=webhook_msg)
-        self._send_webhook_raw(
-            structured_data={"message": webhook_msg},
-            message_type="init_load_base_model_begin",
-        )
-        self.unet, self.transformer = load_diffusion_model(
-            self.config, self.config.weight_dtype
-        )
-        self.accelerator.wait_for_everyone()
-        self._send_webhook_raw(
-            structured_data={"message": "Base model has loaded."},
-            message_type="init_load_base_model_completed",
-        )
-
-    def init_data_backend(self):
-        try:
-            self.init_clear_backend_cache()
-            self._send_webhook_msg(
-                message="Configuring data backends... (this may take a while!)"
-            )
-            self._send_webhook_raw(
-                structured_data={"message": "Configuring data backends."},
-                message_type="init_data_backend_begin",
-            )
-            configure_multi_databackend(
-                self.config,
-                accelerator=self.accelerator,
-                text_encoders=self.text_encoders,
-                tokenizers=self.tokenizers,
-            )
-            self._send_webhook_raw(
-                structured_data={"message": "Completed configuring data backends."},
-                message_type="init_data_backend_completed",
-            )
-        except Exception as e:
-            import traceback
-
-            logger.error(f"{e}, traceback: {traceback.format_exc()}")
-            self._send_webhook_msg(
-                message=f"Failed to load data backends: {e}",
-                message_level="critical",
-            )
-            self._send_webhook_raw(
-                structured_data={
-                    "message": f"Failed to load data backends: {e}",
-                    "status": "error",
-                },
-                message_type="fatal_error",
-            )
-
-            raise e
-
-        self.init_validation_prompts()
-        # We calculate the number of steps per epoch by dividing the number of images by the effective batch divisor.
-        # Gradient accumulation steps mean that we only update the model weights every /n/ steps.
-        collected_data_backend_str = list(StateTracker.get_data_backends().keys())
-        if self.config.push_to_hub and self.accelerator.is_main_process:
-            self.hub_manager.collected_data_backend_str = collected_data_backend_str
-            self.hub_manager.set_validation_prompts(
-                self.validation_prompts, self.validation_shortnames
-            )
-            logger.debug(f"Collected validation prompts: {self.validation_prompts}")
-        self._recalculate_training_steps()
-        logger.info(
-            f"Collected the following data backends: {collected_data_backend_str}"
-        )
-        self._send_webhook_msg(
-            message=f"Collected the following data backends: {collected_data_backend_str}"
-        )
-        self._send_webhook_raw(
-            structured_data={
-                "message": f"Collected the following data backends: {collected_data_backend_str}"
-            },
-            message_type="init_data_backend",
-        )
-        self.accelerator.wait_for_everyone()
-
-    def init_validation_prompts(self):
-        if self.accelerator.is_main_process:
-            if self.config.model_family == "flux":
-                (
-                    self.validation_prompts,
-                    self.validation_shortnames,
-                    self.validation_negative_prompt_embeds,
-                    self.validation_negative_pooled_embeds,
-                    self.validation_negative_time_ids,
-                ) = prepare_validation_prompt_list(
-                    args=self.config,
-                    embed_cache=StateTracker.get_default_text_embed_cache(),
-                )
-            else:
-                (
-                    self.validation_prompts,
-                    self.validation_shortnames,
-                    self.validation_negative_prompt_embeds,
-                    self.validation_negative_pooled_embeds,
-                ) = prepare_validation_prompt_list(
-                    args=self.config,
-                    embed_cache=StateTracker.get_default_text_embed_cache(),
-                )
-        else:
-            self.validation_prompts = None
-            self.validation_shortnames = None
-            self.validation_negative_prompt_embeds = None
-            self.validation_negative_pooled_embeds = None
-        self.accelerator.wait_for_everyone()
-
-    def stats_memory_used(self):
-        # Grab GPU memory used:
-        if torch.cuda.is_available():
-            curent_memory_allocated = torch.cuda.memory_allocated() / 1024**3
-        elif torch.backends.mps.is_available():
-            curent_memory_allocated = torch.mps.current_allocated_memory() / 1024**3
-        else:
-            logger.warning(
-                "CUDA, ROCm, or Apple MPS not detected here. We cannot report VRAM reductions."
-            )
-            curent_memory_allocated = 0
-
-        return curent_memory_allocated
-
-    def init_unload_text_encoder(self):
-        if self.config.model_type != "full" and self.config.train_text_encoder:
-            return
-        memory_before_unload = self.stats_memory_used()
-        if self.accelerator.is_main_process:
-            logger.info("Unloading text encoders, as they are not being trained.")
-        if self.text_encoder_1 is not None:
-            self.text_encoder_1 = self.text_encoder_1.to("cpu")
-        if self.text_encoder_2 is not None:
-            self.text_encoder_2 = self.text_encoder_2.to("cpu")
-        if self.text_encoder_3 is not None:
-            self.text_encoder_3 = self.text_encoder_3.to("cpu")
-        del self.text_encoder_1, self.text_encoder_2, self.text_encoder_3
-        self.text_encoder_1, self.text_encoder_2, self.text_encoder_3 = None, None, None
-        self.text_encoders = []
-        for backend_id, backend in StateTracker.get_data_backends().items():
-            if "text_embed_cache" in backend:
-                backend["text_embed_cache"].text_encoders = None
-                backend["text_embed_cache"].pipeline = None
-        reclaim_memory()
-        memory_after_unload = self.stats_memory_used()
-        memory_saved = memory_after_unload - memory_before_unload
-        logger.info(
-            f"After nuking text encoders from orbit, we freed {abs(round(memory_saved, 2))} GB of VRAM."
-            " The real memories were the friends we trained a model on along the way."
-        )
-
-    def init_precision(self):
-        self.config.enable_adamw_bf16 = (
-            True if self.config.weight_dtype == torch.bfloat16 else False
-        )
-        quantization_device = (
-            "cpu" if self.config.quantize_via == "cpu" else self.accelerator.device
-        )
-
-        if "bnb" in self.config.base_model_precision:
-            # can't cast or move bitsandbytes modelsthis
-            return
-
-        if not self.config.disable_accelerator and self.config.is_quantized:
-            if self.config.base_model_default_dtype == "fp32":
-                self.config.base_weight_dtype = torch.float32
-                self.config.enable_adamw_bf16 = False
-            elif self.config.base_model_default_dtype == "bf16":
-                self.config.base_weight_dtype = torch.bfloat16
-                self.config.enable_adamw_bf16 = True
-            if self.unet is not None:
-                logger.info(
-                    f"Moving U-net to dtype={self.config.base_weight_dtype}, device={quantization_device}"
-                )
-                self.unet.to(quantization_device, dtype=self.config.base_weight_dtype)
-            elif self.transformer is not None:
-                logger.info(
-                    f"Moving transformer to dtype={self.config.base_weight_dtype}, device={quantization_device}"
-                )
-                self.transformer.to(
-                    quantization_device, dtype=self.config.base_weight_dtype
-                )
-
-        if self.config.is_quanto:
-            with self.accelerator.local_main_process_first():
-                self.quantise_model(
-                    unet=self.unet,
-                    transformer=self.transformer,
-                    text_encoder_1=self.text_encoder_1,
-                    text_encoder_2=self.text_encoder_2,
-                    text_encoder_3=self.text_encoder_3,
-                    controlnet=None,
-                    args=self.config,
-                )
-        elif self.config.is_torchao:
-            with self.accelerator.local_main_process_first():
-                (
-                    self.unet,
-                    self.transformer,
-                    self.text_encoder_1,
-                    self.text_encoder_2,
-                    self.text_encoder_3,
-                    self.controlnet,
-                ) = self.quantise_model(
-                    unet=self.unet,
-                    transformer=self.transformer,
-                    text_encoder_1=self.text_encoder_1,
-                    text_encoder_2=self.text_encoder_2,
-                    text_encoder_3=self.text_encoder_3,
-                    controlnet=None,
-                    args=self.config,
-                )
-
-    def init_controlnet_model(self):
-        if not self.config.controlnet:
-            return
-        logger.info("Creating the controlnet..")
-        if self.config.controlnet_model_name_or_path:
-            logger.info("Loading existing controlnet weights")
-            self.controlnet = ControlNetModel.from_pretrained(
-                self.config.controlnet_model_name_or_path
-            )
-        else:
-            logger.info("Initializing controlnet weights from unet")
-            self.controlnet = ControlNetModel.from_unet(self.unet)
-
-        self.accelerator.wait_for_everyone()
-
-    def init_trainable_peft_adapter(self):
-        if "lora" not in self.config.model_type:
-            return
-        if self.config.controlnet:
-            raise ValueError("Cannot train LoRA with ControlNet.")
-        if "standard" == self.config.lora_type.lower():
-            lora_info_msg = f"Using LoRA training mode (rank={self.config.lora_rank})"
-            logger.info(lora_info_msg)
-            self._send_webhook_msg(message=lora_info_msg)
-            target_modules = determine_adapter_target_modules(
-                self.config, self.unet, self.transformer
-            )
-            addkeys, misskeys = [], []
-            if self.unet is not None:
-                unet_lora_config = LoraConfig(
-                    r=self.config.lora_rank,
-                    lora_alpha=(
-                        self.config.lora_alpha
-                        if self.config.lora_alpha is not None
-                        else self.config.lora_rank
-                    ),
-                    lora_dropout=self.config.lora_dropout,
-                    init_lora_weights=self.config.lora_initialisation_style,
-                    target_modules=target_modules,
-                    use_dora=self.config.use_dora,
-                )
-                logger.info("Adding LoRA adapter to the unet model..")
-                self.unet.add_adapter(unet_lora_config)
-                if self.config.init_lora:
-                    addkeys, misskeys = load_lora_weights(
-                        {"unet": self.unet},
-                        self.config.init_lora,
-                        use_dora=self.config.use_dora,
-                    )
-            elif self.transformer is not None:
-                transformer_lora_config = LoraConfig(
-                    r=self.config.lora_rank,
-                    lora_alpha=(
-                        self.config.lora_alpha
-                        if self.config.lora_alpha is not None
-                        else self.config.lora_rank
-                    ),
-                    init_lora_weights=self.config.lora_initialisation_style,
-                    target_modules=target_modules,
-                    use_dora=self.config.use_dora,
-                )
-                self.transformer.add_adapter(transformer_lora_config)
-                if self.config.init_lora:
-                    addkeys, misskeys = load_lora_weights(
-                        {"transformer": self.transformer},
-                        self.config.init_lora,
-                        use_dora=self.config.use_dora,
-                    )
-            if addkeys:
-                logger.warning(
-                    "The following keys were found in %s, but are not part of the model and are ignored:\n %s.\nThis is most likely an error"
-                    % (self.config.init_lora, str(addkeys))
-                )
-            if misskeys:
-                logger.warning(
-                    "The following keys were part of the model but not found in %s:\n %s.\nThese keys will be initialized according to the lora weight initialisation. This could be an error, or intended behaviour in case a lora is finetuned with additional keys."
-                    % (self.config.init_lora, str(misskeys))
-                )
-
-        elif "lycoris" == self.config.lora_type.lower():
-            from lycoris import create_lycoris
-
-            with open(self.config.lycoris_config, "r") as f:
-                self.lycoris_config = json.load(f)
-            multiplier = int(self.lycoris_config["multiplier"])
-            linear_dim = int(self.lycoris_config["linear_dim"])
-            linear_alpha = int(self.lycoris_config["linear_alpha"])
-            apply_preset = self.lycoris_config.get("apply_preset", None)
-            if apply_preset is not None and apply_preset != {}:
-                LycorisNetwork.apply_preset(apply_preset)
-
-            # Remove the positional arguments we extracted.
-            del self.lycoris_config["multiplier"]
-            del self.lycoris_config["linear_dim"]
-            del self.lycoris_config["linear_alpha"]
-
-            logger.info("Using lycoris training mode")
-            self._send_webhook_msg(message="Using lycoris training mode.")
-
-            model_for_lycoris_wrap = None
-            if self.transformer is not None:
-                model_for_lycoris_wrap = self.transformer
-            if self.unet is not None:
-                model_for_lycoris_wrap = self.unet
-
-            if self.config.init_lora is not None:
-                from lycoris import create_lycoris_from_weights
-
-                self.lycoris_wrapped_network = create_lycoris_from_weights(
-                    multiplier,
-                    self.config.init_lora,
-                    model_for_lycoris_wrap,
-                    weights_sd=None,
-                    **self.lycoris_config,
-                )[0]
-            else:
-                self.lycoris_wrapped_network = create_lycoris(
-                    model_for_lycoris_wrap,
-                    multiplier,
-                    linear_dim,
-                    linear_alpha,
-                    **self.lycoris_config,
-                )
-
-                if self.config.init_lokr_norm is not None:
-                    init_lokr_network_with_perturbed_normal(
-                        self.lycoris_wrapped_network,
-                        scale=self.config.init_lokr_norm,
-                    )
-
-            self.lycoris_wrapped_network.apply_to()
-            setattr(
-                self.accelerator,
-                "_lycoris_wrapped_network",
-                self.lycoris_wrapped_network,
-            )
-            lycoris_num_params = sum(
-                p.numel() for p in self.lycoris_wrapped_network.parameters()
-            )
-            logger.info(
-                f"LyCORIS network has been initialized with {lycoris_num_params:,} parameters"
-            )
-        self.accelerator.wait_for_everyone()
-
-    def init_post_load_freeze(self):
-        if self.config.layer_freeze_strategy == "bitfit":
-            from videotuna.third_party.flux.training.model_freeze import (
-                apply_bitfit_freezing,
-            )
-
-            if self.unet is not None:
-                logger.info("Applying BitFit freezing strategy to the U-net.")
-                self.unet = apply_bitfit_freezing(
-                    unwrap_model(self.accelerator, self.unet), self.config
-                )
-            if self.transformer is not None:
-                logger.warning(
-                    "Training DiT models with BitFit is not yet tested, and unexpected results may occur."
-                )
-                self.transformer = apply_bitfit_freezing(
-                    unwrap_model(self.accelerator, self.transformer), self.config
-                )
-
-        if self.config.gradient_checkpointing:
-            if self.unet is not None:
-                unwrap_model(
-                    self.accelerator, self.unet
-                ).enable_gradient_checkpointing()
-            if self.transformer is not None and self.config.model_family != "smoldit":
-                unwrap_model(
-                    self.accelerator, self.transformer
-                ).enable_gradient_checkpointing()
-            if self.config.controlnet:
-                unwrap_model(
-                    self.accelerator, self.controlnet
-                ).enable_gradient_checkpointing()
-            if (
-                hasattr(self.config, "train_text_encoder")
-                and self.config.train_text_encoder
-            ):
-                unwrap_model(
-                    self.accelerator, self.text_encoder_1
-                ).gradient_checkpointing_enable()
-                unwrap_model(
-                    self.accelerator, self.text_encoder_2
-                ).gradient_checkpointing_enable()
-
-    def _recalculate_training_steps(self):
-        # Scheduler and math around the number of training steps.
-        if not hasattr(self.config, "overrode_max_train_steps"):
-            self.config.overrode_max_train_steps = False
-        self.config.total_num_batches = sum(
-            [
-                len(
-                    backend["metadata_backend"] if "metadata_backend" in backend else []
-                )
-                for _, backend in StateTracker.get_data_backends().items()
-            ]
-        )
-        self.config.num_update_steps_per_epoch = math.ceil(
-            self.config.total_num_batches / self.config.gradient_accumulation_steps
-        )
-        if getattr(self.config, "overrode_max_train_steps", False):
-            self.config.max_train_steps = (
-                self.config.num_train_epochs * self.config.num_update_steps_per_epoch
-            )
-            # Afterwards we recalculate our number of training epochs
-            self.config.num_train_epochs = math.ceil(
-                self.config.max_train_steps / self.config.num_update_steps_per_epoch
-            )
-            logger.info(
-                "After removing any undesired samples and updating cache entries, we have settled on"
-                f" {self.config.num_train_epochs} epochs and {self.config.num_update_steps_per_epoch} steps per epoch."
-            )
-        if self.config.max_train_steps is None or self.config.max_train_steps == 0:
-            if (
-                self.config.num_train_epochs is None
-                or self.config.num_train_epochs == 0
-            ):
-                raise ValueError(
-                    "You must specify either --max_train_steps or --num_train_epochs with a value > 0"
-                )
-            self.config.max_train_steps = (
-                self.config.num_train_epochs * self.config.num_update_steps_per_epoch
-            )
-            logger.info(
-                f"Calculated our maximum training steps at {self.config.max_train_steps} because we have"
-                f" {self.config.num_train_epochs} epochs and {self.config.num_update_steps_per_epoch} steps per epoch."
-            )
-            self.config.overrode_max_train_steps = True
-        elif self.config.num_train_epochs is None or self.config.num_train_epochs == 0:
-            if self.config.max_train_steps is None or self.config.max_train_steps == 0:
-                raise ValueError(
-                    "You must specify either --max_train_steps or --num_train_epochs with a value > 0"
-                )
-            self.config.num_train_epochs = math.ceil(
-                self.config.max_train_steps / self.config.num_update_steps_per_epoch
-            )
-            logger.info(
-                f"Calculated our maximum training steps at {self.config.max_train_steps} because we have"
-                f" {self.config.num_train_epochs} epochs and {self.config.num_update_steps_per_epoch} steps per epoch."
-            )
-        if self.lr_scheduler is not None and hasattr(
-            self.lr_scheduler, "num_update_steps_per_epoch"
-        ):
-            self.lr_scheduler.num_update_steps_per_epoch = (
-                self.config.num_update_steps_per_epoch
-            )
-        self.config.total_batch_size = (
-            self.config.train_batch_size
-            * self.accelerator.num_processes
-            * self.config.gradient_accumulation_steps
-        )
-
-    def init_optimizer(self):
-        logger.info(f"Learning rate: {self.config.learning_rate}")
-        extra_optimizer_args = {"lr": self.config.learning_rate}
-        # Initialize the optimizer
-        optimizer_args_from_config, optimizer_class = (
-            determine_optimizer_class_with_config(
-                args=self.config,
-                use_deepspeed_optimizer=self.config.use_deepspeed_optimizer,
-                is_quantized=self.config.is_quantized,
-                enable_adamw_bf16=self.config.enable_adamw_bf16,
-            )
-        )
-        extra_optimizer_args.update(optimizer_args_from_config)
-
-        self.params_to_optimize = determine_params_to_optimize(
-            args=self.config,
-            controlnet=self.controlnet,
-            unet=self.unet,
-            transformer=self.transformer,
-            text_encoder_1=self.text_encoder_1,
-            text_encoder_2=self.text_encoder_2,
-            model_type_label=self.config.model_type_label,
-            lycoris_wrapped_network=self.lycoris_wrapped_network,
-        )
-
-        if self.config.use_deepspeed_optimizer:
-            logger.info(
-                f"DeepSpeed Optimizer arguments, weight_decay={self.config.adam_weight_decay} eps={self.config.adam_epsilon}, extra_arguments={extra_optimizer_args}"
-            )
-            self.optimizer = optimizer_class(self.params_to_optimize)
-        else:
-            logger.info(f"Optimizer arguments={extra_optimizer_args}")
-            if self.config.train_text_encoder and self.config.text_encoder_lr:
-                # changes the learning rate of text_encoder_parameters_one and text_encoder_parameters_two to be
-                # --learning_rate
-                self.params_to_optimize[1]["lr"] = float(self.config.learning_rate)
-                if self.text_encoder_2 is not None:
-                    self.params_to_optimize[2]["lr"] = float(self.config.learning_rate)
-
-            self.optimizer = cpu_offload_optimizer(
-                params_to_optimize=self.params_to_optimize,
-                optimizer_cls=optimizer_class,
-                optimizer_parameters=extra_optimizer_args,
-                fused=self.config.fuse_optimizer,
-                offload_gradients=self.config.optimizer_offload_gradients,
-                offload_mechanism=self.config.optimizer_cpu_offload_method,
-            )
-
-        if (
-            is_optimi_available
-            and self.config.optimizer_release_gradients
-            and "optimi" in self.config.optimizer
-        ):
-            logger.warning(
-                "Marking model for gradient release. This feature is experimental, and may use more VRAM or not work."
-            )
-            prepare_for_gradient_release(
-                (
-                    self.controlnet
-                    if self.config.controlnet
-                    else self.transformer if self.transformer is not None else self.unet
-                ),
-                self.optimizer,
-            )
-
-    def init_lr_scheduler(self):
-        self.config.is_schedulefree = is_lr_scheduler_disabled(self.config.optimizer)
-        if self.config.is_schedulefree:
-            logger.info(
-                "Using experimental AdamW ScheduleFree optimiser from Facebook. Experimental due to newly added Kahan summation."
-            )
-            # we don't use LR schedulers with schedulefree optimisers
-            lr_scheduler = None
-        if not self.config.use_deepspeed_scheduler and not self.config.is_schedulefree:
-            logger.info(
-                f"Loading {self.config.lr_scheduler} learning rate scheduler with {self.config.lr_warmup_steps} warmup steps"
-            )
-            lr_scheduler = get_lr_scheduler(
-                self.config,
-                self.optimizer,
-                self.accelerator,
-                logger,
-                use_deepspeed_scheduler=False,
-            )
-        else:
-            logger.info(f"Using dummy learning rate scheduler")
-            if torch.backends.mps.is_available():
-                lr_scheduler = None
-            else:
-                lr_scheduler = accelerate.utils.DummyScheduler(
-                    self.optimizer,
-                    total_num_steps=self.config.max_train_steps,
-                    warmup_num_steps=self.config.lr_warmup_steps,
-                )
-        if lr_scheduler is not None:
-            if hasattr(lr_scheduler, "num_update_steps_per_epoch"):
-                lr_scheduler.num_update_steps_per_epoch = (
-                    self.config.num_update_steps_per_epoch
-                )
-            if hasattr(lr_scheduler, "last_step"):
-                lr_scheduler.last_step = self.state.get("global_resume_step", 0)
-
-        return lr_scheduler
-
-    def init_ema_model(self):
-        # Create EMA for the unet.
-        self.ema_model = None
-        if not self.config.use_ema:
-            return
-        if self.accelerator.is_main_process:
-            logger.info("Using EMA. Creating EMAModel.")
-
-            ema_model_cls = None
-            if self.unet is not None:
-                ema_model_cls = UNet2DConditionModel
-            elif self.config.model_family == "pixart_sigma":
-                ema_model_cls = PixArtTransformer2DModel
-            elif self.config.model_family == "flux":
-                ema_model_cls = FluxTransformer2DModel
-            else:
-                raise ValueError(
-                    f"Please open a bug report or disable EMA. Unknown EMA model family: {self.config.model_family}"
-                )
-
-            ema_model_config = None
-            if self.unet is not None:
-                ema_model_config = self.unet.config
-            elif self.transformer is not None:
-                ema_model_config = self.transformer.config
-
-            self.ema_model = EMAModel(
-                self.config,
-                self.accelerator,
-                parameters=(
-                    self.unet.parameters()
-                    if self.unet is not None
-                    else self.transformer.parameters()
-                ),
-                model_cls=ema_model_cls,
-                model_config=ema_model_config,
-                decay=self.config.ema_decay,
-                foreach=not self.config.ema_foreach_disable,
-            )
-            logger.info("EMA model creation complete.")
-
-        self.accelerator.wait_for_everyone()
-
-    def init_hooks(self):
-        from videotuna.third_party.flux.training.save_hooks import SaveHookManager
-
-        self.model_hooks = SaveHookManager(
-            args=self.config,
-            unet=self.unet,
-            transformer=self.transformer,
-            ema_model=self.ema_model,
-            accelerator=self.accelerator,
-            text_encoder_1=self.text_encoder_1,
-            text_encoder_2=self.text_encoder_2,
-            use_deepspeed_optimizer=self.config.use_deepspeed_optimizer,
-        )
-        self.accelerator.register_save_state_pre_hook(self.model_hooks.save_model_hook)
-        self.accelerator.register_load_state_pre_hook(self.model_hooks.load_model_hook)
-
-    def init_prepare_models(self, lr_scheduler):
-        # Prepare everything with our `accelerator`.
-        logger.info("Preparing models..")
-
-        # TODO: Is this still needed? Seems like a hack job from January 2024.
-        self.train_dataloaders = []
-        for _, backend in StateTracker.get_data_backends().items():
-            if "train_dataloader" not in backend:
-                continue
-            self.train_dataloaders.append(backend["train_dataloader"])
-            break
-        if len(self.train_dataloaders) == 0:
-            logger.error("For some reason, no dataloaders were configured.")
-            sys.exit(0)
-        if self.config.disable_accelerator:
-            logger.warning(
-                "Because SIMPLETUNER_DISABLE_ACCELERATOR is set, we will not prepare the accelerator."
-            )
-            return
-        logger.info("Loading our accelerator...")
-        if torch.backends.mps.is_available():
-            self.accelerator.native_amp = False
-        self._send_webhook_msg(message="Moving weights to GPU...")
-        self._send_webhook_raw(
-            structured_data={"message": "Moving weights to GPU"},
-            message_type="init_prepare_models_begin",
-        )
-        primary_model = self.unet if self.unet is not None else self.transformer
-        if self.config.controlnet:
-            primary_model = self.controlnet
-        results = self.accelerator.prepare(
-            primary_model, lr_scheduler, self.optimizer, self.train_dataloaders[0]
-        )
-        if self.config.controlnet:
-            self.controlnet = results[0]
-        elif self.unet is not None:
-            self.unet = results[0]
-        elif self.transformer is not None:
-            self.transformer = results[0]
-
-        if self.config.unet_attention_slice:
-            if torch.backends.mps.is_available():
-                logger.warning(
-                    "Using attention slicing when training SDXL on MPS can result in NaN errors on the first backward pass. If you run into issues, disable this option and reduce your batch size instead to reduce memory consumption."
-                )
-            if self.unet is not None:
-                self.unet.set_attention_slice("auto")
-            if self.transformer is not None:
-                self.transformer.set_attention_slice("auto")
-        self.lr_scheduler = results[1]
-        self.optimizer = results[2]
-        # The rest of the entries are dataloaders:
-        self.train_dataloaders = [results[3:]]
-        if self.config.use_ema and self.ema_model is not None:
-            if self.config.ema_device == "accelerator":
-                logger.info("Moving EMA model weights to accelerator...")
-            self.ema_model.to(
-                (
-                    self.accelerator.device
-                    if self.config.ema_device == "accelerator"
-                    else "cpu"
-                ),
-                dtype=self.config.weight_dtype,
-            )
-
-            if self.config.ema_device == "cpu" and not self.config.ema_cpu_only:
-                logger.info("Pinning EMA model weights to CPU...")
-                try:
-                    self.ema_model.pin_memory()
-                except Exception as e:
-                    self._send_webhook_raw(
-                        structured_data={"message": f"Failed to pin EMA to CPU: {e}"},
-                        message_type="error",
-                    )
-                    logger.error(f"Failed to pin EMA model to CPU: {e}")
-
-        idx_count = 0
-        for _, backend in StateTracker.get_data_backends().items():
-            if idx_count == 0 or "train_dataloader" not in backend:
-                continue
-            self.train_dataloaders.append(
-                self.accelerator.prepare(backend["train_dataloader"])
-            )
-        idx_count = 0
-
-        if "lora" in self.config.model_type and self.config.train_text_encoder:
-            logger.info("Preparing text encoders for training.")
-            if self.config.model_family == "sd3":
-                logger.info("NOTE: The third text encoder is not trained for SD3.")
-            self.text_encoder_1, self.text_encoder_2 = self.accelerator.prepare(
-                self.text_encoder_1, self.text_encoder_2
-            )
-        self._recalculate_training_steps()
-        self.accelerator.wait_for_everyone()
-        self._send_webhook_raw(
-            structured_data={"message": "Completed moving weights to GPU"},
-            message_type="init_prepare_models_completed",
-        )
-
-    def init_unload_vae(self):
-        if self.config.keep_vae_loaded or self.config.vae_cache_ondemand:
-            return
-        memory_before_unload = self.stats_memory_used()
-        self.vae = self.vae.to("cpu")
-        del self.vae
-        self.vae = None
-        for _, backend in StateTracker.get_data_backends().items():
-            if "vaecache" in backend:
-                backend["vaecache"].vae = None
-        reclaim_memory()
-        memory_after_unload = self.stats_memory_used()
-        memory_saved = memory_after_unload - memory_before_unload
-        logger.info(
-            f"After nuking the VAE from orbit, we freed {abs(round(memory_saved, 2)) * 1024} MB of VRAM."
-        )
-
-    def init_validations(self):
-        if (
-            hasattr(self.accelerator, "state")
-            and hasattr(self.accelerator.state, "deepspeed_plugin")
-            and getattr(self.accelerator.state.deepspeed_plugin, "deepspeed_config", {})
-            .get("zero_optimization", {})
-            .get("stage")
-            == 3
-        ):
-            logger.error("Cannot run validations with DeepSpeed ZeRO stage 3.")
-            return
-        self.validation = Validation(
-            accelerator=self.accelerator,
-            unet=self.unet,
-            transformer=self.transformer,
-            args=self.config,
-            validation_prompts=self.validation_prompts,
-            validation_shortnames=self.validation_shortnames,
-            text_encoder_1=self.text_encoder_1,
-            tokenizer=self.tokenizer_1,
-            vae_path=self.config.vae_path,
-            weight_dtype=self.config.weight_dtype,
-            embed_cache=StateTracker.get_default_text_embed_cache(),
-            validation_negative_pooled_embeds=self.validation_negative_pooled_embeds,
-            validation_negative_prompt_embeds=self.validation_negative_prompt_embeds,
-            text_encoder_2=self.text_encoder_2,
-            tokenizer_2=self.tokenizer_2,
-            text_encoder_3=self.text_encoder_3,
-            tokenizer_3=self.tokenizer_3,
-            ema_model=self.ema_model,
-            vae=self.vae,
-            controlnet=self.controlnet if self.config.controlnet else None,
-        )
-        if not self.config.train_text_encoder and self.validation is not None:
-            self.validation.clear_text_encoders()
-        self.init_benchmark_base_model()
-        self.accelerator.wait_for_everyone()
-
-    def init_benchmark_base_model(self):
-        if (
-            self.config.disable_benchmark
-            or self.validation is None
-            or self.validation.benchmark_exists("base_model")
-        ):
-            # if we've disabled it or the benchmark exists, we will not do it again.
-            # deepspeed zero3 can't do validations at all.
-            return
-        if not self.accelerator.is_main_process:
-            return
-        logger.info(
-            "Benchmarking base model for comparison. Supply `--disable_benchmark: true` to disable this behaviour."
-        )
-        self._send_webhook_raw(
-            structured_data={"message": "Base model benchmark begins"},
-            message_type="init_benchmark_base_model_begin",
-        )
-        # we'll run validation on base model if it hasn't already.
-        self.validation.run_validations(validation_type="base_model", step=0)
-        self.validation.save_benchmark("base_model")
-        self._send_webhook_raw(
-            structured_data={"message": "Base model benchmark completed"},
-            message_type="init_benchmark_base_model_completed",
-        )
-
-    def init_resume_checkpoint(self, lr_scheduler):
-        # Potentially load in the weights and states from a previous save
-        self.config.total_steps_remaining_at_start = self.config.max_train_steps
-        self.state["current_epoch"] = self.state["first_epoch"]
-        self.state["global_resume_step"] = self.state["global_step"] = (
-            StateTracker.get_global_step()
-        )
-        StateTracker.set_global_resume_step(self.state["global_resume_step"])
-        if not self.config.resume_from_checkpoint:
-            return lr_scheduler
-        if self.config.resume_from_checkpoint != "latest":
-            path = os.path.basename(self.config.resume_from_checkpoint)
-        else:
-            # Get the most recent checkpoint
-            dirs = os.listdir(self.config.output_dir)
-            dirs = [d for d in dirs if d.startswith("checkpoint")]
-            dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
-            path = dirs[-1] if len(dirs) > 0 else None
-
-        if path is None:
-            logger.info(
-                f"Checkpoint '{self.config.resume_from_checkpoint}' does not exist. Starting a new training run."
-            )
-            self._send_webhook_raw(
-                structured_data={
-                    "message": "No model to resume. Beginning fresh training run."
-                },
-                message_type="init_resume_checkpoint",
-            )
-
-            self.config.resume_from_checkpoint = None
-            return lr_scheduler
-
-        logger.info(f"Resuming from checkpoint {path}")
-        self.accelerator.load_state(os.path.join(self.config.output_dir, path))
-        try:
-            if (
-                "constant" == self.config.lr_scheduler
-                and not self.config.is_schedulefree
-            ):
-                for g in self.optimizer.param_groups:
-                    if "lr" in g:
-                        g["lr"] = self.config.learning_rate
-                for k, v in lr_scheduler.state_dict().items():
-                    if k in ("base_lrs", "_last_lr"):
-                        v[0] = self.config.learning_rate
-        except Exception as e:
-            self._send_webhook_raw(
-                structured_data={
-                    "message": "Could not update learning rate scheduler LR value."
-                },
-                message_type="warning",
-            )
-            logger.error(
-                f"Could not update lr_scheduler {self.config.lr_scheduler} learning rate to {self.config.learning_rate} upon resume: {e}"
-            )
-
-        self._send_webhook_raw(
-            structured_data={"message": f"Resuming model: {path}"},
-            message_type="init_resume_checkpoint",
-        )
-        training_state_filename = f"training_state.json"
-        if get_rank() > 0:
-            training_state_filename = f"training_state-{get_rank()}.json"
-        for _, backend in StateTracker.get_data_backends().items():
-            if "sampler" in backend:
-                backend["sampler"].load_states(
-                    state_path=os.path.join(
-                        self.config.output_dir,
-                        path,
-                        training_state_filename,
-                    ),
-                )
-        self.state["global_resume_step"] = self.state["global_step"] = (
-            StateTracker.get_global_step()
-        )
-        StateTracker.set_global_resume_step(self.state["global_resume_step"])
-        training_state_in_ckpt = StateTracker.get_training_state()
-        self._send_webhook_raw(
-            structured_data=training_state_in_ckpt,
-            message_type="init_resume_checkpoint_details",
-        )
-        logger.debug(f"Training state inside checkpoint: {training_state_in_ckpt}")
-        if hasattr(lr_scheduler, "last_step"):
-            lr_scheduler.last_step = self.state["global_resume_step"]
-        logger.info(f"Resuming from global_step {self.state['global_resume_step']}).")
-
-        # Log the current state of each data backend.
-        for _, backend in StateTracker.get_data_backends().items():
-            if "sampler" in backend:
-                backend["sampler"].log_state()
-        # We store the number of dataset resets that have occurred inside the checkpoint.
-        self.state["first_epoch"] = StateTracker.get_epoch()
-        if self.state["first_epoch"] > 1 or self.state["global_resume_step"] > 1:
-            self.config.total_steps_remaining_at_start -= self.state[
-                "global_resume_step"
-            ]
-            logger.debug(
-                f"Resuming from epoch {self.state['first_epoch']}, which leaves us with {self.config.total_steps_remaining_at_start}."
-            )
-        self.state["current_epoch"] = self.state["first_epoch"]
-        StateTracker.set_epoch(self.state["current_epoch"])
-        if hasattr(lr_scheduler, "last_epoch"):
-            lr_scheduler.last_epoch = (
-                training_state_in_ckpt.get(
-                    "epoch_step", self.state.get("global_resume_step", 1)
-                )
-                * self.accelerator.num_processes
-            )
-
-        if self.state["current_epoch"] > self.config.num_train_epochs + 1:
-            logger.info(
-                f"Reached the end ({self.state['current_epoch']} epochs) of our training run ({self.config.num_train_epochs} epochs). This run will do zero steps."
-            )
-        self.accelerator.wait_for_everyone()
-
-        return lr_scheduler
-
-    def init_trackers(self):
-        # We need to initialize the trackers we use, and also store our configuration.
-        # The trackers initializes automatically on the main process.
-        self.guidance_values_table = None
-        if self.accelerator.is_main_process:
-            # Copy args into public_args:
-            public_args = copy.deepcopy(self.config)
-            delattr(public_args, "accelerator_project_config")
-            delattr(public_args, "process_group_kwargs")
-            delattr(public_args, "weight_dtype")
-            delattr(public_args, "base_weight_dtype")
-            delattr(public_args, "vae_kwargs")
-
-            # Hash the contents of public_args to reflect a deterministic ID for a single set of params:
-            public_args_hash = hashlib.md5(
-                json.dumps(vars(public_args), sort_keys=True).encode("utf-8")
-            ).hexdigest()
-            project_name = self.config.tracker_project_name or "simpletuner-training"
-            tracker_run_name = (
-                self.config.tracker_run_name or "simpletuner-training-run"
-            )
-            self.accelerator.init_trackers(
-                project_name,
-                config=vars(public_args),
-                init_kwargs={
-                    "wandb": {
-                        "name": tracker_run_name,
-                        "id": f"{public_args_hash}",
-                        "resume": "allow",
-                        "allow_val_change": True,
-                    }
-                },
-            )
-            self._send_webhook_raw(
-                structured_data=public_args.__dict__,
-                message_type="training_config",
-            )
-
-    def resume_and_prepare(self):
-        self.init_optimizer()
-        lr_scheduler = self.init_lr_scheduler()
-        self.init_hooks()
-        self.init_prepare_models(lr_scheduler=lr_scheduler)
-        lr_scheduler = self.init_resume_checkpoint(lr_scheduler=lr_scheduler)
-        self.init_post_load_freeze()
-
-    def move_models(self, destination: str = "accelerator"):
-        target_device = "cpu"
-        if destination == "accelerator":
-            target_device = self.accelerator.device
-        logger.info(
-            f"Moving the {'U-net' if self.unet is not None else 'diffusion transformer'} to GPU in {self.config.weight_dtype if not self.config.is_quantized else self.config.base_model_precision} precision."
-        )
-        if self.unet is not None:
-            if self.config.is_quantized:
-                self.unet.to(target_device)
-            else:
-                self.unet.to(target_device, dtype=self.config.weight_dtype)
-        if self.transformer is not None:
-            if self.config.is_quantized:
-                self.transformer.to(target_device)
-            else:
-                self.transformer.to(target_device, dtype=self.config.weight_dtype)
-        if getattr(self.accelerator, "_lycoris_wrapped_network", None) is not None:
-            self.accelerator._lycoris_wrapped_network = (
-                self.accelerator._lycoris_wrapped_network.to(
-                    target_device, dtype=self.config.weight_dtype
-                )
-            )
-        if (
-            self.config.enable_xformers_memory_efficient_attention
-            and self.config.model_family
-            not in [
-                "sd3",
-                "pixart_sigma",
-                "flux",
-                "smoldit",
-                "kolors",
-            ]
-        ):
-            logger.info("Enabling xformers memory-efficient attention.")
-            if is_xformers_available():
-                import xformers  # type: ignore # noqa
-
-                if self.unet is not None:
-                    self.unet.enable_xformers_memory_efficient_attention()
-                if self.transformer is not None:
-                    self.transformer.enable_xformers_memory_efficient_attention()
-                if self.config.controlnet:
-                    self.controlnet.enable_xformers_memory_efficient_attention()
-            else:
-                raise ValueError(
-                    "xformers is not available. Make sure it is installed correctly"
-                )
-        elif self.config.enable_xformers_memory_efficient_attention:
-            logger.warning(
-                "xformers is not enabled, as it is incompatible with this model type."
-            )
-            self.config.enable_xformers_memory_efficient_attention = False
-
-        if self.config.controlnet:
-            self.controlnet.train()
-            logger.info(
-                f"Moving ControlNet to {target_device} in {self.config.weight_dtype} precision."
-            )
-            self.controlnet.to(device=target_device, dtype=self.config.weight_dtype)
-            if self.config.train_text_encoder:
-                logger.warning(
-                    "Unknown results will occur when finetuning the text encoder alongside ControlNet."
-                )
-
-    def mark_optimizer_train(self):
-        if is_lr_scheduler_disabled(self.config.optimizer) and hasattr(
-            self.optimizer, "train"
-        ):
-            # we typically have to call train() on the optim for schedulefree.
-            self.optimizer.train()
-
-    def mark_optimizer_eval(self):
-        if is_lr_scheduler_disabled(self.config.optimizer) and hasattr(
-            self.optimizer, "eval"
-        ):
-            # we typically have to call eval() on the optim for schedulefree before saving or running validations.
-            self.optimizer.eval()
-
-    def _send_webhook_msg(
-        self, message: str, message_level: str = "info", store_response: bool = False
-    ):
-        if type(message) is not str:
-            logger.error(
-                f"_send_webhook_msg received {type(message)} type message instead of str."
-            )
-            return False
-        if self.webhook_handler is None or not self.webhook_handler:
-            return
-        self.webhook_handler.send(
-            message=message, message_level=message_level, store_response=store_response
-        )
-
-    def _send_webhook_raw(
-        self,
-        structured_data: dict,
-        message_type: str,
-        message_level: str = "info",
-    ):
-        if type(structured_data) is not dict:
-            logger.error(
-                f"_send_webhook_msg received {type(structured_data)} type message instead of dict."
-            )
-            return False
-        if not self.webhook_handler:
-            return
-        self.webhook_handler.send_raw(
-            structured_data=structured_data,
-            message_type=message_type,
-            message_level=message_level,
-            job_id=self.job_id,
-        )
-
-    def _train_initial_msg(self):
-        initial_msg = "\n***** Running training *****"
-        initial_msg += f"\n-  Num batches = {self.config.total_num_batches}"
-        initial_msg += f"\n-  Num Epochs = {self.config.num_train_epochs}"
-        initial_msg += f"\n  - Current Epoch = {self.state['first_epoch']}"
-        initial_msg += f"\n-  Total train batch size (w. parallel, distributed & accumulation) = {self.config.total_batch_size}"
-        initial_msg += f"\n  - Instantaneous batch size per device = {self.config.train_batch_size}"
-        initial_msg += f"\n  - Gradient Accumulation steps = {self.config.gradient_accumulation_steps}"
-        initial_msg += f"\n-  Total optimization steps = {self.config.max_train_steps}"
-        if self.state["global_step"] > 1:
-            initial_msg += f"\n  - Steps completed: {self.state['global_step']}"
-        initial_msg += f"\n-  Total optimization steps remaining = {max(0, self.config.total_steps_remaining_at_start)}"
-        logger.info(initial_msg)
-        self._send_webhook_msg(message=initial_msg)
-        structured_data = {
-            "total_num_batches": self.config.total_num_batches,
-            "total_num_epochs": self.config.num_train_epochs,
-            "total_num_steps": self.config.max_train_steps,
-            "current_epoch": self.state["first_epoch"],
-            "total_batch_size": self.config.total_batch_size,
-            "micro_batch_size": self.config.train_batch_size,
-            "current_step": self.state["global_step"],
-            "remaining_num_steps": max(0, self.config.total_steps_remaining_at_start),
-        }
-        self._send_webhook_raw(
-            structured_data=structured_data, message_type="_train_initial_msg"
-        )
-
-    def _epoch_rollover(self, epoch):
-        if self.state["first_epoch"] == epoch:
-            return
-        logger.debug(
-            f"Just completed epoch {self.state['current_epoch']}. Beginning epoch {epoch}. Starting epoch was {self.state['first_epoch']}. Final epoch will be {self.config.num_train_epochs}"
-        )
-        for backend_id, backend in StateTracker.get_data_backends().items():
-            backend_config = StateTracker.get_data_backend_config(backend_id)
-            if (
-                backend_config.get("crop")
-                and backend_config.get("crop_aspect") == "random"
-                and "metadata_backend" in backend
-                and not self.config.aspect_bucket_disable_rebuild
-            ) or (
-                backend_config.get("vae_cache_clear_each_epoch")
-                and "vaecache" in backend
-            ):
-                # when the aspect ratio is random, we need to shuffle the dataset on each epoch.
-                if self.accelerator.is_main_process:
-                    # we only compute the aspect ratio indices on the main process.
-                    # we have to set read_only to False since we're generating a new, un-split list.
-                    # otherwise, we can't actually save the new cache to disk.
-                    backend["metadata_backend"].read_only = False
-                    # this will generate+save the new cache to the storage backend.
-                    backend["metadata_backend"].compute_aspect_ratio_bucket_indices(
-                        ignore_existing_cache=True
-                    )
-                self.accelerator.wait_for_everyone()
-                logger.info(f"Reloading cache for backend {backend_id}")
-                backend["metadata_backend"].reload_cache(set_config=False)
-                logger.info("Waiting for other threads to finish..")
-                self.accelerator.wait_for_everyone()
-                # we'll have to split the buckets between GPUs again now, so that the VAE cache distributes properly.
-                logger.info("Splitting buckets across GPUs")
-                backend["metadata_backend"].split_buckets_between_processes(
-                    gradient_accumulation_steps=self.config.gradient_accumulation_steps
-                )
-                # we have to rebuild the VAE cache if it exists.
-                if "vaecache" in backend:
-                    logger.info("Rebuilding VAE cache..")
-                    backend["vaecache"].rebuild_cache()
-                # no need to manually call metadata_backend.save_cache() here.
-        self.state["current_epoch"] = epoch
-        StateTracker.set_epoch(epoch)
-        if self.config.lr_scheduler == "cosine_with_restarts":
-            self.extra_lr_scheduler_kwargs["epoch"] = epoch
-
-    def _exit_on_signal(self):
-        if self.should_abort:
-            self._send_webhook_raw(
-                structured_data={"message": "Aborting training run."},
-                message_type="exit",
-            )
-            raise StopIteration("Training run received abort signal.")
-
-    def abort(self):
-        logger.info("Aborting training run.")
-        if self.bf is not None:
-            self.bf.stop_fetching()
-        # we should set should_abort = True on each data backend's vae cache, metadata, and text backend
-        for _, backend in StateTracker.get_data_backends().items():
-            if "vaecache" in backend:
-                logger.debug(f"Aborting VAE cache")
-                backend["vaecache"].should_abort = True
-            if "metadata_backend" in backend:
-                logger.debug(f"Aborting metadata backend")
-                backend["metadata_backend"].should_abort = True
-            if "text_backend" in backend:
-                logger.debug(f"Aborting text backend")
-                backend["text_backend"].should_abort = True
-            if "sampler" in backend:
-                logger.debug(f"Aborting sampler")
-                backend["sampler"].should_abort = True
-        self.should_abort = True
-
-    def model_predict(
-        self,
-        batch,
-        latents,
-        noisy_latents,
-        encoder_hidden_states,
-        added_cond_kwargs,
-        add_text_embeds,
-        timesteps,
-    ):
-        if self.config.controlnet:
-            training_logger.debug(
-                f"Extra conditioning dtype: {batch['conditioning_pixel_values'].dtype}"
-            )
-        if not self.config.disable_accelerator:
-            if self.config.controlnet:
-                # ControlNet conditioning.
-                controlnet_image = batch["conditioning_pixel_values"].to(
-                    dtype=self.config.weight_dtype
-                )
-                training_logger.debug(f"Image shape: {controlnet_image.shape}")
-                down_block_res_samples, mid_block_res_sample = self.controlnet(
-                    noisy_latents,
-                    timesteps,
-                    encoder_hidden_states=encoder_hidden_states,
-                    added_cond_kwargs=added_cond_kwargs,
-                    controlnet_cond=controlnet_image,
-                    return_dict=False,
-                )
-                # Predict the noise residual
-                if self.unet is not None:
-                    model_pred = self.unet(
-                        noisy_latents,
-                        timesteps,
-                        encoder_hidden_states=encoder_hidden_states,
-                        added_cond_kwargs=added_cond_kwargs,
-                        down_block_additional_residuals=[
-                            sample.to(dtype=self.config.weight_dtype)
-                            for sample in down_block_res_samples
-                        ],
-                        mid_block_additional_residual=mid_block_res_sample.to(
-                            dtype=self.config.weight_dtype
-                        ),
-                        return_dict=False,
-                    )[0]
-                if self.transformer is not None:
-                    raise Exception(
-                        "ControlNet predictions for transformer models are not yet implemented."
-                    )
-            elif self.config.model_family == "flux":
-                # handle guidance
-                packed_noisy_latents = pack_latents(
-                    noisy_latents,
-                    batch_size=latents.shape[0],
-                    num_channels_latents=latents.shape[1],
-                    height=latents.shape[2],
-                    width=latents.shape[3],
-                ).to(
-                    dtype=self.config.base_weight_dtype,
-                    device=self.accelerator.device,
-                )
-                if self.config.flux_guidance_mode == "mobius":
-                    guidance_scales = get_mobius_guidance(
-                        self.config,
-                        self.state["global_step"],
-                        self.config.num_update_steps_per_epoch,
-                        latents.shape[0],
-                        self.accelerator.device,
-                    )
-                elif self.config.flux_guidance_mode == "constant":
-                    guidance_scales = [
-                        float(self.config.flux_guidance_value)
-                    ] * latents.shape[0]
-
-                elif self.config.flux_guidance_mode == "random-range":
-                    # Generate a list of random values within the specified range for each latent
-                    guidance_scales = [
-                        random.uniform(
-                            self.config.flux_guidance_min,
-                            self.config.flux_guidance_max,
-                        )
-                        for _ in range(latents.shape[0])
-                    ]
-                self.guidance_values_list.append(guidance_scales)
-
-                # Now `guidance` will have different values for each latent in `latents`.
-                transformer_config = None
-                if hasattr(self.transformer, "module"):
-                    transformer_config = self.transformer.module.config
-                elif hasattr(self.transformer, "config"):
-                    transformer_config = self.transformer.config
-                if transformer_config is not None and getattr(
-                    transformer_config, "guidance_embeds", False
-                ):
-                    guidance = torch.tensor(
-                        guidance_scales, device=self.accelerator.device
-                    )
-                else:
-                    guidance = None
-                img_ids = prepare_latent_image_ids(
-                    latents.shape[0],
-                    latents.shape[2],
-                    latents.shape[3],
-                    self.accelerator.device,
-                    self.config.weight_dtype,
-                )
-                timesteps = (
-                    torch.tensor(timesteps)
-                    .expand(noisy_latents.shape[0])
-                    .to(device=self.accelerator.device)
-                    / 1000
-                )
-
-                text_ids = torch.zeros(
-                    batch["prompt_embeds"].shape[1],
-                    3,
-                ).to(
-                    device=self.accelerator.device,
-                    dtype=self.config.base_weight_dtype,
-                )
-                training_logger.debug(
-                    "DTypes:"
-                    f"\n-> Text IDs shape: {text_ids.shape if hasattr(text_ids, 'shape') else None}, dtype: {text_ids.dtype if hasattr(text_ids, 'dtype') else None}"
-                    f"\n-> Image IDs shape: {img_ids.shape if hasattr(img_ids, 'shape') else None}, dtype: {img_ids.dtype if hasattr(img_ids, 'dtype') else None}"
-                    f"\n-> Timesteps shape: {timesteps.shape if hasattr(timesteps, 'shape') else None}, dtype: {timesteps.dtype if hasattr(timesteps, 'dtype') else None}"
-                    f"\n-> Guidance: {guidance}"
-                    f"\n-> Packed Noisy Latents shape: {packed_noisy_latents.shape if hasattr(packed_noisy_latents, 'shape') else None}, dtype: {packed_noisy_latents.dtype if hasattr(packed_noisy_latents, 'dtype') else None}"
-                )
-
-                flux_transformer_kwargs = {
-                    "hidden_states": packed_noisy_latents,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
-                    "timestep": timesteps,
-                    "guidance": guidance,
-                    "pooled_projections": batch["add_text_embeds"].to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    "encoder_hidden_states": batch["prompt_embeds"].to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    "txt_ids": text_ids.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    "img_ids": img_ids,
-                    "joint_attention_kwargs": None,
-                    "return_dict": False,
-                }
-                if self.config.flux_attention_masked_training:
-                    flux_transformer_kwargs["attention_mask"] = batch[
-                        "encoder_attention_mask"
-                    ]
-                    if flux_transformer_kwargs["attention_mask"] is None:
-                        raise ValueError(
-                            "No attention mask was discovered when attempting validation - this means you need to recreate your text embed cache."
-                        )
-
-                model_pred = self.transformer(**flux_transformer_kwargs)[0]
-
-            elif self.config.model_family == "sd3":
-                # Stable Diffusion 3 uses a MM-DiT model where the VAE-produced
-                #  image embeds are passed in with the TE-produced text embeds.
-                model_pred = self.transformer(
-                    hidden_states=noisy_latents.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    timestep=timesteps,
-                    encoder_hidden_states=encoder_hidden_states.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    pooled_projections=add_text_embeds.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.weight_dtype,
-                    ),
-                    return_dict=False,
-                )[0]
-            elif self.config.model_family == "pixart_sigma":
-                model_pred = self.transformer(
-                    noisy_latents,
-                    encoder_hidden_states=encoder_hidden_states,
-                    encoder_attention_mask=batch["encoder_attention_mask"],
-                    timestep=timesteps,
-                    added_cond_kwargs=added_cond_kwargs,
-                    return_dict=False,
-                )[0]
-                model_pred = model_pred.chunk(2, dim=1)[0]
-            elif self.config.model_family == "smoldit":
-                first_latent_shape = noisy_latents.shape
-                height = first_latent_shape[1] * 8
-                width = first_latent_shape[2] * 8
-                grid_height = height // 8 // self.transformer.config.patch_size
-                grid_width = width // 8 // self.transformer.config.patch_size
-                base_size = 512 // 8 // self.transformer.config.patch_size
-                grid_crops_coords = get_resize_crop_region_for_grid(
-                    (grid_height, grid_width), (base_size, base_size)
-                )
-                inputs = {
-                    "hidden_states": noisy_latents,
-                    "timestep": timesteps,
-                    "encoder_hidden_states": encoder_hidden_states,
-                    "encoder_attention_mask": batch["encoder_attention_mask"],
-                    "image_rotary_emb": get_2d_rotary_pos_embed(
-                        self.transformer.inner_dim
-                        // self.transformer.config.num_attention_heads,
-                        grid_crops_coords,
-                        (grid_height, grid_width),
-                    ),
-                }
-                model_pred = self.transformer(**inputs).sample
-            elif self.unet is not None:
-                if self.config.model_family == "legacy":
-                    # SD 1.5 or 2.x
-                    model_pred = self.unet(
-                        noisy_latents,
-                        timesteps,
-                        encoder_hidden_states,
-                    ).sample
-                else:
-                    # SDXL, Kolors, other default unet prediction.
-                    model_pred = self.unet(
-                        noisy_latents,
-                        timesteps,
-                        encoder_hidden_states,
-                        added_cond_kwargs=added_cond_kwargs,
-                    ).sample
-            else:
-                raise Exception("Unknown error occurred, no prediction could be made.")
-
-            if self.config.model_family == "flux":
-                model_pred = unpack_latents(
-                    model_pred,
-                    height=latents.shape[2] * 8,
-                    width=latents.shape[3] * 8,
-                    vae_scale_factor=16,
-                )
-        else:
-            # Dummy model prediction for debugging.
-            model_pred = torch.randn_like(noisy_latents)
-
-        return model_pred
-
-    def forward(self, x):
-        pass
-
-    def on_train_start(self):
-        self.init_trackers()
-        self._train_initial_msg()
-
-        if self.config.validation_on_startup and self.state["global_step"] <= 1:
-            # Just in Case.
-            self.mark_optimizer_eval()
-            # normal run-of-the-mill validation on startup.
-            if self.validation is not None:
-                self.validation.run_validations(validation_type="base_model", step=0)
-
-        self.mark_optimizer_train()
-
-        # Only show the progress bar once on each machine.
-        show_progress_bar = True
-        if not self.accelerator.is_local_main_process:
-            show_progress_bar = False
-        self.progress_bar = tqdm(
-            range(0, self.config.max_train_steps),
-            disable=not show_progress_bar,
-            initial=self.state["global_step"],
-            desc=f"Epoch {self.state['first_epoch']}/{self.config.num_train_epochs} Steps",
-            ncols=125,
-        )
-        self.accelerator.wait_for_everyone()
-
-        # Some values that are required to be initialised later.
-        self.step = self.state["global_step"]
-        self.training_luminance_values = []
-        self.current_epoch_step = None
-        self.bf, self.fetch_thread = None, None
-        self.iterator_fn = random_dataloader_iterator
-
-    def on_train_epoch_start(self):
-        if self.state["current_epoch"] > self.config.num_train_epochs + 1:
-            # This might immediately end training, but that's useful for simply exporting the model.
-            logger.info(
-                f"Training run is complete ({self.config.num_train_epochs}/{self.config.num_train_epochs} epochs, {self.state['global_step']}/{self.config.max_train_steps} steps)."
-            )
-
-        self._epoch_rollover(self.current_epoch)
-        if self.config.controlnet:
-            self.controlnet.train()
-            self.training_models = [self.controlnet]
-        else:
-            if self.unet is not None:
-                self.unet.train()
-                self.training_models = [self.unet]
-            if self.transformer is not None:
-                self.transformer.train()
-                self.training_models = [self.transformer]
-        if (
-            "lora" in self.config.model_type
-            and self.config.train_text_encoder
-            and "standard" in self.config.lora_type.lower()
-        ):
-            self.text_encoder_1.train()
-            self.text_encoder_2.train()
-            self.training_models.append(self.text_encoder_1)
-            self.training_models.append(self.text_encoder_2)
-
-        if self.current_epoch_step is not None:
-            # We are resetting to the next epoch, if it is not none.
-            self.current_epoch_step = 0
-        else:
-            # If it's None, we need to calculate the current epoch step based on the current global step.
-            self.current_epoch_step = (
-                self.state["global_step"] % self.config.num_update_steps_per_epoch
-            )
-        train_backends = {}
-        for backend_id, backend in StateTracker.get_data_backends().items():
-            if (
-                StateTracker.backend_status(backend_id)
-                or "train_dataloader" not in backend
-            ):
-                # Exclude exhausted backends.
-                logger.debug(
-                    f"Excluding backend: {backend_id}, as it is exhausted? {StateTracker.backend_status(backend_id)} or not found {('train_dataloader' not in backend)}"
-                )
-                continue
-            train_backends[backend_id] = backend["train_dataloader"]
-        # Begin dataloader prefetch, if enabled.
-        self.iterator_args = [train_backends]
-        if self.config.dataloader_prefetch:
-            self.iterator_args = []
-            if self.bf is not None:
-                self.bf.stop_fetching()
-            self.bf = BatchFetcher(
-                datasets=train_backends,
-                max_size=self.config.dataloader_prefetch_qlen,
-                step=self.step,
-            )
-            if self.fetch_thread is not None:
-                self.fetch_thread.join()
-            self.fetch_thread = self.bf.start_fetching()
-            self.iterator_fn = self.bf.next_response
-
-    def training_step(self, batch, batch_idx):
-        self._exit_on_signal()
-        self.step += 1
-        batch = self.iterator_fn(self.step, *self.iterator_args)
-        training_logger.debug(f"Iterator: {self.iterator_fn}")
-        if self.config.lr_scheduler == "cosine_with_restarts":
-            self.extra_lr_scheduler_kwargs["step"] = self.state["global_step"]
-
-        if self.accelerator.is_main_process:
-            self.progress_bar.set_description(
-                f"Epoch {self.state['current_epoch']}/{self.config.num_train_epochs}, Steps"
-            )
-
-        # If we receive a False from the enumerator, we know we reached the next epoch.
-        if batch is False:
-            logger.debug(f"Reached the end of epoch {self.current_epoch}")
-            loss_output_dir = os.path.join(self.config.output_dir, "cache")
-            loss = torch.load(os.path.join(loss_output_dir, "loss_tensor.pt"))
-            return loss
-
-        if batch is None:
-            import traceback
-
-            raise ValueError(
-                f"Received a None batch, which is not a good thing. Traceback: {traceback.format_exc()}"
-            )
-
-        # Add the current batch of training data's avg luminance to a list.
-        if "batch_luminance" in batch:
-            self.training_luminance_values.append(batch["batch_luminance"])
-
-        with self.accelerator.accumulate(self.training_models):
-            training_logger.debug("Sending latent batch to GPU.")
-            latents = batch["latent_batch"].to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-
-            # Sample noise that we'll add to the latents - self.config.noise_offset might need to be set to 0.1 by default.
-            noise = torch.randn_like(latents)
-            if not self.config.flow_matching:
-                if self.config.offset_noise:
-                    if (
-                        self.config.noise_offset_probability == 1.0
-                        or random.random() < self.config.noise_offset_probability
-                    ):
-                        noise = noise + self.config.noise_offset * torch.randn(
-                            latents.shape[0],
-                            latents.shape[1],
-                            1,
-                            1,
-                            device=latents.device,
-                        )
-
-            bsz = latents.shape[0]
-            if int(bsz) != int(self.config.train_batch_size):
-                logger.error(
-                    f"Received {bsz} latents, but expected {self.config.train_batch_size}. Processing short batch."
-                )
-            training_logger.debug(f"Working on batch size: {bsz}")
-            if self.config.flow_matching:
-                if (
-                    not self.config.flux_fast_schedule
-                    and not self.config.flux_use_beta_schedule
-                ):
-                    # imported from cloneofsimo's minRF trainer: https://github.com/cloneofsimo/minRF
-                    # also used by: https://github.com/XLabs-AI/x-flux/tree/main
-                    # and: https://github.com/kohya-ss/sd-scripts/commit/8a0f12dde812994ec3facdcdb7c08b362dbceb0f
-                    sigmas = torch.sigmoid(
-                        self.config.flow_matching_sigmoid_scale
-                        * torch.randn((bsz,), device=self.accelerator.device)
-                    )
-                    sigmas = apply_flux_schedule_shift(
-                        self.config, self.noise_scheduler, sigmas, noise
-                    )
-                elif self.config.flux_use_beta_schedule:
-                    alpha = self.config.flux_beta_schedule_alpha
-                    beta = self.config.flux_beta_schedule_beta
-
-                    # Create a Beta distribution instance
-                    beta_dist = Beta(alpha, beta)
-
-                    # Sample from the Beta distribution
-                    sigmas = beta_dist.sample((bsz,)).to(device=self.accelerator.device)
-
-                    sigmas = apply_flux_schedule_shift(
-                        self.config, self.noise_scheduler, sigmas, noise
-                    )
-                else:
-                    # fast schedule can only use these sigmas, and they can be sampled up to batch size times
-                    available_sigmas = [
-                        1.0,
-                        1.0,
-                        1.0,
-                        1.0,
-                        1.0,
-                        1.0,
-                        1.0,
-                        0.75,
-                        0.5,
-                        0.25,
-                    ]
-                    sigmas = torch.tensor(
-                        random.choices(available_sigmas, k=bsz),
-                        device=self.accelerator.device,
-                    )
-                timesteps = sigmas * 1000.0
-                sigmas = sigmas.view(-1, 1, 1, 1)
-            else:
-                # Sample a random timestep for each image, potentially biased by the timestep weights.
-                # Biasing the timestep weights allows us to spend less time training irrelevant timesteps.
-                weights = generate_timestep_weights(
-                    self.config, self.noise_scheduler.config.num_train_timesteps
-                ).to(self.accelerator.device)
-                # Instead of uniformly sampling the timestep range, we'll split our weights and schedule into bsz number of segments.
-                # This enables more broad sampling and potentially more effective training.
-                if bsz > 1 and not self.config.disable_segmented_timestep_sampling:
-                    timesteps = segmented_timestep_selection(
-                        actual_num_timesteps=self.noise_scheduler.config.num_train_timesteps,
-                        bsz=bsz,
-                        weights=weights,
-                        use_refiner_range=StateTracker.is_sdxl_refiner()
-                        and not StateTracker.get_args().sdxl_refiner_uses_full_range,
-                    ).to(self.accelerator.device)
-                else:
-                    timesteps = torch.multinomial(weights, bsz, replacement=True).long()
-
-            # Prepare the data for the scatter plot
-            for timestep in timesteps.tolist():
-                self.timesteps_buffer.append((self.state["global_step"], timestep))
-
-            if self.config.input_perturbation != 0 and (
-                not self.config.input_perturbation_steps
-                or self.state["global_step"] < self.config.input_perturbation_steps
-            ):
-                input_perturbation = self.config.input_perturbation
-                if self.config.input_perturbation_steps:
-                    input_perturbation *= 1.0 - (
-                        self.state["global_step"] / self.config.input_perturbation_steps
-                    )
-                input_noise = noise + input_perturbation * torch.randn_like(latents)
-            else:
-                input_noise = noise
-
-            if self.config.flow_matching:
-                noisy_latents = (1 - sigmas) * latents + sigmas * input_noise
-            else:
-                # Add noise to the latents according to the noise magnitude at each timestep
-                # (this is the forward diffusion process)
-                noisy_latents = self.noise_scheduler.add_noise(
-                    latents.float(), input_noise.float(), timesteps
-                ).to(
-                    device=self.accelerator.device,
-                    dtype=self.config.weight_dtype,
-                )
-
-            encoder_hidden_states = batch["prompt_embeds"].to(
-                dtype=self.config.weight_dtype, device=self.accelerator.device
-            )
-            training_logger.debug(
-                f"Encoder hidden states: {encoder_hidden_states.shape}"
-            )
-
-            add_text_embeds = batch["add_text_embeds"]
-            training_logger.debug(
-                f"Pooled embeds: {add_text_embeds.shape if add_text_embeds is not None else None}"
-            )
-            # Get the target for loss depending on the prediction type
-            if self.config.flow_matching:
-                # This is the flow-matching target for vanilla SD3.
-                # If self.config.flow_matching_loss == "diffusion", we will instead use v_prediction (see below)
-                if self.config.flow_matching_loss == "diffusers":
-                    target = latents
-                elif self.config.flow_matching_loss == "compatible":
-                    target = noise - latents
-                elif self.config.flow_matching_loss == "sd35":
-                    sigma_reshaped = sigmas.view(
-                        -1, 1, 1, 1
-                    )  # Ensure sigma has the correct shape
-                    target = (noisy_latents - latents) / sigma_reshaped
-
-            elif self.noise_scheduler.config.prediction_type == "epsilon":
-                target = noise
-            elif self.noise_scheduler.config.prediction_type == "v_prediction" or (
-                self.config.flow_matching
-                and self.config.flow_matching_loss == "diffusion"
-            ):
-                # When not using flow-matching, train on velocity prediction objective.
-                target = self.noise_scheduler.get_velocity(latents, noise, timesteps)
-            elif self.noise_scheduler.config.prediction_type == "sample":
-                # We set the target to latents here, but the model_pred will return the noise sample prediction.
-                # We will have to subtract the noise residual from the prediction to get the target sample.
-                target = latents
-            else:
-                raise ValueError(
-                    f"Unknown prediction type {self.noise_scheduler.config.prediction_type}"
-                    "Supported types are 'epsilon', `sample`, and 'v_prediction'."
-                )
-
-            added_cond_kwargs = None
-            # Predict the noise residual and compute loss
-            if (
-                StateTracker.get_model_family() == "sdxl"
-                or self.config.model_family == "kolors"
-            ):
-                added_cond_kwargs = {
-                    "text_embeds": add_text_embeds.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.weight_dtype,
-                    ),
-                    "time_ids": batch["batch_time_ids"].to(
-                        device=self.accelerator.device,
-                        dtype=self.config.weight_dtype,
-                    ),
-                }
-            elif (
-                self.config.model_family == "pixart_sigma"
-                or self.config.model_family == "smoldit"
-            ):
-                # pixart requires an input of {"resolution": .., "aspect_ratio": ..}
-                if "batch_time_ids" in batch:
-                    added_cond_kwargs = batch["batch_time_ids"]
-                batch["encoder_attention_mask"] = batch["encoder_attention_mask"].to(
-                    device=self.accelerator.device,
-                    dtype=self.config.weight_dtype,
-                )
-
-            # a marker to know whether we had a model capable of regularised data training.
-            handled_regularisation = False
-            is_regularisation_data = batch.get("is_regularisation_data", False)
-            if is_regularisation_data and self.config.model_type == "lora":
-                training_logger.debug("Predicting parent model residual.")
-                handled_regularisation = True
-                with torch.no_grad():
-                    if self.config.lora_type.lower() == "lycoris":
-                        training_logger.debug(
-                            "Detaching LyCORIS adapter for parent prediction."
-                        )
-                        self.accelerator._lycoris_wrapped_network.restore()
-                    else:
-                        raise ValueError(
-                            f"Cannot train parent-student networks on {self.config.lora_type} model. Only LyCORIS is supported."
-                        )
-                    target = self.model_predict(
-                        batch=batch,
-                        latents=latents,
-                        noisy_latents=noisy_latents,
-                        encoder_hidden_states=encoder_hidden_states,
-                        added_cond_kwargs=added_cond_kwargs,
-                        add_text_embeds=add_text_embeds,
-                        timesteps=timesteps,
-                    )
-                    if self.config.lora_type.lower() == "lycoris":
-                        training_logger.debug(
-                            "Attaching LyCORIS adapter for student prediction."
-                        )
-                        self.accelerator._lycoris_wrapped_network.apply_to()
-
-            training_logger.debug("Predicting noise residual.")
-            model_pred = self.model_predict(
-                batch=batch,
-                latents=latents,
-                noisy_latents=noisy_latents,
-                encoder_hidden_states=encoder_hidden_states,
-                added_cond_kwargs=added_cond_kwargs,
-                add_text_embeds=add_text_embeds,
-                timesteps=timesteps,
-            )
-
-            # x-prediction requires that we now subtract the noise residual from the prediction to get the target sample.
-            if (
-                hasattr(self.noise_scheduler, "config")
-                and hasattr(self.noise_scheduler.config, "prediction_type")
-                and self.noise_scheduler.config.prediction_type == "sample"
-            ):
-                model_pred = model_pred - noise
-
-            parent_loss = None
-
-            # Compute the per-pixel loss without reducing over spatial dimensions
-            if self.config.flow_matching:
-                # For flow matching, compute the per-pixel squared differences
-                loss = (
-                    model_pred.float() - target.float()
-                ) ** 2  # Shape: (batch_size, C, H, W)
-            elif self.config.snr_gamma is None or self.config.snr_gamma == 0:
-                training_logger.debug("Calculating loss")
-                loss = self.config.snr_weight * F.mse_loss(
-                    model_pred.float(), target.float(), reduction="none"
-                )  # Shape: (batch_size, C, H, W)
-            else:
-                # Compute loss-weights as per Section 3.4 of https://arxiv.org/abs/2303.09556.
-                # Since we predict the noise instead of x_0, the original formulation is slightly changed.
-                # This is discussed in Section 4.2 of the same paper.
-                training_logger.debug("Using min-SNR loss")
-                snr = compute_snr(timesteps, self.noise_scheduler)
-                snr_divisor = snr
-                if self.noise_scheduler.config.prediction_type == "v_prediction" or (
-                    self.config.flow_matching
-                    and self.config.flow_matching_loss == "diffusion"
-                ):
-                    snr_divisor = snr + 1
-
-                training_logger.debug(
-                    "Calculating MSE loss weights using SNR as divisor"
-                )
-                mse_loss_weights = (
-                    torch.stack(
-                        [
-                            snr,
-                            self.config.snr_gamma * torch.ones_like(timesteps),
-                        ],
-                        dim=1,
-                    ).min(dim=1)[0]
-                    / snr_divisor
-                )  # Shape: (batch_size,)
-
-                # Compute the per-pixel MSE loss without reduction
-                loss = F.mse_loss(
-                    model_pred.float(), target.float(), reduction="none"
-                )  # Shape: (batch_size, C, H, W)
-
-                # Reshape mse_loss_weights for broadcasting and apply to loss
-                mse_loss_weights = mse_loss_weights.view(
-                    -1, 1, 1, 1
-                )  # Shape: (batch_size, 1, 1, 1)
-                loss = loss * mse_loss_weights  # Shape: (batch_size, C, H, W)
-
-            # Mask the loss using any conditioning data
-            conditioning_type = batch.get("conditioning_type")
-            if conditioning_type == "mask":
-                # Adapted from:
-                # https://github.com/kohya-ss/sd-scripts/blob/main/library/custom_train_functions.py#L482
-                mask_image = (
-                    batch["conditioning_pixel_values"]
-                    .to(dtype=loss.dtype, device=loss.device)[:, 0]
-                    .unsqueeze(1)
-                )  # Shape: (batch_size, 1, H', W')
-                mask_image = torch.nn.functional.interpolate(
-                    mask_image, size=loss.shape[2:], mode="area"
-                )  # Resize to match loss spatial dimensions
-                mask_image = mask_image / 2 + 0.5  # Normalize to [0,1]
-                loss = loss * mask_image  # Element-wise multiplication
-
-            # Reduce the loss by averaging over channels and spatial dimensions
-            loss = loss.mean(
-                dim=list(range(1, len(loss.shape)))
-            )  # Shape: (batch_size,)
-
-            # Further reduce the loss by averaging over the batch dimension
-            loss = loss.mean()  # Scalar value
-
-            if is_regularisation_data:
-                parent_loss = loss
-
-            # Gather the losses across all processes for logging (if using distributed training)
-            avg_loss = self.accelerator.gather(
-                loss.repeat(self.config.train_batch_size)
-            ).mean()
-            self.train_loss += avg_loss.item() / self.config.gradient_accumulation_steps
-            # Backpropagate
-            grad_norm = None
-            if not self.config.disable_accelerator:
-                training_logger.debug("Backwards pass.")
-                # self.accelerator.backward(loss)
-                loss.backward(retain_graph=True)
-                # loss_tensor save dir
-                loss_output_dir = os.path.join(self.config.output_dir, "cache")
-                if not os.path.exists(loss_output_dir):
-                    os.makedirs(loss_output_dir)
-                torch.save(loss, os.path.join(loss_output_dir, "loss_tensor.pt"))
-
-                if (
-                    self.config.optimizer != "adam_bfloat16"
-                    and self.config.gradient_precision == "fp32"
-                ):
-                    # After backward, convert gradients to fp32 for stable accumulation
-                    for param in self.params_to_optimize:
-                        if param.grad is not None:
-                            param.grad.data = param.grad.data.to(torch.float32)
-
-                if (
-                    self.accelerator.sync_gradients
-                    and self.config.optimizer != "optimi-stableadamw"
-                    and self.config.max_grad_norm > 0
-                ):
-                    # StableAdamW does not need clipping, similar to Adafactor.
-                    grad_norm = self.accelerator.clip_grad_norm_(
-                        self.params_to_optimize, self.config.max_grad_norm
-                    )
-                training_logger.debug("Stepping components forward.")
-                if self.config.optimizer_release_gradients:
-                    step_offset = 0  # simpletuner indexes steps from 1.
-                    should_not_release_gradients = (
-                        self.step + step_offset
-                    ) % self.config.gradient_accumulation_steps != 0
-                    training_logger.debug(
-                        f"step: {self.step}, should_not_release_gradients: {should_not_release_gradients}, self.config.optimizer_release_gradients: {self.config.optimizer_release_gradients}"
-                    )
-                    self.optimizer.optimizer_accumulation = should_not_release_gradients
-                else:
-                    self.optimizer.step()
-                self.optimizer.zero_grad(set_to_none=self.config.set_grads_to_none)
-
-        # Checks if the accelerator has performed an optimization step behind the scenes
-        wandb_logs = {}
-        if self.accelerator.sync_gradients:
-            try:
-                if self.config.is_schedulefree:
-                    # hackjob method of retrieving LR from accelerated optims
-                    self.lr = StateTracker.get_last_lr()
-                else:
-                    self.lr_scheduler.step(**self.extra_lr_scheduler_kwargs)
-                    self.lr = self.lr_scheduler.get_last_lr()[0]
-            except Exception as e:
-                logger.error(
-                    f"Failed to get the last learning rate from the scheduler. Error: {e}"
-                )
-            wandb_logs = {
-                "train_loss": self.train_loss,
-                "optimization_loss": loss,
-                "learning_rate": self.lr,
-                "epoch": self.current_epoch,
-            }
-            if parent_loss is not None:
-                wandb_logs["regularisation_loss"] = parent_loss
-            if self.config.model_family == "flux" and self.guidance_values_list:
-                # avg the values
-                guidance_values = torch.tensor(self.guidance_values_list).mean()
-                wandb_logs["mean_cfg"] = guidance_values.item()
-                self.guidance_values_list = []
-            if grad_norm is not None:
-                wandb_logs["grad_norm"] = grad_norm
-            self.progress_bar.update(1)
-            self.state["global_step"] += 1
-            self.current_epoch_step += 1
-            StateTracker.set_global_step(self.state["global_step"])
-
-            ema_decay_value = "None (EMA not in use)"
-            if self.config.use_ema:
-                if self.ema_model is not None:
-                    training_logger.debug("Stepping EMA forward")
-                    self.ema_model.step(
-                        parameters=(
-                            self.unet.parameters()
-                            if self.unet is not None
-                            else self.transformer.parameters()
-                        ),
-                        global_step=self.state["global_step"],
-                    )
-                    wandb_logs["ema_decay_value"] = self.ema_model.get_decay()
-                self.accelerator.wait_for_everyone()
-
-            # Log scatter plot to wandb
-            if self.config.report_to == "wandb" and self.accelerator.is_main_process:
-                # Prepare the data for the scatter plot
-                data = [
-                    [iteration, timestep]
-                    for iteration, timestep in self.timesteps_buffer
-                ]
-                table = wandb.Table(data=data, columns=["global_step", "timestep"])
-                wandb_logs["timesteps_scatter"] = wandb.plot.scatter(
-                    table,
-                    "global_step",
-                    "timestep",
-                    title="Timestep distribution by step",
-                )
-
-            # Clear buffers
-            self.timesteps_buffer = []
-
-            # Average out the luminance values of each batch, so that we can store that in this step.
-            avg_training_data_luminance = sum(self.training_luminance_values) / len(
-                self.training_luminance_values
-            )
-            wandb_logs["train_luminance"] = avg_training_data_luminance
-
-            logger.debug(
-                f"Step {self.state['global_step']} of {self.config.max_train_steps}: loss {loss.item()}, lr {self.lr}, epoch {self.current_epoch}/{self.config.num_train_epochs}, ema_decay_value {ema_decay_value}, train_loss {self.train_loss}"
-            )
-            self.accelerator.log(
-                wandb_logs,
-                step=self.state["global_step"],
-            )
-            webhook_pending_msg = f"Step {self.state['global_step']} of {self.config.max_train_steps}: loss {round(loss.item(), 4)}, lr {self.lr}, epoch {self.current_epoch}/{self.config.num_train_epochs}, ema_decay_value {ema_decay_value}, train_loss {round(self.train_loss, 4)}"
-
-            # Reset some values for the next go.
-            self.training_luminance_values = []
-            self.train_loss = 0.0
-
-            if (
-                self.config.webhook_reporting_interval is not None
-                and self.state["global_step"] % self.config.webhook_reporting_interval
-                == 0
-            ):
-                structured_data = {
-                    "state": self.state,
-                    "loss": round(self.train_loss, 4),
-                    "parent_loss": parent_loss,
-                    "learning_rate": self.lr,
-                    "epoch": self.current_epoch,
-                    "final_epoch": self.config.num_train_epochs,
-                }
-                self._send_webhook_raw(
-                    structured_data=structured_data, message_type="train"
-                )
-            if self.state["global_step"] % self.config.checkpointing_steps == 0:
-                self._send_webhook_msg(
-                    message=f"Checkpoint: `{webhook_pending_msg}`",
-                    message_level="info",
-                )
-                if self.accelerator.is_main_process:
-                    # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
-                    if self.config.checkpoints_total_limit is not None:
-                        checkpoints = os.listdir(self.config.output_dir)
-                        checkpoints = [
-                            d for d in checkpoints if d.startswith("checkpoint")
-                        ]
-                        checkpoints = sorted(
-                            checkpoints, key=lambda x: int(x.split("-")[1])
-                        )
-
-                        # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
-                        if len(checkpoints) >= self.config.checkpoints_total_limit:
-                            num_to_remove = (
-                                len(checkpoints)
-                                - self.config.checkpoints_total_limit
-                                + 1
-                            )
-                            removing_checkpoints = checkpoints[0:num_to_remove]
-                            logger.debug(
-                                f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
-                            )
-                            logger.debug(
-                                f"removing checkpoints: {', '.join(removing_checkpoints)}"
-                            )
-
-                            for removing_checkpoint in removing_checkpoints:
-                                removing_checkpoint = os.path.join(
-                                    self.config.output_dir, removing_checkpoint
-                                )
-                                try:
-                                    shutil.rmtree(removing_checkpoint)
-                                except Exception as e:
-                                    logger.error(
-                                        f"Failed to remove directory: {removing_checkpoint}"
-                                    )
-                                    print(e)
-
-                if (
-                    self.accelerator.is_main_process
-                    or self.config.use_deepspeed_optimizer
-                ):
-                    save_path = os.path.join(
-                        self.config.output_dir,
-                        f"checkpoint-{self.state['global_step']}",
-                    )
-                    print("\n")
-                    # schedulefree optim needs the optimizer to be in eval mode to save the state (and then back to train after)
-                    self.mark_optimizer_eval()
-                    self.accelerator.save_state(save_path)
-                    self.mark_optimizer_train()
-                    for _, backend in StateTracker.get_data_backends().items():
-                        if "sampler" in backend:
-                            logger.debug(f"Backend: {backend}")
-                            backend["sampler"].save_state(
-                                state_path=os.path.join(
-                                    save_path,
-                                    self.model_hooks.training_state_path,
-                                ),
-                            )
-
-            if (
-                self.config.accelerator_cache_clear_interval is not None
-                and self.state["global_step"]
-                % self.config.accelerator_cache_clear_interval
-                == 0
-            ):
-                reclaim_memory()
-
-        logs = {
-            "step_loss": loss.detach().item(),
-            "lr": float(self.lr),
-        }
-        if "mean_cfg" in wandb_logs:
-            logs["mean_cfg"] = wandb_logs["mean_cfg"]
-
-        self.progress_bar.set_postfix(**logs)
-        self.mark_optimizer_eval()
-        if self.validation is not None:
-            self.validation.run_validations(
-                validation_type="intermediary", step=self.step
-            )
-        self.mark_optimizer_train()
-        if (
-            self.config.push_to_hub
-            and self.config.push_checkpoints_to_hub
-            and self.state["global_step"] % self.config.checkpointing_steps == 0
-            and self.step % self.config.gradient_accumulation_steps == 0
-            and self.state["global_step"] > self.state["global_resume_step"]
-        ):
-            if self.accelerator.is_main_process:
-                try:
-                    self.hub_manager.upload_latest_checkpoint(
-                        validation_images=(
-                            getattr(self.validation, "validation_images")
-                            if self.validation is not None
-                            else None
-                        ),
-                        webhook_handler=self.webhook_handler,
-                    )
-                except Exception as e:
-                    logger.error(f"Error uploading to hub: {e}, continuing training.")
-        self.accelerator.wait_for_everyone()
-
-        return loss
-
-    def on_train_end(self):
-        self.accelerator.wait_for_everyone()
-        validation_images = None
-        if self.accelerator.is_main_process:
-            self.mark_optimizer_eval()
-            if self.validation is not None:
-                validation_images = self.validation.run_validations(
-                    validation_type="final",
-                    step=self.state["global_step"],
-                    force_evaluation=True,
-                    skip_execution=True,
-                ).validation_images
-            if self.unet is not None:
-                self.unet = unwrap_model(self.accelerator, self.unet)
-            if self.transformer is not None:
-                self.transformer = unwrap_model(self.accelerator, self.transformer)
-            if (
-                "lora" in self.config.model_type
-                and "standard" == self.config.lora_type.lower()
-            ):
-                if self.transformer is not None:
-                    transformer_lora_layers = get_peft_model_state_dict(
-                        self.transformer
-                    )
-                elif self.unet is not None:
-                    unet_lora_layers = convert_state_dict_to_diffusers(
-                        get_peft_model_state_dict(self.unet)
-                    )
-                else:
-                    raise Exception(
-                        "Couldn't locate the unet or transformer model for export."
-                    )
-
-                if self.config.train_text_encoder:
-                    self.text_encoder_1 = self.accelerator.unwrap_model(
-                        self.text_encoder_1
-                    )
-                    self.text_encoder_lora_layers = convert_state_dict_to_diffusers(
-                        get_peft_model_state_dict(self.text_encoder_1)
-                    )
-                    if self.text_encoder_2 is not None:
-                        self.text_encoder_2 = self.accelerator.unwrap_model(
-                            self.text_encoder_2
-                        )
-                        text_encoder_2_lora_layers = convert_state_dict_to_diffusers(
-                            get_peft_model_state_dict(self.text_encoder_2)
-                        )
-                        if self.text_encoder_3 is not None:
-                            text_encoder_3 = self.accelerator.unwrap_model(
-                                self.text_encoder_3
-                            )
-                else:
-                    text_encoder_lora_layers = None
-                    text_encoder_2_lora_layers = None
-
-                if self.config.model_family == "flux":
-                    from diffusers.pipelines import FluxPipeline
-
-                    print("saving lora...")
-                    self.save_lora()
-
-                del self.unet
-                del self.transformer
-                del text_encoder_lora_layers
-                del text_encoder_2_lora_layers
-                reclaim_memory()
-        self.accelerator.end_training()
-
-    def configure_optimizers(self):
-
-        print("configuring optimizers...")
-        # print("model params:")
-        # print("model:", self)
-        # print(list(self.parameters()))
-        opt = torch.optim.Adam(self.parameters(), lr=1e-5)
-        return opt
-
-    @rank_zero_only
-    def save_lora(self):
-        with open("configs/006_flux/config.json", "r") as f:
-            output_dir = json.load(f).get("--output_dir")
-        lora_weights = lora_checkpoint_callback.on_save_checkpoint(
-            self.trainer, pl.LightningModule, self.state_dict()
-        )
-
-        new_lora_weights = {}
-        # rename the state_dict keys
-        for k in list(lora_weights.keys()):
-            k_list = k.split(".")
-            # remove the default and default_0 from the key list
-            if "default" in k_list:
-                k_list.remove("default")
-            if "default_0" in k_list:
-                k_list.remove("default_0")
-            new_k = ".".join(k_list)
-            new_lora_weights[new_k] = lora_weights[k]
-
-        save_path = os.path.join(output_dir, "pytorch_lora_own.ckpt")
-        torch.save(new_lora_weights, save_path)
-        print("lora saved successfully at:", save_path)
diff --git a/videotuna/third_party/flux/training/model_data.py b/videotuna/third_party/flux/training/model_data.py
deleted file mode 100644
index a0cb9b12..00000000
--- a/videotuna/third_party/flux/training/model_data.py
+++ /dev/null
@@ -1,140 +0,0 @@
-import json
-import os
-from pathlib import Path
-
-import pytorch_lightning as pl
-from accelerate.logging import get_logger
-from sklearn.model_selection import train_test_split
-from torch.utils.data import DataLoader, DistributedSampler
-
-from videotuna.third_party.flux.data_backend.factory import configure_multi_databackend
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = get_logger(
-    "SimpleTuner", log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-)
-
-def create_txt_labels_from_dir(data_dir, caption):
-    """
-    Create multiple txt files, each txt file is the content of the caption string.
-    """
-    for image in os.listdir(data_dir):
-        with open(os.path.join(data_dir, Path(image).stem) + ".txt", "w") as f:
-            f.write(caption)
-
-class ModelData(pl.LightningDataModule):
-    def __init__(
-        self,
-        data_dir,
-        caption=None,
-        batch_size=1,
-    ):
-        super().__init__()
-
-        self.data_dir = data_dir
-        self.batch_size = batch_size
-        self.images = []
-        if caption is not None:
-            create_txt_labels_from_dir(data_dir, caption)
-
-    def init_data_backend(self):
-
-        try:
-            self.init_clear_backend_cache()
-            self._send_webhook_msg(
-                message="Configuring data backends... (this may take a while!)"
-            )
-            self._send_webhook_raw(
-                structured_data={"message": "Configuring data backends."},
-                message_type="init_data_backend_begin",
-            )
-            configure_multi_databackend(
-                self.config,
-                accelerator=self.accelerator,
-                text_encoders=self.text_encoders,
-                tokenizers=self.tokenizers,
-            )
-            self._send_webhook_raw(
-                structured_data={"message": "Completed configuring data backends."},
-                message_type="init_data_backend_completed",
-            )
-        except Exception as e:
-            import traceback
-
-            logger.error(f"{e}, traceback: {traceback.format_exc()}")
-            self._send_webhook_msg(
-                message=f"Failed to load data backends: {e}",
-                message_level="critical",
-            )
-            self._send_webhook_raw(
-                structured_data={
-                    "message": f"Failed to load data backends: {e}",
-                    "status": "error",
-                },
-                message_type="fatal_error",
-            )
-
-            raise e
-
-        self.init_validation_prompts()
-        # We calculate the number of steps per epoch by dividing the number of images by the effective batch divisor.
-        # Gradient accumulation steps mean that we only update the model weights every /n/ steps.
-        collected_data_backend_str = list(StateTracker.get_data_backends().keys())
-        if self.config.push_to_hub and self.accelerator.is_main_process:
-            self.hub_manager.collected_data_backend_str = collected_data_backend_str
-            self.hub_manager.set_validation_prompts(
-                self.validation_prompts, self.validation_shortnames
-            )
-            logger.debug(f"Collected validation prompts: {self.validation_prompts}")
-        self._recalculate_training_steps()
-        logger.info(
-            f"Collected the following data backends: {collected_data_backend_str}"
-        )
-        self._send_webhook_msg(
-            message=f"Collected the following data backends: {collected_data_backend_str}"
-        )
-        self._send_webhook_raw(
-            structured_data={
-                "message": f"Collected the following data backends: {collected_data_backend_str}"
-            },
-            message_type="init_data_backend",
-        )
-        self.accelerator.wait_for_everyone()
-
-    def create_dataset(self):
-        print("creating dataset...")
-        self.images = [
-            os.path.join(self.data_dir, image) for image in os.listdir(self.data_dir)
-        ]
-
-        print("dataset created!")
-
-    def prepare_data(self):
-        pass
-
-    def setup(self, stage=None):
-        if stage is None or stage == "fit":
-            self.train_set, _ = train_test_split(self.images, test_size=0.1)
-        if stage is None or stage == "test":
-            _, self.test_set = train_test_split(self.images, test_size=0.1)
-
-    def train_dataloader(self):
-        # train_sampler = DistributedSampler(self.train_set, shuffle=True)
-        return DataLoader(
-            self.train_set,
-            batch_size=self.batch_size,
-            drop_last=True,
-            num_workers=8,
-            pin_memory=True,
-        )
-
-    def test_dataloader(self):
-        # test_sampler = DistributedSampler(self.test_set, shuffle=True)
-
-        return DataLoader(
-            self.test_set,
-            batch_size=self.batch_size,
-            drop_last=True,
-            num_workers=8,
-            pin_memory=True,
-        )
diff --git a/videotuna/third_party/flux/training/model_freeze.py b/videotuna/third_party/flux/training/model_freeze.py
deleted file mode 100644
index 7e565f15..00000000
--- a/videotuna/third_party/flux/training/model_freeze.py
+++ /dev/null
@@ -1,177 +0,0 @@
-import logging
-import os
-import re
-
-from torch import nn
-
-logger = logging.getLogger("ModelFreeze")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-def freeze_transformer_blocks(
-    model: nn.Module,
-    target_blocks: str,
-    first_unfrozen_dit_layer: int = 0,
-    first_unfrozen_mmdit_layer: int = 0,
-    freeze_direction: str = "up",
-    use_bitfit: bool = False,
-):
-    if target_blocks not in ["any", "dit", "mmdit"]:
-        raise ValueError(
-            f"Invalid target_blocks value {target_blocks}. Choose from 'any', 'dit', 'mmdit'."
-        )
-    if freeze_direction not in ["up", "down"]:
-        raise ValueError(
-            f"Invalid freeze_direction value {freeze_direction}. Choose from 'up', 'down'."
-        )
-    if first_unfrozen_dit_layer < 0 or first_unfrozen_mmdit_layer < 0:
-        raise ValueError(f"Invalid first_unfrozen layer value. Must be greater than 0.")
-    for name, param in model.named_parameters():
-        # Example names:
-        #  single_transformer_blocks.31.ff.c_proj.weight
-        #  joint_transformer_blocks.1.ff.c_proj.weight
-        try:
-            layer_group = name.split(".")[0]
-            layer_number = int(name.split(".")[1])
-        except Exception as e:
-            # non-numeric layer.
-            continue
-        try:
-            if hasattr(param, "requires_grad"):
-                # freeze by default.
-                param.requires_grad = False
-            else:
-                continue
-            if target_blocks != "any":
-                # We will exclude entire categories of blocks here if they aren't defined to be trained.
-                if (
-                    target_blocks == "dit"
-                    and layer_group != "single_transformer_blocks"
-                ):
-                    continue
-                if (
-                    target_blocks == "mmdit"
-                    and layer_group != "joint_transformer_blocks"
-                ):
-                    continue
-            should_train = False
-            if first_unfrozen_dit_layer is not None:
-                if layer_group == "single_transformer_blocks" or target_blocks == "any":
-                    if first_unfrozen_dit_layer == 0:
-                        should_train = True
-                    if (
-                        freeze_direction == "up"
-                        and layer_number < first_unfrozen_dit_layer
-                    ) or (
-                        freeze_direction == "down"
-                        and layer_number > first_unfrozen_dit_layer
-                    ):
-                        should_train = True
-
-            if first_unfrozen_mmdit_layer is not None:
-                if layer_group == "joint_transformer_blocks" or target_blocks == "any":
-                    if first_unfrozen_mmdit_layer == 0:
-                        should_train = True
-                    if (
-                        freeze_direction == "up"
-                        and layer_number < first_unfrozen_mmdit_layer
-                    ) or (
-                        freeze_direction == "down"
-                        and layer_number > first_unfrozen_mmdit_layer
-                    ):
-                        should_train = True
-
-            if should_train:
-                param.requires_grad = True
-                logger.debug(f"Unfreezing {name}.")
-
-        except Exception as e:
-            logger.error(e)
-            raise e
-
-    return model
-
-
-def apply_bitfit_freezing(model, args):
-    model_type = args.model_type
-    if "lora" in model_type:
-        # LoRAs don't have bias and arrive pre-frozen on the bottom.
-        return model
-
-    logger.debug("Applying BitFit freezing strategy for u-net tuning.")
-    for name, param in model.named_parameters():
-        if not hasattr(param, "requires_grad"):
-            logger.debug(
-                f"Skipping {name} as it does not have 'requires_grad' attribute."
-            )
-            continue
-        # Freeze everything that's not a bias
-        if "bias" not in name:
-            param.requires_grad = False
-        else:
-            # Unfreeze biases
-            param.requires_grad = True
-    return model
-
-
-def freeze_entire_component(component):
-    for name, param in component.named_parameters():
-        if hasattr(param, "requires_grad"):
-            param.requires_grad = False
-    return component
-
-
-def freeze_text_encoder(args, component):
-    from transformers import T5EncoderModel
-
-    if (
-        not args.train_text_encoder
-        or not args.freeze_encoder
-        or type(component) is T5EncoderModel
-    ):
-        if args.train_text_encoder:
-            logger.info("Not freezing text encoder. Live dangerously and prosper!")
-        return component
-    method = args.freeze_encoder_strategy
-    first_layer = args.freeze_encoder_before
-    last_layer = args.freeze_encoder_after
-    total_count = 0
-    for name, param in component.named_parameters():
-        total_count += 1
-        pieces = name.split(".")
-        if pieces[1] != "encoder" and pieces[2] != "layers":
-            logger.info(f"Ignoring non-encoder layer: {name}")
-            continue
-        else:
-            logger.debug(f"Freezing layer: {name}, which has keys: {pieces}")
-        current_layer = int(pieces[3])
-
-        freeze_param = False
-        if method == "between":
-            freeze_param = current_layer > first_layer or current_layer < last_layer
-        elif method == "outside":
-            freeze_param = first_layer <= current_layer <= last_layer
-        elif method == "before":
-            freeze_param = current_layer < first_layer
-        elif method == "after":
-            freeze_param = current_layer > last_layer
-        else:
-            raise ValueError(
-                f"Invalid method {method}. Choose between 'between', 'outside', 'before' or 'after'."
-            )
-
-        if freeze_param:
-            if hasattr(param, "requires_grad"):
-                param.requires_grad = False
-                # logger.debug(
-                #     f"Froze layer {name} with method {method} and range {first_layer} - {last_layer}"
-                # )
-            else:
-                # logger.info(
-                #     f"Ignoring layer that does not mark as gradient capable: {name}"
-                # )
-                pass
-    logger.info(
-        f"Applied {method} method with range {first_layer} - {last_layer} to {total_count} total layers."
-    )
-    return component
diff --git a/videotuna/third_party/flux/training/multi_process.py b/videotuna/third_party/flux/training/multi_process.py
deleted file mode 100644
index 3448a675..00000000
--- a/videotuna/third_party/flux/training/multi_process.py
+++ /dev/null
@@ -1,19 +0,0 @@
-import torch.distributed as dist
-
-
-def _get_rank():
-    if dist.is_available() and dist.is_initialized():
-        return dist.get_rank()
-    else:
-        return 0
-
-
-def rank_info():
-    try:
-        return f"(Rank: {_get_rank()}) "
-    except:
-        return ""
-
-
-def should_log():
-    return _get_rank() == 0
diff --git a/videotuna/third_party/flux/training/optimizer_param.py b/videotuna/third_party/flux/training/optimizer_param.py
deleted file mode 100644
index 32f322e3..00000000
--- a/videotuna/third_party/flux/training/optimizer_param.py
+++ /dev/null
@@ -1,669 +0,0 @@
-import os
-
-import accelerate
-import torch
-from accelerate.logging import get_logger
-
-logger = get_logger(__name__, log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-
-is_optimi_available = False
-from videotuna.third_party.flux.training.optimizers.adamw_bfloat16 import AdamWBF16
-from videotuna.third_party.flux.training.optimizers.adamw_schedulefree import (
-    AdamWScheduleFreeKahan,
-)
-from videotuna.third_party.flux.training.optimizers.soap import SOAP
-
-try:
-    from optimum.quanto import QTensor
-except:
-    pass
-
-try:
-    from torchao.prototype.low_bit_optim import AdamFp8 as AOAdamFp8
-    from torchao.prototype.low_bit_optim import AdamW4bit as AOAdamW4Bit
-    from torchao.prototype.low_bit_optim import AdamW8bit as AOAdamW8Bit
-    from torchao.prototype.low_bit_optim import AdamWFp8 as AOAdamWFp8
-    from torchao.prototype.low_bit_optim import (
-        CPUOffloadOptimizer as AOCPUOffloadOptimizer,
-    )
-
-    if torch.backends.mps.is_available():
-        import torch._dynamo
-
-        torch._dynamo.config.suppress_errors = True
-except Exception as e:
-    print("You need torchao installed for its low-precision optimizers.")
-    raise e
-
-try:
-    import optimi
-
-    is_optimi_available = True
-except:
-    logger.error(
-        "Could not load optimi library. Please install `torch-optimi` for better memory efficiency."
-    )
-
-is_bitsandbytes_available = False
-try:
-    import bitsandbytes
-
-    is_bitsandbytes_available = True
-except:
-    if torch.cuda.is_available():
-        logger.warning(
-            "Could not load bitsandbytes library. BnB-specific optimisers and other functionality will be unavailable."
-        )
-
-optimizer_choices = {
-    "adamw_bf16": {
-        "precision": "bf16",
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-2,
-            "eps": 1e-6,
-        },
-        "class": AdamWBF16,
-    },
-    "ao-adamw8bit": {
-        "gradient_precision": "bf16",
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-2,
-            "eps": 1e-6,
-        },
-        "class": AOAdamW8Bit,
-    },
-    "ao-adamw4bit": {
-        "gradient_precision": "bf16",
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-2,
-            "eps": 1e-6,
-        },
-        "class": AOAdamW4Bit,
-    },
-    "ao-adamfp8": {
-        "gradient_precision": "bf16",
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-2,
-            "eps": 1e-6,
-        },
-        "class": AOAdamFp8,
-    },
-    "ao-adamwfp8": {
-        "gradient_precision": "bf16",
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-2,
-            "eps": 1e-6,
-        },
-        "class": AOAdamWFp8,
-    },
-    "adamw_schedulefree": {
-        "precision": "any",
-        "override_lr_scheduler": True,
-        "can_warmup": True,
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-2,
-            "eps": 1e-8,
-        },
-        "class": AdamWScheduleFreeKahan,
-    },
-    "adamw_schedulefree+aggressive": {
-        "precision": "any",
-        "override_lr_scheduler": True,
-        "can_warmup": True,
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-3,
-            "eps": 1e-6,
-        },
-        "class": AdamWScheduleFreeKahan,
-    },
-    "adamw_schedulefree+no_kahan": {
-        "precision": "any",
-        "override_lr_scheduler": True,
-        "can_warmup": True,
-        "default_settings": {
-            "betas": (0.9, 0.999),
-            "weight_decay": 1e-3,
-            "eps": 1e-6,
-            "use_kahan": False,
-        },
-        "class": AdamWScheduleFreeKahan,
-    },
-    "optimi-stableadamw": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.99),
-            "weight_decay": 1e-2,
-            "eps": 1e-6,
-            "decouple_lr": False,
-            "max_lr": None,
-            "kahan_sum": True,
-            "foreach": True,
-        },
-        "class": optimi.StableAdamW,
-    },
-    "optimi-adamw": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.99),
-            "eps": 1e-6,
-            "weight_decay": 0.0,
-            "decouple_lr": False,
-            "kahan_sum": True,
-            "max_lr": None,
-        },
-        "class": optimi.AdamW,
-    },
-    "optimi-lion": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.99),
-            "weight_decay": 0.0,
-            "decouple_lr": False,
-            "max_lr": None,
-            "kahan_sum": True,
-            "foreach": True,
-        },
-        "class": optimi.Lion,
-    },
-    "optimi-radam": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.99),
-            "weight_decay": 0.0,
-            "eps": 1e-6,
-            "decouple_wd": True,
-            "decouple_lr": False,
-            "kahan_sum": True,
-            "foreach": True,
-        },
-        "class": optimi.RAdam,
-    },
-    "optimi-ranger": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.99),
-            "weight_decay": 0.0,
-            "eps": 1e-6,
-            "k": 6,
-            "alpha": 0.5,
-            "decouple_wd": True,
-            "decouple_lr": False,
-            "max_lr": None,
-            "kahan_sum": True,
-            "foreach": True,
-        },
-        "class": optimi.Ranger,
-    },
-    "optimi-adan": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.98, 0.92, 0.999),
-            "weight_decay": 2e-2,
-            "eps": 1e-6,
-            "decouple_lr": False,
-            "max_lr": None,
-            "adam_wd": False,
-            "kahan_sum": True,
-            "foreach": True,
-        },
-        "class": optimi.Adan,
-    },
-    "optimi-adam": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.9, 0.99),
-            "eps": 1e-6,
-            "weight_decay": 0.0,
-            "decouple_wd": False,
-            "decouple_lr": False,
-            "kahan_sum": True,
-            "max_lr": None,
-        },
-        "class": optimi.Adam,
-    },
-    "optimi-sgd": {
-        "precision": "any",
-        "default_settings": {
-            "momentum": 0,
-            "weight_decay": 0.0,
-            "dampening": False,
-            "decouple_wd": False,
-            "decouple_lr": False,
-            "max_lr": None,
-            "torch_init": False,
-            "kahan_sum": True,
-            "foreach": True,
-        },
-        "class": optimi.SGD,
-    },
-    "soap": {
-        "precision": "any",
-        "default_settings": {
-            "betas": (0.95, 0.95),
-            "shampoo_beta": -1,
-            "eps": 1e-8,
-            "weight_decay": 0.01,
-            "precondition_frequency": 10,
-            "max_precond_dim": 10000,
-            "merge_dims": False,
-            "precondition_1d": False,
-            "normalize_grads": False,
-            "data_format": "channels_first",
-            "correct_bias": True,
-        },
-        "class": SOAP,
-    },
-}
-
-if is_bitsandbytes_available:
-    optimizer_choices.update(
-        {
-            "bnb-adagrad": {
-                "precision": "any",
-                "default_settings": {
-                    "lr_decay": 0,
-                    "weight_decay": 0,
-                    "initial_accumulator_value": 0,
-                    "eps": 1e-10,
-                    "min_8bit_size": 4096,
-                    "percentile_clipping": 100,
-                },
-                "class": bitsandbytes.optim.Adagrad,
-            },
-            "bnb-adagrad8bit": {
-                "precision": "any",
-                "default_settings": {
-                    "lr_decay": 0,
-                    "weight_decay": 0,
-                    "initial_accumulator_value": 0,
-                    "eps": 1e-10,
-                    "min_8bit_size": 4096,
-                    "percentile_clipping": 100,
-                },
-                "class": bitsandbytes.optim.Adagrad8bit,
-            },
-            "bnb-adam": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999),
-                    "eps": 1e-08,
-                    "weight_decay": 0,
-                    "amsgrad": False,
-                    "min_8bit_size": 4096,
-                    "percentile_clipping": 100,
-                },
-                "class": bitsandbytes.optim.Adam,
-            },
-            "bnb-adam8bit": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999),
-                    "eps": 1e-08,
-                    "weight_decay": 0,
-                    "amsgrad": False,
-                    "min_8bit_size": 4096,
-                    "percentile_clipping": 100,
-                },
-                "class": bitsandbytes.optim.Adam8bit,
-            },
-            "bnb-adamw": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999),
-                    "weight_decay": 1e-2,
-                    "eps": 1e-6,
-                },
-                "class": bitsandbytes.optim.AdamW,
-            },
-            "bnb-adamw8bit": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999),
-                    "weight_decay": 1e-2,
-                    "eps": 1e-6,
-                },
-                "class": bitsandbytes.optim.AdamW8bit,
-            },
-            "bnb-adamw-paged": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999),
-                    "weight_decay": 1e-2,
-                    "eps": 1e-6,
-                },
-                "class": bitsandbytes.optim.PagedAdamW,
-            },
-            "bnb-adamw8bit-paged": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999),
-                    "weight_decay": 1e-2,
-                    "eps": 1e-6,
-                },
-                "class": bitsandbytes.optim.PagedAdamW8bit,
-            },
-            "bnb-ademamix": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999, 0.9999),
-                    "alpha": 5.0,
-                    "t_alpha": None,
-                    "t_beta3": None,
-                    "eps": 1e-08,
-                    "weight_decay": 0.01,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.AdEMAMix,
-            },
-            "bnb-ademamix8bit": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999, 0.9999),
-                    "alpha": 5.0,
-                    "t_alpha": None,
-                    "t_beta3": None,
-                    "eps": 1e-08,
-                    "weight_decay": 0.01,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.AdEMAMix8bit,
-            },
-            "bnb-ademamix-paged": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999, 0.9999),
-                    "alpha": 5.0,
-                    "t_alpha": None,
-                    "t_beta3": None,
-                    "eps": 1e-08,
-                    "weight_decay": 0.01,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.PagedAdEMAMix,
-            },
-            "bnb-ademamix8bit-paged": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.999, 0.9999),
-                    "alpha": 5.0,
-                    "t_alpha": None,
-                    "t_beta3": None,
-                    "eps": 1e-08,
-                    "weight_decay": 0.01,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.PagedAdEMAMix8bit,
-            },
-            "bnb-lion": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.99),
-                    "weight_decay": 0.0,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.Lion,
-            },
-            "bnb-lion8bit": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.99),
-                    "weight_decay": 0.0,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.Lion8bit,
-            },
-            "bnb-lion-paged": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.99),
-                    "weight_decay": 0.0,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.PagedLion,
-            },
-            "bnb-lion8bit-paged": {
-                "precision": "any",
-                "default_settings": {
-                    "betas": (0.9, 0.99),
-                    "weight_decay": 0.0,
-                    "min_8bit_size": 4096,
-                },
-                "class": bitsandbytes.optim.PagedLion8bit,
-            },
-        }
-    )
-
-args_to_optimizer_mapping = {
-    "use_adafactor_optimizer": "adafactor",
-    "use_prodigy_optimizer": "prodigy",
-    "use_dadaptation_optimizer": "dadaptation",
-    "adam_bfloat16": "adamw_bf16",
-    "use_8bit_adam": "adamw8bit",
-}
-
-deprecated_optimizers = {
-    "prodigy": "Prodigy optimiser has been removed due to issues with precision levels and convergence. Please use adamw_schedulefree instead.",
-    "dadaptation": "D-adaptation optimiser has been removed due to issues with precision levels and convergence. Please use adamw_schedulefree instead.",
-    "adafactor": "Adafactor optimiser has been removed in favour of optimi-stableadamw, which offers improved memory efficiency and convergence.",
-    "adamw8bit": "AdamW8Bit has been removed in favour of optimi-adamw optimiser, which offers better low-precision support. Please use this or adamw_bf16 instead.",
-}
-
-
-def convert_arg_to_parameters(args):
-    """--optimizer_config can have a format like --optimizer_config=eps=1e-6,weight_decay=0.0"""
-    out = {}
-    if args.optimizer_config is not None and args.optimizer_config:
-        optimizer_params = [
-            param.split("=") for param in args.optimizer_config.split(",")
-        ]
-        for param in optimizer_params:
-            if "." in param[1]:
-                out[param[0]] = float(param[1])
-            elif str(param[1]).isdigit():
-                out[param[0]] = int(param[1])
-            elif param[1].lower() == "true":
-                out[param[0]] = True
-            elif param[1].lower() == "false":
-                out[param[0]] = False
-            elif param[1].lower() == "none":
-                out[param[0]] = None
-            elif "e-" in param[1]:
-                out[param[0]] = float(param[1])
-            else:
-                out[param[0]] = param[1]
-        return out
-    if args.optimizer_beta1 is not None and args.optimizer_beta2 is not None:
-        # the user has supplied a beta1 and beta2 value
-        out["betas"] = tuple([args.optimizer_beta1, args.optimizer_beta2])
-
-    return out
-
-
-def optimizer_parameters(optimizer, args):
-    """Return the parameters for the optimizer"""
-    if optimizer in optimizer_choices:
-        optimizer_details = optimizer_choices.get(optimizer)
-        optimizer_class = optimizer_choices.get(optimizer).get("class")
-        optimizer_params = optimizer_choices.get(optimizer).get("default_settings")
-        optimizer_params.update(convert_arg_to_parameters(args))
-        if args.optimizer_release_gradients and "optimi-" in optimizer:
-            optimizer_params["gradient_release"] = True
-        optimizer_details["default_settings"] = optimizer_params
-        return optimizer_class, optimizer_details
-    else:
-        raise ValueError(f"Optimizer {optimizer} not found.")
-
-
-def is_lr_scheduler_disabled(optimizer: str):
-    """Check if the optimizer has a built-in LR scheduler"""
-    is_disabled = False
-    if optimizer in optimizer_choices:
-        is_disabled = optimizer_choices.get(optimizer).get(
-            "override_lr_scheduler", False
-        )
-    return is_disabled
-
-
-def show_optimizer_defaults(optimizer: str = None):
-    """we'll print the defaults on a single line, eg. foo=bar, buz=baz"""
-    if optimizer is None:
-        for key in optimizer_choices:
-            print(f"{key}={optimizer_choices[key].get('default_settings')}")
-    else:
-        print(f"{optimizer}={optimizer_choices.get(optimizer).get('default_settings')}")
-
-
-def is_optimizer_deprecated(optimizer: str) -> bool:
-    if optimizer in deprecated_optimizers:
-        raise ValueError(deprecated_optimizers.get(optimizer))
-
-
-def map_deprecated_optimizer_parameter(optimizer: str) -> str:
-    return args_to_optimizer_mapping.get(optimizer, None)
-
-
-def is_optimizer_bf16(optimizer: str) -> bool:
-    optimizer_precision = optimizer_choices.get(optimizer, {}).get("precision", "fp32")
-    if optimizer_precision in ["any", "bf16"]:
-        return True
-    return False
-
-
-def is_optimizer_grad_fp32(optimizer: str) -> bool:
-    optimizer_precision = optimizer_choices.get(optimizer, {}).get(
-        "gradient_precision", None
-    )
-    if optimizer_precision == "fp32":
-        return True
-    return False
-
-
-def cpu_offload_optimizer(
-    params_to_optimize,
-    optimizer_cls,
-    optimizer_parameters: dict,
-    offload_gradients: bool = True,
-    fused: bool = True,
-    offload_mechanism: str = None,
-):
-    if not offload_mechanism or offload_mechanism == "none":
-        return optimizer_cls(params_to_optimize, **optimizer_parameters)
-    if offload_mechanism != "torchao":
-        raise ValueError(
-            f"Unknown CPU optimiser offload mechanism: {offload_mechanism}"
-        )
-
-    if offload_gradients:
-        optimizer_parameters["offload_gradients"] = offload_gradients
-    if fused:
-        optimizer_parameters["fused"] = fused
-
-    optimizer_parameters["optimizer_class"] = optimizer_cls
-
-    return AOCPUOffloadOptimizer(params_to_optimize, **optimizer_parameters)
-
-
-def determine_optimizer_class_with_config(
-    args, use_deepspeed_optimizer, is_quantized, enable_adamw_bf16
-) -> tuple:
-    extra_optimizer_args = {}
-    if use_deepspeed_optimizer:
-        optimizer_class = accelerate.utils.DummyOptim
-        extra_optimizer_args["lr"] = float(args.learning_rate)
-        extra_optimizer_args["betas"] = (args.adam_beta1, args.adam_beta2)
-        extra_optimizer_args["eps"] = args.adam_epsilon
-        extra_optimizer_args["weight_decay"] = args.adam_weight_decay
-        default_settings = extra_optimizer_args
-        optimizer_details = {}
-    elif is_quantized and not enable_adamw_bf16:
-        logger.error(
-            f"When --base_model_default_dtype=fp32, AdamWBF16 may not be used. Switching to AdamW."
-        )
-        optimizer_class, optimizer_details = optimizer_parameters("optimi-adamw", args)
-    else:
-        optimizer_class, optimizer_details = optimizer_parameters(args.optimizer, args)
-        default_settings = optimizer_details.get("default_settings")
-    if optimizer_details.get("can_warmup", False):
-        logger.info(
-            f"Optimizer contains LR scheduler, warmup steps will be set to {args.lr_warmup_steps}."
-        )
-        default_settings["warmup_steps"] = args.lr_warmup_steps
-    logger.info(f"cls: {optimizer_class}, settings: {default_settings}")
-    return default_settings, optimizer_class
-
-
-def determine_params_to_optimize(
-    args,
-    controlnet,
-    unet,
-    transformer,
-    text_encoder_1,
-    text_encoder_2,
-    model_type_label,
-    lycoris_wrapped_network,
-):
-    if args.model_type == "full":
-        if args.controlnet:
-            params_to_optimize = controlnet.parameters()
-        elif unet is not None:
-            params_to_optimize = list(
-                filter(lambda p: p.requires_grad, unet.parameters())
-            )
-        elif transformer is not None:
-            params_to_optimize = list(
-                filter(lambda p: p.requires_grad, transformer.parameters())
-            )
-        if args.train_text_encoder:
-            raise ValueError(
-                "Full model tuning does not currently support text encoder training."
-            )
-    elif "lora" in args.model_type:
-        if args.controlnet:
-            raise ValueError(
-                "SimpleTuner does not currently support training a ControlNet LoRA."
-            )
-        if unet is not None:
-            params_to_optimize = list(
-                filter(lambda p: p.requires_grad, unet.parameters())
-            )
-        if transformer is not None:
-            params_to_optimize = list(
-                filter(lambda p: p.requires_grad, transformer.parameters())
-            )
-        if args.train_text_encoder:
-            if args.model_family in ["sd3", "pixart_sigma"]:
-                raise ValueError(
-                    f"{model_type_label} does not support finetuning the text encoders, as T5 does not benefit from it."
-                )
-            else:
-                # add the first text encoder's parameters
-                params_to_optimize = params_to_optimize + list(
-                    filter(lambda p: p.requires_grad, text_encoder_1.parameters())
-                )
-                # if text_encoder_2 is not None, add its parameters
-                if text_encoder_2 is None and args.model_family not in ["flux"]:
-                    # but not flux. it has t5 as enc 2.
-                    params_to_optimize = params_to_optimize + list(
-                        filter(lambda p: p.requires_grad, text_encoder_2.parameters())
-                    )
-
-        if args.lora_type == "lycoris" and lycoris_wrapped_network is not None:
-            params_to_optimize = list(
-                filter(lambda p: p.requires_grad, lycoris_wrapped_network.parameters())
-            )
-
-    return params_to_optimize
diff --git a/videotuna/third_party/flux/training/optimizers/adamw_bfloat16/__init__.py b/videotuna/third_party/flux/training/optimizers/adamw_bfloat16/__init__.py
deleted file mode 100644
index 796a2bbc..00000000
--- a/videotuna/third_party/flux/training/optimizers/adamw_bfloat16/__init__.py
+++ /dev/null
@@ -1,164 +0,0 @@
-"""
-Different versions appeared,
-they have identical interface, but sutiable for different scenarios.
-"""
-
-__version__ = "0.2.0"
-
-__all__ = ["AdamW_BF16"]
-
-"""
-This implementation uses torch.compile to speed up,
-should be suitable for different backends.
-"""
-
-import torch
-from torch.optim.optimizer import Optimizer
-
-from .stochastic import add_stochastic_, addcdiv_stochastic_
-
-
-class AdamWBF16(Optimizer):
-    decay_threshold = 5e-3
-
-    def __init__(
-        self,
-        params,
-        *,
-        lr=1e-4,
-        betas=(0.9, 0.999),
-        eps=1e-8,
-        weight_decay=0,
-    ):
-        """
-        Implements AdamW optimization specifically for bfloat16 models.
-        No other dtype is supported.
-        Compatible with cuda graphs.
-        Uses delayed accumulation for decays and compensated summation for Adam steps.
-        Uses only one additional bfloat16 weight for keeping correction.
-        Do not use schedulers - those can't affect cuda graphs.
-        :param lr_function: a callable that maps torch scalar (step) to torch scalar (learning rate)
-        """
-        if not 0.0 <= eps:
-            raise ValueError(f"Invalid epsilon value: {eps}")
-        if not 0.0 <= betas[0] < 1.0:
-            raise ValueError(f"Invalid beta parameter at index 0: {betas[0]}")
-        if not 0.0 <= betas[1] < 1.0:
-            raise ValueError(f"Invalid beta parameter at index 1: {betas[1]}")
-        if not 0.0 <= weight_decay:
-            raise ValueError(f"Invalid weight_decay value: {weight_decay}")
-        defaults = dict(betas=betas, eps=eps, weight_decay=weight_decay, lr=lr)
-        super().__init__(params, defaults)
-
-    @torch.no_grad()
-    def step(self, zero_grad: bool = False):
-        """Performs a single optimization step."""
-        for group in self.param_groups:
-            beta1, beta2 = group["betas"]
-
-            for p in group["params"]:
-                if p.grad is not None:
-                    state = self.state[p]
-                    # Lazy state initialization
-                    if len(state) == 0:
-                        assert p.dtype == torch.bfloat16, "only bfloat 16 is supported."
-                        state["step"] = 0.0
-                        # Exponential moving average of gradient values
-                        state["exp_avg"] = torch.zeros_like(
-                            p, memory_format=torch.preserve_format
-                        )
-                        # Exponential moving average of squared gradient values
-                        state["exp_avg_sq"] = torch.zeros_like(
-                            p, memory_format=torch.preserve_format
-                        )
-                        # accumulated shift that should be added to p, but wasn't because of truncation
-                        # true value is p + shift
-                        state["shift"] = torch.zeros_like(
-                            p, memory_format=torch.preserve_format
-                        )
-                        # using decay at each step will work only for float32, so we just remember how much owe to decay
-                        # and decay once in n iterations
-                        # Each weight has its own starting point to avoid simultaneous updates in all weights
-                        state["accumulated_decay"] = float(
-                            torch.rand([]) * self.decay_threshold
-                        )
-
-                    grad = p.grad
-                    state["step"] += 1
-                    lr = group["lr"]
-
-                    state["accumulated_decay"] += group["weight_decay"] * lr
-                    accum_decay = state["accumulated_decay"]
-                    decay_this_iteration = (
-                        accum_decay > self.decay_threshold
-                    ) * accum_decay
-                    state["accumulated_decay"] -= decay_this_iteration
-
-                    _make_step(
-                        grad,
-                        p,
-                        state["shift"],
-                        state["exp_avg"],
-                        state["exp_avg_sq"],
-                        beta1=beta1,
-                        beta2=beta2,
-                        step=state["step"],
-                        lr=lr,
-                        eps=group["eps"],
-                        decay_this_iteration=decay_this_iteration,
-                        zero_grad=zero_grad,
-                    )
-
-
-def _make_step(
-    grad,
-    p,
-    shift,
-    exp_avg,
-    exp_avg_sq,
-    beta1: float,
-    beta2: float,
-    step: float,
-    lr: float,
-    eps: float,
-    decay_this_iteration: float,
-    zero_grad: bool,
-):
-    # Originally:
-    # exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
-    exp_avg.mul_(beta1)
-    add_stochastic_(exp_avg, grad, alpha=1 - beta1)
-    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
-
-    denom_correction = (1 - beta2**step) ** 0.5
-
-    # Originally:
-    # shift.addcdiv_(
-    #     exp_avg,
-    #     exp_avg_sq.sqrt().add_(eps, alpha=1),
-    #     value=-lr * denom_correction,
-    # )
-
-    addcdiv_stochastic_(
-        shift,
-        exp_avg,
-        exp_avg_sq.sqrt().add_(eps, alpha=1),
-        value=-lr * denom_correction,
-    )
-
-    buffer = p.clone()
-    # Originally:
-    # p.add_(shift)
-    add_stochastic_(p, shift)
-
-    # Originally:
-    # shift.add_(buffer.sub_(p))
-    add_stochastic_(shift, buffer.sub_(p))
-
-    if decay_this_iteration > 0:
-        shift.add_(p, alpha=-decay_this_iteration)
-        # Do NOT do this, it will cause the model to become unstable.
-        # add_stochastic_(shift, p, alpha=-decay_this_iteration)
-
-    if zero_grad:
-        grad.zero_()
diff --git a/videotuna/third_party/flux/training/optimizers/adamw_bfloat16/stochastic/__init__.py b/videotuna/third_party/flux/training/optimizers/adamw_bfloat16/stochastic/__init__.py
deleted file mode 100644
index 7829aabf..00000000
--- a/videotuna/third_party/flux/training/optimizers/adamw_bfloat16/stochastic/__init__.py
+++ /dev/null
@@ -1,124 +0,0 @@
-import torch
-from torch import FloatTensor, Tensor
-
-
-def swap_first_and_last_dims(tensor: torch.Tensor) -> torch.Tensor:
-    """
-    Swap the first dimension with the last dimension of a tensor.
-
-    Args:
-        tensor (torch.Tensor): The input tensor of any shape.
-
-    Returns:
-        torch.Tensor: A tensor with the first dimension swapped with the last.
-    """
-    # Get the total number of dimensions
-    num_dims = len(tensor.shape)
-
-    # Create a new order of dimensions
-    new_order = list(range(1, num_dims)) + [0]
-
-    # Permute the tensor according to the new order
-    return tensor.permute(*new_order)
-
-
-def swap_back_first_and_last_dims(tensor: torch.Tensor) -> torch.Tensor:
-    """
-    Swap back the first dimension with the last dimension of a tensor
-    to its original shape after a swap.
-
-    Args:
-        tensor (torch.Tensor): The tensor that had its first and last dimensions swapped.
-
-    Returns:
-        torch.Tensor: A tensor with its original shape restored.
-    """
-    # Get the total number of dimensions
-    num_dims = len(tensor.shape)
-
-    # Create a new order to reverse the previous swapping
-    new_order = [num_dims - 1] + list(range(0, num_dims - 1))
-
-    # Permute the tensor according to the new order
-    return tensor.permute(*new_order)
-
-
-def copy_stochastic_(target: Tensor, source: Tensor):
-    """
-    copies source into target using stochastic rounding
-
-    Args:
-        target: the target tensor with dtype=bfloat16
-        source: the target tensor with dtype=float32
-    """
-    # create a random 16 bit integer
-    result = torch.randint_like(
-        source,
-        dtype=torch.int32,
-        low=0,
-        high=(1 << 16),
-    )
-
-    # add the random number to the lower 16 bit of the mantissa
-    result.add_(source.view(dtype=torch.int32))
-
-    # mask off the lower 16 bit of the mantissa
-    result.bitwise_and_(-65536)  # -65536 = FFFF0000 as a signed int32
-
-    # copy the higher 16 bit into the target tensor
-    target.copy_(result.view(dtype=torch.float32))
-
-    del result
-
-
-def add_stochastic_(_input: Tensor, other: Tensor, alpha: float = 1.0):
-    """
-    Adds other to input using stochastic rounding.
-
-    There is a hack to fix a bug on MPS where uneven final dimensions cause
-    a crash.
-
-    Args:
-        _input: the input tensor with dtype=bfloat16
-        other: the other tensor
-        alpha: a multiplier for other
-    """
-    _input_original = _input
-    if _input.device.type == "mps":
-        _input = _input.to(dtype=torch.float32)
-
-    if other.dtype == torch.float32:
-        result = other.clone()
-    else:
-        result = other.to(dtype=torch.float32)
-
-    if _input.device.type == "mps":
-        result.add_(_input, alpha=torch.tensor(alpha, dtype=torch.float32))
-    else:
-        result.add_(_input, alpha=alpha)
-
-    copy_stochastic_(_input, result)
-
-    if _input.device.type == "mps":
-        _input_original.copy_(_input.view(dtype=torch.float32))
-
-
-def addcdiv_stochastic_(
-    _input: Tensor, tensor1: Tensor, tensor2: Tensor, value: float = 1.0
-):
-    """
-    adds (tensor1 / tensor2 * value) to input using stochastic rounding
-
-    Args:
-        _input: the input tensor with dtype=bfloat16
-        tensor1: the numerator tensor
-        tensor2: the denominator tensor
-        value: a multiplier for tensor1/tensor2
-    """
-    if _input.dtype == torch.float32:
-        result = _input.clone()
-    else:
-        result = _input.to(dtype=torch.float32)
-
-    result.addcdiv_(tensor1, tensor2, value=value)
-    copy_stochastic_(_input, result)
diff --git a/videotuna/third_party/flux/training/optimizers/adamw_schedulefree/__init__.py b/videotuna/third_party/flux/training/optimizers/adamw_schedulefree/__init__.py
deleted file mode 100644
index 3e64082f..00000000
--- a/videotuna/third_party/flux/training/optimizers/adamw_schedulefree/__init__.py
+++ /dev/null
@@ -1,151 +0,0 @@
-import math
-from typing import Iterable
-
-import torch
-from torch.optim.optimizer import Optimizer
-
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-
-class AdamWScheduleFreeKahan(Optimizer):
-    """AdamW optimizer with schedule-free adjustments and Kahan summation.
-
-    Args:
-        params: Iterable of parameters to optimize or dicts defining parameter groups.
-        lr: Learning rate.
-        betas: Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.999)).
-        eps: Added to denominator to improve numerical stability (default: 1e-8).
-        weight_decay: Weight decay coefficient (default: 1e-2).
-        warmup_steps: Number of steps to warm up the learning rate (default: 0).
-        kahan_sum: Enables Kahan summation for more accurate parameter updates when training in low precision.
-    """
-
-    def __init__(
-        self,
-        params: Iterable,
-        lr: float = 1e-3,
-        betas: tuple = (0.9, 0.999),
-        eps: float = 1e-8,
-        weight_decay: float = 1e-2,
-        warmup_steps: int = 0,
-        kahan_sum: bool = True,
-    ):
-        defaults = dict(
-            lr=lr,
-            betas=betas,
-            eps=eps,
-            weight_decay=weight_decay,
-            warmup_steps=warmup_steps,
-            kahan_sum=kahan_sum,
-        )
-        super(AdamWScheduleFreeKahan, self).__init__(params, defaults)
-        self.k = 0
-        self.lr_max = -1.0
-        self.last_lr = -1.0
-        self.weight_sum = 0.0
-
-    def _initialize_state(self, state, p):
-        if "step" not in state:
-            state["step"] = 0
-            state["exp_avg"] = torch.zeros_like(p, memory_format=torch.preserve_format)
-            state["exp_avg_sq"] = torch.zeros_like(
-                p, memory_format=torch.preserve_format
-            )
-            if self.defaults["kahan_sum"]:
-                state["kahan_comp"] = torch.zeros_like(
-                    p, memory_format=torch.preserve_format
-                )
-
-    def eval(self):
-        for group in self.param_groups:
-            train_mode = group.get("train_mode", True)
-            beta1, _ = group["betas"]
-            if train_mode:
-                for p in group["params"]:
-                    state = self.state[p]
-                    if "z" in state:
-                        # Set p.data to x
-                        p.data.lerp_(
-                            end=state["z"].to(p.data.device), weight=1 - 1 / beta1
-                        )
-                group["train_mode"] = False
-
-    def train(self):
-        for group in self.param_groups:
-            train_mode = group.get("train_mode", False)
-            beta1, _ = group["betas"]
-            if not train_mode:
-                for p in group["params"]:
-                    state = self.state[p]
-                    if "z" in state:
-                        # Set p.data to y
-                        p.data.lerp_(end=state["z"].to(p.data.device), weight=1 - beta1)
-                group["train_mode"] = True
-
-    def step(self, closure=None):
-        """Performs a single optimization step."""
-        loss = None
-        if closure is not None:
-            loss = closure()
-
-        for group in self.param_groups:
-            beta1, beta2 = group["betas"]
-            lr = group["lr"]
-            eps = group["eps"]
-            weight_decay = group["weight_decay"]
-            warmup_steps = group.get("warmup_steps", 0)
-            kahan_sum = group["kahan_sum"]
-
-            k = self.k
-
-            # Adjust learning rate with warmup
-            if k < warmup_steps:
-                sched = (k + 1) / warmup_steps
-            else:
-                sched = 1.0
-
-            bias_correction2 = 1 - beta2 ** (k + 1)
-            adjusted_lr = lr * sched * (bias_correction2**0.5)
-            self.lr_max = max(adjusted_lr, self.lr_max)
-
-            for p in group["params"]:
-                if p.grad is None:
-                    continue
-                grad = p.grad.data
-
-                state = self.state[p]
-                self._initialize_state(state, p)
-
-                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
-
-                if kahan_sum:
-                    kahan_comp = state["kahan_comp"]
-                    grad.add_(kahan_comp)
-
-                # Decay the first and second moment running average coefficient
-                exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
-                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)
-
-                denom = exp_avg_sq.sqrt().add_(eps)
-
-                step_size = adjusted_lr / (bias_correction2**0.5)
-
-                if weight_decay != 0:
-                    p.data.add_(p.data, alpha=-weight_decay)
-
-                # Kahan summation to improve precision
-                step = exp_avg / denom
-                p.data.add_(-step_size * step)
-
-                if kahan_sum:
-                    buffer = p.data.add(-step_size * step)
-                    kahan_comp.copy_(p.data.sub(buffer).add(buffer.sub_(p.data)))
-
-            self.k += 1
-            self.last_lr = adjusted_lr
-            StateTracker.set_last_lr(adjusted_lr)
-
-        return loss
-
-    def get_last_lr(self):
-        return self.last_lr
diff --git a/videotuna/third_party/flux/training/optimizers/soap/__init__.py b/videotuna/third_party/flux/training/optimizers/soap/__init__.py
deleted file mode 100644
index 0d5c3927..00000000
--- a/videotuna/third_party/flux/training/optimizers/soap/__init__.py
+++ /dev/null
@@ -1,479 +0,0 @@
-from itertools import chain
-
-import torch
-import torch.nn as nn
-import torch.optim as optim
-
-# Parts of the code are modifications of Pytorch's AdamW optimizer
-# Parts of the code are modifications of code from https://github.com/jiaweizzhao/GaLore/blob/master/galore_torch/galore_projector.py
-
-
-class SOAP(optim.Optimizer):
-    """
-    Implements SOAP algorithm (https://arxiv.org/abs/2409.11321).
-
-    Parameters:
-        params (`Iterable[nn.parameter.Parameter]`):
-            Iterable of parameters to optimize or dictionaries defining parameter groups.
-        lr (`float`, *optional*, defaults to 0.003):
-            The learning rate to use.
-        betas (`Tuple[float,float]`, *optional*, defaults to `(0.95, 0.95)`):
-            Adam's betas parameters (b1, b2).
-        shampoo_beta (`float`, *optional*, defaults to -1):
-            If >= 0, use this beta for the preconditioner (L and R in paper, state['GG'] below) moving average instead of betas[1].
-        eps (`float`, *optional*, defaults to 1e-08):
-            Adam's epsilon for numerical stability.
-        weight_decay (`float`, *optional*, defaults to 0.01): weight decay coefficient.
-        precondition_frequency (`int`, *optional*, defaults to 10):
-            How often to update the preconditioner.
-        max_precond_dim (`int`, *optional*, defaults to 10000):
-            Maximum dimension of the preconditioner.
-            Set to 10000, so that we exclude most common vocab sizes while including layers.
-        merge_dims (`bool`, *optional*, defaults to `False`):
-            Whether or not to merge dimensions of the preconditioner.
-        precondition_1d (`bool`, *optional*, defaults to `False`):
-            Whether or not to precondition 1D gradients.
-        normalize_grads (`bool`, *optional*, defaults to `False`):
-            Whether or not to normalize gradients per layer.
-            Helps at large precondition_frequency (~100 in our experiments),
-            but hurts performance at small precondition_frequency (~10 in our experiments).
-        data_format (`str`, *optional*, defaults to `channels_first`):
-            Data format of the input for convolutional layers.
-            Should be "channels_last" for data_format of NHWC and "channels_first" for NCHW.
-        correct_bias (`bool`, *optional*, defaults to `True`):
-            Whether or not to use bias correction in Adam.
-    """
-
-    def __init__(
-        self,
-        params,
-        lr: float = 3e-3,
-        betas=(0.95, 0.95),
-        shampoo_beta: float = -1,
-        eps: float = 1e-8,
-        weight_decay: float = 0.01,
-        precondition_frequency: int = 10,
-        max_precond_dim: int = 10000,  #
-        merge_dims: bool = False,  # Merge dimensions till the product of the dimensions is less than or equal to max_precond_dim.
-        precondition_1d: bool = False,
-        normalize_grads: bool = False,
-        data_format: str = "channels_first",
-        correct_bias: bool = True,
-    ):
-        defaults = {
-            "lr": lr,
-            "betas": betas,
-            "shampoo_beta": shampoo_beta,
-            "eps": eps,
-            "weight_decay": weight_decay,
-            "precondition_frequency": precondition_frequency,
-            "max_precond_dim": max_precond_dim,
-            "merge_dims": merge_dims,
-            "precondition_1d": precondition_1d,
-            "normalize_grads": normalize_grads,
-            "correct_bias": correct_bias,
-        }
-        super().__init__(params, defaults)
-        self._data_format = data_format
-
-    def merge_dims(self, grad, max_precond_dim):
-        """
-        Merges dimensions of the gradient tensor till the product of the dimensions is less than or equal to max_precond_dim.
-        """
-        assert self._data_format in ["channels_first", "channels_last"]
-        if self._data_format == "channels_last" and grad.dim() == 4:
-            grad = grad.permute(0, 3, 1, 2)
-        shape = grad.shape
-        new_shape = []
-
-        curr_shape = 1
-        for sh in shape:
-            temp_shape = curr_shape * sh
-            if temp_shape > max_precond_dim:
-                if curr_shape > 1:
-                    new_shape.append(curr_shape)
-                    curr_shape = sh
-                else:
-                    new_shape.append(sh)
-                    curr_shape = 1
-            else:
-                curr_shape = temp_shape
-
-        if curr_shape > 1 or len(new_shape) == 0:
-            new_shape.append(curr_shape)
-
-        new_grad = grad.reshape(new_shape)
-        return new_grad
-
-    @torch.no_grad()
-    def step(self, closure=None):
-        """
-        Performs a single optimization step.
-
-        Arguments:
-            closure (`Callable`, *optional*): A closure that reevaluates the model and returns the loss.
-        """
-        loss = None
-        if closure is not None:
-            loss = closure()
-
-        for group in self.param_groups:
-            for p in group["params"]:
-                if p.grad is None:
-                    continue
-                grad = p.grad
-
-                state = self.state[p]
-
-                if "step" not in state:
-                    state["step"] = 0
-
-                # State initialization
-                if "exp_avg" not in state:
-                    # Exponential moving average of gradient values
-                    state["exp_avg"] = torch.zeros_like(grad)
-                    # Exponential moving average of squared gradient values
-                    state["exp_avg_sq"] = torch.zeros_like(grad)
-
-                if "Q" not in state:
-                    self.init_preconditioner(
-                        grad,
-                        state,
-                        precondition_frequency=group["precondition_frequency"],
-                        precondition_1d=group["precondition_1d"],
-                        shampoo_beta=(
-                            group["shampoo_beta"]
-                            if group["shampoo_beta"] >= 0
-                            else group["betas"][1]
-                        ),
-                        max_precond_dim=group["max_precond_dim"],
-                        merge_dims=group["merge_dims"],
-                    )
-                    self.update_preconditioner(
-                        grad,
-                        state,
-                        max_precond_dim=group["max_precond_dim"],
-                        merge_dims=group["merge_dims"],
-                        precondition_1d=group["precondition_1d"],
-                    )
-                    continue  # first step is skipped so that we never use the current gradients in the projection.
-
-                # Projecting gradients to the eigenbases of Shampoo's preconditioner
-                # i.e. projecting to the eigenbases of matrices in state['GG']
-                grad_projected = self.project(
-                    grad,
-                    state,
-                    merge_dims=group["merge_dims"],
-                    max_precond_dim=group["max_precond_dim"],
-                )
-
-                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
-                beta1, beta2 = group["betas"]
-
-                state["step"] += 1
-
-                # Decay the first and second moment running average coefficient
-                # In-place operations to update the averages at the same time
-                exp_avg.mul_(beta1).add_(grad, alpha=(1.0 - beta1))
-                exp_avg_sq.mul_(beta2).add_(
-                    grad_projected.square(), alpha=(1.0 - beta2)
-                )
-
-                denom = exp_avg_sq.sqrt().add_(group["eps"])
-
-                # Projecting the exponential moving average of gradients to the eigenbases of Shampoo's preconditioner
-                # i.e. projecting to the eigenbases of matrices in state['GG']
-                exp_avg_projected = self.project(
-                    exp_avg,
-                    state,
-                    merge_dims=group["merge_dims"],
-                    max_precond_dim=group["max_precond_dim"],
-                )
-
-                step_size = group["lr"]
-                if group["correct_bias"]:
-                    bias_correction1 = 1.0 - beta1 ** (state["step"])
-                    bias_correction2 = 1.0 - beta2 ** (state["step"])
-                    step_size = step_size * (bias_correction2**0.5) / bias_correction1
-
-                # Projecting back the preconditioned (by Adam) exponential moving average of gradients
-                # to the original space
-                norm_grad = self.project_back(
-                    exp_avg_projected / denom,
-                    state,
-                    merge_dims=group["merge_dims"],
-                    max_precond_dim=group["max_precond_dim"],
-                )
-
-                if group["normalize_grads"]:
-                    norm_grad = norm_grad / (1e-30 + torch.mean(norm_grad**2) ** 0.5)
-
-                p.add_(norm_grad, alpha=-step_size)
-
-                # From AdamW code: Just adding the square of the weights to the loss function is *not*
-                # the correct way of using L2 regularization/weight decay with Adam,
-                # since that will interact with the m and v parameters in strange ways.
-                #
-                # Instead we want to decay the weights in a manner that doesn't interact
-                # with the m/v parameters. This is equivalent to adding the square
-                # of the weights to the loss with plain (non-momentum) SGD.
-                # Add weight decay at the end (fixed version)
-                if group["weight_decay"] > 0.0:
-                    p.add_(p, alpha=(-group["lr"] * group["weight_decay"]))
-
-                # Update is done after the gradient step to avoid using current gradients in the projection.
-                self.update_preconditioner(
-                    grad,
-                    state,
-                    max_precond_dim=group["max_precond_dim"],
-                    merge_dims=group["merge_dims"],
-                    precondition_1d=group["precondition_1d"],
-                )
-
-        return loss
-
-    def init_preconditioner(
-        self,
-        grad,
-        state,
-        precondition_frequency=10,
-        shampoo_beta=0.95,
-        max_precond_dim=10000,
-        precondition_1d=False,
-        merge_dims=False,
-    ):
-        """
-        Initializes the preconditioner matrices (L and R in the paper).
-        """
-        state["GG"] = (
-            []
-        )  # Will hold all the preconditioner matrices (L and R in the paper).
-        if grad.dim() == 1:
-            if not precondition_1d or grad.shape[0] > max_precond_dim:
-                state["GG"].append([])
-            else:
-                state["GG"].append(
-                    torch.zeros(grad.shape[0], grad.shape[0], device=grad.device)
-                )
-        else:
-            if merge_dims:
-                grad = self.merge_dims(grad, max_precond_dim)
-
-            for sh in grad.shape:
-                if sh > max_precond_dim:
-                    state["GG"].append([])
-                else:
-                    state["GG"].append(torch.zeros(sh, sh, device=grad.device))
-
-        state["Q"] = None  # Will hold all the eigenbases of the preconditioner.
-        state["precondition_frequency"] = precondition_frequency
-        state["shampoo_beta"] = shampoo_beta
-
-    def project(self, grad, state, merge_dims=False, max_precond_dim=10000):
-        """
-        Projects the gradient to the eigenbases of the preconditioner.
-        """
-        original_shape = grad.shape
-        if merge_dims:
-            if grad.dim() == 4 and self._data_format == "channels_last":
-                permuted_shape = grad.permute(0, 3, 1, 2).shape
-            grad = self.merge_dims(grad, max_precond_dim)
-
-        for mat in state["Q"]:
-            if len(mat) > 0:
-                grad = torch.tensordot(
-                    grad,
-                    mat.to(grad.dtype),
-                    dims=[[0], [0]],
-                )
-            else:
-                permute_order = list(range(1, len(grad.shape))) + [0]
-                grad = grad.permute(permute_order)
-
-        if merge_dims:
-            if self._data_format == "channels_last" and len(original_shape) == 4:
-                grad = grad.reshape(permuted_shape).permute(0, 2, 3, 1)
-            else:
-                grad = grad.reshape(original_shape)
-        return grad
-
-    def update_preconditioner(
-        self,
-        grad,
-        state,
-        max_precond_dim=10000,
-        merge_dims=False,
-        precondition_1d=False,
-    ):
-        """
-        Updates the preconditioner matrices and the eigenbases (L, R, Q_L, Q_R in the paper).
-        """
-        if grad.dim() == 1:
-            if precondition_1d and grad.shape[0] <= max_precond_dim:
-                state["GG"][0].lerp_(
-                    grad.unsqueeze(1) @ grad.unsqueeze(0), 1 - state["shampoo_beta"]
-                )
-        else:
-            if merge_dims:
-                new_grad = self.merge_dims(grad, max_precond_dim)
-                for idx, sh in enumerate(new_grad.shape):
-                    if sh <= max_precond_dim:
-                        outer_product = torch.tensordot(
-                            new_grad,
-                            new_grad,
-                            dims=[
-                                [
-                                    *chain(
-                                        range(idx), range(idx + 1, len(new_grad.shape))
-                                    )
-                                ]
-                            ]
-                            * 2,
-                        )
-                        state["GG"][idx].lerp_(outer_product, 1 - state["shampoo_beta"])
-            else:
-                for idx, sh in enumerate(grad.shape):
-                    if sh <= max_precond_dim:
-                        outer_product = torch.tensordot(
-                            grad,
-                            grad,
-                            # Contracts across all dimensions except for k.
-                            dims=[[*chain(range(idx), range(idx + 1, len(grad.shape)))]]
-                            * 2,
-                        )
-                        state["GG"][idx].lerp_(
-                            outer_product.to(state["GG"][idx].dtype),
-                            1 - state["shampoo_beta"],
-                        )
-
-        if state["Q"] is None:
-            state["Q"] = self.get_orthogonal_matrix(state["GG"])
-        if state["step"] > 0 and state["step"] % state["precondition_frequency"] == 0:
-            state["Q"] = self.get_orthogonal_matrix_QR(
-                state, max_precond_dim, merge_dims
-            )
-
-    def project_back(self, grad, state, merge_dims=False, max_precond_dim=10000):
-        """
-        Projects the gradient back to the original space.
-        """
-        original_shape = grad.shape
-        if merge_dims:
-            if self._data_format == "channels_last" and grad.dim() == 4:
-                permuted_shape = grad.permute(0, 3, 1, 2).shape
-            grad = self.merge_dims(grad, max_precond_dim)
-        for mat in state["Q"]:
-            if len(mat) > 0:
-                grad = torch.tensordot(
-                    grad.to(mat.dtype),
-                    mat,
-                    dims=[[0], [1]],
-                )
-            else:
-                permute_order = list(range(1, len(grad.shape))) + [0]
-                grad = grad.permute(permute_order)
-
-        if merge_dims:
-            if self._data_format == "channels_last" and len(original_shape) == 4:
-                grad = grad.reshape(permuted_shape).permute(0, 2, 3, 1)
-            else:
-                grad = grad.reshape(original_shape)
-        return grad
-
-    def get_orthogonal_matrix(self, mat):
-        """
-        Computes the eigenbases of the preconditioner using torch.linalg.eigh decomposition.
-        """
-        matrix = []
-        for m in mat:
-            if len(m) == 0:
-                matrix.append([])
-                continue
-            if m.data.dtype != torch.float:
-                float_data = False
-                original_type = m.data.dtype
-                original_device = m.data.device
-                matrix.append(m.data.float())
-            else:
-                float_data = True
-                matrix.append(m.data)
-
-        final = []
-        for m in matrix:
-            if len(m) == 0:
-                final.append([])
-                continue
-            try:
-                _, Q = torch.linalg.eigh(
-                    m + 1e-30 * torch.eye(m.shape[0], device=m.device)
-                )
-            except:
-                _, Q = torch.linalg.eigh(
-                    m.to(torch.float64) + 1e-30 * torch.eye(m.shape[0], device=m.device)
-                )
-                Q = Q.to(m.dtype)
-            Q = torch.flip(Q, [1])
-
-            if not float_data:
-                Q = Q.to(original_device).type(original_type)
-            final.append(Q)
-        return final
-
-    def get_orthogonal_matrix_QR(self, state, max_precond_dim=10000, merge_dims=False):
-        """
-        Computes the eigenbases of the preconditioner using one round of power iteration
-        followed by torch.linalg.qr decomposition.
-        """
-        precond_list = state["GG"]
-        orth_list = state["Q"]
-
-        matrix = []
-        orth_matrix = []
-        for m, o in zip(precond_list, orth_list):
-            if len(m) == 0:
-                matrix.append([])
-                orth_matrix.append([])
-                continue
-            if m.data.dtype != torch.float:
-                float_data = False
-                original_type = m.data.dtype
-                original_device = m.data.device
-                matrix.append(m.data.float())
-                orth_matrix.append(o.data.float())
-            else:
-                float_data = True
-                matrix.append(m.data.float())
-                orth_matrix.append(o.data.float())
-
-        orig_shape = state["exp_avg_sq"].shape
-        if self._data_format == "channels_last" and len(orig_shape) == 4:
-            permuted_shape = state["exp_avg_sq"].permute(0, 3, 1, 2).shape
-        if merge_dims:
-            exp_avg_sq = self.merge_dims(state["exp_avg_sq"], max_precond_dim)
-        else:
-            exp_avg_sq = state["exp_avg_sq"]
-
-        final = []
-        for ind, (m, o) in enumerate(zip(matrix, orth_matrix)):
-            if len(m) == 0:
-                final.append([])
-                continue
-            est_eig = torch.diag(o.T @ m @ o)
-            sort_idx = torch.argsort(est_eig, descending=True)
-            exp_avg_sq = exp_avg_sq.index_select(ind, sort_idx)
-            o = o[:, sort_idx]
-            power_iter = m @ o
-            Q, _ = torch.linalg.qr(power_iter)
-
-            if not float_data:
-                Q = Q.to(original_device).type(original_type)
-            final.append(Q)
-
-        if merge_dims:
-            if self._data_format == "channels_last" and len(orig_shape) == 4:
-                exp_avg_sq = exp_avg_sq.reshape(permuted_shape).permute(0, 2, 3, 1)
-            else:
-                exp_avg_sq = exp_avg_sq.reshape(orig_shape)
-
-        state["exp_avg_sq"] = exp_avg_sq
-        return final
diff --git a/videotuna/third_party/flux/training/peft_init.py b/videotuna/third_party/flux/training/peft_init.py
deleted file mode 100644
index e11417f2..00000000
--- a/videotuna/third_party/flux/training/peft_init.py
+++ /dev/null
@@ -1,25 +0,0 @@
-import torch
-
-
-def approximate_normal_tensor(inp, target, scale=1.0):
-    device = inp.device
-    tensor = torch.randn_like(target).to(device)
-    desired_norm = inp.norm().to(device)
-    desired_mean = inp.mean().to(device)
-    desired_std = inp.std().to(device)
-
-    current_norm = tensor.norm()
-    tensor = tensor * (desired_norm / current_norm)
-    current_std = tensor.std()
-    tensor = tensor * (desired_std / current_std)
-    tensor = tensor - tensor.mean() + desired_mean
-    tensor.mul_(scale)
-
-    target.copy_(tensor)
-
-
-def init_lokr_network_with_perturbed_normal(lycoris, scale=1e-3):
-    with torch.no_grad():
-        for lora in lycoris.loras:
-            lora.lokr_w1.fill_(1.0)
-            approximate_normal_tensor(lora.org_weight, lora.lokr_w2, scale=scale)
diff --git a/videotuna/third_party/flux/training/quantisation/__init__.py b/videotuna/third_party/flux/training/quantisation/__init__.py
deleted file mode 100644
index 07356201..00000000
--- a/videotuna/third_party/flux/training/quantisation/__init__.py
+++ /dev/null
@@ -1,224 +0,0 @@
-import logging
-import os
-
-import torch
-
-from videotuna.third_party.flux.training.multi_process import should_log
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-
-logger = logging.getLogger(__name__)
-if should_log():
-    logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-else:
-    logger.setLevel(logging.ERROR)
-
-
-def _quanto_type_map(model_precision: str):
-    if model_precision == "no_change":
-        return None
-    from optimum.quanto import qfloat8, qfloat8_e4m3fnuz, qint2, qint4, qint8
-
-    if model_precision == "int2-quanto":
-        quant_level = qint2
-    elif model_precision == "int4-quanto":
-        quant_level = qint4
-    elif model_precision == "int8-quanto":
-        quant_level = qint8
-    elif model_precision == "fp8-quanto" or model_precision == "fp8uz-quanto":
-        if torch.backends.mps.is_available():
-            logger.warning(
-                "MPS doesn't support dtype float8, you must select another precision level such as bf16, int2, int8, or int8."
-            )
-
-            return None
-        if model_precision == "fp8-quanto":
-            quant_level = qfloat8
-        elif model_precision == "fp8uz-quanto":
-            quant_level = qfloat8_e4m3fnuz
-    else:
-        raise ValueError(f"Invalid quantisation level: {model_precision}")
-
-    return quant_level
-
-
-def _quanto_model(
-    model,
-    model_precision,
-    base_model_precision=None,
-    quantize_activations: bool = False,
-):
-    try:
-        from optimum.quanto import QTensor, freeze, quantize
-
-        from videotuna.third_party.flux.training.quantisation import quanto_workarounds
-    except ImportError as e:
-        raise ImportError(
-            f"To use Quanto, please install the optimum library: `pip install optimum-quanto`: {e}"
-        )
-
-    if model_precision is None:
-        model_precision = base_model_precision
-    if model is None:
-        return model
-    if model_precision == "no_change" or model_precision is None:
-        logger.info(f"...No quantisation applied to {model.__class__.__name__}.")
-        return model
-
-    logger.info(f"Quantising {model.__class__.__name__}. Using {model_precision}.")
-    weight_quant = _quanto_type_map(model_precision)
-    extra_quanto_args = {}
-    if StateTracker.get_args().model_family == "sd3":
-        extra_quanto_args["exclude"] = [
-            "*.norm",
-            "*.norm1",
-            "*.norm1_context",
-            "*.norm_q",
-            "*.norm_k",
-            "*.norm_added_q",
-            "*.norm_added_k",
-            "proj_out",
-            "pos_embed",
-            "norm_out",
-            "context_embedder",
-            "time_text_embed",
-        ]
-    elif StateTracker.get_args().model_family == "flux":
-        extra_quanto_args["exclude"] = [
-            "*.norm",
-            "*.norm1",
-            "*.norm2",
-            "*.norm2_context",
-            "proj_out",
-            "x_embedder",
-            "norm_out",
-            "context_embedder",
-        ]
-    if quantize_activations:
-        logger.info("Freezing model weights and activations")
-        extra_quanto_args["activations"] = weight_quant
-    else:
-        logger.info("Freezing model weights only")
-
-    try:
-        quantize(model, weights=weight_quant, **extra_quanto_args)
-        freeze(model)
-    except Exception as e:
-        if "out of memory" in str(e).lower():
-            logger.error(
-                "GPU ran out of memory during quantisation. Use --quantize_via=cpu to use the slower CPU method."
-            )
-        raise e
-
-    return model
-
-
-def _torchao_filter_fn(mod: torch.nn.Module, fqn: str):
-    # don't convert the output module
-    if fqn == "proj_out":
-        return False
-    # don't convert linear modules with weight dimensions not divisible by 16
-    if isinstance(mod, torch.nn.Linear):
-        if mod.in_features % 16 != 0 or mod.out_features % 16 != 0:
-            return False
-    return True
-
-
-def _torchao_model(
-    model,
-    model_precision,
-    base_model_precision=None,
-    quantize_activations: bool = False,
-):
-    if model_precision is None:
-        model_precision = base_model_precision
-    if model is None:
-        return model
-    if model_precision == "no_change" or model_precision is None:
-        logger.info(f"...No quantisation applied to {model.__class__.__name__}.")
-        return model
-
-    try:
-        import torchao
-        from torchao.float8 import Float8LinearConfig, convert_to_float8_training
-        from torchao.prototype.quantized_training import (
-            int8_weight_only_quantized_training,
-        )
-        from torchao.quantization import quantize_
-
-        from videotuna.third_party.flux.training.quantisation import torchao_workarounds
-    except ImportError as e:
-        raise ImportError(
-            f"To use torchao, please install the torchao library: `pip install torchao`: {e}"
-        )
-    logger.info(f"Quantising {model.__class__.__name__}. Using {model_precision}.")
-    if quantize_activations:
-        logger.warning(
-            "Activation quantisation is not used in TorchAO. This will be ignored."
-        )
-
-    if model_precision == "int8-torchao":
-        quantize_(
-            model,
-            int8_weight_only_quantized_training(),  # , filter_fn=_torchao_filter_fn
-        )
-    elif model_precision == "fp8-torchao":
-        model = convert_to_float8_training(
-            model,
-            module_filter_fn=_torchao_filter_fn,
-            config=Float8LinearConfig(pad_inner_dim=True),
-        )
-
-    else:
-        raise ValueError(f"Invalid quantisation level: {base_model_precision}")
-
-    return model
-
-
-def quantise_model(
-    unet, transformer, text_encoder_1, text_encoder_2, text_encoder_3, controlnet, args
-):
-    if "quanto" in args.base_model_precision.lower():
-        logger.info("Loading Quanto. This may take a few minutes.")
-        quant_fn = _quanto_model
-    elif "torchao" in args.base_model_precision.lower():
-        logger.info("Loading TorchAO. This may take a few minutes.")
-        quant_fn = _torchao_model
-    if transformer is not None:
-        transformer = quant_fn(
-            transformer,
-            model_precision=args.base_model_precision,
-            quantize_activations=args.quantize_activations,
-        )
-    if unet is not None:
-        unet = quant_fn(
-            unet,
-            model_precision=args.base_model_precision,
-            quantize_activations=args.quantize_activations,
-        )
-    if controlnet is not None:
-        controlnet = quant_fn(
-            controlnet,
-            model_precision=args.base_model_precision,
-            quantize_activations=args.quantize_activations,
-        )
-
-    if text_encoder_1 is not None:
-        text_encoder_1 = quant_fn(
-            text_encoder_1,
-            model_precision=args.text_encoder_1_precision,
-            base_model_precision=args.base_model_precision,
-        )
-    if text_encoder_2 is not None:
-        text_encoder_2 = quant_fn(
-            text_encoder_2,
-            model_precision=args.text_encoder_2_precision,
-            base_model_precision=args.base_model_precision,
-        )
-    if text_encoder_3 is not None:
-        text_encoder_3 = quant_fn(
-            text_encoder_3,
-            model_precision=args.text_encoder_3_precision,
-            base_model_precision=args.base_model_precision,
-        )
-
-    return unet, transformer, text_encoder_1, text_encoder_2, text_encoder_3, controlnet
diff --git a/videotuna/third_party/flux/training/quantisation/peft_workarounds.py b/videotuna/third_party/flux/training/quantisation/peft_workarounds.py
deleted file mode 100644
index 0a5be91d..00000000
--- a/videotuna/third_party/flux/training/quantisation/peft_workarounds.py
+++ /dev/null
@@ -1,421 +0,0 @@
-# Copyright 2024-present the HuggingFace Inc. team.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import annotations
-
-import math
-import warnings
-from typing import Any, Optional
-
-import torch
-from peft.import_utils import is_quanto_available
-from peft.tuners.lora.layer import LoraLayer
-from peft.tuners.tuners_utils import BaseTunerLayer, check_adapters_to_merge
-from peft.utils.other import transpose
-from torch import nn
-from torch.nn import functional as F
-
-if is_quanto_available:
-    # ensure that there are no quanto imports unless optimum.quanto is installed
-    from optimum.quanto import QConv2d, QLinear
-else:
-    QConv2d, QLinear = None, None
-
-
-class QuantoLoraLinear(torch.nn.Module, LoraLayer):
-    """LoRA layer implementation for quanto QLinear"""
-
-    def __init__(
-        self,
-        base_layer,
-        adapter_name,
-        r: int = 0,
-        lora_alpha: int = 1,
-        lora_dropout: float = 0.0,
-        fan_in_fan_out: bool = False,  # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
-        init_lora_weights: bool = True,
-        use_rslora: bool = False,
-        use_dora: bool = False,
-        **kwargs,
-    ):
-        if use_dora:
-            raise ValueError(
-                f"{self.__class__.__name__} does not support DoRA yet, please set it to False"
-            )
-
-        super().__init__()
-        LoraLayer.__init__(self, base_layer)
-        self.fan_in_fan_out = fan_in_fan_out
-
-        self._active_adapter = adapter_name
-        self.update_layer(
-            adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora
-        )
-
-    def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
-        result = self.base_layer(x)
-        adapter_names = kwargs.pop("adapter_names", None)
-        if adapter_names is not None:
-            raise ValueError(
-                f"{self.__class__.__name__} does not support mixed_batch_forward yet."
-            )
-
-        if self.disable_adapters:
-            return result
-
-        if self.disable_adapters:
-            if self.merged:
-                self.unmerge()
-            result = self.base_layer(x, *args, **kwargs)
-        elif self.merged:
-            result = self.base_layer(x, *args, **kwargs)
-        else:
-            for active_adapter in self.active_adapters:
-                if active_adapter not in self.lora_A.keys():
-                    continue
-                lora_A = self.lora_A[active_adapter]
-                lora_B = self.lora_B[active_adapter]
-                dropout = self.lora_dropout[active_adapter]
-                scaling = self.scaling[active_adapter]
-
-                requires_conversion = not torch.is_autocast_enabled()
-                if requires_conversion:
-                    expected_dtype = result.dtype
-                    x = x.to(lora_A.weight.dtype)
-
-                output = lora_B(lora_A(dropout(x)))
-                if requires_conversion:
-                    output = output.to(expected_dtype)
-                output = output * scaling
-                result = result + output
-
-        return result
-
-    def get_delta_weight(self, adapter):
-        return (
-            transpose(
-                self.lora_B[adapter].weight @ self.lora_A[adapter].weight,
-                fan_in_fan_out=self.fan_in_fan_out,
-            )
-            * self.scaling[adapter]
-        )
-
-    def merge(
-        self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None
-    ) -> None:
-        from optimum.quanto import quantize_weight
-
-        adapter_names = check_adapters_to_merge(self, adapter_names)
-        if not adapter_names:
-            # no adapter to merge
-            return
-
-        base_layer = self.get_base_layer()
-        orig_weight = base_layer.weight
-
-        for active_adapter in adapter_names:
-            delta_weight = self.get_delta_weight(active_adapter)
-            # note: no in-place for safe_merge=False
-            new_weight_data = orig_weight + delta_weight
-            if safe_merge:
-                if torch.isfinite(new_weight_data).all():
-                    raise ValueError(
-                        f"NaNs detected in the merged weights. The adapter {active_adapter} seems to be broken"
-                    )
-            quantized = quantize_weight(
-                new_weight_data, qtype=orig_weight.qtype, axis=orig_weight.axis
-            )
-            base_layer.weight._data = quantized._data
-            base_layer.weight._scale = quantized._scale
-            self.merged_adapters.append(active_adapter)
-
-    def unmerge(self) -> None:
-        from optimum.quanto import quantize_weight
-
-        if not self.merged:
-            warnings.warn("Already unmerged. Nothing to do.")
-            return
-
-        while len(self.merged_adapters) > 0:
-            active_adapter = self.merged_adapters.pop()
-            if active_adapter not in self.lora_A.keys():
-                continue
-
-            base_layer = self.get_base_layer()
-            orig_weight = base_layer.weight
-            new_weight_data = orig_weight - self.get_delta_weight(active_adapter)
-            quantized = quantize_weight(
-                new_weight_data, qtype=orig_weight.qtype, axis=orig_weight.axis
-            )
-            base_layer.weight._data = quantized._data
-            base_layer.weight._scale = quantized._scale
-
-    def __repr__(self) -> str:
-        rep = super().__repr__()
-        return "lora." + rep
-
-
-class QuantoLoraConv2d(torch.nn.Module, LoraLayer):
-    """LoRA layer implementation for quanto QConv2d"""
-
-    def __init__(
-        self,
-        base_layer,
-        adapter_name,
-        r: int = 0,
-        lora_alpha: int = 1,
-        lora_dropout: float = 0.0,
-        init_lora_weights: bool = True,
-        use_rslora: bool = False,
-        use_dora: bool = False,
-        **kwargs,
-    ):
-        if use_dora:
-            raise ValueError(
-                f"{self.__class__.__name__} does not support DoRA yet, please set it to False"
-            )
-
-        super().__init__()
-        LoraLayer.__init__(self, base_layer)
-
-        self._active_adapter = adapter_name
-        self.update_layer(
-            adapter_name, r, lora_alpha, lora_dropout, init_lora_weights, use_rslora
-        )
-
-    def update_layer(
-        self,
-        adapter_name,
-        r,
-        lora_alpha,
-        lora_dropout,
-        init_lora_weights,
-        use_rslora,
-        use_dora,
-    ):
-        # same as lora.layer.Conv2d
-        if r <= 0:
-            raise ValueError(
-                f"`r` should be a positive integer value but the value passed is {r}"
-            )
-
-        self.r[adapter_name] = r
-        self.lora_alpha[adapter_name] = lora_alpha
-        if lora_dropout > 0.0:
-            lora_dropout_layer = nn.Dropout(p=lora_dropout)
-        else:
-            lora_dropout_layer = nn.Identity()
-
-        self.lora_dropout[adapter_name] = lora_dropout_layer
-        # Actual trainable parameters
-        base_layer = self.get_base_layer()
-        kernel_size = base_layer.kernel_size
-        stride = base_layer.stride
-        padding = base_layer.padding
-        self.lora_A[adapter_name] = nn.Conv2d(
-            self.in_features, r, kernel_size, stride, padding, bias=False
-        )
-        self.lora_B[adapter_name] = nn.Conv2d(
-            r, self.out_features, (1, 1), (1, 1), bias=False
-        )
-        if use_rslora:
-            self.scaling[adapter_name] = lora_alpha / math.sqrt(r)
-        else:
-            self.scaling[adapter_name] = lora_alpha / r
-
-        if init_lora_weights == "loftq":
-            self.loftq_init(adapter_name)
-        elif init_lora_weights:
-            self.reset_lora_parameters(adapter_name, init_lora_weights)
-
-        # call this before dora_init
-        self._move_adapter_to_device_of_base_layer(adapter_name)
-
-        if use_dora:
-            # TODO: Implement DoRA
-            self.dora_init(adapter_name)
-            self.use_dora[adapter_name] = True
-        else:
-            self.use_dora[adapter_name] = False
-
-        self.set_adapter(self.active_adapters)
-
-    def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
-        result = self.base_layer(x)
-        adapter_names = kwargs.pop("adapter_names", None)
-        if adapter_names is not None:
-            raise ValueError(
-                f"{self.__class__.__name__} does not support mixed_batch_forward yet."
-            )
-
-        if self.disable_adapters:
-            return result
-
-        if self.disable_adapters:
-            if self.merged:
-                self.unmerge()
-            result = self.base_layer(x, *args, **kwargs)
-        elif self.merged:
-            result = self.base_layer(x, *args, **kwargs)
-        else:
-            for active_adapter in self.active_adapters:
-                if active_adapter not in self.lora_A.keys():
-                    continue
-                lora_A = self.lora_A[active_adapter]
-                lora_B = self.lora_B[active_adapter]
-                dropout = self.lora_dropout[active_adapter]
-                scaling = self.scaling[active_adapter]
-
-                requires_conversion = not torch.is_autocast_enabled()
-                if requires_conversion:
-                    expected_dtype = result.dtype
-                    x = x.to(lora_A.weight.dtype)
-
-                output = lora_B(lora_A(dropout(x)))
-                if requires_conversion:
-                    output = output.to(expected_dtype)
-                output = output * scaling
-                result = result + output
-
-        return result
-
-    def get_delta_weight(self, adapter):
-        # same as lora.layer.Conv2d
-        device = self.lora_B[adapter].weight.device
-        dtype = self.lora_A[adapter].weight.dtype
-
-        # In case users wants to merge the adapter weights that are in
-        # (b)float16 while being on CPU, we need to cast the weights to float32, perform the merge and then cast back to
-        # (b)float16 because some CPUs have slow bf16/fp16 matmuls.
-        cast_to_fp32 = device.type == "cpu" and (
-            dtype == torch.float16 or dtype == torch.bfloat16
-        )
-
-        weight_A = self.lora_A[adapter].weight
-        weight_B = self.lora_B[adapter].weight
-
-        if cast_to_fp32:
-            weight_A = weight_A.float()
-            weight_B = weight_B.float()
-
-        # https://github.com/bmaltais/kohya_ss/blob/feb6728762a8f463d15ba936d189d4c3abfaa1ab/networks/lora.py#L117
-        if self.get_base_layer().weight.size()[2:4] == (1, 1):
-            # conv2d 1x1
-            output_tensor = (
-                weight_B.squeeze(3).squeeze(2) @ weight_A.squeeze(3).squeeze(2)
-            ).unsqueeze(2).unsqueeze(3) * self.scaling[adapter]
-        else:
-            # conv2d 3x3
-            output_tensor = (
-                F.conv2d(
-                    weight_A.permute(1, 0, 2, 3),
-                    weight_B,
-                ).permute(1, 0, 2, 3)
-                * self.scaling[adapter]
-            )
-
-        if cast_to_fp32:
-            output_tensor = output_tensor.to(dtype=dtype)
-
-            # cast back the weights
-            self.lora_A[adapter].weight.data = weight_A.to(dtype)
-            self.lora_B[adapter].weight.data = weight_B.to(dtype)
-
-        return output_tensor
-
-    def merge(
-        self, safe_merge: bool = False, adapter_names: Optional[list[str]] = None
-    ) -> None:
-        # same as lora.quanto.QuantoLoraLinear
-        from optimum.quanto import quantize_weight
-
-        adapter_names = check_adapters_to_merge(self, adapter_names)
-        if not adapter_names:
-            # no adapter to merge
-            return
-
-        base_layer = self.get_base_layer()
-        orig_weight = base_layer.weight
-
-        for active_adapter in adapter_names:
-            delta_weight = self.get_delta_weight(active_adapter)
-            # note: no in-place for safe_merge=False
-            new_weight_data = orig_weight + delta_weight
-            if safe_merge:
-                if torch.isfinite(new_weight_data).all():
-                    raise ValueError(
-                        f"NaNs detected in the merged weights. The adapter {active_adapter} seems to be broken"
-                    )
-            quantized = quantize_weight(
-                new_weight_data, qtype=orig_weight.qtype, axis=orig_weight.axis
-            )
-            base_layer.weight._data = quantized._data
-            base_layer.weight._scale = quantized._scale
-            self.merged_adapters.append(active_adapter)
-
-    def unmerge(self) -> None:
-        # same as lora.quanto.QuantoLoraLinear
-        from optimum.quanto import quantize_weight
-
-        if not self.merged:
-            warnings.warn("Already unmerged. Nothing to do.")
-            return
-
-        while len(self.merged_adapters) > 0:
-            active_adapter = self.merged_adapters.pop()
-            if active_adapter not in self.lora_A.keys():
-                continue
-
-            base_layer = self.get_base_layer()
-            orig_weight = base_layer.weight
-            new_weight_data = orig_weight - self.get_delta_weight(active_adapter)
-            quantized = quantize_weight(
-                new_weight_data, qtype=orig_weight.qtype, axis=orig_weight.axis
-            )
-            base_layer.weight._data = quantized._data
-            base_layer.weight._scale = quantized._scale
-
-    def __repr__(self) -> str:
-        rep = super().__repr__()
-        return "lora." + rep
-
-
-def dispatch_quanto(
-    target: torch.nn.Module,
-    adapter_name: str,
-    **kwargs: Any,
-) -> Optional[torch.nn.Module]:
-    new_module = None
-
-    if isinstance(target, BaseTunerLayer):
-        target_base_layer = target.get_base_layer()
-    else:
-        target_base_layer = target
-
-    if is_quanto_available() and isinstance(target_base_layer, QLinear):
-        new_module = QuantoLoraLinear(target, adapter_name, **kwargs)
-        target.weight = target_base_layer.weight
-
-        if hasattr(target, "bias"):
-            target.bias = target_base_layer.bias
-    elif is_quanto_available() and isinstance(target_base_layer, QConv2d):
-        new_module = QuantoLoraConv2d(target, adapter_name, **kwargs)
-        target.weight = target_base_layer.weight
-
-        if hasattr(target, "bias"):
-            target.bias = target_base_layer.bias
-
-    return new_module
-
-
-custom_module_mapping = {QConv2d: QuantoLoraConv2d, QLinear: QuantoLoraLinear}
diff --git a/videotuna/third_party/flux/training/quantisation/quanto_workarounds.py b/videotuna/third_party/flux/training/quantisation/quanto_workarounds.py
deleted file mode 100644
index c6b286d4..00000000
--- a/videotuna/third_party/flux/training/quantisation/quanto_workarounds.py
+++ /dev/null
@@ -1,115 +0,0 @@
-import optimum
-import torch
-
-if torch.cuda.is_available():
-    # the marlin fp8 kernel needs some help with dtype casting for some reason
-    # see: https://github.com/huggingface/optimum-quanto/pull/296#issuecomment-2380719201
-    from optimum.quanto.library.extensions.cuda import ext as quanto_ext
-
-    # Save the original operator
-    original_gemm_f16f8_marlin = torch.ops.quanto.gemm_f16f8_marlin
-
-    def fp8_marlin_gemm_wrapper(
-        a: torch.Tensor,
-        b_q_weight: torch.Tensor,
-        b_scales: torch.Tensor,
-        workspace: torch.Tensor,
-        num_bits: int,
-        size_m: int,
-        size_n: int,
-        size_k: int,
-    ) -> torch.Tensor:
-        # Ensure 'a' has the correct dtype
-        a = a.to(b_scales.dtype)
-        # Call the original operator
-        return original_gemm_f16f8_marlin(
-            a,
-            b_q_weight,
-            b_scales,
-            workspace,
-            num_bits,
-            size_m,
-            size_n,
-            size_k,
-        )
-
-    # Monkey-patch the operator
-    torch.ops.quanto.gemm_f16f8_marlin = fp8_marlin_gemm_wrapper
-
-    class TinyGemmQBitsLinearFunction(
-        optimum.quanto.tensor.function.QuantizedLinearFunction
-    ):
-        @staticmethod
-        def forward(ctx, input, other, bias):
-            ctx.save_for_backward(input, other)
-            if type(input) is not torch.Tensor:
-                input = input.dequantize()
-            in_features = input.shape[-1]
-            out_features = other.shape[0]
-            output_shape = input.shape[:-1] + (out_features,)
-            output = torch._weight_int4pack_mm(
-                input.view(-1, in_features).to(dtype=other.dtype),
-                other._data._data,
-                other._group_size,
-                other._scale_shift,
-            )
-            output = output.view(output_shape)
-            if bias is not None:
-                output = output + bias
-            return output
-
-    from optimum.quanto.tensor.weights import tinygemm
-
-    tinygemm.qbits.TinyGemmQBitsLinearFunction = TinyGemmQBitsLinearFunction
-
-
-class WeightQBytesLinearFunction(
-    optimum.quanto.tensor.function.QuantizedLinearFunction
-):
-    @staticmethod
-    def forward(ctx, input, other, bias=None):
-        ctx.save_for_backward(input, other)
-        if isinstance(input, optimum.quanto.tensor.QBytesTensor):
-            output = torch.ops.quanto.qbytes_mm(
-                input._data, other._data, input._scale * other._scale
-            )
-        else:
-            in_features = input.shape[-1]
-            out_features = other.shape[0]
-            output_shape = input.shape[:-1] + (out_features,)
-            output = torch.ops.quanto.qbytes_mm(
-                input.reshape(-1, in_features), other._data, other._scale
-            )
-            output = output.view(output_shape)
-        if bias is not None:
-            output = output + bias
-        return output
-
-
-optimum.quanto.tensor.weights.qbytes.WeightQBytesLinearFunction = (
-    WeightQBytesLinearFunction
-)
-
-
-def reshape_qlf_backward(ctx, gO):
-    # another one where we need .reshape instead of .view
-    input_gO = other_gO = bias_gO = None
-    input, other = ctx.saved_tensors
-    out_features, in_features = other.shape
-    if ctx.needs_input_grad[0]:
-        # grad(A@(B.t()) = gO => grad(A) = gO@(B.t().t()) = gO@B
-        input_gO = torch.matmul(gO, other)
-    if ctx.needs_input_grad[1]:
-        # grad(B@A.t()) = gO.t() => grad(B) = gO.t()@(A.t().t()) = gO.t()@A
-        other_gO = torch.matmul(
-            gO.reshape(-1, out_features).t(),
-            input.to(gO.dtype).reshape(-1, in_features),
-        )
-    if ctx.needs_input_grad[2]:
-        # Bias gradient is the sum on all dimensions but the last one
-        dim = tuple(range(gO.ndim - 1))
-        bias_gO = gO.sum(dim)
-    return input_gO, other_gO, bias_gO
-
-
-optimum.quanto.tensor.function.QuantizedLinearFunction.backward = reshape_qlf_backward
diff --git a/videotuna/third_party/flux/training/quantisation/torchao_workarounds.py b/videotuna/third_party/flux/training/quantisation/torchao_workarounds.py
deleted file mode 100644
index 0c5fa991..00000000
--- a/videotuna/third_party/flux/training/quantisation/torchao_workarounds.py
+++ /dev/null
@@ -1,41 +0,0 @@
-from typing import Optional
-
-import torch
-import torchao
-from torch import Tensor
-from torchao.prototype.quantized_training.int8 import Int8QuantizedTrainingLinearWeight
-
-
-class _Int8WeightOnlyLinear(torch.autograd.Function):
-    @staticmethod
-    def forward(
-        ctx,
-        input: Tensor,
-        weight: Int8QuantizedTrainingLinearWeight,
-        bias: Optional[Tensor] = None,
-    ):
-        ctx.save_for_backward(input, weight)
-        ctx.bias = bias is not None
-
-        # NOTE: we have to .T before .to(input.dtype) for torch.compile() mixed matmul to work
-        out = (input @ weight.int_data.T.to(input.dtype)) * weight.scale
-        out = out + bias if bias is not None else out
-        return out
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        input, weight = ctx.saved_tensors
-
-        grad_input = (grad_output * weight.scale) @ weight.int_data.to(
-            grad_output.dtype
-        )
-        # print(f"dtypes: grad_output {grad_output.dtype}, input {input.dtype}, weight {weight.dtype}")
-        # here is the patch: we will cast the input to the grad_output dtype.
-        grad_weight = grad_output.view(-1, weight.shape[0]).T @ input.to(
-            grad_output.dtype
-        ).reshape(-1, weight.shape[1])
-        grad_bias = grad_output.view(-1, weight.shape[0]).sum(0) if ctx.bias else None
-        return grad_input, grad_weight, grad_bias
-
-
-torchao.prototype.quantized_training.int8._Int8WeightOnlyLinear = _Int8WeightOnlyLinear
diff --git a/videotuna/third_party/flux/training/save_hooks.py b/videotuna/third_party/flux/training/save_hooks.py
deleted file mode 100644
index ddd7b839..00000000
--- a/videotuna/third_party/flux/training/save_hooks.py
+++ /dev/null
@@ -1,520 +0,0 @@
-import json
-import logging
-import os
-import shutil
-
-from diffusers.training_utils import EMAModel, _set_state_dict_into_text_encoder
-from diffusers.utils import (
-    convert_state_dict_to_diffusers,
-    convert_unet_state_dict_to_peft,
-)
-from peft import set_peft_model_state_dict
-from peft.utils import get_peft_model_state_dict
-from safetensors import safe_open
-from safetensors.torch import save_file
-from tqdm import tqdm
-
-from videotuna.third_party.flux.models.sdxl.pipeline import StableDiffusionXLPipeline
-from videotuna.third_party.flux.models.smoldit import SmolDiT2DModel, SmolDiTPipeline
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.training.wrappers import unwrap_model
-
-logger = logging.getLogger("SaveHookManager")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL") or "INFO")
-
-try:
-    from diffusers import (
-        ControlNetModel,
-        FluxPipeline,
-        HunyuanDiTPipeline,
-        PixArtSigmaPipeline,
-        SD3Transformer2DModel,
-        StableDiffusion3Pipeline,
-        StableDiffusionPipeline,
-        UNet2DConditionModel,
-    )
-except ImportError:
-    logger.error("This release requires the latest version of Diffusers.")
-
-try:
-    from diffusers.models import PixArtTransformer2DModel
-except Exception as e:
-    logger.error(
-        f"Can not load Pixart Sigma model class. This release requires the latest version of Diffusers: {e}"
-    )
-    raise e
-
-try:
-    from diffusers.models import FluxTransformer2DModel
-except Exception as e:
-    logger.error(
-        f"Can not load FluxTransformer2DModel model class. This release requires the latest version of Diffusers: {e}"
-    )
-    raise e
-
-try:
-    from diffusers.models import HunyuanDiT2DModel
-except Exception as e:
-    logger.error(
-        f"Can not load Hunyuan DiT model class. This release requires the latest version of Diffusers: {e}"
-    )
-    raise e
-
-
-def merge_safetensors_files(directory):
-    json_file_name = "diffusion_pytorch_model.safetensors.index.json"
-    json_file_path = os.path.join(directory, json_file_name)
-    if not os.path.exists(json_file_path):
-        return
-
-    # Step 2: Load the JSON file and extract the weight map
-    with open(json_file_path, "r") as file:
-        data = json.load(file)
-        weight_map = data.get("weight_map")
-        if weight_map is None:
-            raise KeyError("'weight_map' key not found in the JSON file.")
-
-    # Collect all unique safetensors files from weight_map
-    files_to_load = set(weight_map.values())
-    all_tensors = {}
-
-    # Load tensors from each unique file
-    for file_name in files_to_load:
-        part_file_path = os.path.join(directory, file_name)
-        if not os.path.exists(part_file_path):
-            raise FileNotFoundError(f"Part file {file_name} not found.")
-
-        with safe_open(part_file_path, framework="pt", device="cpu") as f:
-            for tensor_key in f.keys():
-                if tensor_key in weight_map:
-                    all_tensors[tensor_key] = f.get_tensor(tensor_key)
-
-    # Step 4: Save all loaded tensors into a single new safetensors file
-    output_file_path = os.path.join(directory, "diffusion_pytorch_model.safetensors")
-    save_file(all_tensors, output_file_path)
-    # Step 5: If the file now exists, remove the index and part files
-    if os.path.exists(output_file_path):
-        os.remove(json_file_path)
-        for file_name in files_to_load:
-            os.remove(os.path.join(directory, file_name))
-
-    logger.info(f"All tensors have been merged and saved into {output_file_path}")
-
-
-class SaveHookManager:
-    def __init__(
-        self,
-        args,
-        unet,
-        transformer,
-        ema_model,
-        text_encoder_1,
-        text_encoder_2,
-        accelerator,
-        use_deepspeed_optimizer,
-    ):
-
-        self.args = args
-        self.unet = unet
-        self.transformer = transformer
-        if self.unet is not None and self.transformer is not None:
-            raise ValueError("Both `unet` and `transformer` cannot be set.")
-        self.text_encoder_1 = text_encoder_1
-        self.text_encoder_2 = text_encoder_2
-        self.ema_model = ema_model
-        self.accelerator = accelerator
-        self.use_deepspeed_optimizer = use_deepspeed_optimizer
-
-        self.denoiser_class = None
-        self.denoiser_subdir = None
-        self.pipeline_class = None
-        if self.unet is not None:
-            self.denoiser_class = UNet2DConditionModel
-            self.denoiser_subdir = "unet"
-            self.pipeline_class = StableDiffusionXLPipeline
-            if StateTracker.get_model_family() == "legacy":
-                self.pipeline_class = StableDiffusionPipeline
-        elif self.transformer is not None:
-            if args.model_family == "sd3":
-                self.denoiser_class = SD3Transformer2DModel
-                self.pipeline_class = StableDiffusion3Pipeline
-            elif (
-                args.model_family.lower() == "flux"
-                and not args.flux_attention_masked_training
-            ):
-                self.denoiser_class = FluxTransformer2DModel
-                self.pipeline_class = FluxPipeline
-            elif (
-                args.model_family.lower() == "flux"
-                and args.flux_attention_masked_training
-            ):
-                from videotuna.third_party.flux.models.flux.transformer import (
-                    FluxTransformer2DModelWithMasking,
-                )
-
-                self.denoiser_class = FluxTransformer2DModelWithMasking
-                self.pipeline_class = FluxPipeline
-            elif hasattr(args, "hunyuan_dit") and args.hunyuan_dit:
-                self.denoiser_class = HunyuanDiT2DModel
-                self.pipeline_class = HunyuanDiTPipeline
-            elif args.model_family == "pixart_sigma":
-                self.denoiser_class = PixArtTransformer2DModel
-                self.pipeline_class = PixArtSigmaPipeline
-            elif args.model_family == "smoldit":
-                self.denoiser_class = SmolDiT2DModel
-                self.pipeline_class = SmolDiTPipeline
-            self.denoiser_subdir = "transformer"
-
-        if args.controlnet:
-            self.denoiser_class = ControlNetModel
-            self.denoiser_subdir = "controlnet"
-        logger.debug(f"Denoiser class set to: {self.denoiser_class.__name__}.")
-        logger.debug(f"Pipeline class set to: {self.pipeline_class.__name__}.")
-
-        self.ema_model_cls = None
-        self.ema_model_subdir = None
-        if unet is not None:
-            self.ema_model_subdir = "unet_ema"
-            self.ema_model_cls = UNet2DConditionModel
-        if transformer is not None:
-            self.ema_model_subdir = "transformer_ema"
-            if self.args.model_family == "sd3":
-                self.ema_model_cls = SD3Transformer2DModel
-            elif self.args.model_family == "pixart_sigma":
-                self.ema_model_cls = PixArtTransformer2DModel
-        self.training_state_path = "training_state.json"
-        if self.accelerator is not None:
-            rank = get_rank()
-            if rank > 0:
-                self.training_state_path = f"training_state-rank{rank}.json"
-
-    def _save_lora(self, models, weights, output_dir):
-        # for SDXL/others, there are only two options here. Either are just the unet attn processor layers
-        # or there are the unet and text encoder atten layers.
-        unet_lora_layers_to_save = None
-        transformer_lora_layers_to_save = None
-        text_encoder_1_lora_layers_to_save = None
-        text_encoder_2_lora_layers_to_save = None
-        # Diffusers does not train the third text encoder.
-        # text_encoder_3_lora_layers_to_save = None
-
-        for model in models:
-            if isinstance(model, type(unwrap_model(self.accelerator, self.unet))):
-                unet_lora_layers_to_save = convert_state_dict_to_diffusers(
-                    get_peft_model_state_dict(model)
-                )
-            elif isinstance(
-                model, type(unwrap_model(self.accelerator, self.text_encoder_1))
-            ):
-                text_encoder_1_lora_layers_to_save = convert_state_dict_to_diffusers(
-                    get_peft_model_state_dict(model)
-                )
-            elif isinstance(
-                model, type(unwrap_model(self.accelerator, self.text_encoder_2))
-            ):
-                text_encoder_2_lora_layers_to_save = convert_state_dict_to_diffusers(
-                    get_peft_model_state_dict(model)
-                )
-
-            elif not isinstance(
-                model, type(unwrap_model(self.accelerator, HunyuanDiT2DModel))
-            ):
-                if isinstance(
-                    model, type(unwrap_model(self.accelerator, self.transformer))
-                ):
-                    transformer_lora_layers_to_save = get_peft_model_state_dict(model)
-
-            elif not self.use_deepspeed_optimizer:
-                raise ValueError(f"unexpected save model: {model.__class__}")
-
-            # make sure to pop weight so that corresponding model is not saved again
-            if weights:
-                weights.pop()
-
-        if self.args.model_family == "flux":
-            self.pipeline_class.save_lora_weights(
-                output_dir,
-                transformer_lora_layers=transformer_lora_layers_to_save,
-                text_encoder_lora_layers=text_encoder_1_lora_layers_to_save,
-            )
-        elif self.args.model_family == "sd3":
-            self.pipeline_class.save_lora_weights(
-                output_dir,
-                transformer_lora_layers=transformer_lora_layers_to_save,
-                text_encoder_lora_layers=text_encoder_1_lora_layers_to_save,
-                text_encoder_2_lora_layers=text_encoder_2_lora_layers_to_save,
-            )
-        elif self.args.model_family == "legacy":
-            self.pipeline_class.save_lora_weights(
-                output_dir,
-                unet_lora_layers=unet_lora_layers_to_save,
-                text_encoder_lora_layers=text_encoder_1_lora_layers_to_save,
-            )
-        elif self.args.model_family == "sdxl" or self.args.model_family == "kolors":
-            self.pipeline_class.save_lora_weights(
-                output_dir,
-                unet_lora_layers=unet_lora_layers_to_save,
-                text_encoder_lora_layers=text_encoder_1_lora_layers_to_save,
-                text_encoder_2_lora_layers=text_encoder_2_lora_layers_to_save,
-            )
-        else:
-            raise ValueError(f"unexpected model family: {self.args.model_family}")
-
-    def _save_lycoris(self, models, weights, output_dir):
-        """
-        save wrappers for lycoris. For now, text encoders are not trainable
-        via lycoris.
-        """
-        from videotuna.third_party.flux.publishing.huggingface import (
-            LORA_SAFETENSORS_FILENAME,
-        )
-
-        for _ in models:
-            if weights:
-                weights.pop()
-
-        lycoris_config = None
-        with open(self.args.lycoris_config, "r") as f:
-            lycoris_config = json.load(f)
-
-        self.accelerator._lycoris_wrapped_network.save_weights(
-            os.path.join(output_dir, LORA_SAFETENSORS_FILENAME),
-            list(self.accelerator._lycoris_wrapped_network.parameters())[0].dtype,
-            {"lycoris_config": json.dumps(lycoris_config)},  # metadata
-        )
-
-        # copy the config into the repo
-        shutil.copy2(
-            self.args.lycoris_config, os.path.join(output_dir, "lycoris_config.json")
-        )
-
-        logger.info("LyCORIS weights have been saved to disk")
-
-    def _save_full_model(self, models, weights, output_dir):
-        # Create a temporary directory for atomic saves
-        temporary_dir = output_dir.replace("checkpoint", "temporary")
-        os.makedirs(temporary_dir, exist_ok=True)
-
-        if self.args.use_ema:
-            tqdm.write("Saving EMA model")
-            self.ema_model.save_pretrained(
-                os.path.join(temporary_dir, self.ema_model_subdir),
-                max_shard_size="10GB",
-            )
-
-        if self.unet is not None:
-            sub_dir = "unet"
-        if self.transformer is not None:
-            sub_dir = "transformer"
-        if self.args.controlnet:
-            sub_dir = "controlnet"
-        for model in models:
-            model.save_pretrained(
-                os.path.join(temporary_dir, sub_dir), max_shard_size="10GB"
-            )
-            merge_safetensors_files(os.path.join(temporary_dir, sub_dir))
-            if weights:
-                weights.pop()  # Pop the last weight
-
-        # Copy contents of temporary directory to output directory
-        for item in os.listdir(temporary_dir):
-            s = os.path.join(temporary_dir, item)
-            d = os.path.join(output_dir, item)
-            if os.path.isdir(s):
-                shutil.copytree(s, d, dirs_exist_ok=True)  # Python 3.8+
-            else:
-                shutil.copy2(s, d)
-
-        # Remove the temporary directory
-        shutil.rmtree(temporary_dir)
-
-    def save_model_hook(self, models, weights, output_dir):
-        # Write "training_state.json" to the output directory containing the training state
-        StateTracker.save_training_state(
-            os.path.join(output_dir, self.training_state_path)
-        )
-        if not self.accelerator.is_main_process:
-            return
-        if "lora" in self.args.model_type and self.args.lora_type == "standard":
-            self._save_lora(models=models, weights=weights, output_dir=output_dir)
-            return
-        elif "lora" in self.args.model_type and self.args.lora_type == "lycoris":
-            self._save_lycoris(models=models, weights=weights, output_dir=output_dir)
-            return
-        else:
-            self._save_full_model(models=models, weights=weights, output_dir=output_dir)
-
-    def _load_lora(self, models, input_dir):
-        logger.info(f"Loading LoRA weights from Path: {input_dir}")
-        unet_ = None
-        transformer_ = None
-        denoiser = None
-        text_encoder_one_ = None
-        text_encoder_two_ = None
-
-        while len(models) > 0:
-            model = models.pop()
-
-            if isinstance(
-                unwrap_model(self.accelerator, model),
-                type(unwrap_model(self.accelerator, self.unet)),
-            ):
-                unet_ = model
-                denoiser = unet_
-            elif isinstance(
-                unwrap_model(self.accelerator, model),
-                type(unwrap_model(self.accelerator, self.transformer)),
-            ):
-                transformer_ = model
-                denoiser = transformer_
-            elif isinstance(
-                unwrap_model(self.accelerator, model),
-                type(unwrap_model(self.accelerator, self.text_encoder_1)),
-            ):
-                text_encoder_one_ = model
-            elif isinstance(
-                unwrap_model(self.accelerator, model),
-                type(unwrap_model(self.accelerator, self.text_encoder_2)),
-            ):
-                text_encoder_two_ = model
-            else:
-                raise ValueError(
-                    f"unexpected save model: {model.__class__}"
-                    f"\nunwrapped: {unwrap_model(self.accelerator, model).__class__}"
-                    f"\nunet: {unwrap_model(self.accelerator, self.unet).__class__}"
-                )
-
-        if self.args.model_family in ["sd3", "flux", "pixart_sigma"]:
-            key_to_replace = "transformer"
-            lora_state_dict = self.pipeline_class.lora_state_dict(input_dir)
-        else:
-            key_to_replace = "unet"
-            lora_state_dict, _ = self.pipeline_class.lora_state_dict(input_dir)
-
-        denoiser_state_dict = {
-            f'{k.replace(f"{key_to_replace}.", "")}': v
-            for k, v in lora_state_dict.items()
-            if k.startswith(f"{key_to_replace}.")
-        }
-        denoiser_state_dict = convert_unet_state_dict_to_peft(denoiser_state_dict)
-        incompatible_keys = set_peft_model_state_dict(
-            denoiser, denoiser_state_dict, adapter_name="default"
-        )
-
-        if incompatible_keys is not None:
-            # check only for unexpected keys
-            unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None)
-            if unexpected_keys:
-                logger.warning(
-                    f"Loading adapter weights from state_dict led to unexpected keys not found in the model: "
-                    f" {unexpected_keys}. "
-                )
-
-        if self.args.train_text_encoder:
-            # Do we need to call `scale_lora_layers()` here?
-            _set_state_dict_into_text_encoder(
-                lora_state_dict,
-                prefix="text_encoder.",
-                text_encoder=text_encoder_one_,
-            )
-
-            _set_state_dict_into_text_encoder(
-                lora_state_dict,
-                prefix="text_encoder_2.",
-                text_encoder=text_encoder_two_,
-            )
-
-        logger.info("Completed loading LoRA weights.")
-
-    def _load_lycoris(self, models, input_dir):
-        from videotuna.third_party.flux.publishing.huggingface import (
-            LORA_SAFETENSORS_FILENAME,
-        )
-
-        while len(models) > 0:
-            model = models.pop()
-
-        state = self.accelerator._lycoris_wrapped_network.load_weights(
-            os.path.join(input_dir, LORA_SAFETENSORS_FILENAME)
-        )
-        if len(state.keys()) > 0:
-            logging.error(f"LyCORIS failed to load: {state}")
-            raise RuntimeError("Loading of LyCORIS model failed")
-        weight_dtype = StateTracker.get_weight_dtype()
-        if self.transformer is not None:
-            self.accelerator._lycoris_wrapped_network.to(
-                device=self.accelerator.device, dtype=weight_dtype
-            )
-        elif self.unet is not None:
-            self.accelerator._lycoris_wrapped_network.to(
-                device=self.accelerator.device, dtype=weight_dtype
-            )
-        else:
-            raise ValueError("No model found to load LyCORIS weights into.")
-
-        logger.info("LyCORIS weights have been loaded from disk")
-        # disable LyCORIS spam logging
-        lycoris_logger = logging.getLogger("LyCORIS")
-        lycoris_logger.setLevel(logging.ERROR)
-
-    def _load_full_model(self, models, input_dir):
-        if self.args.use_ema:
-            load_model = EMAModel.from_pretrained(
-                os.path.join(input_dir, self.ema_model_subdir), self.ema_model_cls
-            )
-            self.ema_model.load_state_dict(load_model.state_dict())
-            self.ema_model.to(self.accelerator.device)
-            del load_model
-        if self.args.model_type == "full":
-            return_exception = False
-            for i in range(len(models)):
-                try:
-                    # pop models so that they are not loaded again
-                    model = models.pop()
-                    load_model = self.denoiser_class.from_pretrained(
-                        input_dir, subfolder=self.denoiser_subdir
-                    )
-                    if (
-                        self.args.model_family == "sd3"
-                        and not self.args.train_text_encoder
-                    ):
-                        logger.info(
-                            "Unloading text encoders for full SD3 training without --train_text_encoder"
-                        )
-                        (self.text_encoder_1, self.text_encoder_2) = (None, None)
-
-                    model.register_to_config(**load_model.config)
-                    model.load_state_dict(load_model.state_dict())
-                    del load_model
-                except Exception as e:
-                    import traceback
-
-                    return_exception = f"Could not load model: {e}, traceback: {traceback.format_exc()}"
-
-            if return_exception:
-                raise Exception(return_exception)
-
-    def load_model_hook(self, models, input_dir):
-        # Check the checkpoint dir for a "training_state.json" file to load
-        training_state_path = os.path.join(input_dir, self.training_state_path)
-        if (
-            not os.path.exists(training_state_path)
-            and self.training_state_path != "training_state.json"
-        ):
-            logger.warning(
-                f"Could not find {training_state_path} in checkpoint dir {input_dir}. Trying the default path."
-            )
-            training_state_path = os.path.join(input_dir, "training_state.json")
-        if os.path.exists(training_state_path):
-            StateTracker.load_training_state(training_state_path)
-        else:
-            logger.warning(
-                f"Could not find {training_state_path} in checkpoint dir {input_dir}"
-            )
-        if "lora" in self.args.model_type and self.args.lora_type == "standard":
-            self._load_lora(models=models, input_dir=input_dir)
-        elif "lora" in self.args.model_type and self.args.lora_type == "lycoris":
-            self._load_lycoris(models=models, input_dir=input_dir)
-        else:
-            self._load_full_model(models=models, input_dir=input_dir)
diff --git a/videotuna/third_party/flux/training/schedulers.py b/videotuna/third_party/flux/training/schedulers.py
deleted file mode 100644
index 68561eb5..00000000
--- a/videotuna/third_party/flux/training/schedulers.py
+++ /dev/null
@@ -1,44 +0,0 @@
-import os
-
-from accelerate.logging import get_logger
-
-logger = get_logger(__name__, log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-
-
-def load_scheduler_from_args(args):
-    flow_matching = False
-    if (
-        args.model_family == "sd3" and args.flow_matching_loss != "diffusion"
-    ) or args.model_family == "flux":
-        # Stable Diffusion 3 uses rectified flow.
-        flow_matching = True
-        from diffusers import FlowMatchEulerDiscreteScheduler
-
-        noise_scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
-            args.pretrained_model_name_or_path,
-            subfolder="scheduler",
-            shift=1 if args.model_family == "sd3" else 3,
-        )
-    else:
-        if args.model_family == "legacy":
-            args.rescale_betas_zero_snr = True
-            args.training_scheduler_timestep_spacing = "trailing"
-
-        from diffusers import DDPMScheduler
-
-        noise_scheduler = DDPMScheduler.from_pretrained(
-            args.pretrained_model_name_or_path,
-            subfolder="scheduler",
-            rescale_betas_zero_snr=args.rescale_betas_zero_snr,
-            timestep_spacing=args.training_scheduler_timestep_spacing,
-        )
-        args.prediction_type = noise_scheduler.config.prediction_type
-        if flow_matching and args.flow_matching_loss == "diffusion":
-            logger.warning(
-                "Since --flow_matching_loss=diffusion, we will be reparameterising the model to v-prediction diffusion objective. This will break things for a while. Perhaps forever.."
-            )
-
-    return args, flow_matching, noise_scheduler
diff --git a/videotuna/third_party/flux/training/state_tracker.py b/videotuna/third_party/flux/training/state_tracker.py
deleted file mode 100644
index a72910fd..00000000
--- a/videotuna/third_party/flux/training/state_tracker.py
+++ /dev/null
@@ -1,566 +0,0 @@
-import json
-import logging
-from os import environ
-from pathlib import Path
-
-logger = logging.getLogger("StateTracker")
-logger.setLevel(environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-filename_mapping = {
-    "all_image_files": "image",
-    "all_vae_cache_files": "vae",
-    "all_text_cache_files": "text",
-}
-
-
-class StateTracker:
-    config_path = None
-    # Class variables
-    model_type = ""
-    # Job ID for FastAPI. None if local.
-    job_id = None
-
-    ## Training state
-    global_step = 0
-    global_resume_step = None
-    epoch_step = 0
-    epoch_micro_step = 0
-    epoch = 1
-
-    ## Caches
-    all_image_files = {}
-    all_vae_cache_files = {}
-    all_text_cache_files = {}
-    all_caption_files = None
-
-    ## Backend entities for retrieval
-    default_text_embed_cache = None
-    _is_sdxl_refiner = False
-    accelerator = None
-    data_backends = {}
-    parquet_databases = {}
-    # A list of backend IDs to exhaust.
-    exhausted_backends = []
-    # A dict of backend IDs to the number of times they have been repeated.
-    repeats = {}
-    # The images we'll use for upscaling at validation time. Stored at startup.
-    validation_sample_images = []
-    vae = None
-    vae_dtype = None
-    weight_dtype = None
-    args = None
-    # Aspect to resolution map, we'll store once generated for consistency.
-    aspect_resolution_map = {}
-
-    # for schedulefree
-    last_lr = 0.0
-
-    # hugging face hub user details
-    hf_user = None
-
-    webhook_handler = None
-
-    @classmethod
-    def delete_cache_files(
-        cls, data_backend_id: str = None, preserve_data_backend_cache=False
-    ):
-        for cache_name in [
-            "all_image_files",
-            "all_vae_cache_files",
-            "all_text_cache_files",
-        ]:
-            if filename_mapping[cache_name] in str(preserve_data_backend_cache):
-                continue
-            data_backend_id_suffix = ""
-            if data_backend_id:
-                data_backend_id_suffix = f"_{data_backend_id}"
-            cache_path = (
-                Path(cls.args.output_dir) / f"{cache_name}{data_backend_id_suffix}.json"
-            )
-            if cache_path.exists():
-                try:
-                    cache_path.unlink()
-                except:
-                    pass
-
-    @classmethod
-    def _load_from_disk(cls, cache_name):
-        cache_path = Path(cls.args.output_dir) / f"{cache_name}.json"
-        if cache_path.exists():
-            try:
-                with cache_path.open("r") as f:
-                    return json.load(f)
-            except Exception as e:
-                logger.error(
-                    f"Invalidating cache: error loading {cache_name} from disk. {e}"
-                )
-                return None
-        return None
-
-    @classmethod
-    def _save_to_disk(cls, cache_name, data):
-        cache_path = Path(cls.args.output_dir) / f"{cache_name}.json"
-        with cache_path.open("w") as f:
-            json.dump(data, f)
-
-    @classmethod
-    def set_config_path(cls, config_path: str):
-        cls.config_path = config_path
-
-    @classmethod
-    def get_config_path(cls):
-        return cls.config_path
-
-    @classmethod
-    def set_model_family(cls, model_type: str):
-        if model_type not in [
-            "legacy",
-            "sdxl",
-            "sd3",
-            "pixart_sigma",
-            "kolors",
-            "smoldit",
-            "flux",
-        ]:
-            raise ValueError(f"Unknown model type: {model_type}")
-        cls.model_type = model_type
-
-    @classmethod
-    def get_model_family(cls):
-        return cls.model_type
-
-    @classmethod
-    def get_hf_user(cls):
-        return cls.hf_user
-
-    @classmethod
-    def set_hf_user(cls, hf_user):
-        cls.hf_user = hf_user
-
-    @classmethod
-    def get_hf_username(cls):
-        if cls.hf_user is not None and "name" in cls.hf_user:
-            return cls.hf_user["name"]
-        return None
-
-    @classmethod
-    def is_sdxl_refiner(cls, set_value=None):
-        if set_value is not None:
-            cls._is_sdxl_refiner = set_value
-        return cls._is_sdxl_refiner
-
-    @classmethod
-    def set_parquet_database(cls, data_backend_id: str, parquet_database: tuple):
-        """parquet_database is a tuple (dataframe, filename_column, caption_column, fallback_caption_column)"""
-        cls.parquet_databases[data_backend_id] = parquet_database
-
-    @classmethod
-    def get_parquet_database(cls, data_backend_id: str):
-        return cls.parquet_databases.get(data_backend_id, (None, None, None, None))
-
-    @classmethod
-    def set_image_files(cls, raw_file_list: list, data_backend_id: str):
-        if cls.all_image_files[data_backend_id] is not None:
-            cls.all_image_files[data_backend_id].clear()
-        else:
-            cls.all_image_files[data_backend_id] = {}
-        for subdirectory_list in raw_file_list:
-            _, _, files = subdirectory_list
-            for image in files:
-                cls.all_image_files[data_backend_id][image] = False
-        cls._save_to_disk(
-            "all_image_files_{}".format(data_backend_id),
-            cls.all_image_files[data_backend_id],
-        )
-        logger.debug(
-            f"set_image_files found {len(cls.all_image_files[data_backend_id])} images."
-        )
-        return cls.all_image_files[data_backend_id]
-
-    @classmethod
-    def get_image_files(cls, data_backend_id: str):
-        if data_backend_id not in cls.all_image_files:
-            cls.all_image_files[data_backend_id] = cls._load_from_disk(
-                "all_image_files_{}".format(data_backend_id)
-            )
-        return cls.all_image_files[data_backend_id]
-
-    @classmethod
-    def get_global_resume_step(cls):
-        return cls.global_resume_step
-
-    @classmethod
-    def set_global_resume_step(cls, global_resume_step: int):
-        cls.global_resume_step = global_resume_step
-
-    @classmethod
-    def get_global_step(cls):
-        return cls.global_step
-
-    @classmethod
-    def set_global_step(cls, global_step: int):
-        cls.global_step = global_step
-
-    @classmethod
-    def get_epoch(cls):
-        return cls.epoch
-
-    @classmethod
-    def set_epoch(cls, epoch: int):
-        logger.debug(f"Current training state: {cls.get_training_state()}")
-        cls.epoch = epoch
-
-    @classmethod
-    def get_epoch_step(cls):
-        return cls.epoch_step
-
-    @classmethod
-    def set_epoch_step(cls, epoch_step: int):
-        cls.epoch_step = epoch_step
-
-    @classmethod
-    def set_repeats(cls, repeats: dict):
-        cls.repeats = repeats
-
-    @classmethod
-    def load_training_state(cls, state_path: str):
-        try:
-            with open(state_path, "r") as f:
-                training_state = json.load(f)
-        except OSError as e:
-            logger.error(f"Error loading training state: {e}")
-            training_state = {}
-        except Exception as e:
-            logger.error(f"Error loading training state: {e}")
-            training_state = {}
-        cls.set_global_step(training_state.get("global_step", 0))
-        cls.set_epoch_step(training_state.get("epoch_step", 0))
-        cls.set_epoch(training_state.get("epoch", 1))
-        cls.set_exhausted_backends(training_state.get("exhausted_backends", []))
-        cls.init_repeats(training_state.get("repeats", {}))
-        logging.debug(f"Training state loaded: {cls.get_training_state()}")
-
-    @classmethod
-    def save_training_state(cls, state_path: str):
-        training_state = {
-            "global_step": cls.global_step,
-            "epoch_step": cls.epoch_step,
-            "epoch": cls.epoch,
-            "exhausted_backends": cls.exhausted_backends,
-            "repeats": cls.repeats,
-        }
-        logger.debug(f"Saving training state: {training_state}")
-        with open(state_path, "w") as f:
-            json.dump(training_state, f)
-
-    @classmethod
-    def get_training_state(cls):
-        return {
-            "global_step": cls.global_step,
-            "epoch_step": cls.epoch_step,
-            "epoch": cls.epoch,
-            "exhausted_backends": cls.exhausted_backends,
-            "repeats": cls.repeats,
-        }
-
-    @classmethod
-    def set_repeats(cls, repeats: int, data_backend_id: str = None):
-        if data_backend_id is None:
-            # set every entry in repeats to zero
-            for key in cls.repeats.keys():
-                cls.repeats[key] = repeats
-        else:
-            cls.repeats[data_backend_id] = repeats
-
-    @classmethod
-    def init_repeats(cls, repeats: int):
-        cls.repeats = repeats
-
-    @classmethod
-    def get_repeats(cls, data_backend_id: str):
-        if data_backend_id not in cls.repeats:
-            return 0
-        return cls.repeats[data_backend_id]
-
-    @classmethod
-    def increment_repeats(cls, data_backend_id: str):
-        cls.set_repeats(
-            data_backend_id=data_backend_id,
-            repeats=cls.get_repeats(data_backend_id) + 1,
-        )
-
-    @classmethod
-    def backend_status(cls, data_backend_id: str):
-        return data_backend_id in cls.exhausted_backends
-
-    @classmethod
-    def backend_exhausted(cls, data_backend_id: str):
-        cls.exhausted_backends.append(data_backend_id)
-
-    @classmethod
-    def backend_enable(cls, data_backend_id: str):
-        cls.exhausted_backends.remove(data_backend_id)
-
-    @classmethod
-    def set_exhausted_backends(cls, exhausted_backends: list):
-        cls.exhausted_backends = exhausted_backends
-
-    @classmethod
-    def clear_exhausted_buckets(cls):
-        cls.exhausted_backends = []
-
-    @classmethod
-    def set_vae_cache_files(cls, raw_file_list: list, data_backend_id: str):
-        if cls.all_vae_cache_files.get(data_backend_id) is not None:
-            cls.all_vae_cache_files[data_backend_id].clear()
-        else:
-            cls.all_vae_cache_files[data_backend_id] = {}
-        for subdirectory_list in raw_file_list:
-            _, _, files = subdirectory_list
-            for image in files:
-                cls.all_vae_cache_files[data_backend_id][image] = False
-        cls._save_to_disk(
-            "all_vae_cache_files_{}".format(data_backend_id),
-            cls.all_vae_cache_files[data_backend_id],
-        )
-        logger.debug(
-            f"set_vae_cache_files found {len(cls.all_vae_cache_files[data_backend_id])} images."
-        )
-
-    @classmethod
-    def get_vae_cache_files(cls: list, data_backend_id: str):
-        if (
-            data_backend_id not in cls.all_vae_cache_files
-            or cls.all_vae_cache_files.get(data_backend_id) is None
-        ):
-            cls.all_vae_cache_files[data_backend_id] = cls._load_from_disk(
-                "all_vae_cache_files_{}".format(data_backend_id)
-            )
-        return cls.all_vae_cache_files[data_backend_id] or {}
-
-    @classmethod
-    def set_text_cache_files(cls, raw_file_list: list, data_backend_id: str):
-        if cls.all_text_cache_files[data_backend_id] is not None:
-            cls.all_text_cache_files[data_backend_id].clear()
-        else:
-            cls.all_text_cache_files[data_backend_id] = {}
-        for subdirectory_list in raw_file_list:
-            _, _, files = subdirectory_list
-            for text_embed_path in files:
-                cls.all_text_cache_files[data_backend_id][text_embed_path] = False
-        cls._save_to_disk(
-            "all_text_cache_files_{}".format(data_backend_id),
-            cls.all_text_cache_files[data_backend_id],
-        )
-        logger.debug(
-            f"set_text_cache_files found {len(cls.all_text_cache_files[data_backend_id])} images."
-        )
-
-    @classmethod
-    def get_text_cache_files(cls: list, data_backend_id: str):
-        if data_backend_id not in cls.all_text_cache_files:
-            cls.all_text_cache_files[data_backend_id] = cls._load_from_disk(
-                "all_text_cache_files_{}".format(data_backend_id)
-            )
-        return cls.all_text_cache_files[data_backend_id]
-
-    @classmethod
-    def set_caption_files(cls, caption_files):
-        cls.all_caption_files = caption_files
-        cls._save_to_disk("all_caption_files", cls.all_caption_files)
-
-    @classmethod
-    def get_caption_files(cls):
-        if not cls.all_caption_files:
-            cls.all_caption_files = cls._load_from_disk("all_caption_files")
-        return cls.all_caption_files
-
-    @classmethod
-    def get_validation_sample_images(cls):
-        return cls.validation_sample_images
-
-    @classmethod
-    def set_validation_sample_images(cls, validation_sample_images):
-        cls.validation_sample_images = validation_sample_images
-
-    @classmethod
-    def register_data_backend(cls, data_backend):
-        cls.data_backends[data_backend["id"]] = data_backend
-
-    @classmethod
-    def get_data_backend(cls, id: str):
-        return cls.data_backends[id]
-
-    @classmethod
-    def get_dataset_size(cls, data_backend_id: str):
-        if "sampler" in cls.data_backends[data_backend_id]:
-            return len(cls.data_backends[data_backend_id]["sampler"])
-        return 0
-
-    @classmethod
-    def set_conditioning_dataset(
-        cls, data_backend_id: str, conditioning_backend_id: str
-    ):
-        cls.data_backends[data_backend_id]["conditioning_data"] = cls.data_backends[
-            conditioning_backend_id
-        ]
-
-    @classmethod
-    def get_conditioning_dataset(cls, data_backend_id: str):
-        return cls.data_backends[data_backend_id].get("conditioning_data", None)
-
-    @classmethod
-    def get_data_backend_config(cls, data_backend_id: str):
-        return cls.data_backends.get(data_backend_id, {}).get("config", {})
-
-    @classmethod
-    def set_data_backend_config(cls, data_backend_id: str, config: dict):
-        if data_backend_id not in cls.data_backends:
-            cls.data_backends[data_backend_id] = {}
-        cls.data_backends[data_backend_id]["config"] = config
-
-    @classmethod
-    def clear_data_backends(cls):
-        cls.data_backends = {}
-
-    @classmethod
-    def get_data_backends(cls, _type="image"):
-        output = {}
-        for backend_id, backend in dict(cls.data_backends).items():
-            if backend.get("dataset_type", "image") == _type:
-                output[backend_id] = backend
-        return output
-
-    @classmethod
-    def set_accelerator(cls, accelerator):
-        cls.accelerator = accelerator
-
-    @classmethod
-    def get_accelerator(cls):
-        return cls.accelerator
-
-    @classmethod
-    def get_webhook_handler(cls):
-        return cls.webhook_handler
-
-    @classmethod
-    def set_webhook_handler(cls, webhook_handler):
-        cls.webhook_handler = webhook_handler
-
-    @classmethod
-    def set_job_id(cls, job_id: str):
-        cls.job_id = job_id
-
-    @classmethod
-    def get_job_id(cls):
-        return cls.job_id
-
-    @classmethod
-    def set_vae(cls, vae):
-        cls.vae = vae
-
-    @classmethod
-    def get_vae(cls):
-        return cls.vae
-
-    @classmethod
-    def set_vae_dtype(cls, vae_dtype):
-        cls.vae_dtype = vae_dtype
-
-    @classmethod
-    def get_vae_dtype(cls):
-        return cls.vae_dtype
-
-    @classmethod
-    def set_weight_dtype(cls, weight_dtype):
-        cls.weight_dtype = weight_dtype
-
-    @classmethod
-    def get_weight_dtype(cls):
-        return cls.weight_dtype
-
-    @classmethod
-    def set_args(cls, args):
-        cls.args = args
-
-    @classmethod
-    def get_args(cls):
-        return cls.args
-
-    @classmethod
-    def get_vaecache(cls, id: str):
-        return cls.data_backends[id]["vaecache"]
-
-    @classmethod
-    def set_default_text_embed_cache(cls, default_text_embed_cache):
-        cls.default_text_embed_cache = default_text_embed_cache
-
-    @classmethod
-    def get_default_text_embed_cache(cls):
-        return cls.default_text_embed_cache
-
-    @classmethod
-    def get_embedcache(cls, data_backend_id: str):
-        return cls.data_backends[data_backend_id]["text_embed_cache"]
-
-    @classmethod
-    def get_metadata_by_filepath(cls, filepath, data_backend_id: str):
-        for _, data_backend in cls.get_data_backends().items():
-            if "metadata_backend" not in data_backend:
-                continue
-            if data_backend_id != data_backend["metadata_backend"].id:
-                continue
-            metadata = data_backend["metadata_backend"].get_metadata_by_filepath(
-                filepath
-            )
-            if metadata is not None:
-                return metadata
-        return None
-
-    @classmethod
-    def get_resolution_by_aspect(cls, dataloader_resolution: float, aspect: float):
-        return cls.aspect_resolution_map.get(dataloader_resolution, {}).get(
-            str(aspect), None
-        )
-
-    @classmethod
-    def set_resolution_by_aspect(
-        cls, dataloader_resolution: float, aspect: float, resolution: int
-    ):
-        if dataloader_resolution not in cls.aspect_resolution_map:
-            cls.aspect_resolution_map[dataloader_resolution] = {}
-        cls.aspect_resolution_map[dataloader_resolution][str(aspect)] = resolution
-        cls._save_to_disk(
-            f"aspect_resolution_map-{dataloader_resolution}",
-            cls.aspect_resolution_map[dataloader_resolution],
-        )
-        logger.debug(
-            f"Aspect resolution map: {cls.aspect_resolution_map[dataloader_resolution]}"
-        )
-
-    @classmethod
-    def save_aspect_resolution_map(cls, dataloader_resolution: float):
-        cls._save_to_disk(
-            f"aspect_resolution_map-{dataloader_resolution}",
-            cls.aspect_resolution_map[dataloader_resolution],
-        )
-
-    @classmethod
-    def load_aspect_resolution_map(cls, dataloader_resolution: float):
-        if dataloader_resolution not in cls.aspect_resolution_map:
-            cls.aspect_resolution_map = {dataloader_resolution: {}}
-
-        cls.aspect_resolution_map[dataloader_resolution] = (
-            cls._load_from_disk(f"aspect_resolution_map-{dataloader_resolution}") or {}
-        )
-        logger.debug(
-            f"Aspect resolution map: {cls.aspect_resolution_map[dataloader_resolution]}"
-        )
-
-    @classmethod
-    def get_last_lr(cls):
-        return cls.last_lr
-
-    @classmethod
-    def set_last_lr(cls, last_lr: float):
-        cls.last_lr = float(last_lr)
diff --git a/videotuna/third_party/flux/training/text_encoding.py b/videotuna/third_party/flux/training/text_encoding.py
deleted file mode 100644
index 31b7c943..00000000
--- a/videotuna/third_party/flux/training/text_encoding.py
+++ /dev/null
@@ -1,272 +0,0 @@
-import os
-
-from accelerate.logging import get_logger
-from transformers import PretrainedConfig
-
-from .state_tracker import StateTracker
-
-logger = get_logger(__name__, log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-
-
-def import_model_class_from_model_name_or_path(
-    pretrained_model_name_or_path: str,
-    revision: str,
-    args,
-    subfolder: str = "text_encoder",
-):
-    if args.model_family.lower() == "smoldit":
-        from transformers import AutoModelForSeq2SeqLM
-
-        return AutoModelForSeq2SeqLM
-    text_encoder_config = PretrainedConfig.from_pretrained(
-        pretrained_model_name_or_path, subfolder=subfolder, revision=revision
-    )
-    model_class = text_encoder_config.architectures[0]
-
-    if model_class == "CLIPTextModel":
-        from transformers import CLIPTextModel
-
-        return CLIPTextModel
-    elif model_class == "CLIPTextModelWithProjection":
-        from transformers import CLIPTextModelWithProjection
-
-        return CLIPTextModelWithProjection
-    elif model_class == "T5EncoderModel":
-        from transformers import T5EncoderModel
-
-        return T5EncoderModel
-    elif model_class == "UMT5EncoderModel":
-        from transformers import UMT5EncoderModel
-
-        return UMT5EncoderModel
-    elif model_class == "ChatGLMModel":
-        from diffusers.pipelines.kolors.text_encoder import ChatGLMModel
-
-        return ChatGLMModel
-    else:
-        raise ValueError(f"{model_class} is not supported.")
-
-
-def get_tokenizers(args):
-    tokenizer_1, tokenizer_2, tokenizer_3 = None, None, None
-    try:
-        if args.model_family.lower() == "smoldit":
-            from transformers import AutoTokenizer
-
-            tokenizer_1 = AutoTokenizer.from_pretrained(
-                "EleutherAI/pile-t5-base", pad_token="[PAD]"
-            )
-            return tokenizer_1, tokenizer_2, tokenizer_3
-
-        tokenizer_kwargs = {
-            "pretrained_model_name_or_path": args.pretrained_model_name_or_path,
-            "subfolder": "tokenizer",
-            "revision": args.revision,
-        }
-        is_t5_model = False
-        if args.model_family.lower() == "pixart_sigma":
-            from transformers import T5Tokenizer
-
-            tokenizer_cls = T5Tokenizer
-            is_t5_model = True
-        elif args.model_family.lower() == "kolors":
-            from diffusers.pipelines.kolors.tokenizer import ChatGLMTokenizer
-
-            tokenizer_cls = ChatGLMTokenizer
-            tokenizer_1 = tokenizer_cls.from_pretrained(
-                args.pretrained_model_name_or_path,
-                subfolder="tokenizer",
-                revision=args.revision,
-                use_fast=False,
-            )
-        else:
-            from transformers import CLIPTokenizer
-
-            tokenizer_1 = CLIPTokenizer.from_pretrained(**tokenizer_kwargs)
-
-        if is_t5_model:
-            text_encoder_path = (
-                args.pretrained_t5_model_name_or_path
-                if args.pretrained_t5_model_name_or_path is not None
-                else args.pretrained_model_name_or_path
-            )
-            logger.info(
-                f"Tokenizer path: {text_encoder_path}, custom T5 model path: {args.pretrained_t5_model_name_or_path} revision: {args.revision}"
-            )
-            try:
-                tokenizer_1 = tokenizer_cls.from_pretrained(
-                    text_encoder_path,
-                    subfolder="tokenizer",
-                    revision=args.revision,
-                    use_fast=False,
-                )
-            except Exception as e:
-                logger.warning(
-                    f"Failed to load tokenizer 1: {e}, attempting no subfolder"
-                )
-                tokenizer_1 = T5Tokenizer.from_pretrained(
-                    text_encoder_path,
-                    subfolder=None,
-                    revision=args.revision,
-                    use_fast=False,
-                )
-    except Exception as e:
-        import traceback
-
-        logger.warning(
-            "Primary tokenizer (CLIP-L/14) failed to load. Continuing to test whether we have just the secondary tokenizer.."
-            f"\nError: -> {e}"
-            f"\nTraceback: {traceback.format_exc()}"
-        )
-        if args.model_family in ["sd3"]:
-            raise e
-    from transformers import T5TokenizerFast
-
-    if args.model_family not in ["pixart_sigma", "kolors"]:
-        try:
-            tokenizer_2_cls = CLIPTokenizer
-            if args.model_family.lower() == "flux":
-                tokenizer_2_cls = T5TokenizerFast
-            tokenizer_2 = tokenizer_2_cls.from_pretrained(
-                args.pretrained_model_name_or_path,
-                subfolder="tokenizer_2",
-                revision=args.revision,
-                use_fast=False,
-            )
-            if tokenizer_1 is None:
-                logger.info("Seems that we are training an SDXL refiner model.")
-                StateTracker.is_sdxl_refiner(True)
-                if args.validation_using_datasets is None:
-                    logger.warning(
-                        "Since we are training the SDXL refiner and --validation_using_datasets was not specified, it is now being enabled."
-                    )
-                    args.validation_using_datasets = True
-        except Exception as e:
-            logger.warning(
-                f"Could not load secondary tokenizer ({'OpenCLIP-G/14' if args.model_family != 'flux' else 'T5 XXL'}). Cannot continue: {e}"
-            )
-            if args.model_family in ["flux", "sd3"]:
-                raise e
-        if not tokenizer_1 and not tokenizer_2:
-            raise Exception("Failed to load tokenizer")
-    else:
-        if not tokenizer_1:
-            raise Exception("Failed to load tokenizer")
-
-    if args.model_family == "sd3":
-        try:
-            tokenizer_3 = T5TokenizerFast.from_pretrained(
-                args.pretrained_model_name_or_path,
-                subfolder="tokenizer_3",
-                revision=args.revision,
-                use_fast=True,
-            )
-        except:
-            raise ValueError(
-                "Could not load tertiary tokenizer (T5-XXL v1.1). Cannot continue."
-            )
-    return tokenizer_1, tokenizer_2, tokenizer_3
-
-
-def determine_te_path_subfolder(args):
-    if args.model_family.lower() == "kolors":
-        logger.info("Loading Kolors ChatGLM language model..")
-        text_encoder_path = args.pretrained_model_name_or_path
-        text_encoder_subfolder = "text_encoder"
-    elif args.model_family.lower() == "smoldit":
-        text_encoder_path = "EleutherAI/pile-t5-base"
-        text_encoder_subfolder = None
-    elif args.model_family.lower() == "flux":
-        text_encoder_path = args.pretrained_model_name_or_path
-        text_encoder_subfolder = "text_encoder"
-    elif args.model_family.lower() == "pixart_sigma":
-        text_encoder_path = (
-            args.pretrained_t5_model_name_or_path
-            if args.pretrained_t5_model_name_or_path is not None
-            else args.pretrained_model_name_or_path
-        )
-        # Google's version of the T5 XXL model doesn't have a subfolder :()
-        text_encoder_subfolder = "text_encoder"
-    else:
-        # sdxl and sd3 use the sd 1.5 clip-L/14 as number one.
-        # sd2.x uses openclip vit-H/14
-        logger.info("Load CLIP text encoder..")
-        text_encoder_path = args.pretrained_model_name_or_path
-        text_encoder_subfolder = "text_encoder"
-
-    return text_encoder_path, text_encoder_subfolder
-
-
-def load_tes(
-    args,
-    text_encoder_classes,
-    tokenizers,
-    weight_dtype,
-    text_encoder_path,
-    text_encoder_subfolder,
-):
-    text_encoder_cls_1, text_encoder_cls_2, text_encoder_cls_3 = text_encoder_classes
-    tokenizer_1, tokenizer_2, tokenizer_3 = tokenizers
-    text_encoder_1, text_encoder_2, text_encoder_3 = None, None, None
-    text_encoder_variant = args.variant
-
-    if tokenizer_1 is not None and not args.model_family == "smoldit":
-        if args.model_family.lower() == "pixart_sigma":
-            logger.info(
-                f"Loading T5-XXL v1.1 text encoder from {text_encoder_path}/{text_encoder_subfolder}.."
-            )
-        elif args.model_family.lower() == "flux":
-            logger.info(
-                f"Loading OpenAI CLIP-L text encoder from {text_encoder_path}/{text_encoder_subfolder}.."
-            )
-        elif args.model_family.lower() == "kolors":
-            logger.info(
-                f"Loading ChatGLM language model from {text_encoder_path}/{text_encoder_subfolder}.."
-            )
-            text_encoder_variant = "fp16"
-        else:
-            logger.info(
-                f"Loading CLIP text encoder from {text_encoder_path}/{text_encoder_subfolder}.."
-            )
-        text_encoder_1 = text_encoder_cls_1.from_pretrained(
-            text_encoder_path,
-            subfolder=text_encoder_subfolder,
-            revision=args.revision,
-            variant=text_encoder_variant,
-            torch_dtype=weight_dtype,
-        )
-    elif args.model_family.lower() == "smoldit":
-        text_encoder_1 = text_encoder_cls_1.from_pretrained(
-            "EleutherAI/pile-t5-base",
-            torch_dtype=weight_dtype,
-        ).encoder
-        text_encoder_1.eval()
-
-    if tokenizer_2 is not None:
-        if args.model_family.lower() == "flux":
-            logger.info(
-                f"Loading T5 XXL v1.1 text encoder from {args.pretrained_model_name_or_path}/text_encoder_2.."
-            )
-        else:
-            logger.info("Loading LAION OpenCLIP-G/14 text encoder..")
-        text_encoder_2 = text_encoder_cls_2.from_pretrained(
-            args.pretrained_model_name_or_path,
-            subfolder="text_encoder_2",
-            revision=args.revision,
-            torch_dtype=weight_dtype,
-            variant=args.variant,
-        )
-    if tokenizer_3 is not None and args.model_family == "sd3":
-        logger.info("Loading T5-XXL v1.1 text encoder..")
-        text_encoder_3 = text_encoder_cls_3.from_pretrained(
-            args.pretrained_model_name_or_path,
-            subfolder="text_encoder_3",
-            torch_dtype=weight_dtype,
-            revision=args.revision,
-            variant=args.variant,
-        )
-
-    return text_encoder_variant, text_encoder_1, text_encoder_2, text_encoder_3
diff --git a/videotuna/third_party/flux/training/trainer.py b/videotuna/third_party/flux/training/trainer.py
deleted file mode 100644
index e69e4e6b..00000000
--- a/videotuna/third_party/flux/training/trainer.py
+++ /dev/null
@@ -1,3170 +0,0 @@
-import copy
-import glob
-import hashlib
-import json
-import logging
-import math
-import os
-import random
-import shutil
-import sys
-
-import huggingface_hub
-import torch
-import wandb
-from configure import model_labels
-
-from videotuna.third_party.flux.publishing.huggingface import HubManager
-from videotuna.third_party.flux.training.default_settings.safety_check import (
-    safety_check,
-)
-
-# Quiet down, you.
-os.environ["ACCELERATE_LOG_LEVEL"] = "WARNING"
-from accelerate.logging import get_logger
-from diffusers.models.embeddings import get_2d_rotary_pos_embed
-
-from videotuna.third_party.flux import log_format  # noqa
-from videotuna.third_party.flux.caching.memory import reclaim_memory
-from videotuna.third_party.flux.configuration.loader import load_config
-from videotuna.third_party.flux.data_backend.factory import (
-    BatchFetcher,
-    configure_multi_databackend,
-    random_dataloader_iterator,
-)
-from videotuna.utils.common_utils import get_resize_crop_region_for_grid
-from videotuna.third_party.flux.training import steps_remaining_in_epoch
-from videotuna.third_party.flux.training.adapter import (
-    determine_adapter_target_modules,
-    load_lora_weights,
-)
-from videotuna.third_party.flux.training.custom_schedule import (
-    generate_timestep_weights,
-    get_lr_scheduler,
-    segmented_timestep_selection,
-)
-from videotuna.third_party.flux.training.deepspeed import (
-    deepspeed_zero_init_disabled_context_manager,
-    prepare_model_for_deepspeed,
-)
-from videotuna.third_party.flux.training.diffusion_model import load_diffusion_model
-from videotuna.third_party.flux.training.min_snr_gamma import compute_snr
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.optimizer_param import (
-    cpu_offload_optimizer,
-    determine_optimizer_class_with_config,
-    determine_params_to_optimize,
-    is_lr_scheduler_disabled,
-)
-from videotuna.third_party.flux.training.peft_init import (
-    init_lokr_network_with_perturbed_normal,
-)
-from videotuna.third_party.flux.training.schedulers import load_scheduler_from_args
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.training.text_encoding import (
-    determine_te_path_subfolder,
-    get_tokenizers,
-    import_model_class_from_model_name_or_path,
-    load_tes,
-)
-from videotuna.third_party.flux.training.validation import (
-    Validation,
-    prepare_validation_prompt_list,
-)
-from videotuna.third_party.flux.training.wrappers import unwrap_model
-
-logger = get_logger(
-    "SimpleTuner", log_level=os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-)
-
-filelock_logger = get_logger("filelock")
-connection_logger = get_logger("urllib3.connectionpool")
-training_logger = get_logger("training-loop")
-
-# More important logs.
-target_level = os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO")
-logger.setLevel(target_level)
-training_logger_level = os.environ.get("SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL", "INFO")
-training_logger.setLevel(training_logger_level)
-
-# Less important logs.
-filelock_logger.setLevel("WARNING")
-connection_logger.setLevel("WARNING")
-import accelerate
-import diffusers
-import torch
-import torch.nn.functional as F
-import torch.utils.checkpoint
-import transformers
-from accelerate import Accelerator
-from accelerate.utils import set_seed
-from configure import model_classes
-from torch.distributions import Beta
-
-try:
-    from lycoris import LycorisNetwork
-except:
-    print("[ERROR] Lycoris not available. Please install ")
-from diffusers import (
-    AutoencoderKL,
-    ControlNetModel,
-    DDIMScheduler,
-    DDPMScheduler,
-    EulerAncestralDiscreteScheduler,
-    EulerDiscreteScheduler,
-    FluxTransformer2DModel,
-    PixArtTransformer2DModel,
-    StableDiffusion3Pipeline,
-    UNet2DConditionModel,
-    UniPCMultistepScheduler,
-)
-from diffusers.utils import (
-    check_min_version,
-    convert_state_dict_to_diffusers,
-    is_wandb_available,
-)
-from diffusers.utils.import_utils import is_xformers_available
-from peft import LoraConfig
-from peft.utils import get_peft_model_state_dict
-from tqdm.auto import tqdm
-from transformers import CLIPTokenizer, PretrainedConfig
-from transformers.utils import ContextManagers
-
-from videotuna.third_party.flux.models.flux import (
-    apply_flux_schedule_shift,
-    get_mobius_guidance,
-    pack_latents,
-    prepare_latent_image_ids,
-    unpack_latents,
-)
-from videotuna.third_party.flux.models.sdxl.pipeline import StableDiffusionXLPipeline
-from videotuna.third_party.flux.training.ema import EMAModel
-
-is_optimi_available = False
-try:
-    from optimi import prepare_for_gradient_release
-
-    is_optimi_available = True
-except:
-    pass
-
-# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
-check_min_version("0.27.0.dev0")
-
-SCHEDULER_NAME_MAP = {
-    "euler": EulerDiscreteScheduler,
-    "euler-a": EulerAncestralDiscreteScheduler,
-    "unipc": UniPCMultistepScheduler,
-    "ddim": DDIMScheduler,
-    "ddpm": DDPMScheduler,
-}
-logging.basicConfig(
-    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
-    datefmt="%m/%d/%Y %H:%M:%S",
-    level=logging.INFO,
-)
-
-transformers.utils.logging.set_verbosity_warning()
-diffusers.utils.logging.set_verbosity_warning()
-
-
-class Trainer:
-    def __init__(
-        self, config: dict = None, disable_accelerator: bool = False, job_id: str = None
-    ):
-        self.accelerator = None
-        self.job_id = job_id
-        StateTracker.set_job_id(job_id)
-        self.parse_arguments(args=config, disable_accelerator=disable_accelerator)
-        self._misc_init()
-        self.lycoris_wrapped_network = None
-        self.lycoris_config = None
-        self.lr_scheduler = None
-        self.webhook_handler = None
-        self.should_abort = False
-        self.unet = None
-        self.transformer = None
-        self.vae = None
-        self.text_encoder_1 = None
-        self.text_encoder_2 = None
-        self.text_encoder_3 = None
-        self.controlnet = None
-        self.validation = None
-
-    def _config_to_obj(self, config):
-        if not config:
-            return None
-        return type("Config", (object,), config)
-
-    def parse_arguments(self, args=None, disable_accelerator: bool = False):
-        self.config = load_config(args)
-        report_to = (
-            None if self.config.report_to.lower() == "none" else self.config.report_to
-        )
-        if not disable_accelerator:
-            self.accelerator = Accelerator(
-                gradient_accumulation_steps=self.config.gradient_accumulation_steps,
-                mixed_precision=(
-                    self.config.mixed_precision
-                    if not torch.backends.mps.is_available()
-                    else None
-                ),
-                log_with=report_to,
-                project_config=self.config.accelerator_project_config,
-                kwargs_handlers=[self.config.process_group_kwargs],
-            )
-        safety_check(args=self.config, accelerator=self.accelerator)
-        if self.config.lr_scale:
-            logger.info(
-                f"Scaling learning rate ({self.config.learning_rate}), due to --lr_scale"
-            )
-            self.config.learning_rate = (
-                self.config.learning_rate
-                * self.config.gradient_accumulation_steps
-                * self.config.train_batch_size
-                * getattr(self.accelerator, "num_processes", 1)
-            )
-        StateTracker.set_accelerator(self.accelerator)
-        StateTracker.set_args(self.config)
-        StateTracker.set_weight_dtype(self.config.weight_dtype)
-        self.set_model_family()
-        # this updates self.config further, so we will run it here.
-        self.init_noise_schedule()
-
-    def run(self):
-        try:
-            # Initialize essential configurations and schedules
-            self.configure_webhook()
-            self.init_noise_schedule()
-            self.init_seed()
-            self.init_huggingface_hub()
-
-            # Core initialization steps with signal checks after each step
-            self._initialize_components_with_signal_check(
-                [
-                    self.init_preprocessing_models,
-                    self.init_data_backend,
-                    self.init_validation_prompts,
-                    self.init_unload_text_encoder,
-                    self.init_unload_vae,
-                    self.init_load_base_model,
-                    self.init_precision,
-                    self.init_controlnet_model,
-                    self.init_freeze_models,
-                    self.init_trainable_peft_adapter,
-                    self.init_ema_model,
-                ]
-            )
-
-            # Model movement and validation setup
-            self.move_models(destination="accelerator")
-            self._exit_on_signal()
-            self.init_validations()
-            self._exit_on_signal()
-            self.init_benchmark_base_model()
-            self._exit_on_signal()
-            self.resume_and_prepare()
-            self._exit_on_signal()
-            self.init_trackers()
-
-            # Start the training process
-            self.train()
-
-        except Exception as e:
-            import traceback
-
-            logger.error(
-                f"Failed to run training: {e}, traceback: {traceback.format_exc()}"
-            )
-            self._send_webhook_msg(
-                message=f"Failed to run training: {e}",
-            )
-            self._send_webhook_raw(
-                structured_data={
-                    "message": f"Failed to run training: {e}",
-                    "status": "error",
-                },
-                message_type="fatal_error",
-            )
-
-            raise e
-
-    def _initialize_components_with_signal_check(self, initializers):
-        """
-        Runs a list of initializer functions with signal checks after each.
-
-        Args:
-            initializers (list): A list of initializer functions to run sequentially.
-        """
-        for initializer in initializers:
-            initializer()
-            self._exit_on_signal()
-
-    def init_noise_schedule(self):
-        self.config, _flow_matching, self.noise_scheduler = load_scheduler_from_args(
-            self.config
-        )
-        self.config.flow_matching = _flow_matching
-        self.lr = 0.0
-
-    def configure_webhook(self, send_startup_message: bool = True):
-        self.webhook_handler = None
-        if self.config.webhook_config is None:
-            return
-        from videotuna.third_party.flux.webhooks.handler import WebhookHandler
-
-        self.webhook_handler = WebhookHandler(
-            self.config.webhook_config,
-            self.accelerator,
-            f"{self.config.tracker_project_name} {self.config.tracker_run_name}",
-        )
-        StateTracker.set_webhook_handler(self.webhook_handler)
-        if send_startup_message:
-            self._send_webhook_msg(
-                message="SimpleTuner has launched. Hold onto your butts!",
-                store_response=True,
-            )
-        self._send_webhook_raw(
-            structured_data={
-                "message": "Training job has started, configuration has begun."
-            },
-            message_type="configure_webhook",
-        )
-
-    def _misc_init(self):
-        """things that do not really need an order."""
-        torch.set_num_threads(self.config.torch_num_threads)
-        self.state = {}
-        self.state["lr"] = 0.0
-        # Global step represents the most recently *completed* optimization step, which means it
-        #  takes into account the number of gradient_accumulation_steps. If we use 1 gradient_accumulation_step,
-        #  then global_step and step will be the same throughout training. However, if we use
-        #  2 gradient_accumulation_steps, then global_step will be twice as large as step, and so on.
-        self.state["global_step"] = 0
-        self.state["global_resume_step"] = 0
-        self.state["first_epoch"] = 1
-        self.timesteps_buffer = []
-        self.guidance_values_list = []
-        self.train_loss = 0.0
-        self.bf = None
-        self.grad_norm = None
-        self.extra_lr_scheduler_kwargs = {}
-        StateTracker.set_global_step(self.state["global_step"])
-        self.config.use_deepspeed_optimizer, self.config.use_deepspeed_scheduler = (
-            prepare_model_for_deepspeed(self.accelerator, self.config)
-        )
-        self.config.base_weight_dtype = self.config.weight_dtype
-        self.config.is_quanto = False
-        self.config.is_torchao = False
-        self.config.is_bnb = False
-        if "quanto" in self.config.base_model_precision:
-            self.config.is_quanto = True
-        elif "torchao" in self.config.base_model_precision:
-            self.config.is_torchao = True
-        elif "bnb" in self.config.base_model_precision:
-            self.config.is_bnb = True
-        if self.config.is_quanto:
-            from videotuna.third_party.flux.training.quantisation import quantise_model
-
-            self.quantise_model = quantise_model
-        elif self.config.is_torchao:
-            from videotuna.third_party.flux.training.quantisation import quantise_model
-
-            self.quantise_model = quantise_model
-
-    def set_model_family(self, model_family: str = None):
-        model_family = getattr(self.config, "model_family", model_family)
-        if not model_family:
-            logger.warning(
-                "Using --model_family (or MODEL_FAMILY) to specify which model you are training will be required in a future release."
-            )
-            if self.config.model_family == "sd3":
-                model_family = "sd3"
-                logger.warning(
-                    "Using --sd3 is deprecated. Please use --model_family=sd3."
-                )
-            if self.config.model_family == "flux":
-                model_family = "flux"
-                logger.warning(
-                    "Using --flux is deprecated. Please use --model_family=flux."
-                )
-            if self.config.model_family == "pixart_sigma":
-                model_family = "pixart_sigma"
-                logger.warning(
-                    "Using --pixart_sigma is deprecated. Please use --model_family=pixart_sigma."
-                )
-            if self.config.model_family == "legacy":
-                model_family = "legacy"
-                logger.warning(
-                    "Using --legacy is deprecated. Please use --model_family=legacy."
-                )
-            if self.config.model_family == "kolors":
-                model_family = "kolors"
-                logger.warning(
-                    "Using --kolors is deprecated. Please use --model_family=kolors."
-                )
-            if self.config.model_family == "smoldit":
-                model_family = "smoldit"
-            if model_family is None:
-                model_family = "sdxl"
-                logger.warning(
-                    "Training SDXL without specifying --model_family is deprecated. Please use --model_family=sdxl."
-                )
-        elif model_family not in model_classes["full"]:
-            raise ValueError(f"Invalid model family specified: {model_family}")
-
-        self._set_model_paths()
-        StateTracker.set_model_family(model_family)
-        self.config.model_type_label = model_labels[model_family.lower()]
-        if StateTracker.is_sdxl_refiner():
-            self.config.model_type_label = "SDXL Refiner"
-
-    def init_clear_backend_cache(self):
-        if self.config.output_dir is not None:
-            os.makedirs(self.config.output_dir, exist_ok=True)
-        if self.config.preserve_data_backend_cache:
-            return
-        StateTracker.delete_cache_files(
-            preserve_data_backend_cache=self.config.preserve_data_backend_cache
-        )
-
-    def init_seed(self):
-        if self.config.seed is not None and self.config.seed != 0:
-            set_seed(self.config.seed, self.config.seed_for_each_device)
-
-    def init_huggingface_hub(self, access_token: str = None):
-        # Handle the repository creation
-        self.hub_manager = None
-        if not self.accelerator.is_main_process or not self.config.push_to_hub:
-            return
-        if access_token:
-            huggingface_hub.login(token=access_token)
-        self.hub_manager = HubManager(config=self.config)
-        try:
-            StateTracker.set_hf_user(huggingface_hub.whoami())
-            logger.info(
-                f"Logged into Hugging Face Hub as '{StateTracker.get_hf_username()}'"
-            )
-        except Exception as e:
-            logger.error(f"Failed to log into Hugging Face Hub: {e}")
-            raise e
-
-    def _set_model_paths(self):
-        self.config.vae_path = (
-            self.config.pretrained_model_name_or_path
-            if self.config.pretrained_vae_model_name_or_path is None
-            else self.config.pretrained_vae_model_name_or_path
-        )
-        self.config.text_encoder_path, self.config.text_encoder_subfolder = (
-            determine_te_path_subfolder(self.config)
-        )
-
-    def init_preprocessing_models(self, move_to_accelerator: bool = True):
-        # image embeddings
-        self.init_vae(move_to_accelerator=move_to_accelerator)
-        # text embeds
-        self.init_text_encoder(move_to_accelerator=move_to_accelerator)
-
-    def init_vae(self, move_to_accelerator: bool = True):
-        logger.info(f"Load VAE: {self.config.vae_path}")
-        self.config.vae_kwargs = {
-            "pretrained_model_name_or_path": self.config.vae_path,
-            "subfolder": "vae",
-            "revision": self.config.revision,
-            "force_upcast": False,
-            "variant": self.config.variant,
-        }
-        try:
-            self.vae = AutoencoderKL.from_pretrained(**self.config.vae_kwargs)
-        except:
-            logger.warning(
-                "Couldn't load VAE with default path. Trying without a subfolder.."
-            )
-            self.config.vae_kwargs["subfolder"] = None
-            self.vae = AutoencoderKL.from_pretrained(**self.config.vae_kwargs)
-        if not move_to_accelerator:
-            logger.debug("Not moving VAE to accelerator.")
-            return
-        if self.vae is not None:
-            # The VAE is in bfloat16 to avoid NaN losses.
-            _vae_dtype = torch.bfloat16
-            if hasattr(self.config, "vae_dtype"):
-                # Let's use a case-switch for convenience: bf16, fp16, fp32, none/default
-                if self.config.vae_dtype == "bf16":
-                    _vae_dtype = torch.bfloat16
-                elif self.config.vae_dtype == "fp16":
-                    raise ValueError(
-                        "fp16 is not supported for SDXL's VAE. Please use bf16 or fp32."
-                    )
-                elif self.config.vae_dtype == "fp32":
-                    _vae_dtype = torch.float32
-                elif (
-                    self.config.vae_dtype == "none"
-                    or self.config.vae_dtype == "default"
-                ):
-                    _vae_dtype = torch.bfloat16
-            logger.info(
-                f"Loading VAE onto accelerator, converting from {self.vae.dtype} to {_vae_dtype}"
-            )
-            self.vae.to(self.accelerator.device, dtype=_vae_dtype)
-            StateTracker.set_vae_dtype(_vae_dtype)
-            StateTracker.set_vae(self.vae)
-
-    def init_text_tokenizer(self):
-        logger.info("Load tokenizers")
-        self.tokenizer_1, self.tokenizer_2, self.tokenizer_3 = get_tokenizers(
-            self.config
-        )
-        self.tokenizers = [self.tokenizer_1, self.tokenizer_2, self.tokenizer_3]
-
-    def init_text_encoder(self, move_to_accelerator: bool = True):
-        self.init_text_tokenizer()
-        self.text_encoder_1, self.text_encoder_2, self.text_encoder_3 = None, None, None
-        self.text_encoder_cls_1, self.text_encoder_cls_2, self.text_encoder_cls_3 = (
-            None,
-            None,
-            None,
-        )
-        if self.tokenizer_1 is not None:
-            self.text_encoder_cls_1 = import_model_class_from_model_name_or_path(
-                self.config.text_encoder_path,
-                self.config.revision,
-                self.config,
-                subfolder=self.config.text_encoder_subfolder,
-            )
-        if self.tokenizer_2 is not None:
-            self.text_encoder_cls_2 = import_model_class_from_model_name_or_path(
-                self.config.pretrained_model_name_or_path,
-                self.config.revision,
-                self.config,
-                subfolder="text_encoder_2",
-            )
-        if self.tokenizer_3 is not None and self.config.model_family == "sd3":
-            self.text_encoder_cls_3 = import_model_class_from_model_name_or_path(
-                self.config.pretrained_model_name_or_path,
-                self.config.revision,
-                self.config,
-                subfolder="text_encoder_3",
-            )
-        with ContextManagers(deepspeed_zero_init_disabled_context_manager()):
-            tokenizers = [self.tokenizer_1, self.tokenizer_2, self.tokenizer_3]
-            text_encoder_classes = [
-                self.text_encoder_cls_1,
-                self.text_encoder_cls_2,
-                self.text_encoder_cls_3,
-            ]
-            (
-                text_encoder_variant,
-                self.text_encoder_1,
-                self.text_encoder_2,
-                self.text_encoder_3,
-            ) = load_tes(
-                args=self.config,
-                text_encoder_classes=text_encoder_classes,
-                weight_dtype=self.config.weight_dtype,
-                tokenizers=tokenizers,
-                text_encoder_path=self.config.text_encoder_path,
-                text_encoder_subfolder=self.config.text_encoder_subfolder,
-            )
-        if not move_to_accelerator:
-            logger.debug("Not moving text encoders to accelerator.")
-            return
-        self.text_encoders = []
-        self.tokenizers = []
-        if self.tokenizer_1 is not None:
-            logger.info("Moving text encoder to GPU.")
-            self.text_encoder_1.to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-            self.tokenizers.append(self.tokenizer_1)
-            self.text_encoders.append(self.text_encoder_1)
-        if self.tokenizer_2 is not None:
-            logger.info("Moving text encoder 2 to GPU.")
-            self.text_encoder_2.to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-            self.tokenizers.append(self.tokenizer_2)
-            self.text_encoders.append(self.text_encoder_2)
-        if self.tokenizer_3 is not None:
-            logger.info("Moving text encoder 3 to GPU.")
-            self.text_encoder_3.to(
-                self.accelerator.device, dtype=self.config.weight_dtype
-            )
-            self.tokenizers.append(self.tokenizer_3)
-            self.text_encoders.append(self.text_encoder_3)
-
-    def init_freeze_models(self):
-        # Freeze vae and text_encoders
-        if self.vae is not None:
-            self.vae.requires_grad_(False)
-        if self.text_encoder_1 is not None:
-            self.text_encoder_1.requires_grad_(False)
-        if self.text_encoder_2 is not None:
-            self.text_encoder_2.requires_grad_(False)
-        if self.text_encoder_3 is not None:
-            self.text_encoder_3.requires_grad_(False)
-        if "lora" in self.config.model_type or self.config.controlnet:
-            if self.transformer is not None:
-                self.transformer.requires_grad_(False)
-            if self.unet is not None:
-                self.unet.requires_grad_(False)
-        self.accelerator.wait_for_everyone()
-
-    def init_load_base_model(self):
-        webhook_msg = f"Loading model: `{self.config.pretrained_model_name_or_path}`..."
-        self._send_webhook_msg(message=webhook_msg)
-        self._send_webhook_raw(
-            structured_data={"message": webhook_msg},
-            message_type="init_load_base_model_begin",
-        )
-        self.unet, self.transformer = load_diffusion_model(
-            self.config, self.config.weight_dtype
-        )
-        self.accelerator.wait_for_everyone()
-        self._send_webhook_raw(
-            structured_data={"message": "Base model has loaded."},
-            message_type="init_load_base_model_completed",
-        )
-
-    def init_data_backend(self):
-        try:
-            self.init_clear_backend_cache()
-            self._send_webhook_msg(
-                message="Configuring data backends... (this may take a while!)"
-            )
-            self._send_webhook_raw(
-                structured_data={"message": "Configuring data backends."},
-                message_type="init_data_backend_begin",
-            )
-            configure_multi_databackend(
-                self.config,
-                accelerator=self.accelerator,
-                text_encoders=self.text_encoders,
-                tokenizers=self.tokenizers,
-            )
-            self._send_webhook_raw(
-                structured_data={"message": "Completed configuring data backends."},
-                message_type="init_data_backend_completed",
-            )
-        except Exception as e:
-            import traceback
-
-            logger.error(f"{e}, traceback: {traceback.format_exc()}")
-            self._send_webhook_msg(
-                message=f"Failed to load data backends: {e}",
-                message_level="critical",
-            )
-            self._send_webhook_raw(
-                structured_data={
-                    "message": f"Failed to load data backends: {e}",
-                    "status": "error",
-                },
-                message_type="fatal_error",
-            )
-
-            raise e
-
-        self.init_validation_prompts()
-        # We calculate the number of steps per epoch by dividing the number of images by the effective batch divisor.
-        # Gradient accumulation steps mean that we only update the model weights every /n/ steps.
-        collected_data_backend_str = list(StateTracker.get_data_backends().keys())
-        if self.config.push_to_hub and self.accelerator.is_main_process:
-            self.hub_manager.collected_data_backend_str = collected_data_backend_str
-            self.hub_manager.set_validation_prompts(
-                self.validation_prompts, self.validation_shortnames
-            )
-            logger.debug(f"Collected validation prompts: {self.validation_prompts}")
-        self._recalculate_training_steps()
-        logger.info(
-            f"Collected the following data backends: {collected_data_backend_str}"
-        )
-        self._send_webhook_msg(
-            message=f"Collected the following data backends: {collected_data_backend_str}"
-        )
-        self._send_webhook_raw(
-            structured_data={
-                "message": f"Collected the following data backends: {collected_data_backend_str}"
-            },
-            message_type="init_data_backend",
-        )
-        self.accelerator.wait_for_everyone()
-
-    def init_validation_prompts(self):
-        if self.accelerator.is_main_process:
-            if self.config.model_family == "flux":
-                (
-                    self.validation_prompts,
-                    self.validation_shortnames,
-                    self.validation_negative_prompt_embeds,
-                    self.validation_negative_pooled_embeds,
-                    self.validation_negative_time_ids,
-                ) = prepare_validation_prompt_list(
-                    args=self.config,
-                    embed_cache=StateTracker.get_default_text_embed_cache(),
-                )
-            else:
-                (
-                    self.validation_prompts,
-                    self.validation_shortnames,
-                    self.validation_negative_prompt_embeds,
-                    self.validation_negative_pooled_embeds,
-                ) = prepare_validation_prompt_list(
-                    args=self.config,
-                    embed_cache=StateTracker.get_default_text_embed_cache(),
-                )
-        else:
-            self.validation_prompts = None
-            self.validation_shortnames = None
-            self.validation_negative_prompt_embeds = None
-            self.validation_negative_pooled_embeds = None
-        self.accelerator.wait_for_everyone()
-
-    def stats_memory_used(self):
-        # Grab GPU memory used:
-        if torch.cuda.is_available():
-            curent_memory_allocated = torch.cuda.memory_allocated() / 1024**3
-        elif torch.backends.mps.is_available():
-            curent_memory_allocated = torch.mps.current_allocated_memory() / 1024**3
-        else:
-            logger.warning(
-                "CUDA, ROCm, or Apple MPS not detected here. We cannot report VRAM reductions."
-            )
-            curent_memory_allocated = 0
-
-        return curent_memory_allocated
-
-    def init_unload_text_encoder(self):
-        if self.config.model_type != "full" and self.config.train_text_encoder:
-            return
-        memory_before_unload = self.stats_memory_used()
-        if self.accelerator.is_main_process:
-            logger.info("Unloading text encoders, as they are not being trained.")
-        if self.text_encoder_1 is not None:
-            self.text_encoder_1 = self.text_encoder_1.to("cpu")
-        if self.text_encoder_2 is not None:
-            self.text_encoder_2 = self.text_encoder_2.to("cpu")
-        if self.text_encoder_3 is not None:
-            self.text_encoder_3 = self.text_encoder_3.to("cpu")
-        del self.text_encoder_1, self.text_encoder_2, self.text_encoder_3
-        self.text_encoder_1, self.text_encoder_2, self.text_encoder_3 = None, None, None
-        self.text_encoders = []
-        for backend_id, backend in StateTracker.get_data_backends().items():
-            if "text_embed_cache" in backend:
-                backend["text_embed_cache"].text_encoders = None
-                backend["text_embed_cache"].pipeline = None
-        reclaim_memory()
-        memory_after_unload = self.stats_memory_used()
-        memory_saved = memory_after_unload - memory_before_unload
-        logger.info(
-            f"After nuking text encoders from orbit, we freed {abs(round(memory_saved, 2))} GB of VRAM."
-            " The real memories were the friends we trained a model on along the way."
-        )
-
-    def init_precision(self):
-        self.config.enable_adamw_bf16 = (
-            True if self.config.weight_dtype == torch.bfloat16 else False
-        )
-        quantization_device = (
-            "cpu" if self.config.quantize_via == "cpu" else self.accelerator.device
-        )
-
-        if "bnb" in self.config.base_model_precision:
-            # can't cast or move bitsandbytes modelsthis
-            return
-
-        if not self.config.disable_accelerator and self.config.is_quantized:
-            if self.config.base_model_default_dtype == "fp32":
-                self.config.base_weight_dtype = torch.float32
-                self.config.enable_adamw_bf16 = False
-            elif self.config.base_model_default_dtype == "bf16":
-                self.config.base_weight_dtype = torch.bfloat16
-                self.config.enable_adamw_bf16 = True
-            if self.unet is not None:
-                logger.info(
-                    f"Moving U-net to dtype={self.config.base_weight_dtype}, device={quantization_device}"
-                )
-                self.unet.to(quantization_device, dtype=self.config.base_weight_dtype)
-            elif self.transformer is not None:
-                logger.info(
-                    f"Moving transformer to dtype={self.config.base_weight_dtype}, device={quantization_device}"
-                )
-                self.transformer.to(
-                    quantization_device, dtype=self.config.base_weight_dtype
-                )
-
-        if self.config.is_quanto:
-            with self.accelerator.local_main_process_first():
-                self.quantise_model(
-                    unet=self.unet,
-                    transformer=self.transformer,
-                    text_encoder_1=self.text_encoder_1,
-                    text_encoder_2=self.text_encoder_2,
-                    text_encoder_3=self.text_encoder_3,
-                    controlnet=None,
-                    args=self.config,
-                )
-        elif self.config.is_torchao:
-            with self.accelerator.local_main_process_first():
-                (
-                    self.unet,
-                    self.transformer,
-                    self.text_encoder_1,
-                    self.text_encoder_2,
-                    self.text_encoder_3,
-                    self.controlnet,
-                ) = self.quantise_model(
-                    unet=self.unet,
-                    transformer=self.transformer,
-                    text_encoder_1=self.text_encoder_1,
-                    text_encoder_2=self.text_encoder_2,
-                    text_encoder_3=self.text_encoder_3,
-                    controlnet=None,
-                    args=self.config,
-                )
-
-    def init_controlnet_model(self):
-        if not self.config.controlnet:
-            return
-        logger.info("Creating the controlnet..")
-        if self.config.controlnet_model_name_or_path:
-            logger.info("Loading existing controlnet weights")
-            self.controlnet = ControlNetModel.from_pretrained(
-                self.config.controlnet_model_name_or_path
-            )
-        else:
-            logger.info("Initializing controlnet weights from unet")
-            self.controlnet = ControlNetModel.from_unet(self.unet)
-
-        self.accelerator.wait_for_everyone()
-
-    def init_trainable_peft_adapter(self):
-        if "lora" not in self.config.model_type:
-            return
-        if self.config.controlnet:
-            raise ValueError("Cannot train LoRA with ControlNet.")
-        if "standard" == self.config.lora_type.lower():
-            lora_info_msg = f"Using LoRA training mode (rank={self.config.lora_rank})"
-            logger.info(lora_info_msg)
-            self._send_webhook_msg(message=lora_info_msg)
-            target_modules = determine_adapter_target_modules(
-                self.config, self.unet, self.transformer
-            )
-            addkeys, misskeys = [], []
-            if self.unet is not None:
-                unet_lora_config = LoraConfig(
-                    r=self.config.lora_rank,
-                    lora_alpha=(
-                        self.config.lora_alpha
-                        if self.config.lora_alpha is not None
-                        else self.config.lora_rank
-                    ),
-                    lora_dropout=self.config.lora_dropout,
-                    init_lora_weights=self.config.lora_initialisation_style,
-                    target_modules=target_modules,
-                    use_dora=self.config.use_dora,
-                )
-                logger.info("Adding LoRA adapter to the unet model..")
-                self.unet.add_adapter(unet_lora_config)
-                if self.config.init_lora:
-                    addkeys, misskeys = load_lora_weights(
-                        {"unet": self.unet},
-                        self.config.init_lora,
-                        use_dora=self.config.use_dora,
-                    )
-            elif self.transformer is not None:
-                transformer_lora_config = LoraConfig(
-                    r=self.config.lora_rank,
-                    lora_alpha=(
-                        self.config.lora_alpha
-                        if self.config.lora_alpha is not None
-                        else self.config.lora_rank
-                    ),
-                    init_lora_weights=self.config.lora_initialisation_style,
-                    target_modules=target_modules,
-                    use_dora=self.config.use_dora,
-                )
-                self.transformer.add_adapter(transformer_lora_config)
-                if self.config.init_lora:
-                    addkeys, misskeys = load_lora_weights(
-                        {"transformer": self.transformer},
-                        self.config.init_lora,
-                        use_dora=self.config.use_dora,
-                    )
-            if addkeys:
-                logger.warning(
-                    "The following keys were found in %s, but are not part of the model and are ignored:\n %s.\nThis is most likely an error"
-                    % (self.config.init_lora, str(addkeys))
-                )
-            if misskeys:
-                logger.warning(
-                    "The following keys were part of the model but not found in %s:\n %s.\nThese keys will be initialized according to the lora weight initialisation. This could be an error, or intended behaviour in case a lora is finetuned with additional keys."
-                    % (self.config.init_lora, str(misskeys))
-                )
-
-        elif "lycoris" == self.config.lora_type.lower():
-            from lycoris import create_lycoris
-
-            with open(self.config.lycoris_config, "r") as f:
-                self.lycoris_config = json.load(f)
-            multiplier = int(self.lycoris_config["multiplier"])
-            linear_dim = int(self.lycoris_config["linear_dim"])
-            linear_alpha = int(self.lycoris_config["linear_alpha"])
-            apply_preset = self.lycoris_config.get("apply_preset", None)
-            if apply_preset is not None and apply_preset != {}:
-                LycorisNetwork.apply_preset(apply_preset)
-
-            # Remove the positional arguments we extracted.
-            del self.lycoris_config["multiplier"]
-            del self.lycoris_config["linear_dim"]
-            del self.lycoris_config["linear_alpha"]
-
-            logger.info("Using lycoris training mode")
-            self._send_webhook_msg(message="Using lycoris training mode.")
-
-            model_for_lycoris_wrap = None
-            if self.transformer is not None:
-                model_for_lycoris_wrap = self.transformer
-            if self.unet is not None:
-                model_for_lycoris_wrap = self.unet
-
-            if self.config.init_lora is not None:
-                from lycoris import create_lycoris_from_weights
-
-                self.lycoris_wrapped_network = create_lycoris_from_weights(
-                    multiplier,
-                    self.config.init_lora,
-                    model_for_lycoris_wrap,
-                    weights_sd=None,
-                    **self.lycoris_config,
-                )[0]
-            else:
-                self.lycoris_wrapped_network = create_lycoris(
-                    model_for_lycoris_wrap,
-                    multiplier,
-                    linear_dim,
-                    linear_alpha,
-                    **self.lycoris_config,
-                )
-
-                if self.config.init_lokr_norm is not None:
-                    init_lokr_network_with_perturbed_normal(
-                        self.lycoris_wrapped_network,
-                        scale=self.config.init_lokr_norm,
-                    )
-
-            self.lycoris_wrapped_network.apply_to()
-            setattr(
-                self.accelerator,
-                "_lycoris_wrapped_network",
-                self.lycoris_wrapped_network,
-            )
-            lycoris_num_params = sum(
-                p.numel() for p in self.lycoris_wrapped_network.parameters()
-            )
-            logger.info(
-                f"LyCORIS network has been initialized with {lycoris_num_params:,} parameters"
-            )
-        self.accelerator.wait_for_everyone()
-
-    def init_post_load_freeze(self):
-        if self.config.layer_freeze_strategy == "bitfit":
-            from videotuna.third_party.flux.training.model_freeze import (
-                apply_bitfit_freezing,
-            )
-
-            if self.unet is not None:
-                logger.info("Applying BitFit freezing strategy to the U-net.")
-                self.unet = apply_bitfit_freezing(
-                    unwrap_model(self.accelerator, self.unet), self.config
-                )
-            if self.transformer is not None:
-                logger.warning(
-                    "Training DiT models with BitFit is not yet tested, and unexpected results may occur."
-                )
-                self.transformer = apply_bitfit_freezing(
-                    unwrap_model(self.accelerator, self.transformer), self.config
-                )
-
-        if self.config.gradient_checkpointing:
-            if self.unet is not None:
-                unwrap_model(
-                    self.accelerator, self.unet
-                ).enable_gradient_checkpointing()
-            if self.transformer is not None and self.config.model_family != "smoldit":
-                unwrap_model(
-                    self.accelerator, self.transformer
-                ).enable_gradient_checkpointing()
-            if self.config.controlnet:
-                unwrap_model(
-                    self.accelerator, self.controlnet
-                ).enable_gradient_checkpointing()
-            if (
-                hasattr(self.config, "train_text_encoder")
-                and self.config.train_text_encoder
-            ):
-                unwrap_model(
-                    self.accelerator, self.text_encoder_1
-                ).gradient_checkpointing_enable()
-                unwrap_model(
-                    self.accelerator, self.text_encoder_2
-                ).gradient_checkpointing_enable()
-
-    def _recalculate_training_steps(self):
-        # Scheduler and math around the number of training steps.
-        if not hasattr(self.config, "overrode_max_train_steps"):
-            self.config.overrode_max_train_steps = False
-        self.config.total_num_batches = sum(
-            [
-                len(
-                    backend["metadata_backend"] if "metadata_backend" in backend else []
-                )
-                for _, backend in StateTracker.get_data_backends().items()
-            ]
-        )
-        self.config.num_update_steps_per_epoch = math.ceil(
-            self.config.total_num_batches / self.config.gradient_accumulation_steps
-        )
-        if getattr(self.config, "overrode_max_train_steps", False):
-            self.config.max_train_steps = (
-                self.config.num_train_epochs * self.config.num_update_steps_per_epoch
-            )
-            # Afterwards we recalculate our number of training epochs
-            self.config.num_train_epochs = math.ceil(
-                self.config.max_train_steps / self.config.num_update_steps_per_epoch
-            )
-            logger.info(
-                "After removing any undesired samples and updating cache entries, we have settled on"
-                f" {self.config.num_train_epochs} epochs and {self.config.num_update_steps_per_epoch} steps per epoch."
-            )
-        if self.config.max_train_steps is None or self.config.max_train_steps == 0:
-            if (
-                self.config.num_train_epochs is None
-                or self.config.num_train_epochs == 0
-            ):
-                raise ValueError(
-                    "You must specify either --max_train_steps or --num_train_epochs with a value > 0"
-                )
-            self.config.max_train_steps = (
-                self.config.num_train_epochs * self.config.num_update_steps_per_epoch
-            )
-            logger.info(
-                f"Calculated our maximum training steps at {self.config.max_train_steps} because we have"
-                f" {self.config.num_train_epochs} epochs and {self.config.num_update_steps_per_epoch} steps per epoch."
-            )
-            self.config.overrode_max_train_steps = True
-        elif self.config.num_train_epochs is None or self.config.num_train_epochs == 0:
-            if self.config.max_train_steps is None or self.config.max_train_steps == 0:
-                raise ValueError(
-                    "You must specify either --max_train_steps or --num_train_epochs with a value > 0"
-                )
-            self.config.num_train_epochs = math.ceil(
-                self.config.max_train_steps / self.config.num_update_steps_per_epoch
-            )
-            logger.info(
-                f"Calculated our maximum training steps at {self.config.max_train_steps} because we have"
-                f" {self.config.num_train_epochs} epochs and {self.config.num_update_steps_per_epoch} steps per epoch."
-            )
-        if self.lr_scheduler is not None and hasattr(
-            self.lr_scheduler, "num_update_steps_per_epoch"
-        ):
-            self.lr_scheduler.num_update_steps_per_epoch = (
-                self.config.num_update_steps_per_epoch
-            )
-        self.config.total_batch_size = (
-            self.config.train_batch_size
-            * self.accelerator.num_processes
-            * self.config.gradient_accumulation_steps
-        )
-
-    def init_optimizer(self):
-        logger.info(f"Learning rate: {self.config.learning_rate}")
-        extra_optimizer_args = {"lr": self.config.learning_rate}
-        # Initialize the optimizer
-        optimizer_args_from_config, optimizer_class = (
-            determine_optimizer_class_with_config(
-                args=self.config,
-                use_deepspeed_optimizer=self.config.use_deepspeed_optimizer,
-                is_quantized=self.config.is_quantized,
-                enable_adamw_bf16=self.config.enable_adamw_bf16,
-            )
-        )
-        extra_optimizer_args.update(optimizer_args_from_config)
-
-        self.params_to_optimize = determine_params_to_optimize(
-            args=self.config,
-            controlnet=self.controlnet,
-            unet=self.unet,
-            transformer=self.transformer,
-            text_encoder_1=self.text_encoder_1,
-            text_encoder_2=self.text_encoder_2,
-            model_type_label=self.config.model_type_label,
-            lycoris_wrapped_network=self.lycoris_wrapped_network,
-        )
-
-        if self.config.use_deepspeed_optimizer:
-            logger.info(
-                f"DeepSpeed Optimizer arguments, weight_decay={self.config.adam_weight_decay} eps={self.config.adam_epsilon}, extra_arguments={extra_optimizer_args}"
-            )
-            self.optimizer = optimizer_class(self.params_to_optimize)
-        else:
-            logger.info(f"Optimizer arguments={extra_optimizer_args}")
-            if self.config.train_text_encoder and self.config.text_encoder_lr:
-                # changes the learning rate of text_encoder_parameters_one and text_encoder_parameters_two to be
-                # --learning_rate
-                self.params_to_optimize[1]["lr"] = float(self.config.learning_rate)
-                if self.text_encoder_2 is not None:
-                    self.params_to_optimize[2]["lr"] = float(self.config.learning_rate)
-
-            self.optimizer = cpu_offload_optimizer(
-                params_to_optimize=self.params_to_optimize,
-                optimizer_cls=optimizer_class,
-                optimizer_parameters=extra_optimizer_args,
-                fused=self.config.fuse_optimizer,
-                offload_gradients=self.config.optimizer_offload_gradients,
-                offload_mechanism=self.config.optimizer_cpu_offload_method,
-            )
-
-        if (
-            is_optimi_available
-            and self.config.optimizer_release_gradients
-            and "optimi" in self.config.optimizer
-        ):
-            logger.warning(
-                "Marking model for gradient release. This feature is experimental, and may use more VRAM or not work."
-            )
-            prepare_for_gradient_release(
-                (
-                    self.controlnet
-                    if self.config.controlnet
-                    else self.transformer if self.transformer is not None else self.unet
-                ),
-                self.optimizer,
-            )
-
-    def init_lr_scheduler(self):
-        self.config.is_schedulefree = is_lr_scheduler_disabled(self.config.optimizer)
-        if self.config.is_schedulefree:
-            logger.info(
-                "Using experimental AdamW ScheduleFree optimiser from Facebook. Experimental due to newly added Kahan summation."
-            )
-            # we don't use LR schedulers with schedulefree optimisers
-            lr_scheduler = None
-        if not self.config.use_deepspeed_scheduler and not self.config.is_schedulefree:
-            logger.info(
-                f"Loading {self.config.lr_scheduler} learning rate scheduler with {self.config.lr_warmup_steps} warmup steps"
-            )
-            lr_scheduler = get_lr_scheduler(
-                self.config,
-                self.optimizer,
-                self.accelerator,
-                logger,
-                use_deepspeed_scheduler=False,
-            )
-        else:
-            logger.info(f"Using dummy learning rate scheduler")
-            if torch.backends.mps.is_available():
-                lr_scheduler = None
-            else:
-                lr_scheduler = accelerate.utils.DummyScheduler(
-                    self.optimizer,
-                    total_num_steps=self.config.max_train_steps,
-                    warmup_num_steps=self.config.lr_warmup_steps,
-                )
-        if lr_scheduler is not None:
-            if hasattr(lr_scheduler, "num_update_steps_per_epoch"):
-                lr_scheduler.num_update_steps_per_epoch = (
-                    self.config.num_update_steps_per_epoch
-                )
-            if hasattr(lr_scheduler, "last_step"):
-                lr_scheduler.last_step = self.state.get("global_resume_step", 0)
-
-        return lr_scheduler
-
-    def init_ema_model(self):
-        # Create EMA for the unet.
-        self.ema_model = None
-        if not self.config.use_ema:
-            return
-        if self.accelerator.is_main_process:
-            logger.info("Using EMA. Creating EMAModel.")
-
-            ema_model_cls = None
-            if self.unet is not None:
-                ema_model_cls = UNet2DConditionModel
-            elif self.config.model_family == "pixart_sigma":
-                ema_model_cls = PixArtTransformer2DModel
-            elif self.config.model_family == "flux":
-                ema_model_cls = FluxTransformer2DModel
-            else:
-                raise ValueError(
-                    f"Please open a bug report or disable EMA. Unknown EMA model family: {self.config.model_family}"
-                )
-
-            ema_model_config = None
-            if self.unet is not None:
-                ema_model_config = self.unet.config
-            elif self.transformer is not None:
-                ema_model_config = self.transformer.config
-
-            self.ema_model = EMAModel(
-                self.config,
-                self.accelerator,
-                parameters=(
-                    self.unet.parameters()
-                    if self.unet is not None
-                    else self.transformer.parameters()
-                ),
-                model_cls=ema_model_cls,
-                model_config=ema_model_config,
-                decay=self.config.ema_decay,
-                foreach=not self.config.ema_foreach_disable,
-            )
-            logger.info("EMA model creation complete.")
-
-        self.accelerator.wait_for_everyone()
-
-    def init_hooks(self):
-        from videotuna.third_party.flux.training.save_hooks import SaveHookManager
-
-        self.model_hooks = SaveHookManager(
-            args=self.config,
-            unet=self.unet,
-            transformer=self.transformer,
-            ema_model=self.ema_model,
-            accelerator=self.accelerator,
-            text_encoder_1=self.text_encoder_1,
-            text_encoder_2=self.text_encoder_2,
-            use_deepspeed_optimizer=self.config.use_deepspeed_optimizer,
-        )
-        self.accelerator.register_save_state_pre_hook(self.model_hooks.save_model_hook)
-        self.accelerator.register_load_state_pre_hook(self.model_hooks.load_model_hook)
-
-    def init_prepare_models(self, lr_scheduler):
-        # Prepare everything with our `accelerator`.
-        logger.info("Preparing models..")
-
-        # TODO: Is this still needed? Seems like a hack job from January 2024.
-        self.train_dataloaders = []
-        for _, backend in StateTracker.get_data_backends().items():
-            if "train_dataloader" not in backend:
-                continue
-            self.train_dataloaders.append(backend["train_dataloader"])
-            break
-        if len(self.train_dataloaders) == 0:
-            logger.error("For some reason, no dataloaders were configured.")
-            sys.exit(0)
-        if self.config.disable_accelerator:
-            logger.warning(
-                "Because SIMPLETUNER_DISABLE_ACCELERATOR is set, we will not prepare the accelerator."
-            )
-            return
-        logger.info("Loading our accelerator...")
-        if torch.backends.mps.is_available():
-            self.accelerator.native_amp = False
-        self._send_webhook_msg(message="Moving weights to GPU...")
-        self._send_webhook_raw(
-            structured_data={"message": "Moving weights to GPU"},
-            message_type="init_prepare_models_begin",
-        )
-        primary_model = self.unet if self.unet is not None else self.transformer
-        if self.config.controlnet:
-            primary_model = self.controlnet
-        results = self.accelerator.prepare(
-            primary_model, lr_scheduler, self.optimizer, self.train_dataloaders[0]
-        )
-        if self.config.controlnet:
-            self.controlnet = results[0]
-        elif self.unet is not None:
-            self.unet = results[0]
-        elif self.transformer is not None:
-            self.transformer = results[0]
-
-        if self.config.unet_attention_slice:
-            if torch.backends.mps.is_available():
-                logger.warning(
-                    "Using attention slicing when training SDXL on MPS can result in NaN errors on the first backward pass. If you run into issues, disable this option and reduce your batch size instead to reduce memory consumption."
-                )
-            if self.unet is not None:
-                self.unet.set_attention_slice("auto")
-            if self.transformer is not None:
-                self.transformer.set_attention_slice("auto")
-        self.lr_scheduler = results[1]
-        self.optimizer = results[2]
-        # The rest of the entries are dataloaders:
-        self.train_dataloaders = [results[3:]]
-        if self.config.use_ema and self.ema_model is not None:
-            if self.config.ema_device == "accelerator":
-                logger.info("Moving EMA model weights to accelerator...")
-            self.ema_model.to(
-                (
-                    self.accelerator.device
-                    if self.config.ema_device == "accelerator"
-                    else "cpu"
-                ),
-                dtype=self.config.weight_dtype,
-            )
-
-            if self.config.ema_device == "cpu" and not self.config.ema_cpu_only:
-                logger.info("Pinning EMA model weights to CPU...")
-                try:
-                    self.ema_model.pin_memory()
-                except Exception as e:
-                    self._send_webhook_raw(
-                        structured_data={"message": f"Failed to pin EMA to CPU: {e}"},
-                        message_type="error",
-                    )
-                    logger.error(f"Failed to pin EMA model to CPU: {e}")
-
-        idx_count = 0
-        for _, backend in StateTracker.get_data_backends().items():
-            if idx_count == 0 or "train_dataloader" not in backend:
-                continue
-            self.train_dataloaders.append(
-                self.accelerator.prepare(backend["train_dataloader"])
-            )
-        idx_count = 0
-
-        if "lora" in self.config.model_type and self.config.train_text_encoder:
-            logger.info("Preparing text encoders for training.")
-            if self.config.model_family == "sd3":
-                logger.info("NOTE: The third text encoder is not trained for SD3.")
-            self.text_encoder_1, self.text_encoder_2 = self.accelerator.prepare(
-                self.text_encoder_1, self.text_encoder_2
-            )
-        self._recalculate_training_steps()
-        self.accelerator.wait_for_everyone()
-        self._send_webhook_raw(
-            structured_data={"message": "Completed moving weights to GPU"},
-            message_type="init_prepare_models_completed",
-        )
-
-    def init_unload_vae(self):
-        if self.config.keep_vae_loaded or self.config.vae_cache_ondemand:
-            return
-        memory_before_unload = self.stats_memory_used()
-        self.vae = self.vae.to("cpu")
-        del self.vae
-        self.vae = None
-        for _, backend in StateTracker.get_data_backends().items():
-            if "vaecache" in backend:
-                backend["vaecache"].vae = None
-        reclaim_memory()
-        memory_after_unload = self.stats_memory_used()
-        memory_saved = memory_after_unload - memory_before_unload
-        logger.info(
-            f"After nuking the VAE from orbit, we freed {abs(round(memory_saved, 2)) * 1024} MB of VRAM."
-        )
-
-    def init_validations(self):
-        if (
-            hasattr(self.accelerator, "state")
-            and hasattr(self.accelerator.state, "deepspeed_plugin")
-            and getattr(self.accelerator.state.deepspeed_plugin, "deepspeed_config", {})
-            .get("zero_optimization", {})
-            .get("stage")
-            == 3
-        ):
-            logger.error("Cannot run validations with DeepSpeed ZeRO stage 3.")
-            return
-        self.validation = Validation(
-            accelerator=self.accelerator,
-            unet=self.unet,
-            transformer=self.transformer,
-            args=self.config,
-            validation_prompts=self.validation_prompts,
-            validation_shortnames=self.validation_shortnames,
-            text_encoder_1=self.text_encoder_1,
-            tokenizer=self.tokenizer_1,
-            vae_path=self.config.vae_path,
-            weight_dtype=self.config.weight_dtype,
-            embed_cache=StateTracker.get_default_text_embed_cache(),
-            validation_negative_pooled_embeds=self.validation_negative_pooled_embeds,
-            validation_negative_prompt_embeds=self.validation_negative_prompt_embeds,
-            text_encoder_2=self.text_encoder_2,
-            tokenizer_2=self.tokenizer_2,
-            text_encoder_3=self.text_encoder_3,
-            tokenizer_3=self.tokenizer_3,
-            ema_model=self.ema_model,
-            vae=self.vae,
-            controlnet=self.controlnet if self.config.controlnet else None,
-        )
-        if not self.config.train_text_encoder and self.validation is not None:
-            self.validation.clear_text_encoders()
-        self.init_benchmark_base_model()
-        self.accelerator.wait_for_everyone()
-
-    def init_benchmark_base_model(self):
-        if (
-            self.config.disable_benchmark
-            or self.validation is None
-            or self.validation.benchmark_exists("base_model")
-        ):
-            # if we've disabled it or the benchmark exists, we will not do it again.
-            # deepspeed zero3 can't do validations at all.
-            return
-        if not self.accelerator.is_main_process:
-            return
-        logger.info(
-            "Benchmarking base model for comparison. Supply `--disable_benchmark: true` to disable this behaviour."
-        )
-        self._send_webhook_raw(
-            structured_data={"message": "Base model benchmark begins"},
-            message_type="init_benchmark_base_model_begin",
-        )
-        # we'll run validation on base model if it hasn't already.
-        self.validation.run_validations(validation_type="base_model", step=0)
-        self.validation.save_benchmark("base_model")
-        self._send_webhook_raw(
-            structured_data={"message": "Base model benchmark completed"},
-            message_type="init_benchmark_base_model_completed",
-        )
-
-    def init_resume_checkpoint(self, lr_scheduler):
-        # Potentially load in the weights and states from a previous save
-        self.config.total_steps_remaining_at_start = self.config.max_train_steps
-        self.state["current_epoch"] = self.state["first_epoch"]
-        self.state["global_resume_step"] = self.state["global_step"] = (
-            StateTracker.get_global_step()
-        )
-        StateTracker.set_global_resume_step(self.state["global_resume_step"])
-        if not self.config.resume_from_checkpoint:
-            return lr_scheduler
-        if self.config.resume_from_checkpoint != "latest":
-            path = os.path.basename(self.config.resume_from_checkpoint)
-        else:
-            # Get the most recent checkpoint
-            dirs = os.listdir(self.config.output_dir)
-            dirs = [d for d in dirs if d.startswith("checkpoint")]
-            dirs = sorted(dirs, key=lambda x: int(x.split("-")[1]))
-            path = dirs[-1] if len(dirs) > 0 else None
-
-        if path is None:
-            logger.info(
-                f"Checkpoint '{self.config.resume_from_checkpoint}' does not exist. Starting a new training run."
-            )
-            self._send_webhook_raw(
-                structured_data={
-                    "message": "No model to resume. Beginning fresh training run."
-                },
-                message_type="init_resume_checkpoint",
-            )
-
-            self.config.resume_from_checkpoint = None
-            return lr_scheduler
-
-        logger.info(f"Resuming from checkpoint {path}")
-        self.accelerator.load_state(os.path.join(self.config.output_dir, path))
-        try:
-            if (
-                "constant" == self.config.lr_scheduler
-                and not self.config.is_schedulefree
-            ):
-                for g in self.optimizer.param_groups:
-                    if "lr" in g:
-                        g["lr"] = self.config.learning_rate
-                for k, v in lr_scheduler.state_dict().items():
-                    if k in ("base_lrs", "_last_lr"):
-                        v[0] = self.config.learning_rate
-        except Exception as e:
-            self._send_webhook_raw(
-                structured_data={
-                    "message": "Could not update learning rate scheduler LR value."
-                },
-                message_type="warning",
-            )
-            logger.error(
-                f"Could not update lr_scheduler {self.config.lr_scheduler} learning rate to {self.config.learning_rate} upon resume: {e}"
-            )
-
-        self._send_webhook_raw(
-            structured_data={"message": f"Resuming model: {path}"},
-            message_type="init_resume_checkpoint",
-        )
-        training_state_filename = f"training_state.json"
-        if get_rank() > 0:
-            training_state_filename = f"training_state-{get_rank()}.json"
-        for _, backend in StateTracker.get_data_backends().items():
-            if "sampler" in backend:
-                backend["sampler"].load_states(
-                    state_path=os.path.join(
-                        self.config.output_dir,
-                        path,
-                        training_state_filename,
-                    ),
-                )
-        self.state["global_resume_step"] = self.state["global_step"] = (
-            StateTracker.get_global_step()
-        )
-        StateTracker.set_global_resume_step(self.state["global_resume_step"])
-        training_state_in_ckpt = StateTracker.get_training_state()
-        self._send_webhook_raw(
-            structured_data=training_state_in_ckpt,
-            message_type="init_resume_checkpoint_details",
-        )
-        logger.debug(f"Training state inside checkpoint: {training_state_in_ckpt}")
-        if hasattr(lr_scheduler, "last_step"):
-            lr_scheduler.last_step = self.state["global_resume_step"]
-        logger.info(f"Resuming from global_step {self.state['global_resume_step']}).")
-
-        # Log the current state of each data backend.
-        for _, backend in StateTracker.get_data_backends().items():
-            if "sampler" in backend:
-                backend["sampler"].log_state()
-        # We store the number of dataset resets that have occurred inside the checkpoint.
-        self.state["first_epoch"] = StateTracker.get_epoch()
-        if self.state["first_epoch"] > 1 or self.state["global_resume_step"] > 1:
-            self.config.total_steps_remaining_at_start -= self.state[
-                "global_resume_step"
-            ]
-            logger.debug(
-                f"Resuming from epoch {self.state['first_epoch']}, which leaves us with {self.config.total_steps_remaining_at_start}."
-            )
-        self.state["current_epoch"] = self.state["first_epoch"]
-        StateTracker.set_epoch(self.state["current_epoch"])
-        if hasattr(lr_scheduler, "last_epoch"):
-            lr_scheduler.last_epoch = (
-                training_state_in_ckpt.get(
-                    "epoch_step", self.state.get("global_resume_step", 1)
-                )
-                * self.accelerator.num_processes
-            )
-
-        if self.state["current_epoch"] > self.config.num_train_epochs + 1:
-            logger.info(
-                f"Reached the end ({self.state['current_epoch']} epochs) of our training run ({self.config.num_train_epochs} epochs). This run will do zero steps."
-            )
-        self.accelerator.wait_for_everyone()
-
-        return lr_scheduler
-
-    def init_trackers(self):
-        # We need to initialize the trackers we use, and also store our configuration.
-        # The trackers initializes automatically on the main process.
-        self.guidance_values_table = None
-        if self.accelerator.is_main_process:
-            # Copy args into public_args:
-            public_args = copy.deepcopy(self.config)
-            delattr(public_args, "accelerator_project_config")
-            delattr(public_args, "process_group_kwargs")
-            delattr(public_args, "weight_dtype")
-            delattr(public_args, "base_weight_dtype")
-            delattr(public_args, "vae_kwargs")
-
-            # Hash the contents of public_args to reflect a deterministic ID for a single set of params:
-            public_args_hash = hashlib.md5(
-                json.dumps(vars(public_args), sort_keys=True).encode("utf-8")
-            ).hexdigest()
-            project_name = self.config.tracker_project_name or "simpletuner-training"
-            tracker_run_name = (
-                self.config.tracker_run_name or "simpletuner-training-run"
-            )
-            self.accelerator.init_trackers(
-                project_name,
-                config=vars(public_args),
-                init_kwargs={
-                    "wandb": {
-                        "name": tracker_run_name,
-                        "id": f"{public_args_hash}",
-                        "resume": "allow",
-                        "allow_val_change": True,
-                    }
-                },
-            )
-            self._send_webhook_raw(
-                structured_data=public_args.__dict__,
-                message_type="training_config",
-            )
-
-    def resume_and_prepare(self):
-        self.init_optimizer()
-        lr_scheduler = self.init_lr_scheduler()
-        self.init_hooks()
-        self.init_prepare_models(lr_scheduler=lr_scheduler)
-        lr_scheduler = self.init_resume_checkpoint(lr_scheduler=lr_scheduler)
-        self.init_post_load_freeze()
-
-    def move_models(self, destination: str = "accelerator"):
-        target_device = "cpu"
-        if destination == "accelerator":
-            target_device = self.accelerator.device
-        logger.info(
-            f"Moving the {'U-net' if self.unet is not None else 'diffusion transformer'} to GPU in {self.config.weight_dtype if not self.config.is_quantized else self.config.base_model_precision} precision."
-        )
-        if self.unet is not None:
-            if self.config.is_quantized:
-                self.unet.to(target_device)
-            else:
-                self.unet.to(target_device, dtype=self.config.weight_dtype)
-        if self.transformer is not None:
-            if self.config.is_quantized:
-                self.transformer.to(target_device)
-            else:
-                self.transformer.to(target_device, dtype=self.config.weight_dtype)
-        if getattr(self.accelerator, "_lycoris_wrapped_network", None) is not None:
-            self.accelerator._lycoris_wrapped_network = (
-                self.accelerator._lycoris_wrapped_network.to(
-                    target_device, dtype=self.config.weight_dtype
-                )
-            )
-        if (
-            self.config.enable_xformers_memory_efficient_attention
-            and self.config.model_family
-            not in [
-                "sd3",
-                "pixart_sigma",
-                "flux",
-                "smoldit",
-                "kolors",
-            ]
-        ):
-            logger.info("Enabling xformers memory-efficient attention.")
-            if is_xformers_available():
-                import xformers  # type: ignore # noqa
-
-                if self.unet is not None:
-                    self.unet.enable_xformers_memory_efficient_attention()
-                if self.transformer is not None:
-                    self.transformer.enable_xformers_memory_efficient_attention()
-                if self.config.controlnet:
-                    self.controlnet.enable_xformers_memory_efficient_attention()
-            else:
-                raise ValueError(
-                    "xformers is not available. Make sure it is installed correctly"
-                )
-        elif self.config.enable_xformers_memory_efficient_attention:
-            logger.warning(
-                "xformers is not enabled, as it is incompatible with this model type."
-            )
-            self.config.enable_xformers_memory_efficient_attention = False
-
-        if self.config.controlnet:
-            self.controlnet.train()
-            logger.info(
-                f"Moving ControlNet to {target_device} in {self.config.weight_dtype} precision."
-            )
-            self.controlnet.to(device=target_device, dtype=self.config.weight_dtype)
-            if self.config.train_text_encoder:
-                logger.warning(
-                    "Unknown results will occur when finetuning the text encoder alongside ControlNet."
-                )
-
-    def mark_optimizer_train(self):
-        if is_lr_scheduler_disabled(self.config.optimizer) and hasattr(
-            self.optimizer, "train"
-        ):
-            # we typically have to call train() on the optim for schedulefree.
-            self.optimizer.train()
-
-    def mark_optimizer_eval(self):
-        if is_lr_scheduler_disabled(self.config.optimizer) and hasattr(
-            self.optimizer, "eval"
-        ):
-            # we typically have to call eval() on the optim for schedulefree before saving or running validations.
-            self.optimizer.eval()
-
-    def _send_webhook_msg(
-        self, message: str, message_level: str = "info", store_response: bool = False
-    ):
-        if type(message) is not str:
-            logger.error(
-                f"_send_webhook_msg received {type(message)} type message instead of str."
-            )
-            return False
-        if self.webhook_handler is None or not self.webhook_handler:
-            return
-        self.webhook_handler.send(
-            message=message, message_level=message_level, store_response=store_response
-        )
-
-    def _send_webhook_raw(
-        self,
-        structured_data: dict,
-        message_type: str,
-        message_level: str = "info",
-    ):
-        if type(structured_data) is not dict:
-            logger.error(
-                f"_send_webhook_msg received {type(structured_data)} type message instead of dict."
-            )
-            return False
-        if not self.webhook_handler:
-            return
-        self.webhook_handler.send_raw(
-            structured_data=structured_data,
-            message_type=message_type,
-            message_level=message_level,
-            job_id=self.job_id,
-        )
-
-    def _train_initial_msg(self):
-        initial_msg = "\n***** Running training *****"
-        initial_msg += f"\n-  Num batches = {self.config.total_num_batches}"
-        initial_msg += f"\n-  Num Epochs = {self.config.num_train_epochs}"
-        initial_msg += f"\n  - Current Epoch = {self.state['first_epoch']}"
-        initial_msg += f"\n-  Total train batch size (w. parallel, distributed & accumulation) = {self.config.total_batch_size}"
-        initial_msg += f"\n  - Instantaneous batch size per device = {self.config.train_batch_size}"
-        initial_msg += f"\n  - Gradient Accumulation steps = {self.config.gradient_accumulation_steps}"
-        initial_msg += f"\n-  Total optimization steps = {self.config.max_train_steps}"
-        if self.state["global_step"] > 1:
-            initial_msg += f"\n  - Steps completed: {self.state['global_step']}"
-        initial_msg += f"\n-  Total optimization steps remaining = {max(0, self.config.total_steps_remaining_at_start)}"
-        logger.info(initial_msg)
-        self._send_webhook_msg(message=initial_msg)
-        structured_data = {
-            "total_num_batches": self.config.total_num_batches,
-            "total_num_epochs": self.config.num_train_epochs,
-            "total_num_steps": self.config.max_train_steps,
-            "current_epoch": self.state["first_epoch"],
-            "total_batch_size": self.config.total_batch_size,
-            "micro_batch_size": self.config.train_batch_size,
-            "current_step": self.state["global_step"],
-            "remaining_num_steps": max(0, self.config.total_steps_remaining_at_start),
-        }
-        self._send_webhook_raw(
-            structured_data=structured_data, message_type="_train_initial_msg"
-        )
-
-    def _epoch_rollover(self, epoch):
-        if self.state["first_epoch"] == epoch:
-            return
-        logger.debug(
-            f"Just completed epoch {self.state['current_epoch']}. Beginning epoch {epoch}. Starting epoch was {self.state['first_epoch']}. Final epoch will be {self.config.num_train_epochs}"
-        )
-        for backend_id, backend in StateTracker.get_data_backends().items():
-            backend_config = StateTracker.get_data_backend_config(backend_id)
-            if (
-                backend_config.get("crop")
-                and backend_config.get("crop_aspect") == "random"
-                and "metadata_backend" in backend
-                and not self.config.aspect_bucket_disable_rebuild
-            ) or (
-                backend_config.get("vae_cache_clear_each_epoch")
-                and "vaecache" in backend
-            ):
-                # when the aspect ratio is random, we need to shuffle the dataset on each epoch.
-                if self.accelerator.is_main_process:
-                    # we only compute the aspect ratio indices on the main process.
-                    # we have to set read_only to False since we're generating a new, un-split list.
-                    # otherwise, we can't actually save the new cache to disk.
-                    backend["metadata_backend"].read_only = False
-                    # this will generate+save the new cache to the storage backend.
-                    backend["metadata_backend"].compute_aspect_ratio_bucket_indices(
-                        ignore_existing_cache=True
-                    )
-                self.accelerator.wait_for_everyone()
-                logger.info(f"Reloading cache for backend {backend_id}")
-                backend["metadata_backend"].reload_cache(set_config=False)
-                logger.info("Waiting for other threads to finish..")
-                self.accelerator.wait_for_everyone()
-                # we'll have to split the buckets between GPUs again now, so that the VAE cache distributes properly.
-                logger.info("Splitting buckets across GPUs")
-                backend["metadata_backend"].split_buckets_between_processes(
-                    gradient_accumulation_steps=self.config.gradient_accumulation_steps
-                )
-                # we have to rebuild the VAE cache if it exists.
-                if "vaecache" in backend:
-                    logger.info("Rebuilding VAE cache..")
-                    backend["vaecache"].rebuild_cache()
-                # no need to manually call metadata_backend.save_cache() here.
-        self.state["current_epoch"] = epoch
-        StateTracker.set_epoch(epoch)
-        if self.config.lr_scheduler == "cosine_with_restarts":
-            self.extra_lr_scheduler_kwargs["epoch"] = epoch
-
-    def _exit_on_signal(self):
-        if self.should_abort:
-            self._send_webhook_raw(
-                structured_data={"message": "Aborting training run."},
-                message_type="exit",
-            )
-            raise StopIteration("Training run received abort signal.")
-
-    def abort(self):
-        logger.info("Aborting training run.")
-        if self.bf is not None:
-            self.bf.stop_fetching()
-        # we should set should_abort = True on each data backend's vae cache, metadata, and text backend
-        for _, backend in StateTracker.get_data_backends().items():
-            if "vaecache" in backend:
-                logger.debug(f"Aborting VAE cache")
-                backend["vaecache"].should_abort = True
-            if "metadata_backend" in backend:
-                logger.debug(f"Aborting metadata backend")
-                backend["metadata_backend"].should_abort = True
-            if "text_backend" in backend:
-                logger.debug(f"Aborting text backend")
-                backend["text_backend"].should_abort = True
-            if "sampler" in backend:
-                logger.debug(f"Aborting sampler")
-                backend["sampler"].should_abort = True
-        self.should_abort = True
-
-    def model_predict(
-        self,
-        batch,
-        latents,
-        noisy_latents,
-        encoder_hidden_states,
-        added_cond_kwargs,
-        add_text_embeds,
-        timesteps,
-    ):
-        if self.config.controlnet:
-            training_logger.debug(
-                f"Extra conditioning dtype: {batch['conditioning_pixel_values'].dtype}"
-            )
-        if not self.config.disable_accelerator:
-            if self.config.controlnet:
-                # ControlNet conditioning.
-                controlnet_image = batch["conditioning_pixel_values"].to(
-                    dtype=self.config.weight_dtype
-                )
-                training_logger.debug(f"Image shape: {controlnet_image.shape}")
-                down_block_res_samples, mid_block_res_sample = self.controlnet(
-                    noisy_latents,
-                    timesteps,
-                    encoder_hidden_states=encoder_hidden_states,
-                    added_cond_kwargs=added_cond_kwargs,
-                    controlnet_cond=controlnet_image,
-                    return_dict=False,
-                )
-                # Predict the noise residual
-                if self.unet is not None:
-                    model_pred = self.unet(
-                        noisy_latents,
-                        timesteps,
-                        encoder_hidden_states=encoder_hidden_states,
-                        added_cond_kwargs=added_cond_kwargs,
-                        down_block_additional_residuals=[
-                            sample.to(dtype=self.config.weight_dtype)
-                            for sample in down_block_res_samples
-                        ],
-                        mid_block_additional_residual=mid_block_res_sample.to(
-                            dtype=self.config.weight_dtype
-                        ),
-                        return_dict=False,
-                    )[0]
-                if self.transformer is not None:
-                    raise Exception(
-                        "ControlNet predictions for transformer models are not yet implemented."
-                    )
-            elif self.config.model_family == "flux":
-                # handle guidance
-                packed_noisy_latents = pack_latents(
-                    noisy_latents,
-                    batch_size=latents.shape[0],
-                    num_channels_latents=latents.shape[1],
-                    height=latents.shape[2],
-                    width=latents.shape[3],
-                ).to(
-                    dtype=self.config.base_weight_dtype,
-                    device=self.accelerator.device,
-                )
-                if self.config.flux_guidance_mode == "mobius":
-                    guidance_scales = get_mobius_guidance(
-                        self.config,
-                        self.state["global_step"],
-                        self.config.num_update_steps_per_epoch,
-                        latents.shape[0],
-                        self.accelerator.device,
-                    )
-                elif self.config.flux_guidance_mode == "constant":
-                    guidance_scales = [
-                        float(self.config.flux_guidance_value)
-                    ] * latents.shape[0]
-
-                elif self.config.flux_guidance_mode == "random-range":
-                    # Generate a list of random values within the specified range for each latent
-                    guidance_scales = [
-                        random.uniform(
-                            self.config.flux_guidance_min,
-                            self.config.flux_guidance_max,
-                        )
-                        for _ in range(latents.shape[0])
-                    ]
-                self.guidance_values_list.append(guidance_scales)
-
-                # Now `guidance` will have different values for each latent in `latents`.
-                transformer_config = None
-                if hasattr(self.transformer, "module"):
-                    transformer_config = self.transformer.module.config
-                elif hasattr(self.transformer, "config"):
-                    transformer_config = self.transformer.config
-                if transformer_config is not None and getattr(
-                    transformer_config, "guidance_embeds", False
-                ):
-                    guidance = torch.tensor(
-                        guidance_scales, device=self.accelerator.device
-                    )
-                else:
-                    guidance = None
-                img_ids = prepare_latent_image_ids(
-                    latents.shape[0],
-                    latents.shape[2],
-                    latents.shape[3],
-                    self.accelerator.device,
-                    self.config.weight_dtype,
-                )
-                timesteps = (
-                    torch.tensor(timesteps)
-                    .expand(noisy_latents.shape[0])
-                    .to(device=self.accelerator.device)
-                    / 1000
-                )
-
-                text_ids = torch.zeros(
-                    batch["prompt_embeds"].shape[1],
-                    3,
-                ).to(
-                    device=self.accelerator.device,
-                    dtype=self.config.base_weight_dtype,
-                )
-                training_logger.debug(
-                    "DTypes:"
-                    f"\n-> Text IDs shape: {text_ids.shape if hasattr(text_ids, 'shape') else None}, dtype: {text_ids.dtype if hasattr(text_ids, 'dtype') else None}"
-                    f"\n-> Image IDs shape: {img_ids.shape if hasattr(img_ids, 'shape') else None}, dtype: {img_ids.dtype if hasattr(img_ids, 'dtype') else None}"
-                    f"\n-> Timesteps shape: {timesteps.shape if hasattr(timesteps, 'shape') else None}, dtype: {timesteps.dtype if hasattr(timesteps, 'dtype') else None}"
-                    f"\n-> Guidance: {guidance}"
-                    f"\n-> Packed Noisy Latents shape: {packed_noisy_latents.shape if hasattr(packed_noisy_latents, 'shape') else None}, dtype: {packed_noisy_latents.dtype if hasattr(packed_noisy_latents, 'dtype') else None}"
-                )
-
-                flux_transformer_kwargs = {
-                    "hidden_states": packed_noisy_latents,
-                    # YiYi notes: divide it by 1000 for now because we scale it by 1000 in the transforme rmodel (we should not keep it but I want to keep the inputs same for the model for testing)
-                    "timestep": timesteps,
-                    "guidance": guidance,
-                    "pooled_projections": batch["add_text_embeds"].to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    "encoder_hidden_states": batch["prompt_embeds"].to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    "txt_ids": text_ids.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    "img_ids": img_ids,
-                    "joint_attention_kwargs": None,
-                    "return_dict": False,
-                }
-                if self.config.flux_attention_masked_training:
-                    flux_transformer_kwargs["attention_mask"] = batch[
-                        "encoder_attention_mask"
-                    ]
-                    if flux_transformer_kwargs["attention_mask"] is None:
-                        raise ValueError(
-                            "No attention mask was discovered when attempting validation - this means you need to recreate your text embed cache."
-                        )
-
-                model_pred = self.transformer(**flux_transformer_kwargs)[0]
-
-            elif self.config.model_family == "sd3":
-                # Stable Diffusion 3 uses a MM-DiT model where the VAE-produced
-                #  image embeds are passed in with the TE-produced text embeds.
-                model_pred = self.transformer(
-                    hidden_states=noisy_latents.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    timestep=timesteps,
-                    encoder_hidden_states=encoder_hidden_states.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.base_weight_dtype,
-                    ),
-                    pooled_projections=add_text_embeds.to(
-                        device=self.accelerator.device,
-                        dtype=self.config.weight_dtype,
-                    ),
-                    return_dict=False,
-                )[0]
-            elif self.config.model_family == "pixart_sigma":
-                model_pred = self.transformer(
-                    noisy_latents,
-                    encoder_hidden_states=encoder_hidden_states,
-                    encoder_attention_mask=batch["encoder_attention_mask"],
-                    timestep=timesteps,
-                    added_cond_kwargs=added_cond_kwargs,
-                    return_dict=False,
-                )[0]
-                model_pred = model_pred.chunk(2, dim=1)[0]
-            elif self.config.model_family == "smoldit":
-                first_latent_shape = noisy_latents.shape
-                height = first_latent_shape[1] * 8
-                width = first_latent_shape[2] * 8
-                grid_height = height // 8 // self.transformer.config.patch_size
-                grid_width = width // 8 // self.transformer.config.patch_size
-                base_size = 512 // 8 // self.transformer.config.patch_size
-                grid_crops_coords = get_resize_crop_region_for_grid(
-                    (grid_height, grid_width), (base_size, base_size)
-                )
-                inputs = {
-                    "hidden_states": noisy_latents,
-                    "timestep": timesteps,
-                    "encoder_hidden_states": encoder_hidden_states,
-                    "encoder_attention_mask": batch["encoder_attention_mask"],
-                    "image_rotary_emb": get_2d_rotary_pos_embed(
-                        self.transformer.inner_dim
-                        // self.transformer.config.num_attention_heads,
-                        grid_crops_coords,
-                        (grid_height, grid_width),
-                    ),
-                }
-                model_pred = self.transformer(**inputs).sample
-            elif self.unet is not None:
-                if self.config.model_family == "legacy":
-                    # SD 1.5 or 2.x
-                    model_pred = self.unet(
-                        noisy_latents,
-                        timesteps,
-                        encoder_hidden_states,
-                    ).sample
-                else:
-                    # SDXL, Kolors, other default unet prediction.
-                    model_pred = self.unet(
-                        noisy_latents,
-                        timesteps,
-                        encoder_hidden_states,
-                        added_cond_kwargs=added_cond_kwargs,
-                    ).sample
-            else:
-                raise Exception("Unknown error occurred, no prediction could be made.")
-
-            if self.config.model_family == "flux":
-                model_pred = unpack_latents(
-                    model_pred,
-                    height=latents.shape[2] * 8,
-                    width=latents.shape[3] * 8,
-                    vae_scale_factor=16,
-                )
-        else:
-            # Dummy model prediction for debugging.
-            model_pred = torch.randn_like(noisy_latents)
-
-        return model_pred
-
-    def train(self):
-        self.init_trackers()
-        self._train_initial_msg()
-
-        if self.config.validation_on_startup and self.state["global_step"] <= 1:
-            # Just in Case.
-            self.mark_optimizer_eval()
-            # normal run-of-the-mill validation on startup.
-            if self.validation is not None:
-                self.validation.run_validations(validation_type="base_model", step=0)
-
-        self.mark_optimizer_train()
-
-        # Only show the progress bar once on each machine.
-        show_progress_bar = True
-        if not self.accelerator.is_local_main_process:
-            show_progress_bar = False
-        progress_bar = tqdm(
-            range(0, self.config.max_train_steps),
-            disable=not show_progress_bar,
-            initial=self.state["global_step"],
-            desc=f"Epoch {self.state['first_epoch']}/{self.config.num_train_epochs} Steps",
-            ncols=125,
-        )
-        self.accelerator.wait_for_everyone()
-
-        # Some values that are required to be initialised later.
-        step = self.state["global_step"]
-        training_luminance_values = []
-        current_epoch_step = None
-        self.bf, fetch_thread = None, None
-        iterator_fn = random_dataloader_iterator
-        for epoch in range(self.state["first_epoch"], self.config.num_train_epochs + 1):
-            if self.state["current_epoch"] > self.config.num_train_epochs + 1:
-                # This might immediately end training, but that's useful for simply exporting the model.
-                logger.info(
-                    f"Training run is complete ({self.config.num_train_epochs}/{self.config.num_train_epochs} epochs, {self.state['global_step']}/{self.config.max_train_steps} steps)."
-                )
-                break
-            self._epoch_rollover(epoch)
-            if self.config.controlnet:
-                self.controlnet.train()
-                training_models = [self.controlnet]
-            else:
-                if self.unet is not None:
-                    self.unet.train()
-                    training_models = [self.unet]
-                if self.transformer is not None:
-                    self.transformer.train()
-                    training_models = [self.transformer]
-            if (
-                "lora" in self.config.model_type
-                and self.config.train_text_encoder
-                and "standard" in self.config.lora_type.lower()
-            ):
-                self.text_encoder_1.train()
-                self.text_encoder_2.train()
-                training_models.append(self.text_encoder_1)
-                training_models.append(self.text_encoder_2)
-
-            if current_epoch_step is not None:
-                # We are resetting to the next epoch, if it is not none.
-                current_epoch_step = 0
-            else:
-                # If it's None, we need to calculate the current epoch step based on the current global step.
-                current_epoch_step = (
-                    self.state["global_step"] % self.config.num_update_steps_per_epoch
-                )
-            train_backends = {}
-            for backend_id, backend in StateTracker.get_data_backends().items():
-                if (
-                    StateTracker.backend_status(backend_id)
-                    or "train_dataloader" not in backend
-                ):
-                    # Exclude exhausted backends.
-                    logger.debug(
-                        f"Excluding backend: {backend_id}, as it is exhausted? {StateTracker.backend_status(backend_id)} or not found {('train_dataloader' not in backend)}"
-                    )
-                    continue
-                train_backends[backend_id] = backend["train_dataloader"]
-            # Begin dataloader prefetch, if enabled.
-            iterator_args = [train_backends]
-            if self.config.dataloader_prefetch:
-                iterator_args = []
-                if self.bf is not None:
-                    self.bf.stop_fetching()
-                self.bf = BatchFetcher(
-                    datasets=train_backends,
-                    max_size=self.config.dataloader_prefetch_qlen,
-                    step=step,
-                )
-                if fetch_thread is not None:
-                    fetch_thread.join()
-                fetch_thread = self.bf.start_fetching()
-                iterator_fn = self.bf.next_response
-
-            while True:
-                self._exit_on_signal()
-                step += 1
-                batch = iterator_fn(step, *iterator_args)
-                training_logger.debug(f"Iterator: {iterator_fn}")
-                if self.config.lr_scheduler == "cosine_with_restarts":
-                    self.extra_lr_scheduler_kwargs["step"] = self.state["global_step"]
-
-                if self.accelerator.is_main_process:
-                    progress_bar.set_description(
-                        f"Epoch {self.state['current_epoch']}/{self.config.num_train_epochs}, Steps"
-                    )
-
-                # If we receive a False from the enumerator, we know we reached the next epoch.
-                if batch is False:
-                    logger.debug(f"Reached the end of epoch {epoch}")
-                    break
-
-                if batch is None:
-                    import traceback
-
-                    raise ValueError(
-                        f"Received a None batch, which is not a good thing. Traceback: {traceback.format_exc()}"
-                    )
-
-                # Add the current batch of training data's avg luminance to a list.
-                if "batch_luminance" in batch:
-                    training_luminance_values.append(batch["batch_luminance"])
-
-                with self.accelerator.accumulate(training_models):
-                    training_logger.debug("Sending latent batch to GPU.")
-                    latents = batch["latent_batch"].to(
-                        self.accelerator.device, dtype=self.config.weight_dtype
-                    )
-
-                    # Sample noise that we'll add to the latents - self.config.noise_offset might need to be set to 0.1 by default.
-                    noise = torch.randn_like(latents)
-                    if not self.config.flow_matching:
-                        if self.config.offset_noise:
-                            if (
-                                self.config.noise_offset_probability == 1.0
-                                or random.random()
-                                < self.config.noise_offset_probability
-                            ):
-                                noise = noise + self.config.noise_offset * torch.randn(
-                                    latents.shape[0],
-                                    latents.shape[1],
-                                    1,
-                                    1,
-                                    device=latents.device,
-                                )
-
-                    bsz = latents.shape[0]
-                    if int(bsz) != int(self.config.train_batch_size):
-                        logger.error(
-                            f"Received {bsz} latents, but expected {self.config.train_batch_size}. Processing short batch."
-                        )
-                    training_logger.debug(f"Working on batch size: {bsz}")
-                    if self.config.flow_matching:
-                        if (
-                            not self.config.flux_fast_schedule
-                            and not self.config.flux_use_beta_schedule
-                        ):
-                            # imported from cloneofsimo's minRF trainer: https://github.com/cloneofsimo/minRF
-                            # also used by: https://github.com/XLabs-AI/x-flux/tree/main
-                            # and: https://github.com/kohya-ss/sd-scripts/commit/8a0f12dde812994ec3facdcdb7c08b362dbceb0f
-                            sigmas = torch.sigmoid(
-                                self.config.flow_matching_sigmoid_scale
-                                * torch.randn((bsz,), device=self.accelerator.device)
-                            )
-                            sigmas = apply_flux_schedule_shift(
-                                self.config, self.noise_scheduler, sigmas, noise
-                            )
-                        elif self.config.flux_use_beta_schedule:
-                            alpha = self.config.flux_beta_schedule_alpha
-                            beta = self.config.flux_beta_schedule_beta
-
-                            # Create a Beta distribution instance
-                            beta_dist = Beta(alpha, beta)
-
-                            # Sample from the Beta distribution
-                            sigmas = beta_dist.sample((bsz,)).to(
-                                device=self.accelerator.device
-                            )
-
-                            sigmas = apply_flux_schedule_shift(
-                                self.config, self.noise_scheduler, sigmas, noise
-                            )
-                        else:
-                            # fast schedule can only use these sigmas, and they can be sampled up to batch size times
-                            available_sigmas = [
-                                1.0,
-                                1.0,
-                                1.0,
-                                1.0,
-                                1.0,
-                                1.0,
-                                1.0,
-                                0.75,
-                                0.5,
-                                0.25,
-                            ]
-                            sigmas = torch.tensor(
-                                random.choices(available_sigmas, k=bsz),
-                                device=self.accelerator.device,
-                            )
-                        timesteps = sigmas * 1000.0
-                        sigmas = sigmas.view(-1, 1, 1, 1)
-                    else:
-                        # Sample a random timestep for each image, potentially biased by the timestep weights.
-                        # Biasing the timestep weights allows us to spend less time training irrelevant timesteps.
-                        weights = generate_timestep_weights(
-                            self.config, self.noise_scheduler.config.num_train_timesteps
-                        ).to(self.accelerator.device)
-                        # Instead of uniformly sampling the timestep range, we'll split our weights and schedule into bsz number of segments.
-                        # This enables more broad sampling and potentially more effective training.
-                        if (
-                            bsz > 1
-                            and not self.config.disable_segmented_timestep_sampling
-                        ):
-                            timesteps = segmented_timestep_selection(
-                                actual_num_timesteps=self.noise_scheduler.config.num_train_timesteps,
-                                bsz=bsz,
-                                weights=weights,
-                                use_refiner_range=StateTracker.is_sdxl_refiner()
-                                and not StateTracker.get_args().sdxl_refiner_uses_full_range,
-                            ).to(self.accelerator.device)
-                        else:
-                            timesteps = torch.multinomial(
-                                weights, bsz, replacement=True
-                            ).long()
-
-                    # Prepare the data for the scatter plot
-                    for timestep in timesteps.tolist():
-                        self.timesteps_buffer.append(
-                            (self.state["global_step"], timestep)
-                        )
-
-                    if self.config.input_perturbation != 0 and (
-                        not self.config.input_perturbation_steps
-                        or self.state["global_step"]
-                        < self.config.input_perturbation_steps
-                    ):
-                        input_perturbation = self.config.input_perturbation
-                        if self.config.input_perturbation_steps:
-                            input_perturbation *= 1.0 - (
-                                self.state["global_step"]
-                                / self.config.input_perturbation_steps
-                            )
-                        input_noise = noise + input_perturbation * torch.randn_like(
-                            latents
-                        )
-                    else:
-                        input_noise = noise
-
-                    if self.config.flow_matching:
-                        noisy_latents = (1 - sigmas) * latents + sigmas * input_noise
-                    else:
-                        # Add noise to the latents according to the noise magnitude at each timestep
-                        # (this is the forward diffusion process)
-                        noisy_latents = self.noise_scheduler.add_noise(
-                            latents.float(), input_noise.float(), timesteps
-                        ).to(
-                            device=self.accelerator.device,
-                            dtype=self.config.weight_dtype,
-                        )
-
-                    encoder_hidden_states = batch["prompt_embeds"].to(
-                        dtype=self.config.weight_dtype, device=self.accelerator.device
-                    )
-                    training_logger.debug(
-                        f"Encoder hidden states: {encoder_hidden_states.shape}"
-                    )
-
-                    add_text_embeds = batch["add_text_embeds"]
-                    training_logger.debug(
-                        f"Pooled embeds: {add_text_embeds.shape if add_text_embeds is not None else None}"
-                    )
-                    # Get the target for loss depending on the prediction type
-                    if self.config.flow_matching:
-                        # This is the flow-matching target for vanilla SD3.
-                        # If self.config.flow_matching_loss == "diffusion", we will instead use v_prediction (see below)
-                        if self.config.flow_matching_loss == "diffusers":
-                            target = latents
-                        elif self.config.flow_matching_loss == "compatible":
-                            target = noise - latents
-                        elif self.config.flow_matching_loss == "sd35":
-                            sigma_reshaped = sigmas.view(
-                                -1, 1, 1, 1
-                            )  # Ensure sigma has the correct shape
-                            target = (noisy_latents - latents) / sigma_reshaped
-
-                    elif self.noise_scheduler.config.prediction_type == "epsilon":
-                        target = noise
-                    elif (
-                        self.noise_scheduler.config.prediction_type == "v_prediction"
-                        or (
-                            self.config.flow_matching
-                            and self.config.flow_matching_loss == "diffusion"
-                        )
-                    ):
-                        # When not using flow-matching, train on velocity prediction objective.
-                        target = self.noise_scheduler.get_velocity(
-                            latents, noise, timesteps
-                        )
-                    elif self.noise_scheduler.config.prediction_type == "sample":
-                        # We set the target to latents here, but the model_pred will return the noise sample prediction.
-                        # We will have to subtract the noise residual from the prediction to get the target sample.
-                        target = latents
-                    else:
-                        raise ValueError(
-                            f"Unknown prediction type {self.noise_scheduler.config.prediction_type}"
-                            "Supported types are 'epsilon', `sample`, and 'v_prediction'."
-                        )
-
-                    added_cond_kwargs = None
-                    # Predict the noise residual and compute loss
-                    if (
-                        StateTracker.get_model_family() == "sdxl"
-                        or self.config.model_family == "kolors"
-                    ):
-                        added_cond_kwargs = {
-                            "text_embeds": add_text_embeds.to(
-                                device=self.accelerator.device,
-                                dtype=self.config.weight_dtype,
-                            ),
-                            "time_ids": batch["batch_time_ids"].to(
-                                device=self.accelerator.device,
-                                dtype=self.config.weight_dtype,
-                            ),
-                        }
-                    elif (
-                        self.config.model_family == "pixart_sigma"
-                        or self.config.model_family == "smoldit"
-                    ):
-                        # pixart requires an input of {"resolution": .., "aspect_ratio": ..}
-                        if "batch_time_ids" in batch:
-                            added_cond_kwargs = batch["batch_time_ids"]
-                        batch["encoder_attention_mask"] = batch[
-                            "encoder_attention_mask"
-                        ].to(
-                            device=self.accelerator.device,
-                            dtype=self.config.weight_dtype,
-                        )
-
-                    # a marker to know whether we had a model capable of regularised data training.
-                    handled_regularisation = False
-                    is_regularisation_data = batch.get("is_regularisation_data", False)
-                    if is_regularisation_data and self.config.model_type == "lora":
-                        training_logger.debug("Predicting parent model residual.")
-                        handled_regularisation = True
-                        with torch.no_grad():
-                            if self.config.lora_type.lower() == "lycoris":
-                                training_logger.debug(
-                                    "Detaching LyCORIS adapter for parent prediction."
-                                )
-                                self.accelerator._lycoris_wrapped_network.restore()
-                            else:
-                                raise ValueError(
-                                    f"Cannot train parent-student networks on {self.config.lora_type} model. Only LyCORIS is supported."
-                                )
-                            target = self.model_predict(
-                                batch=batch,
-                                latents=latents,
-                                noisy_latents=noisy_latents,
-                                encoder_hidden_states=encoder_hidden_states,
-                                added_cond_kwargs=added_cond_kwargs,
-                                add_text_embeds=add_text_embeds,
-                                timesteps=timesteps,
-                            )
-                            if self.config.lora_type.lower() == "lycoris":
-                                training_logger.debug(
-                                    "Attaching LyCORIS adapter for student prediction."
-                                )
-                                self.accelerator._lycoris_wrapped_network.apply_to()
-
-                    training_logger.debug("Predicting noise residual.")
-                    model_pred = self.model_predict(
-                        batch=batch,
-                        latents=latents,
-                        noisy_latents=noisy_latents,
-                        encoder_hidden_states=encoder_hidden_states,
-                        added_cond_kwargs=added_cond_kwargs,
-                        add_text_embeds=add_text_embeds,
-                        timesteps=timesteps,
-                    )
-
-                    # x-prediction requires that we now subtract the noise residual from the prediction to get the target sample.
-                    if (
-                        hasattr(self.noise_scheduler, "config")
-                        and hasattr(self.noise_scheduler.config, "prediction_type")
-                        and self.noise_scheduler.config.prediction_type == "sample"
-                    ):
-                        model_pred = model_pred - noise
-
-                    parent_loss = None
-
-                    # Compute the per-pixel loss without reducing over spatial dimensions
-                    if self.config.flow_matching:
-                        # For flow matching, compute the per-pixel squared differences
-                        loss = (
-                            model_pred.float() - target.float()
-                        ) ** 2  # Shape: (batch_size, C, H, W)
-                    elif self.config.snr_gamma is None or self.config.snr_gamma == 0:
-                        training_logger.debug("Calculating loss")
-                        loss = self.config.snr_weight * F.mse_loss(
-                            model_pred.float(), target.float(), reduction="none"
-                        )  # Shape: (batch_size, C, H, W)
-                    else:
-                        # Compute loss-weights as per Section 3.4 of https://arxiv.org/abs/2303.09556.
-                        # Since we predict the noise instead of x_0, the original formulation is slightly changed.
-                        # This is discussed in Section 4.2 of the same paper.
-                        training_logger.debug("Using min-SNR loss")
-                        snr = compute_snr(timesteps, self.noise_scheduler)
-                        snr_divisor = snr
-                        if (
-                            self.noise_scheduler.config.prediction_type
-                            == "v_prediction"
-                            or (
-                                self.config.flow_matching
-                                and self.config.flow_matching_loss == "diffusion"
-                            )
-                        ):
-                            snr_divisor = snr + 1
-
-                        training_logger.debug(
-                            "Calculating MSE loss weights using SNR as divisor"
-                        )
-                        mse_loss_weights = (
-                            torch.stack(
-                                [
-                                    snr,
-                                    self.config.snr_gamma * torch.ones_like(timesteps),
-                                ],
-                                dim=1,
-                            ).min(dim=1)[0]
-                            / snr_divisor
-                        )  # Shape: (batch_size,)
-
-                        # Compute the per-pixel MSE loss without reduction
-                        loss = F.mse_loss(
-                            model_pred.float(), target.float(), reduction="none"
-                        )  # Shape: (batch_size, C, H, W)
-
-                        # Reshape mse_loss_weights for broadcasting and apply to loss
-                        mse_loss_weights = mse_loss_weights.view(
-                            -1, 1, 1, 1
-                        )  # Shape: (batch_size, 1, 1, 1)
-                        loss = loss * mse_loss_weights  # Shape: (batch_size, C, H, W)
-
-                    # Mask the loss using any conditioning data
-                    conditioning_type = batch.get("conditioning_type")
-                    if conditioning_type == "mask":
-                        # Adapted from:
-                        # https://github.com/kohya-ss/sd-scripts/blob/main/library/custom_train_functions.py#L482
-                        mask_image = (
-                            batch["conditioning_pixel_values"]
-                            .to(dtype=loss.dtype, device=loss.device)[:, 0]
-                            .unsqueeze(1)
-                        )  # Shape: (batch_size, 1, H', W')
-                        mask_image = torch.nn.functional.interpolate(
-                            mask_image, size=loss.shape[2:], mode="area"
-                        )  # Resize to match loss spatial dimensions
-                        mask_image = mask_image / 2 + 0.5  # Normalize to [0,1]
-                        loss = loss * mask_image  # Element-wise multiplication
-
-                    # Reduce the loss by averaging over channels and spatial dimensions
-                    loss = loss.mean(
-                        dim=list(range(1, len(loss.shape)))
-                    )  # Shape: (batch_size,)
-
-                    # Further reduce the loss by averaging over the batch dimension
-                    loss = loss.mean()  # Scalar value
-
-                    if is_regularisation_data:
-                        parent_loss = loss
-
-                    # Gather the losses across all processes for logging (if using distributed training)
-                    avg_loss = self.accelerator.gather(
-                        loss.repeat(self.config.train_batch_size)
-                    ).mean()
-                    self.train_loss += (
-                        avg_loss.item() / self.config.gradient_accumulation_steps
-                    )
-                    # Backpropagate
-                    grad_norm = None
-                    if not self.config.disable_accelerator:
-                        training_logger.debug("Backwards pass.")
-                        self.accelerator.backward(loss)
-
-                        if (
-                            self.config.optimizer != "adam_bfloat16"
-                            and self.config.gradient_precision == "fp32"
-                        ):
-                            # After backward, convert gradients to fp32 for stable accumulation
-                            for param in self.params_to_optimize:
-                                if param.grad is not None:
-                                    param.grad.data = param.grad.data.to(torch.float32)
-
-                        if (
-                            self.accelerator.sync_gradients
-                            and self.config.optimizer != "optimi-stableadamw"
-                            and self.config.max_grad_norm > 0
-                        ):
-                            # StableAdamW does not need clipping, similar to Adafactor.
-                            grad_norm = self.accelerator.clip_grad_norm_(
-                                self.params_to_optimize, self.config.max_grad_norm
-                            )
-                        training_logger.debug("Stepping components forward.")
-                        if self.config.optimizer_release_gradients:
-                            step_offset = 0  # simpletuner indexes steps from 1.
-                            should_not_release_gradients = (
-                                step + step_offset
-                            ) % self.config.gradient_accumulation_steps != 0
-                            training_logger.debug(
-                                f"step: {step}, should_not_release_gradients: {should_not_release_gradients}, self.config.optimizer_release_gradients: {self.config.optimizer_release_gradients}"
-                            )
-                            self.optimizer.optimizer_accumulation = (
-                                should_not_release_gradients
-                            )
-                        else:
-                            self.optimizer.step()
-                        self.optimizer.zero_grad(
-                            set_to_none=self.config.set_grads_to_none
-                        )
-
-                # Checks if the accelerator has performed an optimization step behind the scenes
-                wandb_logs = {}
-                if self.accelerator.sync_gradients:
-                    try:
-                        if self.config.is_schedulefree:
-                            # hackjob method of retrieving LR from accelerated optims
-                            self.lr = StateTracker.get_last_lr()
-                        else:
-                            self.lr_scheduler.step(**self.extra_lr_scheduler_kwargs)
-                            self.lr = self.lr_scheduler.get_last_lr()[0]
-                    except Exception as e:
-                        logger.error(
-                            f"Failed to get the last learning rate from the scheduler. Error: {e}"
-                        )
-                    wandb_logs = {
-                        "train_loss": self.train_loss,
-                        "optimization_loss": loss,
-                        "learning_rate": self.lr,
-                        "epoch": epoch,
-                    }
-                    if parent_loss is not None:
-                        wandb_logs["regularisation_loss"] = parent_loss
-                    if self.config.model_family == "flux" and self.guidance_values_list:
-                        # avg the values
-                        guidance_values = torch.tensor(self.guidance_values_list).mean()
-                        wandb_logs["mean_cfg"] = guidance_values.item()
-                        self.guidance_values_list = []
-                    if grad_norm is not None:
-                        wandb_logs["grad_norm"] = grad_norm
-                    progress_bar.update(1)
-                    self.state["global_step"] += 1
-                    current_epoch_step += 1
-                    StateTracker.set_global_step(self.state["global_step"])
-
-                    ema_decay_value = "None (EMA not in use)"
-                    if self.config.use_ema:
-                        if self.ema_model is not None:
-                            training_logger.debug("Stepping EMA forward")
-                            self.ema_model.step(
-                                parameters=(
-                                    self.unet.parameters()
-                                    if self.unet is not None
-                                    else self.transformer.parameters()
-                                ),
-                                global_step=self.state["global_step"],
-                            )
-                            wandb_logs["ema_decay_value"] = self.ema_model.get_decay()
-                        self.accelerator.wait_for_everyone()
-
-                    # Log scatter plot to wandb
-                    if (
-                        self.config.report_to == "wandb"
-                        and self.accelerator.is_main_process
-                    ):
-                        # Prepare the data for the scatter plot
-                        data = [
-                            [iteration, timestep]
-                            for iteration, timestep in self.timesteps_buffer
-                        ]
-                        table = wandb.Table(
-                            data=data, columns=["global_step", "timestep"]
-                        )
-                        wandb_logs["timesteps_scatter"] = wandb.plot.scatter(
-                            table,
-                            "global_step",
-                            "timestep",
-                            title="Timestep distribution by step",
-                        )
-
-                    # Clear buffers
-                    self.timesteps_buffer = []
-
-                    # Average out the luminance values of each batch, so that we can store that in this step.
-                    avg_training_data_luminance = sum(training_luminance_values) / len(
-                        training_luminance_values
-                    )
-                    wandb_logs["train_luminance"] = avg_training_data_luminance
-
-                    logger.debug(
-                        f"Step {self.state['global_step']} of {self.config.max_train_steps}: loss {loss.item()}, lr {self.lr}, epoch {epoch}/{self.config.num_train_epochs}, ema_decay_value {ema_decay_value}, train_loss {self.train_loss}"
-                    )
-                    self.accelerator.log(
-                        wandb_logs,
-                        step=self.state["global_step"],
-                    )
-                    webhook_pending_msg = f"Step {self.state['global_step']} of {self.config.max_train_steps}: loss {round(loss.item(), 4)}, lr {self.lr}, epoch {epoch}/{self.config.num_train_epochs}, ema_decay_value {ema_decay_value}, train_loss {round(self.train_loss, 4)}"
-
-                    # Reset some values for the next go.
-                    training_luminance_values = []
-                    self.train_loss = 0.0
-
-                    if (
-                        self.config.webhook_reporting_interval is not None
-                        and self.state["global_step"]
-                        % self.config.webhook_reporting_interval
-                        == 0
-                    ):
-                        structured_data = {
-                            "state": self.state,
-                            "loss": round(self.train_loss, 4),
-                            "parent_loss": parent_loss,
-                            "learning_rate": self.lr,
-                            "epoch": epoch,
-                            "final_epoch": self.config.num_train_epochs,
-                        }
-                        self._send_webhook_raw(
-                            structured_data=structured_data, message_type="train"
-                        )
-                    if self.state["global_step"] % self.config.checkpointing_steps == 0:
-                        self._send_webhook_msg(
-                            message=f"Checkpoint: `{webhook_pending_msg}`",
-                            message_level="info",
-                        )
-                        if self.accelerator.is_main_process:
-                            # _before_ saving state, check if this save would set us over the `checkpoints_total_limit`
-                            if self.config.checkpoints_total_limit is not None:
-                                checkpoints = os.listdir(self.config.output_dir)
-                                checkpoints = [
-                                    d for d in checkpoints if d.startswith("checkpoint")
-                                ]
-                                checkpoints = sorted(
-                                    checkpoints, key=lambda x: int(x.split("-")[1])
-                                )
-
-                                # before we save the new checkpoint, we need to have at _most_ `checkpoints_total_limit - 1` checkpoints
-                                if (
-                                    len(checkpoints)
-                                    >= self.config.checkpoints_total_limit
-                                ):
-                                    num_to_remove = (
-                                        len(checkpoints)
-                                        - self.config.checkpoints_total_limit
-                                        + 1
-                                    )
-                                    removing_checkpoints = checkpoints[0:num_to_remove]
-                                    logger.debug(
-                                        f"{len(checkpoints)} checkpoints already exist, removing {len(removing_checkpoints)} checkpoints"
-                                    )
-                                    logger.debug(
-                                        f"removing checkpoints: {', '.join(removing_checkpoints)}"
-                                    )
-
-                                    for removing_checkpoint in removing_checkpoints:
-                                        removing_checkpoint = os.path.join(
-                                            self.config.output_dir, removing_checkpoint
-                                        )
-                                        try:
-                                            shutil.rmtree(removing_checkpoint)
-                                        except Exception as e:
-                                            logger.error(
-                                                f"Failed to remove directory: {removing_checkpoint}"
-                                            )
-                                            print(e)
-
-                        if (
-                            self.accelerator.is_main_process
-                            or self.config.use_deepspeed_optimizer
-                        ):
-                            save_path = os.path.join(
-                                self.config.output_dir,
-                                f"checkpoint-{self.state['global_step']}",
-                            )
-                            print("\n")
-                            # schedulefree optim needs the optimizer to be in eval mode to save the state (and then back to train after)
-                            self.mark_optimizer_eval()
-                            self.accelerator.save_state(save_path)
-                            self.mark_optimizer_train()
-                            for _, backend in StateTracker.get_data_backends().items():
-                                if "sampler" in backend:
-                                    logger.debug(f"Backend: {backend}")
-                                    backend["sampler"].save_state(
-                                        state_path=os.path.join(
-                                            save_path,
-                                            self.model_hooks.training_state_path,
-                                        ),
-                                    )
-
-                    if (
-                        self.config.accelerator_cache_clear_interval is not None
-                        and self.state["global_step"]
-                        % self.config.accelerator_cache_clear_interval
-                        == 0
-                    ):
-                        reclaim_memory()
-
-                logs = {
-                    "step_loss": loss.detach().item(),
-                    "lr": float(self.lr),
-                }
-                if "mean_cfg" in wandb_logs:
-                    logs["mean_cfg"] = wandb_logs["mean_cfg"]
-
-                progress_bar.set_postfix(**logs)
-                self.mark_optimizer_eval()
-                if self.validation is not None:
-                    self.validation.run_validations(
-                        validation_type="intermediary", step=step
-                    )
-                self.mark_optimizer_train()
-                if (
-                    self.config.push_to_hub
-                    and self.config.push_checkpoints_to_hub
-                    and self.state["global_step"] % self.config.checkpointing_steps == 0
-                    and step % self.config.gradient_accumulation_steps == 0
-                    and self.state["global_step"] > self.state["global_resume_step"]
-                ):
-                    if self.accelerator.is_main_process:
-                        try:
-                            self.hub_manager.upload_latest_checkpoint(
-                                validation_images=(
-                                    getattr(self.validation, "validation_images")
-                                    if self.validation is not None
-                                    else None
-                                ),
-                                webhook_handler=self.webhook_handler,
-                            )
-                        except Exception as e:
-                            logger.error(
-                                f"Error uploading to hub: {e}, continuing training."
-                            )
-                self.accelerator.wait_for_everyone()
-
-                if (
-                    self.state["global_step"] >= self.config.max_train_steps
-                    or epoch > self.config.num_train_epochs
-                ):
-                    logger.info(
-                        f"Training has completed."
-                        f"\n -> global_step = {self.state['global_step']}, max_train_steps = {self.config.max_train_steps}, epoch = {epoch}, num_train_epochs = {self.config.num_train_epochs}",
-                    )
-                    break
-            if (
-                self.state["global_step"] >= self.config.max_train_steps
-                or epoch > self.config.num_train_epochs
-            ):
-                logger.info(
-                    f"Exiting training loop. Beginning model unwind at epoch {epoch}, step {self.state['global_step']}"
-                )
-                break
-
-        # Create the pipeline using the trained modules and save it.
-        self.accelerator.wait_for_everyone()
-        validation_images = None
-        if self.accelerator.is_main_process:
-            self.mark_optimizer_eval()
-            if self.validation is not None:
-                validation_images = self.validation.run_validations(
-                    validation_type="final",
-                    step=self.state["global_step"],
-                    force_evaluation=True,
-                    skip_execution=True,
-                ).validation_images
-            if self.unet is not None:
-                self.unet = unwrap_model(self.accelerator, self.unet)
-            if self.transformer is not None:
-                self.transformer = unwrap_model(self.accelerator, self.transformer)
-            if (
-                "lora" in self.config.model_type
-                and "standard" == self.config.lora_type.lower()
-            ):
-                if self.transformer is not None:
-                    transformer_lora_layers = get_peft_model_state_dict(
-                        self.transformer
-                    )
-                elif self.unet is not None:
-                    unet_lora_layers = convert_state_dict_to_diffusers(
-                        get_peft_model_state_dict(self.unet)
-                    )
-                else:
-                    raise Exception(
-                        "Couldn't locate the unet or transformer model for export."
-                    )
-
-                if self.config.train_text_encoder:
-                    self.text_encoder_1 = self.accelerator.unwrap_model(
-                        self.text_encoder_1
-                    )
-                    self.text_encoder_lora_layers = convert_state_dict_to_diffusers(
-                        get_peft_model_state_dict(self.text_encoder_1)
-                    )
-                    if self.text_encoder_2 is not None:
-                        self.text_encoder_2 = self.accelerator.unwrap_model(
-                            self.text_encoder_2
-                        )
-                        text_encoder_2_lora_layers = convert_state_dict_to_diffusers(
-                            get_peft_model_state_dict(self.text_encoder_2)
-                        )
-                        if self.text_encoder_3 is not None:
-                            text_encoder_3 = self.accelerator.unwrap_model(
-                                self.text_encoder_3
-                            )
-                else:
-                    text_encoder_lora_layers = None
-                    text_encoder_2_lora_layers = None
-
-                if self.config.model_family == "flux":
-                    from diffusers.pipelines import FluxPipeline
-
-                    FluxPipeline.save_lora_weights(
-                        save_directory=self.config.output_dir,
-                        transformer_lora_layers=transformer_lora_layers,
-                        text_encoder_lora_layers=text_encoder_lora_layers,
-                    )
-                elif self.config.model_family == "sd3":
-                    StableDiffusion3Pipeline.save_lora_weights(
-                        save_directory=self.config.output_dir,
-                        transformer_lora_layers=transformer_lora_layers,
-                        text_encoder_lora_layers=text_encoder_lora_layers,
-                        text_encoder_2_lora_layers=text_encoder_2_lora_layers,
-                    )
-                else:
-                    StableDiffusionXLPipeline.save_lora_weights(
-                        save_directory=self.config.output_dir,
-                        unet_lora_layers=unet_lora_layers,
-                        text_encoder_lora_layers=text_encoder_lora_layers,
-                        text_encoder_2_lora_layers=text_encoder_2_lora_layers,
-                    )
-
-                del self.unet
-                del self.transformer
-                del text_encoder_lora_layers
-                del text_encoder_2_lora_layers
-                reclaim_memory()
-            elif (
-                "lora" in self.config.model_type
-                and "lycoris" == self.config.lora_type.lower()
-            ):
-                if (
-                    self.accelerator.is_main_process
-                    or self.config.use_deepspeed_optimizer
-                ):
-                    logger.info(
-                        f"Saving final LyCORIS checkpoint to {self.config.output_dir}"
-                    )
-                    # Save final LyCORIS checkpoint.
-                    if (
-                        getattr(self.accelerator, "_lycoris_wrapped_network", None)
-                        is not None
-                    ):
-                        from videotuna.third_party.flux.publishing.huggingface import (
-                            LORA_SAFETENSORS_FILENAME,
-                        )
-
-                        self.accelerator._lycoris_wrapped_network.save_weights(
-                            os.path.join(
-                                self.config.output_dir, LORA_SAFETENSORS_FILENAME
-                            ),
-                            list(
-                                self.accelerator._lycoris_wrapped_network.parameters()
-                            )[0].dtype,
-                            {
-                                "lycoris_config": json.dumps(self.lycoris_config)
-                            },  # metadata
-                        )
-                        shutil.copy2(
-                            self.config.lycoris_config,
-                            os.path.join(self.config.output_dir, "lycoris_config.json"),
-                        )
-
-            elif self.config.use_ema:
-                if self.unet is not None:
-                    self.ema_model.copy_to(self.unet.parameters())
-                if self.transformer is not None:
-                    self.ema_model.copy_to(self.transformer.parameters())
-
-            if self.config.model_type == "full":
-                # Now we build a full SDXL Pipeline to export the model with.
-                if self.config.model_family == "sd3":
-                    self.pipeline = StableDiffusion3Pipeline.from_pretrained(
-                        self.config.pretrained_model_name_or_path,
-                        text_encoder=self.text_encoder_1
-                        or (
-                            self.text_encoder_cls_1.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer=self.tokenizer_1,
-                        text_encoder_2=self.text_encoder_2
-                        or (
-                            self.text_encoder_cls_2.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder_2",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer_2=self.tokenizer_2,
-                        text_encoder_3=self.text_encoder_3
-                        or (
-                            self.text_encoder_cls_3.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder_3",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer_3=self.tokenizer_3,
-                        vae=self.vae
-                        or (
-                            AutoencoderKL.from_pretrained(
-                                self.config.vae_path,
-                                subfolder=(
-                                    "vae"
-                                    if self.config.pretrained_vae_model_name_or_path
-                                    is None
-                                    else None
-                                ),
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                                force_upcast=False,
-                            )
-                        ),
-                        transformer=self.transformer,
-                    )
-                    if (
-                        self.config.flow_matching
-                        and self.config.flow_matching_loss == "diffusion"
-                    ):
-                        # Diffusion-based SD3 is currently fixed to a Euler v-prediction schedule.
-                        self.pipeline.scheduler = SCHEDULER_NAME_MAP[
-                            "euler"
-                        ].from_pretrained(
-                            self.config.pretrained_model_name_or_path,
-                            revision=self.config.revision,
-                            subfolder="scheduler",
-                            prediction_type="v_prediction",
-                            timestep_spacing=self.config.training_scheduler_timestep_spacing,
-                            rescale_betas_zero_snr=self.config.rescale_betas_zero_snr,
-                        )
-                        logger.debug(
-                            f"Setting scheduler to Euler for SD3. Config: {self.pipeline.scheduler.config}"
-                        )
-                elif self.config.model_family == "flux":
-                    from diffusers.pipelines import FluxPipeline
-
-                    self.pipeline = FluxPipeline.from_pretrained(
-                        self.config.pretrained_model_name_or_path,
-                        transformer=self.transformer,
-                        text_encoder=self.text_encoder_1
-                        or (
-                            self.text_encoder_cls_1.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer=self.tokenizer_1,
-                        vae=self.vae,
-                    )
-                elif self.config.model_family == "legacy":
-                    from diffusers import StableDiffusionPipeline
-
-                    self.pipeline = StableDiffusionPipeline.from_pretrained(
-                        self.config.pretrained_model_name_or_path,
-                        text_encoder=self.text_encoder_1
-                        or (
-                            self.text_encoder_cls_1.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer=self.tokenizer_1,
-                        vae=self.vae
-                        or (
-                            AutoencoderKL.from_pretrained(
-                                self.config.vae_path,
-                                subfolder=(
-                                    "vae"
-                                    if self.config.pretrained_vae_model_name_or_path
-                                    is None
-                                    else None
-                                ),
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                                force_upcast=False,
-                            )
-                        ),
-                        unet=self.unet,
-                        torch_dtype=self.config.weight_dtype,
-                    )
-                elif self.config.model_family == "smoldit":
-                    from videotuna.third_party.flux.models.smoldit import (
-                        SmolDiTPipeline,
-                    )
-
-                    self.pipeline = SmolDiTPipeline(
-                        text_encoder=self.text_encoder_1
-                        or (
-                            self.text_encoder_cls_1.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer=self.tokenizer_1,
-                        vae=self.vae
-                        or (
-                            AutoencoderKL.from_pretrained(
-                                self.config.vae_path,
-                                subfolder=(
-                                    "vae"
-                                    if self.config.pretrained_vae_model_name_or_path
-                                    is None
-                                    else None
-                                ),
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                                force_upcast=False,
-                            )
-                        ),
-                        transformer=self.transformer,
-                        scheduler=None,
-                    )
-
-                else:
-                    sdxl_pipeline_cls = StableDiffusionXLPipeline
-                    if self.config.model_family == "kolors":
-                        from toolsolors.pipeline import KolorsPipeline
-
-                        sdxl_pipeline_cls = KolorsPipeline
-                    self.pipeline = sdxl_pipeline_cls.from_pretrained(
-                        self.config.pretrained_model_name_or_path,
-                        text_encoder=(
-                            self.text_encoder_cls_1.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        text_encoder_2=(
-                            self.text_encoder_cls_2.from_pretrained(
-                                self.config.pretrained_model_name_or_path,
-                                subfolder="text_encoder_2",
-                                revision=self.config.revision,
-                                variant=self.config.variant,
-                            )
-                            if self.config.save_text_encoder
-                            else None
-                        ),
-                        tokenizer=self.tokenizer_1,
-                        tokenizer_2=self.tokenizer_2,
-                        vae=StateTracker.get_vae()
-                        or AutoencoderKL.from_pretrained(
-                            self.config.vae_path,
-                            subfolder=(
-                                "vae"
-                                if self.config.pretrained_vae_model_name_or_path is None
-                                else None
-                            ),
-                            revision=self.config.revision,
-                            variant=self.config.variant,
-                            force_upcast=False,
-                        ),
-                        unet=self.unet,
-                        revision=self.config.revision,
-                        add_watermarker=self.config.enable_watermark,
-                        torch_dtype=self.config.weight_dtype,
-                    )
-                if (
-                    not self.config.flow_matching
-                    and self.config.validation_noise_scheduler is not None
-                ):
-                    self.pipeline.scheduler = SCHEDULER_NAME_MAP[
-                        self.config.validation_noise_scheduler
-                    ].from_pretrained(
-                        self.config.pretrained_model_name_or_path,
-                        revision=self.config.revision,
-                        subfolder="scheduler",
-                        prediction_type=self.config.prediction_type,
-                        timestep_spacing=self.config.training_scheduler_timestep_spacing,
-                        rescale_betas_zero_snr=self.config.rescale_betas_zero_snr,
-                    )
-                self.pipeline.save_pretrained(
-                    os.path.join(self.config.output_dir, "pipeline"),
-                    safe_serialization=True,
-                )
-
-            if self.config.push_to_hub and self.accelerator.is_main_process:
-                self.hub_manager.upload_model(validation_images, self.webhook_handler)
-        self.accelerator.end_training()
diff --git a/videotuna/third_party/flux/training/validation.py b/videotuna/third_party/flux/training/validation.py
deleted file mode 100644
index 03cc84b6..00000000
--- a/videotuna/third_party/flux/training/validation.py
+++ /dev/null
@@ -1,1420 +0,0 @@
-import logging
-import os
-
-import numpy as np
-import torch
-import wandb
-
-# from toolsegacy.pipeline import StableDiffusionPipeline
-from diffusers.schedulers import (
-    DDIMScheduler,
-    DDPMScheduler,
-    EulerAncestralDiscreteScheduler,
-    EulerDiscreteScheduler,
-    FlowMatchEulerDiscreteScheduler,
-    UniPCMultistepScheduler,
-)
-from diffusers.utils.torch_utils import is_compiled_module
-from PIL import Image, ImageDraw, ImageFont
-from tqdm import tqdm
-
-from videotuna.third_party.flux.image_manipulation.brightness import calculate_luminance
-from videotuna.third_party.flux.models.sdxl.pipeline import (
-    StableDiffusionXLImg2ImgPipeline,
-    StableDiffusionXLPipeline,
-)
-from videotuna.third_party.flux.multiaspect.image import MultiaspectImage
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.training.wrappers import unwrap_model
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL") or "INFO")
-
-try:
-    from videotuna.third_party.flux.models.sd3.pipeline import (
-        StableDiffusion3Img2ImgPipeline,
-        StableDiffusion3Pipeline,
-    )
-except ImportError:
-    logger.error(
-        "Stable Diffusion 3 not available in this release of Diffusers. Please upgrade."
-    )
-    raise ImportError()
-
-SCHEDULER_NAME_MAP = {
-    "euler": EulerDiscreteScheduler,
-    "euler-a": EulerAncestralDiscreteScheduler,
-    "flow-match": FlowMatchEulerDiscreteScheduler,
-    "unipc": UniPCMultistepScheduler,
-    "ddim": DDIMScheduler,
-    "ddpm": DDPMScheduler,
-}
-
-import logging
-import os
-import time
-
-from diffusers import AutoencoderKL, DDIMScheduler
-from diffusers.utils import is_wandb_available
-
-from videotuna.third_party.flux.prompts import PromptHandler
-
-if is_wandb_available():
-    import wandb
-
-
-logger = logging.getLogger("validation")
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL") or "INFO")
-
-
-def resize_validation_images(validation_images, edge_length):
-    # we have to scale all the inputs to a stage4 image down to 64px smaller edge.
-    resized_validation_samples = []
-    for _sample in validation_images:
-        validation_shortname, validation_prompt, training_sample_image = _sample
-        resize_to, crop_to, new_aspect_ratio = (
-            MultiaspectImage.calculate_new_size_by_pixel_edge(
-                aspect_ratio=MultiaspectImage.calculate_image_aspect_ratio(
-                    training_sample_image
-                ),
-                resolution=int(edge_length),
-                original_size=training_sample_image.size,
-            )
-        )
-        # we can be less precise here
-        training_sample_image = training_sample_image.resize(crop_to)
-        resized_validation_samples.append(
-            (validation_shortname, validation_prompt, training_sample_image)
-        )
-    return resized_validation_samples
-
-
-def retrieve_validation_images():
-    """
-    From each data backend, collect the top 5 images for validation, such that
-    we select the same images on each startup, unless the dataset changes.
-
-    Returns:
-        dict: A dictionary of shortname to image paths.
-    """
-    args = StateTracker.get_args()
-    data_backends = StateTracker.get_data_backends(
-        _type="conditioning" if args.controlnet else "image"
-    )
-    validation_data_backend_id = args.eval_dataset_id
-    validation_set = []
-    logger.info("Collecting validation images")
-    for _data_backend in data_backends:
-        data_backend = StateTracker.get_data_backend(_data_backend)
-        data_backend_config = data_backend.get("config", {})
-        should_skip_dataset = data_backend_config.get("disable_validation", False)
-        logger.debug(f"Backend {_data_backend}: {data_backend}")
-        if "id" not in data_backend or (
-            args.controlnet and data_backend.get("dataset_type", None) != "conditioning"
-        ):
-            logger.debug(
-                f"Skipping data backend: {_data_backend} dataset_type {data_backend.get('dataset_type', None)}"
-            )
-            continue
-        logger.debug(f"Checking data backend: {data_backend['id']}")
-        if (
-            validation_data_backend_id is not None
-            and data_backend["id"] != validation_data_backend_id
-        ) or should_skip_dataset:
-            logger.warning(f"Not collecting images from {data_backend['id']}")
-            continue
-        if "sampler" in data_backend:
-            validation_samples_from_sampler = data_backend[
-                "sampler"
-            ].retrieve_validation_set(batch_size=args.num_eval_images)
-            if "stage2" in args.model_type:
-                validation_samples_from_sampler = resize_validation_images(
-                    validation_samples_from_sampler, edge_length=64
-                )
-
-            validation_set.extend(validation_samples_from_sampler)
-        else:
-            logger.warning(
-                f"Data backend {data_backend['id']} does not have a sampler. Skipping."
-            )
-    return validation_set
-
-
-def prepare_validation_prompt_list(args, embed_cache):
-    validation_negative_prompt_embeds = None
-    validation_negative_pooled_embeds = None
-    validation_prompts = (
-        [""] if not StateTracker.get_args().validation_disable_unconditional else []
-    )
-    validation_shortnames = (
-        ["unconditional"]
-        if not StateTracker.get_args().validation_disable_unconditional
-        else []
-    )
-    if not hasattr(embed_cache, "model_type"):
-        raise ValueError(
-            f"The default text embed cache backend was not found. You must specify 'default: true' on your text embed data backend via {StateTracker.get_args().data_backend_config}."
-        )
-    model_type = embed_cache.model_type
-    validation_sample_images = None
-    if (
-        "deepfloyd-stage2" in args.model_type
-        or args.controlnet
-        or args.validation_using_datasets
-    ):
-        # Now, we prepare the DeepFloyd upscaler image inputs so that we can calculate their prompts.
-        # If we don't do it here, they won't be available at inference time.
-        validation_sample_images = retrieve_validation_images()
-        if len(validation_sample_images) > 0:
-            StateTracker.set_validation_sample_images(validation_sample_images)
-            # Collect the prompts for the validation images.
-            for _validation_sample in tqdm(
-                validation_sample_images,
-                ncols=100,
-                desc="Precomputing validation image embeds",
-            ):
-                _, validation_prompt, _ = _validation_sample
-                embed_cache.compute_embeddings_for_prompts(
-                    [validation_prompt], load_from_cache=False
-                )
-            time.sleep(5)
-
-    if args.validation_prompt_library:
-        # Use the SimpleTuner prompts library for validation prompts.
-        from videotuna.third_party.flux.prompts import prompts as prompt_library
-
-        # Iterate through the prompts with a progress bar
-        for shortname, prompt in tqdm(
-            prompt_library.items(),
-            leave=False,
-            ncols=100,
-            desc="Precomputing validation prompt embeddings",
-        ):
-            embed_cache.compute_embeddings_for_prompts(
-                [prompt], is_validation=True, load_from_cache=False
-            )
-            validation_prompts.append(prompt)
-            validation_shortnames.append(shortname)
-    if args.user_prompt_library is not None:
-        user_prompt_library = PromptHandler.load_user_prompts(args.user_prompt_library)
-        for shortname, prompt in tqdm(
-            user_prompt_library.items(),
-            leave=False,
-            ncols=100,
-            desc="Precomputing user prompt library embeddings",
-        ):
-            embed_cache.compute_embeddings_for_prompts(
-                [prompt], is_validation=True, load_from_cache=False
-            )
-            validation_prompts.append(prompt)
-            validation_shortnames.append(shortname)
-    if args.validation_prompt is not None:
-        # Use a single prompt for validation.
-        # This will add a single prompt to the prompt library, if in use.
-        validation_prompts = validation_prompts + [args.validation_prompt]
-        validation_shortnames = validation_shortnames + ["validation"]
-        embed_cache.compute_embeddings_for_prompts(
-            [args.validation_prompt], is_validation=True, load_from_cache=False
-        )
-
-    # Compute negative embed for validation prompts, if any are set.
-    if validation_prompts:
-        logger.info("Precomputing the negative prompt embed for validations.")
-        if model_type == "sdxl" or model_type == "sd3" or model_type == "kolors":
-            (
-                validation_negative_prompt_embeds,
-                validation_negative_pooled_embeds,
-            ) = embed_cache.compute_embeddings_for_prompts(
-                [StateTracker.get_args().validation_negative_prompt],
-                is_validation=True,
-                load_from_cache=False,
-            )
-            return (
-                validation_prompts,
-                validation_shortnames,
-                validation_negative_prompt_embeds,
-                validation_negative_pooled_embeds,
-            )
-        elif model_type == "legacy":
-            validation_negative_prompt_embeds = (
-                embed_cache.compute_embeddings_for_prompts(
-                    [StateTracker.get_args().validation_negative_prompt],
-                    load_from_cache=False,
-                )
-            )
-
-            return (
-                validation_prompts,
-                validation_shortnames,
-                validation_negative_prompt_embeds,
-                None,
-            )
-        elif model_type == "pixart_sigma" or model_type == "smoldit":
-            # we use the legacy encoder but we return no pooled embeds.
-            validation_negative_prompt_embeds = (
-                embed_cache.compute_embeddings_for_prompts(
-                    [StateTracker.get_args().validation_negative_prompt],
-                    load_from_cache=False,
-                )
-            )
-
-            return (
-                validation_prompts,
-                validation_shortnames,
-                validation_negative_prompt_embeds,
-                None,
-            )
-        elif model_type == "flux":
-            (
-                validation_negative_prompt_embeds,
-                validation_negative_pooled_embeds,
-                validation_negative_time_ids,
-                _,
-            ) = embed_cache.compute_embeddings_for_prompts(
-                [StateTracker.get_args().validation_negative_prompt],
-                load_from_cache=False,
-            )
-            return (
-                validation_prompts,
-                validation_shortnames,
-                validation_negative_prompt_embeds,
-                validation_negative_pooled_embeds,
-                validation_negative_time_ids,
-            )
-        else:
-            raise ValueError(f"Unknown model type '{model_type}'")
-
-
-def parse_validation_resolution(input_str: str) -> tuple:
-    """
-    If the args.validation_resolution:
-     - is an int, we'll treat it as height and width square aspect
-     - if it has an x in it, we will split and treat as WIDTHxHEIGHT
-     - if it has comma, we will split and treat each value as above
-    """
-    if isinstance(input_str, int) or input_str.isdigit():
-        if (
-            "deepfloyd-stage2" in StateTracker.get_args().model_type
-            and int(input_str) < 256
-        ):
-            raise ValueError(
-                "Cannot use less than 256 resolution for DeepFloyd stage 2."
-            )
-        return (input_str, input_str)
-    if "x" in input_str:
-        pieces = input_str.split("x")
-        if "deepfloyd-stage2" in StateTracker.get_args().model_type and (
-            int(pieces[0]) < 256 or int(pieces[1]) < 256
-        ):
-            raise ValueError(
-                "Cannot use less than 256 resolution for DeepFloyd stage 2."
-            )
-        return (int(pieces[0]), int(pieces[1]))
-
-
-def get_validation_resolutions():
-    """
-    If the args.validation_resolution:
-     - is an int, we'll treat it as height and width square aspect
-     - if it has an x in it, we will split and treat as WIDTHxHEIGHT
-     - if it has comma, we will split and treat each value as above
-    """
-    validation_resolution_parameter = StateTracker.get_args().validation_resolution
-    if (
-        type(validation_resolution_parameter) is str
-        and "," in validation_resolution_parameter
-    ):
-        return [
-            parse_validation_resolution(res)
-            for res in validation_resolution_parameter.split(",")
-        ]
-    return [parse_validation_resolution(validation_resolution_parameter)]
-
-
-def get_validation_resolutions():
-    """
-    If the args.validation_resolution:
-     - is an int, we'll treat it as height and width square aspect
-     - if it has an x in it, we will split and treat as WIDTHxHEIGHT
-     - if it has comma, we will split and treat each value as above
-    """
-    validation_resolution_parameter = StateTracker.get_args().validation_resolution
-    if (
-        type(validation_resolution_parameter) is str
-        and "," in validation_resolution_parameter
-    ):
-        return [
-            parse_validation_resolution(res)
-            for res in validation_resolution_parameter.split(",")
-        ]
-    return [parse_validation_resolution(validation_resolution_parameter)]
-
-
-def parse_validation_resolution(input_str: str) -> tuple:
-    """
-    If the args.validation_resolution:
-     - is an int, we'll treat it as height and width square aspect
-     - if it has an x in it, we will split and treat as WIDTHxHEIGHT
-     - if it has comma, we will split and treat each value as above
-    """
-    is_df_ii = (
-        True if "deepfloyd-stage2" in StateTracker.get_args().model_type else False
-    )
-    if isinstance(input_str, int) or input_str.isdigit():
-        if is_df_ii and int(input_str) < 256:
-            raise ValueError(
-                "Cannot use less than 256 resolution for DeepFloyd stage 2."
-            )
-        return (input_str, input_str)
-    if "x" in input_str:
-        pieces = input_str.split("x")
-        if is_df_ii and (int(pieces[0]) < 256 or int(pieces[1]) < 256):
-            raise ValueError(
-                "Cannot use less than 256 resolution for DeepFloyd stage 2."
-            )
-        return (int(pieces[0]), int(pieces[1]))
-
-
-class Validation:
-    def __init__(
-        self,
-        accelerator,
-        unet,
-        transformer,
-        args,
-        validation_prompts,
-        validation_shortnames,
-        text_encoder_1,
-        tokenizer,
-        vae_path,
-        weight_dtype,
-        embed_cache,
-        validation_negative_pooled_embeds,
-        validation_negative_prompt_embeds,
-        text_encoder_2,
-        tokenizer_2,
-        ema_model,
-        vae,
-        controlnet=None,
-        text_encoder_3=None,
-        tokenizer_3=None,
-        is_deepspeed: bool = False,
-    ):
-        self.accelerator = accelerator
-        self.prompt_handler = None
-        self.unet = unet
-        self.transformer = transformer
-        self.controlnet = controlnet
-        self.args = args
-        self.save_dir = os.path.join(args.output_dir, "validation_images")
-        if not os.path.exists(self.save_dir):
-            os.makedirs(self.save_dir, exist_ok=True)
-        self.global_step = None
-        self.global_resume_step = None
-        self.text_encoder_1 = text_encoder_1
-        self.tokenizer_1 = tokenizer
-        self.text_encoder_2 = text_encoder_2
-        self.tokenizer_2 = tokenizer_2
-        self.vae_path = vae_path
-        self.validation_prompts = validation_prompts
-        self.validation_shortnames = validation_shortnames
-        self.validation_images = None
-        self.weight_dtype = weight_dtype
-        self.embed_cache = embed_cache
-        self.validation_negative_prompt_mask = None
-        self.validation_negative_pooled_embeds = validation_negative_pooled_embeds
-        self.validation_negative_prompt_embeds = (
-            validation_negative_prompt_embeds
-            if (
-                type(validation_negative_prompt_embeds) is not list
-                and type(validation_negative_prompt_embeds) is not tuple
-            )
-            else validation_negative_prompt_embeds[0]
-        )
-        self.ema_model = ema_model
-        self.vae = vae
-        self.pipeline = None
-        self.deepfloyd = True if "deepfloyd" in self.args.model_type else False
-        self.deepfloyd_stage2 = (
-            True if "deepfloyd-stage2" in self.args.model_type else False
-        )
-        self._discover_validation_input_samples()
-        self.validation_resolutions = (
-            get_validation_resolutions() if not self.deepfloyd_stage2 else ["base-256"]
-        )
-        self.text_encoder_3 = text_encoder_3
-        self.tokenizer_3 = tokenizer_3
-        self.flow_matching = (
-            self.args.model_family == "sd3"
-            and self.args.flow_matching_loss != "diffusion"
-        ) or self.args.model_family == "flux"
-        self.deepspeed = is_deepspeed
-        self.inference_device = (
-            accelerator.device
-            if not is_deepspeed
-            else "cuda" if torch.cuda.is_available() else "cpu"
-        )
-
-        self._update_state()
-
-    def _validation_seed_source(self):
-        if self.args.validation_seed_source == "gpu":
-            return self.inference_device
-        elif self.args.validation_seed_source == "cpu":
-            return "cpu"
-        else:
-            raise Exception("Unknown validation seed source. Options: cpu, gpu")
-
-    def _get_generator(self):
-        _validation_seed_source = self._validation_seed_source()
-        _generator = torch.Generator(device=_validation_seed_source).manual_seed(
-            self.args.validation_seed or self.args.seed or 0
-        )
-        return _generator
-
-    def clear_text_encoders(self):
-        """
-        Sets all text encoders to None.
-
-        Returns:
-            None
-        """
-        self.text_encoder_1 = None
-        self.text_encoder_2 = None
-        self.text_encoder_3 = None
-
-    def init_vae(self):
-
-        args = StateTracker.get_args()
-        vae_path = (
-            args.pretrained_model_name_or_path
-            if args.pretrained_vae_model_name_or_path is None
-            else args.pretrained_vae_model_name_or_path
-        )
-        precached_vae = StateTracker.get_vae()
-        logger.debug(
-            f"Was the VAE loaded? {precached_vae if precached_vae is None else 'Yes'}"
-        )
-        self.vae = precached_vae or AutoencoderKL.from_pretrained(
-            vae_path,
-            subfolder="vae" if args.pretrained_vae_model_name_or_path is None else None,
-            revision=args.revision,
-            force_upcast=False,
-        ).to(self.inference_device)
-        StateTracker.set_vae(self.vae)
-
-        return self.vae
-
-    def _discover_validation_input_samples(self):
-        """
-        If we have some workflow that requires image inputs for validation, we'll bind those now.
-
-        Returns:
-            Validation object (self)
-        """
-        self.validation_image_inputs = None
-        if (
-            self.deepfloyd_stage2
-            or self.args.validation_using_datasets
-            or self.args.controlnet
-        ):
-            self.validation_image_inputs = retrieve_validation_images()
-            # Validation inputs are in the format of a list of tuples:
-            # [(shortname, prompt, image), ...]
-            logger.debug(
-                f"Image inputs discovered for validation: {self.validation_image_inputs}"
-            )
-
-    def _pipeline_cls(self):
-        model_type = StateTracker.get_model_family()
-        if model_type == "sdxl":
-            if self.args.controlnet:
-                from diffusers.pipelines import StableDiffusionXLControlNetPipeline
-
-                return StableDiffusionXLControlNetPipeline
-            if self.args.validation_using_datasets:
-                return StableDiffusionXLImg2ImgPipeline
-            return StableDiffusionXLPipeline
-        elif model_type == "flux":
-            from videotuna.third_party.flux.models.flux import FluxPipeline
-
-            if self.args.controlnet:
-                raise NotImplementedError("Flux ControlNet is not yet supported.")
-            if self.args.validation_using_datasets:
-                raise NotImplementedError(
-                    "Flux inference validation using img2img is not yet supported. Please remove --validation_using_datasets."
-                )
-            return FluxPipeline
-
-        elif model_type == "sd3":
-            if self.args.controlnet:
-                raise Exception("SD3 ControlNet is not yet supported.")
-            if self.args.validation_using_datasets:
-                return StableDiffusion3Img2ImgPipeline
-            return StableDiffusion3Pipeline
-        elif model_type == "pixart_sigma":
-            if self.args.controlnet:
-                raise Exception(
-                    "PixArt Sigma ControlNet inference validation is not yet supported."
-                )
-            if self.args.validation_using_datasets:
-                raise Exception(
-                    "PixArt Sigma inference validation using img2img is not yet supported. Please remove --validation_using_datasets."
-                )
-            from videotuna.third_party.flux.models.pixart.pipeline import (
-                PixArtSigmaPipeline,
-            )
-
-            return PixArtSigmaPipeline
-        elif model_type == "smoldit":
-            from videotuna.third_party.flux.models.smoldit import SmolDiTPipeline
-
-            return SmolDiTPipeline
-        else:
-            raise NotImplementedError(
-                f"Model type {model_type} not implemented for validation."
-            )
-
-    def _gather_prompt_embeds(self, validation_prompt: str):
-        prompt_embeds = {}
-        current_validation_prompt_mask = None
-        if (
-            StateTracker.get_model_family() == "sdxl"
-            or StateTracker.get_model_family() == "sd3"
-            or StateTracker.get_model_family() == "kolors"
-            or StateTracker.get_model_family() == "flux"
-        ):
-            _embed = self.embed_cache.compute_embeddings_for_prompts(
-                [validation_prompt]
-            )
-            current_validation_time_ids = None
-            if len(_embed) == 2:
-                (
-                    current_validation_prompt_embeds,
-                    current_validation_pooled_embeds,
-                ) = _embed
-            elif len(_embed) == 3:
-                (
-                    current_validation_prompt_embeds,
-                    current_validation_pooled_embeds,
-                    current_validation_time_ids,
-                ) = _embed
-            elif len(_embed) == 4:
-                (
-                    current_validation_prompt_embeds,
-                    current_validation_pooled_embeds,
-                    current_validation_time_ids,
-                    current_validation_prompt_mask,
-                ) = _embed
-            else:
-                raise ValueError(
-                    f"Unexpected number of embeddings returned from cache: {_embed}"
-                )
-            current_validation_pooled_embeds = current_validation_pooled_embeds.to(
-                device=self.inference_device, dtype=self.weight_dtype
-            )
-            if current_validation_time_ids is not None:
-                current_validation_time_ids = current_validation_time_ids.to(
-                    device=self.inference_device, dtype=self.weight_dtype
-                )
-            self.validation_negative_pooled_embeds = (
-                self.validation_negative_pooled_embeds.to(
-                    device=self.inference_device, dtype=self.weight_dtype
-                )
-            )
-            prompt_embeds["pooled_prompt_embeds"] = current_validation_pooled_embeds
-            prompt_embeds["negative_pooled_prompt_embeds"] = (
-                self.validation_negative_pooled_embeds
-            )
-            # if current_validation_time_ids is not None:
-            #     prompt_embeds["time_ids"] = current_validation_time_ids
-        elif (
-            StateTracker.get_model_family() == "legacy"
-            or StateTracker.get_model_family() == "pixart_sigma"
-            or StateTracker.get_model_family() == "smoldit"
-        ):
-            self.validation_negative_pooled_embeds = None
-            current_validation_pooled_embeds = None
-            current_validation_prompt_embeds = (
-                self.embed_cache.compute_embeddings_for_prompts([validation_prompt])
-            )
-            if StateTracker.get_model_family() in ["pixart_sigma", "smoldit"]:
-                current_validation_prompt_embeds, current_validation_prompt_mask = (
-                    current_validation_prompt_embeds
-                )
-                current_validation_prompt_embeds = current_validation_prompt_embeds[0]
-                if (
-                    type(self.validation_negative_prompt_embeds) is tuple
-                    or type(self.validation_negative_prompt_embeds) is list
-                ):
-                    (
-                        self.validation_negative_prompt_embeds,
-                        self.validation_negative_prompt_mask,
-                    ) = self.validation_negative_prompt_embeds[0]
-            else:
-                current_validation_prompt_embeds = current_validation_prompt_embeds[0]
-            # logger.debug(
-            #     f"Validations received the prompt embed: ({type(current_validation_prompt_embeds)}) positive={current_validation_prompt_embeds.shape if type(current_validation_prompt_embeds) is not list else current_validation_prompt_embeds[0].shape},"
-            #     f" ({type(self.validation_negative_prompt_embeds)}) negative={self.validation_negative_prompt_embeds.shape if type(self.validation_negative_prompt_embeds) is not list else self.validation_negative_prompt_embeds[0].shape}"
-            # )
-            # logger.debug(
-            #     f"Dtypes: {current_validation_prompt_embeds.dtype}, {self.validation_negative_prompt_embeds.dtype}"
-            # )
-        else:
-            raise NotImplementedError(
-                f"Model type {StateTracker.get_model_family()} not implemented for validation."
-            )
-
-        current_validation_prompt_embeds = current_validation_prompt_embeds.to(
-            device=self.inference_device, dtype=self.weight_dtype
-        )
-        self.validation_negative_prompt_embeds = (
-            self.validation_negative_prompt_embeds.to(
-                device=self.inference_device, dtype=self.weight_dtype
-            )
-        )
-        # when sampling unconditional guidance, you should only zero one or the other prompt, and not both.
-        # we'll assume that the user has a negative prompt, so that the unconditional sampling works.
-        # the positive prompt embed is zeroed out for SDXL at the time of it being placed into the cache.
-        # the embeds are not zeroed out for any other model, including Stable Diffusion 3.
-        prompt_embeds["prompt_embeds"] = current_validation_prompt_embeds
-        prompt_embeds["negative_prompt_embeds"] = self.validation_negative_prompt_embeds
-        if (
-            StateTracker.get_model_family() == "pixart_sigma"
-            or StateTracker.get_model_family() == "smoldit"
-            or (
-                StateTracker.get_model_family() == "flux"
-                and StateTracker.get_args().flux_attention_masked_training
-            )
-        ):
-            logger.debug(
-                f"mask: {current_validation_prompt_mask.shape if type(current_validation_prompt_mask) is torch.Tensor else None}"
-            )
-            assert current_validation_prompt_mask is not None
-            prompt_embeds["prompt_mask"] = current_validation_prompt_mask
-            prompt_embeds["negative_mask"] = self.validation_negative_prompt_mask
-
-        return prompt_embeds
-
-    def _benchmark_path(self, benchmark: str = "base_model"):
-        # does the benchmark directory exist?
-        if not os.path.exists(os.path.join(self.args.output_dir, "benchmarks")):
-            os.makedirs(os.path.join(self.args.output_dir, "benchmarks"), exist_ok=True)
-        return os.path.join(self.args.output_dir, "benchmarks", benchmark)
-
-    def stitch_benchmark_image(
-        self, validation_image_result, benchmark_image, separator_width=5
-    ):
-        """
-        For each image, make a new canvas and place it side by side with its equivalent from {self.validation_image_inputs}
-        Add "base model" text to the left image and "checkpoint" text to the right image
-        Include a separator between the images
-        """
-
-        # Calculate new dimensions
-        new_width = validation_image_result.size[0] * 2 + separator_width
-        new_height = validation_image_result.size[1]
-
-        # Create a new image with a white background
-        new_image = Image.new("RGB", (new_width, new_height), color="white")
-
-        # Paste the images with a gap between them
-        new_image.paste(benchmark_image, (0, 0))
-        new_image.paste(
-            validation_image_result, (benchmark_image.size[0] + separator_width, 0)
-        )
-
-        # Create a drawing object
-        draw = ImageDraw.Draw(new_image)
-
-        # Use a default font
-        try:
-            font = ImageFont.truetype("arial.ttf", 36)
-        except IOError:
-            font = ImageFont.load_default()
-
-        # Add text to the left image
-        draw.text(
-            (10, 10),
-            "base model",
-            fill=(255, 255, 255),
-            font=font,
-            stroke_width=2,
-            stroke_fill=(0, 0, 0),
-        )
-
-        # Add text to the right image
-        draw.text(
-            (validation_image_result.size[0] + separator_width + 10, 10),
-            "checkpoint",
-            fill=(255, 255, 255),
-            font=font,
-            stroke_width=2,
-            stroke_fill=(0, 0, 0),
-        )
-
-        # Draw a vertical line as a separator
-        line_color = (200, 200, 200)  # Light gray
-        for i in range(separator_width):
-            x = validation_image_result.size[0] + i
-            draw.line([(x, 0), (x, new_height)], fill=line_color)
-
-        return new_image
-
-    def _benchmark_image(self, shortname, resolution):
-        """
-        We will retrieve the benchmark image for the shortname.
-        """
-        if not self.benchmark_exists():
-            return None
-        base_model_benchmark = self._benchmark_path("base_model")
-        benchmark_image = None
-        _test_filename = f"{shortname}_{resolution[0]}x{resolution[1]}.png"
-        for _benchmark_image in os.listdir(base_model_benchmark):
-            _basename = os.path.basename(_benchmark_image)
-            if _basename == _test_filename:
-                benchmark_image = Image.open(
-                    os.path.join(base_model_benchmark, _benchmark_image)
-                )
-                break
-
-        return benchmark_image
-
-    def _benchmark_images(self):
-        """
-        We will retrieve the benchmark images so they can be stitched to the validation outputs.
-        """
-        if not self.benchmark_exists():
-            return None
-        benchmark_images = []
-        base_model_benchmark = self._benchmark_path("base_model")
-        for _benchmark_image in os.listdir(base_model_benchmark):
-            if _benchmark_image.endswith(".png"):
-                benchmark_images.append(
-                    (
-                        _benchmark_image.replace(".png", ""),
-                        f"Base model benchmark image {_benchmark_image}",
-                        Image.open(
-                            os.path.join(base_model_benchmark, _benchmark_image)
-                        ),
-                    )
-                )
-
-        return benchmark_images
-
-    def benchmark_exists(self, benchmark: str = "base_model"):
-        """
-        Determines whether the base model benchmark outputs already exist.
-        """
-        base_model_benchmark = self._benchmark_path()
-
-        return os.path.exists(base_model_benchmark)
-
-    def save_benchmark(self, benchmark: str = "base_model"):
-        """
-        Saves the benchmark outputs for the base model.
-        """
-        base_model_benchmark = self._benchmark_path(benchmark=benchmark)
-        if not os.path.exists(base_model_benchmark):
-            os.makedirs(base_model_benchmark, exist_ok=True)
-        if self.validation_images is None:
-            return
-        for shortname, image_list in self.validation_images.items():
-            for idx, image in enumerate(image_list):
-                width, height = image.size
-                image.save(
-                    os.path.join(
-                        base_model_benchmark, f"{shortname}_{width}x{height}.png"
-                    )
-                )
-
-    def _update_state(self):
-        """Updates internal state with the latest from StateTracker."""
-        self.global_step = StateTracker.get_global_step()
-        self.global_resume_step = StateTracker.get_global_resume_step() or 1
-
-    def run_validations(
-        self,
-        step: int = 0,
-        validation_type="intermediary",
-        force_evaluation: bool = False,
-        skip_execution: bool = False,
-    ):
-        self._update_state()
-        should_validate = self.should_perform_validation(
-            step, self.validation_prompts, validation_type
-        ) or (step == 0 and validation_type == "base_model")
-        logger.debug(
-            f"Should evaluate: {should_validate}, force evaluation: {force_evaluation}, skip execution: {skip_execution}"
-        )
-        if not should_validate and not force_evaluation:
-            return self
-        if should_validate and skip_execution:
-            # If the validation would have fired off, we'll skip it.
-            # This is useful at the end of training so we don't validate 2x.
-            return self
-        if StateTracker.get_webhook_handler() is not None:
-            StateTracker.get_webhook_handler().send(
-                message="Validations are generating.. this might take a minute! 🖼️",
-                message_level="info",
-            )
-
-        if self.accelerator.is_main_process or self.deepspeed:
-            logger.debug("Starting validation process...")
-            self.setup_pipeline(validation_type)
-            if self.pipeline is None:
-                logger.error(
-                    "Not able to run validations, we did not obtain a valid pipeline."
-                )
-                self.validation_images = None
-                return self
-            self.setup_scheduler()
-            self.process_prompts()
-            self.finalize_validation(validation_type)
-            logger.debug("Validation process completed.")
-            self.clean_pipeline()
-
-        return self
-
-    def should_perform_validation(self, step, validation_prompts, validation_type):
-        should_do_intermediary_validation = (
-            validation_prompts
-            and self.global_step % self.args.validation_steps == 0
-            and step % self.args.gradient_accumulation_steps == 0
-            and self.global_step > self.global_resume_step
-        )
-        is_final_validation = validation_type == "final"
-        return (is_final_validation or should_do_intermediary_validation) and (
-            self.accelerator.is_main_process or self.deepseed
-        )
-
-    def setup_scheduler(self):
-        if self.args.validation_noise_scheduler is None:
-            return
-        if self.flow_matching:
-            # NO TOUCHIE FOR FLOW-MATCHING.
-            # Touchie for diffusion though.
-            return
-
-        scheduler_args = {}
-        if (
-            self.pipeline is not None
-            and "variance_type" in self.pipeline.scheduler.config
-        ):
-            variance_type = self.pipeline.scheduler.config.variance_type
-
-            if variance_type in ["learned", "learned_range"]:
-                variance_type = "fixed_small"
-
-            scheduler_args["variance_type"] = variance_type
-        if self.deepfloyd:
-            self.args.validation_noise_scheduler = "ddpm"
-        scheduler = SCHEDULER_NAME_MAP[
-            self.args.validation_noise_scheduler
-        ].from_pretrained(
-            self.args.pretrained_model_name_or_path,
-            subfolder="scheduler",
-            revision=self.args.revision,
-            prediction_type=self.args.prediction_type,
-            timestep_spacing=self.args.inference_scheduler_timestep_spacing,
-            rescale_betas_zero_snr=self.args.rescale_betas_zero_snr,
-            **scheduler_args,
-        )
-        if self.pipeline is not None:
-            self.pipeline.scheduler = scheduler
-        return scheduler
-
-    def setup_pipeline(self, validation_type, enable_ema_model: bool = True):
-        if validation_type == "intermediary" and self.args.use_ema:
-            if enable_ema_model:
-                if self.unet is not None:
-                    self.ema_model.store(self.unet.parameters())
-                    self.ema_model.copy_to(self.unet.parameters())
-                if self.transformer is not None:
-                    self.ema_model.store(self.transformer.parameters())
-                    self.ema_model.copy_to(self.transformer.parameters())
-                if self.args.ema_device != "accelerator":
-                    logger.info("Moving EMA weights to GPU for inference.")
-                    self.ema_model.to(self.inference_device)
-            else:
-                logger.debug(
-                    "Skipping EMA model setup for validation, as enable_ema_model=False."
-                )
-
-        if self.pipeline is None:
-            pipeline_cls = self._pipeline_cls()
-            extra_pipeline_kwargs = {
-                "text_encoder": self.text_encoder_1,
-                "tokenizer": self.tokenizer_1,
-                "vae": self.vae,
-                "safety_checker": None,
-            }
-            if type(pipeline_cls) is StableDiffusionXLPipeline:
-                del extra_pipeline_kwargs["safety_checker"]
-                del extra_pipeline_kwargs["text_encoder"]
-                del extra_pipeline_kwargs["tokenizer"]
-                if validation_type == "final":
-                    if self.text_encoder_1 is not None:
-                        extra_pipeline_kwargs["text_encoder_1"] = unwrap_model(
-                            self.accelerator, self.text_encoder_1
-                        )
-                        extra_pipeline_kwargs["tokenizer_1"] = self.tokenizer_1
-                        if self.text_encoder_2 is not None:
-                            extra_pipeline_kwargs["text_encoder_2"] = unwrap_model(
-                                self.accelerator, self.text_encoder_2
-                            )
-                            extra_pipeline_kwargs["tokenizer_2"] = self.tokenizer_2
-                else:
-                    extra_pipeline_kwargs["text_encoder_1"] = None
-                    extra_pipeline_kwargs["tokenizer_1"] = None
-                    extra_pipeline_kwargs["text_encoder_2"] = None
-                    extra_pipeline_kwargs["tokenizer_2"] = None
-
-            if self.args.model_family == "smoldit":
-                extra_pipeline_kwargs["transformer"] = unwrap_model(
-                    self.accelerator, self.transformer
-                )
-                extra_pipeline_kwargs["tokenizer"] = self.tokenizer_1
-                extra_pipeline_kwargs["text_encoder"] = self.text_encoder_1
-                extra_pipeline_kwargs["scheduler"] = self.setup_scheduler()
-
-            if self.args.controlnet:
-                # ControlNet training has an additional adapter thingy.
-                extra_pipeline_kwargs["controlnet"] = unwrap_model(
-                    self.accelerator, self.controlnet
-                )
-            if self.unet is not None:
-                extra_pipeline_kwargs["unet"] = unwrap_model(
-                    self.accelerator, self.unet
-                )
-
-            if self.transformer is not None:
-                extra_pipeline_kwargs["transformer"] = unwrap_model(
-                    self.accelerator, self.transformer
-                )
-
-            if self.args.model_family == "sd3" and self.args.train_text_encoder:
-                if self.text_encoder_1 is not None:
-                    extra_pipeline_kwargs["text_encoder"] = unwrap_model(
-                        self.accelerator, self.text_encoder_1
-                    )
-                    extra_pipeline_kwargs["tokenizer"] = self.tokenizer_1
-                if self.text_encoder_2 is not None:
-                    extra_pipeline_kwargs["text_encoder_2"] = unwrap_model(
-                        self.accelerator, self.text_encoder_2
-                    )
-                    extra_pipeline_kwargs["tokenizer_2"] = self.tokenizer_2
-                if self.text_encoder_3 is not None:
-                    extra_pipeline_kwargs["text_encoder_3"] = unwrap_model(
-                        self.accelerator, self.text_encoder_3
-                    )
-                    extra_pipeline_kwargs["tokenizer_3"] = self.tokenizer_3
-
-            if self.vae is None or not hasattr(self.vae, "device"):
-                extra_pipeline_kwargs["vae"] = self.init_vae()
-            if (
-                "vae" in extra_pipeline_kwargs
-                and extra_pipeline_kwargs.get("vae") is not None
-                and extra_pipeline_kwargs["vae"].device != self.inference_device
-            ):
-                extra_pipeline_kwargs["vae"] = extra_pipeline_kwargs["vae"].to(
-                    self.inference_device
-                )
-
-            pipeline_kwargs = {
-                "pretrained_model_name_or_path": self.args.pretrained_model_name_or_path,
-                "revision": self.args.revision,
-                "variant": self.args.variant,
-                "torch_dtype": self.weight_dtype,
-                **extra_pipeline_kwargs,
-            }
-            logger.debug(f"Initialising pipeline with kwargs: {pipeline_kwargs}")
-            attempt = 0
-            while attempt < 3:
-                attempt += 1
-                try:
-                    if self.args.model_family == "smoldit":
-                        self.pipeline = pipeline_cls(
-                            vae=self.vae,
-                            transformer=unwrap_model(
-                                self.accelerator, self.transformer
-                            ),
-                            tokenizer=self.tokenizer_1,
-                            text_encoder=self.text_encoder_1,
-                            scheduler=self.setup_scheduler(),
-                        )
-                    else:
-                        self.pipeline = pipeline_cls.from_pretrained(**pipeline_kwargs)
-                except Exception as e:
-                    import traceback
-
-                    logger.error(e)
-                    logger.error(traceback.format_exc())
-                    continue
-                return None
-            if self.args.validation_torch_compile:
-                if self.unet is not None and not is_compiled_module(self.unet):
-                    logger.warning(
-                        f"Compiling the UNet for validation ({self.args.validation_torch_compile})"
-                    )
-                    self.pipeline.unet = torch.compile(
-                        self.pipeline.unet,
-                        mode=self.args.validation_torch_compile_mode,
-                        fullgraph=False,
-                    )
-                if self.transformer is not None and not is_compiled_module(
-                    self.transformer
-                ):
-                    logger.warning(
-                        f"Compiling the transformer for validation ({self.args.validation_torch_compile})"
-                    )
-                    self.pipeline.transformer = torch.compile(
-                        self.pipeline.transformer,
-                        mode=self.args.validation_torch_compile_mode,
-                        fullgraph=False,
-                    )
-
-        self.pipeline = self.pipeline.to(self.inference_device)
-        self.pipeline.set_progress_bar_config(disable=True)
-
-    def clean_pipeline(self):
-        """Remove the pipeline."""
-        if self.pipeline is not None:
-            del self.pipeline
-            self.pipeline = None
-
-    def process_prompts(self):
-        """Processes each validation prompt and logs the result."""
-        validation_images = {}
-        _content = zip(self.validation_shortnames, self.validation_prompts)
-        total_samples = (
-            len(self.validation_shortnames)
-            if self.validation_shortnames is not None
-            else 0
-        )
-        if self.validation_image_inputs:
-            # Override the pipeline inputs to be entirely based upon the validation image inputs.
-            _content = self.validation_image_inputs
-            total_samples = len(_content) if _content is not None else 0
-        for content in tqdm(
-            _content if _content else [],
-            desc="Processing validation prompts",
-            total=total_samples,
-            leave=False,
-            position=1,
-        ):
-            validation_input_image = None
-            logger.debug(f"content: {content}")
-            if len(content) == 3:
-                shortname, prompt, validation_input_image = content
-            elif len(content) == 2:
-                shortname, prompt = content
-            else:
-                raise ValueError(
-                    f"Validation content is not in the correct format: {content}"
-                )
-            logger.debug(f"Processing validation for prompt: {prompt}")
-            validation_images.update(
-                self.validate_prompt(prompt, shortname, validation_input_image)
-            )
-            self._save_images(validation_images, shortname, prompt)
-            self._log_validations_to_webhook(validation_images, shortname, prompt)
-            logger.debug(f"Completed generating image: {prompt}")
-        self.validation_images = validation_images
-        try:
-            self._log_validations_to_trackers(validation_images)
-        except Exception as e:
-            logger.error(f"Error logging validation images: {e}")
-
-    def stitch_conditioning_images(self, validation_image_results, conditioning_image):
-        """
-        For each image, make a new canvas and place it side by side with its equivalent from {self.validation_image_inputs}
-        """
-        stitched_validation_images = []
-        for idx, image in enumerate(validation_image_results):
-            new_width = image.size[0] * 2
-            new_height = image.size[1]
-            new_image = Image.new("RGB", (new_width, new_height))
-            new_image.paste(image, (0, 0))
-            new_image.paste(conditioning_image, (image.size[0], 0))
-            stitched_validation_images.append(new_image)
-
-        return stitched_validation_images
-
-    def validate_prompt(
-        self, prompt, validation_shortname, validation_input_image=None
-    ):
-        """Generate validation images for a single prompt."""
-        # Placeholder for actual image generation and logging
-        logger.debug(f"Validating prompt: {prompt}")
-        validation_images = {}
-        for resolution in self.validation_resolutions:
-            extra_validation_kwargs = {}
-            if not self.args.validation_randomize:
-                extra_validation_kwargs["generator"] = self._get_generator()
-                logger.debug(
-                    f"Using a generator? {extra_validation_kwargs['generator']}"
-                )
-            if validation_input_image is not None:
-                extra_validation_kwargs["image"] = validation_input_image
-                if self.deepfloyd_stage2:
-                    validation_resolution_width, validation_resolution_height = (
-                        val * 4 for val in extra_validation_kwargs["image"].size
-                    )
-                elif self.args.controlnet or self.args.validation_using_datasets:
-                    validation_resolution_width, validation_resolution_height = (
-                        extra_validation_kwargs["image"].size
-                    )
-                else:
-                    raise ValueError(
-                        "Validation input images are not supported for this model type."
-                    )
-            else:
-                validation_resolution_width, validation_resolution_height = resolution
-
-            if not self.flow_matching and self.args.model_family not in [
-                "deepfloyd",
-                "pixart_sigma",
-                "kolors",
-                "flux",
-                "sd3",
-            ]:
-                extra_validation_kwargs["guidance_rescale"] = (
-                    self.args.validation_guidance_rescale
-                )
-
-            if StateTracker.get_args().validation_using_datasets:
-                extra_validation_kwargs["strength"] = getattr(
-                    self.args, "validation_strength", 0.2
-                )
-                logger.debug(
-                    f"Set validation image denoise strength to {extra_validation_kwargs['strength']}"
-                )
-
-            logger.debug(
-                f"Processing width/height: {validation_resolution_width}x{validation_resolution_height}"
-            )
-            if validation_shortname not in validation_images:
-                validation_images[validation_shortname] = []
-            try:
-                extra_validation_kwargs.update(self._gather_prompt_embeds(prompt))
-            except Exception as e:
-                import traceback
-
-                logger.error(
-                    f"Error gathering text embed for validation prompt {prompt}: {e}, traceback: {traceback.format_exc()}"
-                )
-                continue
-
-            try:
-                # print(f"pipeline dtype: {self.pipeline.unet.device}")
-                pipeline_kwargs = {
-                    "prompt": None,
-                    "negative_prompt": None,
-                    "num_images_per_prompt": self.args.num_validation_images,
-                    "num_inference_steps": self.args.validation_num_inference_steps,
-                    "guidance_scale": self.args.validation_guidance,
-                    "height": MultiaspectImage._round_to_nearest_multiple(
-                        int(validation_resolution_height)
-                    ),
-                    "width": MultiaspectImage._round_to_nearest_multiple(
-                        int(validation_resolution_width)
-                    ),
-                    **extra_validation_kwargs,
-                }
-                if self.args.validation_guidance_real > 1.0:
-                    pipeline_kwargs["guidance_scale_real"] = float(
-                        self.args.validation_guidance_real
-                    )
-                if (
-                    isinstance(self.args.validation_no_cfg_until_timestep, int)
-                    and self.args.model_family == "flux"
-                ):
-                    pipeline_kwargs["no_cfg_until_timestep"] = (
-                        self.args.validation_no_cfg_until_timestep
-                    )
-
-                logger.debug(
-                    f"Image being generated with parameters: {pipeline_kwargs}"
-                )
-                # Print the device attr of any parameters that have one
-                for key, value in pipeline_kwargs.items():
-                    if hasattr(value, "device"):
-                        logger.debug(f"Device for {key}: {value.device}")
-                for key, value in self.pipeline.components.items():
-                    if hasattr(value, "device"):
-                        logger.debug(f"Device for {key}: {value.device}")
-                if StateTracker.get_model_family() == "flux":
-                    if "negative_prompt" in pipeline_kwargs:
-                        del pipeline_kwargs["negative_prompt"]
-                if (
-                    StateTracker.get_model_family() == "pixart_sigma"
-                    or StateTracker.get_model_family() == "smoldit"
-                ):
-                    if pipeline_kwargs.get("negative_prompt") is not None:
-                        del pipeline_kwargs["negative_prompt"]
-                    if pipeline_kwargs.get("prompt") is not None:
-                        del pipeline_kwargs["prompt"]
-                    pipeline_kwargs["prompt_attention_mask"] = pipeline_kwargs.pop(
-                        "prompt_mask"
-                    )[0].to(device=self.inference_device, dtype=self.weight_dtype)
-                    pipeline_kwargs["negative_prompt_attention_mask"] = torch.unsqueeze(
-                        pipeline_kwargs.pop("negative_mask")[0], dim=0
-                    ).to(device=self.inference_device, dtype=self.weight_dtype)
-
-                validation_image_results = self.pipeline(**pipeline_kwargs).images
-                if self.args.controlnet:
-                    validation_image_results = self.stitch_conditioning_images(
-                        validation_image_results, extra_validation_kwargs["image"]
-                    )
-                elif not self.args.disable_benchmark and self.benchmark_exists(
-                    "base_model"
-                ):
-                    benchmark_image = self._benchmark_image(
-                        validation_shortname, resolution
-                    )
-                    if benchmark_image is not None:
-                        # user might have added new resolutions or something.
-                        validation_image_results[0] = self.stitch_benchmark_image(
-                            validation_image_results[0], benchmark_image
-                        )
-                validation_images[validation_shortname].extend(validation_image_results)
-            except Exception as e:
-                import traceback
-
-                logger.error(
-                    f"Error generating validation image: {e}, {traceback.format_exc()}"
-                )
-                continue
-
-        return validation_images
-
-    def _save_images(self, validation_images, validation_shortname, validation_prompt):
-        validation_img_idx = 0
-        for validation_image in validation_images[validation_shortname]:
-            res = self.validation_resolutions[validation_img_idx]
-            if "x" in res:
-                res_label = str(res)
-            elif type(res) is tuple:
-                res_label = f"{res[0]}x{res[1]}"
-            else:
-                res_label = f"{res}x{res}"
-            validation_image.save(
-                os.path.join(
-                    self.save_dir,
-                    f"step_{StateTracker.get_global_step()}_{validation_shortname}_{res_label}.png",
-                )
-            )
-            validation_img_idx += 1
-
-    def _log_validations_to_webhook(
-        self, validation_images, validation_shortname, validation_prompt
-    ):
-        if StateTracker.get_webhook_handler() is not None:
-            StateTracker.get_webhook_handler().send(
-                f"Validation image for `{validation_shortname if validation_shortname != '' else '(blank shortname)'}`"
-                f"\nValidation prompt: `{validation_prompt if validation_prompt != '' else '(blank prompt)'}`",
-                images=validation_images[validation_shortname],
-            )
-
-    def _log_validations_to_trackers(self, validation_images):
-        for tracker in self.accelerator.trackers:
-            if tracker.name == "comet_ml":
-                experiment = self.accelerator.get_tracker("comet_ml").tracker
-                for shortname, image_list in validation_images.items():
-                    for idx, image in enumerate(image_list):
-                        experiment.log_image(
-                            image,
-                            name=f"{shortname} - {self.validation_resolutions[idx]}",
-                        )
-            elif tracker.name == "tensorboard":
-                tracker = self.accelerator.get_tracker("tensorboard")
-                for shortname, image_list in validation_images.items():
-                    tracker.log_images(
-                        {
-                            f"{shortname} - {self.validation_resolutions[idx]}": np.moveaxis(
-                                np.array(image), -1, 0
-                            )[
-                                np.newaxis, ...
-                            ]
-                            for idx, image in enumerate(image_list)
-                        },
-                        step=StateTracker.get_global_step(),
-                    )
-            elif tracker.name == "wandb":
-                resolution_list = [
-                    f"{res[0]}x{res[1]}" for res in get_validation_resolutions()
-                ]
-
-                if self.args.tracker_image_layout == "table":
-                    columns = [
-                        "Prompt",
-                        *resolution_list,
-                        "Mean Luminance",
-                    ]
-                    table = wandb.Table(columns=columns)
-
-                    # Process each prompt and its associated images
-                    for prompt_shortname, image_list in validation_images.items():
-                        wandb_images = []
-                        luminance_values = []
-                        logger.debug(
-                            f"Prompt {prompt_shortname} has {len(image_list)} images"
-                        )
-                        for image in image_list:
-                            logger.debug(f"Adding to table: {image}")
-                            wandb_image = wandb.Image(image)
-                            wandb_images.append(wandb_image)
-                            luminance = calculate_luminance(image)
-                            luminance_values.append(luminance)
-                        mean_luminance = torch.tensor(luminance_values).mean().item()
-                        while len(wandb_images) < len(resolution_list):
-                            # any missing images will crash it. use None so they are indexed.
-                            logger.debug("Found a missing image - masking with a None")
-                            wandb_images.append(None)
-                        table.add_data(prompt_shortname, *wandb_images, mean_luminance)
-
-                    # Log the table to Weights & Biases
-                    tracker.log(
-                        {"Validation Gallery": table},
-                        step=StateTracker.get_global_step(),
-                    )
-
-                elif self.args.tracker_image_layout == "gallery":
-                    gallery_images = {}
-                    for prompt_shortname, image_list in validation_images.items():
-                        logger.debug(
-                            f"Prompt {prompt_shortname} has {len(image_list)} images"
-                        )
-                        for idx, image in enumerate(image_list):
-                            wandb_image = wandb.Image(
-                                image,
-                                caption=f"{prompt_shortname} - {resolution_list[idx]}",
-                            )
-                            gallery_images[
-                                f"{prompt_shortname} - {resolution_list[idx]}"
-                            ] = wandb_image
-
-                    # Log all images in one call to prevent the global step from ticking
-                    tracker.log(gallery_images, step=StateTracker.get_global_step())
-
-    def finalize_validation(self, validation_type, enable_ema_model: bool = True):
-        """Cleans up and restores original state if necessary."""
-        if validation_type == "intermediary" and self.args.use_ema:
-            if enable_ema_model:
-                if self.unet is not None:
-                    self.ema_model.restore(self.unet.parameters())
-                if self.transformer is not None:
-                    self.ema_model.restore(self.transformer.parameters())
-                if self.args.ema_device != "accelerator":
-                    self.ema_model.to(self.args.ema_device)
-            else:
-                logger.debug(
-                    "Skipping EMA model restoration for validation, as enable_ema_model=False."
-                )
-        if not self.args.keep_vae_loaded and not self.args.vae_cache_ondemand:
-            self.vae = self.vae.to("cpu")
-            self.vae = None
-        self.pipeline = None
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
diff --git a/videotuna/third_party/flux/training/wrappers.py b/videotuna/third_party/flux/training/wrappers.py
deleted file mode 100644
index b94cc903..00000000
--- a/videotuna/third_party/flux/training/wrappers.py
+++ /dev/null
@@ -1,7 +0,0 @@
-from diffusers.utils.torch_utils import is_compiled_module
-
-
-def unwrap_model(accelerator, model):
-    model = accelerator.unwrap_model(model)
-    model = model._orig_mod if is_compiled_module(model) else model
-    return model
diff --git a/videotuna/third_party/flux/webhooks/config.py b/videotuna/third_party/flux/webhooks/config.py
deleted file mode 100644
index 42cdbfe9..00000000
--- a/videotuna/third_party/flux/webhooks/config.py
+++ /dev/null
@@ -1,51 +0,0 @@
-from json import load
-
-supported_webhooks = ["discord", "raw"]
-
-
-def check_discord_webhook_config(config: dict) -> bool:
-    if "webhook_type" not in config or config["webhook_type"] != "discord":
-        return False
-    if "webhook_url" not in config:
-        raise ValueError("Discord webhook config is missing 'webhook_url' value.")
-    return True
-
-
-def check_raw_webhook_config(config: dict) -> bool:
-    if config.get("webhook_type") != "raw":
-        return False
-    missing_fields = []
-    required_fields = ["callback_url"]
-    for config_field in required_fields:
-        if not config.get(config_field):
-            missing_fields.append(config_field)
-    if missing_fields:
-        raise ValueError(f"Missing fields on webhook config: {missing_fields}")
-    return False
-
-
-class WebhookConfig:
-    def __init__(self, config_path: str):
-        self.config_path = config_path
-        self.values = self.load_config()
-        if (
-            "webhook_type" not in self.values
-            or self.values["webhook_type"] not in supported_webhooks
-        ):
-            raise ValueError(
-                f"Invalid webhook type specified in config. Supported values: {supported_webhooks}"
-            )
-        if check_discord_webhook_config(self.values):
-            self.webhook_type = "discord"
-        elif check_raw_webhook_config(self.values):
-            self.webhook_type = "raw"
-
-    def load_config(self):
-        with open(self.config_path, "r") as f:
-            return load(f)
-
-    def get_config(self):
-        return self.values
-
-    def __getattr__(self, name):
-        return self.values.get(name, None)
diff --git a/videotuna/third_party/flux/webhooks/handler.py b/videotuna/third_party/flux/webhooks/handler.py
deleted file mode 100644
index 85024568..00000000
--- a/videotuna/third_party/flux/webhooks/handler.py
+++ /dev/null
@@ -1,171 +0,0 @@
-import json
-import logging
-import os
-import time
-from io import BytesIO
-
-import requests
-
-from videotuna.third_party.flux.webhooks.config import WebhookConfig
-
-# Define log levels
-log_levels = {"critical": 0, "error": 1, "warning": 2, "info": 3, "debug": 4}
-
-logger = logging.getLogger(__name__)
-logger.setLevel(os.environ.get("SIMPLETUNER_LOG_LEVEL", "INFO"))
-
-
-class WebhookHandler:
-    def __init__(
-        self,
-        config_path: str,
-        accelerator,
-        project_name: str,
-        mock_webhook_config: WebhookConfig = None,
-    ):
-        self.accelerator = accelerator
-        self.config = mock_webhook_config or WebhookConfig(config_path)
-        self.webhook_url = self.config.values.get(
-            "webhook_url", self.config.values.get("callback_url", None)
-        )
-        self.webhook_type = (
-            self.config.webhook_type
-        )  # Use webhook_type to differentiate behavior
-        self.message_prefix = (
-            f"`({self.config.message_prefix})` "
-            if self.config.message_prefix is not None
-            else f"`({project_name})` "
-        )
-        self.log_level = log_levels.get(
-            self.config.log_level or "info", log_levels["info"]
-        )
-        self.stored_response = None
-
-    def _check_level(self, level: str) -> bool:
-        """Check if the message level meets the configured log level."""
-        return log_levels.get(level, "info") <= self.log_level
-
-    def _send_request(
-        self,
-        message: str,
-        images: list = None,
-        store_response: bool = False,
-        raw_request: bool = False,
-    ):
-        """Send the webhook request based on the webhook type."""
-        if self.webhook_type == "discord":
-            # Prepare Discord-style payload
-            data = {"content": f"{self.message_prefix}{message}"}
-            files = self._prepare_images(images)
-            request_args = {
-                "data": data,
-                "files": files if self.webhook_type == "discord" else None,
-            }
-        elif self.webhook_type == "raw":
-            # Prepare raw data payload for direct POST
-            if raw_request:
-                data = message
-                files = None
-            else:
-                data = {
-                    "message": message,
-                    "images": (
-                        [self._convert_image_to_base64(img) for img in images]
-                        if images
-                        else []
-                    ),
-                }
-            files = None
-            request_args = {
-                "json": data,
-                "files": None,
-            }
-        else:
-            logger.error(f"Unsupported webhook type: {self.webhook_type}")
-            return
-
-        # Send request
-        try:
-            logger.debug(f"Sending webhook request: {request_args}")
-            post_result = requests.post(
-                self.webhook_url,
-                **request_args,
-            )
-            post_result.raise_for_status()
-        except Exception as e:
-            logger.error(f"Could not send webhook request: {e}")
-            return
-
-        if store_response:
-            self.stored_response = post_result.headers
-
-    def _prepare_images(self, images: list):
-        """Convert images to file objects for Discord uploads."""
-        files = {}
-        if images:
-            for index, img in enumerate(images):
-                img_byte_array = BytesIO()
-                img.save(img_byte_array, format="PNG")
-                img_byte_array.seek(0)
-                files[f"file{index}"] = (
-                    f"image{index}.png",
-                    img_byte_array,
-                    "image/png",
-                )
-        return files
-
-    def _convert_image_to_base64(self, image):
-        """Convert PIL image to a base64 string (for 'raw' webhook type)."""
-        import base64
-
-        img_byte_array = BytesIO()
-        image.save(img_byte_array, format="PNG")
-        img_byte_array.seek(0)
-        return base64.b64encode(img_byte_array.read()).decode("utf-8")
-
-    def send(
-        self,
-        message: str,
-        images: list = None,
-        message_level: str = "info",
-        store_response: bool = False,
-    ):
-        """Send a message through the webhook with optional images."""
-        if not self.accelerator.is_main_process or "discord" != self.webhook_type:
-            return
-        if not self._check_level(message_level):
-            return
-        if images is not None and not isinstance(images, list):
-            images = [images]
-
-        # Split the images into smaller chunks if there are too many (Discord limitation)
-        if images and len(images) > 10:
-            for i in range(0, len(images), 9):
-                self._send_request(
-                    message, images[i : i + 9], store_response=store_response
-                )
-        else:
-            self._send_request(message, images, store_response=store_response)
-
-    def send_raw(
-        self,
-        structured_data: dict,
-        message_type: str,
-        message_level: str = "info",
-        job_id: str = None,
-    ):
-        """
-        for sending structured dict to the callback for eg. training step progress updates
-        """
-        if (
-            "raw" != self.webhook_type
-            or not self.accelerator.is_main_process
-            or not self._check_level(message_level)
-        ):
-            return
-        structured_data["message_type"] = message_type
-        structured_data["job_id"] = job_id
-        structured_data["timestamp"] = int(time.time())
-        self._send_request(
-            message=structured_data, images=None, store_response=False, raw_request=True
-        )
diff --git a/videotuna/third_party/flux/webhooks/mixin.py b/videotuna/third_party/flux/webhooks/mixin.py
deleted file mode 100644
index 811d542a..00000000
--- a/videotuna/third_party/flux/webhooks/mixin.py
+++ /dev/null
@@ -1,31 +0,0 @@
-from videotuna.third_party.flux.training.multi_process import _get_rank as get_rank
-from videotuna.third_party.flux.training.state_tracker import StateTracker
-from videotuna.third_party.flux.webhooks.handler import WebhookHandler
-
-current_rank = get_rank()
-
-
-class WebhookMixin:
-    webhook_handler: WebhookHandler = None
-
-    def set_webhook_handler(self, webhook_handler: WebhookHandler):
-        self.webhook_handler = webhook_handler
-
-    def send_progress_update(self, type: str, progress: int, total: int, current: int):
-        if total == 1:
-            return
-        if int(current_rank) != 0:
-            return
-        progress = {
-            "message_type": "progress_update",
-            "message": {
-                "progress_type": type,
-                "progress": progress,
-                "total_elements": total,
-                "current_estimated_index": current,
-            },
-        }
-
-        self.webhook_handler.send_raw(
-            progress, "progress_update", job_id=StateTracker.get_job_id()
-        )
diff --git a/videotuna/training/__init__.py b/videotuna/training/__init__.py
new file mode 100644
index 00000000..7f94addd
--- /dev/null
+++ b/videotuna/training/__init__.py
@@ -0,0 +1 @@
+"""VideoTuna training entrypoints (first-party trainers)."""
diff --git a/videotuna/training/flux_lora/__init__.py b/videotuna/training/flux_lora/__init__.py
new file mode 100644
index 00000000..c03443f5
--- /dev/null
+++ b/videotuna/training/flux_lora/__init__.py
@@ -0,0 +1,5 @@
+"""First-party Flux LoRA fine-tuning (Diffusers + PEFT + Accelerate)."""
+
+from videotuna.training.flux_lora.config import FluxLoraTrainConfig, load_train_config
+
+__all__ = ["FluxLoraTrainConfig", "load_train_config"]
diff --git a/videotuna/training/flux_lora/bucketing.py b/videotuna/training/flux_lora/bucketing.py
new file mode 100644
index 00000000..813812a0
--- /dev/null
+++ b/videotuna/training/flux_lora/bucketing.py
@@ -0,0 +1,66 @@
+"""SimpleTuner-style pixel_area aspect bucketing for Flux LoRA training."""
+
+from __future__ import annotations
+
+import math
+
+
+def round_aspect_ratio(width: int, height: int, rounding: int) -> float:
+    if height <= 0:
+        raise ValueError(f"Invalid image height: {height}")
+    return round(width / height, rounding)
+
+
+def target_pixel_area(resolution: int) -> int:
+    return resolution * resolution
+
+
+def bucket_dimensions(
+    aspect: float,
+    target_area: int,
+    *,
+    align: int = 64,
+) -> tuple[int, int]:
+    if aspect <= 0:
+        raise ValueError(f"Aspect ratio must be positive, got {aspect}")
+    height = math.sqrt(target_area / aspect)
+    width = height * aspect
+    width = max(align, round(width / align) * align)
+    height = max(align, round(height / align) * align)
+    return int(width), int(height)
+
+
+def bucket_dimensions_for_image(
+    width: int,
+    height: int,
+    resolution: int,
+    resolution_type: str,
+    aspect_bucket_rounding: int,
+    *,
+    align: int = 64,
+) -> tuple[int, int]:
+    if resolution_type != "pixel_area":
+        raise ValueError(
+            "Unsupported resolution_type="
+            f"{resolution_type!r}; only 'pixel_area' is supported"
+        )
+    aspect = round_aspect_ratio(width, height, aspect_bucket_rounding)
+    area = target_pixel_area(resolution)
+    return bucket_dimensions(aspect, area, align=align)
+
+
+def meets_minimum_size(
+    width: int,
+    height: int,
+    minimum_image_size: int,
+    resolution_type: str,
+) -> bool:
+    if minimum_image_size <= 0:
+        return True
+    if resolution_type != "pixel_area":
+        raise ValueError(
+            "Unsupported resolution_type="
+            f"{resolution_type!r}; only 'pixel_area' is supported"
+        )
+    min_area = minimum_image_size * minimum_image_size
+    return width * height >= min_area
diff --git a/videotuna/training/flux_lora/checkpoint.py b/videotuna/training/flux_lora/checkpoint.py
new file mode 100644
index 00000000..470e96d2
--- /dev/null
+++ b/videotuna/training/flux_lora/checkpoint.py
@@ -0,0 +1,74 @@
+"""Save Flux LoRA checkpoints in Diffusers-compatible format."""
+
+from __future__ import annotations
+
+import re
+import shutil
+from pathlib import Path
+
+from diffusers import FluxPipeline
+from peft.utils import get_peft_model_state_dict, set_peft_model_state_dict
+
+_CHECKPOINT_RE = re.compile(r"^checkpoint-(\d+)$")
+
+
+def save_lora_checkpoint(transformer, output_dir: str | Path, step: int) -> Path:
+    save_path = Path(output_dir) / f"checkpoint-{step}"
+    save_path.mkdir(parents=True, exist_ok=True)
+
+    transformer_lora = get_peft_model_state_dict(transformer)
+    FluxPipeline.save_lora_weights(
+        save_directory=str(save_path),
+        transformer_lora_layers=transformer_lora,
+    )
+    return save_path
+
+
+def find_latest_checkpoint(output_dir: str | Path) -> Path | None:
+    root = Path(output_dir)
+    if not root.is_dir():
+        return None
+    best_step = -1
+    best_path: Path | None = None
+    for path in root.iterdir():
+        if not path.is_dir():
+            continue
+        match = _CHECKPOINT_RE.match(path.name)
+        if match is None:
+            continue
+        step = int(match.group(1))
+        if step > best_step:
+            best_step = step
+            best_path = path
+    return best_path
+
+
+def checkpoint_step(path: Path) -> int:
+    match = _CHECKPOINT_RE.match(path.name)
+    if match is None:
+        raise ValueError(f"Not a checkpoint directory: {path}")
+    return int(match.group(1))
+
+
+def load_lora_checkpoint(transformer, checkpoint_dir: str | Path) -> None:
+    path = Path(checkpoint_dir)
+    lora_state_dict = FluxPipeline.lora_state_dict(str(path))
+    set_peft_model_state_dict(transformer, lora_state_dict)
+
+
+def prune_checkpoints(output_dir: str | Path, limit: int | None) -> None:
+    if limit is None or limit <= 0:
+        return
+    root = Path(output_dir)
+    checkpoints: list[tuple[int, Path]] = []
+    for path in root.iterdir():
+        if not path.is_dir():
+            continue
+        match = _CHECKPOINT_RE.match(path.name)
+        if match is None:
+            continue
+        checkpoints.append((int(match.group(1)), path))
+    checkpoints.sort(key=lambda item: item[0])
+    while len(checkpoints) > limit:
+        _, path = checkpoints.pop(0)
+        shutil.rmtree(path)
diff --git a/videotuna/training/flux_lora/config.py b/videotuna/training/flux_lora/config.py
new file mode 100644
index 00000000..c137b360
--- /dev/null
+++ b/videotuna/training/flux_lora/config.py
@@ -0,0 +1,399 @@
+"""Load and normalize `configs/domain/` SimpleTuner-style JSON configs."""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+from videotuna.utils.logging_config import bound_logger
+
+logger = bound_logger(phase="t2i", flow="flux_lora")
+
+_ALLOWED_TRAIN_KEYS = frozenset(
+    {
+        "aspect_bucket_rounding",
+        "caption_dropout_probability",
+        "checkpointing_steps",
+        "checkpoints_total_limit",
+        "data_backend_config",
+        "disable_benchmark",
+        "disable_tf32",
+        "gradient_checkpointing",
+        "gradient_accumulation_steps",
+        "learning_rate",
+        "lora_rank",
+        "lora_type",
+        "lr_scheduler",
+        "lr_warmup_steps",
+        "max_train_steps",
+        "minimum_image_size",
+        "mixed_precision",
+        "model_family",
+        "model_type",
+        "num_train_epochs",
+        "num_workers",
+        "optimizer",
+        "output_dir",
+        "pretrained_model_name_or_path",
+        "resolution",
+        "resolution_type",
+        "resume_from_checkpoint",
+        "seed",
+        "train_batch_size",
+        "validation_guidance",
+        "validation_guidance_rescale",
+        "validation_num_inference_steps",
+        "validation_prompt",
+        "validation_resolution",
+        "validation_seed",
+        "validation_steps",
+        "write_batch_size",
+    }
+)
+
+
+def _normalize_key(key: str) -> str:
+    return key[2:] if key.startswith("--") else key
+
+
+def _coerce_value(key: str, value: Any) -> Any:
+    if key in {"gradient_checkpointing", "disable_benchmark", "disable_tf32"}:
+        if isinstance(value, str):
+            return value.lower() in {"true", "1", "yes"}
+        return bool(value)
+    if key in {
+        "lora_rank",
+        "max_train_steps",
+        "checkpointing_steps",
+        "checkpoints_total_limit",
+        "train_batch_size",
+        "write_batch_size",
+        "resolution",
+        "validation_steps",
+        "validation_num_inference_steps",
+        "lr_warmup_steps",
+        "num_train_epochs",
+        "seed",
+        "validation_seed",
+        "num_workers",
+        "aspect_bucket_rounding",
+        "minimum_image_size",
+        "gradient_accumulation_steps",
+    }:
+        return int(value)
+    if key in {
+        "learning_rate",
+        "validation_guidance",
+        "validation_guidance_rescale",
+        "caption_dropout_probability",
+    }:
+        return float(value)
+    return value
+
+
+@dataclass
+class FluxTextEmbedConfig:
+    cache_dir: str
+    write_batch_size: int | None = None
+    disabled: bool = False
+
+
+@dataclass
+class FluxLoraDataConfig:
+    instance_data_dir: str
+    caption_strategy: str = "filename"
+    default_caption: str | None = None
+    resolution: int = 512
+    crop: bool = True
+    crop_aspect: str = "square"
+    resolution_type: str = "pixel_area"
+    aspect_bucket_rounding: int = 2
+    minimum_image_size: int = 0
+    maximum_image_size: int | None = None
+    caption_dropout_probability: float = 0.0
+    text_embeds: FluxTextEmbedConfig | None = None
+
+
+@dataclass
+class FluxLoraTrainConfig:
+    pretrained_model_name_or_path: str
+    output_dir: str
+    instance_data_dir: str
+    model_family: str = "flux"
+    model_type: str = "lora"
+    lora_type: str = "standard"
+    lora_rank: int = 4
+    learning_rate: float = 8e-5
+    lr_scheduler: str = "polynomial"
+    lr_warmup_steps: int = 5
+    max_train_steps: int = 1000
+    num_train_epochs: int = -1
+    train_batch_size: int = 1
+    num_workers: int = 0
+    resolution: int = 512
+    resolution_type: str = "pixel_area"
+    aspect_bucket_rounding: int = 2
+    minimum_image_size: int = 0
+    checkpointing_steps: int = 500
+    checkpoints_total_limit: int | None = None
+    resume_from_checkpoint: str | None = None
+    mixed_precision: str = "bf16"
+    optimizer: str = "adamw"
+    seed: int = 42
+    disable_tf32: bool = False
+    disable_benchmark: bool = False
+    gradient_checkpointing: bool = True
+    gradient_accumulation_steps: int = 1
+    caption_dropout_probability: float = 0.0
+    write_batch_size: int = 128
+    validation_prompt: str | None = None
+    validation_steps: int | None = None
+    validation_resolution: str = "512x512"
+    validation_guidance: float = 3.0
+    validation_guidance_rescale: float = 0.0
+    validation_num_inference_steps: int = 10
+    validation_seed: int = 42
+    data_backend_config: str | None = None
+
+
+def _parse_text_embeds_backend(
+    backends: list[dict[str, Any]],
+) -> FluxTextEmbedConfig | None:
+    embed_backend = next(
+        (
+            b
+            for b in backends
+            if b.get("type") == "local" and b.get("dataset_type") == "text_embeds"
+        ),
+        None,
+    )
+    if embed_backend is None:
+        return None
+    if embed_backend.get("disabled", False):
+        return FluxTextEmbedConfig(cache_dir="", disabled=True)
+    cache_dir = embed_backend.get("cache_dir")
+    if not cache_dir:
+        raise ValueError("text_embeds backend requires cache_dir")
+    write_batch_size = embed_backend.get("write_batch_size")
+    parsed_write_batch_size = (
+        int(write_batch_size) if write_batch_size is not None else None
+    )
+    return FluxTextEmbedConfig(
+        cache_dir=str(cache_dir),
+        write_batch_size=parsed_write_batch_size,
+        disabled=False,
+    )
+
+
+def _parse_local_backend(backends: list[dict[str, Any]]) -> FluxLoraDataConfig:
+    image_backend = next(
+        (
+            b
+            for b in backends
+            if b.get("type") == "local"
+            and b.get("dataset_type") != "text_embeds"
+            and not b.get("disabled", False)
+        ),
+        None,
+    )
+    if image_backend is None:
+        raise ValueError(
+            "multidatabackend.json must include a local image backend "
+            "(text_embeds-only backends are not supported)."
+        )
+    maximum_image_size = image_backend.get("maximum_image_size")
+    return FluxLoraDataConfig(
+        instance_data_dir=image_backend["instance_data_dir"],
+        caption_strategy=image_backend.get("caption_strategy", "filename"),
+        default_caption=image_backend.get("caption"),
+        resolution=int(image_backend.get("resolution", 512)),
+        crop=bool(image_backend.get("crop", True)),
+        crop_aspect=image_backend.get("crop_aspect", "square"),
+        resolution_type=str(image_backend.get("resolution_type", "pixel_area")),
+        aspect_bucket_rounding=int(image_backend.get("aspect_bucket_rounding", 2)),
+        minimum_image_size=int(image_backend.get("minimum_image_size", 0)),
+        maximum_image_size=int(maximum_image_size) if maximum_image_size else None,
+        text_embeds=_parse_text_embeds_backend(backends),
+    )
+
+
+def _validate_train_values(normalized: dict[str, Any]) -> None:
+    if normalized.get("model_family", "flux") != "flux":
+        raise ValueError(
+            f"model_family must be 'flux', got {normalized.get('model_family')!r}"
+        )
+    if normalized.get("model_type", "lora") != "lora":
+        raise ValueError(
+            f"model_type must be 'lora', got {normalized.get('model_type')!r}"
+        )
+    if normalized.get("lora_type", "standard") != "standard":
+        raise ValueError(
+            f"lora_type must be 'standard', got {normalized.get('lora_type')!r}"
+        )
+    if int(normalized.get("num_train_epochs", -1)) != -1:
+        raise ValueError(
+            "num_train_epochs must be -1 "
+            "(PrivTune Flux trainer is step-based via max_train_steps)"
+        )
+    optimizer = normalized.get("optimizer", "adamw")
+    if optimizer not in {"adamw", "adamw_bf16"}:
+        raise ValueError(
+            f"optimizer must be 'adamw' or 'adamw_bf16', got {optimizer!r}"
+        )
+    resolution_type = normalized.get("resolution_type", "pixel_area")
+    if resolution_type != "pixel_area":
+        raise ValueError(
+            f"resolution_type must be 'pixel_area', got {resolution_type!r}"
+        )
+    if float(normalized.get("validation_guidance_rescale", 0.0)) != 0.0:
+        raise ValueError(
+            "validation_guidance_rescale is not supported for Flux (must be 0.0)"
+        )
+    _validate_gradient_accumulation(normalized)
+    _validate_resume_from_checkpoint(normalized)
+
+
+def _validate_gradient_accumulation(normalized: dict[str, Any]) -> None:
+    grad_accum = int(normalized.get("gradient_accumulation_steps", 1))
+    if grad_accum < 1:
+        raise ValueError(f"gradient_accumulation_steps must be >= 1, got {grad_accum}")
+
+
+def _validate_resume_from_checkpoint(normalized: dict[str, Any]) -> None:
+    resume = normalized.get("resume_from_checkpoint")
+    if resume is not None:
+        if not isinstance(resume, str) or not resume.strip():
+            raise ValueError(
+                "resume_from_checkpoint must be null, 'latest', or a non-empty path"
+            )
+
+
+def _merge_data_config(
+    data_cfg: FluxLoraDataConfig,
+    train_cfg: FluxLoraTrainConfig,
+) -> FluxLoraDataConfig:
+    """Apply train-config overrides onto data config for dataset construction."""
+    write_batch_size = train_cfg.write_batch_size
+    if data_cfg.text_embeds and data_cfg.text_embeds.write_batch_size is not None:
+        write_batch_size = data_cfg.text_embeds.write_batch_size
+    text_embeds = data_cfg.text_embeds
+    if text_embeds is not None and not text_embeds.disabled:
+        text_embeds = FluxTextEmbedConfig(
+            cache_dir=text_embeds.cache_dir,
+            write_batch_size=write_batch_size,
+            disabled=False,
+        )
+    return FluxLoraDataConfig(
+        instance_data_dir=train_cfg.instance_data_dir,
+        caption_strategy=data_cfg.caption_strategy,
+        default_caption=data_cfg.default_caption,
+        resolution=train_cfg.resolution,
+        crop=data_cfg.crop,
+        crop_aspect=data_cfg.crop_aspect,
+        resolution_type=train_cfg.resolution_type,
+        aspect_bucket_rounding=train_cfg.aspect_bucket_rounding,
+        minimum_image_size=train_cfg.minimum_image_size,
+        maximum_image_size=data_cfg.maximum_image_size,
+        caption_dropout_probability=train_cfg.caption_dropout_probability,
+        text_embeds=text_embeds,
+    )
+
+
+def load_train_config(
+    config_path: str | Path,
+    data_config_path: str | Path,
+) -> tuple[FluxLoraTrainConfig, FluxLoraDataConfig]:
+    with open(config_path) as f:
+        raw = json.load(f)
+    with open(data_config_path) as f:
+        backends = json.load(f)
+
+    normalized: dict[str, Any] = {}
+    for key, value in raw.items():
+        norm_key = _normalize_key(key)
+        if norm_key not in _ALLOWED_TRAIN_KEYS:
+            raise ValueError(
+                f"Unsupported Flux training config keys: {sorted({norm_key})}"
+            )
+        normalized[norm_key] = _coerce_value(norm_key, value)
+
+    _validate_train_values(normalized)
+
+    data_cfg = _parse_local_backend(backends)
+    instance_data_dir = (
+        normalized.get("instance_data_dir") or data_cfg.instance_data_dir
+    )
+    resolution = int(normalized.get("resolution", data_cfg.resolution))
+    minimum_image_size = int(
+        normalized.get("minimum_image_size", data_cfg.minimum_image_size)
+    )
+    aspect_bucket_rounding = int(
+        normalized.get("aspect_bucket_rounding", data_cfg.aspect_bucket_rounding)
+    )
+    resolution_type = str(normalized.get("resolution_type", data_cfg.resolution_type))
+
+    train_cfg = FluxLoraTrainConfig(
+        pretrained_model_name_or_path=normalized["pretrained_model_name_or_path"],
+        output_dir=normalized["output_dir"],
+        instance_data_dir=instance_data_dir,
+        model_family=normalized.get("model_family", "flux"),
+        model_type=normalized.get("model_type", "lora"),
+        lora_type=normalized.get("lora_type", "standard"),
+        lora_rank=int(normalized.get("lora_rank", 4)),
+        learning_rate=float(normalized.get("learning_rate", 8e-5)),
+        lr_scheduler=normalized.get("lr_scheduler", "polynomial"),
+        lr_warmup_steps=int(normalized.get("lr_warmup_steps", 5)),
+        max_train_steps=int(normalized.get("max_train_steps", 1000)),
+        num_train_epochs=int(normalized.get("num_train_epochs", -1)),
+        train_batch_size=int(normalized.get("train_batch_size", 1)),
+        num_workers=int(normalized.get("num_workers", 0)),
+        resolution=resolution,
+        resolution_type=resolution_type,
+        aspect_bucket_rounding=aspect_bucket_rounding,
+        minimum_image_size=minimum_image_size,
+        checkpointing_steps=int(normalized.get("checkpointing_steps", 500)),
+        checkpoints_total_limit=normalized.get("checkpoints_total_limit"),
+        resume_from_checkpoint=normalized.get("resume_from_checkpoint"),
+        mixed_precision=normalized.get("mixed_precision", "bf16"),
+        optimizer=normalized.get("optimizer", "adamw"),
+        seed=int(normalized.get("seed", 42)),
+        disable_tf32=bool(normalized.get("disable_tf32", False)),
+        disable_benchmark=bool(normalized.get("disable_benchmark", False)),
+        gradient_checkpointing=bool(normalized.get("gradient_checkpointing", True)),
+        gradient_accumulation_steps=int(
+            normalized.get("gradient_accumulation_steps", 1)
+        ),
+        caption_dropout_probability=float(
+            normalized.get("caption_dropout_probability", 0.0)
+        ),
+        write_batch_size=int(normalized.get("write_batch_size", 128)),
+        validation_prompt=normalized.get("validation_prompt"),
+        validation_steps=normalized.get("validation_steps"),
+        validation_resolution=str(normalized.get("validation_resolution", "512x512")),
+        validation_guidance=float(normalized.get("validation_guidance", 3.0)),
+        validation_guidance_rescale=float(
+            normalized.get("validation_guidance_rescale", 0.0)
+        ),
+        validation_num_inference_steps=int(
+            normalized.get("validation_num_inference_steps", 10)
+        ),
+        validation_seed=int(normalized.get("validation_seed", 42)),
+        data_backend_config=normalized.get("data_backend_config"),
+    )
+    merged_data_cfg = _merge_data_config(data_cfg, train_cfg)
+    return train_cfg, merged_data_cfg
+
+
+def stamp_output_dir(output_dir: str) -> str:
+    from datetime import datetime
+
+    path = Path(output_dir)
+    time_str = datetime.now().strftime("%Y%m%d%H%M%S")
+    folder_name = path.stem
+    name_list = folder_name.split("_")
+    if len(name_list[-1]) == 14 and name_list[-1].isdigit():
+        folder_name = "_".join(name_list[:-1])
+    stamped = path.parent / f"{folder_name}_{time_str}"
+    return str(stamped)
diff --git a/videotuna/training/flux_lora/dataset.py b/videotuna/training/flux_lora/dataset.py
new file mode 100644
index 00000000..7793f72f
--- /dev/null
+++ b/videotuna/training/flux_lora/dataset.py
@@ -0,0 +1,214 @@
+"""Local image + caption dataset for Flux LoRA training."""
+
+from __future__ import annotations
+
+import random
+from pathlib import Path
+from typing import TypedDict
+
+import torch
+from PIL import Image
+from torch.utils.data import BatchSampler, Dataset
+from torchvision import transforms
+
+from videotuna.training.flux_lora.bucketing import (
+    bucket_dimensions_for_image,
+    meets_minimum_size,
+)
+from videotuna.training.flux_lora.config import FluxLoraDataConfig
+from videotuna.utils.logging_config import bound_logger
+
+logger = bound_logger(phase="t2i", flow="flux_lora")
+
+_IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".webp", ".bmp"}
+
+
+class FluxLoraSample(TypedDict, total=False):
+    pixel_values: torch.Tensor
+    caption: str
+    prompt_embeds: torch.Tensor
+    pooled_prompt_embeds: torch.Tensor
+    text_ids: torch.Tensor
+
+
+def _load_caption(
+    image_path: Path, caption_strategy: str, default_caption: str | None
+) -> str:
+    if caption_strategy == "filename":
+        txt_path = image_path.with_suffix(".txt")
+        if txt_path.is_file():
+            return txt_path.read_text(encoding="utf-8").strip()
+        if default_caption:
+            return default_caption
+        raise ValueError(
+            f"Missing caption file for {image_path} (caption_strategy=filename)"
+        )
+    if default_caption:
+        return default_caption
+    raise ValueError(
+        f"Unsupported caption_strategy={caption_strategy!r} without default caption"
+    )
+
+
+def _center_square_crop(image: Image.Image) -> Image.Image:
+    width, height = image.size
+    side = min(width, height)
+    left = (width - side) // 2
+    top = (height - side) // 2
+    return image.crop((left, top, left + side, top + side))
+
+
+class FluxBucketBatchSampler(BatchSampler):
+    """Batch indices grouped by target bucket dimensions."""
+
+    def __init__(
+        self,
+        bucket_ids: list[int],
+        batch_size: int,
+        *,
+        shuffle: bool = True,
+        seed: int = 0,
+    ) -> None:
+        self.bucket_ids = bucket_ids
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.seed = seed
+        self._epoch = 0
+
+    def __iter__(self):
+        rng = random.Random(self.seed + self._epoch)
+        buckets: dict[int, list[int]] = {}
+        for index, bucket_id in enumerate(self.bucket_ids):
+            buckets.setdefault(bucket_id, []).append(index)
+        batches: list[list[int]] = []
+        for indices in buckets.values():
+            if self.shuffle:
+                rng.shuffle(indices)
+            for start in range(0, len(indices), self.batch_size):
+                batches.append(indices[start : start + self.batch_size])
+        if self.shuffle:
+            rng.shuffle(batches)
+        yield from batches
+        self._epoch += 1
+
+    def __len__(self) -> int:
+        return sum(
+            (len(indices) + self.batch_size - 1) // self.batch_size
+            for indices in self._grouped_indices().values()
+        )
+
+    def _grouped_indices(self) -> dict[int, list[int]]:
+        buckets: dict[int, list[int]] = {}
+        for index, bucket_id in enumerate(self.bucket_ids):
+            buckets.setdefault(bucket_id, []).append(index)
+        return buckets
+
+
+class FluxLoraImageDataset(Dataset):
+    def __init__(
+        self,
+        data_config: FluxLoraDataConfig,
+        *,
+        embed_lookup: dict[str, dict[str, torch.Tensor]] | None = None,
+        seed: int = 42,
+    ):
+        self.data_dir = Path(data_config.instance_data_dir)
+        if not self.data_dir.is_dir():
+            raise FileNotFoundError(
+                f"Training data directory not found: {self.data_dir}"
+            )
+
+        self.caption_strategy = data_config.caption_strategy
+        self.default_caption = data_config.default_caption
+        self.resolution = data_config.resolution
+        self.crop = data_config.crop
+        self.resolution_type = data_config.resolution_type
+        self.aspect_bucket_rounding = data_config.aspect_bucket_rounding
+        self.minimum_image_size = data_config.minimum_image_size
+        self.caption_dropout_probability = data_config.caption_dropout_probability
+        self.embed_lookup = embed_lookup or {}
+        self._rng = random.Random(seed)
+
+        self.samples: list[tuple[Path, str, tuple[int, int]]] = []
+        self.bucket_ids: list[int] = []
+        bucket_map: dict[tuple[int, int], int] = {}
+        filtered = 0
+
+        for path in sorted(self.data_dir.iterdir()):
+            if path.suffix.lower() not in _IMAGE_EXTENSIONS:
+                continue
+            with Image.open(path) as image:
+                width, height = image.size
+            if not meets_minimum_size(
+                width,
+                height,
+                self.minimum_image_size,
+                self.resolution_type,
+            ):
+                filtered += 1
+                continue
+            if self.crop:
+                side = min(width, height)
+                width = height = side
+            bucket_w, bucket_h = bucket_dimensions_for_image(
+                width,
+                height,
+                self.resolution,
+                self.resolution_type,
+                self.aspect_bucket_rounding,
+            )
+            caption = _load_caption(path, self.caption_strategy, self.default_caption)
+            bucket_key = (bucket_w, bucket_h)
+            if bucket_key not in bucket_map:
+                bucket_map[bucket_key] = len(bucket_map)
+            self.samples.append((path, caption, bucket_key))
+            self.bucket_ids.append(bucket_map[bucket_key])
+
+        if filtered:
+            logger.info("Filtered {} images below minimum_image_size", filtered)
+
+        if not self.samples:
+            raise ValueError(f"No training images found in {self.data_dir}")
+
+        logger.info(
+            "Loaded {} training images from {} across {} aspect buckets",
+            len(self.samples),
+            self.data_dir,
+            len(bucket_map),
+        )
+
+    def __len__(self) -> int:
+        return len(self.samples)
+
+    def _maybe_dropout_caption(self, caption: str) -> str:
+        if (
+            self.caption_dropout_probability > 0.0
+            and self._rng.random() < self.caption_dropout_probability
+        ):
+            return ""
+        return caption
+
+    def __getitem__(self, index: int) -> FluxLoraSample:
+        path, caption, (bucket_w, bucket_h) = self.samples[index]
+        image = Image.open(path).convert("RGB")
+        if self.crop:
+            image = _center_square_crop(image)
+        transform = transforms.Compose(
+            [
+                transforms.Resize(
+                    (bucket_h, bucket_w),
+                    interpolation=transforms.InterpolationMode.BILINEAR,
+                ),
+                transforms.ToTensor(),
+                transforms.Normalize([0.5], [0.5]),
+            ]
+        )
+        pixel_values = transform(image)
+        caption = self._maybe_dropout_caption(caption)
+        sample: FluxLoraSample = {"pixel_values": pixel_values, "caption": caption}
+        cached = self.embed_lookup.get(caption)
+        if cached is not None:
+            sample["prompt_embeds"] = cached["prompt_embeds"]
+            sample["pooled_prompt_embeds"] = cached["pooled_prompt_embeds"]
+            sample["text_ids"] = cached["text_ids"]
+        return sample
diff --git a/videotuna/training/flux_lora/model_utils.py b/videotuna/training/flux_lora/model_utils.py
new file mode 100644
index 00000000..3510e729
--- /dev/null
+++ b/videotuna/training/flux_lora/model_utils.py
@@ -0,0 +1,75 @@
+"""Load Flux components and inject PEFT LoRA adapters."""
+
+from __future__ import annotations
+
+from typing import Any, cast
+
+import torch
+from diffusers import AutoencoderKL, FluxTransformer2DModel
+from peft import LoraConfig, get_peft_model
+from transformers import CLIPTextModel, CLIPTokenizer, T5EncoderModel, T5TokenizerFast
+
+FLUX_LORA_TARGET_MODULES = ["to_k", "to_q", "to_v", "to_out.0"]
+
+
+def load_flux_training_models(
+    pretrained_model_name_or_path: str,
+    lora_rank: int,
+    mixed_precision: str = "bf16",
+    gradient_checkpointing: bool = True,
+):
+    weight_dtype = torch.bfloat16 if mixed_precision == "bf16" else torch.float16
+
+    tokenizer_one = CLIPTokenizer.from_pretrained(
+        pretrained_model_name_or_path, subfolder="tokenizer"
+    )
+    tokenizer_two = T5TokenizerFast.from_pretrained(
+        pretrained_model_name_or_path, subfolder="tokenizer_2"
+    )
+    text_encoder_one = CLIPTextModel.from_pretrained(
+        pretrained_model_name_or_path,
+        subfolder="text_encoder",
+        torch_dtype=weight_dtype,
+    )
+    text_encoder_two = T5EncoderModel.from_pretrained(
+        pretrained_model_name_or_path,
+        subfolder="text_encoder_2",
+        torch_dtype=weight_dtype,
+    )
+    vae = AutoencoderKL.from_pretrained(
+        pretrained_model_name_or_path,
+        subfolder="vae",
+        torch_dtype=weight_dtype,
+    )
+    transformer = FluxTransformer2DModel.from_pretrained(
+        pretrained_model_name_or_path,
+        subfolder="transformer",
+        torch_dtype=weight_dtype,
+    )
+
+    vae.requires_grad_(False)
+    text_encoder_one.requires_grad_(False)
+    text_encoder_two.requires_grad_(False)
+
+    lora_config = LoraConfig(
+        r=lora_rank,
+        lora_alpha=lora_rank,
+        init_lora_weights="gaussian",
+        target_modules=FLUX_LORA_TARGET_MODULES,
+    )
+    transformer = get_peft_model(cast(Any, transformer), lora_config)
+
+    if gradient_checkpointing:
+        transformer.enable_gradient_checkpointing()
+    elif hasattr(transformer, "disable_gradient_checkpointing"):
+        transformer.disable_gradient_checkpointing()
+
+    return {
+        "tokenizer_one": tokenizer_one,
+        "tokenizer_two": tokenizer_two,
+        "text_encoder_one": text_encoder_one,
+        "text_encoder_two": text_encoder_two,
+        "vae": vae,
+        "transformer": transformer,
+        "weight_dtype": weight_dtype,
+    }
diff --git a/videotuna/training/flux_lora/text_embed_cache.py b/videotuna/training/flux_lora/text_embed_cache.py
new file mode 100644
index 00000000..b771c751
--- /dev/null
+++ b/videotuna/training/flux_lora/text_embed_cache.py
@@ -0,0 +1,95 @@
+"""On-disk text embedding cache for Flux LoRA training."""
+
+from __future__ import annotations
+
+import hashlib
+from pathlib import Path
+from typing import Any
+
+import torch
+
+from videotuna.utils.logging_config import bound_logger
+
+logger = bound_logger(phase="t2i", flow="flux_lora")
+
+
+def _caption_cache_path(cache_dir: Path, caption: str) -> Path:
+    digest = hashlib.sha256(caption.encode("utf-8")).hexdigest()
+    return cache_dir / f"{digest}.pt"
+
+
+def _load_cached_embed(path: Path) -> dict[str, torch.Tensor] | None:
+    if not path.is_file():
+        return None
+    data = torch.load(path, map_location="cpu", weights_only=True)
+    if not isinstance(data, dict):
+        return None
+    return data
+
+
+def _save_cached_embed(path: Path, embeds: dict[str, torch.Tensor]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    torch.save(
+        {
+            "prompt_embeds": embeds["prompt_embeds"].cpu(),
+            "pooled_prompt_embeds": embeds["pooled_prompt_embeds"].cpu(),
+            "text_ids": embeds["text_ids"].cpu(),
+        },
+        path,
+    )
+
+
+def build_or_load_cache(
+    pipeline: Any,
+    captions: list[str],
+    cache_dir: str | Path,
+    write_batch_size: int,
+    device: torch.device,
+) -> dict[str, dict[str, torch.Tensor]]:
+    """Build or load cached prompt embeddings keyed by caption text."""
+    cache_root = Path(cache_dir)
+    unique_captions = list(dict.fromkeys(captions))
+    lookup: dict[str, dict[str, torch.Tensor]] = {}
+    pending: list[str] = []
+
+    for caption in unique_captions:
+        path = _caption_cache_path(cache_root, caption)
+        cached = _load_cached_embed(path)
+        if cached is not None:
+            lookup[caption] = cached
+        else:
+            pending.append(caption)
+
+    if not pending:
+        logger.info("Text embed cache hit for all {} captions", len(unique_captions))
+        return lookup
+
+    logger.info(
+        "Encoding {} / {} captions into cache (write_batch_size={})",
+        len(pending),
+        len(unique_captions),
+        write_batch_size,
+    )
+    pipeline.text_encoder.to(device)
+    pipeline.text_encoder_2.to(device)
+
+    for start in range(0, len(pending), write_batch_size):
+        batch = pending[start : start + write_batch_size]
+        with torch.no_grad():
+            prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
+                prompt=batch,
+                prompt_2=batch,
+                device=device,
+                num_images_per_prompt=1,
+                max_sequence_length=512,
+            )
+        for idx, caption in enumerate(batch):
+            embeds = {
+                "prompt_embeds": prompt_embeds[idx : idx + 1].cpu(),
+                "pooled_prompt_embeds": pooled_prompt_embeds[idx : idx + 1].cpu(),
+                "text_ids": text_ids[idx : idx + 1].cpu(),
+            }
+            lookup[caption] = embeds
+            _save_cached_embed(_caption_cache_path(cache_root, caption), embeds)
+
+    return lookup
diff --git a/videotuna/training/flux_lora/train.py b/videotuna/training/flux_lora/train.py
new file mode 100644
index 00000000..3eede1d2
--- /dev/null
+++ b/videotuna/training/flux_lora/train.py
@@ -0,0 +1,567 @@
+"""Accelerate training loop for Flux LoRA fine-tuning."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Any
+
+import torch
+import torch.nn.functional as F
+from accelerate import Accelerator
+from accelerate.utils import ProjectConfiguration, set_seed
+from diffusers import FlowMatchEulerDiscreteScheduler, FluxPipeline
+from diffusers.optimization import get_scheduler
+from torch.utils.data import DataLoader
+from tqdm.auto import tqdm
+
+from videotuna.settings import get_settings
+from videotuna.training.flux_lora.checkpoint import (
+    checkpoint_step,
+    find_latest_checkpoint,
+    load_lora_checkpoint,
+    prune_checkpoints,
+    save_lora_checkpoint,
+)
+from videotuna.training.flux_lora.config import (
+    FluxLoraDataConfig,
+    FluxLoraTrainConfig,
+    load_train_config,
+    stamp_output_dir,
+)
+from videotuna.training.flux_lora.dataset import (
+    FluxBucketBatchSampler,
+    FluxLoraImageDataset,
+    _load_caption,
+)
+from videotuna.training.flux_lora.model_utils import load_flux_training_models
+from videotuna.training.flux_lora.text_embed_cache import build_or_load_cache
+from videotuna.utils.logging_config import bound_logger, resolve_device_label
+from videotuna.utils.training_metrics import (
+    DEFAULT_FLUX_TRACKIO_PROJECT,
+    build_trackio_init_kwargs,
+    describe_metrics_backend,
+    log_validation_image_to_trackio,
+    resolve_accelerate_log_with,
+    trackio_enabled,
+)
+
+logger = bound_logger(phase="t2i", flow="flux_lora")
+
+_IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".webp", ".bmp"}
+
+
+def create_flux_accelerator(
+    output_dir: Path,
+    *,
+    mixed_precision: str,
+    gradient_accumulation_steps: int = 1,
+    metrics_backend: str | None = None,
+) -> Accelerator:
+    """Build an Accelerate instance with TensorBoard (and optional Trackio) tracking."""
+    if metrics_backend is None:
+        metrics_backend = get_settings().metrics_backend
+    project_config = ProjectConfiguration(
+        project_dir=str(output_dir),
+        logging_dir=str(output_dir / "tensorboard"),
+    )
+    return Accelerator(
+        gradient_accumulation_steps=gradient_accumulation_steps,
+        mixed_precision=mixed_precision,
+        log_with=resolve_accelerate_log_with(metrics_backend),
+        project_config=project_config,
+    )
+
+
+def _flux_tracker_config(config: FluxLoraTrainConfig) -> dict[str, Any]:
+    return {
+        "lora_rank": config.lora_rank,
+        "learning_rate": config.learning_rate,
+        "max_train_steps": config.max_train_steps,
+        "resolution": config.resolution,
+        "gradient_accumulation_steps": config.gradient_accumulation_steps,
+        "pretrained_model_name_or_path": config.pretrained_model_name_or_path,
+    }
+
+
+def _apply_runtime_flags(config: FluxLoraTrainConfig) -> None:
+    if config.disable_tf32:
+        torch.backends.cuda.matmul.allow_tf32 = False
+        torch.backends.cudnn.allow_tf32 = False
+    torch.backends.cudnn.benchmark = not config.disable_benchmark
+
+
+def _parse_validation_resolution(value: str) -> tuple[int, int]:
+    if "x" in value.lower():
+        width_str, height_str = value.lower().split("x", 1)
+        return int(width_str), int(height_str)
+    size = int(value)
+    return size, size
+
+
+def _collect_captions(data_config: FluxLoraDataConfig) -> list[str]:
+    data_dir = Path(data_config.instance_data_dir)
+    captions: list[str] = []
+    for path in sorted(data_dir.iterdir()):
+        if path.suffix.lower() not in _IMAGE_EXTENSIONS:
+            continue
+        captions.append(
+            _load_caption(
+                path,
+                data_config.caption_strategy,
+                data_config.default_caption,
+            )
+        )
+    return captions
+
+
+def _resolve_resume_checkpoint(
+    config: FluxLoraTrainConfig, output_dir: Path
+) -> Path | None:
+    if not config.resume_from_checkpoint:
+        return None
+    if config.resume_from_checkpoint == "latest":
+        return find_latest_checkpoint(output_dir)
+    candidate = Path(config.resume_from_checkpoint)
+    if not candidate.is_absolute():
+        candidate = output_dir / candidate
+    return candidate if candidate.is_dir() else None
+
+
+def _create_optimizer(
+    transformer, config: FluxLoraTrainConfig
+) -> torch.optim.Optimizer:
+    if config.optimizer not in {"adamw", "adamw_bf16"}:
+        raise ValueError(f"Unsupported optimizer: {config.optimizer}")
+    params = transformer.parameters()
+    kwargs = {
+        "lr": config.learning_rate,
+        "betas": (0.9, 0.999),
+        "weight_decay": 1e-4,
+        "eps": 1e-8,
+    }
+    if config.optimizer == "adamw_bf16":
+        from optimi import AdamW as OptimiAdamW
+
+        return OptimiAdamW(params, **kwargs)
+    return torch.optim.AdamW(params, **kwargs)
+
+
+def _collate_batch(batch: list[dict[str, Any]]) -> dict[str, Any]:
+    pixel_values = torch.stack([item["pixel_values"] for item in batch])
+    captions = [item["caption"] for item in batch]
+    collated: dict[str, Any] = {"pixel_values": pixel_values, "caption": captions}
+    if batch and "prompt_embeds" in batch[0]:
+        collated["prompt_embeds"] = torch.cat(
+            [item["prompt_embeds"] for item in batch], dim=0
+        )
+        collated["pooled_prompt_embeds"] = torch.cat(
+            [item["pooled_prompt_embeds"] for item in batch], dim=0
+        )
+        collated["text_ids"] = torch.cat([item["text_ids"] for item in batch], dim=0)
+    return collated
+
+
+def _prepare_batch_latents(vae, pixel_values, weight_dtype):
+    pixel_values = pixel_values.to(dtype=weight_dtype)
+    latents = vae.encode(pixel_values).latent_dist.sample()
+    latents = (latents - vae.config.shift_factor) * vae.config.scaling_factor
+    batch_size, num_channels, height, width = latents.shape
+    packed = FluxPipeline._pack_latents(
+        latents, batch_size, num_channels, height, width
+    )
+    return packed, height, width
+
+
+def _compute_loss(
+    pipeline: Any,
+    transformer,
+    batch,
+    weight_dtype,
+    accelerator,
+) -> torch.Tensor:
+    pixel_values = batch["pixel_values"]
+    captions = batch["caption"]
+    if isinstance(captions, str):
+        captions = [captions]
+
+    with torch.no_grad():
+        if "prompt_embeds" in batch:
+            prompt_embeds = batch["prompt_embeds"].to(accelerator.device)
+            pooled_prompt_embeds = batch["pooled_prompt_embeds"].to(accelerator.device)
+            text_ids = batch["text_ids"].to(accelerator.device)
+        else:
+            prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
+                prompt=captions,
+                prompt_2=captions,
+                device=accelerator.device,
+                num_images_per_prompt=1,
+                max_sequence_length=512,
+            )
+        model_input, latent_height, latent_width = _prepare_batch_latents(
+            pipeline.vae, pixel_values, weight_dtype
+        )
+        noise = torch.randn_like(model_input)
+        bsz = model_input.shape[0]
+        u = torch.rand(bsz, device=accelerator.device)
+        sigmas = u
+        timesteps = (sigmas * pipeline.scheduler.config.num_train_timesteps).long()
+        sigmas = sigmas.view(-1, 1, 1)
+        noisy_input = (1.0 - sigmas) * model_input + sigmas * noise
+        target = noise - model_input
+
+        latent_image_ids = FluxPipeline._prepare_latent_image_ids(
+            bsz,
+            latent_height // 2,
+            latent_width // 2,
+            accelerator.device,
+            weight_dtype,
+        )
+
+    guidance = torch.tensor([1.0], device=accelerator.device, dtype=weight_dtype)
+    guidance = guidance.expand(model_input.shape[0])
+
+    model_pred = transformer(
+        hidden_states=noisy_input,
+        timestep=timesteps / 1000,
+        guidance=guidance,
+        pooled_projections=pooled_prompt_embeds,
+        encoder_hidden_states=prompt_embeds,
+        txt_ids=text_ids,
+        img_ids=latent_image_ids,
+        return_dict=False,
+    )[0]
+
+    return F.mse_loss(model_pred.float(), target.float(), reduction="mean")
+
+
+def _run_validation(
+    pipeline: FluxPipeline,
+    config: FluxLoraTrainConfig,
+    output_dir: Path,
+    global_step: int,
+    accelerator: Accelerator,
+    weight_dtype: torch.dtype,
+    log,
+    *,
+    metrics_backend: str,
+) -> None:
+    if not config.validation_prompt or not config.validation_steps:
+        return
+    if global_step % config.validation_steps != 0:
+        return
+    if not accelerator.is_main_process:
+        return
+
+    width, height = _parse_validation_resolution(config.validation_resolution)
+    validation_dir = output_dir / "validation"
+    validation_dir.mkdir(parents=True, exist_ok=True)
+
+    generator = torch.Generator(device=accelerator.device).manual_seed(
+        config.validation_seed
+    )
+    pipeline.transformer.eval()
+    with torch.inference_mode():
+        result = pipeline(
+            prompt=config.validation_prompt,
+            height=height,
+            width=width,
+            num_inference_steps=config.validation_num_inference_steps,
+            guidance_scale=config.validation_guidance,
+            generator=generator,
+        )
+    pipeline.transformer.train()
+    image = result.images[0]
+    image_path = validation_dir / f"step-{global_step:06d}.png"
+    image.save(image_path)
+    log.info("Saved validation image to {}", image_path)
+
+    tracker = accelerator.trackers[0] if accelerator.trackers else None
+    if tracker is not None and hasattr(tracker, "writer"):
+        import numpy as np
+
+        array = np.array(image.convert("RGB")).transpose(2, 0, 1)
+        tracker.writer.add_image(
+            "validation/sample",
+            array,
+            global_step,
+            dataformats="CHW",
+        )
+
+    if trackio_enabled(metrics_backend):
+        log_validation_image_to_trackio(image, global_step)
+
+
+def _build_dataloader(
+    dataset: FluxLoraImageDataset,
+    config: FluxLoraTrainConfig,
+) -> DataLoader:
+    if config.train_batch_size > 1:
+        batch_sampler = FluxBucketBatchSampler(
+            dataset.bucket_ids,
+            config.train_batch_size,
+            shuffle=True,
+            seed=config.seed,
+        )
+        return DataLoader(
+            dataset,
+            batch_sampler=batch_sampler,
+            num_workers=config.num_workers,
+            pin_memory=torch.cuda.is_available(),
+            collate_fn=_collate_batch,
+        )
+    return DataLoader(
+        dataset,
+        batch_size=config.train_batch_size,
+        shuffle=True,
+        num_workers=config.num_workers,
+        pin_memory=torch.cuda.is_available(),
+        collate_fn=_collate_batch,
+    )
+
+
+def _build_embed_lookup(
+    pipeline: FluxPipeline,
+    data_config: FluxLoraDataConfig,
+    config: FluxLoraTrainConfig,
+    device: torch.device,
+) -> dict[str, dict[str, torch.Tensor]]:
+    if not data_config.text_embeds or data_config.text_embeds.disabled:
+        return {}
+    captions = _collect_captions(data_config)
+    write_batch_size = (
+        data_config.text_embeds.write_batch_size or config.write_batch_size
+    )
+    return build_or_load_cache(
+        pipeline,
+        captions,
+        data_config.text_embeds.cache_dir,
+        write_batch_size,
+        device,
+    )
+
+
+def _run_training_loop(
+    *,
+    config: FluxLoraTrainConfig,
+    output_dir: Path,
+    pipeline: FluxPipeline,
+    transformer,
+    dataloader: DataLoader,
+    optimizer,
+    lr_scheduler,
+    accelerator: Accelerator,
+    weight_dtype: torch.dtype,
+    global_step: int,
+    max_train_steps: int,
+    log,
+    metrics_backend: str,
+) -> None:
+    progress = tqdm(
+        range(global_step, max_train_steps),
+        disable=not accelerator.is_main_process,
+        desc="Flux LoRA",
+        initial=global_step,
+        total=max_train_steps,
+    )
+
+    while global_step < max_train_steps:
+        for batch in dataloader:
+            with accelerator.accumulate(transformer):
+                loss = _compute_loss(
+                    pipeline,
+                    transformer,
+                    batch,
+                    weight_dtype,
+                    accelerator,
+                )
+                accelerator.backward(loss)
+                optimizer.step()
+                lr_scheduler.step()
+                optimizer.zero_grad(set_to_none=True)
+
+            if not accelerator.sync_gradients:
+                continue
+
+            global_step += 1
+            progress.update(1)
+            progress.set_postfix(loss=f"{loss.item():.4f}", step=global_step)
+            accelerator.log(
+                {
+                    "train/loss": loss.item(),
+                    "train/lr": lr_scheduler.get_last_lr()[0],
+                },
+                step=global_step,
+            )
+            _run_validation(
+                pipeline,
+                config,
+                output_dir,
+                global_step,
+                accelerator,
+                weight_dtype,
+                log,
+                metrics_backend=metrics_backend,
+            )
+            if (
+                global_step % config.checkpointing_steps == 0
+                or global_step == max_train_steps
+            ) and accelerator.is_main_process:
+                unwrapped = accelerator.unwrap_model(transformer)
+                ckpt = save_lora_checkpoint(unwrapped, output_dir, global_step)
+                prune_checkpoints(output_dir, config.checkpoints_total_limit)
+                log.info("Saved LoRA checkpoint to {}", ckpt)
+            if global_step >= max_train_steps:
+                break
+
+    progress.close()
+
+
+def train(config: FluxLoraTrainConfig, data_config: FluxLoraDataConfig) -> None:
+    set_seed(config.seed)
+    _apply_runtime_flags(config)
+    output_dir = Path(config.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    settings = get_settings()
+    metrics_backend = settings.metrics_backend
+
+    accelerator = create_flux_accelerator(
+        output_dir,
+        mixed_precision=config.mixed_precision,
+        gradient_accumulation_steps=config.gradient_accumulation_steps,
+        metrics_backend=metrics_backend,
+    )
+    log = logger.bind(device=resolve_device_label(accelerator.device))
+    if accelerator.is_main_process:
+        log.info("Training Flux LoRA → {}", output_dir)
+        log.info("Metrics backend: {}", describe_metrics_backend(metrics_backend))
+
+    components = load_flux_training_models(
+        config.pretrained_model_name_or_path,
+        lora_rank=config.lora_rank,
+        mixed_precision=config.mixed_precision,
+        gradient_checkpointing=config.gradient_checkpointing,
+    )
+    weight_dtype = components["weight_dtype"]
+    transformer = components["transformer"]
+
+    global_step = 0
+    resume_path = _resolve_resume_checkpoint(config, output_dir)
+    if resume_path is not None:
+        load_lora_checkpoint(transformer, resume_path)
+        global_step = checkpoint_step(resume_path)
+        log.info("Resumed LoRA weights from {} (step {})", resume_path, global_step)
+    elif config.resume_from_checkpoint:
+        raise ValueError(
+            f"No checkpoint found for resume_from_checkpoint="
+            f"{config.resume_from_checkpoint!r} under output_dir={output_dir}. "
+            "Remove resume_from_checkpoint or set it to null to start fresh, "
+            "or point output_dir at a run that contains checkpoint-* directories."
+        )
+
+    pipeline = FluxPipeline.from_pretrained(
+        config.pretrained_model_name_or_path,
+        vae=components["vae"],
+        text_encoder=components["text_encoder_one"],
+        text_encoder_2=components["text_encoder_two"],
+        tokenizer=components["tokenizer_one"],
+        tokenizer_2=components["tokenizer_two"],
+        transformer=transformer,
+        torch_dtype=weight_dtype,
+    )
+    pipeline.scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
+        config.pretrained_model_name_or_path,
+        subfolder="scheduler",
+    )
+    pipeline.vae.to(accelerator.device)
+    pipeline.text_encoder.to(accelerator.device)
+    pipeline.text_encoder_2.to(accelerator.device)
+
+    embed_lookup = _build_embed_lookup(
+        pipeline, data_config, config, accelerator.device
+    )
+    dataset = FluxLoraImageDataset(
+        data_config,
+        embed_lookup=embed_lookup,
+        seed=config.seed,
+    )
+    dataloader = _build_dataloader(dataset, config)
+
+    optimizer = _create_optimizer(transformer, config)
+    opt_impl = (
+        "optimi.AdamW" if config.optimizer == "adamw_bf16" else "torch.optim.AdamW"
+    )
+    log.info("Using optimizer {} ({})", config.optimizer, opt_impl)
+    max_train_steps = config.max_train_steps
+    lr_scheduler = get_scheduler(
+        config.lr_scheduler,
+        optimizer=optimizer,
+        num_warmup_steps=config.lr_warmup_steps,
+        num_training_steps=max_train_steps,
+    )
+
+    transformer, optimizer, dataloader, lr_scheduler = accelerator.prepare(
+        transformer, optimizer, dataloader, lr_scheduler
+    )
+    pipeline.transformer = accelerator.unwrap_model(transformer)
+    if global_step > 0:
+        for _ in range(global_step):
+            lr_scheduler.step()
+        log.info(
+            "Advanced LR scheduler to step {} (optimizer state not restored)",
+            global_step,
+        )
+    trackio_project = settings.trackio_project or DEFAULT_FLUX_TRACKIO_PROJECT
+    trackio_init_kwargs = build_trackio_init_kwargs(space_id=settings.trackio_space_id)
+    accelerator.init_trackers(
+        trackio_project,
+        config=_flux_tracker_config(config),
+        init_kwargs=trackio_init_kwargs,
+    )
+
+    _run_training_loop(
+        config=config,
+        output_dir=output_dir,
+        pipeline=pipeline,
+        transformer=transformer,
+        dataloader=dataloader,
+        optimizer=optimizer,
+        lr_scheduler=lr_scheduler,
+        accelerator=accelerator,
+        weight_dtype=weight_dtype,
+        global_step=global_step,
+        max_train_steps=max_train_steps,
+        log=log,
+        metrics_backend=metrics_backend,
+    )
+
+    accelerator.end_training()
+    if accelerator.is_main_process:
+        model_path = config.pretrained_model_name_or_path
+        with open(output_dir / "training_config.json", "w") as f:
+            json.dump(
+                {
+                    "pretrained_model_name_or_path": model_path,
+                    "lora_rank": config.lora_rank,
+                    "max_train_steps": config.max_train_steps,
+                    "resolution": config.resolution,
+                },
+                f,
+                indent=2,
+            )
+        log.info("Training finished. Output: {}", output_dir)
+
+
+def run_training(
+    config_path: str, data_config_path: str, stamp_output: bool = True
+) -> None:
+    train_cfg, data_cfg = load_train_config(config_path, data_config_path)
+    if stamp_output and not train_cfg.resume_from_checkpoint:
+        train_cfg.output_dir = stamp_output_dir(train_cfg.output_dir)
+        with open(config_path) as f:
+            raw = json.load(f)
+        raw["--output_dir"] = train_cfg.output_dir
+        with open(config_path, "w") as f:
+            json.dump(raw, f, indent=4)
+    train(train_cfg, data_cfg)
diff --git a/videotuna/training/wan_lora/__init__.py b/videotuna/training/wan_lora/__init__.py
new file mode 100644
index 00000000..3dd2ada7
--- /dev/null
+++ b/videotuna/training/wan_lora/__init__.py
@@ -0,0 +1,13 @@
+"""First-party Wan 2.1 domain LoRA training config (Pydantic + YAML)."""
+
+from videotuna.training.wan_lora.config import (
+    WanLoraTrainConfig,
+    load_wan_lora_config,
+    validated_config_to_dictconfig,
+)
+
+__all__ = [
+    "WanLoraTrainConfig",
+    "load_wan_lora_config",
+    "validated_config_to_dictconfig",
+]
diff --git a/videotuna/training/wan_lora/config.py b/videotuna/training/wan_lora/config.py
new file mode 100644
index 00000000..c0fc55b0
--- /dev/null
+++ b/videotuna/training/wan_lora/config.py
@@ -0,0 +1,245 @@
+"""Load and validate domain Wan LoRA YAML configs via Pydantic v2."""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any, Literal, Self, Sequence
+
+import torch
+from omegaconf import DictConfig, OmegaConf
+from pydantic import BaseModel, ConfigDict, Field, model_validator
+
+from videotuna.utils.config_mapping import apply_config_mappings
+
+WAN_VIDEO_FLOW_TARGET = "videotuna.flow.wanvideo.WanVideoModelFlow"
+DATA_MODULE_TARGET = "videotuna.data.lightningdata.DataModuleFromConfig"
+DATASET_FROM_CSV_TARGET = "videotuna.data.datasets.DatasetFromCSV"
+
+
+class InstantiateConfig(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
+    target: str
+    params: dict[str, Any] = Field(default_factory=dict)
+    use_from_pretrained: bool = False
+
+
+class WanLoraFlowParams(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
+    task: Literal["t2v-14B", "i2v-14B"]
+    ckpt_path: str
+    offload_model: bool = True
+    ulysses_size: int = 1
+    ring_size: int = 1
+    t5_fsdp: bool = False
+    t5_cpu: bool = False
+    dit_fsdp: bool = False
+    use_prompt_extend: bool = False
+    prompt_extend_method: str = "local_qwen"
+    prompt_extend_model: str | None = None
+    prompt_extend_target_lang: str = "zh"
+    seed: int = 42
+    gradient_checkpointing: bool = True
+    denoiser_config: InstantiateConfig
+    first_stage_config: InstantiateConfig
+    cond_stage_config: InstantiateConfig
+    lora_config: InstantiateConfig
+
+    @model_validator(mode="after")
+    def validate_lora_params(self) -> Self:
+        r = self.lora_config.params.get("r")
+        alpha = self.lora_config.params.get("lora_alpha")
+        if r is not None and int(r) <= 0:
+            raise ValueError("lora_config.params.r must be > 0")
+        if alpha is not None and float(alpha) <= 0:
+            raise ValueError("lora_config.params.lora_alpha must be > 0")
+        return self
+
+
+class WanFlowConfig(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
+    target: str
+    params: WanLoraFlowParams
+
+    @model_validator(mode="after")
+    def validate_flow_target(self) -> Self:
+        if self.target != WAN_VIDEO_FLOW_TARGET:
+            raise ValueError(
+                f"flow.target must be {WAN_VIDEO_FLOW_TARGET!r}, got {self.target!r}"
+            )
+        return self
+
+
+class WanTrainerConfig(BaseModel):
+    model_config = ConfigDict(extra="allow")
+
+    accelerator: str = "gpu"
+    benchmark: bool = True
+    num_nodes: int = 1
+    accumulate_grad_batches: int = 1
+    max_epochs: int
+    precision: str = "bf16-mixed"
+
+
+class WanLightningCallbacks(BaseModel):
+    model_config = ConfigDict(protected_namespaces=())
+
+    image_logger: InstantiateConfig
+    model_checkpoint: InstantiateConfig
+
+
+class WanLightningConfig(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
+    strategy: str
+    trainer: WanTrainerConfig
+    callbacks: WanLightningCallbacks
+
+
+class WanTrainSection(BaseModel):
+    model_config = ConfigDict(extra="allow")
+
+    name: str
+    ckpt: str
+    logdir: str
+    seed: int
+    debug: bool = False
+    first_stage_key: str
+    cond_stage_key: str
+    mapping: dict[str, str] | None = None
+    lr_config: dict[str, Any]
+    data: InstantiateConfig
+    lightning: WanLightningConfig
+
+    @model_validator(mode="after")
+    def validate_data_targets(self) -> Self:
+        if self.data.target != DATA_MODULE_TARGET:
+            raise ValueError(
+                f"train.data.target must be {DATA_MODULE_TARGET!r}, "
+                f"got {self.data.target!r}"
+            )
+        train_dataset = self.data.params.get("train")
+        if not isinstance(train_dataset, dict):
+            raise ValueError("train.data.params.train must be present")
+        if train_dataset.get("target") != DATASET_FROM_CSV_TARGET:
+            raise ValueError(
+                f"train.data.params.train.target must be {DATASET_FROM_CSV_TARGET!r}, "
+                f"got {train_dataset.get('target')!r}"
+            )
+        return self
+
+
+class WanInferenceSection(BaseModel):
+    model_config = ConfigDict(extra="allow")
+
+    mode: Literal["t2v", "i2v"]
+    ckpt_path: str
+    savedir: str
+    seed: int
+    height: int
+    width: int
+    image: str | None = None
+    prompt_file: str | None = None
+    prompt_dir: str | None = None
+    solver: str = "unipc"
+    num_inference_steps: int = 20
+    time_shift: float = 3.0
+    unconditional_guidance_scale: float = 5.0
+    frames: int = 81
+    n_samples_prompt: int = 1
+    bs: int = 1
+    savefps: int = 30
+    enable_model_cpu_offload: bool = True
+    mapping: dict[str, str] | None = None
+
+
+class WanLoraTrainConfig(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+
+    flow: WanFlowConfig
+    train: WanTrainSection
+    inference: WanInferenceSection
+
+    @model_validator(mode="after")
+    def validate_i2v_consistency(self) -> Self:
+        if self.flow.params.task != "i2v-14B":
+            return self
+        if self.inference.mode != "i2v":
+            raise ValueError(
+                "inference.mode must be 'i2v' when flow.params.task is i2v-14B"
+            )
+        denoiser_params = self.flow.params.denoiser_config.params
+        if denoiser_params.get("model_type") != "i2v":
+            raise ValueError(
+                "flow.params.denoiser_config.params.model_type must be "
+                "'i2v' for i2v-14B"
+            )
+        if denoiser_params.get("subfolder") != "high_noise_model":
+            raise ValueError(
+                "flow.params.denoiser_config.params.subfolder must be "
+                "'high_noise_model' for i2v-14B"
+            )
+        return self
+
+
+def _register_dtype_resolver() -> None:
+    if OmegaConf.has_resolver("dtype_resolver"):
+        return
+
+    def resolve_dtype(dtype_str: str):
+        mapping = {
+            "torch.float16": torch.float16,
+            "torch.float32": torch.float32,
+            "torch.float64": torch.float64,
+            "torch.bfloat16": torch.bfloat16,
+        }
+        return mapping.get(dtype_str)
+
+    OmegaConf.register_new_resolver("dtype_resolver", resolve_dtype)
+
+
+def _merge_and_resolve(
+    config_paths: Sequence[str | Path],
+    cli_overrides: list[str] | None = None,
+    *,
+    apply_inference_mapping: bool = True,
+) -> dict[str, Any]:
+    configs = [OmegaConf.load(str(path)) for path in config_paths]
+    cli = OmegaConf.from_dotlist(cli_overrides or [])
+    if configs:
+        merged = OmegaConf.merge(*configs, cli)
+    else:
+        merged = cli
+
+    if not isinstance(merged, DictConfig):
+        raise TypeError(f"Expected YAML mapping config, got {type(merged).__name__}")
+
+    apply_config_mappings(merged, section="train")
+    if apply_inference_mapping:
+        apply_config_mappings(merged, section="inference")
+
+    _register_dtype_resolver()
+    OmegaConf.resolve(merged)
+    resolved = OmegaConf.to_container(merged, resolve=True)
+    if not isinstance(resolved, dict):
+        raise TypeError("Wan LoRA config must resolve to a mapping")
+    return resolved
+
+
+def load_wan_lora_config(
+    config_path: str | Path,
+    *,
+    extra_configs: Sequence[str | Path] = (),
+    cli_overrides: list[str] | None = None,
+) -> WanLoraTrainConfig:
+    """Load domain Wan YAML through merge, mapping, resolve, and Pydantic validation."""
+    paths = [Path(config_path), *[Path(path) for path in extra_configs]]
+    resolved = _merge_and_resolve(paths, cli_overrides)
+    return WanLoraTrainConfig.model_validate(resolved)
+
+
+def validated_config_to_dictconfig(config: WanLoraTrainConfig) -> DictConfig:
+    """Convert a validated model back to OmegaConf for existing runtime code."""
+    return OmegaConf.create(config.model_dump(mode="json"))
diff --git a/videotuna/utils/args_utils.py b/videotuna/utils/args_utils.py
index 362c441d..2dd1d9f9 100644
--- a/videotuna/utils/args_utils.py
+++ b/videotuna/utils/args_utils.py
@@ -1,16 +1,23 @@
 import argparse
-import json
+import os
 import time
-from colorama import Fore, Style
-from omegaconf import OmegaConf, MissingMandatoryValue
+from enum import Enum
 from pathlib import Path
 from typing import Union
+
 import torch
-from enum import Enum
+from colorama import Fore, Style
+from loguru import logger
+from omegaconf import DictConfig, OmegaConf
 from pytorch_lightning import Trainer
+
+from videotuna.training.wan_lora.config import (
+    WanLoraTrainConfig,
+    validated_config_to_dictconfig,
+)
+from videotuna.utils.config_mapping import apply_config_mappings
 from videotuna.utils.lightning_utils import add_trainer_args_to_parser
-from loguru import logger
-import os
+
 
 class VideoMode(Enum):
     I2V = "i2v"
@@ -20,7 +27,7 @@ class VideoMode(Enum):
 MANDATORY_INFERENCE_ARGS = ["savedir"]
 
 
-def prepare_train_args(parser: argparse.Namespace):
+def prepare_train_args(parser: argparse.ArgumentParser):
     """
     Prepare the arguments by updating the config with the command line arguments.
 
@@ -36,29 +43,23 @@ def prepare_train_args(parser: argparse.Namespace):
 
     configs = [OmegaConf.load(cfg) for cfg in args.base]
     cli = OmegaConf.from_dotlist(unknown)
-    config = OmegaConf.merge(*configs, cli)
+    merged = OmegaConf.merge(*configs, cli)
+    if not isinstance(merged, DictConfig):
+        raise TypeError(f"Expected YAML mapping config, got {type(merged).__name__}")
+    config = merged
 
-    ## parser args replace train config 
+    ## parser args replace train config
     train_config = config.get("train", OmegaConf.create())
     for k, v in vars(args).items():
-        if not k in train_config.keys():
+        if k not in train_config.keys():
             train_config[k] = v
         else:
             if v is not None:
                 train_config[k] = v
 
-    if OmegaConf.select(config, 'train.mapping') is not None:
-        for source_path, target_path in config.train.mapping.items():
-            if not path_exists(config, source_path):
-                raise ValueError(f"Error: invalid mapping {source_path} not exists")
-            if not path_exists(config, target_path):
-                raise ValueError(f"Error: invalid mapping {target_path} not exists")
-            
-            value = OmegaConf.select(config, source_path)
-            if value is not None:
-                OmegaConf.update(config, target_path, value)
-                logger.info(f"update {target_path} by {source_path} value: {value}")
+    apply_config_mappings(config, section="train")
     logger.info(f"All Config: {OmegaConf.to_yaml(config)}")
+
     def resolve_dtype(dtype_str):
         mapping = {
             "torch.float16": torch.float16,
@@ -67,13 +68,21 @@ def resolve_dtype(dtype_str):
             "torch.bfloat16": torch.bfloat16,
         }
         return mapping.get(dtype_str)
-    OmegaConf.register_new_resolver("dtype_resolver", resolve_dtype)
+
+    if not OmegaConf.has_resolver("dtype_resolver"):
+        OmegaConf.register_new_resolver("dtype_resolver", resolve_dtype)
 
     ## extract trainer config
-    trainer_config = config.train.lightning.trainer 
+    trainer_config = config.train.lightning.trainer
     for k in get_nondefault_trainer_args(args):
         trainer_config[k] = getattr(args, k)
-    return config
+
+    resolved = OmegaConf.to_container(config, resolve=True)
+    if not isinstance(resolved, dict):
+        raise TypeError("Training config must resolve to a mapping")
+    validated = WanLoraTrainConfig.model_validate(resolved)
+    return validated_config_to_dictconfig(validated)
+
 
 def get_nondefault_trainer_args(args):
     parser = argparse.ArgumentParser()
@@ -86,15 +95,8 @@ def get_nondefault_trainer_args(args):
         if getattr(args, k) != getattr(default_trainer_args, k)
     )
 
-# omegaconf has bug, does not work as expected
-def path_exists(cfg, path):
-    try:
-        OmegaConf.select(cfg, path, throw_on_missing=True)
-        return True
-    except MissingMandatoryValue:
-        return False
 
-def prepare_inference_args(args: argparse.Namespace, config: OmegaConf):
+def prepare_inference_args(args: argparse.Namespace, config: DictConfig) -> DictConfig:
     """
     Prepare the arguments by updating the config with the command line arguments.
 
@@ -102,36 +104,35 @@ def prepare_inference_args(args: argparse.Namespace, config: OmegaConf):
     :param config: The config object.
     :return: The updated config object.
     """
+    from videotuna.utils.inference_cli import (
+        prepare_cli_inference_args,
+        validate_cpu_offload_flags,
+    )
+    from videotuna.utils.inference_profile import resolve_inference_profile
+
+    prepare_cli_inference_args(args)
 
     # update the config with the command line arguments
     inference_config = config.pop("inference", OmegaConf.create())
     for k, v in vars(args).items():
-        if not k in inference_config.keys():
+        if k not in inference_config.keys():
             inference_config[k] = v
         else:
             if v is not None:
                 inference_config[k] = v
-                
+
+    resolve_inference_profile(inference_config)
+    validate_cpu_offload_flags(inference_config)
+
     check_args(inference_config)
-    inference_config.savedir = process_savedir(inference_config.savedir)    
+    inference_config.savedir = process_savedir(inference_config.savedir)
     config.inference = inference_config
     print_inference_config(inference_config)
 
-
-    #update flow config with inference mapping config
-    if OmegaConf.select(config, 'inference.mapping') is not None:
-        for source_path, target_path in config.inference.mapping.items():
-            if not path_exists(config, source_path):
-                raise ValueError(f"Error: invalid mapping {source_path} not exists")
-            if not path_exists(config, target_path):
-                raise ValueError(f"Error: invalid mapping {target_path} not exists")
-            
-            value = OmegaConf.select(config, source_path)
-            if value is not None:
-                OmegaConf.update(config, target_path, value)
-                logger.info(f"update {target_path} by {source_path} value: {value}")
+    apply_config_mappings(config, section="inference")
 
     logger.info(f"All Config: {OmegaConf.to_yaml(config)}")
+
     # resolve interpolation first
     def resolve_dtype(dtype_str):
         mapping = {
@@ -141,12 +142,17 @@ def resolve_dtype(dtype_str):
             "torch.bfloat16": torch.bfloat16,
         }
         return mapping.get(dtype_str)
-    OmegaConf.register_new_resolver("dtype_resolver", resolve_dtype)
-    config = OmegaConf.to_container(config, resolve=True)
-    config = OmegaConf.create(config, flags={"allow_objects": True})
+
+    if not OmegaConf.has_resolver("dtype_resolver"):
+        OmegaConf.register_new_resolver("dtype_resolver", resolve_dtype)
+    resolved = OmegaConf.to_container(config, resolve=True)
+    if not isinstance(resolved, dict):
+        raise TypeError("Inference config must resolve to a mapping")
+    config = OmegaConf.create(resolved, flags={"allow_objects": True})
     return config
 
-def check_args(inference_config: OmegaConf):
+
+def check_args(inference_config: DictConfig):
     """
     Check if all the mandatory arguments are provided.
 
@@ -160,7 +166,7 @@ def check_args(inference_config: OmegaConf):
 def process_savedir(savedir: str):
     """
     Process the savedir.
-    Add the current time to the savedir. 
+    Add the current time to the savedir.
     Remove empty directories.
 
     :param savedir: The savedir config.
@@ -169,17 +175,18 @@ def process_savedir(savedir: str):
 
     save_time = time.strftime("%Y%m%d_%H%M%S")
     savedir = os.path.join(savedir, save_time)
-    
+
     # create the savedir
     Path(savedir).mkdir(parents=True, exist_ok=True)
 
     return savedir
 
 
-def print_inference_config(inference_config: OmegaConf):
+def print_inference_config(inference_config: DictConfig):
     """
     Print the basic information of the inference config.
-    Such as the mode, savedir, the seed, the height, width, frames, fps, n_samples_prompt, bs.
+    Such as the mode, savedir, the seed, the height, width, frames, fps,
+    n_samples_prompt, bs.
 
     :param inference_config: The inference config.
     """
@@ -193,7 +200,7 @@ def print_inference_config(inference_config: OmegaConf):
     # Header
     border = f"{BORDER}{'=' * 60}{RESET}"
     title = f"{HEADER}Inference Configuration{RESET}"
-    
+
     print(border)
     print(f"{title:^60}")
     print(border)
@@ -220,4 +227,3 @@ def print_item(key: str, value: Union[int, str, float, None]):
 
     # Footer
     print(border)
-    
diff --git a/videotuna/utils/attention.py b/videotuna/utils/attention.py
new file mode 100644
index 00000000..cb3ca237
--- /dev/null
+++ b/videotuna/utils/attention.py
@@ -0,0 +1,377 @@
+"""
+Unified attention backend selection for VideoTuna model families.
+
+See ``videotuna.settings.PrivTuneSettings`` for VIDEOTUNA_* env vars.
+"""
+
+from __future__ import annotations
+
+import importlib
+import math
+import os
+from contextlib import contextmanager
+from typing import Literal, Optional, Tuple, cast
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from loguru import logger
+
+from videotuna.settings import ENV_ATTN_BACKEND, get_settings
+from videotuna.utils.device_utils import detect_compute_backend, gpu_is_available
+
+AttnBackend = Literal["flash", "sdpa", "eager"]
+AttnLayout = Literal["bsnd", "bhsd"]
+
+
+def _optional_attr(module_name: str, attr_name: str):
+    try:
+        module = importlib.import_module(module_name)
+    except ImportError:
+        return None
+    return getattr(module, attr_name, None)
+
+
+_FLASH_ATTN_FUNC = _optional_attr("flash_attn", "flash_attn_func")
+_FLASH_ATTN_VARLEN_FUNC = _optional_attr("flash_attn", "flash_attn_varlen_func")
+_FLASH_ATTN_3_VARLEN_FUNC = _optional_attr(
+    "flash_attn_interface", "flash_attn_varlen_func"
+)
+_FLASH_ATTN_AVAILABLE = _FLASH_ATTN_FUNC is not None
+
+
+def is_flash_attn_available() -> bool:
+    return _FLASH_ATTN_AVAILABLE
+
+
+def _resolve_auto_backend() -> AttnBackend:
+    if detect_compute_backend() == "rocm":
+        return "sdpa" if gpu_is_available() else "eager"
+    if _FLASH_ATTN_AVAILABLE and gpu_is_available():
+        return "flash"
+    if gpu_is_available():
+        return "sdpa"
+    return "eager"
+
+
+def get_attn_backend_requested() -> str:
+    """Return the attention backend requested via env (before fallback)."""
+    return get_settings().attn_backend
+
+
+def get_attn_backend() -> AttnBackend:
+    """Resolve the active attention backend from env or auto-detection."""
+    settings = get_settings()
+    requested = settings.attn_backend
+    if requested == "auto":
+        return _resolve_auto_backend()
+    if requested in ("flash", "sdpa", "eager"):
+        if requested == "flash":
+            if detect_compute_backend() in ("rocm", "cpu"):
+                backend_label = (
+                    "AMD ROCm" if detect_compute_backend() == "rocm" else "CPU"
+                )
+                raise RuntimeError(
+                    "VIDEOTUNA_ATTN_BACKEND=flash is not supported on "
+                    f"{backend_label}. "
+                    "Use VIDEOTUNA_ATTN_BACKEND=sdpa or eager. "
+                    "See docs/install-rocm.md or docs/install-cpu.md."
+                )
+            if not _FLASH_ATTN_AVAILABLE:
+                if settings.attn_backend_strict:
+                    raise RuntimeError(
+                        "VIDEOTUNA_ATTN_BACKEND=flash requires flash-attn. "
+                        "Install with: poetry run install-flash-attn"
+                    )
+                logger.warning(
+                    "VIDEOTUNA_ATTN_BACKEND=flash requested but flash-attn is not "
+                    "installed; falling back to sdpa. Set "
+                    "VIDEOTUNA_ATTN_BACKEND_STRICT=1 to fail instead."
+                )
+                return "sdpa"
+        if requested == "sdpa" and not gpu_is_available():
+            return "eager"
+        return requested  # type: ignore[return-value]
+    raise ValueError(
+        f"Invalid {ENV_ATTN_BACKEND}={requested!r}. "
+        "Expected auto, flash, sdpa, or eager."
+    )
+
+
+def get_resolved_attn_backend() -> AttnBackend:
+    """Alias for get_attn_backend (resolved after auto-detection / fallback)."""
+    return get_attn_backend()
+
+
+def get_torch_compile_mode() -> str:
+    return get_settings().torch_compile_mode
+
+
+def _to_bhsd(
+    q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, layout: AttnLayout
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    if layout == "bhsd":
+        return q, k, v
+    return q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
+
+
+def _from_bhsd(x: torch.Tensor, layout: AttnLayout) -> torch.Tensor:
+    if layout == "bhsd":
+        return x
+    return x.transpose(1, 2)
+
+
+@contextmanager
+def _sdpa_context():
+    """Prefer flash/mem-efficient SDPA kernels on CUDA when available."""
+    if not gpu_is_available():
+        yield
+        return
+    try:
+        from torch.nn.attention import SDPBackend, sdpa_kernel
+
+        if detect_compute_backend() == "rocm":
+            backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
+        else:
+            backends = [
+                SDPBackend.FLASH_ATTENTION,
+                SDPBackend.EFFICIENT_ATTENTION,
+                SDPBackend.MATH,
+            ]
+        with sdpa_kernel(backends):
+            yield
+    except (ImportError, AttributeError):
+        yield
+
+
+def attention_eager(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    *,
+    attn_mask: Optional[torch.Tensor] = None,
+    dropout_p: float = 0.0,
+    causal: bool = False,
+    scale: Optional[float] = None,
+    layout: AttnLayout = "bsnd",
+) -> torch.Tensor:
+    q, k, v = _to_bhsd(q, k, v, layout)
+    if scale is None:
+        scale = 1.0 / math.sqrt(q.size(-1))
+
+    b, _, s, _ = q.shape
+    s1 = k.size(2)
+    attn_bias = torch.zeros(b, q.size(1), s, s1, dtype=q.dtype, device=q.device)
+    if causal:
+        assert attn_mask is None, "Causal mask and attn_mask cannot be used together"
+        temp_mask = torch.ones(
+            b, q.size(1), s, s, dtype=torch.bool, device=q.device
+        ).tril(diagonal=0)
+        attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
+
+    if attn_mask is not None:
+        if attn_mask.dtype == torch.bool:
+            if attn_mask.ndim == 3:
+                attn_mask = attn_mask.unsqueeze(1)
+            attn_bias.masked_fill_(attn_mask.logical_not(), float("-inf"))
+        else:
+            if attn_mask.ndim == 3:
+                attn_mask = attn_mask.unsqueeze(1)
+            attn_bias = attn_bias + attn_mask
+
+    dtype = q.dtype
+    attn = (q * scale) @ k.transpose(-2, -1)
+    attn = attn + attn_bias
+    attn = attn.softmax(dim=-1).to(dtype)
+    if dropout_p > 0.0:
+        attn = F.dropout(attn, p=dropout_p, training=True)
+    out = attn @ v
+    return _from_bhsd(out, layout)
+
+
+def attention_dense(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    *,
+    attn_mask: Optional[torch.Tensor] = None,
+    dropout_p: float = 0.0,
+    causal: bool = False,
+    scale: Optional[float] = None,
+    layout: AttnLayout = "bsnd",
+    backend: Optional[AttnBackend] = None,
+) -> torch.Tensor:
+    """Dense attention with unified backend selection."""
+    backend = backend or get_attn_backend()
+
+    if backend == "flash":
+        if layout == "bhsd":
+            q_f, k_f, v_f = q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)
+        else:
+            q_f, k_f, v_f = q, k, v
+        assert _FLASH_ATTN_FUNC is not None
+        return _FLASH_ATTN_FUNC(
+            q_f,
+            k_f,
+            v_f,
+            dropout_p=dropout_p,
+            softmax_scale=scale,
+            causal=causal,
+        )
+
+    if backend == "sdpa":
+        q_s, k_s, v_s = _to_bhsd(q, k, v, layout)
+        if attn_mask is not None and attn_mask.dtype != torch.bool:
+            attn_mask = attn_mask.to(q_s.dtype)
+        with _sdpa_context():
+            out = F.scaled_dot_product_attention(
+                q_s,
+                k_s,
+                v_s,
+                attn_mask=attn_mask,
+                dropout_p=dropout_p,
+                is_causal=causal,
+                scale=scale,
+            )
+        return _from_bhsd(out, layout)
+
+    return attention_eager(
+        q,
+        k,
+        v,
+        attn_mask=attn_mask,
+        dropout_p=dropout_p,
+        causal=causal,
+        scale=scale,
+        layout=layout,
+    )
+
+
+def attention_varlen(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    *,
+    cu_seqlens_q: torch.Tensor,
+    cu_seqlens_kv: torch.Tensor,
+    max_seqlen_q: int,
+    max_seqlen_kv: int,
+    dropout_p: float = 0.0,
+    causal: bool = False,
+    softmax_scale: Optional[float] = None,
+    batch_size: Optional[int] = None,
+    window_size: Tuple[int, int] = (-1, -1),
+    deterministic: bool = False,
+    prefer_flash3: bool = True,
+    backend: Optional[AttnBackend] = None,
+) -> torch.Tensor:
+    """Variable-length packed attention (flash varlen or dense fallback)."""
+    backend = backend or get_attn_backend()
+
+    if backend == "flash":
+        if prefer_flash3 and _FLASH_ATTN_3_VARLEN_FUNC is not None:
+            out = _FLASH_ATTN_3_VARLEN_FUNC(
+                q=q,
+                k=k,
+                v=v,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_kv,
+                max_seqlen_q=max_seqlen_q,
+                max_seqlen_k=max_seqlen_kv,
+                softmax_scale=softmax_scale,
+                causal=causal,
+                deterministic=deterministic,
+            )
+            if isinstance(out, tuple):
+                out = out[0]
+        else:
+            assert _FLASH_ATTN_VARLEN_FUNC is not None
+            out = _FLASH_ATTN_VARLEN_FUNC(
+                q,
+                k,
+                v,
+                cu_seqlens_q,
+                cu_seqlens_kv,
+                max_seqlen_q,
+                max_seqlen_kv,
+                dropout_p=dropout_p,
+                softmax_scale=softmax_scale,
+                causal=causal,
+                window_size=window_size,
+                deterministic=deterministic,
+            )
+        if batch_size is not None:
+            return out.view(batch_size, max_seqlen_q, out.shape[-2], out.shape[-1])
+        return out
+
+    if batch_size is None:
+        raise ValueError("batch_size is required for non-flash varlen fallback")
+
+    # Reshape packed varlen tensors back to padded batch for sdpa/eager.
+    q.shape[0]
+    n_heads = q.shape[1]
+    head_dim = q.shape[2]
+    q_pad = q.view(batch_size, max_seqlen_q, n_heads, head_dim)
+    k_pad = k.view(batch_size, max_seqlen_kv, n_heads, head_dim)
+    v_pad = v.view(batch_size, max_seqlen_kv, n_heads, head_dim)
+    return attention_dense(
+        q_pad,
+        k_pad,
+        v_pad,
+        dropout_p=dropout_p,
+        causal=causal,
+        scale=softmax_scale,
+        layout="bsnd",
+        backend=backend,
+    )
+
+
+_DIFFUSERS_BACKEND_MAP = {
+    "flash": "flash",
+    "sdpa": "native",
+    "eager": "_native_math",
+}
+
+
+def apply_diffusers_attention_backend(model) -> None:
+    """Map VIDEOTUNA_ATTN_BACKEND to diffusers set_attention_backend."""
+    backend = get_attn_backend()
+    diffusers_backend = _DIFFUSERS_BACKEND_MAP[backend]
+    if backend == "flash" and detect_compute_backend() == "rocm":
+        diffusers_backend = "native"
+
+    if hasattr(model, "set_attention_backend"):
+        try:
+            model.set_attention_backend(diffusers_backend)
+            return
+        except ValueError:
+            if backend == "flash":
+                model.set_attention_backend("native")
+                return
+            raise
+
+    os.environ["DIFFUSERS_ATTN_BACKEND"] = diffusers_backend
+
+
+_COMPILE_WARNED_ROCM = False
+
+
+def maybe_compile_denoiser(module: nn.Module) -> nn.Module:
+    """Optionally compile a denoiser module when VIDEOTUNA_TORCH_COMPILE=1."""
+    global _COMPILE_WARNED_ROCM
+    if not get_settings().torch_compile:
+        return module
+    if not gpu_is_available():
+        return module
+    if detect_compute_backend() == "rocm" and not _COMPILE_WARNED_ROCM:
+        logger.warning(
+            "torch.compile on AMD ROCm is experimental in PyTorch 2.6; "
+            "set VIDEOTUNA_TORCH_COMPILE=0 to disable."
+        )
+        _COMPILE_WARNED_ROCM = True
+    compile_mode = get_torch_compile_mode()
+    logger.info("torch.compile denoiser with mode={}", compile_mode)
+    return cast(
+        nn.Module,
+        torch.compile(module, mode=compile_mode, fullgraph=True),
+    )
diff --git a/videotuna/utils/callbacks.py b/videotuna/utils/callbacks.py
index 05e31475..ef541990 100755
--- a/videotuna/utils/callbacks.py
+++ b/videotuna/utils/callbacks.py
@@ -1,30 +1,24 @@
-import datetime
-import logging
 import os
 import time
-
-import numpy as np
-from einops import rearrange
-from omegaconf import OmegaConf
-from PIL import Image
+from typing import Any, Optional, Union
 from weakref import proxy
-from collections import OrderedDict
-from typing_extensions import override
-from typing import Any, Literal, Optional, Union
-from loguru import logger
-
-mainlogger = logging.getLogger("mainlogger")
 
 import pytorch_lightning as pl
 import torch
 import torchvision
-from torch import Tensor
 from pytorch_lightning.callbacks import Callback
 from pytorch_lightning.utilities import rank_zero_info, rank_zero_only
 from pytorch_lightning.utilities.types import STEP_OUTPUT
+from torch import Tensor
+from typing_extensions import override
+
+from videotuna.utils.device_utils import empty_accelerator_cache, gpu_is_available
+from videotuna.utils.logging_config import bound_logger
 
 from .save_video import log_local, prepare_to_log
 
+mainlogger = bound_logger(phase="t2v", flow="wanvideo")
+
 
 class LoraModelCheckpoint(pl.callbacks.ModelCheckpoint):
     def __init__(self, *args, **kwargs):
@@ -55,17 +49,31 @@ def on_save_checkpoint(self, trainer, pl_module, checkpoint):
 
 
 class VideoTunaModelCheckpoint(pl.callbacks.ModelCheckpoint):
-    def __init__(self, 
-                 save_flow: bool = True,
-                 save_only_selected_model: bool = True,
-                 selected_model: Optional[Union[str, list]] = None,
-                 *args, **kwargs):
-        assert save_flow or save_only_selected_model, "At least one of `save_flow` and `save_only_trained_model` should be True."
+    def __init__(
+        self,
+        save_flow: bool = True,
+        save_only_selected_model: bool = True,
+        selected_model: Optional[Union[str, list]] = None,
+        *args,
+        **kwargs,
+    ):
+        assert (
+            save_flow or save_only_selected_model
+        ), "At least one of `save_flow` and `save_only_trained_model` should be True."
         super().__init__(*args, **kwargs)
         self.save_flow = save_flow
         self.save_only_selected_model = save_only_selected_model
-        self.selected_model = selected_model
-    
+        if save_only_selected_model and not selected_model:
+            raise ValueError(
+                "selected_model must be set when save_only_selected_model is True"
+            )
+        if isinstance(selected_model, str):
+            self.selected_model: list[str] = [selected_model]
+        elif selected_model is None:
+            self.selected_model = []
+        else:
+            self.selected_model = list(selected_model)
+
     @override
     def on_train_batch_end(
         self,
@@ -75,19 +83,25 @@ def on_train_batch_end(
         batch: Any,
         batch_idx: int,
     ) -> None:
-        """Save checkpoint on train batch end if we meet the criteria for `every_n_train_steps`"""
+        """Save checkpoint on train batch end when `every_n_train_steps` met."""
         if self._should_skip_saving_checkpoint(trainer):
             return
-        skip_batch = self._every_n_train_steps < 1 or (trainer.global_step % self._every_n_train_steps != 0)
+        skip_batch = self._every_n_train_steps < 1 or (
+            trainer.global_step % self._every_n_train_steps != 0
+        )
 
         train_time_interval = self._train_time_interval
         skip_time = True
         now = time.monotonic()
         if train_time_interval:
             prev_time_check = self._last_time_checked
-            skip_time = prev_time_check is None or (now - prev_time_check) < train_time_interval.total_seconds()
+            skip_time = (
+                prev_time_check is None
+                or (now - prev_time_check) < train_time_interval.total_seconds()
+            )
             # in case we have time differences across ranks
-            # broadcast the decision on whether to checkpoint from rank 0 to avoid possible hangs
+            # broadcast the decision on whether to checkpoint from rank 0 to avoid
+            # possible hangs
             skip_time = trainer.strategy.broadcast(skip_time)
 
         if skip_batch and skip_time:
@@ -96,14 +110,18 @@ def on_train_batch_end(
             self._last_time_checked = now
 
         monitor_candidates = self._monitor_candidates(trainer)
-        self._save_last_checkpoint(trainer, monitor_candidates, pl_module)  # only save the last checkpoint
-    
+        self._save_last_checkpoint(trainer, monitor_candidates)
+
     @override
-    def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
+    def on_train_epoch_end(
+        self, trainer: "pl.Trainer", pl_module: "pl.LightningModule"
+    ) -> None:
         pass
 
     @override
-    def on_validation_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
+    def on_validation_end(
+        self, trainer: "pl.Trainer", pl_module: "pl.LightningModule"
+    ) -> None:
         pass
 
     @override
@@ -111,26 +129,38 @@ def _save_last_checkpoint(
         self,
         trainer: "pl.Trainer",
         monitor_candidates: dict[str, Tensor],
-        pl_module: "pl.LightningModule",
     ) -> None:
         if not self.save_last:
             return
 
-        # filepath = self.format_checkpoint_name(monitor_candidates, self.CHECKPOINT_NAME_LAST)
+        pl_module = trainer.lightning_module
+        if pl_module is None:
+            return
+
+        # filepath = self.format_checkpoint_name(monitor_candidates,
+        # self.CHECKPOINT_NAME_LAST)
         filepath = self._format_ckpt_path(monitor_candidates, prefix="flow")
 
         if self._enable_version_counter:
             version_cnt = self.STARTING_VERSION
-            while self.file_exists(filepath, trainer) and filepath != self.last_model_path:
-                filepath = self.format_checkpoint_name(monitor_candidates, self.CHECKPOINT_NAME_LAST, ver=version_cnt)
+            while (
+                self.file_exists(filepath, trainer) and filepath != self.last_model_path
+            ):
+                filepath = self.format_checkpoint_name(
+                    monitor_candidates, self.CHECKPOINT_NAME_LAST, ver=version_cnt
+                )
                 version_cnt += 1
 
         # set the last model path before saving because it will be part of the state.
         previous, self.last_model_path = self.last_model_path, filepath
-        if self.save_last == "link" and self._last_checkpoint_saved and self.save_top_k != 0:
+        if (
+            self.save_last == "link"
+            and self._last_checkpoint_saved
+            and self.save_top_k != 0
+        ):
             self._link_checkpoint(trainer, self._last_checkpoint_saved, filepath)
         else:
-            self._save_checkpoint(trainer, filepath, pl_module)
+            self._save_checkpoint(trainer, filepath)
         if previous and self._should_remove_checkpoint(trainer, previous, filepath):
             self._remove_checkpoint(trainer, previous)
 
@@ -139,10 +169,13 @@ def _save_checkpoint(
         self,
         trainer: "pl.Trainer",
         filepath: str,
-        pl_module: "pl.LightningModule",
     ) -> None:
+        pl_module = trainer.lightning_module
+        if pl_module is None:
+            return
         if self.save_flow:
-            # save all the state including the model, optimizer, and any state that the user has added
+            # save all the state including the model, optimizer, and any state that the
+            # user has added
             self._save_flow_checkpoint(trainer, pl_module, filepath)
         if self.save_only_selected_model:
             # only save the trained parameters
@@ -155,86 +188,111 @@ def _save_checkpoint(
         if trainer.is_global_zero:
             for logger in trainer.loggers:
                 logger.after_save_checkpoint(proxy(self))
-    
+
     def _save_flow_checkpoint(
-        self,
-        trainer: "pl.Trainer",
-        pl_module: "pl.LightningModule",
-        filepath
+        self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", filepath
     ) -> None:
         """Save the whole model."""
         # check the save path
-        original_dirpath_list = filepath.split('/')
-        new_dirpath_list = original_dirpath_list[:-1] + ['flow']
-        new_dirpath = '/'.join(new_dirpath_list)
+        original_dirpath_list = filepath.split("/")
+        new_dirpath_list = original_dirpath_list[:-1] + ["flow"]
+        new_dirpath = "/".join(new_dirpath_list)
         if not os.path.exists(new_dirpath):
             os.makedirs(new_dirpath)
 
         new_filepath = os.path.join(new_dirpath, original_dirpath_list[-1])
         trainer.save_checkpoint(new_filepath, self.save_weights_only)
-    
+
     @rank_zero_only
     def _save_training_checkpoint(
-        self,
-        trainer: "pl.Trainer",
-        pl_module: "pl.LightningModule",
-        filepath
+        self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", filepath
     ) -> None:
         """Save only the trained model."""
         # check the save path
-        original_dirpath_list = filepath.split('/')
-        new_dirpath_list = original_dirpath_list[:-1] + ['only_trained_model']
-        new_dirpath = '/'.join(new_dirpath_list)
+        original_dirpath_list = filepath.split("/")
+        new_dirpath_list = original_dirpath_list[:-1] + ["only_trained_model"]
+        new_dirpath = "/".join(new_dirpath_list)
         if not os.path.exists(new_dirpath):
             os.makedirs(new_dirpath)
 
-        if trainer.strategy.__class__.__name__  == "DeepSpeedStrategy":
-            from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        if trainer.strategy.__class__.__name__ == "DeepSpeedStrategy":
+            from deepspeed.utils.zero_to_fp32 import (
+                get_fp32_state_dict_from_zero_checkpoint,
+            )
+
             original_filename = original_dirpath_list[-1]
-            deepspeed_flow_path = original_dirpath_list[:-1] + ['flow', original_filename]
-            state_dict = get_fp32_state_dict_from_zero_checkpoint('/'.join(deepspeed_flow_path))
-    
+            deepspeed_flow_path = original_dirpath_list[:-1] + [
+                "flow",
+                original_filename,
+            ]
+            state_dict = get_fp32_state_dict_from_zero_checkpoint(
+                "/".join(deepspeed_flow_path)
+            )
+            if state_dict is None:
+                raise RuntimeError(
+                    "Failed to load DeepSpeed zero checkpoint from "
+                    f"{'/'.join(deepspeed_flow_path)}"
+                )
+
             for seleted in self.selected_model:
-                new_state_dict = {name.replace(f"{seleted}.", ""): param for name, param in state_dict.items() if name.startswith(seleted)}
-                save_dict = {'state_dict': new_state_dict}
-                new_filename = original_filename.replace('flow', seleted)
+                new_state_dict = {
+                    name.replace(f"{seleted}.", ""): param
+                    for name, param in state_dict.items()
+                    if name.startswith(seleted)
+                }
+                save_dict = {"state_dict": new_state_dict}
+                new_filename = original_filename.replace("flow", seleted)
                 new_filepath = os.path.join(new_dirpath, new_filename)
                 torch.save(save_dict, new_filepath)
-                logger.info(f"Deepspeed Saving model {seleted} with {len(new_state_dict)} params to {new_filepath}")
+                mainlogger.info(
+                    "Deepspeed Saving model {} with {} params to {}",
+                    seleted,
+                    len(new_state_dict),
+                    new_filepath,
+                )
         else:
             original_filename = original_dirpath_list[-1]
             for seleted in self.selected_model:
                 model = getattr(pl_module, seleted)
                 state_dict = model.state_dict()
-                save_dict = {'state_dict': state_dict}
-                new_filename = original_filename.replace('flow', seleted)
+                save_dict = {"state_dict": state_dict}
+                new_filename = original_filename.replace("flow", seleted)
                 new_filepath = os.path.join(new_dirpath, new_filename)
                 torch.save(save_dict, new_filepath)
-                logger.info(f"Saving model {seleted} with {len(state_dict)} params  to {new_filepath}")
-    
+                mainlogger.info(
+                    "Saving model {} with {} params to {}",
+                    seleted,
+                    len(state_dict),
+                    new_filepath,
+                )
+
     def _format_ckpt_path(
-        self,
-        monitor_candidates: dict[str, Tensor],
-        prefix: str = None
+        self, monitor_candidates: dict[str, Tensor], prefix: str | None = None
     ) -> str:
-        """Format the checkpoint path with the current values of monitored quantities."""
-        epoch = monitor_candidates.get("epoch").item()
-        step = monitor_candidates.get("step").item()
-
-        if 'epoch' in self.filename and 'step' in self.filename:
-            format_filename = self.filename.format(epoch=epoch, step=step)
-        elif 'epoch' in self.filename and 'step' not in self.filename:
-            format_filename = self.filename.format(epoch=epoch)
-        elif 'epoch' not in self.filename and 'step' in self.filename:
-            format_filename = self.filename.format(step=step)
+        """Format checkpoint path with current values of monitored quantities."""
+        epoch_tensor = monitor_candidates.get("epoch")
+        step_tensor = monitor_candidates.get("step")
+        assert epoch_tensor is not None and step_tensor is not None
+        epoch = epoch_tensor.item()
+        step = step_tensor.item()
+
+        filename = self.filename or ""
+        if "epoch" in filename and "step" in filename:
+            format_filename = filename.format(epoch=epoch, step=step)
+        elif "epoch" in filename and "step" not in filename:
+            format_filename = filename.format(epoch=epoch)
+        elif "epoch" not in filename and "step" in filename:
+            format_filename = filename.format(step=step)
         else:
-            format_filename = self.filename
-        
+            format_filename = filename
+
         if prefix is not None:
-            format_filename = prefix + '-' + format_filename + '.ckpt'
-    
-        filepath = os.path.join(self.dirpath, format_filename)
-        
+            format_filename = prefix + "-" + format_filename + ".ckpt"
+
+        dirpath = self.dirpath
+        assert dirpath is not None
+        filepath = os.path.join(dirpath, format_filename)
+
         return filepath
 
 
@@ -278,7 +336,7 @@ def log_to_tensorboard(self, pl_module, batch_logs, filename, split, save_fps=10
                 n = video.shape[0]
                 video = video.permute(2, 0, 1, 3, 4)  # t,n,c,h,w
                 frame_grids = [
-                    torchvision.utils.make_grid(framesheet, nrow=int(n))
+                    torchvision.utils.make_grid(framesheet, nrow=n)
                     for framesheet in video
                 ]  # [3, n*h, 1*w]
                 grid = torch.stack(
@@ -291,7 +349,8 @@ def log_to_tensorboard(self, pl_module, batch_logs, filename, split, save_fps=10
                 )
             elif isinstance(value, torch.Tensor) and value.dim() == 4:
                 img = value
-                grid = torchvision.utils.make_grid(img, nrow=int(n))
+                n = img.shape[0]
+                grid = torchvision.utils.make_grid(img, nrow=n)
                 grid = (grid + 1.0) / 2.0  # -1,1 -> 0,1; c,h,w
                 pl_module.logger.experiment.add_image(
                     tag, grid, global_step=global_step
@@ -314,13 +373,13 @@ def log_batch_imgs(self, pl_module, batch, batch_idx, split="train"):
 
             ## process: move to CPU and clamp
             batch_logs = prepare_to_log(batch_logs, self.max_images, self.clamp)
-            torch.cuda.empty_cache()
+            empty_accelerator_cache()
 
             filename = "ep{}_idx{}_rank{}".format(
                 pl_module.current_epoch, batch_idx, pl_module.global_rank
             )
             if self.to_local:
-                mainlogger.info("Log [%s] batch <%s> to local ..." % (split, filename))
+                mainlogger.info("Log [{}] batch <{}> to local ...", split, filename)
                 filename = "gs{}_".format(pl_module.global_step) + filename
                 log_local(
                     batch_logs,
@@ -330,7 +389,7 @@ def log_batch_imgs(self, pl_module, batch, batch_idx, split="train"):
                 )
             else:
                 mainlogger.info(
-                    "Log [%s] batch <%s> to tensorboard ..." % (split, filename)
+                    "Log [{}] batch <{}> to tensorboard ...", split, filename
                 )
                 self.log_to_tensorboard(
                     pl_module, batch_logs, filename, split, save_fps=10
@@ -349,36 +408,127 @@ def on_train_batch_end(
     def on_validation_batch_end(
         self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=None
     ):
-        ## different with validation_step() that saving the whole validation set and only keep the latest,
-        ## it records the performance of every validation (without overwritten) by only keep a subset
+        # # different with validation_step() that saving the whole validation set and
+        # only keep the latest,
+        # # it records the performance of every validation (without overwritten) by only
+        # keep a subset
         if self.batch_freq != -1 and pl_module.logdir:
             self.log_batch_imgs(pl_module, batch, batch_idx, split="val")
         if hasattr(pl_module, "calibrate_grad_norm"):
             if (
                 pl_module.calibrate_grad_norm and batch_idx % 25 == 0
             ) and batch_idx > 0:
-                self.log_gradients(trainer, pl_module, batch_idx=batch_idx)
+                if hasattr(self, "log_gradients"):
+                    self.log_gradients(trainer, pl_module, batch_idx=batch_idx)
+
+
+class TrainingMetricsCallback(Callback):
+    """Log per-epoch wall time and peak GPU memory to metrics.json in run dir."""
+
+    def __init__(self, save_dir: Optional[str] = None):
+        self.save_dir = save_dir
+        self._epoch_start: float = 0.0
+        self._epoch_peak_gb: float = 0.0
+        self.metrics: list[dict[str, float]] = []
+
+    def _gpu_index(self, trainer: "pl.Trainer") -> int:
+        device = trainer.strategy.root_device
+        return device.index if device.type == "cuda" else 0
+
+    def on_train_epoch_start(
+        self, trainer: "pl.Trainer", pl_module: "pl.LightningModule"
+    ):
+        if torch.cuda.is_available():
+            gpu_index = self._gpu_index(trainer)
+            torch.cuda.reset_peak_memory_stats(gpu_index)
+        self._epoch_start = time.time()
+        self._epoch_peak_gb = 0.0
+
+    def on_train_batch_end(
+        self,
+        trainer: "pl.Trainer",
+        pl_module: "pl.LightningModule",
+        outputs: STEP_OUTPUT,
+        batch: Any,
+        batch_idx: int,
+    ):
+        if not torch.cuda.is_available():
+            return
+        gpu_index = self._gpu_index(trainer)
+        peak_gb = torch.cuda.max_memory_allocated(gpu_index) / (1024**3)
+        self._epoch_peak_gb = max(self._epoch_peak_gb, peak_gb)
+
+    def on_train_epoch_end(
+        self, trainer: "pl.Trainer", pl_module: "pl.LightningModule"
+    ):
+        epoch_time_s = time.time() - self._epoch_start
+        entry = {
+            "epoch": float(trainer.current_epoch),
+            "epoch_time_s": round(epoch_time_s, 4),
+            "peak_vram_gb": round(self._epoch_peak_gb, 4),
+        }
+        self.metrics.append(entry)
+        rank_zero_info(
+            f"Epoch {trainer.current_epoch}: time={entry['epoch_time_s']}s "
+            f"peak_vram={entry['peak_vram_gb']}GB"
+        )
+        if trainer.logger is not None:
+            trainer.logger.log_metrics(
+                {
+                    "epoch_time_s": entry["epoch_time_s"],
+                    "peak_vram_gb": entry["peak_vram_gb"],
+                },
+                step=trainer.global_step,
+            )
+        save_dir = self.save_dir or getattr(pl_module, "logdir", None)
+        if save_dir and trainer.global_rank == 0:
+            import json
+
+            os.makedirs(save_dir, exist_ok=True)
+            metrics_path = os.path.join(save_dir, "metrics.json")
+            with open(metrics_path, "w") as f:
+                json.dump({"epochs": self.metrics}, f, indent=2)
 
 
 class CUDACallback(Callback):
     # see https://github.com/SeanNaren/minGPT/blob/master/mingpt/callback.py
-    def on_train_batch_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", batch: Any, batch_idx: int):
+    def on_train_batch_start(
+        self,
+        trainer: "pl.Trainer",
+        pl_module: "pl.LightningModule",
+        batch: Any,
+        batch_idx: int,
+    ):
+        if not gpu_is_available():
+            self.start_time = time.time()
+            return
         # Reset the memory use counter
-        # lightning update
         gpu_index = trainer.strategy.root_device.index
         torch.cuda.reset_peak_memory_stats(gpu_index)
         torch.cuda.synchronize(gpu_index)
         self.start_time = time.time()
 
-    def on_train_batch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule", outputs: STEP_OUTPUT, batch: Any, batch_idx: int):
+    def on_train_batch_end(
+        self,
+        trainer: "pl.Trainer",
+        pl_module: "pl.LightningModule",
+        outputs: STEP_OUTPUT,
+        batch: Any,
+        batch_idx: int,
+    ):
+        epoch_time = time.time() - self.start_time
+        if not gpu_is_available():
+            rank_zero_info(f"Average Epoch time: {epoch_time:.2f} seconds")
+            return
         gpu_index = trainer.strategy.root_device.index
         torch.cuda.synchronize(gpu_index)
         max_memory = torch.cuda.max_memory_allocated(gpu_index) / 2**20
-        epoch_time = time.time() - self.start_time
 
         try:
-            max_memory = trainer.training_type_plugin.reduce(max_memory)
-            epoch_time = trainer.training_type_plugin.reduce(epoch_time)
+            training_type_plugin = getattr(trainer, "training_type_plugin", None)
+            if training_type_plugin is not None:
+                max_memory = training_type_plugin.reduce(max_memory)
+                epoch_time = training_type_plugin.reduce(epoch_time)
 
             rank_zero_info(f"Average Epoch time: {epoch_time:.2f} seconds")
             rank_zero_info(f"Average Peak memory {max_memory:.2f}MiB")
diff --git a/videotuna/utils/common_utils.py b/videotuna/utils/common_utils.py
index ab90f6b7..df053059 100644
--- a/videotuna/utils/common_utils.py
+++ b/videotuna/utils/common_utils.py
@@ -1,22 +1,33 @@
 import importlib
+import json
 import os
-from colorama import Fore, Style
-from omegaconf import DictConfig, OmegaConf
 import time
-import psutil
-import subprocess
-import sys
+from argparse import Namespace
 from functools import wraps
-from loguru import logger
+from typing import Any, Dict, List, Optional, Union
 
 import cv2
 import numpy as np
+import psutil
 import torch
 import torch.distributed as dist
-import json
-from typing import List, Union
-from argparse import Namespace
+from colorama import Fore, Style
+from loguru import logger
+from omegaconf import DictConfig, OmegaConf
 
+from videotuna.settings import get_settings
+from videotuna.utils.attention import (
+    get_attn_backend_requested,
+    get_resolved_attn_backend,
+    get_torch_compile_mode,
+)
+from videotuna.utils.device_utils import (
+    detect_compute_backend,
+    gpu_is_available,
+    synchronize_accelerator,
+)
+from videotuna.utils.inference_cli import resolve_offload_mode
+from videotuna.utils.lora_utils import parameter_matches_lora_target
 
 precision_to_dtype = {
     "float32": torch.float32,
@@ -27,14 +38,15 @@
 
 def get_resize_crop_region_for_grid(src, target):
     """
-    Returns the centered crop region grid for a resized image to the target size while preserving aspect ratio.
+    Returns the centered crop region grid for a resized image to the target
+    size while preserving aspect ratio.
     src: (h, w)
     target: (h, w)
     """
-    
+
     h, w = src
     th, tw = target
-    
+
     r = h / w
     if r > (th / tw):
         resize_height = th
@@ -61,22 +73,21 @@ def check_istarget(name, para_list):
     name: full name of source para
     para_list: partial name of target para
     """
-    istarget = False
-    for para in para_list:
-        if para in name:
-            return True
-    return istarget
+    return parameter_matches_lora_target(name, para_list)
+
 
 def get_dtype_from_str(dtype_str):
     import torch
+
     dtype_map = {
         "float16": torch.float16,
         "float32": torch.float32,
         "float64": torch.float64,
-        "bfloat16": torch.bfloat16
+        "bfloat16": torch.bfloat16,
     }
     return dtype_map.get(dtype_str, torch.float32)  # 默认返回float32
 
+
 def get_params(config, resolve=True):
     params = config.get("params")
     if params is None:
@@ -86,18 +97,29 @@ def get_params(config, resolve=True):
         return OmegaConf.to_container(params, resolve=True)
     return params
 
+
 # resolve will make params dict type rather than DictConfig type
-def instantiate_from_config(config, resolve=False):
-    if not "target" in config:
+def instantiate_from_config(config, resolve=False) -> Any:
+    if "target" not in config:
         if config == "__is_first_stage__":
             return None
         elif config == "__is_unconditional__":
             return None
         raise KeyError("Expected key `target` to instantiate.")
-    if "diffusers" in config["target"] or config["target"].startswith("transformers") or config.get("use_from_pretrained", False):
-        return get_obj_from_str(config["target"]).from_pretrained(
-            **get_params(config, resolve)
-        )
+    target = config["target"]
+    is_videotuna_diffusers_flow = target.endswith("DiffusersVideoFlow")
+    if not is_videotuna_diffusers_flow and (
+        "diffusers" in target
+        or target.startswith("transformers")
+        or config.get("use_from_pretrained", False)
+    ):
+        params = get_params(config, resolve)
+        if isinstance(params.get("pretrained_model_name_or_path"), str):
+            local_path = os.path.abspath(params["pretrained_model_name_or_path"])
+            if os.path.isdir(local_path):
+                params = dict(params)
+                params["local_files_only"] = True
+        return get_obj_from_str(config["target"]).from_pretrained(**params)
     return get_obj_from_str(config["target"])(**get_params(config, resolve))
 
 
@@ -140,86 +162,229 @@ def resize_numpy_image(image, max_resolution=512 * 512, resize_short_edge=None):
 def setup_dist(args):
     if dist.is_initialized():
         return
-    torch.cuda.set_device(args.local_rank)
-    torch.distributed.init_process_group("nccl", init_method="env://")
+    if gpu_is_available():
+        torch.cuda.set_device(args.local_rank)
+        torch.distributed.init_process_group("nccl", init_method="env://")
+    else:
+        torch.distributed.init_process_group("gloo", init_method="env://")
 
 
 def print_green(text):
     print(Fore.GREEN + text + Style.RESET_ALL)
 
+
 def print_red(text):
     print(Fore.RED + text + Style.RESET_ALL)
 
+
 def print_yellow(text):
     print(Fore.YELLOW + text + Style.RESET_ALL)
 
 
-def monitor_resources(return_metrics=True):
+def _build_sample_metrics(
+    time_used: float,
+    gpu_mem_used: Optional[float],
+    frames: int,
+) -> Dict[str, Any]:
+    peak = round(gpu_mem_used, 2) if gpu_mem_used is not None else None
+    wall = round(time_used, 2)
+    spf = round(wall / frames, 4) if frames > 0 else None
+    return {
+        "time": wall,
+        "wall_time_s": wall,
+        "gpu": peak,
+        "peak_vram_gb": peak,
+        "seconds_per_frame": spf,
+    }
+
+
+def _current_cuda_device_index() -> int:
+    if not gpu_is_available():
+        return 0
+    return torch.cuda.current_device()
+
+
+def _peak_vram_stats(device_index: int) -> tuple[float | None, float | None]:
+    if not gpu_is_available():
+        return None, None
+    allocated = torch.cuda.max_memory_allocated(device_index) / (1024**3)
+    reserved = torch.cuda.max_memory_reserved(device_index) / (1024**3)
+    return round(allocated, 2), round(reserved, 2)
+
+
+def _strip_non_serializable_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
+    cleaned = dict(sample)
+    result = cleaned.pop("result", None)
+    if result is not None and not isinstance(
+        result, (str, int, float, bool, list, dict, type(None))
+    ):
+        cleaned["result_type"] = type(result).__name__
+    return cleaned
+
+
+def monitor_resources(
+    return_metrics: bool = True,
+    frames: int = 1,
+    inference_config: Optional[Any] = None,
+    device_index: Optional[int] = None,
+):
     def decorator(func):
         @wraps(func)
         def wrapper(*args, **kwargs):
             process = psutil.Process()
             start_time = time.time()
-            start_cpu_mem = process.memory_info().rss / 1024 / 1024 / 1024 # GB
+            start_cpu_mem = process.memory_info().rss / 1024 / 1024 / 1024  # GB
 
-            if torch.cuda.is_available():
-                torch.cuda.reset_peak_memory_stats()
-                torch.cuda.synchronize()
+            dev_idx = device_index
+            if dev_idx is None and gpu_is_available():
+                dev_idx = _current_cuda_device_index()
+
+            if gpu_is_available():
+                torch.cuda.reset_peak_memory_stats(dev_idx)
+                synchronize_accelerator()
 
             result = func(*args, **kwargs)
 
             end_time = time.time()
-            end_cpu_mem = process.memory_info().rss / 1024 / 1024 / 1024 # GB
+            end_cpu_mem = process.memory_info().rss / 1024 / 1024 / 1024  # GB
 
             time_used = end_time - start_time
             cpu_mem_used = end_cpu_mem - start_cpu_mem
 
             logger.info(f"Time used: {time_used:.2f} seconds")
             logger.info(f"CPU memory change: {cpu_mem_used:.2f} GB")
-            gpu_mem_used = None
-            if torch.cuda.is_available():
-                torch.cuda.synchronize()
-                gpu_mem_used = torch.cuda.max_memory_allocated() / 1024 / 1024 / 1024 # GB
-                logger.info(f"Peak GPU memory used: {gpu_mem_used:.2f} GB")
+            peak_alloc, peak_reserved = _peak_vram_stats(dev_idx or 0)
+            if peak_alloc is not None:
+                logger.info(f"Peak GPU memory allocated: {peak_alloc:.2f} GB")
+            if peak_reserved is not None:
+                logger.info(f"Peak GPU memory reserved: {peak_reserved:.2f} GB")
 
             if return_metrics:
-                return {
-                    "time": round(time_used, 2),
-                    "cpu": round(cpu_mem_used, 2),
-                    "gpu": round(gpu_mem_used, 2) if gpu_mem_used is not None else None,
-                    "result": result,
-                }
-            else:
-                return result
+                sample = _build_sample_metrics(time_used, peak_alloc, frames)
+                sample["cpu"] = round(cpu_mem_used, 2)
+                sample["peak_vram_reserved_gb"] = peak_reserved
+                sample["attention_backend"] = get_resolved_attn_backend()
+                sample["attention_backend_requested"] = get_attn_backend_requested()
+                sample["attention_backend_resolved"] = get_resolved_attn_backend()
+                sample["compute_backend"] = detect_compute_backend()
+                compile_on = get_settings().torch_compile
+                sample["torch_compile"] = compile_on
+                sample["compile_mode"] = (
+                    get_torch_compile_mode() if compile_on else None
+                )
+                sample["result"] = result
+                if dev_idx is not None and gpu_is_available():
+                    sample["gpu_index"] = dev_idx
+                    sample["gpu_name"] = torch.cuda.get_device_name(dev_idx)
+                if inference_config is not None:
+                    sample["offload_mode"] = resolve_offload_mode(inference_config)
+                    sample["dtype"] = getattr(inference_config, "dtype", None)
+                    sample["memory_preset"] = getattr(
+                        inference_config, "memory_preset", None
+                    )
+                    sample["requested_device"] = getattr(
+                        inference_config, "device", None
+                    )
+                return sample
+            return result
 
         return wrapper
-    return decorator
 
+    return decorator
 
 
-def save_metrics(gpu: List[float],
-                time: List[float],
-                config: Union[DictConfig, Namespace],
-                savedir: str):
+def save_metrics(
+    savedir: str,
+    config: Optional[Union[DictConfig, Namespace, Any]] = None,
+    *,
+    metrics: Optional[Dict[str, Any]] = None,
+    gpu: Optional[List[float]] = None,
+    time: Optional[List[float]] = None,
+    frames: int = 1,
+):
+    """Write metrics.json (and legacy metric.json) beside inference outputs."""
     config_dict = None
     if config is not None:
         if isinstance(config, DictConfig):
             config_dict = OmegaConf.to_container(config, resolve=True)
-        else:
+        elif isinstance(config, Namespace):
             config_dict = vars(config)
-    metrics = {
-        "gpu" : gpu,
-        "time": time,
-        "config" : config_dict
-    }
-    with open(f"{savedir}/metric.json", "w") as f:
+        elif hasattr(config, "items"):
+            config_dict = dict(config)
+
+    if metrics is None:
+        per_sample = []
+        gpu_list = gpu or []
+        time_list = time or []
+        for g, t in zip(gpu_list, time_list):
+            per_sample.append(
+                {
+                    "peak_vram_gb": g,
+                    "wall_time_s": t,
+                    "seconds_per_frame": (
+                        round(t / frames, 4) if frames > 0 and t else None
+                    ),
+                }
+            )
+        metrics = {
+            "per_sample": per_sample,
+            "gpu": gpu_list,
+            "time": time_list,
+            "attention_backend": get_resolved_attn_backend(),
+            "attention_backend_requested": get_attn_backend_requested(),
+            "attention_backend_resolved": get_resolved_attn_backend(),
+            "torch_compile": get_settings().torch_compile,
+        }
+        if config is not None:
+            metrics["offload_mode"] = resolve_offload_mode(config)
+            metrics["dtype"] = getattr(config, "dtype", None)
+            metrics["memory_preset"] = getattr(config, "memory_preset", None)
+            compile_on = metrics["torch_compile"]
+            metrics["compile_mode"] = get_torch_compile_mode() if compile_on else None
+
+    if metrics.get("per_sample"):
+        metrics["per_sample"] = [
+            _strip_non_serializable_metrics(s) if isinstance(s, dict) else s
+            for s in metrics["per_sample"]
+        ]
+    metrics = _strip_non_serializable_metrics(metrics)
+
+    if config_dict is not None:
+        metrics["config"] = config_dict
+
+    if metrics.get("per_sample"):
+        peaks = [
+            s.get("peak_vram_gb")
+            for s in metrics["per_sample"]
+            if s.get("peak_vram_gb") is not None
+        ]
+        times = [
+            s.get("wall_time_s")
+            for s in metrics["per_sample"]
+            if s.get("wall_time_s") is not None
+        ]
+        if peaks:
+            metrics["peak_vram_gb"] = max(peaks)
+        if times:
+            metrics["wall_time_s"] = sum(times)
+            metrics["seconds_per_frame"] = (
+                round(metrics["wall_time_s"] / frames, 4) if frames > 0 else None
+            )
+
+    os.makedirs(savedir, exist_ok=True)
+    metrics_path = os.path.join(savedir, "metrics.json")
+    with open(metrics_path, "w") as f:
+        json.dump(metrics, f, indent=4)
+    legacy_path = os.path.join(savedir, "metric.json")
+    with open(legacy_path, "w") as f:
         json.dump(metrics, f, indent=4)
-    
+
+
 def get_dist_info():
     try:
-        local_rank = int(os.environ.get("LOCAL_RANK"))
-        global_rank = int(os.environ.get("RANK"))
-        num_rank = int(os.environ.get("WORLD_SIZE"))
-    except:
+        local_rank = int(os.environ.get("LOCAL_RANK") or 0)
+        global_rank = int(os.environ.get("RANK") or 0)
+        num_rank = int(os.environ.get("WORLD_SIZE") or 1)
+    except (TypeError, ValueError):
         local_rank, global_rank, num_rank = 0, 0, 1
-    return local_rank, global_rank, num_rank
\ No newline at end of file
+    return local_rank, global_rank, num_rank
diff --git a/videotuna/utils/config_mapping.py b/videotuna/utils/config_mapping.py
new file mode 100644
index 00000000..cbefccc4
--- /dev/null
+++ b/videotuna/utils/config_mapping.py
@@ -0,0 +1,95 @@
+"""Shared OmegaConf mapping helpers for YAML config sections."""
+
+from __future__ import annotations
+
+import re
+from typing import Any, Self
+
+from loguru import logger
+from omegaconf import DictConfig, OmegaConf
+from pydantic import BaseModel, ValidationError, model_validator
+
+_DOT_PATH = re.compile(r"^[A-Za-z_]\w*(\.[A-Za-z_]\w*)*$")
+
+
+class ConfigMappingError(ValueError):
+    """Raised when a config section mapping is invalid or references missing paths."""
+
+
+class ConfigPathMappings(BaseModel):
+    root: dict[str, str]
+
+    @model_validator(mode="after")
+    def validate_dot_paths(self) -> Self:
+        for source, target in self.root.items():
+            if not _DOT_PATH.match(source):
+                raise ValueError(f"invalid source path {source!r}")
+            if not _DOT_PATH.match(target):
+                raise ValueError(f"invalid target path {target!r}")
+        return self
+
+
+def config_path_exists(cfg: DictConfig, path: str) -> bool:
+    """Return True when every dot segment of ``path`` exists in ``cfg``."""
+    if not path:
+        return False
+
+    node: object = cfg
+    for segment in path.split("."):
+        if not OmegaConf.is_config(node):
+            return False
+        if segment not in node:
+            return False
+        node = node[segment]
+    return True
+
+
+def get_config_path(cfg: DictConfig, path: str) -> Any:
+    """Return the value at ``path`` or raise ``ConfigMappingError`` if missing."""
+    if not config_path_exists(cfg, path):
+        raise ConfigMappingError(f"config path {path!r} does not exist")
+    return OmegaConf.select(cfg, path)
+
+
+def apply_config_mappings(cfg: DictConfig, *, section: str = "train") -> None:
+    """Validate and apply ``{section}.mapping`` entries (source -> target)."""
+    mapping = OmegaConf.select(cfg, f"{section}.mapping")
+    if mapping is None:
+        return
+
+    if not OmegaConf.is_dict(mapping):
+        raise ConfigMappingError(
+            f"{section}.mapping must be a mapping of source paths to target paths, "
+            f"got {type(mapping).__name__}"
+        )
+
+    raw_mapping = OmegaConf.to_container(mapping, resolve=False)
+    if not isinstance(raw_mapping, dict):
+        raise ConfigMappingError(
+            f"{section}.mapping must be a mapping of source paths to target paths, "
+            f"got {type(raw_mapping).__name__}"
+        )
+
+    try:
+        validated = ConfigPathMappings(root=raw_mapping)
+    except ValidationError as exc:
+        raise ConfigMappingError(
+            f"{section}.mapping contains invalid dot paths: {exc}"
+        ) from exc
+
+    for source_path, target_path in validated.root.items():
+        if not config_path_exists(cfg, source_path):
+            raise ConfigMappingError(
+                f"{section}.mapping source path {source_path!r} does not exist "
+                f"(entry {source_path!r} -> {target_path!r})"
+            )
+        if not config_path_exists(cfg, target_path):
+            raise ConfigMappingError(
+                f"{section}.mapping target path {target_path!r} does not exist "
+                f"(entry {source_path!r} -> {target_path!r})"
+            )
+
+        value = OmegaConf.select(cfg, source_path)
+        if value is not None:
+            OmegaConf.update(cfg, target_path, value)
+            logger.info(f"update {target_path} by {source_path} value: {value}")
diff --git a/videotuna/utils/device_utils.py b/videotuna/utils/device_utils.py
new file mode 100644
index 00000000..961921e4
--- /dev/null
+++ b/videotuna/utils/device_utils.py
@@ -0,0 +1,644 @@
+"""Device detection and inference hardware requirements."""
+
+from __future__ import annotations
+
+import os
+import re
+import subprocess
+from dataclasses import dataclass
+from typing import Literal
+
+import torch
+import torch.version
+from loguru import logger
+
+from videotuna.settings import (
+    ENV_ALLOW_CPU_INFERENCE,
+    ENV_CPU_MODE,
+    get_settings,
+)
+
+ComputeBackend = Literal["cuda", "rocm", "cpu"]
+InferenceDtype = Literal["bf16", "fp16", "fp32"]
+FlowCapabilityTier = Literal["cpu_ok", "cpu_smoke", "gpu_required"]
+CpuMode = Literal["off", "smoke", "force"]
+
+_DIFFUSERS_FLOW = "videotuna.flow.diffusers_video.DiffusersVideoFlow"
+_WAN_FLOW = "videotuna.flow.wanvideo.WanVideoModelFlow"
+
+FLOW_TIERS: dict[str, FlowCapabilityTier] = {
+    _DIFFUSERS_FLOW: "cpu_smoke",
+    _WAN_FLOW: "gpu_required",
+}
+
+
+@dataclass(frozen=True)
+class GpuInfo:
+    index: int
+    name: str
+    total_vram_gb: float
+    free_vram_gb: float
+    compute_capability: tuple[int, int]
+    supports_bf16: bool
+
+
+def _torch_hip_version() -> str | None:
+    hip = getattr(torch.version, "hip", None)
+    if hip is None:
+        return None
+    return str(hip)
+
+
+def _detect_compute_backend_raw() -> ComputeBackend:
+    if not torch.cuda.is_available():
+        return "cpu"
+    if _torch_hip_version() is not None:
+        return "rocm"
+    return "cuda"
+
+
+def detect_compute_backend() -> ComputeBackend:
+    """Return the active compute backend (cuda, rocm, or cpu)."""
+    requested = get_settings().compute_backend
+    if requested == "auto":
+        return _detect_compute_backend_raw()
+    if requested == "cpu":
+        return "cpu"
+    if requested == "rocm":
+        if _torch_hip_version() is None:
+            raise RuntimeError(
+                f"VIDEOTUNA_COMPUTE_BACKEND=rocm but PyTorch was not built with HIP. "
+                f"Detected: {describe_compute_environment()}\n"
+                "Install with: poetry install --extras rocm"
+            )
+        if not torch.cuda.is_available():
+            raise RuntimeError(
+                "VIDEOTUNA_COMPUTE_BACKEND=rocm but no ROCm GPU is visible. "
+                "Check ROCm driver and HIP_VISIBLE_DEVICES."
+            )
+        return "rocm"
+    # requested == "cuda"
+    if _torch_hip_version() is not None:
+        raise RuntimeError(
+            f"VIDEOTUNA_COMPUTE_BACKEND=cuda but PyTorch reports HIP "
+            f"({_torch_hip_version()}). "
+            "Use VIDEOTUNA_COMPUTE_BACKEND=rocm or install the CUDA PyTorch wheel."
+        )
+    if not torch.cuda.is_available():
+        raise RuntimeError(
+            "VIDEOTUNA_COMPUTE_BACKEND=cuda but torch.cuda.is_available() " "is False."
+        )
+    return "cuda"
+
+
+def gpu_is_available() -> bool:
+    """True when an accelerator GPU is available (NVIDIA CUDA or AMD ROCm)."""
+    return torch.cuda.is_available()
+
+
+def cuda_is_available() -> bool:
+    """Deprecated alias for gpu_is_available()."""
+    return gpu_is_available()
+
+
+def accelerator_device_string() -> str:
+    """PyTorch device type string for GPU autocast/offload (cuda for CUDA/ROCm)."""
+    return "cuda" if gpu_is_available() else "cpu"
+
+
+def normalize_device_prefer(prefer: str | int | None) -> str | None:
+    """Accept cpu, cuda, cuda:0, cuda:1, 0, 1 → canonical device string."""
+    if prefer is None:
+        return None
+    if isinstance(prefer, int):
+        return f"cuda:{prefer}"
+    text = str(prefer).strip().lower()
+    if not text:
+        return None
+    if text == "mps":
+        raise ValueError(
+            f"Invalid device {prefer!r}. MPS is not supported. "
+            "PrivTune supports cpu, cuda, and cuda:N. "
+            "For config validation on Apple Silicon, use "
+            "VIDEOTUNA_CPU_MODE=smoke or --cpu-smoke."
+        )
+    if text == "cpu":
+        return text
+    if text.isdigit():
+        return f"cuda:{int(text)}"
+    if text == "cuda":
+        return "cuda"
+    if re.match(r"^cuda:\d+$", text):
+        return text
+    raise ValueError(
+        f"Invalid device {prefer!r}. Expected cpu, cuda, cuda:N, "
+        "or an integer GPU index."
+    )
+
+
+def resolve_cpu_mode(*, cli_smoke: bool = False) -> CpuMode:
+    """Resolve CPU inference mode from CLI flag, env, or legacy allow_cpu."""
+    if cli_smoke:
+        return "smoke"
+    settings = get_settings()
+    mode: CpuMode = settings.cpu_mode
+    if settings.allow_cpu_inference:
+        logger.warning(
+            "{} is deprecated; use {}=force or --cpu-smoke instead.",
+            ENV_ALLOW_CPU_INFERENCE,
+            ENV_CPU_MODE,
+        )
+        return "force"
+    return mode
+
+
+def _is_production_video_resolution(
+    height: int | None,
+    width: int | None,
+) -> bool:
+    """True when H×W matches production video presets (720p-class)."""
+    if height is None or width is None:
+        return False
+    return (height >= 720 or width >= 1280) or (height >= 480 and width >= 720)
+
+
+def _is_init_smoke_resolution(
+    height: int | None,
+    width: int | None,
+    *,
+    frames: int | None = None,
+) -> bool:
+    """Tiny resolution for native-flow init-only CPU smoke (not full denoise)."""
+    if height is None or width is None:
+        return False
+    if height > 256 or width > 256:
+        return False
+    if frames is not None and frames > 2:
+        return False
+    return True
+
+
+def _diffusers_flow_tier(
+    family: str,
+    variant: str,
+    height: int | None,
+    width: int | None,
+    base: FlowCapabilityTier,
+) -> FlowCapabilityTier:
+    """CPU tier for DiffusersVideoFlow from model family and resolution."""
+    if family == "flux" and variant in ("schnell", "1-schnell"):
+        return "cpu_smoke"
+    if family == "flux" and variant in ("1-dev", "dev"):
+        if height is not None and height >= 512:
+            return "gpu_required"
+        return "cpu_smoke"
+    if family == "wan":
+        if _is_production_video_resolution(height, width):
+            return "gpu_required"
+        return "cpu_smoke"
+    if _is_production_video_resolution(height, width):
+        return "gpu_required"
+    return base
+
+
+def get_flow_tier(
+    flow_target: str,
+    *,
+    model_family: str | None = None,
+    model_variant: str | None = None,
+    height: int | None = None,
+    width: int | None = None,
+) -> FlowCapabilityTier:
+    """Return the CPU capability tier for a flow target and optional model hints."""
+    base = FLOW_TIERS.get(flow_target, "cpu_ok")
+    if flow_target != _DIFFUSERS_FLOW:
+        return base
+
+    family = (model_family or "").lower()
+    variant = (model_variant or "").lower()
+    return _diffusers_flow_tier(family, variant, height, width, base)
+
+
+def _validate_cuda_device_index(index: int) -> None:
+    if not gpu_is_available():
+        raise RuntimeError(
+            f"Requested CUDA device index {index} but no GPU accelerator is available."
+        )
+    count = torch.cuda.device_count()
+    if index < 0 or index >= count:
+        raise RuntimeError(
+            f"Invalid CUDA device index {index}. "
+            f"Visible GPU count is {count} (after CUDA_VISIBLE_DEVICES remapping)."
+        )
+
+
+def resolve_inference_device(prefer: str | int | None = None) -> torch.device:
+    """Pick the best available torch device for inference."""
+    if detect_compute_backend() == "cpu" and prefer is None:
+        return torch.device("cpu")
+
+    normalized = normalize_device_prefer(prefer)
+    if normalized:
+        device = torch.device(normalized)
+        if device.type == "cuda":
+            if not gpu_is_available():
+                raise RuntimeError(
+                    f"Requested device {prefer!r} but no GPU accelerator is available."
+                )
+            index = device.index if device.index is not None else 0
+            _validate_cuda_device_index(index)
+            torch.cuda.set_device(index)
+            return torch.device("cuda", index)
+        return device
+    if gpu_is_available() and detect_compute_backend() != "cpu":
+        torch.cuda.set_device(0)
+        return torch.device("cuda", 0)
+    return torch.device("cpu")
+
+
+def get_visible_gpus() -> list[GpuInfo]:
+    """Enumerate visible CUDA/ROCm devices with VRAM and compute capability."""
+    if not gpu_is_available():
+        return []
+    gpus: list[GpuInfo] = []
+    for index in range(torch.cuda.device_count()):
+        props = torch.cuda.get_device_properties(index)
+        free_bytes, total_bytes = torch.cuda.mem_get_info(index)
+        major, minor = props.major, props.minor
+        gpus.append(
+            GpuInfo(
+                index=index,
+                name=props.name,
+                total_vram_gb=total_bytes / (1024**3),
+                free_vram_gb=free_bytes / (1024**3),
+                compute_capability=(major, minor),
+                supports_bf16=major >= 8,
+            )
+        )
+    return gpus
+
+
+def recommend_dtype(device: torch.device) -> InferenceDtype:
+    """CPU → fp32; Ampere+ (sm >= 8.0) → bf16; older NVIDIA GPUs → fp16."""
+    if device.type == "cpu":
+        return "fp32"
+    if device.type != "cuda" or not gpu_is_available():
+        return "fp16"
+    index = device.index if device.index is not None else 0
+    major, _minor = torch.cuda.get_device_capability(index)
+    if major >= 8:
+        return "bf16"
+    return "fp16"
+
+
+def require_min_vram(
+    gb: float,
+    *,
+    device: torch.device | None = None,
+    context: str = "",
+) -> None:
+    """Fail fast when selected GPU total VRAM is below *gb*."""
+    if not gpu_is_available():
+        raise RuntimeError(
+            _format_hardware_context(context)
+            + "No GPU accelerator is available for VRAM check."
+        )
+    dev = device or resolve_inference_device()
+    if dev.type != "cuda":
+        return
+    index = dev.index if dev.index is not None else 0
+    props = torch.cuda.get_device_properties(index)
+    total_gb = props.total_memory / (1024**3)
+    if total_gb < gb:
+        prefix = _format_hardware_context(context, device_index=index)
+        raise RuntimeError(
+            f"{prefix}"
+            f"GPU total VRAM {total_gb:.1f} GB is below required {gb:.1f} GB.\n"
+            "Next steps:\n"
+            "  - Use --memory-preset low_vram or --enable_sequential_cpu_offload\n"
+            "  - Lower resolution or frame count in the config\n"
+            "  - Select a GPU with more VRAM via --device / CUDA_VISIBLE_DEVICES"
+        )
+
+
+def _cuda_runtime_version() -> str:
+    cuda_ver = getattr(torch.version, "cuda", None)
+    return str(cuda_ver) if cuda_ver else "unknown"
+
+
+def _driver_version() -> str:
+    try:
+        result = subprocess.run(
+            [
+                "nvidia-smi",
+                "--query-gpu=driver_version",
+                "--format=csv,noheader",
+            ],
+            capture_output=True,
+            text=True,
+            timeout=5,
+            check=False,
+        )
+        if result.returncode == 0 and result.stdout.strip():
+            return result.stdout.strip().splitlines()[0].strip()
+    except (OSError, subprocess.TimeoutExpired):
+        pass
+    return "unknown"
+
+
+def _format_hardware_context(
+    context: str = "",
+    *,
+    device_index: int = 0,
+) -> str:
+    lines: list[str] = []
+    if context:
+        lines.append(context.strip())
+        if not lines[-1].endswith("."):
+            lines[-1] += "."
+    if gpu_is_available():
+        props = torch.cuda.get_device_properties(device_index)
+        free_bytes, total_bytes = torch.cuda.mem_get_info(device_index)
+        lines.append(
+            f"  GPU: {props.name} "
+            f"({total_bytes / (1024**3):.1f} GB total, "
+            f"{free_bytes / (1024**3):.1f} GB free)"
+        )
+        lines.append(
+            f"  Driver: {_driver_version()} / "
+            f"CUDA runtime: {_cuda_runtime_version()} / "
+            f"PyTorch: {torch.__version__}"
+        )
+    else:
+        lines.append(f"  Detected: {describe_compute_environment()}")
+    return "\n".join(lines) + "\n"
+
+
+def log_startup_device_summary(
+    device: torch.device,
+    dtype: str | None,
+    attn_backend: str,
+    offload_mode: str,
+    *,
+    attn_backend_requested: str | None = None,
+    memory_preset: str | None = None,
+    compile_enabled: bool = False,
+    compile_mode: str | None = None,
+) -> None:
+    """Emit a single structured startup log for inference."""
+    gpu_name = "CPU"
+    if device.type == "cuda" and gpu_is_available():
+        index = device.index if device.index is not None else 0
+        gpu_name = torch.cuda.get_device_name(index)
+    requested = attn_backend_requested or attn_backend
+    resolved_note = f" (resolved {attn_backend})" if requested != attn_backend else ""
+    preset_note = f", preset={memory_preset}" if memory_preset else ""
+    compile_note = ""
+    if compile_enabled:
+        compile_note = f", compile={compile_mode or 'reduce-overhead'}"
+    logger.info(
+        "Inference startup: device={} gpu={} dtype={} attention={}{} offload={}{}{}",
+        device,
+        gpu_name,
+        dtype or "auto",
+        requested,
+        resolved_note,
+        offload_mode,
+        preset_note,
+        compile_note,
+    )
+
+
+def empty_accelerator_cache() -> None:
+    if gpu_is_available():
+        torch.cuda.empty_cache()
+
+
+def synchronize_accelerator() -> None:
+    if gpu_is_available():
+        torch.cuda.synchronize()
+
+
+# NVIDIA-oriented aliases (ROCm uses the same torch.cuda API).
+empty_cache = empty_accelerator_cache
+synchronize_device = synchronize_accelerator
+
+
+def describe_compute_environment() -> str:
+    backend = _detect_compute_backend_raw()
+    if backend == "rocm":
+        name = torch.cuda.get_device_name(0)
+        hip = _torch_hip_version() or "unknown"
+        return f"ROCm available ({name}, torch {torch.__version__}, HIP {hip})"
+    if backend == "cuda":
+        name = torch.cuda.get_device_name(0)
+        return f"CUDA available ({name}, torch {torch.__version__})"
+    return "No GPU accelerator (CPU-only PyTorch or no GPU driver)"
+
+
+def snapshot_nvidia_smi() -> str | None:
+    """Best-effort nvidia-smi snapshot for failure diagnostics."""
+    try:
+        result = subprocess.run(
+            ["nvidia-smi"],
+            capture_output=True,
+            text=True,
+            timeout=10,
+            check=False,
+        )
+        if result.returncode == 0 and result.stdout.strip():
+            return result.stdout.strip()
+    except (OSError, subprocess.TimeoutExpired):
+        pass
+    return None
+
+
+def _tiered_cpu_error_message(
+    flow_target: str,
+    tier: FlowCapabilityTier,
+    cpu_mode: CpuMode,
+) -> str:
+    lines = [
+        f"This inference command requires a GPU (tier={tier}, cpu_mode={cpu_mode}).\n",
+        _format_hardware_context(f"Flow: {flow_target}"),
+        "Install options:\n"
+        "  - NVIDIA: poetry install --extras cuda\n"
+        "  - AMD ROCm: poetry install --extras rocm (see docs/install-rocm.md)\n",
+        "What you can do without a GPU:\n"
+        "  - Unit tests: poetry run pytest tests/ -m 'not gpu'\n"
+        "  - CPU smoke presets: configs/inference/presets/*_cpu_smoke.yaml\n"
+        "  - Full matrix: docs/capability-matrix.md\n",
+    ]
+    if tier == "gpu_required":
+        lines.append(
+            "  - Debug init only (≤256px, ≤2 frames): --cpu-smoke with a tiny preset\n"
+            f"  - Full override (not recommended): {ENV_CPU_MODE}=force\n"
+        )
+    elif tier == "cpu_smoke" and cpu_mode == "off":
+        lines.append(f"  - Enable CPU smoke: --cpu-smoke or {ENV_CPU_MODE}=smoke\n")
+    lines.append("See docs/install-cpu.md and docs/capability-matrix.md.")
+    return "".join(lines)
+
+
+def require_accelerator_for_flow(
+    flow_target: str,
+    *,
+    min_vram_gb: float | None = None,
+    allow_cpu: bool = False,
+    cpu_mode: CpuMode | None = None,
+    tier: FlowCapabilityTier | None = None,
+    model_family: str | None = None,
+    model_variant: str | None = None,
+    height: int | None = None,
+    width: int | None = None,
+    frames: int | None = None,
+) -> None:
+    """
+    Fail fast when a GPU-backed video flow is started without an accelerator.
+
+    Passes when a CUDA or ROCm GPU is available, or when CPU mode permits the tier.
+    """
+    if allow_cpu:
+        logger.warning(
+            "allow_cpu=True is deprecated; use --cpu-smoke or VIDEOTUNA_CPU_MODE=force"
+        )
+        cpu_mode = "force"
+
+    resolved_tier = tier or get_flow_tier(
+        flow_target,
+        model_family=model_family,
+        model_variant=model_variant,
+        height=height,
+        width=width,
+    )
+    mode = cpu_mode if cpu_mode is not None else resolve_cpu_mode()
+
+    if resolved_tier == "cpu_ok":
+        return
+
+    backend = detect_compute_backend()
+
+    if gpu_is_available() and backend != "cpu":
+        logger.info("Inference device: {}", describe_compute_environment())
+        if min_vram_gb is not None:
+            props = torch.cuda.get_device_properties(0)
+            total_gb = props.total_memory / (1024**3)
+            if total_gb < min_vram_gb:
+                logger.warning(
+                    "GPU VRAM {:.1f} GB is below recommended {:.1f} GB for {}",
+                    total_gb,
+                    min_vram_gb,
+                    flow_target,
+                )
+        return
+
+    if mode == "force":
+        logger.warning(
+            "CPU force mode: skipping GPU requirement for {} (tier={}); "
+            "not suitable for production inference",
+            flow_target,
+            resolved_tier,
+        )
+        return
+
+    if mode == "smoke" and resolved_tier == "cpu_smoke":
+        logger.warning(
+            "CPU smoke mode: {} tier=cpu_smoke — tiny resolution/steps only",
+            flow_target,
+        )
+        return
+
+    if (
+        mode == "smoke"
+        and resolved_tier == "gpu_required"
+        and flow_target == _WAN_FLOW
+        and _is_init_smoke_resolution(height, width, frames=frames)
+    ):
+        logger.warning(
+            "CPU init smoke: {} at {}x{} (frames={}) — checkpoint load only, "
+            "not production 720p denoise",
+            flow_target,
+            height,
+            width,
+            frames,
+        )
+        return
+
+    raise RuntimeError(_tiered_cpu_error_message(flow_target, resolved_tier, mode))
+
+
+def require_nvidia_cuda_for_flow(
+    flow_target: str,
+    *,
+    allow_cpu: bool = False,
+    **kwargs: object,
+) -> None:
+    """Deprecated alias for require_accelerator_for_flow."""
+    require_accelerator_for_flow(flow_target, allow_cpu=allow_cpu, **kwargs)  # type: ignore[arg-type]
+
+
+def require_xfuser_sequence_parallel(flow_name: str) -> None:
+    """Fail when xfuser USP is requested on ROCm (CUDA-only dependency)."""
+    if detect_compute_backend() == "rocm":
+        raise RuntimeError(
+            f"Sequence parallel (ulysses_degree / ring_degree) is not supported on "
+            f"AMD ROCm for {flow_name}. xfuser requires NVIDIA CUDA.\n"
+            "Use single-GPU inference with VIDEOTUNA_ATTN_BACKEND=sdpa instead."
+        )
+
+
+def validate_sequence_parallel_degrees(
+    ulysses_degree: int | None,
+    ring_degree: int | None,
+    *,
+    world_size: int | None = None,
+) -> None:
+    """Validate xfuser USP degree product matches visible process count."""
+    u = ulysses_degree or 1
+    r = ring_degree or 1
+    if u <= 1 and r <= 1:
+        return
+    product = u * r
+    if world_size is None:
+        try:
+            world_size = int(os.environ.get("WORLD_SIZE", "1"))
+        except ValueError:
+            world_size = 1
+    if world_size != product:
+        raise ValueError(
+            f"ulysses_degree ({u}) × ring_degree ({r}) = {product} but "
+            f"WORLD_SIZE={world_size}. "
+            "Launch with torchrun --nproc_per_node=N where N equals the product."
+        )
+
+
+def checkpoints_exist(path: str | None) -> bool:
+    if not path:
+        return False
+    from pathlib import Path
+
+    p = Path(path)
+    return p.exists() and (p.is_dir() or p.is_file())
+
+
+def looks_like_hf_model_id(path: str) -> bool:
+    """True for org/model repo ids that are not local paths."""
+    if not path or path.startswith(("/", "./", "../")):
+        return False
+    from pathlib import Path
+
+    if Path(path).exists():
+        return False
+    parts = path.replace("\\", "/").split("/")
+    return len(parts) == 2 and all(parts) and " " not in path
+
+
+def checkpoint_available(path: str | None, *, flow_target: str = "") -> bool:
+    """Local checkpoint exists, or path is a Hugging Face model id."""
+    if not path:
+        return True
+    if checkpoints_exist(path):
+        return True
+    if "diffusers_video" in flow_target and looks_like_hf_model_id(path):
+        return True
+    return looks_like_hf_model_id(path)
diff --git a/videotuna/utils/diffusers_inference_shim.py b/videotuna/utils/diffusers_inference_shim.py
new file mode 100644
index 00000000..6ace61e7
--- /dev/null
+++ b/videotuna/utils/diffusers_inference_shim.py
@@ -0,0 +1,26 @@
+"""Redirect deprecated Diffusers inference scripts to inference_new.py."""
+
+from __future__ import annotations
+
+import subprocess
+import sys
+import warnings
+
+
+def run_diffusers_inference(config: str, extra_args: list[str] | None = None) -> int:
+    message = (
+        f"This script is deprecated. Use:\n"
+        f"  python scripts/inference_new.py --config {config}\n"
+        f"or the matching poetry run inference-* alias."
+    )
+    warnings.warn(message, DeprecationWarning, stacklevel=2)
+    cmd = [
+        sys.executable,
+        "scripts/inference_new.py",
+        "--config",
+        config,
+    ]
+    if extra_args:
+        cmd.extend(extra_args)
+    result = subprocess.run(cmd, check=False)
+    return result.returncode
diff --git a/videotuna/utils/diffusers_optimizations.py b/videotuna/utils/diffusers_optimizations.py
new file mode 100644
index 00000000..ca4440b9
--- /dev/null
+++ b/videotuna/utils/diffusers_optimizations.py
@@ -0,0 +1,162 @@
+"""Shared Diffusers pipeline memory and performance optimizations."""
+
+from __future__ import annotations
+
+from contextlib import nullcontext
+from typing import Any, Optional, cast
+
+import torch
+from loguru import logger
+
+from videotuna.utils.attention import (
+    apply_diffusers_attention_backend,
+    maybe_compile_denoiser,
+)
+from videotuna.utils.device_utils import gpu_is_available, resolve_inference_device
+from videotuna.utils.inference_cli import resolve_offload_mode
+
+
+def _maybe_compile_pipeline_transformer(pipe: Any, offload: str) -> None:
+    """Compile the transformer when full-GPU inference is requested."""
+    if offload != "none":
+        return
+    transformer = getattr(pipe, "transformer", None)
+    if transformer is None:
+        return
+    compiled = maybe_compile_denoiser(transformer)
+    if compiled is not transformer:
+        pipe.transformer = compiled
+
+
+def _apply_vae_memory_opts(pipe: Any, args: Any) -> None:
+    if getattr(args, "enable_vae_slicing", False) and hasattr(pipe, "vae"):
+        pipe.vae.enable_slicing()
+    if getattr(args, "enable_vae_tiling", False):
+        if hasattr(pipe, "enable_vae_tiling"):
+            pipe.enable_vae_tiling()
+        elif hasattr(pipe, "vae") and hasattr(pipe.vae, "enable_tiling"):
+            pipe.vae.enable_tiling()
+
+
+def _apply_attention_cache_opts(pipe: Any, args: Any) -> None:
+    transformer = getattr(pipe, "transformer", None)
+    if transformer is None or not getattr(args, "enable_attention_cache", False):
+        return
+    if hasattr(transformer, "enable_cache"):
+        transformer.enable_cache()
+        logger.info("Enabled transformer attention cache")
+    else:
+        logger.warning(
+            "enable_attention_cache requested but transformer has no enable_cache()"
+        )
+
+
+def apply_diffusers_optimizations(
+    pipe: Any,
+    args: Any,
+    *,
+    model_family: Optional[str] = None,
+    disable_progress_bar: bool = False,
+    device: Optional[torch.device] = None,
+) -> None:
+    """Apply offload, VAE tiling/slicing, QKV fusion, attention, and cache APIs."""
+    offload = resolve_offload_mode(args)
+    target_device = device or resolve_inference_device(getattr(args, "device", None))
+    device_map = getattr(args, "device_map", None)
+
+    if device_map == "auto" and offload == "none":
+        _apply_device_map(pipe, target_device)
+    elif offload == "sequential":
+        pipe.enable_sequential_cpu_offload()
+    elif offload == "model":
+        pipe.enable_model_cpu_offload()
+    elif hasattr(pipe, "to"):
+        pipe.to(target_device)
+
+    _apply_vae_memory_opts(pipe, args)
+
+    if getattr(args, "fuse_qkv", False) and hasattr(pipe, "fuse_qkv_projections"):
+        pipe.fuse_qkv_projections()
+        logger.info("Enabled fuse_qkv_projections on pipeline")
+
+    apply_diffusers_attention_backend(pipe)
+
+    if hasattr(pipe, "set_progress_bar_config"):
+        pipe.set_progress_bar_config(disable=disable_progress_bar)
+
+    _maybe_compile_pipeline_transformer(pipe, offload)
+    _apply_attention_cache_opts(pipe, args)
+
+
+def _apply_device_map(pipe: Any, device: torch.device) -> None:
+    """Spread large Diffusers models across visible GPUs (experimental)."""
+    try:
+        from accelerate import dispatch_model, infer_auto_device_map
+    except ImportError as exc:
+        raise RuntimeError(
+            "device_map=auto requires accelerate. Install with: poetry install"
+        ) from exc
+
+    if not gpu_is_available() or torch.cuda.device_count() < 2:
+        logger.warning(
+            "device_map=auto requested but fewer than 2 GPUs visible; using single GPU"
+        )
+        if hasattr(pipe, "to"):
+            pipe.to(device)
+        return
+
+    main_module = getattr(pipe, "transformer", None) or getattr(pipe, "unet", None)
+    if main_module is None:
+        logger.warning("device_map=auto: no transformer/unet on pipeline; skipping")
+        if hasattr(pipe, "to"):
+            pipe.to(device)
+        return
+
+    max_memory: dict[int | str, int | str] = {
+        i: "22GiB" for i in range(torch.cuda.device_count())
+    }
+    device_map = infer_auto_device_map(
+        main_module,
+        max_memory=max_memory,
+    )
+    dispatched = dispatch_model(
+        main_module, device_map=cast(dict[str, Any], device_map)
+    )
+    if hasattr(pipe, "transformer"):
+        pipe.transformer = dispatched
+    elif hasattr(pipe, "unet"):
+        pipe.unet = dispatched
+    logger.info(
+        "Applied accelerate device_map=auto across {} GPUs",
+        torch.cuda.device_count(),
+    )
+
+
+def apply_flow_memory_config(flow: Any, inference_config: Any) -> None:
+    """Apply memory/offload settings after from_pretrained for all flow types."""
+    flow_name = flow.__class__.__name__
+    if flow_name == "DiffusersVideoFlow":
+        if flow.pipeline is not None:
+            device = resolve_inference_device(getattr(inference_config, "device", None))
+            apply_diffusers_optimizations(
+                flow.pipeline,
+                inference_config,
+                model_family=getattr(flow, "model_family", None),
+                device=device,
+            )
+        return
+
+    if flow_name == "WanVideoModelFlow":
+        if getattr(inference_config, "enable_model_cpu_offload", False):
+            flow.offload_model = True
+        elif getattr(inference_config, "enable_sequential_cpu_offload", False):
+            flow.offload_model = True
+        return
+
+
+def transformer_cache_context(pipe: Any):
+    """Return a cache context manager when the transformer supports it."""
+    transformer = getattr(pipe, "transformer", None)
+    if transformer is not None and hasattr(transformer, "cache_context"):
+        return transformer.cache_context()
+    return nullcontext()
diff --git a/videotuna/utils/diffusers_quantization.py b/videotuna/utils/diffusers_quantization.py
new file mode 100644
index 00000000..87573a8b
--- /dev/null
+++ b/videotuna/utils/diffusers_quantization.py
@@ -0,0 +1,248 @@
+"""Diffusers pipeline quantization for Wan 2.2 / Flux inference."""
+
+from __future__ import annotations
+
+import importlib.metadata
+from typing import Any, Literal, Optional
+
+import torch
+from loguru import logger
+from packaging.version import Version
+
+from videotuna.utils.device_utils import detect_compute_backend, gpu_is_available
+
+TransformerQuant = Literal["none", "int8_wo", "int4_wo", "fp8_wo"]
+QuantBackend = Literal["torchao", "quanto"]
+
+TRANSFORMER_QUANT_CHOICES = ("none", "int8_wo", "int4_wo", "fp8_wo")
+QUANT_BACKEND_CHOICES = ("torchao", "quanto")
+
+_FP8_MIN_COMPUTE_CAPABILITY = (8, 9)
+_TORCHAO_MIN_VERSION = "0.15.0"
+
+
+def normalize_transformer_quant(value: Optional[str]) -> str:
+    """Return a validated transformer_quant string (default ``none``)."""
+    if value is None or value == "":
+        return "none"
+    quant = str(value).lower()
+    if quant not in TRANSFORMER_QUANT_CHOICES:
+        raise ValueError(
+            f"Unsupported transformer_quant {value!r}. "
+            f"Expected one of: {', '.join(TRANSFORMER_QUANT_CHOICES)}"
+        )
+    return quant
+
+
+def normalize_quant_backend(value: Optional[str]) -> str:
+    """Return a validated quant_backend string (default ``torchao``)."""
+    if value is None or value == "":
+        return "torchao"
+    backend = str(value).lower()
+    if backend not in QUANT_BACKEND_CHOICES:
+        raise ValueError(
+            f"Unsupported quant_backend {value!r}. "
+            f"Expected one of: {', '.join(QUANT_BACKEND_CHOICES)}"
+        )
+    return backend
+
+
+def resolve_quant_components(
+    model_family: str,
+    model_variant: Optional[str],
+    mode: str,
+) -> list[str]:
+    """Pipeline submodules to quantize for the given Diffusers model family."""
+    family = model_family.lower()
+    variant = (model_variant or "").strip()
+    if family == "wan" and variant == "2.2" and mode.lower() in ("t2v", "i2v"):
+        return ["transformer", "transformer_2"]
+    if family in ("wan", "flux"):
+        return ["transformer"]
+    return ["transformer"]
+
+
+def _require_cuda_for_quant(transformer_quant: str) -> None:
+    backend = detect_compute_backend()
+    if backend == "cpu":
+        raise RuntimeError(
+            f"transformer_quant={transformer_quant!r} is not supported on CPU. "
+            "Use transformer_quant: none or --cpu-smoke without quantization."
+        )
+    if backend == "rocm":
+        raise RuntimeError(
+            f"transformer_quant={transformer_quant!r} is not supported on AMD ROCm. "
+            "Use memory_preset low_vram with offload instead."
+        )
+    if backend != "cuda":
+        raise RuntimeError(
+            f"transformer_quant={transformer_quant!r} requires NVIDIA CUDA; "
+            f"detected backend {backend!r}."
+        )
+
+
+def _require_quanto() -> None:
+    try:
+        import optimum.quanto  # noqa: F401
+    except ImportError as exc:
+        raise ImportError(
+            "quant_backend=quanto requires the quanto extra. "
+            "Install with: poetry install -E quanto"
+        ) from exc
+
+
+def _require_torchao_min_version(min_version: str = _TORCHAO_MIN_VERSION) -> None:
+    try:
+        import torchao  # noqa: F401
+    except ImportError as exc:
+        raise ImportError(
+            f"quant_backend=torchao requires torchao>={min_version}. "
+            "Install with: poetry install"
+        ) from exc
+    installed = importlib.metadata.version("torchao")
+    if Version(installed) < Version(min_version):
+        raise ImportError(
+            f"quant_backend=torchao requires torchao>={min_version}; "
+            f"found {installed}. Run: poetry update torchao"
+        )
+
+
+def _require_fp8_gpu() -> None:
+    if not gpu_is_available():
+        return
+    major, minor = torch.cuda.get_device_capability(0)
+    min_major, min_minor = _FP8_MIN_COMPUTE_CAPABILITY
+    if (major, minor) < (min_major, min_minor):
+        raise RuntimeError(
+            f"transformer_quant=fp8_wo requires NVIDIA GPU compute capability >= "
+            f"{min_major}.{min_minor} (Ada/Hopper); detected {major}.{minor}. "
+            "Use int8_wo or int4_wo on older GPUs."
+        )
+    if not hasattr(torch, "float8_e4m3fn"):
+        raise RuntimeError(
+            "transformer_quant=fp8_wo requires torch.float8_e4m3fn (PyTorch 2.6+)."
+        )
+
+
+def validate_transformer_quant(
+    *,
+    transformer_quant: Optional[str],
+    quant_backend: Optional[str],
+    offload_mode: str,
+    compile_enabled: bool = False,
+    has_lora: bool = False,
+) -> str:
+    """
+    Validate quant settings before pipeline load.
+
+    Returns the normalized transformer_quant value.
+    """
+    quant = normalize_transformer_quant(transformer_quant)
+    if quant == "none":
+        return quant
+
+    backend = normalize_quant_backend(quant_backend)
+    _require_cuda_for_quant(quant)
+    if quant == "fp8_wo":
+        _require_fp8_gpu()
+
+    if backend == "quanto":
+        _require_quanto()
+    else:
+        _require_torchao_min_version()
+
+    if offload_mode == "sequential":
+        logger.warning(
+            "transformer_quant={} with sequential CPU offload may be slow or "
+            "incompatible; model CPU offload is preferred when quantizing.",
+            quant,
+        )
+    if compile_enabled and offload_mode != "none":
+        logger.warning(
+            "transformer_quant={} with CPU offload disables torch.compile on "
+            "the transformer (compile applies only when offload is none).",
+            quant,
+        )
+    if has_lora:
+        logger.info(
+            "Attempting native LoRA load on quantized transformers; "
+            "use transformer_quant: none if PEFT bridge fails."
+        )
+    return quant
+
+
+def maybe_adjust_offload_for_quant(args: Any, transformer_quant: str) -> None:
+    """Prefer model offload over sequential when quant is enabled (mutates args)."""
+    if transformer_quant == "none":
+        return
+    if getattr(args, "enable_sequential_cpu_offload", False):
+        logger.info(
+            "transformer_quant enabled: switching sequential CPU offload to "
+            "model CPU offload for Diffusers quant compatibility"
+        )
+        args.enable_sequential_cpu_offload = False
+        args.enable_model_cpu_offload = True
+
+
+def _build_torchao_component_config(transformer_quant: str) -> Any:
+    from diffusers import TorchAoConfig
+
+    if transformer_quant == "int8_wo":
+        from torchao.quantization import Int8WeightOnlyConfig, PerGroup
+
+        return TorchAoConfig(Int8WeightOnlyConfig(granularity=PerGroup(128), version=2))
+    if transformer_quant == "int4_wo":
+        from torchao.quantization import Int4WeightOnlyConfig
+
+        return TorchAoConfig(Int4WeightOnlyConfig(group_size=128, version=2))
+    if transformer_quant == "fp8_wo":
+        from torchao.quantization import Float8WeightOnlyConfig
+
+        return TorchAoConfig(Float8WeightOnlyConfig())
+    raise ValueError(f"Unsupported torchao quant scheme: {transformer_quant}")
+
+
+def _build_quanto_component_config(transformer_quant: str) -> Any:
+    _require_quanto()
+    from diffusers import QuantoConfig
+
+    mapping = {
+        "int8_wo": "int8",
+        "int4_wo": "int4",
+        "fp8_wo": "float8",
+    }
+    weights_dtype = mapping.get(transformer_quant)
+    if weights_dtype is None:
+        raise ValueError(
+            f"quanto backend does not support transformer_quant={transformer_quant!r}"
+        )
+    return QuantoConfig(weights_dtype=weights_dtype)
+
+
+def build_pipeline_quantization_config(
+    *,
+    transformer_quant: str,
+    quant_backend: str,
+    components: list[str],
+) -> Optional[Any]:
+    """Build a Diffusers PipelineQuantizationConfig or None when quant is disabled."""
+    quant = normalize_transformer_quant(transformer_quant)
+    if quant == "none":
+        return None
+
+    backend = normalize_quant_backend(quant_backend)
+    from diffusers import PipelineQuantizationConfig
+
+    if backend == "torchao":
+        component_cfg = _build_torchao_component_config(quant)
+    else:
+        component_cfg = _build_quanto_component_config(quant)
+
+    quant_mapping = {name: component_cfg for name in components}
+    logger.info(
+        "Diffusers quant: scheme={} backend={} components={}",
+        quant,
+        backend,
+        list(quant_mapping.keys()),
+    )
+    return PipelineQuantizationConfig(quant_mapping=quant_mapping)
diff --git a/videotuna/utils/diffusion_utils.py b/videotuna/utils/diffusion_utils.py
index 1b3a4675..c207890c 100644
--- a/videotuna/utils/diffusion_utils.py
+++ b/videotuna/utils/diffusion_utils.py
@@ -2,7 +2,6 @@
 
 import numpy as np
 import torch
-import torch.nn.functional as F
 from einops import repeat
 
 
@@ -74,7 +73,8 @@ def make_ddim_timesteps(
     if ddim_discr_method == "uniform":
         c = num_ddpm_timesteps // num_ddim_timesteps
         ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c)))
-        # add one to get the final alpha values right (the ones from first scale to data during sampling)
+        # add one to get the final alpha values right (the ones from first
+        # scale to data during sampling)
         steps_out = ddim_timesteps + 1
     elif ddim_discr_method == "uniform_trailing":
         c = num_ddpm_timesteps / num_ddim_timesteps
@@ -86,7 +86,8 @@ def make_ddim_timesteps(
         ddim_timesteps = (
             (np.linspace(0, np.sqrt(num_ddpm_timesteps * 0.8), num_ddim_timesteps)) ** 2
         ).astype(int)
-        # add one to get the final alpha values right (the ones from first scale to data during sampling)
+        # add one to get the final alpha values right (the ones from first
+        # scale to data during sampling)
         steps_out = ddim_timesteps + 1
     else:
         raise NotImplementedError(
@@ -140,7 +141,8 @@ def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):
 
 def rescale_zero_terminal_snr(betas):
     """
-    Rescales betas to have zero terminal SNR Based on https://arxiv.org/pdf/2305.08891.pdf (Algorithm 1)
+    Rescales betas to have zero terminal SNR Based on
+    https://arxiv.org/pdf/2305.08891.pdf (Algorithm 1)
 
     Args:
         betas (`numpy.ndarray`):
@@ -175,8 +177,9 @@ def rescale_zero_terminal_snr(betas):
 
 def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
     """
-    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and
-    Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
+    Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of
+    [Common Diffusion Noise Schedules and Sample Steps are
+    Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4
     """
     std_text = noise_pred_text.std(
         dim=list(range(1, noise_pred_text.ndim)), keepdim=True
@@ -184,7 +187,8 @@ def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
     std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
     # rescale the results from guidance (fixes overexposure)
     noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
-    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
+    # mix with the original results from guidance by factor guidance_rescale
+    # to avoid "plain looking" images
     noise_cfg = (
         guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
     )
diff --git a/videotuna/utils/distributions.py b/videotuna/utils/distributions.py
index 3a851c2f..d63b8c39 100644
--- a/videotuna/utils/distributions.py
+++ b/videotuna/utils/distributions.py
@@ -57,7 +57,8 @@ def mode(self):
 def normal_kl(mean1, logvar1, mean2, logvar2):
     """
     Compute the KL divergence between two gaussians.
-    Shapes are automatically broadcasted, so batches can be compared to scalars, among other use cases.
+    Shapes are automatically broadcasted, so batches can be compared to
+    scalars, among other use cases.
     Source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12
     """
     tensor = None
diff --git a/videotuna/utils/ema.py b/videotuna/utils/ema.py
index 0e1447b0..1fbaf62a 100644
--- a/videotuna/utils/ema.py
+++ b/videotuna/utils/ema.py
@@ -49,7 +49,7 @@ def forward(self, model):
                         one_minus_decay * (shadow_params[sname] - m_param[key])
                     )
                 else:
-                    assert not key in self.m_name2s_name
+                    assert key not in self.m_name2s_name
 
     def copy_to(self, model):
         m_param = dict(model.named_parameters())
@@ -58,7 +58,7 @@ def copy_to(self, model):
             if m_param[key].requires_grad:
                 m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data)
             else:
-                assert not key in self.m_name2s_name
+                assert key not in self.m_name2s_name
 
     def store(self, parameters):
         """
diff --git a/videotuna/utils/inference_cli.py b/videotuna/utils/inference_cli.py
new file mode 100644
index 00000000..c688266b
--- /dev/null
+++ b/videotuna/utils/inference_cli.py
@@ -0,0 +1,128 @@
+"""Shared CLI helpers for VideoTuna inference entrypoints."""
+
+from __future__ import annotations
+
+import os
+from typing import Any, Optional
+
+from loguru import logger
+from omegaconf import DictConfig, OmegaConf
+
+from videotuna.settings import (
+    ENV_ATTN_BACKEND,
+    ENV_CPU_MODE,
+    ENV_TORCH_COMPILE,
+    get_settings,
+)
+from videotuna.utils.inference_profile import resolve_inference_profile
+
+
+def apply_compile_env(compile_flag: bool) -> None:
+    """Set VIDEOTUNA_TORCH_COMPILE before model load when --compile is passed."""
+    if get_settings().cpu_mode == "smoke":
+        os.environ[ENV_TORCH_COMPILE] = "0"
+        return
+    os.environ[ENV_TORCH_COMPILE] = "1" if compile_flag else "0"
+
+
+def apply_cpu_smoke_env(args: Any) -> None:
+    """Set environment for CPU smoke mode from --cpu-smoke."""
+    if not getattr(args, "cpu_smoke", False):
+        return
+    os.environ[ENV_CPU_MODE] = "smoke"
+    os.environ[ENV_ATTN_BACKEND] = "eager"
+    os.environ[ENV_TORCH_COMPILE] = "0"
+
+
+def validate_cpu_offload_flags(args: Any) -> None:
+    """Reject GPU VRAM offload on CPU inference; resolve dual offload flag conflicts."""
+    from videotuna.utils.device_utils import (
+        detect_compute_backend,
+        gpu_is_available,
+        resolve_cpu_mode,
+    )
+
+    if getattr(args, "enable_sequential_cpu_offload", False) and getattr(
+        args, "enable_model_cpu_offload", False
+    ):
+        logger.warning(
+            "Both --enable_sequential_cpu_offload and --enable_model_cpu_offload "
+            "were set; using sequential CPU offload (ignoring model offload)."
+        )
+        args.enable_model_cpu_offload = False
+
+    cpu_mode = resolve_cpu_mode(cli_smoke=getattr(args, "cpu_smoke", False))
+    device = (getattr(args, "device", None) or "").strip().lower()
+    cpu_inference = (
+        cpu_mode in ("smoke", "force")
+        or device == "cpu"
+        or detect_compute_backend() == "cpu"
+        or not gpu_is_available()
+    )
+    if not cpu_inference:
+        return
+
+    offload = (
+        getattr(args, "enable_sequential_cpu_offload", False)
+        or getattr(args, "enable_model_cpu_offload", False)
+        or getattr(args, "memory_preset", None) == "low_vram"
+    )
+    if offload:
+        raise RuntimeError(
+            "CPU offload flags (--enable_model_cpu_offload, "
+            "--enable_sequential_cpu_offload, --memory-preset low_vram) "
+            "require a GPU accelerator to stage weights. "
+            "They are not CPU-only inference modes.\n"
+            "Install a GPU stack (poetry install --extras cuda) or run "
+            "without offload flags."
+        )
+
+
+def apply_cpu_smoke_limits(
+    inference_config: DictConfig,
+    flow_config: Optional[DictConfig] = None,
+) -> None:
+    """Cap resolution, frames, and steps for CPU smoke runs."""
+    caps = {
+        "frames": 2,
+        "height": 256,
+        "width": 256,
+        "num_inference_steps": 4,
+        "ddim_steps": 4,
+    }
+    for key, cap in caps.items():
+        current = getattr(inference_config, key, None)
+        if current is not None and int(current) > cap:
+            logger.warning("CPU smoke: capping {} from {} to {}", key, current, cap)
+            inference_config[key] = cap
+        elif current is None and key in ("num_inference_steps", "ddim_steps"):
+            inference_config[key] = cap
+
+    if getattr(inference_config, "device", None) is None:
+        inference_config.device = "cpu"
+    if getattr(inference_config, "dtype", None) is None:
+        inference_config.dtype = "fp32"
+
+    if flow_config is not None:
+        params = flow_config.get("params", OmegaConf.create())
+        if OmegaConf.select(params, "height") and int(params.height) > caps["height"]:
+            params.height = caps["height"]
+        if OmegaConf.select(params, "width") and int(params.width) > caps["width"]:
+            params.width = caps["width"]
+
+
+def resolve_offload_mode(args) -> str:
+    """Return offload mode string from parsed args."""
+    return resolve_inference_profile(args, apply_preset=False).offload_mode
+
+
+def prepare_cli_inference_args(args: Any) -> Any:
+    """Apply smoke env and validate parallel degrees before config merge."""
+    apply_cpu_smoke_env(args)
+    ulysses = getattr(args, "ulysses_degree", None)
+    ring = getattr(args, "ring_degree", None)
+    if ulysses is not None or ring is not None:
+        from videotuna.utils.device_utils import validate_sequence_parallel_degrees
+
+        validate_sequence_parallel_degrees(ulysses, ring)
+    return args
diff --git a/videotuna/utils/inference_profile.py b/videotuna/utils/inference_profile.py
new file mode 100644
index 00000000..d51a71e4
--- /dev/null
+++ b/videotuna/utils/inference_profile.py
@@ -0,0 +1,79 @@
+"""Unified inference memory preset and offload resolution."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any, Literal
+
+MemoryPreset = Literal["low_vram", "balanced", "max_speed"]
+OffloadMode = Literal["sequential", "model", "none"]
+
+
+@dataclass(frozen=True)
+class InferenceProfile:
+    memory_preset: MemoryPreset | None
+    offload_mode: OffloadMode
+    enable_model_cpu_offload: bool
+    enable_sequential_cpu_offload: bool
+    enable_vae_tiling: bool
+    dtype: str | None
+
+
+def _apply_memory_preset(args: Any) -> None:
+    """Mutate *args* in place to apply a named memory preset."""
+    preset = getattr(args, "memory_preset", None)
+    if not preset:
+        return
+
+    if preset == "low_vram":
+        args.enable_sequential_cpu_offload = True
+        args.enable_model_cpu_offload = False
+        args.enable_vae_tiling = True
+        if getattr(args, "dtype", None) is None:
+            args.dtype = "fp16"
+    elif preset == "balanced":
+        args.enable_model_cpu_offload = True
+        args.enable_sequential_cpu_offload = False
+        args.enable_vae_tiling = True
+        if getattr(args, "dtype", None) is None:
+            args.dtype = "bf16"
+    elif preset == "max_speed":
+        args.enable_model_cpu_offload = False
+        args.enable_sequential_cpu_offload = False
+        if getattr(args, "dtype", None) is None:
+            args.dtype = "bf16"
+    else:
+        raise ValueError(
+            f"Unknown memory preset {preset!r}. "
+            "Expected low_vram, balanced, or max_speed."
+        )
+
+
+def _offload_mode_from_args(args: Any) -> OffloadMode:
+    if getattr(args, "enable_sequential_cpu_offload", False):
+        return "sequential"
+    if getattr(args, "enable_model_cpu_offload", False):
+        return "model"
+    return "none"
+
+
+def _profile_from_args(args: Any) -> InferenceProfile:
+    return InferenceProfile(
+        memory_preset=getattr(args, "memory_preset", None),
+        offload_mode=_offload_mode_from_args(args),
+        enable_model_cpu_offload=bool(getattr(args, "enable_model_cpu_offload", False)),
+        enable_sequential_cpu_offload=bool(
+            getattr(args, "enable_sequential_cpu_offload", False)
+        ),
+        enable_vae_tiling=bool(getattr(args, "enable_vae_tiling", False)),
+        dtype=getattr(args, "dtype", None),
+    )
+
+
+def resolve_inference_profile(
+    args: Any, *, apply_preset: bool = True
+) -> InferenceProfile:
+    """Apply memory preset side effects and return the resolved profile."""
+    if apply_preset:
+        _apply_memory_preset(args)
+    return _profile_from_args(args)
diff --git a/videotuna/utils/inference_utils.py b/videotuna/utils/inference_utils.py
index 5ffd801e..35fa51f3 100644
--- a/videotuna/utils/inference_utils.py
+++ b/videotuna/utils/inference_utils.py
@@ -1,19 +1,17 @@
-import glob
+import copy
 import os
-import sys
 from collections import OrderedDict
 
 import cv2
 import numpy as np
-import torch, copy
+import torch
 import torchvision
 import torchvision.transforms as transforms
-from decord import VideoReader, cpu
 from einops import rearrange, repeat
 from PIL import Image
 
-from videotuna.utils.load_weights import load_safetensors, init_weights_on_device
-
+from videotuna.utils.load_weights import init_weights_on_device, load_safetensors
+from videotuna.utils.video_io import AvVideoReader as VideoReader
 
 
 def get_target_filelist(data_dir, ext):
@@ -37,6 +35,7 @@ def get_target_filelist(data_dir, ext):
     file_list.sort()
     return file_list
 
+
 # inplemented in InferenceBase
 def load_prompts_from_txt(prompt_file: str):
     """Load and return a list of prompts from a text file, stripping whitespace."""
@@ -45,6 +44,7 @@ def load_prompts_from_txt(prompt_file: str):
     prompt_list = [line.strip() for line in lines if line.strip() != ""]
     return prompt_list
 
+
 # inplemented in GenerationFlow
 def load_model_checkpoint(model, ckpt):
     def load_checkpoint(model, ckpt, full_strict):
@@ -55,14 +55,14 @@ def load_checkpoint(model, ckpt, full_strict):
             for key in state_dict["module"].keys():
                 new_pl_sd[key[16:]] = state_dict["module"][key]
             model.load_state_dict(new_pl_sd, strict=full_strict)
-        except:
+        except Exception:
             if "state_dict" in list(state_dict.keys()):
                 state_dict = state_dict["state_dict"]
             try:
                 model.model.diffusion_model.load_state_dict(
                     state_dict, strict=full_strict
                 )
-            except:
+            except Exception:
                 model.load_state_dict(state_dict, strict=False)
         return model
 
@@ -84,7 +84,8 @@ def load_inputs_i2v(input_dir, video_size=(256, 256), video_frames=16):
     if len(prompt_files) > 1:
         # only use the first one (sorted by name) if multiple exist
         print(
-            f"Warning: multiple prompt files exist. The one {os.path.split(prompt_files[0])[1]} is used."
+            "Warning: multiple prompt files exist. The one "
+            f"{os.path.split(prompt_files[0])[1]} is used."
         )
         prompt_file = prompt_files[0]
     elif len(prompt_files) == 1:
@@ -134,7 +135,8 @@ def load_inputs_v2v(input_dir, video_size=None, video_frames=None):
     if len(prompt_files) > 1:
         # only use the first one (sorted by name) if multiple exist
         print(
-            f"Warning: multiple prompt files exist. The one {os.path.split(prompt_files[0])[1]} is used."
+            "Warning: multiple prompt files exist. The one "
+            f"{os.path.split(prompt_files[0])[1]} is used."
         )
         prompt_file = prompt_files[0]
     elif len(prompt_files) == 1:
@@ -143,7 +145,7 @@ def load_inputs_v2v(input_dir, video_size=None, video_frames=None):
         print(prompt_files)
         raise ValueError(f"Error: found NO prompt file in {input_dir}")
     prompt_list = load_prompts_from_txt(prompt_file)
-    n_samples = len(prompt_list)
+    len(prompt_list)
 
     ## load videos
     video_filepaths = get_target_filelist(input_dir, ext="mp4")
@@ -156,11 +158,9 @@ def load_inputs_v2v(input_dir, video_size=None, video_frames=None):
 
 def open_video_to_tensor(filepath, video_width=None, video_height=None):
     if video_width is None and video_height is None:
-        vidreader = VideoReader(
-            filepath, ctx=cpu(0), width=video_width, height=video_height
-        )
+        vidreader = VideoReader(filepath, width=video_width, height=video_height)
     else:
-        vidreader = VideoReader(filepath, ctx=cpu(0))
+        vidreader = VideoReader(filepath)
     frame_indices = list(range(len(vidreader)))
     frames = vidreader.get_batch(frame_indices)
     frame_tensor = torch.tensor(frames.asnumpy()).permute(3, 0, 1, 2).float()
@@ -174,16 +174,15 @@ def load_video_batch(
     """
     Notice about some special cases:
     1. video_frames=-1 means to take all the frames (with fs=1)
-    2. when the total video frames is less than required, padding strategy will be used (repreated last frame)
+    2. when the total video frames is less than required, padding strategy
+       will be used (repreated last frame)
     """
     fps_list = []
     batch_tensor = []
     assert frame_stride > 0, "valid frame stride should be a positive interge!"
     for filepath in filepath_list:
         padding_num = 0
-        vidreader = VideoReader(
-            filepath, ctx=cpu(0), width=video_size[1], height=video_size[0]
-        )
+        vidreader = VideoReader(filepath, width=video_size[1], height=video_size[0])
         fps = vidreader.get_avg_fps()
         total_frames = len(vidreader)
         max_valid_frames = (total_frames - 1) // frame_stride + 1
@@ -206,7 +205,8 @@ def load_video_batch(
                 [frame_tensor, *([frame_tensor[:, -1:, :, :]] * padding_num)], dim=1
             )
             print(
-                f"{os.path.split(filepath)[1]} is not long enough: {padding_num} frames padded."
+                f"{os.path.split(filepath)[1]} is not long enough: "
+                f"{padding_num} frames padded."
             )
         batch_tensor.append(frame_tensor)
         sample_fps = int(fps / frame_stride)
@@ -221,9 +221,7 @@ def load_image_batch(filepath_list, image_size=(256, 256)):
         _, filename = os.path.split(filepath)
         _, ext = os.path.splitext(filename)
         if ext == ".mp4":
-            vidreader = VideoReader(
-                filepath, ctx=cpu(0), width=image_size[1], height=image_size[0]
-            )
+            vidreader = VideoReader(filepath, width=image_size[1], height=image_size[0])
             frame = vidreader.get_batch([0])
             img_tensor = (
                 torch.tensor(frame.asnumpy()).squeeze(0).permute(2, 0, 1).float()
@@ -380,7 +378,6 @@ def sample_batch_i2v(
 
     batch_samples = []
     for _ in range(n_samples_prompt):
-
         if z0 is not None:
             cond_z0 = z0.clone()
             kwargs.update({"clean_cond": True})
@@ -454,6 +451,7 @@ def save_videos_vbench(batch_tensors, savedir, prompts, format_file, fps=10):
                 savepath, video, fps=fps, video_codec="h264", options={"crf": "10"}
             )
 
+
 def cast_to(weight, dtype, device):
     r = torch.empty_like(weight, dtype=dtype, device=device)
     r.copy_(weight)
@@ -461,7 +459,16 @@ def cast_to(weight, dtype, device):
 
 
 class AutoWrappedModule(torch.nn.Module):
-    def __init__(self, module: torch.nn.Module, offload_dtype, offload_device, onload_dtype, onload_device, computation_dtype, computation_device):
+    def __init__(
+        self,
+        module: torch.nn.Module,
+        offload_dtype,
+        offload_device,
+        onload_dtype,
+        onload_device,
+        computation_dtype,
+        computation_device,
+    ):
         super().__init__()
         self.module = module.to(dtype=offload_dtype, device=offload_device)
         self.offload_dtype = offload_dtype
@@ -473,29 +480,54 @@ def __init__(self, module: torch.nn.Module, offload_dtype, offload_device, onloa
         self.state = 0
 
     def offload(self):
-        if self.state == 1 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
+        if self.state == 1 and (
+            self.offload_dtype != self.onload_dtype
+            or self.offload_device != self.onload_device
+        ):
             self.module.to(dtype=self.offload_dtype, device=self.offload_device)
             self.state = 0
 
     def onload(self):
-        if self.state == 0 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
+        if self.state == 0 and (
+            self.offload_dtype != self.onload_dtype
+            or self.offload_device != self.onload_device
+        ):
             self.module.to(dtype=self.onload_dtype, device=self.onload_device)
             self.state = 1
 
     @torch.inference_mode
     def forward(self, *args, **kwargs):
-        if self.onload_dtype == self.computation_dtype and self.onload_device == self.computation_device:
+        if (
+            self.onload_dtype == self.computation_dtype
+            and self.onload_device == self.computation_device
+        ):
             module = self.module
         else:
-            module = copy.deepcopy(self.module).to(dtype=self.computation_dtype, device=self.computation_device)
+            module = copy.deepcopy(self.module).to(
+                dtype=self.computation_dtype, device=self.computation_device
+            )
         return module(*args, **kwargs)
-    
 
-class AutoWrappedLinear(torch.nn.Linear):
 
-    def __init__(self, module: torch.nn.Linear, offload_dtype, offload_device, onload_dtype, onload_device, computation_dtype, computation_device):
+class AutoWrappedLinear(torch.nn.Linear):
+    def __init__(
+        self,
+        module: torch.nn.Linear,
+        offload_dtype,
+        offload_device,
+        onload_dtype,
+        onload_device,
+        computation_dtype,
+        computation_device,
+    ):
         with init_weights_on_device(device=torch.device("meta")):
-            super().__init__(in_features=module.in_features, out_features=module.out_features, bias=module.bias is not None, dtype=offload_dtype, device=offload_device)
+            super().__init__(
+                in_features=module.in_features,
+                out_features=module.out_features,
+                bias=module.bias is not None,
+                dtype=offload_dtype,
+                device=offload_device,
+            )
         self.weight = module.weight
         self.bias = module.bias
         self.offload_dtype = offload_dtype
@@ -507,31 +539,56 @@ def __init__(self, module: torch.nn.Linear, offload_dtype, offload_device, onloa
         self.state = 0
 
     def offload(self):
-        if self.state == 1 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
+        if self.state == 1 and (
+            self.offload_dtype != self.onload_dtype
+            or self.offload_device != self.onload_device
+        ):
             self.to(dtype=self.offload_dtype, device=self.offload_device)
             self.state = 0
 
     def onload(self):
-        if self.state == 0 and (self.offload_dtype != self.onload_dtype or self.offload_device != self.onload_device):
+        if self.state == 0 and (
+            self.offload_dtype != self.onload_dtype
+            or self.offload_device != self.onload_device
+        ):
             self.to(dtype=self.onload_dtype, device=self.onload_device)
             self.state = 1
 
     @torch.inference_mode
     def forward(self, x, *args, **kwargs):
-        if self.onload_dtype == self.computation_dtype and self.onload_device == self.computation_device:
+        if (
+            self.onload_dtype == self.computation_dtype
+            and self.onload_device == self.computation_device
+        ):
             weight, bias = self.weight, self.bias
         else:
-            weight = cast_to(self.weight, self.computation_dtype, self.computation_device)
-            bias = None if self.bias is None else cast_to(self.bias, self.computation_dtype, self.computation_device)
+            weight = cast_to(
+                self.weight, self.computation_dtype, self.computation_device
+            )
+            bias = (
+                None
+                if self.bias is None
+                else cast_to(self.bias, self.computation_dtype, self.computation_device)
+            )
         return torch.nn.functional.linear(x, weight, bias)
 
 
-def enable_vram_management_recursively(model: torch.nn.Module, module_map: dict, module_config: dict, max_num_param=None, overflow_module_config: dict = None, total_num_param=0):
+def enable_vram_management_recursively(
+    model: torch.nn.Module,
+    module_map: dict,
+    module_config: dict,
+    max_num_param=None,
+    overflow_module_config: dict = None,
+    total_num_param=0,
+):
     for name, module in model.named_children():
         for source_module, target_module in module_map.items():
             if isinstance(module, source_module):
                 num_param = sum(p.numel() for p in module.parameters())
-                if max_num_param is not None and total_num_param + num_param > max_num_param:
+                if (
+                    max_num_param is not None
+                    and total_num_param + num_param > max_num_param
+                ):
                     module_config_ = overflow_module_config
                 else:
                     module_config_ = module_config
@@ -540,10 +597,30 @@ def enable_vram_management_recursively(model: torch.nn.Module, module_map: dict,
                 total_num_param += num_param
                 break
         else:
-            total_num_param = enable_vram_management_recursively(module, module_map, module_config, max_num_param, overflow_module_config, total_num_param)
+            total_num_param = enable_vram_management_recursively(
+                module,
+                module_map,
+                module_config,
+                max_num_param,
+                overflow_module_config,
+                total_num_param,
+            )
     return total_num_param
 
 
-def enable_vram_management(model: torch.nn.Module, module_map: dict, module_config: dict, max_num_param=None, overflow_module_config: dict = None):
-    enable_vram_management_recursively(model, module_map, module_config, max_num_param, overflow_module_config, total_num_param=0)
-    model.vram_management_enabled = True
\ No newline at end of file
+def enable_vram_management(
+    model: torch.nn.Module,
+    module_map: dict,
+    module_config: dict,
+    max_num_param=None,
+    overflow_module_config: dict = None,
+):
+    enable_vram_management_recursively(
+        model,
+        module_map,
+        module_config,
+        max_num_param,
+        overflow_module_config,
+        total_num_param=0,
+    )
+    model.vram_management_enabled = True
diff --git a/videotuna/utils/lightning_utils.py b/videotuna/utils/lightning_utils.py
index 87821255..89be0d8d 100644
--- a/videotuna/utils/lightning_utils.py
+++ b/videotuna/utils/lightning_utils.py
@@ -1,6 +1,6 @@
 import inspect
 from argparse import ArgumentParser
-from typing import Any, Callable, Dict, List, Tuple, Type, TypeVar, Union, cast
+from typing import Any, Callable, Dict, List, Tuple, Type, Union
 
 import pytorch_lightning as pl
 
@@ -87,10 +87,12 @@ def _precision_allowed_type(x: Union[int, str]) -> Union[int, str]:
 
 
 def str_to_bool_or_str(val: str) -> Union[str, bool]:
-    """Possibly convert a string representation of truth to bool. Returns the input otherwise. Based on the python
-    implementation distutils.utils.strtobool.
+    """Possibly convert a string representation of truth to bool.
+    Returns the input otherwise. Based on the python implementation
+    distutils.utils.strtobool.
 
-    True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values are 'n', 'no', 'f', 'false', 'off', and '0'.
+    True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values are 'n', 'no',
+        'f', 'false', 'off', and '0'.
     """
     lower = val.lower()
     if lower in ("y", "yes", "t", "true", "on", "1"):
@@ -119,7 +121,8 @@ def str_to_bool(val: str) -> bool:
 
 
 def str_to_bool_or_int(val: str) -> Union[bool, int, str]:
-    """Convert a string representation to truth of bool if possible, or otherwise try to convert it to an int.
+    """Convert a string representation to truth of bool if possible,
+    or otherwise try to convert it to an int.
     >>> str_to_bool_or_int("FALSE")
     False
     >>> str_to_bool_or_int("1")
@@ -149,7 +152,7 @@ def add_trainer_args_to_parser(cls, parent_parser, use_argument_group=True):
         raise RuntimeError("Please only pass an `ArgumentParser` instance.")
     if use_argument_group:
         group_name = _get_abbrev_qualified_cls_name(cls)
-        parser: _ADD_ARGPARSE_RETURN = parent_parser.add_argument_group(group_name)
+        parser = parent_parser.add_argument_group(group_name)
     else:
         parser = ArgumentParser(parents=[parent_parser], add_help=False)
 
@@ -212,7 +215,7 @@ def add_trainer_args_to_parser(cls, parent_parser, use_argument_group=True):
                 required=(arg_default == inspect._empty),
                 **arg_kwargs,
             )
-        except:
+        except Exception:
             # TODO: check the argument appending to the parser
             pass
 
diff --git a/videotuna/utils/load_weights.py b/videotuna/utils/load_weights.py
index deca6adb..2b37215c 100755
--- a/videotuna/utils/load_weights.py
+++ b/videotuna/utils/load_weights.py
@@ -1,48 +1,49 @@
 import copy
 import logging
-
-from omegaconf import OmegaConf
-
-mainlogger = logging.getLogger("mainlogger")
-
 from collections import OrderedDict
+from contextlib import contextmanager
 
 import torch
+from omegaconf import OmegaConf
 from safetensors import safe_open
-from torch import nn
 
 from videotuna.utils.common_utils import instantiate_from_config
-from contextlib import contextmanager
-# from lvdm.personalization.lora import net_load_lora
+
+mainlogger = logging.getLogger("mainlogger")
+
+
+def net_load_lora(*_args, **_kwargs):
+    raise NotImplementedError("lvdm LoRA loading is unavailable in PrivTune")
 
 
 @contextmanager
-def init_weights_on_device(device = torch.device("meta"), include_buffers :bool = False):
-    
+def init_weights_on_device(device=torch.device("meta"), include_buffers: bool = False):
     old_register_parameter = torch.nn.Module.register_parameter
     if include_buffers:
         old_register_buffer = torch.nn.Module.register_buffer
-    
+
     def register_empty_parameter(module, name, param):
         old_register_parameter(module, name, param)
         if param is not None:
             param_cls = type(module._parameters[name])
             kwargs = module._parameters[name].__dict__
             kwargs["requires_grad"] = param.requires_grad
-            module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
+            module._parameters[name] = param_cls(
+                module._parameters[name].to(device), **kwargs
+            )
 
     def register_empty_buffer(module, name, buffer, persistent=True):
         old_register_buffer(module, name, buffer, persistent=persistent)
         if buffer is not None:
             module._buffers[name] = module._buffers[name].to(device)
-            
+
     def patch_tensor_constructor(fn):
         def wrapper(*args, **kwargs):
             kwargs["device"] = device
             return fn(*args, **kwargs)
 
         return wrapper
-    
+
     if include_buffers:
         tensor_constructors_to_patch = {
             torch_function_name: getattr(torch, torch_function_name)
@@ -50,19 +51,26 @@ def wrapper(*args, **kwargs):
         }
     else:
         tensor_constructors_to_patch = {}
-    
+
     try:
         torch.nn.Module.register_parameter = register_empty_parameter
         if include_buffers:
             torch.nn.Module.register_buffer = register_empty_buffer
         for torch_function_name in tensor_constructors_to_patch.keys():
-            setattr(torch, torch_function_name, patch_tensor_constructor(getattr(torch, torch_function_name)))
+            setattr(
+                torch,
+                torch_function_name,
+                patch_tensor_constructor(getattr(torch, torch_function_name)),
+            )
         yield
     finally:
         torch.nn.Module.register_parameter = old_register_parameter
         if include_buffers:
             torch.nn.Module.register_buffer = old_register_buffer
-        for torch_function_name, old_torch_function in tensor_constructors_to_patch.items():
+        for (
+            torch_function_name,
+            old_torch_function,
+        ) in tensor_constructors_to_patch.items():
             setattr(torch, torch_function_name, old_torch_function)
 
 
@@ -79,9 +87,9 @@ def load_from_pretrainedSD_checkpoint(
     model, pretained_ckpt, expand_to_3d=True, adapt_keyname=False
 ):
     mainlogger.info(
-        f"------------------- Load pretrained SD weights -------------------"
+        "------------------- Load pretrained SD weights -------------------"
     )
-    sd_state_dict = torch.load(pretained_ckpt, map_location=f"cpu")
+    sd_state_dict = torch.load(pretained_ckpt, map_location="cpu")
     if "state_dict" in list(sd_state_dict.keys()):
         sd_state_dict = sd_state_dict["state_dict"]
     model_state_dict = model.state_dict()
@@ -95,7 +103,8 @@ def load_from_pretrainedSD_checkpoint(
     mainlogger.info(f"Num of parameters of source model: {len(sd_state_dict.keys())}")
 
     if adapt_keyname:
-        ## adapting to standard 2d network: modify the key name because of the add of temporal-attention
+        # # adapting to standard 2d network: modify the key name because of the add of
+        # temporal-attention
         mapping_dict = {
             "middle_block.2": "middle_block.3",
             "output_blocks.5.2": "output_blocks.5.3",
@@ -133,8 +142,8 @@ def load_from_pretrainedSD_checkpoint(
     # load the new state dict
     try:
         model.load_state_dict(model_state_dict)
-    except:
-        state_dict = torch.load(model_state_dict, map_location=f"cpu")
+    except Exception:
+        state_dict = torch.load(model_state_dict, map_location="cpu")
         if "state_dict" in list(state_dict.keys()):
             state_dict = state_dict["state_dict"]
         model_state_dict = model.state_dict()
@@ -146,9 +155,7 @@ def load_from_pretrainedSD_checkpoint(
         model_state_dict.update(state_dict)
         model.load_state_dict(model_state_dict)
 
-    mainlogger.info(
-        f"---------------------------- Finish! ----------------------------"
-    )
+    mainlogger.info("---------------------------- Finish! ----------------------------")
     return model, empty_paras
 
 
@@ -209,13 +216,14 @@ def load_partial_weights(
     model_dict_ori = copy.deepcopy(model_dict)
 
     mainlogger.info(
-        f"-------------- Load pretrained LDM weights --------------------------"
+        "-------------- Load pretrained LDM weights --------------------------"
     )
     mainlogger.info(f"Num of parameters of target model: {len(model_dict.keys())}")
     mainlogger.info(f"Num of parameters of source model: {len(pretrained_dict.keys())}")
 
     if adapt_keyname:
-        ## adapting to menghan's standard 2d network: modify the key name because of the add of temporal-attention
+        # # adapting to menghan's standard 2d network: modify the key name because of
+        # the add of temporal-attention
         mapping_dict = {
             "middle_block.2": "middle_block.3",
             "output_blocks.5.2": "output_blocks.5.3",
@@ -259,7 +267,7 @@ def load_partial_weights(
     # load the new state dict
     try:
         model2.load_state_dict(model_dict)
-    except:
+    except Exception:
         # if parameter size mismatch, skip them
         skipped = []
         for n, p in model_dict.items():
@@ -278,7 +286,7 @@ def load_partial_weights(
         empty_paras += skipped
         mainlogger.info(f"Empty parameters: {len(empty_paras)} ")
 
-    mainlogger.info(f"-------------- Finish! --------------------------")
+    mainlogger.info("-------------- Finish! --------------------------")
     return model2, empty_paras
 
 
@@ -288,7 +296,8 @@ def load_autoencoder(model, config_path=None, ckpt_path=None, device=None):
     if ckpt_path is None:
         ckpt_path = "models/ldm/text2img-large/model.ckpt"
     # if device is None:
-    #     device = torch.device(f"cuda:{dist.get_rank()}") if torch.cuda.is_available() else torch.device("cpu")
+    #     device = torch.device(f"cuda:{dist.get_rank()}") if torch.cuda.is_available()
+    # else torch.device("cpu")
 
     pretrained_ldm = init_and_load_ldm_model(config_path, ckpt_path, device)
     autoencoder_dict = {}
@@ -336,7 +345,8 @@ def convert_lora(
     alpha=0.6,
 ):
     # load base model
-    # pipeline = StableDiffusionPipeline.from_pretrained(base_model_path, torch_dtype=torch.float32)
+    # pipeline = StableDiffusionPipeline.from_pretrained(base_model_path,
+    # torch_dtype=torch.float32)
 
     # load LoRA weight from .safetensors
     # state_dict = load_file(checkpoint_path)
@@ -359,7 +369,8 @@ def convert_lora(
 
         print(f"key={key}")
         if "text" in key and LORA_PREFIX_TEXT_ENCODER in key:
-            # layer_infos = key.split(".")[0].split(LORA_PREFIX_TEXT_ENCODER + "_")[-1].split("_")
+            # layer_infos = key.split(".")[0].split(LORA_PREFIX_TEXT_ENCODER +
+            # "_")[-1].split("_")
             layer_infos = ["cond_stage_model", "transformer"] + key.split(".")[0].split(
                 LORA_PREFIX_TEXT_ENCODER + "_"
             )[-1].split("_")
@@ -439,7 +450,7 @@ def change_sd_weight(
 ):
     model_sd = model.state_dict()
     device = model_sd[list(model_sd.keys())[0]].device
-    is_diffuser = check_diffuser_ckpt(sd_sd)
+    check_diffuser_ckpt(sd_sd)
     for k, v in sd_sd.items():
         if "cond_stage_model.model.transformer.text_model.embeddings" in k:
             continue
@@ -456,7 +467,9 @@ def change_sd_weight(
             k = k.replace("output_blocks.8.2.conv", "output_blocks.8.3.conv")
 
         if k not in model_sd:
-            import pdb; pdb.set_trace()
+            import pdb
+
+            pdb.set_trace()
 
         # merge new token
         if (
@@ -545,8 +558,10 @@ def load_model_checkpoint_t2v(
             sd_sd = load_sd_state_dict(sd_ckpt)
             token_emb = sd_sd["cond_stage_model.model.token_embedding.weight"][49408:]
         else:
-            import pdb; pdb.set_trace()
-        
+            import pdb
+
+            pdb.set_trace()
+
         if token_emb is not None:
             pl_sd["cond_stage_model.model.token_embedding.weight"] = torch.cat(
                 [pl_sd["cond_stage_model.model.token_embedding.weight"], token_emb],
diff --git a/videotuna/utils/logging_config.py b/videotuna/utils/logging_config.py
new file mode 100644
index 00000000..f7452124
--- /dev/null
+++ b/videotuna/utils/logging_config.py
@@ -0,0 +1,86 @@
+"""Central loguru configuration with structured phase/flow/device context."""
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+from typing import TYPE_CHECKING
+
+from loguru import logger
+
+from videotuna.settings import get_settings
+
+if TYPE_CHECKING:
+    import torch
+
+DEFAULT_EXTRA = {"phase": "-", "flow": "-", "device": "-"}
+
+_LOG_FORMAT = (
+    "<green>{time:YYYY-MM-DD HH:mm:ss}</green> | "
+    "<level>{level: <8}</level> | "
+    "phase={extra[phase]} flow={extra[flow]} device={extra[device]} | "
+    "{message}"
+)
+
+_configured = False
+_stderr_handler_id: int | None = None
+_file_handler_id: int | None = None
+
+
+def configure_logging(
+    *,
+    log_file: Path | str | None = None,
+    level: str | None = None,
+) -> None:
+    """Configure loguru once per process (stderr + optional file sink)."""
+    global _configured, _stderr_handler_id, _file_handler_id
+
+    log_level = (level or get_settings().log_level).upper()
+    logger.configure(extra=DEFAULT_EXTRA)
+    logger.remove()
+
+    _stderr_handler_id = logger.add(
+        sys.stderr,
+        level=log_level,
+        format=_LOG_FORMAT,
+    )
+    _file_handler_id = None
+    if log_file is not None:
+        log_path = Path(log_file)
+        log_path.parent.mkdir(parents=True, exist_ok=True)
+        _file_handler_id = logger.add(
+            str(log_path),
+            level=log_level,
+            format=_LOG_FORMAT,
+            mode="w",
+        )
+
+    _configured = True
+
+
+def bound_logger(*, phase: str, flow: str, device: str = "-"):
+    """Return a loguru logger with structured context fields."""
+    return logger.bind(phase=phase, flow=flow, device=device)
+
+
+def resolve_device_label(device: torch.device | str | None = None) -> str:
+    """Return a compact device label for log context."""
+    if device is None:
+        return "-"
+    if isinstance(device, str):
+        return device
+    index_suffix = ""
+    if device.index is not None:
+        index_suffix = f":{device.index}"
+    return f"{device.type}{index_suffix}"
+
+
+def phase_from_wan_task(task: str) -> str:
+    """Map Wan task id to logging phase."""
+    if task.startswith("i2v"):
+        return "i2v"
+    if task.startswith("t2i"):
+        return "t2i"
+    if task.startswith("t2v"):
+        return "t2v"
+    return "inference"
diff --git a/videotuna/utils/lora_utils.py b/videotuna/utils/lora_utils.py
new file mode 100644
index 00000000..0858adb8
--- /dev/null
+++ b/videotuna/utils/lora_utils.py
@@ -0,0 +1,65 @@
+"""PEFT LoRA target-module resolution helpers."""
+
+from __future__ import annotations
+
+from typing import List, Union
+
+import torch.nn as nn
+
+
+def resolve_lora_target_modules(
+    model: nn.Module,
+    target_modules: Union[str, List[str], None],
+) -> Union[str, List[str]]:
+    """Resolve LoRA target modules from explicit lists or PEFT shortcuts."""
+    if target_modules is None:
+        raise ValueError("target_modules must be provided for LoRA configuration")
+
+    if target_modules == "all-linear":
+        return "all-linear"
+
+    if isinstance(target_modules, str):
+        if target_modules == "kappa":
+            return _kappa_targets(model)
+        return [target_modules]
+
+    if isinstance(target_modules, list):
+        return target_modules
+
+    raise TypeError(f"Unsupported target_modules type: {type(target_modules)}")
+
+
+def _module_path_from_param_name(param_name: str) -> str:
+    for suffix in (".weight", ".bias"):
+        if param_name.endswith(suffix):
+            return param_name[: -len(suffix)]
+    return param_name
+
+
+def parameter_matches_lora_target(param_name: str, target_modules: list[str]) -> bool:
+    """Return True when a parameter name matches a LoRA target module token."""
+    module_path = _module_path_from_param_name(param_name)
+    for target in target_modules:
+        if module_path == target or module_path.endswith(f".{target}"):
+            return True
+    return False
+
+
+def _kappa_targets(model: nn.Module) -> List[str]:
+    try:
+        from peft.helpers import find_kappa_target_modules
+    except ImportError as exc:
+        raise ImportError(
+            "kappa target discovery requires peft.helpers.find_kappa_target_modules"
+        ) from exc
+
+    targets = find_kappa_target_modules(model, top_p=0.2)
+    resolved = targets.get("target_modules") or []
+    if not resolved:
+        raise ValueError("find_kappa_target_modules returned no target_modules")
+    return resolved
+
+
+def collect_lora_parameter_names(model: nn.Module) -> set[str]:
+    """Return parameter names that belong to LoRA adapters."""
+    return {name for name, _ in model.named_parameters() if "lora" in name.lower()}
diff --git a/videotuna/utils/memory_presets.py b/videotuna/utils/memory_presets.py
new file mode 100644
index 00000000..9cceaaeb
--- /dev/null
+++ b/videotuna/utils/memory_presets.py
@@ -0,0 +1,14 @@
+"""Named memory/performance presets for inference CLI."""
+
+from __future__ import annotations
+
+from typing import Any
+
+from videotuna.utils.inference_profile import MemoryPreset, resolve_inference_profile
+
+__all__ = ["MemoryPreset", "apply_memory_preset"]
+
+
+def apply_memory_preset(args: Any) -> None:
+    """Mutate *args* in place to apply a named memory preset."""
+    resolve_inference_profile(args)
diff --git a/videotuna/utils/quantization.py b/videotuna/utils/quantization.py
new file mode 100644
index 00000000..087c1ca6
--- /dev/null
+++ b/videotuna/utils/quantization.py
@@ -0,0 +1,45 @@
+"""Optional 4-bit loading for frozen text encoders via bitsandbytes + accelerate."""
+
+from __future__ import annotations
+
+from typing import Any, Dict, Optional
+
+import torch
+
+
+def build_transformers_quant_config(load_in_4bit: bool = True) -> Optional[Any]:
+    """Return a transformers BitsAndBytesConfig for 4-bit loading, or None."""
+    if not load_in_4bit:
+        return None
+
+    try:
+        from transformers import BitsAndBytesConfig
+    except ImportError as exc:
+        raise ImportError(
+            "4-bit loading requires transformers with BitsAndBytesConfig support"
+        ) from exc
+
+    return BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_compute_dtype=torch.bfloat16,
+        bnb_4bit_use_double_quant=True,
+        bnb_4bit_quant_type="nf4",
+    )
+
+
+def apply_quantization_to_config_params(params: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Inject quantization kwargs into a model config params dict when load_in_4bit is set.
+
+    Supports transformers from_pretrained-style configs.
+    """
+    if not params.get("load_in_4bit", False):
+        return params
+
+    updated = dict(params)
+    quant_config = build_transformers_quant_config(True)
+    if quant_config is not None:
+        updated["quantization_config"] = quant_config
+    updated.setdefault("torch_dtype", torch.bfloat16)
+    updated.setdefault("device_map", "auto")
+    return updated
diff --git a/videotuna/utils/save_video.py b/videotuna/utils/save_video.py
index 797a8f0e..e6a8361a 100755
--- a/videotuna/utils/save_video.py
+++ b/videotuna/utils/save_video.py
@@ -92,7 +92,6 @@ def tensor_to_mp4(video, savepath, fps, rescale=True, nrow=None):
 
 
 def tensor2videogrids(video, root, filename, fps, rescale=True, clamp=True):
-
     assert video.dim() == 5  # b,c,t,h,w
     assert isinstance(video, torch.Tensor)
 
@@ -163,9 +162,9 @@ def save_img_grid(grid, path, rescale):
 
     for key in batch_logs:
         value = batch_logs[key]
-        
+
         if isinstance(value, torch.Tensor) and (value.ndim == 6):
-            assert(value.size()[0] == 1)
+            assert value.size()[0] == 1
             value = value[0]
 
         if isinstance(value, list) and isinstance(value[0], str):
@@ -181,7 +180,7 @@ def save_img_grid(grid, path, rescale):
             ## only save grayscale or rgb mode
             if video.shape[1] != 1 and video.shape[1] != 3:
                 continue
-            n = video.shape[0]
+            video.shape[0]
             video = video.permute(2, 0, 1, 3, 4)  # t,n,c,h,w
             frame_grids = [
                 torchvision.utils.make_grid(framesheet, nrow=int(1))
@@ -211,12 +210,15 @@ def save_img_grid(grid, path, rescale):
             ## only save grayscale or rgb mode
             if img.shape[1] != 1 and img.shape[1] != 3:
                 continue
-            n = img.shape[0]
+            img.shape[0]
             grid = torchvision.utils.make_grid(img, nrow=1)
             path = os.path.join(save_dir, "%s-%s.jpg" % (key, filename))
             save_img_grid(grid, path, rescale)
         else:
-            raise ValueError(f"The value of type [{type(value)}[] and key [{key}] does not supported!")
+            raise ValueError(
+                f"The value of type [{type(value)}[] and key [{key}] "
+                "does not supported!"
+            )
 
 
 def prepare_to_log(batch_logs, max_images=100000, clamp=True):
@@ -245,7 +247,7 @@ def prepare_to_log(batch_logs, max_images=100000, clamp=True):
     return batch_logs
 
 
-# ----------------------------------------------------------------------------------------------
+# --------------------------------------------------------------------------------------
 
 
 def fill_with_black_squares(video, desired_len: int) -> Tensor:
@@ -263,7 +265,7 @@ def fill_with_black_squares(video, desired_len: int) -> Tensor:
     )
 
 
-# ----------------------------------------------------------------------------------------------
+# --------------------------------------------------------------------------------------
 def load_num_videos(data_path, num_videos):
     # first argument can be either data_path of np array
     if isinstance(data_path, str):
@@ -281,7 +283,8 @@ def load_num_videos(data_path, num_videos):
 def npz_to_video_grid(
     data_path, out_path, num_frames, fps, num_videos=None, nrow=None, verbose=True
 ):
-    # videos = torch.tensor(np.load(data_path)['arr_0']).permute(0,1,4,2,3).div_(255).mul_(2) - 1.0 # NTHWC->NTCHW, np int -> torch tensor 0-1
+    # videos = torch.tensor(np.load(data_path)['arr_0']).permute(0,1,4,2,3)
+    # .div_(255).mul_(2) - 1.0  # NTHWC->NTCHW, np int -> torch tensor 0-1
     if isinstance(data_path, str):
         videos = load_num_videos(data_path, num_videos)
     elif isinstance(data_path, np.ndarray):
diff --git a/videotuna/utils/sched_utils.py b/videotuna/utils/sched_utils.py
new file mode 100644
index 00000000..da37dd78
--- /dev/null
+++ b/videotuna/utils/sched_utils.py
@@ -0,0 +1,44 @@
+"""Small helpers shared by legacy diffusion schedulers."""
+
+from __future__ import annotations
+
+from typing import Any, Callable, Optional, Union
+
+import torch
+
+
+def exists(val: Any) -> bool:
+    return val is not None
+
+
+def default(val: Any, d: Union[Any, Callable[[], Any]]) -> Any:
+    if exists(val):
+        return val
+    return d() if callable(d) else d
+
+
+def extract_into_tensor(
+    a: torch.Tensor, t: torch.Tensor, x_shape: torch.Size
+) -> torch.Tensor:
+    b, *_ = t.shape
+    out = a.gather(-1, t)
+    return out.reshape(b, *((1,) * (len(x_shape) - 1)))
+
+
+def noise_like(
+    shape: torch.Size,
+    device: torch.device,
+    repeat: bool = False,
+    dtype: Optional[torch.dtype] = None,
+) -> torch.Tensor:
+    dtype = dtype or torch.float32
+
+    def repeat_noise():
+        return torch.randn((1, *shape[1:]), device=device, dtype=dtype).repeat(
+            shape[0], *((1,) * (len(shape) - 1))
+        )
+
+    def noise():
+        return torch.randn(shape, device=device, dtype=dtype)
+
+    return repeat_noise() if repeat else noise()
diff --git a/videotuna/utils/train_utils.py b/videotuna/utils/train_utils.py
index ffc7e409..9ceab3db 100755
--- a/videotuna/utils/train_utils.py
+++ b/videotuna/utils/train_utils.py
@@ -1,22 +1,13 @@
-import argparse
-import glob
-import logging
-import multiprocessing as mproc
 import os
-import sys
 from collections import OrderedDict
 
-from omegaconf import OmegaConf
-from packaging import version
-
-mainlogger = logging.getLogger("mainlogger")
-
-from collections import OrderedDict
-
-import pytorch_lightning as pl
 import torch
+from omegaconf import OmegaConf
 
 from videotuna.utils.load_weights import load_from_pretrainedSD_checkpoint
+from videotuna.utils.logging_config import bound_logger, configure_logging
+
+mainlogger = bound_logger(phase="t2v", flow="wanvideo")
 
 
 def init_workspace(name, logdir, model_config, lightning_config, rank=0):
@@ -25,7 +16,8 @@ def init_workspace(name, logdir, model_config, lightning_config, rank=0):
     cfgdir = os.path.join(workdir, "configs")
     loginfo = os.path.join(workdir, "loginfo")
 
-    # Create logdirs and save configs (all ranks will do to avoid missing directory error if rank:0 is slower)
+    # Create logdirs and save configs (all ranks will do to avoid missing directory
+    # error if rank:0 is slower)
     os.makedirs(workdir, exist_ok=True)
     os.makedirs(ckptdir, exist_ok=True)
     os.makedirs(cfgdir, exist_ok=True)
@@ -78,6 +70,10 @@ def get_trainer_callbacks(lightning_config, logdir, ckptdir):
             "params": {"logging_interval": "step", "log_momentum": False},
         },
         "cuda_callback": {"target": "videotuna.utils.callbacks.CUDACallback"},
+        "training_metrics": {
+            "target": "videotuna.utils.callbacks.TrainingMetricsCallback",
+            "params": {},
+        },
     }
 
     if "callbacks" in lightning_config:
@@ -143,7 +139,7 @@ def load_checkpoints(model, model_cfg):
         )
         print(f"Loading model from {pretrained_ckpt}")
         ## only load weight for the backbone model (e.g. latent diffusion model)
-        state_dict = torch.load(pretrained_ckpt, map_location=f"cpu")
+        state_dict = torch.load(pretrained_ckpt, map_location="cpu")
         if "state_dict" in list(state_dict.keys()):
             state_dict = state_dict["state_dict"]
         else:
@@ -181,7 +177,7 @@ def load_checkpoints(model, model_cfg):
                 for key in pl_sd["module"].keys():
                     new_pl_sd[key[16:]] = pl_sd["module"][key]
                 model.load_state_dict(new_pl_sd)
-        except:
+        except Exception:
             if "state_dict" in pl_sd.keys():
                 model.load_state_dict(pl_sd["state_dict"], strict=False)
             else:
@@ -190,13 +186,15 @@ def load_checkpoints(model, model_cfg):
         """
         try:
             model = model.load_from_checkpoint(pretrained_ckpt, **model_cfg.params)
-        except:
-            mainlogger.info("[Warning] checkpoint NOT complete matched. To adapt by skipping ...")
+        except Exception:
+            mainlogger.info("[Warning] checkpoint NOT complete matched. To adapt by
+                skipping ...")
             state_dict = torch.load(pretrained_ckpt, map_location=f"cpu")
             if "state_dict" in list(state_dict.keys()):
                 state_dict = state_dict["state_dict"]
             model_state_dict = model.state_dict()
-            ## for layer with channel changed (e.g. GEN 1's conditon-concatenating setting)
+            # # for layer with channel changed (e.g. GEN 1's conditon-concatenating
+            # setting)
             for n, p in model_state_dict.items():
                 if p.shape != state_dict[n].shape:
                     mainlogger.info(f"Skipped parameter [{n}] from pretrained! ")
@@ -249,7 +247,7 @@ def get_autoresume_path(logdir):
             gs = tmp["global_step"]
             mainlogger.info(f"[INFO] Resume from epoch {e}, global step {gs}!")
             del tmp
-        except:
+        except Exception:
             try:
                 mainlogger.info("Load last.ckpt failed!")
                 ckpts = sorted(
@@ -274,28 +272,14 @@ def get_autoresume_path(logdir):
     else:
         resume_checkpt_path = None
         mainlogger.info(
-            f"[INFO] no checkpoint found in current workspace: {os.path.join(logdir, 'checkpoints')}"
+            f"[INFO] no checkpoint found in current workspace: {os.path.join(logdir,
+                'checkpoints')}"
         )
 
     return resume_checkpt_path
 
 
 def set_logger(logfile, name="mainlogger"):
-    logger = logging.getLogger(name)
-    logger.setLevel(logging.INFO)
-
-    # Set the logger to prevent log propagation to the parent logger and print twice.
-    logger.propagate = False
-
-    fh = logging.FileHandler(logfile, mode="w")
-    fh.setLevel(logging.INFO)
-
-    ch = logging.StreamHandler()
-    ch.setLevel(logging.DEBUG)
-
-    fh.setFormatter(logging.Formatter("%(asctime)s-%(levelname)s: %(message)s"))
-    ch.setFormatter(logging.Formatter("%(message)s"))
-
-    logger.addHandler(fh)
-    logger.addHandler(ch)
-    return logger
+    """Configure loguru and return a bound training logger (legacy API)."""
+    configure_logging(log_file=logfile)
+    return bound_logger(phase="t2v", flow="wanvideo")
diff --git a/videotuna/utils/training_metrics.py b/videotuna/utils/training_metrics.py
new file mode 100644
index 00000000..8af38d90
--- /dev/null
+++ b/videotuna/utils/training_metrics.py
@@ -0,0 +1,56 @@
+"""Training experiment tracking helpers (TensorBoard + optional Trackio)."""
+
+from __future__ import annotations
+
+import importlib.util
+from typing import Any
+
+MetricsBackend = str
+
+DEFAULT_FLUX_TRACKIO_PROJECT = "flux-domain-lora"
+
+
+def trackio_available() -> bool:
+    return importlib.util.find_spec("trackio") is not None
+
+
+def require_trackio() -> None:
+    if trackio_available():
+        return
+    raise ImportError(
+        "VIDEOTUNA_METRICS_BACKEND=trackio requires the trackio extra. "
+        "Install with: poetry install -E trackio"
+    )
+
+
+def trackio_enabled(metrics_backend: MetricsBackend) -> bool:
+    return metrics_backend == "trackio"
+
+
+def resolve_accelerate_log_with(metrics_backend: MetricsBackend) -> str | list[str]:
+    if metrics_backend == "trackio":
+        require_trackio()
+        return ["tensorboard", "trackio"]
+    return "tensorboard"
+
+
+def describe_metrics_backend(metrics_backend: MetricsBackend) -> str:
+    if trackio_enabled(metrics_backend):
+        return "tensorboard + trackio"
+    return "tensorboard"
+
+
+def build_trackio_init_kwargs(
+    *,
+    space_id: str | None = None,
+) -> dict[str, dict[str, Any]] | None:
+    if not space_id:
+        return None
+    return {"trackio": {"space_id": space_id}}
+
+
+def log_validation_image_to_trackio(image: Any, step: int) -> None:
+    require_trackio()
+    import trackio
+
+    trackio.log({"validation/sample": trackio.Image(image)}, step=step)
diff --git a/videotuna/utils/video_io.py b/videotuna/utils/video_io.py
new file mode 100644
index 00000000..f8119e1d
--- /dev/null
+++ b/videotuna/utils/video_io.py
@@ -0,0 +1,290 @@
+"""Video frame sampling and decoding with PyAV / torchcodec fallbacks.
+
+Install torchcodec via the optional ``video-fast`` extra::
+
+    poetry install -E cuda --with training -E video-fast
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import random
+from typing import Literal, Optional, Sequence, Union
+
+import av
+import numpy as np
+import torch
+from einops import rearrange
+
+VideoBackend = Literal["auto", "av", "torchcodec"]
+
+
+def sample_frame_indices(
+    total_frames: int,
+    num_frames: int,
+    frame_interval: int = 1,
+    begin_index: Optional[int] = None,
+) -> np.ndarray:
+    """Sample frame indices matching TemporalRandomCrop randomness."""
+    sample_length = num_frames * frame_interval
+    rand_end = max(0, total_frames - sample_length - 1)
+    if begin_index is None:
+        begin_index = random.randint(0, rand_end)
+    end_index = min(begin_index + sample_length, total_frames)
+    if end_index - begin_index < num_frames:
+        raise ValueError(
+            f"The video has not enough frames. total={total_frames}, "
+            f"need sample_length={sample_length}"
+        )
+    return np.linspace(begin_index, end_index - 1, num_frames, dtype=int)
+
+
+def _open_video_container(
+    video_path: str,
+) -> tuple[av.container.InputContainer, av.VideoStream]:
+    container = av.open(video_path)
+    stream = container.streams.video[0]
+    stream.thread_type = "AUTO"
+    return container, stream
+
+
+def _stream_frame_count(stream: av.VideoStream) -> int:
+    if stream.frames and stream.frames > 0:
+        return int(stream.frames)
+    if stream.duration and stream.average_rate:
+        return int(stream.duration * stream.time_base * stream.average_rate)
+    return 0
+
+
+def _count_frames_by_decode(
+    container: av.container.InputContainer, stream: av.VideoStream
+) -> int:
+    count = 0
+    for _ in container.decode(stream):
+        count += 1
+    return count
+
+
+def get_video_frame_count(video_path: str) -> int:
+    """Return total frame count using PyAV metadata or demux fallback."""
+    container, stream = _open_video_container(video_path)
+    try:
+        count = _stream_frame_count(stream)
+        if count > 0:
+            return count
+        return _count_frames_by_decode(container, stream)
+    finally:
+        container.close()
+
+
+def get_video_fps(video_path: str) -> float:
+    """Return average frame rate for a video file."""
+    container, stream = _open_video_container(video_path)
+    try:
+        if stream.average_rate:
+            return float(stream.average_rate)
+        if stream.base_rate:
+            return float(stream.base_rate)
+        return 0.0
+    finally:
+        container.close()
+
+
+def get_frame_timestamp(video_path: str, index: int) -> tuple[float, float]:
+    """Return (start, end) PTS in seconds for a frame index (decord-compatible)."""
+    container, stream = _open_video_container(video_path)
+    try:
+        if index < 0:
+            if stream.duration and stream.time_base:
+                end = float(stream.duration * stream.time_base)
+                return (0.0, end)
+            fps = float(stream.average_rate) if stream.average_rate else 1.0
+            count = _stream_frame_count(stream) or _count_frames_by_decode(
+                container, stream
+            )
+            return (0.0, count / fps)
+
+        fps = float(stream.average_rate) if stream.average_rate else 1.0
+        start = index / fps
+        end = (index + 1) / fps
+        return (start, end)
+    finally:
+        container.close()
+
+
+def _resize_frame_chw(
+    frame: torch.Tensor, width: Optional[int], height: Optional[int]
+) -> torch.Tensor:
+    if width is None and height is None:
+        return frame
+    import cv2
+
+    h, w = frame.shape[1], frame.shape[2]
+    target_w = width if width is not None else w
+    target_h = height if height is not None else h
+    if target_w == w and target_h == h:
+        return frame
+    arr = frame.permute(1, 2, 0).numpy()
+    resized = cv2.resize(arr, (target_w, target_h), interpolation=cv2.INTER_LINEAR)
+    return torch.from_numpy(resized).permute(2, 0, 1)
+
+
+def _read_av(
+    video_path: str,
+    indices: Sequence[int],
+    width: Optional[int] = None,
+    height: Optional[int] = None,
+) -> torch.Tensor:
+    idx_list = [int(i) for i in indices]
+    if not idx_list:
+        raise ValueError("indices must not be empty")
+
+    wanted = set(idx_list)
+    max_idx = max(idx_list)
+
+    container, stream = _open_video_container(video_path)
+    frames: dict[int, torch.Tensor] = {}
+    try:
+        for frame_idx, frame in enumerate(container.decode(stream)):
+            if frame_idx in wanted:
+                arr = frame.to_ndarray(format="rgb24")
+                tensor = torch.from_numpy(arr).permute(2, 0, 1)
+                frames[frame_idx] = _resize_frame_chw(tensor, width, height)
+            if frame_idx >= max_idx and len(frames) == len(wanted):
+                break
+    finally:
+        container.close()
+
+    missing = [i for i in idx_list if i not in frames]
+    if missing:
+        raise ValueError(
+            f"Video {video_path} has fewer decodable frames than requested; "
+            f"missing indices: {missing[:5]}"
+        )
+    return torch.stack([frames[i] for i in idx_list])
+
+
+def _torchcodec_available() -> bool:
+    return importlib.util.find_spec("torchcodec") is not None
+
+
+def _resolve_auto_backends() -> list[str]:
+    if _torchcodec_available():
+        return ["torchcodec", "av"]
+    return ["av"]
+
+
+def _read_torchcodec(video_path: str, indices: Sequence[int]) -> torch.Tensor:
+    from torchcodec.decoders import VideoDecoder
+
+    decoder = VideoDecoder(video_path, device="cpu")
+    idx = [int(i) for i in indices]
+    frames = decoder.get_frames_at(indices=idx)
+    data = frames.data
+    if data.ndim == 4 and data.shape[-1] in (1, 3, 4):
+        return rearrange(data, "t h w c -> t c h w")
+    return data
+
+
+def read_video_frames(
+    video_path: str,
+    indices: Sequence[int],
+    backend: VideoBackend = "auto",
+    width: Optional[int] = None,
+    height: Optional[int] = None,
+) -> torch.Tensor:
+    """Decode selected frames as TCHW uint8 tensor."""
+    backends: list[str]
+    if backend == "auto":
+        backends = _resolve_auto_backends()
+    else:
+        backends = [backend]
+
+    last_error: Optional[Exception] = None
+    for name in backends:
+        try:
+            if name == "av":
+                return _read_av(video_path, indices, width=width, height=height)
+            if name == "torchcodec":
+                return _read_torchcodec(video_path, indices)
+        except Exception as exc:
+            last_error = exc
+            continue
+
+    raise RuntimeError(
+        f"Failed to decode {video_path} with backends {backends}"
+    ) from last_error
+
+
+class _NumpyBatch:
+    """Mimics decord NDArray.asnumpy() for vendored Wan call sites."""
+
+    def __init__(self, data: Union[np.ndarray, torch.Tensor]):
+        if isinstance(data, torch.Tensor):
+            self._data = data.cpu().numpy()
+        else:
+            self._data = data
+
+    def asnumpy(self) -> np.ndarray:
+        return self._data
+
+
+class AvVideoReader:
+    """Decord-compatible video reader backed by PyAV."""
+
+    def __init__(
+        self,
+        video_path: str,
+        ctx=None,
+        width: Optional[int] = None,
+        height: Optional[int] = None,
+        num_threads: int = 1,
+    ):
+        del ctx, num_threads  # CPU-only; threading handled by PyAV AUTO
+        self.video_path = video_path
+        self.width = width
+        self.height = height
+        self._frame_count: Optional[int] = None
+        self._fps: Optional[float] = None
+
+    def _ensure_metadata(self) -> None:
+        if self._frame_count is None:
+            self._frame_count = get_video_frame_count(self.video_path)
+        if self._fps is None:
+            self._fps = get_video_fps(self.video_path)
+
+    def __len__(self) -> int:
+        self._ensure_metadata()
+        assert self._frame_count is not None
+        return self._frame_count
+
+    def get_avg_fps(self) -> float:
+        self._ensure_metadata()
+        assert self._fps is not None
+        return self._fps
+
+    def get_batch(self, indices: Sequence[int]) -> _NumpyBatch:
+        frames = read_video_frames(
+            self.video_path,
+            indices,
+            backend="av",
+            width=self.width,
+            height=self.height,
+        )
+        # TCHW uint8 -> THWC for decord compatibility
+        thwc = rearrange(frames, "t c h w -> t h w c").numpy()
+        return _NumpyBatch(thwc)
+
+    def __getitem__(self, index: int) -> np.ndarray:
+        return self.get_batch([index]).asnumpy()[0]
+
+    def get_frame_timestamp(self, index: int) -> tuple[float, float]:
+        return get_frame_timestamp(self.video_path, index)
+
+    def seek(self, index: int) -> None:
+        """No-op kept for decord API compatibility."""
+        del index
+
+
+def init_video_worker() -> None:
+    """Call once per DataLoader worker before decoding (no-op for PyAV)."""
diff --git a/videotuna/utils/wan_lora_bridge.py b/videotuna/utils/wan_lora_bridge.py
new file mode 100644
index 00000000..f0918b99
--- /dev/null
+++ b/videotuna/utils/wan_lora_bridge.py
@@ -0,0 +1,383 @@
+"""Bridge Wan 2.1 native Lightning LoRA checkpoints onto Wan 2.2 Diffusers pipelines."""
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+import torch
+from loguru import logger
+from peft import LoraConfig, get_peft_model
+from peft.utils import set_peft_model_state_dict
+
+# Native Wan 2.1 PEFT targets (training config in wan_t2v_lora.yaml).
+WAN_NATIVE_LORA_TARGETS = ["q", "k", "v", "o", "ffn.0", "ffn.2"]
+
+# Diffusers WanTransformer3DModel PEFT targets (attn1 self-attn + FFN only).
+WAN_DIFFUSERS_LORA_TARGETS = [
+    "attn1.to_q",
+    "attn1.to_k",
+    "attn1.to_v",
+    "attn1.to_out.0",
+    "ffn.net.0.proj",
+    "ffn.net.2",
+]
+
+DEFAULT_LORA_RANK = 16
+HIGH_NOISE_ADAPTER = "domain_high"
+LOW_NOISE_ADAPTER = "domain_low"
+
+_SELF_ATTN_RE = re.compile(
+    r"^blocks\.(\d+)\.self_attn\.(q|k|v|o)\.(lora_[AB]\.weight)$"
+)
+_FFN0_RE = re.compile(r"^blocks\.(\d+)\.ffn\.0\.(lora_[AB]\.weight)$")
+_FFN2_RE = re.compile(r"^blocks\.(\d+)\.ffn\.2\.(lora_[AB]\.weight)$")
+# Legacy / test shorthand: blocks.N.attn.q
+_LEGACY_ATTN_RE = re.compile(r"^blocks\.(\d+)\.attn\.(q|k|v|o)\.(lora_[AB]\.weight)$")
+
+
+@dataclass
+class WanLoraLoadReport:
+    """Structured result from loading native LoRA onto a Diffusers transformer."""
+
+    expert: str
+    rank: int
+    source_keys: int
+    remapped_keys: int
+    loaded_lora_params: int
+    missing_keys: List[str] = field(default_factory=list)
+    unexpected_keys: List[str] = field(default_factory=list)
+
+    @property
+    def remap_ratio(self) -> float:
+        if self.source_keys == 0:
+            return 0.0
+        return self.remapped_keys / self.source_keys
+
+    def as_dict(self) -> Dict[str, Any]:
+        return {
+            "expert": self.expert,
+            "rank": self.rank,
+            "source_keys": self.source_keys,
+            "remapped_keys": self.remapped_keys,
+            "loaded_lora_params": self.loaded_lora_params,
+            "remap_ratio": round(self.remap_ratio, 4),
+            "missing_keys": len(self.missing_keys),
+            "unexpected_keys": len(self.unexpected_keys),
+        }
+
+
+def is_native_wan_lora_ckpt(path: str | Path) -> bool:
+    """Return True when path looks like a Wan native Lightning LoRA .ckpt."""
+    p = Path(path)
+    if not p.is_file():
+        return False
+    if p.suffix not in (".ckpt", ".pt", ".pth"):
+        return False
+    try:
+        state = load_native_wan_lora_state_dict(p)
+    except (OSError, RuntimeError, ValueError):
+        return False
+    return bool(state) and any("lora" in k.lower() for k in state)
+
+
+def load_native_wan_lora_state_dict(ckpt_path: str | Path) -> Dict[str, torch.Tensor]:
+    """Load and normalize LoRA tensors from a Wan training checkpoint."""
+    raw = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+    if isinstance(raw, dict) and "state_dict" in raw:
+        state_dict = raw["state_dict"]
+    elif isinstance(raw, dict):
+        state_dict = raw
+    else:
+        raise ValueError(f"Unexpected checkpoint type in {ckpt_path}")
+
+    lora_state: Dict[str, torch.Tensor] = {}
+    for key, tensor in state_dict.items():
+        if "lora" not in key.lower():
+            continue
+        normalized = key
+        for prefix in ("denoiser.", "model.", "module.", "base_model.model."):
+            if normalized.startswith(prefix):
+                normalized = normalized[len(prefix) :]
+        lora_state[normalized] = tensor
+    if not lora_state:
+        raise ValueError(f"No LoRA keys found in {ckpt_path}")
+    return lora_state
+
+
+def _infer_lora_rank(state_dict: Dict[str, torch.Tensor]) -> int:
+    for key, tensor in state_dict.items():
+        if key.endswith(".lora_A.weight") or ".lora_A." in key:
+            return int(tensor.shape[0])
+        if key.endswith(".lora_B.weight") or ".lora_B." in key:
+            return int(tensor.shape[1])
+    return DEFAULT_LORA_RANK
+
+
+def _remap_single_native_key(key: str) -> str:
+    """Map one native Wan 2.1 LoRA key to Diffusers WanTransformer3DModel naming."""
+    m = _SELF_ATTN_RE.match(key)
+    if m:
+        idx, proj, suffix = m.groups()
+        diff_proj = {"q": "to_q", "k": "to_k", "v": "to_v", "o": "to_out.0"}[proj]
+        return f"blocks.{idx}.attn1.{diff_proj}.{suffix}"
+
+    m = _LEGACY_ATTN_RE.match(key)
+    if m:
+        idx, proj, suffix = m.groups()
+        diff_proj = {"q": "to_q", "k": "to_k", "v": "to_v", "o": "to_out.0"}[proj]
+        return f"blocks.{idx}.attn1.{diff_proj}.{suffix}"
+
+    m = _FFN0_RE.match(key)
+    if m:
+        return f"blocks.{m.group(1)}.ffn.net.0.proj.{m.group(2)}"
+
+    m = _FFN2_RE.match(key)
+    if m:
+        return f"blocks.{m.group(1)}.ffn.net.2.{m.group(2)}"
+
+    # Back-compat: blocks.* -> transformer_blocks.* (older bridge attempt).
+    if key.startswith("blocks."):
+        return "transformer_blocks." + key[len("blocks.") :]
+    return key
+
+
+def _remap_native_to_diffusers_keys(
+    native_state: Dict[str, torch.Tensor],
+) -> Dict[str, torch.Tensor]:
+    """Remap native Wan 2.1 block names to Diffusers WanTransformer3DModel keys."""
+    remapped: Dict[str, torch.Tensor] = {}
+    for key, tensor in native_state.items():
+        remapped[_remap_single_native_key(key)] = tensor
+    return remapped
+
+
+def compute_remap_coverage(
+    native_state: Dict[str, torch.Tensor],
+) -> Tuple[int, int, float]:
+    """Return transformed key count, total keys, and coverage ratio."""
+    if not native_state:
+        return 0, 0, 0.0
+    remapped = _remap_native_to_diffusers_keys(native_state)
+    transformed = sum(1 for key in native_state if key != remapped.get(key))
+    total = len(native_state)
+    return transformed, total, transformed / total
+
+
+def analyze_native_wan_lora_ckpt(ckpt_path: str | Path) -> Dict[str, Any]:
+    """Inventory native checkpoint keys and remapped Diffusers targets."""
+    native_state = load_native_wan_lora_state_dict(ckpt_path)
+    remapped = _remap_native_to_diffusers_keys(native_state)
+    unchanged = [k for k in native_state if k == remapped.get(k)]
+    return {
+        "path": str(ckpt_path),
+        "rank": _infer_lora_rank(native_state),
+        "native_key_count": len(native_state),
+        "remapped_key_count": len(remapped),
+        "unchanged_keys": unchanged[:10],
+        "sample_native": sorted(native_state.keys())[:5],
+        "sample_remapped": sorted(remapped.keys())[:5],
+    }
+
+
+def _peft_prefix_keys(state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+    return {
+        k if k.startswith("base_model.model.") else f"base_model.model.{k}": v
+        for k, v in state.items()
+    }
+
+
+def _count_lora_params(module: Any) -> int:
+    return sum(1 for name, _ in module.named_parameters() if "lora" in name.lower())
+
+
+def _apply_lora_to_transformer(
+    transformer: Any,
+    remapped_state: Dict[str, torch.Tensor],
+    *,
+    rank: int,
+    adapter_name: str,
+    expert_label: str,
+    source_keys: int,
+    remapped_keys: int,
+) -> Tuple[Any, WanLoraLoadReport]:
+    """Inject PEFT LoRA adapters and load remapped weights onto one transformer."""
+    if not hasattr(transformer, "peft_config") or not transformer.peft_config:
+        lora_config = LoraConfig(
+            r=rank,
+            lora_alpha=rank,
+            init_lora_weights=True,
+            target_modules=WAN_DIFFUSERS_LORA_TARGETS,
+        )
+        transformer = get_peft_model(
+            transformer, lora_config, adapter_name=adapter_name
+        )
+
+    peft_state = _peft_prefix_keys(remapped_state)
+    result = set_peft_model_state_dict(
+        transformer,
+        peft_state,
+        adapter_name=adapter_name,
+    )
+    missing: List[str] = []
+    unexpected: List[str] = []
+    if hasattr(result, "missing_keys"):
+        missing = [k for k in result.missing_keys if "lora" in k.lower()]
+        unexpected = list(result.unexpected_keys)
+
+    loaded = _count_lora_params(transformer)
+    report = WanLoraLoadReport(
+        expert=expert_label,
+        rank=rank,
+        source_keys=source_keys,
+        remapped_keys=remapped_keys,
+        loaded_lora_params=loaded,
+        missing_keys=missing,
+        unexpected_keys=unexpected,
+    )
+    if unexpected:
+        logger.warning(
+            "Wan LoRA bridge [{}]: {} unexpected keys (first 5): {}",
+            expert_label,
+            len(unexpected),
+            unexpected[:5],
+        )
+    if missing:
+        logger.warning(
+            "Wan LoRA bridge [{}]: {} missing LoRA keys (first 5): {}",
+            expert_label,
+            len(missing),
+            missing[:5],
+        )
+    logger.info("Wan LoRA bridge [{}]: {}", expert_label, report.as_dict())
+    return transformer, report
+
+
+def apply_native_wan_lora_to_pipeline(
+    pipeline: Any,
+    ckpt_path: str | Path,
+    *,
+    lora_scale: float = 1.0,
+    lora_scale_2: Optional[float] = None,
+) -> List[WanLoraLoadReport]:
+    """
+    Attach Wan 2.1 native LoRA weights to a Wan 2.2 Diffusers pipeline.
+
+    Loads the same adapter onto ``transformer`` (high-noise) and ``transformer_2``
+    (low-noise) when both are present, matching Diffusers community practice.
+    """
+    native_state = load_native_wan_lora_state_dict(ckpt_path)
+    rank = _infer_lora_rank(native_state)
+    remapped = _remap_native_to_diffusers_keys(native_state)
+    remapped_keys, source_keys, _ = compute_remap_coverage(native_state)
+    scale_2 = lora_scale if lora_scale_2 is None else lora_scale_2
+
+    reports: List[WanLoraLoadReport] = []
+    adapters: List[str] = []
+    scales: List[float] = []
+
+    pipeline.transformer, report_high = _apply_lora_to_transformer(
+        pipeline.transformer,
+        remapped,
+        rank=rank,
+        adapter_name=HIGH_NOISE_ADAPTER,
+        expert_label="transformer",
+        source_keys=source_keys,
+        remapped_keys=remapped_keys,
+    )
+    reports.append(report_high)
+    adapters.append(HIGH_NOISE_ADAPTER)
+    scales.append(lora_scale)
+
+    transformer_2 = getattr(pipeline, "transformer_2", None)
+    if transformer_2 is not None:
+        pipeline.transformer_2, report_low = _apply_lora_to_transformer(
+            transformer_2,
+            remapped,
+            rank=rank,
+            adapter_name=LOW_NOISE_ADAPTER,
+            expert_label="transformer_2",
+            source_keys=source_keys,
+            remapped_keys=remapped_keys,
+        )
+        reports.append(report_low)
+        adapters.append(LOW_NOISE_ADAPTER)
+        scales.append(scale_2)
+
+    total_loaded = sum(r.loaded_lora_params for r in reports)
+    if total_loaded == 0:
+        raise RuntimeError(
+            f"Wan LoRA bridge failed to load any parameters from {ckpt_path}. "
+            "Run tools/spike_wan_lora_bridge.py for a key inventory."
+        )
+
+    if hasattr(pipeline, "set_adapters"):
+        pipeline.set_adapters(adapters, adapter_weights=scales)
+    elif hasattr(pipeline, "fuse_lora"):
+        pipeline.fuse_lora(lora_scale=lora_scale)
+
+    min_remap = min(r.remap_ratio for r in reports)
+    if min_remap < 0.9 and remapped:
+        logger.warning(
+            "Wan LoRA bridge: remap ratio {:.1%} below 90% — visual QA recommended",
+            min_remap,
+        )
+
+    logger.info(
+        "Wan LoRA bridge: rank={} experts={} total_lora_params={} scales={}",
+        rank,
+        [r.expert for r in reports],
+        total_loaded,
+        scales,
+    )
+    return reports
+
+
+def apply_native_wan_lora_to_i2v_pipeline(
+    pipeline: Any,
+    ckpt_path: str | Path,
+    *,
+    lora_scale: float = 1.0,
+    lora_scale_2: Optional[float] = None,
+) -> List[WanLoraLoadReport]:
+    """
+    Attach Wan 2.1 native I2V LoRA weights to a Wan 2.2 I2V Diffusers pipeline.
+
+    Uses the same block-level key remap as T2V; both transformer experts receive
+    identical adapter weights when ``transformer_2`` is present.
+    """
+    return apply_native_wan_lora_to_pipeline(
+        pipeline,
+        ckpt_path,
+        lora_scale=lora_scale,
+        lora_scale_2=lora_scale_2,
+    )
+
+
+def export_diffusers_lora_state_dicts(
+    ckpt_path: str | Path,
+) -> Dict[str, Dict[str, torch.Tensor]]:
+    """
+    Export remapped LoRA tensors for Diffusers ``load_lora_weights``.
+
+    Returns a dict with ``high_noise`` and optionally ``low_noise`` entries
+    (same weights; Wan 2.2 loads low-noise expert via ``load_into_transformer_2``).
+    """
+    native_state = load_native_wan_lora_state_dict(ckpt_path)
+    remapped = _remap_native_to_diffusers_keys(native_state)
+    exports = {"high_noise": remapped, "low_noise": dict(remapped)}
+    return exports
+
+
+def strip_peft_prefix(state: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+    """Remove PEFT prefixes for safetensors export."""
+    cleaned: Dict[str, torch.Tensor] = {}
+    for key, tensor in state.items():
+        k = key
+        for prefix in ("base_model.model.", "base_model."):
+            if k.startswith(prefix):
+                k = k[len(prefix) :]
+        cleaned[k] = tensor
+    return cleaned
diff --git a/videotuna/utils/wan_training.py b/videotuna/utils/wan_training.py
new file mode 100644
index 00000000..18f2308e
--- /dev/null
+++ b/videotuna/utils/wan_training.py
@@ -0,0 +1,265 @@
+"""Flow-matching training helpers for Wan T2V / I2V native Lightning LoRA."""
+
+from __future__ import annotations
+
+from typing import Any, List, Optional, Sequence, Tuple
+
+import torch
+import torch.nn.functional as F
+
+from videotuna.schedulers.flow_matching import FlowMatchScheduler
+
+
+def is_i2v_task(task: str) -> bool:
+    return "i2v" in task.lower() and "t2v" not in task.lower()
+
+
+def wan_pipeline_backend(flow: Any) -> Any:
+    if hasattr(flow, "wan_t2v"):
+        return flow.wan_t2v
+    if hasattr(flow, "wan_i2v"):
+        return flow.wan_i2v
+    raise RuntimeError("WanVideoModelFlow has no wan_t2v or wan_i2v backend")
+
+
+def _latent_grid(
+    num_frames: int,
+    height: int,
+    width: int,
+    vae_stride: Tuple[int, int, int],
+    patch_size: Tuple[int, int, int],
+) -> Tuple[int, int, int, int]:
+    lat_t = (num_frames - 1) // vae_stride[0] + 1
+    lat_h = height // vae_stride[1]
+    lat_w = width // vae_stride[2]
+    seq_len = lat_t * lat_h * lat_w // (patch_size[1] * patch_size[2])
+    return lat_t, lat_h, lat_w, int(seq_len)
+
+
+def build_i2v_mask_and_latent(
+    image: torch.Tensor,
+    num_frames: int,
+    lat_h: int,
+    lat_w: int,
+    vae_stride: Tuple[int, int, int],
+    device: torch.device,
+    dtype: torch.dtype,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Build I2V mask + VAE input tensor matching ``WanI2V.generate``.
+
+    ``image``: ``(C, H, W)`` in ``[-1, 1]``.
+    Returns ``(mask, video_tensor)`` with ``video_tensor`` shape ``(3, F, H, W)``.
+    """
+    c, h, w = image.shape
+    f = num_frames
+    video = torch.cat(
+        [
+            image.unsqueeze(1),
+            torch.zeros(c, f - 1, h, w, device=device, dtype=dtype),
+        ],
+        dim=1,
+    )
+    msk = torch.ones(1, f, lat_h, lat_w, device=device, dtype=dtype)
+    msk[:, 1:] = 0
+    msk = torch.concat(
+        [torch.repeat_interleave(msk[:, 0:1], repeats=4, dim=1), msk[:, 1:]],
+        dim=1,
+    )
+    msk = msk.view(1, msk.shape[1] // 4, 4, lat_h, lat_w)
+    msk = msk.transpose(1, 2)[0]
+    return msk, video
+
+
+def encode_i2v_condition(
+    vae: Any,
+    image: torch.Tensor,
+    num_frames: int,
+    height: int,
+    width: int,
+    vae_stride: Tuple[int, int, int],
+    patch_size: Tuple[int, int, int],
+) -> torch.Tensor:
+    """VAE-encode first-frame I2V conditioning for ``WanModel``."""
+    device = image.device
+    dtype = image.dtype
+    _, lat_h, lat_w, _ = _latent_grid(num_frames, height, width, vae_stride, patch_size)
+    if image.dim() == 4:
+        image = image.squeeze(1)
+    msk, video = build_i2v_mask_and_latent(
+        image, num_frames, lat_h, lat_w, vae_stride, device, dtype
+    )
+    if video.shape[-2:] != (height, width):
+        video = F.interpolate(
+            video.unsqueeze(0),
+            size=(num_frames, height, width),
+            mode="trilinear",
+            align_corners=False,
+        ).squeeze(0)
+    encoded = vae.encode([video])[0]
+    return torch.concat([msk, encoded], dim=0)
+
+
+def _encode_videos(vae: Any, videos: torch.Tensor) -> List[torch.Tensor]:
+    """Encode a video batch to per-sample latent tensors."""
+    latents: List[torch.Tensor] = []
+    for idx in range(videos.shape[0]):
+        clip = videos[idx]
+        latents.append(vae.encode([clip])[0])
+    return latents
+
+
+def _encode_prompts(text_encoder: Any, captions: Sequence[str], device: torch.device):
+    if isinstance(captions, str):
+        captions = [captions]
+    return text_encoder(list(captions), device)
+
+
+def _select_denoiser(flow: Any, timestep: torch.Tensor) -> Any:
+    boundary = flow.cfg.boundary * flow.cfg.num_train_timesteps
+    t_val = float(timestep.reshape(-1)[0].item())
+    if t_val < boundary:
+        return flow.low_denoiser
+    return flow.high_denoiser
+
+
+def _build_flow_scheduler(shift: float) -> FlowMatchScheduler:
+    scheduler = FlowMatchScheduler(
+        num_inference_steps=1000,
+        num_train_timesteps=1000,
+        shift=shift,
+    )
+    scheduler.set_timesteps(1000, training=True, shift=shift)
+    return scheduler
+
+
+def compute_wan_flow_matching_loss(flow: Any, batch: dict) -> torch.Tensor:
+    """
+    Flow-matching loss for Wan T2V / I2V LoRA training.
+
+    Expects ``batch`` keys: ``video`` ``(B,C,T,H,W)``, ``caption``, optional ``image``.
+    """
+    wan = wan_pipeline_backend(flow)
+    device = flow.device
+    dtype = flow.cfg.param_dtype
+
+    videos = batch["video"].to(device)
+    if videos.dim() != 5:
+        raise ValueError(f"Expected video batch (B,C,T,H,W), got {videos.shape}")
+
+    batch_size, _, num_frames, height, width = videos.shape
+    vae_stride = tuple(wan.vae_stride)
+    patch_size = tuple(wan.patch_size)
+    _, lat_h, lat_w, seq_len = _latent_grid(
+        num_frames, height, width, vae_stride, patch_size
+    )
+
+    with torch.no_grad():
+        latents = _encode_videos(wan.vae, videos)
+        contexts = _encode_prompts(wan.text_encoder, batch["caption"], device)
+
+        y_list: Optional[List[torch.Tensor]] = None
+        if is_i2v_task(flow.task):
+            images = batch.get("image")
+            if images is None:
+                raise ValueError("I2V training requires batch['image']")
+            images = images.to(device)
+            y_list = []
+            for idx in range(batch_size):
+                img = images[idx]
+                if img.dim() == 4:
+                    img = img.squeeze(1)
+                y_list.append(
+                    encode_i2v_condition(
+                        wan.vae,
+                        img,
+                        num_frames,
+                        height,
+                        width,
+                        vae_stride,
+                        patch_size,
+                    )
+                )
+
+    shift = 3.0 if height <= 480 else float(getattr(flow.cfg, "sample_shift", 5.0))
+    scheduler = _build_flow_scheduler(shift)
+
+    losses: List[torch.Tensor] = []
+    for idx in range(batch_size):
+        z = latents[idx].float()
+        noise = torch.randn_like(z)
+        t_idx = torch.randint(0, len(scheduler.timesteps), (1,), device=device)
+        timestep = scheduler.timesteps[t_idx].to(device)
+        sigma = scheduler.sigmas[t_idx].to(device)
+        noisy = (1.0 - sigma) * z + sigma * noise
+        target = noise - z
+
+        denoiser = _select_denoiser(flow, timestep)
+        denoiser.train()
+        model_input = [noisy.to(dtype)]
+        context = [contexts[idx]]
+        y_arg = [y_list[idx].to(dtype)] if y_list is not None else None
+
+        autocast_enabled = device.type == "cuda"
+        with torch.autocast(
+            device_type=device.type,
+            dtype=dtype,
+            enabled=autocast_enabled,
+        ):
+            pred_list = denoiser(
+                model_input,
+                t=timestep,
+                context=context,
+                seq_len=seq_len,
+                y=y_arg,
+            )
+        pred = pred_list[0].float()
+        weight = scheduler.training_weight(timestep).to(device)
+        loss = F.mse_loss(pred, target, reduction="mean") * weight
+        losses.append(loss)
+
+    return torch.stack(losses).mean()
+
+
+def init_wan_training_denoisers(flow: Any) -> None:
+    """Attach PEFT LoRA to Wan low/high noise experts for dual-expert training."""
+    from peft import LoraConfig, get_peft_model
+
+    from videotuna.utils.common_utils import instantiate_from_config
+    from videotuna.utils.lora_utils import (
+        collect_lora_parameter_names,
+        resolve_lora_target_modules,
+    )
+
+    wan = wan_pipeline_backend(flow)
+    if flow.use_lora and flow.lora_config is not None:
+        lora_cfg = instantiate_from_config(flow.lora_config)
+        if hasattr(lora_cfg, "target_modules"):
+            lora_cfg.target_modules = resolve_lora_target_modules(
+                wan.high_noise_model, lora_cfg.target_modules
+            )
+        flow.high_denoiser = get_peft_model(
+            wan.high_noise_model, lora_cfg, autocast_adapter_dtype=False
+        )
+        low_cfg = LoraConfig(
+            r=lora_cfg.r,
+            lora_alpha=lora_cfg.lora_alpha,
+            init_lora_weights=True,
+            target_modules=list(lora_cfg.target_modules),
+        )
+        flow.low_denoiser = get_peft_model(
+            wan.low_noise_model, low_cfg, autocast_adapter_dtype=False
+        )
+        flow.denoiser = flow.high_denoiser
+        flow.lora_params = collect_lora_parameter_names(flow.denoiser)
+        flow.denoiser.train()
+        for name, param in flow.denoiser.named_parameters():
+            if name in flow.lora_params:
+                param.requires_grad_(True)
+        for name, param in flow.low_denoiser.named_parameters():
+            if "lora" in name.lower():
+                param.requires_grad_(True)
+    else:
+        flow.low_denoiser = wan.low_noise_model
+        flow.high_denoiser = wan.high_noise_model
+        flow.denoiser = wan.high_noise_model
diff --git a/videotuna/vendor/VENDOR.md b/videotuna/vendor/VENDOR.md
new file mode 100644
index 00000000..1b3c1bde
--- /dev/null
+++ b/videotuna/vendor/VENDOR.md
@@ -0,0 +1,34 @@
+# Vendor: SimpleTuner (reference submodule)
+
+| Field | Value |
+|-------|-------|
+| **Path** | `videotuna/vendor/simpletuner/` (git submodule) |
+| **Upstream** | https://github.com/bghira/SimpleTuner |
+| **License** | Apache-2.0 |
+| **Pinned commit** | `34b1fd729fd0fa86e6b085ba0f3dbc44ca8757dc` (2025-01-29) |
+| **Import date** | 2025-06 (reference submodule; runtime trainer replaced) |
+| **VideoTuna entrypoints** | *(none — reference only)* |
+| **Runtime replacement** | `videotuna/training/flux_lora/` via `poetry run train-flux-lora` |
+
+## Purpose
+
+Reference-only submodule for upstream provenance. VideoTuna does **not** import this tree at runtime.
+The deleted in-tree snapshot (`videotuna/third_party/flux/`) was namespace-rewritten and had two
+functional patches — see [`docs/vendor/simpletuner-archive.md`](../../docs/vendor/simpletuner-archive.md).
+
+## Update procedure
+
+```bash
+cd videotuna/vendor/simpletuner
+git fetch origin
+git checkout <new-sha>
+cd ../../..
+git add videotuna/vendor/simpletuner
+# Update this file and docs/vendor/simpletuner-archive.md with the new SHA
+```
+
+Init on clone (optional):
+
+```bash
+git submodule update --init videotuna/vendor/simpletuner
+```