Spaces:
Running
Running
| # [BETA] Side-Step for ACE-Step 1.5 | |
| ```bash | |
| βββββββ ββ ββββββ βββββββ βββββββ ββββββββ βββββββ ββββββ | |
| ββ ββ ββ ββ ββ ββ ββ ββ ββ ββ | |
| βββββββ ββ ββ ββ βββββ βββββ βββββββ ββ βββββ ββββββ | |
| ββ ββ ββ ββ ββ ββ ββ ββ ββ | |
| βββββββ ββ ββββββ βββββββ βββββββ ββ βββββββ ββ | |
| by dernet ((BETA TESTING)) | |
| ``` | |
| **Side-Step** is a **standalone** training toolkit for [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5) models. It provides corrected LoRA and LoKR fine-tuning implementations that fix fundamental bugs (for models other than turbo) in the original trainer while adding low-VRAM support for local GPUs. | |
| > **Standalone?** Yes. Side-Step installs as its own project with its own dependencies. The corrected (fixed) training loop, preprocessing, and wizard all work without a base ACE-Step installation -- you only need the model checkpoints. Vanilla training mode still requires base ACE-Step installed alongside. | |
| ## Why Side-Step? | |
| The original ACE-Step trainer has two critical discrepancies from how the base models were actually trained. Side-Step was built to bridge this gap: | |
| 1. **Continuous Timestep Sampling:** The original trainer uses a discrete 8-step schedule. This is fine for turbo, which the original training script is hardcoded for. Side-Step implements **Logit-Normal continuous sampling**, ensuring the model learns the full range of the denoising process. | |
| 2. **CFG Dropout (Classifier-Free Guidance):** The original trainer lacks condition dropout. Side-Step implements a **15% null-condition dropout**, teaching the model how to handle both prompted and unprompted generation. Without this, inference quality suffers. | |
| 3. **Standalone Core:** The corrected training loop, preprocessing, and wizard bundle all required ACE-Step utilities. No base ACE-Step install needed -- just the model weights. | |
| 4. **Built for the cloud:** The original Gradio breaks when you try to use it for training. Use this instead :) | |
| --- | |
| ## Beta Status & Support | |
| **Current Version:** 0.8.0-beta | |
| | Feature | Status | Standalone? | Note | | |
| | :--- | :--- | :--- | :--- | | |
| | **Fixed Training (LoRA)** | Working | Yes | Recommended for all users. Corrected timesteps + CFG dropout. | | |
| | **Fixed Training (LoKR)** | **Experimental** | Yes | Uses LyCORIS. May have rough edges. | | |
| | **Vanilla Training** | Working | **No** | Reproduction of original behavior. Requires base ACE-Step 1.5 installed alongside. | | |
| | **Interactive Wizard** | Working | Yes | `python train.py` with no args. Session loop, go-back, presets, first-run setup. | | |
| | **CLI Preprocessing** | Beta | Yes | Two-pass pipeline, low VRAM. Adapter-agnostic (same tensors for LoRA and LoKR). | | |
| | **Gradient Estimation** | Beta | Yes | Ranks attention modules by sensitivity. In Experimental menu. | | |
| | **Presets System** | Working | Yes | Save/load/manage training configurations. Stores adapter type. | | |
| | **TUI (Textual UI)** | **BROKEN** | -- | Do not use `sidestep_tui.py` yet. | | |
| > **Something broken?** This is a beta. You can always roll back: | |
| > ```bash | |
| > git log --oneline -5 # find the commit you want | |
| > git checkout <hash> | |
| > ``` | |
| > If you hit issues, please open an issue -- it helps us stabilize faster. | |
| ### What's new in 0.8.0-beta | |
| **Bug fixes:** | |
| - **Fixed gradient checkpointing crash** -- Training with gradient checkpointing enabled (the default) would crash with `element 0 of tensors does not require grad`. The autograd graph was disconnecting through checkpointed segments because the `xt` input tensor wasn't carrying gradients. Now forces `xt.requires_grad_(True)` when checkpointing is active, matching ACE-Step's upstream behavior. This was the #1 blocker for new users. | |
| - **Fixed training completing with 0 steps on Windows** -- Lightning Fabric's `setup_dataloaders()` was wrapping the DataLoader with a shim that yielded 0 batches on Windows, causing training to silently "complete" with 0 epochs and 0 steps. Reported by multiple users on RTX 3090 and other GPUs. The Fabric DataLoader wrapping is now skipped entirely (the model/optimizer are still Fabric-managed for mixed precision). | |
| - **Fixed multi-GPU device selection** -- Using `cuda:1` (or any non-default GPU) no longer causes training to silently fail. The Fabric device setup has been rewritten to use `torch.cuda.set_device()` instead of passing device indices as lists. | |
| - **LoRA save path fix** -- Adapter files (`adapter_config.json`, `adapter_model.safetensors`) are now saved directly into the output directory. Previously they were nested in an `adapter/` subdirectory, causing Gradio/ComfyUI to fail to find the weights at the path Side-Step reported. | |
| - **Massive VRAM reduction** -- Gradient checkpointing is now ON by default and actually works (see above fix). Measured at ~7 GiB for batch size 1 on a 48 GiB GPU (15% utilization). Previously Side-Step had checkpointing off or broken, causing 20-42 GiB VRAM usage. This brings Side-Step well below ACE-Step's memory footprint. | |
| - **0-step training detection** -- If training completes with zero steps processed, Side-Step now reports a clear `[FAIL]` error instead of a misleading "Training Complete" screen with 0 epochs. | |
| - **Windows `num_workers` safety** -- Explicitly clamps `num_workers=0` on Windows even if overridden via CLI, preventing spawn-based multiprocessing crashes. | |
| **Features:** | |
| - **Inference-ready checkpoints** -- Intermediate checkpoints (`checkpoints/epoch_N/`) now save adapter files flat alongside `training_state.pt`. Point any inference tool directly at a checkpoint directory -- no more digging into nested subdirectories. Checkpoints are usable for both inference AND resume. | |
| - **Resume support in basic training loop** -- The non-Fabric fallback loop now supports `--resume-from`, matching the Fabric path. | |
| - **VRAM-tier presets** -- Four new built-in presets (`vram_24gb_plus`, `vram_16gb`, `vram_12gb`, `vram_8gb`) with tuned settings for each GPU tier. Rank, optimizer, batch size, and offloading are pre-configured for your VRAM budget. | |
| - **Flash Attention 2 auto-installed** -- Prebuilt wheels are now a default dependency. No compilation, no `--extra flash`. Falls back to SDPA silently on unsupported hardware. | |
| - **Banner shows version** -- The startup banner now displays the Side-Step version for easier bug reporting. | |
| ### What's new in 0.7.0-beta | |
| - **Truly standalone packaging** -- Side-Step is now its own project with a real `pyproject.toml` and full dependency list. Install it with `uv sync` -- no ACE-Step overlay required. The installer now creates Side-Step alongside ACE-Step as sibling directories. | |
| - **First-run setup wizard** -- On first launch, Side-Step walks you through configuring your checkpoint directory, ACE-Step path (if you want vanilla mode), and validates your setup. Accessible any time from the main menu under "Settings". | |
| - **Model discovery with fuzzy search** -- Instead of hardcoded `turbo/base/sft` choices, the wizard now scans your checkpoint directory for all model folders, labels official vs custom models, and lets you pick by number or search by name. Fine-tunes with arbitrary folder names are fully supported. | |
| - **Fine-tune training support** -- Train on custom fine-tunes by selecting their folder. Side-Step auto-detects the base model from `config.json`. If it can't, it asks which base the fine-tune descends from to condition timestep sampling correctly. | |
| - **`--base-model` CLI argument** -- New flag for CLI users training on fine-tunes. Overrides timestep parameters when `config.json` doesn't contain them. | |
| - **`--model-variant` accepts any folder name** -- No longer restricted to turbo/base/sft. Pass any subfolder name from your checkpoints directory (e.g., `--model-variant my-custom-finetune`). | |
| - **`acestep.__path__` extension** -- When vanilla mode is configured, Side-Step extends its package path to reach ACE-Step's modules. No overlay, no symlinks, no `sys.path` hacks. | |
| - **Settings persistence** -- Checkpoint dir, ACE-Step path, and vanilla intent are saved to `~/.config/sidestep/settings.json` and reused as defaults in subsequent sessions. | |
| ### What's new in 0.6.0-beta | |
| - **Mostly standalone** -- The corrected (fixed) training loop, preprocessing pipeline, and wizard no longer require a base ACE-Step installation. All needed ACE-Step utilities are vendored in `_vendor/`. You only need the model checkpoint files. Vanilla training mode still requires base ACE-Step. | |
| - **Enhanced prompt builder** -- Preprocessing now supports `custom_tag`, `genre`, and `prompt_override` fields from dataset JSON metadata, matching upstream feature parity without the AudioSample dependency. | |
| - **Hardened metadata lookup** -- Dataset JSON entries with `audio_path` but no `filename` field are now handled correctly (basename is extracted as fallback key). | |
| ### What's new in 0.5.0-beta | |
| - **LoKR adapter support (experimental)** -- Train LoKR (Low-Rank Kronecker) adapters via [LyCORIS](https://github.com/KohakuBlueleaf/LyCORIS) as an alternative to LoRA. LoKR uses Kronecker product factorization and may capture different patterns than LoRA. **This is experimental and may break.** The underlying LyCORIS + Fabric interaction has not been exhaustively tested across all hardware. | |
| - **Restructured wizard menu** -- The main menu now offers "Train a LoRA" and "Train a LoKR" as distinct top-level choices, each leading to a corrected/vanilla sub-menu | |
| - **Unified preprocessing** -- Preprocessing is adapter-agnostic: the same tensors work for both LoRA and LoKR. The adapter type only affects weight injection during training, not the data pipeline. *(Previously, LoKR had a separate preprocessing mode that incorrectly fed target audio into context latents, giving the decoder the answer during training and producing misleadingly low loss.)* | |
| - **LoKR-aware presets** -- Presets now save and restore adapter type and all LoKR-specific hyperparameters | |
| ### What's new in 0.4.0-beta | |
| - **Session loop** -- the wizard no longer exits after each action; preprocess, train, and manage presets in one session | |
| - **Go-back navigation** -- type `b` at any prompt to return to the previous question | |
| - **Step indicators** -- `[Step 3/8] LoRA Settings` shows your progress through each flow | |
| - **Presets system** -- save, load, import, and export named training configurations | |
| - **Flow chaining** -- after preprocessing, the wizard offers to start training immediately | |
| - **Experimental submenu** -- gradient estimation and upcoming features live here | |
| - **GPU cleanup** -- memory is released between session loop iterations to prevent VRAM leaks | |
| - **Config summaries** -- preprocessing and estimation show a summary before starting | |
| - **Basic/Advanced mode** -- choose how many questions the training wizard asks | |
| --- | |
| ## Prerequisites | |
| - **Python 3.11+** -- Managed automatically by `uv`. If using pip, install Python 3.11 manually. | |
| - **NVIDIA GPU with CUDA support** -- CUDA 12.x recommended. AMD and Intel GPUs are not supported. | |
| - **8 GB+ VRAM** -- See [VRAM Profiles](#optimization--vram-profiles) for per-tier settings. Training is possible on 8 GB GPUs with aggressive optimization. | |
| - **Git** -- Required for cloning repositories and version management. | |
| - **uv** (recommended) or **pip** -- `uv` handles Python, PyTorch+CUDA, and all dependencies automatically. Plain pip requires manual PyTorch installation. | |
| --- | |
| ## Installation | |
| Side-Step is **partly standalone**: the corrected training loop, preprocessing, wizard, and all CLI tools work without a base ACE-Step installation. You only need the model checkpoint files. The only thing that requires ACE-Step installed alongside is **vanilla training mode** (which reproduces the original bugged behavior for backward compatibility). | |
| We **strongly recommend** using [uv](https://docs.astral.sh/uv/) for dependency management -- it handles Python 3.11, PyTorch with CUDA, Flash Attention wheels, and all other dependencies automatically. | |
| ### Windows (Easy Install) | |
| Download or clone Side-Step, then double-click **`install_windows.bat`** (or run the PowerShell script). It handles everything: uv, Python 3.11, Side-Step deps, ACE-Step (alongside for checkpoints), and model download. | |
| ```powershell | |
| # Or run from PowerShell directly: | |
| git clone https://github.com/koda-dernet/Side-Step.git | |
| cd Side-Step | |
| .\install_windows.ps1 | |
| ``` | |
| The installer creates two sibling directories: | |
| - `Side-Step/` -- your training toolkit (standalone) | |
| - `ACE-Step-1.5/` -- model checkpoints + optional vanilla mode | |
| ### Linux / macOS (Recommended: uv) | |
| ```bash | |
| # 1. Install uv if you don't have it | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| # 2. Clone Side-Step | |
| git clone https://github.com/koda-dernet/Side-Step.git | |
| cd Side-Step | |
| # 3. Install dependencies (includes PyTorch with CUDA + Flash Attention) | |
| uv sync | |
| # 4. First run will guide you through setup (checkpoint path, etc.) | |
| uv run python train.py | |
| ``` | |
| ### Model Checkpoints | |
| You need the model weights before you can train. Options: | |
| 1. **From ACE-Step (recommended):** Clone ACE-Step 1.5 alongside Side-Step and use `acestep-download`: | |
| ```bash | |
| git clone https://github.com/ace-step/ACE-Step-1.5.git | |
| cd ACE-Step-1.5 && uv sync && uv run acestep-download | |
| ``` | |
| Then point Side-Step at the checkpoints folder on first run or via `--checkpoint-dir ../ACE-Step-1.5/checkpoints`. | |
| 2. **Manual download:** Get the weights from [HuggingFace](https://huggingface.co/ACE-Step/Ace-Step1.5) and place them in a `checkpoints/` directory inside Side-Step. | |
| > **IMPORTANT: Never rename checkpoint folders.** The model loader uses folder names and `config.json` files to identify model variants (turbo, base, sft). Renaming them will break loading. | |
| ### Vanilla Mode (optional -- requires ACE-Step) | |
| Vanilla training mode reproduces the original ACE-Step training behavior (bugged discrete timesteps, no CFG dropout). Most users should use **fixed** mode instead. If you specifically need vanilla mode: | |
| ```bash | |
| # Clone ACE-Step alongside Side-Step | |
| git clone https://github.com/ace-step/ACE-Step-1.5.git | |
| cd ACE-Step-1.5 && uv sync && cd .. | |
| # On first run, Side-Step's setup wizard will ask if you want vanilla mode | |
| # and where your ACE-Step installation is. | |
| ``` | |
| > **Note:** With plain pip, you are responsible for installing the correct PyTorch version with CUDA support for your platform. This is the #1 source of "it doesn't work" issues. `uv sync` handles this automatically. | |
| ### Included automatically | |
| Everything is installed by `uv sync` -- no extras, no manual pip installs: | |
| - **Flash Attention 2** -- Prebuilt wheels, no compilation. Auto-detected on Ampere+ GPUs (RTX 30xx+). Falls back to SDPA on older hardware or macOS. | |
| - **Gradient checkpointing** -- Enabled by default. Cuts VRAM dramatically (~7 GiB for batch size 1, down from 20-42 GiB without it). | |
| - **PyTorch with CUDA 12.8** -- Correct CUDA-enabled build per platform. | |
| - **bitsandbytes** -- 8-bit optimizers (AdamW8bit) for ~30-40% optimizer VRAM savings. | |
| - **Prodigy** -- Adaptive optimizer that auto-tunes learning rate. | |
| - **LyCORIS** -- LoKR adapter support (experimental Kronecker product adapters). | |
| --- | |
| ## Platform Compatibility | |
| | Platform | Status | Notes | | |
| | :--- | :--- | :--- | | |
| | **Linux (CUDA)** | Primary | Developed and tested here | | |
| | **Windows (CUDA)** | Supported | Easy installer included; DataLoader workers auto-set to 0 | | |
| | **macOS (MPS)** | Experimental | Apple Silicon only; some ops may fall back to CPU | | |
| --- | |
| ## Usage | |
| ### Option A: The Interactive Wizard (Recommended) | |
| Simply run the script with no arguments. The wizard now stays open in a session loop -- you can preprocess, configure, train, and manage presets without restarting. | |
| ```bash | |
| # With uv (recommended) | |
| uv run python train.py | |
| # Without uv | |
| python train.py | |
| ``` | |
| The wizard supports: | |
| - **Go-back**: Type `b` at any prompt to return to the previous question | |
| - **Presets**: Save and load named training configurations | |
| - **Flow chaining**: After preprocessing, jump straight to training | |
| - **Basic/Advanced modes**: Choose how detailed you want the configuration | |
| ### Option B: The Quick Start One-Liner | |
| If you have your preprocessed tensors ready in `./my_data`, run: | |
| ```bash | |
| # LoRA (default) | |
| uv run python train.py fixed \ | |
| --checkpoint-dir ./checkpoints \ | |
| --model-variant turbo \ | |
| --dataset-dir ./my_data \ | |
| --output-dir ./output/my_lora \ | |
| --epochs 100 | |
| # LoKR (experimental) | |
| uv run python train.py fixed \ | |
| --checkpoint-dir ./checkpoints \ | |
| --model-variant turbo \ | |
| --adapter-type lokr \ | |
| --dataset-dir ./my_data \ | |
| --output-dir ./output/my_lokr \ | |
| --epochs 100 | |
| ``` | |
| ### Option C: Preprocess Audio (Two-Pass, Low VRAM) | |
| Convert raw audio files into `.pt` tensors without loading all models at once. | |
| The pipeline runs in two passes: (1) VAE + Text Encoder (~3 GB), then (2) DIT encoder (~6 GB). | |
| ```bash | |
| uv run python train.py fixed \ | |
| --checkpoint-dir ./checkpoints \ | |
| --model-variant turbo \ | |
| --preprocess \ | |
| --audio-dir ./my_audio \ | |
| --tensor-output ./my_tensors | |
| ``` | |
| With a metadata JSON for lyrics/genre/BPM: | |
| ```bash | |
| uv run python train.py fixed \ | |
| --checkpoint-dir ./checkpoints \ | |
| --preprocess \ | |
| --audio-dir ./my_audio \ | |
| --dataset-json ./my_dataset.json \ | |
| --tensor-output ./my_tensors | |
| ``` | |
| ### Option D: Gradient Estimation | |
| Find which attention modules learn fastest for your dataset (useful for rank/target selection): | |
| ```bash | |
| uv run python train.py estimate \ | |
| --checkpoint-dir ./checkpoints \ | |
| --model-variant turbo \ | |
| --dataset-dir ./my_tensors \ | |
| --estimate-batches 5 \ | |
| --top-k 16 | |
| ``` | |
| ### Option E: Vanilla Training (Requires ACE-Step) | |
| Reproduces the original ACE-Step training behavior (bugged discrete timesteps, no CFG dropout). Most users should use **fixed** mode instead. Requires a base ACE-Step installation alongside Side-Step: | |
| ```bash | |
| uv run python train.py vanilla \ | |
| --checkpoint-dir ./ACE-Step-1.5/checkpoints \ | |
| --audio-dir ./my_audio \ | |
| --output-dir ./output/my_vanilla_lora | |
| ``` | |
| > **Advanced subcommands:** `selective` (corrected training with dataset-specific module selection) and `compare-configs` (compare module config JSON files) are also available. These are advanced/WIP features -- run `uv run train.py selective --help` or `uv run train.py compare-configs --help` for details. | |
| --- | |
| ## Presets | |
| Side-Step ships with seven built-in presets: | |
| | Preset | Description | | |
| | :--- | :--- | | |
| | `recommended` | Balanced defaults for most LoRA fine-tuning tasks | | |
| | `high_quality` | Rank 128, 1000 epochs -- for when quality matters most | | |
| | `quick_test` | Rank 16, 10 epochs -- fast iteration for testing | | |
| | `vram_24gb_plus` | Comfortable tier -- Rank 128, Batch 2, AdamW | | |
| | `vram_16gb` | Standard tier -- Rank 64, Batch 1, AdamW | | |
| | `vram_12gb` | Tight tier -- Rank 32, AdamW8bit, Encoder offloading | | |
| | `vram_8gb` | Minimal tier -- Rank 16, AdamW8bit, Encoder offloading, High grad accumulation | | |
| User presets are saved to `./presets/` (project-local, next to your training data). This ensures presets persist across Docker runs and stay visible alongside your project. Presets from the global location (`~/.config/sidestep/presets/`) are also scanned as a fallback. You can import/export presets as JSON files to share with others. | |
| --- | |
| ## Optimization & VRAM Profiles | |
| Side-Step is optimized for both heavy Cloud GPUs (H100/A100) and local "underpowered" gear (RTX 3060/4070). | |
| **Applied automatically (no configuration needed):** | |
| - **Gradient checkpointing** (ON by default) -- recomputes activations during backward, saves ~40-60% activation VRAM. This matches the original ACE-Step behavior. | |
| - **Flash Attention 2** (auto-installed) -- fused attention kernels for better GPU utilization. Requires Ampere+ GPU (RTX 30xx+). Falls back to SDPA on older hardware. | |
| | Profile | VRAM | Key Settings | | |
| | :--- | :--- | :--- | | |
| | **Comfortable** | 24 GB+ | AdamW, Batch 2+, Rank 64-128 | | |
| | **Standard** | 16-24 GB | AdamW, Batch 1, Rank 64 | | |
| | **Tight** | 10-16 GB | **AdamW8bit**, Encoder Offloading, Rank 32-64 | | |
| | **Minimal** | <10 GB | **AdaFactor** or **AdamW8bit**, Encoder Offloading, Rank 16, High Grad Accumulation | | |
| ### Additional VRAM Options (Advanced mode): | |
| * **`--offload-encoder`**: Moves the heavy VAE and Text Encoders to CPU after setup. Frees ~2-4 GB VRAM. | |
| * **`--no-gradient-checkpointing`**: Disable gradient checkpointing for max speed if you have VRAM to spare. | |
| * **`--optimizer-type prodigy`**: Uses the Prodigy optimizer to automatically find the best learning rate for you. | |
| --- | |
| ## Project Structure | |
| ```text | |
| Side-Step/ <-- Standalone project root | |
| βββ train.py <-- Your main entry point | |
| βββ pyproject.toml <-- Dependencies (uv sync installs everything) | |
| βββ requirements-sidestep.txt <-- Fallback for plain pip | |
| βββ install_windows.bat <-- Windows easy installer (double-click) | |
| βββ install_windows.ps1 <-- PowerShell installer script | |
| βββ acestep/ | |
| βββ training_v2/ <-- Side-Step logic (all standalone) | |
| βββ trainer_fixed.py <-- The corrected training loop | |
| βββ preprocess.py <-- Two-pass preprocessing pipeline | |
| βββ estimate.py <-- Gradient sensitivity estimation | |
| βββ model_loader.py <-- Per-component model loading (supports fine-tunes) | |
| βββ model_discovery.py <-- Checkpoint scanning & fuzzy search | |
| βββ settings.py <-- Persistent user settings (~/.config/sidestep/) | |
| βββ _compat.py <-- Version pin & compatibility check | |
| βββ optim.py <-- 8-bit and adaptive optimizers | |
| βββ _vendor/ <-- Vendored ACE-Step utilities (standalone) | |
| βββ presets/ <-- Built-in preset JSON files | |
| βββ cli/ <-- CLI argument parsing & dispatch | |
| βββ ui/ <-- Wizard, flows, setup, presets, visual logic | |
| ``` | |
| --- | |
| ## Complete Argument Reference | |
| Every argument, its default, and what it does. | |
| ### Global Flags | |
| Available in: all subcommands (placed **before** the subcommand name) | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--plain` | `False` | Disable Rich output; use plain text. Also set automatically when stdout is piped | | |
| | `--yes` or `-y` | `False` | Skip the confirmation prompt and start training immediately | | |
| ### Model and Paths | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--checkpoint-dir` | **(required)** | Path to the root checkpoints directory (contains `acestep-v15-turbo/`, etc.) | | |
| | `--model-variant` | `turbo` | Model variant or subfolder name. Official: `turbo`, `base`, `sft`. For fine-tunes: use the exact folder name (e.g., `my-custom-finetune`) | | |
| | `--base-model` | *(auto)* | Base model a fine-tune was trained from: `turbo`, `base`, or `sft`. Auto-detected for official models. Only needed for custom fine-tunes whose `config.json` lacks timestep parameters | | |
| | `--dataset-dir` | **(required)** | Directory containing your preprocessed `.pt` tensor files and `manifest.json` | | |
| ### Device and Precision | |
| Available in: all subcommands | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--device` | `auto` | Which device to train on. Options: `auto`, `cuda`, `cuda:0`, `cuda:1`, `mps`, `xpu`, `cpu`. Auto-detection priority: CUDA > MPS (Apple Silicon) > XPU (Intel) > CPU | | |
| | `--precision` | `auto` | Floating point precision. Options: `auto`, `bf16`, `fp16`, `fp32`. Auto picks: bf16 on CUDA/XPU, fp16 on MPS, fp32 on CPU | | |
| ### Adapter Selection | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--adapter-type` | `lora` | Adapter type: `lora` (PEFT, stable) or `lokr` (LyCORIS, experimental). LoKR uses Kronecker product factorization | | |
| ### LoRA Settings (used when --adapter-type=lora) | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--rank` or `-r` | `64` | LoRA rank. Higher = more capacity and more VRAM. Recommended: 64 (ACE-Step dev recommendation) | | |
| | `--alpha` | `128` | LoRA scaling factor. Controls how strongly the adapter affects the model. Usually 2x the rank. Recommended: 128 | | |
| | `--dropout` | `0.1` | Dropout probability on LoRA layers. Helps prevent overfitting. Range: 0.0 to 0.5 | | |
| | `--attention-type` | `both` | Which attention layers to target. Options: `both` (self + cross attention, 192 modules), `self` (self-attention only, audio patterns, 96 modules), `cross` (cross-attention only, text conditioning, 96 modules) | | |
| | `--target-modules` | `q_proj k_proj v_proj o_proj` | Which projection layers get adapters. Space-separated list. Combined with `--attention-type` to determine final target modules | | |
| | `--bias` | `none` | Whether to train bias parameters. Options: `none` (no bias training), `all` (train all biases), `lora_only` (only biases in LoRA layers) | | |
| ### LoKR Settings (used when --adapter-type=lokr) -- Experimental | |
| Available in: vanilla, fixed. | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--lokr-linear-dim` | `64` | LoKR linear dimension (analogous to LoRA rank) | | |
| | `--lokr-linear-alpha` | `128` | LoKR linear alpha (scaling factor, analogous to LoRA alpha) | | |
| | `--lokr-factor` | `-1` | Kronecker factorization factor. -1 = automatic | | |
| | `--lokr-decompose-both` | `False` | Decompose both Kronecker factors for additional compression | | |
| | `--lokr-use-tucker` | `False` | Use Tucker decomposition for more efficient factorization | | |
| | `--lokr-use-scalar` | `False` | Use scalar scaling | | |
| | `--lokr-weight-decompose` | `False` | Enable DoRA-style weight decomposition | | |
| ### Training Hyperparameters | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--lr` or `--learning-rate` | `0.0001` | Initial learning rate. For Prodigy optimizer, set to `1.0` | | |
| | `--batch-size` | `1` | Number of samples per training step. Usually 1 for music generation (audio tensors are large) | | |
| | `--gradient-accumulation` | `4` | Number of steps to accumulate gradients before updating weights. Effective batch size = batch-size x gradient-accumulation | | |
| | `--epochs` | `100` | Maximum number of training epochs (full passes through the dataset) | | |
| | `--warmup-steps` | `100` | Number of optimizer steps where the learning rate ramps up from 10% to 100% | | |
| | `--weight-decay` | `0.01` | Weight decay (L2 regularization). Helps prevent overfitting | | |
| | `--max-grad-norm` | `1.0` | Maximum gradient norm for gradient clipping. Prevents training instability from large gradients | | |
| | `--seed` | `42` | Random seed for reproducibility. Same seed + same data = same results | | |
| | `--shift` | `3.0` | Noise schedule shift for inference. Turbo=`3.0`, base/sft=`1.0`. Stored as metadata -- does not affect the training loop (see [Technical Notes](#technical-notes-shift-and-timestep-sampling)) | | |
| | `--num-inference-steps` | `8` | Denoising steps for inference. Turbo=`8`, base/sft=`50`. Stored as metadata -- does not affect the training loop | | |
| | `--optimizer-type` | `adamw` | Optimizer: `adamw`, `adamw8bit` (saves VRAM), `adafactor` (minimal state), `prodigy` (auto-tunes LR) | | |
| | `--scheduler-type` | `cosine` | LR schedule: `cosine`, `cosine_restarts`, `linear`, `constant`, `constant_with_warmup`. Prodigy auto-forces `constant` | | |
| | `--gradient-checkpointing` | `True` | Recompute activations during backward to save VRAM (~40-60% less activation memory, ~10-30% slower). On by default; use `--no-gradient-checkpointing` to disable | | |
| | `--offload-encoder` | `False` | Move encoder/VAE to CPU after setup. Frees ~2-4GB VRAM with minimal speed impact | | |
| ### Corrected Training (fixed mode only) | |
| Available in: fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--cfg-ratio` | `0.15` | Classifier-free guidance dropout rate. With this probability, each sample's condition is replaced with a null embedding during training. This teaches the model to work both with and without text prompts. The model was originally trained with 0.15 | | |
| ### Data Loading | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--num-workers` | `4` (Linux), `0` (Windows) | Number of parallel data loading worker processes. Auto-set to 0 on Windows | | |
| | `--pin-memory` / `--no-pin-memory` | `True` | Pin loaded tensors in CPU memory for faster GPU transfer. Disable if you're low on RAM | | |
| | `--prefetch-factor` | `2` | Number of batches each worker prefetches in advance | | |
| | `--persistent-workers` / `--no-persistent-workers` | `True` | Keep data loading workers alive between epochs instead of respawning them | | |
| ### Checkpointing | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--output-dir` | **(required)** | Directory where LoRA weights, checkpoints, and TensorBoard logs are saved | | |
| | `--save-every` | `10` | Save a full checkpoint (LoRA weights + optimizer + scheduler state) every N epochs | | |
| | `--resume-from` | *(none)* | Path to a checkpoint directory to resume training from. Restores LoRA weights, optimizer state, and scheduler state | | |
| ### Logging and Monitoring | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--log-dir` | `{output-dir}/runs` | Directory for TensorBoard log files. View with `tensorboard --logdir <path>` | | |
| | `--log-every` | `10` | Log loss and learning rate every N optimizer steps | | |
| | `--log-heavy-every` | `50` | Log per-layer gradient norms every N optimizer steps. These are more expensive to compute but useful for debugging | | |
| | `--sample-every-n-epochs` | `0` | Generate an audio sample every N epochs during training. 0 = disabled. (Not yet implemented) | | |
| > **Log file:** All runs automatically append to `sidestep.log` in the working directory. This file captures full tracebacks and debug-level messages that may not appear in the terminal. Useful for diagnosing silent crashes or sharing logs when reporting issues. | |
| ### Preprocessing (optional) | |
| Available in: vanilla, fixed | |
| | Argument | Default | Description | | |
| |----------|---------|-------------| | |
| | `--preprocess` | `False` (flag) | If set, run audio preprocessing before training | | |
| | `--audio-dir` | *(none)* | Source directory containing audio files (for preprocessing) | | |
| | `--dataset-json` | *(none)* | Path to labeled dataset JSON file (for preprocessing) | | |
| | `--tensor-output` | *(none)* | Output directory where preprocessed .pt tensor files will be saved | | |
| | `--max-duration` | `240` | Maximum audio duration in seconds. Longer files are truncated | | |
| --- | |
| ## Technical Notes: Shift and Timestep Sampling | |
| > **Important:** The `--shift` and `--num-inference-steps` settings are **inference metadata only**. They are saved alongside your adapter so you know which values to use when generating audio with the trained LoRA/LoKR. **They do not enter the training loop.** | |
| ### How Side-Step trains (corrected/fixed mode) | |
| Side-Step's corrected training loop uses **continuous logit-normal timestep sampling** -- an exact reimplementation of the `sample_t_r()` function defined inside each ACE-Step model variant's own `forward()` method. The core operation is: | |
| ```python | |
| t = sigmoid(N(timestep_mu, timestep_sigma)) | |
| ``` | |
| The `timestep_mu` and `timestep_sigma` parameters are read automatically from each model's `config.json` at startup. All three model variants (turbo, base, sft) define the same `sample_t_r()` function and call it the same way during their native training forward pass. Our `sample_timesteps()` matches this line-for-line. | |
| ### How the upstream community trainer trains | |
| The original ACE-Step community trainer (`acestep/training/trainer.py`) uses a **discrete 8-step schedule** hardcoded from `shift=3.0`: | |
| ```python | |
| TURBO_SHIFT3_TIMESTEPS = [1.0, 0.955, 0.9, 0.833, 0.75, 0.643, 0.5, 0.3] | |
| ``` | |
| Each training step randomly picks one of these 8 values. This is **not** how the models were originally trained -- it only approximates the turbo model's inference schedule. For base and sft models, this schedule is wrong entirely. | |
| ### Where shift actually matters | |
| `shift` controls the **inference** timestep schedule via `t_shifted = shift * t / (1 + (shift - 1) * t)`. This warp is applied inside `generate_audio()`, not during training. With `shift=1.0` you get a uniform linear schedule (more steps needed); with `shift=3.0` the schedule compresses toward the high end (fewer steps needed -- that's what makes turbo fast). | |
| ### Why this matters | |
| - **Side-Step can train all variants** (turbo, base, sft) because it uses the same continuous sampling the models expect. | |
| - **The upstream trainer only works properly for turbo** because its discrete schedule is derived from `shift=3.0`. | |
| - **Changing `--shift` in Side-Step will not change your training results** -- the training timestep distribution is controlled by `timestep_mu` and `timestep_sigma` from the model config, which Side-Step reads automatically. | |
| - **You still need the correct shift at inference time.** Use `shift=3.0` for turbo LoRAs and `shift=1.0` for base/sft LoRAs when generating audio. | |
| --- | |
| ## Contributing | |
| Contributions are welcome! Specifically looking for help fixing the **Textual TUI** and testing the new preprocessing + estimation modules. | |
| **License:** Follows the original ACE-Step 1.5 licensing | |