Ace-Step-Munk / docs /sidestep /VRAM Optimization Guide.md
OnyxMunk's picture
Add LoRA training assets: scripts, docs (no binaries), ui, my_dataset
bc9c638
# VRAM Optimization Guide
This page documents every VRAM optimization Side-Step applies and the VRAM profiles for different GPU tiers.
---
## VRAM Optimizations
Side-Step applies several layers of VRAM optimization. Some are automatic; others are user-configurable. They are listed here from most impactful to least.
### Gradient Checkpointing (automatic, on by default)
**What it does:** During the backward pass, PyTorch normally stores all intermediate activations from the forward pass so it can compute gradients. Gradient checkpointing discards those activations and recomputes them on-the-fly during backpropagation.
**VRAM savings:** ~40-60% reduction in activation memory. For ACE-Step's decoder this is easily 10-15 GB on large batches.
**Trade-off:** ~10-30% slower training (recomputation cost).
**Default:** ON. This matches what the original ACE-Step trainer always did silently. Side-Step exposes it as a toggle so power users can disable it if they have VRAM to spare and want maximum speed.
**How to disable:** In the wizard (Advanced mode), answer "no" to "Enable gradient checkpointing?". Via CLI: `--no-gradient-checkpointing`.
> Side-Step also disables `use_cache` on the decoder (frees KV-cache memory) and enables `input_require_grads` (needed for PEFT + checkpointing compatibility). These happen automatically when gradient checkpointing is on.
### Flash Attention 2 (automatic, no user action needed)
**What it does:** Replaces the standard attention implementation with a fused CUDA kernel that is both faster and more memory-efficient. Instead of materializing the full N x N attention matrix, Flash Attention computes attention in tiles.
**VRAM savings:** Reduces attention memory from O(N^2) to O(N). In practice, this translates to lower peak VRAM during the forward pass, especially for long audio sequences.
**Speed benefit:** 10-30% faster training steps on supported hardware due to better GPU utilization (100% compute occupancy vs ~70-80% with SDPA).
**Requirements:** NVIDIA Ampere or newer GPU (RTX 30xx, RTX 40xx, A100, H100, etc. -- compute capability >= 8.0).
**How it works:** Side-Step installs Flash Attention 2 from prebuilt wheels (no compilation needed) as part of `uv sync`. At runtime, the model loader auto-detects whether Flash Attention is available and falls back to SDPA (Scaled Dot-Product Attention) if it is not. You do not need to configure anything.
**Fallback:** On older GPUs (RTX 20xx, GTX 16xx, etc.) or macOS, SDPA is used automatically.
### 8-bit Optimizer (optional, Advanced mode)
**What it does:** Replaces the standard 32-bit AdamW optimizer with an 8-bit quantized version. AdamW stores two state tensors (momentum and variance) per trainable parameter. Quantizing these to 8-bit halves their memory footprint.
**VRAM savings:** ~30% reduction in optimizer state memory. For a rank-64 LoRA with ~10M trainable parameters, this saves ~2-3 GB.
**Trade-off:** Negligible quality difference in practice. The quantization is applied to optimizer state, not to the model weights or gradients.
**How to enable:** In the wizard (Advanced > Optimizer & Scheduler), select "AdamW 8-bit". Via CLI: `--optimizer-type adamw8bit`. The `bitsandbytes` package is included automatically by `uv sync`.
### Encoder Offloading (optional, Advanced mode)
**What it does:** After the model is set up for training, moves non-decoder submodules (VAE, text encoder, condition encoders) from GPU to CPU. These components are only needed during preprocessing -- during the training loop itself, only the decoder runs on GPU.
**VRAM savings:** ~2-4 GB, depending on the model.
**Trade-off:** Minimal. These components are not used during the training loop, so offloading them has no speed impact.
**How to enable:** In the wizard (Advanced > VRAM Savings), answer "yes" to "Offload encoder/VAE to CPU?". Via CLI: `--offload-encoder`.
### Lower Rank
**What it does:** Reduces the number of trainable parameters in the LoRA/LoKR adapter. Fewer parameters means less memory for gradients, optimizer state, and activation storage.
**VRAM savings:** Roughly proportional. Rank 16 uses ~4x less adapter VRAM than rank 64.
**Trade-off:** Lower capacity. The adapter can capture less of your dataset's style. See the rank guide in [[Training Guide]].
### Gradient Accumulation
**What it does:** Instead of updating weights after every batch, accumulates gradients over N batches before applying a single optimizer step. This gives the same effective batch size as `batch_size x gradient_accumulation` without the VRAM cost of a larger batch.
**VRAM savings:** Keeps per-step VRAM at `batch_size=1` levels while achieving the training dynamics of a larger effective batch.
**Default:** 4 steps of accumulation with batch size 1 (effective batch size = 4).
---
## VRAM Profiles
| Profile | GPU VRAM | Automatic | Recommended Settings |
|---|---|---|---|
| **Comfortable** | 24 GB+ | Grad checkpointing ON, Flash Attention | Batch 2+, AdamW, Rank 64-128 |
| **Standard** | 16-24 GB | Grad checkpointing ON, Flash Attention | Batch 1, AdamW, Rank 64, Grad accumulation 4 |
| **Tight** | 10-16 GB | Grad checkpointing ON, Flash Attention | Batch 1, **AdamW8bit**, Encoder offloading, Rank 32-64 |
| **Minimal** | <10 GB | Grad checkpointing ON, Flash Attention | Batch 1, **AdamW8bit** or **Adafactor**, Encoder offloading, Rank 16, High grad accumulation |
> Gradient checkpointing and Flash Attention are always active (when supported). The profiles differ in optimizer choice, rank, and optional offloading.
>
---
## See Also
- [[Training Guide]] -- Full training workflow, hyperparameter guide, LoRA vs LoKR
- [[Estimation Guide]] -- Gradient sensitivity analysis for targeted training
- [[Getting Started]] -- Installation and first-run setup
- [[Model Management]] -- Checkpoint structure and fine-tune support