Spaces:

OnyxMunk
/

Ace-Step-Munk

Running

App Files Files Community

Ace-Step-Munk / docs /sidestep /VRAM Optimization Guide.md

OnyxMunk

Add LoRA training assets: scripts, docs (no binaries), ui, my_dataset

bc9c638 about 2 months ago

preview code

raw

history blame contribute delete

5.89 kB

	# VRAM Optimization Guide

	This page documents every VRAM optimization Side-Step applies and the VRAM profiles for different GPU tiers.

	---

	## VRAM Optimizations

	Side-Step applies several layers of VRAM optimization. Some are automatic; others are user-configurable. They are listed here from most impactful to least.

	### Gradient Checkpointing (automatic, on by default)

	What it does: During the backward pass, PyTorch normally stores all intermediate activations from the forward pass so it can compute gradients. Gradient checkpointing discards those activations and recomputes them on-the-fly during backpropagation.

	VRAM savings: ~40-60% reduction in activation memory. For ACE-Step's decoder this is easily 10-15 GB on large batches.

	Trade-off: ~10-30% slower training (recomputation cost).

	Default: ON. This matches what the original ACE-Step trainer always did silently. Side-Step exposes it as a toggle so power users can disable it if they have VRAM to spare and want maximum speed.

	How to disable: In the wizard (Advanced mode), answer "no" to "Enable gradient checkpointing?". Via CLI: `--no-gradient-checkpointing`.

	> Side-Step also disables `use_cache` on the decoder (frees KV-cache memory) and enables `input_require_grads` (needed for PEFT + checkpointing compatibility). These happen automatically when gradient checkpointing is on.

	### Flash Attention 2 (automatic, no user action needed)

	What it does: Replaces the standard attention implementation with a fused CUDA kernel that is both faster and more memory-efficient. Instead of materializing the full N x N attention matrix, Flash Attention computes attention in tiles.

	VRAM savings: Reduces attention memory from O(N^2) to O(N). In practice, this translates to lower peak VRAM during the forward pass, especially for long audio sequences.

	Speed benefit: 10-30% faster training steps on supported hardware due to better GPU utilization (100% compute occupancy vs ~70-80% with SDPA).

	Requirements: NVIDIA Ampere or newer GPU (RTX 30xx, RTX 40xx, A100, H100, etc. -- compute capability >= 8.0).

	How it works: Side-Step installs Flash Attention 2 from prebuilt wheels (no compilation needed) as part of `uv sync`. At runtime, the model loader auto-detects whether Flash Attention is available and falls back to SDPA (Scaled Dot-Product Attention) if it is not. You do not need to configure anything.

	Fallback: On older GPUs (RTX 20xx, GTX 16xx, etc.) or macOS, SDPA is used automatically.

	### 8-bit Optimizer (optional, Advanced mode)

	What it does: Replaces the standard 32-bit AdamW optimizer with an 8-bit quantized version. AdamW stores two state tensors (momentum and variance) per trainable parameter. Quantizing these to 8-bit halves their memory footprint.

	VRAM savings: ~30% reduction in optimizer state memory. For a rank-64 LoRA with ~10M trainable parameters, this saves ~2-3 GB.

	Trade-off: Negligible quality difference in practice. The quantization is applied to optimizer state, not to the model weights or gradients.

	How to enable: In the wizard (Advanced > Optimizer & Scheduler), select "AdamW 8-bit". Via CLI: `--optimizer-type adamw8bit`. The `bitsandbytes` package is included automatically by `uv sync`.

	### Encoder Offloading (optional, Advanced mode)

	What it does: After the model is set up for training, moves non-decoder submodules (VAE, text encoder, condition encoders) from GPU to CPU. These components are only needed during preprocessing -- during the training loop itself, only the decoder runs on GPU.

	VRAM savings: ~2-4 GB, depending on the model.

	Trade-off: Minimal. These components are not used during the training loop, so offloading them has no speed impact.

	How to enable: In the wizard (Advanced > VRAM Savings), answer "yes" to "Offload encoder/VAE to CPU?". Via CLI: `--offload-encoder`.

	### Lower Rank

	What it does: Reduces the number of trainable parameters in the LoRA/LoKR adapter. Fewer parameters means less memory for gradients, optimizer state, and activation storage.

	VRAM savings: Roughly proportional. Rank 16 uses ~4x less adapter VRAM than rank 64.

	Trade-off: Lower capacity. The adapter can capture less of your dataset's style. See the rank guide in [[Training Guide]].

	### Gradient Accumulation

	What it does: Instead of updating weights after every batch, accumulates gradients over N batches before applying a single optimizer step. This gives the same effective batch size as `batch_size x gradient_accumulation` without the VRAM cost of a larger batch.

	VRAM savings: Keeps per-step VRAM at `batch_size=1` levels while achieving the training dynamics of a larger effective batch.

	Default: 4 steps of accumulation with batch size 1 (effective batch size = 4).

	---

	## VRAM Profiles

	\| Profile \| GPU VRAM \| Automatic \| Recommended Settings \|
	\|---\|---\|---\|---\|
	\| Comfortable \| 24 GB+ \| Grad checkpointing ON, Flash Attention \| Batch 2+, AdamW, Rank 64-128 \|
	\| Standard \| 16-24 GB \| Grad checkpointing ON, Flash Attention \| Batch 1, AdamW, Rank 64, Grad accumulation 4 \|
	\| Tight \| 10-16 GB \| Grad checkpointing ON, Flash Attention \| Batch 1, AdamW8bit, Encoder offloading, Rank 32-64 \|
	\| Minimal \| <10 GB \| Grad checkpointing ON, Flash Attention \| Batch 1, AdamW8bit or Adafactor, Encoder offloading, Rank 16, High grad accumulation \|

	> Gradient checkpointing and Flash Attention are always active (when supported). The profiles differ in optimizer choice, rank, and optional offloading.
	>
	---

	## See Also

	- [[Training Guide]] -- Full training workflow, hyperparameter guide, LoRA vs LoKR
	- [[Estimation Guide]] -- Gradient sensitivity analysis for targeted training
	- [[Getting Started]] -- Installation and first-run setup
	- [[Model Management]] -- Checkpoint structure and fine-tune support

	# VRAM Optimization Guide

	This page documents every VRAM optimization Side-Step applies and the VRAM profiles for different GPU tiers.

	---

	## VRAM Optimizations

	Side-Step applies several layers of VRAM optimization. Some are automatic; others are user-configurable. They are listed here from most impactful to least.

	### Gradient Checkpointing (automatic, on by default)

	What it does: During the backward pass, PyTorch normally stores all intermediate activations from the forward pass so it can compute gradients. Gradient checkpointing discards those activations and recomputes them on-the-fly during backpropagation.

	VRAM savings: ~40-60% reduction in activation memory. For ACE-Step's decoder this is easily 10-15 GB on large batches.

	Trade-off: ~10-30% slower training (recomputation cost).

	Default: ON. This matches what the original ACE-Step trainer always did silently. Side-Step exposes it as a toggle so power users can disable it if they have VRAM to spare and want maximum speed.

	How to disable: In the wizard (Advanced mode), answer "no" to "Enable gradient checkpointing?". Via CLI: `--no-gradient-checkpointing`.

	> Side-Step also disables `use_cache` on the decoder (frees KV-cache memory) and enables `input_require_grads` (needed for PEFT + checkpointing compatibility). These happen automatically when gradient checkpointing is on.

	### Flash Attention 2 (automatic, no user action needed)

	What it does: Replaces the standard attention implementation with a fused CUDA kernel that is both faster and more memory-efficient. Instead of materializing the full N x N attention matrix, Flash Attention computes attention in tiles.

	VRAM savings: Reduces attention memory from O(N^2) to O(N). In practice, this translates to lower peak VRAM during the forward pass, especially for long audio sequences.

	Speed benefit: 10-30% faster training steps on supported hardware due to better GPU utilization (100% compute occupancy vs ~70-80% with SDPA).

	Requirements: NVIDIA Ampere or newer GPU (RTX 30xx, RTX 40xx, A100, H100, etc. -- compute capability >= 8.0).

	How it works: Side-Step installs Flash Attention 2 from prebuilt wheels (no compilation needed) as part of `uv sync`. At runtime, the model loader auto-detects whether Flash Attention is available and falls back to SDPA (Scaled Dot-Product Attention) if it is not. You do not need to configure anything.

	Fallback: On older GPUs (RTX 20xx, GTX 16xx, etc.) or macOS, SDPA is used automatically.

	### 8-bit Optimizer (optional, Advanced mode)

	What it does: Replaces the standard 32-bit AdamW optimizer with an 8-bit quantized version. AdamW stores two state tensors (momentum and variance) per trainable parameter. Quantizing these to 8-bit halves their memory footprint.

	VRAM savings: ~30% reduction in optimizer state memory. For a rank-64 LoRA with ~10M trainable parameters, this saves ~2-3 GB.

	Trade-off: Negligible quality difference in practice. The quantization is applied to optimizer state, not to the model weights or gradients.

	How to enable: In the wizard (Advanced > Optimizer & Scheduler), select "AdamW 8-bit". Via CLI: `--optimizer-type adamw8bit`. The `bitsandbytes` package is included automatically by `uv sync`.

	### Encoder Offloading (optional, Advanced mode)

	What it does: After the model is set up for training, moves non-decoder submodules (VAE, text encoder, condition encoders) from GPU to CPU. These components are only needed during preprocessing -- during the training loop itself, only the decoder runs on GPU.

	VRAM savings: ~2-4 GB, depending on the model.

	Trade-off: Minimal. These components are not used during the training loop, so offloading them has no speed impact.

	How to enable: In the wizard (Advanced > VRAM Savings), answer "yes" to "Offload encoder/VAE to CPU?". Via CLI: `--offload-encoder`.

	### Lower Rank

	What it does: Reduces the number of trainable parameters in the LoRA/LoKR adapter. Fewer parameters means less memory for gradients, optimizer state, and activation storage.

	VRAM savings: Roughly proportional. Rank 16 uses ~4x less adapter VRAM than rank 64.

	Trade-off: Lower capacity. The adapter can capture less of your dataset's style. See the rank guide in [[Training Guide]].

	### Gradient Accumulation

	What it does: Instead of updating weights after every batch, accumulates gradients over N batches before applying a single optimizer step. This gives the same effective batch size as `batch_size x gradient_accumulation` without the VRAM cost of a larger batch.

	VRAM savings: Keeps per-step VRAM at `batch_size=1` levels while achieving the training dynamics of a larger effective batch.

	Default: 4 steps of accumulation with batch size 1 (effective batch size = 4).

	---

	## VRAM Profiles

	\| Profile \| GPU VRAM \| Automatic \| Recommended Settings \|
	\|---\|---\|---\|---\|
	\| Comfortable \| 24 GB+ \| Grad checkpointing ON, Flash Attention \| Batch 2+, AdamW, Rank 64-128 \|
	\| Standard \| 16-24 GB \| Grad checkpointing ON, Flash Attention \| Batch 1, AdamW, Rank 64, Grad accumulation 4 \|
	\| Tight \| 10-16 GB \| Grad checkpointing ON, Flash Attention \| Batch 1, AdamW8bit, Encoder offloading, Rank 32-64 \|
	\| Minimal \| <10 GB \| Grad checkpointing ON, Flash Attention \| Batch 1, AdamW8bit or Adafactor, Encoder offloading, Rank 16, High grad accumulation \|

	> Gradient checkpointing and Flash Attention are always active (when supported). The profiles differ in optimizer choice, rank, and optional offloading.
	>
	---

	## See Also

	- [[Training Guide]] -- Full training workflow, hyperparameter guide, LoRA vs LoKR
	- [[Estimation Guide]] -- Gradient sensitivity analysis for targeted training
	- [[Getting Started]] -- Installation and first-run setup
	- [[Model Management]] -- Checkpoint structure and fine-tune support