# ROCm and Metal/MPS Support LightDiffusion-Next includes comprehensive support for AMD GPUs with ROCm and Apple Silicon Macs with Metal Performance Shaders (MPS). This guide covers the platform-specific considerations and optimizations available for non-NVIDIA hardware. ## ROCm Support (AMD GPUs) ### Overview ROCm (Radeon Open Compute) is AMD's open-source platform for GPU computing. LightDiffusion-Next automatically detects and utilizes ROCm-compatible AMD GPUs through PyTorch's HIP backend. ### Supported Hardware - **RDNA Architecture:** - RDNA 2 (RX 6000 series) - FP16 support - RDNA 3 (RX 7000 series) - FP16 and BF16 support - **CDNA Architecture:** - CDNA (MI100) - CDNA 2 (MI200 series) - FP16 and BF16 support - CDNA 3 (MI300 series) - FP16 and BF16 support ### Installation 1. **Install ROCm drivers and runtime:** Follow the official [ROCm installation guide](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html) for your Linux distribution. ```bash # Example for Ubuntu 22.04 wget https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_latest_all.deb sudo apt-get install ./amdgpu-install_latest_all.deb sudo amdgpu-install --usecase=rocm ``` 2. **Verify ROCm installation:** ```bash rocm-smi /opt/rocm/bin/rocminfo ``` 3. **Install PyTorch with ROCm support:** ```bash pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm6.0 ```bash # Create virtual environment python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip uv # Install PyTorch with ROCm 6.0 support (adjust version as needed) uv pip install --index-url https://download.pytorch.org/whl/rocm6.0 torch torchvision # Install project dependencies uv pip install -r requirements.txt ``` 4. **Launch LightDiffusion-Next:** ```bash streamlit run streamlit_app.py --server.address=0.0.0.0 --server.port=8501 ``` ### ROCm-Specific Features #### Automatic Detection LightDiffusion-Next automatically detects ROCm GPUs at startup and reports them in the logs: ``` Device: cuda:0 AMD Radeon RX 7900 XTX (ROCm) : ``` #### Memory Management - **Cache Management:** ROCm uses a more conservative cache clearing strategy compared to CUDA. Cache is only cleared when explicitly forced to prevent memory fragmentation issues. - **Memory Statistics:** Full memory statistics are available through the standard PyTorch CUDA API (which works transparently with ROCm). #### Precision Support - **FP16:** Fully supported on all RDNA and CDNA architectures - **BF16:** Supported on RDNA 3+ and CDNA 2+ GPUs (automatically detected) - **FP32:** Always available as fallback #### Attention Mechanisms | Feature | ROCm Support | Notes | |---------|--------------|-------| | PyTorch Scaled Dot-Product Attention (SDPA) | ✅ Yes | Default and recommended | | PyTorch Flash Attention | ✅ Yes | Available on RDNA 3 and CDNA 2+ | | xformers | ✅ Yes | Works with ROCm builds of xformers | | SageAttention | ❌ No | CUDA-only kernels | | SpargeAttn | ❌ No | CUDA-only kernels | **Recommendation:** Use PyTorch's built-in attention (SDPA) on ROCm for best compatibility. Install xformers ROCm build for additional optimizations. ### Performance Tips 1. **Use BF16 on supported GPUs:** - RDNA 3 (RX 7000 series) and CDNA 2+ support BF16 natively - BF16 provides better numerical stability than FP16 2. **Enable PyTorch attention:** - Automatically enabled for PyTorch 2.0+ - Provides good performance without CUDA-specific optimizations 3. **Install ROCm-compatible xformers:** ```bash # Build xformers from source for ROCm git clone https://github.com/facebookresearch/xformers.git cd xformers git submodule update --init --recursive pip install -e . --no-build-isolation ``` 4. **Monitor GPU utilization:** ```bash watch -n 1 rocm-smi ``` ### Known Limitations - **SageAttention and SpargeAttn:** These optimizations use CUDA-specific kernels and are not available on ROCm. The system automatically falls back to PyTorch SDPA. - **Stable-Fast:** May have limited support depending on ROCm version. Test compilation before relying on it. - **Driver Maturity:** Ensure you're using the latest ROCm version for best stability and performance. --- ## Metal/MPS Support (Apple Silicon) ### Overview Metal Performance Shaders (MPS) provides GPU acceleration on Apple Silicon Macs (M1, M2, M3 series). LightDiffusion-Next automatically detects and utilizes MPS when running on macOS. ### Supported Hardware - **Apple Silicon:** - M1, M1 Pro, M1 Max, M1 Ultra - M2, M2 Pro, M2 Max, M2 Ultra - M3, M3 Pro, M3 Max - All future M-series chips ### Installation 1. **Ensure macOS is up to date:** - macOS 12.3 (Monterey) or later required - macOS 13+ (Ventura) recommended for best performance 2. **Install Python 3.10:** ```bash # Using Homebrew brew install python@3.10 ``` 3. **Create virtual environment and install dependencies:** ```bash python3.10 -m venv .venv source .venv/bin/activate pip install --upgrade pip # Install PyTorch with MPS support pip install torch torchvision torchaudio # Install project dependencies pip install -r requirements.txt ``` 4. **Launch LightDiffusion-Next:** ```bash streamlit run streamlit_app.py --server.address=0.0.0.0 --server.port=8501 ``` ### MPS-Specific Features #### Automatic Detection MPS is automatically detected and enabled on compatible hardware: ``` Device: mps VAE dtype: torch.float16 Set vram state to: SHARED ``` #### Memory Management - **Unified Memory:** Apple Silicon uses unified memory shared between CPU and GPU - **VRAM State:** Automatically set to `SHARED` mode - **Cache Management:** Uses `torch.mps.empty_cache()` for memory cleanup #### Precision Support - **FP16:** Fully supported and recommended (default) - **FP32:** Supported but slower - **BF16:** Not supported on MPS backend #### Attention Mechanisms | Feature | MPS Support | Notes | |---------|-------------|-------| | PyTorch Scaled Dot-Product Attention (SDPA) | ✅ Yes | Default and recommended | | PyTorch Flash Attention | ❌ No | Not available on MPS | | xformers | ❌ No | MPS backend not supported | | SageAttention | ❌ No | CUDA/MPS incompatible | | SpargeAttn | ❌ No | CUDA-only kernels | **Recommendation:** Use PyTorch's built-in attention (SDPA) on MPS. It's well-optimized for Apple Silicon. ### Performance Tips - **Use FP16 precision:** MPS works best with FP16 Automatically enabled by LightDiffusion-Next - **Optimize batch sizes:** Start with smaller batch sizes and increase gradually Monitor memory usage through Activity Monitor - **Keep macOS updated:** Apple regularly improves MPS performance in system updates - **Close unnecessary applications:** Unified memory is shared with system processes Free up RAM for better GPU performance - **Monitor GPU usage:** ```bash # Use Activity Monitor -> GPU tab # Or use powermetrics (requires sudo): sudo powermetrics --samplers gpu_power -i 1000 ``` ### Known Limitations - **Non-blocking transfers:** Not supported; MPS operations are blocking - **Advanced optimizations:** SageAttention, SpargeAttn, and xformers are not available - **BF16:** Not supported on MPS backend - **Memory pressure:** System may swap under high memory load due to unified architecture ### Unified Memory Considerations Apple Silicon's unified memory architecture means: - GPU and CPU share the same physical memory pool - Less memory copying between devices - System processes compete for the same memory - Available VRAM depends on total system RAM and current usage **Recommended RAM:** - 16 GB: SD1.5 models at moderate resolutions - 32 GB: Comfortable for most workflows including Flux (with quantization) - 64 GB+: Professional workflows with large batch sizes --- ## Comparison Table | Feature | NVIDIA (CUDA) | AMD (ROCm) | Apple (MPS) | |---------|---------------|------------|-------------| | FP16 | ✅ Full | ✅ Full | ✅ Full | | BF16 | ✅ Full | ✅ RDNA3+/CDNA2+ | ❌ No | | PyTorch SDPA | ✅ Yes | ✅ Yes | ✅ Yes | | Flash Attention | ✅ Yes | ✅ RDNA3+/CDNA2+ | ❌ No | | xformers | ✅ Yes | ✅ Build from source | ❌ No | | SageAttention | ✅ Yes | ❌ No | ❌ No | | SpargeAttn | ✅ Yes (CC 8.0-9.0) | ❌ No | ❌ No | | Stable-Fast | ✅ Yes | ⚠️ Limited | ❌ No | | Memory Management | ✅ Dedicated VRAM | ✅ Dedicated VRAM | ⚠️ Unified Memory | --- ## Troubleshooting ### ROCm Issues **Problem:** PyTorch doesn't detect ROCm GPU ```bash # Check ROCm installation rocm-smi rocminfo | grep "Name:" # Verify PyTorch sees GPU python -c "import torch; print(torch.cuda.is_available()); print(torch.version.hip)" ``` **Problem:** Out of memory errors - Reduce batch size - Enable lower VRAM mode in settings - Close other GPU-using applications - Check with `rocm-smi` for memory usage **Problem:** Slow performance - Verify you're using the correct ROCm-optimized PyTorch build - Check GPU utilization with `rocm-smi` - Ensure FP16 or BF16 is enabled (check logs) ### MPS Issues **Problem:** MPS not detected ```bash # Verify MPS support python -c "import torch; print(torch.backends.mps.is_available())" ``` - Ensure macOS 12.3+ - Update to latest macOS version - Reinstall PyTorch **Problem:** Memory warnings or crashes - Reduce batch size - Close other applications to free unified memory - Check Activity Monitor for memory pressure **Problem:** Slower than expected performance - Verify FP16 is being used (check logs) - Close background applications - Update to latest macOS version for performance improvements - Some models may be CPU-bound on older M1 chips --- ## Getting Help For platform-specific issues: 1. Check the [FAQ](faq.md) for common questions 2. Review PyTorch's platform-specific documentation: - [ROCm installation](https://pytorch.org/get-started/locally/#linux-rocm) - [MPS backend](https://pytorch.org/docs/stable/notes/mps.html) 3. Open an issue on GitHub with: - Platform details (GPU model, driver version, OS) - LightDiffusion-Next startup logs - Output of `python -c "import torch; print(torch.__version__); print(torch.version.hip if hasattr(torch.version, 'hip') else 'CUDA'); print(torch.cuda.is_available())"` --- **Note:** This documentation reflects the current state of ROCm and MPS support in PyTorch and LightDiffusion-Next. As these platforms mature, more optimizations and features may become available.