Buckets:
π’ Latest News
[03/25/2026] β‘οΈ LightX2V-MagiCompiler is now available! This fork of LightX2V showcases how to seamlessly integrate MagiCompiler into a SOTA framework. With minimal code changes, it unlocks even greater acceleration! Try it out, check the benchmark for details, and stay tuned for more integration demos!
[03/23/2026] π MagiCompiler is officially open-sourced! Delivering whole-graph compilation for multi-modality inference and FSDP-aware whole-layer compilation for large model training.
π About
MagiCompiler is an advanced compiler and runtime augmentation framework built on top of torch.compile. Designed specifically for large-scale Transformer-like architectures, it addresses the critical bottlenecks of memory walls and operator overheads.
By stepping beyond traditional local operator optimization, MagiCompiler introduces system-level optimizations, seamlessly accelerating both training and multi-modality inference workloads with minimal code intrusion.
π‘ Design Philosophy
Compiler as Manager
"Reimagining the compiler: from generating kernels to orchestrating the entire dataflow."
MagiCompiler's core philosophy is Compiler as Manager. We believe a modern deep learning compiler should not be restricted to mere kernel fusion. Instead, it acts as a global manager that owns the full lifecycle of execution. MagiCompiler actively manages subgraph dispatching, dynamically orchestrates dataflow (like offloading and prefetching), and controls memory allocation, ensuring optimal balance between compute efficiency and memory footprint.
Key Features
π― 1. Unified Inference & Training
Tailored for Transformer-like architectures with scenario-specific strategies:
- Inference: Achieves full-graph capture across Transformer boundaries, maximizing kernel fusion scope.
- Training: Introduces FSDP-aware layer-wise compilation. Unlocks aggressive cross-op fusion while keeping distributed parameter sharding entirely transparent.
β‘οΈ 2. Easy to Use, Free Gain, Plug and Play
No complex model refactoring needed. Just two decorators deliver up to 20%+ extra speedups out-of-the-box, seamlessly integrating into SOTA multi-modality frameworks.
π§ 3. Smart Asynchronous Offloading
For memory-constrained setups, our built-in selective offloading policy perfectly overlaps H2D transfers with computation, eliminating pipeline bubbles.
β»οΈ 4. Heuristic Activation Recomputation
Say goodbye to manual torch.utils.checkpoint. MagiCompiler automatically saves compute-bound ops (e.g., MatMul, Attention) and recomputes memory-bound ones, slashing peak memory without sacrificing throughput.
π 5. Magi Depyf Introspection
Meet magi_depyf, MagiCompilerβs native introspection toolkit. Compilation timelines, decompiled bytecode flows, split subgraphs, and backend artifacts are automatically dumped into the cache path as organized, human-readable files for easier debugging.
βοΈ Installation
Requirements:
- Python >= 3.12
- PyTorch >= 2.9
- CUDA Toolkit
Recommended for reproducibility: start from the prebuilt Docker image first, then run examples inside the container.
# Option A (recommended) β Use prebuilt image
# Step 1 β Pull the image
docker pull sandai/magi-compiler:latest
# Step 2 - Start the container
docker run --name my-magi-compiler -it -d --privileged --gpus all --network host --ipc host \
-v /path/on/host:/workspace sandai/magi-compiler:latest /bin/bash
# Step 3 - Attach the container
docker exec -it my-magi-compiler /bin/bash
# Option B β Local source installation
# Step 1 β Clone the repo
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
# Step 2 β System dependencies (optional, for FX graph visualization; Debian/Ubuntu)
sudo apt update && sudo apt install -y graphviz
# Step 3 β Python dependencies
pip install -r requirements.txt
# Step 4 β Install MagiCompiler (pick one)
pip install . # End users (recommended)
# pip install -e . --no-build-isolation --config-settings editable_mode=compat # Developer / editable
π Quick Start
π§Ή 1. One Decorator to Rule Them All (@magi_compile)
Remove scattered torch.compile or torch.compiler.disable calls. Decorate your core Transformer block once for automatic full-graph capture and dynamic shape support (defaulting to dim 0).
import torch
from torch import nn
from magi_compiler import magi_compile
# Decorate your core module once. No more scattered compile tweaks!
@magi_compile
class TransformerBlock(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.attn = Attention(hidden_dim)
self.mlp = MLP(hidden_dim)
def forward(self, x: torch.Tensor, mask: torch.Tensor | None) -> torch.Tensor:
x = x + self.attn(x, mask)
x = x + self.mlp(x)
return x
model = TransformerBlock(hidden_dim=1024).cuda()
# Execute normally - whole-graph compilation handles dynamic batches automatically!
out = model(torch.randn(4, 128, 1024, device="cuda"), None)
out = model(torch.randn(8, 128, 1024, device="cuda"), None)
π οΈ 2. Bridge Custom Kernels (@magi_register_custom_op)
Using custom kernels (FlashAttention, MoE routers) that break FX tracing? Don't disable compilation. Wrap them to teach the compiler how to handle them during graph partitioning and recomputation.
from magi_compiler import magi_register_custom_op
@magi_register_custom_op(
name="athena::flash_attn",
infer_output_meta_fn=["q"], # Output shape matches parameter 'q'
is_subgraph_boundary=True, # Split graph here for subgraph compilation
is_compute_sensitive=True, # Retain this output during recomputation
)
def flash_attn(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
... # Your custom kernel or C++ extension
π§ 3. Advanced Configurations
Explore magi_compiler/config.py for power-user features like custom backend toggles and fine-grained memory management. (Comprehensive guides for popular training/inference frameworks are coming soon!)
π Benchmark
π₯ H100 Extreme Acceleration
On a single NVIDIA H100, MagiCompiler outperforms current SOTA solutions (like LightX2V) by 9% to 26% across mainstream open-source video generation models.
π» RTX 5090 Near Real-Time
Thanks to our underlying JIT offloading engine, daVinci-MagiHuman achieves near real-time speeds, even on heavily VRAM-constrained consumer GPUs.
πΊ Roadmap
We are actively developing MagiCompiler. Here is a glimpse into our upcoming milestones:
- Ecosystem Integration: Benchmarks and out-of-the-box integration guides for popular frameworks (e.g.,
sglang-diffusion,vllm-omni, andLLaMAtraining). - Official Hub & Tech Blog: A dedicated website for advanced tutorials, documentation, and frontier engineering insights.
- Hardware-Aware Auto-Scheduler: An adaptive engine that dynamically orchestrates optimal strategies (auto-recomputation boundaries, offloading) based on your hardware constraints.
- Next-Gen Custom Backend (v2.0): Pushing hardware limits with extreme kernel-level efficiency, native distributed communication and MegaKernels.
π Citation
If you find MagiCompiler useful in your research or production, please consider citing us:
@software{magi_compiler_2026,
author = {Hongyu Jia and Zhiyao Cen and Taoran Wang and Yunbo Zhang},
title = {MagiCompiler: Break the Boundaries of Local Compilation for Large Models},
year = {2026},
url = {https://github.com/SandAI-org/MagiCompiler}
}
π Acknowledgement
MagiCompiler is deeply inspired by and builds upon the shoulders of giants. We extend our heartfelt gratitude to the PyTorch team for their foundational work on torch.compile and torch.fx, and to the vLLM community for their pioneering contributions to large model inference.
We are moving fast, and we want you on board! MagiCompiler is under rapid development. If you are passionate about pushing the limits of large model compilation, we'd love to have you with us. From opening issues and discussing architectures to submitting core PRs, every contribution matters. Let's engineer the future of AI infrastructure together!
β Star History
π License
This project is licensed under the Apache License 2.0.
Xet Storage Details
- Size:
- 13.7 kB
- Xet hash:
- e99e85abddcd9c8a18080b4940561eb7d9c1a9dd454095ddfdaa9bf3380ac97e
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.