Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.16.0
MagiCompiler
An engineering-oriented compiler and execution augmentation library for PyTorch 2.8+, providing module-level compilation decorators, backend adapters, graph partitioning strategies, readable and reusable compile artifacts, and tightly integrated runtime scheduling for any inference engine. The design goal is to systematically expose capabilities of PyTorch Dynamo / AOTAutograd / Inductor / Triton while prioritizing correctness, stability, observability, and maintainability.
Design Overview
- Compilation entrypoint: the
@magi_compiledecorator augmentsnn.Module.forwardwith compilation, including dynamic-shape annotation and argument validation. - Partitioning and passes: configurable graph partitioning and pass management (e.g.,
InductorPass,PostGradPassManager) for fusion, kernel generation, and tuning. - Artifact system: persists compile artifacts using a Python file/directory layout for readability, auditability, and portability (see βCompile Cache Overviewβ).
- Configuration:
CompileConfigcentralizes backend, partition rules, cache root, runtime shapes, and other key parameters.
Key Features
- Dynamic-shape annotations:
- Automatic inference: when a
forwardparameter is annotated astorch.Tensorortorch.Tensor | None, dimension 0 is treated as dynamic by default. - Explicit specification: use
@magi_compile(dynamic_arg_dims={...})to mark dimensions (negative indices supported). - Consistency constraints: parameters that alternately appear as
Noneand non-Noneacross the model lifetime cannot be captured into the same computation graph.
- Automatic inference: when a
- Backend selection and standalone compilation:
inductormode defaults to PyTorch 2.8+standalone_compile, producing reusable artifacts.eagermode is available for debugging or fallback paths.
- Partitioning and passes: operator-set-driven partition rules and pass contexts that stabilize subgraph boundaries and kernel generation across runtime shapes.
- Readable, portable artifacts: structured directories with Python files for quick triage and cross-environment debugging.
- Engine integration: the decorator reads engine-level
CompileConfigto stay aligned with distributed/scheduling components.
Installation and Requirements
- Python β₯ 3.10
- PyTorch β₯ 2.8 (with
torch._inductor.standalone_compileavailable) - Recommended to be used within the Athena environment, together with its dependencies and distributed components (e.g., CUDA Graph manager).
For local development, install in editable mode:
pip install -e . --no-build-isolation --config-settings editable_mode=compat
Quick Start
Minimal Example (automatic dynamic-dim inference)
import torch
from torch import nn
from magi_compiler.decorator import magi_compile
@magi_compile
class MyModel(nn.Module):
def __init__(self, *, model_config):
super().__init__()
self.linear = nn.Linear(10, 5)
def forward(self, x: torch.Tensor, y: torch.Tensor | None) -> torch.Tensor:
if y is not None:
return self.linear(x + y)
return self.linear(x)
# In Athena, model_config is typically provided by the engine
model = MyModel(model_config=...)
out1 = model(torch.randn(4, 10), torch.randn(4, 10))
out2 = model(torch.randn(8, 10), None) # dynamic batch dimension
Explicit Dynamic-Dim Specification
@magi_compile(dynamic_arg_dims={"x": -1}) # mark the last dimension as dynamic
class DynamicDimModel(nn.Module):
def __init__(self, *, model_config):
super().__init__()
self.proj = nn.Linear(16, 16, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.proj(x)
m = DynamicDimModel(model_config=...)
_ = m(torch.randn(2, 16))
_ = m(torch.randn(2, 32)) # allow the last dimension to vary
Configuration and Modes
CompileConfig: centralizes compile parameters (backend, cache paths, partition strategy, dynamic shapes, traced files, etc.).CompileMode: typical setting isCompileMode.TORCH_COMPILE.- Backends:
inductor: usesstandalone_compileto produce reusable artifacts, ideal for production deployments.eager: convenient for rapid debugging or as a fallback.
Architecture and Execution Flow (Brief)
@magi_compilewrapsnn.Module:- infers/validates
dynamic_arg_dims; - extends MRO by injecting
MagiCompilerBase; - reads engine-level
CompileConfigin MagiCompiler.
- infers/validates
CompilerManager:- defines cache keys using
(runtime_shape, graph_index, backend); - dispatches to backends via
CompilerInterface(InductorStandaloneAdaptororEagerAdaptor); - applies partition rules and pass contexts within
compile_context(...); - serializes compile artifacts into a human-readable directory structure.
- defines cache keys using
- Monitoring and statistics:
- counters and timestamps report per-shape/per-subgraph latencies and milestones.
Compile Cache Overview
This document summarizes the cache files generated by torch.compile (TorchDynamo + TorchInductor + AOTAutograd + Triton). Reference path: cache/.
Directory Layout (Tree)
cache/
ββ depyf/
β ββ rank_0/
β ββ __transformed_code_0_for_forward.py
β ββ decompiled_code.py
β ββ full_code_for_forward_0.py
β ββ __compiled_fn_1.BEFORE_PRE_GRAD.{0..N}.py
β ββ __compiled_fn_1.kernel_{0..K}.py
β ββ __compiled_fn_1.__compiled_fn_1_<uuid>.0.py
β ββ __compiled_fn_1.Before_split.0.py
β ββ __compiled_fn_1.After_split.0.py
β ββ __compiled_fn_1.pre_split_module.0.py
β ββ __compiled_fn_1.post_split_module.0.py
β ββ __compiled_fn_1.pre_insert_deferred_runtime_asserts__<uuid>.0.py
β
ββ torch_compile_cache/
ββ bfa0df33ea/ # Hash for graph + compile options + device, etc.
ββ rank_0/ # Rank id in distributed/multi-GPU runs
ββ backbone/
ββ computation_graph.py
ββ magi_compile_cache.py
ββ artifact_shape_None_subgraph_0/
ββ artifact_shape_None_subgraph_1/
ββ ...
ββ artifact_shape_None_subgraph_30/
ββ ir/
β ββ *.py # Python/Triton kernels generated by Inductor for this subgraph
ββ fxgraph/
β ββ */*/<hash> # Binary FX IR/metadata snapshots (not human-readable)
ββ aotautograd/
β ββ */*/<hash> # AOTAutograd partition/capture metadata and artifacts
ββ 44/
β ββ *.py # Other sharded/generated code buckets
ββ ... # Structure may vary slightly across subgraphs
What Each File/Dir Is For
cache/depyf/(TorchDynamo/Depyf debug exports)__transformed_code_0_for_forward.py: The Dynamo-transformedforwardcode (diff-friendly view of pre/post transformation).decompiled_code.py: Decompiled snapshot to help map traced graphs back to original Python.full_code_for_forward_0.py: A more complete expandedforwardfor inspection.__compiled_fn_1.BEFORE_PRE_GRAD.{i}.py: Intermediate wrapper snapshots at specific compile stages (e.g., before autodiff).__compiled_fn_1.kernel_{k}.py: Entrypoints/wrappers for kernels generated at various stages.Before_split/After_split/pre_split_module/post_split_module: Intermediate forms around graph partitioning.pre_insert_deferred_runtime_asserts__*.py: Snapshot before inserting deferred runtime assertions (dynamic shapes/guards).
cache/torch_compile_cache/(TorchInductor artifacts)bfa0df33ea/: Namespace keyed by a hash of model structure, compile settings, and device info.rank_0/: Bucket per process rank for distributed runs.backbone/:computation_graph.py: Full model FX GraphModule with symbolic dims; shared across subgraph kernels.magi_compile_cache.py:- Maps subgraph indices to artifact directories, e.g.
(None, i, 'inductor_standalone') -> artifact_shape_None_subgraph_i/. - Registers and asynchronously compiles Triton kernels via
AsyncCompile.triton(...), including autotune metadata, device properties, scheduling hints, etc.
- Maps subgraph indices to artifact directories, e.g.
artifact_shape_None_subgraph_{N}/:ir/*.py: Inductor-generated Python/Triton kernels and scheduling code for this subgraph (readable).fxgraph/*/*/<hash>: FX IR/metadata snapshots for fast graph reconstruction (binary; do not edit).aotautograd/*/*/<hash>: AOTAutograd partitions/captures and replay requirements.- Additional hashed/prefixed buckets (e.g.,
44/,o5/,55/,br/) containing generated operator/subtask code.
FAQ
- How are these caches produced?
- At runtime by
torch.compile(...), after TorchDynamo tracing, AOTAutograd partitioning, TorchInductor lowering/fusion, and Triton codegen.
- At runtime by
- Will they change across runs?
- Yes. Different input shapes, env vars, device info, or compile options can produce different hash namespaces (e.g., a new
bfa0df33ea).
- Yes. Different input shapes, env vars, device info, or compile options can produce different hash namespaces (e.g., a new
- Is it safe to delete them?
- Yes. You can delete
cache/. It will be rebuilt on demand; the next run will be slower due to recompilation.
- Yes. You can delete
Compatibility and Recommendations
- Prefer official PyTorch β₯ 2.8 builds to ensure
standalone_compileavailability. - For highly dynamic models, explicitly mark key dynamic dimensions to improve graph capture and cache reuse.
Acknowledgments
This library builds upon capabilities of PyTorch Dynamo, AOTAutograd, Inductor, and Triton, and incorporates engineering practices and interface designs inspired by the vLLM community. We thank the relevant open-source communities and contributors.