jiadisu
Switch back to Docker SDK with local pkgs
e6066e8

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

MagiCompiler

An engineering-oriented compiler and execution augmentation library for PyTorch 2.8+, providing module-level compilation decorators, backend adapters, graph partitioning strategies, readable and reusable compile artifacts, and tightly integrated runtime scheduling for any inference engine. The design goal is to systematically expose capabilities of PyTorch Dynamo / AOTAutograd / Inductor / Triton while prioritizing correctness, stability, observability, and maintainability.

Design Overview

  • Compilation entrypoint: the @magi_compile decorator augments nn.Module.forward with compilation, including dynamic-shape annotation and argument validation.
  • Partitioning and passes: configurable graph partitioning and pass management (e.g., InductorPass, PostGradPassManager) for fusion, kernel generation, and tuning.
  • Artifact system: persists compile artifacts using a Python file/directory layout for readability, auditability, and portability (see β€œCompile Cache Overview”).
  • Configuration: CompileConfig centralizes backend, partition rules, cache root, runtime shapes, and other key parameters.

Key Features

  • Dynamic-shape annotations:
    • Automatic inference: when a forward parameter is annotated as torch.Tensor or torch.Tensor | None, dimension 0 is treated as dynamic by default.
    • Explicit specification: use @magi_compile(dynamic_arg_dims={...}) to mark dimensions (negative indices supported).
    • Consistency constraints: parameters that alternately appear as None and non-None across the model lifetime cannot be captured into the same computation graph.
  • Backend selection and standalone compilation:
    • inductor mode defaults to PyTorch 2.8+ standalone_compile, producing reusable artifacts.
    • eager mode is available for debugging or fallback paths.
  • Partitioning and passes: operator-set-driven partition rules and pass contexts that stabilize subgraph boundaries and kernel generation across runtime shapes.
  • Readable, portable artifacts: structured directories with Python files for quick triage and cross-environment debugging.
  • Engine integration: the decorator reads engine-level CompileConfig to stay aligned with distributed/scheduling components.

Installation and Requirements

  • Python β‰₯ 3.10
  • PyTorch β‰₯ 2.8 (with torch._inductor.standalone_compile available)
  • Recommended to be used within the Athena environment, together with its dependencies and distributed components (e.g., CUDA Graph manager).

For local development, install in editable mode:

pip install -e . --no-build-isolation --config-settings editable_mode=compat

Quick Start

Minimal Example (automatic dynamic-dim inference)

import torch
from torch import nn
from magi_compiler.decorator import magi_compile

@magi_compile
class MyModel(nn.Module):
    def __init__(self, *, model_config):
        super().__init__()
        self.linear = nn.Linear(10, 5)

    def forward(self, x: torch.Tensor, y: torch.Tensor | None) -> torch.Tensor:
        if y is not None:
            return self.linear(x + y)
        return self.linear(x)

# In Athena, model_config is typically provided by the engine
model = MyModel(model_config=...)
out1 = model(torch.randn(4, 10), torch.randn(4, 10))
out2 = model(torch.randn(8, 10), None)  # dynamic batch dimension

Explicit Dynamic-Dim Specification

@magi_compile(dynamic_arg_dims={"x": -1})  # mark the last dimension as dynamic
class DynamicDimModel(nn.Module):
    def __init__(self, *, model_config):
        super().__init__()
        self.proj = nn.Linear(16, 16, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.proj(x)

m = DynamicDimModel(model_config=...)
_ = m(torch.randn(2, 16))
_ = m(torch.randn(2, 32))  # allow the last dimension to vary

Configuration and Modes

  • CompileConfig: centralizes compile parameters (backend, cache paths, partition strategy, dynamic shapes, traced files, etc.).
  • CompileMode: typical setting is CompileMode.TORCH_COMPILE.
  • Backends:
    • inductor: uses standalone_compile to produce reusable artifacts, ideal for production deployments.
    • eager: convenient for rapid debugging or as a fallback.

Architecture and Execution Flow (Brief)

  1. @magi_compile wraps nn.Module:
    • infers/validates dynamic_arg_dims;
    • extends MRO by injecting MagiCompilerBase;
    • reads engine-level CompileConfig in MagiCompiler.
  2. CompilerManager:
    • defines cache keys using (runtime_shape, graph_index, backend);
    • dispatches to backends via CompilerInterface (InductorStandaloneAdaptor or EagerAdaptor);
    • applies partition rules and pass contexts within compile_context(...);
    • serializes compile artifacts into a human-readable directory structure.
  3. Monitoring and statistics:
    • counters and timestamps report per-shape/per-subgraph latencies and milestones.

Compile Cache Overview

This document summarizes the cache files generated by torch.compile (TorchDynamo + TorchInductor + AOTAutograd + Triton). Reference path: cache/.

Directory Layout (Tree)

cache/
β”œβ”€ depyf/
β”‚  └─ rank_0/
β”‚     β”œβ”€ __transformed_code_0_for_forward.py
β”‚     β”œβ”€ decompiled_code.py
β”‚     β”œβ”€ full_code_for_forward_0.py
β”‚     β”œβ”€ __compiled_fn_1.BEFORE_PRE_GRAD.{0..N}.py
β”‚     β”œβ”€ __compiled_fn_1.kernel_{0..K}.py
β”‚     β”œβ”€ __compiled_fn_1.__compiled_fn_1_<uuid>.0.py
β”‚     β”œβ”€ __compiled_fn_1.Before_split.0.py
β”‚     β”œβ”€ __compiled_fn_1.After_split.0.py
β”‚     β”œβ”€ __compiled_fn_1.pre_split_module.0.py
β”‚     β”œβ”€ __compiled_fn_1.post_split_module.0.py
β”‚     └─ __compiled_fn_1.pre_insert_deferred_runtime_asserts__<uuid>.0.py
β”‚
└─ torch_compile_cache/
   └─ bfa0df33ea/                # Hash for graph + compile options + device, etc.
      └─ rank_0/                 # Rank id in distributed/multi-GPU runs
         └─ backbone/
            β”œβ”€ computation_graph.py
            β”œβ”€ magi_compile_cache.py
            β”œβ”€ artifact_shape_None_subgraph_0/
            β”œβ”€ artifact_shape_None_subgraph_1/
            β”œβ”€ ...
            └─ artifact_shape_None_subgraph_30/
               β”œβ”€ ir/
               β”‚  └─ *.py        # Python/Triton kernels generated by Inductor for this subgraph
               β”œβ”€ fxgraph/
               β”‚  └─ */*/<hash>  # Binary FX IR/metadata snapshots (not human-readable)
               β”œβ”€ aotautograd/
               β”‚  └─ */*/<hash>  # AOTAutograd partition/capture metadata and artifacts
               β”œβ”€ 44/
               β”‚  └─ *.py        # Other sharded/generated code buckets
               └─ ...            # Structure may vary slightly across subgraphs

What Each File/Dir Is For

  • cache/depyf/ (TorchDynamo/Depyf debug exports)

    • __transformed_code_0_for_forward.py: The Dynamo-transformed forward code (diff-friendly view of pre/post transformation).
    • decompiled_code.py: Decompiled snapshot to help map traced graphs back to original Python.
    • full_code_for_forward_0.py: A more complete expanded forward for inspection.
    • __compiled_fn_1.BEFORE_PRE_GRAD.{i}.py: Intermediate wrapper snapshots at specific compile stages (e.g., before autodiff).
    • __compiled_fn_1.kernel_{k}.py: Entrypoints/wrappers for kernels generated at various stages.
    • Before_split / After_split / pre_split_module / post_split_module: Intermediate forms around graph partitioning.
    • pre_insert_deferred_runtime_asserts__*.py: Snapshot before inserting deferred runtime assertions (dynamic shapes/guards).
  • cache/torch_compile_cache/ (TorchInductor artifacts)

    • bfa0df33ea/: Namespace keyed by a hash of model structure, compile settings, and device info.
    • rank_0/: Bucket per process rank for distributed runs.
    • backbone/:
      • computation_graph.py: Full model FX GraphModule with symbolic dims; shared across subgraph kernels.
      • magi_compile_cache.py:
        • Maps subgraph indices to artifact directories, e.g. (None, i, 'inductor_standalone') -> artifact_shape_None_subgraph_i/.
        • Registers and asynchronously compiles Triton kernels via AsyncCompile.triton(...), including autotune metadata, device properties, scheduling hints, etc.
      • artifact_shape_None_subgraph_{N}/:
        • ir/*.py: Inductor-generated Python/Triton kernels and scheduling code for this subgraph (readable).
        • fxgraph/*/*/<hash>: FX IR/metadata snapshots for fast graph reconstruction (binary; do not edit).
        • aotautograd/*/*/<hash>: AOTAutograd partitions/captures and replay requirements.
        • Additional hashed/prefixed buckets (e.g., 44/, o5/, 55/, br/) containing generated operator/subtask code.

FAQ

  • How are these caches produced?
    • At runtime by torch.compile(...), after TorchDynamo tracing, AOTAutograd partitioning, TorchInductor lowering/fusion, and Triton codegen.
  • Will they change across runs?
    • Yes. Different input shapes, env vars, device info, or compile options can produce different hash namespaces (e.g., a new bfa0df33ea).
  • Is it safe to delete them?
    • Yes. You can delete cache/. It will be rebuilt on demand; the next run will be slower due to recompilation.

Compatibility and Recommendations

  • Prefer official PyTorch β‰₯ 2.8 builds to ensure standalone_compile availability.
  • For highly dynamic models, explicitly mark key dynamic dimensions to improve graph capture and cache reuse.

Acknowledgments

This library builds upon capabilities of PyTorch Dynamo, AOTAutograd, Inductor, and Triton, and incorporates engineering practices and interface designs inspired by the vLLM community. We thank the relevant open-source communities and contributors.