daVinci-MagiHuman

Runtime error

App Files Files Community

daVinci-MagiHuman / pkgs /MagiCompiler /README.md

jiadisu

Switch back to Docker SDK with local pkgs

e6066e8 3 months ago

preview code

raw

history blame contribute delete

9.88 kB

	## MagiCompiler

	An engineering-oriented compiler and execution augmentation library for PyTorch 2.8+, providing module-level compilation decorators, backend adapters, graph partitioning strategies, readable and reusable compile artifacts, and tightly integrated runtime scheduling for any inference engine. The design goal is to systematically expose capabilities of PyTorch Dynamo / AOTAutograd / Inductor / Triton while prioritizing correctness, stability, observability, and maintainability.

	### Design Overview

	- Compilation entrypoint: the `@magi_compile` decorator augments `nn.Module.forward` with compilation, including dynamic-shape annotation and argument validation.
	- Partitioning and passes: configurable graph partitioning and pass management (e.g., `InductorPass`, `PostGradPassManager`) for fusion, kernel generation, and tuning.
	- Artifact system: persists compile artifacts using a Python file/directory layout for readability, auditability, and portability (see “Compile Cache Overview”).
	- Configuration: `CompileConfig` centralizes backend, partition rules, cache root, runtime shapes, and other key parameters.

	### Key Features

	- Dynamic-shape annotations:
	- Automatic inference: when a `forward` parameter is annotated as `torch.Tensor` or `torch.Tensor \| None`, dimension 0 is treated as dynamic by default.
	- Explicit specification: use `@magi_compile(dynamic_arg_dims={...})` to mark dimensions (negative indices supported).
	- Consistency constraints: parameters that alternately appear as `None` and non-`None` across the model lifetime cannot be captured into the same computation graph.
	- Backend selection and standalone compilation:
	- `inductor` mode defaults to PyTorch 2.8+ `standalone_compile`, producing reusable artifacts.
	- `eager` mode is available for debugging or fallback paths.
	- Partitioning and passes: operator-set-driven partition rules and pass contexts that stabilize subgraph boundaries and kernel generation across runtime shapes.
	- Readable, portable artifacts: structured directories with Python files for quick triage and cross-environment debugging.
	- Engine integration: the decorator reads engine-level `CompileConfig` to stay aligned with distributed/scheduling components.

	## Installation and Requirements

	- Python ≥ 3.10
	- PyTorch ≥ 2.8 (with `torch._inductor.standalone_compile` available)
	- Recommended to be used within the Athena environment, together with its dependencies and distributed components (e.g., CUDA Graph manager).

	For local development, install in editable mode:

	```bash
	pip install -e . --no-build-isolation --config-settings editable_mode=compat
	```

	## Quick Start

	### Minimal Example (automatic dynamic-dim inference)

	```python
	import torch
	from torch import nn
	from magi_compiler.decorator import magi_compile

	@magi_compile
	class MyModel(nn.Module):
	def __init__(self, *, model_config):
	super().__init__()
	self.linear = nn.Linear(10, 5)

	def forward(self, x: torch.Tensor, y: torch.Tensor \| None) -> torch.Tensor:
	if y is not None:
	return self.linear(x + y)
	return self.linear(x)

	# In Athena, model_config is typically provided by the engine
	model = MyModel(model_config=...)
	out1 = model(torch.randn(4, 10), torch.randn(4, 10))
	out2 = model(torch.randn(8, 10), None) # dynamic batch dimension
	```

	### Explicit Dynamic-Dim Specification

	```python
	@magi_compile(dynamic_arg_dims={"x": -1}) # mark the last dimension as dynamic
	class DynamicDimModel(nn.Module):
	def __init__(self, *, model_config):
	super().__init__()
	self.proj = nn.Linear(16, 16, bias=False)

	def forward(self, x: torch.Tensor) -> torch.Tensor:
	return self.proj(x)

	m = DynamicDimModel(model_config=...)
	_ = m(torch.randn(2, 16))
	_ = m(torch.randn(2, 32)) # allow the last dimension to vary
	```

	## Configuration and Modes

	- `CompileConfig`: centralizes compile parameters (backend, cache paths, partition strategy, dynamic shapes, traced files, etc.).
	- `CompileMode`: typical setting is `CompileMode.TORCH_COMPILE`.
	- Backends:
	- `inductor`: uses `standalone_compile` to produce reusable artifacts, ideal for production deployments.
	- `eager`: convenient for rapid debugging or as a fallback.

	## Architecture and Execution Flow (Brief)

	1. `@magi_compile` wraps `nn.Module`:
	- infers/validates `dynamic_arg_dims`;
	- extends MRO by injecting `MagiCompilerBase`;
	- reads engine-level `CompileConfig` in MagiCompiler.
	2. `CompilerManager`:
	- defines cache keys using `(runtime_shape, graph_index, backend)`;
	- dispatches to backends via `CompilerInterface` (`InductorStandaloneAdaptor` or `EagerAdaptor`);
	- applies partition rules and pass contexts within `compile_context(...)`;
	- serializes compile artifacts into a human-readable directory structure.
	3. Monitoring and statistics:
	- counters and timestamps report per-shape/per-subgraph latencies and milestones.

	## Compile Cache Overview

	This document summarizes the cache files generated by `torch.compile` (TorchDynamo + TorchInductor + AOTAutograd + Triton). Reference path: `cache/`.

	### Directory Layout (Tree)

	```text
	cache/
	├─ depyf/
	│ └─ rank_0/
	│ ├─ __transformed_code_0_for_forward.py
	│ ├─ decompiled_code.py
	│ ├─ full_code_for_forward_0.py
	│ ├─ __compiled_fn_1.BEFORE_PRE_GRAD.{0..N}.py
	│ ├─ __compiled_fn_1.kernel_{0..K}.py
	│ ├─ __compiled_fn_1.__compiled_fn_1_<uuid>.0.py
	│ ├─ __compiled_fn_1.Before_split.0.py
	│ ├─ __compiled_fn_1.After_split.0.py
	│ ├─ __compiled_fn_1.pre_split_module.0.py
	│ ├─ __compiled_fn_1.post_split_module.0.py
	│ └─ __compiled_fn_1.pre_insert_deferred_runtime_asserts__<uuid>.0.py
	│
	└─ torch_compile_cache/
	└─ bfa0df33ea/ # Hash for graph + compile options + device, etc.
	└─ rank_0/ # Rank id in distributed/multi-GPU runs
	└─ backbone/
	├─ computation_graph.py
	├─ magi_compile_cache.py
	├─ artifact_shape_None_subgraph_0/
	├─ artifact_shape_None_subgraph_1/
	├─ ...
	└─ artifact_shape_None_subgraph_30/
	├─ ir/
	│ └─ *.py # Python/Triton kernels generated by Inductor for this subgraph
	├─ fxgraph/
	│ └─ //<hash> # Binary FX IR/metadata snapshots (not human-readable)
	├─ aotautograd/
	│ └─ //<hash> # AOTAutograd partition/capture metadata and artifacts
	├─ 44/
	│ └─ *.py # Other sharded/generated code buckets
	└─ ... # Structure may vary slightly across subgraphs
	```

	### What Each File/Dir Is For

	- `cache/depyf/` (TorchDynamo/Depyf debug exports)
	- `__transformed_code_0_for_forward.py`: The Dynamo-transformed `forward` code (diff-friendly view of pre/post transformation).
	- `decompiled_code.py`: Decompiled snapshot to help map traced graphs back to original Python.
	- `full_code_for_forward_0.py`: A more complete expanded `forward` for inspection.
	- `__compiled_fn_1.BEFORE_PRE_GRAD.{i}.py`: Intermediate wrapper snapshots at specific compile stages (e.g., before autodiff).
	- `__compiled_fn_1.kernel_{k}.py`: Entrypoints/wrappers for kernels generated at various stages.
	- `Before_split` / `After_split` / `pre_split_module` / `post_split_module`: Intermediate forms around graph partitioning.
	- `pre_insert_deferred_runtime_asserts__*.py`: Snapshot before inserting deferred runtime assertions (dynamic shapes/guards).

	- `cache/torch_compile_cache/` (TorchInductor artifacts)
	- `bfa0df33ea/`: Namespace keyed by a hash of model structure, compile settings, and device info.
	- `rank_0/`: Bucket per process rank for distributed runs.
	- `backbone/`:
	- `computation_graph.py`: Full model FX GraphModule with symbolic dims; shared across subgraph kernels.
	- `magi_compile_cache.py`:
	- Maps subgraph indices to artifact directories, e.g. `(None, i, 'inductor_standalone') -> artifact_shape_None_subgraph_i/`.
	- Registers and asynchronously compiles Triton kernels via `AsyncCompile.triton(...)`, including autotune metadata, device properties, scheduling hints, etc.
	- `artifact_shape_None_subgraph_{N}/`:
	- `ir/*.py`: Inductor-generated Python/Triton kernels and scheduling code for this subgraph (readable).
	- `fxgraph///<hash>`: FX IR/metadata snapshots for fast graph reconstruction (binary; do not edit).
	- `aotautograd///<hash>`: AOTAutograd partitions/captures and replay requirements.
	- Additional hashed/prefixed buckets (e.g., `44/`, `o5/`, `55/`, `br/`) containing generated operator/subtask code.

	### FAQ

	- How are these caches produced?
	- At runtime by `torch.compile(...)`, after TorchDynamo tracing, AOTAutograd partitioning, TorchInductor lowering/fusion, and Triton codegen.
	- Will they change across runs?
	- Yes. Different input shapes, env vars, device info, or compile options can produce different hash namespaces (e.g., a new `bfa0df33ea`).
	- Is it safe to delete them?
	- Yes. You can delete `cache/`. It will be rebuilt on demand; the next run will be slower due to recompilation.

	## Compatibility and Recommendations

	- Prefer official PyTorch ≥ 2.8 builds to ensure `standalone_compile` availability.
	- For highly dynamic models, explicitly mark key dynamic dimensions to improve graph capture and cache reuse.

	## Acknowledgments

	This library builds upon capabilities of PyTorch Dynamo, AOTAutograd, Inductor, and Triton, and incorporates engineering practices and interface designs inspired by the vLLM community. We thank the relevant open-source communities and contributors.